Data Curation for Machine Learning Applied to Geothermal Power Plant Operational Data for GOOML: Geothermal Operational Optimization with Machine Learning: Preprint

Nicole Taverna, Grant Buster, Jay Huggins, Michael Rossol, Paul Siratovich, Jon Weers, Andrea Blair, Christine Siega, Warren Mannington, Alex Urgel, Jonathan Cen, Jaime Quniao, Robbie Watt, John Akerley

Research output: Contribution to conferencePaper


Geothermal Operational Optimization with Machine Learning (GOOML) is a transferable and extensible component-based geothermal asset modeling framework that considers complex steamfield relationships and identifies optimization prospects using a data-driven approach to physics-guided, data-centric machine learning. This framework has been used to develop digital twins that provide steamfield operators with operational environments to analyze and understand historical and forecasted power production, explore new steamfield configuration possibilities, and seek optimal asset management in real world applications. To create, test, and apply the GOOML framework, diverse time-series datasets spanning multiple years were sourced from various geothermal power plant components within several complex real-world geothermal operations. These operations are based in the United States and New Zealand and include a variety of technologies, end-uses and configurations, collectively covering nearly all relevant operating conditions for modern geothermal fields. Datasets were acquired from multiple sources to ensure that machine learning experiments generalized properly to various operating conditions. It was found that the data varied in quality, format, and completeness. To ensure consistency between the various datasets, a standardized data curation process was developed to reliably streamline data preparation. This paper will discuss best practices as learned from the GOOML data curation process which takes the following steps: 1) acquisition of large quantities of data from power plant operators, 2) digestion of data to gain an initial understanding of what is included, 3) data transformation, which includes converting the data into a standardized machine-readable format so that they can be visualized, quality checked, and cleaned, 4) quality assurance and quality control, involving identification of significant data gaps and apparent anomalies through mapping of data features to real world componentry via the GOOML historical model, followed by discussion with modelers and power plant operators to identify additional data needs and to resolve issues, 5) use in machine learning algorithms, and 6) repetition of steps one through five until all data needs are met and data are deemed suitable for producing trustworthy modeling results which may be disseminated, ideally along with the curated dataset. This iterative process is focused on improving the quality of the data rather than tuning machine learning model parameters and supports a shift towards data-centric AI as a means to improving real-world applicability of geothermal machine learning projects.
Original languageAmerican English
Number of pages13
StatePublished - 2022
Event47th Stanford Geothermal Workshop - Stanford, California
Duration: 7 Feb 20229 Feb 2022


Conference47th Stanford Geothermal Workshop
CityStanford, California

NREL Publication Number

  • NREL/CP-6A20-81649


  • access
  • accessibility
  • collaboration
  • curation
  • data
  • data curation
  • data pipeline
  • data-centric AI
  • discoverability
  • dissemination
  • DOE
  • GDR
  • geothermal
  • machine learning
  • open
  • OpenEI
  • pipeline
  • power plant operations
  • standards
  • storage
  • transfer
  • translation
  • usability


Dive into the research topics of 'Data Curation for Machine Learning Applied to Geothermal Power Plant Operational Data for GOOML: Geothermal Operational Optimization with Machine Learning: Preprint'. Together they form a unique fingerprint.

Cite this