Abstract
For machine learning outputs to be applicable to real world problems, high quality data are needed to ensure high quality results. With the more recent emphasis on machine learning in geothermal, there is an increasing need for greater focus on the quality of the data available for use in these projects. For example, Geothermal Operational Optimization Using Machine Learning (GOOML) utilized large quantities of geothermal power plant operational data to inform power plant operational configurations to maximize power generation. High quality datasets result from dependable sensors or devices collecting data, high frequency of measurements, sufficient data points, adequate metadata, reliable storage of data, and sufficient data curation. Another component that contributes to high quality data is reusability, which can be enhanced through data standardization. Data Standardization creates consistency in formatting and contents of like datasets, lessening preprocessing requirements and ensuring adequate information provided by a given dataset. The Geothermal Data Repository (GDR) aims to help improve data quality through automated data standardization for high-value datasets through the implementation of data pipelines alongside reliable and accessible long-term storage for datasets. As such, the GDR has decided to shift away from recommending the use of Excel-based content models and towards the implementation of automated data pipelines. This takes the burden of data standardization off the user and project team and will increase the availability of standardized geothermal data available through the GDR. A set of recommendations, or a data standard for each data type will exist with each data pipeline in order to advise data collection for maximum usability for future research. This paper serves to describe the GDR's proposed transition towards data standardization through automated data pipelines, to discuss the need for and value of such a shift, and to call for suggestions from the community regarding the most useful data standards and pipelines.
Original language | American English |
---|---|
Number of pages | 10 |
State | Published - 2023 |
Event | 48th Stanford Geothermal Workshop - Stanford, California Duration: 6 Feb 2023 → 8 Feb 2023 |
Conference
Conference | 48th Stanford Geothermal Workshop |
---|---|
City | Stanford, California |
Period | 6/02/23 → 8/02/23 |
NREL Publication Number
- NREL/CP-6A20-84994
Keywords
- data
- data curation
- data quality
- data science
- GDR
- machine learning
- pipelines
- standardization