Fostering Geothermal Machine Learning Success: Elevating Big Data Accessibility and Automated Data Standardization in the Geothermal Data Repository

Research output: Contribution to conferencePaper

Abstract

The Department of Energy's (DOE's) Geothermal Data Repository (GDR) has implemented improvements to both its data lakes and its data standards and automated data pipelines. The GDR data lakes have reduced storage and compute-related barriers to using large geothermal datasets, enabling these large datasets to be accessed by anyone with a modern computer and internet access. More recently, the GDR has been working to further reduce barriers through streamlining the data intake process, educating users on the process and requirements, and helping users access data from the data lakes. These improvements have augmented the quantity of datasets the GDR is able to accept into its data lakes and have enabled users who are new to cloud tools to access these datasets more easily, overall increasing the accessibility of big geothermal data for use in machine learning and other projects. In addition, the GDR now has built-in data standards and pipelines for drilling data, geospatial data, and distributed acoustic sensing (DAS) data. These standardization efforts aim to enhance the real-world applicability of geothermal machine learning outcomes by improving the quality of training data. Specifically, through standardizing high-value datasets, the GDR is reducing project-specific data curation requirements, thus allowing more time for actual research. By automating this process, the burden of standardization is lifted from the user, ultimately increasing the availability of standardized data. This paper provides an update on recent improvements made to the GDR's data lakes and automated data pipelines, including: (1) streamlining the data lake intake process, (2) better educating users on the process and requirements through a new data lakes page, (3) adding data lake direct access links to GDR data lake submission pages, (4) implementing a DAS data pipeline to convert DAS data uploaded in SEG-Y format to a standardized hierarchical data format v5 (HDF5), (5) extending this pipeline to encompass data in the GDR data lake, (6) adding metadata requirements for geospatial data, (7) making user interface/user experience (UX) enhancements to the data pipelines' documentation pages, and (8) improving the GDR's data standards and pipelines pages to better guide users in ensuring that their data is standardized by the GDR's automated data pipelines. 2024 Geothermal Resources Council. All rights reserved.
Original languageAmerican English
Pages2279-2291
Number of pages13
StatePublished - 2025
Event2024 Geothermal Rising Conference - Waikoloa, Hawaii
Duration: 27 Oct 202430 Oct 2024

Conference

Conference2024 Geothermal Rising Conference
CityWaikoloa, Hawaii
Period27/10/2430/10/24

Bibliographical note

See NREL/CP-6A20-90400 for preprint

NREL Publication Number

  • NREL/CP-6A20-93583

Keywords

  • accessibility
  • das
  • data
  • data lake
  • data pipeline
  • data science
  • data standard
  • distributed acoustic sensing
  • gdr
  • geospatial
  • gis
  • user experience

Fingerprint

Dive into the research topics of 'Fostering Geothermal Machine Learning Success: Elevating Big Data Accessibility and Automated Data Standardization in the Geothermal Data Repository'. Together they form a unique fingerprint.

Cite this