TY - GEN
T1 - Conquering Data Chaos: Research Data Management with Kubernetes
AU - Clark, Struan
PY - 2023
Y1 - 2023
N2 - Managing massive volumes of data and effectively making it accessible to researchers poses significant challenges and is a barrier to scientific discovery. In many cases, critical data is locked up in unwieldy file formats or one-off databases and is too large to effectively process on a single machine. This talk explores the role of Kubernetes, an open-source container orchestration platform, in addressing research data management challenges. I will discuss how we are using a set of publicly available open-source and home-grown tools in the National Renewable Energy Lab (NREL) Data, Analysis, and Visualization (DAV) group to help researchers overcome data-related bottlenecks. The talk will begin by providing an overview of the data challenges faced in research data management, including data storage, processing, and analysis. I will highlight Kubernetes' ability to handle large-scale data by leveraging containerization and distributed computing, including distributed storage. Kubernetes allows researchers to encapsulate data processing infrastructure and workflows into portable containers, enabling reproducibility and ease of deployment. Kubernetes can then schedule and manage the resource allocation of these containers to enable efficient utilization of limited computing resources, leading to more efficient data processing and analysis. I will discuss some limitations of traditional, siloed approaches to dealing with data and emphasize the need for solutions which foster collaboration. I will highlight how we are using Kubernetes at NREL to facilitate data sharing and cooperation among research teams. Kubernetes' flexible architecture enables the deployment of shared computing environments, such as Apache Superset, where researchers can seamlessly access and analyze shared datasets. Providing the ability to have one research team easily consume data generated by another, utilizing Kubernetes' as a central data platform, is one of the major wins we've encountered by adopting the platform. Finally, I will showcase real-world use cases from NREL where we have used Kubernetes to solve some persistent data challenges involving large volumes of sensor and monitoring data. I will discuss the challenges we encountered when creating our cluster and making it available as a production-ready resource. I will also discuss the specific suite of tools, including Postgres and Apache Druid for columnar and timeseries data, and Redpanda Kafka for streaming data we have deployed in our infrastructure, and the process that went into the selection of these tools.
AB - Managing massive volumes of data and effectively making it accessible to researchers poses significant challenges and is a barrier to scientific discovery. In many cases, critical data is locked up in unwieldy file formats or one-off databases and is too large to effectively process on a single machine. This talk explores the role of Kubernetes, an open-source container orchestration platform, in addressing research data management challenges. I will discuss how we are using a set of publicly available open-source and home-grown tools in the National Renewable Energy Lab (NREL) Data, Analysis, and Visualization (DAV) group to help researchers overcome data-related bottlenecks. The talk will begin by providing an overview of the data challenges faced in research data management, including data storage, processing, and analysis. I will highlight Kubernetes' ability to handle large-scale data by leveraging containerization and distributed computing, including distributed storage. Kubernetes allows researchers to encapsulate data processing infrastructure and workflows into portable containers, enabling reproducibility and ease of deployment. Kubernetes can then schedule and manage the resource allocation of these containers to enable efficient utilization of limited computing resources, leading to more efficient data processing and analysis. I will discuss some limitations of traditional, siloed approaches to dealing with data and emphasize the need for solutions which foster collaboration. I will highlight how we are using Kubernetes at NREL to facilitate data sharing and cooperation among research teams. Kubernetes' flexible architecture enables the deployment of shared computing environments, such as Apache Superset, where researchers can seamlessly access and analyze shared datasets. Providing the ability to have one research team easily consume data generated by another, utilizing Kubernetes' as a central data platform, is one of the major wins we've encountered by adopting the platform. Finally, I will showcase real-world use cases from NREL where we have used Kubernetes to solve some persistent data challenges involving large volumes of sensor and monitoring data. I will discuss the challenges we encountered when creating our cluster and making it available as a production-ready resource. I will also discuss the specific suite of tools, including Postgres and Apache Druid for columnar and timeseries data, and Redpanda Kafka for streaming data we have deployed in our infrastructure, and the process that went into the selection of these tools.
KW - collaborative environment
KW - data analysis
KW - data challenges
KW - data management
KW - kubernetes
M3 - Presentation
T3 - Presented at the US-RSE Conference, 16-18 October 2023, Chicago, Illinois
ER -