Generic and ML Workloads in an HPC Datacenter: Node Energy, Job Failures, and Node-Job Analysis

Xiaoyu Chu, Daniel Hofstatter, Shashikant Ilager, Sacheendra Talluri, Duncan Kampert, Damian Podareanu, Dmitry Duplyakin, Ivona Brandic, Alexandru Iosup

Research output: Contribution to conferencePaper

Abstract

HPC datacenters offer a backbone to the modern digital society. Increasingly, they run Machine Learning (ML) jobs next to generic, compute-intensive workloads, supporting science, business, and other decision-making processes. However, understanding how ML jobs impact the operation of HPC datacenters, relative to generic jobs, remains desirable but understudied. In this work, we leverage long-term operational data, collected from a national-scale production HPC datacenter, and statistically compare how ML and generic jobs can impact the performance, failures, resource utilization, and energy consumption of HPC datacenters. Our study provides key insights, e.g., ML-related power usage causes GPU nodes to run into temperature limitations, median/mean runtime and failure rates are higher for ML jobs than for generic jobs, both ML and generic jobs exhibit highly variable arrival processes and resource demands, significant amounts of energy are spent on unsuccessfully terminating jobs, and concurrent jobs tend to terminate in the same state. We open-source our cleaned-up data traces on Zenodo (https://doi. org/10.5281/zenodo.13685426), and provide our analysis toolkit as software hosted on GitHub (https://github.com/atlarge-research/2024-icpads-hpc-workload-characterization). This study offers multiple benefits for data center administrators, who can improve operational efficiency, and for researchers, who can further improve system designs, scheduling techniques, etc.
Original languageAmerican English
Pages710-719
Number of pages10
DOIs
StatePublished - 2024
Event2024 IEEE 30th International Conference on Parallel and Distributed Systems (ICPADS) - Belgrade, Serbia
Duration: 10 Oct 202414 Oct 2024

Conference

Conference2024 IEEE 30th International Conference on Parallel and Distributed Systems (ICPADS)
CityBelgrade, Serbia
Period10/10/2414/10/24

NREL Publication Number

  • NREL/CP-2C00-92731

Keywords

  • crossanalysis
  • datacenters
  • energy consumption
  • failure analysis
  • GPU
  • HPC
  • machine learning
  • multivariate analysis
  • system modeling
  • workload characterization

Fingerprint

Dive into the research topics of 'Generic and ML Workloads in an HPC Datacenter: Node Energy, Job Failures, and Node-Job Analysis'. Together they form a unique fingerprint.

Cite this