Mastering HPC Runtime Prediction: From Observing Patterns to a Methodological Approach: Preprint

Research output: Contribution to conferencePaper

Abstract

The continual expansion of high-performance computing (HPC) brings with it an increasing need for efficiency. Heavy investment in energy, hardware, and software infrastructure to support peta- and exascale computing requires the optimization of existing systems and, wherever possible, the discernment and adoption of best-practices towards these goals. Such is the case for runtime prediction. When a job is submitted to an HPC system, an estimate of its runtime is provided by the user in the form of "requested wallclock''. Error in this user-provided estimate can lead to jobs being prematurely killed by the scheduler, increased wait time on the queue, and decreased system utilization. More than fifteen years of research has been directed at mitigating these effects by using data-driven runtime predictions. Codified here is a set of commonalities and insights emerging from this body of work, which we present as recommendations and best practices. These practices are combined into a methodological approach described and evaluated on an 11-million-job dataset from the National Renewable Energy Laboratory's petascale HPC system, Eagle. This dataset and the accompanying codebase have been released to the public domain for the benefit of the wider HPC research community.
Original languageAmerican English
Number of pages14
StatePublished - 2023
EventPEARC23 - Portland, Oregon
Duration: 23 Jul 202327 Jul 2023

Conference

ConferencePEARC23
CityPortland, Oregon
Period23/07/2327/07/23

NREL Publication Number

  • NREL/CP-2C00-86526

Keywords

  • high performance computing
  • runtime prediction
  • state of practice

Fingerprint

Dive into the research topics of 'Mastering HPC Runtime Prediction: From Observing Patterns to a Methodological Approach: Preprint'. Together they form a unique fingerprint.

Cite this