Abstract
The continual expansion of high-performance computing (HPC) brings with it an increasing need for efficiency. Heavy investment in energy, hardware, and software infrastructure to support peta- and exascale computing requires the optimization of existing systems and, wherever possible, the discernment and adoption of best-practices towards these goals. Such is the case for runtime prediction. When a job is submitted to an HPC system, an estimate of its runtime is provided by the user in the form of "requested wallclock". Error in this user-provided estimate can lead to jobs being prematurely killed by the scheduler, increased wait time on the queue, and decreased system utilization. More than fifteen years of research has been directed at mitigating these effects by using data-driven runtime predictions. Codified here is a set of commonalities and insights emerging from this body of work, which we present as recommendations and best practices. These practices are combined into a methodological approach described and evaluated on an 11-million-job dataset from the National Renewable Energy Laboratory's petascale HPC system, Eagle. This dataset and the accompanying codebase have been released to the public domain for the benefit of the wider HPC research community.
Original language | American English |
---|---|
Pages | 75-85 |
Number of pages | 11 |
DOIs | |
State | Published - 2023 |
Event | PEARC '23: Practice and Experience in Advanced Research Computing - Portland, Oregon Duration: 23 Jul 2023 → 27 Jul 2023 |
Conference
Conference | PEARC '23: Practice and Experience in Advanced Research Computing |
---|---|
City | Portland, Oregon |
Period | 23/07/23 → 27/07/23 |
Bibliographical note
See NREL/CP-2C00-86526 for preprintNREL Publication Number
- NREL/CP-2C00-88325
Keywords
- high performance computing
- runtime prediction
- state of practice