Tandem Predictions for HPC Jobs: Article No. 23

Research output: Contribution to conferencePaper

Abstract

At the core of the predictive analytics applied to High Performance Computing (HPC), the most prominent tasks are the prediction of job runtimes and the prediction of job queue times, both of which have the potential for informing HPC users during their every-day decision making. Accurate runtime predictions can help users better choose so-called wallclock times at job submission, decreasing the odds of their jobs waiting in queues longer than necessary. The accurate and timely queue time predictions offered for the available partitions can inform the favorable selection of partitions for running jobs. This potential is well understood as we see in the abundance of research studies that propose solutions for these tasks, including the work published in the last several years. These tasks are seemingly receptive to the Machine Learning (ML) solutions, considering that there is no shortage of training data where HPC centers over time run millions and millions of jobs. However, we study the existing research literature, as well as look for examples in the toolchains supported on the exemplar HPC facilities, and, surprisingly, do not find any practical solutions that are ready to be adopted. We interpret this as a manifestation of the shortage of UX/UI efforts that support HPC analytics and also as a sign that the research has not come to the consensus on solving these tasks. In this study, we aim to shed new light on the long-running task of job queue time prediction by exploring the utility of runtime predictions in improving prediction accuracy and, actually, predicting these two metrics together, in tandem. In other words, we show how runtime predictions become valuable input in the queue time modeling. We challenge the existing approaches to feature engineering for the queue time prediction and describe promising results we obtained for a large dataset of HPC jobs from a supercomputer at the National Renewable Energy Laboratory.
Original languageAmerican English
Pages1-9
Number of pages9
DOIs
StatePublished - 2024
EventPEARC24 - Providence, RI
Duration: 22 Jul 202425 Jul 2024

Conference

ConferencePEARC24
CityProvidence, RI
Period22/07/2425/07/24

Bibliographical note

See NREL/CP-2C00-91373 for preprint

NREL Publication Number

  • NREL/CP-2C00-90228

Keywords

  • HPC
  • job queue times
  • job runtimes
  • predictions

Fingerprint

Dive into the research topics of 'Tandem Predictions for HPC Jobs: Article No. 23'. Together they form a unique fingerprint.

Cite this