Record: A Machine Learning Framework for User-Specific HPC Resource Recommendation

Jun 4, 2026·
Beste Oztop
Beste Oztop
· 0 min read
Abstract

In today’s HPC systems, users tend to overestimate resource requests (wall time, CPU cores, memory) for their batch job submissions to avoid early termination. This habit is rational from the user’s perspective, but costly: jobs sit longer in the queue, resources go idle or wasted, and schedulers operate on systematically inflated resource requests. Existing workload managers do not have mechanisms to predict execution time or resource usage before queuing and scheduling batch jobs.

This talk presents Record, a machine learning-based resource recommendation framework designed to close that gap. Record utilizes historical workload manager resource usage data and applies a tree-based Random Forest regression model to recommend resource requests at job submission time. With two different job grouping mechanisms (k-means clustering and user-based grouping), Record outperforms the baseline approaches across all target variables and dataset combinations in the offline prediction experiments. Through a real-world deployment on the Boston University Shared Computing Cluster, Record reduced the average waiting time for over 2000 batch job submissions from 17.5 to 1.2 hours. Furthermore, with Record, we decreased average CPU core hour consumption across 4 different case studies without making any changes to the underlying scheduler.

We will conclude with lessons learned from real-world deployment and a discussion of how Record generalizes to other HPC environments, with implications for scheduling efficiency, energy consumption, and user experience.

Beste Oztop
Authors
Beste Oztop
PhD Candidate
Beste Oztop is a PhD candidate in the Department of Electrical and Computer Engineering at Boston University. She received her B.S degree in electrical and electronics engineering from Middle East Technical University, Turkey. Her research focuses on resilient and efficient high-performance computing (HPC) systems, with a specific interest in applied machine learning for intelligent resource management.