December 4, 2025 Minutes

Dec 4, 2025·
Natalie Bates
Natalie Bates
· 5 min read

The meeting opened with a debrief of the SC25 Birds of a Feather session. Approximately 34 to 36 participants engaged in the real-time Mentimeter survey, with the majority identifying as System Administrators or Researchers. Participation from HPC end-users was very low (only one or two individuals), suggesting a need to better bridge the gap with the user community. ODA continues to be perceived as “difficult” by many, though the group questioned whether this reflects actual technical complexity or the specific “in the weeds” perspective of the attendees. The survey used a spider chart to track interest across several use cases, with performance monitoring, optimization, and resource utilization analysis emerging as the top priorities. Researchers were significantly more likely to use AI and Machine Learning for data analysis compared to the operations side, while visualization, performance analysis, and job reports remained the primary tools used by the community.

The core of the debrief discussion centered on why operational data isn’t being shared more effectively between sites or with researchers. Key blockers identified include policy and privacy concerns, a fear of data being misinterpreted to reflect poorly on a facility, and a general “inertia” regarding instrumentation. While researchers strongly agree that HPC data is valuable, they reported low satisfaction with the current availability and quality of public datasets. “Knowledge sharing” was identified as a major missing link, specifically regarding cross-site collaboration on how to interpret and act on telemetry. There is a growing desire for real-time data access, though making this available remains both a technical and policy challenge. The increasing complexity of hardware (various GPUs and accelerators) makes data standardization difficult, as metrics change every generation. There may also be a lack of “error labels”, contextual data explaining why a job failed, which is critical for training ML models. The team suggested conducting a year-over-year comparison of Mentimeter results from 2021 through 2025 to identify long-term trends in ODA, and reviewing the SC25 BoF recording to analyze the flow of discussion and identify areas for improvement in session moderation and intro slides.

The SC25 BoF itself, “Operational Data Analytics: Mind the Gap,” was moderated by Tim Osborne, with featured speakers Wolfgang Frings and Terry Jones. Framed within the Energy Efficient HPC Working Group, the ODA Team focuses on system telemetry by providing a global view of operational data analytics practices, sharing lessons learned, and informing next-generation data collection and monitoring. The initial Mentimeter audience survey indicated that most attendees were familiar with operational data analytics but generally found it difficult, which was described as the main reason the BoF is held every year. The audience was primarily composed of HPC site staff and researchers, with a few users and no cloud providers present. Performance monitoring and optimization were rated as the most important use cases, followed by future planning and user behavior analysis, with predictive maintenance and energy management considered important but slightly lower in priority.

Wolfgang Frings from the Juelich Supercomputing Centre presented an overview of LLView and explained how monitoring knowledge is shared with users. He noted that monitoring infrastructure in HPC centers is typically only accessible to administrators, and that users often lack direct access because data must be tailored to specific jobs. Users usually rely on shell commands or support staff to obtain information such as GPU or network data, which does not scale well. LLView adds a layer to the monitoring infrastructure by collecting information useful for users and storing it in an internal database for aggregation and analysis. It provides role-based access where administrators and support staff can see the entire system, project leaders can see jobs within their projects, and users can see only their own jobs. Users can access job reports in HTML or PDF format that are generated during runtime and include timelines for metrics such as core usage and GPU temperature. The tool supports troubleshooting by helping identify hardware anomalies such as one GPU running hotter than others, or memory leaks in applications, letting users monitor memory consumption in real time and stop a job before it reaches physical limits. LLView is used on JUPITER, the first exascale system in Europe, and has negligible overhead because it relies on existing data sources rather than querying compute nodes directly.

Terry Jones from Oak Ridge National Laboratory discussed the ExaDigiT project and application fingerprinting. He explained that, as a researcher, his focus is on discovering new insights and testing hypotheses, and that high-quality data is essential. Researchers often struggle with incomplete datasets or restrictions due to intellectual property or privacy concerns. ExaDigiT aims to model supercomputing centers using a real-time simulator, and his specific focus is on recognizing applications based solely on telemetry such as power profiles or input/output patterns. Recognizing an application (for example, a climate model) allows prediction of its behavior in the near future, which is important for managing shared resources. If two jobs each require a large fraction of parallel file system bandwidth, they should not run at the same time, and fingerprints can be used to schedule them serially to avoid slowdowns. Application fingerprints can also be used to schedule jobs in ways that meet power efficiency targets without exceeding infrastructure limits such as transformers.

In the community discussion and Q&A session, Tim Osborne shared survey results on hurdles to providing data. The top concerns were institutional policy, leadership worries about how data reflects on a center, security, and misinterpretation of data. Researchers noted that, although they need datasets, public availability is low and the quality of existing data is often poor. One discussion point addressed the use of artificial intelligence in production environments: an audience member asked why AI-based scheduling, such as job time-limit prediction, is not widely used. Melissa Romanus from NERSC explained that centers are hesitant to rely on automated responses without a human in the loop because of the risk of disrupting production workflows. Another major topic was standardization, with strong interest in standardizing telemetry namespaces and data columns so that datasets could be compared across different laboratories. The discussion also touched on differences between small and large systems. Wolfgang Frings suggested that administrators use smaller systems as test environments for operational data analytics experiments, though it was noted that success on a small system is often not accepted as proof that the approach will work on a large production system with tens of thousands of nodes.

Natalie Bates
Authors
Natalie Bates
EE HPC WG Technical and Executive Lead
Natalie has been the technical and executive leader for EE HPC WG that disseminates best practices, shares information (peer to peer exchange), and takes collective action since its inception in 2010.