October 9, 2025 Minutes | HPC Operational Data Analytics

The meeting opened with a review of the schedule for upcoming meetings and events. The BoF dry run was scheduled as a team meeting on November 6th, leading up to the actual SC25 BoF on Thursday, November 20th from 12:15 to 1:15 CST. The BoF is covered in the Digital Experience and will have live streaming equipment available. A BoF debriefing was scheduled for December 4th to recap discussion topics and share information for those who could not attend in person. The team meeting on January 1st (New Year’s Day) is cancelled, as is the meeting that would have been held on January 29th because it collides with HPC Asia. The first meeting of the new year will be on January 15th, followed by a typical four-week schedule. The January 15th meeting will also mark the start of the new Tech Talks series, alternatively referred to as “Tasche” Talks (a German word meaning “bag” or “pocket”). The series is starting with a new name and schedule to allow more time for preparation and proper announcement on the webpage.

The group then previewed the upcoming BoF session, which will run 60 minutes with a main topic of data sharing, specifically making monitoring and operational data available to communities beyond the operators, including users and researchers. Tim Osborne will serve as Master of Ceremony and will provide the introduction to the ODA team and the Energy Efficient HPC Working Group. The session will use Mentimeter for interactive parts, starting with a “Know your audience” section to gauge attendee awareness of ODA and their current use cases. Wolfgang Frings (JSC) will present on LLView, Jülich’s tool for job-centric monitoring that provides users with role-based access to near real-time job reports and performance data. Harry Jones (ORNL) will then present on the ExaDIGIT project, discussing their work on application fingerprinting and the digital twin framework, along with the challenges they faced in working with existing data.

The remaining BoF time is reserved for interactive discussion using Mentimeter for structured polling and Q&A, organized by three perspectives. From the data center perspective, questions focus on whether operators are satisfied with the data they collect and whether they see problems in collection (such as getting high-frequency, low-overhead information from NVIDIA GPUs), whether they collect application-specific metrics, whether they give users access to monitoring data through web front ends like LLView, whether they actively engage users based on operational data insights (including proactive support and automatic notifications of strange job behavior), whether they make data available to third parties such as researchers (for example, post-decommissioning releases like Blue Waters and Marconi 100), and what the main hurdles are to providing data (such as GDPR compliance in Europe). From the user perspective, the questions probe whether users are aware they can access operational data, what kinds of data they need (for example, GPU efficiency), and whether they would like the data delivered in a way they can process, such as via a REST API to automatically include in their workflows. From the researcher perspective, the questions explore whether researchers can access operational data, what data they want (such as power consumption or workload data), what research they want to do (such as analysis of the queuing system or energy usage for heat reuse), whether data centers should make insights on application classification available in public datasets, and whether researchers are aware of or have used existing publicly available data from projects like Blue Waters, Marconi 100, or Fugaku.

The ensuing discussion centered on the challenge of user engagement with monitoring tools. HPC centers face difficulties getting users to engage with the tools even when they are provided, and centers are unsure how to advertise their data viewing tools (like LLView) so that users are aware they exist. There is a concern that most users “just submit their jobs and don’t care,” only contacting support when a problem occurs, and do not check the monitoring dashboards at all. This observation contrasts with the typical ODA discussion about how hard it is for users to get access to data. The group discussed whether centers should move beyond simply providing a dashboard (a passive approach) and instead actively engage users, for instance by proactively contacting users based on operational data insights (such as seeing “bad jobs” running) and by implementing automatic notifications when strange job behavior is detected in the monitoring system.

For the data center and researcher perspectives, the feedback focused on non-technical barriers to sharing and on maximizing data value. In Europe, strict GDPR data protection laws make it “tricky” and “particularly hard” to collect or share user-specific data, especially with third parties like researchers. Researchers noted a critical lack of application-level information in publicly released datasets such as Fugaku and Marconi 100. While these datasets contain general accounting data, they lack details on which applications, or even which types of applications (machine learning versus HPC simulations, for example), are running. This information is important for researchers but a “nightmare in terms of data protection” for data centers to disclose. Some centers, like NREL, try to address this internally by combining data from job submission scripts, application binaries, and original allocation submissions.

Minutes

Authors

Natalie Bates

EE HPC WG Technical and Executive Lead

Natalie has been the technical and executive leader for EE HPC WG that disseminates best practices, shares information (peer to peer exchange), and takes collective action since its inception in 2010.

← November 6, 2025 Minutes Nov 6, 2025

September 11, 2025 Minutes Sep 11, 2025 →