Operational Data Analytics: HPC Efficiency Improvements with Interoperable Monitoring and Analysis
Operational Data Analytics (ODA) provides unique opportunities to analyze, understand, and optimize operations of HPC systems and their data centers. Readily available open-source frameworks make the collection of monitoring data from different domains of the HPC system increasingly easy. Many HPC sites have deployed sophisticated frameworks that collect telemetry data from their HPC systems, the applications running on them, and their supporting infrastructure, leading to an unprecedented volume of monitoring data. However, making the data work for HPC operations is not straight-forward and effort is being duplicated at many HPC sites to develop methods and tools to analyze the data and leverage it for operations.
There is a clear demand to collaborate on this within the community but as standards in terms of semantics and naming of monitoring data are currently missing, such collaboration is severely hampered. This BoF continues the discussions started at previous iterations of SC and ISC on monitoring and operational data analytics and which have been continued within the ODA team of the Energy Efficient HPC Working Group (EEHPCWG). It will have a strong focus on sharing lessons learned, opportunities for collaboration in operational data analytics, and establishing a path forward in standardizing monitoring data to facilitate interoperability of tools across sites.
Congress Center Hamburg, Hall E
Hamburg,
Session Overview
ISC 2024, Tuesday May 14, 3:30 PM to 4:30 PM in Hall E. Thomas Ilsche (TU Dresden / ZIH) led the session, which extended the SC23 conversation on data standardization into a concrete focus on interoperable monitoring and analysis.
Why This Matters
By ISC 2024, many HPC sites had deployed sophisticated telemetry frameworks across infrastructure, systems, applications, and users. The bottleneck had shifted: effort was being duplicated because sites could not meaningfully combine or compare their data. Interoperability, shared semantics, naming, and metadata conventions were prerequisites for any real cross-site collaboration.
The BoF used Mentimeter polls and audience discussion to gather feedback on specific interoperability challenges, building on approaches that had worked in earlier iterations. The SC23 BoF had drawn nearly 60 participants, and ISC 2024 extended the effort to expand the engaged community.
Outcome
The ISC 2024 BoF reinforced the direction that led to the SC25 BoF’s “Mind the Gap” theme and, ultimately, to the acceptance of the first peer-reviewed HPC-ODA Workshop at SC26. Interoperability, standardization, and bridging stakeholder communities have remained the community’s central challenges.