Operational Data Analytics

Nov 16, 2023·
Rachel Palumbo
Rachel Palumbo
,
Kadidia Konaté
Melissa Romanus
Melissa Romanus
Norm Bourassa
Norm Bourassa
,
Jim Brandt
Jeff Hanson
Jeff Hanson
Tim Osborne
Tim Osborne
Michael Ott
Michael Ott
,
Ben Schwaller
Woong Shin
Woong Shin
Kathleen Shoga
Kathleen Shoga
,
Keiji Yamamoto
· 1 min read
Abstract
Operational Data Analytics (ODA) provides unique opportunities to analyze, understand, and optimize operations of HPC systems. Readily available open-source frameworks make the collection of monitoring data from different domains of the HPC system increasingly easy. However, making the data work for HPC operations is not straight-forward and effort being duplicated at many HPC sites to develop methods and tools to analyze the data and leverage it for operations. There is a clear demand to collaborate on this within the community but as standards in terms of semantics and naming of monitoring data are currently missing, such collaboration is severely hampered.
Event
Location

Colorado Convention Center, Room 607

700 14th Street, Denver, Colorado 80202

Session Overview

SC23 in Denver, Thursday, November 16 at 12:15 PM MST, Room 607. The BoF was led by Rachel Palumbo (ORNL) with a panel drawn from across the ODA team: LBNL, Sandia, HPE, ORNL, LRZ, LLNL, and RIKEN.

The Standardization Problem

Previous BoFs had charted the ODA field from holistic monitoring (SC19) to the data tsunami (SC22) to statistical and AI methods. During SC22, a thread had emerged around open data and the standardization of monitoring data. SC23 put that thread front and center.

Some sites had successfully leveraged their monitoring data to optimize HPC operations, but their approaches were not easily transferable. The main stumbling blocks for closer collaboration were different approaches to organizing data, incompatible naming schemes, and a lack of metadata. Experts with data-analysis expertise were in short supply, and without shared data conventions, sites were effectively redoing each other’s work.

Outcome

The BoF reported back on current efforts to drive standardization, helped to establish a larger community backed by the EE HPC WG, fostered collaboration among different HPC sites, and drove further discussion toward standardization and data sharing. The standardization theme continued into ISC 2024 as “Interoperable Monitoring and Analysis.”