Operational Data Analytics: Drowning in Data

Nov 16, 2022·
Michael Ott
Michael Ott
Melissa Romanus
Melissa Romanus
Norm Bourassa
Norm Bourassa
Rachel Palumbo
Rachel Palumbo
Woong Shin
Woong Shin
Jeff Hanson
Jeff Hanson
Torsten Wilde
Torsten Wilde
,
Jim Brandt
,
Ben Schwaller
Natalie Bates
Natalie Bates
· 2 min read
Abstract
Operational Data Analytics (ODA) provides unique opportunities to analyze, understand, and optimize operations of HPC systems. Readily available open-source frameworks make the collection of monitoring data from different domains of the HPC system (infrastructure, system hardware, software, applications) increasingly easy. However, making the data work for HPC operations is not straight-forward. AI-based methods seem interesting, but which tools and methods are suitable for this type of data is not obvious. This BoF aims to bring together practitioners in HPC operations to share use cases for ODA, discuss problems, and provide feedback.
Event
Location

Kay Bailey Hutchison Convention Center

Dallas, Texas

Session Overview

SC22 in Dallas marked a shift in the ODA BoF series. Where earlier sessions focused on building the monitoring stack, “Drowning in Data” turned to the problem that came after: sites had successfully collected vast amounts of telemetry but now struggled to visualize it at useful granularity or to analyze it for actionable knowledge.

The session combined presentations from US, European, and Asian HPC facilities on specific use cases and lessons learned with open audience discussion, leaving roughly half the session for interaction.

Why This Matters

Most HPC sites were engaged in some form of ODA, whether they called it that or not. Many were overwhelmed by the amount of data they collected and found it difficult to either visualize it in enough detail or find the right tool to extract actionable knowledge. The big-data world offered many methods, but picking the right one required expertise both in data analytics and in the HPC domain.

Threshold-based alarms frequently produced nuisance alerts that overwhelmed operators. Anomaly-based methods held promise for making alarms more relevant. The BoF brought ODA researchers, HPC operators, and data-analytics experts into the same room to share experiences and lessons learned.

Historical Context

“Drowning in Data” built directly on the SC18 and SC19 BoFs on ODA infrastructure and on the SC21 BoF that introduced the 4x4 ODA conceptual framework. It set up the standardization and interoperability themes that would dominate the following years’ sessions.