Operational Data Analytics

Nov 20, 2019·
Michael Ott
Michael Ott
Melissa Romanus
Melissa Romanus
,
Keiji Yamamoto
,
Luca Bortot
,
Michael Mason
· 1 min read
Abstract
For a couple of years now, multiple supercomputing sites around the globe have development and implementation projects underway for expanded monitoring frameworks collecting operational parameters of HPC systems and facility support infrastructure into a single unified database, providing a new and more comprehensive overview of all operations. These early adopter sites have already deployed these systems into production and are collecting a valuable repository of performance data. What to do with this wealth of data, how to process and analyze it, and how to feed it back into improved operations, will be the topic of this BoF session.
Event
Location

Colorado Convention Center

Denver, Colorado

Session Overview

The SC19 BoF brought together HPC sites that had already deployed comprehensive high-resolution monitoring with those still planning their own rollouts. Short presentations from three leading-edge sites across the US, Asia, and Europe framed the discussion, followed by open exchange on lessons learned, use cases, and challenges.

Why This Matters

The EE HPC WG and others had long argued that fine-grained instrumentation and monitoring of HPC systems are required to understand, control, and optimize HPC operations. By 2019, leading sites had deployed sophisticated frameworks that could collect, store, and retrieve telemetry data from thousands of devices at high resolution, covering everything from power provisioning and cooling infrastructure down to individual compute nodes and applications. The next challenge, and the focus of this BoF, was making use of that data.

Obvious use cases included data center infrastructure performance optimization and Fault Detection and Diagnostics. More sophisticated scenarios involved feeding data back to facility control systems or batch schedulers to optimize energy performance and utilization.

Organizer

The BoF was organized by the EE HPC WG Operational Data Analytics team. Supporting resources and the session’s companion page at the EE HPC WG site can be found at eehpcwg.llnl.gov/conf_sc19.html.