A Conceptual Framework for HPC Operational Data Analytics

Sep 1, 2021ยท
Alessio Netti
Woong Shin
Woong Shin
Michael Ott
Michael Ott
,
Torsten Wilde
Natalie Bates
Natalie Bates
ยท 0 min read
Abstract
This paper provides a broad framework for understanding trends in Operational Data Analytics (ODA) for High-Performance Computing (HPC) facilities. The goal of ODA is to allow for the continuous monitoring, archiving, and analysis of near real-time performance data, providing immediately actionable information for multiple operational uses. In this work, we combine two models to provide a comprehensive HPC ODA framework: one is an evolutionary model of analytics capabilities that consists of four types, which are descriptive, diagnostic, predictive and prescriptive, while the other is a four-pillar model for energy-efficient HPC operations that covers facility, system hardware, system software, and applications. This new framework is then overlaid with a description of current development and production deployments of ODA within leading-edge HPC facilities. Finally, we perform a comprehensive survey of ODA works and classify them according to our framework, in order to demonstrate its effectiveness.
Type
Publication
EEHPCWG State of Practice Workshop at the 2021 IEEE International Conference on Cluster Computing (CLUSTER)
Woong Shin
Authors
Woong Shin
Member
Woong Shin, Ph.D. is a Research Scientist at Oak Ridge National Laboratory (ORNL), specializing in high-performance computing (HPC), AI/ML applications, and energy-efficient supercomputing.
Michael Ott
Authors
Michael Ott
ODA Team Leader
Michael Ott is a senior research engineer in the Future Computing group at Leibniz Supercomputing Centre (LRZ)
Natalie Bates
Authors
Natalie Bates
EE HPC WG Technical and Executive Lead
Natalie has been the technical and executive leader for EE HPC WG that disseminates best practices, shares information (peer to peer exchange), and takes collective action since its inception in 2010.