A Conceptual Framework for HPC Operational Data Analytics

Sep 1, 2021·
Alessio Netti
Woong Shin
Woong Shin
Michael Ott
Michael Ott
Torsten Wilde
Torsten Wilde
Natalie Bates
Natalie Bates
· 0 min read
Abstract
This paper provides a broad framework for understanding trends in Operational Data Analytics (ODA) for High-Performance Computing (HPC) facilities. The goal of ODA is to allow for the continuous monitoring, archiving, and analysis of near real-time performance data, providing immediately actionable information for multiple operational uses. In this work, we combine two models to provide a comprehensive HPC ODA framework: one is an evolutionary model of analytics capabilities that consists of four types, which are descriptive, diagnostic, predictive and prescriptive, while the other is a four-pillar model for energy-efficient HPC operations that covers facility, system hardware, system software, and applications. This new framework is then overlaid with a description of current development and production deployments of ODA within leading-edge HPC facilities. Finally, we perform a comprehensive survey of ODA works and classify them according to our framework, in order to demonstrate its effectiveness.
Type
Publication
EEHPCWG State of Practice Workshop at the 2021 IEEE International Conference on Cluster Computing (CLUSTER)
Authors
Alessio Netti
HPC/AI Research Engineer
Alessio Netti (Ph.D., Technical University of Munich, 2022) is an HPC/AI research engineer at DeepL, after earlier work at Leibniz Supercomputing Centre (LRZ) and at Intel on HPC and AI dependability. He is a lead co-author of “A Conceptual Framework for HPC Operational Data Analytics” (IEEE 2021), which establishes the 4x4 scope-and-capability model widely used by the ODA community.
Woong Shin
Authors
Woong Shin
Research Scientist
Woong Shin, Ph.D. is a Research Scientist at Oak Ridge National Laboratory (ORNL), specializing in high-performance computing (HPC), AI/ML applications, and energy-efficient supercomputing.
Michael Ott
Authors
Michael Ott
Senior Research Engineer
Michael Ott is a senior research engineer in the Future Computing group at Leibniz Supercomputing Centre (LRZ)
Torsten Wilde
Authors
Torsten Wilde
HPC System Software Architect
Torsten Wilde (Ph.D.) is a system software architect at Hewlett Packard Enterprise (HPE). His work spans high volume, high frequency data collection and analytics for IT operations and dynamic system power management. He is part of the leadership team of the Energy Efficient HPC Working Group (EE HPC WG).
Natalie Bates
Authors
Natalie Bates
EE HPC WG Technical and Executive Lead
Natalie has been the technical and executive leader for EE HPC WG that disseminates best practices, shares information (peer to peer exchange), and takes collective action since its inception in 2010.