Operating End-to-End HPC Data Center Observability at Scale

Mar 12, 2026·
Melissa Romanus
Melissa Romanus
Basil Lalli
Basil Lalli
· 0 min read
Abstract

As NERSC prepares for its next-generation system, Doudna, the team has been building Omni, an end-to-end observability platform that manages millions of metrics and logs per second across a diverse, multi-vendor environment. Omni functions as a centralized data lake, ingesting everything from CPU and GPU metrics to building-level data such as groundwater cooling measurements and particle counts from California wildfires.

To monitor HPC, NERSC runs a distributed platform comprising hundreds of monitoring nodes, 17 specialized Kubernetes clusters, and thousands of environmental sensors. The environment is entirely Git-driven, using Ansible, FluxCD, and NetBox to allow rolling upgrades and automated pod restarts without interrupting continuous 24/7 data streams. The talk covers technical deep dives and live production dashboards.

Melissa Romanus
Authors
Melissa Romanus
Data Management Engineer
Melissa Romanus is a data management engineer at the National Energy Research Scientific Computing Center (NERSC) at Lawrence Berkeley National Laboratory, where she is part of the Operations Technology Group. Her work focuses on the ingestion, collection, analysis, and visualization of real-time streaming operational and systems data in HPC data centers. Her research interests span operational data analytics, the architecture of large-scale data lakes, and automating scientific workloads on HPC systems.
Basil Lalli
Authors
Basil Lalli
Computer Systems Engineer
Basil Lalli is a Computer Systems Engineer in the HPC Technology Department at the National Energy Research Scientific Computing Center (NERSC), Lawrence Berkeley National Laboratory, where he has worked since 2012.