March 12, 2026 Minutes

Mar 12, 2026·
Natalie Bates
Natalie Bates
· 1 min read

The meeting featured a tech talk titled “Operating End-to-End HPC Data Center Observability at Scale,” presented by Melissa Romanus and Basil Lalli (NERSC/LBNL). The talk focused on Omni, the observability platform that NERSC has been building as the center prepares for its next-generation system, Doudna.

Romanus and Lalli described how the team manages millions of metrics and logs per second across a diverse, multi-vendor environment. Omni functions as a centralized data lake, ingesting everything from CPU and GPU metrics to building-level data such as groundwater cooling measurements and particle counts from California wildfires. This breadth allows the center to correlate facility conditions with system behavior in a single place.

To monitor HPC, NERSC uses a distributed platform comprising hundreds of monitoring nodes, 17 specialized Kubernetes clusters, and thousands of environmental sensors. The environment is entirely Git-driven, utilizing Ansible, FluxCD, and NetBox to allow for rolling upgrades and automated pod restarts without interrupting continuous 24/7 data streams.

The recording contains technical deep dives and “eye candy” dashboards, and is highly recommended watching for those interested in the operational details behind end-to-end observability at scale.

Natalie Bates
Authors
Natalie Bates
EE HPC WG Technical and Executive Lead
Natalie has been the technical and executive leader for EE HPC WG that disseminates best practices, shares information (peer to peer exchange), and takes collective action since its inception in 2010.