March 12, 2026 Minutes
The meeting featured a tech talk titled “Operating End-to-End HPC Data Center Observability at Scale,” presented by Melissa Romanus and Basil Lalli (NERSC/LBNL). The talk focused on Omni, the observability platform that NERSC has been building as the center prepares for its next-generation system, Doudna.
Romanus and Lalli described how the team manages millions of metrics and logs per second across a diverse, multi-vendor environment. Omni functions as a centralized data lake, ingesting everything from CPU and GPU metrics to building-level data such as groundwater cooling measurements and particle counts from California wildfires. This breadth allows the center to correlate facility conditions with system behavior in a single place.
To monitor HPC, NERSC uses a distributed platform comprising hundreds of monitoring nodes, 17 specialized Kubernetes clusters, and thousands of environmental sensors. The environment is entirely Git-driven, utilizing Ansible, FluxCD, and NetBox to allow for rolling upgrades and automated pod restarts without interrupting continuous 24/7 data streams.
The recording contains technical deep dives and “eye candy” dashboards, and is highly recommended watching for those interested in the operational details behind end-to-end observability at scale.