Operating End-to-End HPC Data Center Observability at Scale

As NERSC prepares for its next-generation system, Doudna, the team has been building Omni, an end-to-end observability platform that manages millions of metrics and logs per second across a diverse, multi-vendor environment. Omni functions as a centralized data lake, ingesting everything from CPU and GPU metrics to building-level data such as groundwater cooling measurements and particle counts from California wildfires.
To monitor HPC, NERSC runs a distributed platform comprising hundreds of monitoring nodes, 17 specialized Kubernetes clusters, and thousands of environmental sensors. The environment is entirely Git-driven, using Ansible, FluxCD, and NetBox to allow rolling upgrades and automated pod restarts without interrupting continuous 24/7 data streams. The talk covers technical deep dives and live production dashboards.