[Remote] Staff Site Reliability Engineer
Note: The job is a remote job and is open to candidates in USA. Hamilton Barnes is a fast-growing AI infrastructure business, operating as a compute liquidity layer for global AI labs and cloud providers. They are seeking a Staff Site Reliability Engineer to lead incident response and own the production health of multi-thousand GPU fleets, shaping how GPU infrastructure is operated at scale.
Responsibilities
- Lead P0/P1 incident response across the full GPU stack, owning triage, postmortems and systemic fixes
- Own production health of multi-thousand GPU fleets across providers including node lifecycle, firmware rollouts and driver upgrades
- Build and maintain GPU health checks, fabric monitoring, observability and automated remediation tooling
- Define on-call practices including rotations, runbooks, escalation paths and blameless incident reviews
- Act as the senior reliability voice in customer-facing incident reviews, architecture deep-dives and partner with product engineering on SLOs and error budgets
Skills
- Multiple years hands-on building and operating large-scale GPU infrastructure
- Deep expertise with NVIDIA H100/H200/B200/GB200 including NVLink/NVSwitch topology and hardware failure modes
- Production experience with InfiniBand, RoCE and NVLink fabrics alongside NCCL, CUDA and PyTorch distributed training
- Production-grade Go, Python or Rust with strong Kubernetes and/or Slurm/HPC experience
- Expert Linux internals covering kernel tuning, NVIDIA driver/CUDA lifecycle and BPF tooling
Company Overview