[Remote] Staff Site Reliability Engineer

Work from home Full-time role Hiring

Note: The job is a remote job and is open to candidates in USA. Hamilton Barnes is a fast-growing AI infrastructure business, operating as a compute liquidity layer for global AI labs and cloud providers. They are seeking a Staff Site Reliability Engineer to lead incident response and own the production health of multi-thousand GPU fleets, shaping how GPU infrastructure is operated at scale.

Responsibilities

Lead P0/P1 incident response across the full GPU stack, owning triage, postmortems and systemic fixes
Own production health of multi-thousand GPU fleets across providers including node lifecycle, firmware rollouts and driver upgrades
Build and maintain GPU health checks, fabric monitoring, observability and automated remediation tooling
Define on-call practices including rotations, runbooks, escalation paths and blameless incident reviews
Act as the senior reliability voice in customer-facing incident reviews, architecture deep-dives and partner with product engineering on SLOs and error budgets

Skills

Multiple years hands-on building and operating large-scale GPU infrastructure
Deep expertise with NVIDIA H100/H200/B200/GB200 including NVLink/NVSwitch topology and hardware failure modes
Production experience with InfiniBand, RoCE and NVLink fabrics alongside NCCL, CUDA and PyTorch distributed training
Production-grade Go, Python or Rust with strong Kubernetes and/or Slurm/HPC experience
Expert Linux internals covering kernel tuning, NVIDIA driver/CUDA lifecycle and BPF tooling

Company Overview

Hamilton Barnes are an award-winning recruitment consultancy, here to help you secure the best talent and opportunities in the networking, telecommunications, data center and security space. It was founded in 2014, and is headquartered in London, England, GBR, with a workforce of 51-200 employees. Its website is https://hamilton-barnes.co.uk.

Apply Now

[Remote] Staff Site Reliability Engineer

More open positions

[Remote] Engineering QA Reviewer | Remote

[Remote] Senior UX Designer

[Remote] Project Manager

[Remote] Executive Recruiter

[Remote] Content & Operations Lead

[Remote] Executive Pharma Recruiter

Java Backend Developer | 100% Remote | W2 Profiles

Multilingual Music Expert

Admissions Counselor

Manager, Creative Services (Remote)

Physician Assistant - Wound Care | Raleigh-Durham Greater Triangle Area or Eastern North Carolina (Wilson, Greenville, Goldsboro, Kinston) Area

Fractional CMO – Adventure Tourism / Multi-Location Skydiving Brand – Contract to Hire

Experienced or Senior GRC Analyst

Appian Application Developer

Experienced Customer Information Representative – Workers' Compensation Specialist at careerzynith

Host Acquisition Specialist

Associate Director, Account Management

[Remote] Program Manager, Sales Enablement

[Remote-Position] Urgently Require Fitness Manager/Stretching in

Remote Bookkeeper - Boston, MA | Sappi North America

Sr. Credit Risk Review Officer (Remote)