[Remote] Staff Network Site Reliability Engineer
Note: The job is a remote job and is open to candidates in USA. Nebius is leading a new era in cloud infrastructure for the global AI economy. They are seeking a Network Site Reliability Engineer to build and run the network infrastructure, ensuring reliability and operational efficiency as the company scales.
Responsibilities
- Define and own reliability goals for network services and critical paths (SLIs/SLOs, availability targets, error budgets where it makes sense)
- Drive reliability improvements across the whole network: not only services, but also site readiness, inter-site connectivity (DCI), and operational standards
- Own incident response for your areas, lead investigations/postmortems, and turn failures into durable fixes (not repeated firefighting)
- Build and evolve observability: actionable metrics/logs/traces, alerting, and faster debug loops during and after incidents
- Design safer change workflows: automation, CI/CD, test/staging environments, canarying, rollbacks, and auditability for network changes
- Work closely with network engineers and platform teams to embed operability into designs and keep operations practical and fast
Skills
- Strong production Linux fundamentals and a structured approach to debugging complex systems
- Solid understanding of networking basics and how real networks fail (control plane vs data plane, latency/loss, failure domains, etc.)
- Hands-on experience operating high-availability systems and improving them over time (not just 'keeping lights on')
- Ability to write and maintain software/automation (Go is common for us; Python is also welcome)
- Experience with modern infrastructure tooling (e.g., IaC, CI/CD, container platforms) and comfort automating operational workflows
- Experience with high-throughput traffic processing: load balancers, tunneling/decap, NAT64, or similar datapath-heavy systems
- Low-level networking performance/debug background (eBPF/XDP, DPDK, perf/ftrace, kernel networking internals)
- Experience building network-safe delivery pipelines (testing labs, staged rollouts, automated verification, drift detection)
- Background with large-scale network observability/telemetry (e.g., routing/flow telemetry, regression detection at scale)
Benefits
- Competitive compensation
- Career growth and learning opportunities
- Flexibility and ownership
- Collaborative and innovative culture
- Opportunity to work on impactful AI projects
- International environment and talented teams
Company Overview