[Remote] Senior Site Reliability Engineer
Note: The job is a remote job and is open to candidates in USA. Sanity is a modern content operating system that replaces rigid legacy content management systems. The Senior Site Reliability Engineer will work closely with development teams to design and build infrastructure that ensures the scalability and reliability of the platform, while also mentoring engineers and contributing to operational excellence.
Responsibilities
- Design, build, and operate the shared platform foundations engineers ship on every day: GCP infrastructure, Kubernetes, networking, routing, CI/CD, and observability
- Diagnose and troubleshoot complex distributed systems running at high request volume
- Ensure observability and analyze the behavior of our stack
- Contribute to in-flight work like modernizing our edge, caching, and gateway layers onto Fastly and tightening observability across the platform
- Raise the reliability bar through better dashboards, alert severity, paging standards, on-call readiness, and incident response
- Make deployment boring in the best way: build golden paths, production readiness checks, safe rollouts, and useful automation so engineers have fewer places to look before they ship
- Mentor engineers and raise the technical bar through code review, design review, and pairing
- Participate in our on-call rotation and help our developer on-call rollout land well
Skills
- Based in the United States, with reasonable overlap with European engineering hours
- Experience with SRE/DevOps tools, processes, and culture
- 5+ years of experience as part of an SRE on-call rotation
- Analytical approach to designing, diagnosing, and optimizing infrastructure
- Experience with managing scalable, highly available, cloud-based applications, ideally with high request volume and customer-facing uptime expectations
- Experience with Kubernetes for orchestrating, scaling, and managing containerized applications in cloud-based environments
- Experience building CI/CD pipelines
- Experience with an observability stack (Prometheus, et al.)
- Comfortable working across CDNs, edge, gateways, and caching layers, or eager to go deep there
- You improve on-call and reliability by building systems, standards, and feedback loops that make production healthier over time
- You are comfortable dealing with incidents and outages and have built a practical, thoughtful communication style for handling high-pressure situations
- An open but considered approach to new technologies
Benefits
- A highly-skilled, inspiring, and supportive team
- Real infrastructure scale and meaningful, hands-on work changing how it runs
- Positive, flexible, and trust-based work environment that encourages long-term professional and personal growth
- A global, multi-culturally diverse group of colleagues and customers
- Comprehensive health plans and perks
- A healthy work-life balance that accommodates individual and family needs
- Competitive stock options program and location-based salary
Company Overview
Company H1B Sponsorship