[Remote] Senior Site Reliability Engineer
Note: The job is a remote job and is open to candidates in USA. Akuity is a company focused on simplifying the software delivery process for DevOps and Platform Engineering teams. They are seeking a Senior Site Reliability Engineer to ensure the reliability of the Akuity platform and to collaborate with various teams to enhance system performance and incident response.
Responsibilities
- Own SLI/SLO/SLA definitions for the Akuity SaaS platform and drive continuous improvement against them
- Design, instrument, and maintain observability systems (metrics, logs, traces) across multi-region AWS infrastructure
- Identify reliability gaps, lead blameless post-mortems, and close the loop with permanent fixes
- Partner with engineering teams to build reliability into new features before they ship to production
- Participate in an on-call rotation and act as incident commander for high-severity production events
- Build and maintain runbooks, escalation paths, and incident playbooks that keep mean time to resolution low
- Drive improvements to alerting fidelity; reduce noise, increase signal, eliminate toil
- Lead post-incident reviews with clear timelines, root cause analysis, and follow-through on action items
Skills
- 5+ years of SRE, platform engineering, or production operations experience in a SaaS environment
- Deep hands-on Kubernetes expertise; you understand the scheduler, networking, storage, and autoscaling at a level where you can debug anything
- Strong AWS fundamentals across compute (EC2, EKS), networking (VPC, NLB, Route53), storage (S3, RDS), and IAM
- Experience defining and operating against SLOs in production; you've written error budgets, not just read about them
- Proficiency with observability tooling (Prometheus, Grafana, OpenTelemetry, Datadog, or equivalent)
- Solid scripting and automation skills; Go, Python, Bash, or similar; you automate what you touch
- Strong written communication: clear runbooks, sharp incident reports, thoughtful post-mortems
- Live within US time zones (Pacific through Eastern), including Canada and other regions
- Experience with Argo CD, Kargo, or GitOps-based delivery workflows
- Familiarity with multi-region, multi-cluster Kubernetes deployments
- Experience with compliance-adjacent infrastructure (SOC 2, ISO 27001, HIPAA, or PCI DSS)
- Background operating infrastructure for other platform or developer tooling companies
Benefits
- Equity participation in a well-funded, growing company
- Fully remote: work from anywhere within US time zones (Pacific through Eastern), including Canada and other regions
- Home office stipend and equipment budget
- Flexible time off and a culture that respects it
- Work directly with the engineers who built Argo CD and Kargo; you'll learn a lot here
- US-based employees receive full benefits, including comprehensive health, dental, and vision coverage
Company Overview