[Remote] Senior Site Reliability Engineer
Note: The job is a remote job and is open to candidates in USA. ARA is a company focused on Information Technology, and they are seeking a Senior Site Reliability Engineer. The role involves partnering with development and IT teams to enhance system operability and support, while also maintaining operational standards and improving platform stability.
Responsibilities
- Partner with software developers, platform engineers, and IT staff to improve system design, operability, deployment safety, and production support readiness
- Define and maintain operational standards, runbooks, support procedures, escalation paths, and service-level objectives
- Evaluate system architecture and changes to ensure they balance functional requirements, service quality, reliability, security, and compliance needs
- Drive continuous improvement in platform stability, maintenance, and availability
- Provide advanced technical support and troubleshooting for complex platform and service issues affecting internal users and stakeholders
Skills
- 8+ years of experience in Site Reliability Engineering, DevOps, Platform Engineering, Systems Engineering, or related infrastructure roles supporting production services
- Strong experience with Linux systems administration and troubleshooting in enterprise environments
- Strong experience operating and maintaining on-prem Kubernetes platforms and all related components including CRI, CNI, and CSI plugins
- Experience deploying and maintaining applications on Kubernetes using Helm, Kustomize, and similar tooling
- Experience supporting DevOps tooling such as GitLab, Artifactory, Jira, Confluence
- Experience with GitOps tools such as FluxCD or ArgoCD
- Proficiency scripting with at least one of Python, Go, or Bash
- Strong experience designing, maintaining, and maturing observability tooling including monitoring, dashboards, logging and tracing, and supporting SLOs
- Strong understanding of reliability engineering concepts: Service health indicators, High availability design, failure reduction, and testing, Operational readiness practices, including developing documentation, runbooks, and architectural descriptions, Incident response, root cause analysis, remediation/recovery
- Ability to obtain a security clearance, which includes U.S. citizenship
- Bachelor's degree in CS, Software Engineering or other IT-related field or equivalent experience
- Experience with multiple Linux distributions including Ubuntu
- Experience with at least one of the following: Tanzu Kubernetes, Nutanix Kubernetes Platform, Canonical Kubernetes
- Experience with cloud platforms such as AWS and Azure
- Experience with infrastructure automation and configuration management
- Experience managing AI tooling on Kubernetes including MCP Servers, LLM platforms (vLLM, Ollama), Kubeflow
- Experience with security and compliance considerations in regulated environments
- DoD experience
- Active or inactive Secret Security Clearance
Company Overview