[Remote] Site Reliability Engineer
Note: The job is a remote job and is open to candidates in USA. Second Sight Solutions, a subsidiary of Berkeley Research Group (BRG), is a health technology company that focuses on improving transparency in drug discount data exchange. They are seeking a Site Reliability Engineer to design, build, and maintain highly available systems and infrastructure while collaborating with software developers and operations teams to enhance system reliability.
Responsibilities
- Design, implement, and maintain scalable and reliable systems in cloud environments such as Azure Cloud Services
- Experience with CI/CD Platforms (GitHub Actions, GitLab CI)
- Provide operational support for full-stack software applications
- Increase system resilience with expert-level coding, bulletproof release, and change management skills
- Develop service-level indicators and objectives to automate release validation
- Improve automation and increase the system’s self-healing capability
- Collect operating system data and report performance metrics to stakeholders
- Ensure security best practices are followed in cloud infrastructure and application deployments
- Manage cloud and database system maintenance, debugging production issues as they arise
- Improve reliability, quality, and time-to-market of our suite of software solutions
- Partner with security and product teams to define and publish policies, processes, and playbooks to facilitate rapid and effective handling of alerts and incidents
- Lead incident management processes; respond to outages and service disruptions promptly
Skills
- Bachelor's degree in computer science or similar field
- Five years' experience as a site reliability engineer or similar role
- Strong programming skills (Golang, Ruby, Python, or similar)
- Proven ability to diagnose and monitor performance and reliability issues across the stack
- Expertise in Kubernetes
- Relevant industry certifications, such as through the Site Reliability Engineering (SRE) Foundation
- Proven experience working with cloud-native infrastructure (Azure Cloud Services, AWS, or GCP)
- Experience working with observability and incident management tools (Datadog, OpsGenie, PagerDuty)
- Experience scripting operating system tasks with Infrastructure as Code
- Impeccable communication skills
- Ability to problem-solve in a fast-paced, high-stakes environment
Company Overview