[Remote] Senior/Staff Site Reliability Engineer - Data Center
Note: The job is a remote job and is open to candidates in USA. PathAI is on a mission to improve patient outcomes with AI-powered pathology. They are seeking a skilled senior/staff level Site Reliability Engineer focused on designing, building, and operating their hybrid cloud/on-prem environment.
Responsibilities
- Advancing the state of our operations by implementing SRE best practices - focusing on users, monitoring, and automation
- Engineering infrastructure patterns for cloud environments in Amazon Web Services - building in security, reliability and scalability
- Designing, building, and operating our data center to support our rapidly growing Machine Learning team
- Integrating on-premises datacenter environments with existing cloud infrastructure to create a seamless hybrid cloud environment
- Improving the reliability and resilience of our infrastructure through root-cause analysis and reviewing gaps in designs, and implementations of our infrastructure
- Participating in platform on-call rotations and assisting with urgent incident response
Skills
- 5+ years of relevant experience
- Automation: You work hard to eliminate toil by automating everything through scripting, configuration management tools (Ansible), and code (Python/GoLang)
- You've built monitoring infrastructure with modern observability tools (Datadog/Grafana/Prometheus)
- You've worked with infrastructure as code (Terraform/Cloudformation)
- You've administered physical hardware stacks in production settings (iDRAC/IPMI/Nvidia UFM/Juniper Systems)
- You're opinionated on storage solutions and how they can be optimized for high performance workloads (Quobyte/S3/FSx/EFS)
- Familiarity with modern network designs and comfort operating across network layers
- Some experience and opinions on virtualization, containerization, or container orchestration platforms. (EKS/ClusterAPI/KVM)
- Operations experience: You've managed critical production infrastructure and are familiar with incident response, scaling, and rapid growth related challenges
- A bachelor's degree in Computer Science or equivalent experience
- An insatiable intellectual curiosity and the ability to learn quickly in a complex space
- Travel: Willingness to travel up to 25% of the time
Benefits
- Eligible for Equity
Company Overview
Company H1B Sponsorship