[Remote] Senior Software Engineer- Site Reliability Engineering (SRE)
Note: The job is a remote job and is open to candidates in USA. Noctua Technology is a company focused on driving digital transformation through Site Reliability Engineering. They are seeking a highly experienced Senior Site Reliability Engineer (SRE) to define strategy and apply advanced software engineering principles to operations, ensuring the reliability and performance of large-scale production systems.
Responsibilities
- Drive the definition and adoption of SLIs and SLOs across multiple services or entire platforms, ensuring alignment with business goals
- Design and architect Infrastructure as Code (IaC) solutions for large-scale, complex environments, establishing standards and best practices
- Implement and manage containerized and serverless architectures using Docker, Kubernetes, and cloud-native services, focusing on performance and error budgets
- Build and maintain reliable and self-healing CI/CD pipelines to automate deployments and improve development workflows
- Implement and refine comprehensive monitoring, alerting, and logging to detect and address performance and availability issues proactively
- Lead the strategic effort to eliminate toil, identifying and championing major automation projects that deliver significant organizational efficiency
- Lead high-severity incident response and coordinate blameless postmortems for major outages, driving the resulting remediation and systemic improvements
- Implement cloud security best practices, including identity and access management (IAM), encryption, and compliance controls
- Proactively identify and address system weaknesses and ensure performance under stress
- Support disaster recovery and high availability strategies through backup and failover planning
- Serve as a primary SRE liaison for development teams, influencing application architecture and design to meet reliability and scalability targets from inception
- Create and maintain documentation for cloud architectures, deployment processes, and best practices
- Contribute to internal knowledge-sharing initiatives, ensuring continuous learning within the team
- Act as a subject matter expert and trusted advisor to clients and internal leadership on cloud infrastructure, reliability strategy, and Service Level Agreement (SLA) negotiations
- Act on client feedback to refine and enhance cloud solutions
- Conduct training and knowledge-sharing sessions to help clients manage their cloud environments effectively
- Stay updated on the latest developments in cloud infrastructure and technology trends
- Drive innovation by proposing and implementing new techniques and technologies
Skills
- 5+ years of experience in site reliability engineering, cloud engineering, or related fields
- Strong software engineering skills with an emphasis on writing clean, modular, and maintainable code, specifically for automation and system management
- Deep experience with Infrastructure as Code (IaC) tools like Terraform or CloudFormation
- Deep experience with containerization and orchestration tools like Docker and Kubernetes
- Deep knowledge of networking concepts, cloud security best practices, and identity management
- Experience with programming or scripting languages such as Python, Bash, or Go
- Experience with CI/CD pipelines and DevOps methodologies
- Strong problem-solving skills and the ability to troubleshoot complex cloud environments
- Demonstrated ability to influence technical decision-making across organizational boundaries
- Bachelor's or advanced degree in Computer Science or a related field
- Any of the below cloud certifications: Google Cloud Professional Cloud Architect, Google Cloud Professional Cloud DevOps Engineer, AWS Certified Solutions Architect, AWS Certified Developer, AWS Certified SysOps Administrator
- CompTIA Security+ certification or an equivalent DoD 8140/8570 IAT Level II baseline certification
Company Overview