[Remote] Strategic Operations Engineer III
Note: The job is a remote job and is open to candidates in USA. Backblaze is the object storage leader in the open cloud movement, and they are seeking a Strategic Ops Engineer III to join their team. The role involves managing incident, problem, and change processes while leveraging AI/ML for operational improvements.
Responsibilities
- Available to Lead and govern the end-to-end incident management lifecycle, including detection, triage, escalation, and resolution
- Drive major incident management (MIM) processes and communications
- Improve MTTR (Mean Time to Resolution) through automation and process optimization
- Establish and maintain incident response playbooks and runbooks
- Maintain and improve intelligent heatmaps leveraging AI/ML to identify recurring technical themes and prioritize long-term remediation
- Implement trend analysis and proactive problem identification using observability data and AI
- Track and manage problem records to closure
- Govern change management processes (lead the CAB), ensuring safe, compliant, and low-risk deployments
- Define and enforce change policies, risk assessments, and approval workflows
- Drive continuous improvement in release and deployment practices
- Maintain a strong understanding of system architecture and monitoring strategies, identifying gaps and opportunities for improvement
- Partner with engineering teams to improve system resilience and performance
- Reduce alert fatigue by improving signal-to-noise ratio in monitoring systems
- Leverage AI/ML for anomaly detection, predictive alerting, and automated root cause analysis
- Implement AI-driven solutions to optimize incident response and operational workflows
- Analyze large-scale operational data to identify patterns and recommend improvements
- Experience with AIOps platforms or building AI-driven operational solutions
Skills
- 5+ years of experience in IT Operations, SRE, or similar roles
- Strong expertise in Incident, Problem, and Change Management (ITIL or similar frameworks)
- Proven experience in governing and optimizing operational processes
- AI & Data Expertise: Strong knowledge of AI/ML concepts, including anomaly detection, predictive analytics, and data modeling
- AIOps Experience: Hands-on experience with AIOps platforms or building AI-driven operational solutions (event correlation, alert prioritization)
- ITIL certification (Foundation or higher)
- Proficiency with platforms such as Jira, SNOW, FireHydrant, Moogsoft, etc
- Experience working in high-availability, large-scale environments
Benefits
- W2 option
- Learning, developing, and growing are key parts of our culture
- We want people to feel comfortable expressing their true selves and to come, stay, and do their best work here
Company Overview
Company H1B Sponsorship