[Remote] Principal AIOps Engineer
Note: The job is a remote job and is open to candidates in USA. CVS Health is seeking a Principal AIOps Engineer to lead enterprise-scale AIOps strategy, automation, and observability. The role focuses on modernizing IT operations through advanced technologies to enhance reliability and efficiency.
Responsibilities
- Lead the AIOps strategy, roadmap, and operating model (intake, triage, automation lifecycle, KPIs) to measurably improve MTTR, alert quality, and operational efficiency
- Own the observability-to-AIOps pipeline (metrics, logs, traces, events) and drive standardization of telemetry, service health models, and actionable alerting across teams and platforms
- Design and implement event intelligence: correlation, deduplication, suppression, anomaly detection, incident clustering, and probable-cause analysis using topology/CMDB context
- Advise operations, service owners, and leadership stakeholders; lead change enablement, adoption, and value measurement for AIOps and agentic automation across the organization
- Develop ServiceNow-centric AIOps integrations (ITSM + ITOM/Event Management where applicable): event ingestion, alert-to-incident policies, enrichment, assignment/routing, approvals, change workflows, and closure updates for auditable closed-loop ops
- Establish governance for operational AI (risk controls, approvals, auditability, data access, prompt/response logging, evaluation, and continuous improvement) in partnership with security, compliance, and operations
- Build and operationalize agentic AI workflows for incident triage and resolution: signal summarization, similar-incident retrieval, knowledge article drafting, ticket updates, stakeholder communications, and human-in-the-loop remediation
- Enable closed-loop automation and self-healing by connecting AIOps detections to orchestrated actions (runbooks/workflows), with clear approvals, safety checks, and rollback paths
- Partner with NOC/SOC, infrastructure, and application owners to onboard services into AIOps, define service models, and improve signal quality, escalation paths, and operational readiness
- Create enablement materials (playbooks, operating procedures, dashboards) and coach teams on AIOps practices, agentic AI usage, and responsible automation
Skills
- 10+ years of experience in SRE, production operations supporting highly available services along with experience with Product model
- Proven technical leadership: ability to set direction, lead cross-team initiatives, and advise stakeholders through architecture reviews, tradeoffs, and operational readiness
- Strong programming/scripting skills (Python preferred) and experience building automation, integrations, and APIs
- Experience integrating observability platforms and event sources across hybrid environments (cloud/on-prem) and operating production-grade monitoring/event management at scale
- Strong ServiceNow experience as an ITSM system of record (Incident/Problem/Change; CMDB/asset concepts). Ability to build and operate integrations at scale (REST, webhooks, event management) to support automation and auditability
- Automation & Integration Engineering: Python (preferred) for automation and data/ML pipelines; experience building integrations, services, and operational tooling
- Workflow orchestration and integrations (ServiceNow APIs, event pipelines, runbook automation) with strong reliability, security, and auditability practices
- AIOps, ITSM/ITOM (ServiceNow) & Agentic AI Ecosystem: Observability: Prometheus/Grafana, OpenTelemetry, ELK/Splunk/Datadog (or equivalent)
- ServiceNow ITSM/ITOM: Incident/Problem/Change, CMDB/service mapping concepts, and Event Management/AIOps integrations (where applicable)
- Agentic AI frameworks: building tool-using agents, retrieval workflows, prompt/response logging, evaluation, and guardrails
- Operational ML/Analytics: anomaly detection and time-series analysis, correlation approaches, and model/agent evaluation & monitoring in production
- Demonstrated experience applying machine learning and/or LLM-based approaches to operational problems (noise reduction, correlation, anomaly detection, summarization, and assisted remediation) in production environments
- Experience building an agentic AI platform/ecosystem (shared tools, standardized patterns, evaluation, and guardrails) and enabling multiple teams to safely deliver automations
- Familiarity with ServiceNow ITOM / Event Management / AIOps capabilities (or equivalent) and integrating observability signals into ITSM workflows
- Strong Linux and networking fundamentals (TCP/IP, DNS, TLS, load balancing) and ability to troubleshoot distributed systems end-to-end
- DevOps, or platform engineering experience supporting highly available services along with experience with Product model
- Excellent communication skills with the ability to lead incident bridges, write clear postmortems, and influence reliability improvements across teams
Benefits
- CVS Health bonus, commission or short-term incentive program
- Award target in the company’s equity award program
- Comprehensive benefits package designed to support the physical, emotional, and financial well‑being of colleagues and their families
- Medical, dental, and vision coverage
- Paid time off
- Retirement savings options
- Wellness programs
- Other resources, based on eligibility
Company Overview
Company H1B Sponsorship