[Remote] Data Operations Manager
Note: The job is a remote job and is open to candidates in USA. HealthEdge is looking for a Data Operations Manager with a focus on data insights who thinks like an engineer, works like a scientist and communicates like a strategist. This role is designed to improve production stability by leveraging AI to analyze operational data and drive permanent resolutions for technical issues affecting client services.
Responsibilities
- Design, build and maintain pipelines that consolidate data from PagerDuty, Jira, ServiceNow, Datadog, Splunk and other operational sources into a unified analytical layer
- Develop and curate data models that identify repeat incidents, known error patterns, chronic alert noise and engineering toil consuming disproportionate remediation cycles
- Maintain data quality, lineage and governance standards across all ingested sources – ensuring findings are defensible when presented to senior leadership
- Leverage AI and automation – including the Claude API and Claude-powered workflows – to accelerate pattern detection, root cause hypothesis generation and report synthesis across large operational datasets
- Own and drive the Problem Management lifecycle across HealthEdge client-facing products
- Translate incident patterns into structured Problem Records with defined scope, impact quantification, and recommended permanent fix strategies
- Partner with Engineering, SRE, Platform and Product teams to embed problem-driven prioritization into sprint planning and tech debt roadmaps
- Facilitate Problem Review sessions – leading cross-functional teams from data to decision
- Define and track KPIs that demonstrate Problem Management value: reduction in repeat SEV1/SEV2 incidents, MTTR improvement, tech debt resolution velocity and engineering hours reclaimed from toil
- Build interactive, executive-ready dashboards and data visualizations that make hotspots, failure modes and technical debt load immediately comprehensible to both engineering and business stakeholders
- Apply generative AI tooling to synthesize multi-source operational signals into clear, narrative-driven analysis – reducing time from data to decision
- Develop automated reporting workflows that surface trending issues and emerging risk patterns without requiring manual aggregation cycles
- Support monthly ceremonies by providing KPI and Outcome trending, highlighting influences to trending themes
- Present operational intelligence findings and Problem Management outcomes to Engineering leadership, VP-level+ audiences and cross-functional stakeholders
- Influence from a strategic perspective where the most urgent pockets of risk to platform availability exist, and drive prioritization accordingly
- Translate technical findings – infrastructure failure modes, code regression patterns, dependency risks – into business value framing that drives prioritization conversations
- Author Problem Record summaries, trend analyses and executive briefings that are concise, evidence-based and action-oriented
Skills
- 5+ years of data engineering experience with production-grade pipeline design, transformation logic and operational data modeling
- Proficiency with Python or Scala for data processing; strong SQL for analytical querying against large, event-driven datasets
- Hands-on experience with Jira and at least two of the following: PagerDuty, Datadog, Splunk, ServiceNow – ideally in an operational analytics or SRE context
- Experience integrating large language model (LLM) APIs – including Anthropic Claude, OpenAI or similar – into data workflows, automated summarization pipelines or insight generation applications
- Proficiency building interactive dashboards and data visualizations, Amazon Quick Suite a strong plus
- Working knowledge of ITIL or equivalent ITSM frameworks – specifically Incident Management, Problem Management and Change Management process disciplines
- Demonstrated ability to identify repeat failure patterns in incident or monitoring data and drive structured root cause analysis and resolution workflows
- Familiarity with SRE principles – toil quantification, error budgets, SLO/SLA measurement – and how engineering teams use these to prioritize reliability work
- Strong written and verbal communication skills, with demonstrated experience presenting technical analysis to VP or C-level audiences
- Ability to translate complex, multi-variable findings into business impact narratives that drive prioritization decisions
- Comfortable driving cross-functional alignment – navigating competing priorities across Engineering, Product, Operations and Leadership stakeholders
- Self-directed and intellectually curious; you pursue root causes with the same rigor you bring to your data models
- Experience in a healthcare SaaS environment or regulated platform with high availability requirements
- Prior role embedded in an SRE, NOC, Platform Engineering or Operations function – particularly one that included formal Problem Management or post-incident review responsibilities
- Experience building AI-powered operational tooling – such as automated incident summarization, intelligent alert correlation or AI-assisted root cause classification
- Familiarity with HealthEdge products or the payer technology landscape is a meaningful plus
- ITIL Foundation certification or equivalent
Company Overview