[Remote] Software Engineer III - AI/ML Platform Operations - Remote
Note: The job is a remote job and is open to candidates in USA. CSAA Insurance Group (CSAA IG), a AAA insurer, is one of the leading personal lines property and casualty insurance groups in the United States. They are seeking a Software Engineer – AI/ML Platform Operations to lead the operational excellence, reliability, and support of their enterprise AI and data platforms.
Responsibilities
- Provide technical leadership for AI/ML platforms including Palantir, AWS Bedrock, Amazon SageMaker, and related cloud-native technologies
- Ensure platform reliability, scalability, performance, security, and operational readiness for production AI workloads
- Support deployment, monitoring, maintenance, and lifecycle management of AI/ML solutions and Generative AI services
- Establish operational standards, support models, service-level objectives (SLOs), and platform governance practices
- Design and implement automation, monitoring, observability, and operational tooling to improve platform reliability and efficiency
- Develop and maintain dashboards, health metrics, alerts, logging frameworks, and operational runbooks
- Enhance CI/CD pipelines, deployment automation, infrastructure-as-code, and model release processes
- Implement best practices for MLOps, model monitoring, model lifecycle management, and AI operational governance
- Serve as a senior escalation point for complex production issues involving AI platforms, machine learning workloads, cloud infrastructure, and data integrations
- Lead root cause analysis efforts and drive corrective and preventive actions to improve platform stability
- Solve performance, availability, deployment, and integration issues across AI and data ecosystems
- Partner with engineering and business teams to rapidly restore service and minimize operational risk
- Provide mentorship, technical guidance, and operational expertise to engineers and platform teams
- Influence platform strategy, architecture decisions, operational processes, and technology adoption
- Collaborate with team members to align platform capabilities with business priorities and AI adoption goals
- Communicate complex technical concepts effectively to both technical and non-technical audiences
- Remain current with advancements in AI/ML, Generative AI, cloud technologies, platform engineering, and reliability practices
- Identify opportunities to improve operational efficiency, governance, security, and developer experience
- Champion modern engineering practices including automation, observability, DevOps, Site Reliability Engineering (SRE), and AI Operations (AIOps)
Skills
- 3+ years of progressive experience in software engineering, platform engineering, cloud operations, MLOps, DevOps, or related technical disciplines
- Bachelor's degree in Computer Science, Engineering, Information Technology, or a related field, or equivalent practical experience
- Experience supporting production cloud-based applications and services in AWS environments
- Strong experience with software engineering and automation using languages such as Python, Java, JavaScript/TypeScript, or Node.js
- Experience with CI/CD, build, integration, and deployment tools such as Jenkins, Maven, GitHub Actions, or equivalent
- Experience with cloud-native services including compute, storage, networking, databases, and serverless architectures
- Experience building and maintaining operational monitoring, observability, and alerting solutions
- Strong troubleshooting, incident response, and root cause analysis skills
- Excellent communication, collaboration, and technical leadership capabilities
- Experience with AI/ML platforms such as Palantir Foundry, Amazon SageMaker, AWS Bedrock, Databricks, or similar ecosystems
- Experience supporting Generative AI applications, LLM-based solutions, prompt orchestration frameworks, and Retrieval-Augmented Generation (RAG) architectures
- Knowledge of MLOps practices including model deployment, monitoring, governance, experimentation, and lifecycle management
- Experience with observability and monitoring platforms such as Datadog, Splunk, Grafana, Prometheus, CloudWatch, or OpenTelemetry
- Familiarity with AI governance, responsible AI principles, model risk management, and operational controls
- Relevant cloud, AI/ML, DevOps, or platform engineering certifications
- Actively shapes our company culture (e.g., participating in employee resource groups, volunteering, etc.)
- Lives into cultural norms (e.g., willing to have cameras when it matters: helping onboard new team members, building relationships, etc.)
- Travels as needed for role, including divisional / team meetings and other in-person meetings
- Fulfills business needs, which may include investing extra time, helping other teams, etc
Benefits
- Total compensation package
- Annual bonus eligibility for most roles
- 401(k) with a company match
- Career Growth: leaders and mentors partner with employees to align interests, unlock development opportunities, and support long‑term success
- Flexible Workplace: remote-first culture through our Flexible Workplace with Home-Flex roles, working primarily from home, often with flexibility to work from various locations including CSAA offices
- Inclusion and Belonging: an inclusive and welcoming workplace
- Sustainability: taking bold action to build more resilient communities and reduce environmental impact
- Company-wide annual discretionary bonus, through our Annual Incentive Plan (AIP), of up to 8% of eligible pay
Company Overview