[Remote] DevOps Engineer - Agentic AI Platform
Note: The job is a remote job and is open to candidates in USA. Advisor360° is a leading SaaS platform in the fintech space, transforming how wealth management firms operate and serve their clients. They are seeking a DevOps Engineer for their Agentic AI team, responsible for operating AI infrastructure and applying AI to DevOps processes to enhance operational efficiency.
Responsibilities
- Provision and operate AI infrastructure: the Kubernetes, identity, secrets, and gateway layers that AI and agentic services depend on—built so teams can ship LLM-powered features safely
- Apply AI to DevOps itself: build and operate agent-assisted automation that reduces toil—triage, PR review, runbook generation, log and incident analysis. We already run AI in our delivery pipeline and want a teammate who'll take it further
- Cluster operations on AKS: node pool sizing, autoscaling policies, namespace isolation, and day-two operational hygiene across environments
- GitOps delivery with ArgoCD: app-of-apps structure, environment promotion, rollback strategy, and the guardrails that keep one team's bad deploy from cascading
- Deployment strategies: rolling, blue-green, and canary patterns for agentic services where a bad rollout has downstream effects on active workflows
- Platform reliability: SLIs, SLOs, alerting, and runbooks for the infra layer—so when something breaks at 2am, there's a playbook to follow (and you help write it)
- Cost and capacity management: AI workloads have spiky, non-linear cost profiles. You'll instrument and enforce budgets, quotas, and rightsizing across the cluster
Skills
- 3+ years operating Kubernetes in production
- Hands-on GitOps with ArgoCD: multi-environment setups, sync waves, health checks, and rollback under pressure
- Azure fluency: AKS, ACR, Azure Monitor, Key Vault, and managed/workload identity
- Infrastructure-as-code as a default: Terraform for everything—no console cowboys
- Scripting in Python, Go, or Bash for automation and tooling—maintained code, not one-offs
- Solid incident-response instincts; you've been on-call, written postmortems, and fixed the underlying conditions rather than just the symptom
- A real foothold in AI for infrastructure—either you've applied AI/LLMs to operations work (automation, triage, code or PR review, log analysis), or you've provisioned and operated infrastructure for AI workloads. You don't need an ML background; you need to be the DevOps engineer who's already reaching for AI and wants to go deeper
- AI gateway / proxy patterns for AI workloads—centralized provider-key management, rate limiting, quotas, cost attribution, and failover in front of LLM providers
- Agentic AI frameworks (LangGraph, AutoGen, or similar) and the infrastructure patterns they require
- LLM inference / serving infrastructure (vLLM, TGI, Triton, or managed equivalents) and GPU capacity management
- Policy-as-code with OPA/Gatekeeper for cluster governance
- OpenTelemetry and distributed tracing across non-trivial services
- Service mesh (Istio or Linkerd) for service-to-service auth and traffic management
- Multi-tenant platform expertise
Benefits
- Competitive base salaries
- Annual performance-based bonuses
- The chance to share in the equity value you and your colleagues create during your time with the company
- Comprehensive health benefits, including dental, life, and disability insurance
- Unlimited paid time off program
Company Overview