[Remote] Lead SageMaker Platform Engineer
Note: The job is a remote job and is open to candidates in USA. Perficient is a global AI and technology consulting firm, and they are seeking a Lead SageMaker Platform Engineer to assist in migrating data and machine learning processes onto AWS SageMaker Pipelines. The role involves debugging and deploying SageMaker pipelines, collaborating closely with data scientists, and ensuring reliable production environments.
Responsibilities
- Pair with our data scientists in live debugging sessions to diagnose and fix broken SageMaker pipelines and get them through the SDLC to prod
- Rapidly triage failures using AWS logs and telemetry (CloudWatch, CloudTrail, SageMaker pipeline/execution logs, etc.) and pinpoint root causes
- Untangle permissions issues across pipeline execution roles, cross-account access, and CI/CD identity (GitHub Actions OIDC, Terraform-managed IAM)
- Help debug cross-account model artifact syncing (shared services) and the testing-account validation flow
- Level up the team’s mental model for how the platform works and where to look when things break
Skills
- Expert-level AWS operational experience, especially debugging via logs and telemetry (CloudWatch Logs/Metrics, CloudTrail, X-Ray or equivalent) — can move from a vague failure to a root cause fast
- Deep IAM / permissions expertise in a multi-account setup: execution roles, assume-role/cross-account access, resource policies, KMS/encryption permissions, and reasoning about 'who is allowed to do what, as which principal.'
- Hands-on SageMaker experience, including SageMaker Studio and SageMaker Pipelines — knows how pipelines are defined, deployed, and executed, and where to look when a step fails. (Operating/debugging, not modeling.)
- Multi-account AWS experience aligned to an SDLC (dev/test/prod), including cross-account resource sharing and promotion patterns
- Comfortable working embedded and hands-on: live pairing, screen-sharing, and debugging under time pressure
- Strong communicator who can explain why something broke and how to avoid it next time
- Terraform experience, especially managing IAM and SageMaker/data infrastructure as code
- GitHub Actions CI/CD experience, particularly OIDC-based authentication to AWS (no long-lived keys) and the IAM trust policies behind it
- Experience with Amazon Redshift, and ideally trusted identity propagation / IAM Identity Center integration
- Some ML/MLOps background — enough to speak the language of model training, artifacts, and deployment (helpful, not required)
- AWS certifications (e.g., Solutions Architect Pro, DevOps Engineer Pro, ML Specialty) as a signal, though hands-on evidence matters more
Benefits
- Information regarding the benefits available for this position are in our benefits overview
Company Overview