[Remote] Senior Staff Machine Learning Engineer, Data & Eval
Note: The job is a remote job and is open to candidates in USA. Airbnb is a leading hospitality company that connects hosts and guests for unique stays and experiences. They are seeking a Senior Staff Machine Learning Engineer to set technical direction and lead execution for ML evaluation and data systems that power customer support AI products.
Responsibilities
- Define evaluation strategy and success metrics for GenAI systems, aligning offline evaluation with online business and customer experience outcomes
- Build and scale evaluation frameworks (golden sets, synthetic data, automated regressions, rubric-based grading, LLM-as-judge where appropriate) with strong controls for bias, drift, and reliability
- Design the data flywheel: instrumentation, feedback collection, data quality checks, labeling strategy, dataset versioning, and governance to support continuous improvement
- Lead cross-functional quality initiatives across product, ops, and engineering, driving clarity on what “good” looks like and how teams act on evaluation results
- Develop and productionize pipelines for dataset creation, model monitoring, evaluation-at-scale, and continuous testing (pre-deploy and post-deploy)
- Drive technical decisions and architecture for evaluation and data infrastructure, balancing speed, rigor, cost, and safety
Skills
- Educational Background: PhD in Computer Science, Mathematics, Statistics, or related technical field (or equivalent practical experience)
- Industry Experience: 10+ years building, testing, and shipping ML/AI systems end-to-end; including 2+ years of experience with GenAI/LLM systems in production
- Leadership Experience: 5+ years leading large, ambiguous technical initiatives as a senior IC, influencing roadmap and engineering/science direction across teams
- Technical Proficiency: Deep expertise in evaluation methodology (offline/online alignment, metric design, human-in-the-loop evaluation, A/B testing, power analysis, regression testing)
- Hands-on experience with GenAI systems, including orchestration, retrieval, tool calling, memory, etc
- Experience building data pipelines and quality systems (labeling workflows, dataset curation, versioning, monitoring, and governance)
- Solid ML fundamentals and best practices (model selection, training/serving, monitoring, reliability, and model lifecycle management)
- Customer Support Systems: Experience applying ML/AI to customer support workflows (e.g., agent assist, classification/routing, resolution recommendation, QA)
- Infrastructure & Quality at Scale: Experience building robust evaluation platforms for agent behavior validation, safety/guardrails, and continuous improvement
- Agile Practice for Applied AI: Proven ability to take evaluation and data flywheel work from incubation to production, iterating quickly while maintaining scientific rigor
Benefits
- Bonus
- Equity
- Benefits
- Employee Travel Credits
Company Overview
Company H1B Sponsorship