[Remote] Senior AI Agent & Evaluations Engineer
Note: The job is a remote job and is open to candidates in USA. Vacatia is building the future of vacation ownership, focusing on transforming the industry through AI. They are seeking a Senior AI Agent & Evaluations Engineer to design and improve AI agents that directly impact customer experiences and operational efficiency, while owning the intelligence layer behind these systems.
Responsibilities
- Design, refine, and optimize prompts, tool definitions, routing logic, and decision-making behavior across Vacatia's AI agent ecosystem
- Build and maintain evaluation frameworks, golden datasets, grading systems, and regression testing pipelines that measure agent quality and reliability
- Develop guardrails and safe-failure mechanisms that ensure agents operate responsibly in customer-facing and financially sensitive workflows
- Monitor production performance, investigate failures, identify edge cases, and continuously improve agent outcomes through data-driven iteration
- Partner with business stakeholders to translate policies, operational requirements, and domain expertise into measurable agent behavior
- Collaborate with engineering teams to define context requirements, tool contracts, and integration specifications that support agent success
- Create scalable frameworks and reusable patterns for deploying AI agents across new business workflows and use cases
- Establish best practices for prompt engineering, evaluation methodologies, observability, and agent operations
Skills
- Proven experience shipping and owning production AI agents or LLM-powered systems beyond proof-of-concept environments
- Deep expertise in prompt engineering, including system prompts, tool usage, context management, output constraints, and agent behavior design
- Hands-on experience building evaluation frameworks using golden datasets, scoring rubrics, LLM-as-judge methodologies, and regression testing
- Strong familiarity with modern AI development tools such as Claude Code, Codex, or similar coding agents
- Experience with agent observability and evaluation platforms such as LangSmith, Langfuse, Arize, Galileo, or comparable solutions
- Ability to distinguish prompt issues from data, tooling, model, or evaluation failures and systematically improve agent performance
- Strong written and verbal communication skills with the ability to work effectively across engineering and business teams
- Demonstrated ownership mindset with a passion for building reliable, measurable, and continuously improving AI systems
- Experience building agents that process communication-based workflows including emails, support tickets, chat interactions, or transcripts
- Experience with multiple agent frameworks and a practical understanding of their tradeoffs
- Familiarity with the evolving LLM landscape and model selection strategies
- Experience designing and implementing end-to-end evaluation pipelines and agent operations workflows
- Production experience with online evaluation systems and automated scoring of live traffic
- Experience integrating AI systems with Salesforce, AWS Connect, or customer engagement platforms
- Background in customer-facing industries where accuracy, compliance, and communication quality are critical
- Contributions to open-source projects, technical writing, or public thought leadership in AI, prompt engineering, or agent development
Company Overview
Company H1B Sponsorship