[Remote] Principal AI / Machine Learning Data Engineer - Remote or hybrid from MN or DC
Note: The job is a remote job and is open to candidates in USA. Dice is a global organization that delivers care, aided by technology to help millions of people live healthier lives. The Principal AI Data Engineer will design and build end-to-end AI pipelines for large-scale unstructured data, enabling advanced analytics and Generative AI. This role focuses on transforming complex datasets into AI-ready data products and building modern data pipelines.
Responsibilities
- Design, develop, and maintain scalable data pipelines and data platforms supporting analytics, machine learning, and AI use cases
- Build and optimize ingestion frameworks for large-scale structured and unstructured data, including streaming and event-driven sources
- Partner with cross-functional stakeholders to understand evolving data and AI needs and define long-term technical solutions
- Enable and support machine learning and AI workflows, including feature engineering, data preparation, and model deployment support
- Drive strategic initiatives around Generative AI, data quality, observability, lineage, and governance
- Develop and maintain frameworks that support rapid experimentation and deployment of AI/ML solutions
- Introduce and evolve best practices in data modeling, orchestration, testing, and monitoring
- Identify and champion opportunities for platform scalability, performance optimization, and cost efficiency
- Collaborate with product, analytics, and infrastructure teams to deliver high-impact data and AI solutions
- Build and maintain reusable parsing, enrichment, analytic, and service libraries to accelerate delivery across teams
- Work comfortably under time-sensitive conditions while ensuring thoroughness
- Maintain high ethical standards and the ability to remain objective and confidential
- You will be building and operating production data platforms and pipelines across batch and streaming workloads
- Working hands-on engineering in Python and SQL; in a JVM languages (Java/Scala) Spark ecosystems
- Distributed processing and lakehouse/warehouse patterns (eg, Spark/PySpark, Databricks, Snowflake)
- Build pipelines for OCR, document parsing, and text extraction from image-based or scanned data sources
- Enabling Generative AI solutions in production (eg, RAG-style architectures), including retrieval patterns and evaluation/monitoring practices
- Take a knowledge-centric data approaches (eg, metadata-driven systems, entity resolution, and/or graph concepts) to improve discoverability and downstream analytics
- Data quality, observability, and monitoring mindset (profiling, validation, alerting, and reliability improvements)
- Orchestrate, CI/CD, containerization, and infrastructure-as-code (eg, Airflow, GitHub Actions, Docker, Terraform, Kubernetes)
- Work in the Cloud (AWS, Azure, and/or Google Cloud Platform), including secure handling of sensitive data (PII/PHI) and collaboration with compliance partners
- Lead through influence, mentor engineers, and translate ambiguous problems into scalable technical roadmaps
Skills
- Bachelor's degree or equivalent experience
- 5+ years of experience designing, building, and operating scalable data pipelines and platforms (batch + streaming)
- 2+ years of experience deploying Generative AI solutions to production (e.g., RAG, LLM-powered pipelines, semantic search)
- Proven solid hands-on development in Python and SQL, with experience in Spark/PySpark and Databricks (or similar distributed platforms)
- Experience building ingestion and processing frameworks for unstructured data (OCR, documents, images), including parsing and enrichment
- Experience with cloud platforms (AWS/Azure/Google Cloud Platform), DevOps/CI/CD, and infrastructure-as-code, including secure handling of sensitive data (PII/PHI)
- Proven ability to design scalable solutions, implement data quality/observability practices, and collaborate across stakeholders
- Experience with cloud platforms such as AWS, Azure, or Google Cloud, including managed data services
- Experience with streaming and event-driven architectures (e.g., Kafka, Kinesis, Event Hubs)
- Experience with data quality and validation frameworks (e.g., Great Expectations, Deequ) and/or data observability tooling
- Experience enabling MLOps practices (e.g., feature stores, model registries, experiment tracking, deployment automation)
- Experience with lakehouse architectures, Delta Lake, and advanced Spark optimization/performance tuning
- Experience with data visualization tools and libraries such as Plotly, seaborn, and Chartjs
- Experience with machine learning and predictive analytics
- Familiarity with security and privacy concepts for data platforms (e.g., least privilege, PII/PHI handling) and working with compliance partners
- Solid hands-on engineering in Python and SQL; familiarity with JVM languages (Java/Scala) in Spark ecosystems
Benefits
- A comprehensive benefits package
- Incentive and recognition programs
- Equity stock purchase
- 401k contribution (all benefits are subject to eligibility requirements)
Company Overview
Company H1B Sponsorship