Senior Data Engineer
This is a hands-on building role: you turn raw, messy fabrication data into the clean, well-modeled, AI-ready datasets that our AI/ML and analytics workloads run on 🚀 🧑🏻💻 Responsibilities: Build and operate ingestion, ELT/ETL, and orchestration pipelines that move data from our MongoDB Atlas operational store and other sources into our analytical and AI-serving layers Implement layered (medallion-style) transformations with idempotent, backfillable, incrementally loaded jobs Apply deduplication, normalization, and validation so downstream data is high-quality and trustworthy Modernize legacy / homegrown data flows via incremental, strangler-fig migrations that keep production stable Build embeddings and vector pipelines, and the feature/retrieval-ready datasets that RAG, semantic search, and agentic workloads depend on Make production data AI-ready in practice: well-structured, lineage-tracked, and retrieval-friendly, in partnership with ML and application engineering Implement real-time and change-data-capture flows from MongoDB (Change Streams / CDC) where workloads require fresh data Implement the canonical data model, schemas, and data contracts defined by the Data Architect — enforced in-repo so other teams build against stable definitions Exercise sound persistence judgment in execution: land data in the right store (document / NoSQL, vector, analytical) per the architectural direction Contribute to build-vs-buy decisions by prototyping with proven, industry-standard tooling over custom development Establish testing, data-quality, and lineage checks for the pipelines you own, with clear alerting and runbooks Instrument pipeline observability (freshness, volume, schema-drift, cost) so failures are caught before consumers feel them Use AI-assisted development tools (Claude Code, Copilot, Cursor) as a force multiplier for transformation logic, query tuning, and migration scripting Partner with database engineering on extracting from and protecting the production store Partner with the Data Architect on implementing target-state patterns and surfacing what's hard to build Partner with ML, AI, and application engineers on the data they consume — shaping and governing it so it's safe and ready to build on 🤝 If you have: 5+ years of hands-on data engineering experience building and operating production data pipelines at scale Strong programming and data skills: Python and SQL, with solid software-engineering fundamentals (version control, testing, CI) — shipping and maintaining production code, not just notebooks Hands-on MongoDB at production scale (Atlas ideal): document modeling, aggregation framework, change streams / CDC, and extracting from a document store into analytical / AI-serving layers. Our stack is NoSQL / MongoDB, not relational, this is a core requirement, not an extra Demonstrated experience with ELT/ETL pipeline design, transformation frameworks (dbt or equivalent), and orchestration (Airflow, Dagster, or Azure Data Factory) Experience building on cloud-native data platforms and lake / lakehouse / warehouse architectures, with layered (medallion-style) modeling Hands-on experience preparing data for AI/ML or analytical consumers — embeddings / vector pipelines, RAG-/feature-ready datasets, or equivalent — including deduplication, normalization, and validation Familiarity with vector search and embeddings in production (MongoDB Atlas Vector Search or equivalent) Demonstrated use of AI-assisted development tools (Claude Code, Copilot, Cursor) for data and pipeline work Strong grasp of data quality, testing, lineage, and pipeline observability practices Comfortable working in a complex, specialized domain. MEP / AEC / construction experience is a plus; appetite to learn the domain is required 🦾 It’s a plus: Experience with the Azure data ecosystem (Data Factory, Synapse Analytics, Azure Functions, Event Grid) Lakehouse platforms (Databricks, Snowflake) or open table formats (Iceberg, Delta, Hudi); feature stores (Feast or equivalent) Streaming / event-driven data processing (Kafka, Event Hubs, Spark Structured Streaming) CDC and cross-engine sync (MongoDB Change Streams, Debezium, or equivalent) Experience with geometric / BIM / CAD data or other multi-modal, unstructured source data Knowledge-graph, ontology, or semantic-layer exposure Data governance for AI/agent access to production data: query-cost controls, read-path safety, lineage, audit SOC 2 and data-classification awareness This call is made within the framework of Law 19.691 on the Promotion of Employment for Persons with Disabilities, including individuals registered in the National Registry of Persons with Disabilities of the Ministry of Social Development