[Remote] Mid-Level Data Engineer, Veterans Affairs
Note: The job is a remote job and is open to candidates in USA. ThunderYard Solutions is seeking a Data Engineer to support the U.S. Department of Veterans Affairs in designing, developing, and maintaining scalable data solutions. The role involves collaborating with cross-functional teams to optimize data pipelines and ensure compliance with federal standards.
Responsibilities
- Design, develop, and maintain ETL/ELT pipelines to ingest, transform, and load data from multiple sources such as APIs, relational databases, cloud storage, and streaming platforms
- Build scalable batch and near real time data pipelines using Databricks and Apache Spark (PySpark / SQL)
- Implement data transformation logic following best practices for performance, reliability, and reusability
- Support schema evolution, data validation, deduplication, and error handling in ETL workflows
- Develop and optimize pipelines using Delta Lake and medallion (Bronze / Silver / Gold) architecture patterns
- Use Databricks Workflows / Jobs or similar orchestration tools to schedule and monitor pipelines
- Optimize Spark jobs for performance and cost (partitioning, caching, file sizing, query tuning)
- Collaborate on data governance initiatives using Unity Catalog, access controls, and lineage where applicable
- Work closely with data architects, analytics teams, and downstream consumers to define data requirements
- Troubleshoot pipeline failures and data quality issues and implement long term fixes
- Produce documentation for pipelines, datasets, and operational runbooks
- Participate in CI/CD practices using Git based version control for notebooks and code deployments
Skills
- 3+ years of experience as a Data Engineer or in a similar data focused role
- Hands on experience with Databricks
- Strong experience building ETL/ELT pipelines
- Proficiency in Python and SQL
- Experience with Apache Spark / PySpark
- Familiarity with cloud platforms such as Azure
- Solid understanding of data modeling, data warehousing, and analytics use cases
- Design, develop, and maintain ETL/ELT pipelines to ingest, transform, and load data from multiple sources such as APIs, relational databases, cloud storage, and streaming platforms
- Build scalable batch and near real time data pipelines using Databricks and Apache Spark (PySpark / SQL)
- Implement data transformation logic following best practices for performance, reliability, and reusability
- Support schema evolution, data validation, deduplication, and error handling in ETL workflows
- Develop and optimize pipelines using Delta Lake and medallion (Bronze / Silver / Gold) architecture patterns
- Use Databricks Workflows / Jobs or similar orchestration tools to schedule and monitor pipelines
- Optimize Spark jobs for performance and cost (partitioning, caching, file sizing, query tuning)
- Collaborate on data governance initiatives using Unity Catalog, access controls, and lineage where applicable
- Work closely with data architects, analytics teams, and downstream consumers to define data requirements
- Troubleshoot pipeline failures and data quality issues and implement long term fixes
- Produce documentation for pipelines, datasets, and operational runbooks
- Participate in CI/CD practices using Git based version control for notebooks and code deployments
- Experience with Delta Live Tables (DLT) or Databricks Auto Loader
- Experience with orchestration tools such as Airflow
- Familiarity with streaming data technologies (Kafka, Event Hubs, Kinesis)
- Experience supporting analytics tools (Power BI, Tableau, Looker) connected to Databricks
- Databricks certification (Associate or Professional)
Benefits
- Medical, dental and vision insurance
- 401k matching
- PTO
- Certification reimbursement
Company Overview