[Remote] Senior Data Engineer (Customer Data Products)
Note: The job is a remote job and is open to candidates in USA. NBME is a not-for-profit organization that specializes in the creation of assessments and learning tools for physicians and health professionals. In this role, the Data Engineer will deliver meaningful insights and help modernize and optimize NBME's data platform by building data lakes and complex data integration pipelines.
Responsibilities
- Code, test, deploy, orchestrate, monitor, document, and troubleshoot cloud-based data engineering processes, feature stores, and vector databases in accordance with best practices and security standards throughout the development lifecycle
- Partner closely with data scientists, AI researchers, data and enterprise architects, and business stakeholders to identify, extract, clean, and format structured and unstructured data for AI/ML model training, fine-tuning, and feature extraction
- Lead evaluation, research, and experimentation efforts with batch and streaming data technologies, LLM data preparation frameworks, and MLOps tools to keep pace with industry innovation
- Act as a technical lead to showcase the capabilities of emerging AI and data technologies, enabling the widespread adoption of modern data techniques across the organization
- Significantly contribute to the definition and refinement of processes and procedures for the data engineering practice
- Educate and develop ETL developers on data engineering cloud-bases initiatives to enable transition to data engineer and practice
- Assures the integrity and accuracy of the corporate data, with particular attention to data security
- Responsible for ensuring high data quality for Data Services, Analytics and Master Data Management
- Helps coordinate technical solutions, takes responsibility for designs, development, testing and delivery of solutions
- Build automated, scalable, test-driven data pipelines
- Utilize software development practices such as version control via Git, CI/CD, and release management to enhance existing CI/CD pipelines in AWS
- Collaborate with Data Engineers, DevOps engineers and architects on improvement opportunities for DataOps tools and frameworks
Skills
- Bachelor's Degree
- At least 7 years of experience in application development (Internship experience does not apply)
- At least 4 years of experience in big data technologies
- At least 4 years' experience with cloud computing using AWS
- 4+ years of experience in application development including Python, SQL, Scala, or Java
- 4+ years' experience with Distributed data/computing tools (MapReduce, Hadoop, Hive, EMR, Kafka, Spark, MySQL etc.)
- 4+ year experience working on real-time data and streaming applications
- 4+ years of experience with NoSQL implementation (Mongo, Cassandra)
- 4+ years of data warehousing experience (Redshift)
- 6+ years of experience with UNIX/Linux including basic commands and shell scripting
- 7+ years of experience with Agile engineering practices
- 7+ years of experience with SQL optimization
- 4+ years of experience with PySpark
- 3+ year of experience with process orchestration including AirFlow, KubeFlow, AWS step functions, or Luigi
- Proven experience implementing Generative AI, LLM data preparation pipelines, and Vector Databases (e.g., Pinecone, Milvus, pgvector)
- Strong experience building and maintaining Feature Stores for machine learning models
- Experience building highly scalable, secure, and production-ready APIs and Data-as-a-Service (DaaS) platforms
- AWS Certified Data Engineer or AWS Certified Machine Learning - Specialty certifications
- 3+ year of experience with Machine Learning
- Experience with building a Data-as-a-service platform
- Experience with building APIs
Benefits
- Healthcare, Dental, Prescription, and Vision plans
- 401(k) w/match
- Tuition Reimbursement Plan
- Commuter Benefit: Public Transit or Parking options
- Remote Friendly Workplace
Company Overview
Company H1B Sponsorship