[Remote] Research Scientist, Data
Note: The job is a remote job and is open to candidates in USA. Pika is pioneering the next generation of creative infrastructure built around real-time, multimodal generation and intelligent agentic platforms. They are looking for a staff or lead-level Research Engineer, Data to architect and scale data engineering systems supporting model training for advanced multimodal foundation models.
Responsibilities
- Take ownership of large-scale data pipeline architecture and implementation to support model training and research workflows for text, image, audio, and video datasets
- Partner with research and engineering teams to curate, clean, and manage diverse, sensory-rich datasets for pre-training and mid-training of multimodal models
- Develop strategies and tools for scalable data ingestion, labeling, filtering, augmentation, and storage
- Ensure data quality, reliability, and compliance, including managing privacy and ethical considerations throughout the data lifecycle
- Optimize data processing, transformation, and delivery for large-scale distributed training pipelines
- Prototype and productionize new methods for dataset creation, management, and continuous improvement in response to researcher needs
- Contribute to the integration of research-driven data advancements into production-ready systems
- Stay informed on emerging data engineering and ML data management developments, bringing best practices to our systems
Skills
- 5+ years of experience building and scaling data pipelines for machine learning applications at staff or lead engineer level, ideally in research or model training environments
- Strong background in data engineering and ML data curation for LLMs, VLMs, or other large-scale multimodal models
- Expertise in distributed data systems (e.g., Spark, Hadoop, Ray, or similar) and efficient large dataset processing/ETL workflows
- Proven ability to build robust, scalable, and production-grade data infrastructure for ML pipelines
- Experience developing tools for data labeling, filtering, deduplication, quality assurance, and dataset management
- Strong programming skills (Python, SQL, PySpark, or similar) and familiarity with cloud data platforms (AWS, GCP, Azure)
- Knowledge of privacy, compliance, ethics, and best practices in data collection and management
- Excellent cross-functional collaboration, problem-solving, and communication skills
- Passion for enabling cutting-edge generative AI and creative technology through data excellence
Benefits
- Competitive salary and substantial equity in a high-growth startup
- Full health benefits, 401k matching, and more
- Collaborative, mission-driven team environment with major growth opportunities
- Flexible on-site/remote hybrid (HQ in Palo Alto, CA)
Company Overview
Company H1B Sponsorship