This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are seeking a Senior+ Research Infrastructure Engineer to join our growing team. In this role, you will design, build, and operate distributed data systems that power large-scale ingestion, processing, and transformation of datasets used for AI model training. These datasets span traditional structured data as well as unstructured assets such as images and 3D models, which often require specialized preprocessing for pretraining and fine-tuning workflows. This is a versatile role: you’ll own end-to-end pipelines (from ingestion to transformation), ensure data quality and scalability, and collaborate closely with ML researchers to prepare diverse datasets for cutting-edge model training. You’ll thrive in our fast-paced startup environment, where problem-solving, adaptability, and wearing multiple hats are the norm.
Job Responsibility:
Architect pipelines across cloud object storage (S3, GCS, Azure Blob), data lakes, and metadata catalogs
Optimize large-scale processing with distributed frameworks (Spark, Dask, Ray, Flink, or equivalents)
Implement partitioning, sharding, caching strategies, and observability (monitoring, logging, alerting) for reliable pipelines
Design, implement, and maintain distributed ingestion pipelines for structured and unstructured data (images, 3D/2D assets, binaries)
Build scalable ETL/ELT workflows to transform, validate, and enrich datasets for AI/ML model training and analytics
Support preprocessing of unstructured assets (e.g., images, 3D/2D models, video) for training pipelines, including format conversion, normalization, augmentation, and metadata extraction
Implement validation and quality checks to ensure datasets meet ML training requirements
Collaborate with ML researchers to quickly adapt pipelines to evolving pretraining and evaluation needs
Use infrastructure-as-code (Terraform, Kubernetes, etc.) to manage scalable and reproducible environments
Integrate CI/CD best practices for data workflows
Maintain data lineage, reproducibility, and governance for datasets used in AI/ML pipelines
Work cross-functionally with ML researchers, graphics/vision engineers, and platform teams
Embrace versatility: switch between infrastructure-level challenges and asset/data-level problem solving
Contribute to a culture of fast iteration, pragmatic trade-offs, and collaborative ownership
Requirements:
5+ years of experience in data engineering, distributed systems, or similar
Strong programming skills in Python (plus Scala/Java/C++ a plus)
Solid skills in SQL for analytics, transformations, and warehouse/lakehouse integration
Proficiency with distributed frameworks (Spark, Dask, Ray, Flink)
Familiarity with cloud platforms (AWS/GCP/Azure) and storage systems (S3, Parquet, Delta Lake, etc.)
Experience with workflow orchestration tools (Airflow, Prefect, Dagster)
Comfortable in a startup environment: versatile, self-directed, pragmatic, and adaptive
Strong problem solver who enjoys tackling ambiguous challenges
Commitment to building robust, maintainable, and observable systems
Nice to have:
Kubernetes for distributed workloads and orchestration
Data warehouses or lakehouse platforms (Snowflake, BigQuery, Databricks, Redshift)
Familiarty GPU-accelerated computing and HPC clusters