This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are seeking a Staff / Principal ML Infrastructure Engineer to lead the design, deployment, and scaling of our large language model infrastructure. This role sits at the intersection of machine learning, systems engineering, and platform design, enabling teams to train, serve, and monitor models efficiently and reliably. This is not a prompt engineering role – it is focused on building robust, production-grade ML infrastructure and operational pipelines.
Job Responsibility:
Design, implement, and maintain high-performance infrastructure for training and serving LLMs
Optimize model pipelines for efficiency, latency, and cost at scale
Collaborate with ML researchers, platform engineers, and product teams to deploy models safely into production
Build monitoring, alerting, and tooling to ensure reliability and observability of large-scale ML systems
Evaluate and integrate new frameworks, tools, and architectures to improve ML workflows
Provide technical leadership and mentorship to other engineers on the team
Requirements:
7+ years of software engineering experience, including 3+ years building production ML systems
Deep experience with distributed training and inference frameworks (e.g., PyTorch, JAX, TensorFlow)
Familiarity with model serving technologies and orchestration (e.g., Triton, Ray, Kubernetes)
Strong understanding of GPU/TPU infrastructure, performance optimization, and scalability challenges
Proven experience solving reliability, latency, and cost trade-offs in production ML systems
Excellent collaboration, communication, and problem-solving skills
Nice to have:
Experience mentoring or leading engineering teams is a plus
What we offer:
Flexible work arrangements and competitive compensation