This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are looking for a hands-on, first-principles engineer who is fluent in Linux, comfortable operating close to the metal, and capable of architecting systems for the next generation of AI infrastructure. You will build, maintain, and scale Luma’s infrastructure across on-prem and multi-vendor clouds (AWS & OCI), serving as the bridge between hardware vendors, cloud providers, and our research teams.
Job Responsibility:
Architect for Reliability & Scale: Participate in critical re-architecture sessions to redesign our systems for higher efficiency and scale
Own Multi-Cloud GPU Clusters: Take end-to-end ownership of our production clusters for training and inference across AWS and OCI, ensuring high availability and peak performance
Drive Security & Compliance: Assist in achieving and maintaining security certifications (SOC 2 Type 1 & 2, ISO standards) by implementing robust infrastructure security practices
Deep Linux Performance Tuning: Use your mastery of Linux systems to troubleshoot and optimize performance at the OS and kernel level
Build Robust Automation: Write high-quality tools and automation in Python, Go, or Bash to manage, monitor, and heal our infrastructure
Debug Complex Hardware/Software Failures: Serve as the final escalation point for the most challenging GPU, networking (InfiniBand/RDMA), and system-level issues
Requirements:
8+ years of experience as an SRE, production engineer, or infrastructure engineer in a fast-paced, large-scale environment
Deep, hands-on expertise in Linux, containerized systems, and debugging low-level system performance
Strong experience with providers like AWS or OCI
Thrive on solving complex, low-level problems where hardware and software intersect
Energetic and thrive in a less structured, fast-paced environment
Working knowledge of security best practices and familiarity with compliance frameworks, such as SOC 2 and ISO
Practical experience with InfiniBand, RDMA, or RoCE and understand how to optimize throughput for massive distributed training jobs
Nice to have:
Deep expertise with GPU tooling for NVIDIA and AMD GPUs like DCGM or ROCm
Experience managing large-scale GPU clusters for AI/ML workloads (training or inference)
Familiarity with job management systems based on Kubernetes or orchestration frameworks like Ray