CrawlJobs Logo

Software Engineer: ML Infra

generalistai.com Logo

Generalist AI

Location Icon

Location:
United States , San Mateo

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

200000.00 - 350000.00 USD / Year

Job Description:

Generalist trains very large robot foundation models. This requires utilizing very large numbers of the latest generation GPU hardware and infrastructure (currently Nvidia) to run distributed training jobs and researcher experiments. We have extreme requirements on storage and data loading infrastructure that requires maximizing cloud infrastructure and custom solutions. You will also own inference infrastructure. For our robots this is a fleet of on-prem GPUs attached to robots that have extreme real-time and latency budgets in compute constrained environments.

Job Responsibility:

  • Owning our GPU compute fleets
  • Ensure our GPUs are easy for researchers to use and maximally utilized
  • Optimizing and improving ML data loading transport and storage in highly distributed fully utilized environments
  • Orchestration of robot inference fleets

Requirements:

  • Have managed large fleets of GPUs doing large-scale, long-term, highly distributed training runs or inference
  • Deep experience in Slurm or Kubernetes for ML workload orchestration
  • Have build high-scale ML data loaders and preparation systems
  • Deeply understand every layer of the ML hardware, storage, and networking stacks
  • Have experience in the NVidia GPU ecosystem
What we offer:

Offers Equity

Additional Information:

Job Posted:
February 18, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Software Engineer: ML Infra

Senior Software Engineer – ML Model Compliance & Automation

We are seeking a highly skilled and motivated Senior Software Engineer to lead t...
Location
Location
India , Jaipur
Salary
Salary:
Not provided
infoobjects.com Logo
InfoObjects
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience Required: 3 - 7 yrs
  • GoLang (preferred)
  • Python (preferred)
  • Bash
  • MLOps Tools: KitOps, MLModelCI, MLflow, ONNX, TensorFlow, PyTorch, Docker
  • SBOM & Security: Syft, Grype, Trivy, CycloneDX, SPDX
  • CI/CD: GitHub Actions, GitLab CI, Jenkins, ArgoCD
  • Infra: Kubernetes, Docker, Helm, Terraform
  • Cloud: AWS, GCP, Azure (EKS/GKE/ECS preferred)
  • Version Control: Git, GitOps
Job Responsibility
Job Responsibility
  • Model Packaging & Artifact Management: Design and implement workflows for packaging ML models using KitOps, ONNX, MLflow, or TensorFlow SavedModel
  • Manage model artifact versioning, registries, and reproducibility
  • Ensure artifact integrity, consistency, and traceability across CI/CD pipelines
  • Model Profiling & Optimization: Automate model profiling (latency, size, ops) using MLModelCI, TorchServe, or ONNX Runtime
  • Apply quantization, pruning, and format conversions (e.g., FP32→INT8) for optimization
  • Embed profiling and optimization checks into CI/CD pipelines to assess deployment readiness
  • Compliance & SBOM Generation: Develop pipelines to generate and validate SBOMs for ML models
  • Implement compliance checks for licensing, vulnerabilities, and security using CycloneDX, SPDX, Syft, or Trivy
  • Validate schema, dependencies, and runtime environments for production readiness
  • Cloud Integration & Deployment: Automate model registration, endpoint creation, and monitoring setup in AWS/GCP/Azure
  • Fulltime
Read More
Arrow Right

Principal Engineer - Marketplace

Principal Engineer role in the Marketplace Engineering team to lead breakthrough...
Location
Location
United States , San Francisco; Sunnyvale
Salary
Salary:
302000.00 - 336000.00 USD / Year
uber.com Logo
Uber
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • PhD in Computer Science, Machine Learning, Operations Research, or related quantitative field OR Master’s degree with 12+ years of industry experience
  • 10+ years of experience building and deploying ML models in large-scale production environments
  • Expert-level proficiency in modern ML frameworks (TensorFlow, PyTorch, JAX) and distributed computing platforms (Spark, Ray)
  • Deep expertise across multiple areas including: Deep Learning, Causal Inference, Reinforcement Learning, Multi-objective Optimization, Algorithmic Game Theory, and Large-scale Ads Ranking/Auction Systems
  • Proven track record of leading complex ML projects from research through production with significant measurable business impact
  • Strong programming skills in Python, Java, or Go with experience building production ML systems
  • Experience with feature engineering, model serving, and ML infrastructure at scale (handling millions of predictions per second)
  • Technical leadership experience including mentoring senior engineers and driving cross-team technical initiatives
  • Advanced Deep Learning and Neural Network architectures
  • Scalable ML architecture and distributed model training
Job Responsibility
Job Responsibility
  • Lead the design and implementation of advanced ML systems for dynamic pricing algorithms serving millions of drivers across 70+ countries around the world
  • Architect real-time ML infrastructure handling 1M+ pricing decisions per second with sub-50ms latency requirements
  • Drive breakthrough research in causal ML, reinforcement learning, algorithmic game theory, and multi-objective optimization for marketplace optimization with strategic agents
  • Own end-to-end ML model lifecycle from research through production deployment and continuous optimization
  • Develop and enforce best practices in system design, ensuring data integrity, security, and optimal performance
  • Serve as a representative for the Marketplace organization to the broader internal and external technical community
  • Contribute to the eng brand for Marketplace and serve as a talent magnet to help attract and retain talent for the team
  • Stay abreast of industry trends and emerging technologies in software engineering, focused particularly on ML/AI, to enhance our systems and processes continually
  • Build scalable ML architecture and feature management systems supporting Driver Pricing and broader Marketplace teams
  • Design experimentation frameworks enabling rapid testing of pricing algorithms using A/B, Switchback, Synthetic Control, and other experimental methodologies
What we offer
What we offer
  • Eligible to participate in Uber's bonus program
  • May be offered an equity award & other types of comp
  • Eligible to participate in a 401(k) plan
  • Eligible for various benefits (details at provided link)
  • Fulltime
Read More
Arrow Right

ML Infra Engineer

In this role you will help scale and optimize our training systems and core mode...
Location
Location
United States , San Francisco
Salary
Salary:
Not provided
physicalintelligence.company Logo
Physical Intelligence
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong software engineering fundamentals and experience building ML training infrastructure or internal platforms
  • Hands-on large-scale training experience in JAX (preferred), PyTorch
  • Familiarity with distributed training, multi-host setups, data loaders, and evaluation pipelines
  • Experience managing training workloads on cloud platforms (e.g., SLURM, Kubernetes, GCP TPU/GKE, AWS)
  • Ability to debug and optimize performance bottlenecks across the training stack
  • Strong cross-functional communication and ownership mindset
Job Responsibility
Job Responsibility
  • Own training/inference infrastructure: Design, implement, and maintain systems for large-scale model training, including scheduling, job management, checkpointing, and metrics/logging
  • Scale distributed training: Work with researchers to scale JAX-based training across TPU and GPU clusters with minimal friction
  • Optimize performance: Profile and improve memory usage, device utilization, throughput, and distributed synchronization
  • Enable rapid iteration: Build abstractions for launching, monitoring, debugging, and reproducing experiments
  • Manage compute resources: Ensure efficient allocation and utilization of cloud-based GPU/TPU compute while controlling cost
  • Partner with researchers: Translate research needs into infra capabilities and guide best practices for training at scale
  • Contribute to core training code: Evolve JAX model and training code to support new architectures, modalities, and evaluation metrics
  • Fulltime
Read More
Arrow Right

Staff Software Engineer - AI/ML Infra

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...
Location
Location
United States , Palo Alto
Salary
Salary:
90000.00 - 300000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
  • 8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
  • 3+ years of hands-on experience with machine learning infrastructure and deployment at scale
  • 2+ years of experience working with Large Language Models and transformer architectures
  • Proficient in Python
  • strong skills in Go, Rust, or Java preferred
  • Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
  • Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
  • Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
  • Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)
Job Responsibility
Job Responsibility
  • Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
  • Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
  • Design, implement, and maintain feature stores for ML model training and inference pipelines
  • Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
  • Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
  • Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
  • Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
  • Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
  • Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
  • Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right

Staff Software Engineer - AI/ML Infra

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...
Location
Location
United States , Chevy Chase; New York City; Palo Alto
Salary
Salary:
115000.00 - 300000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
  • 8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
  • 3+ years of hands-on experience with machine learning infrastructure and deployment at scale
  • 2+ years of experience working with Large Language Models and transformer architectures
  • Proficient in Python
  • strong skills in Go, Rust, or Java preferred
  • Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
  • Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
  • Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
  • Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)
Job Responsibility
Job Responsibility
  • Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
  • Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
  • Design, implement, and maintain feature stores for ML model training and inference pipelines
  • Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
  • Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
  • Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
  • Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
  • Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
  • Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
  • Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right

ML Infra Engineer (Data Systems)

As an ML Infra Engineer (Data Systems), you’ll build and operate the data infras...
Location
Location
United States , San Francisco
Salary
Salary:
Not provided
physicalintelligence.company Logo
Physical Intelligence
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong software engineering fundamentals
  • Experience building distributed systems or large-scale data pipelines
  • Comfort reasoning about performance, memory, I/O, and storage efficiency
  • Familiarity with batch and/or streaming processing systems
  • Experience with object storage systems and data format tradeoffs
  • Ownership mindset: design, build, operate, and iterate on systems end-to-end
  • Enjoy working closely with researchers and unblocking fast-moving projects
Job Responsibility
Job Responsibility
  • Data Ingestion & Processing: Design and build high-throughput pipelines that validate, transform, and featurize raw multimodal data
  • Batch & Streaming Systems: Operate large-scale batch and streaming workflows over massive datasets
  • Storage Systems: Design object storage layouts, metadata systems, and efficient access patterns
  • choose file formats with performance and scalability in mind
  • Data Lifecycle Management: Build systems for backfills, dataset rebuilds, garbage collection, and large-scale transformations
  • Training-Time Performance: Optimize dataloaders, sharding, prefetching, caching, and throughput to reduce time from data arrival → model training
  • Metadata & Indexing: Build scalable metadata stores for datasets, annotations, and training artifacts
  • Data Movement: Move hundreds of terabytes to petabytes efficiently across clusters and environments
  • Operational Correctness: Implement observability, validation, and guardrails to prevent silent data regressions
  • Cross-Functional Collaboration: Work closely with cross-functional teams of researchers, engineers and roboticists to translate evolving data needs into robust systems
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, Training Infra Engineer

Contribute in and provide strong support for model training pipelines, ship stat...
Location
Location
Salary
Salary:
Not provided
cohere.com Logo
Cohere
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Extremely strong software engineering skills
  • Proficiency in Python and related ML frameworks such as JAX, Pytorch and XLA/MLIR
  • Experience with distributed training infrastructures (Kubernetes, Slurm) and associated frameworks (Ray)
  • Experience using large-scale distributed training strategies
  • Hands on experience on training large model at scale and having contributed to the tooling and/or setup of the training infrastructure
Job Responsibility
Job Responsibility
  • Design and write high-performant and scalable software for training
  • Improve our training setup from an infrastructure and codebase performance standpoint
  • Craft and implement tools to speed up our training cycles and improve the overall efficacy of our training infrastructure
  • Research, implement, and experiment with ideas on our supercompute and data infrastructure
  • Learn from and work with the best researchers in the field
What we offer
What we offer
  • An open and inclusive culture and work environment
  • Work closely with a team on the cutting edge of AI research
  • Weekly lunch stipend, in-office lunches & snacks
  • Full health and dental benefits, including a separate budget to take care of your mental health
  • 100% Parental Leave top-up for up to 6 months
  • Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
  • Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
  • 6 weeks of vacation (30 working days!)
  • Fulltime
Read More
Arrow Right

Software Engineer, Systems ML - SW/HW Co-design

Meta is seeking an AI Software Engineer to join our Research & Development teams...
Location
Location
United States , Sunnyvale
Salary
Salary:
257000.00 USD / Year
meta.com Logo
Meta
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
  • Specialized experience in one or more of the following machine learning/deep learning domains: Hardware accelerators architecture, GPU architecture, machine learning compilers, or ML systems, AI infrastructure, high performance computing, performance optimizations, or Machine learning frameworks (e.g. PyTorch), numerics and SW/HW co-design
  • Experience developing AI-System infrastructure or AI algorithms in C/C++ or Python
Job Responsibility
Job Responsibility
  • Apply relevant AI infrastructure and hardware acceleration techniques to build & optimize our intelligent ML systems that improve Meta’s products and experiences
  • Goal setting related to project impact, AI system design, and infrastructure/developer efficiency
  • Directly or influencing partners to deliver impact through deep, thorough data-driven analysis
  • Drive large efforts across multiple teams
  • Define use cases, and develop methodology & benchmarks to evaluate different approaches
  • Apply in depth knowledge of how the ML infra interacts with the other systems around it
  • Mentor other engineers / research scientists & improve the quality of engineering work in the broader team
What we offer
What we offer
  • bonus
  • equity
  • benefits
Read More
Arrow Right