Software Engineer: ML Infra Job at Generalist AI (San Mateo)

Senior Software Engineer – ML Model Compliance & Automation

We are seeking a highly skilled and motivated Senior Software Engineer to lead t...

Location

India , Jaipur

Salary:

Not provided

InfoObjects

Expiration Date

Until further notice

Requirements

Experience Required: 3 - 7 yrs
GoLang (preferred)
Python (preferred)
Bash
MLOps Tools: KitOps, MLModelCI, MLflow, ONNX, TensorFlow, PyTorch, Docker
SBOM & Security: Syft, Grype, Trivy, CycloneDX, SPDX
CI/CD: GitHub Actions, GitLab CI, Jenkins, ArgoCD
Infra: Kubernetes, Docker, Helm, Terraform
Cloud: AWS, GCP, Azure (EKS/GKE/ECS preferred)
Version Control: Git, GitOps

Job Responsibility

Model Packaging & Artifact Management: Design and implement workflows for packaging ML models using KitOps, ONNX, MLflow, or TensorFlow SavedModel
Manage model artifact versioning, registries, and reproducibility
Ensure artifact integrity, consistency, and traceability across CI/CD pipelines
Model Profiling & Optimization: Automate model profiling (latency, size, ops) using MLModelCI, TorchServe, or ONNX Runtime
Apply quantization, pruning, and format conversions (e.g., FP32→INT8) for optimization
Embed profiling and optimization checks into CI/CD pipelines to assess deployment readiness
Compliance & SBOM Generation: Develop pipelines to generate and validate SBOMs for ML models
Implement compliance checks for licensing, vulnerabilities, and security using CycloneDX, SPDX, Syft, or Trivy
Validate schema, dependencies, and runtime environments for production readiness
Cloud Integration & Deployment: Automate model registration, endpoint creation, and monitoring setup in AWS/GCP/Azure

Fulltime

Principal Engineer - Marketplace

Principal Engineer role in the Marketplace Engineering team to lead breakthrough...

Location

United States , San Francisco; Sunnyvale

Salary:

302000.00 - 336000.00 USD / Year

Uber

Expiration Date

Until further notice

Requirements

PhD in Computer Science, Machine Learning, Operations Research, or related quantitative field OR Master’s degree with 12+ years of industry experience
10+ years of experience building and deploying ML models in large-scale production environments
Expert-level proficiency in modern ML frameworks (TensorFlow, PyTorch, JAX) and distributed computing platforms (Spark, Ray)
Deep expertise across multiple areas including: Deep Learning, Causal Inference, Reinforcement Learning, Multi-objective Optimization, Algorithmic Game Theory, and Large-scale Ads Ranking/Auction Systems
Proven track record of leading complex ML projects from research through production with significant measurable business impact
Strong programming skills in Python, Java, or Go with experience building production ML systems
Experience with feature engineering, model serving, and ML infrastructure at scale (handling millions of predictions per second)
Technical leadership experience including mentoring senior engineers and driving cross-team technical initiatives
Advanced Deep Learning and Neural Network architectures
Scalable ML architecture and distributed model training

Job Responsibility

Lead the design and implementation of advanced ML systems for dynamic pricing algorithms serving millions of drivers across 70+ countries around the world
Architect real-time ML infrastructure handling 1M+ pricing decisions per second with sub-50ms latency requirements
Drive breakthrough research in causal ML, reinforcement learning, algorithmic game theory, and multi-objective optimization for marketplace optimization with strategic agents
Own end-to-end ML model lifecycle from research through production deployment and continuous optimization
Develop and enforce best practices in system design, ensuring data integrity, security, and optimal performance
Serve as a representative for the Marketplace organization to the broader internal and external technical community
Contribute to the eng brand for Marketplace and serve as a talent magnet to help attract and retain talent for the team
Stay abreast of industry trends and emerging technologies in software engineering, focused particularly on ML/AI, to enhance our systems and processes continually
Build scalable ML architecture and feature management systems supporting Driver Pricing and broader Marketplace teams
Design experimentation frameworks enabling rapid testing of pricing algorithms using A/B, Switchback, Synthetic Control, and other experimental methodologies

What we offer

Eligible to participate in Uber's bonus program
May be offered an equity award & other types of comp
Eligible to participate in a 401(k) plan
Eligible for various benefits (details at provided link)

Fulltime

ML Infra Engineer

In this role you will help scale and optimize our training systems and core mode...

Location

United States , San Francisco

Salary:

Not provided

Physical Intelligence

Expiration Date

Until further notice

Requirements

Strong software engineering fundamentals and experience building ML training infrastructure or internal platforms
Hands-on large-scale training experience in JAX (preferred), PyTorch
Familiarity with distributed training, multi-host setups, data loaders, and evaluation pipelines
Experience managing training workloads on cloud platforms (e.g., SLURM, Kubernetes, GCP TPU/GKE, AWS)
Ability to debug and optimize performance bottlenecks across the training stack
Strong cross-functional communication and ownership mindset

Job Responsibility

Own training/inference infrastructure: Design, implement, and maintain systems for large-scale model training, including scheduling, job management, checkpointing, and metrics/logging
Scale distributed training: Work with researchers to scale JAX-based training across TPU and GPU clusters with minimal friction
Optimize performance: Profile and improve memory usage, device utilization, throughput, and distributed synchronization
Enable rapid iteration: Build abstractions for launching, monitoring, debugging, and reproducing experiments
Manage compute resources: Ensure efficient allocation and utilization of cloud-based GPU/TPU compute while controlling cost
Partner with researchers: Translate research needs into infra capabilities and guide best practices for training at scale
Contribute to core training code: Evolve JAX model and training code to support new architectures, modalities, and evaluation metrics

Fulltime

Staff Software Engineer - AI/ML Infra

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...

Location

United States , Palo Alto

Salary:

90000.00 - 300000.00 USD / Year

Geico

Expiration Date

Until further notice

Requirements

Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
3+ years of hands-on experience with machine learning infrastructure and deployment at scale
2+ years of experience working with Large Language Models and transformer architectures
Proficient in Python
strong skills in Go, Rust, or Java preferred
Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)

Job Responsibility

Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
Design, implement, and maintain feature stores for ML model training and inference pipelines
Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases

What we offer

Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
Financial benefits including market-competitive compensation
a 401K savings plan vested from day one that offers a 6% match
performance and recognition-based incentives
and tuition assistance
Access to additional benefits like mental healthcare as well as fertility and adoption assistance
Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year

Fulltime

Staff Software Engineer - AI/ML Infra

GEICO AI platform and Infrastructure team is seeking an exceptional Senior ML Pl...

Location

United States , Chevy Chase; New York City; Palo Alto

Salary:

115000.00 - 300000.00 USD / Year

Geico

Expiration Date

Until further notice

Requirements

Bachelor’s degree in computer science, Engineering, or related technical field (or equivalent experience)
8+ years of software engineering experience with focus on infrastructure, platform engineering, or MLOps
3+ years of hands-on experience with machine learning infrastructure and deployment at scale
2+ years of experience working with Large Language Models and transformer architectures
Proficient in Python
strong skills in Go, Rust, or Java preferred
Proven experience working with open source LLMs (Llama 2/3, Qwen, Mistral, Gemma, Code Llama, etc.)
Proficient in Kubernetes including custom operators, helm charts, and GPU scheduling
Deep expertise in Azure services (AKS, Azure ML, Container Registry, Storage, Networking)
Experience implementing and operating feature stores (Chronon, Feast, Tecton, Azure ML Feature Store, or custom solutions)

Job Responsibility

Design and implement scalable infrastructure for training, fine-tuning, and serving open source LLMs (Llama, Mistral, Gemma, etc.)
Architect and manage Kubernetes clusters for ML workloads, including GPU scheduling, autoscaling, and resource optimization
Design, implement, and maintain feature stores for ML model training and inference pipelines
Build and optimize LLM inference systems using frameworks like vLLM, TensorRT-LLM, and custom serving solutions
Ensure 99.9%+ uptime for ML platforms through robust monitoring, alerting, and incident response procedures
Design and implement ML platforms using DataRobot, Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Container Instances
Develop and maintain infrastructure using Terraform, ARM templates, and Azure DevOps
Implement cost-effective solutions for GPU compute, storage, and networking across Azure regions
Ensure ML platforms meet enterprise security standards and regulatory compliance requirements
Evaluate and potentially implement hybrid cloud solutions with AWS/GCP as backup or specialized use cases

What we offer

Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
Financial benefits including market-competitive compensation
a 401K savings plan vested from day one that offers a 6% match
performance and recognition-based incentives
and tuition assistance
Access to additional benefits like mental healthcare as well as fertility and adoption assistance
Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year

Fulltime

ML Infra Engineer (Data Systems)

As an ML Infra Engineer (Data Systems), you’ll build and operate the data infras...

Location

United States , San Francisco

Salary:

Not provided

Physical Intelligence

Expiration Date

Until further notice

Requirements

Strong software engineering fundamentals
Experience building distributed systems or large-scale data pipelines
Comfort reasoning about performance, memory, I/O, and storage efficiency
Familiarity with batch and/or streaming processing systems
Experience with object storage systems and data format tradeoffs
Ownership mindset: design, build, operate, and iterate on systems end-to-end
Enjoy working closely with researchers and unblocking fast-moving projects

Job Responsibility

Data Ingestion & Processing: Design and build high-throughput pipelines that validate, transform, and featurize raw multimodal data
Batch & Streaming Systems: Operate large-scale batch and streaming workflows over massive datasets
Storage Systems: Design object storage layouts, metadata systems, and efficient access patterns
choose file formats with performance and scalability in mind
Data Lifecycle Management: Build systems for backfills, dataset rebuilds, garbage collection, and large-scale transformations
Training-Time Performance: Optimize dataloaders, sharding, prefetching, caching, and throughput to reduce time from data arrival → model training
Metadata & Indexing: Build scalable metadata stores for datasets, annotations, and training artifacts
Data Movement: Move hundreds of terabytes to petabytes efficiently across clusters and environments
Operational Correctness: Implement observability, validation, and guardrails to prevent silent data regressions
Cross-Functional Collaboration: Work closely with cross-functional teams of researchers, engineers and roboticists to translate evolving data needs into robust systems

Fulltime

Member of Technical Staff, Training Infra Engineer

Contribute in and provide strong support for model training pipelines, ship stat...

Location

Salary:

Not provided

Cohere

Expiration Date

Until further notice

Requirements

Extremely strong software engineering skills
Proficiency in Python and related ML frameworks such as JAX, Pytorch and XLA/MLIR
Experience with distributed training infrastructures (Kubernetes, Slurm) and associated frameworks (Ray)
Experience using large-scale distributed training strategies
Hands on experience on training large model at scale and having contributed to the tooling and/or setup of the training infrastructure

Job Responsibility

Design and write high-performant and scalable software for training
Improve our training setup from an infrastructure and codebase performance standpoint
Craft and implement tools to speed up our training cycles and improve the overall efficacy of our training infrastructure
Research, implement, and experiment with ideas on our supercompute and data infrastructure
Learn from and work with the best researchers in the field

What we offer

An open and inclusive culture and work environment
Work closely with a team on the cutting edge of AI research
Weekly lunch stipend, in-office lunches & snacks
Full health and dental benefits, including a separate budget to take care of your mental health
100% Parental Leave top-up for up to 6 months
Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
6 weeks of vacation (30 working days!)

Fulltime

Software Engineer, Systems ML - SW/HW Co-design

Meta is seeking an AI Software Engineer to join our Research & Development teams...

Location

United States , Sunnyvale

Salary:

257000.00 USD / Year ▼

Software Engineer: ML Infra

Generalist AI

Location:
United States , San Mateo ▼
Somerville

Category:
IT - Software Development

Contract Type:
Not provided

Salary:

Job Description:

Job Responsibility:

Requirements:

Additional Information:

Job Posted:
February 18, 2026

Looking for more opportunities? Search for other job offers that match your skills and interests.

Similar Jobs for Software Engineer: ML Infra