CrawlJobs Logo

Staff ML Infrastructure Engineer

darwinrecruitment.com Logo

Darwin Recruitment GmbH

Location Icon

Location:
United States , San Francisco

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

We are seeking a Staff / Principal ML Infrastructure Engineer to lead the design, deployment, and scaling of our large language model infrastructure. This role sits at the intersection of machine learning, systems engineering, and platform design, enabling teams to train, serve, and monitor models efficiently and reliably. This is not a prompt engineering role – it is focused on building robust, production-grade ML infrastructure and operational pipelines.

Job Responsibility:

  • Design, implement, and maintain high-performance infrastructure for training and serving LLMs
  • Optimize model pipelines for efficiency, latency, and cost at scale
  • Collaborate with ML researchers, platform engineers, and product teams to deploy models safely into production
  • Build monitoring, alerting, and tooling to ensure reliability and observability of large-scale ML systems
  • Evaluate and integrate new frameworks, tools, and architectures to improve ML workflows
  • Provide technical leadership and mentorship to other engineers on the team

Requirements:

  • 7+ years of software engineering experience, including 3+ years building production ML systems
  • Deep experience with distributed training and inference frameworks (e.g., PyTorch, JAX, TensorFlow)
  • Familiarity with model serving technologies and orchestration (e.g., Triton, Ray, Kubernetes)
  • Strong understanding of GPU/TPU infrastructure, performance optimization, and scalability challenges
  • Proven experience solving reliability, latency, and cost trade-offs in production ML systems
  • Excellent collaboration, communication, and problem-solving skills

Nice to have:

Experience mentoring or leading engineering teams is a plus

What we offer:

Flexible work arrangements and competitive compensation

Additional Information:

Job Posted:
January 05, 2026

Employment Type:
Fulltime
Work Type:
Remote work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Staff ML Infrastructure Engineer

Engineering Manager, Infrastructure

As an Engineering Manager for the Infrastructure team, you’ll lead the engineers...
Location
Location
Canada; United States
Salary
Salary:
195000.00 - 285000.00 USD / Year
apollo.io Logo
Apollo.io
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of hands-on software or infrastructure engineering experience
  • 2+ years of experience leading teams of senior and staff-level engineers in platform, SRE, or infrastructure domains
  • Proven ability to design and operate large-scale distributed systems in cloud environments (preferably GCP or AWS)
  • Expertise with Kubernetes, Docker, Terraform, Ubuntu, and CI/CD pipelines
  • Familiarity with observability tools (Grafana, Prometheus, ELK, Datadog, NewRelic) and performance tuning
  • Strong grounding in networking, security, and reliability principles
  • Experience managing infrastructure costs, availability SLAs, and high-throughput systems at scale
Job Responsibility
Job Responsibility
  • Lead, coach, and grow a distributed team of high-impact Infrastructure Engineers
  • Partner with senior engineering leadership on strategic initiatives such as cloud migration, infrastructure scaling, platform reliability, and cost efficiency
  • Define and implement modern operational excellence practices, including SLOs, error budgets, incident reviews, and performance monitoring
  • Guide technical decision-making across key areas like Kubernetes, GCP, observability, networking, CI/CD, and IaC (Terraform, Ansible)
  • Collaborate with AI, Data, and Product Engineering teams to ensure infrastructure scalability for ML and AI-native workloads
  • Run effective 1:1s, career development conversations, and quarterly performance reviews
  • Support recruiting efforts to attract top engineering talent across time zones
What we offer
What we offer
  • Equity
  • Company bonus or sales commissions/bonuses
  • 401(k) plan
  • At least 10 paid holidays per year
  • Flex PTO
  • Parental leave
  • Employee assistance program and wellbeing benefits
  • Global travel coverage
  • Life/AD&D/STD/LTD insurance
  • FSA/HSA and medical, dental, and vision benefits
  • Fulltime
Read More
Arrow Right

Staff Platform Engineer

Join our dynamic team as a Compute Platform Engineer and play a pivotal role in ...
Location
Location
Canada , Vancouver
Salary
Salary:
190000.00 - 240000.00 CAD / Year
inworld.ai Logo
Inworld AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7 years of experience in software engineering
  • 5 years of experience with infrastructure-as-code
  • Proficiency in managing Kubernetes clusters and applications, including creating Kustomize manifests/Helm charts for new applications
  • Experience in creating and maintaining CI/CD pipelines for both applications and infrastructure deployments (using tools like Terraform/Terragrunt, ArgoCD, GitHub Actions, Ansible, etc.)
  • Deep knowledge of at least one major cloud provider (Google Cloud Platform, Microsoft Azure, Oracle Cloud)
  • Proficient in at least one backend programming/scripting languages such as Golang, Python, and Bash
Job Responsibility
Job Responsibility
  • Work closely with backend and ML engineering teams to design, deploy, and maintain reliable, high-performance, and secure cloud infrastructure for our AI engine and Studio
  • Facilitate a "you build it, you run it" culture by providing the necessary tools and processes for monitoring the reliability, availability, and performance of services
  • Manage CI/CD pipelines to ensure smooth and efficient code integration and deployment
  • Identify and implement opportunities to enhance engineering speed and efficiency
  • Conduct root cause analysis to identify critical issues and develop automated solutions to prevent recurrence
  • Develop and share best practices to improve automation and efficiency across our engineering teams
What we offer
What we offer
  • bonus
  • equity
  • benefits
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, Cloud Infrastructure

As a Software Engineer on our Cloud Infrastructure team, you'll be at the forefr...
Location
Location
United States , New York, NY; San Mateo, CA; Redwood City, CA
Salary
Salary:
175000.00 - 220000.00 USD / Year
fireworks.ai Logo
Fireworks AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Engineering, or a related technical field (or equivalent practical experience)
  • 5+ years of experience designing and building backend infrastructure in cloud environments (e.g., AWS, GCP, Azure)
  • Proven experience in ML infrastructure and tooling (e.g., PyTorch, TensorFlow, Vertex AI, SageMaker, Kubernetes, etc.)
  • Strong software development skills in languages like Python, or C++
  • Deep understanding of distributed systems fundamentals: scheduling, orchestration, storage, networking, and compute optimization
Job Responsibility
Job Responsibility
  • Architect and build scalable, resilient, and high-performance backend infrastructure to support distributed training, inference, and data processing pipelines
  • Lead technical design discussions, mentor other engineers, and establish best practices for building and operating large-scale ML infrastructure
  • Design and implement core backend services (e.g., job schedulers, resource managers, autoscalers, model serving layers) with a focus on efficiency and low latency
  • Drive infrastructure optimization initiatives, including compute cost reduction, storage lifecycle management, and network performance tuning
  • Collaborate cross-functionally with ML, DevOps, and product teams to translate research and product needs into robust infrastructure solutions
  • Continuously evaluate and integrate cloud-native and open-source technologies (e.g., Kubernetes, Ray, Kubeflow, MLFlow) to enhance our platform’s capabilities and reliability
  • Own end-to-end systems from design to deployment and observability, with a strong emphasis on reliability, fault tolerance, and operational excellence
What we offer
What we offer
  • Meaningful equity in a fast-growing startup
  • Competitive salary
  • Comprehensive benefits package
  • Fulltime
Read More
Arrow Right

Staff Platform Engineer

Join our dynamic team as a Compute Platform Engineer and play a pivotal role in ...
Location
Location
United States , Mountain View, California
Salary
Salary:
180000.00 - 280000.00 USD / Year
inworld.ai Logo
Inworld AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7 years of experience in software engineering
  • 5 years of experience with infrastructure-as-code
  • Proficiency in managing Kubernetes clusters and applications, including creating Kustomize manifests/Helm charts for new applications
  • Experience in creating and maintaining CI/CD pipelines for both applications and infrastructure deployments (using tools like Terraform/Terragrunt, ArgoCD, GitHub Actions, Ansible, etc.)
  • Deep knowledge of at least one major cloud provider (Google Cloud Platform, Microsoft Azure, Oracle Cloud)
  • Proficient in at least one backend programming/scripting languages such as Golang, Python, and Bash
  • Candidates must be based in the SF Bay Area or willing to relocate (you will be working on-site in our South Bay office a few days a week)
Job Responsibility
Job Responsibility
  • Work closely with backend and ML engineering teams to design, deploy, and maintain reliable, high-performance, and secure cloud infrastructure for our AI engine and Studio
  • Facilitate a "you build it, you run it" culture by providing the necessary tools and processes for monitoring the reliability, availability, and performance of services
  • Manage CI/CD pipelines to ensure smooth and efficient code integration and deployment
  • Identify and implement opportunities to enhance engineering speed and efficiency
  • Conduct root cause analysis to identify critical issues and develop automated solutions to prevent recurrence
  • Develop and share best practices to improve automation and efficiency across our engineering teams
What we offer
What we offer
  • equity and benefits
  • Fulltime
Read More
Arrow Right

Staff Backend Engineer

Kalepa is looking for a Staff Backend Engineer to work on its AI Copilot platfor...
Location
Location
Salary
Salary:
145000.00 - 185000.00 USD / Year
kalepa.com Logo
Kalepa
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of relevant software engineering experience
  • Excellent development skills including design, debugging and problem solving
  • Bachelors or master's degree in computer science or a related field
  • Experience with Python3 or other OO languages (Java, C++, C#, etc.)
  • Experience with AWS (EC2, Lambda, etc.) and serverless technologies
  • Experience with relational databases, preference for PostgreSQL
  • Experience working on distributed systems creating scalable, fault-tolerant infrastructure
  • Experience building data driven microservices leveraging RESTful API's
  • Experience with tools such as Docker, Git, GitHub, Flask, NumPy, Pandas
Job Responsibility
Job Responsibility
  • Work on advanced systems including NLP, firmographic data, entity resolution
  • Solve problems at the intersection of large and performant data pipelines, distributed systems, machine learning models, and robust infrastructure
  • Collaborate with a global team of full-stack, data, ML, and DevOps engineers
  • Build scalable and reliable backend solutions
What we offer
What we offer
  • Competitive salary (based on experience level)
  • Significant equity options package
  • 20 days of PTO a year
  • Global team offsites
  • Healthy living/gym stipend
  • Mobile phone bill stipend
  • Continuing education credits
  • Fulltime
Read More
Arrow Right

Staff Software Engineer, Backend

The Staff Engineer will work closely with AI/ML engineers, product managers, app...
Location
Location
United States , NYC
Salary
Salary:
160000.00 - 190000.00 USD / Year
conductor.com Logo
Conductor
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Completed studies in Computer Science, Mathematics, engineering or a related field or equivalent professional experience
  • 8+ years of experience in software development, with experience in product-driven companies
  • Strong expertise in system design, distributed computing, and scalable architecture patterns for handling large datasets and high-throughput applications
  • Proficiency in multiple programming languages with strong Python coding skills. Experience with Java is highly valued
  • Strong database experience including both SQL and NoSQL systems, with knowledge of data modeling and optimization techniques
  • Experience with AI/ML technologies including LLMs, vector databases (e.g., Milvus), embeddings, and ML frameworks
  • Knowledge of MLOps practices, model deployment, and AI system integration in production environments
  • Experience working across the full software development lifecycle including CI/CD, monitoring, testing, and production deployment
  • Proven track record of technical leadership, mentoring engineers, and driving engineering excellence within teams
  • Up-to-date with rapidly-evolving technologies and demonstrated ability to evaluate and adopt new tools and frameworks
Job Responsibility
Job Responsibility
  • Lead the technical architecture, design, and implementation of large-scale distributed systems and data platforms to support customer needs and business growth
  • Oversee the planning, execution, and successful delivery of complex engineering projects, ensuring adherence to engineering best practices and quality standards
  • Design and build scalable, high-performance backend systems and APIs that handle millions of requests and large datasets efficiently
  • Architect robust data processing pipelines and ETL workflows using modern cloud technologies and distributed computing frameworks
  • Drive technical decision-making across the engineering organization, evaluating trade-offs and establishing engineering standards and practices
  • Lead cross-functional collaboration with product, AI/ML engineering, data engineering, and infrastructure teams to deliver comprehensive solutions
  • Build and maintain CI/CD pipelines, monitoring systems, and deployment automation to ensure reliable software delivery
  • Implement AI/ML capabilities including LLM integration, vector databases, and intelligent content processing workflows
  • Mentor senior and junior engineers, fostering technical excellence and knowledge sharing within the engineering organization
What we offer
What we offer
  • 100% covered employee medical plan
  • a dental & vision plans
  • 401(k) with employer contribution
  • an unlimited vacation policy
  • 10 sick days
  • short-term disability
  • long-term disability
  • generous paid parental leave
  • employee assistance program
  • flexible savings accounts
  • Fulltime
Read More
Arrow Right

Staff Software Engineer

As a Staff Forward Deployed Engineer (FDE) at Invisible, you'll lead high-impact...
Location
Location
United States , Austin; New York; San Francisco Bay Area; Washington DC–Baltimore
Salary
Salary:
213000.00 - 300000.00 USD / Year
invisible.co Logo
Invisible Technologies
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of software engineering experience, including significant time spent building data, ML, or backend systems
  • Deep proficiency in Python with hands-on experience using Hugging Face, LangChain, OpenAI, Pinecone, and related ecosystems
  • Skilled in full-stack and API-based deployment patterns, including Docker, FastAPI, Kubernetes, and cloud environments (GCP, AWS)
  • Experienced with workflow orchestration libraries, pub/sub systems (Kafka), and schema governance
  • Expertise in data governance and operations, including Unity Catalog and policy management, cluster/job orchestration, data contracts and quality enforcement, Delta/ETL pipelines, and replay processes
  • Strong product and system design instincts — you understand business needs and how to translate them into technical architecture
  • Experience building usable systems from messy data and ambiguous requirements
  • Excellent communication and client-facing skills
  • you’ve led conversations with technical and non-technical stakeholders alike
  • Proven experience owning projects from scoping through deployment in ambiguous, high-stakes environments
Job Responsibility
Job Responsibility
  • Partner with delivery and executive stakeholders to scope, design, and lead implementation of AI-driven solutions
  • Identify transformational opportunities in messy, ambiguous workflows and turn them into repeatable systems
  • Lead architecture design and trade-off discussions across performance, scalability, cost, and reliability
  • Own projects from first discovery call through full deployment — including client-facing delivery, internal coordination, and post-launch iteration
  • Build shared infrastructure, reusable components, and internal playbooks to level-up the team
  • Coach and mentor mid-level engineers and help shape the culture of forward-deployed AI engineering at Invisible
What we offer
What we offer
  • bonus
  • equity
  • benefits
  • Fulltime
Read More
Arrow Right

Staff Machine Learning Engineer

Join PagerDuty as a Staff Machine Learning Engineer to tackle complex problems, ...
Location
Location
Canada , Toronto
Salary
Salary:
156000.00 - 232000.00 CAD / Year
https://www.pagerduty.com Logo
PagerDuty
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience building, designing, and evolving data architecture for large-scale systems
  • Excellent communication skills
  • Experience working with Product teams, ensuring and driving a timely delivery
  • Have a deep understanding of the trade-offs to be considered when designing and delivering machine learning solutions to production
  • Experience leading cross-team architecture discussions, building technical prototypes, and driving the adoption of best practices across diverse teams
  • Demonstrated experience with data engineering processes, working with unstructured data and cloud-based data infrastructures
  • Passionate about ML engineering and interested in driving discussions with stakeholders and executives
Job Responsibility
Job Responsibility
  • Build and improve the capabilities of the data platform that enable and accelerate the production of ML/AI-based solutions
  • Drive and define standards for AI/ML across the organization
  • Provide guidance, technical leadership, and mentoring to other members of the team
  • Mentor junior members and participate in scaling up the existing team
  • Proactively recommend improvements and new approaches addressing potential systemic pain points and technical debt
  • Anticipate technical demands on the data platform based on the organization’s roadmap and systematically drive the evolution of the architecture toward those ends
  • Develop a long-term plan for ML/AI investments
What we offer
What we offer
  • Competitive salary
  • Comprehensive benefits package from day one
  • Flexible work arrangements
  • Company equity
  • ESPP (Employee Stock Purchase Program)
  • Retirement or pension plan
  • Generous paid vacation time
  • Paid holidays and sick leave
  • Dutonian Wellness Days & HibernationDuty - companywide paid days off in addition to PTO
  • Paid parental leave: 22 weeks for pregnant parent, 12 weeks for non-pregnant parent
  • Fulltime
Read More
Arrow Right