Senior ML Infrastructure / ML DevOps Engineer Job at Pathway

Senior ML Infrastructure / ML DevOps Engineer

Pathway

Location:

Category:
IT - Software Development

Contract Type:
Employment contract

Salary:

Not provided

Save Job

Apply Position

Job Description:

We are looking for a Senior ML Infrastructure / DevOps Engineer who loves Linux, distributed systems, and scaling GPU clusters more than fiddling with notebooks. You will own the infrastructure that powers our ML training and inference workloads across multiple cloud providers, from bare‑bones Linux to container orchestration and CI/CD. You will sit close to the R&D team, but your home is production infrastructure: clusters, networks, storage, observability, and automation. Your work will directly determine how fast we can train, ship, and iterate on models.

Job Responsibility:

Design, operate, and scale GPU and CPU clusters for ML training and inference (Slurm, Kubernetes, autoscaling, queueing, quota management)
Automate infrastructure provisioning and configuration using infrastructure‑as‑code (Terraform, CloudFormation, cluster‑tooling) and configuration management
Build and maintain robust ML pipelines (data ingestion, training, evaluation, deployment) with strong guarantees around reproducibility, traceability, and rollback
Implement and evolve ML‑centric CI/CD: testing, packaging, deployment of models and services
Own monitoring, logging, and alerting across training and serving: GPU/CPU utilization, latency, throughput, failures, and data/model drift (Grafana, Prometheus, Loki, CloudWatch)
Work with terabyte‑scale datasets and the associated storage, networking, and performance challenges
Partner closely with ML engineers and researchers to productionize their work, translating experimental setups into robust, scalable systems
Participate in on‑call rotation for critical ML infrastructure and lead incident response and post‑mortems when things break

Requirements:

Former or current Linux / systems / network administrator comfortable living in the shell and debugging at OS and network layers (systemd, filesystems, iptables/security groups, DNS, TLS, routing)
5+ years of experience in DevOps/SRE/Platform/Infrastructure roles running production systems, ideally with high‑performance or ML workloads
Deep familiarity with Linux as a daily driver, including shell scripting and configuration of clusters and services
Strong experience with workload management, containerization, and orchestration (Slurm, Docker, Kubernetes) in production environments
Solid understanding of CI/CD tools and workflows (GitHub Actions, GitLab CI, Jenkins, etc.), including building pipelines from scratch
Hands-on cloud infrastructure experience (AWS, GCP, Azure), especially around GPU instances, VPC/networking, storage, and managed ML services (e.g., SageMaker HyperPod, Vertex AI)
Proficiency with infrastructure as code (Terraform, CloudFormation, or similar) and a bias toward automation over manual operations
Experience with monitoring and logging stacks (Grafana, Prometheus, Loki, CloudWatch, or equivalents)
Familiarity with ML pipeline and experiment orchestration tools (MLflow, Kubeflow, Airflow, Metaflow, etc.) and with model/version management
Solid programming skills in Python, plus the ability to read and debug code that uses common ML libraries (PyTorch, TensorFlow) even if you are not a full‑time model developer
Strong ownership mindset, comfort with ambiguity, and enthusiasm for scaling and hardening critical infrastructure for an ML‑heavy environment
Willingness to learn

What we offer:

Intellectually stimulating work environment
Be a pioneer: you get to work with realtime data processing & AI
Work in one of the hottest AI startups, with exciting career prospects
Team members are distributed across the world
Responsibilities and ability to make significant contribution to the company’s success
Inclusive workplace culture

Additional Information:

Job Posted:
January 07, 2026

Employment Type:

Fulltime

Work Type:

Remote work

Pathway - All Job Offers

Job Link Share:

Senior ML Infrastructure / ML DevOps Engineer

Pathway

Location:

Category:
IT - Software Development

Contract Type:
Employment contract

Salary:

Job Description:

Job Responsibility:

Requirements:

Additional Information:

Job Posted:
January 07, 2026

Looking for more opportunities? Search for other job offers that match your skills and interests.

Similar Jobs for Senior ML Infrastructure / ML DevOps Engineer

Senior DevOps Engineer - ElasticSearch Admin

Senior ML Engineer

Engineering Manager, Infrastructure

Senior Operations Engineer

Senior Mlops Engineer

Data Engineer Senior

Senior Software Engineer - Developer Experience and Automation

Senior Support and Operations Engineer

Senior ML Infrastructure / ML DevOps Engineer

Pathway

Location:

Category:IT - Software Development

Contract Type:Employment contract

Salary:

Job Description:

Job Responsibility:

Requirements:

Additional Information:

Job Posted:January 07, 2026

Looking for more opportunities? Search for other job offers that match your skills and interests.

Similar Jobs for Senior ML Infrastructure / ML DevOps Engineer

Senior DevOps Engineer - ElasticSearch Admin

Senior ML Engineer

Engineering Manager, Infrastructure

Senior Operations Engineer

Senior Mlops Engineer

Data Engineer Senior

Senior Software Engineer - Developer Experience and Automation

Senior Support and Operations Engineer

Category:
IT - Software Development

Contract Type:
Employment contract

Job Posted:
January 07, 2026