CrawlJobs Logo

Senior ML Infrastructure / ML DevOps Engineer

Pathway

Location Icon

Location:

Category Icon

Job Type Icon

Contract Type:
Employment contract

Salary Icon

Salary:

Not provided

Job Description:

We are looking for a Senior ML Infrastructure / DevOps Engineer who loves Linux, distributed systems, and scaling GPU clusters more than fiddling with notebooks. You will own the infrastructure that powers our ML training and inference workloads across multiple cloud providers, from bare‑bones Linux to container orchestration and CI/CD. You will sit close to the R&D team, but your home is production infrastructure: clusters, networks, storage, observability, and automation. Your work will directly determine how fast we can train, ship, and iterate on models.

Job Responsibility:

  • Design, operate, and scale GPU and CPU clusters for ML training and inference (Slurm, Kubernetes, autoscaling, queueing, quota management)
  • Automate infrastructure provisioning and configuration using infrastructure‑as‑code (Terraform, CloudFormation, cluster‑tooling) and configuration management
  • Build and maintain robust ML pipelines (data ingestion, training, evaluation, deployment) with strong guarantees around reproducibility, traceability, and rollback
  • Implement and evolve ML‑centric CI/CD: testing, packaging, deployment of models and services
  • Own monitoring, logging, and alerting across training and serving: GPU/CPU utilization, latency, throughput, failures, and data/model drift (Grafana, Prometheus, Loki, CloudWatch)
  • Work with terabyte‑scale datasets and the associated storage, networking, and performance challenges
  • Partner closely with ML engineers and researchers to productionize their work, translating experimental setups into robust, scalable systems
  • Participate in on‑call rotation for critical ML infrastructure and lead incident response and post‑mortems when things break

Requirements:

  • Former or current Linux / systems / network administrator comfortable living in the shell and debugging at OS and network layers (systemd, filesystems, iptables/security groups, DNS, TLS, routing)
  • 5+ years of experience in DevOps/SRE/Platform/Infrastructure roles running production systems, ideally with high‑performance or ML workloads
  • Deep familiarity with Linux as a daily driver, including shell scripting and configuration of clusters and services
  • Strong experience with workload management, containerization, and orchestration (Slurm, Docker, Kubernetes) in production environments
  • Solid understanding of CI/CD tools and workflows (GitHub Actions, GitLab CI, Jenkins, etc.), including building pipelines from scratch
  • Hands-on cloud infrastructure experience (AWS, GCP, Azure), especially around GPU instances, VPC/networking, storage, and managed ML services (e.g., SageMaker HyperPod, Vertex AI)
  • Proficiency with infrastructure as code (Terraform, CloudFormation, or similar) and a bias toward automation over manual operations
  • Experience with monitoring and logging stacks (Grafana, Prometheus, Loki, CloudWatch, or equivalents)
  • Familiarity with ML pipeline and experiment orchestration tools (MLflow, Kubeflow, Airflow, Metaflow, etc.) and with model/version management
  • Solid programming skills in Python, plus the ability to read and debug code that uses common ML libraries (PyTorch, TensorFlow) even if you are not a full‑time model developer
  • Strong ownership mindset, comfort with ambiguity, and enthusiasm for scaling and hardening critical infrastructure for an ML‑heavy environment
  • Willingness to learn
What we offer:
  • Intellectually stimulating work environment
  • Be a pioneer: you get to work with realtime data processing & AI
  • Work in one of the hottest AI startups, with exciting career prospects
  • Team members are distributed across the world
  • Responsibilities and ability to make significant contribution to the company’s success
  • Inclusive workplace culture

Additional Information:

Job Posted:
January 07, 2026

Employment Type:
Fulltime
Work Type:
Remote work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Senior ML Infrastructure / ML DevOps Engineer

Senior DevOps Engineer - ElasticSearch Admin

You will be part of a high-performing team, leading and executing to enable grow...
Location
Location
Germany , Berlin
Salary
Salary:
Not provided
auto1.com Logo
AUTO1 Group
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Hands-on experience administrating Elasticsearch clusters (5+ Data nodes)
  • Knowledge of planning and executing data retention and life cycle management, Index and Datastream mappings, as well as ML and transform jobs
  • Hands-on experience in operations of sizing, monitoring, and management, for Kafka, Logstash, Beats, Kibana, and Elastic Agent
  • Experience with queuing systems and data streams in production (SQS, ActiveMQ, Kinesis, Kafka or similar)
  • Familiarity with programming languages such as: PHP and/or Python and/or Java
  • 4+ years of experience in administering/developing/DevOps in a Linux/Unix environment
  • AWS Expert
  • Experience in creating CI/CD pipelines preferably using Jenkins
  • Experience with docker orchestration engines (ECS, Kubernetes, swarm, UCP, etc)
  • Significant experience with Docker, Terraform, or CloudFormation
Job Responsibility
Job Responsibility
  • Maintenance, support, and ongoing performance enhancements on multiple Elastic instances
  • Performing system upgrades, troubleshooting, and resolving infrastructure and system issues, as well as log ingestion and communication issues
  • Design and develop scalable, robust, and high-performance data pipelines and data storage solutions
  • Develop and maintain observability frameworks using tools like Kibana, Grafana, or similar
  • Work with cross-functional teams to define observability and search requirements
  • Scale, script and maintain our development and production platform foundation with AWS and GCP
  • Stay updated on the newest tools and (cloud) services
  • Perform database backups, migrations, and upgrades as needed
  • Discuss and evangelize for new technologies and best practices amongst and outside of your team
What we offer
What we offer
  • Relocation support to Germany which includes visa assistance, apartment search and help with costs
  • Educational budget for your personal growth
  • Above-average corporate pension plan
  • Work from home up to 5 days a week
  • Truly international and diverse working environment with more than 90 different nationalities
  • Fulltime
Read More
Arrow Right

Senior ML Engineer

As a Senior ML Engineer at Provectus, you'll be responsible for designing, devel...
Location
Location
Colombia , Medellín; Bogotá; Cali; Barranquilla; Bucaramanga
Salary
Salary:
Not provided
provectus.com Logo
Provectus
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • ML Fundamentals: supervised, unsupervised, and reinforcement learning
  • Model Development: feature engineering, model training, evaluation, hyperparameter tuning, and validation
  • ML Frameworks: classical ML libraries, TensorFlow, PyTorch, or similar frameworks
  • Deep Learning: CNNs, RNNs, Transformers
  • LLM Applications: Experience building production LLM-based applications
  • Prompt Engineering: Ability to design effective prompts and chain-of-thought strategies
  • RAG Systems: Experience building retrieval-augmented generation architectures
  • Vector Databases: Familiarity with embedding models and vector search
  • LLM Evaluation: Experience with evaluation metrics and techniques for LLM outputs
  • Python: Advanced proficiency in Python for ML applications
Job Responsibility
Job Responsibility
  • Design and implement end-to-end ML solutions from experimentation to production
  • Build scalable ML pipelines and infrastructure
  • Optimize model performance, efficiency, and reliability
  • Write clean, maintainable, production-quality code
  • Conduct rigorous experimentation and model evaluation
  • Troubleshoot and resolve complex technical challenges
  • Mentor junior and mid-level ML engineers
  • Conduct code reviews and provide constructive feedback
  • Share knowledge through documentation, presentations, and workshops
  • Collaborate with cross-functional teams (DevOps, Data Engineering, SAs)
Read More
Arrow Right

Engineering Manager, Infrastructure

As an Engineering Manager for the Infrastructure team, you’ll lead the engineers...
Location
Location
Canada; United States
Salary
Salary:
195000.00 - 285000.00 USD / Year
apollo.io Logo
Apollo.io
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of hands-on software or infrastructure engineering experience
  • 2+ years of experience leading teams of senior and staff-level engineers in platform, SRE, or infrastructure domains
  • Proven ability to design and operate large-scale distributed systems in cloud environments (preferably GCP or AWS)
  • Expertise with Kubernetes, Docker, Terraform, Ubuntu, and CI/CD pipelines
  • Familiarity with observability tools (Grafana, Prometheus, ELK, Datadog, NewRelic) and performance tuning
  • Strong grounding in networking, security, and reliability principles
  • Experience managing infrastructure costs, availability SLAs, and high-throughput systems at scale
Job Responsibility
Job Responsibility
  • Lead, coach, and grow a distributed team of high-impact Infrastructure Engineers
  • Partner with senior engineering leadership on strategic initiatives such as cloud migration, infrastructure scaling, platform reliability, and cost efficiency
  • Define and implement modern operational excellence practices, including SLOs, error budgets, incident reviews, and performance monitoring
  • Guide technical decision-making across key areas like Kubernetes, GCP, observability, networking, CI/CD, and IaC (Terraform, Ansible)
  • Collaborate with AI, Data, and Product Engineering teams to ensure infrastructure scalability for ML and AI-native workloads
  • Run effective 1:1s, career development conversations, and quarterly performance reviews
  • Support recruiting efforts to attract top engineering talent across time zones
What we offer
What we offer
  • Equity
  • Company bonus or sales commissions/bonuses
  • 401(k) plan
  • At least 10 paid holidays per year
  • Flex PTO
  • Parental leave
  • Employee assistance program and wellbeing benefits
  • Global travel coverage
  • Life/AD&D/STD/LTD insurance
  • FSA/HSA and medical, dental, and vision benefits
  • Fulltime
Read More
Arrow Right

Senior Operations Engineer

We are currently seeking an Senior Operations Engineer to join our Data Manageme...
Location
Location
Greece , Athens
Salary
Salary:
Not provided
https://www.metlengroup.com Logo
Metlen Group
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • BSc or MSc in Computer Science or related technical field
  • +4 years of experience in Operations or IT roles (Data Engineering, ML Engineering, Software Engineering, or similar)
  • Experience in system monitoring, technical support, and incident handling
  • Hands-on experience with Cloud platforms
  • Practical exposure to MLOps and DevOps frameworks (Azure DevOps/MLOps, Docker, Kubernetes, AWS DevOps/MLOps)
  • Solid hands-on experience in SQL and Python
  • Experience managing enterprise-scale Data/ML workflows
  • Strong analytical and problem-solving abilities
  • Fluent in English, written and oral
Job Responsibility
Job Responsibility
  • Oversee and optimize DevOps and MLOps operations for model deployment, monitoring, and automation
  • Execute, maintain, and improve CI/CD pipelines for Data Engineering and ML deployments
  • Collaborate closely with Data Engineers to strengthen deployment processes and operational efficiency
  • Monitor, troubleshoot, and ensure smooth execution of daily Corporate Data Warehouse workflows
  • Handle technical support requests efficiently, ensuring SLA compliance
  • Maintain high system availability and reliability through proactive monitoring
  • Implement minor enhancements, bug fixes, and performance optimizations
  • Apply version control best practices and ensure proper deployment governance
  • Collaborate cross-functionally to streamline deployment processes across environments
  • Identify opportunities for automation, observability, and improved monitoring
What we offer
What we offer
  • Competitive remuneration package
  • Ticket Restaurant Card
  • Group Health Insurance Plan
  • Preferential Protergia household energy plan
  • Pension Plan
Read More
Arrow Right

Senior Mlops Engineer

Location
Location
Japan , Tokyo
Salary
Salary:
9000000.00 - 15000000.00 JPY / Year
https://www.randstad.com Logo
Randstad
Expiration Date
March 24, 2026
Flip Icon
Requirements
Requirements
  • Must have 5+ years of experience in MLOps, DevOps, or related fields
  • Must have proven experience with tools like Hugging Face, MLFlow, and containerization technologies (Docker, Kubernetes)
  • Must have strong experience with cloud platforms (AWS, Azure, GCP) and infrastructure as code (Terraform)
  • Must have hands-on experience in reducing inference latency and optimizing AI infrastructure
  • Must have proficiency in Python, with experience in ML libraries such as TensorFlow, PyTorch, and related frameworks
  • Must have expertise in CI/CD pipelines, version control (Git), and orchestration tools
  • Must have familiarity with Generative AI, prompt engineering, and deploying models at scale
  • Must have excellent problem-solving skills with the ability to tackle complex challenges independently
What we offer
What we offer
  • Health insurance
  • Employee pension insurance
  • Unemployment insurance
  • Fulltime
Read More
Arrow Right

Data Engineer Senior

We are looking for a highly skilled professional to lead the industrialisation o...
Location
Location
Portugal , Lisbon
Salary
Salary:
Not provided
https://www.inetum.com Logo
Inetum
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Minimum of 5 years’ experience in MLOps, data engineering, or DevOps with a focus on ML/DL/LLM/AI agents in production environments
  • Strong proficiency in Python
  • Hands-on experience with CI/CD tools such as GitLab, Docker, Kubernetes, Jenkins
  • Solid understanding of ML, DL, and LLM models
  • Experience with ML lifecycle tools such as MLflow or DVC
  • Good understanding of model lifecycle, data traceability, and governance frameworks
  • Experience with on-premise and hybrid infrastructures
  • Excellent communication skills and ability to collaborate with remote teams
  • Proactive mindset, technical rigour, and engineering mentality
  • Willingness to learn, document, and standardise best practices
Job Responsibility
Job Responsibility
  • Analyse, monitor, and optimise ML models, tracking their performance
  • Design and implement CI/CD pipelines for ML models and data flows
  • Containerise and deploy models via APIs, batch processes, and streaming
  • Manage model versioning and traceability
  • Ensure continuous improvement and adaptation of AI use cases and ML models
  • Set up monitoring and alerting for model performance
  • Establish incident response protocols in collaboration with IT
  • Maintain dashboards and automated reports on model health
  • Implement validation frameworks for data and models (e.g., Great Expectations, unit tests, stress tests), in collaboration with Group Governance
  • Contribute to documentation and apply technical best practices
What we offer
What we offer
  • Work in a constantly evolving environment
  • Contribute to digital impact
  • Opportunity for growth and development
  • Fulltime
Read More
Arrow Right

Senior Software Engineer - Developer Experience and Automation

Senior Software Engineer for Developer Experience Tooling and Automation who wil...
Location
Location
United States
Salary
Salary:
83430.00 - 203940.00 USD / Year
https://www.cvshealth.com/ Logo
CVS Health
Expiration Date
January 30, 2026
Flip Icon
Requirements
Requirements
  • 5+ years of overall experience in Python
  • Experience in setting up and optimizing efficient data stores (RDBMS/NoSQL) for production
  • 3+ years of overall backend development experience on enterprise-class applications
  • 3+ years partnering with architecture, product, and program management teams to influence product development decisions
  • 3+ years of experience working on projects using mature CI/CD practices, source control such as Git, and automated testing
  • 2+ years of experience working with large public cloud technologies (e.g., GCP, AWS, Azure)
  • Experience with Prompt engineering: Ability to build and craft prompts that evoke desired responses from LLMs
  • Experience in team lead / technical lead capacity that follows a Scrum/Agile development methodology
  • Bachelor's degree or equivalent experience (HS diploma + 4 years relevant experience)
Job Responsibility
Job Responsibility
  • Build shared internal libraries, tools, and processes that enable teams across CVS Health to efficiently build, test, preview, deploy, and operate systems
  • Collaborate with various teams across CVS to influence the technical direction of front-end web development
  • Build APIs, CLI tools, out-of-the-box automation tools using CVS Health approved tools, LLMs and Machine Learning algorithms
  • Build, optimize, fine-tune Generative AI/LLM models to transform experience into solutions and deploy them
  • Work closely with data scientists, ML engineers, software developers, and business stakeholders to translate AI research into practical, deployable solutions
  • Lead the prototyping and experimentation with new generative models, optimizing them for specific use cases
  • Act as a technical leader across all parts of the CVS Health Infrastructure engineering team
  • Develop clear, concise, and clean code in any language (mostly in Python)
  • Collaborate with architecture and engineering teams to standardize how we can enhance the experience
  • Stay aligned with the latest developments in cloud-native and ML ops/engineering
What we offer
What we offer
  • Affordable medical plan options
  • 401(k) plan with matching company contributions
  • Employee stock purchase plan
  • No-cost wellness screenings
  • Tobacco cessation and weight management programs
  • Confidential counseling and financial coaching
  • Paid time off
  • Flexible work schedules
  • Family leave
  • Dependent care resources
  • Fulltime
Read More
Arrow Right

Senior Support and Operations Engineer

Senior Support and Operations Engineer to join Data Management team, taking owne...
Location
Location
Greece , Athens
Salary
Salary:
Not provided
https://www.metlengroup.com Logo
Metlen Group
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • BSc or MSc in Computer Science or related technical field
  • +4 years of experience in Operations or IT roles (Data Engineering, ML Engineering, Software Engineering, or similar)
  • Experience in system monitoring, technical support, and incident handling
  • Hands-on experience with Cloud platforms
  • Practical exposure to MLOps and DevOps frameworks (Azure DevOps/MLOps, Docker, Kubernetes, AWS DevOps/MLOps)
  • Experience managing CI/CD pipelines, especially in Azure is considered a plus
  • Experience with GitHub is an advantage
  • Solid hands-on experience in SQL and Python
  • Experience managing enterprise-scale Data/ML workflows
  • Strong analytical and problem-solving abilities
Job Responsibility
Job Responsibility
  • Oversee and optimize DevOps and MLOps operations for model deployment, monitoring, and automation
  • Execute, maintain, and improve CI/CD pipelines for Data Engineering and ML deployments
  • Collaborate closely with Data Engineers to strengthen deployment processes and operational efficiency
  • Monitor, troubleshoot, and ensure smooth execution of daily Corporate Data Warehouse workflows
  • Handle technical support requests efficiently, ensuring SLA compliance
  • Maintain high system availability and reliability through proactive monitoring
  • Implement minor enhancements, bug fixes, and performance optimizations
  • Apply version control best practices and ensure proper deployment governance
  • Collaborate cross-functionally to streamline deployment processes across environments
  • Identify opportunities for automation, observability, and improved monitoring
What we offer
What we offer
  • Competitive remuneration package
  • Ticket Restaurant Card
  • Group Health Insurance Plan
  • Preferential Protergia household energy plan
  • Pension Plan
  • Fulltime
Read More
Arrow Right