CrawlJobs Logo

Senior DevOps Engineer (AI & Cloud Infrastructure)

inflection.ai Logo

Inflection AI

Location Icon

Location:
United States , Palo Alto

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

175000.00 - 250000.00 USD / Year

Job Description:

We are seeking a Senior DevOps Engineer to design, deploy, and operate the next generation of Inflection AI’s cloud and AI infrastructure. This role sits at the intersection of AI research and production systems, owning the reliability, scalability, and performance of GPU-enabled platforms that power large-scale LLM training and inference. You will work across Azure and AWS to build highly automated, observable, and resilient infrastructure supporting low-latency AI applications in production.

Job Responsibility:

  • Architect, deploy, and operate large-scale LLM inference servers and AI applications with a focus on low latency, high availability, and production reliability
  • Design, provision, and maintain complex cloud architectures across Azure and AWS, including storage, compute, networking, databases, and native LLM services
  • Manage GPU-enabled Kubernetes clusters and Slurm-based HPC environments, optimizing resource allocation for AI training and inference workloads
  • Deploy and operate core Kubernetes infrastructure components and operators (GPU operators, ingress controllers, service meshes, CNIs, CSIs, and storage drivers)
  • Build scalable infrastructure-as-code and deployment workflows using Terraform, Helm, Kustomize, ArgoCD, and GitOps best practices
  • Design and maintain centralized observability systems using Prometheus, Grafana, Clickhouse, and cloud-native monitoring tools
  • Participate in on-call rotations, lead incident response, perform post-mortems, and continuously improve system reliability and SLAs.

Requirements:

  • 5+ years of hands-on experience in DevOps, Site Reliability Engineering, or ML Infrastructure supporting high-scale, production systems
  • Deep expertise in Azure and AWS, including storage, compute, networking, databases, and cloud-native monitoring services
  • Strong Kubernetes administration experience, including GPU scheduling, operator deployment, and management of core infrastructure components
  • experience with Slurm is highly desirable
  • Proven experience deploying, scaling, and operating Large Language Models (LLMs) and inference engines such as vLLM, TGI, or Triton
  • Strong experience with modern DevOps tooling: Terraform, Helm, Kustomize, ArgoCD, GitHub Actions or GitLab CI, Prometheus, Grafana, and Clickhouse
  • Advanced scripting and automation skills in Python and Bash, with the ability to debug complex distributed systems and optimize performance at scale
  • Demonstrated ability to troubleshoot LLM servers, Kubernetes workloads, GPU utilization, and cloud infrastructure bottlenecks
  • Have a bachelor’s degree or equivalent in a related field to the offered position requirements.
What we offer:
  • Diverse medical, dental and vision options
  • 401k matching program
  • Unlimited paid time off
  • Parental leave and flexibility for all parents and caregivers
  • Support of country-specific visa needs for international employees living in the Bay Area
  • Meaningful equity component.

Additional Information:

Job Posted:
January 26, 2026

Employment Type:
Fulltime
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Senior DevOps Engineer (AI & Cloud Infrastructure)

Senior Devops & AI Engineer

This role presents a unique opportunity to contribute to the future of impactful...
Location
Location
India , Hyderabad
Salary
Salary:
Not provided
fissionlabs.com Logo
Fission Labs
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Engineering, or related field
  • 6+ years of experience in Infrastructure Mgmt. roles, with a focus on cloud platforms (Azure and AWS Preferred)
  • Hands-on experience with operations (DevSecOps) principles and best practices
  • Proficiency in scripting languages such as Python, PowerShell, or Bash
  • Excellent communication and collaboration skills
  • In-depth knowledge of Linux operating systems, including CentOS, Ubuntu, and Red Hat, with expertise in shell scripting, package management, and system administration
  • Hands-on experience with a wide range of AWS and Azure services
  • Develop and maintain Infrastructure as Code (IAC) templates using tools such as Terraform or AWS CloudFormation
  • Experience setting up cloud infrastructure stack, databases, service endpoints, GPU as well as CPU resource scaling, optimization etc.
  • Should have worked AIOps/MLOP
Job Responsibility
Job Responsibility
  • Configure and optimize Linux-based servers for performance, security, and resource utilization, including kernel tuning, file system management, and network configuration
  • Architect cloud solutions leveraging best practices and services offered by AWS and Azure, optimizing for scalability, reliability, and cost-effectiveness
  • Implement and manage hybrid cloud environments, facilitating seamless integration and interoperability between AWS and Azure services
  • Establish version control practices for IAC templates, ensuring traceability, auditability, and reproducibility of infrastructure changes
What we offer
What we offer
  • Opportunity to work on impactful technical challenges with global reach
  • Vast opportunities for self-development, including online university access and knowledge sharing opportunities
  • Sponsored Tech Talks & Hackathons to foster innovation and learning
  • Generous benefits packages including health insurance, retirement benefits, flexible work hours, and more
  • Supportive work environment with forums to explore passions beyond work
  • Fulltime
Read More
Arrow Right

Senior Platform Engineer - CI/CD & AI Automation (AI-first)

Groupon is undergoing a critical platform transformation, modernizing its core d...
Location
Location
Czechia , Prague
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of dedicated experience in Platform Engineering, DevOps, or Infrastructure roles
  • Deep expertise building, scaling, and migrating CI/CD systems, with strong practical experience in Jenkins and/or GitHub Actions
  • Expertise in scripting and automation (Python, Go, or Bash)
  • Solid understanding of container technologies, Kubernetes, and cloud build systems
  • Proven experience leveraging AI tooling (e.g., Claude Code, code analysis) to meaningfully increase developer output and optimize platform work
  • Excellent communication and ability to drive technical decisions across multiple platform and product teams
Job Responsibility
Job Responsibility
  • Platform Transformation: Lead the design, planning, and execution of the Jenkins-to-GitHub Actions migration across a large portfolio of microservices
  • Pipeline Engineering: Design and optimize high-performance, secure, and observable CI/CD workflows across GitHub Actions, Jenkins, and Kubernetes environments
  • AI-First Automation: Drive an AI-First workflow by leveraging tools (e.g., Copilot, code generation) to eliminate infrastructure toil, accelerate development, and analyze pipeline failures
  • Core Automation: Develop robust platform automation (e.g., Python, Go, Bash) to improve build efficiency, artifact caching, reliability, and repository hygiene
  • Security & Compliance: Harden CI/CD infrastructure with robust controls for secrets management, RBAC, audit logging, and secure runner design
  • Observability: Implement and enhance CI/CD observability using tools like Prometheus, Grafana, and OpenTelemetry to provide deep insights into performance and reliability
  • Technical Leadership: Mentor engineers and partner across Cloud, Security, and Developer Experience teams to define and evolve our end-to-end delivery platform architecture
Read More
Arrow Right

Senior Java Architect & Cloud Engineer

The Equity Middle Office technology group is actively transforming its technolog...
Location
Location
Singapore , Singapore
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Degree in Computer Science or Electronic/Electrical Engineering
  • ~15 years of Banking Software development experience, including management experiences or equivalent
  • Knowledge of low-latency frameworks such as Chronicle / garbage-free programming in Java
  • Knowledge in IT Infrastructure (i.e. IT Networks, Communications, and Data C-entre Management) and Infra Support Operations
  • Working experience in Linux operating system, Windows, Groovy, Python, JavaScript, Java, ELK, Bitbucket, Jenkins, Confluence, SonarQube, Nexus and scripting experience to do integrations through API, CLI for extracting data and to perform automated operations
  • Very Strong experience in in Shell Scripting, Batch Scripting to do automation, command line integration and invoking REST API using postman is mandatory
  • Must have hands on experience in building microservices using in Java and Spring Boot Framework Stack
  • Working experience in Messaging platform such AMPS, TIBCO, SOLACE and MQ
  • Experience with relational SQL and NoSQL database
  • Strong knowledge and experience in DevOps automation, containerization and orchestration using tools such as Gradle, Maven, Docker, Kubernetes, Terraform, Artifactory
Job Responsibility
Job Responsibility
  • Be recognized as a trusted partner for business application owners and other technology teams who seek to make use of Cloud based infrastructure
  • Define the technology roadmap and prioritize technical resources against to achieve maximum success
  • Ensuring the platform conforms to security best practices and is fully consistent with banking audit and compliance requirements and fully consistent with the design ethos and technical requirements of external cloud providers
  • Supporting adoption of containers and container control frameworks for internal Cloud Services, including container platform selection and design and ensuring that self-service design/deployment/control web containers is appropriate for requirements
  • Ensuring lifecycle management consists of documentation such as test cases, source code repositories etc are actively used and maintained
  • Recommend new services to complement and enhance infrastructure elements to stream-line and support applications development and deployment
  • Developing highly available infrastructures in a cloud services environment, preferably with cloud providers such as OpenShift or AWS
  • Implement continuous Integration / Continuous Deployment practice, tooling, and techniques, particularly evidence of leading organizational and cultural change to adopt CI/CD practices (Jira, Confluence, BitBucket, Git
  • Jenkins, Artifactory, Terraform, Packer, Rundeck, Ansible, AWS, ELK, AppDynamics)
  • Enable AI based monitoring automation to effectively detect/predict/prevent issues in the environment and code base
  • Fulltime
Read More
Arrow Right

Senior AI and Machine Learning Engineer

We are seeking Senior AI/ML & Innovation Engineer who will be leading initiative...
Location
Location
United States , Aguadilla
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or master’s degree in computer science, engineering, data science, machine learning, artificial intelligence, or closely related quantitative discipline
  • Typically, 7-10 years’ experience
  • Deep understanding of machine learning algorithms, such as linear regression, decision trees, support vector machines, random forests, deep learning models (e.g., neural networks), and reinforcement learning
  • A strong foundation in mathematics and statistics
  • Proficiency in programming languages such as Python, R, or Java
  • Strong understanding of GitHub CoPilot, Cursor, N8N, vibe coding, Windsurf, and similar technologies
  • Experience in Cloud Infrastructure (AWS, Azure, etc)
  • Knowledge of Open Source, Linux, etc
  • Understanding of Devops, SRE
  • Advanced knowledge and experience in deep learning
Job Responsibility
Job Responsibility
  • Conducts research and stays up to date with the latest advancements in AI and machine learning technologies, frameworks, and algorithms
  • Collaborates with cross-functional teams to understand business requirements and design AI and machine learning solutions
  • Develops, implements, and optimizes machine learning models and algorithms
  • Deploys machine learning models into production environments
  • Monitors the performance of deployed models
  • Organizes and leads comprehensive design review sessions
  • Works collaboratively with the engineering manager and team lead to set design and implementation standards
  • Regularly leads meetings
  • Has experience in providing technical leadership, mentorship, and guidance to junior team members
  • Develops and delivers strategic presentations and reports to senior stakeholders
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Senior DevOps Engineer (GCP)

Our client is a global UK-based financial services and investment banking organi...
Location
Location
Salary
Salary:
Not provided
n-ix.com Logo
N-iX
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in DevOps, Cloud Engineering, or SRE roles
  • Strong hands-on experience with Google Cloud Platform, including: GKE / Kubernetes, Cloud Run, Cloud Functions, Pub/Sub, Cloud Storage, VPC, IAM, networking, security
  • Expertise in Terraform, Helm, or other IaC tools
  • Experience building CI/CD pipelines (GitHub Actions, GitLab CI, CircleCI, Jenkins, etc.)
  • Strong understanding of containerization and orchestration: Docker, Kubernetes
  • Solid experience with monitoring, observability, and logging stacks
  • Familiarity with networking, load balancing, security hardening, and zero-trust principles
  • Experience supporting production systems in high-availability, distributed environments
  • Strong scripting skills (Python, Bash, or similar)
  • Experience working with agile engineering teams
Job Responsibility
Job Responsibility
  • Design, implement, and maintain cloud infrastructure on Google Cloud (GKE, Cloud Run, Cloud Functions, Pub/Sub, Cloud Storage)
  • Build and optimize CI/CD pipelines (GitHub Actions, GitLab CI, Jenkins, or similar)
  • Develop infrastructure-as-code using Terraform or similar tools
  • Set up and maintain container orchestration (Kubernetes, GKE) and automated deployment workflows
  • Implement monitoring, alerting, and observability using tools such as Prometheus, Grafana, ELK/Elastic, Stackdriver, or OpenTelemetry
  • Ensure compliance with security and governance standards across all environments
  • Collaborate closely with engineering teams to ensure scalable, high-performance deployment architectures
  • Support AI/ML and GenAI workloads (Vertex AI pipelines, model hosting, GPU workloads, inference optimization)
  • Manage environment strategies, release pipelines, configuration management, and secrets management
  • Optimize cloud costs and recommend improvements for performance and reliability
What we offer
What we offer
  • Flexible working format - remote, office-based or flexible
  • A competitive salary and good compensation package
  • Personalized career growth
  • Professional development tools (mentorship program, tech talks and trainings, centers of excellence, and more)
  • Active tech communities with regular knowledge sharing
  • Education reimbursement
  • Memorable anniversary presents
  • Corporate events and team buildings
  • Other location-specific benefits
Read More
Arrow Right

Senior AI Engineer

We are seeking an experienced Senior Python Software Engineer (Senior AI Develop...
Location
Location
Poland , Warsaw
Salary
Salary:
Not provided
https://www.inetum.com Logo
Inetum
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Degree in Computer Science, Data Science, Artificial Intelligence, or a related field, or equivalent practical experience
  • Several years of experience in AI and Machine Learning development, ideally within Customer Care solutions
  • Strong proficiency in Python and NLP frameworks
  • Hands-on experience with Azure AI services (e.g., Azure Machine Learning, Cognitive Services, Bot Services)
  • Solid understanding of cloud architectures and microservices on Azure
  • Experience with CI/CD pipelines and MLOps
  • Analytical mindset and strong problem-solving capabilities
  • Polish & English speaker
Job Responsibility
Job Responsibility
  • Design, develop, and integrate AI/ML solutions, with a particular focus on Generative AI (GenAI), LLMs, and multi-modal (chat, voice) interfaces
  • Architect and deliver customer-facing AI agents that provide real-time, intelligent automation for support, marketing, or transactional use cases
  • Build and maintain multi-model pipelines for inference, fine-tuning, chunking, and embedding-based retrieval (RAG) systems
  • Deploy, monitor, and optimize AI models in production-grade environments using Kubernetes and Azure-native services
  • Integrate GenAI agents with cross-company APIs, backend services, and partner systems through MCP for dynamic tool use and data enrichment
  • Collaborate closely with DevOps engineers to implement scalable CI/CD pipelines, infrastructure-as-code, and secure AI workload automation
  • Evaluate and integrate open-source and proprietary LLMs, embeddings, and vector databases
  • Optimize prompt engineering strategies and implement orchestration tools (e.g., LangChain, MCP) to enable complex task execution
  • Build robust model evaluation frameworks, A/B testing environments, and experiment tracking for iterative development
  • Design privacy-first AI workflows that comply with GDPR, anonymization, and auditability (e.g., PII scrubbing, user consent)
What we offer
What we offer
  • Flexible working hours
  • Hybrid work model, allowing employees to divide their time between home and modern offices in key Polish cities
  • A cafeteria system that allows employees to personalize benefits by choosing from a variety of options
  • Generous referral bonuses, offering up to PLN6,000 for referring specialists
  • Additional revenue sharing opportunities for initiating partnerships with new clients
  • Ongoing guidance from a dedicated Team Manager for each employee
  • Tailored technical mentoring from an assigned technical leader, depending on individual expertise and project needs
  • Dedicated team-building budget for online and on-site team events
  • Opportunities to participate in charitable initiatives and local sports programs
  • A supportive and inclusive work culture with an emphasis on diversity and mutual respect
  • Fulltime
Read More
Arrow Right

Engineering Manager, Infrastructure

As an Engineering Manager for the Infrastructure team, you’ll lead the engineers...
Location
Location
Canada; United States
Salary
Salary:
195000.00 - 285000.00 USD / Year
apollo.io Logo
Apollo.io
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of hands-on software or infrastructure engineering experience
  • 2+ years of experience leading teams of senior and staff-level engineers in platform, SRE, or infrastructure domains
  • Proven ability to design and operate large-scale distributed systems in cloud environments (preferably GCP or AWS)
  • Expertise with Kubernetes, Docker, Terraform, Ubuntu, and CI/CD pipelines
  • Familiarity with observability tools (Grafana, Prometheus, ELK, Datadog, NewRelic) and performance tuning
  • Strong grounding in networking, security, and reliability principles
  • Experience managing infrastructure costs, availability SLAs, and high-throughput systems at scale
Job Responsibility
Job Responsibility
  • Lead, coach, and grow a distributed team of high-impact Infrastructure Engineers
  • Partner with senior engineering leadership on strategic initiatives such as cloud migration, infrastructure scaling, platform reliability, and cost efficiency
  • Define and implement modern operational excellence practices, including SLOs, error budgets, incident reviews, and performance monitoring
  • Guide technical decision-making across key areas like Kubernetes, GCP, observability, networking, CI/CD, and IaC (Terraform, Ansible)
  • Collaborate with AI, Data, and Product Engineering teams to ensure infrastructure scalability for ML and AI-native workloads
  • Run effective 1:1s, career development conversations, and quarterly performance reviews
  • Support recruiting efforts to attract top engineering talent across time zones
What we offer
What we offer
  • Equity
  • Company bonus or sales commissions/bonuses
  • 401(k) plan
  • At least 10 paid holidays per year
  • Flex PTO
  • Parental leave
  • Employee assistance program and wellbeing benefits
  • Global travel coverage
  • Life/AD&D/STD/LTD insurance
  • FSA/HSA and medical, dental, and vision benefits
  • Fulltime
Read More
Arrow Right

Senior Software Development Engineer

CVS Health is transforming digital healthcare by building cohesive, scalable pla...
Location
Location
United States
Salary
Salary:
92700.00 - 222480.00 USD / Year
https://www.cvshealth.com/ Logo
CVS Health
Expiration Date
January 30, 2026
Flip Icon
Requirements
Requirements
  • A minimum of 5 years of experience as a multi-faceted Software Engineer, DevOps Engineer or similar role delivering automated infrastructure solutions at an enterprise scale
  • 3+ years of experience in one or more programming languages such as Python, Java coupled with strong experience with infrastructure as code (IaC), specifically Terraform and/or Ansible
  • 2+ years of experience using and administering build automation, continuous integration, and source code management CI/CD tools (GitHub Actions, Octopus, Jenkins, etc.)
  • 2+ years of experience with cloud platforms (Azure, GCP, AWS) as well as on-prem environments (VMware)
  • 2+ years of experience with API development
  • Bachelor's degree or equivalent experience (HS diploma + 4 years relevant experience)
Job Responsibility
Job Responsibility
  • Design, build, and deploy infrastructure and platform capabilities for public, private, and hybrid cloud environments
  • Drive the creation of automation to improve delivery and speed-to-market
  • Collaborate with security, architecture, and engineering teams to establish and enforce standards and best practices
  • Partner with product and program teams to define roadmaps and execution plans
  • Build APIs to orchestrate automation exposing self-service capabilities to improve customer experiences
  • Drive initiatives to incorporate open-source tools, cloud native services, and AI to increase efficiency
What we offer
What we offer
  • Affordable medical plan options
  • 401(k) plan with matching company contributions
  • Employee stock purchase plan
  • No-cost wellness screenings
  • Tobacco cessation and weight management programs
  • Confidential counseling and financial coaching
  • Paid time off
  • Flexible work schedules
  • Family leave
  • Dependent care resources
  • Fulltime
!
Read More
Arrow Right