CrawlJobs Logo

Hpc Operations Engineering Manager

https://www.microsoft.com/ Logo

Microsoft Corporation

Location Icon

Location:
United States , Mountain View

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

139900.00 - 274800.00 USD / Year

Job Description:

Microsoft AI is seeking an experienced HPC Operations Engineering Manager to join our Infrastructure Team. In this role, you’ll lead a team of Site Reliability Engineers who blend software engineering and systems engineering to keep our large-scale distributed AI infrastructure reliable and efficient. You’ll work closely with ML researchers, data engineers, and product developers to design and operate the platforms that power training, fine-tuning, and serving generative AI models.

Job Responsibility:

  • Team leadership: Lead a team of experienced SREs to ensure uptime, resiliency and fault tolerance of AI model training and inference systems
  • Observability: Design and help maintain monitoring, alerting, and logging systems to provide real-time visibility into model serving pipelines and infra
  • Automation & Tooling: Lead building of automation for deployments, incident response, scaling, and failover in hybrid cloud/on-prem CPU+GPU environments
  • Incident Management: Lead on-call rotations, troubleshoot production issues, conduct blameless postmortems, and drive continuous improvements
  • Security & Compliance: Ensure data privacy, compliance, and secure operations across model training and serving environments
  • Collaboration: Partner with ML engineers and platform teams to improve developer experience and accelerate research-to-production workflows

Requirements:

  • Bachelor's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with Site Reliability Engineering, DevOps, or Infrastructure Engineering Leadership roles AND 8+ years experience with Kubernetes, Docker, and container orchestration, AND 8+ years experience with public cloud platforms like Azure/AWS/GCP and infrastructure-as-code, AND 6+ years experience with programming/scripting skills not limited to Python, Go, or Bash
  • OR equivalent experience
  • Master's Degree in Computer Science or related technical field AND 12+ years technical engineering experience AND 10+ years experience with Kubernetes, Docker, and container orchestration, AND 10+ years' experience with public cloud platforms like Azure/AWS/GCP and infrastructure-as-code
  • OR equivalent experience
  • 6+ years people management experience
  • Experience in monitoring & observability tools (Grafana, Datadog, OpenTelemetry, etc.)
  • Experience running large-scale GPU clusters for ML/AI workloads
  • Experience with high-performance computing (HPC) and workload schedulers (Kubernetes operators)
  • Knowledge of CI/CD pipelines for Inference and ML model deployment
  • Solid knowledge of distributed systems, networking, and storage
  • Familiarity with ML training/inference pipelines
  • Background in capacity planning & cost optimization for GPU-heavy environments

Additional Information:

Job Posted:
February 16, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Hpc Operations Engineering Manager

HPC & AI System Test Engineering Manager

Manages a team of systems engineers for high-performance computing (HPC) server ...
Location
Location
United States , Chippewa Falls
Salary
Salary:
137000.00 - 315000.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • First level university degree or equivalent experience required
  • May have advanced university degree
  • Typically 5 or more years of related work experience, including 0-2 years of people management experience
  • Strong leadership skills, including coaching, team building, and conflict resolution
  • Advanced project management skills including time and risk management, resource prioritization, and project structuring
  • Strong analytical and problem-solving skills
  • Ability to manage human capital across geographies to drive workforce development and achieve desired results
  • Strong verbal and written communication skills, including negotiation, presentation, and influence skills
  • Advanced business acumen, technical knowledge, and extensive knowledge in applications and technologies
  • Strong multi-tasking and prioritization skills
Job Responsibility
Job Responsibility
  • Provides direct and ongoing leadership for a team of individual contributors testing and validating new products, enhancements and updates
  • Manages headcount, deliverables, schedules, and costs for multiple ongoing projects
  • Communicates project status and escalates issues to direct managers, program managers, and internal and external development partners
  • Manages relationships with outsourced partners and suppliers
  • Proactively identifies opportunities for process improvement and cost reductions opportunities
  • Provides people-care management for assigned team members, including hiring, setting and monitoring of annual performance plans, coaching, and career development
  • Coordinates with third-party product vendors and engineering managers to track development issues and implement solutions
What we offer
What we offer
  • Health & Wellbeing benefits
  • Personal & Professional Development programs
  • Unconditional Inclusion environment
  • Fulltime
Read More
Arrow Right

HPC & AI System Test Engineering Manager

The HPC Integrated Systems Test (IST) team is seeking a Systems Engineering Mana...
Location
Location
Puerto Rico , Aguadilla
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • First level university degree or equivalent experience required
  • May have advanced university degree
  • Typically, 5 or more years of related work experience, including 0 -2 years of people management experience
  • Strong leadership skills, including coaching, team building, and conflict resolution
  • Advanced project management skills including time and risk management, resource prioritization, and project structuring
  • Strong analytical and problem-solving skills
  • Ability to manage human capital across geographies to drive workforce development and achieve desired results
  • Strong verbal and written communication skills, including negotiation, presentation, and influence skills
  • Advanced business acumen, technical knowledge, and extensive knowledge in applications and technologies
  • Strong multi-tasking and prioritization skills
Job Responsibility
Job Responsibility
  • Provides direct and ongoing leadership for a team of individual contributors testing and validating new products, enhancements and updates
  • Coordinates projects for systems software, including operating systems, networking, utilities, and Internet-related tools
  • Manages headcount, deliverables, schedules, and costs for multiple ongoing projects
  • Communicates project status and escalates issues to direct managers, program managers, and internal and external development partners
  • Manages relationships with outsourced partners and suppliers
  • Proactively identifies opportunities for process improvement and cost reductions opportunities
  • Provides people-care management for assigned team members, including hiring, setting and monitoring of annual performance plans, coaching, and career development
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Senior HPC Deployment Engineer

As a High Performance Computer (HPC) Solution Installation and Deployment Engine...
Location
Location
Australia , Melbourne
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven experience in installing, configuring, and deploying HPC systems
  • strong knowledge of HPC architectures, parallel computing, and cluster management
  • proficiency in Linux/Unix operating systems
  • experience with HPC software tools and libraries (e.g., MPI, OpenMP, SLURM, Torque)
  • familiarity with high-speed networking technologies (e.g., InfiniBand, Ethernet)
  • excellent problem-solving skills and attention to detail
  • strong communication and interpersonal skills
  • ability to work independently and as part of a team
  • certifications in relevant technologies (e.g., Red Hat Certified Engineer, Certified HPC Professional)
  • experience with cloud-based HPC solutions
Job Responsibility
Job Responsibility
  • Install and configure HPC hardware and software components, including servers, storage, and networking equipment
  • set up and manage high-speed interconnects (e.g., InfiniBand, Ethernet)
  • deploy operating systems, cluster management software, and parallel file systems
  • coordinate with clients and project managers to understand deployment requirements and timelines
  • implement and document HPC deployment processes and best practices
  • perform system testing and validation to ensure optimal performance and reliability
  • provide technical support to clients during the installation and deployment phases
  • conduct training sessions for clients on HPC system usage and maintenance
  • develop and maintain user documentation and guides
  • monitor and analyze system performance to identify and resolve bottlenecks
What we offer
What we offer
  • Comprehensive suite of benefits supporting physical, financial, and emotional wellbeing
  • specific programs for personal and professional development
  • inclusion and flexibility to manage work and personal needs
  • Fulltime
Read More
Arrow Right

HPC Systems/Software Engineer

HPC Systems/Software Engineer needs to understand cluster concepts and required ...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or Master's degree in Computer Science, Information Systems, or equivalent
  • Typically 6+ years experience
  • Expertise in multiple software systems design tools and languages
  • Strong analytical and problem solving skills
  • Designing software systems running on multiple platform types
  • Should have very good systems knowledge including hardware, firmware and Operating System
  • Linux systems knowledge with Python and other languages
  • Good understanding of Network boot technologies (PXE or gPXE/Etherboot etc)
  • Storage specific knowledge: LVM, RAID, iSCSI, Disk partitioning (GPT, MBR)
  • Exposure to Opensource community and software
Job Responsibility
Job Responsibility
  • Designs enhancements, updates, and programming changes for portions and subsystems of systems software
  • Analyzes design and determines coding, programming, and integration activities required
  • Writes and executes complete testing plans, protocols, and documentation
  • Leads a project team of other software systems engineers
  • Collaborates and communicates with management and development partners
  • Represents the software systems engineering team for all phases of development projects
  • Provides guidance and mentoring to less-experienced staff members
What we offer
What we offer
  • Health & Wellbeing benefits
  • Personal & Professional Development programs
  • Unconditional Inclusion environment
  • Fulltime
Read More
Arrow Right

HPC & AI Systems Engineer for Integrated Systems Test

HPC & AI Systems Engineer for Integrated Systems Test role at Hewlett Packard En...
Location
Location
Puerto Rico , Aguadilla
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or master's degree in Computer Engineering, Computer Science, Electrical Engineering, Information Systems, or equivalent
  • Minimum 4 years of experience
  • Experience with certification & submission to OS vendors of Linux (RedHat, SLES, Ubuntu, etc.), Windows Server operating systems, Windows Client operating systems, and VMWare (ESXi)
  • Experience installing and working with Linux, Windows and VMWare OSes
  • Experience in programming or scripting languages, Python, PowerShell, Perl, Linux Shell, Java, MySQL, MS SQL Server
  • Understanding of Redfish commands, RESTful API, and JSON format
  • Knowledge of creating and using Docker containers and VMs
  • Experience in configuring Storage (internal/external storage, file systems, and raid/non-raid settings) and Networking devices (iSCSI, FCoE, IPs, VLANs, Bonding, Jumbo Frames, LAGs)
  • Knowledge of networking concepts such as NIC teaming, VLANs, IPv4, IPv6
  • Excellent written and verbal communication skills in English
Job Responsibility
Job Responsibility
  • Work with Program & Product Management, technical leads, and product development teams to obtain product feature requirements
  • Design and implement new test features in existing and new test cases
  • Analyze, debug and provide feedback/resolution on issues uncovered by test team prior to submission of results to OS vendors for approval
  • Implement software solutions for multiple test programs/projects with internal and outsourced development partners
  • Review and evaluate the implementation and use of test automation and test tools
  • Planning, development, and implementation of software tools for the testing and evaluation of current and next-generation HPE HPC products
  • Debug and analyze issues to a successful resolution
  • Perform testing in local and remote labs
  • Drive appropriate automated test execution to test engineers at various global locations
  • Provide training and guidance to test teams both onshore and offshore
What we offer
What we offer
  • Health & Wellbeing benefits
  • Personal & Professional Development programs
  • Unconditional Inclusion environment
  • Comprehensive suite of benefits that supports physical, financial and emotional wellbeing
  • Fulltime
Read More
Arrow Right

HPC & AI System Test Engineer

Our organization includes high-performance computing (HPC) server platforms, net...
Location
Location
Puerto Rico , Aguadilla
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or Master's degree in Computer Science, Systems Engineering, or equivalent
  • Typically 4-6 years experience
  • Possess experience with XD, Apollo, Industry Standard Server, Storage, and Networking products
  • Have experience with Linux Operating Systems (OS) such as Ubuntu, RHEL and SUSE
  • Excellent understanding of testing methodologies
  • Excellent understanding of hardware and software interactions
  • Excellent analytical and problem-solving skills
  • Experience in the overall architecture of software and hardware for products and solutions
  • Strong analytical and problem solving skills
  • Knowledge of a programming or scripting language (Python, Perl, Linux Shell)
Job Responsibility
Job Responsibility
  • Work with Program & Product Management teams to understand test requirements
  • Debug and troubleshoot issues with various teams
  • Work with cross-functional teams to deliver quality HPC systems
  • Work with 3rd party product vendors and engineering teams to track development issues and solutions
  • Demonstrate the ability to effectively manage diverse test tasks and priorities in a fast-paced fluid environment
  • Effectively respond to changing program requirements, changes to product test plans and compressed schedules while meeting program development requirements
  • Work with product development teams to understand new product features required for test programs/projects, work with technical leads and testers to design and develop appropriate test plans
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

HPC & AI System Test Engineer

The HPC Integrated Systems Test (IST) team is seeking early career and new gradu...
Location
Location
Puerto Rico , Aguadilla
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or Master's degree in Computer Science, Information Systems, or equivalent
  • 0-4 years experience
  • Experience with Industry Standard Server, Storage, and Networking products
  • Experience with Linux Operating Systems (OS) such as Ubuntu, RHEL and SUSE
  • Understanding of testing methodologies
  • Understanding of hardware and software interactions
  • Analytical and problem-solving skills
  • Ability to perform testing in local and remote labs
  • Experience in the overall architecture of software and hardware for products and solutions
  • Knowledge of a programming or scripting language (Python, Perl, Linux Shell)
Job Responsibility
Job Responsibility
  • Work with IST technical leads and program managers to understand test requirements, design and develop appropriate test plans, execute test plans, debug and troubleshoot issues
  • Work with various cross-functional teams and the product development teams to understand new product features required for test programs/projects to deliver quality HPC systems
  • Work with 3rd party product vendors and engineering teams to track development issues and solutions
  • Demonstrate the ability to effectively manage diverse test tasks and priorities in a fast-paced fluid environment
  • Effectively respond to changing program requirements, changes to product test plans and compressed schedules while meeting program development requirements
  • Analyze, debug and provide feedback/resolution on issues uncovered by test team prior to submission of results to OS vendors for approval
  • Review and evaluate the implementation and use of test automation and test tools
  • Ensure development issues are resolved in a cost-effective, efficient, and timely manner.
What we offer
What we offer
  • Comprehensive suite of benefits supporting physical, financial and emotional wellbeing
  • Specific programs for career development
  • Unconditional inclusiveness aligned with individual uniqueness.
  • Fulltime
Read More
Arrow Right

HPC & AI System Test Engineer

Hewlett Packard Enterprise is seeking early career professionals with a backgrou...
Location
Location
Puerto Rico , Aguadilla
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or Master's degree in Computer Science, Information Systems, or equivalent
  • Typically 0-4 years experience
  • Possess experience with Industry Standard Server, Storage, and Networking products
  • Have experience with Linux Operating Systems (OS) such as Ubuntu, RHEL and SUSE
  • Have an understanding of testing methodologies
  • Have an understanding of hardware and software interactions
  • Have an analytical and problem-solving skills
  • Perform testing in local and remote labs
  • Experience in the overall architecture of software and hardware for products and solutions
  • Strong analytical and problem-solving skills
Job Responsibility
Job Responsibility
  • Work with IST technical leads and program managers to understand test requirements, design and develop appropriate test plans, execute test plans, debug and troubleshoot issues
  • Work with various cross-functional teams and the product development teams to understand new product features required for test programs/projects to deliver quality HPC systems
  • Work with 3rd party product vendors and engineering teams to track development issues and solutions
  • Demonstrate the ability to effectively manage diverse test tasks and priorities in a fast-paced fluid environment
  • Effectively respond to changing program requirements, changes to product test plans and compressed schedules while meeting program development requirements
  • Analyze, debug and provide feedback/resolution on issues uncovered by test team prior to submission of results to OS vendors for approval
  • Review and evaluate the implementation and use of test automation and test tools
  • Ensure development issues are resolved in a cost-effective, efficient, and timely manner
  • Debug and analyze issues to a successful resolution
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right