CrawlJobs Logo

Principal Supercomputing Operations Engineering Manager

https://www.microsoft.com/ Logo

Microsoft Corporation

Location Icon

Location:
United States , Multiple Locations

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

139900.00 - 274800.00 USD / Year

Job Description:

Microsoft Azure’s Artificial Intelligence and High Performance Computing (AI/HPC) organization powers some of the world’s largest cloud native supercomputers used for frontier AI training, scientific computing, and large scale distributed simulations. Our team builds and operates hyperscale GPU clusters that consistently place Azure among global leaders in the Top500, MLPerf, and Graph500 benchmarks. By joining us, you step into the engineering core responsible for ensuring these systems remain reliable, performant, and ready for the next wave of AI innovation. At this scale, interconnect fabrics are a first order reliability system that directly determines GPU availability, training throughput, and customer SLAs. As a Principal Supercomputing Operations Engineering Manager, you own the operational strategy and organizational execution for interconnect fabric reliability across flagship AI supercomputing environments. You lead teams that operate InfiniBand and GPU interconnect fabrics as a single end to end reliability domain, defining how they are operated, debugged, hardened, and scaled in production. This is a hands on technical leadership role combined with people and operational management. You are accountable not only for technical outcomes, but for building and leading high performing engineering teams that consistently deliver availability, correctness, and resilience under extreme scale and ambiguity. You set expectations, drive execution through others, and ensure your team is prepared to respond decisively to the most complex production failures. You lead and oversee the most severe fabric related incidents, guiding technical direction, escalation strategy, and risk trade offs while empowering senior engineers to execute deep investigations. Beyond incident response, you define operational strategy, reliability models, and systemic prevention mechanisms that reduce recurrence at fleet scale. Your impact multiplies through organizational leadership: developing talent, setting operational standards, influencing engineering direction across organizations, and partnering deeply with platform, hardware, firmware, and service teams to deliver durable reliability improvements. You are responsible for ensuring that your organization produces high quality automation, diagnostics, telemetry, playbooks, and escalation models that materially improve operability and debuggability across the platform. Through your leadership, judgment, and technical direction, Azure’s largest AI supercomputing platforms scale safely, predictably, and sustainably to meet the demands of next generation AI workloads.

Job Responsibility:

  • Own and drive the end to end operational strategy for InfiniBand and GPU interconnect fabric reliability across large scale AI supercomputing environments, ensuring sustained GPU availability, training stability, and SLA compliance
  • Lead, manage, and grow a team of senior and principal engineers responsible for fabric operations, setting clear expectations, developing talent, and holding the organization accountable for outcomes
  • Provide senior technical leadership and executive decision making during high severity fabric incidents, guiding investigation strategy, escalation paths, and risk trade offs while ensuring effective execution through the team
  • Ensure consistent, high quality incident response, root cause analysis, and post incident follow through across the organization, with a strong emphasis on systemic prevention over one off fixes
  • Drive operational excellence by defining reliability models, failure domains, and long term corrective strategies, and ensuring adoption of authoritative TSGs, playbooks, and escalation frameworks
  • Partner deeply with platform, hardware, firmware, and service teams to align roadmaps, influence design decisions, and close systemic reliability gaps impacting interconnect fabrics at scale
  • Sponsor and prioritize automation, telemetry, diagnostics, and tooling investments that improve detection, observability, debuggability, and time to mitigation across the fleet

Requirements:

  • Bachelor's Degree in Computer Science, or related technical discipline AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • OR equivalent experience.
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.

Nice to have:

  • Bachelor's Degree in Computer Science OR related technical field AND 10+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, OR Python
  • OR Master's Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • OR equivalent experience.
  • 4+ years people management experience.
  • 6+ years of experience operating largescale distributed systems, highperformance computing (HPC), or artificial intelligence (AI) infrastructure in production environments
  • Demonstrated experience leading engineering teams responsible for mission critical production infrastructure with direct impact on service availability, GPU workloads, and customer SLAs
  • Strong hands-on background in operating and debugging interconnect fabrics or similarly complex infrastructure supporting largescale compute workloads
  • Solid Linux systems knowledge with experience reasoning across operating systems, drivers, services, and hardware layers
  • Proven ability to make highimpact technical and organizational decisions under ambiguity while balancing availability, risk, longterm correctness, and business impact

Additional Information:

Job Posted:
March 01, 2026

Employment Type:
Fulltime
Work Type:
Remote work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Principal Supercomputing Operations Engineering Manager

Principal Software Engineer

Microsoft Azure High Performance Computing & AI Engineering (HPC & AI Eng) team ...
Location
Location
United States , Multiple Locations
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python - OR equivalent experience
  • 5+ years hands on experience designing and developing high volume low latency pipelines using products such as AzPubSub, Event Hubs, Azure Stream Analytics, Kafka, Grafana, Event Hubs, Prometheus or equivalent products
  • 3+ years of experience with one of AI/HPC system management OR High-Speed Networks OR HPC Storage OR managing Cloud Infrastructure
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
  • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter
Job Responsibility
Job Responsibility
  • Architect, design and develop high volume low latency end to end event pipelines that can provide first-to-know-insights on events causing job interrupts and job reliability
  • Conduct analysis of existing event pipelines to evaluate fidelity, granularity and latency of critical events
  • Contribute to improving key metrics such as Job Mean Time to Interrupt, Nodes in Service, Mean Time to Resolve on flagship supercomputers by enabling data scientists and domain experts to use the telemetry to identify events & issues at the intersection of datacenter and hardware, develop hypothesis, conduct A/B tests and synthesize results
  • Partner with cross organizational teams to evaluate available telemetry and latency drive architecture, design, development and deployment of end-to-end solutions to manage core infrastructure including current & next generation datacenter, IT hardware, power & cooling technologies
  • Drive engineering and operational excellence based on issues and learnings from strategic customers on their usage scenarios to improve product features and capabilities
  • Partner with teams on continuous learning and continuous improvement programs by leading the resolution of complex incidents, driving root cause analyses and championing initiatives to minimize future customer impact
  • Fulltime
Read More
Arrow Right

Principal Software Engineer

Microsoft Azure Artificial Intelligence/High Performance Computing (AI/HPC) team...
Location
Location
United States , Multiple Locations
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • OR equivalent experience.
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.
Job Responsibility
Job Responsibility
  • Partner with appropriate stakeholders to determine user requirements for a set of scenarios.
  • Lead identification of dependencies and the development of design documents for a product, application, service, or platform, primarily catering towards exhaustive health monitoring of AI training supercomputers.
  • Build AI Supercomputer observability solutions at scale, with deep focus on actionability to improve availability and reliability of supercomputers.
  • Lead by example and mentor others to produce extensible and maintainable code used across products.
  • Leverage subject-matter expertise of cross-product features with appropriate stakeholders (e.g., project managers) to drive multiple groups’ project plans, release plans, and work items.
  • Hold accountability as a Designated Responsible Individual (DRI), mentoring engineers across products/solutions, working on-call to monitor system/product/service for degradation, downtime, or interruptions.
  • Proactively seek new knowledge and adapt to new trends, technical solutions, and patterns that will improve the availability, reliability, efficiency, observability, and performance of products while also driving consistency in monitoring and operations at scale and share knowledge with other engineers.
  • Fulltime
Read More
Arrow Right
New

Technical lead - electrical

Technical lead for Electrical and C&I.
Location
Location
India , Seepat, Chhattisgarh
Salary
Salary:
Not provided
https://www.randstad.com Logo
Randstad
Expiration Date
April 19, 2026
Flip Icon
Requirements
Requirements
  • B.E./B. Tech in Electrical/ Instrumentation/ equivalent
  • Minimum 05 experience in Project Management
  • Min 5yrs of experience in Electrical and C&I Execution of Power plant Packages
  • Thorough understanding of Concept to Commissioning of a thermal power plant mainly around planning and monitoring of Electrical and C&I works of all packages (BTG + BOP)
  • Hands on experience in planning and monitoring of Electrical and C&I works in thermal power plant (e.g., Generator, AVR systems, Transformers, Drives, cabling, HT & LT Switchgears, ESP electrical systems, Relay & protection systems, substations / switch yard etc)
  • Ability to suggest time and cost optimization interventions (e.g. work sequence optimization, schedule fast tracking, time motion studies, process optimization etc.) to Client for ongoing electrical and C&I works in the plant
  • Awareness of all applicable IS, IEC and safety standards as required in a power plant construction.
Job Responsibility
Job Responsibility
  • Identification and red flagging of critical path for all Electrical and C&I works in Power plant
  • Help client identify issues and risk at ground zero and develop mitigation strategies for completion of project within budget and stipulated timeline
  • Provide handholding support to client till Trial run, Performance Guarantee Test & Project Handover of both the units
  • Daily/ Weekly review of resource mobilization by contractors and supporting client in tracking resource mobilization and productivity basis industry benchmarks
  • Prepare and review method statements, checklists for Electrical and C&I works of various packages
  • Identify leading global construction best practices and present to Client Management site
  • Support client by resolving technical glitches, contractual matters and site related issues
  • Review contractor’s work sequence methodology and provide suitable inputs as required on behalf of Client.
  • Fulltime
Read More
Arrow Right
New

Electrician

Are you a dedicated Electrician looking for your next career move? We are curren...
Location
Location
United Kingdom , Larne
Salary
Salary:
17.95 - 35.90 GBP / Hour
https://www.randstad.com Logo
Randstad
Expiration Date
March 10, 2026
Flip Icon
Requirements
Requirements
  • Experience as an Electrician (Formal Apprenticeship/NVQ Level 3 or equivalent relevant qualification)
  • Previous electrical experience (wiring, installation, terminations, connecting panels, upfitting cable trays/ladder racks, conduit upfitting, crimping)
  • Experience in a lean manufacturing environment is preferred
  • Experience working to defined process times
  • Good attention to detail
  • Ability to work from heights and in an enclosed work area
  • A keen awareness and commitment to the highest standard of health and safety and quality
  • A reasonable level of computer literacy would be an advantage but not essential
Job Responsibility
Job Responsibility
  • Read and interpret complex schematic drawings in order to plan methods and sequence of assembly
  • Performs diverse functions such wiring, installation, terminations, connecting panels, upfitting cable trays/ladder racks, conduit upfitting, crimping on gensets and enclosures
  • Use a variety of power tools, hand tools and other equipment
  • Report malfunctions on equipment
  • Compliance with safety and quality standards
  • Compliance with the 5S programme and other business improvement initiatives
  • Display behaviour and conduct fitting with our Values in Action
  • Train other employees if and when required
  • Any other appropriate duties
What we offer
What we offer
  • 4 Day Week
  • Pension
  • Free Parking
  • Canteen
  • Uniform and PPE provided
  • Paid holidays
  • Full induction and ongoing, professional training
  • On-site car parking and canteen facilities
  • Auto enrolment pension scheme
  • Randstad Benefits App
Read More
Arrow Right
New

M&e design engineers (construction)

We are seeking experienced intermediate to senior Mechanical & Electrical (M&E) ...
Location
Location
Malaysia , Kuala Lumpur
Salary
Salary:
5000.00 - 15000.00 MYR / Month
https://www.randstad.com Logo
Randstad
Expiration Date
April 25, 2026
Flip Icon
Requirements
Requirements
  • A Bachelor's Degree in Mechanical, Electrical Engineering or any related field
  • A minimum of 3 years of practical experience in Mechanical or Electrical design in the building services or industrial sector
  • Strong knowledge of relevant engineering software (e.g., AutoCAD, Revit)
  • Familiarity with local and international engineering codes and standards
Job Responsibility
Job Responsibility
  • Lead the design and implementation of complex Mechanical or Electrical systems for large scale industrial projects (data centers, semiconductor facilities etc.)
  • Serve as the primary technical contact for the company's international clients
  • Interpret and translate complex engineering technical specifications and drawings into actionable design solutions
  • Collaborate with internal teams, architects, and other stakeholders
  • Prepare detailed engineering calculations, reports, and tender documents
  • Oversee site inspections and ensure compliance with relevant codes and standards
  • Implement value engineering solutions to optimize project efficiency
Read More
Arrow Right
New

Civil & Structural Engineer

We are working with multiple C&S consultancies that has consistent project pipel...
Location
Location
Malaysia , Kuala Lumpur
Salary
Salary:
5000.00 - 9000.00 MYR / Month
https://www.randstad.com Logo
Randstad
Expiration Date
April 24, 2026
Flip Icon
Requirements
Requirements
  • Bachelor Degree in Civil Engineering
  • At least 3-5 years of experience in similar consultancy background
  • Technical design experience for building structures
  • Excellent communication and interpersonal skills
Job Responsibility
Job Responsibility
  • Review the structural calculation and analysis, ensuring that it adhere with assurance requirement
  • Attend meetings and liaise with multiple disciplinary to ensure everyone is aligned
  • Conduct technical reports and engineering assessment for results shown
  • Prepare cost estimation, specification, structural design report and also schedules for projects
Read More
Arrow Right
New

Mechanical Design Engineer (Construction)

Our client is a global professional design & build construction company involved...
Location
Location
Malaysia , Kuala Lumpur
Salary
Salary:
4000.00 - 7000.00 MYR / Month
https://www.randstad.com Logo
Randstad
Expiration Date
April 26, 2026
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Mechanical Engineering or any related field
  • 2–5 years of working experience, preferably within an EPC/Design-Build company or M&E consultancy specializing in industrial projects
  • Proficient in engineering software such as AutoCAD, Revit, HVAC simulation software (e.g., HAP, Trace 700) etc.
Job Responsibility
Job Responsibility
  • Develop detailed mechanical designs, including HVAC systems, industrial plumbing, fire protection, and process utility piping (steam, compressed air, chilled water)
  • Perform engineering calculations (e.g., heat load, pipe sizing, pump head) to ensure system efficiency and compliance with local/international codes
  • Produce and review 2D/3D deliverables using AutoCAD and Revit/BIM for clash detection and coordination
  • Conduct technical bid evaluations to select vendors and subcontractors
  • Collaborate with the construction team to ensure constructability of designs and resolve site queries (RFIs)
  • Assist in the preparation of project estimates and tender documents
  • Participate in site inspections and testing & commissioning (T&C) activities to ensure systems meet design intent and safety standards
  • Engage with clients to understand requirements and present technical solutions.
Read More
Arrow Right
New

PMO Quality Manager

Reporting to the Technical Manager, the Quality Manager will be responsible for ...
Location
Location
United Kingdom , London
Salary
Salary:
55000.00 - 65000.00 GBP / Year
jobs.360resourcing.co.uk Logo
360 Resourcing Solutions
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Extensive experience in quality management (QA/QC) within Telecommunications Infrastructure Cabling, ICT, Security, or Data Centre environments
  • Proven experience supporting and assuring quality on Hyperscale Data Centre projects
  • Strong knowledge of quality management systems, including ISO9001:2015, with experience or awareness of its internal workings and application through scope, context, auditing, and continuous improvement
  • Demonstrated ability to define, implement, and embed quality standards, processes, and documentation across complex project environments
  • Experience conducting audits, inspections, non-conformance management, and corrective actions across the full project lifecycle
  • Strong understanding of technical documentation (specifications, drawings, schematics) and industry standards
  • Ability to manage multiple projects and stakeholders across design, delivery, and handover phases
  • Confident communicator with the ability to deliver client-facing presentations and influence at all levels of the business
  • Experience contributing to or supporting pre-sales and bid activity to mitigate downstream quality risks
  • Strong understanding of project contracts, scope, and deliverables
Job Responsibility
Job Responsibility
  • Standard and policy development: Establish and maintain comprehensive quality assurance policies, standards, and procedures in line with client and Onnec requirements and in accordance with industry standards
  • Audit and review: Plan and lead periodic and independent quality reviews and audits of ongoing and completed projects to assess compliance with standards
  • Risk and issue management: Monitor project risks related to quality and ensure proper mitigation strategies are implemented. Investigate and drive resolution for non-conformances and corrective actions
  • Process improvement: Analyse audit findings and project performance data to identify trends and opportunities for process improvements. Champion a culture of quality and drive initiatives that enhance project outcomes, as well as the philosophy that quality is planned in, not inspected in
  • Documentation and reporting: Develop and maintain a centralised repository for quality-related documentation, including audit reports, and quality plans. Prepare consolidated quality reports for the PMO and senior management, providing data-driven insights on performance
  • Training and support: Conduct training for project teams and contractors on expected quality requirements and standards. Work with the Project Support Team so they are equipped with the knowledge required to create project quality documentation
  • Stakeholder collaboration: Work closely with project managers, site teams, clients, and subcontractors to communicate quality expectations, address issues, and ensure alignment on quality standards throughout the project lifecycle
  • Resource and compliance oversight: Ensure that project resources responsible for quality are suitably equipped and trained to ensure compliance to the expected quality standards
  • Quality / Operational Input: Provide support and experience into the Pre-Sales teams to improve our proposals and mitigate potential quality related operational issues post award
  • Client Relations: Liaise with client representatives and external quality assurance officers during project inspections. Grow long-term relationships with clients
What we offer
What we offer
  • Company pension
  • bonus scheme
  • Fulltime
Read More
Arrow Right