CrawlJobs Logo

Senior Technical Program Manager – AI Infrastructure, Site Operations

cerebras.net Logo

Cerebras Systems

Location Icon

Location:
United States , Sunnyvale

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programming simplicity of a single device. This approach allows Cerebras to deliver industry-leading training and inference speeds and empowers machine learning users to effortlessly run large-scale ML applications, without the hassle of managing hundreds of GPUs or TPUs. This Sr. TPM role owns site and data center operations programs supporting Cerebras’ AI Cloud and customer deployments. The position sits at Sunnyvale HQ and works closely with Hardware Engineering, Inference Engineering, and Operations leadership to ensure Cerebras systems are reliably deployed, operated, and scaled. This is a highly technical, execution-focused TPM role with strong emphasis on operational readiness, cross-functional coordination, and metrics/KPIs.

Job Responsibility:

  • Own end-to-end technical programs for data center and site operations
  • Act as single-threaded owner across: Hardware & Systems Engineering
  • AI Cloud Infrastructure & Operations
  • Network & Storage Engineering
  • Facilities, power, cooling, and colo partners
  • Drive site readiness for Cerebras Wafer-Scale Engine systems
  • Partner on installation, commissioning, change management, and break/fix workflows
  • Lead incident reviews and postmortems
  • ensure corrective actions are closed
  • Define and own operational metrics and KPIs, including: Availability and reliability
  • Incident rate, severity, MTTR / MTTD
  • Deployment readiness and time-to-service
  • Capacity and operational risk
  • Build executive-level dashboards and reporting
  • Establish program governance, risk tracking, and RACI clarity
  • Present program status, metrics, and operational risks to senior leadership

Requirements:

  • 8+ years in Technical Program Management, Infrastructure Ops, or Data Center Ops
  • Experience leading large, cross-functional infrastructure programs
  • Strong understanding of: Data center power and cooling fundamentals
  • Network and storage basics
  • Hardware-centric platforms
  • Proven ability to define and operationalize metrics
  • Strong written and executive-level communication skills

Nice to have:

  • AI/ML, HPC, or accelerator-based infrastructure
  • High-density and/or liquid-cooled data centers
  • Working with colocation providers and facilities teams
  • Incident management, reliability, or service operations background
What we offer:
  • Build a breakthrough AI platform beyond the constraints of the GPU
  • Publish and open source their cutting-edge AI research
  • Work on one of the fastest AI supercomputers in the world
  • Enjoy job stability with startup vitality
  • Our simple, non-corporate work culture that respects individual beliefs

Additional Information:

Job Posted:
February 17, 2026

Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Senior Technical Program Manager – AI Infrastructure, Site Operations

New

Technical Program Manager, AI Infrastructure

Be part of the team that builds and operates the world's fastest AI infrastructu...
Location
Location
United States , Sunnyvale
Salary
Salary:
Not provided
cerebras.net Logo
Cerebras Systems
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience leading large, cross-functional infrastructure programs
  • Experience with AI/ML, HPC, or accelerator-based infrastructure
  • Strong understanding of data center power and cooling fundamentals
  • Experience installing and managing network, storage, and compute devices
  • Proven ability to define and operationalize metrics
  • Strong written and executive-level communication skills
  • Experience working with colocation providers and facilities teams
  • Background in incident management, reliability, or service operations
Job Responsibility
Job Responsibility
  • Own end-to-end technical programs for multiple data center buildouts, coordinating with partners, contractors, and internal teams
  • Drive facility site readiness for power and cooling for Cerebras Wafer-Scale Engine systems
  • Coordinate equipment delivery and manage vendor accountability for schedules and quality related to rack integration and inter-rack cabling
  • Act as the single-threaded owner across internal partners: Hardware & Systems Engineering, Network & Storage Engineering, AI Cloud Infrastructure & Operations
  • Enforce handover criteria between site completion, equipment deployment, and operations
  • Own overall schedule tracking, risk identification, and mitigation, creating clear visibility for leadership
  • Establish program governance, risk tracking, and RACI clarity
  • Present program status, metrics, and operational risks to senior leadership
  • Drive partner accountability on contractual milestones and commercial commitments
  • Document repeatable processes and implement them to scale across future data centers
What we offer
What we offer
  • Build a breakthrough AI platform beyond the constraints of the GPU
  • Publish and open source their cutting-edge AI research
  • Work on one of the fastest AI supercomputers in the world
  • Enjoy job stability with startup vitality
  • Our simple, non-corporate work culture that respects individual beliefs
Read More
Arrow Right

Senior Technical Program Manager - Datacenter Infrastructure

The Datacenter leasing Senior Technical Program Manager will be part of a team r...
Location
Location
Singapore , Singapore
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Civil, Electrical, Mechanical, Telecom Engineering, or related technical field AND 4+ years’ experience in engineering, operations, commissioning or technical program management
  • 3+ years’ experience managing cross functional and/or cross-team projects
  • 3+ years of experience in data center design, infrastructure, and critical environments
  • Broad infrastructure knowledge across mechanical, electrical, and controls systems with a focus on Datacenter integration and performance
  • Familiarity with key industry standards and best practices, including ASHRAE, Uptime Institute, ANSI, and NFPA
  • Familiarity with high-density power and cooling solutions, sustainability initiatives, and emerging technologies for AI workloads
  • Ability to meet Microsoft, customer and/or government security screening requirements
Job Responsibility
Job Responsibility
  • Act as a Subject Matter Expert (SME) and provide global program support
  • Drive technical solutions for leased datacenters in partnership with Microsoft’s and Lessor’s core engineering teams
  • Evaluate lessor’s design proposal against technical requirements and mitigate non-compliance through technical and commercial solutions
  • Assesses lessor’s compliance through review of technical documents, site assessments, and stakeholder engagement
  • Partner with internal and external stakeholders during construction, RFS, and operations handover to unblock any technical issues risking the on-time delivery of Datacenter to customers
  • Drive cost impact analysis on non-compliance and specification changes. Escalate and provide visibility and feedback to leadership on cost drivers
  • Partner with Microsoft Engineering, Integration, Security, Operations, and Energy teams on resolution management
  • Drive partner accountability on contractual milestones and commercial commitments
  • Own overall schedule tracking, risk identification, blockers, and mitigation for the assigned projects
  • creating clear visibility for leadership
  • Fulltime
Read More
Arrow Right

Engineering Director

We are seeking a seasoned Engineering Director who thrives in challenging and fa...
Location
Location
Puerto Rico , Aguadilla
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Significant work experience as a director or similar position working across multiple stakeholder organizations, with at least 10+ years of people leadership experience specific to SW and Cloud engineering
  • Solid experience leading SW development across storage, networking, on-prem, and SaaS is a must
  • Experience in setting up geographically distributed sites
  • Must have a strong background in software development lifecycle including cloud infrastructure
  • Familiarity with agile methodologies and tools like JIRA
  • Prior experience in cloud product development and deployments
  • end to end ownership and accountability
  • Solid understanding of fundamental AI and machine learning concepts, including supervised and unsupervised learning, deep learning, reinforcement learning, natural language processing, computer vision, and statistical modeling
  • Extensive business acumen, technical knowledge, and industry experience encompassing one or more engineering, technology, and product domains
  • Demonstrated abilities to drive transformation across a business with exceptional skills in the management of change
Job Responsibility
Job Responsibility
  • Oversee the Puerto Rico Site daily operations, strategic planning and cross-functional team leadership for Hybrid Cloud
  • Recruit, mentor, and manage teams of AI/ML engineers, QA Engineers, Design Engineers and innovation specialists to deliver cutting-edge solutions
  • Continuously evaluate new tools, platforms, and frameworks in AI/ML to drive competitive advantage and operational efficiency
  • Ensure alignment with corporate goals while fostering a high-performance culture, operational efficiency, and employee engagement
  • Lead the development and execution of AI/ML strategies that align with business goals and drive innovation across products, services, or operations
  • Create strategic and tactical operations and resource plans, goals, and priorities for assigned organization based on business and technology roadmap and functional objectives
  • Engage with various senior leaders across the organization, program managers, R&D, support, Quality, product managers, technical leaders and executives to communicate program status, escalate issues, and guide and influence strategic decision-making
  • Manage senior relationships and escalated issues with outsourced partners and suppliers, including setting expectations regarding deliverables, product quality, schedules, and costs
  • ensures that organization is effectively leveraging outsourced resources
  • Identify opportunities for and drive organizational initiatives and programs to support business process improvements and cost reductions
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right
New

Senior Director, Critical Environments (Lab Operations)

We are seeking an industry veteran to serve as the Senior Director, Critical Env...
Location
Location
Taiwan , New Taipei City
Salary
Salary:
Not provided
jll.com Logo
JLL
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 20+ years of progressive experience in Critical Environments (Data Centers, Semiconductor, Pharma, or R&D Labs), covering operations, engineering, planning, and innovation
  • 15+ years of direct people management experience, specifically leading large technical teams (50-100+ staff) and 'managing managers' in a multi-site, matrixed environment
  • Bachelor’s degree in Engineering (Mechanical/Electrical), Facilities Management, or a related technical field is required
  • A Master’s degree or MBA is highly preferred
  • Professional Engineer (PE), Certified Facility Manager (CFM), or PMP is preferred
Job Responsibility
Job Responsibility
  • Executive Leadership & Organizational Strategy: Manage and mentor a high-performing organization of 100+ staff members through direct supervision of five specialized Directors
  • Foster a 'No Ego' culture of accountability and collaboration across diverse teams
  • Serve as the primary strategic partner to senior client stakeholders
  • Present complex technical and data concepts as clear business strategies to the C-Suite
  • Define the competency requirements and training standards for the entire critical environments organization
  • Operational Resilience & 24/7 Command: Oversee the Director of Critical Operations and Senior Director of Engineering & Ops Center to ensure 100% uptime in critical operations
  • Serve as the ultimate escalation point for major incidents
  • Lead executive communication, mitigation strategy, and systemic Root Cause Analysis (RCA)
  • Direct the strategy of the 24/7 Operations Center
  • Technical Governance & Engineering Excellence: Oversee comprehensive design reviews for MEP (Mechanical, Electrical, Plumbing) topology
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

Zuora’s Cloud Engineering organization owns the reliability, scalability, and op...
Location
Location
India , Chennai
Salary
Salary:
Not provided
zuora.com Logo
Zuora
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of hands-on experience in Site Reliability Engineering, DevOps, or large-scale production operations
  • Advanced expertise in AWS, including architecture design across services such as EC2, EKS, VPC, IAM, RDS, S3, and CloudWatch
  • Deep experience with Infrastructure-as-Code using Terraform, including complex modules, state management, and governance
  • Strong programming and automation skills using Python and Shell
  • experience building production-grade automation systems
  • Expert-level Linux systems knowledge, including performance tuning, security hardening, and deep troubleshooting
  • Proven experience operating distributed systems and data streaming platforms such as Kafka in high-throughput environments
  • Demonstrated ability to work independently on complex, ambiguous problems with broad organizational impact
  • Proven technical leadership experience driving large, cross-team reliability or infrastructure initiatives, including setting technical direction, influencing design decisions, and mentoring engineers to deliver measurable outcomes at scale
  • Practical experience designing or implementing AI/ML-driven automation in operations, reliability, or platform engineering
Job Responsibility
Job Responsibility
  • Reliability Architecture & Platform Strategy: Own and evolve the reliability architecture of large-scale, distributed SaaS systems by defining SLOs, SLIs, error budgets, and resilience patterns aligned with business objectives
  • AI-Driven Automation & Intelligent Operations: Design, build, and operationalize AI-powered automation to reduce operational toil and improve system stability
  • Advanced Cloud & Infrastructure Engineering: Lead the design and operation of complex AWS-based infrastructure and Kubernetes platforms, optimizing for availability, security, and cost efficiency
  • Incident Leadership & Operational Excellence: Act as a technical leader during high-severity production incidents, driving structured response, decision-making, and recovery
  • Technical Leadership & Cross-Functional Influence: Influence reliability outcomes beyond the SRE team by partnering closely with Engineering, Product, and Security stakeholders
What we offer
What we offer
  • Competitive compensation, variable bonus and performance reward opportunities, and retirement programs
  • Medical Insurance
  • Generous, flexible time off
  • Paid holidays, “wellness” days and company wide end of year break
  • 6 months fully paid parental leave
  • Learning & Development stipend
  • Opportunities to volunteer and give back, including charitable donation match
  • Free resources and support for your mental wellbeing
  • Fulltime
Read More
Arrow Right
New

Senior Backend Software Engineer, Cloud Management

We are seeking talented Senior Software Engineers to design, build, and scale Cr...
Location
Location
United States , San Francisco; Sunnyvale
Salary
Salary:
175000.00 - 210000.00 USD / Year
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of software development experience
  • Programming with modern compiled languages such as Go, Rust, Java, or C++
  • Proven ability to design and scale fault-tolerant distributed systems and develop managed cloud services
  • Strong fundamentals in data structures, algorithms, microservices, and infrastructure tools like Docker, Kubernetes, Terraform, and CI/CD systems
  • Ability to work with cross-functional teams to align priorities and deliver customer-first solutions
  • Experience guiding engineers, improving hiring and onboarding processes, and driving team growth
  • Exceptional ability to articulate complex ideas and align technical solutions with customer needs
  • Customer-Centric Mindset
  • Any experience building out infrastructure tooling is a plus
Job Responsibility
Job Responsibility
  • Design, develop, and maintain scalable and reliable services that power our cloud platform’s user-facing experiences
  • Collaborate with cross-functional teams, like product and design, to evaluate tools, frameworks, and customer needs, creating innovative solutions
  • Design and build backend systems that underpin our cloud platform, covering everything from authentication flows to scalable, reliable access to infrastructure resources
  • Contribute to architectural decisions that support reliability and maintainability across the company
  • Mentor engineers, enhance hiring practices, and contribute to building a strong, inclusive engineering culture
  • Build scalable, reliable cloud services, such as user access management, Gateways, user features, and notification systems, tailored to customer needs
  • Partner with customer success and operations teams to create intuitive tools that enhance the end-user experience
  • Develop automation software that simplifies infrastructure deployment and management for seamless customer operations
  • Implement features that differentiate Crusoe Cloud, focusing on operational efficiency, low-touch adoption, turn-key AI services and scalability
  • Work closely with cloud support, engineering, and site reliability teams to align technical solutions with customer feedback and operational goals
What we offer
What we offer
  • Restricted Stock Units
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Fulltime
Read More
Arrow Right
New

Senior Software Engineer, Managed Services

Crusoe's mission is to accelerate the abundance of energy and intelligence. We’r...
Location
Location
United States , San Francisco; Sunnyvale
Salary
Salary:
166000.00 - 201000.00 USD / Year
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Cloud Expertise: Proven ability to design and scale fault-tolerant distributed systems and develop managed cloud services
  • Technical Proficiency: Strong fundamentals in microservices and infrastructure technologies like Docker, Kubernetes, Terraform, and CI/CD systems. Experience with observability principles and technologies, e.g., time-series databases, log aggregation, distributed tracing
  • Customer-Centric Mindset: A passion for creating intuitive, high-quality solutions that directly impact customer success and satisfaction
  • Collaboration Skills: Ability to work with cross-functional teams to align priorities and deliver customer-first solutions
  • Communication Skills: Exceptional ability to articulate complex ideas and align technical solutions with customer needs
  • Team Leadership: Mentor engineers, enhance hiring practices, and contribute to building a strong, inclusive engineering culture
  • Professional Experience: 3-5 years of software development experience, including programming with modern compiled languages such as Go, Rust, Java, or C++
Job Responsibility
Job Responsibility
  • Building Foundational Infrastructure: Build and scale core infrastructure services that manage critical resources within our cloud platform. This involves designing, developing, and deploying robust and reliable systems from the ground up
  • Scalable Design: Design highly scalable, durable, and reliable platform services that prioritize ease of use
  • Cross Functional Collaboration: Lead projects that require collaborating with engineering, cloud support, site reliability, and product teams to assess tools, frameworks, and solutions that align with both customer and operational needs
  • Innovation: Implement features that differentiate Crusoe Cloud, focusing on operational efficiency, low-touch adoption, turn-key AI services, and scalability
What we offer
What we offer
  • Industry competitive pay
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Fulltime
Read More
Arrow Right
New

Vice President, Venue Technology

The Vice President, Venue Technology is a visionary and execution-focused leader...
Location
Location
United States , Frisco
Salary
Salary:
Not provided
legendsglobal.com Logo
Legends Global
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in information technology, Engineering, Computer Science, or a related technical discipline required
  • Master’s degree or MBA strongly preferred
  • Minimum of 15 years of progressive experience in enterprise or venue technology leadership, with 7+ years in executive-level roles overseeing large-scale, multi-site operations
  • Proven success in leading technology transformation initiatives across sports, entertainment, hospitality, or other high-volume guest-facing industries
  • Experience managing global or national portfolios of venues, with demonstrated ability to scale technology operations and standardize platforms across diverse environments
  • Track record of delivering complex capital projects involving infrastructure modernization, digital innovation, and cross-functional stakeholder alignment
  • Deep expertise in venue technology ecosystems, including: Networking: Enterprise-grade LAN/WAN/Wi-Fi, DAS, 5G, SD-WAN (Cisco, Aruba, Extreme)
  • AV/Broadcast: Control rooms, IPTV, digital signage, live production systems (QSYS, Ross, Evertz)
  • Compute & Storage: Hybrid cloud, edge computing, virtualization (VxRail, Nutanix, VMware)
  • Security & Access: Physical security, surveillance, access control, Zero Trust (Genetec, Avigilon)
Job Responsibility
Job Responsibility
  • Develop and execute a multi-year venue technology roadmap aligned with Legends Global’s business strategy, operational priorities, and guest experience goals
  • Serve as the executive sponsor for venue technology innovation, advising senior leadership on emerging trends, investment opportunities, and competitive differentiation
  • Champion enterprise-wide initiatives such as smart venue platforms, digital twin technologies, and AI-driven operational intelligence
  • Oversight of venue technology budgets
  • Partner on new venue construction and major renovation projects
  • Oversee the design, deployment, and lifecycle management of mission-critical systems including: Network infrastructure (LAN/WAN/Wi-Fi, DAS, 5G)
  • AV and broadcast systems (IPTV, control rooms, digital signage
  • Compute and storage environments (hybrid cloud, edge computing)
  • Venue IT operations (access control, ticketing, incident management)
  • Establish and enforce enterprise standards for technology architecture, cybersecurity, scalability, and interoperability across all venues
What we offer
What we offer
  • medical, dental, vision, life and disability insurance, paid vacation, and 401k plan
  • Fulltime
Read More
Arrow Right