CrawlJobs Logo

Performance Reliability Engineer

cerebras.net Logo

Cerebras Systems

Location Icon

Location:
United States; Canada , Sunnyvale

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

Join Cerebras as a Performance & Reliability Engineer within our innovative Co-Design and Next Generation Team. Our groundbreaking CS-3 system has set new benchmarks in high-performance ML training and inference solutions. It leverages a dinner-plate sized chip with 44GB of on-chip memory to surpass traditional hardware capabilities. This role focuses on characterizing and optimizing the performance and reliability of state-of-the-art AI models running on Cerebras' breakthrough hardware.

Job Responsibility:

  • Characterize and enhance the performance and reliability of advanced ML hardware/software systems, with emphasis on reducing power and thermal fluctuations
  • Analyze ML workloads, software kernels, and hardware architecture for power and performance impacts, and synthesize high-level insights across these layers
  • Develop creative software solutions to improve reliability and performance, collaborating cross-functionally to deploy these solutions in production
  • Influence the design of Cerebras' next-generation AI architecture and software stack through rigorous workload analysis and computational efficiency optimization
  • Partner with ML engineers, researchers, and reliability specialists to understand model behavior and drive system-level improvements from a software perspective
  • Collaborate with teams in architecture, silicon, and research to advance our computational platforms and influence future system designs

Requirements:

  • BS, MS, or PhD in Computer Science, Electrical Engineering, or a related field
  • 3+ years of relevant experience in performance engineering, reliability, computer architecture, and/or software design
  • Proficiency in Python or other scripting languages
  • Experience with C/C++ and assembly programming
  • Demonstrated expertise with system-level performance and reliability optimization
  • Strong verbal and written communication skills

Nice to have:

  • Hands-on experience with ML models, ML frameworks, and collective communication
  • Understanding of thermal management principles and power delivery for advanced semiconductors
What we offer:
  • Build a breakthrough AI platform beyond the constraints of the GPU
  • Publish and open source their cutting-edge AI research
  • Work on one of the fastest AI supercomputers in the world
  • Enjoy job stability with startup vitality
  • Our simple, non-corporate work culture that respects individual beliefs

Additional Information:

Job Posted:
February 17, 2026

Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Performance Reliability Engineer

Reliability Engineer - Manufacturing Maintenance Equipment

Founded in 1985, ATS is a company with a presence in the United States, Mexico a...
Location
Location
United States , Fayetteville
Salary
Salary:
Not provided
atpchemical.com Logo
Advanced Technology Products
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in engineering (ABET accredited) or equivalent experience (ex. heavy industrial maintenance, reliability, or operations experience)
  • Minimum of five years of reliability experience
  • Demonstrates ability to perform full array of reliability tool sets
  • Experience in Performance of RCA
  • Experience & Performance with RCM & FMEA
  • Master Level Proficiency in Predictive Technology
  • Vibration I Certification
  • Infrared I Certification
  • Machine Health Monitoring Strong Proficiency
  • Coaching & Experience with Work Execution Management
Job Responsibility
Job Responsibility
  • Promotes and adheres to the ATS safety culture
  • Ensures compliance with regulatory requirements and ATS policies and procedures
  • Partners with internal/external customer for engineered solutions to improve reliability and throughput
  • Identifies opportunities for Capital Expenditures for equipment replacement with supervision (develops and communicates ROI)
  • Champions operating systems, critical elements, and best practices to enable a precision reliability culture
  • Knowledgeable application of common precision tools and practices
  • Fully understands reliability centered maintenance and deliverables (equipment specific maintenance plan -ESMP)
  • Actively collaborates with maintenance team on the use of predictive, preventative, and precision maintenance technologies and strategies designed to identify or control risks prior to failure and ensure optimum maintenance execution
  • Understands and performs failure mode & effects analysis
  • Advanced understanding of Work Execution Management (WEM) to train and mentor on gaps & improvements identified through reliability strategy session performance
Read More
Arrow Right

Reliability Engineer I

Founded in 1985, ATS is a company with a presence in the United States, Mexico a...
Location
Location
United States , Northumberland
Salary
Salary:
Not provided
atpchemical.com Logo
Advanced Technology Products
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in engineering (ABET accredited) or equivalent experience (ex. heavy industrial maintenance, reliability, or operations experience)
  • Minimum of one year of reliability experience
  • Demonstrates ability to use reliability tool sets
  • Experience in Performance of RCA
  • Involvement with RCM & FMEA
  • Master Level Proficiency in Predictive Technology
  • Vibration I Certification
  • Machine Health Monitoring Intermediate Proficiency
  • Experience with Work Execution Management
  • Technical understanding of electrical or mechanical components, tools, and designs
Job Responsibility
Job Responsibility
  • Promotes and adheres to the ATS safety culture
  • Ensures compliance with regulatory requirements and ATS policies and procedures
  • Partners with internal/external customer for engineered solutions to improve reliability and throughput
  • Identifies opportunities for Capital Expenditures for equipment replacement (develops and communicates ROI)
  • Highly knowledgeable in operating systems, critical elements, and best practices to enable a precision reliability culture
  • Knowledgeable application of common precision tools and practices
  • Partners with peers to perform reliability centered maintenance and deliverables (equipment specific maintenance plan -ESMP)
  • Actively collaborates with maintenance team on the use of predictive, preventative, and precision maintenance technologies and strategies designed to identify or control risks prior to failure and ensure optimum maintenance execution
  • Partners with peers to perform failure mode & effects analysis
  • Understands Work Execution Management (WEM) & improvements identified through reliability strategy session performance
  • Fulltime
Read More
Arrow Right

Reliability Engineer

Founded in 1985, ATS is a company with a presence in the United States, Mexico a...
Location
Location
United States , Tupelo, Mississippi
Salary
Salary:
Not provided
atpchemical.com Logo
Advanced Technology Products
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in engineering (ABET accredited) or equivalent experience (ex. heavy industrial maintenance, reliability, or operations experience)
  • Minimum of one year of reliability experience
  • Demonstrates ability to use reliability tool sets
  • Experience in Performance of RCA
  • Involvement with RCM & FMEA
  • Master Level Proficiency in Predictive Technology
  • Vibration I Certification
  • Machine Health Monitoring Intermediate Proficiency
  • Experience with Work Execution Management
  • Technical understanding of electrical or mechanical components, tools, and designs
Job Responsibility
Job Responsibility
  • Promotes and adheres to the ATS safety culture
  • Ensures compliance with regulatory requirements and ATS policies and procedures
  • Partners with internal/external customer for engineered solutions to improve reliability and throughput
  • Identifies opportunities for Capital Expenditures for equipment replacement (develops and communicates ROI)
  • Highly knowledgeable in operating systems, critical elements, and best practices to enable a precision reliability culture
  • Knowledgeable application of common precision tools and practices
  • Partners with peers to perform reliability centered maintenance and deliverables (equipment specific maintenance plan -ESMP)
  • Actively collaborates with maintenance team on the use of predictive, preventative, and precision maintenance technologies and strategies designed to identify or control risks prior to failure and ensure optimum maintenance execution
  • Partners with peers to perform failure mode & effects analysis
  • Understands Work Execution Management (WEM) & improvements identified through reliability strategy session performance
  • Fulltime
Read More
Arrow Right

Site Reliability Engineering Manager

Hewlett Packard Enterprise (HPE) is looking for a Site Reliability Engineering M...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7–10 years of experience in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles
  • Minimum 2 years of experience managing or leading cloud operations teams
  • Deep understanding of cloud platforms (AWS, GCP, or Azure) and cloud-native architectures
  • Hands-on experience with Kubernetes, containers, infrastructure as code (e.g., Terraform), and configuration management tools
  • Strong foundation in observability (monitoring, logging, tracing), automation using Python, and incident response
  • Familiarity with modern CI/CD automation and tools
  • Excellent communication, stakeholder management, and team-building skills
  • Experience scaling SRE practices in high-growth or large-scale environments
  • Ability to balance long-term reliability initiatives with short-term delivery needs.
Job Responsibility
Job Responsibility
  • Lead and mentor a team of Site Reliability Engineers, supporting their growth, performance, and well-being
  • Own the reliability strategy for SASE cloud infrastructure systems, including incident management, SLIs/SLOs, and capacity planning
  • Partner with Engineering, Product, and Security teams to design and deliver highly available, scalable, and resilient cloud-native services
  • Guide the team in building automation, improving observability, and improve operational efficiency of our cloud infrastructure
  • Drive adoption of best practices in monitoring, alerting, on-call operations, and runbook development
  • Build and maintain a strong engineering culture based on ownership, collaboration, and continuous learning
  • Define and track key reliability metrics, and report on team performance and system health to leadership
  • Contribute to hiring, onboarding, and career development for SREs.
What we offer
What we offer
  • Health & Wellbeing benefits for physical, financial, and emotional wellbeing
  • Personal & Professional Development programs
  • Unconditional inclusion in the workplace.
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

Affirm is reinventing credit to make it more honest and friendly, giving consume...
Location
Location
Spain
Salary
Salary:
85000.00 - 115000.00 EUR / Year
affirm.com Logo
Affirm
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 4+ years of experience designing, developing and launching backend systems at scale using scripting and development languages like Bash, Python or Kotlin
  • A track record of developing highly available distributed systems using technologies like AWS, MySQL and Kubernetes
  • Meaningful experience contributing in or driving parts of the Incident Lifecycle process, enabling actionable insights that improve the quality culture, reliability, resilience, and system performance
  • 4+ years working in a Site Reliability or Production Engineering team
  • Experience defining a technical plan for the delivery of a significant feature or system component with an elegant, simple and extensible design
  • Experience in making impactful changes in a large code base, and have developed a suite of tools and practices that enable you and your team to do so safely
  • Strong verbal and written communication skills that support effective collaboration with our global engineering team
  • On-Call Rotation - There would be an on-call rotation for this role as a requirement
Job Responsibility
Job Responsibility
  • You will be responsible for owning and delivering quarterly goals for your team, leading engineers on your team through ambiguity to solve open-ended problems, and ensuring that everyone is supported throughout delivery
  • You will support your peers and stakeholders in the product development lifecycle by collaborating with infrastructure, product management, developer experience & analytics by participating in ideation, articulating technical constraints, and partnering on decisions that properly consider risks and trade-offs
  • You will proactively identify technical solutions and operational processes that strengthen incident readiness, response, and post-incident analysis
  • You will support the operations and availability of your team’s artifacts by creating and monitoring metrics, escalating when needed, and supporting “keep the lights on” & on-call efforts
  • You will foster a culture of quality and ownership on your team by setting or improving code review and design standards for your team, and advocating for them beyond your team through your writing and tech talks
  • You will help develop talent on your team by providing feedback and guidance, and leading by example
What we offer
What we offer
  • Flexible Spending Wallets for tech, food and lifestyle
  • Away Days - wellness days to take off work and recharge
  • Learning & Development programs
  • Parental benefit
  • Employee Resource & Community Groups
  • Health care coverage - Affirm covers all premiums for all levels of coverage for you and your dependents
  • Flexible Spending Wallets - generous stipends for spending on Technology, Food, various Lifestyle needs, and family forming expenses
  • Time off - competitive vacation and holiday schedules allowing you to take time off to rest and recharge
  • ESPP - An employee stock purchase plan enabling you to buy shares of Affirm at a discount
  • Fulltime
Read More
Arrow Right

Database Reliability Engineer

We are committed to providing our customers with reliable and secure services at...
Location
Location
Netherlands
Salary
Salary:
Not provided
clickhouse.com Logo
ClickHouse
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s degree in Computer Science or a related field
  • At least 5 years of experience in Reliability Engineering, QA or customer facing engineering
  • Previous experience operating ClickHouse or other SQL databases in production
  • Excellent understanding of distributed database internals and SQL, particularly ClickHouse is a major plus
  • Scripting experience with Shell or Python, and ability to read and understand C++ code
  • Knowledge of cloud computing platforms such as AWS, Azure, or Google Cloud Platform
  • You are a strong problem-solver and have solid production debugging skills
  • You thrive in a fast-paced environment as part of a global team, and you see yourself as a partner with the business with the shared goal of moving the business forward
  • You have a high level of responsibility, ownership, and accountability
  • Excellent communication skills
Job Responsibility
Job Responsibility
  • Continuously improve the reliability and performance of ClickHouse core
  • Improve and create metrics and alerts for ClickHouse to be able to identify and prevent problems in production before they affect customers
  • Dig deeper into the most common problems encountered by customers in Clickhouse Core to identify the root cause of problems and submit bug fixes, issue reports and suggest improvements
  • Enhance and refine incident response processes and post-mortem analysis for ClickHouse core related outages including working with support and Cloud teams to communicate to the impacted customers
  • Plan, enable, and drive Chaos initiatives across Engineering teams, based upon internal priorities
  • Manage on-call processes to respond to performance and reliability issues, and establish best practices for coordinating escalation to resolve issues and minimize customer impact
What we offer
What we offer
  • Flexible work environment - ClickHouse is a globally distributed company and remote-friendly. We currently operate in 20 countries
  • Healthcare - Employer contributions towards your healthcare
  • Equity in the company - Every new team member who joins our company receives stock options
  • Time off - Flexible time off in the US, generous entitlement in other countries
  • A $500 Home office setup if you’re a remote employee
  • Global Gatherings – We believe in the power of in-person connection and offer opportunities to engage with colleagues at company-wide offsites
  • Fulltime
Read More
Arrow Right

Database Reliability Engineer

We are committed to providing our customers with reliable and secure services at...
Location
Location
Germany
Salary
Salary:
Not provided
clickhouse.com Logo
ClickHouse
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s degree in Computer Science or a related field
  • At least 5 years of experience in Reliability Engineering, QA or customer facing engineering
  • Previous experience operating ClickHouse or other SQL databases in production
  • Excellent understanding of distributed database internals and SQL, particularly ClickHouse is a major plus
  • Scripting experience with Shell or Python, and ability to read and understand C++ code
  • Knowledge of cloud computing platforms such as AWS, Azure, or Google Cloud Platform
  • You are a strong problem-solver and have solid production debugging skills
  • You thrive in a fast-paced environment as part of a global team, and you see yourself as a partner with the business with the shared goal of moving the business forward
  • You have a high level of responsibility, ownership, and accountability
  • Excellent communication skills
Job Responsibility
Job Responsibility
  • Continuously improve the reliability and performance of ClickHouse core
  • Improve and create metrics and alerts for ClickHouse to be able to identify and prevent problems in production before they affect customers
  • Dig deeper into the most common problems encountered by customers in Clickhouse Core to identify the root cause of problems and submit bug fixes, issue reports and suggest improvements
  • Enhance and refine incident response processes and post-mortem analysis for ClickHouse core related outages including working with support and Cloud teams to communicate to the impacted customers
  • Plan, enable, and drive Chaos initiatives across Engineering teams, based upon internal priorities
  • Manage on-call processes to respond to performance and reliability issues, and establish best practices for coordinating escalation to resolve issues and minimize customer impact
What we offer
What we offer
  • Flexible work environment - ClickHouse is a globally distributed company and remote-friendly. We currently operate in 20 countries
  • Healthcare - Employer contributions towards your healthcare
  • Equity in the company - Every new team member who joins our company receives stock options
  • Time off - Flexible time off in the US, generous entitlement in other countries
  • A $500 Home office setup if you’re a remote employee
  • Global Gatherings – We believe in the power of in-person connection and offer opportunities to engage with colleagues at company-wide offsites
Read More
Arrow Right

Reliability Engineer

The Reliability Engineer is responsible for developing and leading asset reliabi...
Location
Location
United States , Bennettsville
Salary
Salary:
Not provided
domtar.com Logo
Domtar
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Mechanical Engineering or related technical field
  • Minimum five (5) years of experience in maintenance, reliability, or engineering within manufacturing or heavy industrial environments (pulp and paper experience preferred)
  • Strong knowledge of RCM, FMEA, RCFA, CMMS systems, and predictive maintenance technologies
  • Demonstrated commitment to safety and continuous improvement
Job Responsibility
Job Responsibility
  • Lead the development and execution of precision, preventive, and predictive maintenance strategies that improve equipment reliability
  • Champion Root Cause Problem Elimination (RCPE) and Failure Mode & Effects Analysis (FMEA) to proactively address equipment failures
  • Manage and optimize condition-based monitoring programs, including vibration, infrared, oil analysis, and ultrasound technologies
  • Establish and maintain robust systems and tools that enable maintenance and operations teams to monitor and interpret equipment and process health data effectively
  • Optimize maintenance strategies using asset criticality and reliability data to focus efforts on high-impact equipment
  • Analyze failure data and trends to identify systemic issues and drive continuous improvement initiatives
  • Collaborate with planning and scheduling teams to ensure timely and efficient execution of maintenance activities aligned with reliability goals
  • Serve as a subject matter expert on reliability tools, CMMS platforms, and emerging technologies
  • Develop and deliver training and communications to enhance reliability awareness and engagement among maintenance and operations personnel
  • Monitor and report on key reliability and maintenance KPIs, such as MTBF, MTTR, and OEE
What we offer
What we offer
  • competitive compensation
  • a supportive working environment
  • rewarding career paths
  • plenty of opportunities for learning and growth
  • Fulltime
Read More
Arrow Right