Performance Reliability Engineer Job at Cerebras Systems (Sunnyvale)

Reliability Engineer - Manufacturing Maintenance Equipment

Founded in 1985, ATS is a company with a presence in the United States, Mexico a...

Location

United States , Fayetteville

Salary:

Not provided

Advanced Technology Products

Expiration Date

Until further notice

Requirements

Bachelor’s degree in engineering (ABET accredited) or equivalent experience (ex. heavy industrial maintenance, reliability, or operations experience)
Minimum of five years of reliability experience
Demonstrates ability to perform full array of reliability tool sets
Experience in Performance of RCA
Experience & Performance with RCM & FMEA
Master Level Proficiency in Predictive Technology
Vibration I Certification
Infrared I Certification
Machine Health Monitoring Strong Proficiency
Coaching & Experience with Work Execution Management

Job Responsibility

Promotes and adheres to the ATS safety culture
Ensures compliance with regulatory requirements and ATS policies and procedures
Partners with internal/external customer for engineered solutions to improve reliability and throughput
Identifies opportunities for Capital Expenditures for equipment replacement with supervision (develops and communicates ROI)
Champions operating systems, critical elements, and best practices to enable a precision reliability culture
Knowledgeable application of common precision tools and practices
Fully understands reliability centered maintenance and deliverables (equipment specific maintenance plan -ESMP)
Actively collaborates with maintenance team on the use of predictive, preventative, and precision maintenance technologies and strategies designed to identify or control risks prior to failure and ensure optimum maintenance execution
Understands and performs failure mode & effects analysis
Advanced understanding of Work Execution Management (WEM) to train and mentor on gaps & improvements identified through reliability strategy session performance

Reliability Engineer I

Founded in 1985, ATS is a company with a presence in the United States, Mexico a...

Location

United States , Northumberland

Salary:

Not provided

Advanced Technology Products

Expiration Date

Until further notice

Requirements

Bachelor’s degree in engineering (ABET accredited) or equivalent experience (ex. heavy industrial maintenance, reliability, or operations experience)
Minimum of one year of reliability experience
Demonstrates ability to use reliability tool sets
Experience in Performance of RCA
Involvement with RCM & FMEA
Master Level Proficiency in Predictive Technology
Vibration I Certification
Machine Health Monitoring Intermediate Proficiency
Experience with Work Execution Management
Technical understanding of electrical or mechanical components, tools, and designs

Job Responsibility

Promotes and adheres to the ATS safety culture
Ensures compliance with regulatory requirements and ATS policies and procedures
Partners with internal/external customer for engineered solutions to improve reliability and throughput
Identifies opportunities for Capital Expenditures for equipment replacement (develops and communicates ROI)
Highly knowledgeable in operating systems, critical elements, and best practices to enable a precision reliability culture
Knowledgeable application of common precision tools and practices
Partners with peers to perform reliability centered maintenance and deliverables (equipment specific maintenance plan -ESMP)
Actively collaborates with maintenance team on the use of predictive, preventative, and precision maintenance technologies and strategies designed to identify or control risks prior to failure and ensure optimum maintenance execution
Partners with peers to perform failure mode & effects analysis
Understands Work Execution Management (WEM) & improvements identified through reliability strategy session performance

Fulltime

Reliability Engineer

Founded in 1985, ATS is a company with a presence in the United States, Mexico a...

Location

United States , Tupelo, Mississippi

Salary:

Not provided

Advanced Technology Products

Expiration Date

Until further notice

Requirements

Bachelor’s degree in engineering (ABET accredited) or equivalent experience (ex. heavy industrial maintenance, reliability, or operations experience)
Minimum of one year of reliability experience
Demonstrates ability to use reliability tool sets
Experience in Performance of RCA
Involvement with RCM & FMEA
Master Level Proficiency in Predictive Technology
Vibration I Certification
Machine Health Monitoring Intermediate Proficiency
Experience with Work Execution Management
Technical understanding of electrical or mechanical components, tools, and designs

Job Responsibility

Promotes and adheres to the ATS safety culture
Ensures compliance with regulatory requirements and ATS policies and procedures
Partners with internal/external customer for engineered solutions to improve reliability and throughput
Identifies opportunities for Capital Expenditures for equipment replacement (develops and communicates ROI)
Highly knowledgeable in operating systems, critical elements, and best practices to enable a precision reliability culture
Knowledgeable application of common precision tools and practices
Partners with peers to perform reliability centered maintenance and deliverables (equipment specific maintenance plan -ESMP)
Actively collaborates with maintenance team on the use of predictive, preventative, and precision maintenance technologies and strategies designed to identify or control risks prior to failure and ensure optimum maintenance execution
Partners with peers to perform failure mode & effects analysis
Understands Work Execution Management (WEM) & improvements identified through reliability strategy session performance

Fulltime

Site Reliability Engineering Manager

Hewlett Packard Enterprise (HPE) is looking for a Site Reliability Engineering M...

Location

India , Bangalore

Salary:

Not provided

Hewlett Packard Enterprise

Expiration Date

Until further notice

Requirements

7–10 years of experience in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles
Minimum 2 years of experience managing or leading cloud operations teams
Deep understanding of cloud platforms (AWS, GCP, or Azure) and cloud-native architectures
Hands-on experience with Kubernetes, containers, infrastructure as code (e.g., Terraform), and configuration management tools
Strong foundation in observability (monitoring, logging, tracing), automation using Python, and incident response
Familiarity with modern CI/CD automation and tools
Excellent communication, stakeholder management, and team-building skills
Experience scaling SRE practices in high-growth or large-scale environments
Ability to balance long-term reliability initiatives with short-term delivery needs.

Job Responsibility

Lead and mentor a team of Site Reliability Engineers, supporting their growth, performance, and well-being
Own the reliability strategy for SASE cloud infrastructure systems, including incident management, SLIs/SLOs, and capacity planning
Partner with Engineering, Product, and Security teams to design and deliver highly available, scalable, and resilient cloud-native services
Guide the team in building automation, improving observability, and improve operational efficiency of our cloud infrastructure
Drive adoption of best practices in monitoring, alerting, on-call operations, and runbook development
Build and maintain a strong engineering culture based on ownership, collaboration, and continuous learning
Define and track key reliability metrics, and report on team performance and system health to leadership
Contribute to hiring, onboarding, and career development for SREs.

What we offer

Health & Wellbeing benefits for physical, financial, and emotional wellbeing
Personal & Professional Development programs
Unconditional inclusion in the workplace.

Fulltime

Senior Site Reliability Engineer

Affirm is reinventing credit to make it more honest and friendly, giving consume...

Location

Spain

Salary:

85000.00 - 115000.00 EUR / Year

Affirm

Expiration Date

Until further notice

Requirements

4+ years of experience designing, developing and launching backend systems at scale using scripting and development languages like Bash, Python or Kotlin
A track record of developing highly available distributed systems using technologies like AWS, MySQL and Kubernetes
Meaningful experience contributing in or driving parts of the Incident Lifecycle process, enabling actionable insights that improve the quality culture, reliability, resilience, and system performance
4+ years working in a Site Reliability or Production Engineering team
Experience defining a technical plan for the delivery of a significant feature or system component with an elegant, simple and extensible design
Experience in making impactful changes in a large code base, and have developed a suite of tools and practices that enable you and your team to do so safely
Strong verbal and written communication skills that support effective collaboration with our global engineering team
On-Call Rotation - There would be an on-call rotation for this role as a requirement

Job Responsibility

You will be responsible for owning and delivering quarterly goals for your team, leading engineers on your team through ambiguity to solve open-ended problems, and ensuring that everyone is supported throughout delivery
You will support your peers and stakeholders in the product development lifecycle by collaborating with infrastructure, product management, developer experience & analytics by participating in ideation, articulating technical constraints, and partnering on decisions that properly consider risks and trade-offs
You will proactively identify technical solutions and operational processes that strengthen incident readiness, response, and post-incident analysis
You will support the operations and availability of your team’s artifacts by creating and monitoring metrics, escalating when needed, and supporting “keep the lights on” & on-call efforts
You will foster a culture of quality and ownership on your team by setting or improving code review and design standards for your team, and advocating for them beyond your team through your writing and tech talks
You will help develop talent on your team by providing feedback and guidance, and leading by example

What we offer

Flexible Spending Wallets for tech, food and lifestyle
Away Days - wellness days to take off work and recharge
Learning & Development programs
Parental benefit
Employee Resource & Community Groups
Health care coverage - Affirm covers all premiums for all levels of coverage for you and your dependents
Flexible Spending Wallets - generous stipends for spending on Technology, Food, various Lifestyle needs, and family forming expenses
Time off - competitive vacation and holiday schedules allowing you to take time off to rest and recharge
ESPP - An employee stock purchase plan enabling you to buy shares of Affirm at a discount

Fulltime

Database Reliability Engineer

We are committed to providing our customers with reliable and secure services at...

Location

Netherlands

Salary:

Not provided

ClickHouse

Expiration Date

Until further notice

Requirements

Bachelor’s or Master’s degree in Computer Science or a related field
At least 5 years of experience in Reliability Engineering, QA or customer facing engineering
Previous experience operating ClickHouse or other SQL databases in production
Excellent understanding of distributed database internals and SQL, particularly ClickHouse is a major plus
Scripting experience with Shell or Python, and ability to read and understand C++ code
Knowledge of cloud computing platforms such as AWS, Azure, or Google Cloud Platform
You are a strong problem-solver and have solid production debugging skills
You thrive in a fast-paced environment as part of a global team, and you see yourself as a partner with the business with the shared goal of moving the business forward
You have a high level of responsibility, ownership, and accountability
Excellent communication skills

Job Responsibility

Continuously improve the reliability and performance of ClickHouse core
Improve and create metrics and alerts for ClickHouse to be able to identify and prevent problems in production before they affect customers
Dig deeper into the most common problems encountered by customers in Clickhouse Core to identify the root cause of problems and submit bug fixes, issue reports and suggest improvements
Enhance and refine incident response processes and post-mortem analysis for ClickHouse core related outages including working with support and Cloud teams to communicate to the impacted customers
Plan, enable, and drive Chaos initiatives across Engineering teams, based upon internal priorities
Manage on-call processes to respond to performance and reliability issues, and establish best practices for coordinating escalation to resolve issues and minimize customer impact

What we offer

Flexible work environment - ClickHouse is a globally distributed company and remote-friendly. We currently operate in 20 countries
Healthcare - Employer contributions towards your healthcare
Equity in the company - Every new team member who joins our company receives stock options
Time off - Flexible time off in the US, generous entitlement in other countries
A $500 Home office setup if you’re a remote employee
Global Gatherings – We believe in the power of in-person connection and offer opportunities to engage with colleagues at company-wide offsites

Fulltime

Database Reliability Engineer

We are committed to providing our customers with reliable and secure services at...

Location

Germany

Salary:

Not provided

ClickHouse

Expiration Date

Until further notice

Requirements

Bachelor’s or Master’s degree in Computer Science or a related field
At least 5 years of experience in Reliability Engineering, QA or customer facing engineering
Previous experience operating ClickHouse or other SQL databases in production
Excellent understanding of distributed database internals and SQL, particularly ClickHouse is a major plus
Scripting experience with Shell or Python, and ability to read and understand C++ code
Knowledge of cloud computing platforms such as AWS, Azure, or Google Cloud Platform
You are a strong problem-solver and have solid production debugging skills
You thrive in a fast-paced environment as part of a global team, and you see yourself as a partner with the business with the shared goal of moving the business forward
You have a high level of responsibility, ownership, and accountability
Excellent communication skills

Job Responsibility

Continuously improve the reliability and performance of ClickHouse core
Improve and create metrics and alerts for ClickHouse to be able to identify and prevent problems in production before they affect customers
Dig deeper into the most common problems encountered by customers in Clickhouse Core to identify the root cause of problems and submit bug fixes, issue reports and suggest improvements
Enhance and refine incident response processes and post-mortem analysis for ClickHouse core related outages including working with support and Cloud teams to communicate to the impacted customers
Plan, enable, and drive Chaos initiatives across Engineering teams, based upon internal priorities
Manage on-call processes to respond to performance and reliability issues, and establish best practices for coordinating escalation to resolve issues and minimize customer impact

What we offer

Flexible work environment - ClickHouse is a globally distributed company and remote-friendly. We currently operate in 20 countries
Healthcare - Employer contributions towards your healthcare
Equity in the company - Every new team member who joins our company receives stock options
Time off - Flexible time off in the US, generous entitlement in other countries
A $500 Home office setup if you’re a remote employee
Global Gatherings – We believe in the power of in-person connection and offer opportunities to engage with colleagues at company-wide offsites

Reliability Engineer

The Reliability Engineer is responsible for developing and leading asset reliabi...

Location

United States , Bennettsville

Salary:

Not provided

Domtar

Expiration Date

Until further notice

Requirements

Bachelor’s degree in Mechanical Engineering or related technical field
Minimum five (5) years of experience in maintenance, reliability, or engineering within manufacturing or heavy industrial environments (pulp and paper experience preferred)
Strong knowledge of RCM, FMEA, RCFA, CMMS systems, and predictive maintenance technologies
Demonstrated commitment to safety and continuous improvement

Job Responsibility

Lead the development and execution of precision, preventive, and predictive maintenance strategies that improve equipment reliability
Champion Root Cause Problem Elimination (RCPE) and Failure Mode & Effects Analysis (FMEA) to proactively address equipment failures
Manage and optimize condition-based monitoring programs, including vibration, infrared, oil analysis, and ultrasound technologies
Establish and maintain robust systems and tools that enable maintenance and operations teams to monitor and interpret equipment and process health data effectively
Optimize maintenance strategies using asset criticality and reliability data to focus efforts on high-impact equipment
Analyze failure data and trends to identify systemic issues and drive continuous improvement initiatives
Collaborate with planning and scheduling teams to ensure timely and efficient execution of maintenance activities aligned with reliability goals
Serve as a subject matter expert on reliability tools, CMMS platforms, and emerging technologies
Develop and deliver training and communications to enhance reliability awareness and engagement among maintenance and operations personnel
Monitor and report on key reliability and maintenance KPIs, such as MTBF, MTTR, and OEE

What we offer

competitive compensation
a supportive working environment
rewarding career paths
plenty of opportunities for learning and growth

Fulltime

Performance Reliability Engineer

Cerebras Systems

Location:
United States; Canada , Sunnyvale ▼
Toronto

Category:
IT - Software Development

Contract Type:
Not provided

Salary:

Job Description:

Job Responsibility:

Requirements:

Nice to have:

Additional Information:

Job Posted:
February 17, 2026

Looking for more opportunities? Search for other job offers that match your skills and interests.

Similar Jobs for Performance Reliability Engineer