Software Engineer - Reliability Job at Luma AI (Palo Alto)

Senior Software Engineer, Site Reliability

Babylist is looking for a Senior Software Engineer, Site Reliability to join our...

Location

United States; Canada

Salary:

186818.00 - 224183.00 USD; CAD / Year

Babylist

Expiration Date

Until further notice

Requirements

8+ years of experience as a Site Reliability Engineer or similar role
Experience supporting high-traffic consumer-facing websites
Proficiency with Terraform
Strong experience working with AWS cloud-based infrastructure and services
Proficiency with Docker and Kubernetes
Solid understanding of cloud-native systems design
Troubleshooting and debugging skills
Experience designing and supporting CI systems
Familiar with monitoring and alerting best practices
Proven experience in on-call management best practices

Job Responsibility

Manage and build our AWS infrastructure using Infrastructure as Code (IaC) tools like Terraform
Improve the speed and reliability of our Continuous Integration (CI) systems
Provide support to developers in troubleshooting issues
Establish, communicate, and support best practices for monitoring and alerting

What we offer

Company-paid medical, dental, and vision insurance
Retirement savings plan with company matching and flexible spending accounts
Generous paid parental leave and PTO
Remote work stipend
Perks for physical, mental, and emotional health, parenting, childcare, and financial planning

Fulltime

Software Engineer, Site Reliability

As a Site Reliability Engineer (SRE) at Fireworks AI, you will play a critical r...

Location

United States , San Mateo

Salary:

Not provided

Fireworks AI

Expiration Date

Until further notice

Requirements

Bachelor's degree in Computer Science, related technical field, or equivalent practical experience
5+ years of experience in Site Reliability Engineering, DevOps, or a similar role focused on large-scale production systems
Deep expertise in SRE principles and practices, including SLOs, SLIs, operational automation, incident management, and post-mortems
Extensive hands-on experience with public cloud platforms (AWS, GCP, Azure), including compute, networking, storage, and database services
Strong experience with containerization technologies (Docker) and orchestration platforms (Kubernetes)
Proficiency in designing and implementing robust monitoring, logging, and alerting systems using tools like Prometheus, Grafana, ELK stack, and distributed tracing
Solid programming/scripting skills in at least one language (e.g., Python, Go) for automation and tool development
In-depth knowledge of Linux operating systems, networking fundamentals, and system debugging
Proven ability to troubleshoot complex issues across the entire stack
Excellent communication, collaboration, and problem-solving skills

Job Responsibility

Ensuring System Reliability: Ensure systems are designed and implemented with high availability, scalability, and performance. Focus on fault tolerance, disaster recovery, identifying and removing scaling bottlenecks, and performance optimization across our multi-cloud infrastructure
Incident Management & Response: Lead efforts in incident detection, response, and resolution for critical production issues. Drive post-mortems to identify root causes and implement preventative measures to improve system reliability
Observability & Monitoring: Develop, implement, and maintain comprehensive monitoring, alerting, logging, and tracing solutions to provide deep insights into system health and performance
Automation & Toil Reduction: Identify and automate repetitive operational tasks to reduce toil and improve operational efficiency. Develop tools and scripts to streamline deployments, scaling, and system management
Capacity Planning & Performance Tuning: Work proactively on capacity planning to ensure our infrastructure can gracefully handle growth and peak loads. Optimize system performance and resource utilization
Reliability Best Practices: Collaborate with software engineers to embed reliability principles (e.g., SLOs, SLIs, error budgets) into the development lifecycle, promoting a culture of operational excellence
On-call Rotation: Participate in a periodic on-call rotation to support our production environment and respond to critical alerts

Fulltime

MTS Software Architecture - Reliability Engineering

Our team is searching for a Full Stack Member of Technical Staff to collaborate ...

Location

United States , Frisco; Atlanta; Overland Park

Salary:

145400.00 - 262300.00 USD / Year

T-Mobile

Expiration Date

Until further notice

Requirements

Bachelor's Degree Computer Science, engineering or related field of study
9+ years technical engineering experience, including full-stack web development (front-end and back-end)
7+ years or experience in database schema design and writing SQL
3+ years DevOps experience, including infrastructure as code
4+ years hands-on experience with cloud services (AWS, Azure, GCP)
3+ years experience mentoring and coaching team members
Expertise in multiple technologies and software stacks
Strong understanding of cloud capabilities and how to optimize them for team success
Ability to setup a completely new full stack environment from scratch including build steps and backend infrastructure
Proficiency in html, css, webpack, JavaScript, at least one front end framework and one backend framework

Job Responsibility

Imagines, designs and builds full stack web solutions including both the back end and front end
Code Review and mentoring of other team members
Imagines, designs and builds advanced scheduled jobs and micro-services defining new patterns and orchestrations
Imagines, designs and implements advanced data storage mechanisms using relational and non-relational data stores
Explores, builds and configures cloud services using infrastructure as code. Recommends new cloud services and patterns
Presents ideas which improve an existing system/process/service. Presents new ideas which utilize new frameworks to improve an existing system/process/service
Collaborates with team to break down features into user stories and estimate them
Awareness of technology roadmap. Updates job knowledge by tracking and understanding emerging engineering practices. Continuously learns, creates content, and teaches others specific subject areas. Informally coaches and contributes to the development of others through mentoring or in house workshops and learning sessions. Coach and develop engineers across functional teams on technology decisions. Influence technology and policy decisions made at Director+ level across organization. Understand financial decisions, including NPV and ROI, based on customer experience/business drivers. Present highly technical concepts to both technical and non-technical decision-makers
Provides direction on creation of reliability practices, metrics and tooling based on industry best practices and incident data

What we offer

Competitive base salary and compensation package
Annual stock grant
Employee stock purchase plan
401(k)
Access to free, year-round money coaches
Medical, dental and vision insurance
Flexible spending account
Employee stock grants
Employee stock purchase plan
Paid time off

Fulltime

Software Test Engineer - Reliability, Availability, Serviceability

At IBM, work is more than a job - it's using your creativity and imagination: To...

Location

Mexico , Guadalajara

Salary:

Not provided

IBM Deutschland GmbH

Expiration Date

Until further notice

Requirements

Bachelor's Degree
Experience working in testing projects, including creation and execution of test plans and test cases
Experience managing and maintaining (HW and SW) Servers and Switches
Experience using Linux
Experience with common scripting languages (bash, python, JavaScript, etc.)
Medium to high written and oral English

Job Responsibility

Create and execute test plans and test cases
Review test plans and test cases created by other teams
Support and test for customer support cases
Ensure that overall testing is performed within the committed timeframe
Test code fixes for issues found during testing and customer reported problems
Creation and tracking of code defects found during testing, providing extra support for development if needed
Expand test automation using test automation framework
Set-up and maintenance (HW and SW) of equipment used for the testing, including servers, switches and tape libraries

Fulltime

Sr. Engineer II, Software Engineering FE

At CVS Health, we’re building a world of health around every consumer and surrou...

Location

United States , Chicago

Salary:

148949.00 - 180000.00 USD / Year

CVS Health

Expiration Date

Until further notice

Requirements

Bachelor's degree in Computer Science, Computer Engineering, or related field
six (6) years of progressively responsible, post-baccalaureate experience in a related occupation
Experience in building consumer-facing products using any SPA frameworks (React/Vue)
Experience in design first approach to software development
Experience in writing Jest / Vitest Unit Tests and achieving close to 100% code coverage
Experience working in an Agile/Devops environment

Job Responsibility

Contribute to all aspects of SDLC process (SCRUM, Design, Code, Test, Deploy & Maintain)
Collaborate with Product, UX and other Engineering teams
Collaborate with Platform team following Architecture best practices for scalability and reliability
Contribute to code review process to improve code quality
Mentor Engineers
Implement SecDevops best practices
and other duties as assigned

What we offer

Full range of medical, dental, and vision benefits
401(k) retirement savings plan
Employee Stock Purchase Plan
Fully-paid term life insurance plan
Short-term and long term disability benefits
Well-being programs
Education assistance
Free development courses
CVS store discount
Discount programs with participating partners

Fulltime

Software Engineer II, Android Engineering

As a Software Engineer on Axon’s Robotics team, you’ll be at the forefront of tr...

Location

United States , Boston

Salary:

120750.00 - 193200.00 USD / Year

Axon

Expiration Date

Until further notice

Requirements

3+ years of industry experience shipping Android applications to the Google Play Store
Understand the ins and out of mobile phones
expected to lead mobile design reviews as well as the implementation of their designs to release and post-release monitoring
Experience with modern architecture (MVVM, MVI, etc) including unit testing
Android experience with Retrofit, Coroutines, Okhttp, Hilt, Jetpack Compose
Experience working with remote data via REST and JSON
Understanding and experience with networking protocols such as TCP, UDP, DHCP, DNS, Server-Sent-Events, Websockets (debugging with Wireshark or Charles a plus)

Job Responsibility

Lead engineering architecture and design reviews to ensure high standards in software quality
Collaborate with the Axon product design team to turn mobile UI designs into functional, engaging solutions
Drive the entire mobile software lifecycle, from prototyping to commercialization and post-launch support
Interface with cloud services for seamless integration across platforms
Set a high technical standard for the team through code and design reviews
Partner with Product, Design, and Engineering teams to deliver integrated solutions that meet customer needs
Enhance engineering processes, including sprint planning, stand-ups, and long-term planning
Build robust and reliable mission critical software that meets high standards for stability in mission-critical applications
Collaborate closely with other groups to align on goals, ensuring we deliver impactful and innovative solutions

What we offer

Competitive salary and 401k with employer match
Discretionary time off
Paid parental leave for all
Medical, Dental, Vision plans
Fitness Programs
Emotional & Development Programs
snacks in our offices

Fulltime

Principal Site Reliability Engineer

Location

United States , Ft. Meade

Salary:

Not provided

CipherLogix

Expiration Date

Until further notice

Requirements

Fourteen (14) years experience in software development/engineering, including requirements analysis, software development, installation, integration, evaluation, enhancement, maintenance, testing, and problem diagnosis/resolution
Ten (10) years experience in system engineering/architecture
Ten (10) years experience working with products that support highly distributed, massively parallel computation needs such as Hbase, Hadoop, CloudBase/Acumulo, Big Table, Cassandra, Scality etc
At least ten (10) years experience writing software scripts using scripting languages such as Perl, Python, or Ruby for software automation
At least four (4) years experience managing and monitoring large Cloud System (>200 nodes). Cloud Systems Administrator or Developer Certification
Experience in performing and providing technical direction for the development, engineering, interfacing, integration, and testing of complete hardware/software systems to include monitoring technical health of a system, improving organizational processes, implementation of postmortem (failure) analysis and incident management
Ten (10) years experience in the cleared environment
Ten (10) years demonstrated experience developing software for one of the following: Windows, UNIX, or Linux OS
Knowledge and experience with developing distributed storage routing and querying algorithms
Experience in developing documentation required to support a program’s technical issues and training situations

Fulltime

Staff Software Engineer, Compute

Play a key role in building our platform from zero to one. Partner across teams ...

Location

United States

Salary:

200000.00 - 275000.00 USD / Year

dbt Labs

Expiration Date

Until further notice

Requirements

10+ years of experience in software engineering, with expertise in database systems, query engines, or storage systems
Strong coding skills at the systems level C++, Rust, Go, Python, or Java
Experience designing and scaling distributed systems or SaaS platforms
Expertise with cloud infrastructure (AWS, GCP, Azure, Kubernetes, Terraform)
Proven ability to lead complex projects and collaborate across functions
Excellent problem-solving skills, clear communication, and a strong sense of ownership

Job Responsibility

Design, build, and maintain the Compute layer that powers dbt’s ability to optimize queries across ingestion, transformation, and consumption
Lead technical architecture discussions with a focus on query engines, storage systems, and distributed database design
Collaborate with Product, Design, Operations, and Security to deliver well-architected, scalable compute solutions
Build services, APIs, and experiences that support user delight, quality, high availability, and performance
Tackle ambiguous, open-ended technical challenges with strategic thinking, balancing technical constraints with user needs and product goals
Define and drive best practices in testing, observability, and system reliability
Mentor engineers across the company, fostering technical growth and collaboration
Champion a culture of technical excellence and innovation, influencing engineering direction across multiple teams or domains

What we offer

Unlimited vacation
401k
Pension Plan
16 weeks Paid Parental Leave
Wellness stipend
Home office stipend
Equity Stake

Fulltime

Software Engineer - Reliability

Luma AI

Location:
United States , Palo Alto

Category:
IT - Software Development

Contract Type:
Not provided

Salary:

Job Description:

Job Responsibility:

Requirements:

Nice to have:

Additional Information:

Job Posted:
January 13, 2026

Looking for more opportunities? Search for other job offers that match your skills and interests.

Similar Jobs for Software Engineer - Reliability