CrawlJobs Logo

Senior Software Engineer, Site Reliability

babylist.com Logo

Babylist

Location Icon

Location:
United States; Canada

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

186818.00 - 224183.00 USD; CAD / Year

Job Description:

Babylist is looking for a Senior Software Engineer, Site Reliability to join our Platform team. In this position, you will play a vital role in ensuring our systems and services' stability, scalability, and reliability. You will work closely with all Babylist Engineering teams to support shared infrastructure and developer tools. Your expertise in site reliability engineering, AWS cloud infrastructure, and modern DevOps practices will be instrumental in optimizing our systems and driving continuous improvement.

Job Responsibility:

  • Manage and build our AWS infrastructure using Infrastructure as Code (IaC) tools like Terraform
  • Improve the speed and reliability of our Continuous Integration (CI) systems
  • Provide support to developers in troubleshooting issues
  • Establish, communicate, and support best practices for monitoring and alerting

Requirements:

  • 8+ years of experience as a Site Reliability Engineer or similar role
  • Experience supporting high-traffic consumer-facing websites
  • Proficiency with Terraform
  • Strong experience working with AWS cloud-based infrastructure and services
  • Proficiency with Docker and Kubernetes
  • Solid understanding of cloud-native systems design
  • Troubleshooting and debugging skills
  • Experience designing and supporting CI systems
  • Familiar with monitoring and alerting best practices
  • Proven experience in on-call management best practices
  • Excellent verbal and written communication skills
  • Comfortable and enthusiastic about working in an AI-forward environment
What we offer:
  • Company-paid medical, dental, and vision insurance
  • Retirement savings plan with company matching and flexible spending accounts
  • Generous paid parental leave and PTO
  • Remote work stipend
  • Perks for physical, mental, and emotional health, parenting, childcare, and financial planning

Additional Information:

Job Posted:
December 06, 2025

Employment Type:
Fulltime
Work Type:
Remote work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Senior Software Engineer, Site Reliability

Senior Site Reliability Engineer

We are looking for a Senior Site Reliability Engineer who is passionate about sc...
Location
Location
Salary
Salary:
Not provided
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years experience operating high-availability, fault-tolerant, scalable, distributed software in production: building monitoring, tweaking dashboards, defining alerts, writing runbooks, etc.
  • 5+ years of hands on experience with public cloud offerings (AWS components like EC2, CloudFormation, RDS / Aurora, Caches, SQS - or equivalents, e.g. in GCP / Azure)
  • Familiarity with Unix / Linux operating systems
  • Strong emphasis to debug, improve code, and automate routine tasks
  • Strong backend engineering experience in one or more prominent languages such as Java, Go or Python
  • Excellent communication skills in written and verbal forms, and an ability to communicate complex technical issues to a range of technical and non-technical audiences (management, peers, clients)
  • An ability and desire to mentor and coach engineers
What we offer
What we offer
  • health coverage
  • paid volunteer days
  • wellness resources
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

Baxter International is seeking a skilled Senior Principal Site Reliability Engi...
Location
Location
United States , Deerfield
Salary
Salary:
96000.00 - 132000.00 USD / Year
https://www.baxter.com/ Logo
Baxter
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in computer science, IT, or related field (or equivalent experience)
  • Prior experience in Site Reliability Engineering and cloud-based infrastructure management
  • Experience in enterprise engineering, including 24x7 uptime, regulated environments, and planning/operations
  • Azure administration and operations experience, with certifications a plus
  • Knowledge of related technologies, including cloud, encryption, and security protocols
  • Systems administration experience in Windows and Linux environments
  • Proven problem-solving skills and experience with scripting and automation tools
  • Ability to create accurate documentation and reports, with excellent communication skills
Job Responsibility
Job Responsibility
  • Drive strategies to ensure 24x7 availability of services and business continuity for customer facing healthcare software applications and platforms hosted on Microsoft Azure cloud
  • Manage and administer Azure resources, including virtual machines, databases, and networking components
  • Define and document operating procedures to ensure required security, privacy and other compliance standards are maintained for digital solutions deployed in cloud
  • Manage process, planning, and execution for Disaster Recovery (DR) and Business Continuity Planning (BCP)
  • Define and refine Operations SLAs to maintain high level of Customer Satisfaction
  • Establish non-functional requirements to meet SLAs
  • Establish infrastructure and application monitoring dashboards and workflow for automatic routing of notifications
  • Define key performance indicators that can be monitored, measured, and used to derive opportunities
  • Standardize site metrics for stakeholders, reporting on various KPIs including SLAs, availability, capacity utilization, service metrics and cost utilization
  • Work closely with DevOps Engineers to automate infrastructure provisioning and deployment processes
What we offer
What we offer
  • Healthcare benefits
  • Employee Stock Purchase Plan (ESPP)
  • 401(k) Retirement Savings Plan
  • Flexible Spending Accounts
  • Educational assistance programs
  • Paid holidays
  • Paid time off
  • Paid parental leave
  • Commuting benefits
  • Employee Discount Program
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

This is a role at Baxter where your work impacts saving and sustaining lives thr...
Location
Location
United States , Deerfield
Salary
Salary:
96000.00 - 132000.00 USD / Year
https://www.baxter.com/ Logo
Baxter
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in computer science, IT, or related field (or equivalent experience)
  • Prior experience in Site Reliability Engineering and cloud-based infrastructure management
  • Experience in enterprise engineering, including 24x7 uptime, regulated environments, and planning/operations
  • Azure administration and operations experience, with certifications a plus
  • Knowledge of related technologies, including cloud, encryption, and security protocols
  • Systems administration experience in Windows and Linux environments
  • Proven problem-solving skills and experience with scripting and automation tools
  • Ability to create accurate documentation and reports, with excellent communication skills
  • Applicants must be authorized to work for any employer in the U.S.
  • Unable to sponsor or take over sponsorship of an employment visa at this time.
Job Responsibility
Job Responsibility
  • Drive strategies to ensure 24x7 availability of services and business continuity for customer-facing healthcare software applications and platforms hosted on Microsoft Azure cloud
  • Manage and administer Azure resources, including virtual machines, databases, and networking components
  • Define and document operating procedures to ensure required security, privacy and other compliance standards are maintained for digital solutions deployed in cloud
  • Manage process, planning, and execution for Disaster Recovery (DR) and Business Continuity Planning (BCP)
  • Define and refine Operations SLAs to maintain high level of Customer Satisfaction
  • Establish non-functional requirements to meet SLAs
  • Establish infrastructure and application monitoring dashboards and workflow for automatic routing of notifications
  • Define key performance indicators that can be monitored, measured, and used to derive opportunities
  • Standardize site metrics for stakeholders, reporting on various KPIs including SLAs, availability, capacity utilization, service metrics and cost utilization
  • Work closely with DevOps Engineers to automate infrastructure provisioning and deployment processes.
What we offer
What we offer
  • Support for Parents
  • Continuing Education/Professional Development
  • Employee Health & Well-Being Benefits
  • Paid Time Off
  • 2 Days a Year to Volunteer
  • Medical and dental coverage starting day one
  • Insurance coverage for basic life, accident, short-term and long-term disability
  • Business travel accident insurance
  • Employee Stock Purchase Plan (ESPP)
  • 401(k) Retirement Savings Plan
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

Architect, develop, and troubleshoot large-scale infrastructure, maintain and im...
Location
Location
United States , San Francisco
Salary
Salary:
180960.00 - 230900.00 USD / Year
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Software Engineering, Information Technology or a closely related field
  • four years of experience as a Site Reliability Engineer architecting, developing, and troubleshooting large scale infrastructure utilizing programming languages such as PowerShell, Python, or Bash
  • networking technologies such as TCP/IP or security
  • four years of experience in automation development and infrastructure as code implementation using tools such as Terraform, AWS CloudFormation, Ansible, or Salt
  • knowledge of Linux and Windows systems
  • cloud technologies within AWS, GCP, Azure
  • continuous integration continuous delivery/deployment (CICD) practices and monitoring and observability practices
  • must pass technical interview
Job Responsibility
Job Responsibility
  • Architect, develop, and troubleshoot large scale infrastructure utilizing programming languages such as PowerShell, Python, or Bash and networking technologies such as TCP/IP or security
  • provide real-time feedback on production systems
  • work with product family and platform developers to maintain and improve services and performance with a strong customer focus
  • utilize a variety of data collection, enrichment, analytics, and visualizations to support our complex systems
  • responsible for automation development and infrastructure-as-code implementation using tools such as Terraform, AWS CloudFormation, Ansible, and/or Salt
  • build solutions to enhance availability, performance, and stability for hundreds of Atlassian enterprise customers in the cloud as well as automate repetitive work
  • help secure the cloud architecture with penetration testing, vulnerability resolution, and compliance audit responses
  • responsible for continuous integration continuous delivery/deployment (CICD) practices and monitoring and observability practices
What we offer
What we offer
  • Health and wellbeing resources
  • paid volunteer days
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

HiveWatch is seeking a Staff Site Reliability Engineer to join our Platform Team...
Location
Location
United States , El Segundo
Salary
Salary:
183000.00 - 235000.00 USD / Year
hivewatch.com Logo
HiveWatch
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of software engineering experience with strong coding skills in production environments
  • 5+ years of SRE, DevOps, or production operations experience
  • Expertise with cloud platforms (AWS preferred) and containerized applications (Docker, Kubernetes)
  • Experience with Infrastructure as Code (Terraform, CloudFormation, or similar)
  • Proficiency in at least one object oriented programming language in our tech stack (Java, Kotlin, Python)
  • Hands-on experience with relational databases and SQL performance optimization
  • Experience with monitoring and observability tools (Prometheus, Grafana, DataDog, or equivalent)
  • Strong debugging skills across distributed systems and microservices architectures
  • Bachelor's degree in Computer Science, Engineering, or equivalent practical experience
Job Responsibility
Job Responsibility
  • Own the reliability of mission-critical systems including production monitoring, alerting, and capacity planning
  • Debug and resolve complex production issues across the full stack, from infrastructure to application code
  • Participate in a regular on-call rotation to provide 24/7 coverage for critical systems
  • Perform root cause analysis requiring deep code-level investigation and implement preventive measures
  • Build automation and tooling to reduce operational toil and improve system reliability
  • Maintain CI/CD pipelines, observability infrastructure, and database performance optimization
  • Increase the resiliency, scalability, and maintainability of production environments
  • Establish on-call procedures and disaster recovery processes
  • Provide technical leadership and mentorship to foster engineering excellence and reliability culture
What we offer
What we offer
  • Comprehensive health coverage: medical, dental, vision, and life insurance
  • Cutting-edge work in an emerging field with huge growth potential
  • Competitive compensation packages designed to reward top talent
  • A modern, newly renovated HQ right on Main Street in El Segundo, CA
  • 401(k) with a 4% company match to help you invest in your future (match launches in 2026)
  • Flexible paid time off so you can recharge when you need it
  • Additional benefits include ClassPass credits and a discount on pet insurance
  • A family-friendly, compassionate culture that values balance and belonging
  • Eligible to participate in HiveWatch Equity Incentive Plan
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

What will you be doing at Miniclip? Participate in an on-call rotation with the ...
Location
Location
Portugal , Lisbon
Salary
Salary:
Not provided
miniclip.com Logo
Miniclip
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of hands-on experience with AWS in both development and operations contexts
  • Strong Linux system administration skills, including performance tuning and debugging
  • Software development background and strong coding skills in one or more of the following: Go, Python, Ruby
  • Experience with Infrastructure as Code, particularly Terraform
  • Familiarity with CI/CD pipelines and artifact management tools
  • A mindset for resilient systems design, thinking about edge cases, failure modes, and graceful degradation
  • Excellent communication skills in English, both written and spoken
  • Comfortable in a fast-paced environment and adaptable to shifting priorities
Job Responsibility
Job Responsibility
  • Participate in an on-call rotation with the Cloud Engineering team to respond to production incidents and outages
  • Operate and evolve infrastructure using Infrastructure as Code (Terraform), configuration management tools, and containerized platforms on AWS
  • Build and maintain observability tooling to detect symptoms before they lead to outages
  • Automate repetitive tasks and processes to reduce operational toil
  • Collaborate with Engineering and Product teams to design resilient systems that meet performance and reliability goals
  • Troubleshoot production issues across application, network, and infrastructure layers
  • Document systems, processes, and runbooks to improve team transparency and onboarding
Read More
Arrow Right

Senior Site Reliability Engineer

You'll join the team primarily responsible for making our self-hosted product of...
Location
Location
United States
Salary
Salary:
200000.00 - 220000.00 USD / Year
tines.com Logo
Tines
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5-8 years in an SRE or similar role
  • Experience architecting, maintaining, and supporting systems with containerized applications, ideally k8s
  • Experience with troubleshooting deployment issues, creating clear documentation, and designing robust escalation paths
  • Comfortable learning new technologies
  • Experience with Ruby, Rails, React, TypeScript, Postgres, Redis and Docker
  • Customer obsessed and willing to go deep into unfamiliar stacks to find root causes
  • Authorized to work for any employer in the U.S.
Job Responsibility
Job Responsibility
  • Making our self-hosted product offering as easy as possible for customers to install and operate
  • Owning all of the supporting services and tools that our self-hosted customers rely on
  • Identifying and fixing availability risks and monitoring gaps
  • Enabling software engineers to build new product features that work seamlessly across cloud and self-hosted environments
  • Using our own product extensively to automate infrastructure maintenance and to build DevOps tooling for customer deployments
  • Identifying areas for improvement in our containerized architecture and deployment strategies
  • Mentoring other engineers in container orchestration and Kubernetes best practices
  • Act as a subject matter expert for critical self-hosted customer issues
What we offer
What we offer
  • Competitive salary
  • Startup equity & extended exercise window
  • Matching retirement plans
  • Home office setup
  • Private healthcare plans
  • 25 days annual leave
  • Extra company holidays
  • Generous parental leave programs
  • Flexibility in how and where you work
  • Phone and home Internet allowance
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer Cloud Platform

Zilliz is a fast-growing startup developing the industry’s leading vector databa...
Location
Location
Salary
Salary:
175000.00 - 225000.00 USD / Year
zilliz.com Logo
Zilliz
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 4+ years of experience in site reliability engineering or similar roles with a focus on cloud-native systems
  • Proficiency in scripting languages such as Python, Go, or Java
  • Strong knowledge of container orchestration technologies like Kubernetes and Docker
  • Expertise with cloud platforms such as AWS, GCP, or Azure, and their respective monitoring and management tools
  • Experience with infrastructure as code tools such as Terraform or Ansible
  • Familiarity with CI/CD tools such as Jenkins, GitLab CI, or Argo
  • Proven ability to troubleshoot complex distributed systems and resolve issues promptly
  • Bachelor’s degree or above in computer science, software engineering, or other relevant disciplines
  • Ability to thrive in a fast-paced, startup environment and handle multiple projects simultaneously
Job Responsibility
Job Responsibility
  • Work at the intersection of development and site reliability. Creating SRE tools and systems, as well as supporting existing infrastructure and platforms
  • Ensure the reliability, availability, and performance of Zilliz’s distributed database systems
  • Develop and implement strategies for monitoring, incident management, and disaster recovery
  • Automate system operations and maintenance tasks to improve efficiency and reduce manual intervention
  • Design and build tools to manage and monitor infrastructure, ensuring scalability and robustness
  • Collaborate with software engineers to enhance system reliability, scalability, and performance
  • Maintain and improve the CI/CD pipeline to ensure smooth and rapid deployment of changes
  • Actively contribute to the Milvus Vector Database open-source community, focusing on improving reliability and operational efficiency
  • Fulltime
Read More
Arrow Right