This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
As a Senior SRE, you'll be a technical leader in how we design, observe, and operate our systems in production. You focus on how services behave as a whole: reliability, performance, failure modes, and the experience of the engineers who build them.
Job Responsibility:
Design systems with resilience, graceful degradation, and capacity in mind
Define and measure SLOs and SLIs
Use Datadog together with CloudWatch to build observability
Configure alerting and routing
Continuously improve our incident lifecycle
Combine strong software fundamentals with reliability thinking
Requirements:
Master's or bachelor's degree in computer science or a related field, or related experience
4+ years of experience in an SRE or Software Engineering role
Successfully managed production environments at scale
A strong belief that observability is critically important
Experience using SLOs, SLIs, and KPIs to guide decisions
You've read (or written) part or all of the SRE book and contextualized it
Proficiency in leveraging AI productivity tools (e.g., Cursor, Claude Code) and prompt engineering
Hands-on experience shepherding services from design to production
Experience tackling site-wide outages
Passion for mentoring engineers
What we offer:
Healthcare
Internet/cell phone reimbursement
A learning and development stipend
Opportunities to collaborate with and travel to our Palo Alto HQ and Bangkok Site