This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are looking for an Senior Site Relability Engineer to join our growing engineering team. We are a company that values SRE principles and practices. We believe in empowering our SREs to make data-driven decisions, automate operational tasks, and continuously improve the reliability of our systems. We foster a blameless culture where everyone is encouraged to learn from mistakes and share knowledge. If you are passionate about building and maintaining highly reliable systems, we would love to hear from you!
Job Responsibility:
Lead the design of scalable, fault-tolerant and self-healing systems in a multi-region AWS environment
Define and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to drive architectural decisions and error budget policies
Conduct blameless post-incident reviews to uncover systemic root causes and implement long-term preventive measures
Identify patterns of manual work and lead the development of internal tools/automation to permanently eliminate them
Develop and maintain automated runbooks and playbooks for common operational tasks and complex incident response
Shift from simple monitoring to deep observability, ensuring high cardinality data leads to proactive actionable insights
Proactively identify and mitigate operational risks through chaos engineering and architecture reviews
Work with software engineers to design systems for reliability, scalability, and maintainability from the early stages of the SDLC
Continuously evaluate and optimize system performance, capacity, and cost efficiency
Beyond just participating, you will refine the on-call experience to reduce alert fatigue, improve MTTR, and ensure sustainable rotation health
Requirements:
Bachelor’s degree in Computer Engineering or a similar discipline
5+ years of experience as a Site Reliability Engineer or in a similar role
3+ years of experience with AWS services including strong knowledge of container orchestration
2+ years of Kubernetes experience
Deep understanding of observability principles and tools like (Prometheus, Datadog, OpenTelemetry)
Experience with leading incident management and complex postmortem analysis
Experience and interest in managing infrastructure as code (Terraform)
Experience with chaos engineering and other techniques for testing system resilience
Experience with CI/CD tools such as GitHub Actions for automated delivery
Proficiency in at least one programming language (Python, Go, Java, etc.) for building automation and internal tooling