This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
The Principal Site Reliability Engineer will be a senior technical expert responsible for driving end-to-end resilience, reliability, and scalability across our mission-critical payments platform. This role focuses on front-to-back payment flows, ensuring systems are designed for fault tolerance, observability, and operational excellence. You will perform deep technical reviews, troubleshoot complex issues, and define patterns for resiliency by design. As a hands-on engineer, you will collaborate with development and production support teams, advocate chaos engineering, and build a culture of designing for failure. This position requires strong technical breadth across infrastructure, applications, networks, databases, and integrations, combined with expertise in modern reliability engineering practices.
Job Responsibility:
Drive strategies to improve reliability, maintainability, and scalability across payment flows and platform components
conduct deep technical assessments of system architectures, identifying risks and recommending improvements for fault tolerance and disaster recovery
act as a senior escalation point for production incidents, lead RCA, and implement permanent fixes to prevent recurrence
define and enforce reliability patterns, frameworks, and best practices
advocate and implement chaos engineering principles to validate system resilience under real-world failure scenarios
design and implement full-stack observability solutions, including metrics, logging, distributed tracing, and alerting
develop automation for failover, capacity management, and self-healing mechanisms to reduce operational risk
partner with development, infrastructure, and production support teams to embed reliability into the SDLC
analyze service risk assessments and production incidents to identify systemic issues and drive long-term improvements
promote operational excellence and a mindset of designing for failure across all engineering teams
provision of guidance and expertise to engineering teams to ensure alignment with best practices and foster a culture of technical excellence
contribution to strategic planning by aligning technical decisions with business goals, anticipating future technology trends, and providing insights to optimize product roadmaps
design and implementation of complex, scalable, and maintainable software solutions, considering long-term viability and business objectives
mentoring and coaching to junior and mid-level engineers to foster professional growth and knowledge sharing, elevating the overall skillset and capabilities of the organization
collaboration with business partners, product managers, designers, and other stakeholders to translate business requirements into technical solutions and ensure a cohesive approach to product development
innovation within the organization by identifying and incorporating new technologies, methodologies, and industry practices into the engineering process
Requirements:
12+ years in software engineering or infrastructure roles
at least 5 years focused on reliability engineering or SRE
proven experience building and operating fault-tolerant, highly available systems at scale
strong knowledge of distributed systems, resiliency patterns (circuit breakers, retries, failover), and disaster recovery strategies
expertise across infrastructure (compute, storage, networking), application architecture, databases, and integration patterns
ability to troubleshoot complex technical issues across distributed systems and perform deep root cause analysis
skilled at working with development, operations, and architecture teams to embed reliability into design and delivery