This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Zuora’s Cloud Engineering organization owns the reliability, scalability, and operational excellence of our global, customer-facing SaaS platforms. We are seeking a Senior Site Reliability Engineer to play a technical leadership role in advancing Zuora’s reliability strategy with a strong focus on AI-driven automation and intelligent operations. This role goes beyond execution and requires ownership of complex systems, definition of new approaches, and influence across teams.
Job Responsibility:
Reliability Architecture & Platform Strategy: Own and evolve the reliability architecture of large-scale, distributed SaaS systems by defining SLOs, SLIs, error budgets, and resilience patterns aligned with business objectives
AI-Driven Automation & Intelligent Operations: Design, build, and operationalize AI-powered automation to reduce operational toil and improve system stability
Advanced Cloud & Infrastructure Engineering: Lead the design and operation of complex AWS-based infrastructure and Kubernetes platforms, optimizing for availability, security, and cost efficiency
Incident Leadership & Operational Excellence: Act as a technical leader during high-severity production incidents, driving structured response, decision-making, and recovery
Technical Leadership & Cross-Functional Influence: Influence reliability outcomes beyond the SRE team by partnering closely with Engineering, Product, and Security stakeholders
Requirements:
8+ years of hands-on experience in Site Reliability Engineering, DevOps, or large-scale production operations
Advanced expertise in AWS, including architecture design across services such as EC2, EKS, VPC, IAM, RDS, S3, and CloudWatch
Deep experience with Infrastructure-as-Code using Terraform, including complex modules, state management, and governance
Strong programming and automation skills using Python and Shell
experience building production-grade automation systems
Expert-level Linux systems knowledge, including performance tuning, security hardening, and deep troubleshooting
Proven experience operating distributed systems and data streaming platforms such as Kafka in high-throughput environments
Demonstrated ability to work independently on complex, ambiguous problems with broad organizational impact
Proven technical leadership experience driving large, cross-team reliability or infrastructure initiatives, including setting technical direction, influencing design decisions, and mentoring engineers to deliver measurable outcomes at scale
Practical experience designing or implementing AI/ML-driven automation in operations, reliability, or platform engineering
Experience integrating AI capabilities into monitoring, alerting, incident response, or workflow automation systems
Strong understanding of how AI can be safely and effectively applied in production environments
Nice to have:
Experience with advanced observability platforms (Prometheus, Grafana, ELK, or similar) enhanced with AI-driven insights
Familiarity with predictive analytics, anomaly detection, or AIOps platforms
Experience influencing architectural decisions at a platform or product level
Prior experience operating in a 24/7, global, high-availability SaaS environment
What we offer:
Competitive compensation, variable bonus and performance reward opportunities, and retirement programs
Medical Insurance
Generous, flexible time off
Paid holidays, “wellness” days and company wide end of year break
6 months fully paid parental leave
Learning & Development stipend
Opportunities to volunteer and give back, including charitable donation match
Free resources and support for your mental wellbeing