This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are seeking a senior Resilience Engineer to own and evolve the stability, availability, and recoverability of our IoT platforms. This role operates at the intersection of system architecture, reliability engineering, and operational excellence, with end-to-end accountability for designing resilience into our services. You will define and govern resilience strategies, influence platform architecture, and partner across product, infrastructure, and engineering teams to ensure our systems continue to perform under failure, scale, and unexpected disruption.
Job Responsibility:
Developing and governing resilience strategies across system architecture, deployment, monitoring, and incident response
Defining and tracking stability KPIs (e.g., MTTD, MTTR, error budgets), partnering with performance and operations teams to meet or exceed targets
Designing and implementing fault injection testing, chaos engineering practices, and scenario-based simulations to validate platform robustness
Collaborating with product, infrastructure, architecture and development teams to re-design services with built-in redundancy, failover, and graceful degradation
Driving automation and observability improvements to reduce noise, increase fault detection speed, and support predictive failure mitigation
Contributing to the design and maintenance of our Business Continuity and Disaster Recovery Plan (BCDR), ensuring IoT systems remain resilient and recoverable in the face of unexpected disruptions
Owning the resilience roadmap and continuously assessing emerging threats, technologies, and architectural shifts to guide evolution of stability practices
Evangelizing a culture of resilience through internal communication, workshops, and post-incident learning programs
Deliver high-quality engineering solutions while continuously strengthening the resilience, scalability, and cost efficiency of our IoT platform
Consistently meet or exceed delivery expectations by prioritizing the highest-leverage resilience initiatives that improve customer experience, business outcomes, and financial performance
Build trusted, transparent, and outcome-driven relationships by providing clear technical direction and trade-off recommendations to business and engineering stakeholders.
Requirements:
Educated to BSc degree level in Software Engineer or related discipline with Computer Science
Strong scripting and automation experience (e.g., Python, Bash, Go, PowerShell), with a demonstrated ability to replace manual processes with reliable, scalable automation
Proven experience designing and operating high-availability, fault-tolerant systems, including the use of chaos engineering techniques and proactive failure-mitigation strategies
Experience applying Business Continuity and resilience standards (e.g., ISO 22301) in the context of real-world platform design and operational readiness
Hands-on experience designing or integrating monitoring, alerting, and automated testing frameworks to support early fault detection and system validation
Broad experience working with Linux-based platforms across on-premises and cloud environments, with an understanding of how infrastructure choices impact reliability, scalability, and recovery
Deep expertise in Site Reliability Engineering principles, including SLOs/SLIs, error budgets, observability, toil reduction, and automation, with the ability to apply them at platform and system scale to guide architectural decisions and long-term resilience strategy
Proven ability to balance long-term platform stability with delivery velocity by making clear, data-driven trade-offs
Strong understanding of security principles, practices, and standards, and the ability to incorporate them into resilient, real-world technical solutions
Deep command of telemetry, logging, and alerting ecosystems (e.g., Prometheus, Grafana, ELK, Datadog, Splunk), with the ability to design signals that enable early fault detection and informed decision-making
Experience defining meaningful SLIs and building dashboards that drive architectural insight, prioritization, and corrective action
Proven experience leading blameless post-incident reviews, root cause analysis, and systemic improvements across multiple teams
Expertise in identifying and addressing system bottlenecks, latency issues, and throughput constraints in distributed environments
Proficiency in forecasting demand, planning capacity, and managing system growth in a cost-efficient and sustainable manner
Strong track record of partnering with software engineering, infrastructure, product, and business teams to embed resilience into the full development lifecycle
Fluency in English.
What we offer:
Hybrid Work Model - Flexible hybrid work model with 8-10 in-office days per month, managed by team leaders
Vodafone Products and Services - Employees get a mobile phone, free communication plan, data card, and various discounts on services and products
Recognition - Recognition programs for innovative, creative, high-potential employees and exemplary behaviors
Health and Well-being - Well-being Program offers nutrition and psychological consultations, webinars, workshops, and discounts on various services and products
Learning - Access to Communities of Practice and a customizable digital training platform with high-quality content (namely Harvard Business Publishing, Skillsoft and Speexx)
Local and International Mobility - Internal recruitment with local and international rotation opportunities across departments and roles.