This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Senior Site Reliability Engineer. As a Site Reliability Engineer, you will be responsible for applying development skills and mindset to Platform Engineering, with the goal of improving the reliability of the company’s systems through automation and continuous integration and delivery.
Job Responsibility:
Effectively manage troubleshooting and recovery of complex production incidents, ranging from ow to critical impacts
Drive incident resolution through a systematic problem-solving approach, coupled with a strong sense of ownership and drive
Actively participate in teams’ Agile stories (project work) to streamline and enhance day to day operations of the team
Create, manage, and utilize appropriate technical procedural documentation (run books)
Proactively monitor all applications and infrastructure behind the company’s external and internal customer-facing services, including availability, latency, performance, and capacity
Influence resiliency and scalability in production environments in Azure
Assist with conducting Root Cause Analysis (RCA) on critical production outages, develop and implement mitigation strategies
Utilize production support expertise to influence and support new designs, architectures, standards, and methods, maintaining stability and availability for large-scale distributed systems
Proactively identify and implement opportunities for automation of routine maintenance tasks, data gathering, and resolution of common issues
Continuously seek to develop new skills and technical expertise, as well as proactively share knowledge with others
Build software and systems to manage platform infrastructure and applications to improve reliability, quality, and time-to-market of the company’s suite of software solutions
Gather and analyze operating systems/applications metrics to assist in performance tuning and fault finding
Participate in system design consulting, platform management, capacity planning, testing & release procedures
Create sustainable systems and services through automation and uplifts
Balance feature development speed and reliability with well-defined service level objectives
Perform disaster recovery operations, monitor network performance, and troubleshoot, diagnose, and resolve hardware, software, and other network and system problems
Requirements:
Bachelor’s Degree in Computer Science preferred but not required or relevant experience
In-depth understanding of web service protocols and REST API design and consumption
Experience with both container and serverless computing
Microsoft Azure/architecture certifications preferred
Skilled in Cloud/PaaS Environments (e.g., Azure), LAN, WAN, Network Security
Proficient, collaborative, & experienced in building reliable, scalable, enterprise systems
Ability to identify root-cause sources of instability in a high-traffic, large-scale distributed systems
Linux administration, troubleshooting, and performance tuning experience
Understanding of observability principles (monitoring, logging, tracing, alerting), tools and practices that promote observability