This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We are looking for a passionate, hard-working, and talented Site to take the lead on solving some of the toughest operational challenges in some of the most sensitive and mission-critical automated warehouse solutions. The SRE team will drive the stability and sustainability of these next-generation systems and discover innovative ways to scale and operate them reliably as we expand. In this role, you will work with cross functional teams such as Operations, Mechanical Hardware Engineering, IT infrastructure Systems, and Software Engineering teams to identify and address underlying resiliency gaps.
Job Responsibility:
Analyze various sources of metric, dashboards, phrasing logs and articulating that to a facts-based actionable Root Cause Analysis investigation to lead a group of Subject Matter Experts teams to find the actual cause
Host RCA calls as a chair and drive the RCA process to conclusion within tight SLAs with customer-facing deliverables
Lead problem tickets and improvements to major software components, systems, and features to improve the availability, scalability, latency, and efficiency of the Symbotic System
Engage in and improve the service lifecycle from inception and design to deployment, operation, and refinement based on lessons learned through deep dives
Hands-on troubleshooting of VMware, Kubernetes, Custom Software, and infrastructure performance incidents
Be a trusted technical advisor who leads complex root cause analysis investigations from beginning to end until maximum improvements are identified
Demonstrate sound knowledge of gathering logs and facilitating a facts-based root cause analysis with cross-functional teams
Assist internal teams with corrective actions and improvement tickets and influence the completion goals
Flexibility to work during occasional out of standard hours including weekends may be required depending on the cruciality and workload demands
Ability to travel up to 10%
Requirements:
Bachelor’s degree in Software Engineering, Information Systems, Computer Science or a related field
Minimum of 5 years of experience working on ITSM tools such as Jira or equivalent tool
Minimum of 5 years of infrastructure engineering experience with a record demonstrating the delivery of high-quality, large-scale solutions requiring planning and change control
Minimum of 5 years of experience in operation of production systems including troubleshooting, testing, and automation
Minimum of 5 years of experience leading technical Root Cause Analysis
Ability to prioritize parallel RCA investigations and tasks by influencing cross-functional teams to complete actions on time with demanding quality
Experience with executive incident communications, RCA report writing, and written communication skills to non-technical audiences
Ability to transfer vast technical background to projects through excellent problem-solving and competence to work with other technical teams
Efficiently read and understand Gitlab technical documentation
Experience in the advanced use of tools like Prometheus, Grafana, Logic Monitor, Elastic, VMware and use of CLI (Kube or Linux)
Nice to have:
ITIL Problem Management experience
Experience in the advanced use of tools like Prometheus, Grafana, Logic Monitor, Elastic, VMware and use of CLI (Kube or Linux)
Knowledge of Power BI, Tableau, executive report writing, and presentation skills