This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
We’re looking for a Site Reliability Engineering Manager with extensive leadership and observability experience in cloud-based applications. In this role, you will lead, manage, and mentor a team of SREs, define and track metrics related to the company's SLO, SLI, and SLAs, and operationalize incident management, communication, and incident handling. The SRE Manager will be responsible for the availability and performance of all external and internal-facing application endpoints that help drive Checkr’s business. Extensive knowledge of AWS, Kubernetes, and event orchestration is desired. Tooling knowledge with Datadog, PagerDuty, and Atlassian (Jira, Confluence) is highly preferred to identify strategies to improve our full-stack telemetry and monitoring capabilities. Mentoring SREs contributing to observability-related work, as well as to their career development.
Job Responsibility:
Expand and improve our observability and monitoring footprint in line with cost efficiency
Drive and delegate the day-to-day escalations and incidents with on-call engineering teams
Collaborate with other Engineering Managers to define metrics and dashboarding requirements
Ensure stakeholders and partners are informed of incidents and incident trends while working with other departments, such as account managers, legal, and marketing, for outbound communication
Review the work of the SRE team, help them get unblocked, and provide mentoring
Meet with the team and individuals weekly to collaborate and discuss topics related to processes, planning, and goals
Manage and assist the on-call incident commander and owners in resolving production reliability issues, ensuring timely communication, retrospectives, and postmortems are performed and delivered
Participate in design and production reviews for new features, products, or infrastructure
Assist in planning for the growth of Checkr’s infrastructure, reliability/resiliency, and resources
Requirements:
8+ years working in a relevant role, including 4+ years of technical leadership experience mentoring engineers
4+ years of experience architecting and administrating observability stacks, either managed or self-hosted (e.g., Datadog, New Relic, Prometheus, Elastic Stack/ELK, OpenTelemetry)
Experience with operation of containerized microservices running on the public cloud, asynchronous event processing, and databases
Knowledge of Linux, Git, and CI/CD pipelines
On-call support of highly available production systems
Designing and building new tools to automate repetitive tasks, prevent incidents or improve MTTR using programming language such as Python
Experience with automation and Infrastructure as Code using tools like Terraform, Terragrunt, or Cloud Formation
Understanding of how application components interact and experience contributing to architectural discussions
Unwavering commitment to operational security and best practices
Ownership: identify problems, propose solutions, and then coach and guide a team to implement them
Connection: motivated to help other teams improve their service reliability and continuous improvement of tooling and services
What we offer:
A fast-paced and collaborative environment
Learning and development allowance
Competitive compensation and opportunity for advancement
100% medical, dental, and vision coverage
Up to 25K reimbursement for fertility, adoption, and parental planning services
Welcome to CrawlJobs.com – Your Global Job Discovery Platform
At CrawlJobs.com, we simplify finding your next career opportunity by bringing job listings directly to you from all corners of the web. Using cutting-edge AI and web-crawling technologies, we gather and curate job offers from various sources across the globe, ensuring you have access to the most up-to-date job listings in one place.
We use cookies to enhance your experience, analyze traffic, and serve personalized content. By clicking “Accept”, you agree to the use of cookies.