CrawlJobs Logo

Principle SRE

barclays.co.uk Logo

Barclays

Location Icon

Location:
India , Pune

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

The Principal Site Reliability Engineer will be a senior technical expert responsible for driving end-to-end resilience, reliability, and scalability across our mission-critical payments platform. This role focuses on front-to-back payment flows, ensuring systems are designed for fault tolerance, observability, and operational excellence. You will perform deep technical reviews, troubleshoot complex issues, and define patterns for resiliency by design. As a hands-on engineer, you will collaborate with development and production support teams, advocate chaos engineering, and build a culture of designing for failure. This position requires strong technical breadth across infrastructure, applications, networks, databases, and integrations, combined with expertise in modern reliability engineering practices.

Job Responsibility:

  • Drive strategies to improve reliability, maintainability, and scalability across payment flows and platform components
  • conduct deep technical assessments of system architectures, identifying risks and recommending improvements for fault tolerance and disaster recovery
  • act as a senior escalation point for production incidents, lead RCA, and implement permanent fixes to prevent recurrence
  • define and enforce reliability patterns, frameworks, and best practices
  • advocate and implement chaos engineering principles to validate system resilience under real-world failure scenarios
  • design and implement full-stack observability solutions, including metrics, logging, distributed tracing, and alerting
  • develop automation for failover, capacity management, and self-healing mechanisms to reduce operational risk
  • partner with development, infrastructure, and production support teams to embed reliability into the SDLC
  • analyze service risk assessments and production incidents to identify systemic issues and drive long-term improvements
  • promote operational excellence and a mindset of designing for failure across all engineering teams
  • provision of guidance and expertise to engineering teams to ensure alignment with best practices and foster a culture of technical excellence
  • contribution to strategic planning by aligning technical decisions with business goals, anticipating future technology trends, and providing insights to optimize product roadmaps
  • design and implementation of complex, scalable, and maintainable software solutions, considering long-term viability and business objectives
  • mentoring and coaching to junior and mid-level engineers to foster professional growth and knowledge sharing, elevating the overall skillset and capabilities of the organization
  • collaboration with business partners, product managers, designers, and other stakeholders to translate business requirements into technical solutions and ensure a cohesive approach to product development
  • innovation within the organization by identifying and incorporating new technologies, methodologies, and industry practices into the engineering process

Requirements:

  • 12+ years in software engineering or infrastructure roles
  • at least 5 years focused on reliability engineering or SRE
  • proven experience building and operating fault-tolerant, highly available systems at scale
  • strong knowledge of distributed systems, resiliency patterns (circuit breakers, retries, failover), and disaster recovery strategies
  • expertise across infrastructure (compute, storage, networking), application architecture, databases, and integration patterns
  • ability to troubleshoot complex technical issues across distributed systems and perform deep root cause analysis
  • skilled at working with development, operations, and architecture teams to embed reliability into design and delivery
What we offer:
  • Competitive holiday allowance
  • Life assurance
  • Private medical care
  • Pension contribution

Additional Information:

Job Posted:
January 22, 2026

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Principle SRE

Intermediate Software Engineer SRE – AI

At PointClickCare our mission is simple: to help providers deliver exceptional c...
Location
Location
Canada , Mississauga
Salary
Salary:
115000.00 - 128000.00 CAD / Year
pointclickcare.com Logo
PointClickCare
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years' experience in software engineering
  • Experience with SRE principles
  • Experience with AI/ML in production environments
  • A passion for automation, intelligent systems, and operational excellence
  • Strong debugging, problem-solving, and system design skills
  • Languages: Python, Java, Bash, Terraform
  • Platforms: Azure, Kubernetes, Docker
  • Tools: Datadog, Prometheus, AppDynamics, ELK, GitHub Actions
  • ML/AI: MCP framework, AI agents, Vector store, Agent orchestration (LangChain), RAG
  • CI/CD: Jenkins, ArgoCD, Spinnaker
Job Responsibility
Job Responsibility
  • Build ML-based anomaly detection and pattern recognition systems
  • Enhance telemetry with smart tagging and metadata for better AI insights
  • Develop event-driven workflows and self-healing systems using AI triggers
  • Automate incident response with generative AI and custom AI agent orchestration
  • Use time-series forecasting and predictive modelling to anticipate failures
  • Optimise infrastructure with AI-powered autoscaling and cost-aware resource allocation
  • Build scalable, fault-tolerant systems in a cloud-native environment
  • Participate in on-call rotations and lead incident response for critical systems
  • Skilled in API integration for streamlined data exchange and system connectivity
  • Run internal AIOps workshops and help teams adopt AI maturity models
What we offer
What we offer
  • Benefits starting from Day 1
  • Retirement Plan Matching
  • Flexible Paid Time Off
  • Wellness Support Programs and Resources
  • Parental & Caregiver Leaves
  • Fertility & Adoption Support
  • Continuous Development Support Program
  • Employee Assistance Program
  • Allyship and Inclusion Communities
  • Employee Recognition … and more
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

We are recruiting a Junior SRE for a company that provides an advanced data, ope...
Location
Location
Portugal , Lisboa
Salary
Salary:
Not provided
https://www.precisers.pt Logo
Precise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Up to 2-3 years of experience in a Site Reliability Engineering SRE, DevOps, or Production Engineering role, with a deep understanding of SRE principles and best practices
  • Incident management expertise, including triaging, escalation, and resolution of high-severity outages
  • Proficiency in at least one coding language Python or Java) for automation and debugging
  • Hands-on experience in Kubernetes K8s for managing and orchestrating containerized applications
  • Cloud experience AWS preferred) with exposure to key services like EC2, S3, Lambda, and CloudWatch
  • Excellent communication skills to articulate technical challenges and solutions effectively
  • Strong troubleshooting and problem-solving skills, with experience diagnosing complex production issues
  • Ability to stay calm under pressure, multitask, and prioritize effectively in fast-moving environments
  • Fluency in English (spoken and written) is required
  • Must have the legal right to work in the country
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

We are recruiting a Senior SRE for a company that provides an advanced data, ope...
Location
Location
Portugal , Lisboa
Salary
Salary:
Not provided
https://www.precisers.pt Logo
Precise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Up to 5 years of experience in a Site Reliability Engineering SRE, DevOps, or Production Engineering role, with a deep understanding of SRE principles and best practices
  • Incident management expertise, including triaging, escalation, and resolution of high-severity outages
  • Proficiency in at least one coding language Python or Java) for automation and debugging
  • Hands-on experience in Kubernetes K8s for managing and orchestrating containerized applications
  • Cloud experience AWS preferred) with exposure to key services like EC2, S3, Lambda, and CloudWatch
  • Excellent communication skills to articulate technical challenges and solutions effectively
  • Strong troubleshooting and problem-solving skills, with experience diagnosing complex production issues
  • Ability to stay calm under pressure, multitask, and prioritize effectively in fast-moving environments
  • Fluency in English (spoken and written) is required
  • Must have the legal right to work in the country
  • Fulltime
Read More
Arrow Right

Software Engineering Professional

Working in this role you will play a critical part in the operation of the BT Bu...
Location
Location
United Kingdom , Belfast
Salary
Salary:
Not provided
plus.net Logo
Plusnet
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Programming / scripting experience
  • Understanding of SRE principles and a willingness to grow and develop these new principles with BT Business SRE
  • Understanding of CI/CD pipelines
  • Experience of identifying and automating manual processes using technologies like Ansible
  • Experience of using Application Performance Monitoring tools such as Dynatrace
  • You are organised and like to get things done. The ability to adapt, take risks and embrace change will be a necessity
  • Empathetic and good with people
  • you like working with people and finding solutions together
  • Have an understanding of agile methodologies/frameworks
  • Good communication skills, comfortable with presenting to team members and other wider teams
Job Responsibility
Job Responsibility
  • Work with colleagues across the various Business SRE teams to design and develop SRE software solutions
  • Be part of a team responsible for the implementation of APM and service monitoring and reporting with a desire to auto-remediate problem solutions
  • Be part of a team responsible for the development of SRE Tooling and BT Business infrastructure automation using SRE software approaches
  • Be part of a team responsible for the design, build and deployment of AI/ML solutions
  • Support and contribute to BT Business Service Assurance SRE goals
  • Ensure good software engineering practices
  • Produce clear documentation for the Observability and Tooling solutions we develop
  • Support our agile methods and ambition to grow our SRE culture throughout with enthusiasm
  • Contributing to our culture and team’s wellbeing
What we offer
What we offer
  • Competitive salary
  • 25 days annual leave (plus bank holidays)
  • 10% on target bonus
  • Life Assurance
  • Pension scheme
  • Direct share scheme
  • Option to join the Healthcare Cash Plan or other benefits such as dental insurance, gym memberships etc.
  • 50% off EE mobile pay monthly or SIM only plans
  • Exclusive colleague discounts on our latest and greatest BT broadband packages
  • BT TV with TNT Sports and NOW Entertainment & 50% discount for friends and family on EE SIM Only plans & airtime element off a Flex Pay plan
  • Fulltime
Read More
Arrow Right

Forward Deployed Engineer

As a Forward Deployed Engineer (FDE) at Virtru, you will play a pivotal role in ...
Location
Location
United States , Tampa
Salary
Salary:
190000.00 - 240000.00 USD / Year
virtru.com Logo
Virtru
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Active U.S. Secret Clearance required
  • Minimum of 5+ years of experience as a cloud engineer, demonstrating a strong understanding of SRE principles for highly scalable and reliable systems
  • Bachelor's degree in Computer Science or related field
  • Proficiency in DevSecOps practices, with experience in source code repositories and CI/CD pipeline solutions such as Team Foundation Server/Azure DevOps, Bitbucket, and GitHub
  • Expertise in Infrastructure as Code (IaC) and best practices for managing cloud infrastructure
  • Familiarity with containerization, Kubernetes (k8s) and orchestration tooling such as OpenShift, Rancher, and Helm
  • Ability to excel both independently and as part of a collaborative team
  • Effective communication and collaboration skills with the on-site customer and the support team
  • Willingness to work onsite 5 days a week in Tampa, FL.
Job Responsibility
Job Responsibility
  • Monitor platform and containerized applications to ensure optimal performance and availability
  • Identify and mitigate performance and availability risks and issues in real-time
  • Contribute to the development and optimization of core platform functions to establish a robust infrastructure
  • Collaborate closely with internal teams and government clients on a daily basis.
What we offer
What we offer
  • Flexible PTO policy
  • $1,500 annual Learning & Development Stipend
  • Home-Office Stipend
  • Internal mobility options
  • Frequent company-sponsored Team Celebrations
  • Access to an Employee Assistance Program
  • Access to Headspace, a mental health app
  • A high degree of flexibility
  • Competitive compensation
  • Generous parental, medical, and bereavement policies
  • Fulltime
Read More
Arrow Right

Forward Deployed Engineer

As a Forward Deployed Engineer (FDE) at Virtru, you will play a pivotal role in ...
Location
Location
United States , Columbia
Salary
Salary:
190000.00 - 240000.00 USD / Year
virtru.com Logo
Virtru
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Active U.S. TS/SCI Clearance required
  • Minimum of 5+ years of experience as a cloud engineer, demonstrating a strong understanding of SRE principles for highly scalable and reliable systems
  • Bachelor's degree in Computer Science or related field
  • Proficiency in DevSecOps practices, with experience in source code repositories and CI/CD pipeline solutions such as Team Foundation Server/Azure DevOps, Bitbucket, and GitHub
  • Expertise in Infrastructure as Code (IaC) and best practices for managing cloud infrastructure
  • Familiarity with containerization, Kubernetes (k8s) and orchestration tooling such as OpenShift, Rancher, and Helm
  • Ability to excel both independently and as part of a collaborative team
  • Effective communication and collaboration skills with the on-site customer and the support team
  • Willingness to work onsite 3-5 days per week in Columbia, MD
Job Responsibility
Job Responsibility
  • Monitor platform and containerized applications to ensure optimal performance and availability
  • Identify and mitigate performance and availability risks and issues in real-time
  • Contribute to the development and optimization of core platform functions to establish a robust infrastructure
  • Collaborate closely with internal teams and government clients on a daily basis
What we offer
What we offer
  • A Flexible PTO policy
  • A $1,500 annual Learning & Development Stipend
  • Home-Office Stipend
  • Internal mobility options
  • Frequent company-sponsored Team Celebrations
  • Access to an Employee Assistance Program
  • Access to Headspace, a mental health app
  • A high degree of flexibility
  • Competitive compensation
  • Generous parental, medical, and bereavement policies
  • Fulltime
Read More
Arrow Right

Software Engineer, Site Reliability

As a Site Reliability Engineer (SRE) at Fireworks AI, you will play a critical r...
Location
Location
United States , San Mateo
Salary
Salary:
Not provided
fireworks.ai Logo
Fireworks AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, related technical field, or equivalent practical experience
  • 5+ years of experience in Site Reliability Engineering, DevOps, or a similar role focused on large-scale production systems
  • Deep expertise in SRE principles and practices, including SLOs, SLIs, operational automation, incident management, and post-mortems
  • Extensive hands-on experience with public cloud platforms (AWS, GCP, Azure), including compute, networking, storage, and database services
  • Strong experience with containerization technologies (Docker) and orchestration platforms (Kubernetes)
  • Proficiency in designing and implementing robust monitoring, logging, and alerting systems using tools like Prometheus, Grafana, ELK stack, and distributed tracing
  • Solid programming/scripting skills in at least one language (e.g., Python, Go) for automation and tool development
  • In-depth knowledge of Linux operating systems, networking fundamentals, and system debugging
  • Proven ability to troubleshoot complex issues across the entire stack
  • Excellent communication, collaboration, and problem-solving skills
Job Responsibility
Job Responsibility
  • Ensuring System Reliability: Ensure systems are designed and implemented with high availability, scalability, and performance. Focus on fault tolerance, disaster recovery, identifying and removing scaling bottlenecks, and performance optimization across our multi-cloud infrastructure
  • Incident Management & Response: Lead efforts in incident detection, response, and resolution for critical production issues. Drive post-mortems to identify root causes and implement preventative measures to improve system reliability
  • Observability & Monitoring: Develop, implement, and maintain comprehensive monitoring, alerting, logging, and tracing solutions to provide deep insights into system health and performance
  • Automation & Toil Reduction: Identify and automate repetitive operational tasks to reduce toil and improve operational efficiency. Develop tools and scripts to streamline deployments, scaling, and system management
  • Capacity Planning & Performance Tuning: Work proactively on capacity planning to ensure our infrastructure can gracefully handle growth and peak loads. Optimize system performance and resource utilization
  • Reliability Best Practices: Collaborate with software engineers to embed reliability principles (e.g., SLOs, SLIs, error budgets) into the development lifecycle, promoting a culture of operational excellence
  • On-call Rotation: Participate in a periodic on-call rotation to support our production environment and respond to critical alerts
  • Fulltime
Read More
Arrow Right

Platform Engineer DevOps

We are looking for an experienced Platform Engineer DevOps to ensure that the fo...
Location
Location
France , Paris
Salary
Salary:
Not provided
cozycozy.com Logo
cozycozy
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of hands-on experience in Platform Engineering, Infrastructure or DevOps
  • Expertise in operating and scaling Kubernetes and Docker in production environments
  • Proven experience managing hybrid cloud / on-premises infrastructure for high-traffic applications
  • A strong background in designing and implementing robust CI/CD pipelines (GitLab CI, Jenkins, etc.)
  • Experience with Infrastructure as Code (Terraform, Ansible, etc.)
  • Experience with monitoring, alerting, and reliability practices (SRE principles)
  • The mindset to mentor and guide other engineers, fostering a culture of automation and operational excellence
  • Excellent communication skills in English
  • The demonstrated ability to drive complex projects
Job Responsibility
Job Responsibility
  • Implement, maintain and secure infrastructure (cloud, bare-metal, Kubernetes clusters)
  • Automate environment configuration using Infrastructure as Code (e.g.,Terraform, Ansible) and adhere to GitOps principles
  • Implement full-stack observability (metrics, logs, traces), sophisticated alerting, and participate in the incident management lifecycle
  • Ensure compliance with Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for all managed services
  • Implement and manage secrets management systems
  • Contribute to the design and evolution of hybrid infrastructure
  • Define, lead, and maintain engineering standards for security, reliability, and technology selection across the organization, supporting the Head of Engineering in defining the platform roadmap
  • Drive continuous improvement initiatives for cloud cost optimization, scalability, performance, and platform security posture
  • Maintain comprehensive, up-to-date documentation and best practices to foster self-service and cross-team enablement
  • Design, implement, and maintain CI/CD pipelines (using GitLab CI, Github, and/or Jenkins) tailored for microservice architectures built with Node.js
What we offer
What we offer
  • Competitive salary
  • stock options
  • Alan health insurance
  • Swile card
  • unlimited coffee, tea, snacks, and drinks in the office
Read More
Arrow Right