This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
Join Barclays as a Site Reliability Engineer and play a key role in building a new, high-impact SRE capability within Markets Post-Trade. As part of a cross-cutting team, you will expand application stability and reliability measurement by automating reliability tooling, closing telemetry gaps, and addressing reliability findings across multiple mission-critical systems. You will help extend and scale an SRE solution across Markets Post-Trade, driving full-stack observability for cash settlements, securities settlement, and liquidity management flows. Through centralised dashboards and end-to-end transaction tracing, you will deliver greater transparency, faster issue resolution, and enable the adoption of AI-driven observability, anomaly detection, and advanced analytics. This role focuses on pre-emptive monitoring, optimisation, and non-functional architecture design to ensure resilient, high-performing systems in a fast-paced environment.
Job Responsibility:
Availability, performance, and scalability of systems and services through proactive monitoring, maintenance, and capacity planning
Resolution, analysis and response to system outages and disruptions, and implement measures to prevent similar incidents from recurring
Development of tools and scripts to automate operational processes, reducing manual workload, increasing efficiency, and improving system resilience
Monitoring and optimisation of system performance and resource usage, identify and address bottlenecks, and implement best practices for performance tuning
Collaboration with development teams to integrate best practices for reliability, scalability, and performance into the software development lifecycle, and work closely with other teams to ensure smooth and efficient operations
Stay informed of industry technology trends and innovations, and actively contribute to the organization's technology communities to foster a culture of technical excellence and growth
Requirements:
Experience with observability and APM tools such as OpenTelemetry, Elastic, AppDynamics, or Prometheus
Experience designing and implementing resilience patterns, including Retry, Timeout, Circuit Breaker, Bulkhead, Throttling, and Saga
Proficiency with load-testing tools such as HP Performance Center, LoadRunner, k6, or JMeter
Solid knowledge of networking and security fundamentals, including VPC design, IAM, encryption, and secrets management
Operational experience with scripting and/or programming languages such as Java, Python, Ruby, or Bash
Nice to have:
Experience in the financial services industry
Experience with infrastructure-as-code tools such as Chef and Ansible
Working knowledge of CI/CD tools including GitLab, Jenkins, Nolio, and TeamCity
Experience operating in Red Hat, Windows, and Kubernetes environments
Familiarity with alerting and monitoring tools such as Geneos ITRS