Site Reliability Engineering (SRE) Team Lead Job at OneMain Financial (Irving)

Site Reliability Engineering Support Lead

Site Reliability Engineering Support Lead role focused on application support, d...

Location

Ireland , Dublin

Salary:

Not provided

Citi

Expiration Date

Until further notice

Requirements

Solid SRE process experience
5+ years of Leading high-performance, 24x7, DevOps or SysOps team
Proficiency in Windows administration, Office 365, Exchange, SharePoint, Active Directory, Backup, Networking and Infrastructure
Experience with Microsoft OS Windows & Server
Experience in ticket tracking and resolving on time
Hands-on experience on ticketing tools (ServiceNow)
Excellent verbal, written, presentation and interpersonal communication skills
Ability to make complex technical matters easy-to-comprehend for non-technical persons.

Job Responsibility

Taking end-to-end Ownership of Application Support for Production Systems Issues resolution
Implementing, monitoring, and maintaining CI/CD frameworks
Developing new capabilities, coordinating implementation across a large number of teams including infrastructure, developer tools and information security
Influencing a culture of Site Reliability Engineering. Engaging in training and mentoring to help develop other engineers with SRE mind set
Providing the first line of after-deployment technical support at L1 and L2 level for applications and and/or associated production systems diagnostics, and network health monitoring
Coordination and/or for deploying hands-on fixes, patches and software updates at the application level, and as appropriate at the network level
Managing a team of technical support engineers who provide technical support to users
Escalating complex problems to the L3 level of expertise within organization, along with observations from investigative and diagnostic assessments
Co-ordinating in the investigation of repeated technical issues affecting user system and seeing through to resolution
Escalating, resolving, guiding team, and tracking production incidents to closure

What we offer

Competitive base salary (which is annually reviewed)
Hybrid working model (up to 2 days working at home per week)
Additional benefits to support you and your family to be well, live well and save well.

Fulltime

Site Reliability Engineering Manager

Hewlett Packard Enterprise (HPE) is looking for a Site Reliability Engineering M...

Location

India , Bangalore

Salary:

Not provided

Hewlett Packard Enterprise

Expiration Date

Until further notice

Requirements

7–10 years of experience in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles
Minimum 2 years of experience managing or leading cloud operations teams
Deep understanding of cloud platforms (AWS, GCP, or Azure) and cloud-native architectures
Hands-on experience with Kubernetes, containers, infrastructure as code (e.g., Terraform), and configuration management tools
Strong foundation in observability (monitoring, logging, tracing), automation using Python, and incident response
Familiarity with modern CI/CD automation and tools
Excellent communication, stakeholder management, and team-building skills
Experience scaling SRE practices in high-growth or large-scale environments
Ability to balance long-term reliability initiatives with short-term delivery needs.

Job Responsibility

Lead and mentor a team of Site Reliability Engineers, supporting their growth, performance, and well-being
Own the reliability strategy for SASE cloud infrastructure systems, including incident management, SLIs/SLOs, and capacity planning
Partner with Engineering, Product, and Security teams to design and deliver highly available, scalable, and resilient cloud-native services
Guide the team in building automation, improving observability, and improve operational efficiency of our cloud infrastructure
Drive adoption of best practices in monitoring, alerting, on-call operations, and runbook development
Build and maintain a strong engineering culture based on ownership, collaboration, and continuous learning
Define and track key reliability metrics, and report on team performance and system health to leadership
Contribute to hiring, onboarding, and career development for SREs.

What we offer

Health & Wellbeing benefits for physical, financial, and emotional wellbeing
Personal & Professional Development programs
Unconditional inclusion in the workplace.

Fulltime

Site Reliability Engineer

Corporate Tools is looking for a Site Reliability Engineer. You will be a tradit...

Location

United States

Salary:

175000.00 USD / Year

Corporate Tools

Expiration Date

Until further notice

Requirements

Bachelor's degree in Computer Science, Software Engineering, or equivalent practical experience
5+ years of experience in software engineering
2+ years of experience in site reliability engineering, DevOps, or infrastructure engineering roles
Deep experience with cloud platforms (AWS, Azure, or GCP) and infrastructure as code tools such as Terraform, CloudFormation, or Pulumi
Strong proficiency with Kubernetes, Docker, and container orchestration in production environments
Hands-on experience with observability and monitoring tools like Prometheus, Grafana, OpenTelemetry, Sentry, or New Relic
Proven ability to design and implement highly available, fault-tolerant systems and lead proactive incident response efforts
Experience with performance tuning, database optimization, and caching strategies (e.g., PostgreSQL, Redis, Memcached)
Demonstrated ability to drive reliability improvements, reduce operational toil, and foster a culture of resilience and continuous improvement
Experience leading reliability-focused initiatives such as post-incident reviews, capacity planning, and root cause analysis

Job Responsibility

Stop problems before they start
Fix issues quickly and learn from them
Help keep systems steady, secure, and running
Work closely with DevOps engineers to build out tools and automation
Take ownership

What we offer

100% employer-paid medical, dental and vision for employees
Annual review with raise option
22 days Paid Time Off accrued annually, and 4 holidays
After 3 years, PTO increases to 29 days
Employees transition to flexible time off after 5 years with the company—not accrued, not capped, take time off when you want
Paid Parental Leave
Up to 6% company matching 401(k) with no vesting period
Quarterly allowance
Open concept office with friendly coworkers
Creative environment where you can make a difference

Fulltime

Lead Site Reliability Engineer

Groupon is a marketplace where customers discover new experiences and services e...

Location

India , Bangalore

Salary:

Not provided

Groupon

Expiration Date

Until further notice

Requirements

10+ years in systems engineering
at least 5+ years in SRE or DevOps roles
expertise in cloud platforms (GCP, AWS) and container orchestration (Kubernetes, Docker)
proficiency in programming and scripting languages like Python, Go, and Bash
advanced knowledge of Infrastructure as Code (IaC) tools such as Terraform and Ansible
deep understanding of networking, DNS, load balancing, and security principles
proven track record of managing high-availability systems in demanding environments
exceptional analytical and problem-solving skills

Job Responsibility

Architect and maintain fault-tolerant systems, ensuring uptime SLAs of 99.9% or higher
drive automation in infrastructure management and deployment using Terraform, Ansible, Kubernetes, and similar tools
create and optimize CI/CD pipelines to ensure reliable, secure, and efficient software delivery
build and enhance comprehensive observability solutions, including monitoring, logging, and alerting systems using Prometheus, Grafana, and the ELK stack
collaborate with stakeholders to define and achieve SLIs, SLOs, and error budgets aligned with business needs
lead incident response during on-call rotations, ensuring rapid resolution and root cause analysis for critical issues
design and execute performance testing, capacity planning, and scalability strategies for evolving workloads
proactively identify and resolve bottlenecks, increasing system performance and developer efficiency
mentor junior engineers, fostering a collaborative and growth-oriented team environment
guide architectural decisions that drive innovation and enhance system reliability

What we offer

The opportunity to work with cutting-edge technologies in a transformative environment
a collaborative and innovative work values alignment that values your expertise and contributions
professional growth and leadership development pathways tailored to your aspirations
a chance to leave a lasting impact by shaping the future of reliable and scalable systems

Principal Site Reliability Engineer (AI-first SRE)

Groupon is modernizing its global platform — and reliability is at the center of...

Location

Peru

Salary:

Not provided

Groupon

Expiration Date

Until further notice

Requirements

10+ years in software/systems engineering, including 5+ years in SRE or platform reliability
Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
Proficiency in Python or Go for automation and tooling
Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
Strong communication and influencing skills — data over hierarchy

Job Responsibility

Architect and maintain self-healing systems with 99.9%+ availability targets
Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
Implement adaptive SLIs/SLOs that evolve automatically from real-time data
Build AIOps-based observability and auto-remediation pipelines
Apply predictive modeling to forecast failures before they impact users
Lead chaos, performance, and resilience testing programs
Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
Mentor engineers and drive reliability standards across teams
Partner with platform, data, and product teams to ensure stability aligns with business goals
Support major incident response, incident review, and participate in on-call rotations

What we offer

The opportunity to work with cutting-edge technologies in a transformative environment
Professional growth and leadership development pathways tailored to your aspirations
A chance to leave a lasting impact by shaping the future of reliable and scalable systems

Principal Site Reliability Engineer

Groupon is modernizing its global platform — and reliability is at the center of...

Location

Colombia

Salary:

Not provided

Groupon

Expiration Date

Until further notice

Requirements

10+ years in software/systems engineering
5+ years in SRE or platform reliability
Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
Proficiency in Python or Go for automation and tooling
Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
Strong communication and influencing skills — data over hierarchy

Job Responsibility

Architect and maintain self-healing systems with 99.9%+ availability targets
Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
Implement adaptive SLIs/SLOs that evolve automatically from real-time data
Build AIOps-based observability and auto-remediation pipelines
Apply predictive modeling to forecast failures before they impact users
Lead chaos, performance, and resilience testing programs
Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
Mentor engineers and drive reliability standards across teams
Partner with platform, data, and product teams to ensure stability aligns with business goals
Support major incident response, incident review, and participate in on-call rotations

What we offer

The opportunity to work with cutting-edge technologies in a transformative environment
Professional growth and leadership development pathways tailored to your aspirations
A chance to leave a lasting impact by shaping the future of reliable and scalable systems

Principal Site Reliability Engineer

Groupon is modernizing its global platform — and reliability is at the center of...

Location

Ecuador

Salary:

Not provided

Groupon

Expiration Date

Until further notice

Requirements

10+ years in software/systems engineering, including 5+ years in SRE or platform reliability
Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
Proficiency in Python or Go for automation and tooling
Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
Strong communication and influencing skills — data over hierarchy

Job Responsibility

Architect and maintain self-healing systems with 99.9%+ availability targets
Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
Implement adaptive SLIs/SLOs that evolve automatically from real-time data
Build AIOps-based observability and auto-remediation pipelines
Apply predictive modeling to forecast failures before they impact users
Lead chaos, performance, and resilience testing programs
Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
Mentor engineers and drive reliability standards across teams
Partner with platform, data, and product teams to ensure stability aligns with business goals
Support major incident response, incident review, and participate in on-call rotations

What we offer

The opportunity to work with cutting-edge technologies in a transformative environment
Professional growth and leadership development pathways tailored to your aspirations
A chance to leave a lasting impact by shaping the future of reliable and scalable systems