CrawlJobs Logo

SRE Production Support

selectmindsllc.com Logo

Select Minds

Location Icon

Location:
United States, Livonia

Category Icon
Category:
IT - Software Development

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

We’re passionate about building software that solves problems. We count on our site reliability engineers (SREs) to empower users with a rich feature set, high availability, and stellar performance level to pursue their missions. As we expand customer deployments, we’re seeking an experienced SRE to deliver insights from massive-scale data in real time. Specifically, we’re searching for someone who has fresh ideas and a unique viewpoint, and who enjoys collaborating with a cross-functional team to develop real-world solutions and positive user experiences for every interaction.

Job Responsibility:

  • Monitoring and reporting on application behavior analytics, conducts smart triage by identifying, diagnosing, and coordinating resolution of performance problems before they impact end users, and participates in rapid root cause diagnosis of problems occurring within the application and infrastructure
  • Identifying the functional domain in which problems reside (Server Utilization, network Saturation, Application Tuning)
  • Participating in all Major Incident Management and Root Cause Analysis calls and provides expert troubleshooting support as needed
  • Understanding of troubleshooting, incidents and problems, work to resolve issues timely and determine fault or underlying issue. Work with both customer and vendor personnel
  • Monitoring high value Business-centric transactions and manages response actions
  • Maintaining accurate documentation for assigned workspace and procedures, updating procedures including, but not limited to software, hardware layers
  • Understand and utilize de-escalation techniques when working with difficult customers
  • Monitoring Application infrastructure and network through monitoring tools like Splunk, AppDynamics, Dynatrace
  • Proactively detects, reports, logs, and responds to all network performance and availability problems in each part of the Application
  • Follows incident, problem and change management processes related to technology infrastructure being supported. Reviews system requirements and application dependencies to determine monitoring configuration
  • Involving in creating documentation

Requirements:

  • Master’s degree in Computer Science or related discipline
  • Ability to program (structured and OOP) using one or more high-level languages, such as Python, Java, C/C++, Ruby, and JavaScript
  • 5 to 6 years of experience in Production Support
  • Minimum 6+ years of professional experience in SRE Production Support
  • Experience with NFS, HDFS, Ceph, and Amazon S3, as well as dynamic resource management frameworks (Apache Mesos, Kubernetes, Yarn)
  • Must Provide 24×7 support on the production servers on a rotation basis

Additional Information:

Job Posted:
December 11, 2025

Employment Type:
Fulltime
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for SRE Production Support

Head of Support

Coralogix is a modern, full-stack observability platform transforming how busine...
Location
Location
Israel , Ramat Gan
Salary
Salary:
Not provided
coralogix.com Logo
Coralogix
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience in technical support, DevOps, SRE, or similar roles
  • Strong knowledge of AWS/Azure/GCP and Kubernetes ecosystems
  • Familiarity with observability tools (Kibana, Grafana, Prometheus, Datadog, Splunk, ELK)
  • Hands-on experience with Kubernetes, Docker, and distributed systems
  • Proficiency with ELK concepts, RegEx, Lucene, and PromQL
  • Proven leadership of global/multi-regional support teams (35+ people)
  • Strong incident management and escalation-handling skills
  • Ability to optimize support operations, workflows, and tooling
  • Strong analytical and data-driven decision-making abilities
  • Excellent communicator with technical and non-technical audiences
Job Responsibility
Job Responsibility
  • Lead and coach global Technical Support Engineering teams
  • Ensure high-quality support with improvements in CSAT, response/resolution times, backlog, and KPIs
  • Maintain clear global processes and standards
  • Align with regional leads for coverage across time zones
  • Act as the senior escalation point for complex issues
  • Guide engineers in root cause analysis, distributed systems, and observability
  • Oversee incident management with strong communication and collaboration
  • Maintain hands-on knowledge of Coralogix architecture and tooling
  • Drive continuous improvement to streamline workflows and reduce escalations
  • Enhance productivity through better tools, processes, and automation
  • Fulltime
Read More
Arrow Right

Site Reliability Engineering Support Lead

Site Reliability Engineering Support Lead role focused on application support, d...
Location
Location
Ireland , Dublin
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Solid SRE process experience
  • 5+ years of Leading high-performance, 24x7, DevOps or SysOps team
  • Proficiency in Windows administration, Office 365, Exchange, SharePoint, Active Directory, Backup, Networking and Infrastructure
  • Experience with Microsoft OS Windows & Server
  • Experience in ticket tracking and resolving on time
  • Hands-on experience on ticketing tools (ServiceNow)
  • Excellent verbal, written, presentation and interpersonal communication skills
  • Ability to make complex technical matters easy-to-comprehend for non-technical persons.
Job Responsibility
Job Responsibility
  • Taking end-to-end Ownership of Application Support for Production Systems Issues resolution
  • Implementing, monitoring, and maintaining CI/CD frameworks
  • Developing new capabilities, coordinating implementation across a large number of teams including infrastructure, developer tools and information security
  • Influencing a culture of Site Reliability Engineering. Engaging in training and mentoring to help develop other engineers with SRE mind set
  • Providing the first line of after-deployment technical support at L1 and L2 level for applications and and/or associated production systems diagnostics, and network health monitoring
  • Coordination and/or for deploying hands-on fixes, patches and software updates at the application level, and as appropriate at the network level
  • Managing a team of technical support engineers who provide technical support to users
  • Escalating complex problems to the L3 level of expertise within organization, along with observations from investigative and diagnostic assessments
  • Co-ordinating in the investigation of repeated technical issues affecting user system and seeing through to resolution
  • Escalating, resolving, guiding team, and tracking production incidents to closure
What we offer
What we offer
  • Competitive base salary (which is annually reviewed)
  • Hybrid working model (up to 2 days working at home per week)
  • Additional benefits to support you and your family to be well, live well and save well.
  • Fulltime
Read More
Arrow Right

FX Applications Support Senior Analyst

As an OpsTech Application Support Analyst, the candidate will play a pivotal rol...
Location
Location
Australia , Sydney
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5-8 years experience in an Application Support role
  • experience installing, configuring or supporting business applications
  • experience with some programming languages and willingness/ability to learn
  • advanced execution capabilities and ability to adjust quickly to changes and re-prioritization
  • effective written and verbal communications including ability to explain technical issues in simple terms that non-IT staff can understand
  • demonstrated analytical skills
  • issue tracking and reporting using tools
  • knowledge/experience of problem Management Tools
  • good all-round technical skills
  • effectively share information with other support team members and with other technology teams
Job Responsibility
Job Responsibility
  • Provide technical and business support for users of Citi Applications
  • maintain application systems
  • manage, maintain and support applications
  • perform start of day checks, continuous monitoring, and regional handover
  • develop and maintain technical support documentation
  • maximize the potential of applications
  • assess risk and impact of production issues and escalate
  • ensure storage and archiving procedures are functioning correctly
  • formulate and define scope and objectives for complex application enhancements
  • prioritize bug fixes and support tooling requirements
What we offer
What we offer
  • Rewarding work in a supportive environment
  • clear opportunities for progression
  • exciting company benefits
  • Fulltime
Read More
Arrow Right

Lead SRE

We are looking for a Lead SRE to join our Inetum Team and be part of a work cult...
Location
Location
Portugal , Lisbon
Salary
Salary:
Not provided
https://www.inetum.com Logo
Inetum
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • SRE IT production processes
  • Agile / DevOps Mindset Problem Solving
  • Scripting: Python, YML, Shell
  • Monitoring: Dynatrace, Nagios
  • Linux
  • Admin Network (DNS, Firewall, Switch)
  • DevOps stack: Git & Git Flow, Artifactory, Jenkins or Gitlab CI, Ansible Tower, Digital ai Release
  • Cloud: Kubernetes, Docker, Argo CD, ArgoCD, Vault, Helm
  • End-to-end IT organization and processes (from development to run / operate)
  • Technical Architecture
Job Responsibility
Job Responsibility
  • Train SREs and their managers on SRE practices
  • Co-construct the transformation strategy and the support plan by participating in workshops, brainstorming with the transformation team and producing training content
  • Coach and support
  • Fulltime
Read More
Arrow Right
New

Expert Site Reliability Engineer

Expert Site Reliability Engineer provides technical expertise and strategic guid...
Location
Location
India
Salary
Salary:
Not provided
uk.alterahealth.com Logo
Altera Digital Health Inc. UK
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree (Preferred)
  • 8+ years relevant work experience
  • 5–7 years Expert level experience providing systems engineering in assigned product
  • 8+ years experience with healthcare products in a support, development or consultancy environment
  • Experience with Windows Server and IIS
  • Experience with SQL
  • Experience in Application support
Job Responsibility
Job Responsibility
  • Provide continual technical guidance and support to the client on an ongoing basis
  • Collaborate with the internal technical teams to ensure successful implementation and integration of the proposed solutions
  • Collaborate with business stakeholders and TAM to understand business requirements and objectives
  • Design solutions that align with Hosting best practices, industry standards, and organizational business priorities
  • Develop and document overall technical architecture for the client
  • Design and document integration of various systems, components, and third-party services
  • Create architectural diagrams and documentation
  • Identify potential technical risks and provide mitigation strategies
  • Proactively address the challenges related to project deliverables and client environments
  • Review Control systems for your assigned client on a weekly basis and take appropriate actions to mitigate issues
  • Fulltime
Read More
Arrow Right

Service Management Specialist

We are currently seeking an experienced professional to join our team in the rol...
Location
Location
China , Guangzhou
Salary
Salary:
Not provided
https://www.hsbc.com Logo
HSBC
Expiration Date
December 31, 2025
Flip Icon
Requirements
Requirements
  • Minimum 10 years of experience in production support, SRE, or DevOps roles, with a proven track record of managing and improving large-scale, mission-critical systems
  • Advanced programming and scripting skills (e.g., Java, Python, Go, SQL, API development, backend systems)
  • Extensive experience with containerization (Docker) and orchestration platforms (Kubernetes), including designing and managing large-scale deployments
  • Proficiency in monitoring and observability tools such as Splunk, CloudWatch, AppDynamics, Prometheus, or Grafana
  • Strong expertise in Infrastructure as Code (IaC) tools like Terraform, CloudFormation, or Ansible, with experience in managing cloud-based infrastructure (AWS, Azure, or GCP)
  • Demonstrable experience in designing and implementing automation pipelines for CI/CD and operational tasks
  • Proven ability to lead cross-functional teams to resolve complex technical issues and drive system improvements
  • Strong understanding of security best practices, including vulnerability management and secure system design
  • Excellent written and verbal communication skills in both Mandarin and English, with the ability to communicate complex technical concepts to diverse audiences
  • Experience in mentoring and leading junior engineers, fostering a collaborative and high-performing team environment
Job Responsibility
Job Responsibility
  • Lead the design, implementation, and enhancement of service monitoring systems to ensure services operate within agreed Service Level Objectives (SLOs) and enable rapid response to performance indicator breaches
  • Drive automation initiatives by identifying opportunities to replace manual tasks with software solutions, improving efficiency and reliability across systems
  • Perform in-depth system analysis, configuration management, and implement improvements to enhance system software performance, availability, scalability, and reliability
  • Oversee and approve deployment changes, ensuring adherence to best practices and minimizing change-related incidents that could impact the error budget
  • Collaborate with cross-functional teams, including software engineers, testers, and product managers, to ensure systems meet non-functional requirements such as performance, security, and availability
  • Develop and enforce best practices for incident management, root cause analysis, and post-mortem processes to improve system resilience
  • Mentor and guide junior SREs, fostering a culture of continuous learning and operational excellence
  • Maintain and expand system documentation, including runbooks, architecture diagrams, and operational procedures, ensuring critical knowledge is accessible to the team
  • Lead capacity planning and disaster recovery strategies to ensure system readiness for growth and unexpected events
  • Stay updated on industry trends and emerging technologies, driving innovation and improvements in reliability engineering practices
What we offer
What we offer
  • Continuous professional development
  • Flexible working
  • Opportunities to grow within an inclusive and diverse environment
  • Fulltime
Read More
Arrow Right

Senior DevOps Engineer

Join HSBC as a Senior DevOps Engineer in Shanghai or Xi'an, focusing on driving ...
Location
Location
China , Shanghai; Xi'an
Salary
Salary:
Not provided
https://www.hsbc.com Logo
HSBC
Expiration Date
December 31, 2025
Flip Icon
Requirements
Requirements
  • Bachelor's degree in computer science, Information Technology, Cybersecurity, or a related field
  • Advanced degrees or certifications (e.g., AWS Certified DevOps Engineer, CISSP, CISM) are a plus
  • Minimum of 7 years of experience in DevOps, DevSecOps, or related roles, with at least 3 years in a leadership or senior engineering role
  • Proven experience in automating CI/CD pipelines and implementing security practices in a financial services or banking environment
  • Experience supporting production support teams during incidents, with a focus on rapid resolution and root cause analysis
  • Familiarity with coordinating with global/regional SRE and DevOps teams in a distributed environment
  • Expertise in Jenkins, GitLab CI, GitHub Actions, or CircleCI for building secure, automated pipelines
  • Proficiency in Terraform, CloudFormation, or Ansible for automated infrastructure provisioning
  • Deep knowledge of AWS, Azure, or GCP for managing secure, scalable infrastructure
  • Knowledge of Ali Cloud would be an advantage
Job Responsibility
Job Responsibility
  • Lead the design and implementation of secure, automated CI/CD pipelines
  • Implement Infrastructure as Code using tools like Terraform or Ansible
  • Automate security scanning, compliance checks, and vulnerability management within development workflows
  • Drive adoption of DevSecOps best practices
  • Collaborate with production support team to resolve production incidents
  • Provide technical expertise during incident response
  • Work closely with production support and application teams
  • Partner with the operation resilience project team
  • Coordinate with global and regional SRE and DevOps teams
  • Ensure compliance with China's regulatory requirements
What we offer
What we offer
  • Flexible working
  • Continuous professional development
  • Opportunities to grow within an inclusive and diverse environment.
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

Digital Business Services (DBS) Our GCIO organisation plays a critical role for ...
Location
Location
China , Shanghai
Salary
Salary:
Not provided
https://www.hsbc.com Logo
HSBC
Expiration Date
December 31, 2025
Flip Icon
Requirements
Requirements
  • Bachelor's degree in computer science, Information Technology, or a related field. Advanced degrees or certifications (e.g., ITIL, AWS Certified Solutions Architect, Google SRE) are a plus
  • Minimum of 5 years of experience in site reliability engineering, software development, or systems engineering, preferably in a financial services environment
  • Proven experience in automating operational processes and managing high-availability systems
  • Experience collaborating with production support, application development, and global teams in a distributed environment
  • Programming: Proficiency in Python, Go, Java, or Ruby for automation and tool development
  • Systems: Deep knowledge of Linux/Unix systems for administration, performance tuning, and debugging
  • Cloud and Infrastructure: Expertise in AWS, Azure, or GCP, and Infrastructure as Code (IaC) tools like Terraform or Ansible
  • Containerization: Experience with Docker and Kubernetes for managing containerized banking applications
  • Monitoring: Proficiency in Prometheus, Grafana, Splunk, or Datadog for observability and performance monitoring
  • CI/CD: Familiarity with Jenkins, GitLab CI, or GitHub Actions for integrating reliability into deployment pipelines
Job Responsibility
Job Responsibility
  • Design, develop, and implement automation tools and scripts to reduce manual operational tasks ("toil") and enhance system resilience
  • Ensure high availability (e.g., 99.99% uptime) of critical banking applications, including core banking, payment systems, and global platforms/local system
  • Conduct capacity planning and chaos engineering to test and improve system resilience under failure conditions
  • Participate in on-call rotations to respond to production incidents, troubleshoot issues, and conduct post-mortems to prevent recurrence
  • Collaborate with production support teams for rapid incident resolution and escalate complex issues to application teams or vendors as needed
  • Work closely with production support teams to streamline incident handling and integrate automated solutions into support processes
  • Partner with application development teams to embed reliability practices into the software development lifecycle (SDLC)
  • Engage with the bank's operation resilience project team to align on initiatives for regulatory compliance, disaster recovery, and system robustness
  • Coordinate with global and regional SRE and DevOps teams to ensure consistency in tools, processes, and standards across distributed banking systems
  • Implement and maintain monitoring solutions to track service-level indicators (SLIs) and ensure service-level objectives (SLOs) are met
  • Fulltime
Read More
Arrow Right
Welcome to CrawlJobs.com
Your Global Job Discovery Platform
At CrawlJobs.com, we simplify finding your next career opportunity by bringing job listings directly to you from all corners of the web. Using cutting-edge AI and web-crawling technologies, we gather and curate job offers from various sources across the globe, ensuring you have access to the most up-to-date job listings in one place.