CrawlJobs Logo

Site Reliability Engineer 2

https://www.microsoft.com/ Logo

Microsoft Corporation

Location Icon

Location:
United States , Redmond

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

100600.00 - 199000.00 USD / Year

Job Description:

The M365 Copilot App Platform team is the team that provides the platform APIs, infrastructure and backend web server for the Microsoft 365 Copilot app. All partner teams have built their AI-enabled experiences on our platform and depend on us for their success. We own everything from the application code itself to the platform APIs to the deployment pipelines and infrastructure including the backend web server and middle-tier service that supports the application on the web, Windows, and Mac. This role is central to enabling the M365 Copilot app—one of Microsoft’s key strategic products in the competitive AI landscape.

Job Responsibility:

  • Leverage expertise in distributed systems, cloud technology layers, platform APIs, and infrastructure components to improve availability, reliability, performance, observability, and security of the middle-tier services
  • Identify opportunities to enhance service quality by analyzing production telemetry and applying insights to propose and implement engineering changes
  • Participate in on‑call rotations and incident responses, engaging with product engineering teams throughout the product lifecycle
  • Independently create, test, and deploy changes through safe deployment processes (SDP) to improve operability and code quality
  • Collaborate with engineers and architects to diagnose and resolve production issues and prevent recurrence
  • Develop and maintain the middle-tier service, platform APIs, deployment pipelines, and infrastructure supporting the M365 Copilot app
  • Work closely with partner teams to enable new capabilities and ensure the platform meets reliability and performance requirements
  • Contribute to the continuous evolution of infrastructure and tooling to support services at scale
  • Collaborate with cross-functional teams to enable the M365 Copilot app and drive innovation
  • Work closely with partner teams to build new additional capabilities into our application

Requirements:

  • Master's Degree in Computer Science, Information Technology, or related field AND 1+ year(s) technical experience in software engineering, network engineering, or systems administration OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, or systems administration OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter

Nice to have:

  • Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience or experience as Site Reliability Engineer in building and shipping production software or services with code in languages including, but not limited to, C#, JavaScript or Typescript OR equivalent experience
  • Experience in distributed systems and/or cloud platforms (Azure, Kubernetes, Docker, containers ecosystem)
  • Proven ability to modify componentized, well-architected infrastructure software and collaborate across teams
  • 1+ years experience with incident management and reliability engineering in cloud or AI environments
  • Proficient in scripting (PowerShell, Shell script, etc.) and expertise in Linux
  • Technical experience working with large-scale cloud or distributed systems
  • Experience running highly-available, mission-critical large-scale distributed systems, including domain expertise in areas such as scalable & fault tolerant system design, observability & monitoring, safe change management, automation, reliability & security risk reduction
  • Motivated and self-driven
  • Strong cross-team communication and partnership skills
  • Creativity, insightfulness, and sensitivity for a dynamic approach to problem solving

Additional Information:

Job Posted:
February 16, 2026

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Site Reliability Engineer 2

Site Reliability Engineer 2

Join us. At PagerDuty, you'll tackle complex problems, collaborate with kind and...
Location
Location
Portugal , Lisbon
Salary
Salary:
Not provided
https://www.pagerduty.com Logo
PagerDuty
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 3+ years of experience in Site Reliability Engineering, DevOps, or Platform Engineering roles
  • Experience with Kubernetes and container orchestration
  • Experience working on cloud-native infrastructure (e.g. AWS, GCP, Azure)
  • Proficiency in at least one programming language (e.g. Python, Ruby, Go, etc.)
  • Experience with Infrastructure as Code, (e.g. Terraform, Cloudformation)
Job Responsibility
Job Responsibility
  • Deploy, configure, monitor and optimize highly available Kubernetes clusters on AWS/EKS
  • Help maintain the overall health of the platform, including triaging and troubleshooting production issues, monitoring system capacity, and working with other technical teams to ensure adherence to compliance and security best practices
  • Continuously strive to improve the internal developer experience and the software development lifecycle
  • Stay current on technical trends to suggest innovative tools and approaches to interesting problems
  • Participate in a 24/7 on-call rotation
What we offer
What we offer
  • Competitive salary
  • Comprehensive benefits package from day one
  • Flexible work arrangements
  • Company equity
  • ESPP (Employee Stock Purchase Program)
  • Retirement or pension plan
  • Generous paid vacation time
  • Paid holidays and sick leave
  • Dutonian Wellness Days & HibernationDuty - companywide paid days off in addition to PTO
  • Paid parental leave: 22 weeks for pregnant parent, 12 weeks for non-pregnant parent
  • Fulltime
Read More
Arrow Right

Site Reliability Engineering Manager

Hewlett Packard Enterprise (HPE) is looking for a Site Reliability Engineering M...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7–10 years of experience in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles
  • Minimum 2 years of experience managing or leading cloud operations teams
  • Deep understanding of cloud platforms (AWS, GCP, or Azure) and cloud-native architectures
  • Hands-on experience with Kubernetes, containers, infrastructure as code (e.g., Terraform), and configuration management tools
  • Strong foundation in observability (monitoring, logging, tracing), automation using Python, and incident response
  • Familiarity with modern CI/CD automation and tools
  • Excellent communication, stakeholder management, and team-building skills
  • Experience scaling SRE practices in high-growth or large-scale environments
  • Ability to balance long-term reliability initiatives with short-term delivery needs.
Job Responsibility
Job Responsibility
  • Lead and mentor a team of Site Reliability Engineers, supporting their growth, performance, and well-being
  • Own the reliability strategy for SASE cloud infrastructure systems, including incident management, SLIs/SLOs, and capacity planning
  • Partner with Engineering, Product, and Security teams to design and deliver highly available, scalable, and resilient cloud-native services
  • Guide the team in building automation, improving observability, and improve operational efficiency of our cloud infrastructure
  • Drive adoption of best practices in monitoring, alerting, on-call operations, and runbook development
  • Build and maintain a strong engineering culture based on ownership, collaboration, and continuous learning
  • Define and track key reliability metrics, and report on team performance and system health to leadership
  • Contribute to hiring, onboarding, and career development for SREs.
What we offer
What we offer
  • Health & Wellbeing benefits for physical, financial, and emotional wellbeing
  • Personal & Professional Development programs
  • Unconditional inclusion in the workplace.
  • Fulltime
Read More
Arrow Right

Principal Site Reliability Engineer

We are looking for a reliability expert who is passionate about scaling Cloud se...
Location
Location
Salary
Salary:
Not provided
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Expert-level proficiency with 10+ years experience in one or more prominent languages such as Java, Go or Python
  • Expert-level proficiency with 7+ years experience in public cloud offerings (with at least 2+ years specifically on GCP)
  • Expert-level proficiency with 7+ years experience in operating high-availability, fault-tolerant, scalable, distributed software in production: building monitoring into your code, tweaking dashboards, defining alerts, writing runbooks, etc.
  • Excellent communication skills in written and verbal forms, and an ability to communicate complex technical issues to a range of technical and non-technical audiences (management, peers, clients)
  • An ability and desire to mentor and coach engineers
Job Responsibility
Job Responsibility
  • Analyse and help improve our services and processes to get us to an even higher level of reliability, performance, scalability, and cost efficiency
  • Cross team and functional boundaries to advocate for reliability methodologies
  • Work with a variety of platform, product and SRE teams to both build reliability into our platform and drive adoption of those practices into our products
  • Be the driving force for change
Read More
Arrow Right

Site Reliability Engineer

Corporate Tools is looking for a Site Reliability Engineer. You will be a tradit...
Location
Location
United States
Salary
Salary:
175000.00 USD / Year
corporatetools.com Logo
Corporate Tools
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Software Engineering, or equivalent practical experience
  • 5+ years of experience in software engineering
  • 2+ years of experience in site reliability engineering, DevOps, or infrastructure engineering roles
  • Deep experience with cloud platforms (AWS, Azure, or GCP) and infrastructure as code tools such as Terraform, CloudFormation, or Pulumi
  • Strong proficiency with Kubernetes, Docker, and container orchestration in production environments
  • Hands-on experience with observability and monitoring tools like Prometheus, Grafana, OpenTelemetry, Sentry, or New Relic
  • Proven ability to design and implement highly available, fault-tolerant systems and lead proactive incident response efforts
  • Experience with performance tuning, database optimization, and caching strategies (e.g., PostgreSQL, Redis, Memcached)
  • Demonstrated ability to drive reliability improvements, reduce operational toil, and foster a culture of resilience and continuous improvement
  • Experience leading reliability-focused initiatives such as post-incident reviews, capacity planning, and root cause analysis
Job Responsibility
Job Responsibility
  • Stop problems before they start
  • Fix issues quickly and learn from them
  • Help keep systems steady, secure, and running
  • Work closely with DevOps engineers to build out tools and automation
  • Take ownership
What we offer
What we offer
  • 100% employer-paid medical, dental and vision for employees
  • Annual review with raise option
  • 22 days Paid Time Off accrued annually, and 4 holidays
  • After 3 years, PTO increases to 29 days
  • Employees transition to flexible time off after 5 years with the company—not accrued, not capped, take time off when you want
  • Paid Parental Leave
  • Up to 6% company matching 401(k) with no vesting period
  • Quarterly allowance
  • Open concept office with friendly coworkers
  • Creative environment where you can make a difference
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer

We are recruiting a Junior SRE for a company that provides an advanced data, ope...
Location
Location
Portugal , Lisboa
Salary
Salary:
Not provided
https://www.precisers.pt Logo
Precise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Up to 2-3 years of experience in a Site Reliability Engineering SRE, DevOps, or Production Engineering role, with a deep understanding of SRE principles and best practices
  • Incident management expertise, including triaging, escalation, and resolution of high-severity outages
  • Proficiency in at least one coding language Python or Java) for automation and debugging
  • Hands-on experience in Kubernetes K8s for managing and orchestrating containerized applications
  • Cloud experience AWS preferred) with exposure to key services like EC2, S3, Lambda, and CloudWatch
  • Excellent communication skills to articulate technical challenges and solutions effectively
  • Strong troubleshooting and problem-solving skills, with experience diagnosing complex production issues
  • Ability to stay calm under pressure, multitask, and prioritize effectively in fast-moving environments
  • Fluency in English (spoken and written) is required
  • Must have the legal right to work in the country
  • Fulltime
Read More
Arrow Right

Cloud Security Site Reliability Engineer

This role sits within the Cloud Security team which is responsible for Private a...
Location
Location
Singapore , Singapore
Salary
Salary:
Not provided
https://www.citi.com/ Logo
Citi
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree or equivalent work experience
  • 3+ years of relevant work experience
  • Highly motivated self-starter with good interpersonal and communication skills
  • Certification or formal training in site reliability engineering concepts and practices would be beneficial
  • Prior experience working towards SLIs, SLOs and observability capabilities
  • 2+ years experience in Python alongside Linux based scripting languages
  • Experience working on observability, logging and metrics toolsets
  • Experience of k8s and container technologies such as Docker, Openshift and EKS
  • Experience with Secrets products such as HashiCorp Vault or CyberArk beneficial but not essential
  • Experience with CICD tools such as terraform, Jenkins, Ansible.
Job Responsibility
Job Responsibility
  • Working across Container products and Secrets products, across Public and Private Cloud, as well as Cloud native specific products
  • Architecting and building tools and platforms that provide capabilities for SRE
  • Collaboration with multiple stakeholders and partners across Engineering and Operations as well as partner teams within the wider Citi organisation
  • Actively owning production level incidents till resolution.
  • Fulltime
Read More
Arrow Right

Site Reliability Engineer (Developer Experience)

KnowBe4’s Site Reliability Engineers help ensure that our platforms are reliable...
Location
Location
India , Kochi
Salary
Salary:
Not provided
knowbe4.com Logo
KnowBe4
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • BS/MS/Ph.D. or equivalent plus 2 years experience
  • Comfortable maintaining existing scripts in one or more programming languages (e.g. Python, Ruby, Javascript)
  • Experience maintaining infrastructure in AWS
  • Experience maintaining workflows for continuous integration and continuous deployment (CI/CD) - GitLab is preferred
  • Effective communication skills
  • Ability to easily adapt while working on competing projects
  • Demonstrated ability to learn new technologies quickly
Job Responsibility
Job Responsibility
  • Work with other Site Reliability Engineers to build highly scalable and resilient applications and infrastructure in AWS
  • Maintain and improve extensible infrastructure-as-code using Terraform
  • Learn, maintain, and improve our existing deployment strategies
  • Deliver effective observability, monitoring, and alerting patterns for KnowBe4’s applications and infrastructure
  • Assist in identifying and resolving production incidents
  • Correct deficiencies in our current applications and infrastructure
  • Implement solutions to complex technical problems
What we offer
What we offer
  • Company-wide bonuses based on monthly sales targets
  • Employee referral bonuses
  • Adoption assistance
  • Tuition reimbursement
  • Certification reimbursement
  • Certification completion bonuses
  • Fulltime
Read More
Arrow Right

Staff Engineer, Site Reliability

LearnUpon is looking for a Staff Site Reliability Engineer to join our team in I...
Location
Location
Ireland , Dublin
Salary
Salary:
Not provided
learnupon.com Logo
LearnUpon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of experience in a software or Ops role
  • 5+ years of cloud engineering experience, with at least 2 years experience with AWS
  • Experience deploying Microservice environments, using containerisation technologies such as Kubernetes and Docker
  • Experience in designing and implementing Observability tech stacks
  • Have championed the benefits of Observability to Engineering teams
  • Can architect the design of SLO/SLI implementation that balances the needs of different teams
  • Familiar with cost analysis of Observability metrics gathering, Engineering effort, and tooling
  • Experience building and supporting large-scale distributed systems that back a consumer app or website with associated requirements of performance, security and disaster recovery
  • Experience with implementing IaaC (e.g. CloudFormation, Terraform etc.), automation tooling (e.g. Puppet, Ansible etc.), CI/CD (e.g. Jenkins, Travis CI, GitLab etc.)
  • Able to effectively communicate technical ideas to and collaborate with both technical and non-technical peers
Job Responsibility
Job Responsibility
  • Identifying opportunities to improve and scale our infrastructure for performance, observability, maintainability, and cost, by creating innovative solutions
  • Leading our efforts to build an observability function that incorporates application metrics, application transaction tracking, and event log management
  • Driving the processes to maintain resilient, scalable and cost-effective infrastructure
  • Working with other Engineering teams to provide infrastructure solutions that meet their ongoing requirements
  • Building tools focused on measuring, monitoring and alerting, with an eye towards self-service in order to promote Engineers’ ownership of observability
  • Reacting quickly to changing customer and business needs
  • Participate in on-call rota
  • Mentoring junior talent
What we offer
What we offer
  • Work in a fun and supportive environment with regular team events
  • Excellent career progression
  • Structured learning environment
  • Competitive salary and company ESOP
  • Private health insurance
  • 26 days annual leave
  • Fulltime
Read More
Arrow Right