Senior Site Reliability Engineer Cloud Platform Job at Zilliz

Senior Site Reliability Engineer

Baxter International is seeking a skilled Senior Principal Site Reliability Engi...

Location

United States , Deerfield

Salary:

96000.00 - 132000.00 USD / Year

Baxter

Expiration Date

Until further notice

Requirements

Bachelor's degree in computer science, IT, or related field (or equivalent experience)
Prior experience in Site Reliability Engineering and cloud-based infrastructure management
Experience in enterprise engineering, including 24x7 uptime, regulated environments, and planning/operations
Azure administration and operations experience, with certifications a plus
Knowledge of related technologies, including cloud, encryption, and security protocols
Systems administration experience in Windows and Linux environments
Proven problem-solving skills and experience with scripting and automation tools
Ability to create accurate documentation and reports, with excellent communication skills

Job Responsibility

Drive strategies to ensure 24x7 availability of services and business continuity for customer facing healthcare software applications and platforms hosted on Microsoft Azure cloud
Manage and administer Azure resources, including virtual machines, databases, and networking components
Define and document operating procedures to ensure required security, privacy and other compliance standards are maintained for digital solutions deployed in cloud
Manage process, planning, and execution for Disaster Recovery (DR) and Business Continuity Planning (BCP)
Define and refine Operations SLAs to maintain high level of Customer Satisfaction
Establish non-functional requirements to meet SLAs
Establish infrastructure and application monitoring dashboards and workflow for automatic routing of notifications
Define key performance indicators that can be monitored, measured, and used to derive opportunities
Standardize site metrics for stakeholders, reporting on various KPIs including SLAs, availability, capacity utilization, service metrics and cost utilization
Work closely with DevOps Engineers to automate infrastructure provisioning and deployment processes

What we offer

Healthcare benefits
Employee Stock Purchase Plan (ESPP)
401(k) Retirement Savings Plan
Flexible Spending Accounts
Educational assistance programs
Paid holidays
Paid time off
Paid parental leave
Commuting benefits
Employee Discount Program

Fulltime

Senior Site Reliability Engineer

This is a role at Baxter where your work impacts saving and sustaining lives thr...

Location

United States , Deerfield

Salary:

96000.00 - 132000.00 USD / Year

Baxter

Expiration Date

Until further notice

Requirements

Bachelor's degree in computer science, IT, or related field (or equivalent experience)
Prior experience in Site Reliability Engineering and cloud-based infrastructure management
Experience in enterprise engineering, including 24x7 uptime, regulated environments, and planning/operations
Azure administration and operations experience, with certifications a plus
Knowledge of related technologies, including cloud, encryption, and security protocols
Systems administration experience in Windows and Linux environments
Proven problem-solving skills and experience with scripting and automation tools
Ability to create accurate documentation and reports, with excellent communication skills
Applicants must be authorized to work for any employer in the U.S.
Unable to sponsor or take over sponsorship of an employment visa at this time.

Job Responsibility

Drive strategies to ensure 24x7 availability of services and business continuity for customer-facing healthcare software applications and platforms hosted on Microsoft Azure cloud
Manage and administer Azure resources, including virtual machines, databases, and networking components
Define and document operating procedures to ensure required security, privacy and other compliance standards are maintained for digital solutions deployed in cloud
Manage process, planning, and execution for Disaster Recovery (DR) and Business Continuity Planning (BCP)
Define and refine Operations SLAs to maintain high level of Customer Satisfaction
Establish non-functional requirements to meet SLAs
Establish infrastructure and application monitoring dashboards and workflow for automatic routing of notifications
Define key performance indicators that can be monitored, measured, and used to derive opportunities
Standardize site metrics for stakeholders, reporting on various KPIs including SLAs, availability, capacity utilization, service metrics and cost utilization
Work closely with DevOps Engineers to automate infrastructure provisioning and deployment processes.

What we offer

Support for Parents
Continuing Education/Professional Development
Employee Health & Well-Being Benefits
Paid Time Off
2 Days a Year to Volunteer
Medical and dental coverage starting day one
Insurance coverage for basic life, accident, short-term and long-term disability
Business travel accident insurance
Employee Stock Purchase Plan (ESPP)
401(k) Retirement Savings Plan

Fulltime

Senior Site Reliability Engineer

Architect, develop, and troubleshoot large-scale infrastructure, maintain and im...

Location

United States , San Francisco

Salary:

180960.00 - 230900.00 USD / Year

Atlassian

Expiration Date

Until further notice

Requirements

Bachelor’s degree in Computer Science, Software Engineering, Information Technology or a closely related field
four years of experience as a Site Reliability Engineer architecting, developing, and troubleshooting large scale infrastructure utilizing programming languages such as PowerShell, Python, or Bash
networking technologies such as TCP/IP or security
four years of experience in automation development and infrastructure as code implementation using tools such as Terraform, AWS CloudFormation, Ansible, or Salt
knowledge of Linux and Windows systems
cloud technologies within AWS, GCP, Azure
continuous integration continuous delivery/deployment (CICD) practices and monitoring and observability practices
must pass technical interview

Job Responsibility

Architect, develop, and troubleshoot large scale infrastructure utilizing programming languages such as PowerShell, Python, or Bash and networking technologies such as TCP/IP or security
provide real-time feedback on production systems
work with product family and platform developers to maintain and improve services and performance with a strong customer focus
utilize a variety of data collection, enrichment, analytics, and visualizations to support our complex systems
responsible for automation development and infrastructure-as-code implementation using tools such as Terraform, AWS CloudFormation, Ansible, and/or Salt
build solutions to enhance availability, performance, and stability for hundreds of Atlassian enterprise customers in the cloud as well as automate repetitive work
help secure the cloud architecture with penetration testing, vulnerability resolution, and compliance audit responses
responsible for continuous integration continuous delivery/deployment (CICD) practices and monitoring and observability practices

What we offer

Health and wellbeing resources
paid volunteer days

Fulltime

Senior Vice President, Cloud Security Site Reliability Engineer

This role sits within the Cloud Security team which is responsible for Private a...

Location

Singapore , Singapore

Salary:

Not provided

Citi

Expiration Date

Until further notice

Requirements

Bachelor’s degree or equivalent work experience
8+ years of relevant work experience
Highly motivated self-starter with excellent interpersonal and communication skills. Able to communicate efficiently at multiple levels of seniority
Certification or formal training in site reliability engineering concepts and practices
Prior experience working towards SLIs, SLOs and observability capabilities at a large scale
5+ years experience in Python (preferable) or Java, on large scale systems alongside Linux based scripting languages
Experience working on observability, logging and metrics toolsets
Experience of k8s and container technologies such as Docker, Openshift and EKS.
Experience with public cloud technologies such as AWS, GCP or Azure
Experience with Secrets products such as HashiCorp Vault or CyberArk

Job Responsibility

Working across Container products and Secrets products, across Public and Private Cloud, as well as Cloud native specific products
Architecting and building tools and platforms that provide capabilities for SRE
Collaboration with multiple stakeholders and partners across Engineering and Operations as well as partner teams within the wider Citi organization
Actively owning production level incidents till resolution.

Fulltime

Vice President - Cloud Security Site Reliability Engineer

This role sits within the Cloud Security team which is responsible for Private a...

Location

Singapore , Singapore

Salary:

Not provided

Citi

Expiration Date

Until further notice

Requirements

Bachelor’s degree or equivalent work experience
6+ years of relevant work experience
Highly motivated self-starter with excellent interpersonal and communication skills. Able to communicate efficiently at multiple levels of seniority
Certification or formal training in site reliability engineering concepts and practices
Prior experience working towards SLIs, SLOs and observability capabilities at a large scale
4+ years experience in Python (preferable) or Java, on large scale systems alongside Linux based scripting languages
Experience working on observability, logging and metrics toolsets
Experience of k8s and container technologies such as Docker, Openshift and EKS
Experience with public cloud technologies such as AWS, GCP or Azure
Experience with Secrets products such as HashiCorp Vault or CyberArk

Job Responsibility

Working across Container products and Secrets products, across Public and Private Cloud, as well as Cloud native specific products
Architecting and building tools and platforms that provide capabilities for SRE
Collaboration with multiple stakeholders and partners across Engineering and Operations as well as partner teams within the wider Citi organisation
Actively owning production level incidents till resolution.

Fulltime

Junior Site Reliability Engineer

As a Jr. Site Reliability Engineer, you will 'make things scale' which includes ...

Location

United Kingdom

Salary:

Not provided

accesso

Expiration Date

Until further notice

Requirements

Some practical exposure to cloud platforms (AWS/Azure/GCP)—coursework, internships, or self-led projects
Ability to self-learn with assistance from Senior Engineers
Basic scripting ability using Python or Bash
Familiarity with basic Linux systems and general command–line
Understanding of Git and basic CI/CD concepts
Good written and verbal communication
customer-focused approach
Ability to work with minimal direction
Willingness to learn, take direction and work within a team

Job Responsibility

Assisting with provisioning and deploying accesso Horizon components to customer cloud accounts using Infrastructure as Code (Terraform)
Help maintain CI/CD pipelines (GitHub Actions) for application and infrastructure deployments
Support monitoring, logging and alerting (Prometheus, Grafana & Coralogix) and respond to basic alerts with supervision
Implement and improve basic automation and scripting
Participate in incident triage, root cause investigation and follow-up tasks
Follow security and compliance requirements for customer cloud environments (identity, secrets, network controls)
Produce and maintain operational runbooks, deployment guides and change notes
Participate in on-call rotation as a L1 responder
Normal workday may require time outside the normal working day
Learn and apply accesso Horizon product architecture and configuration

What we offer

Competitive compensation package including an annual bonus opportunity
8-days of paid bank holiday leave and 26-days of paid annual leave (paid leave increases with tenure)
8 hours of paid Volunteer Time Off
Inclusive Family Benefits, including a $7,500 benefit for surrogacy, adoption, and fertility
Robust health insurance scheme with the opportunity to participate in private medical scheme after satisfactory performance
Matching pension scheme (up to 8%)
Unlimited access to Udemy for Business
Flexible work schedule

Fulltime

Senior Site Reliability Engineer

We are seeking an experienced Senior Site Reliability Engineer (L3) to join our ...

Location

India , Chennai

Salary:

Not provided

Arcadia

Expiration Date

Until further notice

Requirements

Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience
8–10+ years of experience in SRE/DevOps/Cloud Engineering, with deep hands-on exposure to AWS and Kubernetes
Strong hands-on experience with: Terraform & Infrastructure as Code
AWS core services (EKS, IAM, RDS, EC2, VPC, CloudWatch, CloudTrail, GuardDuty)
Jenkins + Groovy, GitHub Actions, ArgoCD, FluxCD
Kubernetes troubleshooting and operations
Prometheus/Grafana/Datadog observability stacks
Proven ability to operate in high-scale, high-uptime, multi-environment production systems
Experience building automation via Python/Bash and reducing operational toil
Strong understanding of incident management, root cause analysis, and reliability engineering principles

Job Responsibility

Design, build, and maintain AWS infrastructure (EKS, VPC, RDS, IAM, CloudWatch, CloudTrail, GuardDuty, Load Balancers, S3, CloudFront) using Terraform and CloudFormation
Lead all aspects of Kubernetes operations including cluster upgrades, performance tuning, CNI troubleshooting, workload scaling, Helm chart packaging, and GitOps deployments
Own and evolve our CI/CD ecosystem across Jenkins (Groovy scripting), GitHub Actions, AWS CodePipeline, ArgoCD, and FluxCD
Improve platform reliability by reducing operational toil through automation, scripting (Python/Bash), and proactive system hardening
Implement and enhance observability across Prometheus, Grafana, Loki, Tempo, Datadog, and CloudWatch—ensuring actionable alerting, dashboards, and metrics alignment with SLO/SLIs
Drive FinOps initiatives, identifying cost inefficiencies and working with engineering teams to implement best practices, tagging standards, budgeting, and resource right-sizing
Manage database operations across MySQL and PostgreSQL including backups, performance tuning, replication, and operational runbooks
Maintain and improve secret management using Vault, AWS Secrets Manager, and Parameter Store
Strengthen cloud security posture with IAM least privilege, CSPM reviews, audit readiness, GuardDuty/CloudTrail monitoring, and environment hardening
Troubleshoot complex production issues across networking, Kubernetes, compute, databases, and CI/CD systems

What we offer

Competitive compensation and employee stock options
Hybrid/remote-first working model (India-based role, with global collaboration)
Flexible leave policy
Comprehensive medical insurance (self + family members)
Annual performance cycle + quarterly recognition awards
A supportive, diverse engineering culture grounded in empathy, teamwork, and innovation

Fulltime

Senior Site Reliability Engineer

HiveWatch is seeking a Staff Site Reliability Engineer to join our Platform Team...

Location

United States , El Segundo

Salary:

183000.00 - 235000.00 USD / Year

HiveWatch

Expiration Date

Until further notice

Requirements

7+ years of software engineering experience with strong coding skills in production environments
5+ years of SRE, DevOps, or production operations experience
Expertise with cloud platforms (AWS preferred) and containerized applications (Docker, Kubernetes)
Experience with Infrastructure as Code (Terraform, CloudFormation, or similar)
Proficiency in at least one object oriented programming language in our tech stack (Java, Kotlin, Python)
Hands-on experience with relational databases and SQL performance optimization
Experience with monitoring and observability tools (Prometheus, Grafana, DataDog, or equivalent)
Strong debugging skills across distributed systems and microservices architectures
Bachelor's degree in Computer Science, Engineering, or equivalent practical experience

Job Responsibility

Own the reliability of mission-critical systems including production monitoring, alerting, and capacity planning
Debug and resolve complex production issues across the full stack, from infrastructure to application code
Participate in a regular on-call rotation to provide 24/7 coverage for critical systems
Perform root cause analysis requiring deep code-level investigation and implement preventive measures
Build automation and tooling to reduce operational toil and improve system reliability
Maintain CI/CD pipelines, observability infrastructure, and database performance optimization
Increase the resiliency, scalability, and maintainability of production environments
Establish on-call procedures and disaster recovery processes
Provide technical leadership and mentorship to foster engineering excellence and reliability culture

What we offer

Comprehensive health coverage: medical, dental, vision, and life insurance
Cutting-edge work in an emerging field with huge growth potential
Competitive compensation packages designed to reward top talent
A modern, newly renovated HQ right on Main Street in El Segundo, CA
401(k) with a 4% company match to help you invest in your future (match launches in 2026)
Flexible paid time off so you can recharge when you need it
Additional benefits include ClassPass credits and a discount on pet insurance
A family-friendly, compassionate culture that values balance and belonging
Eligible to participate in HiveWatch Equity Incentive Plan

Fulltime

Senior Site Reliability Engineer Cloud Platform

Zilliz

Location:

Category:
IT - Software Development

Contract Type:
Not provided

Salary:

Job Description:

Job Responsibility:

Requirements:

Nice to have:

Additional Information:

Job Posted:
December 14, 2025

Looking for more opportunities? Search for other job offers that match your skills and interests.

Similar Jobs for Senior Site Reliability Engineer Cloud Platform

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Vice President, Cloud Security Site Reliability Engineer

Vice President - Cloud Security Site Reliability Engineer

Junior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer Cloud Platform

Zilliz

Location:

Category:IT - Software Development

Contract Type:Not provided

Salary:

Job Description:

Job Responsibility:

Requirements:

Nice to have:

Additional Information:

Job Posted:December 14, 2025

Looking for more opportunities? Search for other job offers that match your skills and interests.

Similar Jobs for Senior Site Reliability Engineer Cloud Platform

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Vice President, Cloud Security Site Reliability Engineer

Vice President - Cloud Security Site Reliability Engineer

Junior Site Reliability Engineer

Senior Site Reliability Engineer

Senior Site Reliability Engineer

Category:
IT - Software Development

Contract Type:
Not provided

Job Posted:
December 14, 2025