CrawlJobs Logo

Site Reliability Engineer SRE – ML platform

thirdeyedata.ai Logo

Thirdeye Data

Location Icon

Location:
United States , Sunnyvale

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Responsibility:

  • Continuous Deployment using GitHub Actions, Flux, Kustomize
  • Design and implement cloud solutions, build MLOps on AWS cloud
  • Data science model containerization, deployment using Docker, VLLM, Kubernetes
  • Communicate with a team of data scientists, data engineers, and architects, and document the processes
  • Develop and deploy scalable tools and services for our clients to handle machine learning training and inference
  • Knowledge of ML models and LLM

Requirements:

  • 6+ years of experience in ML Ops with strong knowledge in Kubernetes, Python, MongoDB and AWS
  • Good understanding of Apache SOLR
  • Proficient with Linux administration
  • Knowledge of ML models and LLM
  • Ability to understand tools used by data scientists and experience with software development and test automation
  • Ability to design and implement cloud solutions and ability to build MLOps pipelines on cloud solutions (AWS)
  • Experience working with cloud computing and database systems
  • Experience building custom integrations between cloud-based systems using APIs
  • Experience developing and maintaining ML systems built with open-source tools
  • Experience with MLOps Frameworks like Kubeflow, MLFlow, DataRobot, Airflow etc., experience with Docker and Kubernetes
  • Experience developing containers and Kubernetes in cloud computing environments
  • Familiarity with one or more data-oriented workflow orchestration frameworks (Kubeflow, Airflow, Argo, etc.)
  • Ability to translate business needs to technical requirements
  • Strong understanding of software testing, benchmarking, and continuous integration
  • Exposure to machine learning methodology and best practices
  • Good communication skills and ability to work in a team

Additional Information:

Job Posted:
December 26, 2025

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Site Reliability Engineer SRE – ML platform

Principal Site Reliability Engineer (AI-first SRE)

Groupon is modernizing its global platform — and reliability is at the center of...
Location
Location
Peru
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in software/systems engineering, including 5+ years in SRE or platform reliability
  • Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
  • Proficiency in Python or Go for automation and tooling
  • Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
  • Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
  • Strong communication and influencing skills — data over hierarchy
Job Responsibility
Job Responsibility
  • Architect and maintain self-healing systems with 99.9%+ availability targets
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data
  • Build AIOps-based observability and auto-remediation pipelines
  • Apply predictive modeling to forecast failures before they impact users
  • Lead chaos, performance, and resilience testing programs
  • Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
  • Mentor engineers and drive reliability standards across teams
  • Partner with platform, data, and product teams to ensure stability aligns with business goals
  • Support major incident response, incident review, and participate in on-call rotations
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • Professional growth and leadership development pathways tailored to your aspirations
  • A chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right

Principal Site Reliability Engineer

Groupon is modernizing its global platform — and reliability is at the center of...
Location
Location
Colombia
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in software/systems engineering
  • 5+ years in SRE or platform reliability
  • Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
  • Proficiency in Python or Go for automation and tooling
  • Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
  • Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
  • Strong communication and influencing skills — data over hierarchy
Job Responsibility
Job Responsibility
  • Architect and maintain self-healing systems with 99.9%+ availability targets
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data
  • Build AIOps-based observability and auto-remediation pipelines
  • Apply predictive modeling to forecast failures before they impact users
  • Lead chaos, performance, and resilience testing programs
  • Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
  • Mentor engineers and drive reliability standards across teams
  • Partner with platform, data, and product teams to ensure stability aligns with business goals
  • Support major incident response, incident review, and participate in on-call rotations
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • Professional growth and leadership development pathways tailored to your aspirations
  • A chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right

Principal Site Reliability Engineer

Groupon is modernizing its global platform — and reliability is at the center of...
Location
Location
Ecuador
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in software/systems engineering, including 5+ years in SRE or platform reliability
  • Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
  • Proficiency in Python or Go for automation and tooling
  • Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
  • Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
  • Strong communication and influencing skills — data over hierarchy
Job Responsibility
Job Responsibility
  • Architect and maintain self-healing systems with 99.9%+ availability targets
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data
  • Build AIOps-based observability and auto-remediation pipelines
  • Apply predictive modeling to forecast failures before they impact users
  • Lead chaos, performance, and resilience testing programs
  • Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
  • Mentor engineers and drive reliability standards across teams
  • Partner with platform, data, and product teams to ensure stability aligns with business goals
  • Support major incident response, incident review, and participate in on-call rotations
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • Professional growth and leadership development pathways tailored to your aspirations
  • A chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right

Principal Site Reliability Engineer (AI-first SRE)

Groupon is modernizing its global platform — and reliability is at the center of...
Location
Location
Salary
Salary:
Not provided
groupon.com Logo
Groupon
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years in software/systems engineering, including 5+ years in SRE or platform reliability
  • Strong experience with GCP (preferred) or AWS, Kubernetes, and Terraform
  • Proficiency in Python or Go for automation and tooling
  • Deep understanding of observability stacks (Prometheus, Grafana, OpenTelemetry) and service meshes (Istio, Envoy)
  • Hands-on AIOps experience: anomaly detection, predictive analytics, ML-assisted operations
  • Strong communication and influencing skills — data over hierarchy
Job Responsibility
Job Responsibility
  • Architect and maintain self-healing systems with 99.9%+ availability targets
  • Use AI/ML to automate infrastructure governance and detect configuration or IaC anti-patterns
  • Implement adaptive SLIs/SLOs that evolve automatically from real-time data
  • Build AIOps-based observability and auto-remediation pipelines
  • Apply predictive modeling to forecast failures before they impact users
  • Lead chaos, performance, and resilience testing programs
  • Map platform and service behavior to revenue impact and drive improved revenue resilience through better infrastructure performance
  • Mentor engineers and drive reliability standards across teams
  • Partner with platform, data, and product teams to ensure stability aligns with business goals
  • Support major incident response, incident review, and participate in on-call rotations
What we offer
What we offer
  • The opportunity to work with cutting-edge technologies in a transformative environment
  • Professional growth and leadership development pathways tailored to your aspirations
  • A chance to leave a lasting impact by shaping the future of reliable and scalable systems
Read More
Arrow Right
New

Senior Manager Events and Catering

Assists the Assistant Director of Catering by providing support to the operation...
Location
Location
United States
Salary
Salary:
85000.00 - 113000.00 USD / Year
https://www.marriott.com Logo
Marriott Bonvoy
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • High school diploma or GED
  • 4 years’ experience in the event management, food and beverage, or related professional area
  • OR Bachelor’s degree from an accredited university in Hotel and Restaurant Management, Hospitality, Business Administration, or related major
  • 2 years’ experience in the event management, food and beverage, or related professional area
Job Responsibility
Job Responsibility
  • Projects supply needs for the department
  • Applies knowledge of all laws as they relate to an event
  • Understands the impact of banquet operations on the overall success of a conference event and manages activities to maximize customer satisfaction
  • Adheres to and reinforces all standards, policies, and procedures
  • Maintains established sanitation levels
  • Manages departmental inventories and maintains equipment
  • Schedules banquet service staff to forecast and service standards, while maximizing profits
  • Assists team in developing lasting relationships with groups to retain business and increase growth
  • Manages department controllable expenses to achieve or exceed budgeted goals
  • Verifies that all banquet event orders (BEO’s) are developed and distributed according to established guidelines
What we offer
What we offer
  • Relocation Assistance Available
  • Fulltime
Read More
Arrow Right
New

Software Engineer

We are looking for a skilled Software Engineer to join our dynamic team in New Y...
Location
Location
United States , New York
Salary
Salary:
Not provided
https://www.roberthalf.com Logo
Robert Half
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science or a related field from a reputable institution
  • At least 3 years of experience as a software engineer, with a proven track record in full-stack development
  • Proficiency in TypeScript, React.js, and Node.js
  • Hands-on experience with mobile development, particularly using React Native
  • Ability to design and develop performance-sensitive and low-latency systems
  • Strong problem-solving skills and attention to detail
  • Familiarity with startup environments and an entrepreneurial mindset
Job Responsibility
Job Responsibility
  • Develop and maintain full-stack applications using TypeScript, React, and React Native
  • Design, implement, and optimize low-latency systems and performance-sensitive software
  • Collaborate with cross-functional teams to deliver high-quality solutions that meet user needs
  • Contribute to the development of mobile applications with expertise in React Native
  • Write clean, efficient, and scalable code to ensure optimal application functionality
  • Debug and troubleshoot technical issues to maintain system reliability
  • Participate in code reviews and provide constructive feedback to team members
  • Stay up-to-date with emerging technologies and incorporate best practices into development processes
  • Work in an entrepreneurial environment, taking ownership of projects and driving them to completion
  • Engage with product teams to understand user requirements and deliver impactful solutions
What we offer
What we offer
  • medical, vision, dental, and life and disability insurance
  • eligible to enroll in our company 401(k) plan
Read More
Arrow Right
New

Ct technologist

PRN CT Technologist position at Atrium Health Navicent Peach. Need PRN CT techno...
Location
Location
United States , Byron
Salary
Salary:
33.05 - 49.60 USD / Hour
advocatehealth.com Logo
Advocate Health Care
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Graduate of an accredited two-year AMA program in Radiologic or Nuclear Medicine Technology required
  • ARRT certification in Radiology or NMTCB for Nuclear Medicine and advanced registry from the ARRT in CT scanning within one year of hire required
  • BLS required
Job Responsibility
Job Responsibility
  • Examines requests and verifies orders on each assigned patient
  • Properly identify and assist patients while offering a brief explanation of the procedures
  • Interviews patients for a complete medical history
  • Assumes responsibility for the exam from beginning of exam until completion of dictated results
  • Prepares and administers IV contrast according to departmental protocols
  • Evaluates technical quality of images and consults with a Radiologist if needed
  • Performs basic patient care functions
  • Performs CT scanning and assists Radiologist/PA during invasive procedures
  • Is authorized to obtain medication or contrast material as directed for administration by a licensed practitioner
  • Practices principle of radiation safety for self, employees, patients and family members
What we offer
What we offer
  • Paid Time Off programs
  • Health and welfare benefits such as medical, dental, vision, life, and Short- and Long-Term Disability
  • Flexible Spending Accounts for eligible health care and dependent care expenses
  • Family benefits such as adoption assistance and paid parental leave
  • Defined contribution retirement plans with employer match and other financial wellness programs
  • Educational Assistance Program
  • Parttime
Read More
Arrow Right
New

Night shift Production Technician

Manufacturing life saving products such as nutrition, chemotherapy and antibioti...
Location
Location
United Kingdom , Thetford
Salary
Salary:
25600.00 GBP / Year
https://www.baxter.com/ Logo
Baxter
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • No experience required
  • Attentive
  • Strong attention to detail
  • Positive attitude
  • Excellent communication skills
Job Responsibility
Job Responsibility
  • Manufacture life saving products in accordance to approved Standard Operating Procedures (SOP’s)
  • Perform routine quality checks on the product as part of the manufacturing process
  • Correctly record detailed data on batch paperwork and quality documents following Good Documentation Practices
  • Perform and record all cleaning for the equipment and areas of work as and when required as defined in the relevant SOP’s
  • Notify Team Leader/Supervisor/Manager/Quality assurance of any deviations from the process and take part in correcting mistakes under guidance
  • Perform all tasks requested by Team Leader
What we offer
What we offer
  • Weekly shift allowance of £85.38
  • First class training
  • Opportunity to develop and grow via ACE development process
  • New challenges every day with support
  • Fulltime
Read More
Arrow Right