CrawlJobs Logo

Senior Machine Learning Site Reliability Engineer

prima.it Logo

Prima

Location Icon

Location:
Italy , Milan

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

Are you looking for a new challenge? Fancy helping us shape the future of motor insurance? Prima could be the place for you. Since 2015, we’ve been using our love of data and tech to rethink motor insurance and bring drivers a great experience at a great price. Our story began in Italy, where we’ve quickly become the number one online motor insurance provider. In fact, we’re trusted by over 4 million drivers. And now we’re expanding to help millions more drivers in the UK and Spain. To help fuel that growth, we need a Senior Machine Learning Site Reliability Engineer to join our Infrastructure team. This team is the beating heart of Prima. You’ll be joining over 300 engineers across software development, infrastructure, operations and security. Fueled by curiosity, experimentation and collaboration, you’ll help deliver scalable, impactful solutions that shape the future of insurance.

Job Responsibility:

  • Hands-on Reliability & System Engineering: Design, build, and operate reliable and scalable systems by defining and monitoring SLOs/SLIs, working directly on production infrastructure, and collaborating closely with software engineers on system design and reliability improvements
  • Automation, Operations & Incident Response: Actively develop automation for infrastructure and operational workflows to eliminate toil and reduce MTTR, participate in and lead incident response, and drive blameless post-incident reviews with concrete follow-ups implemented in code and tooling
  • Performance, Capacity & Security: Continuously analyze and optimize system performance and cost, provide data, insights, and recommendations to inform capacity planning, and support security best practices through hands-on vulnerability remediation and threat mitigation

Requirements:

  • SRE & Cloud Engineering: Hands-on experience with SRE practices in production, strong AWS expertise, Kubernetes, networking, DNS, and Infrastructure as Code (Pulumi preferred, Terraform a plus)
  • Automation, Software Engineering and MLOps: Demonstrate strong software engineering fundamentals with an emphasis on code quality and maintainability. This includes solid Python proficiency and deep knowledge of the Python ecosystem (testing, debugging, packaging), hands-on experience with PySpark, and a consistent focus on writing clean, well-structured, and maintainable code. Familiarity with MLOps practices such as model registries, model versioning, retraining workflows, and end-to-end deployment lifecycles is also expected
  • Reliability, Data & Operations: Add stakeholder engagement and mentoring e.g. lead incident response and RCAs, improve system reliability, and engage stakeholders to propose solutions, share learnings, and mentor others

Nice to have:

  • Regulated Environments & Security: Experience operating in highly regulated industries (e.g. Insurance, Banking, Healthcare), managing sensitive data, and supporting secure networking setups, including exposure to security technologies such as Cloudflare
  • Distributed Systems & Microservices: Strong understanding of microservices architectures, their principles and trade-offs, with the ability to troubleshoot and maintain distributed systems and supporting technologies (RabbitMQ, Kafka, PostgreSQL, Redis)
  • Observability & Platform Operations: Hands-on experience with Datadog for platform and application monitoring, performance optimisation, and solid fundamentals in database structures and operational troubleshooting, with exposure to systems built in languages such as Rust and Elixir
What we offer:
  • Grow with us: access to learning resources, mentorship and a growth plan tailored to you
  • Thrive and perform: private healthcare, gym discounts, wellbeing programs and mental health support

Additional Information:

Job Posted:
January 20, 2026

Employment Type:
Fulltime
Work Type:
Remote work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Senior Machine Learning Site Reliability Engineer

Senior Software Engineer, Backend

As a Senior Software Engineer, Backend specializing in database architecture and...
Location
Location
United States , San Francisco
Salary
Salary:
150000.00 - 240000.00 USD / Year
chefrobotics.ai Logo
Chef Robotics
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Engineering, or equivalent practical experience
  • 7+ years of professional experience in backend development roles with demonstrated leadership experience
  • Expert knowledge of relational databases (MySQL, PostgreSQL) including schema design, optimization, and administration
  • Strong proficiency with Python and JavaScript/TypeScript with advanced software engineering skills
  • Extensive experience leading projects with at least two web frameworks: Flask, FastAPI, Django, Node.js, or Next.js
  • Proven experience designing and implementing RESTful and GraphQL APIs at scale
  • Advanced understanding of containerization (Docker) and orchestration (Kubernetes) technologies
  • Experience with cloud infrastructure and deployment (AWS, GCP, or Azure) in production environments
  • Proven experience leading complex backend projects and mentoring junior engineers
  • Understanding of data requirements for robotics or automation systems
Job Responsibility
Job Responsibility
  • Lead the design, implementation, and optimization of database schemas to support robot operations, telemetry, recipe management, and system analytics
  • Develop robust data migration strategies and version control for database schema evolution
  • Implement efficient query optimization and indexing strategies to support high-throughput robot operations
  • Establish data integrity protocols and backup systems to ensure operational continuity across customer deployments
  • Create scalable data access layers that balance security, performance, and maintainability
  • Mentor team members on database design patterns and optimization techniques
  • Lead the development and maintenance of scalable APIs to serve robot control systems, dashboards, and monitoring tools
  • Design and implement secure authentication and authorization mechanisms across backend services
  • Develop robust middleware for processing and validating data between robotics subsystems
  • Create service interfaces that enable efficient communication between robotics components and cloud services
What we offer
What we offer
  • medical, dental, and vision insurance
  • commuter benefits
  • flexible paid time off (PTO)
  • catered lunch
  • 401(k) matching
  • early-stage equity
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

The Windows and Devices mission is to create innovative, trusted, and open produ...
Location
Location
India , Hyderabad
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master's Degree in Computer Science, Information Technology, or related field AND 6+ years technical experience in software engineering, network engineering, or systems administration
  • OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 8+ years technical experience in software engineering, network engineering, or systems administration
  • OR equivalent experience
  • 3+ years technical experience working with large-scale cloud or distributed systems
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Microsoft Cloud Background Check
Job Responsibility
Job Responsibility
  • Independently designs, creates, tests, and deploys changes through a safe deployment process (SDP) to enhance code quality and improve the observability, security, reliability and operability of platforms, systems, and products at scale
  • Leverages technical expertise in the infrastructure of cloud technologies and specific products to advocate for, or directly contribute to the automation to improve the availability, security, quality, observability, reliability, efficiency, observability, and performance of related sets of products
  • Leverages end-to-end technical expertise and telemetry analysis alongside advanced artificial intelligence (AI) and machine learning (ML) algorithms to identify patterns and opportunities to implement configuration and data changes
  • Shares insights and best practices via documented artifacts that can be applied to improve development and operations across related sets of systems, platforms, and/or products
  • Writes code, scripts, systems, and/or artificial intelligence (AI)/machine learning (ML) platforms to automate operations tasks at scale
  • Develops, maintains, and implements capacity planning models and monitoring tools to forecast product capacity, related security risk, and resource demands
  • Handles incidents during on-call shifts assessing impact, troubleshooting complex problems, taking appropriate action to mitigate impact, and heading investigations to address root cause(s)
  • Leverages existing tools and automation, including the safe deployment process (SDP), to enable product engineering teams within their organization to increase the velocity in which they can reliably and safely implement changes in production
  • Draws insights from performance and resource monitoring across products and services within their organization to identify whether there is a need to optimize algorithms, security, infrastructure, or architecture
  • Analyzes data from telemetry pipelines and monitoring tools that detail operations metrics of systems, platforms, or products operating at scale
  • Fulltime
Read More
Arrow Right

Senior Machine Learning Engineer

Security represents the most critical priorities for our customers in a world aw...
Location
Location
India , Hyderabad
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Data Science, Engineering, or a related technical field
  • 7+ years of overall experience
  • 5+ years of hands‑on software engineering experience writing production‑quality code
  • 3+ years designing and implementing end‑to‑end software systems
  • Solid understanding of machine learning fundamentals, model evaluation, experimentation, and performance trade‑offs
  • Experience building or operationalizing LLM / Generative AI systems, including RAG, prompt engineering, or agent‑based architectures
  • Proven ability to collaborate across disciplines and operate with autonomy at senior IC scope
Job Responsibility
Job Responsibility
  • Design, develop, and deploy AI / ML systems across the full lifecycle, including data ingestion, feature engineering, model training, evaluation, and production integration
  • Build and optimize Generative AI and LLM‑based systems, including agentic workflows, prompt engineering, retrieval‑augmented generation (RAG), and fine‑tuning approaches
  • Write production‑grade code (Python, C#, and/or Java) with a strong focus on scalability, performance, security, testability, and maintainability
  • Partner closely with engineering, product management, and applied science teams to translate business and customer requirements into robust technical solutions
  • Ship and operate large‑scale AI services in cloud environments, with ownership of reliability, latency, throughput, accuracy, and cost efficiency
  • Define and execute model evaluation strategies, including offline experiments, online monitoring, drift detection, bias analysis, and feedback loops
  • Implement MLOps best practices, including CI/CD for models, versioning, rollout strategies, observability, and live‑site monitoring
  • Apply Responsible AI principles—privacy, security, explainability, fairness, and compliance—throughout system design and deployment
  • Stay current with advancements in GenAI, LLM frameworks, and ML infrastructure, assessing feasibility and impact for enterprise security scenarios
  • Contribute technical leadership by reviewing designs, mentoring peers, and raising the overall engineering and scientific bar of the team
  • Fulltime
Read More
Arrow Right

Senior Manager - DevSecOps & Site Reliability Engineering

We’re building a world of health around every individual — shaping a more connec...
Location
Location
United States
Salary
Salary:
118450.00 - 236900.00 USD / Year
https://www.cvshealth.com/ Logo
CVS Health
Expiration Date
March 30, 2026
Flip Icon
Requirements
Requirements
  • 7+ years of DevSecOps & SRE leadership in hybrid enterprise environments
  • Proven release management for multi-team agile trains (7+ teams)
  • Hands-on experience with CI/CD (GitHub/ADO), artifact management, code scanning, and observability stacks
  • Deep knowledge of security frameworks and compliance in healthcare-grade systems
  • Strong coaching, stakeholder management, and executive communication skills
  • 3+ years in change/release management, incident/problem management, and ITIL frameworks
  • Experience with cloud platforms (AWS, Azure, GCP), containerization (Docker, Kubernetes), and observability tools
  • Experience managing vendor/contractor teams
  • Bachelor’s Degree or equivalent work experience in Computer Science, Information Systems, Data Engineering, Data Analytics, Machine Learning, or related field required
Job Responsibility
Job Responsibility
  • Architect and implement DevSecOps and SRE practices for hybrid environments (on-prem and cloud)
  • Define and execute DevSecOps strategy, including policy, standards, and guardrails
  • Oversee CI/CD pipelines, security automation, and incident management
  • build/operate pipelines with integrated quality gates, code scanning (Sonar), secrets management, SBOM, and IaC
  • Lead SRE functions: SLIs/SLOs, error budgets, resilience engineering, performance, and capacity planning
  • Deploy, manage, and optimize observability and event management platforms
  • stand up metrics, tracing, logging, and immutable logging for governance and audits
  • Coordinate releases across 7+ scrum teams, aligning regression/UAT calendars and compliance gates
  • lead and govern change/release management processes, including CAB participation and risk mitigation
  • Champion security-by-design: threat modeling, shift-left testing, dependency hygiene, data segmentation, and zero-trust
What we offer
What we offer
  • Affordable medical plan options
  • 401(k) plan (including matching company contributions)
  • Employee stock purchase plan
  • No-cost programs for all colleagues including wellness screenings, tobacco cessation and weight management programs, confidential counseling and financial coaching
  • Paid time off
  • Flexible work schedules
  • Family leave
  • Dependent care resources
  • Colleague assistance programs
  • Tuition assistance
  • Fulltime
!
Read More
Arrow Right
New

Senior Software Engineer

The AI & Innovation team at Microsoft Suzhou is seeking a highly motivated Senio...
Location
Location
China , Beijing
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science, Electrical Engineering, or related technical field AND 4+ years of technical engineering experience with coding in languages such as Python, C++, or C#
  • OR equivalent industry experience
  • 7+ years of software engineering experience with a focus on AI/ML systems
  • Proven experience with one or more of the following: Developing or applying generative AI models
  • Building and optimizing inference pipelines for large AI models on cloud infrastructure
  • Integrating AI features into consumer-facing web or mobile applications at scale
  • Working with programmatic advertising ecosystems
  • Familiarity with cloud services (Azure preferred), microservices architecture, and DevOps practices
  • Hands-on experience in at least two of the three core areas: AI/ML Prototyping: Experience with deep learning frameworks (PyTorch, TensorFlow) and implementing/tuning models from recent literature
  • Video/Graphics Processing: Experience with video codecs (FFmpeg), computer graphics, GPU programming (CUDA), or real-time media pipelines
Job Responsibility
Job Responsibility
  • Rapid AI Prototyping: Design, build, and iterate on high-potential prototypes for AI-powered video generation, editing, and content understanding
  • System Integration & Productionization: Bridge the gap between research prototypes and production-ready systems
  • Integrate AI video generation capabilities with large-scale advertising platforms and consumer products
  • Full-Stack Development: Develop end-to-end solutions encompassing backend AI service APIs, model inference optimization, and frontend interfaces
  • Cross-Functional Collaboration: Work closely with Applied Scientists, Machine Learning Engineers, Product Managers, and Ads Platform teams
  • Technical Leadership: Drive architectural decisions for scalable, reliable, and cost-effective AI service deployment
  • Mentor junior engineers and promote engineering best practices
  • Live Site Ownership: Participate in on-call rotations and act as a Designated Responsible Individual (DRI) to ensure the health, performance, and reliability of services
  • Fulltime
Read More
Arrow Right

Senior Software Engineer

The AI Platform organization at Microsoft builds the end-to-end Azure AI stack/P...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience writing production code in building internet scale services and distributed systems
  • Ability to debug, read code and work on a large and increasing codebase
  • Engineering knowledge of machine learning systems and data pipelines
  • Experience mentoring other developers, working partners and being a team player
  • Excellent communication and presentation skill
Job Responsibility
Job Responsibility
  • Design, implement, and support scalable, reliable, high-performance services
  • Write clean and concise code with unit tests
  • Design, implement, and support new features as well as extend existing systems
  • Investigate live site issues and implement and deploy fixes
  • Participate in an on-call rotation
  • Drive quality engineering via code reviews and design discussions
  • Fulltime
Read More
Arrow Right
New

Senior Software Engineer and Principal Software Engineer - Power Point AI Team

The PowerPoint team is embarking on an exciting new chapter - evolving a product...
Location
Location
United States , Redmond
Salary
Salary:
119800.00 - 234700.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
  • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
  • 8+ years of experience in backend service engineering, including work on high-scale infrastructures
  • Proficiency in one or more systems programming languages such as C#, C++
  • 1+ years of experience in software engineering, designing and developing systems (and APIs) that deploy and integrate with AI models
  • 2+ years of experience working with rich telemetry, making data driven decisions, and carrying out rapid experimentation
  • 2+ years of experience building software for scale, performance, and reliability
  • Academic or industry experience with building, finetuning, deploying or building eval-driven systems utilizing the models (any category)
Job Responsibility
Job Responsibility
  • Lead design and delivery of complex, scalable AI features ensuring resilience and exceptional user experience
  • Drive technical strategy and architecture decisions across multiple services, influencing partner teams and aligning with compliance and security requirements
  • Champion modern engineering practices, including AI-driven approaches, automation, and cloud-native patterns, across the full development lifecycle
  • Mentor and guide engineers, fostering technical excellence and continuous improvement in security, reliability, and performance
  • Collaborate cross-org to solve challenging technical problems, streamline processes, and reduce operational costs while improving live-site health
  • Design and implement scalable backend services optimized for machine learning workflows and large language model integration
  • Develop and maintain evaluation-driven systems that leverage text and multimodal inputs (e.g., images) to power visual-creation experiences
  • Build and optimize APIs and infrastructure to support high-performance model inference and experimentation at scale
  • Collaborate with product, ML, and design teams to integrate models into user-facing features, ensuring seamless functionality and performance
  • Conduct model evaluations and experiments, analyze results, and iterate on improvements to enhance accuracy and user experience
  • Fulltime
Read More
Arrow Right
New

Senior Software Engineer

Security represents the most critical priorities for our customers in a world aw...
Location
Location
India , Hyderabad
Salary
Salary:
Not provided
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Minimum of 8+ years of software development experience
  • Strong skills in distributed systems and data processing
  • Hands-on experience with cloud platforms such as Azure, AWS, or Google Cloud
  • experience with Azure Services is a plus
  • Solid understanding of Object-Oriented Programming (OOP) and common Design Patterns
  • Excellent communication and collaboration abilities, with the capacity to handle ambiguity and prioritize effectively
  • BS or MS degree in Computer Science or Engineering, or equivalent work experience
Job Responsibility
Job Responsibility
  • Build cloud-scale services that process and analyze massive volumes of organizational signals in real time
  • Harness the power of Apache Spark for high-performance data processing and scalable pipelines
  • Apply machine learning to uncover subtle patterns and anomalies that signal insider threats
  • Craft intelligent user experiences using React and AI-driven insights to help security analysts act with confidence
  • Collaborate across disciplines—from data science to UX to cloud infrastructure—in a fast-paced, high-impact environment
  • Design and deliver end-to-end features including system architecture, coding, deployment, scalability, performance, and quality
  • Ensure engineering excellence by writing effective code, unit tests, debugging, code reviews, and building CI/CD pipelines
  • Troubleshoot and optimize Live Site operations, focusing on automation, reliability, and monitoring
  • Fulltime
Read More
Arrow Right