Staff Engineer, Site Reliability Job at LearnUpon (Dublin)

Staff Site Reliability Engineer

At Ledger, we are looking for an experienced Reliability Engineer to join our SR...

Location

France , Paris

Salary:

Not provided

Ledger

Expiration Date

Until further notice

Requirements

8+ years on cloud engineering at scale, on organizations operating SaaS solutions
Proficiency in working in Unix/Linux environments, Git, Python, Terraform, Kubernetes, AWS cloud solutions and architectures, CI/CD tools, Argocd, Ansible, configuration management, etc.
Strong knowledge on observability practices, with experience implementing and managing Logging, Monitoring and Alerting framework with solutions such as Datadog or Prometheus/Grafana/Loki.
Experience of cross-functional work and the ability to demonstrate a collaborative approach with regards to building key relationships across the organization and define projects scope, goals, plan and deliverables
Customer focused with the ability to identify and understand both internal and external customer's needs
Creative problem-solving and analysis skills with an ability to identify, develop, and implement solutions to meet the needs of the business
Excellent presentation and written communication
Ability to deal with ambiguity, high level of pressure and rapidly changing environments
Engineering degree.

Job Responsibility

Participate in building a DevOps / SRE culture and enable the transition to modern infrastructure management and deployment practices
Participate in building the SRE team roadmap (vision and delivery accountability). Anticipate stakeholder needs, game-changing technologies emergence and challenge scope / deadlines
Perform integration of platform software components
Participate to design and deliver solutions to improve the availability, scalability, latency, and efficiency of systems
Influence and create standards & best practices in support of service level objectives
Automate key SRE metrics including SLOs/SLAs and error budgets
Provide expert support to our level-2/application support team, to troubleshoot priority incidents, and conduct post-mortems
Apply analytics on past incidents and usage patterns to predict issues and take proactive actions
Ensure control of technical debt and promote quality practices
Follow SRE and chaos engineering approaches across all strategic systems to predict in coordination with Service Design and prevent outages and improve solution availability

What we offer

Equity: Employees are the foundation of our success, and we award stock options so you can share in that success as we grow
Flexibility: A hybrid work policy
Social: Annual company outing for Ledgerdary Days, plus frequent social events, snacks and drinks
Medical: Comprehensive health insurance policy offering extensive medical, dental and vision care coverage
Well-being: Personal development, coaching & fitness with our dedicated partners
Vacation: Five weeks of paid leave per year, in addition to national holidays and rest & relaxation (RTT) days
High tech: Access to high performance office equipment and gadgets, including Apple products
Transport: Ledger reimburses part of your preferred means of transportation
Discounts: Employee discount on all our products.

Fulltime

Staff Site Reliability Engineer

We are looking for a Site Reliability Engineer to own our internal systems infra...

Location

United States , Sunnyvale

Salary:

175000.00 - 250000.00 USD / Year

Figure

Expiration Date

Until further notice

Requirements

Strong experience with Linux/Unix systems administration
Proficiency in programming/scripting
Extensive experience with cloud platforms (Azure, AWS, GCP) and on-prem hardware architectures
Experience designing, deploying, and operating high-availability, fault-tolerant, and distributed systems
Mastery of infrastructure as code (Terraform, CloudFormation, Ansible…)
Familiarity with monitoring, logging, and alerting tools (Prometheus, Grafana, Datadog…)
Solid understanding of networking fundamentals (TCP/IP, DNS, HTTP, load balancers, firewalls)
Experience defining Service Level Objectives (SLO), developing runbooks/incident response plans, facilitating post-mortems and managing systems assets
Ability to work in cross-functional teams with developers, infra, and product teams
Excellent verbal and written communication skills

Job Responsibility

Be the go to person for mission critical infrastructure enabling critical operations such as Source Configuration Management, CI/CD systems, software distribution, supplier portals, manufacturing and more
Migrate SaaS to self-hosted solutions to enhance security and reliability
Implement monitoring and alerting systems, and define incident response plans and runbooks
Reduce human workload through automation to automate deployment and scaling
Establish strong relationships with stakeholders to identify infrastructure needs and establish Service Level Objectives
Use a data driven approach to demonstrate service robustness and track optimization work
Partner with the security team to ensure that security remediations and updates are applied in a timely manner

Fulltime

Staff Site Reliability Engineer

Affirm is reinventing credit to make it more honest and friendly, giving consume...

Location

Spain

Salary:

101000.00 - 131000.00 EUR / Year

Affirm

Expiration Date

Until further notice

Requirements

8+ years of experience designing, developing, advocating as a point subject of reference, and launching backend systems at scale using scripting and development languages like Bash, Python or Kotlin
Extensive track record of developing highly available distributed systems using technologies like AWS, MySQL, Spark and Kubernetes
Track record of managing, driving and improving the Incident Livecycle process from live incident management through retrospective and post-incident analysis to provide actional insights to enhance overall system reliability, resilience, and performance
7+ years experience in Site Reliability or Production Engineering teams
Experience delivering major features, system components or deprecating existing functionality in a system through the definition of a technical and execution plan
Ability to write high quality code that is easily understood and used by others
Strong verbal and written communication skills that support effective collaboration with our global engineering team and key stakeholders of an organization
Equivalent practical experience or a Bachelor’s degree in a related field
Based in Spain for the role

Job Responsibility

Set technical strategy vision for your team on a multi year-long time scale, and help your team tie it together with critical, business-impacting projects
Collaborate across teams in the product development lifecycle by collaborating with infrastructure, product management, developer experience & analytics to ensure technical sustainability, risks and trade-offs are well understood and managed
Act as a force-multiplier for your team through your definition and advocacy of technical solutions and operational processes
Take ownership of your team’s operations and availability by ensuring you have the right monitoring, triage rotations, playbooks, policies, testing and alerting in place to support “keep the lights on” & on-call efforts
Foster a culture of quality and ownership on your team by setting code review and design standards for your team, and advocating for them beyond your team through your writing and tech talks
Help develop talent on your team by providing feedback and guidance, and leading by example
Participate in an on-call rotation

What we offer

Flexible Spending Wallets for tech, food and lifestyle
Away Days - wellness days to take off work and recharge
Learning & Development programs
Parental benefit
Employee Resource & Community Groups
Health care coverage - Affirm covers all premiums for all levels of coverage for you and your dependents
Flexible Spending Wallets - generous stipends for spending on Technology, Food, various Lifestyle needs, and family forming expenses
Time off - competitive vacation and holiday schedules allowing you to take time off to rest and recharge
ESPP - An employee stock purchase plan enabling you to buy shares of Affirm at a discount
Visa sponsorship

Fulltime

Staff Site Reliability Engineer

Site Reliability Engineering at Affirm is a small, yet crucial, team that helps ...

Location

Poland

Salary:

358000.00 - 458000.00 PLN / Year

Affirm

Expiration Date

Until further notice

Requirements

8+ years of experience designing, developing, advocating as a point subject of reference, and launching backend systems at scale using scripting and development languages like Bash, Python or Kotlin
Extensive track record of developing highly available distributed systems using technologies like AWS, MySQL, Spark and Kubernetes
Track record of managing, driving and improving the Incident Livecycle process from live incident management through retrospective and post-incident analysis to provide actional insights to enhance overall system reliability, resilience, and performance
7+ years experience in Site Reliability or Production Engineering teams
Demonstrate curiosity with empathy, and strong opinions loosely held
Experience delivering major features, system components or deprecating existing functionality in a system through the definition of a technical and execution plan
Write high quality code that is easily understood and used by others
Thrive in ambiguity, and are comfortable moving from low level language idioms all the way to the architecture of large systems to understand how they work
Growth and impact trajectory demonstrates that you have mastered gathering and iterating on feedback from your engineering and cross-functional peers
Strong verbal and written communication skills that support effective collaboration with our global engineering team and key stakeholders of an organization

Job Responsibility

Set technical strategy vision for your team on a multi year-long time scale, and help your team tie it together with critical, business-impacting projects
Collaborate across teams in the product development lifecycle by collaborating with infrastructure, product management, developer experience & analytics to ensure technical sustainability, risks and trade-offs are well understood and managed
Act as a force-multiplier for your team through your definition and advocacy of technical solutions and operational processes
Take ownership of your team’s operations and availability by ensuring you have the right monitoring, triage rotations, playbooks, policies, testing and alerting in place to support “keep the lights on” & on-call efforts
Foster a culture of quality and ownership on your team by setting code review and design standards for your team, and advocating for them beyond your team through your writing and tech talks
Help develop talent on your team by providing feedback and guidance, and leading by example

What we offer

Flexible Spending Wallets for tech, food and lifestyle
Away Days - wellness days to take off work and recharge
Learning & Development programs
Parental leave
Employee Resource & Community Groups
Health care coverage - Affirm covers all premiums for all levels of coverage for you and your dependents
Flexible Spending Wallets - generous stipends for spending on Technology, Food, various Lifestyle needs, and family forming expenses
Time off - competitive vacation and holiday schedules allowing you to take time off to rest and recharge
ESPP - An employee stock purchase plan enabling you to buy shares of Affirm at a discount

Fulltime

Staff Platform Engineer

Join our dynamic team as a Compute Platform Engineer and play a pivotal role in ...

Location

United States , Mountain View, California

Salary:

180000.00 - 280000.00 USD / Year

Inworld AI

Expiration Date

Until further notice

Requirements

7 years of experience in software engineering
5 years of experience with infrastructure-as-code
Proficiency in managing Kubernetes clusters and applications, including creating Kustomize manifests/Helm charts for new applications
Experience in creating and maintaining CI/CD pipelines for both applications and infrastructure deployments (using tools like Terraform/Terragrunt, ArgoCD, GitHub Actions, Ansible, etc.)
Deep knowledge of at least one major cloud provider (Google Cloud Platform, Microsoft Azure, Oracle Cloud)
Proficient in at least one backend programming/scripting languages such as Golang, Python, and Bash
Candidates must be based in the SF Bay Area or willing to relocate (you will be working on-site in our South Bay office a few days a week)

Job Responsibility

Work closely with backend and ML engineering teams to design, deploy, and maintain reliable, high-performance, and secure cloud infrastructure for our AI engine and Studio
Facilitate a "you build it, you run it" culture by providing the necessary tools and processes for monitoring the reliability, availability, and performance of services
Manage CI/CD pipelines to ensure smooth and efficient code integration and deployment
Identify and implement opportunities to enhance engineering speed and efficiency
Conduct root cause analysis to identify critical issues and develop automated solutions to prevent recurrence
Develop and share best practices to improve automation and efficiency across our engineering teams

What we offer

equity and benefits

Fulltime

Staff Software Engineer

As a Staff Forward Deployed Engineer (FDE) at Invisible, you'll lead high-impact...

Location

United States , Austin; New York; San Francisco Bay Area; Washington DC–Baltimore

Salary:

213000.00 - 300000.00 USD / Year

Invisible Technologies

Expiration Date

Until further notice

Requirements

8+ years of software engineering experience, including significant time spent building data, ML, or backend systems
Deep proficiency in Python with hands-on experience using Hugging Face, LangChain, OpenAI, Pinecone, and related ecosystems
Skilled in full-stack and API-based deployment patterns, including Docker, FastAPI, Kubernetes, and cloud environments (GCP, AWS)
Experienced with workflow orchestration libraries, pub/sub systems (Kafka), and schema governance
Expertise in data governance and operations, including Unity Catalog and policy management, cluster/job orchestration, data contracts and quality enforcement, Delta/ETL pipelines, and replay processes
Strong product and system design instincts — you understand business needs and how to translate them into technical architecture
Experience building usable systems from messy data and ambiguous requirements
Excellent communication and client-facing skills
you’ve led conversations with technical and non-technical stakeholders alike
Proven experience owning projects from scoping through deployment in ambiguous, high-stakes environments

Job Responsibility

Partner with delivery and executive stakeholders to scope, design, and lead implementation of AI-driven solutions
Identify transformational opportunities in messy, ambiguous workflows and turn them into repeatable systems
Lead architecture design and trade-off discussions across performance, scalability, cost, and reliability
Own projects from first discovery call through full deployment — including client-facing delivery, internal coordination, and post-launch iteration
Build shared infrastructure, reusable components, and internal playbooks to level-up the team
Coach and mentor mid-level engineers and help shape the culture of forward-deployed AI engineering at Invisible

What we offer

bonus
equity
benefits

Fulltime

Staff Software Engineer, Forward Deployed

As a Staff Forward Deployed Engineer (FDE) at Invisible, you'll lead high-impact...

Location

United Kingdom , London

Salary:

Not provided

Invisible Technologies

Expiration Date

Until further notice

Requirements

5+ years of software engineering experience, including significant time spent building data, ML, or backend systems
Deep proficiency in Python and experience with ML/LLM frameworks such as Hugging Face, LangChain, OpenAI, Pinecone, etc.
Familiarity with full-stack or API-based deployment patterns (Docker, FastAPI, Kubernetes, GCP/AWS)
Strong product and system design instincts — you understand business needs and how to translate them into technical architecture
Experience building usable systems from messy data and ambiguous requirements
Excellent communication and client-facing skills
you’ve led conversations with technical and non-technical stakeholders alike
Proven experience owning projects from scoping through deployment in ambiguous, high-stakes environments
Be willing to be on-call for our customers when situations ari
Ability to travel roughly 25–50 % of the time, sometimes short-notice trips—primarily across Europe with occasional international roll-outs—to work directly on-site with clients

Job Responsibility

Partner with delivery and executive stakeholders to scope, design, and lead implementation of AI-driven solutions
Identify transformational opportunities in messy, ambiguous workflows and turn them into repeatable systems
Lead architecture design and trade-off discussions across performance, scalability, cost, and reliability
Own projects from first discovery call through full deployment — including client-facing delivery, internal coordination, and post-launch iteration
Build shared infrastructure, reusable components, and internal playbooks to level-up the team
Coach and mentor mid-level engineers and help shape the culture of forward-deployed AI engineering at Invisible

What we offer

Bonuses and equity are included in offers above entry level

Fulltime

Staff Observability Operations Engineer

We are currently seeking several experienced and highly skilled Staff Observabil...

Location

United States , Hartford

Salary:

130295.00 - 260590.00 USD / Year

CVS Health

Expiration Date

Until further notice

Requirements

7+ Years of experience in IT operations, with significant responsibilities in system monitoring, performance tuning, and troubleshooting enterprise applications
5+ Years in a Site Reliability Engineering (SRE) role deploying and managing modern observability solutions
5+ Years managing and implementing observability and event management platforms (e.g., AppDynamics, Splunk, Prometheus, Grafana)
Experience developing and administering ServiceNow ITOM event management solutions
Experience deploying and managing service reliability platforms (e.g., xMatters, OpsGenie, PagerDuty)
Experience with and deep knowledge of cloud environments, cloud monitoring platforms, and container orchestration tools (e.g., AWS/CloudTrail, Azure/Monitor, GCP/GCM, Kubernetes, OpenShift)
Proficiency in Python and other scripting languages such as Ansible, PowerShell, Bash for automation and configuration
Hands-on experience deploying, managing, and administering observability platforms
Hands-on experience leading, coordinating, and performing migration of application, platform, and infrastructure observability solutions
Proven ability to troubleshoot and resolve complex technical issues

Job Responsibility

Deploy and implement modern observability solutions
Manage and administer observability and event management platforms
Coordinate and manage release cycles for observability platforms
Troubleshoot and resolve incidents related to observability platforms
Continuously monitor and enhance platform performance
Collaborate with cross-functional stakeholders
Provide training and mentoring to junior engineers
Ensure compliance and security of observability platforms
Maintain documentation of observability platform configurations
Generate and analyze reports on platform performance and capacity

What we offer

Affordable medical plan options
a 401(k) plan (including matching company contributions)
an employee stock purchase plan
No-cost programs for all colleagues including wellness screenings, tobacco cessation and weight management programs
confidential counseling and financial coaching
Paid time off
flexible work schedules
family leave
dependent care resources
colleague assistance programs

Fulltime

Staff Engineer, Site Reliability

LearnUpon

Location:
Ireland , Dublin

Category:
IT - Software Development

Contract Type:
Not provided

Salary:

Job Description:

Job Responsibility:

Requirements:

Nice to have:

Additional Information:

Job Posted:
December 09, 2025

Looking for more opportunities? Search for other job offers that match your skills and interests.

Similar Jobs for Staff Engineer, Site Reliability

Staff Site Reliability Engineer

Staff Site Reliability Engineer

Staff Site Reliability Engineer

Staff Site Reliability Engineer

Staff Platform Engineer

Staff Software Engineer

Staff Software Engineer, Forward Deployed

Staff Observability Operations Engineer

Staff Engineer, Site Reliability

LearnUpon

Location:Ireland , Dublin

Category:IT - Software Development

Contract Type:Not provided

Salary:

Job Description:

Job Responsibility:

Requirements:

Nice to have:

Additional Information:

Job Posted:December 09, 2025

Looking for more opportunities? Search for other job offers that match your skills and interests.

Similar Jobs for Staff Engineer, Site Reliability

Staff Site Reliability Engineer

Staff Site Reliability Engineer

Staff Site Reliability Engineer

Staff Site Reliability Engineer

Staff Platform Engineer

Staff Software Engineer

Staff Software Engineer, Forward Deployed

Staff Observability Operations Engineer

Location:
Ireland , Dublin

Category:
IT - Software Development

Contract Type:
Not provided

Job Posted:
December 09, 2025