CrawlJobs Logo

Software Engineer - Cloud FinOps & Reliability

lumalabs.ai Logo

Luma AI

Location Icon

Location:
United States , Palo Alto

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

120000.00 - 255000.00 USD / Year

Job Description:

This is a foundational engineering position for a technical, data-driven expert who gets excited about optimization at a massive scale. As a foundational member of our SRE team, you will specialize in FinOps and cloud cost management, owning the financial health of one of the world's largest multi-cloud GPU infrastructures. You will be an SRE who applies a deep understanding of cloud architecture and pricing models to find and eliminate inefficiency. You will use your software engineering skills to build the tools and automation required to govern our cloud spend, providing critical insights that allow us to scale our AI research and products sustainably.

Job Responsibility:

  • Analyze & Optimize: Actively monitor and analyze costs across our entire technical ecosystem—including multi-cloud infrastructure (AWS, GCP, OCI), on-premise clusters, and third-party services—to identify and execute on opportunities for cost optimization. Develop forecasting models to predict future spend and inform our capacity planning
  • Manage & Commit: Develop and actively manage a multi-million dollar portfolio of Reserved Instances (RIs) and Savings Plans to maximize commitment-based discounts across our global GPU and CPU fleets
  • Automate & Build: Apply a software engineering approach to design, build, and maintain custom tools and automation in Python and SQL. Your systems will track, analyze, and report on costs across our entire fleet of providers and services, with a focus on detecting anomalies immediately
  • Partner & Advise: Working closely as an embedded member of the SRE team, you will partner with fellow SREs and research teams to model the cost implications of new models and infrastructure designs, providing expert guidance on cost-performance trade-offs
  • Visualize & Report: Create and manage a centralized observability stack for cloud costs, building dashboards in tools like Grafana to give a real-time, granular view of our financial posture to all stakeholders

Requirements:

  • 5+ years of experience in a technical role such as Site Reliability Engineer, DevOps Engineer, Infrastructure Engineer, or a dedicated Cloud Cost Engineer
  • Deep, hands-on expertise with the cost models and optimization levers of at least one major cloud provider (AWS, GCP), and a willingness to learn others
  • Proficient in Python for the purpose of scripting, data analysis, and building automation tooling
  • Strong, foundational understanding of cloud infrastructure, including containerization (Docker, Kubernetes), networking, and storage
  • Not an accountant
  • you are a systems thinker who is passionate about applying engineering principles to solve financial challenges at scale
  • A tenacious troubleshooter and a data-driven decision-maker who thrives on finding the 'why' behind the numbers

Nice to have:

  • Experience managing a monthly cloud spend in excess of $1 million
  • Relevant certifications, such as the FinOps Certified Practitioner (FOCP)
  • Experience building custom cost allocation, showback, or chargeback systems from scratch
  • A background working with large-scale GPU clusters for AI/ML workloads

Additional Information:

Job Posted:
January 13, 2026

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Software Engineer - Cloud FinOps & Reliability

Cloud Engineering Manager - FinOps

This role combines technical expertise, leadership, and operational excellence t...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven expertise in cloud platforms (e.g., AWS, Azure, Google Cloud) and cloud-native technologies
  • Strong knowledge of FinOps principles and cloud financial management, including cost optimization, forecasting, and governance
  • Experience with application development frameworks (e.g., Node.js, Python, Java) and modern software engineering practices
  • Familiarity with cloud monitoring and cost management tools, such as AWS Cost Explorer, Azure Cost Management, or third-party FinOps platforms (e.g., CloudHealth, Apptio)
  • Proficiency in containerization and orchestration technologies such as Docker and Kubernetes
  • Demonstrated success in leading engineering teams, managing priorities, and delivering complex projects on time and within budget
  • Strong collaboration skills, with the ability to work effectively across engineering, finance, and business teams
  • Exceptional ability to communicate technical concepts to non-technical stakeholders and align engineering efforts with business goals
  • Bachelor’s or master’s degree in computer science, engineering, information systems, or related field
  • Typically, 7-10 years’ experience, including 0-2 years of people management experience
Job Responsibility
Job Responsibility
  • Lead and inspire a team of cloud engineers focused on FinOps application development, fostering a culture of innovation, collaboration, and continuous improvement
  • Drive the design, development, and implementation of cloud engineering applications that enable visibility, optimization, and governance of cloud costs and usage
  • Architect scalable, secure, and resilient solutions that align with FinOps principles (e.g., cost optimization, forecasting, usage analytics)
  • Collaborate with product managers and business stakeholders to define requirements, prioritize features, and deliver value-driven solutions
  • Ensure seamless integration of FinOps applications with existing HPE cloud platform tools and systems
  • Lead efforts to optimize cloud infrastructure costs and usage patterns across HPE's cloud platforms, leveraging advanced analytics and automation
  • Establish and enforce engineering best practices, including CI/CD pipelines, DevSecOps principles, and automated testing frameworks
  • Monitor and improve application performance, reliability, and scalability through proactive measures and robust incident management
  • Collaborate with finance teams to ensure compliance with cloud spending policies and reporting requirements
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Staff Platform Software Engineer

EarnIn is seeking a Staff Platform Engineer to lead the strategic design, automa...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
earnin.com Logo
EarnIn
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s Degree in Computer Science or equivalent industry experience
  • 7+ years of experience in cloud infrastructure, managing large-scale, high-availability, customer-facing distributed systems
  • Proven experience mentoring and guiding senior engineers, driving technical decisions, and leading company-wide cloud initiatives
  • Mastery of public cloud providers, specifically AWS (EKS, DynamoDB, Aurora, Kinesis, etc.)
  • Strong expertise in containerized microservices running on Kubernetes
  • Deep knowledge of automation and configuration management tools (Terraform, Ansible)
  • Expertise on CICD pipelines and tools, including Jenkins, GHA, Argo CD, Spinnaker & FluxCD or similar
  • Experience with advanced observability tools (DataDog, CloudWatch)
  • Track record of leading cost optimization / FinOps initiatives, performance tuning, and operational excellence projects
  • Proven ability to drive cross-functional initiatives with engineering, product, and business teams
Job Responsibility
Job Responsibility
  • Serve as a key architect and thought leader in the cloud infrastructure domain, guiding the team on best practices
  • Mentor and coach senior engineers across the company in advanced cloud operations practices
  • Provide oversight of hosted Linux and Windows systems, networks, databases, and applications, identifying and solving critical performance, scalability, and stability challenges
  • Design and develop reusable components and operational strategies to enhance the scalability, performance, and monitoring of cloud systems
  • Collaborate with other senior engineers to create technical solutions that address company-wide cloud challenges
  • Lead the establishment and continuous evolution of infrastructure-as-code best practices, driving automation, self-healing, and security standards
  • Drive operational cost savings through service optimizations, autoscaling strategies, and distributed processing architectures
  • Collaborate closely with cross-functional teams, including security, engineering, and business teams, to ensure that operational strategies align with company-wide objectives
  • Provide thought leadership in company-wide initiatives such as observability, automation, and disaster recovery
  • Continuously evaluate existing tools and processes, lead efforts to socialize, present, and implement enhancements for optimal operational efficiency
What we offer
What we offer
  • healthcare
  • internet/cell phone reimbursement
  • a learning and development stipend
  • opportunities to travel to our Mountain View HQ
  • Fulltime
Read More
Arrow Right

Executive Director, Digital SRE & Operations

We’re building a world of health around every individual — shaping a more connec...
Location
Location
United States , Austin, Texas
Salary
Salary:
175100.00 - 334750.00 USD / Year
https://www.cvshealth.com/ Logo
CVS Health
Expiration Date
March 31, 2026
Flip Icon
Requirements
Requirements
  • 18+ years of experience in software engineering, platform operations, or site reliability engineering
  • 8+ years leading large-scale SRE, DevOps, or platform reliability organizations
  • Experience leveraging AI/ML for operations, including anomaly detection, predictive alerts, log analysis, or automated remediation
  • Familiarity with AIOps tools such as Datadog Watchdog, Dynatrace Davis, Splunk AI, Elastic AIOps, or custom ML/LLM solutions
  • Understanding of how to safely operate and monitor AI-enabled production systems
  • Deep expertise in distributed systems, cloud infrastructure, and high-availability architectures
  • Strong knowledge of SRE principles, DevOps, and reliability engineering at scale
  • Experience implementing AIOps or AI-driven operational tooling
  • Executive-level communication skills with the ability to influence senior leaders and business stakeholders
  • Experience operating mission-critical digital platforms serving millions of users
Job Responsibility
Job Responsibility
  • Define and own the enterprise SRE strategy, including SLOs, SLIs, error budgets, and reliability roadmaps
  • Establish reliability standards and practices across web, mobile, backend services, APIs, data platforms, and AI workloads
  • Drive a culture of reliability-by-design and operational excellence across engineering teams
  • Lead adoption of AIOps capabilities for proactive issue detection, alert noise reduction, and predictive failure prevention
  • Implement AI-assisted incident triage, automated runbooks, root-cause analysis, and self-healing systems
  • Partner with the AI Platform team to integrate LLMs and ML models into operational workflows (log summarization, anomaly detection, remediation)
  • Own enterprise observability strategy across metrics, logs, traces, and user experience monitoring
  • Standardize tooling and practices using platforms such as Datadog, Splunk, Prometheus, Grafana, OpenTelemetry
  • Deliver real-time dashboards and executive reporting on uptime, performance, latency, and error budgets
  • Partner with DevOps and Platform teams to ensure safe, automated, and scalable CI/CD pipelines
What we offer
What we offer
  • Affordable medical plan options
  • 401(k) plan (including matching company contributions)
  • Employee stock purchase plan
  • No-cost programs for all colleagues including wellness screenings, tobacco cessation and weight management programs, confidential counseling and financial coaching
  • Paid time off
  • Flexible work schedules
  • Family leave
  • Dependent care resources
  • Colleague assistance programs
  • Tuition assistance
  • Fulltime
Read More
Arrow Right

Distinguished Engineer

At GEICO, we offer a rewarding career where your ambitions are met with endless ...
Location
Location
United States , Chevy Chase
Salary
Salary:
150000.00 - 300000.00 USD / Year
geico.com Logo
Geico
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of professional experience in software engineering
  • 8+ years of experience with architecture and design
  • 6+ years of experience in open-source frameworks
  • 4+ years of experience with AWS, GCP, Azure, or another cloud service
  • Bachelor's degree in computer science, Information Systems, or equivalent education or work experience
  • Deep hands-on experience in building complex distributed system to process large scale telemetry and architectures to support the scale and performance, with great knowledge on Docker and Kubernetes
  • Advance knowledge of at least two of the OOP language such as Java, Go, Python, etc.
  • Great understanding of open-source databases like MySQL, PostgreSQL, etc. And strong foundation with No-SQL databases like Clickhouse, Cassandra. Apache Trino etc. Knowledge or Big data formats such as Parquet or Avro etc.
  • Experience in architecting, designing, building Observability platform solutions, Advanced data analytics using Open-Source technologies are a big plus.
  • Experience building distributed systems
Job Responsibility
Job Responsibility
  • Develop and drive the overall tech strategy for the Reliability and observability tools organization, and report to the Senior Director
  • Focus on multiple areas and provide technical and thought leadership as Observability Domain Technical Champion
  • Collaborate with product managers, team members, customers, and other engineering teams to solve our toughest problems
  • Develop and execute technical software development strategy for the Observability Engineering domain
  • Accountable for the quality, usability, and performance of the solutions
  • Be a role model and mentor, helping to coach and strengthen the technical expertise and know-how of our engineering and product community. Influence and educate executives
  • Consistently share best practices and improve processes within and across teams
  • Lead the design and architecture of resilient and scalable systems, considering both on-premises and cloud-based solutions
  • Develop and maintain comprehensive incident response plans to address various disaster scenarios on our backup/restore systems
  • Conduct regular simulations and drills to ensure the readiness of the organization in the event of a disaster
What we offer
What we offer
  • Comprehensive Total Rewards program that offers personalized coverage tailor-made for you and your family’s overall well-being
  • Financial benefits including market-competitive compensation
  • a 401K savings plan vested from day one that offers a 6% match
  • performance and recognition-based incentives
  • and tuition assistance
  • Access to additional benefits like mental healthcare as well as fertility and adoption assistance
  • Supports flexibility- We provide workplace flexibility as well as our GEICO Flex program, which offers the ability to work from anywhere in the US for up to four weeks per year
  • Fulltime
Read More
Arrow Right

Senior Infrastructure Software Engineer, Cloud Foundation

As a Senior Infrastructure Software Engineer on the Cloud Platform org, you will...
Location
Location
Poland
Salary
Salary:
314500.00 - 425500.00 PLN / Year
dropbox.com Logo
Dropbox
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience as a backend, platform, or infrastructure engineer, with a proven track record of building scalable, reliable systems
  • Proficiency in backend development with Golang and Python (required)
  • Hands-on experience deploying and managing production workloads in public cloud environments (AWS and/or Azure)
  • Expertise with infrastructure-as-code (Terraform, CDK) and automation of cloud infrastructure configuration
  • Strong knowledge of public cloud architecture best practices, including AWS Well-Architected principles, networking, and identity/access management
  • Ability to design and implement technical solutions that translate business and product requirements into efficient cloud-based architectures
  • Effective communication skills for cross-functional collaboration and driving alignment on cloud solutions
Job Responsibility
Job Responsibility
  • Design and build highly available, scalable services that provision and seamlessly integrate secure public cloud infrastructure
  • Partner with security and network engineering teams to define clear requirements and set standards for public cloud usage across Dropbox
  • Collaborate with Capacity Engineering to integrate supply-side capabilities with FinOps tooling and processes
  • Document, share, and promote best practices to help product engineering teams succeed in public cloud environments
  • Shape technical direction at an organizational level by translating business and technical constraints into actionable roadmaps, and driving alignment across the platform org
  • Provide technical guidance and mentorship to junior engineers via code review and design docs
  • Contribute to the evolution of Dropbox’s infrastructure stack by improving code quality and system reliability
What we offer
What we offer
  • Competitive medical, dental and vision coverage
  • Retirement savings through a defined contribution pension or savings plan
  • Flexible PTO/Paid Time Off, paid holidays, Volunteer Time Off, and more
  • Income Protection Plans: Life and disability insurance
  • Business Travel Protection: Travel medical and accident insurance
  • Perks Allowance to be used on what matters most to you
  • Parental benefits including: Parental Leave, Fertility Benefits, Adoptions and Surrogacy support, and Lactation support
  • Mental health and wellness benefits
  • Fulltime
Read More
Arrow Right

Senior Software Engineer, Capacity & Efficiency

Join us in building the future of finance. Our mission is to democratize finance...
Location
Location
United States , Bellevue
Salary
Salary:
196000.00 - 230000.00 USD / Year
robinhood.com Logo
Robinhood
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Experience building and operating production systems in a cloud-native environment, ideally on AWS
  • Strong proficiency with Kubernetes and a practical understanding of resource efficiency and capacity planning
  • Experience working on infrastructure, platform, or data-heavy systems where cost, scale, and reliability matter
  • Ability to reason about cloud cost models, including tradeoffs between performance, reliability, and spend
  • Clear communication skills and comfort working with partner teams to explain findings and technical recommendations
Job Responsibility
Job Responsibility
  • Build and maintain software systems that detect, track, and attribute AWS cloud costs to the correct teams, services, and workloads
  • Develop tooling to identify cost anomalies, regressions, and over-provisioned resources across Kubernetes and managed services
  • Partner with Data Science to support forecasting models, unit economics, and projections that surface future cost risks
  • Analyze infrastructure usage patterns to identify inefficiencies and implement technical solutions that reduce cloud spend
  • Collaborate with partner teams to land efficiency improvements, validate cost reductions, and track remediation outcomes
What we offer
What we offer
  • Performance-driven compensation with multipliers for outsized impact, bonus programs, equity ownership, and 401(k) matching
  • 100% paid health insurance for employees with 90% coverage for dependents
  • Lifestyle wallet — a highly flexible benefits spending account for wellness, learning, and more
  • Employer-paid life & disability insurance, fertility benefits, and mental health benefits
  • Time off to recharge including company holidays, paid time off, sick time, parental leave, and more
  • Exceptional office experience with catered meals, events, and comfortable workspaces
  • Fulltime
Read More
Arrow Right

Principal Azure DevOps Engineer

We are looking to recruit an SC Cleared Principal Azure DevOps Engineer for a le...
Location
Location
United Kingdom
Salary
Salary:
80000.00 - 90000.00 GBP / Year
datacareers.co.uk Logo
DataCareers
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Extensive experience in Azure services and architecture (VMs, EntraID, Application Gateway, Sentinel, Defender for Cloud, Azure Fabric, Functions, Logic Apps, Front Door, App Service, Dev Box, Azure Migrate)
  • Strong expertise in Azure DevOps, GitHub CI/CD, and build/release automation
  • Proficiency with Infrastructure as Code (Terraform, Pulumi, CloudFormation, PowerShell)
  • Experience deploying solutions in AWS is desirable
  • Familiarity with containerization and orchestration (Docker, Kubernetes) and automation/configuration tools (Ansible)
  • Strong scripting skills (PowerShell, Bash, Python)
  • Experience with monitoring and observability tools (Grafana, Azure Monitor, DataDog, New Relic)
  • Deep understanding of cloud security, governance, and FinOps principles
  • Solid Windows, Linux, and Microsoft 365 design and implementation experience
  • Proven experience migrating databases (e.g., MS SQL) in cloud environments
Job Responsibility
Job Responsibility
  • Lead the design and implementation of cloud infrastructure and DevOps processes across client projects
  • Act as a technical advisor for cloud engineers, providing guidance on CI/CD automation, container orchestration, and platform reliability
  • Design, document, and maintain secure technical and security architectures aligned with best practices
  • Collaborate with Architecture, Security, Software Engineering, and Product teams to align cloud platform strategy
  • Drive improvements in automation, infrastructure as code, and overall DevOps maturity across projects
  • Mentor and coach engineering teams to adopt modern engineering practices and automation strategies
  • Deliver large-scale infrastructure transformation projects with low-level design expertise
  • Stay ahead of emerging technologies, applying them to deliver maximum client value
  • Fulltime
Read More
Arrow Right

Senior Azure DevOps Engineer

We are looking to recruit an SC Cleared Senior Azure DevOps Engineer for a leadi...
Location
Location
United Kingdom
Salary
Salary:
80000.00 - 90000.00 GBP / Year
datacareers.co.uk Logo
DataCareers
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Extensive experience in Azure services and architecture (VMs, EntraID, Application Gateway, Sentinel, Defender for Cloud, Azure Fabric, Functions, Logic Apps, Front Door, App Service, Dev Box, Azure Migrate)
  • Strong expertise in Azure DevOps, GitHub CI/CD, and build/release automation
  • Proficiency with Infrastructure as Code (Terraform, Pulumi, CloudFormation, PowerShell)
  • Experience deploying solutions in AWS is desirable
  • Familiarity with containerization and orchestration (Docker, Kubernetes) and automation/configuration tools (Ansible)
  • Strong scripting skills (PowerShell, Bash, Python)
  • Experience with monitoring and observability tools (Grafana, Azure Monitor, DataDog, New Relic)
  • Deep understanding of cloud security, governance, and FinOps principles
  • Solid Windows, Linux, and Microsoft 365 design and implementation experience
  • Proven experience migrating databases (e.g., MS SQL) in cloud environments
Job Responsibility
Job Responsibility
  • Lead the design and implementation of cloud infrastructure and DevOps processes across client projects
  • Act as a technical advisor for cloud engineers, providing guidance on CI/CD automation, container orchestration, and platform reliability
  • Design, document, and maintain secure technical and security architectures aligned with best practices
  • Collaborate with Architecture, Security, Software Engineering, and Product teams to align cloud platform strategy
  • Drive improvements in automation, infrastructure as code, and overall DevOps maturity across projects
  • Mentor and coach engineering teams to adopt modern engineering practices and automation strategies
  • Deliver large-scale infrastructure transformation projects with low-level design expertise
  • Stay ahead of emerging technologies, applying them to deliver maximum client value
  • Fulltime
Read More
Arrow Right