CrawlJobs Logo

Senior Site Reliability Engineer - Fleet Reliability

lambda.ai Logo

Lambda

Location Icon

Location:
United States , San Francisco

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

230000.00 - 345000.00 USD / Year

Job Description:

Lambda, The Superintelligence Cloud, is a leader in AI cloud infrastructure serving tens of thousands of customers. Our customers range from AI researchers to enterprises and hyperscalers. Lambda's mission is to make compute as ubiquitous as electricity and give everyone the power of superintelligence. One person, one GPU. If you'd like to build the world's best AI cloud, join us. Engineering at Lambda is responsible for building and scaling our cloud offering. Our scope includes the Lambda website, cloud APIs and systems as well as internal tooling for system deployment, management and maintenance.

Job Responsibility:

  • Define Fleet Health metrics and indicators to objectively measure and improve system availability
  • Collaborate with the observability team on comprehensive monitoring and alerting systems to proactively predict, detect and respond to issues or anomalies
  • Create runbooks and automated remediations for common failure scenarios
  • Build in automation and auditing to ensure compliance and improve efficiency and productivity
  • Participate in on-call rotations and provide support for incident response and resolution
  • Implement and integrate logging and metrics across platforms such as Datadog, Prometheus, OpenTelemetry, Grafana, SumoLogic, etc

Requirements:

  • 7+ years of experience in Site Reliability Engineering, DevOps, or a similar role
  • Strong understanding of modern AI infrastructure, from GPU architectures to hardware performance optimization
  • Strong understanding of Linux-based systems in a distributed environment
  • Solid understanding of Python and Go, with experience working with SWE teams to improve internal tooling
  • Experience with monitoring and alerting tools (e.g., Prometheus, Grafana, SumoLogic)
  • Proficiency in automation and configuration management tools (e.g., Ansible, Terraform)
  • Familiarity with cloud platforms (e.g., OCI, AWS, GCP, Azure)
  • Excellent problem-solving and troubleshooting skills
  • Strong communication and collaboration skills
  • Passion for continuous improvement and innovation

Nice to have:

  • Experience in the machine learning or computer hardware industry
  • Knowledge of containerization and orchestration technologies (e.g., Docker, Kubernetes)
  • Experience building and/or operating HPC resources
  • Background in chaos engineering or similar reliability testing methodologies
  • Understanding of compliance frameworks (SOC 2, ISO 27001, etc.)
What we offer:
  • Generous cash & equity compensation
  • Health, dental, and vision coverage for you and your dependents
  • Wellness and commuter stipends for select roles
  • 401k Plan with 2% company match (USA employees)
  • Flexible paid time off plan

Additional Information:

Job Posted:
February 18, 2026

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Senior Site Reliability Engineer - Fleet Reliability

Senior Software Engineer, Backend

As a Senior Software Engineer, Backend specializing in database architecture and...
Location
Location
United States , San Francisco
Salary
Salary:
150000.00 - 240000.00 USD / Year
chefrobotics.ai Logo
Chef Robotics
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Engineering, or equivalent practical experience
  • 7+ years of professional experience in backend development roles with demonstrated leadership experience
  • Expert knowledge of relational databases (MySQL, PostgreSQL) including schema design, optimization, and administration
  • Strong proficiency with Python and JavaScript/TypeScript with advanced software engineering skills
  • Extensive experience leading projects with at least two web frameworks: Flask, FastAPI, Django, Node.js, or Next.js
  • Proven experience designing and implementing RESTful and GraphQL APIs at scale
  • Advanced understanding of containerization (Docker) and orchestration (Kubernetes) technologies
  • Experience with cloud infrastructure and deployment (AWS, GCP, or Azure) in production environments
  • Proven experience leading complex backend projects and mentoring junior engineers
  • Understanding of data requirements for robotics or automation systems
Job Responsibility
Job Responsibility
  • Lead the design, implementation, and optimization of database schemas to support robot operations, telemetry, recipe management, and system analytics
  • Develop robust data migration strategies and version control for database schema evolution
  • Implement efficient query optimization and indexing strategies to support high-throughput robot operations
  • Establish data integrity protocols and backup systems to ensure operational continuity across customer deployments
  • Create scalable data access layers that balance security, performance, and maintainability
  • Mentor team members on database design patterns and optimization techniques
  • Lead the development and maintenance of scalable APIs to serve robot control systems, dashboards, and monitoring tools
  • Design and implement secure authentication and authorization mechanisms across backend services
  • Develop robust middleware for processing and validating data between robotics subsystems
  • Create service interfaces that enable efficient communication between robotics components and cloud services
What we offer
What we offer
  • medical, dental, and vision insurance
  • commuter benefits
  • flexible paid time off (PTO)
  • catered lunch
  • 401(k) matching
  • early-stage equity
  • Fulltime
Read More
Arrow Right

Senior Site Reliability Engineer

It's fun to work in a company where people truly believe in what they're doing! ...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
blackline.com Logo
BlackLine
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5–10+ years in SRE, DevOps, or systems engineering in production cloud environments
  • B.tech/B.E in Computer Science or related field
  • Expertise in automation, observability & monitoring, CI/CD pipelines, and incident management
  • Experience with SRE principles (SLI/SLO/error budgets/postmortems, etc)
  • Proficient in IaC tools like Terraform, Ansible, Chef
  • Experience in working on HashiCorp tools - Consul, Vault, Nomad, Packer
  • Strong cloud knowledge (GCP preferred, AWS/Azure a plus)
  • Experience with containerization & orchestration (Docker, Kubernetes, ArgoCD, etc)
  • Advanced scripting and automation (Python, Go, PowerShell)
  • Familiarity with cloud cost monitoring and optimization techniques
Job Responsibility
Job Responsibility
  • Own performance, scalability, and operational excellence across critical services
  • Blend software engineering and systems engineering to build and run large-scale, fault-tolerant, distributed systems—focusing on performance, capacity, availability, and security
  • Own service reliability across the stack and collaborate closely with developers, architects, and infrastructure teams to ensure services are resilient by design and self-healing by default
  • Automate operational tasks to reduce toil and increase team velocity
  • Lead timely and reliable deployments, with emphasis on progressive delivery techniques (canary, blue/green, feature flags, zero outage, etc)
  • Partner in blameless postmortems and ensure incident reviews lead to systemic fixes
  • Automate secure lifecycle of certificates, secrets, and credentials
  • Build and maintain cloud-native security stacks and compliance guardrails
  • Execute infrastructure rotation and automated rehydration to maintain fleet hygiene
  • Create and manage highly reproducible environment provisioning via Infrastructure as Code
What we offer
What we offer
  • A technology-based company with a sense of adventure and a vision for the future
  • A culture that is kind, open, and accepting
  • A culture where BlackLiner's continued growth and learning is empowered
  • BlackLine offers a wide variety of professional development seminars and inclusive affinity groups to celebrate and support our diversity
Read More
Arrow Right

Senior Network Technician

As Senior Network Technician, you would help support the rollout of GeniusIQ, ou...
Location
Location
United Kingdom , Manchester
Salary
Salary:
Not provided
geniussports.com Logo
Genius Sports
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5 years’ experience with system and network administration on infrastructure with 100+ Linux servers
  • Strong understanding of the entire Linux server stack: OS boot and installation, system, networking, container deployment, logging, metrics & monitoring, out-of-band management, etc...
  • Strong understanding of OSI network layers 2-3-4 and network configuration: switching, VLANs, routing, firewall rules, ARP, DHCP, DNS, TCP, switch command-line, etc...
  • Proficiency in Bash scripting
  • Ability to communicate efficiently and articulate concepts based on the audience, including remote hands, engineering and customers
Job Responsibility
Job Responsibility
  • Supervise IT issue tracking and resolution for a large fleet of bare-metal Linux servers and network equipment in hundreds of sport venues in Europe
  • Assist venue operations coordinators with preparation of equipment and installation, based on automation processes developed by site reliability engineers
  • Communicate kindly with external venue IT and management staff
  • Partner with software engineers to eliminate common issues
  • Fulltime
Read More
Arrow Right

Senior Maintenance Planner

We are currently seeking an experienced Senior Mobile Fleet Maintenance Planner ...
Location
Location
Australia , Mudgee
Salary
Salary:
Not provided
peabodyenergy.com Logo
Peabody Energy
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Mechanical Trade or Engineering qualification
  • 3+ Years experience as a Maintenance Planner desirable
  • Strong working knowledge of SAP, maintenance planning and scheduling principles and procedures
  • Strong interpersonal and communication skills
  • demonstrated experience in safety systems and processes including JSEAs, risk assessments and permits
  • Experience with Microsoft Project is not required but desirable
  • goal orientated and have the ability to work autonomously
Job Responsibility
Job Responsibility
  • Ensuring maintenance "best practice" techniques are implemented to ensure equipment is maintained to a high safety, productive and reliable standard
  • Working with stakeholder to manage lead time on parts
  • Prioritisation of work and time management
  • An active role in Forecasting Costs for Field Short to Mid-Term work
  • Working with the Maintenance Execution Team to develop plans that support the Maintenance function to meet the needs of the business
  • Developing and maintaining relationships with our internal departments as well as our key suppliers
  • Ensuring compliance with relevant statutory, legislative, WH&S standards and site policies and procedures
  • Development a high performing planning and scheduling team
  • Fulltime
Read More
Arrow Right
New

Accounts Payable Specialist

We are looking for an experienced Accounts Payable Specialist to join our team i...
Location
Location
United States , Pompano Beach
Salary
Salary:
Not provided
https://www.roberthalf.com Logo
Robert Half
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proven experience in full-cycle accounts payable processes
  • Strong knowledge of invoice coding and three-way matching with purchase orders
  • Familiarity with Automated Clearing House (ACH) and check run procedures
  • Excellent attention to detail and organizational skills
  • Ability to work effectively in a fast-paced environment
  • Proficiency in accounting software and systems
  • Strong communication skills to collaborate with vendors and internal teams
  • Demonstrated ability to meet deadlines and handle high-volume workloads
Job Responsibility
Job Responsibility
  • Process a high volume of invoices weekly, including both three-way matching with purchase orders and standard administrative invoices
  • Review, match, and code invoices accurately to ensure proper payment and record-keeping
  • Prepare and execute check runs and Automated Clearing House (ACH) transactions
  • Assist with reconciling accounts and resolving discrepancies with vendors
  • Collaborate with internal departments to ensure proper documentation and approvals for payments
  • Maintain organized records of invoices, payments, and other financial documents
  • Support efforts to streamline accounts payable processes and improve efficiency
  • Provide timely updates and reports to management regarding payment statuses
  • Ensure compliance with company policies and accounting standards
  • Assist in catching up on delayed payments and maintaining accurate financial records
What we offer
What we offer
  • Medical, vision, dental, and life and disability insurance
  • Eligibility to enroll in company 401(k) plan
  • Access to top jobs, competitive compensation and benefits
  • Free online training
Read More
Arrow Right
New

Home Health LPN

Home Health LPN position in Manitowoc County. The role involves providing qualit...
Location
Location
United States , Green Bay, WI; Aurora Medical Center Manitowoc County
Salary
Salary:
25.30 - 37.95 USD / Hour
advocatehealth.com Logo
Advocate Health Care
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Licensed in the State of Illinois (or eligible for licensure)
  • Able to demonstrate compliance with state continuing education requirements
  • 2 years of Med/Surgical clinical experience
  • CPR certified
  • Basic knowledge of computer use (i.e. Microsoft Word, email access and use)
  • Ability to learn and use computer based scheduling, and documentation system
  • Ability to communication professionally both verbally and through written reports, with strong documentation skills
  • Good interpersonal skills
  • Good time management and self organization skills
  • Proficiency in the ability to obtain lab draws and report results to primary nurse or clinical manager
Job Responsibility
Job Responsibility
  • Assuring/Improving Quality of Care: Facilitates the patient and family's right to receive quality cost-effective care
  • Utilizes appropriate resources in response to situations that have the potential to negatively impact patient and family outcomes
  • Adapts practice to the latest standards according to evidence-based literature
  • Participates in efforts to reduce risk and improve patient safety
  • Consistently demonstrates the ability to thoroughly and decisively document to support reimbursement and regulatory requirements
  • Care and Service Coordination: Practices as an effective team member of the patient care team to formulate an integrated approach to care
  • Recognizes changes in clinical situations. Consults with RN as to action based on observation
  • Consistently makes sound clinical decisions, demonstrating the ability to care for all patients including those with complex problems
  • Prioritizes and organizes patient care based on established plan of care, ensures clear hand-off communication to achieve optimal health outcomes for patients, utilizing excellent communication techniques like SBAR and 5P's
  • Actively participates in interdisciplinary case conferences
What we offer
What we offer
  • Paid Time Off programs
  • Health and welfare benefits such as medical, dental, vision, life, and Short- and Long-Term Disability
  • Flexible Spending Accounts for eligible health care and dependent care expenses
  • Family benefits such as adoption assistance and paid parental leave
  • Defined contribution retirement plans with employer match and other financial wellness programs
  • Educational Assistance Program
  • Premium pay such as shift, on call, and more based on a teammate's job
  • Incentive pay for select positions
  • Opportunity for annual increases based on performance
  • Fulltime
Read More
Arrow Right
New

Registered Nurse - Endocrinology Clinic

Our expert team of endocrinology physicians in Franklin, WI, can diagnose and tr...
Location
Location
United States , Franklin
Salary
Salary:
35.50 - 53.25 USD / Hour
aurorahealthcare.org Logo
Advocate Aurora Health
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Graduate of a Board of Nursing approved nursing education program
  • Basic Life Support (BLS)
  • Active, unrestricted registered nurse (RN) multi-state compact and/or single-state license with privileges to practice in the state(s) where the RN is providing client nursing services
  • Typically requires 1 year of experience in clinical nursing
  • Strong clinical judgment and critical thinking
  • Time management, prioritization and problem-solving skills
  • Excellent communication and interpersonal skills
  • Ability to work in a fast-paced, dynamic environment
  • Proficiency in operating computer functions (e.g., E-mail, electronic records, digital platforms etc.)
  • Must be able to sit, stand, walk, lift, squat, bend, reach above shoulders, and twist frequently throughout the workday
Job Responsibility
Job Responsibility
  • Engages in unit councils, professional governance, and quality initiatives to improve care processes and apply evidence-based practices
  • Utilizes the nursing process to assess, plan, diagnose, implement, and evaluate nursing care, engaging patients and families through the continuum of care
  • Monitors patient conditions, adjusts care plans, mobilizes resources, and collaborates with the care team to influence care outcomes
  • Upholds and promotes a culture of safety
  • Continuously evaluates patient, team, and unit outcomes, taking action as needed
  • May administer medications, treatments, and therapies safely and according to clinical protocols and procedures
  • Demonstrates effective communication, feedback, and conflict resolution, fostering team collaboration and appropriate delegation
  • Pursues professional development, completes required education, and maintains certifications
  • Adhere to the ANA Code of Ethics and practices ethical decision-making, respects interdisciplinary roles, and contributes to integrated, unbiased patient care
  • Appropriate delegation to other Registered Nurses, Licensed Practical Nurses, and unlicensed assistive personnel
What we offer
What we offer
  • Paid Time Off programs
  • Health and welfare benefits such as medical, dental, vision, life, and Short- and Long-Term Disability
  • Flexible Spending Accounts for eligible health care and dependent care expenses
  • Family benefits such as adoption assistance and paid parental leave
  • Defined contribution retirement plans with employer match and other financial wellness programs
  • Educational Assistance Program
  • Opportunity for annual increases based on performance
  • Premium pay such as shift, on call, and more
  • Incentive pay for select positions
  • Fulltime
Read More
Arrow Right
New

Event Staff

ASM Global, the leader in privately managed public assembly facilities, has an e...
Location
Location
United States , Kissimmee
Salary
Salary:
Not provided
legendsglobal.com Logo
Legends Global
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Must be able to speak, read and write English
  • Professional attitude and appearance
  • Ability to listen and follow instructions
  • Ability to work independently and in a team environment
  • Good communication, customer service and sense of public relation skills
  • Good organizational and problem solving skills
  • Ability to work flexible hours including daytime, evening, weekends and holidays as needed
  • High school diploma or general education degree (GED)
  • Guest services background preferred
  • Ability to stand for long periods of time
Job Responsibility
Job Responsibility
  • Greet each guest with a smile and encourage them to enjoy their visit to our facilities
  • Listen attentively to patrons' questions, concerns or suggestions and be prepared to answer their questions
  • Inspect your assigned areas for any safety hazards or seating irregularities prior to opening the doors
  • report any problems to the Event Manager on duty
  • Knowledge of the locations of the nearest restrooms, drinking fountains, smoking sections and concession stands
  • Face the incoming patrons (not the floor) while standing at the top of your section or in the center of your passageway
  • Constantly scan the seating areas for any unusual happenings
  • Scan your assigned area for cans, bottles, and any alcohol related problems
  • Be alert for any objects being thrown from the seating areas and from areas above your area
  • Watch for seat jumpers to protect the integrity of the tickets
  • Parttime
Read More
Arrow Right