CrawlJobs Logo

Staff Software Engineer, Slurm

crusoe.ai Logo

Crusoe

Location Icon

Location:
United States , San Francisco

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

185000.00 - 224000.00 USD / Year

Job Description:

We are actively seeking an exceptional Staff Software Engineer to join our cloud software team, focusing specifically on building and operating Slurm as a fully managed cloud service within Crusoe Cloud. This role is crucial for delivering next-generation orchestration capabilities to power GPU-accelerated and high-performance computing (HPC) at scale. Your expertise will be instrumental in designing and scaling our carbon-reducing operating model, and advancing our AI training clusters to lead the industry in reliability and performance. You will shape the technical direction of systems that allow customers to run advanced workloads across CPUs, NVIDIA and AMD GPUs, and high-performance networking environments. You will be involved in writing and reviewing code, contributing to proposals, and drafting architecture documents. You will evaluate tools and frameworks, considering their impact on reliability, scalability, operational costs, and ease of adoption.

Job Responsibility:

  • Lead the development and engineering of our managed Slurm offering, providing a seamless experience for AI/ML and HPC customers who rely on robust Slurm job scheduling
  • Contribute to the development of scalable and robust software solutions, closely aligning with the strategic objectives outlined in the Crusoe Cloud roadmap
  • Design, build, and maintain Kubernetes operators and controllers dedicated to managing the lifecycle, configuration, and state of large-scale Slurm clusters
  • Drive the integration of GPU acceleration in the Slurm environment, including device plugin architecture, GPU operators, accelerator-aware scheduling, and resource allocation
  • Ensure that high-performance networking technologies, such as InfiniBand and RoCE, are correctly leveraged for distributed GPU workloads running through Slurm
  • Implement and manage features such as multi-tenancy, cluster lifecycle management, auto-scaling, and high availability for the managed Slurm control plane services
  • Develop scalable systems to compete with leading managed services
  • Support the development of your peers by sharing knowledge and providing guidance in technical discussions

Requirements:

  • 7+ years of experience working in software engineering, with strong experience in Systems Engineering
  • Experience in distributed systems, cloud, or HPC environments is a must
  • 2+ years of programming experience in GoLang
  • Strong proficiency in other systems languages (Rust, C++, Python for HPC tooling) is also beneficial
  • Extensive experience with Kubernetes and Linux Engineering and debugging
  • Deep knowledge of Slurm (Simple Linux Utility for Resource Management) administration and the architecture required for managing compute jobs in high-performance environments
  • Skilled in infrastructure as code and familiar with systems-level challenges, ideally with experience utilizing Terraform
  • Understand Argo, CI/CD, and Automated Testing pipelines
  • Can design system architecture, taking ownership of system architecture, including CI/CD pipelines, while ensuring adherence to security standards
  • Strong knowledge of container networking (CNI plugins, service meshes) and Linux networking fundamentals
  • Familiarity with GPU integration in Kubernetes, including device plugins and GPU operators
  • Excellent communication skills, both verbal and written
What we offer:
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Subscription to the Calm app
  • MetLife Legal
  • Company paid commuter benefit
  • $300 per month

Additional Information:

Job Posted:
February 21, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Staff Software Engineer, Slurm

Staff MLOps Engineer

At Inworld, we’re building the AI framework behind the next generation of real-t...
Location
Location
Canada , Vancouver
Salary
Salary:
190000.00 - 240000.00 CAD / Year
inworld.ai Logo
Inworld AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of software engineering experience
  • 5+ years of infrastructure-as-code
  • Proficiency in managing Kubernetes clusters and applications, including creating Helm charts/Kustomize manifests for new applications
  • Experience in creating and maintaining CI/CD pipelines for both applications and infrastructure deployments (using tools like Terraform/Terragrunt, ArgoCD, GitHub Actions, Ansible, etc.)
  • Deep knowledge of at least one major cloud provider (Google Cloud Platform, Microsoft Azure, Oracle Cloud)
  • Proficient in at least one backend programming/scripting languages such as Golang, Python, and Bash
  • Knowledge of SLURM or similar job schedulers for distributed training
  • Experience with data pipeline and workflow management tools
  • Desire to work at a fast-growing Series A startup, comfortable with uncertainty, owning and scaling new products, and embracing an experimental and iterative development process
Job Responsibility
Job Responsibility
  • Build and scale MLOps systems to streamline the end-to-end ML model lifecycle on the Inworld AI platform, from training to deployment
  • Design and implement robust model training, evaluation, and release pipelines
  • Collaborate cross-functionally with ML and backend teams to design, deploy, and maintain scalable secure infrastructure for Inworld’s AI Engine and Studio
  • Facilitate a "you build it, you run it" culture by providing the necessary tools and processes for monitoring the reliability, availability, and performance of services
  • Manage CI/CD pipelines to ensure smooth and efficient code integration and deployment
  • Identify and implement opportunities to enhance engineering speed and efficiency
  • Provide technical leadership in ML engineering best practices, raise the technical bar, and mentor junior engineers in MLOps principles
What we offer
What we offer
  • equity
  • benefits
  • Fulltime
Read More
Arrow Right

Staff MLOps Engineer

At Inworld, we’re building the AI framework behind the next generation of real-t...
Location
Location
United States , Mountain View
Salary
Salary:
180000.00 - 280000.00 USD / Year
inworld.ai Logo
Inworld AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of software engineering experience, with 5+ years of infrastructure-as-code
  • Proficiency in managing Kubernetes clusters and applications, including creating Helm charts/Kustomize manifests for new applications
  • Experience in creating and maintaining CI/CD pipelines for both applications and infrastructure deployments (using tools like Terraform/Terragrunt, ArgoCD, GitHub Actions, Ansible, etc.)
  • Deep knowledge of at least one major cloud provider (Google Cloud Platform, Microsoft Azure, Oracle Cloud)
  • Proficient in at least one backend programming/scripting languages such as Golang, Python, and Bash
  • Knowledge of SLURM or similar job schedulers for distributed training
  • Experience with data pipeline and workflow management tools
  • Desire to work at a fast-growing Series A startup, comfortable with uncertainty, owning and scaling new products, and embracing an experimental and iterative development process
  • In-office location: Mountain View, CA, United States. You must be available for hybrid work
Job Responsibility
Job Responsibility
  • Build and scale MLOps systems to streamline the end-to-end ML model lifecycle on the Inworld AI platform, from training to deployment
  • Design and implement robust model training, evaluation, and release pipelines
  • Collaborate cross-functionally with ML and backend teams to design, deploy, and maintain scalable secure infrastructure for Inworld’s AI Engine and Studio
  • Facilitate a "you build it, you run it" culture by providing the necessary tools and processes for monitoring the reliability, availability, and performance of services
  • Manage CI/CD pipelines to ensure smooth and efficient code integration and deployment
  • Identify and implement opportunities to enhance engineering speed and efficiency
  • Provide technical leadership in ML engineering best practices, raise the technical bar, and mentor junior engineers in MLOps principles
What we offer
What we offer
  • equity and benefits
  • Fulltime
Read More
Arrow Right
New

Member of Technical Staff, Training Infra Engineer

Contribute in and provide strong support for model training pipelines, ship stat...
Location
Location
Salary
Salary:
Not provided
cohere.com Logo
Cohere
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Extremely strong software engineering skills
  • Proficiency in Python and related ML frameworks such as JAX, Pytorch and XLA/MLIR
  • Experience with distributed training infrastructures (Kubernetes, Slurm) and associated frameworks (Ray)
  • Experience using large-scale distributed training strategies
  • Hands on experience on training large model at scale and having contributed to the tooling and/or setup of the training infrastructure
Job Responsibility
Job Responsibility
  • Design and write high-performant and scalable software for training
  • Improve our training setup from an infrastructure and codebase performance standpoint
  • Craft and implement tools to speed up our training cycles and improve the overall efficacy of our training infrastructure
  • Research, implement, and experiment with ideas on our supercompute and data infrastructure
  • Learn from and work with the best researchers in the field
What we offer
What we offer
  • An open and inclusive culture and work environment
  • Work closely with a team on the cutting edge of AI research
  • Weekly lunch stipend, in-office lunches & snacks
  • Full health and dental benefits, including a separate budget to take care of your mental health
  • 100% Parental Leave top-up for up to 6 months
  • Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
  • Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
  • 6 weeks of vacation (30 working days!)
  • Fulltime
Read More
Arrow Right
New

Member of Technical Staff, Post-Training

Advance the state of the art for model post training, ship state of the art mode...
Location
Location
Salary
Salary:
Not provided
cohere.com Logo
Cohere
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Extremely strong software engineering skills
  • Proficiency in Python and related ML frameworks such as JAX, Pytorch and XLA/MLIR
  • Experience with distributed training infrastructures (Kubernetes, Slurm) and associated frameworks (Ray)
  • Experience using large-scale distributed training strategies
  • Hands on experience on training large model at scale
  • Hands on experience with the post training phase of model training, with a strong emphasis on performance optimisation
Job Responsibility
Job Responsibility
  • Design and write high-performant and scalable software for training models
  • Consistently post-train the models to reach SOTA level performance
  • Coordinate with other specialist teams (Agentic, Code…) to produce models that have strong all encompassing performance
  • Craft and implement techniques to improve the performance and results of our training cycles both on the SFT and the RL regime
  • Research, implement, and experiment with ideas on our supercompute and data infrastructure
  • Learn from and work with the best researchers in the field
What we offer
What we offer
  • An open and inclusive culture and work environment
  • Work closely with a team on the cutting edge of AI research
  • Weekly lunch stipend, in-office lunches & snacks
  • Full health and dental benefits, including a separate budget to take care of your mental health
  • 100% Parental Leave top-up for up to 6 months
  • Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
  • Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
  • 6 weeks of vacation (30 working days!)
  • Fulltime
Read More
Arrow Right
New

Senior Staff Cloud Support Engineer

As a Senior Staff Cloud Support Engineer, you are a technical authority within C...
Location
Location
United States , San Francisco; Sunnyvale
Salary
Salary:
180000.00 - 220000.00 USD / Year
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years experience in SRE, DevOps, HPC, or Cloud Infrastructure roles
  • Advanced Linux systems expertise
  • Deep Kubernetes operational experience (CKA-level or higher)
  • Strong networking knowledge: Infiniband, RDMA, RoCE, SDN
  • Experience supporting AI/ML workloads at scale (GPU clusters)
  • Proven track record of resolving multi-layer, distributed system failures
  • Strong customer communication and executive-facing presence
Job Responsibility
Job Responsibility
  • Serve as highest-level escalation point for complex P1/P0 incidents
  • Lead cross-functional root cause investigations involving compute, networking (IB/RDMA/RoCE), storage, and orchestration layers
  • Partner with SRE, Software teams (Storage, Networking, Compute, K8) to design systemic fixes rather than recurring workarounds
  • Design and improve node validation, burn-in processes, performance baselining, and release readiness
  • Influence Kubernetes architecture, workload orchestration (Slurm, Terraform), and AI/ML cluster stability
  • Reduce MTTR and incident recurrence through structural improvements
  • Troubleshoot NCCL, IB, GPU driver/firmware issues, distributed training failures
  • Support complex AI workloads (training + inference) with performance tuning and observability improvements
  • Act as senior technical advisor during high-risk customer incidents
  • Deliver executive-ready RCAs with clarity and confidence
What we offer
What we offer
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Fulltime
Read More
Arrow Right

HPC Principal Federal Technical Consultant

Principal Consultant to join our High-Performance Computing (HPC) team. In this ...
Location
Location
United States
Salary
Salary:
115500.00 - 266000.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of professional experience, with at least 3+ in HPC architecture, systems engineering, or large-scale infrastructure design
  • Advanced degree in Computer Science, Engineering, Physics, or related technical field (or equivalent experience)
  • Proven ability to design and deliver complex, multi-vendor HPC solutions at scale
  • Demonstrated ability to independently complete solution implementations and application design deliverables
  • Must be United States Citizen due to the responsibilities and requirements of the role as this will be supporting a Federal site
  • Top Secret Clearance, TS/SCI with Full Scope Polygraph (FSP)
  • Must be willing to travel as the business dictates
  • Expertise in one or more of the following: parallel computing, MPI/OpenMP, GPU acceleration, workload schedulers (Slurm, Altair PBS Pro, Torque/MOAB, etc.), or large-scale data storage systems (Lustre, GPFS, Ceph)
  • Experience with Network boot technologies (PXE or gPXE/Etherboot etc)
  • Storage specific knowledge: LVM, RAID, iSCSI, Disk partitioning (GPT, MBR)
Job Responsibility
Job Responsibility
  • Lead the technical implementation design and delivery of world class scale HPC solutions, from requirements gathering to implementation
  • Provide architectural guidance on compute, storage, networking, and workload management tailored to customer use cases
  • Configure, deploy, and maintain Linux-based HPC clusters, associated storage, and network infrastructure
  • Work in close collaboration with customers on finalizing and deploying HPC software applications, hosting platforms, and management systems that enable customer research and production workloads
  • Provide technical support and troubleshooting for HPC implementation in secure locations
  • Work on both operational support and strategic HPC projects
  • actively participate in customer user group environments
  • Evaluate and implement new tools, middleware, and methodologies to improve operations and service delivery
  • Ensure compliance with enterprise IT security and technology controls
  • Act as principal consultant in customer engagements, often leading cross-functional project teams (including customer staff)
What we offer
What we offer
  • Health & Wellbeing benefits
  • Personal & Professional Development programs
  • Unconditional Inclusion environment
  • Comprehensive suite of benefits supporting physical, financial and emotional wellbeing
  • Fulltime
Read More
Arrow Right
New

HGV driver

Start or continue your career in logistics with Jaguar Land Rover's Castle Bromw...
Location
Location
United Kingdom , Birmingham
Salary
Salary:
16.08 GBP / Hour
https://www.randstad.com Logo
Randstad
Expiration Date
February 25, 2026
Flip Icon
Requirements
Requirements
  • Valid HGV licence class 1 and knowledge of associated duties i.e. secure loading
  • A full UK driving license with less than 6 penalty points and no disqualifications
  • ADR (Dangerous Goods) Trained
  • Holds a Driver Certificate of Professional Competence (CPC)
  • Knowledge of load security
  • Knowledge of Government and European Union traffic regulations including tachographs
  • Available for shift working patterns as required
  • Ability to work overtime and unsocial hours
  • Knowledge of working time directive
  • Knowledge of road network in the UK
Job Responsibility
Job Responsibility
  • Safely distribute and transport materials, components, and sub-assemblies to meet operational needs
  • Collect and verify delivery instructions for accurate and timely shipments
  • Conduct routine vehicle inspections and preventative maintenance
  • Assist in the loading and unloading of vehicles safely
  • Ensure compliance with government and EU traffic regulations, including tachograph usage
  • Plan routes effectively to meet delivery schedules
  • Maintain high standards of health, safety, and accident prevention
What we offer
What we offer
  • 34 days holiday (including bank holidays) and a 2-week summer shutdown
  • Confidential mental health and financial support
  • On-site Employee Inclusion Council
  • Free onsite parking, canteen with healthy options, and Costa coffee machines
  • Public transport links nearby
  • Discounts at shops, gyms, cinemas, and restaurants via our benefits app
!
Read More
Arrow Right
New

Primary Teaching Assistant

Primary Teaching Assistant - Day-to-Day Supply. Are you looking for a rewarding ...
Location
Location
United Kingdom , Plymouth
Salary
Salary:
Not provided
https://www.randstad.com Logo
Randstad
Expiration Date
March 04, 2026
Flip Icon
Requirements
Requirements
  • Previous experience working with children (schools, nurseries, sports coaching, or youth groups)
  • A good standard of English and Maths to support Primary-aged learners
  • An enhanced DBS on the update service (or the willingness to apply for one)
Job Responsibility
Job Responsibility
  • Classroom Support: Assisting the teacher with lesson setup and helping pupils stay on task
  • Core Skills: Supporting small groups with reading, writing, and phonics (KS1) or numeracy and literacy (KS2)
  • Early Years/KS1: Helping younger children learn through play and develop their social skills
  • Well-being: Providing a supportive environment and helping with lunchtime or playground supervision
What we offer
What we offer
  • Great Pay: Earn £90 - £95 per day with no hidden fees and weekly pay
  • Ultimate Flexibility: Manage your own diary-perfect for parents, students, or those seeking a better work-life balance
  • Zero Admin: No planning, no marking, and no parents' evenings
  • Variety: Experience different school cultures across Plymouth and find the setting that suits you best
  • training and education
  • Safeguarding and Prevent
Read More
Arrow Right