CrawlJobs Logo

Engineering Manager, Inference Platform

Cerebras Systems

Location Icon

Location:
United States; Canada , Sunnyvale

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programming simplicity of a single device. This approach allows Cerebras to deliver industry-leading training and inference speeds and empowers machine learning users to effortlessly run large-scale ML applications, without the hassle of managing hundreds of GPUs or TPUs. Cerebras' current customers include top model labs, global enterprises, and cutting-edge AI-native startups. OpenAI recently announced a multi-year partnership with Cerebras, to deploy 750 megawatts of scale, transforming key workloads with ultra high-speed inference. Thanks to the groundbreaking wafer-scale architecture, Cerebras Inference offers the fastest Generative AI inference solution in the world, over 10 times faster than GPU-based hyperscale cloud inference services. This order of magnitude increase in speed is transforming the user experience of AI applications, unlocking real-time iteration and increasing intelligence via additional agentic computation.

Job Responsibility:

  • Provide hands-on technical leadership, owning the technical vision and roadmap for the Cerebras Inference Platform, from internal scaling to on-prem customer solutions
  • Lead the end-to-end development of distributed inference systems, including request routing, autoscaling, and resource orchestration on Cerebras' unique hardware
  • Drive a culture of operational excellence, guaranteeing platform reliability (>99.9% uptime), performance, and efficiency
  • Lead, mentor, and grow a high-caliber team of engineers, fostering a culture of technical excellence and rapid execution
  • Productize the platform into an enterprise-ready, on-prem solution, collaborating closely with product, ops, and customer teams to ensure successful deployments

Requirements:

  • 6+ years in high-scale software engineering
  • 3+ years leading distributed systems or ML infra teams
  • strong coding and review skills
  • Proven track record scaling LLM inference: optimizing latency (<100ms P99), throughput, batching, memory/IO efficiency and resources utilization
  • Expertise in distributed inference/training for modern LLMs
  • understanding of AI/ML ecosystems, including public clouds (AWS/GCP/Azure)
  • Hands-on with model-serving frameworks (e.g. vLLM, TensorRT-LLM, Triton or similar) and ML stacks (PyTorch, Hugging Face, SageMaker)
  • Deep experience with orchestration (Kubernetes/EKS, Slurm), large clusters, and low-latency networking
  • Strong background in monitoring and reliability engineering (Prometheus/Grafana, incident response, post-mortems)
  • Demonstrated ability to recruit and retain high-performing teams, mentor engineers, and partner cross-functionally to deliver customer-facing products

Nice to have:

  • Experience with on-prem/private cloud deployments
  • Background in edge or streaming inference, multi-region systems, or security/privacy in AI
  • Customer-facing experience with enterprise deployments
What we offer:
  • Build a breakthrough AI platform beyond the constraints of the GPU
  • Publish and open source their cutting-edge AI research
  • Work on one of the fastest AI supercomputers in the world
  • Enjoy job stability with startup vitality
  • Our simple, non-corporate work culture that respects individual beliefs

Additional Information:

Job Posted:
February 17, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Engineering Manager, Inference Platform

Engineering Manager - Machine Learning Infrastructure

We build simple yet innovative consumer products and developer APIs that shape h...
Location
Location
United States , San Francisco
Salary
Salary:
241200.00 - 400000.00 USD / Year
plaid.com Logo
Plaid
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8–10 years of experience in ML infrastructure, including direct hands-on expertise as an engineer, IC/TL
  • 2+ years of experience managing infrastructure or ML platform engineers
  • Proven experience delivering and operating ML or AI infrastructure at scale
  • Solid technical depth across ML/AI infrastructure domains (e.g., feature stores, pipelines, deployment, inference, observability)
  • Demonstrated ability to drive execution on complex technical projects with cross-team stakeholders
  • Strong communication and stakeholder management skills
Job Responsibility
Job Responsibility
  • Lead and support the ML Infra team, driving project execution and ensuring delivery on key commitments
  • Build and launch Plaid’s next-generation feature store to improve reliability and velocity of model development
  • Define and drive adoption of an ML Ops “golden path” for secure, scalable model training, deployment, and monitoring
  • Ensure operational excellence of ML pipelines, deployment tooling, and inference systems
  • Partner with ML product teams to understand requirements and deliver solutions that accelerate model development and iteration
  • Recruit, mentor, and develop engineers, fostering a collaborative and high-performing team culture
What we offer
What we offer
  • medical
  • dental
  • vision
  • 401(k)
  • equity
  • commission
  • Fulltime
Read More
Arrow Right

Senior Principal Technical Program Manager - ML Platform

Location
Location
Salary
Salary:
231300.00 - 301975.00 USD / Year
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years of experience on software teams as Development Manager, Technical Product Manager or TPM leading technical platforms areas
  • Deep domain experience in AI and/or Search. Example: Model Inference, Model Evaluation, Model Training, LLM Ops, Semantic Search, Search Relevance, etc.
  • Partner with Engineering in defining direction, strategy and execution at Platform level
  • Strategic thinking and ability to understand business objectives to translate them into technical problems and programs.
  • Technical understanding of systems involved. Willingness to develop domain expertise in the area they operate - storage, networking, authentication, capacity management, service deployments, etc.
  • TPMs are not expected to write or read code, but are expected to understand system flows, block architectures, APIs and such.
  • Experience defining and running end-to-end complex technical programs
  • Strong leadership, organizational, and communication skills
Job Responsibility
Job Responsibility
  • Understand and stay up-to-date on latest innovations in AI and Search. Partner closely with engineering teams to translate these into practical platform evolution for Atlassian bringing value to our customers.
  • Analyze business objectives, customer needs, product adoption inhibitors and opportunities, industry trends, and based on these, in close collaboration with your stakeholders, define a long-term strategy and roadmap for your platform and product components.
  • Understand business objectives and translate them into technical systems problems that need to be prioritized solved in the current business environment.
  • Define specific systems programs and create a plan of action for realizing those programs. Such programs could be around capacity planning, migration efforts, high availability, network architecture, performance optimization, reliability improvements and more.
  • Use your technical understanding of Atlassian and related systems to partner with and influence engineers and architects in making progress on these problems.
  • Responsible for taking a systematic approach to engineering problems. This includes: prioritizing tasks, scoping out the project, defining objectives, and making consistent progress against each of these.
  • Be accountable for the success of these technical programs by managing the entire lifecycle from initiation to forecasting, budgeting, scheduling, etc.
  • Manage complex dependencies and projects with a broad scope across the company
What we offer
What we offer
  • health and wellbeing resources
  • paid volunteer days
Read More
Arrow Right

Senior ML Platform Engineer

At WHOOP, we're on a mission to unlock human performance and healthspan. WHOOP e...
Location
Location
United States , Boston
Salary
Salary:
150000.00 - 210000.00 USD / Year
whoop.com Logo
Whoop
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s or Master’s Degree in Computer Science, Engineering, or a related field
  • or equivalent practical experience
  • 5+ years of experience in software engineering with a focus on ML infrastructure, cloud platforms, or MLOps
  • Strong programming skills in Python, with experience in building distributed systems and REST/gRPC APIs
  • Deep knowledge of cloud-native services and infrastructure-as-code (e.g., AWS CDK, Terraform, CloudFormation)
  • Hands-on experience with model deployment platforms such as AWS SageMaker, Vertex AI, or Kubernetes-based serving stacks
  • Proficiency in ML lifecycle tools (MLflow, Weights & Biases, BentoML) and containerization strategies (Docker, Kubernetes)
  • Understanding of data engineering and ingestion pipelines, with ability to interface with data lakes, feature stores, and streaming systems
  • Proven ability to work cross-functionally with Data Science, Data Platform, and Software Engineering teams, influencing decisions and driving alignment
  • Passion for AI and automation to solve real-world problems and improve operational workflows
Job Responsibility
Job Responsibility
  • Architect, build, own, and operate scalable ML infrastructure in cloud environments (e.g., AWS), optimizing for speed, observability, cost, and reproducibility
  • Create, support, and maintain core MLOps infrastructure (e.g., MLflow, feature store, experiment tracking, model registry), ensuring reliability, scalability, and long-term sustainability
  • Develop, evolve, and operate MLOps platforms and frameworks that standardize model deployment, versioning, drift detection, and lifecycle management at scale
  • Implement and continuously maintain end-to-end CI/CD pipelines for ML models using orchestration tools (e.g., Prefect, Airflow, Argo Workflows), ensuring robust testing, reproducibility, and traceability
  • Partner closely with Data Science, Sensor Intelligence, and Data Platform teams to operationalize and support model development, deployment, and monitoring workflows
  • Build, manage, and maintain both real-time and batch inference infrastructure, supporting diverse use cases from physiological analytics to personalized feedback loops for WHOOP members
  • Design, implement, and own automated observability tooling (e.g., for model latency, data drift, accuracy degradation), integrating metrics, logging, and alerting with existing platforms
  • Leverage AI-powered tools and automation to reduce operational overhead, enhance developer productivity, and accelerate model release cycles
  • Contribute to and maintain internal platform documentation, SDKs, and training materials, enabling self-service capabilities for model deployment and experimentation
  • Continuously evaluate and integrate emerging technologies and deployment strategies, influencing WHOOP’s roadmap for AI-driven platform efficiency, reliability, and scale
What we offer
What we offer
  • equity
  • benefits
  • Fulltime
Read More
Arrow Right

Senior Machine Learning Engineering Manager, Gen AI

We're seeking a Senior Machine Learning Manager (M60) to lead a cross-functional...
Location
Location
United States
Salary
Salary:
193500.00 - 303150.00 USD / Year
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 8+ years in ML, search, or backend engineering roles, with 3+ years leading teams
  • Strong track record of shipping ML-powered or LLM-integrated user-facing products
  • Experience with RAG systems (vector search, hybrid retrieval, LLM orchestration)
  • Deep experience in either modeling (e.g., LLMs, search, NLP) or engineering (e.g., backend infra, full-stack), with the ability to lead end-to-end
  • Deep understanding of LLM ecosystems (OpenAI, Claude, Mistral, OSS), orchestration frameworks (LangChain, LlamaIndex), and vector databases (Weaviate, Pinecone, FAISS, etc.)
  • Strong product intuition and ability to translate complex tech into valuable user features
  • Familiarity with GenAI evaluation methods: hallucination detection, groundedness scoring, and human-in-the-loop feedback loops
  • Master’s or PhD in Computer Science, Machine Learning, or related field preferred—or equivalent practical experience
Job Responsibility
Job Responsibility
  • Lead the vision, design, and execution of LLM-powered AI products, leveraging advance AI modeling (e.g. SLM post-training/fine-tuning), RAG architectures and hybrid ranking system
  • Define system architecture across retrievers, rankers, orchestration layers, prompt templates, and feedback mechanisms
  • Work closely with product and design teams to ensure delightful, fast, and grounded user experiences
  • Build and manage a cross-disciplinary team including ML engineers, backend/frontend engineers, and applied scientists
  • Foster a culture of E2E ownership — empowering the team to move from prototype to production quickly and iteratively
  • Mentor individuals to grow in both technical depth and product acumen
  • Shape the technical roadmap and long-term strategy for GenAI search across Atlassian’s product suite
  • Partner with platform and infra teams to scale inference, evaluate performance, and integrate usage signals for continuous improvement
  • Champion data quality, grounding, and responsible AI practices in all deployed features
What we offer
What we offer
  • health and wellbeing resources
  • paid volunteer days
  • Fulltime
Read More
Arrow Right

ML Platform Engineer

We are seeking a Machine Learning Engineer to help build and scale our machine l...
Location
Location
United States
Salary
Salary:
Not provided
duettocloud.com Logo
Duetto
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 3+ years of experience in ML engineering or a similar role building and deploying machine learning models in production
  • Strong experience with AWS ML services (SageMaker, Lambda, EMR, ECR) for training, serving, and orchestrating model workflows
  • Hands-on experience with Kubernetes (e.g., EKS) for container orchestration and job execution at scale
  • Strong proficiency in Python, with exposure to ML/DL libraries such as TensorFlow, PyTorch, scikit-learn
  • Experience working with feature stores, data pipelines, and model versioning tools (e.g., SageMaker Feature Store, Feast, MLflow)
  • Familiarity with infrastructure-as-code and deployment tools such as Terraform, GitHub Actions, or similar CI/CD systems
  • Experience with logging and monitoring stacks such as Prometheus, Grafana, CloudWatch, or similar
  • Experience working in cross-functional teams with data scientists and DevOps engineers to bring models from research to production
  • Strong communication skills and ability to operate effectively in a fast-paced, ambiguous environment with shifting priorities
Job Responsibility
Job Responsibility
  • Develop, maintain, and scale machine learning pipelines for training, validation, and batch or real-time inference across thousands of hotel-specific models
  • Build reusable components to support model training, evaluation, deployment, and monitoring within a Kubernetes- and AWS-based environment
  • Partner with data scientists to translate notebooks and prototypes into production-grade, versioned training workflows
  • Implement and maintain feature engineering workflows, integrating with custom feature pipelines and supporting services
  • Collaborate with platform and DevOps teams to manage infrastructure-as-code (Terraform), automate deployment (CI/CD), and ensure reliability and security
  • Integrate model monitoring for performance metrics, drift detection, and alerting (using tools like Prometheus, CloudWatch, or Grafana)
  • Improve retraining, rollback, and model versioning strategies across different deployment contexts
  • Support experimentation infrastructure and A/B testing integrations for ML-based products
Read More
Arrow Right

Staff Product Manager, Managed Inference

As a core member of the Crusoe Managed AI Services team, you will own the comple...
Location
Location
United States , San Francisco; Sunnyvale; New York
Salary
Salary:
204000.00 - 247000.00 USD / Year
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 6+ years of experience in technical product management or engineering roles with product responsibilities
  • Experience building and launching cloud infrastructure, platform, or AI/ML services used in production
  • Strong understanding of cloud infrastructure (e.g., AWS, GCP, Azure) and modern compute architectures
  • Familiarity with the machine learning lifecycle, particularly model deployment, inference, and monitoring
  • Strong communication and collaboration skills, with experience working across engineering, product, and business teams
  • Demonstrated ability to operate independently with strong product judgment and a bias for action
  • Bachelor’s degree in Computer Science or a related technical field (or equivalent experience)
Job Responsibility
Job Responsibility
  • Own the end-to-end product lifecycle for Crusoe’s Managed Inference services, including roadmap definition, execution, and iteration
  • Translate customer needs, market signals, and technical constraints into clear product requirements and prioritization
  • Partner closely with Engineering, Infrastructure, and Platform teams to deliver scalable, reliable inference services
  • Drive product decisions across performance, reliability, cost efficiency, and developer experience
  • Define and track success metrics for inference services in production environments
  • Collaborate with go-to-market teams to support product launches, positioning, and customer adoption
  • Communicate product strategy and tradeoffs clearly to cross-functional partners and leadership
What we offer
What we offer
  • Restricted Stock Units in a fast growing, well-funded technology company
  • Health insurance package options that include HDHP and PPO, vision, and dental for you and your dependents
  • Employer contributions to HSA accounts
  • Paid Parental Leave
  • Paid life insurance, short-term and long-term disability
  • Teladoc
  • 401(k) with a 100% match up to 4% of salary
  • Generous paid time off and holiday schedule
  • Cell phone reimbursement
  • Tuition reimbursement
  • Fulltime
Read More
Arrow Right

Engineering Manager, GenAI Platform

Our generative AI-powered products are bringing joy back to the practice of medi...
Location
Location
United States , San Francisco
Salary
Salary:
220000.00 - 270000.00 USD / Year
abridge.com Logo
Abridge
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years as a software engineer
  • 2+ years managing software engineering teams
  • Experience in building and leading teams in the AI Infrastructure domain - LLM workflows, agentic systems, retrieval and evaluation
  • Comfortable giving constructive feedback on technical designs and code reviews
  • Skilled in building secure, compliant systems in major cloud platforms (GCP preferred, AWS experience welcome)
  • Familiar with containers, databases, and modern frontend frameworks
  • Skilled at hiring and mentorship, with a track record of helping engineers grow their skills and careers
  • Excited about being hands-on in a fast-moving, productive, and supportive environment
  • Has thrived in a fast-growing startup, knows how to operate in that environment
Job Responsibility
Job Responsibility
  • Recruit and mentor engineers with backgrounds in LLM-driven workflows and Agentic systems
  • Provide regular feedback
  • create opportunities for career growth
  • and foster a culture of collaboration and excellence
  • Work closely with ML/AI Research, Inference, Data, and Product teams to plan, execute, and support multiple projects simultaneously
  • responsible for the engineering process in the team and the output of the team
  • Help set business context for the team and apply your knowledge of business and software engineering to guide architectural discussions
  • anticipate and resolve technical challenges
  • and advocate for technology projects
  • Set a high standard for your team including software quality
What we offer
What we offer
  • Generous Time Off: 14 paid holidays, flexible PTO for salaried employees, and accrued time off for hourly employees
  • Comprehensive Health Plans: Medical, Dental, and Vision coverage for all full-time employees and their families
  • Generous HSA Contribution: If you choose a High Deductible Health Plan, Abridge makes monthly contributions to your HSA
  • Paid Parental Leave: Generous paid parental leave for all full-time employees
  • Family Forming Benefits: Resources and financial support to help you build your family
  • 401(k) Matching: Contribution matching to help invest in your future
  • Personal Device Allowance: Tax free funds for personal device usage
  • Pre-tax Benefits: Access to Flexible Spending Accounts (FSA) and Commuter Benefits
  • Lifestyle Wallet: Monthly contributions for fitness, professional development, coworking, and more
  • Mental Health Support: Dedicated access to therapy and coaching to help you reach your goals
  • Fulltime
Read More
Arrow Right

Director of Engineering, Cloud Availability

As the Director of Engineering, Cloud Availability, you will lead our engineerin...
Location
Location
Ireland , Dublin
Salary
Salary:
Not provided
crusoe.ai Logo
Crusoe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 10+ years of engineering leadership experience with a proven track record of managing high-performing technical teams
  • Deep technical knowledge of public cloud infrastructure and experience building or operating large-scale platforms (Public, Private, or Hybrid)
  • Expert-level understanding of availability, observability, SLIs/SLOs, and modern incident management frameworks
  • Proven ability to lead remote teams and successfully collaborate with US-based engineering organizations
  • Demonstrated success navigating and leading within a matrix organizational structure
  • Strong familiarity with virtual and managed Kubernetes platforms, such as EKS, GKE, or AKS
  • The ability to balance long-term organizational strategy with the immediate tactical needs of a fast-growing engineering site
Job Responsibility
Job Responsibility
  • Organizational Leadership: Partner closely with Data Center, Network, and SRE teams to build and scale a world-class engineering organization in Dublin
  • Site Leadership & Culture: Serve as the primary point of contact and face of Crusoe leadership in Dublin, proactively managing office sentiment and ensuring the team remains focused on high-impact objectives
  • Global Strategic Alignment: Build high-trust partnerships with US-based leadership to ensure local priorities are perfectly synchronized with the global business roadmap
  • Operational Excellence: Implement and refine "follow-the-sun" protocols to enable smooth hand-offs between time zones, ensuring zero customer disruption and 24/7 reliability
  • Unified Team Vision: Foster a "one-team" mindset across geographic boundaries, breaking down silos and promoting deep collaboration between Dublin and US offices
  • Talent Development: Level up the Dublin engineering team by identifying individual strengths and establishing a culture of mentorship to grow the next generation of Engineering Leads and ICs
  • Reliability Initiatives: Lead the development of SRE functions for IaaS and managed services, including Inference, SLURM, and automated cluster management
What we offer
What we offer
  • pension contributions
  • private health and dental insurance
  • income protection
  • life assurance
  • Fulltime
Read More
Arrow Right