Performance Engineer - Inference Job at Cerebras Systems (Toronto)

New

Head of Inference Kernels

As a core member of the team, you will play a pivotal role in leading a high-per...

Location

United States , San Jose

Salary:

200000.00 - 300000.00 USD / Year

Etched

Expiration Date

Until further notice

Requirements

Experience in designing and optimizing GPU kernels for deep learning on GPUs using CUDA, and assembly (ASM)
Experience with low-level programming to maximize performance for AI operations, leveraging tools like Compute Kernel (CK), CUTLASS, and Triton for multi-GPU and multi-platform performance
Deep fluency with transformer inference architecture, optimization levers, and full-stack systems (e.g., vLLM, custom runtimes)
History of delivering tangible perf wins on GPU hardware or custom AI accelerators
Solid understanding of roofline models of compute throughput, memory bandwidth and interconnect performance
Experienced in running large-scale workloads on heterogeneous compute clusters, optimizing for efficiency and scalability of AI workloads
Scopes projects crisply, sets aggressive but realistic milestones, and drives technical decision-making across the team
Anticipates blockers and shifts resources proactively

Job Responsibility

Architect Best-in-Class Inference Performance on Sohu: Deliver continuous batching throughput exceeding B200 by ≥10x on priority workloads
Develop Best-in-Performance Inference Mega Kernels: Develop complex, fused kernels that increase chip utilization and reduce inference latency, and validate these optimizations through benchmarking and regression-tested in production pipelines
Architect Model Mapping Strategies: Develop system level optimizations using a mix of techniques such tensor parallelism and expert parallelism for optimal performance
Hardware-Software Co-design of Inference-time Algorithmic Innovation: Develop and deploy production-ready inference-time algorithmic improvements (e.g., speculative decoding, prefill-decode disaggregation, KV cache offloading)
Build Scalable Team and Roadmap: Grow and retain a team of high-performing inference optimization engineers
Cross-Functional Performance Alignment: Ensure inference stack and performance goals are aligned with the software infrastructure teams, GTM and hardware teams for future generations of our hardware

What we offer

Medical, dental, and vision packages with generous premium coverage
$500 per month credit for waiving medical benefits
Housing subsidy of $2k per month for those living within walking distance of the office
Relocation support for those moving to San Jose (Santana Row)
Various wellness benefits covering fitness, mental health, and more
Daily lunch + dinner in our office
significant equity package

Fulltime

New

Inference Technical Lead

The Sora team is pioneering multimodal capabilities for OpenAI’s foundation mode...

Location

United States , San Francisco

Salary:

380000.00 USD / Year

OpenAI

Expiration Date

Until further notice

Requirements

Deep expertise in model performance optimization, particularly at the inference layer
Strong background in kernel-level systems, data movement, and low-level performance tuning
Excited about scaling high-performing AI systems that serve real-world, multimodal workloads
Can navigate ambiguity, set technical direction, and drive complex initiatives to completion

Job Responsibility

Perform engineering efforts focused on improving model serving, inference performance, and system efficiency
Drive optimizations from a kernel and data movement perspective to improve system throughput and reliability
Partner closely with research and product teams to ensure our models perform effectively at scale
Design, build, and improve critical serving infrastructure to support Sora’s growth and reliability needs
Contribute to improvements in model serving efficiency for Sora
Drive initiatives to optimize inference performance and scalability
Be engaged in model design, to help assist our researchers in developing inference-friendly models

What we offer

Offers Equity
Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
401(k) retirement plan with employer match
Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
Mental health and wellness support
Employer-paid basic life and disability coverage
Annual learning and development stipend to fuel your professional growth

Fulltime

New

Engineering Manager - Inference

We are looking for an Inference Engineering Manager to lead our AI Inference tea...

Location

United States , San Francisco

Salary:

300000.00 - 385000.00 USD / Year

Perplexity

Expiration Date

Until further notice

Requirements

5+ years of engineering experience with 2+ years in a technical leadership or management role
Deep experience with ML systems and inference frameworks (PyTorch, TensorFlow, ONNX, TensorRT, vLLM)
Strong understanding of LLM architecture: Multi-Head Attention, Multi/Grouped-Query Attention, and common layers
Experience with inference optimizations: batching, quantization, kernel fusion, FlashAttention
Familiarity with GPU characteristics, roofline models, and performance analysis
Experience deploying reliable, distributed, real-time systems at scale
Track record of building and leading high-performing engineering teams
Experience with parallelism strategies: tensor parallelism, pipeline parallelism, expert parallelism
Strong technical communication and cross-functional collaboration skills

Job Responsibility

Lead and grow a high-performing team of AI inference engineers
Develop APIs for AI inference used by both internal and external customers
Architect and scale our inference infrastructure for reliability and efficiency
Benchmark and eliminate bottlenecks throughout our inference stack
Drive large sparse/MoE model inference at rack scale, including sharding strategies for massive models
Push the frontier with building inference systems to support sparse attention, disaggregated pre-fill/decoding serving, etc.
Improve the reliability and observability of our systems and lead incident response
Own technical decisions around batching, throughput, latency, and GPU utilization
Partner with ML research teams on model optimization and deployment
Recruit, mentor, and develop engineering talent

What we offer

Equity
Health
Dental
Vision
Retirement
Fitness
Commuter and dependent care accounts

Fulltime

New

Machine Learning Engineer - Inference

Together AI is seeking a Machine Learning Engineer to join our Inference Engine ...

Location

United States , San Francisco

Salary:

160000.00 - 230000.00 USD / Year

Together AI

Expiration Date

Until further notice

Requirements

3+ years of experience writing high-performance, well-tested, production-quality code
Proficiency with Python and PyTorch
Demonstrated experience in building high performance libraries and tooling
Excellent understanding of low-level operating systems concepts including multi-threading, memory management, networking, storage, performance, and scale

Job Responsibility

Design and build the production systems that power the Together AI inference engine, enabling reliability and performance at scale
Develop and optimize runtime inference services for large-scale AI applications
Collaborate with researchers, engineers, product managers, and designers to bring new features and research capabilities to the world
Conduct design and code reviews to ensure high standards of quality
Create services, tools, and developer documentation to support the inference engine
Implement robust and fault-tolerant systems for data ingestion and processing

What we offer

competitive compensation
startup equity
health insurance
other competitive benefits

Fulltime

New

LLM Inference Frameworks and Optimization Engineer

At Together.ai, we are building state-of-the-art infrastructure to enable effici...

Location

United States , San Francisco

Salary:

160000.00 - 230000.00 USD / Year

Together AI

Expiration Date

Until further notice

Requirements

3+ years of experience in deep learning inference frameworks, distributed systems, or high-performance computing
Familiar with at least one LLM inference frameworks (e.g., TensorRT-LLM, vLLM, SGLang, TGI(Text Generation Inference))
Background knowledge and experience in at least one of the following: GPU programming (CUDA/Triton/TensorRT), compiler, model quantization, and GPU cluster scheduling
Deep understanding of KV cache systems like Mooncake, PagedAttention, or custom in-house variants
Proficient in Python and C++/CUDA for high-performance deep learning inference
Deep understanding of Transformer architectures and LLM/VLM/Diffusion model optimization
Knowledge of inference optimization, such as workload scheduling, CUDA graph, compiled, efficient kernels
Strong analytical problem-solving skills with a performance-driven mindset
Excellent collaboration and communication skills across teams

Job Responsibility

Design and develop fault-tolerant, high-concurrency distributed inference engine for text, image, and multimodal generation models
Implement and optimize distributed inference strategies, including Mixture of Experts (MoE) parallelism, tensor parallelism, pipeline parallelism for high-performance serving
Apply CUDA graph optimizations, TensorRT/TRT-LLM graph optimizations, and PyTorch-based compilation (torch.compile), and speculative decoding to enhance efficiency and scalability
Collaborate with hardware teams on performance bottleneck analysis, co-optimize inference performance for GPUs, TPUs, or custom accelerators
Work closely with AI researchers and infrastructure engineers to develop efficient model execution plans and optimize E2E model serving pipelines

What we offer

competitive compensation
startup equity
health insurance
other competitive benefits

Fulltime

Research Engineer AI

The role involves conducting high-quality research in AI and HPC, shaping future...

Location

United Kingdom , Bristol

Salary:

Not provided

Hewlett Packard Enterprise

Expiration Date

Until further notice

Requirements

A good working knowledge of AI/ML frameworks, at least TensorFlow and PyTorch, as well as the data preparation, handling, and lineage control, as well as model deployment, in particular in a distributed environment
At least a B.Sc. equivalent in a Science, Technology, Engineering or Mathematical discipline
Development experience in compiled languages such as C, C++ or Fortran and experience with interpreted environments such as Python
Parallel programming experience, with relevant programming models such as OpenMP, MPI, CUDA, OpenACC, HIP, PGAS languages is highly desirable

Job Responsibility

Perform world-class research while also shaping products of the future
Enable high performance AI software stacks on supercomputers
Provide new environments/abstractions to support application developers to build, deploy, and run AI applications taking advantage of leading-edge hardware at scale
Manage modern data-intensive AI training and inference workloads
Port and optimize workloads of key research centers like the AI safety institute
Support onboarding and scaling of domain-specific applications
Foster collaboration with the UK and European research community

What we offer

Health & Wellbeing benefits that support physical, financial and emotional wellbeing
Career development programs catered to achieving career goals
Unconditional inclusion in the workplace
Flexibility to manage work and personal needs

Fulltime

New

LLM Inference Performance & Evals Engineer

Join the inference model team dedicated to bring up the state-of-the-art models,...

Location

Canada , Toronto

Salary:

Not provided

Cerebras Systems

Expiration Date

Until further notice

Requirements

3+ years building high-performance ML or systems software
Solid grounding in Transformer math—attention scaling, KV-cache, quantisation—or clear evidence you learn this material rapidly
Comfort navigating the full AI toolchain: Python modeling code, compiler IRs, performance profiling, etc.
Strong debugging skills across performance, numerical accuracy, and runtime integration
Prior experience in modeling, compilers or crafting benchmarks or performance studies
not just black-box QA tests
Strong passion to leverage AI agents or workflow orchestration tools to boost personal productivity

Job Responsibility

Prototype and benchmark cutting-edge ideas: new attentions, MoE, speculative decoding, and many more innovations as they emerge
Develop agent-driven automation that designs experiments, schedules runs, triages regressions, and drafts pull-requests
Work closely with compiler, runtime, and silicon teams: unique opportunity to experience the full stack of software/hardware innovation
Keep pace with the latest open- and closed-source models
run them first on wafer scale to expose new optimization opportunities

What we offer

Build a breakthrough AI platform beyond the constraints of the GPU
Publish and open source their cutting-edge AI research
Work on one of the fastest AI supercomputers in the world
Enjoy job stability with startup vitality
Our simple, non-corporate work culture that respects individual beliefs

New

Senior GPU Engineer

We are seeking an expert Senior GPU Engineer to join our AI Infrastructure team....

Location

China , Beijing

Salary:

Not provided

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
4+ years of experience in systems programming, HPC, or GPU software development, featuring at least 5 years of hands-on CUDA/C++ kernel development
Expertise in the CUDA programming model and NVIDIA GPU architectures (specifically Ampere/Hopper)
Deep understanding of the memory hierarchy (Shared Memory, L2 cache, Registers), warp-level primitives, occupancy optimization, and bank conflict resolution
Familiarity with advanced hardware features: Tensor Cores, TMA (Tensor Memory Accelerator), and asynchronous copy
Proven ability to navigate and modify complex, large-scale codebases (e.g., PyTorch internals, Linux kernel)
Experience with build and binding ecosystems: CMake, pybind11, and CI/CD for GPU workloads
Mastery of NVIDIA Nsight Systems/Compute
Ability to mathematically reason about performance using the Roofline Model, memory bandwidth utilization, and compute throughput

Job Responsibility

Custom Operator Development: Design and implement highly optimized GPU kernels (CUDA/Triton) for critical deep learning operations (e.g., FlashAttention, GEMM, LayerNorm) to outperform standard libraries
Inference Engine Architecture: Contribute to the development of our high-performance inference engine, focusing on graph optimizations, operator fusion, and dynamic memory management (e.g., KV Cache optimization)
Performance Optimization: Deeply analyze and profile model performance using tools like Nsight Systems/Compute. Identify bottlenecks in memory bandwidth, instruction throughput, and kernel launch overheads
Model Acceleration: Implement advanced acceleration techniques such as Quantization (INT8, FP8, AWQ), Kernel Fusion, and continuous batching
Distributed Computing: Optimize communication primitives (NCCL) to enable efficient multi-GPU and multi-node inference (Tensor Parallelism, Pipeline Parallelism)
Hardware Adaptation: Ensure the software stack fully utilizes modern GPU architecture features (e.g., NVIDIA Hopper/Ampere Tensor Cores, Asynchronous Copy)

Fulltime

Performance Engineer - Inference

Cerebras Systems

Location:
Canada , Toronto

Category:
IT - Software Development

Contract Type:
Not provided

Salary:

Job Description:

Job Responsibility:

Requirements:

Additional Information:

Job Posted:
February 17, 2026

Looking for more opportunities? Search for other job offers that match your skills and interests.

Similar Jobs for Performance Engineer - Inference

Head of Inference Kernels

Inference Technical Lead

Engineering Manager - Inference

Machine Learning Engineer - Inference

LLM Inference Frameworks and Optimization Engineer

Research Engineer AI

LLM Inference Performance & Evals Engineer

Senior GPU Engineer

Performance Engineer - Inference

Cerebras Systems

Location:Canada , Toronto

Category:IT - Software Development

Contract Type:Not provided

Salary:

Job Description:

Job Responsibility:

Requirements:

Additional Information:

Job Posted:February 17, 2026

Looking for more opportunities? Search for other job offers that match your skills and interests.

Similar Jobs for Performance Engineer - Inference

Head of Inference Kernels

Inference Technical Lead

Engineering Manager - Inference

Machine Learning Engineer - Inference

LLM Inference Frameworks and Optimization Engineer

Research Engineer AI

LLM Inference Performance & Evals Engineer

Senior GPU Engineer

Location:
Canada , Toronto

Category:
IT - Software Development

Contract Type:
Not provided

Job Posted:
February 17, 2026