Engineering Manager, GPU Kernel Job at Wayve (London)

New

Sr. Software Development Engineer

As a core member of the team, you will play a pivotal role in optimizing and dev...

Location

China , Shanghai

Salary:

Not provided

AMD

Expiration Date

Until further notice

Requirements

Skilled engineer with strong technical and analytical expertise in C++ development within Linux environments
Ability to define goals, manage development efforts, and deliver high-quality solutions
Strong problem-solving skills
Proactive approach
Keen understanding of software engineering best practices
Experience in GPU kernel development & optimization for AMD GPUs using HIP, CUDA, and assembly (ASM)
Strong knowledge of AMD architectures (GCN, RDNA) and low-level programming
Experience leveraging tools like Compute Kernel (CK), CUTLASS, and Triton for multi-GPU and multi-platform performance
Experience in integrating optimized GPU performance into machine learning frameworks (e.g., TensorFlow, PyTorch)
Skilled in Python and C++

Job Responsibility

Optimize Deep Learning Frameworks: Enhance and optimize frameworks like TensorFlow and PyTorch for AMD GPUs in open-source repositories
Develop GPU Kernels: Create and optimize GPU kernels to maximize performance for specific AI operations
Develop & Optimize Models: Design and optimize deep learning models specifically for AMD GPU performance
Collaborate with GPU Library Teams: Work closely with internal teams to analyze and improve training and inference performance on AMD GPUs
Collaborate with Open-Source Maintainers: Engage with framework maintainers to ensure code changes are aligned with requirements and integrated upstream
Work in Distributed Computing Environments: Optimize deep learning performance on both scale-up (multi-GPU) and scale-out (multi-node) systems
Utilize Cutting-Edge Compiler Tech: Leverage advanced compiler technologies to improve deep learning performance
Optimize Deep Learning Pipeline: Enhance the full pipeline, including integrating graph compilers
Software Engineering Best Practices: Apply sound engineering principles to ensure robust, maintainable solutions

Engineering Manager, Kernel Reliability

We're looking for a deeply technical, hands-on engineering leader for our on-fie...

Location

United States; Canada , Sunnyvale; Toronto

Salary:

Not provided

Cerebras Systems

Expiration Date

Until further notice

Requirements

6+ years in software engineering
3+ years leading teams in SW/HW reliability, debug, diagnostic, failure analysis or related fields
Expertise in parallel and distributed programming (message passing, multicore, GPU, embedded, etc.)
Expertise in debug and diagnostic tool development or expert usage (debuggers, core dump handling, code sanitizers, etc.)
Experience debugging distributed and parallel applications (deadlocks, livelocks, race conditions, etc.)
Deep understanding of computer architectures (instruction pipelining, multithreading, networking, etc.)
Strong background in monitoring and reliability engineering (incident response, post-mortem analysis, etc.)
Demonstrated ability to recruit and retain high-performing teams, mentor engineers, and partner cross-functionally to deliver customer-facing products.

Job Responsibility

Provide hands-on technical leadership, owning the technical vision and roadmap for the kernel-centric reliability of our internal and customer-facing systems
Assist System and Cluster Operations teams on reducing system and service downtime after failure by providing tooling and manual intervention for failure analysis and diagnostic
Work with the Debug Team to enhance debug tools with the goal of speeding up failure analysis
Collaborate with SW teams to improve the software stack, including Kernels, to improve on-field debugging and failure analysis
Work with the ASIC and HW architecture teams to codesign the next generation architectures with reliability and ease of debug in mind
Lead, mentor, and grow a high-caliber team of engineers, fostering a culture of technical excellence and rapid execution.

What we offer

Build a breakthrough AI platform beyond the constraints of the GPU
Publish and open source their cutting-edge AI research
Work on one of the fastest AI supercomputers in the world
Enjoy job stability with startup vitality
Simple, non-corporate work culture that respects individual beliefs.

Senior Machine Learning Engineer

As a Machine Learning Engineer at Dedrone, you’ll play a pivotal role in advanci...

Location

United States , Sterling

Salary:

Not provided

Axon

Expiration Date

Until further notice

Requirements

5+ years of professional experience in modern C++ (C++14/17 or later), with strong object-oriented and generic programming skills
Deep understanding of multithreading and concurrency (threads, thread pools, locks, lock-free structures, atomics, futures, async patterns) and experience building robust, concurrent systems
Hands-on experience with parallel processing frameworks or patterns (SIMD, task-based parallelism, GPU offload, or similar) for real-time or high-throughput applications
Strong command of data structures and algorithms, and the ability to choose and implement the right structures for performance-critical, memory-constrained environments
Proven experience with memory management and performance optimization in C++ (stack vs heap, custom allocators, cache-aware design, avoiding fragmentation, RAII, move semantics)
Practical experience with CUDA (or similar GPU programming frameworks): writing kernels, managing GPU memory, optimizing for occupancy and bandwidth, and integrating with C++ codebases
Familiarity with Linux-based development (build systems like CMake, unit testing frameworks, containerization and/or cross-compilation for edge devices)
Strong debugging and profiling skills across CPU and GPU, and a methodical approach to benchmarking and regression testing
Excellent collaboration and communication skills, with a track record of working closely with research or ML teams to move algorithms from prototype to production

Job Responsibility

Design and implement high-performance C++ software that runs computer vision and tracking algorithms in real time on edge devices
Work closely with computer vision / self-supervised learning engineers to integrate their models into production pipelines, including pre/post-processing, I/O, and system orchestration
Build and optimize multithreaded and parallel processing pipelines for ingesting, synchronizing, and processing data from a networked system of cameras
Implement and tune CUDA kernels and GPU-accelerated components to maximize throughput and minimize latency for inference, tracking, and search
Design robust data structures and memory management strategies for handling large volumes of video, sensor, and metadata streams under tight compute and power constraints
Profile and optimize code using tools such as perf, valgrind, nvprof / Nsight, and similar to identify bottlenecks and improve CPU/GPU utilization
Collaborate with simulation and CV teams to deploy and evaluate algorithms in realistic test scenarios, including fault handling and performance monitoring
Develop clean, well-tested, and well-documented C++ libraries and services that can be reused across products and future airspace applications
Contribute to system-level architecture decisions, including inter-process communication, scheduling, resource allocation, and deployment strategies on edge platforms

What we offer

Competitive salary and 401k with employer match
Discretionary paid time off
Paid parental leave for all
Medical, Dental, Vision plans
Fitness Programs
Emotional & Mental Wellness support
Learning & Development programs
Snacks in our offices

Fulltime

New

Senior Software Development Engineer

We are seeking an experienced and highly technical SMTS Software Development Eng...

Location

United Kingdom

Salary:

Not provided

AMD

Expiration Date

Until further notice

Requirements

Bachelor’s or Master’s degree in Computer Science, Computer Engineering, or related technical field
8+ years of software engineering experience in systems software, runtime libraries, GPU programming, or compiler/runtime interfaces
Strong proficiency in modern C++ (C++14/C++17 or newer), templates, memory models, and low‑level systems programming
Deep understanding of at least one GPU computing model (HIP, CUDA, SYCL, OpenCL, OpenMP offload)
Hands‑on experience with runtime systems, driver interfaces, or high‑performance compute libraries
Strong debugging skills using tools such as gdb, sanitizers, profilers, and GPU debugging tools
Solid understanding of parallel programming concepts—memory hierarchy, synchronization, concurrency, thread scheduling

Job Responsibility

Architect, implement, and optimize features in the HIP runtime, including memory management, kernel dispatch, device abstraction, multi‑GPU coordination, and synchronization primitives
Contribute to the evolution of the HIP programming model and interoperability with ROCr, HSA runtime, and compiler toolchains
Ensure functional correctness, performance, and scalability of runtime APIs across different GPU generations
Conduct root‑cause analysis and systems‑level debugging across the runtime, driver, compiler, and hardware layers
Profile GPU applications and internal runtime components to identify bottlenecks and design performance improvements
Optimize HIP runtime behavior for large-scale AI, HPC, and cloud workloads
Work closely with compiler teams (LLVM/Clang), driver teams, GPU architecture, and systems engineers to deliver end‑to‑end GPU software solutions
Contribute to API specifications and collaborate with upstream open-source communities where appropriate
Define and drive technical strategy for correctness, reliability, and conformance of the HIP runtime
Support enhancements in automated testing, CI, and stress/failure scenarios in the HIP test suite

New

ROCm Core SW Project Manager

We are seeking an experienced Project Manager to manage ROCm development project...

Location

Canada , Markham

Salary:

139200.00 - 208800.00 CAD / Year

AMD

Expiration Date

Until further notice

Requirements

5+ years of program or project management experience in software development
At least 3 years focused on systems software, GPU computing, or HPC/AI infrastructure
Demonstrated experience managing complex, multi-team technical programs involving pre-silicon validation or hardware/software co-design
Strong foundational knowledge of machine learning frameworks, model architectures, and performance optimization techniques
Deep understanding of software development lifecycle (SDLC), agile methodologies, and modern CI/CD practices
Excellent stakeholder management, communication, and influencing skills across engineering and executive levels
Bachelor’s degree in Computer Science, Electrical Engineering, or related technical field

Job Responsibility

Manage ROCm development projects for AMD next generation GPUs
Drive internal SW execution including GPU performance optimization, pre-silicon performance feature development, and GPU kernel development
Coordinate across software, hardware, and validation teams to deliver high-performance, reliable, and scalable ROCm software stack
Work together with ROCm SW team to drive pre-silicon software development and performance validation activities using SW/HW emulation platforms
Orchestrate hardware-software co-development efforts for new GPU ML features
Establish and track KPIs for new GPU feature quality, performance, and time-to-market
Proactively identify and mitigate project risks

Fulltime

Senior Software Engineer

As a Senior Software Engineer, you will lead the design, development, and valida...

Location

United States , Multiple Locations

Salary:

119800.00 - 234700.00 USD / Year

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
2+ years experience in Kernel bring-up and platform enablement
1+ years experience in GPU driver development and integration
2+ years experience in C / C++ kernel-space programming, Git-based source management and release branching, RPM packaging, spec file authoring, and build automation
Ability to meet Microsoft, customer and/or government security screening requirements are required for this role

Job Responsibility

Lead kernel integration and validation for new silicon platforms, from early board bring‑up through full feature enablement
Architect and maintain the Maintenance OS (MOS) kernel, ensuring long‑term stability, security, and compatibility across multiple hardware generations
Own the end‑to‑end lifecycle of GPU drivers (NVIDIA, amdgpu, ROCm), including:Integration of out‑of‑tree (OOT) kernel drivers DKMS packaging, build, and version‑tracking, Compatibility validation against kernel and firmware baselines
Define and manage build and release pipelines for kernel RPMs, driver SRPMs, and signing workflows
Collaborate with hardware, platform, and firmware teams to validate kernel features tied to new silicon capabilities (PCIe, CXL, IOMMU, NUMA, etc.)
Own spec files, RPM packaging, and associated CI/CD automation for kernel and driver deliverables
Conduct deep‑dive debugging across the full stack — from kernel to device firmware — to resolve performance, stability, or bring‑up issues
Drive engagement with upstream Linux communities to upstream or align kernel changes where feasible

Fulltime

New

Software Engineer 2 - Processing Unit for Copilot

We are seeking an expert GPU Engineer 2 to join our AI Infrastructure team. In t...

Location

China , Beijing

Salary:

Not provided

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 2+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Architectural Mastery: Expertise in the CUDA programming model and NVIDIA GPU architectures (specifically Ampere/Hopper)
Deep understanding of the memory hierarchy (Shared Memory, L2 cache, Registers), warp-level primitives, occupancy optimization, and bank conflict resolution
Familiarity with advanced hardware features: Tensor Cores, TMA (Tensor Memory Accelerator), and asynchronous copy
Proven ability to navigate and modify complex, large-scale codebases (e.g., PyTorch internals, Linux kernel)
Experience with build and binding ecosystems: CMake, pybind11, and CI/CD for GPU workloads
Performance Engineering: Mastery of NVIDIA Nsight Systems/Compute
Ability to mathematically reason about performance using the Roofline Model, memory bandwidth utilization, and compute throughput

Job Responsibility

Custom Operator Development: Design and implement highly optimized GPU kernels (CUDA/Triton) for critical deep learning operations (e.g., FlashAttention, GEMM, LayerNorm) to outperform standard libraries
Inference Engine Architecture: Contribute to the development of our high-performance inference engine, focusing on graph optimizations, operator fusion, and dynamic memory management (e.g., KV Cache optimization)
Performance Optimization: Deeply analyze and profile model performance using tools like Nsight Systems/Compute. Identify bottlenecks in memory bandwidth, instruction throughput, and kernel launch overheads
Model Acceleration: Implement advanced acceleration techniques such as Quantization (INT8, FP8, AWQ), Kernel Fusion, and continuous batching
Distributed Computing: Optimize communication primitives (NCCL) to enable efficient multi-GPU and multi-node inference (Tensor Parallelism, Pipeline Parallelism)
Hardware Adaptation: Ensure the software stack fully utilizes modern GPU architecture features (e.g., NVIDIA Hopper/Ampere Tensor Cores, Asynchronous Copy)

Fulltime

New

Senior Software Engineer - Processing Unit for Copilot

We are seeking an expert Senior GPU Engineer to join our AI Infrastructure team....

Location

China , Beijing

Salary:

Not provided

Microsoft Corporation

Expiration Date

Until further notice

Requirements

Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience
Architectural Mastery: Expertise in the CUDA programming model and NVIDIA GPU architectures (specifically Ampere/Hopper)
Deep understanding of the memory hierarchy (Shared Memory, L2 cache, Registers), warp-level primitives, occupancy optimization, and bank conflict resolution
Familiarity with advanced hardware features: Tensor Cores, TMA (Tensor Memory Accelerator), and asynchronous copy
Proven ability to navigate and modify complex, large-scale codebases (e.g., PyTorch internals, Linux kernel)
Experience with build and binding ecosystems: CMake, pybind11, and CI/CD for GPU workloads
Performance Engineering: Mastery of NVIDIA Nsight Systems/Compute
Ability to mathematically reason about performance using the Roofline Model, memory bandwidth utilization, and compute throughput

Job Responsibility

Custom Operator Development: Design and implement highly optimized GPU kernels (CUDA/Triton) for critical deep learning operations (e.g., FlashAttention, GEMM, LayerNorm) to outperform standard libraries
Inference Engine Architecture: Contribute to the development of our high-performance inference engine, focusing on graph optimizations, operator fusion, and dynamic memory management (e.g., KV Cache optimization)
Performance Optimization: Deeply analyze and profile model performance using tools like Nsight Systems/Compute. Identify bottlenecks in memory bandwidth, instruction throughput, and kernel launch overheads
Model Acceleration: Implement advanced acceleration techniques such as Quantization (INT8, FP8, AWQ), Kernel Fusion, and continuous batching
Distributed Computing: Optimize communication primitives (NCCL) to enable efficient multi-GPU and multi-node inference (Tensor Parallelism, Pipeline Parallelism)
Hardware Adaptation: Ensure the software stack fully utilizes modern GPU architecture features (e.g., NVIDIA Hopper/Ampere Tensor Cores, Asynchronous Copy)

Fulltime

Engineering Manager, GPU Kernel

Wayve

Location:
United Kingdom , London

Category:
IT - Software Development

Contract Type:
Not provided

Salary:

Job Description:

Job Responsibility:

Requirements:

Nice to have:

Additional Information:

Job Posted:
January 01, 2026

Looking for more opportunities? Search for other job offers that match your skills and interests.

Similar Jobs for Engineering Manager, GPU Kernel

Sr. Software Development Engineer

Engineering Manager, Kernel Reliability

Senior Machine Learning Engineer

Senior Software Development Engineer

ROCm Core SW Project Manager

Senior Software Engineer

Software Engineer 2 - Processing Unit for Copilot

Senior Software Engineer - Processing Unit for Copilot

Engineering Manager, GPU Kernel

Wayve

Location:United Kingdom , London

Category:IT - Software Development

Contract Type:Not provided

Salary:

Job Description:

Job Responsibility:

Requirements:

Nice to have:

Additional Information:

Job Posted:January 01, 2026

Looking for more opportunities? Search for other job offers that match your skills and interests.

Similar Jobs for Engineering Manager, GPU Kernel

Sr. Software Development Engineer

Engineering Manager, Kernel Reliability

Senior Machine Learning Engineer

Senior Software Development Engineer

ROCm Core SW Project Manager

Senior Software Engineer

Software Engineer 2 - Processing Unit for Copilot

Senior Software Engineer - Processing Unit for Copilot

Location:
United Kingdom , London

Category:
IT - Software Development

Contract Type:
Not provided

Job Posted:
January 01, 2026