CrawlJobs Logo

Software Development Engineer – Distributed Inference

amd.com Logo

AMD

Location Icon

Location:
United States , Austin

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

143280.00 - 214920.00 USD / Year

Job Description:

AMD is looking for a software engineer who is passionate about Distributed Inferencing on AMD GPUs and improving the performance of key applications and benchmarks. You will be a member of a core team of incredibly talented industry specialists and will work with the very latest hardware and software technology.

Job Responsibility:

  • Enable and benchmark AI models on large-scale distributed systems to evaluate performance, accuracy, and scalability
  • Optimize AI workloads across scale-up (multi-GPU), scale-out (multi-node), and scale-across distributed system configurations
  • Collaborate closely with internal GPU library teams to analyze and optimize distributed workloads for high throughput and low latency
  • Develop and apply optimal parallelization strategies for AI workloads to achieve best-in-class performance across diverse system configurations
  • Contribute to distributed model management systems, model zoos, monitoring frameworks, benchmarking pipelines, and technical documentation
  • Build and maintain real-time dashboards reporting performance, accuracy, and reliability metrics for internal stakeholders and external users

Requirements:

  • Undergraduate or Master’s or PhD degree in Computer Science, Computer Engineering, or a related field, or equivalent practical experience
  • Strong technical expertise in C++/ Python development
  • Experience solving performance and investigating scalability on multi-GPU, multi-node clusters
  • Passionate about quality assurance, benchmarking, and automation in the AI/ML space
  • Strong C/C++ and Python skills, with experience in software design, debugging, performance analysis, and test development
  • Experience running AI workloads on large-scale, heterogeneous compute clusters
  • Familiarity with cluster management and orchestration platforms such as SLURM and Kubernetes (K8s)
  • Experience with GitHub, Jenkins, or similar CI/CD tools and modern development workflows

Nice to have:

  • Hands-on experience with AI inference or serving frameworks such as vLLM, SGLang, and Llama.cpp
  • Understanding KV cache transfer mechanisms and technologies (e.g., Mooncake, NIXL/RIXL) and expert parallelization approaches (e.g., DeepEP, MORI, PPLX-Garden)

Additional Information:

Job Posted:
February 04, 2026

Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Software Development Engineer – Distributed Inference

Senior Software Engineer

At JFrog, we’re reinventing DevOps and MLOps to help the world’s greatest compan...
Location
Location
Israel , Netanya/Tel Aviv
Salary
Salary:
Not provided
jfrog.com Logo
JFrog
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of proven experience in software development
  • Strong background in designing, developing, and debugging complex distributed systems (e.g., microservices, event-driven architectures)
  • Hands-on experience with containerized environments, microservices, and Kubernetes
  • Proven experience with at least one major cloud provider (e.g., AWS, GCP, Azure)
  • Ability to lead technical discussions, mentor engineers, and drive architectural decisions
Job Responsibility
Job Responsibility
  • Be an integral part of a highly skilled team working to build the leading MLOps platform in the industry
  • Maintain and evolve the Runtime team’s products, ensuring their reliability and scalability
  • Design and develop a complete hosting system that supports various types of inference, analytics, monitoring, distribution, and more – enabling customers to run large-scale real-time, batch, and streaming ML pipelines
  • Play a key role in shaping our cross-company engineering culture
  • Conduct high-quality design reviews with a strong emphasis on scalability, maintainability, security, and sound use of design patterns
  • Write maintainable, well-tested code in multiple programming languages
  • Continuously improve the efficiency, scalability, and stability of critical system components
Read More
Arrow Right

Software Engineer Staff

This Software Engineer Staff will be engaged in data science-related research an...
Location
Location
India , Bangalore
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Utilize analytical and programming skills and open-source systems, such as Apache Storm, Apache Spark, Elasticsearch, Cassandra, Graph DB etc. develop data processing pipeline required efficacy and latency
  • Require good knowledge and experience of the big data tool sets and techniques of distributed storage and computation engine
  • Require the experience to develop the reusable and highly scalable data processing component
  • Require good knowledge and experience to work with cloud based CICD tools and cloud devops teams to collect stats and create monitors for our data processing pipelines
  • Develop good quality python APIs to support micro services
  • Require the knowledge of APIs to various No SQL storage systems, Elasticsearch, Cassandra, and Redis, etc.
  • Good understanding Python Flask web service and be able to develop good quality code
  • Troubleshoot production environment and customer reported issues
  • Require the knowledge of the multi-cloud production environment
  • Require the agility to troubleshoot open-source data processing engine, such as Apache Spark, Apache Storm and Apache Flink
Job Responsibility
Job Responsibility
  • Designs, develops, troubleshoots and debugs software programs for software enhancements and new products
  • Develops software including operating systems, compilers, routers, networks, utilities, databases and Internet-related tools
  • Determines hardware compatibility and/or influences hardware design
  • Engaged in data science-related research and software application development and engineering duties related to our enterprise-grade Wi-Fi technology and autonomous platform to provide an unprecedented visibility into the user experience
  • Collaborate with other engineers and product managers to build the next generation of autonomous Wi-Fi networks leveraging big data and predictive models
  • Use knowledge of wireless communication networks, machine learning and software engineering to develop and implement scalable algorithms to process a large amount of streaming data to detect anomalies, predict problems, and classify them in real-time
  • Leverage the data collected from the Wi-Fi network to empower the inference engine of our Mist platform and systems, including the Mist virtual assistant chat bot
  • Determine the likelihood of failures across the Wi-Fi network and performing failure scope analysis
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Principal Software Engineer

Principal Software Engineer role at Hewlett Packard Enterprise to design, develo...
Location
Location
United States , San Jose
Salary
Salary:
148000.00 - 340500.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor or Masters degree in Computer science, Computer Engineering or a related field
  • 10+ years of experience in software engineering with a focus on Python, Go or Java
  • Strong understanding of RESTful API design and development
  • 2+ years of Experience working with large scale distributed systems based on either cloud technologies or Kubernetes
  • 2+ years of experience on event-driven technologies like Kafka and Apache Storm/Flink
  • 2+ years of experience in Big-data technologies like Apache spark/Databricks
  • Proficient in working with Redis and databases like Cassandra/Datastax
  • Excellent problem-solving and analytical skills
  • Strong communication and collaboration skills
Job Responsibility
Job Responsibility
  • Design, develop, and test software related to the cloud-based network configuration and reporting system
  • Solve complex problems and design subsystems for the Mist platform
  • Develop software for highly scalable and fault-tolerant cloud-scale distributed applications
  • Develop microservices using Python, and/or Go (golang)
  • Develop event-driven systems using Python and Java
  • Develop software for AIDE's real-time data pipeline and batch processing
  • Develop ETL pipelines aiding in training and inference of various ML models using big-data frameworks like Apache Spark
  • Build metrics, monitoring and structured logging into the product
  • Write unit, integration and functional tests
  • Participate in collaborative, DevOps style, lean practices
What we offer
What we offer
  • Health & Wellbeing benefits
  • Personal & Professional Development programs
  • Unconditional Inclusion environment
  • Comprehensive suite of benefits supporting physical, financial and emotional wellbeing
  • Fulltime
Read More
Arrow Right

Senior Software Engineer - ML Infrastructure

We build simple yet innovative consumer products and developer APIs that shape h...
Location
Location
United States , San Francisco
Salary
Salary:
180000.00 - 270000.00 USD / Year
plaid.com Logo
Plaid
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of industry experience as a software engineer, with strong focus on ML/AI infrastructure or large-scale distributed systems
  • Hands-on expertise in building and operating ML platforms (e.g., feature stores, data pipelines, training/inference frameworks)
  • Proven experience delivering reliable and scalable infrastructure in production
  • Solid understanding of ML Ops concepts and tooling, as well as best practices for observability, security, and reliability
  • Strong communication skills and ability to collaborate across teams
Job Responsibility
Job Responsibility
  • Design and implement large-scale ML infrastructure, including feature stores, pipelines, deployment tooling, and inference systems
  • Drive the rollout of Plaid’s next-generation feature store to improve reliability and velocity of model development
  • Help define and evangelize an ML Ops “golden path” for secure, scalable model training, deployment, and monitoring
  • Ensure operational excellence of ML pipelines and services, including reliability, scalability, performance, and cost efficiency
  • Collaborate with ML product teams to understand requirements and deliver solutions that accelerate experimentation and iteration
  • Contribute to technical strategy and architecture discussions within the team
  • Mentor and support other engineers through code reviews, design discussions, and technical guidance
What we offer
What we offer
  • medical, dental, vision, and 401(k)
  • Fulltime
Read More
Arrow Right

Software Engineer, Infrastructure

As a Software Engineer on our Infrastructure team, you will help design and buil...
Location
Location
United States , New York; San Mateo; Redwood City
Salary
Salary:
140000.00 - 150000.00 USD / Year
fireworks.ai Logo
Fireworks AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Engineering, or a related technical field (or equivalent practical experience)
  • Strong programming skills in Python, C++, or a similar language
  • Solid understanding of computer systems concepts such as networking, storage, and distributed computing
  • Familiarity with cloud platforms like AWS, GCP, or Azure, and containerization tools like Docker or Kubernetes
  • Knowledge and interest in cloud infrastructure, distributed systems, and machine learning
Job Responsibility
Job Responsibility
  • Contribute to the design and development of scalable backend infrastructure that supports distributed training, inference, and data pipelines
  • Build and maintain core backend services such as job schedulers, autoscalers, resource managers, and model serving systems
  • Support performance optimization, cost efficiency, and reliability improvements across compute, storage, and networking layers
  • Collaborate with ML, DevOps, and product teams to translate research and product needs into infrastructure solutions
  • Learn and apply modern cloud technologies including Kubernetes, Ray, Kubeflow, and MLFlow
  • Participate in code reviews, technical discussions, and continuous integration and deployment processes
What we offer
What we offer
  • Meaningful equity in a fast-growing startup
  • Competitive salary and comprehensive benefits package
  • Fulltime
Read More
Arrow Right

Principal Software Engineer

Principal Software Engineer role at Hewlett Packard Enterprise to design, develo...
Location
Location
United States , San Jose
Salary
Salary:
148000.00 - 340500.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor or Masters degree in Computer science, Computer Engineering or a related field
  • 10+ years of experience in software engineering with a focus on Python, Go or Java
  • Strong understanding of RESTful API design and development
  • 2+ years of Experience working with large scale distributed systems based on either cloud technologies or Kubernetes
  • 2+ years of experience on event-driven technologies like Kafka and Apache Storm/Flink
  • 2+ years of experience in Big-data technologies like Apache spark/Databricks
  • Proficient in working with Redis and databases like Cassandra/Datastax
  • Must hold U.S. citizenship
Job Responsibility
Job Responsibility
  • Design, develop, and test software related to the cloud-based network configuration and reporting system
  • Solve complex problems and designing subsystems for Mist platform
  • Develop software for highly scalable and fault-tolerant cloud-scale distributed applications
  • Develop microservices using Python, and/or Go (golang)
  • Develop event-driven systems using Python and Java
  • Develop software for AIDE's real-time data pipeline and batch processing
  • Develop ETL pipelines aiding in training and inference of various ML models using big-data frameworks like Apache Spark
  • Build metrics, monitoring and structured logging into the product
  • Write unit, integration and functional tests
  • Participate in collaborative, DevOps style, lean practices
What we offer
What we offer
  • Health & Wellbeing benefits
  • Personal & Professional Development programs
  • Unconditional Inclusion environment
  • Comprehensive benefits suite supporting physical, financial and emotional wellbeing
  • Fulltime
Read More
Arrow Right

Software Engineer, AI Infrastructure

As a Software Engineer on our AI Infrastructure team, you will help design the c...
Location
Location
United States , New York, NY; San Mateo, CA
Salary
Salary:
Not provided
fireworks.ai Logo
Fireworks AI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s degree in Computer Science, Engineering, or a related technical field (or equivalent practical experience)
  • 3 years of experience in software engineering, with a focus on infrastructure or machine learning systems
  • Strong programming skills in Python, Go, or a similar language
  • Proven experience in ML infrastructure and tooling (e.g., PyTorch, MLflow, Vertex AI, SageMaker, Kubernetes, etc.)
  • Basic understanding of LLM knowledge (e.g., context length, disaggregated prefill, KV cache memory estimation, etc)
Job Responsibility
Job Responsibility
  • Contribute to the design and development of scalable backend infrastructure that supports distributed training, inference, and data pipelines
  • Build and maintain core backend services such as LLM CI/CD pipeline, control plane, and model serving systems
  • Support performance optimization, cost efficiency, and reliability improvements across compute, storage, and networking layers
  • Building frameworks and safeguards to ensure Fireworks AI has the best model quality in the industry
  • Collaborate with performance, training, and product teams to translate research and product needs into infrastructure solutions
  • Participate in code reviews, technical discussions, and continuous integration and deployment processes
What we offer
What we offer
  • Solve Hard Problems: Tackle challenges at the forefront of AI infrastructure
  • Build What’s Next: Work with bleeding-edge technology that impacts how businesses and developers harness AI globally
  • Ownership & Impact: Join a fast-growing, passionate team where your work directly shapes the future of AI—no bureaucracy, just results
  • Learn from the Best: Collaborate with world-class engineers and AI researchers who thrive on curiosity and innovation
  • Fulltime
Read More
Arrow Right

Senior Software Engineer, Backend

As a Senior Software Engineer, Backend specializing in database architecture and...
Location
Location
United States , San Francisco
Salary
Salary:
150000.00 - 240000.00 USD / Year
chefrobotics.ai Logo
Chef Robotics
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Computer Science, Engineering, or equivalent practical experience
  • 7+ years of professional experience in backend development roles with demonstrated leadership experience
  • Expert knowledge of relational databases (MySQL, PostgreSQL) including schema design, optimization, and administration
  • Strong proficiency with Python and JavaScript/TypeScript with advanced software engineering skills
  • Extensive experience leading projects with at least two web frameworks: Flask, FastAPI, Django, Node.js, or Next.js
  • Proven experience designing and implementing RESTful and GraphQL APIs at scale
  • Advanced understanding of containerization (Docker) and orchestration (Kubernetes) technologies
  • Experience with cloud infrastructure and deployment (AWS, GCP, or Azure) in production environments
  • Proven experience leading complex backend projects and mentoring junior engineers
  • Understanding of data requirements for robotics or automation systems
Job Responsibility
Job Responsibility
  • Lead the design, implementation, and optimization of database schemas to support robot operations, telemetry, recipe management, and system analytics
  • Develop robust data migration strategies and version control for database schema evolution
  • Implement efficient query optimization and indexing strategies to support high-throughput robot operations
  • Establish data integrity protocols and backup systems to ensure operational continuity across customer deployments
  • Create scalable data access layers that balance security, performance, and maintainability
  • Mentor team members on database design patterns and optimization techniques
  • Lead the development and maintenance of scalable APIs to serve robot control systems, dashboards, and monitoring tools
  • Design and implement secure authentication and authorization mechanisms across backend services
  • Develop robust middleware for processing and validating data between robotics subsystems
  • Create service interfaces that enable efficient communication between robotics components and cloud services
What we offer
What we offer
  • medical, dental, and vision insurance
  • commuter benefits
  • flexible paid time off (PTO)
  • catered lunch
  • 401(k) matching
  • early-stage equity
  • Fulltime
Read More
Arrow Right