This list contains only the countries for which job offers have been published in the selected language (e.g., in the French version, only job offers written in French are displayed, and in the English version, only those in English).
In AI infrastructure organization, simplifying large hardware deployments with push button, single pane of glass for observability/monitoring and software capabilities for build-in resiliency are some of the key focus areas. As senior software development engineer in Test, we are looking for a candidate who can make a big impact on how we test and validate thousands of nodes in large deployments to ensure the cluster is 99.999% reliable.
Job Responsibility:
Innovate and execute tests on cutting edge AI infrastructure
Define optimized test strategies and methodologies
Be a quick learner, adapt to new technologies
Build a strong understanding of how to break these large distributed systems challenge into smaller components that can be unit tested
Automate first approach - Aim for 100% automated tests to test all cluster features in areas of high availability, failure scenarios, performance, stress and security
Champion cluster security, reliability for uptime of 99.9999% and ease of use with observability
Test all components of AI cluster including but not limited to cluster software involving kubernetes, prometheus and grafana. Cluster hardware components like ML wafer scale accelerators, CPU runtime nodes, High speed swarmx interconnect, High speed data transfer of weights through memoryx interconnect
Qualify cluster networking solutions which consists of high-speed switches, routers and optics from various vendors
Qualify cluster security features including OS security, network security, cloud compliance user access and security certifications
Requirements:
Bachelor's or master's degree in engineering in computer science, electrical, AI, data science of related field
10+ years of experience in testing one of areas like enterprise software, distributed systems, datacenter hardware and software
Experience working in large enterprise or cloud networking infrastructure, high speed switches, routers, firewalls
Experience in qualifying networking vendor platforms like Juniper, Arista or Cisco and network test equipment like Ixia/Spirent
Experience in Datacenter technology like BGP, ECN, PFC
Experience testing networking security, compliance and firewalls
Strong coding skills in one of the programming languages like python, golang or C/C++
Strong debugging skills to debug issues in large distributed systems, hardware, and software. Experience with debugging tools like gdb, strace, networking monitors
Strong understanding of operating systems internals like memory management, file system working, security basics and performance
Strong understanding of datacenter layout, device performance characteristics like PCIe, networking and storage
Experience with cloud technologies like AWS, kubernetes and dockers. Monitoring tools like grafana, prometheus is huge plus
Understanding and experience of ML model training and inference is a huge plus
Understand of ML hardware accelerators like GPU, custom accelerator ASIC is a huge plus
Nice to have:
Experience with cloud technologies like AWS, kubernetes and dockers. Monitoring tools like grafana, prometheus is huge plus
Understanding and experience of ML model training and inference is a huge plus
Understand of ML hardware accelerators like GPU, custom accelerator ASIC is a huge plus
What we offer:
Build a breakthrough AI platform beyond the constraints of the GPU
Publish and open source their cutting-edge AI research
Work on one of the fastest AI supercomputers in the world
Enjoy job stability with startup vitality
Our simple, non-corporate work culture that respects individual beliefs