CrawlJobs Logo

Supercomputing Test Software Engineer

etched.com Logo

Etched

Location Icon

Location:
Taiwan , Taipei

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

We are seeking highly motivated and detail-oriented Software Engineers to join our Supercomputing Testing team. This team plays a critical role in ensuring the reliability and stability of our highest-performance Inference server hardware and software. As a Software Engineer on this team, you will design, develop, and execute comprehensive supercomputing test suites, analyze test results, and collaborate with hardware and software engineering teams at Etched and our ODM partners to identify and resolve potential issues. You will be at the forefront of ensuring our server products meet the highest quality standards before they reach our customers.

Job Responsibility:

  • Design, develop, and implement automated supercomputing test suites using common scripting languages (Python, Go, Bash) and test frameworks across all aspects of System Operation including: boot sequences, root-of-trust, system management, workload deployment and performance
  • Execute tests on server hardware, monitor system performance and health, and analyze test results
  • Investigate and debug hardware and software failures identified during testing, providing detailed reports and mitigation plans
  • Collaborate with internal and external hardware and software engineering teams to identify root causes of failures and implement corrective actions
  • Contribute to the development and maintenance of the supercomputing testing infrastructure, including portable test environments and automation tools runnable in any environment
  • Create and maintain comprehensive documentation for test plans, test cases, and test results
  • Analyze system performance metrics to identify potential bottlenecks and areas for optimization
  • Participate in continuous improvement efforts to enhance the efficiency and effectiveness of the testing process

Requirements:

  • Proficiency in at least one scripting language (e.g., Python, Bash, Go)
  • Experience with software testing methodologies and tools
  • Strong understanding of operating systems (Linux preferred) and server hardware architectures
  • Ability to analyze complex technical problems and provide effective solutions
  • Excellent communication and collaboration skills
  • Ability to work independently and as part of a team
  • Experience with version control systems (e.g., Git)
  • Experience with reading and interpreting hardware logs

Nice to have:

  • Experience with hardware burn-in testing or reliability testing
  • Experience with performance testing and benchmarking tools
  • Familiarity with hardware diagnostic tools and techniques
  • Experience with CI/CD pipelines
  • Knowledge of low level hardware communication protocols (i2c, etc.)
  • Experience with data analysis tools and techniques
What we offer:
  • Competitive compensation packages including generous equity packages
  • Comprehensive insurance coverage and other top-of-market benefits

Additional Information:

Job Posted:
February 18, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Supercomputing Test Software Engineer

Supercomputing Engineer (Test)

We are seeking highly motivated and detail-oriented Supercomputing Engineer (Tes...
Location
Location
United States , San Jose
Salary
Salary:
150000.00 - 275000.00 USD / Year
etched.com Logo
Etched
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proficiency in at least one scripting language (e.g., Python, Bash, Go)
  • Experience with software testing methodologies and tools
  • Strong understanding of operating systems (Linux preferred) and server hardware architectures
  • Ability to analyze complex technical problems and provide effective solutions
  • Excellent communication and collaboration skills
  • Ability to work independently and as part of a team
  • Experience with version control systems (e.g., Git)
  • Experience with reading and interpreting hardware logs
Job Responsibility
Job Responsibility
  • Test Development: Design, develop, and implement automated burn-in test suites using common scripting languages (Python, Go, Bash) and test frameworks across all aspects of System Operation including: boot sequences, root-of-trust, system management, workload deployment and performance
  • Test Execution: Execute burn-in tests on server hardware, monitor system performance and health, and analyze test results
  • Failure Analysis: Investigate and debug hardware and software failures identified during testing, providing detailed reports and mitigation plans
  • Collaboration: Collaborate with internal and external hardware and software engineering teams to identify root causes of failures and implement corrective actions
  • Test Infrastructure: Contribute to the development and maintenance of the burn-in testing infrastructure, including portable test environments and automation tools runable in any environment
  • Documentation: Create and maintain comprehensive documentation for test plans, test cases, and test results
  • Performance Analysis: Analyze system performance metrics to identify potential bottlenecks and areas for optimization
  • Continuous Improvement: Participate in continuous improvement efforts to enhance the efficiency and effectiveness of the burn-in testing process
What we offer
What we offer
  • Medical, dental, and vision packages with generous premium coverage
  • $500 per month credit for waiving medical benefits
  • Housing subsidy of $2k per month for those living within walking distance of the office
  • Relocation support for those moving to San Jose (Santana Row)
  • Various wellness benefits covering fitness, mental health, and more
  • Daily lunch + dinner in our office
  • Fulltime
Read More
Arrow Right

Supercomputing Software Engineer

We are seeking a highly skilled and motivated Supercomputing Software Engineer t...
Location
Location
Taiwan , Taipei
Salary
Salary:
Not provided
etched.com Logo
Etched
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proficiency in C/C++ or Python
  • Strong understanding of BIOS and BMC firmware architectures
  • Experience with server boot processes
  • Knowledge of root-of-trust and security principles
  • Strong understanding of operating systems (Linux preferred) and server hardware architectures
  • Experience with advanced system logging and diagnostic tools
  • Ability to analyze complex technical problems and provide effective solutions
  • Excellent communication and collaboration skills
  • Experience with version control systems (e.g., Git)
  • Experience with reading and interpreting hardware logs
Job Responsibility
Job Responsibility
  • Integrate and maintain BIOS and BMC firmware, ensuring robust and efficient server boot processes
  • Measure and Tune System Performance Configuration: Analyze DRAM timings, PCIe configurations, power state transitions etc. to ensure high performance and maximal reliability
  • Root of Trust and Security: Validating security features, including root of trust mechanisms, to protect system integrity and data security
  • Advanced System Logging and Diagnostics: Design and implement advanced system logging and diagnostic capabilities to facilitate efficient troubleshooting and performance analysis
  • Data Center Orchestration Integration: Integrate and optimize node-level data center orchestration technologies, such as Kubernetes and Docker, into the system software stack
  • System Validation and Testing: Develop and execute comprehensive test plans to validate system software functionality, stability, and performance
  • Collaboration and Troubleshooting: Collaborate with hardware and software teams to diagnose and resolve complex system-level issues
What we offer
What we offer
  • Competitive compensation packages including generous equity packages
  • Comprehensive insurance coverage and other top-of-market benefits
  • Fulltime
Read More
Arrow Right

HPC Senior Technical Writer

In this position you will collaborate with knowledge management project leads an...
Location
Location
United States of America , Chippewa Falls
Salary
Salary:
81500.00 - 187500.00 USD / Year
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's degree in Technical Communications, Computer Science, or related technical/communications field with 4-6 years related experience
  • Advanced University degree and 2-4 years' experience or equivalent
  • Understands concepts and develops in-depth working knowledge of products, applications, and systems in assigned area of responsibility
  • Ability to deliver on multiple project technical requirements, schedules, and information formats
  • Codes in HTML, DHTML, XML, JavaScript or similar as required
  • Applies developed subject matter knowledge to solve common and complex business issues and recommends appropriate alternatives
  • Works on problems of diverse complexity and scope
  • May act as a team or project leader providing direction to team activities and facilitates information validation and team decision making process
  • Exercises independent judgment to identify and select a solution
  • Knowledge of HPC system software and hardware components, including operating systems, programming languages, system monitoring applications, HPC storage, chassis, servers, compute nodes, blades, HPC storage, coolant systems, power supplies, high speed network switches and cabling, and more
Job Responsibility
Job Responsibility
  • Create technical product documentation for software products and hardware
  • Analyze customer information requirements and product specifications to define scope of work and documentation plan
  • Identify and address the needs of all user groups, including end users, system administrators, internal support engineers, product developers, integration test teams, and training developers
  • Test documentation for install or administrative tasks to improve information deliverables and provide feedback on ease of use and user interfaces to product development
  • Manage workload in Jira and source management tools, including SDL, Oxygen, Git, and Github, to manage changes in the shared work environment
  • Create, revise, and manage content in Oxygen Author (DITA), Markdown, and other content tools
  • Work with developers, testers, product managers, technical support, and training to identify new features and content that needs to be reworked
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Software Engineer II

Microsoft Azure Artificial Intelligence/High Performance Computing (AI/HPC) team...
Location
Location
United States , Multiple Locations
Salary
Salary:
100600.00 - 199000.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 2+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • OR equivalent experience
  • Ability to meet Microsoft, customer and/or government security screening requirements
  • Microsoft Cloud Background Check
Job Responsibility
Job Responsibility
  • Be proactive and innovative about adding new metrics for monitoring the health of the supercomputers
  • Collaborate with team members and stakeholders to understand requirements and produce detailed, data-driven, collaborative design for assigned features
  • Independently uses appropriate artificial intelligence tools and practices across the software development lifecycle to develop, test, debug, and maintain code for Supercomputer health monitoring systems
  • Remain current in skills by investing time and effort into staying abreast of current developments that will improve the availability, reliability, efficiency, observability, and performance of products while also driving consistency in monitoring and operations at scale
  • Act as a Designated Responsible Individual (DRI) working on-call to monitor system/product feature/service for degradation, downtime, or interruptions and gain approval to restore system/product/service for simple problems
  • Fulltime
Read More
Arrow Right

Principal Software Engineer

Microsoft Azure High Performance Computing & AI Engineering (HPC & AI Eng) team ...
Location
Location
United States , Multiple Locations
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python - OR equivalent experience
  • 5+ years hands on experience designing and developing high volume low latency pipelines using products such as AzPubSub, Event Hubs, Azure Stream Analytics, Kafka, Grafana, Event Hubs, Prometheus or equivalent products
  • 3+ years of experience with one of AI/HPC system management OR High-Speed Networks OR HPC Storage OR managing Cloud Infrastructure
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role
  • Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter
Job Responsibility
Job Responsibility
  • Architect, design and develop high volume low latency end to end event pipelines that can provide first-to-know-insights on events causing job interrupts and job reliability
  • Conduct analysis of existing event pipelines to evaluate fidelity, granularity and latency of critical events
  • Contribute to improving key metrics such as Job Mean Time to Interrupt, Nodes in Service, Mean Time to Resolve on flagship supercomputers by enabling data scientists and domain experts to use the telemetry to identify events & issues at the intersection of datacenter and hardware, develop hypothesis, conduct A/B tests and synthesize results
  • Partner with cross organizational teams to evaluate available telemetry and latency drive architecture, design, development and deployment of end-to-end solutions to manage core infrastructure including current & next generation datacenter, IT hardware, power & cooling technologies
  • Drive engineering and operational excellence based on issues and learnings from strategic customers on their usage scenarios to improve product features and capabilities
  • Partner with teams on continuous learning and continuous improvement programs by leading the resolution of complex incidents, driving root cause analyses and championing initiatives to minimize future customer impact
  • Fulltime
Read More
Arrow Right

Supercomputing Engineer (Network)

We are seeking highly motivated and skilled Supercomputing Engineers (Network) t...
Location
Location
United States , San Jose
Salary
Salary:
150000.00 - 275000.00 USD / Year
etched.com Logo
Etched
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proficiency in C/C++
  • Proficiency in at least one scripting language (e.g., Python, Bash, Go)
  • Strong experience with device-to-device networking technologies (RDMA, GPUDirect, etc.), including RoCE
  • Experience with zero-copy networking, RDMA verbs and memory registration
  • Familiarity with queue pairs, completions queues, and transport types
  • Strong understanding of operating systems (Linux preferred) and server hardware architectures
  • Ability to analyze complex technical problems and provide effective solutions
  • Excellent communication and collaboration skills
  • Ability to work independently and as part of a team
  • Experience with version control systems (e.g., Git)
Job Responsibility
Job Responsibility
  • Design, develop, and implement RDMA based networking peering, supporting high bandwidth, low latency communication across PCIe nodes within and across racks
  • Develop tests that qualify host processors (x86), NICs, TORs and device network interfaces for high performance
  • Furnish burn-in teams with tests that represent both real-world use cases and workloads for device to device networking, and extreme-load stress testing
  • Define the key metrics that system software must collect to maintain high availability and performance under extreme communications workloads
What we offer
What we offer
  • Medical, dental, and vision packages with generous premium coverage
  • $500 per month credit for waiving medical benefits
  • Housing subsidy of $2k per month for those living within walking distance of the office
  • Relocation support for those moving to San Jose (Santana Row)
  • Various wellness benefits covering fitness, mental health, and more
  • Daily lunch + dinner in our office
  • Fulltime
Read More
Arrow Right

Software Engineer, Hardware

As a software engineer on the Scaling team, you’ll help build and optimize the l...
Location
Location
United States , San Francisco
Salary
Salary:
266000.00 - 455000.00 USD / Year
openai.com Logo
OpenAI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proficient in systems programming (e.g., Rust, C++) and scripting languages like Python
  • Experience in one or more of the following areas: compiler development, kernel authoring, accelerator programming, runtime systems, distributed systems, or high-performance simulation
  • Deep curiosity for how large-scale systems work and enjoy making them faster, simpler, and more reliable
  • Excited to work in a fast-paced, highly collaborative environment with evolving hardware and ML system demands
  • Value engineering excellence, technical leadership, and thoughtful system design
Job Responsibility
Job Responsibility
  • Design and build APIs and runtime components to orchestrate computation and data movement across heterogeneous ML workloads
  • Contribute to compiler infrastructure, including the development of optimizations and compiler passes to support evolving hardware
  • Engineer and optimize compute and data kernels, ensuring correctness, high performance, and portability across simulation and production environments
  • Profile and optimize system bottlenecks, especially around I/O, memory hierarchy, and interconnects, at both local and distributed scales
  • Develop simulation infrastructure to validate runtime behaviors, test training stack changes, and support early-stage hardware and system development
  • Rapidly deploy runtime and compiler updates to new supercomputing builds in close collaboration with hardware and research teams
  • Work across a diverse stack, primarily using Rust and Python, with opportunities to influence architecture decisions across the training framework
What we offer
What we offer
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Fulltime
Read More
Arrow Right

Electrical Engineer - Systems

The Scaling team works on the design of our AI supercomputers, doing everything ...
Location
Location
United States , San Francisco
Salary
Salary:
225000.00 - 445000.00 USD / Year
openai.com Logo
OpenAI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • At least 10 years of industry experience, including experience designing hardware systems for data center applications
  • experience in designing EE circuit, CPU/GPU/TPU hw system design, board bring up, system design, integration, and system bring up
  • Master's degree in Electrical Engineering, Computer Engineering, Physics, a related field, or equivalent practical experience
  • Have a strong bias toward action, and won’t take no for an answer
  • Have experience and good knowledge of system design experience in the mechanical and product design areas, from xPUs, board, rack level to data center level
  • Have a strong intrinsic desire to learn and fill in missing skills
  • and an equally strong talent for sharing that information clearly and concisely with others
  • Are comfortable with ambiguity and rapidly changing conditions
Job Responsibility
Job Responsibility
  • Work on Machine Learning/AI hardware systems projects to craft the solutions for current and future data center deployments
  • Worked with hardware team on test vehicle, bring up board design, evaluating end to end system design trade off
  • Lead EE circuit level design, work with power, thermal, mechanical teams to drive AI hardware system design
  • Work with product teams to ensure that goals are met with systems and will work with ASIC/FPGA, Software, and Verification teams to ensure proper verification of features
  • Work with the manufacturing teams to ensure that designs are manufacturable and ready for volume production, and with the field teams to support systems that are deployed in the data center
  • Gather system requirements, define architecture, execute hardware design, and product validation
  • Lead the system bring up, validation, NPI, deployment, and sustaining of hardware solutions
  • Work cross-functionally with Hardware, Software, Mechanical, Thermal, Validation, Manufacturing, and external vendors
  • Drive system development from concept through production
  • Lead debug and root cause analysis of deployed systems
What we offer
What we offer
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Fulltime
Read More
Arrow Right