CrawlJobs Logo

Supercomputing Software Engineer

etched.com Logo

Etched

Location Icon

Location:
Taiwan , Taipei

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

Not provided

Job Description:

We are seeking a highly skilled and motivated Supercomputing Software Engineer to join our team, responsible for the foundational software that powers our server infrastructure. This role focuses on the development, integration, and debugging of critical system software components, including BIOS, BMC firmware, boot processes (including NetBoot), root of trust implementations, advanced system logging, and kernel-mode drivers. You will play a pivotal role in ensuring the reliability, security, and performance of our server platforms, and contribute to the integration of data center orchestration technologies at the node level.

Job Responsibility:

  • Integrate and maintain BIOS and BMC firmware, ensuring robust and efficient server boot processes
  • Measure and Tune System Performance Configuration: Analyze DRAM timings, PCIe configurations, power state transitions etc. to ensure high performance and maximal reliability
  • Root of Trust and Security: Validating security features, including root of trust mechanisms, to protect system integrity and data security
  • Advanced System Logging and Diagnostics: Design and implement advanced system logging and diagnostic capabilities to facilitate efficient troubleshooting and performance analysis
  • Data Center Orchestration Integration: Integrate and optimize node-level data center orchestration technologies, such as Kubernetes and Docker, into the system software stack
  • System Validation and Testing: Develop and execute comprehensive test plans to validate system software functionality, stability, and performance
  • Collaboration and Troubleshooting: Collaborate with hardware and software teams to diagnose and resolve complex system-level issues

Requirements:

  • Proficiency in C/C++ or Python
  • Strong understanding of BIOS and BMC firmware architectures
  • Experience with server boot processes
  • Knowledge of root-of-trust and security principles
  • Strong understanding of operating systems (Linux preferred) and server hardware architectures
  • Experience with advanced system logging and diagnostic tools
  • Ability to analyze complex technical problems and provide effective solutions
  • Excellent communication and collaboration skills
  • Experience with version control systems (e.g., Git)
  • Experience with reading and interpreting hardware logs

Nice to have:

  • Experience with data center orchestration technologies (Kubernetes, Docker)
  • Experience with tracing tools like perf, eBPF, ftrace, etc.
  • Experience with performance testing and benchmarking tools (gProf, vTune, Wireshark, etc.)
  • Experience with CI/CD pipelines
  • Experience with Rust
  • Experience with kernel-mode driver development and debugging
What we offer:
  • Competitive compensation packages including generous equity packages
  • Comprehensive insurance coverage and other top-of-market benefits

Additional Information:

Job Posted:
February 18, 2026

Employment Type:
Fulltime
Work Type:
On-site work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Supercomputing Software Engineer

Senior Research Engineer

The HPE HPC & AI EMEA Research Lab (ERL) is characterized by a unique blend of i...
Location
Location
Germany , Munich, Berlin
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Development experience in compiled languages such as C, C++ or Fortran and experience with interpreted environments such as Python
  • At least a B.Sc. equivalent in a Science, Technology, Engineering or Mathematical discipline
  • Parallel programming experience, with programming models such as OpenMP, MPI, CUDA, OpenACC, HIP, PGAS languages, etc.
  • An understanding of AI/ML frameworks, experience with frameworks such as TensorFlow or PyTorch is highly desirable
  • An interest in system- and data center monitoring and operational data analysis
  • Professional language skills in English and German
Job Responsibility
Job Responsibility
  • Perform world-class research while also shaping products of the future
  • Work with the most esteemed research partners across Europe
  • Enable high performance research software on pre-Exascale and Exascale supercomputers
  • Provide new environments/abstractions to support application developers to build, deploy, and run applications taking advantage of leading-edge hardware at scale
  • Make and operate HPC/AI systems and datacenters in a sustainable way
  • Manage modern data-intensive workloads in high performance environments
What we offer
What we offer
  • Competitive salary and extensive benefits package (pension scheme, insurances, bike and car leasing, and other fringe benefits)
  • Work-life balance (flexible working time and hybrid workplace model, 30 vacation days, four HPE Wellness-Fridays, up to six months paid parental leave)
  • Support for education, training, and career development
  • Diverse and dynamic work environment
Read More
Arrow Right

Software Engineer, Frontier Systems - Power Management

As a Software Engineer on the Frontier Systems team focused on power management,...
Location
Location
United States , San Francisco
Salary
Salary:
295000.00 - 445000.00 USD / Year
openai.com Logo
OpenAI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of software engineering experience with a focus on solving large-scale, system-level challenges
  • Strong proficiency in Python and familiarity with automation and scripting tools (e.g., shell scripting)
  • Experience with distributed systems to efficiently aggregate and analyze streaming data
  • Knowledge of electrical engineering concepts including digital signal processing, power systems, Fast Fourier Transforms, or related areas
  • Experience in system-level investigations and development of automated solutions to address power management, fault detection, and remediation
  • Strong analytical skills and the ability to dig into noisy data (experience with SQL, PromQL, Pandas, etc.)
  • Comfort working with both hardware and software teams to solve multidisciplinary problems
Job Responsibility
Job Responsibility
  • Develop and implement system-level and software-level solutions to optimize power usage in large-scale supercomputers, ensuring efficient and reliable operations
  • Build automation to monitor power consumption patterns during training workloads and design algorithms to stabilize these fluctuations, preventing issues with grid reliability
  • Work with researchers and engineers to design tools for real-time monitoring, detection, and remediation of power-related hardware and system faults
  • Collaborate cross-functionally to translate complex electrical system requirements into code, while driving continuous improvements in power management solutions
  • Drive the development of power throttling mechanisms at the IT system level to dynamically adjust power usage based on workload demands and infrastructure limitations
  • Collaborate with hardware design teams to integrate system-level power control requirements into IT hardware design, ensuring seamless coordination between software-driven power management and hardware capabilities
What we offer
What we offer
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Fulltime
Read More
Arrow Right

Supercomputing Test Software Engineer

We are seeking highly motivated and detail-oriented Software Engineers to join o...
Location
Location
Taiwan , Taipei
Salary
Salary:
Not provided
etched.com Logo
Etched
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Proficiency in at least one scripting language (e.g., Python, Bash, Go)
  • Experience with software testing methodologies and tools
  • Strong understanding of operating systems (Linux preferred) and server hardware architectures
  • Ability to analyze complex technical problems and provide effective solutions
  • Excellent communication and collaboration skills
  • Ability to work independently and as part of a team
  • Experience with version control systems (e.g., Git)
  • Experience with reading and interpreting hardware logs
Job Responsibility
Job Responsibility
  • Design, develop, and implement automated supercomputing test suites using common scripting languages (Python, Go, Bash) and test frameworks across all aspects of System Operation including: boot sequences, root-of-trust, system management, workload deployment and performance
  • Execute tests on server hardware, monitor system performance and health, and analyze test results
  • Investigate and debug hardware and software failures identified during testing, providing detailed reports and mitigation plans
  • Collaborate with internal and external hardware and software engineering teams to identify root causes of failures and implement corrective actions
  • Contribute to the development and maintenance of the supercomputing testing infrastructure, including portable test environments and automation tools runnable in any environment
  • Create and maintain comprehensive documentation for test plans, test cases, and test results
  • Analyze system performance metrics to identify potential bottlenecks and areas for optimization
  • Participate in continuous improvement efforts to enhance the efficiency and effectiveness of the testing process
What we offer
What we offer
  • Competitive compensation packages including generous equity packages
  • Comprehensive insurance coverage and other top-of-market benefits
  • Fulltime
Read More
Arrow Right

Research Engineer AI

The role involves conducting high-quality research in AI and HPC, shaping future...
Location
Location
United Kingdom , Bristol
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • A good working knowledge of AI/ML frameworks, at least TensorFlow and PyTorch, as well as the data preparation, handling, and lineage control, as well as model deployment, in particular in a distributed environment
  • At least a B.Sc. equivalent in a Science, Technology, Engineering or Mathematical discipline
  • Development experience in compiled languages such as C, C++ or Fortran and experience with interpreted environments such as Python
  • Parallel programming experience, with relevant programming models such as OpenMP, MPI, CUDA, OpenACC, HIP, PGAS languages is highly desirable
Job Responsibility
Job Responsibility
  • Perform world-class research while also shaping products of the future
  • Enable high performance AI software stacks on supercomputers
  • Provide new environments/abstractions to support application developers to build, deploy, and run AI applications taking advantage of leading-edge hardware at scale
  • Manage modern data-intensive AI training and inference workloads
  • Port and optimize workloads of key research centers like the AI safety institute
  • Support onboarding and scaling of domain-specific applications
  • Foster collaboration with the UK and European research community
What we offer
What we offer
  • Health & Wellbeing benefits that support physical, financial and emotional wellbeing
  • Career development programs catered to achieving career goals
  • Unconditional inclusion in the workplace
  • Flexibility to manage work and personal needs
  • Fulltime
Read More
Arrow Right

Software Engineer, Frontier Systems

The Frontier Systems team at OpenAI builds, launches, and supports the largest s...
Location
Location
United States , San Francisco
Salary
Salary:
250000.00 - 445000.00 USD / Year
openai.com Logo
OpenAI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 7+ years of industry experience in software engineering
  • Proficiency with Python and shell scripting
  • A high degree of comfort digging into noisy data with SQL, PromQL, and Pandas or any other tool necessary
  • Experience developing reproducible analyses
  • A balance of strengths in building and operationalizing
Job Responsibility
Job Responsibility
  • Own and improve the system health checks that keep our hyperscale supercomputers stable during model training
  • Lead deep dives into hardware failures and system-level bugs to understand how things break at scale
  • Build automation that monitors and fixes issues across thousands of machines - so researchers can keep moving without interruption
What we offer
What we offer
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Fulltime
Read More
Arrow Right

Principal Software Engineer

Microsoft Azure Artificial Intelligence/High Performance Computing (AI/HPC) team...
Location
Location
United States , Multiple Locations
Salary
Salary:
139900.00 - 274800.00 USD / Year
https://www.microsoft.com/ Logo
Microsoft Corporation
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python
  • OR equivalent experience.
  • Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include, but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud Background Check upon hire/transfer and every two years thereafter.
Job Responsibility
Job Responsibility
  • Partner with appropriate stakeholders to determine user requirements for a set of scenarios.
  • Lead identification of dependencies and the development of design documents for a product, application, service, or platform, primarily catering towards exhaustive health monitoring of AI training supercomputers.
  • Build AI Supercomputer observability solutions at scale, with deep focus on actionability to improve availability and reliability of supercomputers.
  • Lead by example and mentor others to produce extensible and maintainable code used across products.
  • Leverage subject-matter expertise of cross-product features with appropriate stakeholders (e.g., project managers) to drive multiple groups’ project plans, release plans, and work items.
  • Hold accountability as a Designated Responsible Individual (DRI), mentoring engineers across products/solutions, working on-call to monitor system/product/service for degradation, downtime, or interruptions.
  • Proactively seek new knowledge and adapt to new trends, technical solutions, and patterns that will improve the availability, reliability, efficiency, observability, and performance of products while also driving consistency in monitoring and operations at scale and share knowledge with other engineers.
  • Fulltime
Read More
Arrow Right

Member of Technical Staff, Training Infra Engineer

Contribute in and provide strong support for model training pipelines, ship stat...
Location
Location
Salary
Salary:
Not provided
cohere.com Logo
Cohere
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Extremely strong software engineering skills
  • Proficiency in Python and related ML frameworks such as JAX, Pytorch and XLA/MLIR
  • Experience with distributed training infrastructures (Kubernetes, Slurm) and associated frameworks (Ray)
  • Experience using large-scale distributed training strategies
  • Hands on experience on training large model at scale and having contributed to the tooling and/or setup of the training infrastructure
Job Responsibility
Job Responsibility
  • Design and write high-performant and scalable software for training
  • Improve our training setup from an infrastructure and codebase performance standpoint
  • Craft and implement tools to speed up our training cycles and improve the overall efficacy of our training infrastructure
  • Research, implement, and experiment with ideas on our supercompute and data infrastructure
  • Learn from and work with the best researchers in the field
What we offer
What we offer
  • An open and inclusive culture and work environment
  • Work closely with a team on the cutting edge of AI research
  • Weekly lunch stipend, in-office lunches & snacks
  • Full health and dental benefits, including a separate budget to take care of your mental health
  • 100% Parental Leave top-up for up to 6 months
  • Personal enrichment benefits towards arts and culture, fitness and well-being, quality time, and workspace improvement
  • Remote-flexible, offices in Toronto, New York, San Francisco, London and Paris, as well as a co-working stipend
  • 6 weeks of vacation (30 working days!)
  • Fulltime
Read More
Arrow Right

Software Engineer, Data Visualization

The Data Visualization team at OpenAI is responsible for building and maintainin...
Location
Location
United States , San Francisco
Salary
Salary:
230000.00 - 385000.00 USD / Year
openai.com Logo
OpenAI
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Strong experience in full-stack software development, with a focus on building scientific or infrastructure visualization tools
  • Proficiency in both front-end and back-end programming languages such as Python, JavaScript, SQL, or similar
  • Familiar with front-end technologies like React and back-end technologies like Node.js, and databases like Snowflake
  • Experience with visualization libraries and frameworks (e.g., Plotly, Grafana)
  • Strong understanding of full-stack architecture, design principles, and best practices
  • Excellent problem-solving skills and attention to detail
  • Strong communication skills and the ability to work collaboratively in a team environment
Job Responsibility
Job Responsibility
  • Develop and maintain full-stack visualization tools for hardware and software analysis
  • Design intuitive front-end interfaces and robust back-end systems for monitoring the performance and health of supercomputer systems
  • Collaborate with researchers and engineers to understand their needs and deliver effective full-stack visualization solutions
  • Ensure high performance, reliability, and scalability of visualization tools across both front-end and back-end systems
  • Continuously improve existing tools and develop new features to meet evolving requirements
What we offer
What we offer
  • Medical, dental, and vision insurance for you and your family, with employer contributions to Health Savings Accounts
  • Pre-tax accounts for Health FSA, Dependent Care FSA, and commuter expenses (parking and transit)
  • 401(k) retirement plan with employer match
  • Paid parental leave (up to 24 weeks for birth parents and 20 weeks for non-birthing parents), plus paid medical and caregiver leave (up to 8 weeks)
  • Paid time off: flexible PTO for exempt employees and up to 15 days annually for non-exempt employees
  • 13+ paid company holidays, and multiple paid coordinated company office closures throughout the year for focus and recharge, plus paid sick or safe time (1 hour per 30 hours worked, or more, as required by applicable state or local law)
  • Mental health and wellness support
  • Employer-paid basic life and disability coverage
  • Annual learning and development stipend to fuel your professional growth
  • Daily meals in our offices, and meal delivery credits as eligible
  • Fulltime
Read More
Arrow Right