CrawlJobs Logo

Senior Machine Learning Infrastructure Engineer

plus.ai Logo

PlusAI

Location Icon

Location:
United States , Santa Clara

Category Icon

Job Type Icon

Contract Type:
Not provided

Salary Icon

Salary:

160000.00 - 200000.00 USD / Year

Job Description:

As a Senior ML Infrastructure Engineer at Plus, you will design scalable architectures capable of handling petabytes of data while ensuring optimal performance for both training and inference phases. You will build robust pipelines for managing model versioning systems and experiment tracking frameworks, which are essential for maintaining reproducibility across experiments. Additionally, you will be responsible for managing large-scale GPU clusters. This role offers unparalleled opportunities—both technically and professionally—for individuals passionate about solving challenging problems using modern cloud-native technologies. Ideal candidates thrive in environments that leverage tools such as Docker containers orchestrated via Kubernetes clusters, seamlessly integrated with state-of-the-art deep learning frameworks like PyTorch or TensorFlow. If you are eager to push the boundaries of what's possible in machine learning infrastructure and contribute to cutting-edge solutions, this position is an excellent fit!

Job Responsibility:

  • Design and develop scalable, high-performance systems for training, inference, deploying, and monitoring ML models at scale
  • Build and maintain efficient data pipelines, model versioning systems, and experiment tracking frameworks
  • Collaborate with cross-functional teams, including ML researchers and engineers, to identify bottlenecks and improve platform usability
  • Implement distributed systems and storage solutions optimized for machine learning workloadsDrive improvements in CI/CD workflows for ML models and infrastructure
  • Ensure high availability and reliability of the ML platform by implementing robust monitoring, logging, and alerting systems
  • Stay current with industry trends and integrate relevant tools and frameworks to enhance the platform
  • Mentor junior engineers and contribute to a culture of technical excellence
  • Ensure that your work is performed in accordance with the company’s Quality Management System (QMS) requirements and contribute to continuous improvement efforts
  • Ensure team compliance with QMS, monitor quality, and drive process improvements

Requirements:

  • Phd or MS in Computer Science, Electrical Engineering, or related field
  • Good oral and written communication skills
  • Phd new grad or Masters with 3+ years of software engineering experience with a focus on ML infrastructure or distributed systems
  • Proficiency in in Python, C++, SQL
  • Deep understanding of containerization, orchestration technologies, distributed ML workload, and experiment tracking tools (e.g., Docker, Kubernetes, multiprocessing, Kubeflow, and mlflow)
  • Deploy and manage resources across multiple cloud platforms (AWS, GCP, or on-prem environments)
  • Proficiency in at least one deep learning framework, such as PyTorch and data pipeline tools (e.g., Apache Airflow, Prefect)
  • Strong knowledge of distributed systems, databases, and storage solutions
  • Extensive software design and development skills
  • Ability to learn and adapt to new technologies and contribute in a productive environment

Nice to have:

  • Familiarity with fundamental deep learning architectures, such as Convolutional Neural Networks (CNNs) and Transformer models
  • Experience in building large-scale ML datasets, MLOps pipelines, and distributed computing frameworks like Ray
  • Experience working with autonomous vehicles or robotics
What we offer:
  • Work, learn and grow in a highly future-oriented, innovative and dynamic field
  • Wide range of opportunities for personal and professional development
  • Catered free lunch, unlimited snacks and beverages
  • Highly competitive salary and benefits package, including 401(k) plan

Additional Information:

Job Posted:
December 11, 2025

Employment Type:
Fulltime
Work Type:
Hybrid work
Job Link Share:

Looking for more opportunities? Search for other job offers that match your skills and interests.

Briefcase Icon

Similar Jobs for Senior Machine Learning Infrastructure Engineer

Senior AI Infrastructure Engineer

This role will be responsible for designing, deploying, and maintaining high-per...
Location
Location
United States , Bothell; Overland Park; Bellevue
Salary
Salary:
113600.00 - 205000.00 USD / Year
https://www.t-mobile.com Logo
T-Mobile
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years technical engineering experience, preferably in multiple technology focus areas
  • Expert understanding of AI/ML infrastructure components, or GPU-based systems – preferably in a high-availability, large scale environment
  • Hands-on Experience with NVIDIA DGX servers, BasePOD architectures, and advanced GPU technologies
  • Proficient in Linux/UNIX environments, including scripting/automation tools (Bash, Python, Ansible, Terraform)
  • Understanding of AI infrastructure security best practices
  • Experience with container orchestration (Kubernetes, Docker) and GPU workload management tools
  • Strong knowledge of networking (InfiniBand/Ethernet) and storage solutions in AI/ML contexts
Job Responsibility
Job Responsibility
  • Technical System Expertise: Understands system protocols, how systems operate and data flows
  • Technical Engineering Services: Drives engineering projects by active contribution to the application of engineering techniques
  • Innovation: Contributes to designs to implement new ideas which improve an existing and new system/process/service
  • Technical Writing: Writes basic documentation on how technology works
  • Technical Leadership: Collaborates with technical teams and utilizes system expertise to deliver technical solutions
  • Technology Strategy: Contributes to new and existing technology options that support business goals
What we offer
What we offer
  • Competitive base salary and compensation package
  • Annual stock grant
  • Employee stock purchase plan
  • 401(k)
  • Access to free, year-round money coaches
  • Medical, dental and vision insurance
  • Flexible spending account
  • Paid time off
  • Paid holidays
  • Paid parental and family leave
  • Fulltime
Read More
Arrow Right

Senior Principal Machine Learning Engineer - LLM Post-Training and Optimization

Atlassian is seeking a highly skilled and experienced Senior Principle Machine L...
Location
Location
United States , Mountain View
Salary
Salary:
243100.00 - 407200.00 USD / Year
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Ph.D. or Master’s degree in Computer Science, Machine Learning, Artificial Intelligence, or a related field
  • 8+ years of experience in machine learning, with a focus on large-scale model development and optimization
  • Deep expertise in LLM and transformer architectures (e.g., GPT, BERT, T5)
  • Strong proficiency in Python and ML frameworks such as PyTorch, JAX, or TensorFlow
  • Experience with distributed training techniques and large-scale data processing pipelines
  • Proven track record of deploying machine learning models in production environments
  • Familiarity with model optimization techniques, including quantization, pruning, and knowledge distillation
  • Strong problem-solving skills and ability to work in a fast-paced, collaborative environment
  • Excellent communication skills and ability to translate technical concepts for diverse audiences
Job Responsibility
Job Responsibility
  • Lead the fine-tuning and post-training optimization of large language models (LLMs) for diverse applications
  • Develop and implement techniques for model compression, quantization, pruning, and knowledge distillation to optimize performance and reduce computational costs
  • Conduct research on advanced techniques in transfer learning, reinforcement learning, and prompt engineering for LLMs
  • Design and execute rigorous benchmarking and evaluation frameworks to assess model performance across multiple dimensions
  • Collaborate with infrastructure teams to optimize LLM deployment pipelines, ensuring scalability and efficiency in production environments
  • Stay at the forefront of advancements in LLM technologies, sharing insights, driving innovation within the team, and leading agile development
  • Mentoring other team members, facilitating within/across team workshops, fostering a culture of technical excellence and continuous learning
What we offer
What we offer
  • health coverage
  • paid volunteer days
  • wellness resources
  • Fulltime
Read More
Arrow Right

Senior Machine Learning Systems Engineer

Our team is building the foundations to democratise Machine Learning for Atlassi...
Location
Location
India , Bengaluru
Salary
Salary:
Not provided
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Fluency in at least one modern object-oriented programming language (preferably Java/Kotlin)
  • Understanding and experience with Machine Learning project lifecycle and tools
  • Understanding of LLMs, best deployment practices and inference optimisation
  • Experience in building and implementing high-performance RESTful micro-services
  • Experience building and operating large scale distributed systems using Amazon Web Services (Sagemaker, S3, Cloud Formation, AWS Security and Networking)
  • Experience with Continuous Delivery and Continuous Integration
Job Responsibility
Job Responsibility
  • Build and scale the core infrastructure to allow software engineers, ML engineers & data scientists to develop, train, evaluate, deploy, and operate Machine Learning models and pipelines
  • Build systems for product teams like Jira & Confluence to provide access to curated LLMs
  • Use software development expertise to solve difficult problems, tackling infrastructure and architecture challenges
  • Lead engineers to drive involved projects from technical design to launch
  • Collaborate with other teams and internal customers to set expectations, gather input and communicate results
What we offer
What we offer
  • Health coverage
  • Paid volunteer days
  • Wellness resources
  • Fulltime
Read More
Arrow Right

Senior Machine Learning Engineer

As a Senior Machine Learning Engineer in the Central AI team, you will build and...
Location
Location
Australia , Sydney
Salary
Salary:
Not provided
https://www.atlassian.com Logo
Atlassian
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Master or PhD in a quantitative subject (Statistics, Mathematics, Computer Science, Operations Research, or relevant work experience)
  • 3+ years of related industry experience in the data science domain
  • Expertise in Python or Java with and the ability to write performant production-quality code, familiarity with SQL, knowledge of Spark and cloud data environments (e.g. AWS, Databricks)
  • Experience building and scaling machine learning models in business applications using large amounts of data
  • Ability to communicate and explain data science concepts to diverse audiences, craft a compelling story
  • Focus on business practicality and the 80/20 rule
  • very high bar for output quality, but recognize the business benefit of "having something now" vs "perfection sometime in the future"
  • Agile development mindset, appreciating the benefit of constant iteration and improvement
Job Responsibility
Job Responsibility
  • Build and maintain the core infrastructure to allow machine learning engineers and data scientists to develop, train, evaluate, deploy, and operate Machine Learning models and pipelines
  • Use software development expertise to solve difficult problems, tackling complex infrastructure and architecture challenges
  • Design system and model architectures, conducting rigorous experimentation and model evaluations, and providing guidance to junior ML engineers
  • Lead other engineers to drive involved projects from technical design to launch
  • Collaborate with other teams and internal customers to set expectations, gather input and communicate results
What we offer
What we offer
  • Health and wellbeing resources
  • Paid volunteer days
  • Fulltime
Read More
Arrow Right

Senior AI and Machine Learning Engineer

We are seeking Senior AI/ML & Innovation Engineer who will be leading initiative...
Location
Location
United States , Aguadilla
Salary
Salary:
Not provided
https://www.hpe.com/ Logo
Hewlett Packard Enterprise
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor's or master’s degree in computer science, engineering, data science, machine learning, artificial intelligence, or closely related quantitative discipline
  • Typically, 7-10 years’ experience
  • Deep understanding of machine learning algorithms, such as linear regression, decision trees, support vector machines, random forests, deep learning models (e.g., neural networks), and reinforcement learning
  • A strong foundation in mathematics and statistics
  • Proficiency in programming languages such as Python, R, or Java
  • Strong understanding of GitHub CoPilot, Cursor, N8N, vibe coding, Windsurf, and similar technologies
  • Experience in Cloud Infrastructure (AWS, Azure, etc)
  • Knowledge of Open Source, Linux, etc
  • Understanding of Devops, SRE
  • Advanced knowledge and experience in deep learning
Job Responsibility
Job Responsibility
  • Conducts research and stays up to date with the latest advancements in AI and machine learning technologies, frameworks, and algorithms
  • Collaborates with cross-functional teams to understand business requirements and design AI and machine learning solutions
  • Develops, implements, and optimizes machine learning models and algorithms
  • Deploys machine learning models into production environments
  • Monitors the performance of deployed models
  • Organizes and leads comprehensive design review sessions
  • Works collaboratively with the engineering manager and team lead to set design and implementation standards
  • Regularly leads meetings
  • Has experience in providing technical leadership, mentorship, and guidance to junior team members
  • Develops and delivers strategic presentations and reports to senior stakeholders
What we offer
What we offer
  • Health & Wellbeing
  • Personal & Professional Development
  • Unconditional Inclusion
  • Fulltime
Read More
Arrow Right

Senior Staff Machine Learning Engineer (AI Agent)

At Cresta, the AI Agent team is on a mission to create state-of-the-art AI Agent...
Location
Location
United States; Canada
Salary
Salary:
Not provided
cresta.com Logo
Cresta
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • Bachelor’s Degree in Computer Science, Mathematics, or a related field
  • Master’s or Ph.D. preferred, or equivalent professional experience
  • 7+ years of hands-on industry experience with AI and machine learning
  • 3+ years of experience working with LLMs in large-scale production environments
  • Expert knowledge of machine learning concepts and methods, especially those related to NLP, Generative AI, and working with LLMs
  • Proven leadership in designing and deploying AI solutions at scale
  • Extensive practical knowledge of modern machine learning frameworks and technologies (e.g., PyTorch, Tensorflow, Hugging Face, NumPy)
  • Experience with distributed systems and cloud-based AI infrastructure
  • Strong problem-solving and strategic thinking abilities
  • Proven ability to lead cross-functional teams and work collaboratively to deliver innovative AI solutions in production
Job Responsibility
Job Responsibility
  • Design, develop, and deploy Cresta’s AI Agent solutions and proprietary models
  • Focus on practical AI challenges such as improving reasoning, planning capabilities, and evaluation in real-world scenarios
  • Collaborate with cross-functional teams including front-end and back-end software engineers to integrate AI Agents into Cresta’s customer solutions
  • Lead initiatives to scale AI systems for production environments, ensuring performance and reliability across use cases
  • Contribute to solving cutting-edge problems in AI and help define the future roadmap for Cresta’s AI Agents
  • Innovate and research ways to improve security, cost-efficiency, and reliability of AI systems
What we offer
What we offer
  • Variety of medical, dental, and vision plans
  • Paid parental leave
  • Monthly Health & Wellness allowance
  • Work from home office stipend
  • Lunch reimbursement for in-office employees
  • PTO: 3 weeks in Canada
  • Base salary, equity, and a variety of benefits
  • Fulltime
Read More
Arrow Right

Senior Machine Learning Engineer, Personalization and Recommendations

As a Senior Machine Learning Engineer on the Personalization & Recommendations t...
Location
Location
United States , San Francisco
Salary
Salary:
183360.00 - 248000.00 USD / Year
edtechjobs.io Logo
EdTech Jobs
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 5+ years of experience in applied machine learning or ML-heavy software engineering, with a strong focus on personalization, ranking, or recommendation systems
  • Demonstrated impact improving key metrics such as CTR, retention, or engagement through recommender or search systems in production
  • Strong hands-on skills in Python and PyTorch, with expertise in data and feature engineering, distributed training and inference on GPUs, and familiarity with modern MLOps practices — including model registries, feature stores, monitoring, and drift detection
  • Deep understanding of retrieval and ranking architectures, such as Two-Tower models, deep cross networks, Transformers, or MMoE, and the ability to apply them to real-world problems
  • Experience with large-scale embedding models and vector search, including FAISS, ScaNN, or similar systems
  • Proficiency in experiment design and evaluation, connecting offline metrics (AUC, NDCG, calibration) with online A/B test outcomes to drive product decisions
  • Clear, effective communication, collaborating well with product managers, data scientists, engineers, and cross-functional partners
  • A growth and mentorship mindset, helping elevate team quality in modeling, experimentation, and reliability
  • Commitment to responsible and inclusive personalization, ensuring our systems respect learner privacy, fairness, and diverse goals
Job Responsibility
Job Responsibility
  • Design and implement personalization models across candidate retrieval, ranking, and post-ranking layers, leveraging user embeddings, contextual signals and content features
  • Develop scalable retrieval and serving systems using architectures such as Two-Tower models, deep ranking networks, and ANN-based vector search for real-time personalization
  • Build and maintain model training, evaluation, and deployment pipelines, ensuring reliability, training–serving consistency, observability, and robust monitoring
  • Partner with Product and Data Science to translate learner objectives (engagement, retention, mastery) into measurable modeling goals and experiment designs
  • Advance evaluation methodologies, contributing to offline metric design (e.g., NDCG, CTR, calibration) and supporting rigorous A/B testing to measure learner and business impact
  • Collaborate with platform and infrastructure teams to optimize distributed training, inference latency, and serving cost in production environments
  • Stay informed on industry and research trends, evaluating opportunities to meaningfully apply them within Quizlet’s ecosystem
  • Mentor junior and mid-level engineers, supporting technical growth, experimentation rigor, and responsible ML practices
  • Champion collaboration, inclusion, curiosity, and data-driven problem solving, contributing to a healthy and productive team culture
What we offer
What we offer
  • 20 vacation days
  • Competitive health, dental, and vision insurance (100% employee and 75% dependent PPO, Dental, VSP Choice)
  • Employer-sponsored 401k plan with company match
  • Access to LinkedIn Learning and other resources to support professional growth
  • Paid Family Leave, FSA, HSA, Commuter benefits, and Wellness benefits
  • 40 hours of annual paid time off to participate in volunteer programs of choice
  • Fulltime
Read More
Arrow Right

Senior Staff Machine Learning Engineer

Help design our AI platform and develop our next generation of machine learning ...
Location
Location
United States , San Francisco
Salary
Salary:
216500.00 - 324500.00 USD / Year
gofundme.com Logo
GoFundMe
Expiration Date
Until further notice
Flip Icon
Requirements
Requirements
  • 9+ years of hands-on experience in machine learning engineering, AI development, software engineering, or related fields
  • Experience emphasizing secure, large-scale, distributed system design, AI/ML pipeline development, and implementation
  • Extensive experience designing, developing, and operating scalable backend systems
  • Experience applying software engineering best practices such as domain-driven design, event-driven architectures, and microservices
  • Deep expertise in agentic workflows, AI evaluation solutions, prompt management, and secure AI development and testing practices
  • Strong knowledge of relational and document-based databases, data storage paradigms, and efficient RESTful API design
  • Experience establishing robust CI/CD pipelines, automated testing (unit and integration), and deployment practices
  • Strong leadership skills, including effective planning and management of complex projects, mentoring of team members, and fostering a collaborative, high-performing engineering culture
  • Excellent communicator, able to articulate complex technical concepts clearly to both technical and non-technical stakeholders
  • Bachelor's degree in Computer Science, Software Engineering, or a related technical field (preferred)
Job Responsibility
Job Responsibility
  • Design and implement AI platforms to enable scalable and secure access to LLMs from multiple model providers for diverse use cases
  • Design and implement agentic workflows, agentic tool ecosystems, and LLM prompt management solutions
  • Design, build, and optimize scalable model training, fine tuning, and inference pipelines, ensuring robust integration with production systems
  • Influence technical strategy and approach to developing embedding stores, vector databases, and other reusable assets
  • Lead initiatives to streamline ML and AI workflows, improve operational efficiency, and establish standardized procedures to achieve consistent, high-quality results across our AI systems
  • Design and develop backend services and RESTful APIs using Python and FastAPI, integrating seamlessly with ML pipelines and services
  • Take operational responsibility for team-owned services, including performance monitoring, optimization, troubleshooting, and participation in an on-call rotation
  • Collaborate with both technical and non-technical colleagues, including data and applied scientists, software engineers, product managers, and business stakeholders, to deliver reliable and scalable ML-driven products
  • Coach and mentor fellow ML engineers, promoting a culture of collaboration, continuous improvement, and engineering excellence within the team
  • Employ a diverse set of tools and platforms including Python, AWS, Databricks, Docker, Kubernetes, FastAPI, Terraform, Snowflake, Coralogix, and GitHub to build, deploy, and maintain scalable, highly available machine learning infrastructure
What we offer
What we offer
  • Competitive pay
  • Comprehensive healthcare benefits
  • Financial assistance for things like hybrid work, family planning
  • Generous parental leave
  • Flexible time-off policies
  • Mental health and wellness resources
  • Learning, development, and recognition programs
  • Fulltime
Read More
Arrow Right