Location

Palo Alto

Employment Type

Full time

Department

Software

OverviewApplication

At Rhoda AI, we’re building the full-stack foundation for the next generation of humanoid robots — from high-performance, software-defined hardware to the foundational models and video world models that control it. Our robots are designed to be generalists capable of operating in complex, real-world environments and handling scenarios unseen in training. We work at the intersection of large-scale learning, robotics, and systems, with a research team that includes researchers from Stanford, Berkeley, Harvard, and beyond. We’re not building a feature; we’re building a new computing platform for physical work — and with over $400M raised, we’re investing aggressively in the R&D, hardware development, and manufacturing scale-up to make that a reality.

We’re looking for a Staff / Principal ML Engineer to build and own our training platform — the system that makes large-scale training reliable, reproducible, and easy to run. You will define how training jobs are launched, tracked, recovered, and debugged across the cluster. Your work ensures that researchers can move fast without fighting infrastructure.

This role sits at the core of research velocity: when training fails → you make it recover automatically. When experiments are hard to reproduce → you fix the system. When GPU-hours are wasted → you make it visible and preventable.

What You’ll Do

Own the training job lifecycle

Design and build systems for job launch and configuration, monitoring and state tracking, automatic retry and resume, and failure handling and recovery
Define clean, scalable interfaces for running distributed training: CLI / SDK / config systems and standardized launch templates across model families

Build robust checkpointing and recovery systems

Develop checkpointing systems that are reliable (no silent corruption or mismatch), efficient (fast save/load at scale), and flexible (support sharded and distributed models)
Enable seamless resume from failures, partial recovery (e.g., node/rank failures), and consistent state across distributed jobs

Make training reproducible and debuggable

Build systems for experiment configuration and versioning, tracking training state, metrics, and lineage, and reproducible “golden runs” and configs
Ensure runs can be reliably reproduced and differences between runs are explainable

Make performance and failures observable

Create unified visibility into per-job behavior (failures, slowdowns, anomalies) and fleet-wide trends (GPU utilization, failure modes, wasted compute)
Partner with training systems engineers to surface step-time breakdowns, resource inefficiencies, and failure patterns across jobs

Reduce operational burden on researchers

Eliminate manual debugging and babysitting of training jobs
Provide clean abstractions so researchers don’t need to think about cluster quirks, retry logic, or distributed setup details
Goal: make large-scale training feel simple and reliable

Collaborate with infra / SRE on cluster reliability

Work with infrastructure teams to reduce GPU waste from node failures, network instability, checkpointing/storage bottlenecks, and scheduler placement issues

What We’re Looking For

Strong experience building distributed systems or ML infrastructure
Experience with large-scale training environments (preferred but not required)
Hands-on experience with modern ML stacks (e.g., PyTorch; JAX a plus)
Solid understanding of distributed systems fundamentals (fault tolerance, state management, retries), training workflows and failure modes, and checkpointing and data consistency challenges
Strong product / systems instincts — you build tools people actually want to use and simplify complex workflows into clean abstractions
High ownership mindset and comfort in a fast-moving environment

Nice To Have (But Not Required)

Experience with checkpointing for large distributed models (FSDP / ZeRO / sharded states)
Experience with cluster schedulers (Slurm, Kubernetes, Ray, etc.)
Experience building experiment tracking or ML observability systems
Familiarity with large-scale storage systems and I/O bottlenecks

Why This Role

Own the reliability layer that every training run in the company depends on — your systems are the foundation research velocity is built on
Direct impact on developer experience and research throughput at a company building real-world embodied intelligence, not toy ML pipelines
High ownership in a small, elite team where your infrastructure decisions compound across every model the research team trains

What You’ll Do

Own the training job lifecycle

Design and build systems for job launch and configuration, monitoring and state tracking, automatic retry and resume, and failure handling and recovery
Define clean, scalable interfaces for running distributed training: CLI / SDK / config systems and standardized launch templates across model families

Build robust checkpointing and recovery systems

Develop checkpointing systems that are reliable (no silent corruption or mismatch), efficient (fast save/load at scale), and flexible (support sharded and distributed models)
Enable seamless resume from failures, partial recovery (e.g., node/rank failures), and consistent state across distributed jobs

Make training reproducible and debuggable

Build systems for experiment configuration and versioning, tracking training state, metrics, and lineage, and reproducible “golden runs” and configs
Ensure runs can be reliably reproduced and differences between runs are explainable

Make performance and failures observable

Create unified visibility into per-job behavior (failures, slowdowns, anomalies) and fleet-wide trends (GPU utilization, failure modes, wasted compute)
Partner with training systems engineers to surface step-time breakdowns, resource inefficiencies, and failure patterns across jobs

Reduce operational burden on researchers

Eliminate manual debugging and babysitting of training jobs
Provide clean abstractions so researchers don’t need to think about cluster quirks, retry logic, or distributed setup details
Goal: make large-scale training feel simple and reliable

Collaborate with infra / SRE on cluster reliability

Work with infrastructure teams to reduce GPU waste from node failures, network instability, checkpointing/storage bottlenecks, and scheduler placement issues

What We’re Looking For

Strong experience building distributed systems or ML infrastructure
Experience with large-scale training environments (preferred but not required)
Hands-on experience with modern ML stacks (e.g., PyTorch; JAX a plus)
Solid understanding of distributed systems fundamentals (fault tolerance, state management, retries), training workflows and failure modes, and checkpointing and data consistency challenges
Strong product / systems instincts — you build tools people actually want to use and simplify complex workflows into clean abstractions
High ownership mindset and comfort in a fast-moving environment

Nice To Have (But Not Required)

Experience with checkpointing for large distributed models (FSDP / ZeRO / sharded states)
Experience with cluster schedulers (Slurm, Kubernetes, Ray, etc.)
Experience building experiment tracking or ML observability systems
Familiarity with large-scale storage systems and I/O bottlenecks

Why This Role

Own the reliability layer that every training run in the company depends on — your systems are the foundation research velocity is built on
Direct impact on developer experience and research throughput at a company building real-world embodied intelligence, not toy ML pipelines
High ownership in a small, elite team where your infrastructure decisions compound across every model the research team trains

What You’ll Do

Own the training job lifecycle

Design and build systems for job launch and configuration, monitoring and state tracking, automatic retry and resume, and failure handling and recovery
Define clean, scalable interfaces for running distributed training: CLI / SDK / config systems and standardized launch templates across model families

Build robust checkpointing and recovery systems

Develop checkpointing systems that are reliable (no silent corruption or mismatch), efficient (fast save/load at scale), and flexible (support sharded and distributed models)
Enable seamless resume from failures, partial recovery (e.g., node/rank failures), and consistent state across distributed jobs

Make training reproducible and debuggable

Build systems for experiment configuration and versioning, tracking training state, metrics, and lineage, and reproducible “golden runs” and configs
Ensure runs can be reliably reproduced and differences between runs are explainable

Make performance and failures observable

Create unified visibility into per-job behavior (failures, slowdowns, anomalies) and fleet-wide trends (GPU utilization, failure modes, wasted compute)
Partner with training systems engineers to surface step-time breakdowns, resource inefficiencies, and failure patterns across jobs

Reduce operational burden on researchers

Eliminate manual debugging and babysitting of training jobs
Provide clean abstractions so researchers don’t need to think about cluster quirks, retry logic, or distributed setup details
Goal: make large-scale training feel simple and reliable

Collaborate with infra / SRE on cluster reliability

Work with infrastructure teams to reduce GPU waste from node failures, network instability, checkpointing/storage bottlenecks, and scheduler placement issues

What We’re Looking For

Strong experience building distributed systems or ML infrastructure
Experience with large-scale training environments (preferred but not required)
Hands-on experience with modern ML stacks (e.g., PyTorch; JAX a plus)
Solid understanding of distributed systems fundamentals (fault tolerance, state management, retries), training workflows and failure modes, and checkpointing and data consistency challenges
Strong product / systems instincts — you build tools people actually want to use and simplify complex workflows into clean abstractions
High ownership mindset and comfort in a fast-moving environment

Nice To Have (But Not Required)

Experience with checkpointing for large distributed models (FSDP / ZeRO / sharded states)
Experience with cluster schedulers (Slurm, Kubernetes, Ray, etc.)
Experience building experiment tracking or ML observability systems
Familiarity with large-scale storage systems and I/O bottlenecks

Why This Role

Own the reliability layer that every training run in the company depends on — your systems are the foundation research velocity is built on
Direct impact on developer experience and research throughput at a company building real-world embodied intelligence, not toy ML pipelines
High ownership in a small, elite team where your infrastructure decisions compound across every model the research team trains

Alert me to jobs like this

Machine Learning Engineer – Training Platform Full Time NEW

Gigascale Capital

Job Overview

Log In

Sign Up

Machine Learning Engineer – Training Platform Full Time NEW

Gigascale Capital

Apply For This Job

Related Jobs

2nd Shift Precision Optical & Electro-Mechanical Technician (10% Shift Differential) Full Time

MEDICAL PRACTICE MANAGER Full Time

Corporate Finance, Special Projects Full Time

FT 2nd Shift – Cabarrus ED Full Time

Liquor Barn – Store Associate, Market Street Full Time

Material Review Board Engineer Full Time

Job Overview

Apply For This Job