Palo Alto
Employment Type
Full time
Department
Software
OverviewApplication
At Rhoda AI, we’re building the full-stack foundation for the next generation of humanoid robots — from high-performance, software-defined hardware to the foundational models and video world models that control it. Our robots are designed to be generalists capable of operating in complex, real-world environments and handling scenarios unseen in training. We work at the intersection of large-scale learning, robotics, and systems, with a research team that includes researchers from Stanford, Berkeley, Harvard, and beyond. We’re not building a feature; we’re building a new computing platform for physical work — and with over $400M raised, we’re investing aggressively in the R&D, hardware development, and manufacturing scale-up to make that a reality.
We’re looking for a Staff / Principal ML Engineer to build and own our training platform — the system that makes large-scale training reliable, reproducible, and easy to run. You will define how training jobs are launched, tracked, recovered, and debugged across the cluster. Your work ensures that researchers can move fast without fighting infrastructure.
This role sits at the core of research velocity: when training fails → you make it recover automatically. When experiments are hard to reproduce → you fix the system. When GPU-hours are wasted → you make it visible and preventable.
What You’ll Do
Own the training job lifecycle
Build robust checkpointing and recovery systems
Make training reproducible and debuggable
Make performance and failures observable
Reduce operational burden on researchers
Collaborate with infra / SRE on cluster reliability
What We’re Looking For
Nice To Have (But Not Required)
Why This Role
At Rhoda AI, we’re building the full-stack foundation for the next generation of humanoid robots — from high-performance, software-defined hardware to the foundational models and video world models that control it. Our robots are designed to be generalists capable of operating in complex, real-world environments and handling scenarios unseen in training. We work at the intersection of large-scale learning, robotics, and systems, with a research team that includes researchers from Stanford, Berkeley, Harvard, and beyond. We’re not building a feature; we’re building a new computing platform for physical work — and with over $400M raised, we’re investing aggressively in the R&D, hardware development, and manufacturing scale-up to make that a reality.
We’re looking for a Staff / Principal ML Engineer to build and own our training platform — the system that makes large-scale training reliable, reproducible, and easy to run. You will define how training jobs are launched, tracked, recovered, and debugged across the cluster. Your work ensures that researchers can move fast without fighting infrastructure.
This role sits at the core of research velocity: when training fails → you make it recover automatically. When experiments are hard to reproduce → you fix the system. When GPU-hours are wasted → you make it visible and preventable.
What You’ll Do
Own the training job lifecycle
Build robust checkpointing and recovery systems
Make training reproducible and debuggable
Make performance and failures observable
Reduce operational burden on researchers
Collaborate with infra / SRE on cluster reliability
What We’re Looking For
Nice To Have (But Not Required)
Why This Role
At Rhoda AI, we’re building the full-stack foundation for the next generation of humanoid robots — from high-performance, software-defined hardware to the foundational models and video world models that control it. Our robots are designed to be generalists capable of operating in complex, real-world environments and handling scenarios unseen in training. We work at the intersection of large-scale learning, robotics, and systems, with a research team that includes researchers from Stanford, Berkeley, Harvard, and beyond. We’re not building a feature; we’re building a new computing platform for physical work — and with over $400M raised, we’re investing aggressively in the R&D, hardware development, and manufacturing scale-up to make that a reality.
We’re looking for a Staff / Principal ML Engineer to build and own our training platform — the system that makes large-scale training reliable, reproducible, and easy to run. You will define how training jobs are launched, tracked, recovered, and debugged across the cluster. Your work ensures that researchers can move fast without fighting infrastructure.
This role sits at the core of research velocity: when training fails → you make it recover automatically. When experiments are hard to reproduce → you fix the system. When GPU-hours are wasted → you make it visible and preventable.
What You’ll Do
Own the training job lifecycle
Build robust checkpointing and recovery systems
Make training reproducible and debuggable
Make performance and failures observable
Reduce operational burden on researchers
Collaborate with infra / SRE on cluster reliability
What We’re Looking For
Nice To Have (But Not Required)
Why This Role
L3Harris is dedicated to recruiting and developing high-performing talent who are passionate about what they do. Our employees are unified...
Apply For This JobJOB PURPOSE: Responsible for supervising, evaluating and consistently improving the Coding & Medical Practice operations. In conjunction with each DTC...
Apply For This JobAbout Us Notion helps you build beautiful tools for your life’s work. In today’s world of endless apps and tabs,...
Apply For This JobDepartment 13352 Enterprise Revenue Cycle – Cabarrus NC Arrival Emergency Department Inpatient and Outpatient Status Full time Benefits Eligible Yes...
Apply For This JobLiquor Barn, Party Mart, and DEP’S (BRS) is a wholly owned subsidiary of GoPuff. An operated retail chain of spirits,...
Apply For This JobJob Description A Material Review Board Engineer job with Belcan Engineering Group is currently available in Charleston, SC. This job...
Apply For This Job