Training a Whole-Body Control Foundation Model

A humanoid robot’s most significant advantage is the ability to perform a wide variety of tasks in highly constrained environments. The built world was designed with people in mind, and machines that can match our general capability will be the most valuable. Of course, that doesn’t mean humanoid robots have to look exactly like us, which we’ve covered before.

The task at hand is relatively clear, but performing the work poses a variety of challenges. The robot must keep its balance, move about smoothly, and be robust to environmental disturbances, all while reaching, grasping, and manipulating with a high degree of precision. Above all, the robot must never fall.

To this end, we’ve developed a whole-body control foundation model for our humanoid robot, Digit, to ensure it remains safe and stable while performing a wide variety of tasks. It functions much like the Motor Cortex in the human brain, taking signals from different parts of the brain (different levels of our control hierarchy in Digit’s case) to control voluntary movements & fine motor skills.

This model is a relatively small LSTM neural network with fewer than one million parameters, which is trained in NVIDIA’s Isaac Sim physics simulator for decades of simulated time over three or four days.

Digit’s motor cortex is learned purely in simulation and transfers zero-shot to the real world. It can be prompted with dense free-space position and orientation objectives for the arms and torso to accomplish different goals, including walking around and picking and placing heavy objects. We can learn downstream dexterous manipulation skills on top of the motor cortex, and similarly coordinate complex behaviors using LLMs.

We can prompt our model using various methods to accomplish a diverse set of tasks.

In March, we deployed an early version of this work at NVIDIA’s GTC conference to show Digit shopping for groceries. For this demo, the policy was prompted by object detections from an open-vocabulary object detector, lifted to 3D, and execution was managed via a state machine planning loop:

This is a significant step in making Digit more robust to disturbances, even while executing complex manipulation plans.

Why is this hard?

To perform useful work, a robot must be able to position and move its end effectors in the world robustly. For fixed-base robots, this is a well-understood problem; simple controllers making use of model-based algorithms like Inverse Kinematics (IK) or Inverse Dynamics (ID) have been widely used to great effect for decades. From the perspective of a user of such a system, one needs only provide the desired pose of the end effector, and the fixed-base robot will quickly move to match that target pose.

We wish to create the same interface for humanoid robots: one where the robot is informed simply by the desired motion of its end effectors, and the robot quickly and efficiently positions itself to achieve those targets.

Unfortunately, this capability is not so trivial on a legged robot. The physics of legged locomotion have two modes: one where a leg is swinging freely through the air, and another when the leg is planted on the ground and applying forces to the rest of the robot. These two sets of physics are separated by a notion of making or breaking ‘contact’, which happens when the robot places its foot or lifts its foot off the ground. This complexity has a heavy computational cost, and simplifying assumptions are almost always employed in order to make the control and planning problem easier. One such commonly-made assumption is that the robot's legs will remain in contact with the ground while the end effectors are reaching for their targets or manipulating objects in the world - essentially turning a legged platform into a fixed-base one.

This is a powerful heuristic that is the foundation of many of the impressive advances that the field of humanoid robotics has recently enjoyed, but is also a fundamental limitation on the performance of modern robots. Preventing the robot from dynamically adjusting its foot placement will limit its manipulation workspace and prevent it from intelligently and naturally reacting to disturbance forces that it may encounter during the course of its work.

A more ideal interface would be one where the robot simply tracks the hand motions desired by the user while taking steps on its own when necessary, avoiding collisions with the environment, and trying really hard not to fall over. The difficulty of generating dynamically feasible whole-body motion plans in real-time has historically made this sort of interface intractable for humanoid robots, at least, until recently.

Enter, reinforcement learning

Deep reinforcement learning is quickly emerging as the dominant control paradigm for humanoid robots. Rather than explicitly modeling the equations of motion of the hybrid dynamics of the robot in the controller or making simplifying assumptions about the contact state of the whole system, it is possible to train a neural network in a physics simulator to act as a controller for the purpose of tracking a whole-body motion, and then deploy that neural network on hardware.

Recent results on humanoid whole-body control are impressive, highly dynamic, and require far less expertise and effort to get working on hardware than using more traditional model-based control methods. However, most of this body of work focuses on dynamic whole-body motions such as dancing and fails to achieve the precise tracking required for mobile manipulation. We are focusing on the ability to apply forces on the environment with both hands as well as feet so Digit is able to lift and maneuver heavy things.

Covering the workspace

Most existing work training whole-body controllers for humanoid robots with reinforcement learning (for example, GMT or TWIST) use large, offline datasets of human motion capture data like AMASS or LAFAN1 to train whole-body control networks. This has led to some incredibly impressive and lifelike results, but these trajectory distributions often don’t cover the entire desired manipulation workspace. Inevitably, the hand pose targets in these offline datasets will densely cover some well-represented region of the workspace, and leave the extremes of the workspace or other key regions sparsely populated. However, it is important to us that the training distribution of hand targets uniformly covers the full workspace to make sure that the robot can perform complex manipulations that utilize its whole extent reliably.

To ensure that our motor cortex is able to reach any point in our workspace with high precision and reliability, we employ a random sampling scheme wherein we randomly pick positions and orientations uniformly from the workspace, and generate random translational and rotational movement speeds between those points to create time-indexed trajectories for both hands and the torso. We train our motor cortex to reach those points using a reward term that considers the translational and rotational error between the current hand pose and the target hand pose.

Position > velocity

In order to train an RL policy that can precisely interact with the world in a variety of ways, we need it to be able to walk the robot to specific positions in the world and stay there. However, the vast majority of literature on applying reinforcement learning to legged robots trains neural networks to track target root velocities (‘match this speed and direction’) rather than root positions (‘go to this point’) (velocity-based VMP, AMOR, FALCON, ExBody and ExBody2, HumanPlus, GMT, R2S2, MHC vs. position-capable CLONE, OmniH2O and HOVER). Training controllers to track velocities rather than positions for the purpose of locomotion provides a much richer and more forgiving reward signal, so it is often a pragmatic choice. In addition, tracking a root position on hardware at test-time requires (non-trivially) some sort of odometry, such as an implementation of a floating base pose estimator.

However, a locomotion controller that is conditioned on a target velocity rather than a target position requires a higher-level planner object or a human teleoperator to provide constant guidance to correct position drift or account for position error. Ideally, we would tell the controller where it should be in free space, and it would navigate there and return naturally to that position even if shoved or otherwise perturbed.

Prompting in task space is better than configuration space

Another issue is that in prior work, the upper body target setpoints are often parameterized in joint space. This means that, to actually use the whole-body controller, you need either a motion capture suit and a mapping from human configuration space to robot configuration space, or a planner or model for generating upper-body motion plans.

These are serious limitations because they require that the higher-level planning layer, human teleoperator, or LLM needs to have a model of the robot’s kinematic state. It would be much more convenient for downstream scripting, teleoperation, and behavior cloning that build on top of the controller if the interface were simply free-space positions and orientations. This also has the advantage of being much closer to the representation used for state-of-the-art imitation learning methods, and means that we can predict object locations directly using models like Gemini to provide the policy for execution.

Conclusion

At Agility, we’re actively building and deploying intelligent humanoid robots that can perform useful work in human environments. Digit’s “motor cortex” is a part of this: it functions as an “always on” safety layer that enables reactive and intuitive control of our robots. We can build complex behaviors on top of this lower-level motor cortex, including learning dexterous mobile manipulation behaviors. We consider this to be the first step towards a safe and robust motion foundation model for real-world humanoid robots.

Do you want to dig even deeper into this subject? Check out this article from Dr. Alan Fern, co-director of the Dynamic Robotics and Artificial Intelligence Lab at Oregon State University: The Emerging Humanoid Motor Cortex: An Inventory of RL-Trained Controllers

Agility in the press.