Agility and AI

A primer on how we use and benefit from artificial intelligence

Introduction

Our humanoids are built to help with the hardest work – strenuous, repetitive tasks in highly dynamic industrial settings. They connect islands of automation. They work tirelessly. And they’re designed to operate in the same spaces that people do. 

But people are incredibly adaptable when it comes to learning new tasks and can make adjustments in real time to a change in working conditions. Building a humanoid to operate in this manner presents an immense engineering challenge, and requires an enormous amount of foundational data. Thankfully, advances in artificial intelligence (AI) over the past few years have drastically changed this pursuit. 

Training protocols, skill acquisition, and time to deployment are changing at a rapid clip, and our team is uniquely positioned to capitalize on the moment.

Before AI, we used fixed automation

The Agility team has been building robots for over ten years, which means we know firsthand what it was like to work before current AI innovations. Previously, it could take skilled engineers weeks to program a simple robot movement. After a committed period of work, Agility was able to build a reliable and safe system using traditional control methods – we had to be explicit about what precise actions we wanted a robot to take. 

Robots controlled by prescriptive protocols like this excelled at repeating the same movement thousands of times, in the exact same environment. Reductively, a factory could press play and let the work happen. Picture a stationary robotic arm performing repeated motions, time and time again. Or even an automated guided vehicle (AGV) following a specific path on a facility’s floor plan. 

However, the moment something shifted in an industrial environment such as this (lighting, floor conditions, new obstructions), robots needed a human to intervene and reprogram them. They couldn’t simply adapt and take new action. With their human form factor, legs, and multi-functional end effectors, humanoids were designed to automate tasks where human-like flexibility was needed. However, they required exponentially higher levels of elasticity. In short, they needed more training data and different models to function properly. 

If humanoids were to become the transformational technology we knew they could be, they needed the capability to pivot the way a person does in a split second. Enter AI. 

Functioning AI models need data - a lot of it 

AI offers us an unparalleled resource to help power our humanoids so they can proficiently perform a wider range of tasks in the real world. But where does the data come from? 

There are three main ways to acquire the data AI models need: 

  1. Generate it ourselves: The most expensive way is to go out and generate the data ourselves. Agility is able to do this today via teleoperation or teleop. Using Learning from Demonstration (LfD) methods (see below), we can teleop Digit to perform the same task numerous times, then use the observational data as training material. We gather data this way at our own facility, as well as through our work at numerous client deployments – pivotal field experience that few others have. 
  2. Simulation: The second method is using simulation (see below). NVIDIA open frameworks and libraries, as well the library MuJoCo from Google Deepmind, help us generate accurate virtual physics environments for large scale, repetitive training sessions. We can play out an unlimited amount of parallel scenarios in rapid succession, compressing months of effort into hours.   
  3. Public data: The third method is using tranches of free data available on the public internet. This is how the modern AI frontier labs have made such rapid progress training their LLMs. But that data does not yet exist for robot movement.

Before diving into the particulars of these options, it’s helpful to understand the software stack that these training methods actually feed into.

Our humanoids learn through a three layer stack

Any task Digit performs, from unloading a tote to navigating a crowded facility, requires a stack of coordinated intelligence working simultaneously. Picture three layers, each operating at a different speed and scale:

Cognition: The slowest layer, responsible for high-level thinking. What is the task? What are the individual steps? What does the environment look like? This is where planning and semantic understanding happen and where LLMs and vision-language-action (VLA) models are highly useful.

Skills: The middle layer, operating in real time. Given the plan from above, how exactly do I pick up this tote? How do I adjust my grip when the weight feels different than I expected? This layer translates intent into specific physical actions and is where LfD is particularly helpful. 

Controls: The fastest layer, running in fractions of a millisecond. How do I stay balanced while doing all of that? How do I react to a sudden shift in load, or when I encounter a slippery surface? This layer is physical intelligence – stability and moment-to-moment body coordination, where reinforcement learning (RL) and simulations are the training methods we use. 

Previously, each of these layers was built by hand: mathematicians and engineers would describe the world as precisely as possible so that the robot could navigate it. AI offers us an alternative. Instead of writing prescriptive rules, you let a learning algorithm figure them out using data, experience, or both. 

Learning from Demonstration (LfD) 

With LfD, AI provides humanoids a way of making movement decisions in a data driven way, instead of the rigid, pre-planned code manner we used to rely on. Engineers can teleop a humanoid to 

perform a complex task in a range of ways, then aggregate that data, and use it to build an AI model.  

At Agility, our engineers sync with Digit via VR headsets. They’re able to see out of the humanoid’s eyes, so to speak, and use two controllers to show the robot exactly what to do – say, unloading a tote from a conveyor belt and placing it on a pallet. As they do, Digit's sensors are capturing everything: camera feeds, joint positions, force readings, end effector location. When we do this enough times, across enough variations, our engineers can build a model that informs Digit’s general understanding of the task rather than a rigid script for executing it.

A man wearing a VR headset and holding handhold controls, with a humanoid robot in the background.

This helps with the kind of skills that are nearly impossible to describe in rules but very natural to demonstrate. The instinctive reaction people have when items inside a tote shift and they recalibrate their grip. Or the natural body adjustment we make to reach an awkward angle. LfD captures that embodied expertise and transfers it to Digit.

And while this method is an incredibly useful training tool, it is still limited by the amount of data that physical engineers and operators can generate. It is bound by space and time. Simulations are not.

Simulation training

A much more efficient, albeit compute intensive method, is simulation training. Here, we can greatly increase the amount of "repetitions" we have to train an underlying model on. It’s in simulation training that we use RL. 

RL can be described as follows: a virtual Digit tries a movement and receives feedback in the form of a reward for doing well, or a penalty for doing poorly, and gradually adjusts its behavior to maximize its score. This means that a simulated version of Digit discovers the “correct” movements over time itself, not through prescriptive rule giving.

This approach is particularly well-suited to the controls layer because the goal is easy to define. We want Digit to stay standing or balanced even when the physics are extremely complex or variable. RL lets Digit explore options in the most cost effective way. If we were training a robot with RL in the real world, it would require it to fail thousands of times before developing reliable behavior. Which would mean crashing an expensive piece of engineering thousands of times. It’s just not practical.  Using a simulation instead, those trials happen in a fraction of the time without risk to hardware.

Renderings of humoid robots navigating around one another in open spaces and across uneven terrain.

None of this would be possible without our partner, NVIDIA. Our simulation-first workflow uses the open robotics learning framework NVIDIA Isaac Lab for reinforcement learning, and NVIDIA Jetson Thor to run larger, more powerful reasoning models on our robots locally.

One of the results is what we call Digit's whole-body control foundation model – a base layer of physical intelligence that governs how the robot moves, keeps balance, and coordinates its arms and legs together. It functions like the robot's motor intuition: the underlying capability that everything else is built on top of.

What this means for skill advancements and our tech

The net result of these different training methods is a dramatic increase in the amount of data we have to train our humanoids on, and the possibility of new platforms where an ever increasing library of skills can be developed. This means delivering the reliability industries need today while offering the flexibility to adopt new use cases tomorrow.

We’re building towards a future where deployed humanoids can receive new skills wirelessly via cloud updates as new use cases emerge. Eventually, a single powerful model will learn new tasks with very little additional training required – getting us closer to Agility’s goal of creating a world where people and humanoids work together with ease. 

Conclusion 

Agility is a place where the most advanced robotics engineering meets the latest in AI development. We’re taking what was previously impossible, given cost and time constraints, and developing AI models that bring the dynamism and flexibility of human workers to humanoids – robots that can help us with the most dangerous and repetitive work. 

To read more about many of the technical terms included in this blog, visit our Glossary.