This is a spin-off from my post on The Future of Robot Learning, and focuses more on learned simulators, which I think will play a major role in the future of the industry.

Robots are at an interesting intersection between being an engineered system and an embodied intelligence. We have fairly well-established practices for designing and building highly reliable engineered systems, but we are actively figuring out how to develop flexible intelligent systems that learn from experience and that reach the same levels of safety and reliability. This combination represents a unique set of challenges and an exciting frontier in technology today.

Simulation is a well-established tool in both engineering and human skill acquisition—it’s been used to design complex fighter jets and to teach the human pilots that fly them. It’s also used universally in robotics for debugging, evaluation, and (sometimes) training, to make development safer and cheaper, and to induce scenarios that are otherwise impossible or too dangerous to prep for.

Simulation is going to play an increasingly important role in the future of robotics, by providing a readable and writeable proxy for reality. The best robotics companies will have the best simulators, because that will be the cheapest, quickest, and safest way to iterate on and deploy intelligent systems in the real world.

As our robotic systems become capable of handling increasingly complex and varied settings, however, our simulators must become increasingly complex and varied to remain useful. And I believe we are going to hit a wall with traditional simulator development—a wall very similar to the one we hit with traditional computer vision before deep learning.

In the long run, then, we will need to learn our simulators from data, much more akin to how humans learn their world models, and much more akin to how the rest of modern robot learning systems work. There is no other way to handle the variety of the real world in a general and scalable way (e.g., an egg cracking, pouring out, sizzling on a frying pan, and a paintbrush dipping into a bucket, dragging across the wall, leaving a streak of red paint behind).

The Simulator of Tomorrow

Traditional simulators provide a nice conceptual blueprint for the future of learned simulators. Traditional simulators are fairly general and reusable tools; they can be reprogrammed for many tasks; they have nice structure that enables us to interface with them and visualize their results in interpretable formats; they are built up from a central codebase, where effort and insights can pile up over time in a central place and compound, rather than needing to be constructed from scratch for each environment or task.

But traditional simulators are limited in many ways, and ultimately by introducing learned components, we can go far beyond what they are currently capable of. Beyond additional accuracy and variety, the Simulator of Tomorrow will be much easier to use and will enables things like:

Automatic grounding. Instead of defining YAML/XML files to specify all the details and possible things we want to vary over, we could naturally “prompt” the model to simulate what we want. It could absorb videos, still images, text, sound, technical drawings, robot specifications, meshes—any modality that we could encode with a neural network—and spit out a simulator description. We could film a quick video of our scene, with some robot specifications, command data and proprioception, and get out a calibrated and general simulation of the scene and the robot.

Native rendering. On the flip side of grounding is rich rendering coming directly from the model. Instead of dealing with complex rendering APIs or designing custom visuals for effects like smoke or dust, we could just query the model, and as long as it has sufficient video data of these effects, it could render them.

Repeatability and controllability. For training, we could induce specific and repeatable settings that we want our agent to practice, using natural interfaces (video, text). We could debug our system by pulling in failures from the fleet, embedding similar scenarios in the simulator, and creating behavioral unit and integration tests.

Intelligent domain randomization. Because powerful generative models will have to model uncertainty in the environment, sampling them will yield something like intelligent domain randomization. Instead of randomizing over a bunch of wacky parameters, our model could be tuned to the underlying distribution and only give us variety we might actually see in the real world. For example, given a video of an opaque container, the model samples over the range of possible masses that could fit in the container.

Differentiability. Currently, the environment is a giant stop_gradient in the middle of our reinforcement learning computational graph. In fact, it’s even worse than a stop_gradient, since we usually have to call into a separate Python or C++ API. Every other part of the system is learned and differentiable, so if we can patch these issues, there is a lot of opportunity for cleaner designs and perhaps more straightforward application of ideas from generative modeling. Or just straight up supervised learning. (Technically some of this is available today in certain differentiable simulators being developed, but those are still generally external software (not in a PyTorch/jax graph), and more importantly, they are upper bounded in accuracy by human engineering effort.)

Keeping data local. On a related note, by making the environment just another nn.Module, we never have to leave the compute graph or the accelerator. To train our policy, we can just hook it straight up to the firehose of data coming from the model. Resetting an environment is just a means of sampling from a new seed or prompt, and we can easily generate many counterfactual outcomes from a single state.

Code simplicity. With traditional simulators (and Software 1.0 generally), the more features we support, the more complex and unwieldy it gets, both for development and usage. For Software 2.0, improving accuracy is “just a matter” of scaling the size of the model, along with data and compute. And for users of the simulator, the interface stays simple and we can use natural interfaces to program it (e.g., natural language like in OpenAI’s API). It’s not a free lunch and this is not going to be trivial or cheap, but in the long run seems more manageable as our Software 2.0 tools develop and as Moore’s Law runs for a few more cycles.

Portability. On another related note, learned simulators would have many fewer dependencies. We just need to save the weights and model definition, and then we can load them anywhere that supports the floating point operation primitives. We can deploy them in browser for interactivity, for example, or on any hardware that supports those ops.

Sim2real Engine. A learned simulator may enable a Sim2Real Engine, where we iteratively bootstrap a system by: training models inside of the simulator, using those models to collect data in the real world, and using that data to train and improve the simulator. Rinse, repeat.

Science and engineering applications. A learned simulator may be useful to answer scientific questions and to use in an engineering design process. It could offer a more repeatable and examinable model of the real world. We could study the dynamics of systems more easily. We could plug in information like technical drawings and descriptions of new parts and observe how systems behave (similar to how simulators are used now, but in an easier automatic way).

Conclusion

The big question, then, is how are we going to build this?

My brief answer for now is: gradually. I don’t think we currently have the technology (compute and otherwise) to build and run a fully learned simulator in a way that is economically sustainable. In the meantime, it seems like a good strategy to pick the low-hanging fruit where we can learn components of the simulator and incorporate them in the loop for greater accuracy. Jemin et al.’s work on the Anymal is probably my favorite example in this space right now.