Imagine a huge video prediction model trained on a huge chunk of YouTube—tens of thousands of cooking videos, DIY, home improvement, exercise, gardening, etc. Something like a general purpose predictive model, with generative abilities akin to GPT-3 (and beyond), but for video instead of text.

What might this enable in robotics? And more generally, for bringing advances from information tech to the physical world?

Recent advances in huge language models demonstrate a path to powerful AI systems. First, develop fairly simple but scalable architectures and training procedures, and then proceed to scale the shit out of them. Bigger models, more compute, and more data lead to better performance and wholly new qualitative abilities, it turns out. The scaling hypothesis is true. All hail compute ;)

When GPT is given more compute and more parameters, it keeps on filling those parameters with more and more knowledge. It learns basic syntax and grammar so it can better predict the next word. Then it learns paragraph and dialogue structure. Then emotional sentiment. Then, at 175 billion parameters, things like amateur chess playing, arithmetic, UI programming

There’s good reason to believe these scaling trends are robust and that they hold beyond natural language, in video for example. Current language models are still quite limited, of course. There are many more issues to fix and details to get right in text and other domains, but it seems like we’re just getting started here with massively scaled models.

So, just as GPT-3 picks up on grammar, sentiment, and so on, in order to better predict the next word, a future video-based-GPT-X model is going to be able to learn accurate physics to better predict the next frame. It will probably take specialized effort beyond a vanilla video prediction model. But certainly with enough physics specific data, a few built-in inductive biases, and some fine-tuning, a learned model could become insanely good at physical prediction and simulation. With “physicsal simulation” being just a narrow use case of such a model. I imagine that AI generated video content and VR environments, for example, are going to be huge. There’ll be plenty of incentive to develop large generative video models outside of physics prediction.

By absorbing knowledge across many domains, a large video prediction model could simulate environments with a physical accuracy and generality far beyond what is possible today. It could replace all the narrow, special, hand-engineered (rigid-body, fluid, optic, agent, etc) simulators with a single great tool for robot learning, engineering, scientific discovery.

A single model could simulate an egg cracking, pouring out, and sizzling on a frying pan. A paintbrush dipping into a bucket, dragging across the wall, and leaving a red streak of paint behind it. Sunlight passing through a magnifying glass, catching a pile of dry leaves on fire—a rising trail of smoke, a marshmallow cooking over the flame. A human stepping to the side when someone else is walking towards them, or getting mad if that someone gets too close and bumps them. Any number of other interactions for which it is nearly impossible for us to write computer simulations for, but for which we have, or could collect, a lot of data.

Such a simulator could be incredibly useful in robot learning. Imagine sim2real learning with the smallest possible reality gap, or model-based learning with the best possible model. All in a package with a natural interface.

We could “prompt” our model with a video sequence to match our specific robotics setup and task. Film a video of our room layout and our paintbrush dipping into the paint bucket. The model would automatically generate a virtual scene of our scenario that we could freely modify. “What about blue paint instead?”

No XML files, no painstaking calibration or modeling e.g., the articulated physics of a Rubik’s Cube. (With all the cubing videos on YouTube, we should be especially well covered here lol.) Just film a video of our scene and the model would catch on, like GPT-3 catches on when given prompts.

It’s all differentiable and can be placed directly in our PyTorch/TensorFlow computational graph. Gradients flow like water.

Model-based learning algos can plug directly into it. Plan ahead and pipe RL gradients directly through the model. Maybe with some fully continuous, fully differentiable analog of MCTS, which has worked so well in Alpha/MuZero.

Train a robot to paint your room. Visualize the sequence of actions the robot would take. Make modifications. “Be careful not to spill on the rug, and don’t paint the baseboards.” Visualize the adjusted behavior to ensure it achieves exactly what you had in mind.

Train models directly from human preferences in source videos. Learn that humans don’t like spilling paint, or breaking vases, or burning their eggs. Learn how humans and animals move naturally. How humans are polite in letting others pass.

This could be an incredible tool for future progress. One that a small fraction of people build and maintain, while many others benefit from what it enables them to do.

Beyond just a training tool, it could form the basis of an internal model that the agent uses online during deployment. The agent could understand and interact with the world in terms of its high-fidelity physics model, relying on its future predictions for making decisions in the world.

For example, humans can accurately predict what will happen if we bump our coffee cup near the edge of a table or how someone might respond if we bump their coffee near the edge of a table.

Human intuitive physics and psychology predictions are very good, despite the fact our information processing abilities are severely constrained by our hardware, the DNA bottleneck, and whatever we can learn in a single lifetime. In theory, it seems like you can do much better—with digital brains, specifically optimized to model these things, trained on orders of magnitude more data than anyone encounters in a lifetime, with more memory and much higher accuracy representations.

For example, such brains may be able to predict things like the precise trajectory of the mug. Or perhaps whether that outdoor deck all those people are standing on is about to collapse. By having watched thousands of videos of structural failures like that on YouTube, including several videos of this exact thing, they might know all the tell-tale signs, like the overloading from too many people and the stress fractures in the wood. Models that have “experienced” much more than any single human may have extraordinary capabilities like this.

If text-based-GPT-X is like having thousands of world experts to talk to, robot-embodied-video-GPT could be like having thousands of world experts in the room with you. It could know things like survival skills, yoga, workout routines, guitar cords. It could explain and demonstrate the mechanics of these things to you (e.g., starting a fire with a magnifying glass [youtube video]). Like having an Ian Banks Culture drone or Star Wars droid with you. C3PO that knows all the languages or whatever.

To caveat, it’s hard to say how far we are from video-GPT-X. It’s possible that patching up current limitations proves extremely difficult. It’s possible text is a uniquely well-suited modality for progress here Probably true. Images/video are less semantically dense than text, You need to process many more bits to gain relevant info—perhaps thousands of irrelevant or redudant pixels to determine you are looking at a brick wall or something. I don’t see this as a roadblock to very accurate video models. It probably just means we’ll need a bit more cleverness and a lot more scale. .

But with improved hardware, larger investments, and efficiency gains, massive-scale video prediction and general physics simulators don’t seem too far on the horizon though. Seems like they’re probably worth planning for.

Anyway, cheers, thanks for reading


(I’m happy to catch any comments, criticism, feedback you have below, or in DMs, email, etc.)