Using Depth Maps to Steer Video Generation Models
Last year I shared some of the challenges that exist when trying to steer the output of text-to-video models.
Things have improved rapidly since then. Frontier video models have aggressively incorporated reference-guided generation, which has opened up more options for controlling generated outputs. Text-only prompting is a thing of the past, we're almost at the point where we have fine-grained multi-modal control over every aspect of the video generation pipeline.
The Problems With Text
Let's recap why text isn't great for specifying things that require exact precision and reproducibility, like camera movement or object orientation.
Take the following prompt:
At frame zero, the camera is at 1m away from the character, oriented towards the character's face at a height of 1.5m and focused on the eyes. The camera then animates over 120 frames to rotate around the character's head to closeup on the left ear, peering inside the ear canal.
There are a few problems with this. First, it's textual, it's not visual (duh). You can't immediately visualize whether the camera positions and animation curves are "good", you need to feed the prompt to the model and wait for the generated results to inspect them visually. If the placement or animation needs tweaking, you need to adjust the prompt, feed it to the model, then wait for the next generation to complete. This is not only slow, it is very expensive.
Second, it's not reproducible. Generative models are inherently non-deterministic, so feeding the exact same prompt multiple times will not guarantee the same animation curves.
Third, it couples content
Better would be to give the model some kind of additional signal so it understands that, when generating the video based on the text prompt, it should also
Here, we are using depth maps.
Here is the viewport view from the Mixreel app, and the rendered depth map video.
Seedance 2.0 (via Dreamina)
Dreamina Seedance 2.0 does not seem strictly a depth controlled model, and seems more akin to a video reference guided model. Here, we pass the depth video as a reference.
Wan-2.2 A14B ControlNet
This is a ControlNet for Wan 2.2 trained specifically for depth maps.
LTX-2-19B Control
This is a ControlNet for LTX 2.2 trained specifically for depth maps. However, this was generated via Wavespeed (which performs depth estimation from the reference video, and does not allow you to pass the reference depth video directly).