Mixreel Logo

July 2025 Update

It's been roughly 6 months since I started working on the first iteration of Mixreel. I've been regularly sharing progress on Bluesky, but this is first time I've put together a comprehensive update on what I've been doing since I started. I also wanted to record my thoughts about what I've found works well, and what I've found hasn't worked so well, and where I think my efforts should be focused.

The Idea

Over the last 12 months, video generation models have gone from minor curiosity to potential must-have tool for visual creatives. Objects no longer randomly morph together or fly off randomly into the sky. Image-to-video models create genuinely plausible extensions of static scenes. Text-to-video models adhere reasonably faithfully to their prompts.

This is true of both open-source models (Wan, CogVideo) and proprietary platforms (Runway, Luma Labs, Sora, Veo, Kling).

However, when working with these models, you're still mostly constrained to a single text prompt. It's difficult, if not impossible, to describe in words exactly how the camera should move, how the scene is composed, the lighting, or any of the other multitude of things that go into making a video.

So that was the starting point for the Mixreel prototype. I wanted to experiment with building a 3D scene editor that gives control over the camera and the scene, while using the (untextured) rendered output as the input for a video generation model.

Progress So Far

I'll get into the nitty-gritty below, but here's a quick overview of the application in its current form:

  • a viewport with an editor camera and render camera that can both be navigated with a mouse
  • a prefab asset library so I can insert and move objects around a scene
  • a functional timeline that lets me keyframe basic transform channels
  • I can render the timeline to VP9 video and submit to various video generation APIs
  • a backend database for storing & retrieving scenes and projects.

Here's a timelapse I made today, where I use the macOS version to create a very basic scene.

The web version shares the same rendering backend, which (mostly) is at feature parity with the desktop version.

The Outputs

Does the architecture work? Yes, but it's definitely not good.

Here's an example of a scene I designed in Mixreel and exported as a video. This simply shows a floor, a set of stairs against the wall, a few cubes, with the camera moving in a simple track through the scene:

Here is the result of passing this video to Luma Labs' "Modify" feature with the following prompt:

"The opening scene of a movie in a futuristic factory with conveyor belts moving boxes back and forth"

Basically....mush.

All the details visible in the input video (stairs, floor, walls, etc) have completely disappeared. The output video has basically no coherence whatsoever - I don't even know what is being displayed here.

This was using the default "Strength" setting for Luma, which allows the model considerable freedom to depart from the input video.

Here's what I get if I reduce the Strength and re-run with a slightly modified prompt:

In a futuristic factory, the camera zooms in on a set of stairs, then after reaching the top, the camera turns to show a conveyor belt of boxes being made

This resembles the original video much more closely. However:

  • it's still ugly as sin and totally useless
  • even for such a simple scene, there are strange artifacts
  • it doesn't really adhere to the prompt at all

In other words, still not usable for anything serious.

A Different Approach

Let's step back and think about what we're actually expecting the model to do here.

When we pass an input video and a text prompt, we're basically asking the model to come up with something that is visually plausible, that matches the description we provide, but that closely adheres to the surfaces and edges of the objects in the scene.

Maybe this is asking too much; after all, there's only so much you can do with a flat surface like the face of a cube. Perhaps loosening these restrictions a little will help the model come up with something aesthetically coherent.

What if we remove the flat surfaces and just ask the model to use the edges as a reference?

Here's the output from CogVideo 2B with a Canny ControlNet:

This is admittedly still awful. But:

  1. it's probably an improvement that shows this is a good direction to explore.
  2. it's an older, miniscule open source model. Larger/newer models would probably be an improvement.

The Challenges Ahead

A lot of challenges remain. Here are some of the problems that I think need to be solved before the Mixreel platform is open:

1) Inconsistency/speed

Video generation still involves a lot of trial-and-error. It's infeasible to ask the user to wait 5-10 minutes to see a single iteration (particularly if 9/10 of those need to be discarded).

There must be some form of real-time feedback to ensure you're heading in the right direction.

2) Rasterizing polygons is limiting

Working with polygonal meshes (basic cubes/spheres/etc, or pre-fab assets) is very restrictive. By forcing video generation models to "colour inside the lines" of these basic shapes, we drastically constrain the diversity of outputs that the models can deliver.

Something truly compelling will probably involve a mixture of the following methods:

  • alpha-blended billboarding (e.g. rendering a backdrop in 2D, then adding custom objects on top)
  • splatting/radiance fields (i.e. either capturing or generating volumetric videos and then adding/manipulating the objects in this scene).
  • image-to-3d (generating 3D objects directly from images and inserting these into the scene)
  • "concept tokens" (extracting latent-space representations for highly detailed items)

3) Professional vs amateur

It's still not clear to me whether this is a tool for video professionals, or something that should be simple and accessible for casual users.

The question is definitely premature, because it's a long road ahead to smooth out the core technical details.

I'm mindful, though, that targeting the former means a lot more work is needed to refine the UI and the control system. Tools like Blender, Maya, Nuke, Houdini have had thousands of man-hours invested in small quality-of-life improvements. These are the features that you don't actively notice when you're using the tools, but you definitely notice when they're not there.

I'm looking forward to developing the concept further throughout 2025. I'm planning to provide updates here on a more regular basis, but if you're interested in hearing things in real-time, follow me on Bluesky.

-- Nick