But how do AI videos actually work? | Guest video by @WelchLabsVideo
Diffusion models, CLIP, and the math of turning text into images
Welch Labs Book: https://www.welchlabs.com/resources/imaginary-numbers-book
Sections
0:00 - Intro
3:37 - CLIP
6:25 - Shared Embedding Space
8:16 - Diffusion Models & DDPM
11:44 - Learning Vector Fields
22:00 - DDIM
25:25 Dall E 2
26:37 - Conditioning
30:02 - Guidance
33:39 - Negative Prompts
34:27 - Outro
35:32 - About guest videos + Grant’s Reaction
Special Thanks to:
Jonathan Ho - Jonathan is the Author of the DDPM paper and the Classifier Free Guidance Paper.
https://arxiv.org/pdf/2006.11239
https://arxiv.org/pdf/2207.12598
Preetum Nakkiran - Preetum has an excellent introductory diffusion tutorial:
https://arxiv.org/pdf/2406.08929
Chenyang Yuan - Many of the animations in this video were implemented using manim and Chenyang’s smalldiffusion library: https://github.com/yuanchenyang/smalldiffusion
Cheyang also has a terrific tutorial and MIT course on diffusion models
https://www.chenyang.co/diffusion.html
https://www.practical-diffusion.org/
Other References
All of Sander Dieleman’s diffusion blog posts are fantastic: https://sander.ai/
CLIP Paper: https://arxiv.org/pdf/2103.00020
DDIM Paper: https://arxiv.org/pdf/2010.02502
Score-Based Generative Modeling: https://arxiv.org/pdf/2011.13456
Wan2.1: https://github.com/Wan-Video/Wan2.1
Stable Diffusion: https://huggingface.co/stabilityai/stable-diffusion-2
Midjourney: https://www.midjourney.com/
Veo: https://deepmind.google/models/veo/
DallE 2 paper: https://cdn.openai.com/papers/dall-e-2.pdf
Code for this video: https://github.com/stephencwelch/manim_videos/tree/master/_2025/sora
Written by: Stephen Welch, with very helpful feedback from Grant Sanderson
Produced by: Stephen Welch, Sam Baskin, and Pranav Gundu
Technical Notes
The noise videos in the opening have been passed through a VAE (actually, diffusion process happens in a compressed “latent” space), which acts very much like a video compressor - this is why the noise videos don’t look like pure salt and pepper.
6:15 CLIP: Although directly minimizing cosine similarity would push our vectors 180 degrees apart on a single batch, overall in practice, we need CLIP to maximize the uniformity of concepts over the hypersphere it's operating on. For this reason, we animated these vectors as orthogonal-ish. See: https://proceedings.mlr.press/v119/wang20k/wang20k.pdf
Per Chenyang Yuan: at 10:15, the blurry image that results when removing random noise in DDPM is probably due to a mismatch in noise levels when calling the denoiser. When the denoiser is called on x_{t-1} during DDPM sampling, it is expected to have a certain noise level (let's call it sigma_{t-1}). If you generate x_{t-1} from x_t without adding noise, then the noise present in x_{t-1} is always smaller than sigma_{t-1}. This causes the denoiser to remove too much noise, thus pointing towards the mean of the dataset.
The text conditioning input to stable diffusion is not the 512-dim text embedding vector, but the output of the layer before that, [with dimension 77x512](https://stackoverflow.com/a/79243065)
For the vectors at 31:40 - Some implementations use f(x, t, cat) + alpha(f(x, t, cat) - f(x, t)), and some that do f(x, t) + alpha(f(x, t, cat) - f(x, t)), where an alpha value of 1 corresponds to no guidance. I chose the second format here to keep things simpler.
At 30:30, the unconditional t=1 vector field looks a bit different from what it did at the 17:15 mark. This is the result of different models trained for different parts of the video, and likely a result of different random initializations.
Premium Beat Music ID: EEDYZ3FP44YX8OWT