Noise to Narrative
The way diffusion models are trained is that they tag an existing image with what’s on it. Then they add a small amount of noise to that picture in iterations, following a carefully designed schedule, and finally end up with a black and white mess. The model specifically learns to predict what the “less noisy” version of an image should look like at each step.
Then when you want to create an image you start from a noisy image and add your prompt (i.e tag) and then it starts removing noise and starts iteratively doing this until you see an image. The text prompt is converted into a numerical representation that guides the denoising process.
Think of this as when you try to learn something new. You have some hooks in your brain from before, you start to add context, new lessons, new information and suddenly you see the areas somewhat clearer.
Most LLMs (ChatGPT, Mistral, Claude, Gemini and so on) usually write like we humans do. We add one word after another.
Diffusion based text models have started to appear. Think of them as working in reverse. You might input the last words like the ending of a script and then from complete noise it starts to add in words that make sense.
The model keeps refining this noisy text adding more coherent words until in the end we have something somewhat unique. This approach creates text quite differently from traditional word by word generation.
What does this mean for texts?
Since it is not constrained of left-to-right thinking the output can be more creative and surprising. It can solve potential issues where sequential models struggle with maintaining coherence across long passages.
In this way it mimics how creatives work. Moving from rough draft to refinement to polished work.
A future existence of multiple models that we can use in different parts of the creative flow is exciting.