Stable Diffusion for Dummies

10 min readJun 7, 2023

An intuitive and in-depth look on this breakthrough model in AI Art.

Image generated using the prompt “astronaut riding a horse”

If you haven’t been living under a rock for the past year, you’ve likely heard about the recent breakthroughs in AI that are making schools, corporations, and governments scramble to keep up with all the powerful new technology at our fingertips. One such new model is Stable Diffusion, built to generate images based on a text prompt, which has produced images capable of even winning art competitions.

Jason Allen’s “Théâtre D’opéra Spatial,” which won 1st place at the Colorado State Fair. Allen created the image using Midjourney, a text-to-image service based on Stable Diffusion.

With Stable Diffusion’s growing mainstream popularity, it is vital to better understand this emerging technology. This article provides an intuitive but comprehensive guide on how Stable Diffusion works, as well as some accompanying code to better illustrated the concepts in practice.

Background on Diffusion Models

In order to better understand Stable Diffusion, it is necessary to have a basic knowledge of its backbone — Diffusion models.

Diffusion models are a form of generative model built to create new data that resembles the data they were trained on. They have a variety of uses, such as data generation for domains where real data is limited (ex. medical imaging).

Lets dig a bit deeper into how these models work.

Diffusion models consist of a forward and backward process. The forward process consists of progressively destroying data, traditionally images, until it is pure noise. Then the backward process, consisting of a U-Net, aims to recover the original data from the noise.

Image from https://developer.nvidia.com/blog/improving-diffusion-models-as-an-alternative-to-gans-part-1/

This is analogous to throwing a piece of cake at the wall, and then trying to bake the same cake while only having its splattered remains as a guide.

Eventually, the trained model is supplied pure noise and only the backward process is run to synthesize new data similar to that in the training dataset.

Diffusion models fall into the category of generative models, and are thus often compared with Generative Adversarial Networks and Variational Autoencoders, producing this trilemma:

However, recently diffusion models have pulled ahead in popularity because of how their main caveat, sampling/generation speed, has been addressed. By introducing a latent phase (https://arxiv.org/pdf/2112.10752.pdf) into which images are autoencoded, the forward/backward process occurs in the latent space, allowing for faster sampling overall. Basically, adding a latent phase means that the original images are compressed, or encoded, into a smaller/latent dimension using a neural network, and then the diffusion model is only responsible for learning from and generating these latents. Once generated, these latents are then passed through a decoder which can fill in details at a higher resolution.

Going back to our cake analogy, this is similar to learning to reconstruct a cake without icing, and having a resident baker do just the icing instead. This makes learning the reconstruction much easier, as there is simply less to do when we don’t worry about the finer details.

With these fundamentals out of the way, we can now focus on applying these ideas to better understanding how Stable Diffusion works.

Stable Diffusion Overview

Lets start off by better understanding the components of the model, and then do a comprehensive walkthrough with some code.

1. CLIP Text Encoder

One of the main differences of Stable Diffusion compared to traditional diffusion models is that it accepts a text prompt. To encode this prompt into a form understandable by an algorithm, we use OpenAI’s CLIP model.

There are several guides on how exactly CLIP works to encode text, but the gist is that CLIP was trained to place related images and text into a similar latent space. In other words, if CLIP is given an image of a dog, it should be able to correctly output the text string “photo of a dog”, because the model has learned to put the image and text encodings close to each other in latent space.

Rather than using its predictive capabilities, we just use CLIP to encode text into a latent representation where it has proximity to similar images. So essentially, we only use this part of the above diagram in order to process our text prompt.

2. Variational AutoEncoder (VAE)

A VAE is a neural network that facilitates the conversion to/from latent space for images.

The Encoder acts like a compressor, squishing the input image into a lower dimensional latent representation.

Once the forward / reverse diffusion process finishes and the diffusion model has output a reconstruction the original latent, this output latent is passed through the Decoder to create an image with the same resolution as input images.

As explained in the previous section on Diffusion Models, this process is what makes Stable Diffusion so powerful — the VAE enables the diffusion model, which is the most computationally expensive part of Stable Diffusion, to run in latent space. Since latent space is effectively a compression of the original image, the diffusion process is far more efficient and so can be done on consumer GPUs which usually have very limited VRAM.

3. Diffusion Model

The basics of Diffusion Models were already covered earlier in this article, but the main modification to keep in mind for Stable Diffusion is that the backward process uses the text embedding as well as random noise to generate the desired image.

Putting it All Together

To put it all together, we can follow this general procedure to build our own Stable Diffusion pipeline to generate images from text:

Encode our text prompt using the CLIP model.
Generate some random noise in the latent dimension.
Load in a pretrained U-Net model, and perform the reverse process for a fixed number of timesteps, using the random noise and encoded text prompt as input.
The output of this step is the latent representation of our generated image.
Load in a pretrained VAE, and perform the Decoding process on the output latent from the previous step to obtain the final output image, in full resolution.

Stable Diffusion Code Walkthrough

Finally, lets do a quick demo on how the process outlined above can be implemented in practice.

Note — for all of the following steps, we use the trained checkpoints of the components of Stable Diffusion. This is done because this guide serves to illustrate how Stable Diffusion functions as a whole, and not how the training for each component was done.

CLIP

Lets start by instantiating a tokenizer and encoder for the CLIP model. We then define a prompt, and use the tokenizer and encoder to get an embedding for it.

Lets use the prompt “pikachu enjoying a meal in front of the Eiffel tower”.

import torch
from transformers import CLIPTextModel, CLIPTokenizer
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14", torch_dtype=torch.float16)
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14", torch_dtype=torch.float16).to("cuda")

prompt = "pikachu enjoying a meal in front of the Eiffel tower"
tokens = tokenizer([prompt], padding="max_length", truncation=True, return_tensors="pt") 
embedding = text_encoder(tokens.input_ids.to("cuda"))[0].half()

The CLIP tokenizer is just used to give each individual word in the prompt its own token, or a unique number which represents it. The tokenizer returns a tensor of size 1x77, as our prompt could have a maximum of 77 tokens. Since our prompt only had 10 tokens, the remaining are just padded with a default value. You can see the non-default tokens are the 10 tokens starting from 28107.

We then pass the tokens to the CLIP encoder, which runs the pretrained CLIP model and spits out a 1x77x768 embedding for our prompt:

Going back to what CLIP does, this embedding, though seemingly nonsensical to us, represents the prompt in latent space; thus, it is close to embeddings of images which are similar in meaning, possibly those of Pikachu, the Eiffel tower, etc.

VAE

Lets start off by creating a function to convert an Image to its latent representation through the Encoder so we can better visualize the latent space. We will be working with this image:

from PIL import Image 
from torchvision import transforms as tfms   
import numpy as np 
import matplotlib.pyplot as plt 
from diffusers import AutoencoderKL 
import requests

# Loading our pretrained vae
vae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="vae", torch_dtype=torch.float16).to("cuda")

def image_to_latent(image):
  # Adding batch size of 1, (H, W, C) -> (1, H, W, C)
  image_tens = tfms.ToTensor()(image).unsqueeze(0)
  # Normalizing to [-1, 1]
  image_tens = image_tens * 2.0 - 1.0   
  # Moving tensor onto GPU
  image_tens = image_tens.to(device="cuda", dtype=torch.float16)

  # Encoding our image using the encoder of the vae
  latents = vae.encode(image_tens).latent_dist.sample() * 0.18215   

  fig, axs = plt.subplots(1, 4, figsize=(16, 4))
  for c in range(4):
      axs[c].imshow(latents[0][c].detach().cpu(), cmap='Greys')
    
  return latents

url = "https://oyster.ignimgs.com/mediawiki/apis.ign.com/pokemon-blue-version/8/89/Pikachu.jpg"
image = Image.open(requests.get(url, stream=True).raw).convert('RGB').resize((512,512))
latents = image_to_latent(image)

Latent space representation of the input

From the above image, we can see that latents have shape 64x64x4. We can also think of this as 4 grayscale images stacked on top of each other — the above visualization shows each image separately.

To recap how this is used in Stable Diffusion, the Diffusion model / U-Net would be trained on these latents. So, instead of the original image going through the forward / backward diffusion process, these latent representations are used in the processes instead.

Next, we go over the Decoding process, where latent space representations are converted back to image space.

def latent_to_image(latent):
    latent = (1 / 0.18215) * latent

    with torch.no_grad():
        # Decoding our latents
        image = vae.decode(latent).sample
      
    # Normalizing RGB values back to [0, 1]
    image = (image / 2 + 0.5).clamp(0, 1)

    # Put the tensor back on the cpu, convert it back to numpy
    image = image.detach().cpu().permute(0, 2, 3, 1).numpy()
    images = (image * 255).round().astype("uint8")

    # Return the original image
    return [Image.fromarray(image) for image in images][0]

reconstructed_image = latent_to_image(latents)
reconstructed_image.show()

And we get our original image back! Well, almost — a key point to remember about VAEs is that they are not lossless, and the reconstructed images may have some defects or abnormalities that were not present in the original image. However, this can be minimized through proper training, as shown in the high quality reconstruction above.

Diffusion Model (U-Net)

Lets start by initializing our pretrained U-Net model and a scheduler. Schedulers are predefined functions that determine how much noise to add to the latent at each timestep in the diffusion process. There are several different schedulers available, but for the sake of this tutorial we choose LMS.

from diffusers import UNet2DConditionModel, LMSDiscreteScheduler

scheduler = LMSDiscreteScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", num_train_timesteps=1000)
scheduler.set_timesteps(51)

unet = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="unet", torch_dtype=torch.float16).to("cuda")

As explained earlier, we start with random noise in the latent space, then run part of reverse diffusion procession by passing the current sample through the U-Net. This process would be iterated a fixed number of times to progressively denoise the image. Here is an animation of the process:

Lets take a look at just one timestep in this process. To make this process clearer, we will use an empty text prompt so the U-Net is forced to only attend to the image.

# Adding noise to image using the scheduler
noisy_image = scheduler.add_noise(latent_img, noise, timesteps=torch.tensor([scheduler.timesteps[40]])) 
latent_to_image(noisy_image)

# Text prompt empty in this case as we are not generating a new image
empty_prompt = ""
tokens = tokenizer([prompt], padding="max_length", truncation=True, return_tensors="pt") 
embedding = text_encoder(tokens.input_ids.to("cuda"))[0].half()
    
# Predicting noise with U-net
latent_model_input = torch.cat([encoded_and_noised.to("cuda").float()]).half()
with torch.no_grad(): noise_pred = unet(
   latent_model_input,40,encoder_hidden_states=text_embeddings
    )["sample"]
# Subtract the noise from the latent and visualize
latent_to_image(encoded_and_noised- noise_pred)

Image after one step of the reverse process

We can see that the image is clearer after passing through the U-Net and part of the noise is removed. This is repeated a fixed number of times to completely denoise the object.

Full Pipeline

Lets run a full example of Stable Diffusion using our code and the text prompt “pikachu having dinner in front of the Eiffel tower”:

from tqdm import tqdm

# Encoding prompt
prompt = "pikachu playing basketball"
tokens = tokenizer([prompt], padding="max_length", max_length=tokenizer.model_max_length, truncation=True, return_tensors="pt") 
embedding = text_encoder(tokens.input_ids.to("cuda"))[0].half()

# Adding an unconditional prompt
empty_tokens = tokenizer([""], padding="max_length", max_length=embedding.shape[1], truncation=True, return_tensors="pt") 
empty_embedding = text_encoder(empty_tokens.input_ids.to("cuda"))[0].half()
emb = torch.cat([empty_embedding, embedding])

# Initiating random noise in the latent dimension, which is 1/8 of the desired image dimension.
dim = 512
latents = torch.randn((1, unet.in_channels, dim//8, dim//8))

# Setting up scheduler
scheduler = LMSDiscreteScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", num_train_timesteps=1000)
scheduler.set_timesteps(100)

# Noising latents
latents = latents.to("cuda").half() * scheduler.init_noise_sigma

# Guidance parameter - dictates how closely the model listens to the prompt
g = 7
for i,ts in enumerate(tqdm(scheduler.timesteps)):
    inp = scheduler.scale_model_input(torch.cat([latents] * 2), ts)
    
    # Predicting noise using U-Net
    with torch.no_grad(): u,t = unet(inp, ts, encoder_hidden_states=emb).sample.chunk(2)
        
    # Performing Guidance
    pred = u + g*(t-u)
    
    # Conditioning the latents
    latents = scheduler.step(pred, ts, latents).prev_sample
        
# Returning the latent representation to output an image of 3x512x512
latent_to_image(latents).show()

Not bad for something we coded in only a few lines!

Moreover, there are ways to significantly improve the quality of images generated, such as using a Stable Diffusion model checkpoint further trained on specialized datasets. Here is an image created with the same prompt but using the Lexica Aperture V3 model:

There are several specialized Stable Diffusion models for generating images with a specific style available online. Lexica’s is available over here.

Conclusion

I hope this article gave a comprehensive introduction to Stable Diffusion and sparked your interest into how this technology could be used. If you have any feedback, please reach out to me on LinkedIn, I’d love to hear it.