Project 5 - Fun With Diffusion Models!

Part A - The Power of Diffusion Models!

By Ethan Chen

Introduction

This project aims to implement and deploy diffusion models to generate images.

I use random seed 42 for the entire 5A. Using the 3 text prompts provided, we get the image output from the model below. Overall, the quality of each image is very high with fine details, like many vibrant colors and sharp contrast, and little blurriness. Using a larger num_inference_steps gave me more realistic results with brighter colors, especially "a man wearing a hat". Using a smaller num_inference_steps gave me fewer details like "a rocket ship" with 10 steps and slightly more blurriness like "an oil painting of a snowy mountain village" with 10 steps even though "a man wearing a hat" with 10 steps was a decently realistic image. We can see some variance and the trend based on num_inference_steps is not always direct.

`num_inference_steps=10`

`num_inference_steps=20`

`num_inference_steps=40`

Part 1: Sampling Loops

Part 1.1: Implementing the Forward Process

We start by implementing the forward function to add noise to a clean image.

Berkeley Campanile

Part 1.2: Classical Denoising

Now, we can use Gaussian blur filtering to attempt to remove the noise. As we can see, this method does not really work since we cannot clearly distinguish the campanile in the image.

Part 1.3: One-Step Denoising

Using stage1.unet and the embedings for the prompt "a high quality photo", we can try to denoise in one step. To remove the estimated noise from the image, we can follow the equation below.

$ x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon \quad \text{where } \epsilon \sim N(0,1) $

Part 1.4: Iterative Denoising

To improve the result from the one-step denoising, we can iteratively denoise. We can use 1000 timesteps and skip with strides, in this case we use a stride of 30. We can follow the equation below and start at i_start = 10.

$ x_{t^\prime} = \frac{\sqrt{\bar{\alpha}_{t^\prime}}\beta_t}{1-\bar{\alpha}_t}x_0 + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t^\prime})}{1-\bar{\alpha}_t}x_t + v_\sigma $

Berkeley Campanile

Part 1.5: Diffusion Model Sampling

Now, we will generate the images from scratch (pure noise) by setting i_start=0. The image qualities aren't great, which we will fix in the next part.

Part 1.6: Classifier-Free Guidance (CFG)

To apply CFG, we follow the equation below.

$ \epsilon = \epsilon_u + \gamma(\epsilon_c - \epsilon_u) $

where $\epsilon$ is our combined noise estimate from $\epsilon_c$, our conditional noise estimate and $\epsilon_u$, our unconditional noise estimate. We use CFG scale $\gamma = 7$ to generate the images below.

We can see that our images are a bit clearer, with finer details and less regions of blurriness.

Part 1.7: Image-to-image Translation

Now, we can use the SDEdit algorithm to noise the original image by a little and force it back to the manifold without conditioning. We can observe the sequence of edits. In the images below, we can see that the image with i_start = 20 has qualities of both the prompt we pass in ("a high quality photo") and the original image.

Berkeley Campanile

I added two noise levels, 25 and 30, which generate better results (closer to the corresponding original images) than samller noise levels, as expected. Since these images are much simpler than the ones from the images I got offline, the sequence of images from diffusion below converge to the drawn image much more quickly and sharply.

Hoover Tower

Green Building

Part 1.7.1: Editing Hand-Drawn and Web Images

For the hand drawn images, I used the starter code in the Google Colab notebook to draw and process them. Note that the original Hoover Tower and Green Building images are blurry because the code resizes them to 64x64.

Chair

Lamp

Part 1.7.2: Inpainting

Now, we can inpaint part of the image by using a binary mask. Below are 3 examples.

Original

Mask

Hole to Fill

For the hoover tower and the plain field images, I used all the same variables for inpainting. The inpainted plain image looks the best and coolest.

Original

Mask

Hole to Fill

Original

Mask

Hole to Fill

Since the green building is not exactly frontward facing, the mask was slightly off centered but the result still looks acceptable.

Part 1.7.3: Text-Conditional Image-to-image Translation

Here, we can use prompts to condition the image to retain qualities from both the prompt and original image.

Using prompt "a photo of a man", we get the result below. We can see that the image from noise level 20 gives us a man with a different face.

Using prompt "a photo of a dog", we get the result below. The image from noise level 20 has some traces of a dog in the cat's face but retains most of the qualities that make up the cat image.

Part 1.8: Visual Anagrams

In this section, we will use our UNet to get two noise estimates and take the average of it. This process will give us visual anagrams. The prompts for the pairs of images are as follows:

"an oil painting of an old man" and "an oil painting of people around a campfire"
"a photo of a hipster barista" and "a man wearing a hat"
"a rocket ship" and "a photo of a dog"

All images are recognizable both rightside up and upside down.

Part 1.9: Hybrid Images

Similar to visual anagrams, we can use low-pass and high-pass functions to create a hybrid image that looks like one prompt from up close and like another prompt from far away. The prompts for the pairs of images are as follows:

"a lithograph of a skull" and "a lithograph of waterfalls"
"a rocket ship" and "a pencil"
"a man wearing a hat" and "a lithograph of a skull"

Part B: Diffusion Models from Scratch!

Part 1: Training a Single-Step Denoising UNet

We can design a UNet architecture consisting of ConvBlocks, DownBlocks, and UpBlocks to train this denoiser on the MNIST dataset. Below are the ablations of simply adding noise to a sample of MNIST images.

We will use a batch size of 256 and hidden dimension size 128 to train over the training dataset for 5 epochs. Our (log) loss is consistent below.

Below are the results of inference on the training dataset. We can see that after 1 epoch of training, the model isn't able to get as black of a background as the original images. The model after 5 epochs of training does much better with a slight amount of blurriness around the white digits.

These are the results on the test set when we vary the $\sigma$ values for noising. The denoised images for $\sigma = 0.8$ and $\sigma = 1.0$ are worse than the results of smaller $\sigma$ values but the digit is still distinguishable.

Part 2: Training a Diffusion Model

Now, we can take advantage of time and classes to train a stronger UNet.

Part 2.1-2.3: Adding Time Conditioning to UNet + Training + Sampling

We will follow the equation from Part 1.3 in 5A. Using a DDPM schedule, we will be able to use a smaller T (300 vs. 1000) and get better results. We will add 2 fully-connected blocks (FCBlocks) to our UNet, which will take the input of time and play a role in the UpBlocks of the architecture. Below is our training result, which steadily decreases, like the previous part in 5B.

Now, we can call ddpm_sample, which we implement by following the algorithm from the paper. Note that we set the random seed in each iteration of the reverse diffusion loop in order to maintain reproducibility.

Sample after 5 epochs of training

Sample after 20 epochs of training

Part 2.4-2.5: Adding Class-Conditioning to UNet + Training + Sampling

Last but not least, we can condition on the classes in our UNet by adding 2 more FCBlocks that take in class-conditioning vectors c and dropout with probability 10%, in which case we set c to 0. We will follow algorithms from the paper to implement training and sampling. Below is our log loss from training.

We use $\gamma = 5.0$ in CFG for sampling.

Sample after 5 epochs of training

Sample after 20 epochs of training

These samples are much clearer and refined than time-conditioned as we can see thicker strokes and more distinguishable digits. There are some stray marks but they do not affect the overall image of the digits.

What I learned

It was nice to dive into the deep learning of this project and see how we can iteratively make improvements (and understand the rationale) to our model and algorithms to get better results.