Project 5 - Fun With Diffusion Models!

Part A - The Power of Diffusion Models!

By Ethan Chen

Introduction

This project aims to implement and deploy diffusion models to generate images.

I use random seed 42 for the entire 5A. Using the 3 text prompts provided, we get the image output from the model below. Overall, the quality of each image is very high with fine details, like many vibrant colors and sharp contrast, and little blurriness. Using a larger num_inference_steps gave me more realistic results with brighter colors, especially "a man wearing a hat". Using a smaller num_inference_steps gave me fewer details like "a rocket ship" with 10 steps and slightly more blurriness like "an oil painting of a snowy mountain village" with 10 steps even though "a man wearing a hat" with 10 steps was a decently realistic image. We can see some variance and the trend based on num_inference_steps is not always direct.

num_inference_steps=10

stage1_stage2_steps_10.png

num_inference_steps=20

stage1_stage2_steps_20.png

num_inference_steps=40

stage1_stage2_steps_40.png

Part 1: Sampling Loops

Part 1.1: Implementing the Forward Process

We start by implementing the forward function to add noise to a clean image.

Berkeley Campanile

campanile_modified.jpg
noisy_campanile_t_250.png
noisy_campanile_t_500.png
noisy_campanile_t_750.png

Part 1.2: Classical Denoising

Now, we can use Gaussian blur filtering to attempt to remove the noise. As we can see, this method does not really work since we cannot clearly distinguish the campanile in the image.

noisy_and_blurred_campanile_t_250.png
noisy_and_blurred_campanile_t_500.png
noisy_and_blurred_campanile_t_750.png

Part 1.3: One-Step Denoising

Using stage1.unet and the embedings for the prompt "a high quality photo", we can try to denoise in one step. To remove the estimated noise from the image, we can follow the equation below.

$ x_t = \sqrt{\bar{\alpha}_t}x_0 + \sqrt{1-\bar{\alpha}_t}\epsilon \quad \text{where } \epsilon \sim N(0,1) $

noisy_and_denoised_campanile_t_250.png
noisy_and_denoised_campanile_t_500.png
noisy_and_denoised_campanile_t_750.png

Part 1.4: Iterative Denoising

To improve the result from the one-step denoising, we can iteratively denoise. We can use 1000 timesteps and skip with strides, in this case we use a stride of 30. We can follow the equation below and start at i_start = 10.

$ x_{t^\prime} = \frac{\sqrt{\bar{\alpha}_{t^\prime}}\beta_t}{1-\bar{\alpha}_t}x_0 + \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t^\prime})}{1-\bar{\alpha}_t}x_t + v_\sigma $

noisy_campanile_t_690.png
noisy_campanile_t_540.png
noisy_campanile_t_390.png
noisy_campanile_t_240.png
noisy_campanile_t_90.png

Berkeley Campanile

campanile_modified.jpg
iterative_denoised_campanile.png
one_step_denoised_campanile.png
gaussian_blurred_campanile.png

Part 1.5: Diffusion Model Sampling

Now, we will generate the images from scratch (pure noise) by setting i_start=0. The image qualities aren't great, which we will fix in the next part.

5_sampled_images_part1.5.png

Part 1.6: Classifier-Free Guidance (CFG)

To apply CFG, we follow the equation below.

$ \epsilon = \epsilon_u + \gamma(\epsilon_c - \epsilon_u) $

where $\epsilon$ is our combined noise estimate from $\epsilon_c$, our conditional noise estimate and $\epsilon_u$, our unconditional noise estimate. We use CFG scale $\gamma = 7$ to generate the images below.

5_sampled_images_cfg_part1.6.png

We can see that our images are a bit clearer, with finer details and less regions of blurriness.

Part 1.7: Image-to-image Translation

Now, we can use the SDEdit algorithm to noise the original image by a little and force it back to the manifold without conditioning. We can observe the sequence of edits. In the images below, we can see that the image with i_start = 20 has qualities of both the prompt we pass in ("a high quality photo") and the original image.

campanile_1.7.1.png

Berkeley Campanile

campanile_modified.jpg

I added two noise levels, 25 and 30, which generate better results (closer to the corresponding original images) than samller noise levels, as expected. Since these images are much simpler than the ones from the images I got offline, the sequence of images from diffusion below converge to the drawn image much more quickly and sharply.

hoover_tower_1.7.1.png

Hoover Tower

hoover_tower_modified.jpg
green_building_1.7.1.png

Green Building

green_building_modified.jpg

Part 1.7.1: Editing Hand-Drawn and Web Images

For the hand drawn images, I used the starter code in the Google Colab notebook to draw and process them. Note that the original Hoover Tower and Green Building images are blurry because the code resizes them to 64x64.

chair_1.7.1.png

Chair

chair_modified.png
lamp_1.7.1.png

Lamp

lamp_modified.png

Part 1.7.2: Inpainting

Now, we can inpaint part of the image by using a binary mask. Below are 3 examples.

Original

campanile_inpainting_original.jpg

Mask

campanile_inpainting_mask.jpg

Hole to Fill

campanile_inpainting_to_replace.jpg
campanile_inpainting_inpainted.png

For the hoover tower and the plain field images, I used all the same variables for inpainting. The inpainted plain image looks the best and coolest.

Original

hoover_tower_inpainting_original.jpg

Mask

hoover_tower_inpainting_mask.jpg

Hole to Fill

hoover_tower_inpainting_to_replace.jpg
hoover_tower_inpainting_inpainted.png

Original

plain_inpainting_original.jpg

Mask

plain_inpainting_mask.jpg

Hole to Fill

plain_inpainting_to_replace.jpg
plain_inpainting_inpainted.png

Since the green building is not exactly frontward facing, the mask was slightly off centered but the result still looks acceptable.

Part 1.7.3: Text-Conditional Image-to-image Translation

Here, we can use prompts to condition the image to retain qualities from both the prompt and original image.

campanile_1.7.3.png

Using prompt "a photo of a man", we get the result below. We can see that the image from noise level 20 gives us a man with a different face.

daniel_craig_1.7.3.png

Using prompt "a photo of a dog", we get the result below. The image from noise level 20 has some traces of a dog in the cat's face but retains most of the qualities that make up the cat image.

cat_1.7.3.png

Part 1.8: Visual Anagrams

In this section, we will use our UNet to get two noise estimates and take the average of it. This process will give us visual anagrams. The prompts for the pairs of images are as follows:

  1. "an oil painting of an old man" and "an oil painting of people around a campfire"
  2. "a photo of a hipster barista" and "a man wearing a hat"
  3. "a rocket ship" and "a photo of a dog"

All images are recognizable both rightside up and upside down.

visual_anagram_rightside_up1.png
visual_anagram_upside_down1.png
visual_anagram_rightside_up2.png
visual_anagram_upside_down2.png
visual_anagram_rightside_up3.png
visual_anagram_upside_down3.png

Part 1.9: Hybrid Images

Similar to visual anagrams, we can use low-pass and high-pass functions to create a hybrid image that looks like one prompt from up close and like another prompt from far away. The prompts for the pairs of images are as follows:

  1. "a lithograph of a skull" and "a lithograph of waterfalls"
  2. "a rocket ship" and "a pencil"
  3. "a man wearing a hat" and "a lithograph of a skull"
hybrid_image1.png
hybrid_image2.png
hybrid_image3.png

Part B: Diffusion Models from Scratch!

Part 1: Training a Single-Step Denoising UNet

We can design a UNet architecture consisting of ConvBlocks, DownBlocks, and UpBlocks to train this denoiser on the MNIST dataset. Below are the ablations of simply adding noise to a sample of MNIST images.

single_step_ablations.png

We will use a batch size of 256 and hidden dimension size 128 to train over the training dataset for 5 epochs. Our (log) loss is consistent below.

single_step_log_loss.png

Below are the results of inference on the training dataset. We can see that after 1 epoch of training, the model isn't able to get as black of a background as the original images. The model after 5 epochs of training does much better with a slight amount of blurriness around the white digits.

single_step_1_epoch_inference.png
single_step_5_epoch_inference.png

These are the results on the test set when we vary the $\sigma$ values for noising. The denoised images for $\sigma = 0.8$ and $\sigma = 1.0$ are worse than the results of smaller $\sigma$ values but the digit is still distinguishable.

single_step_test_inference.png

Part 2: Training a Diffusion Model

Now, we can take advantage of time and classes to train a stronger UNet.

Part 2.1-2.3: Adding Time Conditioning to UNet + Training + Sampling

We will follow the equation from Part 1.3 in 5A. Using a DDPM schedule, we will be able to use a smaller T (300 vs. 1000) and get better results. We will add 2 fully-connected blocks (FCBlocks) to our UNet, which will take the input of time and play a role in the UpBlocks of the architecture. Below is our training result, which steadily decreases, like the previous part in 5B.

tc_log_loss.png

Now, we can call ddpm_sample, which we implement by following the algorithm from the paper. Note that we set the random seed in each iteration of the reverse diffusion loop in order to maintain reproducibility.

Sample after 5 epochs of training

tc_5_epochs_sample.png

Sample after 20 epochs of training

tc_20_epochs_sample.png

Part 2.4-2.5: Adding Class-Conditioning to UNet + Training + Sampling

Last but not least, we can condition on the classes in our UNet by adding 2 more FCBlocks that take in class-conditioning vectors c and dropout with probability 10%, in which case we set c to 0. We will follow algorithms from the paper to implement training and sampling. Below is our log loss from training.

cc_log_loss.png

We use $\gamma = 5.0$ in CFG for sampling.

Sample after 5 epochs of training

cc_5_epochs_sample.png

Sample after 20 epochs of training

cc_20_epochs_sample.png

These samples are much clearer and refined than time-conditioned as we can see thicker strokes and more distinguishable digits. There are some stray marks but they do not affect the overall image of the digits.

What I learned

It was nice to dive into the deep learning of this project and see how we can iteratively make improvements (and understand the rationale) to our model and algorithms to get better results.