This project aims to implement and deploy diffusion models to generate images.
I use random seed 42 for the entire 5A. Using the 3 text prompts provided, we get the image
output from the model below. Overall, the quality of each image is very high with fine
details, like many vibrant colors and sharp contrast, and little blurriness. Using a larger
num_inference_steps
gave me more realistic results with brighter colors,
especially "a man wearing a hat". Using a smaller num_inference_steps
gave me
fewer details like "a rocket ship" with 10 steps and slightly more blurriness like "an oil
painting of a snowy mountain village" with 10 steps even though "a man wearing a hat" with 10
steps was a decently realistic image. We can see some variance and the trend based on
num_inference_steps
is not always direct.
num_inference_steps=10
num_inference_steps=20
num_inference_steps=40
We start by implementing the forward function to add noise to a clean image.
Now, we can use Gaussian blur filtering to attempt to remove the noise. As we can see, this method does not really work since we cannot clearly distinguish the campanile in the image.
Using stage1.unet
and the embedings for the prompt "a high quality photo", we can
try to denoise in one step. To remove the estimated noise from the image, we can follow the
equation below.
To improve the result from the one-step denoising, we can iteratively denoise. We can use 1000
timesteps and skip with strides, in this case we use a stride of 30. We can follow the
equation below and start at i_start = 10
.
Now, we will generate the images from scratch (pure noise) by setting i_start=0
.
The image qualities aren't great, which we will fix in the next part.
To apply CFG, we follow the equation below.
where $\epsilon$ is our combined noise estimate from $\epsilon_c$, our conditional noise estimate and $\epsilon_u$, our unconditional noise estimate. We use CFG scale $\gamma = 7$ to generate the images below.
We can see that our images are a bit clearer, with finer details and less regions of blurriness.
Now, we can use the SDEdit algorithm to noise the original image by a little and force it back
to the manifold without conditioning. We can observe the sequence of edits. In the images
below, we can see that the image with i_start = 20
has qualities of both the
prompt we pass in ("a high quality photo") and the original image.
I added two noise levels, 25 and 30, which generate better results (closer to the corresponding original images) than samller noise levels, as expected. Since these images are much simpler than the ones from the images I got offline, the sequence of images from diffusion below converge to the drawn image much more quickly and sharply.
For the hand drawn images, I used the starter code in the Google Colab notebook to draw and process them. Note that the original Hoover Tower and Green Building images are blurry because the code resizes them to 64x64.
Now, we can inpaint part of the image by using a binary mask. Below are 3 examples.
For the hoover tower and the plain field images, I used all the same variables for inpainting. The inpainted plain image looks the best and coolest.
Since the green building is not exactly frontward facing, the mask was slightly off centered but the result still looks acceptable.
Here, we can use prompts to condition the image to retain qualities from both the prompt and original image.
Using prompt "a photo of a man", we get the result below. We can see that the image from noise level 20 gives us a man with a different face.
Using prompt "a photo of a dog", we get the result below. The image from noise level 20 has some traces of a dog in the cat's face but retains most of the qualities that make up the cat image.
In this section, we will use our UNet to get two noise estimates and take the average of it. This process will give us visual anagrams. The prompts for the pairs of images are as follows:
All images are recognizable both rightside up and upside down.
Similar to visual anagrams, we can use low-pass and high-pass functions to create a hybrid image that looks like one prompt from up close and like another prompt from far away. The prompts for the pairs of images are as follows:
We can design a UNet architecture consisting of ConvBlocks, DownBlocks, and UpBlocks to train this denoiser on the MNIST dataset. Below are the ablations of simply adding noise to a sample of MNIST images.
We will use a batch size of 256 and hidden dimension size 128 to train over the training dataset for 5 epochs. Our (log) loss is consistent below.
Below are the results of inference on the training dataset. We can see that after 1 epoch of training, the model isn't able to get as black of a background as the original images. The model after 5 epochs of training does much better with a slight amount of blurriness around the white digits.
These are the results on the test set when we vary the $\sigma$ values for noising. The denoised images for $\sigma = 0.8$ and $\sigma = 1.0$ are worse than the results of smaller $\sigma$ values but the digit is still distinguishable.
Now, we can take advantage of time and classes to train a stronger UNet.
We will follow the equation from Part 1.3 in 5A. Using a DDPM schedule, we will be able to use a smaller T (300 vs. 1000) and get better results. We will add 2 fully-connected blocks (FCBlocks) to our UNet, which will take the input of time and play a role in the UpBlocks of the architecture. Below is our training result, which steadily decreases, like the previous part in 5B.
Now, we can call ddpm_sample
, which we implement by following the algorithm from
the paper. Note that we set the random seed in each iteration of the reverse diffusion loop in
order to maintain reproducibility.
Last but not least, we can condition on the classes in our UNet by adding 2 more FCBlocks that
take in class-conditioning vectors c
and dropout with probability 10%, in which
case we set c
to 0. We will follow algorithms from the paper to implement
training and sampling. Below is our log loss from training.
We use $\gamma = 5.0$ in CFG for sampling.
These samples are much clearer and refined than time-conditioned as we can see thicker strokes and more distinguishable digits. There are some stray marks but they do not affect the overall image of the digits.
It was nice to dive into the deep learning of this project and see how we can iteratively make improvements (and understand the rationale) to our model and algorithms to get better results.