CS180 Project 5A

This is the website for CS180 Project 5, where we learned a lot more about diffusion models and image generation. I thought the coolest part of the project was building the time and class conditioned UNet and finding that class conditioned UNet does a really good job of generating images similar to the numbers in the MNIST dataset.

Downloading Precomputed Text Embeddings

20 inference steps.
5 inference steps.

To start, we wanted to learn a little more about diffusion models. We started by looking at some precomputed text embeddings and sample images that were generated. The three captions given were "an oil painting of a snowy mountain village", "a man wearing a hat", "a rocket ship". The first three have 20 inference steps and the next three have 5 inference steps. For the 20 inference steps, all the images do seem to basically exactly match the text prompts and the quality seems pretty good, except that the colors are pretty off; like the rocket ship seems more of a caricature rather than an actual depiction of a rocket ship. For the five inference steps the quality got a lot worse, especially again with the rocket ship which is now kind of broken apart into pieces with some strange things on top of it. I think this just supports the hypothesis that at least at the beginning, more inference steps can improve quality of outputs. The random seed I am using is 2988.

Implementing the Forward Process

The original image, noisy image at t = 250, noisy image at t = 500, noisy image at t = 750

For the forward process, I exactly followed the two equations given where you compute a noisy image x_t by sampling from a Gaussian that is constructed according to alpha-bar-t and random noise epsilon, specifically equation A.2. from the project spec. This resulted in the following test image at noise levels [250, 500, 750]; as you can clearly see as the noise level increases, the image gets more and more noisy since we inject more random noise.

Classical Denoising

The noisy image followed by the Gaussian blurred noisy image with a 5x5 kernel at t = 250, t = 500, t = 750. The results are not great, and especially at t = 750 not much of the original image really appears to be reconstructed.

Next I tried to use the denoising tactics that we have used in previous projects, which is applying a Gaussian blur to the image with torchvision.transforms.functional.gaussian_blur to try to remove the noise. For this I used a 5x5 kernel. Above are the noisy images and corresponding Gaussian-denoised versions. This did not do so well, so it felt necessary to try another method of denoising.

One-Step Denoising

Top is the original image. Then, in sequence, noisy images and one-step denoising. Note that the outputs at later steps have less noise around them, but don't necessarily look that much like the original image, for example the shapes and colors of the building is a little off.

Now, we do one-step denoising. For this piece I used a UNet conditioned on the amount of Gaussian noise to first estimate the noise, then remove that noise to get an estimate of the original image. Note that when removing the noise, I had to reverse the process by which the noise was applied, dealing with the alpha-bar-t scalars and doing the "backwards" of what we do to add the noise by taking im_noisy - sqrt(1 - alpha-bar-t) * noise_est / sqrt(alpha-bar-t). This then got images that looked slightly better than Gaussian blur, but still not amazing, as described in the caption above. Thus, we turn to iterative denoising.

Iterative Denoising

The image through iterative denoising, showing the result every 5 timesteps and at the end the very final result of the iterative denoising.
The image through that we did the iterative denoising on, but with one-step denoising (on the top) and Gaussian blur (on the bottom). We see that iterative denoising clearly does a lot better.

Now, we do iterative denoising following equations 6 and 7 of the DDPM paper. The key idea here is that we keep improving the image and denoising over multiple time steps, each time getting a noise estimate and removing it, so we are able to better estimate the total noise over time sort of thinking of it like the sum of these noise estimates. This iterative denoising result is displayed above, as you see over each 5 timesteps (since we displayed a result every 5 timesteps), the image gets better and better as more and more noise is removed. We can contrast these results to the second set of images above, which is the same image with one-step denoising and Gaussian blur, to see iterative denoising does better.

Diffusion Model Sampling

The five generated images using iterative_denoise on pure noise.

Next we do Diffusion Model Sampling. Here, we use the iterative_denoise function with i_start = 0 and passing in completely random noise in order to generate an image from scratch, denoising a completely random image (pure noise) to then generate an image that wasn't already there. Above are our five generated images using the iterative_denoise function to denoise pure noise.

Classifier Free Guidance

The five images generated using CFG with gamma = 7. Notice that these are much higher quality than the previous images generated.

Next we want to improve upon these generated images a little so we do Classifier-Free Guidance (CFG). This computes a conditional and unconditional noise estimate and then lets the new noise estimate be equal to the unconditional estimate + gamma * (conditional noise estimate - unconditional noise estimate). For the unconditional noise estimate, we just use an empty prompt embedding whereas for the conditional we use the "a high quality photo" embedding. The gamma parameter controls how much of the unconditional/conditional estimate you use, and utilizing a gamma of 7, we notice that when gamma > 1 you actually get a lot of higher quality images. You can see these images above.

Image to Image Translation

Image to image translation with the campanile image using starting index [1, 3, 5, 7, 10, 20]. As you can see as the starting index increases, the less "edits" we do and the closer to the actual campanile image we get.
Image to image translation with an image of a cat off Wikipedia using starting index [1, 3, 5, 7, 10, 20]. This process does slightly worse, but the final images do look closer and closer to a cat. The bottom image is the original image, rest are at each time step.
Image to image translation with an image of a random couple off Pinterest because that's the only pinimg link I could find. Using starting index [1, 3, 5, 7, 10, 20]. This does a lot better than the cat. The bottom image is the original image, rest are at each time step.

Next we do image to image translation. We take an original image, add some noise, and then use the SDEdit algorithm to get back a new image using the text prompt "a high quality photo". This should return an image similar to the test image, as long as the noise is low enough. We do this by running forward to get a noisy test image, then run iterative_denoise_cfg with different starting indexes to see the "edits" to the image, the less the starting index, the closer to the original image that the image generated is.

Editing Hand Drawn and Web Images

The avocado image at each timestep, same process as the above images. Again, we see that as we increase the number of steps the image looks less and less like the original avocado (and as i-start increases we get closer to the original image).
I hand drew an image that looked kind of like an envelope and got these results. The last image looks a lot like what I drew except with an additional smiley face, which is cute.
I hand drew a heart, and the last image looks a lot like it, but I thought it was kind of funny that second to last looks more like a lightbulb.

Now, we do the same thing with one image from the web (an avocado pinterest image) and two hand drawn images. These are the results. This shows that "edits" and image to image translation work not just on real life images but also on hand drawn and web images, which is really cool.

Inpainting

The setup for the campanile inpainting. See the picture, mask, and portion of the image.
The inpainted campanile, see how the top comes from a different image.
The setup for the cat inpainting. See the picture, mask, and portion of the image. I set the mask to be the region of the cat's head.
The cat inpainted, with a different head there.
The setup for the couple inpainting. See the picture, mask, and portion of the image. I set the mask to be the man of the couple.
The new inpainted, the girl looks like she is leaning against a wall instead of a man, which is cool.

Next, I did inpainting. Here, we leave everything except for the mask part as the original image, and then do image generation in just the mask part. So x_t = mx_t + (1-m)*forward(x, t). This resulted in the above results, which are pretty cool to kind of combine a generated image onto part of the original image.

Text conditional Image to Image Translation

The same as image to image translation, but with the prompt of "a rocket ship" using the campanile image. You see that it looks more kind of like a merge between a rocket ship and the campanile as it moves forward in timesteps.
The same prompt of "a rocket ship" but with the pinterest image of a couple.
The same prompt of "a rocket ship" but with the cat image from wikipedia.

Next, I did the same thing as above but with a text prompt, which conditions the image generation based on the text prompt as well. This just means you take the previous image generation prompt of "a high quality photo" and just give a text prompt of what you want the edits to look more like. The resulting images are above.

Visual Anagrams

An example of a visual anagram with the top prompt as "an oil painting of an old man" and the bottom as "an oil painting of people around a campfire".
An example of a visual anagram with the top prompt as "a rocket ship" and the bottom as "a pencil". This one turned out pretty good I think.
An example of "a photo of a dog" on top and the bottom as "a man wearing a hat". This one is pretty cool since you can see how the dog face kind of goes through the hat. Probably my favorite of the three.

After all of this, we then can make visual anagrams, which allow us to create images that are generated as optical illusions. The method behind this is that first, we denoise an image of noise with a text prompt of something you want the top to look like. Then, you flip the image and denoise with another prompt of something you want the bottom to look like. Then, you average the two noise estimates and do the diffusion denoising with the average of the two noise estimates. This then makes it so that if you look at the image from the top it looks like something, and from the bottom something else -- based on the text prompts generated. This was super cool, so above are a couple examples of that working in action.

Hybrid Images

A skull from close up and a waterfall from far away.
A man from close up and a dog from far away.
A rocket from close up and a pencil from far away.

Finally, we implement hybrid images like in Project 2. For these images, we first apply UNet to the image with two different text prompts. Then, we take the low pass of the noise estimate for the first and the high pass of the noise estimate for the second, and that creates our new noise estimate that we use for the diffusion. This should retain an image that looks like the low passed thing from close up and the high passed thing from far away. The logic here is the same as the logic behind low/high frequencies in project 2. You can see this in the above images.

CS180 Project 5B

Implementing the UNet and Using the UNet to train a Denoiser

Next, we move to project 5b, which is implementing an actual UNet model. Here, we implement UNet. UNet has many steps and the infrastructure is described well in the project spec. The basic idea is that it does a few different convolutions and upsampling and downsampling, concatenating different layers together, to build the UNet model. In this case, we want to optimize our model for L2 loss, aka the MSE, and we would train it by first generating an image from the dataset, adding some amount of noise to it (scaling pure noise with some sigma) and then optimizing over the L2 Loss between our denoised prediction and x. To start, we start by applying noising to some images of the dataset, using different sigmas of 0, 0.2, 0.4, 0.5, 0.6, 0.8, 1, which you can see below.

Five of the images from the training set denoised at the different levels. sigma = [0, 0.2, 0.4, 0.5, 0.6, 0.8, 1.0]

Training the Simple Model

Next, we do training where we apply noise with a sigma of 0.5 to images from the training set, shuffling the training data set and using batch size 256 for 5 epochs. We then optimize over the L2 loss between our prediction and the original image, and use an Adam optimizer with learning rate 1e-4. This results in the following training curve over each step. You can also see the results after the 1st and 5th epoch below. They are okay, but still a bit blurry.

The training loss for the simple UNet with no conditioning with parameters mentioned above.
The results after the first epoch of the images.
The results after the fifth epoch of the images.

Out of distribution testing

Now, we want to see how this does out of distribution, aka tested on images where it has not seen the true results. We try this with many different sigmas (levels of noise) applied, as you can see below. It does worse and worse as the noise gets away from 0.5 which it is trained on, which makes sense, since it is under new conditions, and also as there is more noise it is going to do worse since there is more to denoise.

Denoised on sigma = 0.0.
Denoised on sigma = 0.2.
Denoised on sigma = 0.4.
Denoised on sigma = 0.5.
Denoised on sigma = 0.6.
Denoised on sigma = 0.8.
Denoised on sigma = 1.0.

Training a Diffusion Model

Now, we're going to iteratively denoise images using the UNet model, doing a diffusion model. The equations that we use here come from DDPM, and are the same as many of the equations from part a. We start by generating a pure noise image. Then, we predict the added noise epsilon to that image, and subtract it out. We do this iteratively, now using the new image over and over again, until we can generate our more realistic image. Now, in order for this to work well, we need to condition the UNet on the timestep that it is currently at, since the variance of x_t changes with t.

Adding Time Conditioning to UNet and training the UNet

To add time conditioning, we take our previous UNet infrastructure and add two additional FCBlocks that allow us to embed time information into our model. We normalize t to be in the range [0, 1], passing in t/300. Then, to train, we pick a random image from the training set, get a random t, and predict the noise in x_t. We do this for many different images at many different timesteps so that our model can deal with multiple timesteps. This time, instead of 5 epochs, we do this for 20 epochs and use an exponential learning rate decay scheduler for our Adam optimizer. Below is our training curve for the time-conditioned UNet.

The training loss of our time-conditioned UNet with parameters mentioned above.

Sampling from the UNet

Then, we sample from the UNet using a very similar model to part A where we take our pure noise and then go back each time step, iteratively denoising until we get the sampled values. The more epochs the model trains, the better this does. As you can see below, these are the sampled outputs after 5 epochs and 20 epochs.

The samples from time conditioned UNet after five epochs. They look okay, but some are pretty strange, for example the bottom left.
The samples from time conditioned UNet after twenty epochs. These look a lot better, but there's definitely still room for improvement.

Adding Class Conditioning to the UNet

Now, we want to condition our UNet on the class 0-9, since we know 0s tend to look more like 0s than others, etc. and we can encode that additional information into our model to make it better. We make a one-hot-encoded vector that is 1 for the given class and 0 for the other classes, and utilize this alongside a mask (that is all 0s with probability 10%, to allow us to have 10% dropout and all 1s the rest of the time) to make our model. The purpose for this dropout is so that if we don't condition by the class, our UNet still works. The process is very similar to time conditioning except that we have c and 10% of the time we do unconditional generation. The resulting training curve is below.

The class conditioned training curve.

Sampling from the Class Conditioned UNet

Class conditioned sampling of 4 of each number after 5 epochs.
Class conditioned sampling of 4 of each number after 20 epochs.

Now, we want to sample from the class conditioned UNet in the same way as before, but now we also need to say the specific classes that we want to sample. I sampled 4 0s, 4 1s, 4 2s, etc. in order to see 4 of each image sampled. You'll notice that the class conditioning did pretty good, since the outputted image always matched the class that was conditioned on even after only 5 epochs of training. The results look a lot better than even time conditioned, and you can see them above.

Now we're done! We've implemented a class and time conditioned UNet and learned a lot about diffusion models and image generation. Very cool project, thanks for assigning this one!