This is the website for CS180 Project 5, where we learned a lot more about diffusion models and image generation. I thought the coolest part of the project was building the time and class conditioned UNet and finding that class conditioned UNet does a really good job of generating images similar to the numbers in the MNIST dataset.
To start, we wanted to learn a little more about diffusion models. We started by looking at some precomputed text embeddings and sample images that were generated. The three captions given were "an oil painting of a snowy mountain village", "a man wearing a hat", "a rocket ship". The first three have 20 inference steps and the next three have 5 inference steps. For the 20 inference steps, all the images do seem to basically exactly match the text prompts and the quality seems pretty good, except that the colors are pretty off; like the rocket ship seems more of a caricature rather than an actual depiction of a rocket ship. For the five inference steps the quality got a lot worse, especially again with the rocket ship which is now kind of broken apart into pieces with some strange things on top of it. I think this just supports the hypothesis that at least at the beginning, more inference steps can improve quality of outputs. The random seed I am using is 2988.
For the forward process, I exactly followed the two equations given where you compute a noisy image x_t by sampling from a Gaussian that is constructed according to alpha-bar-t and random noise epsilon, specifically equation A.2. from the project spec. This resulted in the following test image at noise levels [250, 500, 750]; as you can clearly see as the noise level increases, the image gets more and more noisy since we inject more random noise.
Next I tried to use the denoising tactics that we have used in previous projects, which is applying a Gaussian blur to the image with torchvision.transforms.functional.gaussian_blur to try to remove the noise. For this I used a 5x5 kernel. Above are the noisy images and corresponding Gaussian-denoised versions. This did not do so well, so it felt necessary to try another method of denoising.
Now, we do one-step denoising. For this piece I used a UNet conditioned on the amount of Gaussian noise to first estimate the noise, then remove that noise to get an estimate of the original image. Note that when removing the noise, I had to reverse the process by which the noise was applied, dealing with the alpha-bar-t scalars and doing the "backwards" of what we do to add the noise by taking im_noisy - sqrt(1 - alpha-bar-t) * noise_est / sqrt(alpha-bar-t). This then got images that looked slightly better than Gaussian blur, but still not amazing, as described in the caption above. Thus, we turn to iterative denoising.
Now, we do iterative denoising following equations 6 and 7 of the DDPM paper. The key idea here is that we keep improving the image and denoising over multiple time steps, each time getting a noise estimate and removing it, so we are able to better estimate the total noise over time sort of thinking of it like the sum of these noise estimates. This iterative denoising result is displayed above, as you see over each 5 timesteps (since we displayed a result every 5 timesteps), the image gets better and better as more and more noise is removed. We can contrast these results to the second set of images above, which is the same image with one-step denoising and Gaussian blur, to see iterative denoising does better.
Next we do Diffusion Model Sampling. Here, we use the iterative_denoise function with i_start = 0 and passing in completely random noise in order to generate an image from scratch, denoising a completely random image (pure noise) to then generate an image that wasn't already there. Above are our five generated images using the iterative_denoise function to denoise pure noise.
Next we want to improve upon these generated images a little so we do Classifier-Free Guidance (CFG). This computes a conditional and unconditional noise estimate and then lets the new noise estimate be equal to the unconditional estimate + gamma * (conditional noise estimate - unconditional noise estimate). For the unconditional noise estimate, we just use an empty prompt embedding whereas for the conditional we use the "a high quality photo" embedding. The gamma parameter controls how much of the unconditional/conditional estimate you use, and utilizing a gamma of 7, we notice that when gamma > 1 you actually get a lot of higher quality images. You can see these images above.
Next we do image to image translation. We take an original image, add some noise, and then use the SDEdit algorithm to get back a new image using the text prompt "a high quality photo". This should return an image similar to the test image, as long as the noise is low enough. We do this by running forward to get a noisy test image, then run iterative_denoise_cfg with different starting indexes to see the "edits" to the image, the less the starting index, the closer to the original image that the image generated is.
Now, we do the same thing with one image from the web (an avocado pinterest image) and two hand drawn images. These are the results. This shows that "edits" and image to image translation work not just on real life images but also on hand drawn and web images, which is really cool.
Next, I did inpainting. Here, we leave everything except for the mask part as the original image, and then do image generation in just the mask part. So x_t = mx_t + (1-m)*forward(x, t). This resulted in the above results, which are pretty cool to kind of combine a generated image onto part of the original image.
Next, I did the same thing as above but with a text prompt, which conditions the image generation based on the text prompt as well. This just means you take the previous image generation prompt of "a high quality photo" and just give a text prompt of what you want the edits to look more like. The resulting images are above.
After all of this, we then can make visual anagrams, which allow us to create images that are generated as optical illusions. The method behind this is that first, we denoise an image of noise with a text prompt of something you want the top to look like. Then, you flip the image and denoise with another prompt of something you want the bottom to look like. Then, you average the two noise estimates and do the diffusion denoising with the average of the two noise estimates. This then makes it so that if you look at the image from the top it looks like something, and from the bottom something else -- based on the text prompts generated. This was super cool, so above are a couple examples of that working in action.
Finally, we implement hybrid images like in Project 2. For these images, we first apply UNet to the image with two different text prompts. Then, we take the low pass of the noise estimate for the first and the high pass of the noise estimate for the second, and that creates our new noise estimate that we use for the diffusion. This should retain an image that looks like the low passed thing from close up and the high passed thing from far away. The logic here is the same as the logic behind low/high frequencies in project 2. You can see this in the above images.
Next, we move to project 5b, which is implementing an actual UNet model. Here, we implement UNet. UNet has many steps and the infrastructure is described well in the project spec. The basic idea is that it does a few different convolutions and upsampling and downsampling, concatenating different layers together, to build the UNet model. In this case, we want to optimize our model for L2 loss, aka the MSE, and we would train it by first generating an image from the dataset, adding some amount of noise to it (scaling pure noise with some sigma) and then optimizing over the L2 Loss between our denoised prediction and x. To start, we start by applying noising to some images of the dataset, using different sigmas of 0, 0.2, 0.4, 0.5, 0.6, 0.8, 1, which you can see below.
Next, we do training where we apply noise with a sigma of 0.5 to images from the training set, shuffling the training data set and using batch size 256 for 5 epochs. We then optimize over the L2 loss between our prediction and the original image, and use an Adam optimizer with learning rate 1e-4. This results in the following training curve over each step. You can also see the results after the 1st and 5th epoch below. They are okay, but still a bit blurry.
Now, we want to see how this does out of distribution, aka tested on images where it has not seen the true results. We try this with many different sigmas (levels of noise) applied, as you can see below. It does worse and worse as the noise gets away from 0.5 which it is trained on, which makes sense, since it is under new conditions, and also as there is more noise it is going to do worse since there is more to denoise.
Now, we're going to iteratively denoise images using the UNet model, doing a diffusion model. The equations that we use here come from DDPM, and are the same as many of the equations from part a. We start by generating a pure noise image. Then, we predict the added noise epsilon to that image, and subtract it out. We do this iteratively, now using the new image over and over again, until we can generate our more realistic image. Now, in order for this to work well, we need to condition the UNet on the timestep that it is currently at, since the variance of x_t changes with t.
To add time conditioning, we take our previous UNet infrastructure and add two additional FCBlocks that allow us to embed time information into our model. We normalize t to be in the range [0, 1], passing in t/300. Then, to train, we pick a random image from the training set, get a random t, and predict the noise in x_t. We do this for many different images at many different timesteps so that our model can deal with multiple timesteps. This time, instead of 5 epochs, we do this for 20 epochs and use an exponential learning rate decay scheduler for our Adam optimizer. Below is our training curve for the time-conditioned UNet.
Then, we sample from the UNet using a very similar model to part A where we take our pure noise and then go back each time step, iteratively denoising until we get the sampled values. The more epochs the model trains, the better this does. As you can see below, these are the sampled outputs after 5 epochs and 20 epochs.
Now, we want to condition our UNet on the class 0-9, since we know 0s tend to look more like 0s than others, etc. and we can encode that additional information into our model to make it better. We make a one-hot-encoded vector that is 1 for the given class and 0 for the other classes, and utilize this alongside a mask (that is all 0s with probability 10%, to allow us to have 10% dropout and all 1s the rest of the time) to make our model. The purpose for this dropout is so that if we don't condition by the class, our UNet still works. The process is very similar to time conditioning except that we have c and 10% of the time we do unconditional generation. The resulting training curve is below.
Now, we want to sample from the class conditioned UNet in the same way as before, but now we also need to say the specific classes that we want to sample. I sampled 4 0s, 4 1s, 4 2s, etc. in order to see 4 of each image sampled. You'll notice that the class conditioning did pretty good, since the outputted image always matched the class that was conditioned on even after only 5 epochs of training. The results look a lot better than even time conditioned, and you can see them above.
Now we're done! We've implemented a class and time conditioned UNet and learned a lot about diffusion models and image generation. Very cool project, thanks for assigning this one!