In this project, we learned more about image homographies, and building mosaics of images together, similar to the "Panorama" camera setting. I think it was really exciting to learn more about how to stitch images together so that they look like a panorama, since I've always wondered how our phones can do that.
First, I shot pictures around campus, particularly in the CSUA, in BWW, and in my apartment. I used my iPhone 15 to take the pictures, and used Adobe to resize them. All relevant pictures are included in the images below alongside corresponding rectification/mosaic, so for ease of reading I won't include them again here. I did make sure to follow the center of projection thing mentioned in lecture, and think about taking photos through the camera's center of projection, rather than the person further back.
I started by first computing the homography. The way I did this is by first picking points in the image (in this case, the corners of the picture on the wall) in the Correspondence tool. Then, from lecture, we had the math of how a homography multiplies by our point to get the resulting point in the other image. I set up a system of equations to solve A = HB where A is the original image points and B is the new image corresponding points -- and then used numpy linear least squares to solve it to get the homography.
From there, I computed image warping -- this was similar to project 3 except that instead of using an affine transformation, I wanted to use the homography matrix that I had defined. The way I did this was first taking the corners of the image, and applying the homography onto them (warping the corners). From there, I found the minimum and maximum X and Y, and was able to then set up a translation matrix to ensure that other points when transformed were able to be seen in the resulting image. Then, I used the computed transformation applied to the corners and corresponding height and width alongside skimage polygon to get the relevant portions of the image, apply the homography onto them in the same way as last project's affine transform, and then display the output.
To prove that the warping worked, we had to then do rectification -- which meant warping the images onto a rectangle. This is where selecting the four points of the rectangle in the image came into play, then applying it to rectification. So basically, I selected four points on my image that were a rectangle, then compared to 4 points on a rectangle that I manually entered as pixel values. I then warped the image onto the rectangle, to get a clear rectification. I did this for two images, and below are the results.
Then, finally, came the most exciting part of the project. Blending the images together! This step took a lot of conceptual understanding to do. The way that I set it up was as follows. Let's say I wanted to blend image1 and image2. First, I warped image1 onto image2, and took the resulting translation that the warp required. Then, I applied that translation onto image2. Then, I took the max of the sizes of the two images to get the size of the total blended campus. I then added those images together -- aligned due to using the same translation -- into the created canvas. Unfortunately, there would be an area of overlap between the two images which would look very strange. As a result, I used a two-band Laplacian pyramid blending mechanism to ease the blending in this overlap. To do this, I first computed the area of overlap by looking at pixels in the blended canvas that contained parts of both images. Then, for each image, I applied a 2D Gaussian convolution to get the low-frequency version. I subtracted the low frequency from the original image to get the high-frequency version. Then, I applied a distance transform for both images in that region of overlap. Using the distance transform, for the high-frequency parts of the images I chose the pixel that had the higher distance transform result (furthest from edge). For the low-frequency components, I computed a weighted average based on the corresponding distance transforms. Then, I added the low and high frequency together in the overlap region, to make the transition more smooth between the two images. Below are the outcomes.
In Part B of the Project, we now follow the paper “Multi-Image Matching using Multi-Scale Oriented Patches” by Brown et al. This allows us to do automatic blending, instead of having to rely on manually clicking on points in the images and matching them up. The coolest thing I learned from this project was how to use ANMS to pick the corners that were key to the image -- oftentimes these corresponded to the corners that I had manually chosen, and sometimes they didn't, and it was cool seeing how the algorithm decides which corner to pick based on the relative strengths of pairs of coordinates. I hadn't really thought of this kind of approach before, and it was pretty cool to implement it in order to find the corners that were more evenly spaced out that provided good corners to compare across images.
We start on the work to detect the corner features in the image. We use the Harris Corner code provided to generate Harris corners and display them on the BWW image like so.
Then, we use the ANMS algorithm. This algorithm allows us to condense down the number of corner points we look at. Basically, we take the distance between every pair of points using the provided dist2 function. Then, we compute for each point, the suppression radius, and keep only the points that are the maximum in the neighborhoods of a certain radius. We sort the points to look at the maximums, then return the 100 largest maximums, which then becomes our new set of corners that we use for feature extraction and matching. Here, we use a c_robust value of 0.9 and return the top 500 interest points, as shown in the paper. Choosing points that are spatially further apart on the image, which this algorithm causes, then mean that it's more likely that we will be able to do feature matching with the pairs of points and they won't be dropped, as explained in the paper.
Now, for each corner, we take the 40x40 region around the point, resize it to 8x8, and standardize it (subtracting the mean and dividing by the standard deviation). Then, we flatten each feature and append them all together. This gets back the extracted feature descriptors from each corner. Below are some examples of what the first six feature descriptors look like.
Using these feature descriptors on both images, we then implement feature matching, as described in the paper. The way that we do feature matching is by looking at the SSD between all pairs of feature descriptors from image 1 and image 2. Then, we compare the 1-NN and 2-NN distances between each pair of features. If the 1-NN distance divided by the 2-NN distance is less than a Lowe's threshold of 0.4 -- which I got from the graph in the paper -- then we consider this to be a match between images. The reason this works is because Lowe's assumes the 1-NN is a potential correct match, and 2-NN is potentially incorrect (the outlier distance) and can use the ratio to describe how likely it is for this to be a correct match. For the two images that we are visualizing here, these are the points that we get from feature matching -- ie the corners that match across features. You can look at them from left to right to see where the correspondences line up, and they seem relatively correct.
Next, we use the RANSAC algorithm. We take the pairs of matched points that we got from the previous steps. Then, we randomly select four pairs of points. For those automatically chosen points, we compute a homography using the same logic as Proj4A, and transform the images. Then, we look at the result. If the SSD distance between the transformed feature points and the target feature points is less than a threshold, which I chosen through some parameter tuning to be 1, then we keep those feature points. If the number of feature points that are "close enough" between the transformed image and the target image is greater than the number of points kept from the previous best random selection, then we use the homography matrix generated by those points for the image. We keep going and run this 100 times, and then output the best homography. There's no images to see here, since this is then directly applied to the next part.
Finally, we use the RANSAC algorithm to produce a Mosaic. The way that we do this is we first get the homography that we want to use from RANSAC. Then, we use that homography to warp the Image (like in Proj4A) and create the blended mosaic from that warped image -- as exactly described in Proj4A. The only difference here is that we used automatically generated points rather than manually selected ones. Then, we output our blended mosaics from Proj4B methods, side by side with the blended mosaics from Proj4A methods that were manually selected.
From the set of images that we see, there are some kind of cool observations to make here. First, we notice that for the CSUA board image, looking at the whiteboard area, the blended image that comes out of Proj4B using the automatically detected corners is actually better than the one in Project 4A, since the side of the whiteboard appears less blurry and is more clear. The reason for this is because when we go back and look at the feature matching output of the image, we see that there were points selected that were on the whiteboard side, which then allowed the whiteboard to become more clearly aligned, whereas when I was manually selecting I only selected points on the left side, which caused the output to be a little more blurry. The algorithm was able to more clearly detect good alignment points here than me. However, this is not necessarily true for the other images. For example, for the box, the image with manually selected points looks better than the image with automatically selected points, since I was able to select better corner points on the actual box to allow the box to align, whereas it becomes very blurry in the image output from my methods in Project 4B. Ultimately, I think that manually selecting points, when you have the time and effort to put into selecting many different points, is going to be better than automatically detecting, but automatic detection is a quicker solution when you have less time, and can sometimes select some correspondences that you might miss if you're rushing.