Stiching Panoramas

William Keyes | 15-463 Project 3

A view of the Gates Center from the Pausch bridge

Contents

Introduction

When you take a picture with a camera, you are applying a perspective transform to your subject, taking points in three dimensional space and mapping them to a two dimensional image plane. This mapping occurs through a point known as the center of projection. If we have a set of images that were all taken with the same center of projection, we can actually compute perspective transformations from the final images and use the results to combine the images into a panorama, creating the image we would see if our lens had a wider field of view.

Technical Details

The basic idea is that given at least four corespondances between two images taken with the same center of projection, one can compute a homography representing the perspective transform between the two images. Using this transform, we warp one image into the same space as the other and then blend the result together.

Computing a homorgraphy is easy &mdash just solve a possibly overconstrained system of eight equations. Finding the corespondances is a more difficult and interesting problem. In the first part of this assignment, I manually defined points using MATLAB's cpselect function. In general this works well, but it is time consuming to define enough (at least 10) correspondences and in some images, like landscapes, it may be hard for humans to find good points.

The second part of this project explored automatically finding matching points. This takes four steps: finding points, selecting some subset of evenly distributed points, matching points between two images, and eliminating outliers that have no match.

  1. Corners are good points to use as correspondances, becuase the location is clear and they are relatively easy to match between images. Here, we find corners using a Harris detector. The basic idea is that a region of the image which contains a corner will be very different, in a sum-squared-difference sense, from all other nearby patches. The intuition behind this is that if the patch is on a line, motion in one direction will produce little change while motion in the other direction will cause large differences. On a corner, there are differences in all directions.

  2. After step one, we have a lot of points; the Harris detector rates every pixel image based on how corner-y it is. We select corners using Adaptive Non-Maximal Supression (ANMS), as proposed in this paper. We can't just select the n strongest corners, because they will normally be in the same part of the image. AMNS selects the set of points that are the farthest away from stronger points. This gives a set of points that are the strongest in each region of the image, where a "region" is adaptively defined for each image.

  3. After we have points in each image, we need to find how they are related. We use a simplified version of Multi-Scale Oriented Patches. The main obeservation is that if we select a region around each point, say a 40 pixel square, and then resize the patch so it's an 8 pixel square and normalize for lighting, we have a good descriptor of the feature in the image at the point. Once we have features from both images, we can compare all pairs of features to find the ones that match.

    We know to features match if their difference, again in the SSD sense, is less than some threshold. It turns out that using pure distance makes it too hard to pick a good threshold, one which keeps enough real matches while discarding as many bad matches as possible. Instead we threshold on the ratio between the nearest neighbor (best match) and the second nearest neighbor (second best match). If the nearest neighbor is much better than the second best, we assume that the match is good. If both neighbors are equally good, or rather equally bad, we assume that there's no match and discard the points.

  4. Even with the process above, there will still be outliers &mdash points that just don't match. We use the well-known RANSAC algorithm to deal with this. The idea is to select four correspondances at random, compute the homography, and then see how many other points agree with this homography. After several thousand iterations, we use the biggest set of points in agreement to compute the acutal homography.

Results

Click on any of the pictures to view them full size.

Image rectification &mdash a picture on the wall is warped so that it is parallel to the image plane.

A picture of a picture of bunnies on a wall. The same picture, but aligned to the camera.

Image rectification &mdash this picture was taken at a low angle, and we warp it so that we are looking straight down on the 0.

A grid of tiles with numbers The same picture, but aligned to the camera.

[Manually Stitched] The Gates Ceneter for Computer Science. There is some blurring in the lower left - the images were taken without a tripod, so the center of projection moved slightly. Original image sequence.

The Gates Center for Computer Science

[Manually Stitched] Trees on Flagstaff Hill. Any artifacts are hidden by the density of the leaves. Original image sequence.

Trees on Flagstaff Hill

[Manually Stitched] Dinosaur outside the Carnegie Library. There is considerable blurring in the lower left - it was hard to rotate the camera exactly around the center of projection at this odd angle. The subject is also close to the camera, which doesn't help. Original image sequence.

Dinosaur statue in front of the Carnegie Library

[Automatically Stitched] This is the same dinosaur from above, but stitched automatically this time. There is still a little bit of blurring, but it is much better. It turns out that the right side of the picture is just bad for defining correspondences by hand; there are too many subtle textures which the computer handles well.

Dinosaur statue in front of the Carnegie Library

Here are some of the steps involved in automatically matching the images:

The results of the Harris detector.

The remianing 500 points after suppression.

The matching points (yellow plus signs).

It's surprising that it worked so well with only five corresponences between these two images

[Automatically Stitched] The same dinosaur outside the Carnegie Library, stitched from 8 images. I'm surprised this worked at all, given that I was sitting on the ground close to the body using a wide angle when I took the pictures. Also interesting is that stitching this required all 8 GB of RAM in my desktop - my code is not optimized for space efficiency.

Dinosaur statue in front of the Carnegie Library

[Automatically Stitched] View from some place high at Yosemite National Park - I can't remember the exact location. Darn people moving in the foreground, being blurry. Original image sequence.

Yosemite National Park