Many computer vision problems, such as object classification, motion estimation or shape registration rely on solving the correspondence problem. Existing algorithms to solve the correspondence problems are usually NP-hard, difficult to approximate and lack mechanism for feature selection. This proposal addresses the correspondence problem in computer vision, and proposes two new correspondence problems and three algorithms to solve spatial, temporal and spatio-temporal correspondence problems. The main contributions of the thesis are:
(1) Factorial graph matching (FGM). FGM extends existing work on graph matching (GM) by finding an exact factorization of the affinity matrix. Four are the benefits that follow from this factorization: (a) There is no need to compute the costly (in space and time) pairwise affinity matrix; (b) It provides a unified framework that reveals commonalities and differences between GM methods. Moreover, the factorization provides a clean connection with other matching algorithms such as iterative closest point; (c) The factorization allows the use of a path-following optimization algorithm, that leads to improved optimization strategies and matching performance; (d) Given the factorization, it becomes straight-forward to incorporate geometric transformations (rigid and non-rigid) to the GM problem.
(2) Canonical time warping (CTW). CTW is a technique to temporally align multiple multi-dimensional and multi-modal time series. CTW extends DTW by incorporating a feature weighting layer to adapt different modalities (e.g., video and motion capture data), allowing a more flexible warping as combination of monotonic functions, and has linear complexity (unlike DTW that has quadratic). We applied CTW to align human motion captured with different sensors (e.g., audio, video, accelerometers).
(3) Spatio-temporal matching (STM). STM simultaneously finds the temporal and spatial correspondence between trajectories of multidimensional multi-modal time series. STM is used to solve for spatial and temporal correspondence between 2D videos and 3D data (e.g., motion capture data or Kinect).
Fernando De la Torre (Chair)
Anand Rangarajan (University of Florida)