Adam Harley

Robotics PhD student at CMU

I am a second-year PhD student at Carnegie Mellon University, in The Robotics Institute. I work with Dr. Katerina Fragkiadaki, on machine learning and computer vision. Before this, I did a Master of Science in computer science, with Dr. Kosta Derpanis. Even earlier, I did a Bachelor of Arts in psychology.


July 2017: Two papers accepted to ICCV.

Segmentation-Aware Convolutional Networks Using Local Attention Masks

ICCV 2017

Adam W. Harley, Konstantinos G. Derpanis, and Iasonas Kokkinos
project page paper github bibtex

We introduce an approach to integrate segmentation information within a convolutional neural network (CNN). This counter-acts the tendency of CNNs to smooth information across regions and increases their spatial precision. To obtain segmentation information, we set up a CNN to provide an embedding space where region co-membership can be estimated based on Euclidean distance. We use these embeddings to compute a local attention mask relative to every neuron position. We incorporate such masks in CNNs and replace the convolution operation with a "segmentation-aware" variant that allows a neuron to selectively attend to inputs coming from its own region. We call the resulting network a segmentation-aware CNN because it adapts its filters at each image point according to local segmentation cues. We demonstrate the merit of our method on two widely different dense prediction tasks, that involve classification (semantic segmentation) and regression (optical flow). Our results show that in semantic segmentation we can match the performance of DenseCRFs while being faster and simpler, and in optical flow we obtain clearly sharper responses than networks that do not use local attention masks. In both cases, segmentation-aware convolution yields systematic improvements over strong baselines.

Adversarial Inverse Graphics Networks: Learning 2D-to-3D Lifting and Image-to-Image Translation from Unpaired Supervision

ICCV 2017

Hsiao-Yu Fish Tung, Adam W. Harley, William Seto and Katerina Fragkiadaki
paper bibtex

Researchers have developed excellent feed-forward models that learn to map images to desired outputs, such as to the images’ latent factors, or to other images, using supervised learning. Learning such mappings from unlabelled data, or improving upon supervised models by exploiting unlabelled data, remains elusive. We argue that there are two important parts to learning without annotations: (i) matching the predictions to the input observations, and (ii) matching the predictions to known priors. We propose Adversarial Inverse Graphics networks (AIGNs): weakly supervised neural network models that combine feedback from rendering their predictions, with distribution matching between their predictions and a collection of ground-truth factors. We apply AIGNs to 3D human pose estimation and 3D structure and egomotion estimation, and outperform models supervised by only paired annotations. We further apply AIGNs to facial image transformation using super-resolution and inpainting renderers, while deliberately adding biases in the ground-truth datasets. Our model seamlessly incorporates such biases, rendering input faces towards young, old, feminine, masculine or Tom Cruiselike equivalents (depending on the chosen bias), or adding lip and nose augmentations while inpainting concealed lips and noses.

Segmentation-Aware Convolutional Nets

Master's Thesis

project page paper bibtex

This thesis introduces a method to both obtain segmentation information and integrate it uniformly within a convolutional neural network (CNN). This counter-acts the tendency of CNNs to produce smooth predictions, which is undesirable for pixel-wise prediction tasks, such as semantic segmentation. The segmentation information is obtained by a form of metric learning, where a CNN learns to compute pixel embeddings that reflect whether any pair of pixels is likely to belong to the same region. This information is then used within a larger network, to replace all convolutions with foreground-focused convolutions, where the foreground is determined adaptively at each image point by local embeddings. The resulting network is called a segmentation-aware CNN, because the network can change its behaviour at each image location according to local segmentation cues. The proposed method yields systematic improvements on a standard semantic segmentation benchmark when compared to a strong baseline.

Learning Dense Convolutional Embeddings for Semantic Segmentation

ICLR 2016 (workshop)

Adam W. Harley, Konstantinos G. Derpanis, and Iasonas Kokkinos
project page paper bibtex

This paper proposes a new deep convolutional neural network (DCNN) architecture that learns pixel embeddings, such that pairwise distances between the embeddings can be used to infer whether or not the pixels lie on the same region. That is, for any two pixels on the same object, the embeddings are trained to be similar; for any pair that straddles an object boundary, the embeddings are trained to be dissimilar. Experimental results show that when this embedding network is used in conjunction with a DCNN trained on semantic segmentation, there is a systematic improvement in per-pixel classification accuracy. Our contributions are integrated in the popular Caffe deep learning framework, and consist in straightforward modifications to convolution routines. As such, they can be exploited for any task involving convolution layers.

Evaluation of Deep Convolutional Nets for Document Image Classification and Retrieval

ICDAR 2015

Adam W. Harley, Alex Ufkes, and Konstantinos G. Derpanis
project page paper bibtex

This paper presents a new state-of-the-art for document image classification and retrieval, using features learned by deep convolutional neural networks (CNNs). In object and scene analysis, deep neural nets are capable of learning a hierarchical chain of abstraction from pixel inputs to concise and descriptive representations. The current work explores this capacity in the realm of document analysis, and confirms that this representation strategy is superior to a variety of popular handcrafted alternatives. Extensive experiments show that (i) features extracted from CNNs are robust to compression, (ii) CNNs trained on non-document images transfer well to document analysis tasks, and (iii) enforcing region-specific feature-learning is unnecessary given sufficient training data. This work also makes available a new labelled subset of the IIT-CDIP collection, containing 400,000 document images across 16 categories.

RVL-CDIP dataset

Adam W. Harley, Alex Ufkes, and Konstantinos G. Derpanis
project page bibtex

The RVL-CDIP (Ryerson Vision Lab Complex Document Information Processing) dataset consists of 400,000 grayscale images in 16 classes, with 25,000 images per class. There are 320,000 training images and 80,000 test images. The images are sized so their largest dimension does not exceed 1000 pixels.

An Interactive Node-Link Visualization of Convolutional Neural Networks

Adam W. Harley
project page paper bibtex

Convolutional neural networks are at the core of state-of-the-art approaches to a variety of computer vision tasks. Visualizations of neural networks typically take the form of static node-link diagrams, which illustrate only the structure of a network, rather than the behavior. Motivated by this observation, this paper presents a new interactive visualization of neural networks trained on handwritten digit recognition, with the intent of showing the actual behavior of the network given user-provided input. The user can interact with the network through a drawing pad, and watch the activation patterns of the network respond in real time.