Learning Large Deep Belief Networks using Graphics Processors
Rajat Raina, Andrew Y. Ng
Stanford University
Recent work on parallelizing learning algorithms has focused on using
multiple CPUs or multiple cores to learn large models.
In this work, we show that modern graphics processors (GPUs) can be
successfully applied to parallelizing a large-scale learning problem,
involving parameter learning in a large, undirected graphical model.
In contrast to general-purpose multicore devices, GPUs are highly
specialized for only certain types of computation-intensive tasks.
But they provide massive parallelism on those tasks, at a fraction of
the cost required for similar parallelism with multicore.
The particular undirected graphical model we consider in this work is
the deep belief network (DBN). DBNs are multi-layer networks that can
be trained layer-by-layer, primarily using unlabeled data. In recent
years, researchers have shown that each layer of a DBN can be trained
efficiently using the contrastive divergence algorithm (Hinton, 2000),
and that a sparse DBN variant can learn features similar to those
observed biologically in visual area V2 (Lee et al., 2007). DBNs have
also been successfully applied to applications such as handwritten
character recognition and information retrieval (Hinton &
Salakhutdinov, 2006).
However, the previously published work on DBNs has been confined to
relatively small networks. For example, when applied to image data,
such networks can be learned for small image "patches" that are, say,
40x40 pixels in size. We would ideally like to apply similar models
to learn deep, hierarchical networks over large images with tens of
thousands of pixels, not just over small image patches. Such models
have tens of millions of independent parameters and can take weeks to
learn on a single processor with current algorithms.
Specifically, we consider a DBN model for large images, in which the
lowermost layer contains one visible unit per image pixel, and each
successive higher layer contains hidden variables that are connected
locally to a subset of the variables in the previous layer. These
subsets overlap with each other, and thus all the parameters for a
layer are coupled together. We show that the contrastive divergence
algorithm can be implemented successfully on the GPU using the Nvidia
CUDA programming model. Crucially, the main bottleneck in applying
GPUs is the overhead incurred in transferring data to and from the
GPU. We show that the data transfer can be drastically reduced by
storing and updating all the parameters permanently on the graphics
card itself. We use custom parallel kernels for computing several
components of the contrastive divergence update on the GPU, and
further accelerate matrix operations using highly optimized BLAS
routines written specifically for the GPU.
With our method, we are able to massively scale previous experiments
with DBNs. We can learn a 4-layer network over 160x160 pixel images
in 12 hours using a single Nvidia GeForce 8600GT graphics card. This
network has more than 40 million independent parameters, which is two
orders of magnitude more than previous work on deep belief networks.
We also analyze the parameters learned by our large-scale model to
show that higher layers automatically learn to group similar lower
layer features (e.g., edges) over successively larger areas of the
image, and thus capture several interesting invariances in the input
data.