Learning Large Deep Belief Networks using Graphics Processors Rajat Raina, Andrew Y. Ng Stanford University Recent work on parallelizing learning algorithms has focused on using multiple CPUs or multiple cores to learn large models. In this work, we show that modern graphics processors (GPUs) can be successfully applied to parallelizing a large-scale learning problem, involving parameter learning in a large, undirected graphical model. In contrast to general-purpose multicore devices, GPUs are highly specialized for only certain types of computation-intensive tasks. But they provide massive parallelism on those tasks, at a fraction of the cost required for similar parallelism with multicore. The particular undirected graphical model we consider in this work is the deep belief network (DBN). DBNs are multi-layer networks that can be trained layer-by-layer, primarily using unlabeled data. In recent years, researchers have shown that each layer of a DBN can be trained efficiently using the contrastive divergence algorithm (Hinton, 2000), and that a sparse DBN variant can learn features similar to those observed biologically in visual area V2 (Lee et al., 2007). DBNs have also been successfully applied to applications such as handwritten character recognition and information retrieval (Hinton & Salakhutdinov, 2006). However, the previously published work on DBNs has been confined to relatively small networks. For example, when applied to image data, such networks can be learned for small image "patches" that are, say, 40x40 pixels in size. We would ideally like to apply similar models to learn deep, hierarchical networks over large images with tens of thousands of pixels, not just over small image patches. Such models have tens of millions of independent parameters and can take weeks to learn on a single processor with current algorithms. Specifically, we consider a DBN model for large images, in which the lowermost layer contains one visible unit per image pixel, and each successive higher layer contains hidden variables that are connected locally to a subset of the variables in the previous layer. These subsets overlap with each other, and thus all the parameters for a layer are coupled together. We show that the contrastive divergence algorithm can be implemented successfully on the GPU using the Nvidia CUDA programming model. Crucially, the main bottleneck in applying GPUs is the overhead incurred in transferring data to and from the GPU. We show that the data transfer can be drastically reduced by storing and updating all the parameters permanently on the graphics card itself. We use custom parallel kernels for computing several components of the contrastive divergence update on the GPU, and further accelerate matrix operations using highly optimized BLAS routines written specifically for the GPU. With our method, we are able to massively scale previous experiments with DBNs. We can learn a 4-layer network over 160x160 pixel images in 12 hours using a single Nvidia GeForce 8600GT graphics card. This network has more than 40 million independent parameters, which is two orders of magnitude more than previous work on deep belief networks. We also analyze the parameters learned by our large-scale model to show that higher layers automatically learn to group similar lower layer features (e.g., edges) over successively larger areas of the image, and thus capture several interesting invariances in the input data.