Lecture 16: Building Blocks of Deep Learning

Overview of CNNs, RNNs, and attention.

Convolutional Neural Networks (CNNs)

CNNs are biologically-inspired variants of MLPs that exploit the strong spatial local correlations present in images. The biological concept of the receptive field states that the visual cortex contains a complex arrangement of cells that are sensitive to small sub-regions that are tiled to cover the visual field. CNNs enjoy sparse connectivity, shared weights, and a hierarchy of representation. Stacking multiple layers can result in lower layers learning low-level features, while upper layers learn high-level representations. Using the biological analogue, simple cells detect local features while complex ones pool the outputs of simpler cells.

**Figure 1** Hierarchy of Features in CNNs

One type of layer in a CNN is the convolutional layer. These involve taking taking filter kernels and convolving them over the image. This has the effect of filtering an image and preserves the local connectivity of an image. The use of convolutions also allows for parameter sharing, whereby the same parameters (i.e., each filter kernel) can be applied to multiple parts of the same image to greatly reduce the model’s complexity. This process creates a feature map from the responses for each filter, which can be fed into a pooling layer. The pooling layer helps to reduce the dimensionality of the input space by downsampling the feature maps at each layer. For example, max pooling with a kernel size of 2x2 only takes the max pixel response in each set of 2x2 pixels. This downsamples the image by a factor of 4. Average pooling would instead take the average of the pixels in a kernel. The advantage of pooling is to improve robustness to the exact spatial location of features, as anything that is “close enough” would be pooled into the same output. Many CNNs involve stacking multiple alternating convolution and pooling layers together to build the aforementioned hierarchy of representation.

Examples of ConvNets show a trend towards ever increasing numbers of layers:

AlexNet, 8 layers
VGG, 19 layers
GoogLeNet, 22 layers
ResNet, 152 layers

Recurrent Neural Networks (RNNs)

The temporal (or sequential) analogue to the CNN is the RNN. RNNs can have a variable number of computation steps unlike CNNs. Unlike MLPs and CNNs, RNN outputs depend not only on the current input, but also on the previous states of hidden layers.

LSTMs and the Vanishing/Exploding Gradient Problem

Unrolling an RNN for several steps results in multiple products with W and applying tanh multiple times. The hidden states that are passed on to each successive cell follow this expression:

$h_t = tanh(W^{hh} h_{t-1} + W^{hx} x_t)$

As you backpropagate backward to $h_0$, there will be many repeated factors of W and tanh. If the singular value of the W matrix is greater than 1, this can result in exploding gradients; similarly, singular values under 1 can result in vanishing gradients. This is because product over the W matrices during backprop will result in either exponential growth/decay in value. A solution to this problem is to use gradient clipping . In the case of exploding gradients, this involves checking if the norm of the gradient is greater than some threshold. If so, the gradient is scaled by the threshold divided by the norm.

LSTMs are designed to solve the long-term dependency problem by creating a path with uninterrupted gradient flow during backpropagation . Internally, they are more complicated than a vanilla RNN.

They use linear memory cells and multiplicative gates to store read, write, and reset information.

Forget gate: decides what must be removed from $h_{t-1}$.

$f_t = \sigma(W_f \cdot [h_{t-1}, x_t] + b_f)$

Input gate: whether to write to cell and what information should be stored

\begin{aligned} i_t &= \sigma(W_i \cdot [h_{t-1}, x_t] + b_i) \\ \tilde{C}_t &= tanh(W_c \cdot [h_{t-1}, x_t] + b_c) \end{aligned}

Update cell state: $f_t$ controls how much of $C_{t-1}$ is forgotten. $i_t$ scales our new candidate values by how much we want to update each state value.

$C_t = f_t \times C_{t-1} + i_t \times \tilde{C}_t$

Output gate: decides what to output from the cell state

\begin{aligned} o_t &= \sigma(W_o \cdot [h_{t-1}, x_t] + b_o) \\ h_t &= o_t \times tanh(C_t) \end{aligned}

The sigmoid in $o_t$ decides which part of the cell state will be outputted.

As can be seen, LSTMs allow for a path with uninterrupted gradient flow, which helps mitigate the long-term dependency problem. There is no need to multiply by the W matrix during backprop, which was the source of the growth/decay; instead, you multiply by the different values of the gates. While this does not totally eliminate the vanishing/exploding gradient problem, it makes it much less likely, as there is usually a path where the gradient does not explode/vanish.

Different Flavors of RNNs

**Figure** Some Conventional Variants of RNNs

Bi-directional RNN: The hidden state is the concatenated result of both forward and backward hidden states so that it can capture past and future information.

Tree-structured RNN: Allows for the hidden states to be conditioned on an input and the hidden states of arbitrarily many child units. The standard LSTM, for example, is a special case conditioned on only one child unit.

RNNs for 2D Sequences: Fixed structures for 2D images that capture different dependencies and hierarchical relations.

RNNs for Graph Structures: Used in image segmentation.

Attention Mechanisms

Attention mechanisms are techniques that are used to focus on particular features in the data. They has been show to drastically improve performance in tasks such as machine translation, image captioning and speech recognition. They allow to accomodate for long-range dependencies, and dealing with the problem of vanishing gradients, seen in RNNs. By allowing for fine-grained localized representations of portions of data, like patches in images or words in sentences, attention improves feature recognition in the model.

Attention Computation

Attention can be computed for a machine translation task using the following procedure:

Encode each token in the input sentence into a key vector.
When decoding, compare the query vector with the encoder states, and generate alignment scores corresponding to each key vector.
Compute the attention weights by normalizing the alignment scores.
Treat the encoder states as value vectors and compute a weighted sum, using the attention weights.
Use this weighted sum in the decoder to generate the next token.

Attention Variants

There are a number of different alignment score functions that may be used to generate scores. Some of these are shown in the table below:

Soft and hard attention are variants of attention that respectively use deterministic and stochastic methods in computing the weights for each token. The computation described above is for soft attention. Instead of using the attention weights to compute a weighted average, hard attention uses these as probabilities and samples from the corresponding features using this distribution. A comparison for attention used on images can be illustrated below. Notice how soft attention can be diffuse, and assign nonzero weights to significant weights to large portions of the image at times, while hard attention focuses on a particular equally-sized part of the image in each case. Soft attention is presently the more popular variant, primarily because it allows for simpler backpropagation in the network.

Applications in Computer Vision

Attention can be used in conjunction with conventional CNNs in image processing. Features extracted from the CNN are used as key vectors and attention is used to sequentialy compute the caption as tokens. It can also be used in image paragraph generation, which is the generation of a long paragraph to describe an image. This is a challenging task because it involves long-term reasoning about language and visual features. Each sentence needs to be grounded on visual features to ensure contentful descriptions. One technique for doing this, presented in Xu et al. (2017) , proceeds as follow:

The image is first segmented into semantic regions, and each region is locally captioned with phrases
Attention is applied on the visual regions and the text phrases and the resulting features fed to a generator which generates sentences
These sentences are then fed to a sentence discriminator and a recurrent topic-transition discriminator for assessing sentence plausibility and topic coherence respectively
A paragraph description corpus is adopted to provide linguistic knowledge about paragraphgeneration, which depicts the true data distribution of the discriminators

The entire pipeline can be seen in the figure below:

Transformers: Multi-Headed Attention

Transformer

Recently, Vaswani et al. (2017) debuted a novel, non-recurrent neural network architecture composed purely of self-attention called the Transformer. The Transformer has attained state-of-the-art results in many sequence-to-sequence natural language processing tasks, such as machine translation. Since the Transformer architecture lacks recurrent structure, it can be more amenable to learning long-range dependencies over sequences while also improving training and inference speed.

As shown below, the Transformer employs multi-headed self-attention, in which multiple attention layers run in parallel. Intuitively, this can enable different heads to focus on different parts of the sequence.

Formally, multiple heads of Queries $Q$ , Keys $K$ of dimension $d_k$ , and Values $V$ can be packed together into separate matrices to allow for attention to be computed efficiently using the scaled dot-product variant, which they suggest prevents diminished gradients during training:

$\text{Attention}(Q, K, V) = \text{softmax}(\frac{QK^\top}{\sqrt{d_k}}) V$

Multi-headed attention can then jointly attend to information from multiple different representations at different positions by:

$\begin{aligned} \text{MultiHead}(Q, K, V) &= \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O \\ \text{where head}_i &= \text{Attention}(QW_i^Q, KW_i^K, VW_i^V) \end{aligned}$

Vaswani et al. (2017) obtain a single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, which is two orders of magnitude less training time than recurrent approaches. They visualize the weights of the multiple attention heads to try to explain that each head learns separate information, such as long-term dependencies. Also, they demonstrate the Transformer’s ability to learn structured outputs for English constituency parsing by beating all other discriminative recurrent sequence-to-sequence methods.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Unsupervised Language Model Pre-training

Recently, language representation via unsupervised language model pre-training has revolutionized the field of natural language processing. Parameters learned during the training of large language models using self-supervision have been shown to be extremely effective when transferred to other NLP prediction tasks. Since language modeling requires the resolving of long-term dependencies, hierarchical relations, and sentiment, it can be seen as an ideal source task for transfer learning in NLP.

ELMo introduce deep contextualized word representations, which are learned functions of the internal states of a deep bidirectional LSTM language model trained to predict both the next word in a sentence given its history and the previous word in a sentence given its future words. These contextualized representations can then be frozen and used as embeddings for other downstream tasks like question answering, textual entailment, and sentiment analysis. Instead of just transferring word embeddings for a new task, Howard and Ruder’s (2018) ULMFiT aims to transfer the language model itself for new tasks. Particularly, the authors train an AWD-LSTM language model on 103 million words of Wikipedia data, fine-tune on a smaller amount of task-specific data using different learning rates for different layers of the model, and add a final classifier on the end of the network for the target task. Thus, while ELMo requires task-specific architectures when transferring to new tasks, ULMFiT simply adds a classifier on top of the language model to obtain state-of-the-art results on six benchmarks. OpenAI then adapted this method (dubbed Generative Pre-training or GPT) to work with the popular Transformer architecture, an auxiliary language modeling loss during fine-tuning to obtain even better results, and adaptation to more difficult tasks such as machine translation. Just this year, OpenAI followed up their work with GPT2, which is the highest performing language model to date. They use a similar but much higher capacity model to GPT as they find that capacity improves performance log-linearly. Due to its high, human-like language generation performance, they controversially decided to not release their largest model: a 1.5B parameter Transformer trained on 8 million documents of web text.

BERT

In BERT (Bidirectional Encoder Representations from Transformers), Devlin et al. (2018) used a bidirectional Transformer architecture to great advantage in order to obtain improved contextualized word embeddings in an extension to OpenAI GPT . The paper introduces two new objectives to adapt the traditional task of predicting the next word in language modeling to benefit from bidirectionality. After encoding each word in a given sentence into a contextualized representation, they have the model both predict a random masked word from the original sentence and perform a binary classification on two sentences to identify whether or not one sentence follows the other. Although the masked language model objective requires more pre-training steps since each prediction is no longer sequential, they find that performance increases over the traditional objective are immediate. They find that the next sentence classification objective is particularly beneficial to tasks like natural language inference and question answering since they require multi-sentence reasoning.

Example of the masked language model objective. Image credit: The Illustrated Transformer by Jay Alammar.

In the paper, the authors report using both Wikipedia data (2.5B words) and eBook data (800M words) for training a Transformer encoder with hundreds of millions of parameters. An ablation study on model size empirically shows that extreme model sizes lead to large improvements on even very small scale tasks, provided that the model has been sufficiently pre-trained. Although training takes many more steps to converge than a traditional language model objective, BERT, with only single output layer modifications, performs at the state-of-the-art for eleven NLP tasks including sentiment, question answering, and natural language inference. While Google is able to train BERT in just 4 days on 4 TPU pods, training is impractical for academics with traditional GPU resources, For example, a standard 4 GPU desktop with an RTX 2080Ti would take almost 99 days to complete training!