CMU 11-731(MT&Seq2Seq) SeqtoSeq Model

11731, Machine Translation, course notes

reference
- Neural MT 1: Neural Encoder-Decoder Models

Encoder-decoder Models

formula
$$ m_t^{(f)}=M_{\cdot,f_t}^{(f)} $$
$$ h_t^{(f)}=\begin{array}{lc}RNN^{(f)}(m_t^{(f)},h_{t-1}^{(f)})&\;if\;t\geq1\end{array}\;else\;0 $$
$$ m_t^{(e)}=M_{\cdot,e_{t-1}}^{(e)} $$
$$ h_t^{(e)}=\begin{array}{lc}RNN^{(e)}(m_t^{(e)},h_{t-1}^{(e)})&\;if\;t\geq1\end{array}\;else\;h_{\left|F\right|}^f $$
$$ p_t^{(e)}=softmax(W_{hs}h_t^{(e)}+b_s) $$

Generating Output

Random Sampling
- usage:Get a variety of outputs for a particular input(dialogue system)
- Ancestral sampling, sample a distribution from $$ P(e_t\vert\widehat e_1^{t-1}) $$
- Calculate sentence probabilty
  $$ P\left(\widehat E\vert F\right)=\prod_t^{\left|\widehat E\right|}P({\widehat e}_t\vert F,\widehat E_1^{t-1}) $$
  - problem: numberical precision, so just use log and add them together
Greedy 1-best Search
- just like Ancestral Sampling, except sampling.Use :
  $$ \widehat{e_t}\;=\;\underset i{argmax}\;P_{t,i}^{(e)} $$
- not guaranteed to find the translation with the highest probability(as Greedy often do this)
Beam Search
- pruning
- heuristic search
Length normalization
- problem
  - tend to prefer shorter sentences
  - beam search with a larger beam size has a significant length bias towards short sentence
- solution
  - Prior knowledge: the length of a target sentence correlates with the length of a source sentence.(Tree-to-Sequence Attentional Neural Machine Translation)
    $$ P(|E|\;|\;|F|) $$
    $$ \widehat E\;=\underset E{\;argmax}\;\log(P(\left|E\right|\;\vert\;\left|F\right|)\;+\;\log(P(E\;\vert\;F)) $$
  - how to get prior?
    $$ P(\vert E\vert\;\vert\;\vert F\vert)\;=\;\frac{c(\vert E\vert,\;\vert F\vert)}{c(\left|F\right|)} $$
    $$ or $$
    $$ \widehat E\;=\underset E{\;argmax}\;\frac{\;\log(P(E\;\vert\;F))}{\left|E\right|}(highest\;average\;\log\;probability\;per\;word\;) $$

Bidirectional Encoders

reverse encoder
- motivation: langauge with similar ordering(English-French)
bi-directional encoder(more robust with typologically distinct languages)
$$ \overset\rightharpoonup h_t^{(f)}=\begin{array}{lc}\overset\rightharpoonup{RNN}^{(f)}(m_t^{(f)},\overrightarrow h_{t-1}^{(f)})&\;if\;t\geq1\end{array}\;else\;0 $$
$$ \overleftarrow h_t^{(f)}=\begin{array}{lc}\overleftarrow{RNN}^{(f)}(m_t^{(f)},\overleftarrow h_{t+1}^{(f)})&\;if\;t\;\leq\left|F\right|\end{array}\;else\;0 $$
- flexible combination of hidden states vector
  
  $$ h_0^{(e)}=tanh\;(W_1{\overrightarrow h}_{\vert F\vert}+W_2{\overleftarrow h}_1\;+\;b_e) $$

Sentence Embedding Methods

Auto-encoding
- Semi-supervised Sequence Learning
- re-generate
Language modeling
- Unsupervised Pretraining for Sequence to Sequence Learning
Predicting context
- Skip-Thought Vectors
- surrounding sentence
- fixed length embedding
  - prevent overfitting
  - pretraining
  - differenct software
- fine-tune embedding
  - increasing expressivity
Predicting paraphrases
- Towards universal paraphrastic sentence embeddings
- similar sentence has similar embeddings
- PARANMT-50M: Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations
Predicting sentence features
- Supervised Learning of Universal Sentence Representations from Natural Language Inference Data
Contextual embedding
- Deep contextualized word representations
- Learned in Translation: Contextualized Word Vectors
Misc
- Copied Monolingual Data Improves Low-Resource Neural Machine Translation

Further Reading

Several studies on natural language and back-propagation
- first proposed the idea of performing translation using neural networks
Learning recursive distributed representations for holistic computation
- further expanded to recurrent networks
Sequence to Sequence Learning with Neural Networks
- further expanded this to recurrent networks
Recurrent Continuous Translation Models
- first example of fully neural models for translation
Sequence to Sequence Learning with Neural Networks
- popularized neural MT due to impressive empirical performance
Learning to Decode for Future Success
- about search
On the Properties of Neural Machine Translation: Encoder–Decoder Approaches