Encoder-decoder Models
- formula
 $$ m_t^{(f)}=M_{\cdot,f_t}^{(f)} $$
 $$ h_t^{(f)}=\begin{array}{lc}RNN^{(f)}(m_t^{(f)},h_{t-1}^{(f)})&\;if\;t\geq1\end{array}\;else\;0 $$
 $$ m_t^{(e)}=M_{\cdot,e_{t-1}}^{(e)} $$
 $$ h_t^{(e)}=\begin{array}{lc}RNN^{(e)}(m_t^{(e)},h_{t-1}^{(e)})&\;if\;t\geq1\end{array}\;else\;h_{\left|F\right|}^f $$
 $$ p_t^{(e)}=softmax(W_{hs}h_t^{(e)}+b_s) $$
Generating Output
- Random Sampling - usage:Get a variety of outputs for a particular input(dialogue system)
- Ancestral sampling, sample a distribution from $$ P(e_t\vert\widehat e_1^{t-1}) $$
- Calculate sentence probabilty
 $$ P\left(\widehat E\vert F\right)=\prod_t^{\left|\widehat E\right|}P({\widehat e}_t\vert F,\widehat E_1^{t-1}) $$- problem: numberical precision, so just use log and add them together
 
 
- Greedy 1-best Search - just like Ancestral Sampling, except sampling.Use :
 $$ \widehat{e_t}\;=\;\underset i{argmax}\;P_{t,i}^{(e)} $$
- not guaranteed to find the translation with the highest probability(as Greedy often do this)
 
- just like Ancestral Sampling, except sampling.Use :
- Beam Search - pruning
- heuristic search
 
- Length normalization - problem- tend to prefer shorter sentences
- beam search with a larger beam size has a significant length bias towards short sentence
 
- solution- Prior knowledge: the length of a target sentence correlates with the length of a source sentence.(Tree-to-Sequence Attentional Neural Machine Translation)
 $$ P(|E|\;|\;|F|) $$
 $$ \widehat E\;=\underset E{\;argmax}\;\log(P(\left|E\right|\;\vert\;\left|F\right|)\;+\;\log(P(E\;\vert\;F)) $$
- how to get prior?
 $$ P(\vert E\vert\;\vert\;\vert F\vert)\;=\;\frac{c(\vert E\vert,\;\vert F\vert)}{c(\left|F\right|)} $$
 $$ or $$
 $$ \widehat E\;=\underset E{\;argmax}\;\frac{\;\log(P(E\;\vert\;F))}{\left|E\right|}(highest\;average\;\log\;probability\;per\;word\;) $$
 
- Prior knowledge: the length of a target sentence correlates with the length of a source sentence.(Tree-to-Sequence Attentional Neural Machine Translation)
 
- problem
Bidirectional Encoders
- reverse encoder- motivation: langauge with similar ordering(English-French)
 
- bi-directional encoder(more robust with typologically distinct languages) 
 $$ \overset\rightharpoonup h_t^{(f)}=\begin{array}{lc}\overset\rightharpoonup{RNN}^{(f)}(m_t^{(f)},\overrightarrow h_{t-1}^{(f)})&\;if\;t\geq1\end{array}\;else\;0 $$
 $$ \overleftarrow h_t^{(f)}=\begin{array}{lc}\overleftarrow{RNN}^{(f)}(m_t^{(f)},\overleftarrow h_{t+1}^{(f)})&\;if\;t\;\leq\left|F\right|\end{array}\;else\;0 $$- flexible combination of hidden states vector - $$ h_0^{(e)}=tanh\;(W_1{\overrightarrow h}_{\vert F\vert}+W_2{\overleftarrow h}_1\;+\;b_e) $$ 
 
Sentence Embedding Methods
- Auto-encoding- Semi-supervised Sequence Learning
- re-generate
 
- Language modeling
- Predicting context- Skip-Thought Vectors
- surrounding sentence
- fixed length embedding- prevent overfitting
- pretraining
- differenct software
 
- fine-tune embedding- increasing expressivity
 
 
- Predicting paraphrases- Towards universal paraphrastic sentence embeddings
- similar sentence has similar embeddings
- PARANMT-50M: Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations
 
- Predicting sentence features 
- Contextual embedding 
- Misc 
Further Reading
- Several studies on natural language and back-propagation - first proposed the idea of performing translation using neural networks
 
- Learning recursive distributed representations for holistic computation - further expanded to recurrent networks
 
- Sequence to Sequence Learning with Neural Networks - further expanded this to recurrent networks
 
- Recurrent Continuous Translation Models - first example of fully neural models for translation
 
- Sequence to Sequence Learning with Neural Networks - popularized neural MT due to impressive empirical performance
 
- Learning to Decode for Future Success - about search
 
- On the Properties of Neural Machine Translation: Encoder–Decoder Approaches 
