CMU 11-731(MT&Seq2Seq) Algorithms for MT 2 Parameter Optimization Methods

Error Functions and Error Minimization

error function
$$ error(\varepsilon,\widehat\varepsilon) $$
difficulty in directly optimizing the error function
- a myriad of possible translations
- the argmax function, and by corollary the error function is not continuous => piecewise constant (gradient is zero/undefined)
how to overcome?
- approximate the hypothesis space
- easily calculable loss functions

$$ \log\;P(F,E)\;\propto\;S(F,E)\;=\;\sum_i\lambda_i\phi_i(F,E) $$

$$ risk(F,E,\theta)=\sum_\widetilde EP(\widetilde E\vert F;\theta)error(E,\widetilde E) $$

Optimization Through Search
structured perceptron
- linearized model -> just stochastic gradient descent
- variety
  - early stopping (stop when output is inconsistent with the reference)
    
    $$ l_{early-percep}\;=\;S(F,\widehat e_1^t)\;-\;\;S(F,\;e_1^t) $$
Search-aware tuning and beam-search optimization (adjust the score of hypotheses in the intermediate search steps)
- Search-aware tuning -> giving a bonus to hypotheses at each time step that get lower error
- Beam-search optimization -> applying a perceptron-style penalty at each time step where the best hypothesis falls o↵ the beam

score of the best hypothesis to exceed by a margin $M$

$$ S(F,E) > S(F,\widehat E) \Rightarrow S(F,E) > S(F,\widehat E) + M $$
- explanantion: have some breathing room in its predictions
loss-augmented training

$$ S(F,E) > S(F,\widehat E) + M * err(E,\widehat E) $$

The story
- view each word selection as an action
- the final evaluation score (e.g. BLEU) as the reward
policy gradient methods
key word : REINFORCE objective
- self-training
  $$ l_{nll}(\widehat E)=\sum_{t=1}^{\vert E\vert}-\log P({\widehat e}_t\vert F,\widehat e_1^{t-1}) $$
- weighting the objective with the value of the evaluation function
  $$ l_{reinforce}(\widehat E,E)=eval(E,\widehat E)\sum_{t=1}^{\vert E\vert}-\log P({\widehat e}_t\vert F,\widehat e_1^{t-1}) $$
- make the addition of a baseline function(expect good get bad, and expect bad get slightly good)
  $$ l_{reinforce+base}(\widehat E,E)=-(eval(E,\widehat E)-base(F,\widehat e_1^{t-1})\overset{\vert E\vert}{\underset{t=1}{)\sum}}-\log P({\widehat e}_t\vert F,\widehat e_1^{t-1}) $$
value-based reinforcement learning
- learn value/Q function, $Q(H,a)$
  $$ H = <F,\widehat e_1^{t-1}>,\;and\;a = e_t $$
- actor-critic methods
recommended reading
- Reward Augmented Maximum Likelihood for Neural Structured Prediction
- SwitchOut: an Efficient Data Augmentation Algorithm for Neural Machine Translation