CMU 11-731(MT&Seq2Seq) Algorithms for MT 2 Parameter Optimization Methods

Error Functions and Error Minimization

  • error function
    $$ error(\varepsilon,\widehat\varepsilon) $$

  • difficulty in directly optimizing the error function

    • a myriad of possible translations
    • the argmax function, and by corollary the error function is not continuous => piecewise constant (gradient is zero/undefined)
  • how to overcome?

    • approximate the hypothesis space
    • easily calculable loss functions

Minimum Error Rate Training(MERT)

  • assume dealing with a linear model
  • work only a subset of hypotheses
  • effcient line-search method

$$ \log\;P(F,E)\;\propto\;S(F,E)\;=\;\sum_i\lambda_i\phi_i(F,E) $$

  • outer loop

    • Generating hypotheses

      • using beam search
        $$ F_i\overset{n-best\;list}\Rightarrow{\widehat E}_i $$

        $$ {\widehat E}_{i,j}\;is\;jth\;hypothesis\;in\;the\;n-best\;list $$

    • Adjusting parameters

      $$ \widehat\varepsilon^{(\lambda)}={\widehat E_1^{(\lambda)},\widehat E_2^{(\lambda)},\;….,\;\widehat E_{\vert\varepsilon\vert}^{(\lambda)}} $$

      $$ where\; \widehat E_i^{(\lambda)}=\underset{\widetilde E\in{\widehat E}_i}{argmax}S(F_i,\widetilde E;\lambda) $$

      $$ \widehat\lambda=\underset\lambda{argmin}\;error(\varepsilon,\widehat\varepsilon^{(\lambda)}) $$

  • inner loop (how to find parameter $ \widehat \lambda $ - line search)

    • Picking a direction(vector $ d $)
      • one-hot vector
      • random vector
      • vector calulated based on gradient-based methods (e.g. minimum-risk method)
    • Finding the optimal point along this direction
      $$ \lambda_\alpha=\lambda+\alpha d $$

      $$ then,\;we\;get\; S(F_i,E_{i,j})\;=\;b\;+\;c\alpha $$

      $$ {\widehat\lambda}_\alpha=\underset\alpha{argmin}\; error(\varepsilon,\widehat\varepsilon^{\lambda(\alpha)}) $$

      $$ update\;\lambda\leftarrow{\widehat\lambda}_\alpha $$

      • line sweep algorithm
    • more tricks for MERT
      • Random restarts
      • Corpus-level measures
        s

Minimum Risk Training

  • differentiable and conducive to optimization through gradient-based methods

$$ risk(F,E,\theta)=\sum_\widetilde EP(\widetilde E\vert F;\theta)error(E,\widetilde E) $$

  • Two thing to be careful

    • sum the entire space of the hypo is intractable, so we choose a subset of hypotheses
    • not actually optimizing the error

      • introduce a temperature parameter $\tau$
        $$ risk(F,E,\theta)=\sum_\widetilde E\frac{P(\widetilde E\vert F;\theta)^{1/\tau}}Zerror(E,\widetilde E) $$

        $$ where\;Z\;=\;\sum_\widetilde EP(\widetilde E\vert F;\theta)^{1/\tau}s $$

        • $\tau = 1$, regular probability distribution
        • $\tau > 1$, distribution becomes “smoother”
        • $\tau < 1$, distribution becomes “sharper”
      • how to choose $\tau$ -> annealing (decrease from a high value to zero)

Optimization Through Search

  • Optimization Through Search

  • structured perceptron

    • linearized model -> just stochastic gradient descent
    • variety

      • early stopping (stop when output is inconsistent with the reference)

        $$ l_{early-percep}\;=\;S(F,\widehat e_1^t)\;-\;\;S(F,\;e_1^t) $$

  • Search-aware tuning and beam-search optimization (adjust the score of hypotheses in the intermediate search steps)

    • Search-aware tuning -> giving a bonus to hypotheses at each time step that get lower error
    • Beam-search optimization -> applying a perceptron-style penalty at each time step where the best hypothesis falls o↵ the beam

Margin-based Loss-augmented Training

  • score of the best hypothesis to exceed by a margin $M$

    $$ S(F,E) > S(F,\widehat E) \Rightarrow S(F,E) > S(F,\widehat E) + M $$

    • explanantion: have some breathing room in its predictions
  • loss-augmented training

    $$ S(F,E) > S(F,\widehat E) + M * err(E,\widehat E) $$

Optimization as Reinforcement Learning

  • The story

    • view each word selection as an action
    • the final evaluation score (e.g. BLEU) as the reward
  • policy gradient methods
    key word : REINFORCE objective

    • self-training
      $$ l_{nll}(\widehat E)=\sum_{t=1}^{\vert E\vert}-\log P({\widehat e}_t\vert F,\widehat e_1^{t-1}) $$
    • weighting the objective with the value of the evaluation function
      $$ l_{reinforce}(\widehat E,E)=eval(E,\widehat E)\sum_{t=1}^{\vert E\vert}-\log P({\widehat e}_t\vert F,\widehat e_1^{t-1}) $$
    • make the addition of a baseline function(expect good get bad, and expect bad get slightly good)
      $$ l_{reinforce+base}(\widehat E,E)=-(eval(E,\widehat E)-base(F,\widehat e_1^{t-1})\overset{\vert E\vert}{\underset{t=1}{)\sum}}-\log P({\widehat e}_t\vert F,\widehat e_1^{t-1}) $$
  • value-based reinforcement learning

    • learn value/Q function, $Q(H,a)$
      $$ H = <F,\widehat e_1^{t-1}>,\;and\;a = e_t $$
    • actor-critic methods
  • recommended reading

Futher Reading

  • Evaluation measures for optimization
  • Effcient data structures and algorithms for optimization

training trick

  • start with MLE
  • learning rate/optimizer
  • large batch size
  • self-critic/ average baseline
Share