- reference
Error Functions and Error Minimization
error function
$$ error(\varepsilon,\widehat\varepsilon) $$difficulty in directly optimizing the error function
- a myriad of possible translations
- the argmax function, and by corollary the error function is not continuous => piecewise constant (gradient is zero/undefined)
how to overcome?
- approximate the hypothesis space
- easily calculable loss functions
Minimum Error Rate Training(MERT)
- assume dealing with a linear model
- work only a subset of hypotheses
- effcient line-search method
$$ \log\;P(F,E)\;\propto\;S(F,E)\;=\;\sum_i\lambda_i\phi_i(F,E) $$
outer loop
Generating hypotheses
using beam search
$$ F_i\overset{n-best\;list}\Rightarrow{\widehat E}_i $$$$ {\widehat E}_{i,j}\;is\;jth\;hypothesis\;in\;the\;n-best\;list $$
Adjusting parameters
$$ \widehat\varepsilon^{(\lambda)}={\widehat E_1^{(\lambda)},\widehat E_2^{(\lambda)},\;….,\;\widehat E_{\vert\varepsilon\vert}^{(\lambda)}} $$
$$ where\; \widehat E_i^{(\lambda)}=\underset{\widetilde E\in{\widehat E}_i}{argmax}S(F_i,\widetilde E;\lambda) $$
$$ \widehat\lambda=\underset\lambda{argmin}\;error(\varepsilon,\widehat\varepsilon^{(\lambda)}) $$
inner loop (how to find parameter $ \widehat \lambda $ - line search)
- Picking a direction(vector $ d $)
- one-hot vector
- random vector
- vector calulated based on gradient-based methods (e.g. minimum-risk method)
Finding the optimal point along this direction
$$ \lambda_\alpha=\lambda+\alpha d $$$$ then,\;we\;get\; S(F_i,E_{i,j})\;=\;b\;+\;c\alpha $$
$$ {\widehat\lambda}_\alpha=\underset\alpha{argmin}\; error(\varepsilon,\widehat\varepsilon^{\lambda(\alpha)}) $$
$$ update\;\lambda\leftarrow{\widehat\lambda}_\alpha $$
- line sweep algorithm
- more tricks for MERT
- Random restarts
- Corpus-level measures
s
- Picking a direction(vector $ d $)
Minimum Risk Training
- differentiable and conducive to optimization through gradient-based methods
$$ risk(F,E,\theta)=\sum_\widetilde EP(\widetilde E\vert F;\theta)error(E,\widetilde E) $$
Two thing to be careful
- sum the entire space of the hypo is intractable, so we choose a subset of hypotheses
not actually optimizing the error
introduce a temperature parameter $\tau$
$$ risk(F,E,\theta)=\sum_\widetilde E\frac{P(\widetilde E\vert F;\theta)^{1/\tau}}Zerror(E,\widetilde E) $$$$ where\;Z\;=\;\sum_\widetilde EP(\widetilde E\vert F;\theta)^{1/\tau}s $$
- $\tau = 1$, regular probability distribution
- $\tau > 1$, distribution becomes “smoother”
- $\tau < 1$, distribution becomes “sharper”
- how to choose $\tau$ -> annealing (decrease from a high value to zero)
Optimization Through Search
Optimization Through Search
structured perceptron
- linearized model -> just stochastic gradient descent
variety
early stopping (stop when output is inconsistent with the reference)
$$ l_{early-percep}\;=\;S(F,\widehat e_1^t)\;-\;\;S(F,\;e_1^t) $$
Search-aware tuning and beam-search optimization (adjust the score of hypotheses in the intermediate search steps)
- Search-aware tuning -> giving a bonus to hypotheses at each time step that get lower error
- Beam-search optimization -> applying a perceptron-style penalty at each time step where the best hypothesis falls o↵ the beam
Margin-based Loss-augmented Training
score of the best hypothesis to exceed by a margin $M$
$$ S(F,E) > S(F,\widehat E) \Rightarrow S(F,E) > S(F,\widehat E) + M $$
- explanantion: have some breathing room in its predictions
loss-augmented training
$$ S(F,E) > S(F,\widehat E) + M * err(E,\widehat E) $$
Optimization as Reinforcement Learning
The story
- view each word selection as an action
- the final evaluation score (e.g. BLEU) as the reward
policy gradient methods
key word : REINFORCE objective- self-training
$$ l_{nll}(\widehat E)=\sum_{t=1}^{\vert E\vert}-\log P({\widehat e}_t\vert F,\widehat e_1^{t-1}) $$ - weighting the objective with the value of the evaluation function
$$ l_{reinforce}(\widehat E,E)=eval(E,\widehat E)\sum_{t=1}^{\vert E\vert}-\log P({\widehat e}_t\vert F,\widehat e_1^{t-1}) $$ - make the addition of a baseline function(expect good get bad, and expect bad get slightly good)
$$ l_{reinforce+base}(\widehat E,E)=-(eval(E,\widehat E)-base(F,\widehat e_1^{t-1})\overset{\vert E\vert}{\underset{t=1}{)\sum}}-\log P({\widehat e}_t\vert F,\widehat e_1^{t-1}) $$
- self-training
value-based reinforcement learning
- learn value/Q function, $Q(H,a)$
$$ H = <F,\widehat e_1^{t-1}>,\;and\;a = e_t $$ - actor-critic methods
- learn value/Q function, $Q(H,a)$
recommended reading
Futher Reading
- Evaluation measures for optimization
- Effcient data structures and algorithms for optimization
training trick
- start with MLE
- learning rate/optimizer
- large batch size
- self-critic/ average baseline