for each utterance
1) load and preprocess the utterances features
2) get an alignment path from somewhere
3) whatever has to be trained, let it accumulate
the necessary training information
whatever has to be trained, let it update its
parameters according to the accumulated data
Here step 2) can be either the running of a Viterbi or a
forward-backward alignment, or we can load an aready aligned
path from a file, which we call labels-file. Usually, training along labels is much faster than computing a forced alignment for every utterances.