Preface
The Hidden Markov Model Toolkit (HTK) is a portable toolkit for building and manipulating hidden Markov models. It is primarily used for speech recognition, but it can be used for other applications. In the following example, I modified HVite.c a bit and used OpenExtBuffer() function to demonstrate recognition from streaming keyboard input. Please keep in mind that I deliverately made this demo simple and minimal. If you are considering using the HTK, I strongly recommend that you go over the HTKBook before applying whats here on the page to your problem domain.Preparation
% cd ~ % wget ftp://username:password@htk.eng.cam.ac.uk/software/HTK-3.0.tar.gz % tar xvfz HTK-3.0.tar.gz
% cd ~ % patch -p0 < htk-20020109.diff
% cd ~/htk/HTKLib % source ~/htk/env/env.linux % makeThen you'll see HTKLib.linux.a on ~/htk/HTKLib
% source ~/htk/env/env.linux % setenv HBIN ~/htk/bin % mkdir ~/htk/bin % mkdir ~/htk/bin/bin.$CPU % cd ~/htk/HTKTools % makeAll the HTKTools should be compiled and placed in ~/htk/bin/bin.linux. You should add a PATH to this directory.
% setenv PATH ~/htk/bin/bin.linux:$PATH
The HTKTools executables can be tested using the demonstration script provided by the entropic.
% cd ~/htk % wget ftp://username:password@htk.eng.cam.ac.uk/software/HTK-samples-3.0.tar.gz % tar xvfz HTK-samples-3.0.tar.gzThen you can follow the instruction in ~/htk/samples/HTKDemo/README.
% cd ~/htk/samples/HTKDemo % ./runDemo configs/monPlainM1S3.dcfIf you see the following message, you got the HTKLib, HTKTools and the environment set correctly.
------------------------ Overall Results -------------------------- SENT: %Correct=0.00 [H=0, S=3, N=3] WORD: %Corr=63.91, Acc=59.40 [H=85, D=35, S=13, I=6, N=133] ===================================================================
Training/Recognizing user defined dataset
I've basically followed chapter 3 "A Tutorial Example of Using HTK" on The HTK Book, and trimmed it down for this simple training/recognition task.
We use HAscii2Bin to convert data from ASCII format to HTK binary format. First, compile HAscii2Bin.
% cd ~/htk/contrib % gcc HAscii2Bin.c -g -o ../bin/bin.linux/HAscii2Bin -lm % rehashThen, prepare the data file and convert it to HTK format using HAscii2Bin. For this session we make HMMs for YAMA (mountain) and TANI (valley) from a set of one-dimensional vecotors.
% cd ~/htk/work/tr (a placeholder for training data)
% cat >! yama1
3.2 3 3 3 4.1 4 4 5.3 5.2 5.2 5.5 5.1 5.2 4 4.2 4 4 3.5 3.7 3 3 3.1
^D
% cat >! yama2
2.8 3.2 3.8 3.3 3.5 4.1 4 4.7 5.1 5.2 5.2 5.2 4.7 4.2 4 4 3.5 3.7 3 3 3.1 3.2 2.7
^D
% HAscii2Bin yama1 1 ('1' to specify one-dimension. The program attaches '.ext' to the filename)
opening for data 4 yama1.ext
number of frames 23
% HAscii2Bin yama2 1
opening for data 4 yama2.ext
number of frames 24
% HList -h yama1.ext (use HList to view the HTK format data)
---------------------------------- Source: yama1.ext -----------------------------------
Sample Bytes: 4 Sample Kind: USER
Num Comps: 1 Sample Period: 200.0 us
Num Samples: 23 File Format: HTK
------------------------------------ Samples: 0->-1 ------------------------------------
0: 3.200
1: 3.000
2: 3.000
3: 3.000
4: 4.100
5: 4.000
6: 4.000
7: 5.300
8: 5.200
9: 5.200
10: 5.500
11: 5.100
12: 5.200
13: 4.000
14: 4.200
15: 4.000
16: 4.000
17: 3.500
18: 3.700
19: 3.000
20: 3.000
21: 3.100
22: 3.100
----------------------------------------- END ------------------------------------------
Create yama3,4 (mountains) in similar manner, and so as tani1,2,3,4 (valley).
% cat >! yama3 2.8 3.2 3 3.7 3.9 4.1 4 4.7 5.1 5.2 5.3 5.2 5.2 4.7 4.4 4.2 4 4 3.5 3.7 3.3 3 3.1 3.2 2.9 ^D % cat >! yama4 2 2 2 2 2.8 3.2 3 3.3 3 3.7 3.2 3.9 4.1 4 4.7 5.1 5 5 5 5.2 5.3 5.2 5.2 4.7 4.4 4.2 4 4 3.5 3.7 3.3 3 3.1 3.2 2.9 ^D % cat >! tani1 3.2 3.1 3 3.1 2.7 2.3 2.1 2.1 1.9 1.5 1.6 1.1 1.3 2.0 2.8 2.3 2.9 2.9 3 3.3 3 3 ^D % cat >! tani2 3.0 3.1 3 3.1 2.7 2.5 2.3 2.1 2.1 1.9 1.5 1.6 1.3 1.3 1.1 1.3 2.0 2.8 2.3 2.7 2.9 2.9 3 3.3 3 3 3.2 ^D % cat >! tani3 3.1 3.2 3.1 3 3.1 2.7 2.3 2.3 2.1 2.1 1.9 1.5 1.6 1.3 1.1 1.3 1.7 2.0 2.8 2.3 2.9 2.9 3 3.3 ^D % cat >! tani4 3 3.1 3 2.7 3.1 2.7 2.5 2.3 2.1 2.1 1.9 1.4 1.5 1.6 1.1 1.3 1.7 2.0 2.8 2.3 2.9 2.9 3 3.3 3 ^D % HAscii2Bin yama3 1 opening for data 4 yama3.ext number of frames 26 % HAscii2Bin yama4 1 opening for data 4 yama4.ext number of frames 36 % HAscii2Bin tani1 1 opening for data 4 tani1.ext number of frames 23 % HAscii2Bin tani2 1 opening for data 4 tani2.ext number of frames 28 % HAscii2Bin tani3 1 opening for data 4 tani3.ext number of frames 25 % HAscii2Bin tani4 1 opening for data 4 tani4.ext number of frames 26
% cat >! ~/htk/work/trfiles tr/tani1.ext tr/tani2.ext tr/tani3.ext tr/tani4.ext tr/yama1.ext tr/yama2.ext tr/yama3.ext tr/yama4.ext ^D
The label basically tells the ground truth of what is inside the data. In this case, the label is simplly YAMA or TANI, but for the speech data, it would be a list of phonemes.
% cd ~/htk/work % cat >! yamatani.mlf #!MLF!# "*/yama*.lab" YAMA . "*/tani*.lab" TANI . ^D
The HMMs are going to be placed in ~/htk/work/hmm[0-6]. The initial hmm will be placed in hmm0, and as the training goes on, the better trained HMMs are placed in hmm1, hmm2, ..., hmm6. First, create the HMM prototype, and the list of training files.
% cd ~/htk/work
% cat >! proto
~o <VecSize> 1 <USER>
~h "proto"
<BeginHMM>
<NumStates> 5
<State> 2
<Mean> 1
0.0
<Variance> 1
1.0
<State> 3
<Mean> 1
0.0
<Variance> 1
1.0
<State> 4
<Mean> 1
0.0
<Variance> 1
1.0
<TransP> 5
0.0 1.0 0.0 0.0 0.0
0.0 0.6 0.4 0.0 0.0
0.0 0.0 0.6 0.4 0.0
0.0 0.0 0.0 0.7 0.3
0.0 0.0 0.0 0.0 0.0
<EndHMM>
^D
% cd ~/htk/work
% cat >! trfiles
tr/yama1.ext
tr/yama2.ext
tr/yama3.ext
tr/yama4.ext
tr/tani1.ext
tr/tani2.ext
tr/tani3.ext
tr/tani4.ext
^D
% HCompV -o YAMA -S trfiles -f 0.01 -m -M hmm0 proto (compute mean & variance from the entire data) % HCompV -o TANI -S trfiles -f 0.01 -m -M hmm0 proto % HRest -v 0.01 -S trfiles -M hmm1 -I yamatani.mlf hmm0/YAMA % HRest -v 0.01 -S trfiles -M hmm1 -I yamatani.mlf hmm0/TANI % HERest -v 0.01 -S trfiles -d hmm1 -I yamatani.mlf -M hmm2 wlist (embedded training from here on) % HERest -v 0.01 -S trfiles -H hmm2/newMacros -I yamatani.mlf -M hmm3 wlist % HERest -v 0.01 -S trfiles -H hmm3/newMacros -I yamatani.mlf -M hmm4 wlist % HERest -v 0.01 -S trfiles -H hmm4/newMacros -I yamatani.mlf -M hmm5 wlist % HERest -v 0.01 -S trfiles -H hmm5/newMacros -I yamatani.mlf -M hmm6 wlistThe final HMM will appear in hmm6/newMacros.
First, create the test data.
% mkdir ~/htk/work/te % cat >! ~/htk/work/te/yama5 3 3 3 3 3 4 4 4 5 5 5 4 4 4 4 4 4 4 4 3 3 3 3 3 3 3 3 ^D % cat >! ~/htk/work/te/tani5 3 3 3 3 3 3 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 3 3 3 ^D % HAscii2Bin tani5 1 opening for data 4 tani5.ext number of frames 35 % HAscii2Bin yama5 1 opening for data 4 yama5.ext number of frames 28 % cat >! ~/htk/work/tefiles te/yama5.ext te/tani5.ext ^DCreate a word network (wdnet) from the grammar.
% cd ~/htk/work % cat >! grammar ( YAMA | TANI) ^D % HParse grammar wdnetSpecify a correspondence between HMM lables to the word. It is a straight correspondence in our case, but for the speech example, it would be something like "k ao l" -> "call".
% cat >! ~/htk/work/dict YAMA YAMA TANI TANI ^DRecognize data in ~/htk/work/tefiles (tani5, yama5) using HVite (viterbi), and see the result on HResult.
% cd ~/htk/work
% HVite -T 1 -S tefiles -H hmm6/newMacros -i results -w wdnet dict wlist
Read 2 physical / 2 logical HMMs
Read lattice with 5 nodes / 5 arcs
Created network with 9 nodes / 9 links
File: te/yama5.ext
YAMA == [28 frames] -0.4559 [Ac=-12.8 LM=0.0] (Act=6.6)
File: te/tani5.ext
TANI == [35 frames] -1.2518 [Ac=-43.8 LM=0.0] (Act=6.7)
% HResults -t -I yamatani.mlf wlist results
====================== HTK Results Analysis =======================
Date: Wed Jan 9 16:21:40 2002
Ref : yamatani.mlf
Rec : results
------------------------ Overall Results --------------------------
SENT: %Correct=100.00 [H=2, S=0, N=2]
WORD: %Corr=100.00, Acc=100.00 [H=2, D=0, S=0, I=0, N=2]
===================================================================
The original HTK cannot handle streaming input other than the live audio, so I've added few lines to HVite.c and HParm.c so that we can handle user defined data-type in streaming mode. Since HParm.c does have a mechanism to handle external data (HParm.c:296-362), we only need to modify and add your own functions to HMyFuncs.c to deal with user defined data-types in streaming mode. The sample functions I wrote in HMyFuncs.c take one floating point number at a time through scanf to simulate the streaming input.
Before using HVite for streaming mode, we need to create config file to specify data-type to the program.
% cd ~/htk/work % cat >! config SOURCEKIND = USER TARGETKIND = USER ^DThe following command executes the streaming recognition using the HMMs you have trained. "-h" specify that the data stream comes from the function defined in HMyFuncs.c rather than the live audio. "-T" specifies the trace-leve. level 1 would display the recognition result after reading all the data, and 7 would display the most likely LABEL and WORD at the particular instance. You'd probably want to use 7 to see the result in every moment of the stream.
% HVite -h -D -T 7 -C config -H hmm6/newMacros -i results -w wdnet dict wlist HTK Configuration Parameters[2] Module/Tool Parameter Value # TARGETKIND USER # SOURCEKIND USER Read 2 physical / 2 logical HMMs Read lattice with 5 nodes / 5 arcs Created network with 9 nodes / 9 links READY[1]> numer? (-1 to exit)> 3 0: 3.000 Optimum @1 HMM: YAMA (YAMA) 2 -0.284 numer? (-1 to exit)> 3.4 1: 3.400 Optimum @2 HMM: YAMA (YAMA) 2 -0.541 numer? (-1 to exit)> 3.4 2: 3.400 Optimum @3 HMM: YAMA (YAMA) 7 -0.627 numer? (-1 to exit)> 3.5 3: 3.500 Optimum @4 HMM: YAMA (YAMA) 7 -0.713 numer? (-1 to exit)> 4 4: 4.000 Optimum @5 HMM: YAMA (YAMA) 7 -1.041 numer? (-1 to exit)> 4.3 5: 4.300 Optimum @6 HMM: YAMA (YAMA) 7 -0.973 numer? (-1 to exit)> 4.5 6: 4.500 Optimum @7 HMM: YAMA (YAMA) 7 -0.907 numer? (-1 to exit)> 4.6 7: 4.600 Optimum @8 HMM: YAMA (YAMA) 7 -0.862 numer? (-1 to exit)> 5 8: 5.000 Optimum @9 HMM: YAMA (YAMA) 7 -0.867 numer? (-1 to exit)> 5 9: 5.000 Optimum @10 HMM: YAMA (YAMA) 7 -0.871 numer? (-1 to exit)> 5.3 10: 5.300 Optimum @11 HMM: YAMA (YAMA) 7 -0.923 numer? (-1 to exit)> 4.32 11: 4.320 Optimum @12 HMM: YAMA (YAMA) 7 -0.891 numer? (-1 to exit)> 3.4 12: 3.400 Optimum @13 HMM: YAMA (YAMA) 7 -0.969 numer? (-1 to exit)> 2 13: 2.000 Optimum @14 HMM: YAMA (YAMA) 7 -1.484 numer? (-1 to exit)> -1 14: 2.000 Optimum @15 HMM: YAMA (YAMA) 7 -1.930 YAMA == [15 frames] -3.0803 [Ac=-46.2 LM=0.0] (Act=6.3) READY[2]> numer? (-1 to exit)> %
Extension
You should be able to expand this by using differencial component of the data for training/recognition, using bigram/n-gram and so on. A good place to start would be the Chapter3 of The HTK Book.- soshi