Important note: * Please use Janus/Tcl command only! No perl, shell script, etc. * If you have a question about the homework, feel free to ask Stan at scjou@cs.cmu.edu * Making a demi-syllable Mandarin dictionary In the Mandarin language, every character's pronunciation is a syllable, which is often decomposed as an initial-final (I-F) pair. You can imagine the I-F structure as the consonant-vowel structure, but they are not the same thing. We call the I and F units demi-syllable. For example, the syllable 'zhong' can be decomposed to two demi-syllables 'zh' and 'ong', while 'biao' can be decomposed to 'b' and 'iao'. Since Mandarin is a tonal language, we often attach the tonal marker to the syllables. For example, 'zhong3' means the syllable 'zhong' with the 3rd tone. It is commonly assumed the tonal information can be ignored in the initial part, so we decompose 'zhong3' to 'zh' and 'ong3', instead of 'zh3' and 'ong3'. - Task 1.a: to write a Janus tcl script to convert the syllable-based raw dictionary into a demi-syllable-based Janus-format dictionary Input: /project/Class-11-753/data/CH/dict/train-orig.syl.dict For example, an entry of the input raw dictionary reads ge4zhong3 ge4 zhong3 the first column is a Mandarin word in romanized form (Pinyin) and the rest are the syllables of the word. Your Janus tcl script should convert the entry to the Janus dictionary format using a Dictionary object (say, 'dict') by the 'add' command: dict add ge4zhong3 { {g WB} {e T4} zh {ong WB T3} } Note that here we have six tags: WB for word boundary, which appears at the begining and the end of the word, and T1 to T5 for the five tones, respectively. To seperate a Mandarin syllable into initial and final, just cut the syllable at the left of the first occurance of the set of characters: { a, e, i, o, u, v, - } For example: zhong -> zh ong biao -> b iao sh-i -> sh -i Note that you should also additionally put an entry of the 'silence word', as described in the session 2 web page. - Task 1.b: generate a phonesSet description file based on the demi-syllables you found from Task 1.a The phonesSet description file should contain at least four classes: PHONES SILENCE INITIAL FINAL for example PHONES @ SIL zh ong b iao sh -i SILENCE SIL INITIAL zh b sh FINAL ong iao -i Use a Janus PhonesSet object to do this task. Hint: Tcl has some powerful commands: string - Manipulate strings regexp - Match a regular expression against a string You may see http://tcl.activestate.com/man/tcl8.3/TclCmd/contents.htm for the tcl command reference, and try to google some command usage example. Submission of Task 1.a and 1.b: Send to Stan (scjou@cs.cmu.edu) the following NFS paths (not the files!): 1. the Tcl script for 1.a and 1.b: You should let Stan be able to reproduce your result by UNIX> Janus YourScript.tcl 2. the generated demi-syllable dictionary and phonesSet description file. * Making a Janus database Here we want to process the data at /project/Class-11-753/data/CH so we firstly need to generate a Janus database for data interpretation and organization. Under the aforementioned directory, the sub-directory adc/ is where the waveform (adc) files are located, rmn/ is where the romanized transcripts are located. - Task 2: Use the method described in the Session 2 web page to generate a Janus dbase containing the following information: + A dbase key which is a unique ID. Usually we use the utterance ID. + Speaker ID: SPKID + Utterance ID: UTTID + Waveform path: ADCPATH + Waveform filename: ADCFILE + Utterance start time: FROM . The FROM value in this task is always '0'. + Utterance end time: TO . The TO value in this task is always 'last'. + Transcript: TEXT For example, with the file /project/Class-11-753/data/CH/rmn/CH094.rmn we know the SPKID from either the filename or the first line of the file: ";SprecherID 094". The following lines contain the utterance ID number and the romanized transcript alternatively, line by line. Therefore, for example, your script should do something like: db add spk094_utt1 { {SPKID spk094} {UTTID spk094_utt1} {ADCPATH /project/Class-11-753/data/CH/adc/094/} {ADCFILE CH094_1.adc.shn} {FROM 0} {TO last} {TEXT wai4jiao1bu4 fa1yan2ren2 da1 ji4zhe3 wen4 zhong1mei3 zhi1shi5chan3quan2 cuo1shang1 da2cheng2 yi1zhi4 you3li4 yu2 shuang1bian1 guan1xi5 gai3shan4 he2 fa1zhan3} } Do this "db add" command for all the speaker-utterance pairs. Hint: Tcl has some powerful commands: glob - Return names of files that match patterns file - Manipulate file names and attributes string - Manipulate strings regexp - Match a regular expression against a string You may see http://tcl.activestate.com/man/tcl8.3/TclCmd/contents.htm for the tcl command reference, and try to google some command usage example. Submission of Task 2: Send to Stan (scjou@cs.cmu.edu) the following NFS paths (not the files!): 1. the Tcl script: You should let Stan be able to reproduce your result by UNIX> Janus YourScript.tcl 2. the dbase files
Last modified: Tue Dec 20 18:23:09 EST 2005
Maintainer: scjou@cs.cmu.edu.