Homework 2
Out: Dec-20 Due: Jan-4 Wednesday night (12:00)

Important note:
* Please use Janus/Tcl command only!  No perl, shell script, etc.
* If you have a question about the homework,
  feel free to ask Stan at scjou@cs.cmu.edu

* Making a demi-syllable Mandarin dictionary

In the Mandarin language, every character's pronunciation is a syllable,
which is often decomposed as an initial-final (I-F) pair.
You can imagine the I-F structure as the consonant-vowel structure,
but they are not the same thing.
We call the I and F units demi-syllable.
For example, the syllable 'zhong' can be decomposed to
two demi-syllables 'zh' and 'ong',
while 'biao' can be decomposed to 'b' and 'iao'.

Since Mandarin is a tonal language,
we often attach the tonal marker to the syllables.
For example, 'zhong3' means the syllable 'zhong' with the 3rd tone.
It is commonly assumed the tonal information can be ignored
in the initial part,
so we decompose 'zhong3' to 'zh' and 'ong3', instead of 'zh3' and 'ong3'.

- Task 1.a: to write a Janus tcl script to
  convert the syllable-based raw dictionary
  into a demi-syllable-based Janus-format dictionary

  Input: /project/Class-11-753/data/CH/dict/train-orig.syl.dict
  For example, an entry of the input raw dictionary reads
    ge4zhong3 ge4 zhong3
  the first column is a Mandarin word in romanized form (Pinyin)
  and the rest are the syllables of the word.
  Your Janus tcl script should convert the entry to the Janus dictionary format
  using a Dictionary object (say, 'dict') by the 'add' command:
    dict add ge4zhong3 { {g WB} {e T4} zh {ong WB T3} }

  Note that here we have six tags:
  WB for word boundary, which appears at the begining and the end of the word,
  and T1 to T5 for the five tones, respectively.

  To seperate a Mandarin syllable into initial and final,
  just cut the syllable at the left of the first occurance of
  the set of characters: { a, e, i, o, u, v, - }
  For example:
    zhong -> zh ong
    biao  -> b  iao
    sh-i  -> sh -i

  Note that you should also additionally put an entry of
  the 'silence word', as described in the session 2 web page.

- Task 1.b: generate a phonesSet description file
  based on the demi-syllables you found from Task 1.a

  The phonesSet description file should contain at least
  four classes: PHONES SILENCE INITIAL FINAL
  for example
    PHONES     @ SIL zh ong b iao sh -i
    SILENCE    SIL
    INITIAL    zh b sh
    FINAL      ong iao -i

  Use a Janus PhonesSet object to do this task.

  Hint:
  Tcl has some powerful commands:
    string - Manipulate strings
    regexp - Match a regular expression against a string
  You may see http://tcl.activestate.com/man/tcl8.3/TclCmd/contents.htm 
  for the tcl command reference, and try to google some command usage example.

Submission of Task 1.a and 1.b:
Send to Stan (scjou@cs.cmu.edu) the following NFS paths (not the files!):
1. the Tcl script for 1.a and 1.b:
   You should let Stan be able to reproduce your result by
   UNIX> Janus YourScript.tcl
2. the generated demi-syllable dictionary and phonesSet description file.



* Making a Janus database
  
  Here we want to process the data at
  /project/Class-11-753/data/CH
  so we firstly need to generate a Janus database
  for data interpretation and organization.
  Under the aforementioned directory,
  the sub-directory adc/ is where the waveform (adc) files are located,
  rmn/ is where the romanized transcripts are located.

- Task 2: Use the method described in the Session 2 web page
  to generate a Janus dbase containing the following information:
  + A dbase key which is a unique ID.  Usually we use the utterance ID.
  + Speaker ID: SPKID
  + Utterance ID: UTTID
  + Waveform path: ADCPATH
  + Waveform filename: ADCFILE
  + Utterance start time: FROM . The FROM value in this task is always '0'.
  + Utterance end time: TO . The TO value in this task is always 'last'.
  + Transcript: TEXT

  For example, with the file
  /project/Class-11-753/data/CH/rmn/CH094.rmn
  we know the SPKID from either the filename or the first line of the file:
  ";SprecherID 094".
  The following lines contain the utterance ID number and
  the romanized transcript alternatively, line by line.

  Therefore, for example, your script should do something like:
    db add spk094_utt1 { {SPKID spk094} {UTTID spk094_utt1} {ADCPATH /project/Class-11-753/data/CH/adc/094/} {ADCFILE CH094_1.adc.shn} {FROM 0} {TO last} {TEXT wai4jiao1bu4 fa1yan2ren2 da1 ji4zhe3 wen4 zhong1mei3 zhi1shi5chan3quan2 cuo1shang1 da2cheng2 yi1zhi4 you3li4 yu2 shuang1bian1 guan1xi5 gai3shan4 he2 fa1zhan3} }

  Do this "db add" command for all the speaker-utterance pairs.

  Hint:
  Tcl has some powerful commands:
    glob - Return names of files that match patterns
    file - Manipulate file names and attributes
    string - Manipulate strings
    regexp - Match a regular expression against a string
  You may see http://tcl.activestate.com/man/tcl8.3/TclCmd/contents.htm 
  for the tcl command reference, and try to google some command usage example.

Submission of Task 2:
Send to Stan (scjou@cs.cmu.edu) the following NFS paths (not the files!):
1. the Tcl script:
   You should let Stan be able to reproduce your result by
   UNIX> Janus YourScript.tcl
2. the dbase files

Last modified: Tue Dec 20 18:23:09 EST 2005
Maintainer: scjou@cs.cmu.edu.