Welcome to my home page


I am now a Ph.D student of Language Technology Institution, School of Computer Science, Carnegie Mellon University. My research interest is statistical machine translation. Welcome here to see some of my current works. My email address: qing at cs dot cmu dot edu.

New release of MGIZA++/PGIZA++


MGIZA++ is now integrated with Chaksi, and be able to run word alignment on Hadoop Clusters. For better documentation I move the software release page here: http://geek.kyloo.net/software . The newest version of MGIZA+ has the following improvements:

As MGIZA++ already has most of the funtionality of PGIZA++, the develop of PGIZA++ is discontinued. Please visit http://geek.kyloo.net/software for more detail.

Multi-thread GIZA


Multi Thread GIZA++ is an extension to GIZA++ word aligning tool. It can perform much faster training than origin GIZA++ if you have more than one CPUs, in addition it fixed some bugs in GIZA, and the final aligning perplexity is generally lower than original GIZA++. You can download the most up-to-date version here:

mgizapp.current.tar.gz

Here is a very brief introduction to the installation and usage of MGIZA++.

Installation

To install, download the latest release file, unpack the tarball and run
				configure  --prefix=/your/install/dir
				make
				make install
			

The binaries will be in /your/install/dir/bin

Usage

MGIZA++ is completely compatible with original GIZA++'s configuration file, the only difference is that MGIZA++ has an additional parameter NCPUS, which speicify the number of threads to run. It can be as large as the number of CPUs in your system.

Output, post processing and integrating with moses package

The output of MGIZA++ is pretty the same as original GIZA++, except the alignment file. Because each thread outputs alignment separatedly, we will have several partial alignment files, like prefix.A3.final.part0, prefix.A3.final.part1 and so on. If you need a combined alignment (For example, to use it with Moses package), please use this file

merge_alignment.py

with the following command:
				./merge_alignment.py `ls prefix.A3.final.part*` > prefix.A3.final
A sample Moses script to use mgiza Use -mgiza to enable MGIZA

PGIZA++


UPDATE : 05/06/2008

Updated source and script, cleaning unused temporary files. The most current version is here:

UPDATE : 05/02/2008

We just finished an optimization on PGIZA++ and the previous package has some problems in compiling, so please download the most current package to replace. Because the optimization some scripts becomes invalid, I do not have time so I removed them from the package. Now only ssh script is available, condor and maui script will be available after some test is performed.


PGIZA++ is another version that can run on cluters. Welcome to try the current version here pgiza-0.3.tar.gz.

PGIZA++ based on schedulers of the cluster, here we have script for maui, Condor and simply ssh remote procedure call. Choose script for your scheduler from this file: pgiza-scripts.tar.gz.

Also here is the manual for setting up PGIZA++, specially for clusters without shedulers.(using ssh+NFS). Manual


PGIZA++ is not only suitable for parallel processing, because all the functionality are compiled into individual executables, you can use a previously trained model, and continue training on different data. Check it out and have fun. (Last words, for debugging purpose or because of I am lazy, I does not clean the intermediate results which can be SUPER LARGE. They are in "coll" directory and "tmp" directory, just remember to clean them. I will add function to clean it when I am sure it is stable enough).