I am now a Ph.D student of Language Technology Institution, School of Computer Science, Carnegie Mellon University. My research interest is statistical machine translation. Welcome here to see some of my current works. My email address: qing at cs dot cmu dot edu.
MGIZA++ is now integrated with Chaksi, and be able to run word alignment on Hadoop Clusters. For better documentation I move the software release page here: http://geek.kyloo.net/software . The newest version of MGIZA+ has the following improvements:
Multi Thread GIZA++ is an extension to GIZA++ word aligning tool. It can perform much faster training than origin GIZA++ if you have more than one CPUs, in addition it fixed some bugs in GIZA, and the final aligning perplexity is generally lower than original GIZA++. You can download the most up-to-date version here:
Here is a very brief introduction to the installation and usage of MGIZA++.
configure --prefix=/your/install/dir make make install
MGIZA++ is completely compatible with original GIZA++'s configuration file, the only difference is that MGIZA++ has an additional parameter NCPUS, which speicify the number of threads to run. It can be as large as the number of CPUs in your system.
The output of MGIZA++ is pretty the same as original GIZA++, except the alignment file. Because each thread outputs alignment separatedly, we will have several partial alignment files, like prefix.A3.final.part0, prefix.A3.final.part1 and so on. If you need a combined alignment (For example, to use it with Moses package), please use this file
merge_alignment.pywith the following command:
./merge_alignment.py `ls prefix.A3.final.part*` > prefix.A3.finalA sample Moses script to use mgiza Use -mgiza to enable MGIZA
Updated source and script, cleaning unused temporary files. The most current version is here:
We just finished an optimization on PGIZA++ and the previous package has some problems in compiling, so please download the most current package to replace. Because the optimization some scripts becomes invalid, I do not have time so I removed them from the package. Now only ssh script is available, condor and maui script will be available after some test is performed.
PGIZA++ is another version that can run on cluters. Welcome to try the current version here pgiza-0.3.tar.gz.
PGIZA++ based on schedulers of the cluster, here we have script for maui, Condor and simply ssh remote procedure call. Choose script for your scheduler from this file: pgiza-scripts.tar.gz.
Also here is the manual for setting up PGIZA++, specially for clusters without shedulers.(using ssh+NFS). Manual
PGIZA++ is not only suitable for parallel processing, because all the functionality are compiled into individual executables, you can use a previously trained model, and continue training on different data. Check it out and have fun. (Last words, for debugging purpose or because of I am lazy, I does not clean the intermediate results which can be SUPER LARGE. They are in "coll" directory and "tmp" directory, just remember to clean them. I will add function to clean it when I am sure it is stable enough).