| 11-731 Machine Translation: Homework 2 - Questions and Clarifications |
|
Answer
First of all, you are not supposed to filter out punctuations. What you should do is to separate them from the words, so that there could be a better match in the alignment. Part of understanding MT work is to realize that such decisions have to be taken and - yes - that they do effect the performance of the resulting system. We don't expect that each student comes back with the same word count for the tokenized corpus. There is no one correct answer for most of these questions. What we are interested in is what kind of decisions you make. For example, which punctuations did you use in tokenizing? Do you deal with punctuation at the start or end of the sentence? |
|
Answer
Cases where a simple separation of punctuation will not suffice. For example, compounds with hyphens (e.g. state-of-the-art) |
|
Answer
The idea of the exercise is to look at the data and analyze to see if there is something unexpected, something we need to worry about. Here, when you compare the two lists you could see, for example, if the high frequency words in the two list are of similar type or not. |
|
Answer
You can use some online translation service, or just pairs, which are obviously not good translations. English and Spanish have many cognates, which could help a lot. Again, the idea here is to check if there are obvious cases where things have gone wrong. |