11-731 Machine Translation: Homework 2 - Questions and Clarifications


Question

In Part 2 Section (i), Looking at the list of the most frequent words (on the English side) I found a number of tokens that should be filtered out . Deciding what to filter out, without a given set of defined tokens, could be totally arbitrary, which will affect the results in this homework and the MT system in the following assigments.


Answer

First of all, you are not supposed to filter out punctuations. What you should do is to separate them from the words, so that there could be a better match in the alignment. Part of understanding MT work is to realize that such decisions have to be taken and - yes - that they do effect the performance of the resulting system. We don't expect that each student comes back with the same word count for the tokenized corpus. There is no one correct answer for most of these questions. What we are interested in is what kind of decisions you make. For example, which punctuations did you use in tokenizing? Do you deal with punctuation at the start or end of the sentence?


Question

In Part 2 Section (i), what do you mean by "Are there any special cases, which might need special treatment?"


Answer

Cases where a simple separation of punctuation will not suffice. For example, compounds with hyphens (e.g. state-of-the-art)


Question

In Part 2 Section (iii), What do you mean by "Do you see any interesting similarities or differences?"


Answer

The idea of the exercise is to look at the data and analyze to see if there is something unexpected, something we need to worry about. Here, when you compare the two lists you could see, for example, if the high frequency words in the two list are of similar type or not.


Question

In Part 2 Section (v), You mentioned "Select an example where the sentences are not a good translation pair." How do you determine that this is not a good translation if you don't know Spanish ?


Answer

You can use some online translation service, or just pairs, which are obviously not good translations. English and Spanish have many cognates, which could help a lot. Again, the idea here is to check if there are obvious cases where things have gone wrong.