Once basic CI class is working for all Log Files... Here is the more researchy aspect of manipulating CIs: 0. First, make sure your CI class can read in a bunch of the log files at once and test it by loading two different user sessions (2006-2-13-16-51-29-8333 and 2006-2-13-17-13-55-8336, say). 1. Need to detect equivalent walks through the correction action space. In order to do this, you first need to detect and delete (not store) empty/spurious loops, Example: For the following sequence of correction actions: sl: The great artist El artista gran El artista grande (edit: gran->grande) El grande artista (cwo) El gran artista (edit: grande->gran) Only the following should be stored in CI's vActions: El artista gran El gran artista (cwo) Note that these are not necessarily "state repeats" (AI), in this case for example, it doesn't go back to the initial state, since the order of artista and gran is changed in between spurious corrections. Another example, which seems too terrible to naturally occur, but which I have actually seen users do, goes as follows: For the following sequence of correction actions: sl: I saw the girl Vio la muchacha Vio a la muchacha (add a word: 'a') add alignment 'a' with 'saw' delete alignment from 'a' to 'saw' add alignment from 'a' to 'woman' delete alignment from 'a' to 'woman' Vio la muchacha (delete a word: 'a') Vi la muchacha (edit: vio->vi) Vi a la muchacha (add a word: 'a') add alignment 'a' with 'saw' delete alignment from 'a' to 'saw' add alignment from 'a' to 'the' delete alignment from 'a' to 'the' Only the following should be stored in CI's vActions: sl: I saw the girl Vio la muchacha Vi la muchacha (edit: vio->vi) Vi a la muchacha (add a word: 'a') You can either incorporate this directly when you’re reading in the CI from the logfile or have a method which given a CI, it returns it but without any loops, whatever you think is better. DetectSpuriousLoop(*CI) 2. Once this is in place, we will be in a situation where we can implement a comparison method, that given two different CIs for the same SL-TL-CTL triplet, it detects if the non-spurious actions are the same, namely if two CI's are equivalent, even though their LogFile is not identical. If the order in which the correction actions took place is different, this should also be considered somewhat equivalent... (even though this might not actually be true for dependent errors... since the order in which refinements take place does determine how the grammar is refined (not invariable wrt. the input). Maybe instead of having the compare method return a boolean, it could return an int, where 0 means no equivalence, 1 means exact equivalence (without counting spurious loops that might have occurred in one or both CIs) and 2 means equivalence in terms of correction actions used, but not in their sequence. bool ComparableCIs(CI1,CI2) ;; returns 1 if SL-TL-CTL are the same int EquivalentCIs(CI1,CI2) 3. Once the comparison method is in place, implement a collection of CIs, indexed by SL-TL-CTL, each CI should contain a vector of unique IDs, initially only containing one ID (extracted from Log File or the directory name that generated it). Should be able to access all CI's that have the same SL-TL sequence (need at least two access methods, once given SL-TL-CTL and one given SL-TL). class CorrectionInstance { private: CICollection() vector UniqueLogFileID vector FindCIs(SL,TL) vector FindCIs(SL,TL, CTL) public: StoreCICollection() …} For each SL-TL, group all CIs affecting a particular SL-TL-CTL triplet together. When storing CI's affecting the same SL-TL-CTL, if it turns out that two correction instances are equivalent (see above), then we only want to store it once, say we decide to store CI1, and then add the unique ID from CI2 into the vectorID, so that it now contains the ID for CI1 and the ID for CI2. So if we do this recursively, at the end, only unique CIs should be stored under each different SL-TL-CTL triplet, and each CI will store a vector with one or more unique IDs. 4. Once all this is in place, add an evidence method, which will return the size of the vector, given a CI. This tells us how many actual LogFiles support the evidence for that CI. int EvidenceForRefinement(CI) ;; returns the number of LogFiles that support this evidence 5. Another method that should be implemented is given a collection of CIs grouped by SL-TL, return the CI with more evidence (namely, larger size of vectorID). We could call this GetBestCI or something like that. *CI GetBestCI(CICollectionIndexedBySL-TL) ;; 6. The next step is to store a new collection containing the CIs with higher evidence. After processing all the CIs, this collection will contain only one CI per SL-TL pair, namely the one with higher evidence (BestCI). BestCICollection() ;; for each SL-TL pair in the input, it finds the BestCI and stores it in the BestCICollection 7. We will also need a ranking method, which given this new collection, it ranks BestCIs by error complexity, from easier to refine to harder to refine. A first approximation of error complexity could be the number of corrections, which sort of correlates to number of errors. This means that the method needs to compare each BestCI's Actions vector and the one with a smaller vector, should be ranked higher. There is one caveat here, the error complexity comparison method should not take into account Alignment corrections. This is not to say that alignment corrections are not relevant for further processing, but we don't think they should contribute the same as other actions to error complexity, and so the easiest way to do this, is by not counting them at all in this method. But there is something else that can be done in terms of approximating a true error complexity when there are multiple errors, namely to try to detect if the errors are independent or not. The reason for this being that when errors are dependent, it becomes trickier to decide which refinement operation to apply first, and thus such cases should have a higher complexity score, which would then be ranked lower by the ranking method. For example, if a CI contains multiple errors, all affecting different words, then we make the assumption that they are independent errors. If, on the other hand, two or more different corrections affect the same word, then we can make the assumption that they are dependent. For the following example: sl: Gaudi was a great artist tl: Gaudi es un artista grande Correction 1: edit: grande->gran temp_ctl: Gaudi es un artista gran Correction 2: cwo: grande artista ctl: Gaudi es un gran artista the new error complexity comparison method, would detect that there were 2 actions involved and that they were not independent. For now, let’s say that when dependency is detected between two or more actions, the error complexity gets incremented by 1. In this case, CIs with two dependent actions would get a score of 3 and would be considered as complex as CIs with three independent actions. This is not ideal, so if you can think of a better scoring mechanism, that would be grant. The ideal final ranking of CIs would be as follows: 1st: CIs with one correction action 2nd: CIs with two independent correction actions 3rd: CIs with two dependent correction actions 4th: CIs with three independent correction actions 5th: CIs with three dependent correction actions etc... So maybe storing the information about error dependency separately from the actual number of corrections is what is required here. int ErrorComplexity(CI) ;; does not take alignment correction actions into account bool ConatainsDependentErrors(CI) Rank(BestCICollection) 8. Since the final ranking might be affected by regression testing results, we also need to have a reranking method, which given two CIs, it reranks them, so that CIs that either had the same rank or ranked incorrectly can be reranked like this: CI1 >= CI2 Rerank(CI2,CI1) ;;where CI2 > CI1 9. CIs should also have a couple of book-keeping variables to store: 1. whether they lead to an actual change in the system, and 2. whether that refinement(s) increased or decreased the accuracy over a regression test set. class CorrectionInstance { private: bool Lead2Refinement ;; 0 by default bool IncreasedMTAccuracy ;; 0 by default …} 10. Another thing we are going to want to store is what rules and lexical entries would be affected by the refinement operations triggered by a specific CI action, to also determine dependencies between CIs. This information can be extracted from the Parse Trace, since the candidate rule to be refined is the immediate parent node of the error word. For example, for the parse tree: <((S,1 (NP,2 (N,7:1 "GAUDÍ") ) (VP,3 (VB,2 (AUX,1:2 "ERA") ) (NP,8 (DET,3:3 "UN") (N,8:5 "ARTISTA") (ADJ,3:4 "GRANDE") ) ) ) )> Where “grande” is the error (the user edits it into “gran” and then is moved in front of “artist” [see 2006-2-13-17-13-55-8336/9 and 2006-2-13-16-51-29-8333/9 for examples]). So, the StoreAffectedIDs method should store the lexical ID for ARTISTA, namely N,8 and then the parent node” NP,8, presumably doing something like this: vAffectedRules.push_back(“N,8”); vAffectedRules.push_back(“NP,8”); You can probably use the ParseTree methods to access the right tree nodes. In this way, Rule ID information can be stored for each CI's action (StoreAffectedRules). Note that the grammar rule and lexical entry IDs are of the same format, so just one data structure is needed. So having a method to tell us whether two different RuleID vectors have one or more common elements (regardless of order), is what we need here. This method loops over each Actions vector and if it detects the same RuleID, in one of the actions in the other CI, it stores it in the out argument, which is then returned at the end, so that we can tell which Rules are affected why both CIs. class CorrectionInstance { private: Vector AffectedRules ;; For each Action in the CI, store all the mother node of the error and it’s lexical label public: StoreAffectedRuleIDs(CI) ;; loops through all the Actions vector RulesInCommon (CI1, CI2) ;; loops through all the Actions …}