Given a sequence of elements of any (uniform) type, remove any duplicates returning only one element for each key value. Each element consists of a key and possibly auxiliary data and the implementation must be based on the following three user supplied functions:
The comparison function on the auxiliary data is used to decide which element to keep when multiple elements have the same key--a maximal value with respect to the partial order must be kept. When there is no auxiliary data this function should always returns false. The code must not take advantage of the specific key and auxiliary types beyond the hash and comparison function.The input and output should be in the sequence file format both with the same element types. It output must contain one element for each key in the input, and if there is auxiliary data, a maximal auxiliary value for that key. The output can be ordered in any way.
randomSeq -t int <n> <filename>
randomSeq -t int -r 100000 <n> <filename>
exptSeq -t int <n> <filename>
trigramSeq <n> <filename>
trigramSeq <n> <tmpname>
addDataSeq -t int <tmpname> <filename>
This project has been funded by the following sources:
Intel Labs Academic Research Office for the Parallel Algorithms for Non-Numeric Computing Program,
National Science Foundation, and
IBM Research.