Problem Based Benchmark Suite : Remove Duplicates

Remove Duplicates (RDUPS):

Given a sequence of elements of any (uniform) type, remove any duplicates returning only one element for each key value. Each element consists of a key and possibly auxiliary data and the implementation must be based on the following three user supplied functions:

A hash function that maps each key to an unsigned integer,

a comparison function on the keys defining a total order, and

a comparison function on the auxiliary data defining a partial order.

The comparison function on the auxiliary data is used to decide which element to keep when multiple elements have the same key--a maximal value with respect to the partial order must be kept. When there is no auxiliary data this function should always returns false. The code must not take advantage of the specific key and auxiliary types beyond the hash and comparison function.

Input and Output File Formats

The input and output should be in the sequence file format both with the same element types. It output must contain one element for each key in the input, and if there is auxiliary data, a maximal auxiliary value for that key. The output can be ordered in any way.

Default Input Distributions

Each distribution should be run for n=10,000,000. The weights used for average time are given in parentheses.

(1) Unsigned integers generated uniformly at random in the range [0:n).
randomSeq -t int <n> <filename>
(1) Unsigned integers generated uniformly at random in the range [0:100000).
randomSeq -t int -r 100000 <n> <filename>
(1) 32-bit unsigned integers generated at random with repeats appearing in an exponential distribution.
exptSeq -t int <n> <filename>
(3) Strings from a tri-gram distribution.
trigramSeq <n> <filename>
(3) Strings from a tri-gram distribution tagged with auxiliary integers from the range [0:n).
trigramSeq <n> <tmpname>
addDataSeq -t int <tmpname> <filename>

Problem Based Benchmark Suite (2020)

Remove Duplicates (RDUPS):

Input and Output File Formats

Default Input Distributions