Problem Based Benchmark Suite (2020)

Remove Duplicates (RDUPS):

Given a sequence of elements of any (uniform) type, remove any duplicates returning only one element for each key value. Each element consists of a key and possibly auxiliary data and the implementation must be based on the following three user supplied functions:

  • A hash function that maps each key to an unsigned integer,
  • a comparison function on the keys defining a total order, and
  • a comparison function on the auxiliary data defining a partial order.
  • The comparison function on the auxiliary data is used to decide which element to keep when multiple elements have the same key--a maximal value with respect to the partial order must be kept. When there is no auxiliary data this function should always returns false. The code must not take advantage of the specific key and auxiliary types beyond the hash and comparison function.

    Input and Output File Formats

    The input and output should be in the sequence file format both with the same element types. It output must contain one element for each key in the input, and if there is auxiliary data, a maximal auxiliary value for that key. The output can be ordered in any way.

    Default Input Distributions

    Each distribution should be run for n=10,000,000. The weights used for average time are given in parentheses.
    last modified 13:53, 07 Mar 2017

    This project has been funded by the following sources:
    Intel Labs Academic Research Office for the Parallel Algorithms for Non-Numeric Computing Program,
    National Science Foundation, and
    IBM Research.