Problem Based Benchmark Suite (2020)

trigramSeq Data Generator:

trigram <n> <filename>

This generator generates a sequence of n character strings in the sequence file format based on the trigram distribution of the English language. In particular it selects each next character at random based on the previous two characters with a probability based on the probability of the given sequence of three characters (the previous two and itself). Only the 26 lowercase characters from the alphabet and the space character are used. If the space character is selected then in ends the string. The first two characters of each string are selected based on the 1-gram and 2-gram probabilities.

last modified 17:46, 20 Sep 2020

This project has been funded by the following sources:
Intel Labs Academic Research Office for the Parallel Algorithms for Non-Numeric Computing Program,
National Science Foundation, and
IBM Research.