Cinnamon: Synthetic Data Generation

Principal Investigator

Roy A. Maxion, Carnegie Mellon University (maxion@cs.cmu.edu)

Project Heading

Invictus

Objective

Develop a synthetic environment which will generate realistic, but carefully controlled, datasets for testing anomaly-detection systems.

Overview

Cinnamon is a synthetic environment to generate realistic system performance data that enables:
  • Evaluation of competing anomaly-detection methodologies, either within the Invictus project or across other projects, benchmarked to a common standard.
  • Assessment, in a statistically rigorous fashion, of the capability of the core algorithms (e.g., robustness of types I & II error rates to noise, multidimensionality, nonstationarity, etc.).
  • Measurement of system scalability (i.e., performance degradation as a function of the complexity of the system of systems).
  • Generation of patterns that imitate evolutionary system behavior, as well as patterns that represent attempted intrusion of other kinds of system compromise.
The synthesizer will support designed factorial experiments that make statistical comparisons of different kinds of system monitors and enable tuning of Harbinger to specific applications.

Work Completed

The following items have been completed:
  • Generate univariate interval data from any one of the following statistical distributions: Binomial, Cauchy, Chi-squared, Exponential, F, Normal, Poisson, T, Uniform, or Weibull.
  • Generate multivariate interval data from any one of the following statistical distributions: ARMA (Auto Regression and Moving Average), or Multivariate Normal.
  • Generate univariate nominal data using any one of the following statistical distributions: Binomial, Multinomial, or Markov Model.
  • Add linear, exponential, or sinusoidal drift to generated univariate interval data.
  • Insert perturbations into generated data where a perturbation is data generated from a different statistical distribution.
  • Generate autocorrelated univariate interval data.
  • Generate data that is a probabilistic mixture of two or more statistical distributions.
  • Provide the capability for continual data generation.
  • Write text for About Cinnamon and About Data.
  • Develop a web interface to allow the user to enter a specification of the data to be generated, to view graphs of the generated data, and to retrieve the generated data files.
  • Write help text for each statistical distribution that describes an appropriate application of the distribution.

Work In Progress

There is currently no work in progress.

Future Plans

The following items are planned for future development:
  • Write text for About Random Numbers.
  • Allow the user to submit a file containing a specification of the data to be generated rather than entering on the web interface.
  • Add capability to specify intermittent perturbations (e.g., insert perturbation X every 500 time steps).
  • Add capability to specify lagged perturbations (e.g., perturb vector 2 with perturbation X 5 time steps after perturbing vector 1 with perturbation Y).
  • Produce a key which enumerates the perturbations inserted into the data and the details/characteristics for each.
  • Validate the specifications entered by the user to ensure there are no conflicts and that all necessary information has been entered.
  • Modify the web interface to include default values for each user-specified entry so as to correspond with the example scenario for the distribution.