======================================================================

-=[ Summary ]=-

This is a set of Perl modules and scripts for 


  1) Estimating univariate probability distributions, including
     a few variations such as truncations or mixtures.
  2) Applying some basic numerical algorithms such as Simpson's rule
     for integration or Gaussian elimination for solving linear 
     equations.
  3) Selecting attributes (when used in conjunction with my fractal
     dimension package).
  4) Argument parsing, in a fairly complicated way that allows 
     automatic generation of help text, the use of callbacks for
     verifying goodness, et al.

This is currently aimed at those /really/ comfortable with examining 
and writing Perl code.  The code is still in flux...



-=[ Sources ]=-

Many of the formulas used were found in "Continuous Univariate
Distributions" by Johnson and Kotz.



-=[ Manifest ]=-

The files involved are the following:

14 Perl5 modules
  AndersonDarling.pm (Anderson-Darling goodness of fit)
  ArrayFoo.pm   (Simple array statistics)
  ChiSquare.pm  (Chi-square goodness of fit tester; table-based)
  GKQuantile.pm (Quantile estimator as per 
    Greenwald, Michael, and Sanjeev Khanna.  "Space-Efficient
    Online Computation of Quantile Summaries", in Proc. SIGMOD
    2001).
  Heap.pm       (Basic minheap implementation)
  kMeans.pm     (Basic k-means clustering module)
  mktemp.pm     (Utility package for making temporary files)
  Numerical.pm  (Simple numerical algorithms, such as a numerical
                 function inverter.  Avoid the polynomial code.)
  ParseArgs.pm  (Overly complicated argument parser)
  SEM.pm        ('Scalable EM' implementation; see module header
                 for relevant citation)
  Transform.pm  (Univariate non-cdf transformation functions.
                 Beware the polynomial code, it's pretty bad.)
  UniRand.pm    (Univariate random distributions; estimates
                 parameters, generates cdfs/pdfs/ppfs; computes
                 log-likelihood, BIC, AIC)
  util.pm       (Tiny utility package)
  Wrapper.pm    (Provides interface to data file; identical to
                 one in fractal dimension package)


5 Perl5 scripts
  apply.pl      (Script for applying some transformations to
                 columns of a data set)
  fit.pl        (Script for fitting non-truncated, non-mixture
                 univariate distributions)
  grab_bag5.pl  (Script for selecting attributes from a data set
                 based on the fractal dimension)
  pca.pl        (Principal components analysis)
  quantile_fit.pl (Identical to fit.pl, just uses GKQuantile for
                 generating a "sample").

1 data file
  table.chi_square (flat file that's a basic chi-square table)

A simple LICENSE (essentially, do whatever you like -- open, 
  closed -- except claim that it's yours, remove the LICENSE,
  or sell it for more than nominal cost).

This README.



-=[ Dependencies ]=-

The 'grab_bag5.pl' script relies on my fractal dimension 
package, with all the dependencies that entails.  An 
appropriate 'use lib' line will, naturally, need to be added.

A working Perl5 installation is required.



-=[ Immediacies ]=-

Decide where you want to put the code, and change the 'use lib'
lines to point to the new location instead of 
'/usr/lw2j/private/Stat/Perl'.



-=[ Usage ]=-

Well, it's still largely a programmer's package for now, so the
first bit would probably to examine whatever packages interest
you and see what you can use.



AndersonDarling.pm:
  Given data and a CDF, it will compute the Anderson-Darling
  statistic.  However, for actual significance testing, one 
  needs distribution-specific tables.  I've got both tables
  and estimators for normal, lognormal and exponential, so
  this module can give significance levels for those.

ArrayFoo.pm:
  Not much "direct" use; it's more of a supporting package with
  stuff like a 'select' algorithm and estimators for mean and
  standard deviation.


ChiSquare.pm:
  If you need to check goodness of fit according to chi-square,
  and you don't mind needing to use a table (my gamma code is
  off when it comes to extremes), it's pretty simple.

  0.  If you need a better table, replace the table.chi_square
      file.  The comments at the beginning describe the (trivial)
      format.  
  1.  Create a new chi-square object.
  2.  Call the chi_square method, giving it the data, a number of
      buckets, the cumulative distribution function under 
      consideration, and the number of estimated parameters.
  3.  If the return value is undef, it didn't pass; otherwise,
      it's a significance level.


  ** CHANGE:  Newer versions have a built-in table.  You can still
  specify a table via a filename, but you don't have to anymore.
  This should ease using the module from any directory.

  ** CHANGE:  There's now a chi-square random-variable 
  approximator in UniRand.pm.  This approximator calls the 
  approximator for the normal distribution, and, in turn, is 
  called by the revised gamma approximator (instead of the old
  method of numerical integration).

GKQuantile.pm:
  This is my implementation of their quantile approximator -- 
  blame any errors or inefficiencies on me, not them.  It's a 
  rather space-efficient means of getting approximate quantiles 
  from stream data (can query at any time, no need to buffer it 
  all), and deterministically meets the user's error bound.

  While there IS quantile code in ArrayFoo, it requires that all
  the data be in memory and then it uses an (n log n) sort.  For
  large or constantly updated data, GKQuantile should be a much
  better choice.

  ** NOTE:  
    This module has undergone a number of fixes and alterations --
    use recent versions as they are more trustworthy.  Also, 
    watch out for the 'compress_min' option, which is used to set
    a minimum sample count before compression.  If you use this
    (default: undef == off), YOUR RESULTS WILL CHANGE.  Keep this
    in mind if you want to compare with previous versions.

    Theoretically, a high threshold should improve accuracy at the
    cost of memory, and also altering performance (compression
    will be more expensive, but rarer).

Heap.pm:
   Basic priority queue/minheap.  'new' a Heap object, 'insert'
   priority/item pairs, 'remove' priority/item pairs.  Limits
   can be set, but haven't been extensively tested.

kMeans.pm:
   Basic k-Means package.  Does not automatically guess k.
   Has built-in random-initialization method.  Can be invoked
   via a single line, e.g.

   my ($means, $assignments, $covariances) = 
      kMeans::cluster($data, $k)
   
   where $data is the data in row-major form -- a list ref
   of list refs, each of which corresponds to a tuple -- and
   $k is the number of clusters to use.  $means is also in
   row-major form, $assignments is a ref to a single array
   giving cluster indices 0..($k-1) in the same order as the
   data, and $covariances contains the covariance matrices.

   Can handle univariate data; just use scalars instead of
   list refs wherever tuples would be.
 

mktemp.pm:
  Unless you need code for creating temporary files, don't worry
  about it.


Numerical.pm:
  Read before you use, and avoid the polynomial guesser, as it's
  very bad at estimating initial coefficients.

  It may help to know that the functions in Numerical which deal
  with matrices and vectors prefer row-major order; the 3x3 matrix

    | 1 2 3 |
    | 4 5 6 |
    | 7 8 9 |
 
  gets represented as a reference to a list of list refs:

  +[ +[ 1,2,3 ],
     +[ 4,5,6 ],
     +[ 7,8,9 ] ]

  That's the format used by quite a few of the functions.

  The code is aimed more at being straightforwards instead of
  fast.  It might be faster if it were rewritten using PDL, or
  a C/C++ core.

ParseArgs.pm:
  It's another "support" module.  This one's used by the 
  grab_bag5.pl, which uses it to support an excessive number of 
  options.

SEM.pm:
  Scalable EM method.  Handles multiple dimensions.  Can be
  simply evoked, e.g. 
 
  my $phi = SEM::cluster($data, $k, $first_ct, $next_ct);

  where
    $data is a row-major ordering of data,
    $k is the number of components to use,
    $first_ct is the number of those tuples to use for
    choosing the initial estimates (using k-Means with
    random draw) [optional; defaults to all of data]
    $next_ct is the number of tuples to use per subsequent
    batch update [optional; defaults to rest of data]

  and $phi is the estimate array -- ref to list of refs,
  one per component.  Component refs contain mixing
  probability, mean (tuple if multidimensional, scalar
  otherwise), and covariance matrix (if multidimensional)
  or variance (otherwise).

  Provides some additional methods of note, such as a method
  for generating the PDFs of multi-dimensional Gaussians, or
  of estimating the same, or generating such random variables.
  Note that parameter format here differs from UniRand -- 
  this uses mean/covariance, while UniRand uses mean/deviation.


Transform.pm:
  It's a small module for univariate transformations that
  aren't cumulative distribution functions of probability
  distributions.  Ignore the polynomial estimator unless you
  have a better way of making the initial guess.

  The estimators here are out-of-date compared to the 
  univariate random variables in that they do not generate
  PDFs.


UniRand.pm:
  Univariate random distributions.  Read the comments, as
  they're fairly extensive.  The main thing to know is that
  an estimator takes a reference to the unidimensional 
  array, and gives you four things in return:

     - a cumulative distribution function (CDF)
     - a reference to a list of estimated parameters
       (see @DIST_LIST or the comments for a description).
       In the case of mixtures, this will be hierarchical;
       you can use _flatten() to flatten it if you need to.
     - a percentage-point function (inverse CDF).
       This is often based on Numerical's numerical function
       inverter, and thus might be slow.
     - a probability density function (PDF)

  BIC_wrap and AIC_wrap give you the same, but also return
  log-likelihood and either the Bayesian information 
  criterion or the Akaike information criterion.
  
  The 'match' functions give you just about all the info
  that the module can compute.  See the @MATCH_INDICES
  table for a list.

  Be warned that the mixture methods can take a _very_ 
  long time, because they're EM-based.  Sampling might be
  advisable.

  Oh, and watch out for the gammas; computation is 
  significantly off at either extreme.

  **CHANGE:  Gamma seems to be better now with the chi-
  square estimator coded up.

util.pm:
  Minor code.  Mostly used for a 'deep copy' recursive
  method.

Wrapper.pm:
  The Wrapper module provides a basic interface for data
  access.  The included version handles flat files, with
  some variations allowed (for instance, CSVs are fine;
  so is whitespace-separation.  It also allows ; or #
  to denote comments).  If you wanted to handle a known 
  binary format, you could do so pretty easily while
  preserving the API.


apply.pl:
  It uses Wrapper.pm, so it can handle flat files as well.
  Basically, it applies transformations to the columns and
  produces a new set with many more columns (each being a 
  transformed version of an original column) and a header
  listing what parameters were used, et al.

grab_bag5.pl:
  Ditto re:  Wrapper.pm.  It also does some parsing to see if 
  there's a header of the form generated by 'apply.pl'.  Otherwise, 
  it produces two blocks of output -- a log indicating what 
  choices were made and a block of data consisting only of the
  new columns.

  This script provides a lot of options.  The main ones to worry
  about are --

     --memory      (load all data into memory? performance issue)  
     --max_select  (how many?)
     --min_gain    (how good must they be?)
     --fn_old      (input file)
     --fn_new      (output file; just the selected columns)

fit.pl: [mixfit.pl -- arguably obsolete; trivial change from fit.pl
  anyway]
  Both take a single data column from STDIN and fit a univariate
  distribution to it.  This is mostly a simple example script
  showing one way to fit and to generate a readable report of
  parameters and goodness-of-fit evaluations.

pca.pl:
  This is a very, very basic script that uses the routines in
  Numerical.pm to do some simple dimensionality reduction.  It
  takes data from STDIN, whitens it, computes the pseudoinverse,
  reduces that, and uses the smaller form to project the data
  into a lower-dimensional space.  It is far cleaner code than
  it is efficient (either space or speed); if you have a big
  matrix, beware.


-=[ Contact information ]=-

Questions, complaints, comments, suggestions et al can be sent
to moi --

  Leejay Wu <lw2j@cs.cmu.edu>
