================================================

-=[ Summary ]=-

This package supplies a C++ library, a C++ test
program, and a Perl wrapper for the above.  The
library implements box-counting, easily the most
expensive part of the Perl FracDim set of 
modules and scripts.

This is still a work in progress.  Preliminary
results suggest that it serves its purpose as a
much faster replacement for Pair_Count.p[ml], 
but more testing is necessary.

Berkeley database support is now an option, and
disabled by default.  To enable it, edit the 
Makefile to set 'USE_BERKELEY' to be 1 instead
of 0, and set the library path, include path, 
and library name to appropriate values.   By
default, the driver is now built with a native
in-memory-only hashing algorithm (ExtHash) in 
order to reduce external dependencies.  ExtHash
may be significantly faster on certain data
sets, even when Berkeley DB uses an in-memory
table.

Note that this hashing algorithm can consume 
vast amounts of memory, was written by me, and 
is relatively untested, so Berkeley DB w/ C++
support is still recommended...



-=[ Cross-Product Counting ]=-

Apply the normal grid discretization to not one,
but two input sources.  Then, instead of summing
the squares of occupancies of one source's grid,
as we would for a second-degree fractal-dimension
computation, we sum the products of occupancies of 
corresponding cells in the two grids.  This is 
cross-product counting.

It's utility lies in comparing sets for 
distribution similarities.  If, for instance, the
cross-product of two sets yields a "normal"-looking
graph -- flat regions at either end, with a linear
region in between -- then the two may well both
have the same self-similar distribution.  If the
middle region is bent, however, then that strongly
suggests that either they are not self-similar, or
they have quite different distributions.

This code is relatively new and therefore not
well-tested.



-=[ Manifest ]=-

The files included are the following.

Source for the library, consisting of header 
(.h) and C++ source (.cc) files:

   BerkeleyLayer.h     BerkeleyLayer.cc
   BoxCount.h          BoxCount.cc
   CrossCount.h        CrossCount.cc
   DBLayer.h           
   DataWrapper.h       DataWrapper.cc
   DiskWrapper.h       DiskWrapper.cc
   ExtHash.h           ExtHash.cc
   MemoryWrapper.h     MemoryWrapper.cc
   QuadCount.h         QuadCount.cc



A C++ source file for a sample executable:
   
   driver.cc

A GNU Makefile, for building the library and 
executable:
   
   Makefile

A Perl module and accompanying script, providing
compatibility with the old Pair_Count interface:

   FDC.pm              FDC.pl


Licensing terms, in LICENSE.

This README.



-=[ Immediacies ]=-

1.  Editing the Makefile.

If you want to enable Berkeley support, set 
USE_BERKELEY to 1 and edit the include and library
variables.  If enabled, this will be the default 
database implementation.  Make sure that you built
Berkeley DB w/ the C++ API...

If you want to disable (?) support for the included
hashing code, set USE_EXTHASH to 0.

You may need to change flags if you are using a 
vendor-supplied compiler; the ones supplied are for
GNU's gcc.

By default, PARANOID is defined.  This enforces a
requirement that the geometric series of radii 
multiply by integers, which guarantees monotonicity.
Removing this #define disables this protection.

SPEEDER is currently defined.  This option changes the 
default behavior from minimizing memory
requirements (counter tables on disk, not buffering
the data into memory) to maximizing speed (buffering
both, and using an algorithm tweak).  These are
all controllable via options, as well, but if you
want the original defaults (slower, but less memory
consumption) turn it off.


2.  Compiling.

In theory, this will build with GNU make -- 'gmake'
on some systems, 'make' on others.  If it does 
build properly, there should be a static library

   libbox.a

and a binary executable,

   driver

statically linked with libbox.a and libdb_cxx
(if USE_BERKELEY=1).

This driver allows setting numerous options,
as revealed via 'driver -h'.


3.  Driver usage.

The driver may be faster if you specify '--two_table'
which enables an alternate algorithm for when it
increases radii.  Instead of iterating over the
original data, it can iterate over the previous
counter set.  This will increase storage 
requirements since it needs to maintain two 
tables at once.

Another speed boost comes from '--counter_memory',
which stores the occupancy counter Btrees in memory.
You need memory for two tables at a time if you
enable --two_table, and one otherwise.

The largest speed boost comes from '--data_memory',
which loads the entire data set into memory; this
means that it no longer needs to frequently fgets()
and parse strings every fetch, but can deal 
natively with arrays of doubles.

The option '--speed' is an abbreviation for 
specifying all three of the aforementioned options.
None of these options should impact correctness;
activating SPEEDER in the Makefile makes these
options default to true instead of false.

There are other options, as described by 'driver -h'.
Some, such as '--radius_min', '--radius_max', and
'--radius_count', which specify the parameters of 
the geometric series of radii, are fairly 
straightforwards.  '--zero_translate' may be less
so; this, which defaults to 1 (true), specifies
whether or not the entire data set is translated
by the wrappers by subtracting the minima from 
every point.  This option requires additional
processing whenever a vector is fetched if the data
has not been loaded into memory; the memory wrapper,
on the other hand, performs this subtraction as
soon as it can rather than lazily.

Cross-product counting can be invoked via --cross.
This is the default behavior when the binary
contains the name 'cross'; otherwise, the default
is --pairs, which invokes standard pair-counting.

Quad-counting, or more precisely, using the 
pair-wise algorithm (called 'quad' since it's
quadratic), is also now allowed via --quad.  Note
that this ignores many options such as the exponent
(equivalent to 2 for pair counting, also not used
for cross-counting) and database options (there is
no counter database).  It does use two options,
--min_frac and --max_frac, which define the 
fraction of pairs which must be within a given 
radius for that radius and count to be reported.
It is otherwise compatible with both pair-counting
and cross-counting.  Using this is not recommended
except on small sets where box-counting may not
be applicable.


4.  Compatibility.

Also enclosed are FDC.pm and FDC.pl, which provide
a Perl wrapper and compatibility layer.

The FDC.pm Perl module is designed to be used as
a supplement to Pair_Count.pm in the original 
FracDim toolkit.  Once edited -- the 'use lib'
path needs to be changed to point to the location
of Pair_Count.pm, and the 'executable' parameter
should be set in set_default_params() to point to
the location of the 'driver' binary -- it should
suffice as a replacement for Pair_Count.pm in
other scripts and modules.  It still needs 
Pair_Count itself, however, since if the user sets
parameters to request something that the library
does not handle -- namely, the quadratic pairwise
method instead of box-counting, or in-wrapper 
hypercube normalization -- it can access Pair_Count
instead of 'driver'.  

<< Quadratic method now implemented. >>

FDC.pm accepts both old-style parameters and new
driver-style ones, such as both '-q' and 
'--exponent'.  It also supports the ones that are
applicable only to the library, such as '--speed'.

FDC.pl is similar to the original Pair_Count.pl,
but uses FDC.pm instead of Pair_Count.pm.  Thus, 
it should be usable as a drop-in replacement for
Pair_Count.pl.

An integrated version (full Perl suite with C++ 
support) will probably be released later, when the
library and FDC code have been tested much more.

There may be slight differences in output between
the library and the Perl implementation, but these
should not be significant since both use the same
algorithm.



--------- INSTALLATION SUMMARY WHEN USING FracDim:

1.  Unpack FracDim and FDC.  They can go in sibling directories.
2.  Configure and compile FDC.
3.  Apply the FracDim-FDC.patch patch.  It's just a unified diff.
    For instance, if your directory resembles

    FDC-2003xxxx/
    FracDim-2003xxxx/
    FracDim-FDC.patch

    you should be able to 

    cd FracDim-2003xxxx
    patch -p1 < ../FracDim-FDC.patch
4.  Load FDC.pm in your favorite text editor.  Search for 'driver'.
    Change the path (it reads /home/lw2j/private/WORK/FracDim/driver'
    to point to the driver binary in the FDC-2003xxxx directory.

At that point, running 'RUNME.pl' in the FracDim directory should
still work (faster).
