On pair counting:


There are two basic algorithms used by this package to obtain 
the statistics of how many pairs of objects are within a given 
distance of each other.

One method is the extremely slow method of actually computing 
all the pairwise distances between every pair of points.  This 
is obviously quadratic in nature, and therefore not normally
recommended; the exception would be on unusually small sets,
where an approximation method may not yield particularly good
results.  For these sets, one can always enable '-s' (for 
'slow').  

With this method, there arises the question of how many data
points there should be for the line-fitting part of the 
problem.  Clearly, one could use one point for every distinct
pairwise distance, but this has the possibility of unduly 
biasing the robust line-fitter.  Therefore, we divide the 
points into equilength intervals, and from each interval 
select the first included point (in ascending order).

The second, and default, algorithm we use is box-counting.
This algorithm divides a vector space into a hypergrid with
equilength grid sides for every radius in a series.  If the
radii are in a geometric series with an integral ratio, 
then the grids will "nest" since the package aligns them 
with the origin always at a grid "corner".  Nesting
guarantees that any objects together in the same cell at 
one radius will be in the same cell for any larger radius in
the series, which is theoretically useful since it guarantees 
monotonicity -- and, most likely, enhances stability.

Box-counting works to estimate the pairwise counts.  For
instance, the series of the sums of the second moments of the 
occupancies of the cells mimics the series of pairs yielded 
by the full pairwise method, and the slopes (in logarithmic 
space) should be very close.  And since it iterates over the 
entire data set once per radius, and not many radii are often 
used -- certainly, far less than data cardinality -- it 
should normally be far faster than the pairwise method.

One question is how one decides what radii are to be used.
This question is irrelevant for the pairwise method, since we
compute the pairwise distances; for box-counting, we may not 
have any apriori suspicions about a suitable range for box
radii.  I certainly don't when it comes to your data.  
Therefore, the package provides a *very* broad range of 
radii by default, and uses certain termination conditions.

First, it starts at the middle radius in the series, in the
hope that it is a reasonable distance, by which we mean that
the graph will not be unduly affected by outliers.

In one direction, the package systematically reduces the 
radius.  If it reaches the minimum radius -- either specified
by the user, or the default value -- it will stop.  If every
object is in its own cell, it will stop.  And a broader 
version of the previous convention has been applied:  if an
excessive fraction of the occupied cells have only singletons,
it will stop.  This singleton rule should help with problems
relating to duplicate (or near duplicate) objects; having 
unusually low pairwise distances can interfere with the 
line-fitting, since pairs of very close points can result in
a flat region (on the pair-count plot) starting at the small
distance and stopping with the distance to the other points.
This flat region could escape being trimmed if the radius 
range extends to the tiny distance, since the flat trimmer
is designed to remove only extrema, leaving other flat 
regions as they may indicate clusters or other non-self-
similarity.  An untrimmed flat region of sufficient size can
then fool the line fitter into concluding that the slope, and
fractal dimension, is exactly zero.  Preventing this could 
be done either with the singleton checking, or by manually
reducing the radius range.

In the other direction, it will stop if it reaches the end of
the radius range.  It will also stop if they're all in the
same cell.  And, again there's a more general termination 
condition:  it will stop if there is an excessive fraction of
the entire data set in one cell.  This termination condition
is also designed to help deal with a possible problem:  an
extreme outlier.  By extreme, I mean an outlier that is so 
many magnitudes away from the rest of the data -- perhaps it 
indicates a data entry error, perhaps it's merely an extreme 
anomaly -- that it results in a flat region inside the upper
part of the graph.  This strikes me as less likely than the
singleton issue, but still theoretically possible, and 
checking for this likely does not induce a terribly 
significant cost in CPU time.

One note about performance:  there should be not only an
accuracy difference between using perfectly nested radii
and random series, but also a performance difference.  The
package will take advantage of perfect nesting when it is
increasing the radius; instead of iterating over the 
entire data set, it iterates over occupied grid cells.  In
theory, at some point there should be significantly fewer
cells than objects, and this should save time.


Relevant parameters include:
  
* Slow method:
  "method" should be set to slow.  In the scripts, it's '-s'.

  "intervals" indicates how many divisions will be imposed
     before line-fitting.  The default is 20, and it'll
     complain with fewer than 10, since you probably won't
     get good line-fitting with too few.

*  Box-counting:
   "q", the box-counting exponent.  0 is Hausdorff, 2 is
   correlation, and so forth.

   "r_min", "r_max", "r_count":  the minimum, maximum, and
   number of radii to use (at most; the occupancy criteria
   limit this).  The radius multiplier -- the ratio between
   two consecutive radii -- will be computed from these.

   "s_max":  the maximum allowed fraction of *occupied* 
   cells that may be singletons.  If too many of these 
   cells contain just one object, then reducing the 
   radius further isn't terribly meaningful, and we run
   the risk of confusing the line-fitter.

   "o_max":  the maximum allowed fraction of the database
   to be put in one cell.  Once there are many objects in
   just one cell, increasing the radius further may be
   counterproductive.

The defaults, in general, should be seemingly reasonable
values.  In particular, the radii are set up so that the
multiplier is 2, allowing perfect nesting, and the radius
range is overly broad, so many, many data sets should 
fit.  Both s_max and o_max default to 0.95; this may be
too high for effective o_max pruning, but then I'm not 
sure how much of a problem extreme outliers are, yet.
