htsnp is a program that uses haplotype motifs to locate "haplotype
tagging" SNP sub-sets that carry all or most of the information found
in a full set of SNPs sequenced in a sample population. The code
uses a dynamic programming algorithm to chose an optimal SNP set
from a set of motifs with measured frequencies. Details of the
algorithms will be described in a forthcoming paper.
The code is called as follows:
Usages:
./htsnp [-i ] [-d ]
[-p [-c ]]
./htsnp [-i ] [-a ]
[-p [-c ]]
./htsnp [-i ] [-k ]
-i : specifies a file of motif frequencies to use in SNP
selection (default: stdin)
-d : specifies a tolerated maximum amount of per-base error
and seeks to minimize SNPs for that maximum (default: 0)
-a : specifies a tolerated average amount of per-base error and
seeks to minimize SNPs for that average (default: 0)
-k : specifies a maximum number of SNPs and seeks to minimize
expected total error given that maximum (default: 0)
-p : specifies the population size from which motifs were derived
(only used if a confidence interval is used)
-c : specifies the size of the confidence interval if a
population size is specified (default: 0.0)
The program takes as input a motif file (created with the -s option of
the hapmotif executable). It produces a set of SNPs that are chosen
to allow inference of the other SNPs. In one version, the SNPs are
chosen to be a minimal set yielding a particular level of expected
prediction accuracy on each SNP when used with the prediction
algorithm of the predictb executable. In that algorithm, we assign a
missing SNP by finding the motif spanning that site that is most
probable given the known sites in the sequence, using whatever SNP
that motif has at that site. If a non-zero confidence interval is
selected, along with a population size needed to establish it, then
SNPs are chosen to provide the specified accuracy with at least the
specified confidence at each site, chosen in isolation. In another
version, the SNPs are chosen to yield a set of fixed size of minimum
expected error rate. In the third version, the SNPs are chosen to be
a minimal set yielding a particular level of expected prediction
accuracy averaged over all SNPs.