This document describes the file formats used by the HapMotif code. See the other README files for information on using the code. HAPLOTYPE INPUT FILE FORMAT --------------------------- A haplotype input file consists of a set of lines each of which has the format " \n" where is a positive integer count of occurrences of a haplotype in a population sample and is a string representing the haplotype. fields contain a string of symbols with no whitespace where the symbol in position i represents the allele at the ith polymorphic site in a particular region. Any single character can represent an allele except for the special characters ('?','X', and 'x') which are synonomous and represent an unknown value at a given polymorphic site. These programs are designed to work primarily with SNP values, in which case alleles will generally be one of (A,C,T,G), but other kinds of polymorphisms can be represented by using a larger symbol set. The probability model used by the code assumes biallelic data and it will not generally work correctly if a single polymorphic site has more than two alleles. All fields for a given file must have the same length (i.e. there must be one symbol for each polymorphic site in each line.) A deletion polymorphism still requires a symbol. For example, if a base is missing in some sequences, you might use a '-' symbol to represent the missing base in those sequences that lack it. See sample.hap for an example of a haplotype input file. HAPLOTYPE OUTPUT FILE FORMATS ----------------------------- Three output file formats are supported: text, portable anymap (PNM), and HTML. Text files are mainly useful because they do not require that you have a browser or graphics utility to view them and because they lend themselves well to automated parsing. PNM files provide the prettiest pictures and are generally the easiest way to interpret color information for large numbers of haplotypes, but do not communicate sequence information. HTML files provide a compact visual display of both sequence and color information in a portable format. The text format lists each sequence lined up with its coloring in the following format: (+)+\n (+)+\n \n The integer colors are padded on the left with up to two spaces so as to line them up with the bases. In the event that there are more than 999 colors, the alignment may not work. The PNM format represents each haplotype as a row of colors displaying the coloring. The height of the line is a constant per observed sequence and the width is constant per SNP. Distinct haplotype colors will map to distinct colors in the image, up to a maximum of 256 colors. Viewing PNM files requires an external image viewer, which is not included with this code. The HTML format prints each sequence as a string with individual characters colored according to their haplotype coloring. Distinct haplotype colors map to distinct colors in the HTML file, up to a maximum of 256 colors. HTML files should be viewable in any common web browser. MOTIF FILE FORMAT ----------------- A complete set of motifs is output by the hapmotif executable in the format: ( \n)+ is the index of the first polymorphic site in the motif relative to the set of polymorphic sites used to generate the motif. Indices start at 0. is the string of polymorphic values found in the motif. is the frequency of the motif in the haplotype data set used to generate them. This motif format is also used as input to the htsnp, predictb, predictg, and case-control executables. DIPLOID INPUT FILE FORMAT ------------------------- Diploid input files, used to specify case and control populations for the case-control executable, are similar to haplotype input files except that lines come in pairs of the form: " \n " where parameters must be the same for the two lines. The format is otherwise identical to that described above.