Download the sourcecode for version 2.0 of the toolkit. It may be installed on a unix machine or using Cygwin on windows platform. Download the sourcecode and unzip and untar the file using the following commands: gunzip blmt_v1.0.tar.gz tar -xvf blmt_v1.0.tar cd blmt make 1. Compilation of programs There is a makefile (called "makefile") in the Final directory. [ Tutorial help for non-computer scientists: -c creates an object file *.o are the object files -o links two or more object files to create the executable (o is output) ] The following commands will be applicable to makefile: Global compilation of all the programs in the toolkit: make [or make all]: removes all *.o files and compiles all programs. make clean-all: removes all *.o files and all executables [So that you can re-compile them afresh]. make clean: removes only *.o files Each individual program can also be compiled separately: make faa2srt: [Compiles the faa2srt.cpp file and creates the faa2srt executable. Note that this does not remove *.o files before. *.o files need to be removed anytime there is a change to the C-code and you want that to be updated]. make srt2lcp make ngrams make proteinCount make proteinNGram make yule make map2srt make langmodel make wcngrams Usage of the programs (input/output options etc) ./programname -help shows the different options that go with the program called programName Example usage (see below): ./ngrams -fsrt bb.faa.srt -flcp bb.faa.lcp -n 5 -printall -sortc faa2srt: Creates a Suffix Array from a Fasta format Genome file. ./faa2srt -help Usage: ./faa2srt -ffaa <Genome (Input) filename (.faa)> -fsrt <Sorted-Suffix-Array of Genome (Output) filename (default: .faa.srt)> -help display this help message Example usage: ./faa2srt -ffaa human.faa Note: For long genomes, you have to adjust the maximum length of the genome to suit your file: In mylib.h find SUPERLEN 12000000, change to larger value if needed. srt2lcp: Creates the Least Common Prefix (LCP) and Rank arrays corresponding to a Suffix Array. ./srt2lcp -help Usage: ./srt2lcp -fsrt <Sorted-Suffix-Array of Genome (Input) filename (.faa.srt)> -flcp <LCP (Output) filename (default: .srt.lcp)> -frnk <Rank (Output) filename (default: .srt.rnk)> -help display this help message Example: ./srt2lcp -fsrt human.faa.srt ngrams:Finds the various n-grams occuring in a Genome and also the number of times that a particular n-gram occurs. Also computes listing the n-grams in descending order of their number of occurances. Prints out counts of n-grams alone (without the n-gram itself), to allow the output to be used easily by other programs (plots?) ./ngrams -help Usage: ./ngrams ./ngrams -fsrt <Sorted Suffix Array (Input)filename> -flcp <LCP array file (Input)> -fngrams <Output Filename to print n-gram counts> -n <n-gram length: eg. "-n 4"> -top N <print only top N n-grams: eg. "-top 20"> (with this option, n-grams are sorted by count) (Also, if N is 0 or -top option not givenm all n-grams are printed) [-printngram] (default: OFF. Give this flag if you want n-gram to be printed besides the count of n-gram) (-printall was old switch for the same action. Still supported [-sortbycount] (default: OFF. Give this flag to sort n-grams by count instead alphabetically) [-pzn] <Print also n-grams that do not occur in the input> counts for these non-occuring ngrams would be 0 -help display this help message Example: ./ngrams -fsrt human.faa.srt -flcp human.faa.lcp -fngrams human.4grams.txt -n 4 -top 20 -sortbycount proteinCount:Counts the total number of proteins in a Genome and lists out the lengths and headers (optionally), for all the proteins. ./proteinCount -help Usage: ./proteinCount -ffaa <Genome (.faa) (Input) filename> -fsrt <Sorted Suffix Array (Input)filename> -fprot <Protein count (Output) filename> [-printall] (default: OFF. Give this flag if you want proteins Headers to be printed [-nosort] (default: OFF. Give this flag to NOT sort proteins by length -help display this help message Example: ./proteinCount -ffaa human.faa -fsrt human.faa.srt -fprot human.protCount.txt -printall -nosort proteinNGram:Given a protein sequence, this program lists out the frequncy of occurance of each n-gram appearing the protein when seen through a sliding window. For example, if the input sequence is ABFGMAW, the program can list out number of occurances of ABFG, BFGM, FGMA and GMAW. This program is useful in comaring n-gram preferences across organisms. ./proteinNGram -help Usage: ./proteinNGram -fsrt <Sorted Suffix Array (Input)filename> -flcp <LCP filename> -fprot <Input Protein Sequence (for n-gram analysis)> -fstats <Output file to write n-gram statitics> -n <n-gram length; Default: 4> -help display this help message Example: ./proteinNGram -fsrt human.faa.srt -flcp human.faa.lcp -fprot prot0157.txt -fstats human_prot0157.stats -n 4 wcngrams: Wild card matched ngrams. Input a pattern with wildcard characters ?, < and > to match 'any amino acid/nucleotide', 'beginning' and 'end' of sequence. These wildcards may be combined with other specific amino acid/nucleotide combinations. ./wcngrams -help Usage: ./wcngrams -fsrt <Sorted Suffix Array (Input)filename> -flcp <LCP array file (Input)> -pattern <pattern: Such as "A?C?A" or "<MA?A"> or "MA?A>"> where ? matches any-1 character, < means begining and > means end, of a sequence Maximum of 999 chars; -help display this help message yule: Find Yule statistics of patterns such as "A**B" in a database. The computed yule are written out to text files. Usage: ./yule -fsrt <Sorted Suffix Array (Input)filename> -flcp <LCP filename> -fprot <Input Protein Sequence (for n-gram analysis)> -foutprefix <Output file to write n-gram statitics> -nfrom ngram-lengths to be considered can be specified as a range -nto using -nfrom to -nto. for example -nfrom 2 -nto 4 means 2 to 4. -help display this help message langmode: N-gram language model is computed for the training set. Test set sequence perplexities are then compared with the language model. iUsage: ./langmodel -fsrt <Sorted Suffix Array (Input)filename> -flcp <LCP filename> -fprot <Input Protein Sequence (for n-gram analysis)> -fpsrt <Input Protein Sorted Array filename (optional) ***Note that only -fprot or -psrt need to be given ***If the sorted arary is already computed, it will be faster if srt file is given as input -fplcp <Input Protein LCP Array filename (optional) -foutprefix <Output file to write n-gram statitics> -nfrom <n-gram from length; Default: 4> -nto <n-gram to length; Default: 4> -help display this help message map2srt: Amino acid or Nucleotide sequences may be mapped to reduced alphabet such as electronic properties or polarity. Suffix array is then computed for the mapped sequences. Usage: ./map2srt -ffaa <Genome (Input) filename (.faa)> -mtype <Maptype [pnp, ep]> -help display this help message </verbatim> </body> </html>