The Consensus-based Likelihood Estimator for AdmiXture (CLEAX) Program

Version 2.0

This page contains links to a preliminary version of the CLEAX program written in C++. The program automatically detects population structures, identifies the population history, and learns divergence time and admixture fraction.

Version History:
Version 1.0Capable of inferring only three populations admixture/non-admixture scenarios. Assumes that the population with least supporting weight from observed data is the admixed population.
Version 2.0Capable of inferring three or more populations adimxture/non-admixture scenarios. Automatically infers population history by incorporating all possible admixed/non-admixed scenarios into the MCMC chain.

Compatibility: The program has been compiled and tested in both Windows and Linux with GNU C++ compilers and GNU make. While the program has only been tested in Windows and Linux, this should work on machines with ANSI C++.

Source Codes: cleax-2.0.tgz

Compilation: To compile the program, go to build directory and type:

make clean
make all

This should produce a program called cleax (or cleax.exe in Windows) in the build directory.

USAGE: To use the program, you will need a white space delimited property-value descriptor text file that specifies program option in each line. The list of options the program takes is below:

ModeProgram execution mode. There are currently 4 modes allowed (Normal/ConsensusOnly/MarkovOnly/ComputeOnly). A "Normal" mode allows the program to read a ConsensusInputFile consisting of the SNP data and performs automatic identification of subpopulations and history inference. A "ConsensusOnly" mode performs only the automatically identification of the subpopulation from the SNP dataset by reading a ConsensusInputFile. A "MarkovOnly" mode reads from a specialized MCMCInputFile consisting of model bipartitions and its associated weights and performs the history inference. A "ComputeOnly" mode reads both SNP data from ConsensusInputFile and a model bipartition set data from ModelPartitionsInputFile. Using the SNP data from ConsensusInputFile, the program then computes the weights associated with each model bipartition. (Default: Normal Mode)
ConsensuInputFile
Required for ConsensusOnly/Normal
Location of the genetic variation data. The program assumes that the input is consisted of space-delimited bi-allelic variation dataset where 0 represents one allele and 1 represents another. (See examples/example-0.6-0.05-0.2.hap for example)
MCMCInputFile
Required for MarkovOnly
Location of the input file used for running MarkovOnly mode. The file consisted of two sections: Weights and Models. A Weights section begins with a line with the word "Weights" followed by a line of weights associated with k model bipartitions. Each weight is separated by one or more spaces. A Models section begins with a line with word "Models" followed by k lines of model bipartitions. Each model bipartition line consisted of 0s and 1s without any spaces.
ModelPartitionsInputFile
Required for ComputeOnly
Location of the input file used for running the ComputeOnly model. The file specifies the k model bipartitions the user is interested in computing the weights associated with each model bipartition. Each line in the file represents a model bipartition. A model bipartition is represented with 0 and 1 without any spaces.
OutputFile
(required)
Location of the file to which where the program will write its output.
NumGenealogiesNumber of genealogies, m, that the program assumes are sufficient to describe the entire sequence set. The default value is 30. Ideally, the number of genealogies should be at least as many as the number of recombinant sites.
NumEMItersNumber of simulated annealing/expectation maximum iterations the program will go through before returning the best scoring consensus tree. The default value is 1000.
NumMCMCItersNumber of MCMC iterations the program will sample before returning the average expected parameters. The default is 20,000 iterations.
PenaltyPenalty score added to the tree score that used to penalize large, complicated consensus trees. The default is the number of samples. A large penalty will steer the algorithm to identify simple consensus trees with few subpopulation, while a small penalty will prefer trees with more subpopulations. A small penalty can give rise to over-fitting on small dataset. In it current form, the program assumes there will be 3 model bipartitions (assuming a 3-population evolutionary model). This means that the penalty is not a critical factor in the current iteration
PopSize*Effective population size. If the effective population size, the mutation rate, and the sequence length are specified, the program will use the specified parameters to estimate the expected number of mutations. Otherwise, the program will incorporate effective population size, mutation rate, and sequence length into the MCMC chain.
SeqLength*Sequence length. If the effective population size, the mutation rate, and the sequence length are specified, the program will use the specified parameters to estimate the expected number of mutations. Otherwise, the program will incorporate effective population size, mutation rate, and sequence length into the MCMC chain.
MutationRate*Neutral mutation rate. If the effective population size, the mutation rate, and the sequence length are specified, the program will use the specified parameters to estimate the expected number of mutations. Otherwise, the program will incorporate effective population size, mutation rate, and sequence length into the MCMC chain.

*The three parameters are used to determine theta that is used to compute the expected number of variant sites. This parameter is by default sampled by CLEAX. If you want to fix theta, the three parameters must be specified in order for the program to not sample theta.

To use the program, one would execute the following command:

./cleax path-to-property-file

Questions, comments, and bug reports may be sent to the authors at mingchit@andrew.cmu.edu or russells@andrew.cmu.edu. Please note, however, that development of this code is a research project which is aimed at creating theoretical methods for computational genomics, not at producing production quality code. This code is being released to allow others to review, experiment with, and improve upon these methods. The code is not suitable for mission critical work and should not be used as if it were. The code and all associated materials are provided as is, with no warranty of any kind, explicit or implicit, and no explicit or implicit promise of support.


Copyright (c) 2012 Russell Schwartz, Department of Biological Sciences, Carnegie Mellon University. Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.