Protocol

Input : A gene cluster with homologous gene pairs between two chromosomes. The gene cluster data is in a tabular form.
  • Insulin-IGF1 region
  • IGF1R-IR region
  • Output : MSA of homologs for each gene pair and an unrooted tree for each MSA
    Step1 : Obtaining homologous sequences using BLAST?

    For each homologous gene pairs of the gene cluster, we want to find the homologous sequences across variety of species ranging from invertebrates to mammals. For example, in the Insulin-IGF1 cluster, given in the table, each row corresponds to a homologous gene pair in chromosomes 11 and 12 in human. The location columns in the table are hyper linked to the genes in ensembl database.

    Getting a query sequence :
    The protein sequences for each gene can be obtained from ensembl database. We can obtain the protein sequence from the ensembl page for the gene from "protein information" link in "Transcript structure" row.
    BLASTing GenBank with the query sequence to get similar sequences
    We can get similar protein sequences for the query sequence obtained from ensembl using BLAST. BLAST is heuristic tool to search sequence database to get similar sequences to a query sequence. Protein-Protein BLAST page looks like

    We use default values for BLAST search except for

  • limiting the search to specific species.
  • # of hits to be viewed We paste the query sequence or gi id in "query" in

      Search         

    We restrict the search to the sequences from the specified species by using the species list as

    Species1[Organism] OR Species2[Organism] OR Taxa1[Organism]

    in Limit by entrez query in "Options for advanced blasting".

    In general, we are mainly interested in these species Caenorhabditis elegans(worm), Drosophila melanogaster(fly),Branchiostomidae (lancelets), Tunicata, Myxinidae(hagfishes), Xenopus laevis(frog), Mus musculus(mouse) OR Homo sapiens(human). Thus for doing BLAST we paste

    Caenorhabditis elegans[Organism] OR Drosophila melanogaster[Organism] OR Branchiostomidae[Organism] OR Tunicata[Organism] OR Myxinidae[Organism] OR Danio Rerio[Organism] OR Takifugu rubripes[Organism] OR Xenopus laevis[Organism] OR Gallus gallus[Organism] OR Mus musculus[Organism] OR Homo sapiens[Organism]

  • At the BLAST results page, we can click on "Taxonomy reports" to see more information about the taxa in the matches that we have obtained. We need to look at the hits and choose right sequences from the species list using number of criteria like complete non-redundant sequences and significant E-value and save them in a file in FASTA format as follows.
  • Choose a sequence and click on GI. It will show the sequence info in GenBank.
  • Choose FASTA format in the "display" option and copy and paste it to a different file.
  • Saving the sequences
    Each gene cluster involves homologous gene pairs between two chromosomal regions.
  • A directory for each gene cluster
  • Two files containing protein sequences of the gene matches in the two regions in FASTA format.
  • A subdirectory for each gene pair
  • In each subdirectory,
  • A file containing orthologs for each protein sequence in the gene pair
  • A file containing BLAST output for each query

  • Step 2: MSA
  • Find the union of protein sequence hits for each homologoous pair and create a fasta file containing the set of protein sequences.
  • Group the sequences from same species together in the fasta file
  • Build MSA for each gene family using TCoffee( or ProbCons)
    Input : Protein sequences in FASTA format in a file
    Output : Multiple sequence alignments in clustalw, msf format
  • Use MEME if necessary for identifying sequence patterns The script for running MEME can be generated by the program makseq in PSC account.
  • Edit the alignment using GeneDoc. GeneDoc expects the MSA to be in *.msf format. Genedoc is windows based program and the modified alignment can be saved as *.msf file. One can use the output from MEME to curate the alignment using genedoc.
  • Step 3: Tree
    Build phylogenetic tree using NJ method using the tools in Phylip.
    Input : A curated MSA
    Output : Consensus unrooted NJ tree with bootstrapping
  • One can create phylogenetic tree from *.msf_aln file. One needs to first convert it to *.phylip file using the command readseq (web server for readseq )
  • The *.phylip file uses "." dots as gaps, however the phylip programs expect "-" dashes as gaps. Dots can be changed to dashes in vi editor using :%s/\./-/g
  • Create the phylip pairwise distance using the program protdist. Check the option P corresponding to DayhoffPAM matrix is set. Rename the output file " outfile " to *.dist
  • Build neighbor joining tree using neighbor using the *.dist file as input.
  • In order to build tree with bootstrap values, one need to use seqboot with *.phylip as input. Rename the output file "outfile".
  • Then use protdist and neighbor to build trees. Finally, one can use consense to produce a consensus tree.
  • Special cases
    Predicted sequence with no information
    Can one use the alignments from HOVERGEN?

    Last modified : 12th Jun 2005