Genome Analysis Program :
	Multiple Sequence Alignment by 3-Dimensional DP-matching

1. What is multiple sequence alignment?

Biologists often align DNA and protein sequences in order to determine
how similar they are. DNA is a chain of four kinds of nucleic acids
and a protein is a chain of twenty kinds of amino acids, which are
translated from a chain of nucleic acids. Strong similarities between
sequences may result from a common evolutionary relationship, and
these sequences may have almost same function.

Figure 1 shows a typical multiple sequence alignment. Twelve fractions
of enzyme proteins are aligned. Each letter stands for an amino acid:
D is aspartic acid, R is arginine, H is histidine, and P is proline. A
good alignment has same or similar amino acids in each column.  To
make an alignment good, each sequence is shifted or gaps (dash
characters) are inserted into the sequence.

2. Dynamic programming on sequence matching

Dynamic programming (DP) is a basic method to find an optimal
alignment. The method is regarded as the best path search in the
N-dimensional network. In the method, for example, if two sequences,
ADHE and AHIE are given, we form a 2-dimensional network that has 25
nodes connected by arrows. A cost is assigned to each arrow. We search
a path from the top left node to the bottom right node, minimizing the
total cost of arrows. In this case, the set of arrows that connect
white circle nodes is the best path. This best path corresponds to the
optimal alignment ADH-E and A-HIE.

Costs on arrows should reflect similarity between compared characters.
In the case of protein sequence alignment, Dayhoff's odds matrix
is the most popular way of obtaining the costs. The
matrix was obtained by statistical analysis of mutation probability of
amino acids.

Though DP-matching is an optimal method for alignment, it takes a lot
of calculation time. DP-matching with more than three dimensions is
too time-wasteful to be used for practical alignment. So DP-matching
has been used for partial matching, when several sequences need to be
aligned. For instance, we can produce all pairwise alignments of given
sequences with 2-dimensional DP, then merge the alignments one by one.

3. Parallel pipeline processing of 3-dimensional DP

If 3-dimensional DP can be executed rapidly, it is useful for partial
matching because it tolerates noise better than 2-dimensional DP does. 
We have implemented 3-dimensional DP on the parallel machine,
Multi-PSI, and improved the speed of three-sequence matching.

Our system constructs a 3-dimensional prism network with KL1 processes.
The prism network is divided into 64 subprisms of equal
volume and is mapped to 64 process elements (PEs). The KL1 is suitable
for constructing such mesh-like process networks and the network can be
used as data-flow pipeline easily.

If many different combinations of three-sequence alignments are
available, we expect to merge whole sequences adequately for multiple
alignment.  This system provides optimal three-sequence alignments by
parallel pipeline processing.


4. Demonstration

The demonstration system solves three-sequence alignment problems
continuously by parallel pipeline processing. After several initial
alignment data are fed to PE0, their optimal alignments come out from
PE63 and are displayed at short intervals. During processing, the
performance meter window shows that several wavefronts pack and
propagate from PE0 to PE63 clearly.


5. Refference

\bibitem[Dayhoff\hspace{0.5em} 78]{PAM}
 Dayhoff, Hunt and Hurst-Calderone {\em ``Composition of Proteins''} in 
{\em Atlas of Protein Sequence and Structure 5:3,}
Nat. Biomed. Res. Found., Washington, D. C., 1978, pp.363-373.

\bibitem[Needleman\hspace{0.5em} 70]{Needleman}
 Needleman, S. B. and Wunsch, C. D. {\em ``A General Method Applicable to
the Search for Similarities in the Amino Acid Sequences of Two Proteins'',}
in {\em Journal of Molecular Biology 48,} 1970, pp.443-453.

\bibitem[Waterman\hspace{0.5em} 86]{Waterman}
 Waterman, M. S. {\em ``Multiple Sequence Alignment by Consensus''} in  
{\em Nucleic Acids Research 14:22,} 1986, pp.9095-9102.

\bibitem[Murata 1\hspace{0.5em} 85]{murata1} 
 Mitsuo Murata {\em ``Simultaneous Comparison of Three Protein Sequences''} in
 {\em Proc. Natl. Acad. Sci. USA  Vol. 82}, 1985, pp.3073--3077  

\bibitem[Murata 2\hspace{0.5em} 90]{murata2}
 Mitsuo Murata {\em ``Three-Way Needleman-Wunsch Algorithm ''} in
 {\em Methods in Enzymology Volume 183}, Academic Press, 1990, pp.365--375

\bibitem[Carrillo\hspace{0.5em} 88]{carrillo} 
 Himberto Carrillo and David Lipman {\em ``The Multiple Sequence Alignment Problem in Biology''} in
 {\em J. Appl. Math. 48}, 1988,  pp.1073--1082 



