-----Intelligent Refiner for Multiple Sequence Alignment-----

INTRODUCTION

Multiple alignment, which is used for similarity analysis of protein,
is an important technique for drawing the phylogenic trees of
creatures and for predicting the function and structure of proteins. 
By simultaneously aligning the sequences of similar proteins, we can
identify the regions of protein sequences that may have important
functions when these sequences are folded into proteins. Also, by
simultaneously aligning the same proteins (hemoglobin, for example)
from different species of creatures, we can analyze the similarity
between sequences belonging to creatures and draw phylogenic trees of
creatures.

Because biological expertise is necessary for multiple alignment,
biolgists have up to recently produced multiple alignment by hand.
However with the increasing rate of determination of protein
sequences, the number of multiple alignments that biologists must
handle has also increased remarkably. And each multiple alignment has
become more difficult because the number of sequences that must be
aligned and the length of their sequences have increased. This
situation has become more burdensome on biologists Therefore, computer
for use in multiple alignment are now indispensable. And researches
have been made to facilitate multiple alignment by computer.


REVIEW OF MULTIPLE ALIGNMENT

So far, various multiple alignment algorithms have been developed.
These algorithms try to optimize computationally defined evaluation
function and produce computationally optimal or semi-optimal
alignment.  The evalation function is based on a similarity index
between amino acids. The Dayhoff score index [Dayhoff 1978] is one of
the similarity indices generally used.

Needleman and Wunsch [Needleman and Wunsch 1970] introduced Dynamic
Programming(DP) into multiple alignment. With N-way DP, the
computationally optimal alignment can be produced theoretically. 
However, the problem with DP is the incredible length of time it takes
to compute. N-way DP takes computational time in the order of the
N-th power of the sequence length.  If we align sequences which are
short, by resricting solution space we can align the sequences within
the manageable time[Carrillo and Lipman 1988]. However, if we apply
the program to practical problem, the program takes extremely long
time.

To keep this expansion of the computational time manageable, various
multiple alignment algorithms have been developed that can produce
semi-optimal alignment within a limited time. As algorithms with 2 way
DP, tree based algorithms [Johnson and Dolittle 1986] [Barton 1990]
and the iterative improving algorithm[Berger and Manson 1991]
[Ishikawa et al. 1992] were developed. As an iterative improving
alignment system without 2 way DP, an alignment by simulated annealing
was developed[Ishikawa et al. 1991].


NEED FOR MULTIPLE ALIGNMENT SYSTEM WITH REFINER

Today, biologists use these programs to produce a temporary alignment. 
However, biologists must have to refine it to produce the final
alignment.  They must refine the alignment into a biological
meaningful alignment by themselves because the alignments that these
programs produce are just computational optimal or semi-optimal
alignments instead of biologically optimal alignments.

As stated before, the number of alignments that biologists must handle
is also increasing, and the difficulty of each alignment is
increasing. Therefore, biologists now feel the need for computer
assistance even for refining temporary alignments.  And with protein
sequence data accumulated in databases, researchers other than biology
experts now have the chance to make biological discoveries merely by
analyzing sequence data. Automatic refinement of alignments using
biological know-how and knowledge is also necessary for them.

For knowledge to be used in the refinement phase, the heuristics that
biologists rely on are rather ambiguous. However, knowledge on the
biological meaningful portion of sequences has been accumulated in
databases, one of which is Prosite [Bairoch 1991], and another
information has been published in various papers.

We have developed an alignment system composed of two modules, namely,
the aligner and the intelligent refiner [Hirosawa 1992]. The aligner
produces a computationally semi-optimal alignment. Then, the
intelligent refiner refines the product of the aligner to produce the
biologically optimal alignment.

To design an intelligent refiner, we interviewed experts on multiple
alignment.  We base the framework of the intelligent refiner on
analysis of the way they align sequences and the knowledge they use
for alignment.
 

BIOLOGICAL RELIABILITY OF COMPUTATIONALLY OPTIMAL ALIGNMENT

It is hard to define biologically optimal alignment, because this
depends on the sequences to be aligned. Here, we select some sequences
as an example, and we define biologically optimal alignment of the
sequences. Then, we investigate the biological reliability of the
computationally optimal alignment.

To define the biologically optimal alignment, we will use six
sequences of protein called endonuclease from Retro-virus and its
relatives. This is shown in Figure 1, and one of its biological
optimal alignments is shown in Figure 2.

<Figure 1: Sequences to be aligned>
17.6   : ILDFHEKLLHPGIQKTTKLFGETYYFPNSQLLIQNIINECSICNLAKTEHRNTDMPTKTT
M-MuLV : LLDFLHQLTHLSFSKMKALLERSHSPYYMLNRDRTLKNITETCKACAQVNASKSAVKQGTR
HTLV   : LTDALLITPVLQLSPAELHSFTHCGQTALTLQGATTTEASNILRSCHACRGGNPQHQMPRGHI
RSV    : VADSQATFQAYPLREAKDLHTALHIGPRALSKACNISMQQAREVVQTCPHCNSAPALEAGVN
MMTV   : ISDPIHEATQAHTLHHLNAHTLRLLYKITREQARDIVKACKQCVVATPVPHLGVN
SMRV   : ILTALESAQESHALHHQNAAALRFQFHITREQAREIVKLCPNCPDWGSAPQLGVN

<Figure 2: One of biologically optimal alignment>
ILD--F-------HEKLLHPGIQKTTK-LF--GET-YY-FPNSQLLIQNIINECSICNL-AKT-EHR--N-TDMPTKTT
LLD-FL-------HQ-LTHLSFSKM-KALLERSHSPYYMLNRDRTL-KNITETCKACAQ-VNA-SKS--A-VKQGTR--
ITP-VLQLSPAELHS-FTHCGQTAL-T-LQ--------GATTTEA--SNILRSCHACRG-GNPQHQMPRGHI-------
FQAYPLR-EAKDLHT-ALHIGPRAL-S-KA-------CNISMQQA--REVVQTCPHC---NSA-PALEAG-VN------
ISD-PIH-EATQAHT-LHHLNAHTL-R-LL-------YKITREQA--RDIVKACKQCVV-ATPVPHL--G-VN------
ILT-ALE-SAQESHA-LHHQNAAAL-R-FQ-------FHITREQA--REIVKLCPNCPDWGSA-PQL--G-VN------
             *    *                                  *  *

Columns whose elements are identical are marked by '*'. These columns
are called conserved columns. There are four conserved columns ``H''s
and ``C''s.  Some patterns of conserved columns are known to have
biological meaning and are called motifs. The motif consisting of
conserved patterns of ``HXXXH'' and ``CXXC'' is a reverse prototype
motif of the zinc finger. ``X'' in the pattern means any amino acid. 
Protein with the zinc finger is one of those known to have the
capability to bind to DNA/RNA sequences. We define the biologically
optimal alignmenta as the alignments that identify the zinc finger
pattern.

The homology score (the index that measures the similarity between
sequences) of sequences in the alignment is very low at 26 percent. It
indicates that these sequences are difficult to align.


BIOLOGICAL RELIABILITY OF COMPUTATIONALLY OPTIMAL ALIGNMENT

We investigated the biological reliability of alignments produced by
3-way DP, because the computationally optimal alignment of three
sequences is obtained by 3-way DP.

We selected three sequences from these six sequences and aligned the
three sequences by 3-way DP. Since there can be 20 triplets of
sequences, a corresponding number of alignment were produced.  Then,
we investigated whether the alignment corresponds to the biologically
optimal alignment.

Of the twenty alignments, only six alignments identified the zinc
finger. It indicates that biological opimal alignments don't
neccessary correspond to the biological meaningful alignment.

The computationally optimal alignment doesn't necessarily correspond
to biological optimal alignment.  However, biologists and even
computer scientists with biological knowledge can make biologically
optimal alignment by refining the computationally optimal or
near-optimal alignment.

We will show you an example in which we can make the biologically
optimal alignment by refinement if we use biological knowledge. The
example alignment to be refined is produced 3 way DP and is
computationally optimal alignment.  The sequences to be aligned are
the sequences whose biological optimal alignment is defined. The
biologically optimal alignment must identify the zinc finger pattern.

Example alignment (Figure 3) can be refined into a biologically more
optimal alignment if we have biological knowledge.  In the part of the
alignment that corresponds to ``HXXH'', ``H'' is not properly aligned
but is mis-bridged. Here, mis-bridged means that ``H''s corresponding
to the first letter in ``HXXH''(in the third sequence) are aligned
with ``H''s corresponding to the last letters ``HXXH''(in the first
and second sequences).

<Figure 3: Example Alignment 2>
-------ILDF--HE-KLL-HPGIQKTTKLF--GET-YY-FPNSQLLIQNIINECSICNLAKTEHRNTDMPTKTT
-------LLDF-LHQ---LTHLSFSKMKALLERSHSPYYMLNRDRTL-KNITETCKACAQVNASKSAVKQGTR--
VADSQATFQAYPLREAKDL-HTALHIGPRAL--SKA-CN-ISMQQA--REVVQTCPHCNSAPALEAGVN------
(Evaluation value = 161)

If we fix the mis-bridged ``H'' and refine the alignment, we cannot
identify the zinc finger. However, if we have experience in aligning
the sequences of some protein that contains the zinc finger, we can
guess the possibility of a mis-bridged ``H'' and therefore, refine the
alignment into an alignment in which the zinc finger is identified
(Figure 4).

<Figure 4: Biologically more optimal alignment of Example 2>
-------ILDF--------HEKLLHPGIQKTTKLF--GET-YY-FPNSQLLIQNIINECSICNLAKTEHRNTDMPTKTT
-------LLDF-------LHQ-LTHLSFSKMKALLERSHSPYYMLNRDRTL-KNITETCKACAQVNASKSAVKQGTR--
VADSQATFQAYPLREAKDLHT ALHIGPRAL--SKA-----CN-ISMQQA--REVVQTCPHCNSAPALEAGVN------
(Evaluation value = 156)

The above example indicates that we can infer biologically optimal
alignment from biologically optimal alignment.  It is also found that
if we use only the evaluation value of the alignment as a measure of
the alignments, we cannot make biologically optimal alignment (In the
example, the evaluation value of the alignment is reduced from 161 to
156 when we refine the alignment).


MULTIPLE ALIGNMENT SYSTEM WITH INTELLIGENT REFINER

By analyzing the alignment produced by aligner and by consulting
biological knowledge, the intelligent refiner can roughly understand
where the conserved column regions are and where another conserved
column region may be found in the alignment. The intelligent refiner
modifies the alignment in order to increase biologically meaningful
conserved column region. The modified alignment in one cycle becomes
the input to the next iteration.  In each iteration, the intelligent
refiner can understand more precisely where the conserved regions are.
Thus, it can gradually identify the conserved column regions to
produce the biologically near-optimal alignment.

We designed the intelligent refiner by analyzing the knowledge used by
experts on multiple alignment.  The program of intelligent refiner is
written in Prolog and KL1. The structure and function of the
intelligent aligner is explained below.

The intelligent refiner is composed of a control module, refinement
rule base and biological knowledge base.  The control module
iteratively modifies the alignment by calling refinement rules in the
refinement rule base to produce a biologically optimal alignment.
Refinement rules consult with biological knowledge base when
neccessary.


BIOLOGICAL KNOWLEDGE BASE

In our system, biological knowledge is written in Prolog syntax.  In
the biological knowledge base, the biological knowledge we extracted
from the database, Prosite [Bairoch 1991] and so on, are contained. 
Prosite is the database in which knowledge on motifs and related
knowledge is written in natural language.

Besides the knowledge we store, biologists can easily input their own
knowledge into the biological database, because Prolog has a syntax
similar to natural language.

<Figure 5: biological knowledge>
 motif(name, zinc_finger(reverse),
	``H-X(3,5)-H-X(10,25)-C-X(3,5)-C'').                        (1)
 motif(protein, kinase,''[LIV]-G-X-G-[FY]-[SG]-X-[LIV]'').          (2)
 motif(protein, kinase(tyrosine),
	``[LIVMFYC]-X-[HY]-X-D-[LIVMFY]-K-X(2)-N-[LIVMFC](3)'').    (3)
 motif(protein, kinase(serine,threonine),
	``[LIVMFYC]-X-[HY]-X-D-[LIVMFY]-RSTA-X(2)-N-[LIVMFC](3)''). (4)
 upper_concept(kinase(serine,threonine), kinase).                   (5)
 upper_concept(kinase(tyrosine), kinase).                           (6)
 motif(protein,Protein,Motif)  :-
	upper_concept(Protein,UpperProtein),
	motif(protein,UpperProtein,Motif).                          (7)

We show a portion of the biological knowledge in Figure 5. The syntax
and meanings are explained below. "motif" is a predicate that tells us
the meaning of the motif in the third argument. The expression of the
motif is similar to that employed in Prosite. When its first argument
is "name", the second argument means the name of the motif, for
example, zinc finger [(1)].  When its first argument is "protein", the
second argument means the name of the protein that contains the motif,
for example, kinase [(2),(3),(4)].  The predicate "upper_concept"
expresses the hierarchical relationship of protein. For example,
serine_threonine kinase is one class of kinase [(5)], and tyrosine
kinase is another class of kinase [(6)].


EXAMPLE APPLICATION OF REFINEMENT RULES

The rules are expressed using the IF-THEN rule. Here we show
representative five rules are shown in Figure 6.  The rules will be
explained using examples later.  In the rules, several routines are
called. But, since their functions are clear from the context, we will
not explain the routines further.

<Figure 6: Representative rules in refinement rule base>
[Rule 1]
[IF] An half conserved column c_{i},
     in which more than the specified percentage(e.g. 80%)
     of whose elements are identical amino acids ($x_{s}$),
     is found. AND
     In the sequence that doesn't have x_{s} in the column, 
     x_{s} are sought within specified distance from c_{i}
     (checked by search routine 1).
[THEN] The modified alignment  is produced 
     in the constraint that the found x_{s} is aligned 
     in the column c_{i} (done by modification routine). 
     When plural alternatives are generated, the modified
     alignment whose evaluation value is the highest is selected.
[Rule 2]
[IF] Sequences in the alignment are grouped in to two groups,
     g_{i} and g_{j} according to similarities between
     the sequences (checked by grouping routine). AND
     Patterns of conserved columns (p_{i} and p_{j} in each
     group of alignment (a_{i} and a_{j}) are found. AND
     Common sub-pattern p_{ij} is found in the both patterns
     (p_{i} and p_{j}) (done by search routine 2)
[THEN] a_{i} and a_{j} are aligned in the constraint that 
     p_{ij} in a_{i} and p_{ij} in a_{j} are aligned
     (done by modification routine).
[Rule 3] 
[IF] Discovered conserved column pattern "p" contains some part
     of an motif m_{i} stored in the biological knowledge base
     (done by motif_check routine).
[THEN] the Motif_finding routine is called to identify 
     the other part of the motif.
[Rule 4]
[IF] An motif (that motif_finding routine trys to identify) 
     in the biological knowledge base have two amino acids,
     x_{i} and x_{j}, of same kind of amino acid x. AND
     The motif has no other conserved amino acids between
     x_{i} and x_{j}. AND
     There are a half conserved column of c_{i} of x and
     a conserved column of c_{j} of x in the alignment
     (done by motif_finding routine).
[THEN] Modification routine is called to produce alignment 
     in the constraint that x_{j,t} (belonging to the 
     sequence s_{t} that doesn't have x in the half conserved
     column c_{i}) is aligned in the half conserved column c_{j}.
[Rule 5] 
[IF] A motif (m_{i}) is identified in the alignment. AND
     The protein that has the motif (m_{i}) have another motif
     m_{j} (checked  by knowledge_consulting routine)
[THEN] the motif_finding routine is called to identify the motif
     m_{i}.

Five refinement rules are explained using exmaples. Figure 7 is an
example of application of Rule 1.  By search routine 1 a half
conserved column of ``A'' (there is an exception in the first
sequence) is found (``_'' in Figure 7 signifies any character) and in
the first sequence, an ``A'' is found in the neighborhood of the half
conserved column. Then, Rule 1 is fired. The modification routine is
called to produce a conserved column of ``A'' (computational optimal
or semi-opitmal alignment is produced in the constraint that all ``A''
should be aligned in a column).

<Figure 7: Example application of Rule 1>
  ________A___________________           ______________A_____________
  ______________A_____________           ______________A_____________
  ______________A_____________     =>    ______________A_____________
  ______________A_____________           ______________A_____________
  ______________A_____________           ______________A_____________
  ______________A_____________           ______________A_____________

Figure 8 shows an example of the application of Rule 2.  Here, the
sequences are decomposed into two groups, the first three sequences
and the last three sequences according to similarities of the
sequences (checked by grouping routine)

<Figure 8: Example application of Rule 2>
  ____________AP_____________               ____________AP_____________
  ____________AP_____________               ____________AP_____________
  ____________AP_____________               ____________AP_____________
  _______________AP__________     =>        ____________AP_____________
  _______________AP__________               ____________AP_____________
  _______________AP__________               ____________AP_____________

By search routine 1, it is found that a conserved column pattern
``AP'' is found in both groups. Then, Rule 2 is fired.  The
modification routine is called to produce a conserved column of ``AP''
that extends all sequences.

The temporary alignment produced by Rule 1 or Rule 2 is sent to the
evaluation routine with the current alignment (the most biologically
optimal alignment at the time).  In the routine, the current alignment
and the temporary alignment are evaluated by consulting with the
biological knowledge base. Then, the new current alignment is
selected.

An example application of Rule 3 and Rule 4 is explained with the use
of Figure 9.  The motif_check routine identifies that the conserved
column of ``CXXXC'' is the latter part of zinc finger (reverse type)
motif ``H-X(3,5)-H-X(10,25)-C-X(3,5)-C'' by consulting the knowledge
(1) in the biological database. Then, rule 3 is fired. The
motif_finding routine finds the half conserved pattern of ``HXXXH'' in
which the the latter column of ``H'' is half conserved.  Then, rule 4
is fired. The modification routine makes the alignment which has a
conserved column pattern of ``H-X(3)-H-X(15)-C-X(3)-C''.

<Figure 9: Example application of Rule 3 and Rule 4>
  __H___H_______________C___C___            __H___H_______________C___C___
  ______H___H___________C___C___            __H___H_______________C___C___
  ______H___H___________C___C___    =>      __H___H_______________C___C___
  ______H___H___________C___C___            __H___H_______________C___C___
  ______H___H___________C___C___            __H___H_______________C___C___
  ______H___H___________C___C___            __H___H_______________C___C___

The example application of Rule 5 is explained by using the sequences
belonging to a protein called tyrosine kinase. When the rule is
applied, the knowledge (2)(3)(6)(7) in the biological knowledge base
are consulted.

If motif ``[LIVMFYC]-X-[HY]-X-D-[LIVMFY]-K-X(2)-N-[LIVMFC](3)'' is
identified in the alignment, the knowledge_consulting routine finds
that it is the motif of tyrosine kinase (knowledge (3) in the
biological database). Then, other motifs belonging to tyrosine kinase
are sought in the biological knowledge base. Here, corresponding
motifs are not stored explicitly in the biological database. However,
using the inference rule (knowledge (7)), and the knowledge that
kinase is the upper concept of tyrosine kinase (knowledge (6)) and
knowledge on motif of kinase (knowledge(2)), it is derived that
tyrosine kinase also has a motif ``[LIV]-G-X-G-[FY]-[SG]-X-[LIV]''.
Then, Rule 5 is fired, and the Motif_finding routine is called to
identify ``'[LIV]-G-X-G-[FY]-[SG]-X-[LIV]''.


REFERENCES

[Bacon and Anderson 1986]
D.J Bacon, W.F.Anderson. J.Mol.Biol., 191, 153-161 (1986) 

[Bairoch 1991]
Bairoch,A. Prosite: A dictionary of protein site and pattern:
 User manual Release 7.00, May 1991.

[Berger and Manson 1991] 
Berger,M. and Manson,P. A novel randomized iterative strategy for 
aligning multiple protein sequences. Computer Application in the
Biosciences, 7, 1991. pp.479-484.

[Barton 1990]
Barton,J.G. Protein Multiple Alignment and Flexible Pattern Matching. 
in Methods in Enzymology Vol.183, Academic Press, 626-645.

[Butler  et al. 1990]
Butler,R., ,Butler,T., Foster,I., Karonis,N., Olson,R., Overbeek,R., 
Pfluger,N., Price.M. and Tuecke,S. Aligning Genetic Sequences
in  Foster,I. and Taylor,S. Strand -- New concept in parallel
programming. Prentice Hall.

[Carrillo and Lipman 1988]
H. Carrillo and D. Lipman.
The multiple sequence alignment problem in biology.
{\sl SIAM J. Appl. Math.,} {\bf 48,} 1988, pp.1073--1082.

[Dayhoff,O. et al. 1978]
Dayhoff,M.O., Schwatz,R.M. and Orcutt,B.C. 
A model of evolutionary change in proteins. In Dayhoff,M.O.(ed),
Atlas of Protein Sequence and Structure Vol.5, Suppl.3,
Nat. Biomed. Res. Found., Washington, D. C., 363--373.

[Hirosawa et al. 1991]
Hirosawa,M., Hoshida,M., Ishikawa,M. and Toya,T. 
Multiple Alignment System for Protein Sequences employing 
3-dimensional Dynamic Programming. 
Genome Informatics Workshop II, (in Japanese).

[Hirosawa et al. 1992]
Hirosawa,M.,  Ishikawa,M. Hoshida,M.
Protein Multiple Sequence Alignment System using Knowledge
ICOT TR 793, 1992.

[Ishikawa et al. 1991]
Ishikawa,M., Toya,T., Hoshida,M., Nitta,K., Ogiwara,A. and Kanehisa,M.
Multiple Alignment by Parallel Simulated Annealing.
Genome Informatics Workshop II,  (in Japanese).

[Ishikawa et al. 1992]
Ishikawa,M.,  Hoshida,M.,  Hirosawa,M., Toya,T. and Nitta,K. (1992)
Protein Sequence Analysis by Parallel Inferrence Machine.
Proc. Int. Conf. on Fifth Generation Computer Systems 1992.

[Johnson and Dolittle 1986]
M. S. Johnson and R. F. Doolittle.
A method for the simultaneous alignment of three or more
amino acids sequences.
J. of Mol. Evol., 23, 1986, pp.267--278.

[Murata 1985]
Murata,M.  Simulteneous comparison of three protein sequences
Proc. Natl. Acad. Sci. USA Vol.32, 1985, pp.3073-3077.

[Needleman and Wunsch  1970]
Needleman,S.B. and Wunsch,C.D. (1970) A General Method Applicable to 
the Search for Similarities in the Amino Acid Sequences of Two Proteins.
J. of Mol. Biol., 48, 443-453.


APPENDIX

A user interface tool is appended to this intelligent refiner. The
interface tool helps to extract biological knowledge from alignment
process done by a human expert. It consists of a mouse-based manual
editor and a constraint-oriented partial aligner. The editor written
with C-language, Xlib and OSF/Motif works on X-Motif window system.
The partial aligner written with KL1 on PIM is triggered by the editor
by means of Unix socket method. When a partial alignment problem has
more than 63 sequences, it is preferable to use PIM/m with 256 PEs.


