






             PROSITE: A DICTIONARY OF PROTEIN SITES AND PATTERNS


                                 USER MANUAL



                             Release 9.0, June 1992






  Amos Bairoch
  Medical Biochemistry Department
  Centre Medical Universitaire
  1211 Geneva 4
  Switzerland

  Telephone: (+41 22) 361 84 92
  Electronic mail address: bairoch@cmu.unige.ch
                        or bairoch@cgecmu51.bitnet







  This manual and the accompanying data bank may be copied and redistributed
  freely, without  advance  permission,  provided  that  this  statement  is
  reproduced with each copy.




















<PAGE>




                               INTRODUCTION

The use  of  protein  sequence  patterns  (or  motifs)  to  determine  the
function(s) of  proteins is  becoming very  rapidly one  of the  essential
tools of  sequence analysis.  This reality  has been  recognized  by  many
authors, as it can be illustrated from the following citations from two of
the most  well known  experts of protein sequence analysis, R.F. Doolittle
and A.M. Lesk:

   "There are  many short  sequences  that  are  often  (but  not  always)
   diagnostics of certain binding properties or active sites. These can be
   set into a small subcollection and searched against your sequence (1)".

   "In some  cases, the structure and function of an unknown protein which
   is too  distantly related  to any  protein of known structure to detect
   its affinity  by overall  sequence alignment  may be  identified by its
   possession of  a particular  cluster of  residues types classified as a
   motifs. The  motifs, or  templates, or  fingerprints, arise  because of
   particular  requirements  of  binding  sites  that  impose  very  tight
   constraint on the evolution of portions of a protein sequence (2)."

PROSITE is a compilation of sites and patterns found in protein sequences.
Some of  these patterns been published in the literature, but the majority
have been developed, in the last two years, by the author. Originally this
dictionary was  conceived as part of the author's doctoral dissertation as
well as  an integral  part of  the PROSITE program in the PC/Gene sequence
analysis software  package. But,  as  many  people  have  expressed  their
interest in  this project,  we have decided to make this work available on
computer media as well as in a printed form.

     Citation

If you want to refer to PROSITE in a publication you can do so by citing:

   Bairoch A.
   PROSITE: a dictionary of sites and patterns in proteins.
   Nucleic Acids Res. 20:2013-2018(1992).

     Feedback

I welcome  any feedback.  If you find errors, omissions, or if you want to
suggest new  sites or  patterns to be added to this dictionary, please let
me know. You can contact me (by electronic mail preferably) at the address
listed on the cover page of this document.

____________________
1  Doolittle R.F.
   (In) Of  URFs and  ORFs: a  primer on how to analyze derived amino acid
   sequences., University Science Books, Mill Valley, California, (1986).
2  Lesk A.M.
   (In) Computational  Molecular Biology,  Lesk A.M., Ed., pp17-26, Oxford
   University Press, Oxford (1988).




<PAGE>


                              1) METHODOLOGY


1.1) Introduction

In this section we will explain how we selected or developed the signature
patterns described  in this  compilation. Our  first  and  most  important
criterion is  that a  good signature pattern must be as short as possible,
should detect  all or most of the sequences it is designed to describe and
should not  give too  many false  positive results. In other words it must
exhibit both high sensitivity and high specificity.


1.2) Patterns from the literature

A number of the patterns described in this dictionary have been published.
We have  tested those  patterns on  the SWISS-PROT data bank to see if the
signature pattern  was still  specific to  the group of family of proteins
since the  paper was published. If this was the case we used the published
pattern as such, otherwise we updated the pattern using methods similar to
those used  to develop  a new  pattern and  which  are  described  in  the
following sub-section.


1.2) Steps in the development of a new pattern

We generally start by studying review(s) on a group or family of proteins.
We build  an alignment  table of the proteins discussed in that review. If
necessary we  add to  this table  new published  sequences relevant to the
subject  under   consideration.  Using   such  alignment  tables  we  take
particular attention  to the  residues and regions thought or proved to be
important to  the biological  function of  that group  of proteins.  These
biologically significant regions or residues are generally:

-  Enzyme catalytic sites.
-  Prostethic group  attachment sites  (heme, pyridoxal-phosphate, biotin,
   etc).
-  Amino acids involved in binding a metal ion.
-  Cysteines involved in disulfide bonds.
-  Regions involved in binding a molecule (ADP/ATP, GDP/GTP, calcium, DNA,
   etc.) or another protein.

We then  try to  find a  short (not  more than four or five residues long)
conserved sequence  which is  part of  a region  known to  be important or
which include  biologically significant residue(s). We call the pattern(s)
created at  this stage  the `core'  pattern(s). The most recent version of
the SWISS-PROT  Protein Sequence  databank is then scanned with these core
pattern(s).  If  a  core  pattern  will  detect  all  the  proteins  under
consideration and none (or very few) of the other proteins, we can stop at
this stage  and use  the core  pattern as  a bona  fide signature. In most
cases we  are not  so lucky  and we pick up a lot of extra sequences which
clearly do  not belong  to the  group of  proteins under  consideration. A
further series  of scans,  involving a gradual increase in the size of the





<PAGE>




pattern, is  then necessary.  In some cases we never manage to find a good
pattern and  we have to retry with a core pattern from a different part of
the sequence.  It must  also be noted that we take particular attention to
try to  avoid `false' patterns. We will use an example to describe what we
call a `false' pattern:

Let us  assume that  we have a partial alignment of three sequences around
an active  site residue  (in this  example an  histidine whose position is
marked with an asterisk) as shown below:

                 *
          ALRDFATHDDF
          SMTAEATHDSI
          ECDQAATHEAS

Here we  would start scanning with a core pattern with the sequence A-T-H-
[D or  E]. This pattern is small and would probably pick up too many false
positive results. According to the procedure outlined above, we would then
have to  extend the core pattern. But in this case, any extension would be
artificial and group together residues which have different properties and
which are  represented only once in a given position of the alignment. For
example, we  could scan  with the pattern [R, T or D]-[D, A or Q]-[F, E or
A]-A-T-H-[D or  E]. This pattern would probably only pick up the sequences
which are  in the  alignment, but  it would  be biologically  meaningless;
there is  no consensus in the first three positions of the pattern and the
pattern does  not even  group  residues  with  identical  physico-chemical
properties. Consequently, this pattern would probably fail to detect a new
sequence containing the same active site but having a different N-terminal
sequence.



























<PAGE>




                   2) CONVENTIONS USED IN THE DATA BANK


2.1) General structure

The PROSITE  data bank  is composed  of two  ASCII (text) files. The first
file (PROSITE.DAT)  is a  computer readable  file that  contains  all  the
information necessary to programs that will scan sequence(s) with patterns
and/or  matrices.   The  second   file  (PROSITE.DOC)   contains   textual
information that  fully documents  each pattern  and matrix. We must point
out that we strongly urge software developers to build software tools that
make use  of both  files. A  list of patterns present in a sequence is not
very useful to biologists without the relevant documentation.


2.2) Data file structure

     2.2.1) Structure of an entry

The entries  in the  database data file (PROSITE.DAT) are structured so as
to be  usable by human readers as well as by computer programs. Each entry
in the  database is composed of lines. Different types of lines, each with
its own format, are used to record the various types of data which make up
the entry. The general structure of a line is the following:

Characters Content
---------- -----------------------------------------------------------
1 to 2     Two-character line  code. Indicates the type of information
           contained in the line.
3 to 5     Blank
6 up to 78 Data

The currently used line types, along with their respective line codes, are
listed below:

ID  Identification                    (Begins each entry; 1 per entry)
AC  Accession number                  (1 per entry)
DT  Date                              (1 per entry)
DE  Short description                 (1 per entry)
PA  Pattern                           (>=0 per entry)
MA  Matrix                            (>=0 per entry)
RU  Rule                              (>=0 per entry)
NR  Numerical results                 (>=0 per entry)
CC  Comments                          (>=0 per entry)
DR  Cross-references to SWISS-PROT    (>=0 per entry)
3D  Cross-references to PDB           (>=0 per entry)
DO  Pointer to documentation file     (1 per entry)
//  Termination line                  (Ends each entry; 1 per entry)


Additional line-types will be added in future releases.
Each of the line-types are described in section 2.3 of this document.




<PAGE>




     2.2.2) Example of an entry

ID   PPASE; PATTERN.
AC   PS00387;
DT   NOV-1990 (CREATED); DEC-1991 (DATA UPDATE); JUN-1992 (INFO UPDATE).
DE   Inorganic pyrophosphatase signature.
PA   D-[SGN]-D-P-[LIVM]-D-[LIVMC].
NR   /RELEASE=22,25044;
NR   /TOTAL=6(6); /POSITIVE=6(6); /UNKNOWN=0(0); /FALSE_POS=0(0);
NR   /FALSE_NEG=0(0);
CC   /TAXO-RANGE=??EP?; /MAX-REPEAT=1;
CC   /SITE=1,magnesium; /SITE=3,magnesium; /SITE=6,magnesium;
DR   P17288, IPYR_ECOLI, T; P13998, IPYR_KLULA, T; P19117, IPYR_SCHPO, T;
DR   P19514, IPYR_THEP3, T; P00817, IPYR_YEAST, T; P21216, IPYR_ARATH, T;
DR   P19371, IPYR_DESVH, P; P21616, IPYR_PHAAU, P;
3D   1PYP;
DO   PDOC00325;
//


2.3) The different line types

This section  describes in  detail the format of each type of line used in
the database data file (PROSITE.DAT).

     2.3.1) The ID line

The ID  (IDentification) line  is always  the first  line of an entry. The
general form of the ID line is:

ID   ENTRY_NAME; ENTRY_TYPE.

The first  item on  the ID  line is  the entry name. This name is a useful
means of  identifying an  entry. The  entry name  consists of from 2 to 20
uppercase alphanumeric  characters. The  characters that are allowed in an
entry name are: A-Z, 0-9, and the underscore character "_".

The second  item on  the ID  line indicates  the type  of  PROSITE  entry.
Currently this can be one the following:

               PATTERN
               MATRIX
               RULE

Examples:

ID   ADH_ZINC; PATTERN.
ID   SULFATATION; RULE.








<PAGE>





     2.3.2) The AC line

The AC  (ACcession number) line lists the accession number associated with
an entry.  It is  always the  second line  of an  entry. Accession numbers
provide a stable way of identifying entries from release to release. It is
sometimes necessary  for reasons of consistency to change the names of the
entries between releases.

An accession  number,  however,  never  change.  Accession  numbers  allow
unambiguous citation  of database  entries. Researchers who wish to cite a
PROSITE entry  in their  publications should  always  cite  the  accession
number of that entry in order to ensure that readers can find the relevant
data in a subsequent release.

The format of the AC line is:

AC   PSnnnnn;

Where `PS' stands for PROSITE and `nnnnn' is a five digits number.

Example:

AC   PS00123;


     2.3.3) The DT line

The DT  (DaTe) line  shows the  date of  entry or last modification of the
entry. It  is always the third line of an entry. The format of the DT line
is:

DT   MMM-YYYY (CREATED); MMM-YYYY (DATA UPDATE); MMM-YYYY (INFO UPDATE).

where:

-  MMM is the month and YYYY the year.
-  The first  date indicates  when the  entry first  appeared in  the data
   bank.
-  The second date indicates when the `primary' data of the entry was last
   modified. By this we means the data relevant to the pattern, matrix, or
   rule being described in that entry.
-  The third  date indicates  when any  data other then the `primary' data
   has been modified.

Example:

DT   APR-1990 (CREATED); JUL-1990 (DATA UPDATE); JUN-1992 (INFO UPDATE).








<PAGE>



     2.3.4) The DE line

The DE  (DEscription) line  provides  descriptive  information  about  the
content of the entry. It is always the fourth line of an entry. The format
of the DE line is:

DE   Description.

The description is given in ordinary English and is free-format.

Examples:

DE   Myb DNA-binding domain repeat signature 1.
DE   Iron-containing alcohol dehydrogenases signature.
DE   Zinc finger, C2H2 type, domain.


     2.3.5) The PA line

The PA  (PAttern) lines  contains the definition of a PROSITE pattern. The
patterns are described using the following conventions:

-  The standard IUPAC one-letter codes for the amino acids are used.
-  The symbol `x' is used for a position where any amino acid is accepted.
-  Ambiguities are  indicated by  listing the acceptable amino acids for a
   given position,  between square  parentheses `[  ]'. For example: [ALT]
   stands for Ala or Leu or Thr.
-  Ambiguities are  also indicated  by listing  between a  pair  of  curly
   brackets `{  }' the  amino acids  that are  not  accepted  at  a  given
   position. For  example: {AM}  stands for  any amino acid except Ala and
   Met.
-  Each element in a pattern is separated from its neighbor by a `-'.
-  Repetition of  an element  of the pattern can be indicated by following
   that element  with a  numerical value  or  a  numerical  range  between
   parenthesis. Examples: x(3) corresponds to x-x-x, x(2,4) corresponds to
   x-x or x-x-x or x-x-x-x.
-  When a  pattern is  restricted to  either the  N- or  C-terminal  of  a
   sequence, that  pattern either  starts with a `<' symbol or ends with a
   `>' symbol.
-  A period ends the pattern.

Examples:

PA   [AC]-x-V-x(4)-{ED}.

This pattern  can be  translated as: [Ala or Cys]-any-Val-any-any-any-any-
{any but Glu or Asp}

PA   <A-x-[ST](2)-x(0,1)-V.

This pattern,  which must  be in the N-terminal of the sequence (`<'), can
be translated as: Ala-any-[Ser or Thr]-[Ser or Thr]-(any or none)-Val





<PAGE>



     2.3.6) The MA line

The MA  (MAtrix) lines  contain the  definition of a PROSITE matrix entry.
The exact format and content of this line is not yet defined.

     2.3.7) The RU line

The RU  (RUle) lines  contain the  definition of a PROSITE rule entry. The
format of the RU line is:

DE   Rule_Description.

The rule is described in ordinary English and is free-format.

     2.3.8) The NR line

The NR  (Numerical Results)  lines contain  information  relevant  to  the
results of  the scan  with a pattern on the complete SWISS-PROT data bank.
The format of the NR line is:

NR   /QUALIFIER=data; /QUALIFIER=data; .......

The qualifiers that are currently defined are:

/RELEASE       SWISS-PROT release  number and  total  number  of  sequence
               entries in that release.
/TOTAL         Total number of hits in SWISS-PROT.
/POSITIVE      Number of  hits on proteins that are known to belong to the
               set in consideration.
/UNKNOWN       Number of  hits on  proteins that  could possibly belong to
               the set in consideration.
/FALSE_POS     Number of false hits (on unrelated proteins).
/FALSE_NEG     Number of known missed hits.

The syntax of the /RELEASE qualifier is:

                           /RELEASE=nn,seq_num;

where `nn'  is a  SWISS-PROT release number and `seq_num' the total number
of SWISS-PROT entries in that release.

For all other qualifiers the syntax is:

                             /QUALIFIER=x(y);

where `x'  represents the  number of hits and `y' the number of sequences.
In the majority of pattern entries `x' will be equal to `y', but for those
patterns that  are designed  to detect  domains that  can be repeated more
than once in a given sequence (for example: zinc-fingers, EF-hand regions,
kringle domain,  etc.), `x'  can be  larger than  `y'. Such  situation  is
described in the following example:

NR   /RELEASE=22,25044; /TOTAL=123(56); /POSITIVE=115(51);
NR   /UNKNOWN=5(2); /FALSE_POS=3(3); /FALSE_NEG=3(2);



<PAGE>


In the  above example  the scan  for the pattern was done on release 22 of
SWISS-PROT which  contains 25044  sequence entries, that pattern was found
123 times in 56 different sequences (/TOTAL). Out of those 123 `hits', 115
were produced  by 51  sequences that belong to the set under consideration
(/POSITIVE), 5  hits were  produced by  two sequences which could possible
belong to the set (/UNKNOWN) and 2 hits were produced by 3 other sequences
(/FALSE_POS). That particular pattern missed 3 occurences in two different
sequences (/FALSE_NEG).

Note: for  some degenerate  patterns (as  for example  the N-glycosylation
consensus pattern),  the NR lines are not provided as they would not yield
any useful information.


     2.3.9) The CC line

The CC  (Comments) lines contains various types of comments. The format of
the CC line is:

CC   /QUALIFIER=data; /QUALIFIER=data; .......

The qualifiers that are currently defined are:

/TAXO-RANGE    Taxonomic range.
/MAX-REPEAT    Maximum known  number of  repetition of  the pattern  in  a
               single protein.
/SITE          Indication of an `interesting' site in the pattern.
/SKIP-FLAG     Indication of  an entry that can be, in some cases, ignored
               by a program (because it is too unspecific).


         2.3.9.1) The /TAXO-RANGE qualifier

This qualifier  is used  to indicate  the taxonomic  range of a pattern or
matrix. The syntax of that qualifier is the following:

                            /TAXO-RANGE=ABEPV;

where

-  `A' stands for archebacteria
-  `B' stands for bacteriophages
-  `E' stands for eukaryotes
-  `P' stands for prokaryotes
-  `V' stands for eukaryotic viruses

-  when the  pattern or  matrix entry has no relevance to one of the above
   taxonomic classes  a question  mark (`?')  replaces  the  corresponding
   letter symbol.

Example: /TAXO-RANGE=A?E??  would be used in an entry relevant to proteins
of archebacterial (`A') and eukaryotic (`E') origin.

Note: the /TAXO-RANGE qualifier does not takes into account false positive
hits. For example: if a pattern produces one or more false positive hit(s)



<PAGE>


on bacteriophage  protein(s)  but  that  no  true  positive  results  were
obtained on  any bacteriophage  proteins, a  question mark will be present
instead of the `B' in the second position of the /TAXO-RANGE qualifier.


         2.3.9.2) The /MAX-REPEAT qualifier

This qualifier  is used  to indicate  the maximum  number of times a given
pattern has  been found  in a  single protein sequence. The syntax of that
qualifier is the following:

                             /MAX-REPEAT=nn;

For example,  in the  CC lines  of the  pattern entry to detect an EF-hand
calcium-binding domain  we have /MAX-REPEAT=8. This indicates that up to 8
copies of  the EF-hand  domain are  known to  be present  in at  least one
protein sequence.

Notes

One should  not make  the assumption  that the  value  indicated  by  this
qualifier is  equivalent to  the maximum  number  of  hits  that  will  be
obtained by  the pattern  being described;  for it  is not uncommon that a
pattern will not detect all occurences of a repeated domain.

         2.3.9.3) The /SITE qualifier

This qualifier  is used  to indicate the position of an `interesting' site
in a  pattern. For  example, if a pattern includes an active site residue,
the /SITE  qualifier will be used to indicate the position of that residue
in the pattern. The syntax of this qualifier is the following:

                        /SITE=nn,text_description;

where `nn'  is the position in the pattern of the site being described and
`text_description' a textual description of that site.

Examples:

CC   /SITE=3,active_site;
CC   /SITE=5,disulfide;

Notes

The position numbering is indicated in pattern elements units. For example
if we  want to indicate that the `C' in the pattern `<A-[ILMV]-x(2,4)-A-C-
P' is involved in a disulfide bond we would indicate `/SITE=5,disulfide;',
the 'C' being the fifth element in the pattern.

If necessary  there are more than one /SITE qualifier in the CC line(s) of
a pattern  entry. For  example in  the entry  specific to  proteins of the
cytochrome c  family, the  pattern `C-{CPWHF}-{CPWR}-C-H-{CFWY}'  has  the
following /SITE qualifiers in its CC lines:

CC   /SITE=1,heme; /SITE=4,heme; /SITE=5,heme_iron;



<PAGE>



This to  indicate that  the two  `C's are  the residues that bind the heme
group and that the `H' is an axial ligand to the heme iron.

If the  presence of  a site  is assumed,  but that  experimental  data  is
lacking, a  `(?)' is  appended at  the end  of the  text description.  For
example if  we have  the pattern  `A-x(2)-C-R' and  the cysteine  in  that
pattern is  thought to  be involved  in a  disulfide  bond,  it  would  be
indicated as `/SITE=3,disulfide(?);'.


         2.3.9.4) The /SKIP-FLAG qualifier

Some  PROSITE   keys  such   as  those  describing  commonly  found  post-
translational modifications  (a typical  example is  N-glycosylation)  are
found in  the majority  of known  protein sequences. While it is generally
useful to  note their  presence, some programs may want, in some cases, to
ignore those  keys. For  this purpose  these keys  are indicated  with the
following qualifier in their CC lines:

CC   /SKIP-FLAG=TRUE;



     2.3.10) The DR line

The DR  (Data bank Reference) lines are used as pointers to the SWISS-PROT
entries that  are picked  up (or missed) by the pattern being described in
the entry. The format of the DR line is:

DR   AC_NB, ENTRY_NAME, C; AC_NB, ENTRY_NAME, C; AC_NB, ENTRY_NAME, C;

where:

-  `AC_NB' is  the SWISS-PROT  primary accession  number of  the entry  to
   which reference is being made.

-  `ENTRY_NAME' is the SWISS-PROT entry name.

-  `C' is a one character flag that can be one of the following:

T  For a true positive.
N  For a  false negative;  a sequence  which  belongs  to  the  set  under
   consideration, but which has not been picked up by the pattern.
P  For a  `potential' hit;  a sequence  that  belongs  to  the  set  under
   consideration, but  which was  not picked  up because  the region  that
   contains the  pattern is  not yet  available in  the data bank (partial
   sequence).
?  For an unknown; a sequence which possibly could belong to the set under
   consideration.
F  For a  false positive;  a sequence  which does not belong to the set in
   consideration.






<PAGE>



Example:

DR   P10807, ADH_DROLE , T; P07162, ADH_DROMA , T; P00334, ADH_DROME , T;
DR   P09370, ADH1_DROMO, T; P09369, ADH2_DROMO, T; P07160, ADH2_DROMU, T;
DR   P12854, ADH1_DRONA, T; P07159, ADH_DROOR , T; P07158, ADH_DROPS , T;
DR   P07163, ADH_DROSI , T; P08074, AP27_MOUSE, T; P08088, BEN5_PSEPU, T;
DR   P07772, BEND_ACICA, T; P08694, BPHB_PSEPS, T; P14061, DHES_HUMAN, T;
DR   P12310, DHG_BACSU , T; P10528, DHGA_BACME, T; P07999, DHGB_BACME, T;
DR   P16232, DHII_RAT  , T; P15047, ENTA_ECOLI, T; P05406, FIXR_BRAJA, T;
DR   P05707, GUTD_ECOLI, T; P06234, NODG_RHIME, T; P06235, NODG_RHIMS, T;
DR   P15428, PGDH_HUMAN, T; P14697, PHBB_ALCEU, T; P00335, RIDH_KLEAE, T;
DR   P13859, TODD_PSEPU, T;
DR   P13203, DHG_THEAC , P;
DR   P14802, YRTP_BACSU, ?;
DR   P07161, ADH1_DROMU, N;
DR   P00805, ASPG_ECOLI, F; P13226, GALX_STRLI, F; P14373, RFP_HUMAN , F;
DR   P02788, TRFL_HUMAN, F; P08071, TRFL_MOUSE, F;

In the  above example,  we have  pointers to 28 SWISS-PROT sequences which
are true  positives (`T'),  one which  is a potential hit (`P'), one for a
sequence that  may belong  to the set under consideration (`?'), one which
has been  missed by  the pattern  (`N'), and five sequences that are false
positives (`F').


     2.3.11) The 3D line

The  3D  (3D-structure)  line  is  used  to  list  the  code(s)  of  X-ray
crystallography Protein  Data Bank  (PDB) entries  that contain structural
data corresponding  the sequence  region described in a PROSITE entry. The
format of the 3D line is:

3D   name; [name2;...]

Example:

3D   7WGA; 9WGA; 1WGC; 2WGC;


     2.3.12) The DO line

The DO (DOcumentation) line contains a pointer to the entry in the PROSITE
documentation file that describes the entry. The format of the DO line is:

DO   PDOCnnnnn;

Where `PDOC' stands for PROSITE DOCumentation and `nnnnn' is a five digits
number.

Example:

DO   PDOC00128;





<PAGE>



     2.3.13) The termination line

The //  (terminator) line  contains no data or comments. It designates the
end of an entry.


2.4) Documentation file structure

The PROSITE  documentation file  is an ASCII file. The maximum line length
has been set to 78 characters. The general format of a documentation entry
is the following:

{PDOCnnnnn}
{PSmmmmm; ENTRY_NAME}
..
{BEGIN}
Documentation text lines
.
..
...
{END}

-  The first  line `{PDOCnnnnn}',  where `nnnnn' is a five digit number is
   the documentation entry accession number.
-  The following  lines `{PSmmmmm;  ENTRY_NAME}' list the accession number
   and entry  name of  the PROSITE  data file entri(es) that correspond to
   the documentation entry.
-  The documentation  text lines  are in  ordinary English  and are  free-
   format. The  only restriction  is that  they  do  not  start  with  the
   character `{'.

As an  example, we  show here  a section  of the  documentation file  that
contains two entries.

{PDOC00082}
{PS00087; SOD_CU_ZN_1}
{PS00332; SOD_CU_ZN_2}
{BEGIN}
***********************************************
* Copper/Zinc superoxide dismutase signatures *
***********************************************

Copper/Zinc superoxide dismutase (EC 1.15.1.1) (SODC) [1] is  one of the three
forms of an enzyme that catalyzes the dismutation of superoxide radicals. SODC
binds one atom each  of zinc and copper.  Various forms  of  SODC are known: a
cytoplasmic  form in  eukaryotes, an additional chloroplast form in plants, an
extracellular form in some  eukaryotes, and a periplasmic form in prokaryotes.
The metal binding sites are conserved in all the known SODC sequences [2].

We derived two signature  patterns for this family of enzymes:  the  first one
contains two  histidine residues that  bind the copper atom; the second one is
located in the C-terminal section of  SODC  and  contains a  cysteine which is
involved in a disulfide bond.




<PAGE>



-Consensus pattern: [GA]-[IFAT]-H-[LIVF]-H-x(2)-[GP]-[SDG]-x-[STAGD]
                    [The two H's are copper ligands]
-Sequences known to belong to this class detected by the pattern: ALL.
-Other sequence(s) detected in SWISS-PROT: human complement receptor type 1.

-Consensus pattern: G-[GN]-[SGA]-G-x-R-x-[SGA]-C-x(2)-[IV]
                    [C is involved in a disulfide bond]
-Sequences known to belong to this class detected by the pattern: ALL.
-Other sequence(s) detected in SWISS-PROT: NONE.

-Last update: June 1992 / Patterns and text revised.

[ 1] Bannister J.V., Bannister W.H., Rotilio G.
     CRC Crit. Rev. Biochem. 22:111-154(1987).
[ 2] Smith M.W., Doolittle R.F.
     J. Mol. Evol. 34:175-184(1992).
{END}

{PDOC00083}
{PS00088; SOD_MN}
{BEGIN}
******************************************************
* Manganese and iron superoxide dismutases signature *
******************************************************

Manganese  superoxide dismutase (EC 1.15.1.1) (SODM)  [1] is  one of the three
forms of an enzyme that catalyzes the dismutation  of superoxide radicals. The
four  ligands of  the manganese atom  are  conserved in  all  the  known  SODM
sequences.  These metal ligands are also conserved in the related iron form of
superoxide  dismutases [2,3].  We selected, as  a signature, a short conserved
region which includes two of the four ligands: an aspartate and an histidine.

-Consensus pattern: D-x-W-E-H-[STA]-[FY](2)
                    [D and H are manganese/iron ligands]
-Sequences known to belong to this class detected by the pattern: ALL.
-Other sequence(s) detected in SWISS-PROT: NONE.
-Last update: June 1992 / Text revised.

[ 1] Bannister J.V., Bannister W.H., Rotilio G.
     CRC Crit. Rev. Biochem. 22:111-154(1987).
[ 2] Parker M.W., Blake C.C.F.
     FEBS Lett. 229:377-382(1988).
[ 3] Smith M.W., Doolittle R.F.
     J. Mol. Evol. 34:175-184(1992).
{END}












<PAGE>


              APPENDIX) ANSWER TO SOME POTENTIAL QUESTIONS


1) Why did we break-up PROSITE into two files (data and documentation) ?

There are  two main reasons for having chosen to implement PROSITE in this
fashion.

a) There  are a  number of  cases in  PROSITE where  more than one pattern
entries can be described by the same documentation. For example, there are
two PROSITE  patterns which  are specific  to the trypsin family of serine
proteases; one  of them  detects the serine active-site residue, the other
detects the  histidine active-site  residue. Using  a single text entry to
document both  patterns makes much more sense than having two separate and
partially redundant documentation entries.

b) We  plan to  extend the documentation file to describe family or groups
of proteins  for which  there will  not necessarily  be any  corresponding
pattern or  matrix entry.  In fact  the goal  of the PROSITE documentation
file is  slowly to  evolve into  a separate  data base  and to  become the
kernel of a computerized encyclopedia of proteins.


2) Will  the pattern  description conventions  be extended  to allow  more
complex patterns to be described ?

Yes we  intend to  allow more complex patterns to be described, but as the
current version  of PROSITE  does not  contain such  patterns it  was  not
necessary nor  desirable to  add levels  of complexity to the syntax. As a
perquisite to  the extension  of the  syntax we  plan to consult different
groups that  are proficient in the use of pattern searches in biomolecular
data banks.


3) Will rule entries always be in free-format ?

Once enough  rules have  been identified it will be a worthwhile objective
to find  a way  of describing  these rules  so that they could be read and
used by  computer programs.  It is  conceivable that we shall use computer
languages such as Lisp or Prolog to describe such rules.


4) When will the syntax for matrix entries be defined ?

As soon  as we  have found  group(s) willing  to contribute  and  maintain
matrices. Currently  we do not have the time nor the manpower required for
such a task.











<PAGE>
