## Basis Comparisons of the Astral 40 Set --- Spring 2006 |

The 5271 protein chains described on the line weavings page encompass 6646 domains. We list these domains here.

Suppose we focus on those single-chain domains in SCOP classes A, B, C, and D that contain at least six significant secondary structure elements. (By a "secondary structure element" we mean a helix or strand along with its approximating edge, as computed by the method outlined on the line weavings page. By "significant" we mean that a helix must contain at least 5 residues and a strand must contain at least 3 residues.) There are 5032 such domains in 629 folds.

For each of these 629 folds we computed a basis of at most 10 domains, chosen adaptively so that a large fraction of the domains in the fold are within some reasonable L2 error radius of a basis domain. This file describes the fold bases found.

Suppose we further focus in on folds with at least 5 domains, again satisfying our previous requirements. There are 201 such folds, consisting of 4263 domains. This file describes the union of the basis elements for these folds. There are 686 basis elements "covering" 4154 domains.

This link points to a directory of directories,
one for each of the 629 folds mentioned above.

Each directory in turn contains a set of files that cross-compare all
members of that fold with each other.

These comparisons constitute the data by which we computed the basis
elements for each fold.

This link points to a set of files
that cross-compare all 686 basis elements for SCOP classes A,B,C,D with
each other.

These comparisons represent a snapshot of the structural information
contained in the PDB, from the perspective of our line weaving code.

**Caution:** We used greedy algorithms in
computing these alignments. An inner loop slides the backbones of the
two domains being compared relative to each other, then uses the angle
and distance information contained in the crossing files to filter
alignments. An iterative-closest-point routine in the space of edges
then greedily extends these alignments using bipartite-graph
matching.

Generally, alignments with large `rho12` similarity or low
`L2` error will be good alignments. This means that alignments
of similar domains are likely to be present in our comparisons.

However, the greedy nature of this implementation means
that alignments of small substructures may be missing.
Similarly, symmetries may give rise to suboptimal alignments.

For each given domain of interest there are appear four files to
describe the comparison of that domain with all target domains of
interest, either all domains in the same fold or all domains in the
overall basis set. Embedded in a file name is the given domain's index
and a descriptor of the file type, which is one of `log`,
`sim`, `align`, or `star`. The semantics of
entries in the file are roughly as follows:

File Type | Format of a typical entry | Description |

log |
score1 [score2] LinePairings L2 {Bad/Sig:Tot} Breaks |
For each pair of compared domains, the file contains the top 5 alignments, in decreasing order of similarity. ("size requirement" refers to restrictions based on the domain size as well as the secondary structure sizes.) |

sim |
(see the file legend in one of the files) | The file presents in a different format the top-ranked alignment for each pair of domains compared. |

align |
TargetDomain rho12 L2 LinePairings |
The entries are abbreviations of the information shown for each top-ranked alignment in the log/sim file.The file's entries appear in order of the TargetDomain indices. |

star |
TargetDomain rho12 L2 LinePairingStars |
This is a slightly different presentation of the information contained in the align file.The entries are now sorted in decreasing order on rho12 and the LinePairings appear as stars *,indicating the relative backbone locations of the given domain's aligned secondary structures. |

Recall:

`rho12`is the similarity score from the perspective of the given domain within the`TargetDomain`.`L2`is a root-mean-sum-of-squares measure of the line similarity taken over the aligned lines given by`LinePairings`.`{Bad/Sig:Tot}`is a measure of crossing consistency.

The values "

All comparisons consider only helices with at least 5 residues and strands with at least 3 residues.

Alignments with fewer than 4 aligned secondary structure lines were ignored, including during (pre-filtered) candidate alignment generation.

For more precise definitions, see the paper Protein Similarity from Knot Theory: Geometric Convolution and Line Weavings.

The next several images show the cross-comparison of all 686 basis elements with each other, from a variety of perspectives.

The first two images show the `rho12` and `L2` values
of the top-ranked alignments, using a spectrum of colors.

The color ordering from best to worst is:

` red orange yellow green cyan blue purple black`

Also, `white` means there is no alignment.

The red diagonal in the two figures below reflects the fact that self-comparisons lead to perfect matches.

Matrix of rho12 comparison similarities: |

Matrix of L2 line alignment values: |

The domains indexing the rows and columns of these matrices appear in
Class.Fold order.

The different classes are roughly visible,
namely:

**One can examine the various alignment files to detect shape
patterns.**

The following is a boolean matrix (with fattened dots for
visibility) indicating those alignments (other than self-alignments)

of length `6` or greater that respect backbone order, for
which `rho12` is at least `0.8` and for which the
maximum

`L2` error permitted is `8`. The matrix
evidences some clusters near the diagonal as well as some reoccurring
motifs,

a few of which are labeled in the figure.

Here is a different boolean matrix (again with fattened dots)
indicating those alignments (other than self-alignments)

of length
`12` or greater that respect backbone order for which the
maximum `L2` error permitted is `4.5`.

Some familiar
clusters appear.

Permission is granted to any individual or institution to use, copy, and/or distribute this material, provided that the complete contents of this webpage, including but not limited to the disclaimer, copyright, and permission notice, are maintained, intact, in all copies and supporting documentation.

Modified 29-April-2006 by

Michael Erdmann