Basis Comparisons of the Astral 40 Set --- Spring 2006


Overview

The 5271 protein chains described on the line weavings page encompass 6646 domains. We list these domains here.

Suppose we focus on those single-chain domains in SCOP classes A, B, C, and D that contain at least six significant secondary structure elements. (By a "secondary structure element" we mean a helix or strand along with its approximating edge, as computed by the method outlined on the line weavings page. By "significant" we mean that a helix must contain at least 5 residues and a strand must contain at least 3 residues.)   There are 5032 such domains in 629 folds.

For each of these 629 folds we computed a basis of at most 10 domains, chosen adaptively so that a large fraction of the domains in the fold are within some reasonable L2 error radius of a basis domain. This file describes the fold bases found.

Suppose we further focus in on folds with at least 5 domains, again satisfying our previous requirements. There are 201 such folds, consisting of 4263 domains. This file describes the union of the basis elements for these folds. There are 686 basis elements "covering" 4154 domains.


Links to Raw Comparison Data

This link points to a directory of directories, one for each of the 629 folds mentioned above.
Each directory in turn contains a set of files that cross-compare all members of that fold with each other.
These comparisons constitute the data by which we computed the basis elements for each fold.

This link points to a set of files that cross-compare all 686 basis elements for SCOP classes A,B,C,D with each other.
These comparisons represent a snapshot of the structural information contained in the PDB, from the perspective of our line weaving code.

Caution:  We used greedy algorithms in computing these alignments. An inner loop slides the backbones of the two domains being compared relative to each other, then uses the angle and distance information contained in the crossing files to filter alignments. An iterative-closest-point routine in the space of edges then greedily extends these alignments using bipartite-graph matching.
Generally, alignments with large rho12 similarity or low L2 error will be good alignments. This means that alignments of similar domains are likely to be present in our comparisons.
However, the greedy nature of this implementation means that alignments of small substructures may be missing. Similarly, symmetries may give rise to suboptimal alignments.


File Descriptions

For each given domain of interest there are appear four files to describe the comparison of that domain with all target domains of interest, either all domains in the same fold or all domains in the overall basis set. Embedded in a file name is the given domain's index and a descriptor of the file type, which is one of log, sim, align, or star. The semantics of entries in the file are roughly as follows:

File Type Format of a typical entry Description
log score1 [score2]  LinePairings  L2  {Bad/Sig:Tot}  Breaks For each pair of compared domains, the file contains the top 5 alignments, in decreasing order of similarity.
("size requirement" refers to restrictions based on the domain size as well as the secondary structure sizes.)
sim (see the file legend in one of the files) The file presents in a different format the top-ranked alignment for each pair of domains compared.
align TargetDomain  rho12  L2  LinePairings The entries are abbreviations of the information shown for each top-ranked alignment in the log/sim file.
The file's entries appear in order of the TargetDomain indices.
star TargetDomain  rho12  L2  LinePairingStars This is a slightly different presentation of the information contained in the align file.
The entries are now sorted in decreasing order on rho12 and the LinePairings appear as stars *,
indicating the relative backbone locations of the given domain's aligned secondary structures.


Recall: Breaks refers to the number of times the alignment LinePairings is non-sequential relative to the backbone.
The values "score1" and "[score2]" are used internally by the code; details are irrelevant, except that score1 is the inverse of rho12.
All comparisons consider only helices with at least 5 residues and strands with at least 3 residues.
Alignments with fewer than 4 aligned secondary structure lines were ignored, including during (pre-filtered) candidate alignment generation.

For more precise definitions, see the paper Protein Similarity from Knot Theory: Geometric Convolution and Line Weavings.

Some Observations

The next several images show the cross-comparison of all 686 basis elements with each other, from a variety of perspectives.

The first two images show the rho12 and L2 values of the top-ranked alignments, using a spectrum of colors.
The color ordering from best to worst is:

     red    orange    yellow    green    cyan    blue    purple    black

Also, white means there is no alignment.

The red diagonal in the two figures below reflects the fact that self-comparisons lead to perfect matches.

Matrix of rho12 comparison similarities:


Matrix of L2 line alignment values:

The domains indexing the rows and columns of these matrices appear in Class.Fold order.
The different classes are roughly visible, namely:


One can examine the various alignment files to detect shape patterns.

The following is a boolean matrix (with fattened dots for visibility) indicating those alignments (other than self-alignments)
of length 6 or greater that respect backbone order, for which rho12 is at least 0.8 and for which the maximum
L2 error permitted is 8. The matrix evidences some clusters near the diagonal as well as some reoccurring motifs,
a few of which are labeled in the figure.




Here is a different boolean matrix (again with fattened dots) indicating those alignments (other than self-alignments)
of length 12 or greater that respect backbone order for which the maximum L2 error permitted is 4.5.
Some familiar clusters appear.





Disclaimer

The data contained in this set of webpages and the files to which they point are approximations, possibly faulty, and are provided "as is", without express or implied warranty of any kind.

Copyright (C) 2006 by Michael A. Erdmann.

Permission is granted to any individual or institution to use, copy, and/or distribute this material, provided that the complete contents of this webpage, including but not limited to the disclaimer, copyright, and permission notice, are maintained, intact, in all copies and supporting documentation.


Modified 29-April-2006 by
Michael Erdmann