Astral40 Weavings and Overlays -- Basis Comparisons Page

Suppose we focus on those single-chain domains in SCOP classes A, B, C, and D that contain at least six significant secondary structure elements. (By a "secondary structure element" we mean a helix or strand along with its approximating edge, as computed by the method outlined on the line weavings page. By "significant" we mean that a helix must contain at least 5 residues and a strand must contain at least 3 residues.) There are 5032 such domains in 629 folds.

For each of these 629 folds we computed a basis of at most 10 domains, chosen adaptively so that a large fraction of the domains in the fold are within some reasonable L2 error radius of a basis domain. This file describes the fold bases found.

Suppose we further focus in on folds with at least 5 domains, again satisfying our previous requirements. There are 201 such folds, consisting of 4263 domains. This file describes the union of the basis elements for these folds. There are 686 basis elements "covering" 4154 domains.

Links to Raw Comparison Data

This link points to a directory of directories, one for each of the 629 folds mentioned above.
Each directory in turn contains a set of files that cross-compare all members of that fold with each other.
These comparisons constitute the data by which we computed the basis elements for each fold.

This link points to a set of files that cross-compare all 686 basis elements for SCOP classes A,B,C,D with each other.
These comparisons represent a snapshot of the structural information contained in the PDB, from the perspective of our line weaving code.

Caution: We used greedy algorithms in computing these alignments. An inner loop slides the backbones of the two domains being compared relative to each other, then uses the angle and distance information contained in the crossing files to filter alignments. An iterative-closest-point routine in the space of edges then greedily extends these alignments using bipartite-graph matching.
Generally, alignments with large rho12 similarity or low L2 error will be good alignments. This means that alignments of similar domains are likely to be present in our comparisons.
However, the greedy nature of this implementation means that alignments of small substructures may be missing. Similarly, symmetries may give rise to suboptimal alignments.

File Descriptions

For each given domain of interest there are appear four files to describe the comparison of that domain with all target domains of interest, either all domains in the same fold or all domains in the overall basis set. Embedded in a file name is the given domain's index and a descriptor of the file type, which is one of log, sim, align, or star. The semantics of entries in the file are roughly as follows:

File Type	Format of a typical entry	Description
`log`	`score1 [score2] LinePairings L2 {Bad/Sig:Tot} Breaks`	For each pair of compared domains, the file contains the top 5 alignments, in decreasing order of similarity. ("size requirement" refers to restrictions based on the domain size as well as the secondary structure sizes.)
`sim`	(see the file legend in one of the files)	The file presents in a different format the top-ranked alignment for each pair of domains compared.
`align`	`TargetDomain rho12 L2 LinePairings`	The entries are abbreviations of the information shown for each top-ranked alignment in the `log`/`sim` file. The file's entries appear in order of the `TargetDomain` indices.
`star`	`TargetDomain rho12 L2 LinePairingStars`	This is a slightly different presentation of the information contained in the `align` file. The entries are now sorted in decreasing order on `rho12` and the `LinePairings` appear as stars `*`, indicating the relative backbone locations of the given domain's aligned secondary structures.

Some Observations

The next several images show the cross-comparison of all 686 basis elements with each other, from a variety of perspectives.

The first two images show the rho12 and L2 values of the top-ranked alignments, using a spectrum of colors.
The color ordering from best to worst is:

The red diagonal in the two figures below reflects the fact that self-comparisons lead to perfect matches.

The domains indexing the rows and columns of these matrices appear in Class.Fold order.
The different classes are roughly visible, namely:

The following is a boolean matrix (with fattened dots for visibility) indicating those alignments (other than self-alignments)
of length 6 or greater that respect backbone order, for which rho12 is at least 0.8 and for which the maximum
L2 error permitted is 8. The matrix evidences some clusters near the diagonal as well as some reoccurring motifs,
a few of which are labeled in the figure.

Here is a different boolean matrix (again with fattened dots) indicating those alignments (other than self-alignments)
of length 12 or greater that respect backbone order for which the maximum L2 error permitted is 4.5.
Some familiar clusters appear.

Disclaimer

Permission is granted to any individual or institution to use, copy, and/or distribute this material, provided that the complete contents of this webpage, including but not limited to the disclaimer, copyright, and permission notice, are maintained, intact, in all copies and supporting documentation.

Basis Comparisons of the Astral 40 Set --- Spring 2006

Overview

Links to Raw Comparison Data

File Descriptions

Some Observations

Disclaimer