Language Technologies Institute
Student Research Symposium 2006

Lengths of Antigen Binding Sites across PDB Data

Steve Gardiner with Andrew Walsh and Roni Rosenfeld

Proteins are sequences of symbols over the alphabet of the twenty amino acids, called residues. Protein sequences fold in space, producing complex three-dimensional structures in which some subsequences are entwined in the interior of the structure while other sub- sequences (words) are presented on the exterior. Antigens are proteins which are targeted by an organism's immune system; for each antigen a recognizer protein, called an anti- body, must be constructed. An antibody must be equipped with a hard-coded sequence of residues which will properly bond to some part of the antigen in order to allow the immune system to dispose of the antigen. An antibody does not generally need to recognize (bind to) the entire sequence of symbols in the antigen, but rather to a subset of the words which must be on the exterior of the antigen protein.

Binding sites between antigens and antibodies are of crucial importance to the function of the human immune system. The characteristics of an antigen-antibody binding site have traditionally been examined in the context of a single antigen-antibody complex. The Protein Data Bank (PDB) website [1], a worldwide repository for structural biology data, makes available the specifications of thousands of antigen-antibody complexes. This makes possible a lateral study of all available antibody-antigen binding sites which can take advantage of computational methods over PDB's large and growing corpus.

Here we characterize the antigen-antibody binding site across all complexes available from PDB. Specifically, we analyze the number of contiguous residues which participate in the binding site on the antigen side. We examine the lengths of the words and the distribution of the lengths in the PDB corpus. We begin by culling antibody-antigen com- plexes from the PDB using queries and post-processing, then detecting and eliminating various kinds of noise implicit in the database. We approximate the antigen residues which participate in the binding site by antigen residues which are spatially close to residues in the antibody, i.e. within some threshold e. We discuss bounds on the optimal value for e, obtained from physical knowledge about the bonds as well as from empirical data from PDB.

Our results indicate that the words on the antigen chain which are recognized by the corresponding antibody are generally quite short. This indicates that the antigen "signa- ture" is not generally local to a subsequence of the linear sequence of residues: antigen structure seems to play a large role in antibody-antigen recognition.

[1] Helen M. Berman, John Westbrook, Zukang Feng, Gary Gilliland, T. N. Bhat, Helge Weissig, Ilya N. Shindyalov, and Philip E. Bourne. The Protein Data Bank. Nucl. Acids Res., 28(1):235242, 2000.

[2] M. Michael Gromiha and S. Selvaraj. Inter-residue Interactions in Protein Folding and Stability. Prog. Biophys. Mol. Biol., 86:235277, 2004.