Annotated Primary Reference List



Bajorath, J., R. Stenkamp, et al. (1993). "Knowledge-based model building of proteins: Concepts and examples." Protein Science 2: 1798-1810.
 

This is the best overall procedural review and outlines the steps in the process of building a model. The steps are:

1. Structure based sequence alignment of unknown with one or more template proteins. This yields a definition of which parts of the unknown sequence constitute the secondary structural core of the protein and which are the loops and turns. Coordinated are then assigned to the backbone of the core. When more than one possible template structure exists, core assignments can be made with somewhat greater confidence. There are problems here, though, some of which are discussed later in the paper. This is far and away the most difficult and most crucial step in the whole process. Some good examples: refs.

2. The backbones of loops and turns are generated next. These are generally less conserved than the protein core. One can search for potential candidates of the same length as that of the unknown in the PDB, rejecting those which collide with the core. (Insight II can do this.) When the loop search yields no good candidates, various types of conformational searches are used: refs.

3. Side chains are added next. Several options exist ranging from using template side chains where they are the same as those of the unknown to combinatorial methods which use rotamer libraries of side chain conformations: refs.

4. Model refinement is used to eliminate steric overlaps and other forms of strain, improve packing, etc. Energy minimization is used for this, although it has some problems and must be used with care.

5. Model evaluation: solvent accessible surface area, electrostatic matters, volume packing, etc. One can usually pick our areas of greater and lesser reliability in the resulting model, and one should....

Helpful supplementary aids are discussed. The inverse folding problem is an example: given a 3D backbone structure, one can ask, "Which sequences are compatible with this structure?" If yours happens to be one of them, you're in good shape. This can be done for parts of the structure which form relatively independent packing units. refs.

The authors discuss two specific examples of models built in their lab, including experimental tests.

Some of their observations:
Homology, strictly speaking, = common evolutionary origin; similarity may include convergent evolution, as well
Methods of comparing 3D structures
Secondary structure predictions, alignments of similar sequences -> sec. Str. assignments: Average accuracy/residue ~62-70%
~50% of new 3D structures are related to known folds
Number of families with distinct topology = 500-700
Thornton's group reports 112 "nonanalogous" folds as of 1992. 150 folds with more stringent criteria. Convergent evolution often yields similar 3D structures. Assessment of topological differences among them is important for assignment to fold types.
 


Chothia, C. and A. V. Finkelstein (1990). "The classification and origins of protein folding patterns." Annu. Rev. Biochem. 59: 1007-39.
 
 

Cosenza, L., A. Rosenbach, et al. (2000). "Comparative model building of interleukin-7 using interleukin-4 as a template: A structural hypothesis that displays atypical surface chemistry in helix D important for receptor activation." Protein Science 9: 916-926.
 
 

Eisenhaber, F., B. Persson, et al. (1995). "Protein structure prediction: recognition of primary, secondary, and tertiary structural features from amino acid sequence." Critical Reviews in Biochemistry and Molecular Biology 30(1): 1-94.
 
 

Fasman, G. D., Ed. (1989). Prediction of protein structure and the principles of protein conformation. New York and London, Plenum Press.
 

Gerstein, M. and M. Levitt (1997). "A structural census of the current population of protein sequences." Proc. Natl. Acad. Sci USA(94): 11911-11916.
 

A good short review.
"Most of the common folds are associated with many families of nonhomologous sequences (i.e. >10 sequence families for each common fold)." This can arise either from widely divergent evolution or from convergent evolution. Therefore, if a fold is identified for an unknown protein, be careful not to include in the multiple alignment of many sequences proteins from families not homologous to one's own.

As of 1997 there were >140,000 protein sequences known.... They used the 142,737 sequences in the OWL composite database (April, 1996, refs. 17, 18). Removed various sequences for various reasons, ending with 120,068 to analyze.

Used FASTA for all classification & sequence analysis, using "conservative" thresholds to find homologies. Those sequences having structural homologues were analyzed with SCOP, which divided the 4,432 3Dstructures in the PDB at the time into 8,330 domains and these into 318 fold families. Each sequence was then assigned a "fold identifier" number: a "molecular part number." The distribution of these was examined across the taxa.

It is interesting that other structural classifications also yield ~300 fold families: CATH,FSSP, Entrez-MMDB, LPFC

Most folds have ~130 homologues. Top 7 have >1,000 (Fig. 2)
Top 25 match 61% of sequences which have structural analogues.
Some of these perform a single function, e.g. Rossmann NAD domain
Others act as "multipurpose parts" for multiple functions. Many folds (125) are associated with more than one sequence family. Convergent evolution?

The most common folds are present in all taxa, but their distributions vary among taxa. (See Fig. 4 for Venn diagram) These differences in distribution can be useful in classifying new sequences in that they may allow one to rule out folds known to be absent from the taxon to which the organism belongs.

Detailed analysis of Hemophilus influenzae genome shows ~same distributions as for whole set. Result addresses in part the problems of bias in the databases. H. Inflluenzae. Genome is at www.tigr.org

Above covers 37,706 sequences having structural analogues in the PDB. 82,362 sequences, however, lack homologues. Distributions for these are very similar to those having analogues with a few exceptions. Implications for extrapolation of findings.... Tables of data available at http://bioinfo.mbb.yale.edu/census

Refs to FASTA 2.0 usage. Also to other methods of comparing: profiles, hidden Markov, threading.

Class predictions made using sec. str. predictions (GOR method) approximately 80% agreement with SCOP & comparable for PHD server.

Greer, J. (1990). "Comparative modeling methods; application to the familty of the mammalian serine proteases." Prot- Struct. Funct. Gen. 7: 317-334.
 
 

Hadley, C. and D. T. Jones (1999). "A systematic comparison of protein structure classifications: SCOP, CATH, and FSSP." Structure 7: 1099-1112.
 
 

Honig, B. (1999). "Protein folding: from the levinthal paradox to structure prediction." J. Mol. Biol. 293: 283-293.
 
 

Jennings, A., C. Edge, et al. (2001). "An approach to improving multiple alignments of protein sequences using predicted secondary structure." Pro. Engineering 14: 227-231.
 
 

Kabsch, W. and C. Sander (1983). "Dictionary of protein secondary structure: Pattern recognition of hydrogen-bonded and geometrical features." Biopolymers 22: 2577-2637.
 
 

Martin, A., K. Toda, et al. (1995). "Long Loops in proteins." Pro. Engineering 8: 1093-1101.
 
 

Mosimann, S., R. Meleshko, et al. (1995). "A critical assessment of comparative molecular modeling of tertiary structures of proteins." Proteins- Structure, Function, and Genetics 23: 301-317.
 
 

Nicholas, H., D. Deerfield, et al. (2000). "Strategies for Searching Sequence Databases." BioTechniques 28: 1174-1191.
 
 

Pearson, W. (1995). "Comparison of methods for searching protein sequence databases." Protein Science 4: 1145-1160.
 
 

Pontius, J., J. Richelle, et al. (1996). "Deviations from standard atomic volumes as a quality measure for protein crystal structures." J. Mol. Biol. 264: 121-136.
 
 

Rodriguez, R., G. Chinea, et al. (1998). "Homology modeling, model and software evaluation; three related resources." Bioinformatics 14: 523-528.
 
 

Rost, B. and C. Sander (1994). "Conservation and prediction of solvent accessibility in protein families." Proteins- Struct. Funct. Genet. 20: 216-226.
 
 

Rost, B. and C. Sander (1996). "Bridging the protein sequence-structure gap by structure predictions." Annu. Rev. Biophys. Biomol. Struct. 25: 113-136.
 

The review examines "generic" methods for prediction at three levels: one, two, and three dimensional. It deals only with methods available for automatic prediction, which could analyze large numbers of sequences, e.g. whole chromosomes or whole genomes.
One dimensional methods: Sequence alignment: FASTA, BLAST, Failures
When identity >30%, proteins have homologous 3D structures. Converse is not necessarily true, as sequences with lower identity can still have similar structures.
Use of information from 3D structures can help a lot, especially when sequence homology is low. Profile based alignments are sensitive and fast enough for large scale use. See refs. 30,80, 79 (MAXHOM program)

Authors note that very few papers develop measures of the quality of alignments, making comparisons of methods difficult. Remote homologues can be aligned correctly for large sequence families, but most sequences with <25% identity will be false positives.

They give a good, critical discussion of cross validation of methods, making clear the problems and warning against using a test data set to help optimize the training set. Applies especially to secondary structure prediction, as well as to 3D methods. Test sets and training sets should have < 25% identity, etc.

Secondary structure prediction: Good discussion of uses and limitations, and many good references. They do slight earlier methods whose lower accuracy stems from the use of fewer proteins to train the method, as the number of 3D structures available to them were fewer than more recent methods. The authors do make a limited attempt to deal with this question in Figure 4, though.

Solvent accessibility as part of prediction: Accessibility is conserved at each position, which leads to methods of using it for prediction. Several methods are presented in Figure 6, with good references.

Transmembrane helices are discussed with references.

Two dimensional predictions, i.e. whether residues not close in the primary sequence are in contact: Divided into interresidue contacts (including distance geometry), interstrand contacts, and intercysteine contacts (for disulfides). None are terribly good as predictive methods.

Three dimensional methods: homology modeling & threading
Homology modeling: When identity >30%, the folds are identical but loops may vary. When the identity is at least 60%, results are good, and for higher values as good as experimental 3D structure determination. ref. Overall discussion is rather thin.

Threading: Much potential. Too early to be of much use as of this writing (1996)
Analysis & evaluation of models: Brief but useful discussion. Several potentially good references for evaluation procedures.

Rost, B. and A. Valencia (1996). "Pitfalls of protein sequence analysis." Current Opinion in Biotechnology 7: 457-461.
 
An excellent short review. Compulsory reading. Many good references.
"The success of alignment programs relies on evolutionary connections between homologous proteins: if 24 out of 80 aligned residues (i.e. 30%: more for shorter matches) are identical, between two naturally evolved proteins, the two have similar 3D structures and similar functions." Several references are cited. It would be interesting to know the strength of this statement, especially with respect to function. One should also be aware that the converse need not be true: fewer than 30% matches does not necessarily rule out similarity of structure or of function. The strength of the statement may be in that it makes no attempt to rule out the converse....

Note that this review was published in 1996 and therefore written in 1995 or early 1996. It is mostly on target, but there has been progress since then.
 


Siew, N., A. Elofsson, et al. (2000). "MaxsSub: an automated measure for the assessment of protein structure prediction quality." Bioinformatics 16: 776-785.