Helix algorithm:
As described previously (Kahn, 1989a), the helix axis algorithm finds the axis of a helix using vector algebra. An axis segment is generated for each four consecutive amino acids. The angle created by the first three alpha-carbons (CA1, CA2, CA3) is bisected, and then the angle created by the second three alpha-carbons (CA2, CA3, CA4) is bisected. Since these two vectors pass through the helix axis, the cross product of them gives the direction cosines of the axis. The angle between the two vectors gives the angle of rotation, and dividing 360 by this number gives the number of residues per turn. The intersection points of the two vectors with the axis are the end-points of the axis segment for that tetrad, which are opposite residues CA2 and CA3. The distance between the two end-points is the rise per residue, while the distance from the alpha-carbon to the helix axis gives the radius of the helix. Multiplying the rise by the residues per turn gives the pitch. Click here to see an graphical example of this method.
Strand algorithm:
Designed by analyzing the geometry of a perfect helix, the helix algorithm functions well for a range of imperfect geometries. Although a strand is geometrically similar to a helix with only two residues per turn, the unmodified algorithm gives undesirable results when used on strands. For this reason, a modified algorithm was developed for use on strands. This method finds the average position for the first three alpha-carbons (CA1, CA2, CA3), the average position for the second three alpha-carbons (CA2, CA3, CA4), and uses them as the end-points for the axis segment. The other parameters are generated in a similar fashion to the helix algorithm. Seredipitously, this algorithm also follows aperiodic sections well.Click here to see an example of this.
Axis segments:
Using the simple vector operations described above, GASP computes an axis segment for residues 1, 2, 3 and 4 (a tetrad) then steps up by one residue and computes an axis segment for residues 2, 3, 4 and 5 (another tetrad). It continues stepping by one residue until it has computed an axis segment for every tetrad in the protein. Both the helix algorithm and the strand algorithm are run over the entire protein. The output from the helix axis algorithm is kept for areas with helical geometry, and the output from the strand algorithm is kept for areas with strand geometry or aperiodic geometry. In a geometrically perfect helix or strand, the axis segments would line up end to end and make a straight line. However, most helices in proteins are either curved or kinked (Daffner et al., 1994; Blundell et al., 1983; Geetha, 1996) or simply irregular, and most strands have a right-handed twist (Yang and Honig, 1995b). Even those structures that are relatively straight are not geometrically perfect. Yet, the two algorithms are robust enough that the axis segments follow the backbone geometry very well.
Least squares axes:
The program then fits a least-squares axis in three-space to each secondary structure (Kahn, 1989b). If there is any superhelicity, curvature, kink or other deviation (see Perturbations) in the a-helix or b-strand, it will be clearly seen from the relationship of the segments to the least-squares axis and is easily quantified by GASP. Direction cosines are produced for each a-helix or b-strand axis. This enables a rigorous calculation of the angles between secondary structures (see kinemage 1). The angles between secondary structures are output to a separate file for ease of analysis. This is a useful subroutine, as most researchers currently use line-of-sight estimates, or less rigorous automated methods for the axes used to calculate these angles. This ability has already been used in the description and analysis of a crystallographic structure at high resolution (Helin et al., 1995). Least squares axes can be calculated from the segment end points for structures described by the user, DSSP, the PDB file, or by GASP. The difference is sometimes significant, especially in curved helices, as changing the description of the helix end-point by even one residue can change the direction of the axis significantly. This can have far reaching implications in methods that use vectors to compare and classify protein folds, and is one reason consistent description of end-points is important.
GASP output:
Output of axial segment end-points for a-helical, b-strand, and aperiodic tetrads can be merged to yield a "curvilinear" representation of the entire protein with the a-helices, b-strands and aperiodic sections clearly delineated by color (see kinemage 2). This representation will appear as a string threaded through the protein. It clearly follows the axes of the secondary structures and also follows the aperiodic sections very well. If one views just the curvilinear representation, it is an easy way to visualize the overall fold of a protein, as it has many of the same applications that a ribbon diagram does (Kraulis, 1991; Richardson, 1981), and more (see below). The entire protein chain is defined so as to present a very accurate description of the protein fold and could be used to compare structures of a similar fold much as other authors have done (Abagyan and Maiorov, 1989; Murthy, 1974). Changes in color for the axis segments of different secondary structures makes them more easily visualized, or the chain can be represented in one color to visualize just the basic fold. Click here to see an example of this.
As mentioned above, GASP output also includes the following parameters for every tetrad: the rise per residue, which is the length of the axis segment, the radius of the tetrad at the a-carbons, the angle of rotation about the axis per residue, the number of residues per turn (360/angle of rotation), and the pitch (rise x residues per turn). These are properties of the tetrad as a whole (see Appendix A for examples of the output). GASP also creates files containing the lengths of secondary structures, and comparisons of GASP descriptions with PDB, mDSSP and user descriptions. These comparisons are used in describing extensions and perturbations.
GASP flags those values that fall within the 95% confidence limits for helical parameters or b-strand-like parameters. Segments that have all five parameters within the 95% confidence limits are flagged as helical or b-strand-like. We use all five parameters instead of just the three independent ones, since the derived ones are more sensitive to extreme values. In addition, GASP flags the segments described in the PDB files to be a-helices, b-strands or turns. It also uses the mDSSP algorithm to flag hydrogen-bonded a-helices and b-strands. These flags make patterns easily visible, facilitate comparisons, and simplify analysis (see Appendix A for example of output). GASP describes helices as at least two consecutive tetrads (5 residues) with helical geometry. This was done because if GASP was used to describe a helix as being only four residues long, many a-turns were mistakenly described as short helices. Additionally, although GASP does not look at hydrogen-bonding, most common descriptions of helices include both geometry and hydrogen-bonding. Since the shortest helix (that with one i à i+4 hydrogen-bond) is five residues long, it seemed appropriate to choose this length. Conversely, since short strands are common, four residue strands, consisting of only one tetrad, are described by GASP. Unfortunately, this misses some short strands described by DSSP that have only two or three residues (see Chapter 3: short structures). The GASP descriptions are used in later analysis for calculation of least-squares axes, secondary structure lengths, direction cosines, comparisons with other descriptions, and output of axis segments into correctly labeled files.
The program then compares the descriptions generated by GASP with the
descriptions input by the user, from mDSSP and from the PDB file. Residues
at the end of or within secondary structures that have correct hydrogen-bonding
(as described by mDSSP) but do not have correct a-carbon
geometry (as described by GASP) are labeled perturbations. Residues at
the end of secondary structures that do not have correct hydrogen-bonding
(as described by mDSSP) but do have correct geometry (as described by GASP)
are labeled extensions. Short structures, such as areas of extended geometry
(described by GASP as having strand-like geometry but not part of a sheet
as described by mDSSP), short helical areas (described by GASP as having
helical geometry but not having a-helical hydrogen-bonds
as described by mDSSP), and single helical tetrads (which do not and cannot
have an a-helical hydrogen-bond described by
mDSSP) are also labeled. These comparisons are output to a file for ease
of analysis. The lengths of perturbations, extensions, and short structures
are compiled and also output in a file.
Summary of analysis
Attempts to understand protein folding through analysis of protein structure have been conducted for many years. As discussed above, there are many methods of analysis, many computer programs developed, and many methods of displaying the results of these methods. We have developed a program (GASP), to analyze the geometry of secondary structures. GASP has several important differences from programs already in use and provides the investigator with several new and useful methods for protein analysis. We have used it to analyze the geometry of a database of independently solved protein structures. We created a database of 112 (see Appendix B) high-resolution, independently derived protein structures. GASP defines four consecutive a-carbons as a tetrad and computes axis segments for all the tetrads in the protein. In the process it yields five parameters for each tetrad (rise per residue, radius, angle of rotation per residue, residues per turn, and pitch). For all 112 proteins we compiled the five parameters generated by GASP for all the tetrads included within the boundaries of the secondary structures described in the HELIX and SHEET records in the PDB files. From these data, we determined rigorous confidence limits for helical and b-strand-like geometry.
Using these confidence limits to define secondary structure geometry, we reran the calculations over the entire length of all 112 proteins. A detailed analysis of protein geometry was done. This analysis shows that some tetrads within the boundaries of the a-helices and b-strands described by mDSSP (see below) do not fall within the 95% confidence limits for all five geometric parameters. Thus, by our definition, they do not have helical or b-strand-like a-carbon geometry. The out-of-bounds sections are flagged in the output, making it easy to find perturbations in the structure. In contrast, some tetrads outside the hydrogen-bonded secondary structure boundaries described by mDSSP do fall within the 95% confidence limits for all five parameters, and thus, by our definition, do have correct helical or b-strand-like a-carbon geometry. This is also easy to see on the output as flagged tetrads outside the defined structures. Some of these tetrads correspond to turns, some are areas of extended conformation without standard sheet bonding, and some are at the ends of a-helices and b-strands, indicating that their geometry extends beyond the mDSSP descriptions. Thus, one advantage of GASP is that it gives us an easy method of finding ambiguous areas of definition. These areas have the potential to be biologically interesting. Our analysis shows that these ambiguous areas are often conserved in homologous proteins. This suggests that they are probably not artifacts, but that Nature has reasons for putting them there. Although these reasons are not always apparent, in some other cases, there appear to be readily identifiable forces at work. Several examples of areas with ambiguous propensities were studied.
By way of example, we analyzed in detail the structures of two independently solved homologous proteins, trypsin (PDB file 1TLD), and chymotrypsin (PDB file 5CHA). We compared GASP output with mDSSP descriptions to find ambiguous areas, perturbations and extensions. These analyses have led to several intriguing hypotheses and suggest several avenues of further research.
Description of GASP
Data input:
GASP can read data from a variety of sources. It reads in the coordinates in three-space of the a-carbons for the chain specified directly from the PDB file. It can also read in the descriptions of secondary structure specified in the HELIX and SHEET records in the PDB file. GASP reads data on secondary structure descriptions, hydrogen-bonding, and phi-psi angles directly from the output of the DSSP program, as well, and it can also read in descriptions of secondary structure defined by the user. All of these descriptions can be compared to the secondary structure descriptions generated by the program itself.
DSSP modification:
Since the descriptions in the PDB files are sometimes subjective and variable (Kabsch and Sander, 1983), we wanted to compare GASP definitions with an automated method. We also wished to compare a-carbon geometry and hydrogen-bonding. We chose DSSP (Kabsch and Sander, 1983) for this purpose because many structural laboratories have adopted it. In addition, DSSP looks at hydrogen-bonding and ignores a-carbon geometry, while GASP looks solely at a-carbon geometry and ignores hydrogen-bonding. We hoped that this would show us areas that have correct helical and b-strand-like geometry without correct hydrogen-bonding (Extensions), and areas that have correct hydrogen-bonding without having correct a-carbon geometry (Perturbations).
DSSP calculates all the hydrogen-bonds present in the protein using an energy cut-off. It then looks for patterns of hydrogen-bonds to identify helices, sheets & turns. DSSP reports the helices as one residue shorter on either end than the actual hydrogen-bonding pattern. For instance, if residues 5-15 have correct helical hydrogen-bonding, DSSP reports the helix as including residues 6-14. The authors state (Kabsch & Sander, 1988) that this was done because many helices have non-helical phi-psi in the terminal residues, and they wished to restrict their definitions of helices to those residues that are both correctly hydrogen-bonded and which have helical phi-psi geometry. Perhaps when DSSP was first released the small number of available structures made this approach necessary, but our investigation of geometry has shown that indeed most of these truncated residues do have helical a-carbon geometry (see phi-psi results in Chapter 3). In addition, an examination of full-backbone helical geometry shows that both the phi and psi of the first and last residues can be non-helical and not disrupt either the helical hydrogen-bonding or the position of the a-carbons.
Imagine a geometrically perfect helix running from residues 5-15, correctly hydrogen-bonded the entire length. While holding residues 5-14 still, rotate the psi bond of residue 15. The C15 atom spins in place, while the =O15 and all the residues from residue 16 on rotate around it. This does not affect the position of CA15 nor the helical hydrogen-bonding, as the last four carbonyls are not involved in the helix hydrogen-bonding anyway. Now, while still holding residues 5-14 still, rotate the phi bond of residue 15. This time the CA15 spins in place, while C15, =O15, and all the residues from residue 16 on rotate around it. This, too, does not affect the position of the CA15, nor the helical hydrogen-bonding. In both cases, the a-carbon geometry (position of CA15) and the helical hydrogen-bonding (based on the position of NH15) are unchanged.
A similar situation exists at the N-terminal end of the helix. Convention
would have us hold still the atoms to the N-terminal side of the phi-bond
rotation. However, the best way to visualize the situation is contrary
to convention; imagine holding residues 6-15 still while rotating the phi
bond of residue 5. The N5 atom spins in place, while the amide
proton rotates around it. Residues 1-4 also rotate around the bond. The
position of CA5 and the helical hydrogen-bonding are not disturbed,
since, like the last four carbonyls, the first four amide protons are not
bonded into the helix. Now imagine holding residues 6-15 still and rotating
the psi bond of residue 5. This time CA5 spins in place while
R5, N5, the amide proton, and residues 1-4 rotate
around it. In both cases, the hydrogen-bond from the C=O up into the helix
and the position of CA5 are not perturbed at all. Thus, no matter
what the phi-psi angles of residue 5 or residue 15 are, these residues
are still part of the helix both in terms of hydrogen-bonding and in terms
of a-carbon geometry. It is, therefore, unnecessary
to truncate the terminal residues of helices. We have, accordingly, modified
the DSSP algorithm to put the missing residues back into the description
of helices. For the purposes of this paper, when we refer to mDSSP helical
descriptions, we are referring to this modified description and are including
all the residues that are correctly hydrogen-bonded.
Comparison with other methods:
ROTFIT:
One recent method, (Christopher et al., 1996), generates axes of helices via a rotational least-squares fit. First the helix is superimposed onto itself, displaced by one residue. Thus residue 1 is superimposed onto residue 2 and so forth. Then, the axis of rotation necessary to superimpose the atoms is calculated and assumed to be the helix axis. This method suffers primarily from being only capable of handling straight helices. Curved helices generate an axis which is a tangent to its midpoint, as opposed to GASP, which generates axis segments along the entire helix. GASP's segments can easily be fed into routines used to find the radius of curvature, or compared to the least-squares axes to observe superhelicity. It should be noted that ROTFIT is reported to have less error than other methods in finding the axes of both protein and DNA helices. However, this comparison is based on the authors' own version of Kahn's method (Christopher et al., 1996). It is unclear which of Kahn's four published methods the authors are using, which is pertinent as they differ greatly in their accuracy. Additionally, Christopher et al. assert that Kahn's method fails with DNA helices. In fact, Kahn's (best) method works very well, with minimum error, on DNA helices if one uses every other residue. This is described in detail in the published report on the algorithm (Kahn, 1989a). In their comparison with ROTFIT the authors report comparatively large errors for Kahn's method largely because they fail to apply the method as suggested.
HBEND:
This program (Barlow and Thornton, 1988) generates axes of helices using
probes of various lengths and conformations and selecting the one with
the least deviation from that portion of the actual helix. These selected
probes are then strung together to create an axis for the entire helix.
These approximate axes are highly dependent on the length and conformation
of the probes used, and are useful only for helices. In addition, the method
results in curves and twists within the helices so that real curvature,
kinks or superhelicity are difficult to identify. The axes generated by
GASP can be used in many of the same ways as the output of HBEND. However,
if there is any superhelicity, curvature, kink or other deviation in the
helix or strand, it will be clearly seen in GASP output from the relationship
of the axial segments to the least-squares axis and is easily quantified.
In addition to helical axes, GASP also generates rigorous axes for strands.
Ribbon diagrams:
Ribbon diagrams (Kraulis, 1991; Richardson, 1981) are often used as an easy way to depict the overall fold of a protein. Unfortunately, ribbon diagrams are frequently smoothed for comprehensibility and shifts of as much as 1 to 2 Angstroms are sometimes necessary in order to avoid ambiguity at crossing points (Richardson, 1981). The curvilinear representation generated by GASP has many of the same applications that a ribbon diagram does. The GASP representation may even be preferable to a ribbon diagram in some instances due to the rigorous, unsmoothed nature of the representation, and its ability to represent the axes of helices and strands equivalently as sticks.
P-Curve:
P-Curve's helicoidal axis is also useful for representing the overall fold of a protein. Each portion of the axis generated by P-Curve is reliant on a minimum of 9 peptide units. Unfortunately, this smoothing can lead to confusion at the ends of secondary structures, as the peptide units within the secondary structure are smoothed with those in turns. Thus, secondary structure definitions generated by P-Curve are frequently shorter than those of other methods. In contrast, GASP's curvilinear representation is made up of individual axial segments which each represent one tetrad, independent of the surrounding residues. As stated above, this unsmoothed representation may be preferred in some instances. In addition, P-Curve generates extensive output (for rigorous interpretations, 16 parameters for each peptide unit are needed) useful for various comparisons; however, it is often difficult to correlate these parameters with standard helical parameters. In comparison, the GASP algorithm generates five clearly understood parameters for each tetrad.
Vectors:
Abagyan and Mairorov's vector representation (1988) is useful for comparing folds. A vector that approximates the axis is generated for each secondary structure and these are linked by other vectors that run from the end of one to the beginning of the next. There is no effort to represent the aperiodic sections, and all secondary structures are represented as straight lines, making no allowance for curvature or kink. Since the vectors are calculated using definitions supplied by some other method, the output is heavily dependent on those definitions. The authors discuss the importance of using consistent definitions for secondary structures in any comparison of protein folds. Additionally, the direction of the vectors can be changed drastically by minor changes in the definition of the secondary structure end-points, and by defining additional secondary structures. The axes generated by GASP could be used in a similar way to compare folds. In addition, the curvature or kink of axes and aperiodic sections could be taken into account. When the GASP definition of secondary structure is used to generate the least-squares axes (tetrads within the 95% confidence limits for helical or strand geometry), they tend to follow the axis segments closer (and are thus a better representation of the axes) than when other definitions are used. Vectors generated using these axes would be a more accurate definition of the overall fold.
COMPUTERS AND PROGRAMS
GASP was developed in FORTRAN on a VAX, and then rewritten for a UNIX host. The axis segments and least-squares axes were analyzed on a Silicon Graphics computer using Insight II (MSI/Biosym Technologies, Inc.). The kinemages were developed for this paper using Prekin and Kinemage (Richardson and Richardson, 1992). An in-house program using the DSSP algorithm, and DSSP (Kabsch and Sander, 1983 ), were used to analyze hydrogen-bonding. PDBSUM (see Chapter 1) (Lazkowski, 2001; Laskowski et al., 1997; Lazkowski et al., 1993) was used to analyze strain.
SELECTION OF FILES
All protein data and coordinates used in the creation of the database were from the Brookhaven Protein Data Bank (PDB) release of January 1996. Sections of a general use directory program were modified and used to create a list of PDB files having HELIX or SHEET records. These files were sorted by resolution, and those with a resolution of 1.75 Å or better were kept. We chose to use only well resolved proteins, since it has been shown that well resolved proteins have a better topology and stereochemistry (Morris et al., 1992). Only crystallographically determined structures were used. Structures derived from computer simulations or from solution NMR were not used at this point in the work. (see Chapters 1, 4 and 5) This left us with 314 files. These were sorted by name and culled using several criteria.
Although we were not concerned with excluding homologous proteins, we
were careful to use only independently solved structures. Since some PDB
files were derived from the data in another file, using various methods
to refine the coordinates, the file with the best resolution was used.
Some PDB files were mutants of other proteins. Fifty-four lysozyme mutants
met our resolution requirements. The files containing the native forms,
1LZ1 (human) and 3LZM (T4-lysozyme) were used. Not only are these proteins
derived from different organisms, but, more importantly, their structures
were independently solved. The other 52 lysozyme files were omitted. Some
PDB files described different forms of the same protein. Different forms
of a protein were also only kept if their structures were independently
derived. For example two forms of hemoglobin, lECA, the aquomet form, and
1H3G, a carbon monoxy form, were both used. The other (three carbonmonoxy-,
one cyano-met-, six deoxy-, two oxygenated, one ferric, and one mutant)
forms were all culled. Homologous proteins, e.g., trypsin and chymotrypsin
were kept only if their structures were independently solved. From a total
of 2921 PDB files, 112 files were chosen to make the database (see Appendix
B).
CREATION OF DATABASE
The 112 protein files contained 615 HELIX records (see footnote on previous page), of which 19 had fewer than four residues and could not be used, as the algorithm requires four consecutive a-carbons. This left 596 a-helices that yielded a total of 4696 axis segments. Thirty-nine axis segments had negative rises, corresponding to left-handed helices. These were omitted since they threw off the calculations for the right-handed a-helix. This left 4657 right-handed a-helical axis segments in the database (Table 2.1) that were used in further calculations (see below).
There are 973 SHEET records in the 112 PDB files. Of these, 108 had fewer than four residues and could not be used; this left 865 b-strands, which yielded a total of 2738 axis segments (Table 2.2). Computing statistics for b-strands was more complicated. The first run through the output listing, including all the segments, yielded huge standard deviations for residues per turn and pitch (data not shown). Inspection of the values indicated a significant number of extreme outliers that caused the distributions not to be Gaussian. These were mostly caused by an angle of rotation close to zero (data not shown). Since residues per turn is calculated by dividing 360 by the angle of rotation, this yielded the extreme results. The distribution of pitch, which is calculated as [(360 x rise)/(angle of rotation)] was also distorted. Since the presence of extreme outliers can create large perturbations in means and standard deviations, it is standard statistical practice to use trimmed means and standard deviations, i.e., to delete a small percentage of the most extreme data-points (Mallows, 1996). An inspection of the data (data not shown) showed that radius and angle of rotation have fairly symmetric distributions. The distribution of radius is short tailed, while angle of rotation is long-tailed. However, rise, residues per turn and pitch have fairly long tails on one side, but short tails on the other. All three asymmetric parameters had a significant number of outliers. It was decided to cull the outliers by a systematic procedure. Since pitch is computed from both rise and angle of rotation, it showed the most outliers. The data were sorted by pitch, and 68 axis segment records (2.5%) were removed from each end of this listing (5% of the total list). This culling left 2602 b-strand axis segments in the database (Table 2.2), which were used, in further calculations (see below). Figure 2.2 shows the distribution of the culled data for all five parameters. Rise and radius are short tailed, while angle, residues per turn and pitch are still long-tailed. The distributions of these culled parameters show that, while still not Gaussian, they are more symmetric than before culling, the means agree with standard published values, and the standard deviations are more reasonable (see Table 2.3).
STATISTICS
Using the database we created, the means, standard deviations and 95% confidence limits were calculated for each parameter for a-helices and b-strands (Table 2.3). Upon inspection, we discovered that the number of parameters outside the 95% confidence limits differed from the expected 5%. This suggests that the data are non-Gaussian. The data appear to be short-tailed for all helical parameters and for the b-strand parameters (after culling) rise and radius. Quantile-Quantile graphs (data not shown) were created for these parameters and confirm this (Mallows, 1996). The findings suggest that helical geometry and b-strand rise and radius have natural constraints, which keep them within fairly narrow limits rather than a broad continuum of possible geometries. However, the b-strand parameter, angle of rotation (and thus residues per turn and pitch which are derived from it), is significantly long-tailed. This indicates that a wider range of angle can be tolerated within a b-sheet.
We then used the 95% confidence limits to find the number of tetrads that have an outlier in any one of the five parameters. This is a much more sensitive method of finding perturbations. We found that 272 a-helix axis segments (5.8% of all PDB described helical segments) and 525 b-strand axis segments (19.1% of all PDB described b-strand segments) had at least one of the five parameter values out-of-bounds. If the parameters were truly independent, we would expect the outliers to be additive. Since they are correlated (as you stretch out a helix towards strand geometry, the lengthening rise relates to a shortening radius and a widening angle) we expect to find approximately 5% out-of-bounds. The fact that the number of segments containing one or more outliers is larger than the expected 5% of the database for helices or 10% (5% outlier + 5% culled before calculations) of the database for strands reflect, again, that the distribution of the five correlated parameters is non-Gaussian, or could indicate that the data is not perfectly correlated (i.e. one parameter can be perturbed to an out-of-bounds value without necessarily perturbing the others out-of-bounds).
PERTURBATIONS
The analyses indicate that some sections within DSSP described a-helices and b-strands do not have a-helical or b-strand-like geometry. The flagging makes these perturbations easy to find. Negative rise (left-handedness) and short gaps in the flagging pattern (kinks) within an a-helix or b-strand are common. These could be artifacts of the crystallographic process, or they could be real anomalies in structure. If real, they are of biological interest. Since there is a free energy cost in pushing secondary structures out of plumb, compensation must exist elsewhere in the molecule. One interesting observation is that the out-of-bounds segments often fall into patterns, such that homologous proteins often have out-of-bounds segments in the same places (see below). This supports the idea that they are real perturbations in structure and not artifacts.
EXTENSIONS
After running GASP over the whole length of all 112 proteins, we see many segments NOT within the mDSSP described a-helices and b-strands that have helical or strand-like geometry. Some of these seem to be turns, or areas of extended conformation not involved in sheets, while others seem to be extensions of the a-helices and b-strands described in the PDB file (see Chapter 3). It is well known that a-helices frequently tighten into 310 helices, or at least tighten enough for the n->n+4 bond to bifurcate into n->n+4 and n->n+3 bonds. This is known as an aII helix, which has the same parameters as an a-helix, but the peptide is tilted enough to make bifurcation easier (Nemethy et al., 1967). Some of extensions are due to this effect, but not all.
1TLD vs. 5CHA COMPARISON
A detailed analysis of the ambiguous areas, where the definitions differed, was done on trypsin (PDB file 1TLD) and chymotrypsin (PDB file 5CHA). We compared GASP geometry, mDSSP hydrogen-bonds, and PROCHECK strain (all automated methods, done the same way on both proteins). We did not include a detailed comparison of our geometry with PDB descriptions because, as mentioned above, the PDB descriptions are variable and are occasionally wrong. We wrote a program that used the DSSP algorithm to define hydrogen-bonds, but which included side-chain hydrogen-bond donors and acceptors. Side-chain bonding is not examined by DSSP. The output listing from our program was checked against a listing produced by the published version of DSSP (Kabsch and Sander, 1983), and was shown to include all the backbone-backbone bonds indicated by DSSP, and only those backbone-backbone bonds. We worked manually from our listing to analyze hydrogen-bonding of the two proteins.
We found that some extensions did have standard helical or b-strand-like backbone-backbone hydrogen-bonding, suggesting an ambiguity in the DSSP pattern recognition algorithm. This is of considerable methodological interest, but tells us little about folding. However, we also found ample evidence that most of the extensions are not correctly hydrogen-bonded.