Data input:
GASP can read data from a variety of sources. It reads in the coordinates in three-space of the alpha-carbons for the chain specified directly from the PDB file. It can also read in the descriptions of secondary structure specified in the HELIX and SHEET records in the PDB file. GASP reads data on secondary structure descriptions, hydrogen-bonding, and phi-psi angles directly from the output of the DSSP program, as well, and it can also read in descriptions of secondary structure defined by the user. Additionally, the user can select to use modified DSSP definitions (mDSSP) instead of the direct DSSP output. All of these descriptions can be compared to the secondary structure descriptions generated by the program itself.
Helix algorithm:
As described previously (Kahn, 1989a), the helix axis algorithm finds the axis of a helix using vector algebra. An axis segment is generated for each four consecutive amino acids. The angle created by the first three alpha-carbons (CA1, CA2, CA3) is bisected, and then the angle created by the second three alpha-carbons (CA2, CA3, CA4) is bisected. Since these two vectors pass through the helix axis, the cross product of them gives the direction cosines of the axis. The angle between the two vectors gives the angle of rotation, and dividing 360 by this number gives the number of residues per turn. The intersection points of the two vectors with the axis are the end-points of the axis segment for that tetrad, which are opposite residues CA2 and CA3. The distance between the two end-points is the rise per residue, while the distance from the alpha-carbon to the helix axis gives the radius of the helix. Multiplying the rise by the residues per turn gives the pitch. Click here to see an example of this method.
Strand algorithm:
Designed by analyzing the geometry of a perfect helix, the helix algorithm functions well for a range of imperfect geometries. Although a strand is geometrically similar to a helix with only two residues per turn, the unmodified algorithm gives undesirable results when used on strands. For this reason, a modified algorithm was developed for use on strands. This method finds the average position for the first three alpha-carbons (CA1, CA2, CA3), the average position for the second three alpha-carbons (CA2, CA3, CA4), and uses them as the end-points for the axis segment. The other parameters are generated in a similar fashion to the helix algorithm. Click here to see an example of this method. Seredipitously, this algorithm also follows aperiodic sections well.Click here to see an example of this.
Axis segments:
Using the simple vector operations described above, GASP computes an axis segment for residues 1, 2, 3 and 4 (a tetrad) then steps up by one residue and computes an axis segment for residues 2, 3, 4 and 5 (another tetrad). It continues stepping by one residue until it has computed an axis segment for every tetrad in the protein. Both the helix algorithm and the strand algorithm are run over the entire protein. The output from the helix axis algorithm is kept for areas with helical geometry, and the output from the strand algorithm is kept for areas with strand geometry or aperiodic geometry. In a geometrically perfect helix or strand, the axis segments would line up end to end and make a straight line. However, most helices in proteins are either curved or kinked (Daffner et al., 1994; Blundell et al., 1983; Geetha, 1996) or simply irregular, and most strands have a right-handed twist (Yang and Honig, 1995b). Even those structures that are relatively straight are not geometrically perfect. Yet, the two algorithms are robust enough that the axis segments follow the backbone geometry very well.
Least squares axes:
The program then fits a least-squares axis in three-space to each secondary structure (Kahn, 1989b). If there is any superhelicity, curvature, kink or other deviation in the alpha-helix or beta-strand, it will be clearly seen from the relationship of the segments to the least-squares axis and is easily quantified by GASP. Direction cosines are produced for each alpha-helix or beta-strand axis. This enables a rigorous calculation of the angles between secondary structures. The angles between secondary structures are output to a separate file for ease of analysis. This is a useful subroutine, as most researchers currently use line-of-sight estimates, or less rigorous automated methods for the axes used to calculate these angles. This ability has already been used in the description and analysis of a crystallographic structure at high resolution (Helin et al., 1995). Least squares axes can be calculated from the segment end points for structures described by the user, DSSP, the PDB file, or by GAS-P. The difference is sometimes significant, especially in curved helices, as changing the description of the helix end-point by even one residue can change the direction of the axis significantly. This can have far reaching implications in methods that use vectors to compare and classify protein folds, and is one reason consistent description of end-points is important.
Curvature:
The end-points of the axis segments are easily fed into a routine to calculate curvature of secondary structures. An automated version of this is currently under development and will soon be incorporated into the GAS-P algorithm.
Curvilinear Representation:
Output of axial segment end-points for alpha-helical, beta-strand, and aperiodic tetrads can be merged to yield a "curvilinear" representation of the entire protein with the alpha-helices, beta-strands and aperiodic sections clearly delineated by color. This representation will appear as a string threaded through the protein. It clearly follows the axes of the secondary structures and also follows the aperiodic sections very well. If one views just the curvilinear representation, it is an easy way to visualize the overall fold of a protein, as it has many of the same applications that a ribbon diagram does (Kraulis, 1991; Richardson, 1981), and more. The entire protein chain is defined so as to present a very accurate description of the protein fold and could be used to compare structures of a similar fold much as other authors have done (Abagyan and Maiorov, 1989; Murthy, 1974) (Click here for a comparison with other available programs). Changes in color for the axis segments of different secondary structures makes them more easily visualized, or the chain can be represented in one color to visualize just the basic fold.Click here to see an example of this.
GAS-P Output:
As mentioned above, GAS-P output also includes the following parameters for every tetrad: the rise per residue, which is the length of the axis segment, the radius of the tetrad at the alpha-carbons, the angle of rotation about the axis per residue, the number of residues per turn (360/angle of rotation), and the pitch (rise x residues per turn). These are properties of the tetrad as a whole (Click here for examples of helix output, strand output, or output for all tetrads). GAS-P also creates files containing the lengths of secondary structures, and comparisons of GASP descriptions with PDB, mDSSP and user descriptions. These comparisons are used in describing extensions and perturbations.
Flagging:
GAS-P flags those values that fall within the 95% confidence limits for helical parameters or beta-strand-like parameters. Segments that have all five parameters within the 95% confidence limits are flagged as helical or beta-strand-like. We use all five parameters instead of just the three independent ones, since the derived ones are more sensitive to extreme values. In addition, GAS-P flags the segments described in the PDB files to be alpha-helices, beta-strands or turns. It also uses output from the DSSP algorithm to flag hydrogen-bonded alpha-helices and beta-strands. These flags make patterns easily visible, facilitate comparisons, and simplify analysis. GAS-P describes helices as at least two consecutive tetrads (5 residues) with helical geometry. This was done because if GAS-P was used to describe a helix as being only four residues long, many alpha-turns were mistakenly described as short helices. Additionally, although GAS-P does not look at hydrogen-bonding, most common descriptions of helices include both geometry and hydrogen-bonding. Since the shortest helix (that with one i --> i+4 hydrogen-bond) is five residues long, it seemed appropriate to choose this length. Conversely, since short strands are common, four residue strands, consisting of only one tetrad, are described by GAS-P. Unfortunately, this misses some short strands described by DSSP that have only two or three residues. The GAS-P descriptions are used in later analysis for calculation of least-squares axes, secondary structure lengths, direction cosines, comparisons with other descriptions, and output of axis segments into correctly labeled files.
Comparisons:
The program then compares the descriptions generated by GASP with the descriptions input by the user, from DSSP or mDSSP, and from the PDB file. Residues at the end of or within secondary structures that have correct hydrogen-bonding (as described by DSSP or mDSSP) but do not have correct alpha-carbon geometry (as described by GASP) are labeled perturbations. Residues at the end of secondary structures that do not have correct hydrogen-bonding but do have correct alpha-carbon geometry are labeled extensions. Short structures, such as areas of extended geometry (described by GASP as having strand-like geometry but not part of a sheet as described by DSSP), short helical areas (described by GASP as having helical geometry but not having alpha-helical hydrogen-bonds as described by DSSP), and single helical tetrads (which do not and cannot have an alpha-helical hydrogen-bond described by DSSP) are also labeled. These comparisons are output to a file for ease of analysis. The lengths of perturbations, extensions, and short structures are also compiled and output to the file.
Computers and Programs:
GASP was developed in FORTRAN on a VAX, and then rewritten for a UNIX host. The axis segments and least-squares axes were analyzed on a Silicon Graphics computer using Insight II (MSI/Biosym Technologies, Inc.). The kinemages were developed for this paper using Prekin and Kinemage (Richardson and Richardson, 1992). A modified version of DSSP (Kabsch and Sander, 1983 ), mDSSP, was used to analyze hydrogen-bonding. PDBSUM (Lazkowski, 2001; Laskowski et al., 1997; Lazkowski et al., 1993) was used to analyze strain.
Summary of analysis:
Attempts to understand protein folding through analysis of protein structure have been conducted for many years. As discussed above, there are many methods of analysis, many computer programs developed, and many methods of displaying the results of these methods. We have developed a program (GASP), to analyze the geometry of secondary structures. GASP has several important differences from programs already in use and provides the investigator with several new and useful methods for protein analysis. We have used it to analyze the geometry of a database of independently solved protein structures. We created a database of 112 high-resolution, independently derived protein structures. GASP defines four consecutive alpha-carbons as a tetrad and computes axis segments for all the tetrads in the protein. In the process it yields five parameters for each tetrad (rise per residue, radius, angle of rotation per residue, residues per turn, and pitch). For all 112 proteins we compiled the five parameters generated by GASP for all the tetrads included within the boundaries of the secondary structures described in the HELIX and SHEET records in the PDB files. From these data, we determined rigorous confidence limits for helical and beta-strand-like geometry.
Using these confidence limits to define secondary structure geometry, we reran the calculations over the entire length of all 112 proteins. A detailed analysis of protein geometry was done. This analysis shows that some tetrads within the boundaries of the alpha-helices and beta-strands described by mDSSP do not fall within the 95% confidence limits for all five geometric parameters. Thus, by our definition, they do not have helical or beta-strand-like alpha-carbon geometry. The out-of-bounds sections are flagged in the output, making it easy to find perturbations in the structure. In contrast, some tetrads outside the hydrogen-bonded secondary structure boundaries described by mDSSP do fall within the 95% confidence limits for all five parameters, and thus, by our definition, do have correct helical or beta-strand-like alpha-carbon geometry. This is also easy to see on the output as flagged tetrads outside the defined structures. Some of these tetrads correspond to turns, some are areas of extended conformation without standard sheet bonding, and some are at the ends of alpha-helices and beta-strands, indicating that their geometry extends beyond the mDSSP descriptions. Thus, one advantage of GASP is that it gives us an easy method of finding ambiguous areas of definition. These areas have the potential to be biologically interesting. Our analysis shows that these ambiguous areas are often conserved in homologous proteins. This suggests that they are probably not artifacts, but that Nature has reasons for putting them there. Although these reasons are not always apparent, in some other cases, there appear to be readily identifiable forces at work. Several examples of areas with ambiguous propensities were studied.
By way of example, we analyzed in detail the structures of two independently solved homologous proteins, trypsin (PDB file 1TLD), and chymotrypsin (PDB file 5CHA). We compared GASP output with mDSSP descriptions to find ambiguous areas, perturbations and extensions. These analyses have led to several intriguing hypotheses and suggest several avenues of further research.