Biology 403, Fourth Lecture
Thursday 29 January 2004

Protein Structure

Outline

Since I study protein structure as my primary research interest, it is difficult for me not to overemphasize its importance as we go through the material in this course. But this week it's our explicit focus, so I'm entitled to spend some time on the subject.

Levels of protein structure

We can describe protein structure in terms of four levels of organization:

Many proteins consist of only a single polypeptide chain, so in effect they have no quaternary structure. Some fibrous proteins have such a uniform hydrogen-bonding pattern that we can describe them entirely in terms of secondary structure; for these proteins the secondary structure subsumes the tertiary structure. Other proteins and peptides, especially short ones, are floppy enough in solution that they cannot be said to have a stable secondary or tertiary structure at all.

We should say a word or two about disulfides. These are covalent linkages between the sulfur atoms in the side-chains of cysteine residues. The cysteine residues that make up the disulfides may be only a few residues apart in the linear sequence, or they may be hundreds of amino acids apart from one another. Sometimes they in fact form between separate polypeptide chains. The formation of a disulfide bond is not a simple hydrolysis reaction, as many of the other biochemical reactions we will study are; instead, it is an oxidation-reduction reaction, represented formally thus:
R-S-H + R'-S-H + (1/2)O2 -> R-S-S-R' + H2O
where we have depicted molecular oxygen as the oxidizing agent that interacts with the reduced sulfydryl groups (S-H) in the two cysteine side-chains. Molecular oxygen is not the only biochemically relevant oxidizing agent, but it often acts as the ultimate electron acceptor in systems of this kind. we described protein structure as a serier

Hydrogen bonds

Hydrogen bonds are weak, ionic interactions between atoms that are not covalently bonded to one another. In biochemical contexts, hydrogen bonds arise when an electronegative atom (usually N, O, S, or P) that has a hydrogen atom attached to it comes into the vicinity of another electronegative atom that does not have a hydrogen atom attached to it. Under those circumstances the partial positive charge found on the hydrogen atom interacts with the partial negative charge on the second electronegative atom. The extra negative charge on that second atom is shared with the hydrogen atom to a limited degree. This allows an approach of the two electronegative atoms toward one another that is closer than what ordinarily arises, and it reduces the overall enthalpy of the molecule by a few kcal/mol. The hydrogen bond between an amine group and a carbonyl oxygen is represented thus:
N-H...O=C
and the hydrogen bond between a hydroxyl gruop and a carbonyl is represented thus:
O-H...O=C
The formation of any regular structure is entropically disfavored, but at physiological temperatures the enthalpic advantage afforded by the hydrogen bond generally overcomes that entropic disadvantage, so hydrogen bonds generally form when they can. But the overall energetic favorability is small enough that these interactions can be readily disrupted.

Secondary structure

In many texts the distinction is made between fibrous and globular proteins. As Zubay says in his Principles of Biochemistry,

Fibrous proteins aggregate to form highly elongated structures having the shape of fibers or sheets. Each protein unit that makes up these aggregated structures is built from a repeating structural motif, giving the structures a simple shape that is relatively easy to analyze. By contrast, globular proteins are more complex, containing one or more polypeptide chains folded back on themselves many times to give an approximately spherical shape.
Thus we will examine secondary structures in fibrous proteins first because they are simpler.

The first structural motif in fibrous proteins that Linus Pauling and Robert Corey studied in detail was the alpha helix. This is the predominant structural motif found in α-keratin, which is the protein found in hair, wool, and fingernails. We will see throughout the course that globular proteins contain varying quantities of α-helix as well. In α-keratin, the helical structure extends through an entire polypeptide strand.

The alpha helix is characterized by a specific, localized hydrogen bonding pattern among main-chain amine and carbonyl groups. As such it is only loosely dependent on the identities of the side-chains on the amino acids that participate in the hydrogen bonding. The alpha-helix pattern consists of a hydrogen bond between a main-chain carbonyl oxygen and the main-chain amine group located three and one-third residues farther along the chain:

                    ________________________________
                   |                                |
 NH-CHR-CO-NH-CHR-CO-NH-CHR-CO-NH-CHR-CO-NH-CHR-CO-NH-CHR-CO...
         |________________________________|
The protein adopts a roughly linear conformation when the series of hydrogen bonds line up nearly parallel to one another. Because of the twist imparted to the chain by this hydrogen bonding, the number of residues per full turn is a bit longer than 3.33--it's more like 3.6. As shown in fig.4.10 in the text, the side chain (R) groups in a helical protein stick out from the axis of the helix. Therefore there is an opportunity for the side chains to be thoroughly separated from one another. This reduces the likelihood of steric clashes between these groups, and helps the alpha helical structure to grow roughly independent of the identities of the side chains. Proline residues introduce a kink in the main-chain because of the bending of the C-N bond that forms part of the five-membered ring in the imino acid group. But for the other nineteen standard amino acids, helical conformations can form readily. The alpha helix is a right-handed structure; that is, the twist in the structure is analogous to the twist in your right hand if you curl your fingers.

Another common structure found in fibrous proteins is the beta-pleated sheet. This predominates in β-keratin, which is a fibrous protein found in silk and some other structural proteins. The hydrogen-bonding pattern that stabilizes the beta-pleated sheet is less local than that found in the alpha helix, because it involves interactions between separate segments of polypeptide with no defined spacing between the participating residues. There are two kinds, antiparallel and parallel sheet, distinguished by whether the participating strands are running in the same (N-C and N-C) direction or in opposite (N-C and C-N) directions. Thus in the antiparallel sheet the pattern involves hydrogen bonds between main-chain carbonyl oxygen atoms in a polypeptide chain and main-chain amine NH groups in a strand that is running locally in the opposite direction from the first strand. In a parallel sheet the same kinds of hydrogen bonds are forming (NH to O=C), but the two chains are running locally parallel. The hydrogen bonding pattern in the antiparallel sheet results in a modest amount of kinking of the successive peptide planes with respect to one another, but the kinks in one of the chains participating in the hydrogen bonding conforms to the kinks in the other one. This allows for the extension of the hydrogen bonding pattern over multiple strands. A similar system operates with the parallel sheet, but the degree of kinking is a bit stronger, making the structure a bit more scrunched and the hydrogen bonding pattern slightly more strained. In both the parallel and antiparallel sheets, we can think of the average plane that contains several consecutive peptide planes in this arrangement as being horizontal. If we do, the side-chains stick out above and below that average plane. Successive side chains on a single strand will alternate poking up and down, so they will not collide with one another; on the other hand, the side chains on the pairs of strands point the same direction, and can occasionally produce steric clashes if they are large enough.

The kinks we referred to above mean that beta sheet structures are not completely flat: the hydrogen bonds formed between neighboring strands are stablest when there is a certain amount of undulation in the approximately planar structure. The undulation vaguely resembles a pleat in a garment, which is why these structures are known as pleated sheets.

Collagen is an example of a more complex secondary structure. Each strand is a left-handed helix with a regular hydrogen-bonding pattern, and three strands arranged in parallel and twisted together to form a three-stranded, roughly linear conformation. This arrangement depends on the presence of many glycine residues, because the three strands come close together every third residue, and nothing but glycine will fit into those positions of close approach. Therefore collagen's sequence is predominantly Gly-X-Y, where X is often proline and Y is often hydroxyproline. Hydroxyproline is not produced as such from the ribosomal protein synthetic apparatus; instead, it is produced as proline and converted to hydroxyproline enzymatically as or after the polypeptide chain is created. This kind of chemical change in an amino acid residue in a protein is known as a post-translational modification.

Secondary structure in globular proteins

Alpha helices and beta sheets are fairly common elements of secondary structure in globular proteins. Unlike fibrous proteins, globular proteins usually have short segments of a specific kind of secondary structure. After one of these segments there is often a break, where one to thirty amino acid residues will not be participating in a regular pattern of secondary structure. At some later point another element of secondary structure will begin. Some globular proteins, such as the globins and most cytochromes, are almost entirely helical; but even here the segments of alpha helix extend only for four to twenty amino acids, after which there is a turn or bend, and then another helix starts up.

Other proteins have a predominantly beta structure. A typical mostly-beta globular protein will contain of a pleated sheet containing two to eight strands, each with four to twelve amino acids in it. At the ends of the sheet there will be interruptions where the main chain in one strand turns through close to 180° and then heads down the sheet in the opposite direction. If the turn is tight enough, the second strand may be the one to which the first is hydrogen-bonded; if it is a wider turn, it may be displaced down by two or more strands on the sheet.

We will encounter many other proteins that include substantial segments of both alpha helix and beta sheet structure. In some cases the helical segments alternate in a regular way with the sheet structures; in other cases the protein will contain a mostly-helical portion that is separate from a mostly-sheet portion. In the transitions between secondary structure elements, there are usually a few (or even many) amino acids that are not involved in regular main-chain hydrogen bonding patterns of any kind.

Tertiary Structure

Tertiary structure is a significant element of how we describe globular proteins and draw conclusions about how these globular proteins perform their biological functions. Tertiary structure is less crucial to our understanding of fibrous proteins, for which a description of the secondary structure often enables us to make sense of the protein's function. In a roughly spherical protein, by contrast, we are intensely interested in how the various pieces of secondary structure, and the amino acids that do not participate in secondary structure, fold up to give the protein its overall shape. These descriptions of tertiary structure generally require examination of three-dimensional models of structures. Since it is difficult to fit three-dimensional models onto a printed page or a flat computer screen, structural biologists have developed a variety of representational tools to help us understand these three-dimensional objects. A general statement one can make about these representations is that our ability to derive information from a model often depends as much on what is not depicted as on what isdepicted. With relatively small proteins, like bovine ribonuclease (12 kDa), we have difficulty seeing the overall structure if we include every atom in the structure; the problem of omitting information in order to make sense of a picture becomes even more acute with larger proteins. The antibody molecule depicted in fig. 4.52 has a molecular mass of about 175 kDa, so to draw any useful conclusions about structure-function relationships, one needs to be very selective about the elements of structure that one actually displays.

Determining the overall three-dimensional fold of a protein usually involves experimental techniques. Under what circumstances can we determine the tertiary structure of a protein without doing a (probably lengthy) experiment? There are two cases: small, helical proteins, and proteins that closely resemble known proteins. There are a few small proteins for which entirely theoretical approaches can generate an accurate prediction of the tertiary structure based only on a knowledge of the primary sequence. These theoretical approaches are known as ab initio protein structure prediction methods. Proteins whose structures can be predicted ab initio are usually predominantly alpha-helical, and rarely larger than 10 kDa. If a protein is highly similar in sequence to a protein whose structure has already been experimentally determined, then we can often successfully make the assumption that the unknown protein's structure will be nearly identical to that of the known protein. Computer-modeling approaches that start from the assumption of exact overlap between the known structure and the unknown, and then use theoretical and heuristic approaches to guess how the changes in amino acid sequence would manifest themselves as changes in three-dimensional structure, have had substantial success in recent years. These "homology models" are often accurate enough to enable drug companies to design inhibitors of proteins based on the models.

If the protein is long enough or non-helical enough that ab initio modeling fails, and if it is not similar enough to a known protein to allow for successful homology modeling, then the structure must be determined experimentally. Even for structures that have been predicted computationally, biochemists tend to feel much more confident of the prediction if at least some experimental verifications are available. There are two basic approaches to experimental structure determination: macromolecular X-ray crystallography and multi-dimensional nuclear resonance spectroscopy. These methods will be described below.

Domains

Domains are functional units of tertiary structure. As Zubay says,

contiguous portions of the polypeptide chain often fold into compact local units called domains, each of which might consist, for example of a four-helix cluster or a "barrel" or an antiparallel β sheet.

There are several examples of protein domains depicted in chapter 4 of Horton. Examine, for example, figs. 4.17, 4.20, 4.21, and (especially) 4.24. In each case, convince yourself that the structure involves a single contiguous unit of polypeptide, and that the structure formed would have a substantial amount of stability. A single domain is rarely larger than 100kDa in size (~900 amino acids); often they are 30kDa or smaller. Domains are often somewhat separated from one another, but in other cases a single domain can be closely associated with its (possibly noncontiguous) neighbor. In some cases the connectors between domains are just as rigid as the structural elements within a domain; in other cases there is a flexible "hinge" between two domains. Immunoglobulins are examples of multi-domain proteins with flexible hinges between domains. Fig. 4.52 shows the various domains in an immunglobulin in distinct colors.

Quaternary structure

Quaternary structure is the organization of multiple polypeptide chains into a functional whole. Generally a single gene codes for a single contiguous piece of polypeptide, so a gene product (the protein produced by the DNA transcription and mRNA translation mechanism) is in general a single polypeptide chain. In order to produce a multiple-chain protein in which the chains are not covalently connected head-to-tail to one another requires the association either of multiple copies of a gene product (as in tetrameric lactate hydrogenase) or multiple gene products. Hemoglobin has both: the hemoglobin molecule contains two alpha chains that are identical to one another and two beta chains that are identical to one another. So two genes are involved in producing tetrameric hemoglobin. Two molecules of alpha hemoglobin are synthesized at the ribosome, where they associate with two copies of beta hemoglobin, and the tetrameric whole is the functional unit in most land vertebrates. Immunoglobulins are examples of proteins with multiple domains and quaternary structure: immunoglobulin G, for example, contains two heavy chains and two light chains, so two copies each of two gene products are present. But within each heavy chain there are four domains separated by linkers, and within each light chain there are two domains separated by linkers. Re-examine fig. 4.52(b) in this context: there is a two-fold symmetry axis running from top to bottom in the picture, so one light chain and one heavy chain are found to the left of that axis and one of each are found to the right. The left-hand light chain is the uppermost blob of dark green protein and the blob immediately below and to the right of it. The left-hand heavy chain is all the rest of the left side of the picture.

The individual polypeptide chains in a protein with quaternary structure are called subunits. Mammalian hemoglobin has four subunits, as does immunoglobulin G. Some proteins have a much larger number of subunits; fig. 4.25(f) shows the Rhodopseudomonas photosynthetic reaction center, which contains three kinds of subunits, and two copies of one of them.

X-ray crystallography of proteins

Crystallography is the most useful tool available for determining three-dimensional structures. It is the only experimental technique that reveals overall, yet detailed, structural information on proteins larger than about 20kDa. For smaller proteins, NMR spectroscopy (see below) offers some advantages, but even in that realm crystallography is often easier and provides more definitive information.

Crystallography involves creating solid, macroscopic objects called crystals in which the constituent molecules are arranged in a regular pattern called a lattice. In a lattice, each molecule is oriented identically relative to the others, and the spacing between the molecules in any given direction is constant. The world is full of crystal lattices of varying degrees of perfection; in fact, most objects that we think of as solids are crystalline. Proteins, on the other hand, generally function either in solution or in a membrane-anchored environment that isn't crystalline. Thus to form a crystal of protein molecules, we must coerce the molecules to interact with one another in a way that makes it energetically favorable for them to arrange themselves into a lattice.

Structural biologists produce protein crystals by purifying a protein and concentrating it, typically to concentrations somewhat higher than any single protein is present in the cytosol. Salts, organic solvents, or polysaccharides are added to the protein solution to bring about conditions under which the protein is just barely insoluble--conditions where the protein is beyond saturation. If the approach to saturation is slow enough, the protein will become supersaturated, and a single crystal of the protein will grow out of the solution rather than than thousands of tiny crystals or an amorphous precipitant. Techniques for growing protein crystals were considered art rather than science as recently as the early 1990's, but reasonably systematic approaches are now available.

Once a reasonable-sized (30-500 micrometers in all three dimensions) crystal has been grown, it is exposed to X-rays. Because of the regular arrangements of molecules in the lattice, constructive interference of the X-ray waves is set up in some directions, and destructive interference occurs in all other directions. If we install a two-dimensional detector that registers the arrival of the constructively-interfered X-ray waves, we will obtain a diffraction pattern like that shown in fig. 4.2(b). This pattern consists of spots of varying brightness, distributed in a known way across the surface of the detector. The size of the repeating unit that makes up the crystal determines where the spots occur; the contents of the repeating unit determines how bright the spots are. A single pattern like fig. 4.2(b) contains a small fraction of the entire ensemble of spots that must be measured. To measure the intensities of all the spots we need to know about, we must rotate the crystal about an axis and measure diffraction patterns every half-degree or so around the rotation axis. We measure the intensities of the spots (by eye or by computer) and use those as input to a messy mathematical procedure to determine the electron density in the molecule. Since the electrons are concentrated where the atoms are, this electron density "map" tells us where the atoms are.

The information derivable from a crystallographic experiment is always limited in precision: the finest detail we can distinguish from a diffraction pattern will be equal to half the wavelength of the radiation being diffracted. Copper-Kalpha X-rays (those most commonly used in laboratory X-ray protein diffraction experiments) have a wavelength of 1.54 Ångstroms (0.154nm), so the finest detail derivable is about 0.77 Å. Most protein crystals have internal disorder that prevents us from seeing details better than 1.5 to 2 Å anyway, so the wavelength limit is rarely a problem.

Traditionally, X-ray crystallography has been the province of professionals, but the techniques are getting easier to use and the probability of success is increasing. Therefore more and more bread-and-butter biochemists are embarking on crystallographic determinations of the structures of the proteins they study rather than handing them off to specialists. Specialists will continue to play the role of methods developers, but the era in which the crystallographer provides results to the grateful biochemist will have ended in another decade.

Nuclear magnetic resonance spectroscopy of proteins

Nuclear magnetic resonance spectroscopy relies on the interaction between an atomic nucleus with an odd number of nucleons and electromagnetic radiation.Depending on the range of radiation input into a molecule, the excited nuclei may be a proton, an 15N nucleus, or certain other choices. The precise energy value at which the resonance arises depends on the distance between one nucleus and another within a molecule. In a small molecule, the number of interatomic distances is small enough that each spectroscopic resonance is well-separated from all the others, and interpreting the resonances in order to characterize the structure can be carried out by hand or with a simple software package. In a protein, where there are thousands of protons and hundreds of nitrogen atoms, the resonances are thoroughly overlapped, and even if one could separate out one of the resonances, assigning it to a specific atom-atom interaction would be impossible. Thus the kind of NMR spectroscopy that proves so useful in small-molecule structure determination has little utility in determining, ab initio, the structures of proteins. This kind of NMR can be used to answer specific questions about details of a structure for which the broad outlines are already known, but not to determine a structure from scratch.

In the 1980's a method was developed by which the initial resonance is coupled with a second, and sometimes a third or even a fourth resonance. NMR spectroscopy then becomes a two-, three-, or four-dimensional technique, and the individual resonances can be separated out in this multi-dimensional space. Assigning those resonances to specific atom-atom pairs then becomes a permutational exercise, and can require a huge amount of computer time. As selection algorithms become more sophisticated and computers get faster, the size of proteins for which this method can produce reliable structural information keeps increasing, but even now 20 kDa is generally the upper limit.

There are disadvantages to NMR as a tool in ab initio protein structure determination: it doesn't work with large proteins, it's computationally expensive even when it does work, and it yields not a single structure but a family of structures, each of which is seen to be equally likely compared to all the others. This last phenomenon makes it difficult to make final statements about what the structure really looks like, although at the same time it helps us remember that in fact proteins are somewhat flexible in solution. But the big advantage of NMR compared to crystallography is that it provides a picture of the structure in solution, rather than in the artificial environment of a crystal. Numerous studies have shown that the structures of crystalline proteins resemble very closely the structures in solution. But it still is very useful to have access to a techniuqe that operates in solution.