The paths to the atomic structures of proteins and nucleic acids

Atomic structures of large biological molecules were first established by scattering X-rays in protein crystals and later with crystals of nucleic acids. Good crystals allow for an accuracy of 0.1 Å (10−11 m) that may reveal details of catalytic processes. The novel cryo-electron microscopy method does not need crystals and it can establish chain folds confidently. Chain folds can also be derived from NMR data to produce numerous binary atomic distances. Recently, chain folds for a given amino acid sequence were derived by mere computing, based on the presently available large library of proteins that are related by their amino acid sequences and structures.


Introduction
Living material consists of mostly four types of molecules; lipids, saccharides, proteins, and nucleic acids. Lipids associate with membranes or they form long-term energy reservoirs. Saccharides form either short-term energy stores (glucose alpha bonds, starch) or solid scaffolds (glucose beta bonds, cellulose). As lipids and saccharides depend on the local activities of enzymes, they are rarely uniform and, therefore, not discussed here. In contrast, proteins and nucleic acids have defined structures based on genes. The historical pathway to the elucidation of these structures is here outlined.

Protein crystal structures
The story began more than 180 years ago when Hünefeld detected crystals in squashed blood [1]. Probably, the crystals had been formed by hemoglobin, the ubiquitous red dye coloring blood. The identity of these crystals was confirmed 22 years later by Hoppe-Seyler [2], who isolated and crystallized hemoglobin. Today, such crystals would certify the uniformity of all hemoglobin molecules. At that time, however, things were less clear. In 1869 the nucleic acids were detected by Miescher [3]. In contrast to proteins, the nucleic acids did not crystallize. Analyzable crystals of these molecules were produced only almost 100 years later (Fig. 1).
In 1895 Röntgen [4] performed experiments with cathode ray tubes, in which electrons with energies of more Fig. 1 Six lines showing the method developments for structural analyses: red, direct methods for crystals of molecules below about 1500 Da; black, MIR and molecular replacement for protein crystals; green, MIR for crystals of nucleic acids; violet, reconstruction from electron microscope data; orange, calculation from nuclear magnetic resonance distances; blue, calculations based on the available sequences and structures in data banks. Asterisks indicate Nobel prizes. MIR multiple isomorphous replacement, CASP critical assessment of (protein) structure predictions (15 meetings to date) than 10,000 eV were shot onto a metal like tungsten, where they produced a penetrating (so-called) X-radiation. This radiation showed, e.g., the bones of a hand on a scintillation screen without hurting the tissue. Consequently, this observation was developed into an important diagnostic medical tool. When in 1912, a crystal was exposed to a thin X-ray beam, a multitude of weak beams split away from the incident beam giving rise to so-called reflections that were documented on a photographic film. This observation was correctly interpreted by von Laue [5] as an interference phenomenon involving the scattering of an electromagnetic wave by the periodically located electrons in the crystal. This confirmed both the wave nature of the X-rays and the periodic arrangement of atoms in the crystal.
The wavelength of the X-rays was around 10 -10 m = 1 Å, which corresponds to the atom-atom binding lengths and should therefore allow one to locate individual atoms in a crystal. The reflections are measurable because the scattered waves of millions of crystal unit cells add up in the interference process. Moreover, von Laue [5] showed that the electron density in the crystal (and thus the atom positions) can be calculated from a Fourier synthesis of the reflections. However, such a reconstruction requires the intensities as well as the phases of the reflections. Unfortunately, the phases cannot be measured directly, but only derived indirectly.
The phases may be established by combining several pieces of information: for instance, the electron density distribution in the crystal has to be positive everywhere, the distribution of electrons around an atomic nucleus is radial, the atom radii are known, all atomic bonds are close to a certain specific distance, the internal symmetries of crystals cause restrictions to the phase angles, partial structures may be known and accounted for (e.g., a phenyl ring), etc. In 1986, all these possibilities were compiled by Hauptman [6] under the name direct methods. Moreover, for crystals with less than a handful of atoms in the unit cell, a Fourier synthesis of the mere intensities without the phases (a Patterson function) may yield the positions of these atoms in the unit cell.
The first crystal structures, namely those of NaCl and diamond, were determined in 1913 by Bragg and Bragg [7]. They were followed by numerous other structures of larger molecules culminating in the structure of vitamin B 12 (M r = 1355), which was elucidated in 1956 by Crowfoot-Hodgkin [8]. Crystals of molecules smaller than vitamin B 12 are usually analyzed by direct methods [6], but they do not work for larger molecules.
Around 1920, the intrinsic stability of proteins like hemoglobin was generally accepted, but proteineous enzymes remained mysterious. As enzymes catalyze chemical reactions, they should be intrinsically mobile. At that time, they were considered colloids without a stable spatial structure. The puzzle was solved in 1926 when Sumner reported crystals of the enzyme urease [9]. The crystals indicated that enzymes also have a defined spatial structure. Later on, it became clear that enzymes are indeed mobile, but can crystallize in one of their stable states. In 1995, the first movie of all states of an enzyme over a complete catalytic cycle was published [10].
The first structural knowledge on proteins did not come from crystals but from peptide fibers. In 1931, Astbury analyzed such fibers and detected two dominant X-ray scattering patterns, which he named α (observed with wool) and β (characteristic for silk) [11]. The actual structures of the αand β-fibers remained obscure for 20 years. However, when Pauling studied the crystal structures of small peptides, he recognized that the bonds between the amino acid residues are always in the trans conformation, greatly restricting the structures of longer peptides [12]. Long all-trans-peptides can assume only two regular conformations stabilized by hydrogen bonding, the α-helix and the β-sheet, which actually corresponded to the α-and β-patterns of the fibers analyzed by Astbury [11]. These regular conformations turned out to constitute the dominant substructures (so-called secondary structures) of proteins.
The first serious X-ray diffraction experiment on a protein crystal ( Fig. 2) was performed in 1934 by Bernal [13]. The crystals contained the enzyme pepsin and showed defined reflections up to high (about 2 Å) resolution, confirming the proposal of Sumner [9] and indicating that the atomic structure of pepsin could be obtained in principle. Actually, however, the protein structure remained unknown because the phases of the reflections could not be determined. A suitable method for phase determination was invented only 17 years later by Bijvoet [14], who compared the reflection intensities of the isomorphous crystals of strychnine sulfate and strychnine selenate and derived the position of the sulfur (selenium) atom in the unit cell by a Patterson function of the reflection intensity differences. The position helped decisively in determining all phases. Bijvoet named this the method of isomorphous replacement.
Three years later, Perutz [15] used a variation of this idea with hemoglobin crystals. He soaked the crystal with a solution of mercury ions that bound locally in a defined manner at the free cysteines of the protein. Soaking was possible because his protein crystal, like all others, consisted of about 50% water. As usually several cysteines were available, he called this method multiple isomorphous replacement (MIR). The localized 80 electrons of a mercury atom change all reflection intensities measurably. A Fourier synthesis of these differences (difference Patterson) reveals the mercury atom positions, which in turn can be used for determining all phases. The MIR method was applied in almost all following structure analyses of proteins and nucleic acids. Astonishingly, Perutz [15] did not quote Bijvoet [14], the initiator of this method.
Using the MIR method, Kendrew [16] produced the electron density map of a myoglobin crystal (M r = 17,000) 6 years later. During this analysis the phases of around 10,000 reflections had been calculated, which was an extraordinary logistic achievement in those days without versatile computers. It should be noted that the reliable interpretation of the resulting electron density map required the amino acid sequence of myoglobin. After the pioneering work of Sanger [17], that sequence was available on time. It turned out that myoglobin consists exclusively of α-helices, the geometry of which confirmed the substructure proposal of Pauling [12]. Five years after the atomic structure of myoglobin, Phillips [18] determined the first structure of an enzyme, lysozyme, which had crystallized in one of its stable conformations as proposed by Sumner [9] and Bernal [13]. In the beginning, the MIR phasing method was generally applied. However, after numerous protein structures were established, the molecular replacement method, in which phases were determined in a refinement using a resembling (part of the) protein structure, became popular [19].
The 60 years following the determination of the structure of myoglobin saw a multitude of reports on atomic protein and enzyme structures, giving rise to a very large amount of structural data. Since 1971, the protein structure data were normalized and compiled in an easily accessible bank, the Protein Data Bank [20,21]. This bank brought an exceptional stimulus for this field of research.
Until 1985, all structures were from soluble proteins, because membrane proteins failed to crystallize as they associated nonspecifically at hydrophobic surface patches. After tedious experiments, Michel [22] observed in 1982 that membrane proteins can also be crystallized if their hydrophobic surface regions were covered by detergent molecules. This expanded the field of known atomic protein structures appreciably.
The size and the importance of the published protein structures grew with time. A typical structure is shown as a ribbon plot in Fig. 3. It is the membrane channel MspA, which is the base of the modern DNA sequence analysis [23,24]. Several important atomic structures were rewarded with a Nobel prize, beginning with the first membrane protein [25], followed by the F 1 -ATPase [26], the potassium channel [27], RNA polymerase [28], and the G-protein-coupled receptor [29]. The analysis of crystallized proteins remains important because only this method allows for positional accuracies of 0.1 Å that are required for the explanation of catalytic processes.

Nucleic acid structures
After the nucleic acids were detected by Miescher [3], it took a long time before their chemical structures were established. Nucleic acids are linear chains of nucleotides that Fig. 3 Ribbon model of the octameric membrane pore MspA from Mycobacterium smegmatis. The pore was produced efficiently by overexpression into Escherichia coli inclusion bodies and subsequent naturation [23]. As its pore diameter allows for the passage of a single-stranded DNA molecule, MspA was used for converting DNA sequence analysis into an automatic and cheap method [24] are connected via phosphodiester bonds. Each nucleotide consists of a heteroaromatic ring system (base) and a ribose (RNA) or 2′-deoxyribose (DNA) 5′-phosphate. There are essentially four different bases, the sequences of which constitute the genetic information of all protein and RNA molecules. Sanger [30] and Gilbert [31] separately designed two analytical methods for determining such DNA sequences. Nowadays, the Sanger method has been greatly simplified and extensively applied, giving rise to a very large number of known natural DNA sequences. In analogy to the spatial protein structures in the Protein Data Bank [20,21], the linear DNA sequences were compiled in another easily accessible data bank, GenBank [32], which also brought a great stimulus for the research field.
The first report on the spatial structure of a piece of DNA was published only 70 years after Miescher [3], when Astbury [33] drew a thin fiber out of bulk DNA material and subjected it to X-rays. The resulting scattering pattern contained a very strong 3.5 Å reflection that indicated long stacks of bases along the DNA fiber. Thirteen years later in 1951 Chargaff [34] performed a detailed quantitative chemical analysis of DNA, finding that the amount of base G corresponded to that of base C and the amount of base A to that of base T. This indicated the existence of base pairs G-C and A-T in the DNA. Two years later, accounting for base stacks [33], base pairing [34], and for an unauthorized photo from Franklin [35], Watson and Crick built a DNA model that fitted biology (exact duplication via base pairing = inheritance), chemistry (base pairing via hydrogen bonds), physics (hydrophobic inside and polar outside), and informatics (the general structure was independent of the base pair sequence) [36]. Their model turned out to be correct and was a great leap forward.
In the following interim, there was no hint of a folded spatial structure of single-or double-stranded DNA. However, it became clear that RNA exists in more or less stable folded single-stranded structures, as indicated for the numerous transfer RNA molecules. In 1970 Cramer et al. [37] produced the first X-ray-grade crystals from a phenylalanine-specific transfer RNA of yeast. Three years later the crystal structure of this particular transfer RNA was determined by Rich [38] and independently by Klug [39], both using the MIR methods known from protein analyses. They found double helices resembling the Watson-Crick DNA model that interweaved with each other. As with proteins, the electron density distribution in the crystal could only be interpreted using the known base sequence. This sequence, however, had been established long before the crystal analysis and used for base pairing trials that had already indicated where the single-stranded RNA is involved in double-helical interactions.
In 1982 Cech [40] discovered RNA molecules that are active catalysts and named them ribozymes. Several groups then focused on and succeeded in crystallizing ribozymes, giving rise to a number of structures. In particular Yonath [41] tried to crystallize full ribosomes and parts thereof for a long time. After ribosomal crystals appeared, other scientists got interested and joined the endeavor, which in 2000 resulted in the ribosome structure being elucidated in three separate competing analyses by Yonath [41], Steitz [42], and Ramakrishnan [43]. The structure showed that ribosomes are ribozymes despite the large number of associated proteins. The ribosomal proteins do not participate in the formation of the peptide bond, but merely stabilize the RNA structure. This observation corroborated the hypothesis that there existed an original "RNA world", which was superseded by our present more efficient RNA-DNA-protein world.

Cryo-electron microscopy
Besides the MIR analyses of crystals, there exist further methods which, however, in general do not reach the quality of a good crystal structure. In 1939, Ruska [44] designed and built an electron microscope. This apparatus has a theoretical accuracy far below 1 Å, but as a result of the small aperture of the electron beam and the low contrast in the sample, the real resolution remained far above 1 Å. Over the years, however, the electron microscope and the sample preparation have been greatly improved. In 1984, for instance, Dubochet [45] introduced the cryo sample, reducing dramatically the scattering background of supporting material. Following the work of Henderson [46], the sensitivity of the electron detector was greatly improved. Upon these developments Frank [47] was able in 1995 to derive the structures of large molecules from numerous projections from different angles that were appropriately added and averaged. The method reached resolutions below 3 Å that allowed one to trace the polypeptide chain with confidence. It has been applied for numerous proteins and RNAs, all of which are available from the Protein Data Bank [20,21].
Recently, another electron microscopy method became available-micro-crystal electron diffraction-that uses essentially two-dimensional crystals and very weak electron beams [48]. Here, the third dimension is explored by stage tilting. The rate of depositions of electron microscope structures in the Protein Data Bank is presently about half of that of X-ray structures [49].

Nuclear magnetic resonance
A further crystal-free method was introduced by Wüthrich [50]. This method requires a highly concentrated mono-disperse protein solution and an apparatus suitable for the measurement of nuclear magnetic spin resonances. The measurable magnetic interaction between spatially neighboring nuclear spins can be interpreted as their local distances within the large molecule. The three-dimensional structure is then calculated from a multitude of mutual distances using an iterative algorithm. This method does not need crystals; however, the obtained structures do not reach the quality of a good crystal structure. On the other hand, the determined structure is more natural because it is not disturbed by crystal contacts. Unfortunately, the identification of contacting atoms is very tedious so that the number of such structures in the Protein Data Bank [20,21] is rather limited [49].

Calculation of atomic protein structures from amino acid sequences
Early on, it became clear that a large number of known spatial protein structures, which are related to each other across millions of organisms, may in the future form an extensive library in which a spatial structure could be derived from a given amino acid sequence alone [51]. As such a sequence can be translated from an easily measurable underlying DNA sequence, the structure analysis should become a simple enterprise. In the beginning, the number of known sequences and structures was rather small. Despite this limitation, several groups developed methods for predicting substructures from sequences. For quite a time these methods remained unreliable. This changed, however, with a combined secondary (sub) structure prediction for the given sequence of the enzyme adenylate kinase [52]. Here, a simple addition of the submitted nine predictions outlined accurately all α-helices, β-strands, and loops. Obviously, at that time, the size of the available data library allowed for an identification of substructures.
In light of the numerous spatial protein structures published in the following years, the structure prediction methods improved appreciably. In order to establish the status of the field, Moult [53] invited all interested groups to a Conference for the Assessment of Structure Predictions (CASP) in 1994. The meeting was held in Asilomar, California and considered a success. Consequently, the participants decided to repeat it biannually, giving rise to the 15th CASP meeting held this year. At the 14th CASP meeting in 2021, Jumper [54] presented the computer program AlphaFold that predicted complete spatial structures with an astonishing accuracy. It was based on the very large data compilations presently available from GenBank [32] and the Protein Data Bank [20,21]. Presently, it requires only about 1 day computing time for a structure. Moreover, AlphaFold has been followed up by other similar programs [55,56]. As the data banks expand quickly, these programs are bound to improve in the future. Nowadays, any protein structure analysis will start with a DNA sequence (translated to an amino acid sequence) that is applied to one or more of the artificial intelligence programs [54][55][56]. The resulting initial model is then used for guiding all further experimental analyses.