Keywords

  1. 1.

    How can information be quantified, and why is this important?

  2. 2.

    DNA damage—what are the basic repair mechanisms?

  3. 3.

    What is the relation between the 3D structure of genetic code and its function?

  4. 4.

    Discontinuity of genes—its role in biology.

  5. 5.

    What is the reason for the unidirectionality (5 → 3) of DNA synthesis?

  6. 6.

    Epigenetics—the mechanism and its biological role.

  7. 7.

    Extragenetic storage of information in the cell.

  8. 8.

    What processes determine the variability and diversity of embryonal development? What about evolutionary development?

  9. 9.

    When does the “multiple attempts” (k) requirement become relevant in trying to reach a goal?

  10. 10.

    What enables the possibility of targeted action through the “p” route?

  11. 11.

    Detoxification—theoretical premise and mechanism.

  12. 12.

    The red blood cell does not have a nucleus, but is capable of function—where, then, is the necessary information stored?

  13. 13.

    What basic information is required for an organism to be formed from autonomous cells?

  14. 14.

    How are processes which consist of multiple stages encoded?

  15. 15.

    In general, how does biology solve problems which carry a high degree of uncertainty?

The mechanisms described in this chapter can be experimented with using two web applications we provide for the reader’s convenience: The Information Probability (IP) tool, available at https://ip.sano.science, which simulates the likelihood of achieving the given goal depending on the number of repetitions and elementary probability.

3.1 Information as a Quantitative Concept

The science of regulatory mechanisms has recently emerged from the shadow of structural and functional research. Observational evidence indicates that most of the information encoded in the genome serves regulatory purposes; hence scientific interest in biological regulation is growing rapidly.

Our emerging knowledge of regulatory mechanisms calls for a quantitative means of describing information. The twentieth century and the 1940s, in particular, have brought about significant progress in this matter, owing to the discoveries by N. Wiener, G. Walter, W. R. Ashby, C. E. Shannon, and others. In order to explain the nature of information and the role it plays in biological systems, we first need to rehash our terminology.

Information is commonly understood as a transferable description of an event or object. Information transfer can be either spatial (communication, messaging, or signaling) or temporal (implying storage).

From a quantitative point of view, information is not directly related to the content of any particular message, but rather to the ability to make an informed choice between two or more possibilities. Thus, information is always discussed in the context of some regulated activity where the need for selection emerges.

Probability is a fundamental concept in the theory of information. It can be defined as a measure of the statistical likelihood that some event will occur.

If the selection of each element from a given set is equally probable, then the following equation applies:

$$ p=1/N $$

where N is the number of elements in the set.

Let us consider a car driving down a straight road. If the layout of the road does not force the driver to make choices, the information content of the driving process is nil. However, when forks appear and a decision has to be made, the probability of choosing the correct exit is equal to p = 1/2, p = 1/3, or p = 1/5, for a two-, three-, or five-way fork, respectively.

Choosing the correct route by accident becomes less probable as the number of exits increases (Fig. 3.1).

Fig. 3.1
4 illustrations. There are 4 vertical lines, out of which 3 are further split into 2, 3, and 4 upward directions, respectively.

Road forks with varying numbers of exits

If more than one exit leads to our intended destination, then the associated probability increases, becoming equal to 2/3, 3/5, or 4/5 for a five-way fork where two, three, or four exits result in the correct direction of travel. Clearly, if all exits are good, the probability of making a correct choice is given as

$$ p=5/5=1.0 $$

The value 1.0 implies certainty: no matter which exit we choose, we are sure to reach our destination.

Probabilities are additive and multiplicative. Total probability is a sum of individual probabilities whenever an alternative is involved, i.e., when we are forced to choose one solution from among many, provided that some of the potential choices are correct and some are wrong. If a four-way fork includes two exits which lead to our destination, the probability of accidentally making the right decision (i.e., choosing one of the two correct exits) is equal to 1/2, according to the following formula:

$$ p={p}_1+{p}_2=1/4+1/4=1/2 $$

which means that, given four exits, the likelihood of choosing an exit that leads to our destination is the sum of the individual probabilities of choosing any of the correct exits (1/4 in each case).

If, in addition to the fork mentioned above, our route includes an additional five-way fork with just one correct exit, we can only reach our destination if we make correct choices on both occasions (this is called a conjunction of events). In such cases, probabilities are multiplicative. Thus, the probability of choosing the right route is given as

$$ p={\left({p}_1+{p}_2\right)}^{\ast }{p}_3={\left(1/4+1/4\right)}^{\ast }\ 1/5=1/10 $$

The larger the set of choices, the lower the likelihood making the correct choice by accident and—correspondingly—the more information is needed to choose correctly. We can therefore state that an increase in the cardinality of a set (the number of its elements) corresponds to an increase in selection indeterminacy. This indeterminacy can be understood as a measure of “a priori ignorance.” If we do not know which route leads to the target and the likelihood of choosing each of the available routes is equal, then our a priori ignorance reaches its maximum possible value.

The difficulty of making the right choice depends not only on the number of potential choices but also on the conditions under which a choice has to be made. This can be illustrated by a lottery where 4 of 32 numbers need to be picked.

The probability of making one correct selection is 4/32. The probability that two numbers will be selected correctly can be calculated as a product of two distinct probabilities and is equal to

$$ {p}_2={\left(4/32\right)}^{\ast }\ \left(3/31\right) $$

Similarly, the probability that all of our guesses will be correct is given as

p4 = (4/32)(3/31) ∗ (2/30) ∗ (1/29), which is equal to pWIN = 0.0000278.

The above value denotes our chances for winning the lottery. In contrast, the corresponding probability of a total loss (i.e., not selecting any of the four lucky numbers) is pLOSS = 0.5698. If someone informs us that we have lost and that not a single one of our selections was correct, we should not be surprised, as this is a fairly likely outcome. However, a message telling us that we have won carries a far higher information content—not due to any emotional considerations but because the odds of winning are extremely low.

In 1928 R. V. L. Hartley defined information quantity (I) as

$$ I=-{\log}_2(p)\ \left[\mathrm{bit}\right] $$

(For a base-2 logarithm, the result is given in bits.)

Referring to the above example, the information quantity contained in a message indicating that we have won the lottery is given as

$$ {I}_{\mathrm{WIN}}=-{\log}_20.0000278=15.2\ \mathrm{bit} $$

A corresponding message informing us of a total loss carries significantly less information:

$$ {I}_{\mathrm{LOSS}}=-{\log}_20.5698=0.812\ \mathrm{bit} $$

The bit is a basic unit of information, corresponding to the quantity of information required to make a choice between two equally probable events or objects.

In the above example, we focused on two unambiguous scenarios, i.e., hitting the jackpot (all selections correct) or losing entirely (all selections incorrect). If you wish to learn more, consider the following example.

Arriving at a probabilistic measure of entropy requires us to consider all possible outcomes (called realizations). This includes partial wins. For example, the probability of getting exactly two numbers right is

$$ {\displaystyle \begin{array}{c}{p}_{(2)}=4/{32}^{\ast }3/{31}^{\ast }28/{30}^{\ast }27/29\\ {}+4/{32}^{\ast }28/{31}^{\ast }3/{30}^{\ast }27/29\\ {}+4/{32}^{\ast }28/{31}^{\ast }27/{30}^{\ast }3/29\\ {}+28/{32}^{\ast }4/{31}^{\ast }3/{30}^{\ast }27/29\\ {}+28/{32}^{\ast }27/{31}^{\ast }4/{30}^{\ast }3/29\\ {}+28/{32}^{\ast }4/{31}^{\ast }27/{30}^{\ast }3/29\end{array}} $$

This expression corresponds to the likelihood of arriving at the same end result (two correct guesses) in various ways (C, correct guess; W, wrong guess):

$$ \mathrm{CCWW}+\mathrm{CWCW}+\mathrm{CWWC}+\mathrm{WCCW}+\mathrm{WWCC}+\mathrm{WCWC} $$

As the goal can be reached in six different ways and the probability of each sequence is equal, we may calculate the value of p(2) using a simplified formula:

$$ {p}_{(2)}={6}^{\ast}\left[\left({3}^{\ast }{4}^{\ast }{27}^{\ast }28\right)/\left({32}^{\ast }{31}^{\ast }{30}^{\ast }29\right)\right] $$

In the case of our lottery which has five different outcomes (A, four correct guesses; B, three correct guesses; C, two correct guesses; D, one correct guess; and E, no correct guesses), the measure of uncertainty is the average quantity of information involved in making a selection.

C. E. Shannon was the first to relate statistical uncertainty to physical entropy, arriving at the formula:

$$ H=-\Sigma {p}_i{\log}_2\left({p}_i\right) $$

where H is the information entropy and n is the number of possible outcomes.

Determining H has practical consequences as it enables us to compare different, seemingly unrelated situations.

H indicates the (weighted) average quantity of information associated with the realization of an event for which the sum of all pi equals 1.

The mathematical formula for H in the case of the presented lottery is

$$ {\displaystyle \begin{array}{l}H=-{0.36432}^{\ast }{\log}_2(0.36432)-{0.06306}^{\ast }{\log}_2(0.06306)-{0.00311}^{\ast }{\log}_2(0.00311)\\ {}\kern1.2em -{0.0000278}^{\ast }{\log}_2(0.0000278)-{0.5698}^{\ast }{\log}_2(0.5698)=1.271\end{array}} $$

The first component represents the “one correct guess” outcome; the second, “two correct guesses”; and so on until the final component where none of the selected numbers are correct. (Notice the sum of probabilities of all cases is equal to 1.)

Entropy determines the uncertainty inherent in a given system and therefore represents the relative difficulty of making the correct choice. For a set of possible events, it reaches its maximum value if the relative probabilities of each event are equal. Any information input reduces entropy—we can therefore say that changes in entropy are a quantitative measure of information. This can be denoted in the following way:

$$ I={H}_i\hbox{--} {H}_r $$

where Hi indicates the initial entropy and Hr stands for the resulting entropy.

Although physical and information entropy are mathematically equivalent, they are often expressed in different units. Physical entropy is usually given in J/(mol*degree), while the standard unit of information entropy is 1 bit.

Physical entropy is highest in a state of equilibrium, i.e., lack of spontaneity (ΔG = 0, 0), which effectively terminates the given reaction. Regulatory processes which counteract the tendency of physical systems to reach equilibrium must therefore oppose increases in entropy. It can be said that a steady inflow of information is a prerequisite of continued function in any organism.

As selections are typically made at the entry point of a regulatory process, the concept of entropy may also be applied to information sources. This approach is useful in explaining the structure of regulatory systems which must be “designed” in a specific way, reducing uncertainty and enabling accurate, error-free decisions.

One of the models which can be used to better illustrate this process is the behavior of social insects which cooperatively seek out sources of food (Fig. 3.2).

Fig. 3.2
2 illustrations depict the direction of the food source in daylight. 1, the fly moves in circular directions and then towards the food source. 2, the ant moves in the direction of the food source. The angles are labeled R, 2 R, and 4 R.

Information regarding the location of a food source conveyed using visual (bee dance) or olfactory (ant pheromones—dark strip) cues

Nonrandom pathing is a result of the availability of information, expressed, e.g., in the span of the arc (1/8, 1/16, or 1/32 of the circumference of a circle).

The ability to interpret directional information presented by a bee that has located a source of food means that other bees are not forced to make random decisions. Upon returning to the hive, the bee performs a “dance,” where individual movements indicate the approximate path to the food source.

If no directional information is available, the dance is random as the source may lie anywhere in relation to the hive. However, if there is specific information, the bee traverses an arc, where the width (in relation to the full circumference of a circle) corresponds to the quantity of information. For instance, if the bee traverses 1/16 of the circumference of a circle (22.5°), the quantity of the conveyed information is 4 bits. Corrections can be introduced by widening the radius of the circle, along the way to the target.

The fire ant exudes a pheromone which enables it to mark sources of food and trace its own path back to the colony. In this way, the ant conveys pathing information to other ants. The intensity of the chemical signal is proportional to the abundance of the source. Other ants can sense the pheromone from a distance of several (up to a dozen) centimeters and thus locate the source themselves. Figure 3.2 and Table 3.1 present the demand of information relative to the distance at which information can be detected by an insect, for a given length of its path.

Table 3.1 Comparison of the probability of reaching a given goal and the amount of required information relative to the distance from the starting point

The quantity of information required to locate the path at a distance of 2.5 cm is 3 bits. However, as the distance from the starting point increases and the path becomes more difficult to follow, the corresponding demand for information also grows.

As can be expected, an increase in the entropy of the information source (i.e., the measure of ignorance) results in further development of regulatory systems—in this case, receptors capable of receiving signals and processing them to enable accurate decisions.

Over time, the evolution of regulatory mechanisms increases their performance and precision. The purpose of various structures involved in such mechanisms can be explained on the grounds of information theory. The primary goal is to select the correct input signal, preserve its content, and avoid or eliminate any errors.

3.2 Reliability of Information Sources

An information source can be defined as a set of messages which assist the recipient in making choices. However, in order for a message to be treated as an information source, it must first be read and decoded.

An important source of information is memory which can be further divided into acquired memory and genetic (evolutionary) memory.

Acquired memory is the set of messages gathered in the course of an individual life. This memory is stored in the nervous system and—to some extent—in the immune system, both capable of remembering events and amassing experience.

However, a more basic source of information in the living world is genetic memory. This type of memory is based upon three dissimilar (though complementary) channels, which differ with respect to their goals and properties:

  1. 1.

    Steady-state genetics, including the “software” required for normal functioning of mature cells and the organism as a whole. This type of information enables biological systems to maintain homeostasis.

  2. 2.

    Development genetics, which guides cell differentiation and the development of the organism as a whole (also called epigenetics).

  3. 3.

    Evolutionary genetics, including mechanisms which facilitate evolutionary progress.

Figure 3.3 presents a simplified model of the genome—a single chromosome with all three channels indicated (dark bands represent the DNA available for transcription of genetic material).

Fig. 3.3
3 illustrations of the simplified genome present the reproduction of structure and function, differentiation of cells and formation of an organism, and creation of new species.

Simplified view of the genome (bar) and its basic functions

Dark bands represent conventionally the total volume of DNA, i.e., the DNA available for transcription in a mature specialized cell.

The role of the genome is to encode and transfer information required for the synthesis of self-organizing structures in accordance with evolutionary programming, thereby enabling biological functions. Information transfer can be primary (as in the synthesis of RNA and directly used proteins) or secondary (as observed in the synthesis of other structures which ensure cell homeostasis).

3.2.1 Steady-State Genetics

Genetic information stored in nucleotide sequences can be expressed and transmitted in two ways:

  1. A.

    Via replication (in cell division)

  2. B.

    Via transcription and translation (also called gene expression—enabling cells and organisms to maintain their functionality; see Fig. 3.4).

Fig. 3.4
2 illustrations A and B present the flow of regulatory signals in 2 processes as follows. Replication, and transcription or translation, respectively.

Simplified diagram of replication (a) and transcription/translation (b) processes. Arrows indicate the flow of regulatory signals which control syntheses (see Chap. 4)

Both processes act as effectors and can be triggered by certain biological signals transferred on request.

Gene expression can be defined as a sequence of events which lead to the synthesis of proteins or their products required for a particular function. In cell division, the goal of this process is to generate a copy of the entire genetic code (S phase), whereas in gene expression only selected fragments of DNA (those involved in the requested function) are transcribed and translated. Reply to the trigger comes in the form of synthesis (and thus activation) of a specific protein. Information is transmitted via complementary nucleic acid interactions, DNA–DNA, DNA–RNA, and RNA–RNA, as well as interactions between nucleic acid chains and proteins (translation). Transcription calls for exposing a section of the cell’s genetic code, and although its product (RNA) is short-lived, it can be recreated on demand, just like a carbon copy of a printed text. On the other hand, replication affects the entire genetic material contained in the cell and must conform to stringent precision requirements, particularly as the size of the genome increases.

3.2.2 Replication and Its Reliability

The magnitude of effort involved in replication of genetic code can be visualized by comparing the DNA chain to a zipper (Fig. 3.5). Assuming that the zipper consists of three pairs of interlocking teeth per centimeter (300 per meter) and that the human genome is made up of 3 billion (3*109) base pairs, the total length of our uncoiled DNA in “zipper form” would be equal to 1*104 km or 10,000 km—roughly twice the distance between Warsaw and New York.

Fig. 3.5
A set of comparative images contain an illustration of a D N A strand uncoiling, and a photograph of a zipper being unfastened, respectively. They indicate the similarity in movement.

Similarities between unfastening a zipper and uncoiling the DNA helix in the process of replication

If we were to unfasten the zipper at a rate of 1 m/s, the entire unzipping process would take approximately 3 months—the time needed to travel 10,000 km at 1 m/s. This comparison should impress upon the reader the length of the DNA chain and the precision with which individual nucleotides must be picked to ensure that the resulting code is an exact copy of the source. It should also be noted that for each base pair the polymerase enzyme needs to select an appropriate matching nucleotide from among four types of nucleotides present in the solution and attach it to the chain (clearly, no such problem occurs in zippers).

The reliability of an average enzyme is on the order of 103–104, meaning that one error occurs for every 1000–10,000 interactions between the enzyme and its substrate. Given this figure, replication of 3*109 base pairs would introduce approximately three million errors (mutations) per genome, resulting in a highly inaccurate copy. Since the observed reliability of replication is far higher, we may assume that some corrective mechanisms are involved.

Really, the remarkable precision of genetic replication is ensured by DNA repair processes and in particular by the corrective properties of polymerase itself. Its enzymatic association with exonuclease acting in the 3′ → 5′ direction increases the fidelity of polymerization. DNA repair works by removing incorrect adducts and replacing them with proper nucleotide sequences.

The direction of anti-parallel DNA strands is determined by their terminating nucleotides or, more specifically, by their hydroxyl groups (the sole participants of polymerization processes) attached to 3′ and 5′ carbons of deoxyribose. In the 5′ → 3′ direction, a free nucleotide (5′-triphosphate nucleotide) may attach itself to the 3′-carbon hydroxyl group, whereas in the opposite direction, only the 5′ carbon hydroxyl group may be used to extend the chain (Fig. 3.6).

Fig. 3.6
2 illustrations depict the antiparallel movement as follows. 1, presents men moving in right and left directions. 2, D N A strands move in opposite directions from 5 prime to 3 prime.

Anti-parallel arrangement of DNA strands, mandating structural complementarity

The proofreading properties of polymerase are an indispensable condition of proper replication of genetic material. However, they also affect the replication process itself, by enforcing one specific direction of DNA synthesis, namely, the 5′ → 3′ direction (Fig. 3.7). For reasons related to the distribution of energy, this is the only direction in which errors in the DNA strand may be eliminated by cleaving the terminal nucleotide and replacing it with a different unit. In this process, the energy needed to create a new diester bond is carried by the free 5′-triphosphate nucleotide, whereas in the 3′ → 5′ direction, the required energy can only come from the terminating nucleotide which thus cannot be cleaved for repair. Doing so would, however, make it impossible to attach another nucleotide at the end of the chain. This is why both prokaryotic and eukaryotic organisms can only accurately replicate their genetic code in the 5′ → 3′ direction. An important consequence of this fact is the observed lack of symmetry in replication of complementary DNA strands (Fig. 3.8).

Fig. 3.7
An illustration presents D N A stands in 5 prime to 3 prime, and 3 prime to 5 prime directions in A and B, respectively. Below, a correction is indicated in A 1 as possible and B 1 as impossible. An inset indicates a chemical structure of a nucleotide.

Polymerization of DNA strands in the 5′ → 3′ (A) and 3′ → 5′ (B) directions, together with potential repair mechanisms (A1 and B1, respectively). The inset represents a simplified nucleotide model

Fig. 3.8
An illustration presents 5 prime to 3 prime and 3 prime to 5 prime antiparallel D N A strands. A loop indicates unidirectional replication of 5 prime 3 prime.

Looping as a means of achieving unidirectional replication of anti-parallel DNA strands

Linear synthesis is only possible in the case of the 5′ → 3′ copy based on the 3′ → 5′ template. Synthesis of the 3′ → 5′ copy proceeds in a piecemeal fashion, through the so-called Okazaki fragments which are synthesized sequentially as the replication fork progresses and template loops are formed. This looping mechanisms result in the desired 3′ → 5′ direction of synthesis (Fig. 3.8).

The polymerase enzyme itself exhibits as necessary both polymerase and exonuclease activities. We should hence note that the availability of a DNA template is not sufficient to begin polymerization. In order to attach itself to the chromatid and commence the process, the polymerase enzyme must first interface with a complementary precursor strand, which it can then elongate. This short fragment is called a primer (Fig. 3.9). Its ability to bind to polymerase enzymes has been exploited in many genetic engineering techniques.

Fig. 3.9
3 illustrations present a nucleic acid primer that interacts with a complementary strand in the presence of 3 prime to 5 prime in 3 different directions.

A nucleic acid primer complementary to the original DNA template is required by polymerase, which also exhibits exonuclease activity in the 3′ → 5′ direction

The DNA replication fork involves an unbroken, continuous (though unwound) template. Since polymerase cannot directly attach to either of its strands, the primer must first be synthesized by RNA polymerase (i.e., transcriptase). This enzyme does not require a primer to initiate its function and can attach directly to the one strained template.

Polymerase carries out DNA synthesis by elongating the RNA fragment supplied to it by RNA polymerase. Therefore, each incidence of DNA synthesis must begin with transcription. The transcriptase enzyme responsible for assisting DNA replication is called the primase.

The synthesized RNA fragments are paired up with complementary nucleotides and attached to the strand which is being elongated by DNA polymerase. They are then replaced by DNA nucleotides, and the complementation to template fragment is ligated. This seemingly complicated process and the highly evolved structure of polymerase itself are necessary for reducing the probability of erroneous transcription.

Permanent changes introduced in the genetic transcription process are called mutations. They can result from random events associated with the function of the genome, or from environmental stimuli, either physical (e.g., UV radiation) or chemical. Mutations which do not compromise the complementarity of DNA often go unrecognized by proofreading mechanisms and are never repaired.

In addition to direct changes in genetic code, errors may also occur as a result of the imperfect nature of information storage mechanisms. Many mutations are caused by the inherent chemical instability of nucleic acids: for example, cytosine may spontaneously convert to uracil. In the human genome, such an event occurs approximately 100 times per day; however uracil is not normally encountered in DNA, and its presence alerts defensive mechanisms which correct the error.

Another type of mutation is spontaneous depurination, which also triggers its own, dedicated error correction procedure. Cells employ a large number of corrective mechanisms—some capable of mending double-strand breaks or even recovering lost information based on the contents of the homologous chromatid (Fig. 3.10). DNA repair mechanisms may be treated as an “immune system” which protects the genome from loss or corruption of genetic information.

Fig. 3.10
2 illustrations. A, presents replication and repair of 5 prime to 3 prime, and 3 prime to 5 prime D N A strands. B, indicates the inserted fragments of K u, polymerase, and ligase.

Repairing double-strand DNA breaks: (a) using the sister chromatid to recover missing information and (b) reattaching severed strands via specific proteins

The unavoidable mutations which sometimes occur despite the presence of error correction mechanisms can be masked due to doubled presentation (alleles) of genetic information. Thus, most mutations are recessive and not expressed in the phenotype.

As the length of the DNA chain increases, mutations become more probable. It should be noted that the number of nucleotides in DNA is greater than the relative number of amino acids participating in polypeptide chains. This is due to the fact that each amino acid is encoded by exactly three nucleotides—a general principle which applies to all living organisms.

Information theory tells us why, given four available nucleotides, a three-nucleotide codon carries the optimal amount of information required to choose one of 20 amino acids. The quantity of information carried by three nucleotides, each selected from a set of four, equals I3 =  − log2(1/4 1/4 1/4) = 6 bits, whereas in order to choose one of 20 amino acids, I20 =  − log2(1/20) = 4.23 bits of information is required.

If the codon were to consist of two nucleotides, it would carry I2 =  − log2(1/41/4) = 4 bits of information, which is insufficient to uniquely identify an amino acid. This is why nucleotide triplets are used to encode amino acids, even though their full information potential is not exploited (a nucleotide triplet could theoretically encode 4*4*4 = 64 nucleotides).

If the DNA were to consist of only two base types, the minimum number of nucleotides required to encode 20 amino acids would be 5:

$$ I=-{\log}_2\left(1/{2}^{\ast }1/{2}^{\ast }1/{2}^{\ast }1/{2}^{\ast }1/2\right)=-{\log}_2\left(1/32\right)=5\ \mathrm{bit} $$

In this case, redundancy would be somewhat reduced, but the DNA chain would become far longer, and the likelihood of harmful mutations would increase accordingly.

Considering the reliability of genetic storage mechanisms, the selected encoding method appears optimal. We should, however, note that despite the presence of many safeguards, errors cannot be completely eliminated.

3.2.2.1 Telomeres

Telomerase is a peculiar polymerase—an enzyme which elongates DNA strands, enabling reconstruction of the terminal fragments of chromosomes called telomeres. These fragments are truncated at each replication as polymerase cannot carry through to the very end of the DNA molecule due to the specific mechanism by which the lagging strand is synthesized.

Thus, telomerase elongates one of the DNA strands in the 5′ → 3′ direction. Acting as a reverse transcriptase, it utilizes its own RNA matrix which binds to the enzyme and consists of repeating AAUCCC fragments (Fig. 3.11). This, in turn, enables attachment of a normal DNA polymerase, in the usual fashion—via an RNA primer—and completion of the chromosome through synthesis of the complementary strand. The need for such a convoluted solution arises due to the anti-parallel arrangement of DNA strands and differences in the way in which each strand is synthesized during replication. Crucially, loss of telomerase activity in differentiated cells and the corresponding limitation in the number of possible divisions are strongly tied to the strategy and philosophy of nature.

Fig. 3.11
5 illustrations present telomere elongation in 5 prime to 3 prime D N A strands. 5 pairs of 3 prime to 5 and 5 prime to 3 prime are indicated along with matrix and primer R N As. The lengths of the D N A strands are uneven in each pair.

Telomere elongation mechanism: (A) uneven, truncated termination of the DNA strands, (B) attachment of telomerase to the leading strand and initiation of synthesis, (C) elongated leading strand with an attached primer, (D) onset of polymerase activity, and (E) completed telomere

3.2.2.2 Function-Oriented Organization of the Genome

The control of the intracellular transcription is enabled by the specific, function-related 3D organization of chromatin forming loops comprising promoter region and regulatory fragments including an enhancer required to generate a given biological activity. The location of such loops is determined DNA sequences recognized by the transcription factor called CTCF responsible for anchoring loops at the given locus. The loop itself is formed by protein complex called cohesion, which has the properties of the molecular motor using ATP to migrate along the chain of DNA and create anchor at two initially separated loci. The resulted structures form topologically associating domains called TADs. They emerge when cohesion terminates its activity at the contact with CTCF. This mechanism creates the link between the spatial structure of a chromosome and its biological activity (Fig. 3.12).

Fig. 3.12
3 illustrations of chromatic structure indicate 3 loops along with enhancer E and promoter P regions, cohesin, and C T C F and C T C I regions.

Organization of chromatin structure. Formation of TADs: (a) unlooped form (E and P enhancer and promoter regions, respectively), (b) looped form with conventionally marked regulatory and promoter-associating proteins, and (c) structure-initiating transcription activity

The finally formed complex ready to initiate the transcription is composed of many protein constituents assembled to produce coordinated information needed for the proper starting and reading of DNA template by RNA polymerase. Beside the enhancer also other necessary components are enjoined including transcription factors, the protein pulling apart two strands of DNA and as well proteins which allow conquering the resistance arising when transcription proceeds along the protected double-strained chains of DNA by histones. This indicates the requirement of some extra information necessary to make signaling in complex biological systems possible.

3.2.3 Gene Expression and Its Fidelity

Fidelity is, of course, fundamentally important in DNA replication as any harmful mutations introduced in its course are automatically passed on to all successive generations of cells. In contrast, transcription and translation processes can be more error-prone as their end products are relatively short-lived. Of note is the fact that faulty transcripts appear in relatively low quantities and usually do not affect cell functions, since regulatory processes ensure continued synthesis of the required substances until a suitable level of activity is reached.

Nevertheless, it seems that reliable transcription of genetic material is sufficiently significant for cells to have developed appropriate proofreading mechanisms, similar to those which assist replication. RNA polymerase recognizes irregularities in the input chain (which register as structural deformations) and can reverse the synthesis process, cleave the incorrect nucleotide, or even terminate polymerization and discard the unfinished transcript. In the case of translation, the ribosome is generally incapable of verifying the correctness of the polypeptide chain due to encoding differences between the polymer and its template. The specificity of synthesis is associated with the creation of aa-tRNA and determined by the specific nature of the enzyme itself. Any errors introduced beyond this stage go uncorrected.

Once a polypeptide chain has been synthesized, it must fold in a prescribed way in order to fulfill its purpose. In theory, all proteins are capable of spontaneous folding and reaching their global energy minima. In practice, however, most proteins misfold and become “stuck” in local minima. (Alternatively, the global minimum may not represent the active form of the protein, and a local minimum may be preferable.) This is why the folding process itself is supervised by special regulatory structures: simple proteins (chaperones) or machine-like mechanisms (chaperonins), which work by attaching themselves to the polypeptide chain and guiding its folding process. Improperly folded structures are broken down via dedicated “garbage collectors” called proteasomes. If, however, the concentration of undesirable proteins reaches critical levels (usually through aggregation), the cell itself may undergo controlled suicide called apoptosis. Thus, the entire information pathway—starting with DNA and ending with active proteins—is protected against errors. We can conclude that fallibility is an inherent property of genetic information channels and that in order to perform their intended function, these channels require error correction mechanisms.

The processes associated with converting genetic information into biologically useful structures are highly complex. This is why polymerases (which are of key importance to gene expression) are large complexes with machine-like properties. Some of their subunits perform regulatory functions and counteract problems which may emerge in the course of processing genetic information. In the case of synthesis, the main goal of polymerases is to ensure equivalent synthesis of both anti-parallel DNA strands. In contrast, transcription relies on proper selection of the DNA fragment to be expressed. Problems associated with polypeptide chain synthesis usually arise as a result of difficulties in translating information from nucleic acids to proteins.

While the cell is in interphase its DNA is packed in the nucleus and assumes the form of chromatin. During cell division the nucleus is subdivided into so-called chromosome territories, each occupied by a pair of chromosomes. Densely packed fragments of DNA material, unavailable for transcription, constitute the so-called heterochromatin, while regions from which information can potentially be read are called euchromatin.

Expression of genetic information is conditioned by recognition of specific DNA sequences which encode proteins. The function of a protein is not, however, entirely determined by its coding fragment. Noncoding fragments may also store useful information, related to, e.g., regulatory mechanisms, structure of the chromatin strand (including its packaging), sites of specific aberrations such as palindromes, etc. Proper transfer of genetic information requires precise recognition of its nucleotide arrangement.

Specific DNA sequences may be recognized by RNA or by proteins. RNA recognition is straightforward, as both DNA and RNA share the same “language.” Recognition of a nucleotide sequence by proteins poses more problems; however some proteins have evolved the ability to attach to specific sequences (transcription factors in particular). This mechanism is usually employed in the major groove of the DNA double helix whose breadth admits contact with transcription mediators (it should, however, be noted that the minor groove may also convey useful structural information via relative differences in its width and the distribution of electrostatic charges—see papers by T. Tullius and R. Rohs; 2009). All such information can be exploited by proteins in order to recognize their target sequences. Interaction between proteins and nucleotide sequences usually assumes the form of protein-RNA complexes which include a short double-stranded transcripts consisting of approximately 20–30 nucleotides. Such noncoding RNA fragments can bind as single-strand pieces to proteins which exhibit some defined properties (e.g., enzymes). Their presence ensures in the result that the protein acts in a highly targeted manner, seeking out sequences which correspond to the attached RNA strand. Examples include AGO RNase enzymes as well as proteins whose function is to inhibit or destroy outliving mRNA chains. They are thus important for maintaining biological balance within the cell. Protein-RNA complexes also participate in untangling chromatin strands, epigenetic processes, antiviral defense, and other tasks.

The double-stranded form assumed by RNA prior to interacting with proteins protects it from rapid degradation. Three groups of interfering RNA fragments have been distinguished with respect to their length, means of synthesis, and mechanism of action (they are called miRNA, siRNA, and piRNA, respectively). The use of short RNA fragments as “target guides” for proteins greatly increases the efficiency of functional expression of genetic information (Fig. 3.13). By the same token, specific RNA chains are also useful in translation processes. The ribosome is a nanomachine whose function depends on precise interaction with nucleic acids—those integrated in the ribosome itself as well as those temporarily attached to the complex during polypeptide chain synthesis (tRNA). Good cooperation between proteins and nucleic acids is a prerequisite of sustained function of the entire biological machinery (Fig. 3.14). The use of RNA in protein complexes is common across all domains of the living world as it bridges the gap between discrete and continuous storage of genetic information.

Fig. 3.13
An illustration presents the R N A fragments in the degradation steps as follows. Precursor m i R N A, dicer, m i R N A, argonaute, r i s c, and m R N A.

Synthesis of complexes by using RNA fragments which guide active proteins to specific sites in the target sequence. The figure shows miRNA being used to degrade redundant mRNA

Fig. 3.14
An illustration of the translation process presents polypeptide chains in several f Mets.

Translation process—the function of tRNA in recognizing mRNA and facilitating the synthesis of polypeptide chain

The discrete nature of genetic material is an important property which distinguishes prokaryotes from eukaryotes. It enables gene splicing in the course of transcription and promotes evolutionary development. The ability to select individual nucleotide fragments and construct sequences from predetermined “building blocks” results in high adaptability to environmental stimuli and is a fundamental aspect of evolution. The discontinuous nature of genes is evidenced by the presence of fragments which do not convey structural information (introns), as opposed to structure-encoding fragments (exons). The initial transcript (pre-mRNA) contains introns as well as exons. In order to provide a template for protein synthesis, it must undergo further processing (also known as splicing): introns must be cleaved and exon fragments attached to one another. The process is carried out by special complexes called spliceosomes, which consist of proteins and function-specific RNA fragments (Fig. 3.15). These fragments inform the spliceosome where to sever the pre-mRNA strand so that introns may be discarded and the remaining exon fragments reattached to yield the mRNA transcript template. Recognition of intron-exon boundaries is usually very precise, while the reattachment of adjacent exons is subject to some variability. Under certain conditions, alternative splicing may occur, where the ordering of the final product does not reflect the order in which exon sequences appear in the source chain. This greatly increases the number of potential mRNA combinations and thus the variety of resulting proteins. Alternative splicing explains the clear disparity between the number of genes in the genome and the variety of proteins encountered in living organisms. It also plays a significant role in the course of evolution. The discontinuous nature of genes is evolutionarily advantageous but comes at the expense of having to maintain a nucleus where such splicing processes can be safely conducted, in addition to efficient transport channels allowing transcripts to penetrate the nuclear membrane. While it is believed that at early stages of evolution RNA was the primary repository of genetic information, its present function can best be described as an information carrier. Since unguided proteins cannot ensure sufficient specificity of interaction with nucleic acids, protein-RNA complexes are used often in cases where specific fragments of genetic information need to be read.

Fig. 3.15
4 illustrations present protein synthesis with spliceosome, splicing, and alternative splicing.

I, splicing; II, alternative splicing

Long RNA chains usually occur as single strands; however they can occasionally fold into double strands which resemble proteins and can even perform protein-like functions (including enzymatic catalysis, as observed in ribozymes). It should be noted, however, that catalytic activity has nothing in common with the natural RNA activity connected with sequence recognition, and hence both activities differs essentially.

In summary, we can state that the primary role of the genome and the information contained therein are to sustain living processes and enable cells to convey the mechanics of life to their offspring. This process depends on accurate expression of DNA information in the form of proteins and on their activity, maintaining the cell in a steady state what allows stabilizing biological processes in accordance with genetically programmed criteria.

3.2.3.1 Nuclear Pores

An example of a universal technical solution is provided by the so-called nuclear pores, which facilitate transport of substances between the nucleus and cell cytoplasm. This cross-membrane transport is highly diversified and must proceed without interruptions. Proteins are synthesized in the cytoplasm on the basis of RNA fragments synthesized and secreted by the nucleus. Therefore, mutual interaction between the nucleus and the cytoplasm is essential for the cell. Some of the most frequently synthesized proteins—ribosomal proteins—which emerge in the cytoplasm must return to the nucleus where they attach to RNA strands and then, as proper components of ribosomes, are transported back to the cytoplasm. Most molecules which migrate from the nucleus to the cytoplasm are mRNA transcripts; however, the nuclear membrane must also remain permeable for small metabolites. Taken together, these functions place special demands on transport channels. The corresponding transport mechanism must, on the one hand, remain universal, while on the other hand ensuring specificity and directionality (Fig. 3.16). The former is achieved by an elastic barrier composed of polypeptide fibrils which comprise mainly glycine and phenylalanine. Under the influence of the appropriate signals, these fibrils either expand or contract, creating gateways for substances which must be conveyed across the membrane. Transport is therefore controlled by specific signaling molecules and is an active process, requiring energy.

Fig. 3.16
2 illustrations present 2 fibrillar polypeptide structures. An inward and outward arrow indicates the signals to transport molecules.

Systemic solution which ensures selectivity and diversity of nuclear membrane transport by employing fibrillar polypeptide structures called nucleoporins (FG/Nup)—a plastic barrier which can be modulated by appropriate signals depending on the needs and properties of transported molecules

3.2.3.2 Protein Synthesis and Degradation

Maintaining protein homeostasis is of critical importance for cells, enabling them to function and respond to stressors. It is also important in the context of senescence and longevity of organisms.

Protein degradation is particularly a sensitive component of homeostasis because, unlike synthesis, it is inherently stochastic. Proteins differ with respect to longevity; what’s more—misfolded or structurally deficient proteins may emerge, and the quantity of such proteins increases significantly under stress.

Proteins are degraded either by lysosomal proteases or via the proteasome pathway. Approximately 30% of all synthesized proteins are dispersed via the endoplasmic reticulum. Environmental conditions encountered in the reticulum are less controllable than in the cytoplasm—this results in locally greater levels of misfolded proteins whose conformational properties may promote unwanted aggregation. Consequently, the reticulum is equipped with a special receptor system which reacts to the presence and concentrations of undesirable proteins. This system is referred to as UPR (unfolded protein response). The role of UPR is to detect misfolded proteins and activate a stress response by which protein structure repair may occur (owing to chaperones) or, alternatively, the cell may undergo apoptosis. Some misfolded proteins may also penetrate into the cytoplasm, where they are destroyed in the proteasome pathway, through ubiquitination.

Overall, UPR consists of three components. The Ire1 receptor senses misfolded proteins (Fig. 3.17), and its action is linked to PERK kinase and the ATF6 transcription factor. UPR appears to play a key role in maintaining protein homeostasis. In the presence of misfolded proteins, Ire1 undergoes phosphorylation, while its constituent RNase becomes active. This, in turn, triggers an alarm through modification of a specific pre-mRNA and production of an mRNA matrix for the appropriate transcription factor.

Fig. 3.17
An illustration presents 3 sections of the U P R system. It includes I r e 1 receptors, kinase and R Nase in Pre m R N A, and 2 Ps in m R N A.

Action of Ire1 receptors in the UPR system. Capture of misfolded proteins triggers an alarm signal. K, kinase; R, RNase

3.2.4 Epigenetics

Epigenetics is a branch of science which studies the differentiation of hereditary traits (passed on to successive generations of cells by means of cell division) through persistent activation or inhibition of genes, without altering the DNA sequences themselves. Differentiation is a result of chemical, covalent modification of histones and/or DNA, and the action of non-histone proteins which affect the structure of the chromatin. Differentiation has no bearing on the fidelity of information channels; instead, it determines the information content, i.e., the set of genes released for transcription.

Differentiation plays a key role in the expression of specialized cell functions (as opposed to basic functions encoded by the so-called housekeeping genes, which are relatively similar in all types of cells). Information stored in DNA can be accessed by specific protein complexes which uncoil the chromatin thread and present its content for transcription. This process is guided by markers: modified (usually methylated) histone amino acids and/or methylated DNA nucleotides. Modifications ensure the specificity of binding between DNA and non-histone proteins and therefore guide the appropriate release of genetic information, facilitating biological development and vital functions. Intracellular differentiation processes are initiated at specific stages in cell development via RNA-assisted transcription factors. Their function can be controlled by external signals (hormones), capable of overriding intracellular regulatory mechanisms.

Epigenetic mechanisms are observed in the following processes:

  1. 1.

    Embryogenesis and regeneration

  2. 2.

    Stem cell survival and differentiation (e.g., bone marrow function)

  3. 3.

    Selective (single-allele) inheritance of parental traits (also called paternal and maternal imprinting) including functional inhibition of chromosome X

  4. 4.

    Epigenetics of acquired traits

Ad. 1. Embryogenesis and Regeneration

3.2.5 Development Genetics (Embryogenesis and Regeneration): The Principles of Cell Differentiation

Epigenetic differentiation mechanisms are particularly important in embryonic development. This process is controlled by a very specific program, itself a result of evolution. Unlike the function of mature organisms, embryonic programming refers to structures which do not yet exist but which need to be created through cell proliferation and differentiation. The primary goal of development is to implement the genetic blueprint—a task which is automated by sequential activation of successive development stages in accordance with chemical signals generated at each stage (Fig. 3.18). Similar sequential processing can be observed in cell division which consists of multiple, clearly defined stages. It should be noted that embryonic development programs control both proliferation and differentiation of cells.

Fig. 3.18
A line graph and 2 insets. 1, presents the differentiation of cells versus time with an increasing line. 2, A and B are active and suppressed forms of protein, and negative C H 3. 3, 2 dashed arrows facing each other indicate the interaction of cells with a new differentiated group of cells.

The sequential nature of cell differentiation. Arrows indicate intra- and intercellular signals which trigger each differentiation stage. Top inset: schematic depiction of a chromatin fragment (DNA + protein) in its active (a) and suppressed (b) form (structure and packing of complexes). Bottom inset: emergence of a differentiated group of cells (black layer) as a function of interaction between adjacent cell layers

Differentiation of cells results in phenotypic changes. This phenomenon is the primary difference between development genetics and steady-state genetics. Functional differences are not, however, associated with genomic changes: instead they are mediated by the transcriptome where certain genes are preferentially selected for transcription, while others are suppressed.

In a mature, specialized cell, only a small portion of the transcribable genome is actually expressed. The remainder of the cell’s genetic material is said to be silenced. Gene silencing is a permanent condition. Under normal circumstances mature cells never alter their function, although such changes may be forced in a laboratory setting, e.g., by using viral carriers to introduce special transcription factors associated with cellular pluripotency (Nanog, Oct4, Sox2, Klf4, Myc). A similar reversal of aging processes is also observed in neoplastic tissue. Cells which make up the embryo at a very early stage of development are pluripotent, meaning that their purpose can be freely determined and that all of their genetic information can potentially be expressed (under certain conditions). Maintaining the chromatin in a change-ready state is a function of hormonal factors called morphogens (e.g., the sonic hedgehog protein). At each stage of the development process, the scope of pluripotency is reduced until, ultimately, the cell becomes monopotent. Monopotency implies that the final function of the cell has already been determined, although the cell itself may still be immature. This mechanism resembles human education which is initially generalized, but at a certain point (usually prior to college enrollment), the student must choose a specific vocation, even though he/she is not yet considered a professional.

As noted above, functional dissimilarities between specialized cells are not associated with genetic mutations but rather with selective silencing of genes. This process may be likened to deletion of certain words from a sentence, which changes its overall meaning. Let us consider the following adage by Mieczysław Kozłowski: “When the blind gain power, they believe those they govern are deaf.” Depending on which words we remove, we may come up with a number of semantically nonequivalent sentences, such as “When the blind gain power, they are deaf.” or “When the blind gain power, they govern.” or even “The blind are deaf.” It is clear that selective transcription of information enables us to express various forms of content.

3.2.5.1 Molecular Licensing of Genes for Transcription

The “gene licensing” mechanism depends primarily on chemical modifications (mostly methylation of histones but also of DNA itself) and attaching the modified chromatin to certain non-histone proteins. In addition to methylation, the activity of a given gene may be determined by acetylation, ubiquitination, sumoylation, and phosphorylation. However, the nature of the modifying factor is just one piece of the overall puzzle. Equally important is the modification site: for instance, methylation of lysine at position 4 of histone 3 (H3K4) promotes transcription, while methylation of lysine at position 9 (H3K9) results in a different DNA-protein binding and inhibits gene expression (Fig. 3.19). The degree of methylation (the presence of one, two, or three methyl groups) matters as well: triple methylation usually occurs at positions 4, 9, 27, and 36 of histone 3, as well as at position 20 of histone 4. Positions 9 and 27 are particularly important for gene suppression because methylated lysine acts as an acceptor for certain Polycomb proteins which inhibit transcription. On the other hand, position 4 of histone 3 is associated with promotion of gene transcription mediated by trithorax proteins.

Fig. 3.19
4 illustrations depict the modification of chromatin. 2 active forms with acetylation of H 3 K 9 and methylation of H 3 H 4, and silenced and stably silenced forms with methylation of H 3 K 9, D N A, and H 3 K 27.

Simplified model of histone and DNA methylation and its results

Gene suppression may be reversed through detachment and/or demethylation of the coupled proteins (Fig. 3.20a). Phosphorylation associated with introduction of a negative charge promotes dissociation of inhibitors and is therefore useful in regulatory mechanisms (Fig. 3.20b). It is interesting to note that phosphorylation may also occur in histones: for instance, histone 3 includes serine units which bind phosphoric acid residues at positions 10 and 28, i.e., directly adjacent to the lysine units at positions 9 and 27, which (as noted above) inhibit transcription by binding with additional proteins. Serine phosphorylation in histones controls the bonding between DNA and non-histone proteins (which may also undergo phosphorylation). On the other hand, removal of methyl groups is a function of specific demethylases.

Fig. 3.20
2 illustrations of A and B present demethylation and phosphorylation during the activation process of chromatin.

Chromatin activation (dissociation of non-histone proteins) through (a) demethylation of histone H3K27me3 and (b) protein phosphorylation (schematic depiction)

In addition to lysine, histone methylation may also affect arginine, while direct DNA methylation usually involves cytosine. Such modifications are often mutually dependent (Fig. 3.21). Methylation is a rapid, covalent process and a convenient way to directly tag the DNA chain as the replication complex progresses. It enables cells to pass epigenetic information to their offspring and thus ensures its persistence (Fig. 3.22). However, the final decision on which genes to silence is a function of non-histone proteins which bind to DNA in places marked via methylation (or other forms of chemical modification).

Fig. 3.21
A schematic representation of induced D N A methylation in the presence of K 4, K 9, K 27, and C G indicates a suppressed form.

Schematic depiction of the interactions between a methylated histone (H3K27me3) and DNA

Fig. 3.22
An illustration presents the replication process in the presence of C H 3 and active methyl groups.

Simplified view of remethylation of freshly synthesized genetic material (DNA and histones) by the replicating complex and its methylating subunits

In contrast to methylation, acetylation usually promotes transcription. Many transcription activators are in fact enzymes (acetyltransferases), and, consequently, many gene suppressor proteins act by deacetylation. Non-histone proteins involved in epigenetic processes either activate or inhibit transcription and may also modify the structure of chromatin. DNA–protein complexes which act as gene suppressors (for instance, those involving Polycomb proteins) result in tighter packing of the heterochromatin chain (the silent, non-transcribed part of the genome). Heterochromatin may be packed in at least four different ways, depending on the activity of attached proteins; however we usually distinguish two broad categories: constitutive (permanently suppressed) heterochromatin and facultative heterochromatin, which may, under certain condition, be expressed (note that the structure which allows transcription of genetic material is called euchromatin).

3.2.5.2 Specificity of Epigenetic Processes

The complicated chromatin modification mechanism associated with cellular differentiation is a consequence of the complex selection of genes which need to be expressed or silenced in order to sustain biological activities at the proper level. It can be compared to a piano concerto where the pianist must strike certain keys in a selected order at just the right time. Moreover, each stroke must have an appropriate force, determining the volume and duration of the played note. Some keys are struck separately, while others need to be arranged into chords. All of these decisions are subject to a form of programming, i.e., the notes written down by the composer of a given piece.

The specificity of recognition and activation of certain nucleotide sequences for transcription seems understandable if we assume that genes may be recognized by transcription factors alone or in collaboration with noncoding RNA fragments which belong to the miRNA group. Both types of structures are capable of interfacing with DNA regulatory sequences and thus selectively induce transcription of certain genes. However, contrary to a piano concerto where the role of the pianist is simply to play back the piece by striking certain keys only, the cell must also proactively manage its silenced genes.

Chromatin methylation and other chemical modifications are a result of enzymatic activity where the substrates (basal histone amino acids and cytosine of DNA) reside both within the transcribed parts of DNA and in sections which need to be silenced. Clearly, this property may interfere with the selectivity of gene expression.

A solution emerges in the form of spatial isolation of certain DNA fragments and exposes selected parts of the chain for enzymatic activity. This is only possible during interphase when transcription and other enzymatic processes appear to be concentrated in specific areas of the nucleus (sometimes called factories). These areas accept “loosened” DNA coils, recognized and preselected by transcription factors and/or RNA. Compartmentalization also prevents uncontrolled propagation of catalysis (Fig. 3.23).

Fig. 3.23
An illustration presents spatial restrictions in the presence of constitutive and facultative heterochromatin, euchromatin, and transcription factory. An inset indicates a nucleus.

Spatial restrictions applied to transcription and methylation of DNA (chromatin fragments) through selection of areas where enzymatic contact is maintained. Inset: depiction of the nucleus

Spatial ordering of catalysis is important for epigenetic processes due to the great variety of enzymatic interactions involved in cell differentiation. However, an even more important self-control mechanism associated with enzymatic activity seems to be its division into stages, where only selected types of enzymes seem to be active at each stage. This greatly increases the selectivity of information channels and reduces the potential for error.

3.2.5.3 External Control of Cell Proliferation and Differentiation: Embryonic Development

Each stage of differentiation can be activated automatically; however they all obey steering signals which come from outside of the cell, i.e., from other cells. Such signals can be generated directly (by adjacent cells) or indirectly (by specific hormonal markers called morphogens).

The duration of the signal and the concentration of a specific morphogen may affect cell differentiation by triggering internal processes which subsequently operate in accordance with predetermined sequential programs. Morphogen diffusion is, however, somewhat peculiar: morphogens travel through clusters of densely packed embryonic cells and have to maintain a predetermined concentration at a given distance from their originator in order to ensure proper strength of the signal they encode. In order to fulfill these goals, morphogens are inherently short-lived and need to be constantly replenished. They must also possess special means of traversing cell clusters. It should be noted that the boundaries separating various tissues are usually well delineated in spite of the diffusive nature of biological signals. This is due to simultaneous action of contradictory signals, which results in the emergence of unambiguous tissue boundaries (Fig. 3.24).

Fig. 3.24
A dual-axis area graph depicts morphogen concentration versus diffusion. The labels A, B, and B prime are marked on the Y axes, and X 1, and X 2 are marked on the X axis. 2 circles on the graph indicate specialized cell types.

Formation of a clear boundary between separate cell layers in an evolving embryo, through contradictory action of morphogens. Circles represent various types of specialized cells associated with different concentrations of a given morphogen

Each gene packet activated in the course of differentiation belongs to a certain development stage. This alignment results in staged synthesis of various sets of proteins and enzymes, each responsible for performing different actions. For instance, in stem cells methylation affects CG and CA nucleotide clusters, whereas in mature, specialized cells, no CA methylation is observed. Furthermore, as no cell is completely independent of its adjacent cells, cellular development must proceed in a coordinated fashion.

Coordinated propagation of information follows a hierarchical pathway, meaning that information first reaches key loci in the developing system and only then can be disseminated to wider groups of recipients (genes). This process resembles a human population settling a new territory: initially, settlers decide upon administrative boundaries and elect local authorities. Later on, these agreements may be amended as a result of individual postulates and specific strategies developed in order to resolve emerging problems and adapt to changing conditions.

Cells responsible for triggering new signal pathways need to be created in the course of embryonic development. The emergence of new centers of activity and new tissues results from cooperation of existing, differentiated cells. Cell proliferation and mutual interactions proceed in accordance with the genetic program, progressively giving rise to new structures. The spatial “blueprint” of the embryonic mass is in place even before macroscopic details can be discerned; indeed, the process of differentiation begins with the first asymmetric division of the embryo. Cells which are already undergoing differentiation “remember” their position and place in the development program—thus, they can be said to possess a specific “address” in the overall structure of the organism. The placement of cells in a developing embryo is determined by the hox gene family. In humans, these genes (of which there are approximately 40) are activated sequentially during successive development stages. Their action is to impose spatial alignment upon the growing mass of cells. Spatial memory and information about the cell’s future role in the developing organism are stored in its chromatin, conditioned to enable certain types of transcription. Generalized biochemical signals trigger specific responses in individual cells, guiding the development process in each part of the embryo. Cell groups gain their epigenetic “addresses” and “assignments” by reacting to signals in different ways, thus enabling coherent growth. While the spatial arrangement of tissues was subjected to some changes in the course of evolution, the general epigenetic control mechanisms governing the differentiation of cell groups remained unchanged. An early strategy, characteristic of invertebrates, is the division of the embryonic mass into segments (see Fig. 3.34). However, as the notochord and (later on) the spine emerged, the differentiation process had to evolve as well. Thus, a vertebrate embryo initially consists of two distinct germ cell layers: endoderm and ectoderm. Their interaction gives rise to a third layer called mesoderm. Further development and divergence of cell layers result in the formation of a spine as a central core around which development may progress.

Genetic control of cell mobility (involving the entire cell layers as well as individual cells) is facilitated by changes in the shape of cells, affecting their mutual adhesion. Cell layers gain mobility by means of locally reduced or increased adhesion, itself a result of the emergence or degradation of surface receptors (cadherins and integrins).

Other examples of epigenetic mechanisms are as follows:

Ad. 2. Stem Cell Survival and Differentiation

A special group of undifferentiated cells, called stem cells, may persist in mature organisms in specific niches formed by adjacent cell layers. One example of such a structure is bone marrow, where new blood cells are constantly being created.

Ad. 3. Selective Inheritance of Parental Trials

Imprinting—Differentiation mechanisms can also be used to ensure monoallelicity.

Most genes which determine biological functions have a biallelic representation (i.e., a representation consisting of two alleles). The remainder (approximately 10% of genes) is inherited from one specific parent, as a result of partial or complete silencing of their sister alleles (called paternal or maternal imprinting) which occurs during gametogenesis. The suppression of a single copy of the X chromosome is a special case of this phenomenon. It is initiated by a specific RNA sequence (XIST) and propagates itself, eventually inactivating the entire chromosome. The process is observed in many species and appears to be of fundamental biological importance.

Ad. 4. Epigenetics at Acquired Trials

Hereditary traits—Cell specialization is itself a hereditary trait. New generations of cells inherit the properties of their parents, though they may also undergo slight (but permanent) changes as a result of environmental factors.

3.2.6 The Genetics of Evolution

Contrary to steady-state genetics and development genetics, evolution exploits the gene mutation phenomenon which underpins speciation processes. Duplication and redundancy of genetic material are beneficial as it enables organisms to thrive and reproduce in spite of occasional mutations. Note that mutations which preclude cross-breeding with other members of a given species can be said to result in the emergence of a new species.

Evolutionary genetics is subject to two somewhat contradictory criteria. On the one hand, there is clear pressure on accurate and consistent preservation of biological functions and structures, while on the other hand, it is also important to permit gradual but persistent changes. Mutational diversity is random by nature; thus evolutionary genetics can be viewed as a directionless process—much unlike steady-state or developmental genetics.

In spite of the above considerations, the observable progression of adaptive traits which emerge as a result of evolution suggests a mechanism which promotes constructive changes over destructive ones. Mutational diversity cannot be considered truly random if it is limited to certain structures or functions. In fact, some processes (such as those associated with intensified gene transcription) reveal the increased mutational activities. Rapid transcription may induce evolutionary changes by exposing cell DNA to stimuli which result in mutations. These stimuli include DNA repair processes, particularly those which deal with double-strand damage. In this respect, an important category of processes involves recombination and shifting of mobile DNA segments.

Approximately 50% of the human genome consists of mobile segments, capable of migrating to various positions in the genome. These segments are called transposons and retrotransposons (respectively, DNA fragments and mobile RNA transcripts which resemble retroviruses in their mechanism of action except that they are not allowed to leave the cell).

The mobility of genome fragments not only promotes mutations (by increasing the variability of DNA) but also affects the stability and packing of chromatin strands wherever such mobile sections are reintegrated with the genome. Under normal circumstances the activity of mobile sections is tempered by epigenetic mechanisms (methylation and the DNA-protein complexes it creates); however in certain situations, gene mobility may be upregulated. In particular, it seems that in prehistoric times such events occurred at a much faster pace, accelerating the rate of genetic changes and promoting rapid evolution.

Cells can actively promote mutations by way of the so-called AID process (activity-dependent cytosine deamination). It is an enzymatic mechanism which converts cytosine into uracil, thereby triggering repair mechanisms and increasing the likelihood of mutations. AID is mostly responsible for inducing hypermutations in antibody synthesis, but its activity is not limited to that part of the genome. The existence of AID proves that cells themselves may trigger evolutionary changes and that the role of mutations in the emergence of new biological structures is not strictly passive.

3.2.6.1 Combinatorial Changes as a Diversity-Promoting Strategy

Although the processes mentioned above may contribute to evolutionary changes and even impart them with a certain direction, they remain highly random and thus unreliable. A simple increase in the rate of mutations does not account for the high evolutionary complexity of eukaryotic organisms. We should therefore seek an evolutionary strategy which promotes the variability of DNA while limiting the randomness associated with mutations and preventing undesirable changes.

This problem may be highlighted by considering the immune system which itself must undergo rapid evolution in order to synthesize new types of recombinant proteins called antibodies. As expected, antibody differentiation is subject to the same deterministic mechanisms which have guided evolution throughout its billion-year course but which remain difficult to distinguish from stochastic evolutionary processes. In the immune system, synthesis of new proteins (i.e., antibodies with new V domains) proceeds by way of changes in amino acid sequences (particularly in their V, D, and J DNA segments) through a mechanism which owes its function to high redundancy of certain fragments of genetic code. DNA sequences which contain the previously mentioned V, D, and J segments may, upon recombination, determine the structure of variable immunoglobulins: their light (V, J) and heavy chains (V, D, J) (see Fig. 3.36).

A key advantage of recombination is that it yields a great variety of antibodies, making it likely that an antibody specific to a particular antigen will ultimately be synthesized. The degree of variability in L and H chains is determined by the number of possible combinations of V/J and V/D/J segments (for L and H chains, respectively). Constructing random genetic sequences via recombination is a process which may occur far more frequently than creating new, complete genes from scratch. It enables great genetic diversity in spite of the limited participating genome information and is therefore preferable to gene differentiation. Antibody differentiation also relies on one additional mechanism which triggers random changes in their active groups: combination of light and heavy chains within the antigen binding site.

Combinatorial differentiation and antibody synthesis may roughly be compared to the work of a cook who has to prepare meals for a large group of gourmands. Two strategies may be applied here: (1) preparing a large number of varied meals, far more than there are customers, and (2) preparing a selection of meal components (A, main courses; B, salads; C, appetizers; etc.) and allowing customers to compose their own sets. Clearly, the latter solution is more efficient and corresponds to strategies which can frequently be observed in nature. The efficiency of combinatorial differentiation is shown in Table 3.2, which presents a quantitative example of constructing antibodies from segments of the heavy chain (H) and two forms of the light chain (Lλ and Lκ). The degree of variability of each form is listed in Table 3.2A. As can be observed, this variability is far greater than in the case of a single, nonrecombinant chain consisting of all the above-mentioned segments.

Table 3.2 Variability of antibodies as a result of recombination in V, D, and J segments and interaction between light (L) and heavy (H) chains

The number of possible antibody sequences, given minimal variability of individual components, is listed in Table 3.2B. Table 3.2C presents values which correspond to the highest possible variability of components.

Combinatorial differentiation yields a huge population of immunoglobulins, making it exceedingly likely that at least some of them will be selectively adapted to their intended purpose. Further structural improvements are possible as a result of hypermutations restricted to the active protein group and induced by the AID process, resulting in incremental synthesis of more specialized antibodies (affinity maturation). Thus, antibody synthesis is itself a microscale model of directed evolution, enabling progressive improvement of its final product.

Thus although mutations are the basic mechanism by which changes in nucleotide sequences (and, consequently, amino acid sequences) can be introduced, variability of antibodies is not entirely dependent on mutations. Rather, it is a result of recombinant synthesis of diverse DNA fragments, each contributing to the structure of the final product. Such recombinant fragments emerge in the course of evolution, mostly via duplication of genetic code and also (in a limited scope) via localized mutations which do not affect the fundamental properties of the protein complex. The entire process is similar to a tool with exchangeable parts, although in the case of antibodies there is no “set of available parts”—rather, the given parts are synthesized on the fly from smaller subcomponents.

Combinatorial rearrangement of presynthesized DNA fragments (as opposed to ad hoc mutations) is an evolutionarily favored means of achieving diversity. It can rapidly accelerate DNA diversification while restricting the likelihood of adverse changes and errors associated with random mutations. The mechanism can be compared to the use of numbers and letters in car license plates, which also affords a great number of unique combinations.

In DNA, combinatorial diversification requires that functional fragments of the chain be clearly separated and well-spaced. Long dividers carry information which enables proper folding of the chain and assists in its combinatorial rearrangement. The discontinuing of genetic code supports combinatorial genetics but also facilitates the current gene expression through alternative splicing of exon fragments whenever suitable mRNA chains need to be synthesized. Thus, the number of intracellular proteins far exceeds the number of individual genes which make up the genome. This phenomenon is similar to recombination, although it applies to RNA rather than to DNA. Alternative splicing, itself a result of evolution, is an important contributor to evolutionary progress. Discontinuity is also observed in the so-called cis-regulatory elements of the genetic code which may be located far away from gene promoters. Such fragments include enhancers and silencers, separated by special sections called insulators.

DNA fragments, recognized by transcription factors, can bind to polymerase and guide its activity (Fig. 3.25). Regulatory fragments act as hooks for transcription factors. Sets of genes associated with a single biological function often share identical (or similar) enhancers and silencers, acted upon by a single transcription factor. Such cooperation of genes can be compared to piano chords which consist of several different notes but are struck by a single hand. The role of regulatory sequences in evolutionary development is more significant than that of actual protein codons (exons).

Fig. 3.25
3 illustrations present differentiation complex synthesis of the polymerase in regulatory and promotory fragments.

Differentiation complex synthesis (action of transcriptase). 1, 2, 3, …, n, DNA enhancers and silencers. Transcription factors not shown

Primitive organisms often possess nearly as many genes as humans, despite the essential differences between both groups. Interspecies diversity is primarily due to the properties of regulatory sequences. Evolutionary development promotes clear separation of DNA fragments carrying information concerning structure and function, allowing genetic code to be recombined with ease. In humans the separators between coding and noncoding DNA sequences (introns and exons) are among the longest observed in any organism. It therefore appears likely that diversification of regulatory structures carries significant evolutionary benefits.

As already mentioned, evolutionary progress is associated with the scope and diversity of regulatory sequences rather than with the number of actual genes. This is due to the fact that regulatory sequences facilitate optimization of gene expression. Returning to our metaphor, we can say that the same grand piano can be used either by a master pianist or by an amateur musician, although in the latter case the instrument’s potential will not be fully realized, resulting in a lackluster performance.

The special evolutionary role of regulatory fragments is a consequence of their noncoding properties. Contrary to genes, noncoding fragments are not subject to structural restrictions: they do not need to be verified by the synthesis of specialized proteins where mutations are usually detrimental and result in negative selection. They also exhibit far greater variability than gene-encoding fragments. The mutability of noncoding DNA fragments is aided by the fact that—owing to their number—each fragment only contributes a small share to the overall regulatory effect. This property reduces the potential impact of unprofitable mutations.

3.2.6.2 Directed Mutability: Hotspot Genes

The recombinant variability of regulatory fragments and of genes themselves is sufficient to explain the progress of evolution. Nevertheless, ongoing research suggests the existence of additional mechanisms which promote evolution by increasing mutability in focused and localized scopes. Not all genes are equally susceptible to evolutionary pressure. Some can be termed “conservative” (i.e., undergoing few changes in the course of evolution), while others are subject to particularly rapid changes. The latter group is colloquially said to consist of hotspot genes. The reason behind this variability is unclear; however it appears that high mutability may emerge as a result of intense functional involvement or local instabilities in chromatin structure. It is also observed that fragments directly adjacent to retrotransposons are characterized by relatively loose packing, which may accelerate the rate of mutations. However, the most likely explanation has to deal with the presence of special nucleotide sequences which reduce the overall stability of the DNA chain. Certain observations attribute this role to short fragments dominated by a single type of nucleotide (usually T or A) attached to longer sequences which are largely bereft of nucleosomes. Such structures are particularly conductive to random exchange of genetic material between DNA coils, thereby promoting recombination and increasing the rate of mutations.

If such accelerated mutability is restricted to specific DNA fragments, its destructive impact can be minimized, and the mechanism may serve a useful purpose. It seems that the placement of such sequences in the genome may assist in directed evolutionary development. This is somewhat equivalent to the hypermutation process in antibody synthesis, where increased mutational activity (caused by AID) only applies to specific domains, ensuring an effective immune response without significantly altering the core structure of antibodies.

3.2.6.3 Gene Collaboration and Hierarchy

The placement of a given gene in the gene regulatory network may affect its transcriptional activity. Each mechanism which contributes to the overall phenotype requires the collaboration of many genes. The goal of such collaboration is to ensure balanced responses to various stimuli and activate all the required genes. Collaborative systems emerge via coupling of genes which together determine biological signals associated with transcription. Coherent activation of genes coding factors is a prerequisite for the formation of a so-called kernel.

Automatically regulated collaborative systems may be likened to cybernetic mechanisms. The impact of individual genes on collaboration processes is, however, unequal. Genes which occupy core nodes of regulatory networks (also called input/output genes) are usually tasked with proper routing of biological signals. Their intense functional involvement and interaction with advanced regulatory mechanisms may result in increased susceptibility to mutations.

Regulatory mechanisms which assist in evolutionary development are themselves subject to evolution—for instance, through creation of new enhancers and silencers or by increasing their relative spacing (similarly to introns). Gene regulation and interaction (particularly in the scope of input/output genes) can also be improved. Finally, the number of genes which encode transcription factors tends to increase over time. Such changes can be explained by their positive effect on gene collaboration. Referring to our “musical” example, we can say that using all ten fingers gives the pianist far greater leeway than if he were to tap the melody with just one finger.

As mentioned above, the role of genes in collaborative systems differs from gene to gene. Input/output genes are particularly important: it seems that they are the key members of the so-called hotspot gene set. This observation is further supported by their high involvement in transcription processes. It is theorized that the placement of input/output genes in the genome is intimately tied to their evolutionary role. An unambiguous proof of this theory would further confirm the directed nature of evolution. Mechanisms which accelerate evolutionary development (such as duplication, recombination based on discontinuity of genetic material and focused mutability) indicate that evolutionary processes follow specific strategies which may themselves undergo improvement. This, in turn, suggests the selective possibility of accelerating evolution. One putative example of this phenomenon is the rapid development of the Homo sapiens brain, often described as an “evolutionary leap.”

3.3 Types of Information Conveyed by DNA

Biochemistry explains how genetic information can be used to synthesize polypeptide chains. On the surface it might appear that information transferred to RNA and subsequently to proteins is the only type of information present in the genome. However, an important issue immediately emerges: in addition to what is being synthesized, the living cell must also be able to determine how, where, when, and even to what extent certain phenotypic properties should be expressed.

The questions what? and when? involve structural properties, while terms such as how? and how much? are more closely tied to function. The question how? usually emerges whenever we wish to determine the role of a certain structure, its synthesis process, or its mechanism of action. Each of these aspects may also be associated with the question how much?, i.e., a request for quantitative information. This information is useful in determining the required concentrations of reagents, their level of activity, the size of biological structures, etc. Quantitative assessment is important for any doctor who sends a biological sample to a diagnostics lab. Such properties are static and must therefore have a genetic representation. If we assume that, on a molecular level, structure determines function, we must also accept that the structure of certain proteins determines their quantity.

As can be expected, quantitative regulation is a function of receptor structures, each of which belongs to a regulatory chain. The question how much? is inexorably tied to regulatory processes. It seems clear that the stable concentrations and levels of activity observed in biological systems cannot be maintained without regulation. It is equally evident that such regulation must be automatic since isolated cell cultures can thrive and maintain their biological properties despite not being part of any organism.

Research indicates that biological regulatory mechanisms rely basically on negative feedback loops. Figure 3.26 presents the structure of such a loop. The heredity of a biological function is not restricted to the specific effector which directly implements it but covers the entire regulatory chain, including receptor systems and information channels.

Fig. 3.26
A schematic representation of the negative feedback loop with labels, how, how much, and what.

Schematic depiction of a negative feedback loop, with elements determining the how much? and how? properties. The what? property may relate to each structure separately or to the loop as a whole

Biological function cannot be separated as long as it is in the range of physiological regulation.

Quantitative control of the activity of various processes is facilitated by receptors which measure product concentrations or reaction intensity. Each receptor is connected to an effector which counteracts the observed anomalies. The receptor—being a functional protein—contains an active site which it uses to form reversible complexes with elements of the reaction it controls (usually with its products).

Typical receptors are allosteric proteins which undergo structural reconfiguration and release a signal whenever a ligand is bound in their active site. The equilibrium constant—a measure of receptor-ligand affinity—determines the concentration at which saturation occurs and the receptor morphs into its complementary allosteric form. As a consequence, ligand concentration depends on the affinity of its receptor and therefore on its structure. The structure of the receptor protein determines the quantitative properties of the system as a whole, thus providing an answer to the how much? question. In contrast, the how? issue is addressed by the structure of effector proteins.

Effectors may be either simple or complex, depending on the task they perform. In a living cell, an effector may consist of a single enzyme, a set of enzymes facilitating synthesis of a specific product, or an even more advanced machine-like structure. In the regulatory mechanisms of organism, effectors are often specialized tissues or organs. A typical type of effector mechanism is involved in transcription and translation processes.

The role of the effector is to stabilize the controlled process. Its structure may address the what? and (possibly) where? questions associated with any biologically active entity, but it primarily relates to the how? question by determining the mechanism applied for a given task, as requested by the receptor. We can state that the genetic code (i.e., nucleotide sequence) describes the primary structure of receptor, effector, and transfer proteins. A regulatory loop (negative feedback loop) is a self-contained functional unit which performs a specific task in an automated manner.

Self-organization mechanisms determine the location of biological structures both in individual cells and in organisms—thus, they address the where? question. Structures built according to the genetic blueprint and interacting in specific ways may spontaneously generate complexes, associates, and as well a set of cells which may then recognize one another through appropriate receptor systems. Examples of self-organization include spontaneous formation of the cellular membrane from enzymatically synthesized phospholipids (which arrange themselves into planar micelles in the presence of water) and the mutual recognition of tissue cells (Figs. 3.27, 3.28, 3.29, 3.30, 3.31, 3.32, and 3.33).

Fig. 3.27
An illustration presents a process of self-organization. Several cellular membranes are scattered and then arranged in an organized manner.

Self-organization example: membrane formation

Fig. 3.28
3 illustrations present self-organization in polypeptide chain folding, quaternary protein structure, and proteins' integration of the cellular membrane, respectively.

Self-organization example: (a) polypeptide chain folding, (b) formation of quaternary protein structure, and (c) integration of proteins in the cellular membrane

Fig. 3.29
An illustration presents the process of self-organization of protein binding in a ribosome subunit. A collection of scattered dots moves into a string-like coiled structure, until all the dots are embedded within it in the final step.

Simplified view of the self-organization of a ribosome subunit through sequential binding of proteins

Fig. 3.30
An illustration presents a 5 step process of self-organization in fatty acid synthase.

Simplified view of the formation of fatty acid synthase through self-organization

Fig. 3.31
An illustration presents a 2-step process of self-organization in skeletal muscle.

Simplified view of the self-organization of skeletal muscle—initial stage

Fig. 3.32
An illustration presents 2 types of vesicles with markers moving from one side to another and attaching themselves separately on 2 sides of a cavity.

Distribution of substances encapsulated in vesicles, surrounded by a membrane with integrated markers assisting the self-organization process

Fig. 3.33
An illustration presents different components of bacterial cells that attach themselves to other phages.

Self-organization of a phage with the use of components synthesized by a bacterial cell

The where? question is particularly important in the development and maturation of organisms. A crucial issue is how to create spatial points of reference in a developing embryo, enabling precise distribution of organs and guiding the development process as a whole. The most frequently applied strategy is to divide the embryo into specific parts, each with a different biological “address,” and to apply a separate control process to each part. Such division occurs in stages and is guided by sequentially activated gene packets, according to a predetermined genetic algorithm.

The ability to assign permanent “addresses” to individual components of the organism is a result of cell differentiation. Following spatial self-orientation of the embryo (mediated by hox genes), each “address” is targeted for signals which either promote or inhibit cell proliferation and further specialization, resulting in development of specific organs.

This strategy is evident in insect embryos, particularly in the oft-studied fruit fly (Drosophila melanogaster). It relies on three basic gene packets whose sequential activation results in structural self-orientation of the embryo and progressive development of the organism. These packets are, respectively, called maternal genes, segmentation genes, and homeotic genes.

The fruit fly egg already exhibits discernible polarity. During development the embryo undergoes further segmentation which clearly defines points of reference and enables precise placement of organs. Segmentation can commence once the frontal, rear, ventral, and dorsal areas of the embryo are determined. This process, in turn, relies on mechanisms activated by the mother inside the egg (this is why the relevant gene packed is called maternal). Following initial self-determination, it becomes possible to delineate boundaries and segments by way of contradictory activity of cells making up each of the preexisting polar regions. This process is mediated by hormones (morphogens) or by direct interaction between adjacent cells. The creation of boundaries is similar to a geopolitical process where two neighboring countries compete to control as much land as possible, ultimately reaching a detente which translates into a territorial border. Differentiated boundary cells generate signals which induce further segmentation. This process continues until a suitable precision is reached, under the guidance of segmentation genes. Transcription-dependent expression of these genes results in cell differentiation and determines the final purpose of each segment. Once specific points of reference (i.e., segment boundaries) are in place, the development of organs may commence, as specified by homeotic genes.

To illustrate the need for this strategy, Fig. 3.34 presents how a blind tailor would go about making a dress. He begins by marking the cloth and then uses these marks to recreate the structure which exists in his mind.

Fig. 3.34
2 tables present 3 columns and 6 rows, and 2 columns and 6 rows, respectively. The column headers are as follows. 1, address, stage of realization for front and back, and dressmaker acts. 2, genes at development at steps, and embryo development.

Points of reference in a developing embryo compared to the work of a blind tailor

The aim of the example is to visualize the purpose of natural strategies observed in embryonic development (note that this example does not fully reflect the properties of biological processes).

3.4 Information Entropy and Mechanisms Assisting Selection

According to the second law of thermodynamics, any isolated system tends to approach its most probable state which is associated with a relative increase in entropy. Regulatory mechanisms can counteract this process but require a source of information. A steady inflow of information is therefore essential for any self-organizing system.

From the viewpoint of information theory, entropy can be described as a measure of ignorance. Regulatory mechanisms which receive signals characterized by high degrees of uncertainty must be able to make informed choices to reduce the overall entropy of the system they control. This property is usually associated with development of information channels. Special structures ought to be exposed within information channels connecting systems of different characters as, for example, linking transcription to translation or enabling transduction of signals through the cellular membrane. Examples of structures which convey highly entropic information are receptor systems associated with blood coagulation and immune responses.

The regulatory mechanism which triggers an immune response relies on relatively simple effectors (complement factor enzymes, phages, and killer cells) coupled to a highly evolved receptor system, represented by specific antibodies and organized set of cells. Compared to such advanced receptors, the structures which register the concentration of a given product (e.g., glucose in blood) are rather primitive.

Advanced receptors enable the immune system to recognize and verify information characterized by high degrees of uncertainty. The system must be able to distinguish a specific antigen among a vast number of structures, each of which may potentially be treated as a signal.

The larger the set of possibilities, the more difficult it is to make a correct choice—hence the need for intricate receptor systems. The development and evolution of such systems increase the likelihood that each input signal will be recognized and classified correctly.

In sequential processes it is usually the initial stage which poses the most problems and requires the most information to complete successfully. It should come as no surprise that the most advanced control loops are those associated with initial stages of biological pathways. The issue may be roughly compared to train travel. When setting out on a journey, we may go to the train station at any moment we wish and then board any train, regardless of its destination. In practice, however, our decision must take into account the specific goal of our journey. The number of decisions required at this preliminary stage is high: we need to decide whether we wish to travel at all, in which direction, on which train, from which platform, and so on. We also have to make sure that the train waiting at the platform is the one we wish to board. In systems devoid of sentience, such questions must be “posed” by specific protein structures, attached to the control loop and usually discarded once their task has been fulfilled. The complexes formed at these initial stages of biological processes are called initiators. Additional structural elements (usually protein-based) which “pose questions” through specific interactions facilitate correct selections among many seemingly random possibilities. Figure 3.35 presents an example of the formation and degradation of initiation complexes in the synthesis of proteins in prokaryotic cell.

Fig. 3.35
A flow diagram on protein synthesis has 4 steps. 1, A cell. 2, A cell with labels G T P, I F 1, I F 2, and I F 3. 3, has labels, G T P, I F 1, I F 2, I F 3, f Met, U A C A U G, and m R N A. 4, involves labels G D P, I F 1, I F 2, I F 3, f Met, U A C A U G, P subscript i, and m R N A.

Simplified view of the creation and disassembling of the initiation complex assisting protein synthesis in prokaryotes

3.5 Indirect Storage of Genetic Information

While access to energy sources is not a major problem, sources of information are usually far more difficult to manage—hence the universal tendency to limit the scope of direct (genetic) information storage. Reducing the length of genetic code enables efficient packing and enhances the efficiency of operations while at the same time decreasing the likelihood of errors. A classic example of this trend is the progressive evolution of alternative splicing of exon fragments. The number of genes identified in the human genome is lower than the number of distinct proteins by a factor of 4, a difference which can be attributed to alternative splicing. Even though the set of proteins which can be synthesized is comparatively large, genetic information may still be accessed in a straightforward manner as splicing occurs in the course of synthesizing final mRNA chains from their pre-mRNA precursors. This mechanism increases the variety of protein structures without affecting core information storage, i.e., DNA sequences.

The information expressed as a sequence of amino acids in the polypeptide chain is initially contained in the genome; however, the final product of synthesis—the protein itself—may be affected by recombination of exon fragments. Thus, there is no direct correspondence between synthesized proteins and their genetic representation.

3.5.1 Self-Organization as a Means of Exploiting Information Associated with the Natural Direction of Spontaneous Processes

In addition to information contained directly in nucleotide sequences, the cell genome also carries “unwritten” rules, rooted in evolutionary experience. Such experience can be explained as a form of functional optimization, resulting from deletion of detrimental and deadweight solutions from genetic memory. This mechanism also applies to information which proves redundant once a simpler solution to a particular problem has been found.

Evolutionary experience may also free the genome from unnecessary ballast by exploiting certain mechanisms by which a reaction may draw information from sources other than the genome itself. This is possible by, e.g., exploiting the natural direction of spontaneous processes. If the lumberjack knows that the sawmill is located by the river, he does not have to carry a map—he simply needs to follow the riverbed. A similar situation may occur while sailing: if the intended direction of travel is consistent with wind direction, all we need to do is set a sail—we do not require knowledge of paddling or navigation.

Such “unwritten” information is a classic example of natural self-organization at work. However, it requires a suitable initial structure, synthesized in accordance with a genetic blueprint.

Self-organization can be likened to stones randomly rolling down a hill and accumulating at its foot. We can expect that such stones will be mostly round in shape, as flat or otherwise uneven stones are not as likely to roll. Clearly, roundness is necessary to exploit the force of gravity as a means of propulsion. Flat stones would instead need to be carried down the hill (which, of course, requires an additional source of energy and information).

As mentioned above, biological self-organization is most frequently associated with protein folding and synthesis of cellular membranes.

The primary structure of the polypeptide chain provides a starting point for the emergence of higher-order structures. The folding process is spontaneous and is, however, largely irrespective of information stored in the genome. While the genetic information is present in the amino acid sequence of the polypeptide chain, folding does not directly rely on it. In order to fold properly, the protein must draw information from its environment. The primary source of such information is the presence of water whose molecules assume a specific structural order in the presence of a hydrophobic residue of the polypeptide and then, by spontaneously reverting to their unordered state, transfer this information back to the polypeptide chain driving its folding.

In terms of energy flow, polypeptide chain folding is a spontaneous process, consistent with the second law of thermodynamics. It is assumed that the final, native form of the protein corresponds to an energy minimum of the protein-water system.

Self-organization of the polypeptide chain is powered primarily by hydrophobic interactions which may be described as a thermodynamically conditioned search for the optimal structure of the chain in the presence of water.

Most researchers believe that only those proteins whose global energy minima correspond to biological activity pass evolutionary selection and become encoded in the genome. Consequently, their secondary and tertiary structure may emerge through spontaneous interaction of the polypeptide chain with its environment.

Although the formation of three-dimensional structures does not directly depend on nucleotide sequences, the cell may nevertheless employ special proteins called chaperones, assisting polypeptide chains in finding their optimal conformation (i.e., their energy minima). The role of chaperones is to prevent aggregation of partly folded chains and promote correct packing of chains by restricting their freedom. They do not directly interfere in folding—instead, their contribution may be treated as a form of genetic interference to self-organization.

Owing to evolutionary selection of polypeptide chains which ensure spontaneous synthesis of the required structures in an aqueous environment (corresponding to active forms of proteins), the genome does not need to directly encode information related to the extremely complex folding process.

Self-organization may also yield more advanced structures consisting of multiple proteins—such as ribosomes and other cellular organelles.

Specific reactivity is revealed by proteins recognizing and binding their specific markers; thus determining localization also occurs in the context of self-organization, although its contribution to this process is often limited. It can be observed, e.g., in intercellular interaction where the participating receptors are sometimes called topological receptors.

3.5.1.1 Formation of Organized Structures as a Means of Reducing the Necessary Quantity of Information

An organized system is, by definition, more efficient in exploiting information than an unorganized system. The emergence of complex structures through self-organization of genetically programmed components, resulting in improved operational efficiency, can be explained as a form of utilizing information which is not directly contained in the genome. Subunits of protein complexes owe their connectivity to DNA-encoded structural properties, yet their aggregation is a spontaneous process, independent of any genetic representation. It occurs as a consequence of the structural affinity of subunits and does not directly translate into any form of code. This “design concept,” concealed in the structure of subunits, expresses itself via their interactions in protein complexes. Advanced complexes act as biological machines and are capable of operating with no need for large quantities of information (compared to individual subunits). Examples of such structures include ribosomes, DNA and RNA polymerases, proteasomes, chaperones, etc.

The amount of required information can be further reduced by restricting selected functions to specific areas of the cell. An example of this process (also called compartmentalization) is the delegation of fatty acid degradation processes to mitochondria, which allows the cell to separate such processes from synthesis of new fatty acid molecules. Conducting both actions in a shared space would require additional regulatory mechanisms and therefore additional genetic code. We should also note the clear division of chromatin present in the nucleus of eukaryotic cells into introns and exons, which appears to play an important role in evolutionary development. In prokaryotes, translation is intimately coupled to transcription, and genes cannot exist as discrete units (maintaining discrete genes would require an unfeasibly large set of additional regulatory mechanisms).

3.5.1.2 Reducing the Need for Genetic Information by Substituting Large Sets of Random Events for Directed Processes

A stochastic (directionless) system may fulfill a specific task purely through randomization and selection. The likelihood of achieving the required effect increases in relation to the number of events. Thus, meeting the stated goal (performing a specific action) can result from a trial-and-error approach, given a large enough number of tries. In biological systems, directed processes (requiring information) are frequently replaced by large pools of random actions which can occur with limited input. This model can be compared to operating a machine gun which fires many (k) bullets, each with a small but non-negligible probability (p) of hitting the target, as opposed to launching a single guided missile which has a very high probability (p = 1.0) of impacting the same target. In the former case, the likelihood of a successful hit increases with the number of bullets fired, whereas in the latter case, it depends on the quality of electronic guiding systems. The guided missile is highly efficient (we only need one), but producing and operating it require a vast quantity of information. In contrast, the machine gun is a relatively primitive weapon, yet given a large enough number of tries, it also offers a good chance of scoring a hit.

It should therefore be quite natural to employ stochastic strategies in directionless (i.e., non-sentient) biological systems, where the cost of increasing the number of attempts is far lower than the cost of obtaining additional information.

If the probability of achieving a hit on each attempt is equal to p and all attempts are mutually independent, the overall likelihood of hitting the target (P) is expressed as

$$ P=1\hbox{--} {\left(1\hbox{--} p\right)}^k $$

where p is the probability of hitting the target on any given attempt (probability of the elementary event) and k is the number of bullets fired. Figure 3.36 shows increases in P as a consequences of more accurate targeting (increased p) and a larger number of attempts (increased k).

Fig. 3.36
2 line graphs. A, the graph of uppercase P versus p plots 4 lines in increasing trends. B, the graph of uppercase P versus k plots 4 lines in increasing trends.

Association between the probability of hitting the target (P) and (a) the probability associated with each elementary event (p) for a variable number of attempts (k) and (b) the number of attempts (k) for a variable elementary probability (p)

This problem often emerges in interactions between biological systems and their external environment where the goal is poorly defined (i.e., p is low). Let us consider the odds that a plant seed will encounter favorable ground in which it can germinate. Clearly, the dominant biological strategy is to produce a large number of seeds, increasing the chance that at least one will be successful; however increasing p is also possible—for example, by broadening the dispersal radius with the aid of suitable biological structures (Fig. 3.37).

Fig. 3.37
3 illustrations. 1, seeds from a dandelion plant fly in the wind. 2, seeds with wings. 3, seeds attached to an animal's fur.

Increasing the likelihood of achieving a biological goal by broadening the distribution of seeds (increased p): dandelion seed (wind action). Broadening seed distribution (increased p): linden and maple seeds (rotary motion). Broadening seed distribution (increased p): greater burdock seeds (adhesion to animal fur)

Intracellular mechanisms usually involve fewer attempts due to the limited space in which they operate. One example is the search for target objects, e.g., the interaction between microtubules and chromosomes. Microtubules form a dynamic system which fluctuates as each individual microtubule grows or shrinks. Their growth depends on the number and activity (by way of forming complexes with GTP) of tubulin molecules, which exhibit GTPase properties. Given a suitable concentration of active (GTP-bound) molecules, microtubules are capable of random growth. The distribution of tubulin is uneven, as newly formed molecules are quickly integrated into growing chains. Areas characterized by rapid growth of microtubules become devoid of tubulin, which in turn precludes further growth. Only those microtubules which find their targets (by associating with an external object) gain stability. This system is capable of sweeping the cell area and, given a large enough number of steps, locating all chromosomes. We can expect that limiting the search space to the interior of a cell results in relatively high p and therefore the number of tries (k) may be lower than in a system which is forced to act in open, unrestricted space (Fig. 3.38).

Fig. 3.38
An illustration presents several dots distributed unevenly and rods with 2 different shades, one shade accompanied by downward arrows and the other with upward arrows.

Spatial search mechanism, as implemented by microtubules

Development of the mitotic spindle. Gray arcs represent chromosomes, while black dots represent tubulin molecules.

Another example of substituting a stochastic process for a directed one is the formation of antibodies. Since the probability that any single antibody will match the given antigen (p) is extremely low, the number of randomly generated antibodies (k) must be correspondingly high.

Antibody differentiation clearly relies on the “large numbers” strategy, assisted by the “accelerated evolution” mechanism discussed above.

The V exon of both the light and heavy antibody chains emerges through a recombination process which resembles building a house from toy blocks. The randomness of DNA fragments making up each V section, together with the large number of elements which participate in recombination, provides the resulting antibodies with random specificity. Given the great number of synthesized antibodies (high k), it is exceedingly likely that any given antigen will be recognized by at least some antibodies (P = 1.0) (Fig. 3.39).

Fig. 3.39
An illustration presents 3 stages as follows. V domain organization D N A, R N A, and protein. R N A stage includes transcription and V plus C splicing. The protein stage includes translation.

Randomness in antibody synthesis

By maintaining an enormous, active receptor (consisting of antibodies and specialized cells) where the number of different elements is in the 108–1011 range, the immune system can respond to a great variety of potential threats. However, such a strategy requires a vast production line, responsible for replacing lost components and ensuring constant alertness. This is akin to a burglar who carries a huge bunch of keys, hoping that at least one of them will match the lock he is trying to open (Fig. 3.40).

Fig. 3.40
An illustration of 6 keys looped together. The keys are labeled 1, 2, 3, 4, 5, and n.

Symbolic representation of the “large numbers” strategy—different numbers of keys (1 to n) and the increasing probability that a given door can be opened by at least one key in the bunch

3.5.1.3 Exploiting Systemic Solutions as a Means of Restricting Information Requirements

This principle corresponds to an adjustable skeleton key which can be used to pick many locks and therefore replaces a large bunch of individual keys.

Finding a generalized operating principle (by determining commonalities among a large number of individual mechanisms) is a good way to reduce the need for directly encoded information. Such generalizations enable the organism to apply a single procedure to a wide variety of situations.

One example of this strategy is organic detoxification, i.e., transforming toxic compounds into their polar metabolites which are far less active in the organism and can easily be excreted. Here, instead of a large number of specific detoxification processes (as in the case of the immune system, where each antigen has a specific associated antibody), a relatively small group of enzymes may effectively detoxify the organism by applying the common principle of reducing the toxicity of dangerous compounds by increasing their polarity.

Figure 3.41 presents detoxification as applied to aromatic compounds. The enzymatic systems integrated in the endoplasmic reticulum conduct oxidation-reduction processes dominated by monooxygenase activity. Other processes (including hydrolysis, deamination, etc.) occur simultaneously, resulting in synthesis of reactive, polar metabolites which contain R–OH, R–NH2, and R–COOH groups.

Fig. 3.41
A flowchart with 2 step process. Oxidation-reduction, hydrolysis, and dealkylation combine to form O H, C O O H, and N H 2 groups that associate with conjugation and glucuronidation sulfation groups in the final step.

Detoxification mechanisms active in the endoplasmic reticulum of hepatocytes

Application of systemic principle—increased polarity results in decreased toxicity.

At the next stage of the detoxification process, such groups can readily associate with other highly polar compounds such as glucuronic acid, sulfate group, taurine, etc., yielding polar, water-soluble derivatives which are easily captured by the kidneys and excreted in urine.

3.6 The Role of Information in Interpreting Pathological Events

Intuition suggests that controlled (regulated) processes fall within the domain of physiology, while processes that have escaped control should be considered pathological. Loss of control may result from an interruption in a regulatory circuit. Thus, information is the single most important criterion separating physiological processes from pathological ones. Diabetes is caused by insulin deficiencies (or insulin immunity), genetic defects may emerge in the absence of certain effector enzymes, neoplasms occur when cells ignore biological “stop” signals, etc.

When discussing immune reactions, it should be noted that an exaggerated response to certain stimuli may prove just as detrimental as a complete lack of response. Similar hypersensitivity may also involve, e.g., nitric oxide, which can be overproduced by LPS or TNF and cause pathological reactions (Fig. 3.42). Pathological deficiencies sometimes apply to vitamins (which need to be delivered in food) and other substances which the organism expects to have available as a result of evolutionary conditioning. Finally, instances of poisoning (where the function of enzymes and other proteins is disrupted) may result from a breakdown of regulatory loops.

Fig. 3.42
2 illustrations. 1, no-negative feedback. Reception cells with inflammation and stress acting on the effector system cause 2 types of effects. 2, negative feedback. Guanylate cyclase in the effector produces a regulation effect. It acts on a receptor system and loops back to the effector.

Physiological (controlled) and pathological (uncontrolled) effects of nitric oxide—efficient regulation as a requirement in physiological processes

3.7 Hypothesis

3.7.1 Protein Folding Simulation Hypothesis: Early-Stage Intermediate Form

The verification of the correctness of the research presented above, as well as its conclusions, may be verified by attempting to simulate biological phenomena in silico (this is a counterpart of the terms in vivo and in vitro which describe research environments—a less popular equivalent is the term in computro). Suitable computer programs, reflecting the properties of real-life biological systems, may serve as an important tool for verifying scientific hypotheses.

The folding of polypeptide chains into their native three-dimensional forms is a prerequisite of biological activity. The spatial structure of a protein molecule determines its interaction with other molecules, substrates (in the case of enzymes), ligands (in the case of coenzymes), and prosthetic groups, and its structural lability (i.e., inherent instability, understood as an effect which facilitates biological function—note that the structure of the protein is not rigid and may undergo changes) in addition to any allosteric properties it may exhibit. Under normal conditions polypeptide chain folding is a directed process which yields a specific, predetermined structure.

While in silico methods are quite effective in genomics, calculation of tertiary and quaternary structures has so far proven elusive. We should note that protein folding is not the only area where computational techniques may be exploited in support of biological research. Equally important is the ability to determine the function of a given protein by identifying its ligand binding sites, active sites (in the case of enzymes), or the ability to form complexes with other proteins.

Simulating three-dimensional structures becomes particularly important in the context of genomics-based successful identification of genes of proteins unknown in biochemistry. An effective means of simulating protein structures would enable us to determine their role in biological systems.

Thus far, the biennial Critical Assessment of Structure Prediction (CASP) experiment, organized for the tenth time in 2012, has not produced significant progress, despite the involvement of key research centers from around the world (see http://predictioncenter.org).

Theoretical models assessed within the scope of CASP can be divided into two groups: Boltzmann and Darwinian approaches. The former group assumes that the polypeptide chain changes its conformation along the energy gradient according to the so-called thermodynamic hypothesis (which states that protein folding is simply an ongoing “quest” to reach an energy minimum). The Darwinian approach bases on evolutionary criteria, claiming that proteins have evolved to attain their observed structure and function. Thus, Darwinists focus on structural comparisons, especially those involving homologous proteins (i.e., proteins which share common ancestry). Research teams which apply the thermodynamic hypothesis seek global optimization methods as a means of performing structural assessment. In contrast, scientists who follow the Darwinian approach query protein databases in search for similar structures and sequences, trying to determine the evolutionary proximity of various proteins. If an existing protein is found to be structurally similar to a new amino acid sequence (for which only the primary structure is known), the folding properties of this new sequence can be confidently predicted on the basis of the available templates.

The model presented below (as a hypothesis) bases on accurate recreation of the folding process, contrary to methods which rely on guessing the structure of a given amino acid chain. In addition to finding out the spatial conformation of a known sequence of amino acids, a useful research tool should also propose a generalized model of the folding process itself, enabling researchers to perform in silico experiments by interfering with the described mechanism (e.g., by introducing mutations and determining their impact on the biological function of the protein). Such techniques would be particularly useful in drug research where the aim is to design a drug with specific properties.

Before presenting the model, however, we first need to introduce some general concepts relating to stages which precede translation.

The foundation of biological systems is the flow of information—from genetic code to a three-dimensional protein structure, capable of performing a specific function:

$$ \mathrm{DNA}=>\mathrm{RNA}=>\mathrm{Protein} $$

Or, to be more exact:

$$ \mathrm{DNA}=>\mathrm{mRNA}=>\mathrm{AA}=>3\mathrm{D}=>\mathrm{Biological}\ \mathrm{function} $$

AA indicates the primary structure of the protein, while 3D stands for its spatial (three-dimensional) structure.

Numerical (stochastic) analysis of DNA nucleotide sequences enables us to locate genes, i.e., fragments which are subject to transcription and expression in the form of mRNA chains. The sequence of nucleotides in each mRNA chain determines the sequence of amino acids (AA) which make up the protein molecule. This sequence, in turn, determines the structural (3D) form of the protein and therefore its biological function.

While modern in silico sequencing techniques (including gene identification) appear sufficiently reliable, and the translation process is mostly deterministic (we know which sequences correspond to each amino acid), transforming an amino acid sequence into a three-dimensional protein structure remains an exceedingly difficult problem.

Proteins attain their 3D structure through self-organization. From the viewpoint of energy management, this process conforms to the so-called thermodynamic hypothesis which states that a folding polypeptide chain undergoes structural changes which lower its potential energy in search for an energy minimum.

In light of the stated requirement for accurate recreation of experimental conditions, we need to consider the fact that folding occurs in steps. The presented information flow model can be extended with intermediate structural forms:

$$ \mathrm{D}\mathrm{NA}=>\mathrm{mRNA}=>\mathrm{AA}=>{I}_1=>{I}_2=>\dots {I}_{\mathrm{n}}=>3\hbox{-} \mathrm{D}=>\mathrm{Biological}\ \mathrm{function} $$

I1In indicate an arbitrary number of intermediate stages.

A slightly more specific model which assumes two distinct intermediate stages called ES (early stage) and LS (late stage) will be discussed later on:

$$ \mathrm{DNA}=>\mathrm{mRNA}=>\mathrm{AA}=>\mathrm{ES}=>\mathrm{LS}=>3\ \mathrm{D}=>\mathrm{Biological}\ \mathrm{function} $$

In order to determine how ES and LS are generated, let us consider the first stage in the information pathway. Genetic information is encoded by a four-letter alphabet where each letter corresponds to a nucleotide. In contrast, the amino acid “alphabet” consists of 20 separate letters—one for each amino acid.

According to information theory, the quantity of information carried by a single nucleotide is 2 bits (−log2(1/4)), whereas the quantity of information needed to unambiguously select one amino acid from 20 is 4.23 bits (−log2(1/20)). Clearly, a single nucleotide cannot encode an amino acid. If we repeat this reasoning for nucleotide pairs, we will conclude that two nucleotides are still insufficient (they only carry 4 bits of information—less than the required 4.23 bits). Thus, the minimum number of nucleotides needed to encode a single amino acid is 3, even though such a sequence carries more information that is strictly needed (specifically, 6 bits). This excess information capacity explains the redundancy of genetic code.

Let us now take this elementary link between probability and the quantity of information required in the translation process and apply it to subsequent stages in the information pathway, according to the central dogma of molecular biology.

If the sequence of amino acids unambiguously determines the three-dimensional structure of the resulting protein, it should contain sufficient information to permit the folding process to take place.

As already noted, one amino acid carries approximately 4.23 bits of information (assuming that all amino acids are equally common). How much information must be provided to describe the conformation of a given amino acid? This property is determined by deriving the so-called conformer, i.e., the value of two dihedral angles which correspond to two degrees of freedom: Φ (Phi), the angle of rotation about the N–Cα bond, and Ψ (Psi), the angle of rotation around the N–Cα and Cα–C′, bond where N stands for the amine group nitrogen, while C′ stands for the carbonyl group carbon. Each angle may theoretically assume values from the −180 to 180° range. The combination of both angles determines the conformation of a given amino acid within the polypeptide sequence. All such combinations may be plotted on a planar chart where the vertical (X) axis corresponds to Φ angles, while the horizontal (Y) axis determines Ψ angles. This chart is called the Ramachandran plot, in honor of its inventor. It spans the entire conformational space, i.e., it covers all possible combinations of Φ and Ψ angles.

Let us now assume that it is satisfactory to measure each angle to an accuracy of 5°. The number of possible combinations is equal to (360/5) * (360/5) = 72 * 72 (note, however, that −180 and 180° angles are in fact equivalent, so we are in fact dealing with 71 * 71 possible structures).

Information theory tells us how many bits are required to unambiguously encode 71 * 71 possible combinations: we need -log2((1/71)(*(1/71)) = 12.29 bits. Thus, from the point of view of information theory, it is impossible to accurately derive the final values of Φ and Ψ angles starting with a bare amino acid sequence. However, the presented calculations rely on the incorrect assumption that all amino acids are equally common in protein structures. In fact, the probability of occurrence (p) varies from amino acid to amino acid. The Ramachandran plot shows also that some conformations are preferred, while others are excluded due to the specific nature of peptide bonds or their association with high-energy states (which, as already explained, are selected against the folding process). Column B in Table 3.3 presents the quantity of information carried by each amino acid given its actual probability of occurrence in polypeptide chains (these values are derived from a nonredundant protein base—PDB —where multiple data sets obtained by different research institutions for each protein have been merged into a single master set). Columns C and D indicate the expected quantity of information required to determine the dihedral angles Φ and Ψ (to an accuracy of 5 or 10°, respectively) for a specific peptide bond, subject to the probability distribution in the Ramachandran plot as well as the probability of occurrence of each amino acid (p). The values presented in Table 3.3 reflect entropy which corresponds to the quantity of information (weighted average over the entire conformational space).

Table 3.3 The quantity of information carried by each amino acid (IA) (column B) in relation to the quantity of information (interpreted as entropy) required to assign its conformation to a specific area on the Ramachandran plot with an accuracy of 5° (column C) or 10° (column D) (correspondingly—I5 and I10). Columns E, F, and G present the quantity of information required to pinpoint the location of the peptide bond within a specific cell of the limited conformational subspace (defined in the presented model)

The values presented in column B of Table 3.3 (the quantity of information carried by each amino acid and the average quantity of information required to determine its corresponding Φ and Ψ angles) suggest that simply knowing a raw amino acid sequence does not enable us to accurately model its structure. This is why experimental research indicates the need for intermediate stages in the folding process.

A basic assumption concerning the early stage (ES) says that the structure of the polypeptide chain depends solely on its peptide backbone, with no regard for interactions between side chains. In contrast, the late-stage (LS) structure is dominated by interactions between side chains and may also depend on environmental stimuli. Most polypeptide chains fold in the presence of water (although they may also undergo folding in the apolar environment of a cellular membrane).

Explaining the ES structure requires a suitable introduction.

Let us assume that at early stages in the synthesis of polypeptide chains when interactions between side chains are not yet possible, the conformation of a specific chain is strictly determined by mutual orientation of peptide bond planes being the consequence of the change of dihedral angles Φ and Ψ. This angle (called V-angle) is expressed as a value from the 0–180° range. Given these assumptions, a helical structure emerges when the values of V-angle is close to 0° because in a helix successive peptide bonds planes share (roughly) the same direction (treated as dipoles). In contrast, values close to V = 180° give rise to a so-called β-sheet structure, where the peptide bond planes are directed against each other (note that a peptide bond has polarity and can therefore be assigned a sense akin to a vector). As it turns out, the relative angle between two neighboring bonding planes determines the curvature radius of the whole polypeptide chain. The helical structure associated with V ≈ 0 exhibits a low curvature radius, while the sheet-like structure corresponding to V ≈ 180 is characterized by straight lines, with a near-infinite curvature radius. Intermediate values of V result in structures which are more open than a helix but not as flat as a sheet. The functional dependency between V and the curvature radius (Fig. 3.43b) for a specific area of the Ramachandran plot representing low-energy bonds is depicted in Fig. 3.43a (note the logarithmic scale for radius values to avoid dealing with very large values for β-sheet).

Fig. 3.43
5 scatterplots. A, C, D, and E, present psi versus phi. A and C have dots distributed in clusters. D and E, have ellipses drawn on the distributed clusters of dots. B, is a graph of I n versus V. The plots are scattered following an increasing line.

Elliptical path: (a) low-energy area on Ramachandran map, (b) relation between V-angle and resultant radius of curvature expressed in logarithmic scale ln(R) with the approximation function, (c) the distribution of structures satisfying the relation between V-angle and Ln(R) according to the approximation function, (d) elliptical solution graph approximating these points, and (e) the ellipse path links all low-energy area on Ramachandran map representing all secondary structural forms

The approximation function (i.e., the curve which most closely matches an arbitrary set of points) is given as

$$ \ln (R)=3.4\times {10}^{-4}\times {V}^2-2.009\times {10}^{-2}\times V+0.848 $$
(3.1)

Figure 3.43c depicts the placement of structures which satisfy the above equation on the Ramachandran plot. It appears that certain conformations are preferred in peptide bonds and that there exists a preferred conformational space for the entire peptide backbone (note that the early-stage structure is determined solely by the arrangement of backbone). The elliptical path seen in Fig. 3.43d joins all ordered secondary structures and can be treated as a limited conformation subspace for the early stage (ES) form.

The presented interpretation has one more advantage: if we calculate the quantity of information required to unambiguously select one of the points comprising an ellipse (columns E, F, and G in Table 3.3, for varying degrees of accuracy), we will note that it corresponds to the quantity of information carried by amino acids (column B in Table 3.3).

As a consequence, it seems that the raw sequence of amino acids only contains sufficient information to predict the conformation of the early-stage intermediate form within the subspace presented in Fig. 3.43d and e. The prediction of proper ΦES and ΨES (ES denotes belonging to the set of points of ellipse path) appears to be possible due to comparable amount of information balancing the information carried by amino acid with the amount of information necessary to predict the ES conformation (limited to ellipse path on Ramachandran map). The ellipse path is treated as limited conformational subspace for ES intermediate in protein folding process.