Quantitative genetics: past and present
- First Online:
- Cite this article as:
- Narain, P. Mol Breeding (2010) 26: 135. doi:10.1007/s11032-010-9406-4
- 177 Views
Most characters of economic importance in plants and animals, and complex diseases in humans, exhibit quantitative variation, the genetics of which has been a fascinating subject of study since Mendel’s discovery of the laws of inheritance. The classical genetic basis of continuous variation based on the infinitesimal model of Fisher and mostly using statistical methods has since undergone major modifications. The advent of molecular markers and their extensive mapping in several species has enabled detection of genes of metric characters known as quantitative trait loci (QTL). Modeling the high-resolution mapping of QTL by association analysis at the population level as well as at the family level has indicated that incorporation of a haplotype of a pair of single-nucleotide polymorphisms (SNPs) in the model is statistically more powerful than a single marker approach. High-throughput genotyping technology coupled with micro-arrays has allowed expression of thousand of genes with known positions in the genome and has provided an intermediate step with mRNA abundance as a sub-phenotype in the mapping of genotype onto phenotype for quantitative traits. Such gene expression profiling has been combined with linkage analysis in what is known as eQTL mapping. The first study of this kind was on budding yeast. The associated genetic basis of protein abundance using mass spectrometry has also been attempted in the same population of yeast. A comparative picture of transcript vs. protein abundance levels indicates that functionally important changes in the levels of the former are not necessarily reflected in changes in the levels of the latter. Genes and proteins must therefore be considered simultaneously to unravel the complex molecular circuitry that operates within a cell. One has to take a global perspective on life processes instead of individual components of the system. The network approach connecting data on genes, transcripts, proteins, metabolites etc. indicates the emergence of a systems quantitative genetics. It seems that the interplay of the genotype-phenotype relationship for quantitative variation is not only complex but also requires a dialectical approach for its understanding in which ‘parts’ and ‘whole’ evolve as a consequence of their relationship and the relationship itself evolves.
KeywordsQuantitative charactersGenetic basisMolecular markersQuantitative trait loci (QTL)High-resolution mappingPower of statistical modelingeQTLmRNA abundanceProtein abundanceSystems quantitative geneticsDialectical approach
Most traits of economic importance in plants and animals as well as disease traits in humans have an underlying genetic basis involving several genes and are subject to modification by environmental factors. Statistical considerations have been predominant in dissecting such complex traits into estimable components. The heritability of a trait as the proportion of phenotypic variation that is attributed to genetic causes has been a prime indicator helpful in making decisions for the genetic improvement of economic traits. The prediction of response to artificial selection based on intensity and accuracy of selection and the existence of genetic variability has been successful across several crop plants, livestock, poultry and fisheries. However, the relationship between phenotype and genotype has been like a black box, where an inferential approach has been the only way to look into it. This scenario is now changing with the advent of modern technologies of gene sequencing, microarray experiments and enormous advances in attempts to understand gene and protein expression within a cell of an organism. Information on molecular markers has been extremely helpful in identifying the regions on chromosomes that bring about variation in the trait (quantitative trait loci; QTL), thereby providing tools that can lead to much more accurate selection procedures for genetic improvement of economic traits. Saturated genetic maps of markers, giving their order along a chromosome and relative distances between them, have been developed. The map distance is based on the total number of crossovers between the two markers, whereas the physical distance between them is in terms of nucleotide base pairs (bp). A centiMorgan (cM), corresponding to a crossover of 1%, can be a span of 10–1,000 kbp and can vary across species. The gene transcript data from microarray experiments can be integrated with molecular marker information to map expression traits (eQTL) that can possibly lead to causal networks. In this paper we discuss briefly some of these developments and indicate how the evolution of the quantitative genetics from the past to the present is heading towards a systems quantitative genetics.
Since the marker genotypes can be followed in their inheritance through generations, they can serve as molecular tags for following the QTL provided they are tightly linked with the QTL. The first problem is therefore to detect the marker–QTL linkage. Once this is established, the next problem is to estimate the QTL map position on the chromosome and estimate the effect of allelic substitution. However, these problems depend on whether we have data on experimental populations obtained from controlled crosses, as in plants and animals, or on natural populations like humans where controlled crosses cannot be made. It is, however, important to note that the markers chosen for the QTL analysis should not show any segregation distortion, as that may lead to biased marker-trait association. Also, the phenotypic data on the quantitative trait should follow a normal distribution. One has therefore to verify these assumptions for the data under consideration before embarking on the QTL analysis.
The detection of marker–QTL linkage is based on a statistical test of a null hypothesis (H0) against an alternative hypothesis (H1). It is therefore subject to two types of error. H0 postulates that there is no QTL and hence no linkage exists between the marker and the QTL. Rejecting it when it is true is a Type I error which means that we detect marker–QTL linkage when in fact no QTL is present. This is termed false positive and the probability of such a contingency (α) is kept as low as 5% or less. On the other hand, if we accept H0 when in fact a QTL is present, we commit a Type II error. This means that our test misses the QTL. As in any statistical test, the strategy is to minimize the probability of committing a Type II error (β) for a fixed value of α. The statistical power for QTL detection is then (1–β). In QTL studies, such testing is done at several points or intervals where markers are located on each of the several chromosomes across the genome. Such multiple testing poses a challenging problem that is primarily statistical.
The most common method of QTL mapping is that of interval mapping. The whole chromosome is divided into short intervals of about 20 cM each and each interval is treated separately for QTL detection and estimation. The maximum likelihood method leading to LOD score statistics is used for this purpose. A LOD score threshold T is chosen for comparing with the observed value. An observed value greater than T indicates significance. The LOD score values obtained for each interval are plotted against the chromosome position to give a Likelihood Map. The maximum value of the significant LOD scores provides a possible position of the QTL for the given genomic region.
Although simple interval mapping (SIM) is the method for QTL mapping most widely used with advantage in several practical situations, it ignores the fact that most quantitative traits are influenced by numerous QTL. This is overcome either by adopting a model of multiple QTL mapping (MQM) or by combining SIM with the method of multiple linear regression, a procedure known as composite interval mapping (CIM). In all these methods, one uses the approach of maximum likelihood which produces only point estimates of the parameters such as the number of QTL, their location, and effects. The corresponding confidence intervals are required to be determined separately by re-sampling methods. Further, the correct number of QTL is difficult to determine using traditional methods. Their incorrect specification leads to distortion of the estimates of locations and effects of QTL. To address these problems a Bayesian approach is adopted wherein the joint posterior distribution of all unknown parameters given their prior distributions and the observed data is computed. This is done using iterative simulation procedures on high-speed computers.
The first application of interval mapping in plant breeding has been to an inter-specific backcross in tomato. The parents for the backcross were the domestic tomato Lycopersicon esculentum (E) with fruit mass 65 g and a wild South American green-fruited tomoto L. chmielewskii (CL) with fruit mass 5 g. A total of 237 backcross plants were assayed for continuously varying characters like fruit mass, soluble-solids concentration and pH, and 63 RFLP and 20 isozyme markers spaced at approximately 20 cM intervals were selected for QTL mapping. The methods of maximum likelihood and LOD scores were used through the software MAPMAKER-QTL to implement the interval mapping. A threshold T = 2.4, giving the probability of less than 5% that even a single false positive will occur anywhere in the genome, was used. This corresponds approximately to the significance level for any single test of 0.001. The resulting QTL likelihood maps revealed multiple QTL for each trait (6 for fruit weight, 4 for concentration of soluble solids and 5 for fruit pH) and estimated their location to within 20–30 cM.
In regard to fruit weight, the above type of investigation was continued, with more and more QTL for this trait being identified. In another study, at least 28 QTL controlling the difference in fruit weight between wild and cultivated tomato were identified, one of them being fw2.2 on chromosome 2. Using refined mapping studies, this QTL was localized to a narrow chromosomal region of the order of 1/10,000 of the genome. Using a map-based approach, fw2.2 was cloned and a 19-kb segment of DNA containing it was sequenced. This made it possible to identify a single gene, ORFX, responsible for the QTL effect. By transforming the wild version of the gene into a cultivated tomato, it was shown that the transformed plants decrease in weight by around 30% as predicted thus conforming that there are no additional fruit weight QTL nearby on the chromosome. Yet in another experiment, the population under study was derived from a cross between the wild species L. pimpinellifolium with average tomato fruit weight of 1 g and L. esculentum cultivar var. Giant Heirloom with fruit weight in excess of 1,000 g. The same six major loci on chromosomes 1–3 and 11 accounting for as much as 67% of phenotypic variation in fruit mass as in the previous experiments were identified. The two most significant QTL detected in this study are fw11.3 and fw2.1 on chromosomes 11 and 2 respectively.
Linkage disequilibrium or association mapping
Association studies that involve linkage disequilibrium (LD) between markers and genes underlying complex traits are being undertaken in different parts of the world, but mostly in human genetics. The key idea is that a disease mutation assumed to have arisen once on the ancestral haplotype of a single chromosome in the past history of the population of interest is passed on from generation to generation together with markers at tightly linked loci resulting in LD. The usual method adopted in human genetic studies is that of case–control analysis wherein genotype or allele frequencies of candidate genes are compared in unrelated cases and controls. However, when the population is composed of a recent admixture of different ethnic groups that differ in marker allele frequencies and disease frequencies, the method of case–control comparison leads to spurious association between the marker genotypes and the disease traits. Family-based association methods such as the transmission/disequilibrium test (TDT) can circumvent such problems.
Several studies on modeling the high-resolution mapping of QTL by association analysis at the population level as well as at the family level have been conducted (Spielman et al. 1993; Luo et al. 1997; Luo et al. 2000; Fan et al. 2006 and several others). Because of the difficulty in ascertaining the phase of a haplotype consisting of several single-nucleotide polymorphisms (SNPs), these models considered marker genotypes at each locus separately, thus losing information on their joint characteristics. Narain (2007, 2009) therefore considered the full genotypic model at a pair of flanking diallelic SNPs, in the context of a family-based approach like the TDT for testing the association in the presence of LD. It led to a more powerful test when expressed in terms of non-centrality parameters. This strategy for high-resolution mapping of QTL by association analysis was also investigated at the population level and led to increased power of the corresponding tests.
Joint linkage and LD mapping
While linkage mapping can readily detect chromosomal regions harboring QTL, it is difficult to locate them precisely. Also, since this approach depends on the cross between two true breeding parents, it captures only a tiny fraction of the genetic diversity in the population. Association mapping, on the other hand, widely samples genetic diversity as well as requires fewer individuals but has less power to detect QTL when they are not common. The advantages of the two approaches can, however, be combined by initially detecting QTL using linkage mapping with a moderate number of markers followed by a second stage of high-resolution association mapping in QTL regions that capitalizes on a high-density marker map.
The benefits of linkage and association mapping have recently been combined in a single population of maize by adopting a nested association mapping (NAM) approach. The maize NAM population was derived by crossing a common reference sequence strain to 25 different maize lines. Individuals resulting from each of the 25 crosses were self-fertilized for four further generations, to produce 5,000 NAM recombinant inbred lines (RILs). This population was first used for initial detection of QTL using linkage mapping approach. Subsequently, within each diverse strain, high-resolution association mapping was adopted with a high-density marker map. It is significant to note that within each RIL all individuals are genetically nearly identical. This means we can estimate the true breeding value of each line much more accurately by averaging the phenotypic measurements of a given trait taken on several individuals with the same genotype.
In a recent experiment, the genetic architecture of flowering time in Zea mays (maize) was dissected using NAM. About 1 million plants were assayed in eight environments to map the QTL. About 29–56 QTL were found to affect flowering time. These were small-effect QTL shared among the diverse families. The analysis showed, surprisingly, the absence of any single large-effect QTL. Moreover, there was found no evidence of epistasis or environmental interactions. Flowering time controls adaptation of plants to their local environment in the outcrossing species Zea mays. A simple additive genetic model accurately predicting the flowering time in this species is thus in sharp contrast to what has been observed in several plant species which practice self-fertilization.
Mapping of QTL for gene expression profile (eQTL)
The advent of DNA chip technology in the form of cDNA and oligonucleotide microarrays has provided huge and complex datasets on gene expression profiles of different cell lines from different organisms. Such gene expression profiles have recently been combined with linkage analysis based on QTL mapping through molecular markers in what has been termed ‘genetical genomics’ (Jansen and Nap 2001). Gene expression levels for each individual of a segregating population are phenotypes that are correlated with markers, genotyped for that individual, to identify the QTL and their locations on the genome to which the expression traits are linked. Such expression quantitative trait loci (eQTL) studies are similar to traditional multi-trait QTL studies but with thousands of phenotypes. It is also important to note that, underlying the gene expression differences, there are two types of regulatory sequence variation. One is cis-regulatory that affects its own expression and the other is trans-acting or protein coding that affects the expression of other genes. The first study in which transcript abundance was used to study the linkage with the QTL was on budding yeast (Brem et al. 2002) based on a cross between a laboratory strain and a wild strain, the parents being haploid derivatives. The heritability estimation was based on haploid segregants and the linkage with a marker was tested by partitioning the segregants into two groups according to marker genotypes and comparing the expression levels between the groups with the Wilcoxon–Mann–Whitney test. They found eight trans-acting loci, each affecting the expression of a group of 7–94 genes of related function. Since then, several eQTL studies have been published in species like mice, maize, humans, rats and Arabidopsis thaliana (Schadt et al. 2003; Lan et al. 2003; Morley et al. 2004; DeCook et al. 2006). These have led to some general principles of genetic mapping of genome-wide gene expression as reviewed by Rockman and Kruglyak (2006).
Conducting experiments to identify QTL for organismal phenotype (P) as well as for the corresponding transcript phenotype (Ps) can indicate the genetic relationship between them, as borne out by the study of Lan et al. (2003) on type 2 diabetes in a population of F2-ob/ob mice from a cross of two mouse strains. There were 8 mRNA traits (several Ps) and 8- and 10-week levels of fasting plasma glucose, insulin and body mass—the six physiological phenotypic traits (several P) for diabetes—and known genotypes of 192 microsatellite markers included in the study. In addition, of course, each transcript had a known position on the genome, as is true for any microarray experiment. The clustering of the two types of phenotypes together led to two groups of 4 each of the 8 mRNA traits due to their mutual correlations, with one of the groups containing SCD1 transcript (Ps), showing strong association with the insulin trait. eQTL mapping of the first principal component of this group revealed two loci DMC1 and DMC2 that were significantly associated with SCD1. The region of the former, on chromosome 2, overlapped with the locus t2dm3 that was found to be associated with fasting insulin levels (P), using traditional QTL mapping. Similarly, the region of the DMC2 gene, on chromosome 5, overlapped with a locus associated with fasting glucose levels (P). Thus SCD1 mRNA expression was shown to be linked to the loci that are associated with type 2 diabetes using both multi- as well as single-trait QTL mapping. This study points out that the phenotypic correlation between P and Ps is due to the genetic correlation between the corresponding genotypes—the DNA sequence variation—and the possible correlation between their corresponding environmental components. As we will see later, such data can develop into causal networks.
QTL for protein levels in yeast
In each cell of an organism, most of the day-to-day work in terms of metabolism and structure is performed by proteins consisting of long polypeptide chains of amino acids that are of 20 types. It is well known that the function of a protein is coded in a 20-letter-alphabet language of amino acids and the type of amino acid is dictated by the genetic code that consists of successive triplets of nucleotides along the DNA. The relationship between DNA and proteins is provided by the manner in which the 4-letter language of DNA is transformed in the 20-letter language of protein. It is therefore expected that functionally important changes in transcript levels should be reflected in the changes in the levels of corresponding protein levels.
Proteome profiling based on mass spectrometry has been used for quantitative measurement of protein abundance to study the genetic basis of protein level in a cross between two diverse strains of the budding yeast, the two strains differing at 0.6% of base pairs (Foss et al. 2007). The same cross was also used earlier to understand the genetic basis of transcript levels (Brem et al. 2002). This therefore allowed the comparison of the genetics of protein and transcript levels in the same population. Just as transcript levels are compared across samples by measurements of corresponding spot hybridization intensities on micro-arrays, levels of peptides in an output of a mass spectrometry experiment consisting of a matrix of peaks, each of which represents a peptide, are measured in terms of ion intensities after appropriate alignment of the matrices.
Total proteins from eight independent logarithmic-phase cultures of each parent and from two independent cultures of each of 98 segregants were isolated, digested with trypsin and analyzed by mass spectrometry. Only the best peptide for a given protein was selected. This led to 221 unique peptides with high quality data and corresponded to 278 proteins. The genetic contribution to the observed variability in protein abundance was estimated from a subset of 156 of these proteins for which high-quality data from the parent strains were also available. The heritability of protein abundance was found to be 0.62. The comparison between genetic regulation of proteins and that of the transcripts revealed more differences than similarities; the average correlation between them was found to be only 0.186. The parental strains differed in both proteins as well as transcripts to the extent of about 33%. However, only 43% of proteins that differed between the parents corresponded to transcripts that were different between the parents. Linkage analysis detected loci for 156 of 278 transcripts (56%) compared to 85 of 221 peptides (38%). Most loci affected either peptide abundance or transcript abundance but not both. Since traits are not physically located in the region, the corresponding hot spots are trans-acting. Protein linkages were found to be concentrated in fewer hot-spots than the transcript linkages. The overall conclusion of this study was startling in that the loci that influenced protein abundance differed from those that influenced transcript levels, much against expectations.
Systems quantitative genetics
The relationship between genotype and phenotype is viewed by Rockman (2008) as a reverse engineering process in which observations from segregating populations on genes, transcript abundance, QTL for transcript abundance and molecular markers are used to infer causal networks to understand how the system works as an integrated whole. Based on the premise that genetic variation occurring naturally in a population is a source of multi-factorial perturbation, he reviews the recent literature to show how models of probabilistic causal networks can be built up to establish the genotype–phenotype map. In a way, the review indicates the emergence of a systems quantitative genetics.
QTL analysis of 10 hypothetical transcripts
There have been two major developments in recent times that have changed the way we are accustomed to look at the mapping of genotype onto phenotype for quantitative characters. The first is the advent of molecular markers, their extensive mapping in several species and their incorporation in statistical models as covariates. In addition to classical heritability as the proportion of phenotypic variation in the character that is due to additive effects of QTL, we have now the proportion of additive genetic variation that is associated with the markers. The larger this proportion, the greater is our ability to detect QTL. However, the regions to which the QTL are mapped are usually large, of the order of 10–20 cM or even greater, making candidate gene evaluation impossible. High-resolution mapping based on association genetics must then be undertaken for which various models have been developed, most of which consider a single marker at a time thereby losing valuable information due to linkage between them. For family-based association methods like the TDT, Narain (2007, 2009) developed the theory with haplotypes instead of a single marker and proposed that one can study the putative gene at any given location on the chromosome by considering only a pair of markers around it rather than the whole set of markers.
The second development is high-throughput genotyping technology which, coupled with micro-arrays, has allowed expression of thousand of genes with known positions in the genome and has provided an intermediate step with mRNA abundance as a sub-phenotype (Ps) in the mapping of genotype onto phenotype for quantitative traits. Such gene expression profiling has been combined with linkage analysis, termed eQTL mapping. Recently, the associated genetic basis of protein abundance using mass spectrometry has also been attempted. A comparative picture of transcript vs. protein abundance levels in the same population in the case of budding yeast, however, indicates that functionally important changes in the levels of the former are not necessarily reflected in changes in the levels of the latter. It may be worthwhile to discuss it from a conceptual angle.
As we know, the central dogma of molecular biology stipulates that the sequence information flows from DNA to RNA to protein but not in the reverse direction. Rockman (2008) has also indicated that many causal orderings in the network analysis are prohibited by the central dogma, at least within an individual, as phenotype does not feed back to affect genotype, though between individuals phenotypes do feed back by selection to shape genes. But Kimchi-Sarfaty et al. (2007) reported data that indicates that a protein’s three-dimensional structure is not necessarily determined by its amino acid sequence which has been specified by the DNA sequence. An mRNA, if subjected to translational braking, can generate a protein with a different structure than specified by the DNA sequence. This has been termed the ‘translation-dependent folding’ (TDF) hypothesis (Newman and Bhat 2007). Differential gene expression resulting in transcripts as sub-phenotypes could then lead to different proteins and could give results similar to those obtained in the yeast experiment. Genes and proteins are therefore required to be considered simultaneously to unravel the complex molecular circuitry that operates within a cell. One has to look at a global perspective of genotype—phenotype relationship instead of individual components like DNA or proteins of a cellular system.
It seems the interplay of genotype–phenotype relationship for quantitative variation is not only complex but also needs a closer look at how we view this relationship—whether purely at the DNA–RNA level as in the reductionist approach or at the level of the cell as a whole where DNA–RNA are just parts of the cellular system, with other contextual forces present in the micro-environments of the cell also playing their own important roles. Such situations have also been noticed in agricultural experimentation where a dialectical approach has been advocated (Narain 2006). In the grain production process, it is also important to study how this process affects the soil health and the ecosystem surrounding the plant, as is studying the effect of the inputs on the production. In the dialectical approach, this relationship between the plant and its environment is studied both ways—input to output as well as output to input—a sort of feedback. A similar possibility seems to exist in the genotype–phenotype relationship within a cell. The protein as a phenotype is determined by the DNA sequence as the genotype but the reverse phenomenon of protein affecting the DNA could also take place at the expense of violating the central dogma. In fact, studies are being conducted to explore biochemical signaling pathways that regulate the function of living cells through regulatory networks having positive and negative feedback loops (Ray 2008), though it is unclear how genetics can be incorporated into it. These feedback loops are basically cybernetic concepts that are inherent in the dialectical approach. This approach takes into account the dynamics of the system over time as well, in which the development is a consequence of opposing forces. This is based on the concept of contradiction inherent in the meaning of dialectics. Things change because of the action of the opposing forces on them, and things remain how they are because of the temporary balance of opposing forces. The opposing forces are seen as contradictory in the sense that each taken separately would have opposite effects, but their joint action may be different from the result of either acting alone. These forces are, however, part of self-regulation and the development of the object is regarded as a network of positive and negative feedback loops the incorporation of which in the genetic context would violate the central dogma. Genes, transcripts, proteins, metabolites, physical components etc. can be regarded as ‘parts’ of the cellular system and the ‘whole’ is regarded as a relation of these parts that acquire properties by virtue of being the parts of a particular whole. As soon as the parts acquire properties by being together, they impart to the whole new properties that are in turn reflected in changes in the parts, and so on. Parts and whole therefore evolve as a consequence of their relationship, and the relationship itself evolves. Genes are fixed but their expression, the transcript, is not. At any given moment of time genes are expressed depending on the requirement of the cell and through the information contained in the DNA. At this moment of time the cellular system is said to have a particular state. At the next moment of time the same genes are expressed but differently, depending upon the then requirement of the cell and based on the feedback, if any, from the system’s state at the previous time point, assuming that the process is Markovian. This gives the next state of the system which might or might not be different from the previous state. And so the process goes on, continually modifying the relationship between the different parts of the system based on the interactions and feedbacks. It seems a dialectical approach could provide the clue for understanding how ‘parts’ of a system and the ‘whole’ system behave in the genetics context. But how to model such a process remains to be seen.