29.1 Introduction

Family studies have provided experimental observations enabling geneticists to recognize many human genetic traits and diseases. Single-gene Mendelian traits are usually deduced by straightforward inspection of the data, but sophisticated statistical methods have had to be developed to analyze phenotypes that have more complex modes of inheritance. An ongoing catalog of these traits has been compiled by Victor McKusick for over 30 years; 4344 traits are listed in the eighth edition (1988) of Mendelian Inheritance in Man (1). The exponential increase in reporting of new human genetic information has led to this data base being computerized, and it is available in daily updated form for geneticists to interrogate via academic networks.

For many years geneticists have been frustrated by being able both to identify an inherited trait or disease by family studies and to propose that it would be caused by mutation in a single gene, but being unable to investigate the underlying genetic pathology of the disorder. Useful genetic-counseling advice could be offered to patients and relatives in some cases, but there arose few opportunities to offer genetic screening, and presymptomatic or prenatal diagnosis.

The recent explosion in molecular genetic technology has provided the tools to extend the analysis of inherited traits from the segregation pattern to cloning of the gene or genes responsible for the trait. In some cases this has provided useful information for genetic counselors helping their clients, as well as insight into the pathophysiology of and potential therapeutic strategies for these conditions. There has been much interest in localizing single-gene mutations to individual chromosomes by researchers aiming to isolate and clone the gene responsible for specific diseases. Frequently there is sparse information as to the underlying biochemical defect in these diseases, and mapping, cloning, and sequencing these genes is one of the few options open for understanding these conditions. These so-called reverse-genetic strategies have had several successes; for example, the gene mutated in cystic fibrosis has recently been cloned, and mutations within the gene have been identified, allowing direct carrier detection and prenatal diagnosis (2).

The first step in isolating by “reverse genetics” the defective gene that is mutated in an inherited disorder is to localize the disease trait to a specific chromosomal region. Human geneticists have two main methods available to them for mapping these traits; genes may be directly mapped when affected individuals carry chromosomal aberrations that physically pinpoint the mutated gene, or indirectly mapped in genetic linkage studies with multiply affected families.

Greig cephalopolysyndactyly syndrome, a condition affecting limb and craniofacial development in humans, was localized directly by examining the karyotypes of two unrelated patients; each was found to carry a different balanced translocation with a common breakpoint at 7pl3. For many, perhaps most, inherited traits, no evidence of chromosomal rearrangement is found, and genetic linkage studies provide the sole means of chromosomal localization. This is a practical approach when data from sufficient families with multiply affected members can be collected to provide the raw material for linkage studies, namely informative meioses. This requirement usually limits the linkage approach to relatively common conditions. In contrast, those rare conditions associated with chromosomal translocations have the potential to be profitably analyzed with material from only one patient.

The linkage approach has been successful in mapping many genetic diseases, including heritable cancers (e.g., familial polyposis coli [3], neurofibromatosis [4,5], multiple endocrine neoplasia type 2a [6,7]), neuromuscular diseases (e.g., Duchenne muscular dystrophy [8]), degenerative neurological disorders (e.g., Huntington’s disease [9] andFriedreich’s ataxia [10]), adult polycystic kidney disease (11), and the respiratory and gastrointestinal tract disorder cystic fibrosis (1214). The next sections discuss the methodology that has been followed in mapping such genes in humans using linkage studies, including both the resources that are necessary in collating the data and statistical topics relevant to the analysis of these data.

29.2 A Brief History of Genetic Linkage Analysis

29.2.1 Sweet-Peas and Fruit Flies

The classic work by Bateson et al. in 1905 provided evidence that Mendelian characteristics (petal color and pollen grain shape in the sweet-pea) did not always segregate independently of each other, since they observed an excess of parental gametic combinations over reassociations (15). The inferred exchange of genetic material between chromosomes caused the authors consternation when they considered the implications in terms of the chromosomal theory of inheritance, which was initiated in 1903 when Sutton proposed that genes were carried on chromosomes. De Vries had anticipated these exchanges of genetic material (16); Morgan and Cattell in 1912 interpreted recombination in terms of “crossing over” between homologous chromosomes (17). Sturtevant in 1913 produced a genetic map of several sex-linked loci in Drosophila, using the recombination fraction as a measure of physical separation (18). This early work has provided the core methodology, which has been followed subsequently by geneticists constructing linkage maps for a diverse range of species including humans.

29.2.2 Humans

The first genetic linkage in humans was reported in 1937 by Bell and Haldane, who found linkage between X-linked color blindness and hemophilia (19). Mohr reported the first autosomal linkage between Lutheran and Lewis blood groups in 1954 (20). It is pertinent to note that, in his original analysis, Mohr failed to detect the linkage between these blood groups and myotonic dystrophy in the original family, which was evident when likelihood methods were used in the analysis. Linkage analysis in humans really blossomed only with the discovery of the abundance of DNA polymorphisms coupled with simple experimental means to detect and follow them as they segregated through families.

29.3 DNA Polymorphisms

For the majority of human DNA (possibly as much as 99%) there is no known function. Mutations that accumulate within this “noncoding” or “anonymous” DNA appear to be, in evolutionary terms, selectively neutral; several classes of DNA polymorphisms have been identified within this DNA, and all segregate as codominant markers.

29.3.1 Restriction Fragment Length Polymorphisms (RFLP)

The simplest of these polymorphisms is a single base change, which has been estimated to occur randomly about once every 150 bp in noncoding, “anonymous” DNA. These point mutations arise by a variety of mechanisms, but the CpG dinudeotide is particularly susceptible to modification. The cytosine in a CpG dinucleotide is liable to methylation outside of Hpall tiny fragment (HTF) islands, and the methylated derivative is frequently converted to TpG. The base substitution may alter a restriction endonuclease recognition site that contains a CpG (e.g., TaqI or MspI, which recognize TCGA and CCGG, respectively); thus, probing a Southern blot made with appropriately digested genomic DNA will reveal a restriction fragment length polymorphism (RFLP). The CpG “hotspot” for point mutation, coupled with the preferential use of CpG restriction enzymes by researchers searching for RFLPs explains their enrichment in published lists of RFLPs.

Probes detecting RFLPs have been reported for all chromosomes, although several investigators have commented that the X chromosome carries fewer and less informative polymorphisms than do autosomes. RFLPs that detect a base substitution have two alleles, which obviously limits the upper boundary for the level of heterozygosity and informativity with a single polymorphism. However, data from multiple tightly linked markers may be combined into a haplotype, which may well be more informative (unless the alleles detected by tightly linked markers are in linkage disequilibrium, see Section 4).

29.3.2 Hypervariable DNA Polymorphisms

Alec Jeffreys and coworkers at the University of Leicester have identified a novel set of DNA sequences containing short, simple, repetitive motifs (21). Using a “minisatellite core” sequence isolated from the human myoglobin gene to probe genomic DNA blots, many restriction fragments are resolved. The complex and highly polymorphic pattern of DNA fragments constitutes a “DNA fingerprint,” which has proved useful in paternity, immigration, and forensic cases (see Chapters 22 and 23, this volume).

Each fragment corresponds to an individual “minisatellite” sequence that is dispersed throughout the genome, the “core” sequence being repeated tandemly within each “minisatellite.” Several types of minisatellites have been identified, which show sequence similarity in their core sequences. These may be crossover “hotspots” analogous to the Chi sequence that initiates recombination in phage lambda. The mechanisms leading to such frequent variation in the number of tandem repeats of the core sequence is unknown. It is unlikely to be generated by unequal exchange during recombination, but may be generated by slippage during DNA replication (22). The single-copy sequences flanking several minisatellites have been localized on several chromosomes by in situ hybridization, and they are clustered at the telomeres.

Jeffreys’ DNA “fingerprints” are potentially useful in linkage studies, since multiple loci may be analyzed simultaneously; typically 30 or more loci may be resolved on a single blot. Uitterlinden et al. have increased the data yield from a single blot by a factor of 10 by resolving fragments in two dimensions using denaturing gel electrophoresis (23). Both systems share an analytical limitation, since fragments corresponding to both alleles at a locus are not usually identified, and alleles detected by the same locus in different families cannot usually be matched (a direct result of the high degree of polymorphism). This unfortunately results in data being “private” to each individual family, so data may not easily be pooled across unrelated families.

These problems have been overcome, since individual hypervariable probes may be cloned by probing a genomic library with a core sequence. The core sequence plus the unique flanking sequence may then be used as a “single-copy” probe, detecting alleles at a single locus. These variable-number-of-tandem-repeat (VNTR) probes are as technically straightforward to use as a conventional RFLP, since the banding pattern is simple, consisting of one or two fragments per individual. VNTR probes frequently reveal a high degree of heterozygosity (commonly τ;80%), and may be physically localized by standard methods. Data may also be pooled between unrelated families, since all alleles detected by a single probe map to the same locus.

The variation found with “minisatellite” DNA has prompted a search for variation within other repetitive DNA families. The simple sequence (CA) n is very widely dispersed in mammalian genomes, and shows variation between individuals in the number of CA repeats (i.e., alleles are found with [CA] w [CA] n+1, [CA] n+2, and so on, ref. 24). These polymorphisms are typically analyzed by using the polymerase chain reaction (PCR) to amplify a short (about 250 bp) sequence encompassing the (CA) n repeat and separating the allelic fragments on a denaturing polyacrylamide gel. The fragments can be detected by autoradiography if radioactively labeled primers are used in the PCR. Several other families of polymorphic simple repetitive sequences have been reported (e.g., [TTA] w ref. 25, and Alu variable poly [A], ref. 26), which are widely dispersed throughout the genome (including the X chromosome), show a high degree of heterozygosity, and are proving to be very useful for linkage studies.

29.4 Linkage Disequilibrium

Alleles detected by probes that map genetically and physically close to each other are occasionally associated with each other as a direct consequence of their tight genetic linkage. This is detected by counting haplotype frequencies and comparing these observed frequencies with the expected frequencies, which are the product of the individual allele frequencies. For example, consider two loci, with alleles A and a and B and b. If the frequency of both A and B is 0.5, then the expected haplotype frequency for AB chromosomes is 0.5×0.5=0.25. Observation that the haplotype frequency for AB is significantly different from 0.25 would indicate that there is allelic association or disequilibrium between alleles A and B.

Individual alleles of an RFLP arise infrequently by spontaneous mutation, so alleles at two tightly linked loci will remain “coupled” unless recombination between the probes generates new combinations of alleles (haplotypes) on a chromosome. It should be remembered that several other genetic factors, such as admixture and selective pressure, may act at a population level to create and sustain the level of disequilibrium.

This population genetic phenomenon of linkage disequilibrium is usually found only for polymorphisms separated by no more than a few tens of kilobases and has been used to advantage by geneticists in some types of study. Recombination is the principal mechanism that generates new haplotypes that mark the decay of disequilibrium, and, in general, the stronger the disequilibrium, the smaller the recombination fraction and associated genetic distance between the markers.

Following this argument, attempts have been made to use the degree of disequilibrium as a “metric” and deduce the relative order of tightly linked markers and mutations leading to inherited diseases (2729). The varied degrees of success of these attempts suggests that other genetic factors (admixture and selection), as well as random drift, confound high-resolution genetic mapping with linkage disequilibria.

Linkage disequilibrium between RFLPs and inherited disease has been put to clinical use in modifying risks of individuals being carriers of the cystic fibrosis mutation. The disequilibrium additionally provides “phase” information, which is useful when calculating risks for prenatal diagnosis. Disequilibrium may be a nuisance, however, when it limits the gain in informative capability from typing multiple polymorphisms in a small physical (and genetic) region.

29.5 Construction of the Human Genetic Map

Solomon and Bodner and Botstein et al. were among the first to suggest that DNA polymorphisms would be sufficiently common to be used both as informative markers dispersed throughout the human genome and to construct a genome-wide linkage map in humans (30,31). It is estimated that 330 RFLPs spaced evenly at 10-centimorgan (cM) intervals would span the human genome.

29.5.1 Progress

In 1973, the genetic map of the human genome compiled at the first international Human Gene Mapping Workshop (HGMW) in Yale comprised 27 Mendelian markers and 55 in vitro markers, with hemoglobin and MNSs incorrectly assigned to chromosome 2 (32).

Recombinant DNA methods have fueled the explosion in human gene mapping in two major ways. Genes that have been cloned may be directly mapped by hybridization to a panel of somatic-cell hybrids or by in situ hybridization. Alternatively, genes are indirectly mapped by genetic linkage to RFLPs. The HGMW reconvened in Yale in June 1989 and reported on a total of 1631 mapped genes, 113 fragile sites, and 3300 DNA segments (33). The ongoing efforts to sequence systematically the entire human genome will build on this framework of genetically mapped genes until a unified gene map for humans is completed.

29.5.2 Resources

Three internationally available resources have played a central role in the overall synthesis of the currently detailed human genetic map.

The Human Gene Mapping Library (HGML, Director Ken K. Kidd, Yale, USA), in close collaboration with the DNA committee of the Human Gene Mapping Workshop (HGMW, President Bob Sparkes), have maintained a catalog of DNA probes, their chromosomal assignments and regional localizations, and any RFLPs identified by these probes. Currently, there are approx 2000 polymorphic DNA markers, and HGML maintains an internationally accepted system for numbering DNA probes (so-called D numbers).

At the HGMWs, which are held in alternate years, committees responsible for one or two chromosomes edit data submitted by investigators and attempt to derive an overall consensus map integrating diseases and probes. The committeepersons often have to arbitrate between diverse sources and quality of data. The reports are published and provide a key reference that (hopefully) summarizes the state-of-the-art map. The HGML has provided the additional resource of a continuously updated computer data base that may be interrogated interactively over academic networks. The HGMW data has formed the core of the HGML data base, but much additional detail is added, including laboratory details of probes, their availability, addresses of investigators, and literature references.

At HGM10.5 (held in Oxford, September 1990), a new genome data base (GDB) was launched. This data base, developed by the Welch Medical Library (Johns Hopkins University, Baltimore, MD), will be constantly edited by the committee chairpersons at HGMW and is intended to be accessed by geneticists throughout the academic world (see Appendix to this volume).

The Centre d’Etude du Polymorphisme Humain (CEPH, director Jean Dausset, Paris, France) provides another key resource to help map the human genome. CEPH has collected a mapping panel of 40 human nuclear families (usually with grandparents) with at least nine children. DNAs are distributed to members of a “collaborative group” that have expressed an interest in gene mapping. The investigators agree to type completely the families with RFLPs that they become interested in mapping. It is expected that data be returned to CEPH headquarters for pooling, so that consensus (or “consortium”) maps may be deduced; these maps should be more detailed and accurate than those constructed with data from a single group. The RFLP data base is checked for errors (as far as is possible) and distributed to all collaborating groups.

Most groups declare an interest in mapping particular regions of the genome, a concentration that invariably results from the location of a particular inherited disease. For example, the localization of cystic fibrosis to chromosome 7q was the stimulus that has resulted in a highly detailed map being generated for the whole chromosome (34). However, two groups have contributed in a general way to mapping the entire genome, principally with anonymous DNA markers. This has resulted in the publication of “primary” human genome linkage maps. Donis-Keller et al. (35) has reported a 403-locus map with linkage groups on all chromosomes and White et al. (36) distributed a booklet containing details of 255 loci on 17 chromosomes.

Since these pioneering maps, much detail has been added, and published maps at 5- to 10-cM resolution are available for many genomic regions. Most of these probes are freely available for general mapping purposes, and Collaborative Research Inc. (Bedford, MA, USA) markets the probes that comprise the Donis-Keller genome map (see Appendix to this volume).

29.6 Strategies in Searching for Linkage

29.6.1 Candidate Gene Approach

For some traits there may be a clue as to the location of the gene under investigation; linkage may then be sought with markers that map to this region, or with candidate genes themselves if they have been cloned. For example, a patient was reported with a partial trisomy of chromosome 5q and schizophrenia, and Sherrington et al. (37) reported the linkage of DNA markers that map to chromosome 5q to a putative autosomal schizophrenia locus.

Clues may come from hypotheses generated from comparison of genetic maps across species. For example, porcine stress syndrome (PSS) and malignant hyperthermia (MHS) in humans have many phenotypic similarities, and both are inherited as simple Mendelian traits. PSS was found to be tightly linked to glucose phosphoisomerase (GPI), and GPI maps to chromosome 19q in humans. A recent linkage study has shown that MHS and markers that map in the GPI region of humans are linked (GPI itself was uninformative), confirming the claim that these two diseases are caused by mutations within homologous genes (38).

29.6.2 Genome-Wide Searches for Linkage

For many traits there will be no clues as to which region of the genome to screen first. There have been two broad approaches to the search, each of which have advantages and disadvantages.

29.6.2.1 Systematic / Sequential Searches

This chromosome-by-chromosome approach has succeeded in several instances; the availability of a preexisting map of markers at 10- to 20-cM intervals enables efficient searching with multipoint linkage analysis. The RFLP map is constructed with a number of “intervals” (each spanning 10–20 cM), so that the disease locus will be flanked by a pair of RFLPs wherever it happens to map. The exclusion component of “interval” mapping is particularly efficient, since intervals that do not contain the disease locus will generate apparent double-recombination events, which are unlikely. Only one or two meioses consistent with double-recombination events are necessary to exclude a 10-cM interval.

Problems may arise in regions where markers are sparse or only moderately informative, since data insufficient to exclude or include linkage to an interval will be collected. It is also difficult to ensure that intervals extend to the telomore, although the recent cloning and characterization of human telomeres may soon resolve this.

In general, investigators will choose markers that individually show the highest degree of informativity, but two-allele RFLPs are still useful, since many have been accurately mapped or can be combined to generate informative haplotypes.

29.6.2.2 “Shotgun” Method

Another strategy involves picking at random DNA probes that individually reveal a high degree of informativity and testing for linkage. This pairwise approach will generate substantial regions of exclusion around each marker, but it is difficult to monitor overall progress if the markers are not themselves mapped. Huntington’s disease and adult polycystic kidney disease were mapped by this method (9,11).

An elegant variant of the “shotgun” approach is to use a “minisatellite” probe to “fingerprint” the family and to test simultaneously multiple marker loci for linkage. Jeffreys and coworkers have succeeded in linking hereditary persistence of fetal hemoglobin (HPFH) to a single minisatellite locus; up to 34 loci dispersed throughout the genome could be tracked in one experiment (39). This method is suitable for analyzing only large families with enough informative meioses to prove linkage in isolation, since data cannot be easily pooled.

Few investigators would plan or admit to a purely random search for linkage; rather, they would probably opt to test those highly informative markers that became available, provided they were mapped and dispersed throughout the genome. This work would probably continue in parallel with more systematic searches.

29.6.3 An Example: Friedreich’s Ataxia

The search to localize the gene for Friedreich’s ataxia (FRDA) illustrates the alternative strategies and their interplay during the laborious search for linkage. FRDA is a rare autosomal recessive disorder (incidence of 1 in 50,000 in the United Kingdom) resulting in progressive spinocerebellar degeneration during the second decade. Despite much research, there were few clues to the underlying biochemical defect, no method for presymptomatic diagnosis, and no specific treatment. A “reverse genetic” project to localize, clone, and analyze the gene mutation in FRDA was therefore initiated in 1985 at St. Mary’s Hospital Medical School in London.

A total of 20 multiply affected families were ascertained, principally through consultations at neurology clinics, but also through a patient data base held by a charitable organization, the Friedreich’s Ataxia Group UK. Initially, sibships with at least three affected members were collected. For a recessive disease, meioses from one sibling are “consumed” to establish phase, so a maximum of four informative meioses may be derived from a “3-affected” family at a cost of DNA-typing five individuals (provided both parents are informative). For a “2-affected” family, a maximum of two informative meioses may be deduced after typing four individuals. The efficiency of data collection as judged by the number of informative meioses deduced for each individual being DNA-typed, is 0.8 for a “3-affected” family and 0.5 for a “2-afTected” family. Obviously, larger sibships would yield data more efficiently, but they are rare.

Candidate gene: A portion (20%) of FRDA patients develop clinical diabetes mellitus, and pharmacokinetic studies have shown the insulin receptor (INSR) to be present in normal densities, but with a much-reduced binding affinity for insulin. INSR had been previously cloned and mapped to chromosome 19p. Linkage studies with INSR polymorphisms detected obligate recombination events in several FRDA families (40).

Systematic search: The remainder of chromosome 19 was then systematically excluded from being linked to FRDA (40). This region was chosen to commence the structured exclusion study, since premapped probes were readily available for much of the rest of chromosome 19. These probes detected two allele RFLPs, but were sufficiently informative to exclude the majority of chromosome 19. Other chromosomal regions that were reported to be covered with a number of appropriately (10–20 cM) spaced markers were also examined in turn.

“Shotgun” search: While the systematic searches continued, several highly informative markers (e.g., HLA) were tested for linkage as they became available. A panel of polymorphic protein and red-cell antigen markers were also tested in the families by researchers who had semiautomated assays established and could analyze the FRDA samples at a relatively low cost. Most of these markers were only moderately informative; one notable exception was the MNS blood group system. A few VNTR probes that were highly informative were also analyzed in the families.

Markers covering 80% of the genome (117 markers) were excluded from linkage before a large positive lod score (see Section 7, especially 7.2.3) was finally revealed with a probe mapping to chromosome 9 in 1988 (10). There were several instances in which markers showed maximal lod scores of τ;2.0, and one instance when a lod score nearly reached 3.0, which is broadly consistent with the theoretical false positive rate of 5%, which corresponds to a lod score threshold of +3.0 (see Section 7.2.3).

Before the search for linkage was successfully concluded, some neurologists claimed that the clinical (phenotypic) heterogeneity was likely to be reflected in generic heterogeneity. This could be either intragenic heterogeneity (a number of different mutations within the same gene) or intergenic heterogeneity (mutations in a number of genes that map to different chromosomal regions). To date all FRDA families have proved to be linked to chromosome 9 markers, which argues against intergenic heterogeneity. The tight and homogeneous linkage has been used to clinical advantage in first-trimester prenatal diagnosis of this condition (41, see also Chapter 30, this volume).

29.7 Statistical Considerations

The cardinal principles of good practice for experimental design apply equally to linkage analysis and to any other type of study that will undergo statistical analysis:

  1. 1.

    Hypotheses should be declared at the outset of the study.

  2. 2.

    Appropriate statistical methods and significance levels should be chosen.

  3. 3.

    The sample size should be adequate to ensure that the study has sufficient power to achieve its objectives.

29.7.1 Hypotheses

In linkage analyses, the null hypothesis (Ho) states that alleles at the disease locus and the RFLP under examination segregate independently; in other words, the recombination fraction between the two loci is 50%. The alternative hypothesis (H1) might state that the recombination fraction between the two loci is <50% (e.g., 10%). In practice, most investigators do not wish to be confined by anticipating the recombination fraction, and multiple H1s corresponding to various recombination fractions are implicitly assumed. The H1 that fits best is chosen and the rest forgotten.

29.7.2 Analysis and Thresholds

Likelihood or lod score methods of analysis have been effective in their application to analyze human genetic data efficiently and reliably. Methods developed to analyze for experimental organisms the offspring from “ideal” matings are generally of little practical use in analyzing human data; the phase of alleles at multiple loci is rarely known, and data are often missing for key family members.

29.7.2.1 Likelihood Calculations

The likelihoods of pedigrees with arbitrary structures, including multiple marriages and consanguineous loops, segregating with markers may be calculated with the aid of computer programs. Analyses by hand or with the aid of tables of lod scores are really of use only in the simplest of cases. Programs that have had widespread application in linkage analysis include Liped (42), Linkage (43), and Mapmaker (44). These programs permit a flexible specification of the underlying mathematical model for the segregation of loci through the families.

The mode of inheritance is defined, both for discontinuous (simple Mendelian) and continuous (quantitative) traits. Loci maybe autosomal, sex-linked, or pseudoautosomal. Multiple alleles detected at the same locus may be specified together with their associated frequencies. Penetrance, the conditional probability that an individual with a known genotype expresses a phenotype, may be defined, and multiple penetrance classes are used to correct for “age of onset.” Phenocopies, individuals with normal genotypes that appear to be affected by nongenetic causes, may also be allowed for. Haplotype frequencies may be incorporated when markers show linkage disequilibrium. Spontaneous mutation rates may also be specified. The Linkage program additionally has an option that calculates genetic risks.

29.7.2.2 Sex Differences in Recombination

There is extensive evidence that the recombination fraction between a pair of linked loci varies with the sex of the parent. For example, a review of linkage data for chromosome 1 loci revealed an overall 2/1 female/male ratio in recombination fractions. This is consistent with data from other species (e.g., mouse and Drosophila), in which a relative excess of recombination is found in females (homogametic sex) over males (heterogametic), which defines Haldane’s law. There are clear exceptions to this rule, with males showing more recombination than females. The ratio may also vary from chromosome to chromosome and between different regions on the same chromosome. Currently available computer linkage-analysis packages allow full specification of male and female recombination rates.

29.7.2.3 Statistical Inference

Likelihoods are calculated at several recombination fractions and compared with the “null” likelihood, calculated with the recombination fraction set at 50%. The lod score represents a likelihood-ratio test and is expressed as the log10 likelihood difference, i.e., the log10 likelihood at the “test” recombination fraction minus the “null” log10 likelihood.

By convention, lod scores are calculated and reported at several recombination fractions, namely 0.00, 0.01, 0.05, 0.10, 0.15, 0.20, 0.30, and 0.40. The maximal lod score (Z) and the corresponding maximal likelihood estimate of the recombination fraction (θ) are also recorded.

A lod score of+3.0 expresses odds of 1000/1 supporting linkage, and is the threshold value generally accepted as adequate evidence to prove linkage between loci (45). The “raw” odds ratio of 1000/1 corresponds to a final (posterior) probability of 95% that the two loci are truly linked. This calculation takes into consideration the modest prior chance (which is conventionally taken as 1 in 50) that any two loci chosen at random will be linked. A threshold of −2.0 is conventionally chosen as sufficient evidence to exclude linkage between loci. This represents a highly stringent exclusion threshold with a false negative rate of 0.02% (remember that the prior chance of linkage is a low 2%).

It may seem surprising that the accepted exclusion threshold is so much more stringent than the false positive rate. It should be remembered that positive linkages will almost certainly be followed up, by collecting data from additional families and by adding in data for new polymorphisms. By contrast, excluded regions are discarded, and the investigator will continue the search for linkage elsewhere. The risk of missing linkage and scanning the rest of the genome unnecessarily strongly supports the choosing of a highly stringent exclusion threshold.

It is interesting to review reports of linkage in the literature to see how many substantial positive lod scores turn out to be false positives. One such report just failed to link the cystic fibrosis locus to a DNA marker on chromosome 21 in an extended Amish kindred group (maximal lod score = 2.48). The same family showed overwhelming evidence of linkage to chromosome 7q markers when they were tested later (46). Another recent example involved a report of linkage between chromosome 11 markers (Harvey ras and insulin) and manic depression, with an original pairwise lod score of 4.08 (47). Reanalysis with new data, namely inclusion of new individuals and two changes in clinical status, markedly reduced the lod score (48). Analysis of an additional branch of the family led to a final exclusion of this region of chromosome 11.

This revelation has prompted the editorial staff of Nature to speculate whether linkages between loci should be published only if lod scores are τ;6.0 (49). The author would personally favor a threshold of 3.7 (which represents a 1 % false positive rate) to be adopted for analysis of simple Mendelian traits. More stringent thresholds are necessary when analyzing traits that present diagnostic difficulties and uncertainties about penetrance or age of onset. In all studies, investigators should try to collect and analyze data from as many families and polymorphisms as possible in an attempt to publish scores that exceed the threshold comfortably, rather than “give up” when the score just exceeds 3.0.

29.7.2.4 Multiple Testing

During the search for linkage, a substantial amount of exclusion data will be collected (unless the investigator is extremely lucky). In statistical terms, this represents multiple tests for linkage; each test (i.e., Is the maximal lod score for this pair of loci greater than +3.0?) is associated with a false positive rate of 5%. Thus, after 20 independent tests for linkage, a false positive result is to be expected! This problem of correcting a “primary” significance level to compensate for multiple tests arises in many statistical fields and has been addressed by Ott in relation to linkage (50).

However, two other factors may be considered that compensate (at least partially) for the reduced significance level associated with repeated testing. First, as regions of the genome are excluded, the remaining genome to be scanned is shrinking, and the prior probability of linkage correspondingly increases. For example, if 50% of the genome is excluded, then the prior probability of two loci being linked is 1/25; a lod score of 3.0 (1000/1 odds supporting linkage) is therefore associated with a false positive rate of 25/1000=2.5%.

Second, tests for linkage with multiple loci on the same chromosome are statistically interdependent. One test may therefore encompass multiple markers vs the disease locus, so the total number of statistical tests is considerably smaller than the number of markers.

29.7.2.5 Estimation of Recombination

This is conventionally taken as the maximal likelihood estimate (MLE) of the recombination fraction, i.e., the recombination fraction that yields the largest lod score. This may be approximated either by quadratic interpolation or numerically, using an iterative algorithm. The latter method is implemented in the ILINK program from the Linkage package by the Gemini routine (see Chapter 31 for a discussion of linkage software).

29.7.2.6 Confidence Limits

It is useful to express the confidence that investigators should associate with the MLE of a recombination fraction, since estimates may depart from true values with sampling error. This can be done following large-sample theory, but the applicability to typical human data is unclear.

An empiric but simple method that claims to provide a confidence limit of approx 95% is demonstrated in Fig. 1, which shows an illustrative lod score graph for two loci. The MLE of the recombination fraction is 15% with a lod score of 14.15. A line is drawn one lod unit below the maximal score (13.15), lines are dropped perpendicularly from the two points at which this line cuts the likelihood curve, and two recombination fractions are read (8.5 and 21.5%). This “lod - 1.0 support” method follows a convention proposed by the HGMW in Helsinki (51) and is gaining acceptance by the scientific community, through frequent application. The original recommendation was that these limits approximated a 95% confidence limit for “large” samples. The author interprets this as applying to tables generated with more than 30 informative meioses.

Fig. 1.
figure 1

Lod-score graph illustrating “lod -1.0 support” method for deducing confidence limits for recombination fractions

29.7.3 Power of the Study

Studies should be designed so that they have a very good chance of detecting linkage when loci are truly linked. A lod score of 3.0 will theoretically be found between 5% of pairs of unlinked loci. If a lod score of 3.0 is found for a study that has only a 5% chance of reaching a significant score, then the chances of a true positive and a false positive are equal. It is prudent to attempt only linkage studies that have at least a 95% chance of detecting linkage (lod score of at least 3.0) when the loci are truly linked. This may present problems for investigating rare diseases for which only a few families are known.

Each phase-known meiosis contributes a lod score of 0.301 (log10 2) when the loci cosegregate; hence, 10 phase-known meioses are the minimum necessary to attain a lod score τ;3.0, assuming fully informative markers. For many studies, family structure and mode of inheritance preclude direct deduction of phase. Reduced penetrance, correction for age of onset, and missing data further confound attempts to deduce the effective number of informative meioses (ENIM) in the family.

The ENIM may be estimated quickly and simply by the investigator before any family members are sampled or typed with markers. The pedigree structure is drawn, typings are “invented” for an imaginary, totally informative, highly polymorphic marker that cosegregates infallibly with the disease, and only those members that are likely to be available for sampling are “typed.” These data may then be entered into a conventional computer linkage-analysis package for calculation of lod scores, and allowance may be made as appropriate for reduced penetrance, age of onset, phenocopies, and the like. The maximal lod score should be found at zero recombination. This lod is divided by 0.301 to yield the ENIM.

An example of the utility of calculating the ENIM is shown with reference to Fig. 2. Here a pedigree with dominant spinocerebellar ataxia is shown. In this condition, heterozygotes develop symptoms as they grow older, so an age-of-onset correction is necessary. Heterozygotes in each of the four generations have a 100, 90, 75, and 50% chance, respectively, of expressing the “affected” phenotype. The “simulated” genotypings of a highly informative four-allele RFLP are also shown. The maximal lod score (at zero recombination) is 2.06, and the ENIM is therefore 2.06/0.301=6.84. Obviously, data from other families would have to be collected before it would be worthwhile initiating a genome-wide search for linkage. Formal power calculations may be made analytically, but are practical only for simple pedigrees (50). Boehnke has written a computer program for estimating the power of families to detect linkage by repeatedly simulating the family and possible genotypings (52).

Fig. 2.
figure 2

Spinocerebellar ataxia (SCA) pedigree segregating with a highly informative marker. The dominant SCA gene segregates with the marker allele 1.

29.8 Heterogeneity

Mutations in different genes may result in very similar phenotypes, and linkage studies have the potential to reveal this genetic heterogeneity. For example, Morton discovered significant linkage heterogeneity between elliptocytosis and the rhesus blood group in 14 families (53).

Likelihood-ratio tests have been devised to test if multiple families are linked to a single locus, and lod scores can be added together. Ott distributes a set of computer programs (HOMOG) that implement these methods (50).

29.9 Multipoint Linkage Analysis

When family data are available for three or more loci on a chromosome, then attempts may be made to deduce genetic order. For three loci A, B, and C in a line, three recombination fractions (ϑ AB , ϑ BO and ϑ AC ) may be estimated. If these raw recombination fractions are transformed into genetic distances (d) using a mapping function, then dAC=dAB+dBC. It is simple to deduce the genetic order, provided the estimates of the three recombination fractions are accurate and derived from independent samples of chromosomes. However, multipoint crosses can provide more information for deducing order than the pairwise recombination fractions.

29.9.1 Multiple Crossing Over

Geneticists working with three-point crosses in experimental organisms noticed that, as a consequence of multiple crossing over, recombination in adjacent intervals was not additive. For example, for the loci A-B-C sequential cross-overs in intervals AB and BC will be counted in the estimation of ϑ AB and ϑ BO but not in that of ϑ AC Double cross-overs in small intervals are uncommon, and the most probable order for a set of loci will show the fewest multiple cross-overs.

29.9.2 Interference

In crosses in experimental organisms, double cross-overs have been observed less frequently than expected if cross-overs occurred independently of each other. It seems that one cross-over inhibits a second cross-over in the immediate vicinity. This positive genetic interference has been observed in many organisms, including Drosophila and mice, and thus is anticipated to occur in humans.

29.9.3 Mapping Functions

These define a mathematical relationship between recombination and genetic distance (or density of crossing over). They make empiric assumptions as to the frequency of multiple crossovers, which in turn makes assumptions about the degree of interference. Genetic distances are measured in morgans, 1 cM being equivalent to 1% recombination. This equality becomes inaccurate for recombination fractions greater than about 15%.

29.9.4 Joint-Likelihood Multipoint Linkage Analysis

The lod score method, which has been used successfully with pairwise linkage data, has been extended to analyze data segregating simultaneously for multiple loci. Lathrop has developed the Linkage program for joint-likelihood analysis of an arbitrary number of loci. In many problems, a single marker is not sufficiently informative to “track” all the meioses in a family; however, data from flanking loci may be analyzed jointly and yield more information overall. This efficient extraction of mapping information from the expensive (in terms of time, labor, and money) data allows more accurate mapping and, frequently, more confidence in interpreting the results. Exclusion of a disease locus from a map of linked markers is particularly efficient, since double crossovers will be inferred when the disease is located incorrectly (see also Section 6.2.1).

In the current version of Linkage, likelihoods for four or more loci are calculated assuming no interference. This has been criticized on the theoretical grounds that mathematical modeling with interference would be biologically more accurate and estimates of genetic distances without interference would be exaggerated. In practice, this assumption probably makes little difference. For example, maps constructed by multipoint analysis tend to be slightly larger than those deduced from pairwise data. For investigators attempting to map new loci, the assumption of no interference will minimize the contribution of double cross-overs and make claims of exclusion conservative.

There is no elegant way to tabulate multipoint likelihoods as conveniently as lod scores for pairwise data, in such a way that new data can simply be added in. Usually recalculation with the original pedigree structure and genotypings will be necessary to integrate new data. The support for linkage of a new marker to a preexisting map of marker loci is often graphically expressed as a location map.

One problem faced by all geneticists using joint-likelihood methods for multipoint analysis is the substantial consumption of computer time and memory. Families with genetic diseases frequently have individuals with missing data, who are essential to include since they link informative branches of the family together. likelihoods have to be calculated for all possible joint genotypes for these individuals. As the number of loci under examination increases, the number of possible joint genotypes increases dramatically. At the present time, the author uses a UNIX workstation with a fast (12-MIPS) RISC processor. There have been many problems that have not been analyzed completely, since they would involve an impractical length of processor time. In these situations, subsets of loci are analyzed jointly and the overall map constructed somewhat empirically from these fragments.

29.10 Family Collection

The most important and frequently limiting component of a linkage study is ascertaining and collecting families suitable for detecting linkage.

29.10.1 Autosomal Dominant

Typically, multigeneration families with several affected individuals are sampled. For example, in Huntington’s disease, a single large Venezuelan pedigree was collected with sufficient affected individuals for a powerful study. Dominant disorders occasionally show incomplete or age-dependent penetrance, so individuals may carry the mutant allele, but appear phenotypically unaffected. This is a feature of Huntington’s disease; carriers develop symptoms only in the fourth decade. This reduces the information contribution of younger family members. For common dominant traits, occasional homozygous affected individuals may well be sampled.

29.10.2 Autosomal Recessive

Nuclear families are most typically collected for recessive traits. Grandparents are unaffected and, in the absence of a biochemical carrier test, cannot contribute any phase information for the disease. They may be useful for deducing the phase for markers in a multipoint analysis. Pseudodominant families are reported only infrequently, and it should be remembered that homozygotes are uninformative for linkage. Consanguineous matings classically bring together recessive alleles and may be usefully collected. These families provide “phase-known meioses,” which are unusual in human genetic-linkage analyzes. Pedigrees with many inbreeding “loops” present analytic difficulties since each loop dramatically increases the calculation time with currently available algorithms.

29.10.3 X-Linked Traits

Males are hemizygous, which eases deduction of phase.

29.11 Mapping of Complex Traits

Linkage studies have a proven track record in mapping loci that have a well-defined mode of inheritance. Major genes that cause common disorders, such as the low-density-lipoprotein receptor (LDLR) and familial hypercholesterolemia, have been analyzed in families with a dominant, single-gene mode of inheritance.

There are several common conditions of clinical importance that show familial clustering, but do not show an obvious or consistent inheritance pattern (e.g., atherosclerosis, hypertension, diabetes, cancer, and mental illness). This is probably a consequence of an individual’s phenotype being modified by multiple genes (polygenic) as well as nongenetic (environmental) factors.

Methods for statistical analysis that extend the lod score method to map the underlying genes for such traits have been developed, but it is unclear if they will be of practical use with typical human data sets. It seems prudent to attempt to map these complex traits in experimental organisms, for which much larger and controlled data sets can be made available, and then investigate candidate genes or genetic regions in humans.

An alternative analytic approach to searching for linkage to genes involved in complex traits involves affected-relative pair methods. These “identity-by-state” extensions to the “classic” sib-pair method of linkage analysis provide alternative means and strategies for attempts to identify genes that contribute to complex multilocus diseases (54,55).

29.12 Concluding Remarks

Recombinant-DNA technology has provided abundant polymorphic markers that are suitable for genetic-linkage studies in humans. Statistical methods have been developed that can efficiently analyze the data, so scans of the genome are practical for locating disease loci. Presymptomatic and prenatal diagnosis and carrier detection are feasible for mapped diseases (see Chapter 30). Linkage can be used to test for genetic heterogeneity between families. Finally, reverse genetic strategies may then be devised to isolate and clone the underlying gene (see Chapters 18 and 19). Understanding the genetic pathology of a disease is the first step in both development of specific therapies and offering prospects for population-based genetic screening.