DNA Microarrays: a Powerful Genomic Tool for Biomedical and Clinical Research
- 7 Downloads
Among the many benefits of the Human Genome Project are new and powerful tools such as the genome-wide hybridization devices referred to as microarrays. Initially designed to measure gene transcriptional levels, microarray technologies are now used for comparing other genome features among individuals and their tissues and cells. Results provide valuable information on disease subcategories, disease prognosis, and treatment outcome. Likewise, they reveal differences in genetic makeup, regulatory mechanisms, and subtle variations and move us closer to the era of personalized medicine. To understand this powerful tool, its versatility, and how dramatically it is changing the molecular approach to biomedical and clinical research, this review describes the technology, its applications, a didactic step-by-step review of a typical microarray protocol, and a real experiment. Finally, it calls the attention of the medical community to the importance of integrating multidisciplinary teams to take advantage of this technology and its expanding applications that, in a slide, reveals our genetic inheritance and destiny.
Genomics approaches have changed the way we do research in biology and medicine. We now can measure the majority of mRNAs, proteins, metabolites, protein-protein interactions, genomic mutations, polymorphisms, epigenetic alterations, and micro RNAs in a single experiment. The data generated by these methods together with the knowledge derived by their analyses was unimaginable just a few years ago. These techniques, however, produce such amounts of data that making sense of them is a difficult task. So far, DNA microarray technologies are perhaps the most successful and mature methodologies for high-throughput and large-scale genomic analyses.
DNA microarray technologies initially were designed to measure the transcriptional levels of RNA transcripts derived from thousands of genes within a genome in a single experiment. This technology has made it possible to relate physiological cell states to gene expression patterns for studying tumors, diseases progression, cellular response to stimuli, and drug target identification. For example, subsets of genes with increased and decreased activities (referred to as transcriptional profiles or gene expression “signatures”) have been identified for acute lymphoblast leukemia (1), breast cancer (2), prostate cancer (3), lung cancer (4), colon cancer (5), multiple tumor types (6), apoptosis-induction (7), tumorigenesis (8), and drug response (9). Moreover, because the published data is increasing every day, integrated analysis of several studies or “meta-analysis,” have been proposed in the literature (10). These approaches detect generalities and particularities of gene expression in diseases.
More recent uses of DNA microarrays in biomedical research are not limited to gene expression. DNA microarrays are being used to detect single nucleotide polymorphisms (SNPs) of our genome (Hap Map project) (11), aberrations in methylation patterns (12), alterations in gene copy-number (13), alternative RNA splicing (14), and pathogen detection (15,16).
In the last ten or 15 years, high quality arrays, standardized hybridization protocols, accurate scanning technologies, and robust computational methods have established DNA microarray for gene expression as a powerful, mature, and easy to use essential genomic tool. Although the identification of the most relevant information from microarray experiments is still under active research, very well established methods are available for a broad spectrum of experimental setups. In this publication, we present the most common uses of DNA microarray technologies, provide an overview of their frequent biomedical applications, describe the steps of a typical laboratory procedure, guide the reader through the processing of a real experiment to detect differentially expressed genes, and list valuable web-based microarray data and software repositories.
It is well known that complementary single-stranded sequences of nucleic acids form double stranded hybrids. This property is the basis of the very powerful molecular biology tools such as Southern and Northern blots, in situ hybridization, and Polymerase Chain Reaction (PCR). In these, specific single-stranded DNA sequences are used to probe for its complementary sequence (DNA or RNA) forming hybrids. This same idea also is used in DNA microarray technologies. The aim, however, is not only to detect but also to measure the expression levels of not a few but rather thousands of genes in the same experiment. For this purpose, thousands of single-stranded sequences that are complementary to target sequences are bound, synthesized, or spotted to a glass support whose size is similar to a typical microscope slide. There are mainly two types of DNA arrays, depending on the type of spotted probes. One uses small single-stranded oligonucleotides (~22 nt) synthesized in situ whose leading provider is Affymetrix (Santa Clara, CA, USA, http://www.affymetrix.com). The other type of arrays uses complementary DNA (cDNA) obtained by reverse transcription of the genes’ messenger RNAs (mRNA), completion of the second strand, cloning of the double-stranded DNAs, and typically PCR amplification of their open reading frames (ORF), which become the bound probes. One of the limitations of using large ORF or cDNA sequences is an uneven optimal melting temperature caused by differences in their sizes and the content of GC-paired nucleotides. A second problem is cross-hybridization of closely related sequences, overlapped genes, and splicing variants. In oligo-based DNA arrays, the targeted nucleic acid specie is redundantly detected by designing several complementary oligonucleotides spanning each entire target sequence by segments. The oligonucleotides are designed in such a way to avoid the cDNA probe drawbacks and to maximize the specificity for the target gene. Initially, DNA arrays were based on nylon membranes that are still in use. However, the glass provides an excellent support for attaching the nucleotide sequences, is less sensitive to light than membranes, and is non-porous, allowing the use of very small amounts of sample. There is a more recent and different technology that uses designed oligonucleotide probes attached to beads that are deposited randomly in a support. The position of each bead and hence the sequence it carries is determined by a complex pseudo-sequencing process. These types of arrays, provided by Illumina (SanDiego, CA, USA, http://www.illumina.com) are mainly used for genotyping, copy-number measurements, sequencing, and detecting loss of heterozygosity (LOH), allele-specific expression, and methylation. A recent review of this technology has been published elsewhere (17). For clinical research, however, the preferred technology so far is the oligo-based microarrays whose leading provider is Affymetrix.
Finally, at the end of the experiment, an important issue derived from statistical tests in microarray data is the concept of the real significance of results and the concomitant need for multiplicity of tests. For example, when applying a t-test, the result is the probability that the observed values are given by chance. Commonly, we call a result significant when the probability is smaller than five percent. For large-scale data, a t-test would be performed thousands of times (one for each gene) which means that from 10,000 t-tests at five percent of significance level, we will call 500 genes differentially expressed merely by chance which is very close or even higher than those actually selected from experiments. Therefore, a correction to attempt to control for false positives should be performed. The most common correction method is the False Discovery Rate proposed originally by Benjamini and Hochberg (18) and extended by Storey and Tibshirani (19).
Applications in Biomedical Research
The ultimate output from any microarray assay, independent of the technology, is to provide a measure for each gene or probe of the relative abundance of the complementary target in the examined sample. In this section, we revise the most common applications of the data derived from clinical studies using microarrays irrespective of the technology employed.
Relating Gene Expression to Physiology: Differential Expressed Genes
To detect differentially expressed genes, intuitive and formal statistical approaches have been proposed. The most famous intuitive approach, proposed in early microarray studies, is the fold change in fluorescence intensity (20,21) expressed as the logarithm (base 2 or log2) of the sample divided by the reference (ratios). In this way, fold change equal to one means that the expression level has increased two fold (upregulation), fold change equal to −1 means that the expression level has decreased two fold (downregulation) whereas zero means that the expression level has not changed. Larger values account for larger fold changes. Genes whose fold change is larger than a certain (arbitrary) value, are selected for further analyses. Although fold change is a very useful measure, the weaknesses of this criterion are the overestimation for low expressed genes in the reference (denominators close to zero tend to elevate the value of the ratio), the subjective nature of the value that determines a “significant” change, and the tendency to omit small but significant changes in gene expression levels. For these reasons, currently the most sensible option is following formal statistical approaches to select differentially expressed genes. For two groups of samples, the common t-test is the easiest option, while not the best, for analyzing two-dye microarrays whose log2 ratios generate normal-like distributions after normalization (see next section), and the ANOVA (analysis of variance) test for more than two groups of samples. These options apply for both one- and two-dye microarrays. If the data is non-standardized, Wilcoxon or MannWhitney tests may be applied. A comparison of differential expression statistical tests, including t-test, has been published elsewhere (22).
The approaches we have described are univariate. That is, one gene is tested at a time independently of any other gene. There are multivariate procedures however, where genes are tested in combinations rather than isolated. Whilst being more powerful (23, 24, 25, 26), these approaches require a more complex analysis.
Biomarker Detection: Supervised Classification
The fundamental difference between identifying differentially expressed genes and identifying a set of genes of real diagnostic or prognostic value is that a biomarker needs to be predictive of disease class or clinical outcome. For this reason, it must be possible to associate, to a given set of marker genes, a rule that allows identification of an unknown sample. The classification accuracy of the biomarker also needs to be determined with robust statistical procedures. Therefore, during the biomarker selection procedure, a substantial fraction of the samples are set aside in order to evaluate independently the accuracy of the selected biomarkers (in terms of sensitivity and specificity). Thus, such studies require a relatively large number of samples.
We already explained that unlike differential expression, in biomarker selection for diagnostics, a rule is needed to make predictions. This rule is generated by a classifier, a statistical model that assigns a sample to a certain category based on gene expression values. For example, a sensible classifier for diabetes is whether sugar levels in serum reach certain value. In statistics, this classifier is referred to as univariate. That is, only one variable (sugar level) is needed in the rule. Nevertheless, for DNA microarray studies, it is common to obtain a large gene list useful for disease discrimination. Multiple genes provide robustness in the estimation and consider potential synergy between genes. Therefore, multivariate classifiers are commonly used. For example, it is well known that obesity and parental predisposition to diabetes, in addition to sugar levels in serum, is a more precise diabetes diagnosis criteria. Multivariate classifier can be designed using genes selected either by a univariate method such as t-test, ANOVA, Wilcoxon, PAM (27), Golub’s centroid (1), or by a multivariate method (23, 24, 25, 26).
Thus, the possibility to characterize the molecular state of diseased tissues has led to an improvement in prognosis and diagnosis as well as providing evidence of the existence of distinct disease subclasses in previously considered homogeneous diseases.
Describing the Relationship Between the Molecular State of Biological Samples: Unsupervised Classification
One key issue in the analysis of microarray data is finding genes with a similar expression profile across a number of samples. Co-expressed genes have the potential to be regulated by the same transcriptional factors or to have similar functions (for example belonging to the same metabolic or signaling pathways). The detection of co-expressed genes therefore may reveal potential clinical targets, genes with similar biological functions, or expose novel biological connections between genes. On the other hand, we may want to describe the degree of similarity between biological samples at the transcriptional level (28). We may expect such analysis to confirm that samples with similar biological properties (for example samples derived from patients affected by the same disease) tend to have a similar molecular profile. Although this is true, it also has been demonstrated that the molecular profile of samples reflects disease heterogeneity and therefore it is useful in discovering novel diseases sub-classes (5). From the methodological prospective, these questions can be addressed using unsupervised clustering methods.
Identification of Prognostic Genes Associated with Risk and Survival
Association of Genes with Disease Surrogate Markers: Regression Analysis
Genetic Disorders: Gene Copy Number and Comparative Genomic Hybridization
It is well known that several inherited diseases are a consequence of genetic rearrangements such as gene duplications, translocations, and deletions. Moreover, these alterations are observed in cancer cells as well. A specific microarray technique used to detect these abnormalities in a single hybridization experiment is called Comparative Genomic Hybridization (CGH) (Pollack, 1999) (13). The core concept in CGH is the use of genomic DNA (gDNA) in the hybridization to compare the gDNA from a disease sample versus that of a healthy individual. Hence, a typical microarray design can be used in this approach (see Figure 1).
The signal intensity in all probes in the microarray should, therefore, be very similar for healthy samples. Thus, differences in gene copy number are easily detected by changes in signal intensity. Using this technology, Zhao et al., (2005) (36) recently have characterized the variations of gene copy number in several cell lines derived from prostate cancer and Braude et al., (36) confirmed an alteration in chronic myeloid leukemia.
Genetic Disorders: Epigenetics and Methylation
Genetic Disorders and Variability: Gene Polymorphism and Single Nucleotide Polymorphism
Chromatin Immunoprecipitation: Genetic Control and Transcriptional Regulation
An Overview of a Typical Microarray Experiment
In this section we provide a brief description of the typical workflow of a microarray experiment and its data analysis (see Figure 1).
RNA can be extracted from tissue or cultured cells using molecular biology laboratory procedures (although several commercial kits are available). The amount of mRNA required is about 0.5/µ/g which is equivalent to 20/µ/g of total RNA, though there is some variation depending on the microarray technology. When the amount of mRNA (or DNA) is scarce, an amplification step, for example by PCR amplification of reverse transcribed cDNA, is needed before labeling.
mRNA is retro-transcribed using reverse transcriptase to generate cDNA. Labeling is achieved by including in the reaction (or in a separate reaction) modified fluorescent nucleotides that are made fluorescent by excitation at appropriate wavelengths. The most common fluorescent dyes used are Cy3 (green) and Cy5 (red). The unincorporated dyes usually are removed by column chromatography or ethanol precipitation.
Hybridization is carried out according to conventional protocols. Hybridization solution contains saline sodium citrate (SSC), sodium dodecyl sulphate (SDS) as detergent, non-specific DNA such as yeast DNA, salmon sperm DNA, or repetitive sequences, blocking reagents like bovine serum albumin (BSA) or Denhardt’s reagent, and labeled cDNA from the samples. Hybridization temperatures range from 42°C to 45°C for cDNA-based microarrays and from 42°C to 50°C for oligo-based microarrays. Hybridization volumes vary between 20/µ/L to 1 mL depending on the microarray technology. A hybridization chamber is usually needed to keep temperature and humidity constant.
After hybridization, the microarray is washed in salt buffers of decreasing concentration and dried by slide centrifugation or by blowing air after immersion in alcohol. Then the slide is read by a scanner which consists of a device similar to a fluorescence microscope coupled with a laser, robotics, and digital camera to record the fluorescent excitation. The robotics focuses on the slide, lens, camera, and laser by rows similar to a common desktop scanner. The amount of signal (color) detected is presumed to be proportional to the amount of dye at each spot in the microarray and hence proportional to the RNA concentration of the complementary sequence in the sample. The output is, for each fluorescent dye, a monochromatic (non-colored) digital image file typically in TIFF format. False-color images (red, green, and yellow) are reconstructed by specialized software for visualization purposes only.
The goal in this step is to identify the spots in the microarray image, quantify the signal, and record the quality of each spot. Depending on the software used, this step may need some degree of human intervention. The digital images are loaded in specialized software with a pre-loaded design of the microarray (grid layout) which instructs the software to consider number, position, shape, and dimension of each spot. The grid is then accommodated to the actual image automatically or manually. Fine-tuning of spot positions and shapes is usually performed to avoid any bias in the robotic construction of the microarray. Human involvement is needed to mark those spots that could be artifacts such as bubbles or scratches which are common. Finally, an automated integration function is performed using the software to convert the actual spot readings to a numerical value. The integration function considers the signal and background noise for each spot. The output of the image analysis may be commonly a tab-delimited text file or a specific file format. Common image analysis software include ScanArray (PerkinElmer, Waltham, MA, USA), GenePix (Axon), (Molecular Devices Corporation, Union City, CA, USA) TIGR-SpotFinder/TM4 (www.tigr.org), (The Institute for Genomic Research, Rockville, MD, USA) and GeneChip (Affymetrix, Santa Clara, CA, USA). This process varies from automatic or semi-automatic to manual depending on the microarray technology, scanner, and software used.
Systematic errors are introduced in labeling, hybridization, and scanning procedures. The main aims of normalization is to correct for these errors preserving the biological information and to generate values that can be compared between experiments, especially when they were generated in, and with, different times, places, reagents, microarrays, or technicians. There are two types of normalization, “within” and “between” array normalization. “Within” array normalization refers to normalization applied in the same slide and it is applicable, generally, to two-dye technologies. For this, let us define M = Log2(R/G) and A = Log2(R*G)/2 where R and G are the red and green readings respectively. Under the assumption that the majority of genes have not been differentially expressed, the majority of the M values should oscillate around zero. “Within” normalization is finally performed shifting the imaginary line produced by the values of M (in vertical axis) to zero along the values of A (in horizontal axis). This kind of normalization, sometimes called loess, usually is performed by spatial blocks to avoid any bias in the microarray printing process (called print-tip-loess). “Between” normalization is necessary when at least two slides are analyzed to guarantee that both slides are measured in the same scale and that its values are independent from the parameters used to generate the measurements. The goal is to transform the data in such a way that all microarrays have the same distribution of values. For two-dye technologies this is optional and is commonly done through scaling or standardizing the values once within normalization has been performed. For one-dye microarrays, between normalization is usually performed using methods to equalize distributions such as quantilenormalization (55) after log2 transformation. There are, however, a number of normalization methods. The right choice is usually data-dependent. A comparison of the results of different normalization methods is recommended.
The image analysis process (generally in spotted microarrays) does not always generate a value for a gene because the spot was defective or manually marked as faulty. This is not a major issue when genes are replicated in several spots in the microarray, because the reading of the gene still can be estimated using the remaining spots. If the value in a spot is systematically missing in several arrays, it should be removed from the analysis. If the number of missing values is low, the corresponding spots can simply not be considered in all arrays. However, when the number of arrays is large, this could lead to the removal of several spots. To avoid these problems, one must use only those methods that can deal with missing values, or, use algorithms to infer those values (30). Results should, therefore, be interpreted considering that some values were inferred.
Current microarrays contain more than 10,000 genes, spots, or probes. Dealing with large amounts of data may require expensive computational resources and large processing times. A common practice is to remove genes that have not shown significant changes across samples, genes with several missing data, or those whose average expression is very low (because low expressed genes are more susceptible to noise). The most common approaches use statistical tests (lower), signal-to-noise estimations (higher), variability (higher), and average (higher).
The numerical values from image analysis are commonly integer numbers between one and 32,000 for both signal and background. The background normally is subtracted from the signal. The distribution of these values is, however, concentrated in a narrow range and, therefore, is transformed using logarithms (base 2 generally) which generate normal-like distributions. Negative values resulting from subtraction may raise problems in transformations which are resolved by restricting the values or performing more robust transformations such as the generalized logarithm.
The procedure after image analysis and data processing depends mainly on the particular biological issue and data available. These procedures have been described in the Applications Section of this review.
Illustrating The Detection of Differentially Expressed Genes: The Case of Term Placenta
Step 1: mRNA Extraction and Microarray Hybridization
Human total term placenta RNA isolated using proteinase K-phenol based protocol (described in (56)) and a pool of commercially available total RNAs from several human tissues not including placenta were part of the set of reagents utilized in the EMBO-INER Advanced Practical Course 2005 held in Mexico city (EMBO Courses and Workshops Programme, Heidelberg Germany, http://www.embo.org/courses_workshops/mexico.html). They were quality controlled by running them in a RNA 6000 Nano Assay from Agilent (Agilent Technologies Inc., Santa Clara, CA, USA). First strand cDNA was synthesized from each RNA (5 µg) sample by reverse transcription using an oligo-dT primer with a T7-promoter sequence attached to its 5′ end, while s strand resulted from treating the first strands with RNase H plus DNA polymerase I (Message Amp aRNA kit from Ambion, Austin, TX, USA). Column purified double-stranded cDNAs were transcribed (in vitro transcription) with T7 RNA polymerase and the amplified RNAs (aRNAs) were purified also by column binding and subsequent elution. Fluorescent labels were attached indirectly to the hybridization probes by a two-step procedure. The first step consisted of a reverse transcription of the aRNA using this time a mixture of all four desoxiribonucleotides and including aminoallyl-dUTP. In the second step, N-hydroxysuccinimide-activated fluorescent dyes (Cy3 and Cy5) were coupled to the cDNAs by reaction with the amino functional groups. Probes were preincubated with blocking reagents (human Cot DNA at 1 pg/mL and poly-dA DNA also at pg/mL) and then hybridized to prehybridized (6X SSC, 0.5 percent SDS and one percent BSA) slides in hybridization buffer (50 percent formamide, 6X SCC, 0.5 percent SDS and 5X Denhardt’s solution). Slides were washed once in 2X SSC/0.1 percent SDS at 65°C for five minutes, twice in 0.1X SSC/0.1 percent SDS but first at 65°C for ten minutes, and then at room temperature for two minutes, and finally in isopropanol, also at room temperature, with slide centrifugation between each washing step, and stored in the dark until scanning. Fluorescent probes were hybridized to cDNA microarrays (laboratory made oligo-based microarray containing half of the probes in each of two slides).
Step 2: Microarray Scanning, Spot Finding and Image Processing
Microarrays were scanned using ScanArray Express (PerkinElmer, Waltham, MA, USA). Images obtained were analyzed using ChipSkipper (EMBLEM Technology Transfer GmbH, Heidelberg, Germany, http://www.embl-em.de) to obtain a single value for each spot representing the ratio (in log2 scale) of the mRNA expression level from placenta to the reference mRNA from the pool of non-placenta tissues. A value of zero represents similar expression level in both mRNA samples. A value of one represents two-fold over-expression in placenta whereas a value of −1 represents two-fold downregulation in placenta. One placental sample was hybridized in duplicate into the two microarrays using a dye-swap design. In this approach the labeling scheme is reversed in two separate microarrays. To gain information on the variability associated with experimental error, two aliquots of the reference pool mRNA were compared on the same microarray. Likewise the comparison between experimental and control samples and the comparison between the two control samples were performed in duplicate using the dye-swap design. To summarize, the experiment was performed using six microarrays (two placenta samples compared with a reference in duplicate and two reference mRNA as controls, see Figure 11).
Step 3: Quality Assessment, Processing and Normalization
Step 4: Detection of Differential Expressed Genes
Step 5: Validation
To verify the process of selection, we made two comparisons. First, as negative control, we followed the same selection criteria for the control microarrays that made use of the reference sample in both channels. The result was that no genes match the criteria. Second, we performed a comparison using the Tissue Expression tool (http://www.t1dbase.org/page/Tissue Home) from T1dbase (59). This tool makes use of Gene Expression Atlas (59), SAGEmap (60), and TissueInfo (58), integrating all measurements in a single score (58). This score, estimated for several tissues, represents whether the expression for a gene is tissue-specific. Scores closer to one are meant to be tissue-specific whereas scores closer to zero represents no-tissue-specificity. From the 350 genes resulted in Step 4, we selected only those that are included in this database. The result was 201 genes. Several genes that seem to be over-expressed in the placentas processed here (darker colors in Figure 13A) shows consistently higher placenta-specific scores in T1dbase (darker colors in Figure 13B). These results suggest that the experiment is coherent and valid.
Step 6: Analysis
Once genes have been selected, further computational, literature, and laboratory analyses are needed to confirm, expand, or restrain the results. Here, the analysis only dealt with comparing the results with T1dbase-Tissue Specific Expression Tool. However, queries to Gene Ontology, KEEG pathways, Pubmed, Blasts, or any other pertinent database resource should be considered a compulsory step.
Conclusions and Trends
DNA microarrays are a powerful, mature, versatile, and easy-to-use genomic tool that can be applied for biomedical and clinical research. The research community is expanding the use of this approach for novel applications. The main advantage is the genomic-wide information provided at reasonable costs. Biological interpretation however requires the integration of several sources of information. In this context, a new discipline referred as Systems Biology is emerging that integrates biological knowledge, clinical information, mathematical models, computer simulations, biological databases, imaging, and high-throughput “omic” technologies, such as microarray experiments. Therefore, multidisciplinary groups involving clinicians, biologists, statisticians, and, recently, bioinformaticians are being formed and expanded in all important research institutions. Subsequently, virtually all biology-related research areas are moving from merely describing cellular and molecular components in a qualitative manner, toward a more quantitative approach. These new teams are generating huge amounts of data and more convincing models to ultimately reveal hidden pieces in the biological puzzle. This new knowledge is having a crucial impact on the treatment of diseases, because, among other things, it individualizes subtypes of pathologies, disease risks, and survival, treatment, prognosis, and outcome, quickly moving biomedical research to the era of personalized medicine.
All supplementary materials are available online at molmed.org.
HABS thanks the Staff of the Microarray Technology EMBO-INER Advanced Practical Course for enjoyable course lessons, materials and results; Peter Davies, Nancy and Greg Shipley of UT Medical School for additional laboratory training; Albert Sasson for critical reading of the manuscript and the offices of the Dean of his school and of the President of his University for support. Victor Trevino thanks Darwin Trust of Edinburgh and CONACyT for his PhD scholarship, and ITESM for support.
- 18.Benjamini Y, Hochberg Y. (1995) Controlling the False Discovery Rate — a Practical and Powerful Approach to Multiple Testing. J. R. Stat. Soc. Ser. B. 57:289–300.Google Scholar