Community Analysis-Based Methods
Microbial communities are each a composite of populations whose presence and relative abundance in water or other environmental samples are a direct manifestation of environmental conditions, including the introduction of microbe-rich fecal material and factors promoting persistence of the microbes therein. As shown by culture-independent methods, different animal-host fecal microbial communities are distinctive, suggesting that their community profiles can be used to differentiate fecal samples and to potentially reveal the presence of host fecal material in environmental waters. Cross-comparisons of microbial communities from different hosts also reveal relative abundances of genetic groups that can be used to distinguish sources. In increasing order of their information richness, several community analysis methods hold promise for MST applications: phospholipid fatty acid (PLFA) analysis, denaturing gradient gel electrophoresis (DGGE), terminal restriction fragment length polymorphism (TRFLP), cloning/sequencing, and PhyloChip. Specific case studies involving TRFLP and PhyloChip approaches demonstrate the ability of community-based analyses of contaminated waters to confirm a diagnosis of water quality based on host-specific marker(s). The success of community-based MST for comprehensively confirming fecal sources relies extensively upon using appropriate multivariate statistical approaches. While community-based MST is still under evaluation and development as a primary diagnostic tool, results presented herein demonstrate its promise. Coupled with its inherently comprehensive ability to capture an unprecedented amount of microbiological data that is relevant to water quality, the tools for microbial community analysis are increasingly accessible, and community-based approaches have unparalleled potential for translation into rapid, perhaps real-time, monitoring platforms.
KeywordsCommunity analysis Multivariate statistical method Spatial source tracking TRFLP PhyloChip MST on a chip
11.1.1 Challenges in Water-Quality Diagnosis
Microbiological water-quality is a serious public health concern for drinking water, recreational swimming, shellfish consumption, and agricultural food production. The microbial pollutants of concern are pathogens, discharged with various fecal sources including human sewage and septage, domestic pet waste, and livestock manure. Pathogen-related coastal water-quality problems are recognized in a National Research Council (NRC) report (1993) where “pathogens and toxins that affect human health” and the “introduction of nonindigenous species” are listed among major US coastal environmental issues.
11.1.2 The MST Toolkit, Including Microbial Community Analysis
Because of the ubiquity of routine FIB monitoring, MST is often employed as a follow-up to determine if FIB signal the presence of fecal material. Frequently used in such follow-up studies, a stepwise MST study approach has been successful in some circumstances (Vogel et al. 2007), including a study revealing a leaking sanitary sewer line as an acute source of FIB near a beach (Boehm et al. 2003), and another discovering that aged sanitary sewers contribute diffuse contamination to storm drains discharging directly to coastal waters (Sercu et al. 2009). In such cases, study steps included: analyzing historical FIB time-course data in a spatial context, performing existing data and field system reconnaissance, nominating hypothetical sources based on sewer infrastructure and nearby features, designing a field sampling and sample analysis program to test the hypotheses, and employing one or more source-specific markers in sample analysis. This has been previously referred to as a “tiered approach” (Boehm et al. 2003), but the differences in decay rates between abundances of FIB and source-specific markers (Dick et al. 2010) means that FIB data may not provide a solid foundation upon which a tiered source diagnosis can be built. FIB data are useful for generally circumscribing the spatial emphases of MST, but not for reliably tracking sources of waste.
Here, we use community analysis to refer to culture-independent characterization of the microbial community in a sample, i.e., what microbes, in what abundance, comprise of the community in a sample. Fecal input can alter microbial communities in the receiving environmental waters either directly or indirectly, i.e., by fecal source communities acting as inoculants, or by altering the environmental conditions of receiving water such as through changing water chemistry. Community analysis in MST has a strong potential to be useful for waste source assessment for the following reasons. First, gut microbial communities of various hosts vary significantly by host species and diet (Ley et al. 2008); therefore, microbial communities in feces would differ by host animal. Second, microbial communities are sensitive to perturbations and respond rapidly to environmental changes (Hwang et al. 2009), suggesting that their composition may reflect recent or ongoing contamination events. Third, the culture-independent analysis of microbial communities, by now a mature research endeavor (Liu and Jansson 2010), has successfully captured changes of microbial communities in the environment along gradients such as carbon availability (LaMontagne et al. 2003a) and proximity to hydrocarbon seeps (LaMontagne et al. 2004), and changes in microbial community due to perturbations such as inundation (Córdova-Kreylos et al. 2006), and vegetation and pollutant variations (Cao et al. 2006, 2008) in estuaries. Furthermore, community analysis detects many microbial signals and pathogens simultaneously to provide what is potentially a more robust and relevant approach to MST when compared to single indicators (Fig. 11.3).
This chapter examines the use of culture-independent microbial community analysis in MST. Briefly, this chapter reviews technical methods and data analysis approaches, provides two case studies, a critical evaluation of advantages and needs, and a view of community analysis in the future of MST.
11.2 Community Analysis Methods
11.2.1 Description of Community Analysis Techniques
Here, we describe methods that are most widely used. In particular, we focus on techniques that have the greatest potential to be used in MST studies. For each method, brief descriptions are provided regarding the method background, sample processing, and basic method-specific data processing. Multivariate data analysis is discussed in Sect. 11.3.
126.96.36.199 Phospholipid Fatty Acid (PLFA) Analysis
Phospholipids are essential membrane components that control cell permeability. Microbial fatty acids are largely linked to phospholipids, and in most cases, specific types of fatty acids predominate in a given taxon and are commonly associated with groups metabolizing similar substrates (Zelles 1999). Microorganisms also change membrane fatty acids composition in adaptation to environmental conditions including stressors (Loffhagen et al. 2004). Since PLFAs decompose quickly upon cell death, PLFA community analysis is typically regarded as reflecting viable or recently living cells (White 1994).
In performing PLFA analysis, PLFAs are first recovered from a sample by organic solvent extraction or solid-phase extraction (Zelles 1999). The extracted fatty acids and their derivatives are analyzed by gas chromatography to generate a PLFA profile that contains a list of fatty acids and their molar abundances. For data processing, the total mass of all PLFAs is often calculated as an indicator of total biomass, and the total mass of specific groups of PLFAs or mass ratios of certain PLFAs are also calculated as indicators of target groups of microorganisms or of environmental stresses (Zelles 1999). For example, several PLFAs are considered as biomarkers for members of the sulfate-reducer functional group; branched PLFAs are used as biomarkers for Gram-positive bacteria, and certain FAs can indicate heavy metal pollution (Pennanen et al. 1996; Córdova-Kreylos et al. 2006). Depending on downstream data analysis, the PLFA mass (i.e., absolute abundance) is also converted to percentage composition (i.e., relative abundance) so that biomass influences can be down-weighted and sample comparisons will be mainly based on community composition (Cao et al. 2006).
188.8.131.52 Terminal Restriction Fragment Length Polymorphism (TRFLP)
Terminal restriction fragment length polymorphism (TRFLP) produces genotypic fingerprints, or profiles, of the microbial community based on length polymorphism of PCR products of a specific marker gene, most frequently the gene encoding 16S rRNA, which contains regions that are conserved and regions that are variable among microorganisms. Additionally, other functional genes such as nitrite reductase genes (nirS, nirK) have also been used as marker genes for community analysis by TRFLP (Braker et al. 2001).
In TRFLP analysis (Liu et al. 1997), DNA is extracted from an environmental sample and genes from total community DNA (usually 16S rRNA genes) are amplified via PCR. One or both of the (forward and reverse) primers are labeled with a fluorescent dye. The resulting terminally labeled PCR products (amplicons) are digested with restriction enzymes that recognize specific cutting sites. These sites are located at different positions on the amplicons due to differences in gene sequences among microorganisms. Thus, the enzyme digestion generates fluorescently labeled terminal restriction fragments (TRFs) of various sizes (or lengths in base pairs). The sizes and abundances of the TRFs are determined using an automated DNA sequencer, where each TRF is represented as a peak on the electropherogram. The electropherogram is the graphical representation of the TRFLP profile, and the profile itself is a list of TRFs (a.k.a. peaks) in order of their base pair length and their abundance in relative fluorescent units. Both TRF height and area, for each peak, are provided in the profile dataset. Since the estimated sizes of TRFs from the same phylotype differ slightly due to run-to-run variability, TRFLP profiles are aligned across runs before further data analysis (Dunbar et al. 2001). Absolute abundances of TRFs are often converted to relative abundances, expressed as percentages of the total peak height or peak area for all TRFs in a sample. Normalization in this way is necessary to adjust for slight variations in the amount of DNA loaded onto the sequencer. If PCR bias is a concern, TRFLP data can be converted, prior to further analysis, to a simple binary format based on TRF presence or absence (Cao et al. 2006).
184.108.40.206 Denaturing Gradient Gel Electrophoresis (DGGE)
Denaturing gradient gel electrophoresis (DGGE) also generates community fingerprints based on profiling PCR products of a specific marker gene. However, the profiling is achieved based on separating the PCR products in a gel formulated with a chemical denaturant such as urea (Muyzer and Smalla 1998). Specially designed primers with a GC clamp are used in PCR so that the GC clamp can hold together the two separated strands of amplicon during denaturation. Briefly, as the PCR products are separated by electrophoresis, they denature in the gel due to exposure to the chemical denaturant. Denaturation converts easily electrophoresed double-stranded DNA into DNA whose migration is sterically hindered by its denatured content. Since sequence differences cause different melting behaviors, the timing and extent of denaturation of the PCR products differ along the gradient, which in turn determines how fast the various PCR products migrate in the gel. A community profile results from the pattern of separately migrated bands where each band represents a different organism or group. A similar technique is known as temperature gradient gel electrophoresis (TGGE) where a temperature gradient is used for denaturation.
After electrophoresis, the gel is stained with a nucleic acid-binding dye and photographed, resulting in an image of the banding pattern that is often imported into computer software for band density analysis (Esseili et al. 2008). Depending on the crispness of the separation, bands can also be excised, PCR-amplified, and sequenced for phylogenetic identification.
220.127.116.11 Cloning and Sequencing
A higher degree of phylogenetic resolution for community analysis is achieved by obtaining the actual sequence of a marker gene (or a metagenome, Fig. 11.4) through cloning followed by sequencing (Nocker et al. 2007) or direct sequencing (Shendure and Ji 2008). Here, a metagenome refers to the entire collection of genetic material recovered directly from environmental samples. During the cloning process, PCR products (or fragmented community DNA, i.e., the pieces of the metagenome) are inserted into plasmid vectors that are then transformed into Escherichia coli. The plasmids harboring specific PCR products (from specific organisms, presumably) are then multiplied by growing the transformed E. coli cells. The insertions are harvested by plasmid preparation and are subsequently sequenced. After quality check and trimming of the vector sequence (i.e., alignment), the marker gene sequences can be compared to sequence databases (Maidak et al. 2001) for phylogenetic identification. Computer programs used for sequence quality check, alignment, and comparison are widely available. Phylogenetic trees can be constructed to reveal similarities between sequences (http://rdp.cme.msu.edu/treebuilder/treeing.spr).
18.104.22.168 PhyloChip Microarray
Microarrays are high-throughput devices that allow for simultaneous detection of multiple DNA fragments (Bodrossy and Sessitsch 2004; Andersen et al. 2010). Detection is based on strand complementation and hybridization of the fluorophore-labeled target DNA with the probes representing known DNA sequences fixed on the array. After washing away the unbound target DNA fragments, the array is scanned at defined excitation wavelengths to image the bound, fluorophore-labeled DNA fragments. The locations of the probes with hybridized target DNA on the array indicate the presence of specific nucleic acid sequences in the query sample. The fluorescence intensity can also be used to quantify the relative abundance of the target when compared to the same probe on a separate array. However, different probes cannot be directly compared to each other due to differences in GC content and hybridization efficiency
The PhyloChip (G2, i.e., the second generation) is an Affymetrix (Santa Clara, CA)-platform microarray designed by researchers at the Lawrence Berkeley National Laboratory (Berkeley, CA). This high-density custom microarray encodes all known Archaea and Bacteria 16S rRNA sequences found in the 2004 public databases, and can identify up to 8,741 operational taxonomic units (OTUs) (Brodie et al. 2006; DeSantis et al. 2006, 2007). Currently, a G3 PhyloChip is being developed, and the probes are designed based on all 16S rRNA sequences available in the 2007 public databases. The G3 PhyloChip can detect up to ∼60,000 OTUs and select pathogen specific genes. Additionally, the probes on the G2 and G3 PhyloChips are constantly being dynamically re-annotated based on the most current database information.
11.2.2 Applicability and Demonstration of Community Analysis Approaches for MST
Applicability of the various community analysis methods to specific types of questions differ because of their technological differences, state of method development and evaluation, and logistics. The demonstrated and potential usages of the methods for MST also vary.
PLFA analysis has been widely used to study changes in microbial community structure along environmental gradients or in response to environmental perturbations (Frostegard et al. 1993; Macalady et al. 2000; Kaur et al. 2005). Since phenotypic adaptation reflects the microorganisms’ particular habitat, including host intestine and other sources, PLFA is potentially a useful tool for MST. As it mostly reflects living or recently living cells, PLFA may have the added benefit of detecting recent pollution sources, but not older ones. However, extensive knowledge about fatty acid patterns is generally required for interpreting the significance of specific fatty acids or fatty acid groups and for most efficient usage of PLFA data. Accessible databases for relating taxa or environmental stresses with fatty acid patterns are not available, and such interpretation often relies on a researcher’s experience or familiarity with the PLFA literature. Furthermore, PLFA extraction methods influence the types of fatty acids recovered from a sample, and some extraction protocols may liberate fatty acids from nonliving organic matter, in which case PLFA composition could reflect more than the living microbial community (Zelles 1999). Nonetheless, these potentially confounding issues may be alleviated by commercial service laboratories that offer standardized PLFA analysis. Total PLFA abundance and PLFA profiles can be useful for tracking overall biomass and microbial community changes in MST studies without specifically focusing on individual fatty acids or fatty acid groups. For example, similarity and dissimilarity of the PLFA profiles from different samples were utilized to rule out kelp, but imply beach sand, as sources of fecal contamination to beach water (Izbicki et al. 2009).
For genotypic methods based on genes encoding 16S rRNA, abundant sequence data have been generated and are accessible in large databases such as the Ribosomal Database Project (RDP) (Maidak et al. 2001) and the Greengenes Database (greengenes.lbl.gov). Ever-growing databases are increasingly accessible because of developments in computational biology and bioinformatics that provide new and better tools for data handling. Performing most community analysis methods does require specialized equipment and expertise, but many genomic facilities can provide such services. However, it is important to be aware that, like culture-based or phenotypic methods, genotypic methods have their share of technical shortcomings arising from variations in DNA extraction efficiency, PCR bias, and/or sequencing accuracy and comprehensiveness.
Application of genotypic community analysis methods in MST has included: (1) identification of source-specific species as candidates for developing source-specific single indicators and (2) differentiation and/or tracking source of pollution based on the similarity of microbial community profiles from potential sources and sinks. Numerous studies have characterized microbial communities associated with human or animal feces using community analysis methods (Zoetendal et al. 2004). Although MST was not the objective, these studies provided abundant information regarding the host specificity of microbial communities and factors that affect such specificity (Ley et al. 2008); such studies also demonstrated the potential of community analysis for MST (Li et al. 2007). The following paragraphs discuss the application of TRFLP, cloning and sequencing, and PhyloChip in MST.
Since it is a high-throughput, sensitive, and reproducible approach whose data are readily amenable to quantitative statistical comparisons, TRFLP has been frequently used to analyze communities from a wide range of environments, including feces and digestive tracts of insects and mammals, and to characterize microbial community responses to environmental changes (for review, see Thies 2007; Schütte et al. 2008). Furthermore, because of its popularity in microbial ecology, an abundance of literature and many automated data processing software applications are available for adapting TRFLP for MST studies (Kent et al. 2003; Shyu et al. 2007).
TRFLP has been used successfully to develop source-specific single indicators. For example, highly reproducible, host-specific TRFLP patterns were identified in microbial communities from human and cow feces using primers specific to the Bacteroides–Prevotella group (Bernhard and Field 2000a), and subsequent cloning and sequencing of such source-specific TRFs led to designing of human- and cow-specific single indicator (q)PCR assays for MST (Bernhard and Field 2000b; Field et al. 2003a). More recently, TRFLP was used to find a poultry-specific Brevibacterium marker (Weidhaas et al. 2010). Studies also employed TRFLP to differentiate fecal sources based on overall community similarity. TRFLP was first shown to be successful in distinguishing deer fecal samples from sands while demonstrating high similarity between microbial communities in two discrete piles of deer fecal pellets (Clement et al. 1998). Using universal eubacterial primers, TRFLP analysis also clearly differentiated microbial communities from cattle feces, dog feces, and sewage (LaMontagne et al. 2003b). In a more comprehensive study, TRFLP was employed to analyze the Bacteroides–Prevotella community in multiple (10–50) fecal samples from each of nine host species (cattle, chicken, deer, dog, geese, horse, humans, pig, and seagulls) from different geographical locations and times of year (Fogarty and Voytek 2005). While no single TRF was identified as exclusive to a host species, and the previously identified cow- and human-specific TRFs (Bernhard and Field 2000a) were not resolved from their respective sources (Fogarty and Voytek 2005), the Bacteroides–Prevotella TRFLP community profiles were highly reproducible and much more similar within host species as compared to between host species. Attempts to identify sources using single TRFs or total community similarity for mixed-source samples, however, were less successful, perhaps due to biomass dominance from one source (Field et al. 2003b) or higher redundancy in TRFs from eubacterial primers (Liu et al. 1997). It is possible that a combination of TRFs, analyzed as a subset of the consortium, would be more useful for source identification in mixed-source samples. More recent studies utilized TRFLP to conduct source tracking in defined watersheds. Potential pollutant transport pathways were identified via similarity analysis of TRFLP community profiles from different sampling locations (Ibekwe et al. 2008). A case study using this approach will be discussed in a later section of this chapter.
DGGE has been widely used to assess diversity and to monitor dynamics of microbial communities (Ercolini 2004; Dorigo et al. 2005). Although DGGE analysis is often confined to PCR products with limited lengths (<400 bp), it can differentiate a single base pair difference in sequence fragments. Compared to expensive sequencers needed for TRFLP and sequencing, equipment for DGGE analysis is affordable for ordinary laboratories; however, DGGE is technically demanding. Methodological concerns such as inaccuracy and low reproducibility of band patterns, low sensitivity and lack of reliable quantification of band intensity, and in particular the need to optimize experimental conditions may also hinder its application (Dorigo et al. 2005). Nevertheless, DGGE applied to a 126 bp fragment of the β-d-glucuronidase gene (uidA) was successful in discriminating among E. coli phylotypes using DNA from cultured isolates, DNA from mixed culture-enriched E. coli populations, and community DNA extracted directly from environmental samples. Little difference in the DGGE patterns was observed for the latter two DNA sources, indicating that the culture enrichment step may be bypassed (Farnleitner et al. 2000). More recently, DGGE based on enriched E. coli cultures indicated similar E. coli populations for samples originating from the same sampling site (Sigler and Pasutti 2006). A more comprehensive study evaluated the applicability of 15 marker genes for use with DGGE for MST, and three genes (mdh, phoE and uidA) were identified to provide good discrimination among horses, pigs, and goats (Esseili et al. 2008). DGGE profiles from these three genes indicated greater E. coli population similarity (98–100%) between wastewater treatment plant (WWTP) effluents and downstream water samples, and lower similarity between upstream and downstream/effluent samples, providing strong evidence for a dominant contamination source from the WWTP. However, source attribution was less successful for contamination in a pond, presumably due to mixed sources from urban runoff in addition to goose feces deposition (Esseili et al. 2008).
Cloning followed by DNA sequencing has been a widely used tool in molecular microbial ecology from its inception. Although cloning and sequencing offers high phylogenetic resolution, the method is laborious, time consuming, and costly for routine usage. Also, rarely do clone libraries provide complete coverage of entire microbial communities. However, technology advancement in automation and parallel sequencing may greatly improve its speed and lower its cost while also improving its comprehensiveness (Shendure and Ji 2008). Since it provides actual sequence data, cloning followed by sequencing is often employed in developing and evaluating single indicator-based (q)PCR assays. For example, a library of genes encoding 16S rRNA extracted from gull feces revealed the abundance of a sequence closely related to Catellicoccus marimammalium, which was used successfully to develop a gull-specific, SYBR green (q)PCR assay (Lu et al. 2008). Libraries of Bacteroidales genes encoding 16S rRNA extracted from the feces of eight hosts revealed ruminant-, pig-, and horse-specific clusters of sequences, while human, dog, cat, and gull Bacteroidales communities shared greater similarities (Dick et al. 2005). The host-specific sequences were used to design PCR assays specific to pig and horse fecal matter. The analysis of other Bacteroidales clone libraries comprised of genes encoding 16S rRNA extracted from gull, goose, canine, raccoon, and sewage sources revealed concerns regarding instability of source identification assays against geographic or host individual differences (Jeter et al. 2009).
In addition to developing and evaluating single indicator assays, multiple clone libraries from potential contributing sources and environmental samples have also been developed to link pollution sources with environmental sinks. Bacteroidales clone library analysis revealed high similarity between cattle feces and water sample clone libraries, confirming cattle fecal pollution in a small watershed (Lamendella et al. 2007). However, clone libraries constructed from a horse manure pile and water samples from upstream and downstream of the manure pile using both universal eubacterial primers and Bacteroides group-specific primers showed little similarity between microbial communities from the manure pile and the downstream water samples, even though the water at 5 m downstream was visibly contaminated (Simpson et al. 2004). The authors offered two explanations: (1) downstream water was contaminated with the recently deposited surface material of the manure pile which harbored a different microbial community than the older interior manure pile from which the clone library had been constructed, (2) universal eubacterial primers do not offer sufficient sensitivity to detect manure pollution at the dilution level in the study sites. Direct shotgun cloning and sequencing of community DNAs (i.e., metagenomic analysis) was used to characterize a viral community in human feces (Breitbart et al. 2003); however, its direct application in MST has been limited at this time.
Phylogenetic microarrays such as the PhyloChip, which targets the currently known diversity within bacteria and archaea, have been employed to determine the composition of microbial communities in a number of different environments and conditions. When the PhyloChip microarray was applied to urban aerosols, the spatio-temporal distributions of known bacterial groups, including specific pathogens, were determined to be related to meteorologically driven transport processes as well as sources (Brodie et al. 2007). This microarray has been extensively validated and successfully used on a number of complex environmental samples, and the resulting findings have been confirmed by additional methods, including qPCR and 16S rRNA gene clone libraries (Brodie et al. 2006; Flanagan et al. 2007; Chivian et al. 2008; Tsiamis et al. 2008; Wrighton et al. 2008; Cruz-Martinez et al. 2009; DeAngelis et al. 2009; Sagaram et al. 2009; Sunagawa et al. 2009; Yergeau et al. 2009; Rastogi et al. 2010; Wu et al. 2010). Studies using split samples have confirmed that >90% of all 16S rRNA sequence types identified by the more expensive clone library method are also identified by the PhyloChip (DeSantis et al. 2007). In addition, the PhyloChip has demonstrated several-fold increases in detected microbial diversity over the clone library method and metagenomic sequencing with second-generation sequencers. One of the reasons for this is the high sensitivity of the PhyloChip, with the ability to detect organisms present at a proportional fraction of less than 10−4 abundance compared to the total sample (La Duc et al. 2009). Each sample analysis by the PhyloChip provides detailed information on microbial composition, and the highly parallel and reproducible nature of this array also allows tracking community dynamics over time and treatment. With no prior knowledge, specific microbial taxa may be identified in urban watersheds that are keys to human-associated fecal influence.
The PhyloChip is ideal for characterizing complex microbial communities, and its application for MST is currently being investigated. The comprehensiveness and sensitivity of the PhyloChip allows for better characterization of low-abundance organisms, leading to improved description of microbial diversity (La Duc et al. 2009; Sagaram et al. 2009). The reproducibility of the PhyloChip data on microbial community composition provides the opportunity to obtain results with high levels of statistical confidence (Brodie et al. 2006; DeSantis et al. 2007). A case study with PhyloChip-analyzed bacterial communities from an urban creek with known fecal pollution is discussed in a later section.
11.3 Multivariate Data Analysis, Interpretation, and Presentation
11.3.1 Why Multivariate Techniques?
Multivariate analysis involves simultaneous analysis of multiple, often correlated, variables. Multivariate analysis of community profiles has been developed and is routinely used by ecologists who study animals or plants, yet the application of such tools has been limited in microbial ecology, both in terms of frequency and choice of multivariate methods (Ramette 2007). However, multivariate tools are necessary to analyze the multivariate datasets generated from community analysis-based methods for MST.
Datasets generated by microbial community analysis methods usually contain rows representing samples or sites and columns representing OTUs. OTUs can be fatty acids from PLFA, TRFs from TRFLP, gel band identifiers from either DGGE or TGGE, or sequences or species from either clone libraries, or direct sequencing or PhyloChip analyses. Although each OTU could be treated as a single variable and analyzed by univariate statistical methods separately, separate univariate analyses not only are logistically difficult because there are hundreds to tens of thousands columns in a community profile but also are scientifically undesirable because microbial communities evolve and adapt together, therefore these variables are not independent. Source–sink relationships that are not revealed when a single OTU is evaluated can be distinguished when a consortium of OTUs is analyzed simultaneously in an integrated fashion and hence the rationale of using community-based analysis for microbial source tracking (see Sect. 11.1).
Many applications of MST are closely tied to TMDL assessment, which is currently based on FIB concentrations (Santo Domingo et al. 2007), and it is often desirable to evaluate correlations between MST and FIB concentration data. Correlations between multivariate community profiles with FIB concentrations can only be done through multivariate statistics such as direct gradient analysis (see Sect. 11.3.2). Furthermore, when discoveries of OTUs that are indicative of sources are desired, multivariate ordination techniques are more efficient compared to manually counting OTUs that are shared among samples from the same source.
11.3.2 Selection of Multivariate Techniques and Results Interpretation
Common multivariate techniques for the examination of microbial community structure include cluster analysis, principle components analysis (PCA), correspondence analysis (CA), and nonmetric multidimensional scaling (NMDS). These techniques belong to a group called indirect gradient analysis, which aims to reveal community similarities among sites or samples through grouping or ordering the sites or samples into either dendrograms or on a two (2D) or three-dimensional (3D) plot. Direct gradient analysis such as canonical correspondence analysis (CCA), on the contrary, aims to correlate the overall multivariate community profile with environmental variables or FIB concentrations. More details on each individual technique can be found in this review (Ramette 2007) and the references therein. However, methods differ in when and how they should be used, and proper selection of the methods is a very important first step in data analysis. Selection of the multivariate methods must be based on data type (binary, compositional or abundance data), analysis objective, and strengths and limitations of the various multivariate methods (Ramette 2007). Standard statistical software such as R (R Core Development Team 2008) and SAS can be programmed to run multivariate analysis. Specialized multivariate software packages are also available: CANOCO (Microcomputer Power, Ithaca, NY), PC-ORD (MjM Software Design), and Primer (Primer-E Ltd., Plymouth Marine Laboratory, UK).
Microbial ecologists are likely most familiar with cluster analysis, which is historically the basis for constructing phylogenetic trees that reveal similarities between sequences (e.g., OTUs in clone libraries). When discovering similarities between sites or samples is the goal, cluster analysis essentially groups sites or samples according to a similarity coefficient based on OTU data, and its interpretation is mostly intuitive: samples or sites grouped in the same cluster are similar to each other (Ramette 2007). However, because cluster analysis forces the formation of clusters, this method is most appropriate when groupings (i.e., discontinuous changes) of sites or samples are expected, such as when samples are from different known or suspected sources (e.g., animals, sewage, etc.) (Legendre and Legendre 1998). Cluster analysis is not appropriate when changes in communities are either continuous or gradual and discrete groupings are not expected, such as when samples are from upstream to downstream sites.
PCA is a frequently used multivariate technique partially due to its elegant mathematical algorithms (ter Braak 1995); however, it may also be the least appropriate method for analyzing microbial community profiles. A basic assumption for PCA is that OTUs respond to environmental conditions (i.e., environmental gradients) in a linear fashion, which is rarely true because most species have an optimal environmental condition or ecological niche and their response curves to environmental conditions are more similar to unimodal models (ter Braak 1986). However, when the environmental gradient is very short, the unimodal response curve may appear linear, and PCA may be appropriate to use. For example, most bacteria prefer a neutral pH and therefore exhibit a unimodal response to pH; yet, the response may be considered linear if the environmental pH conditions present a very short gradient ranging only from 5.5 to 6.0. Still, the absence of many OTUs in some sites or samples is a clear sign that the linear approximation is not valid and PCA is not an appropriate method. Improper usage of PCA may cause an artifact called the “horseshoe effect” where sites or samples are positioned on the 2D PCA plot resembling the shape of a horseshoe; these positions do not represent either similarity or dissimilarity between sites or samples (Palmer 2006).
CA assumes unimodal species response curves which are more appropriate for analyzing ecological data such as microbial community profiles (ter Braak 1986). CA is also considered a flexible method in that it can accommodate a dataset even when the underlying gradient is short, and thus linear, as long as the composition data (i.e., relative abundance in percentages) instead of absolute abundance data are analyzed (ter Braak and Smilauer 2002). CA results are generally displayed in a 2D plot called a “joint plot” where both sites and OTUs can be displayed as points on the plot, or a sample scatter plot where only sites are displayed. Community similarity between sites is indicated by close proximity of the site positions on the plot. The strong association of OTUs to certain sites (or sources) is implied by close proximity of the OTUs to the sites (or sources), and this can be used to reveal indicative OTUs for developing source-specific qPCR assays. Similar to the “horseshoe effect” in PCA, CA sometimes suffers an artifact called the “arch effect” where positioning of sites along the secondary axis could be arbitrary such that it resembles an arch. Removing the arch effect is achieved by a process called detrending and hence the term detrended correspondence analysis (DCA). Note that the axes on CA plots are meaningful, as they represent latent variables or gradients such as the distance to a point source (e.g., a storm drain discharge). This is useful for discovering trends and potential microbial contamination sources.
NMDS positions sites, or samples, into a 2D (or 3D) plot in a way that the ranks of dissimilarity between these sites are preserved as best as possible, much like positioning cities on a 2D map where the relative, or ranks of, distances between those cities along the Earth’s spherical surface are preserved (Clarke and Warwick 2001). Therefore, sites in closer proximity to each other on a NMDS plot have more similar community profiles than those that are further apart on the plot. However, the distances between sites on the NMDS plot do not reflect the original dissimilarity in community profiles between those sites because only the ranks of the dissimilarity are preserved. NMDS has gained popularity because it does not either assume linear or unimodal species response curves or produce “horseshoe” or “arch” artifacts, and its interpretation is intuitive. Drawbacks of NMDS include its great sensitivity to dissimilarity measures, i.e., distance metrics, which must be specified a priori by the user. NMDS also cannot simultaneously display sites and OTUs on one plot; thus, associations between sites and OTUs may not be revealed (Palmer 2006).Therefore, if identifying site- or source-specific OTUs is the objective of the analysis, NMDS would not be the method of choice. NMDS is most useful for quickly assessing relative (dis)similarity among sites, or samples.
While indirect gradient analysis is generally considered exploratory, direct gradient analysis offers a means of specific hypothesis testing (ter Braak and Prentice 1988). A popular direct gradient analysis technique is CCA, which is an extension of CA, therefore CCA shares advantages and disadvantages of CA. CCA has been used to test whether community changes are influenced by environmental variations such as inundation (Cao et al. 2006). In the case of MST, for instance, CCA could be useful for testing whether community profiles correlate with FIB concentrations or if community profiles correlate with days after a sewage spill. An extension of CCA is partial canonical correspondence analysis (pCCA), where influences from a covariable can be excluded before evaluating effects on community profiles from another environmental variable. For example, pCCA was successfully used to identify the correlation between denitrifying community changes with heavy-metal contamination after adjusting for influences of dissolved carbon (Cao et al. 2008). pCCA is potentially useful for MST when the effects of geographic location need to be accounted for before correlating changes in microbial community with the magnitude of a microbial pollution source, for instance, the volume of a sewage spill.
11.4 Two Case Studies
11.4.1 A Case Study Using TRFLP
This case study examined bacterial communities using TRFLP during dry weather flow in the Arroyo Burro watershed (Santa Barbara, CA) where elevated FIB concentrations and human-specific Bacteroides markers were previously reported (Sercu et al. 2009). A laboratory spike-in experiment for validating the TRFLP technique, and a field study for investigating pollution sources in the Arroyo Burro watershed, were conducted. For the spike-in experiment, fecal samples were collected from suspected fecal sources such as dog, cat, and human (e.g., septage). Dog feces were acquired from three healthy individuals of different breeds, each from a separate household. Cat feces were acquired from three healthy individuals of mixed breeds from two households. Septic solids, representing the composite material from several residential tanks, were obtained from a local pumping company (MarBorg, Santa Barbara). Relatively unimpacted creek water from a reference site in the watershed was collected in order to create spiked samples using the above fecal sources at various doses.
For the field study, nine sites in the lower Arroyo Burro watershed were selected. The nine sites (site 9 to 0 from upstream to downstream) included a storm drain (site 9) discharging into the Arroyo Burro creek (site 8 to 3) that flowed through the Arroyo Burro lagoon (site 2), and then into the ocean at Hendry’s Beach, CA (site 1). Water samples were collected from these sites on 3 consecutive days (August 2005) as described previously (Sercu et al. 2009). No rain occurred at least 48 h prior to or during sampling, and the creek flow rate was 0.013 m3 s−1. Four sewage influent samples were also collected from the El Estero Wastewater Treatment Plant (Santa Barbara, CA) during the period October 2004–2005. Microbial communities were analyzed by TRFLP using universal primers targeting the domain Bacteria. Relative abundance TRFLP data were aligned and analyzed using DCA as before (Cao et al. 2006).
11.4.2 A Case Study Using PhyloChip
This case study examined bacterial communities during dry weather flow in the lower Mission Creek and Laguna watersheds (Santa Barbara, CA) where elevated FIB concentrations and human-specific Bacteroides markers were previously reported (Sercu et al. 2009). Communities from creek (including storm drains), lagoon, and ocean sites, along with three fecal samples of human origin, were analyzed by the G2 PhyloChip (Wu et al. 2010). Mission Creek and Laguna Channel flow through an urbanized area of downtown Santa Barbara and discharge at a popular bathing beach. As described previously (Sercu et al. 2009), water column samples from 3 consecutive days were collected during the dry season (June 2005) from nine locations within the Mission Creek and Laguna watersheds in Santa Barbara, California. No rain occurred at least 48 h prior to or during the sampling. The creek flow rate in Mission Creek averaged 0.016 m3 s−1. Both watersheds discharged into the same lagoon and then flowed from the lagoon into the ocean.
11.5 Relationship of Community Analysis to Multiple Indicator Approaches
Similarly, methods commonly used for multiple indicators such as ratio and predictive modeling (Blanch et al. 2006) can also be applied to community analysis. In addition to using the overall community data, one can choose to reduce the data to a few OTU groups and investigate the ratios between groups as a source identification tool (e.g., as per the case study in Sect. 11.4.2), or to focus on several specific OTUs that may be indicative of specific sources (Bernhard and Field 2000a). Furthermore, common OTUs (or cosmopolitan OTUs, if any) can be removed from the overall community data, and the community profiles can be focused into a more selected dataset where the most predictive OTUs, which can be considered multiple indicators, are measured simultaneously in just one assay.
11.6 Summary and Future Directions of Community Analysis for MST
Ideally, MST methodology should include assays targeting specific pathogens that have been identified via epidemiology studies as public health risks in the context of recreational water use (Field and Samadpour 2007). However, to date, epidemiology studies include a very limited number of pathogen measurements. There are two reasons for such limitation. First, prior knowledge about relevant pathogens in a particular water body is often lacking. Second, the cost to perform a complete survey of all possible pathogenic indicators via many single measurements is prohibitive. Nevertheless, sole reliance on specific pathogens could be inadequate for MST, since detection of pathogens would depend both on the presence of fecal material and on the health status of surrounding human or animal populations. Community analyses provide a cost-effective alternative that offers many advantages, most notably: (1) comprehensiveness and relevancy, and (2) data density.
Comprehensiveness and relevancy refer to the inclusion of pathogens, fecal indicators, and other organisms when DNA or other biomarkers are fully extracted and analyzed from a water sample (Fig. 11.3). If a single marker is labile and its environmental fate ill-defined, simultaneous reliance on many singly or interactively relevant markers from a community can enable waste detection even in the absence of the yet-to-be-discovered single, robust marker. Furthermore, when identification of fecal source(s) is based on the entire microbial community, by default it is also based on tracing pathogens within that community (Fig. 11.3). Although resolution for particular pathogens differs with variations of the community analysis techniques (Sect. 11.2.1), the public health relevance of data acquired from community analysis is higher for MST when compared to data from a few individual markers whose transport characteristics and fates are unlikely to mimic fates exhibited by a majority of fecal pathogens.
Data density refers to the inherently multivariate nature of community analysis data, which are extremely versatile in how and what information may be extracted. In an MST study, the multivariate community analysis data also represent multiple lines of evidence, which, as in a trial by jury, increase the certainty of a water-quality diagnosis. While performing many different types of individual assays can also provide the needed multiple lines of evidence, e.g., by various chemical and biochemical host-specific markers, each requires separate procedures and expertise. An additional advantage provided by community analysis includes the capability of using community similarity between sites to conduct spatial source tracking (see the case study in Sect. 11.4.1). This advantage exists because community similarity analysis naturally combines data from source types and loading (Sercu et al. 2009), both of which are often needed for locating the source of contamination in MST studies. Finally, by using community analysis approaches more broadly in MST, more insights across numerous studies and geographical locations can be obtained to define additional individual markers of waste or to define the suite of individual markers within the overall community that best resolves sources of concern.
11.6.2 Critical Issues
Despite its advantages, there are technological, logistical, and implementation-related issues regarding usage of community analysis for MST.
Technologically, the sensitivity and resolution of certain community analysis methods may prevent source identification if the sources contributing to the receiving waters are very diluted or are of complex, mixed origins. Also, more biomass (or DNA) is often needed for community analysis than a single qPCR assay. Temporal variation in source concentrations, a concern for MST studies in general, can add to the issue. It is, therefore, important to examine the characteristics and complexity of a watershed (or an MST study system) before: (1) selecting a community analysis technique, because the different techniques vary in their sensitivity and resolution (Sect. 11.2) as well as cost and feasibility and (2) formulating a sampling design that may capture temporal and spatial variation in source contributions. Multiple community analysis techniques may also be used in one MST study to obtain more sensitivity and resolution in a cost-effective manner. For example, less expensive, high-throughput TRFLP may be applied to screen an entire watershed for “hot zones,” where more expensive cloning and sequencing or PhyloChip analysis can be applied to obtain higher resolution source identification. While efficiency of concentration (such as water sample filtration) and DNA extraction methods are important to the reliability of all molecular techniques in general, such concerns and PCR bias may be less for community composition studies performed comparatively among samples or sites. However, such technical issues may hinder quantitative interpretation of community analysis results.
Fate (including die-off, persistence, and growth) and transport of microorganisms contributed from various fecal sources in the environment would also affect application of community analysis for MST. The microbial communities contained in fecal sources such as sewage undergo alterations when migrating through soil or groundwater. Wastewater changes biochemically when passed through reactive porous media (An et al. 2004) such as soil, so that nutrient (Hua et al. 2004; Stogbauer et al. 2004) and microbial (Hua et al. 2003) concentrations may change in predictable ways (Brinkmann et al. 2004). However, how microbial communities in fecal sources change during their mixing and migration in the environment is yet unknown. This is consistent with the state of knowledge for other, host-specific, individual markers (Field and Samadpour 2007) and warrants further research.
Logistically, as most methods need specialized equipment and expertise, community analyses are often more expensive than a PCR assay of a single indicator and are performed in (sometimes very specialized) research laboratories or commercial service laboratories. Routine use of such methods may be limited by the availability of expertise and cost, although high demand of applying such analysis (from MST or other fields such as biomedical research) will drive the availability up, and the cost down.
A common issue related to implementation in the MST field is the lack of standard protocols for performing MST studies. While lack of standardized sample processing and laboratory protocols is common to most MST methods, for community analysis-based methods there are even more needs for standard protocols in data processing, analysis, and interpretation, particularly because of data complexity. Research aimed at developing standardized MST protocols for sampling design, sample processing (including DNA extraction), and data analysis are needed. Lastly, and of largely a practical concern, is the potential difficulty for water-quality managers in communicating microbial community data to the public. There may be multiple dimensions to this issue, including the fact that the nonscientific community is generally unaware that microbial communities exist in nature, and thus could become unnecessarily alarmed by data that reveals the richness of microbial taxa in water even in the absence of fecal contamination. As with the public consumption of voluminous “personal genomic” data (from genetic testing for susceptibility to disease) for which full interpretation is lacking (Wright and Kroese 2010), there is a possibility for public misunderstanding and data misuse. The evolution of microbial community analysis in microbiological water-quality from a research tool into a monitoring tool will require consideration of this and other issues described in this section.
11.6.3 Future Directions
Future directions for community analysis may include: (1) incorporation of these methods into epidemiology studies, (2) assisting in research on indicator persistence and survival, and (3) development of new indicators, customized community analysis, and automation. While advanced molecular technologies did not exist for early epidemiology studies, modern epidemiology studies often archive genetic samples for future analysis. These archived samples can be analyzed by community analysis such that comprehensive water-quality data can be correlated to human health data that are already collected. This would provide a means to discover pathogens or nonpathogenic indicators that are highly predictive of health risk.
Comprehensive community analysis is also used to study the succession of microbial communities along pollution gradients of sewage discharge (Zhang et al. 2009) or after sewage spills (Dubinsky et al. 2009), which can provide insight into indicator survival and persistence after a sewage source is introduced into receiving waters. Similar studies can be conducted on other fecal sources in other environments to provide similar information that are greatly needed for revealing age and contributions of various fecal sources in MST (Blanch et al. 2006).
Although limitations such as read length and complexities of data analysis still exist, the next-generation sequencing technologies and bioinformatics are advancing very rapidly to provide truly high-efficiency and low-cost sequencing in the near future (Shendure and Ji 2008). Sequencing the whole microbial community associated with each source covering diverse geographic locations and individuals would soon be feasible, which will lead to identification of more and better source-specific indicators for development of qPCR-based methods and specialized MST microarrays. Because of the versatility of the community analysis techniques, “customized” community analysis can be designed to provide higher sensitivity and resolution for MST. For example, high-throughput techniques such as TRFLP can be paired with more specific and variable marker genes such as functional genes to offer higher sensitivity by tuning down the background that is generated when universal 16S rRNA primers are used. While the PhyloChip is one form of custom microarray, other microarrays can also be designed (Shiu and Borevitz 2008) to specifically target pathogens and pathogenic genes. Source identification microarrays (i.e., “MST on a chip”) can be designed to include thousands of source-specific assays such that each sector on the microarray represents a particular pollution source (human, sewage, gulls, etc.). The “MST on a chip,” combined with automated data analysis software, may give a probability estimate of contributions from each source and enable fast diagnosis in a watershed. Ultimately, a high-throughput, close to real-time (within 6–12 h) pipeline (array or qPCR) for processing water samples and obtaining results may be developed for real-time source tracking.
The authors acknowledge the City of Santa and the Switzer Foundation for support, and the NSF-funded Santa Barbara Long Term Ecological Research project (NSF OCE 9982105 and OCE 0620276) for assistance including stream flow data in Santa Barbara, and the work of Laurie C. Van De Werfhorst and Bram Sercu in sampling, and sample and data processing for the AB and MC case studies herein. Other attributions for the Arroyo Burro and Mission Creek fecal source and sample acquisition plus analysis are as per Sercu et al. (2009). Part of this work was performed at Lawrence Berkeley National Laboratory under Department of Energy contract number DE-AC02-05CH11231.
- Andersen GL, He Z, DeSantis TZ et al. (2010) The use of microarrays in microbial ecology. In Environmental Molecular Microbiology. Liu W-T, and Jansson JK (eds). Caister Academic Press, Norfolk, UK, pp. 87–109.Google Scholar
- Braker G, Ayala-del-Rio HL, Devol AH et al. (2001) Community structure of denitrifiers, bacteria, and archaea along redox gradients in pacific northwest marine sediments by terminal restriction fragment length polymorphism analysis of amplified nitrite reductase (nirS) and 16S rRNA genes. Appl Environ Microbiol 67: 1893–1901.PubMedCrossRefGoogle Scholar
- Clarke KR, and Warwick RM (2001) Change in marine communities: An approach to statistical analysis and interpretation, 2nd edition. PRIMER-E, Plymouth.Google Scholar
- Dubinsky E, Wu C, Hulls J et al. (2009) A complete microbial community approach to tracking fecal pollution in coastal waters. In State of the Estuary Conference. Oakland, CA.Google Scholar
- Eaton AD, Clesceri LS, Greenberg AE et al. (1998) Standard Methods for The Examination of Water and Wastewater. American Public Health Association, Washington, DC.Google Scholar
- Farnleitner AH, Kreuzinger N, Kavka GG et al. (2000) Simultaneous detection and differentiation of Escherichia coli populations from environmental freshwaters by means of sequence variations in a fragment of the beta-D-glucuronidase gene. Appl Environ Microbiol 66: 1340–1346.PubMedCrossRefGoogle Scholar
- Ibekwe AM, Bold RM, Lyon SR et al. (2008) Microbial community composition in middle Santa Ana River watershed impacted by non-point source pollutants. In the 108th general meeting for American Society of Microbiology. Boston, MA.Google Scholar
- Izbicki JA, Swarzenski PW, Reich CD et al. (2009) Sources of fecal indicator bacteria in urban streams and ocean beaches, Santa Barbara, California. Annals of Environmental Sciences 3: 139–178.Google Scholar
- Kaur A, Chaudhary A, Kaur A et al. (2005) Phospholipid fatty acid - A bioindicator of environment monitoring and assessment in soil ecosystem. Curr Sci 89: 1103–1112.Google Scholar
- LaMontagne MG, Griffith JF, and Holden PA (2003b) Comparative analysis of animal fecal bacterial communities using terminal restriction fragment length polymorphisms of bacterial 16S rDNA PCR-amplified from fecal community DNA. In American Society for Microbiology General Meeting. Washington, DC.Google Scholar
- Leadbetter ER (1997) Prokaryotic diversity: form, ecophysiology, and habitat. In Manual of Environmental Microbiology. Hurst CJ, Knudsen GR, McInerney MJ et al. (eds). American Society for Microbiology, Washington, D.C., pp. 14–24.Google Scholar
- Legendre P, and Legendre L (1998) Numerical Ecology. Elsevier Science BV, Amsterdam.Google Scholar
- Liu W-T, and Jansson JK (eds) (2010) Environmental Molecular Microbiology. Caister Academic Press, Norfolk, U.K.Google Scholar
- National Research Council (1993) Managing Wastewater in Coastal Urban Areas. National Academy Press, Washington, D.C.Google Scholar
- Osborn AM, and Smith CJ (2005) Molecular Microbial Ecology. Taylor & Francis, New York, NY, USA.Google Scholar
- Palmer MW (2006). Ordination methods for ecologists: The ordination webpage. http://ordination.okstate.edu/. Accessed May 11, 2011Google Scholar
- R Core Development Team (2008) R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, URL http://www.R-project.org.
- ter Braak CJF, and Smilauer P (2002) CANOCO reference manual and CanoDraw for windows user’s guide: Software for canonical community ordination (version 4.5). Microcomputer Power, Ithaca, NY, USA.Google Scholar
- U.S. Environmental Protection Agency (2005) Microbial Source Tracking Guide Document. In. Cincinnati, OH: Office of Research and Development, p. 150.Google Scholar
- Wu CH, Sercu B, Van De Werfhorst LC et al. (2010) Characterization of coastal urban watershed bacterial communities leads to alternative community-based indicators. PLoS One 5(6): e11285. doi: 10.1371/journal.pone.0011285.Google Scholar