Community ecological modelling as an alternative to physiographic classifications for marine conservation planning
Accurate mapping of marine species and habitats is an important yet challenging component of establishing networks of representative marine protected areas. Due to limited biological data, marine classifications based on abiotic data are often used as surrogates to represent biological patterns. We tested the surrogacy of an existing physiographic marine classification using non-metric multidimensional scaling and permutational analysis of variance to determine whether species composition was significantly different among physiographic units. We also present an alternative ecological classification that incorporates biological and environmental data in a community modeling approach. We use data on 174 species of demersal fish and benthic invertebrates to identify mesoscale biological assemblages in a 100,000 km2 study area in the northeast Pacific Ocean. We identified assemblages using cluster analysis then used a random forest model with 12 environmental variables to delineate mesoscale ecological units. Our community modelling approach resulted in five geographically coherent ecological units that were best explained by changes in depth, temperature and salinity. Our model showed high predictive performance (AUC = 0.93) and the resulting ecological units represent more distinct species assemblages than those delineated by physiographic variables alone. A strength of our analysis is the ability to map model uncertainty to identify transition zones at unit boundaries. The output of this study provides a biotic driven classification that can be used to better achieve representativity in the MPA planning process.
KeywordsMPA network Ecological representation Random forest Cluster analysis IndVal
Biodiversity is rapidly declining as human activities drive global-scale species losses and ecosystem changes (Pimm et al. 2014; Ceballos et al. 2015). Globally, less than 3.5 % of the marine environment currently benefits from protection, compared to 15.4 % of the terrestrial environment (Juffe-Bignoli et al. 2014). Overexploitation, habitat loss, pollution, invasive species expansion and climate change threaten to break down the social, ecological and economic benefits society derives from the world’s oceans (Worm et al. 2006). In response to the increasing threats to biodiversity, the Convention on Biological Diversity (CBD) called upon member states to protect at least 10 % of representative coastal and marine areas, emphasizing those areas of particular importance for biodiversity and ecosystem services (CBD 2010). Implementing an effective network of representative marine protected areas (MPAs) helps to achieve this biodiversity target while conferring ecosystem resilience to environmental change.
In order to design an effective MPA network, several criteria must be met including ecological representativity, connectivity, and the protection of vulnerable habitats and species (e.g., Airamé et al. 2003). The MPA network design criterion of representativity aims to protect examples of the full range of ecosystems and habitat types found within a given planning area (Roberts et al. 2003) and builds ecological resilience to the impacts of human activities into the network by incorporating a proportion of each type of ecosystem and habitat. Achieving representativity in an MPA network requires information on the distribution of ecosystems, species, and habitats across multiple spatial scales (Roff et al. 2003; Last et al. 2010; Harris 2012a).
An ecological classification system that partitions areas into relatively homogeneous spatial units based on a selected set of environmental and/or biological variables can be used to delineate ecological units as a basis for implementing the representativity criterion (Roff and Zacharias 2011). Ideally, marine ecological classifications should reflect the relationships between physical features and the distribution and abundance of species (Gregr et al. 2012), however due to pervasive limitations in availability of biological data, most marine ecological classifications are built on the physical geography or the abiotic conditions of the planning area (also referred to as a physiographic classification), and assume that these physical variables are reliable surrogates of biological patterns (e.g., Roff et al. 2003). While abiotic surrogates can reflect biological patterns (e.g., Roff and Taylor 2000), physiographic classifications do not perform as well as biologically informed classifications at fulfilling the representativity criterion in conservation planning (Lombard et al. 2003; Rodrigues and Brooks 2007; Sutcliffe et al. 2015). Sutcliffe et al. (2015) found that abiotic classifications may be used for an initial reserve design when biological information is insufficient, but classifications that are biologically informed either through weighting the biological importance of abiotic variables (e.g., Pitcher et al. 2012) or by explicitly incorporating biological data, will produce more representative reserves.
On the Pacific coast of Canada, a joint federal-provincial strategy was recently initiated to develop an “ecologically comprehensive, resilient and representative network of marine protected areas” (Canada-British Columbia Marine Protected Area Network Strategy 2014) necessitating a deeper understanding of the spatial distribution of species, ecosystems and habitats in the region. In the late 1990’s, the provincial government of British Columbia (BC) developed a marine ecological classification system, British Columbia Marine Ecological Classification (BCMEC), using a physiographic approach. BCMEC is a hierarchical classification that comprises five nested divisions based on physical properties of the environment from ice regimes at the top of the hierarchy to currents, substrate, relief and exposure at the lowest level (Zacharias et al. 1998, available at http://geobc.gov.bc.ca/).
BCMEC is grounded in expert knowledge, however the degree to which it represents patterns of biological diversity at lower levels (Ecosections, Ecounits) remains untested. The 12 BCMEC Ecosections on the BC coast are delineated based on ocean currents and stratification (Zacharias et al. 1998). Many studies use the Ecosections in a description of their study site, or use their boundaries to summarize biological or socio-economic information (e.g., Ban and Vincent 2009; Nelson et al. 2011; Robb 2014). However, without biological validation, their utility to fulfil the ecological representativity criterion for conservation planning is uncertain. The BCMEC Ecounits, nested below the Ecosections, attempt to classify the seabed using available physiographic data including currents, depth, bottom substrate, bottom relief, and wave exposure (AXYS Environmental Consulting Ltd 2000, 2001), but the Ecounits are criticized based on the scale and accuracy of the input substrate layer, the methodology that is difficult to repeat, and the lack of biological validation (e.g., Levings and Jamieson 1999; Johannessen et al. 2004). Given these constraints, the BCMEC Ecounits are not examined further in this study.
To delineate ecological units, we used a community modeling “assemble first, predict later” approach described by Ferrier and Guisan (2006). Using this approach, biological survey data from systematic fisheries-independent groundfish and invertebrate surveys are first classified into assemblages irrespective of environmental data using a cluster analysis. Second, we use a random forest model to identify the environmental correlates of the assemblages’ distributions, and then use the model to predict the presence of assemblages to areas within the study area with no biological data. The predictive performance of random forest models, regularly used in species distribution modelling studies, are typically equivalent to or better than other statistical and machine-learning methods in comparisons relating ecological to covariate data (Prasad et al. 2006; Cutler et al. 2007; Gonzalez-Mirelis and Lindegarth 2012).
Indicator species are commonly used for the analysis of biodiversity change or for defining conservation strategies (Lindenmayer et al. 2000; De Càceres et al. 2010; Hayes et al. 2015). For the final step in our classification approach, we use an indicator species analysis (IndVal; Dufrêne and Legendre 1997), to identify species that are associated with each ecological unit delineated in the classification. The IndVal combines the species’ site fidelity with its relative frequency of occurrence to statistically determine if species are associated with one or several classes. In other words, this analysis identifies species that are well represented by each ecological unit, and therefore enhances the interpretability of the resulting map.
With approximately 36,000 km of complex shoreline, over 6500 islands and over 450,000 km2 of marine waters, Canada’s Pacific Region is a highly diverse and productive part of the ocean (Fig. 1). The waters on the BC coast are located in a transition zone dominated by Alaska Coastal Current flowing to the north and the California Current flowing to the south. These currents shape recognized zoogeographic provinces in fish fauna where the Aleutian and Oregonian provinces overlap and a transition of algal and invertebrate composition occurs (Allen and Smith 1988; Druehl 2000; Fenberg et al. 2015). Although the location of the boundary between these zoogeographic provinces is spatially and temporally dynamic, in general, a transition zone occurs near Brooks Peninsula on the west coast of Vancouver Island at the dividing point for the two current domains (Lucas et al. 2007). This point is the basis for distinguishing two nationally designated bioregions: the Northern Shelf Bioregion (NSB) and the Southern Shelf Bioregion (SSB; Fig. 1). In this study, our objective is to better understand the distribution of mesoscale (10–1000 s km) benthic habitats and ecosystems within bioregions.
To complete our analyses, we gridded our study area into 4 km cells (resulting in 6875 cells) and aggregated the biological data and resampled environmental data to this resolution. Given the inherent errors with some remotely sensed abiotic data near the coast (Tyberghein et al. 2012), the topographic complexity of the coastline, and the presence of unique local processes (such as freshwater inputs, narrow fjords and local currents) we excluded grid cells that intersected with land, and the Strait of Georgia Bioregion from the analysis. An additional reason to remove these cells was to remove unequal sampling area as cells that intersected with land were less than 4 × 4 km of ocean area. The study area ranges in depth from 2–2900 m (mean = 538 m, median = 190 m) and is shown in Fig. 1.
Biotic data sources
We used presence/absence data collected in groundfish and crab biological surveys conducted by Fisheries and Oceans Canada between 2000–2014 to test the biological relevance of the existing physiographic classification in the study region (BCMEC-Ecosections, Fig. 1c) and to build a community based ecological classification. The same biotic dataset was used in both analyses and included data from two biological survey programs: (1) The standardized trawl and long line groundfish biological surveys which were undertaken annually from 2003 to 2013; and (2) The crab biological surveys, conducted using standardized trawl and traps, including data from the Tanner Crab Research survey (2000–2006) and the Crab research survey (2000–2014). Although these research surveys are conducted for specific taxa (i.e. groundfish and crabs), all species encountered are recorded. Ninety percent of the surveys were conducted between April and September with the remaining 10 % occurring between October and March (see Supplementary Material for detailed methodology). A total of 3707 cells (referred to as “sites”) contained catch records used in our analysis, at depths ranging from 7 to 2250 m (mean = 304 m, median = 160 m).
We removed species with low frequency in the dataset because they add noise to multivariate analyses and provide little information in addition to that obtained from more common species (Gauch 1982; McCune and Grace 2002, see Supplementary Material). To maximize the inclusion of species while also reducing noise and potential biases in the analysis, we chose a conservative exclusion threshold and removed species reported in less than 1 % of sites (≤37 sites). In addition, sites with single species can cause distortions in similarity analyses (Koleff et al. 2003) so sites where only one species was recorded were also removed. Our final biotic dataset included 174 species (96 species of demersal fish and 78 species of benthic invertebrate—See Supplementary Material for list) and 3615 sample sites (Fig. 1b). Survey effort was not consistent across all sites but initial models included “number of surveys” as a measure of survey effort. Results of these initial models showed that survey effort was not an accurate predictor of assemblage and accounted for less than 0.2 % mean decrease in model accuracy (see next section). Therefore survey effort was removed from further analyses.
Species composition dissimilarity matrix
Biological validation of BCMEC ecosections
To determine if the BCMEC Ecosections (Fig. 1c) represented benthic biological diversity patterns, we assigned each site to the Ecosection in which its centre point fell. Of the 12 marine Ecosections, five (with sample sizes from 281 to 1372) were included in the analysis: Continental Slope, Dixon Entrance, Hecate Strait, Queen Charlotte Sound, and Vancouver Island Shelf. The remaining seven Ecosections fell outside of or had limited overlap with our study area (≤35 sites each) and were not analyzed further.
We used a permutational analysis of variance (PERMANOVA, Anderson 2001; McArdle and Anderson 2001) to test whether the species composition was significantly different among groups (Ecosections) and a test of the homogeneity of multivariate dispersions among groups (PERMDISP, Anderson 2006) to help interpret the PERMANOVA results. A significant PERMANOVA result can be due to differences in centroid location among groups (i.e., differences in species composition), differences in spread (variance), or a combination of the two (Anderson and Walsh 2013). PERMDISP tests if the average within-group dispersion, measured by the average distance to group centroid, is equal among groups. Balanced PERMANOVA tests are more robust (Anderson and Walsh 2013), so we randomly resampled the number of sites in each Ecosection to the smallest sample size (n = 281, Hecate Strait). Each test was run with 999 permutations.
To aid in the interpretation of the results, we examined the data visually, using nonmetric multidimensional scaling (nMDS). nMDS is an iterative search for a ranking and placement of n entities on k dimensions (axes) that minimizes the stress of the k-dimensional configurations, where stress is a measure of departure from monotonicity in the relationship between the distance in the original matrix and distance in the reduced k-dimensional ordination space (McCune and Grace 2002). The nMDS plot provides a visualization of the differences in species composition among groups. Groups were defined by the Ecosection boundaries, and all sites that fell within the boundary were assigned to that Ecosection. PERMANOVA, PERMDISP, and nMDS were run in R using the adonis, betadisper, and metaMDS functions in the ‘vegan’ package (Oksanen et al. 2014).
To better understand the community structure within each Ecosection we ran an indicator species analysis using the R function IndVal in the “labdsv” package (Roberts 2015). IndVal calculates an indicator value for each species, ranging from 0 to 1, based on the relative frequency of each species in a group compared to all other groups, and can be interpreted as how strongly a specific species is associated with a given group. For presence–absence data, a species’ IndVal in group i is calculated as the product of a species’ specificity (the proportion of sites with that species present for group i, divided by sum of proportions of sites with that species present for all other groups), and the species’ fidelity (the proportion of sites with presences in group i). High IndVal values show that a species is not only very frequent in a particular group, but that it is also infrequent elsewhere. A permutation test also calculates a p value for each indicator value and species. We report indicators within each group that were significant (p < 0.05) and had an IndVal value >0.25 (following Dufrêne and Legendre 1997). To compare the strength of indicator species in the BCMEC Ecosection classification with the community classification approach described in the next section, we compared the top indicator species for each Ecosection with the top indicator for each ecological unit in the community analysis. Although IndVal was designed for abundance data, it performs well for presence–absence data (Podani and Csányi 2010).
Defining biological assemblages for community model
The matrix of pairwise βsim values described in the previous section was used to create a dendrogram using average clustering (“UPGMA”, unweighted pair group method with arithmetic mean). To assess the performance of the βsim distance compared to other (dis)similarity measures, we compared the cophenetic correlation coefficients (also referred to as the cluster validity index; Lessig 1972) for the dendrograms produced using the βsim, Sorensen, Jaccard, and Ochiai distances. Determining the appropriate number of clusters (k) is an enduring issue in cluster analysis (Milligan and Cooper 1985). In biogeographic studies different types of stopping rules have been used, for example, a minimum number of grid cells per cluster (Williams et al. 1999), a predetermined level of dissimilarity (e.g., Proches 2005), or the height of the nodes of dendrogram and various metrics of relative endemism within clusters (e.g., Kreft and Jetz 2010). Given the objective of our study was to delineate the study area into biologically relevant ecological units, we wanted to maximize the number of clusters to ensure all assemblages at this scale were captured, while also maximizing the number of sites classified into geographically coherent clusters. To determine the optimal cut-off we examined three metrics: (1) the proportion of sites in the most-populated clusters; (2) the spatial coherence (clumping) of the sites in each cluster; and (3) the variation in cluster size. After the dendrogram was cut, most sites were assigned to a “major cluster”, with a small number of sites that fell in small, spatially scattered clusters considered “unclassified”.
Random forest analysis
We used a random forest analysis to identify environmental correlates of the variation in biological clusters across space, and evaluate whether these relationships could be used to accurately predict cluster membership in areas with no biological data. Random forest is a machine-learning method that creates an ensemble or “forest” of classification trees. It avoids developing a tree model that is over-fit to the training data by using bootstrap aggregation or “bagging” to repeatedly sample the data with replacement (bootstrapping) and developing trees for each dataset (Cutler et al. 2007). The “out of the bag” sample (1/3 of the data) are held out of the same and used to evaluate the model accuracy using a metric analogous to R2, called pseudo R2 (Franklin 2009).
Environmental rasters were resampled from their original resolutions to a 4 km × 4 km cell size to match the spatial resolution of the biological data. Although random forest can handle correlated variables (Breiman 2001), for ease of interpretability we selected a subset of the original 59 environmental variables that were not highly correlated (R2 for each pair of variables <0.7; R package ‘corrplot’, Wei 2013), had coverage for the entire study area, and have been shown to be biologically important (reviewed by Harris and Baker 2012; see Supplementary Material for analysis). We retained 12 variables, including depth, rugosity, flow (summer and winter), tidal direction, tidal speed, bottom salinity (range), bottom temperature (range), sea surface temperature (overall), and concentrations of phosphate, dissolved oxygen, and silicate (see Supplementary Material). Of the original 3615 sampling sites used in the cluster analysis, 3496 were assigned to a major dendrogram cluster and had associated data for all 12 predictor variables. Only these 3496 sites were used in the random forest analysis.
Model parameters and performance metrics
The random forest model was implemented in R, using the ‘randomForest’ package (Liaw and Wiener 2002) with default settings and 10,000 trees for each run. The accuracy of the model was assessed as 100−(% out-of-bag error), as well as with tenfold cross validation. For cross-validation, the input data were randomly divided into ten subsamples; each subsample (10 % of full dataset) was used to test the prediction accuracy of a 10,000-tree model built on the remaining data (90 %). Model fits were quantified using the area under the receiver operating characteristic curve (AUC; function auc in R package ‘pROC’, Robin et al. 2011). AUC values typically range from 0.5 for classifiers that perform no better than random to 1.0 for perfect classification (Fawcett 2006). AUC values from each cross-validation run (n = 10) were averaged to assess overall fit of the model. The relative importance of each predictor variable was also obtained from the cross-validation analysis, by taking the average of the mean decrease in model accuracy for each predictor for each 90–10 split. The variable importance plots were examined to assess the importance of each predictor in the classification of each biological cluster individually, and for the overall model.
Data from the 12 predictor environmental variables were available for 6814 of 6875 (99.7 %) grid cells within our study site. The random forest model was projected onto these layers to get a surface of predicted cluster membership for sites that were not used to create the model (i.e., sites not assigned to a major cluster and sites without biological data). The model input data covered just over half of the study site (3480 of 6586 grid cells). Using the modelled relationships between the environmental data and the biological assemblage data, we delineated mesoscale ecological units. Each identified ecological unit refers to the biological assemblage and the dominant environmental characteristics shaping that assemblage as predicted by our model, for the entire study area. We further examined the uncertainty in the model by mapping the percentage of votes underlying each predicted cluster classification (the output of the random forest model). This evaluation provides a visualization of the underlying uncertainty in the predicted surface to identify areas of poorer model fit in the classification.
The species assemblages within ecological units were assessed using the same analysis described for the Ecosections. We determined indicator species using IndVal, and ran balanced PERMANOVA, PERMDISP, and nMDS analyses on the ecological units for comparison with the Ecosection results. PERMANOVA is generally used to test a priori groups such as the Ecosections, therefore its use in testing our units could be considered somewhat circular, given that we defined the clusters using the same data. However, given that the ecological units represent the modelled results (not the clusters themselves), we continued with the analysis for comparative purposes.
Biological validation of ecosections
The PERMANOVA results revealed significant differences in species composition among Ecosections (F = 24.43, df = 4, p < 0.0001). However, the PERMDISP test rejected the null hypothesis of homogeneity of multivariate dispersion among all groups (F = 5.8, df = 4, p < 0.001) indicating that the significant PERMANOVA result could be driven by differences in multivariate spread in the data within groups. The effect size shows that only 32 % of the variation is explained by Ecosections leaving 68 % of the variation explained within groups (See Supplementary Material for PERMANOVA tables).
Indicator species for BCMEC Ecosections produced using survey data on demersal fish and invertebrates. Species listed in order of ascending IndVal metric
Frequency in ecosection (% grid cells inhabited)
IndVal in ecosection
Crab, Grooved Tanner
Crab, Scarlet King
Vancouver Island Shelf
No IndVal > 0.25
No IndVal > 0.25
Queen Charlotte Sound
Sole, Pacific Sand
Random forest analysis and delineation of ecological unit
The environmental variables included in the random forest model accurately classified each cluster with an out-of bag misclassification rate of 15.96 % (pseudo R2 = 84.04). The predictive power of the model was high, with an AUC for cross-validation of the random forest model of 0.93 ± 0.01. An AUC value above 0.9 is considered a high model performance and indicates that the clusters are well-explained by the environmental variables included in the model.
Using the relationships between the environmental data and the biological assemblage data, ecological units were delineated across the study area (Fig. 3c). Although the overall model AUC was high (0.93) we mapped uncertainty (quantified by the percentage of votes to designated cluster, the measure used in the random forest model) to highlight the underlying uncertainty in the model (Fig. 3d). The uncertainty map indicates that in general, the areas surrounding the boundaries of each ecological unit have a lower percentage of votes in the random forest model than areas in the core of each assemblage. This is particularly true at the southern boundary of Dogfish Bank, around Other Banks, and running along the length of the transition from Shelf to Slope at the shelf break.This suggests that the model does not perform as well in transition zones, where the species composition is changing across the environmental gradient. An additional area with high model uncertainty (lower model performance) is around the southern the tip of Vancouver Island (i.e., Juan de Fuca Strait) suggesting that variables included in the model have low predictive power close to land.
Biological patterns of ecological units
Using a balanced design (randomly resampled to n = 119 per ecological unit), the PERMANOVA results showed there were significant differences in species assemblages, among groups delineated using the random forest approach (F = 218.65, df = 4, p < 0.001). However, the PERMDISP test rejected the null hypothesis of homogeneity of multivariate dispersion among all groups (F = 9.8531, df = 4, p < 0.001) indicating that the significant PERMANOVA result could be driven by the spread in the data within groups. However, the PERMANOVA results show that 60 % of the variation is explained among ecological units and 40 % of the variation explained within ecological units. This is in contrast to the results of the Ecosections PERMANOVA, where the majority of the variation (68 %) was due to variation within Ecosections and only 32 % was due to variation among Ecosections.
We used an nMDS plot to examine the similarity of resultant ecological units in multidimensional space (Fig. 2b). The plot, showing the 95 % ellipses, shows an improvement in the distinctness of ecological units compared to the Ecosections’ nMDS showing considerably less overlap among groups. Most of the overlap that does occur is between physically similar groups such as Dogfish Bank and Other Banks. The Shelf shows overlap with all other ecological units except for the Slope. The Slope, similar to the Continental Slope Ecosection, is the most distinct assemblage with only a small amount of overlap with Troughs.
Indicator species for ecological units produced by random forest analysis of species assemblages (96 demersal fish and 78 invertebrates) listed in order of IndVal metric
Frequency in unit (% grid cells inhabited)
IndVal in unit
Crab, Grooved Tanner
Crab, Scarlet King
Perch, Pacific Ocean
Sea Urchin, Pink
Sole, Pacific Sand
The Slope ecological unit, similar to the Continental Slope Ecosection, had among the highest IndVal values of any unit with three species with IndVal values of over 0.65 (IndVal = 0.71 for Grooved Tanner Crab, Chionoecetes tanneri; IndVal = 0.71 for Giant Grenadier, Albatrossia pectoralis; and IndVal = 0.71 for Pacific Grenadier, Coryphaenoides acrolepis) providing more support that these three species have a strong association with slope habitats. Dogfish Bank had the highest IndVal values of any ecological units with the highest being 0.72 for the Pacific Sand Sole (Psettichthys melanostictus), whereas Troughs’ highest IndVal value was 0.55 for Redbanded Rockfish (Sebastes babcocki) and 0.55 for Pacific Ocean Perch (Sebastes alutus). Yelloweye Rockfish (Sebastes ruberrimus) was the strongest indicator for Shelf, occurring in 46 % of Shelf sites; however, its IndVal of 0.28 reflects its occurrence in other ecological units (8 % of Trough sites, 20 % of Other Bank sites). In contrast, the Giant Grenadier (A. pectoralis) has a high frequency in the Slope (75 % of sites) and also a very high IndVal value (0.71), indicating that it is rarely found in other ecological units (also observed in 4 % of Trough sites).
Integrated approach improves ecological classification
The results of this study provide a new mesoscale ecological classification that can be used in marine spatial planning in the Pacific Region of Canada. A current initiative to develop an MPA network on the coast of British Columbia requires information about the distribution of biodiversity in the planning region and this study provides an initial step in understanding the coarse-scale benthic community patterns and associated environmental heterogeneity. Studies have shown that building reserves using biological data produces more representative reserves (e.g., Sutcliffe et al. 2015), yet reserves built solely on abiotic surrogates result in more representative reserves than randomly selected sites (Rodrigues and Brooks 2007; Beier et al. 2015). Here we tested the biological relevance of an existing physiographic classification based solely on abiotic data and found that it reflects significant compositional turnover in benthic species in our study area. However, we also showed that an integrated approach, using biotic and abiotic data, better represents distinct benthic species assemblages and their associated habitat. Our analyses showed there are few species with strong associations to the physiographically-based Ecosections, with the exception of the Continental Slope. The nMDS analysis highlighted the high overlap among Ecosections and only the Continental Slope Ecosection displays a visually distinct assemblage. This suggests that if the physiographic classification was used in MPA planning to fulfil the representativity criterion, the continental slope species would be represented in the network, but the remaining Ecosection boundaries are not representative of turnover in benthic species and habitat diversity. However, by focussing on available data for benthic species of fish and invertebrates, we were only able to test aspects of the biological relevance of the Ecosections. Given that the Ecosections were built with information on ocean stratification and mixing, they may better represent pelagic biodiversity patterns. Further work is needed to examine how pelagic diversity is structured in comparison to the Ecosection boundaries. In terms of representing meso-scale patterns of benthic biodiversity, our approach provides an improvement and useful alternative to the Ecosections as a meso-scale benthic habitat layer to fulfil the ecological representativity criterion in MPA network planning. This result strengthens the conclusion that classifying habitats using biological information creates ecologically relevant habitat units that better represent species-environment relationships than classifications built on abiotic variables alone (Hewitt et al. 2004; Eastwood et al. 2006; Rooper and Zimmermann 2007; Shumchenia and King 2010).
Community driven classification
The vast majority of benthic marine organisms are limited by some combination of depth, substrate type, temperature, and salinity but the complexity of the relationships to these variables are less well understood (Roff and Zacharias 2011; Harris 2012b). A review of 57 studies on mapping marine benthic communities found that water depth, followed by substrate type, was the most useful surrogate for delineating benthic communities (Harris and Baker 2012). Our results support this finding, with depth coming out as the strongest driver structuring the biological assemblages across our study area, followed by temperature and salinity range. Interestingly, Harris and Baker (2012) found that water properties including temperature and salinity were not as good surrogates as other seabed characteristics such as acoustic backscatter, grain size and rugosity, likely due to the non-linear and complex responses of species to changes in temperature and salinity (Harris 2012b). Unfortunately, a reliable map of substrate type or acoustic backscatter is not currently available at the scale of this study and the available grain size model did not cover the full extent of our study area. We did include rugosity in the analysis however it did not have high predictive power in our model (Fig. 4), perhaps because it was resampled from 100 m to 4 km to meet our sample resolution. More local analyses following similar methods could be carried out in areas where finer-scale data are available within the study area, as other research has shown that this community modeling approach performs well at the biotope scale (10–100 m; Gonzalez-Mirelis and Lindegarth 2012).
Our results found five coarse-scale habitats in our study area with somewhat distinct biological communities: Shelf, Troughs, Other Banks, Dogfish Bank and the Slope. Interestingly, we found that the group of species found on Dogfish Bank, the largest shallow bank in the region (Clarke and Jamieson 2006), was distinct from other banks in the study area. This result supports the identification of Dogfish Bank as an Ecologically and Biologically Significant Area (EBSA, Clarke and Jamieson 2006). An expert-driven process identified Dogfish Bank as an EBSA because it is the largest, shallowest bank in the region, and an important area of aggregation for marine birds and Dungeness Crab, and rearing habitat for flatfish and invertebrate larvae (Clarke and Jamieson 2006). Our analysis showed that four species of flatfish (Pacific Sand Sole, Rock Sole, English Sole, and Butter Sole) were identified as indicator species for Dogfish Bank based on their high frequency. Although all four flatfish species were also found in other areas, particularly the Rock Sole in Other Banks and Shelf, the higher frequency of flatfish in Dogfish Bank in comparison to other ecological units, provides empirical evidence of its importance as flatfish habitat. Similarly, although Dungeness Crab were found in low frequencies in other units (2 % of sites in Other Banks, and 2 % of sites in Shelf), nearly half of the sites in Dogfish Bank contained Dungeness Crab (49 %), providing empirical evidence that this habitat is important for Dungeness crab aggregations, as outlined in its EBSA designation (Clarke and Jamieson 2006). An added benefit of our community approach is the ability to develop an associated list of indicator species representative of each ecological unit, information that is important to conservation planners and managers.
Use of spatial autocorrelation patterns
Spatial autocorrelation, a pattern in which observations are related to one another by their geographic distance, is common in georeferenced ecological data (Legendre and Legendre 2012). The presence of spatial autocorrelation can create problems in species distribution models (SDM; Lennon 2000; Dormann 2007; Crase et al. 2012) such as the random forest approach taken in this paper. Spatial autocorrelation in model residuals of SDMs violates the assumption of independent and identically distributed errors and can inflate type I errors (Legendre 1993; Kühn 2007), which can lead to the selection of unimportant explanatory variables and poorly estimated parameters in SDMs (Lennon 2000, Dormann 2007). There are several approaches to test for spatial autocorrelation in species distribution model residuals (reviewed by Keitt et al. 2002; Dormann et al. 2007) but our community modeling approach is different than the typical species distribution model making it more complicated to test for spatial autocorrelation. For example, we used geographic cohesiveness as one of several criteria for selecting a dissimilarity cut-off in our cluster analysis. We were looking for areas of similar species composition to map coarse-scale benthic biological communities across geographic space, so spatial autocorrelation was inherent in our design (and could be considered a strength, see Gonzalez-Mirelis and Lindegarth 2012).
In many studies, the distribution model is fitted for the specific purpose of mapping its predictions, which involves using the mean, and the distribution of parameters are not often examined (e.g., Gonzalez-Mirelis and Lindegarth 2012). In this study, the map (Fig. 3c) is the output of interest (as opposed to the explanatory variables and parameter estimates) and because only the variance of effects is largely affected by autocorrelation, spatial autocorrelation is less of a concern. However, because spatial autocorrelation was not addressed in our model, we are unable to explicitly test the effects of the structuring processes of each ecological unit. In other words, although depth, salinity and temperature range are strong predictors of the biological clusters, we are limited in our interpretation regarding the strength of those correlative relationships.
Conservation planning and uncertainty
Models of natural systems, including predictive ecological models like random forests inevitably include some degree of uncertainty. Uncertainty is not problematic per se as long as its effects on model projections are not ignored (Gould et al. 2014). However, many correlative models such as species distribution models are spatially projected without explicitly addressing uncertainty, thereby implying a confidence in model outputs that may be misleading (Beale and Lennon 2012; Wenger et al. 2013; Gould et al. 2014). Tulloch et al. (2013) stated that one of the most pervasive forms of uncertainty in data used to make conservation decisions is error associated with mapping of conservation features. While conservation planners should consider uncertainty associated with ecological data to make informed decisions (e.g., Halpern et al. 2006; Langford et al. 2009) model error is rarely accommodated in the planning process (Tulloch et al. 2013).
To better incorporate uncertainty into the planning process in the Pacific Region, we provided an uncertainty map that clearly highlights areas of lower confidence in model performance. Although our overall model performance metrics were considered high (pseudo R2 = 0.84 and AUC = 0.93), at the boundaries of the ecological units presumably across environmental gradients, the output of the model had less support. Model uncertainty in transition zones around edges of ecological units is expected given the potentially steep environmental gradients and associated community turnover. However these areas would be masked if not highlighted through the examination of the level of support underlying the model prediction. Furthermore, transition zones are important features to consider in conservation planning, often with enhanced diversity (Araujo 2002), and mapping uncertainty allowed us to better identify transition zones.
Model predictions in close proximity to land also show higher uncertainty than surrounding areas, particularly around the southern the tip of Vancouver Island. The environmental complexity including local currents and eddies occuring in this area are not likely adequately captured in our abiotic data resulting in low model performance. This result supports the decision to remove sites near land, and the Strait of Georgia Bioregion, and provides evidence that these areas should be modeled separately and at a local scale with finer-scale data if possible. Underlying uncertainty, particularly at the boundaries of classification units, is not captured in most rule-based classifications, like the BCMEC physiographic classification, and documentation of such uncertainty is not always made available or may only be found in technical metadata. The ability to examine the variability in model performance in a spatial context is a strength of our analytical approach and allows conservation planners and managers to explicitly consider uncertainty in the decision making process.
Additional sources of uncertainty in our results are the limitations of our input data. Although we pooled samples over a decade to average inter-annual variation in species presence, the majority of the biological data used in this study was collected in April through September. Therefore, our results best represent spring and summer patterns in benthic diversity and assume that large changes in species composition do not occur seasonally. Many of the species included in the analysis are low mobility or sessile invertebrates so are not expected to move but other species, such as demersal fish and mobile invertebrates may undertake seasonal movements. For example, certain Sablefish (Anoplopoma fimbria) populations, have been shown to make large seasonal migrations whereas others have been shown to remain resident year round (Maloney and Heifetz 1997; McFarlane and Saunders 2006). An additional limitation in our study is that only non-larval stages of species were included. Further studies examining patterns of pelagic diversity will hopefully be able to better incorporate larval life history stages.
We used spatially explicit data on demersal fish and benthic invertebrate species, to first test the biological validity of a physiographic marine classification in the Pacific Region, BC. Second, we maximized the use of available biological data on benthic species to develop a mesoscale classification delineating ecological units that represent distinct biological assemblages for use in MPA network planning. We provided a biological validation of the BCMEC Ecosections, a physiographic classification that had not been tested against biological data prior to this study. We also showed that the representativity of a classification system can be greatly improved by integrating biotic and abiotic data into a predictive modeling framework. Our community modeling approach deepened our understanding of the spatial distribution of benthic biological communities in our study area and provides a good alternative to the existing physiographic classification for marine conservation planning. This study highlights the importance of maximizing the use of biological data in marine conservation planning process, as well as the utility of multi-species stock assessment surveys for community analyses. Although the data used in this study were not collected for this purpose, studies suggest that it is better to move forward with conservation planning even with data limitations, rather than postponing planning efforts and risk further biodiversity loss (Ban 2009; Ban et al. 2014; Beier et al. 2015). As data become available at finer scales, we can use similar approaches to develop biologically driven classification systems that contribute to building an ecologically representative MPA network.
We are grateful for feedback and discussion from Ed Gregr, Laura Feyrer, Erin McClelland, Greig Oldford, Chris McDougall, Carrie Robb, and Karin Bodtker as well as members of the Canada-British Columbia-First Nations Marine Protected Area Technical Team. The manuscript was greatly improved by two anonymous reviewers. We also would like to thank Kate Rutherford, Leslie Barton, Jason Dunham and others who provided access and answered questions about data sources. Funding for this project was provided by the Canada-British Columbia Marine Protected Area Implementation Team and Fisheries and Oceans Canada’s National Conservation Plan Program and the Strategic Program for Ecosystem Research and Analysis.
- Araujo MB (2002) Biodiversity hotspots and zones of ecological transition Cons. Biol 16:1662–1663Google Scholar
- AXYS Environmental Consulting Ltd (2000) British Columbia Marine Ecological Classification Update – Method Options. Prepared for Land Use Coordination Office, Government of British ColumbiaGoogle Scholar
- AXYS Environmental Consulting Ltd. (2001). British Columbia Marine Ecological Classification Update. Ministry of Sustainable Resource Management Decision Support ServicesGoogle Scholar
- Allen MJ, Smith GB (1988) Atlas and zoogeography of common fishes in the bering sea and northeastern pacific. NOAA Technical Report NMFS 66. National Marine Fisheries Service, NOAAGoogle Scholar
- Anderson MJ (2001) A new method for non-parametric multivariate analysis of variance. Austral Ecol 26:32–46Google Scholar
- Breiman L (2001) Random forests. Mach L 45:5–32Google Scholar
- Canada—British Columbia Marine Protected Area Network Strategy(2014) Available from https://www.for.gov.bc.ca/tasb/slrp/pdf/ENG_BC_MPA_LOWRES.pdf Accessed 8 June 2015
- CBD (2010) Aichi Biodiversity Targets, Strategic Plan for Biodiversity 2011-2020 Convention on Biodiversity, https://www.cbd.int/sp/targets/. Accessed 4 January 2016
- Core Development Team R (2014) R: a language and environment for statistical computing. R Foundation for Statistical Computing, ViennaGoogle Scholar
- Druehl L (2000) Pacific seaweeds. Harbour Publ, Madeira ParkGoogle Scholar
- Dufrêne M, Legendre P (1997) Species assemblages and indicator species: the need for a flexible symmetrical approach. Ecol Monog 67(3):345–366Google Scholar
- Franklin J (2009) Mapping species distributions—spatial inference and prediction. Cambridge University Press, New YorkGoogle Scholar
- Jefferis G (2014) dendroextras: Extra functions to cut, label and colour dendrogram clusters. R package version 0.2.1. http://CRAN.R-project.org/package=dendroextras
- Johannessen D, Haggarty D, Pringle J (2004) Boundary definition for the central coast integrated management area. Can Sci Advis Sec Res Doc 2004/050Google Scholar
- Juffe-Bignoli D, Burgess ND, Bingham H et al (2014) Protected Planet Report 2014. UNEP-WCMC, CambridgeGoogle Scholar
- Jurasinski G and contributions from V. Retzer (2012). simba: a Collection of functions for similarity analysis of vegetation data. R package version 0.3-5. http://CRAN.R-project.org/package=simba
- Kühn I (2007) Incorporating spatial autocorrelation may invert observed patterns. Divers Distrib 13(1):66–69Google Scholar
- Legendre P, Legendre L (2012) Numerical ecology, 3rd ed. Developments in environmental modelling, vol 24. Elsevier, AmsterdamGoogle Scholar
- Levings CD, Jamieson GS (1999) Evaluation of ecological criteria for selecting MPAs in pacific region: a proposed semi-quantitative approach. Can Stock Assess Sec Res Doc. 99/210Google Scholar
- Liaw A, Wiener M (2002) Classification and regression by randomForest. R News 2(3):18–22Google Scholar
- Lucas BG, Verrin S, Brown R (2007) Ecosystem overview: Pacific North Coast Integrated Management Area (PNCIMA). Can Tech Rep Fish Aquat Sci 2667:xiii + 104pGoogle Scholar
- Maloney N Heifetz J 1997 Movements of tagged sablefish, Anoplopoma fimbria, released in the eastern Gulf of AlaskaNOAA Technical Report, NMFS130115121Google Scholar
- McCune B, Grace J (2002) Analysis of ecological communities. MjM Software Design, Gleneden BeachGoogle Scholar
- McFarlane G Saunders M 2006 Dispersion of juvenile sablefish, Anoplopoma fimbria, as indicating by tagging in Canadian watersNOAA Technical Report, NMFS130137150Google Scholar
- Oksanen J, Guillaume Blanchet F, Kindt R et al (2014) vegan: community ecology package. R package version 2.3-0. http://CRAN.R-project.org/package=vegan
- R Core development TEAM 2014 R: a language and environment for statistical computing R foundation for statistical computing ViennaGoogle Scholar
- Roberts DW (2015) labdsv: ordination and multivariate analysis for ecology. R package version 1.7-0. http://CRAN.R-project.org/package=labdsv
- Roff JC, Zacharias MA (2011) Marine conservation ecology. Earthscan, London, UKGoogle Scholar
- Tyberghein L, Verbruggen H, Pauly K et al (2012) Bio-ORACLE: a global environmental dataset for marine species distribution modeling. Global Ecol Biogeog. Available from Supporting information available at http://www.oracle.ugent.be/DATA/Other/Appendix.pdf
- Wei T (2013) corrplot: Visualization of a correlation matrix. R package version 0.73. http://CRAN.R-project.org/package=corrplot
- WoRMS Editorial Board (2015) World register of marine specie. Available from http://www.marinespecies.org at VLIZ. Accessed 15 May 2015