Introduction

Predicting changes in species distributions can be a powerful tool for managers and conservationists as they attempt to understand how global change may impact freshwater ecosystems. Freshwater habitats are rich and diverse systems (Lundberg et al., 2000) and among the most threatened on the planet (Reid et al., 2019; Strayer & Dudgeon, 2010). Therefore, understanding the drivers and predicting the distributions of freshwater fishes is a global priority. Historically, species distribution models have often focused on the relationship between a single fish species and measures of abiotic habitat conditions (Olden et al., 2006). The focal species of such modeling efforts are most commonly managed—and therefore modeled—on a species-by-species basis, often focusing on important commercial and recreational species (e.g., Brandt et al., 2022; McKenna Jr & Johnson, 2011). Single-species models that use only abiotic landscape predictors implicitly make the assumption that landscape-derived predictors are reasonable proxies for instream habitat. For example, inclusion of the proportion of agricultural or urban land use in a stream catchment as a predictor variable in a species distribution model assumes that human-dominated landscapes result in degraded instream conditions and therefore are useful for prediction—which is often the case (Allan, 2004). Although such watershed-level metrics are useful in capturing broad-scale patterns in fish distributions (e.g., Kristensen et al., 2012), finer-scale heterogeneity in instream habitat that also affects species occurrences is likely not captured well. For example, DeWeber and Wagner (2015) developed a species distribution model for eastern brook trout Salvelinus fontinalis that focused on understanding abiotic drivers of brook trout occurrence across its native range in the eastern Unites States. They used the proportion of urban and agricultural land use as proxies for instream habitat and found that brook trout were less likely to occur in human altered landscapes—a finding that is consistent with expectations for altered instream habitat conditions and brook trout ecology. However, their approach—although providing valuable information about habitat factors influencing brook trout distributions—did not account for species co-occurrences that may also affect where brook trout occur. For example, the presence of non-native brown trout Salmo trutta can also affect the occurrence probability of native brook trout (Wagner et al., 2013). Therefore, the presence or absence of other species, such as brown trout in this example, may provide additional information about instream conditions that could improve predictions beyond those based solely on landscape-derived habitat metrics.

Although a single-species focus may be appropriate given specific research or management objectives, or necessary due to limited data, the paucity of multi-species studies that incorporate both community-wide co-occurrence information and abiotic drivers has been partly due to the lack of analytical frameworks that can accommodate such data. However, recent advancements in statistical methodologies allow for modeling multi-species data (e.g., fish assemblage occurrence) while accounting for species co-occurrences and abiotic environmental variables. For example, joint species distribution models (JSDMs; Clark et al., 2014; Ovaskainen et al., 2017; Wilkinson et al., 2019) that account for residual dependencies between species have seen rapid development in the past several years and are more commonly being applied to freshwater fisheries data (Inoue et al., 2017; Wagner et al., 2020). JSDMs capture species dependencies within a covariance matrix, after accounting for the effects of abiotic predictor variables included in the model, and leveraging this information can lead to improved predictive performance (e.g., Clark et al. 2014; Ovaskainen et al. 2017; Tikhonov et al., 2017; Vallé et al., 2023). Similarly to single-species models, using landscape-derived abiotic predictors within a multi-species framework assumes that they will act as reasonable proxies for instream habitat. However, in the case where these landscape-derived predictors are not sufficient proxies, JSDMs (and other multi-species approaches) may capture this fine-scale information through overlapping (or diverging) habitat requirements within the species dependencies. However, JSDMs allow for limited ability to make inferences regarding species associations as these pairwise species dependencies are not directly estimated, instead being inferred from the residual correlations (Poggiato et al., 2021). This also prevents their effect sizes from being directly compared to the abiotic covariates included in the model (Clark et al., 2018; Harris, 2016).

An alternative statistical approach called conditional random fields (CRF; a type of Markov network) simultaneously models both community-level species co-occurrence information and abiotic predictor effects on species distributions, while also allowing for the effects of species co-occurrences to vary along abiotic gradients (i.e., the effect of species co-occurrences in the community is allowed to be context-dependent). Furthermore, CRFs directly estimate the effect sizes of co-occurrences between species. Because these effect sizes are estimated jointly and at the same scale as those for the abiotic factors, they can then be used to calculate relative importance scores between all effect types. This allows for direct comparisons of effect sizes between abiotic factors, species co-occurrences, and their context dependency. These relative importance scores are calculated within each species of the community, allowing for inferences regarding how different effect types vary across the community. However, as with all modeling efforts, the choice of the modeling approach used is dependent on the research questions and available data. If a primary objective is to predict fish occurrence at unsampled locations, then CRF models would not be the preferred option because CRFs cannot predict to unsampled locations where community data are lacking. However, the application of CRF models can still be valuable tools for management and conservation efforts. For example, predictions may be leveraged to identify locations that are more or less susceptible to the spread of an invasive species, or those that are prime candidates for the reintroduction of a locally extirpated species—given both the abiotic habitat and occurrences of other species.

Here, we explore two objectives pertaining to the use of assemblage-level co-occurrence data and a relatively novel statistical methodology. First, we fit and compared the predictive performance of an assemblage-level (i.e., the assemblage represents fishes sampled and not an entire aquatic community) CRF model against a more traditional single-species generalized linear model (GLM). Previous research has suggested that including species co-occurrences can improve predictions of species distributions of freshwater fishes (e.g., through the use of JSDMs; Wagner et al., 2020). Thus, our hypothesis is that including species co-occurrence data within the CRF framework will capture additional information about instream habitat conditions compared with single-species GLMs that rely only on landscape-based predictors, which will lead to better predictions of stream fish occurrence across two regional watersheds. For example, species that share overlapping habitat requirements at a finer scale may appear as positive co-occurrences when such habitat information is not captured in coarser-scale abiotic variables (e.g., catchment land use). Second, within the CRF framework, we investigate the relative importance of the different effect types (i.e., species co-occurrence and environmental predictors) to understand their respective contributions to predicting species distributions. Previous research has suggested that the ecological processes structuring the assemblage of freshwater stream fishes is non-random and is largely governed by abiotic filters (Giam & Olden, 2016). Here, we test to see if this holds true when using data at a relatively large extent and coarse resolution through the use of relative importance measures.

Methods

Fish and landscape data

Fish assemblage data were obtained from three agencies that sample within Pennsylvania, USA: the Pennsylvania Department of Environmental Protection (PADEP), the Pennsylvania Fish and Boat Commission (PFBC), and the Susquehanna River Basin Commission (SRBC). Fish assemblage data were collected using standardized survey protocols over a 20-year period (2000–2020), with the majority of the sample sites (hereafter catchments) occurring within Pennsylvania state lines (Fig. 1). Only fishery surveys that sampled the entire fish assemblage were included in the analysis. A number of different gear types were used during the surveys and varied by agency. Gear type varied in order to effectively sample the fish assemblage given stream-specific characteristics (e.g., stream size), but the majority of samples were collected using electrofishing. Backpack and tow barge electrofishers were used for smaller, wadeable streams and boats for larger streams and rivers. Additional gear types included seine nets, trawling, and SCUBA/snorkeling visual identification. The fish data were summarized as presence–absence, where a species was considered present if it ever occurred at the sample site over the time period of record. We summarize site occurrence over time because the abiotic information used within our analysis was stationary (described below). In other words, we were not able to relate changes in environmental conditions to changes in occurrence over time. We also acknowledge that accounting for differences in sampling efficiency across gears and imperfect detection are important considerations when modeling fisheries catch data (e.g., Arreguín-Sánchez 1996, Ensign et al. 2002, Kennard et al. 2006, Peterson & Rabeni 1995); however, we assumed that summarizing data into presence–absence helped to reduce the influence of varying catchability on our analysis (i.e., the potential influence of varying catchability across gear types on inferences would likely be greater when modeling relative abundance; King et al., 2023).

Fig. 1
figure 1

Map of fish assemblage samples (dots) with HUC2 watersheds boundaries (solid lines with HUC2 code labels) and Pennsylvania, USA border (red dashed line). Blue lines represent major rivers of Pennsylvania. Inset shows location of Pennsylvania within contiguous United States

Catchments were grouped by location within level two hydrologic unit code (HUC; Seaber et al., 1987) regional watersheds to be modeled independently. The regional watersheds that fall within Pennsylvania and included within the study were the Mid-Atlantic (HUC2 = 02) and the Ohio River (HUC2 = 05). The full fish assemblage data were composed of 2222 samples from 1265 catchments, and included over 145 fish species. For each watershed, a species had to occur at 30 or more catchments to ensure a sufficient sample size of occurrences, reducing the full species list to 52 species for the Mid-Atlantic watershed model (hereafter the Mid-Atlantic model) and 46 species for the Ohio River watershed model (hereafter the Ohio River model). All fishes were identified to the species level when possible. However, if species-level identification was not possible then fish were identified at the genus level and grouped accordingly. For example, all species of the sculpin genus Cottus were grouped. Sculpins can be difficult to identify at the species level, but are quite common in Pennsylvania headwater streams (Stauffer et al., 2016). Both models included sculpins Cottus spp. Two additional genera were removed despite having high enough occurrence: unidentified Lepomis spp. in the Mid-Atlantic watershed, and unidentified lampreys from the Ohio River watershed. Unidentified Lepomis spp. were removed so that we could retain those identified at the species level rather than grouping them to genus. Unidentified lampreys were removed because much of their associated sample data did not include genus information and lampreys can differ significantly in their ecological niche (i.e., some are parasitic).

EcoSHEDS Northeast Catchment Delineation (NECD) hydrological catchments (Walker et al., 2015) were used as the base spatial unit for the analysis. The NECD catchments are similar to those of the National Hydrography Dataset (NHD) Plus product (Buto & Anderson, 2020), but have improved spatial resolution, with catchment delineation based on the high-resolution NHD flowlines. Sample locations were assigned to a catchment based on each sample site’s spatial coordinates, resulting in 817 unique catchments in the Mid-Atlantic model and 412 in the Ohio River model. Most abiotic predictors were obtained from the NECD data product. This spatial dataset includes both local (information about the catchment itself) and upstream (the local and accumulated upstream catchments combined) summaries of each sample site’s ecological context. For this analysis, all abiotic variables from this dataset were defined using the upstream summaries in an effort to capture information regarding the network influences on lotic systems. Importantly, as previously mentioned, the NECD data product is stationary and thus provides a single point estimate for each abiotic predictor within each hydrological catchment. In other words, these data do not capture changes in habitat characteristics over the sampled time period.

Table 1 Summary of abiotic factors within the Mid-Atlantic watershed included in the model of stream fish assemblages. Stream temperature was sourced from DeWeber and Wagner (2014)

Abiotic factors included climate information (e.g., annual precipitation [mm]; PRISM 30-year normals; https://prism.oregonstate.edu/normals/), land use (National Land Cover Database; 2006 NLCD; Jin et al., 2013), nitrate deposition (National Atmospheric Deposition Program; https://nadp.slh.wisc.edu/), and land form (Tables 1 and 2). Stream temperature is known to be an important abiotic factor that affects the distribution of freshwater poikilotherms (Buisson et al., 2008; Wehrly et al., 2003), but is not measured across large spatial extents. Therefore, predicted stream water temperature (maximum summer 14-day mean) was sourced from DeWeber and Wagner (2014). All species presence and abiotic sample site data were joined to the NECD catchment using the sf package in R (Pebesma, 2018; R Core Team, 2022). Because the available abiotic data were landscape-derived variables (as opposed to instream habitat), we attempted to include as many of the NECD covariates as possible. Variable selection was based on Pearson’s pairwise correlations between variables such that no two variables had an estimated correlation greater than |0.6|. This variable selection process was performed separately within each watershed region, and the final set of covariates included in the model fitting process were those that met this criteria for both watersheds. All abiotic predictor variables shown in Tables 1 and 2 were standardized to have a mean zero and standard deviation of one. Predicted stream temperature was the only variable included in the model with missing values (n = 60 and n = 7 for the Mid-Atlantic and Ohio River models, respectively). Missing values for abiotic variables were set to the standardized mean.

Table 2 Summary of abiotic factors within the Ohio River watershed included in the model of stream fish assemblages. Stream temperature was sourced from DeWeber and Wagner (2014)

Statistical analysis

We used conditional random fields (CRFs) to quantify the effect size and relative importance of species co-occurrences and abiotic factors structuring fish assemblages at the landscape scale and to determine if species co-occurrence patterns were context-dependent (i.e., vary across environmental gradients). CRFs are undirected graphical network models that allow for direct inferences of both species co-occurrence and abiotic factors structuring species distributions using presence–absence data (Harris, 2016). Briefly, the CRF model estimates the log-odds of observing species j given the occurrence of species k and covariate x by:

$$\begin{aligned} \text{log} \left( \frac{P(y_j=1|y_{\backslash j},x)}{1-P(y_j=1|y_{\backslash j},x)}\right) = \alpha _{j0} + \beta ^T_jx + \sum _{k:k \ne j}(\alpha _{jk0}+\beta ^T_{jk}x)y_k \end{aligned}$$
(1)

where \({\textbf {y}}_j\) is a vector of species observations (1 if present, 0 otherwise), \({\textbf {y}}_{k}\) is a vector of observations for all other species not j, \(\alpha _{j0}\) is the species-level intercept, \(\beta ^T_j\) (superscript T means transposed) is the coefficient of covariate x on species j’s occurrence probability, and \(\alpha _{jk0}\) and \(\beta ^T_{jk}\) represent the coefficients associated with species k’s main effect and interaction effect with covariate x, respectively (see Clark et al. 2018, Harris 2016, for more details regarding Markov and conditional random fields). The full analysis was performed using the MRFcov R package (Clark et al., 2018; R Core Team, 2022). The model was run as a spatially-explicit model by providing coordinates (catchment centroid) of each sample site. This allowed the model to account for possible spatial autocorrelation via Gaussian Process spatial regression splines. Within each bootstrap iteration (n = 400), the model is fit across all species. Each model was fit separately for each watershed region. This resulted in models with 52 and 46 species (co-occurrence effects), 12 abiotic factors, and 624 and 552 interaction (species co-occurrence \(\times \) abiotic) effects for the Mid-Atlantic and Ohio River models, respectively. To prevent overfitting with so many potential coefficients, the algorithm uses LASSO (least absolute shrinkage and selection operator) penalization to regularize the regressions. This regularization process is optimized automatically through the functions from MRFcov R package within each species’ model fit and forces a number of coefficients to zero. The coefficient’s estimates across all bootstrap iterations were then summarized into the mean coefficient value and 90% confidence intervals (i.e., 5% and 95% quantiles).

Predictive performance

We assessed the predictive performance of the CRF models against the single-species models using a 100-repetition 5-fold cross-validation. This was done by randomly partitioning each dataset into five different subsets. Four of these samples were used to fit the models and the fifth subset was used to test the predictive performance of that fitted model. This was done five times so that each subset was used as testing data. This entire process was then repeated 100 times. For this portion of the analysis, each CRF model was fit using the MRFcov_spatial function from MRFcov R package. This function fits a single CRF model with spatial splines to account for spatial autocorrelation. Single-species models were fit using the same datasets and the model structure was developed to closely mirror that of the CRF. We fit generalized linear models (GLMs; logistic regression) with the glm function from the stats package in R using a logit link. Similar to the CRF models, the GLMs also included spatial splines to account for potential spatial autocorrelation which were calculated using the smooth.construct2 function from the mgcv package in R. These approaches were chosen because they most closely mirror the algorithm used within the CRF model fitting process. A key difference here is that we did not use LASSO regularization for the single-species models. This was reasonable because we were only interested in using the GLM for prediction comparisons and not for making inferences about estimated coefficients.

To compare predictive performance between the multi-species CRF and the single-species GLM, the sensitivity (the ability of the model to correctly predict species presence), the estimated receiver operating characteristic (ROC) curve, and Youden index were calculated to determine the overall model’s classification ability and the optimal threshold (the predicted probability cutoff level at which the model classifies a species as present) for classification (Khan & Brandenburger, 2020). The Youden index and ROC curve were then used to calculate the area under the curve (AUC) value. AUC values measure discriminatory capacity of the model where a higher AUC value represents better predictive performance (Jiménez-Valverde, 2012). These metrics were calculated for each species within each fold and repetition using the ROCit R package (Khan & Brandenburger, 2020), then summarized by their mean. The AUC was also summarized by the 5th and 95th percentiles (i.e., 90% bootstraped confidence intervals). The species-level CRF measures were then compared against their respective GLM measures to compare predictive performance of the two approaches and to determine if including assemblage co-occurrence data did in fact improve predictive performance. Models with AUC values > 0.9 are considered highly accurate and most useful for interpretation and prediction, and models with AUC values between 0.7 and 0.9 are considered moderately accurate (Manel et al., 2001). Note that the closer an AUC score is to 0.5, the closer the model is to determining presence/absence via "coin-flip."

Key coefficients

From the full multi-species CRF, estimated coefficients (hereafter referred to as key coefficients because they were the important estimated effects not forced to zero through regularization) for each species were used to make inferences about the relative importance of species co-occurrences and abiotic factors governing species distributions within each watershed. The full assemblage dataset was fit under a bootstrapped conditional random fields framework using the bootstrapMRF function within the MRFcov R package. This function also models spatial autocorrelation through splines and the bootstrapping allows for the estimation of uncertainty in the parameter estimates. Key coefficients were determined by a relative importance score (calculated \(B^2/\Sigma B^2\) from all bootstrapped models across iterations, where the vector B is the regression coefficient for predictor variables) occurring above a default threshold (> 0.01). We focus our inferences on the use of these key coefficients rather than a significance level to focus on the relative strength of the effect size of the estimated associations. Key coefficients were also grouped according to effect type: co-occurrence main effect, abiotic main effect, and biotic-abiotic interaction effect (context-dependent species co-occurrences). Coefficients were ranked by their relative importance score within each species. The coefficient with the highest relative importance score (i.e., strongest absolute effect size) was ranked 1, and so on for each key coefficient. This was done to simplify comparisons of effect type’s relative importance across all species. Network graphs were created using the igraph R package (Csardi & Nepusz, 2006; R Core Team, 2022).

Results

Across both watersheds, the 70 unique species of fish were from 14 families and included species such as brook trout—a coldwater stenotherm found in headwater streams (drainage area ranged from < 1\(\text{km}^2\) to > 70,000 \(\text{km}^2\))—to warmwater fishes such as bluegill Lepomis macrochirus and channel catfish Ictalurus punctatus found in larger, low gradient rivers (Appendix Tables 4 and 5). Sampled stream and river catchments varied in abiotic characteristics across watersheds and land form. For example, agriculture land use in the upstream catchment ranged from 0 to 71% in the Ohio River watershed (median = 22%) and from 0 to 88% (median = 15%) in the Mid-Atlantic watershed (Tables 1 and 2). However, percent slope ranged from 3 to 29% (median = 13%) in the Ohio River watershed and from 1 to 40% (median = 14%) in the Mid-Atlantic watershed.

Predictive performance

The CRF model was moderately accurate classifying occurrence for both watersheds. The mean AUC value (90% bootstrapped confidence interval) across all species, folds, and repetitions was 0.87 (0.65, 0.97) for the Mid-Atlantic watershed and 0.84 (0.50, 0.97) for the Ohio River watershed. Furthermore, 53.6% of cross-validated CRF models had an AUC higher than 0.9 in the Mid-Atlantic watershed and 35.9% of cross-validated CRF models had an AUC higher than 0.9 in the Ohio River watershed. This accuracy is also reflected in the separate measures of predicting species presence (sensitivity = 0.84 and 0.79 for the Mid-Atlantic and Ohio River models, respectively).

Predictive performance of the CRF model varied among individual species across watersheds (Appendix Tables 4 and 5), but predicted occurrence with moderate to high accuracy for almost all species across both watersheds. The CRF model was highly accurate in predicting occurrence for more species in the Mid-Atlantic watershed than the Ohio River watershed. There were 27 (51.9%) species with an AUC value greater than 0.9 from the Mid-Atlantic CRF model, and 11 species (23.9%) from the Ohio River CRF model. There were only 4 (7.7%) species with an AUC value less than 0.7 from the Mid-Atlantic CRF model, and three species (6.5%) from the Ohio River CRF model (Fig. 2). In the Mid-Atlantic model, predictive performance for the CRF model was highest for Sander vitreus and Clinostomus funduloides (AUC = 0.96) and lowest for Clinostomus elongatus (AUC = 0.62). The full CRF model for the Ohio River model performed best for Etheostoma zonale (AUC = 0.96) and had the lowest AUC value for Lota lota (AUC = 0.50).

Fig. 2
figure 2

Distribution of predictive performance metrics AUC and sensitivity across 70 freshwater fish species for a multi-species model that included both species co-occurrences and abiotic predictor variables (CRF) and single-species models that only included abiotic predictor variables (GLM) for two regional watersheds (Mid-Atlantic and Ohio River). Vertical dashed lines for AUC at 0.7 and 0.9 show cutoffs for low accuracy (AUC < 0.7), moderate accuracy (0.7 < AUC < 0.9), and high accuracy (AUC > 0.9) models

Many of the single-species GLMs performed moderately well as measured by AUC (Fig. 2; Appendix Table 4 and 5), though none of the cross-validated GLMs averaged higher than 0.9 for either watershed. Collectively, the mean AUC value (90% bootstrapped confidence interval) across all species, folds, and repetitions was 0.71 (0.54, 0.85) for the Mid-Atlantic watershed and 0.66 (0.51, 0.81) for the Ohio River watershed. The GLM models were moderately accurate, with an AUC value between 0.7 and 0.9, for 34 (65.4%) species from the Mid-Atlantic and 13 species (39.4%) from the Ohio River CRF model (Fig. 2). The remaining 18 (34.6%) and 33 (71.7%) species from the Mid-Atlantic and Ohio River watersheds, respectively, were predicted with low accuracy. Across the GLMs for the Mid-Atlantic species, the highest AUC value was 0.81 for Micropterus dolomieu, and the lowest was 0.55 for Notemigonus crysoleucas. Across the GLMs for the Ohio River watershed, the highest AUC was 0.78 for Erimystax dissimilis, and the lowest AUC was 0.52 for Lepomis gibbosus.

Fig. 3
figure 3

Difference between the multi-species conditional random fields model (CRF) and the single-species generalized linear model (GLM) of the mean value of predictive performance metrics from a fivefold cross-validation repeated 100 times for the Mid-Atlantic watershed

When comparing the predictive performance between models within the Mid-Atlantic, the multi-species CRF generally outperformed the single-species GLMs as measured by AUC and sensitivity where the GLM only did as well as the CRF for some species, but never better (Fig. 2). The CRF model had higher AUC estimates than the GLM for all species (Fig. 3), with 37 (71.2%) significantly different (i.e., non-overlapping) 90% confidence intervals (Table 4). Again, this is further reflected in the sensitivity measures, where the CRF also had higher values for predicting species presence for all species. The largest difference in estimated AUC values in favor of the CRF model in the Mid-Atlantic watershed was for Carpiodes cyprinus  where the CRF model was highly accurate and the GLM predicted with low to moderate accuracy (CRF AUC = 0.93 [0.88, 0.96]; GLM AUC = 0.63, [0.54, 0.73]). In the Mid-Atlantic, the lowest difference in AUC point estimates in favor of the CRF was for Clinostomus elongatus, but both models were low accuracy with wide and overlapping 90% confidence intervals (CRF AUC = 0.62 [0.49, 0.86]; GLM AUC = 0.59 [0.47, 0.73]).

Within the Ohio River watershed, the CRF model again generally outperformed the single-species GLMs as measured by AUC and sensitivity, where the GLM only did as well as the CRF for some species, but never better. The CRF model had higher AUC estimates for all but two species and 25 (54.3%) had significantly different 90% confidence intervals. The CRF model had higher sensitivity scores than the GLMs for all but one species (Fig. 4). In the Ohio River watershed, the largest difference in estimated AUC values in favor of the CRF model was for Notropis rubellus (CRF AUC = 0.90 [0.84, 0.96]; GLM AUC = 0.61, [0.50, 0.72]), and the biggest difference in favor of the GLM was for Lota lota (GLM AUC = 0.69 [0.50, 0.87]; CRF AUC = 0.50, [0.50, 0.50]) suggesting the CRF model was no better than a coin flip. Despite having a higher point estimate for AUC, the wide confident intervals suggests that we cannot claim that the GLM did significantly better at predicting Lota lota occurrence.

Fig. 4
figure 4

Difference between the multi-species conditional random fields model (CRF) and the single-species generalized linear model (GLM) of the mean value of predictive performance metrics from a fivefold cross-validation repeated 100 times for the Ohio River watershed

CRF modeling: key coefficient summaries

Fig. 5
figure 5

Frequency of a abiotic factors and b species co-occurrences appearing as key coefficients from the conditional random fields model in the Mid-Atlantic. Fill color represents proportion of effect type for each factor

Across all species, there were 712 key coefficients identified within the Mid-Atlantic model and 583 key coefficients identified within the Ohio River model (Appendix Tables 6 and 7 provide the full list of key coefficients for every species). Across both watersheds, the species co-occurrence main effects were the most frequently identified key coefficients (410 and 319 for the Mid-Atlantic and Ohio River, respectively; Figs. 5b and 6b), followed by context-dependent co-occurrence effects (297 and 261 for the Mid-Atlantic and Ohio River, respectively) and then abiotic main effects (5 and 3 for the Mid-Atlantic and Ohio River, respectively; Figs. 5a and 6a). When comparing these against their expected counts based on random chance (i.e., their respective percentages of total coefficients included in the model fitting multiplied by their respective total key coefficients), we saw that species co-occurrence effects (54 and 44 expected for the Mid-Atlantic and Ohio River models, respectively) over-performed (i.e., they were identified as key coefficients more often than expected), and abiotic main effects (12 and 11 expected for the Mid-Atlantic and Ohio River models, respectively) and interaction effects (646 and 528 expected for the Mid-Atlantic and Ohio River models, respectively) under-performed (i.e., identified as key coefficients less often than expected).

Fig. 6
figure 6

Frequency of a abiotic factors and b species co-occurrences appearing as key coefficients from the conditional random fields model in the Ohio River watershed. Fill color represents proportion of effect type for each factor

Both watershed models estimated a complex network of species associations. In the Mid-Atlantic model, all 52 species appeared as a key coefficient for predicting the occurrence of at least one other species, whereas 45 of the 46 species included in the Ohio River model appeared as a key coefficient for predicting the occurrence of at least one other species (Figs. 5b and 6b). The burbot Lota lota was the only species that did not appear as a key coefficient in the Ohio River model. The species most commonly identified as key coefficients were the creek chub Semotilus atromaculatus in the Mid-Atlantic and western blacknose dace Rhinichthys obtusus in the Ohio River watershed. Conversely, the rosyside dace Clinostomus funduloides in the Mid-Atlantic and brook trout Salvelinus fontinalis in the Ohio River watershed were species important for predicting the occurrence of relatively few other species (Fig. 6b). For species that occurred across both models, their effects as key coefficients varied. For example, the brook trout appeared as a key coefficient only once (as a negative main effect for Pimephales notatus) in the Ohio River model, but it was identified as a key coefficient 10 times (5 main effects and 5 interaction effect) in the Mid-Atlantic model. Species co-occurrences varied in their effect type across watersheds. In the Mid-Atlantic model, the fallfish Semotilus corporalis was the most frequently occurring species co-occurrence effect (n = 14), whereas the greenside darter Etheostoma blennioides and the northern hogsucker Hypentelium nigricans tied for the most frequently occurring species co-occurrence effect (n = 15) in the Ohio River model. Additionally, the most frequently identified species co-occurrences among key context-dependent effects were the channel catfish Ictalurus punctatus and creek chub (n = 14) in the Mid-Atlantic model and the common shiner Luxilus cornutus (n = 14) in the Ohio River model.

Abiotic main effects were infrequently identified across both models. In the Mid-Atlantic model, there were five abiotic main effects, one each for sandy soil cover, developed land cover, drainage area size, predicted stream temperature, and agricultural land cover. In the Ohio River model, only three abiotic main effects were identified, one each for predicted stream temperature, drainage area size, and nitrate deposition. The total frequency of abiotic factors appearing as key coefficients (main and interaction effects combined) varied by watershed. In the Mid-Atlantic model, the amount of open water was the most frequently identified abiotic factor, whereas drainage area size was the most frequently identified abiotic key coefficient in the Ohio River model. Across both models, drainage area size and sandy soil abiotic factors were in the top three most frequently identified abiotic key coefficients. Abiotic factors positively associated with species occurrence varied across watersheds (Appendix Tables 6 and 6). Across both watershed models, the proportion of all abiotic factors (main and interaction effects) with positive coefficients was approximately 25%. This suggests that species most often had negative associations (i.e., decreased probability of occurrence) with the included abiotic factors. In the Mid-Atlantic model, predicted stream temperature had the highest proportion (68.4%) of positive coefficient values, corresponding to a number of warmwater species within the region. Whereas, the Ohio River model slope had the highest proportion (100%) of positive values, suggesting a number of species used high gradient stream habitat.

Table 3 Summary of key coefficients for the fathead minnow within the Mid-Atlantic watershed

The effect of co-occurrences on predicting species occurrence also varied along environmental gradients (i.e., context dependency) for 51 (98%) and 46 (100%) species in the Mid-Atlantic and Ohio River watersheds, respectively. This suggests that species co-occurrence effects varied along abiotic gradients and that this was important for almost every species. For example, predicting the occurrence of the fathead minnow Pimephales promelas in the Mid-Atlantic watershed estimated 11 context-dependent co-occurrences (Table 3). Figure 7 visualizes the context dependency of species co-occurrences as the amount of open water increases within the sampled catchment. Here, we see a number of species co-occurrences shift from a positive effect to a negative effect on predicting the occurrence of fathead minnow as the amount of open water increases. Of the seven species co-occurrences with the fathead minnow shown, four had positive associations at low levels of open water that became negative at high levels of open water. This suggests that in areas with less water, such as first order streams, it would be expected to find these species co-occurring and in areas with more water, such as higher order streams and large rivers, it would no longer be expected to find the fathead minnow co-occurring with those four species.

Fig. 7
figure 7

An illustration of context dependency for the effects of species co-occurrences when predicting the occurrence of the fathead minnow Pimephales promelas. The co-occurrence network for fathead minnow (center node) is plotted along amount of open water gradient within the Mid-Atlantic watershed. The graphs to the left, middle, and right represent the estimated network structure at the minimum, midpoint, and maximum levels of open water where fathead minnows were sampled, respectively. Dashed-red and solid-blue lines represent negative and positive estimated co-occurrence patterns, respectively, with line weight representing relative strength of that association. Graphs are filtered to seven species with strongest overall estimated co-occurrence effects with the fathead minnow. Co-occurrence patterns between those seven species are transparent to emphasize co-occurrences that would be used to predict the occurrence of the fathead minnow

Discussion

We found that using fish assemblage co-occurrence data can improve predictions of stream fish distributions across regional watersheds compared to relying on remotely sensed landscape-derived environmental data alone. Here, we used a novel CRF modeling framework which offers a number of aforementioned advantages. Although there was variability in the magnitude of the differences among species and watersheds, the predictive performance (as measured by AUC) was consistently higher for the CRF models compared to landscape-based GLMs. Across both the Mid-Atlantic and Ohio River watersheds, the abiotic-only GLMs never outperformed the CRF models in predicting species occurrence. This aligns with previous efforts to model stream fish occurrence with assemblage data using JSDMs (Inoue et al., 2017; Rodríguez et al., 2021; Wagner et al., 2020). For example, Wagner et al. (2020) showed that including fish assemblage data through a JSDM framework improved conditional predictions of species occurrences for fishes in both stream and lake habitats. Their modeling efforts also showed that their abiotic variables performed poorly in predicting species occurrence. Similarly, Inoue et al. (2017) showed that when jointly modeling freshwater mussels and fishes, the residual correlations (i.e., species dependencies) were prevalent among fishes, whereas mussel occurrences were exclusively explained by abiotic factors. These studies, along with our analysis, suggest that modeling stream fish distributions with only remotely sensed abiotic factors, such as land cover, may lead to spurious inferences and incorporating the additional information provided by species co-occurrences can often improve predictions.

Scale dependency has been shown to play a significant role in species distribution models (Geheber & Geheber, 2016; König et al., 2021). Our study, like most regional modeling efforts, used landscape-scale abiotic factors (i.e., land use and cover) instead of habitat measurements at the local or micro-habitat scale. This resulted in our models relying heavily on species co-occurrences to accurately predict occurrence. The high frequency of important species co-occurrences is likely due to their ability to capture overlapping habitat requirements at a finer scale. For example, stream flow is well-known to be an important abiotic filter for stream fish (e.g., McManamay & Frimpong, 2015; Poff & Allan, 1995; Van Vliet et al. 2013). However, flow requires physical sampling or modeling and may not be readily available for landscape-scale modeling efforts. Thus, we relied on other factors to act as proxies for flow, such as drainage area size or the amount of open water. We did see both drainage area and the amount of open water appear among the most frequently estimated abiotic key coefficients, though rarely estimated as main effects. Instead, they were estimated as context-dependent co-occurrences, furthering the notion that the scale of our abiotic data was relatively coarse and not able to represent instream habitat as well as co-occurrence data. Our finding that species co-occurrences were relatively more important effect types within our models does not suggest that abiotic factors are unimportant filters for structuring stream fish assemblages. Instead, the assumption that our remotely sensed landscape, particularly land cover, variables would act as proxies for true instream habitat was shown to be insufficient for many species of stream fish. Context-dependent species co-occurrences were important effect types for improving the accuracy of species distribution predictions. Although they were estimated as key coefficients less frequently than expected across both models, they were still an important effect type for almost every species. This is important because even with the high relative importance of species co-occurrences, they were often dependent on the environmental conditions present in the stream’s catchment. In other words, species co-occurrences were often capturing overlapping fine-scale habitat requirements between pairs of species, but the context dependency of these co-occurrences helped capture where these habitat needs diverged. However, these models are scale-dependent and inferences regarding the ecological processes structuring assemblages from these models should be done with caution.

Recall that a limitation of the CRF is that it cannot predict to unsampled locations. However, CRF predictions at sites with existing assemblage data are still useful for fisheries management as a means to identify high priority sites for potential invasions or re-introductions. A number of species within our study are invasive, such as flathead catfish Pylodictis olivaris and the banded darter Etheostoma zonale in the Mid-Atlantic watershed. The flathead catfish is of particular concern within the region due to their large size and piscivorous diet (Brown et al., 2005; Smith et al., 2021). Fisheries managers could use the CRF model to identify areas with higher predicted probabilities of occurrence as habitats at risk for flathead catfish or banded darter range expansion. For flathead catfish, however, the analysis of predictive performance from the cross-validation suggested there was no significant difference between the two modeling approaches—suggesting that either the CRF or GLM could be used for this purpose. That said, the CRF model for flathead catfish does offer additional information that can help generate hypotheses about the potential effects of invasion in new locations. In the CRF analysis, we see a number of important species co-occurrences (both main and context-dependent effects) with comely shiner Notropis amoenus and channel catfish Ictalurus punctatus (Appendix Table 6). In fact, the frequency of key coefficients with the comely shiner and channel catfish may indicate shared, fine-scale habitat requirements not captured with our environmental data, or perhaps even important potential biotic interactions between these species. Importantly, the key coefficients associated with biotic factors may help identify which species face the biggest threat from this invasive species if they are consistently co-occurring. In contrast to the flathead catfish, we did see large improvements in predicting the occurrence of the banded darter using the CRF in the Mid-Atlantic watershed. Again, the occurrence of this species was not associated with a key abiotic main effect suggesting that the use of remotely sensed landscape data were poor predictors of their occurrence (Appendix Table 6). Such results may provide fisheries managers with an important foundation for developing management and research plans for invasive species. Although just one example, this contextualizes how fisheries managers may use CRF predictions and leverage available fish assemblage data to help inform management and conservation efforts. However, if predicting to unsampled locations is a main research objective, the aforementioned JSDMs may be a more suitable modeling framework. JSDMs still simultaneously model species dependencies with abiotic factors, but do not require a full assemblage sample thus allowing them to make unconditional predictions to unsampled locations.

Estimated coefficients from the CRF model may also be used to generate hypotheses about potential biotic interactions. It is important to first emphasize that species co-occurrences do not directly indicate true interactions and such inferences should be avoided (Poggiato et al., 2021). As previously noted, many, if not most, of these estimated co-occurrences are representative of missing abiotic information and capture overlapping habitat requirements (Zurell et al., 2018). However, they do represent patterns seen across assemblages and biotic interactions play an important role in structuring assemblages (Hutchinson, 1957; Ovaskainen et al., 2017). Therefore, key species co-occurrences can be used, with caution, as a baseline for developing hypotheses regarding potentially important biotic interactions. For example, previous studies have shown that brook trout and brown trout can have negative interactions (Hoxmeier & Dieterman, 2013; Wagner et al., 2013). Our Mid-Atlantic model estimated positive co-occurrences between the two species, an artifact of overlapping habitat requirements. This effect, though, was mediated by a negative, context-dependent co-occurrences with both drainage area and amount of open water. These could potentially represent either diverging habitat requirements or context-dependent biotic interactions.

Finally, we clarify key caveats of the assumptions of our analysis. Firstly, our abiotic predictor variables were limited to remotely sensed landscape data and predicted water temperature. It is possible that other landscape-scale variables, such as lithology, not included in our analysis act as better proxies for instream habitat factors structuring stream fish assemblages. Thus, our results should be interpreted within the context of the predictor variables we included in our models. Previous research has suggested that macroscale variables were sufficient proxies of instream characteristics for modeling stream fish distributions, albeit within a significantly different ecosystem (Brazilian Amazon basin; Frederico et al., 2014). Additionally, a study in the Piedmont Plateau region (which extends into a portion of southeastern Pennsylvania) of Georgia, USA found that reach-scale geomorphology factors were the best predictors for species composition (Walters et al., 2003). That same study, however, found that stream slope was a dominant factor whereas our analysis rarely identified it as a key coefficient across both watersheds. Furthermore, a study by Magalhaes et al. (2002) suggested that coarse-scale factors may better explain fish assemblage variation in Mediterranean streams than micro-habitat factors. Importantly, they hypothesized that these observed patterns could have been due to seasonal changes in water availability, such as summer droughts, an important factor in understanding fish distributions that was not captured within our analysis due to the stationarity of the data, which leads to another important caveat. Both biotic and abiotic data were summarized to single values across time due to the stationarity of the hydrological dataset (NECD). As previously mentioned, this prevented our analysis from relating any potential changes in habitat to potential changes in species occurrence. Thus, it is possible if a stream catchment underwent rapid changes in habitat, such as human development, the environmental predictors modeled may not accurately reflect the habitat conditions experienced by the fishes when sampled. We also made the assumption that summarizing the fish data into presence–absence reduced the effect of varying catchability and imperfect detection and thus did not directly incorporate this into our model. It is possible, however, that some species were still not accurately detected, which could introduce bias into our modeling efforts. That said, our assumption was reasonable for this analysis given the breadth of time and gear types used to detect species presence. Most instances of imperfect detection would likely occur for rarer species which were already not included in the model due to sample size restrictions. However, if accounting for imperfect detection is required for predicting species occurrence, there are multi-species models that incorporate imperfect detection within their framework (e.g., MacKenzie et al. 2004, Rota et al. 2016). Whereas our analysis suggested that using fish assemblage co-occurrence data can improve the predictive performance of stream fish species distribution modeling, there are numerous studies that show abiotic factors can adequately predict stream fish occurrence. Ultimately, the choice of species distribution model should be made with respect to available data and specific research objectives.

One of the most common approaches to modeling species distributions is to use landscape-based (often remotely sensed) habitat data as proxies for instream habitat conditions. We showed that predictions of fish assemblage distributions can be improved for many species by leveraging co-occurrence information. Furthermore, by taking advantage of the CRF methodology, we were able to directly compare the relative importance of species co-occurrences against abiotic factors, while allowing them to interact and be context-dependent. This information can help inform hypotheses about the effects of species range expansions on native fishes and the relative importance species co-occurrences and abiotic drivers of species distributions that can help motivate future research. Future research could also explore how modeling choices related to spatial scale and data resolution affect predictive performance for CRF models of stream fish assemblages. For example, comparing results from this analysis against efforts that include fine-scale, instream habitat data and/or relative abundance data would further improve our understanding of the ecological filters structuring stream fish assemblages. The predictive performance of the CRF, which maintains symmetric relationships between species, could also be compared against alternative multi-species modeling techniques that allow for asymmetric relationships between species.

Author statements

Any use of trade, firm, or product names is for descriptive purposes only and does not imply endorsement by the U.S. Government.