Introduction

The expansion of cities, both in their geographic scope as well as their use of resources, can result in ecological degradation (Elmqvist et al. 2016; Johnson and Munshi-South 2017). However, urban environments are not necessarily just degraded forms of pre-existing ecosystems but may be more accurately considered as a class of anthropogenic biome (Pincetl 2015; Fleming and Bateman 2018; Teixeira and Fernandes 2020). Urban environments are shaped by a combination of variables both natural, such as bioclimate and topography (Qian et al. 2020; Kendal et al. 2018;), and anthropogenic, such as land use and artificial illumination (Johnson et al. 2018; Pauwels et al. 2019). Through a variety of human activities, ranging from transport to habitat modification, urban environments contain a mix of native and non-native species (Helden et al. 2012; Gaertner et al. 2017; Godefroid and Ricotta 2018; Gruver and CaraDonna 2021). While heavily modified, these environments can contain diverse and functioning ecosystems (Baldock et al. 2019; Beller et al. 2019; Wenzel et al. 2020; Casanelles-Abella et al. 2021). This in turn has motivated interest in the study and management of urban ecosystems in their own right, and how they are shaped by environmental gradients (Grêt-Regamey et al. 2017; Shaffer 2018; Montero 2020; Uchida et al. 2021).

Research has been carried out in the potential of various measures of biodiversity to be used as environmental indicators in urban environments (Godefroid 2001; Llop et al. 2012; Guilland et al. 2018; Alquezar et al. 2020). Though the presence or abundance of various species have been used as environmental indicators, there is still the ill-defined problem of selecting such species for a given set of environmental assessment criteria (Siddig et al. 2016). Furthermore, while some urban ecosystems have been systematically assessed (Baldock et al. 2019; Planillo et al. 2021; Casanelles-Abella et al. 2021), there have often been significant limitations in obtaining a sufficient number of observations with which to build and test assessment models for most cities (Cappa et al., 2021).

To address these limitations we propose using machine learning, in combination with community science-based species observations, to select and evaluate both the accuracy and behavior of species for use as indicators of environmental quality. Community science, the collection or analysis of data by non-professional scientists, has shown promise in enabling the collection of sets of species observations on a far larger scale than from individual research projects (McCaffrey 2005; Silvertown 2009; Kobori et al. 2016; Ballard et al. 2017; Spear et al. 2017), especially on private lands which are typically undersampled in urban environments (Ballard et al. 2017). Machine learning, in particular species distribution models (SDMs), can then further extend the geographic extent of our understanding of species distributions by generating predictions on how the presence of species will vary in response to environmental conditions, even from a relatively small set of observations (Elith and Leathwick 2009). The accuracy of SDMs has been used to investigate the potential of various species, both native and non-native, to act as environmental indicators (Sergio and Newton 2003; Growns et al., 2013; Vallecillo et al. 2016). In urban environments, SDMs have been investigated as a means for assessing patterns of biodiversity at spatial resolutions not typically possible with point-based sampling (Milanovich et al. 2012; Stas et al. 2020; Wellmann et al. 2020; Casanelles-Abella et al. 2021; Planillo et al. 2021). Urban SDMs have also enabled comparisons of the effects of socio-ecological factors, driven by anthropogenic activity, versus natural variations in the environment (Rhodes et al. 2006; Le Louarn et al. 2018; Liu et al. 2019). Additionally, models predicting the richness of various groups of species, species richness models (SRMs), have also been studied in a similar fashion as SDMs for investigating the impacts of environmental conditions on urban biodiversity (Gavier-Pizarro et al. 2010; Perillo et al. 2017; Fröhlich and Ciach 2019).

In order to investigate the potential of combining the use of SDMs and SRMs with community science-based observations of species, we used the city of Los Angeles as our test case. We selected Los Angeles as there is current interest in assessing it ecologically (Jenerette et al. 2016; Spear et al. 2017; McGlynn et al. 2019; Avolio et al. 2020; Rauser 2021), covers significant variations in both elevation and microclimates (Tayyebi and Jenerette, 2016), lies within one of the 36 most biodiverse terrestrial ecosystems in the world (Myers et al. 2000), and contains a large number of observations from community scientists (Vendetti et al. 2018; Leong and Trautwein 2019; Callaghan et al., 2020). As Los Angeles is a heavily urbanized area within a designated biodiversity ‘hotspot’ (Gillespie et al. 2018), it can also serve as a model city for assessing the potential of various species to act as environmental indicators in an urban context.

Within Los Angeles we then propose to investigate the following in an urban environment:

  1. (1)

    Compare the accuracy of SDMs constructed from native and non-native species.

  2. (2)

    Assess the importance of anthropogenic and natural environmental variables in shaping urban biodiversity patterns as described by SDMs.

  3. (3)

    Identify individual species, given the accuracy of their SDMs, which may act as accurate environmental indicators.

  4. (4)

    Construct and assess the reliability of SRMs, assembled from species with the most accurate SDMs, to predict species richness across an urban landscape.

Methods

Software and workflow

Our analysis was conducted in R v4.1.2 using RStudio 2021.09.1 + 372 “Ghost Orchid” Release (RStudio Team, 2021). Processing of geospatial data involving the use of Geospatial Data Abstraction Library (GDAL) was done using version 3.4.1 of that software. A diagram of our analysis workflow is illustrated in Fig. 1.

Fig. 1
figure 1

Workflow diagram for generating and evaluating SDMs and SRMs

Study area

Our study area covered the city of Los Angeles, an urban environment centered at 34.05° N, 118.24° W with approximately 4 million inhabitants covering over 1200 km2 (Kawabata and Shen 2006). The city is part of the greater Los Angeles area, an urban agglomeration containing over 18 million people covering approximately 88,000 km2 of southern California biomes ranging from chaparral to coastal oak forests (Tayyebi and Jenerette, 2016). This region is within a Mediterranean climatic zone, receiving the bulk of its annual rainfall (40 cm) during the winter season (Hill et al. 2016). While the environment of this region is heavily modified through the expansion of impervious surfaces, irrigation, river channelization and transportation networks, it has been found to contain a diverse, albeit heavily altered, set of species (Pataki et al. 2013; Li et al. 2019; Adams et al. 2020; Rogers et al. 2020).

Occurrence data

We sought to obtain as diverse an array of species as possible from public databases in order to assess their behavior as potential environmental indicators. Our initial set of species observational data was obtained from the Global Biodiversity Information Facility (GBIF) (GBIF.org, 2021) with the following query: (1) observations made within the period 2010–2020, (2) within the spatial extent of Los Angeles County, and (3) with a spatial uncertainty of less than 30 m. We selected GBIF data within the decade 2010–2020 as it both provided for species observations covered by the range of time when our environmental data sets were collected (SI: File 4). This initial set of 120,713 observations was then filtered using the function st_intersection from the R package sf v1.0-5 (Pebesma 2018) to contain only observations within the political boundaries of the city of Los Angeles and with a spatial uncertainty of less than 10 m.

To reduce the effects of spatial clustering of our occurrence data we performed spatial thinning, using the function thin within the R package spThin v0.2.0 (Aiello-Lammens et al. 2015), with a minimum separation distance of 500 m. Species were retained if at least 25 presence points remained after spatial thinning, as this is a conservative minimum number for generating accurate SDMs using MaxEnt (van Proosdij et al. 2016), leaving 20,050 observations covering 122 species for analysis. These species were assigned a native or non-native status using a CalFlora list of plants and fungi native to Los Angeles County (SI: File 1), and a corresponding list of animal species curated by the Los Angeles Sanitation & Environment (LASAN, 2021) (SI: File 2), producing a split of 96 native and 26 non-native species across 10 classes (Table 1).

Table 1 Number of species, number of species with fairly performing SDMs (Mean TSS ≥ 0.3), their native / non-native status, organized by class

The number of remaining presences used to estimate sampling bias for each species ranged from 25 to 242. This method allows for the number of sampling of background points for each species to grow or shrink in proportion to its sampling effort. This method of sampling is often used to account for spatial bias in non-systematic sampling efforts (Phillips et al. 2009; Syfert et al. 2013; Molloy et al. 2017).

Environmental data

We obtained our initial set of 39 environmental layers, 25 natural and 14 anthropogenic, from a variety of sources (https://doi.org/10.5068/D1W988). We selected these initial layers (SI: File 4), as they cover both variations in bioclimate and topography as well as anthropogenic modifications to the environment such as housing density and exposure to light pollution, which have been found to influence the spatial distributions of a wide variety of species (Davies et al. 2008; Santorufo et al. 2012; Chong et al., 2014; Norton et al. 2016; Lin et al. 2021; Simons et al. 2021). We also selected a number of composite measures of anthropogenic disturbance, specifically habitat quality (Brown, 2019), the global human modification index (gHM) (Kennedy et al. 2019), the Calenviroscreen pollution exposure score (PollutionS) and its composite with a human population vulnerability index (CIscore) (Faust et al. 2017). These measures integrate data on land cover, habitat connectivity, and anthropogenic disturbance, impacts to human health, and interactions between human population characteristics and environmental contamination, all of which have been found to influence urban biodiversity (Table 2).

Table 2 Environmental variables used in this project, with attributions

This set of environmental layers was then clipped and aligned to the city boundaries of Los Angeles using GDAL (Warmerdam 2008) with the project coordinate reference system (EPSG:2229) and resolution (30 ft / 9.1 m). From this initial set of environmental layers we retained 21 after removing those with a Pearson’s correlation greater than 0.7 to other layers (Barber et al., 2021). This was done using the function removeCollinearity within the R package virtualspecies v1.5.1 (Leroy et al. 2016) with 100,000 randomly selected points. Of the layers we retained, 12 described anthropogenic variables and 9 natural environmental variables (Table 2). We calculated the variability in these remaining environmental layers by calculating their coefficients of variation using the function cellStats in raster (SI: Table 5).

Accounting for spatial bias in species observations

To visualize geographic biases in observational data we converted our spatially thinned occurrences into a heatmap using the function stat_density_2d in the R package ggplot2 v3.3.5 (Wickham et al. 2016). Prior to running species distribution models, spatial biases in species observations were accounted for by modifying the sampling of background points using a probability function corresponding to sampling effort for each species (Filazzola et al. 2018; von Takach et al. 2020). This probability density function was generated for each species using a two-dimensional kernel density estimate, with the function kde2d in the R package MASS v7.3-54 (Ripley et al. 2013), using the density of spatially thinned observations for each species. The spatial bias raster of each species was then generated, along with the spatially thinned data set as a whole (SI: Fig. 4), in the project coordinate reference system and resolution, from this kernel density estimate using the function raster in the R package raster v2.5-2 (Hijmans and van Etten., 2015). Each of these spatial bias rasters were then clipped and aligned to the boundaries of Los Angeles using the function gdalwarp within the R package gdalUtils v2.0.3.2 (Greenberg and Mattiuzzi 2015).

Building SDMs

In order to construct SDMs we used the machine learning technique of Maximum Entropy (MaxEnt). We selected MaxEnt for constructing our SDMs as it can work with presence-only data for species observations (Elith et al., 2011), which are commonly found in databases such as the GBIF (Edwards 2004). The accuracy of MaxEnt-based SDMs have also been used in assessing the potential of species to act as environmental indicators (Jose V, 2020), that is, species whose likelihood of presence can be accurately predicted by a set of environmental conditions.

For each species a set of 10 MaxEnt models were run using the function maxent, within the R package dismo v1.3-5 (Hijmans et al. 2017), with its default settings. Species input data consisted of both presence points, as well as 10,000 background points sampled using the function xyFromCell within the package raster and a sampling probability determined using a species specific spatial bias raster. To enable an assessment of accuracy, each model utilized a randomly sampled set of 80% of presence and background points for model training, while the remaining 20% were used for testing.

Evaluating SDMs

Comparing native to non-native SDM accuracy: To assess the accuracy of these models rates of true and false positives and negatives were found using the function evaluate within the dismo package. These rates were then used to calculate the True Skill Statistic (TSS) for each of the 10 models run per species. The TSS was used as it has been found to be a metric with little dependence on species prevalence (Allouche et al. 2006). The value of the TSS ranges from − 1 to 1, with a value of 1 corresponding to a perfect agreement between predicted and actual distributions, and negative values indicating a model’s predictions are no better than random (Allouche et al. 2006; Liu et al. 2009). For each species we calculated the mean and standard on the TSS scores for its maxent models (SI: File 3).

To visualize the distribution of mean TSS scores for each species, and how they differ between native and non-native species, the functions geom_violin and facet_grid were used from ggplot2. These functions were also used to visualize the distributions of the percent relative importances of both anthropogenic and natural environmental variables in our SDMs. To test if the distribution of TSS scores differed significantly between SDMs made for native and non-native species, we used a Kruskal-Wallis (K-W) test implemented with the function kruskal.wallis in the R package stats where α was set to 0.05.

Comparing the importance of natural and anthropogenic environmental variables: The percent relative importance of the 21 environmental variables used to generate each SDM were calculated, as a percent contribution to each model, using the function var.importance within the R package ENMeval v2.0.2 (Muscarella et al. 2014). The mean and standard deviation on these relative importance values were calculated for each species (SI: File 3).

To visualize the rank mean relative importance of environmental variables for each SDM the function geom_tile, within the R package ggplot2, was used to visualize a heatmap. To test if the distribution of the percent relative importance for natural and anthropogenic environmental variables differed significantly, for both native and non-native SDMs, we again used a K-W where α was set to 0.05.

Selecting environmental indicator species

Species were selected for constructing our SRMs following a cutoff on the mean TSS scores for their SDMs (≥ 0.3). While a TSS cutoff of 0.4 has been used to classify SDMs as accurate (Thuiller et al. 2019), only one of our species exceeded this threshold (Ardea herodias, mean TSS of 0.43), and so we used a more relaxed threshold of 0.3 to select SDMs with a fair level of accuracy (Landis and Koch 1977). This selection criterion produced a list of 9 native species in the class Aves, and 13 in the class Magnoliopsida (Table 1). For species in both lists, SDM maps were generated using the function predict within the R package raster. Each SDM map was generated using the maxent function with all available spatially thinned presence points, 10,000 background points, and a presence threshold set as the maximum sum of the specificity and sensitivity.

Constructing and evaluating SRMs

The SDMs from these two classes were then summed, using the calc function in raster, to construct a SRM for species in each class (Fig. 2). These two SRMs represent the richness of nine species within the class Aves and another representing the richness of thirteen species within the class Magnoliopsida (SI: Table 6). Visualization of these SRM maps, which illustrate the predicted species richness per 900 ft2 / 83.6 m2 map call, was done using the function leaflet in the R package leaflet v2.0.4.1 (Cheng et al. 2018).

Fig. 2
figure 2

SRMs covering the city boundaries of Los Angeles, for native species in the class Aves (a) and Magnoliopsida (b), constructed using SDMs with mean TSS scores exceeding 0.3. Color scales reflect the predicted number of species per map cell

To evaluate the accuracy of our SRMs to respond to our environmental variables (Table 2) we used a random forest model, which has been shown to be an accurate method for predicting the spatial distribution of other ecological data sets (Rodriguez-Galiano et al. 2012). For each SRM we constructed a set of 100 random forest models using training data sets and the function tuneRF, within the R package randomForest v4.6-14 (Liaw and Wiener 2002), with stepFactor set to a value of 1 and doBest set as ‘true’. Data for each random forest model of a SRM was extracted from 300 randomly selected locations within our study area using the function extract within raster. We chose 300 locations as this was found to be the minimum number needed to consistently model significant (Pearson correlation, p < 10− 4) predictions of both SRMs using random forest. A training set was generated by randomly sampling 80% of these extracted data using the function kfold within dismo. Predicted SRM values were calculated using the function predict within randomForest, which were then compared against their actual counterparts with a Pearson correlation coefficient. The root mean square error (RMSE) for each random forest model was calculated using the function rmse in the R package Metrics v0.1.4 (Hamner et al. 2018).

The mean and standard deviation on the 100 Pearson correlation coefficients, generated in evaluating each random forest model of a SRM, were then calculated for each class’ SRM. This was done by first performing a Fisher transformation, using the FisherZ function within the R package DescTools v0.99.44 (Signorelli, 2020), on the set of Pearson correlation coefficients. We used a Fisher transformation as it has been found to produce less biased summary statistics for a set of Pearson correlation coefficients (Corey et al., 1998). The average and standard deviation were then inverse Fisher transformed, using the function FisherZInv within the DescTools package, to produce a single statistic per SRM. For each SRM the median and variance on the RMSE values was also calculated. To account for the percent variance explained by our SRMs we used the following formula for each SRM:

$$\% Varianceexplained = \frac{{\sum {{(Richnes{s_{actual}} - Richnes{s_{predicted}})}^2}}}{{\sum {{(Richnes{s_{actual}} - mean(Richnes{s_{actual}}))}^2}}}$$
(1)

To compare the relative importance of environmental variables within our models we used the function importance within randomForest. The relative importance of our environmental variables was calculated as their mean decrease in node impurity as quantified by their Gini indices, with a larger value denoting greater relative importance. To generate individual partial dependence plots we used the function partialPlot within randomForest.

To visualize a heat map of the 100 iterations of the partial dependence plots for our environmental variables for each SRM, we then used the function geom_bin2d within ggplot2. For all our partial dependence heat maps, we divided our axes into 20 bins and generated a best-fit curve using the function stat_smooth within ggplot2.

Results

Evaluating SDMs

Comparing native to non-native SDM accuracy

We found native species had SDMs which tended to be more accurate, as assessed by their mean TSS scores (Fig. 3a) (K-W test, χ2 = 7.57, p < 10−2), than their non-native counterparts. For all of our SDMs, the relative importance of anthropogenic environmental variables tended to exceed their natural counterparts (K-W test, χ2 = 16.54, p < 10−4). Even within native and non-native species groups (Fig. 3B), anthropogenic environmental variables tended to have a greater relative importance for both native (K-W test, χ2 = 11.58, p < 10−3) and non-native species (K-W test, χ2 = 5.23, p < 0.05).

Fig. 3
figure 3

Violin plots of the mean TSS scores per SDM split by native / non-native status (a), and the mean percent relative importance of environmental variables for SDMs split both by native / non-native status and anthropogenic / natural category (b)

Comparing the importance of natural and anthropogenic environmental variables: For a majority of species we found their spatial distributions to be strongly influenced by the density of housing units and slope (SI: Fig. 5). Both habitat quality and proximity to streams were found to be of high relative importance in influencing the SDMs of some plants, birds, and insects (SI: Fig. 5a-c). For members of Magnoliopsida and Liliopsida the aspect of the local terrain also tended to be influential on the spatial distribution of most species (SI: Fig. 5a and h).

Selecting environmental indicator species

Of the 122 species in this study, 28 were found to have SDMs considered accurate (Table 1), that is those with a TSS score of at least 0.3 (Landis and Koch 1977). Of these species 26 were native to Los Angeles, and were predominantly in the class Aves (9 species) or Magnoliopsida (13 species) (Table 1 and SI: Table 6).

Constructing and evaluating SRMs

SRMs were assembled by summing SDMs for Los Angeles native species in the class Aves and Magnoliopsida (Fig. 2), which were the only two sets of species in this study to provide for much variation in richness (Table 1 and SI: Table 6). The richness of species in both classes could be reliably predicted across Los Angeles, from the environmental variables used in this study, albeit with a moderate level of uncertainty as described by the RMSEs of SRMs from each class (Table 3).

Table 3 Summary statistics on the accuracy of 100 SRMs, for species in class Aves and Magnoliopsida, generated via random forest. Pearson correlation coefficients calculated between training and testing values of species richness. Percent variance explained is calculated as how well predicted richness explains the target variance of the training richness. RMSE is calculated as the deviance of actual to predicted species richness

For both SRMs the density of housing units was the most influential variable considered (Table 4), which was found to generally have a negative relationship with species richness in both classes (SI: Figures 6a and 7a). Both slope and the CIscore, which is a composite of various measures of environmental contamination along with both human health and socioeconomic factors, tended to be highly important in predicting the local richness of species in the class Aves or Magnoliopsida (Table 4). For both SRMs a decline in species richness tended to be associated with an increase in slope or CIscore (SI: Figures 6d-e and 7c and e). The richness of bird species, more than species in the class Magnoliopsida, tended to be more strongly influenced by proximity to either streams or lakes (Table 4), with species richness tending to decline with distance to either type of freshwater (SI: Fig. 6b-c). The richness of species in Magnoliopsida were found to be strongly influenced by both habitat quality and the gHM (Table 4 ), a measure of anthropogenic modification to the landscape, with a positive trend associated with habitat quality and a negative one with the gHM (SI: Fig. 7b and d).

Table 4 Relative rank importance of environmental variables to our SRMs with variable importance listed in descending order, along with the mean and standard deviation of their relative importance, as measured by the mean decrease in their Gini indices. Values recorded as the mean value (standard deviation on the mean value), with higher values indicating greater importance of the variable to the model

While the relative importance of environmental variables declines quickly with rank order for both SRMs (Table 4 ), we do note the predicted behavior of both models to be somewhat counterintuitive with a variety of the environmental variables considered. For our avian SRM we predict a general increase in species richness with various measures of anthropogenic disturbance ranging from nearby vehicular traffic density, to soil and water contamination, pesticide use, and light pollution (SI: Fig. 6f-i, o-p, s-t). For our Magnoliopsida SRM, species richness was also predicted to increase with various measures of soil and water contamination and pesticide use (SI: Fig. 7n-o, q-s). However, both SRMs tend to increase with variables which are composite metrics of anthropogenic disturbance. In particular, we modeled species richness declining with the gHM and CIScore and increasing with habitat quality for both SRMs (SI: Figures 6e, j and m and 7 b, d-e).

Discussion

Evaluating SDMs

Comparing native to non-native SDM accuracy

For the species we studied, SDMs derived from native species tended to be significantly more accurate than SDMs derived from non-native species (Fig. 3a). This may reflect the tendency of non-native species, especially in urban environments, to be able to occupy a broader set of environmental conditions than native species (Le Viol et al. 2012; Cervelli et al., 2013; Concepción et al., 2016; Callaghan et al., 2019; Colléony and Shwartz, 2020). The result of this ecological tendency being less expected variation in the likelihood of a species being present in response to variations in the environment.

We found that membership in our list of potential environmental indicator species (Table 1), particularly those native to Los Angeles, were biased towards members of the classes Aves and Magnoliopsida. This reflects a bias in the community science data sets found in GBIF towards observations of both classes (Troudet et al. 2017; Petersen et al. 2021). Despite this bias, species from both classes have been used as environmental indicators in a variety of studies of urban areas. For example, avian diversity has been found to respond to human modification of landscapes (Callaghan et al., 2021), the extent and density of the built environment (Pinho et al. 2016), as well as habitat quality and groundwater contamination (Mekonen 2017), while the diversity of flowering plants has been found to respond to the impact of road traffic on air and soil quality (Philips et al., 2021) as well as landscape modifications (Ioja and Breuste 2020).

Comparing the importance of natural and anthropogenic environmental variables: For both native and non-native species, we tend to find the relative importance of anthropogenic environmental variables to be greater in our SDMs than natural variables (Fig. 3b and SI: Fig. 5). This pattern has been observed in other cities (Aronson et al. 2014; Liu et al. 2017), and the greater importance of anthropogenic factors in shaping urban biodiversity (Faeth et al. 2011; Li et al. 2019) may simply reflect the fact of cities being some of the most anthropogenically modified habitats on Earth (Chase and Chase, 2016).

Many of the species we selected as potential environmental indicators also have large geographic ranges, often extending well beyond California (Table 1). A large geographic extent for a species can be associated with a wide climatic tolerance (Slatyer et al. 2013), although this may only weakly hold for various plant species (Bocsi et al. 2016). With many of our potential environmental indicator species having ranges which cover multiple climatic zones, this may partly explain the low relative importance of bioclimatic variables in their SDMs (SI: Fig. 5), which would reduce the overall relative importance of natural environmental variables in our overall set of SDMs. The lower relative importance for natural environmental variables, may in part also result from their lower levels of variation across the landscape of Los Angeles (SI: Table 5).

Selecting environmental indicator species

We find our species with the most environmentally responsive SDMs to be strongly biased toward native species in the class Aves or Magnoliopsida (Table 1). While this in part reflects taxonomic biases in our source data, the presence of species responsive enough to environmental variations to be reliable as indicators in an urban environment may stem from traits conducive to tolerance of anthropogenic environments such as small size and high dispersal (McKinney and Lockwood 1999; Lizée et al. 2011).

Constructing and evaluating SRMs

Our SRMs were able to capture a number of ecological relationships involving simple environmental variables, such as the distance to freshwater or housing density, indicating their potential utility in assessing environmental conditions in an urban environment such as Los Angeles. For example, housing density has been associated with a decline in avian species richness (Gagné and Fahrig 2011), which has also been more broadly observed in relation to the overall fraction of an urban area dedicated to buildings (Aronson et al. 2014). Both this decline in urban species richness, as well as the importance of the role of building density, has also been observed in a variety of plant species (Godefroid and Koedam 2007). Our Avian SRM also appears to capture, at least on the scale of approximately 10 km (SI: Fig. 6b-c), the expected decline in species richness with distance to freshwater (Chin and Kupfer 2020; de Camargo Barbosa et al. 2020). At a similar geographic scale our Magnoliopsida SRM predicts a decline with the distance to freshwater (SI: Fig. 7f and j), in line with the observed decline of plant species richness with distance to freshwater observed in Mediterranean climates such as Los Angeles (Hawkins et al. 2003).

Less intuitive relationships between species richness and individual anthropogenic variables we also observed. For example, variables describing the number of unique pollutants within nearby impaired water bodies (iwb) and the areal density of pesticide use (pesticides), were both positively associated with a rise in Avian and Magnoliopsidan species richness (SI: Figures 6f and i and 7 o and s). While pesticide use or impaired water bodies would be expected to reduce biodiversity, in urban environments they are often associated with irrigated and fertilized landscapes, which are often more species rich as a result (Clarke et al., 2013; Avolio et al. 2020). Another measure of anthropogenic disturbance, the density of road traffic (traffic), was also associated with a positive response in species richness with our Aves SRM (SI: Fig. 6 h). Nonetheless, this response may be a result of bias in community scientists tending to take bird observations in proximity to roads (Keller and Scallan 1999; Mair and Reute, 2016).

We find additional potential in our method for constructing SRMs for urban environmental assessments given the responses of our two SRMs to more integrated measures of anthropogenic disturbance: an index of the quality of habitat for species native to Los Angeles (HabitatQuality), a measure of anthropogenic modification (gHM), an integrated measure of exposure to pollution (PollutionS), and an integrated measure of exposure to pollution weighted by local measures of socioeconomic vulnerability and health outcomes (CIscore). The value of HabitatQuality integrates data on land cover and vegetation found to support native biodiversity in Los Angeles, along with a measure of geographic connectivity between habitats supportive of native biodiversity (Brown, 2019). That our two SRMs both predict a positive relationship between HabitatQuality and species richness (SI: Figures 6j and 7b) is not surprising, with urban species richness generally found to increase with related metrics such as the area or connectivity of vegetated habitats (Aronson et al. 2014; Beninde et al. 2015; Callaghan et al. 2018). In a similar fashion the value of the gHM, in integrating the areal fraction dedicated to human activities such as impervious surfaces and electricity transmission (Kennedy et al. 2019), would also correspond to habitat fragmentation and decline in species richness (SI: Fig. 6 m and 7d).

Another unexpected result were discrepancies between the predicted species richness of our SRMs and the values of PollutionS and the CIscore. The value of PollutionS incorporates exposure to measures of air pollution, such as ground-level ozone concentration, human health effects, such as asthma hospitalization rates, and additional measures of environmental degradation, such as the density of hazardous waste sites (Meehan August et al. 2012). However, the value of PollutionS is not associated with a predicted decline in species richness in either of our SRMs (SI: Fig. 6 g and 7n). When the pollution metrics summarized in PollutionS are multiplied against demographic and socioeconomic characteristics, such as the fraction of the population under 5 or over 65 or people over 25 without a secondary education, to produce the CIscore it is associated with a predicted decline in species richness (SI: Figures 6e and 7e). This difference in predicted responses of species richness between exposure to pollution and a composite measure of it with characteristics of the human population may reflect a relationship between the interactions of the physical environment, socioeconomic factors, and biodiversity in an urban context (Schell et al. 2020).

Limitations

Beyond the overrepresentation of species from a few classes in GBIF (Troudet et al. 2017; Petersen et al. 2021), there remain issues with geographic biases in the observational data used to construct SDMs. We found geographic biases in the locations of all the species used in this study (SI: Fig. 4a), as well those used to construct our SRMs (SI: Fig. 4b-c), and this type of observational bias has been observed in prior studies using community science data (Mair and Ruete 2016; Petersen et al. 2021). While we were able to account for these spatial biases in generating our SDMs, we found our data to be clustered in large urban parks (SI: Fig. 4). This geographic bias toward large urban parks may in part be a result of their accessibility, as well as the expectation of community scientists on finding more diversity to capture in park spaces versus more developed land (Bonney et al. 2009; Callaghan et al., 2020). In urban environments biodiversity tends to be relatively high in large park spaces (Matthies et al. 2013), with the richness of species in parks following a species-area relationship (Nielsen et al. 2014). Additionally, cities are often built in biodiverse areas and many of these park spaces may simply reflect this relic biodiversity (Kühn et al. 2004; Luck 2007).

Geographic biases in species observations may also stem from the skewed demographics of people collecting community science data, which is biased towards those who are under 65 years of age and have a post-secondary degree (Ganzevoort et al. 2017; Lopez et al. 2020). Such a bias may affect the number of species observations in areas with a higher proportion of the population below 5 or over 65 years of age, which would bias predicted species richness downwards. This may in part explain our negative relationship between pollution exposure, which factors in exposure of a variety of environmental pollutants to more vulnerable young and elderly populations, and species richness as predicted by our SRMs. Such biases in observer demographics may then tend to skew observed biodiversity downwards, enhancing any modeled negative response between species richness and various socio-ecological variables.

Beyond the potential confounding of observer biases with socioeconomic and anthropogenic environmental variables, is the issue of temporal gaps between the collection of species occurrences and environmental variables used to generate SDMs. First, our species observations are recent and only cover a decade, mirroring the rapid but recent growth in community science-based platforms for recording species observations (Di Cecco et al. 2021). We aggregated our observations across this window of time to provide for sufficient data density to generate SDMs and SRMs, although this coarse temporal resolution may obscure a number of ecological patterns. A similar issue exists with many of our environmental layers, such as those derived from the Calenviroscreen or bioclimate, which are updated only once every few years and may therefore reduce the predictive power of any SDM or SRM which incorporates them.

We also note that the spatial resolutions of a number of the remotely sensed environmental layers used in this study (SI: File 4) may not be fine enough to sufficiently capture their influence in a highly heterogeneous urban environment. For example, while urban areas tend to produce highly localized microclimates which influence local patterns of biodiversity (Fournier et al. 2020; Casanelles-Abella et al. 2021), the resolution the bioclimatic variables we used are near a kilometer in scale and may effectively obscure the signal any potential relationship. Exposure to artificial light at night is also mapped at the same spatial resolution (Falchi et al. 2016), although its illuminance has been found to vary by orders of magnitude at the scale of only hundreds of meters (Simons et al. 2020). Our species observation data then, while large in scale thanks to the efforts of numerous community scientists, may be mapped against a level of environmental variation artificially lowered by spatially coarse remote sensing data.

Future prospects

The volume of data captured through community science has grown rapidly in recent years. Although there are biases in the type of species observed, and where they’re observed, these may be compensated for through the use of environmental DNA (eDNA). Using eDNA from soil, sediment, water, or air sample inventories, we can complement existing species monitoring efforts through the identification of thousands of species at once, including plants, animals, and microbes (Stat et al. 2019; Lin et al. 2021; Nørgaard et al. 2021). Use of eDNA can greatly complement traditional species monitoring by enabling greater taxonomic resolution (Deiner et al. 2017; Ruppert et al. 2019), the detection of species which tend to avoid the presence of humans (Yonezawa et al. 2020; Mas-Carrió et al. 2021), or organisms such as bacteria and fungi which can be difficult to monitor using traditional observations (Frøslev et al. 2019; Liddicoat et al. 2022). Comparisons of eDNA with observational methods have also indicated their potential to help capture additional elements of ecologically relevant information, such as the functional diversity of various groups of species (Aglieri et al., 2021; Donald et al. 2021; Sigsgaard et al. 2021), particularly with regards to identifying ecological indicators (Yan et al. 2018; Blattner et al. 2021; Seymour et al. 2021).

With the recent developments in routine low-cost hyperspectral imaging there is the potential to overcome a number of these limitations in being able to acquire frequent, and high resolution, environmental data to improve models of urban biodiversity (Mozgeris et al. 2018; Zhang et al. 2020; Hartling et al. 2021). Such remotely sensed environmental data may be captured using cubesats (Kimm et al. 2020; Grøtte et al. 2021), small and low-cost satellites, as well as airborne drones (Räsänen et al. 2020; Dierssen et al. 2021). Of particular use to monitoring highly dynamic and heterogeneous urban environments, such data can be collected at resolutions under 3 m in scale (Salgado-Hernanz et al. 2021) and daily in time (Rhodes et al. 2022).

Underlying many SDMs there is the assumption that the likely geographic distribution of species is purely a function of the environmental gradients present. While environmental variations may be an important driver in the geographic distribution of species, a variety of ecological factors ranging from interspecies competition to dispersal will also have some degree of influence (Soberón 2007). One potential method for inferring the components of variation in SDMs which may be attributed to interspecies interactions, or unknown environmental factors, is the use of joint-SDMs (jSDMs) (Pollock et al. 2014). Although, as with our study, there is still a large degree of uncertainty in disentangling the contributions to variations in SDMs with a large number of potentially interacting species (Zurell et al. 2018).

Conclusions

As a predominantly urban species there is a need for humans to better understand the ecology, and condition, of their most common habitat. This presents particular difficulties in developing ecological assessments for urban areas as they are both highly heterogeneous, and in a global environment rapidly responding to anthropogenic activity, ever more dynamic. Despite potential biases, both in the spatial distribution of sampling efforts and towards particular taxonomic groups, we demonstrate the potential use of species distribution modeling and community science-based observations to both identify potential environmental indicators and assess the response of biodiversity to environmental conditions in an urban environment. We found evidence that native SDMs tend to be more accurate than their non-native counterparts, and that biodiversity patterns in urban environments are driven more by anthropogenic activities than variations in the natural environment. In constructing SRMs from the most accurate SDMs, we were able to detect a number of plausible responses of urban biodiversity to environmental conditions. Of particular interest is the potential for our SRMs to detect declines in biodiversity associated with measures which integrate both exposure to various pollutants, as well socioeconomic characteristics, on a local basis. However, because of various biases associated with the backgrounds of community scientists, some of the environmental responses of our SRMs may be confounded with socioeconomic variables. We therefore recommend future development of this methodology to incorporate a broader initial set of environmental and socioeconomic variables in order to better correct for potential observer biases, and subsequently improve the accuracy of result SDMs and SRMs to capture meaningful environmental responses in urban environments.