1 Introduction

Vector-borne pathogens are becoming more prevalent and colonizing new spatial regions as a result of climate change and other human impacts, posing serious threats to human and animal populations. This trend is true for mosquito-borne pathogens because the mosquito life cycle depends on local weather and land use conditions (Bartlow et al. 2019). For instance, Gorris et al. (2021) inferred northward range expansions for several Culex mosquito species in updating the original maps of Darsie and Ward (1981). They found that abiotic factors like warm temperatures, water availability, terrain, and land cover characterized ecological niches for these mosquitoes. Relating environmental conditions to vector disease ecology under a statistical modeling framework can reveal key relationships between vectors and their environments and facilitate projections of disease spread.

Mosquitoes carry the potential to transmit pathogens maintained in reservoir hosts to humans and other animals. Culex mosquitoes, in particular, have been documented as potent disease vectors for West Nile virus (WNV), Eastern equine encephalitis virus (EEV), Zika virus, and other pathogens (Gorris et al. 2021). Turell et al. (2005) confirmed through laboratory experiments that Culex species could spread WNV efficiently as enzootic or bridge vectors. Since its arrival in North America in the late 1990s, WNV has spread across the United States and Canada and become endemic (Hadfield et al. 2019). It is a major public health concern, costing the American and Canadian economies hundreds of millions of dollars (Giordano et al. 2018).

Various public health agencies across the US and Canada maintain records on mosquito abundance and viral testing. These programs involve classifying and counting mosquito species found at trap sites across a geographic region. In many instances, pools of mosquitoes were tested for WNV and/or other pathogens. Although the dataset we are using contains direct information about vector-borne pathogens, most studies have only explored the patterns of vector abundance. There are established methods to model abundance in ecology (Royle and Nichols 2003; Royle and Dorazio 2006), whereas analyzing viral data requires another approach. Considering that WNV is a serious public health concern, we developed a new statistical method to make use of presence/absence indicators from viral tests to create an occupancy model specifically for viral presence in mosquitoes.

One of the challenges in developing an accurate viral occupancy model is that mosquito populations, viral transmission, and human sampling effort vary across time and space. Our method generalizes the occupancy model to allow occupancy to change during a sampling season (MacKenzie et al. 2002, 2003, 2017). Occupancy modeling posits a data-generating process by which a presence arises when a species is both occupying a sampling site and detected during a sampling period. The forces affecting occupancy and detection differ; occupancy may depend on local environmental conditions whereas detection may depend on sampling procedures and effort. Historically, these models have assumed that fauna either occupied or did not occupy a site throughout a sampling season (MacKenzie et al. 2002, 2003; Johnson et al. 2013; Hooten and Hobbs 2015). We relaxed this assumption to adjust for seasonality in vector-borne pathogens. Occupancy and detection probabilities can be construed as the outputs of generalized linear models (GLMs) with predictors belonging to three different dimensions of heterogeneity: site, season, and period. In our example, site is the areal region, season is the yearly pattern, and period is the epidemiological week. Hence, time-varying occupancy models can be implemented by introducing occupancy covariate effects that vary over periods.

The rest of this paper is presented as follows. Section 2 motivates the method development by introducing empirical data on West Nile virus in Ontario, Canada. Section 3 formalizes the time-varying occupancy model and places it alongside its historical counterparts. Monte Carlo Markov Chain (MCMC) methods for Bayesian binary regression are laid out in Section 3.3. Models with spatial random effects are discussed in Section 3.4. In Section 4, we present a model to study occupancy and detection patterns for WNV in Ontario. Besides serving as an example for time-varying occupancy, this case study uncovered strong statistical signals that were in line with contemporary scientific knowledge. We conclude with commentary on the time-varying occupancy model and its future use in vector disease ecology.

2 West Nile virus mosquito data

The province of Ontario is 1.076 million \(\hbox {km}^2\) and holds about 20% of Canada’s population. It is apportioned into 34 public health units (PHUs), ranging from rural, large, and sparsely populated PHUs in the northwest to urban, small, and densely populated PHUs in the southeast. Between 2002 and 2017, these PHUs trapped mosquitoes and tested them for WNV. Officials baited miniature light traps and returned 24 hours later to collect mosquitoes and diagnose WNV status (Giordano et al. 2018). Surveillance was conducted at hundreds of trap sites each week from May to October. Most PHUs surveyed traps at least once a week, so weekly aggregation resulted in fewer missing observations. Using the MMWRweek R package (Niemi 2020), we defined epidemiological weeks (epiweeks) according to the Morbidity and Mortality Weekly Report standard of the Centers for Disease Control and Prevention (CDC) and other public health agencies throughout the world. We chose this definition because human cases for infectious diseases like WNV are commonly reported this way.

2.1 Mosquito traps

Trap data included species classifications, abundance counts, and test results for WNV from mosquito seasons in 2002-2017. Pools of mosquitoes were blended together and then evaluated in aggregate for WNV, obscuring true counts for how many mosquitoes harbored the virus (Kesavaraju et al. 2012). Agency protocols and local mosquito abundance and diversity also impacted this detection process. Some PHUs collected and assessed more specimens than others. On the other hand, baiting for potential WNV vectors could have affected the viral testing. For example, the Culex genus is especially relevant to WNV transmission (Turell et al. 2005) and widespread throughout Ontario (Gorris et al. 2021). Zero inflation was a concern as well at finer spatial resolutions. These nuances and limitations in the viral testing informed our decision to model the binary response positive test versus negative test(s). We converted counts of WNV tests to presence-absence observations; namely, 1 corresponded to a positive test and 0 corresponded to no positive tests.

After data aggregation, there were 7396 trap observations, 1054 of which had a positive WNV test. Exploratory data analysis highlighted some interesting trends: (1) the majority of positive cases belonged to the Culex pipiens morphological group, (2) PHUs in the Greater Toronto Area (GTA) tested the most, (3) testing was generally consistent between years, and (4) most positive tests occurred between mid to late summer. Presences were mainly in southern Ontario and unequally distributed about the metro areas of Toronto, Ottawa, and Detroit. 2002, 2012, and 2017 had more cases than the average season whereas 2009 had fewer cases (Figure 1).

Fig. 1
figure 1

Total number of positive WNV cases by sampling season

2.2 Environmental factors

Prior studies of this dataset have reported temperature, water resources, and land cover types that covary strongly with mosquito abundance (DeMets et al. 2020a, b). Using Poisson GLMs for the GTA data, Yoo (2014) and Yoo et al. (2016) discovered positive associations between temperature, precipitation, and population density and Culex pipiens-restuans abundance. Wang et al. (2011) found that temperature provided a stronger leading indication than precipitation in a Gamma GLM for Culex pipiens-restuans abundance in the Peel region (PEE). In general, mosquito species important to WNV transmission thrive in human-occupied land with standing water available and warm weather conditions. Birds also play a part in the enzootic cycle of WNV. Their competence as reservoir hosts impacts transmission in dynamic ways (Allan et al. 2009; Ciota and Kramer 2013).

Based on this literature review, we gathered environmental predictors to consider in our models. For climate and land type, we collected 19 bioclimatic variables (Vega et al. 2017) and 12 land classification proportions (Tuanmu and Jetz 2014) that had been inferred from satellite imagery. For weather trends and water availability, we assembled temperature, precipitation, and water level statistics from the daily reports of weather (Dunnington 2017) and hydrological stations (Albers 2017). Lastly, we computed Shannon and Gini-Simpson diversity indices (Simpson 1949; Willis and Martin 2020) based on observational data from the citizen-science project eBird (Sullivan et al. 2009). We chose these indices because they were less affected by the sampling effort biases of the eBird user base.

We aggregated these covariates to align with the PHUs and weekly reporting. In some time periods, some PHUs lacked weather, hydrological, or bird diversity measurements. We imputed values for this small fraction of the covariate data. Missing temperature, precipitation, and water level values were replaced with distance-weighted averages, whereas missing bird diversity indices were filled in with medians.

These environmental covariates varied over space and time. Regions at lower latitudes experienced warmer temperatures, and epiweeks 30 to 40 aligned with the summer heat. Bird diversity was typically strong throughout our study period, but dips did occur and may be predictive. We found water level to be relatively stable for each PHU, which may indicate that its data source consists of managed water resources. Plotting precipitation over time did not uncover any noteworthy trends. Agricultural land was prevalent in southern Ontario outside of downtown Toronto. While land cover and climate characterized sites only, temperature, precipitation, hydrological resources, and bird diversity changed over time and may inform time-varying occupancy.

3 Bayesian time-varying occupancy model

Study designs for occupancy modeling encompass heterogeneity along three different dimensions: site, season, and period. Responses and covariates comprise a fourth dimension. Let i, j, and k index sites, seasons, and periods, respectively. Throughout this paper, we keep track of these indices in variable subscripts. For our motivating WNV data analysis, sites were PHUs, seasons ranged from 2002 to 2017, and periods concerned the epiweeks 18 to 44.

3.1 Standard occupancy models

Mackenzie et al. developed likelihood-based approaches in the early 2000s to accommodate imperfect detection when surveying animals (MacKenzie et al. 2002). We refer to this model and its many extensions (MacKenzie et al. 2017) as occupancy models. These models separate observation probabilities into products of occupancy and detection probabilities. Before generalizing to multiple seasons, we introduce the single-season occupancy model.

For an observation to be made, a species must have been both present in an area and detected during sampling. Let y denote a binary observation and z be a latent occupancy status. For our case study, y is 1 if there was a positive WNV test and 0 otherwise, and z is 1 if WNV was circulating in a mosquito population and 0 otherwise. There are three cases to consider: (a) observed \(((y,z)=(1,1))\), (b) not observed but occupied \(((y,z)=(0,1))\), and (c) not occupied \(((y,z)=(0,0))\). (Detection can only occur if there is occupancy, hence \(((y,z)=(1,0))\) is impossible.) If the species was observed at least once in the season, the third case did not apply. If the species was never observed, this was either because it did not occupy the site or because of repeated failures to detect it. The data-generating process is formally stated as:

$$\begin{aligned} Y_{ik}&\sim \text {Bernoulli}(Z_i \cdot p_{ik}) \\ Z_i&\sim \text {Bernoulli}(\psi _i), \end{aligned}$$

where vectors \(\varvec{\psi }\) and \({\mathbf {p}}\) are occupancy and detection probabilities. Detection probabilities can be different over time and among sites, whereas occupancy probabilities may only change by site. If a site did not have a survey for a given period, no data contributes to the model likelihood. Given presence-absence data \({\mathbf {Y}}\), the likelihood is as follows:

$$\begin{aligned} {\mathcal {L}} (\varvec{\psi }, {\mathbf {p}} \vert {\mathbf {y}} )&= \bigg (\prod _i \bigg (\psi _i \prod _{k} p_{ik}^{y_{ik}} (1 - p_{ik})^{1-y_{ik}} \bigg ) \cdot \nonumber \\&\quad \underbrace{I\bigg (\sum _k y_{ik} \ge 1\bigg )}_{\text {Observed \, at \, least \, once}}\bigg ) \nonumber \\&\times \bigg (\prod _i \bigg (\psi _i \prod _k (1-p_{ik}) + (1-\psi _i)\bigg ) \cdot \nonumber \\&\quad \underbrace{I\bigg (\sum _{k}y_{ik} = 0\bigg )}_{\text {Never \, observed}}\bigg ) \end{aligned}$$
(1)

A colonization-extinction framework applies to multiple sampling seasons in which site occupancy probabilities may have changed. This model involves additional parameters \(\varvec{\gamma }\) and \(\varvec{\varepsilon }\) for colonization and extincition probabilities (MacKenzie et al. 2003). Occupancy probabilities are calculated recursively:

$$\begin{aligned} \psi _{ij} = \psi _{i(j-1)}(1 - \varepsilon _{i(j-1)}) + (1 - \psi _{i(j-1)})\gamma _{i(j-1)} \end{aligned}$$

With too many parameters, maximum likelihood methods are unlikely to converge, so the standard practice is to model these component probabilities with probit or logistic regression (MacKenzie et al. 2002). An appealing aspect of the colonization-extinction model is that these components may depend on different covariates than the occupancy and detection components.

3.2 Time-varying occupancy

Single-season occupancy and colonization-extinction models are restrictive in their assumption that site occupancy was constant throughout a sampling season. We weaken this assumption, letting site occupancy vary by period. Thus, the data-generating process and likelihood are:

$$\begin{gathered} Y_{{ijk}} \sim {\text{Bernouilli}}(Z_{{ijk}} \cdot p_{{ijk}} ) \hfill \\ Z_{{ijk}} \sim {\text{Bernouilli}}(\psi _{{ijk}} ) \hfill \\ {\mathcal{L}}(\user2{\psi },{\mathbf{p}}|{\mathbf{y}}) = \prod\limits_{{ijk}} {(\underbrace {{\psi _{{ijk}} p_{{ijk}} }}_{{{\text{(a)}}}})^{{y_{{ijk}} }} } (\underbrace {{\psi _{{ijk}} (1 - p_{{ijk}} )}}_{{{\text{(b)}}}} + \underbrace {{(1 - \psi _{{ijk}} )}}_{{{\text{(c)}}}})^{{1 - y_{{ijk}} }} \hfill \\ \end{gathered}$$
(2)

In our methods, we achieve time-varying occupancy \(\psi _{ijk}\) by including covariates \({\mathbf {x}}_{ijk}\) that varied between sampling periods, since \(\psi _{ijk}\) is modeled with sigmoid regression. Excluding period-dependent occupancy covariates means \(\psi _{ijk}\) is the same for all periods k. However, when a species was never observed at a site, the likelihoods (1) and (2) differ slightly. Namely,

$$\begin{aligned} \prod _k (\psi _{ij}(1-p_{ijk}) + (1- \psi _{ij})) \ne (1 - \psi _{ij}) + \psi _{ij} \prod _k (1-p_{ijk}) \end{aligned}$$

Our likelihood includes non-occupancy in the product. While this distinction may be important for likelihood optimization methods, we did not explore it further as it did not affect our MCMC sampling routines. Another difference is that seasonal effects are handled jointly with other occupancy effects. We interpret seasonal effects to have impacted occupancy directly, rather than impacting occupancy indirectly through colonization and extinction. For instance, we might conclude that harsh winters decreased occupancy whereas MacKenzie et al. (2003) would say that harsh winters increased extinction and decreased colonization. In this respect, the time-varying occupancy model appears suitable for a species that is widespread and mobile, responding uniformly to macro environmental changes. This scenario is the case for WNV, as it is maintained in migratory avian hosts and transmitted by flying insect vectors.

3.3 Gibbs samplers for occupancy models

Classical Bayesian methods for modeling binary data take the perspective that the outcomes depend on a latent regression structure; namely,

$$\begin{aligned} u_l&= {\left\{ \begin{array}{ll} 1, &{} v_l \ge 0 \\ 0, &{} v_l < 0 \end{array}\right. } \\ {\tilde{u}}_l&= \eta _l + \epsilon _l \end{aligned}$$

where \(\eta _l\) is a linear predictor and \(\epsilon _l\) is an error distribution. The errors follow normal distributions for probit regression and logistic distributions for logistic regression. Albert and Chib (1993) demonstrated for probit regression that the latent variables could be sampled from truncated-normal distributions. This strategy of sampling latent variables in a hierarchical model is referred to as data-augmentation. Conditional on the augmented variables, regression effects can be shown to be a posteriori normally-distributed. Polson et al. (2013) showed that such a scheme is possible for Bayesian logistic regression as well. Their method is nearly the same except their latent variables are sampled from Pólya-Gamma (PG) distributions, where PG random variables can be represented as infinite sums of scaled Gamma random variables. Since the regression effects remain a posteriori multivariate normal (MVN) conditional on the latent PG variables, we can sample them in two steps in a Gibbs way.

Combining Bayesian sigmoid regressions for occupancy and detection together results in a blocked Gibbs sampler. Dorazio and Rodriguez (2012) and Clark and Altwegg (2019) have shown as much for probit and logistic regression, respectively. For time-varying occupancy models, we have to draw \(z_{ijk}\) at each period. We implemented blocked Gibbs samplers for time-varying occupancy with probit or logit link functions. We added further hierarchy to our samplers with inverse-Wishart (IW) conjugacy for the covariances of the MVN effects. There are many alternative ways to sample these posteriors distributions from exact (Metropolis et al. 1953; Hoffman et al. 2014) to approximate (Rue et al. 2009; Blei et al. 2017) MCMC methods. Polson et al. (2013) argue in simulated and real data studies that the PG method does not depend on careful hyperparameter tuning for proposal densities and is more efficient than Metropolis-Hastings methods, especially for complex model frameworks like ours. This conclusion has been verified in the case of large spatial occupancy models (Clark and Altwegg 2019, Table 3).

Before presenting our samplers, we lay out some notation. Let \({\mathbf {X}}\) and \({\mathbf {W}}\) be four-dimensional occupancy and detection covariate arrays with column vectors \({\mathbf {x}}_{ijk}\) and \({\mathbf {w}}_{ijk}\). \(\varvec{\beta }\) and \(\varvec{\alpha }\) denote occupancy and detection effects, and multivariate normal means and covariances are \(\varvec{\mu }\) and \(\varvec{\Sigma }\). Using this notation and link function f, the data-generating process is as follows:

$$\begin{aligned} Y_{ijk}&\sim \text {Bernoulli}(Z_{ijk} \cdot p_{ijk})\\ Z_{ijk}&\sim \text {Bernoulli}(\psi _{ijk}) \\ f (\psi _{ijk})&= {\mathbf {x}}^\prime _{ijk} \varvec{\beta } \\ f (p_{ijk})&= {\mathbf {w}}^\prime _{ijk} \varvec{\alpha } \\ \varvec{\beta }&\sim \text {Normal}(\varvec{\mu }_{\varvec{\beta }}, \varvec{\Sigma }_{\varvec{\beta }}) \\ \varvec{\alpha }&\sim \text {Normal}(\varvec{\mu }_{\varvec{\alpha }}, \varvec{\Sigma }_{\varvec{\alpha }}) \\ \varvec{\Sigma }_{\varvec{\beta }}&\sim \text {Inverse-Wishart}(\nu _{\varvec{\beta }}, \varvec{\Lambda }_{\varvec{\beta }}) \\ \varvec{\Sigma }_{\varvec{\alpha }}&\sim \text {Inverse-Wishart}(\nu _{\varvec{\alpha }}, \varvec{\Lambda }_{\varvec{\alpha }}) \end{aligned}$$

Link functions \(\text {logit}(\eta ) = \log (\eta / (1 - \eta ))\) and \(f(\eta ) = \Phi ^{-1}(\eta )\), where \(\Phi (\cdot )\) is the cumulative distribution function of a standard normal random variable, characterize Bayesian logistic and probit regression for our hierarchical model. We decided to model with the logit link function because interpretation with respect to log-odds is more intuitive and preferred in statistical ecology (Northrup and Gerber 2018).

3.3.1 Logistic regression for time-varying occupancy

We outline an algorithm (Algorithm 1) to perform Gibbs sampling for a Bayesian time-varying occupancy model with logit link functions. (A similar algorithm (Algorithm 2) for probit link functions is available in the Appendix.) The data augmentation strategy of Polson et al. (2013) is to draw continuous augmented variables from PG distributions and then use them to sample regression effects. For occupancy modeling, we employ data augmentation to sample detection effects as well if the latent status is occupied for the current iteration. The supplement of Clark and Altwegg (2019) includes algebraic derivations for this sampler.

For the sampling algorithms, we use superscripts \(^{(t)}\) to keep track of iterations, subscripts \(_{\varvec{\alpha }}\) and \(_{\varvec{\beta }}\) to distinguish between detection and occupancy, and accents to denote augmented variables \({\tilde{y}}_{ijk}\) and \({\tilde{z}}_{ijk}\) associated with observations \(y_{ijk}\) and latent occupancy \(z_{ijk}\). The arrays \({\mathbf {X}}\) and \({\mathbf {W}}\) are collapsed to matrix form by stacking one on top of another the row vectors \({\mathbf {x}}^\prime _{ijk}\) and \({\mathbf {w}}^\prime _{ijk}\) for non-missing presence-absence observations. Intermediate mean \({\mathbf {m}}\) and covariance \({\mathbf {V}}\) terms are computed each time for MVN updates. Lastly, hyperpriors \(\varvec{\mu } = {\mathbf {0}}\), \(\varvec{\Lambda } = {\mathbf {I}}\) (identity matrix), and \(\nu = \text {dim(design matrix)}\) are chosen to be weakly informative. A review of prior specifications for occupancy models can be found in Northrup and Gerber (2018).

figure a

3.4 Spatial occupancy models

Dependence among areal units may confound inference in spatial models. Hughes and Haran (2013) recommend incorporating synthetic covariates that are orthogonal to fixed effects covariates to model positive dependence via spatial random effects (SREs) \(\varvec{\theta } \sim \text {Normal}({\mathbf {0}}, \varvec{\Sigma }_{\varvec{\theta }})\). This approach has been taken before by Johnson et al. (2013) and Clark and Altwegg (2019) in fitting large spatial occupancy models.

Let \({\mathbf {X}}_s\) be occupancy covariates that are fixed for each site and \({\mathbf {A}}\) be an adjacency matrix, i.e., \(A_{il} = 1\) if i and l touch and \(A_{il} = 0\) otherwise. The Moran operator for \({\mathbf {X}}_s\) is \({\mathbf {P}}^\perp _{{\mathbf {X}}_s} {\mathbf {A}}{\mathbf {P}}^\perp _{{\mathbf {X}}_s}\), where \({\mathbf {P}}^\perp _{{\mathbf {X}}_s} = {\mathbf {I}} - {\mathbf {X}}_s ({\mathbf {X}}_s^\prime {\mathbf {X}}_s)^{-1} {\mathbf {X}}_s^\prime\). Eigenvectors from the spectral decomposition of the Moran operator become the synthetic covariates attached to the SREs; their corresponding eigenvalues indicate positive and/or negative spatial dependence. Let matrix \({\mathbf {M}}\) contain the orthogonal spatial covariates. The occupancy component is now modeled by the equation \(f(\psi _{ijk}) = {\mathbf {x}}^\prime _{ijk} \varvec{\beta } + {\mathbf {m}}^\prime _{i} \varvec{\theta }\), and the covariance matrix \(\varvec{\Sigma }_{\varvec{\theta }}\) is defined as \(\sigma ^2_{\varvec{\theta }} ({\mathbf {M}}^\prime {\mathbf {Q}} {\mathbf {M}})^{-1}\), where \({\mathbf {Q}}\) is the intrinsic conditionally autoregressive precision matrix (Besag and Kooperberg 1995). Inverse-Gamma (IG) conjugacy is assumed for the variance scalar \(\sigma ^2_{\varvec{\theta }}\). In the Appendix, we present MCMC samplers for time-varying occupancy models with SREs.

4 West Nile virus in Ontario

We analyzed the West Nile virus data in Ontario, Canada using our methodology (Algorithm 1). We used R to sample from the posterior distribution in a Gibbs way the parameters of Bayesian time-varying occupancy models. For the logistic model, we call the R package BayesLogit which implements an exact and efficient accept/reject sampler for PG random variables (Polson et al. 2019). In addition, we programmed utility tools for data preparation, model diagnostics, model evaluation, and model visualization.

4.1 Model selection procedure

We split our data into training, validation, and testing datasets. We held out seasons 2008, 2012, and 2016 as examples of low, high, and medium case counts. With the remaining 13 years, we trained and validated models on seventy-five and twenty-five percent of the observations.

One diagnostic we checked was the Watanabe-Akaike information criterion (WAIC) as a measurement of out-of-sample predictive accuracy for Bayesian models (Watanabe 2013; Gelman et al. 2014). We implemented WAIC as \(- 2 \cdot (\text {log pointwise predictive density} - p_{\text {WAIC}_1})\) (Gelman et al. 2013, pp. 169,173). We measured WAIC on the training dataset. Additionally, we made note of model effects in which ninety-five percent credible intervals did not overlap with zero. We interpreted such findings to mean that the effect has a strong statistical signal. Lastly, we formulated our own posterior predictive check for Bayesian occupancy models (Algorithm 2). First, we calculate presence probabilities for a given array by sampling posterior effects and applying their link functions. Second, we simulate presence-absence data 1000 times and construct summary statistics. Our main summary statistic is the relative presence, which is the average positive count divided by the true positive count. Relative presence evaluates how many binary 1 observations a model generates normalized by the true observations. It can be computed over subsets of the sites (PHUs), seasons (years), and periods (epiweeks) to diagnose strengths and weaknesses of a model. We looked at this posterior predictive check for select sites and seasons in the validation dataset to assess out-of-sample accuracy.

figure b

4.2 Model building

Given that our model space was large, we chose to construct models by iteratively adding covariates with plausible scientific meaning. All model covariates were derived from the datasets described in Section 2. We applied root and log transformations to covariates with skewed distributions, including land cover proportions (\(\hbox {Agri}_{{i}}\) and \(\hbox {Urban}_{{i}}\) for agricultural and urban land cover), mosquito genus proportions (\(\hbox {Culex}_{{ijk}}\)), survey count (\(\hbox {Surveys}_{{ijk}}\)), catch rate (\(\hbox {CatchRate}_{{ijk}}\)), the yearly count of weeks 1 to 17 with mean temperature below zero Celsius (\(\hbox {Freezing}_{{ij}}\)), and water level (\(\hbox {Water}_{{ijk}}\)). To keep the models parsimonious, we did not explore variable interactions.

For occupancy modeling, standard practice is to put a covariate in either the occupancy or the detection component but not both; otherwise, the effect may oscillate between the two components and fail to converge. We assigned covariates from the trap data to the detection component and those from the environmental literature review to the occupancy component. While mosquito genus proportions and vector abundances could explain occupancy, these were influenced by surveillance decisions at the PHU level.

Initially, we fit some exploratory models to establish intuition for which covariates contribute the most to model improvements. Bioclimatic effects tended to have credible intervals covering zero and did not improve WAIC scores. At closer inspection, MERRAclim variables displayed little variation over the province. We focused on agricultural and urban land types with a downstream application in mind of relating WNV in mosquito populations to human WNV cases. These land types carry great significance as well because they are habitats frequented by Culex species (Gorris et al. 2021). Population density from the 2016 census was strongly correlated with urban land type, so we did not use it in our models. Survey count, Culex proportion, and catch rate all appeared useful in calibrating detection probabilities.

Based on exploratory model building, we arrived at a good null model without time-varying occupancy. This model improved on an intercept only model in both WAIC and relative presence.

$$\begin{aligned} \text {Detection}_{ijk}&= \alpha _0 + \alpha _1 \text {Surveys}_{ijk} + \alpha _2 \text {Culex}_{ijk} + \alpha _3 \text {CatchRate}_{ijk} \\ \text {Occupancy}_{i}&= \beta _0 + \beta _1 \text {Agri}_{i} + \beta _2 \text {Urban}_{i} \end{aligned}$$

In an iterative fashion, we added a seasonal covariate, freezing weeks (\(\hbox {Freezing}_{{ij}}\)), and period covariates for temperature (\(\hbox {Temp}_{{ijk}}\)), water level (\(\hbox {Water}_{{ijk}}\)), and Gini-Simpson diversity among birds (\(\hbox {Bird}_{{ijk}}\)) to the occupancy component. Since environmental conditions in winter and spring can impact mosquito development, we explored lags of 2, 4, 6, and 8 weeks for the period covariates, finding that six week lags substantially improved training and validation WAIC scores. Next, we built in epidemic behavior through a custom covariate, the two week lagged count of known infected neighboring sites, including oneself. That is, a site only contributes to the count if WNV was detected there in the previous epiweek.

$$\begin{aligned} \text {Occupancy}_{ijk}&= \beta _0 + \beta _1 \text {Agri}_{i} + \beta _2 \text {Urban}_{i} \\&\quad + \beta _3 \text {Freezing}_{ij} + \beta _4 \text {Temp}_{ijk} \\&\quad + \beta _5 \text {Water}_{ijk} + \beta _6 \text {Bird}_{ijk} + \beta _7 \text {Neighbors}_{ijk} \end{aligned}$$

Finally, we included five spatial random effects in the occupancy component. Our spatial design \({\mathbf {X}}_s\) included agricultural and urban land types and site averages for freezing weeks, temperature, water level, and bird diversity. Smaller spatial designs resulted in strong posterior correlations between land type fixed effects and SREs. For each model, we checked trace plots and the Gelman-Rubin diagnostic (Gelman et al. 2013) to ensure convergence of all regression effects.

We report model comparisons in Table 1. In addition to WAIC scores, for the validation data we computed our custom posterior predictive check, relative presence, for all sites, for the GTA region, and for seven sites above the 46th latitude (northern Ontario). Temperature and infected neighboring sites made the biggest difference in the iterative model build, which supports WNV occupancy varying by period. The first SRE captured a residual effect of longitude whereas the other SREs were less interpretable. Otherwise, including SREs did not impact predictive performance or change the posterior inference of occupancy and detection effects (Appendix, Figure 5). All models had poor predictive performance for northern Ontario. We followed up this finding by fitting these models on northern Ontario data only. We found that all posterior effects overlapped with zero, indicating that these covariates did not explain the observed WNV cases. Based on these diagnostics and our scientific understanding of WNV transmission dynamics, we selected the final model with SREs as our best model. We refit this model with training and validation data for 3 chains of 12,000 iterations with 2,000 iterations of burn-in.

Table 1 Iteratively built models with WAIC applied to the training dataset and relative presence diagnostics applied to the validation dataset

4.3 Model results

Our final model identified associations that have plausible scientific interpretations and have been previously reported in the literature (Table 2). Urban zones were more strongly associated with WNV occupancy than agricultural land, though both land types supported the pathogen. Warmer temperatures appeared to provide favorable environmental conditions for mosquitoes and WNV to thrive. Likewise, colder winters affected WNV occupancy, possibly by decreasing survival during the mosquitoes’ overwintering period. Decreases in eBird species diversity indicated more occupancy. This negative correlation was also observed in a separate analysis by Allan et al. (2009) who surmised that more bird species with low reservoir competence may exist in diverse communities. More traps and more mosquitoes, especially Culex mosquitoes, increased virus detection.

Table 2 Posterior effects for final model. The first eight rows are occupancy effects \(\varvec{\beta }\) and the last four rows are detection effects \(\varvec{\alpha }\)

Temperature and surveillance changes over the sampling season modified occupancy and detection. Figure 2 illustrates this pattern in the model predictions for a subset of PHUs in 2016. Heat waves in the early summer drove up occupancy rates for late summer. These seasonal peaks in occupancy aligned well with the pattern of human WNV cases from the Centers for Disease Control and Prevention (McDonald et al. 2021, Figure 1). More Culex mosquitoes were captured and tested in the late summer as well, increasing detection rates. Figure 2 displays temporal trends and uncertainty quantification, but it hides the inherently spatial nature of this model. Maps for modeled occupancy and detection in the 34th epiweek of 2016 capture this spatial behavior (Figures 3 and 4). Southeastern Ontario, especially GTA, had high occupancy rates. During the peak season, epiweeks 30 to 40, detection generally exceeded 0.50 in GTA and sometimes surpassed this mark in Ottawa (OTT), Windsor-Essex (WEC), and other PHUs. Animated maps for the entire 2016 sampling season are available online as Supplementary Materials.

Fig. 2
figure 2

(a) Occupancy and (b) detection probabilities for PHUs Ottawa (OTT), Sudbury (SUD), Toronto (TOR), and Windsor-Essex (WEC) in 2016. Solid lines are posterior medians, and bands are ninety-five percent credible intervals. (c) Lagged mean temperature and (d) Culex proportion are influential occupancy and detection covariates

Fig. 3
figure 3

Posterior medians for occupancy probabilities in Ontario on the 34th epiweek of 2016. The inset map juxtaposes southern Ontario relative to northern Ontario

Fig. 4
figure 4

Posterior medians for detection probabilities in Ontario on the 34th epiweek of 2016. Detection was missing when no surveys were done. The inset map juxtaposes southern Ontario relative to northern Ontario

We also assessed our model’s posterior predictive performance for the test dataset (Table 3). For each PHU, we simulated 200 presences and averaged the results. PHUs with no presences were typically modeled to have fractional average presences. Meanwhile, PHUs with many presences were modeled to have many average presences. Metropolitan Toronto, one of the most surveyed PHUs, was remarkably well estimated by our method. In sum, our model simulated too few presences, especially for 2012; however, regression emphasizes mean behavior, and 2012 had the highest case counts of any season.

Table 3 Model average versus actual presences for test data

5 Discussion

We developed a new approach to solve binary regression problems in which the observed process is the product of a detection process and a latent binary process. Our method is in some sense a generalization of the single-season occupancy model (MacKenzie et al. 2002) but without the assumption of fixed occupancy. Our implementation relies on the state-of-the-art data augmentation strategy of Polson et al. (2013) to generate posterior samples in a Gibbs way. We suspect that these methods could be thoughtfully applied to analyze presence-absence data for other vector-borne pathogens in which survey detection is imperfect, e.g., the bacteria causing Lyme disease in ticks.

In an extended case study of West Nile virus in Ontario, Canada, we demonstrated that Bayesian time-varying occupancy models can attain good predictive performance and identify scientifically meaningful relationships between responses and covariates. Notably, we found strong evidence to suggest that warmer temperatures in general foster WNV spread in our study region. Careful data fusion with citizen science and other Internet data resources could provide value to future ecological and environmental studies. Our analysis also suggests that using traps to specifically capture Culex mosquitoes may assist in WNV surveillance efforts. Finally, our modeling is the first among studies on this dataset to examine mosquitoes and WNV across all of Ontario. Our efforts have laid groundwork in the pursuit of an omnibus model for the province.

Models for WNV occupancy in Ontario could be improved and extended in various ways. Compiling and integrating more covariate data of ecological importance may elucidate new associations. Normalized difference vegetation index (NDVI), a greenness measurement from remote sensing, could indicate microchanges to land type at the temporal scale of days/weeks (DeMets et al. 2020a, b). Another useful predictor would be based on standing water and other basins in which mosquitoes breed, given that water level demonstrated a suggestive direction of effect (Shutt et al. 2021). Similar to freezing weeks, a seasonal covariates for precipitation could investigate the effects of droughts and flooding. On the other hand, we could study WNV occupancy at a finer spatial mesh. From the onset, we aggregated to 34 areal sites, some as large as states. While there are existing spatial methods to analyze coordinate locations (Yue and Speckman 2010), working with site locations as points would be challenging, both to collect covariate data and address sparsity concerns. SREs may also be important to account for spatial confounding. Such a model could provide locally informative updates on epidemics.

From a policy standpoint, we require statistical analysis like that presented in this paper to inform public health efforts and to start to quantify the relative impacts of time-varying occupancy versus time-varying sampling. Under a changing climate, vector-borne pathogens have been colonizing new regions and increasing in frequency, again highlighting the utility of being able to predict viral occupancy in systems that are seasonal. Our methods can be implemented on a personal laptop and employed in real-time to accommodate epidemic responses. With adequate vector surveillance, public health officials can now quantify the probability of occupancy and detection of an environmentally-driven pathogen and mitigate its spread in a timely manner. Our method can also be used to improve parameterization of other model types, such as differential equation and agent-based disease transmission models. By providing estimates of the probability through time that a virus is present and that it is detected, we can better calibrate the data-fitting process for spatio-temporal mechanistic models fit to mosquito virus data.

Fig. 5
figure 5

Areal covariate corresponding to the first SRE. Negative (positive) values are in blue (red). The inset map juxtaposes southern Ontario relative to northern Ontario

Table 4 Posterior SREs \(\varvec{\theta }\) for final model