## Abstract

Strong parametric assumptions are often made when formulating statistical models in practice. In the field of ecology, these assumptions have sparked repeated debates about identifiability of species distribution and abundance models. We leverage econometrics literature to broaden the view of the problem. Nonparametric identifiability exists when a model could, in theory, be estimated without parametric assumptions. Even if in practice an ecologist will not fit a nonparametric model, the potential to do so means the data are informative for desired goals. Our approach for determining whether nonparametric identifiability holds in targeted parts of the model is based on relaxing particular parametric assumptions. We approximate a nonparametric relationship as a flexible, unpenalized spline fit to simulated data with increasing sample sizes. We show the importance of semi-parametric identifiability, nonparametric identifiability achieved in part of a model, with presence-only models, single-visit occupancy and abundance models, and capture–recapture models with detection heterogeneity. In each case, we use our simulation approach to illustrate that when nonparametric identifiability holds in a regression relationship, even a mis-specified parametric model may provide a useful approximation of properties of interest like prevalence and average occurrence and abundance, the fit of alternative models can be compared, and parametric assumptions can be checked. When semi-parametric identifiability does not hold, parametric assumptions create artificial identifiability, and alternative models cannot be distinguished empirically. We argue that ecologists, and modelers in general, should be most confident in results when a stronger form of identifiability holds.

### Similar content being viewed by others

## References

Lewbel A (2019) the identification zoo: meanings of identification in econometrics. J Econ Lit 57(4):835–903

Koopmans TC, Reiersol O (1950) The identification of structural characteristics. Ann Math Stat 21(2):165–181

Rothenberg TJ (1971) Identification in parametric models. Econometrica. https://doi.org/10.2307/1913267

Roehrig CS (1988) Conditions for identification in nonparametric and parametric models. Econometrica 56(2):433–447. https://doi.org/10.2307/1911080

Manski CF (2003) Partial identification of probability distributions, 1st edn. Springer, New York. https://doi.org/10.1007/b97478

Slud E, McKeague IW (1992) Nonparametric identifiability of marginal survival distributions in the presence of dependent competing risks and a prognostic covariate. In: Klein JP, Goel PK (eds) Survival analysis: state of the art, 1st edn. Springer, Dordrecht, pp 355–368

Abbring JH, Van den Berg GJ (2003) The nonparametric identification of treatment effects in duration models. Econometrica 71(5):1491–1517. https://doi.org/10.1111/1468-0262.00456

Van der Laan M, Hubbard AE, Jewell N (2010) Learning from data: semiparametric models versus faith-based inference. Epidemiology 21(4):479–481. https://doi.org/10.1097/EDE.0b013e3181e13328

Van der Laan M, Hubbard A, Jewell NP (2007) Estimation of treatment effects in randomized trials with non-compliance and a dichotomous outcome. J R Stat Soc Ser B 69(3):463–482. https://doi.org/10.1111/j.1467-9868.2007.00598.x

Robins JM, Greenland S (1992) Identifiability and exchangeability for direct and indirect effects. Epidemiology 3(2):143–155

Yackulic CB, Chandler RB, Zipkin EF, Royle JA, Nichols JD, Grant EHC, Veran S (2013) Presence-only modelling using maxent: when can we trust the inferences? Methods Ecol Evol 4:236–243. https://doi.org/10.1111/2041-210x.12004

Guillera-Arroita G, Lahoz-Monfort JJ, Elith J, Gordon A, Kujala H, Lentini PE, McCarthy MA, Tingley R, Wintle BA (2015) Is my species distribuiton model fit for purpose? matching data and models to applications. Glob Ecol Biogeogr 24(3):276–292. https://doi.org/10.1111/geb.12268

Barker RJ, Schofield MR, Link WA, Sauer JR (2018) On the reliability of N-mixture models for count data. Biometrics 74:369–377. https://doi.org/10.1111/biom.12734

Lele SR, Moreno M, Bayne E (2012) Dealing with detection error in site occupancy surveys: what can we do with a single survey? J Plant Ecol 5(1):22–31. https://doi.org/10.1093/jpe/rtr042

Lele SR, Keim JL (2006) Weighted distributions and estimation of resource selection probability functions. Ecology 87(12):3021–3028. https://doi.org/10.1890/0012-9658(2006)87[3021:WDAEOR]2.0.CO;2

Solymos P, Lele SR, Bayne E (2012) Conditional likelihood approach for analyzing single visit abundance survey data in the presence of zero inflation and detection error. Environmetrics 23:197–205. https://doi.org/10.1002/env.1149

Knape J, Korner-Nievergelt F (2015) Estimates from non-replicated population surveys rely on critical assumptions. Methods Ecol Evol 6:298–306. https://doi.org/10.1111/2041-210X.12329

Knape J, Korner-Nievergelt F (2016) On assumptions behind estimates of abundance from counts at multiple sites. Methods Ecol Evol 7:206–209. https://doi.org/10.1111/2041-210X.12507

Solymos P, Lele SR (2016) Revisiting resource selection probability functions and single-visit methods: clarification and extensions. Methods Ecol Evol 7:196–205. https://doi.org/10.1111/2041-210X.12432

Royle JA, Chandler RB, Yackulic C, Nichols JD (2012) Likelihood analysis of species occurrence probability from presence-only data for modelling species distributions. Methods Ecol Evol 3:545–554. https://doi.org/10.1111/j.2041-210X.2011.00182.x

Ward G, Hastie T, Barry S, Elith J, Leathwick JR (2009) Presence-only data and the em algorithm. Biometrics 65:554–563. https://doi.org/10.1111/j.1541-0420.2008.01116.x

Hastie T, Fithian W (2013) Inference from presence-only data; the ongoing controversy. Ecography 36:864–867. https://doi.org/10.1111/j.1600-0587.2013.00321.x

Link WA (2003) Nonidentifiability of population size from capture-recapture data with heterogeneous detection probabilities. Biometrics 59:1123–1130. https://doi.org/10.1111/j.0006-341X.2003.00129.x

Holzmann H, Munk A, Zucchini W (2006) On identifiability in capture-recapture models. Biometrics 62:934–939. https://doi.org/10.1111/j.1541-0420.2006.00637_1.x

Catchpole EA, Morgan BJT (1997) Detecting parameter redundancy. Biometrika 84(1):187–196

Cole D (2020) Parameter redundancy and identifiability, 1st edn. CRC Press, New York

Gimenez O, Viallefont A, Catchpole EA, Choquet R, Morgan BJT (2004) Methods for investigating parameter redundancy. Animal Biodiversity Conserv 27(1):561–572

Choquet R, Cole DJ (2012) A hyrbrid symbolic-numerical method for determining model structure. Math Biosci 236(2):117–125. https://doi.org/10.1016/j.mbs.2012.02.002

Box GEP (1979) Robustness in the strategy of scientific model building. In: Roubstness in statistics. https://doi.org/10.1016/B978-0-12-438150-6.50018-2

Renner IW, Warton DI (2013) Equivalence of MAXENT and Poisson Point Process models for species distribution modeling in ecology. Biometrics 69(1):274–281. https://doi.org/10.1111/j.1541-0420.2012.01824.x

Dufour J, Hsiao C (2010) Identification. In: Durlauf SN, Blume LE (eds.) Microeconometrics. Palgrave Macmillan, London. https://doi.org/10.1057/9780230280816_11

Casella G, Berger RL (1990) Statistical inference, 1st edn. Brooks/Cole Publishing Company, Pacific Grove

Parzen E, Tanabe K, Kitagawa G (eds.) (1998) Selected Papers of Hirotugu Akaike. Springer Series in Statistics, pp. 199–213. Springer, New York. Chap. Information theory and an extension of the maximum likelihood principle

Mosher BA, Bailey LL, Hubbard BA, Huyvaert KP (2018) Inferential biases linked to unobservable states in complex occupancy models. Ecography 41(1):32–39. https://doi.org/10.1111/ecog.02849

Dorazio RM, Mukherjee B, Zhang L, Ghosh M, Jelks HL, Jordan F (2008) Modeling unobserved soruces of heterogeneity in animal abundance using a Dirichlet Process prior. Biometrics 64(2):635–644. https://doi.org/10.1111/j.1541-0420.2007.00873.x

Turek D, Wehrhahn C, Gimenez O (2020) Bayesian non-parametric detection heterogeneity in ecological models. arXiv:2007.10163

Phillips SJ, Elith J (2013) On estimating probability of presence from use-availability or presence-background data. Ecology 94(6):1409–1419. https://doi.org/10.1890/12-1520.1

Solymos P, Moreno M, Lele SR (2018) Detect: analyzing wildlife data with detection error

Fiske I, Chandler R (2011) Unmarked: an R package for fitting hierarchical models of wildlife occurrence and abundance. J Stat Softw 43(10):1–23

Lele SR, Nadeem K, Schmuland B (2012) Estimability and likelihood inference for generalized linear mixed models using data cloning. J Am Stat Assoc 105(492):1617–1625

O’Hagan A (2003) HSSS model criticism. In: Green PJ, Hjort NL, Richardson S (eds) Highly structured stochastic systems, 1st edn. Oxford University Press, Oxford, pp 423–444

Hubbard AE, Ahern J, Fleischer NL, Van der Laan M, Satariano SA, Jewell N, Bruckner T, Stariano WA (2010) To GEE or not to GEE: comparing population average and mixed models for estimating the associations between neighborhood risk factors and health. Epidemiology 21(4):467–474

Pollock KH (1982) A capture-recapture design robust to unequal probability of capture. J Wildlife Manag 46(3):752–757

Rota CT, Fletcher RJ Jr, Dorazio RM, Betts MG (2009) Occupancy estimation and the closure assumption. J Appl Ecol 46:1173–1181. https://doi.org/10.1111/j.1365-2664.2009.01734.x

Poirier DJ (1998) Revising beliefs in nonidentified models. Econ Theory 5:483–509. https://doi.org/10.1017/S0266466698144043

Knape J, Arlt D, Barraquand F, Berg A, Chevalier M, Part T, Ruete A, Zmihorski M (2018) Sensitivity of binomial N-mixture models to overdispersion: the importance of assessing model fit. Methods Ecol Evol 9(10):2102–2114. https://doi.org/10.1111/2041-210X.13062

Pearce JL, Boyce MS (2006) Modelling distribution and abundance wiht presence-only data. J Appl Ecol 43:405–412. https://doi.org/10.1111/j.1365-2664.2005.01112.x

Boyce MS, Vernier PR, Nielsen SE, Schmiegelow FKA (2002) Evaluating resource selection functions. Ecol Model 157:281–300. https://doi.org/10.1016/S0304-3800(02)00200-4

Ottaviani D, Lasinio GJ, Boitani L (2004) Two statistical methods to validate habitat suitability models using presence-only data. Ecol Model 179:417–443. https://doi.org/10.1016/j.ecolmodel.2004.05.016

Hirzel AH, Lay GL, Helfer V, Randin C, Guisan A (2006) Evaluating the ability of habitat suitability models to predict species presences. Ecol Model 199:142–152. https://doi.org/10.1016/j.ecolmodel.2006.05.017

Phillips SJ, Elith J (2010) POC plots: calibrating species distribution models with presence-only data. Ecology 91(8):2476–2484. https://doi.org/10.1890/09-0760.1

Dorazio RM (2014) Accounting for imperfect detection and survey bias in statistical analysis of presence-only data. Glob Ecol Biogeogr 23:1472–1484. https://doi.org/10.1111/geb.12216

Fithian W, Elith J, Hastie T, Keith DA (2015) Bias correction in species distribution models: pooling survey and collection data for multiple species. Methods Ecol Evol 6:424–438. https://doi.org/10.1111/2041-210X.12242

Renner IW, Louvrier J, Gimenez O (2019) Combining multiple data sources in species distribution models while accounting for spatial dependence and overfitting with combined penalized likelihood maximization. Methods Ecol Evol 10(12):218–2128. https://doi.org/10.1111/2041-210X.13297

Elith J, Graham CH, Anderson RP, Dudik M, Ferrier S, Guisan A, Hijmans RJ, Huettmann F, Leathwick JR, Lehmann A, Li J, Lohmann LG, Loiselle BA, Manion G, Moritz C, Nakamura M, Nakazawa Y, Overton JM, Peterson AT, Phillips SJ, Richardson K, Scachetti-Pereira R, Schapire RE, Soberon J, Williams S, Wisz MS, Zimmermann NE (2006) Novel methods improve prediction of species’ distributions from occurrence data. Ecography 29:129–151. https://doi.org/10.1111/j.2006.0906-7590.04596.x

MacKenzie DI, Nichols JD, Lachman GB, Droege S, Royle JA, Langtimm CA (2002) Estimating site occupancy rates when detection probabilities are less than one. Ecology 83(8):2248–2255. https://doi.org/10.1890/0012-9658(2002)083[2248:ESORWD]2.0.CO;2

MacKenzie DI, Nichols JD, Hines JE, Knutson MG, Franklin AB (2003) Estimating site occupancy, colonization, and local extinction when a species is detected imperfectly. Ecology 84(8):2200–2207. https://doi.org/10.1890/02-3090

MacKenzie DI, Royle JA (2005) Designing occupancy studies: general advice and allocating survey effort. J Appl Ecol 42:1105–1114. https://doi.org/10.1111/j.1365-2664.2005.01098.x

Guillera-Arroita G, Ridout MS, Morgan BJT (2010) Design of occupancy studies with imperfect detection. Methods Ecol Evol 1:131–139. https://doi.org/10.1111/j.2041-210X.2010.00017.x

Wood SN (2017) Generalized additive models an introduction with R, 2nd edn. Chapman & Hall/CRC, London

Huggins R (2001) A note on the difficulties associated with the analysis of capture-recapture experiments with heterogeneous capture probabilities. Statist Probab Lett 54:147–152. https://doi.org/10.1016/S0167-7152(00)00233-9

Pezzott GLM, Salasar LEB, Leite JG, Louzada-Neto F (2019) A note on identifiability and maximum likelihood estimation for a heterogeneous capture-recapture model. Commun Stat Theory Methods. https://doi.org/10.1080/03610926.2019.1615628

Link WA (2006) Rejoinder to On identifiability in capture-recapture models. Biometrics 62(3):936–939

Mao CX (2007) Estimating population sizes for capture-recapture sampling with binomial mixtures. Comput Stat Data Anal 51:5211–5219. https://doi.org/10.1016/j.csda.2006.09.025

Mao CX (2008) On the nonidentifiability of population sizes. Biometrics 64:977–981. https://doi.org/10.1111/j.1541-0420.2008.01078.x

Farcomeni A, Tardella L (2012) Identifiability and inferential issues in capture-recapture experiments with heterogeneous detection probabilities. Electron J Stat 6:2602–2626. https://doi.org/10.1214/12-EJS758

Sanathanan L (1972) Estimating the size of a multinomial population. Ann Math Stat 43:142–152

Raue A, Kreutz C, Maiwald T, Bachmann J, Schilling M, Klingmuller U, Timmer J (2009) Structural and practical identifiability analysis of partially observed dynamical models by exploiting the profile likelihood. Bioinformatics 25(15):1923–1929. https://doi.org/10.1093/bioinformatics/btp358

Eisenberg MC, Jain HV (2017) A confidence building exercise in data and identifiability: Modeling cancer chemotherapy as a case study. J Theor Biol 431:63–78. https://doi.org/10.1016/j.jtbi.2017.07.018

Johndrow JE, Lum K, Manrique-Vallier D (2019) Low-risk population size estimates in the presence of capture heterogeneity. Biometrika. https://doi.org/10.1093/biomet/asy065

## Acknowledgements

The first author was supported by the National Physical Sciences Consortium fellowship, the Gordon and Betty Moore Foundation through Grant GBMF3834, and by the Alfred P. Sloan Foundation through Grant 2013-10-27 to the University of California, Berkeley. Thank you to Steve Beissenger, Peng Ding, Jonas Knape, Andrew Royle, and anonymous reviewers for helpful feedback.

## Author information

### Authors and Affiliations

### Corresponding author

## Ethics declarations

### Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

## Additional information

### Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article is part of the topical collection “Ecological Statistics” guest edited by Tiago Marques, Ben Stevenson, Charlotte Jones-Todd, and Ben Swallow.

## Appendices

### Appendix A: Simulation Set Up

The code for the simulation and the plots for the presence-only vs. presence–absence prevalence and single-visit vs. double-visit occurrence and detection scenarios can be found in the supplementary information and will be posted on GitHub and linked to here, upon publication [60]. Some details are outlined in this document.

We use unpenalized splines in this diagnostic scenario (not to be confused with the more typically used regularized spline). We use seven knots; this choice provides a balance between sufficient flexibility for the scenarios of interest and computational practicality. Following common practice (e.g., Chapters 3 and 4, [61]), we spaced the knots equally throughout the range of the covariate. The choice of seven knots results in 11 parameters (the number of knots, the degree, and an intercept). It is useful that one can increase the flexibility of the spline model by increasing the number of knots. In practice, difficulties with numerical stability of a particular model fitting implementation will likely prevent one from choosing an extremely large number of knots. However, in our examples, even a seven knot spline is sufficient to illustrate lack of nonparametric identifiability.

Covariate values for the presence-only vs. presence–absence scenario are independent and uniformly distributed, \(x_i \sim \text {Unif}(-2.5,8)\). This particular choice is not essential; it was chosen to reveal the whole shape of the occurrence function. Similarly values for \(x_i\) and \(z_i\) in the single-visit vs. double-visit scenario are independent and uniformly distributed as above, \(x_i \sim \text {Unif}(-2.5,8)\) and \(z_i \sim \text {Unif}(-2.5,8)\). Again this particular choice is not essential. Occurrence and detection probabilities are \(\psi (x_i, \beta ) = 0.071 x_i + 0.18\) and \(p(z_i, \beta ) = 0.048 z_i + 0.12\), respectively.

### Appendix B: Partial Identification

### 1.1 Appendix B.1: Occurrence and Detection Probabilities with Single-Visit Data

As a concrete example, if we identify the product of average occurrence and average detection as 0.75, we know that the case where \(\bar{\psi }=0.625\) and \(\bar{p}=1.20\) is not plausible (detection probabilities cannot be above one). With partial identifiability, we can bound \(\bar{\psi }\) and \(\bar{p}\) to both be in the interval [0.75, 1] (where either, but not both could equal one).

### 1.2 Appendix B.2: Prevalence with Presence-Only Data (With Perfect Detection)

For this discussion, it will be useful to denote the occurrence probability \(\psi _i\) at site *i* as \(\psi (x_i)\) to emphasize the dependency on the covariate *x*.

Each \(\psi (x_i) \in (0,1)\) so each \(\alpha \psi (x_i) \in (0,1)\). Therefore, \(\alpha \in \left( 0, \frac{1}{\sup _{x\in X} \psi (x)}\right) \) for the set *X* of possible realizations of the feature *x*. This bound on \(\alpha \) translates into a bound for the overall prevalence \(\rho ^*\), \(\rho ^* \in \left( 0, \frac{\rho }{\sup _{x\in X} \psi (x)}\right) \), where \(\rho \) is the prevalence before scaling. Note that \(\frac{\sup _{x\in X }\psi (x)}{\rho } = \sup _{x\in X} \frac{\psi (x)}{\rho }=\sup _{x\in X} \frac{\pi (x |Y=1)}{\pi (x)}\). Then, the bound can be estimated: \(\rho ^* \in \left( 0, \inf _{x\in X} \frac{\pi (x)}{\pi (x |Y=1)}\right) \). A lower bound exists for \(\sup _{x \in X}\frac{\pi (x |Y=1)}{\pi (x)}\), so an upper bound exists for \(\inf _{x\in X}\frac{\pi (x)}{\pi (x |Y=1)}\) because \(\pi (x)\) is assumed to be known and \(\pi (x |Y=1)\) is the observable distribution of the covariate *x* given \(Y=1\). In context, to get a narrower interval, a big differential between \(\pi (x |Y=1)\) and \(\pi (x)\) would need to exist. This would correspond to *x* being a strong predictor of occurrence. In practice, one could search for a region within the data that had a big difference. There is potential that this partial identifiability would suffice in practice depending on the research goals.

### Appendix C: Single-Visit Abundance Scenario

The single-visit abundance scenario follows analogously to the single-visit occurrence scenario. Parametric identifiability for the site-specific abundance and detection probabilities comes from particular choices of link functions, but these properties lack the stronger nonparametric identifiability. In the single-visit abundance scenario, there is an underlying data-generating process that produces an abundance \(N_i\) at each site *i*. A random sample of *S* sites are visited and how many individuals seen are recorded. The \(N_i\) are assumed to be independent Poisson random variables with parameters \(\lambda _i\). Given the abundance \(N_i\) at a site, and the site-specific detection probability \(p_i\), the number \(y_i\) of individuals observed is assumed to follow a binomial distribution with \(N_i\) trials and probability \(p_i\). Then, the \(y_i\) are marginally independent Poisson distributions with parameter \(\lambda _i p_i\). The abundances are typically of interest, and the average parameter \(\frac{1}{S}\sum _{i=1}^S\lambda _i\) of the abundance distribution may be of particular interest.

An approach using covariates *x* to model abundance and *z* to model detection probabilities is proposed by Solymos et al. [16] that estimates the parameters \(\lambda _i\) of the site-specific abundance distribution separately from the site-specific detection probabilities \(p_i\) [16].

Their approach is able to parametrically identify the site-specific abundance and detection probabilities, but Knape and Korner-Nievergelt [17] proposed a counter-example model that reveals a lack of nonparametric identifiability for these properties [17]. This counter-example model is similar to that of the single-visit occurrence example: \(\lambda _i = \alpha \frac{e^{\beta _0+\beta 'x_i}}{1+e^{\beta _0+\beta 'x_i}}\); \(p_i= \frac{1}{\alpha }\frac{e^{\theta _0+\theta 'z_i}}{1+e^{\theta _0+\theta 'z_i}}\). The same observable distribution of the \(y_i\) can arise from different components \(\lambda _i\) and \(p_i\).

A lack of nonparametric identifiability can be shown in even these idealized conditions, but it should be noted that the assumptions of the Poisson distribution, the independence between sites, and the Binomial distribution may also be tenuous. Barker et al. [13] and Knape et al. [46] provide a discussion of identifiability and robustness with respect to the binomial and Poisson assumptions in the scenario where abundance data are recorded from multiple visits [13, 46].

Again the scaling counter-example was a convenient way to show that the average abundance lacks nonparametric identifiability, but the breakdown under model mis-specification can be shown using yet another data-generating process that differs from the assumptions of the single-visit model. Figures 5 and 6 present the same identifiability scenarios as in the main manuscript for estimating abundance and detection, respectively. The top rows use counts with imperfect detection from a single-visit, while the bottom rows use counts with imperfect detection from two visits. The first columns show estimation via Poisson and logistic regression, respectively, when the true data-generating processes come from quadratic functions of covariates. The code for this simulation and the plots for this scenario can be found in the supporting information and will be posted on GitHub and linked to here upon publication [60].

In the single-visit case, the “best approximation” within the parametric family of the Poisson underestimates the abundance and overestimates detection probabilities. However, with two visits, the abundance and detection probabilities are locally approximated within the Poisson and logistic families, respectively. In the second column, the added flexibility of the spine terms allow both the single-visit and double-visit cases to estimate abundances and detection probabilities closer to the truth, albeit with fairly high variability across simulations. With extra data, the nonparametrically identified double-visit case has decreased variability across simulations and is starting to converge to the truth (although even more data seems to be needed), while the estimates across simulations in the single-visit case still cover too wide a range of potential abundances and detection probabilities to be useful in practice.

In the single-visit abundance example, an upper bound for abundance cannot be found since detection probabilities could be arbitrarily small, but the number of detected individuals (assuming no double-counts) can provide a lower bound.

### Appendix D: Capture–Recapture Abundance Scenario

Capture–recapture is a data collection strategy that requires repeated visits to the same locations over time and the ability to uniquely identify individuals. This way a repeated sighting of a particular individual is recorded. Closure is assumed and replication given by revisiting locations is used to allow for imperfect detection probabilities. The problem in this scenario is how to estimate the abundance when nothing is known about the distribution of the individual detection probabilities.

To estimate an unknown total abundance *N* across a region of interest, *S* sites are visited *T* times. When an individual is spotted, it is marked such that it can be distinguished from others if it is seen again during another visit. The \(X_i\) are the number of times individual *i* is observed, and *n* is the number of individuals that are seen at least once. The \(X_i\) given individual detection probabilities \(p_i\) are independent binomial random variables with *T* trials and probability \(p_i\) of success. The detection probabilities \(p_i\) are identically distributed from some unknown distribution *g*(*p*). The sighting frequencies \(f_x\) are the number of \(X_i\) where \(X_i=x\). However, sighting frequencies are observed only given that the individual is spotted at all, \(f_x^c\). Similarly, the only probability observed is that of seeing an individual *x* times given that it is seen at least once.

With the language of nonparametric and parametric identifiability, previous results in the literature are more easily interpretable. Huggins [62] showed that there are different unconditional distributions of the total abundance (unobserved) that have the same distributions when conditioned on the captured individuals (observed) [62]. Link [23] refined the conclusions of Huggins [62] and showed that if two distributions of detection probabilities conditioned on the captured individuals are close (in function space), their unconditional distributions of the total abundance are not necessarily close [23, 62]. This result shows a lack of nonparametric identifiability; the abundance cannot be identified without restrictions on the detection probability distribution *g*(*p*).

Holzmann et al. [24] showed that if *g*(*p*) is assumed to belong to certain probability distribution families, abundance is identifiable [24]. Abundance is identifiable if it is assumed that the detection probabilities follow a uniform distribution (with more than one visit per site), follow a Beta distribution (with more than two visits per site) or follow a finite mixture model (with at least twice as many visits as mixture components). These are parametric identifiability results. Pezzot et al. [63] later improved the finite mixture parametric identifiability result from Holzmann et al. [24] to allow for one extra parameter in the parameter space [24, 63]. Link [64] responded that even if it is assumed that there are no individuals who are undetectable, there is still no identifiability across different families of assumptions for *g*(*p*), e.g., a Beta distribution can be found that gives an identical observable distribution to a two-point mixture but implies a different overall abundance [64]. This illustrates the gap between a parametrically and nonparametrically identifiable property.

Mao [65, 66] determined a lower bound on the odds of an individual animal not being captured and used it to lower bound the abundance [65, 66]. However, they also showed that an upper bound on abundance cannot be found without placing restrictions on *g*(*p*). These results partially identify abundance.

### 1.1 Appendix D.1: Lack of Nonparametric Identifiability for Capture–Recapture Data (With Heterogeneous Detection)

Link [23] gave examples that have the same observable distribution but imply different values of total abundance, showing lack of nonparametric identifiability when working with the distribution of the abundance conditional on the sighted individuals [23].

However, Farcomeni and Tardella [67] showed that abundance is technically identifiable when analyzing the unconditional likelihood rather than the conditional likelihood [67]. The conditional likelihood was focused on by Huggins [62], Link [23], and Holzman et al. [24] because Sanathanan [68] stated that the conditional likelihood of the abundance (conditional on the captured individuals) is asymptotically equivalent to its unconditional likelihood [23, 24, 62, 68]. However, Farcomeni and Tardella [67] pointed out that the conditions for this statement are not met in the capture–recapture scenario [67]. Despite technical identifiability, Farcomeni and Tardella [67] showed that there is no consistent estimator for the total abundance. Here, some intuition about why the technical identifiability is so tenuous is provided [67].

Recall that if *N* is considered fixed, \(n \sim \text {Binom}(N, 1-\pi _g(0))\). Technically, *N* and \(\pi _g(0)\) are identifiable because it is not possible to find a \(N \ne N^*\) and \(\pi _g(0)\ne \pi _g^*(0)\) such that for all *x*:

However, with a single realization of the data generating process, these parameters are not practically identifiable, i.e., not identifiable given a particular set of data observed in practice [69, 70]. Importantly, more realizations of the data generating function would be needed, not a larger sample size.

If the model is broadened to allow *N* to be random this tenuous identifiability goes away. Suppose \(N \sim \text {Poisson}(\lambda )\). Note that *N* could be chosen to follow a nonparametric distribution, but since *N* and \(\pi _g(0)\) can be proved to be not nonparametrically identifiable in this case, they certainly are not nonparametrically identifiable in a more general case. Then, the number of individuals seen at least once follows a Poisson distribution with parameter \(\lambda (1-\pi _g(0))\).

Multiple data generating processes can have the same observable distribution by scaling up \(\lambda \) and scaling down \((1-\pi _g(0))\) by equal amounts (or vice versa). Therefore, the abundance is not nonparametrically identifiable.

Johndrow et al. [71] proposed estimating a different quantity, the abundance of individuals who have detection probabilities above a particular threshold [71]. They provide a risk analysis of their estimator and some guidance on how to choose the required threshold.

## Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

## About this article

### Cite this article

Stoudt, S., de Valpine, P. & Fithian, W. Nonparametric Identifiability in Species Distribution and Abundance Models: Why it Matters and How to Diagnose a Lack of it Using Simulation.
*J Stat Theory Pract* **17**, 39 (2023). https://doi.org/10.1007/s42519-023-00336-5

Accepted:

Published:

DOI: https://doi.org/10.1007/s42519-023-00336-5