Abstract
We address the multiple testing problem under the assumption that the true/false hypotheses are driven by a hidden Markov model (HMM), which is recognized as a fundamental setting to model multiple testing under dependence since the seminal work of Sun and Cai (J R Stat Soc Ser B (Stat Methodol) 71:393–424, 2009). While previous work has concentrated on deriving specific procedures with a controlled false discovery rate under this model, following a recent trend in selective inference, we consider the problem of establishing confidence bounds on the false discovery proportion, for a user-selected set of hypotheses that can depend on the observed data in an arbitrary way. We develop a methodology to construct such confidence bounds first when the HMM model is known, then when its parameters are unknown and estimated, including the data distribution under the null and the alternative, using a nonparametric approach. In the latter case, we propose a bootstrap-based methodology to take into account the effect of parameter estimation error. We show that taking advantage of the assumed HMM structure allows for a substantial improvement of confidence bound sharpness over existing agnostic (structure-free) methods, as witnessed both via numerical experiments and real data examples.
Similar content being viewed by others
References
Abraham K, Castillo I, Gassiat E (2021a) Multiple testing in nonparametric hidden Markov models: an empirical Bayes approach. arXiv:2101.03838
Abraham K, Castillo I, Roquain E (2021b) Empirical Bayes cumulative \(\ell \)-value multiple testing procedure for sparse sequences
Albertson DG, Collins C, McCormick F, Gray JW (2003) Chromosome aberrations in solid tumors. Nat Genet 34:369–376
Alexandrovich G, Holzmann H, Leister A (2016) Nonparametric identification and maximum likelihood estimation for hidden Markov models. Biometrika 103:423–434
Azriel D, Schwartzman A (2015) The empirical distribution of a large number of correlated normal variables. J Am Stat Assoc 110:1217–1228. https://doi.org/10.1080/01621459.2014.958156
Bachoc F, Blanchard G, Neuvial P (2018) On the post selection inference constant under restricted isometry properties. Electron J Stat 12:3736–3757. https://doi.org/10.1214/18-EJS1490
Bachoc F, Leeb H, Pötscher BM (2019) Valid confidence intervals for post-model-selection predictors. Ann Stat 47:1475–1504. https://doi.org/10.1214/18-AOS1721
Benjamini Y, Bogomolov M (2014) Selective inference on multiple families of hypotheses. J R Stat Soc Ser B (Stat Methodol) 76:297–318
Benjamini Y, Hochberg Y (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B 57:289–300
Benjamini Y, Yekutieli D (2005) False discovery rate-adjusted multiple confidence intervals for selected parameters. J Am Stat Assoc 100:71–81
Berk R, Brown L, Buja A, Zhang K, Zhao L (2013) Valid post-selection inference. Ann Stat 41:802–837. https://doi.org/10.1214/12-AOS1077
Blanchard G, Neuvial P, Roquain E (2020) Post hoc confidence bounds on false positives using reference families. Ann Stat 48:1281–1303. https://doi.org/10.1214/19-AOS1847
Cai TT, Jin J (2010) Optimal rates of convergence for estimating the null density and proportion of nonnull effects in large-scale multiple testing. Ann Stat 38:100–145. https://doi.org/10.1214/09-AOS696
Cai TT, Sun W (2009) Simultaneous testing of grouped hypotheses: finding needles in multiple haystacks. J Am Stat Assoc 104:1467–1481. https://doi.org/10.1198/jasa.2009.tm08415
Cai TT, Sun W, Wang W (2019) Covariate-assisted ranking and screening for large-scale two-sample inference. J R Stat Soc Ser B (Stat Methodol) 81:187–234. https://doi.org/10.1111/rssb.12304
Cappé O, Moulines E, Rydén T (2006) Inference in hidden Markov models. Springer, Berlin
Castillo I, Roquain E (2020) On spike and slab empirical Bayes multiple testing. Ann Stat 48:2548–2574
Dawid AP (1994) Selection paradoxes of Bayesian inference. Lect Notes Monogr Ser 24:211–220
De Castro Y, Gassiat E, Le Corff S (2017) Consistent estimation of the filtering and marginal smoothing distributions in nonparametric hidden Markov models. IEEE Trans Inf Theory 63:4758–4777
Durand G, Blanchard G, Neuvial P, Roquain E (2020) Post hoc false positive control for structured hypotheses. Scand J Stat 47:1114–1148. https://doi.org/10.1111/sjos.12453
Efron B (2004) Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. J Am Stat Assoc 99:96–104. https://doi.org/10.1198/016214504000000089
Efron B (2007) Doing thousands of hypothesis tests at the same time. Metron Int J Stat LXV:3–21
Efron B (2008) Microarrays, empirical Bayes and the two-groups model. Stat Sci 23:1–22. https://doi.org/10.1214/07-STS236
Efron B (2009) Empirical Bayes estimates for large-scale prediction problems. J Am Stat Assoc 104:1015–1028. https://doi.org/10.1198/jasa.2009.tm08523
Efron B (2011) Tweedie’s formula and selection bias. J Am Stat Assoc 106:1602–1614
Efron B, Tibshirani R, Storey JD, Tusher V (2001) Empirical Bayes analysis of a microarray experiment. J Am Stat Assoc 96:1151–1160
Fan J, Han X (2017) Estimation of the false discovery proportion with unknown dependence. J R Stat Soc Ser B (Stat Methodol) 79:1143–1164
Fan J, Ke Y, Sun Q, Zhou W-X (2019) Farmtest: factor-adjusted robust multiple testing with approximate false discovery control. J Am Stat Assoc 1–29
Franke J, Kreiss J-P, Mammen E, Neumann MH (2002) Properties of the nonparametric autoregressive bootstrap. J Time Ser Anal 23:555–585
Fridlyand J, Snijders AM, Pinkel D, Albertson DG, Jain AN (2004) Hidden Markov models approach to the analysis of array CGH data. J Multivar Anal 90:132–153
Friguet C, Kloareg M, Causeur D (2009) A factor model approach to multiple testing under dependence. J Am Stat Assoc 104:1406–1415
Gales M, Young S (2008) The application of hidden Markov models in speech recognition. Now Publishers Inc, Hanover
Gassiat É, Cleynen A, Robin S (2016) Inference in finite state space non parametric hidden Markov models and applications. Stat Comput 26:61–71
Genovese CR, Wasserman L (2006) Exceedance control of the false discovery proportion. J Am Stat Assoc 101:1408–1417
Goeman JJ, Solari A (2011) Multiple testing for exploratory research. Stat Sci 26:584–597. https://doi.org/10.1214/11-STS356
Hall P, DiCiccio TJ, Romano JP (1989) On smoothing and the bootstrap. Ann Stat 17:692–704
Heller R, Rosset S (2021) Optimal control of false discovery criteria in the two-group model. J R Stat Soc Ser B (Stat Methodol) 83:133–155
Heller R, Yekutieli D (2014) Replicability analysis for genome-wide association studies. Ann Appl Stat 8:481–498. https://doi.org/10.1214/13-AOAS697
Horowitz JL (2003) Bootstrap methods for Markov processes. Econometrica 71:1049–1082
Jin J, Cai TT (2007) Estimating the null and the proportional of nonnull effects in large-scale multiple comparisons. J Am Stat Assoc 102:495–506. https://doi.org/10.1198/016214507000000167
Katsevich E, Ramdas A (2020) Simultaneous high-probability bounds on the false discovery proportion in structured, regression and online settings. Ann Stat 48:3465–3487. https://doi.org/10.1214/19-AOS1938
Kim C-J, Nelson CR et al (1999) State-space models with regime switching: classical and Gibbs-sampling approaches with applications, vol 1. The MIT press, Cambridge
Koski T (2001) Hidden Markov models for bioinformatics, vol 2. Springer, Berlin
Lee JD, Sun DL, Sun Y, Taylor JE et al (2016) Exact post-selection inference, with application to the lasso. Ann Stat 44:907–927
Leek JT, Storey JD (2008) A general framework for multiple testing dependence. Proc Natl Acad Sci 105:18718–18723
Luo F (2019) A systematic evaluation of copy number alterations detection methods on real SNP array and deep sequencing data. BMC Bioinform 20:1–16
Nguyen VH, Matias C (2014) Nonparametric estimation of the density of the alternative hypothesis in a multiple testing setup. Application to local false discovery rate estimation. ESAIM PS 18:584–612. https://doi.org/10.1051/ps/2013041
Okamoto A, Sehouli J, Yanaihara N, Hirata Y, Braicu I, Kim B-G, Takakura S, Saito M, Yanagida S, Takenaka M et al (2015) Somatic copy number alterations associated with Japanese or endometriosis in ovarian clear cell adenocarcinoma. PLoS ONE 10:e0116977
Panigrahi S, Taylor J, Weinstein A (2020) Integrative methods for post-selection inference under convex constraints
Pierre-Jean M, Neuvial P (2017) acnr: annotated copy-number regions R package version 1.0.0
Pierre-Jean M, Rigaill G, Neuvial P (2015) Performance evaluation of DNA copy number segmentation methods. Brief Bioinform 16:600–615
Pierre-Jean M, Rigaill G, Neuvial P (2019) jointseg: Joint segmentation of multivariate (copy number) signals R package version 1.0.2
Rebafka T, Roquain E, Villers F (2019) Graph inference with clustering and false discovery rate control
Robin S, Bar-Hen A, Daudin J-J, Pierre L (2007) A semi-parametric approach for mixture models: application to local false discovery rate estimation. Comput Stat Data Anal 51:5483–5493
Roquain E, Verzelen N (2020) False discovery rate control with unknown null distribution: is it possible to mimic the oracle?
Scheffé H (1959) The analysis of variance. Chapman & Hall Ltd, London, p 0116429
Schwartzman A (2010) Comment: correlated \(z\)-values and the accuracy of large-scale statistical estimates. J Am Stat Assoc 105:1059–1063. https://doi.org/10.1198/jasa.2010.tm10237
Senn S (2008) A note concerning a selection “paradox’’ of Dawid’s. Am Stat 62:206–210
Shah SP, Cheung K-J Jr, Johnson NA, Alain G, Gascoyne RD, Horsman DE, Ng RT, Murphy KP (2009) Model-based clustering of array CGH data. Bioinformatics 25:i30–i38
Stephens M (2017) False discovery rates: a new deal. Biostatistics 18:275–294
Sun W, Cai TT (2007) Oracle and adaptive compound decision rules for false discovery rate control. J Am Stat Assoc 102:901–912. https://doi.org/10.1198/016214507000000545
Sun W, Cai TT (2009) Large-scale multiple testing under dependence. J R Stat Soc Ser B (Stat Methodol) 71:393–424
Sun L, Stephens M (2018) Solving the empirical Bayes normal means problem with correlated noise
Sun Y, Zhang NR, Owen AB (2012) Multiple hypothesis testing adjusted for latent variables, with an application to the agemap gene expression data. Ann Appl Stat 6:1664–1688
Tibshirani RJ, Rinaldo A, Tibshirani R, Wasserman L (2018) Uniform asymptotic inference and the bootstrap after model selection. Ann Stat 46:1255–1287
Weinstein A, Ramdas A (2019) Online control of the false coverage rate and false sign rate
Yekutieli D (2012) Adjusted Bayesian inference for selected parameters. J R Stat Soc Ser B (Stat Methodol) 74:515–541
Zhang NR (2010) DNA copy number profiling in normal and tumor genomes. In: Feng J, Fu W, Sun F (eds) Frontiers in computational and systems biology. Springer, Berlin, pp 259–281. https://doi.org/10.1007/978-1-84996-196-7_14
Acknowledgements
The authors would like to thank an associate editor and the two referees, whose insightful comments led to considerable improvements to this paper. This work has been supported by ANR-16-CE40-0019 (SansSouci), ANR-17-CE40-0001 (BASICS), ANR-19-CHIA-0021-01 (BiSCottE), ANR-21-CE23-0035 (ASCAI), the UPSaclay Excellency Chair REC-2019-044, the DFG CRC 1294 - 318763901 ’Data Assimilation’, and by the GDR ISIS through the “projets exploratoires” program (project TASTY).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Perrot-Dockès, M., Blanchard, G., Neuvial, P. et al. Selective inference for false discovery proportion in a hidden Markov model. TEST 32, 1365–1391 (2023). https://doi.org/10.1007/s11749-023-00886-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11749-023-00886-7