Abstract
Estimating the dimension of a model along with its parameters is fundamental to many statistical learning problems. Traditional model selection methods often approach this task by a two-step procedure: first estimate model parameters under every candidate model dimension, then select the best model dimension based on certain information criterion. When the number of candidate models is large, however, this two-step procedure is highly inefficient and not scalable. We develop a novel automated and scalable approach with theoretical guarantees, called mixed-binary simultaneous perturbation stochastic approximation (MB-SPSA), to simultaneously estimate the dimension and parameters of a statistical model. To demonstrate the broad practicability of the MB-SPSA algorithm, we apply the MB-SPSA to various classic statistical models including K-means clustering, Gaussian mixture models with an unknown number of components, sparse linear regression, and latent factor models with an unknown number of factors. We evaluate the performance of the MB-SPSA through simulation studies and an application to a single-cell sequencing dataset in terms of accuracy, running time, and scalability. The code implementing the MB-SPSA is available at http://github.com/wanglong24/MB-SPSA.
Similar content being viewed by others
References
Visscher PM, Brown MA, McCarthy MI, Yang J (2012) Five years of GWAS discovery. Am J Hum Genet 90(1):7–24
Cao J, Spielmann M, Qiu X, Huang X, Ibrahim DM, Hill AJ, Zhang F, Mundlos S, Christiansen L, Steemers FJ et al (2019) The single-cell transcriptional landscape of mammalian organogenesis. Nature 566(7745):496–502
Richardson S, Green PJ (1997) On Bayesian analysis of mixtures with an unknown number of components (with discussion). J R Stat Soc Ser B (Methodol) 59(4):731–792
Bhattacharya A, Dunson DB (2011) Sparse Bayesian infinite factor models. Biometrika 98(2):291–306
Athreya A, Fishkind DE, Tang M, Priebe CE, Park Y, Vogelstein JT, Levin K, Lyzinski V, Qin Y (2017) Statistical inference on random dot product graphs: a survey. J Mach Learn Res 18(1):8393–8484
Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19(6):716–723
Schwarz G et al (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464
Zoph B et al (2018) Learning transferable architectures for scalable image recognition. In Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 8697–8710
Real E, Aggarwal A, Huang Y, Le QV (2019) Regularized evolution for image classifier architecture search. Proc AAAI Conf Artif Intell 33:4780–4789
Green PJ (1995) Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 82(4):711–732
Antoniak CE (1974) Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. Ann Stat 2(6):1152–1174
Neal RM (2000) Markov chain sampling methods for Dirichlet process mixture models. J Comput Graph Stat 9(2):249–265
Ghahramani Z, Griffiths TL (2006) Infinite latent feature models and the Indian buffet process. In: Advances in neural information processing systems, pp 475–482
Walker SG (2007) Sampling the dirichlet mixture model with slices. Commun Stat-Simul Comput 36(1):45–54
Blei DM, Jordan MI et al (2006) Variational inference for dirichlet process mixtures. Bayesian Anal 1(1):121–143
Markley SC, Miller DJ (2010) Joint parsimonious modeling and model order selection for multivariate gaussian mixtures. IEEE J Select Top Signal Proces 4(3):548–559
Huang T, Peng H, Zhang K (2017) Model selection for Gaussian mixture models. Stat Sin 27(1):147–169
Bertsimas D, King A, Mazumder R et al (2016) Best subset selection via a modern optimization lens. Ann Stat 44(2):813–852
Miyashiro R, Takano Y (2015) Mixed integer second-order cone programming formulations for variable selection in linear regression. Eur J Oper Res 247(3):721–731
Duchi J, Hazan E, Singer Y (2011) Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res 12:2121–2159
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. http://arxiv.org/abs/1412.6980
Spall JC (1992) Multivariate stochastic approximation using a simultaneous perturbation gradient approximation. IEEE Trans Autom Control 37(3):332–341
Alessandri A, Parisini T (1997) Nonlinear modeling of complex large-scale plants using neural networks and stochastic approximation. IEEE Trans Syst Man Cybern A Syst Hum 27(6):750–757
Balakrishna R, Antoniou C, Ben-Akiva M, Koutsopoulos HN, Wen Y (2007) Calibration of microscopic traffic simulation models: methods and application. Transp Res Rec 1999(1):198–207
Kocsis L, Szepesvári C (2006) Universal parameter optimisation in games based on spsa. Mach Learn 63(3):249–286
Sidorov KA, Richmond S and Marshall D (2009) An efficient stochastic approach to groupwise non-rigid image registration. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2208–2213
Wang L, Zhu J and Spall JC (2018) Mixed simultaneous perturbation stochastic approximation for gradient-free optimization with noisy measurements. In Proceedings of the annual american control conference, pp 3774–3779
Tympakianaki A, Koutsopoulos HN, Jenelius E (2015) C-SPSA: cluster-wise simultaneous perturbation stochastic approximation algorithm and its application to dynamic origin-destination matrix estimation. Transp Res C Emerg Technol 55:231–245
Dong N, Wu C-H, Gao Z-K, Chen Z-Q, Ip W-H (2016) Data-driven control based on simultaneous perturbation stochastic approximation with adaptive weighted gradient estimation. IET Control Theory Appl 10(2):201–209
Lorenz R, Monti RP, Violante IR, Anagnostopoulos C, Faisal AA, Montana G, Leech R (2016) The automatic neuroscientist: a framework for optimizing experimental design with closed-loop real-time fmri. Neuroimage 129:320–334
Alaeddini A, Klein DJ (2017) Application of a second-order stochastic optimization algorithm for fitting stochastic epidemiological models. In: Proceedings of the winter simulation conference, pp 2194–2206
Khatami A, Nazari A, Khosravi A, Lim CP, Nahavandi S (2020) A weight perturbation-based regularisation technique for convolutional neural networks and the application in medical imaging. Expert Syst Appl 149:113196
Aksakalli V, Malekipirbazari M (2016) Feature selection via binary simultaneous perturbation stochastic approximation. Pattern Recogn Lett 75:41–47
Dennis J Jr, Schnabel RB (1989) Chapter ia view of unconstrained optimization. Handb Oper Res Manage Sci 1:1–72
Spall JC (1998) Implementation of the simultaneous perturbation algorithm for stochastic optimization. IEEE Trans Aerosp Electron Syst 34(3):817–823
Bottou L, Cun YL (2004) Large scale online learning. In: Proceedings of the advances in neural information processing systems, pp 217–224
Spall JC (2005) Introduction to stochastic search and optimization: estimation, simulation, and control, vol 65. Wiley, Berlin
Shukla AK, Muhuri PK (2019) Big-data clustering with interval type-2 fuzzy uncertainty modeling in gene expression datasets. Eng Appl Artif Intell 77:268–282
de la Fuente-Tomas L, Arranz B, Safont G, Sierra P, Sanchez-Autet M, Garcia-Blanco A, Garcia-Portilla MP (2019) Classification of patients with bipolar disorder using k-means clustering. PLoS ONE 14(1):e0210314
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J Roy Stat Soc Ser B (Methodol) 58(1):267–288
Brunet J-P, Tamayo P, Golub TR, Mesirov JP (2004) Metagenes and molecular pattern discovery using matrix factorization. Proc Natl Acad Sci 101(12):4164–4169
Lee DD, Seung HS (2001) Algorithms for non-negative matrix factorization. In: Proceedings of the advances in neural information processing systems, pp 556–562
Pascual-Montano A, Carazo JM, Kochi K, Lehmann D, Pascual-Marqui RD (2006) Nonsmooth nonnegative matrix factorization (nsnmf). IEEE Trans Pattern Anal Mach Intell 28(3):403–415
Alexandrov LB, Nik-Zainal S, Wedge DC, Campbell PJ, Stratton MR (2013) Deciphering signatures of mutational processes operative in human cancer. Cell Rep 3(1):246–259
Frichot E, Mathieu F, Trouillon T, Bouchard G, François O (2014) Fast and efficient estimation of individual ancestry coefficients. Genetics 196(4):973–983
Stein-O’Brien GL, Arora R, Culhane AC, Favorov AV, Garmire LX, Greene CS, Goff LA, Li Y, Ngom A, Ochs MF et al (2018) Enter the matrix: factorization uncovers knowledge from omics. Trends Genet 34(10):790–805
Cemgil AT (2009) Bayesian inference for nonnegative matrix factorisation models. Comput Intell Neurosci 2009:785152
Févotte C, Cemgil AT (2009) Nonnegative matrix factorizations as probabilistic inference in composite models. In: Proceedings of the European signal processing conference, pp 1913–1917
Landgraf AJ, Lee Y (2020) Generalized principal component analysis: projection of saturated model parameters. Technometrics 62(4):459–472
Zhang S et al (2020) Review of single-cell rna-seq data clustering for cell type identification and characterization. http://arxiv.org/abs/2001.01006
Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401(6755):788–791
Durif G, Modolo L, Mold JE, Lambert-Lacroix S, Picard F (2019) Probabilistic count matrix factorization for single cell expression data analysis. Bioinformatics 35(20):4011–4019
Sun S, Chen Y, Liu Y, Shang X (2019) A fast and efficient count-based matrix factorization method for detecting cell types from single-cell rnaseq data. BMC Syst Biol 13(2):28
Bruce P, Bruce A (2017) Practical statistics for data scientists: 50 essential concepts. O’Reilly Media, Inc, Newton
James G, Witten D, Hastie T, Tibshirani R (2013) An introduction to statistical learning, vol 112. Springer, New York
Yang L, Liu J, Lu Q, Riggs AD, Wu X (2017) SAIC: an iterative clustering approach for analysis of single cell RNA-seq data. BMC Genomics 18(6):689
Jiang L, Chen H, Pinello L, Yuan G-C (2016) Giniclust: detecting rare cell types from single-cell gene expression data with gini index. Genome Biol 17(1):144
Zhu X, Ching T, Pan X, Weissman SM, Garmire L (2017) Detecting heterogeneity in single-cell rna-seq data by non-negative matrix factorization. PeerJ 5:e2888
Linderman GC, Rachh M, Hoskins JG, Steinerberger S, Kluger Y (2019) Fast interpolation-based t-sne for improved visualization of single-cell rna-seq data. Nat Methods 16(3):243–245
Darmanis S, Sloan SA, Zhang Y, Enge M, Caneda C, Shuer LM, Gephart MGH, Barres BA, Quake SR (2015) A survey of human brain transcriptome diversity at the single cell level. Proc Natl Acad Sci 112(23):7285–7290
Ghosh J, Acharya A (2011) Cluster ensembles. Wiley Interdiscipl Rev Data Mining Knowl Discov 1(4):305–315
Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218
van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579–2605
Xie F, Xu Y (2019) Optimal Bayesian estimation for random dot product graphs. http://arxiv.org/abs/1904.12070
Huang H, Shi G, He H, Duan Y, Luo F (2019) Dimensionality reduction of hyperspectral imagery based on spatial-spectral manifold learning. IEEE Trans Cybern 50(6):2604–2616
Bing X, Bunea F, Wegkamp M et al (2020) A fast algorithm with minimax optimal guarantees for topic models with an unknown number of topics. Bernoulli 26(3):1765–1796
Acknowledgements
The work of Xu was supported by NSF 1918854 and NSF 1940107.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Appendix: Proof of Theorem 1
Appendix: Proof of Theorem 1
Proof
Denote \( L({\hat{{\varvec{\theta }}}}_k^{(+)}) = L_k^{(+)} \) and \( L_k^{(-)} = L({\hat{{\varvec{\theta }}}}_k^{(-)}) \). Before starting the main proof, we first define some useful notations below
where the expectation in (12) is taken over both the perturbation vector \( {\varvec{\Delta }}_k \) and the noise term \( \epsilon _k \). Using (12), (13) and (14), we can write the updating equation as
For any \( \omega \in \Omega _0 \) such that \( P(\Omega _0) = 1 \), since \( \{{\hat{{\varvec{\theta }}}}_k(\omega )\} \) is a bounded sequence by Assumption 2, the Bolzano–Weierstrass Theorem implies that there exists \( \Omega _1 \subset \Omega \) such that \( P(\Omega _1) = 1 \) and for any \( \omega \in \Omega _1 \) there exists a convergent subsequence \( \{{\hat{{\varvec{\theta }}}}_{k_s}(\omega )\} \). Denote the limiting point of the convergent subsequence as \( {\varvec{\theta }}'(\omega ) \). For simplicity, the notation \( \omega \) is suppressed below.
According to (15), we can write
Since \( {\varvec{\theta }}' - {\hat{{\varvec{\theta }}}}_{k_s} \rightarrow \varvec{0} \) as \( s \rightarrow \infty \), we will show below that all the three terms of the right-hand side of (16) must also converge to \( \varvec{0} \).
First note that by Assumption 3 and (13), we have
which implies that \( \{\sum _{i=k}^{m} a_i{\varvec{b}}_i\}_{m\ge k} \) is a martingale sequence as
Given that \( \{\sum _{i=k}^{m} a_i{\varvec{b}}_i\}_{m\ge k} \) is a martingale sequence, the Doob’s martingale inequality implies that for any \( \eta > 0 \)
where the last equality is due to Assumption 3, since
By Assumption 1, we have \( b_k > 0 \) and \( c_k \le c_0 \). Hence, there exist a constant \( {\bar{c}} \) such that we can write (17) as
which further implies that
For any \( \eta > 0 \) and all \( k \ge n \), since
we can use (17) to get
As \( n \rightarrow \infty \), for all \( k \ge n \),
and
Therefore, we conclude
and
Similarly, we can also show that
Combining (16) with results in (18) and (19), we have
Suppose \( {\varvec{\theta }}' \ne {\varvec{\theta }}^* \). Given \( \lim _{s\rightarrow \infty } {\hat{{\varvec{\theta }}}}_{k_s} = {\varvec{\theta }}' \), for any \( \delta > 0 \), there exists a S such that for any \( s > S \), \( \Vert {\hat{{\varvec{\theta }}}}_{k_s} - {\varvec{\theta }}' \Vert \le \delta \). Let \( \delta \) be sufficiently small, we have \( {\hat{{\varvec{\theta }}}}_{k_s} \in B_r({\varvec{\theta }}') \). By Assumption 1 and 6, we must have \( \sum _{i=s}^\infty a_{k_i} = \infty \) implies
which contradicts with (20). Hence, we conclude that \( {\varvec{\theta }}' = {\varvec{\theta }}^* \). Since \( {\varvec{\theta }}' \) is chosen to be the limiting point of any convergent subsequence, we have all the convergent subsequence converges to the same liming points and consequently \( {\hat{{\varvec{\theta }}}}_k \rightarrow {\varvec{\theta }}^* \) a.s. when \( k \rightarrow \infty \). \(\square \)
Rights and permissions
About this article
Cite this article
Wang, L., Xie, F. & Xu, Y. Simultaneous Learning the Dimension and Parameter of a Statistical Model with Big Data. Stat Biosci 15, 583–607 (2023). https://doi.org/10.1007/s12561-021-09324-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12561-021-09324-4