Skip to main content
Log in

A goodness-of-fit test on the number of biclusters in a relational data matrix

  • Published:
Annals of the Institute of Statistical Mathematics Aims and scope Submit manuscript

Abstract

Biclustering is a method for detecting homogeneous submatrices in a given matrix. Although there are many studies that estimate the underlying bicluster structure of a matrix, few have enabled us to determine the appropriate number of biclusters. Recently, a statistical test on the number of biclusters has been proposed for a regular-grid bicluster structure. However, when the latent bicluster structure does not satisfy such regular-grid assumption, the previous test requires a larger number of biclusters than necessary for the null hypothesis to be accepted, which is not desirable in terms of interpreting the accepted structure. In this study, we propose a new statistical test on the number of biclusters that does not require the regular-grid assumption and derive the asymptotic behavior of the proposed test statistic in both null and alternative cases. We illustrate the effectiveness of the proposed method by applying it to both synthetic and practical data matrices.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19

Similar content being viewed by others

Notes

  1. By its nature, non-rejection of a null hypothesis in a statistical test does not mean the positive acceptance of it, and this is also the case when selecting the number of biclusters with the proposed test. However, we prioritize simplicity and use the term “accepted” to mean “not rejected” throughout this paper.

References

  • Balakrishnan, S., Kolar, M., Rinaldo, A., Singh, A., Wasserman, L. (2011). Statistical and computational tradeoffs in biclustering. In: NIPS 2011 workshop on computational trade-offs in statistical learning.

  • Ben-Dor, A., Chor, B., Karp, R., Yakhini, Z. (2002). Discovering local structure in gene expression data: The order-preserving submatrix problem. In: Proceedings of the Sixth Annual International Conference on Computational Biology (pp 49–57).

  • Bickel, P. J., Sarkar, P. (2016). Hypothesis testing for automated community detection in networks. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(1), 253–273.

    Article  MathSciNet  MATH  Google Scholar 

  • Bloemendal, A., Knowles, A., Yau, H. T., Yin, J. (2016). On the principal components of sample covariance matrices. Probability Theory and Related Fields, 164, 459–552.

    Article  MathSciNet  MATH  Google Scholar 

  • Brennan, M., Bresler, G., Huleihel, W. (2018). Reducibility and computational lower bounds for problems with planted sparse structure. In: Proceedings of the 31st Conference On Learning Theory (vol 75, pp. 48–166). Proceedings of Machine Learning Research.

  • Brennan, M., Bresler, G., Huleihel, W. (2019). Universality of computational lower bounds for submatrix detection. In: Proceedings of the 32nd Conference On Learning Theory (vol 99, pp. 417–468). Proceedings of Machine Learning Research.

  • Butucea, C., Ingster, Y. I. (2013). Detection of a sparse submatrix of a high-dimensional noisy matrix. Bernoulli, 19(5B), 2652–2688.

    Article  MathSciNet  MATH  Google Scholar 

  • Butucea, C., Ingster, Y.I., Suslina, I.A. (2015). Sharp variable selection of a sparse submatrix in a high-dimensional noisy matrix. ESAIM: Probability and Statistics 19:115–134.

  • Cai, T. T., Wu, Y. (2020). Statistical and computational limits for sparse matrix detection. Annals of Statistics, 48(3), 1593–1614.

    Article  MathSciNet  MATH  Google Scholar 

  • Cai, T. T., Liang, T., Rakhlin, A. (2017). Computational and statistical boundaries for submatrix localization in a large noisy matrix. Annals of Statistics, 45(4), 1403–1430.

    Article  MathSciNet  MATH  Google Scholar 

  • Chekouo, T., Murua, A. (2015). The penalized biclustering model and related algorithms. Journal of Applied Statistics, 42(6), 1255–1277.

    Article  MathSciNet  MATH  Google Scholar 

  • Chekouo, T., Murua, A., Raffelsberger, W. (2015). The Gibbs-plaid biclustering model. Annals of Applied Statistics, 9(3), 1643–1670.

    Article  MathSciNet  MATH  Google Scholar 

  • Chen, Y., Xu, J. (2016). Statistical-computational tradeoffs in planted problems and submatrix localization with a growing number of clusters and submatrices. Journal of Machine Learning Research, 17(27), 1–57.

    MathSciNet  MATH  Google Scholar 

  • Conover, W. J. (1999). Practical nonparametric statistics. New York: John Wiley & Sons.

    Google Scholar 

  • Corneli, M., Latouche, P., Rossi, F. (2015). Exact ICL maximization in a non-stationary time extension of the latent block model for dynamic networks. In: Proceedings of the 23-th European Symposium on Artificial Neural Networks (pp. 225–230). Computational Intelligence and Machine Learning.

  • Dhillon, I.S. (2001). Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 269–274).

  • Dua, D., Graff, C. (2017). UCI machine learning repository. http://archive.ics.uci.edu/ml, University of California, Irvine, School of Information and Computer Sciences.

  • Duffy, D. E., Quiroz, A. J. (1991). A permutation-based algorithm for block clustering. Journal of Classification, 8, 65–91.

    Article  MathSciNet  Google Scholar 

  • Flynn, C. J., Perry, P. O. (2020). Profile likelihood biclustering. Electronic Journal of Statistics, 14(1), 731–768.

    Article  MathSciNet  MATH  Google Scholar 

  • França, FOD. (2012). Scalable overlapping co-clustering of word-document data. In: 2012 11th International Conference on Machine Learning and Applications (pp. 464–467).

  • Goldberg, K., Roeder, T., Gupta, D., Perkins, C. (2001). Eigentaste: A constant time collaborative filtering algorithm. Information Retrieval, 4(2), 133–151.

    Article  MATH  Google Scholar 

  • Hajek, B., Wu, Y., Xu, J. (2017). Information limits for recovering a hidden community. IEEE Transactions on Information Theory, 63(8), 4729–4745.

    Article  MathSciNet  MATH  Google Scholar 

  • Hajek, B., Wu, Y., Xu, J. (2018). Submatrix localization via message passing. Journal of Machine Learning Research, 18(186), 1–52.

    MATH  Google Scholar 

  • Harper, F. M., Konstan, J. A. (2015). The MovieLens datasets: History and context. ACM Transactions on Interactive Intelligent Systems, 5(4), 1–19.

    Article  Google Scholar 

  • Hartigan, J. A. (1972). Direct clustering of a data matrix. Journal of the American Statistical Association, 67(337), 123–129.

    Article  Google Scholar 

  • Hochreiter, S., Bodenhofer, U., Heusel, M., Mayr, A., Mitterecker, A., Kasim, A., Khamiakova, T., Sanden, S. V., Lin, D., Talloen, W., Bijnens, L., Göhlmann, H. W. H., Shkedy, Z., Clevert, D. A. (2010). FABIA: Factor analysis for bicluster acquisition. Bioinformatics, 26(12), 1520–1527.

    Article  Google Scholar 

  • Hu, J., Zhang, J., Qin, H., Yan, T., Zhu, J. (2020). Using maximum entry-wise deviation to test the goodness of fit for stochastic block models. Journal of the American Statistical Association 0(0):1–10.

  • Kolar, M., Balakrishnan, S., Rinaldo, A., Singh, A. (2011). Minimax localization of structural information in large noisy matrices. Advances in Neural Information Processing Systems, 24, 909–917.

    Google Scholar 

  • Lei, J. (2016). A goodness-of-fit test for stochastic block models. The Annals of Statistics, 44(1), 401–424.

    Article  MathSciNet  MATH  Google Scholar 

  • Liu, J., Yang, J., Wang, W. (2004). Biclustering in gene expression data by tendency. In: Proceedings of 2004 IEEE Computational Systems Bioinformatics Conference (pp. 182–193).

  • Liu, Y., Guo, J. (2018). Distribution-free, size adaptive submatrix detection with acceleration. arXiv:1804.10887.

  • Lomet, A., Govaert, G., Grandvalet, Y. (2012). Model selection in block clustering by the integrated classification likelihood. In: Proceedings of 20th International Conference on Computational Statistics (pp. 519–530).

  • Luo, Y., Zhang, A. (2020). Tensor clustering with planted structures: Statistical optimality and computational limits. In: 2020 Joint Statistical Meetings.

  • Ma, Z., Wu, Y. (2015). Computational barriers in minimax submatrix detection. Annals of Statistics, 43(3), 1089–1116.

    Article  MathSciNet  MATH  Google Scholar 

  • Madeira, S. C., Oliveira, A. L. (2004). Biclustering algorithms for biological data analysis: A survey. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 1(1), 24–45.

    Article  Google Scholar 

  • Moran, G.E. (2019). Bayesian approaches for modeling variation. PhD thesis, University of Pennsylvania, Pennsylvania, United States.

  • Oghabian, A., Kilpinen, S., Hautaniemi, S., Czeizler, E. (2014). Biclustering methods: Biological relevance and application in gene expression analysis. PLOS ONE, 9(3), e90801.

    Article  Google Scholar 

  • Pillai, N. S., Yin, J. (2014). Universality of covariance matrices. Annals of Applied Probability, 24(3), 935–1001.

    Article  MathSciNet  MATH  Google Scholar 

  • Pio, G., Ceci, M., D’Elia, D., Loglisci, C., Malerba, D. (2013). A novel biclustering algorithm for the discovery of meaningful biological correlations between microRNAs and their target genes. BMC Bioinformatics, 14(7), S8.

    Article  Google Scholar 

  • Prelić, A., Bleuler, S., Zimmermann, P., Wille, A., Bühlmann, P., Gruissem, W., Hennig, L., Thiele, L., Zitzler, E. (2006). A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics, 22(9), 1122–1129.

    Article  Google Scholar 

  • Raff, E., Zak, R., Munoz, G.L., Fleming, W., Anderson, H.S., Filar, B., Nicholas, C., Holt, J. (2020). Automatic Yara rule generation using biclustering. In: Proceedings of the 13th ACM Workshop on Artificial Intelligence and Security (pp 71–82).

  • Sakai, Y., Yamanishi, K. (2013). An NML-based model selection criterion for general relational data modeling. In: Proceedings of 2013 IEEE International Conference on Big Data (pp. 421–429).

  • Shabalin, A. A., Weigman, V. J., Perou, C. M., Nobel, A. B. (2009). Finding large average submatrices in high dimensional data. Annals of Applied Statistics, 3(3), 985–1012.

    Article  MathSciNet  MATH  Google Scholar 

  • Shan, H., Banerjee, A. (2008). Bayesian co-clustering. In: Proceedings of the 8th IEEE International Conference on Data Mining (pp. 530–539).

  • Sill, M., Kaiser, S., Benner, A., Kopp-Schneider, A. (2011). Robust biclustering by sparse singular value decomposition incorporating stability selection. Bioinformatics, 27(15), 2089–2097.

    Article  Google Scholar 

  • Symeonidis, P., Nanopoulos, A., Papadopoulos, A., Manolopoulos, Y. (2007). Nearest-biclusters collaborative filtering with constant values. In: Advances in Web Mining and Web Usage Analysis, WebKDD 2006, Lecture Notes in Computer Science (vol. 4811, pp. 36–55).

  • Tanay, A., Sharan, R., Shamir, R. (2002). Discovering statistically significant biclusters in gene expression data. Bioinformatics, 18(1–Suppl), S136–S144.

    Article  Google Scholar 

  • Tepper, M., Sapiro, G. (2016). Fast L1-NMF for multiple parametric model estimation. arXiv:1610.05712.

  • Tibshirani, R., Hastie, T., Eisen, M., Ross, D., Botstein, D., Brown, P. (1999). Clustering methods for the analysis of DNA microarray data. Tech. rep., Department of health research and policy, department of statistics, department of genetics and department of biochemistry, Stanford University.

  • Tracy, C.A., Widom, H. (2009). The distributions of random matrix theory and their applications. In: New Trends in Mathematical Physics (pp. 753–765), Springer.

  • van der Vaart, A. W. (1998). Asymptotic statistics. Cambridge, England: Cambridge University Press.

    Book  MATH  Google Scholar 

  • Ward, J. H., Jr. (1963). Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association, 58(301), 236–244.

    Article  MathSciNet  Google Scholar 

  • Watanabe, C., Suzuki, T. (2021). Goodness-of-fit test for latent block models. Computational Statistics & Data Analysis, 154, 107090.

    Article  MathSciNet  MATH  Google Scholar 

  • Watanabe, C., Suzuki, T. (2023). Supplement to “a goodness-of-fit test on the number of biclusters in a relational data matrix”. Annals of the Institute of Statistical Mathematics.

  • Wyse, J., Friel, N., Latouche, P. (2017). Inferring structure in bipartite networks using the latent blockmodel and exact ICL. Network Science, 5(1), 45–69.

    Article  Google Scholar 

  • Yamanishi, K., Wu, T., Sugawara, S., Okada, M. (2019). The decomposed normalized maximum likelihood code-length criterion for selecting hierarchical latent variable models. Data Mining and Knowledge Discovery, 33, 1017–1058.

    Article  MathSciNet  MATH  Google Scholar 

  • Yöntem, M.K. (2017). The predictive role of the styles of parenthood origin on divorce predictors. PhD thesis, Gaziosmanpasa University, Tokat, Turkey.

  • Yöntem, M. K., Adem, K., Ilhan, T., Kılıçarslan, S. (2019). Divorce prediction using correlation based feature selection and artificial neural networks. Nevşehir HacıBektaş Veli Üniversitesi SBE Dergisi, 9, 259–273.

    Google Scholar 

Download references

Acknowledgements

We would like to thank Editage (www.editage.com) for English language editing.

Funding

This work was supported by JSPS KAKENHI (18H03201 and 20H00576), Fujitsu Laboratories Ltd., and JST CREST.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chihiro Watanabe.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 290KB)

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Watanabe, C., Suzuki, T. A goodness-of-fit test on the number of biclusters in a relational data matrix. Ann Inst Stat Math 75, 979–1009 (2023). https://doi.org/10.1007/s10463-023-00869-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10463-023-00869-3

Keywords

Navigation