Self-semi-supervised clustering for large scale data with massive null group

Ahn, Soohyun; Choi, Hyungwon; Lim, Johan; Lee, Kyeong Eun

doi:10.1007/s42952-019-00005-z

Self-semi-supervised clustering for large scale data with massive null group

Research Article
Published: 01 January 2020

Volume 49, pages 161–176, (2020)
Cite this article

Journal of the Korean Statistical Society Aims and scope Submit manuscript

Soohyun Ahn¹,
Hyungwon Choi²,
Johan Lim³ &
…
Kyeong Eun Lee⁴

228 Accesses
2 Citations
Explore all metrics

Abstract

In this paper, we propose self-semi-supervised clustering, a new clustering method for large scale data with a massive null group. Self-semi-supervised clustering is a two-stage procedure: preselect a part of “null” group from the data in the first stage and apply semi-supervised clustering to the rest of the data in the second stage, allowing them to be assigned to the null group. We evaluate the performance of the proposed method using a simulation study and demonstrate the method in the analysis of time course gene expression data from a longitudinal study of Influenza A virus infection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Sparse clusterability: testing for cluster structure in high dimensions

Article Open access 31 March 2023

Jose Laborde, Paul A. Stewart, … Naomi C. Brownstein

The determination of uncertainty levels in robust clustering of subjects with longitudinal observations using the Dirichlet process mixture

Article 28 June 2016

Reyhaneh Rikhtehgaran & Iraj Kazemi

Thresher: determining the number of clusters while removing outliers

Article Open access 08 January 2018

Min Wang, Zachary B. Abrams, … Kevin R. Coombes

References

Basu, S., Davidson, I., & Wagstaff, K. (2008). Constrained clustering: Advanced in algorithms, theory, and application. New York: Chapman and Hall/CRC.
Google Scholar
Bermejo-Martin, J. F., Martin-Loeches, I., Rello, J., Antón, A., Almansa, R., Xu, L., et al. (2010). Host adaptive immunity deficiency in severe pandemic influenza. Clinical Care, 14, R167.
Google Scholar
Bozdogan, H., & Sclove, S. L. (1984). Multi-sample cluster analysis using Akaike’s information criterion. Annals of the Institute of Statistical Mathematics, 36, 163–180.
Article Google Scholar
Brown, R. L., Durbin, J., & Evans, J. M. (1975). Techniques for testing the constancy of regression relations over time. Journal of the Royal Statistical Society, Series B, 37, 149–192.
MathSciNet MATH Google Scholar
Celeux, G., & Soromenho, G. (1996). An entropy criterion for assessing the number of clusters in a mixture model. Journal of Classification, 13, 195–212.
Article MathSciNet Google Scholar
Dasgupta, A., & Raftery, A. E. (1998). Detecting features in spatial point processes with clutter via model-based clustering. Journal of the American Statistical Association, 93, 294–302.
Article Google Scholar
Fan, J., & Fan, Y. (2008). High-dimensional classification using features annealed independence rules. The Annals of Statistics, 36, 2605–2637.
Article MathSciNet Google Scholar
Fraley, C., & Raftery, A. E. (1998). How many clusters? Which clustering method? Answers via model-based cluster analysis. The Computer Journal, 41, 578–588.
Article Google Scholar
Fraley, C., & Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association, 97, 611–631.
Article MathSciNet Google Scholar
Herberg, J. A., Kaforou, M., Gormley, S., Sumner, E. R., Patel, S., Jones, K. D. J., et al. (2013). Transcriptomic profiling in childhood H1N1/09 influenza reveals reduced expression of protein sysnthesis genes. The Journal of Infectious Diseases, 208(10), 1664–1668.
Article Google Scholar
Hong, S., Kim, Y., & Park, T. (2014). Practical issues in screening and variable selection in genome-wide association analysis. Cancer Informatics, 13(Suppl 7), 55–65.
Google Scholar
Huang, J.-T. & Hasegawa-Johnson, M. (2009). On semi-supervised learning of Gaussian mixture models for phonetic classification. In Proceedings of the North American Chapter of the Association for Compuational Linguistics—Human Language Technologies Workshop on Semi-supervised Learning for Natural Language Processing, pp. 75–83.
Ji, P., & Jin, J. (2012). UPS delivers optimal phase diagram in high dimensional variable selection. The Annals of Statistics, 40, 73–103.
Article MathSciNet Google Scholar
Keribin, C. (2000). Consistent estimation of the order of mixture models. Sankhya. The Indian Journal of Statistics, Series A, 62, 49–66.
MathSciNet MATH Google Scholar
Law, M.H.C., Topchy, A., & Jain, A.K. (2005). Model-based clustering with probabilistic constraints. In Proceedings of the 2005 Society for Industrial and Applied Mathematics International Conference on Data Mining, pp. 641–645.
Lee, K. E., Lim, J., Won, J.-H., Lee, S., & Lee, S.-J. (2015). Finding standard dental arch forms from a nationwide standard occlusion study using a Gaussian functional mixture model. Journal of the Korean Statistical Society, 44, 477–489.
Article MathSciNet Google Scholar
Lim, J., Wang, X., Lee, S., & Jung, S.-H. (2008). A distribution-free test of constant mean in linear mixed effects models. Statistics in Medicine, 27, 3833–3846.
Article MathSciNet Google Scholar
Lu, Z., & Leen, T. K. (2007). Penalized probabilistic clustering. Neural Computation, 19, 1528–1567.
Article MathSciNet Google Scholar
Luan, Y., & Li, H. (2003). Clustering of time-course gene expression data using a mixed-effects model with B-splines. Bioinformatics, 19, 474–482.
Article Google Scholar
Melnykov, V., Melykov, I., & Michael, S. (2016). Semi-supervised model-based clustering with positive ad negative constraints. Advances in Data Analysis and Classification, 10, 327–349.
Article MathSciNet Google Scholar
Pan, W., Shen, X., Jiang, A., & Hebbel, R. (2006). Semisupervised learning via penalized mixture model with application to microaaray sample classification. Bioinformatics, 22, 2388–2395.
Article Google Scholar
Park, C., Ahn, J., Hendry, M., & Jang, W. (2011). Analysis of long period variable stars with nonparametric tests for trend detection. Journal of the American Statistical Association, 106, 832–845.
Article MathSciNet Google Scholar
Ramilo, O., Allman, W., Chung, W., Mejias, A., Ardura, M., Glaser, C., et al. (2007). Gene expression patterns in blood leukocytes discriminate patients with acute infections. Blood, 109, 2066–2077.
Article Google Scholar
Schwarz, G. E. (1978). Estimating the dimensions of a model. The Annals of Statistics, 6, 461–464.
Article MathSciNet Google Scholar
Shental, N., Bar-Hiller, A., Hertz, T., & Weinshall, D. (2003). Computing Gaussian mixture models with EM using equivalence constraints. In Proceedings of the 16th International Conference on Advances in Neural Information Processing Systems, pp. 465–472.
Wagstaff, K., Cardi, C., Rogers, S., & Schroedl, S. (2001). Constrained K-means clustering with background knowledge. In Proceedings of the Eighteenth International Conference on Machine Learning, pp. 577–584.
Wasserman, L., & Roeder, K. (2009). High-dimensional variable selection. The Annals of Statistics, 37, 2178–2201.
Article MathSciNet Google Scholar
Woods, C. W., McClain, M. T., Chen, M., Zaas, A. K., Nicholson, B. P., Varkey, J., et al. (2013). A host transcriptional signature for presymptomatic detection of infection in humans exposed to influenza H1N1 or H3N2. PLos One, 8(1), e52198.
Article Google Scholar
Yang, Y., Huang, N., Hao, L., & Kong, W. (2017). A clustering-based approach for efficient identification of microRNA combinatorial biomarkers. BMC Genomics, 18(Suppl 2), 210.
Article Google Scholar
Yeung, K. Y., Fraley, C., Murua, A., Raftery, A. E., & Ruzzo, W. L. (2001). Model-based clustering and data transformations for gene expression data. Bioinformatics, 17, 977–87.
Article Google Scholar
Yu, D., Lee, S. H., Lim, J., Xiao, G., Craddock, R. C., & Biswal, B. B. (2018). Fused lasso regression for identifying differential correlations in brain connectome graphs. Statistical Analysis and Data Mining, 11(5), 203–226.
Article MathSciNet Google Scholar
Zhai, Y., Franco, L. M., Atmar, R. L., Quarles, J. M., Arden, N., Bucasas, K. L., et al. (2015). Host transcriptional response to influenza and other acute respiratory viral infections? A prospective cohort study. PLOS Pathogen, 11(6), e1004869.
Article Google Scholar
Zaas, A. K., Chen, M., Varkey, J., Veldman, T., Hero, A. O, I. I. I., Lucas, J., et al. (2009). Gene expression signatures diagnose influenza and other symptomatic respiratory viral infections in humans. Cell Host & Microbe, 6(3), 207–217.
Article Google Scholar

Download references

Acknowledgements

This work was supported by the new faculty research fund of Ajou University and National Research Foundation of Korea (Grant nos: 2012R1A1A3013075, 2017R1A2B2012264).

Author information

Authors and Affiliations

Department of Mathematics, Ajou University, Suwon, Korea
Soohyun Ahn
Department of Medicine, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
Hyungwon Choi
Department of Statistics, Seoul National University, Seoul, Korea
Johan Lim
Department of Statistics, Kyungpook National University, Daegu, Korea
Kyeong Eun Lee

Authors

Soohyun Ahn
View author publications
You can also search for this author in PubMed Google Scholar
Hyungwon Choi
View author publications
You can also search for this author in PubMed Google Scholar
Johan Lim
View author publications
You can also search for this author in PubMed Google Scholar
Kyeong Eun Lee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Johan Lim.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ahn, S., Choi, H., Lim, J. et al. Self-semi-supervised clustering for large scale data with massive null group. J. Korean Stat. Soc. 49, 161–176 (2020). https://doi.org/10.1007/s42952-019-00005-z

Download citation

Received: 10 October 2018
Accepted: 19 June 2019
Published: 01 January 2020
Issue Date: March 2020
DOI: https://doi.org/10.1007/s42952-019-00005-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Self-semi-supervised clustering for large scale data with massive null group

Abstract

Access this article

Similar content being viewed by others

Sparse clusterability: testing for cluster structure in high dimensions

The determination of uncertainty levels in robust clustering of subjects with longitudinal observations using the Dirichlet process mixture

Thresher: determining the number of clusters while removing outliers

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Self-semi-supervised clustering for large scale data with massive null group

Abstract

Access this article

Similar content being viewed by others

Sparse clusterability: testing for cluster structure in high dimensions

The determination of uncertainty levels in robust clustering of subjects with longitudinal observations using the Dirichlet process mixture

Thresher: determining the number of clusters while removing outliers

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation