Skip to main content

Advertisement

Log in

Self-semi-supervised clustering for large scale data with massive null group

  • Research Article
  • Published:
Journal of the Korean Statistical Society Aims and scope Submit manuscript

Abstract

In this paper, we propose self-semi-supervised clustering, a new clustering method for large scale data with a massive null group. Self-semi-supervised clustering is a two-stage procedure: preselect a part of “null” group from the data in the first stage and apply semi-supervised clustering to the rest of the data in the second stage, allowing them to be assigned to the null group. We evaluate the performance of the proposed method using a simulation study and demonstrate the method in the analysis of time course gene expression data from a longitudinal study of Influenza A virus infection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  • Basu, S., Davidson, I., & Wagstaff, K. (2008). Constrained clustering: Advanced in algorithms, theory, and application. New York: Chapman and Hall/CRC.

    Google Scholar 

  • Bermejo-Martin, J. F., Martin-Loeches, I., Rello, J., Antón, A., Almansa, R., Xu, L., et al. (2010). Host adaptive immunity deficiency in severe pandemic influenza. Clinical Care, 14, R167.

    Google Scholar 

  • Bozdogan, H., & Sclove, S. L. (1984). Multi-sample cluster analysis using Akaike’s information criterion. Annals of the Institute of Statistical Mathematics, 36, 163–180.

    Article  Google Scholar 

  • Brown, R. L., Durbin, J., & Evans, J. M. (1975). Techniques for testing the constancy of regression relations over time. Journal of the Royal Statistical Society, Series B, 37, 149–192.

    MathSciNet  MATH  Google Scholar 

  • Celeux, G., & Soromenho, G. (1996). An entropy criterion for assessing the number of clusters in a mixture model. Journal of Classification, 13, 195–212.

    Article  MathSciNet  Google Scholar 

  • Dasgupta, A., & Raftery, A. E. (1998). Detecting features in spatial point processes with clutter via model-based clustering. Journal of the American Statistical Association, 93, 294–302.

    Article  Google Scholar 

  • Fan, J., & Fan, Y. (2008). High-dimensional classification using features annealed independence rules. The Annals of Statistics, 36, 2605–2637.

    Article  MathSciNet  Google Scholar 

  • Fraley, C., & Raftery, A. E. (1998). How many clusters? Which clustering method? Answers via model-based cluster analysis. The Computer Journal, 41, 578–588.

    Article  Google Scholar 

  • Fraley, C., & Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association, 97, 611–631.

    Article  MathSciNet  Google Scholar 

  • Herberg, J. A., Kaforou, M., Gormley, S., Sumner, E. R., Patel, S., Jones, K. D. J., et al. (2013). Transcriptomic profiling in childhood H1N1/09 influenza reveals reduced expression of protein sysnthesis genes. The Journal of Infectious Diseases, 208(10), 1664–1668.

    Article  Google Scholar 

  • Hong, S., Kim, Y., & Park, T. (2014). Practical issues in screening and variable selection in genome-wide association analysis. Cancer Informatics, 13(Suppl 7), 55–65.

    Google Scholar 

  • Huang, J.-T. & Hasegawa-Johnson, M. (2009). On semi-supervised learning of Gaussian mixture models for phonetic classification. In Proceedings of the North American Chapter of the Association for Compuational Linguistics—Human Language Technologies Workshop on Semi-supervised Learning for Natural Language Processing, pp. 75–83.

  • Ji, P., & Jin, J. (2012). UPS delivers optimal phase diagram in high dimensional variable selection. The Annals of Statistics, 40, 73–103.

    Article  MathSciNet  Google Scholar 

  • Keribin, C. (2000). Consistent estimation of the order of mixture models. Sankhya. The Indian Journal of Statistics, Series A, 62, 49–66.

    MathSciNet  MATH  Google Scholar 

  • Law, M.H.C., Topchy, A., & Jain, A.K. (2005). Model-based clustering with probabilistic constraints. In Proceedings of the 2005 Society for Industrial and Applied Mathematics International Conference on Data Mining, pp. 641–645.

  • Lee, K. E., Lim, J., Won, J.-H., Lee, S., & Lee, S.-J. (2015). Finding standard dental arch forms from a nationwide standard occlusion study using a Gaussian functional mixture model. Journal of the Korean Statistical Society, 44, 477–489.

    Article  MathSciNet  Google Scholar 

  • Lim, J., Wang, X., Lee, S., & Jung, S.-H. (2008). A distribution-free test of constant mean in linear mixed effects models. Statistics in Medicine, 27, 3833–3846.

    Article  MathSciNet  Google Scholar 

  • Lu, Z., & Leen, T. K. (2007). Penalized probabilistic clustering. Neural Computation, 19, 1528–1567.

    Article  MathSciNet  Google Scholar 

  • Luan, Y., & Li, H. (2003). Clustering of time-course gene expression data using a mixed-effects model with B-splines. Bioinformatics, 19, 474–482.

    Article  Google Scholar 

  • Melnykov, V., Melykov, I., & Michael, S. (2016). Semi-supervised model-based clustering with positive ad negative constraints. Advances in Data Analysis and Classification, 10, 327–349.

    Article  MathSciNet  Google Scholar 

  • Pan, W., Shen, X., Jiang, A., & Hebbel, R. (2006). Semisupervised learning via penalized mixture model with application to microaaray sample classification. Bioinformatics, 22, 2388–2395.

    Article  Google Scholar 

  • Park, C., Ahn, J., Hendry, M., & Jang, W. (2011). Analysis of long period variable stars with nonparametric tests for trend detection. Journal of the American Statistical Association, 106, 832–845.

    Article  MathSciNet  Google Scholar 

  • Ramilo, O., Allman, W., Chung, W., Mejias, A., Ardura, M., Glaser, C., et al. (2007). Gene expression patterns in blood leukocytes discriminate patients with acute infections. Blood, 109, 2066–2077.

    Article  Google Scholar 

  • Schwarz, G. E. (1978). Estimating the dimensions of a model. The Annals of Statistics, 6, 461–464.

    Article  MathSciNet  Google Scholar 

  • Shental, N., Bar-Hiller, A., Hertz, T., & Weinshall, D. (2003). Computing Gaussian mixture models with EM using equivalence constraints. In Proceedings of the 16th International Conference on Advances in Neural Information Processing Systems, pp. 465–472.

  • Wagstaff, K., Cardi, C., Rogers, S., & Schroedl, S. (2001). Constrained K-means clustering with background knowledge. In Proceedings of the Eighteenth International Conference on Machine Learning, pp. 577–584.

  • Wasserman, L., & Roeder, K. (2009). High-dimensional variable selection. The Annals of Statistics, 37, 2178–2201.

    Article  MathSciNet  Google Scholar 

  • Woods, C. W., McClain, M. T., Chen, M., Zaas, A. K., Nicholson, B. P., Varkey, J., et al. (2013). A host transcriptional signature for presymptomatic detection of infection in humans exposed to influenza H1N1 or H3N2. PLos One, 8(1), e52198.

    Article  Google Scholar 

  • Yang, Y., Huang, N., Hao, L., & Kong, W. (2017). A clustering-based approach for efficient identification of microRNA combinatorial biomarkers. BMC Genomics, 18(Suppl 2), 210.

    Article  Google Scholar 

  • Yeung, K. Y., Fraley, C., Murua, A., Raftery, A. E., & Ruzzo, W. L. (2001). Model-based clustering and data transformations for gene expression data. Bioinformatics, 17, 977–87.

    Article  Google Scholar 

  • Yu, D., Lee, S. H., Lim, J., Xiao, G., Craddock, R. C., & Biswal, B. B. (2018). Fused lasso regression for identifying differential correlations in brain connectome graphs. Statistical Analysis and Data Mining, 11(5), 203–226.

    Article  MathSciNet  Google Scholar 

  • Zhai, Y., Franco, L. M., Atmar, R. L., Quarles, J. M., Arden, N., Bucasas, K. L., et al. (2015). Host transcriptional response to influenza and other acute respiratory viral infections? A prospective cohort study. PLOS Pathogen, 11(6), e1004869.

    Article  Google Scholar 

  • Zaas, A. K., Chen, M., Varkey, J., Veldman, T., Hero, A. O, I. I. I., Lucas, J., et al. (2009). Gene expression signatures diagnose influenza and other symptomatic respiratory viral infections in humans. Cell Host & Microbe, 6(3), 207–217.

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by the new faculty research fund of Ajou University and National Research Foundation of Korea (Grant nos: 2012R1A1A3013075, 2017R1A2B2012264).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Johan Lim.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ahn, S., Choi, H., Lim, J. et al. Self-semi-supervised clustering for large scale data with massive null group. J. Korean Stat. Soc. 49, 161–176 (2020). https://doi.org/10.1007/s42952-019-00005-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s42952-019-00005-z

Keywords

Navigation