Abstract
Multi-view clustering is a complex problem that consists in extracting partitions from multiple representations of the same objects. In text mining and natural language processing, such views may come in the form of word frequencies, topic based representations and many other possible encoding forms coming from various vector space model algorithms. From there, in this paper we propose a clustering fusion algorithm that takes clustering results acquired from multiple vector space models of given documents, and merges them into a single partition. Our fusion method relies on an information theory model based on Kolmogorov complexity that was previously used for collaborative clustering applications. We apply our algorithm to different text corpuses frequently used in the literature with results that we find to be very satisfying.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
For the clustering task, the relation could be stated as “has the same label as”.
References
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)
Cornuéjols, A., Wemmert, C., Gançarski, P., Bennani, Y.: Collaborative clustering: why, when, what and how. Inf. Fusion 39, 81–95 (2018)
Fraj, M., HajKacem, M.A.B., Essoussi, N.: Ensemble method for multi-view text clustering. In: Computational Collective Intelligence - 11th International Conference, ICCCI 2019, Hendaye, France, 4–6 September 2019, Proceedings, Part I, pp. 219–231 (2019). https://doi.org/10.1007/978-3-030-28377-3_18
Fred, A.L., Jain, A.K.: Combining multiple clusterings using evidence accumulation. IEEE Trans. Pattern Anal. Mach. Intell. 27(6), 835–850 (2005)
Ghosh, J., Acharya, A.: Cluster ensembles. Wiley Interdisc. Rev.: Data Min. Knowl. Discov. 1(4), 305–315 (2011)
Greene, D., Cunningham, P.: A matrix factorization approach for integrating multiple data views. In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J. (eds.) ECML PKDD 2009. LNCS (LNAI), vol. 5781, pp. 423–438. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04180-8_45
Hussain, S.F., Mushtaq, M., Halim, Z.: Multi-view document clustering via ensemble method. J. Intell. Inf. Syst. 43(1), 81–99 (2014). https://doi.org/10.1007/s10844-014-0307-6
Janssens, F., Glänzel, W., De Moor, B.: Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 360–369. ACM (2007)
Li, T., Ogihara, M., Ma, S.: On combining multiple clusterings. In: Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, pp. 294–303. ACM (2004)
Liu, H., Zhao, R., Fang, H., Cheng, F., Fu, Y., Liu, Y.Y.: Entropy-based consensus clustering for patient stratification. Bioinformatics 33(17), 2691–2698 (2017)
Liu, X., Glänzel, W., De Moor, B.: Hybrid clustering of multi-view data via Tucker-2 model and its application. Scientometrics 88(3), 819–839 (2011). https://doi.org/10.1007/s11192-011-0348-3
Liu, X., Ji, S., Glänzel, W., De Moor, B.: Multiview partitioning via tensor methods. IEEE Trans. Knowl. Data Eng. 25(5), 1056–1069 (2012)
Liu, X., Yu, S., Moreau, Y., De Moor, B., Glänzel, W., Janssens, F.: Hybrid clustering of text mining and bibliometrics applied to journal sets. In: Proceedings of the 2009 SIAM International Conference on Data Mining, pp. 49–60. SIAM (2009)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Murena, P., Sublime, J., Matei, B., Cornuéjols, A.: An information theory based approach to multisource clustering. In: IJCAI, pp. 2581–2587. ijcai.org (2018)
Rashidi, F., Nejatian, S., Parvin, H., Rezaie, V.: Diversity based cluster weighting in cluster ensemble: an information theory approach. Artif. Intell. Rev. 52, 1341–1368 (2019)
Rissanen, J.: Modeling by shortest data description. Automatica 14(5), 465–471 (1978)
Romeo, S., Tagarelli, A., Ienco, D.: Semantic-based multilingual document clustering via tensor modeling (2014)
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24(5), 513–523 (1988)
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975). The paper where vector space model for IR was introduced
Strehl, A., Ghosh, J.: Cluster ensembles–a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 3(Dec), 583–617 (2002)
Topchy, A., Jain, A.K., Punch, W.: Clustering ensembles: models of consensus and weak partitions. IEEE Trans. Pattern Anal. Mach. Intell. 27(12), 1866–1881 (2005)
Wallace, C.S., Boulton, D.M.: An information measure for classification. Comput. J. 11(2), 185–194 (1968). https://doi.org/10.1093/comjnl/11.2.185
Wu, J., Liu, H., Xiong, H., Cao, J., Chen, J.: K-means-based consensus clustering: a unified view. IEEE Trans. Knowl. Data Eng. 27(1), 155–169 (2014)
Xie, X., Sun, S.: Multi-view clustering ensembles. In: International Conference on Machine Learning and Cybernetics, ICMLC 2013, Tianjin, China, 14–17 July 2013, pp. 51–56 (2013). https://doi.org/10.1109/ICMLC.2013.6890443
Yi, J., Yang, T., Jin, R., Jain, A.K., Mahdavi, M.: Robust ensemble clustering by matrix completion. In: 2012 IEEE 12th International Conference on Data Mining, pp. 1176–1181. IEEE (2012)
Yu, S., Moor, B., Moreau, Y.: Clustering by heterogeneous data fusion: framework and applications. In: NIPS Workshop (2009)
Zamora, J., Allende-Cid, H., Mendoza, M.: Distributed clustering of text collections. IEEE Access 7, 155671–155685 (2019)
Zhao, Y., Karypis, G.: Criterion functions for document clustering: experiments and analysis. Department of Computer Science, University of Minnesota, Technical Report TR 01-40 (2001)
Zimek, A., Vreeken, J.: The blind men and the elephant: on meeting the problem of multiple truths in data from clustering and pattern mining perspectives. Mach. Learn. 98(1–2), 121–155 (2015). https://doi.org/10.1007/s10994-013-5334-y
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Zamora, J., Sublime, J. (2020). A New Information Theory Based Clustering Fusion Method for Multi-view Representations of Text Documents. In: Meiselwitz, G. (eds) Social Computing and Social Media. Design, Ethics, User Behavior, and Social Network Analysis. HCII 2020. Lecture Notes in Computer Science(), vol 12194. Springer, Cham. https://doi.org/10.1007/978-3-030-49570-1_11
Download citation
DOI: https://doi.org/10.1007/978-3-030-49570-1_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-49569-5
Online ISBN: 978-3-030-49570-1
eBook Packages: Computer ScienceComputer Science (R0)