A New Information Theory Based Clustering Fusion Method for Multi-view Representations of Text Documents

Zamora, Juan; Sublime, Jérémie

doi:10.1007/978-3-030-49570-1_11

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12194))

Included in the following conference series:

International Conference on Human-Computer Interaction

4631 Accesses

Abstract

Multi-view clustering is a complex problem that consists in extracting partitions from multiple representations of the same objects. In text mining and natural language processing, such views may come in the form of word frequencies, topic based representations and many other possible encoding forms coming from various vector space model algorithms. From there, in this paper we propose a clustering fusion algorithm that takes clustering results acquired from multiple vector space models of given documents, and merges them into a single partition. Our fusion method relies on an information theory model based on Kolmogorov complexity that was previously used for collaborative clustering applications. We apply our algorithm to different text corpuses frequently used in the literature with results that we find to be very satisfying.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
For the clustering task, the relation could be stated as “has the same label as”.

References

Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3(Jan), 993–1022 (2003)
MATH Google Scholar
Cornuéjols, A., Wemmert, C., Gançarski, P., Bennani, Y.: Collaborative clustering: why, when, what and how. Inf. Fusion 39, 81–95 (2018)
Article Google Scholar
Fraj, M., HajKacem, M.A.B., Essoussi, N.: Ensemble method for multi-view text clustering. In: Computational Collective Intelligence - 11th International Conference, ICCCI 2019, Hendaye, France, 4–6 September 2019, Proceedings, Part I, pp. 219–231 (2019). https://doi.org/10.1007/978-3-030-28377-3_18
Fred, A.L., Jain, A.K.: Combining multiple clusterings using evidence accumulation. IEEE Trans. Pattern Anal. Mach. Intell. 27(6), 835–850 (2005)
Article Google Scholar
Ghosh, J., Acharya, A.: Cluster ensembles. Wiley Interdisc. Rev.: Data Min. Knowl. Discov. 1(4), 305–315 (2011)
Google Scholar
Greene, D., Cunningham, P.: A matrix factorization approach for integrating multiple data views. In: Buntine, W., Grobelnik, M., Mladenić, D., Shawe-Taylor, J. (eds.) ECML PKDD 2009. LNCS (LNAI), vol. 5781, pp. 423–438. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-04180-8_45
Chapter Google Scholar
Hussain, S.F., Mushtaq, M., Halim, Z.: Multi-view document clustering via ensemble method. J. Intell. Inf. Syst. 43(1), 81–99 (2014). https://doi.org/10.1007/s10844-014-0307-6
Article Google Scholar
Janssens, F., Glänzel, W., De Moor, B.: Dynamic hybrid clustering of bioinformatics by incorporating text mining and citation analysis. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 360–369. ACM (2007)
Google Scholar
Li, T., Ogihara, M., Ma, S.: On combining multiple clusterings. In: Proceedings of the Thirteenth ACM International Conference on Information and Knowledge Management, pp. 294–303. ACM (2004)
Google Scholar
Liu, H., Zhao, R., Fang, H., Cheng, F., Fu, Y., Liu, Y.Y.: Entropy-based consensus clustering for patient stratification. Bioinformatics 33(17), 2691–2698 (2017)
Article Google Scholar
Liu, X., Glänzel, W., De Moor, B.: Hybrid clustering of multi-view data via Tucker-2 model and its application. Scientometrics 88(3), 819–839 (2011). https://doi.org/10.1007/s11192-011-0348-3
Article Google Scholar
Liu, X., Ji, S., Glänzel, W., De Moor, B.: Multiview partitioning via tensor methods. IEEE Trans. Knowl. Data Eng. 25(5), 1056–1069 (2012)
Google Scholar
Liu, X., Yu, S., Moreau, Y., De Moor, B., Glänzel, W., Janssens, F.: Hybrid clustering of text mining and bibliometrics applied to journal sets. In: Proceedings of the 2009 SIAM International Conference on Data Mining, pp. 49–60. SIAM (2009)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Murena, P., Sublime, J., Matei, B., Cornuéjols, A.: An information theory based approach to multisource clustering. In: IJCAI, pp. 2581–2587. ijcai.org (2018)
Google Scholar
Rashidi, F., Nejatian, S., Parvin, H., Rezaie, V.: Diversity based cluster weighting in cluster ensemble: an information theory approach. Artif. Intell. Rev. 52, 1341–1368 (2019)
Article Google Scholar
Rissanen, J.: Modeling by shortest data description. Automatica 14(5), 465–471 (1978)
Article Google Scholar
Romeo, S., Tagarelli, A., Ienco, D.: Semantic-based multilingual document clustering via tensor modeling (2014)
Google Scholar
Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24(5), 513–523 (1988)
Article Google Scholar
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975). The paper where vector space model for IR was introduced
Google Scholar
Strehl, A., Ghosh, J.: Cluster ensembles–a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res. 3(Dec), 583–617 (2002)
Google Scholar
Topchy, A., Jain, A.K., Punch, W.: Clustering ensembles: models of consensus and weak partitions. IEEE Trans. Pattern Anal. Mach. Intell. 27(12), 1866–1881 (2005)
Article Google Scholar
Wallace, C.S., Boulton, D.M.: An information measure for classification. Comput. J. 11(2), 185–194 (1968). https://doi.org/10.1093/comjnl/11.2.185
Article MATH Google Scholar
Wu, J., Liu, H., Xiong, H., Cao, J., Chen, J.: K-means-based consensus clustering: a unified view. IEEE Trans. Knowl. Data Eng. 27(1), 155–169 (2014)
Article Google Scholar
Xie, X., Sun, S.: Multi-view clustering ensembles. In: International Conference on Machine Learning and Cybernetics, ICMLC 2013, Tianjin, China, 14–17 July 2013, pp. 51–56 (2013). https://doi.org/10.1109/ICMLC.2013.6890443
Yi, J., Yang, T., Jin, R., Jain, A.K., Mahdavi, M.: Robust ensemble clustering by matrix completion. In: 2012 IEEE 12th International Conference on Data Mining, pp. 1176–1181. IEEE (2012)
Google Scholar
Yu, S., Moor, B., Moreau, Y.: Clustering by heterogeneous data fusion: framework and applications. In: NIPS Workshop (2009)
Google Scholar
Zamora, J., Allende-Cid, H., Mendoza, M.: Distributed clustering of text collections. IEEE Access 7, 155671–155685 (2019)
Article Google Scholar
Zhao, Y., Karypis, G.: Criterion functions for document clustering: experiments and analysis. Department of Computer Science, University of Minnesota, Technical Report TR 01-40 (2001)
Google Scholar
Zimek, A., Vreeken, J.: The blind men and the elephant: on meeting the problem of multiple truths in data from clustering and pattern mining perspectives. Mach. Learn. 98(1–2), 121–155 (2015). https://doi.org/10.1007/s10994-013-5334-y
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Instituto de Estadística, Pontificia Universidad Católica de Valparaíso, 2340025, Valparaíso, Chile
Juan Zamora
ISEP, Lisite Laboratory – DaSSIP Team, 10 rue de Vanves, 92130, Issy-Les-Moulineaux, France
Jérémie Sublime
LIPN - CNRS UMR 7030, University Paris 13, 99 Avenue J.-B. Clément, 93430, Villetaneuse, France
Jérémie Sublime

Authors

Juan Zamora
View author publications
You can also search for this author in PubMed Google Scholar
Jérémie Sublime
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Juan Zamora .

Editor information

Editors and Affiliations

Towson University, Towson, MD, USA
Gabriele Meiselwitz

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zamora, J., Sublime, J. (2020). A New Information Theory Based Clustering Fusion Method for Multi-view Representations of Text Documents. In: Meiselwitz, G. (eds) Social Computing and Social Media. Design, Ethics, User Behavior, and Social Network Analysis. HCII 2020. Lecture Notes in Computer Science(), vol 12194. Springer, Cham. https://doi.org/10.1007/978-3-030-49570-1_11

Download citation

DOI: https://doi.org/10.1007/978-3-030-49570-1_11
Published: 10 July 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-49569-5
Online ISBN: 978-3-030-49570-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics