A Probabilistic Model Based on Uncertainty for Data Clustering

Yu, Yaxin; Zhu, Xinhua; Li, Miao; Wang, Guoren; Luo, Dan

doi:10.1007/978-3-642-36288-0_12

Yaxin Yu²⁵,
Xinhua Zhu²⁶,
Miao Li²⁵,
Guoren Wang²⁵ &
…
Dan Luo²⁶

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 7607))

Included in the following conference series:

International Workshop on Agents and Data Mining Interaction

1071 Accesses

Abstract

Recently, all kinds of data in real-life have exploded in an unbelievable way. In order to manage these data, dataspace has been becoming a universal platform, which contains various kinds of data, such as unstructured data, semi-structured data and structured data. But how to cluster these data in dataspace in an efficient and accurate way to help the user manage and explore them is still an intractable problem. In the previous work, the uncertain relationship between term and topic is not considered sufficiently. There are many techniques to handle this problem and probability theory provides an effective way to deal with the uncertainty of clustering. As a result, we proposed a novel probability model based on topic terms, i.e., Probabilistic Term Similarity Model (PTSM) to tackle the uncertainty between term and topic. In this model, not only terms from various data but also structure information of semi-structured and structured data are considered. Each term is assigned a probability indicating how relevant it is to the topic. Then, according to the probability for each term, a probabilistic matrix is established for clustering various data. At last, extensive experiment results show that the clustering method based on this probabilistic model has excellent performance and outperforms some other classical algorithms.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 49.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Li, G., Ooi, B.C., Feng, J., Wang, J., Zhou, L.: EASE: An effective 3-in-1 keyword search method for unstructured, semi-structured and structured Data. In: Proceedings of Special Interest Group on Management of Data, pp. 903–914 (2008)
Google Scholar
Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Computing Surveys 38(2), Ariticle 6 (2006)
Google Scholar
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. Technical Report. University of Minnesota-Computer Science and Engineering, Minnesota (2000)
Google Scholar
Li, T., Ding, C., Zhang, Y., Shao, B.: Knowledge transformation from word space to document space. In: Proceedings of Special Interest Group on Information Retrieval, pp. 187–194 (2008)
Google Scholar
Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: Proceedings of Special Interest Group on Knowledge Discovery and Data Mining, pp. 16–22 (1999)
Google Scholar
Van Rijsbergen, C.J.: Information Retrieval. Butterworth-Heinemann Ltd. (1989)
Google Scholar
Kowalski, G.: Information retrieval systems: theory and implementation. Springer, 10.1016/S0898-1221(97)80229-5 (1998)
Google Scholar
Strehl, A., Ghosh, J.: Cluster ensembles: a knowledge reuse framework for combining multiple partitions. The Journal of Machine Learning Research 3, 583–617 (2003)
MathSciNet MATH Google Scholar
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of Special Interest Group on Information Retrieval, pp. 50–57 (1999)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
MATH Google Scholar
Rafi, M., Maujood, M., Fazal, M.M., Ali, S.M.: A comparison of two suffix tree-based document clustering algorithms. CoRR abs/1112.6222 (2011)
Google Scholar
Lee, D.D., Seung, H.S.: Learning the parts of objects with nonnegative matrix factorization. Nature 401, 788–791 (1999)
Article Google Scholar
Xu, W., Liu, X., Gong, Y.: Document clustering based on non-negative matrix factorization. In: Proceedings of Special Interest Group on Information Retrieval, pp. 267–273 (2003)
Google Scholar
Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of Special Interest Group on Knowledge Discovery and Data Mining, pp. 269–274 (2001)
Google Scholar
Hofmann, T., Puzicha, J.: Statistical models for co-occurrence data. Technical Report AIM, 1625 (1998)
Google Scholar
Wang, W., Barnaghi, P., Bargiela, A.: Probabilistic Topic Models for Learning Terminological Ontologies. IEEE Transactions on Knowledge and Data Engineering, 1028–1040 (2010)
Google Scholar
Cao, L.: Data Mining and Multi-agent Integration (edited). Springer (2009)
Google Scholar
Cao, L., Weiss, G., Yu, P.S.: A Brief Introduction to Agent Mining. Journal of Autonomous Agents and Multi-Agent Systems 25, 419–424 (2012)
Article Google Scholar
Cao, L., Gorodetsky, V., Mitkas, P.A.: A Agent Mining: The Synergy of Agents and Data Mining. IEEE Intelligent Systems 24(3), 64–72 (2009)
Article Google Scholar

Download references

Author information

Authors and Affiliations

College of Information Science and Engineering, Northeastern University, China
Yaxin Yu, Miao Li & Guoren Wang
QCIS, University of Technology, Sydney, Australia
Xinhua Zhu & Dan Luo

Authors

Yaxin Yu
View author publications
You can also search for this author in PubMed Google Scholar
Xinhua Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Miao Li
View author publications
You can also search for this author in PubMed Google Scholar
Guoren Wang
View author publications
You can also search for this author in PubMed Google Scholar
Dan Luo
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Technology, Chippendale, 2007, Sydney, NSW, Australia
Longbing Cao
Teesside University, TS1 3BA, Tees Valley, UK
Yifeng Zeng
Electrical & Computer Engineering Department, Aristotle University of Thessaloniki, 57001, Thessaloniki, Greece
Andreas L. Symeonidis
St. Petersburg Institute for Informatics and Automation, Russian Academy of Sciences, 39, 14th Liniya, 199178, St. Petersburg, Russia
Vladimir I. Gorodetsky
Department of Computer Science, University of Illinois at Chicago, 851 S. Morgan St., Rm 1138 SEO, Chicago, 60607, IL, USA
Philip S. Yu
Department of Computer Science, North Carolina State University, 27695-8206, Raleigh, NC, USA
Munindar P Singh

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yu, Y., Zhu, X., Li, M., Wang, G., Luo, D. (2013). A Probabilistic Model Based on Uncertainty for Data Clustering. In: Cao, L., Zeng, Y., Symeonidis, A.L., Gorodetsky, V.I., Yu, P.S., Singh, M.P. (eds) Agents and Data Mining Interaction. ADMI 2012. Lecture Notes in Computer Science(), vol 7607. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36288-0_12

Download citation

DOI: https://doi.org/10.1007/978-3-642-36288-0_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-36287-3
Online ISBN: 978-3-642-36288-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics