Abstract
Recently, all kinds of data in real-life have exploded in an unbelievable way. In order to manage these data, dataspace has been becoming a universal platform, which contains various kinds of data, such as unstructured data, semi-structured data and structured data. But how to cluster these data in dataspace in an efficient and accurate way to help the user manage and explore them is still an intractable problem. In the previous work, the uncertain relationship between term and topic is not considered sufficiently. There are many techniques to handle this problem and probability theory provides an effective way to deal with the uncertainty of clustering. As a result, we proposed a novel probability model based on topic terms, i.e., Probabilistic Term Similarity Model (PTSM) to tackle the uncertainty between term and topic. In this model, not only terms from various data but also structure information of semi-structured and structured data are considered. Each term is assigned a probability indicating how relevant it is to the topic. Then, according to the probability for each term, a probabilistic matrix is established for clustering various data. At last, extensive experiment results show that the clustering method based on this probabilistic model has excellent performance and outperforms some other classical algorithms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Li, G., Ooi, B.C., Feng, J., Wang, J., Zhou, L.: EASE: An effective 3-in-1 keyword search method for unstructured, semi-structured and structured Data. In: Proceedings of Special Interest Group on Management of Data, pp. 903–914 (2008)
Zobel, J., Moffat, A.: Inverted files for text search engines. ACM Computing Surveys 38(2), Ariticle 6 (2006)
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. Technical Report. University of Minnesota-Computer Science and Engineering, Minnesota (2000)
Li, T., Ding, C., Zhang, Y., Shao, B.: Knowledge transformation from word space to document space. In: Proceedings of Special Interest Group on Information Retrieval, pp. 187–194 (2008)
Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: Proceedings of Special Interest Group on Knowledge Discovery and Data Mining, pp. 16–22 (1999)
Van Rijsbergen, C.J.: Information Retrieval. Butterworth-Heinemann Ltd. (1989)
Kowalski, G.: Information retrieval systems: theory and implementation. Springer, 10.1016/S0898-1221(97)80229-5 (1998)
Strehl, A., Ghosh, J.: Cluster ensembles: a knowledge reuse framework for combining multiple partitions. The Journal of Machine Learning Research 3, 583–617 (2003)
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of Special Interest Group on Information Retrieval, pp. 50–57 (1999)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. Journal of Machine Learning Research 3, 993–1022 (2003)
Rafi, M., Maujood, M., Fazal, M.M., Ali, S.M.: A comparison of two suffix tree-based document clustering algorithms. CoRR abs/1112.6222 (2011)
Lee, D.D., Seung, H.S.: Learning the parts of objects with nonnegative matrix factorization. Nature 401, 788–791 (1999)
Xu, W., Liu, X., Gong, Y.: Document clustering based on non-negative matrix factorization. In: Proceedings of Special Interest Group on Information Retrieval, pp. 267–273 (2003)
Dhillon, I.S.: Co-clustering documents and words using bipartite spectral graph partitioning. In: Proceedings of Special Interest Group on Knowledge Discovery and Data Mining, pp. 269–274 (2001)
Hofmann, T., Puzicha, J.: Statistical models for co-occurrence data. Technical Report AIM, 1625 (1998)
Wang, W., Barnaghi, P., Bargiela, A.: Probabilistic Topic Models for Learning Terminological Ontologies. IEEE Transactions on Knowledge and Data Engineering, 1028–1040 (2010)
Cao, L.: Data Mining and Multi-agent Integration (edited). Springer (2009)
Cao, L., Weiss, G., Yu, P.S.: A Brief Introduction to Agent Mining. Journal of Autonomous Agents and Multi-Agent Systems 25, 419–424 (2012)
Cao, L., Gorodetsky, V., Mitkas, P.A.: A Agent Mining: The Synergy of Agents and Data Mining. IEEE Intelligent Systems 24(3), 64–72 (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Yu, Y., Zhu, X., Li, M., Wang, G., Luo, D. (2013). A Probabilistic Model Based on Uncertainty for Data Clustering. In: Cao, L., Zeng, Y., Symeonidis, A.L., Gorodetsky, V.I., Yu, P.S., Singh, M.P. (eds) Agents and Data Mining Interaction. ADMI 2012. Lecture Notes in Computer Science(), vol 7607. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36288-0_12
Download citation
DOI: https://doi.org/10.1007/978-3-642-36288-0_12
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-36287-3
Online ISBN: 978-3-642-36288-0
eBook Packages: Computer ScienceComputer Science (R0)