Advertisement

Journal of Computer Science and Technology

, Volume 23, Issue 1, pp 112–128 | Cite as

Clustering Text Data Streams

  • Yu-Bao Liu
  • Jia-Rong Cai
  • Jian Yin
  • Ada Wai-Chee Fu
Regular Paper

Abstract

Clustering text data streams is an important issue in data mining community and has a number of applications such as news group filtering, text crawling, document organization and topic detection and tracing etc. However, most methods are similarity-based approaches and only use the TF*IDF scheme to represent the semantics of text data and often lead to poor clustering quality. Recently, researchers argue that semantic smoothing model is more efficient than the existing TF*IDF scheme for improving text clustering quality. However, the existing semantic smoothing model is not suitable for dynamic text data context. In this paper, we extend the semantic smoothing model into text data streams context firstly. Based on the extended model, we then present two online clustering algorithms OCTS and OCTSM for the clustering of massive text data streams. In both algorithms, we also present a new cluster statistics structure named cluster profile which can capture the semantics of text data streams dynamically and at the same time speed up the clustering process. Some efficient implementations for our algorithms are also given. Finally, we present a series of experimental results illustrating the effectiveness of our technique.

Keywords

clustering database applications data mining text data streams 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Supplementary material

11390_2008_Article_9115_ESM.pdf (62 kb)
(PDF 62 kb)

References

  1. [1]
    Dou Shen, Qiang Yang, JianTao Sun, Zheng Chen. Thread detection in dynamic text message streams. In Proc. ACM SIGIR 2006, Seattle, Washington, August 6–11, pp.35–42.Google Scholar
  2. [2]
    Aggarwal C C. A framework for diagnosing changes in evolving data streams. In Proc. ACM SIGMOD 2003, San Diego, June 9–12, pp.575–586.Google Scholar
  3. [3]
    Agrawal C C, Han J, Wang J, Yu P S. A framework for clustering evolving data streams. In Proc. VLDB 2003, Berlin, September 9–12, 2003, pp.81–92.Google Scholar
  4. [4]
    Agrawal C C, Han J, Wang J, Yu P S. A framework for projected clustering of high dimensional data streams. In Proc. VLDB 2004, Toronto, August 31-September 3, pp.852–863.Google Scholar
  5. [5]
    O’Callaghan L, Mishra N, Meyerson A, Guha S. Streaming data algorithms for high-quality clustering. In Proc. ICDE 2002, San Jose, CA, February 26-March 1, pp.685–704.Google Scholar
  6. [6]
    Aggarwal C C, Yu P S, A framework for clustering massive text and categorical data streams. In Proc. SIAM Conference on Data Mining, Bethesda, MD, April 20–22, 2006, pp.407–411.Google Scholar
  7. [7]
    Yang Y, Pierce T, Carbonell J. A study of retrospective and on-line event detection. In Proc. ACM SIGIR, Melbourne, August 24–28, 1998, pp.28–36.Google Scholar
  8. [8]
    Arindam Banerjee, Sugato Basu. Topic models over text streams: A study of batch and online unsupervised learning. In Proc. SIAM Conference on Data Mining, Minneapolis, April 26–28, 2007, pp.437–442.Google Scholar
  9. [9]
    Xiaodan Zhang, Xiaohua Zhou, Xiaohua Hu. Semantic smoothing for model-based document clustering. In Proc. ICDM06, Hong Kong, December 18–22, pp.1193–1198.Google Scholar
  10. [10]
    Zhou X, Hu X, Zhang X, Lin X, Song I Y. Context-sensitive semantic smoothing for the language modeling approach to genomic IR. In Proc. ACM SIGIR, Seattle, Washington, August 6–11, 2006, pp.170–177.Google Scholar
  11. [11]
    Zhai C, Lafferty J. Two-stage language models for information retrieval. In Proc. ACM SIGIR, Tampere, August 11–15, 2002, pp.49–56.Google Scholar
  12. [12]
    Zhai C, Lafferty J. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proc. ACM SIGIR, New Orleans, September 9–13, 2001, pp.334–342.Google Scholar
  13. [13]
    Yubao Liu, Jiarong Cai, Jian Yin, Ada Wai-Chee Fu. Clustering massive text data streams by semantic smoothing model. In Proc. ADMA, Harbin, August 6–8, 2007, pp.389–400.Google Scholar
  14. [14]
    Zhong S, Ghosh J. Generative model-based document clustering: A comparative study. Knowledge and Information Systems, 2005, 8(3): 374–384.CrossRefGoogle Scholar
  15. [15]
    Steinbach M, Karypis G, Kumar V. A comparison of document clustering techniques. In Proc. Text Mining Workshop, KDD 2000, Boston, August 20–23, pp.1–20.Google Scholar
  16. [16]
    Guha S, Mishra N, Motwani R, O’Callaghan L. Clustering data streams. In Proc. FOCS 2000, California, November 12–14, pp.359–366.Google Scholar
  17. [17]
    Shi Zhong. Efficient streaming text clustering. Neural Networks, 2005, 18(5–6): 790–798.zbMATHCrossRefGoogle Scholar
  18. [18]
    Fung G P C, Yu J X, Yu P S, Lu H. Parameter free bursty events detection in text streams. In Proc. VLDB 2005, Trondheim, August 30-September 2, pp.181–192Google Scholar
  19. [19]
    Qi He, Kuiyu Chang, Ee-Peng Lim, Jun Zhang. Bursty feature representation for clustering text streams. In Proc. SIAM Conference on Data Mining 2007, Minneapolis, April 26–28, pp.491–496.Google Scholar
  20. [20]
    Xu-Bin Deng, Yang-Yong Zhu. L-tree match: A new data extraction model and algorithm for huge text stream with noises. Journal of Computer Science and Technology, 2005, 20(6): 763–773.CrossRefMathSciNetGoogle Scholar
  21. [21]
    Gabriel Pui Cheong Fung, Jeffery Xu Yu, Hongjun Lu. Classifying text streams in the presence of concept drifts. In Proc. PAKDD 2004, Sydney, May 26–28, pp.373–383.Google Scholar
  22. [22]
    Haixun Wang, Jian Yin, Jian Pei, Philip S Yu, Jeffrey Xu Yu. Suppressing model over-fitting in mining concept-drifting data streams. In Proc. KDD 2006, Philadelphia, August 20–23, pp.736–741.Google Scholar
  23. [23]
    Weiheng Zhu, Jian Pei, Jian Yin, Yihuang Xie. Granularity adaptive density estimation and on demand clustering of concept-drifting data streams. In Proc. DaWaK 2006, Krakow, September 4–8, pp.322–331.Google Scholar
  24. [24]
    Qiaozhu Mei, Chengxiang Zhai. Discovering evolutionary theme patterns from text–An exploration of temporal text mining. In Proc. KDD 2005, Chicago, August 21–24, pp.198–207.Google Scholar
  25. [25]
    Shouke Qin, Weining Qian, Aoying Zhou. Approximately processing multi-granularity aggregate queries over data streams. In Proc. ICDE 2006, Atlanta, April 3–8, p.67.Google Scholar
  26. [26]
    Dong-Hong Han, Guo-Ren Wang, Chuan Xiao, Rui Zhou. Load shedding for window joins over streams. Journal of Computer Science and Technology, 2007, 22(2): 182–189.CrossRefGoogle Scholar
  27. [27]
    Zhi-Hong Chong, Jeffrey Xu Yu, Zhen-Jie Zhang, Xue-Min Lin, Wei Wang, Ao-Ying Zhou. Efficient computation of k-medians over data streams under memory constraints. Journal of Computer Science and Technology, 2006, 21(2): 284–296.CrossRefGoogle Scholar
  28. [28]
    Jian Pei, Haixun Wang, Philip S Yu. Online mining of data streams: Applications, techniques and progress. In Proc. KDD 2004 (Tutorials), Seattle, WA, August 22–25, pp.1–60.Google Scholar
  29. [29]
    Joong Hyuk Chang, Won Suk Lee. Effect of count estimation in finding frequent itemsets over online transactional data streams. Journal of Computer Science and Technology, 2005, 20(1): 63–69.CrossRefGoogle Scholar
  30. [30]
    Jiawei Han, Yixin Chen, Guozhu Dong, Jian Pei, Benjamin W Wah, Jianyong Wang, Y Dora Cai. Stream cube: An architecture for multi-dimensional analysis of data streams. Distributed and Parallel Databases, 2005, 18(2): 173–197.CrossRefGoogle Scholar
  31. [31]
    Yixin Chen, Guozhu Dong, Jiawei Han, Benjamin W Wah, Jianyong Wang. Multi-dimensional regression analysis of time-series data streams. In Proc. VLDB 2002, August 20–23, Hong Kong, pp.323–334.Google Scholar
  32. [32]
    Yubao Liu, Jiarong Cai, Jian Yin, Zhilan Huang. Document clustering based on semantic smoothing approach. In Proc. AWIC 2007, Fontainebleau, June 25–27, pp.217–222.Google Scholar

Copyright information

© Science Press, Beijing, China and Springer Science + Business Media, LLC, USA 2008

Authors and Affiliations

  • Yu-Bao Liu
    • 1
  • Jia-Rong Cai
    • 1
  • Jian Yin
    • 1
  • Ada Wai-Chee Fu
    • 2
  1. 1.Department of Computer ScienceSun Yat-Sen UniversityGuangzhouChina
  2. 2.Department of Computer Science and Engineeringthe Chinese University of Hong KongHong KongChina

Personalised recommendations