Advertisement

Dynamic Pattern Mining: An Incremental Data Clustering Approach

  • Seokkyung Chung
  • Dennis McLeod
Chapter
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3360)

Abstract

We propose a mining framework that supports the identification of useful patterns based on incremental data clustering. Given the popularity of Web news services, we focus our attention on news streams mining. News articles are retrieved from Web news services, and processed by data mining tools to produce useful higher-level knowledge, which is stored in a content description database. Instead of interacting with a Web news service directly, by exploiting the knowledge in the database, an information delivery agent can present an answer in response to a user request. A key challenging issue within news repository management is the high rate of document insertion. To address this problem, we present a sophisticated incremental hierarchical document clustering algorithm using a neighborhood search. The novelty of the proposed algorithm is the ability to identify meaningful patterns (e.g., news events, and news topics) while reducing the amount of computations by maintaining cluster structure incrementally. In addition, to overcome the lack of topical relations in conceptual ontologies, we propose a topic ontology learning framework that utilizes the obtained document hierarchy. Experimental results demonstrate that the proposed clustering algorithm produces high-quality clusters, and a topic ontology provides interpretations of news topics at different levels of abstraction.

Keywords

Cluster Algorithm Neighborhood Search News Article Document Cluster Court Trial 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Aggarwal, C.C., Gates, S.C., Yu, P.S.: On the merits of using supervised clustering for building categorization systems. In: Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (1999)Google Scholar
  2. 2.
    Agirre, E., Ansa, O., Hovy, E., Martinez, D.: Enriching very large ontologies using the WWW. In: Proceedings of the ECAI Workshop on Ontology Learning (2000)Google Scholar
  3. 3.
    Agrawal, R., Faloutsos, C., Swami, A.: Efficient similarity search in sequence database. In: Proceedings of International Conference of Foundations of Data Organization and Algorithms (1993)Google Scholar
  4. 4.
    Allan, J., Carbonell, J., Doddington, G., Yamron, J., Yang, Y.: Topic detection and tracking: pilot study final report. In: Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop (1998)Google Scholar
  5. 5.
    Allan, J., Lavrenko, V., Jin, H.: First story detection in TDT is hard. In: Proceedings of the 9th ACM International Conference on Information and Knowledge Management (2000)Google Scholar
  6. 6.
    Beckmann, N., Kriegel, H.P., Schneider, R., Seeger, B.: The R*-tree: an efficient and robust access method for points and rectangles. ACM SIGMOD Record 19(2), 322–331 (1990)CrossRefGoogle Scholar
  7. 7.
    Berchtold, S., Keim, D.A., Kreigel, H.P.: The X-tree: An index structure for high dimensional data. In: Proceedings of the 22nd International Conference on Very Large Data Bases (1996)Google Scholar
  8. 8.
    Berry, M.W., Dumais, S.T., O’Brien, G.W.: Using linear algebra for intelligent information retrieval. SIAM Review 37(4), 573–595 (1995)zbMATHCrossRefMathSciNetGoogle Scholar
  9. 9.
    Bradley, P.S., Fayyad, U., Reina, C.: Scaling clustering algorithms to large databases. In: Proceedings of the 4th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (1998)Google Scholar
  10. 10.
    Brants, T., Chen, F., Farahat, A.: A system for new event detection. In: Proceedings of the 26th International ACM SIGIR International Conference on Research and Development in Information Retrieval (2003)Google Scholar
  11. 11.
    Chan, K., Fu, A.W.: Efficient time series matching by wavelets. In: Proceedings of IEEE International Conference on Data Engineering (1999)Google Scholar
  12. 12.
    Chung, S., McLeod, D.: Dynamic topic mining from news stream data. In: Proceedings of the 2nd International Conference on Ontologies, Databases, and Application of Semantics for Large Scale Information Systems (2003)Google Scholar
  13. 13.
    Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley, New York (2001)zbMATHGoogle Scholar
  14. 14.
    Dunlavy, D.M., Conroy, J., O’Leary, D.P.: QCS: a tool for querying, clustering, and summarizing documents. In: Proceedings of Human Language Technology Conference (2003)Google Scholar
  15. 15.
    Ertöz, L., Steinbach, M., Kumar, V.: Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In: Proceedings of the 3rd SIAM International Conference on Data Mining (2003)Google Scholar
  16. 16.
    Fayyad, U.M., Reina, C., Bradley, P.S.: Initialization of iterative refinement clustering algorithms. In: Proceedings of the 4th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (1998)Google Scholar
  17. 17.
    Glover, E.J., Pennock, D.M., Lawrence, S., Krovetz, R.: Inferring hierarchical descriptions. In: Proceedings of the 2002 ACM CIKM International Conference on Information and Knowledge Management (2002)Google Scholar
  18. 18.
    Guha, S., Rastogi, R., Shim, K.: CURE: An efficient clustering algorithm for large databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (1998)Google Scholar
  19. 19.
    Guha, S., Rastogi, R., Shim, K.: ROCK: A robust clustering algorithm for categorical attributes. In: Proceedings of the 15th International Conference on Data Engineering (1999)Google Scholar
  20. 20.
    Guttman, A.: R-Trees: A dynamic index structure for spatial searching. In: Proceedings of the ACM SIGMOD International Conference on Management of Data (1985)Google Scholar
  21. 21.
    Han, J., Kamber, M.: Data mining: concepts and techniques. Morgan Kaufmann Publishers, San Francisco (2000)Google Scholar
  22. 22.
    Hatzivassiloglou, V., Gravano, L., Maganti, A.: An investigation of linguistic features and clustering algorithms for topical document clustering. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (2000)Google Scholar
  23. 23.
    Huber, P.J.: Robust Statistics. Wiley, New York (1981)zbMATHCrossRefGoogle Scholar
  24. 24.
    Jarvis, R.A., Patrick, E.A.: Clustering using a similarity measure based on shared near neighbors. IEEE Transactions on Computers C22, 1025–1034 (1973)CrossRefGoogle Scholar
  25. 25.
    Joachims, T., Cristianini, N., Shawe-Taylor, J.: Composite kernels for hypertext categorisation. In: Proceedings of the 18th International Conference on Machine Learning (2001)Google Scholar
  26. 26.
    Karypis, G., Han, E.H., Kumar, V.: CHAMELEON: a hierarchical clustering algorithm using dynamic modeling. IEEE Computer 32(8), 68–75 (1999)Google Scholar
  27. 27.
    Khan, L., McLeod, D.: Effective retrieval of audio information from annotated text using ontologies. In: Proceedings of ACM SIGKDD Workshop on Multimedia Data Mining (2000)Google Scholar
  28. 28.
    Khan, L., McLeod, D.: Disambiguation of annotated text of audio using onologies. In: Proceeding of ACM SIGKDD Workshop on Text Mining (2000)Google Scholar
  29. 29.
    Khan, L., McLeod, D., Hovy, E.H.: Retrieval effectiveness of an ontology-based model for information selection. The VLDB Journal 13(1), 71–85 (2004)CrossRefGoogle Scholar
  30. 30.
    Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (1999)Google Scholar
  31. 31.
    Liu, X., Gong, Y., Xu, W., Zhu, S.: Document clustering with cluster refinement and model selection capabilities. In: Proceedings of the 25th ACM SIGIR International Conference on Research and Development in Information Retrieval (2002)Google Scholar
  32. 32.
    Maedche, A., Staab, S.: Ontology learning for the Semantic Web. IEEE Intelligent Systems 16(2), 72–79 (2001)CrossRefGoogle Scholar
  33. 33.
    McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2000)Google Scholar
  34. 34.
    McKeown, K.R., Barzilay, R., Evans, D., Hatzivassiloglou, V., Klavans, J.L., Nenkova, A., Sable, C., Schiffman, B., Sigelman, S.: Tracking and summarizingnews on a daily basis with Columbia’s Newsblaster. In: Proceedings of the Human Language Technology Conference (2002)Google Scholar
  35. 35.
    Melamed, I.D.: Automatic evaluation and uniform filter cascades for inducing n-best translation lexicons. In: Proceedings of the 3rd Workshop on Very LargeCorpora (1995)Google Scholar
  36. 36.
    Miller, G.: Wordnet: An on-line lexical database. International Journal of Lexicography 3(4), 235–312 (1990)CrossRefGoogle Scholar
  37. 37.
    Ng, R.T., Han, J.: Efficient and effective clustering methods for spatial data mining. In: Proceedings of the 20th International Conference on Very Large Data Bases (1994)Google Scholar
  38. 38.
    Noy, N.F., Sintek, M., Decker, S., Crubezy, M., Fergerson, R.W., Musen, M.A.: Creating Semantic Web contents with Protégé-2000. IEEE Intelligent Systems 6(12), 60–71 (2001)CrossRefGoogle Scholar
  39. 39.
    Pelleg, D., Moore, A.: X-means: Extending K-means with efficient estimation of the number of clusters. In: Proceedings of the 17th International Conference on Machine Learning (2000)Google Scholar
  40. 40.
    Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)Google Scholar
  41. 41.
    Radev, D.R., Goldensohn, S., Zhang, Z., Raghavan, R.S.: Newsinessence: a system for domain-independent, real-time news clustering and multi-document summarization. In: Proceedings of Human Language Technology Conference (2001)Google Scholar
  42. 42.
    Radev, D.R., Goldensohn, S., Zhang, Z., Raghavan, R.S.: Interactive, domainindependent identification and summarization of topically related news. In: Proceedings of the 5th European Conference on Research and Advanced Technology for Digital Libraries (2001)Google Scholar
  43. 43.
    Ralaivola, L., d’Alch´e-Buc, F.: Incremental support vector machine learning: a local approach. In: Proceedings of the Annual Conference of the European Neural Network Society (2001)Google Scholar
  44. 44.
    Salton, G., McGill, M.J.: Introduction to modern information retrieval. McGraw-Hill, New York (1983)zbMATHGoogle Scholar
  45. 45.
    Sanderson, M., Croft, W.B.: Deriving concept hierarchies from text. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (1999)Google Scholar
  46. 46.
    Song, D., Bruza, P.D.: Towards context sensitive information inference. Journal of the American Society for Information Science and Technology 54(4), 321–334 (2003)CrossRefGoogle Scholar
  47. 47.
    Voorhees, E.M.: Query expansion using lexical-semantic relations. In: Proceedings of the 17th International ACM SIGIR Conference on Research and Development in Information Retrieval (1994)Google Scholar
  48. 48.
    Yang, Y., Carbonell, J., Brown, R., Pierce, T., Archibald, B.T., Liu, X.: Learning approaches for detecting and tracking news events. IEEE Intelligent Systems 14(4), 32–43 (1999)CrossRefGoogle Scholar
  49. 49.
    Yang, Y., Zhang, J., Carbonell, J., Jin, C.: Topic-conditioned novelty detection.In. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2002)Google Scholar
  50. 50.
    Zadeh, L.A.: Similarity relations and fuzzy orderings. Information Sciences 3(2), 177–200 (1971)zbMATHCrossRefMathSciNetGoogle Scholar
  51. 51.
    Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the ACM SIGMOD InternationalConference on Management of Data (1996)Google Scholar
  52. 52.
    Zhao, Y., Karypis, G.: Evaluations of hierarchical clustering algorithms for document datasets. In: Proceedings of the 11th ACM International Conference on Information and Knowledge Management (2002)Google Scholar
  53. 53.
    Nist topic detection and tracking corpus (1998), http://www.nist.gov/speech/tests/tdt/tdt98/index.htm
  54. 54.

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Seokkyung Chung
    • 1
  • Dennis McLeod
    • 1
  1. 1.Department of Computer Science, and Integrated Media System CenterUniversity of Southern CaliforniaLos AngelesUSA

Personalised recommendations