TopCat: Data Mining for Topic Identification in a Text Corpus

  • Chris Clifton
  • Robert Cooley
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1704)

Abstract

TopCat (Topic Categories) is a technique for identifying topics that recur in articles in a text corpus. Natural language processing techniques are used to identify key entities in individual articles, allowing us to represent an article as a set of items. This allows us to view the problem in a database/data mining context: Identifying related groups of items. This paper presents a novel method for identifying related items based on “traditional” data mining techniques. Frequent itemsets are generated from the groups of items, followed by clusters formed with a hypergraph partitioning scheme. We present an evaluation against a manually-categorized “ground truth” news corpus showing this technique is effective in identifying topics in collections of news articles.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    1998 topic detection and tracking project (TDT-2), http://www.nist.gov/speech/tdt98/tdt98.htm
  2. 2.
    The topic detection and tracking phase 2 (TDT2) evaluation. ftp://jaguar.ncsl.nist.gov/tdt98/tdt2_dec98_official_results_19990204/index.htm.
  3. 3.
    The topic detection and tracking phase 2 (TDT2) evaluation plan, http://www.nist.gov/speech/tdt98/doc/tdt2.eval.plan.98.v3.7.pdf
  4. 4.
    Agrawal, R., Imielinski, T., Swami, A.N.: Mining association rules between sets of items in large databases. In: Buneman, P., Jajodia, S. (eds.) Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, D.C, May 26–28, pp. 207–216 (1993)Google Scholar
  5. 5.
    Ahonen, H., Heinonen, O., Klemettinen, M., Verkamo, I.: Mining in the phrasal frontier. In: Komorowski, J., Żytkow, J.M. (eds.) PKDD 1997. LNCS, vol. 1263, Springer, Heidelberg (1997)Google Scholar
  6. 6.
    Brin, S., Motwani, R., Silverstein, C.: Beyond market baskets: Generalizing association rules to correlations. In: Proceedings of the 1997 ACM SIGMOD Conference on Management of Data, Tucson, AZ, May 13-15 (1997)Google Scholar
  7. 7.
    Day, D., Aberdeen, J., Hirschman, L., Kozierok, R., Robinson, P., Vilain, M.: Mixed initiative development of language processing systems. In: Proceedings of the Fifth Conference on Applied Natural Language Processing, Washington, D.C (March 1997)Google Scholar
  8. 8.
    Feldman, R., et al.: Maximal association rules: a new tool for mining for keyword cooccurrences in document collections. In: Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, August 14– 17, pp. 167–170 (1997)Google Scholar
  9. 9.
    Feldman, R., Hirsh, H.: Exploiting background information in knowledge discovery from text. Journal of Intelligent Information Systems 9(1), 83–97 (1998)CrossRefGoogle Scholar
  10. 10.
    Feldman, R., Hirsh, H.(eds.): IJCAI 1999 Workshop on Text Mining, Stockholm, Sweden, August 2 (1999)Google Scholar
  11. 11.
    Han, E.H.S., Karypis, G., Kumar, V.: Clustering based on association rule hypergraphs. In: Proceedings of the SIGMOD 1997 Workshop on Research Issues in Data Mining and Knowledge Discovery. ACM, New York (1997)Google Scholar
  12. 12.
    Hyland, R., Clifton, C., Holland, R.: GeoNODE: Visualizing news in geospatial context. In: Proceedings of the Federal Data Mining Symposium and Exposition 1999, Washington, D.C, March 9-10, AFCEA (1999)Google Scholar
  13. 13.
    Karypis, G., Aggarwal, R., Kumar, V., Shekar, S.: hypergraph partitioning: Applications in VLSI domain. In: Proceedings of the ACM/IEEE Design Automation Conference (1997)Google Scholar
  14. 14.
    Kodratoff, Y.(ed.): European Conference on Machine Learning Workshop on Text Mining, Chemnitz, Germany (April 1998)Google Scholar
  15. 15.
    Lent, B., Agrawal, R., Srikant, R.: Discovering trends in text databases. In: Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, August 14–17, pp. 227–230 (1997)Google Scholar
  16. 16.
    Mladenić, D., Grobelnik, M.(eds.): ICML 1999 Workshop on Machine Learning in Text Data Analysis, Bled, Slovenia, June 30 (1999)Google Scholar
  17. 17.
    Salton, G., Allan, J., Buckley, C.: Automatic structuring and retrieval of large text files. Communications of the ACM 37(2), 97–108 (1994)CrossRefGoogle Scholar
  18. 18.
    Singh, L., Scheuermann, P., Chen, B.: Generating association rules from semi-structured documents using an extended concept hierarchy. In: Proceedings of the Sixth International Conference on Information and Knowledge Management, Las Vegas, Nevada (November 1997)Google Scholar
  19. 19.
    Srikant, R., Agrawal, R.: Mining generalized association rules. In: Proceedings of the 21st International Conference on Very Large Databases, Zurich, Switzerland, September 23-25 (1995)Google Scholar
  20. 20.
    Tsur, D., Ullman, J.D., Abiteboul, S., Clifton, C., Motwani, R., Nestorov, S., Rosenthal, A.: Query flocks: A generalization of association rule mining. In: Proceedings of the 1998 ACM SIGMOD Conference on Management of Data, Seattle, WA, June 2-4, pp. 1–12 (1998)Google Scholar
  21. 21.
    Zamir, O., Etzioni, O., Madan, O., Karp, R.M.: Fast and intuitive clustering of web documents. In: Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, pp. 287–290, August 14–17 (1997)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1999

Authors and Affiliations

  • Chris Clifton
    • 1
  • Robert Cooley
    • 2
  1. 1.The MITRE CorporationBedfordUSA
  2. 2.University of MinnesotaMinneapolisUSA

Personalised recommendations