Abstract
TopCat (Topic Categories) is a technique for identifying topics that recur in articles in a text corpus. Natural language processing techniques are used to identify key entities in individual articles, allowing us to represent an article as a set of items. This allows us to view the problem in a database/data mining context: Identifying related groups of items. This paper presents a novel method for identifying related items based on “traditional” data mining techniques. Frequent itemsets are generated from the groups of items, followed by clusters formed with a hypergraph partitioning scheme. We present an evaluation against a manually-categorized “ground truth” news corpus showing this technique is effective in identifying topics in collections of news articles.
This work supported by the Community Management Staff’s Massive Digital Data Systems Program.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
1998 topic detection and tracking project (TDT-2), http://www.nist.gov/speech/tdt98/tdt98.htm
The topic detection and tracking phase 2 (TDT2) evaluation. ftp://jaguar.ncsl.nist.gov/tdt98/tdt2_dec98_official_results_19990204/index.htm.
The topic detection and tracking phase 2 (TDT2) evaluation plan, http://www.nist.gov/speech/tdt98/doc/tdt2.eval.plan.98.v3.7.pdf
Agrawal, R., Imielinski, T., Swami, A.N.: Mining association rules between sets of items in large databases. In: Buneman, P., Jajodia, S. (eds.) Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, D.C, May 26–28, pp. 207–216 (1993)
Ahonen, H., Heinonen, O., Klemettinen, M., Verkamo, I.: Mining in the phrasal frontier. In: Komorowski, J., Żytkow, J.M. (eds.) PKDD 1997. LNCS, vol. 1263, Springer, Heidelberg (1997)
Brin, S., Motwani, R., Silverstein, C.: Beyond market baskets: Generalizing association rules to correlations. In: Proceedings of the 1997 ACM SIGMOD Conference on Management of Data, Tucson, AZ, May 13-15 (1997)
Day, D., Aberdeen, J., Hirschman, L., Kozierok, R., Robinson, P., Vilain, M.: Mixed initiative development of language processing systems. In: Proceedings of the Fifth Conference on Applied Natural Language Processing, Washington, D.C (March 1997)
Feldman, R., et al.: Maximal association rules: a new tool for mining for keyword cooccurrences in document collections. In: Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, August 14– 17, pp. 167–170 (1997)
Feldman, R., Hirsh, H.: Exploiting background information in knowledge discovery from text. Journal of Intelligent Information Systems 9(1), 83–97 (1998)
Feldman, R., Hirsh, H.(eds.): IJCAI 1999 Workshop on Text Mining, Stockholm, Sweden, August 2 (1999)
Han, E.H.S., Karypis, G., Kumar, V.: Clustering based on association rule hypergraphs. In: Proceedings of the SIGMOD 1997 Workshop on Research Issues in Data Mining and Knowledge Discovery. ACM, New York (1997)
Hyland, R., Clifton, C., Holland, R.: GeoNODE: Visualizing news in geospatial context. In: Proceedings of the Federal Data Mining Symposium and Exposition 1999, Washington, D.C, March 9-10, AFCEA (1999)
Karypis, G., Aggarwal, R., Kumar, V., Shekar, S.: hypergraph partitioning: Applications in VLSI domain. In: Proceedings of the ACM/IEEE Design Automation Conference (1997)
Kodratoff, Y.(ed.): European Conference on Machine Learning Workshop on Text Mining, Chemnitz, Germany (April 1998)
Lent, B., Agrawal, R., Srikant, R.: Discovering trends in text databases. In: Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, August 14–17, pp. 227–230 (1997)
Mladenić, D., Grobelnik, M.(eds.): ICML 1999 Workshop on Machine Learning in Text Data Analysis, Bled, Slovenia, June 30 (1999)
Salton, G., Allan, J., Buckley, C.: Automatic structuring and retrieval of large text files. Communications of the ACM 37(2), 97–108 (1994)
Singh, L., Scheuermann, P., Chen, B.: Generating association rules from semi-structured documents using an extended concept hierarchy. In: Proceedings of the Sixth International Conference on Information and Knowledge Management, Las Vegas, Nevada (November 1997)
Srikant, R., Agrawal, R.: Mining generalized association rules. In: Proceedings of the 21st International Conference on Very Large Databases, Zurich, Switzerland, September 23-25 (1995)
Tsur, D., Ullman, J.D., Abiteboul, S., Clifton, C., Motwani, R., Nestorov, S., Rosenthal, A.: Query flocks: A generalization of association rule mining. In: Proceedings of the 1998 ACM SIGMOD Conference on Management of Data, Seattle, WA, June 2-4, pp. 1–12 (1998)
Zamir, O., Etzioni, O., Madan, O., Karp, R.M.: Fast and intuitive clustering of web documents. In: Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, pp. 287–290, August 14–17 (1997)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 1999 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Clifton, C., Cooley, R. (1999). TopCat: Data Mining for Topic Identification in a Text Corpus. In: Żytkow, J.M., Rauch, J. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 1999. Lecture Notes in Computer Science(), vol 1704. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-48247-5_19
Download citation
DOI: https://doi.org/10.1007/978-3-540-48247-5_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-66490-1
Online ISBN: 978-3-540-48247-5
eBook Packages: Springer Book Archive