TopCat: Data Mining for Topic Identification in a Text Corpus

Clifton, Chris; Cooley, Robert

doi:10.1007/978-3-540-48247-5_19

Chris Clifton⁸ &
Robert Cooley⁹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1704))

Included in the following conference series:

European Conference on Principles of Data Mining and Knowledge Discovery

2179 Accesses
22 Citations

Abstract

TopCat (Topic Categories) is a technique for identifying topics that recur in articles in a text corpus. Natural language processing techniques are used to identify key entities in individual articles, allowing us to represent an article as a set of items. This allows us to view the problem in a database/data mining context: Identifying related groups of items. This paper presents a novel method for identifying related items based on “traditional” data mining techniques. Frequent itemsets are generated from the groups of items, followed by clusters formed with a hypergraph partitioning scheme. We present an evaluation against a manually-categorized “ground truth” news corpus showing this technique is effective in identifying topics in collections of news articles.

This work supported by the Community Management Staff’s Massive Digital Data Systems Program.

Download to read the full chapter text

Chapter PDF

Hierarchical Latent Tree Analysis for Topic Detection

Topic Grouper: An Agglomerative Clustering Approach to Topic Modeling

LDA+: An Extended LDA Model for Topic Hierarchy and Discovery

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

1998 topic detection and tracking project (TDT-2), http://www.nist.gov/speech/tdt98/tdt98.htm
The topic detection and tracking phase 2 (TDT2) evaluation. ftp://jaguar.ncsl.nist.gov/tdt98/tdt2_dec98_official_results_19990204/index.htm.
The topic detection and tracking phase 2 (TDT2) evaluation plan, http://www.nist.gov/speech/tdt98/doc/tdt2.eval.plan.98.v3.7.pdf
Agrawal, R., Imielinski, T., Swami, A.N.: Mining association rules between sets of items in large databases. In: Buneman, P., Jajodia, S. (eds.) Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, Washington, D.C, May 26–28, pp. 207–216 (1993)
Google Scholar
Ahonen, H., Heinonen, O., Klemettinen, M., Verkamo, I.: Mining in the phrasal frontier. In: Komorowski, J., Żytkow, J.M. (eds.) PKDD 1997. LNCS, vol. 1263, Springer, Heidelberg (1997)
Google Scholar
Brin, S., Motwani, R., Silverstein, C.: Beyond market baskets: Generalizing association rules to correlations. In: Proceedings of the 1997 ACM SIGMOD Conference on Management of Data, Tucson, AZ, May 13-15 (1997)
Google Scholar
Day, D., Aberdeen, J., Hirschman, L., Kozierok, R., Robinson, P., Vilain, M.: Mixed initiative development of language processing systems. In: Proceedings of the Fifth Conference on Applied Natural Language Processing, Washington, D.C (March 1997)
Google Scholar
Feldman, R., et al.: Maximal association rules: a new tool for mining for keyword cooccurrences in document collections. In: Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, August 14– 17, pp. 167–170 (1997)
Google Scholar
Feldman, R., Hirsh, H.: Exploiting background information in knowledge discovery from text. Journal of Intelligent Information Systems 9(1), 83–97 (1998)
Article Google Scholar
Feldman, R., Hirsh, H.(eds.): IJCAI 1999 Workshop on Text Mining, Stockholm, Sweden, August 2 (1999)
Google Scholar
Han, E.H.S., Karypis, G., Kumar, V.: Clustering based on association rule hypergraphs. In: Proceedings of the SIGMOD 1997 Workshop on Research Issues in Data Mining and Knowledge Discovery. ACM, New York (1997)
Google Scholar
Hyland, R., Clifton, C., Holland, R.: GeoNODE: Visualizing news in geospatial context. In: Proceedings of the Federal Data Mining Symposium and Exposition 1999, Washington, D.C, March 9-10, AFCEA (1999)
Google Scholar
Karypis, G., Aggarwal, R., Kumar, V., Shekar, S.: hypergraph partitioning: Applications in VLSI domain. In: Proceedings of the ACM/IEEE Design Automation Conference (1997)
Google Scholar
Kodratoff, Y.(ed.): European Conference on Machine Learning Workshop on Text Mining, Chemnitz, Germany (April 1998)
Google Scholar
Lent, B., Agrawal, R., Srikant, R.: Discovering trends in text databases. In: Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, August 14–17, pp. 227–230 (1997)
Google Scholar
Mladenić, D., Grobelnik, M.(eds.): ICML 1999 Workshop on Machine Learning in Text Data Analysis, Bled, Slovenia, June 30 (1999)
Google Scholar
Salton, G., Allan, J., Buckley, C.: Automatic structuring and retrieval of large text files. Communications of the ACM 37(2), 97–108 (1994)
Article Google Scholar
Singh, L., Scheuermann, P., Chen, B.: Generating association rules from semi-structured documents using an extended concept hierarchy. In: Proceedings of the Sixth International Conference on Information and Knowledge Management, Las Vegas, Nevada (November 1997)
Google Scholar
Srikant, R., Agrawal, R.: Mining generalized association rules. In: Proceedings of the 21st International Conference on Very Large Databases, Zurich, Switzerland, September 23-25 (1995)
Google Scholar
Tsur, D., Ullman, J.D., Abiteboul, S., Clifton, C., Motwani, R., Nestorov, S., Rosenthal, A.: Query flocks: A generalization of association rule mining. In: Proceedings of the 1998 ACM SIGMOD Conference on Management of Data, Seattle, WA, June 2-4, pp. 1–12 (1998)
Google Scholar
Zamir, O., Etzioni, O., Madan, O., Karp, R.M.: Fast and intuitive clustering of web documents. In: Proceedings of the Third International Conference on Knowledge Discovery and Data Mining, pp. 287–290, August 14–17 (1997)
Google Scholar

Download references

Author information

Authors and Affiliations

The MITRE Corporation, 202 Burlington Rd, Bedford, MA, 01730-1420, USA
Chris Clifton
University of Minnesota, 6-225D EE/CS Building, Minneapolis, MN, 55455, USA
Robert Cooley

Authors

Chris Clifton
View author publications
You can also search for this author in PubMed Google Scholar
Robert Cooley
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Science Department, UNC Charlotte, Charlotte, N.C. 28223 and Institute of Computer Science, Polish Academy of Sciences,
Jan M. Żytkow
Faculty of Informatics and Statistics, University of Economics, Prague, nám. W. Churchilla 4, 130 67, Prague, Czech Republic
Jan Rauch

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Clifton, C., Cooley, R. (1999). TopCat: Data Mining for Topic Identification in a Text Corpus. In: Żytkow, J.M., Rauch, J. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 1999. Lecture Notes in Computer Science(), vol 1704. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-48247-5_19

Download citation

DOI: https://doi.org/10.1007/978-3-540-48247-5_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-66490-1
Online ISBN: 978-3-540-48247-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics

TopCat: Data Mining for Topic Identification in a Text Corpus

Abstract

Chapter PDF

Similar content being viewed by others

Hierarchical Latent Tree Analysis for Topic Detection

Topic Grouper: An Agglomerative Clustering Approach to Topic Modeling

LDA+: An Extended LDA Model for Topic Hierarchy and Discovery

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

TopCat: Data Mining for Topic Identification in a Text Corpus

Abstract

Chapter PDF

Similar content being viewed by others

Hierarchical Latent Tree Analysis for Topic Detection

Topic Grouper: An Agglomerative Clustering Approach to Topic Modeling

LDA+: An Extended LDA Model for Topic Hierarchy and Discovery

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation