TopCat: Data Mining for Topic Identification in a Text Corpus

  • Chris Clifton
  • Robert Cooley
Conference paper

DOI: 10.1007/978-3-540-48247-5_19

Part of the Lecture Notes in Computer Science book series (LNCS, volume 1704)
Cite this paper as:
Clifton C., Cooley R. (1999) TopCat: Data Mining for Topic Identification in a Text Corpus. In: Żytkow J.M., Rauch J. (eds) Principles of Data Mining and Knowledge Discovery. PKDD 1999. Lecture Notes in Computer Science, vol 1704. Springer, Berlin, Heidelberg

Abstract

TopCat (Topic Categories) is a technique for identifying topics that recur in articles in a text corpus. Natural language processing techniques are used to identify key entities in individual articles, allowing us to represent an article as a set of items. This allows us to view the problem in a database/data mining context: Identifying related groups of items. This paper presents a novel method for identifying related items based on “traditional” data mining techniques. Frequent itemsets are generated from the groups of items, followed by clusters formed with a hypergraph partitioning scheme. We present an evaluation against a manually-categorized “ground truth” news corpus showing this technique is effective in identifying topics in collections of news articles.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Copyright information

© Springer-Verlag Berlin Heidelberg 1999

Authors and Affiliations

  • Chris Clifton
    • 1
  • Robert Cooley
    • 2
  1. 1.The MITRE CorporationBedfordUSA
  2. 2.University of MinnesotaMinneapolisUSA

Personalised recommendations