Chapter

Principles of Data Mining and Knowledge Discovery

Volume 1704 of the series Lecture Notes in Computer Science pp 174-183

TopCat: Data Mining for Topic Identification in a Text Corpus

  • Chris CliftonAffiliated withThe MITRE Corporation
  • , Robert CooleyAffiliated withUniversity of Minnesota

* Final gross prices may vary according to local VAT.

Get Access

Abstract

TopCat (Topic Categories) is a technique for identifying topics that recur in articles in a text corpus. Natural language processing techniques are used to identify key entities in individual articles, allowing us to represent an article as a set of items. This allows us to view the problem in a database/data mining context: Identifying related groups of items. This paper presents a novel method for identifying related items based on “traditional” data mining techniques. Frequent itemsets are generated from the groups of items, followed by clusters formed with a hypergraph partitioning scheme. We present an evaluation against a manually-categorized “ground truth” news corpus showing this technique is effective in identifying topics in collections of news articles.