Generating Concept Hierarchies from User Queries
Most information retrieval (IR) systems are comprised of a focused set of domain-specific documents located within a single logical repository. A mechanism is developed by which user queries against a particular type of IR repository, a frequently asked question (FAQ) system, are used to generate a concept hierarchy pertinent to the domain. First, an algorithm is described which selects a set of user queries submitted to the system, extracts terms from the repository documents matching those queries, and then reduces this set of terms to a manageable length. The resulting terms are used to generate a feature vector for each query, and the queries are clustered using a hierarchical agglomerative clustering (HAC) algorithm. The HAC algorithm generates a binary tree of clusters, which is not particularly amenable to use by humans and which is slow to search due to its depth, so a subsequent processing step applies min-max partitioning to form a shallower, bushier tree that is a more natural representation of the hierarchy of concepts inherent in the system. Two alternative versions of the partitioning algorithm are compared to determine which produces a more usable concept hierarchy.
The goal is to generate a concept hierarchy that is built from phrases that users actually enter when searching the repository, which should make the hierarchy more usable for all users. While the algorithm presented here is applied to an FAQ system, the techniques can easily be extended to any IR system that allows users to submit natural language queries and that selects documents from the repository that match those queries.
KeywordsFeature Vector Feature Selection User Query Hierarchical Agglomerative Cluster Concept Hierarchy
Unable to display preview. Download preview PDF.
- 1.Spangler S, Kreulen J (2001) Knowledge base maintenance using knowledge gap analysis. In: Proceedings of SIGKDD’01, San Francisco, CA, August, 2001, pp. 462–466Google Scholar
- 2.Sanderson M, Croft B (1999) Deriving concept hierarchies from text. In: Proceedings of SIGIR’99, Berkeley, CA, August, 1999, pp. 206–213Google Scholar
- 3.Cilibrasi R, Vitanyi P. Automatic meaning discovery using Google. Published on Web, available at http://arxiv.org/abs/cs/0412098
- 4.Chuang S-L, Chien L-F (2002) Towards automatic generation of query taxonomy: a hierarchical query clustering approach. In: Proceedings of ICDM’02, Maebashi City, Japan, December 9–12, 2002, pp. 75–82Google Scholar
- 5.Chuang S-L, Chien L-F (2004) A practical web-based approach to generating topic hierarchy for text segments. In: Proceedings of CIKM’04, Washington DC, November, 2004, pp. 127–136Google Scholar
- 6.Larsen B, Aone C (1999) Fast and effective text mining using linear-time document clustering. In: Proceedings of SIGKDD’99, San Diego, CA, August, 1999, pp. 16–22Google Scholar
- 9.Fodor IK (2002) A survey of dimension reduction techniques. LLNL technical report, June 2002, UCRL-ID-148494 (available at http://www.llnl.gov/CASC/sapphire/pubs/148494.pdf)
- 12.Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: KDD workshop on text mining, 2000Google Scholar
- 13.Dhillon I, Fan J, Guan Y (2001) Efficient clustering of very large document collections. In: Grossman R, Kamath G, Naburu R (eds) Data mining for scientific and engineering applications, Kluwer, BostonGoogle Scholar
- 15.Mandhani B, Joshi S, Kummamuru K (2003) A matrix density based algorithm to hierarchically co-cluster documents and words. In: Proceedings of WWW2003, May 20–24, 2003, Budapest, Hungary, pp. 511–518Google Scholar
- 16.Frigui H, Masraoui O (2004) Simultaneous clustering and dynamic keyword weighting for text documents. In: Berry, MW (ed) Survey of text mining: clustering, classification, and retrieval, Springer, Berlin Heidelberg New York, 2004, pp. 45–72Google Scholar