Advertisement

Generating Concept Hierarchies from User Queries

  • Bob Wall
  • Neal Richter
  • Rafal Angryk
Part of the Studies in Computational Intelligence book series (SCI, volume 118)

Summary

Most information retrieval (IR) systems are comprised of a focused set of domain-specific documents located within a single logical repository. A mechanism is developed by which user queries against a particular type of IR repository, a frequently asked question (FAQ) system, are used to generate a concept hierarchy pertinent to the domain. First, an algorithm is described which selects a set of user queries submitted to the system, extracts terms from the repository documents matching those queries, and then reduces this set of terms to a manageable length. The resulting terms are used to generate a feature vector for each query, and the queries are clustered using a hierarchical agglomerative clustering (HAC) algorithm. The HAC algorithm generates a binary tree of clusters, which is not particularly amenable to use by humans and which is slow to search due to its depth, so a subsequent processing step applies min-max partitioning to form a shallower, bushier tree that is a more natural representation of the hierarchy of concepts inherent in the system. Two alternative versions of the partitioning algorithm are compared to determine which produces a more usable concept hierarchy.

The goal is to generate a concept hierarchy that is built from phrases that users actually enter when searching the repository, which should make the hierarchy more usable for all users. While the algorithm presented here is applied to an FAQ system, the techniques can easily be extended to any IR system that allows users to submit natural language queries and that selects documents from the repository that match those queries.

Keywords

Feature Vector Feature Selection User Query Hierarchical Agglomerative Cluster Concept Hierarchy 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Spangler S, Kreulen J (2001) Knowledge base maintenance using knowledge gap analysis. In: Proceedings of SIGKDD’01, San Francisco, CA, August, 2001, pp. 462–466Google Scholar
  2. 2.
    Sanderson M, Croft B (1999) Deriving concept hierarchies from text. In: Proceedings of SIGIR’99, Berkeley, CA, August, 1999, pp. 206–213Google Scholar
  3. 3.
    Cilibrasi R, Vitanyi P. Automatic meaning discovery using Google. Published on Web, available at http://arxiv.org/abs/cs/0412098
  4. 4.
    Chuang S-L, Chien L-F (2002) Towards automatic generation of query taxonomy: a hierarchical query clustering approach. In: Proceedings of ICDM’02, Maebashi City, Japan, December 9–12, 2002, pp. 75–82Google Scholar
  5. 5.
    Chuang S-L, Chien L-F (2004) A practical web-based approach to generating topic hierarchy for text segments. In: Proceedings of CIKM’04, Washington DC, November, 2004, pp. 127–136Google Scholar
  6. 6.
    Larsen B, Aone C (1999) Fast and effective text mining using linear-time document clustering. In: Proceedings of SIGKDD’99, San Diego, CA, August, 1999, pp. 16–22Google Scholar
  7. 7.
    Forman G (2003) An extensive empirical study of feature selection metrics for text classification. In: Journal of machine learning research, vol. 3, 2003, pp. 1289–1305zbMATHCrossRefGoogle Scholar
  8. 8.
    Jain A, Murty M, Flynn P (1999) Data clustering: a review. In: ACM computing surveys, vol. 31, no. 3, September, 1999, pp. 264–323CrossRefGoogle Scholar
  9. 9.
    Fodor IK (2002) A survey of dimension reduction techniques. LLNL technical report, June 2002, UCRL-ID-148494 (available at http://www.llnl.gov/CASC/sapphire/pubs/148494.pdf)
  10. 10.
    Liu H, Yu L (2005) Toward integrating feature selection algorithms for classification and clustering. In: IEEE transactions on knowledge and data engineering, vol. 17, no. 4, April, 2005, pp. 491–502CrossRefGoogle Scholar
  11. 11.
    Dy JG, Brodley CE (2005) Feature selection for unsupervised learning. In: Journal of machine learning research, vol. 5, 2005, pp. 845–889MathSciNetGoogle Scholar
  12. 12.
    Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: KDD workshop on text mining, 2000Google Scholar
  13. 13.
    Dhillon I, Fan J, Guan Y (2001) Efficient clustering of very large document collections. In: Grossman R, Kamath G, Naburu R (eds) Data mining for scientific and engineering applications, Kluwer, BostonGoogle Scholar
  14. 14.
    Yager RR (2000) Intelligent control of the hierarchical agglomerative clustering process. In: IEEE transactions on systems, man, cybernetics, part B, vol. 30, no. 6, December 2000, pp. 835–845CrossRefGoogle Scholar
  15. 15.
    Mandhani B, Joshi S, Kummamuru K (2003) A matrix density based algorithm to hierarchically co-cluster documents and words. In: Proceedings of WWW2003, May 20–24, 2003, Budapest, Hungary, pp. 511–518Google Scholar
  16. 16.
    Frigui H, Masraoui O (2004) Simultaneous clustering and dynamic keyword weighting for text documents. In: Berry, MW (ed) Survey of text mining: clustering, classification, and retrieval, Springer, Berlin Heidelberg New York, 2004, pp. 45–72Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Bob Wall
    • 1
  • Neal Richter
    • 1
  • Rafal Angryk
    • 2
  1. 1.RightNow TechnologiesBozemanUSA
  2. 2.Montana State UniversityBozemanUSA

Personalised recommendations