Skip to main content

An integrated system for building enterprise taxonomies


Although considerable research has been conducted in the field of hierarchical text categorization, little has been done on automatically collecting labeled corpus for building hierarchical taxonomies. In this paper, we propose an automatic method of collecting training samples to build hierarchical taxonomies. In our method, the category node is initially defined by some keywords, the web search engine is then used to construct a small set of labeled documents, and a topic tracking algorithm with keyword-based content normalization is applied to enlarge the training corpus on the basis of the seed documents. We also design a method to check the consistency of the collected corpus. The above steps produce a flat category structure which contains all the categories for building the hierarchical taxonomy. Next, linear discriminant projection approach is utilized to construct more meaningful intermediate levels of hierarchies in the generated flat set of categories. Experimental results show that the training corpus is good enough for statistical classification methods.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5


  • Adami, G., Avesani, P., & Sona, D. (2003). Bootstrapping for hierarchical document classification. Proceedings of the Twelfth International Conference on Information and knowledge management (pp. 295 – 302). New Orleans, LA, USA, November 03-08, 2003.

  • Aggarwal, C. C., Gates, S. C., & Yu, P. S. (1999). On the merits of building categorization systems by supervised clustering. Proceedings of KDD-99, 5th ACM International Conference on Knowledge Discovery and Data Mining (pp. 352–356). San Diego, California, USA, August 15–18, 1999.

  • Agrawal, R., & Srikant, R. (2001). On integrating catalogs. WWW2001: Proceedings of the 10th International World Wide Web Conference (pp. 603–612). Hong Kong, China.

  • Allan, J. (2002). Automatic hypertext link typing. Proceedings for the Hypertext’96 conference (pp. 42–52). Washington, D.C., USA.

  • Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. Addison& Wesley.

  • Beyer, K., Goldstein, J., Ramakrishnan, R., & Shaft, U. (1999). When is nearest neighbor meaningful? Proceeding of the 7th International Conference on Database Theory (pp. 217–235).

  • Chakrabarti, S., Roy, S., & Soundalgekar, M. V. (2002). Fast and accurate text classification via multiple linear discriminant projections. Proceedings of 28th International Conference on Very Large Data Bases (pp. 658–669). Hong Kong, China, August 20–23, 2002.

  • Davidov, D., Gabrilovich, E., & Markovitch, S. (2004). Parameterized generation of labeled datasets for text categorization based on a hierarchical directory. Proceedings of The 27th Annual International ACM SIGIR Conference (pp. 250–257). Sheffield, UK: ACM Press.

  • Doan, A., Madhavan, J., Domingos, P., & Halevy, A. (2002). Learning to map between ontologies on the semantic web. WWW2002: Proceedings of the 11th International World Wide Web Conference (pp. 662–673). Hawaii, NY, USA.

  • Dumais, S., Platt, J., Heckerman, D., & Sahami, M. (1998). Inductive learning algorithms and representations for text categorization. Proceedings of the seventh international conference on Information and knowledge management (pp. 148–155). Bethesda, Maryland, United States, November 02–07, 1998.

  • Ferragina, P., & Gulli, A. (2004). The anatomy of a clustering engine for web, books, news snippets. Proceedings of the Fourth IEEE International Conference on Data Mining (ICDM’04) (pp. 395–398).

  • Fisher, R. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 179–188.

  • Franz, M., McCarley, J. S., Ward, T., & Zhu, W.-J. (2001). Unsupervised and Supervised Clustering for Topic Tracking. Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 310–317). New Orleans, Louisiana, USA, September 9–13.

  • Freund, Y., Seung, H. S., Shamir, E., & Tishby, N. (1997). Selective sampling using the query by committee algorithm. Machine Learning, 28, 133–168.

    MATH  Article  Google Scholar 

  • Fukunaga, K. (1990). Introduction to statistical pattern recognition (2nd ed.). New York: Academic Press.

    MATH  Google Scholar 

  • Godbole, S., Harpale, A., Sarawagi, S., & Chakrabarti, S. (2004). Document classification through interactive supervision on both document and term labels. The 8th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD) (pp. 185–196).

  • Goldman, S. A., & Zhou, Y. (2000). Enhancing supervised learning with unlabeled data. Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000) (pp. 327–334). Stanford University, Stanford, CA, USA, June 29–July 2.

  • Hearst, M. A., & Pedersen, J. O. (1996). Reexamining the cluster hypothesis: Scatter/gather on retrieval results. Proceedings of the 19th Annual ACM International Conference on Research and Development in Information Retrieval (SIGIR 1996) (pp. 76–84). ACM Press.

  • Howland, P., & Park, H. (2004). Generalizing discriminant analysis using the generalized singular value decomposition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26, 995–1006.

    Article  Google Scholar 

  • Jain, A. K., & Dubes, R. C. (1988). Algorithms for clustering data. Prentice Hall.

  • Jeon, B., & Landgrebe, D. (1999). Partially supervised classification using weighted unsupervised clustering. IEEE Transactions on Geoscience and Remote Sensing, 37, 1073–1079.

    Article  Google Scholar 

  • Jiang, Z., Joshi, A., Krishnapuram, R., & Yi, L. (2000). Retriever: Improving Web Search Engine Results Using Clustering (Technical Report). University of Maryland Baltimore County.

  • Joachims, T. (1998). Text categorization with support vector machines: learning with many relevant features. Proceedings of the 10th European Conference on Machine Learning(ECML-98) (pp. 137–142). Dorint-Parkhotel, Chemnitz, Germany, April 21–24.

  • Lawrie, D., Croft, W. B., & Rosenberg, A. (2001). Finding topic words for hierarchical summarization. Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’01) (pp. 349–357). New Orleans, Louisiana, USA: ACM Press.

  • Lawrie, D. J., & Croft, W. B. (2000). Discovering and comparing topic hierarchies. Proceedings of RIAO 2000. Paris, France, April 12–14.

  • Lewis, D. D. Reuters-21578 text categorization test collection. lewis.

  • Lewis, D. D. (1998). Naive (Bayes) at forty: The independence assumption in information retrieval. Proceedings of the 10th European Conference on Machine Learning (ECML-98) (pp. 4–15). Chemnitz, DE: Springer Verlag, Heidelberg, DE.

  • Lewis, D. D., Yang, Y., Rose, T. G., & Li, F. (2004). Rcv1: A new benchmark collection for text categorization research. Journal of Machine Learning Research, 5, 361–397.

    Google Scholar 

  • Li, T., Zhu, S., & Ogihara, M. (2003a). Efficient multi-way text categorization via generalized discriminant analysis. Proceedings of the Twelfth International Conference on Information and Knowledge Management (CIKM 2003) (pp. 317–324). New Orleans, LA, USA, November 03–08.

  • Li, T., Zhu, S., & Ogihara, M. (2003b). Topic hierarchy generation via linear discriminant projection. Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 421–422). Toronto, Canada, July 28–August 01.

  • Li, T., Zhu, S., & Ogihara, M. (2003c). Using discriminant analysis for multi-class classification. Proceedings of the Third IEEE International Conference on Data Mining (ICDM 2003) (pp. 589–592).

  • Liu, B., Lee, W., Yu, P., & Li, X. (2002). Partially supervised classification of text documents. Proceedings of the 19th International Conference on Machine Learning (pp. 387–394). Sydney, Australia, July 8–12.

  • McCallum, A., & Nigam, K. (1998a). A comparison of event models for naive bayes text classification. AAAI/ICML-98 Workshop on Learning for Text Categorization (pp. 41–48).

  • McCallum, A. K., & Nigam, K. (1998b). Employing EM in pool-based active learning for text classification. Proceedings of the 15th International Conference on Machine Learning (ICML-98) (pp. 350–358). Madison, USA, July 24–27. San Francisco, US: Morgan Kaufman Publishers.

  • Nevill-Manning, C. G., Witten, I. H., & Paynter, G. W. (1999). Lexically-generated subject hierarchies for browsing large collections. International Journal on Digital Libraries, 2, 111–123.

    Article  Google Scholar 

  • Nigam, K., McCallum, A. K., Thrun, S., & Mitchell, T. M. (1998). Learning to classify text from labeled and unlabeled documents. Proceedings of the 15th Conference of the American Association for Artificial Intelligence (AAAI-98) (pp. 792–799).

  • Pavlov, D., Mao, J., & Dom, B. (2000). Scaling-up support vector machines using boosting algorithm. 15th International Conference on Pattern Recognition (ICPR 2000) (pp. 219–222).

  • Robertson, S., Walker, S., Jones, S., Hancock-Beaulieu, M., & Gatford., M. (1995). Okapi at trec-3. The 3d Text REtrieval Conference (TREC-3).

  • Roussinov, D. G., & Chen, H. (2001). Information navigation on the web by clustering and summarizing query results. Information Processing and Management, 37, 789–816.

    MATH  Article  Google Scholar 

  • Salton, G., & Buckley, C. (1990). Improving retrieval performance by relevance feedback. Journal of the American Society for Information Science, 41, 288–297.

    Article  Google Scholar 

  • Sanderson, M., & Croft, W. B. (1999). Deriving concept hierarchies from text. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR’99) (pp. 206–213). Berkeley, California, USA, August 15–19. ACM Press.

  • Schapire, R. E., & Singer, Y. (2000). Boostexter: A boosting-based system for text categorization. Machine Learning, 39, 135–168.

    MATH  Article  Google Scholar 

  • Schütze, H., Hull, D. A., & Pedersen, J. O. (1995). A comparison of classifiers and document representations for the routing problem. SIGIR ’95: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 229–237). New York, NY, USA: ACM Press.

  • Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34, 1–47.

    Article  MathSciNet  Google Scholar 

  • Seung, H. S., Opper, M., & Sompolinsky, H. (1992). Query by committee. Proceedings of the fifth annual workshop on Computational learning theory (pp. 287–294).

  • Shannon, C. E. (1948). A mathematical theory of communication. Bell System Technical Journal, 27, 379–423, 623–656.

    Google Scholar 

  • Singhal, A., Buckley, C., & Mitra, M. (1996). Pivoted document length normalization. SIGIR’96 (pp. 21–29).

  • Stephen C. Gates, & Wilfried Teiken, K.-S. F. C. (2005). Taxonomies by the numbers: building high-performance taxonomies. Proceedings of the 14th ACM International Conference on Information and Knowledge Management (pp. 568 – 577). Bremen, Germany, October 31–November 05.

  • Wang, Y., & Kitsuregawa, M. (2001). Link-based clustering of web search results. Proceedings of the Second International conference on Web-Age Information Management (WAIM’2001) (pp. 225–236).

  • Wayne, C. L. (2000). Multilingual topic detection and tracking: Successful research enabled by corpora and evaluation. Language Resources and Evaluation Conference (LREC) 2000 (pp. 1487–1494).

  • Weiss, D. (2002). Introduction to search results clustering. Proceedings of the 6th International Conference on Soft Computing and Distributed Processing (pp. 82–84). Rzeszów, Poland.

  • Weiss, S. M., Apte, C., Damerau, F. J., Johnson, D. E., Oles, F. J., Goetz, T., & Hampp, T. (1999). Maximizing text-mining performance. IEEE Intelligent Systems, 14, 2–8.

    Article  Google Scholar 

  • Wiener, E. D., Pedersen, J. O., & Weigend, A. S. (1995). A neural network approach to topic spotting. Proceedings of the 4th Annual Symposium on Document Analysis and Information Retrieval (SDAIR-95) (pp. 317–332). Las Vegas, US.

  • Yang, Y. M., & Liu, X. (1999). A re-examination of text categorization methods. Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 42–49). Berkeley, California, USA, August 15–19.

  • Zamir, O., & Etzioni, O. (1999). Grouper: A dynamic clustering interface to web search results. Computer Networks: The International Journal of Computer and Telecommunications Networking, 31, 1361–1374.

    Google Scholar 

  • Zhang, D., & Dong, Y. (2004). Semantic, hierarchical, online clustering of web search results. Proceedings of the 6th Asia Pacific Web Conference (APWEB).

  • Zhang, D., & Lee, W. S. (2004). Taxonomy integration using support vector machines. WWW2004: Proceedings of the 13th International World Wide Web Conference (pp. 472–481). New York, NY, USA.

  • Zhao, Y., & Karypis, G. (2001). Criterion functions for document clustering: Experiments and analysis (Technical Report). University of Minnesota.

Download references


The work of Tao Li is partially supported by NSF IIS-0546280 and NIH/NIGMS S06 GM008205. The authors are grateful to the anonymous reviewers for their useful comments.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Tao Li.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Zhang, L., Li, T., Liu, S. et al. An integrated system for building enterprise taxonomies. Inf Retrieval 10, 365–391 (2007).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


  • Taxonomy
  • Consistency
  • Discriminant projection