Skip to main content

Advertisement

Log in

Tailoring the automated construction of large-scale taxonomies using the web

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

It has long been a dream to have available a single, centralized, semantic thesaurus or terminology taxonomy to support research in a variety of fields. Much human and computational effort has gone into constructing such resources, including the original WordNet and subsequent wordnets in various languages. To produce such resources one has to overcome well-known problems in achieving both wide coverage and internal consistency within a single wordnet and across many wordnets. In particular, one has to ensure that alternative valid taxonomizations covering the same basic terms are recognized and treated appropriately. In this paper we describe a pipeline of new, powerful, minimally supervised, automated algorithms that can be used to construct terminology taxonomies and wordnets, in various languages, by harvesting large amounts of online domain-specific or general text. We illustrate the effectiveness of the algorithms both to build localized, domain-specific wordnets and to highlight and investigate certain deeper ontological problems such as parallel generalization hierarchies. We show shortcomings and gaps in the manually-constructed English WordNet in various domains.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. For the sake of simplicity in this paper, we will use term and concept interchangeably.

  2. The intermediate-level terms are located between the low-level and the root terms.

  3. To compute the longest path we use a standard implementation.

  4. http://www.isi.edu/~kozareva/data/kozareva_taxonomy_data.zip.

  5. The various approaches to such ontological decisions are discussed in Hovy (2002).

References

  • Agirre, E., & Lopez de Lacalle, O. (2004). Publicly available topic signatures for all WordNet nominal senses. In Proceedings of the 4rd international conference on languages resources and evaluations (LREC). Lisbon, Portugal.

  • Amsler, R. A. (1981). A taxonomy for english nouns and verbs. In: Proceedings of the 19th annual meeting on association for computational linguistics, Association for Computational Linguistics, Morristown, NJ, USA, pp. 133–138.

  • Atserias, J., Villarejo, L., Rigau, G., Agirre, E., Carroll, J., Magnini, B., et al. (2004). The MEANING multilingual central repository. In Proceedings of the second international WordNet conference. pp. 80–210.

  • Banko, M. (2009). Open information extraction from the web. Ph.D. Dissertation from University of Washington.

  • Bateman, J. A., Kasper, R. T., Moore, J. D., & Whitney, R. A. (1989). A general organization of knowledge for natural language processing: The penman upper model. Unpublished research report, USC/Information Sciences Institute, Marina del Rey.

  • Cuadros, M., & Rigau, G. (2008). KnowNet: Building a large net of knowledge from the web. The 22nd international conference on computational linguistics (Coling’08), UK, Manchester.

  • Davidov, D., & Rappoport, A. (2006). Efficient unsupervised discovery of word categories using symmetric patterns and high frequency words. In Proceedings of the 21st international conference on computational linguistics COLING and the 44th annual meeting of the ACL, pp. 297–304.

  • Etzioni, O., Cafarella, M., Downey, D., Popescu, A. M., Shaked, T., Soderland, S., et al. (2005). Unsupervised named-entity extraction from the web: An experimental study. Artificial Intelligence, 165(1) 91–134.

    Article  Google Scholar 

  • Fellbaum, C. (Ed.). (1998). WordNet: An on-line lexical database and some of its applications. Cambridge, MA, MIT Press.

    Google Scholar 

  • Fleiss, J. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5) 378–382.

    Article  Google Scholar 

  • George, A. M. (1995). WordNet: A lexical database for english. Proceedings of Communications of the ACM, 38 pp. 39–41.

    Google Scholar 

  • Girju, R., Badulescu, A., & Moldovan, D. (2003). Learning semantic constraints for the automatic discovery of part-whole relations. In Proceedings of the conference of the north american chapter of the association for computational linguistics on human language technology (NAACL-HLT), pp. 1–8.

  • Glickman, O., Dagan, I., & Koppel, M. (2005). A probabilistic classification approach for lexical textual entailment. In Proceedings of the twentieth national conference on artificial intelligence and the seventeenth innovative applications of artificial intelligence conference, pp. 1050–1055.

  • Hearst, M. (1992). Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th conference on computational linguistics, pp. 539–545.

  • Hovy, E. H. (1998). Combining and standardizing large-scale, practical ontologies for machine translation and other uses. In Proceedings of the LREC conference.

  • Hovy, E. H. (2002). Comparing sets of semantic relations in ontologies. In R. Green, C. A. Bean, & S. H. Myaeng (Eds.), The semantics of relationships: An interdisciplinary perspective, pp. 91–110.

  • Hovy, E. H., Kozareva, Z., & Riloff, E. (2009). Toward completeness in concept extraction and classification. In Proceedings of the 2009 conference on empirical methods in natural language processing (EMNLP), pp. 948–957.

  • Hovy, E. H., & Nirenburg, S. (1992). Approximating an interlingua in a principled way. In Proceedings of the DARPA Speech and natural language workshop, Arden House, NY.

  • Ide, N., & Veronis, J. (1994). Machine readable dictionaries: What have we learned, where do we go. In Proceedings of the post-COLING 94 intl. workshop on directions of lexical research, Beijing, pp. 137–146.

  • Katz, B., & Lin, J. (2003). Selectively using relations to improve precision in question answering. In Proceedings of the EACL-2003 workshop on natural language processing for question answering, pp. 43–50.

  • Kozareva, Z., Riloff, E., & Hovy, E. H. (2008). Semantic class learning from the web with hyponym pattern linkage graphs. In Proceedings of the NAACL-HLT conference, pp. 1048–1056.

  • Lenat, D. B., & Guha, R. V. (1990). Building large knowledge-based systems. reading. Boston: Addison-Wesley.

  • Lin, D. (1998). Automatic retrieval and clustering of similar words. In Proceedings of the 17th international conference on computational linguistics (COLING), pp. 768–774.

  • Lin, D., & Pantel, P. (2002). Concept discovery from text. In Proceedings of the 19th international conference on computational linguistics (COLING), pp. 1–7.

  • Miller, G. A. (1995). WordNet: a lexical database for english. Communications of the ACM, 38, 39–41.

    Google Scholar 

  • Mitchell, T. M., Betteridge, J., Carlson, A., Hruschka, E., & Wang, R. (2009). Populating the semantic web by macro-reading internet text. In Proceedings of the 8th international semantic web conference (ISWC).

  • Moldovan, D. I., Harabagiu, S. M., Pasca, M., Mihalcea, R., Goodrum, R., Girju, R. et al. (1999). Lasso: A tool for surfing the answer net. In Proceedings of the TREC conference.

  • Moravcsik, J. M. E. (1981). How do words get their meanings? The Journal of Philosophy, 78 1.

  • Navigli, R., & Ponzetto, P. (2012). BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Journal of Artificial Intelligence, 193, 217–250.

    Google Scholar 

  • Navigli, R., Velardi, P., Cucchiarelli, A., Neri, F., & Cucchiarelli, R. (2004). Extending and enriching WordNet with OntoLearn. In Proceedings of the second global wordnet conference 2004 (GWC 2004). pp. 279–284.

  • Navigli, R., Velardi, P., & Faralli, S. (2011). A graph-based algorithm for inducing lexical taxonomies from scratch. In Proceedings of the Twenty-Second international joint conference on artificial intelligence—volume volume three. IJCAI’11, pp. 1872–1877.

  • Pantel, P., & Pennacchiotti, M. (2006). Espresso: Leveraging generic patterns for automatically harvesting semantic relations. In Proceedings of 21st international conference on computational linguistics (COLING) and 44th annual meeting of the association for computational linguistics (ACL).

  • Pantel, P., Crestan, E., Borkovsky, A., Popescu, A. M., & Vyas, V. (2009). Web-scale distributional similarity and entity set expansion. In Proceedings of the conference on empirical methods in natural language processing (EMNLP), pp. 938–947.

  • Pasca, M. (2004). Acquisition of categorized named entities for web search. In Proceedings of the thirteenth ACM international conference on information and knowledge management (CIKM), pp 137–145.

  • Pease, A., Fellbaum, C., & Vossen, P. (2008). Building the global WordNet grid. In Proceedings of the 18th international congress of linguists (CIL18), Seoul, Republic of Korea, July, pp. 21–26.

  • Pennacchiotti, M., & Pantel P. (2006). Ontologizing semantic relations. In Proceedings of the international conference on computational linguistics (COLING) and the annual meeting of the association for computational linguistics (ACL), pp. 793–800.

  • Peters, I. (2009). Folksonomies. Indexing and retrieval in web 2.0. Berlin: De Gruyter Saur.

  • Ponzetto, S., & Navigli, R. (2010). Knowledge-rich word sense disambiguation rivaling supervised systems. In Proceedings of the 48th annual meeting of the association for computational linguistics (ACL 2010), Uppsala, Sweden.

  • Pustejovsky, J. (1995). The generative lexicon. Cambridge, MA: MIT Press.

    Google Scholar 

  • Richardson, S. D., Dolan, W. B., & Vanderwende, L. (1998). Mindnet: Acquiring and structuring semantic information from text. In Proceedings of the 36th annual meeting of the association for computational linguistics and 17th international conference on computational linguistics—Volume 2 (ACL ’98), (Vol. 2). Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 1098–1102.

  • Rigau, G., Rodriguez, H., & Agirre, E. (1998). Building accurate semantic taxonomies from monolingual MRDs. In Proceedings of the 36th annual meeting of the association for computational linguistics and 17th international conference on computational linguistics—Volume 2 (ACL ’98), (Vol. 2). Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 1103–1109.

  • Riloff, E., & Shepherd, J. (1997). A corpus-based approach for building semantic lexicons. In Proceedings of the second conference on empirical methods in natural language processing (EMNLP), pp. 117–124.

  • Riloff, E., & Jones, R. (1999). Learning dictionaries for information extraction by multi-level bootstrapping. In Proceedings of the sixteenth national conference on artificial intelligence (AAAI), pp. 474–479.

  • Ritter, A., Soderland, S., & Etzioni, O., (2009). What is this, anyway: Automatic hypernym discovery. In Proceedings of the AAAI spring symposium on learning by reading and learning to read.

  • Ritter, A., & Mausam, O.E. (2010). A latent dirichlet allocation method for selectional preferences. In Proceedings of the association for computational linguistics conference (ACL).

  • Roberto, N., Velardi, P., & Faralli, S. (2011). A graph-based algorithm for inducing lexical taxonomies from scratch. In Proceedings of IJCAI 2011, pp. 1872–1877.

  • Robkop, K., Thoongsup, S., Charoenporn, T., Sornlertlamvanich, V., & Isahara, H. (2010). WNMS: Connecting the distributed WordNet in the case of Asian WordNet the 5th international conference of the global WordNet association (GWC-2010), Mumbai, India.

  • Rosch, E. (1978). Principles of categorization. In Cognition and Categorization, pp. 27–48

  • Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of the international conference on new methods in language processing, pp. 4449.

  • Snow, R., Jurafsky, D., & Ng, A.Y. (2005). Learning syntactic patterns for automatic hypernym discovery. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems (Vol. 17, pp. 1297–1304).

  • Snow, R., Jurafsky, D., & Ng, A. Y. (2006). Semantic taxonomy induction from heterogenous evidence. In Proceedings of the international conference on computational linguistics (COLING) and the annual meeting of the association for computational linguistics (ACL).

  • Suchanek, F. M., Kasneci, G., & Weikum, G. (2007). Yago: A core of semantic knowledge. In Proceedings of the 16th international conference on World Wide Web (WWW), pp. 697–706.

  • Szpektor, I., Dagan, I., Bar-Haim, R., & Goldberger, J. (2008). Contextual preferences. In Proceedings of the annual meeting of the association for computational linguistics (ACL), pp. 683–691.

  • Velardi, P., Roberto, N., & Pierluigi, D. (2008). Mining the web to create specialized glossaries. Journal of IEEE Intelligent Systems, 23(5) 18–25. ISSN:1541-1672.

    Google Scholar 

  • Vossen, P., Hofmann, K., Rijke, M., Tjong, E., Sang, K., & Deschacht, K. (2008). The Cornetto database: Architecture and user-scenarios. In Proceedings of the fourth international GlobalWordNet conference—GWC.

  • Vossen, P. (Ed.). (1998). EuroWordNet: A multilingual database with lexical semantic networks. Dordrecht, The Netherlands: Kluwer.

  • Widdows, D. (2003). Unsupervised methods for developing taxonomies by combining syntactic and statistical information. In Proceedings of the HLT-NAACL conference.

  • Wilks, Y., Fass, D., ming Guo, C., Mcdonald, J. E., Plate, T., & Slator, B. M. (1988). Machine tractable dictionaries as tools and resources for natural language processing. In Proceedings of the 12th conference on computational linguistics, Association for Computational Linguistics, Morristown, NJ, USA, pp. 750–755.

  • Yang, H., & Callan, J. (2009). A metric-based framework for automatic taxonomy induction. In Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP (ACL-IJCNLP) (Vol. 1, pp. 271–279.

Download references

Acknowledgments

We acknowledge the support of DARPA contract number FA8750-09-C-3705.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zornitsa Kozareva.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kozareva, Z., Hovy, E. Tailoring the automated construction of large-scale taxonomies using the web. Lang Resources & Evaluation 47, 859–890 (2013). https://doi.org/10.1007/s10579-013-9229-0

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-013-9229-0

Keywords

Navigation