Advertisement

Language Resources and Evaluation

, Volume 47, Issue 3, pp 859–890 | Cite as

Tailoring the automated construction of large-scale taxonomies using the web

  • Zornitsa KozarevaEmail author
  • Eduard Hovy
Original Paper

Abstract

It has long been a dream to have available a single, centralized, semantic thesaurus or terminology taxonomy to support research in a variety of fields. Much human and computational effort has gone into constructing such resources, including the original WordNet and subsequent wordnets in various languages. To produce such resources one has to overcome well-known problems in achieving both wide coverage and internal consistency within a single wordnet and across many wordnets. In particular, one has to ensure that alternative valid taxonomizations covering the same basic terms are recognized and treated appropriately. In this paper we describe a pipeline of new, powerful, minimally supervised, automated algorithms that can be used to construct terminology taxonomies and wordnets, in various languages, by harvesting large amounts of online domain-specific or general text. We illustrate the effectiveness of the algorithms both to build localized, domain-specific wordnets and to highlight and investigate certain deeper ontological problems such as parallel generalization hierarchies. We show shortcomings and gaps in the manually-constructed English WordNet in various domains.

Keywords

Hyponym and hypernym learning Text mining Ontology induction Wordnet evaluation 

Notes

Acknowledgments

We acknowledge the support of DARPA contract number FA8750-09-C-3705.

References

  1. Agirre, E., & Lopez de Lacalle, O. (2004). Publicly available topic signatures for all WordNet nominal senses. In Proceedings of the 4rd international conference on languages resources and evaluations (LREC). Lisbon, Portugal.Google Scholar
  2. Amsler, R. A. (1981). A taxonomy for english nouns and verbs. In: Proceedings of the 19th annual meeting on association for computational linguistics, Association for Computational Linguistics, Morristown, NJ, USA, pp. 133–138.Google Scholar
  3. Atserias, J., Villarejo, L., Rigau, G., Agirre, E., Carroll, J., Magnini, B., et al. (2004). The MEANING multilingual central repository. In Proceedings of the second international WordNet conference. pp. 80–210.Google Scholar
  4. Banko, M. (2009). Open information extraction from the web. Ph.D. Dissertation from University of Washington.Google Scholar
  5. Bateman, J. A., Kasper, R. T., Moore, J. D., & Whitney, R. A. (1989). A general organization of knowledge for natural language processing: The penman upper model. Unpublished research report, USC/Information Sciences Institute, Marina del Rey.Google Scholar
  6. Cuadros, M., & Rigau, G. (2008). KnowNet: Building a large net of knowledge from the web. The 22nd international conference on computational linguistics (Coling’08), UK, Manchester.Google Scholar
  7. Davidov, D., & Rappoport, A. (2006). Efficient unsupervised discovery of word categories using symmetric patterns and high frequency words. In Proceedings of the 21st international conference on computational linguistics COLING and the 44th annual meeting of the ACL, pp. 297–304.Google Scholar
  8. Etzioni, O., Cafarella, M., Downey, D., Popescu, A. M., Shaked, T., Soderland, S., et al. (2005). Unsupervised named-entity extraction from the web: An experimental study. Artificial Intelligence, 165(1) 91–134.CrossRefGoogle Scholar
  9. Fellbaum, C. (Ed.). (1998). WordNet: An on-line lexical database and some of its applications. Cambridge, MA, MIT Press.Google Scholar
  10. Fleiss, J. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5) 378–382.CrossRefGoogle Scholar
  11. George, A. M. (1995). WordNet: A lexical database for english. Proceedings of Communications of the ACM, 38 pp. 39–41.Google Scholar
  12. Girju, R., Badulescu, A., & Moldovan, D. (2003). Learning semantic constraints for the automatic discovery of part-whole relations. In Proceedings of the conference of the north american chapter of the association for computational linguistics on human language technology (NAACL-HLT), pp. 1–8.Google Scholar
  13. Glickman, O., Dagan, I., & Koppel, M. (2005). A probabilistic classification approach for lexical textual entailment. In Proceedings of the twentieth national conference on artificial intelligence and the seventeenth innovative applications of artificial intelligence conference, pp. 1050–1055.Google Scholar
  14. Hearst, M. (1992). Automatic acquisition of hyponyms from large text corpora. In Proceedings of the 14th conference on computational linguistics, pp. 539–545.Google Scholar
  15. Hovy, E. H. (1998). Combining and standardizing large-scale, practical ontologies for machine translation and other uses. In Proceedings of the LREC conference.Google Scholar
  16. Hovy, E. H. (2002). Comparing sets of semantic relations in ontologies. In R. Green, C. A. Bean, & S. H. Myaeng (Eds.), The semantics of relationships: An interdisciplinary perspective, pp. 91–110.Google Scholar
  17. Hovy, E. H., Kozareva, Z., & Riloff, E. (2009). Toward completeness in concept extraction and classification. In Proceedings of the 2009 conference on empirical methods in natural language processing (EMNLP), pp. 948–957.Google Scholar
  18. Hovy, E. H., & Nirenburg, S. (1992). Approximating an interlingua in a principled way. In Proceedings of the DARPA Speech and natural language workshop, Arden House, NY.Google Scholar
  19. Ide, N., & Veronis, J. (1994). Machine readable dictionaries: What have we learned, where do we go. In Proceedings of the post-COLING 94 intl. workshop on directions of lexical research, Beijing, pp. 137–146.Google Scholar
  20. Katz, B., & Lin, J. (2003). Selectively using relations to improve precision in question answering. In Proceedings of the EACL-2003 workshop on natural language processing for question answering, pp. 43–50.Google Scholar
  21. Kozareva, Z., Riloff, E., & Hovy, E. H. (2008). Semantic class learning from the web with hyponym pattern linkage graphs. In Proceedings of the NAACL-HLT conference, pp. 1048–1056.Google Scholar
  22. Lenat, D. B., & Guha, R. V. (1990). Building large knowledge-based systems. reading. Boston: Addison-Wesley.Google Scholar
  23. Lin, D. (1998). Automatic retrieval and clustering of similar words. In Proceedings of the 17th international conference on computational linguistics (COLING), pp. 768–774.Google Scholar
  24. Lin, D., & Pantel, P. (2002). Concept discovery from text. In Proceedings of the 19th international conference on computational linguistics (COLING), pp. 1–7.Google Scholar
  25. Miller, G. A. (1995). WordNet: a lexical database for english. Communications of the ACM, 38, 39–41.Google Scholar
  26. Mitchell, T. M., Betteridge, J., Carlson, A., Hruschka, E., & Wang, R. (2009). Populating the semantic web by macro-reading internet text. In Proceedings of the 8th international semantic web conference (ISWC).Google Scholar
  27. Moldovan, D. I., Harabagiu, S. M., Pasca, M., Mihalcea, R., Goodrum, R., Girju, R. et al. (1999). Lasso: A tool for surfing the answer net. In Proceedings of the TREC conference.Google Scholar
  28. Moravcsik, J. M. E. (1981). How do words get their meanings? The Journal of Philosophy, 78 1.Google Scholar
  29. Navigli, R., & Ponzetto, P. (2012). BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Journal of Artificial Intelligence, 193, 217–250.Google Scholar
  30. Navigli, R., Velardi, P., Cucchiarelli, A., Neri, F., & Cucchiarelli, R. (2004). Extending and enriching WordNet with OntoLearn. In Proceedings of the second global wordnet conference 2004 (GWC 2004). pp. 279–284.Google Scholar
  31. Navigli, R., Velardi, P., & Faralli, S. (2011). A graph-based algorithm for inducing lexical taxonomies from scratch. In Proceedings of the Twenty-Second international joint conference on artificial intelligence—volume volume three. IJCAI’11, pp. 1872–1877.Google Scholar
  32. Pantel, P., & Pennacchiotti, M. (2006). Espresso: Leveraging generic patterns for automatically harvesting semantic relations. In Proceedings of 21st international conference on computational linguistics (COLING) and 44th annual meeting of the association for computational linguistics (ACL).Google Scholar
  33. Pantel, P., Crestan, E., Borkovsky, A., Popescu, A. M., & Vyas, V. (2009). Web-scale distributional similarity and entity set expansion. In Proceedings of the conference on empirical methods in natural language processing (EMNLP), pp. 938–947.Google Scholar
  34. Pasca, M. (2004). Acquisition of categorized named entities for web search. In Proceedings of the thirteenth ACM international conference on information and knowledge management (CIKM), pp 137–145.Google Scholar
  35. Pease, A., Fellbaum, C., & Vossen, P. (2008). Building the global WordNet grid. In Proceedings of the 18th international congress of linguists (CIL18), Seoul, Republic of Korea, July, pp. 21–26.Google Scholar
  36. Pennacchiotti, M., & Pantel P. (2006). Ontologizing semantic relations. In Proceedings of the international conference on computational linguistics (COLING) and the annual meeting of the association for computational linguistics (ACL), pp. 793–800.Google Scholar
  37. Peters, I. (2009). Folksonomies. Indexing and retrieval in web 2.0. Berlin: De Gruyter Saur.Google Scholar
  38. Ponzetto, S., & Navigli, R. (2010). Knowledge-rich word sense disambiguation rivaling supervised systems. In Proceedings of the 48th annual meeting of the association for computational linguistics (ACL 2010), Uppsala, Sweden.Google Scholar
  39. Pustejovsky, J. (1995). The generative lexicon. Cambridge, MA: MIT Press.Google Scholar
  40. Richardson, S. D., Dolan, W. B., & Vanderwende, L. (1998). Mindnet: Acquiring and structuring semantic information from text. In Proceedings of the 36th annual meeting of the association for computational linguistics and 17th international conference on computational linguistics—Volume 2 (ACL ’98), (Vol. 2). Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 1098–1102.Google Scholar
  41. Rigau, G., Rodriguez, H., & Agirre, E. (1998). Building accurate semantic taxonomies from monolingual MRDs. In Proceedings of the 36th annual meeting of the association for computational linguistics and 17th international conference on computational linguistics—Volume 2 (ACL ’98), (Vol. 2). Association for Computational Linguistics, Stroudsburg, PA, USA, pp. 1103–1109.Google Scholar
  42. Riloff, E., & Shepherd, J. (1997). A corpus-based approach for building semantic lexicons. In Proceedings of the second conference on empirical methods in natural language processing (EMNLP), pp. 117–124.Google Scholar
  43. Riloff, E., & Jones, R. (1999). Learning dictionaries for information extraction by multi-level bootstrapping. In Proceedings of the sixteenth national conference on artificial intelligence (AAAI), pp. 474–479.Google Scholar
  44. Ritter, A., Soderland, S., & Etzioni, O., (2009). What is this, anyway: Automatic hypernym discovery. In Proceedings of the AAAI spring symposium on learning by reading and learning to read.Google Scholar
  45. Ritter, A., & Mausam, O.E. (2010). A latent dirichlet allocation method for selectional preferences. In Proceedings of the association for computational linguistics conference (ACL).Google Scholar
  46. Roberto, N., Velardi, P., & Faralli, S. (2011). A graph-based algorithm for inducing lexical taxonomies from scratch. In Proceedings of IJCAI 2011, pp. 1872–1877.Google Scholar
  47. Robkop, K., Thoongsup, S., Charoenporn, T., Sornlertlamvanich, V., & Isahara, H. (2010). WNMS: Connecting the distributed WordNet in the case of Asian WordNet the 5th international conference of the global WordNet association (GWC-2010), Mumbai, India.Google Scholar
  48. Rosch, E. (1978). Principles of categorization. In Cognition and Categorization, pp. 27–48Google Scholar
  49. Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In Proceedings of the international conference on new methods in language processing, pp. 4449.Google Scholar
  50. Snow, R., Jurafsky, D., & Ng, A.Y. (2005). Learning syntactic patterns for automatic hypernym discovery. In L. K. Saul, Y. Weiss, & L. Bottou (Eds.), Advances in neural information processing systems (Vol. 17, pp. 1297–1304).Google Scholar
  51. Snow, R., Jurafsky, D., & Ng, A. Y. (2006). Semantic taxonomy induction from heterogenous evidence. In Proceedings of the international conference on computational linguistics (COLING) and the annual meeting of the association for computational linguistics (ACL).Google Scholar
  52. Suchanek, F. M., Kasneci, G., & Weikum, G. (2007). Yago: A core of semantic knowledge. In Proceedings of the 16th international conference on World Wide Web (WWW), pp. 697–706.Google Scholar
  53. Szpektor, I., Dagan, I., Bar-Haim, R., & Goldberger, J. (2008). Contextual preferences. In Proceedings of the annual meeting of the association for computational linguistics (ACL), pp. 683–691.Google Scholar
  54. Velardi, P., Roberto, N., & Pierluigi, D. (2008). Mining the web to create specialized glossaries. Journal of IEEE Intelligent Systems, 23(5) 18–25. ISSN:1541-1672.Google Scholar
  55. Vossen, P., Hofmann, K., Rijke, M., Tjong, E., Sang, K., & Deschacht, K. (2008). The Cornetto database: Architecture and user-scenarios. In Proceedings of the fourth international GlobalWordNet conference—GWC.Google Scholar
  56. Vossen, P. (Ed.). (1998). EuroWordNet: A multilingual database with lexical semantic networks. Dordrecht, The Netherlands: Kluwer.Google Scholar
  57. Widdows, D. (2003). Unsupervised methods for developing taxonomies by combining syntactic and statistical information. In Proceedings of the HLT-NAACL conference.Google Scholar
  58. Wilks, Y., Fass, D., ming Guo, C., Mcdonald, J. E., Plate, T., & Slator, B. M. (1988). Machine tractable dictionaries as tools and resources for natural language processing. In Proceedings of the 12th conference on computational linguistics, Association for Computational Linguistics, Morristown, NJ, USA, pp. 750–755.Google Scholar
  59. Yang, H., & Callan, J. (2009). A metric-based framework for automatic taxonomy induction. In Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP (ACL-IJCNLP) (Vol. 1, pp. 271–279.Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2013

Authors and Affiliations

  1. 1.USC Information Sciences InstituteMarina del ReyUSA

Personalised recommendations