Knowledge and Information Systems

, Volume 39, Issue 1, pp 1–39 | Cite as

Taxonomic data integration from multilingual Wikipedia editions

Regular Paper

Abstract

Information systems are increasingly making use of taxonomic knowledge about words and entities. A taxonomic knowledge base may reveal that the Lago di Garda is a lake and that lakes as well as ponds, reservoirs, and marshes are all bodies of water. As the number of available taxonomic knowledge sources grows, there is a need for techniques to integrate such data into combined, unified taxonomies. In particular, the Wikipedia encyclopedia has been used by a number of projects, but its multilingual nature has largely been neglected. This paper investigates how entities from all editions of Wikipedia as well as WordNet can be integrated into a single coherent taxonomic class hierarchy. We rely on linking heuristics to discover potential taxonomic relationships, graph partitioning to form consistent equivalence classes of entities, and a Markov chain-based ranking approach to construct the final taxonomy. This results in MENTA (Multilingual Entity Taxonomy), a resource that describes 5.4 million entities and is one of the largest multilingual lexical knowledge bases currently available.

Keywords

Taxonomy induction Multilingual Graph Ranking 

References

  1. 1.
    Adar E, Skinner M, Weld DS (2009) Information arbitrage across multi-lingual Wikipedia. In: Proceedings of the WSDM 2009. ACM, New York, NY, USAGoogle Scholar
  2. 2.
    Agirre E, López de Lacalle O, Fellbaum C, Hsieh SK, Tesconi M, Monachini M, Vossen P, Segers R (2010) SemEval-2010 task 17: all-words word sense disambiguation on a specific domain. In: Proceedings of the 5th international workshop on semantic evaluation. ACL, Uppsala, Sweden, pp 75–80Google Scholar
  3. 3.
    Aho AV, Garey MR, Ullman JD (1972) The transitive reduction of a directed graph. SIAM J Comput 1(2):131–137CrossRefMATHMathSciNetGoogle Scholar
  4. 4.
    Atserias J, Rigau G, Villarejo L (2004) Spanish WordNet 1.6: porting the Spanish WordNet across Princeton versions. In: Proceedings of the 4th language resources and evaluation conference (LREC 2004)Google Scholar
  5. 5.
    Atserias J, Villarejo L, Rigau G, Agirre E, Carroll J, Magnini B, Vossen P (2004) The MEANING multilingual central repository. In: Proceedings of the GWCGoogle Scholar
  6. 6.
    Auer S, Bizer C, Lehmann J, Kobilarov G, Cyganiak R, Ives Z (2007) DBpedia: a nucleus for a web of open data. In: Proceedings of the ISWC/ASWC, LNCS 4825. Springer, New YorkGoogle Scholar
  7. 7.
    Baeza-Yates R, Tiberi A (2007) Extracting semantic relations from query logs. In: Proceedings of the 13th ACM SIGKDD international conference on knowledge discovery and data mining (KDD 2007). ACM, New York, pp 76–85. doi:10.1145/1281192.1281204
  8. 8.
    Bast H, Chitea A, Suchanek F, Weber I (2007) ESTER: efficient search in text, entities, and relations. In: Proceedings of the SIGIR. ACM, New York, NYGoogle Scholar
  9. 9.
    Bellaachia A, Amor-Tijani G (2008) Enhanced query expansion in English-Arabic CLIR. In: Proceedings of DEXA 2008. IEEE Computer Society, Washington, DC, USA. doi:10.1109/DEXA.2008.52
  10. 10.
    Benitez L, Cervell S, Escudero G, Lopez M, Rigau G, Taulé M (1998) Methods and tools for building the Catalan WordNet. In: Proceedings of the ELRA workshop on language resources for European minority languages at LRECGoogle Scholar
  11. 11.
    Bishop CM (2007) Pattern recognition and machine learning, 1st ed. corr. 2nd printing edn. Springer, New YorkGoogle Scholar
  12. 12.
    Bizer C, Heath T, Berners-Lee T (2009) Linked data—the story so far. Int J Semantic Web Inf Syst 5(3):1–22Google Scholar
  13. 13.
    Bollacker K, Evans C, Paritosh P, Sturge T, Taylor J (2008) Freebase: a collaboratively created graph database for structuring human knowledge. In: Proceedings of the 2008 ACM SIGMOD international conference on management of data (SIGMOD 2008). ACM, New York, NY, pp 1247–1250. http://doi.acm.org/10.1145/1376616.1376746
  14. 14.
    Bouamrane MM, Rector A, Hurrell M (2011) Using owl ontologies for adaptive patient information modelling and preoperative clinical decision support. Knowl Inf Syst 29:405–418. doi:10.1007/s10115-010-0351-7 Google Scholar
  15. 15.
    Bouma G, Duarte S, Islam Z (2009) Cross-lingual alignment and completion of Wikipedia templates. In: Proceedings of the workshop cross lingual information access. ACLGoogle Scholar
  16. 16.
    Brants T, Franz A (2006) Web 1T 5-gram version 1. Linguistic Data Consortium, Philadelphia, PA, USAGoogle Scholar
  17. 17.
    Brown LD, Cai TT, DasGupta A (2001) Interval estimation for a binomial proportion. Stat Sci 16(2): 101–133MATHMathSciNetGoogle Scholar
  18. 18.
    Buitelaar P, Magnini B, Strapparava C, Vossen P (2006) Domains in sense disambiguation. In: Agirre E, Edmonds P (eds) Word sense disambiguation: algorithms and applications, chap. 9. Springer, New York, pp 275–298CrossRefGoogle Scholar
  19. 19.
    Chen CL, Tseng F, Liang T (2011) An integration of fuzzy association rules and wordnet for document clustering. Knowl Inf Syst 28:687–708. doi:10.1007/s10115-010-0364-2
  20. 20.
    Chien S, Dwork C, Kumar R, Simon DR, Sivakumar D (2003) Link evolution: analysis and algorithms. Int Math 1(3):277–304Google Scholar
  21. 21.
    Cook D (2008) MLSN: a multi-lingual semantic network. In: Proceedings of the 14th annual meeting of the Association for Natural Language Processing (NLP 2008)Google Scholar
  22. 22.
    Davis M, Dürst M (2008) Unicode normalization forms, Rev. 29. Tech. rep., Unicode. http://unicode.org/reports/tr15/
  23. 23.
    Deng J, Dong W, Socher R, Li LJ, Li K, Li FF (2009) ImageNet: a large-scale hierarchical image database. In: Proceedings of CVPR 2009, IEEE, pp 248–255Google Scholar
  24. 24.
    Dhillon IS, Guan Y, Kulis B (2007) Weighted graph cuts without eigenvectors. A multilevel approach. IEEE Trans Pattern Anal Mach Intell 29(11):1944–1957. doi:10.1109/TPAMI.2007.1115 Google Scholar
  25. 25.
    Dorneles C, Gonçalves R, dos Santos Mello R (2011) Approximate data instance matching: a survey. Knowl Inf Syst 27:1–21. doi:10.1007/s10115-010-0285-0 Google Scholar
  26. 26.
    Dumais ST, Chen H (2000) Hierarchical classification of Web content. In: Proceedings of the SIGIR. ACM, Athens, GreeceGoogle Scholar
  27. 27.
    Edmonds J, Karp RM (1972) Theoretical improvements in algorithmic efficiency for network flow problems. J ACM 19(2):248–264. doi:10.1145/321694.321699 Google Scholar
  28. 28.
    Etzioni O, Cafarella M, Downey D, Kok S, Popescu AM, Shaked T, Soderland S, Weld DS, Yates A (2004) Web-scale information extraction in KnowItAll: Preliminary results. In: Proceedings of the 13th international World Wide Web conference (WWW 2004), pp 100–110Google Scholar
  29. 29.
    Etzioni O, Reiter K, Soderland S, Sammer M (2007) Lexical translation with application to image search on the web. In: Proceedings of the MT Summit XIGoogle Scholar
  30. 30.
    Euzenat J, Shvaiko P (2007) Ontology matching. Springer, Heidelberg, GermanyMATHGoogle Scholar
  31. 31.
    Fellbaum C (ed) (1998) WordNet: an electronic lexical database. The MIT Press, CambridgeGoogle Scholar
  32. 32.
    Fellbaum C, Vossen P (2007) Connecting the universal to the specific: towards the global grid. In: Proceedings of the IWIC, LNCS, vol 4568. Springer, New YorkGoogle Scholar
  33. 33.
    Garera N, Yarowsky D (2008) Minimally supervised multilingual taxonomy and translation lexicon induction. In: Proceedings of the IJCNLPGoogle Scholar
  34. 34.
    Gong Z, Cheang CW, U LH (2005) Web query expansion by WordNet. In: Proceedings of DEXA 2005, LNCS, vol 3588. Springer, New YorkGoogle Scholar
  35. 35.
    Havasi C, Speer R, Alonso J (2007) ConceptNet 3: a flexible, multilingual semantic network for common sense knowledge. In: Proceedings of RANLP 2007. Borovets, BulgariaGoogle Scholar
  36. 36.
    Haveliwala TH (2002) Topic-sensitive PageRank. In: Proceedings of the 11th international World Wide Web conference (WWW 2002)Google Scholar
  37. 37.
    Hayes P (2004) RDF semantics. W3C recommendation, World Wide Web consortium. http://www.w3.org/TR/2004/REC-rdf-mt-20040210/
  38. 38.
    Hearst MA (1992) Automatic acquisition of hyponyms from large text corpora. In: Proceedings of the 14th conference on computational linguistics (COLING 1992). ACL, Morristown, NJ, USA, pp 539–545. doi:10.3115/992133.992154
  39. 39.
    Hoffart J, Yosef MA, Bordino I, Fürstenau H, Pinkal M, Spaniol M, Taneva B, Thater S, Weikum G (2011) Robust disambiguation of named entities in text. In: Proceedings of EMNLP 2011. ACL, pp 782–792Google Scholar
  40. 40.
    Kinzler D (2008) Automatischer Aufbau eines multilingualen Thesaurus durch Extraktion semantischer und lexikalischer Relationen aus der Wikipedia. Universität Leipzig, Master’s thesisGoogle Scholar
  41. 41.
    Klapaftis IP, Manandhar S (2010) Taxonomy learning using word sense induction. In: Proceedings of NAACL-HLT. ACLGoogle Scholar
  42. 42.
    Knight K, Luk SK (1994) Building a large-scale knowledge base for machine translation. In: Proceedings of the AAAIGoogle Scholar
  43. 43.
    Lesk M (1986) Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. In: Proceedings of the SIGDOC 1986. ACM. doi:10.1145/318723.318728
  44. 44.
    Mausam, Soderland S, Etzioni O, Weld D, Skinner M, Bilmes J (2009) Compiling a massive, multilingual dictionary via probabilistic inference. In: Proceedings of the ACL-IJCNLP. ACLGoogle Scholar
  45. 45.
    de Melo G, Weikum G (2009) Towards a universal wordnet by learning from combined evidence. In: Proceedings of the CIKM 2009. ACM, New York, NY, USA. doi:10.1145/1645953.1646020
  46. 46.
    de Melo G, Weikum G (2010) Untangling the cross-lingual link structure of wikipedia. In: Proceedings of the ACL 2010. ACL, Uppsala, Sweden. http://www.aclweb.org/anthology/P10-1087
  47. 47.
    Mihalcea R, Tarau P, Figa E (2004) PageRank on semantic networks, with application to word sense disambiguation. In: Proceedings of the 20th international conference on computational linguistics (COLING 2004). ACL, Morristown, NJ, USA, pp 1126. doi:10.3115/1220355.1220517
  48. 48.
    Milne DN, Witten IH, Nichols DM (2007) A knowledge-based search engine powered by Wikipedia. In: Proceedings of the CIKM 2007. ACM, New York, NY, USA. doi:10.1145/1321440.1321504
  49. 49.
    Montemagni S, Vanderwende L (1992) Structural patterns vs. string patterns for extracting semantic information from dictionaries. In: Proceedings of the 14th conference on computational linguistics (COLING 1992). ACL, Morristown, NJ, USA, pp 546–552. doi:10.3115/992133.992155
  50. 50.
    Nadeau D, Sekine S (2007) A survey of named entity recognition and classification. Lingvisticae Investig 30(1):3–26. 10.1075/li.30.1.03nad. doi:10.1075/li.30.1.03nad Google Scholar
  51. 51.
    Nastase V, Strube M (2012) Transforming wikipedia into a large scale multilingual concept network. Artif Intell. doi:10.1016/j.artint.2012.06.008
  52. 52.
    Navigli R, Ponzetto SP (2012) Babelnet: the automatic construction, evaluation and application of a wide-coverage multilingual semantic network. Artif Intell 193:217–250. doi:10.1016/j.artint.2012.07.001 Google Scholar
  53. 53.
    Newcombe RG (1998) Two-sided confidence intervals for the single proportion: comparison of seven methods. Stat Med 17(8):857–872CrossRefGoogle Scholar
  54. 54.
    Niles I, Pease A (2001) Towards a standard upper ontology. In: Proceedings of FOISGoogle Scholar
  55. 55.
    Niles I, Pease A (2003) Linking lexicons and ontologies: mapping WordNet to the suggested upper merged ontology. In: Proceedings of the IEEE international conference on information and knowledge engineering (IKE 2003), pp 412–416Google Scholar
  56. 56.
    Noy NF, Musen MA (2003) The prompt suite: interactive tools for ontology merging and mapping. Int J Hum Comput Stud 59(6):983–1024Google Scholar
  57. 57.
    Oh JH, Kawahara D, Uchimoto K, Kazama J, Torisawa K (2008) Enriching multilingual language resources by discovering missing cross-language links in Wikipedia. In: Proceedings of the IEEE, WI/IAT. Washington, DC, USA. doi:10.1109/WIIAT.2008.317
  58. 58.
    On BW, Lee I, Lee D (2012) Scalable clustering methods for the name disambiguation problem. Knowl Inf Syst 31, 129–151. doi:10.1007/s10115-011-0397-1.10.1007/s10115-011-0397-1
  59. 59.
    Orav H, Vider K (2005) Estonian wordnet and lexicography. In: Proceedings of the 11th international symposium on lexicography. CopenhagenGoogle Scholar
  60. 60.
    Ordan N, Wintner S (2007) Hebrew WordNet: a test case of aligning lexical databases across languages. Int J Transl 19(1):39–58Google Scholar
  61. 61.
    Page L, Brin S, Motwani R, Winograd T (1999) The PageRank citation ranking: bringing order to the web. Technical report 1999–66, Stanford InfoLab. http://ilpubs.stanford.edu:8090/422/
  62. 62.
    Pantel P, Pennacchiotti M (2006) Espresso: leveraging generic patterns for automatically harvesting semantic relations. In: Proceedings of the ACL 2006. ACLGoogle Scholar
  63. 63.
    Pasternack J, Roth D (2009) Learning better transliterations. In: Proceedings of the CIKM 2009. ACM, New York, NY, USA. doi:10.1145/1645953.1645978
  64. 64.
    Pianta E, Bentivogli L, Girardi C (2002) MultiWordNet: developing an aligned multilingual database. In: Proceedings of the GWCGoogle Scholar
  65. 65.
    Ponzetto SP, Navigli R (2009) Large-scale taxonomy mapping for restructuring and integrating Wikipedia. In: Proceedings of the IJCAI. Morgan KaufmannGoogle Scholar
  66. 66.
    Ponzetto SP, Strube M (2008) WikiTaxonomy: a large scale knowledge resource. In: Proceedings of the ECAI 2008. IOS Press, AmsterdamGoogle Scholar
  67. 67.
    Raunich S, Rahm E (2011) Atom: automatic target-driven ontology merging. In: Proceedings of the 2011 IEEE 27th international conference on data engineering, ICDE ’11. IEEE Computer Society, Washington, DC, USA. doi:10.1109/ICDE.2011.5767871
  68. 68.
    Rodríguez H, Farwell D, Farreres J, Bertran M, Alkhalifa M, Martí MA, Black WJ, Elkateb S, Kirk J, Pease A, Vossen P, Fellbaum C (2008) Arabic WordNet: current state and future extensions. In: Proceedings of the 4th Global WordNet conference (GWC 2008)Google Scholar
  69. 69.
    Schlaefer N, Ko J, Betteridge J, Pathak M, Nyberg E, Sautter G (2007) Semantic extensions of the Ephyra QA system for TREC 2007. In: Proceedings of the TREC 2007. NISTGoogle Scholar
  70. 70.
    Silberer C, Wentland W et al (2008) Building a multilingual lexical resource for named entity disambiguation, translation and transliteration. In: Proceedings of the LREC. ELRAGoogle Scholar
  71. 71.
    Sleator D, Temperley D (1993) Parsing English with a link grammar. In: Proceedings of the 3rd international workshop on parsing technologiesGoogle Scholar
  72. 72.
    Sánchez D, Isern D, Millan M (2011) Content annotation for the semantic web: an automatic web-based approach. Knowl Inf Syst 27, 393–418. doi:10.1007/s10115-010-0302-3
  73. 73.
    Snow R, Jurafsky D, Ng AY (2004) Learning syntactic patterns for automatic hypernym discovery. In: Advances in neural information processing systems, vol 17 (NIPS 2004)Google Scholar
  74. 74.
    Snow R, Jurafsky D, Ng AY (2006) Semantic taxonomy induction from heterogenous evidence. In: Proceedings of the ACL. ACL, Morristown, NJ, USA. doi:10.3115/1220175.1220276
  75. 75.
    Sorg P, Cimiano P (2008) Enriching the crosslingual link structure of Wikipedia. A classification-based approach. In: Proceedings of the AAAI 2008 workshop Wikipedia and AIGoogle Scholar
  76. 76.
    Stumme G, Maedche A (2001) FCA-MERGE: bottom-up merging of ontologies. In: Proceedings of the 17th international joint conference on artificial intelligence—volume 1, IJCAI 2001. Morgan Kaufmann Publishers Inc., San Francisco, CA, USAGoogle Scholar
  77. 77.
    Suchanek FM, Kasneci G, Weikum G (2007) Yago: a core of semantic knowledge. In: Proceedings of WWW. ACMGoogle Scholar
  78. 78.
    Talukdar PP, Reisinger J, Paşca M, Ravichandran D, Bhagat R, Pereira F (2008) Weakly-supervised acquisition of labeled class instances using graph random walks. In: Proceedings of the EMNLP 2008. ACL, Morristown, NJ, USAGoogle Scholar
  79. 79.
    Tandon N, de Melo G (2010) Information extraction from web-scale n-gram data. In: Zhai C, Yarowsky D, Viegas E, Wang K, Vogel S (eds) In: Proceedings of the Web N-gram workshop at ACM SIGIR 2010, vol 5803. ACM, New YorkGoogle Scholar
  80. 80.
    Tarjan R (1972) Depth-first search and linear graph algorithms. SIAM J Comput 1(2):146–160CrossRefMATHMathSciNetGoogle Scholar
  81. 81.
    Thau D, Bowers S, Ludäscher B (2008) Merging taxonomies under rcc-5 algebraic articulations. In: Proceedings of the 2nd international workshop on ontologies and information systems for the semantic web (ONISW 2008), pp 47–54. ACM, New York. doi:10.1145/1458484.1458492 http://doi.acm.org/10.1145/1458484.1458492
  82. 82.
    Toral A, Muñoz R, Monachini M (2008) Named entity WordNet. In: Proceedings of the LREC. ELRAGoogle Scholar
  83. 83.
    Vossen P (ed) (1998) EuroWordNet: a multilingual database with lexical semantic networks. Springer, New YorkGoogle Scholar
  84. 84.
    Wu F, Weld DS (2008) Automatically refining the Wikipedia infobox ontology. In: Proceedings of WWW. ACM, New York. doi:10.1145/1367497.1367583
  85. 85.
    Zhang X, Li H, Qu Y (2006) Finding important vocabulary within ontology. In: Proceedings of the ASWC 2006, LNCS, vol 4185. Springer, New YorkGoogle Scholar

Copyright information

© Springer-Verlag London 2013

Authors and Affiliations

  1. 1.ICSI BerkeleyBerkeleyUSA
  2. 2.Max Planck Institute for InformaticsSaarbrückenGermany

Personalised recommendations