An approach to measuring and annotating the confidence of Wiktionary translations

Abstract

Wiktionary is an online collaborative project based on the same principle than Wikipedia , where users can create, edit and delete entries containing lexical information. While the open nature of Wiktionary is the reason for its fast growth, it has also brought a problem: how reliable is the lexical information contained in every article? If we are planing to use Wiktionary translations as source content to accomplish a certain use case, we need to be able to answer this question and extract measures of their confidence . In this paper we present our work on assessing the quality of Wiktionary translations by introducing confidence metrics. Additionally, we describe our effort to share Wiktionary translations and the associated confidence values as linked data.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

(source: http://www.lemon-model.net)

Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Notes

  1. 1.

    http://wiktionary.org.

  2. 2.

    http://en.wiktionary.org/wiki/Wiktionary:Text_of_Creative_Commons_Attribution-ShareAlike_3.0_Unported_License.

  3. 3.

    http://en.wiktionary.org/wiki/Wiktionary:GNU_Free_Documentation_License.

  4. 4.

    http://en.wiktionary.org/wiki/Wiktionary:Translations.

  5. 5.

    http://meta.wikimedia.org/wiki/Wiktionary.

  6. 6.

    http://mastersofmedia.hum.uva.nl/2008/09/20/wiktionary-and-the-limitations-of-collaborative-sites/.

  7. 7.

    An example of this can be found for the German translation of the word “banco” in the Spanish language edition (http://es.wiktionary.org/wiki/banco). The available translations contain the German translation “Bank”, however it points to an empty page (http://es.wiktionary.org/w/index.php?title=Bank&action=edit&redlink=1), showing that the term does not exist in the Spanish edition. The translation link is well created in the opposite direction, i.e., from German to Spanish.

  8. 8.

    http://ogden.basic-english.org.

  9. 9.

    http://www.coe.int/t/dg4/linguistic/Cadre1_en.asp.

  10. 10.

    http://www.lemon-model.net.

  11. 11.

    The latest developments on lemon by part of the Ontology Lexicon (Ontolex) community group can be found at http://www.w3.org/2016/05/ontolex/.

  12. 12.

    For a detailed description of the different lemon components we refer the reader to the official cookbook at http://www.lemon-model.net/lemon-cookbook.

  13. 13.

    At the time of writing, this modification is being considered as a possible approach to model explicit translations in lemon. Further discussions are available at http://www.w3.org/community/ontolex/wiki/Translation_Module.

  14. 14.

    The category registry can be found at http://purl.org/net/translation.

  15. 15.

    The dataset is available at http://kaiko.getalp.org/sparql.

  16. 16.

    http://www.lexvo.org.

  17. 17.

    http://dublincore.org/documents/dcmi-terms/.

  18. 18.

    http://www.w3.org/TR/rdf-sparql-query/.

  19. 19.

    http://dydra.com.

  20. 20.

    https://www.ukp.tu-darmstadt.de/data/lexical-resources/uby/.

  21. 21.

    http://babelnet.org/.

References

  1. Blumenstock, J. E. (2008). Size matters: Word count as a measure of quality on Wikipedia. In Proceedings of the 17th international conference on world wide web, WWW ’08 (pp. 1095–1096). ACM, New York, NY, USA. doi:10.1145/1367497.1367673.

  2. Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. Computer Networks and ISDN Systems, 30(1–7), 107–117.

    Article  Google Scholar 

  3. Dwork, C., Kumar, R., Naor, M., & Sivakumar, D. (2001). Rank aggregation methods for the web. WWW.

  4. Fuertes-Olivera, P. A. (2009). The function theory of lexicography and electronic dictionaries: Wiktionary as a prototype of collective free multiple-language internet dictionary. In Lexicography at a crossroads: Dictionaries and encyclopedias today, Lexicographical Tools Tomorrow (pp. 99–134). Bern: Peter Lang.

  5. Gracia, J., Montiel-Ponsoda, E., Vila-Suero, D., & de Cea, G. A. (2014). Enabling language resources to expose translations as linked data on the web. In LREC (pp. 409–413).

  6. Lih, A. (2004). Wikipedia as participatory journalism: Reliable sources? metrics for evaluating collaborative media as a news resource. In Proceedings of the 5th international symposium on online journalism (pp. 16–17). http://jmsc.hku.hk/faculty/alih/publications/utaustin-2004-wikipedia-rc2.pdf.

  7. Lim, E. P., Vuong, B. Q., Lauw, H. W., & Sun, A. (2006). Measuring qualities of articles contributed by online communities. In Proceedings of the 2006 IEEE/WIC/ACM international conference on web intelligence, WI ’06 (pp. 81–87). IEEE Computer Society, Washington, DC, USA. http://dx.doi.org/10.1109/WI.2006.115.

  8. Manola, F., & Miller, E. (2004). Rdf primer. w3c recommendation. http://www.w3.org/TR/2004/REC-rdf-primer-20040210/.

  9. Matuschek, M., Meyer, C. M., & Gurevych, I. (2013). Multilingual knowledge in aligned Wiktionary and Omegawiki for translation applications. Translation: Corpora, Computation, Cognition (TC3), 3(1), 87–118.

    Google Scholar 

  10. McCrae, J., Aguado-de Cea, G., Buitelaar, P., Cimiano, P., Declerck, T., Gmez-Prez, A., et al. (2012). Interchanging lexical resources on the semantic web. Language Resources and Evaluation, 46(4), 701–719.

    Article  Google Scholar 

  11. Meyer, C. M., & Gurevych, I. (2012). Wiktionary: A new rival for expert-built lexicons? Exploring the possibilities of collaborative lexicography. In S. Granger, & M. Paquot (Eds.) Electronic lexicography, chap. 13 (pp. 259–291). Oxford: Oxford University Press. http://www.christian-meyer.org/research/publications/oup-elex2012/.

  12. Miles, A., & Bechhofer, S. (2009a). SKOS simple knowledge organization system extension for labels (SKOS-XL). http://www.w3.org/TR/skos-reference/skos-xl.html.

  13. Miles, A., & Bechhofer, S. (2009b). SKOS simple knowledge organization system reference. http://www.w3.org/TR/2009/REC-skos-reference-20090818/.

  14. Miller, T., & Gurevych, I. (2014). Wordnet—Wikipedia—Wiktionary: Construction of a three-way alignment. In N. C. C. Chair), K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, & S. Piperidis (Eds.) Proceedings of the ninth international conference on language resources and evaluation (LREC’14). European Language Resources Association (ELRA), Reykjavik, Iceland.

  15. Montiel-Ponsoda, E., Gracia, J., de Cea, G. A., & Gómez-Pérez, A. (2011). Representing translations on the semantic web. In MSW (pp. 25–37).

  16. Müller, C., & Gurevych, I. (2009). Using Wikipedia and Wiktionary in domain-specific information retrieval. In Proceedings of the 9th cross-language evaluation forum conference on evaluating systems for multilingual and multimodal information access (pp. 219–226). CLEF’08 Berlin, Heidelberg: Springer-Verlag.

  17. Navarro, E., Sajous, F., Gaume, B., Prévot, L., ShuKai, H., & Tzu-Yi, K. et al. (2009). Wiktionary and NLP: Improving synonymy networks. In Proceedings of the 2009 workshop on the people’s web meets NLP: Collaboratively constructed semantic resources (pp. 19–27). People’s Web ’09 Stroudsburg, PA, USA: Association for Computational Linguistics.

  18. Sajous, F., Navarro, E., Gaume, B., Prévot, L., & Chudy, Y. (2013). Semi-automatic enrichment of crowdsourced synonymy networks: The wisigoth system applied to Wiktionary. Language Resources and Evaluation, 47(1), 63–96.

    Article  Google Scholar 

  19. Sérasset, G. (2014). DBnary: Wiktionary as a lemon-based multilingual lexical resource in RDF. Semantic Web Journal: Special issue on Multilingual Linked Open Data. http://hal.archives-ouvertes.fr/hal-00953638.

  20. Weale, T., & Brew, C. F. L. E. (2009). Using the Wiktionary graph structure for synonym detection. In Proceedings of the 2009 workshop on the people’s web meets NLP: Collaboratively constructed semantic resources, People’s Web ’09 (pp. 28–31). Association for Computational Linguistics, Stroudsburg, PA, USA. http://dl.acm.org/citation.cfm?id=1699765.1699769.

  21. Zesch, T., Müller, C., & Gurevych, I. (2008). Using Wiktionary for computing semantic relatedness. In Proceedings of the 23rd national conference on artificial intelligence—Volume 2, AAAI’08 (pp. 861–866). AAAI Press.

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Antonio J. Roa-Valverde.

Appendix: Dataset example

Appendix: Dataset example

In the following, we show an example of how our data model can be used to describe lexical translations. We have taken the word “able” in English and build the associated ISG for Spanish as described in Sect. 3. The resulting graph is shown in Fig. 3. Table 3 contains the adjacency matrix with the existing translations and the computed confidence after combining the individual PageRank scores. Note that for this example we take the ISG as the only graph under consideration and therefore it is equivalent to the USG. Listing 5 depicts the generated model in turtle notation.

Table 3 Adjacency matrix corresponding to \({ ISG}_{en,es}\big (able\big )\) and associated confidence
figurea
figureb
figurec

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Roa-Valverde, A.J., Sanchez-Alonso, S., Sicilia, M. et al. An approach to measuring and annotating the confidence of Wiktionary translations. Lang Resources & Evaluation 51, 319–349 (2017). https://doi.org/10.1007/s10579-017-9384-9

Download citation

Keywords

  • Linguistics
  • Linked data
  • Ranking
  • Random walks