Mind the (Language) Gap: Generation of Multilingual Wikipedia Summaries from Wikidata for ArticlePlaceholders

  • Lucie-Aimée KaffeeEmail author
  • Hady Elsahar
  • Pavlos Vougiouklis
  • Christophe Gravier
  • Frédérique Laforest
  • Jonathon Hare
  • Elena Simperl
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10843)


While Wikipedia exists in 287 languages, its content is unevenly distributed among them. It is therefore of utmost social and cultural importance to focus efforts on languages whose speakers only have access to limited Wikipedia content. We investigate supporting communities by generating summaries for Wikipedia articles in underserved languages, given structured data as an input.

We focus on an important support for such summaries: ArticlePlaceholders, a dynamically generated content pages in underserved Wikipedias. They enable native speakers to access existing information in Wikidata. To extend those ArticlePlaceholders, we provide a system, which processes the triples of the KB as they are provided by the ArticlePlaceholder, and generate a comprehensible textual summary. This data-driven approach is employed with the goal of understanding how well it matches the communities’ needs on two underserved languages on the Web: Arabic, a language with a big community with disproportionate access to knowledge online, and Esperanto, an easily-acquainted, artificial language whose Wikipedia content is maintained by a small but devoted community. With the help of the Arabic and Esperanto Wikipedians, we conduct a study which evaluates not only the quality of the generated text, but also the usefulness of our end-system to any underserved Wikipedia version.


Multilinguality Wikipedia Wikidata Natural language generation Esperanto Arabic Neural networks 



This research is partially supported by the Answering Questions using Web Data (WDAqua) project, a Marie Skłodowska-Curie Innovative Training Network under grant agreement No 642795, part of the Horizon 2020 programme.


  1. 1.
    Bird, S.: NLTK: the natural language toolkit. In: ACL 2006, 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, 17–21 July 2006, Sydney, Australia (2006)Google Scholar
  2. 2.
    Chisholm, A., Radford, W., Hachey, B.: Learning to generate one-sentence biographies from Wikidata. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, Long Papers, vol. 1, pp. 633–642. Association for Computational Linguistics, Valencia, April 2017Google Scholar
  3. 3.
    Cho, K., van Merrienboer, B., Gülçehre, Ç., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR abs/1406.1078 (2014)Google Scholar
  4. 4.
    Clough, P.D., Gaizauskas, R.J., Piao, S.S.L., Wilks, Y.: METER: MEasuring TExt reuse. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, 6–12 July 2002, Philadelphia, PA, USA, pp. 152–159 (2002)Google Scholar
  5. 5.
    Duma, D., Klein, E.: Generating natural language from linked data: unsupervised template extraction. In: IWCS, pp. 83–94 (2013)Google Scholar
  6. 6.
    Ell, B., Harth, A.: A language-independent method for the extraction of RDF verbalization templates. In: INLG 2014 - Proceedings of the Eighth International Natural Language Generation Conference, Including Proceedings of the INLG and SIGDIAL 2014 Joint Session, 19–21 June 2014, Philadelphia, PA, USA, pp. 26–34 (2014)Google Scholar
  7. 7.
    Galanis, D., Androutsopoulos, I.: Generating multilingual descriptions from linguistically annotated OWL ontologies: the NaturalOWL system. In: Proceedings of the Eleventh European Workshop on Natural Language Generation, pp. 143–146. Association for Computational Linguistics (2007)Google Scholar
  8. 8.
    Gordon, R.G., Grimes, B.F., et al.: Ethnologue: Languages of the world, vol. 15. SIL International, Dallas (2005)Google Scholar
  9. 9.
    Halko, N., Martinsson, P., Tropp, J.A.: Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Rev. 53(2), 217–288 (2011)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Hecht, B., Gergle, D.: The tower of Babel meets web 2.0: user-generated content and its applications in a multilingual context. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 291–300. ACM (2010)Google Scholar
  11. 11.
    Joachims, T.: A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In: Proceedings of the Fourteenth International Conference on Machine Learning (ICML 1997), Nashville, Tennessee, USA, 8–12 July 1997, pp. 143–151 (1997)Google Scholar
  12. 12.
    Kaffee, L.A.: Generating article placeholders from Wikidata for Wikipedia: increasing access to free and open knowledge. Bachelor’s thesis, HTW Berlin (2016)Google Scholar
  13. 13.
    Kaffee, L.A., Piscopo, A., Vougiouklis, P., Simperl, E., Carr, L., Pintscher, L.: A glimpse into Babel: an analysis of multilinguality in Wikidata. In: Proceedings of the 13th International Symposium on Open Collaboration, p. 14. ACM (2017)Google Scholar
  14. 14.
    Kondadadi, R., Howald, B., Schilder, F.: A statistical NLG framework for aggregated planning and realization. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL 2013, 4–9 August 2013, Sofia, Bulgaria, Long Papers, vol. 1, pp. 1406–1415 (2013)Google Scholar
  15. 15.
    Lebret, R., Grangier, D., Auli, M.: Neural text generation from structured data with application to the biography domain. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, EMNLP 2016, Austin, Texas, USA, 1–4 November 2016, pp. 1203–1213 (2016)Google Scholar
  16. 16.
    Lewis, W.D., Yang, P.: Building MT for a severely under-resourced language: white hmong. In: Association for Machine Translation in the Americas, October 2012Google Scholar
  17. 17.
    Luong, T., Sutskever, I., Le, Q.V., Vinyals, O., Zaremba, W.: Addressing the rare word problem in neural machine translation. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, 26–31 July 2015, Beijing, China, Long Papers, vol. 1, pp. 11–19 (2015)Google Scholar
  18. 18.
    Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extraction without labeled data. In: ACL 2009, Proceedings of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing of the AFNLP, 2–7 August 2009, Singapore, pp. 1003–1011 (2009)Google Scholar
  19. 19.
    Mrabet, Y., Vougiouklis, P., Kilicoglu, H., Gardent, C., Demner-Fushman, D., Hare, J., Simperl, E.: Aligning texts and knowledge bases with semantic sentence simplification (2016)Google Scholar
  20. 20.
    Pochampally, Y., Karlapalem, K., Yarrabelly, N.: Semi-supervised automatic generation of Wikipedia articles for named entities. In: Wiki@ ICWSM (2016)Google Scholar
  21. 21.
    Potthast, M., Gollub, T., Hagen, M., Kiesel, J., Michel, M., Oberländer, A., Tippmann, M., Barrón-Cedeño, A., Gupta, P., Rosso, P., Stein, B.: Overview of the 4th international competition on plagiarism detection. In: CLEF 2012 Evaluation Labs and Workshop, Online Working Notes, Rome, Italy, 17–20 September 2012 (2012)Google Scholar
  22. 22.
    Rush, A.M., Chopra, S., Weston, J.: A neural attention model for abstractive sentence summarization. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, 17–21 September 2015, pp. 379–389 (2015)Google Scholar
  23. 23.
    Sauper, C., Barzilay, R.: Automatically generating Wikipedia articles: a structure-aware approach. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, ACL 2009, vol. 1, pp. 208–216. Association for Computational Linguistics, Stroudsburg (2009)Google Scholar
  24. 24.
    Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N.D., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 27, pp. 3104–3112. Curran Associates, Inc. (2014)Google Scholar
  25. 25.
    Vougiouklis, P., ElSahar, H., Kaffee, L., Gravier, C., Laforest, F., Hare, J.S., Simperl, E.: Neural Wikipedian: generating textual summaries from knowledge base triples. CoRR abs/1711.00155 (2017).
  26. 26.
    Wanner, L., Bohnet, B., Bouayad-Agha, N., Lareau, F., Nicklaß, D.: MARQUIS: generation of user-tailored multilingual air quality bulletins. Appl. Artif. Intell. 24(10), 914–952 (2010)CrossRefGoogle Scholar
  27. 27.
    Wise, M.J.: YAP 3: improved detection of similarities in computer program and other texts. ACM SIGCSE Bull. 28(1), 130–134 (1996)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.School of Electronics and Computer ScienceUniversity of SouthamptonSouthamptonUK
  2. 2.Laboratoire Hubert Curien, CNRS UJM-Saint-ÉtienneUniversité de LyonLyonFrance

Personalised recommendations