A Matter of Words: NLP for Quality Evaluation of Wikipedia Medical Articles

  • Vittoria Cozza
  • Marinella PetrocchiEmail author
  • Angelo Spognardi
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9671)


Automatic quality evaluation of Web information is a task with many fields of applications and of great relevance, especially in critical domains, like the medical one. We move from the intuition that the quality of content of medical Web documents is affected by features related with the specific domain. First, the usage of a specific vocabulary (Domain Informativeness); then, the adoption of specific codes (like those used in the infoboxes of Wikipedia articles) and the type of document (e.g., historical and technical ones). In this paper, we propose to leverage specific domain features to improve the results of the evaluation of Wikipedia medical articles, relying on Natural Language Processing (NLP) and dictionaries-based techniques. The results of our experiments confirm that, by considering domain-oriented features, it is possible to improve existing solutions, mainly with those articles that other approaches have less correctly classified.


Natural Language Processing Minority Class Unify Medical Language System Medical Domain Semantic Group 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Attardi, G., Cozza, V., Sartiano, D.: Adapting linguistic tools for the analysis of Italian medical records. In: Italian Conference Computational Linguistics, CLiC-it (2014)Google Scholar
  2. 2.
    Azer, S.A.: Is Wikipedia a reliable learning resource for medical students? Evaluating respiratory topics. Adv. Physiol. Educ. 39(1), 5–14 (2015)CrossRefGoogle Scholar
  3. 3.
    Blumenstock, J.E.: Size matters: word count as a measure of quality on Wikipedia. In: 17th World Wide Web, pp. 1095–1096. ACM (2008)Google Scholar
  4. 4.
    Bodenreider, O., McCray, A.T.: Exploring semantic groups through visual approaches. J. Biomedi. Inf. 36(6), 414–432 (2003)CrossRefGoogle Scholar
  5. 5.
    Cabitza, F.: An information reliability index as a simple consumer-oriented indication of quality of medical web sites. In: Pasi, G., Bordogna, G., Jain, L.C. (eds.) Quality Issues in the Management of Web Information. ISRL, vol. 50, pp. 159–177. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  6. 6.
    Chawla, N.V., et al.: SMOTE: synthetic minority over-sampling technique. J. Artif. Intell. Res. 16, 321–357 (2002)zbMATHGoogle Scholar
  7. 7.
    Cover, T.M., Thomas, J.A.: Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley, Hoboken (2006)Google Scholar
  8. 8.
    Cozza, V., Petrocchi, M., Spognardi, A.: A matter of words: NLP for quality evaluation of Wikipedia medical articles. CoRR abs/1603.01987 (2016)Google Scholar
  9. 9.
    Hall, M., et al.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)CrossRefGoogle Scholar
  10. 10.
    Hodson, R.: Wikipedians reach out to academics. Nature News, September 2015Google Scholar
  11. 11.
    Marzini, E., Spognardi, A., Matteucci, I., Mori, P., Petrocchi, M., Conti, R.: Improved automatic maturity assessment of Wikipedia medical articles. In: Meersman, R., Panetto, H., Dillon, T., Missikoff, M., Liu, L., Pastor, O., Cuzzocrea, A., Sellis, T. (eds.) OTM 2014. LNCS, vol. 8841, pp. 612–622. Springer, Heidelberg (2014)Google Scholar
  12. 12.
    Pasi, G., et al.: An introduction to quality issues in the management of web information. In: Pasi, G., Bordogna, G., Jain, L.C. (eds.) Quality Issues in the Management of Web Information. ISRL, vol. 50, pp. 1–3. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  13. 13.
    Powers, D.M.W.: Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. Int. J. Mach. Learn. Technol. 2(1), 37–63 (2011)MathSciNetCrossRefGoogle Scholar
  14. 14.
    Saaty, T.L.: How to make a decision: the analytic hierarchy process. Eur. J. Oper. Res. 48(1), 9–26 (1990)CrossRefzbMATHGoogle Scholar
  15. 15.
    Stvilia, B., et al.: A model for online consumer health information quality. Am. Soc. Inf. Sci. Technol. 60(9), 1781–1791 (2009)CrossRefGoogle Scholar
  16. 16.
    Warncke-Wang, M., et al.: Tell me more: an actionable quality model for Wikipedia. In: 9th Symposium on Open Collaboration, pp. 8:1–8:10. ACM (2013)Google Scholar
  17. 17.
    Wecel, K., Lewoniewski, W.: Modelling the quality of attributes in Wikipedia infoboxes. In: Abramowicz, W., et al. (eds.) BIS 2015 Workshops. LNBIP, vol. 228, pp. 308–320. Springer, Heidelberg (2015). doi: 10.1007/978-3-319-26762-3_27 CrossRefGoogle Scholar
  18. 18.
    Wu, K., et al.: Mining the factors affecting the quality of Wikipedia articles. Inf. Sci. Manag. Eng. 1, 343–346 (2010)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Vittoria Cozza
    • 1
  • Marinella Petrocchi
    • 1
    Email author
  • Angelo Spognardi
    • 2
  1. 1.IIT CNRPisaItaly
  2. 2.DTU ComputeKgs. LyngbyDenmark

Personalised recommendations