Size Does Not Matter. Frequency Does. A Study of Features for Measuring Lexical Complexity

  • Rodrigo WilkensEmail author
  • Alessandro Dalla VecchiaEmail author
  • Marcely Zanon Boito
  • Muntsa Padró
  • Aline Villavicencio
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8864)


Lexical simplification aims at substituting complex words by simpler synonyms or semantically close words. A first step to perform such task is to decide which words are complex and need to be replaced. Though this is a very subjective task, and not trivial at all, there is agreement among linguists of what makes a word more difficult to read and understand. Cues like the length of the word or its frequency in the language are accepted as informative to determine the complexity of a word. In this work, we carry out a study of the effectiveness of those cues by using them in a classification task for separating words as simple or complex. Interestingly, our results show that word length is not important, while corpus frequency is enough to correctly classify a large proportion of the test cases (F-measure over 80 %).


Lexical simplification Lexical complexity Feature selection 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Max, A.: Writing for language-impaired readers. In: Gelbukh, A. (ed.) CICLing 2006. LNCS, vol. 3878, pp. 567–570. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  2. 2.
    Siddharthan, A., Nenkova, A., McKeown, K.: Syntactic simplification for improving content selection in multi-document summarization. In: Proc. of the 20th International Conference on Computational Linguistics, p. 896. ACL (2004)Google Scholar
  3. 3.
    Carroll, J., Minnen, G., Canning, Y., Devlin, S., Tait, J.: Practical simplification of english newspaper text to assist aphasic readers. In: Proc. of the AAAI-98 Workshop on Integrating Artificial Intelligence and Assistive Technology, pp. 7–10 (1998)Google Scholar
  4. 4.
    Chandrasekar, R., Doran, C., Srinivas, B.: Motivations and methods for text simplification. In: Proc. of the 16th Conference on Computational linguistics, pp. 1041–1044. ACL (1996)Google Scholar
  5. 5.
    Specia, L., Jauhar, S.K., Mihalcea, R.: Semeval-2012 task 1: English lexical simplification. In: Proc. of the First Joint Conference on Lexical and Computational Semantics, pp. 347–355 (2012)Google Scholar
  6. 6.
    Flesch, R.: A new readability yardstick. Journal of Applied Psychology 32(3), 221–233 (1948)CrossRefGoogle Scholar
  7. 7.
    Devlin, S., Unthank, G.: Helping aphasic people process online information. In: Proceedings of the 8th International ACM SIGACCESS Conference on Computers and Accessibility, pp. 225–226. ACM (2006)Google Scholar
  8. 8.
    Leroy, G., Kauchak, D., Mouradi, O.: A user-study measuring the effects of lexical simplification and coherence enhancement on perceived and actual text difficulty. International Journal of Medical Informatics 82(8), 717–730 (2013)CrossRefGoogle Scholar
  9. 9.
    De Belder, J., Deschacht, K., Moens, M.F.: Lexical simplification. In: Proceedings of ITEC2010: 1st International Conference on Interdisciplinary Research on Technology, Education and Communication (2010)Google Scholar
  10. 10.
    Biran, O., Brody, S., Elhadad, N.: Putting it simply: a context-aware approach to lexical simplification. In: Proceedings of the 49th Annual Meeting of the ACL: Human Language Technologies, pp. 496–501 (2011)Google Scholar
  11. 11.
    Gasperin, C., Maziero, E., Specia, L., Pardo, T., Aluisio, S.M.: Natural language processing for social inclusion: a text simplification architecture for different literacy levels. In: Proceedings of SEMISH-XXXVI Seminário Integrado de Software e Hardware, pp. 387–401 (2009)Google Scholar
  12. 12.
    Saggion, H., Martínez, E.G., Etayo, E., Anula, A., Bourg, L.: Text simplification in simplext. making text more accessible. Procesamiento del lenguaje natural 47, 341–342 (2011)Google Scholar
  13. 13.
    Aluísio, S.M., Specia, L., Pardo, T.A., Maziero, E.G., Fortes, R.P.: Towards brazilian portuguese automatic text simplification systems. In: Proceedings of the 8th ACM symposium on Document engineering, pp. 240–248. ACM (2008)Google Scholar
  14. 14.
    De Belder, J., Moens, M.-F.: A dataset for the evaluation of lexical simplification. In: Gelbukh, A. (ed.) CICLing 2012, Part II. LNCS, vol. 7182, pp. 426–437. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  15. 15.
    Lin, D., Pantel, P.: DIRT - Discovery of Inference Rules from Text. In: Proc. of ACM Conference on Knowledge Discovery and Data Mining (KDD-01). San Francisco, USA pp. 323–328 (2001)Google Scholar
  16. 16.
    Barzilay, R., McKeown, K.R.: Extracting paraphrases from a parallel corpus. In: Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, pp. 50–57. ACL (2001)Google Scholar
  17. 17.
    Shinyama, Y., Sekine, S., Sudo, K.: Automatic paraphrase acquisition from news articles. In: Proceedings of the second International Conference on Human Language Technology Research, pp. 313–318. Morgan Kaufmann Publishers Inc. (2002)Google Scholar
  18. 18.
    Barzilay, R., Lee, L.: Learning to paraphrase: an unsupervised approach using multiple-sequence alignment. In: Proc. of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pp. 16–23 (2003)Google Scholar
  19. 19.
    Pang, B., Knight, K., Marcu, D.: Syntax-based alignment of multiple translations: Extracting paraphrases and generating new sentences. In: Proc. of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pp. 102–109 (2003)Google Scholar
  20. 20.
    Ibrahim, A., Katz, B., Lin, J.: Extracting structural paraphrases from aligned monolingual corpora. In: Proceedings of the Second International Workshop on Paraphrasing, pp. 57–64. ACL (2003)Google Scholar
  21. 21.
    Lal, P., Ruger, S.: Extract-based summarization with simplification. In: Proceedings of the ACL Workshop on Text Summarisation: DUC, Philadelphia, USA (2002)Google Scholar
  22. 22.
    Amoia, M., Romanelli, M.: Sb: mmsystem-using decompositional semantics for lexical simplification. In: Proceedings of the First Joint Conference on Lexical and Computational Semantics, pp. 482–486 (2012)Google Scholar
  23. 23.
    Fellbaum, C.: WordNet: An Electronic Lexical Database (Language, Speech, and Communication). MIT Press, Cambridge (1998)Google Scholar
  24. 24.
    Sharoff, S.: Open-source corpora: Using the net to fish for linguistic data. International Journal of Corpus Linguistics 11(4), 435–462 (2006)CrossRefGoogle Scholar
  25. 25.
    MacWhinney, B.: The CHILDES Project: The database. vol. 2. Psychology Press (2000)Google Scholar
  26. 26.
    de Paiva, V., Rademaker, A., de Melo, G.: Openwordnet-pt: An open brazilian wordnet for reasoning. In: Proceedings of the 24th International Conference on Computational Linguistics (2012)Google Scholar
  27. 27.
    Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: An update. SIGKDD Explor. Newsl. 11(1), 10–18 (2009)CrossRefGoogle Scholar
  28. 28.
    Padró, L., Stanilovsky, E.: Freeling 3.0: Towards wider multilinguality. In: Proceedings of the Language Resources and Evaluation Conference (LREC). ELRA, Istanbul (2012)Google Scholar
  29. 29.
    Scott, M., Tribble, C.: Textual patterns: key words and corpus analysis in language education. John Benjamins publishing company, Amsterdam (2006)Google Scholar
  30. 30.
    Baroni, M., Bernardini, S., Ferraresi, A., Zanchetta, E.: The wacky wide web: a collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation 43(3), 209–226 (2009)CrossRefGoogle Scholar
  31. 31.
    Boos, R., Prestes, K., Villavicencio, A., Padró, M.: brWaC: a WaCky corpus for Brazilian Portuguese. In: Proceedings of PROPOR 2014, São Carlos, Brazil (2014)Google Scholar
  32. 32.
    Finatto, M.J.B., Scarton, C.E., Rocha, A., Aluísio, S.: Características do jornalismo popular: avaliação da inteligibilidade e auxílio à descrição do gênero. In: Proceedings of the 8th Brazilian Symposium in Information and Human Language Technology (2011)Google Scholar
  33. 33.
    Caseli, H.M., Pereira, T.F., Specia, L., Pardo, T.A., Gasperin, C., Aluísio, S.: Building a brazilian portuguese parallel corpus of original and simplified texts. In: Proceedings of CICLing (2009)Google Scholar
  34. 34.
    Koehn, P.: Europarl: A parallel corpus for statistical machine translation. In: Proceedings of the 10th Machine Translation Summit, pp. 79–86 (2005)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Rodrigo Wilkens
    • 1
    Email author
  • Alessandro Dalla Vecchia
    • 1
    Email author
  • Marcely Zanon Boito
    • 1
  • Muntsa Padró
    • 1
  • Aline Villavicencio
    • 1
  1. 1.Institute of InformaticsFederal University of Rio Grande do SulPorto AlegreBrazil

Personalised recommendations