Analyzing the Adequacy of Readability Indicators to a Non-English Language
Readability is a linguistic feature that indicates how difficult it is to read a text. Traditional readability formulas were made for the English language. This study evaluates their adequacy to the Portuguese language. We applied the traditional formulas in 10 parallel corpora. We verified that the Portuguese language had higher grade scores (less readability) in the formulas that use the number of syllables per words or number of complex words per sentence. Formulas that use letters by words instead of syllables by words output similar grade scores. Considering this, we evaluated the correlation of the complex words in 65 Portuguese school books of 12 schooling years. We found out that the concept of complex word as a word with 4 or more syllables, instead of 3 or more syllables as originally used in traditional formulas applied to English texts, is more correlated with the grade of Portuguese school books. In the end, for each traditional readability formula, we adapted it to the Portuguese language performing a multiple linear regression in the same dataset of school books.
KeywordsReadability Portuguese language Text simplification Natural language processing
This work is financed by National Funds through the Portuguese funding agency, FCT - Fundação para a Ciência e a Tecnologia within the project: UID/EEA/50014/2019. We would also like to thank the Master in Informatics and Computing Engineering of the Faculty of Engineering of the University of Porto for supporting the registration and travel costs.
- 1.Cha, M., Gwon, Y., Kung, H.T.: Language modeling by clustering with word embeddings for text readability assessment. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM 2017, pp. 2003–2006. ACM, New York (2017)Google Scholar
- 4.Feng, L., Jansche, M., Huenerfauth, M., Elhadad, N.: A comparison of features for automatic readability assessment. In: COLING 2010 Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 276–284. Association for Computational Linguistics, Stroudsburg (2010)Google Scholar
- 5.François, T., Miltsakaki, E.: Do NLP and machine learning improve traditional readability formulas? In: Proceedings of the First Workshop on Predicting and Improving Text Readability for Target Reader Populations, PITR 2012, pp. 49–57. Association for Computational Linguistics, Stroudsburg (2012)Google Scholar
- 6.Gunning, R.: The Technique of Clear Writing. McGraw-Hill, New York (1952)Google Scholar
- 7.Jiang, Z., Gu, Q., Yin, Y., Chen, D.: Enriching word embeddings with domain knowledge for readability assessment. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 366–378. Association for Computational Linguistics, Santa Fe (2018)Google Scholar
- 8.Kincaid, J.: Derivation of new readability formulas: (automated readability index, fog count and Flesch reading ease formula) for navy enlisted personnel. Research Branch report, Chief of Naval Technical Training, Naval Air Station Memphis (1975)Google Scholar
- 9.Kolahi, S., Shirvani, E.: A comparative study of the readability of english textbooks of translation and their Persian translations. Int. J. Linguist. 4, 344–366 (2012)Google Scholar
- 10.Martins, T.B.F., Ghiraldelo, C.M., Nunes, M.D.G.V., Oliveira Junior, O.N.D.: Readability Formulas Applied to Textbooks in Brazilian Portuguese (1996)Google Scholar
- 11.McLaughlin, H.G.: SMOG grading - a new readability formula. J. Read. 12(8), 639–646 (1969)Google Scholar
- 12.Smith, E.A., Senter, R.: Automated readability index. In: AMRL-TR. Aerospace Medical Research Laboratories, pp. 1–14 (1967)Google Scholar
- 14.Tiedemann, J.: Parallel data, tools and interfaces in opus. In: Chair, N.C.C., et al. (eds.) Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC 2012). European Language Resources Association (ELRA), Istanbul, Turkey, May 2012Google Scholar
- 15.Tillman, R., Hagberg, L.: Readability algorithms compability on multiple languages (2014)Google Scholar