Abstract
The availability of annotated corpora for research in the area of Readability Assessment is still very limited. On the other hand, the Web is increasingly being used by researchers as a source of written content to build very large and rich corpora, in the Web as Corpus (WaC) initiative. This paper proposes a framework for automatic generation of large corpora classified by readability. It adopts a supervised learning method to incorporate a readability filter based in features with low computational cost to a crawler, to collect texts targeted at a specific reading level. We evaluate this framework by comparing a readability-assessed web crawled corpus to a reference corpus (Both corpora are available in http://www.inf.ufrgs.br/pln/resource/CrawlingByReadabilityLevel.zip.). The results obtained indicate that these features are good at separating texts from level 1 (initial grades) from other levels. As a result of this work two Portuguese corpora were constructed: the Wikilivros Readability Corpus, classified by grade level, and a crawled WaC classified by readability level.
Keywords
- Readability assessment
- Web as a corpus
- Focused crawling
This is a preview of subscription content, access via your institution.
Buying options

Notes
- 1.
- 2.
- 3.
- 4.
The toolkit is divided in a web crawling module, several combinable filter modules, a deduplication module and a post-processing module responsible for the annotation and compilation of the corpus.
- 5.
- 6.
All correlations presented a significance level higher than 99 %.
References
Baroni, M., Bernardini, S., Ferraresi, A., Zanchetta, E.: The wacky wide web: a collection of very large linguistically processed web-crawled corpora. Lang. Resour. Eval. 43(3), 209–226 (2009)
Boos, R., Prestes, K., Villavicencio, A., Padró, M.: brWaC: a wacky corpus for Brazilian Portuguese. In: Baptista, J., Mamede, N., Candeias, S., Paraboni, I., Pardo, T.A.S., Volpe Nunes, M.G. (eds.) PROPOR 2014. LNCS, vol. 8775, pp. 201–206. Springer, Heidelberg (2014)
Callan, J., Eskenazi, M.: Combining lexical and grammatical features to improve readability measures for first and second language texts. In: Proceedings of NAACL HLT, pp. 460–467 (2007)
Chall, J.S., Dale, E.: Readability Revisited: The new Dale-Chall readability formula. Brookline Books, Cambridge (1995)
Coleman, M., Liau, T.L.: A computer readability formula designed for machine scoring. J. Appl. Psychol. 60(2), 283 (1975)
DuBay, W.H.: The principles of readability. Online Submission (2004)
Feng, L., Elhadad, N., Huenerfauth, M.: Cognitively motivated features for readability assessment. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, pp. 229–237. Association for Computational Linguistics (2009)
Feng, L., Jansche, M., Huenerfauth, M., Elhadad, N.: A comparison of features for automatic readability assessment. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, COLING 2010, pp. 276–284. Association for Computational Linguistics, Stroudsburg (2010). http://dl.acm.org/citation.cfm?id=1944566.1944598
Ferraresi, A., Bernardini, S.: The academic web-as-corpus. In: Proceedings of the 8th Web as Corpus Workshop, pp. 53–62 (2013)
Flesch, R.F., et al.: Art of Plain Talk. Harper, New York (1946)
François, T., Miltsakaki, E.: Do nlp and machine learning improve traditional readability formulas? In: Proceedings of the First Workshop on Predicting and Improving Text Readability for target reader populations, pp. 49–57. Association for Computational Linguistics (2012)
Gasperin, C., Specia, L., Pereira, T., Aluísio, S.: Learning when to simplify sentences for natural text simplification. In: Proceedings of ENIA - Brazilian Meeting on Artificial Intelligence, pp. 809–818 (2009)
Graesser, A.C., McNamara, D.S., Louwerse, M.M., Cai, Z.: Coh-metrix: analysis of text on cohesion and language. Behav. Res. methods Instrum. comput. 36(2), 193–202 (2004)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The weka data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
Landwehr, N., Hall, M., Frank, E.: Logistic model trees. Mach. Learn. 59(1–2), 161–205 (2005)
Martins, T.B., Ghiraldelo, C.M., Nunes, M.d.G.V., de Oliveira Junior, O.N.: Readability formulas applied to textbooks in brazilian portuguese. Icmsc-Usp (1996)
McNamara, D.S., Louwerse, M.M., McCarthy, P.M., Graesser, A.C.: Coh-metrix: capturing linguistic features of cohesion. Discourse Processes 47(4), 292–330 (2010)
McNamara, D., Louwerse, M., Cai, Z., Graesser, A.: Coh-metrix version 3.0 (2013). http://cohmetrix.com. Accessed 1 Apr 2015
Navigli, R., Ponzetto, S.P.: Babelnet: building a very large multilingual semantic network. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 216–225. Association for Computational Linguistics (2010)
Neto, N., Rocha, W., Sousa, G.: An open-source rule-based syllabification tool for Brazilian Portuguese. J. Braz. Comput. Soc. 21(1), 1–10 (2015)
Petersen, S.E., Ostendorf, M.: A machine learning approach to reading level assessment. Comput. Speech Lang. 23(1), 89–106 (2009)
Pomikálek, J.: Removing boilerplate and duplicate content from web corpora. Ph.D. en informatique, Masarykova univerzita, Fakulta informatiky (2011)
Scarton, C., Aluısio, S.M.: Coh-metrix-port: a readability assessment tool for texts in Brazilian Portuguese. In: Proceedings of the 9th International Conference on Computational Processing of the Portuguese Language, Extended Activities Proceedings, PROPOR, vol. 10 (2010)
Scarton, C., Gasperin, C., Aluisio, S.: Revisiting the readability assessment of texts in Portuguese. In: Kuri-Morales, A., Simari, G.R. (eds.) IBERAMIA 2010. LNCS, vol. 6433, pp. 306–315. Springer, Heidelberg (2010)
Schwarm, S.E., Ostendorf, M.: Reading level assessment using support vector machines and statistical language models. In: Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pp. 523–530. Association for Computational Linguistics (2005)
Stenner, A.J.: Measuring Reading Comprehension with the Lexile Framework. ERIC, Washington (1996)
Vajjala, S., Meurers, D.: On the applicability of readability models to web texts. In: Proceedings of the 2nd Workshop on Predicting and Improving Text Readability for Target Reader Populations, p. 59 (2013)
Vajjala, S., Meurers, D.: Exploring measures of readability for spoken language: analyzing linguistic features of subtitles to identify age-specific tv programs. In: Proceedings of the 3rd Workshop on Predicting and Improving Text Readability for Target Reader Populations (PITR)@ EACL, pp. 21–29 (2014)
Ziai, R., Ott, N.: Web as Corpus Toolkit: Users and Hackers Manual. Lexical Computing Ltd., Brighton (2005)
Acknowledgments
This research was partially developed in the context of the project Text Simplification of Complex Expressions, sponsored by Samsung Eletrônica da Amazônia Ltda., in the terms of the Brazilian law n. 8.248/91. This work was also partly supported by CNPq (482520/2012- 4, 312114/2015-0) and FAPERGS AiMWEst.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Filho, J.A.W., Wilkens, R., Zilio, L., Idiart, M., Villavicencio, A. (2016). Crawling by Readability Level. In: Silva, J., Ribeiro, R., Quaresma, P., Adami, A., Branco, A. (eds) Computational Processing of the Portuguese Language. PROPOR 2016. Lecture Notes in Computer Science(), vol 9727. Springer, Cham. https://doi.org/10.1007/978-3-319-41552-9_31
Download citation
DOI: https://doi.org/10.1007/978-3-319-41552-9_31
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41551-2
Online ISBN: 978-3-319-41552-9
eBook Packages: Computer ScienceComputer Science (R0)