A study of readability of texts in Bangla through machine learning approaches

Sinha, Manjira; Basu, Anupam

doi:10.1007/s10639-014-9368-y

A study of readability of texts in Bangla through machine learning approaches

Published: 07 December 2014

Volume 21, pages 1071–1094, (2016)
Cite this article

Education and Information Technologies Aims and scope Submit manuscript

Manjira Sinha¹ &
Anupam Basu¹

782 Accesses
4 Citations
Explore all metrics

Abstract

In this work, we have investigated text readability in Bangla language. Text readability is an indicator of the suitability of a given document with respect to a target reader group. Therefore, text readability has huge impact on educational content preparation. The advances in the field of natural language processing have enabled the automatic identification of reading difficulty of texts and contributed in the design and development of suitable educational materials. In spite of the fact that, Bangla is one of the major languages in India and the official language of Bangladesh, the research of text readability in Bangla is still in its nascent stage. In this paper, we have presented computational models to determine the readability of Bangla text documents based on syntactic properties. Since Bangla is a digital resource poor language, therefore, we were required to develop a novel dataset suitable for automatic identification of text properties. Our initial experiments have shown that existing English readability metrics are inapplicable for Bangla. Accordingly, we have proceeded towards new models for analyzing text readability in Bangla. We have considered language specific syntactic features of Bangla text in this work. We have identified major structural contributors responsible for text comprehensibility and subsequently developed readability models for Bangla texts. We have used different machine-learning methods such as regression, support vector machines (SVM) and support vector regression (SVR) to achieve our aim. The performance of the individual models has been compared against one another. We have conducted detailed user survey for data preparation, identification of important structural parameters of texts and validation of our proposed models. The work posses further implications in the field of educational research and in matching text to readers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

References

Agnihotri, R. K. (2008). 13 orality and literacy. Language in South Asia, page 271.
Bamberger, R., & Rabin, A. T. (1984). New approaches to readability: Austrian research. The Reading Teacher, 37(6), 512–519.
Google Scholar
Basak, D., Pal, S., & Patranabis, D. C. (2007). Support vector regression. Neural Information Processing-Letters and Reviews, 11(10), 203–224.
Google Scholar
Benjamin, R. (2012). Reconstructing readability: Recent developments and recommendations in the analysis of text difficulty. Educational Psychology Review, 24, 1–26.
Article MathSciNet Google Scholar
Britton, B., & Gülgöz, S. (1991). Using kintsch’s computational model to improve instructional text: Effects of repairing inference calls on recall and cognitive structures. Journal of Educational Psychology, 83(3), 329.
Article Google Scholar
Buswell, G. (1937). How adults read. University of Chicago.
Chakraborti, P. (2003). Diglossia in Bengali. PhD thesis, University of New Mexico.
Chall, J. (1958). Readability: An appraisal of research and application. Number 34. Ohio State University.
Chall, J. (1995). Readability revisited: The new Dale-Chall readability formula, volume 118. Cambridge: Brookline Books.
Google Scholar
Chang, C.-C., & Lin, C.-J. (2011). Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 27.
Google Scholar
Collins-Thompson, K. and Callan, J. (2004). A language modeling approach to predicting reading difficulty. In Proceedings of HLT/NAACL, volume 4
Collins-Thompson, K., & Callan, J. (2005). Predicting reading difficulty with statistical language models. Journal of the American Society for Information Science and Technology, 56(13), 1448–1462.
Article Google Scholar
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.
MATH Google Scholar
Cotugna, N., Vickery, C., & Carpenter-Haefele, K. (2005). Evaluation of literacy level of patient education pages in health-related journals. Journal of Community Health, 30(3), 213–219.
Article Google Scholar
Crossley, S., Dufty, D., McCarthy, P., & McNamara, D. (2007). Toward a new readability: A mixed model approach. In Proceedings of the 29th annual conference of the Cognitive Science Society, pp. 197–202.
Dale, E., & Chall, J. (1948). A formula for predicting readability. Educational research bulletin, pp. 11–28.
Das, S., & Roychoudhury, R. (2006). Readability modelling and comparison of one and two parametric fit: a case study in bangla*. Journal of Quantitative Linguistics, 13(01), 17–34.
Article Google Scholar
Drucker, H., Burges, C. J., Kaufman, L., Smola, A., & Vapnik, V. (1997). Support vector regression machines. Advances in Neural Information Processing Systems, 9, 155–161.
Google Scholar
DuBay, W. (2004). The principles of readability. Impact Information, 1–76.
DuBay, W. (2007). Smart Language: Readers, Readability, and the Grading of Text. ERIC.
Ferguson, C. A. (1959). Diglossia. Word-Journal of the International Linguistic Association, 15(2), 325–340.
Google Scholar
Flesch, R. (1948). A new readability yardstick. Journal of Applied Psychology, 32(3), 221.
Article Google Scholar
Foltz, P., Kintsch, W., & Landauer, T. (1998). The measurement of textual coherence with latent semantic analysis. Discourse Processes, 25(2–3), 285–307.
Article Google Scholar
Fry, E. (1968). A readability formula that saves time. Journal of Reading, 11(7), 513–578.
Google Scholar
Graesser, A., McNamara, D., & Kulikowich, J. (2011). Coh-metrix providing multilevel analyses of text characteristics. Educational Researcher, 40(5), 223–234.
Article Google Scholar
Graesser, A., McNamara, D., Louwerse, M., & Cai, Z. (2004). Coh-metrix: Analysis of text on cohesion and language. Behavior Research Methods, 36(2), 193–202.
Article Google Scholar
Gunning, R. (1968). The technique of clear writing. NewYork: McGraw-Hill.
Google Scholar
Heilman, M., Collins-Thompson, K., and Eskenazi, M. (2008). An analysis of statistical models and features for reading difficulty prediction. In Proceedings of the Third Workshop on Innovative Use of NLP for Building Educational Applications, (pp. 71–79). Association for Computational Linguistics.
Islam, Z., Mehler, A., Rahman, R., and Texttechnology, A. (2012). Text readability classification of textbooks of a low-resource language. In Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation.
Kemper, S. (1983). Measuring the inference load of a text. Journal of Educational Psychology, 75(3), 391.
Article Google Scholar
Kincaid, J. P., Fishburne Jr, R. P., Rogers, R. L., and Chissom, B. S. (1975). Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. Technical report, DTIC Document.
Kintsch, W., & Van Dijk, T. (1978). Toward a model of text comprehension and production. Psychological Review, 85(5), 363.
Article Google Scholar
Klare, G. (1963). The mesaurement of readability. Ames: Iowa State University Press.
Google Scholar
Landauer, T., Foltz, P., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25(2–3), 259–284.
Article Google Scholar
Learning, R. (2001). The atos readability formula for books and how it compares to other formulas. Madison: School Renaissance Institute.
Google Scholar
Liu, X., Croft, W., Oh, P., and Hart, D. (2004). Automatic recognition of reading levels from user queries. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, (pp. 548–549). ACM.
Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval, volume 1. Cambridge: University Press Cambridge.
Book MATH Google Scholar
McLaughlin, G. (1969). Smog grading: A new readability formula. Journal of Reading, 12(8), 639–646.
Google Scholar
McNamara, D., Louwerse, M., McCarthy, P., & Graesser, A. (2010). Coh-metrix: Capturing linguistic features of cohesion. Discourse Processes, 47(4), 292–330.
Article Google Scholar
Miltsakaki, E., & Troutt, A. (2007). Read-x: Automatic evaluation of reading difficulty of web text. In Proceedings of E-Automatic evaluation of reading difficulty of web text. In Proceedings of ELearn.
Montgomery, D., Peck, E., and Vining, G. (2007). Introduction to linear regression analysis, volume 49. Wiley.
Oakland, T., & Lane, H. (2004). Language, reading, and readability formulas: Implications for developing and adapting tests. International Journal of Testing, 4(3), 239–252.
Article Google Scholar
Petersen, S. E., & Ostendorf, M. (2009). A machine learning approach to reading level assessment. Computer Speech & Language, 23(1), 89–106.
Article Google Scholar
Rabin, A., Zakaluk, B., and Samuels, S. (1988). Determining difficulty levels of text written in languages other than english. Readability: Its past, present & future. Newark DE: International Reading Association, (pp. 46–76).
Rosch, E. (1978). Principles of categorization. Fuzzy grammar: a reader (pp. 91–108).
Schwarm, S. and Ostendorf, M. (2005). Reading level assessment using support vector machines and statistical language models. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, (pp. 523–530). Association for Computational Linguistics.
Sherman, L. (1893). Analytics of literature: A manual for the objective study of english poetry and prose. Boston: Ginn.
Google Scholar
Si, L., & Callan, J. (2003). A semisupervised learning method to merge search engine results. ACM Transactions on Information Systems (TOIS), 21(4), 457–491.
Article Google Scholar
Smola, A. J., & Schölkopf, B. (2004). A tutorial on support vector regression. Statistics and Computing, 14(3), 199–222.
Article MathSciNet Google Scholar
Stenner, A. (1996). Measuring reading comprehension with the lexile framework.
Taft, M. (2004). Morphological decomposition and the reverse base frequency effect. Quarterly Journal of Experimental Psychology Section A, 57(4), 745–765.
Article Google Scholar
vor der Brück, T., Helbig, H., Leveling, J., & Kommunikationssysteme, I. (2008). The Readability Checker Delite: Technical Report. FernUniv., Fak. für Mathematik und Informatik.
Zar, J. (1998). Spearman rank correlation. Encyclopedia of Biostatistics.

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, Kharagpur, India
Manjira Sinha & Anupam Basu

Authors

Manjira Sinha
View author publications
You can also search for this author in PubMed Google Scholar
Anupam Basu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Manjira Sinha.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sinha, M., Basu, A. A study of readability of texts in Bangla through machine learning approaches. Educ Inf Technol 21, 1071–1094 (2016). https://doi.org/10.1007/s10639-014-9368-y

Download citation

Published: 07 December 2014
Issue Date: September 2016
DOI: https://doi.org/10.1007/s10639-014-9368-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A study of readability of texts in Bangla through machine learning approaches

Abstract

Access this article

Similar content being viewed by others

Assisting European Portuguese Teaching: Linguistic Features Extraction and Automatic Readability Classifier

Readability Classification of Bangla Texts

Ablesbarkeitsmesser: A System for Assessing the Readability of German Text

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A study of readability of texts in Bangla through machine learning approaches

Abstract

Access this article

Similar content being viewed by others

Assisting European Portuguese Teaching: Linguistic Features Extraction and Automatic Readability Classifier

Readability Classification of Bangla Texts

Ablesbarkeitsmesser: A System for Assessing the Readability of German Text

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation