Education and Information Technologies

, Volume 21, Issue 5, pp 1071–1094 | Cite as

A study of readability of texts in Bangla through machine learning approaches

  • Manjira SinhaEmail author
  • Anupam Basu


In this work, we have investigated text readability in Bangla language. Text readability is an indicator of the suitability of a given document with respect to a target reader group. Therefore, text readability has huge impact on educational content preparation. The advances in the field of natural language processing have enabled the automatic identification of reading difficulty of texts and contributed in the design and development of suitable educational materials. In spite of the fact that, Bangla is one of the major languages in India and the official language of Bangladesh, the research of text readability in Bangla is still in its nascent stage. In this paper, we have presented computational models to determine the readability of Bangla text documents based on syntactic properties. Since Bangla is a digital resource poor language, therefore, we were required to develop a novel dataset suitable for automatic identification of text properties. Our initial experiments have shown that existing English readability metrics are inapplicable for Bangla. Accordingly, we have proceeded towards new models for analyzing text readability in Bangla. We have considered language specific syntactic features of Bangla text in this work. We have identified major structural contributors responsible for text comprehensibility and subsequently developed readability models for Bangla texts. We have used different machine-learning methods such as regression, support vector machines (SVM) and support vector regression (SVR) to achieve our aim. The performance of the individual models has been compared against one another. We have conducted detailed user survey for data preparation, identification of important structural parameters of texts and validation of our proposed models. The work posses further implications in the field of educational research and in matching text to readers.


Bangla text comprehensibility Text readability Resource creation Readability models Regression Support vector machines Support vector regression User study 


  1. Agnihotri, R. K. (2008). 13 orality and literacy. Language in South Asia, page 271.Google Scholar
  2. Bamberger, R., & Rabin, A. T. (1984). New approaches to readability: Austrian research. The Reading Teacher, 37(6), 512–519.Google Scholar
  3. Basak, D., Pal, S., & Patranabis, D. C. (2007). Support vector regression. Neural Information Processing-Letters and Reviews, 11(10), 203–224.Google Scholar
  4. Benjamin, R. (2012). Reconstructing readability: Recent developments and recommendations in the analysis of text difficulty. Educational Psychology Review, 24, 1–26.MathSciNetCrossRefGoogle Scholar
  5. Britton, B., & Gülgöz, S. (1991). Using kintsch’s computational model to improve instructional text: Effects of repairing inference calls on recall and cognitive structures. Journal of Educational Psychology, 83(3), 329.CrossRefGoogle Scholar
  6. Buswell, G. (1937). How adults read. University of Chicago.Google Scholar
  7. Chakraborti, P. (2003). Diglossia in Bengali. PhD thesis, University of New Mexico.Google Scholar
  8. Chall, J. (1958). Readability: An appraisal of research and application. Number 34. Ohio State University.Google Scholar
  9. Chall, J. (1995). Readability revisited: The new Dale-Chall readability formula, volume 118. Cambridge: Brookline Books.Google Scholar
  10. Chang, C.-C., & Lin, C.-J. (2011). Libsvm: a library for support vector machines. ACM Transactions on Intelligent Systems and Technology (TIST), 2(3), 27.Google Scholar
  11. Collins-Thompson, K. and Callan, J. (2004). A language modeling approach to predicting reading difficulty. In Proceedings of HLT/NAACL, volume 4Google Scholar
  12. Collins-Thompson, K., & Callan, J. (2005). Predicting reading difficulty with statistical language models. Journal of the American Society for Information Science and Technology, 56(13), 1448–1462.CrossRefGoogle Scholar
  13. Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.zbMATHGoogle Scholar
  14. Cotugna, N., Vickery, C., & Carpenter-Haefele, K. (2005). Evaluation of literacy level of patient education pages in health-related journals. Journal of Community Health, 30(3), 213–219.CrossRefGoogle Scholar
  15. Crossley, S., Dufty, D., McCarthy, P., & McNamara, D. (2007). Toward a new readability: A mixed model approach. In Proceedings of the 29th annual conference of the Cognitive Science Society, pp. 197–202.Google Scholar
  16. Dale, E., & Chall, J. (1948). A formula for predicting readability. Educational research bulletin, pp. 11–28.Google Scholar
  17. Das, S., & Roychoudhury, R. (2006). Readability modelling and comparison of one and two parametric fit: a case study in bangla*. Journal of Quantitative Linguistics, 13(01), 17–34.CrossRefGoogle Scholar
  18. Drucker, H., Burges, C. J., Kaufman, L., Smola, A., & Vapnik, V. (1997). Support vector regression machines. Advances in Neural Information Processing Systems, 9, 155–161.Google Scholar
  19. DuBay, W. (2004). The principles of readability. Impact Information, 1–76.Google Scholar
  20. DuBay, W. (2007). Smart Language: Readers, Readability, and the Grading of Text. ERIC.Google Scholar
  21. Ferguson, C. A. (1959). Diglossia. Word-Journal of the International Linguistic Association, 15(2), 325–340.Google Scholar
  22. Flesch, R. (1948). A new readability yardstick. Journal of Applied Psychology, 32(3), 221.CrossRefGoogle Scholar
  23. Foltz, P., Kintsch, W., & Landauer, T. (1998). The measurement of textual coherence with latent semantic analysis. Discourse Processes, 25(2–3), 285–307.CrossRefGoogle Scholar
  24. Fry, E. (1968). A readability formula that saves time. Journal of Reading, 11(7), 513–578.Google Scholar
  25. Graesser, A., McNamara, D., & Kulikowich, J. (2011). Coh-metrix providing multilevel analyses of text characteristics. Educational Researcher, 40(5), 223–234.CrossRefGoogle Scholar
  26. Graesser, A., McNamara, D., Louwerse, M., & Cai, Z. (2004). Coh-metrix: Analysis of text on cohesion and language. Behavior Research Methods, 36(2), 193–202.CrossRefGoogle Scholar
  27. Gunning, R. (1968). The technique of clear writing. NewYork: McGraw-Hill.Google Scholar
  28. Heilman, M., Collins-Thompson, K., and Eskenazi, M. (2008). An analysis of statistical models and features for reading difficulty prediction. In Proceedings of the Third Workshop on Innovative Use of NLP for Building Educational Applications, (pp. 71–79). Association for Computational Linguistics.Google Scholar
  29. Islam, Z., Mehler, A., Rahman, R., and Texttechnology, A. (2012). Text readability classification of textbooks of a low-resource language. In Proceedings of the 26th Pacific Asia Conference on Language, Information, and Computation.Google Scholar
  30. Kemper, S. (1983). Measuring the inference load of a text. Journal of Educational Psychology, 75(3), 391.CrossRefGoogle Scholar
  31. Kincaid, J. P., Fishburne Jr, R. P., Rogers, R. L., and Chissom, B. S. (1975). Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. Technical report, DTIC Document. Google Scholar
  32. Kintsch, W., & Van Dijk, T. (1978). Toward a model of text comprehension and production. Psychological Review, 85(5), 363.CrossRefGoogle Scholar
  33. Klare, G. (1963). The mesaurement of readability. Ames: Iowa State University Press.Google Scholar
  34. Landauer, T., Foltz, P., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25(2–3), 259–284.CrossRefGoogle Scholar
  35. Learning, R. (2001). The atos readability formula for books and how it compares to other formulas. Madison: School Renaissance Institute.Google Scholar
  36. Liu, X., Croft, W., Oh, P., and Hart, D. (2004). Automatic recognition of reading levels from user queries. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, (pp. 548–549). ACM.Google Scholar
  37. Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval, volume 1. Cambridge: University Press Cambridge.CrossRefzbMATHGoogle Scholar
  38. McLaughlin, G. (1969). Smog grading: A new readability formula. Journal of Reading, 12(8), 639–646.Google Scholar
  39. McNamara, D., Louwerse, M., McCarthy, P., & Graesser, A. (2010). Coh-metrix: Capturing linguistic features of cohesion. Discourse Processes, 47(4), 292–330.CrossRefGoogle Scholar
  40. Miltsakaki, E., & Troutt, A. (2007). Read-x: Automatic evaluation of reading difficulty of web text. In Proceedings of E-Automatic evaluation of reading difficulty of web text. In Proceedings of ELearn.Google Scholar
  41. Montgomery, D., Peck, E., and Vining, G. (2007). Introduction to linear regression analysis, volume 49. Wiley.Google Scholar
  42. Oakland, T., & Lane, H. (2004). Language, reading, and readability formulas: Implications for developing and adapting tests. International Journal of Testing, 4(3), 239–252.CrossRefGoogle Scholar
  43. Petersen, S. E., & Ostendorf, M. (2009). A machine learning approach to reading level assessment. Computer Speech & Language, 23(1), 89–106.CrossRefGoogle Scholar
  44. Rabin, A., Zakaluk, B., and Samuels, S. (1988). Determining difficulty levels of text written in languages other than english. Readability: Its past, present & future. Newark DE: International Reading Association, (pp. 46–76).Google Scholar
  45. Rosch, E. (1978). Principles of categorization. Fuzzy grammar: a reader (pp. 91–108).Google Scholar
  46. Schwarm, S. and Ostendorf, M. (2005). Reading level assessment using support vector machines and statistical language models. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, (pp. 523–530). Association for Computational Linguistics.Google Scholar
  47. Sherman, L. (1893). Analytics of literature: A manual for the objective study of english poetry and prose. Boston: Ginn.Google Scholar
  48. Si, L., & Callan, J. (2003). A semisupervised learning method to merge search engine results. ACM Transactions on Information Systems (TOIS), 21(4), 457–491.CrossRefGoogle Scholar
  49. Smola, A. J., & Schölkopf, B. (2004). A tutorial on support vector regression. Statistics and Computing, 14(3), 199–222.MathSciNetCrossRefGoogle Scholar
  50. Stenner, A. (1996). Measuring reading comprehension with the lexile framework.Google Scholar
  51. Taft, M. (2004). Morphological decomposition and the reverse base frequency effect. Quarterly Journal of Experimental Psychology Section A, 57(4), 745–765.CrossRefGoogle Scholar
  52. vor der Brück, T., Helbig, H., Leveling, J., & Kommunikationssysteme, I. (2008). The Readability Checker Delite: Technical Report. FernUniv., Fak. für Mathematik und Informatik.Google Scholar
  53. Zar, J. (1998). Spearman rank correlation. Encyclopedia of Biostatistics.Google Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  1. 1.Department of Computer Science and EngineeringIndian Institute of Technology KharagpurKharagpurIndia

Personalised recommendations