Abstract
In this paper we introduce a computational procedure for measuring the semantic readability of a segmented text. The procedure mainly consists of three steps. First, natural language processing tools and unsupervised machine learning techniques are adopted in order to obtain a vectorized numerical representation for any section or segment of the inputted text. Hence, similar or semantically related text segments are modeled by nearby points in a vector space, then the shortest and longest Hamiltonian paths passing through them are computed. Lastly, the lengths of these paths and that of the original ordering on the segments are combined into an arithmetic expression in order to derive an index, which may be used to gauge the semantic difficulty that a reader is supposed to experience when reading the text. A preliminary experimental study is conducted on seven classic narrative texts written in English, which were obtained from the well-known Gutenberg project. The experimental results appear to be in line with our expectations.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
glove-wiki-gigaword-100 is available with the Python module Gensim [17] at the url https://github.com/RaRe-Technologies/gensim-data.
- 2.
Concorde is available from https://www.math.uwaterloo.ca/tsp/concorde.html.
References
Applegate, D., Bixby, R., Chvátal, V., Cook, W.: TSP cuts which do not conform to the template paradigm. In: Jünger, M., Naddef, D. (eds.) Computational Combinatorial Optimization. LNCS, vol. 2241, pp. 261–303. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-45586-8_7
Baioletti, M., Milani, A., Santucci, V., Bartoccini, U.: An experimental comparison of algebraic differential evolution using different generating sets. In: Proceedings of the Genetic and Evolutionary Computation Conference Companion, GECCO 2019, pp. 1527–1534. (2019). https://doi.org/10.1145/3319619.3326854
Banerjee, A., Dhillon, I.S., Ghosh, J., Sra, S., Ridgeway, G.: Clustering on the unit hypersphere using von Mises-Fisher distributions. J. Mach. Learning Res. 6(9), 1345–1382 (2005)
Barvinok, A., Gimadi, E.K., Serdyukov, A.I.: The maximum TSP. In: Gutin, G., Punnen, A.P. (eds.) The Traveling Salesman Problem and Its Variations. Combinatorial Optimization, vol. 12, pp. 585–607. Springer, Boston (2007). https://doi.org/10.1007/0-306-48213-4_12
Calfee, R.C., Curley, R.: Structures of prose in content areas. In: Understanding Reading Comprehension, pp. 161–180 (1984)
Chowdhary, K.: Natural language processing. Fundamentals of Artificial Intelligence, pp. 603–649 (2020)
Church, K.W.: Word2vec. Nat. Lang. Eng. 23(1), 155–162 (2017)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Dieng, A.B., Ruiz, F.J., Blei, D.M.: Topic modeling in embedding spaces. Trans. Assoc. Comput. Linguis. 8, 439–453 (2020)
DuBay, W.H.: The principles of readability. Online Submission (2004)
Forti, L., Grego Bolli, G., Santarelli, F., Santucci, V., Spina, S.: MALT-IT2: a new resource to measure text difficulty in light of CEFR levels for Italian L2 learning. In: Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, May 2020, pp. 7204–7211. European Language Resources Association (2020). https://aclanthology.org/2020.lrec-1.890
Forti, L., Milani, A., Piersanti, L., Santarelli, F., Santucci, V., Spina, S.: Measuring text complexity for Italian as a second language learning purposes. In: Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications. Florence, Italy, August 2019, pp. 360–368. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/W19-4438
Gourru, A., Guille, A., Velcin, J., Jacques, J.: Document network projection in pretrained word embedding space. In: Jose, J.M., et al. (eds.) ECIR 2020. LNCS, vol. 12036, pp. 150–157. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-45442-5_19
Graff, D., Kong, J., Chen, K., Maeda, K.: English Gigaword. Linguis. Data Consortium Philadelphia 4(1), 34 (2003)
Jones, M.J., Shoemaker, P.A.: Accounting narratives: a review of empirical studies of content and readability. J. Acc. Lit. 13, 142 (1994)
Jünger, M., Reinelt, G., Rinaldi, G.: The traveling salesman problem. In: Handbooks in Operations Research and Management Science, vol. 7, pp. 225–330 (1995)
Khosrovian, K., Pfahl, D., Garousi, V.: GENSIM 2.0: a customizable process simulation model for software process evaluation. In: Wang, Q., Pfahl, D., Raffo, D.M. (eds.) ICSP 2008. LNCS, vol. 5007, pp. 294–306. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-79588-9_26
Kwolek, W.F.: A readability survey of technical and popular literature. Journalism Q. 50(2), 255–264 (1973). https://doi.org/10.1177/107769907305000206
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine learning, pp. 1188–1196. PMLR (2014)
Li, B., Han, L.: Distance weighted cosine similarity measure for text classification. In: Yin, H., et al. (eds.) IDEAL 2013. LNCS, vol. 8206, pp. 611–618. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41278-3_74
Li, Y., Yang, T.: Word embedding for understanding natural language: a survey. In: Srinivasan, S. (ed.) Guide to Big Data Applications. SBD, vol. 26, pp. 83–104. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-53817-4_4
Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Rahman, M.S., Kaykobad, M.: On Hamiltonian cycles and Hamiltonian paths. Inf. Process. Lett. 94(1), 37–41 (2005)
Ruder, S., Peters, M.E., Swayamdipta, S., Wolf, T.: Transfer learning in natural language processing. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials, pp. 15–18 (2019)
Santucci, V., Baioletti, M., Milani, A.: An algebraic differential evolution for the linear ordering problem. In: Companion Material Proceedings of Genetic and Evolutionary Computation Conference, GECCO 2015, pp. 1479–1480 (2015). https://doi.org/10.1145/2739482.2764693
Santucci, V., Ceberio, J.: Using pairwise precedences for solving the linear ordering problem. Appl. Soft Comput. 87, 105998 (2020). https://doi.org/10.1016/j.asoc.2019.105998
Santucci, V., Forti, L., Santarelli, F., Spina, S., Milani, A.: Learning to classify text complexity for the Italian language using support vector machines. In: Gervasi, O., et al. (eds.) ICCSA 2020. LNCS, vol. 12250, pp. 367–376. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58802-1_27
Santucci, V., Santarelli, F., Forti, L., Spina, S.: Automatic classification of text complexity. Appl. Sci. 10(20) (2020). https://doi.org/10.3390/app10207285, https://www.mdpi.com/2076-3417/10/20/7285
Santucci, V., Spina, S., Milani, A., Biondi, G., Di Bari, G.: Detecting hate speech for Italian language in social media. In: EVALITA 2018, Co-located with the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018), vol. 2263 (2018)
Schnabel, T., Labutov, I., Mimno, D., Joachims, T.: Evaluation methods for unsupervised word embeddings. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 298–307 (2015)
Smith, E.A., Kincaid, J.P.: Derivation and validation of the automated readability index for use with technical materials. Hum. Factors 12(5), 457–564 (1970). https://doi.org/10.1177/001872087001200505
Stroube, B.: Literary freedom: project Gutenberg. XRDS: Crossroads, ACM Mag. Students 10(1), 3–3 (2003)
Yeoh, J.M., Caraffini, F., Homapour, E., Santucci, V., Milani, A.: A clustering system for dynamic data streams based on metaheuristic optimisation. Mathematics 7(12), 1229 (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Santucci, V., Bartoccini, U., Mengoni, P., Zanda, F. (2022). A Computational Measure for the Semantic Readability of Segmented Texts. In: Gervasi, O., Murgante, B., Misra, S., Rocha, A.M.A.C., Garau, C. (eds) Computational Science and Its Applications – ICCSA 2022 Workshops. ICCSA 2022. Lecture Notes in Computer Science, vol 13377. Springer, Cham. https://doi.org/10.1007/978-3-031-10536-4_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-10536-4_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-10535-7
Online ISBN: 978-3-031-10536-4
eBook Packages: Computer ScienceComputer Science (R0)