A Computational Measure for the Semantic Readability of Segmented Texts

Santucci, Valentino; Bartoccini, Umberto; Mengoni, Paolo; Zanda, Fabio

doi:10.1007/978-3-031-10536-4_8

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13377))

Included in the following conference series:

International Conference on Computational Science and Its Applications

1705 Accesses
1 Altmetric

Abstract

In this paper we introduce a computational procedure for measuring the semantic readability of a segmented text. The procedure mainly consists of three steps. First, natural language processing tools and unsupervised machine learning techniques are adopted in order to obtain a vectorized numerical representation for any section or segment of the inputted text. Hence, similar or semantically related text segments are modeled by nearby points in a vector space, then the shortest and longest Hamiltonian paths passing through them are computed. Lastly, the lengths of these paths and that of the original ordering on the segments are combined into an arithmetic expression in order to derive an index, which may be used to gauge the semantic difficulty that a reader is supposed to experience when reading the text. A preliminary experimental study is conducted on seven classic narrative texts written in English, which were obtained from the well-known Gutenberg project. The experimental results appear to be in line with our expectations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
glove-wiki-gigaword-100 is available with the Python module Gensim [17] at the url https://github.com/RaRe-Technologies/gensim-data.
2.
Concorde is available from https://www.math.uwaterloo.ca/tsp/concorde.html.

References

Applegate, D., Bixby, R., Chvátal, V., Cook, W.: TSP cuts which do not conform to the template paradigm. In: Jünger, M., Naddef, D. (eds.) Computational Combinatorial Optimization. LNCS, vol. 2241, pp. 261–303. Springer, Heidelberg (2001). https://doi.org/10.1007/3-540-45586-8_7
Chapter Google Scholar
Baioletti, M., Milani, A., Santucci, V., Bartoccini, U.: An experimental comparison of algebraic differential evolution using different generating sets. In: Proceedings of the Genetic and Evolutionary Computation Conference Companion, GECCO 2019, pp. 1527–1534. (2019). https://doi.org/10.1145/3319619.3326854
Banerjee, A., Dhillon, I.S., Ghosh, J., Sra, S., Ridgeway, G.: Clustering on the unit hypersphere using von Mises-Fisher distributions. J. Mach. Learning Res. 6(9), 1345–1382 (2005)
MathSciNet MATH Google Scholar
Barvinok, A., Gimadi, E.K., Serdyukov, A.I.: The maximum TSP. In: Gutin, G., Punnen, A.P. (eds.) The Traveling Salesman Problem and Its Variations. Combinatorial Optimization, vol. 12, pp. 585–607. Springer, Boston (2007). https://doi.org/10.1007/0-306-48213-4_12
Chapter MATH Google Scholar
Calfee, R.C., Curley, R.: Structures of prose in content areas. In: Understanding Reading Comprehension, pp. 161–180 (1984)
Google Scholar
Chowdhary, K.: Natural language processing. Fundamentals of Artificial Intelligence, pp. 603–649 (2020)
Google Scholar
Church, K.W.: Word2vec. Nat. Lang. Eng. 23(1), 155–162 (2017)
Article Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Dieng, A.B., Ruiz, F.J., Blei, D.M.: Topic modeling in embedding spaces. Trans. Assoc. Comput. Linguis. 8, 439–453 (2020)
Article Google Scholar
DuBay, W.H.: The principles of readability. Online Submission (2004)
Google Scholar
Forti, L., Grego Bolli, G., Santarelli, F., Santucci, V., Spina, S.: MALT-IT2: a new resource to measure text difficulty in light of CEFR levels for Italian L2 learning. In: Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, May 2020, pp. 7204–7211. European Language Resources Association (2020). https://aclanthology.org/2020.lrec-1.890
Forti, L., Milani, A., Piersanti, L., Santarelli, F., Santucci, V., Spina, S.: Measuring text complexity for Italian as a second language learning purposes. In: Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications. Florence, Italy, August 2019, pp. 360–368. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/W19-4438
Gourru, A., Guille, A., Velcin, J., Jacques, J.: Document network projection in pretrained word embedding space. In: Jose, J.M., et al. (eds.) ECIR 2020. LNCS, vol. 12036, pp. 150–157. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-45442-5_19
Chapter Google Scholar
Graff, D., Kong, J., Chen, K., Maeda, K.: English Gigaword. Linguis. Data Consortium Philadelphia 4(1), 34 (2003)
Google Scholar
Jones, M.J., Shoemaker, P.A.: Accounting narratives: a review of empirical studies of content and readability. J. Acc. Lit. 13, 142 (1994)
Google Scholar
Jünger, M., Reinelt, G., Rinaldi, G.: The traveling salesman problem. In: Handbooks in Operations Research and Management Science, vol. 7, pp. 225–330 (1995)
Google Scholar
Khosrovian, K., Pfahl, D., Garousi, V.: GENSIM 2.0: a customizable process simulation model for software process evaluation. In: Wang, Q., Pfahl, D., Raffo, D.M. (eds.) ICSP 2008. LNCS, vol. 5007, pp. 294–306. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-79588-9_26
Chapter Google Scholar
Kwolek, W.F.: A readability survey of technical and popular literature. Journalism Q. 50(2), 255–264 (1973). https://doi.org/10.1177/107769907305000206
Article Google Scholar
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: International Conference on Machine learning, pp. 1188–1196. PMLR (2014)
Google Scholar
Li, B., Han, L.: Distance weighted cosine similarity measure for text classification. In: Yin, H., et al. (eds.) IDEAL 2013. LNCS, vol. 8206, pp. 611–618. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41278-3_74
Chapter Google Scholar
Li, Y., Yang, T.: Word embedding for understanding natural language: a survey. In: Srinivasan, S. (ed.) Guide to Big Data Applications. SBD, vol. 26, pp. 83–104. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-53817-4_4
Chapter Google Scholar
Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Google Scholar
Rahman, M.S., Kaykobad, M.: On Hamiltonian cycles and Hamiltonian paths. Inf. Process. Lett. 94(1), 37–41 (2005)
Article MathSciNet MATH Google Scholar
Ruder, S., Peters, M.E., Swayamdipta, S., Wolf, T.: Transfer learning in natural language processing. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Tutorials, pp. 15–18 (2019)
Google Scholar
Santucci, V., Baioletti, M., Milani, A.: An algebraic differential evolution for the linear ordering problem. In: Companion Material Proceedings of Genetic and Evolutionary Computation Conference, GECCO 2015, pp. 1479–1480 (2015). https://doi.org/10.1145/2739482.2764693
Santucci, V., Ceberio, J.: Using pairwise precedences for solving the linear ordering problem. Appl. Soft Comput. 87, 105998 (2020). https://doi.org/10.1016/j.asoc.2019.105998
Article Google Scholar
Santucci, V., Forti, L., Santarelli, F., Spina, S., Milani, A.: Learning to classify text complexity for the Italian language using support vector machines. In: Gervasi, O., et al. (eds.) ICCSA 2020. LNCS, vol. 12250, pp. 367–376. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58802-1_27
Chapter Google Scholar
Santucci, V., Santarelli, F., Forti, L., Spina, S.: Automatic classification of text complexity. Appl. Sci. 10(20) (2020). https://doi.org/10.3390/app10207285, https://www.mdpi.com/2076-3417/10/20/7285
Santucci, V., Spina, S., Milani, A., Biondi, G., Di Bari, G.: Detecting hate speech for Italian language in social media. In: EVALITA 2018, Co-located with the Fifth Italian Conference on Computational Linguistics (CLiC-it 2018), vol. 2263 (2018)
Google Scholar
Schnabel, T., Labutov, I., Mimno, D., Joachims, T.: Evaluation methods for unsupervised word embeddings. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 298–307 (2015)
Google Scholar
Smith, E.A., Kincaid, J.P.: Derivation and validation of the automated readability index for use with technical materials. Hum. Factors 12(5), 457–564 (1970). https://doi.org/10.1177/001872087001200505
Article Google Scholar
Stroube, B.: Literary freedom: project Gutenberg. XRDS: Crossroads, ACM Mag. Students 10(1), 3–3 (2003)
Article Google Scholar
Yeoh, J.M., Caraffini, F., Homapour, E., Santucci, V., Milani, A.: A clustering system for dynamic data streams based on metaheuristic optimisation. Mathematics 7(12), 1229 (2019)
Article Google Scholar

Download references

Author information

Authors and Affiliations

University for Foreigners of Perugia, Perugia, Italy
Valentino Santucci, Umberto Bartoccini & Fabio Zanda
School of Communication and Film, Hong Kong Baptist University, Hong Kong, China
Paolo Mengoni

Authors

Valentino Santucci
View author publications
You can also search for this author in PubMed Google Scholar
Umberto Bartoccini
View author publications
You can also search for this author in PubMed Google Scholar
Paolo Mengoni
View author publications
You can also search for this author in PubMed Google Scholar
Fabio Zanda
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Valentino Santucci .

Editor information

Editors and Affiliations

University of Perugia, Perugia, Italy
Osvaldo Gervasi
University of Basilicata, Potenza, Potenza, Italy
Beniamino Murgante
Østfold University College, Halden, Norway
Sanjay Misra
University of Minho, Braga, Portugal
Ana Maria A. C. Rocha
University of Cagliari, Cagliari, Italy
Chiara Garau

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Santucci, V., Bartoccini, U., Mengoni, P., Zanda, F. (2022). A Computational Measure for the Semantic Readability of Segmented Texts. In: Gervasi, O., Murgante, B., Misra, S., Rocha, A.M.A.C., Garau, C. (eds) Computational Science and Its Applications – ICCSA 2022 Workshops. ICCSA 2022. Lecture Notes in Computer Science, vol 13377. Springer, Cham. https://doi.org/10.1007/978-3-031-10536-4_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-10536-4_8
Published: 23 July 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-10535-7
Online ISBN: 978-3-031-10536-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Computational Measure for the Semantic Readability of Segmented Texts