Abstract
Identification of complex words is an interesting research problem with various application scenarios such as text simplification. There are various approaches to identify complex words either by incorporating the complete sentence in which the word appears or by focusing only on the word. This paper falls under the later category, which employs intra-word features in classifying a word either as simple or complex. A model termed CORDIF (COmplex woRD identification with Intra-word Features). The proposed methodology incorporates 19 intra-word features. These features are harnessed to train a machine learning model. A dataset termed as CWIdataset is built with the proposed set of intra-word features. With the proposed feature-set, an accuracy level of 84.75% was achieved. Later using this model, we have tested for identifying the complex words for nonnative persons. As a result, we concluded that for identifying complex words, personalized systems are needed.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Paetzold, G., Specia, L.: SemEval 2016 Task 11: Complex Word Identification. In: SemEval@NAACL-HLT, pp. 560–569 (2016)
Shardlow, M.: Out in the Open: Finding and categorising errors in the lexical simplification pipeline. In: LREC, pp. 1583–1590 (2014)
Kauchak, D.: Improving text simplification language modeling using unsimplified text data. ACL 1, 1537–1546 (2013)
Paetzold, G.H., Specia, L.: Text simplification as tree transduction. In: Proceedings of the 9th Brazilian Symposium in Information and Human Language Technology, pp. 116–125 (2013)
Shardlow, M.: The CW corpus: a new resource for evaluating the identification of complex words. In: Proceedings of the 2nd Workshop on Pre-dicting and Improving Text Readability for Target Reader Populations (2013)
De Belder, J., Moens, M.-F.: A dataset for the evaluation of lexical simplification. In: Computational Linguistics and Intelligent Text Processing, pp. 426–437 (2012)
Carroll, J.A., Minnen, G., Pearce, D., Canning, Y., Devlin, S., Tait, J.: Simplifying text for language-impaired readers. In: EACL, pp. 269–270 (1999)
Devlin, S.: The use of a psycholinguistic database in the simplification of text for aphasic readers. Linguistic databases (1998)
Chall, J.S., Dale, E.: Readability Revisited: the new dale-chall readability formula (1995)
Ismail, A., Yusof, N.: Readability of ESL picture books in Malaysia. J. Nusant. Stud. (JONUS) 1(1), 60–70 (2016)
Ismail, A., Yusof, N., Yunus, K.: The readability of malaysian english children books: a multilevel analysis. Int. J. Appl. Linguist. Engl. Lit. 5(6), 214–220 (2016)
Karhu, M., Hilera, J., Fernández, L., RÃos, R.: Accessibility and readability of university websites in Finland. J. Access. Des. For All 2(2), 178–189 (2012)
Pantula, M., Kuppusamy, K.S.: A model to measure readability of captions with temporal dimension. In: Proceedings of First International Conference on Smart System, Innovations and Computing
Friedman, D.B., Hoffman-Goetz, L.: A systematic review of readability and comprehension instruments used for print and web-based cancer information. Health Educ. Behav. 33(3), 352–373 (2006)
Shepperd, S., Charnock, D., Gann, B.: Helping patients access high quality health information, BMJ 319(7212), 764–766 (1999)
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V.: Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12(Oct), 2825–2830 (2011)
Bird, S.: NLTK: the natural language toolkit. In: Proceedings of the COLING/ACL on Interactive presentation sessions, pp. 69–72 (2006)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Pantula, M., Kuppusamy, K.S. (2019). CORDIF: A Machine Learning-Based Approach to Identify Complex Words Using Intra-word Feature Set. In: Ray, K., Sharan, S., Rawat, S., Jain, S., Srivastava, S., Bandyopadhyay, A. (eds) Engineering Vibration, Communication and Information Processing. Lecture Notes in Electrical Engineering, vol 478. Springer, Singapore. https://doi.org/10.1007/978-981-13-1642-5_26
Download citation
DOI: https://doi.org/10.1007/978-981-13-1642-5_26
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-1641-8
Online ISBN: 978-981-13-1642-5
eBook Packages: EngineeringEngineering (R0)