Abstract
In broadly spoken languages such as English or Spanish, there are words akin to a particular region. For example, there are words typically used in the UK such as cooker, while stove is preferred for that concept in the US. Identifying the particular words a region cultivates involves discriminating them from the set of common words to all regions. This yields the problem where a term’s frequency should be salient enough to be considered of importance, while being a common term tames this salience. This is the known problem of Term Frequency versus the Inverse Document Frequency; nevertheless, typical TF·IDF applications do not include weighting factors. In this work we propose several alternative formulae empirically, and then we conclude that we need to dig in a broader search space; thereby, we propose using Genetic Programming to find a suitable expression composed of TF and IDF terms that maximizes the discrimination of such terms given a reduced bootstrapping set of examples labeled for each region (400). We present performance examples for the Spanish variations across the Americas and Spain.
Work done with support from CONACyT-SNI, Mexico, and SIP project IPN 20121202.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Sánchez, F., Porta, J., Sancho, J.L., Nieto, A., Ballester, A., Fernández, A., Gómez, J., Gómez, L., Raigal, E., Ruiz, R.: La anotación de los corpus CREA y CORDE. In: Proceedings of SEPLN, vol. 99 (1999)
Kilgarriff, A., Rychly, P., Smrz, P., Tugwell, D.: ITRI-04-08 The Sketch Engine. Information Technology 105, 116 (2004)
Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 28, 11-21, 60, 493–502 (1972, 2004)
Koza, J.R.: Non-Linear Genetic Algorithms for Solving Problems. United States Patent and Trademark Office (1988)
Koza, J.R.: Genetic evolution and co-evolution of computer programs. In: Artificial Life II, pp. 603–629 (1990)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Calvo, H. (2014). Simple TF·IDF Is Not the Best You Can Get for Regionalism Classification. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2014. Lecture Notes in Computer Science, vol 8403. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-54906-9_8
Download citation
DOI: https://doi.org/10.1007/978-3-642-54906-9_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-54905-2
Online ISBN: 978-3-642-54906-9
eBook Packages: Computer ScienceComputer Science (R0)