Abstract
This paper reports the preliminary results of an experiment carried out on a large scale for the extraction of PUs (phraseological units, also called idioms) from large web corpora in four languages (English, Spanish, French, Chinese). The use of a new algorithm based on metric clustering techniques, of optimized database storage and of interaction with users and researchers by means of a web application, made it possible to reach high precision scores for most common PUs in the four languages, while further experimentation is still necessary for establishing recall levels with long n-grams. In the meantime, the freely accessible web application makes it possible to visualize the high proportion of phraseology in the broad sense (or of formulaic language): about 30 to 60% of the newspaper articles tested in the experiments consisted of PUs. The most surprising results, however, came from Chinese: as the algorithm had to be changed for taking into account the associations between morphemes, the methodology used made it possible to partly confirm, from a statistical point of view, one of the major claims of construction grammar: the existence of a probabilistic network of constructions, from morphemes to idiomatic phrases.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The web corpora used in the IdiomSearch experiment were assembled using the WebBootCat tool provided by the Sketch Engine (http://sketchengine.co.uk), on the basis of seed words and following the methodology described in [2].
- 2.
For Chinese, the corpus does not consist of 200 million Chinese characters (hans), but of 200 million Chinese words (as tokens).
- 3.
IdiomSearch is accessible on the web at: http://idiomsearch.LSTI.ucl.ac.be.
- 4.
The Guardian, http://www.theguardian.com, 7 August 2017.
- 5.
Athelstan Homepage, http://www.athel.com/cspatg.html, last accessed 2017/08/09.
- 6.
The computational issue is well known: many web pages contain Unicode errors; the robot assumes that the downloaded web page is in Unicode, but the errors remain and appear in the web corpus.
- 7.
shū zhōng zì yǒu huángjīn wū, A book holds a house of gold.
- 8.
According to Wikipedia, English is good for 51.6% of all web pages, Spanish for 5.1%, French for 4.1%, and Chinese for 2.0%. Wikipedia homepage, https://en.wikipedia.org/wiki/Languages_used_on_the_Internet, last accessed 2017/08/17.
References
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. ACM Press/Addison Wesley, New York (1999)
Baroni, M., Bernardini, S., Ferraresi, A., Zanchetta, E.: The WaCky wide web: a collection of very large linguistically processed web-crawled corpora. J. Lang. Res. Eval. 43, 209–226 (2009)
Booij, G.: Morphology in construction grammar. In: Hoffmann, T., Trousdale, G. (eds.) The Oxford Handbook of Construction Grammar, pp. 255–273. Oxford University Press, Oxford/New York (2013)
Burger, H., Dobrovol’skij, D., Kühn, P., Norrick, N. (eds.): Phraseologie/Phraseology. Ein internationales Handbuch der zeitgenössischen Forschung/An International Hand-book of Contemporary Research. De Gruyter, Berlin/New York (2007)
Colson, J-P.: The World Wide Web as a corpus for set phrases. In: Burger, H., Dobro-vol’skij, D., Kühn, P., Norrick, N. (eds.) Phraseologie/Phraseology. Ein internationales Handbuch der zeitgenössischen Forschung/An International Handbook of Contemporary Research, pp. 1071–1077. De Gruyter, Berlin/ New York (2007)
Colson, J.-P.: Set phrases around globalization: an experiment in corpus-based computational phraseology. In: Alonso Almeida, F., Ortega Barrera, I., Quintana Toledo, E., Sanchez Cuervo, M.E. (eds.) Input a Word, Analyze the World. Selected Approaches to Corpus Linguistics. Cambridge Scholars Publishing, Newcastle, pp. 141–152 (2016)
Croft, W.: Radical Construction Grammar: Syntactic Theory in Typological Perspective. Oxford University Press, Oxford (2001)
Croft, W.: Radical construction grammar. In: Hoffmann, T.H., Trousdale, G. (eds.) The Oxford Handbook of Construction Grammar, pp. 211–232. Oxford University Press, Oxford/New York (2013)
Fillmore, C.H.: The mechanisms of construction grammar. Berkeley Linguistic Soc. 14, 35–55 (1988)
Goldberg, A.: Constructions: A Construction Grammar Approach to Argument Structure. University of Chicago Press, Chicago (1995)
Goldberg, A.: Constructions: a new theoretical approach to language. Trends Cogn. Sci. 7(5), 219–224 (2003)
Goldberg, A.: Constructions at Work: The Nature of Generalization in Language. Oxford University Press, Oxford (2006)
Gries, S.: 50-something years of work on collocations. What is or should be next …. Int. J. Corpus Linguist. 18, 137–165 (2013)
Gries, S.: Data in construction grammar. In: Hoffmann, T.H., Trousdale, G. (eds.) The Oxford Handbook of Construction Grammar, pp. 93–108. Oxford University Press, Oxford/New York (2013)
Gries, S., Stefanowitsch, A.: Extending collostructional analysis: a corpus-based perspective on ‘Alternations’. Int. J. Corpus Linguist. 9(1), 97–129 (2004)
Henry, K.: Les chengyu du chinois: caractérisation de phrasèmes hors norme. Yearb. Phraseology 7, 99–126 (2016)
Hoffmann, T.H., Trousdale, G. (eds.): The Oxford Handbook of Construction Grammar. Oxford University Press, Oxford/New York (2013)
Manning, C.H., Raghavan, P., Schütze, H.: An Introduction to Information Retrieval. Cambridge University Press, Cambridge (2009)
Moon, R.: Fixed Expressions and Idioms in English. Clarendon Press, Oxford (1998)
Sinclair, J.: Corpus, Concordance, Collocation. Oxford University Press, Oxford (1991)
Stefanowitsch, A.: Collostructional analysis. In: Hoffmann, T.H., Trousdale, G. (eds.) The Oxford Handbook of Construction Grammar, pp. 290–306. Oxford University Press, Oxford/New York (2013)
Wray, A.: Formulaic Language: Pushing the Boundaries. Oxford University Press, Oxford (2008)
Wulff, S.: Words and idioms. In: Hoffmann, T.H., Trousdale, G. (eds.) The Oxford Hand-book of Construction Grammar, pp. 274–289. Oxford University Press, Oxford/New York (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Colson, JP. (2017). The IdiomSearch Experiment: Extracting Phraseology from a Probabilistic Network of Constructions. In: Mitkov, R. (eds) Computational and Corpus-Based Phraseology. EUROPHRAS 2017. Lecture Notes in Computer Science(), vol 10596. Springer, Cham. https://doi.org/10.1007/978-3-319-69805-2_2
Download citation
DOI: https://doi.org/10.1007/978-3-319-69805-2_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69804-5
Online ISBN: 978-3-319-69805-2
eBook Packages: Computer ScienceComputer Science (R0)