Skip to main content

The IdiomSearch Experiment: Extracting Phraseology from a Probabilistic Network of Constructions

  • Conference paper
  • First Online:
Computational and Corpus-Based Phraseology (EUROPHRAS 2017)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10596))

Included in the following conference series:

Abstract

This paper reports the preliminary results of an experiment carried out on a large scale for the extraction of PUs (phraseological units, also called idioms) from large web corpora in four languages (English, Spanish, French, Chinese). The use of a new algorithm based on metric clustering techniques, of optimized database storage and of interaction with users and researchers by means of a web application, made it possible to reach high precision scores for most common PUs in the four languages, while further experimentation is still necessary for establishing recall levels with long n-grams. In the meantime, the freely accessible web application makes it possible to visualize the high proportion of phraseology in the broad sense (or of formulaic language): about 30 to 60% of the newspaper articles tested in the experiments consisted of PUs. The most surprising results, however, came from Chinese: as the algorithm had to be changed for taking into account the associations between morphemes, the methodology used made it possible to partly confirm, from a statistical point of view, one of the major claims of construction grammar: the existence of a probabilistic network of constructions, from morphemes to idiomatic phrases.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 74.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 95.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The web corpora used in the IdiomSearch experiment were assembled using the WebBootCat tool provided by the Sketch Engine (http://sketchengine.co.uk), on the basis of seed words and following the methodology described in [2].

  2. 2.

    For Chinese, the corpus does not consist of 200 million Chinese characters (hans), but of 200 million Chinese words (as tokens).

  3. 3.

    IdiomSearch is accessible on the web at: http://idiomsearch.LSTI.ucl.ac.be.

  4. 4.

    The Guardian, http://www.theguardian.com, 7 August 2017.

  5. 5.

    Athelstan Homepage, http://www.athel.com/cspatg.html, last accessed 2017/08/09.

  6. 6.

    The computational issue is well known: many web pages contain Unicode errors; the robot assumes that the downloaded web page is in Unicode, but the errors remain and appear in the web corpus.

  7. 7.

    shū zhōng zì yǒu huángjīn wū, A book holds a house of gold.

  8. 8.

    According to Wikipedia, English is good for 51.6% of all web pages, Spanish for 5.1%, French for 4.1%, and Chinese for 2.0%. Wikipedia homepage, https://en.wikipedia.org/wiki/Languages_used_on_the_Internet, last accessed 2017/08/17.

References

  1. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. ACM Press/Addison Wesley, New York (1999)

    Google Scholar 

  2. Baroni, M., Bernardini, S., Ferraresi, A., Zanchetta, E.: The WaCky wide web: a collection of very large linguistically processed web-crawled corpora. J. Lang. Res. Eval. 43, 209–226 (2009)

    Article  Google Scholar 

  3. Booij, G.: Morphology in construction grammar. In: Hoffmann, T., Trousdale, G. (eds.) The Oxford Handbook of Construction Grammar, pp. 255–273. Oxford University Press, Oxford/New York (2013)

    Google Scholar 

  4. Burger, H., Dobrovol’skij, D., Kühn, P., Norrick, N. (eds.): Phraseologie/Phraseology. Ein internationales Handbuch der zeitgenössischen Forschung/An International Hand-book of Contemporary Research. De Gruyter, Berlin/New York (2007)

    Google Scholar 

  5. Colson, J-P.: The World Wide Web as a corpus for set phrases. In: Burger, H., Dobro-vol’skij, D., Kühn, P., Norrick, N. (eds.) Phraseologie/Phraseology. Ein internationales Handbuch der zeitgenössischen Forschung/An International Handbook of Contemporary Research, pp. 1071–1077. De Gruyter, Berlin/ New York (2007)

    Google Scholar 

  6. Colson, J.-P.: Set phrases around globalization: an experiment in corpus-based computational phraseology. In: Alonso Almeida, F., Ortega Barrera, I., Quintana Toledo, E., Sanchez Cuervo, M.E. (eds.) Input a Word, Analyze the World. Selected Approaches to Corpus Linguistics. Cambridge Scholars Publishing, Newcastle, pp. 141–152 (2016)

    Google Scholar 

  7. Croft, W.: Radical Construction Grammar: Syntactic Theory in Typological Perspective. Oxford University Press, Oxford (2001)

    Book  Google Scholar 

  8. Croft, W.: Radical construction grammar. In: Hoffmann, T.H., Trousdale, G. (eds.) The Oxford Handbook of Construction Grammar, pp. 211–232. Oxford University Press, Oxford/New York (2013)

    Google Scholar 

  9. Fillmore, C.H.: The mechanisms of construction grammar. Berkeley Linguistic Soc. 14, 35–55 (1988)

    Article  Google Scholar 

  10. Goldberg, A.: Constructions: A Construction Grammar Approach to Argument Structure. University of Chicago Press, Chicago (1995)

    Google Scholar 

  11. Goldberg, A.: Constructions: a new theoretical approach to language. Trends Cogn. Sci. 7(5), 219–224 (2003)

    Article  Google Scholar 

  12. Goldberg, A.: Constructions at Work: The Nature of Generalization in Language. Oxford University Press, Oxford (2006)

    Google Scholar 

  13. Gries, S.: 50-something years of work on collocations. What is or should be next …. Int. J. Corpus Linguist. 18, 137–165 (2013)

    Article  Google Scholar 

  14. Gries, S.: Data in construction grammar. In: Hoffmann, T.H., Trousdale, G. (eds.) The Oxford Handbook of Construction Grammar, pp. 93–108. Oxford University Press, Oxford/New York (2013)

    Google Scholar 

  15. Gries, S., Stefanowitsch, A.: Extending collostructional analysis: a corpus-based perspective on ‘Alternations’. Int. J. Corpus Linguist. 9(1), 97–129 (2004)

    Article  Google Scholar 

  16. Henry, K.: Les chengyu du chinois: caractérisation de phrasèmes hors norme. Yearb. Phraseology 7, 99–126 (2016)

    Google Scholar 

  17. Hoffmann, T.H., Trousdale, G. (eds.): The Oxford Handbook of Construction Grammar. Oxford University Press, Oxford/New York (2013)

    Google Scholar 

  18. Manning, C.H., Raghavan, P., Schütze, H.: An Introduction to Information Retrieval. Cambridge University Press, Cambridge (2009)

    MATH  Google Scholar 

  19. Moon, R.: Fixed Expressions and Idioms in English. Clarendon Press, Oxford (1998)

    Google Scholar 

  20. Sinclair, J.: Corpus, Concordance, Collocation. Oxford University Press, Oxford (1991)

    Google Scholar 

  21. Stefanowitsch, A.: Collostructional analysis. In: Hoffmann, T.H., Trousdale, G. (eds.) The Oxford Handbook of Construction Grammar, pp. 290–306. Oxford University Press, Oxford/New York (2013)

    Google Scholar 

  22. Wray, A.: Formulaic Language: Pushing the Boundaries. Oxford University Press, Oxford (2008)

    Google Scholar 

  23. Wulff, S.: Words and idioms. In: Hoffmann, T.H., Trousdale, G. (eds.) The Oxford Hand-book of Construction Grammar, pp. 274–289. Oxford University Press, Oxford/New York (2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jean-Pierre Colson .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Colson, JP. (2017). The IdiomSearch Experiment: Extracting Phraseology from a Probabilistic Network of Constructions. In: Mitkov, R. (eds) Computational and Corpus-Based Phraseology. EUROPHRAS 2017. Lecture Notes in Computer Science(), vol 10596. Springer, Cham. https://doi.org/10.1007/978-3-319-69805-2_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-69805-2_2

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-69804-5

  • Online ISBN: 978-3-319-69805-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics