Mining the Web for Idiomatic Expressions Using Metalinguistic Markers

  • Filip Graliński
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7499)


In this paper, methods for identification and delimitation of idiomatic expressions in large Web corpora are presented. The proposed methods are based on the observation that idiomatic expressions are sometimes accompanied by metalinguistic expressions, e.g. the word “proverbial”, the expression “as they say” or quotation marks. Even though the frequency of such idiom-related metalinguistic markers is not very high, it is possible to identify new idiomatic expressions with a sufficiently large corpus (only type identification of idiomatic expressions is discussed here, not the token identification). In this paper, we propose to combine infrequent but reliable idiom-related markers (such as the word “proverbial”) with frequent but unreliable markers (such as quotation marks). The former could be used for the identification of idiom candidates, the latter – for their delimitation. The experiments for the estimation of recall upper bound of the proposed methods are also presented in this paper. Even though the paper is concerned with identification and delimitations of Polish idiomatic expressions, the approaches proposed here should also be feasible for other languages with sufficiently large web corpora, English in particular.


idiomatic expressions Web mining 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Liontas, J.I.: Context and idiom understanding in second languages. EUROSLA Yearbook 2(1), 155–185 (2002)CrossRefGoogle Scholar
  2. 2.
    Graliński, F.: Looking for proverbial needles in the proverbial haystack. In: Kłopotek, M.A., Marciniak, M., Mykowiecka, A., Penczek, W., Wierzchoń, S.T. (eds.) Intelligent Information Systems. New Approaches. Wydawnictwo Akademii Podlaskiej, Siedlce, Poland, pp. 101–111 (2010)Google Scholar
  3. 3.
    Lin, D.: Automatic identification of non-compositional phrases. In: Proceedings of ACL 1999, pp. 317–324 (1999)Google Scholar
  4. 4.
    Fazly, A., Cook, P., Stevenson, S.: Unsupervised type and token identification of idiomatic expressions. Computational Linguistics 35, 61–103 (2009)CrossRefGoogle Scholar
  5. 5.
    Li, L., Sporleder, C.: Linguistic cues for distinguishing literal and non-literal usages. In: Huang, C.R., Jurafsky, D. (eds.) COLING (Posters), pp. 683–691. Chinese Information Processing Society of China (2010)Google Scholar
  6. 6.
    Lewicki, A.M.: Aparat pojęciowy frazeologii. In: Lech Ludorowski, W.M. (ed.): Z badań nad literaturą i językiem. Państwowe Wydawnictwo Naukowe, pp. 135–151 (1974)Google Scholar
  7. 7.
    Gale, W., Church, K., Hanks, P., Hindle, D.: Using statistics in lexical analysis. In: Zernik, U. (ed.) Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon, pp. 115–164. Lawrence Erlbaum Associates, Hillsdale (1991)Google Scholar
  8. 8.
    Wiktorowicz, J., Frączek, A. (eds.): Wielki słownik polsko-niemiecki. Wydawnictwo Naukowe PWN, Warszawa (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Filip Graliński
    • 1
  1. 1.Faculty of Mathematics and Computer ScienceAdam Mickiewicz UniversityPoznańPoland

Personalised recommendations