Skip to main content

Mining the Web for Idiomatic Expressions Using Metalinguistic Markers

  • Conference paper
  • 1610 Accesses

Part of the Lecture Notes in Computer Science book series (LNAI,volume 7499)

Abstract

In this paper, methods for identification and delimitation of idiomatic expressions in large Web corpora are presented. The proposed methods are based on the observation that idiomatic expressions are sometimes accompanied by metalinguistic expressions, e.g. the word “proverbial”, the expression “as they say” or quotation marks. Even though the frequency of such idiom-related metalinguistic markers is not very high, it is possible to identify new idiomatic expressions with a sufficiently large corpus (only type identification of idiomatic expressions is discussed here, not the token identification). In this paper, we propose to combine infrequent but reliable idiom-related markers (such as the word “proverbial”) with frequent but unreliable markers (such as quotation marks). The former could be used for the identification of idiom candidates, the latter – for their delimitation. The experiments for the estimation of recall upper bound of the proposed methods are also presented in this paper. Even though the paper is concerned with identification and delimitations of Polish idiomatic expressions, the approaches proposed here should also be feasible for other languages with sufficiently large web corpora, English in particular.

Keywords

  • idiomatic expressions
  • Web mining

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (Canada)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (Canada)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (Canada)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Liontas, J.I.: Context and idiom understanding in second languages. EUROSLA Yearbook 2(1), 155–185 (2002)

    CrossRef  Google Scholar 

  2. Graliński, F.: Looking for proverbial needles in the proverbial haystack. In: Kłopotek, M.A., Marciniak, M., Mykowiecka, A., Penczek, W., Wierzchoń, S.T. (eds.) Intelligent Information Systems. New Approaches. Wydawnictwo Akademii Podlaskiej, Siedlce, Poland, pp. 101–111 (2010)

    Google Scholar 

  3. Lin, D.: Automatic identification of non-compositional phrases. In: Proceedings of ACL 1999, pp. 317–324 (1999)

    Google Scholar 

  4. Fazly, A., Cook, P., Stevenson, S.: Unsupervised type and token identification of idiomatic expressions. Computational Linguistics 35, 61–103 (2009)

    CrossRef  Google Scholar 

  5. Li, L., Sporleder, C.: Linguistic cues for distinguishing literal and non-literal usages. In: Huang, C.R., Jurafsky, D. (eds.) COLING (Posters), pp. 683–691. Chinese Information Processing Society of China (2010)

    Google Scholar 

  6. Lewicki, A.M.: Aparat pojęciowy frazeologii. In: Lech Ludorowski, W.M. (ed.): Z badań nad literaturą i językiem. Państwowe Wydawnictwo Naukowe, pp. 135–151 (1974)

    Google Scholar 

  7. Gale, W., Church, K., Hanks, P., Hindle, D.: Using statistics in lexical analysis. In: Zernik, U. (ed.) Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon, pp. 115–164. Lawrence Erlbaum Associates, Hillsdale (1991)

    Google Scholar 

  8. Wiktorowicz, J., Frączek, A. (eds.): Wielki słownik polsko-niemiecki. Wydawnictwo Naukowe PWN, Warszawa (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Graliński, F. (2012). Mining the Web for Idiomatic Expressions Using Metalinguistic Markers. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2012. Lecture Notes in Computer Science(), vol 7499. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32790-2_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-32790-2_13

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-32789-6

  • Online ISBN: 978-3-642-32790-2

  • eBook Packages: Computer ScienceComputer Science (R0)