Skip to main content

Unsupervised Learning of Multiword Units from Part-of-Speech Tagged Corpora: Does Quantity Mean Quality?

  • Conference paper
  • 1447 Accesses

Part of the Lecture Notes in Computer Science book series (LNAI,volume 3808)

Abstract

This paper describes an original hybrid system that extracts multiword unit candidates from part-of-speech tagged corpora. While classical hybrid systems manually define local part-of-speech patterns that lead to the identification of well-known multiword units (mainly compound nouns), we automatically identify relevant syntactical patterns from the corpus. Word statistics are then combined with the endogenously acquired linguistic information in order to extract the most relevant sequences of words. As a result, (1) human intervention is avoided providing total flexibility of use of the system and (2) different multiword units like phrasal verbs, adverbial locutions and prepositional locutions may be identified. Finally, we propose an exhaustive evaluation of our architecture based on the multi-domain, bilingual Slovene-English IJS-ELAN corpus where surprising results are evidenced. To our knowledge, this challenge has never been attempted before.

Keywords

  • Association Rule
  • Normalize Expectation
  • Language Resource
  • Textual Unit
  • Relevant Sequence

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (Canada)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (Canada)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (Canada)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Tanaka, T., Baldwin, T.: Noun-Noun Compound Machine Translation: A Feasibility Study on Shallow Processing. In: Workshop on Multiword Expressions of the 41st ACL meeting, Sapporo, Japan, July 7-12, pp. 17–25 (2003)

    Google Scholar 

  2. Nivre, J., Nilsson, J.: Multiword Units in Syntactic Parsing. In: Dias, G., Lopes, J.G.L., Vintar, S. (eds.) Workshop on Methodologies and Evaluation of Multiword Units in Real-world Applications associated with the 4th International Conference on Languages Resources and Evaluation, Lisbon, Portugal, May 25, pp. 39–47 (2004) ISBN: 2-9517408-1-6. EAN: 0782951740815

    Google Scholar 

  3. Bourigault, D.: Analyse syntaxique locale pour le repérage de termes complexes dans un texte. Traitement Automatique des Langues 34(2), 105–117 (1993)

    Google Scholar 

  4. Tomokiyo, T., Hurst, M.: A Language Model Approach to Keyphrase Extraction. In: Workshop on Multiword Expressions of the 41st ACL meeting, Sapporo, Japan, July 7-12, pp. 33–41 (2003)

    Google Scholar 

  5. Piao, S., Rayson, P., Archer, D., Wilson, A., McEnery, T.: Extracting Multiword Expressions with a Semantic Tagger. In: Workshop on Multiword Expressions of the 41st ACL meeting, Sapporo, Japan, July 7-12, pp. 49–57 (2003)

    Google Scholar 

  6. Yang, S.: Machine Learning for Collocation Identification. In: Zong, C. (ed.) International Conference on Natural Language Processing and Knowledge Engineering, Beijing. China, October 26-29, IEEE Press, Los Alamitos (2003) ISBN: 0-7803-7902-0. 315-321

    Google Scholar 

  7. Dias, G., Nunes, S.: Evaluation of Different Similarity Measures for the Extraction of Multiword Units in a Reinforcement Learning Environment. In: Lino, M.T., Xavier, M.F., Pereira, F., Costa, R., Silva, R. (eds.) Proceedings of the 4th International Conference on Languages Resources and Evaluation, Lisbon, Portugal, May 26-28, pp. 1717–1721 (2004), ISBN: 2-9517408-1-6. EAN: 0782951740815

    Google Scholar 

  8. Díaz-Galiano, M.C., Martín-Valdivia, M.T., Martínez-Santiago, F., Ureña-López, L.A.: Multiword Expressions Recognition with the LVQ Algorithm. In: Dias, G., Lopes, J.G.L., Vintar, S. (eds.) Workshop on Methodologies and Evaluation of Multiword Units in Real-world Applications associated with the 4th International Conference on Languages Resources and Evaluation, Lisbon, Portugal, May 25, pp. 12–17 (2004) ISBN: 2-9517408-1-6. EAN: 0782951740815

    Google Scholar 

  9. Ogata, T., Terao, K., Umemura, K.: Japanese Multiword Extraction using SVM and Adaptation. In: Dias, G., Lopes, J.G.L., Vintar, S. (eds.) Workshop on Methodologies and Evaluation of Multiword Units in Real-world Applications associated with the 4th International Conference on Languages Resources and Evaluation, Lisbon, Portugal, May 25, pp. 8–12 (2004) ISBN: 2-9517408-1-6. EAN: 0782951740815

    Google Scholar 

  10. Dias, G.: Extraction Automatique d’Associations Lexicales à partir de Corpora. PhD Thesis. DI/FCT New University of Lisbon (Portugal) and LIFO University of Orléans, France (2002)

    Google Scholar 

  11. Habert, B., Jacquemin, C.: Noms composés, termes, dénominations complexes: problématiques linguistiques et traitements automatiques. Traitement Automatique des Langues 34(2), 5–41 (1993)

    Google Scholar 

  12. Erjavec, T.: The IJS-ELAN Slovene-English Parallel Corpus. International Journal of Corpus Linguistics 7(1), 1–20 (2002)

    CrossRef  Google Scholar 

  13. Sinclair, J.: English Lexical Collocations: A study in computational linguistics. In: Foley, J.A. (ed.) Reprinted as ch. 2 of 1996. John Sinclair on Lexis and Lexicography, Uni Press, Singapore (1974)

    Google Scholar 

  14. Agrawal, R., Imielinski, T., Swami, A.: Mining association rules between sets of items in large databases. In: Proceedings of the ACM SIGMOD Conference on Management of Data, Washington, D.C, USA, pp. 207–216 (1993)

    Google Scholar 

  15. Justeson, J., Katz, S.: Technical Terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering 1, 9–27 (1995)

    CrossRef  Google Scholar 

  16. Daille, B.: Study and Implementation of Combined Techniques for Automatic Extraction of Terminology. In: The balancing act combining symbolic and statistical approaches to language, pp. 49–66. MIT Press, Cambridge (1996)

    Google Scholar 

  17. Dias, G.: Multiword Unit Hybrid Extraction. In: Workshop on Multiword Expressions of the 41st ACL meeting, Sapporo, Japan, July 7-12, pp. 41–49 (2003)

    Google Scholar 

  18. Gross, G.: Les expressions figées en français, Paris, Ophrys (1996)

    Google Scholar 

  19. Dias, G., Alves, E.: Language-Independent Informative Topic Segmentation. In: Proceedings of the 9th International Symposium on Social Communication, Santiago de Cuba, Cuba, January 24-28, pp. 588–592 (2005) (Best Award Paper) ISBN: 959-7174-05-7

    Google Scholar 

  20. Kilgarriff, A.: Comparing Corpora. International Jounal of Corpus Lingustics 6(1), 97–133 (2001)

    CrossRef  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Dias, G., Vintar, Š. (2005). Unsupervised Learning of Multiword Units from Part-of-Speech Tagged Corpora: Does Quantity Mean Quality?. In: Bento, C., Cardoso, A., Dias, G. (eds) Progress in Artificial Intelligence. EPIA 2005. Lecture Notes in Computer Science(), vol 3808. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11595014_65

Download citation

  • DOI: https://doi.org/10.1007/11595014_65

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-30737-2

  • Online ISBN: 978-3-540-31646-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics