Using LocalMaxs Algorithm for the Extraction of Contiguous and Non-contiguous Multiword Lexical Units

  • Joaquim Ferreira da Silva
  • Gaël Dias
  • Sylvie Guilloré
  • José Gabriel Pereira Lopes
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 1695)

Abstract

The availability of contiguous and non-contiguous multiword lexical units (MWUs) in Natural Language Processing (NLP) lexica enhances parsing precision, helps attachment decisions, improves indexing in information retrieval (IR) systems, reinforces information extraction (IE) and text mining, among other applications. Unfortunately, their acquisition has long been a significant problem in NLP, IR and IE. In this paper we propose two new association measures, the Symmetric Conditional Probability (SCP) and the Mutual Expectation (ME) for the extraction of contiguous and non-contiguous MWUs. Both measures are used by a new algorithm, the LocalMaxs, that requires neither empirically obtained thresholds nor complex linguistic filters. We assess the results obtained by both measures by comparing them with reference association measures (Specific Mutual Information, ø 2, Dice and Log-Likelihood coefficients) over a multilingual parallel corpus. An additional experiment has been carried out over a part-of-speech tagged Portuguese corpus for extracting contiguous compound verbs.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Abeille, A.: Les nouvelles syntaxes: Grammaires d’unification et Analyse du Français, Armand Colin, Paris (1993)Google Scholar
  2. 2.
    Bahl, L., & Brown, P., Sousa, P., Mercer, R.: Maximum Mutual Information of Hidden Markov Model Parameters for Speech Recognition. In Proceedings, International Conference on Acoustics, Speech, and Signal Processing Society, Institute of Electronics and Communication Engineers of Japan, and Acoustical Society of Japan (1986)Google Scholar
  3. 3.
    Blank, I.: Computer-Aided Analysis of Multilingual Patent Documentation, First LREC, (1998) 765–771Google Scholar
  4. 4.
    Barkema, H.: Determining the Syntactic Flexibility of Idioms, in Fries U., Tottie G., Shneider P. (eds.): Creating and Using English Language Corpora, Rodopi, Amsterdam, (1994), 39–52Google Scholar
  5. 5.
    Barkema, H.: Idiomaticy in English Nps, in Aarts J., de Haan P., Oostdijk N. (eds.): English Language Corpora: Design, Analysis and Exploitation, Rodopi, Amsterdam, (1993), 257–278Google Scholar
  6. 6.
    Bourigault, D., Jacquemin, C.: Term Extraction and Term Clustering: an Integrated Platform for Computer Aided Terminology. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics, p. 15–22, Bergen, Norway June (1999)Google Scholar
  7. 7.
    Bourigault, D.: Lexter, a Natural Language Processing Tool for Terminology Extraction, 7th EURALEX International Congress, (1996)Google Scholar
  8. 8.
    Chengxiang, Z.: Exploiting Context to Identify Lexical Atoms: a Statistical View of Linguistic Context, cmp-lg/9701001, 2 Jan 1997, (1997)Google Scholar
  9. 9.
    Church, K. et al.: Word Association Norms Mutual Information and Lexicography, Computational Linguistics, Vol. 16(1). (1990) 23–29Google Scholar
  10. 10.
    Church, K., Gale, W., Hanks, P., Hindle, D.: Using Statistical Linguistics in Lexical Analysis. In Lexical Acquisition: Using On-line Resources to Build a Lexicon, edited by Uri Zernik. Lawrence Erlbaum, Hilldale, New Jersey (1991) 115–165Google Scholar
  11. 11.
    Dagan, I.: Termight: Identifying and Translating Technical Terminology, 4th Conference on Applied Natural Language Processing, ACL Proceedings (1994)Google Scholar
  12. 12.
    Daille, B.: Study and Implementation of Combined Techniques for Automatic Extraction of Terminology. The Balancing Act Combining Symbolic and Statistical Approaches to Language, MIT Press (1995)Google Scholar
  13. 13.
    Dias, G., Gilloré, S., Lopes, G.: Language Independent Automatic Acquisition of Rigid Multiword Units from Unrestricted Text corpora. In Proceedings of the TALN’99 (1999).Google Scholar
  14. 14.
    Dias, G., Gilloré, S., Lopes, G.: Multilingual Aspects of Multiword Lexical Units. In Proceedings of the Workshop Language Technologies-Multilingual Aspects, Faculty of Arts, 8–11 July (1999), Ljubljana, SloveniaGoogle Scholar
  15. 15.
    Dunning, T.: Accurate Methods for the Statistics of Surprise and Coincidence, Association for Computational Linguistics, Vol. 19-1. (1993)Google Scholar
  16. 16.
    Enguehard, C.: Acquisition de Terminologie à partir de Gros Corpus, Informatique & Langue Naturelle, ILN’93 (1993) 373–384Google Scholar
  17. 17.
    Gale, W.: Concordances for Parallel Texts, Proceedings of Seventh Annual Conference of the UW Centre for the New OED and Text Research, Using Corpora, Oxford (1991)Google Scholar
  18. 18.
    Habert, B. et al.: Les linguistiques du Corpus, Armand Colin, Paris (1997)Google Scholar
  19. 19.
    Herviou-Picard et al.: Informatiques, Statistiques et Langue Naturelle pour Automatiser la Constitution de Terminologies, In Proc. ILN’96 (1996)Google Scholar
  20. 20.
    Jacquemin, C., Royauté, J.: Retrieving Terms and their Variants in a Lexicalized Unification-Based Framework, in: SIGIR’94, Dublin, (1994) 132–141Google Scholar
  21. 21.
    Justeson, J.: Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text, IBM Research Report, RC 18906 (82591) 5/18/93 (1993)Google Scholar
  22. 22.
    Marques, N.: Metodologia para a Modelação Estatística da Subcategorização Verbal. Ph.D. Thesis. Universidade Nova de Lisboa, Faculdade de Ciências e Tecnologia, Lisbon, Portugal, Previewed Presentation (1999) (In Portuguese)Google Scholar
  23. 23.
    Shimohata, S.: Retrieving Collocations by Co-occurrences and Word Order Constraints, Proceedings ACL-EACL’97 (1997) 476–481Google Scholar
  24. 24.
    Silva, J., Lopes, G.: A local Maxima Method and a Fair Dispersion Normalization for Extracting Multiword Units. In Proceedings of the 6th Meeting on the Mathematics of Language, Orlando, July 23–25 (1999)Google Scholar
  25. 25.
    Silva, J., Lopes, G.: Extracting Multiword Terms from Document Collections. In Proceedings of the VExTAL, Venezia per il Trattamento Automatico delle Lingu, Universiá Cá Foscari, Venezia November 22–24 (1999)Google Scholar
  26. 26.
    Silva, J., Lopes, G., Xavier, M., Vicente, G.: Relevant Expressions in Large Corpora. In Proceedings of the Atelier-TALN99, Corse, july 12–17 (1999)Google Scholar
  27. 27.
    Smadja, F. et al.: Translating Collocations for Bilingual Lexicons: A Statistical Approach, Association for Computational Linguistics, Vol. 22(1) (1996)Google Scholar
  28. 28.
    Smadja, F.: From N-grams to Collocations: An Evaluation of Extract. In Proceedings, 29th Annual Meeting of the ACL (1991). Berkeley, Calif., 279–284Google Scholar
  29. 29.
    Smadja, F.: Retrieving Collocations From Text: XTRACT, Computational Linguistics, Vol. 19(1). (1993) 143–177Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 1999

Authors and Affiliations

  • Joaquim Ferreira da Silva
    • 1
  • Gaël Dias
    • 1
  • Sylvie Guilloré
    • 2
  • José Gabriel Pereira Lopes
    • 1
  1. 1.Departamento de InformáticaUniversidade Nova de Lisboa Faculdade de Ciências e TecnologiaMonte da CaparicaPortugal
  2. 2.Laboratoire d’Informatique Fondamentale d’OrléansUniversité d’OrléansOrléans Cédex 2France

Personalised recommendations