Abstract
This paper describes an original hybrid system that extracts multiword unit candidates from part-of-speech tagged corpora. While classical hybrid systems manually define local part-of-speech patterns that lead to the identification of well-known multiword units (mainly compound nouns), we automatically identify relevant syntactical patterns from the corpus. Word statistics are then combined with the endogenously acquired linguistic information in order to extract the most relevant sequences of words. As a result, (1) human intervention is avoided providing total flexibility of use of the system and (2) different multiword units like phrasal verbs, adverbial locutions and prepositional locutions may be identified. Finally, we propose an exhaustive evaluation of our architecture based on the multi-domain, bilingual Slovene-English IJS-ELAN corpus where surprising results are evidenced. To our knowledge, this challenge has never been attempted before.
Keywords
- Association Rule
- Normalize Expectation
- Language Resource
- Textual Unit
- Relevant Sequence
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, access via your institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Tanaka, T., Baldwin, T.: Noun-Noun Compound Machine Translation: A Feasibility Study on Shallow Processing. In: Workshop on Multiword Expressions of the 41st ACL meeting, Sapporo, Japan, July 7-12, pp. 17–25 (2003)
Nivre, J., Nilsson, J.: Multiword Units in Syntactic Parsing. In: Dias, G., Lopes, J.G.L., Vintar, S. (eds.) Workshop on Methodologies and Evaluation of Multiword Units in Real-world Applications associated with the 4th International Conference on Languages Resources and Evaluation, Lisbon, Portugal, May 25, pp. 39–47 (2004) ISBN: 2-9517408-1-6. EAN: 0782951740815
Bourigault, D.: Analyse syntaxique locale pour le repérage de termes complexes dans un texte. Traitement Automatique des Langues 34(2), 105–117 (1993)
Tomokiyo, T., Hurst, M.: A Language Model Approach to Keyphrase Extraction. In: Workshop on Multiword Expressions of the 41st ACL meeting, Sapporo, Japan, July 7-12, pp. 33–41 (2003)
Piao, S., Rayson, P., Archer, D., Wilson, A., McEnery, T.: Extracting Multiword Expressions with a Semantic Tagger. In: Workshop on Multiword Expressions of the 41st ACL meeting, Sapporo, Japan, July 7-12, pp. 49–57 (2003)
Yang, S.: Machine Learning for Collocation Identification. In: Zong, C. (ed.) International Conference on Natural Language Processing and Knowledge Engineering, Beijing. China, October 26-29, IEEE Press, Los Alamitos (2003) ISBN: 0-7803-7902-0. 315-321
Dias, G., Nunes, S.: Evaluation of Different Similarity Measures for the Extraction of Multiword Units in a Reinforcement Learning Environment. In: Lino, M.T., Xavier, M.F., Pereira, F., Costa, R., Silva, R. (eds.) Proceedings of the 4th International Conference on Languages Resources and Evaluation, Lisbon, Portugal, May 26-28, pp. 1717–1721 (2004), ISBN: 2-9517408-1-6. EAN: 0782951740815
Díaz-Galiano, M.C., Martín-Valdivia, M.T., Martínez-Santiago, F., Ureña-López, L.A.: Multiword Expressions Recognition with the LVQ Algorithm. In: Dias, G., Lopes, J.G.L., Vintar, S. (eds.) Workshop on Methodologies and Evaluation of Multiword Units in Real-world Applications associated with the 4th International Conference on Languages Resources and Evaluation, Lisbon, Portugal, May 25, pp. 12–17 (2004) ISBN: 2-9517408-1-6. EAN: 0782951740815
Ogata, T., Terao, K., Umemura, K.: Japanese Multiword Extraction using SVM and Adaptation. In: Dias, G., Lopes, J.G.L., Vintar, S. (eds.) Workshop on Methodologies and Evaluation of Multiword Units in Real-world Applications associated with the 4th International Conference on Languages Resources and Evaluation, Lisbon, Portugal, May 25, pp. 8–12 (2004) ISBN: 2-9517408-1-6. EAN: 0782951740815
Dias, G.: Extraction Automatique d’Associations Lexicales à partir de Corpora. PhD Thesis. DI/FCT New University of Lisbon (Portugal) and LIFO University of Orléans, France (2002)
Habert, B., Jacquemin, C.: Noms composés, termes, dénominations complexes: problématiques linguistiques et traitements automatiques. Traitement Automatique des Langues 34(2), 5–41 (1993)
Erjavec, T.: The IJS-ELAN Slovene-English Parallel Corpus. International Journal of Corpus Linguistics 7(1), 1–20 (2002)
Sinclair, J.: English Lexical Collocations: A study in computational linguistics. In: Foley, J.A. (ed.) Reprinted as ch. 2 of 1996. John Sinclair on Lexis and Lexicography, Uni Press, Singapore (1974)
Agrawal, R., Imielinski, T., Swami, A.: Mining association rules between sets of items in large databases. In: Proceedings of the ACM SIGMOD Conference on Management of Data, Washington, D.C, USA, pp. 207–216 (1993)
Justeson, J., Katz, S.: Technical Terminology: some linguistic properties and an algorithm for identification in text. Natural Language Engineering 1, 9–27 (1995)
Daille, B.: Study and Implementation of Combined Techniques for Automatic Extraction of Terminology. In: The balancing act combining symbolic and statistical approaches to language, pp. 49–66. MIT Press, Cambridge (1996)
Dias, G.: Multiword Unit Hybrid Extraction. In: Workshop on Multiword Expressions of the 41st ACL meeting, Sapporo, Japan, July 7-12, pp. 41–49 (2003)
Gross, G.: Les expressions figées en français, Paris, Ophrys (1996)
Dias, G., Alves, E.: Language-Independent Informative Topic Segmentation. In: Proceedings of the 9th International Symposium on Social Communication, Santiago de Cuba, Cuba, January 24-28, pp. 588–592 (2005) (Best Award Paper) ISBN: 959-7174-05-7
Kilgarriff, A.: Comparing Corpora. International Jounal of Corpus Lingustics 6(1), 97–133 (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Dias, G., Vintar, Š. (2005). Unsupervised Learning of Multiword Units from Part-of-Speech Tagged Corpora: Does Quantity Mean Quality?. In: Bento, C., Cardoso, A., Dias, G. (eds) Progress in Artificial Intelligence. EPIA 2005. Lecture Notes in Computer Science(), vol 3808. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11595014_65
Download citation
DOI: https://doi.org/10.1007/11595014_65
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-30737-2
Online ISBN: 978-3-540-31646-6
eBook Packages: Computer ScienceComputer Science (R0)
