Bridging Collocational and Syntactic Analysis

Seretan, Violeta

doi:10.1007/978-3-319-92582-0_2

Violeta Seretan^8,9

Part of the book series: Quantitative Methods in the Humanities and Social Sciences ((QMHSS))

1323 Accesses

Abstract

The advent of the computer era, which enabled the development of large text corpora and of sophisticated corpus processing tools, led to unprecedented advances in the area of collocational analysis. These advances were paralleled by significant achievements in the area of syntactic analysis, with parsing technologies becoming available for an increasing number of languages. But more often than not, these developments have taken place independently. The coupling of collocational and syntactic analyses has seldom been considered, despite the fact that one type of analysis could benefit the other. In this chapter, we focus on the integration of syntactic parsing and collocational analysis. First, we review the literature describing syntactically-informed approaches to collocation extraction. Second, we survey the work devoted to exploiting collocational resources for syntactic parsing. Finally, we refer to more recent work that proposes a joint approach to collocational and syntactic analysis, arguing that the two analyses are interdependent to such a degree that only a simultaneous process, one in which structure decoding and pattern identification go hand in hand, can provide a solid bridge between them.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The same metaphor is used by Uhrig and Proisl (2012).
2.
Although there is no generally accepted list of collocation patterns (as it is widely accepted that the parameters of a collocation extraction procedure may vary according to the intended use of results), most authors agree that typical collocation patterns include the ones enumerated in Hausmann’s definition (1989, 1010) – ‘We shall call collocation a characteristic combination of two words in a structure like the following: (a) noun + adjective (epithet); (b) noun + verb; (c) verb + noun (object); (d) verb + adverb; (e) adjective + adverb; (f) noun + (prep) + noun’.
3.
This decision is also motivated by statistical considerations, as most statistical methods are unreliable for low-frequency data. However, it is contested by the lexicographic community, because a significant part of lexicographically interesting candidates occurs only once or twice in a corpus (Piao et al., 2005, 379).
4.
https://github.com/seretan/collocation-extraction-toy (accessed 1 February 2018).
5.
A French verb, for instance, may have as many as 48 forms (Tzoukermann and Radev, 1996).
6.
The same approach has been used by Uhrig and Proisl (2012), among others.
7.
The extraction system was named FipsCo, as it relies on the output of the Fips parser (Laenzlinger and Wehrli, 1991; Wehrli, 1997; Wehrli and Nerima, 2015). It is available online at http://latlapps.unige.ch (accessed 1 February 2018).
8.
PARSEME (2013–2017) was a European COST Action focusing on the link between complex lexical items and a comprehensive linguistic analysis of text. With more than 200 members from 33 countries, the Action fostered research on the integration of complex lexical items in parsing and translation.

References

Breidt, E. (1993). Extraction of V-N-collocations from text corpora: A feasibility study for German. In Proceedings of the Workshop on Very Large Corpora: Academic and Industrial Perspectives, Columbus (pp. 74–83).
Google Scholar
Brun, C. (1998). Terminology finite-state preprocessing for computational LFG. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Morristown (pp. 196–200).
Google Scholar
Charest, S., Brunelle, E., Fontaine, J., & Pelletier, B. (2007). Élaboration automatique d’un dictionnaire de cooccurrences grand public. In Actes de la 14e conférence sur le Traitement Automatique des Langues Naturelles (TALN 2007), Toulouse (pp. 283–292).
Google Scholar
Choueka, Y. (1988). Looking for needles in a haystack, or locating interesting collocational expressions in large textual databases. In Proceedings of the International Conference on User-Oriented Content-Based Text and Image Handling, Cambridge, MA (pp. 609–623).
Google Scholar
Church, K., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1), 22–29.
Google Scholar
Constant, M., & Sigogne, A. (2011). MWU-aware part-of-speech tagging with a CRF model and lexical resources. In Proceedings of the Workshop on Multiword Expressions: From Parsing and Generation to the Real World (pp. 49–56). Portland: Association for Computational Linguistics.
Google Scholar
Constant, M., Sigogne, A., & Watrin, P. (2012). Discriminative strategies to integrate multiword expression recognition and parsing. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 204–212). Jeju Island: Association for Computational Linguistics.
Google Scholar
Daille, B. (1994). Approche mixte pour l’extraction automatique de terminologie: Statistiques lexicales et filtres linguistiques. Ph.D. thesis, Université Paris 7.
Google Scholar
Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.
Google Scholar
Evert, S. (2004). The statistics of word cooccurrences: Word pairs and collocations. Ph.D. thesis, University of Stuttgart.
Google Scholar
Evert, S., & Krenn, B. (2005). Using small random samples for the manual evaluation of statistical association measures. Computer Speech & Language, 19(4), 450–466.
Article Google Scholar
Gross, M. (1984). Lexicon-grammar and the syntactic analysis of French. In Proceedings of the 10th Annual Computational Linguistics and 22nd Meeting of the Association for Computational Linguistics, Morristown. (pp. 275–282).
Google Scholar
Hausmann, F. J. (1989). Le dictionnaire de collocations. In F. Hausmann, O. Reichmann, H. Wiegand, & L. Zgusta (Eds.), Wörterbücher: Ein internationales Handbuch zur Lexicographie (pp. 1010–1019). Berlin: Dictionaries, Dictionnaires, de Gruyter.
Chapter Google Scholar
Hornby, A. S., Cowie, A. P., & Lewis, J. W. (1948). Oxford advanced learner’dictionary of current English. London: Oxford University Press.
Google Scholar
Huang, C. R., Kilgarriff, A., Wu, Y., Chiu, C. M., Smith, S., Rychly, P., Bai, M. H., & Chen, K. J. (2005). Chinese sketch engine and the extraction of grammatical collocations. In Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, Jeju Island (pp. 48–55).
Google Scholar
Kilgarriff, A., Rychly, P., Smrz, P., & Tugwell, D. (2004). The sketch engine. In Proceedings of the Eleventh EURALEX International Congress, Lorient (pp. 105–116).
Google Scholar
Korkontzelos, I., & Manandhar, S. (2010). Can recognising multiword expressions improve shallow parsing? In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (pp. 636–644). Los Angeles: Association for Computational Linguistics.
Google Scholar
Krenn, B. (2000). Collocation mining: Exploiting corpora for collocation identification and representation. In Proceedings of the KONVENS 2000, Ilmenau (pp. 209–214).
Google Scholar
Laenzlinger, C., & Wehrli, E. (1991). Fips, un analyseur interactif pour le français. TA Informations, 32(2), 35–49.
Google Scholar
Lafon, P. (1984). Dépouillements et statistiques en lexicométrie. Genève/Paris: Slatkine – Champion.
Google Scholar
Lea, D., & Runcie, M. (Eds.). (2002). Oxford collocations dictionary for students of English. Oxford: Oxford University Press.
Google Scholar
Lin, D. (1998). Extracting collocations from text corpora. In Proceedings of the First Workshop on Computational Terminology, Montreal (pp. 57–63).
Google Scholar
Lin, D. (1999). Automatic identification of non-compositional phrases. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, Morristown (pp. 317–324).
Google Scholar
Lü, Y., & Zhou, M. (2004). Collocation translation acquisition using monolingual corpora. In Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL’04), Barcelona (pp. 167–174)
Google Scholar
Mel’čuk, I. (2003). Collocations: Définition, rôle et utilité. In F. Grossmann, & A. Tutin (Eds.), Les collocations: Analyse et traitement (pp. 23–32). Amsterdam: Editions De Werelt
Google Scholar
Monti, J., Seretan, V., Pastor, G. C., & Mitkov, R. (2018). Multiword units in machine translation and translation technology. In R. Mitkov, J. Monti, G. C. Pastor, & V. Seretan (Eds.), Multiword units in machine translation and translation technology (Current issues in linguistic theory, Vol. 341). Amsterdam/Philadelphia: John Benjamins.
Google Scholar
Nivre, J. (2006). Inductive dependency parsing (Text, speech and language technology). Secaucus: Springer.
Book Google Scholar
Nivre, J., & Nilsson, J. (2004). Multiword units in syntactic parsing. In MEMURA 2004 – Methodologies and Evaluation of Multiword Units in Real-World Applications (LREC Workshop) (pp. 39–46).
Google Scholar
Nugues, P. M. (2014). Corpus processing tools (pp. 23–64). Berlin/Heidelberg: Springer.
Google Scholar
Orliac, B., & Dillinger, M. (2003). Collocation extraction for machine translation. In Proceedings of Machine Translation Summit IX, New Orleans (pp. 292–298).
Google Scholar
Pearce, D. (2001). Synonymy in collocation extraction. In Proceedings of the NAACL Workshop on WordNet and Other Lexical Resources: Applications, Extensions and Customizations, Pittsburgh (pp. 41–46).
Google Scholar
Pearce, D. (2002). A comparative evaluation of collocation extraction techniques. In Third International Conference on Language Resources and Evaluation, Las Palmas (pp. 1530–1536).
Google Scholar
Pecina, P. (2005). An extensive empirical study of collocation extraction methods. In Proceedings of the ACL Student Research Workshop, Ann Arbor (pp. 13–18).
Google Scholar
Pecina, P. (2008). Lexical association measures: Collocation extraction. Ph.D. thesis, Charles University.
Google Scholar
Piao, S. S., Rayson, P., Archera, D., & McEnery, T. (2005). Comparing and combining a semantic tagger and a statistical tool for MWE extraction. Computer Speech and Language Special Issue on Multiword Expressions, 19(4), 378–397.
Article Google Scholar
Rani, A., Mehla, K., & Jangra, A. (2015). Parsers and parsing approaches: Classification and state of the art. In Proceedings of the 2015 International Conference on Futuristic Trends on Computational Analysis and Knowledge Management (ABLAZE), New Delhi (pp. 34–38).
Google Scholar
Sag, I. A., Baldwin, T., Bond, F., Copestake, A., & Flickinger, D. (2002). Multiword expressions: A pain in the neck for NLP. In Proceedings of the Third International Conference on Intelligent Text Processing and Computational Linguistics (CICLING 2002), Mexico City (pp. 1–15).
Google Scholar
Seretan, V. (2008). Collocation extraction based on syntactic parsing. Ph.D. thesis, University of Geneva.
Google Scholar
Seretan, V. (2011). Syntax-based collocation extraction, text, speech and language technology (Vol. 44). Dordrecht: Springer.
Book Google Scholar
Seretan, V., & Wehrli, E. (2006). Accurate collocation extraction using a multilingual parser. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Sydney (pp. 953–960).
Google Scholar
Seretan, V., Nerima, L., & Wehrli, E. (2003). Extraction of multi-word collocations using syntactic bigram composition. In Proceedings of the Fourth International Conference on Recent Advances in NLP (RANLP-2003), Borovets (pp. 424–431).
Google Scholar
Shimohata, S., Sugio, T., & Nagata, J. (1997). Retrieving collocations by co-occurrences and word order constraints. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Madrid (pp. 476–481).
Google Scholar
Sinclair, J. (1995). Collins cobuild english dictionary. London: Harper Collins.
Google Scholar
Smadja, F. (1993). Retrieving collocations from text: Xtract. Computational Linguistics, 19(1), 143–177.
Google Scholar
Tzoukermann, E., & Radev, D. R. (1996). Using word class for part-of-speech disambiguation. In Proceedings of the Fourth Workshop on Very Large Corpora, Copenhagen (pp. 1–13).
Google Scholar
Uhrig, P., & Proisl, T. (2012). Less hay, more needles – using dependency-annotated corpora to provide lexicographers with more accurate lists of collocation candidates. Lexicographica, 28(1), 141–180.
Article Google Scholar
Villada Moirón, M. B. (2005). Data-driven identification of fixed expressions and their modifiability. Ph.D. thesis, University of Groningen.
Google Scholar
Villavicencio, A., Kordoni, V., Zhang, Y., Idiart, M., & Ramisch, C. (2007). Validation and evaluation of automatically acquired multiword expressions for grammar engineering. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague (pp. 1034–1043).
Google Scholar
Wehrli, E. (1997). L’analyse syntaxique des langues naturelles: Problèmes et méthodes. Paris: Masson.
Google Scholar
Wehrli, E., & Nerima, L. (2015). The fips multilingual parser. In N. Gala, R. Rapp, & G. Bel-Enguix (Eds.), Language production, cognition, and the lexicon, text, speech and language technology (Vol. 48, pp. 473–489). Cham: Springer.
Google Scholar
Wehrli, E., Seretan, V., & Nerima, L. (2010). Sentence analysis and collocation identification. In Proceedings of the Workshop on Multiword Expressions: From Theory to Applications (MWE 2010), Beijing (pp. 27–35).
Google Scholar
Wehrli, E., Seretan, V., & Nerima, L. (to appear) Verbal collocations and pronominalization. In G. C. Pastor & U. Heid (Eds.), Currrent trends in computational phraseology, research in linguistics and literature. Amsterdam/Philadelphia: John Benjamins.
Google Scholar
Wu, H., & Zhou, M. (2003). Synonymous collocation extraction using translation information. In Proceeding of the Annual Meeting of the Association for Computational Linguistics (ACL 2003), Sapporo (pp. 120–127).
Google Scholar
Zhang, Y., & Kordoni, V. (2006). Automated deep lexical acquisition for robust open texts processing. In Proceedings of LREC-2006, Genoa (pp. 275–280).
Google Scholar

Download references

Acknowledgements

I am grateful to the anonymous reviewers, whose comments and suggestions allowed me to improve the chapter.

Author information

Authors and Affiliations

University of Geneva, Geneva, Switzerland
Violeta Seretan
University of Lausanne, Lausanne, Switzerland
Violeta Seretan

Authors

Violeta Seretan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Violeta Seretan .

Editor information

Editors and Affiliations

Department of English, University of Murcia, Murcia, Spain
Pascual Cantos-Gómez
Department of English, University of Murcia, Murcia, Spain
Moisés Almela-Sánchez

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Seretan, V. (2018). Bridging Collocational and Syntactic Analysis. In: Cantos-Gómez, P., Almela-Sánchez, M. (eds) Lexical Collocation Analysis. Quantitative Methods in the Humanities and Social Sciences. Springer, Cham. https://doi.org/10.1007/978-3-319-92582-0_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-92582-0_2
Published: 22 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-92581-3
Online ISBN: 978-3-319-92582-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics