Skip to main content

Bridging Collocational and Syntactic Analysis

  • Chapter
  • First Online:
Lexical Collocation Analysis
  • 1323 Accesses

Abstract

The advent of the computer era, which enabled the development of large text corpora and of sophisticated corpus processing tools, led to unprecedented advances in the area of collocational analysis. These advances were paralleled by significant achievements in the area of syntactic analysis, with parsing technologies becoming available for an increasing number of languages. But more often than not, these developments have taken place independently. The coupling of collocational and syntactic analyses has seldom been considered, despite the fact that one type of analysis could benefit the other. In this chapter, we focus on the integration of syntactic parsing and collocational analysis. First, we review the literature describing syntactically-informed approaches to collocation extraction. Second, we survey the work devoted to exploiting collocational resources for syntactic parsing. Finally, we refer to more recent work that proposes a joint approach to collocational and syntactic analysis, arguing that the two analyses are interdependent to such a degree that only a simultaneous process, one in which structure decoding and pattern identification go hand in hand, can provide a solid bridge between them.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The same metaphor is used by Uhrig and Proisl (2012).

  2. 2.

    Although there is no generally accepted list of collocation patterns (as it is widely accepted that the parameters of a collocation extraction procedure may vary according to the intended use of results), most authors agree that typical collocation patterns include the ones enumerated in Hausmann’s definition (1989, 1010) – ‘We shall call collocation a characteristic combination of two words in a structure like the following: (a) noun + adjective (epithet); (b) noun + verb; (c) verb + noun (object); (d) verb + adverb; (e) adjective + adverb; (f) noun + (prep) + noun’.

  3. 3.

    This decision is also motivated by statistical considerations, as most statistical methods are unreliable for low-frequency data. However, it is contested by the lexicographic community, because a significant part of lexicographically interesting candidates occurs only once or twice in a corpus (Piao et al., 2005, 379).

  4. 4.

    https://github.com/seretan/collocation-extraction-toy (accessed 1 February 2018).

  5. 5.

    A French verb, for instance, may have as many as 48 forms (Tzoukermann and Radev, 1996).

  6. 6.

    The same approach has been used by Uhrig and Proisl (2012), among others.

  7. 7.

    The extraction system was named FipsCo, as it relies on the output of the Fips parser (Laenzlinger and Wehrli, 1991; Wehrli, 1997; Wehrli and Nerima, 2015). It is available online at http://latlapps.unige.ch (accessed 1 February 2018).

  8. 8.

    PARSEME (2013–2017) was a European COST Action focusing on the link between complex lexical items and a comprehensive linguistic analysis of text. With more than 200 members from 33 countries, the Action fostered research on the integration of complex lexical items in parsing and translation.

References

  • Breidt, E. (1993). Extraction of V-N-collocations from text corpora: A feasibility study for German. In Proceedings of the Workshop on Very Large Corpora: Academic and Industrial Perspectives, Columbus (pp. 74–83).

    Google Scholar 

  • Brun, C. (1998). Terminology finite-state preprocessing for computational LFG. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Morristown (pp. 196–200).

    Google Scholar 

  • Charest, S., Brunelle, E., Fontaine, J., & Pelletier, B. (2007). Élaboration automatique d’un dictionnaire de cooccurrences grand public. In Actes de la 14e conférence sur le Traitement Automatique des Langues Naturelles (TALN 2007), Toulouse (pp. 283–292).

    Google Scholar 

  • Choueka, Y. (1988). Looking for needles in a haystack, or locating interesting collocational expressions in large textual databases. In Proceedings of the International Conference on User-Oriented Content-Based Text and Image Handling, Cambridge, MA (pp. 609–623).

    Google Scholar 

  • Church, K., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1), 22–29.

    Google Scholar 

  • Constant, M., & Sigogne, A. (2011). MWU-aware part-of-speech tagging with a CRF model and lexical resources. In Proceedings of the Workshop on Multiword Expressions: From Parsing and Generation to the Real World (pp. 49–56). Portland: Association for Computational Linguistics.

    Google Scholar 

  • Constant, M., Sigogne, A., & Watrin, P. (2012). Discriminative strategies to integrate multiword expression recognition and parsing. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 204–212). Jeju Island: Association for Computational Linguistics.

    Google Scholar 

  • Daille, B. (1994). Approche mixte pour l’extraction automatique de terminologie: Statistiques lexicales et filtres linguistiques. Ph.D. thesis, Université Paris 7.

    Google Scholar 

  • Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.

    Google Scholar 

  • Evert, S. (2004). The statistics of word cooccurrences: Word pairs and collocations. Ph.D. thesis, University of Stuttgart.

    Google Scholar 

  • Evert, S., & Krenn, B. (2005). Using small random samples for the manual evaluation of statistical association measures. Computer Speech & Language, 19(4), 450–466.

    Article  Google Scholar 

  • Gross, M. (1984). Lexicon-grammar and the syntactic analysis of French. In Proceedings of the 10th Annual Computational Linguistics and 22nd Meeting of the Association for Computational Linguistics, Morristown. (pp. 275–282).

    Google Scholar 

  • Hausmann, F. J. (1989). Le dictionnaire de collocations. In F. Hausmann, O. Reichmann, H. Wiegand, & L. Zgusta (Eds.), Wörterbücher: Ein internationales Handbuch zur Lexicographie (pp. 1010–1019). Berlin: Dictionaries, Dictionnaires, de Gruyter.

    Chapter  Google Scholar 

  • Hornby, A. S., Cowie, A. P., & Lewis, J. W. (1948). Oxford advanced learner’dictionary of current English. London: Oxford University Press.

    Google Scholar 

  • Huang, C. R., Kilgarriff, A., Wu, Y., Chiu, C. M., Smith, S., Rychly, P., Bai, M. H., & Chen, K. J. (2005). Chinese sketch engine and the extraction of grammatical collocations. In Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing, Jeju Island (pp. 48–55).

    Google Scholar 

  • Kilgarriff, A., Rychly, P., Smrz, P., & Tugwell, D. (2004). The sketch engine. In Proceedings of the Eleventh EURALEX International Congress, Lorient (pp. 105–116).

    Google Scholar 

  • Korkontzelos, I., & Manandhar, S. (2010). Can recognising multiword expressions improve shallow parsing? In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics (pp. 636–644). Los Angeles: Association for Computational Linguistics.

    Google Scholar 

  • Krenn, B. (2000). Collocation mining: Exploiting corpora for collocation identification and representation. In Proceedings of the KONVENS 2000, Ilmenau (pp. 209–214).

    Google Scholar 

  • Laenzlinger, C., & Wehrli, E. (1991). Fips, un analyseur interactif pour le français. TA Informations, 32(2), 35–49.

    Google Scholar 

  • Lafon, P. (1984). Dépouillements et statistiques en lexicométrie. Genève/Paris: Slatkine – Champion.

    Google Scholar 

  • Lea, D., & Runcie, M. (Eds.). (2002). Oxford collocations dictionary for students of English. Oxford: Oxford University Press.

    Google Scholar 

  • Lin, D. (1998). Extracting collocations from text corpora. In Proceedings of the First Workshop on Computational Terminology, Montreal (pp. 57–63).

    Google Scholar 

  • Lin, D. (1999). Automatic identification of non-compositional phrases. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, Morristown (pp. 317–324).

    Google Scholar 

  • Lü, Y., & Zhou, M. (2004). Collocation translation acquisition using monolingual corpora. In Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL’04), Barcelona (pp. 167–174)

    Google Scholar 

  • Mel’čuk, I. (2003). Collocations: Définition, rôle et utilité. In F. Grossmann, & A. Tutin (Eds.), Les collocations: Analyse et traitement (pp. 23–32). Amsterdam: Editions De Werelt

    Google Scholar 

  • Monti, J., Seretan, V., Pastor, G. C., & Mitkov, R. (2018). Multiword units in machine translation and translation technology. In R. Mitkov, J. Monti, G. C. Pastor, & V. Seretan (Eds.), Multiword units in machine translation and translation technology (Current issues in linguistic theory, Vol. 341). Amsterdam/Philadelphia: John Benjamins.

    Google Scholar 

  • Nivre, J. (2006). Inductive dependency parsing (Text, speech and language technology). Secaucus: Springer.

    Book  Google Scholar 

  • Nivre, J., & Nilsson, J. (2004). Multiword units in syntactic parsing. In MEMURA 2004 – Methodologies and Evaluation of Multiword Units in Real-World Applications (LREC Workshop) (pp. 39–46).

    Google Scholar 

  • Nugues, P. M. (2014). Corpus processing tools (pp. 23–64). Berlin/Heidelberg: Springer.

    Google Scholar 

  • Orliac, B., & Dillinger, M. (2003). Collocation extraction for machine translation. In Proceedings of Machine Translation Summit IX, New Orleans (pp. 292–298).

    Google Scholar 

  • Pearce, D. (2001). Synonymy in collocation extraction. In Proceedings of the NAACL Workshop on WordNet and Other Lexical Resources: Applications, Extensions and Customizations, Pittsburgh (pp. 41–46).

    Google Scholar 

  • Pearce, D. (2002). A comparative evaluation of collocation extraction techniques. In Third International Conference on Language Resources and Evaluation, Las Palmas (pp. 1530–1536).

    Google Scholar 

  • Pecina, P. (2005). An extensive empirical study of collocation extraction methods. In Proceedings of the ACL Student Research Workshop, Ann Arbor (pp. 13–18).

    Google Scholar 

  • Pecina, P. (2008). Lexical association measures: Collocation extraction. Ph.D. thesis, Charles University.

    Google Scholar 

  • Piao, S. S., Rayson, P., Archera, D., & McEnery, T. (2005). Comparing and combining a semantic tagger and a statistical tool for MWE extraction. Computer Speech and Language Special Issue on Multiword Expressions, 19(4), 378–397.

    Article  Google Scholar 

  • Rani, A., Mehla, K., & Jangra, A. (2015). Parsers and parsing approaches: Classification and state of the art. In Proceedings of the 2015 International Conference on Futuristic Trends on Computational Analysis and Knowledge Management (ABLAZE), New Delhi (pp. 34–38).

    Google Scholar 

  • Sag, I. A., Baldwin, T., Bond, F., Copestake, A., & Flickinger, D. (2002). Multiword expressions: A pain in the neck for NLP. In Proceedings of the Third International Conference on Intelligent Text Processing and Computational Linguistics (CICLING 2002), Mexico City (pp. 1–15).

    Google Scholar 

  • Seretan, V. (2008). Collocation extraction based on syntactic parsing. Ph.D. thesis, University of Geneva.

    Google Scholar 

  • Seretan, V. (2011). Syntax-based collocation extraction, text, speech and language technology (Vol. 44). Dordrecht: Springer.

    Book  Google Scholar 

  • Seretan, V., & Wehrli, E. (2006). Accurate collocation extraction using a multilingual parser. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Sydney (pp. 953–960).

    Google Scholar 

  • Seretan, V., Nerima, L., & Wehrli, E. (2003). Extraction of multi-word collocations using syntactic bigram composition. In Proceedings of the Fourth International Conference on Recent Advances in NLP (RANLP-2003), Borovets (pp. 424–431).

    Google Scholar 

  • Shimohata, S., Sugio, T., & Nagata, J. (1997). Retrieving collocations by co-occurrences and word order constraints. In Proceedings of the Annual Meeting of the Association for Computational Linguistics, Madrid (pp. 476–481).

    Google Scholar 

  • Sinclair, J. (1995). Collins cobuild english dictionary. London: Harper Collins.

    Google Scholar 

  • Smadja, F. (1993). Retrieving collocations from text: Xtract. Computational Linguistics, 19(1), 143–177.

    Google Scholar 

  • Tzoukermann, E., & Radev, D. R. (1996). Using word class for part-of-speech disambiguation. In Proceedings of the Fourth Workshop on Very Large Corpora, Copenhagen (pp. 1–13).

    Google Scholar 

  • Uhrig, P., & Proisl, T. (2012). Less hay, more needles – using dependency-annotated corpora to provide lexicographers with more accurate lists of collocation candidates. Lexicographica, 28(1), 141–180.

    Article  Google Scholar 

  • Villada Moirón, M. B. (2005). Data-driven identification of fixed expressions and their modifiability. Ph.D. thesis, University of Groningen.

    Google Scholar 

  • Villavicencio, A., Kordoni, V., Zhang, Y., Idiart, M., & Ramisch, C. (2007). Validation and evaluation of automatically acquired multiword expressions for grammar engineering. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), Prague (pp. 1034–1043).

    Google Scholar 

  • Wehrli, E. (1997). L’analyse syntaxique des langues naturelles: Problèmes et méthodes. Paris: Masson.

    Google Scholar 

  • Wehrli, E., & Nerima, L. (2015). The fips multilingual parser. In N. Gala, R. Rapp, & G. Bel-Enguix (Eds.), Language production, cognition, and the lexicon, text, speech and language technology (Vol. 48, pp. 473–489). Cham: Springer.

    Google Scholar 

  • Wehrli, E., Seretan, V., & Nerima, L. (2010). Sentence analysis and collocation identification. In Proceedings of the Workshop on Multiword Expressions: From Theory to Applications (MWE 2010), Beijing (pp. 27–35).

    Google Scholar 

  • Wehrli, E., Seretan, V., & Nerima, L. (to appear) Verbal collocations and pronominalization. In G. C. Pastor & U. Heid (Eds.), Currrent trends in computational phraseology, research in linguistics and literature. Amsterdam/Philadelphia: John Benjamins.

    Google Scholar 

  • Wu, H., & Zhou, M. (2003). Synonymous collocation extraction using translation information. In Proceeding of the Annual Meeting of the Association for Computational Linguistics (ACL 2003), Sapporo (pp. 120–127).

    Google Scholar 

  • Zhang, Y., & Kordoni, V. (2006). Automated deep lexical acquisition for robust open texts processing. In Proceedings of LREC-2006, Genoa (pp. 275–280).

    Google Scholar 

Download references

Acknowledgements

I am grateful to the anonymous reviewers, whose comments and suggestions allowed me to improve the chapter.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Violeta Seretan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Seretan, V. (2018). Bridging Collocational and Syntactic Analysis. In: Cantos-Gómez, P., Almela-Sánchez, M. (eds) Lexical Collocation Analysis. Quantitative Methods in the Humanities and Social Sciences. Springer, Cham. https://doi.org/10.1007/978-3-319-92582-0_2

Download citation

Publish with us

Policies and ethics