Abstract
Collocation candidate extraction from dependency-annotated corpora has become more and more mainstream in collocation research over the past years. In most studies, however, the results of one parser are compared to those of relatively “dumb” window-based approaches only. To date, the impact of the parser used and its parsing scheme has not been studied systematically to the best of our knowledge. This chapter evaluates a total of 8 parsers on 2 corpora with 20 different association measures plus several frequency thresholds for 6 different types of collocations against the Oxford Collocations Dictionary for Students of English (2nd edition; 2009). We find that the parser and parsing scheme both play a role in the quality of the collocation candidate extraction. The performance of different parsers can differ substantially across different collocation types. The filters used to extract different types of collocations from the corpora also play an important role in the trade-off between precision and recall we can observe. Furthermore, we find that carefully sampled and balanced corpora (such as the BNC) seem to have considerable advantages in precision, but of course for total coverage, larger, less balanced corpora (such as the web corpus used in this study) take the lead. Overall, log-likelihood is the best association measure, but for some specific types of collocation (such as adjective-noun or verb-adverb), other measures perform even better.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
See Bartsch (2004: 27–39, 58–78) for a detailed overview.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
To date, the following revisions have been released: 1.0, 1.1, 1.2, 1.3, 1.4, 2.0
- 11.
- 12.
Unfiltered data can be used to maximize recall, since parsers generally are better at predicting that two items should be connected by a dependency relation than they are at predicting what type of dependency relation connects the two. In the technical terms of parser evaluation, this is the difference between unlabelled and labelled attachment.
- 13.
- 14.
There are relatively few candidate pairs for verb-adjective and adverb-adjective collocations; the largest numbers of pairs are found for noun-verb (both subjects and objects) and noun-adjective collocations.
- 15.
That is, of course, if the definition of the collocation type is regarded as a lexical phenomenon with the terminology based on the canonical active-declarative structure.
- 16.
Except for graphs where the high frequency threshold leads to a coverage of less than 50%
- 17.
CoreNLP produces a parsing error on this sentence so that Grapeshot stores is wrongly analysed as a nominal compound.
- 18.
The same is true of the alternative form “peace be upon him”, which occurs more than 10,000 times but does not propel peace + be into the to 1,000 collocation candidates.
- 19.
The list for CoreNLP enhanced++ only contains four of them.
References
Ambati, B. R., Reddy, S., & Kilgarriff, A. (2012). Word sketches for Turkish. In N. Calzolari, K. Choukri, T. Declerck, M. U. Doğan, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the eighth international conference on language resources and evaluation (LREC’12) (pp. 2945–2950). Istanbul: European Language Resources Association http://www.lrec-conf.org/proceedings/lrec2012/pdf/585_Paper.pdf.
Andor, D., Alberti, C., Weiss, D., Severyn, A., Presta, A., Ganchev, K., Petrov, S., & Collins, M. (2016). Globally normalized transition-based neural networks. In Proceedings of the 54th annual meeting of the Association for Computational Linguistics (ACL'16) (pp. 2442–2452). Berlin: Association for Computational Linguistics http://aclweb.org/anthology/P16-1231.
Bartsch, S. (2004). Structural and functional properties of collocations in English. A corpus study of lexical and pragmatic constraints on lexical co-occurrence. Tübingen: Narr.
Bartsch, S., & Evert, S. (2014). Towards a Firthian notion of collocation. OPAL – Online publizierte Arbeiten zur Linguistik, 2(2014), 48–61 http://pub.ids-mannheim.de/laufend/opal/pdf/opal2014-2.pdf.
Basili, R., Pazienza, M. T., & Velardi, P. (1994). A ‘not-so-shallow’ parser for collocational analysis. In Proceedings of the 15th conference on computational linguistics (COLING’94) (pp. 447–453). Tokyo: Association for Computational Linguistics http://aclweb.org/anthology/C94-1074.
Blaheta, D., & Johnson, M. (2001). Unsupervised learning of multi-word verbs. In Proceedings of the ACL workshop on collocation: Computational extraction, analysis and exploitation (pp. 54–60). Toulouse.: http://web.science.mq.edu.au/$\sim$mjohnson/papers/2001/dpb-colloc01.pdf.
Chen, D., & Manning, C. D. (2014). A fast and accurate dependency parser using neural networks. In A. Moschitti, B. Pang, & W. Daelemans (Eds.), Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP'14) (pp. 740–750). Doha: Association for Computational Linguistics http://aclweb.org/anthology/D14-1082.
Choi, J. D., & McCallum, A. (2013). Transition-based dependency parsing with Selectional branching. In Proceedings of the 51st annual meeting of the Association for Computational Linguistics (ACL'13) (pp. 1052–1062). Sofia: Association for Computational Linguistics http://aclweb.org/anthology/P13-1104.
Choi, J. D., & Palmer, M. (2011). Getting the most out of transition-based dependency parsing. In Proceedings of the 49th annual meeting of the Association for Computational Linguistics: Human language technologies (ACL'11) (pp. 687–692). Portland: Association for Computational Linguistics http://aclweb.org/anthology/P11-2121.
Choi, J. D., & Palmer, M. (2012). Guidelines for the C lear Style Constituent to Dependency Conversion. Institute of Cognitive Science Technical Report 01-12, University of Colorado Boulder.
Church, K., Gale, W., Hanks, P., & Hindle, D. (1989). Parsing, word associations and typical predicate-argument relations. In Speech and natural language: Proceedings of a workshop held at cape cod, Massachusetts, October 15-18, 1989 (pp. 75–81). Cape Cod.: http://aclweb.org/anthology/H89-2012.
Clark, S., & Curran, J. R. (2007). Wide-coverage efficient statistical parsing with CCG and log-linear models. Computational Linguistics, 33(4), 493–556 http://aclweb.org/anthology/J07-4004.
Evert, S. (2004). The statistics of word Cooccurrences. Word pairs and collocations. Ph.D. thesis, Institut für maschinelle Sprachverarbeitung, Universität Stuttgart. Published in 2005 http://elib.uni-stuttgart.de/opus/volltexte/2005/2371/.
Evert, S., & Krenn, B. (2001). Methods for the qualitative evaluation of lexical association measures. In Proceedings of the 39th annual meeting of the Association for Computational Linguistics (ACL’01) (pp. 188–195). Toulouse: Association for Computational Linguistics http://www.aclweb.org/anthology/P01-1025.
Evert, S., Uhrig, P., Bartsch, S., & Proisl, T. (2017). E-VIEW-alation – A large-scale evaluation study of association measures for collocation identification. In Proceedings of eLex 2017 – Electronic lexicography in the 21st century: Lexicography from Scratch (pp. 531–549). Leiden: Lexical Computing https://elex.link/elex2017/wp-content/uploads/2017/09/paper32.pdf.
Farahmand, M., & Henderson, J. (2016). Modeling the non-substitutability of multiword expressions with distributional semantics and a log-linear model. In Proceedings of the 12th workshop on multiword expressions (pp. 61–66). Berlin: Association for Computational Linguistics https://aclweb.org/anthology/W16-1809.
Gries, S. T. (2013). 50-something years of work on collocations: What is or should be next …. International Journal of Corpus Linguistics, 18(1), 137–165.
Gries, S. T., & Stefanowitsch, A. (2004). Covarying collexemes in the into-causative. In M. Achard & S. Kemmer (Eds.), Language, culture, and mind (pp. 225–236). Stanford, CA: CSLI.
Heid, U., Fritzinger, F., Hauptmann, S., Weidenkaff, J., Weller, M. (2008). Providing corpus data for a dictionary for German juridical phraseology. In Storrer, A., Geyken, A., Siebert, A., Würzner, K-M, Text resources and lexical knowledge. Selected papers from the 9th conference on natural language processing, KONVENS 2008, Berlin, Germany (pp. 131–144). Berlin/Boston: Mouton de Gruyter. https://doi.org/10.1515/9783110211818.2.131
Herbst, T. (1996). What are collocations: Sandy beaches or false teeth? In English studies (Vol. 1996/4, pp. 379–393).
Ivanova, K., Heid, U., Walde, S. S. i., Kilgarriff, A., & Pomikalek, J. (2008). Evaluating a German sketch grammar: A case study on noun phrase case. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, & D. Tapias (Eds.), Proceedings of the sixth international conference on language resources and evaluation (LREC’08). Marrakech: European Language Resources Association, 2101–2107 http://www.lrec-conf.org/proceedings/lrec2008/pdf/537_paper.pdf.
Johansson, R., & Nugues, P. (2007). Extended constituent-to-dependency conversion for English. In Proceedings of NODALIDA 2007 (pp. 105–112). Tartu.: http://dspace.ut.ee/bitstream/handle/10062/2560/reg-Johansson-10.pdf.
Johnson, M. (1999). Confidence intervals on likelihood estimates for estimating association strengths. Unpublished technical report.
Katz, G., & Giesbrecht, E. (2006). Automatic identification of non-compositional multi-word expressions using latent semantic analysis. In Proceedings of the workshop on multiword expressions: Identifying and exploiting underlying properties (MWE’06) (pp. 12–19). Sydney: Association for Computational Linguistics http://aclweb.org/anthology/W06-1203.
Kermes, H., & Heid, U. (2003). Using chunked corpora for the acquisition of collocations and idiomatic expressions. In F. Kiefer & J. Pajzs (Eds.), Proceedings of 7th conference on computational lexicography and Corpus research. Budapest: Research Institute for Linguistics, Hungarian Academy of Sciences.
Kiela, D., & Clark, S. (2013). Detecting compositionality of multi-word expressions using nearest neighbours in vector space models. In Proceedings of the 2013 conference on empirical methods in natural language processing (EMNLP’13) (pp. 1427–1432). Seattle: Association for Computational Linguistics http://www.aclweb.org/anthology/D13-1147.
Kilgarriff, A., Rychlý, P., Smrz, P., & Tugwell, D. (2004). The sketch engine. In G. Williams & S. Vessier (Eds.), Proceedings of the 11th EURALEX international congress (pp. 105–115). Lorient: Université de Bretagne-Sud, Faculté des lettres et des sciences humaines http://euralex.org/wp-content/themes/euralex/proceedings/Euralex%202004/011_2004_V1_Adam%20KILGARRIFF,%20Pavel%20RYCHLY,%20Pavel%20SMRZ, %20David%20TUGWELL_The%20%20Sketch%20Engine.pdf.
Kilgarriff, A., Rychlý, P., Jakubicek, M., Kovář, V., Baisa, V., & Kocincová, L. (2014). Extrinsic corpus evaluation with a collocation dictionary task. In N. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the ninth international conference on language resources and evaluation (LREC’14). Reykjavik: European Language Resources Association http://www.lrec-conf.org/proceedings/lrec2014/pdf/52_Paper.pdf.
Klotz, M., & Herbst, T. (2016). English dictionaries: A linguistic introduction. Berlin: Erich Schmidt.
Lin, D. (1998). Extracting collocations from text corpora. In Proceedings of the first workshop on computational terminology (pp. 57–63). Montreal.
Lin, D. (1999). Automatic identification of non-compositional phrases. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics (ACL’99) (pp. 317–324). Morristown: Association for Computational Linguistics http://aclweb.org/anthology/P99-1041.
Lü, Y., & Zhou, M. (2004). Collocation translation acquisition using monolingual corpora. In Proceedings of the 42nd meeting of the Association for Computational Linguistics (ACL’04) (pp. 167–174). Barcelona: Association for Computational Linguistics http://aclweb.org/anthology/P04-1022.
Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J. R., Bethard, S., & McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. In Proceedings of the 52nd annual meeting of the Association for Computational Linguistics (ACL'14) (pp. 55–60). Baltimore: Association for Computational Linguistics http://aclweb.org/anthology/P14-5010.
Marcus, M. P., Marcinkiewicz, M. A., & Santorini, B. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2), 313–330 http://aclweb.org/anthology/J93-2004.
Marneffe, M.-C. de & Manning, C. D. (2008). Stanford dependencies manual. https://nlp.stanford.edu/software/dependencies_manual.pd
Nerima, L., Seretan, V., & Wehrli, E. (2003). Creating a multilingual collocations dictionary from large text corpora. In Companion volume to the proceedings of the 10th conference of the European chapter of the Association for Computational Linguistics (EACL’03) (pp. 131–134). Budapest: Association for Computational Linguistics http://aclweb.org/anthology/E03-1022.
Nissim, Malvina, Andrea Zaninello (2013): “Modeling the internal variability of multi-word expressions through a pattern-based method.” ACM Transactions on Speech and Language Processing (TSLP) 10/2: 7:1–7:26. https://doi.org/10.1145/2483691.2483696
Nivre, J. (2009). Non-projective dependency parsing in expected linear time. In Proceedings of the 47th annual meeting of the Association for Computational Linguistics and the 4th international joint conference on natural language processing of the AFNLP (ACL'09) (pp. 351–359). Singapore: Association for Computational Linguistics http://www.aclweb.org/anthology/P09-1040.
Pearce, D. (2001). Synonymy in collocation extraction. In Proceedings of the NAACL workshop on WordNet and other lexical resources: Applications, extensions and customizations (pp. 41–46). Pittsburgh: Association for Computational Linguistics.
Pearce, D. (2002). A comparative evaluation of collocation extraction techniques. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02). Las Palmas: European language resources association (pp. 1530–1536). http://www.lrec-conf.org/proceedings/lrec2002/pdf/169.pdf.
Pecina, P. (2005). An extensive empirical study of collocation extraction methods. In Proceedings of the ACL student research workshop (pp. 13–18). Ann Arbor: Association for Computational Linguistic http://aclweb.org/anthology/P05-2003.
Pecina, P. (2010). Lexical association measures and collocation extraction. Language Resources and Evaluation, 44, 137–158 https://doi.org/10.1007/s10579-009-9101-4.
Pecina, P., & Schlesinger, P. (2006). Combining association measures for collocation extraction. In Proceedings of the COLING/ACL 2006 main conference poster sessions (pp. 651–658). Sydney: Association for Computational Linguistics http://aclweb.org/anthology/P06-2084.
Rodríguez-Fernández, S., Anke, L. E., Carlini, R., & Wanner, L. (2016). Semantics-driven recognition of collocations using word embeddings. In Proceedings of the 54th annual meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 499–505). Berlin: Association for Computational Linguistics https://doi.org/10.18653/v1/P16-2081.
Sangati, F., & van Cranenburgh, A. (2015). Multiword expression identification with recurring tree fragments and association measures. In Proceedings of the 11th workshop on multiword expressions (pp. 10–18). Denver: Association for Computational Linguistics https://doi.org/10.3115/v1/W15-0902.
Schäfer, R. (2015). Processing and querying large web corpora with the COW14 architecture. In P. Bański, H. Biber, E. Breiteneder, M. Kupietz, H. Lüngen, & A. Witt (Eds.), Proceedings of the 3rd workshop on challenges in the Management of Large Corpora (CMLC-3) (pp. 28–34). Mannheim: IDS Publication Server https://ids-pub.bsz-bw.de/files/3826/Schaefer_Processing_and_querying_large_web_corpora_2015.pdf.
Schäfer, R., & Bildhauer, F. (2012). Building large corpora from the web using a new efficient tool chain. In N. Calzolari, K. Choukri, T. Declerck, M. U. Doğan, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the eighth international conference on language resources and evaluation (LREC’12) (pp. 486–493). Istanbul: European Language Resources Association http://www.lrec-conf.org/proceedings/lrec2012/pdf/834_Paper.pdf.
Schulte im Walde, S. (2003). A collocation database for German verbs and nouns. In Proceedings of the 7th conference on computational lexicography and text research (COMPLEX’03) (pp. 73–81). Budapest.: http://www.ims.uni-stuttgart.de/institut/mitarbeiter/schulte/publications/workshop/complex-03.pdf.
Schuster, S., & Manning, C. D. (2016). Enhanced English universal dependencies: An improved representation for natural language understanding tasks. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the tenth international conference on language resources and evaluation (LREC’16) (pp. 2371–2378). Portorož: European Language Resources Association http://www.lrec-conf.org/proceedings/lrec2016/pdf/779_Paper.pdf.
Seretan, V. (2008). Collocation extraction based on syntactic parsing. Ph.D. thesis, Faculté des lettres, Université de Genève http://www.issco.unige.ch/en/staff/seretan/publ/PhDThesis-VioletaSeretan.pdf.
Seretan, V., & Wehrli, E. (2006). Accurate collocation extraction using a multilingual parser. In Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the Association for Computational Linguistics (pp. 953–960). Sydney: Association for Computational Linguistics http://aclweb.org/anthology/P06-1120.
Seretan, V., Nerima, L., & Wehrli, E. (2003). Extraction of multi-word collocations using syntactic bigram composition. In Proceedings of the fourth international conference on recent advances in NLP (RANLP-2003) (pp. 424–431). https://archive-ouverte.unige.ch/unige:17034.
Seretan, V., Nerima, L., & Wehrli, E. (2004). Multi-word collocation extraction by syntactic composition of collocation bigrams. In N. Nicolov, K. Bontcheva, G. Angelova, & R. Mitkov (Eds.), Recent advances in natural language processing III. Selected papers from RANLP 2003 (pp. 91–100). Amsterdam/Philadelphia: John Benjamins https://doi.org/10.1075/cilt.260.10ser.
Squillante, L. (2014). Towards an empirical subcategorization of multiword expressions. In Proceedings of the 10th workshop on multiword expressions (MWE 2014) (pp. 77–81). Gothenburg: Association for Computational Linguistics http://www.aclweb.org/anthology/W14-0813.
Steedman, M. (2000). The syntactic process. Cambridge, MA: The MIT Press.
Stefanowitsch, A., & Gries, S. T. (2005). Covarying collexemes. Corpus Linguistics and Linguistic Theory, 1(1), 1–43. https://doi.org/10.1515/cllt.2005.1.1.1.
Stefanowitsch, A., & Gries, S. T. (2009). Corpora and grammar. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics: An international handbook (Vol. 2, pp. 933–952). Berlin, DE/New York, NY: Walter de Gruyter.
Teufel, S., & Grefenstette, G. (1995). Corpus-based method for automatic identification of support verbs for nominalizations. In Proceedings of the seventh conference of the European chapter of the Association for Computational Linguistics (EACL’95) (pp. 98–103). Dublin: Association for Computational Linguistics http://aclweb.org/anthology/E95-1014.
Tsvetkov, Y., & Wintner, S. (2014). Identification of multiword expressions by combining multiple linguistic information sources. Computational Linguistics, 40(2), 449–468 https://doi.org/10.1162/COLI_a_00177.
Uhrig, P., & Proisl, T. (2012). Less hay, more needles – Using dependency-annotated corpora to provide lexicographers with more accurate lists of collocation candidates. Lexicographica, 28, 141–180 https://doi.org/10.1515/lexi.2012-0009.
Villada, M., & Begoña, M. (2005). Data-driven identification of fixed expressions and their modifiability. Ph.D. thesis, University of Groningen http://www.rug.nl/research/portal/files/9790774/thesis.pdf.
Weller, M., & Heid, U. (2010). Extraction of German multiword expressions from parsed corpora using context features. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, M. Rosner, & D. Tapias (Eds.), Proceedings of the seventh international conference on language resources and evaluation (LREC’10) (pp. 3195–3201). Valletta: European Language Resources Association http://lrec-conf.org/proceedings/lrec2010/pdf/428_Paper.pdf.
Wermter, J., & Hahn, U. (2006). You can’t beat frequency (unless you use linguistic knowledge) – A qualitative evaluation of association measures for collocation and term extraction. In Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the ACL (ACL’06) (pp. 785–792). Sydney: Association for Computational Linguistics http://aclweb.org/anthology/P06-1099.
Wiechmann, D. (2008). On the computation of collostruction strength: Testing measures of association as expressions of lexical bias. Corpus Linguistics and Linguistic Theory, 4(2), 253–290 https://doi.org/10.1515/CLLT.2008.011.
Yazdani, M., Farahmand, M., & Henderson, J. (2015). Learning semantic composition to detect non-compositionality of multiword expressions. In Proceedings of the 2015 conference on empirical methods in natural language processing (EMNLP’15) (pp. 1733–1742). Lisbon: Association for Computational Linguistics http://www.aclweb.org/anthology/D15-1201.
Zinsmeister, H., & Heid, U. (2003). Significant triples: Adjective+noun+verb combinations. In Proceedings of the 7th conference on computational lexicography and text research (complex 2003). Budapest.: http://www.ims.uni-stuttgart.de/%7Ezinsmeis/pubs/SigColl-paper.pdf.
Zinsmeister, H., & Heid, U. (2004). Collocations of complex nouns: Evidence for lexicalisation. In Proceedings of KONVENS 2004. Vienna.: https://pdfs.semanticscholar.org/3e5d/d62cbe41b8aa4bbdf37231b85b9b7ef7d94e.pdf.
Dictionaries
OALD8 = Oxford Advanced Learner’s Dictionary of Current English, 8th edition (2010). Edited by Joanna Turnbull. Oxford: Oxford University Press.
OCD2 = Oxford Collocations Dictionary for Students of English, 2nd edition (2009). Edited by Colin MacIntosh. Oxford: Oxford University Press.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this chapter
Cite this chapter
Uhrig, P., Evert, S., Proisl, T. (2018). Collocation Candidate Extraction from Dependency-Annotated Corpora: Exploring Differences across Parsers and Dependency Annotation Schemes. In: Cantos-Gómez, P., Almela-Sánchez, M. (eds) Lexical Collocation Analysis. Quantitative Methods in the Humanities and Social Sciences. Springer, Cham. https://doi.org/10.1007/978-3-319-92582-0_6
Download citation
DOI: https://doi.org/10.1007/978-3-319-92582-0_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-92581-3
Online ISBN: 978-3-319-92582-0
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)