Advertisement

Collocation Candidate Extraction from Dependency-Annotated Corpora: Exploring Differences across Parsers and Dependency Annotation Schemes

  • Peter Uhrig
  • Stefan Evert
  • Thomas Proisl
Chapter
Part of the Quantitative Methods in the Humanities and Social Sciences book series (QMHSS)

Abstract

Collocation candidate extraction from dependency-annotated corpora has become more and more mainstream in collocation research over the past years. In most studies, however, the results of one parser are compared to those of relatively “dumb” window-based approaches only. To date, the impact of the parser used and its parsing scheme has not been studied systematically to the best of our knowledge. This chapter evaluates a total of 8 parsers on 2 corpora with 20 different association measures plus several frequency thresholds for 6 different types of collocations against the Oxford Collocations Dictionary for Students of English (2nd edition; 2009). We find that the parser and parsing scheme both play a role in the quality of the collocation candidate extraction. The performance of different parsers can differ substantially across different collocation types. The filters used to extract different types of collocations from the corpora also play an important role in the trade-off between precision and recall we can observe. Furthermore, we find that carefully sampled and balanced corpora (such as the BNC) seem to have considerable advantages in precision, but of course for total coverage, larger, less balanced corpora (such as the web corpus used in this study) take the lead. Overall, log-likelihood is the best association measure, but for some specific types of collocation (such as adjective-noun or verb-adverb), other measures perform even better.

References

  1. Ambati, B. R., Reddy, S., & Kilgarriff, A. (2012). Word sketches for Turkish. In N. Calzolari, K. Choukri, T. Declerck, M. U. Doğan, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the eighth international conference on language resources and evaluation (LREC’12) (pp. 2945–2950). Istanbul: European Language Resources Association http://www.lrec-conf.org/proceedings/lrec2012/pdf/585_Paper.pdf.Google Scholar
  2. Andor, D., Alberti, C., Weiss, D., Severyn, A., Presta, A., Ganchev, K., Petrov, S., & Collins, M. (2016). Globally normalized transition-based neural networks. In Proceedings of the 54th annual meeting of the Association for Computational Linguistics (ACL'16) (pp. 2442–2452). Berlin: Association for Computational Linguistics http://aclweb.org/anthology/P16-1231.Google Scholar
  3. Bartsch, S. (2004). Structural and functional properties of collocations in English. A corpus study of lexical and pragmatic constraints on lexical co-occurrence. Tübingen: Narr.Google Scholar
  4. Bartsch, S., & Evert, S. (2014). Towards a Firthian notion of collocation. OPAL – Online publizierte Arbeiten zur Linguistik, 2(2014), 48–61 http://pub.ids-mannheim.de/laufend/opal/pdf/opal2014-2.pdf.Google Scholar
  5. Basili, R., Pazienza, M. T., & Velardi, P. (1994). A ‘not-so-shallow’ parser for collocational analysis. In Proceedings of the 15th conference on computational linguistics (COLING’94) (pp. 447–453). Tokyo: Association for Computational Linguistics http://aclweb.org/anthology/C94-1074.CrossRefGoogle Scholar
  6. Blaheta, D., & Johnson, M. (2001). Unsupervised learning of multi-word verbs. In Proceedings of the ACL workshop on collocation: Computational extraction, analysis and exploitation (pp. 54–60). Toulouse.: http://web.science.mq.edu.au/$\sim$mjohnson/papers/2001/dpb-colloc01.pdf.
  7. Chen, D., & Manning, C. D. (2014). A fast and accurate dependency parser using neural networks. In A. Moschitti, B. Pang, & W. Daelemans (Eds.), Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP'14) (pp. 740–750). Doha: Association for Computational Linguistics http://aclweb.org/anthology/D14-1082.CrossRefGoogle Scholar
  8. Choi, J. D., & McCallum, A. (2013). Transition-based dependency parsing with Selectional branching. In Proceedings of the 51st annual meeting of the Association for Computational Linguistics (ACL'13) (pp. 1052–1062). Sofia: Association for Computational Linguistics http://aclweb.org/anthology/P13-1104.Google Scholar
  9. Choi, J. D., & Palmer, M. (2011). Getting the most out of transition-based dependency parsing. In Proceedings of the 49th annual meeting of the Association for Computational Linguistics: Human language technologies (ACL'11) (pp. 687–692). Portland: Association for Computational Linguistics http://aclweb.org/anthology/P11-2121.Google Scholar
  10. Choi, J. D., & Palmer, M. (2012). Guidelines for the C lear Style Constituent to Dependency Conversion. Institute of Cognitive Science Technical Report 01-12, University of Colorado Boulder.Google Scholar
  11. Church, K., Gale, W., Hanks, P., & Hindle, D. (1989). Parsing, word associations and typical predicate-argument relations. In Speech and natural language: Proceedings of a workshop held at cape cod, Massachusetts, October 15-18, 1989 (pp. 75–81). Cape Cod.: http://aclweb.org/anthology/H89-2012.
  12. Clark, S., & Curran, J. R. (2007). Wide-coverage efficient statistical parsing with CCG and log-linear models. Computational Linguistics, 33(4), 493–556 http://aclweb.org/anthology/J07-4004.CrossRefGoogle Scholar
  13. Evert, S. (2004). The statistics of word Cooccurrences. Word pairs and collocations. Ph.D. thesis, Institut für maschinelle Sprachverarbeitung, Universität Stuttgart. Published in 2005 http://elib.uni-stuttgart.de/opus/volltexte/2005/2371/.
  14. Evert, S., & Krenn, B. (2001). Methods for the qualitative evaluation of lexical association measures. In Proceedings of the 39th annual meeting of the Association for Computational Linguistics (ACL’01) (pp. 188–195). Toulouse: Association for Computational Linguistics http://www.aclweb.org/anthology/P01-1025.Google Scholar
  15. Evert, S., Uhrig, P., Bartsch, S., & Proisl, T. (2017). E-VIEW-alation – A large-scale evaluation study of association measures for collocation identification. In Proceedings of eLex 2017 – Electronic lexicography in the 21st century: Lexicography from Scratch (pp. 531–549). Leiden: Lexical Computing https://elex.link/elex2017/wp-content/uploads/2017/09/paper32.pdf.Google Scholar
  16. Farahmand, M., & Henderson, J. (2016). Modeling the non-substitutability of multiword expressions with distributional semantics and a log-linear model. In Proceedings of the 12th workshop on multiword expressions (pp. 61–66). Berlin: Association for Computational Linguistics https://aclweb.org/anthology/W16-1809.CrossRefGoogle Scholar
  17. Gries, S. T. (2013). 50-something years of work on collocations: What is or should be next …. International Journal of Corpus Linguistics, 18(1), 137–165.CrossRefGoogle Scholar
  18. Gries, S. T., & Stefanowitsch, A. (2004). Covarying collexemes in the into-causative. In M. Achard & S. Kemmer (Eds.), Language, culture, and mind (pp. 225–236). Stanford, CA: CSLI.Google Scholar
  19. Heid, U., Fritzinger, F., Hauptmann, S., Weidenkaff, J., Weller, M. (2008). Providing corpus data for a dictionary for German juridical phraseology. In Storrer, A., Geyken, A., Siebert, A., Würzner, K-M, Text resources and lexical knowledge. Selected papers from the 9th conference on natural language processing, KONVENS 2008, Berlin, Germany (pp. 131–144). Berlin/Boston: Mouton de Gruyter. https://doi.org/10.1515/9783110211818.2.131 CrossRefGoogle Scholar
  20. Herbst, T. (1996). What are collocations: Sandy beaches or false teeth? In English studies (Vol. 1996/4, pp. 379–393).Google Scholar
  21. Ivanova, K., Heid, U., Walde, S. S. i., Kilgarriff, A., & Pomikalek, J. (2008). Evaluating a German sketch grammar: A case study on noun phrase case. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, & D. Tapias (Eds.), Proceedings of the sixth international conference on language resources and evaluation (LREC’08). Marrakech: European Language Resources Association, 2101–2107 http://www.lrec-conf.org/proceedings/lrec2008/pdf/537_paper.pdf.Google Scholar
  22. Johansson, R., & Nugues, P. (2007). Extended constituent-to-dependency conversion for English. In Proceedings of NODALIDA 2007 (pp. 105–112). Tartu.: http://dspace.ut.ee/bitstream/handle/10062/2560/reg-Johansson-10.pdf.
  23. Johnson, M. (1999). Confidence intervals on likelihood estimates for estimating association strengths. Unpublished technical report.Google Scholar
  24. Katz, G., & Giesbrecht, E. (2006). Automatic identification of non-compositional multi-word expressions using latent semantic analysis. In Proceedings of the workshop on multiword expressions: Identifying and exploiting underlying properties (MWE’06) (pp. 12–19). Sydney: Association for Computational Linguistics http://aclweb.org/anthology/W06-1203.CrossRefGoogle Scholar
  25. Kermes, H., & Heid, U. (2003). Using chunked corpora for the acquisition of collocations and idiomatic expressions. In F. Kiefer & J. Pajzs (Eds.), Proceedings of 7th conference on computational lexicography and Corpus research. Budapest: Research Institute for Linguistics, Hungarian Academy of Sciences.Google Scholar
  26. Kiela, D., & Clark, S. (2013). Detecting compositionality of multi-word expressions using nearest neighbours in vector space models. In Proceedings of the 2013 conference on empirical methods in natural language processing (EMNLP’13) (pp. 1427–1432). Seattle: Association for Computational Linguistics http://www.aclweb.org/anthology/D13-1147.Google Scholar
  27. Kilgarriff, A., Rychlý, P., Smrz, P., & Tugwell, D. (2004). The sketch engine. In G. Williams & S. Vessier (Eds.), Proceedings of the 11th EURALEX international congress (pp. 105–115). Lorient: Université de Bretagne-Sud, Faculté des lettres et des sciences humaines http://euralex.org/wp-content/themes/euralex/proceedings/Euralex%202004/011_2004_V1_Adam%20KILGARRIFF,%20Pavel%20RYCHLY,%20Pavel%20SMRZ, %20David%20TUGWELL_The%20%20Sketch%20Engine.pdf.Google Scholar
  28. Kilgarriff, A., Rychlý, P., Jakubicek, M., Kovář, V., Baisa, V., & Kocincová, L. (2014). Extrinsic corpus evaluation with a collocation dictionary task. In N. Calzolari, K. Choukri, T. Declerck, H. Loftsson, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the ninth international conference on language resources and evaluation (LREC’14). Reykjavik: European Language Resources Association http://www.lrec-conf.org/proceedings/lrec2014/pdf/52_Paper.pdf.Google Scholar
  29. Klotz, M., & Herbst, T. (2016). English dictionaries: A linguistic introduction. Berlin: Erich Schmidt.Google Scholar
  30. Lin, D. (1998). Extracting collocations from text corpora. In Proceedings of the first workshop on computational terminology (pp. 57–63). Montreal.Google Scholar
  31. Lin, D. (1999). Automatic identification of non-compositional phrases. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics (ACL’99) (pp. 317–324). Morristown: Association for Computational Linguistics http://aclweb.org/anthology/P99-1041.Google Scholar
  32. Lü, Y., & Zhou, M. (2004). Collocation translation acquisition using monolingual corpora. In Proceedings of the 42nd meeting of the Association for Computational Linguistics (ACL’04) (pp. 167–174). Barcelona: Association for Computational Linguistics http://aclweb.org/anthology/P04-1022.Google Scholar
  33. Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J. R., Bethard, S., & McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. In Proceedings of the 52nd annual meeting of the Association for Computational Linguistics (ACL'14) (pp. 55–60). Baltimore: Association for Computational Linguistics http://aclweb.org/anthology/P14-5010.Google Scholar
  34. Marcus, M. P., Marcinkiewicz, M. A., & Santorini, B. (1993). Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2), 313–330 http://aclweb.org/anthology/J93-2004.Google Scholar
  35. Marneffe, M.-C. de & Manning, C. D. (2008). Stanford dependencies manual. https://nlp.stanford.edu/software/dependencies_manual.pd
  36. Nerima, L., Seretan, V., & Wehrli, E. (2003). Creating a multilingual collocations dictionary from large text corpora. In Companion volume to the proceedings of the 10th conference of the European chapter of the Association for Computational Linguistics (EACL’03) (pp. 131–134). Budapest: Association for Computational Linguistics http://aclweb.org/anthology/E03-1022.Google Scholar
  37. Nissim, Malvina, Andrea Zaninello (2013): “Modeling the internal variability of multi-word expressions through a pattern-based method.” ACM Transactions on Speech and Language Processing (TSLP) 10/2: 7:1–7:26. https://doi.org/10.1145/2483691.2483696
  38. Nivre, J. (2009). Non-projective dependency parsing in expected linear time. In Proceedings of the 47th annual meeting of the Association for Computational Linguistics and the 4th international joint conference on natural language processing of the AFNLP (ACL'09) (pp. 351–359). Singapore: Association for Computational Linguistics http://www.aclweb.org/anthology/P09-1040.Google Scholar
  39. Pearce, D. (2001). Synonymy in collocation extraction. In Proceedings of the NAACL workshop on WordNet and other lexical resources: Applications, extensions and customizations (pp. 41–46). Pittsburgh: Association for Computational Linguistics.Google Scholar
  40. Pearce, D. (2002). A comparative evaluation of collocation extraction techniques. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02). Las Palmas: European language resources association (pp. 1530–1536). http://www.lrec-conf.org/proceedings/lrec2002/pdf/169.pdf.Google Scholar
  41. Pecina, P. (2005). An extensive empirical study of collocation extraction methods. In Proceedings of the ACL student research workshop (pp. 13–18). Ann Arbor: Association for Computational Linguistic http://aclweb.org/anthology/P05-2003.CrossRefGoogle Scholar
  42. Pecina, P. (2010). Lexical association measures and collocation extraction. Language Resources and Evaluation, 44, 137–158 https://doi.org/10.1007/s10579-009-9101-4.CrossRefGoogle Scholar
  43. Pecina, P., & Schlesinger, P. (2006). Combining association measures for collocation extraction. In Proceedings of the COLING/ACL 2006 main conference poster sessions (pp. 651–658). Sydney: Association for Computational Linguistics http://aclweb.org/anthology/P06-2084.CrossRefGoogle Scholar
  44. Rodríguez-Fernández, S., Anke, L. E., Carlini, R., & Wanner, L. (2016). Semantics-driven recognition of collocations using word embeddings. In Proceedings of the 54th annual meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 499–505). Berlin: Association for Computational Linguistics https://doi.org/10.18653/v1/P16-2081.CrossRefGoogle Scholar
  45. Sangati, F., & van Cranenburgh, A. (2015). Multiword expression identification with recurring tree fragments and association measures. In Proceedings of the 11th workshop on multiword expressions (pp. 10–18). Denver: Association for Computational Linguistics https://doi.org/10.3115/v1/W15-0902.CrossRefGoogle Scholar
  46. Schäfer, R. (2015). Processing and querying large web corpora with the COW14 architecture. In P. Bański, H. Biber, E. Breiteneder, M. Kupietz, H. Lüngen, & A. Witt (Eds.), Proceedings of the 3rd workshop on challenges in the Management of Large Corpora (CMLC-3) (pp. 28–34). Mannheim: IDS Publication Server https://ids-pub.bsz-bw.de/files/3826/Schaefer_Processing_and_querying_large_web_corpora_2015.pdf.Google Scholar
  47. Schäfer, R., & Bildhauer, F. (2012). Building large corpora from the web using a new efficient tool chain. In N. Calzolari, K. Choukri, T. Declerck, M. U. Doğan, B. Maegaard, J. Mariani, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the eighth international conference on language resources and evaluation (LREC’12) (pp. 486–493). Istanbul: European Language Resources Association http://www.lrec-conf.org/proceedings/lrec2012/pdf/834_Paper.pdf.Google Scholar
  48. Schulte im Walde, S. (2003). A collocation database for German verbs and nouns. In Proceedings of the 7th conference on computational lexicography and text research (COMPLEX’03) (pp. 73–81). Budapest.: http://www.ims.uni-stuttgart.de/institut/mitarbeiter/schulte/publications/workshop/complex-03.pdf.
  49. Schuster, S., & Manning, C. D. (2016). Enhanced English universal dependencies: An improved representation for natural language understanding tasks. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the tenth international conference on language resources and evaluation (LREC’16) (pp. 2371–2378). Portorož: European Language Resources Association http://www.lrec-conf.org/proceedings/lrec2016/pdf/779_Paper.pdf.Google Scholar
  50. Seretan, V. (2008). Collocation extraction based on syntactic parsing. Ph.D. thesis, Faculté des lettres, Université de Genève http://www.issco.unige.ch/en/staff/seretan/publ/PhDThesis-VioletaSeretan.pdf.
  51. Seretan, V., & Wehrli, E. (2006). Accurate collocation extraction using a multilingual parser. In Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the Association for Computational Linguistics (pp. 953–960). Sydney: Association for Computational Linguistics http://aclweb.org/anthology/P06-1120.Google Scholar
  52. Seretan, V., Nerima, L., & Wehrli, E. (2003). Extraction of multi-word collocations using syntactic bigram composition. In Proceedings of the fourth international conference on recent advances in NLP (RANLP-2003) (pp. 424–431). https://archive-ouverte.unige.ch/unige:17034.Google Scholar
  53. Seretan, V., Nerima, L., & Wehrli, E. (2004). Multi-word collocation extraction by syntactic composition of collocation bigrams. In N. Nicolov, K. Bontcheva, G. Angelova, & R. Mitkov (Eds.), Recent advances in natural language processing III. Selected papers from RANLP 2003 (pp. 91–100). Amsterdam/Philadelphia: John Benjamins https://doi.org/10.1075/cilt.260.10ser.CrossRefGoogle Scholar
  54. Squillante, L. (2014). Towards an empirical subcategorization of multiword expressions. In Proceedings of the 10th workshop on multiword expressions (MWE 2014) (pp. 77–81). Gothenburg: Association for Computational Linguistics http://www.aclweb.org/anthology/W14-0813.Google Scholar
  55. Steedman, M. (2000). The syntactic process. Cambridge, MA: The MIT Press.zbMATHGoogle Scholar
  56. Stefanowitsch, A., & Gries, S. T. (2005). Covarying collexemes. Corpus Linguistics and Linguistic Theory, 1(1), 1–43. https://doi.org/10.1515/cllt.2005.1.1.1.CrossRefGoogle Scholar
  57. Stefanowitsch, A., & Gries, S. T. (2009). Corpora and grammar. In A. Lüdeling & M. Kytö (Eds.), Corpus linguistics: An international handbook (Vol. 2, pp. 933–952). Berlin, DE/New York, NY: Walter de Gruyter.Google Scholar
  58. Teufel, S., & Grefenstette, G. (1995). Corpus-based method for automatic identification of support verbs for nominalizations. In Proceedings of the seventh conference of the European chapter of the Association for Computational Linguistics (EACL’95) (pp. 98–103). Dublin: Association for Computational Linguistics http://aclweb.org/anthology/E95-1014.Google Scholar
  59. Tsvetkov, Y., & Wintner, S. (2014). Identification of multiword expressions by combining multiple linguistic information sources. Computational Linguistics, 40(2), 449–468 https://doi.org/10.1162/COLI_a_00177.CrossRefGoogle Scholar
  60. Uhrig, P., & Proisl, T. (2012). Less hay, more needles – Using dependency-annotated corpora to provide lexicographers with more accurate lists of collocation candidates. Lexicographica, 28, 141–180 https://doi.org/10.1515/lexi.2012-0009.CrossRefGoogle Scholar
  61. Villada, M., & Begoña, M. (2005). Data-driven identification of fixed expressions and their modifiability. Ph.D. thesis, University of Groningen http://www.rug.nl/research/portal/files/9790774/thesis.pdf.
  62. Weller, M., & Heid, U. (2010). Extraction of German multiword expressions from parsed corpora using context features. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, M. Rosner, & D. Tapias (Eds.), Proceedings of the seventh international conference on language resources and evaluation (LREC’10) (pp. 3195–3201). Valletta: European Language Resources Association http://lrec-conf.org/proceedings/lrec2010/pdf/428_Paper.pdf.Google Scholar
  63. Wermter, J., & Hahn, U. (2006). You can’t beat frequency (unless you use linguistic knowledge) – A qualitative evaluation of association measures for collocation and term extraction. In Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the ACL (ACL’06) (pp. 785–792). Sydney: Association for Computational Linguistics http://aclweb.org/anthology/P06-1099.Google Scholar
  64. Wiechmann, D. (2008). On the computation of collostruction strength: Testing measures of association as expressions of lexical bias. Corpus Linguistics and Linguistic Theory, 4(2), 253–290 https://doi.org/10.1515/CLLT.2008.011.CrossRefGoogle Scholar
  65. Yazdani, M., Farahmand, M., & Henderson, J. (2015). Learning semantic composition to detect non-compositionality of multiword expressions. In Proceedings of the 2015 conference on empirical methods in natural language processing (EMNLP’15) (pp. 1733–1742). Lisbon: Association for Computational Linguistics http://www.aclweb.org/anthology/D15-1201.CrossRefGoogle Scholar
  66. Zinsmeister, H., & Heid, U. (2003). Significant triples: Adjective+noun+verb combinations. In Proceedings of the 7th conference on computational lexicography and text research (complex 2003). Budapest.: http://www.ims.uni-stuttgart.de/%7Ezinsmeis/pubs/SigColl-paper.pdf.
  67. Zinsmeister, H., & Heid, U. (2004). Collocations of complex nouns: Evidence for lexicalisation. In Proceedings of KONVENS 2004. Vienna.: https://pdfs.semanticscholar.org/3e5d/d62cbe41b8aa4bbdf37231b85b9b7ef7d94e.pdf.

Dictionaries

  1. OALD8 = Oxford Advanced Learner’s Dictionary of Current English, 8th edition (2010). Edited by Joanna Turnbull. Oxford: Oxford University Press.Google Scholar
  2. OCD2 = Oxford Collocations Dictionary for Students of English, 2nd edition (2009). Edited by Colin MacIntosh. Oxford: Oxford University Press.Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Friedrich-Alexander-Universität Erlangen-NürnbergErlangenGermany

Personalised recommendations