Advertisement

Syntax-Based Extraction

  • Violeta Seretan
Chapter
Part of the Text, Speech and Language Technology book series (TLTB, volume 44)

Abstract

In this chapter—the core of the book—we present and evaluate our methodology for collocation extraction based on deep syntactic parsing. First, a closer look at previous work which made use of parsed text for collocation extraction will reveal that the aim of fully-fledged syntax-based extraction was far from realized in these efforts due, primarily, to the insufficient robustness, precision, or coverage of the parsers used, as well as to the small number of syntactic configurations taken into account. Our work addresses these deficiencies with a generic extraction procedure that relies on a large-scale multilingual parsing system. After describing the system and extraction method, we focus on the contrastive evaluation of the method against the sliding window method, a standard syntax-free method based on the linear proximity of words. Cross-language evaluation shows that, despite the inherent errors and the challenges posed by the analysis of large amounts of unrestricted text, deep parsing contributes to a significant increase in performance. A detailed qualitative analysis of the results, including a case-study comparison, allows an assessment of the relative strengths and weaknesses of the two methods to be made. Following the qualitative comparison, a brief comparison of the current system with systems based on shallow parsing is presented.

Keywords

Pair Type Window Method Candidate Pair Slide Window Method Regular Pair 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. Artstein R, Poesio M (2008) Inter-coder agreement for computational linguistics. Computational Linguistics 34(4):555–596CrossRefGoogle Scholar
  2. Blaheta D, Johnson M (2001) Unsupervised learning of multi-word verbs. In: Proceedings of the ACL Workshop on Collocation: Computational Extraction, Analysis and Exploitation, Toulouse, France, pp 54–60Google Scholar
  3. Breidt E (1993) Extraction of V-N-collocations from text corpora: A feasibility study for German. In: Proceedings of the Workshop on Very Large Corpora: Academic and Industrial Perspectives, Columbus, OH, USA, pp 74–83Google Scholar
  4. Bresnan J (2001) Lexical Functional Syntax. Blackwell, OxfordGoogle Scholar
  5. Chomsky N (1995) The Minimalist Program. MIT Press, Cambridge, MAzbMATHGoogle Scholar
  6. Choueka Y (1988) Looking for needles in a haystack, or locating interesting collocational expressions in large textual databases. In: Proceedings of the International Conference on User-Oriented Content-Based Text and Image Handling, Cambridge, MA, USA, pp 609–623Google Scholar
  7. Cohen J (1960) A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20:37–46CrossRefGoogle Scholar
  8. Cook P, Fazly A, Stevenson S (2008) The VNC-tokens dataset. In: Proceedings of the LREC Workshop Towards a Shared Task for Multiword Expressions (MWE 2008), Marrakech, Morocco, pp 19–22Google Scholar
  9. Culicover P, Jackendoff R (2005) Simpler Syntax. Oxford University Press, OxfordCrossRefGoogle Scholar
  10. Daille B (1994) Approche mixte pour l’extraction automatique de terminologie: statistiques lexicales et filtres linguistiques. PhD thesis, Université Paris 7Google Scholar
  11. Diab MT, Bhutada P (2009) Verb noun construction MWE token supervised classification. In: 2009 Workshop on Multiword Expressions: Identification, Interpretation, Disambiguation, Applications, Suntec, Singapore, pp 17–22CrossRefGoogle Scholar
  12. Dunning T (1993) Accurate methods for the statistics of surprise and coincidence. Computational Linguistics 19(1):61–74Google Scholar
  13. Evert S (2004a) Significance tests for the evaluation of ranking methods. In: Proceedings of Coling 2004, Geneva, Switzerland, pp 945–951Google Scholar
  14. Evert S (2004b) The statistics of word cooccurrences: Word pairs and collocations. PhD thesis, University of StuttgartGoogle Scholar
  15. Evert S (2008b) A lexicographic evaluation of German adjective-noun collocations. In: Proceedings of the LREC Workshop Towards a Shared Task for Multiword Expressions (MWE 2008), Marrakech, MoroccoGoogle Scholar
  16. Evert S, Kermes H (2003) Experiments on candidate data for collocation extraction. In: Companion Volume to the Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL’03), Budapest, Hungary, pp 83–86Google Scholar
  17. Evert S, Krenn B (2001) Methods for the qualitative evaluation of lexical association measures. In: Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics, Toulouse, France, pp 188–195Google Scholar
  18. Evert S, Krenn B (2005) Using small random samples for the manual evaluation of statistical association measures. Computer Speech & Language 19(4):450–466Google Scholar
  19. Evert S, Heid U, Spranger K (2004) Identifying morphosyntactic preferences in collocations. In: Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal, pp 907–910Google Scholar
  20. Fleiss JL (1981) Measuring nominal scale agreement among many raters. Psychological Bulletin 76:378–382CrossRefGoogle Scholar
  21. Fontenelle T (1999) Semantic resources for word sense disambiguation: A sine qua non? Linguistica e Filologia (9):25–43, dipartimento di Linguistica e Letterature Comparate, Università degli Studi di BergamoGoogle Scholar
  22. Fritzinger F, Weller M, Heid U (2010) A survey of idiomatic Preposition-Noun-Verb triples on token level. In: Proceedings of the 7th Conference on International Language Resources and Evaluation (LREC’10), Valletta, MaltaGoogle Scholar
  23. Grégoire N, Evert S, Krenn B (eds) (2008) Proceedings of the LREC Workshop Towards a Shared Task for Multiword Expressions (MWE 2008). European Language Resources Association (ELRA), Marrakech, MoroccoGoogle Scholar
  24. Hajič J (2000) Morphological tagging: Data vs. dictionaries. In: Proceedings of the 6th Applied Natural Language Processing and the 1st NAACL Conference, Seattle, WA, USA, pp 94–101Google Scholar
  25. Heid U, Weller M (2008) Tools for collocation extraction: Preferences for active vs. passive. In: Proceedings of the 6th International Language Resources and Evaluation (LREC’08), Marrakech, MoroccoGoogle Scholar
  26. Justeson JS, Katz SM (1995) Technical terminology: Some linguistic properties and an algorithm for identification in text. Natural Language Engineering 1(1):9–27CrossRefGoogle Scholar
  27. Kilgarriff A, Rychly P, Smrz P, Tugwell D (2004) The Sketch Engine. In: Proceedings of the 11th EURALEX International Congress, Lorient, France, pp 105–116Google Scholar
  28. Kilgarriff A, Kovář V, Krek S, Srdanović I, Tiberius C (2010) A quantitative evaluation of word sketches. In: Proceedings of the 14th EURALEX International Congress, Leeuwarden, The NetherlandsGoogle Scholar
  29. Koehn P (2005) Europarl: A parallel corpus for statistical machine translation. In: Proceedings of the 10th Machine Translation Summit (MT Summit X), Phuket, Thailand, pp 79–86Google Scholar
  30. Krenn B (2000a) Collocation mining: Exploiting corpora for collocation identification and representation. In: Proceedings of KONVENS 2000, Ilmenau, Germany, pp 209–214Google Scholar
  31. Krenn B (2000b) The Usual Suspects: Data-Oriented Models for Identification and Representation of Lexical Collocations, vol 7. German Research Center for Artificial Intelligence and Saarland University Dissertations in Computational Linguistics and Language Technology, Saarbrücken, GermanyGoogle Scholar
  32. Krenn B (2008) Description of evaluation resource – German PP-verb data. In: Proceedings of the LREC Workshop Towards a Shared Task for Multiword Expressions (MWE 2008), Marrakech, MoroccoGoogle Scholar
  33. Krenn B, Evert S (2001) Can we do better than frequency? A case study on extracting PP-verb collocations. In: Proceedings of the ACL Workshop on Collocation: Computational Extraction, Analysis and Exploitation, Toulouse, France, pp 39–46Google Scholar
  34. Krenn B, Evert S, Zinsmeister H (2004) Determining intercoder agreement for a collocation identification task. In: Proceedings of KONVENS 2004, Vienna, AustriaGoogle Scholar
  35. Landis J, Koch G (1977) The measurement of observer agreement for categorical data. Biometrics 33:159–174zbMATHCrossRefMathSciNetGoogle Scholar
  36. Lin D (1998) Extracting collocations from text corpora. In: First Workshop on Computational Terminology, Montreal, Canada, pp 57–63Google Scholar
  37. Lin D (1999) Automatic identification of non-compositional phrases. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, Morristown, NJ, USA, pp 317–324Google Scholar
  38. Lü Y, Zhou M (2004) Collocation translation acquisition using monolingual corpora. In: Proceedings of the 42nd Meeting of the Association for Computational Linguistics (ACL’04), Barcelona, Spain, pp 167–174Google Scholar
  39. Manning CD, Schütze H (1999) Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MAzbMATHGoogle Scholar
  40. McKeown KR, Radev DR (2000) Collocations. In: Dale R, Moisl H, Somers H (eds) A Handbook of Natural Language Processing, Marcel Dekker, New York, NY, pp 507–523Google Scholar
  41. Orliac B, Dillinger M (2003) Collocation extraction for machine translation. In: Proceedings of Machine Translation Summit IX, New Orleans, LA, USA, pp 292–298Google Scholar
  42. Pearce D (2001a) Synonymy in collocation extraction. In: Proceedings of the NAACL Workshop on WordNet and Other Lexical Resources: Applications, Extensions and Customizations, Pittsburgh, PA, USA, pp 41–46Google Scholar
  43. Pearce D (2002) A comparative evaluation of collocation extraction techniques. In: Proceedings of the 3rd International Conference on Language Resources and Evaluation, Las Palmas, Spain, pp 1530–1536Google Scholar
  44. Pecina P (2008a) Lexical association measures: Collocation extraction. PhD thesis, Charles University in PragueGoogle Scholar
  45. Pecina P (2008b) A machine learning approach to multiword expression extraction. In: Proceedings of the LREC Workshop Towards a Shared Task for Multiword Expressions (MWE 2008), Marrakech, Morocco, pp 54–57Google Scholar
  46. Pecina P (2010) Lexical association measures and collocation extraction. Language Resources and Evaluation 1(44):137–158CrossRefGoogle Scholar
  47. Ramisch C, Schreiner P, Idiart M, Villavicencio A (2008) An evaluation of methods for the extraction of multiword expressions. In: Proceedings of the LREC Workshop Towards a Shared Task for Multiword Expressions (MWE 2008), Marrakech, MoroccoGoogle Scholar
  48. Ritz J (2006) Collocation extraction: Needs, feeds and results of an extraction system for German. In: Proceedings of the Workshop on Multi-Word-Expressions in a Multilingual Context at the 11th Conference of the European Chapter of the Association for Computational Linguistics, Trento, Italy, pp 41–48Google Scholar
  49. Schulte im Walde S (2003) A collocation database for German verbs and nouns. In: Kiefer F, Pajzs J (eds) Proceedings of the 7th Conference on Computational Lexicography and Corpus Research, Research Institute for Linguistics, Hungarian Academy of Sciences, Budapest, HungaryGoogle Scholar
  50. Seretan V (2008) Collocation extraction based on syntactic parsing. PhD thesis, University of GenevaGoogle Scholar
  51. Seretan V (2009) An integrated environment for extracting and translating collocations. In: Mahlberg M, González-Díaz V, Smith C (eds) Proceedings of the Corpus Linguistics Conference CL2009, Liverpool, UKGoogle Scholar
  52. Seretan V, Wehrli E (2009) Multilingual collocation extraction with a syntactic parser. Language Resources and Evaluation 43(1):71–85CrossRefGoogle Scholar
  53. Seretan V, Nerima L, Wehrli E (2004) A tool for multi-word collocation extraction and visualization in multilingual corpora. In: Proceedings of the 11th EURALEX International Congress, EURALEX 2004, Lorient, France, pp 755–766Google Scholar
  54. Smadja F (1993) Retrieving collocations from text: Xtract. Computational Linguistics 19(1):143–177Google Scholar
  55. Thanopoulos A, Fakotakis N, Kokkinakis G (2002) Comparative evaluation of collocation extraction metrics. In: Proceedings of the 3rd International Conference on Language Resources and Evaluation, Las Palmas, Spain, pp 620–625Google Scholar
  56. Villada Moirón MBn (2005) Data-driven identification of fixed expressions and their modifiability. PhD thesis, University of GroningenGoogle Scholar
  57. Wehrli E (1997) L’analyse syntaxique des langues naturelles: Problèmes et méthodes. Masson, ParisGoogle Scholar
  58. Wehrli E (2004) Un modèle multilingue d’analyse syntaxique. In: Auchlin A, Burger M, Filliettaz L, Grobet A, Moeschler J, Perrin L, Rossari C, de Saussure L (eds) Structures et discours - Mélanges offerts à Eddy Roulet, Éditions Nota bene, Québec, pp 311–329Google Scholar
  59. Wehrli E (2007) Fips, a “deep” linguistic multilingual parser. In: ACL 2007 Workshop on Deep Linguistic Processing, Prague, Czech Republic, pp 120–127Google Scholar
  60. Weller M, Heid U (2010) Extraction of German multiword expressions from parsed corpora using context features. In: Proceedings of the 7th Conference on International Language Resources and Evaluation (LREC’10), Valletta, MaltaGoogle Scholar
  61. Wermter J, Hahn U (2006) You can’t beat frequency (unless you use linguistic knowledge) – a qualitative evaluation of association measures for collocation and term extraction. In: Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics, Sydney, Australia, pp 785–792Google Scholar
  62. Wu H, Zhou M (2003) Synonymous collocation extraction using translation information. In: Proceeding of the Annual Meeting of the Association for Computational Linguistics (ACL 2003), Sapporo, Japan, pp 120–127Google Scholar
  63. Zajac R, Lange E, Yang J (2003) Customizing complex lexical entries for high-quality MT. In: Proceedings of the 9th Machine Translation Summit, New Orleans, LA, USA, pp 433–438Google Scholar
  64. Zinsmeister H, Heid U (2003) Significant triples: Adjective+Noun+Verb combinations. In: Proceedings of the 7th Conference on Computational Lexicography and Text Research (Complex 2003), Budapest, HungaryGoogle Scholar

Copyright information

© Springer Science+Business Media B.V. 2011

Authors and Affiliations

  1. 1.Department of Linguistics (Office L706)University of GenevaGenevaSwitzerland

Personalised recommendations