Skip to main content

A Three-Layered Collocation Extraction Tool and Its Application in China English Studies

  • Conference paper
  • First Online:
Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data (CCL 2015, NLP-NABD 2015)

Abstract

We design a three-layered collocation extraction tool by integrating syntactic and semantic knowledge and apply it in China English studies. The tool first extracts peripheral collocations in the frequency layer from dependency triples, then extracts semi-peripheral collocations in the syntactic layer by association measures, and last extracts core collocations in the semantic layer with a similar word thesaurus. The syntactic constraints filter out much noise from surface co-occurrences, and the semantic constraints are effective in identifying the very “core” collocations. The tool is applied to automatically extract collocations from a large corpus of China English we compile to explore how China English as a variety of English is nativilized. Then we analyze similarities and differences of the typical China English collocations of a group of verbs. The tool and results can be applied in the compilation of language resources for Chinese-English translation and corpus-based China studies.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Notes

  1. 1.

    http://wordnet.princeton.edu/.

  2. 2.

    http://scrapy.org.

  3. 3.

    http://www.chinadaily.cn.

  4. 4.

    http://www.news.cn/english/.

  5. 5.

    http://english.gov.cn/.

  6. 6.

    http://www.fmprc.gov.cn/mfa_eng/.

References

  1. Seretan, V.: Syntax-based collocation extraction. Text, Speech and Language Technology Series. Springer, Netherlands (2011)

    Book  MATH  Google Scholar 

  2. Evert, S.: Corpora and collocations. In: Lüdeling, A., Kytö, M. (eds.) Corpus Linguistics. An International Handbook, pp. 1112–1248. Mouton de Gruyter, Berlin (2008)

    Google Scholar 

  3. Smadja, F.: Retrieving collocations from text: Xtract. Comput. Linguist. 19(1), 143–177 (1993)

    Google Scholar 

  4. Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19(1), 61–74 (1993)

    Google Scholar 

  5. Wermter, J., Hahn, U.: Paradigmatic modifiability statistics for the extraction of complex multi-word terms. In: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, pp. 843–850. Association for Computational Linguistics (2005)

    Google Scholar 

  6. Lin, D.: Extracting collocations from text corpora. In: Proceedings of the First Workshop on Computational Terminology, Montreal, Canada, pp. 57–63 (1998)

    Google Scholar 

  7. Heid, U., Weller, M.: Tools for collocation extraction: preferences for active vs. passive. In: Sixth International Conference on Language Resources & Evaluation LREC, vol. 24, pp. 1266–1272 (2008)

    Google Scholar 

  8. Scott, M.: WordSmith Tools Version 5.0. Lexical Analysis Software, Liverpool (2008)

    Google Scholar 

  9. Li, D., Cao, J., Huang D.: A hierarchical collocation extraction tool. In: The 5th IEEE International Conference on Big Data and Cloud Computing (BDCloud 2015), pp. 51–55, Dalian, China, 26–29 August 2015

    Google Scholar 

  10. He, D., Li, D.C.S.: Language attitudes and linguistic features in the “China English” debate. World Englishes 28(1), 70–89 (2009)

    Article  Google Scholar 

  11. Kirkpatrick, A., Zhichang, X.U.: Chinese pragmatic norms and ‘China English’. World Englishes 21(2), 269–279 (2002)

    Article  Google Scholar 

  12. Wei, Y., Jia, F.: Using english in China. Engl. Today 19(4), 42–47 (2003)

    Article  Google Scholar 

  13. Du, R., Jiang, Y.: China English in the past 20 years. 33(1), 37–41 (2001)

    Google Scholar 

  14. Bolton, K., Graddol, D.: English in china today. Engl. Today 28(03), 3–9 (2012)

    Article  Google Scholar 

  15. Yang, J.: Lexical innovations in China English. World Engl. 24(4), 425–436 (2005)

    Article  Google Scholar 

  16. Zhang, H.: Bilingual creativity in Chinese English: Ha Jin’s in the pond. World Engl. 21(2), 305–315 (2002)

    Article  Google Scholar 

  17. Yu, X., Wen, Q.: The nativilized characteristics of evaluative adjective collocational patterns in China’s english-language newspapers. Foreign Lang. Teach. 5, 23–28 (2010)

    Google Scholar 

  18. Ai, H., You, X.: The grammatical features of english in a chinese internet discussion forum. World Engl. 34(2), 211–230 (2015)

    Article  Google Scholar 

  19. Hamid, M.B., Baldauf, Jr., R.B.: Second language errors and features of world Englishes. World Engl. 32(4), 476–494 (2013)

    Article  Google Scholar 

  20. Kachru, B.B.: World Englishes: approaches, issues and resources. Lang. Teach. 25(1), 1–14 (1992)

    Article  Google Scholar 

  21. Bahns, J.: Lexical collocations: a contrastive view. ELT J. 47(1), 56–63 (1993)

    Article  Google Scholar 

  22. Benson, M., Benson, I., Robert, E.: The BBI combinatory dictionary of English: a guide to word combinations, pp. x–xxiii. Benjamins John, New York (1986)

    Book  Google Scholar 

  23. Sinclair, J.: Corpus, Concordance. Collocation. Shanghai Foreign Language Education Press, Shanghai (2000)

    Google Scholar 

  24. Mckeown, K.R., Ravd, D.R.: Collocations. In: Dale, R., Moils, H., Somers, H. (eds.) Handbook of Natural Language Processing, pp. 1–19. CRC Press (2000)

    Google Scholar 

  25. Firth, J.R.: A synopsis of linguistic theory, 1903–1955. In: Studies in Linguistic Analysis (Special volume of the Philological Society), pp. 1–15 (1962)

    Google Scholar 

  26. Bartsch, S., Evert, S.: Towards a firthian notion of collocation. Online publication Arbeiten zui Linguistik. 2, 48–60 (2014)

    Google Scholar 

  27. Kiss, T., Strunk, J.: Unsupervised multilingual sentence boundary detection. Comput. Linguist. 32, 485–525 (2006)

    Article  Google Scholar 

  28. Bird, S., Loper, E.: NLTK: the natural language toolkit. In: Proceedings of the ACL Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics. Association for Computational Linguistics, Philadelphia (2002)

    Google Scholar 

  29. Klein, D., Manning, C.D.: Accurate unlexicalized parsing. In: Proceedings of the 41st Meeting of the Association for Computational Linguistics, pp. 423–430 (2003)

    Google Scholar 

  30. Miller, G.A.: Wordnet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995)

    Article  Google Scholar 

  31. Lin, D.: Automatic identification of non-compositional phrases. In: Proceedings of ACL 1999, pp. 317–324. University of Maryland, Maryland (1999)

    Google Scholar 

  32. Alvaro, J.J.: Analyzing China’s english-language media. World Engl. 34(2), 260–277 (2015)

    Article  Google Scholar 

  33. Pereira, L., Strafella, E., Duh, K., Matsumoto, Y.: Identifying collocations using cross-lingual association measures. In: ACL 2014 14th Conference of the European Chapter of the Association for Computational Linguistics Proceedings of the 10th Workshop on Multiword Expressions (MWE 2014), pp. 26–27 (2014)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dan Li .

Editor information

Editors and Affiliations

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution-NonCommercial 2.5 International License (http://creativecommons.org/licenses/by-nc/2.5/), which permits any noncommercial use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Cao, J., Li, D., Huang, D. (2015). A Three-Layered Collocation Extraction Tool and Its Application in China English Studies. In: Sun, M., Liu, Z., Zhang, M., Liu, Y. (eds) Chinese Computational Linguistics and Natural Language Processing Based on Naturally Annotated Big Data. CCL NLP-NABD 2015 2015. Lecture Notes in Computer Science(), vol 9427. Springer, Cham. https://doi.org/10.1007/978-3-319-25816-4_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-25816-4_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-25815-7

  • Online ISBN: 978-3-319-25816-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics