Skip to main content

Translation Oriented Sentence Level Collocation Identification and Extraction

  • Conference paper
  • First Online:
Machine Translation (CWMT 2017)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 787))

Included in the following conference series:

Abstract

The technique to identify and extract collocations in a given sentence is very important to sentence understanding, analysing and translating. So we propose a sentence level collocation identification and extraction method which follows the traditional two phase collocation extraction model. In candidate generating phase, we use the dependency parsing results directly, while in the filtering phase, we propose to use the latest model of distributional semantics - word embedding based similarity to filter the noises. For each candidate, three word embedding based similarity rankings will be obtained and accordingly to decide if it is a real collocation. The experimental results show that the proposed filtering method performs better than the traditional well-known association measures. The comparison with the baseline system shows that the proposed method can retrieve more collocations with higher precision than the baseline, which is of significance to sentence related natural language processing tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The corpus is compiled specially for Chinese-English collocation pair extraction and the sentences are from on-line dictionaries, bilingual language learning websites, governmental websites that provide bilingual texts, etc.

References

  1. Silberztein, M.: INTEX: an integrated FST toolbox. In: Wood, D., Yu, S. (eds.) WIA 1997. LNCS, vol. 1436, pp. 185–197. Springer, Heidelberg (1998). https://doi.org/10.1007/BFb0031392

    Chapter  Google Scholar 

  2. Choueka, Y., Klein, S.T., Neuwitz, E.: Automatic retrieval of frequent idiomatic and collocational expressions in a large corpus. Assoc. Lit. Linguist. Comput. J. 4, 34–38 (1983)

    Google Scholar 

  3. Smadja, F.: Retrieving collocations from text: xtract. Comput. Linguist. 19, 143–177 (1993)

    Google Scholar 

  4. Sun, M., Huang, C., Fang, J.: A pilot study on corpus-based quantatitive analysis of Chinese collocations. ZHONGGUOYUWEN 1, 29–38 (1997). (in Chinese)

    Google Scholar 

  5. Krenn, B., Evert, S.: Can we do better than frequency? A case study on extracting PP-verb collocations. In: Proceedings of the ACL Workshop on Collocations, pp. 39–46 (2001)

    Google Scholar 

  6. Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19, 61–74 (1993)

    Google Scholar 

  7. Evert, S., Krenn, B.: Using Small Random Samples for the Manual Evaluation of Statistical Association Measures. Academic Press Ltd., London (2005). https://doi.org/10.1016/j.csl.2005.02.005

    Google Scholar 

  8. Xu, R., Lu, Q., Wong, K.F., Li, W.: Classification-based Chinese collocation extraction. In: IEEE NLP-KE 2007 - Proceedings of International Conference on National Language Processing and Knowledge Engineering, pp. 308–315 (2007). https://doi.org/10.1109/nlpke.2007.4368048

  9. Church, K.W., Hanks, P.: Word association norms, mutual information, and lexicography. Comput. Linguist. 16, 22–29 (1990). https://doi.org/10.3115/981623.981633

    Google Scholar 

  10. Takayama, H., Kato, Y., Ohno, T., Matsubara, S., Ishikawa, Y.: Collocation extraction using a PMI-based association measure for dependency tree pattern. In: The Tenth Symposium on Natural Language Processing, pp. 138–143 (2013)

    Google Scholar 

  11. Kato, Y., Kuzuhara, K., Matsubara, S.: Automatic acquisition of useful English expressions using dependency relations. In: Proceedings of Joint International Symposium on Natural Language Processing Agricultural Ontology Service, pp. 45–48 (2012)

    Google Scholar 

  12. Lin, D.: Automatic identification of non-compositional phrases. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, pp. 317–324. Association for Computational Linguistics (1999). https://doi.org/10.3115/1034678.1034730

  13. Seretan, V., Nerima, L., Wehrli, E.: Extraction of multi-word collocations using syntactic bigram composition. In: RANLP-2003, pp. 424–431 (2003)

    Google Scholar 

  14. Martens, S., Vandeghinste, V.: An efficient, generic approach to extracting multi-word expressions from dependency trees. In: CoLing Workshop: Multiword Expressions: From Theory to Applications (MWE 2010), pp. 1–4 (2010)

    Google Scholar 

  15. Gao, Z.-M.: Automatic Identification of English Collocation Errors based on Dependency Relations. Spons. National Sci. Council. Exec. Yuan, ROC Inst. Linguist. Acad. Sin. NCCU Off. Res. Dev. 550 (2013)

    Google Scholar 

  16. Pereira, L., Strafella, E., Duh, K., Matsumoto, Y.: Identifying collocations using cross-lingual association measures. In: EACL 2014, p. 109 (2014). https://doi.org/10.3115/v1/w14-0819

  17. Cao, J., Li, D., Huang, D.: A three-layered collocation extraction tool and its application in China English studies. In: Sun, M., Liu, Z., Zhang, M., Liu, Y. (eds.) CCL 2015. LNCS, vol. 9427, pp. 38–49. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25816-4_4

    Chapter  Google Scholar 

  18. Li, D., Cao, J., Huang, D.: A hierachical collocation extraction tool. In: Proceedings - 2015 IEEE 5th International Conference on Big Data and Cloud Computing, BDCloud 2015, pp. 51–55. IEEE (2015). https://doi.org/10.1109/bdcloud.2015.67

  19. Yang, S.: Machine learning for collocation identification. In: NLP-KE 2003, pp. 315–320 (2003). https://doi.org/10.1109/nlpke.2003.1275921

  20. Pecina, P.: An extensive empirical study of collocation extraction methods. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, pp. 13–18 (2005). https://doi.org/10.3115/1628960.1628964

  21. Li, W.C.: Chinese collocation extraction and its application in natural language processing. PhD. thesis. pp. 1–188 (2007)

    Google Scholar 

  22. Antoch, J., Prchal, L., Sarda, P.: Combining association measures for collocation extraction using clustering of receiver operating characteristic curves. J. Classif. 30, 100–123 (2013). https://doi.org/10.1007/s00357-013-9123-x

    Article  MATH  MathSciNet  Google Scholar 

  23. Dinu, A., Dinu, L.P., Sorodoc, I.T.: Aggregation methods for efficient collocation detection. In: LREC, pp. 4041–4045 (2014)

    Google Scholar 

  24. De Saussure, F.: Course in General Linguistics. Peter Owen, London (1915). Trans. by Baskin, W., Ed. by Bally, C., Sechehaye, A., Riedlinger, A. (1960)

    Google Scholar 

  25. Ře\(\mathring{\rm {h}}\check{{\rm u}}\)rek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: LREC 2010 Workshop on New Challenges for NLP Frameworks. pp. 45–50. Citeseer (2010)

    Google Scholar 

  26. Manning, C.D., Bauer, J., Finkel, J., Bethard, S.J., Surdeanu, M., McClosky, D.: The Stanford CoreNLP natural language processing toolkit. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55–60 (2014). https://doi.org/10.3115/v1/p14-5010

Download references

Acknowledgments

This work was supported by National Natural Science Foundation of China under Grant No. 61672127.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Degen Huang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Liu, X., Huang, D. (2017). Translation Oriented Sentence Level Collocation Identification and Extraction. In: Wong, D., Xiong, D. (eds) Machine Translation. CWMT 2017. Communications in Computer and Information Science, vol 787. Springer, Singapore. https://doi.org/10.1007/978-981-10-7134-8_8

Download citation

  • DOI: https://doi.org/10.1007/978-981-10-7134-8_8

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-10-7133-1

  • Online ISBN: 978-981-10-7134-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics