Translation Oriented Sentence Level Collocation Identification and Extraction

Liu, Xiaoxia; Huang, Degen

doi:10.1007/978-981-10-7134-8_8

Xiaoxia Liu¹¹ &
Degen Huang¹¹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 787))

Included in the following conference series:

China Workshop on Machine Translation

491 Accesses
2 Citations

Abstract

The technique to identify and extract collocations in a given sentence is very important to sentence understanding, analysing and translating. So we propose a sentence level collocation identification and extraction method which follows the traditional two phase collocation extraction model. In candidate generating phase, we use the dependency parsing results directly, while in the filtering phase, we propose to use the latest model of distributional semantics - word embedding based similarity to filter the noises. For each candidate, three word embedding based similarity rankings will be obtained and accordingly to decide if it is a real collocation. The experimental results show that the proposed filtering method performs better than the traditional well-known association measures. The comparison with the baseline system shows that the proposed method can retrieve more collocations with higher precision than the baseline, which is of significance to sentence related natural language processing tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The corpus is compiled specially for Chinese-English collocation pair extraction and the sentences are from on-line dictionaries, bilingual language learning websites, governmental websites that provide bilingual texts, etc.

References

Silberztein, M.: INTEX: an integrated FST toolbox. In: Wood, D., Yu, S. (eds.) WIA 1997. LNCS, vol. 1436, pp. 185–197. Springer, Heidelberg (1998). https://doi.org/10.1007/BFb0031392
Chapter Google Scholar
Choueka, Y., Klein, S.T., Neuwitz, E.: Automatic retrieval of frequent idiomatic and collocational expressions in a large corpus. Assoc. Lit. Linguist. Comput. J. 4, 34–38 (1983)
Google Scholar
Smadja, F.: Retrieving collocations from text: xtract. Comput. Linguist. 19, 143–177 (1993)
Google Scholar
Sun, M., Huang, C., Fang, J.: A pilot study on corpus-based quantatitive analysis of Chinese collocations. ZHONGGUOYUWEN 1, 29–38 (1997). (in Chinese)
Google Scholar
Krenn, B., Evert, S.: Can we do better than frequency? A case study on extracting PP-verb collocations. In: Proceedings of the ACL Workshop on Collocations, pp. 39–46 (2001)
Google Scholar
Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19, 61–74 (1993)
Google Scholar
Evert, S., Krenn, B.: Using Small Random Samples for the Manual Evaluation of Statistical Association Measures. Academic Press Ltd., London (2005). https://doi.org/10.1016/j.csl.2005.02.005
Google Scholar
Xu, R., Lu, Q., Wong, K.F., Li, W.: Classification-based Chinese collocation extraction. In: IEEE NLP-KE 2007 - Proceedings of International Conference on National Language Processing and Knowledge Engineering, pp. 308–315 (2007). https://doi.org/10.1109/nlpke.2007.4368048
Church, K.W., Hanks, P.: Word association norms, mutual information, and lexicography. Comput. Linguist. 16, 22–29 (1990). https://doi.org/10.3115/981623.981633
Google Scholar
Takayama, H., Kato, Y., Ohno, T., Matsubara, S., Ishikawa, Y.: Collocation extraction using a PMI-based association measure for dependency tree pattern. In: The Tenth Symposium on Natural Language Processing, pp. 138–143 (2013)
Google Scholar
Kato, Y., Kuzuhara, K., Matsubara, S.: Automatic acquisition of useful English expressions using dependency relations. In: Proceedings of Joint International Symposium on Natural Language Processing Agricultural Ontology Service, pp. 45–48 (2012)
Google Scholar
Lin, D.: Automatic identification of non-compositional phrases. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, pp. 317–324. Association for Computational Linguistics (1999). https://doi.org/10.3115/1034678.1034730
Seretan, V., Nerima, L., Wehrli, E.: Extraction of multi-word collocations using syntactic bigram composition. In: RANLP-2003, pp. 424–431 (2003)
Google Scholar
Martens, S., Vandeghinste, V.: An efficient, generic approach to extracting multi-word expressions from dependency trees. In: CoLing Workshop: Multiword Expressions: From Theory to Applications (MWE 2010), pp. 1–4 (2010)
Google Scholar
Gao, Z.-M.: Automatic Identification of English Collocation Errors based on Dependency Relations. Spons. National Sci. Council. Exec. Yuan, ROC Inst. Linguist. Acad. Sin. NCCU Off. Res. Dev. 550 (2013)
Google Scholar
Pereira, L., Strafella, E., Duh, K., Matsumoto, Y.: Identifying collocations using cross-lingual association measures. In: EACL 2014, p. 109 (2014). https://doi.org/10.3115/v1/w14-0819
Cao, J., Li, D., Huang, D.: A three-layered collocation extraction tool and its application in China English studies. In: Sun, M., Liu, Z., Zhang, M., Liu, Y. (eds.) CCL 2015. LNCS, vol. 9427, pp. 38–49. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25816-4_4
Chapter Google Scholar
Li, D., Cao, J., Huang, D.: A hierachical collocation extraction tool. In: Proceedings - 2015 IEEE 5th International Conference on Big Data and Cloud Computing, BDCloud 2015, pp. 51–55. IEEE (2015). https://doi.org/10.1109/bdcloud.2015.67
Yang, S.: Machine learning for collocation identification. In: NLP-KE 2003, pp. 315–320 (2003). https://doi.org/10.1109/nlpke.2003.1275921
Pecina, P.: An extensive empirical study of collocation extraction methods. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, pp. 13–18 (2005). https://doi.org/10.3115/1628960.1628964
Li, W.C.: Chinese collocation extraction and its application in natural language processing. PhD. thesis. pp. 1–188 (2007)
Google Scholar
Antoch, J., Prchal, L., Sarda, P.: Combining association measures for collocation extraction using clustering of receiver operating characteristic curves. J. Classif. 30, 100–123 (2013). https://doi.org/10.1007/s00357-013-9123-x
Article MATH MathSciNet Google Scholar
Dinu, A., Dinu, L.P., Sorodoc, I.T.: Aggregation methods for efficient collocation detection. In: LREC, pp. 4041–4045 (2014)
Google Scholar
De Saussure, F.: Course in General Linguistics. Peter Owen, London (1915). Trans. by Baskin, W., Ed. by Bally, C., Sechehaye, A., Riedlinger, A. (1960)
Google Scholar
Ře\(\mathring{\rm {h}}\check{{\rm u}}\)rek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: LREC 2010 Workshop on New Challenges for NLP Frameworks. pp. 45–50. Citeseer (2010)
Google Scholar
Manning, C.D., Bauer, J., Finkel, J., Bethard, S.J., Surdeanu, M., McClosky, D.: The Stanford CoreNLP natural language processing toolkit. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55–60 (2014). https://doi.org/10.3115/v1/p14-5010

Download references

Acknowledgments

This work was supported by National Natural Science Foundation of China under Grant No. 61672127.

Author information

Authors and Affiliations

Dalian University of Technology, No. 2, Linggong Road, Hi-Tech Zone, Dalian, 116024, China
Xiaoxia Liu & Degen Huang

Authors

Xiaoxia Liu
View author publications
You can also search for this author in PubMed Google Scholar
Degen Huang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Degen Huang .

Editor information

Editors and Affiliations

University of Macau, Macau SAR, China
Derek F. Wong
Soochow University, Suzhou, China
Deyi Xiong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, X., Huang, D. (2017). Translation Oriented Sentence Level Collocation Identification and Extraction. In: Wong, D., Xiong, D. (eds) Machine Translation. CWMT 2017. Communications in Computer and Information Science, vol 787. Springer, Singapore. https://doi.org/10.1007/978-981-10-7134-8_8

Download citation

DOI: https://doi.org/10.1007/978-981-10-7134-8_8
Published: 14 November 2017
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-7133-1
Online ISBN: 978-981-10-7134-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics