Abstract
The technique to identify and extract collocations in a given sentence is very important to sentence understanding, analysing and translating. So we propose a sentence level collocation identification and extraction method which follows the traditional two phase collocation extraction model. In candidate generating phase, we use the dependency parsing results directly, while in the filtering phase, we propose to use the latest model of distributional semantics - word embedding based similarity to filter the noises. For each candidate, three word embedding based similarity rankings will be obtained and accordingly to decide if it is a real collocation. The experimental results show that the proposed filtering method performs better than the traditional well-known association measures. The comparison with the baseline system shows that the proposed method can retrieve more collocations with higher precision than the baseline, which is of significance to sentence related natural language processing tasks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The corpus is compiled specially for Chinese-English collocation pair extraction and the sentences are from on-line dictionaries, bilingual language learning websites, governmental websites that provide bilingual texts, etc.
References
Silberztein, M.: INTEX: an integrated FST toolbox. In: Wood, D., Yu, S. (eds.) WIA 1997. LNCS, vol. 1436, pp. 185–197. Springer, Heidelberg (1998). https://doi.org/10.1007/BFb0031392
Choueka, Y., Klein, S.T., Neuwitz, E.: Automatic retrieval of frequent idiomatic and collocational expressions in a large corpus. Assoc. Lit. Linguist. Comput. J. 4, 34–38 (1983)
Smadja, F.: Retrieving collocations from text: xtract. Comput. Linguist. 19, 143–177 (1993)
Sun, M., Huang, C., Fang, J.: A pilot study on corpus-based quantatitive analysis of Chinese collocations. ZHONGGUOYUWEN 1, 29–38 (1997). (in Chinese)
Krenn, B., Evert, S.: Can we do better than frequency? A case study on extracting PP-verb collocations. In: Proceedings of the ACL Workshop on Collocations, pp. 39–46 (2001)
Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19, 61–74 (1993)
Evert, S., Krenn, B.: Using Small Random Samples for the Manual Evaluation of Statistical Association Measures. Academic Press Ltd., London (2005). https://doi.org/10.1016/j.csl.2005.02.005
Xu, R., Lu, Q., Wong, K.F., Li, W.: Classification-based Chinese collocation extraction. In: IEEE NLP-KE 2007 - Proceedings of International Conference on National Language Processing and Knowledge Engineering, pp. 308–315 (2007). https://doi.org/10.1109/nlpke.2007.4368048
Church, K.W., Hanks, P.: Word association norms, mutual information, and lexicography. Comput. Linguist. 16, 22–29 (1990). https://doi.org/10.3115/981623.981633
Takayama, H., Kato, Y., Ohno, T., Matsubara, S., Ishikawa, Y.: Collocation extraction using a PMI-based association measure for dependency tree pattern. In: The Tenth Symposium on Natural Language Processing, pp. 138–143 (2013)
Kato, Y., Kuzuhara, K., Matsubara, S.: Automatic acquisition of useful English expressions using dependency relations. In: Proceedings of Joint International Symposium on Natural Language Processing Agricultural Ontology Service, pp. 45–48 (2012)
Lin, D.: Automatic identification of non-compositional phrases. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, pp. 317–324. Association for Computational Linguistics (1999). https://doi.org/10.3115/1034678.1034730
Seretan, V., Nerima, L., Wehrli, E.: Extraction of multi-word collocations using syntactic bigram composition. In: RANLP-2003, pp. 424–431 (2003)
Martens, S., Vandeghinste, V.: An efficient, generic approach to extracting multi-word expressions from dependency trees. In: CoLing Workshop: Multiword Expressions: From Theory to Applications (MWE 2010), pp. 1–4 (2010)
Gao, Z.-M.: Automatic Identification of English Collocation Errors based on Dependency Relations. Spons. National Sci. Council. Exec. Yuan, ROC Inst. Linguist. Acad. Sin. NCCU Off. Res. Dev. 550 (2013)
Pereira, L., Strafella, E., Duh, K., Matsumoto, Y.: Identifying collocations using cross-lingual association measures. In: EACL 2014, p. 109 (2014). https://doi.org/10.3115/v1/w14-0819
Cao, J., Li, D., Huang, D.: A three-layered collocation extraction tool and its application in China English studies. In: Sun, M., Liu, Z., Zhang, M., Liu, Y. (eds.) CCL 2015. LNCS, vol. 9427, pp. 38–49. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25816-4_4
Li, D., Cao, J., Huang, D.: A hierachical collocation extraction tool. In: Proceedings - 2015 IEEE 5th International Conference on Big Data and Cloud Computing, BDCloud 2015, pp. 51–55. IEEE (2015). https://doi.org/10.1109/bdcloud.2015.67
Yang, S.: Machine learning for collocation identification. In: NLP-KE 2003, pp. 315–320 (2003). https://doi.org/10.1109/nlpke.2003.1275921
Pecina, P.: An extensive empirical study of collocation extraction methods. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics, pp. 13–18 (2005). https://doi.org/10.3115/1628960.1628964
Li, W.C.: Chinese collocation extraction and its application in natural language processing. PhD. thesis. pp. 1–188 (2007)
Antoch, J., Prchal, L., Sarda, P.: Combining association measures for collocation extraction using clustering of receiver operating characteristic curves. J. Classif. 30, 100–123 (2013). https://doi.org/10.1007/s00357-013-9123-x
Dinu, A., Dinu, L.P., Sorodoc, I.T.: Aggregation methods for efficient collocation detection. In: LREC, pp. 4041–4045 (2014)
De Saussure, F.: Course in General Linguistics. Peter Owen, London (1915). Trans. by Baskin, W., Ed. by Bally, C., Sechehaye, A., Riedlinger, A. (1960)
Ře\(\mathring{\rm {h}}\check{{\rm u}}\)rek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: LREC 2010 Workshop on New Challenges for NLP Frameworks. pp. 45–50. Citeseer (2010)
Manning, C.D., Bauer, J., Finkel, J., Bethard, S.J., Surdeanu, M., McClosky, D.: The Stanford CoreNLP natural language processing toolkit. In: Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pp. 55–60 (2014). https://doi.org/10.3115/v1/p14-5010
Acknowledgments
This work was supported by National Natural Science Foundation of China under Grant No. 61672127.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Liu, X., Huang, D. (2017). Translation Oriented Sentence Level Collocation Identification and Extraction. In: Wong, D., Xiong, D. (eds) Machine Translation. CWMT 2017. Communications in Computer and Information Science, vol 787. Springer, Singapore. https://doi.org/10.1007/978-981-10-7134-8_8
Download citation
DOI: https://doi.org/10.1007/978-981-10-7134-8_8
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-7133-1
Online ISBN: 978-981-10-7134-8
eBook Packages: Computer ScienceComputer Science (R0)