Automatic Acquisition of Chinese–English Parallel Corpus from the Web
Parallel corpora are a valuable resource for tasks such as cross-language information retrieval and data-driven natural language processing systems. Previously only small scale corpora have been available, thus restricting their practical use. This paper describes a system that overcomes this limitation by automatically collecting high quality parallel bilingual corpora from the web. Previous systems used a single principle feature for parallel web page verification, whereas we use multiple features to identify parallel texts via a k-nearest-neighbor classifier. Our system was evaluated using a data set containing 6500 Chinese–English candidate parallel pairs that have been manually annotated. Experiments show that the use of a k-nearest-neighbors classifier with multiple features achieves substantial improvements over the systems that use any one of these features. The system achieved a precision rate of 95% and a recall rate of 97%, and thus is a significant improvement over earlier work.
KeywordsCandidate Site Feature Filter Parallel Corpus Candidate Pair Feature Score
Unable to display preview. Download preview PDF.
- 1.Nie, J.Y., Simard, M., Isabelle, P., Durand, R.: Cross-language information retrieval based on parallel texts and automatic mining of parallel texts from the web. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Berkeley, California, United States, pp. 74–81. ACM Press, New York (1999)Google Scholar
- 2.Franz, M., McCarley, J.S., Ward, T., Zhu, W.J.: Quantifying the utility of parallel corpora. In: Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, New Orleans, Louisiana, United States, pp. 398–399. ACM Press, New York (2001)Google Scholar
- 3.Brown, P.F., Cocke, J., Pietra, S.D., Pietra, V.J.D., Jelinek, F., Lafferty, J.D., Mercer, R.L., Roossin, P.S.: A statistical approach to machine translation. Computational Linguistics 16, 79–85 (1990)Google Scholar
- 4.Ballesteros, L., Croft, W.B.: Resolving ambiguity for cross-language retrieval. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Melbourne, Australia, pp. 64–71. ACM Press, New York (1998)Google Scholar
- 5.McEwan, C.J.A., Ounis, I., Ruthven, I.: Building bilingual dictionaries from parallel web documents. In: Proceedings of the 24th BCS-IRSG European Colloquium on IR Research, London, UK, pp. 303–323. Springer, Heidelberg (2002)Google Scholar
- 6.Chau, R., Yeh, C.H.: Construction of a fuzzy multilingual thesaurus and its application to cross-lingual text retrieval. In: Proceedings of the 1st Asia-Pacific Conference on Web Intelligence: Research and Development, Maebashi City, Japan, pp. 340–345. Springer, Heidelberg (2001)Google Scholar
- 14.Chen, J., Chau, R., Yeh, C.H.: Discovering parallel text from the world wide web. In: Proceedings of the 2nd Workshop on Australasian Information Security, Data Mining and Web Intelligence, and Software Internationalisation, Dunedin, New Zealand, Australian Computer Society, Inc., pp. 157–161 (2004)Google Scholar
- 17.Wilson, D.R., Martinez, T.R.: Instance pruning techniques. In: Proceedings of the 14th International Conference on Machine Learning, pp. 403–411. Morgan Kaufmann Publishers Inc., San Francisco (1997)Google Scholar