Abstract
We introduce (1) a novel stochastic inversion transduction grammar formalism for bilingual language modeling of sentence-pairs, and (2) the concept of bilingual parsing with a variety of parallel corpus analysis applications. Aside from the bilingual orientation, three major features distinguish the formalism from the finite-state transducers more traditionally found in computational linguistics: it skips directly to a context-free rather than finite-state base, it permits a minimal extra degree of ordering flexibility, and its probabilistic formulation admits an efficient maximum-likelihood bilingual parsing algorithm. A convenient normal form is shown to exist. Analysis of the formalism’s expressiveness suggests that it is particularly well-suited to model ordering shifts between languages, balancing needed flexibility against complexity constraints. We discuss a number of examples of how stochastic inversion transduction grammars bring bilingual constraints to bear upon problematic corpus analysis tasks such as segmentation, bracketing, phrasal alignment, and parsing.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aho, A. V. Ullman, J. D. (1969a). Properties of syntax directed translations. Journal of Computer and System Sciences, 3 (3), 319–334.
Aho, A. V. Ullman, J. D. (19696). Syntax directed translations and the pushdown assembler. J. Computer and System Sciences,3(1), 37–56.
Aho, A. V. Ullman, J. D. (1972). The Theory of Parsing, Translation, and Compiling. Prentice Hall, Englewood Cliffs, NJ.
Black, E., Garside, R. Leech, G. (Eds.). (1993). Statistically-Driven Computer Grammars of English: The IBM/Lancaster Approach. Amsterdam: Rodopi.
Brown, P. F., Cocke, J., Della Pietra, S., Della Pietra, V. J.,Jelinek, F., Lafferty, J., Mercer, R. L. Roosin, P. (1990). A statistical approach to machine translation. Computational Linguistics, I6(2),79–85.
Brown, P. F., Della Pietra, S., Della Pietra, V. J. Mercer, R. L. (1993). The mathematics of statistical machine translation: parameter estimation. Computational Linguistics,19(2), 263311.
Brown, P. F., Lai, J. C. Mercer, R. L. (1991). Aligning Sentences in Parallel Corpora, Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics. Berkeley, 169–176.
Catizone, R., Russell, G. Warwick, S. (1989). Deriving Translation Data from Bilingual Texts, Proceedings of the First International Lexical Acquisition Workshop. Detroit, 1–7.
Chang, Chao-Huang Chen, Cheng-Der. (1993). HMM-based part-of-speech tagging for Chinese corpora. Proceedings of the Workshop on Very Large Corpora, Columbus, Ohio, 40–47.
Chen, Stanley F. (1993). Aligning sentences in bilingual corpora using lexical information. Proceedings of the 31st Annual Conference of the Association for Computational Linguistics, Columbus (Ohio), 9–16.
Chiang, Tung-Hui; Chang, Jing-Shin; Lin, Ming-Yu Su, Keh-Yih. (1992). Statistical models for word segmentation and unknown resolution. Proceedings of ROCLING-92, Taipei, 121146.
Church, K. W. (1993). Char_align: a program for aligning parallel texts at the character level. Proceedings of the 31s’ Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, 1–8.
Cranias, L., Papageorgiou, H. Piperidis, S.. (1994). A matching technique in example-based machine translation. Proceedings of the Fifteenth International Conference on Computational Linguistics, Kyoto, 100–104.
Dagan, I. Church, K. W. (1994). Termight: identifying and translating technical terminology. Proceedings of the 4`h Conference on Applied Natural Language Processing (ANLP ‘94), University of Stuttgart, Germany, 34–40.
Dagan, I., Church, K. W. Gale. W. (1993). Robust Bilingual Word Alignment for Machine-Aided Translation. Proceedings of the Workshop on Very Large Corpora: Academic and Industrial Perspectives, Columbus, Ohio, 1–8.
Earley, J. (1970). An efficient context-free parsing algorithm. Communications of the Association for Computing Machinery, 13 (2), 94–102.
Fung, Pascale Church, K. W. (1994). K-vec: A new approach for aligning parallel texts, Proceedings of the 15th International Conference on Computational Linguistics (COLING ‘94), Kyoto, 1096–1102.
Fung, Pascale McKeown, K. R. (1994). Aligning Noisy Parallel Corpora across Language Groups: Word Pair Feature Matching by Dynamic Time Warping, Proceedings of the Conference of the Association for Machine Translation in the Americas. Columbia, MD, 81–88.
Fung, Pascale Wu, Dekai. (1994). Statistical augmentation of a Chinese machine-readable dictionary. Proceedings of the Second Annual Workshop on Very Large Corpora, Kyoto, 6985.
Gale, W. A. Church, K. W. (1991). A program for aligning sentences in bilingual corpora. Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics ( ACL ), Berkeley, 177–184.
Gale, W. A., Church, K. W. Yarowsky, D. (1992). Using bilingual materials to develop word sense disambiguation methods. Proceedings of the Fourth International Conference on Theoretical and Methodological Issues in Machine Translation (TMI ‘92), Montréal, 101–112.
Gazdar, G. Mellish, C. S. (1989). Natural Language Processing in LISP: An Introduction to Computational Linguistics. Addison-Wesley, Reading, MA.
Grishman, R. (1994). Iterative alignment of syntactic structures for a bilingual corpus. Proceedings of the Second Annual Workshop on Very Large Corpora, Kyoto, 57–68.
Kaji, H., Kida, Y. Morimoto, Y. (1992). Learning translation templates from bilingual text. Proceedings of the Fourteenth International Conference on Computational Linguistics, pages 672–678, Nantes.
Kaplan, R. M. Kay, M. (1994). Regular models of phonological rule systems. Computational Linguistics, 20 (3), 331–378.
Kasami, T. (1965). An efficient recognition and syntax analysis algorithm for context free languages. Technical Report AFCRL-65–758, Air Force Cambridge Research Laboratory, Bedford, MA.
Kay, M. Röscheisen, M. (1988). Text-translation alignment. Technical Report. Xerox Palo Alto Research Center.
Koskenniemi, K. (1983). Two-level morphology: A general computational model for word form recognition and production. Technical Report 11, Department of General Linguistics, University of Helsinki.
Kupiec, J. (1993). An algorithm for finding noun phrase correspondences in bilingual corpora. Proceedings of the 31’ Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, 17–22.
Laporte, E. (1996). Context-free parsing with finite-state transducers. In Ziviani, N., Baeza-Yates, R. Guimaraes, K. S. (Eds.). Proceedings of the Third South American Workshop on String Processing (WSP’96) (pp. 171–182). International Informatics Series 4, Ottawa: Carleton University Press.
Lewis, P. M. Stearns, R. E.. (1968). Syntax-directed transduction. Journal of the Association for Computing Machinery, 15, 465–488.
Lin, Ming-Yu, Chiang, Tung-Hui Su, Keh-Yih. (1993). A preliminary study on unknown word problem in Chinese word segmentation. Proceedings of ROCLING-93, Taipei, 119–141.
Lin, Yi-Chung; Chiang, Tung-Hui Su, Keh-Yih. (1992). Discrimination oriented probabilistic tagging. Proceedings of ROCLING-92, Taipei, 85–96.
Magerman, D. M. Marcus, M. P. (1990). Parsing a natural language using mutual information statistics. Proceedings of AAAI-90, Eighth National Conference on Artificial Intelligence, 984989.
Matsumoto, Y., Ishimoto, H. Utsuro, T. (1993). Structural matching of parallel text. Proceedings of the 31’ Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, 23–30.
Nagao, M. (1984). A framework of a mechanical translation between japanese and english by analogy principle. In Elithorn, A. Banerji, R. (Eds.), Artificial and Human Intelligence: Edited Review Papers Presented at the International NATO Symposium on Artificial and Human Intelligence (pp. 173–180 ). Amsterdam: North-Holland.
Pereira, F. (1991). Finite-state approximation of phrase structure grammars. Proceedings of the 29th Annual Conference of the Association for Computational Linguistics, Berkeley, 244–255.
Pereira, F. Schabes, Y.. (1992). Inside-outside reestimation from partially bracketed corpora. Proceedings of the 30th Annual Conference of the Association for Computational Linguistics, Newark, DE, 128–135.
Roche, E. (1994). Two parsing algorithms by means of finite-state transducers. Proceedings of the Fifteenth International Conference on Computational Linguistics (COLING’94), Kyoto, 431435.
Sadler, V. Vendelmans, R. (1990). Pilot implementation of a bilingual knowledge bank. In Proceedings of the Thirteenth International Conference on Computational Linguistics (COLING’90), Helsinki, 449–451.
Savitch, W. J. (1982). Abstract Machines and Grammars. Boston: Little, Brown.
Smadja, F. A. (1992). How to compile a bilingual collocational lexicon automatically. AAAI-92 Workshop on Statistically-Based NLP Techniques, San Jose, CA, 65–71.
Sproat, R.; Shih, Chilin; Gale, W. A. Chang, Nancy. (1994). A stochastic word segmentation algorithm for a Mandarin text-to-speech system. Proceedings of the 32nd Annual Conference of the Association for Computational Linguistics, Las Cruces, New Mexico, 66–72.
Viterbi, A. J. (1967). Error bounds for convolutional codes and an asymptotically optimal decoding algorithm. IEEE Transactions on Information Theory, 13, 260–269.
Wu, Dekai. (1994). Aligning a Parallel English-Chinese Corpus Statistically with Lexical Criteria. Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics. Las Cruces (New Mexico), 80–87.
Wu, Dekai. (1995). An algorithm for simultaneously bracketing parallel texts by aligning words. Proceedings of the 33rd Annual Conference of the Association for Computational Linguistics, pages 244–251, Cambridge (Massachusetts).
Wu, Dekai. (1997). Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational Linguistics, 23(3), 377–404.
Wu, Dekai Fung, Pascale. (1994). Improving Chinese tokenization with linguistic filters on statistical lexical acquisition. Proceedings of the Fourth Conference on Applied Natural Language Processing, Stuttgart, 180–181.
Wu, Dekai Xia, Xuanyin. (1994). Learning an English-Chinese Lexicon from a Parallel Corpus. Proceedings of the 1st Conference of the Association for Machine Translation in the Americas, Columbia (Maryland).
Wu, Zimin Tseng, Gwyneth. (1993). Chinese text segmentation for text retrieval: Achievements and problems. Journal of The American Society for Information Science,44(9), 532542.
Younger, D. H. (1967). Recognition and parsing of context-free languages in time. Information and Control, 10(2), 189–208.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2000 Springer Science+Business Media Dordrecht
About this chapter
Cite this chapter
Wu, D. (2000). Bracketing and aligning words and constituents in parallel text using Stochastic Inversion Transduction Grammars. In: Véronis, J. (eds) Parallel Text Processing. Text, Speech and Language Technology, vol 13. Springer, Dordrecht. https://doi.org/10.1007/978-94-017-2535-4_7
Download citation
DOI: https://doi.org/10.1007/978-94-017-2535-4_7
Publisher Name: Springer, Dordrecht
Print ISBN: 978-90-481-5555-2
Online ISBN: 978-94-017-2535-4
eBook Packages: Springer Book Archive