Bracketing and aligning words and constituents in parallel text using Stochastic Inversion Transduction Grammars

Wu, Dekai

doi:10.1007/978-94-017-2535-4_7

Dekai Wu⁴

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 13))

249 Accesses
1 Citations

Abstract

We introduce (1) a novel stochastic inversion transduction grammar formalism for bilingual language modeling of sentence-pairs, and (2) the concept of bilingual parsing with a variety of parallel corpus analysis applications. Aside from the bilingual orientation, three major features distinguish the formalism from the finite-state transducers more traditionally found in computational linguistics: it skips directly to a context-free rather than finite-state base, it permits a minimal extra degree of ordering flexibility, and its probabilistic formulation admits an efficient maximum-likelihood bilingual parsing algorithm. A convenient normal form is shown to exist. Analysis of the formalism’s expressiveness suggests that it is particularly well-suited to model ordering shifts between languages, balancing needed flexibility against complexity constraints. We discuss a number of examples of how stochastic inversion transduction grammars bring bilingual constraints to bear upon problematic corpus analysis tasks such as segmentation, bracketing, phrasal alignment, and parsing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Aho, A. V. Ullman, J. D. (1969a). Properties of syntax directed translations. Journal of Computer and System Sciences, 3 (3), 319–334.
Article Google Scholar
Aho, A. V. Ullman, J. D. (19696). Syntax directed translations and the pushdown assembler. J. Computer and System Sciences,3(1), 37–56.
Google Scholar
Aho, A. V. Ullman, J. D. (1972). The Theory of Parsing, Translation, and Compiling. Prentice Hall, Englewood Cliffs, NJ.
Google Scholar
Black, E., Garside, R. Leech, G. (Eds.). (1993). Statistically-Driven Computer Grammars of English: The IBM/Lancaster Approach. Amsterdam: Rodopi.
Google Scholar
Brown, P. F., Cocke, J., Della Pietra, S., Della Pietra, V. J.,Jelinek, F., Lafferty, J., Mercer, R. L. Roosin, P. (1990). A statistical approach to machine translation. Computational Linguistics, I6(2),79–85.
Google Scholar
Brown, P. F., Della Pietra, S., Della Pietra, V. J. Mercer, R. L. (1993). The mathematics of statistical machine translation: parameter estimation. Computational Linguistics,19(2), 263311.
Google Scholar
Brown, P. F., Lai, J. C. Mercer, R. L. (1991). Aligning Sentences in Parallel Corpora, Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics. Berkeley, 169–176.
Google Scholar
Catizone, R., Russell, G. Warwick, S. (1989). Deriving Translation Data from Bilingual Texts, Proceedings of the First International Lexical Acquisition Workshop. Detroit, 1–7.
Google Scholar
Chang, Chao-Huang Chen, Cheng-Der. (1993). HMM-based part-of-speech tagging for Chinese corpora. Proceedings of the Workshop on Very Large Corpora, Columbus, Ohio, 40–47.
Google Scholar
Chen, Stanley F. (1993). Aligning sentences in bilingual corpora using lexical information. Proceedings of the 31st Annual Conference of the Association for Computational Linguistics, Columbus (Ohio), 9–16.
Google Scholar
Chiang, Tung-Hui; Chang, Jing-Shin; Lin, Ming-Yu Su, Keh-Yih. (1992). Statistical models for word segmentation and unknown resolution. Proceedings of ROCLING-92, Taipei, 121146.
Google Scholar
Church, K. W. (1993). Char_align: a program for aligning parallel texts at the character level. Proceedings of the 31s’ Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, 1–8.
Chapter Google Scholar
Cranias, L., Papageorgiou, H. Piperidis, S.. (1994). A matching technique in example-based machine translation. Proceedings of the Fifteenth International Conference on Computational Linguistics, Kyoto, 100–104.
Google Scholar
Dagan, I. Church, K. W. (1994). Termight: identifying and translating technical terminology. Proceedings of the 4`h Conference on Applied Natural Language Processing (ANLP ‘94), University of Stuttgart, Germany, 34–40.
Google Scholar
Dagan, I., Church, K. W. Gale. W. (1993). Robust Bilingual Word Alignment for Machine-Aided Translation. Proceedings of the Workshop on Very Large Corpora: Academic and Industrial Perspectives, Columbus, Ohio, 1–8.
Google Scholar
Earley, J. (1970). An efficient context-free parsing algorithm. Communications of the Association for Computing Machinery, 13 (2), 94–102.
Article Google Scholar
Fung, Pascale Church, K. W. (1994). K-vec: A new approach for aligning parallel texts, Proceedings of the 15th International Conference on Computational Linguistics (COLING ‘94), Kyoto, 1096–1102.
Google Scholar
Fung, Pascale McKeown, K. R. (1994). Aligning Noisy Parallel Corpora across Language Groups: Word Pair Feature Matching by Dynamic Time Warping, Proceedings of the Conference of the Association for Machine Translation in the Americas. Columbia, MD, 81–88.
Google Scholar
Fung, Pascale Wu, Dekai. (1994). Statistical augmentation of a Chinese machine-readable dictionary. Proceedings of the Second Annual Workshop on Very Large Corpora, Kyoto, 6985.
Google Scholar
Gale, W. A. Church, K. W. (1991). A program for aligning sentences in bilingual corpora. Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics ( ACL ), Berkeley, 177–184.
Chapter Google Scholar
Gale, W. A., Church, K. W. Yarowsky, D. (1992). Using bilingual materials to develop word sense disambiguation methods. Proceedings of the Fourth International Conference on Theoretical and Methodological Issues in Machine Translation (TMI ‘92), Montréal, 101–112.
Google Scholar
Gazdar, G. Mellish, C. S. (1989). Natural Language Processing in LISP: An Introduction to Computational Linguistics. Addison-Wesley, Reading, MA.
Google Scholar
Grishman, R. (1994). Iterative alignment of syntactic structures for a bilingual corpus. Proceedings of the Second Annual Workshop on Very Large Corpora, Kyoto, 57–68.
Google Scholar
Kaji, H., Kida, Y. Morimoto, Y. (1992). Learning translation templates from bilingual text. Proceedings of the Fourteenth International Conference on Computational Linguistics, pages 672–678, Nantes.
Google Scholar
Kaplan, R. M. Kay, M. (1994). Regular models of phonological rule systems. Computational Linguistics, 20 (3), 331–378.
Google Scholar
Kasami, T. (1965). An efficient recognition and syntax analysis algorithm for context free languages. Technical Report AFCRL-65–758, Air Force Cambridge Research Laboratory, Bedford, MA.
Google Scholar
Kay, M. Röscheisen, M. (1988). Text-translation alignment. Technical Report. Xerox Palo Alto Research Center.
Google Scholar
Koskenniemi, K. (1983). Two-level morphology: A general computational model for word form recognition and production. Technical Report 11, Department of General Linguistics, University of Helsinki.
Google Scholar
Kupiec, J. (1993). An algorithm for finding noun phrase correspondences in bilingual corpora. Proceedings of the 31’ Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, 17–22.
Google Scholar
Laporte, E. (1996). Context-free parsing with finite-state transducers. In Ziviani, N., Baeza-Yates, R. Guimaraes, K. S. (Eds.). Proceedings of the Third South American Workshop on String Processing (WSP’96) (pp. 171–182). International Informatics Series 4, Ottawa: Carleton University Press.
Google Scholar
Lewis, P. M. Stearns, R. E.. (1968). Syntax-directed transduction. Journal of the Association for Computing Machinery, 15, 465–488.
Article Google Scholar
Lin, Ming-Yu, Chiang, Tung-Hui Su, Keh-Yih. (1993). A preliminary study on unknown word problem in Chinese word segmentation. Proceedings of ROCLING-93, Taipei, 119–141.
Google Scholar
Lin, Yi-Chung; Chiang, Tung-Hui Su, Keh-Yih. (1992). Discrimination oriented probabilistic tagging. Proceedings of ROCLING-92, Taipei, 85–96.
Google Scholar
Magerman, D. M. Marcus, M. P. (1990). Parsing a natural language using mutual information statistics. Proceedings of AAAI-90, Eighth National Conference on Artificial Intelligence, 984989.
Google Scholar
Matsumoto, Y., Ishimoto, H. Utsuro, T. (1993). Structural matching of parallel text. Proceedings of the 31’ Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, 23–30.
Google Scholar
Nagao, M. (1984). A framework of a mechanical translation between japanese and english by analogy principle. In Elithorn, A. Banerji, R. (Eds.), Artificial and Human Intelligence: Edited Review Papers Presented at the International NATO Symposium on Artificial and Human Intelligence (pp. 173–180 ). Amsterdam: North-Holland.
Google Scholar
Pereira, F. (1991). Finite-state approximation of phrase structure grammars. Proceedings of the 29th Annual Conference of the Association for Computational Linguistics, Berkeley, 244–255.
Google Scholar
Pereira, F. Schabes, Y.. (1992). Inside-outside reestimation from partially bracketed corpora. Proceedings of the 30th Annual Conference of the Association for Computational Linguistics, Newark, DE, 128–135.
Google Scholar
Roche, E. (1994). Two parsing algorithms by means of finite-state transducers. Proceedings of the Fifteenth International Conference on Computational Linguistics (COLING’94), Kyoto, 431435.
Google Scholar
Sadler, V. Vendelmans, R. (1990). Pilot implementation of a bilingual knowledge bank. In Proceedings of the Thirteenth International Conference on Computational Linguistics (COLING’90), Helsinki, 449–451.
Google Scholar
Savitch, W. J. (1982). Abstract Machines and Grammars. Boston: Little, Brown.
Google Scholar
Smadja, F. A. (1992). How to compile a bilingual collocational lexicon automatically. AAAI-92 Workshop on Statistically-Based NLP Techniques, San Jose, CA, 65–71.
Google Scholar
Sproat, R.; Shih, Chilin; Gale, W. A. Chang, Nancy. (1994). A stochastic word segmentation algorithm for a Mandarin text-to-speech system. Proceedings of the 32nd Annual Conference of the Association for Computational Linguistics, Las Cruces, New Mexico, 66–72.
Google Scholar
Viterbi, A. J. (1967). Error bounds for convolutional codes and an asymptotically optimal decoding algorithm. IEEE Transactions on Information Theory, 13, 260–269.
Article Google Scholar
Wu, Dekai. (1994). Aligning a Parallel English-Chinese Corpus Statistically with Lexical Criteria. Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics. Las Cruces (New Mexico), 80–87.
Google Scholar
Wu, Dekai. (1995). An algorithm for simultaneously bracketing parallel texts by aligning words. Proceedings of the 33rd Annual Conference of the Association for Computational Linguistics, pages 244–251, Cambridge (Massachusetts).
Google Scholar
Wu, Dekai. (1997). Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational Linguistics, 23(3), 377–404.
Google Scholar
Wu, Dekai Fung, Pascale. (1994). Improving Chinese tokenization with linguistic filters on statistical lexical acquisition. Proceedings of the Fourth Conference on Applied Natural Language Processing, Stuttgart, 180–181.
Google Scholar
Wu, Dekai Xia, Xuanyin. (1994). Learning an English-Chinese Lexicon from a Parallel Corpus. Proceedings of the 1st Conference of the Association for Machine Translation in the Americas, Columbia (Maryland).
Google Scholar
Wu, Zimin Tseng, Gwyneth. (1993). Chinese text segmentation for text retrieval: Achievements and problems. Journal of The American Society for Information Science,44(9), 532542.
Google Scholar
Younger, D. H. (1967). Recognition and parsing of context-free languages in time. Information and Control, 10(2), 189–208.
Google Scholar

Download references

Author information

Authors and Affiliations

Hong Kong University of Science and Technology, Hong Kong
Dekai Wu

Authors

Dekai Wu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Université de Provence and CNRS, 29, Avenue Robert Schuman, 13100, Aix-en-Provence, France
Jean Véronis

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Wu, D. (2000). Bracketing and aligning words and constituents in parallel text using Stochastic Inversion Transduction Grammars. In: Véronis, J. (eds) Parallel Text Processing. Text, Speech and Language Technology, vol 13. Springer, Dordrecht. https://doi.org/10.1007/978-94-017-2535-4_7

Download citation

DOI: https://doi.org/10.1007/978-94-017-2535-4_7
Publisher Name: Springer, Dordrecht
Print ISBN: 978-90-481-5555-2
Online ISBN: 978-94-017-2535-4
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics