Skip to main content

Bracketing and aligning words and constituents in parallel text using Stochastic Inversion Transduction Grammars

  • Chapter
Book cover Parallel Text Processing

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 13))

Abstract

We introduce (1) a novel stochastic inversion transduction grammar formalism for bilingual language modeling of sentence-pairs, and (2) the concept of bilingual parsing with a variety of parallel corpus analysis applications. Aside from the bilingual orientation, three major features distinguish the formalism from the finite-state transducers more traditionally found in computational linguistics: it skips directly to a context-free rather than finite-state base, it permits a minimal extra degree of ordering flexibility, and its probabilistic formulation admits an efficient maximum-likelihood bilingual parsing algorithm. A convenient normal form is shown to exist. Analysis of the formalism’s expressiveness suggests that it is particularly well-suited to model ordering shifts between languages, balancing needed flexibility against complexity constraints. We discuss a number of examples of how stochastic inversion transduction grammars bring bilingual constraints to bear upon problematic corpus analysis tasks such as segmentation, bracketing, phrasal alignment, and parsing.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Aho, A. V. Ullman, J. D. (1969a). Properties of syntax directed translations. Journal of Computer and System Sciences, 3 (3), 319–334.

    Article  Google Scholar 

  • Aho, A. V. Ullman, J. D. (19696). Syntax directed translations and the pushdown assembler. J. Computer and System Sciences,3(1), 37–56.

    Google Scholar 

  • Aho, A. V. Ullman, J. D. (1972). The Theory of Parsing, Translation, and Compiling. Prentice Hall, Englewood Cliffs, NJ.

    Google Scholar 

  • Black, E., Garside, R. Leech, G. (Eds.). (1993). Statistically-Driven Computer Grammars of English: The IBM/Lancaster Approach. Amsterdam: Rodopi.

    Google Scholar 

  • Brown, P. F., Cocke, J., Della Pietra, S., Della Pietra, V. J.,Jelinek, F., Lafferty, J., Mercer, R. L. Roosin, P. (1990). A statistical approach to machine translation. Computational Linguistics, I6(2),79–85.

    Google Scholar 

  • Brown, P. F., Della Pietra, S., Della Pietra, V. J. Mercer, R. L. (1993). The mathematics of statistical machine translation: parameter estimation. Computational Linguistics,19(2), 263311.

    Google Scholar 

  • Brown, P. F., Lai, J. C. Mercer, R. L. (1991). Aligning Sentences in Parallel Corpora, Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics. Berkeley, 169–176.

    Google Scholar 

  • Catizone, R., Russell, G. Warwick, S. (1989). Deriving Translation Data from Bilingual Texts, Proceedings of the First International Lexical Acquisition Workshop. Detroit, 1–7.

    Google Scholar 

  • Chang, Chao-Huang Chen, Cheng-Der. (1993). HMM-based part-of-speech tagging for Chinese corpora. Proceedings of the Workshop on Very Large Corpora, Columbus, Ohio, 40–47.

    Google Scholar 

  • Chen, Stanley F. (1993). Aligning sentences in bilingual corpora using lexical information. Proceedings of the 31st Annual Conference of the Association for Computational Linguistics, Columbus (Ohio), 9–16.

    Google Scholar 

  • Chiang, Tung-Hui; Chang, Jing-Shin; Lin, Ming-Yu Su, Keh-Yih. (1992). Statistical models for word segmentation and unknown resolution. Proceedings of ROCLING-92, Taipei, 121146.

    Google Scholar 

  • Church, K. W. (1993). Char_align: a program for aligning parallel texts at the character level. Proceedings of the 31s’ Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, 1–8.

    Chapter  Google Scholar 

  • Cranias, L., Papageorgiou, H. Piperidis, S.. (1994). A matching technique in example-based machine translation. Proceedings of the Fifteenth International Conference on Computational Linguistics, Kyoto, 100–104.

    Google Scholar 

  • Dagan, I. Church, K. W. (1994). Termight: identifying and translating technical terminology. Proceedings of the 4`h Conference on Applied Natural Language Processing (ANLP ‘94), University of Stuttgart, Germany, 34–40.

    Google Scholar 

  • Dagan, I., Church, K. W. Gale. W. (1993). Robust Bilingual Word Alignment for Machine-Aided Translation. Proceedings of the Workshop on Very Large Corpora: Academic and Industrial Perspectives, Columbus, Ohio, 1–8.

    Google Scholar 

  • Earley, J. (1970). An efficient context-free parsing algorithm. Communications of the Association for Computing Machinery, 13 (2), 94–102.

    Article  Google Scholar 

  • Fung, Pascale Church, K. W. (1994). K-vec: A new approach for aligning parallel texts, Proceedings of the 15th International Conference on Computational Linguistics (COLING ‘94), Kyoto, 1096–1102.

    Google Scholar 

  • Fung, Pascale McKeown, K. R. (1994). Aligning Noisy Parallel Corpora across Language Groups: Word Pair Feature Matching by Dynamic Time Warping, Proceedings of the Conference of the Association for Machine Translation in the Americas. Columbia, MD, 81–88.

    Google Scholar 

  • Fung, Pascale Wu, Dekai. (1994). Statistical augmentation of a Chinese machine-readable dictionary. Proceedings of the Second Annual Workshop on Very Large Corpora, Kyoto, 6985.

    Google Scholar 

  • Gale, W. A. Church, K. W. (1991). A program for aligning sentences in bilingual corpora. Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics ( ACL ), Berkeley, 177–184.

    Chapter  Google Scholar 

  • Gale, W. A., Church, K. W. Yarowsky, D. (1992). Using bilingual materials to develop word sense disambiguation methods. Proceedings of the Fourth International Conference on Theoretical and Methodological Issues in Machine Translation (TMI ‘92), Montréal, 101–112.

    Google Scholar 

  • Gazdar, G. Mellish, C. S. (1989). Natural Language Processing in LISP: An Introduction to Computational Linguistics. Addison-Wesley, Reading, MA.

    Google Scholar 

  • Grishman, R. (1994). Iterative alignment of syntactic structures for a bilingual corpus. Proceedings of the Second Annual Workshop on Very Large Corpora, Kyoto, 57–68.

    Google Scholar 

  • Kaji, H., Kida, Y. Morimoto, Y. (1992). Learning translation templates from bilingual text. Proceedings of the Fourteenth International Conference on Computational Linguistics, pages 672–678, Nantes.

    Google Scholar 

  • Kaplan, R. M. Kay, M. (1994). Regular models of phonological rule systems. Computational Linguistics, 20 (3), 331–378.

    Google Scholar 

  • Kasami, T. (1965). An efficient recognition and syntax analysis algorithm for context free languages. Technical Report AFCRL-65–758, Air Force Cambridge Research Laboratory, Bedford, MA.

    Google Scholar 

  • Kay, M. Röscheisen, M. (1988). Text-translation alignment. Technical Report. Xerox Palo Alto Research Center.

    Google Scholar 

  • Koskenniemi, K. (1983). Two-level morphology: A general computational model for word form recognition and production. Technical Report 11, Department of General Linguistics, University of Helsinki.

    Google Scholar 

  • Kupiec, J. (1993). An algorithm for finding noun phrase correspondences in bilingual corpora. Proceedings of the 31’ Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, 17–22.

    Google Scholar 

  • Laporte, E. (1996). Context-free parsing with finite-state transducers. In Ziviani, N., Baeza-Yates, R. Guimaraes, K. S. (Eds.). Proceedings of the Third South American Workshop on String Processing (WSP’96) (pp. 171–182). International Informatics Series 4, Ottawa: Carleton University Press.

    Google Scholar 

  • Lewis, P. M. Stearns, R. E.. (1968). Syntax-directed transduction. Journal of the Association for Computing Machinery, 15, 465–488.

    Article  Google Scholar 

  • Lin, Ming-Yu, Chiang, Tung-Hui Su, Keh-Yih. (1993). A preliminary study on unknown word problem in Chinese word segmentation. Proceedings of ROCLING-93, Taipei, 119–141.

    Google Scholar 

  • Lin, Yi-Chung; Chiang, Tung-Hui Su, Keh-Yih. (1992). Discrimination oriented probabilistic tagging. Proceedings of ROCLING-92, Taipei, 85–96.

    Google Scholar 

  • Magerman, D. M. Marcus, M. P. (1990). Parsing a natural language using mutual information statistics. Proceedings of AAAI-90, Eighth National Conference on Artificial Intelligence, 984989.

    Google Scholar 

  • Matsumoto, Y., Ishimoto, H. Utsuro, T. (1993). Structural matching of parallel text. Proceedings of the 31’ Annual Meeting of the Association for Computational Linguistics, Columbus, Ohio, 23–30.

    Google Scholar 

  • Nagao, M. (1984). A framework of a mechanical translation between japanese and english by analogy principle. In Elithorn, A. Banerji, R. (Eds.), Artificial and Human Intelligence: Edited Review Papers Presented at the International NATO Symposium on Artificial and Human Intelligence (pp. 173–180 ). Amsterdam: North-Holland.

    Google Scholar 

  • Pereira, F. (1991). Finite-state approximation of phrase structure grammars. Proceedings of the 29th Annual Conference of the Association for Computational Linguistics, Berkeley, 244–255.

    Google Scholar 

  • Pereira, F. Schabes, Y.. (1992). Inside-outside reestimation from partially bracketed corpora. Proceedings of the 30th Annual Conference of the Association for Computational Linguistics, Newark, DE, 128–135.

    Google Scholar 

  • Roche, E. (1994). Two parsing algorithms by means of finite-state transducers. Proceedings of the Fifteenth International Conference on Computational Linguistics (COLING’94), Kyoto, 431435.

    Google Scholar 

  • Sadler, V. Vendelmans, R. (1990). Pilot implementation of a bilingual knowledge bank. In Proceedings of the Thirteenth International Conference on Computational Linguistics (COLING’90), Helsinki, 449–451.

    Google Scholar 

  • Savitch, W. J. (1982). Abstract Machines and Grammars. Boston: Little, Brown.

    Google Scholar 

  • Smadja, F. A. (1992). How to compile a bilingual collocational lexicon automatically. AAAI-92 Workshop on Statistically-Based NLP Techniques, San Jose, CA, 65–71.

    Google Scholar 

  • Sproat, R.; Shih, Chilin; Gale, W. A. Chang, Nancy. (1994). A stochastic word segmentation algorithm for a Mandarin text-to-speech system. Proceedings of the 32nd Annual Conference of the Association for Computational Linguistics, Las Cruces, New Mexico, 66–72.

    Google Scholar 

  • Viterbi, A. J. (1967). Error bounds for convolutional codes and an asymptotically optimal decoding algorithm. IEEE Transactions on Information Theory, 13, 260–269.

    Article  Google Scholar 

  • Wu, Dekai. (1994). Aligning a Parallel English-Chinese Corpus Statistically with Lexical Criteria. Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics. Las Cruces (New Mexico), 80–87.

    Google Scholar 

  • Wu, Dekai. (1995). An algorithm for simultaneously bracketing parallel texts by aligning words. Proceedings of the 33rd Annual Conference of the Association for Computational Linguistics, pages 244–251, Cambridge (Massachusetts).

    Google Scholar 

  • Wu, Dekai. (1997). Stochastic inversion transduction grammars and bilingual parsing of parallel corpora. Computational Linguistics, 23(3), 377–404.

    Google Scholar 

  • Wu, Dekai Fung, Pascale. (1994). Improving Chinese tokenization with linguistic filters on statistical lexical acquisition. Proceedings of the Fourth Conference on Applied Natural Language Processing, Stuttgart, 180–181.

    Google Scholar 

  • Wu, Dekai Xia, Xuanyin. (1994). Learning an English-Chinese Lexicon from a Parallel Corpus. Proceedings of the 1st Conference of the Association for Machine Translation in the Americas, Columbia (Maryland).

    Google Scholar 

  • Wu, Zimin Tseng, Gwyneth. (1993). Chinese text segmentation for text retrieval: Achievements and problems. Journal of The American Society for Information Science,44(9), 532542.

    Google Scholar 

  • Younger, D. H. (1967). Recognition and parsing of context-free languages in time. Information and Control, 10(2), 189–208.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2000 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Wu, D. (2000). Bracketing and aligning words and constituents in parallel text using Stochastic Inversion Transduction Grammars. In: Véronis, J. (eds) Parallel Text Processing. Text, Speech and Language Technology, vol 13. Springer, Dordrecht. https://doi.org/10.1007/978-94-017-2535-4_7

Download citation

  • DOI: https://doi.org/10.1007/978-94-017-2535-4_7

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-90-481-5555-2

  • Online ISBN: 978-94-017-2535-4

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics