Unsupervised Cross-Lingual Sentence Representation Learning via Linguistic Isomorphism

Wang, Shuai; Hou, Lei; Tong, Meihan

doi:10.1007/978-3-030-29563-9_20

Unsupervised Cross-Lingual Sentence Representation Learning via Linguistic Isomorphism

Shuai Wang¹¹,
Lei Hou¹¹ &
Meihan Tong¹¹

Conference paper
First Online: 22 August 2019

1218 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11776))

Abstract

Recently, many researches on learning cross-lingual word embeddings without parallel data have achieved success by utilizing word isomorphism among languages. However, unsupervised cross-lingual sentence representation, which aims to learn a unified semantic space without parallel data, has not been well explored. Though many cross-lingual tasks can be solved by learning a unified sentence representation of different languages benefiting from cross-lingual word embeddings, the performance is not competitive with their supervised counterparts. In this paper, we propose a novel framework for unsupervised cross-lingual sentence representation learning by utilizing linguistic isomorphism in both word and sentence level. After generating pseudo-parallel sentence based on the pre-trained cross-lingual word embeddings, the framework iteratively conducts sentence modeling, word embedding tuning and parallel sentences update. Our experiments show that the proposed framework achieves state-of-the-art results in many cross-lingual tasks, as well as improves the quality of cross-lingual word embeddings. The codes and pre-trained encoders will be released upon the paper publishing.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
https://elki-project.github.io/tutorial/samesize_k_means.

References

Artetxe, M., Labaka, G., Agirre, E.: Learning bilingual word embeddings with (almost) no bilingual data. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 451–462 (2017)
Google Scholar
Artetxe, M., Labaka, G., Agirre, E.: A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. arXiv preprint arXiv:1805.06297 (2018)
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
Barone, A.V.M.: Towards cross-lingual distributed representations without parallel text trained with adversarial autoencoders. arXiv preprint arXiv:1608.02996 (2016)
Chen, Q., Li, W., Lei, Y., Liu, X., He, Y.: Learning to adapt credible knowledge in cross-lingual sentiment analysis. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), vol. 1, pp. 419–429 (2015)
Google Scholar
Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. In: Proceedings of the 25th International Conference on Machine Learning, pp. 160–167. ACM (2008)
Google Scholar
Conneau, A., Lample, G., Ranzato, M., Denoyer, L., Jégou, H.: Word translation without parallel data. arXiv preprint arXiv:1710.04087 (2017)
Conneau, A., et al.: XNLI: evaluating cross-lingual sentence representations. arXiv preprint arXiv:1809.05053 (2018)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Kiros, R., et al.: Skip-thought vectors. In: Advances in Neural Information Processing Systems, pp. 3294–3302 (2015)
Google Scholar
Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: a new benchmark collection for text categorization research. J. Mach. Learn. Res. 5, 361–397 (2004)
Google Scholar
Logeswaran, L., Lee, H.: An efficient framework for learning sentence representations. arXiv preprint arXiv:1803.02893 (2018)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)
Google Scholar
Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Google Scholar
Schwenk, H., Douze, M.: Learning joint multilingual sentence representations with neural machine translation, pp. 157–167 (2017)
Google Scholar
Storer, T.: Linguistic isomorphisms. Univ. Chicago Press Behalf Philos. Sci. Assoc. 19(1), 77–85 (1952)
MathSciNet Google Scholar
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)
Google Scholar
Taylor, W.L.: “Cloze procedure”: a new tool for measuring readability. Journal. Bull. 30(4), 415–433 (1953)
Google Scholar
Wada, T., Iwata, T.: Unsupervised cross-lingual word embedding by multilingual neural language models. arXiv preprint arXiv:1809.02306 (2018)
Williams, A., Nangia, N., Bowman, S.: A broad-coverage challenge corpus for sentence understanding through inference. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 1112–1122. Association for Computational Linguistics (2018). http://aclweb.org/anthology/N18-1101
Zhang, M., Liu, Y., Luan, H., Sun, M.: Adversarial training for unsupervised bilingual lexicon induction. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, pp. 1959–1970 (2017)
Google Scholar
Zhang, M., Liu, Y., Luan, H., Sun, M.: Earth mover’s distance minimization for unsupervised bilingual lexicon induction. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 1934–1945 (2017)
Google Scholar
Zhang, M., Wu, Y., Li, W., Li, W.: Learning universal sentence representations with mean-max attention autoencoder. arXiv preprint arXiv:1809.06590 (2018)

Download references

Acknowledgement

The work is supported by NSFC key project (U1736204, 61533018, 61661146007), Ministry of Education and China Mobile Joint Fund (MCM20170301), and THUNUS NExT Co-Lab.

Author information

Authors and Affiliations

Tsinghua University, Beijing, 100084, China
Shuai Wang, Lei Hou & Meihan Tong

Authors

Shuai Wang
View author publications
You can also search for this author in PubMed Google Scholar
Lei Hou
View author publications
You can also search for this author in PubMed Google Scholar
Meihan Tong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lei Hou .

Editor information

Editors and Affiliations

University of Piraeus, Piraeus, Greece
Christos Douligeris
University of Vienna, Vienna, Austria
Dimitris Karagiannis
University of Piraeus, Piraeus, Greece
Dimitris Apostolou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, S., Hou, L., Tong, M. (2019). Unsupervised Cross-Lingual Sentence Representation Learning via Linguistic Isomorphism. In: Douligeris, C., Karagiannis, D., Apostolou, D. (eds) Knowledge Science, Engineering and Management. KSEM 2019. Lecture Notes in Computer Science(), vol 11776. Springer, Cham. https://doi.org/10.1007/978-3-030-29563-9_20

Download citation

DOI: https://doi.org/10.1007/978-3-030-29563-9_20
Published: 22 August 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-29562-2
Online ISBN: 978-3-030-29563-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics