CCDC: A Chinese-Centric Cross Domain Contrastive Learning Framework

Yang, Hao; Tao, Shimin; Wang, Minghan; Zhang, Min; Wei, Daimeng; Zhao, Shuai; Ma, Miaomiao; Qin, Ying

doi:10.1007/978-3-031-10986-7_18

Hao Yang¹²,
Shimin Tao¹²,
Minghan Wang¹²,
Min Zhang¹²,
Daimeng Wei¹²,
Shuai Zhao¹³,
Miaomiao Ma¹² &
…
Ying Qin¹²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13369))

Included in the following conference series:

International Conference on Knowledge Science, Engineering and Management

1806 Accesses
4 Citations

Abstract

Unsupervised/Supervised SimCSE [5] achieves the SOTA performance of sentence-level semantic representation based on contrastive learning and dropout data augmentation. In particular, supervised SimCSE mines positive pairs and hard-negative pairs through Natural Language Inference (NLI) entailment/contradiction labels, which significantly outperforms other unsupervised/supervised models. As NLI data is scarce, can we construct pseudo-NLI data to improve the semantic representation of multi-domain sentences? This paper proposes a Chinese-centric Cross Domain Contrastive learning framework (CCDC), which provides a “Hard/Soft NLI Data Builder” to annotate entailment/contradiction pairs through Business Rules and Neural Classifiers, especially out-domain but semantic-alike sentences as hard-negative samples. Experiments show that the CCDC framework can achieve both intra-domain and cross-domain enhancement. Moreover, with the Soft NLI Data Builder, the CCDC framework can achieve the best results of all domains with one model, improving 34% and 11% in terms of the Spearman correlation coefficient compared with the baseline (BERT-base) and strong baseline (unsupervised SimCSE). And through empirical analysis, this framework effectively reduces the anisotropy of the pre-trained models and shows semantic clustering over unsupervised SimCSE.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Aitchison, L.: Infonce is a variational autoencoder. arXiv preprint arXiv:2107.02495 (2021)
Chen, J., Chen, Q., Liu, X., Yang, H., Lu, D., Tang, B.: The BG corpus: a large-scale domain-specific Chinese corpus for sentence semantic equivalence identification. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4946–4951 (2018)
Google Scholar
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A Simple Framework for Contrastive Learning of Visual Representations, pp. 1597–1607 (2020). http://proceedings.mlr.press/v119/chen20j.html
Financial, A.: Ant Financial Artificial Competition (2018)
Google Scholar
Gao, T., Yao, X., Chen, D.: SimCSE: Simple Contrastive Learning of Sentence Embeddings. arXiv:2104.08821 [cs] (2021). zSCC: 0000087
Gillick, D., et al.: Learning dense representations for entity retrieval, pp. 528–537 (2019). https://www.aclweb.org/anthology/K19-1049
Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2006), 17–22 June 2006, New York, pp. 1735–1742. IEEE Computer Society (2006). https://doi.org/10.1109/CVPR.2006.100
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2015)
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: Albert: a lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019)
Liu, X., et al.: Lcqmc: a large-scale Chinese question matching corpus. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 1952–1962 (2018)
Google Scholar
Liu, Y., et al.: Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Meng, Y., et al.: COCO-LM: correcting and contrasting text sequences for language model pretraining. arXiv preprint arXiv:2102.08473 (2021)
Reimers, N., Beyer, P., Gurevych, I.: Task-Oriented Intrinsic Evaluation of Semantic Textual Similarity, pp. 87–96 (2016). https://www.aclweb.org/anthology/C16-1009
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks, pp. 3982–3992 (2019). https://doi.org/10.18653/v1/D19-1410
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
MathSciNet MATH Google Scholar
Su, J., Cao, J., Liu, W., Ou, Y.: Whitening sentence representations for better semantics and faster retrieval. arXiv preprint arXiv:2103.15316 (2021)
Sun, X., Sun, S., Yin, M., Yang, H.: Hybrid neural conditional random fields for multi-view sequence labeling. Knowl. Based Syst. 189, 105151 (2020)
Article Google Scholar
Vaswani, A., et al.: Attention is all you need. arXiv:1706.03762 [cs] (2017). zSCC: 0033821
Wang, B., et al.: On position embeddings in Bert. In: International Conference on Learning Representations (2020)
Google Scholar
Wang, L., Huang, J., Huang, K., Hu, Z., Wang, G., Gu, Q.: Improving neural language generation with spectrum control (2020). https://openreview.net/forum?id=ByxY8CNtvr
Williams, A., Nangia, N., Bowman, S.: A broad-coverage challenge corpus for sentence understanding through inference, pp. 1112–1122 (2018). https://doi.org/10.18653/v1/N18-1101
Yang, H., Chen, J., Zhang, Y., Meng, X.: Optimized query terms creation based on meta-search and clustering. In: 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery, vol. 2, pp. 38–42. IEEE (2008)
Google Scholar
Yang, H., Deng, Y., Wang, M., Qin, Y., Sun, S.: Humor detection based on paragraph decomposition and BERT fine-tuning. In: AAAI Workshop 2020 (2020)
Google Scholar
Yang, H., Xie, G., Qin, Y., Peng, S.: Domain specific NMT based on knowledge graph embedding and attention. In: 2019 21st International Conference on Advanced Communication Technology (ICACT), pp. 516–521. IEEE (2019)
Google Scholar
Yang, Y., Zhang, Y., Tar, C., Baldridge, J.: PAWS-X: a cross-lingual adversarial dataset for paraphrase identification. arXiv preprint arXiv:1908.11828 (2019). zSCC: NoCitationData[s0]
Zhang, D., et al.: Pairwise Supervised Contrastive Learning of Sentence Representations, p. 13, zSCC: 0000001
Google Scholar
Zhang, N., et al.: CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark. arXiv preprint arXiv:2106.08087 (2021)

Download references

Author information

Authors and Affiliations

2012 Labs, Huawei Technologies Co., Ltd., Beijing, China
Hao Yang, Shimin Tao, Minghan Wang, Min Zhang, Daimeng Wei, Miaomiao Ma & Ying Qin
Beijing University of Posts and Telecommunications, No. 10 Xitucheng Road, Haidian District, Beijing, China
Shuai Zhao

Authors

Hao Yang
View author publications
You can also search for this author in PubMed Google Scholar
Shimin Tao
View author publications
You can also search for this author in PubMed Google Scholar
Minghan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Min Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Daimeng Wei
View author publications
You can also search for this author in PubMed Google Scholar
Shuai Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Miaomiao Ma
View author publications
You can also search for this author in PubMed Google Scholar
Ying Qin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hao Yang .

Editor information

Editors and Affiliations

Télécom Paris, Paris, France
Gerard Memmi
Purdue University, West Lafayette, IN, USA
Baijian Yang
Shanghai Jiao Tong University, Shanghai, Shanghai, China
Linghe Kong
Nanyang Technological University, Singapore, Singapore
Tianwei Zhang
Texas A&M University – Commerce, Commerce, TX, USA
Meikang Qiu

7 Appendix

1.1 7.1 CCDC with Different PLM and Different Pooling Layer

For comparison of different model architectures, model sizes, and pooling types, [27] is used as a reference, which provides 3 types of ALBERT and 3 types of Roberta based pre-trained models for Chinese. And the pooling type could be mean/cls/max, respectively indicating average pooling, class token pooling, and max pooling. Table 6 lists 19 different PLM + Pooling layer results(\(6\,*\,3\,+\,1\,=\,19\)).

As can be seen, ALBERT is not as good as BERT-base, even with large model ,due to parameter sharing. Roberta large achieved all-domain best performance, and mean pooling achieved most of the best performance in most domains, while ATEC, BQ, and LCQMC offered the best performance with the max pooling layer.

1.2 7.2 Case Analysis

The sentence-level similarity of the corresponding three groups of models was calculated, and it was found that: (1) BERT-base, regardless of semantically related or not, similarity score are all over 0.93. (2) For the unsupvised SimCSE model, similarity score is 0.90 vs 0.86 for semantic related or not. (3) For the CCDC model, similar score is 0.93 vs 0.81. The CCDC model has better discrimination than BERT and Unsupervised SimCSE, as can be seen in Table 7

Table 6. CCDC results with different PLMs and different pooling layer

Full size table

Table 7. Case Analysis for BERT-base, Unsup-SimCSE, and CCDC, Label = 1 is semantically identical and vice versa, label = 0

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Yang, H. et al. (2022). CCDC: A Chinese-Centric Cross Domain Contrastive Learning Framework. In: Memmi, G., Yang, B., Kong, L., Zhang, T., Qiu, M. (eds) Knowledge Science, Engineering and Management. KSEM 2022. Lecture Notes in Computer Science(), vol 13369. Springer, Cham. https://doi.org/10.1007/978-3-031-10986-7_18

Download citation

DOI: https://doi.org/10.1007/978-3-031-10986-7_18
Published: 19 July 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-10985-0
Online ISBN: 978-3-031-10986-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

CCDC: A Chinese-Centric Cross Domain Contrastive Learning Framework

Abstract

Access this chapter

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

7 Appendix

7 Appendix

1.1 7.1 CCDC with Different PLM and Different Pooling Layer

1.2 7.2 Case Analysis

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation