Skip to main content

CCDC: A Chinese-Centric Cross Domain Contrastive Learning Framework

  • Conference paper
  • First Online:
Knowledge Science, Engineering and Management (KSEM 2022)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13369))

Abstract

Unsupervised/Supervised SimCSE [5] achieves the SOTA performance of sentence-level semantic representation based on contrastive learning and dropout data augmentation. In particular, supervised SimCSE mines positive pairs and hard-negative pairs through Natural Language Inference (NLI) entailment/contradiction labels, which significantly outperforms other unsupervised/supervised models. As NLI data is scarce, can we construct pseudo-NLI data to improve the semantic representation of multi-domain sentences? This paper proposes a Chinese-centric Cross Domain Contrastive learning framework (CCDC), which provides a “Hard/Soft NLI Data Builder” to annotate entailment/contradiction pairs through Business Rules and Neural Classifiers, especially out-domain but semantic-alike sentences as hard-negative samples. Experiments show that the CCDC framework can achieve both intra-domain and cross-domain enhancement. Moreover, with the Soft NLI Data Builder, the CCDC framework can achieve the best results of all domains with one model, improving 34% and 11% in terms of the Spearman correlation coefficient compared with the baseline (BERT-base) and strong baseline (unsupervised SimCSE). And through empirical analysis, this framework effectively reduces the anisotropy of the pre-trained models and shows semantic clustering over unsupervised SimCSE.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Aitchison, L.: Infonce is a variational autoencoder. arXiv preprint arXiv:2107.02495 (2021)

  2. Chen, J., Chen, Q., Liu, X., Yang, H., Lu, D., Tang, B.: The BG corpus: a large-scale domain-specific Chinese corpus for sentence semantic equivalence identification. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4946–4951 (2018)

    Google Scholar 

  3. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A Simple Framework for Contrastive Learning of Visual Representations, pp. 1597–1607 (2020). http://proceedings.mlr.press/v119/chen20j.html

  4. Financial, A.: Ant Financial Artificial Competition (2018)

    Google Scholar 

  5. Gao, T., Yao, X., Chen, D.: SimCSE: Simple Contrastive Learning of Sentence Embeddings. arXiv:2104.08821 [cs] (2021). zSCC: 0000087

  6. Gillick, D., et al.: Learning dense representations for entity retrieval, pp. 528–537 (2019). https://www.aclweb.org/anthology/K19-1049

  7. Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2006), 17–22 June 2006, New York, pp. 1735–1742. IEEE Computer Society (2006). https://doi.org/10.1109/CVPR.2006.100

  8. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2015)

  9. Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: Albert: a lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019)

  10. Liu, X., et al.: Lcqmc: a large-scale Chinese question matching corpus. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 1952–1962 (2018)

    Google Scholar 

  11. Liu, Y., et al.: Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

  12. Meng, Y., et al.: COCO-LM: correcting and contrasting text sequences for language model pretraining. arXiv preprint arXiv:2102.08473 (2021)

  13. Reimers, N., Beyer, P., Gurevych, I.: Task-Oriented Intrinsic Evaluation of Semantic Textual Similarity, pp. 87–96 (2016). https://www.aclweb.org/anthology/C16-1009

  14. Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks, pp. 3982–3992 (2019). https://doi.org/10.18653/v1/D19-1410

  15. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)

    MathSciNet  MATH  Google Scholar 

  16. Su, J., Cao, J., Liu, W., Ou, Y.: Whitening sentence representations for better semantics and faster retrieval. arXiv preprint arXiv:2103.15316 (2021)

  17. Sun, X., Sun, S., Yin, M., Yang, H.: Hybrid neural conditional random fields for multi-view sequence labeling. Knowl. Based Syst. 189, 105151 (2020)

    Article  Google Scholar 

  18. Vaswani, A., et al.: Attention is all you need. arXiv:1706.03762 [cs] (2017). zSCC: 0033821

  19. Wang, B., et al.: On position embeddings in Bert. In: International Conference on Learning Representations (2020)

    Google Scholar 

  20. Wang, L., Huang, J., Huang, K., Hu, Z., Wang, G., Gu, Q.: Improving neural language generation with spectrum control (2020). https://openreview.net/forum?id=ByxY8CNtvr

  21. Williams, A., Nangia, N., Bowman, S.: A broad-coverage challenge corpus for sentence understanding through inference, pp. 1112–1122 (2018). https://doi.org/10.18653/v1/N18-1101

  22. Yang, H., Chen, J., Zhang, Y., Meng, X.: Optimized query terms creation based on meta-search and clustering. In: 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery, vol. 2, pp. 38–42. IEEE (2008)

    Google Scholar 

  23. Yang, H., Deng, Y., Wang, M., Qin, Y., Sun, S.: Humor detection based on paragraph decomposition and BERT fine-tuning. In: AAAI Workshop 2020 (2020)

    Google Scholar 

  24. Yang, H., Xie, G., Qin, Y., Peng, S.: Domain specific NMT based on knowledge graph embedding and attention. In: 2019 21st International Conference on Advanced Communication Technology (ICACT), pp. 516–521. IEEE (2019)

    Google Scholar 

  25. Yang, Y., Zhang, Y., Tar, C., Baldridge, J.: PAWS-X: a cross-lingual adversarial dataset for paraphrase identification. arXiv preprint arXiv:1908.11828 (2019). zSCC: NoCitationData[s0]

  26. Zhang, D., et al.: Pairwise Supervised Contrastive Learning of Sentence Representations, p. 13, zSCC: 0000001

    Google Scholar 

  27. Zhang, N., et al.: CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark. arXiv preprint arXiv:2106.08087 (2021)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hao Yang .

Editor information

Editors and Affiliations

7 Appendix

7 Appendix

1.1 7.1 CCDC with Different PLM and Different Pooling Layer

For comparison of different model architectures, model sizes, and pooling types, [27] is used as a reference, which provides 3 types of ALBERT and 3 types of Roberta based pre-trained models for Chinese. And the pooling type could be mean/cls/max, respectively indicating average pooling, class token pooling, and max pooling. Table 6 lists 19 different PLM + Pooling layer results(\(6\,*\,3\,+\,1\,=\,19\)).

As can be seen, ALBERT is not as good as BERT-base, even with large model ,due to parameter sharing. Roberta large achieved all-domain best performance, and mean pooling achieved most of the best performance in most domains, while ATEC, BQ, and LCQMC offered the best performance with the max pooling layer.

1.2 7.2 Case Analysis

The sentence-level similarity of the corresponding three groups of models was calculated, and it was found that: (1) BERT-base, regardless of semantically related or not, similarity score are all over 0.93. (2) For the unsupvised SimCSE model, similarity score is 0.90 vs 0.86 for semantic related or not. (3) For the CCDC model, similar score is 0.93 vs 0.81. The CCDC model has better discrimination than BERT and Unsupervised SimCSE, as can be seen in Table 7

Table 6. CCDC results with different PLMs and different pooling layer
Table 7. Case Analysis for BERT-base, Unsup-SimCSE, and CCDC, Label = 1 is semantically identical and vice versa, label = 0

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yang, H. et al. (2022). CCDC: A Chinese-Centric Cross Domain Contrastive Learning Framework. In: Memmi, G., Yang, B., Kong, L., Zhang, T., Qiu, M. (eds) Knowledge Science, Engineering and Management. KSEM 2022. Lecture Notes in Computer Science(), vol 13369. Springer, Cham. https://doi.org/10.1007/978-3-031-10986-7_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-10986-7_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-10985-0

  • Online ISBN: 978-3-031-10986-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics