Abstract
Unsupervised/Supervised SimCSE [5] achieves the SOTA performance of sentence-level semantic representation based on contrastive learning and dropout data augmentation. In particular, supervised SimCSE mines positive pairs and hard-negative pairs through Natural Language Inference (NLI) entailment/contradiction labels, which significantly outperforms other unsupervised/supervised models. As NLI data is scarce, can we construct pseudo-NLI data to improve the semantic representation of multi-domain sentences? This paper proposes a Chinese-centric Cross Domain Contrastive learning framework (CCDC), which provides a “Hard/Soft NLI Data Builder” to annotate entailment/contradiction pairs through Business Rules and Neural Classifiers, especially out-domain but semantic-alike sentences as hard-negative samples. Experiments show that the CCDC framework can achieve both intra-domain and cross-domain enhancement. Moreover, with the Soft NLI Data Builder, the CCDC framework can achieve the best results of all domains with one model, improving 34% and 11% in terms of the Spearman correlation coefficient compared with the baseline (BERT-base) and strong baseline (unsupervised SimCSE). And through empirical analysis, this framework effectively reduces the anisotropy of the pre-trained models and shows semantic clustering over unsupervised SimCSE.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Aitchison, L.: Infonce is a variational autoencoder. arXiv preprint arXiv:2107.02495 (2021)
Chen, J., Chen, Q., Liu, X., Yang, H., Lu, D., Tang, B.: The BG corpus: a large-scale domain-specific Chinese corpus for sentence semantic equivalence identification. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4946–4951 (2018)
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A Simple Framework for Contrastive Learning of Visual Representations, pp. 1597–1607 (2020). http://proceedings.mlr.press/v119/chen20j.html
Financial, A.: Ant Financial Artificial Competition (2018)
Gao, T., Yao, X., Chen, D.: SimCSE: Simple Contrastive Learning of Sentence Embeddings. arXiv:2104.08821 [cs] (2021). zSCC: 0000087
Gillick, D., et al.: Learning dense representations for entity retrieval, pp. 528–537 (2019). https://www.aclweb.org/anthology/K19-1049
Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2006), 17–22 June 2006, New York, pp. 1735–1742. IEEE Computer Society (2006). https://doi.org/10.1109/CVPR.2006.100
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2015)
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: Albert: a lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019)
Liu, X., et al.: Lcqmc: a large-scale Chinese question matching corpus. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 1952–1962 (2018)
Liu, Y., et al.: Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Meng, Y., et al.: COCO-LM: correcting and contrasting text sequences for language model pretraining. arXiv preprint arXiv:2102.08473 (2021)
Reimers, N., Beyer, P., Gurevych, I.: Task-Oriented Intrinsic Evaluation of Semantic Textual Similarity, pp. 87–96 (2016). https://www.aclweb.org/anthology/C16-1009
Reimers, N., Gurevych, I.: Sentence-BERT: sentence embeddings using Siamese BERT-networks, pp. 3982–3992 (2019). https://doi.org/10.18653/v1/D19-1410
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
Su, J., Cao, J., Liu, W., Ou, Y.: Whitening sentence representations for better semantics and faster retrieval. arXiv preprint arXiv:2103.15316 (2021)
Sun, X., Sun, S., Yin, M., Yang, H.: Hybrid neural conditional random fields for multi-view sequence labeling. Knowl. Based Syst. 189, 105151 (2020)
Vaswani, A., et al.: Attention is all you need. arXiv:1706.03762 [cs] (2017). zSCC: 0033821
Wang, B., et al.: On position embeddings in Bert. In: International Conference on Learning Representations (2020)
Wang, L., Huang, J., Huang, K., Hu, Z., Wang, G., Gu, Q.: Improving neural language generation with spectrum control (2020). https://openreview.net/forum?id=ByxY8CNtvr
Williams, A., Nangia, N., Bowman, S.: A broad-coverage challenge corpus for sentence understanding through inference, pp. 1112–1122 (2018). https://doi.org/10.18653/v1/N18-1101
Yang, H., Chen, J., Zhang, Y., Meng, X.: Optimized query terms creation based on meta-search and clustering. In: 2008 Fifth International Conference on Fuzzy Systems and Knowledge Discovery, vol. 2, pp. 38–42. IEEE (2008)
Yang, H., Deng, Y., Wang, M., Qin, Y., Sun, S.: Humor detection based on paragraph decomposition and BERT fine-tuning. In: AAAI Workshop 2020 (2020)
Yang, H., Xie, G., Qin, Y., Peng, S.: Domain specific NMT based on knowledge graph embedding and attention. In: 2019 21st International Conference on Advanced Communication Technology (ICACT), pp. 516–521. IEEE (2019)
Yang, Y., Zhang, Y., Tar, C., Baldridge, J.: PAWS-X: a cross-lingual adversarial dataset for paraphrase identification. arXiv preprint arXiv:1908.11828 (2019). zSCC: NoCitationData[s0]
Zhang, D., et al.: Pairwise Supervised Contrastive Learning of Sentence Representations, p. 13, zSCC: 0000001
Zhang, N., et al.: CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark. arXiv preprint arXiv:2106.08087 (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
7 Appendix
7 Appendix
1.1 7.1 CCDC with Different PLM and Different Pooling Layer
For comparison of different model architectures, model sizes, and pooling types, [27] is used as a reference, which provides 3 types of ALBERT and 3 types of Roberta based pre-trained models for Chinese. And the pooling type could be mean/cls/max, respectively indicating average pooling, class token pooling, and max pooling. Table 6 lists 19 different PLM + Pooling layer results(\(6\,*\,3\,+\,1\,=\,19\)).
As can be seen, ALBERT is not as good as BERT-base, even with large model ,due to parameter sharing. Roberta large achieved all-domain best performance, and mean pooling achieved most of the best performance in most domains, while ATEC, BQ, and LCQMC offered the best performance with the max pooling layer.
1.2 7.2 Case Analysis
The sentence-level similarity of the corresponding three groups of models was calculated, and it was found that: (1) BERT-base, regardless of semantically related or not, similarity score are all over 0.93. (2) For the unsupvised SimCSE model, similarity score is 0.90 vs 0.86 for semantic related or not. (3) For the CCDC model, similar score is 0.93 vs 0.81. The CCDC model has better discrimination than BERT and Unsupervised SimCSE, as can be seen in Table 7
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Yang, H. et al. (2022). CCDC: A Chinese-Centric Cross Domain Contrastive Learning Framework. In: Memmi, G., Yang, B., Kong, L., Zhang, T., Qiu, M. (eds) Knowledge Science, Engineering and Management. KSEM 2022. Lecture Notes in Computer Science(), vol 13369. Springer, Cham. https://doi.org/10.1007/978-3-031-10986-7_18
Download citation
DOI: https://doi.org/10.1007/978-3-031-10986-7_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-10985-0
Online ISBN: 978-3-031-10986-7
eBook Packages: Computer ScienceComputer Science (R0)