Skip to main content

Interpreting Pretrained Language Models via Concept Bottlenecks

  • Conference paper
  • First Online:
Advances in Knowledge Discovery and Data Mining (PAKDD 2024)

Abstract

Pretrained language models (PLMs) have made significant strides in various natural language processing tasks. However, the lack of interpretability due to their “black-box” nature poses challenges for responsible implementation. Although previous studies have attempted to improve interpretability by using, e.g., attention weights in self-attention layers, these weights often lack clarity, readability, and intuitiveness. In this research, we propose a novel approach to interpreting PLMs by employing high-level, meaningful concepts that are easily understandable for humans. For example, we learn the concept of “Food” and investigate how it influences the prediction of a model’s sentiment towards a restaurant review. We introduce C\(^3\)M, which combines human-annotated and machine-generated concepts to extract hidden neurons designed to encapsulate semantically meaningful and task-specific concepts. Through empirical evaluations on real-world datasets, we show that our approach offers valuable insights to interpret PLM behavior, helps diagnose model failures, and enhances model robustness amidst noisy concept labels.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 119.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 139.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.kaggle.com/datasets/omkarsabnis/yelp-reviews-dataset.

  2. 2.

    https://github.com/Zhen-Tan-dmml/CBM_NLP.git.

References

  1. Abraham, E.D., et al.: Cebab: estimating the causal effects of real-world concepts on NLP model behavior. In: Advances in Neural Information Processing Systems, vol. 35, pp. 17582–17596 (2022)

    Google Scholar 

  2. Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., Raffel, C.A.: Mixmatch: a holistic approach to semi-supervised learning. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

    Google Scholar 

  3. Bills, S., et al.: Language models can explain neurons in language models (2023). https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html

  4. Brown, T., et al.: Language models are few-shot learners. In: Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901 (2020)

    Google Scholar 

  5. Cai, H., Xia, R., Yu, J.: Aspect-category-opinion-sentiment quadruple extraction with implicit aspects and opinions. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) (2021)

    Google Scholar 

  6. Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  7. Diao, S., et al.: Black-box prompt learning for pre-trained language models. arXiv preprint arXiv:2201.08531 (2022)

  8. Englesson, E., Azizpour, H.: Generalized Jensen-Shannon divergence loss for learning with noisy labels. In: Advances in Neural Information Processing Systems, vol. 34, pp. 30284–30297 (2021)

    Google Scholar 

  9. Galassi, A., Lippi, M., Torroni, P.: Attention in natural language processing. IEEE Trans. Neural Netw. Learn.Syst. 32(10), 4291–4308 (2020)

    Article  Google Scholar 

  10. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  11. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  12. Kim, B., et al.: Interpretability beyond feature attribution: quantitative testing with concept activation vectors (TCAV). In: International Conference on Machine Learning, pp. 2668–2677. PMLR (2018)

    Google Scholar 

  13. Kim, E., Klinger, R.: Who feels what and why? Annotation of a literature corpus with semantic roles of emotions. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 1345–1359 (2018)

    Google Scholar 

  14. Koh, P.W., et al.: Concept bottleneck models. In: International Conference on Machine Learning, pp. 5338–5348. PMLR (2020)

    Google Scholar 

  15. Liu, Y., Cheng, H., Zhang, K.: Identifiability of label noise transition matrix. arXiv preprint arXiv:2202.02016 (2022)

  16. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

  17. Losch, M., Fritz, M., Schiele, B.: Interpretability beyond classification output: semantic bottleneck networks. arXiv preprint arXiv:1907.10882 (2019)

  18. Maas, A., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 142–150 (2011)

    Google Scholar 

  19. Madsen, A., Reddy, S., Chandar, S.: Post-hoc interpretability for neural NLP: a survey. ACM Comput. Surv. 55(8), 1–42 (2022)

    Article  Google Scholar 

  20. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

  21. Németh, R., Sik, D., Máté, F.: Machine learning of concepts hard even for humans: the case of online depression forums. Int. J. Qual. Methods 19, 1609406920949338 (2020)

    Article  Google Scholar 

  22. Oikarinen, T., Das, S., Nguyen, L.M., Weng, T.-W.: Label-free concept bottleneck models. In: The Eleventh International Conference on Learning Representations (2023)

    Google Scholar 

  23. OpenAI. Gpt-4 Technical report (2023)

    Google Scholar 

  24. Paszke, A., et al.: Automatic differentiation in PyTorch. In: NeurIPS (2017)

    Google Scholar 

  25. Radford, A., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)

    Google Scholar 

  26. Ross, A., Marasović, A., Peters, M.E.: Explaining NLP models via minimal contrastive editing (mice). In: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pp. 3840–3852 (2021)

    Google Scholar 

  27. Sohn, K., et al.: Fixmatch: simplifying semi-supervised learning with consistency and confidence. In: Advances in Neural Information Processing Systems, vol. 33, pp. 596–608 (2020)

    Google Scholar 

  28. Vig, J., et al.: Investigating gender bias in language models using causal mediation analysis. In: Advances in Neural Information Processing Systems, vol. 33, pp. 12388–12401 (2020)

    Google Scholar 

  29. Wolf, T., et al.: Huggingface’s transformers: state-of-the-art natural language processing (2020)

    Google Scholar 

  30. Yang, J., Zhang, Y., Li, L., Li, X.: Yedda: a lightweight collaborative text span annotation tool. arXiv preprint arXiv:1711.03759 (2017)

  31. Yin, K., Neubig, G.: Interpreting language models with contrastive explanations. arXiv preprint arXiv:2202.10419 (2022)

  32. Zarlenga, M.E., et al.: Concept embedding models. In: NeurIPS 2022 - 36th Conference on Neural Information Processing Systems (2022)

    Google Scholar 

  33. Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: Mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)

  34. Zhang, W., Li, X., Deng, Y., Bing, L., Lam, W.: A survey on aspect-based sentiment analysis: tasks, methods, and challenges. IEEE Trans. Knowl. Data Eng. (2022)

    Google Scholar 

  35. Zhu, J., et al.: Incorporating BERT into neural machine translation. In: International Conference on Learning Representations (2020)

    Google Scholar 

Download references

Acknowledgements

This work is supported by the National Science Foundation (NSF) under grants IIS-2229461.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhen Tan .

Editor information

Editors and Affiliations

Appendices

A Definitions of Training Strategies

Given a text input \(x \in \mathbb {R}^d\), concepts \(c\in \mathbb {R}^k\) and its label y, the strategies for fine-tuning the text encoder \(f_\theta \), the projector \(p_\psi \) and the label predictor \(g_\phi \) are defined as follows:

i) Vanilla fine-tuning a PLM: The concept labels are ignored, and then the text encoder \(f_\theta \) and the label predictor \(g_\phi \) are fine-tuned either as follows:

$$\begin{aligned} \theta , \phi = \textrm{argmin}_{\theta , \phi } L_{CE} (g_\phi (f_\theta (x), y), \end{aligned}$$

or as follows (frozen text encoder \(f_\theta \)):

$$\begin{aligned} \phi = \textrm{argmin}_{\phi } L_{CE} (g_\phi (f_\theta (x), y), \end{aligned}$$

where \(L_{CE}\) indicates the cross-entropy loss. In this work we only consider the former option for its significant better performance.

ii) Independently training PLM with the concept and task labels: The text encoder \(f_\theta \), the projector \(p_\psi \) and the label predictor \(g_\phi \) are trained separately with ground truth concepts labels and task labels as follows:

$$\begin{aligned} \begin{aligned} \theta , \psi &= \textrm{argmin}_{\theta , \psi } L_{CE} (p_\psi (f_\theta (x)),c), \\ \phi &= \textrm{argmin}_{\phi } L_{CE} (g_{\phi }(c),y). \end{aligned} \end{aligned}$$

During inference, the label predictor will use the output from the projector rather than the ground-truth concepts.

iii) Sequentially training PLM with the concept and task labels: We first learn the concept encoder as the independent training strategy above, and then use its output to train the label predictor:

$$\begin{aligned} \begin{aligned} \phi = \textrm{argmin}_{\phi } L_{CE} (g_{\phi }(p_\psi (f_\theta (x),y). \end{aligned} \end{aligned}$$

iv) Jointly training PLM with the concept and task labels: Learn the concept encoder and label predictor via a weighted sum \(L_{joint}\) of the two objectives described above:

$$\begin{aligned} \begin{aligned} \theta , \psi , \phi &= \textrm{argmin}_{\theta , \psi , \phi } L_{joint}(x, c, y) \\ {} &= \textrm{argmin}_{\theta , \psi , \phi } [L_{CE} (g_{\phi }(p_\psi (f_\theta (x),y) \\ {} &+ \gamma L_{CE} (p_\psi (f_\theta (x)),c)]. \end{aligned} \end{aligned}$$

It’s worth noting that the CBE-PLMs trained jointly are sensitive to the loss weight \(\gamma \). We report the most effective results here, tested value for \(\gamma \) are given in Table 2 in Appendix D.

B Details of the Manual Concept Annotation for the IMDB Dataset

Our annotation policy is following a previous work [5] for NLP datasets annotating. For the IMDB-C dataset, we annotate the four concepts (Acting, Stroyline, Emotional Arousal, Cinematography) manually. Even though the concepts are naturally understandable by humans, two Master students familiar with sentiment analysis are selected as annotators for independent annotation with the annotation tool introduced by [30]. The strict quadruple matching F1 score between two annotators is \(85.74\%\), which indicates a consistent agreement between the two annotators [13]. In case of disagreement, a third expert will be asked to make the final decision.

C Implementation Detail

In this section, we provide more details on the implementation settings of our experiments. Specifically, we implement our framework with PyTorch [24] and HuggingFace [29] and train our framework on a single 80 GB Nvidia A100 GPU. We follow a prior work [1] for backbone implementation. All backbone models have a maximum token number of 512 and a batch size of 8. We use the Adam optimizer to update the backbone, projector, and label predictor according to Sect. 3.1. The values of other hyperparameters (Table 2 in Appendix D) for each specific PLM type are determined through grid search. We run all the experiments on an Nvidia A100 GPU with 80 GB RAM.

D Parameters and Notations

In this section, we provide used notations in this paper along with their descriptions for comprehensive understanding. We also list their experimented values and optimal ones, as shown in Table 2.

Table 2. Key parameters in this paper with their annotations and evaluated values. Note that bold values indicate the optimal ones.

E Statistics of Data Splits

The Statistics and split policies of the experimented datasets, including the source concept dataset \(\mathcal {D}_s\), the unlabeled concept dataset \(\mathcal {D}_u\), and their augmented versions. The specific details are presented in Table 3.

Table 3. Statistics of experimented datasets. k denotes the number of concepts.

F Statistics of Concepts in Transformed Datasets

The Statistics and split policies of the transformed datasets of experimented datasets are presented in Table 4.

Table 4. Statistics of concepts in transformed datasets (\(\tilde{\mathcal {D}}\)). Human-specified concepts are underlined. Concepts shown in gray are not used in experiments as the portion of the “Unknown” label is too large.

G More Results on Explainable Predictions

Case studies on explainable predictions for both CEBaB and IMDB-C datasets are given in Fig. 5 and Fig. 6 respectively.

Fig. 5.
figure 5

Illustration of the explanable prediction for an example from the CEBaB dataset.

Fig. 6.
figure 6

Illustration of the explanable prediction for an example from the IMDB-C dataset.

H A Case Study on Test-Time Intervention

We present a case study of Test-time Intervention using an example from the transformed unlabeled concept data \(\tilde{\mathcal {D}}_u\) of the CEBaB dataset, as shown in Fig. 7. The first row displays the target concept labels generated by ChatGPT. The second row shows the predictions from the trained CBE-PLM model, which mispredicts two concepts ("Waiting time" and "Waiting area"). The third row demonstrates test-time intervention using ChatGPT as the oracle, which corrects the predicted task labels. Finally, the fourth row implements test-time intervention with a human oracle, rectifying the concept that ChatGPT originally mislabeled.

Fig. 7.
figure 7

Illustration of the explanable prediction for an example from the transformed unlabeled concept data \(\tilde{\mathcal {D}}_{u}\) of the CEBaB dataset. The brown box with dash lines indicates the test-time intervention on corresponding concepts. (Color figure online)

I Examples of Querying ChatGPT

In this paper, we query ChatGPT for 1) augmenting the concept set, and 2) annotate missing concept labels. Note that in practice, we query ChatGPT (GPT4) via OpenAI API. Here we demonstrate examples from the ChatGPT (GPT4) GUI for better illustration. The illustrations are given in Fig. 8 and Fig. 9.

Fig. 8.
figure 8

The illustration of querying ChatGPT for additional concepts for the IMDB-C dataset.

Fig. 9.
figure 9

The illustration of querying ChatGPT for annotating a missing concept label for the IMDB-C dataset.

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tan, Z., Cheng, L., Wang, S., Yuan, B., Li, J., Liu, H. (2024). Interpreting Pretrained Language Models via Concept Bottlenecks. In: Yang, DN., Xie, X., Tseng, V.S., Pei, J., Huang, JW., Lin, J.CW. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2024. Lecture Notes in Computer Science(), vol 14647. Springer, Singapore. https://doi.org/10.1007/978-981-97-2259-4_5

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-2259-4_5

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-2261-7

  • Online ISBN: 978-981-97-2259-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics