Adversarial Evasion on LLMs

Guerraoui, Rachid; Pinot, Rafael

doi:10.1007/978-3-031-54827-7_20

Rachid Guerraoui⁶ &
Rafael Pinot⁷

10k Accesses

Abstract

While Machine Learning (ML) applications have shown impressive achievements in tasks such as computer vision, NLP, and control problems, such achievements were possible, first and foremost, in the best-case-scenario setting. Unfortunately, settings where ML applications fail unexpectedly, abound, and malicious ML application users or data contributors can trigger such failures. This problem became known as adversarial example robustness. While this field is in rapid development, some fundamental results have been uncovered, allowing some insight into how to make ML methods resilient to input and data poisoning. Such ML applications are termed adversarially robust. While the current generation of LLMs is not adversarially robust, results obtained in other branches of ML can provide insight into how to make them adversarially robust. Such insight would complement and augment ongoing empirical efforts in the same direction (red-teaming).

Download to read the full chapter text

Chapter PDF

References

Battista Biggio et al. Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases, pages 387–402. Springer, 2013.
Google Scholar
Christian Szegedy et al. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014.
Google Scholar
Nilesh Dalvi, Pedro Domingos, Sumit Sanghai, and Deepak Verma. Adversarial classification. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 99–108, 2004.
Google Scholar
Amir Globerson and Sam Roweis. Nightmare at test time: robust learning by feature deletion. In Proceedings of the 23rd international conference on Machine learning, pages 353–360, 2006.
Google Scholar
Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015.
Google Scholar
Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533, 2016.
Google Scholar
Aleksander Madry et al. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018.
Google Scholar
Rey Reza Wiyatno, Anqi Xu, Ousmane Dia, and Archy de Berker. Adversarial examples in modern machine learning: A review. arXiv preprint arXiv:1911.05268, 2019.
Google Scholar
Jan Hendrik Metzen, Tim Genewein, Volker Fischer, and Bastian Bischoff. On detecting adversarial perturbations. In Proceedings of 5th International Conference on Learning Representations (ICLR), 2017.
Google Scholar
Chang Xiao, Peilin Zhong, and Changxi Zheng. Resisting adversarial attacks by \( k \)-winners-take-all. 2020.
Google Scholar
Florian Tramèr, Nicholas Carlini, Wieland Brendel, and Aleksander Madry. On adaptive attacks to adversarial example defenses. In Conference on Neural Information Processing Systems (NeurIPS), 2020.
Google Scholar
Hongyang Zhang et al. Theoretically principled trade-off between robustness and accuracy. International conference on Machine Learning, 2019.
Google Scholar
Aman Sinha, Hongseok Namkoong, and John Duchi. Certifying some distributional robustness with principled adversarial training. arXiv preprint arXiv:1710.10571, 2017.
Google Scholar
Eric Wong and Zico Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. In International Conference on Machine Learning, pages 5286–5295, 2018.
Google Scholar
Mislav Balunovic and Martin Vechev. Adversarial training and provable defenses: Bridging the gap. In International Conference on Learning Representations, 2020.
Google Scholar
Mark Niklas Mueller, Franziska Eckert, Marc Fischer, and Martin Vechev. Certified training: Small boxes are all you need. In The Eleventh International Conference on Learning Representations, 2023.
Google Scholar
Jeremy M. Cohen, Elan Rosenfeld, and J. Zico Kolter. Certified adversarial robustness via randomized smoothing. arXiv preprint arXiv:1902.02918.
Google Scholar
Hadi Salman et al. Provably robust deep learning via adversarially trained smoothed classifiers. In Advances in Neural Information Processing Systems, pages 11289–11300, 2019.
Google Scholar
Nicholas Carlini et al. (certified!!) adversarial robustness for free! In International Conference on Learning Representations (ICLR), 2023.
Google Scholar
Daniel Cullina, Arjun Nitin Bhagoji, and Prateek Mittal. Pac-learning in the presence of adversaries. In Samy Bengio et al., editor, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada, pages 228–239, 2018.
Google Scholar
Pranjal Awasthi, Natalie Frank, and Mehryar Mohri. Adversarial learning guarantees for linear hypotheses and neural networks. International Conference on Machine Learning, 2020.
Google Scholar
Jiancong Xiao, Yanbo Fan, Ruoyu Sun, Jue Wang, and Zhi-Quan Luo. Stability analysis and generalization bounds of adversarial training. In S. Koyejo et al., editor, Advances in Neural Information Processing Systems, volume 35, pages 15446–15459. Curran Associates, Inc., 2022.
Google Scholar
Dimitris Tsipras et al. Robustness may be at odds with accuracy. International Conference on Learning Representation, 2019.
Google Scholar
Elvis Dohmatob. Generalized no free lunch theorem for adversarial robustness. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 1646–1654, Long Beach, California, USA, 2019. PMLR.
Google Scholar
Sébastien Bubeck and Mark Sellke. A universal law of robustness via isoperimetry. Journal of the ACM, 70(2):1–18, 2023.
Article MathSciNet Google Scholar
Sébastien Bubeck, Yin Tat Lee, Eric Price, and Ilya Razenshteyn. Adversarial examples from computational constraints. In International Conference on Machine Learning, pages 831–840, 2019.
Google Scholar
Rafael Pinot et al. Randomization matters. how to defend against strong adversarial attacks. International Conference on Machine Learning, 2020.
Google Scholar
Laurent Meunier et al. Mixed nash equilibria in the adversarial examples game. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 7677–7687. PMLR, 18–24 Jul 2021.
Google Scholar
Muni Sreenivas Pydi and Varun Jog. The many faces of adversarial risk: An expanded study. IEEE Transactions on Information Theory, pages 1–1, 2023.
Google Scholar
Pranjal Awasthi, Natalie Frank, and Mehryar Mohri. On the existence of the adversarial bayes classifier. In M. Ranzato et al., editor, Advances in Neural Information Processing Systems, volume 34, pages 2978–2990. Curran Associates, Inc., 2021.
Google Scholar
Natalie Frank and Jonathan Niles-Weed. The adversarial consistency of surrogate risks for binary classification, 2023.
Google Scholar
Laurent Meunier, Raphael Ettedgui, Rafael Pinot, Yann Chevaleyre, and Jamal Atif. Towards consistency in adversarial classification. In S. Koyejo et al., editor, Advances in Neural Information Processing Systems, volume 35, pages 8538–8549. Curran Associates, Inc., 2022.
Google Scholar
Justin Gilmer et al. Motivating the rules of the game for adversarial example research. arXiv preprint arXiv:1807.06732, 2018.
Google Scholar
Rafael Pinot. On the impact of randomization on robustness in machine learning. Theses, Université Paris sciences et lettres, December 2020.
Google Scholar
Yao Qin et al. Imperceptible, robust, and targeted adversarial examples for automatic speech recognition. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5231–5240. PMLR, 09–15 Jun 2019.
Google Scholar
Yuantian Miao et al. Faag: Fast adversarial audio generation through interactive attack optimisation. ArXiv, abs/2202.05416, 2022.
Google Scholar
KiYoon Yoo, Jangho Kim, Jiho Jang, and Nojun Kwak. Detection of adversarial examples in text classification: Benchmark and baseline via robust density estimation. In Findings of the Association for Computational Linguistics: ACL 2022, pages 3656–3672, Dublin, Ireland, May 2022. Association for Computational Linguistics.
Google Scholar
Linyang Li et al. BERT-ATTACK: Adversarial attack against BERT using BERT. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6193–6202, Online, November 2020. Association for Computational Linguistics.
Google Scholar
Lifan Yuan, YiChi Zhang, Yangyi Chen, and Wei Wei. Bridge the gap between CV and NLP! a gradient-based textual adversarial attack framework. In Findings of the Association for Computational Linguistics: ACL 2023, pages 7132–7146, Toronto, Canada, July 2023. Association for Computational Linguistics.
Google Scholar
Chuan Guo, Alexandre Sablayrolles, Hervé Jégou, and Douwe Kiela. Gradient-based adversarial attacks against text transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5747–5757, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.
Google Scholar
Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. ArXiv, abs/2307.15043, 2023.
Google Scholar
Xiangyu Qi et al. Visual adversarial examples jailbreak aligned large language models, 2023.
Google Scholar
Nicholas Carlini et al. Are aligned neural networks adversarially aligned?, 2023.
Google Scholar

Download references

Author information

Authors and Affiliations

EPFL, Lausanne, Switzerland
Rachid Guerraoui
Sorbonne Université, Paris, France
Rafael Pinot

Authors

Rachid Guerraoui
View author publications
You can also search for this author in PubMed Google Scholar
Rafael Pinot
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rafael Pinot .

Editor information

Editors and Affiliations

HES-SO Valais-Wallis, Sierre, Switzerland
Andrei Kucharavy
Cyber-Defence Campus, armasuisse Science and Technology, Thun, Switzerland
Octave Plancherel
Cyber-Defence Campus, armasuisse Science and Technology, Thun, Switzerland
Valentin Mulder
Cyber-Defence Campus, armasuisse Science and Technology, Thun, Switzerland
Alain Mermoud
Cyber-Defence Campus, armasuisse Science and Technology, Thun, Switzerland
Vincent Lenders

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Guerraoui, R., Pinot, R. (2024). Adversarial Evasion on LLMs. In: Kucharavy, A., Plancherel, O., Mulder, V., Mermoud, A., Lenders, V. (eds) Large Language Models in Cybersecurity. Springer, Cham. https://doi.org/10.1007/978-3-031-54827-7_20

Download citation

DOI: https://doi.org/10.1007/978-3-031-54827-7_20
Published: 12 April 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-54826-0
Online ISBN: 978-3-031-54827-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics