Abstract
While Machine Learning (ML) applications have shown impressive achievements in tasks such as computer vision, NLP, and control problems, such achievements were possible, first and foremost, in the best-case-scenario setting. Unfortunately, settings where ML applications fail unexpectedly, abound, and malicious ML application users or data contributors can trigger such failures. This problem became known as adversarial example robustness. While this field is in rapid development, some fundamental results have been uncovered, allowing some insight into how to make ML methods resilient to input and data poisoning. Such ML applications are termed adversarially robust. While the current generation of LLMs is not adversarially robust, results obtained in other branches of ML can provide insight into how to make them adversarially robust. Such insight would complement and augment ongoing empirical efforts in the same direction (red-teaming).
Chapter PDF
References
Battista Biggio et al. Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases, pages 387–402. Springer, 2013.
Christian Szegedy et al. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014.
Nilesh Dalvi, Pedro Domingos, Sumit Sanghai, and Deepak Verma. Adversarial classification. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 99–108, 2004.
Amir Globerson and Sam Roweis. Nightmare at test time: robust learning by feature deletion. In Proceedings of the 23rd international conference on Machine learning, pages 353–360, 2006.
Ian Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015.
Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533, 2016.
Aleksander Madry et al. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018.
Rey Reza Wiyatno, Anqi Xu, Ousmane Dia, and Archy de Berker. Adversarial examples in modern machine learning: A review. arXiv preprint arXiv:1911.05268, 2019.
Jan Hendrik Metzen, Tim Genewein, Volker Fischer, and Bastian Bischoff. On detecting adversarial perturbations. In Proceedings of 5th International Conference on Learning Representations (ICLR), 2017.
Chang Xiao, Peilin Zhong, and Changxi Zheng. Resisting adversarial attacks by \( k \)-winners-take-all. 2020.
Florian Tramèr, Nicholas Carlini, Wieland Brendel, and Aleksander Madry. On adaptive attacks to adversarial example defenses. In Conference on Neural Information Processing Systems (NeurIPS), 2020.
Hongyang Zhang et al. Theoretically principled trade-off between robustness and accuracy. International conference on Machine Learning, 2019.
Aman Sinha, Hongseok Namkoong, and John Duchi. Certifying some distributional robustness with principled adversarial training. arXiv preprint arXiv:1710.10571, 2017.
Eric Wong and Zico Kolter. Provable defenses against adversarial examples via the convex outer adversarial polytope. In International Conference on Machine Learning, pages 5286–5295, 2018.
Mislav Balunovic and Martin Vechev. Adversarial training and provable defenses: Bridging the gap. In International Conference on Learning Representations, 2020.
Mark Niklas Mueller, Franziska Eckert, Marc Fischer, and Martin Vechev. Certified training: Small boxes are all you need. In The Eleventh International Conference on Learning Representations, 2023.
Jeremy M. Cohen, Elan Rosenfeld, and J. Zico Kolter. Certified adversarial robustness via randomized smoothing. arXiv preprint arXiv:1902.02918.
Hadi Salman et al. Provably robust deep learning via adversarially trained smoothed classifiers. In Advances in Neural Information Processing Systems, pages 11289–11300, 2019.
Nicholas Carlini et al. (certified!!) adversarial robustness for free! In International Conference on Learning Representations (ICLR), 2023.
Daniel Cullina, Arjun Nitin Bhagoji, and Prateek Mittal. Pac-learning in the presence of adversaries. In Samy Bengio et al., editor, Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada, pages 228–239, 2018.
Pranjal Awasthi, Natalie Frank, and Mehryar Mohri. Adversarial learning guarantees for linear hypotheses and neural networks. International Conference on Machine Learning, 2020.
Jiancong Xiao, Yanbo Fan, Ruoyu Sun, Jue Wang, and Zhi-Quan Luo. Stability analysis and generalization bounds of adversarial training. In S. Koyejo et al., editor, Advances in Neural Information Processing Systems, volume 35, pages 15446–15459. Curran Associates, Inc., 2022.
Dimitris Tsipras et al. Robustness may be at odds with accuracy. International Conference on Learning Representation, 2019.
Elvis Dohmatob. Generalized no free lunch theorem for adversarial robustness. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 1646–1654, Long Beach, California, USA, 2019. PMLR.
Sébastien Bubeck and Mark Sellke. A universal law of robustness via isoperimetry. Journal of the ACM, 70(2):1–18, 2023.
Sébastien Bubeck, Yin Tat Lee, Eric Price, and Ilya Razenshteyn. Adversarial examples from computational constraints. In International Conference on Machine Learning, pages 831–840, 2019.
Rafael Pinot et al. Randomization matters. how to defend against strong adversarial attacks. International Conference on Machine Learning, 2020.
Laurent Meunier et al. Mixed nash equilibria in the adversarial examples game. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 7677–7687. PMLR, 18–24 Jul 2021.
Muni Sreenivas Pydi and Varun Jog. The many faces of adversarial risk: An expanded study. IEEE Transactions on Information Theory, pages 1–1, 2023.
Pranjal Awasthi, Natalie Frank, and Mehryar Mohri. On the existence of the adversarial bayes classifier. In M. Ranzato et al., editor, Advances in Neural Information Processing Systems, volume 34, pages 2978–2990. Curran Associates, Inc., 2021.
Natalie Frank and Jonathan Niles-Weed. The adversarial consistency of surrogate risks for binary classification, 2023.
Laurent Meunier, Raphael Ettedgui, Rafael Pinot, Yann Chevaleyre, and Jamal Atif. Towards consistency in adversarial classification. In S. Koyejo et al., editor, Advances in Neural Information Processing Systems, volume 35, pages 8538–8549. Curran Associates, Inc., 2022.
Justin Gilmer et al. Motivating the rules of the game for adversarial example research. arXiv preprint arXiv:1807.06732, 2018.
Rafael Pinot. On the impact of randomization on robustness in machine learning. Theses, Université Paris sciences et lettres, December 2020.
Yao Qin et al. Imperceptible, robust, and targeted adversarial examples for automatic speech recognition. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5231–5240. PMLR, 09–15 Jun 2019.
Yuantian Miao et al. Faag: Fast adversarial audio generation through interactive attack optimisation. ArXiv, abs/2202.05416, 2022.
KiYoon Yoo, Jangho Kim, Jiho Jang, and Nojun Kwak. Detection of adversarial examples in text classification: Benchmark and baseline via robust density estimation. In Findings of the Association for Computational Linguistics: ACL 2022, pages 3656–3672, Dublin, Ireland, May 2022. Association for Computational Linguistics.
Linyang Li et al. BERT-ATTACK: Adversarial attack against BERT using BERT. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6193–6202, Online, November 2020. Association for Computational Linguistics.
Lifan Yuan, YiChi Zhang, Yangyi Chen, and Wei Wei. Bridge the gap between CV and NLP! a gradient-based textual adversarial attack framework. In Findings of the Association for Computational Linguistics: ACL 2023, pages 7132–7146, Toronto, Canada, July 2023. Association for Computational Linguistics.
Chuan Guo, Alexandre Sablayrolles, Hervé Jégou, and Douwe Kiela. Gradient-based adversarial attacks against text transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5747–5757, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics.
Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models. ArXiv, abs/2307.15043, 2023.
Xiangyu Qi et al. Visual adversarial examples jailbreak aligned large language models, 2023.
Nicholas Carlini et al. Are aligned neural networks adversarially aligned?, 2023.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2024 The Author(s)
About this chapter
Cite this chapter
Guerraoui, R., Pinot, R. (2024). Adversarial Evasion on LLMs. In: Kucharavy, A., Plancherel, O., Mulder, V., Mermoud, A., Lenders, V. (eds) Large Language Models in Cybersecurity. Springer, Cham. https://doi.org/10.1007/978-3-031-54827-7_20
Download citation
DOI: https://doi.org/10.1007/978-3-031-54827-7_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-54826-0
Online ISBN: 978-3-031-54827-7
eBook Packages: Computer ScienceComputer Science (R0)