Abstract
Prompt red-teaming is a form of evaluation that involves testing machine learning models for vulnerabilities that could result in undesirable behaviors. It is similar to adversarial attacks, but red-teaming prompts appear like regular natural language prompts, and they reveal model limitations that can cause harmful user experiences or aid violence. Red-teaming can be resource-intensive due to the large search space required to search the prompt space of possible model failures. Augmenting the model with a classifier trained to predict potentially undesirable texts is a possible workaround. Red-teaming LLMs is a developing research area, and there is a need for best practices, including persuading people to harm themselves or others and other problematic behaviors, such as memorization, spam, weapons assembly instructions, and the generation of code with pre-defined vulnerabilities. The challenge with evaluating LLMs for malicious behaviors is that they are not explicitly trained to exhibit such behaviors. Therefore, it is critical to continually develop red-teaming methods that can adapt as models become more powerful. Multi-organization collaboration on datasets and best practices can enable smaller entities releasing models to still red-team their models before release, leading to a safer user experience across the board.
Chapter PDF
References
Deep Ganguli et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned, 2022.
Xiangyu Qi et al. Visual adversarial examples jailbreak aligned large language models, 2023.
Nicholas Carlini et al. Are aligned neural networks adversarially aligned?, 2023.
Yuxin Wen, Neel Jain, John Kirchenbauer, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery, 2023.
Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models, 2023.
Manli Shu et al. On the exploitability of instruction tuning, 2023.
Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. Poisoning language models during instruction tuning, 2023.
Parul Pandey. Exploring the vulnerability of language models to poisoning attacks. https://towardsdatascience.com/exploring-the-vulnerability-of-language-models-to-poisoning-attacks-d6d03bcc5ecb, 2023.
Yi Liu et al. Prompt injection attack against llm-integrated applications, 2023.
Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail?, 2023.
Tao Shen et al. Large language models are strong zero-shot retriever, 2023.
Yang Liu et al. Trustworthy llms: a survey and guideline for evaluating large language models’ alignment, 2023.
Jiahao Yu, Xingwei Lin, Zheng Yu, and Xinyu Xing. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts, 2023.
Leonard Adolphs. Task: Bot adversarial dialogue dataset. https://github.com/facebookresearch/ParlAI/tree/main/parlai/tasks/bot_adversarial_dialogue, 2020.
Ethan Perez et al. Red teaming language models with language models, 2022.
Yangsibo Huang et al. Catastrophic jailbreak of open-source llms via exploiting generation, 2023.
Zheng-Xin Yong, Cristina Menghini, and Stephen H. Bach. Low-resource languages jailbreak gpt-4, 2023.
Xinyue Shen et al. “do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models, 2023.
Thomas Hartvigsen et al. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection, 2022.
Nazneen Rajani, Nathan Lambert, and Lewis Tunstall. Red-teaming large language models. https://huggingface.co/blog/red-teaming, 2023.
Ben Krause et al. Gedi: Generative discriminator guided sequence generation, 2020.
Sumanth Dathathri et al. Plug and play language models: A simple approach to controlled text generation, 2020.
Amanda S Oliveira et al. How good is chatgpt for detecting hate speech in portuguese? In Anais do XIV Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana, pages 94–103. SBC, 2023.
Gelei Deng et al. Jailbreaker: Automated jailbreak across multiple large language model chatbots, 2023.
Fábio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models, 2022.
Stephen Casper et al. Explore, establish, exploit: Red teaming language models from scratch, 2023.
Rishabh Bhardwaj and Soujanya Poria. Red-teaming large language models using chain of utterances for safety-alignment, 2023.
Charles O’Neill et al. Adversarial fine-tuning of language models: An iterative optimisation approach for the generation and detection of problematic content, 2023.
Adel Khorramrouz, Sujan Dutta, Arka Dutta, and Ashiqur R. KhudaBukhsh. Down the toxicity rabbit hole: Investigating palm 2 guardrails, 2023.
Tianhua Zhang et al. Interpretable unified language checking, 2023.
Flor Miriam Plaza-del arco, Debora Nozza, and Dirk Hovy. Respectful or toxic? using zero-shot learning with language models to detect hate speech. In The 7th Workshop on Online Abuse and Harms (WOAH), pages 60–68, Toronto, Canada, July 2023. Association for Computational Linguistics.
Lingyao Li, Lizhou Fan, Shubham Atreja, and Libby Hemphill. “hot” chatgpt: The promise of chatgpt in detecting and discriminating hateful, offensive, and toxic comments on social media, 2023.
Yau-Shian Wang and Yingshan Chang. Toxicity detection with generative prompt-based inference, 2022.
Xinlei He, Savvas Zannettou, Yun Shen, and Yang Zhang. You only prompt once: On the capabilities of prompt learning on large language models to tackle toxic content, 2023.
Neel Jain et al. Baseline defenses for adversarial attacks against aligned language models, 2023.
JD Zamfirescu-Pereira, Richmond Y Wong, Bjoern Hartmann, and Qian Yang. Why johnny can’t prompt: how non-ai experts try (and fail) to design llm prompts. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1–21, 2023.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2024 The Author(s)
About this chapter
Cite this chapter
Ruiu, D. (2024). LLMs Red Teaming. In: Kucharavy, A., Plancherel, O., Mulder, V., Mermoud, A., Lenders, V. (eds) Large Language Models in Cybersecurity. Springer, Cham. https://doi.org/10.1007/978-3-031-54827-7_24
Download citation
DOI: https://doi.org/10.1007/978-3-031-54827-7_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-54826-0
Online ISBN: 978-3-031-54827-7
eBook Packages: Computer ScienceComputer Science (R0)