LLMs Red Teaming

Ruiu, Dragos

doi:10.1007/978-3-031-54827-7_24

Dragos Ruiu⁶

10k Accesses

Abstract

Prompt red-teaming is a form of evaluation that involves testing machine learning models for vulnerabilities that could result in undesirable behaviors. It is similar to adversarial attacks, but red-teaming prompts appear like regular natural language prompts, and they reveal model limitations that can cause harmful user experiences or aid violence. Red-teaming can be resource-intensive due to the large search space required to search the prompt space of possible model failures. Augmenting the model with a classifier trained to predict potentially undesirable texts is a possible workaround. Red-teaming LLMs is a developing research area, and there is a need for best practices, including persuading people to harm themselves or others and other problematic behaviors, such as memorization, spam, weapons assembly instructions, and the generation of code with pre-defined vulnerabilities. The challenge with evaluating LLMs for malicious behaviors is that they are not explicitly trained to exhibit such behaviors. Therefore, it is critical to continually develop red-teaming methods that can adapt as models become more powerful. Multi-organization collaboration on datasets and best practices can enable smaller entities releasing models to still red-team their models before release, leading to a safer user experience across the board.

Download to read the full chapter text

Chapter PDF

References

Deep Ganguli et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned, 2022.
Google Scholar
Xiangyu Qi et al. Visual adversarial examples jailbreak aligned large language models, 2023.
Google Scholar
Nicholas Carlini et al. Are aligned neural networks adversarially aligned?, 2023.
Google Scholar
Yuxin Wen, Neel Jain, John Kirchenbauer, Micah Goldblum, Jonas Geiping, and Tom Goldstein. Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery, 2023.
Google Scholar
Andy Zou, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models, 2023.
Google Scholar
Manli Shu et al. On the exploitability of instruction tuning, 2023.
Google Scholar
Alexander Wan, Eric Wallace, Sheng Shen, and Dan Klein. Poisoning language models during instruction tuning, 2023.
Google Scholar
Parul Pandey. Exploring the vulnerability of language models to poisoning attacks. https://towardsdatascience.com/exploring-the-vulnerability-of-language-models-to-poisoning-attacks-d6d03bcc5ecb, 2023.
Yi Liu et al. Prompt injection attack against llm-integrated applications, 2023.
Google Scholar
Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail?, 2023.
Google Scholar
Tao Shen et al. Large language models are strong zero-shot retriever, 2023.
Google Scholar
Yang Liu et al. Trustworthy llms: a survey and guideline for evaluating large language models’ alignment, 2023.
Google Scholar
Jiahao Yu, Xingwei Lin, Zheng Yu, and Xinyu Xing. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts, 2023.
Google Scholar
Leonard Adolphs. Task: Bot adversarial dialogue dataset. https://github.com/facebookresearch/ParlAI/tree/main/parlai/tasks/bot_adversarial_dialogue, 2020.
Ethan Perez et al. Red teaming language models with language models, 2022.
Google Scholar
Yangsibo Huang et al. Catastrophic jailbreak of open-source llms via exploiting generation, 2023.
Google Scholar
Zheng-Xin Yong, Cristina Menghini, and Stephen H. Bach. Low-resource languages jailbreak gpt-4, 2023.
Google Scholar
Xinyue Shen et al. “do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models, 2023.
Google Scholar
Thomas Hartvigsen et al. Toxigen: A large-scale machine-generated dataset for adversarial and implicit hate speech detection, 2022.
Google Scholar
Nazneen Rajani, Nathan Lambert, and Lewis Tunstall. Red-teaming large language models. https://huggingface.co/blog/red-teaming, 2023.
Ben Krause et al. Gedi: Generative discriminator guided sequence generation, 2020.
Google Scholar
Sumanth Dathathri et al. Plug and play language models: A simple approach to controlled text generation, 2020.
Google Scholar
Amanda S Oliveira et al. How good is chatgpt for detecting hate speech in portuguese? In Anais do XIV Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana, pages 94–103. SBC, 2023.
Google Scholar
Gelei Deng et al. Jailbreaker: Automated jailbreak across multiple large language model chatbots, 2023.
Google Scholar
Fábio Perez and Ian Ribeiro. Ignore previous prompt: Attack techniques for language models, 2022.
Google Scholar
Stephen Casper et al. Explore, establish, exploit: Red teaming language models from scratch, 2023.
Google Scholar
Rishabh Bhardwaj and Soujanya Poria. Red-teaming large language models using chain of utterances for safety-alignment, 2023.
Google Scholar
Charles O’Neill et al. Adversarial fine-tuning of language models: An iterative optimisation approach for the generation and detection of problematic content, 2023.
Google Scholar
Adel Khorramrouz, Sujan Dutta, Arka Dutta, and Ashiqur R. KhudaBukhsh. Down the toxicity rabbit hole: Investigating palm 2 guardrails, 2023.
Google Scholar
Tianhua Zhang et al. Interpretable unified language checking, 2023.
Google Scholar
Flor Miriam Plaza-del arco, Debora Nozza, and Dirk Hovy. Respectful or toxic? using zero-shot learning with language models to detect hate speech. In The 7th Workshop on Online Abuse and Harms (WOAH), pages 60–68, Toronto, Canada, July 2023. Association for Computational Linguistics.
Google Scholar
Lingyao Li, Lizhou Fan, Shubham Atreja, and Libby Hemphill. “hot” chatgpt: The promise of chatgpt in detecting and discriminating hateful, offensive, and toxic comments on social media, 2023.
Google Scholar
Yau-Shian Wang and Yingshan Chang. Toxicity detection with generative prompt-based inference, 2022.
Google Scholar
Xinlei He, Savvas Zannettou, Yun Shen, and Yang Zhang. You only prompt once: On the capabilities of prompt learning on large language models to tackle toxic content, 2023.
Google Scholar
Neel Jain et al. Baseline defenses for adversarial attacks against aligned language models, 2023.
Google Scholar
JD Zamfirescu-Pereira, Richmond Y Wong, Bjoern Hartmann, and Qian Yang. Why johnny can’t prompt: how non-ai experts try (and fail) to design llm prompts. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1–21, 2023.
Google Scholar

Download references

Author information

Authors and Affiliations

Secwest, Vancouver, BC, Canada
Dragos Ruiu

Authors

Dragos Ruiu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dragos Ruiu .

Editor information

Editors and Affiliations

HES-SO Valais-Wallis, Sierre, Switzerland
Andrei Kucharavy
Cyber-Defence Campus, armasuisse Science and Technology, Thun, Switzerland
Octave Plancherel
Cyber-Defence Campus, armasuisse Science and Technology, Thun, Switzerland
Valentin Mulder
Cyber-Defence Campus, armasuisse Science and Technology, Thun, Switzerland
Alain Mermoud
Cyber-Defence Campus, armasuisse Science and Technology, Thun, Switzerland
Vincent Lenders

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Ruiu, D. (2024). LLMs Red Teaming. In: Kucharavy, A., Plancherel, O., Mulder, V., Mermoud, A., Lenders, V. (eds) Large Language Models in Cybersecurity. Springer, Cham. https://doi.org/10.1007/978-3-031-54827-7_24

Download citation

DOI: https://doi.org/10.1007/978-3-031-54827-7_24
Published: 12 April 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-54826-0
Online ISBN: 978-3-031-54827-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics