Semantic anomaly detection with large language models

Elhafsi, Amine; Sinha, Rohan; Agia, Christopher; Schmerling, Edward; Nesnas, Issa A. D.; Pavone, Marco

doi:10.1007/s10514-023-10132-6

Semantic anomaly detection with large language models

Published: 23 October 2023

Volume 47, pages 1035–1055, (2023)
Cite this article

Autonomous Robots Aims and scope Submit manuscript

Amine Elhafsi¹,
Rohan Sinha¹,
Christopher Agia¹,
Edward Schmerling¹,
Issa A. D. Nesnas² &
…
Marco Pavone¹

1707 Accesses
3 Citations
Explore all metrics

Abstract

As robots acquire increasingly sophisticated skills and see increasingly complex and varied environments, the threat of an edge case or anomalous failure is ever present. For example, Tesla cars have seen interesting failure modes ranging from autopilot disengagements due to inactive traffic lights carried by trucks to phantom braking caused by images of stop signs on roadside billboards. These system-level failures are not due to failures of any individual component of the autonomy stack but rather system-level deficiencies in semantic reasoning. Such edge cases, which we call semantic anomalies, are simple for a human to disentangle yet require insightful reasoning. To this end, we study the application of large language models (LLMs), endowed with broad contextual understanding and reasoning capabilities, to recognize such edge cases and introduce a monitoring framework for semantic anomaly detection in vision-based policies. Our experiments apply this framework to a finite state machine policy for autonomous driving and a learned policy for object manipulation. These experiments demonstrate that the LLM-based monitor can effectively identify semantic anomalies in a manner that shows agreement with human reasoning. Finally, we provide an extended discussion on the strengths and weaknesses of this approach and motivate a research outlook on how we can further use foundation models for semantic anomaly detection. Our project webpage can be found at https://sites.google.com/view/llm-anomaly-detection.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A sensor-based approach for fault detection and diagnosis for robotic systems

Article 07 December 2017

Model-Centered Assurance for Autonomous Systems

Scenic: a language for scenario specification and data generation

Article Open access 02 February 2022

Availability of data and materials

Relevant documentation, data and/or code is readily available to verify the validity of the results presented upon request.

Notes

https://futurism.com/the-byte/tesla-autopilot-bamboozled-truck-traffic-lights.
https://www.youtube.com/watch?v=-OdOmU58zOw.
Although we use YOLOv8 (Jocher et al., 2023) in our vehicle planner, we find that DETR yields similar performance. We apply the baselines to DETR as it is trained on the same dataset as YOLOv8, though its architecture is more amenable to applying traditional OOD detectors.
This task is adapted from the put-blocks-in-bowl task defined by (Shridhar et al., 2021).
In these experiments, we generated these scene descriptions using privileged simulator information. In principle, an object detector could have been used to identify the objects involved in our experiments, however we found that the simulator visuals were not amenable to pretrained detection models.

References

Abdar, M., Pourpanah, F., Hussain, S., Rezazadegan, D., Liu, L., Ghavamzadeh, M., Fieguth, P., Cao, X., Khosravi, A., Acharya, U. R., Makarenkov, V., & Nahavandi, S. (2021). A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Information Fusion, 76, 243–297.
Article Google Scholar
Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al. (2022). Flamingo: A visual language model for few-shot learning. In Advances in neural information processing systems.
Amini, A., Schwarting, W., Soleimany, A., & Rus, D. (2020). Deep evidential regression. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Advances in neural information processing systems, (vol. 33, pp. 14927–14937). Curran Associates, Inc.
Antonante, P., Spivak, D. I., & Carlone, L. (2021). Monitoring and diagnosability of perception systems. In 2021 IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 168–175).
Arjovsky, M., Bottou, L., Gulrajani, I., & Lopez-Paz, D. Invariant risk minimization. arXiv:1907.02893.
Banerjee, S., Sharma, A., Schmerling, E., Spolaor, M., Nemerouf, M., & Pavone, M. (2023). Data lifecycle management in evolving input distributions for learning-based aerospace applications. In L. Karlinsky, T. Michaeli, & K. Nishino (Eds.), Computer vision–ECCV 2022 workshops (pp. 127–142). Cham: Springer.
Chapter Google Scholar
Brohan, A., Chebotar, Y., Finn, C., Hausman, K., Herzog, A., Ho, D., Ibarz, J., Irpan, A., Jang, E., Julian, R., et al. (2023). Do as i can, not as i say: Grounding language in robotic affordances. In Conference on robot learning (pp. 287–318). PMLR.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. (2020). Language models are few-shot learners. In Advances in neural information processing systems, (Vol. 33, pp. 1877–1901).
Bubeck, S., Chandrasekaran, V., Eldan, R., Gehrke, J., Horvitz, E., Kamar, E., Lee, P., Lee, Y. T., Li, Y., Lundberg, S., et al. (2023). Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. In European conference on computer vision (pp. 213–229). Springer.
Chen, W., Hu, S., Talak, R., & Carlone, L. (2022). Leveraging large language models for robot 3d scene understanding.
Cui, Y., Niekum, S., Gupta, A., Kumar, V., & Rajeswaran, A. (2022). Can foundation models perform zero-shot task specification for robot manipulation? In Learning for dynamics and control conference (pp. 893–905). PMLR.
Daftry, S., Zeng, S., Bagnell, J. A., & Hebert, M. (2016). Introspective perception: Learning to predict failures in vision systems. In 2016 IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 1743–1750).
de Haan, P., Jayaraman, D., & Levine, S. (2019). Causal confusion in imitation learning. In Advances in neural information processing systems (Vol. 32). Curran Associates, Inc.
De Lange, M., Aljundi, R., Masana, M., Sarah Parisot, X., Jia, A. L., Slabaugh, G., & Tuytelaars, T. (2022). A continual learning survey: Defying forgetting in classification tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(7), 3366–3385.
Google Scholar
Denouden, T., Salay, R., Czarnecki, K., Abdelzad, V., Phan, B., & Vernekar, S. (2018). Improving reconstruction autoencoder out-of-distribution detection with mahalanobis distance. arXiv preprint arXiv:1812.02765.
Dosovitskiy, A., Ros, G., Codevilla, F., Lopez, A., & Koltun, V. (2017). Carla: An open urban driving simulator. In Conference on robot learning (pp. 1–16). PMLR.
Downs, L., Francis, A., Koenig, N., Kinman, B., Hickman, R., Reymann, K., McHugh, T. B, & Vanhoucke, V. (2022). Google scanned objects: A high-quality dataset of 3d scanned household items. In 2022 international conference on robotics and automation (ICRA) (pp. 2553–2560). IEEE.
Driess, D., Xia, F., Sajjadi, M. S. M., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., Huang, W., Chebotar, Y., Sermanet, P., Duckworth, D., Levine, S., Vanhoucke, V., Hausman, K., Toussaint, M., Greff, K., Zeng, A., Mordatch, I., & Florence, P. (2023). Palm-e: An embodied multimodal language model. In arXiv preprint arXiv:2303.03378.
Gal, Y., & Ghahramani, Z. (2016). Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In M. Florina Balcan and K. Q. Weinberger (Eds.), Proceedings of the 33rd international conference on machine learning, volume 48 of proceedings of machine learning research (pp. 1050-1059), New York, New York, USA, 20–22. PMLR.
Geirhos, R., Jacobsen, J.-H., Michaelis, C., Zemel, R., Brendel, W., Bethge, M., & Wichmann, F. A. (2020). Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11), 665–673.
Article Google Scholar
Gomez-Donoso, F., Castano-Amoros, J., Escalona, F., & Cazorla, M. (2023). Three-dimensional reconstruction using SFM for actual pedestrian classification. Expert Systems with Applications, 213, 119006.
Article Google Scholar
Gulrajani, I., & Lopez-Paz, D. (2021). In search of lost domain generalization. In International conference on learning representations.
Hendrycks, D., & Gimpel, K. (2017). A baseline for detecting misclassified and out-of-distribution examples in neural networks. In International conference on learning representations.
Huang, W., Xia, F., Xiao, T., Chan, H., Liang, J., Florence, P., Zeng, A., Tompson, J., Mordatch, I., Chebotar, Y., et al. (2022). Inner monologue: Embodied reasoning through planning with language models. In 6th annual conference on robot learning.
Japkowicz, N., Myers, C. E., & Gluck, M. A. (1995). A novelty detection approach to classification. In International joint conference on artificial intelligence.
Jocher, G., Chaurasia, A., & Qiu, J. (January 2023). YOLO by Ultralytics.
Koh, P. W., et al. (Jul 2021). Wilds: A benchmark of in-the-wild distribution shifts. In M. Meila, & T. Zhang (Eds.), Proceedings of the 38th international conference on machine learning, volume 139 of proceedings of machine learning research (pp. 5637–5664). PMLR, 18–24.
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). Large language models are zero-shot reasoners. In Advances in neural information processing systems.
Lakshminarayanan, B., Pritzel, A., & Blundell, C. (2017). Simple and scalable predictive uncertainty estimation using deep ensembles. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & R. Garnett (Eds.), Advances in neural information processing systems. (Vol. 30). Curran Associates Inc.
Google Scholar
Lee, M. A., Tan, M., Zhu, Y., & Bohg, J. (2021). Detect, reject, correct: Crossmodal compensation of corrupted sensors. In 2021 IEEE international conference on robotics and automation (ICRA) (pp. 909–916).
Lee, K., Lee, K., Lee, H., & Shin, J. (2018). A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, & R. Garnett (Eds.), Advances in neural information processing systems. (Vol. 31). Curran Associates Inc.
Google Scholar
Li, J., Li, D., Xiong, C., & Hoi, S. (2022). BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning (pp. 12888–12900). PMLR.
Liang, J., Huang, W., Xia, F., Xu, P., Hausman, K., Florence, P., Zeng, A., et al. (2022). Code as policies: Language model programs for embodied control. In Workshop on language and robotics at CoRL 2022.
Liang, S., Li, Y., & Srikant, R. (2018). Enhancing the reliability of out-of-distribution image detection in neural networks. In 6th international conference on learning representations, ICLR 2018.
Lin, K., Agia, C., Migimatsu, T., Pavone, M., & Bohg, J. (2023). Text2motion: From natural language instructions to feasible plans. arXiv preprint arXiv:2303.12153.
Lin, Z., Roy, S. D., & Li, Y. (2021). Mood: Multi-level out-of-distribution detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 15313–15323).
Liu, J., Lin, Z., Padhy, S., Tran, D., Bedrax Weiss, T., & Lakshminarayanan, B. (2020). Simple and principled uncertainty estimation with deterministic deep learning via distance awareness. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Advances in neural information processing systems (Vol. 33, pp. 7498–7512). Curran Associates, Inc.
Liu, Z., Bahety, A., & Song, S. (2023). Reflect: Summarizing robot experiences for failure explanation and correction. arXiv preprint arXiv:2306.15724.
Madaan, A., Zhou, S., Alon, U., Yang, Y., & Neubig, G. (2022). Language models of code are few-shot commonsense learners. arXiv preprint arXiv:2210.07128.
McAllister, R., Kahn, G., Clune, J., & Levine, S. (2019). Robustness to out-of-distribution inputs via task-aware generative uncertainty. In ICRA (pp. 2083–2089).
Michels, F., Adaloglou, N., Kaiser, T., & Kollmann, M. (2023). Contrastive language-image pretrained (clip) models are powerful out-of-distribution detectors.
Minderer, M., Gritsenko, A., Stone, A., Neumann, M., Weissenborn, D., Dosovitskiy, A., Mahendran, A., Arnab, A., Dehghani, M., Shen, Z., Wang, X., Zhai, X., Kipf, T., & Houlsby, N. (2022). Simple open-vocabulary object detection with vision transformers. In ECCV.
OpenAI. (2023). Gpt-4 technical report.
Osband, I., & Wen, Z., Asghari, S. M., Dwaracherla, V., Ibrahimi, M., Lu, X., & Van Roy, B. (2023). Epistemic neural networks.
Ovadia, Y., Fertig, E., Ren, J., Nado, Z., Sculley, D., Nowozin, S., Dillon, J. V., Lakshminarayanan, B., & Snoek, J. (2019). Can you trust your model’s uncertainty? Evaluating predictive uncertainty under dataset shift. In Proceedings of the 33rd international conference on neural information processing systems, Red Hook, NY, USA. Curran Associates Inc.
Oza, P., & Patel, V. M. (2019). C2ae: Class conditioned auto-encoder for open-set recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2307–2316).
Rabiee, S., & Biswas, J. (2019). IVOA: Introspective vision for obstacle avoidance. In 2019 IEEE/RSJ international conference on intelligent robots and systems (IROS) (pp. 1230–1235). IEEE Press.
Ren, A. Z., Dixit, A., Bodrova, A., Singh, S., Tu, S., Brown, N., Xu, P., Takayama, L., Xia, F., Varley, J., et al. (2023). Robots that ask for help: Uncertainty alignment for large language model planners. arXiv preprint arXiv:2307.01928.
Richter, C., & Roy, N. (July 2017). Safe visual navigation via deep learning and novelty detection. In RSS.
Ritter, H., Botev, A., & Barber, D. (2018). A scalable Laplace approximation for neural networks. In 6th international conference on learning representations, ICLR 2018-conference track proceedings, volume 6. International conference on representation learning.
Rosinol, A., Violette, A., Abate, M., Hughes, N., Chang, Y., Shi, J., Gupta, A., & Carlone, L. (2021). Kimera: From slam to spatial perception with 3d dynamic scene graphs. The International Journal of Robotics Research, 40(12–14), 1510–1546.
Article Google Scholar
Ruff, L., Vandermeulen, R., Goernitz, N., Deecke, L., Siddiqui, S. A., Binder, A., Müller, E., & Kloft, M. (Jul 2018). Deep one-class classification. In Proceedings of the 35th international conference on machine learning, volume 80 of proceedings of machine learning research (pp. 4393–4402). PMLR, 10–15.
Ruff, L., Kauffmann, J. R., Vandermeulen, R. A., Montavon, G., Samek, W., Kloft, M., Dietterich, T. G., & Müller, K.-R. (2021). A unifying review of deep and shallow anomaly detection. Proceedings of the IEEE, 109(5), 756–795.
Article Google Scholar
Salehi, M., Mirzaei, H., Hendrycks, D., Li, Y., Rohban, M. H., & Sabokrou, M. (2021). A unified survey on anomaly, novelty, open-set, and out-of-distribution detection: Solutions and future challenges.
Shah, D., Osiński, B., Levine, S., et al. (2023). LM-NAV: Robotic navigation with large pre-trained models of language, vision, and action. In Conference on robot learning (pp. 492–504). PMLR.
Sharma, A., Azizan, N., & Pavone, M. (2021). Sketching curvature for efficient out-of-distribution detection for deep neural networks. In Uncertainty in artificial intelligence (pp. 1958–1967). PMLR.
Sharma, A., Azizan, N., & Pavone, M. (Jul 2021). Sketching curvature for efficient out-of-distribution detection for deep neural networks. In C. de Campos & M. H. Maathuis (Eds.), Proceedings of the thirty-seventh conference on uncertainty in artificial intelligence, volume 161 of proceedings of machine learning research (pp. 1958-1967). PMLR, 27–30.
Shridhar, M., Manuelli, L., & Fox, D. (2021). Cliport: What and where pathways for robotic manipulation. In Proceedings of the 5th conference on robot learning (CoRL).
Sinha, R., Sharma, A., Banerjee, S., Lew, T., Luo, R., Richards, S. M, Sun, Y., Schmerling, E., & Pavone, M. (2022). A system-level view on out-of-distribution data in robotics. arXiv:2212.14020.
Srivastava, M., Goodman, N., & Sadigh, D. (2023). Generating language corrections for teaching physical control tasks. arXiv preprint arXiv:2306.07012.
Torralba, A., & Efros, A. A. (2011). Unbiased look at dataset bias. In CVPR, 2011 (pp. 1521–1528).
Volk, G., Müller, S., von Bernuth, A., Hospach, D., & Bringmann, O. (2019). Towards robust CNN-based object detection through augmentation with synthetic rain variations. In 2019 IEEE intelligent transportation systems conference (ITSC) (pp. 285–292).
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E. H., Le, Q. V., Zhou, D., et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. In Advances in neural information processing systems.
Wilson, G., & Cook, D. J. (2020). A survey of unsupervised deep domain adaptation. ACM Transactions on Intelligent Systems and Technology, 11(5), 1–46.
Article Google Scholar
Yang, J., Zhou, K., Li, Y., & Liu, Z. (2021). Generalized out-of-distribution detection: A survey. arXiv preprint arXiv:2110.11334.
Yu, F., Chen, H., Wang, X., Xian, W., Chen, Y., Liu, F., Madhavan, V., & Darrell, T. (June 2020). BDD100K: A diverse driving dataset for heterogeneous multitask learning. In IEEE/CVF conference on computer vision and pattern recognition (CVPR).
Zeng, A., Florence, P., Tompson, J., Welker, S., Chien, J., Attarian, M., Armstrong, T., Krasin, I., Duong, D., Sindhwani, V., & Lee, J. (2020). Transporter networks: Rearranging the visual world for robotic manipulation. In Conference on robot learning (CoRL).
Zeng, A., Wong, A., Welker, S., Choromanski, K., Tombari, F., Purohit, A., Ryoo, M., Sindhwani, V., Lee, J., Vanhoucke, V., et al. (2022). Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598.
Zong, B., Song, Q., Min, M. R., Cheng, W., Lumezanu, C., Cho, D., & Chen, H. (2018). Deep autoencoding Gaussian mixture model for unsupervised anomaly detection. In International conference on learning representations.

Download references

Funding

The NASA University Leadership initiative (Grant #80NSSC20M0163) provided funds to assist the authors with their research. Amine Elhafsi is supported by a NASA NSTGRO fellowship (Grant #80NSSC19K1143). This article solely reflects the opinions and conclusions of its authors and not any NASA entity.

Author information

Authors and Affiliations

Autonomous Systems Laboratory, Stanford University, 496 Lomita Mall, Stanford, CA, 94305, USA
Amine Elhafsi, Rohan Sinha, Christopher Agia, Edward Schmerling & Marco Pavone
Jet Propulsion Laboratory, 4800 Oak Grove Drive, La Cañada Flintridge, CA, 91109, USA
Issa A. D. Nesnas

Authors

Amine Elhafsi
View author publications
You can also search for this author in PubMed Google Scholar
Rohan Sinha
View author publications
You can also search for this author in PubMed Google Scholar
Christopher Agia
View author publications
You can also search for this author in PubMed Google Scholar
Edward Schmerling
View author publications
You can also search for this author in PubMed Google Scholar
Issa A. D. Nesnas
View author publications
You can also search for this author in PubMed Google Scholar
Marco Pavone
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

AE initiated the project, developed the methodology, performed prompt tuning, and implemented and conducted the experiments. RS prepared the structure for the CARLA autonomous vehicle stack, conducted autonomous vehicle experiments, computed autoencoder OOD detector baseline metrics, processed experimental results, and performed data analysis. CA implemented the autoencoder OOD detector baseline for the learned policy experiments. ES implemented the autonomous vehicle traffic light classification, performed data analysis, and advised the project. IADN advised the project. MP was the primary advisor for the project. The manuscript was jointly written by Amine, Rohan and Edward. All authors reviewed and revised the manuscript.

Corresponding author

Correspondence to Amine Elhafsi.

Ethics declarations

Conflict of interest

Not applicable.

Consent to participate

Not applicable.

Consent for publication

The authors unanimously endorsed the content and provided explicit consent for submission. They also ensured that consent was obtained from the responsible authorities at the institute(s)/organization(s) where the research was conducted.

Code availability

Code and data will be provided at https://sites.google.com/view/llm-anomaly-detection.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A Additional details: reasoning-based policy

The following template was designed to prompt an analysis of the autonomous vehicle’s scene observations. Placeholders are indicated by the braces and are substituted for the relevant information at each query.

Appendix B Additional experimental details: learned policy

1.1 B.1 Prompt template

The following prompt was designed to elicit a comparison of the distractor objects and the blocks and bowls from the LLM. Placeholders are indicated by the braces and are substituted for the relevant information at each query.

Table 7 Distractors used in the learned policy experiments

Full size table

We chose to abstain from using few-shot prompting for this set of experiments. We noted that the diversity exhibited by the common household object classes used as distractors (as compared to driving objects classes, such as traffic lights and signals exhibit some degree of standardization features) necessitated some degree of zero-shot reasoning by the LLM. This zero-shot prompting strategy encouraged the LLM to leverage its inherent knowledge of common objects more effectively. In contrast, when few-shot prompted, we found that the responses tended to overfit to the provided examples, negatively impacting the LLM’s function as a monitor.

1.2 B.2 Semantic and neutral distractors

See Table 7.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Elhafsi, A., Sinha, R., Agia, C. et al. Semantic anomaly detection with large language models. Auton Robot 47, 1035–1055 (2023). https://doi.org/10.1007/s10514-023-10132-6

Download citation

Received: 08 May 2023
Accepted: 31 July 2023
Published: 23 October 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s10514-023-10132-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Semantic anomaly detection with large language models

Abstract

Access this article

Similar content being viewed by others

A sensor-based approach for fault detection and diagnosis for robotic systems

Model-Centered Assurance for Autonomous Systems

Scenic: a language for scenario specification and data generation

Availability of data and materials

Notes

References

Funding