A survey on integration of large language models with intelligent robots

Kim, Yeseung; Kim, Dohyun; Choi, Jieun; Park, Jisang; Oh, Nayoung; Park, Daehyung

doi:10.1007/s11370-024-00550-5

A survey on integration of large language models with intelligent robots

Review Article
Open access
Published: 13 August 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Intelligent Service Robotics Aims and scope Submit manuscript

A survey on integration of large language models with intelligent robots

Download PDF

Yeseung Kim¹,
Dohyun Kim¹,
Jieun Choi¹,
Jisang Park¹,
Nayoung Oh¹ &
…
Daehyung Park¹

64 Accesses
1 Citation
Explore all metrics

Abstract

In recent years, the integration of large language models (LLMs) has revolutionized the field of robotics, enabling robots to communicate, understand, and reason with human-like proficiency. This paper explores the multifaceted impact of LLMs on robotics, addressing key challenges and opportunities for leveraging these models across various domains. By categorizing and analyzing LLM applications within core robotics elements—communication, perception, planning, and control—we aim to provide actionable insights for researchers seeking to integrate LLMs into their robotic systems. Our investigation focuses on LLMs developed post-GPT-3.5, primarily in text-based modalities while also considering multimodal approaches for perception and control. We offer comprehensive guidelines and examples for prompt engineering, facilitating beginners’ access to LLM-based robotics solutions. Through tutorial-level examples and structured prompt construction, we illustrate how LLM-guided enhancements can be seamlessly integrated into robotics applications. This survey serves as a roadmap for researchers navigating the evolving landscape of LLM-driven robotics, offering a comprehensive overview and practical guidance for harnessing the power of language models in robotics development.

Speech and Language in Humanoid Robots

Unlocking Robotic Autonomy: A Survey on the Applications of Foundation Models

Article 02 August 2024

Learning Unknown Groundings for Natural Language Interaction with Mobile Robots

Find the latest articles, discoveries, and news in related topics.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Over the last decade, we have witnessed remarkable progress in the field of robotics in applying language models (LMs). This progress includes not only human-like communication but also understanding and reasoning capabilities of robots, thereby significantly improving their effectiveness across various tasks, from household chores to industrial operations [52, 105]. In the early stage of work, the success stemmed from statistical models analyzing and predicting words in linguistic expressions. These models enable robots to interpret human commands [110, 121], understand contexts [3, 5], represent the world [50], and interact with humans [135], albeit with a limited depth of understanding. Then, the adoption of Transformer architecture with self-attention mechanisms [141], particularly pre-trained LMs such as BERT [26], has elevated the capability of capturing complex patterns while fine-tuning models for specific tasks. However, the performance of these models is often contingent upon limited datasets, constraining their ability to grasp deeper contextual understanding and generalize across diverse scenarios.

With the advancement of large language models (LLMs), language-based robotics introduce innovative changes across various domains such as information retrieval, reasoning tasks, adaptation to environments, continuous learning and improvements, etc. [61, 64]. These LLMs, characterized by their vast parameter sizes and training on internet-scale datasets, offer zero- and few-shot learning capabilities for downstream tasks without requiring additional parameter updates. These prominent advancements come from emergent abilities, defined as ‘the abilities that are not present in small models but arise in large models’ in the literature [148]. The abilities have significantly enhanced robots’ performance in understanding, inferring, and responding to open-set instructions by leveraging extensive commonsense knowledge [8]. Furthermore, prompt engineering techniques have enabled LLMs to incorporate richer contextual information through free-form language descriptions or interactive dialogues, facilitating generalized reasoning [149]. The introduction of in-context learning abilities [8] leads LLMs to generate outputs in expected formats, such as JSON, YAML, or PDDL, or even code, based on provided instructions or demonstrations in prompts [42, 86]. Recent LLMs, such as GPT-4, have further expanded capabilities by integrating with external robotics tools such as planners or translators [89].

Despite the diverse capabilities of LLMs, their utilization faces several challenges [69]. Firstly, LLMs often generate inaccurate or unexpected responses. As the safety of robot execution is one of the most important deployment factors, LLM-based robotic applications require filtering and correction mechanisms to ensure safety. Second, the emergent abilities, such as in-context learning, are not predictable and consistent yet [19]. Even minor alterations to input text may lead to unpredictable changes in response. Third, a well-designed prompt enables robots to effectively leverage the abilities of LLMs but there is a lack of systematic guidelines supporting key components of robotic systems, hindering seamless integration [35, 54, 164]. Therefore, we need to investigate component-wise LLM engagements in robotics toward an understanding of limitations and safety.

Currently, various surveys have started exploring the intersection of LLMs and robotics [142, 164], primarily focusing on application or interaction dimensions of LLM-based robotics. However, there remains a gap in providing holistic reviews and actionable insights for integrating LLMs across key elements of robotic systems, including communication, perception, planning, and control. Additionally, researchers explore the wide field of pre-trained large-capacity models, called foundation models, seeking the generalization capabilities across multimodal transformer-based models [35, 54]. However, this expansive field spans a wide spectrum of robotics and diverse methodologies, making emerging researchers miss in-depth reviews and guidelines.

In this paper, as shown in Fig. 1, we aim to categorize and analyze how LLMs could enhance core elements of robotics systems and how we can guide emerging researchers in integrating LLMs within each domain, encompassing communication, perception, planning, and control, toward the development of intelligent robots. We structure this paper following three key questions:

Q1: How are LLMs being utilized in each robotics domain?
Q2: How can researchers overcome the integration limitation of LLMs?
Q3: What basic prompt structures are required to produce a minimum functionality in each domain?

To address these questions, we focus on LLMs developed after the introduction of GPT-3.5 [106]. We primarily consider text-based modalities but also review multimodalities for perception and control areas. However, for an in-depth review, we limit our investigation to LLMs rather than foundation models.

In addition, we provide comprehensive guidelines and examples for prompt engineering, aimed at enabling beginners to access LLM-based robotics solutions. Our tutorial-level examples illustrate how fundamental functionalities of robotic components can be augmented or replaced by introducing four types of exemplary prompts: conversational prompt for interactive grounding, directive prompt for scene-graph generation, planning prompt for few-shot planning, and code-generation prompt for reward generation. By providing rules and tips for prompt construction, we outline the process of generating well-designed prompts to yield outputs in the desired format. These principles ensure effective LLM-guided enhancements in robotics applications, without parameter adjustments.

The remainder of this paper is organized as follows. Section 2 outlines the historical background of LMs and LLMs in robotics. Section 3 reviews how LLMs empower robots to communicate via language understanding and generation. Section 4 investigates how LLMs perceive various sensor modalities and advance sensing behaviors. Sections 5 and 6 organize LLM-based planning and control studies, respectively. In Sect. 7, we provide comprehensive guideline for prompt engineering as a starting point for LLM integration in robotics. Finally, Sect. 8 summarizes this survey.

2 Preliminary

We briefly review language models used in robotics, categorizing them in terms of pre- and post-LLM eras. Unlike previous literature [164], we define the pre-LLM era as the period for LMs up to the advent of GPT-2 [115], characterized by neural language models such as recurrent neural networks (RNNs) [33] and early Transformer architectures. We then provide a brief explanation of LLMs, introducing terminologies and techniques for subsequent reviews.

2.1 Language models in robotics

In the pre-LLM era, early-stage studies have primarily focused on sequential data processing, using RNN-based models [23, 46]. The models are often to transform linguistic commands into a sequence of actions [6, 99] or formal languages [40], leveraging RNN’s sequence-to-sequence modeling capabilities. On the other hand, researchers have also used RNNs as language encoders converting textual input into linguistic features that could be mapped to visual features for referring object identification [121, 125]. However, the long-term dependency issue in RNNs restricts the scope of applications. Then, the Transformer architecture [141], which is a non-sequential model supporting long-range comprehension, has enabled new robotic tasks, such as vision-and-language navigation [14, 16].

The studies in the pre-LLM era also show improved application performance unlike previous methods trained on small, task-specific datasets. Transformer-based models and self-supervised learning techniques, such as masked language modeling, have led to the development of internet-scale pre-trained models, including BERT [26] and GPT-2 [115]. These models exhibit a broad understanding of language, enabling both (1) improved generalization abilities and (2) fine-tuning for specific robotic tasks [74, 75, 124]. In addition, researchers have also developed LMs that process multimodal information [116] since robotic applications often require accessing diverse multimodalities, such as natural language and vision, for interaction with users and environment [76, 126].

2.2 Large language model in robotics

Recent advancements in LLMs, such as GPT-3 [8], GPT-4 [107], LLaMA [137], Llama 2 [138], and Gemini [2], demonstrate notable improvements in understanding, contextual awareness, generalization capabilities, and knowledge richness, surpassing earlier language models. These improvements are from their training on vast datasets with billion-scale parameters, enabling them to capture intricate data patterns. Further, advanced learning strategies, such as reinforcement learning from human feedback, have been developed to align the behaviors of LLMs with human values or preferences [108]. However, learning with large-scale parameters requires computationally expensive costs of updating the entire model. To address this issue, researchers have developed parameter-efficient fine-tuning methods (e.g., an adapter [49] and LoRA [51]) for robotic tasks. For example, LLM-POP [132] fine-tunes their model using adapters, which are small, trainable networks inserted into each layer of an LLM for interactive planning scenarios.

Alternatively, prompt engineering with in-context learning (ICL) [8] marks a significant advancement in learning from prompts without additional training. Its effectiveness relies on the design and quality of prompts, which can be enhanced with detailed task descriptions, few-shot examples, or model-friendly formats (e.g., ‘###’ as a stop symbol [167]). Moreover, chain-of-thought (CoT) prompting [149] is another emerging approach that incorporates intermediate reasoning steps in prompts. The CoT method substantially enhances the reasoning and problem-solving capabilities of LLMs, making it a dominant technique in robotics applications [86, 128, 163].

3 Communication

We investigate the utilization of LLMs to facilitate human-like communication in robotics, enabling robots to interact with humans and other robotic agents effectively [97]. We categorize the communication capabilities into two primary areas: (1) language understanding and (2) language generation. We show the detailed categorization in Fig. 1 alongside relevant studies, referred in green cells.

3.1 Language understanding

We review language understanding capabilities, addressing how LLMs handle the variability and ambiguity of linguistic inputs through interpretation and grounding processes.

Interpretation transforms natural-language inputs into semantic representations that are easier for robots to process. These representations include formal languages such as linear temporal logic (LTL) [93, 160] and planning domain definition language (PDDL) [18, 42, 89, 155], as well as programming languages such as Python [56, 76]. To aid in interpreting free-form sentences, researchers leverage LLMs’ ICL capabilities, providing guidelines and demonstrations within prompts [56, 76, 89, 122]. Despite the efforts, LLMs often fail to satisfy syntax or capture precise semantics when translating an input into formal languages. To address this issue, researchers suggest simplifying vocabulary or fine-tuning LLMs with domain-agnostic data [93, 160]. For example, Lang2LTL [91] translates landmark-referring expressions in navigational commands into LTL symbols. Further improvements often involve in using human feedback and syntax checkers to correct generated formal language translations [18, 42]. For instance, Guan et al. present a human-in-the-loop translation framework, in which human domain experts repeatedly review PDDL descriptions and provide feedback in natural language [42].

Grounding is another process that maps linguistic expressions to reference targets, such as behaviors or objects, recognizable to robots. Early studies identify mappings that maximize the cosine similarity between word embeddings of LLM outputs and real-world targets [58, 76, 93, 117]. Subsequent studies leverage LLMs’ commonsense knowledge to capture the context of object text labels for improved grounding [41, 118]. For instance, ConceptGraphs [41] demonstrate how LLMs can ground an expression, ‘something to use as a paperweight,’ to a ceramic vase based on size and weight assumptions. However, grounding accuracy depends on the detail and accuracy of the world model. To address this, researchers augment LLMs with multimodal capabilities to directly correlate linguistic inputs with sensory percepts [31, 47, 114, 159] or enable LLMs to interact with environments [158, 168] or humans [61, 109, 120] for better context gathering. For instance, LLM-Grounder [158], a 3D visual grounding method, actively gathers environmental information using vision tools such as LERF [72] and OpenScene [111].

3.2 Language generation

Language generation refers to the production of human-like written or spoken language that reflects communicative intents [39]. We categorize language generation into task-dependent and-independent types based on their communication intents, diverging from conventional natural-language generation (NLG) categories of text-to-text and data-to-text [30] due to our focus on the communicative purposes of studies.

Task-dependent language generation focuses on producing language with specific functional objectives, being declarative or imperative. To generate open-ended declarative statements, researchers often provide LLMs with contextual information [20, 62, 96]. However, LLMs often result in repetitive and factually inconsistent outputs, confined by the reliance on previous dialogues and commonsense knowledge [20, 83]. Consequently, researchers augment LLMs with auxiliary knowledge sources to broaden the scope of available information [4, 21, 157]. For instance, Axelsson and Skantze enhance a robot museum guide with knowledge graphs [4]. Furthermore, researchers instruct LLMs to clarify ambiguities by generating imperative instructions requesting human assistance [25, 61]. To improve inference steps, probabilistic models have been introduced to evaluate the uncertainty of situations [109, 120]. For instance, KnowNo [120] and CLARA [109] interaction systems assess confidence and semantic variance, respectively, triggering generation only when these metrics indicate significant uncertainty.

Task-independent language generation involves crafting expressions with social-emotional objectives [11], by embedding non-verbal cues (e.g., non-verbal sounds, hand gestures, and facial expressions) within prompts to enhance engagement and empathy [73, 80]. For example, Khoo et al. have developed a conversational robot that generates empathetic responses using transcribed audio and visual cues [73]. However, conversations with LLMs remain superficial due to the limited knowledge and dialogue history [65]. To overcome this, researchers integrate memory modules into LLMs, enabling them to distill and store information from conversations in a structured format [22, 63, 65, 162]. For example, the companion robot designed by Irfan et al. continuously updates the robot’s memory based on interactions with users to generate personalized dialogues [65].

4 Perception

Perception plays a crucial role in enabling robots to make decisions, plan their actions, and navigate the real world [113]. In the field of LLM-based robotic perception, research primarily focuses on two aspects: sensing modalities and behaviors. In this section, we introduce how LLM-based robots have integrated language with sensor modalities and how agents acquire environmental information through passive and active perception behaviors. Figure 1 presents the detailed categorization alongside relevant studies, referred in pink cells.

4.1 Sensing modalities

Researchers have significantly advanced robots’ comprehension and generalization capabilities through the integration of multimodal language models. We categorize primary sensing modalities into visual, auditory, and haptic modalities, reviewing recent studies leveraging multimodal LLMs for perception tasks.

Visual perception tasks involve the interpretation of visual information such as image or point clouds. Pre-trained visual-language models (VLMs), such as CLIP [116] and InstructBLIP [82], allows LLM-based robots to directly utilize image sources. For instance, recent LLM-based manipulation systems, such as TidyBot [152] and RoCo [96], use image-inferred object labels or scene descriptions generated from CLIP and OWL-ViT [100], respectively. In addition, researchers extend reasoning capabilities by applying VLMs on downstream tasks such as image captioning [41] and visual question answering (VQA) [37, 78, 103]. The downstream tasks enable LLMs to subsequently request VLMs to infer object properties (e.g., material, fragility) [37] or ground object parts for grasping [103]. However, images are often challenging to acquire spatial-geometric information.

Alternatively, Huang et al. associate visual-language features from a VLM (i.e., LSeg [81]) with three-dimensional (3D) point clouds for 3D map reconstruction [56]. Further, Jatavallabhula et al. improve this association mechanism with RGB-D images by introducing fine-grained and pixel-aligned features from VLMs [66]. However, association with 3D information tends to be memory intensive, limiting scalability for large scenes [56, 66, 158]. As an alternative solution, researchers often associate geometric and semantic features with 3D scene graphs [41].

Auditory perception involves the interpretation of sound. LLM-based studies often leverage pre-trained audio-language models (ALMs), such as AudioCLIP [43] and Wav2CLIP [151], integrating them with visual data to enhance environmental or contextual understanding [55, 94, 123, 163]. For example, AVLMaps [55], a 3D spatial map constructor with cross-modal information, integrates audio, visual, and language signals into 3D maps, enabling agents to navigate using multimodal objectives such as ‘move between the image of a refrigerator and the sound of breaking glass.’ In addition, REFLECT [94], a framework for summarizing robot failures, transforms multisensory observations such as RGB-D images, audio clips, and robot states into textual descriptions to enhance LLM-based failure reasoning.

Haptic perception involves the interpretation of contact information. Researchers introduce multimodal perception modules that interactively incorporate haptic features obtained from pre-defined high-level descriptions [168] or CLIP-based tactile-image features [48] about haptic interactions. For example, MultiPLY [48], a multisensory LLM, converts tactile sensory readings into a heatmap, encoded by CLIP. Then, by introducing a linear layer of tactile projector, the model maps the heatmap information to the feature space of LLMs.

4.2 Sensing behavior

Following the type of sensing behaviors, we decompose this section into passive and active perceptions.

The passive perception refers to the process of gathering sensory information without actively seeking it out. Despite its limited nature, passive sensing has been extensively employed in LLM-based robotics studies for various tasks: object recognition [37, 53, 152], pose estimation [103, 156], scene reconstruction [41, 59, 122, 122], and object grounding [66, 144, 158]. For example, TidyBot [152] detects the closest object from an overhead view and subsequently recognizes its object category using a closer view captured by the robot’s camera. However, the passive nature of sensing limits the ability to perform tasks when information is unobserved or unavailable (e.g., unseen area, weight).

On the other hand, active perception refers to the conscious process of gathering sensory information by taking additional actions. Active information gathering enhances environmental understanding by acquiring new information through sensory observations or requesting user feedback [78, 129]. For example, LLM-Planner [129] generates seeking actions such as ‘open the refrigerator’ to locate invisible objects. Recent studies also focus on collecting sensory data to better understand objects’ physical properties [48, 132, 168]. However, LLMs often generate inaccurate or fabricated information, known as hallucinations. To address this issue, Dai et al. introduce a personalized conversational agent designed to ask users for uncertain information [25].

5 Planning

Planning involves organizing actions to solve given problems, typically through generating a sequence of high-level symbolic operators (i.e., task planning) followed by executing them using low-level motor controllers [38, 84]. This section investigates how LLM-based planning research addresses limitations in the planning domain by categorizing them into three key research areas: (1) task planning, (2) motion planning, and (3) task and motion planning (TAMP). Figure 1 presents the detailed categorization along with related planning studies, referred in purple cells.

5.1 Task planning

LLM-based task planners are capable of generating plans without strict symbol definitions [58], while traditional task planners are required to pre-define operators with domain knowledge about available actions and constraints [34, 98]. In this field, most planners employ a static planning strategy, which takes fixed descriptions that are not adaptable to the changes in environment [163]. However, an alternative approach, adaptive planning, allows for the incorporation of environmental feedback into input prompts, enabling adjustments to actions based on observed conditions. This section reviews LLM-based planners in terms of these two strategies: static and adaptive planning.

Static planning approaches are generally zero- or few-shot prediction methods, where zero-shot methods generate a plan based solely on an input command, while few-shot methods leverage learning from a limited set of similar examples [9, 27, 70, 163]. However, LLMs often exhibit poor performance in long-horizon task planning due to limited reasoning ability [89, 140]. To address this limitation, Huang et al. introduce a planner that iteratively selects the most probable action among executable ones generated by LLMs [58].

Alternatively, LLM-based code generators, such as Code as Policies [86] or ProgPrompt [128], produce codes that result in actions responsive to observations [56, 57]. Singh et al. demonstrate that code generation outperforms basic task planning from LLMs since the output plan closely aligns with execution environments [128]. Despite their advantages, these methods lack validation and replanning processes.

To validate plans, researchers often augment LLMs with logical programs, either to (1) check if resulting plans violate logical constraints or (2) generate plans using an external logical planner. For instance, SayPlan [118], a GPT-4-based planner, validates abstract-level actions through a scene-graph simulator 3DSG [1], while LLM+P [89] applies a PDDL problem translated from LLMs to a classical task planner, Fast Downward [45]. In addition, Silver et al. demonstrate that a search-based planner with an initial plan from LLMs performs better by exploring fewer nodes [127]. These studies underscore the effectiveness of integrating LLMs with logical programs to increase the success rate or the performance of generating feasible plans.

Adaptive planning allows robots to modify their plans or actions in response to feedback, either by generating new plans based on environmental observations [20, 142, 152, 168, 169] or by detecting failures and adjusting accordingly [61]. Chen et al. and Huang et al. introduce adaptation strategies that generate new plans based on observed feedback, enabling robots to respond to a broader range of scenarios [12, 60].

Another adaptation strategy is the detection of failures as feedback. For instance, Inner Monologue [61] retries the initial plan until it succeeds. Furthermore, other studies provide textual explanations about past failures to help avoid recurrent issues [87, 94, 117, 147]. LLM-Planner [129] and COWP [28] improve replanning capabilities by finding alternative plans that leverage context from observations and the commonsense knowledge of LLMs. These flexibilities in adapting to new information enhance robot autonomy in dynamic settings.

5.2 Task and motion planning

We outline LLM-based low-level planning, classifying methodologies into motion planning and TAMP areas.

Motion planning refers to the process of generating a path by computing sequential waypoints within configuration or task spaces. Jiao et al. introduce an LLM-based motion planner that directly generates positional sequences for drone choreography [68]. While this work demonstrates LLMs’ spatial reasoning abilities, the scenario presented is relatively simple. Further, planning spaces are often continuous, which presents a challenge for language models that operate with discrete tokens. Alternatively, indirect sequencing approaches, such as VoxPoser [59], generate a potential field code with the help of a VLM and then conduct motion planning within the generated field, augmenting LLMs with a search-based planner.

TAMP refers to integrating high-level task planning with low-level motion planning. Recent studies often use LLMs as TAMP planners, leveraging LLMs’ logical and physical reasoning capabilities [79, 96, 153]. Researchers guide LLMs to generate high-level subgoals, which are then used for low-level trajectory generation [79, 96]. However, the coarse representations of LLMs restrict their applications to simple tasks such as pick-and-place. To address this limitation, researchers use additional prompts or augment LLMs to improve reasoning abilities. For example, Xia et al. enable LLMs to consider kinematic knowledge through kinematic-aware prompting for more complex manipulation tasks, such as articulated object manipulation [153]. Ding et al. introduce a logic-augmented LLM planner that checks the logical feasibility of the task plans generated by LLMs [29]. Meanwhile, others use physics-augmented LLM planners to evaluate physical feasibility [18, 44, 88]. For example, Text2Motion [88] allows an LLM to generate physically feasible high-level actions and combine them with learned skills for low-level actions.

6 Control

Early studies primarily focus on establishing mappings between simple linguistic commands and known motion primitives. With the advent of deep learning, researchers have explored two main approaches in control: direct modeling of control values based on linguistic instructions [7, 119] and indirect interpretation of complex instructions via LLMs to generate actions [154]. We categorize the work in this field into two groups: (1) direct approach which means the direct generation of control commands based on linguistic instructions and (2) indirect approach which stands for indirect specification of control commands through linguistic guidance. Figure 1 presents a detailed categorization alongside related papers, referred in orange cells.

6.1 Direct approach

The direct approach involves using an LLM to interpret and produce executable commands, either by selecting motion primitives [134] or generating control signals [146, 170]. Early work generates action tokens to produce a control policy training Transformer architecture with task-specific expert demonstrations such as Gato [119], RT-1 [7], and MOO [131]. Researchers linearly map action tokens to discretized end-effector velocities [119] or displacements [7, 131] for continuous motion. While these approaches demonstrate a degree of generalization over unseen tasks, such as new objects or realistic instructions, they often require extensive data collection and training time.

To reduce the collection effort, researchers often leverage existing web-scale vision and language datasets such as RT-2 [170] and RT-X [143]. For example, Zitkovich et al. train VLMs (e.g., PaLI-X [17] and PaLM-E [31]) with visual-language datasets and robotic demonstrations [170]. This approach maintains general knowledge of visual-language tasks while training for control tasks. In addition, to reduce the training burden, Chen et al. use a low-rank adaptation (LoRA) [51] method for fine-tuning an LLM for control tasks rather than fine-tuning the entire model [15].

LLMs often struggle to generate continuous action-level commands such as joint position and torque values, as LLMs typically generate atomic elements known as tokens [134]. Therefore, researchers instead generate task-level output using LLMs [10, 101, 134]. For example, SayTap, an LLM-based controller for walking, generates contact patterns between feet and the ground for walking motion with an LLM instead of directly producing joint positions [134]. Other studies solve the control problem as natural-language generation tasks, completing a sequence of end-effector poses [101] or generating Python codes [10]. Recent studies often restrict action space to enhance LLM control outputs. For instance, Wang et al. design a prompt that produces positive integer control values while maintaining a smoothness trend in outputs [146]. Alternatively, Li et al. demonstrate that incorporating robot kinematics information helps the LLM determine joint values for desired poses [85].

6.2 Indirect approach

LLMs are also useful for generating indirect representations of control commands, such as subgoals or reward functions, based on natural-language instructions. To guide the learning process, researchers leverage the goal description that explains desired behaviors in natural language [32, 67, 77]. For example, ELLM [32], an LLM-based reinforcement learning (RL) framework, uses an LLM to generate subgoal descriptions as prior knowledge of the RL policy and further uses the similarity between current observation and the subgoal description in text embedding space to calculate reward. Further, Kumar et al. generate a goal description based on the history of human instructions to reuse previously learned skills [77]. However, as the output of an LLM is a natural-language description, these approaches require an additional step of grounding or interpreting the description.

Alternatively, researchers often generate code-level reward functions. Yu et al. convert a natural-language goal into a high-level motion description and then generate a corresponding reward function [161]. However, this method requires pre-defined reward formats. Instead, recent work prompts an LLM to infer a new reward function from human-designed examples [71, 145]. Nonetheless, the generated reward functions may not always be accurate or optimal enough for direct use in training [130].

To improve accuracy, researchers add a refinement loop to validate both the syntax [112] and semantics [95, 130, 154, 165] of the generated reward functions. For example, Song et al. use an LLM to redesign a reward function based on the convergence of the training process and the resulting robot motion [130]. Chu et al. employ an LLM to directly generate rewards for evaluating robot motion [24]. Other approaches refine a motion by adjusting control parameters based on the error state [133] or by selecting a suitable motion target from human feedback [90].

7 Prompt guideline

We provide prompt design guidelines for robotic tasks to researchers entering this field. A prompt is a message that directs LLMs to process and generate outputs according to our instructions [92, 150]. Well-designed prompts

include clear, concise, and specific statements without using technical jargon,
incorporate examples that allow anticipating the model’s process,
specify the format that we want the output to be presented, and
contain instructions to constrain actions.

The prompts enable models to generate desired content following output formats and constraints without parameter updates. We provide guidelines over four robotics fields: (1) interactive grounding, (2) scene-graph generation, (3) few-shot planning, and (4) reward function generation.

7.1 Conversation prompt: interactive grounding

We detail a conversation prompt design, leveraging an LLM as a grounding agent, to clarify commands such as ‘Bring me something to eat’ and infer the ambiguous targets, expressed as ‘something,’ through logical inference. Figure 2 shows the design detail, where the prompt consists of three key components: task description, task procedure, and task context. We further describe each component as follows.

The task description outlines the expected behavior and response format of the LLM. In this example, we particularly emphasize its role as a conversational agent, which fosters dynamic interactions with users, guided by directives such as ‘you should.’ Further, the imperative statements containing ‘keep’ provide task constraints or requirements. We also place behavioral constraints at the end to suppress the LLM’s verbosity.

The task procedure then defines a sequence of inference steps for the LLM to follow, aimed at achieving the task objective. This description employs numbered steps to instruct LLMs to execute the actions step by step. By using logical representations, we also enforce actions to be performed in a logical order; we use ‘iteratively’ to indicate a ‘while loop’ and ‘if’ or ‘when’ to represent conditions.

The task context describes the contextual inputs, such as ‘world model,’ that LLMs perform grounding upon. Consistency in terminology across task description and task procedure is crucial for LLM operations. For example, the common expressions, such as ‘task’ and ‘world model,’ allow the LLM to work within the same context provided. Further, by using clear names for objects in the world model, we enable the LLM to apply common knowledge to named entities. Note that although we use a list of objects as a world model, LLMs accept world models in various formats: textual descriptions, a list of objects, and scene graphs.

With these structured components, the prompt invokes an interactive grounding dialogue for precise object identification, as shown in Fig. 2. We obtain the resulting interaction using ChatGPT 3.5 [106].

7.2 Directive prompt: scene-graph generation

We introduce directive prompt designs for constructing a scene graph from a scene image using a multimodal LLM, particularly with GPT-4 [107]. The scene graph consists of objects as nodes and their relationships as edges [36, 41]. Despite the advancement of multimodal LLMs, their capability has limitations in inferring 3D relationships from a 2D image [13]. To reduce the limitation, we decompose the task into two steps: node creation with multimodal inputs and edge creation with textual information. We describe each step with detailed examples provided in Fig. 3.

The prompt for node creation consists of two parts: (1) task description and (2) task context. The task description includes the multimodal LLM’s expected behavior (i.e., role) and response format, similar to Sect. 7.1. For instance, the multimodal LLM’s role is to identify objects as nodes in the given images. We then specify the output format as ‘ObjectName(ID)’ for consistency and simplicity. Then, the task context presents a sequence of unique object identifiers with corresponding object-centric images. In this scenario, we assume that object-centric images are obtained using an active perception method to identify objects with occlusion.

The edge creation consists of (1) task description, (2) examples, and (3) task context. The task description not only specifies the expected behavior and output format but also elucidates on how to identify the relationships between nodes leveraging examples. We particularly explain how the LLM uses 3D object coordinates and unit measurements to infer spatial relationships from a pre-defined set such as ‘left,’ ‘right,’ etc. Unlike the node creation, this allows for generating additional output explanations to accommodate the complexity of discerning spatial relationships.

To enhance the understanding of the input format and corresponding output, we include examples showcasing edge generation. We choose an example similar to the target scenario in terms of objects and their spatial interrelationships, thereby providing richer information for edge identification. Finally, the task context provides source and target node information as inputs and leaves the output empty to obtain responses from the LLM. We also assume that the 3D bounding boxes are obtained through a neural network-based detector [104] or from the extent of point clouds using depth information [41].

7.3 Planning prompt: few-shot planning

We present a planning prompt design aimed at predicting subsequent actions to fulfill an instructed objective, integrating contextual elements such as available actions and environmental settings. This design particularly focuses on few-shot planning, enhancing performance through examples. The design comprises four components: (1) task description, (2) examples, (3) task context, (4) and additional interactions, detailed in Fig. 4.

The task description include task objectives, expected behaviors, and response formats similar to conventional prompts. However, unlike previous ones, this prompt specifies the robot’s constraints, including initial states and action limitations. For example, the term ‘CANNOT’ in Fig. 4 emphasizes the robot’s limitation to manipulating only one object per action. Moreover, these constraints extend to the rules governing the ‘Done’ action, indicating task completion.

The examples demonstrate input–output pairs that guide the LLM in generating the desired action. The examples adapt the generic ‘object’ argument in the allowed actions (e.g., ‘Close (object)’) to specific object names such as ‘drawer’ or ‘paper,’ reinforcing task constraints written in the task description. For instance, the second example returns the ‘Done’ signal instead of further planning after achieving the task objective.

The task context provides information about the current scenario, including ‘task,’ ‘allowed actions,’ ‘visible objects,’ ‘executed plans,’ and ‘next plan,’ as shown in the examples. We allow the LLM to fill in the blank space after ‘next plan:’ suggesting the next action without adding unnecessary elements like line breaks, ensuring output precision.

Furthermore, when additional prompts update the executed plans, the LLM generates new plans based on this updated context without reiterating the full task context, enabling a dynamic and iterative planning process that adapts to changes and maintains efficiency.

7.4 Code-generation prompt: reward design

We introduce a code-generation prompt design to generate a reward function for the MuJoCo-based Reacher task [136] from Gymnasium [139]. The goal of the Reacher task is to move the end-effector of a robotic arm close to a designated target position from an arbitrary starting configuration. The prompt is to translate this task objective into a reward-specifying code. Figure 5 shows the design detail, comprising four key elements: (1) task description, (2) available APIs, (3) goals and constraints, and (4) generation rules.

Task description define the expected robot behavior and task conditions for the LLM, including the robot’s control strategies and the action space of the two-joint robot arm. We particularly specify the action space as a continuous ‘Box’ space using an API from Gymnasium, assuming the LLM’s familiarity with well-known library functions. Then, this description leads the LLM to grasp the overarching RL objective of the defined actions.

Available APIs list the APIs necessary for designing the reward function, including the names and input–output specifications of each API. By providing Python function annotations, we enable the LLM to infer the types of inputs and outputs, given its presumed knowledge of float-like variable types and how the APIs work.

Goals and constraints provide the task objectives and limitations that guide the reward contents. We clearly define the initial setup, goal assignment, and goal conditions, aiming to exclude unnecessary reward components, such as penalizing high velocities for smooth motion. Note that we recommend the use of concise and consistent words, such as ‘torque,’ as used in the task description, instead of ‘power.’ This ensures the generated reward function aligns with the specified task requirements without introducing ambiguities or unintended penalties.

Lastly, generation rules establish guidelines for generating directly executable code, addressing the tendency of LLMs to produce unnecessary or incorrect variables or functions. These rules restrict such declarations, as written in the second component of the generation rules in Fig. 5, encouraging the use of well-known Python libraries to enhance programming quality. Furthermore, considering the linearly combined elements of the reward function, we introduce a rule for scaling reward components to maintain balance.

8 Conclusion

In this survey, we have investigated the current robotics research works with large language models in terms of intelligent robot components encompassing communication, perception, planning, and control. This component-wise investigation reveals how researchers integrate LLMs to overcome challenges inherent in pre-LLM approaches across various tasks, thereby offering a comprehensive understanding of LLMs’ impact in this field. Within each component area, we examine the improvement of methodologies proposed to maximize the utilization of LLMs’ capabilities and enhance the integrity of their responses. Additionally, our survey offers guidelines for prompt engineering in each component area, supplemented with key examples of prompt components, to provide practical insights for researcher entering this field. The core contribution of this paper is to highlight the transformative impact of LLMs in robotics, enabling the development of versatile and intelligent robots. By synthesizing these insights, we aim to guide future research on integrating LLMs into robotic system.

References

Agia C, Jatavallabhula KM, Khodeir M et al (2022) Taskography: evaluating robot task planning over large 3d scene graphs. In: Proceedings of the conference on robot learning (CoRL), pp 46–58
Anil R, Borgeaud S, Wu Y et al (2023) Gemini: a family of highly capable multimodal models. pp 1–62. arXiv preprint arXiv:2312.11805
Arkin J, Park D, Roy S et al (2020) Multimodal estimation and communication of latent semantic knowledge for robust execution of robot instructions. Int J Robot Res (IJRR) 39:1279–1304
Article Google Scholar
Axelsson A, Skantze G (2023) Do you follow? a fully automated system for adaptive robot presenters. In: Proceedings of the ACM/IEEE international conference on human-robot interaction (HRI), pp 102–111
Barber DJ, Howard TM, Walter MR (2016) A multimodal interface for real-time soldier-robot teaming. In: Unmanned systems technology XVIII, p 98370M
Blukis V, Misra D, Knepper RA et al (2018) Mapping navigation instructions to continuous control actions with position-visitation prediction. In: Proceedings of the conference on robot learning (CoRL), pp 505–518
Brohan A, Brown N, Carbajal J et al (2023) Rt-1: Robotics transformer for real-world control at scale. In: Proceedings of robotics: science and systems (RSS), pp 1–22
Brown T, Mann B, Ryder N et al (2020) Language models are few-shot learners. In: Conference on neural information processing systems (NeurIPS), pp 1877–1901
Cao Y, Lee C (2023a) Robot behavior-tree-based task generation with large language models. In: Proceedings of the AAAI 2023 spring symposium on challenges requiring the combination of machine learning and knowledge engineering (AAAI-MAKE), pp 1–15
Cao Y, Lee CG (2023b) Ground manipulator primitive tasks to executable actions using large language models. In: Proceedings of the AAAI fall symposium series, pp 502–507
Chattaraman V, Kwon WS, Gilbert JE et al (2019) Should AI-based, conversational digital assistants employ social- or task-oriented interaction style? A task-competency and reciprocity perspective for older adults. Comput Human Behav 90:315–330
Article Google Scholar
Chen B, Xia F, Ichter B et al (2023a) Open-vocabulary queryable scene representations for real world planning. In: Proceedings of the IEEE international conference on robotics and automation (ICRA), pp 11509–11522
Chen B, Xu Z, Kirmani S et al (2024a) Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
Chen K, Chen JK, Chuang J et al (2021) Topological planning with transformers for vision-and-language navigation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 11276–11286
Chen L, Sinavski O, Hünermann J et al (2024b) Driving with llms: fusing object-level vector modality for explainable autonomous driving. In: Proceedings of the IEEE international conference on robotics and automation (ICRA)
Chen S, Guhur PL, Tapaswi M et al (2022) Think global, act local: dual-scale graph transformer for vision-and-language navigation. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 16537–16547
Chen X, Djolonga J, Padlewski P et al (2023b) Pali-x: On scaling up a multilingual vision and language model. pp 1–30. arXiv preprint arXiv:2305.18565
Chen Y, Arkin J, Zhang Y et al (2023c) Autotamp: autoregressive task and motion planning with llms as translators and checkers. In: Proceedings of the IEEE international conference on robotics and automation (ICRA)
Chen Y, Zhao C, Yu Z et al (2023) On the relation between sensitivity and accuracy in in-context learning. Find Assoc Comput Ling: EMNLP 2023:155–167
Google Scholar
Chen Y, Arkin J, Zhang Y et al (2024c) Scalable multi-robot collaboration with large language models: centralized or decentralized systems? In: Proceedings of the IEEE international conference on robotics and automation (ICRA)
Cherakara N, Varghese F, Shabana S et al (2023) FurChat: An embodied conversational agent using LLMs, combining open and closed-domain dialogue with facial expressions. In: Proceedings of the annual meeting of the special interest group on discourse and dialogue (SIGDIAL), pp 588–592
Cho H, Nam TJ (2023) The story of beau: exploring the potential of generative diaries in shaping social perceptions of robots. Int J Design 17:1–15
Google Scholar
Cho K, van Merriënboer B, Gulcehre C et al (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. In: Proceedings of the conference on empirical methods in natural language processing (EMNLP), pp 1724–1734
Chu K, Zhao X, Weber C et al (2024) Accelerating reinforcement learning of robotic manipulations via feedback from large language models. In: CoRL workshop on bridging the gap between cognitive science and robot learning in the real world: progresses and new directions, pp 1–10
Dai Y, Peng R, Li S et al (2024) Think, act, and ask: open-world interactive personalized robot navigation. In: Proceedings of the IEEE international conference on robotics and automation (ICRA)
Devlin J, Chang MW, Lee K et al (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the conference of the north american chapter of the association for computational linguistics: human language technologies (NAACL-HLT), pp 4171–4186
Di Palo N, Byravan A, Hasenclever L et al (2023) Towards a unified agent with foundation models. In: ICLR workshop on reincarnating reinforcement learning, pp 1–14
Ding Y, Zhang X, Amiri S et al (2023) Integrating action knowledge and LLMs for task planning and situation handling in open worlds. Auton Robots 47:981–997
Article Google Scholar
Ding Y, Zhang X, Paxton C et al (2023b) Task and motion planning with large language models for object rearrangement. In: Proceedings of the IEEE/RSJ international conference on intelligent robots and systems (IROS), pp 2086–2092
Dong C, Li Y, Gong H et al (2022) A survey of natural language generation. ACM Comput Surv 55:1–38
Article Google Scholar
Driess D, Xia F, Sajjadi MSM et al (2023) PaLM-e: an embodied multimodal language model. In: Proceedings of the international conference on machine learning (ICML), pp 8469–8488
Du Y, Watkins O, Wang Z et al (2023) Guiding pretraining in reinforcement learning with large language models. In: Proceedings of the international conference on machine learning (ICML), pp 8657–8677
Elman JL (1990) Finding structure in time. Cogn Sci 14:179–211
Article Google Scholar
Fikes RE, Nilsson NJ (1971) Strips: a new approach to the application of theorem proving to problem solving. Artif Intell 2:189–208
Article Google Scholar
Firoozi R, Tucker J, Tian S et al (2023) Foundation models in robotics: applications, challenges, and the future. pp 1–33. arXiv preprint arXiv:2312.07843
Fisher M, Savva M, Hanrahan P (2011) Characterizing structural relationships in scenes using graph kernels. In: ACM SIGGRAPH papers, pp 1–12
Gao J, Sarkar B, Xia F et al (2024) Physically grounded vision-language models for robotic manipulation. In: Proceedings of the IEEE international conference on robotics and automation (ICRA)
Garrett CR, Lozano-Pérez T, Kaelbling LP (2020) Pddlstream: integrating symbolic planners and blackbox samplers via optimistic adaptive planning. In: Proceedings of the international conference on automated planning and scheduling (ICAPS), pp 440–448
Gatt A, Krahmer E (2018) Survey of the state of the art in natural language generation: Core tasks, applications and evaluation. J Artif Intell Res (JAIR) 61:65–170
Article MathSciNet Google Scholar
Gopalan N, Arumugam D, Wong LL et al (2018) Sequence-to-sequence language grounding of non-markovian task specifications. In: Proceedings of robotics: science and systems (RSS), pp 1–10
Gu Q, Kuwajerwala A, Jatavallabhula KM et al (2024) Conceptgraphs: open-vocabulary 3d scene graphs for perception and planning. In: Proceedings of the IEEE international conference on robotics and automation (ICRA)
Guan L, Valmeekam K, Sreedharan S et al (2023) Leveraging pre-trained large language models to construct and utilize world models for model-based task planning. In: Conference on neural information processing systems (NeurIPS), pp 79081–79094
Guzhov A, Raue F, Hees J et al (2022) Audioclip: extending clip to image, text and audio. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 976–980
Ha H, Florence P, Song S (2023) Scaling up and distilling down: language-guided robot skill acquisition. In: Proceedings of the conference on robot learning (CoRL), pp 3766–3777
Helmert M (2006) The fast downward planning system. J Artif Intell Res (JAIR) 26:191–246
Article Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780
Article Google Scholar
Hong Y, Zhen H, Chen P et al (2023) 3d-llm: injecting the 3d world into large language models. In: Conference on neural information processing systems (NeurIPS), pp 20482–20494
Hong Y, Zheng Z, Chen P et al (2024) Multiply: a multisensory object-centric embodied large language model in 3d world. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
Houlsby N, Giurgiu A, Jastrzebski S et al (2019) Parameter-efficient transfer learning for NLP. In: Proceedings of the international conference on machine learning (ICML), pp 2790–2799
Howard T, Stump E, Fink J et al (2022) An intelligence architecture for grounded language communication with field robots. Field Robot 2:468–512
Article Google Scholar
Hu EJ, Shen Y, Wallis P et al (2022a) LoRA: low-rank adaptation of large language models. In: Proceedings of the international conference on learning representation (ICLR), pp 1–13
Hu H, Chen J, Liu H et al (2022) Natural language-based automatic programming for industrial robots. J Grid Comput 20:26
Article Google Scholar
Hu Y, Lin F, Zhang T et al (2023a) Look before you leap: unveiling the power of gpt-4v in robotic vision-language planning. pp 1–20. arXiv preprint arXiv:2311.17842
Hu Y, Xie Q, Jain V et al (2023b) Toward general-purpose robots via foundation models: a survey and meta-analysis. pp 1–48. arXiv preprint arXiv:2312.08782
Huang C, Mees O, Zeng A et al (2023a) Audio visual language maps for robot navigation. In: Proceedings of the international symposium on experimental robotics (ISER), pp 1–8
Huang C, Mees O, Zeng A et al (2023b) Visual language maps for robot navigation. In: Proceedings of the IEEE international conference on robotics and automation (ICRA), pp 10608–10615
Huang S, Jiang Z, Dong H et al (2023c) Instruct2act: mapping multi-modality instructions to robotic actions with large language model. pp 1–21. arXiv preprint arXiv:2305.11176
Huang W, Abbeel P, Pathak D et al (2022) Language models as zero-shot planners: extracting actionable knowledge for embodied agents. In: Proceedings of the international conference on machine learning (ICML), pp 9118–9147
Huang W, Wang C, Zhang R et al (2023d) Voxposer: composable 3d value maps for robotic manipulation with language models. In: Proceedings of the conference on robot learning (CoRL), pp 540–562
Huang W, Xia F, Shah D et al (2023e) Grounded decoding: guiding text generation with grounded models for embodied agents. In: Conference on neural information processing systems (NeurIPS), pp 59636–59661
Huang W, Xia F, Xiao T et al (2023f) Inner monologue: embodied reasoning through planning with language models. In: Proceedings of the conference on robot learning (CoRL), pp 1769–1782
Hunt W, Godfrey T, Soorati MD (2024) Conversational language models for human-in-the-loop multi-robot coordination. In: International conference on autonomous agents and multi-agent systems (AAMAS)
Ichikura A, Kawaharazuka K, Obinata Y et al (2023) A method for selecting scenes and emotion-based descriptions for a robot’s diary. In: Proceedings of the IEEE international conference on robot and human interactive communication (RO-MAN), pp 1683–1688
Ichter B, Brohan A, Chebotar Y et al (2023) Do as i can, not as i say: grounding language in robotic affordances. In: Proceedings of the conference on robot learning (CoRL), pp 287–318
Irfan B, Kuoppamäki SM, Skantze G (2023) Between reality and delusion: challenges of applying large language models to companion robots for open-domain dialogues with older adults. Research square preprint pp 1–43
Jatavallabhula KM, Kuwajerwala A, Gu Q et al (2023) Conceptfusion: open-set multimodal 3d mapping. In: Proceedings of robotics: science and systems (RSS), pp 1–17
Jia Z, Liu F, Thumuluri V et al (2023) Chain-of-thought predictive control. In: ICLR workshop on reincarnating reinforcement learning, pp 1–16
Jiao A, Patel TP, Khurana S et al (2023) Swarm-gpt: combining large language models with safe motion planning for robot choreography design. In: NeurIPS robot learning workshop: pretraining, fine-tuning, and generalization with large scale models, pp 1–10
Kaddour J, Harris J, Mozes M et al (2023) Challenges and applications of large language models. pp 1–72. arXiv preprint arXiv:2307.10169
Kannan SS, Venkatesh VL, Min BC (2024) Smart-llm: smart multi-agent robot task planning using large language models. pp 1–8. arXiv preprint arXiv:2309.10062
Katara P, Xian Z, Fragkiadaki K (2023) Gen2sim: scaling up simulation with generative models for robotic skill learning. In: CoRL workshop on towards generalist robots: learning paradigms for scalable skill acquisition, pp 1–13
Kerr J, Kim CM, Goldberg K et al (2023) Lerf: language embedded radiance fields. In: Proceedings of the international conference on computer vision (ICCV), pp 19729–19739
Khoo W, Hsu LJ, Amon KJ et al (2023) Spill the tea: when robot conversation agents support well-being for older adults. In: Companion of the ACM/IEEE international conference on human-robot interaction (HRI), pp 178–182
Kim D, Kim J, Cho M et al (2022) Natural language-guided semantic navigation using scene graph. In: Robot intelligence technology and applications (RiTA), pp 148–156
Kim D, Kim Y, Jang J et al (2023) Sggnet2: Speech-scene graph grounding network for speech-guided navigation. In: Proceedings of the IEEE international conference on robot and human interactive communication (RO-MAN), pp 1648–1654
Kim D, Oh N, Hwang D et al (2024) Lingo-space: language-conditioned incremental grounding for space. In: Proceedings of the AAAI conference on artificial intelligence (AAAI), pp 10314–10322
Kumar KN, Essa I, Ha S (2023) Words into action: learning diverse humanoid behaviors using language guided iterative motion refinement. In: CoRL workshop on language and robot learning: language as grounding, pp 1–11
Kwon M, Hu H, Myers V et al (2024) Toward grounded commonsense reasoning. In: Proceedings of the IEEE international conference on robotics and automation (ICRA)
Kwon T, Di Palo N, Johns E (2023) Language models as zero-shot trajectory generators. In: CoRL workshop on language and robot learning: language as grounding, pp 1–23
Lee YK, Jung Y, Kang G et al (2023) Developing social robots with empathetic non-verbal cues using large language models. In: Proceedings of the IEEE international conference on robot and human interactive communication (RO-MAN)
Li B, Weinberger KQ, Belongie S et al (2022) Language-driven semantic segmentation. In: Proceedings of the international conference on learning representation (ICLR), pp 1–13
Li J, Li D, Savarese S et al (2023a) BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: Proceedings of the international conference on machine learning (ICML), pp 19730–19742
Li M, Roller S, Kulikov I et al (2020) Don’t say that! making inconsistent dialogue unlikely with unlikelihood training. In: Proceedings of the association for computational linguistics (ACL), pp 4715–4728
Li S, Park D, Sung Y et al (2021) Reactive task and motion planning under temporal logic specifications. In: Proceedings of the IEEE international conference on robotics and automation (ICRA), pp 12618–12624
Li Y, Li J, Fu W et al (2023b) Learning agile bipedal motions on a quadrupedal robot. In: Proceedings of the IEEE international conference on robotics and automation (ICRA)
Liang J, Huang W, Xia F et al (2023) Code as policies: language model programs for embodied control. In: Proceedings of the IEEE international conference on robotics and automation (ICRA), pp 9493–9500
Lin BY, Fu Y, Yang K et al (2023a) Swiftsage: a generative agent with fast and slow thinking for complex interactive tasks. In: Conference on neural information processing systems (NeurIPS), pp 23813–23825
Lin K, Agia C, Migimatsu T et al (2023) Text2motion: from natural language instructions to feasible plans. Auton Robots 47:1345–1365
Article Google Scholar
Liu B, Jiang Y, Zhang X et al (2023a) Llm+p: Empowering large language models with optimal planning proficiency. pp 1–8. arXiv preprint arXiv:2304.11477
Liu H, Chen A, Zhu Y et al (2023b) Interactive robot learning from verbal correction. In: CoRL workshop on language and robot learning: language as grounding, pp 1–18
Liu JX, Yang Z, Idrees I et al (2023c) Grounding complex natural language commands for temporal tasks in unseen environments. In: Proceedings of the conference on robot learning (CoRL), pp 1084–1110
Liu P, Yuan W, Fu J et al (2023) Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput Surv 55:1–35
Google Scholar
Liu S, Zeng Z, Ren T et al (2023e) Grounding dino: Marrying dino with grounded pre-training for open-set object detection. pp 1–17. arXiv preprint arXiv:2303.05499
Liu Z, Bahety A, Song S (2023f) Reflect: Summarizing robot experiences for failure explanation and correction. In: Proceedings of the conference on robot learning (CoRL), pp 3468–3484
Ma YJ, Liang W, Wang G et al (2023) Eureka: human-level reward design via coding large language models. In: CoRL workshop on language and robot learning: language as grounding, pp 1–45
Mandi Z, Jain S, Song S (2024) Roco: dialectic multi-robot collaboration with large language models. In: Proceedings of the IEEE international conference on robotics and automation (ICRA)
Mavridis N (2015) A review of verbal and non-verbal human-robot interactive communication. Robotics Auton Syst 63:22–35
Article MathSciNet Google Scholar
McDermott D, Ghallab M, Howe AE et al (1998) Pddl–the planning domain definition language. Tech. rep
Mei H, Bansal M, Walter M (2016) Listen, attend, and walk:neural mapping of navigational instructions to action sequences. In: Proceedings of the AAAI conference on artificial intelligence (AAAI), pp 2772–2778
Minderer M, Gritsenko A, Stone A et al (2022) Simple open-vocabulary object detection. In: Proceedings of the european conference on computer vision (ECCV), pp 728–755
Mirchandani S, Xia F, Florence P et al (2023) Large language models as general pattern machines. In: Proceedings of the conference on robot learning (CoRL), pp 2498–2518
Mirjalili R, Krawez M, Burgard W (2023a) Fm-loc: Using foundation models for improved vision-based localization. In: Proceedings of the IEEE/RSJ international conference on intelligent robots and systems (IROS), pp 1381–1387
Mirjalili R, Krawez M, Silenzi S et al (2023b) Lan-grasp: Using large language models for semantic object grasping. pp 1–7. arXiv preprint arXiv:2310.05239
Mousavian A, Anguelov D, Flynn J et al (2017) 3d bounding box estimation using deep learning and geometry. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 7074–7082
Nyga D, Roy S, Paul R et al (2018) Grounding robot plans from natural language instructions with incomplete world knowledge. In: Proceedings of the conference on robot learning (CoRL), pp 714–723
OpenAI (2023) Chatgpt (jan 15 version). https://chat.openai.com/chat, large language model
OpenAI, Achiam J, Adler S et al (2024) Gpt-4 technical report. pp 1–100. arXiv preprint arXiv:2303.08774
Ouyang L, Wu J, Jiang X et al (2022) Training language models to follow instructions with human feedback. In: Conference on neural information processing systems (NeurIPS), pp 27730–27744
Park J, Lim S, Lee J et al (2024) Clara: classifying and disambiguating user commands for reliable interactive robotic agents. IEEE Robot Autom Lett (RA-L) 9:1059–1066
Article Google Scholar
Patki S, Fahnestock E, Howard TM et al (2020) Language-guided semantic mapping and mobile manipulation in partially observable environments. In: Proceedings of the conference on robot learning (CoRL), pp 1201–1210
Peng S, Genova K, Jiang C et al (2023) Openscene: 3d scene understanding with open vocabularies. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp 815–824
Perez J, Proux DM, Roux C et al (2023) Larg, language-based automatic reward and goal generation. pp 1–32. arXiv preprint arXiv:2306.10985
Premebida C, Ambrus R, Marton ZC (2018) Intelligent robotic perception systems. In: Applications of mobile robots. IntechOpen, chap 6, p 111–127
Qian S, Chen W, Bai M, et al (2024) Affordancellm: grounding affordance from vision language models. pp 1–12. arXiv preprint arXiv:2401.06341
Radford A, Wu J, Child R et al (2019) Language models are unsupervised multitask learners. OpenAI Blog pp 1–24
Radford A, Kim JW, Hallacy C et al (2021) Learning transferable visual models from natural language supervision. In: Proceedings of the international conference on machine learning (ICML), pp 8748–8763
Raman SS, Cohen V, Paulius D et al (2023) CAPE: corrective actions from precondition errors using large language models. In: CoRL workshop on language and robot learning: language as grounding, pp 1–9
Rana K, Haviland J, Garg S et al (2023) Sayplan: Grounding large language models using 3d scene graphs for scalable task planning. In: Proceedings of the conference on robot learning (CoRL), pp 23–72
Reed S, Zolna K, Parisotto E et al (2022) A generalist agent. Transactions on machine learning research (TMLR) pp 1–42
Ren AZ, Dixit A, Bodrova A et al (2023) Robots that ask for help: uncertainty alignment for large language model planners. In: Proceedings of the conference on robot learning (CoRL), pp 661–682
Roy S, Noseworthy M, Paul R et al (2019) Leveraging past references for robust language grounding. In: Proceedings of the conference on computational natural language learning (CoNLL), pp 430–440
Shah D, Osiński B, Ichter B et al (2023a) Lm-nav: Robotic navigation with large pre-trained models of language, vision, and action. In: Proceedings of the conference on robot learning (CoRL), pp 492–504
Shah R, Martín-Martín R, Zhu Y (2023b) Mutex: learning unified policies from multimodal task specifications. In: Proceedings of the conference on robot learning (CoRL), pp 2663–2682
Shao L, Migimatsu T, Zhang Q et al (2021) Concept2robot: learning manipulation concepts from instructions and human demonstrations. Int J Robot Res (IJRR) 40:1419–1434
Article Google Scholar
Shridhar M, Mittal D, Hsu D (2020) Ingress: interactive visual grounding of referring expressions. Int J Robot Res (IJRR) 39:217–232
Article Google Scholar
Shridhar M, Manuelli L, Fox D (2022) Cliport: What and where pathways for robotic manipulation. In: Proceedings of the conference on robot learning (CoRL), pp 894–906
Silver T, Hariprasad V, Shuttleworth RS et al (2022) PDDL planning with pretrained large language models. In: NeurIPS workshop on foundation models for decision making, pp 1–13
Singh I, Blukis V, Mousavian A et al (2023) Progprompt: Generating situated robot task plans using large language models. In: Proceedings of the IEEE international conference on robotics and automation (ICRA), pp 11523–11530
Song CH, Wu J, Washington C et al (2023a) Llm-planner: few-shot grounded planning for embodied agents with large language models. In: Proceedings of the international conference on computer vision (ICCV), pp 2998–3009
Song J, Zhou Z, Liu J et al (2023b) Self-refined large language model as automated reward function designer for deep reinforcement learning in robotics. pp 1–62. arXiv preprint arXiv:2309.06687
Stone A, Xiao T, Lu Y et al (2023) Open-world object manipulation using pre-trained vision-language models. In: Proceedings of the conference on robot learning (CoRL), pp 3397–3417
Sun L, Jha DK, Hori C et al (2024) Interactive planning using large language models for partially observable robotics tasks. In: Proceedings of the IEEE international conference on robotics and automation (ICRA)
Tagliabue A, Kondo K, Zhao T et al (2023) Real: resilience and adaptation using large language models on autonomous aerial robots. In: CoRL workshop on language and robot learning: language as grounding, pp 1–12
Tang Y, Yu W, Tan J et al (2023) Saytap: language to quadrupedal locomotion. In: Proceedings of the conference on robot learning (CoRL), pp 3556–3570
Tellex S, Gopalan N, Kress-Gazit H et al (2020) Robots that use language. Annu Rev Control Robot Auton Syst 3:25–55
Article Google Scholar
Todorov E, Erez T, Tassa Y (2012) Mujoco: a physics engine for model-based control. In: Proceedings of the IEEE/RSJ international conference on intelligent robots and systems (IROS), IEEE, pp 5026–5033
Touvron H, Lavril T, Izacard G et al (2023a) Llama: open and efficient foundation language models. pp 1–27. arXiv preprint arXiv:2302.13971
Touvron H, Martin L, Stone KR et al (2023b) Llama 2: open foundation and fine-tuned chat models. pp 1–77. arXiv preprint arXiv:2307.09288
Towers M, Terry JK, Kwiatkowski A et al (2023) Gymnasium. https://zenodo.org/record/8127025
Valmeekam K, Marquez M, Sreedharan S et al (2023) On the planning abilities of large language models - a critical investigation. In: Conference on neural information processing systems (NeurIPS), pp 75993–76005
Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. In: Conference on neural information processing systems (NeurIPS), pp 1–11
Vemprala S, Bonatti R, Bucker A et al (2023) Chatgpt for robotics: design principles and model abilities. pp 1–25. arXiv preprint arXiv:2306.17582
Vuong Q, Levine S, Walke HR et al (2023) Open x-embodiment: robotic learning datasets and RT-x models. In: CoRL workshop on language and robot learning: language as grounding, pp 1–16
Wang T, Li Y, Lin H et al (2023a) Wall-e: embodied robotic waiter load lifting with large language model. pp 1–13. arXiv preprint arXiv:2308.15962
Wang Y, Xian Z, Chen F et al (2023b) Robogen: towards unleashing infinite data for automated robot learning via generative simulation. pp 1–39. arXiv preprint arXiv:2311.01455
Wang YJ, Zhang B, Chen J et al (2023c) Prompt a robot to walk with large language models. pp 1–8. arXiv preprint arXiv:2309.09969
Wang Z, Cai S, Liu A et al (2023d) Describe, explain, plan and select: interactive planning with llms enables open-world multi-task agents. In: Conference on neural information processing systems (NeurIPS), pp 34153–34189
Wei J, Tay Y, Bommasani R et al (2022a) Emergent abilities of large language models. Transactions on machine learning research (TMLR) pp 1–30
Wei J, Wang X, Schuurmans D et al (2022b) Chain-of-thought prompting elicits reasoning in large language models. In: Conference on neural information processing systems (NeurIPS), pp 24824–24837
White J, Fu Q, Hays S et al (2023) A prompt pattern catalog to enhance prompt engineering with chatgpt. pp 1–19. arXiv preprint arXiv:2302.11382
Wu HH, Seetharaman P, Kumar K et al (2022) Wav2clip: learning robust audio representations from clip. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4563–4567
Wu J, Antonova R, Kan A et al (2023) Tidybot: personalized robot assistance with large language models. Auton Robots 47:1087–1102
Article Google Scholar
Xia W, Wang D, Pang X et al (2024) Kinematic-aware prompting for generalizable articulated object manipulation with LLMs. In: Proceedings of the IEEE international conference on robotics and automation (ICRA)
Xie T, Zhao S, Wu CH et al (2024) Text2reward: automated dense reward function generation for reinforcement learning. In: Proceedings of the international conference on learning representation (ICLR)
Xie Y, Yu C, Zhu T et al (2023) Translating natural language to planning goals with large-language models. pp 1–15. arXiv preprint arXiv:2302.05128
Xu J, Jin S, Lei Y et al (2023) Reasoning tuning grasp: adapting multi-modal large language models for robotic grasping. In: Proceedings of the conference on robot learning (CoRL), pp 1–13
Yamazaki T, Yoshikawa K, Kawamoto T et al (2023) Building a hospitable and reliable dialogue system for android robots: a scenario-based approach with large language models. Adv Robot 37:1364–1381
Article Google Scholar
Yang J, Chen X, Qian S et al (2023a) LLM-grounder: open-vocabulary 3d visual grounding with large language model as an agent. In: CoRL workshop on language and robot learning: language as grounding, pp 1–8
Yang S, Liu J, Zhang R et al (2023b) Lidar-LLM: exploring the potential of large language models for 3d lidar understanding. pp 1–15. arXiv preprint arXiv:2312.14074
Yang Z, Raman SS, Shah A et al (2023c) Plug in the safety chip: enforcing constraints for LLM-driven robot agents. In: CoRL workshop on language and robot learning: language as grounding, pp 1–15
Yu W, Gileadi N, Fu C et al (2023) Language to rewards for robotic skill synthesis. In: Proceedings of the conference on robot learning (CoRL), pp 374–404
Yu Y, Zhang Q, Li J et al (2024) Affordable generative agents. pp 1–20. arXiv preprint arXiv:2402.02053
Zeng A, Attarian M, Ichter B et al (2022) Socratic models: composing zero-shot multimodal reasoning with language. In: Proceedings of the international conference on learning representation (ICLR), pp 1–35
Zeng F, Gan W, Wang Y et al (2023) Large language models for robotics: a survey. pp 1–19. arXiv preprint arXiv:2311.07226
Zeng Y, Xu Y (2023) Learning reward for physical skills using large language model. In: CoRL workshop on language and robot learning: language as grounding, pp 1–22
Zhang H, Du W, Shan J et al (2024) Building cooperative embodied agents modularly with large language models. In: Proceedings of the international conference on learning representation (ICLR)
Zhao WX, Zhou K, Li J et al (2023a) A survey of large language models. pp 1–124. arXiv preprint arXiv:2303.18223
Zhao X, Li M, Weber C et al (2023b) Chat with the environment: interactive multimodal perception using large language models. In: Proceedings of the IEEE/RSJ international conference on intelligent robots and systems (IROS), pp 3590–3596
Zhao Z, Lee WS, Hsu D (2023c) Large language models as commonsense knowledge for large-scale task planning. In: Conference on neural information processing systems (NeurIPS), pp 31967–31987
Zitkovich B, Yu T, Xu S et al (2023) Rt-2: vision-language-action models transfer web knowledge to robotic control. In: Proceedings of the conference on robot learning (CoRL), pp 2165–2183

Download references

Acknowledgements

This research was supported by the DRB-KAIST SketchTheFuture Research Center and the KAIST Convergence Research Institute Operation Program.

Funding

Open Access funding enabled and organized by KAIST.

Author information

Authors and Affiliations

Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea
Yeseung Kim, Dohyun Kim, Jieun Choi, Jisang Park, Nayoung Oh & Daehyung Park

Authors

Yeseung Kim
View author publications
You can also search for this author in PubMed Google Scholar
Dohyun Kim
View author publications
You can also search for this author in PubMed Google Scholar
Jieun Choi
View author publications
You can also search for this author in PubMed Google Scholar
Jisang Park
View author publications
You can also search for this author in PubMed Google Scholar
Nayoung Oh
View author publications
You can also search for this author in PubMed Google Scholar
Daehyung Park
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Daehyung Park.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Kim, Y., Kim, D., Choi, J. et al. A survey on integration of large language models with intelligent robots. Intel Serv Robotics (2024). https://doi.org/10.1007/s11370-024-00550-5

Download citation

Received: 18 April 2024
Accepted: 04 June 2024
Published: 13 August 2024
DOI: https://doi.org/10.1007/s11370-024-00550-5

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A survey on integration of large language models with intelligent robots

Abstract

Similar content being viewed by others

Speech and Language in Humanoid Robots

Unlocking Robotic Autonomy: A Survey on the Applications of Foundation Models

Learning Unknown Groundings for Natural Language Interaction with Mobile Robots

Explore related subjects

1 Introduction

2 Preliminary

2.1 Language models in robotics

2.2 Large language model in robotics

3 Communication

3.1 Language understanding

3.2 Language generation

4 Perception

4.1 Sensing modalities

4.2 Sensing behavior

5 Planning

5.1 Task planning

5.2 Task and motion planning

6 Control

6.1 Direct approach

6.2 Indirect approach

7 Prompt guideline

7.1 Conversation prompt: interactive grounding

7.2 Directive prompt: scene-graph generation

7.3 Planning prompt: few-shot planning

7.4 Code-generation prompt: reward design

8 Conclusion

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation