1 Introduction

Over the last decade, we have witnessed remarkable progress in the field of robotics in applying language models (LMs). This progress includes not only human-like communication but also understanding and reasoning capabilities of robots, thereby significantly improving their effectiveness across various tasks, from household chores to industrial operations [52, 105]. In the early stage of work, the success stemmed from statistical models analyzing and predicting words in linguistic expressions. These models enable robots to interpret human commands [110, 121], understand contexts [3, 5], represent the world [50], and interact with humans [135], albeit with a limited depth of understanding. Then, the adoption of Transformer architecture with self-attention mechanisms [141], particularly pre-trained LMs such as BERT [26], has elevated the capability of capturing complex patterns while fine-tuning models for specific tasks. However, the performance of these models is often contingent upon limited datasets, constraining their ability to grasp deeper contextual understanding and generalize across diverse scenarios.

With the advancement of large language models (LLMs), language-based robotics introduce innovative changes across various domains such as information retrieval, reasoning tasks, adaptation to environments, continuous learning and improvements, etc. [61, 64]. These LLMs, characterized by their vast parameter sizes and training on internet-scale datasets, offer zero- and few-shot learning capabilities for downstream tasks without requiring additional parameter updates. These prominent advancements come from emergent abilities, defined as ‘the abilities that are not present in small models but arise in large models’ in the literature [148]. The abilities have significantly enhanced robots’ performance in understanding, inferring, and responding to open-set instructions by leveraging extensive commonsense knowledge [8]. Furthermore, prompt engineering techniques have enabled LLMs to incorporate richer contextual information through free-form language descriptions or interactive dialogues, facilitating generalized reasoning [149]. The introduction of in-context learning abilities [8] leads LLMs to generate outputs in expected formats, such as JSON, YAML, or PDDL, or even code, based on provided instructions or demonstrations in prompts [42, 86]. Recent LLMs, such as GPT-4, have further expanded capabilities by integrating with external robotics tools such as planners or translators [89].

Despite the diverse capabilities of LLMs, their utilization faces several challenges [69]. Firstly, LLMs often generate inaccurate or unexpected responses. As the safety of robot execution is one of the most important deployment factors, LLM-based robotic applications require filtering and correction mechanisms to ensure safety. Second, the emergent abilities, such as in-context learning, are not predictable and consistent yet [19]. Even minor alterations to input text may lead to unpredictable changes in response. Third, a well-designed prompt enables robots to effectively leverage the abilities of LLMs but there is a lack of systematic guidelines supporting key components of robotic systems, hindering seamless integration [35, 54, 164]. Therefore, we need to investigate component-wise LLM engagements in robotics toward an understanding of limitations and safety.

Currently, various surveys have started exploring the intersection of LLMs and robotics [142, 164], primarily focusing on application or interaction dimensions of LLM-based robotics. However, there remains a gap in providing holistic reviews and actionable insights for integrating LLMs across key elements of robotic systems, including communication, perception, planning, and control. Additionally, researchers explore the wide field of pre-trained large-capacity models, called foundation models, seeking the generalization capabilities across multimodal transformer-based models [35, 54]. However, this expansive field spans a wide spectrum of robotics and diverse methodologies, making emerging researchers miss in-depth reviews and guidelines.

In this paper, as shown in Fig. 1, we aim to categorize and analyze how LLMs could enhance core elements of robotics systems and how we can guide emerging researchers in integrating LLMs within each domain, encompassing communication, perception, planning, and control, toward the development of intelligent robots. We structure this paper following three key questions:

  • Q1: How are LLMs being utilized in each robotics domain?

  • Q2: How can researchers overcome the integration limitation of LLMs?

  • Q3: What basic prompt structures are required to produce a minimum functionality in each domain?

To address these questions, we focus on LLMs developed after the introduction of GPT-3.5 [106]. We primarily consider text-based modalities but also review multimodalities for perception and control areas. However, for an in-depth review, we limit our investigation to LLMs rather than foundation models.

In addition, we provide comprehensive guidelines and examples for prompt engineering, aimed at enabling beginners to access LLM-based robotics solutions. Our tutorial-level examples illustrate how fundamental functionalities of robotic components can be augmented or replaced by introducing four types of exemplary prompts: conversational prompt for interactive grounding, directive prompt for scene-graph generation, planning prompt for few-shot planning, and code-generation prompt for reward generation. By providing rules and tips for prompt construction, we outline the process of generating well-designed prompts to yield outputs in the desired format. These principles ensure effective LLM-guided enhancements in robotics applications, without parameter adjustments.

The remainder of this paper is organized as follows. Section 2 outlines the historical background of LMs and LLMs in robotics. Section 3 reviews how LLMs empower robots to communicate via language understanding and generation. Section 4 investigates how LLMs perceive various sensor modalities and advance sensing behaviors. Sections 5 and 6 organize LLM-based planning and control studies, respectively. In Sect. 7, we provide comprehensive guideline for prompt engineering as a starting point for LLM integration in robotics. Finally, Sect. 8 summarizes this survey.

Fig. 1
figure 1

Overview structure of intelligent robotics research integrated with LLMs in this survey. The rightmost cells show the representative names (e.g., method, model, or authors) of papers in each category

2 Preliminary

We briefly review language models used in robotics, categorizing them in terms of pre- and post-LLM eras. Unlike previous literature [164], we define the pre-LLM era as the period for LMs up to the advent of GPT-2 [115], characterized by neural language models such as recurrent neural networks (RNNs) [33] and early Transformer architectures. We then provide a brief explanation of LLMs, introducing terminologies and techniques for subsequent reviews.

2.1 Language models in robotics

In the pre-LLM era, early-stage studies have primarily focused on sequential data processing, using RNN-based models [23, 46]. The models are often to transform linguistic commands into a sequence of actions [6, 99] or formal languages [40], leveraging RNN’s sequence-to-sequence modeling capabilities. On the other hand, researchers have also used RNNs as language encoders converting textual input into linguistic features that could be mapped to visual features for referring object identification [121, 125]. However, the long-term dependency issue in RNNs restricts the scope of applications. Then, the Transformer architecture [141], which is a non-sequential model supporting long-range comprehension, has enabled new robotic tasks, such as vision-and-language navigation [14, 16].

The studies in the pre-LLM era also show improved application performance unlike previous methods trained on small, task-specific datasets. Transformer-based models and self-supervised learning techniques, such as masked language modeling, have led to the development of internet-scale pre-trained models, including BERT [26] and GPT-2 [115]. These models exhibit a broad understanding of language, enabling both (1) improved generalization abilities and (2) fine-tuning for specific robotic tasks [74, 75, 124]. In addition, researchers have also developed LMs that process multimodal information [116] since robotic applications often require accessing diverse multimodalities, such as natural language and vision, for interaction with users and environment [76, 126].

2.2 Large language model in robotics

Recent advancements in LLMs, such as GPT-3 [8], GPT-4 [107], LLaMA [137], Llama 2 [138], and Gemini [2], demonstrate notable improvements in understanding, contextual awareness, generalization capabilities, and knowledge richness, surpassing earlier language models. These improvements are from their training on vast datasets with billion-scale parameters, enabling them to capture intricate data patterns. Further, advanced learning strategies, such as reinforcement learning from human feedback, have been developed to align the behaviors of LLMs with human values or preferences [108]. However, learning with large-scale parameters requires computationally expensive costs of updating the entire model. To address this issue, researchers have developed parameter-efficient fine-tuning methods (e.g., an adapter [49] and LoRA [51]) for robotic tasks. For example, LLM-POP [132] fine-tunes their model using adapters, which are small, trainable networks inserted into each layer of an LLM for interactive planning scenarios.

Alternatively, prompt engineering with in-context learning (ICL) [8] marks a significant advancement in learning from prompts without additional training. Its effectiveness relies on the design and quality of prompts, which can be enhanced with detailed task descriptions, few-shot examples, or model-friendly formats (e.g., ‘###’ as a stop symbol [167]). Moreover, chain-of-thought (CoT) prompting [149] is another emerging approach that incorporates intermediate reasoning steps in prompts. The CoT method substantially enhances the reasoning and problem-solving capabilities of LLMs, making it a dominant technique in robotics applications [86, 128, 163].

3 Communication

We investigate the utilization of LLMs to facilitate human-like communication in robotics, enabling robots to interact with humans and other robotic agents effectively [97]. We categorize the communication capabilities into two primary areas: (1) language understanding and (2) language generation. We show the detailed categorization in Fig. 1 alongside relevant studies, referred in green cells.

3.1 Language understanding

We review language understanding capabilities, addressing how LLMs handle the variability and ambiguity of linguistic inputs through interpretation and grounding processes.

Interpretation transforms natural-language inputs into semantic representations that are easier for robots to process. These representations include formal languages such as linear temporal logic (LTL) [93, 160] and planning domain definition language (PDDL) [18, 42, 89, 155], as well as programming languages such as Python [56, 76]. To aid in interpreting free-form sentences, researchers leverage LLMs’ ICL capabilities, providing guidelines and demonstrations within prompts [56, 76, 89, 122]. Despite the efforts, LLMs often fail to satisfy syntax or capture precise semantics when translating an input into formal languages. To address this issue, researchers suggest simplifying vocabulary or fine-tuning LLMs with domain-agnostic data [93, 160]. For example, Lang2LTL [91] translates landmark-referring expressions in navigational commands into LTL symbols. Further improvements often involve in using human feedback and syntax checkers to correct generated formal language translations [18, 42]. For instance, Guan et al. present a human-in-the-loop translation framework, in which human domain experts repeatedly review PDDL descriptions and provide feedback in natural language [42].

Grounding is another process that maps linguistic expressions to reference targets, such as behaviors or objects, recognizable to robots. Early studies identify mappings that maximize the cosine similarity between word embeddings of LLM outputs and real-world targets [58, 76, 93, 117]. Subsequent studies leverage LLMs’ commonsense knowledge to capture the context of object text labels for improved grounding [41, 118]. For instance, ConceptGraphs [41] demonstrate how LLMs can ground an expression, ‘something to use as a paperweight,’ to a ceramic vase based on size and weight assumptions. However, grounding accuracy depends on the detail and accuracy of the world model. To address this, researchers augment LLMs with multimodal capabilities to directly correlate linguistic inputs with sensory percepts [31, 47, 114, 159] or enable LLMs to interact with environments [158, 168] or humans [61, 109, 120] for better context gathering. For instance, LLM-Grounder [158], a 3D visual grounding method, actively gathers environmental information using vision tools such as LERF [72] and OpenScene [111].

3.2 Language generation

Language generation refers to the production of human-like written or spoken language that reflects communicative intents [39]. We categorize language generation into task-dependent and-independent types based on their communication intents, diverging from conventional natural-language generation (NLG) categories of text-to-text and data-to-text [30] due to our focus on the communicative purposes of studies.

Task-dependent language generation focuses on producing language with specific functional objectives, being declarative or imperative. To generate open-ended declarative statements, researchers often provide LLMs with contextual information [20, 62, 96]. However, LLMs often result in repetitive and factually inconsistent outputs, confined by the reliance on previous dialogues and commonsense knowledge [20, 83]. Consequently, researchers augment LLMs with auxiliary knowledge sources to broaden the scope of available information [4, 21, 157]. For instance, Axelsson and Skantze enhance a robot museum guide with knowledge graphs [4]. Furthermore, researchers instruct LLMs to clarify ambiguities by generating imperative instructions requesting human assistance [25, 61]. To improve inference steps, probabilistic models have been introduced to evaluate the uncertainty of situations [109, 120]. For instance, KnowNo [120] and CLARA [109] interaction systems assess confidence and semantic variance, respectively, triggering generation only when these metrics indicate significant uncertainty.

Task-independent language generation involves crafting expressions with social-emotional objectives [11], by embedding non-verbal cues (e.g., non-verbal sounds, hand gestures, and facial expressions) within prompts to enhance engagement and empathy [73, 80]. For example, Khoo et al. have developed a conversational robot that generates empathetic responses using transcribed audio and visual cues [73]. However, conversations with LLMs remain superficial due to the limited knowledge and dialogue history [65]. To overcome this, researchers integrate memory modules into LLMs, enabling them to distill and store information from conversations in a structured format [22, 63, 65, 162]. For example, the companion robot designed by Irfan et al. continuously updates the robot’s memory based on interactions with users to generate personalized dialogues [65].

4 Perception

Perception plays a crucial role in enabling robots to make decisions, plan their actions, and navigate the real world [113]. In the field of LLM-based robotic perception, research primarily focuses on two aspects: sensing modalities and behaviors. In this section, we introduce how LLM-based robots have integrated language with sensor modalities and how agents acquire environmental information through passive and active perception behaviors. Figure 1 presents the detailed categorization alongside relevant studies, referred in pink cells.

4.1 Sensing modalities

Researchers have significantly advanced robots’ comprehension and generalization capabilities through the integration of multimodal language models. We categorize primary sensing modalities into visual, auditory, and haptic modalities, reviewing recent studies leveraging multimodal LLMs for perception tasks.

Visual perception tasks involve the interpretation of visual information such as image or point clouds. Pre-trained visual-language models (VLMs), such as CLIP [116] and InstructBLIP [82], allows LLM-based robots to directly utilize image sources. For instance, recent LLM-based manipulation systems, such as TidyBot [152] and RoCo [96], use image-inferred object labels or scene descriptions generated from CLIP and OWL-ViT [100], respectively. In addition, researchers extend reasoning capabilities by applying VLMs on downstream tasks such as image captioning [41] and visual question answering (VQA) [37, 78, 103]. The downstream tasks enable LLMs to subsequently request VLMs to infer object properties (e.g., material, fragility) [37] or ground object parts for grasping [103]. However, images are often challenging to acquire spatial-geometric information.

Alternatively, Huang et al. associate visual-language features from a VLM (i.e., LSeg [81]) with three-dimensional (3D) point clouds for 3D map reconstruction [56]. Further, Jatavallabhula et al. improve this association mechanism with RGB-D images by introducing fine-grained and pixel-aligned features from VLMs [66]. However, association with 3D information tends to be memory intensive, limiting scalability for large scenes [56, 66, 158]. As an alternative solution, researchers often associate geometric and semantic features with 3D scene graphs [41].

Auditory perception involves the interpretation of sound. LLM-based studies often leverage pre-trained audio-language models (ALMs), such as AudioCLIP [43] and Wav2CLIP [151], integrating them with visual data to enhance environmental or contextual understanding [55, 94, 123, 163]. For example, AVLMaps [55], a 3D spatial map constructor with cross-modal information, integrates audio, visual, and language signals into 3D maps, enabling agents to navigate using multimodal objectives such as ‘move between the image of a refrigerator and the sound of breaking glass.’ In addition, REFLECT [94], a framework for summarizing robot failures, transforms multisensory observations such as RGB-D images, audio clips, and robot states into textual descriptions to enhance LLM-based failure reasoning.

Haptic perception involves the interpretation of contact information. Researchers introduce multimodal perception modules that interactively incorporate haptic features obtained from pre-defined high-level descriptions [168] or CLIP-based tactile-image features [48] about haptic interactions. For example, MultiPLY [48], a multisensory LLM, converts tactile sensory readings into a heatmap, encoded by CLIP. Then, by introducing a linear layer of tactile projector, the model maps the heatmap information to the feature space of LLMs.

4.2 Sensing behavior

Following the type of sensing behaviors, we decompose this section into passive and active perceptions.

The passive perception refers to the process of gathering sensory information without actively seeking it out. Despite its limited nature, passive sensing has been extensively employed in LLM-based robotics studies for various tasks: object recognition [37, 53, 152], pose estimation [103, 156], scene reconstruction [41, 59, 122, 122], and object grounding [66, 144, 158]. For example, TidyBot [152] detects the closest object from an overhead view and subsequently recognizes its object category using a closer view captured by the robot’s camera. However, the passive nature of sensing limits the ability to perform tasks when information is unobserved or unavailable (e.g., unseen area, weight).

On the other hand, active perception refers to the conscious process of gathering sensory information by taking additional actions. Active information gathering enhances environmental understanding by acquiring new information through sensory observations or requesting user feedback [78, 129]. For example, LLM-Planner [129] generates seeking actions such as ‘open the refrigerator’ to locate invisible objects. Recent studies also focus on collecting sensory data to better understand objects’ physical properties [48, 132, 168]. However, LLMs often generate inaccurate or fabricated information, known as hallucinations. To address this issue, Dai et al. introduce a personalized conversational agent designed to ask users for uncertain information [25].

5 Planning

Planning involves organizing actions to solve given problems, typically through generating a sequence of high-level symbolic operators (i.e., task planning) followed by executing them using low-level motor controllers [38, 84]. This section investigates how LLM-based planning research addresses limitations in the planning domain by categorizing them into three key research areas: (1) task planning, (2) motion planning, and (3) task and motion planning (TAMP). Figure 1 presents the detailed categorization along with related planning studies, referred in purple cells.

5.1 Task planning

LLM-based task planners are capable of generating plans without strict symbol definitions [58], while traditional task planners are required to pre-define operators with domain knowledge about available actions and constraints [34, 98]. In this field, most planners employ a static planning strategy, which takes fixed descriptions that are not adaptable to the changes in environment [163]. However, an alternative approach, adaptive planning, allows for the incorporation of environmental feedback into input prompts, enabling adjustments to actions based on observed conditions. This section reviews LLM-based planners in terms of these two strategies: static and adaptive planning.

Static planning approaches are generally zero- or few-shot prediction methods, where zero-shot methods generate a plan based solely on an input command, while few-shot methods leverage learning from a limited set of similar examples [9, 27, 70, 163]. However, LLMs often exhibit poor performance in long-horizon task planning due to limited reasoning ability [89, 140]. To address this limitation, Huang et al. introduce a planner that iteratively selects the most probable action among executable ones generated by LLMs [58].

Alternatively, LLM-based code generators, such as Code as Policies [86] or ProgPrompt [128], produce codes that result in actions responsive to observations [56, 57]. Singh et al. demonstrate that code generation outperforms basic task planning from LLMs since the output plan closely aligns with execution environments [128]. Despite their advantages, these methods lack validation and replanning processes.

To validate plans, researchers often augment LLMs with logical programs, either to (1) check if resulting plans violate logical constraints or (2) generate plans using an external logical planner. For instance, SayPlan [118], a GPT-4-based planner, validates abstract-level actions through a scene-graph simulator 3DSG [1], while LLM+P [89] applies a PDDL problem translated from LLMs to a classical task planner, Fast Downward [45]. In addition, Silver et al. demonstrate that a search-based planner with an initial plan from LLMs performs better by exploring fewer nodes [127]. These studies underscore the effectiveness of integrating LLMs with logical programs to increase the success rate or the performance of generating feasible plans.

Adaptive planning allows robots to modify their plans or actions in response to feedback, either by generating new plans based on environmental observations [20, 142, 152, 168, 169] or by detecting failures and adjusting accordingly [61]. Chen et al. and Huang et al. introduce adaptation strategies that generate new plans based on observed feedback, enabling robots to respond to a broader range of scenarios [12, 60].

Another adaptation strategy is the detection of failures as feedback. For instance, Inner Monologue [61] retries the initial plan until it succeeds. Furthermore, other studies provide textual explanations about past failures to help avoid recurrent issues [87, 94, 117, 147]. LLM-Planner [129] and COWP [28] improve replanning capabilities by finding alternative plans that leverage context from observations and the commonsense knowledge of LLMs. These flexibilities in adapting to new information enhance robot autonomy in dynamic settings.

5.2 Task and motion planning

We outline LLM-based low-level planning, classifying methodologies into motion planning and TAMP areas.

Motion planning refers to the process of generating a path by computing sequential waypoints within configuration or task spaces. Jiao et al. introduce an LLM-based motion planner that directly generates positional sequences for drone choreography [68]. While this work demonstrates LLMs’ spatial reasoning abilities, the scenario presented is relatively simple. Further, planning spaces are often continuous, which presents a challenge for language models that operate with discrete tokens. Alternatively, indirect sequencing approaches, such as VoxPoser [59], generate a potential field code with the help of a VLM and then conduct motion planning within the generated field, augmenting LLMs with a search-based planner.

TAMP refers to integrating high-level task planning with low-level motion planning. Recent studies often use LLMs as TAMP planners, leveraging LLMs’ logical and physical reasoning capabilities [79, 96, 153]. Researchers guide LLMs to generate high-level subgoals, which are then used for low-level trajectory generation [79, 96]. However, the coarse representations of LLMs restrict their applications to simple tasks such as pick-and-place. To address this limitation, researchers use additional prompts or augment LLMs to improve reasoning abilities. For example, Xia et al. enable LLMs to consider kinematic knowledge through kinematic-aware prompting for more complex manipulation tasks, such as articulated object manipulation [153]. Ding et al. introduce a logic-augmented LLM planner that checks the logical feasibility of the task plans generated by LLMs [29]. Meanwhile, others use physics-augmented LLM planners to evaluate physical feasibility [18, 44, 88]. For example, Text2Motion [88] allows an LLM to generate physically feasible high-level actions and combine them with learned skills for low-level actions.

6 Control

Early studies primarily focus on establishing mappings between simple linguistic commands and known motion primitives. With the advent of deep learning, researchers have explored two main approaches in control: direct modeling of control values based on linguistic instructions [7, 119] and indirect interpretation of complex instructions via LLMs to generate actions [154]. We categorize the work in this field into two groups: (1) direct approach which means the direct generation of control commands based on linguistic instructions and (2) indirect approach which stands for indirect specification of control commands through linguistic guidance. Figure 1 presents a detailed categorization alongside related papers, referred in orange cells.

6.1 Direct approach

The direct approach involves using an LLM to interpret and produce executable commands, either by selecting motion primitives [134] or generating control signals [146, 170]. Early work generates action tokens to produce a control policy training Transformer architecture with task-specific expert demonstrations such as Gato [119], RT-1 [7], and MOO [131]. Researchers linearly map action tokens to discretized end-effector velocities [119] or displacements [7, 131] for continuous motion. While these approaches demonstrate a degree of generalization over unseen tasks, such as new objects or realistic instructions, they often require extensive data collection and training time.

To reduce the collection effort, researchers often leverage existing web-scale vision and language datasets such as RT-2 [170] and RT-X [143]. For example, Zitkovich et al. train VLMs (e.g., PaLI-X [17] and PaLM-E [31]) with visual-language datasets and robotic demonstrations [170]. This approach maintains general knowledge of visual-language tasks while training for control tasks. In addition, to reduce the training burden, Chen et al. use a low-rank adaptation (LoRA) [51] method for fine-tuning an LLM for control tasks rather than fine-tuning the entire model [15].

LLMs often struggle to generate continuous action-level commands such as joint position and torque values, as LLMs typically generate atomic elements known as tokens [134]. Therefore, researchers instead generate task-level output using LLMs [10, 101, 134]. For example, SayTap, an LLM-based controller for walking, generates contact patterns between feet and the ground for walking motion with an LLM instead of directly producing joint positions [134]. Other studies solve the control problem as natural-language generation tasks, completing a sequence of end-effector poses [101] or generating Python codes [10]. Recent studies often restrict action space to enhance LLM control outputs. For instance, Wang et al. design a prompt that produces positive integer control values while maintaining a smoothness trend in outputs [146]. Alternatively, Li et al. demonstrate that incorporating robot kinematics information helps the LLM determine joint values for desired poses [85].

Fig. 2
figure 2

A conversational prompt for interactive grounding. Through the final “Command” in the prompt, we ask the LLM ground the underspecified object, referred to as “something” in the “Command”, as a “cookie” interactively asking for personal preferences. The prompt consists of task description, task procedure, and task context parts guiding the LLM’s behavior and contextual understanding. The words in bold indicate subjects of interactions with LLM responses highlighted in blue

6.2 Indirect approach

LLMs are also useful for generating indirect representations of control commands, such as subgoals or reward functions, based on natural-language instructions. To guide the learning process, researchers leverage the goal description that explains desired behaviors in natural language [32, 67, 77]. For example, ELLM [32], an LLM-based reinforcement learning (RL) framework, uses an LLM to generate subgoal descriptions as prior knowledge of the RL policy and further uses the similarity between current observation and the subgoal description in text embedding space to calculate reward. Further, Kumar et al. generate a goal description based on the history of human instructions to reuse previously learned skills [77]. However, as the output of an LLM is a natural-language description, these approaches require an additional step of grounding or interpreting the description.

Alternatively, researchers often generate code-level reward functions. Yu et al. convert a natural-language goal into a high-level motion description and then generate a corresponding reward function [161]. However, this method requires pre-defined reward formats. Instead, recent work prompts an LLM to infer a new reward function from human-designed examples [71, 145]. Nonetheless, the generated reward functions may not always be accurate or optimal enough for direct use in training [130].

Fig. 3
figure 3

Directive prompts for generating a scene graph. The table includes two prompts: node creation and edge creation. Given a scene of images, a multimodal LLM perceives objects in the image and infers relevant relationships using geographic information. The words in bold indicate subjects of outputs with LLM responses highlighted in blue

To improve accuracy, researchers add a refinement loop to validate both the syntax [112] and semantics [95, 130, 154, 165] of the generated reward functions. For example, Song et al. use an LLM to redesign a reward function based on the convergence of the training process and the resulting robot motion [130]. Chu et al. employ an LLM to directly generate rewards for evaluating robot motion [24]. Other approaches refine a motion by adjusting control parameters based on the error state [133] or by selecting a suitable motion target from human feedback [90].

7 Prompt guideline

We provide prompt design guidelines for robotic tasks to researchers entering this field. A prompt is a message that directs LLMs to process and generate outputs according to our instructions [92, 150]. Well-designed prompts

  • include clear, concise, and specific statements without using technical jargon,

  • incorporate examples that allow anticipating the model’s process,

  • specify the format that we want the output to be presented, and

  • contain instructions to constrain actions.

The prompts enable models to generate desired content following output formats and constraints without parameter updates. We provide guidelines over four robotics fields: (1) interactive grounding, (2) scene-graph generation, (3) few-shot planning, and (4) reward function generation.

7.1 Conversation prompt: interactive grounding

We detail a conversation prompt design, leveraging an LLM as a grounding agent, to clarify commands such as ‘Bring me something to eat’ and infer the ambiguous targets, expressed as ‘something,’ through logical inference. Figure 2 shows the design detail, where the prompt consists of three key components: task description, task procedure, and task context. We further describe each component as follows.

The task description outlines the expected behavior and response format of the LLM. In this example, we particularly emphasize its role as a conversational agent, which fosters dynamic interactions with users, guided by directives such as ‘you should.’ Further, the imperative statements containing ‘keep’ provide task constraints or requirements. We also place behavioral constraints at the end to suppress the LLM’s verbosity.

The task procedure then defines a sequence of inference steps for the LLM to follow, aimed at achieving the task objective. This description employs numbered steps to instruct LLMs to execute the actions step by step. By using logical representations, we also enforce actions to be performed in a logical order; we use ‘iteratively’ to indicate a ‘while loop’ and ‘if’ or ‘when’ to represent conditions.

The task context describes the contextual inputs, such as ‘world model,’ that LLMs perform grounding upon. Consistency in terminology across task description and task procedure is crucial for LLM operations. For example, the common expressions, such as ‘task’ and ‘world model,’ allow the LLM to work within the same context provided. Further, by using clear names for objects in the world model, we enable the LLM to apply common knowledge to named entities. Note that although we use a list of objects as a world model, LLMs accept world models in various formats: textual descriptions, a list of objects, and scene graphs.

With these structured components, the prompt invokes an interactive grounding dialogue for precise object identification, as shown in Fig. 2. We obtain the resulting interaction using ChatGPT 3.5 [106].

7.2 Directive prompt: scene-graph generation

We introduce directive prompt designs for constructing a scene graph from a scene image using a multimodal LLM, particularly with GPT-4 [107]. The scene graph consists of objects as nodes and their relationships as edges [36, 41]. Despite the advancement of multimodal LLMs, their capability has limitations in inferring 3D relationships from a 2D image [13]. To reduce the limitation, we decompose the task into two steps: node creation with multimodal inputs and edge creation with textual information. We describe each step with detailed examples provided in Fig. 3.

Fig. 4
figure 4

A planning prompt for few-shot planning. Leveraging input-output example pairs, the LLM improves performance in generating a plan to accomplish the task objective. The prompt consists of task descriptions, examples, and task context. The words in bold indicate subjects of interactions with LLM responses highlighted in blue

The prompt for node creation consists of two parts: (1) task description and (2) task context. The task description includes the multimodal LLM’s expected behavior (i.e., role) and response format, similar to Sect. 7.1. For instance, the multimodal LLM’s role is to identify objects as nodes in the given images. We then specify the output format as ‘ObjectName(ID)’ for consistency and simplicity. Then, the task context presents a sequence of unique object identifiers with corresponding object-centric images. In this scenario, we assume that object-centric images are obtained using an active perception method to identify objects with occlusion.

The edge creation consists of (1) task description, (2) examples, and (3) task context. The task description not only specifies the expected behavior and output format but also elucidates on how to identify the relationships between nodes leveraging examples. We particularly explain how the LLM uses 3D object coordinates and unit measurements to infer spatial relationships from a pre-defined set such as ‘left,’ ‘right,’ etc. Unlike the node creation, this allows for generating additional output explanations to accommodate the complexity of discerning spatial relationships.

To enhance the understanding of the input format and corresponding output, we include examples showcasing edge generation. We choose an example similar to the target scenario in terms of objects and their spatial interrelationships, thereby providing richer information for edge identification. Finally, the task context provides source and target node information as inputs and leaves the output empty to obtain responses from the LLM. We also assume that the 3D bounding boxes are obtained through a neural network-based detector [104] or from the extent of point clouds using depth information [41].

Fig. 5
figure 5

Prompts example for reward function generation. The prompt consists of task descriptions, available APIs, goals and constraints, and generation rules. The LLM generates a reward function in Python code for RL training.

7.3 Planning prompt: few-shot planning

We present a planning prompt design aimed at predicting subsequent actions to fulfill an instructed objective, integrating contextual elements such as available actions and environmental settings. This design particularly focuses on few-shot planning, enhancing performance through examples. The design comprises four components: (1) task description, (2) examples, (3) task context, (4) and additional interactions, detailed in Fig. 4.

The task description include task objectives, expected behaviors, and response formats similar to conventional prompts. However, unlike previous ones, this prompt specifies the robot’s constraints, including initial states and action limitations. For example, the term ‘CANNOT’ in Fig. 4 emphasizes the robot’s limitation to manipulating only one object per action. Moreover, these constraints extend to the rules governing the ‘Done’ action, indicating task completion.

The examples demonstrate input–output pairs that guide the LLM in generating the desired action. The examples adapt the generic ‘object’ argument in the allowed actions (e.g., ‘Close (object)’) to specific object names such as ‘drawer’ or ‘paper,’ reinforcing task constraints written in the task description. For instance, the second example returns the ‘Done’ signal instead of further planning after achieving the task objective.

The task context provides information about the current scenario, including ‘task,’ ‘allowed actions,’ ‘visible objects,’ ‘executed plans,’ and ‘next plan,’ as shown in the examples. We allow the LLM to fill in the blank space after ‘next plan:’ suggesting the next action without adding unnecessary elements like line breaks, ensuring output precision.

Furthermore, when additional prompts update the executed plans, the LLM generates new plans based on this updated context without reiterating the full task context, enabling a dynamic and iterative planning process that adapts to changes and maintains efficiency.

7.4 Code-generation prompt: reward design

We introduce a code-generation prompt design to generate a reward function for the MuJoCo-based Reacher task [136] from Gymnasium [139]. The goal of the Reacher task is to move the end-effector of a robotic arm close to a designated target position from an arbitrary starting configuration. The prompt is to translate this task objective into a reward-specifying code. Figure 5 shows the design detail, comprising four key elements: (1) task description, (2) available APIs, (3) goals and constraints, and (4) generation rules.

Task description define the expected robot behavior and task conditions for the LLM, including the robot’s control strategies and the action space of the two-joint robot arm. We particularly specify the action space as a continuous ‘Box’ space using an API from Gymnasium, assuming the LLM’s familiarity with well-known library functions. Then, this description leads the LLM to grasp the overarching RL objective of the defined actions.

Available APIs list the APIs necessary for designing the reward function, including the names and input–output specifications of each API. By providing Python function annotations, we enable the LLM to infer the types of inputs and outputs, given its presumed knowledge of float-like variable types and how the APIs work.

Goals and constraints provide the task objectives and limitations that guide the reward contents. We clearly define the initial setup, goal assignment, and goal conditions, aiming to exclude unnecessary reward components, such as penalizing high velocities for smooth motion. Note that we recommend the use of concise and consistent words, such as ‘torque,’ as used in the task description, instead of ‘power.’ This ensures the generated reward function aligns with the specified task requirements without introducing ambiguities or unintended penalties.

Lastly, generation rules establish guidelines for generating directly executable code, addressing the tendency of LLMs to produce unnecessary or incorrect variables or functions. These rules restrict such declarations, as written in the second component of the generation rules in Fig. 5, encouraging the use of well-known Python libraries to enhance programming quality. Furthermore, considering the linearly combined elements of the reward function, we introduce a rule for scaling reward components to maintain balance.

8 Conclusion

In this survey, we have investigated the current robotics research works with large language models in terms of intelligent robot components encompassing communication, perception, planning, and control. This component-wise investigation reveals how researchers integrate LLMs to overcome challenges inherent in pre-LLM approaches across various tasks, thereby offering a comprehensive understanding of LLMs’ impact in this field. Within each component area, we examine the improvement of methodologies proposed to maximize the utilization of LLMs’ capabilities and enhance the integrity of their responses. Additionally, our survey offers guidelines for prompt engineering in each component area, supplemented with key examples of prompt components, to provide practical insights for researcher entering this field. The core contribution of this paper is to highlight the transformative impact of LLMs in robotics, enabling the development of versatile and intelligent robots. By synthesizing these insights, we aim to guide future research on integrating LLMs into robotic system.