1 Introduction

The widespread adoption of generative AI models has facilitated Natural Language-related tasks (Gozalo-Brizuela and Garrido-Merchan 2023). As a result, the integration of Natural Language to Code (NL-to-Code) translation has become a sought-after feature in many code-centric tools (Xu et al 2022). Notable examples include GitHub Copilot,Footnote 1 TabNine,Footnote 2 Amazon CodeWhispererFootnote 3, and ChatGPT.Footnote 4

There are divergent opinions regarding the advantages and security of these tools, specifically GitHub Copilot. Various studies and surveys have shown the benefits of Copilot in assisting developers (Bird et al 2022; Vaithilingam et al 2022; Dakhel et al 2023). However, other empirical studies (Imai 2022) reveal that Copilot results in higher productivity but with lower quality of code. Concerns have been raised about Copilot recommending code that relies on non-existing helper functions or undefined variables (Nguyen and Nadi 2022). Several studies focus on the vulnerability and security of code generation tools (Pearce et al 2022; Asare et al 2022), as well as the uncertainty surrounding the licensing of generated code (Bird et al 2022).

Notably, a study of Bird et al (2022) reveals that developers devote more time to reviewing AI-generated code than writing it. This emphasizes the need of aiding developers in better understanding and evaluating the generated code, which is hindered by the unpredictable behavior of AI models. Furthermore, the validation of the code’s origin is essential, but currently limited to a single IDE development session.

Additionally, users do not always obtain the desired code when using generative AI tools. One of the primary factors is the quality of the prompt, which involves natural language features such as implication, association, and ambiguity (Reynolds and McDonell 2021). Interactive programming has garnered considerable attention as one of the prominent approaches to tackle these issues of natural language (Shin and Nam 2021; Heyman et al 2021; Schlegel et al 2019; Elgohary et al 2021; Su et al 2018; Cai et al 2023).

The concept entails users engaging with models in an iterative manner through low-code approaches, until they attain the desired outcome. However, despite the leverage of user feedback, its persistence is confined to a single conversational session (Bird et al 2022). This limitation arises from the inherent properties of generative AI models, which require explicit re-training to integrate new data or feedback from users. Figure 1 illustrates a simple scenario of interactive programming, underlining the issue of recalling cross-session user feedback in a current generative AI model.

Fig. 1
figure 1

A simple scenario of interactive programming, highlighting the issue of utilizing user feedback across sessions

In Fig. 1a, users initially submit the NL query “Function adds two numbers” and receive a Python code representing the add function. Subsequently, users request to rename the function from add to sum and proceed with further unrelated queries. After several inquiries in a new session, users once again input the same query, “Function adds two numbers”. The question arises as whether the model should return to users the function with the name add or sum. We applied this scenario to ChatGPT,Footnote 5 one of the recent prominent tools, and obtained the results as illustrated in Fig. 1b. Even though the user has corrected the function name in Session 1 (i.e. add_numbers to cal_sum), ChatGPT still returns the original code snippet (i.e. function named add_numbers) in Session 2. In practice, the repetitive user modifications (e.g. renaming, restructuring) of generated code can be inefficient and frustrating.

In this paper, we propose a methodology to address the aforementioned challenges by developing a user-feedback-driven NL-to-Code translation model. This method requires no additional training. We aim to provide interpretable, straightforward approaches to enable comprehension of code provenance and facilitate thorough analysis of unexpected results. Our contributions are as follows:

  • A One-shot Correction methodology. We introduce an approach to integrate user feedback into generative AI models without re-training while supporting intensive inspection of incorrect outcomes. An additional memory for user feedback and k-Nearest Neighbors methods are employed to accumulate and retrieve correction information across sessions. To tackle the code’s origin issue, we adopt the techniques from decomposition in problem solving. Each natural language query is divided into segments and the final code snippet is constructed from sub-snippets attained for each segment. The NL-to-code translation of each query chunk is performed through either the additional memory or a generative AI model.

  • A prototype and an extensive comparison. To illustrate the utility of our methodology, we deploy a prototype of One-shot Correction based on GPT-3.5-Turbo-0301 modelFootnote 6 and conduct an extensive comparison between the code generated by GPT-3.5-Turbo-0301 and our prototype. The evaluation results justify the concept of our methodology and provide insights into the behavior of all models when combined with user feedback.

  • A Graphical User Interface application. We develop a preliminary GUI to exhibit the benefit of using the One-shot Correction in customizing and interpreting the generated code snippets. Users can convert a natural language query to a Python function, modify identifier names in the produced code snippet, and preserve the correction information for future reference.

  • Source code and data. To facilitate reproducibility and reuse of our methodology, we publish our source code, along with test suites and evaluation results.Footnote 7

The rest of the paper is organized as follows: Sect. 2 provides an overview of the background and related work. Our methodology is described in Sect. 3. Section 4 presents our experiments in detail, while the evaluation results are analyzed in Sect. 5. We discuss threats to validity and potential enhancements in Sect. 6 and introduce our GUI application in Sect. 7. Finally, we conclude in Sect. 8.

2 Background and related work

In this section, we provide a brief introduction to the relevant background and related work in our study.

2.1 Generative artificial intelligence for code

Generative Artificial Intelligence (AI) encompasses models capable of producing novel content across various formats, including text, image, video, or audio (Gozalo-Brizuela and Garrido-Merchan 2023). In the context of code generation, generative AI leverages the naturalness hypothesis (Hindle et al 2016; Sun et al 2022; Weisz et al 2022), which posits that software can be viewed as a form of human communication. Consequently, techniques applicable to natural language can also be employed for code generation.

Various approaches, spanning from probabilistic (Bielik et al 2016; Li et al 2017; Schumacher et al 2020) to Machine Learning (Kim et al 2021; Svyatkovskiy et al 2020), have been proposed to validate this hypothesis. Transformer (Vaswani et al 2017) has emerged as the dominant architecture, serving as the foundation for notable models like PLBART (Ahmad et al 2021), CodeBERT (Feng et al 2020), Codex (Chen et al 2021), AlphaCode (Li et al 2022), and GPT-3.5. These models support a wide range of code-related tasks, including code summarization, code translation across programming languages (Ahmad et al 2021), code documentation generation (Feng et al 2020), code auto-completion based on comments or existing code (Chen et al 2021; Ahmad et al 2021), and are even able to challenge humans in programming competitions (Li et al 2022).

Several methods have been proposed to enhance generative AI models. These approaches involve expanding input queries with contextual information, such as code token types (Izadi et al 2022), preceding task queries and code outputs (Nijkamp et al 2022). Other methods involve integrating a supplemental retriever (Lu et al 2022; Parvez et al 2021) or incorporating an additional memory component (Wu et al 2022; Fan et al 2021; Khandelwal et al 2020). However, these approaches either require training or overlook the potential of leveraging user feedback as a valuable resource. Furthermore, the non-deterministic and unpredictable nature of the underlying AI model restricts in-depth analysis of their unexpected behaviors (Bird et al 2022).

2.2 Interactive programming

The quality of the natural language (NL) prompt significantly impacts the accuracy of NL translation models (Reynolds and McDonell 2021). Various techniques have been proposed to address the ambiguity of NL and bridge the gap between NL and programming languages. These methods include heuristic methods, semantic parsing (Shin and Nam 2021), and interactive programming, which has gained notable attention as a prominent approach (Heyman et al 2021).

Methods supporting user interaction comprise binary validation of target results (Iyer et al 2017), multiple-choice questions (Gür et al 2018), selection from a list of options (Schlegel et al 2019), decomposition and modification of sub-components using predefined values (Su et al 2018), feedback through NL queries (Elgohary et al 2021), and workflow updates instead of direct result modification (Cai et al 2023). Although user feedback has been shown to be advantageous in these studies, its persistence is limited to a single interaction session.

2.3 Decomposition in problem solving

Our methodology draws inspiration from a widely known heuristic in problem solving, which involves the decomposition of the problem into manageable sub-problems (Egidi 2006). This approach is valuable in software development (Charitsis et al 2022), and particularly in working with generative AI models (Barke et al 2023). Recent studies target to enable the decomposition ability of AI models by enhancing the prompt with a series of intermediate NL reasoning steps, namely chain-of-thought (Wei et al 2022), tree of thoughts (Yao et al 2023), and plan-and-solve prompting (Wang et al 2023). However, due to the unpredictable nature of AI models, it remains challenging to determine which steps described in the prompt contribute to unexpected results.

In addition, users commonly gain proficiency in a new programming language by initially acquainting themselves with basic functions and progressively advancing towards more intricate features (Carpenter 2021). Ordinarily, after decomposing a problem, users leverage acquired knowledge to resolve familiar sub-problems, and reserve the search for novel solutions solely for unfamiliar sub-problems. Our methodology aims to reflect this learning process on NL-to-Code translation models. Particularly, we consider user feedback as knowledge that the translation model needs to remember following each interaction. When encountering a new NL query, the model is expected to identify the portions of the query that have been resolved and distinguish them from the segments that necessitate code generation. The resulting composition of sub-knowledge allows for in-depth analysis of which phrases in the query lead to unexpected answers.

2.4 Chunking in natural language processing

Shallow parsing, or chunking, involves dividing text into non-overlapping groups of syntactically or semantically related words (Abney 1992). It is widely used in Natural Language Processing (NLP) for various types of chunks, such as named entities, noun phrases, and verbal groups (Zhang et al 2002).

A reliable text chunker is crucial for extracting vital information from unstructured text, enabling detailed analysis in subsequent processing tasks. Different techniques, including grammar-based methods, statistical models, and machine learning approaches, have been developed for chunking tasks (Ramshaw and Marcus 1999; Mohapatra et al 2021). These approaches utilize features such as part-of-speech tags or grammar templates for training.

The speedy development of Large Language Models has spawned a substantial amount of NLP libraries that cover a diverse array of tasks beyond chunking. Prominent libraries in this domain include NLTK,Footnote 8 CoreNLP,Footnote 9 scikit-learn,Footnote 10 and spaCy,Footnote 11 which is utilized in our experiments.

3 Approach

This section presents a thorough explanation of our methodology, including an overview of the One-shot Correction workflow, and descriptions of each primary component within the workflow.

3.1 General workflow

Figure 2 presents the general workflow of One-shot Correction methodology for NL-to-Code translation models with an illustrative example. Our methodology incorporates three components: (i) a correction data-store, (ii) an NL-to-Code generator, and (iii) a code builder.

Fig. 2
figure 2

One-shot correction workflow for NL-to-Code translation models, exemplified with an illustrative example

The correction data-store collects user feedback paired with corresponding NL queries. Meanwhile the NL-to-Code generator is a code translation model that takes natural language queries as inputs and produces code snippets. The code builder is the key component designed to integrate correction information with the code generator model without requiring an additional model re-training.

For each NL query, the code builder initially checks if the query already exists in the correction data-store. If it does, the code that was previously corrected by users in past conversations is retrieved and directly returned to the users. If it is the first time users inquire about this NL query, the query undergoes several processing steps before the final code snippet is assembled.

Initially, in the Query chunking step, the query is decomposed to various chunks, with each chunk representing a single task or action. Subsequently, the code builder searches for potential code snippets associated with each NL chunk by accessing the correction data-store or utilizing the NL-to-Code generator (if the chunk has no similar stored queries). We call this step Sub-snippets retrieving/generating. Finally, in the Code building step, all the obtained code snippets are utilized to construct the final snippet before providing a reply to the user. If users make modifications to the generated code, the correction information is once again stored in the correction data-store before the next query is requested.

We demonstrate the result of each step using a typical example in Python. Assuming that the NL query is “add two numbers, and then print the result” and no prior modifications have been made by users, the query is decomposed into two chunks: “add two numbers” and “print the result”. In the subsequent step, the code builder retrieves the code snippet return num_1 + num_2 from the NL-to-Code generator for the chunk “add two numbers” since this chunk is not present in the correction data-store. Meanwhile, the snippet for the chunk “print the result”, i.e. print(result), is fetched from the data-store, supposing it was corrected by users in past conversations. Ultimately, in the Code building step, the two code snippets are combined to generate the response, as shown in Fig. 2.

To illustrate the applicability of our methodology to different NL-to-Code translation models, we utilize existing NL-to-Code models instead of developing a new one. For simplicity, the correction data-store is structured as a dictionary, with the keys representing the embedding values of NL queries and the corresponding values indicating the corrected code. Further explanation on the NL-to-Code generator and the correction data-store employed in our experiments is presented in Sect. 4.2. The subsequent sub-sections delve into a comprehensive analysis of each main phase in the code builder component.

3.2 Query chunking

As mentioned in Sect. 2.4, text chunking entails grouping adjacent tokens in unstructured text into phrases based on their part-of-speech (POS) tags. In our methodology, we target NL queries representing pipelines of actions, where each main verb in a query indicates a task in the target code. Therefore, our objective in this phase is to identify non-overlapping verb-noun chunks within a query.

We use rule-based method and dependency graph to determine main verbs and construct the chunk for each verb. There are two types of main verbs considered in our methodology: (i) verbs with the POS value VERB (e.g. print the result, calculate the average), and (ii) auxiliary verbs (AUX) that are not immediately followed by other verbs (e.g. are inputs, is odd or even). Supplementary verbs do not form their own chunks (e.g. using Counter).

Figure 3 depicts a dependency graph generated by spaCy for the query “add two numbers, and then print the result”. The main verbs in this query are add and print. The dependency graph reveals that all the main verbs are interconnected, while other words (e.g. NOUN, ADV) associate with their corresponding verbs. Thus, the main verb functions as the root node of its verb-phrase tree. By applying this rule to analyze the dependency graph, we extract two verb-noun-phrases, namely “add two numbers” and “print the result”. Punctuation and conjunction between main verbs are omitted in this analysis.

Fig. 3
figure 3

A dependency graph generated by spaCy (https://spacy.io/)

It is worth highlighting the potential benefits of employing Large Language Models (LLMs), such as GPT-3.5,Footnote 12 in this phase. Nonetheless, our objective is to ensure the transparency of the model and the ease of comprehension for developers throughout all steps. Additionally, our evaluation results indicate that even a less complex model, when incorporated as an additional component, can already improve the efficiency of the NL-to-Code model.

3.3 Sub-snippets retrieving/generating

In our methodology, NL chunks are considered as atomic NL queries that represent a single primary task or action. The sub-snippets retrieval and generation process for an NL chunk is displayed in Fig. 4.

Fig. 4
figure 4

Flowchart of retrieving/generating sub-snippets for a Natural Language chunk

Firstly, if the NL chunk exists in the correction data-store, the related code snippets are retrieved and transferred to the Code building step. If the NL chunk is not present in the data-store, the k-Nearest Neighbors (KNNs) of the chunk are computed under a predetermined threshold (refer to Sect. 4.2). Code snippets from the KNNs are extracted and forwarded to the Code building step. However, if there are no nearest neighbors of the NL chunk, the NL-to-Code generator is activated to generate code for the chunk and proceed to the subsequent step. Further details on code generation for NL queries or NL chunks are provided in Sect. 4.2.

3.3.1 Extracting sub-snippets for an NL chunk

In case the NL chunk has a similar NL query in the correction data-store (i.e. a nearest neighbor), the sub-snippets of the NL chunk are determined based on the sub-snippets of the phrase in the NL query that is most similar to it. Algorithm 1 outlines the process of extracting code snippet for an NL chunk from the corresponding code of a similar NL query.

Algorithm 1
figure a

Extracting sub-snippet for an NL chunk from a similar query

Initially, the NL chunk is compared to the similar NL query to identify the most similar phrase, denoted as simi_chunk (line 2). We use the (cosine) similarity feature provided by spaCyFootnote 13 to assess the correlation between the NL chunk and each chunk in the similar query, subject to a predefined threshold (see Sect. 4.2). Additionally, each chunk in the similar query is mapped to sub-snippets in the target code of the query, using the function named MAP_NL_CODE (line 3). The associated sub-snippets of simi_chunk are then extracted and assigned as sub-snippets for the NL chunk (line 5).

3.3.2 Mapping NL chunks and sub-snippets

Algorithm 2 represents the pseudo code for the MAP_NL_CODE function. We employ a rule-based approach to establish mappings between chunks in an NL query and sub-snippets in the correlative target code. Before constructing the mapping, the target code is divided into sub-snippets by analyzing its Abstract Syntax Tree (AST) structure (line 3). We utilize tree-sitterFootnote 14 parser to obtain the AST of the target code. Sub-snippets within the target code consist of statements under the root_node (e.g. import statements) and child statements of function_definition. For simplicity, we require that each NL query is translated to code snippets wrapped in a function_definition and necessary import statements.

Algorithm 2
figure b

Mapping chunks in a query and its target code

Subsequently, the NL query is decomposed into verb-noun chunks (line 4) following the method described in Sect. 3.2. To estimate the analogy between sub-snippets and verb-noun phrases, we developed a straightforward code explanation approach (line 6) that translates programming language operations and abbreviations into natural language.Footnote 15 Afterwards, the explanation of each sub-snippet is compared to the verb-noun phrases utilizing the (cosine) similarity function from spaCy (lines 711). The phrase with the highest similarity score is mapped to the current sub-snippet (line 15).

It is worth mentioning again that LLMs might be used for these NLP-related tasks. However, as we emphasized above, our goal is to investigate whether an NL-to-Code model can be enhanced by a less sophisticated method. Hence, a rule-based approach is a well-suited for this purpose.

3.4 Code building

In this step, the final code is constructed by combining sub-snippets corresponding to each verb-noun phrase in the NL query. The inputs for this step include the NL query and the mapping between each phrase in the query and its respective sub-snippets. The final code encompasses sub-snippets enclosed within a function_definition and any required import statements.

This step consists of the following sub-steps: (i) determining the order of sub-snippets, (ii) refining sub-snippets for each verb-noun phrase, (iii) renaming identifiers in all sub-snippets to ensure data-flow (i.e. naming progression from definitions to usages of identifiers in a code snippet, which is referred from semantic data-flow of Ren et al (2020)), and (iv) identifying parameters for the final function. Figure 5 demonstrates an example of code construction from sub-snippets, using the example described in Fig. 2.

Fig. 5
figure 5

An example of building code from sub-snippets. Assuming that each NL chunk retrieves 2-Nearest Neighbors from the correction data-store, resulting in two potential sub-snippets for each chunk

3.4.1 Determining sub-snippet order

Sub-snippets are sorted according to the verb-noun phrase order in the NL query, which corresponds to the order of related verbs in the dependency graph. The arrangement is determined by analyzing the relationship between verbs in the graph. As a result, sub-snippets associated with the root verbFootnote 16 are given priority. In Fig. 3, the verb add precedes the verb print due to a conj dependency from add to print. Therefore, sub-snippets of the verb add (i.e. return num_1 + num_2 and return a + b) are placed before sub-snippets of the verb print (i.e. print(result) and print(‘result = ’, result)) at the end of this sub-step (Fig. 5).

3.4.2 Refining sub-snippets

Relevant sub-snippets of each NL chunk are modified based on a set of rules. To illustrate the utility of our methodology, we initially employed three rules for sub-snippet refinement: (i) no starting return statements, (ii) reducing plural statements, and (iii) refining between return statements. These rules aim to minimize grammatical errors when combining sub-snippets.

(i) No starting return statements. This rule prioritizes non-return statements for non-last NL chunks. By default, each NL chunk is corresponded to a list of potential sub-snippets and the first item (i.e. sub-snippet(s) extracted from the top-1 nearest neighbor) has highest priority. This is the output from the step Sub-snippets retrieving/generating (Sect. 3.3).

The preference is maintained if the current NL chunk occupies the last position in the list of chunks achieved from the preceding sub-step (i.e. sub-snippets ordering). In contrast, if the current NL chunk is a non-last chunk, return statement will have lower ranking than others. This is due to the fact that a return statement in the majority of programming languages will cancel other subsequent statements of the same level (e.g. same indent) within a scope (e.g. try statement). However, in case the NL chunk correlates to return statements only, the first return statement is selected and converted into an assignment statement. The left operand is named using stmt followed by the index of the current NL chunk.

Table 1 displays four typical cases of refining sub-snippets for a non-last NL chunk with examples of the chunk “add two numbers”. In the example depicted in Fig. 5, “add two number” possesses the top position in the ordered list and has potential sub-snippets starting with the keyword return solely. Hence, the statement selected for this chunk is refined as stmt_0 = num_1 + num_2.

Table 1 Examples of refining sub-snippets for a non-last NL chunk with 2-NNs

(ii) Reducing plural statements. The rule targets to omit redundant sub-snippets of a verb-noun phrase. For simplicity, we implement a preliminary prototype of this rule based on the direct object of the verb in an NL chunk. Nearly identical sub-snippets will be reduced if the direct object is a singular noun (i.e. spaCy tag_ is NN). Contrarily, sub-snippets of the NL chunk are left unchanged if the direct object is a plural noun (i.e. spaCy tag_ is NNS):

For instance, assuming that the following sub-snippets are obtained for the NL chunk “get an integer input from user”:

figure c

Since the direct object of the verb get is a singular noun (i.e. inputFootnote 17), only the first sub-snippet from the list of highly similar sub-snippets is preserved for building the final code (i.e. num_1 = int(input("Number 1: "))).

It should be emphasized that this is an initial prototype of this rule to exhibit the concept of our approach. The reducing condition can get more complex when the plural noun is described with a specific quantity. In Fig. 5, both of the chunks “add two numbers” and “print the result” have only one sub-snippet for each chunk, as a result from rule (i). Therefore, the rule (ii) has no effect on these sub-snippets.

(iii) Refining between return statement. The last rule in our primary rule set ensures that a return statement should be placed after other statements of the same level in the final assembled code. Namely, a non-last NL chunk should contribute non-return statement(s) to the final code. Otherwise, depending on the expression after keyword return, the return statement will be omitted or transformed to an assignment statement, using same technique in rule (i). The latter case (i.e. modifying the return statement) happens when the part following keyword return creates new values (e.g. return a + b, return abs(num), or return a[i]). Alternatively, the former case arises if the after-return part is an identifier (e.g. return sum) or a list of identifiers (e.g. return a, b).

In the example exhibited in Fig. 5, the sub-snippets of NL chunks remain unmodified after employing rule (iii). This is because there is no return statements left in the sub-snippets list after applying rule (i) and (ii). It is essential to mention that return statements nested in other code structures (e.g. if, for) are not affected by rules (i) and (iii) since the considered sub-snippets are statements right under the root_node of an AST and direct child statements of function_definition (see Sect. 3.3).

Furthermore, our primary rule set is adaptable and can be expanded for intricate cases (e.g. conditional and loop statements). We develop a configuration file to gather all the settings used in our experiments (see Sect. 7, Listing 4) and to conveniently select/deselect each of the refinement rules before running an experiment.

3.4.3 Renaming identifiers

In this sub-step, the propagation of names within the sub-snippets is determined by analyzing code token types in the refined sub-snippets. We simplify the process by assuming that an identifier defined in a given statement should be utilized directly in the following statement. The inspection of sub-snippets is performed from the last sub-snippet to the first one. The underlying concept is to substitute the definitions of identifiers in the current sub-snippet with the undefined identifiers in the sub-snippet below it. Pseudo code for our algorithm is provided in Algorithm 3.

Algorithm 3
figure d

Renaming identifiers in sub-snippets

The list of undefined identifiers is initialized by taking the set difference between the identifier usages and the identifier definitions in the last statement (lines 24). We use tree-sitterFootnote 18 and Code Token Type Taxonomy (CT3) proposed by Le et al (2023) to analyze token types of each token within the sub-snippets. Identifier definitions encompass variable definitions, argument definitions, and imported libraries, while identifier usages include the utilization of all the specified definitions.

Identifier definitions and usages of each sub-snippet are determined in reversed order of sub-snippets using the same method (line 6). The identifier definitions of the current sub-snippet are then replaced by the previously computed undefined identifiers (line 7), while the list of undefined identifiers is updated to exclude the replacement (line 8). The REPLACE_ID_DEFS function also considers identifier usages to handle cases where the current sub-snippet lacks identifier definitions for the one directly below it. In this case, the identifier usages serve as identifier definitions.

In Fig. 5, the list of undefined identifiers of the last statement (print(result)) comprises only the token result. Meanwhile, stmt_0 is the only identifier definition in the preceding statement. Accordingly, after the renaming sub-step, stmt_0 is supplanted by result.

3.4.4 Identifying Parameters for the Final Code

In the last sub-step, a list of parameters for the final function is assembled from undefined identifiers that are unsubstitutable by any identifier definitions. In Fig. 5, the tokens num_1 and num_2 remain as parameters of the resulting function due to the absence of appropriate identifier definitions for them within the sub-snippets.

Our methodology adheres to the principles of simplicity, interpretability, and the ability to investigate unexpected outcomes, which is not feasible with AI models. Furthermore, the methodology’s composability facilitates a stronger analogy between the generated code and target code with increased correction information.

Given the novelty of our approach, we aim to illustrate the utility of the method and to highlight the main contributions. Therefore, even though our predefined rules are preliminary, they still adequately support the proposed concept. In the sections below, we present our experiments and evaluation results, demonstrating how the inclusion of a relatively simple additional component can already bring benefit to a code translation model.

4 Experiments

In this section, we firstly reiterate our objective through two research questions. A detailed description of our experimental setup is then provided to ensure reproducibility. Finally, we present evaluation metrics used in our experiments.

4.1 Research questions

We address the following two research questions:

RQ1: Does an interpretable, non-AI methodology enhance generative AI models? We investigate this question by proposing a rule-based methodology on NL-to-Code translation that incorporates code derived from user feedback with selectively generated code from an AI model (only as needed). Our methodology requires no explicit re-training. We conduct experiments on NL-to-Python code translation and use GPT-3.5-Turbo-0301 model developed by OpenAIFootnote 19 as the generative AI model.

Models for comparison. To ensure a fair evaluation in the absence of existing comparable methods, we introduce an additional method that integrates correction information directly into input queries. This approach is based on the premise that GPT-3 series models tend to yield more accurate results with increased input information. The extended input technique and our proposed methodology function similarly when the query exists in the correction data-store, as the correction information is retrieved and returned to users. However, these models differ in their response when there are similar queries in the correction data-store.

While the extended input approach simply expands the input query with information from the similar queries, our chunking methodology first decomposes the query into chunks, gathers appropriate code snippets for each chunk by examining the correction data-store or activating the NL-to-Code generator, and then constructs the final code using the collected code snippets.

In summary, our main evaluation comprises three variants: (i) CodeGen – code generation without correction information, (ii) CodeGenE – code generation with correction information integrated through extended input queries, and (iii) CodeGenC – code generation with correction information incorporated using our chunking methodology. The CodeGen model serves as the baseline. Additionally, to examine LLM performance with our chunking strategy embedded within prompts as task descriptions, we conducted an additional experiment using GPT3.5 to directly generate code with the integrated chunking instruction, referred as GPT35Prompt.

RQ2: Does user feedback improve NL-to-Code models without explicit re-training? To address this question, we perform an ablation study to assess the influence of user feedback on generated code. We compare the code generated solely by the code generator to the code produced when integrating the generator with various states of the correction data-store (see Sect. 4.2). This comparison allows us to determine if incorporating user feedback offers benefits to code translation models without re-training and which state of the data-store would offer the greatest advantage.

4.2 Experimental setup

We perform experiments on translating NL to Python code, using available APIs and libraries as follows:

4.2.1 Test cases and scenarios

For the evaluation, we assess the methodology using a range of NL queries, varying from basic to complex. To simplify the query chunking process, we assume that each chunk in the query describes a single task, and the chunks are separated by comma or the term “, and then”. While acknowledging potential artificiality in triple or more-chunk queries, our proposed structure addresses NLP ambiguity as an intermediate form between NL and Domain Specific Language. It involves an inevitable trade-off between flexibility and efficiency.

Due to the unavailability of a suitable test suite or benchmark tailored to our specific requirements, we develop a new test suite comprising queries with one to three chunks along with their corresponding target code. We extract single-chunk queries from online Python examples.Footnote 20 For multi-chunk queries, we utilize ChatGPT,Footnote 21 a well-known model trained on an immense dataset, to form the queries. Although ChatGPT’s responses might lean toward its own biases, they remain closer to human intent and are more objective than our self-composed queries. For instance, we use the following inquiry for creating double-chunk queries, specifically related to dataframe:

figure e

Subsequently, GitHub CopilotFootnote 22 is employed to generate the target code for each query. GitHub Copilot is powered by Codex model,Footnote 23 a descendant of GPT-3, which was trained on both natural language and billions of lines of code. Hence, code generated by GitHub Copilot can serve as a reasonable reference. We thoroughly validate and modify (if necessary) each target code to ensure its validity and executability.Footnote 24

For each NL query or chunk, there are five possible states of the correction data-store: (1) empty data-store, (2) identical query in the data-store, (3) non-empty data-store without similar queries for the inquiry, (4) similar single-chunk queries in the data-store, and (5) similar multi-chunk queries in the data-store. Accordingly, each single-chunk query involves five scenarios, while each multi-chunk query can associate up to \((1 + 4^n)\) scenarios, where n represents the number of chunks in the query.

The test suite should cover all the states of the correction data-store and yield sufficient results to analyze the behavior of all models, targeting to highlight the utility of the proposed method. For this reason, we gathered 47 single-chunk queries, nine double-chunk queries, and three triple-chunk queries as main inquiries, alongside 55 single-chunk queries, 20 double-chunk queries, and 14 triple-chunk queries dedicated as similar queries in the correction data-store. These queries cover 401 cases across five states of the correction data-store. Furthermore, each query chunk is guaranteed to have at least one similar single-chunk query.

Accordingly, each test case includes: (i) the correction data-store, (ii) NL query, (iii) target code, (iv) code generated by the NL-to-Code generator only, (v) code obtained with extended input queries, and (vi) code constructed by our chunking methodology.

4.2.2 Correction data-store

For simplicity, we use a correction data-store dictionary where the keys depict tuples of embedding values of the NL query, and the values correspond to code corrected by users. Given the varying data-store states for each query, we create a data-store containing all collected NL queries and their target code, and provide a snapshot of the data-store for each test case.

4.2.3 Code generator

We utilize a substitution of Codex model, GPT-3.5-Turbo-0301,Footnote 25 for NL to Python code translation. Due to the replacement of the CodeCompletion feature in the Codex model with ChatCompletion in GPT-3.5-Turbo-0301, queries for translating NL to code are formalized as messages between system and users.

To obtain Python code from an NL query with GPT-3.5-Turbo-0301 model solely (i.e. CodeGen model), we structure the messages between system and user as demonstrated in Listing 1.

figure f

The response should exclude both code explanations and Python marks (e.g. “‘python) to facilitate code snippet extraction. Besides, all variable names in the generated Python code should adhere to snake_case convention to enhance the mapping between NL chunks and code snippets.

As GPT-3.5-Turbo-0301 model generally gives less attention to system messages,Footnote 26 to extend input queries for the CodeGenE model, we integrate the correction information to user messages, as illustrated in Listing 2. We refer to the documentation from OpenAI for detailed explanation of each field in the messages.

figure g

Similar queries and their corrected code snippets from the correction data-store are provided as examples for the NL query and displayed in sequential order. Alternative prompting methods for user messages may impact the generated response (see Sect. 2.3). However, comparing these prompting methods is beyond the scope of this paper.

Additionally, OpenAI models exhibit non-deterministic behavior, resulting in varying outputs for identical inputs. This poses challenges for our evaluation process, particularly when triggering the model multiple times with the same input due to the dynamic state of the correction data-store in the test cases. To address this issue, we adopt a dictionary-based method to accumulate and store the code generated by GPT-3.5-Turbo-0301. The dictionary uses embedding values of inquiries as keys, enabling retrieval of the corresponding generated code when an identical prompt is submitted.

4.2.4 Natural language embedding and KNNs

We employ another model from OpenAI, Text-Embedding-ADA-002,Footnote 27 to embed NL queries. KNNs for each query are extracted using cosine similarity under a predefined threshold (see experiment configuration). The accompanying function is developed by OpenAI as well.

4.2.5 Experiment configuration

Table 2 displays the configurations for the conducted experiments.

Table 2 Experiment configuration

The hyperparameters for generating Python code from NL queries using the model GPT-3.5-Turbo-0301 are outlined in the top part of the table. Specifically, a temperature of 0.9 and a top_p value of 0.9 are set to encourage the model’s creativity when multiple responses are required (\(n>1\)). A frequency_penalty of 0.5 is assigned to penalize the frequent occurrence of repeated identifiers in code snippets, while a presence_penalty of 1.5 is used to prompt the model to generate a novel response each time for the same query. For simplicity, in our experiments, we consider a single response per query (\(n=1\)). Further information on each hyperparameter is explained in the OpenAI documentation.Footnote 28

Queries of the correction data-store undergo KNN examination using cosine similarity thresholds of 0.15 and 0.2 for single and multi-chunk queries, respectively. Each inquiry obtains two nearest neighbors (\(knn = 2\)). The settings for obtaining sub-snippets and building the final code are specified at the bottom of Table 2. The spaCy model en_core_web_md is utilized, and another cosine similarity threshold of 0.5 is set for comparing the resemblance between chunks or a chunk and its sub-snippets. A threshold of 0.9 is employed to determine mostly identical sub-snippets for the second rule in the rule set of refining sub-snippets (Sect. 3.4).

In addition, we omit stop words and lemmatize verbs to their base form before calculating the similarity. The thresholds in Table 2 are adjusted to ensure that the final code snippet is constructed successfully in a majority of test cases. As mentioned in Sect. 3.4, we gathered setting values to a configuration file to easily fine-tune all the parameters, rules, and options.

4.3 Evaluation metrics

Inspired by the study of Su et al (2018), we assume that users modify the generated result in the following order: (i) restructuring the code if necessary (i.e. adding, re-arranging, or removing statements), (ii) renaming identifiers and updating strings to align with the NL query. Based on this, we evaluate the code obtained by different approaches using the following criteria in the descending order of priority:

  1. 1.

    Code validity and executability

  2. 2.

    Syntax similarity between the attained snippet and the target code

  3. 3.

    Data-flow correlation among the obtained results

  4. 4.

    Analogy of identifier names in the code snippets

To ensure the first criterion, we manually evaluate each test case for its correctness. The remaining criteria are assessed using CodeBLEU (Ren et al 2020) with hyperparameters \((\alpha , \beta , \gamma , \delta )\) representing ngram match, weighted ngram match, syntax match, and data-flow match, respectively. While ngram match and weighted ngram match are targeted for the last criterion, syntax match depicts the syntax similarity and data-flow match exhibits the equivalence of data-flow.

Ren et al (2020) recommended using the value set (0.1, 0.1, 0.4, 0.4) as \(\gamma \) and \(\delta \) have a stronger correlation with human evaluation scores. Based on the order of modifying generated code and our criteria for the evaluation, we adjust the value set to (0.1, 0.1, 0.5, 0.3). Our evaluation results show that both of the value sets follow the same trend with minimum differences. Section 5 presents the statistics with the amended value set (i.e. 0.1, 0.1, 0.5, 0.3). We refer to our published dataFootnote 29 for the outcomes of the other value set.

5 Evaluation results

In this section, we present and analyze our evaluation results to address the two research questions from Sect. 4.1. We conclude this section by assessing the performance of an LLM with our chunking strategy outlined in the NL text prompt.

5.1 Evaluation results by difficulty level

To analyze the evaluation results, we utilize the correction data-store states defined in Sect. 4.2 to determine the difficulty level for each test case and classify the results based on these levels. Each difficulty level indicates the degree of challenge in achieving the target code. The levels range from 0 to 4, representing a spectrum that includes low, medium-low, medium, medium-high, and high difficulty. Table 3 presents the definition of these levels.

Table 3 Definitions for difficulty levels (diff.)

For instance, difficulty level-2 involves two sub-scenarios: (i) single-chunk NL query linked to multi-chunk queries in the data-store, or (ii) a multi-chunk NL query where each chunk resembles single-chunk queries in the data-store. Meanwhile, difficulty level-3 indicates that each chunk in the multi-chunk NL query is related to queries with multi-chunk in the data-store. Ultimately, difficulty level-4 represents two sub-cases: (i) empty correction data-store, and (ii) no matching queries in the data-store.

Figure 6 presents the patterns of CodeBLEU score by difficulty level for the three models discussed in Sect. 4.1. The scores were computed for three sets: (i) all test cases (Fig. 6a), (ii) correct chunking cases (Fig. 6b), and (iii) incorrect chunking cases (Fig. 6cFootnote 30). The corresponding CodeBLEU scores are provided in Table 4.

Fig. 6
figure 6

CodeBLEU scores by difficulty level on all test cases (left), on correct test cases (middle) and on incorrect test cases (right) of the chunking methodology

Table 4 CodeBLEU by difficulty level (diff.) across all approaches
figure q

CodeGenC demonstrates average improvements of \(1.6\%\) and \(48.6\%\) over CodeGenE and CodeGen (i.e. the baseline model), respectively. Particularly, on test cases of medium-high difficulty level, CodeGenC outperforms CodeGen by \(21.1\%\), whereas CodeGenE improves the baseline performance by \(12.5\%\) (Table 4, columns 2–4, diff.3).

The models CodeGenC and CodeGenE exhibit similar trends in their CodeBLEU scores with a rapid downward transition from difficulty levels 0 to 4, representing the shift from code generation with correction information to code generation without (Fig. 6.a). Both models significantly outperform the baseline model by a factor of 2.2 at difficulty level-0, where the NL query exists in the correction data-store (Table 4, columns 2–4). Their performance then converges to the baseline’s at difficulty level-4.

In contrast, the standalone code generator (i.e. CodeGen) results in slight improvements from difficulty levels 0 to 3 but a decline at difficulty level-4 (Fig. 6.a). Overall, the CodeGen performs worse than other models, except at difficulty level-4, where it slightly exceeds CodeGenC by \(2.4\%\) and lags behind CodeGenE by \(1.3\%\) (Table 4, columns 2–4).

figure r

To gain insights into the behavior of CodeGenE and CodeGenC models, and understand the factors contributing to performance differences, we conduct a detailed analysis for difficulty levels 1 to 4 on correct and incorrect chunking cases.

CodeGenC obtains accurate results on \(88.3\%\) of all test cases and consistently outperforms other models across difficulty levels 1 to 3 (Fig. 6.b). Particularly, in the case of medium-low difficulty, where the single-chunk input NL query is similar to single-chunk queries in the correction data-store, CodeGenC surpasses CodeGen by a factor of 1.9 and slightly improves upon CodeGenE by \(5.3\%\) (Table 4, columns 5–7, diff.1). The latter improvement is attributed to CodeGenE occasionally omitting the syntax or identifier names of similar queries. The first four rows of Table 5 present an example for this situation. The target code contains an assignment followed by a return statement, and utilizes variable names like df and input_file. While CodeGenE disregards this information, CodeGenC integrates the suggested syntax and identifier names from the similar query successfully.

Table 5 Examples of CodeGenE overlooks or gets confused by extra information

Difficulty level-2 expresses cases where the single-chunk NL query is associated with multi-chunk queries in the data-store, or each chunk of the multi-chunk NL query resembles single-chunk queries in the data-store. At this level our model persistently excels over CodeGen by a factor of 1.4 and achieves a slight advantage over CodeGenE by \(5.2\%\) (Table 4, columns 5–7). The latter increment results from CodeGenE getting confused by extra information from similar queries. Table 5, rows 5–8 illustrate an example of this case. While CodeGenC achieves identical syntax and variable names, code generated by CodeGenE includes a redundant statement due to the additional chunk from the similar query (e.g. “get a number as kilometers from users”).

Notably, at the medium-high difficulty level, where each chunk in the multi-chunk NL query is similar to multi-chunk queries in the data-store, CodeGenC shows a \(23.1\%\) increase over CodeGen and exhibits an \(11\%\) improvement upon CodeGenE. The latter discrepancy arises due to additional information from similar queries applying to only some chunks in the input NL query. The bottom part of Table 5 provides an example for this instance. The similar query “add two numbers, and then print the result” provides information that pertains to only the first and third chunks in the input query, which causes missing code lines in the snippet produced by CodeGenE. Meanwhile, CodeGenC overcomes this issue since it derives the final code from sub-snippets of each chunk in the NL query.

figure s

Around one tenth of all test cases are classified as inaccurate chunking results. On these test cases, our model outperforms the baseline model by an average of \(15\%\), but lags behind the CodeGenE model across difficulty levels (Fig. 6.c). At difficulty level-1, CodeGenC surpasses CodeGen by a factor of 2.0, while experiencing an \(8.3\%\) decrease compared to CodeGenE (Table 4, columns 8–10).

This reduction is attributed to three factors. Firstly, the naming convention for functions used by CodeGenC, where function names are extracted from verbs and direct nouns in the NL query, which may not always align with developer preferences. Secondly, the validation of similarity between queries using cosine similarity occasionally validates similar queries with unexpected KNN order. Lastly, for simplicity, our model currently does not handle auto-detection of specific values from similar queries (e.g. two queries are similar but have different values of numbers or strings). In contrast, CodeGenE, which is derived from a Large Language Model GPT-3.5-Turbo-0301, acquires inherent advantages in pure NLP tasks. Additionally, CodeGenE benefits from the target code obtained through GitHub Copilot, a predecessor of GPT-3.5-Turbo-0301.

Analogously, the decrease of CodeGenC compared to CodeGenE at difficulty level-2 (by \(26.5\%\)) and level-3 (by \(11.2\%)\) is attributed to NLP-related challenges. Tasks in these two levels include finding correct KNNs, accurately extracting the most similar chunks from similar queries, and properly mapping NL chunks to their relevant code snippets. CodeGenC relies on rule-based approaches in performing these tasks, which faces limitations in NLP. Ultimately, at the high difficulty level, CodeGenC slightly underperforms compared to the other models with a \(5.7\%\) reduction, primarily due to the discussed naming convention for functions. It is worth noting that this intensive inspection on incorrect cases is restricted with CodeGen and CodeGenE models because of their unpredictable property.

Overall, our model, CodeGenC, demonstrates competitive performance compared to other models, despite the challenges encountered in NLP tasks. In contrast to generative AI models, our methodology offers straightforward and interpretable approaches for generating the final code, enabling thorough analysis of unexpected results and facilitating insights for potential improvements. Additionally, utilizing the explicit mapping between generated code snippets and NL chunks in a graphical user interface can simplify assessment of suggested code for users (see Sect. 7).

The extensive analysis of evaluation results on the entire test case dataset, spanning various difficulty levels, provides valuable information to answer the first research question introduced in Sect. 4.1.

figure t

5.2 Ablation study

We continue analyzing the evaluation results under two aspects: (i) complexity level and (ii) correct outcome ratio.

5.2.1 Complexity level

To study the significance of user feedback and the influence of each state of the correction data-store on generated code, we categorize the test results by complexity level. Each level describes the components required to attain the final code. These levels, ranging from 0 to 4, represent a spectrum from low to high complexity, determined by the states of the correction data-store (Sect. 4.2).

For example, on test cases of low complexity, the NL query exists in the correction data-store, requiring only the data-store component to obtain the final code. Complexity levels 1 and 2 represent situations where the code generator is activated due to an empty correction data-store or no matching queries in the data-store, respectively. Further details for each complexity level are provided in Table 6.

Table 6 Definitions for complexity levels (comp.)

Figure 7 depicts the CodeBLEU scores of our chunking methodology (CodeGenC) by complexity level, divided into three groups: (i) all test cases, (ii) correct chunking cases, and (iii) incorrect chunking cases.

Fig. 7
figure 7

CodeBLEU scores by complexity level for our CodeGenC model

figure u

The evaluation results demonstrate a near-perfect score of 99.9 at the low complexity level, indicating the presence of the NL query in the correction data-store. However, at the medium-low and medium complexity levels, the test cases occupy the lowest scores, regardless of correctness, with decrements of \(34.8\%\) and \(26\%\) compared to the average score across all test cases. This decline is attributed to the absence of user feedback for the NL query in the data-store.

Additionally, medium-high complexity test cases slightly surpass the high complexity ones by \(11.4\%\). This can be explained by the increased complexity associated with generating the final code at the high level. For test cases at medium-high level, CodeGenC utilizes various components including the correction data-store, similarity validation between queries, and, if necessary, the NL-to-Code generator to attain the final code. Meanwhile, the high complexity cases require an additional component, namely the NL-Code mapping, to construct the code by identifying suitable code snippets for each chunk in the query.

To examine the influence of user feedback on the CodeGenE model, we compare its CodeBLEU scores across complexity levels for all test cases (Table 7). The results align with the analysis of CodeGenC model discussed earlier. Specifically, complexity levels 1 and 2 encounter the lowest scores, with decrements of \(32.7\%\) and \(25.4\%\) from the average, respectively. Conversely, test cases with low complexity persistently achieve almost the perfect score. In addition, complexity levels 3 and 4 both exceed the average with increments of \(10.7\%\) and \(2\%\), respectively.

Table 7 CodeBLEU by complexity level across all test cases for CodeGenE model
Table 8 Correct outcome ratio and CodeBLEU for each model over all test cases

Ultimately, despite that we target to mimic the structure and identifier names of the corrected code in constructing the final code, the validity of the generated code is also an important metric. In the next sub-section, we inspect the achieved codes by their executed output.

5.2.2 Correct outcome ratio

As mentioned in Sect. 5.1, for simplicity, CodeGenC composes function name based on the input NL query instead of using LLMs as CodeGen and CodeGenE. Consequently, utilizing exact match or accuracy is improper for the evaluation. Alternatively, we manually examine each obtained code snippet and validate if it yields correct output after executing. The percentage of accurate outputs over all test cases forms the correct outcome ratio.

figure z

Table 8 displays the ratios of all models alongside their CodeBLEU scores. Overall, CodeGen acquires highest percentage but the disparities between models are insignificant. Particularly, CodeGen exceeds CodeGenE and CodeGenC by \(2.5\%\) and \(4.2\%\), respectively. In contrast, CodeGen attains the lowest CodeBLEU score, lacking behind CodeGenE and CodeGenC by \(27.0\%\) and \(27.7\%\), respectively. In other words, the standalone AI code generator model might yield the correct output but its generated code lacks substantial alignment with user suggestions.

Notably, Table 8 reveals minimal differences between CodeGenE and CodeGenC when aggregating CodeBLEU scores across all test cases. Nevertheless, at specific difficulty and complexity levels (discussed above), CodeGenC remarkably outperforms CodeGenE, underscoring the importance of our refined evaluation in such instances.

Additionally, Fig. 8 presents further analysis on the correct outcomes. When CodeGenC obtains accurate results from constructed code snippets, CodeGen and CodeGenE still encounter \(5.4\%\) and \(9.3\%\) of inaccurate outcomes, respectively. Notably, whereas CodeGen produces incorrect outcomes, CodeGenE and CodeGenC rectify \(60\%\) and \(63.3\%\) of these cases, turning them into correct ones, respectively.

Fig. 8
figure 8

Correct outcome ratio for specific cases over all test cases

In simpler terms, generative models with extra information from user feedback can improve some cases that are original incorrect outcomes. The evaluation results unequivocally indicate the advantages of user feedback for NL-to-Code translation model, even in the absence of explicit re-training. The last two findings address our second research question outlined in Sect. 4.1.

figure aa

Ultimately, despite that prompting technique is not the main focus of our work, it is essential to assess whether an LLM with our chunking strategy integrated into the prompt can perform better than our proposed model. The next subsection addresses this matter and reveals the results.

5.3 LLM involvement

Primary goal reiteration. It is worth emphasizing again that besides the goal of integrating user feedback into generative AI models without re-training, we aim to ensure model interpretability throughout all steps for developers (mentioned in Sects. 3.2 and 3.3). The latter also enables thorough analysis of incorrect outcomes. In addition, we target to explore the potential enhancement of an NL-to-Code model through a simplified approach. Consequently, we refrained from using complex LLMs for query decomposition and chunk-to-sub-snippet mapping, and only employed them for code generation.

Furthermore, given that the prompting technique can influence the quality of results in generative AI models (mentioned in Sect. 1), our approach is to assist users with standard, straightforward prompts, delegating strategy planning and reasoning to the underlying mechanism. Moreover, most of generative AI models impose constraints on prompt length or context window size (i.e. the number of tokens processed simultaneously), restricting the integration of historical corrected codes.

LLM with chunking instruction. However, to complement our preceding evaluation, we conducted an additional experiment utilizing GPT-3.5-Turbo-0125Footnote 31 for translating NL queries to Python, incorporating our decomposition strategy as task descriptions. This experiment assessed the effectiveness of the employed models in analyzing query chunks and NL-code mapping. Consequently, we only considered scenarios with multi-chunk queries and non-empty correction data-store from the collected test cases (i.e. \(39.4\%\) of the total cases).

Listing 3 displays the query template utilized in this experiment. Queries from the correction data-store, serving as user-approved cases, are appended following the input NL query. The chunking strategy is outlined from lines 5 to 9. We denote the model utilizing GPT-3.5 alongside our chunking strategy as GPT35Prompt.

figure ab

Result assessment. Table 9 illustrates the correct outcome ratios and CodeBLEU scores of all models across test cases featuring multi-chunk queries and non-empty correction data-store. The results reveal that GPT35Prompt underperforms other models in terms of correct outcome ratio, lacking behind CodeGen, CodeGenE, and CodeGenC by \(20.9\%\), \(14.5\%\), and \(14.6\%\), respectively. Besides, GPT35Prompt only surpasses CodeGen by \(10.9\%\) in terms of CodeBLEU score, while lagging behind CodeGenE and CodeGenC by \(8.5\%\) and \(10.3\%\), respectively.

Table 9 Correct outcome ratio and CodeBLEU for each model over test cases with multi-chunk queries and non-empty correction data-store

Further analysis, as depicted in Fig. 9, confirms the inferior performance of GPT35Prompt compared to other models. Specifically, while CodeGenC (our approach) achieves correct outcomes, CodeGen, GPT35Prompt, and CodeGenE still encounter \(8.6\%\), \(29.3\%\), and \(16.4\%\) of incorrect outcomes, respectively. In cases where CodeGen produces incorrect results, GPT35Prompt, CodeGenE, and CodeGenC rectify \(56.2\%\), \(68.7\%\), and \(75\%\) of these instances.

Fig. 9
figure 9

Correct outcome ratio for certain cases with GPT35Prompt taken into account

Brief analysis. While the underlying LLMs of GPT35Prompt and CodeGenE are slightly different (GPT-3.5-Turbo-0301 vs. GPT-3.5-Turbo-0125), they employ distinct prompting templates, leading to notable disparities in generating accurate final codes. This underscores the significance of prompting techniques in result quality. However, comparing prompting techniques is beyond the scope of our study.

We briefly examined the failed cases of GPT35Prompt and discovered that GPT35Prompt also experiences the similar shortcomings as CodeGenE (e.g. overlooking or becoming confused by additional information, as shown in Table 5). Additionally, \(55.1\%\) of the incorrect outcomes stem from GPT35Prompt generating code that uses functions defined in queries from the correction data-store, without including these function definitions into the final code. Even after adjusting the prompt template in Listing 3 to explicitly address this issueFootnote 32, the incorrect outcomes persist.

It is worth noting here that we consider NL-to-Code generation individually for each query. The corrected codes refer to preceding corrections but are not available to users at the moment of executing the prompts. An enhancement for this matter is discussed in Sect. 6.2. For simplicity, we exclude the analysis of GPT35Prompt results based on individual difficulty and complexity levels.

Ultimately, we anticipate that advanced prompting techniques, such as chain-of-thought (Wei et al 2022) and tree of thoughts (Yao et al 2023), could improve the LLM outcomes. Nonetheless, despite detailed strategy descriptions, the inherent black-box nature of LLMs still hinders thorough analysis of unexpected results, making it challenging to pinpoint which step in the strategy description causes the failed cases.

6 Discussion

In this section, we discuss threats to validity of our experiments, as well as challenges and potential enhancements for our methodology.

6.1 Threats to validity

We analyze threats to validity of our work as follows:

Test suite. A custom test suite was developed for the experiments due to the absence of a suitable existing test suite. Though our dataset is not as extensive as those for AI model training, it sufficiently demonstrates our methodology’s utility. However, the inclusion of an official benchmark would enhance the effectiveness of the proposed approach. In future work, we intend to incorporate more complex test cases, probably by refining Q &As from programming forums. Furthermore, the lack of probability logs from Codex model in the response of ChatCompletion feature (GPT-3.5-Turbo-0301) raises questions about the likelihood of the code returned in the first response being the most probable one.

Language specificity. The algorithm for mapping NL chunks and code snippets in the Code building step is currently implemented exclusively for Python. However, the identification of code token types is based on AST analysis and token relationships, which vary slightly across programming languages. Besides, the algorithm focuses on critical token types shared among programming languages, such as variable definition and usage. Determining these token types in other languages (e.g. Java) is even less complicated than for Python due to Python’s dynamic typing. Therefore, we anticipate that our results will be applicable to other programming languages. Additionally, it is worth mentioning that the parser used in the Query chunking step is specifically for English language. Nonetheless, multilingual NLP is outside the scope of this paper.

Model comparison. Our experiments employ GPT-3.5-Turbo-0301, which has demonstrated significant advancements in NLP tasks. However, being a beta version and subject to frequent updates, minor adjustments may be necessary to accommodate changes in its APIs. Furthermore, due to the lack of directly comparable models, we compare our methodology with extended input queries on GPT-3.5-Turbo-0301. We assume that comparing other approaches that utilize the chunking method would further validate the concept of our methodology.

Evaluation metrics. Besides manually examining the validity of generated code, we adopt CodeBLEU as evaluation metric due to its popularity in code generation models. Although ChrF has been proposed as an alternative (Evtikhiev et al 2023), it does not fully consider the specifics of working with source code. As our experiments prioritize the syntax of the generated code (as discussed in Sect. 4.3), CodeBLEU with the mentioned settings remains suitable for our purposes.

6.2 Challenges and potential enhancements

Given the novelty of our proposed methodology, we outline below challenges while developing the approach, alongside potential improvements which can make our concept applicable to more intricate use cases.

6.2.1 Scalability support

Multi-users and large datasets. To illustrate the utility of our methodology, we collected user feedback in a dictionary with embedding values of the input queries as keys and the corrected code snippets as corresponding values. Subsequently, similar queries of each input are retrieved using KNN technique, by comparing similarity between the input and all existing queries in the data-store. This simple setup serves its purpose in exhibiting the advantages of integrating user feedback into generative AI models without re-training. However, adapting this method to multi-user systems and large datasets necessitates upgrading the correction data-store structure.

Particularly, users usually refer to their own naming patterns (while aligning to coding convention) for identifiers, requiring the separation of correction information stored for individual users or only shared within user groups. Furthermore, a function generated from an input query can be adopted multiple times at different locations within a program, each with distinct sets of variable names. Consequently, various versions of function customization should be stored instead of employing a single record for each query and overriding previous corrections.

Dynamic Sparse Distributed Memory. The presence of numerous users can result in data expansion, necessitating scalability features in the correction data-store architecture. To address this, a potential solution is employing Dynamic Sparse Distributed Memory (DSDM) introduced by Pourcel et al (2022), an extension of Sparse Distributed Memory (Kanerva 1992).

DSDM begins with an empty memory space and incrementally adds new address nodes based on input patterns, dynamic write radius, and current memory space state. Query content is retrieved from specific memory nodes using a softmin function that considers the distance between the query and other query addresses. Integrating DSDM into the One-shot Correction approach may enhance the correction data-store’s capacity, mitigating scalability challenges.

6.2.2 Flexible rule selection for code building

Despite that we deployed a configuration file (outlined later in Listing 4) to centrally manage rules for refining sub-snippets, the inclusion of rules for renaming identifiers, determining parameters for the final code, and handling multi-input queries would be beneficial. Moreover, a flexible selection mechanism for these rules should be employed based on the input query and corrected codes from similar queries.

Identifier renaming For instance, when renaming identifiers within combined sub-snippets by prioritizing the last statement (Sect. 3.4), situations arise where the final code lacks desired names compared to the corrected code. This occurs because desired names initially appear atop the statement list but are subsequently replaced by identifier names in statements below. Hence, a flexible activation of renaming rules (top-down or bottom-up) should be determined based on the positions of chunks in the input query receiving similar queries from the correction data-store.

Furthermore, to exemplify the proposed chunking concept, we streamlined the renaming process by assuming that identifiers defined in one statement are directly utilized in the subsequent statement. A potential enhancement to diminish this assumption involves (i) preserving the data-flow of each variable in every code snippet, (ii) analyzing the purpose of each variable definition and usage, and (iii) bridging the data-flow gap between code snippets. These steps may require NL chunks, their associated code snippets, and the input query as inputs, suggesting the consideration of a more intricate rule or approach.

Parameter determination As we target generating final codes comprising code snippets enclosed within a function definition with requisite import statements, the current parameter identification rule for the final function suffices to illustrate the method’s concept. However, in case the input query requests multiple functions or omits this requirement, the rule should be adjusted accordingly, which is technically feasible by identifying the scope of variables besides their definitions and usages.

Multiple input queries Ultimately, our proposed approach currently addresses NL-to-Code cases individually, as depicted in a GUI in Sect. 7. However, when applying this method to a code file containing existing NL queries and their relevant code snippets, or when dealing with inputs featuring multiple NL queries, consideration should be given to previously generated code when constructing the outcomes.

In such instances, a rule should prioritize suggested code snippets using functions defined from prior queries over code snippets that redefine these functions. Preceding queries and their codes can be directly injected into the input query, forming a multi-turn programming pipeline, similarly to the study described in Nijkamp et al (2022).

7 One-shot correction GUI

In this section, we briefly introduce our preliminary GUIFootnote 33 built on the One-shot Correction methodology. The GUI exhibits the practicality of our proposed concept in simplifying code customization and assessment for users. Main features of the GUI are demonstrated in Appendix A.1 with examples.

We draw inspiration from the work of Su et al (2018) on building an application with fine-grained user interaction for code modification. With each code token in a returned code, we determine its token type and a list of alternative values, which are extracted from other suggested codes for the same token type. Figure 10 presents the general scenario of using the GUI.

Fig. 10
figure 10

General scenario of using the One-shot Correction GUI

After initiating a search with an input NL query, users can perform the following actions: (1) choose displayed code from a list of returned code snippets, (2.1) select a code token under Suggested code by clicking on it and (2.2) change its value from the list of substitute values, (3) type a new value for the code token if the preferred value is not on the list in step (2.2), (4) directly modify the code if restructuring is necessary, and (5) save the modification for subsequent inquiries.

By default, user modification is integrated with both options, GPT-3.5 and One-shot Correction, which are corresponding to the CodeGenE and CodeGenC models mentioned in previous sections. Deselecting these options results in the code snippet using solely the CodeGen model (i.e. without user feedback). Besides, for each code token, we also provide its token type as an extra information for users.

Notably, the highlight matching option associates input query chunks with sub-snippet(s) of the displayed code in the One-shot Correction case. For other cases (i.e. standalone code generator and extending input), the whole input query and its code are marked without separation (see Appendix A.1 ). We expect that this explicit mapping can facilitate users in comprehending and validating the generated code.

Additionally, by modifying the configuration file (Listing 4), users can manipulate the state of the correction data-store (line 8), filter important code token types (lines 9–11), and adjust hyperparameters used in each model (lines 2–5). We published these setting values together with our source code.Footnote 34

figure ag

Particularly, possible values for corrt_ds involve "all" (all gathered queries), "all_x" (a collection of all x-chunk queries, \(x \in [1, 2, 3]\)), "all_x_excl" (all x-chunk queries excluding the current target query), and "task_x_y" (the x-chunk query with index y). Appendix A.1 presents an example of code generation with two different states of the correction data-store. Furthermore, to prefer specific token types, users can simply enable/disable the corresponding flag of the token type (Listing 4, line 11). These types are determined based on a study of Le et al (2023).

8 Conclusions

We proposed a methodology named One-shot Correction to incorporate user feedback into generative AI models without re-training. Evaluation results illustrate competitive performance compared to other models, despite challenges in NLP tasks. Our methodology enables thorough examination of unexpected results through straightforward approaches and facilitates insights for potential improvements. Additionally, we demonstrated that feedback from users significantly enhances code translation models without re-training. We published the test suite used in our experiments, evaluation results, and source code of the methodology.Footnote 35 A preliminary GUI with fine-grained user interaction in code modification was also implemented to sketch the utility of our proposed approach in practice. Further work encompasses extending the method to other programming languages and large datasets, which includes upgrading the correction data-store structure for scalability (e.g. using Dynamic Sparse Distributed Memory). Furthermore, exploring flexible rule selection at each step in the methodology for complex inquiries is a promising direction.