Rethinking AI code generation: a one-shot correction approach based on user feedback

Le, Kim Tuyen; Andrzejak, Artur

doi:10.1007/s10515-024-00451-y

Rethinking AI code generation: a one-shot correction approach based on user feedback

Open access
Published: 12 July 2024

Volume 31, article number 60, (2024)
Cite this article

Download PDF

You have full access to this open access article

Automated Software Engineering Aims and scope Submit manuscript

Rethinking AI code generation: a one-shot correction approach based on user feedback

Download PDF

56 Accesses
Explore all metrics

Abstract

Code generation has become an integral feature of modern IDEs, gathering significant attention. Notable approaches like GitHub Copilot and TabNine have been proposed to tackle this task. However, these tools may shift code writing tasks towards code reviewing, which involves modification from users. Despite the advantages of user feedback, their responses remain transient and lack persistence across interaction sessions. This is attributed to the inherent characteristics of generative AI models, which require explicit re-training for new data integration. Additionally, the non-deterministic and unpredictable nature of AI-powered models limits thorough examination of their unforeseen behaviors. We propose a methodology named One-shot Correction to mitigate these issues in natural language to code translation models with no additional re-training. We utilize decomposition techniques to break down code translation into sub-problems. The final code is constructed using code snippets of each query chunk, extracted from user feedback or selectively generated from a generative model. Our evaluation indicates comparable or improved performance compared to other models. Moreover, the methodology offers straightforward and interpretable approaches, which enable in-depth examination of unexpected results and facilitate insights for potential enhancements. We also illustrate that user feedback can substantially improve code translation models without re-training. Ultimately, we develop a preliminary GUI application to demonstrate the utility of our methodology in simplifying customization and assessment of suggested code for users.

Leveraging pre-trained language models for code generation

Article Open access 29 February 2024

Code Autocomplete Using Transformers

Bash comment generation via data augmentation and semantic-aware CodeBERT

Article 26 March 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The widespread adoption of generative AI models has facilitated Natural Language-related tasks (Gozalo-Brizuela and Garrido-Merchan 2023). As a result, the integration of Natural Language to Code (NL-to-Code) translation has become a sought-after feature in many code-centric tools (Xu et al 2022). Notable examples include GitHub Copilot,^{Footnote 1} TabNine,^{Footnote 2} Amazon CodeWhisperer^{Footnote 3}, and ChatGPT.^{Footnote 4}

There are divergent opinions regarding the advantages and security of these tools, specifically GitHub Copilot. Various studies and surveys have shown the benefits of Copilot in assisting developers (Bird et al 2022; Vaithilingam et al 2022; Dakhel et al 2023). However, other empirical studies (Imai 2022) reveal that Copilot results in higher productivity but with lower quality of code. Concerns have been raised about Copilot recommending code that relies on non-existing helper functions or undefined variables (Nguyen and Nadi 2022). Several studies focus on the vulnerability and security of code generation tools (Pearce et al 2022; Asare et al 2022), as well as the uncertainty surrounding the licensing of generated code (Bird et al 2022).

Notably, a study of Bird et al (2022) reveals that developers devote more time to reviewing AI-generated code than writing it. This emphasizes the need of aiding developers in better understanding and evaluating the generated code, which is hindered by the unpredictable behavior of AI models. Furthermore, the validation of the code’s origin is essential, but currently limited to a single IDE development session.

Additionally, users do not always obtain the desired code when using generative AI tools. One of the primary factors is the quality of the prompt, which involves natural language features such as implication, association, and ambiguity (Reynolds and McDonell 2021). Interactive programming has garnered considerable attention as one of the prominent approaches to tackle these issues of natural language (Shin and Nam 2021; Heyman et al 2021; Schlegel et al 2019; Elgohary et al 2021; Su et al 2018; Cai et al 2023).

The concept entails users engaging with models in an iterative manner through low-code approaches, until they attain the desired outcome. However, despite the leverage of user feedback, its persistence is confined to a single conversational session (Bird et al 2022). This limitation arises from the inherent properties of generative AI models, which require explicit re-training to integrate new data or feedback from users. Figure 1 illustrates a simple scenario of interactive programming, underlining the issue of recalling cross-session user feedback in a current generative AI model.

In Fig. 1a, users initially submit the NL query “Function adds two numbers” and receive a Python code representing the add function. Subsequently, users request to rename the function from add to sum and proceed with further unrelated queries. After several inquiries in a new session, users once again input the same query, “Function adds two numbers”. The question arises as whether the model should return to users the function with the name add or sum. We applied this scenario to ChatGPT,^{Footnote 5} one of the recent prominent tools, and obtained the results as illustrated in Fig. 1b. Even though the user has corrected the function name in Session 1 (i.e. add_numbers to cal_sum), ChatGPT still returns the original code snippet (i.e. function named add_numbers) in Session 2. In practice, the repetitive user modifications (e.g. renaming, restructuring) of generated code can be inefficient and frustrating.

In this paper, we propose a methodology to address the aforementioned challenges by developing a user-feedback-driven NL-to-Code translation model. This method requires no additional training. We aim to provide interpretable, straightforward approaches to enable comprehension of code provenance and facilitate thorough analysis of unexpected results. Our contributions are as follows:

A One-shot Correction methodology. We introduce an approach to integrate user feedback into generative AI models without re-training while supporting intensive inspection of incorrect outcomes. An additional memory for user feedback and k-Nearest Neighbors methods are employed to accumulate and retrieve correction information across sessions. To tackle the code’s origin issue, we adopt the techniques from decomposition in problem solving. Each natural language query is divided into segments and the final code snippet is constructed from sub-snippets attained for each segment. The NL-to-code translation of each query chunk is performed through either the additional memory or a generative AI model.
A prototype and an extensive comparison. To illustrate the utility of our methodology, we deploy a prototype of One-shot Correction based on GPT-3.5-Turbo-0301 model^{Footnote 6} and conduct an extensive comparison between the code generated by GPT-3.5-Turbo-0301 and our prototype. The evaluation results justify the concept of our methodology and provide insights into the behavior of all models when combined with user feedback.
A Graphical User Interface application. We develop a preliminary GUI to exhibit the benefit of using the One-shot Correction in customizing and interpreting the generated code snippets. Users can convert a natural language query to a Python function, modify identifier names in the produced code snippet, and preserve the correction information for future reference.
Source code and data. To facilitate reproducibility and reuse of our methodology, we publish our source code, along with test suites and evaluation results.^{Footnote 7}

The rest of the paper is organized as follows: Sect. 2 provides an overview of the background and related work. Our methodology is described in Sect. 3. Section 4 presents our experiments in detail, while the evaluation results are analyzed in Sect. 5. We discuss threats to validity and potential enhancements in Sect. 6 and introduce our GUI application in Sect. 7. Finally, we conclude in Sect. 8.

2 Background and related work

In this section, we provide a brief introduction to the relevant background and related work in our study.

2.1 Generative artificial intelligence for code

Generative Artificial Intelligence (AI) encompasses models capable of producing novel content across various formats, including text, image, video, or audio (Gozalo-Brizuela and Garrido-Merchan 2023). In the context of code generation, generative AI leverages the naturalness hypothesis (Hindle et al 2016; Sun et al 2022; Weisz et al 2022), which posits that software can be viewed as a form of human communication. Consequently, techniques applicable to natural language can also be employed for code generation.

Various approaches, spanning from probabilistic (Bielik et al 2016; Li et al 2017; Schumacher et al 2020) to Machine Learning (Kim et al 2021; Svyatkovskiy et al 2020), have been proposed to validate this hypothesis. Transformer (Vaswani et al 2017) has emerged as the dominant architecture, serving as the foundation for notable models like PLBART (Ahmad et al 2021), CodeBERT (Feng et al 2020), Codex (Chen et al 2021), AlphaCode (Li et al 2022), and GPT-3.5. These models support a wide range of code-related tasks, including code summarization, code translation across programming languages (Ahmad et al 2021), code documentation generation (Feng et al 2020), code auto-completion based on comments or existing code (Chen et al 2021; Ahmad et al 2021), and are even able to challenge humans in programming competitions (Li et al 2022).

Several methods have been proposed to enhance generative AI models. These approaches involve expanding input queries with contextual information, such as code token types (Izadi et al 2022), preceding task queries and code outputs (Nijkamp et al 2022). Other methods involve integrating a supplemental retriever (Lu et al 2022; Parvez et al 2021) or incorporating an additional memory component (Wu et al 2022; Fan et al 2021; Khandelwal et al 2020). However, these approaches either require training or overlook the potential of leveraging user feedback as a valuable resource. Furthermore, the non-deterministic and unpredictable nature of the underlying AI model restricts in-depth analysis of their unexpected behaviors (Bird et al 2022).

2.2 Interactive programming

The quality of the natural language (NL) prompt significantly impacts the accuracy of NL translation models (Reynolds and McDonell 2021). Various techniques have been proposed to address the ambiguity of NL and bridge the gap between NL and programming languages. These methods include heuristic methods, semantic parsing (Shin and Nam 2021), and interactive programming, which has gained notable attention as a prominent approach (Heyman et al 2021).

Methods supporting user interaction comprise binary validation of target results (Iyer et al 2017), multiple-choice questions (Gür et al 2018), selection from a list of options (Schlegel et al 2019), decomposition and modification of sub-components using predefined values (Su et al 2018), feedback through NL queries (Elgohary et al 2021), and workflow updates instead of direct result modification (Cai et al 2023). Although user feedback has been shown to be advantageous in these studies, its persistence is limited to a single interaction session.

2.3 Decomposition in problem solving

Our methodology draws inspiration from a widely known heuristic in problem solving, which involves the decomposition of the problem into manageable sub-problems (Egidi 2006). This approach is valuable in software development (Charitsis et al 2022), and particularly in working with generative AI models (Barke et al 2023). Recent studies target to enable the decomposition ability of AI models by enhancing the prompt with a series of intermediate NL reasoning steps, namely chain-of-thought (Wei et al 2022), tree of thoughts (Yao et al 2023), and plan-and-solve prompting (Wang et al 2023). However, due to the unpredictable nature of AI models, it remains challenging to determine which steps described in the prompt contribute to unexpected results.

In addition, users commonly gain proficiency in a new programming language by initially acquainting themselves with basic functions and progressively advancing towards more intricate features (Carpenter 2021). Ordinarily, after decomposing a problem, users leverage acquired knowledge to resolve familiar sub-problems, and reserve the search for novel solutions solely for unfamiliar sub-problems. Our methodology aims to reflect this learning process on NL-to-Code translation models. Particularly, we consider user feedback as knowledge that the translation model needs to remember following each interaction. When encountering a new NL query, the model is expected to identify the portions of the query that have been resolved and distinguish them from the segments that necessitate code generation. The resulting composition of sub-knowledge allows for in-depth analysis of which phrases in the query lead to unexpected answers.

2.4 Chunking in natural language processing

Shallow parsing, or chunking, involves dividing text into non-overlapping groups of syntactically or semantically related words (Abney 1992). It is widely used in Natural Language Processing (NLP) for various types of chunks, such as named entities, noun phrases, and verbal groups (Zhang et al 2002).

A reliable text chunker is crucial for extracting vital information from unstructured text, enabling detailed analysis in subsequent processing tasks. Different techniques, including grammar-based methods, statistical models, and machine learning approaches, have been developed for chunking tasks (Ramshaw and Marcus 1999; Mohapatra et al 2021). These approaches utilize features such as part-of-speech tags or grammar templates for training.

The speedy development of Large Language Models has spawned a substantial amount of NLP libraries that cover a diverse array of tasks beyond chunking. Prominent libraries in this domain include NLTK,^{Footnote 8} CoreNLP,^{Footnote 9} scikit-learn,^{Footnote 10} and spaCy,^{Footnote 11} which is utilized in our experiments.

3 Approach

This section presents a thorough explanation of our methodology, including an overview of the One-shot Correction workflow, and descriptions of each primary component within the workflow.

3.1 General workflow

Figure 2 presents the general workflow of One-shot Correction methodology for NL-to-Code translation models with an illustrative example. Our methodology incorporates three components: (i) a correction data-store, (ii) an NL-to-Code generator, and (iii) a code builder.

The correction data-store collects user feedback paired with corresponding NL queries. Meanwhile the NL-to-Code generator is a code translation model that takes natural language queries as inputs and produces code snippets. The code builder is the key component designed to integrate correction information with the code generator model without requiring an additional model re-training.

For each NL query, the code builder initially checks if the query already exists in the correction data-store. If it does, the code that was previously corrected by users in past conversations is retrieved and directly returned to the users. If it is the first time users inquire about this NL query, the query undergoes several processing steps before the final code snippet is assembled.

Initially, in the Query chunking step, the query is decomposed to various chunks, with each chunk representing a single task or action. Subsequently, the code builder searches for potential code snippets associated with each NL chunk by accessing the correction data-store or utilizing the NL-to-Code generator (if the chunk has no similar stored queries). We call this step Sub-snippets retrieving/generating. Finally, in the Code building step, all the obtained code snippets are utilized to construct the final snippet before providing a reply to the user. If users make modifications to the generated code, the correction information is once again stored in the correction data-store before the next query is requested.

We demonstrate the result of each step using a typical example in Python. Assuming that the NL query is “add two numbers, and then print the result” and no prior modifications have been made by users, the query is decomposed into two chunks: “add two numbers” and “print the result”. In the subsequent step, the code builder retrieves the code snippet return num_1 + num_2 from the NL-to-Code generator for the chunk “add two numbers” since this chunk is not present in the correction data-store. Meanwhile, the snippet for the chunk “print the result”, i.e. print(result), is fetched from the data-store, supposing it was corrected by users in past conversations. Ultimately, in the Code building step, the two code snippets are combined to generate the response, as shown in Fig. 2.

To illustrate the applicability of our methodology to different NL-to-Code translation models, we utilize existing NL-to-Code models instead of developing a new one. For simplicity, the correction data-store is structured as a dictionary, with the keys representing the embedding values of NL queries and the corresponding values indicating the corrected code. Further explanation on the NL-to-Code generator and the correction data-store employed in our experiments is presented in Sect. 4.2. The subsequent sub-sections delve into a comprehensive analysis of each main phase in the code builder component.

3.2 Query chunking

As mentioned in Sect. 2.4, text chunking entails grouping adjacent tokens in unstructured text into phrases based on their part-of-speech (POS) tags. In our methodology, we target NL queries representing pipelines of actions, where each main verb in a query indicates a task in the target code. Therefore, our objective in this phase is to identify non-overlapping verb-noun chunks within a query.

We use rule-based method and dependency graph to determine main verbs and construct the chunk for each verb. There are two types of main verbs considered in our methodology: (i) verbs with the POS value VERB (e.g. print the result, calculate the average), and (ii) auxiliary verbs (AUX) that are not immediately followed by other verbs (e.g. are inputs, is odd or even). Supplementary verbs do not form their own chunks (e.g. using Counter).

Figure 3 depicts a dependency graph generated by spaCy for the query “add two numbers, and then print the result”. The main verbs in this query are add and print. The dependency graph reveals that all the main verbs are interconnected, while other words (e.g. NOUN, ADV) associate with their corresponding verbs. Thus, the main verb functions as the root node of its verb-phrase tree. By applying this rule to analyze the dependency graph, we extract two verb-noun-phrases, namely “add two numbers” and “print the result”. Punctuation and conjunction between main verbs are omitted in this analysis.

It is worth highlighting the potential benefits of employing Large Language Models (LLMs), such as GPT-3.5,^{Footnote 12} in this phase. Nonetheless, our objective is to ensure the transparency of the model and the ease of comprehension for developers throughout all steps. Additionally, our evaluation results indicate that even a less complex model, when incorporated as an additional component, can already improve the efficiency of the NL-to-Code model.

3.3 Sub-snippets retrieving/generating

In our methodology, NL chunks are considered as atomic NL queries that represent a single primary task or action. The sub-snippets retrieval and generation process for an NL chunk is displayed in Fig. 4.

Firstly, if the NL chunk exists in the correction data-store, the related code snippets are retrieved and transferred to the Code building step. If the NL chunk is not present in the data-store, the k-Nearest Neighbors (KNNs) of the chunk are computed under a predetermined threshold (refer to Sect. 4.2). Code snippets from the KNNs are extracted and forwarded to the Code building step. However, if there are no nearest neighbors of the NL chunk, the NL-to-Code generator is activated to generate code for the chunk and proceed to the subsequent step. Further details on code generation for NL queries or NL chunks are provided in Sect. 4.2.

3.3.1 Extracting sub-snippets for an NL chunk

In case the NL chunk has a similar NL query in the correction data-store (i.e. a nearest neighbor), the sub-snippets of the NL chunk are determined based on the sub-snippets of the phrase in the NL query that is most similar to it. Algorithm 1 outlines the process of extracting code snippet for an NL chunk from the corresponding code of a similar NL query.

Initially, the NL chunk is compared to the similar NL query to identify the most similar phrase, denoted as simi_chunk (line 2). We use the (cosine) similarity feature provided by spaCy^{Footnote 13} to assess the correlation between the NL chunk and each chunk in the similar query, subject to a predefined threshold (see Sect. 4.2). Additionally, each chunk in the similar query is mapped to sub-snippets in the target code of the query, using the function named MAP_NL_CODE (line 3). The associated sub-snippets of simi_chunk are then extracted and assigned as sub-snippets for the NL chunk (line 5).

3.3.2 Mapping NL chunks and sub-snippets

Algorithm 2 represents the pseudo code for the MAP_NL_CODE function. We employ a rule-based approach to establish mappings between chunks in an NL query and sub-snippets in the correlative target code. Before constructing the mapping, the target code is divided into sub-snippets by analyzing its Abstract Syntax Tree (AST) structure (line 3). We utilize tree-sitter^{Footnote 14} parser to obtain the AST of the target code. Sub-snippets within the target code consist of statements under the root_node (e.g. import statements) and child statements of function_definition. For simplicity, we require that each NL query is translated to code snippets wrapped in a function_definition and necessary import statements.

Subsequently, the NL query is decomposed into verb-noun chunks (line 4) following the method described in Sect. 3.2. To estimate the analogy between sub-snippets and verb-noun phrases, we developed a straightforward code explanation approach (line 6) that translates programming language operations and abbreviations into natural language.^{Footnote 15} Afterwards, the explanation of each sub-snippet is compared to the verb-noun phrases utilizing the (cosine) similarity function from spaCy (lines 7–11). The phrase with the highest similarity score is mapped to the current sub-snippet (line 15).

It is worth mentioning again that LLMs might be used for these NLP-related tasks. However, as we emphasized above, our goal is to investigate whether an NL-to-Code model can be enhanced by a less sophisticated method. Hence, a rule-based approach is a well-suited for this purpose.

3.4 Code building

In this step, the final code is constructed by combining sub-snippets corresponding to each verb-noun phrase in the NL query. The inputs for this step include the NL query and the mapping between each phrase in the query and its respective sub-snippets. The final code encompasses sub-snippets enclosed within a function_definition and any required import statements.

This step consists of the following sub-steps: (i) determining the order of sub-snippets, (ii) refining sub-snippets for each verb-noun phrase, (iii) renaming identifiers in all sub-snippets to ensure data-flow (i.e. naming progression from definitions to usages of identifiers in a code snippet, which is referred from semantic data-flow of Ren et al (2020)), and (iv) identifying parameters for the final function. Figure 5 demonstrates an example of code construction from sub-snippets, using the example described in Fig. 2.

3.4.1 Determining sub-snippet order

Sub-snippets are sorted according to the verb-noun phrase order in the NL query, which corresponds to the order of related verbs in the dependency graph. The arrangement is determined by analyzing the relationship between verbs in the graph. As a result, sub-snippets associated with the root verb^{Footnote 16} are given priority. In Fig. 3, the verb add precedes the verb print due to a conj dependency from add to print. Therefore, sub-snippets of the verb add (i.e. return num_1 + num_2 and return a + b) are placed before sub-snippets of the verb print (i.e. print(result) and print(‘result = ’, result)) at the end of this sub-step (Fig. 5).

3.4.2 Refining sub-snippets

Relevant sub-snippets of each NL chunk are modified based on a set of rules. To illustrate the utility of our methodology, we initially employed three rules for sub-snippet refinement: (i) no starting return statements, (ii) reducing plural statements, and (iii) refining between return statements. These rules aim to minimize grammatical errors when combining sub-snippets.

(i) No starting return statements. This rule prioritizes non-return statements for non-last NL chunks. By default, each NL chunk is corresponded to a list of potential sub-snippets and the first item (i.e. sub-snippet(s) extracted from the top-1 nearest neighbor) has highest priority. This is the output from the step Sub-snippets retrieving/generating (Sect. 3.3).

The preference is maintained if the current NL chunk occupies the last position in the list of chunks achieved from the preceding sub-step (i.e. sub-snippets ordering). In contrast, if the current NL chunk is a non-last chunk, return statement will have lower ranking than others. This is due to the fact that a return statement in the majority of programming languages will cancel other subsequent statements of the same level (e.g. same indent) within a scope (e.g. try statement). However, in case the NL chunk correlates to return statements only, the first return statement is selected and converted into an assignment statement. The left operand is named using stmt followed by the index of the current NL chunk.

Table 1 displays four typical cases of refining sub-snippets for a non-last NL chunk with examples of the chunk “add two numbers”. In the example depicted in Fig. 5, “add two number” possesses the top position in the ordered list and has potential sub-snippets starting with the keyword return solely. Hence, the statement selected for this chunk is refined as stmt_0 = num_1 + num_2.

Table 1 Examples of refining sub-snippets for a non-last NL chunk with 2-NNs

Full size table

(ii) Reducing plural statements. The rule targets to omit redundant sub-snippets of a verb-noun phrase. For simplicity, we implement a preliminary prototype of this rule based on the direct object of the verb in an NL chunk. Nearly identical sub-snippets will be reduced if the direct object is a singular noun (i.e. spaCy tag_ is NN). Contrarily, sub-snippets of the NL chunk are left unchanged if the direct object is a plural noun (i.e. spaCy tag_ is NNS):

For instance, assuming that the following sub-snippets are obtained for the NL chunk “get an integer input from user”:

Since the direct object of the verb get is a singular noun (i.e. input^{Footnote 17}), only the first sub-snippet from the list of highly similar sub-snippets is preserved for building the final code (i.e. num_1 = int(input("Number 1: "))).

It should be emphasized that this is an initial prototype of this rule to exhibit the concept of our approach. The reducing condition can get more complex when the plural noun is described with a specific quantity. In Fig. 5, both of the chunks “add two numbers” and “print the result” have only one sub-snippet for each chunk, as a result from rule (i). Therefore, the rule (ii) has no effect on these sub-snippets.

(iii) Refining between return statement. The last rule in our primary rule set ensures that a return statement should be placed after other statements of the same level in the final assembled code. Namely, a non-last NL chunk should contribute non-return statement(s) to the final code. Otherwise, depending on the expression after keyword return, the return statement will be omitted or transformed to an assignment statement, using same technique in rule (i). The latter case (i.e. modifying the return statement) happens when the part following keyword return creates new values (e.g. return a + b, return abs(num), or return a[i]). Alternatively, the former case arises if the after-return part is an identifier (e.g. return sum) or a list of identifiers (e.g. return a, b).

In the example exhibited in Fig. 5, the sub-snippets of NL chunks remain unmodified after employing rule (iii). This is because there is no return statements left in the sub-snippets list after applying rule (i) and (ii). It is essential to mention that return statements nested in other code structures (e.g. if, for) are not affected by rules (i) and (iii) since the considered sub-snippets are statements right under the root_node of an AST and direct child statements of function_definition (see Sect. 3.3).

Furthermore, our primary rule set is adaptable and can be expanded for intricate cases (e.g. conditional and loop statements). We develop a configuration file to gather all the settings used in our experiments (see Sect. 7, Listing 4) and to conveniently select/deselect each of the refinement rules before running an experiment.

3.4.3 Renaming identifiers

In this sub-step, the propagation of names within the sub-snippets is determined by analyzing code token types in the refined sub-snippets. We simplify the process by assuming that an identifier defined in a given statement should be utilized directly in the following statement. The inspection of sub-snippets is performed from the last sub-snippet to the first one. The underlying concept is to substitute the definitions of identifiers in the current sub-snippet with the undefined identifiers in the sub-snippet below it. Pseudo code for our algorithm is provided in Algorithm 3.

The list of undefined identifiers is initialized by taking the set difference between the identifier usages and the identifier definitions in the last statement (lines 2–4). We use tree-sitter^{Footnote 18} and Code Token Type Taxonomy (CT3) proposed by Le et al (2023) to analyze token types of each token within the sub-snippets. Identifier definitions encompass variable definitions, argument definitions, and imported libraries, while identifier usages include the utilization of all the specified definitions.

Identifier definitions and usages of each sub-snippet are determined in reversed order of sub-snippets using the same method (line 6). The identifier definitions of the current sub-snippet are then replaced by the previously computed undefined identifiers (line 7), while the list of undefined identifiers is updated to exclude the replacement (line 8). The REPLACE_ID_DEFS function also considers identifier usages to handle cases where the current sub-snippet lacks identifier definitions for the one directly below it. In this case, the identifier usages serve as identifier definitions.

In Fig. 5, the list of undefined identifiers of the last statement (print(result)) comprises only the token result. Meanwhile, stmt_0 is the only identifier definition in the preceding statement. Accordingly, after the renaming sub-step, stmt_0 is supplanted by result.

3.4.4 Identifying Parameters for the Final Code

In the last sub-step, a list of parameters for the final function is assembled from undefined identifiers that are unsubstitutable by any identifier definitions. In Fig. 5, the tokens num_1 and num_2 remain as parameters of the resulting function due to the absence of appropriate identifier definitions for them within the sub-snippets.

Our methodology adheres to the principles of simplicity, interpretability, and the ability to investigate unexpected outcomes, which is not feasible with AI models. Furthermore, the methodology’s composability facilitates a stronger analogy between the generated code and target code with increased correction information.

Given the novelty of our approach, we aim to illustrate the utility of the method and to highlight the main contributions. Therefore, even though our predefined rules are preliminary, they still adequately support the proposed concept. In the sections below, we present our experiments and evaluation results, demonstrating how the inclusion of a relatively simple additional component can already bring benefit to a code translation model.

4 Experiments

In this section, we firstly reiterate our objective through two research questions. A detailed description of our experimental setup is then provided to ensure reproducibility. Finally, we present evaluation metrics used in our experiments.

4.1 Research questions

We address the following two research questions:

RQ1: Does an interpretable, non-AI methodology enhance generative AI models? We investigate this question by proposing a rule-based methodology on NL-to-Code translation that incorporates code derived from user feedback with selectively generated code from an AI model (only as needed). Our methodology requires no explicit re-training. We conduct experiments on NL-to-Python code translation and use GPT-3.5-Turbo-0301 model developed by OpenAI^{Footnote 19} as the generative AI model.

Models for comparison. To ensure a fair evaluation in the absence of existing comparable methods, we introduce an additional method that integrates correction information directly into input queries. This approach is based on the premise that GPT-3 series models tend to yield more accurate results with increased input information. The extended input technique and our proposed methodology function similarly when the query exists in the correction data-store, as the correction information is retrieved and returned to users. However, these models differ in their response when there are similar queries in the correction data-store.

While the extended input approach simply expands the input query with information from the similar queries, our chunking methodology first decomposes the query into chunks, gathers appropriate code snippets for each chunk by examining the correction data-store or activating the NL-to-Code generator, and then constructs the final code using the collected code snippets.

In summary, our main evaluation comprises three variants: (i) CodeGen – code generation without correction information, (ii) CodeGenE – code generation with correction information integrated through extended input queries, and (iii) CodeGenC – code generation with correction information incorporated using our chunking methodology. The CodeGen model serves as the baseline. Additionally, to examine LLM performance with our chunking strategy embedded within prompts as task descriptions, we conducted an additional experiment using GPT3.5 to directly generate code with the integrated chunking instruction, referred as GPT35Prompt.

RQ2: Does user feedback improve NL-to-Code models without explicit re-training? To address this question, we perform an ablation study to assess the influence of user feedback on generated code. We compare the code generated solely by the code generator to the code produced when integrating the generator with various states of the correction data-store (see Sect. 4.2). This comparison allows us to determine if incorporating user feedback offers benefits to code translation models without re-training and which state of the data-store would offer the greatest advantage.

4.2 Experimental setup

We perform experiments on translating NL to Python code, using available APIs and libraries as follows:

4.2.1 Test cases and scenarios

For the evaluation, we assess the methodology using a range of NL queries, varying from basic to complex. To simplify the query chunking process, we assume that each chunk in the query describes a single task, and the chunks are separated by comma or the term “, and then”. While acknowledging potential artificiality in triple or more-chunk queries, our proposed structure addresses NLP ambiguity as an intermediate form between NL and Domain Specific Language. It involves an inevitable trade-off between flexibility and efficiency.

Due to the unavailability of a suitable test suite or benchmark tailored to our specific requirements, we develop a new test suite comprising queries with one to three chunks along with their corresponding target code. We extract single-chunk queries from online Python examples.^{Footnote 20} For multi-chunk queries, we utilize ChatGPT,^{Footnote 21} a well-known model trained on an immense dataset, to form the queries. Although ChatGPT’s responses might lean toward its own biases, they remain closer to human intent and are more objective than our self-composed queries. For instance, we use the following inquiry for creating double-chunk queries, specifically related to dataframe:

Subsequently, GitHub Copilot^{Footnote 22} is employed to generate the target code for each query. GitHub Copilot is powered by Codex model,^{Footnote 23} a descendant of GPT-3, which was trained on both natural language and billions of lines of code. Hence, code generated by GitHub Copilot can serve as a reasonable reference. We thoroughly validate and modify (if necessary) each target code to ensure its validity and executability.^{Footnote 24}

For each NL query or chunk, there are five possible states of the correction data-store: (1) empty data-store, (2) identical query in the data-store, (3) non-empty data-store without similar queries for the inquiry, (4) similar single-chunk queries in the data-store, and (5) similar multi-chunk queries in the data-store. Accordingly, each single-chunk query involves five scenarios, while each multi-chunk query can associate up to \((1 + 4^n)\) scenarios, where n represents the number of chunks in the query.

The test suite should cover all the states of the correction data-store and yield sufficient results to analyze the behavior of all models, targeting to highlight the utility of the proposed method. For this reason, we gathered 47 single-chunk queries, nine double-chunk queries, and three triple-chunk queries as main inquiries, alongside 55 single-chunk queries, 20 double-chunk queries, and 14 triple-chunk queries dedicated as similar queries in the correction data-store. These queries cover 401 cases across five states of the correction data-store. Furthermore, each query chunk is guaranteed to have at least one similar single-chunk query.

Accordingly, each test case includes: (i) the correction data-store, (ii) NL query, (iii) target code, (iv) code generated by the NL-to-Code generator only, (v) code obtained with extended input queries, and (vi) code constructed by our chunking methodology.

4.2.2 Correction data-store

For simplicity, we use a correction data-store dictionary where the keys depict tuples of embedding values of the NL query, and the values correspond to code corrected by users. Given the varying data-store states for each query, we create a data-store containing all collected NL queries and their target code, and provide a snapshot of the data-store for each test case.

4.2.3 Code generator

We utilize a substitution of Codex model, GPT-3.5-Turbo-0301,^{Footnote 25} for NL to Python code translation. Due to the replacement of the CodeCompletion feature in the Codex model with ChatCompletion in GPT-3.5-Turbo-0301, queries for translating NL to code are formalized as messages between system and users.

To obtain Python code from an NL query with GPT-3.5-Turbo-0301 model solely (i.e. CodeGen model), we structure the messages between system and user as demonstrated in Listing 1.

The response should exclude both code explanations and Python marks (e.g. “‘python) to facilitate code snippet extraction. Besides, all variable names in the generated Python code should adhere to snake_case convention to enhance the mapping between NL chunks and code snippets.

As GPT-3.5-Turbo-0301 model generally gives less attention to system messages,^{Footnote 26} to extend input queries for the CodeGenE model, we integrate the correction information to user messages, as illustrated in Listing 2. We refer to the documentation from OpenAI for detailed explanation of each field in the messages.

Similar queries and their corrected code snippets from the correction data-store are provided as examples for the NL query and displayed in sequential order. Alternative prompting methods for user messages may impact the generated response (see Sect. 2.3). However, comparing these prompting methods is beyond the scope of this paper.

Additionally, OpenAI models exhibit non-deterministic behavior, resulting in varying outputs for identical inputs. This poses challenges for our evaluation process, particularly when triggering the model multiple times with the same input due to the dynamic state of the correction data-store in the test cases. To address this issue, we adopt a dictionary-based method to accumulate and store the code generated by GPT-3.5-Turbo-0301. The dictionary uses embedding values of inquiries as keys, enabling retrieval of the corresponding generated code when an identical prompt is submitted.

4.2.4 Natural language embedding and KNNs

We employ another model from OpenAI, Text-Embedding-ADA-002,^{Footnote 27} to embed NL queries. KNNs for each query are extracted using cosine similarity under a predefined threshold (see experiment configuration). The accompanying function is developed by OpenAI as well.

4.2.5 Experiment configuration

Table 2 displays the configurations for the conducted experiments.

Table 2 Experiment configuration

Full size table

The hyperparameters for generating Python code from NL queries using the model GPT-3.5-Turbo-0301 are outlined in the top part of the table. Specifically, a temperature of 0.9 and a top_p value of 0.9 are set to encourage the model’s creativity when multiple responses are required (\(n>1\)). A frequency_penalty of 0.5 is assigned to penalize the frequent occurrence of repeated identifiers in code snippets, while a presence_penalty of 1.5 is used to prompt the model to generate a novel response each time for the same query. For simplicity, in our experiments, we consider a single response per query (\(n=1\)). Further information on each hyperparameter is explained in the OpenAI documentation.^{Footnote 28}

Queries of the correction data-store undergo KNN examination using cosine similarity thresholds of 0.15 and 0.2 for single and multi-chunk queries, respectively. Each inquiry obtains two nearest neighbors (\(knn = 2\)). The settings for obtaining sub-snippets and building the final code are specified at the bottom of Table 2. The spaCy model en_core_web_md is utilized, and another cosine similarity threshold of 0.5 is set for comparing the resemblance between chunks or a chunk and its sub-snippets. A threshold of 0.9 is employed to determine mostly identical sub-snippets for the second rule in the rule set of refining sub-snippets (Sect. 3.4).

In addition, we omit stop words and lemmatize verbs to their base form before calculating the similarity. The thresholds in Table 2 are adjusted to ensure that the final code snippet is constructed successfully in a majority of test cases. As mentioned in Sect. 3.4, we gathered setting values to a configuration file to easily fine-tune all the parameters, rules, and options.

4.3 Evaluation metrics

Inspired by the study of Su et al (2018), we assume that users modify the generated result in the following order: (i) restructuring the code if necessary (i.e. adding, re-arranging, or removing statements), (ii) renaming identifiers and updating strings to align with the NL query. Based on this, we evaluate the code obtained by different approaches using the following criteria in the descending order of priority:

1.
Code validity and executability
2.
Syntax similarity between the attained snippet and the target code
3.
Data-flow correlation among the obtained results
4.
Analogy of identifier names in the code snippets

To ensure the first criterion, we manually evaluate each test case for its correctness. The remaining criteria are assessed using CodeBLEU (Ren et al 2020) with hyperparameters \((\alpha , \beta , \gamma , \delta )\) representing ngram match, weighted ngram match, syntax match, and data-flow match, respectively. While ngram match and weighted ngram match are targeted for the last criterion, syntax match depicts the syntax similarity and data-flow match exhibits the equivalence of data-flow.

Ren et al (2020) recommended using the value set (0.1, 0.1, 0.4, 0.4) as \(\gamma \) and \(\delta \) have a stronger correlation with human evaluation scores. Based on the order of modifying generated code and our criteria for the evaluation, we adjust the value set to (0.1, 0.1, 0.5, 0.3). Our evaluation results show that both of the value sets follow the same trend with minimum differences. Section 5 presents the statistics with the amended value set (i.e. 0.1, 0.1, 0.5, 0.3). We refer to our published data^{Footnote 29} for the outcomes of the other value set.

5 Evaluation results

In this section, we present and analyze our evaluation results to address the two research questions from Sect. 4.1. We conclude this section by assessing the performance of an LLM with our chunking strategy outlined in the NL text prompt.

5.1 Evaluation results by difficulty level

To analyze the evaluation results, we utilize the correction data-store states defined in Sect. 4.2 to determine the difficulty level for each test case and classify the results based on these levels. Each difficulty level indicates the degree of challenge in achieving the target code. The levels range from 0 to 4, representing a spectrum that includes low, medium-low, medium, medium-high, and high difficulty. Table 3 presents the definition of these levels.

Table 3 Definitions for difficulty levels (diff.)

Full size table

For instance, difficulty level-2 involves two sub-scenarios: (i) single-chunk NL query linked to multi-chunk queries in the data-store, or (ii) a multi-chunk NL query where each chunk resembles single-chunk queries in the data-store. Meanwhile, difficulty level-3 indicates that each chunk in the multi-chunk NL query is related to queries with multi-chunk in the data-store. Ultimately, difficulty level-4 represents two sub-cases: (i) empty correction data-store, and (ii) no matching queries in the data-store.

Figure 6 presents the patterns of CodeBLEU score by difficulty level for the three models discussed in Sect. 4.1. The scores were computed for three sets: (i) all test cases (Fig. 6a), (ii) correct chunking cases (Fig. 6b), and (iii) incorrect chunking cases (Fig. 6c^{Footnote 30}). The corresponding CodeBLEU scores are provided in Table 4.

Table 4 CodeBLEU by difficulty level (diff.) across all approaches

Full size table

CodeGenC demonstrates average improvements of \(1.6\%\) and \(48.6\%\) over CodeGenE and CodeGen (i.e. the baseline model), respectively. Particularly, on test cases of medium-high difficulty level, CodeGenC outperforms CodeGen by \(21.1\%\), whereas CodeGenE improves the baseline performance by \(12.5\%\) (Table 4, columns 2–4, diff.3).

The models CodeGenC and CodeGenE exhibit similar trends in their CodeBLEU scores with a rapid downward transition from difficulty levels 0 to 4, representing the shift from code generation with correction information to code generation without (Fig. 6.a). Both models significantly outperform the baseline model by a factor of 2.2 at difficulty level-0, where the NL query exists in the correction data-store (Table 4, columns 2–4). Their performance then converges to the baseline’s at difficulty level-4.

In contrast, the standalone code generator (i.e. CodeGen) results in slight improvements from difficulty levels 0 to 3 but a decline at difficulty level-4 (Fig. 6.a). Overall, the CodeGen performs worse than other models, except at difficulty level-4, where it slightly exceeds CodeGenC by \(2.4\%\) and lags behind CodeGenE by \(1.3\%\) (Table 4, columns 2–4).

To gain insights into the behavior of CodeGenE and CodeGenC models, and understand the factors contributing to performance differences, we conduct a detailed analysis for difficulty levels 1 to 4 on correct and incorrect chunking cases.

CodeGenC obtains accurate results on \(88.3\%\) of all test cases and consistently outperforms other models across difficulty levels 1 to 3 (Fig. 6.b). Particularly, in the case of medium-low difficulty, where the single-chunk input NL query is similar to single-chunk queries in the correction data-store, CodeGenC surpasses CodeGen by a factor of 1.9 and slightly improves upon CodeGenE by \(5.3\%\) (Table 4, columns 5–7, diff.1). The latter improvement is attributed to CodeGenE occasionally omitting the syntax or identifier names of similar queries. The first four rows of Table 5 present an example for this situation. The target code contains an assignment followed by a return statement, and utilizes variable names like df and input_file. While CodeGenE disregards this information, CodeGenC integrates the suggested syntax and identifier names from the similar query successfully.

Table 5 Examples of CodeGenE overlooks or gets confused by extra information

Full size table

Difficulty level-2 expresses cases where the single-chunk NL query is associated with multi-chunk queries in the data-store, or each chunk of the multi-chunk NL query resembles single-chunk queries in the data-store. At this level our model persistently excels over CodeGen by a factor of 1.4 and achieves a slight advantage over CodeGenE by \(5.2\%\) (Table 4, columns 5–7). The latter increment results from CodeGenE getting confused by extra information from similar queries. Table 5, rows 5–8 illustrate an example of this case. While CodeGenC achieves identical syntax and variable names, code generated by CodeGenE includes a redundant statement due to the additional chunk from the similar query (e.g. “get a number as kilometers from users”).

Notably, at the medium-high difficulty level, where each chunk in the multi-chunk NL query is similar to multi-chunk queries in the data-store, CodeGenC shows a \(23.1\%\) increase over CodeGen and exhibits an \(11\%\) improvement upon CodeGenE. The latter discrepancy arises due to additional information from similar queries applying to only some chunks in the input NL query. The bottom part of Table 5 provides an example for this instance. The similar query “add two numbers, and then print the result” provides information that pertains to only the first and third chunks in the input query, which causes missing code lines in the snippet produced by CodeGenE. Meanwhile, CodeGenC overcomes this issue since it derives the final code from sub-snippets of each chunk in the NL query.

Around one tenth of all test cases are classified as inaccurate chunking results. On these test cases, our model outperforms the baseline model by an average of \(15\%\), but lags behind the CodeGenE model across difficulty levels (Fig. 6.c). At difficulty level-1, CodeGenC surpasses CodeGen by a factor of 2.0, while experiencing an \(8.3\%\) decrease compared to CodeGenE (Table 4, columns 8–10).

This reduction is attributed to three factors. Firstly, the naming convention for functions used by CodeGenC, where function names are extracted from verbs and direct nouns in the NL query, which may not always align with developer preferences. Secondly, the validation of similarity between queries using cosine similarity occasionally validates similar queries with unexpected KNN order. Lastly, for simplicity, our model currently does not handle auto-detection of specific values from similar queries (e.g. two queries are similar but have different values of numbers or strings). In contrast, CodeGenE, which is derived from a Large Language Model GPT-3.5-Turbo-0301, acquires inherent advantages in pure NLP tasks. Additionally, CodeGenE benefits from the target code obtained through GitHub Copilot, a predecessor of GPT-3.5-Turbo-0301.

Analogously, the decrease of CodeGenC compared to CodeGenE at difficulty level-2 (by \(26.5\%\)) and level-3 (by \(11.2\%)\) is attributed to NLP-related challenges. Tasks in these two levels include finding correct KNNs, accurately extracting the most similar chunks from similar queries, and properly mapping NL chunks to their relevant code snippets. CodeGenC relies on rule-based approaches in performing these tasks, which faces limitations in NLP. Ultimately, at the high difficulty level, CodeGenC slightly underperforms compared to the other models with a \(5.7\%\) reduction, primarily due to the discussed naming convention for functions. It is worth noting that this intensive inspection on incorrect cases is restricted with CodeGen and CodeGenE models because of their unpredictable property.

Overall, our model, CodeGenC, demonstrates competitive performance compared to other models, despite the challenges encountered in NLP tasks. In contrast to generative AI models, our methodology offers straightforward and interpretable approaches for generating the final code, enabling thorough analysis of unexpected results and facilitating insights for potential improvements. Additionally, utilizing the explicit mapping between generated code snippets and NL chunks in a graphical user interface can simplify assessment of suggested code for users (see Sect. 7).

The extensive analysis of evaluation results on the entire test case dataset, spanning various difficulty levels, provides valuable information to answer the first research question introduced in Sect. 4.1.

5.2 Ablation study

We continue analyzing the evaluation results under two aspects: (i) complexity level and (ii) correct outcome ratio.

5.2.1 Complexity level

To study the significance of user feedback and the influence of each state of the correction data-store on generated code, we categorize the test results by complexity level. Each level describes the components required to attain the final code. These levels, ranging from 0 to 4, represent a spectrum from low to high complexity, determined by the states of the correction data-store (Sect. 4.2).

For example, on test cases of low complexity, the NL query exists in the correction data-store, requiring only the data-store component to obtain the final code. Complexity levels 1 and 2 represent situations where the code generator is activated due to an empty correction data-store or no matching queries in the data-store, respectively. Further details for each complexity level are provided in Table 6.

Table 6 Definitions for complexity levels (comp.)

Full size table

Figure 7 depicts the CodeBLEU scores of our chunking methodology (CodeGenC) by complexity level, divided into three groups: (i) all test cases, (ii) correct chunking cases, and (iii) incorrect chunking cases.

The evaluation results demonstrate a near-perfect score of 99.9 at the low complexity level, indicating the presence of the NL query in the correction data-store. However, at the medium-low and medium complexity levels, the test cases occupy the lowest scores, regardless of correctness, with decrements of \(34.8\%\) and \(26\%\) compared to the average score across all test cases. This decline is attributed to the absence of user feedback for the NL query in the data-store.

Additionally, medium-high complexity test cases slightly surpass the high complexity ones by \(11.4\%\). This can be explained by the increased complexity associated with generating the final code at the high level. For test cases at medium-high level, CodeGenC utilizes various components including the correction data-store, similarity validation between queries, and, if necessary, the NL-to-Code generator to attain the final code. Meanwhile, the high complexity cases require an additional component, namely the NL-Code mapping, to construct the code by identifying suitable code snippets for each chunk in the query.

To examine the influence of user feedback on the CodeGenE model, we compare its CodeBLEU scores across complexity levels for all test cases (Table 7). The results align with the analysis of CodeGenC model discussed earlier. Specifically, complexity levels 1 and 2 encounter the lowest scores, with decrements of \(32.7\%\) and \(25.4\%\) from the average, respectively. Conversely, test cases with low complexity persistently achieve almost the perfect score. In addition, complexity levels 3 and 4 both exceed the average with increments of \(10.7\%\) and \(2\%\), respectively.

Table 7 CodeBLEU by complexity level across all test cases for CodeGenE model

Full size table

Table 8 Correct outcome ratio and CodeBLEU for each model over all test cases

Full size table

Ultimately, despite that we target to mimic the structure and identifier names of the corrected code in constructing the final code, the validity of the generated code is also an important metric. In the next sub-section, we inspect the achieved codes by their executed output.

5.2.2 Correct outcome ratio

As mentioned in Sect. 5.1, for simplicity, CodeGenC composes function name based on the input NL query instead of using LLMs as CodeGen and CodeGenE. Consequently, utilizing exact match or accuracy is improper for the evaluation. Alternatively, we manually examine each obtained code snippet and validate if it yields correct output after executing. The percentage of accurate outputs over all test cases forms the correct outcome ratio.

Table 8 displays the ratios of all models alongside their CodeBLEU scores. Overall, CodeGen acquires highest percentage but the disparities between models are insignificant. Particularly, CodeGen exceeds CodeGenE and CodeGenC by \(2.5\%\) and \(4.2\%\), respectively. In contrast, CodeGen attains the lowest CodeBLEU score, lacking behind CodeGenE and CodeGenC by \(27.0\%\) and \(27.7\%\), respectively. In other words, the standalone AI code generator model might yield the correct output but its generated code lacks substantial alignment with user suggestions.

Notably, Table 8 reveals minimal differences between CodeGenE and CodeGenC when aggregating CodeBLEU scores across all test cases. Nevertheless, at specific difficulty and complexity levels (discussed above), CodeGenC remarkably outperforms CodeGenE, underscoring the importance of our refined evaluation in such instances.

Additionally, Fig. 8 presents further analysis on the correct outcomes. When CodeGenC obtains accurate results from constructed code snippets, CodeGen and CodeGenE still encounter \(5.4\%\) and \(9.3\%\) of inaccurate outcomes, respectively. Notably, whereas CodeGen produces incorrect outcomes, CodeGenE and CodeGenC rectify \(60\%\) and \(63.3\%\) of these cases, turning them into correct ones, respectively.

In simpler terms, generative models with extra information from user feedback can improve some cases that are original incorrect outcomes. The evaluation results unequivocally indicate the advantages of user feedback for NL-to-Code translation model, even in the absence of explicit re-training. The last two findings address our second research question outlined in Sect. 4.1.

Ultimately, despite that prompting technique is not the main focus of our work, it is essential to assess whether an LLM with our chunking strategy integrated into the prompt can perform better than our proposed model. The next subsection addresses this matter and reveals the results.

5.3 LLM involvement

Primary goal reiteration. It is worth emphasizing again that besides the goal of integrating user feedback into generative AI models without re-training, we aim to ensure model interpretability throughout all steps for developers (mentioned in Sects. 3.2 and 3.3). The latter also enables thorough analysis of incorrect outcomes. In addition, we target to explore the potential enhancement of an NL-to-Code model through a simplified approach. Consequently, we refrained from using complex LLMs for query decomposition and chunk-to-sub-snippet mapping, and only employed them for code generation.

Furthermore, given that the prompting technique can influence the quality of results in generative AI models (mentioned in Sect. 1), our approach is to assist users with standard, straightforward prompts, delegating strategy planning and reasoning to the underlying mechanism. Moreover, most of generative AI models impose constraints on prompt length or context window size (i.e. the number of tokens processed simultaneously), restricting the integration of historical corrected codes.

LLM with chunking instruction. However, to complement our preceding evaluation, we conducted an additional experiment utilizing GPT-3.5-Turbo-0125^{Footnote 31} for translating NL queries to Python, incorporating our decomposition strategy as task descriptions. This experiment assessed the effectiveness of the employed models in analyzing query chunks and NL-code mapping. Consequently, we only considered scenarios with multi-chunk queries and non-empty correction data-store from the collected test cases (i.e. \(39.4\%\) of the total cases).

Listing 3 displays the query template utilized in this experiment. Queries from the correction data-store, serving as user-approved cases, are appended following the input NL query. The chunking strategy is outlined from lines 5 to 9. We denote the model utilizing GPT-3.5 alongside our chunking strategy as GPT35Prompt.

Result assessment. Table 9 illustrates the correct outcome ratios and CodeBLEU scores of all models across test cases featuring multi-chunk queries and non-empty correction data-store. The results reveal that GPT35Prompt underperforms other models in terms of correct outcome ratio, lacking behind CodeGen, CodeGenE, and CodeGenC by \(20.9\%\), \(14.5\%\), and \(14.6\%\), respectively. Besides, GPT35Prompt only surpasses CodeGen by \(10.9\%\) in terms of CodeBLEU score, while lagging behind CodeGenE and CodeGenC by \(8.5\%\) and \(10.3\%\), respectively.

Table 9 Correct outcome ratio and CodeBLEU for each model over test cases with multi-chunk queries and non-empty correction data-store

Full size table

Further analysis, as depicted in Fig. 9, confirms the inferior performance of GPT35Prompt compared to other models. Specifically, while CodeGenC (our approach) achieves correct outcomes, CodeGen, GPT35Prompt, and CodeGenE still encounter \(8.6\%\), \(29.3\%\), and \(16.4\%\) of incorrect outcomes, respectively. In cases where CodeGen produces incorrect results, GPT35Prompt, CodeGenE, and CodeGenC rectify \(56.2\%\), \(68.7\%\), and \(75\%\) of these instances.

Brief analysis. While the underlying LLMs of GPT35Prompt and CodeGenE are slightly different (GPT-3.5-Turbo-0301 vs. GPT-3.5-Turbo-0125), they employ distinct prompting templates, leading to notable disparities in generating accurate final codes. This underscores the significance of prompting techniques in result quality. However, comparing prompting techniques is beyond the scope of our study.

We briefly examined the failed cases of GPT35Prompt and discovered that GPT35Prompt also experiences the similar shortcomings as CodeGenE (e.g. overlooking or becoming confused by additional information, as shown in Table 5). Additionally, \(55.1\%\) of the incorrect outcomes stem from GPT35Prompt generating code that uses functions defined in queries from the correction data-store, without including these function definitions into the final code. Even after adjusting the prompt template in Listing 3 to explicitly address this issue^{Footnote 32}, the incorrect outcomes persist.

It is worth noting here that we consider NL-to-Code generation individually for each query. The corrected codes refer to preceding corrections but are not available to users at the moment of executing the prompts. An enhancement for this matter is discussed in Sect. 6.2. For simplicity, we exclude the analysis of GPT35Prompt results based on individual difficulty and complexity levels.

Ultimately, we anticipate that advanced prompting techniques, such as chain-of-thought (Wei et al 2022) and tree of thoughts (Yao et al 2023), could improve the LLM outcomes. Nonetheless, despite detailed strategy descriptions, the inherent black-box nature of LLMs still hinders thorough analysis of unexpected results, making it challenging to pinpoint which step in the strategy description causes the failed cases.

6 Discussion

In this section, we discuss threats to validity of our experiments, as well as challenges and potential enhancements for our methodology.

6.1 Threats to validity

We analyze threats to validity of our work as follows:

Test suite. A custom test suite was developed for the experiments due to the absence of a suitable existing test suite. Though our dataset is not as extensive as those for AI model training, it sufficiently demonstrates our methodology’s utility. However, the inclusion of an official benchmark would enhance the effectiveness of the proposed approach. In future work, we intend to incorporate more complex test cases, probably by refining Q &As from programming forums. Furthermore, the lack of probability logs from Codex model in the response of ChatCompletion feature (GPT-3.5-Turbo-0301) raises questions about the likelihood of the code returned in the first response being the most probable one.

Language specificity. The algorithm for mapping NL chunks and code snippets in the Code building step is currently implemented exclusively for Python. However, the identification of code token types is based on AST analysis and token relationships, which vary slightly across programming languages. Besides, the algorithm focuses on critical token types shared among programming languages, such as variable definition and usage. Determining these token types in other languages (e.g. Java) is even less complicated than for Python due to Python’s dynamic typing. Therefore, we anticipate that our results will be applicable to other programming languages. Additionally, it is worth mentioning that the parser used in the Query chunking step is specifically for English language. Nonetheless, multilingual NLP is outside the scope of this paper.

Model comparison. Our experiments employ GPT-3.5-Turbo-0301, which has demonstrated significant advancements in NLP tasks. However, being a beta version and subject to frequent updates, minor adjustments may be necessary to accommodate changes in its APIs. Furthermore, due to the lack of directly comparable models, we compare our methodology with extended input queries on GPT-3.5-Turbo-0301. We assume that comparing other approaches that utilize the chunking method would further validate the concept of our methodology.

Evaluation metrics. Besides manually examining the validity of generated code, we adopt CodeBLEU as evaluation metric due to its popularity in code generation models. Although ChrF has been proposed as an alternative (Evtikhiev et al 2023), it does not fully consider the specifics of working with source code. As our experiments prioritize the syntax of the generated code (as discussed in Sect. 4.3), CodeBLEU with the mentioned settings remains suitable for our purposes.

6.2 Challenges and potential enhancements

Given the novelty of our proposed methodology, we outline below challenges while developing the approach, alongside potential improvements which can make our concept applicable to more intricate use cases.

6.2.1 Scalability support

Multi-users and large datasets. To illustrate the utility of our methodology, we collected user feedback in a dictionary with embedding values of the input queries as keys and the corrected code snippets as corresponding values. Subsequently, similar queries of each input are retrieved using KNN technique, by comparing similarity between the input and all existing queries in the data-store. This simple setup serves its purpose in exhibiting the advantages of integrating user feedback into generative AI models without re-training. However, adapting this method to multi-user systems and large datasets necessitates upgrading the correction data-store structure.

Particularly, users usually refer to their own naming patterns (while aligning to coding convention) for identifiers, requiring the separation of correction information stored for individual users or only shared within user groups. Furthermore, a function generated from an input query can be adopted multiple times at different locations within a program, each with distinct sets of variable names. Consequently, various versions of function customization should be stored instead of employing a single record for each query and overriding previous corrections.

Dynamic Sparse Distributed Memory. The presence of numerous users can result in data expansion, necessitating scalability features in the correction data-store architecture. To address this, a potential solution is employing Dynamic Sparse Distributed Memory (DSDM) introduced by Pourcel et al (2022), an extension of Sparse Distributed Memory (Kanerva 1992).

DSDM begins with an empty memory space and incrementally adds new address nodes based on input patterns, dynamic write radius, and current memory space state. Query content is retrieved from specific memory nodes using a softmin function that considers the distance between the query and other query addresses. Integrating DSDM into the One-shot Correction approach may enhance the correction data-store’s capacity, mitigating scalability challenges.

6.2.2 Flexible rule selection for code building

Despite that we deployed a configuration file (outlined later in Listing 4) to centrally manage rules for refining sub-snippets, the inclusion of rules for renaming identifiers, determining parameters for the final code, and handling multi-input queries would be beneficial. Moreover, a flexible selection mechanism for these rules should be employed based on the input query and corrected codes from similar queries.

Identifier renaming For instance, when renaming identifiers within combined sub-snippets by prioritizing the last statement (Sect. 3.4), situations arise where the final code lacks desired names compared to the corrected code. This occurs because desired names initially appear atop the statement list but are subsequently replaced by identifier names in statements below. Hence, a flexible activation of renaming rules (top-down or bottom-up) should be determined based on the positions of chunks in the input query receiving similar queries from the correction data-store.

Furthermore, to exemplify the proposed chunking concept, we streamlined the renaming process by assuming that identifiers defined in one statement are directly utilized in the subsequent statement. A potential enhancement to diminish this assumption involves (i) preserving the data-flow of each variable in every code snippet, (ii) analyzing the purpose of each variable definition and usage, and (iii) bridging the data-flow gap between code snippets. These steps may require NL chunks, their associated code snippets, and the input query as inputs, suggesting the consideration of a more intricate rule or approach.

Parameter determination As we target generating final codes comprising code snippets enclosed within a function definition with requisite import statements, the current parameter identification rule for the final function suffices to illustrate the method’s concept. However, in case the input query requests multiple functions or omits this requirement, the rule should be adjusted accordingly, which is technically feasible by identifying the scope of variables besides their definitions and usages.

Multiple input queries Ultimately, our proposed approach currently addresses NL-to-Code cases individually, as depicted in a GUI in Sect. 7. However, when applying this method to a code file containing existing NL queries and their relevant code snippets, or when dealing with inputs featuring multiple NL queries, consideration should be given to previously generated code when constructing the outcomes.

In such instances, a rule should prioritize suggested code snippets using functions defined from prior queries over code snippets that redefine these functions. Preceding queries and their codes can be directly injected into the input query, forming a multi-turn programming pipeline, similarly to the study described in Nijkamp et al (2022).

7 One-shot correction GUI

In this section, we briefly introduce our preliminary GUI^{Footnote 33} built on the One-shot Correction methodology. The GUI exhibits the practicality of our proposed concept in simplifying code customization and assessment for users. Main features of the GUI are demonstrated in Appendix A.1 with examples.

We draw inspiration from the work of Su et al (2018) on building an application with fine-grained user interaction for code modification. With each code token in a returned code, we determine its token type and a list of alternative values, which are extracted from other suggested codes for the same token type. Figure 10 presents the general scenario of using the GUI.

After initiating a search with an input NL query, users can perform the following actions: (1) choose displayed code from a list of returned code snippets, (2.1) select a code token under Suggested code by clicking on it and (2.2) change its value from the list of substitute values, (3) type a new value for the code token if the preferred value is not on the list in step (2.2), (4) directly modify the code if restructuring is necessary, and (5) save the modification for subsequent inquiries.

By default, user modification is integrated with both options, GPT-3.5 and One-shot Correction, which are corresponding to the CodeGenE and CodeGenC models mentioned in previous sections. Deselecting these options results in the code snippet using solely the CodeGen model (i.e. without user feedback). Besides, for each code token, we also provide its token type as an extra information for users.

Notably, the highlight matching option associates input query chunks with sub-snippet(s) of the displayed code in the One-shot Correction case. For other cases (i.e. standalone code generator and extending input), the whole input query and its code are marked without separation (see Appendix A.1 ). We expect that this explicit mapping can facilitate users in comprehending and validating the generated code.

Additionally, by modifying the configuration file (Listing 4), users can manipulate the state of the correction data-store (line 8), filter important code token types (lines 9–11), and adjust hyperparameters used in each model (lines 2–5). We published these setting values together with our source code.^{Footnote 34}

Particularly, possible values for corrt_ds involve "all" (all gathered queries), "all_x" (a collection of all x-chunk queries, \(x \in [1, 2, 3]\)), "all_x_excl" (all x-chunk queries excluding the current target query), and "task_x_y" (the x-chunk query with index y). Appendix A.1 presents an example of code generation with two different states of the correction data-store. Furthermore, to prefer specific token types, users can simply enable/disable the corresponding flag of the token type (Listing 4, line 11). These types are determined based on a study of Le et al (2023).

8 Conclusions

We proposed a methodology named One-shot Correction to incorporate user feedback into generative AI models without re-training. Evaluation results illustrate competitive performance compared to other models, despite challenges in NLP tasks. Our methodology enables thorough examination of unexpected results through straightforward approaches and facilitates insights for potential improvements. Additionally, we demonstrated that feedback from users significantly enhances code translation models without re-training. We published the test suite used in our experiments, evaluation results, and source code of the methodology.^{Footnote 35} A preliminary GUI with fine-grained user interaction in code modification was also implemented to sketch the utility of our proposed approach in practice. Further work encompasses extending the method to other programming languages and large datasets, which includes upgrading the correction data-store structure for scalability (e.g. using Dynamic Sparse Distributed Memory). Furthermore, exploring flexible rule selection at each step in the methodology for complex inquiries is a promising direction.

Notes

https://github.com/features/copilot
https://www.tabnine.com/
https://aws.amazon.com/codewhisperer/
https://openai.com/blog/chatgpt
https://chat.openai.com/
https://platform.openai.com/docs/models/gpt-3-5
https://gitlab.com/pvs-hd/published-code/one-shot-correction-ase
https://www.nltk.org/
https://stanfordnlp.github.io/CoreNLP/
https://scikit-learn.org/
https://spacy.io/
https://platform.openai.com/docs/models/gpt-3-5
https://spacy.io/
https://tree-sitter.github.io/tree-sitter/
Source code provided in https://gitlab.com/pvs-hd/published-code/one-shot-correction-ase
Verb with dependency label marked as ROOT.
Determined by spaCy dependency dobj.
https://tree-sitter.github.io/tree-sitter/
https://platform.openai.com/docs/models/gpt-3-5
https://www.programiz.com/python-programming/examples
https://openai.com/blog/chatgpt
https://github.com/features/copilot
https://openai.com/blog/openai-codex
To ensure code generation exclusively based on the NL query, the IDE displays only one Python file.
OpenAI discontinues supporting Codex from March 23, 2023. A newer model, GPT-3.5-Turbo-0613, was released on June 27, 2023.
https://github.com/openai/openai-cookbook/blob/main/examples/How_to_format_inputs_to_ChatGPT_models.ipynb (Accessed: October 6, 2023).
https://platform.openai.com/docs/guides/embeddings
https://platform.openai.com/docs/api-reference/chat (Accessed: June 30, 2023).
https://gitlab.com/pvs-hd/published-code/one-shot-correction-ase
There are no incorrect chunking results at difficulty level-0.
At the time of our experiment, OpenAI stopped supporting for the model GPT-3.5-Turbo-0301. GPT-3.5-Turbo-0125 is purported to exhibit greater accuracy in responding to requested formats, https://platform.openai.com/docs/models/gpt-3-5-turbo (Accessed: March 24, 2024).
By extending step 3 in the description with the guidance “If you use functions defined in the similar queries, you must add these function definitions into the final code”.
The GUI is developed using CustomTkinter https://customtkinter.tomschimansky.com/
https://gitlab.com/pvs-hd/published-code/one-shot-correction-ase
https://gitlab.com/pvs-hd/published-code/one-shot-correction-ase

References

Abney, S.P.: Parsing by chunks. In: Computation and Psycholinguistics, Principle-Based Parsing, pp. 257–278 (1992)
Ahmad, W.U., Chakraborty, S., Ray, B., et al.: Unified pre-training for program understanding and generation (2021). arXiv preprint arXiv:2103.06333
Asare, O., Nagappan, M., Asokan, N.: Is github’s copilot as bad as humans at introducing vulnerabilities in code? (2022). arXiv preprint arXiv:2204.04741
Barke, S., James, M.B., Polikarpova, N.: Grounded copilot: how programmers interact with code-generating models. Proc. ACM Program. Lang. 7(OOPSLA1), 85–111 (2023)
Article Google Scholar
Bielik, P., Raychev, V., Vechev, M.: Phog: probabilistic model for code. In: International Conference on Machine Learning, PMLR, pp. 2933–2942 (2016)
Bird, C., Ford, D., Zimmermann, T., et al.: Taking flight with copilot: early insights and opportunities of AI-powered pair-programming tools. Queue 20(6), 35–57 (2022)
Article Google Scholar
Cai, Y., Mao, S., Wu, W., et al.: (2023) Low-code LLM: visual programming over LLMs. arXiv preprint arXiv:2304.08103
Carpenter, A.: 7 tips to help you learn a new programming language fast (2021). https://www.codecademy.com/resources/blog/how-to-learn-a-new-programming-language-fast/. Accessed 22 June 2023
Charitsis, C., Piech, C., Mitchell, J.C.: Using NLP to quantify program decomposition in CS1. In: Proceedings of the Ninth ACM Conference on Learning@ Scale, pp. 113–120 (2022)
Chen, M., Tworek, J., Jun, H., et al.: Evaluating large language models trained on code (2021). arXiv preprint arXiv:2107.03374
Dakhel, A.M., Majdinasab, V., Nikanjam, A., et al.: Github copilot AI pair programmer: Asset or liability? J. Syst. Softw. 203, 111734 (2023)
Article Google Scholar
Egidi, M.: Decomposition patterns in problem solving. Contrib. Econ. Anal. 280, 15–46 (2006)
Article Google Scholar
Elgohary, A., Meek, C., Richardson, M., et al.: Nl-edit: correcting semantic parse errors through natural language interaction (2021). arXiv preprint arXiv:2103.14540
Evtikhiev, M., Bogomolov, E., Sokolov, Y., et al.: Out of the bleu: How should we assess quality of the code generation models? J. Syst. Softw. 203, 111741 (2023)
Article Google Scholar
Fan, A., Gardent, C., Braud, C., et al.: Augmenting transformers with KNN-based composite memory for dialog. Trans. Assoc. Comput. Linguist. 9, 82–99 (2021)
Article Google Scholar
Feng, Z., Guo, D., Tang, D., et al.: Codebert: a pre-trained model for programming and natural languages (2020). arXiv preprint arXiv:2002.08155
Gozalo-Brizuela, R., Garrido-Merchan, E.C.: ChatGPT is not all you need. A state of the art review of large generative AI models (2023). arXiv preprint arXiv:2301.04655
Gür, I., Yavuz, S., Su, Y., et al.: Dialsql: dialogue based structured query generation. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1339–1349 (2018)
Heyman, G., Huysegems, R., Justen, P., et al.: Natural language-guided programming. In: Proceedings of the 2021 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software, pp. 39–55 (2021)
Hindle, A., Barr, E.T., Gabel, M., et al.: On the naturalness of software. Commun. ACM 59(5), 122–131 (2016)
Article Google Scholar
Imai, S. Is github copilot a substitute for human pair-programming? An empirical study. In: Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings, pp. 319–321 (2022)
Iyer, S., Konstas, I., Cheung, A., et al.: Learning a neural semantic parser from user feedback (2017). arXiv preprint arXiv:1704.08760
Izadi, M., Gismondi, R., Gousios, G.: Codefill: multi-token code completion by jointly learning from structure and naming sequences. In: Proceedings of the 44th International Conference on Software Engineering. Association for Computing Machinery, New York, NY, USA, pp. 401–412 (2022). https://doi.org/10.1145/3510003.3510172
Kanerva, P.: Sparse distributed memory and related models. Technical Report (1992)
Khandelwal, U., Fan, A., Jurafsky, D., et al.: Nearest neighbor machine translation (2020). arXiv preprint arXiv:2010.00710
Kim, S., Zhao, J., Tian, Y., et al.: Code prediction by feeding trees to transformers. In: 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), IEEE, pp. 150–162 (2021)
Le, K.T., Rashidi, G., Andrzejak, A.: A methodology for refined evaluation of neural code completion approaches. Data Min. Knowl. Discov. 37(1), 167–204 (2023)
Article Google Scholar
Li, J., Wang, Y., Lyu, M.R., et al.: Code completion with neural attention and pointer networks (2017). arXiv preprint arXiv:1711.09573
Li, Y., Choi, D., Chung, J., et al.: Competition-level code generation with alphacode. Science 378(6624), 1092–1097 (2022)
Article Google Scholar
Lu, S., Duan, N., Han, H., et al.: Reacc: a retrieval-augmented code completion framework (2022). arXiv preprint arXiv:2203.07722
Mohapatra, N., Sarraf, N., et al.: Domain based chunking. Int. J. Nat. Lang. Comput. (IJNLC) 10 (2021)
Nguyen, N., Nadi, S. An empirical evaluation of github copilot’s code suggestions. In: Proceedings of the 19th International Conference on Mining Software Repositories, pp. 1–5 (2022)
Nijkamp, E., Pang, B., Hayashi, H., et al.: Codegen: an open large language model for code with multi-turn program synthesis (2022). arXiv preprint arXiv:2203.13474
Parvez, M.R., Ahmad, W.U., Chakraborty, S., et al.: Retrieval augmented code generation and summarization (2021). arXiv preprint arXiv:2108.11601
Pearce, H., Ahmad, B., Tan, B., et al.: Asleep at the keyboard? Assessing the security of github copilot’s code contributions. In: 2022 IEEE Symposium on Security and Privacy (SP), IEEE, pp. 754–768 (2022)
Pourcel, J., Vu, N.S., French, R.M. Online task-free continual learning with dynamic sparse distributed memory. In: European Conference on Computer Vision, Springer, pp. 739–756 (2022)
Ramshaw, L.A., Marcus, M.P.: Text chunking using transformation-based learning. In: Natural language processing using very large corpora, pp. 157–176 (1999)
Ren, S., Guo, D., Lu, S., et al.: Codebleu: a method for automatic evaluation of code synthesis (2020). arXiv preprint arXiv:2009.10297
Reynolds, L., McDonell, K. Prompt programming for large language models: Beyond the few-shot paradigm. In: Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–7 (2021)
Schlegel, V., Lang, B., Handschuh, S., et al.: Vajra: step-by-step programming with natural language. In: Proceedings of the 24th International Conference on Intelligent User Interfaces, pp. 30–39 (2019)
Schumacher, M.E.H., Le, K.T., Andrzejak, A. Improving code recommendations by combining neural and classical machine learning approaches. In: Proceedings of the IEEE/ACM 42nd International Conference on Software Engineering Workshops, pp. 476–482 (2020)
Shin, J., Nam, J.: A survey of automatic code generation from natural language. J. Inf. Process. Syst. 17(3), 537–555 (2021)
Google Scholar
Su, Y., Hassan Awadallah, A., Wang, M., et al.: Natural language interfaces with fine-grained user interaction: a case study on web apis. In: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, pp. 855–864 (2018)
Sun, J., Liao, Q.V., Muller, M., et al.: Investigating explainability of generative AI for code through scenario-based design. In: 27th International Conference on Intelligent User Interfaces, pp. 212–228 (2022)
Svyatkovskiy, A., Deng, S.K., Fu, S., et al.: Intellicode compose: code generation using transformer. In: Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 1433–1443 (2020)
Vaithilingam, P., Zhang, T., Glassman, E.L. Expectation vs. experience: evaluating the usability of code generation tools powered by large language models. In: Chi Conference on Human Factors in Computing Systems Extended Abstracts, pp. 1–7 (2022)
Vaswani, A., Shazeer, N., Parmar, N., et al.: Attention is all you need. In: Advances in neural information processing systems, vol. 30 (2017)
Wang, L., Xu, W., Lan, Y., et al.: Plan-and-solve prompting: improving zero-shot chain-of-thought reasoning by large language models (2023). arXiv preprint arXiv:2305.04091
Wei, J., Wang, X., Schuurmans, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 35, 24824–24837 (2022)
Google Scholar
Weisz, J.D., Muller, M., Ross, S.I., et al.: Better together? An evaluation of AI-supported code translation. In: 27th International Conference on Intelligent User Interfaces, pp. 369–391 (2022)
Wu, Y., Rabe, M.N., Hutchins, D., et al.: Memorizing transformers (2022). arXiv preprint arXiv:2203.08913
Xu, F.F., Vasilescu, B., Neubig, G.: In-ide code generation from natural language: promise and challenges. ACM Trans. Softw. Eng. Methodol. (TOSEM) 31(2), 1–47 (2022)
Article Google Scholar
Yao, S., Yu, D., Zhao, J., et al.: Tree of thoughts: deliberate problem solving with large language models (2023). arXiv preprint arXiv:2305.10601
Zhang, T., Damerau, F., Johnson, D.: Text chunking based on a generalization of winnow. J. Mach. Learn. Res. 2(4) (2002)

Download references

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and Affiliations

Heidelberg University, Heidelberg, Germany
Kim Tuyen Le & Artur Andrzejak

Authors

Kim Tuyen Le
View author publications
You can also search for this author in PubMed Google Scholar
Artur Andrzejak
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Kim Tuyen Le or Artur Andrzejak.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A Appendix

1.1 A.1 A Demo for the one-shot correction GUI

This section highlights key features of the One-shot Correction GUI with examples.

1.1.1 Code generation

By clicking the Search button, code snippet(s) of the corresponding input NL query will be retrieved/produced. There are two options to combine user correction with the final results, namely integrating with (i) GPT-3.5 and (ii) One-shot Correction. These options are equivalent to the aforementioned CodeGenE and CodeGenC models. When both of the options are unselected, the original CodeGen model will be activated.

Assuming that the input query is “get a string from user, replace all spaces in the string with underscores, print the result”. Table 10 presents the first nearest neighbor (1-NN) retrieved from the correction data-store for each chunk of the input query. The result after triggering the search feature with both selected options GPT-3.5 and One-shot Correction is displayed in Fig. 11.

Table 10 1-NN of the input query at the beginning

Full size table

Specifically, the top-1 code snippet (in this case, code generated with the One-shot Correction option) is shown in Fig. 11a while the list of attained snippets (grouped by models) is exhibited in Fig. 11b. Notably, since handling specific numbers or strings is beyond the paper’s scope (mentioned in Sect. 5.1), the string utilized in line 2 of the suggested code needs to be refined to suit the input query.

1.1.2 Code customization

Users are recommended to follow these steps to customize a generated code snippet:

1.
Changing suggested code. By browsing items in the list of snippets (i.e. Top suggestions, Fig. 11b), users can inspect code generated by each model and choose suitable code structure and identifier names.
2.
Choosing alternative values. Subsequently, users can customize the displayed code by selecting substitute values of each code token. Figure 12a presents the screen capture of the GUI right after clicking on the function code token. Relevant fields of the code token (i.e. current value, alternative values, and token type) are highlighted briefly to capture user attention. Consequently, users can choose another value for the function name from the list (Fig. 12b).
3.
Inputting a new token value. In case there is no preferred value in the replacement list, users can type a new value for the code token and click the check button next to the text input field or simply enter to apply the new value (Fig. 12c).
4.
Modifying directly. Ultimately, users can also write their own code if the current code structure does not fit their requirements (Fig. 12d).
5.
Saving the correction. In the end, users should save their feedback for future reference. For instance, given that a user updated the function name to replace_spaces_with_underscores, modified the string from "Number: " to "Your string: ", and changed the variable text to user_string, Fig. 13 reveals the suggested code on the second time of submitting the same input query.

1.1.3 NL-code mapping

We developed the feature of highlighting NL-code mapping to underline the utility of our proposed method in simplifying code assessment for users. By activating the switch highlight matching of the GUI, users can inspect the correlation between each chunk in the input query and its code snippets constructed by the One-shot Correction method (Fig. 14a). This feature does not work for results obtained using the other methods since they do not provide decomposition information (Fig. 14b).

Table 11 1-NN of the input query after updating the correction data-store

Full size table

1.1.4 Manipulating the correction data-store for testing

Queries in the correction data-store can be manually managed through the configuration file (see Listing 4) by specifying the query indices (see Sect. 7). This feature is utilized when users prefer to find the suggested code of the same input query with various states of the correction data-store.

Assuming that the correction data-store contains only one query, “get a string input from user”, as depicted in Table 11. While there is no preceding modification for the input query, the code snippet constructed by the One-shot Correction method (i.e. CodeGenC model) is slightly updated as illustrated in Fig. 15a. Additionally, the structure of the code snippet generated by extending the input query (i.e. CodeGenE model) is significantly changed to two separated functions, as shown in Fig. 15b.

We utilize the same technique to produce diverse states of the correction data-store for each input query in the experimental evaluation (see Sect. 4.2).

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Le, K.T., Andrzejak, A. Rethinking AI code generation: a one-shot correction approach based on user feedback. Autom Softw Eng 31, 60 (2024). https://doi.org/10.1007/s10515-024-00451-y

Download citation

Received: 13 November 2023
Accepted: 24 May 2024
Published: 12 July 2024
DOI: https://doi.org/10.1007/s10515-024-00451-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Rethinking AI code generation: a one-shot correction approach based on user feedback

Abstract

Similar content being viewed by others

Leveraging pre-trained language models for code generation

Code Autocomplete Using Transformers

Bash comment generation via data augmentation and semantic-aware CodeBERT

1 Introduction

2 Background and related work

2.1 Generative artificial intelligence for code

2.2 Interactive programming

2.3 Decomposition in problem solving

2.4 Chunking in natural language processing

3 Approach

3.1 General workflow

3.2 Query chunking

3.3 Sub-snippets retrieving/generating

3.3.1 Extracting sub-snippets for an NL chunk

3.3.2 Mapping NL chunks and sub-snippets

3.4 Code building

3.4.1 Determining sub-snippet order

3.4.2 Refining sub-snippets

3.4.3 Renaming identifiers

3.4.4 Identifying Parameters for the Final Code

4 Experiments

4.1 Research questions

4.2 Experimental setup

4.2.1 Test cases and scenarios

4.2.2 Correction data-store

4.2.3 Code generator

4.2.4 Natural language embedding and KNNs

4.2.5 Experiment configuration

4.3 Evaluation metrics

5 Evaluation results

5.1 Evaluation results by difficulty level

5.2 Ablation study

5.2.1 Complexity level

5.2.2 Correct outcome ratio

5.3 LLM involvement

6 Discussion

6.1 Threats to validity

6.2 Challenges and potential enhancements

6.2.1 Scalability support

6.2.2 Flexible rule selection for code building

7 One-shot correction GUI

8 Conclusions

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

A Appendix

A Appendix

1.1 A.1 A Demo for the one-shot correction GUI

1.1.1 Code generation

1.1.2 Code customization

1.1.3 NL-code mapping

1.1.4 Manipulating the correction data-store for testing

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation