1 Introduction

In recent years, the discipline of Artificial Intelligence (AI) has experienced a surge of research growth. None more so in the field of Natural Language Processing (NLP), and especially in regards to research involving chat bots and conversational agents [1, 2]. Indeed, generative artificial intelligence (GenAI) systems based on natural language inputs are producing a large range of content types, including text, images, audio, and video [3, 4]. Presently, GenAI systems are being utilised to address common programming issues such as summaries, code review and syntheses, as well as error repairs and debugging [5]. One such GenAI, ChatGPT 3.5 by OpenAI, is identified as having huge potential for examining source code, proposing changes, and generating code [1, 5]. This development is having wide ranging applications for researchers who wish to augment their research with computational-based methods. There are other GenAIs in use, each with their own strengths and weaknesses (please see [6] for a comparative analysis on OpenAI ChatGPT 3.5, Microsoft Bing Chat, and Google Bard). Additionally, see [7] for an extensive list of the large language models available and their comparisons. Indeed, to highlight the speed in developments within this domain, during the process of writing this article further evolutions in GenAIs have occurred. One significant tool is AutoGPT, that seeks to make the use of ChatGPT autonomous [8]. AutoGPT, in differing form other GenAIs, ‘automatically generates prompts in line with the given command and works until it reaches the result, without the need for users to add any input’ [8]. Another significant GenAI development is MetaGPT. Regarding collaborative software engineering benchmarks, MetaGPT is claimed to be able to generate more coherent solutions than previous chat-based systems [9].

Due to space limitations this article will focus on OpenAI’s ChatGPT 3.5. However, regardless of the GenAI employed, it is important to recognise when they are utilised for creating code for computational-based research, as there are consequences when it comes to acknowledging their use.

When GenAIs are utilised in research, many prospective journals have the requirement that (1) their use is recognised, and (2) that the authors acknowledge that what is generated is accurate [10, 11]. Indeed, the International Committee of Medical Journal Editors (ICMJE) outlines that GenAIs, ‘should not be listed as authors because they cannot be responsible for the accuracy, integrity, and originality of the work, and these responsibilities are required for authorship’ [12]. Although, recent research in [13] did illustrate that ChatGPT 3.0 could technically pass the criteria to be listed as an author, however, to accommodate Nature Springer policies it was removed prior to publication.

Nevertheless, authors who utilise GenAIs in their research must acknowledge and stand by what is generated. The reason for this is because GenAIs have been known to fabricate responses or to ‘hallucinate.’ To date, there has been a large amount of work conducted on how GenAIs, including ChatGPT, can ‘hallucinate’ text-based responses [14,15,16]. In the same vein, GenAIs can ‘hallucinate’ code, however, the hallucinations for code are in the guise of ‘functioning code’ that does not produce results as intended due to ‘silent errors’ [17]. Therefore, when utilising GenAIs for coding, an unintentional consequence on the side of the author could be that although the code is functional, it may not necessarily be true.

The purpose of this article is to discuss the role that GenAIs have for researchers who wish to augment their research with computer programming-based methods. To this end, ChatGPT 3.5 is used as an example for how GenAIs can be used to review and refine, correct errors, and script new codes for research projects. The article will begin with a brief introduction to AI in general and then GenAI systems specifically, and then discuss how they can be utilised for creating code. For the purpose of this article, it will discuss how GenAIs such as ChatGPT 3.5 can be used to compile code to be utilised in machine learning applications such as Latent Dirichlet Allocation Topic Models (LDA-TM), specifically for Hyper-parameter tuning, in this instance the Random State Hyper-parameter. ChatGPT was chosen due to the authors familiarity. As evident above, there are an array of GenAIs available for researchers, each with their own strengths and drawbacks. This article does not seek to assign primacy to one above the other, but rather to offer an approach to synthase, script code, and correct errors with GenAIs (in this case ChatGPT 3.5) and what to be aware of in this process.

Additionally, an LDA-TM was the machine learning technique (MLT) chosen as this article seeks to extend the code published in [18]. The code published in [18] is part of a new methodology utilising LDA-TMs to synthesise and abstract the data gathered during a systematic literature review (SLR). The importance of defining an appropriate Random State Hyper-parameter is to ensure the repeatability of the work being undertaken. As SLRs are renowned for their rigorous, transparent, and repeatable approaches to gathering, appraising, and synthesising data, the role that an LDA-TM can play in increasing the above elements is an important development [18]. In addition to an LDA-TM, the research conducted in [18] also includes approaches to enhance other stages of a SLR with AI and MLTs and is part of an emergent trend that is infusing AI and MLTs within the SLR process [19].

The subsequent section considers the technical advantages and pitfalls associated with utilising GenAIs alongside legal and ethical concerns regarding the use of GenAIs for coding. This is with particular regard to the licencing restrictions of source code that GenAIs can utilise in their responses. Next, the methods section will present the code developed with the help of ChatGPT 3.5 for the purpose of identifying an appropriate Random State Hyper-parameter to be employed in a LDA-TM.

Presented as a narrow case study, the methods section illustrates how ChatGPT 3.5 can be utilised to refine code previously published in [18]. In Stage One it is prompted to the task at hand before being told that it will be manually fed snippets of the code published in [18]. ChatGPT 3.5 then provides an explanation of each snippet as it is prompted. In Stage Two it is asked to synthesise this code.

The final stage of the case study illustrates how ChatGPT 3.5 can be asked to script new code and correct the errors encountered in running it. After being presented with the new scripted, and confirmed code from Stage Two, this stage highlights how ChatGPT 3.5 can be used to script code for identifying a Random State Hyper-parameter for use in an LDA-TM.

The ensuing results section will present the results from the newly scripted code. Next, the discussion section will discuss the outcomes from the generated code and contextualise this article in regard to covering issues surrounding submitting GenAI produced code for scrutiny in the context of peer review as well as for post publication. The article will then address the limitations of this work alongside future research opportunities, and end with a concluding section.

2 Theoretical background

2.1 Artificial intelligence and generative artificial intelligence

As evidenced by the industrial revolution, humankind goes through periods of explosive innovation that transforms numerous manual tasks that have existed for decades [20]. So transformative has AI been across such a variety of disciplines, that there exists numerous definitions and contexts of what it constitutes and where it is applied [21,22,23]. Defined as ‘the study of agents that receive precepts from the environment and perform actions’ [24], AI has the ability to mimic human cognitive functions such as speech, learning, and problem solving [20]. Far from a new occurrence, AI is already prevalent in modern society. From driverless vehicles, to chatbots, gaming, language translation, art and music production, text prediction, and even utilised in medical diagnosis, AI technologies permeate society [25, 26]. Indeed, humanity has maintained a focus on AI ever since the question posed by Alan Turing, ‘Can digital computers think?’ [27]. Not only an instrument of data and computer scientists, AI is also playing an increasingly larger role in other areas of academic research, and there are many tools available to aid researchers in not only speeding up projects, but also in reducing their costs through hours saved [28]. A way for researchers to access the tools to do so is through GenAI systems.

GenAI refers to an AI that can produce its own content, and is contrasted against systems that only analyse or act upon existing data, such as expert systems [29]. GenAI models are normally trained on several billion, sometimes hundreds of billions of different parameters, necessitating vast amounts of data and computing power [30]. Although only released in 2022, one of the most easily identified GenAIs is ChatGPT 3.5.

A GenAI model developed by OpenAI, ChatGPT 3.5 can generate writing that closely matches that of a human [31]. Following release in November 2022, it reached one million followers in five days, illustrating the application potential of GenAIs [32]. Despite the magnitude of forward leaps in their capabilities, the application of GenAIs such as ChatGPT 3.5 is a nascent field in many areas of research [3]. As a sophisticated chatbot, ChatGPT 3.5 is capable of fulfilling a wide range of text-based requests, including writing letters, among more complex tasks such as literature review assistance, text generation, data analysis, language translation, automated summarisation, and answering questions [33]. Since an upgrade to include coding was introduced, one area where ChatGPT 3.5 is making drastic changes is in developing MLTs for use in research [34, 35].

2.2 Machine learning techniques and topic modelling

MLTs are part of the field of AI and are employed to automatically ‘learn’ to undertake a specific task through statistical modelling data sets, usually sizable ones [36]. There are three types of machine learning: unsupervised, supervised, and semi-supervised. In unsupervised learning, which is the focus of the code in this article, the algorithm identifies natural correlations and classes within the uploaded data with no reference to any outcomes [37]. There are many different types of unsupervised machine learning algorithms; k-means, hierarchical clustering, and principal component analysis, to name a few [38, 39]. For a comprehensive and contemporary systematic review on both supervised and unsupervised machine learning algorithms that are available, please see [40]. The unsupervised MLT that this article is the focus of is for Topic Modelling, specifically LDA-TMs.

A three level Bayesian model, LDA is a propagative probabilistic model utilised in weaving together text based data [41]. In the model, a topic is defined as a distribution over a set vocabulary [42]. As such, each theme or ‘topic’ is a, ‘distribution over all observed words in the corpus, such that words that are strongly associated with the document's dominant topics have a higher chance of being selected’ [43]. Therefore, the most frequently occurring words within a topic will present a general overview of the topic [42].

Aside from LDA, there are many different ways to model topics: Non Negative Matrix Factorization [44]; Latent Semantic Analysis [45]; Parallel Latent Dirichlet Allocation [46]; and Pachinko Allocation Model [47]. Each of these methods of Topic Modelling creates topics based on patterns of (co-)occurrence of words in the text that is analysed [48]. However, and importantly, although the topics in models are automatically coded, it is up to the researcher to interpret the results and determine whether or not they are useful for the research being conducted [48]. Not solely in use within the computer sciences, Topic Modelling is utilised across a variety of disciplines [49, 50]. More recently, Topic Modelling has been highlighted as a useful tool to abstract the data gathered during a SLR [18] It is from this work that the original code is first refined and then extended using ChatGPT 3.5 to determine an appropriate Random State to use in the LDA-TM.

2.3 Hyper-parameters’ and the ‘Random State’

Finding the appropriate Hyper-parameter (or ‘Hyper-parameter tuning’) is a vital step in machine learning practice [51]. In LDA-TMs, Hyper-parameters are not latent variables in the model but instead are the simplest parameters of the Topic Model [52]. There are several Hyper-parameters that are involved in LDA-TMs, and Topic Modelling in general. Hyper-parameters include the Alpha (α), Beta (β), Gamma (γ), and Random State [53]. Alpha directs the distribution of topics over documents, Beta denotes the distribution of words over topics, Gamma is the concentration value in Dirichlet Processes, and Random State is a measure used to improve efficiency and to ensure repeatability [53]. It is very simple to set a Hyper-parameter and forget it, especially if the model is generating good results [52]. However, it is advised to sample different Hyper-parameters to improve the quantity/quality of the results [52]. One such Hyper-parameter that can be easily manipulated, is the Random State.

The Random State Hyper-parameter in an LDA-TM is a measure used to improve efficiency and to ensure repeatability [53]. There are many ways to ‘tune’ Hyper-parameters for use in an LDA-TM. Indeed, the popular Python library Scikit Learn comes equipped with default parameters [54]. However, as the outcomes and interpretations of LDA-TMs require human interpretation [48], it is easy to scroll through Random States and arbitrarily select one. For the purpose of this article, alongside streamlining the code set out in [18] ChatGPT 3.5 is tasked to come up with a method for selecting the most optimal Random State for the LDA-TM.

2.4 Reviewing, debugging, and creating code with ChatGPT 3.5

ChatGPT 3.5 has the potential to generate code in several programming languages, and for numerous purposes [55]. It has even been claimed that the coding results obtained through ChatGPT 3.5 are not only outstanding, but will replace Stack Overflow as the place where software developers and coders go for advice [56]. ChatGPT 3.5 can be utilised for several different coding needs: debugging, code review and revision, correcting errors, and scripting new code [5]. Indeed, when used for debugging and error correction, ChatGPT 3.5 can process code in several ways to locate the issues within, and then provide recommendations to resolve the errors found [57]. Recently, [58] sought to compare and contrast the debugging prowess of ChatGPT 3.5 against other benchmark debugging software. They found that ChatGPT 3.5 performs on par with debugging software Codex and Deep Learning based Automated Program Repair (DL-APR) on standard benchmarked sets. Importantly, it greatly outclasses standard APR methods (19 vs. 7 out of 40 bugs fixed) [58]. Another area where ChatGPT 3.5 is helping to automate coding is for code reviewing.

Code review is a day-to-day task for software developers. Recently, large language models have been investigated to determine their ability to automate this process [59]. ChatGPT 3.5 has been highlighted as one such tool to utilise in reviewing and synthesising code in both academic and commercial areas of work [60]. As important as debugging and error corrections are, it is in the area of code scripting, where ChatGPT 3.5 (and other GenAIs) is spurring increased attention.

When generating code, ChatGPT 3.5 has been shown to perform impressively. [61] evaluated the coding ability of ChatGPT 3.5 on both the Mostly Basic Programming Problems (MBPP) [62] and HumanEval [63] datasets and obtained favourable results. In addition, [64] performed a series of tests on the ability of ChatGPT 3.5 to both review and generate code. When tasked with generating code in the computer programming language Python (utilising the NumPy and Pandas libraries), ChatGPT 3.5 produced the correct performing code in ‘eight of the ten cases’ [64].

The makers of ChatGPT 3.5, OpenAI, have partnered with other organisations to infuse AI into the coding process. In 2022, Microsoft owned GitHub and OpenAI introduced GitHub Copilot, an “AI pair programmer” for Visual Studio Code, Neovim, and JetBrains IDEs [65]. Much like utilising ChatGPT 3.5 for coding, GitHub Copilot is a new innovation in computer programming. However, reviews have returned mixed results. Recently, [66] determined that, ‘Copilot can become an asset for experts, but a liability for novice developers.’ This is due to the view that although Copilot can provide effective solutions, there are still bugs associated with them. However, they are easier to resolve then human caused coding bugs [66]. Another critique is that the code that Copilot is trained on (the GitHub database) is ‘buggy’ in places, therefore introducing flaws into generated results from the outset [67]. For further research regarding the quality of code produced by different GenAIs, please see [68, 69].

As evident, ChatGPT 3.5 holds great promise for researchers who wish to augment their research with computer programming-based tools and techniques. However, there are several drawbacks and pitfalls that researchers should be aware of.

2.4.1 Downside of coding with ChatGPT 3.5

An issue when discussing the usefulness of AI generated code is the correctness of the generated code [70]. This can be due to that while ChatGPT 3.5 can understand and analyse code, it does not have a deep understanding of the wider setting in which the code is being employed. As such, it may not have the same awareness as a human programmer might [57]. Another problem is that code may be ‘functionally correct’, insofar as it runs, however, it may not be true [70]. Finally, as mentioned in the previous section, training GenAIs on data that itself is potentially ‘buggy’ and can lead to generated responses being open to incorporating bugs themselves [67]. Aside from technical issues, researchers should also be aware of the legal and ethical pitfalls of using GenAIs such as ChatGPT 3.5 for coding.

The prevalence with which intelligent systems are currently influencing our society raises progressively more compelling ethical and legal queries [71]. To train GenAI systems, vast amounts of data is taken from the internet. As such, some of the data could be subject to copyright among other protections [72]. Codes can be subject to copyright protections. A recent development regarding GenAIs and their use of copyrighted codes is currently making its way through court in the United States. The case centres around GitHub, Open AI & Microsoft (defendants) training of their Copilot tool on data based on the GitHub repository [73]. In May 2023, efforts by the defendants to have the case dismissed were denied [74]. At the time of writing, this legal quandary is yet to be resolved.

Alongside the legal issue of using GenAIs that have been trained on open-source available code, is an ethical issue. Many data scientists have made their codes available for free for the wider research community to utilise. Therefore the commercialisation of their code not only could be in breach of the legal copyright law, but also raises the ethical question of whether those codes would have been made available in the first instance [75]. Another area to be aware of when prompting all GenAIs (not just ChatGPT), is the emergent discipline of prompt engineering itself. Prompt engineering is a nascent field, as such the rigor behind it is also nascent [76, 77]. Therefore, there are many pitfalls to be aware of when crafting prompts (bias reinforcement, overfitting, unintended side effects, and model limitations for example) [78]. Fortunately, there are a number of guides that seek to aid GenAI users in formulating their prompts. Due to space limits, it is not possible to provide a full list of examples in this article. However, for contemporary guides and frameworks please see [76, 79].

There are as many benefits as there are drawbacks for researchers utilising GenAIs such as ChatGPT 3.5 to code in research projects. The next section will present a narrow case study to set out the steps taken to accomplish three coding tasks with ChatGPT 3.5: (Stage One) prompting ChatGPT 3.5 with the accredited and published set of codes from [18]; (Stage Two) prompting ChatGPT 3.5 to streamline the codes from [18]; and (Stage Three) prompting ChatGPT 3.5 to script new code to define a Random State to utilise in the LDA-TM and to correct issues that were encountered during the scripting process.

3 Methods

In this section the prompts that were input into ChatGPT 3.5 alongside the responses generated will be presented as a three-stage case study, albeit with a narrow focus. In the first stage, the original Pythonic code that was developed in [18] was modified, and then used to prompt ChatGPT 3.5. This also provided an opportunity for ChatGPT 3.5 to review the input codes. In the second stage of the case study ChatGPT 3.5 was prompted to streamline the code to make it more concise. Finally, in the third stage, ChatGPT 3.5 was asked to script new code to determine the best Random State to utilise for the LDA-TM and to correct any errors encountered when running the code. As identifying an appropriate Random State in an LDA-TM is essential for repeatability of the results produced, this is an important Hyper-parameter within the model. The data used for this section is from preliminary ‘Policy Problems’ data extracted for a SLR on how governance settings can enhance the resilience and sustainability of energy infrastructures. The prepublished research protocol for this work can be found in [80].

3.1 Case study

3.1.1 Stage one: prompting ChatGPT 3.5

In this first stage, ChatGPT 3.5 is prompted with the modified code form (left blank for peer review). This can be seen in Table 1.

Table 1 Prompting ChatGPT 3.5 with the code as appearing in (left blank for peer review) with modifications

3.1.2 Stage two: prompting ChatGPT 3.5 to streamline code

In this stage, ChatGPT 3.5 is asked to streamline all of the code from Stage One. The prompt and streamlined code can be seen in Table 2.

Table 2 Simplified and streamlined code produced by ChatGPT 3.5

3.1.3 Stage three; prompting ChatGPT 3.5 to create code and fix errors

After building the corpus for the model, next the Hyper-parameters for it are defined. Table 3 lists the prompts and responses from ChatGPT 3.5 for writing the code used to define the Random State used in the model. As errors are encountered, it is prompted to correct them.

Table 3 Defining the Random State Hyper-parameter

4 Results

4.1 Stages one and two: code review and synthesis

As seen in Stage One of the case study, ChatGPT 3.5 provides an accurate description about every piece of code that it was prompted with. The synthesised output in Stage Two of the case study provided a cleaner version of the code. To ensure that this is correct the first 30 Tuples from both code sets were checked and found to be exactly the same. The first 5 Tuples produced were as follows: [(0, 1), (1, 1), (2, 1), (3, 2), (4, 1).

4.2 Stage three: creating and correcting

In Stage Three, the third task that ChatGPT 3.5 was asked to perform; creating code to determine a Random State for an LDA-TM, alongside resolving any errors encountered, can be seen in Table 3. Table 3 highlights the explanatory power of ChatGPT 3.5. As soon as the AI is prompted in #2, it responds with a way to implement the request. Following the prompt in #3 a full set of code is produced to respond to the request. It is only a matter of further prompting the AI with relevant information in #4. In #5, 6, 8, and 9 the corrective power of ChatGPT 3.5 is put on display. Indeed, following some simple prompts, the AI produces a functioning code to determine how to determine the best Random State to utilise in the LDA-TM. The results of this can be seen in Table 4. The final prompt in Table 3 (#10) asks the AI to produce a graph that plots each Random State and its Log Likelihood. The graph can be seen in Fig. 1.

Table 4 Generated Random States according to Log Likelihood
Fig. 1
figure 1

Printout of Log Likelihood of 100 Random States

5 Discussion

The above methods and results sections highlight how ChatGPT 3.5 can be utilised to review and synthesise, create, and correct code. For all intents and purposes, the GenAI has completed the tasks that it was asked to do. Firstly, it has reorganised and streamlined the code from [18]. Secondly, ChatGPT 3.5 has been able to resolve the errors encountered with running the new code that it produced. However, what is less firm is the code generated to anchor the LDA-TM to the ‘best’ Random State Hyper-parameter.

As discussed earlier, the Hyper-parameter ‘Random State’ is utilised to provide a fixed point in an LDA-TM to ensure repeatability [53]. However, LDA-TMs also need to be interpreted by humans to determine whether or not the results are useful for the research being conducted [48]. When both of these views are taken together, then the GenAI has completed its task as it has defined a number of Random States for a human to review. Furthermore, should one be interpreted as the ‘best’ Random State to utilise, then a transparent and repeatable means of determining this particular Hyper-parameter has been utilised. From this point, an author should be able to confidently state that the code produced by ChatGPT 3.5 works as intended, and that they therefore stand by the results generated. However, as highlighted by [70], just because it is ‘functionally correct’, insofar as it runs, it may not be ‘true.’ This is evidenced in Table 3 when ChatGPT 3.5 points out that, ‘[…] other metrics like perplexity or coherence scores to determine the best Random State.’ A way to increase the trust, transparency, and rigor of this process is to submit codes developed with ChatGPT 3.5 alongside results for peer review. Although, this is just the first step.

As one of the pillars of scientific communication, peer review is indispensable in the creation of scientific inquiry [81]. However, researchers should not just have peer reviewers in mind when submitting GenAI formulated code for review purposes. By submitting full codes and datasets, researchers are ensuring that their work can be subject to inspection and scrutiny by their peers beyond peer review. This in turn can help with assessing what is ‘functional’ code, what is ‘true’, and what is both.

6 Limitations and future research

This article has presented a case study in a limited context and under limited conditions. As such there are limitations to the research conducted. The first clear limitation of this work is pointed out by ChatGPT 3.5 itself, ‘keep in mind that log likelihood is not the only metric you can use […]. Depending on your specific use case, you might consider using other metrics like perplexity or coherence scores to determine the best Random State.’ Log Likelihood was utilised due to the simple fact that it was arbitrarily chosen by ChatGPT 3.5. However, it should also be noted that there is not an agreed upon method to utilise in determining Random State Hyper-parameters [82]. Future research on different approaches to determine a Random State is currently being undertaken by the author.

Prompt engineering also is a nascent field, as such the rigor behind it is also nascent [76, 77], and there are many pitfalls to be aware of when crafting prompts (bias reinforcement, overfitting, unintended side effects, and model limitations for example) [78]. Fortunately, there are a number of guides that seek to aid GenAI users in formulating their prompts [76, 79]. Future research in this area could include a SLR and Meta-Analysis on the available methods.

Additionally, this article utilises a single case study that only investigated ChatGPT 3.5. Unfortunately, due to article word limits, a deeper comparative case study was not possible, however, this opens up the possibility for future research to be conducted in this area. Additionally, the use of the Random State as the focal point for the case study is also a limitation. There are numerous other MLTs that could have been investigated, however, determining a Random State to extend the LDA-TMl created by [18] provided an opportunity to highlight the role that GenAIs such as ChatGPT 3.5 could play.

Finally, utilising a GenAI with manual prompting over newer GenAIs that automate prompting can be a limiting factor as well. However, by choosing to manually upload prompts into ChatGPT 3.5, an extra layer of openness is created. This then allows for readers to test the prompt patterns utilised, making comparison with other GenAIs easier, and is in line with other research conducted [83].

7 Conclusion

This article has discussed the role that ChatGPT 3.5 has for researchers who wish to augment their research with computer programming-based methods. Specifically, it has illustrated how ChatGPT 3.5 can be used to review and refine, correct errors, and create new codes for research projects. By presenting a refined version of the code published in [18] as well as a GenAI produced code that aims to determine the best Random State to use in an LDA-TM, this article has illustrated the speed and ease with which this can be accomplished. Alongside the benefits of this method, this article has also pointed out the technical, legal, and ethical issues surrounding its use. In dealing with the ‘function’ versus ‘truth’ of GenAI produced code, this article advocates for the full publication of GenAI produced code alongside completed research. This is not only for the purposes of peer review, but also so that work can be adequately reviewed post publication. To partially fix some of the ethical issues regarding author attribution arising from GenAI being trained on free and open-source code repositories, this article suggests retroactive searches once code has been identified.