Introduction

The scientific method is globally recognized as a systematic and logical approach to discovering new knowledge and understanding the natural world and is also the foundation of all scientific inquiry which is critical for the advancement of human knowledge (Gauch 2002). It involves posing a hypothesis, collecting data through observation and experimentation, and analyzing and interpreting the results to reach a conclusion. Hence it helps to ensure that research is objective, replicable, and based on empirical evidence. In the field of epidemiology, the scientific method is essential for conducting rigorous and transparent studies that can accurately address public health issues. As in many other types of research, in a typical epidemiologic study, the general flowchart of the scientific method process (Elm et al. 2008) includes identifying a research question, conducting a literature review, designing the study, collecting and analyzing data, and interpreting the results and implications.

Therefore, given the importance and complexity of this process, it appears essential to define shared rules, standards, and methodologies to guide the conduct of studies and mitigate the risk of misconduct or errors. In other words, it is crucial that epidemiological studies are conducted transparently and rigorously, following acknowledged guidelines. The importance of this has been underlined by many authors (Adami et al. 2011; Alba et al. 2020; Arroyave et al. 2021) and led to the development of the most widely used guidelines globally: the STrengthening the Reporting of OBservational studies in Epidemiology (STROBE) statement and guidelines. The STROBE guidelines provide a framework for the transparent reporting of observational studies, ensuring that the study's sections (introduction, methods, results, etc.) are accurately and thoroughly reported (Elm et al. 2008). Adhering to the STROBE guidelines can increase the reliability and reproducibility of the study, as well as facilitate the critical appraisal and synthesis of the evidence (Elm et al. 2008).

Artificial intelligence (AI) is a rapidly growing field with numerous applications in a variety of fields (Dwivedi et al. 2021). Among these, one where AI has the potential to make a significant impact is clinical research and epidemiology, where it could revolutionize the way these fields operate and significantly improve our understanding and management of public health and healthcare management issues. Both fields rely on the collection and analysis of large amounts of data, making them well-suited for the application of AI technologies (Thiébaut et al. 2018; Topol 2019). One aspect in which AI has yet to show its potential is the overall support for conducting epidemiological studies. Given the complexity of epidemiological studies, and the increasingly present need to speed up and streamline their execution—while preserving scientific quality and integrity—it appears necessary to study applications of AI in supporting the overall conduct of epidemiological studies, i.e., the support for study design, data analysis, and interpretation of results. Both the tools and the studies that test and validate them are scarce in this field, and some potentially revolutionary tools, namely AI-based transformers, have only recently been introduced (Castelvecchi 2022; Graham 2022; Stokel-Walker 2022).

ChatGPT and AI-based transformers

In recent years, techniques based on deep learning are driving the AI revolution. Among them, transformers (Abdel-Aty and Gould 2022; Dai et al. 2021) show prominent results in several applications. A transformer is a type of neural network architecture that was introduced in the paper “Attention is All You Need” by Vaswani and colleagues in 2017 (Vaswani et al. 2017). The transformer architecture is a type of deep learning model that adopts the mechanism of self-attention and is particularly adept at processing sequential data such as machine translation, language modeling, and text classification. The transformer model incorporates attention mechanisms to weigh the significance of various parts of the input and output sequences, enabling it to effectively capture long-range dependencies and generate more coherent text.

Chat-Generative Pre-training Transformer (ChatGPT) is a recently developed AI model which stands for “controlled hierarchical generative transformer” (Cahan and Treutlein 2023). It is a machine learning algorithm capable of generating high-quality scientific text (Brown et al. 2020) that was developed by a team of researchers at OpenAI and has received significant attention in the scientific community due to its impressive performance and potential applications. AI-based transformers, such as ChatGPT, may represent a significant advance in the field of AI and could have the potential to impact the way scientific research is conducted and communicated. However, there are also several potential drawbacks to their use for scientific research and scientific publishing, particularly in public health and epidemiology, and their use to support researchers in conducting epidemiological studies has not yet been studied.

Aim of the study

The aim of this study is to shed light and describe how AI-based transformer models can help a researcher in the various steps of an epidemiological study. To do this, we used the OpenAI ChatGPT to partially simulate an epidemiological study, reformulating the STROBE framework into a list of questions to be answered by the transformer itself. We then qualitatively evaluated the coherence and relevance of the transformer’s outputs and discussed the pros and cons of this innovation in the field of epidemiology.

Methods

In this exploratory study, we assessed how ChatGPT could support an epidemiological study. To do this, we first chose a study to be used as a basis for the simulation and extracted the research question and the specific aim. We then relied on the STROBE checklist as a basis for formulating the questions to be posed to ChatGPT. Specifically, we asked ChatGPT to transform each STROBE checklist item into a question/prompt for itself. We then asked the transformer to answer the questions, to provide support to the researcher in conducting the epidemiological study. Each specific answer was evaluated by three independent senior researchers in terms of coherence and relevance for the purposes of conducting the epidemiological study.

When formulating the various prompts, we used the selected study as a guideline (Subramanian and Kumar 2021). This study by Subramanian and Kumar (2021) analyzed the relationship between the percentage of the population fully vaccinated and new cases of COVID-19 in 68 countries. The study used publicly available COVID-19 data for cross-country analysis. We chose this type of study because of its relatively simple design and straightforward interpretation.

Starting from this article, we extracted the aim of the study, considering this as a purely human activity related to human curiosity to ask questions and to transform them into research questions. The aim of the study served as the basis for contextualizing the use of AI-based transformers and demonstrating their potential usefulness in supporting the conduct of an epidemiological study.

Using STROBE recommendations prospectively

The STROBE guidelines are a set of recommendations for reporting observational studies, including cohort, case–control, and cross-sectional studies. These guidelines provide a framework for transparent and complete reporting of the design, conduct, and results of an observational study, with the aim of improving the quality and transparency of the research and minimizing biases and confounding factors which can affect the validity and reliability of the results.

The STROBE guidelines can also be used prospectively when conducting an epidemiologic study. By following the STROBE guidelines during the design and conduct of the study, researchers can ensure that the study is conducted in a transparent and rigorous manner and that the results of the study are accurately and clearly reported.

The STROBE guidelines provide specific recommendations for the reporting of various aspects of the study, such as the study design, the sampling and recruitment of participants, the measurement of exposures and outcomes, and the analysis and interpretation of the results.

ChatGPT

GPT-3 (Generative Pre-trained Transformer) is a third-generation, autoregressive language model that uses deep learning to produce human-like text developed by OpenAI. It is designed to generate human-like text and engage in conversation with users naturally and intuitively. To train the GPT model, researchers at OpenAI fed it a large dataset of human-generated text and utilized the model to predict the next word in the sequence. The model was then fine-tuned using backpropagation, a technique that involves adjusting the model's weights and biases to minimize the error between the predicted output and the actual output.

After the model was trained, it was able to generate human-like text by predicting the next word in a sequence based on the context provided by the previous words. The model can be fine-tuned for specific tasks, such as translation or question-answering, by training it on a dataset specific to that task.

It is worth noting that ChatGPT, a variant of GPT specifically designed for generating human-like text in a conversational context such as chatbots or virtual assistants, is not the same as GPT. While ChatGPT is trained on a dataset of human conversation and is able to generate text that is appropriate for use in a chat or messaging context, the underlying technology and principles of how ChatGPT works are the same as those of GPT.

One of the key features of ChatGPT is its ability to generate text that is coherent and follows the structure and style of scientific articles. It has been trained on large datasets by using 175 billion parameters on Microsoft’s Azure AI supercomputer and has learned to generate text that is similar to human-written scientific text in terms of grammar, style, and content. This makes it a valuable tool for researchers who need to generate large amounts of scientific text, such as descriptions of experimental methods and results, summaries of research findings, or review articles (Radford et al. 2019).

The use of ChatGPT for supporting the conduction of the study

First, we identified the purpose of the study as follows: [To examine the relationship between the percentage of population fully vaccinated and the incidence of new COVID-19 cases in a given population, and to assess the impact of vaccination on the incidence of COVID-19 cases.]

To maximize the use of ChatGPT, we adopted the STROBE framework, translating its main items/recommendations into prospective questions for the AI transformer. To standardize the use of the transformer as much as possible, we followed the process listed below:

  • STROBE item > question/prompt for ChatGPT > adaptation to the simulated research questions > final answer

For each STROBE recommendation, we asked the transformer (on December 19–23, 2022) the following question:

  • “Transform the following recommendation from the STROBE (Strengthening the Reporting of Observational Studies in Epidemiology) guidelines into a question for ChatGPT. Recommendation: [STROBE recommendation]”

This provided us Answer#1. Then, the next step was the specific question:

  • “[Answer#1]+[The study has the following aim: To examine the relationship between the percentage of population fully vaccinated and the incidence of new COVID-19 cases in a given population, and to assess the impact of vaccination on the incidence of COVID-19 cases]

Expert assessment

We assessed the outputs of the transformer by using human annotators to rate the quality and relevance of the model's responses. This was accomplished through a process called “human evaluation,” in which annotators (NN, FT, AB) were asked to rate the responses produced by the model on a 1-to-5-point Likert scale. The annotators were asked to consider factors such as the coherence of the response and its relevance to the prompt.

Given that there are no standardized questions that are specifically designed to evaluate the quality and relevance of the responses produced by ChatGPT or any other language model in terms of coherence and relevance, we evaluated the outputs by asking the expert annotators to consider the following questions:

  1. 1.

    Is the response coherent and well-formed, or is it difficult to understand? (where 5 is the maximum level of coherence)

  2. 2.

    Is the response relevant and informative, or is it lacking in substance? (where 5 is the maximum level of relevance)

For each STROBE recommendation, we formulated a prompt for ChatGPT contextualizing it with the aim of the study by Subramanian and Kumar (2021). Each response to the prompts was then evaluated by domain experts to assess its consistency and relevance.

Results

The STROBE checklist includes a list of 35 recommendations (Table S1). From these recommendations we derived 35 specific questions for ChatGPT. Three domain experts evaluated the answers independently using two coherence and relevance scores (Table 1). The mean scores were heterogeneous. On average, for the coherence scores, the overall mean score was 3.6 out of 5.0, and for relevance it was 3.3 out of 5.0.

Table 1 Final Q&A and respective scores

For example, item #9 (“What should be included in the definition of outcomes, exposures, predictors, potential confounders, and effect modifiers in an observational study, and how should diagnostic criteria be provided, if applicable, according to the STROBE guidelines?”), in the Methods section, received a very satisfactory response in terms of both coherence (5.0/5.0) and relevance (5.0/5.0); therefore, the responses given were rated very positively (“According to the STROBE guidelines, in order to clearly define all outcomes, exposures, predictors, potential confounders, and effect modifiers in your observational study, you should provide detailed descriptions of each of these components. Outcomes: In this study, the outcome would be the incidence of new COVID-19 cases in a given population. You should specify how you will measure the incidence of COVID-19 cases (e.g., number of new cases per 100,000 population), and provide any relevant diagnostic criteria for COVID-19 (e.g., symptoms, test results). Exposures: The exposure in this study would be the percentage of the population that is fully vaccinated. You should specify how you will measure the percentage of the population that is fully vaccinated (e.g., number of individuals who have received both doses of the vaccine), and provide any relevant information on the specific vaccine(s) used. Predictors: The predictor in this study would be the percentage of the population that is fully vaccinated. [...]”.

Similarly, the answer to item #10, Methods section, was also rated very positively (“What should be included in the description of sources of data and methods of assessment for each variable of interest in an observational study, and how should comparability of assessment methods be described if there is more than one group, according to the STROBE guidelines?”), and received a satisfactory response in terms of both coherence (4.7/5.0) and relevance (4.7/5.0). Here, too, in the opinion of the evaluators, the transformer response was complete and adequate (“According to the STROBE guidelines, in order to provide a clear and transparent description of the sources of data and methods of assessment for each variable of interest in your observational study, you should include the following information: For each variable of interest (e.g., percentage of population fully vaccinated, incidence of new COVID-19 cases), specify the source of data (e.g., administrative records, survey data). For each variable of interest, describe the method of assessment in detail. For example, if you are using administrative records to measure the percentage of population fully vaccinated, you should describe how the records were collected and how you will extract the relevant information. If you are using survey data to measure the incidence of new COVID-19 cases, you should describe the survey instrument and how you will collect and analyze the data. If there is more than one group in your study (e.g., different populations or time periods), describe the comparability of assessment methods for each variable of interest. For example, if you are comparing the percentage of population fully vaccinated in two different populations, you should describe whether the vaccination programs were similar in the two populations and whether the data sources and methods of assessment were comparable. [...]”.

In contrast, some responses received particularly negative ratings from domain experts. For example, the response to item/recommendation #14, Methods section, received negative comments in terms of both coherence and relevance (2.3/5.0, and 1.7/5.0, respectively). The question was: “What statistical methods should be described, including those used to control for confounding, according to the STROBE guidelines?”. And the answer (“According to the STROBE guidelines, you should describe all statistical methods that you used in your study to examine the relationship between vaccination and COVID-19 incidence, including any methods used to control for confounding factors. This may include methods such as multivariate regression analysis or stratified analysis, depending on the specific design and analysis of your study. It is important to provide a clear and detailed description of your statistical methods in order to allow readers to understand and evaluate the validity and reliability of your findings.”) was not deemed adequate in terms of either coherence or relevance.

Similarly, for item #18 (“What should be included in the report of an observational study in order to describe sensitivity analyses according to the STROBE guidelines?”), the answer given was also not considered adequate by the evaluators, giving mean scores of 2.3/5.0 and 2.0/5.0.

Discussion

In this study, we used an open AI-based transformer, namely ChatGPT (Gordijn and Have 2023; Graham 2023), to reformulate and answer the STROBE recommendations to support the conduction of an epidemiological study. We then assessed the coherence and relevance of the transformer’s outputs. We used an observational study that analyzes publicly available data to investigate the relationship between the percentage of population fully vaccinated and new COVID-19 cases. It is important to note that we started from the premise that while AI can potentially assist in reproducing a study, the research question, the aim of the study, and all the aspects of originality should remain the sole domain of humans. Therefore, we began our simulation by using a research question and aim of the study devised by a human.

After ChatGPT had answered to all the STROBE recommendations, we asked independent experts to qualitatively evaluate the coherence and relevance of the transformer’s outputs. In the last few months, these innovative systems have proven to be fast and intuitive and an important support for researchers. However, the coherence, relevance, and even the correctness of their answers is not always clear, and there are currently no studies, to our knowledge, that critically analyze the use of these tools in epidemiological research (O’Connor and ChatGPT 2023). The choice of the STROBE checklist and the method of turning its recommendations into questions for ChatGPT represents an attempt to standardize and make our analysis as reproducible as possible. In fact, by following the STROBE guidelines, researchers can help ensure that their results are accurately and clearly reported, which can help other researchers and policymakers understand and interpret the results of the study. This is particularly important in the field of epidemiology, as the results of epidemiological studies can have important implications for public health policy and practice.

From the results of our analysis, we found that ChatGPT can be a valuable support for researchers, both experienced and inexperienced, in setting up an epidemiological study, particularly observational, following internationally recognized guidelines and standards. The average score attributed by experts to the responses given by the AI-based transformer was above 3.5 over 5, but with a fair amount of variability among responses to individual outputs. This is consistent with suggestions and preliminary analyses in the limited literature available on these tools (Castelvecchi 2022; Graham 2022; Huh 2023; Stokel-Walker 2022), which indicate that AI-based transformers are beginning to be used but with some skepticism, partly due to their suboptimal ability to respond adequately, in terms of consistency and relevance, to questions posed by domain expert researchers. Therefore, it seems essential that users of these tools have knowledge about the subject matter and a critical mindset when evaluating their outputs. Blindly accepting the answers of these tools may still be too great a risk to the integrity of science and thus human society.

It is also interesting to note that the researchers rated some ChatGPT outputs very positively and others very negatively. Notably, both the responses rated very positively and those rated very negatively were in the methods part. This suggests that, when properly queried, the transformer may provide more appropriate answers in the methodological and data analysis and reporting domains, which were found to have the highest level of consistency and relevance.

Even in the introduction part, this AI-based transformer can be useful in placing a study in context. As literature has shown (O’Connor and ChatGPT 2023), these systems can function as true “scientific writers” and can quickly and adequately contribute to the writing of entire paragraphs. However, our study found that the responses to STROBE recommendations related to the introduction/background and discussion/interpretation paragraphs of scientific articles did not receive high scores, indicating that the tool is not yet fully adequate to support (or potentially replace) researchers in this part of scientific articles, particularly in epidemiological studies. A possible explanation for this drop in performance may be related to the ChatGPT training data. Specifically, the AI-based transformer was not specifically trained on scientific articles. This issue may be partially addressed by specific training or fine-tuning (Brown et al. 2020; Raffel et al. 2020).

Our analysis showed the ability of AI-based transformers to generate answers and human-like text, which could potentially be used to conduct epidemiological studies or write research articles. This could potentially save time and resources for researchers, as these activities can be time-consuming and labor-intensive.

However, there are also several potential drawbacks to the use of AI-based transformers for scientific research (King and ChatGPT 2023; O’Connor and ChatGPT 2023). One concern is the risk of bias in the data that are used to train these systems. If the data used to train the AI system are biased, the system may produce biased results or make biased recommendations. This could have serious consequences, particularly in the field of public health and epidemiology, where decisions about interventions and policies may be based on the results of research.

Another concern that is often reported in the critical analysis of these systems (Else 2023; Graham 2023), is the potential for AI-based transformers to replace human researchers, potentially leading to job loss and the devaluation of human expertise (Graham 2022). However, our assessment suggests a possible shift in the “study-experiment-analysis-results-writing-publication” paradigm. It seems more likely that, with the support of these systems, we will move towards a model where the researcher can focus more on the study and experimental phase than on methodological problems and questions.

Another risk associated with the use of AI in scientific research is the potential for fraudulent manipulation of large-scale scientific publications. An example of this is related to the scientific publisher IOP, which retracted as many as 850 articles in 2022 after a researcher at the University of Cambridge in the United Kingdom discovered that many of them contained nonsensical phrases produced by artificial intelligence programs trying to avoid anti-plagiarism software. These phrases were repeated a staggering number of times in different articles from different groups. When IOP began investigating, the editor found other similarities that suggested the articles came from a so-called scientific paper mill, a company that produces and sells pseudo-articles for a fee.

The use of AI-based solutions capable of generating text in scientific research also brings with it many legal implications. However, it is beyond the scope of this article to address in detail each and every possible legal issue that may arise when using AI tools to generate text to be published in scientific studies. For our purposes, we can highlight that one of the main concerns is the issue of plagiarism and copyright infringement. AI-generated text may be similar or identical to existing copyrighted material, which could lead to legal issues for researchers and their institutions. It is important for researchers to be aware of these potential issues and take steps to ensure that their work does not violate any copyright laws. Using anti-plagiarism software to evaluate the generated text can be a helpful tool in reducing the risk of publishing non-original text.

We look forward to the development and refinement of these systems, which will be helpful in preserving scientific integrity and ethics and will be of great assistance to publishers and other stakeholders. These software programs are designed to detect similarities between a given text and existing material. They can identify potential instances of plagiarism or copyright infringement, allowing researchers to make necessary changes before publishing their work. It is important to note, however, that anti-plagiarism software may not catch all instances of plagiarism. Therefore, it is important for researchers to also use their own judgment and to carefully review the AI-generated text for any potential instance of plagiarism or copyright infringement.

Another legal implication in using AI-based solutions for text generation in scientific research is the accuracy of the information. As mentioned earlier, AI-generated text may contain errors or inaccuracies that can have serious consequences for the scientific community. For example, if an AI-generated article is published in a reputable journal and it contains inaccurate information, it could lead to other researchers basing their work on flawed data. Therefore, it is crucial for researchers to thoroughly check that the final product is sound and reliable.

In short, it is important to stress that even if AI tools are used to generate text, the responsibility for the research still lies with the humans to whom the work is attributed. This means that researchers and their institutions are responsible for ensuring that their work is legally compliant and that it does not infringe on any copyright laws or contain any inaccuracies. This highlights the need for careful consideration and adherence to ethical and legal guidelines when utilizing AI in scientific research.

Limits of the study

It is important to note that, being an ML algorithm, ChatGPT's answers may differ if re-run after some time. This represents a limitation in the reproducibility of the results of our study.

Furthermore, there is no single way to formulate questions to the transformer; different questions, even on the same topic, produce different answers, which may be more or less consistent and relevant to the objective of the analysis, representing a limitation of the study. However, in the context of the present study, we tried to standardize this process as much as possible by starting with STROBE recommendations and directly asking the AI-based transformer to adapt them into prompts for itself. This may have at least partially mitigated the risk of variability.

Conclusions

In this exploratory study, we evaluated the coherence and relevance of an open AI-based transformer’s answers regarding questions extracted and reformulated from the STROBE recommendations for observational studies.

From our assessment, the transformer can be considered as a valuable support for researchers, both experienced and inexperienced, in setting up an epidemiological study, following internationally recognized guidelines and standards. The average score attributed by experts to the responses given by the AI-based transformer was high, but with a fair amount of variability among responses to individual outputs.

Responses related to the methods, data analysis, and reporting domains and recommendations were found to have the highest level of coherence and relevance, emphasizing that these innovative tools may be useful primarily for methodological and statistical support, rather than for building the conceptual frameworks of the study and the narrative structure of the paper.

Therefore, it seems essential that users of these tools have knowledge about the subject matter and a critical mindset when evaluating their outputs. Blindly accepting the answers of these tools may pose a serious threat to the ethics and integrity of scientific research, including epidemiology. Accordingly, we advocate the necessity for taking a proactive stance towards this evolving context, in order to ensure careful governance of this inevitable process. Maintaining rigor and reliability of scientific research implies establishing a shared, and as much as possible, standardized method for conducting studies, analyzing data, and reporting its results. The potential benefits of AI in scientific research are undeniable, but it is crucial that we address the risks, and the ethical and legal consequences associated with its use. This requires a collaborative effort between researchers, policymakers, and other stakeholders to define guidelines and best practices—and monitor their fulfillment—for the ethically and legally correct use of AI in scientific research.