Introduction

In scientific research, predicting a publication’s number of citations and its alternative impact metrics (altmetrics) such as Twitter and blog mentions, is becoming increasingly important. Citations are often used as an indicator of the impact of research contributions, allowing researchers, institutions, funding agencies, and policymakers to make informed decisions (e.g., Aksnes et al., 2019; Caon et al., 2020). Altmetrics, on the other hand, can serve as a valuable index of societal impact, indicating public engagement and interest in research (Bornmann, 2014; Pulido et al., 2018). Nevertheless, their correlation with scientific impact in the form of citations may not be strong, as altmetrics scores could be primarily driven by factors such as public appeal (De Winter, 2015; Hassan et al., 2017; Pandey Akella et al., 2021; Warren et al., 2017).

Mendeley, a reference management tool that enables researchers to organize literature, provides another valuable index of readership which can reflect the degree of interest in a publication (Thelwall, 2018). A higher number of Mendeley readers suggests that a publication is frequently accessed by fellow researchers, which could lead to collaboration and citation opportunities (Bornmann, 2015; Haustein et al., 2014).

The factors influencing citation rates may include the validity of the research methods and conclusions (e.g., Antonakis et al., 2014) and the clarity of the researcher’s writing (Dowling et al., 2018). Estimating these metrics before submission could benefit researchers by allowing them to optimize their work and maximize its impact within both the scientific community and society at large. This predictive ability could also prove valuable for educators and writing instructors who play a role in shaping their students’ research output (e.g., Baldwin & Chandler, 2002; Murray et al., 2008).

The advent of large language models, especially ChatGPT, has enabled a wealth of new research opportunities, which may be of benefit to scientometrics research. ChatGPT has demonstrated that it can successfully tackle a range of tasks, such as summarizing text (Yang et al., 2023), stance detection (Huang et al., 2023; Zhang et al., 2022), and sentiment analysis (Tabone & De Winter, 2023; Zhong et al., 2023). The latest version, ChatGPT-4, has been shown to outperform the average student in a variety of verbally-oriented tests and exams (De Winter, 2023; Katz et al., 2023; Nori et al., 2023; OpenAI, 2023a). The current study used ChatGPT-4 to rate scientific abstracts across multiple characteristics. The ratings generated by ChatGPT were then used to investigate their predictive validity in determining the number of citations and altmetrics scores received by the corresponding papers.

In scholarly literature, various approaches have been investigated for predicting the impact of scientific publications. These include basic textual features, such as paper metadata (Ma et al., 2021) and paper length (Haustein et al., 2015; Xie et al., 2019). Additionally, more refined textual analyses have been proposed as citation predictors, encompassing linguistic complexity and readability metrics (Ante, 2022; Lu et al., 2019; Sienkiewicz & Altmann, 2016), stylometry of titles and abstracts (Jimenez et al., 2020), and sentiment analysis (Liu & Zhu, 2023). In a review regarding the application of artificial intelligence for predicting citations and other research assessment metrics, Kousha and Thelwall (2022) observed that the existing predictors, such as readability scores and text-based analyses, have inconsistent predictive power. They posited that “a risk with text mining to predict citation counts is that it is likely to work best by identifying highly cited topics, predicting higher citation counts for all articles on these topics. A successful prediction model for one year might be invalid for the next one due to topic changes, so text mining may need rebuilding each year to identify the new hot topics” (p. 63). Consequently, there may be a need for more generalizable predictors that do not rely on specific textual elements.

Presently, a research gap exists in determining the extent to which large language models, such as ChatGPT-4, can contribute to the prediction of citation counts and altmetrics scores. Our study aims to address this gap by using ChatGPT-4 to evaluate scientific abstracts across various qualities, and subsequently investigating the predictive value of these evaluations concerning the attention received by the respective papers.

For this investigation, it was decided to concentrate on the open-access mega-journal PLOS ONE. Articles were selected within a specific time frame (January and February 2022) to ensure an approximately equal age and minimize temporal confounding factors when evaluating cumulative citations and altmetrics scores. Early 2022 was deliberately selected to circumvent ChatGPT-4’s knowledge cut-off of September 2021, thereby mitigating the risk of ‘contamination’. This term denotes the potential that the training modules of the base model may have been exposed to the abstracts, citation numbers, and associated data. The concern of contamination in benchmark analyses has also been recognized by other researchers (Aiyappa et al., 2023; Bubeck et al., 2023; De Winter, 2023; OpenAI, 2023a). In the present study, the ChatGPT-4 model was used to evaluate PLOS ONE paper abstracts across a diverse range of qualities (which themselves were proposed by ChatGPT), which were subsequently correlated with citation counts and altmetrics scores.

Methods

On April 2, 2023, a total of 2222 records of articles published in the journal PLOS ONE during January and February 2022, including their respective abstracts, were downloaded using Scopus (Scopus, 2023). The current study solely focused on document types marked as ‘'article’, excluding other document types such as review articles, errata, and retractions from our analysis. On January 4, 2024, the number of citations, altmetrics scores, dimensions citations, and Google Scholar citations for the same 2222 papers were extracted using Scopus (2023), Altmetric (2023), Dimensions (2023), and Publish or Perish software (Harzing, 2023), respectively. The altmetrics scores included a variety of counts (e.g., news mentions, Twitter mentions, as well as Mendeley readers). Dimensions is a platform that extracts citations from various sources, including journal items, books, conference proceedings, pre-prints, and patents. Because Altmetric only includes the number of Mendeley readers when other altmetrics scores are available, the missing numbers of Mendeley readers were manually inserted based on the Mendeley database (Mendeley, 2023).

Between April 5 and 17, 2023, a custom script in MATLAB R2021b was used to submit each abstract individually to OpenAI’s Application Programming Interface (API), together with a prompt that asked for scores on 60 different items. For each abstract, ChatGPT-4 was instructed to provide numerical ratings between 0 and 100. The abstract records characteristically included a copyright declaration, commencing with the term ‘Copyright’ or the copyright symbol. This segment of the abstract was automatically removed prior to the submission of the prompt to the API. The temperature setting of ChatGPT-4, which indicates the level of randomness in its responses, was set to 0 by the user to maximize the repeatability of results.

An example prompt is provided below, where (…) refers to omitted text for brevity.

Please rate the following abstract on each of the 60 items from 0 = Not at all to 100 = Very much. Only provide the numbers. For example:

 1. 65

 2. 50

 3. 5

 4. 95

 5. …

This is the abstract:

Single nucleotide polymorphisms (SNPs) in the (…) variants in disease genesis

These are the items:

 1. Original

 2. Accessible

 3. Circumlocutory

 4. Nontechnical

(…)

 60. Uninsightful

The items were generated using the following query in the ChatGPT-4 web interface: “I have to rate a short scientific text according to 25 variables. For example ‘Engaging’, ‘Controversial’, etc. Could you list 25 of such variables? The variables should be as diverse and orthogonal as possible. Only provide the words; do not define their meaning.” Five additional items were manually incorporated: Difficult to understand, Exciting, Not well written, Theoretical, and To the point, based on prior explorations with abstracts from another journal.

Furthermore, 30 antonyms corresponding to the 30 items were generated by using the following prompt in the ChatGPT-4 web interface: “For these 30 items, provide the antonyms in the same order.” The objective of the inclusion of the antonyms was to examine if negatively worded items would result in negative correlations with their counterparts: If the model exhibits consistent responses to both positive and negative wordings of a comparable concept, it serves as an indication that ChatGPT-4 is capturing the intended components. Moreover, the combination of 30 items with their respective 30 antonyms helps control for acquiescence bias. Analogous approaches have been applied in questionnaires designed for human respondents (Weijters & Baumgartner, 2012). For each prompt, the 60 items were presented in a random order, to minimize order effects that might influence the performance of ChatGPT. All 60 items are shown in Table 1.

Table 1 Items that were included with each prompt

For every individual abstract, the prompt yielded one corresponding ChatGPT-4 output. Using our MATLAB script, the numerical responses were extracted from the ChatGPT-4 outputs, resulting in a 2222 × 60 matrix of numbers between 0 and 100. Next, considering that some items might be anticipated to be correlated, a principal component analysis (PCA) was conducted on the standardized scores, i.e., a 2222 × 60 matrix with means of 0 and standard deviations of 1. The number of components to retain was based on a scree plot (Cattell, 1966). To improve interpretability of the components loadings, the loadings were subjected to orthogonal Varimax rotation. Principal component scores (i.e., a 2222 × 3 matrix) were computed using the standard procedure of multiplying the standardized scores with the pseudoinverse of the transpose of the 60 × 3 Varimax-rotated loading matrix, i.e., xT)+. Because of the Varimax rotation, the three component scores were uncorrelated (r = 0.00).

Next, the predictive value of the component scores was evaluated in relation to the following dependent variables: (1) the abstract length in number of characters, the number of mentions in (2) blogs, (3) news items, (4) Twitter, and (5) Reddit, (6) Mendeley readers, as well as the available citation counts obtained from (7) Dimensions, (8) Scopus, and (9) Google Scholar. Although additional altmetrics scores were available (e.g., the number of patent, Facebook, and Weibo mentions), these were deemed too infrequent (with a mean across articles of less than 0.053) to warrant inclusion as dependent variables.

Furthermore, based on research that demonstrated a correlation between the number of authors and citation count (Kousha & Thelwall, 2022; Sommer & Wohlrabe, 2017; Tahamtan et al., 2016), the number of authors was determined by counting the semicolons present in the ‘Authors’ record and subsequently adding one.

Additionally, for each abstract, a set of readability indexes were computed based on textual features, such as the number of sentences, words, characters, syllables, and complex words (those with three or more syllables). These features served as building blocks for calculating various classical readability indexes, including the Automated Readability Index (Senter & Smith, 1967), SMOG grade (McLaughlin, 1969), Flesch-Kincaid grade level (Kincaid et al., 1975), Flesch Reading Ease (Flesch, 1948), Coleman-Liau index (Coleman & Liau, 1975), and Gunning fog index (Gunning, 1952). These indexes, which take into account text characteristics such as the number of characters per word, syllables per word, words per sentence, and the number of polysyllabic words (words with 3 or more syllables), have been widely used in the field of scientometrics (e.g., Hartley, 2016; Wang et al., 2022). The predictive validity of these traditional text-based methods was compared with the component scores derived from ChatGPT. For computing the readability scores, an API provided by Ipeirotis (2023) was used.

Due to the potential presence of outliers (e.g., a small number of articles attracting a large number of tweets), Spearman rank-order correlation coefficients were used for determining associations with citations and altmetrics. The Spearman rank-order correlation is robust against the influence of outliers (Croux & Dehon, 2010; De Winter et al., 2016), making it a more suitable choice for the analysis (for similar approaches in scientometrics studies, see e.g., De Winter, 2015; Jimenez et al., 2020).

In interpreting correlation coefficients, Cohen (1988) once proposed that r = 0.10, r = 0.30, and r = 0.50 could be considered as small, medium, and large effects, respectively. Later, Ferguson (2009) recommended a minimum correlation of r = 0.20 to deem an effect practically significant. On the other hand, a meta-analysis by Gignac and Szodorai (2016) showed that correlations of 0.50 or stronger are rare in individual-differences research, and proposed a more realistic guideline of r = 0.10, r = 0.20, and r = 0.30 representing small, medium, and large effects. This article adopts the same guideline for interpreting the strength of the correlation coefficients. A literature review on predictive effects in scientometrics by Kousha and Thelwall (2022) concurs that correlations r > 0.50 are rare. Such large correlations may suggest a hidden causal relationship, variability due to a small sample size, or may be the result of aggregation prior to calculating the correlation. For comparison, in Ante (2022), readability statistics of 135,502 abstracts showed statistically significant correlations with citation counts, but the Spearman correlation coefficients were not higher than 0.1, while a meta-analysis of 24 effect sizes showed a Pearson correlation of 0.31 between paper length and the number of citations (Xie et al., 2019).

Results

Statistical reliability checks

The script developed for this study generated 60 responses for all 2222 abstracts. The duration for ChatGPT-4 to process a single abstract was around 43 s, with fluctuations observed throughout the day (ranging from approximately 35 to 55 s), likely attributable to varying server load at OpenAI. Overall, obtaining responses for all 2222 abstracts required roughly 27 h, and incurred a cost of USD 91.

To verify the reliability of the ChatGPT-4 output, the script was run three additional times. One repetition had the items in reverse order compared to the original run, whereas two other repetitions had the items in random order, with the one repetition inadvertently including a few words from another abstract (“Coaching has been described, in part,”) for all abstracts. During this process, an interesting finding was uncovered: the item means were highly consistent (r > 0.998 for all combinations of runs). For example, averaged across all 2222 abstracts, the item ‘Unethical’ consistently received low scores (5.85, 5.86, 6.03, and 5.98 for the four runs), while ‘Relevant’ received high scores (87.70, 87.21, 87.07, 87.52 for the four runs). However, when examining scores assigned to individual abstracts (n = 2222), reliability was relatively low, averaging at r = 0.27 (average of 60 correlation coefficients), ranging from r = 0.00 for ‘Ethical’ to r = 0.57 for ‘Easy to understand’.

Thus, ChatGPT-4 generated scores that are reliable at the population level (i.e., when averaged across all abstracts) yet unreliable for individual abstracts. Some variability can be attributed to the inability of ChatGPT to provide entirely non-random outputs, as OpenAI noted: “Setting temperature to 0 will make the outputs mostly deterministic, but a small amount of variability may remain” (OpenAI, 2023b). Moreover, it seems that the ChatGPT-4 output was highly sensitive to the order in which the 60 items were presented. Additionally, it is probable that certain items, like ‘Ethical’, are challenging to score since none of the abstracts would be expected to be unethical, resulting in minimal variance. The above repetition process highlights the importance of considering multiple items and not relying on a single ChatGPT-4 output score. Due to these reliability concerns, the average of the four scores per item per abstract was calculated before proceeding with further analysis.

Data reduction

Figure 1 presents the scree plot derived from the 60 × 60 correlation matrix of item scores. Based on the scree plot, the decision was made to retain three components. Although retaining five components would seem appropriate as well, the loadings for the three-component outcome were deemed more interpretable. Specifically, when extracting five components, one component was found to cluster along the 30 antonyms, and another component had low and uninterpretable loadings. Table 2 provides the means, standard deviations, and Varimax-rotated component loadings for all 60 items.

Fig. 1
figure 1

Eigenvalues of the correlation matrix in descending order, also called the ‘scree plot’. Listed in blue is the percentage of variance explained

Table 2 Means and standard deviations of scores assigned to abstracts (n = 2222), and Varimax-rotated component loadings

In general, items with high mean scores (> 50) are associated with positive qualities, such as Engaging, Rigorous, and Accessible, while their antonyms with lower mean scores (< 50) represent negative qualities, such as Disengaging, Lax, and Inaccessible. Furthermore, antonyms display inverse component loadings. Specifically, the loadings of Items 1–30 for each of the three components exhibited a strong correlation with the loadings of Items 31–60, that is, their corresponding antonyms (r = − 0.91, − 0.92, and − 0.93, respectively).

Component 1 captures the Quality and Reliability of the abstract. High negative loadings correspond to negative aspects (e.g., Lax, Superficial, Unreliable), while high positive loadings signify positive aspects (e.g., Rigorous, Methodical, Objective). Component 2 represents the Accessibility and Understandability of the abstract. High negative loadings are associated with difficulty in understanding (e.g., Difficult to understand, Inaccessible, Technical), while high positive loadings indicate ease of understanding (e.g., Accessible, Easy to understand, Nontechnical). Component 3 captures Novelty and Engagement. High negative loadings are indicative of more conventional and uninteresting aspects (e.g., Unprovocative, Conventional, Dull), while high positive loadings correspond to engaging and innovative aspects (e.g., Engaging, Innovative, Exciting).

Additionally, a supplementary reliability check was performed using a split-half approach. In this method, instead of aggregating the item scores from all four runs as in the primary analysis, component scores were calculated by aggregating the item scores from runs 1 and 2, as well as from runs 3 and 4. The three component scores exhibited inter-correlations of r = 0.56, 0.81, and 0.70.

Predicting altmetrics scores and citation counts from ChatGPT-4 component scores

The correlations in Table 3 indicate the extent to which the three component scores predicted altmetrics and citation scores. The Quality and Reliability component displayed weak positive correlations with the number of citations (ρ = 0.05 to 0.10 for the three citation measures). Furthermore, weak correlations were observed between Quality and Reliability and the altmetrics.

Table 3 Means, standard deviations (SD), and Spearman rank-order correlations between component scores and altmetrics scores & citations

The Accessibility and Understandability component demonstrated weak to moderate positive correlations with several altmetrics scores, including the number of news mentions (ρ = 0.10), blog mentions (ρ = 0.07), Twitter mentions (ρ = 0.18), and Reddit mentions (ρ = 0.06), as well as a large correlation with the number of Mendeley readers (ρ = 0.40). This component also showed weak to moderate positive correlations with the number of citations (ρ = 0.08 to 0.17). In other words, more accessible and understandable abstracts tend to receive more attention in terms of citations and altmetrics.

The Novelty and Engagement component exhibited weak to moderate positive correlations with altmetrics such as the number of news mentions (ρ = 0.09), blog mentions (ρ = 0.15), Twitter mentions (ρ = 0.23), Reddit mentions (ρ = 0.14), and Mendeley readers (ρ = 0.12). This component also displayed a moderate positive correlation with the number of citations (ρ = 0.18).

In summary, the Accessibility and Understandability and Novelty and Engagement components were found to be positively correlated with various altmetrics and citation scores, including Mendeley readers, while the Quality and Reliability component showed weak correlations with these metrics.

Predicting altmetrics scores and citation counts from classical readability scores

The correlation coefficients presented in Table 4 reveal a distinct pattern, specifically that there is only a very weak correlation between the readability indexes and altmetrics scores or citations. Among the relationships between readability indexes and outcome measures, the strongest Spearman correlation was only 0.13. Furthermore, it is worth highlighting that the text attributes themselves (such as the number of sentences, words, and characters) exhibited a stronger prediction with the number of Mendeley readers (up to ρ = 0.29) in comparison to the readability scores.

Table 4 Means, standard deviations (SD), and Spearman rank-order correlations between altmetrics scores and readability scores

Upon further investigation, it was observed that the readability scores displayed weak to moderate correlations with the three ChatGPT-4 scores (see Table 5). More specifically, abstracts that received higher scores on the Accessibility and Understandability component were associated with higher readability scores (ρ up to -0.21), while abstracts that achieved higher scores in Novelty and Engagement received lower readability scores (ρ up to 0.28). Moreover, articles with a higher number of authors had higher Quality and Reliability scores (ρ = 0.20).

Table 5 Spearman-rank order correlations between ChatGPT-4 component scores and readability scores

Discussion

In the current study, a large language model, ChatGPT-4, was used to perform a scientometrics analysis. Prior research has predominantly relied on techniques such as string searches, sentiment analysis, and linguistic complexity analysis (e.g., Hassan et al., 2020; Sienkiewicz & Altmann, 2016). For example, Hu et al. (2021) examined the association between citation counts of scientific articles and the writing styles of abstracts. Their analysis revealed that abstracts of highly cited papers had a more complex vocabulary, longer sentences, and more complex syntactic structures, and were less readable than uncited abstracts. On the other hand, Dowling et al. (2018) found that readability had a positive impact on the citations that Economics Letters articles receive, particularly for methods and macroeconomic papers. The discrepancy in predictability observed in the two selected articles may be indicative of domain-specific variations, where the importance of readability may be more pronounced in Economics compared to other disciplines (Kousha & Thelwall, 2022). However, it may also imply that the readability index that we used, which relies on features such as mean sentence length and the number of syllables per word, might not offer an accurate portrayal of readability, as it does not assess the words in the text in context (Hartley, 2016).

The current study pioneers a language-based evaluation of article abstracts. In contrast to conventional scientometrics methods—which may be prone to overfitting and inconsistent results, as discussed in the Introduction—the present approach represents an innovative direction. By assessing abstracts across a range of characteristics using semantically diverse items, this method may produce more generalizable outcomes. Our findings demonstrated that ChatGPT-4 scores displayed stronger correlations with altmetrics and citation counts in comparison to conventional readability scores, which are calculated based on the number of sentences, words, syllables, and characters.

A set of 30 items, along with their respective antonyms, was used in this study. This approach was anticipated to be robust against potential response biases. For example, relying solely on positively-worded items could have led ChatGPT-4 to assess the ‘overall positivity’ of the abstract. Our approach appeared effective, as demonstrated by the strong negative correlations between the loadings of the 30 items and their 30 corresponding antonyms. A strategy of aggregating across four script runs, and aggregating across items by means of principal component analysis, with items presented in a random order, was used to mitigate the limited reliability of individual item scores. The present approach offers a framework for appraising abstracts and may encourage additional research to further refine this method of prompting large language models.

Our correlational analysis showed several noteworthy findings. The number of authors positively correlated with Quality and Reliability, as could be expected. However, Quality and Reliability did not display a substantial association with altmetrics scores, while only a weak association was observed with the number of citations. These results are disconcerting, considering the worldwide attention on aspects such as methodological rigor and replicability, which are facets that highly loaded on the Quality and Reliability component. Others have similarly remarked that rigorous research, including pre-registered replications, does not necessarily correlate with popularity in terms of citations (Akcan et al., 2013; Hardwicke et al., 2021).

Abstracts assessed by ChatGPT-4 as Accessible and Understandable attracted a higher number of tweets and, more prominently, Mendeley readers, while also exhibiting a weak increase in citations. The correlation with Mendeley readers was large, at ρ = 0.40. It was also found that abstracts judged by ChatGPT-4 to be more Novel and Engaging tend to receive more citations, with moderate effects. From the perspective of science communication, having exciting and objective abstracts could make it easier for readers to appreciate the work, which in turn may promote knowledge dissemination (Sand-Jensen, 2007). However, there is a risk that the focus on crafting exciting abstracts might encourage sensationalism, which could detract from the pursuit of rigorous research.

This study presented a total of 60 items to the ChatGPT-4 model, incorporating some redundancy. Undoubtedly, further aspects could have been investigated through ChatGPT, such as the sentiment and tonality of the abstracts. While the relationship between the three identified components and citation counts appears to be interpretable, the strength of the correlations was found to be only weak to moderate, with Spearman’s ρ values ranging between 0.08 and 0.18. Various factors might have imposed an upper limit on the correlations. One such factor is the potential unrepresentativeness of an abstract for the entire paper. Future research could benefit from analyzing full-text articles, including their titles, instead of just abstracts. Specifically, it is unclear whether the abstract itself is a causal factor or if it acts as an epiphenomenon, reflecting the overall quality of the research paper, with the latter being the true driving force behind social and scientific impact. In the case of altmetrics, however, it is worth considering whether even the full paper is meaningful, as people may retweet a title without reading the paper (or abstract). This ambiguity extends to the aspect of the abstract that should be rated by ChatGPT-4. For example, our exploration of the data revealed that one abstract was rated as controversial, presumably because the studied topic (tax evasion) was controversial, but this does not imply that the abstract, its research methods, or its conclusions are controversial.

Upon conducting a series of additional manual assessments, it was noted that many of the ratings provided by ChatGPT were in fact meaningful. For example, abstracts that received the lowest rating in terms of being ‘well-written’ or the highest in ‘not well-written’ were, according to the present author, hard to read due to complex sentence structures, or a frequent use of acronyms/abbreviations. Another illustrative case was an abstract that exhibited the lowest score on ‘replicability’; the present author observed that this abstract lacked a methodology and more closely resembled an essay. Conversely, some scores assigned by ChatGPT appeared to be somewhat superficial. An example of this was observed for an abstract that received a high score on ‘originality’ seemingly because the authors of the abstract had incorporated the words “We present novel …”. Another abstract was given a low rating on ‘balance’ possibly due to the presence of a recommendation for a balanced approach within the text. Future research could explore different types of prompts, such as prompts about the topic of the abstract, the methods and conclusions of the abstract, and the potential impact of the abstract on readers, in order to provide more precise insights into cause-and-effect relationships.

Another limitation is that, while 2222 abstracts were analyzed, it would have been preferable to process a larger dataset. Our current method was not particularly cost-effective, amounting to approximately 0.04 USD per abstract. In contrast, other scientometrics studies have analyzed millions of abstracts (De Winter & Dodou, 2015; Ioannidis, 2019; Pei et al., 2023; Sienkiewicz & Altmann, 2016), which is currently unattainable through ChatGPT-4 for regular researchers. Our analysis was based on abstracts from January and February 2022, just after the knowledge cut-off of ChatGPT-4. Consequently, the papers have had only 22–24 months to accumulate citations and altmetrics scores. Despite this limitation, our findings still offer valuable insights and may set the stage for further investigations of ChatGPT-4 or similar methods. Future research could consider applying a larger time gap between the moment of publication and the collection of citation counts. The data underlying this research is available in a public data repository, enabling other researchers to recalculate the correlations in a few years when the articles have garnered more citations. However, it is conceivable that by then, the ChatGPT models have already undergone improvements, making it sensible to reassess the abstracts at that time.

This study demonstrated the potential application of large language models in scientometrics. Our study may also stimulate reflection from a broader perspective. For example, it has been argued that artificial intelligence should play a role in the peer-review process for scientific articles, such as in manuscript pre-screening (Hancock, in press). In this context, our findings hold important implications. In the current study, artificial intelligence evaluated abstracts as subjective, unwieldy, and so forth. It is important to recognize that these assessments arise from a model trained on previously generated human data. Additionally, the GPT model underwent fine-tuning through a process known as reinforcement learning from human feedback (OpenAI, 2022, 2023a). As scientists strive to maintain academic integrity and autonomy, understanding who determines the quality of a scientific work remains a critical topic. It is hoped that this article stimulates a meaningful dialogue on this important issue.