Introduction

Glucocorticoid-induced osteoporosis (GIOP) is a form of osteoporosis caused by long-term or high-dose use of glucocorticoid medications in patients with a variety of inflammatory and autoimmune diseases [1,2,3]. Glucocorticoids decrease bone formation, increase bone resorption, and interfere with calcium absorption and excretion, thereby leading to bone loss [4, 5]. GIOP evolves more rapidly than other types of osteoporosis, with severe bone density loss sometimes observed within a few months of glucocorticoid administration; furthermore, while symptoms might be absent in the early stages of the disease, bone pain, loss of height, or fracture might occur as the disease progresses [6, 7].

The rate of new fractures after one year of glucocorticoid therapy can reach 17%, and GIOP is more likely to affect the spine, especially the vertebrae [1, 8]. Therefore, GIOP can lead to compression fractures. Fractures occur in 30–50% of patients receiving long-term glucocorticoid therapy, and fractures are usually asymptomatic [1, 8, 9]. These fractures may occur as little as three months after starting steroid therapy at doses as low as 2.5 mg of prednisone per day [1]. In addition, people of any age and sex can develop osteoporosis with glucocorticoid use, and the risk is higher among older adults, postmenopausal women, and those with other risk factors for o steoporosis [10,11,12]. Consequently, understanding the characteristics of GIOP is essential for the prevention and management of this condition [13, 14].

Natural language processing (NLP) is a form of artificial intelligence (AI) that is dedicated to enabling computers to understand, interpret, and respond to human language. NLP combines methods from computer science, AI, and linguistics to analyze, understand, and generate natural language [15, 16]. Large language models (LLMs) are a subfield of NLP that focuses on developing large-scale machine learning models to process, understand, and generate natural language [17]. LLMs are typically built by training a model on large amounts of textual data, and they are able to capture the complexity and nuances of language [18]. Currently, the most advanced LLM chatbots are ChatGPT-3.5 and ChatGPT-4, which were developed by the OpenAI Foundation, and Google Gemini [19,20,21].

The use of LLMs in the medical field can help physicians improve the quality and effectiveness of health care by increasing the efficiency of medical information management, patient care, medical research, and clinical diagnosis [22,23,24,25,26]. However, in real applications, different versions and implementations of LLM chatbots may have different levels of performance, so it is also essential to choose the right model for a particular task [27,28,29,30]. In the study by Zhi Wei Lim et al., ChatGPT-4 showed excellent accuracy in answering questions about myopia care, with 80.6% of the responses rated as “good,” compared to 61.3% for ChatGPT-3.5 and 54.8% for Google Gemini [31]. ChatGPT has also been found to be reasonably accurate in answering general questions about osteoporosis, but the responses to questions based on the National Osteoporosis Guidelines Group guidelines were only 61.3% accurate [32]. The purpose of this study was to evaluate and compare the performance of three publicly available LLMs, namely, OpenAI’s ChatGPT-3.5 and ChatGPT-4, as well as Google Gemini, in answering questions related to GIOP and the 2022 ACR-GIOP Guidelines. These findings will help to determine which model performs better in a particular task or application scenario, thus enabling users to make more informed choices.

Methods

Study design

A set of 34 general questions related to GIOP (Supplementary Table 1a) were curated from reputable online health information sources, including UpToDate, the American College of Rheumatology (ACR), the National Center for Biotechnology Information (NCBI), and Endocrine News. Subsequently, for further optimization, questions were selected based on their applicability to common clinical settings. To deepen the understanding of the strengths and weaknesses of different LLM chatbots in addressing various topics, we categorized these questions into 6 critical fields, namely, clinical manifestation, pathogenesis, diagnosis, treatment, prevention and risk factors. We also prepared 25 questions based on the 2022 ACR-GIOP Guidelines (Supplementary Table 1b). Answers to these question queries were generated from March 13 to March 25, 2024, by using two versions of ChatGPT (versions ChatGPT-3.5 and ChatGPT-4, OpenAI, California) and Google Gemini (Google LLC, Alphabet Inc., California). Each question was entered as a separate conversation, and the conversations were reset after each query to collect the content of the replies. The content of the LLM chatbot replies was converted to plain text format, any information in the text that identified the LLM chatbot was removed, and the responses were rated by three orthopedic surgeons experienced in treating osteoporosis. Figure 1 shows the overall design of this study.

Fig. 1
figure 1

Flowchart of the overall study design

Accuracy assessment

Three senior orthopedic surgeons independently rated the answers based on responses ranging between 1 and 4 points (1 indicates that the answer is completely incorrect, 2 indicates that part of the answer is correct but contains incorrect information, 3 indicates a correct but inadequate answer, and 4 indicates a correct and adequate answer). The consistency of the three senior orthopedic surgeons’ ratings of the ChatGPT-3.5, ChatGPT-4, and Google Gemini responses to the questions was assessed using the Fleiss’ Kappa coefficient. A total score (TS) > 9 indicated ‘good’ responses, 6 ≤ TS ≤ 9 indicated ‘moderate’ responses, and TS < 6 indicated ‘poor’ responses.

Re-evaluating the accuracy of LLM chatbot self-correcting

The questions that were recognized as ‘poor’ were subjected to further questioning, where the incorrect parts were explained through an orthopedic specialist pointing out incorrect or inaccurate sentences within the content of the responses. The answers to these questions were also self-corrected in the LLM chatbot chat program: “This doesn’t seem quite right. Can you answer it again?”. Subsequently, the responses were collected and converted to plain text format, information identifying that LLM chatbot was removed from the text, the order was disrupted, and the corrected content was reevaluated by the three raters. This round of reassessment was conducted one week after the previous round of scoring. During this round of reassessment, the scorers were not informed that the responses were self-correcting versions.

Statistical analysis

SPSS 26 software (IBM Corp. Released 2021) was used for the data analysis. Normally distributed data are expressed as the mean ± standard deviation, nonnormally distributed data are presented as the median (percentile25-percentile75) (M(P25-P75)), and the Kruskal‒Wallis H test was used for multiple comparisons to determine the significance of the differences between ChatGPT-3.5, ChatGPT-4 and Google Gemini. Paired t tests were used to compare the initial TS and self-corrected TS, and Pearson’s chi-squared test was used to compare initial accuracy ratings and self-corrected accuracy ratings. P < 0.05 was considered to indicate a statistically significant difference. Fleiss’ Kappa was using for assessing the consistency of the responses to the questions ratings of ChatGPT-3.5, ChatGPT-4, and Google Gemini scores by the three senior orthopedic surgeons. Fleiss’ Kappa values between 0 and 1. The degree of consistency is poor from 0 to 0.2; moderate from 0.2 to 0.4; medium from 0.4 to 0.6; strong from 0.6 to 0.8; and very strong from 0.8 to 1.0.

Results

Length of responses from LLM chatbot

Table 1 shows the length of words and characters generated by the LLM chatbots to the GIOP-related general questions. The mean ± standard deviation of the word count was 346.50 ± 68.19 for ChatGPT-3.5, 303.91 ± 43.56 for ChatGPT-4, and the M(P25-P75) of the word count was 308.50 (266.75-350.25) for Google Gemini (Fig. 2a). The number of words generated by ChatGPT-4 and Google Gemini was significantly higher than that generated by ChatGPT-3.5 (P < 0.05). The number of characters generated by ChatGPT-3.5 was 2445.65 ± 467.72, the number of characters generated by ChatGPT-4 was 2119.29 ± 300.83, and the number of words generated by Google Gemini was 2206.00 (1888.50-2546.25) (Fig. 2b). The number of characters generated by ChatGPT-4 was significantly lower than that generated by ChatGPT-3.5 (P < 0.05).

Table 1 Length of LLMs-chatbots’ responses to general questions about GIOP
Fig. 2
figure 2

a, b The length of words and characters generated by LLM chatbots to GIOP-related general questions; c, d The length of words and characters generated by LLM chatbots to questions related to the 2022 ACR-GIOP Guidelines. *P < 0.05, **P < 0.01, ChatGPT-3.5 vs. ChatGPT-4 and Google Gemini; ^P < 0.05, ^^P < 0.01, ChatGPT-4 vs. Google Gemini

Table 2 shows the length of words and characters generated by the LLM chatbots in response to the questions related to the 2022 ACR-GIOP Guideline. The M(P25-P75) word counts were 378.00 (303.00-407.00), 328.00 (257.00-361.00) and 317.00 (269.50–350.00) for ChatGPT-3.5, ChatGPT-4 and Google Gemini (Fig. 2c), respectively. The M(P25-P75) word counts were 2,783.00 (2,173.00–2,967.50), 2407.00 (1875.00-2564.00) and 2273.00 (2016.50-2451.50) for ChatGPT-3.5, ChatGPT-4 and Google Gemini (Fig. 2d), respectively. The number of words and characters generated by Google Gemini was significantly higher than that of ChatGPT-3.5 (P < 0.05). The number of words and characters per question generated by the LLM chatbots is shown in Supplementary Table 3,4a-c.

Table 2 Length of LLMs-chatbots’ responses to questions for 2022 ACR-GIOP Guideline

Accuracy and grading of LLMs chatbot responses

Table 3 shows the TSs of the LLM chatbot responses to the different topics within the GIOP-related general questions. Regarding pathological mechanisms, the TS of ChatGPT-4 [10.00 (9.00-10.50)] was significantly higher than that of ChatGPT-3.5 [8.00 (6.50-8.00)] (P < 0.05). Table 4 shows the TSs of the LLM chatbot in terms of the 2022 ACR-GIOP Guideline-related questions. Google Gemini [8.00 (8.00–11.00)] had a lower TS than ChatGPT-4 [10.00 (9.00–11.00)] (P < 0.05).

Table 3 Differences in LLMs-chatbots’ TS of response to general questions about GIOP
Table 4 Differences in LLMs-chatbots’ TS of response to questions about 2022 ACR-GIOP Guideline

Table 5 shows the accuracy ratings of the LLM chatbot responses to the different topics within the GIOP-related general questions. Regarding pathological mechanisms, ChatGPT-3.5 was significantly worse (P < 0.05) than ChatGPT-4 and Google Gemini. Overall, the ChatGPT-4 performed excellently in answering GIOP-related general questions, with no ‘poor’ responses, and it was more effective in addressing the topic of clinical presentation, with a 100% probability of responding ‘good’. Table 6 shows the accuracy ratings of the responses of the LLM chatbots to questions related to the 2022 ACR-GIOP Guidelines. ChatGPT-4 had the highest percentage of ‘good’ answers, accounting for 64%. Google Gemini had the lowest percentage of ‘good’ answers, accounting for 32%. Google Gemini and ChatGPT-3.5 had four poor answers, but overall, there was no significant difference among the three LLM chatbots (P > 0.05). The raw responses to each question generated by the LLM chatbot are shown in Supplementary Tables 2a-c.

Table 5 Accuracy ratings of LLMs-chatbots’ responses to general questions related to GIOP
Table 6 Accuracy ratings of LMs-chatbots’ responses to questions related to 2022 ACR-GIOP Guideline

Self-correcting capacity of LLM chatbots

Table 7 shows the changes in ChatGPT-3.5 after self-correcting for responses with a TS < 6. The average TS for the initial responses was 4.00 ± 0.89, and the average TS for the self-corrected responses was 6.67 ± 2.16, which was significantly higher (P < 0.05). Table 8 shows the changes in ChatGPT-4 after self-correcting for responses with a TS < 6. The self-corrected TS was significantly higher than the initial TS (4.00 ± 1.00 vs. 11.00 ± 1.00, P < 0.05). Table 9 shows the changes in Google Gemini after self-correcting for responses with a TS < 6. However, there was no significant difference in the TS or ratings between the initial responses and the self-corrected responses; these findings suggest that Google Gemini’s self-correction abilities are worse than those of ChatGPT-3.5 and ChatGPT-4. Supplementary Tables 5a-c show the LLM chatbot responses with TSs < 6. The specific parts of the responses that contain errors are highlighted in yellow. In addition, these tables provide further explanations of the errors identified by professional orthopedic physicians.

Table 7 Self-correcting capacity of ChatGPT-3.5
Table 8 Self-correcting capacity of ChatGPT-4
Table 9 Self-correcting capacity of Google Gemini

Discussion

GIOP is caused by long-term use of glucocorticoid medications (usually defined as more than 3 months) in patients who suffer from a variety of inflammatory and autoimmune diseases, such as asthma, rheumatoid arthritis, and lupus erythematosus. It is characterized by a decrease in bone mineral density and susceptibility to fracture, a lack of obvious symptoms in the early stages, and a higher risk of osteoporosis in older patients, females, and patients who use higher doses of glucocorticosteroids [1, 5, 13, 33]. Long-term use of glucocorticoids may also cause or exacerbate other health problems, such as muscle loss, weight gain, high blood pressure, diabetes, and eye problems (e.g., cataracts) [7, 13, 34]. Based on these characteristics, the patient’s quality of life may be affected, and the ability to perform daily activities may be reduced [35]. Regular monitoring of bone density and individualized risk assessments are crucial for patients who are using or need to use glucocorticosteroids for a long period of time [1, 8, 14, 36].

With the development of AI, LLM chatbots, such as ChatGPT-3.5, ChatGPT-4, and Google Gemini, have been widely applied in the medical field [19, 30, 37]. According to a study by Giovanni Maria Iannantuono et al., LLM chatbots can quickly provide cancer patients with medical knowledge, drug information, disease symptoms and treatments, and other relevant information [38]. According to a study by Giacomo Rossettini et al., LLM chatbots (e.g., ChatGPT, Microsoft Bing, and Google Gemini) play a role in musculoskeletal rehabilitation by providing health counseling, medication management and reminders, and psychological support to patients [39]. In a study by Zhi Wei Lim et al., the ability of ChatGPT-3.5, ChatGPT-4, and Google Gemini to provide accurate responses to common myopia-related queries was evaluated, and the results showed that ChatGPT-4 is more able to provide accurate and comprehensive responses to myopia-related queries than the other LLMs [31]. According to Cigdem Cinar’s study, ChatGPT had high accuracy in responses to general questions about osteoporosis and reduced accuracy in responses about osteoporosis guidelines [32]. There are no studies that have tested the performance of LLM chatbots in answering questions related to osteoporosis caused by glucocorticoids.

When answering the general GIOP-related questions ChatGPT-4, Google Gemini provided more concise answers than ChatGPT-3.5, and when answering the questions related to the 2022 ACR-GIOP Guidelines, number of Google Gemini generated shorter responses than ChatGPT-3.5 in terms of both words and characters, thus suggesting that Google Gemini may be more focused on providing concise and direct answers to improve the efficiency of information delivery (Tables 1 and 2). The above results suggest that due to the technical and algorithmic differences in LLM chatbots, they perform differently in information processing and question answering and that Google Gemini and ChatGPT-4 may focus more on providing concise and direct answers to improve the efficiency of information delivery. However, through the content of Google Gemini’s specific answers, Google Gemini did not provide a clear answer for some questions, leading to a reduction in the length of the answer (Supplementary Table 2, 3a-c). This difference may be related to the different LLM chatbot algorithms used [40]. In contrast, ChatGPT-3.5 may provide more detailed information, including background information, multiple perspectives, or additional explanations, which increases the number of characters and words. ChatGPT-4, which is an iteration of ChatGPT-3.5 that takes into account user feedback and improvement, adopts a more advanced linguistic representation and uses a larger and more diverse dataset; thus, it better captures linguistic patterns and meets users’ needs for high-quality answers, which is a sign of continuous progress in the field of natural language processing [38, 41, 42].

We also commissioned three professional orthopedic experts to rate the accuracy of responses generated by different LLM chatbots (Table 3; Supplementary Table 1, 2a-c). In terms of pathological mechanisms, the TS of ChatGPT-3.5 was significantly lower than that of ChatGPT-4 (P < 0.05), and the accuracy of ChatGPT-3.5 was also significantly lower than that of ChatGPT-4 and Google Gemini. However, there was no significant difference in the ratings of three different LLM chatbots on the remaining topics. In response to questions related to the 2022 ACR-GIOP Guidelines, Google Gemini’s TS was significantly lower than that of ChatGPT-4 (P < 0.05), and there was no significant difference between the ratings of ChatGPT-4 and ChatGPT-3.5 (Table 4; Supplementary Table 1, 3a-c). The difference in scores between ChatGPT-3.5 and ChatGPT-4, Google Gemini could be due to a number of factors. ChatGPT-3.5 was trained on data available up to January 2022, and ChatGPT-4 was launched by OpenAI in March 2023. Building on the foundation of ChatGPT-3.5, ChatGPT-4 adopts a more advanced model architecture and richer training data in the medical domain. Furthermore, ChatGPT-4 is able to perform domain-specific fine-tuning and improve the contextual understanding of specialized medical terminology. More importantly, ChatGPT-4 improves the accuracy of answering questions in specific medical domains by iteratively improving the user feedback of ChatGPT-3.5 [43]. Therefore, in our study, the accuracy of ChatGPT-4 was significantly higher than that of ChatGPT-3.5 in terms of answering questions about the pathological mechanisms of GIOP. Google Gemini, which was developed by Google and leverages Google’s long experience in search, natural language processing, and other AI areas, is fundamentally different from ChatGPT-4 (developed by OpenAI) in the way it processes information and answers questions [44,45,46,47]. In our study, we found that ChatGPT-4 performed better than Google Gemini in answering questions such as the 2022 ACR-GIOP Guideline. It is possible that ChatGPT-4’s dataset contains more professional literature or guidelines related to newer glucocorticoids and osteoporosis, and thus, it will be more accurate in processing related questions.

In our study, we also compared the self-correcting updating ability of three different LLM chatbots by prompting questions with a “poor” answer rating and then compared the self-correcting ability with the ratings of a professional orthopedic surgeon. Our study showed that ChatGPT3.5 had a total of six responses rated as “poor” on all questions, two of which were related to general GIOP information and four of which were related to the 2022 ACR-GIOP Guidelines. The overall performance was “poor”, with an average TS of 4.00 ± 0.89 and an average TS of 6.67 ± 2.16 after correction, which was significantly higher than pre-correction TS (P < 0.05). However, there was no significant improvement in the accuracy after correction (P > 0.05). There were three questions that yielded “poor” responses from ChatGPT4, all of which were questions about the 2022 ACR-GIOP Guidelines. The average pre-correction TS was 4.00 ± 1.00 before correction; after correction, the responses were rated as “good”, with an average TS of 11.00 ± 1.00 (P < 0.05). There were seven questions that received a “poor” response from Google Gemini, three of which were related to general information about GIOP and four of which were related to the 2022 ACR-GIOP Guidelines. The results showed that Google Gemini’s scores and grades did not change significantly between pre-correction and post-correction (P > 0.05). The three different LLM chatbots performed poorly in answering such questions about the 2022 ACR-GIOP Guidelines, with all of them generating 3–4 responses with poor ratings. According to professional orthopedic surgeons, the “poor” responses were mostly due to a lack of specificity in the details, a failure to answer according to the guidelines, and the inability to respond professionally to the questions asked. These findings indicate that these LLM chatbots have poor knowledge of the 2022 ACR-GIOP Guidelines, thus suggesting that the chatbots may have limited utility for patients with GIOP. Furthermore, these findings indicate that appropriate management and timely assessment of GIOP by a professional health care team are essential for different conditions.

Strengths and limitations

The selected questions we chose might not have been comprehensive enough and might have biased the ultimate answers generated by the three LLMs. The scoring system used in this study was developed by ourselves. The Fleiss’ Kappa values of three orthopaedic surgeons scoring the responses generated by ChatGPT-3.5, ChatGPT-4, and Google Gemini were 0.384, 0.350, and 0.340, respectively, suggesting that our scoring system is generally reliable across evaluators. This result may be related to the professional background and experience of the evaluator. Our study has some time limitations; new, relevant content about GIOP will be constantly uploaded to the internet in the future. The iterative updating of LLMs, such as ChatGPT-3.5 and ChatGPT-4.0, is quick and may soon outpace the models evaluated in this study. As a result, the benchmark comparisons made in this thesis may quickly become outdated. However, the insights we have gained in applying these models to specific diseases are still of great value and provide a foundation for future studies using more advanced models. In addition, our studies were conducted in English, and we did not study dialogs in other languages. Finally, we did not assess the comprehensibility of the responses to the three LLMs because the comprehensibility of LLMs varies across education levels. There are no relevant articles reporting the use of LLMs in daily care for GIOP patients or caregivers. However, the functions of LLMs, such as disease education, treatment recommendations, and patient counseling, can improve the quality and efficiency of care, and LLMs can be extended to other potential applications in GIOP care, such as the development of personalized treatment plans and the implementation of real-time monitoring and alerts. Technical improvements are needed due to the technical limitations of LLMs in health care applications, such as improving the accuracy of the model, increasing the training of medically specific datasets, ensuring the security and privacy of patient data to avoid data misuse, and ensuring that the application of the technology complies with the ethical and legal frameworks of health care. These improvements would enhance the applicability of LLMs in GIOP care.

Conclusion

Our study showed that ChatGPT-4 and Google Gemini provided more concise and intuitive answers than CChatGPT3.5 and that ChatGPT-4 performed significantly better than did CChatGPT3.5 and Google Gemini in terms of answering general questions related to GIOP. Our findings also showed that ChatGPT3.5 and ChatGPT-4 self-corrected better than Google Gemini. This finding might be related to differences in design patterns, training database updates and application algorithms between the LLMs.