Abstract
Purpose
A large-scale language model is expected to have been trained with a large volume of data including cancer treatment protocols. The current study aimed to investigate the use of generative pretrained transformer 4 (GPT-4) for identifying the TNM classification of pancreatic cancers from existing radiology reports written in Japanese.
Materials and methods
We screened 100 consecutive radiology reports on computed tomography scan for pancreatic cancer from April 2020 to June 2022. GPT-4 was requested to classify the TNM from the radiology reports based on the General Rules for the Study of Pancreatic Cancer 7th Edition. The accuracy and kappa coefficient of the TNM classifications by GPT-4 was evaluated with the classifications by two experienced abdominal radiologists as gold standard.
Results
The accuracy values of the T, N, and M factors were 0.73, 0.91, and 0.93, respectively. The kappa coefficients were 0.45 for T, 0.79 for N, and 0.83 for M.
Conclusion
Although GPT is familiar with the TNM classification for pancreatic cancer, its performance in classifying actual cases in this experiment may not be adequate.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
Introduction
Advancements in the natural language processing technology have facilitated the categorizing and summarizing of radiology reports [1,2,3,4,5]. Large-scale language models (LLMs), such as generative pretrained transformer 4 (GPT-4) (OpenAI Inc., San Francisco, CA, USA) are anticipated to have been trained on a large amount of various medical resources and tested for many medical applications, including interpretation of medical documents and radiology reports [6,7,8]. One of these applications is determining cancer staging from the natural language text of radiology reports without the need for additional training [9, 10].
The current study aimed to investigate the use of GPT-4 for accurately classifying the TNM stage of pancreatic cancer based on unstructured radiology reports written in Japanese. Previous studies have applied LLMs to the TNM classification of lung cancer [9, 10]. This study is the first to investigate the TNM classification of pancreatic cancer using GPT-4.
Materials and methods
Patients and ethics approval
We screened 100 consecutive patients with pancreatic cancer (including those suspected of cancer) who underwent contrast-enhanced abdominal computed tomography (CT) scan and evaluation of TNM classification on radiology report starting from April 2020 at our institution and who had available radiology report on TNM classification. In cases requiring a second TNM classification during the study period, such as after neoadjuvant chemotherapy, the later report was excluded, and the collection continued until 100 cases were reached. No additional exclusion criteria were applied. These radiology reports were created by modifying shared text templates in most cases; thus, they were written in a similar style. Moreover, pancreatic cancer was selected as the number of cases and radiology reports available at our institution were sufficient based on the clinical situation at our hospital. Furthermore, our team included abdominal radiology specialists.
This study was approved by the Institutional Review Board with a waiver for documented informed consent from patients (approval number 2023–0029). The research title was published on an institutional website, and the patients were allowed to opt-out based on the institutional regulations. Patient identifiers were removed, and this study was conducted in accordance with the Declaration of Helsinki. This was a retrospective study, and we have reviewed our study to eliminate bias according to the Standards for Reporting of Diagnostic Accuracy (STARD) guidelines. Although our sample of 100 consecutive pancreatic cancer cases may not be sufficient to adhere to the guidelines, we have ensured that other criteria are met in our study.
Data processing
Our radiology reports provide an objective description of image findings and impressions from a radiologist’s perspective based on those findings. The TNM classification of cancer is often included by the impressions as the result of interpretations; however, we did not use this original TNM classification in this study. Image findings of 100 cases were reviewed by two experienced abdominal radiologists (with 21 years and 13 years of experience as radiologists) to determine the TNM classification according to the General Rules for the Study of Pancreatic Cancer 7th Edition [11]. In cases where there was a disagreement between the two radiologists, a consensus was reached through discussion to determine the gold standard (TNM by radiologists). Only the descripted image findings were transmitted to GPT-4 to identify the TNM classification (TNM by GPT-4). No personally identifiable data, metadata, such as age and gender, or TNM classifications made by radiologists were transmitted to GPT-4. OpenAI Playground was used to prevent the use of our data used for training by GPT. The latest language model “gpt-4-turbo-2024–04-09” was utilized. The temperature parameter, which controls variability, was set to 0.0. The system prompt was designed to guide GPT-4 in classifying the TNM stage based on the radiology reports and the General Rules for the Study of Pancreatic Cancer 7th Edition. Our prompt uses the chain-of-thought technique as prompt engineering [12]. The prompt tells GPT-4 to determine the T, N, and M factors step-by-step and then summarizes them into the TxNxMx format.
Prompt for GPT-4 (Translation into English, as original prompt was written in Japanese):
“Identify TNM classification from the CT scan findings according to the General Rules for the Study of Pancreatic Cancer 7th Edition.
-
Evaluate the T factor based on the tumor size, local extent of pancreatic cancer (CH, DU, S, RP, PV, A, PL, and OO), and the description about invasion into the surrounding areas.
-
Evaluate the N factor based on the presence or absence of metastasis to the regional lymph nodes.
-
Evaluate the M factor based on the presence or absence of distant metastasis.
Finally, summarize them in the TxNxMx format.”
Evaluation
The performance of TNM by GPT-4 was assessed using the accuracy and kappa coefficient of the T, N, and M factors versus the TNM by radiologists. Microsoft Excel (Microsoft, Redmond, WA, USA) and JMP Pro 17 (SAS Institute, Cary, NC, USA) were used to calculate these indices.
For T1a and T1b classifications by GPT-4, they were considered correct if they corresponded to T1 by the radiologists. If the radiologists marked the cases as Nx or Mx, indicating no clear documentation of metastasis, any GPT-4’s classification of N and M factors was considered correct. Radiologists reserved judgment on the presence of lymph node metastasis in such cases, including lymph node around aorta.
Results
In total, 100 consecutive patients were identified from April 2020 to June 2022. Among them, 62 were men and 38 women and were aged 38–93 (average: 70.3 ± 10.8) years. The TNM classifications made by radiologists for all patients were as follows: T1 (n = 5), T2 (n = 2), T3 (n = 64), T4 (n = 28), and Tx (n = 1). For the N factor, N0 (n = 68), N1 (n = 16), and Nx (n = 16) were reported. For the M factor, M0 (n = 66), M1 (n = 29), and Mx (n = 5) were observed. The classification made by GPT-4 were as follows: T1a (n = 2), T1b (n = 2), T2 (n = 1), T3 (n = 66), and T4 (n = 29). GPT-4 reported N0 (n = 65) and N1 (n = 35) for the N factor and M0 (n = 74) and M1 (n = 26) for the M factor (Tables 1, 2, and 3).
The accuracy values were 0.73 for T (95% confidence interval [CI] 0.64–0.82), 0.91 for N (CI 0.85–0.97), and 0.93 (CI 0.88–0.98), respectively. The kappa coefficients were 0.45 for T (CI 0.28–0.62), 0.79 for N (CI 0.66–0.92), and 0.83 for M (CI 0.71–0.95).
Discussion
The TNM classification is a crucial factor in determining treatment strategies, and it is expected that the classification provided by GPT will closely align with that provided by radiologists. However, the results of our experiment showed that the kappa values did not meet clinical standards. The errors have various causes, making it difficult to identify all of them. Therefore, we focused on identifying common conditions that seemed to be involved in the majority of error cases.
The T factor in the General Rules for the Study of Pancreatic Cancer 7th Edition aligns with the 7th edition of the TNM Classification of Malignant Tumors by the Union for International Cancer Control (UICC classification). In these guidelines, T4 is defined as invasion (including “abutment” and “contact”) into the celiac axis or superior mesenteric artery. In nine cases, GPT-4 did not classify “contact” with these arteries as invasion, incorrectly determining them as T3. In 11 cases, GPT-4 incorrectly classified invasions into other organs, such as the spleen, and invasions into the portal vein or splenic artery as T4. The extent of invasion is represented by symbols, such as CH, DU, S, RP, PV, A, PL, and OO, and GPT-4 has a high accuracy in recognizing symbols and notations representing the extent of invasion. We guided GPT-4 to interpret these symbols; however, their meaning was not explained in the prompt. GPT-4 interpreted these symbols based on its own knowledge.
The low accuracy in N and M factor classification was attributed to the different categories of metastases to the lymph nodes outside the regional lymph nodes. Under the General Rules for the Study of Pancreatic Cancer 7th Edition, metastasis to nonregional lymph nodes, such as the para-aortic lymph nodes, is considered distant metastasis and is classified as M1. However, GPT-4 incorrectly classified it as N1 in six cases. There is a concern that such classification errors could occur with other types of cancer as well.
Another reason might be that Japanese radiology reports are rarely published; thus, GPT-4 may not have sufficiently learned their descriptive expressions. Furthermore, discrepancies between the evaluations by two radiologists were noted in seven cases for the T factor, eight for the N factor, and two for the M factor, and GPT-4 failed most of these subtle cases, correctly identifying only one case each for the T, N, and M factors. This indicates that Japanese language expressions are often ambiguous, resulting in varying interpretations even among radiologists.
LLMs including GPT-4 do not retain memory unlike a database. Their knowledge is fragmented and diffused within the mass of neural network parameters, resulting in a state similar to a vague recollection. Therefore, even though the guidelines were explicitly named, GPT-4 defaulted to a more general manner of classifying. For example, direct invasion to other organs is classified as T4, and lymph node metastases as N1. When asked how pancreatic cancer is classified as T4, GPT-4 can correctly respond that it involves invasion into the celiac axis or the superior mesenteric artery. This phenomenon likely occurred due to insufficient training on the guidelines. This issue persists even when specifying the UICC classification instead of the Japanese guidelines. If this issue is pointed during interactions, GPT-4 can optionally correct it according to the notification by users. Individual rules can be enforced for N and M factors in the prompt. However, such approach is not in accordance with our objective to develop an assistant for TNM classification for all types of cancers or a general-purpose assistant for imaging diagnosis.
The current study had several limitations. That is, the results were not compared with those of pathological examinations. Beyond this study, a hypothesis may exist, which states that LLMs might be able to interpret radiology reports more effectively than radiologists. However, in this study, we focused on the ability of LLMs as natural language processors with medical knowledge. Although the evaluation of only 100 cases in this study may be insufficient, significantly increasing the number of cases beyond this was difficult due to practical conditions. We attempted to eliminate selection bias by including consecutive cases; however, a distribution bias may still exist. This may be due to challenges in detecting early-stage pancreatic cancer and the tendency of our facility to receive advanced cases from neighboring facilities. That is, 64 patients presented with T3 tumors, and only five and two patients had T1 and T2 tumors, respectively. This bias could lead to an apparent increase in accuracy compared with those of previous reports about lung cancer [9, 10]. For better evaluation, the kappa coefficient is used as an alternate indicator that eliminates the effects of coincidental matches, and the kappa coefficient for the T factor remains low at 0.45 (moderate agreement). Since the radiologists reserved judgment on the presence of metastasis to slightly enlarged lymph nodes and used Nx or Mx, it was necessary to consider GPT-4’s responses to be correct. This improved the evaluation of the N and M factors slightly. Further research is needed, including evaluations with improved prompts, cancers of other organs, indicators beyond the TNM classification, and alternative LLMs besides GPT-4.
Another issue was the lack of reproducibility of GPT’s responses. OpenAI provides a parameter called “temperature” to control the variability of GPT’s responses, which can be specified as a decimal between 0 and 1. When using GPT as a chatbot, the temperature is often set to a value higher than 0.0 to make it behave more human-like. However, for scientific experiments requiring reproducibility, the temperature is commonly set to 0.0. Even with the temperature set to 0.0, the variability of GPT-4’s responses may not be sufficiently suppressed. Whether the variability in GPT-4’s responses affects TNM classification requires further investigation. In addition, OpenAI frequently updates the language model under GPT-4, which raises concerns about maintaining the reproducibility of experimental results.
In similar previous studies using lung cancer, the TNM classification was provided in the prompt [9, 10]; however, the prompt we used did not include the definitions of the TNM classification. Although the performance was insufficient, GPT can act as an artificial intelligence with medical knowledge. When the training of LLMs advances further, they will become useful copilots for radiologists.
Conclusion
GPT is familiar with the TNM classification for pancreatic cancer; however, in this experiment, its performance in classifying actual cases may be inadequate. Further investigation, particularly, on how prompts are formulated is needed.
References
Hu D, Zhang H, Li S, Wang Y, Wu N, Lu X. Automatic extraction of lung cancer staging information from computed tomography reports: deep learning approach. JMIR Med Inform. 2021;9(7):e27955. https://doi.org/10.2196/27955.
Nobel JM, Puts S, Weiss J, Aerts HJWL, Mak RH, Robben SGF, et al. T-staging pulmonary oncology from radiological reports using natural language processing: translating into a multi-language setting. Insights Imaging. 2021;12(1):77. https://doi.org/10.1186/s13244-021-01018-1.
Park HJ, Park N, Lee JH, Choi MG, Ryu J-S, Song M, et al. Automated extraction of information of lung cancer staging from unstructured reports of PET-CT interpretation: natural language processing with deep-learning. BMC Med Inform Decis Mak. 2022;22(1):229.
Suzuki K, Shirai Y, Kawaji T, Sakai S. Category classification for lung computed tomography of COVID-19 by natural language processing in Japanese radiology report. Tokyo Women’s Med Univ J. 2023;7:109.
Puts S, Nobel M, Zegers C, Bermejo I, Robben S, Dekker A. How natural language processing can aid with pulmonary oncology tumor node metastasis staging from free-text radiology reports: algorithm development and validation. JMIR Form Res. 2023;7:e38125. https://doi.org/10.2196/38125.
Sun Z, Ong H, Kennedy P, Tang L, Chen S, Elias J, et al. Evaluating GPT-4 on impressions generation in radiology reports. Radiology. 2023;307(5):e231259. https://doi.org/10.1148/radiol.231259.
Gertz RJ, Bunck AC, Lennartz S, Dratsch T, Iuga A-I, Maintz D, et al. GPT-4 for Automated determination of radiologic study and protocol based on radiology request forms: a feasibility study. Radiology. 2023;307(5):e230877.
Nakaura T, Ito R, Ueda D, Nozaki T, Fushimi Y, Matsui Y, et al. The impact of large language models on radiology: a guide for radiologists on the latest innovations in AI. Jpn J Radiol. 2024. https://doi.org/10.1007/s11604-024-01552-0.
Nakamura Y, Kikuchi T, Yamagishi Y, Hanaoka S, Nakao T, Miki S, et al. ChatGPT for automating lung cancer staging: feasibility study on open radiology report dataset. medRxiv. 2023. https://doi.org/10.1101/2023.12.11.23299107.
Nishio M, Matsuo H, Matsunaga T, Fujimoto K, Rohanian M, Nooralahzadeh F, et al. (2023) Zero-shot classification of TNM staging for Japanese radiology report using ChatGPT at RR-TNM subtask of NTCIR-17 MedNLP-SC. The 17th NTCIR Conference on Evaluation of Information Access Technologies. Tokyo, Japan. pp. 155–61
Japan Pancreas Society. The general rules for the study of pancreatic cancer. 7th ed. Tokyo: Kanehara & Co., Ltd.; 2016.
Wei J, Wang X, Schuurmans D, Bosma M, Ichter B, Xia F, et al. Chain-of-thought prompting elicits reasoning in large language models. ArXiv. 2022;35:24824.
Acknowledgements
We thank Dr. Satoru Morita, an abdominal radiologist who reviewed the TNM classification of 100 cases.
Funding
No funding was received to assist with the preparation of this manuscript.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflicts of interest.
Ethical approval
This study was approved by the Institutional Review Board with a waiver for documented informed consent from patients. The research title was published on an institutional website, and patients were allowed to opt-out based on the institutional regulations. Patient identifiers were removed, and this study was conducted in accordance with the Declaration of Helsinki.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
This article is published under an open access license. Please check the 'Copyright Information' section either on this page or in the PDF for details of this license and what re-use is permitted. If your intended use exceeds what is permitted by the license or if you are unable to locate the licence and re-use information, please contact the Rights and Permissions team.
About this article
Cite this article
Suzuki, K., Yamada, H., Yamazaki, H. et al. Preliminary assessment of TNM classification performance for pancreatic cancer in Japanese radiology reports using GPT-4. Jpn J Radiol (2024). https://doi.org/10.1007/s11604-024-01643-y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11604-024-01643-y