Effect of differences in O-RADS lexicon interpretation between senior and junior sonologists on O-RADS classification and diagnostic performance

Purpose To assess the consistency of Ovarian-Adnexal Reporting and Data System (O-RADS) lexicon interpretation between senior and junior sonologists and to investigate its impact on O-RADS classification and diagnostic performance. Methods We prospectively studied 620 patients with adnexal lesions, all of whom underwent transvaginal or transrectal ultrasound performed by a senior sonologist (R1) who selected the O-RADS lexicon description and O-RADS category for the lesion after the examination. Meanwhile, the junior sonologist (R2) analyzed the images retained by R1 and divided the lesion in the same way. Pathological findings were used as a reference standard. kappa (к) statistics were used to assess the interobserver agreement. Results Of the 620 adnexal lesions, 532 were benign and 88 were malignant. When using the O-RADS lexicon, R1 and R2 had almost perfect agreement regarding lesion category, external contour of solid lesions, presence of papillary inside cystic lesions, and fluid echogenicity (к: 0.81–1.00). Substantial agreement in solid components, acoustic shadow, vascularity and O-RADS categories (к: 0.61–0.80). Consistency in classifying classic benign lesions in the O-RADS category was only moderate (к = 0.535). No significant difference in diagnostic performance between them using O-RADS (P = 0.1211). Conclusion There was good agreement between senior and junior sonologists in the interpretation of the O-RADS lexicon and in the classification of O-RADS, except for a moderate agreement in the interpretation and classification of classic benign lesions. Differences in O-RADS category delineation between sonologists had no significant effect on the diagnostic performance of O-RADS.


Introduction
Ultrasound is the first-line imaging technique for the evaluation of adnexal masses (Vara et al. 2023). However, there is a lack of standardized terminology for gynecologic imaging, which may lead to variation in the interpretation of results between local institutions, and consequently, patients may miss out on the best clinical management strategies (Andreotti et al. 2018). Due to the highly subjective nature of ultrasonography, the subjective assessment of an experienced sonologist is still the most accurate method of gynecologic ultrasonography (Vara et al. 2023;Vázquez-Manjarrez et al. 2020). To improve the consistency and accuracy of ultrasound, the American College of Radiology (ACR) promulgated the Ovarian-Adnexal Reporting and Data System (O-RADS) lexicon in 2018 and built on it with the O-RADS ultrasound risk stratification and management system in 2020 (Andreotti et al. 2018(Andreotti et al. , 2020. Prior to the formation of the O-RADS committee, the International Ovarian Tumor Analysis (IOTA) group collected decades of outcomes date based on ovarian lesion characteristics, and on this basis published terms and definitions to describe adnexal masses and a series of diagnostic models to assess benign or malignant adnexal masses (Timmerman et al. 2000(Timmerman et al. , 2008(Timmerman et al. , 2016Van Calster et al. 2014).

3
The O-RADS committee continued some of the terms used by the IOTA group, fully considered the supporting evidence for the terms used to classify benign or malignant lesions and the common use of these terms, reached a consensus of the committee, and ultimately proposed the O-RADS lexicon (Andreotti et al. 2018). The lexicon provides a detailed classification and definition of the main categories of lesions, their size, vascularity, and extra-ovarian findings, as well as specific descriptions of the external and internal features of each of the different categories of lesions (Andreotti et al. 2018(Andreotti et al. , 2020. Based on this lexicon, the experts of the O-RADS working group defined six risk categories for adnexal lesions, category 0 for incomplete evaluation; category 1 for premenopausal normal ovaries; category 2 for almost certainly benign and < 1% risk of malignancy; category 3 for low-risk malignancy (1-10%); category 4 for intermediate risk (10-50%); and category 5 for high risk (≥ 50%) (Andreotti et al. 2020). The classification system also provides corresponding clinical management strategies for different categories of lesions, establishing a bridge between ultrasound and clinical management (Andreotti et al. 2020).
Existing studies (Vara et al. 2022;Xie et al. 2022;Wu et al. 2022;Pi et al. 2021) have confirmed that the O-RADS ultrasound classification system has high sensitivity and diagnostic efficacy when assessed by both senior and junior sonologists, but the inter-observer agreement of this classification system fluctuates widely, ranging from fair to almost perfect. Differences in the interpretation of the O-RADS ultrasound lexicon can have an impact on the outcome of O-RADS assessment (Antil et al. 2022). In order to assess the consistency of O-RADS ultrasound lexicon interpretation between senior and junior sonologists and to investigate its impact on O-RADS classification and diagnostic performance, we conducted a prospective study.

Materials and methods
The prospective study was approved by the Ethics Committee of our Hospital and registered on ClinicalTrials.gov (ChiCTR2100054542). Informed consent was obtained from all patients who underwent the examination.

Study participants
We prospectively studied 656 female patients with adnexal masses between June 2021 and May 2023, and if a patient had multiple adnexal masses (unilateral or bilateral multiple) simultaneously, we selected the lesion with the highest O-RADS category among them for inclusion in the study. The inclusion criteria for this study were preoperative patients with suspicious adnexal lesions detected by clinical palpation or imaging, available pathology and non-malignant postoperative pelvic recurrence. Exclusion criteria were (1) incomplete clinical data of the patients (n = 4), (2) uncertain histological diagnosis (n = 26), and (3) substandard image quality (n = 6). Ultimately, we included 620 adnexal lesions from 620 patients for the study. During the study, we recorded the patients' age, clinical symptoms (abdominal pain, abdominal mass, irregular vaginal bleeding, etc.), age at menarche, menopausal status and CA125 level. The flow chart of our study is shown in Fig. 1.

Image acquisition and interpretation
All patients underwent transvaginal or transrectal (when patients were unable to undergo transvaginal ultrasound) ultrasound, which was supplemented with transabdominal ultrasound if the lesion was too extensive. The ultrasound instrument used in this study was a Nuewa R9 (Mindray Medical, Shenzhen, China), and all ultrasound images were acquired and interpreted by a sonologist with more than 10 years of experience in gynecologic ultrasound at our institution (Reader 1, R1). This sonologist was studied and trained in the O-RADS ultrasound lexicon and classification system before the start of the study and passed the final test.
During the examination, the sonologist first used B-mode ultrasound to perform a complete scan and evaluation of the lesion, saving images of the largest section of the lesion and its vertical section, and measuring the size of the lesion on the above section. Then, Color Doppler Flow Imaging was performed, and the section of the lesion with the most abundant blood flow was retained. Sections of interest to the sonologist and those with suspicious features of the lesion could also be retained as appropriate. After the examination, Fig.1 The flow chart of our study. O-RADS Ovarian-Adnexal Reporting and Data System the sonologist needed to select the characteristics and the specific O-RADS category of the lesion based on the description of the O-RADS ultrasound lexicon. To assess the differences between senior and junior sonologists when describing the same lesion using the O-RADS ultrasound lexicon, another sonologist with two years of experience in gynecologic ultrasound (Reader 2, R2) who received the same training and examination was selected to analyze the images retained during the examination of R1, and the corresponding features of the lesion and the specific O-RADS category were selected according to the description of the O-RADS ultrasound lexicon. The only patient information available to the two sonologists mentioned above was the patient's age, clinical symptoms, menopausal status, and CA125 level, while the patient's clinical diagnosis and other imaging findings were not available. All images in this study were stored in our hospital's Picture Archiving and Communication Systems (PACS).

Reference standard
The postoperative pathological findings of the patients were used as a reference standard. All patients included in this study underwent surgery within two weeks of ultrasonography, and all pathologies were classified according to the World Health Organization (WHO) guidelines for the classification of female genital tumors (Meinhold-Heerlein et al. 2016). During the study, borderline tumors (BOT) were categorized as malignant (Basha et al. 2021;Hiett et al. 2022).

Data analysis
Statistical analyses were performed using SPSS version 21 (IBM Corporation, Armonk, NY) and MedCalc Version 20.022 (MedCalc Software Ltd, Ostend, Belgium) software. Continuous variables were expressed as mean ± standard deviation, and categorical variables were expressed as numbers and percentages. The independent-sample t tests were used for the comparison of continuous variables, and chisquare tests were used for the comparison of categorical variables. The kappa value (к) was used to assess inter-observer agreement, with к equal to or less than 0.20 indicating slight agreement; 0.21-0.40 indicating fair agreement; 0.41-0.60 indicating moderate agreement; 0.61-0.80 indicating substantial agreement; and 0.81-1.00 indicates almost perfect agreement (Landis and Koch 1977). For inter-observer agreement on the internal and external characteristics of solid as well as cystic lesions, we selected both observers classified as the same type of lesion for statistical analysis. The area under the receiver operating characteristic (ROC) curve (AUC) for the classification of benign or malignant tumors was calculated to compare the difference in diagnostic performance between senior and junior sonologists when using O-RADS. When O-RADS > 3 was defined as malignant (Wu et al. 2022;Cao et al. 2021), the O-RADS results were dichotomized accordingly and the sensitivity, specificity, accuracy and positive predictive value (PPV) of O-RADS classification were calculated for both observers by comparison with pathological findings. P value < 0.05 was considered statistically significant.

Characteristics of patients and lesions
The study ultimately included 620 adnexal lesions in 620 patients, including 532 benign lesions and 88 malignant lesions (including 19 BOT). Of the 620 patients (mean age 39.02 years, range 12-82 years), 510 (82.3%) were nonmenopausal women, 42 (6.8%) were early menopausal, and 68 (11.0%) were late menopausal. The median maximum diameter of the lesions was 6.5 cm (range 1.5-25.2 cm). There were significant differences between patients with benign and malignant lesions in age, presence of clinical symptoms, age at menarche, menopausal status, CA 125 level and maximum diameter of the lesions (p < 0.05). The characteristics of the study population and adnexal lesions were detailed in Table 1.

O-RADS lexicon description of adnexal lesions and inter-observer agreement
Among the 620 adnexal lesions included, the maximum diameter of malignant lesions (9.78 ± 4.43 cm) was significantly larger than that of benign lesions (6.82 ± 2.92 cm) (p < 0.001). When lesions were described using the O-RADS lexicon, the two sonologists had almost perfect agreement (к: 0.81-1.00) regarding lesion category, external contour of solid predominant lesions, presence of papillary inside cystic lesions, and fluid echogenicity. Substantial agreement (к: 0.61-0.80) in the assessment of the presence or absence of solid components, acoustic shadow and vascularity. However, inter-observer agreement was only moderate (к = 0.566) on the assessment of classic benign lesions. The O-RADS lexicon descriptions of adnexal lesions and the inter-observer agreement between the two observers were presented in Table 2. Table 3 showed the results of O-RADS classification by the two sonologists. In this study, the malignancy rates of lesions classified as O-RADS categories 2, 3, 4, and 5 were 0.0%, 0.0%, 32.3%, and 93.4% (R1) and 0.0%, 2.4%, 24.8%, and 93.8% (R2), respectively, which were consistent with the expected malignancy rates of the lesions. There was substantial agreement between the two observers in classifying O-RADS categories (к = 0.790). However, inter-observer agreement on the O-RADS classification of classic benign lesions was only moderate (к = 0.535) (Fig. 2). Defined as malignant when O-RADS > 3, R1 had higher diagnostic sensitivity (100.0% vs 96.6%), specificity (87.0% vs 85.0%), accuracy (88.9% vs 86.6%) and AUC (0.980 vs 0.971) than R2. However, there was no significant difference in diagnostic performance between them (P = 0.1211) ( Table 4 and Fig. 3).

Discussion
The O-RADS ultrasound working group selected some terms from the O-RADS lexicon and proposed the O-RADS ultrasound risk stratification and management system in 2020 (Andreotti et al. 2020;Cao et al. 2021). This risk stratification and management system provides a detailed description of each category (Andreotti et al. 2020), but in some studies, the classification system still did not achieve good interobserver agreement (Antil et al. 2022;Guo et al. 2022aGuo et al. , 2022b. Differences in the interpretation of the O-RADS ultrasound lexicon can have an impact on the assessment results of O-RADS (Antil et al. 2022). To explore the sources of inter-observer variation in the O-RADS classification system, we analyzed the consistency of the O-RADS ultrasound lexicon applied by senior and junior sonologists to describe the same lesions, and then explored the impact of differences in O-RADS lexicon interpretation on O-RADS classification results and diagnostic performance.
Like the results of Jha et al. (Jha et al. 2022), inter-observer agreement between senior and junior sonologists was almost perfect for lesion type. Inter-observer agreement for blood flow scores was also the same as in previous studies (Substantial agreement) (Antil et al. 2022;Jha et al. 2022). However, the interobserver agreement for the outer contour of the solid component was higher than in previous studies (almost perfect vs moderate) (Antil et al. 2022;Jha et al. 2022), and the reason for this may be that in the present study, the analysis of the external contour of the solid component was performed on the basis of the category in which both observers were classified as solid or solid appearing lesions, and the analysis on this basis may have led to an improved interobserver agreement for this term. In terms of the presence or absence of solid components, inter-observer agreement in this study was poorer than in Jha et al. (2022), but the final agreement was also at a substantial level. In addition, an interobserver agreement analysis of classic benign lesions was included in this study, with only moderate interobserver agreement between the two sonologists in 338 pathologically confirmed classic benign lesions (к = 0.566). Senior sonologist was more accurate in identifying and correctly classifying typical benign lesions compared to junior sonologist (77.5% vs. 67.8%). Consistent with some studies (Basha et al. 2021;Cao et al. 2021;Guo et al. 2022;Katlariwala et al. 2022), the  inter-observer consistency of O-RADS classification in this study was substantial (к = 0.790), and in this study, O-RADS classification inconsistencies were mostly concentrated in classic benign lesions (к = 0.535). When classifying classic benign lesions, junior sonologist preferred to follow the lesion categories in the O-RADS lexicon; therefore, some classic benign lesions were classified by junior sonologist into the subcategories of atypical benign lesions in O-RADS 2 or 3, which in turn affected the overall inter-observer agreement of the O-RADS classification (к = 0.790). Numerous studies have confirmed that the best cutoff value for the O-RADS classification system is > O-RADS 3 (Wu et al. 2022;Basha et al. 2021;Cao et al. 2021). Therefore, even when classified according to the lesion categories in the O-RADS ultrasound lexicon, most classic benign lesions could still be accurately categorized as benign. Therefore, this classification system demonstrated high diagnostic performance when assessed by both senior and junior sonologists (AUC: 0.980 and 0.971, p = 0.1211). However, in this study, some classic benign lesions were misclassified into O-RADS 4 categories, resulting in an overestimation of their malignant risk, which in turn affected the diagnostic accuracy of the O-RADS classification system. Among them, the most misclassified lesion type was dermoid cysts. The histological components of dermoid cysts are mixed, which leads to a complex and diverse sonographic presentation (Saida et al. 2021;Saleh et al. 2021).
In the present study, the hyperechoic component in some dermoid cysts was easily mistaken for the solid component by junior sonographers, which in turn led to their classification in the higher O-RADS category and overestimation of their risk of malignancy. This may also be one of the reasons why the consistency of observers in the assessment of the presence or absence of a solid component in this study was lower than in previous studies (Jha et al. 2022). Katlariwala et al. (2022) concluded that comparing the lesion echogenicity with the surrounding pelvic or subcutaneous fat echogenicity during evaluation may reduce misclassification between the hyperechoic of dermoid cyst and solid components. In clinical work, the interobserver agreement and diagnostic efficacy of the O-RADS classification and management system may be further improved if the identification of classic benign lesions by the sonologist is enhanced.
The main strength of our study is that we analyzed interobserver agreement for each subcategory in the O-RADS ultrasound lexicon and the O-RADS classification system separately, based on prospective, large sample data. However, there are still some limitations to this study. First, this study was a single-center study, and the findings still need to be validated in a large-scale multicenter study. Second, the lesions in this study were evaluated by the junior sonologist on the basis of images acquired by the senior sonologist, which may have overestimated inter-observer agreement and the diagnostic performance of the junior sonologist using the O-RADS classification system. Third, the patients included in this study were preoperative, and data on some normal ovaries (O-RADS 1) and non-surgical benign lesions were excluded, which may led to selection bias.
In summary, there was good agreement between senior and junior sonologists in the interpretation of the O-RADS ultrasound lexicon and in the classification of O-RADS categories, except for a moderate observer agreement in the interpretation and classification of classic benign lesions. In addition, differences in O-RADS category delineation among sonologists of different seniority had no significant effect on the diagnostic performance of O-RADS.
Author contributions All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by YY, HYW, ZZL, NS, LYG, XXT, RZ, YG, LM, RJW, WX, YHX, WJZ, HZ, GQX, TR, QD, JCL and YXJ. The first draft of the manuscript was written by YY and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript. Data Availability The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

Competing Interests
The authors have no relevant financial or nonfinancial interests to disclose.
Ethics approval and consent to participate Our study was approved by the Ethics Committee of Peking Union Medical College Hospital and registered on ClinicalTrials.gov (ChiCTR2100054542). Informed consent was obtained from all patients who underwent the examination.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.