Interobserver agreement between eight observers using IOTA simple rules and O-RADS lexicon descriptors for adnexal masses

Purpose To evaluate interobserver agreement in assigning imaging features and classifying adnexal masses using the IOTA simple rules versus O-RADS lexicon and identify causes of discrepancy. Methods Pelvic ultrasound (US) examinations in 114 women with 118 adnexal masses were evaluated by eight radiologists blinded to the final diagnosis (4 attendings and 4 fellows) using IOTA simple rules and O-RADS lexicon. Each feature category was analyzed for interobserver agreement using intraclass correlation coefficient (ICC) for ordinal variables and free marginal kappa for nominal variables. The two-tailed significance level (a) was set at 0.05. Results For IOTA simple rules, interobserver agreement was almost perfect for three malignant lesion categories (M2-4) and substantial for the remaining two (M1, M5) with k-values of 0.80–0.82 and 0.68–0.69, respectively. Interobserver agreement was almost perfect for two benign feature categories (B2, B3), substantial for two (B4, B5) and moderate for one (B1) with k-values of 0.81–0.90, 0.69–0.70 and 0.60, respectively. For O-RADS, interobserver agreement was almost perfect for two out of ten feature categories (ascites and peritoneal nodules) with k-values of 0.89 and 0.97. Interobserver agreement ranged from fair to substantial for the remaining eight feature categories with k-values of 0.39–0.61. Fellows and attendings had ICC values of 0.725 and 0.517, respectively. Conclusion O-RADS had variable interobserver agreement with overall good agreement. IOTA simple rules had more uniform interobserver agreement with overall excellent agreement. Greater reader experience did not improve interobserver agreement with O-RADS. Graphical abstract


Graphical abstract
Interobserver Agreement between Eight Observers using IOTA Simple Rules and O-RADS Lexicon Descriptors for Adnexal Masses AnƟl and Raghu, et al;2022 • O-RADS had more variable interobserver agreement compared to IOTA. • Reader experience did not improve interobserver agreement. • Findings may be due to binary approach in IOTA versus paƩernbased recogniƟon in O-RADS.

Introduction
Numerous ultrasound (US) guidelines have attempted to guide the accurate characterization and subsequent management of adnexal masses. These include the International Ovarian Tumor Analysis (IOTA) simple rules, American College of Radiology Ovarian-Adnexal Reporting and Data System (O-RADS), Society of Radiologists in Ultrasound (SRU) Consensus Guidelines, Gynecologic Imaging Reporting and Data System (GI-RADS), and Morphology Index by the University of Kentucky. These systems rely on subjective assessment, patternbased recognition, morphologic indexing, simple scoring systems, and/or statistically derived algorithms [1][2][3][4][5][6][7][8][9]. We aim to evaluate interobserver agreement in classification of adnexal masses using two of the most widely used systems: the pre-existing IOTA Simple Rules and newer O-RADS lexicon.
In 2008, the IOTA group published evidence-based nomenclature which led to development of the "Simple Rules" [1,2]. These include a set of five US features indicative of benignity (B-rules) and a set of five US features indicative of malignancy (M-rules). Based on these rules, adnexal masses are then categorized into benign, malignant, or inconclusive [1,2]. The system has high diagnostic performance and good risk prediction capability, but still has limited use in clinical practice given the need for further imaging workup of all IOTA inconclusive lesions, which account for approximately 20% of patient cases in one study [9].
In 2019, the American College of Radiology introduced O-RADS to provide an internationally standardized riskstratified lexicon and to unify various diagnostic and management approaches into a single model [8,9]. The lexicon provides descriptors and definitions for physiologic cysts (i.e. follicle, corpus luteum) as well as non-physiologic benign and malignant adnexal masses. Based on the lesion descriptors, the system further classifies into six risk categories. These include O-RADS 0, an incomplete evaluation; O-RADS 1, healthy normal premenopausal ovaries or physiological simple cysts ≤ 3 cm; O-RADS 2, almost certainly benign with < 1% risk of malignancy; O-RADS 3, lesions with low (1-10%) risk of malignancy; O-RADS 4, lesions with intermediate (10-< 50%) risk of malignancy; and O-RADS 5, lesions with high (≥ 50%) risk of malignancy [8,9]. Management or follow-up recommendations are also provided for each category as part of O-RADS.

Study design
A retrospective reader-based diagnostic performance study was performed in women who presented to the radiology department for routine non-obstetric pelvic ultrasound. The research study was Health Insurance Portability and Accountability Act (HIPAA) compliant and received Institutional Review Board (IRB) approval. Due to the retrospective nature of the study, informed consent was waived. A

Inclusion and exclusion criteria
Pelvic ultrasound examinations were reviewed on the picture archiving and communication systems (PACS) workstation by a research radiologist (NA) with specialization and expertise in pelvic ultrasound and ovarian cancer. All exams with adnexal masses (cystic, solid, or mixed cystic and solid) were included in the study. Patients with bilateral adnexal masses were recorded separately as two lesions. Normal or incomplete studies -i.e. without transvaginal scanning or color Doppler -were excluded. Additionally, the following exams were excluded: extra-ovarian lesions, physiologic follicles or corpus luteum, and cystic lesions < 1 cm in postmenopausal women.
The research radiologist (NA) reviewed the electronic medical records and recorded patient age, menopausal status, and final pathologic diagnosis when available. For lesions that were not resected, adnexal masses with adequate follow-up (≥ 2 years of follow-up with documented imaging to show benignity of the lesion) were included in the final analysis. Imaging follow-up for 2 years on any modality was accepted: ultrasound, computed tomography (CT) or magnetic resonance imaging (MRI) to document stability or resolution. In certain cases, follow-up CT or MRI which characterized a classic lesion was also noted (e.g. macroscopic fat seen on CT or MRI to confirm suspected dermoid cyst). In cases where the imaging comparison was not available in our system, clinician notes indicating stability for 2 years were used in lieu of 2-year imaging follow-up. All data were collected and recorded, with final inclusion of 114 patients with 118 adnexal masses (Fig. 1).

Image review and data collection
All pelvic US images of included subjects were then evaluated on PACS by eight radiologists at different levels of clinical expertise: 4 fellows and 4 attendings (with 4 years, 4 years, 7 years, and > 15 years of experience). All readers were blinded to the final diagnosis and provided identical training materials on the classification systems. Each adnexal mass was evaluated according to the feature categories of the IOTA Simple Rules (Table 1) and O-RADS lexicon ( Table 2)

Statistical analysis
Each feature category under O-RADS and IOTA was analyzed for interobserver agreement. O-RADS, IOTA, and their subcomponents can be considered as ordinal or . We used twoway random-effects model, absolute agreement, single rater type, as previously suggested by a guideline [10]. ICCs were interpreted as follows: < 0.40, poor; 0.40-0.59, fair; 0.60-0.74, good; and 0.75-1.00, excellent [11]. For nominal variables, Free-marginal kappa was calculated using an online calculator (http:// justu srand olph. net/ kappa) [12]. Free-marginal multirater kappa is an alternative to Fleiss' multirater kappa. Calculation of chance agreement in Fleiss' kappa is based on fixed marginal probabilities. Thus, Fleiss' kappa is suitable when raters know beforehand the fixed proportions of cases in each category. However, in our study, the raters were blinded to the numbers of cases in each category [12]. Free-marginal kappa is the suitable statistics in our setting. Free-marginal kappa values were interpreted as follows: < 0, poor; 0.01 -0.20, slight; 0.21 -0.40, fair; 0.41 -0.60, moderate; 0.61 -0.80, substantial; and 0.81 -1.00, almost perfect agreement [13]. ICCs and kappas, along with 95% confidence intervals (CI), were calculated for agreement amongst all 8 radiologists (4 attendings and 4 fellows). Agreement was also calculated amongst attendings alone and fellows alone. The two-tailed significance level ( ) was set at 0.05.

Subjects and demographics
A total of 114 women with 118 adnexal masses were included in the study, with inclusion and exclusion criteria summarized in Fig. 1. Median patient age was 41 years, with range 18 to 88 years. Age and menstrual status for the cohort, including for benign versus malignant cases, is detailed in Table 3. The median (IQR) lesion size was 7.0 (6.3) cm. All 118 adnexal masses in our study were included as a small subset of a separate and unrelated multi-institutional study evaluating the diagnostic accuracy of O-RADS [25]. This is an example of lesion type, which had substantial interobserver agreement. B Color Doppler image of an ovarian cystic lesion with solid component. This is an example of color score 2 (mild), which had good interobserver agreement. C Grayscale image of ovarian cyst with a smooth septation. This is an example of septation type, which had moderate interobserver agreement. D Grayscale image of a complex ovarian cystic lesion with white arrow denoting the irregular inner wall. This is an example of inner wall, which had fair interobserver agreement  (Table 3).
O-RADS: Interobserver agreement was almost perfect for two of ten feature categories (presence of ascites and peritoneal nodules) with k-value of 0.89 & 0.97. Agreement in interpretation for remaining eight feature categories (lesion type, inner wall type, septation type, number of septa, number of solid components, contour of solid component and color score) were variable ranging from fair to substantial with ICC and k-value ranging from 0.39-0.61. The final O-RADS conclusion was good for all eight readers combined, good for fellows alone, and fair for attendings alone with ICC values of 0.621, 0.725 and 0.517, respectively (Table 5).  [14,15]. Ultrasound is the first line initial imaging modality utilized to evaluate the adnexa and to help differentiate benign from malignant ovarian lesions. Multiple ultrasound-based guidelines and scoring systems have been proposed and validated over the years [1-9, 16, 17]. Of these, IOTA simple rules and O-RADS have gained significant traction. We found that interobserver agreement is overall excellent for IOTA simple rules and good for O-RADS. We hypothesize that the differences in interobserver agreement between the two systems relates to the risk stratification method: while pattern recognition is important in the initial assessment of IOTA simple rules, final delineation into benign, inconclusive, or malignant is based on an algorithmic scoring system. On the other hand, O-RADS is an entirely pattern-based scoring system with potential for some degree of subjectivity and measurement error that may influence this nuanced pattern recognition. We found two features in particular had lower interobserver agreement in O-RADS (scored as "fair") primarily due to differences in distinguishing smooth versus irregular inner wall and number of solid components. A focus of nodularity along the inner wall may be interpreted as an irregular inner wall by some, whereas others may interpret this finding as a solid component. Measurement differences may further contribute to differences in O-RADS score, as less than 3 mm in size is considered an irregular inner wall whereas 3 mm or greater is considered a solid component. Subjectivity in color Doppler scoring of vascularity can further impact agreement. For example, a solid-appearing mass with color score 2-3 (mild to moderate flow) would be O-RADS 4, whereas a color score 4 would upgrade the mass to O-RADS 5. Finally, some variability in interpretation of what constitutes 'solid component' under the O-RADS lexicon may lead to differences in categorization (i.e. fat, Rokitansky nodule, normal ovary within a peritoneal inclusion cyst, tubal or inflammatory tissue in a tubo-ovarian abscess, etc.).

Proper characterization and risk stratification of adnexal masses is important because ovarian cancer is the most lethal of all gynecologic malignancies and is the fifth leading
In a study by Basha et al., the diagnostic performance of O-RADS was compared to IOTA and GI-RADS (gynecologic imaging reporting and data system). They found greater sensitivity and similar specificity and reliability with O-RADS compared to the other two [18,19]. They also found interobserver agreement to be similar across all three risk stratification systems. Although their study had a larger sample of adnexal masses, they used only 5 radiologists, all of whom had greater than 15 years of experience with pelvic imaging and were not blinded to the initial ultrasound reports. A smaller study by Pi et al. [20] with 3 readers and 50 adnexal masses found excellent diagnostic accuracy and interobserver agreement for the O-RADS system but did not compare O-RADS against other existing classification systems. There have been a few performance comparison studies based on different US scoring systems but none have specifically focused on interobserver agreement. Hiett et al. [22] compared IOTA Simple Rules, ADNEX model, and O-RADS, and stated similar sensitivity for discrimination of malignant from benign pelvic masses with superior specificity with the IOTA model. Patel-Lippman et al. [21] performed a comparison study between IOTA Simple Rules and SRU and demonstrated IOTA Simple Rules slightly more accurate than the SRU guidelines (AUC, 0.9805 versus 0.9713; p = 0.0003) and both to be highly sensitive for detection of malignancy. Another study by Xie et al. [23] noted that the area under the curve, sensitivity, and specificity for detection of malignancy under IOTA or O-RADS can be similarly improved by factoring in the patient's CA-125 levels, as is done in the ADNEX model. Although tumor marker information was not provided to our readers and is often not prospectively available at the time of initial ultrasound interpretation, it certainly plays a role for the gynecologists and gynecologic oncologists in determining management for indeterminate adnexal masses [24].
We included body imaging fellows and attending radiologists with varying years of experience to determine if experience may affect degree of interobserver agreement. All participants were given identical training materials and resources to review beforehand. Interestingly, interobserver agreement amongst both fellows and attendings was excellent for the IOTA simple rules. However, better interobserver agreement was seen amongst the fellows with O-RADS compared to attendings. When analyzing the specific O-RADS and IOTA feature categories, attendings had greater agreement for two O-RADS features (lesion type and number of solid components) compared to fellows. On the other hand, fellows had greater agreement on the other eight features within O-RADS. Thus, number of years of experience did not appear to improve interobserver agreement with O-RADS. It is important to recognize that greater interobserver agreement does not necessarily correlate with diagnostic accuracy. Thus, while fellows may have had greater uniformity than attendings, we caution any inference regarding the diagnostic accuracy of the two groups.
We acknowledge several limitations in our study. First, this was a single institution study. Future larger scale multicenter studies with multiple readers may be warranted to evaluate institutional or regional variability in interpretation and interobserver agreement. Second, we did not evaluate the diagnostic accuracy of each risk stratification system. Thus, while IOTA may have greater interobserver agreement, we do not know if the diagnostic performance of one is better than the other. Indeed, diagnostic performance is of vital importance in determining the merits of each system. Third, images were retrospectively reviewed and therefore may not replicate real-life evaluation of adnexal masses as we were limited to images obtained at the time of examination. As technology and sonographic detail improve, the diagnostic interpretation and accuracy may similarly change. Finally, we did not evaluate the diagnostic performance or accuracy of these two systems when compared to final pathology. Due to the large number of radiologists interpreting each exam, we did not ask radiologists to come to consensus, nor did we determine a "correct" categorization for each ovarian lesion. A separate multi-institutional study with a larger cohort has analyzed the category-specific diagnostic performance of O-RADS [25].
In summary, we found excellent interobserver agreement with IOTA and good interobserver agreement with O-RADS amongst eight blinded observers reviewing 118 adnexal masses. Greater reader experience did not improve interobserver agreement with O-RADS.
Funding None.

Declarations
Conflict of interest All authors declare that they have no conflict of interest.
Ethical approval Specific Remark: Aya Kamaya -Book Royalties, Elsevier, No relevant disclosures for other authors. This research study was HIPAA compliant and received IRB approval.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.