Introduction

Acute appendicitis is a common emergency condition in adults, which can result in severe complications if not managed appropriately. Complicated appendicitis can lead to perforation, abscess formation, peritonitis, and sepsis and require urgent surgical intervention [1]. Conversely, uncomplicated appendicitis can be treated with either appendectomy or nonoperative management with the use of antibiotics [2]. Nonoperative management is a viable option for selected patients with uncomplicated appendicitis, particularly those who are at increased risk for surgical complications or have a preference for a nonsurgical approach. Patient selection is, therefore, crucial in identifying those with uncomplicated appendicitis and avoiding directing complicated cases to a nonsurgical approach. The guideline issued by the World Society of Emergency Surgery emphasizes the importance of patient selection in the management of acute appendicitis [1].

Clinical scoring systems have been developed to aid in diagnosing appendicitis, such as the Alvarado score, Appendicitis Inflammatory Response score, and Adult Appendicitis Score. However, these scores have limited ability to differentiate between uncomplicated and complicated appendicitis [3, 4]. Several scoring systems have been proposed to aid in identifying complicated appendicitis, with varying methods and success [5,6,7,8,9,10,11]. However, only a few studies [10, 12, 13] have externally validated their performance.

Our investigation aimed to evaluate the performance of existing scoring systems for predicting complicated appendicitis in adults diagnosed with acute appendicitis on computed tomography and compare them to a newly proposed system.

Methods

Study design and patient selection

This investigation was conducted at a 2200-bed urban academic hospital. It was approved by the Institutional Review Board (protocol No. 136/2566(IRB2)). Informed consent was not required due to the retrospective nature and minimal risk involved. Figure 1 provides a flow chart of patient inclusion. The study identified eligible patients by searching the pathological database for a diagnosis of appendicitis among all consecutive adult patients aged 18 years or older from October 2016 to March 2021. Patients who had undergone abdominopelvic CT prior to appendectomy, regardless of the timing of appendectomy relative to the diagnosis of acute appendicitis, were included. Only the first CT examination indicating a clinical suspicion of acute appendicitis was included if there were multiple CT exams. Patients with incomplete clinical data (n = 12) and an appendix not identified at CT (n = 1) were excluded. The investigation ultimately included 325 patients (Table 1). Note that 201 of these patients have been described in our previous investigation [14]. Among the 325 patients, 321 initially underwent a CT scan as their primary imaging modality, while the remaining individuals had an initial ultrasound examination.

Fig. 1
figure 1

Flow chart of patient inclusion

Table 1 Patient characteristics and comparison between uncomplicated and complicated appendicitis (n = 325)

Image acquisition, reinterpretation, and definitions

One of the three multidetector CT scanners was used to conduct CT exams. With the exception of one scan, all exams were performed with intravenous administration of nonionic iodinated contrast medium, at a volume of 1.5–2.0 mL/kg (equivalent to approximately 80–100 mL) at a rate of 2–3 mL/s. The exams covered the area from either the upper border of the diaphragms or the upper pole of the kidneys to the ischial tuberosities. For each scan, an unenhanced phase was followed by a portovenous phase (approximately 80 s after contrast administration) with an axial slice thickness of 1.25 mm. All images were then transferred to Picture Archiving and Communication Systems (PACS) for viewing.

Two fellowship-trained radiologists, specialized in abdominal imaging and emergency imaging with 20 years of experience each, independently re-reviewed all CT scans. They were informed of the patient’s age, sex, and diagnosis of acute appendicitis, but remained unaware of other data. The images were analyzed on standard PACS workstations using Synapse (FujiFilm Inc.). Each radiologist provided their own interpretation of the CT findings based on definitions described in Supplementary Material 1 and previously [14]. Discrepancies between the two radiologists were resolved by an abdominal radiologist with 24 years of experience for the 201 previously reported cases, while the rest were resolved by consensus.

Reference standards

The diagnosis of acute appendicitis was confirmed through histopathological analysis. Cases of complicated appendicitis included those with gangrene or perforation. The diagnosis of gangrene was based on histopathology, while perforation was documented either through histopathology or surgical operative findings.

Scoring systems validated

Eight scoring systems were selected for validation due to their inclusion of both clinical features and CT findings in their scores [5,6,7,8,9,10,11]. Details of these scores are provided in Supplementary Material 2. Of these, 5 included serum C-reactive protein in their scores [5, 6, 10], which was documented in only 7 of our patient cohort. Therefore, this laboratory value was removed from the scores. The weighting of included factors remained but the appropriate cutoff values for all scores were reselected during statistical analysis.

Statistical analysis

Qualitative and quantitative information were analyzed using descriptive statistics. Categorical variables were presented in terms of numbers or percentages while continuous data were reported as mean (standard deviation) or median (range) depending on whether the data had normal or skewed data distribution.

To compare the difference between the two groups (uncomplicated vs. complicated appendicitis), inferential statistics were used. The Pearson chi-square or Fisher’s exact test was used for categorical variables, while the independent t-test or the Mann–Whitney U test was used for continuous variables having means or medians, respectively. Binary logistic regression was used for univariable and multivariable analyses to determine the odds ratio (OR) and coefficients for independent predictors of complicated appendicitis. Odds ratio with corresponding 95% confidence interval (95% CI) were used to identify the strength and direction of their association. The selection of factors into the multivariable model was based on a P value of less than 0.1 in the univariable model. In order to prioritize patient safety, we placed a high emphasis on sensitivity to diagnose complicated appendicitis. This approach enables the safe practice of recommending appendectomy for patients with uncomplicated appendicitis, rather than resorting to nonoperative management for those with complicated appendicitis.

The diagnostic performance of the scoring systems in differentiating between uncomplicated and complicated appendicitis was determined using two-by-two tables to calculate metrics such as sensitivity, specificity, positive likelihood ratio, negative likelihood ratio, positive predictive value, negative predictive value, and accuracy. The ROC curves of these scoring systems were compared through pairwise comparison. These analyses were conducted using the Statistical Package for Social Sciences (SPSS, version 23, IBM), with a significance of 0.05.

The discrimination of the scoring systems describes the ability to give different predictions for complicated and uncomplicated appendicitis. The area under the ROC curve (AUC) with the corresponding 95% confidence interval (95% CI) was considered a summary measure for discrimination. The internal validation of the model was carried out by split-sample estimation and validation, in which the entire sample was randomly divided into two subsets, one used exclusively for model estimation ("training") and another used for validation ("testing"). Data were randomly divided with a split-sample approach, with 80% of the data allocated for training the model and 20% for internal validation using the R program (R Core Team (2022). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/).

Results

Baseline characteristics of patients (Table 1)

The mean age of the patients was 51.9 ± 19.6 years. Most of them (60.3%) belonged to the age group of ≥ 45 years, with female predominance (65.2%). They presented to the hospital with a median duration of symptoms of 24 h (range, 2–480) and a median Alvarado score of 7 (range, 1–10). On CT, the mean appendix diameter was 12.0 ± 2.9 mm, and 41.2% of patients had an appendicolith. Periappendiceal fat stranding, periappendiceal fluid, ascites, and extraluminal air were present in 46.8%, 42.5%, 33.8%, and 14.5% of cases, respectively. One hundred twenty-seven patients (39.1%) had complicated appendicitis. Almost all patients (97.2%) had appendectomy at the initial admission of appendicitis. The median length of stay was 3 days (range, 1–44).

Predictive factors of complicated appendicitis (Tables 2 and 3)

Table 2 Univariable and multivariable analyses of predictive factors of complicated appendicitis
Table 3 Weighted score for each factor in the risk prediction of complicated appendicitis

Univariable analysis identified multiple clinical, laboratory, and imaging factors that are significantly associated with complicated appendicitis. After multivariable analysis, five factors remained statistically significant: duration of symptoms > 12 h, appendicolith, periappendiceal fat stranding, periappendiceal fluid, and extraluminal air. Their p values ranged from < 0.001 to 0.037. The odds ratios and coefficients were weighted for each factor to identify the risk prediction of complicated appendicitis as shown in Table 3.

Comparison of 9 scoring systems (Table 4)

Table 4 Diagnostic performance of 10 scores

The Atema, Kim, Imaoka, and Lin (models 1 and 2) scores were modified to exclude C-reactive protein, with their respective cutoff values selected at ≥ 5, ≥ 1, ≥ 1, ≥ 4, and ≥ 4, respectively. The cutoff value of the Avanesov score was reduced from ≥ 4 in the original description to ≥ 2 in our analysis. The Khan and Kim HY scores retained their original cutoff values of ≥ 2 and ≥ 3, respectively. Their AUCs are provided in Fig. 2. The scores based on our multivariable analysis assigned different points to each predictive factor. For both, when a value of ≥ 2 was used as a cutoff, the scores (based on odds ratios or coefficients) demonstrated a sensitivity, specificity, and accuracy of 88.2–89.8%, 48.5–49.0%, and 64.3–64.6%, respectively. The one that utilized the coefficients had slightly better sensitivity and accuracy, but slightly less specificity. Pairwise comparison of these ten scores (Table 5) revealed no significant difference between the modified Atema, Kim HY, and our (identified as “current”) proposed scores (p = 0.110–0.901).

Fig. 2
figure 2

Comparison of ROC curves of 8 scoring systems and current scores

Table 5 Pairwise comparison of area under the ROC curves of 10 scores

Internal validation of current scores (Fig. 3, Supplementary Material 3)

Fig. 3
figure 3

ROC curves of current scoring systems based on internal validation

With a split-model method, 260 cases were allocated for training the model and 65 cases for internal validation of our proposed scores. The scores derived from odds ratios and coefficients both achieved high AUCs (0.826–0.844) with the score using the odds ratio showing a sensitivity and a negative predictive value of 100%, and specificity of 46.4% in predicting complicated appendicitis.

Discussion

Our investigation identified factors independently predictive of complicated appendicitis that are crucial to consider in the era of potential nonoperative management of acute appendicitis. We validated the diagnostic performance of 8 existing scoring systems and proposed a new scoring system to predict complicated appendicitis without the need for serum C-reactive protein. Of these, modified Atema, Kim HY, and our proposed scores showed similarly high AUCs with reasonably high sensitivities and modest specificities in the identification of complicated appendicitis.

Since 2015, multiple scoring systems have been proposed to identify appendicitis with complications, utilizing clinical-only [15,16,17,18], imaging-only [19], or both clinical and imaging data [5,6,7,8,9,10,11]. In this study, we validated eight systems that utilized both clinical features and CT findings as these scores generally performed better than those utilized only clinical or CT features. Previous investigations have validated these models using a traditional statistical methodology [10, 12, 13] and artificial neural network [20]. Fujiwara et al. [13], Lin et al. [10], and Geerdink et al. [12] used 203 to 678 patients (52 to 175 with complicated appendicitis) for validation. In another study by Lin et al. [20], datasets of 592 patients were split for training of and validated by artificial neural network.

The Atema score [5] was introduced in 2015, with an original sensitivity of 97% and specificity of 46% in the differentiation of complicated from uncomplicated appendicitis. The scores demonstrated sensitivities from 64 to 90% and specificities from 51 to 95% in subsequent studies [10, 12, 13, 20]. Our investigation found that even with C-reactive protein excluded from the equation and a cutoff value reduced to ≥ 5, the Atema score still had the best performance with high AUC (0.831; 95% CI 0.787–0.875) and sensitivity (91%; 95% CI 84–95%). However, its specificity was only 61% (95% CI 53–68%).

Another scoring system that demonstrated promising results in our investigation was the Kim HY score [11]. In its original description, this score had an AUC of 0.81, a sensitivity of 93%, and a specificity of 28%. However, subsequent validations reported higher AUCs ranging from 0.84 to 0.92 and specificities between 88 and 100%, but lower sensitivities at 64% [10, 20]. Our study showed a balanced sensitivity and specificity at 73% (95% CI 64–81%), and 71% (95% CI 64–77%), respectively, indicating its potential usefulness. Other validated scoring systems showed varying results, with some demonstrating high specificity (Kim TH, Lin Model 2 scores), and others exhibiting variable performance (Imaoka, Avanesov, Khan, Lin Model 1 scores) [10, 13, 20].

Our proposed scoring system, when validated internally, the score that used odds ratio demonstrated 100% sensitivity and 100% negative predictive value, allowing it to avoid misclassification of complicated appendicitis, albeit at a moderate specificity. It overcame the modified Atema score in terms of less complexity as it consisted of only 5 factors for calculation, did not require C-reactive protein, and accumulated fewer total points.

The performance of other scoring systems in our evaluation was suboptimal. Specifically, the Khan score exhibited a lower AUC of 0.699 (95% CI 0.643–0.756), alongside moderate sensitivity (76%; 95% CI 67–83%) and specificity (48%; 95% CI 41–55%). Similarly, the modified Imaoka score demonstrated a lower AUC of 0.692 (95% CI 0.642–0.741), with moderate sensitivity (80%; 95% CI 72–81%) and specificity (58%; 95% CI 51–65%). Both of these were validated by Lin et al. [10], who reported similar diagnostic performance for predicting complicated appendicitis. Additionally, the Imaoka score had been validated by other studies [13, 20], revealing inconsistent diagnostic performance. For the modified Kim score, it exhibited very high sensitivity (98%; 95% CI 94–100%) but low specificity (23%; 95% CI 17–29%), limiting its utility. Notably, our results diverged significantly from the validation performed by Lin et al. [10, 20], who reported the original score as having much lower sensitivity but higher specificity.

When comparing the elements within the scoring systems that exhibited optimal vs. suboptimal performance, the factors contributing the most to enhanced performance were CT findings. Notably, the presence of extraluminal air, which was found in the modified Atema, Kim HY, and our proposed scores but absent in the modified Imaoka, Kim, or Khan scores, played a significant role. Additionally, the presence of appendicolith, which was included in the modified Atema and our proposed score but excluded from the modified Imaoka and Kim scores, also contributed to improved performance.

While our investigation provided a detailed evaluation of the performance of existing scoring systems, there are several limitations that need to be acknowledged. Firstly, our study was retrospective and conducted in a single center with a small sample size. As appendectomy remained the standard of care for acute appendicitis in our hospital, we were unable to evaluate the success rate of nonoperative management fully. However, our approach allowed us to use pathological results as a standard reference for the diagnosis of complicated appendicitis. Secondly, the absence of C-reactive protein data in most patients prevented us from validating some scores in full. However, this allowed us to test the scores without C-reactive protein and demonstrated that the modified Atema score still performed well. Thirdly, we designed our endpoint to prioritize high sensitivity to detect complicated appendicitis, rather than balancing the sensitivity and specificity. This approach ensured patient safety by avoiding sending complicated appendicitis for nonoperative management. Fourthly, we did not validate scores that utilized only clinical factors [16,17,18] as they were not our target population. Cross-sectional imaging is necessary for safe selection of nonoperative management in this condition even in young individuals [3, 21]. The scores proposed by Mahankali et al. [19] which utilized purely CT findings were not validated in our study due to incomplete data. Additionally, we believe that some data points including grading of periappendicial fat stranding [10] may pose a challenge in terms of real-world applicability as they were subjective.

In conclusion, our study demonstrated that the modified Atema, Kim HY, and our proposed scores were effective in predicting complicated appendicitis with high AUC and reasonable sensitivities. These scores have the potential to aid in the safe selection of patients for nonoperative management. However, further validation is required in larger, multicenter studies with a diverse patient population. Recent publications have shown that artificial neural networks may play a crucial role in this regard [20, 22]. Additionally, it is important to note that a prospective trial [23] focused on this issue is currently ongoing, and its results are eagerly awaited to further guide clinical decision-making.