Introduction

Acute abdominal pain is a common emergency that can be caused by a wide variety of conditions, ranging from self-limiting to life-threatening disease. For patients presenting to the Emergency Department (ED) with acute abdominal pain, fast and accurate work-up is needed to plan treatment. Imaging will be used to differentiate between urgent conditions (requiring immediate treatment) and non-urgent conditions, and for determining the diagnosis and extent of disease.

The diagnostic work-up for acute abdominal pain has changed over the last 10 years, with a fourfold increase in the use of computed tomography (CT) [1, 2]. High accuracy of CT has been reported for appendicitis and for acute diverticulitis (98%) [3, 4]. A study evaluating the diagnostic value of abdominal CT in patients with acute abdominal pain in general also showed good results, with an accuracy of 78% for CT, including clinical evaluation [5]. Although the accuracy reported in the literature is good, this does not automatically imply that reproducibility is good as well and that accuracy results reported in the literature can be generalised to the local clinical situation.

Only a few studies evaluated the reproducibility of CT results. These studies focused on selected patients, with a specific suspected diagnosis, such as appendicitis or diverticulitis [69]. There were considerable differences in the reported inter-observer agreement in these studies, ranging from fair to excellent [69]. Fair inter-observer agreement was also reported for so-called difficult cases [6]. The latter study included CT examinations that were equivocal for the diagnosis of appendicitis after the first analysis.

These previous inter-observer studies all evaluated patients suspected of one specific disease and the evaluation of the test results focused merely on the presence or absence of that disease. An evaluation of inter-observer agreement in unselected patients with acute abdominal pain presenting at the ED is probably more relevant as it is closest to clinical practice, where consecutive patients usually have different diagnoses, and most likely clinical diagnosis may not be evident after clinical evaluation in a substantial number of cases.

The purpose of this study was to document the level of inter-observer agreement in abdominal CT in unselected patients with acute abdominal pain presenting at the ED. We evaluated agreement of all diagnoses, as well as agreement on urgent versus non-urgent diagnoses, on general radiological features, and on frequent diagnoses in this patient population.

Method and materials

Patients presenting at the ED with acute abdominal pain for more than 2 h and less than 5 days were eligible for this study. Patients who were discharged by the treating physician at the ED without any diagnostic imaging (including plain radiography and ultrasonography), patients under 18 years, pregnant women, patients with a blunt or penetrating trauma, and patients in haemorrhagic shock caused by a gastrointestinal bleeding or acute abdominal aneurysm, were excluded Furthermore, only patients with abdominal pain were eligible for this study; if a patient had just flank pain, that patient was not invited. The included patients were part of a larger trial to document the diagnostic accuracy of imaging modalities in the work-up of patients with acute abdominal pain [10]. In this trial 1,101 consecutive patients underwent plain abdominal and chest radiography, abdominal US and abdominal CT. The first 200 patients of the initiating hospital entered this retrospective inter-observer study.

Eligible patients were informed about the study and asked for consent. Consenting patients underwent CT within a few hours after ED presentation. A multidetector-row four-slice helical CT (SOMATOM Sensation 4; Siemens Medical Systems, Forchheim, Germany) was used in all patients. The CT protocol was as follows: effective mAs level of 165, 120 kV, (4×) 2.5-mm collimation, (4×) 3-mm slice width and 0.5-s rotation time. A total of 125 ml contrast medium (Visipaque 320, General Electric Healthcare, Chicago, Ill.) was injected intravenously at 3 ml/s and the CT was performed after a 60-s delay; no oral or rectal contrast agents were used. The effective dose of this CT protocol was 11 mSv, with a DLP of 640 mGy·cm.

Only patients with known renal failure received an unenhanced CT.

The CT examinations were reviewed 3 months or more after the initial presentation at the ED, to diminish recall bias of the observers, as some of them could have been involved in the initial diagnostic work-up of these patients. All CT examinations were interpreted using a picture archiving and communication system (PACS; Agfa-Gevaert, Brussels, Belgium), on which observers evaluated the axial-CT images; however, they had access to three-dimensional reformats. The CT examinations were independently read by three radiologists, blinded for the results of the co-observers. Two observers both had 12 years of experience in abdominal radiology, in which they had evaluated approximately 5,200 abdominal CT studies for acute abdomen and 13,000 abdominal CT examinations in general. The third observer had 2 years of experience as a general radiologist (fellow interventional radiology), and had evaluated approximately 175 abdominal CT studies indicated for acute abdomen, and 1,000 abdominal CT examinations in general. The observers were blinded for other imaging examinations performed in the diagnostic work-up on the day of presentation of these patients, imaging performed during follow-up and other findings during follow-up.

Observers had access to a summary of the patients’ clinical history, physical examination (both performed by an attending surgical resident) and to the laboratory findings of the day of presentation, as in normal clinical practice observers also would have access to clinical information of the patient [11]. An example of summary patient information is provided in Appendix I.

Image and CT characteristics

For a standardized evaluation of the CT examinations, image characteristics were assessed and recorded on a digital case record form. The following general image findings and specific radiological features were assessed: image quality, fat infiltration, free fluid, fluid collections, free intra-peritoneal air, and whether fistulas could be visualized. Image characteristics were also assessed for abnormalities per organ: gallbladder, bile duct, liver, pancreas, appendix, gastrointestinal tract (without appendiceal abnormalities), lymph nodes, vascular system, kidneys, and if appropriate, female genitalia.

In case of abnormalities further specification on the observed abnormality was warranted. All observers also recorded their final CT diagnosis, and, if applicable, two differential diagnoses. These diagnoses were selected from a list of diagnoses provided with the online case record form. All possible diagnoses in the list of diagnoses on the online case record form had been classified a priori as urgent or non-urgent. Diagnoses were classified as urgent when immediate treatment, within 24 h, was needed, whereas in patients with a non-urgent diagnosis a general consensus exists that treatment, if any, can safely be delayed beyond 24 h.

Final diagnosis

An independent expert panel, consisting of two experienced gastrointestinal surgeons and an experienced abdominal radiologist, assigned a final diagnosis. The members of this panel individually evaluated all available data for each patient. Final consensus on the final diagnosis was reached in a consensus meeting. Information was provided to the expert panel in a standardized way and consisted of clinical findings, laboratory findings, image findings, surgery (if any), histopathology (if any) and the results from 6 months of outpatient follow-up. Panel members selected the final diagnosis from the same list of diagnoses as provided to the initial observers. None of the panel members had been involved in the work-up of the included patients or in reading the CT images in this study.

Analysis

In the analysis, our focus was on overall inter-observer agreement, on agreement on urgent diagnoses, and on frequently occurring diagnoses. Inter-observer agreement was also evaluated for specific radiological features, such as fat infiltration, free fluid, fluid collections, and free intra-peritoneal air.

Frequent diagnosis within the population under study was defined as diagnosis with a prevalence >5%. Frequencies of diagnoses or specific features were calculated for each observer. The number of cases in which an observer recorded a specific diagnosis or feature (e.g. fat infiltration) was recorded per observer. It is thought that if different observers record a specific diagnosis or feature in a similar number of the patients, agreement will be good as well. This evaluation of frequencies is used as a rough measurement to indicate inter-observer agreement.

Agreement was calculated and expressed as percentage observed agreement (i.e. the number of CT examinations at which both observers scored a feature as present or absent divided by the total of 200 CT examinations evaluated by both observers) and with kappa statistics (i.e. the observed agreement adjusted for chance). If prevalence is very high or very low, chance on agreement increases, thereby lowering kappa values. Kappa values were calculated for each observer. Median kappa values were calculated for all three observers. Kappa values can be calculated for a 2 × 2 table as well as for more extensive tables [12].

Kappa (κ) values can be classified according to the level of agreement as κ < 0.20 poor agreement, κ = 0.21–0.40 fair agreement, κ = 0.41–0.60 moderate agreement, κ = 0.61–0.80 good agreement, κ = 0.81–1.00 excellent agreement, according to Landis and Koch [13].

For all analyses the statistical software package SPSS 12.0.2. was used (SPSS, Chicago, Ill.)

Results

The mean age of the 200 included patients was 46 years (range 19–94) and 54% (n = 107) of the patients were female. In 17 patients it was not possible to obtain a final diagnosis because of incomplete patient data from initial clinical history and physical examination at the ED (discharge diagnoses were non-specific abdominal pain (NSAP) in six, pneumonia in two, miscellaneous in nine).

The most frequent final diagnoses, assigned by the expert panel, were acute appendicitis in 41 (22%) patients, NSAP in 32 (17%), and acute diverticulitis in 20 (11%) patients (Table 1). Of the 200 patients evaluated in the inter-observer analysis, 193 patients had received intravenous contrast agent, whilst seven (3.5%) had unenhanced CT (five with renal failure; two with inappropriate position of intravenous cannula).

Table 1 Final diagnoses after 6 months follow-up, reference standard

Overall agreement

Overall inter-observer agreement on diagnoses was good, with a kappa value of 0.66 (95% CI: 0.60–0.75) for observers 1 and 2, a kappa value of 0.63 (95% CI: 0.58–0.69) for observers 1 and 3 and a kappa value of 0.67 (CI: 0.63–0.75) for observers 2 and 3 (median kappa value of 0.66). The observed proportion of agreement was 0.71, 0.67 and 0.71 for observer 1 and 2, 1 and 3 and, 2 and 3, respectively (Table 2).

Table 2 Observer agreement of all diagnoses and of urgent diagnoses

An overall cross-classification of all diagnoses is provided for all three observer couples in Appendix II. For urgent versus non-urgent diagnoses agreement was moderate, with a median kappa for all three observers of 0.59 (see also Table 2). The observed agreement for urgent diagnoses was excellent, with a median agreement of 0.83 (Table 2). Furthermore, observers assigned an urgent diagnosis approximately to the same number of patients (Table 3).

Table 3 Frequencies of urgent diagnoses, CT examinations with a high level of confidence, frequent occurring diagnoses and radiological features

Radiological features

Inter-observer agreement for specific radiological features, such as fat infiltration, free fluid, free intra-peritoneal air and fluid collections, is reported in Table 4. Detection of fat infiltration had a good level of agreement, with a median kappa value of 0.70. Agreement on free fluid was moderate, with a median kappa value of 0.58. The frequency in which observers recorded presence of free fluid, differed considerably, ranging from 26% to 46% (Tables 3, 4). Fluid collection and free intra-peritoneal air both had an extremely high observed agreement. Because the prevalence of fluid collections and free intra-peritoneal air within this study population were very low, the corresponding kappa values were low as well.

Table 4 Observer agreement on specific radiological features

Agreement on specific diagnoses

Kappa values and observed agreement are listed in Table 5 for diagnoses with prevalence higher than 5% within this study population. Median kappa values for specific urgent diagnoses, such as appendicitis, diverticulitis and bowel obstruction were 0.84, 0.90, and 0.81, respectively, which implies excellent agreement (Figs. 1, 2). Median kappa values for non-urgent diagnoses were moderate to fair (Table 5). This difference between urgent and non-urgent diagnoses could not be derived from the frequencies of these specific diagnoses assigned by radiologists only. Frequencies between urgent and non-urgent diagnoses did not differ at large. The table in Appendix II shows cross tables of diagnoses assigned per observer and, thereby, the difference in agreement between urgent and non-urgent diagnoses.

Fig. 1
figure 1

Case of agreement. A 52-year-old male with abdominal pain for 2 days in the right lower quadrant. He had complaints of nausea and vomiting and a temperature of 38°C. At physical examination he had right lower quadrant tenderness with guarding. The C-reactive protein was 206. All observers correctly diagnosed this patient with acute appendicitis (arrow). C cecum

Fig. 2
figure 2

Case of disagreement. A 57-year-old female, with right lower quadrant pain for 3 days and a temperature of 37.6°C. At physical examination both lower abdominal quadrants were tender, with rebound tenderness, but without guarding. C-reactive protein was 112 mg/l and WBC 8 × 109 g/l. a Observer 1 diagnosed this patient with appendicitis (arrows point at the appendix). b Observer 2 diagnosed her with diverticulitis (arrow points to a diverticula with adjacent inflammatory changes). Observer 3 diagnosed this patient with inflammation of the sigmoid, but without diverticulitis. The final diagnosis of the expert panel in this patient was acute diverticulitis. The patient was treated conservatively with rest and a liquid diet and recovered uneventfully

Table 5 Observed agreement and agreement according to kappa statistic for frequent occurring disease

Discussion

In this study, in unselected patients with acute abdominal pain presenting to the ED, abdominal CT was found to have good inter-observer agreement, with excellent inter-observer agreement for urgent diagnoses, such as appendicitis, diverticulitis and bowel obstruction. For non-urgent diagnoses, such as hepatic pancreatic biliary disorders, gastrointestinal tract disorders and NSAP, CT had good but not excellent agreement of abdominal CT in patients with acute abdominal pain. One should be aware that most of these non-urgent diagnoses, such as gastro-enteritis, can not be readily made by CT. Therefore, efforts must be made to adequately select patients with acute abdominal pain at the ED that will benefit from CT. Inter-observer agreement was generally moderate for individual radiological features, but that did not negatively affect agreement on urgent diagnoses. It is most important that an urgent diagnosis is recognised by all observers. Patients with an urgent disease need immediate treatment, whereas patients with a non-urgent cause of acute abdominal pain, treatment can be safely delayed beyond 24 h. In these patients, more time is available to establish the correct diagnosis.

This study has some potential limitations that have to be taken into account. First, we did not evaluate intra-observer agreement. As inter-observer agreement was good overall, information on intra-observer agreement may not be crucial. Another potential limitation of this study was the spectrum of experience of observers. All observers were radiologists, no radiological resident read the CT images for study purposes. Radiological residents are usually supervised by a radiologist in the evaluation of abdominal CT. Thirdly, the CT protocol within this study included intravenous contrast medium only. Oral or rectal contrast medium is not a prerequisite, although helpful in some conditions. In this study, CT examinations were read after 3 months, which may have caused some bias. Although, work level and time of day differed between initial review and cold review, methods of review were identical for all there observers. Furthermore, kappa values did not differ significantly between observer 3 and the observer at the ED (data not shown). Finally, because not all patients had a surgical and histopathological proven final diagnosis, an expert panel was used to assign a final diagnosis. For this reason, a follow-up period of 6 months was chosen to collect additional data.

Our results closely reflect daily clinical practice, as we invited all consecutive patients presenting with acute abdominal pain and made no a priori selection. In the literature [68] kappa values have been reported for selected patients with a suspicion of one specific condition, e.g. patients suspected of appendicitis or diverticulitis, or in studies in which abdominal CT were reviewed to identify the appendix [14].

Another study that has evaluated inter-observer variability of abdominal CT in general found a good inter-observer agreement for presence or absence of abdominal pathology, as measured on a five-point scale [15]. However, in that study abdominal pathology was not specified, and no diagnosis was assigned by the observers.

We chose to report level of agreement in kappa values, because they are widely used in the literature and well known to clinicians and radiologists. Cohen’s kappa statistics express agreement adjusted for chance. We also reported observed agreement (percentage agreement) alongside kappa values. Kappa values are influenced by disease prevalence. If the disease prevalence is high in the study population, expected agreement will be high as well, and this will lower the corresponding kappa value. On the other end of the spectrum the same holds true: if disease prevalence is very low in the population under study, expected agreement will be high, thereby lowering the corresponding kappa value. Therefore, it is assumed that kappa values of urgent diagnoses in our study are lowered because of high prevalence of urgent diagnoses instead of actual moderate agreement.

Agreement between radiologists on the CT diagnoses was good, but excellent inter-observer agreement could have been expected, because of the excellent accuracy reported in literature, which presupposes excellent agreement. In the present study, accuracy was not a primary study aim, and the 200 abdominal CTs that had been assessed by three observers were nevertheless not enough to evaluate accuracy.

Accuracy studies can be prone to observer bias, when only highly experienced observers are used to evaluate the test. In this study no difference in level of agreement was found between all three observer couples, which suggests that agreement does not depend highly on additional years of experience.

CT images were evaluated with information on clinical history, physical and laboratory examination provided to the observers. Reading with clinical information can inflate test accuracy due to clinical review bias [16]. Test reading is influenced by clinical information in the perception of abnormalities and in the interpretation of abnormalities. Our results may have been influenced by clinical review bias, as CT scans were evaluated with knowledge of clinical information, but this situation reflects normal practice.

In conclusion, we can say that overall inter-observer agreement of radiologists for CT is good in patients with acute abdominal pain presenting at the emergency department and, most importantly, excellent agreement was found for urgent diagnoses. Therefore, if CT images suggest an urgent diagnosis in a patient with acute abdominal pain, it can safely be assumed that different radiologists would assign the same diagnosis, but opinions are more likely to differ for non-urgent diagnoses.