The PubMed search resulted in 1,587 hits, EMBASE in 68 and the Cochrane Library none.
Thirty articles were selected for further review. Two were excluded due to use of endpoints other than those specified in the search strategy, one was a narrative review, four were conducted in other environments than the specified, one was a consensus paper, six were for specified groups of patients and one was not a scoring system.
Six articles were on track and trigger systems (scoring systems normally used to activate in-hospital Medical Emergency Teams to evaluate patients in acute distress). One was a review of 33 different systems and this was included in our review. Two of these six articles included only patients who presented to an Emergency Department and were thus excluded as the patients were not later admitted to the hospital.
A total of 13 articles were included in this review, see figure 1.
The articles presented in this study are very heterogeneous. They originate from different department types and the case-mix ranges from patients solely admitted by helicopter to all patients discharged from a medical department. We therefore have chosen to focus only the parts of the scoring systems we find important for assessing their relevancy; i.e. which variables have the authors chosen to include, which statistical methods were chosen to design and test the systems, what was the discriminatory power and calibration of the systems (i.e. how usable are the systems) and which level of evidence does the systems achieve.
Track and trigger systems
Two of the papers on track and trigger systems were written by Subbe et al. One analyzed the Early Warning Score (EWS) and the other the Modified Early Warning Score (MEWS) on patients admitted to or through a MAU. None of these articles presented data on discriminatory power or calibration. When calculating the EWS, the authors found that a maximum score of five was associated with an increased risk of death, ICU and HDU admission. When the authors stratified patients into three risk bands according to the MEWS score, they only found a statistical significant increased incidence of cardiac arrest in the intermediary risk band i.e. MEWS 3-4.
Paterson et al. included medical and surgical patients admitted to a combined assessment area in their study. The object was to evaluate the implementation of a standardized early warning scoring system (SEWS). A total of 848 patients were included, 435 after the implementation of SEWS. In the SEWS cohort, they found a significant linear relationship between in-hospital mortality and admission SEWS score (chi-squared 34.3, p < 0.001). Data on discrimination were not presented.
As the review by Smith et al. is included in our article (here referred to as TTS) and included the studies by both Subbe et al. and Paterson et al., these will not be presented in further detail.
Variables included in the scoring systems
All but two of the scoring systems used vital signs as variables when calculating the score (see table 1). The Admission Laboratory Tests (ALT) and The Routine Laboratory Data (RLD) both relied mostly upon blood tests. Two systems, the Simple Clinical Score (SCS) and the Hypotension, Oxygen saturation, low Temperature, ECG changes and Loss of independence Score (HOTEL) included both subjective and objective parameters (e.g. dyspnoea and abnormal EKG).
Development of the scoring systems
Regression analysis was the most applied method for development of the scoring systems, only Track and Trigger System (TTS) was developed otherwise (see table 2). Eight of the ten systems included patients admitted to a medical admission unit, but the population in Rapid Acute Physiology Score (RAPS) was patients transported to the hospital by helicopter, and the population in the Goodacre Score (GS) was patients transported to the emergency department by ambulance.
Six systems used in-hospital mortality as their primary endpoint and only Early Warning Score (EWS) used a composite endpoint.
Discriminatory power and calibration
Discriminatory power (i.e. the ability to identify patients at increased risk of meeting the endpoint) was specified for eight of the scoring systems (see table 3), but not for RAPS and RLD. It was above 0.8 in five of these, but in Worthing Physiological Scoring System (WPS) is was 0.74 and in TTS 0.657-0.782.
The calibration (i.e. agreement between the predicted and the observed outcome in the model) was only specified for four scoring systems. In The Rapid Emergency Medicine Score (REMS)  it was calculated using the Chi-square test and was found to be poor, but not further specified.
In none of the studies impact analysis or inter-observer reliability were analyzed.
The TTS and EWS reached evidence level two according to McGinn et al. The GS only reached level four whereas the other systems all reached level three. None of the systems thus reached level one.