Introduction

Assessment of passenger counts is of paramount importance for public transport agencies in order to plan, manage and evaluate their transit service. Application covers many topics, for example short- and long-term forecasting, optimizing passenger behaviour and daily operations, or sharing of revenue among operators. Issues of passenger demand have a long-lasting history (see e.g. Kraft and Wohl 1968). In recent years, modelling of passenger counts has emerged rapidly due to the availability of large-scale automatic data collection. Data on (automatic) passenger counts has direct impact on both the revenue generated by ticket sales as well as state subsidies of public transport companies within one unified ticketing system. To illustrate, in Berlin, buses and underground trains are operated by the BVG, while a complementary rapid transit system is operated by S-Bahn Berlin GmbH. Public transport services in Berlin (and Brandenburg) are provided by around 40 companies organized in the Transport and Tariff Association of Berlin and Brandenburg (VBB), which provides operator-spanning tickets, e.g. single day or monthly passes. Excluding subsidies, the revenue from the ticket sales alone has been around 1.4 billion EUR in 2017Footnote 1. Revenue magnitudes in the billions are common in public transport (Armstrong and Meissner 2010) and regularly need to be shared among different operators on the basis of (automatic) passenger data. APC systems have evolved considerably within the past 40 years. Passenger flow data can be acquired with high accuracy outperforming manual ride checkers (Hodges 1985; Hwang et al. 2006). Devices that operate on 3D image streams are the industrial state-of-the-art technology. Latest generation devices offer an accuracy of around 99% (iris 2018; Hella Aglaia 2018) and technical progress is ongoing. As all measurement devices, APC systems are susceptible to error. For the comparison of counting precision between different sensors, objective statistical criteria are therefore required. These criteria are not only needed for comparisons between APC systems but also decision-making processes rely on high accuracy APC data (Furth et al. 2005). Upraising usage of APC systems led to the formulation of some precision criteria to ensure validity and reliability. The term APC validation for this type of quality control was used by Strathman (1989) and some wider usage of validation concepts awoke since the early 2000s (see e.g. Kimpel et al. 2003; Strathman et al. 2005; Boyle 2008; Chu 2010; Köhler et al. 2015). For real-world APC validations the most relevant criterion is to ensure unbiasedness, i.e. the need to rule out that the APC system makes a relevant a systematic error. Especially regarding revenue sharing, for which APC count data is commonly used (Detig et al. 2014; Hagemann 2017; Nahverkehrs-praxis 2014; Verkehr & Technik 2016; VVS 2016; VMT 2010), this is crucial, since small errors—of whatever origin—can already have a large impact: to illustrate, companies with a shared ticketing system like the above-mentioned BVG or the S-Bahn Berlin GmbH have revenues, consisting of ticket sales and subsidies, of roughly one billion EUR eachFootnote 2. If one of these companies somehow systematically counted 1% too few and the other one 1% too many passengers and passenger counts accounted to only 10%Footnote 3 of the shared revenue, yet two million EUR would be distributed inappropriately—every year, for these two companies operating in the Berlin area alone. In Germany, it is prevalent that the tickets are sold by the transport association and revenue, as well as subsidies, are distributed among the transport companies (Beck 2011), which accounted for 12.8 billion EUR in 2017 (VDV 2018). Furthermore, such a revenue sharing scheme itself is currently associated with high costs, which were roughly one million EUR for the VBB in the year 2014 (Baum and Gaebler 2015).

One industrial recommendation regarding APC systems is central to tendering procedures in Germany and also advertised by manufacturers worldwide (Hella Aglaia 2018; iris 2018): the VDV 457 (Köhler et al. 2015), which regularly and in an unmodified form becomes part of transportation contracts, sometimes even in the latest, yet unreleased version that has “this is a pre-release” watermarked on every page. Due to the huge impact of the document, all change requests to the VDV 457 must be approved by a committee within the Association of German Transport Companies (Verband Deutscher Verkehrsunternehmen, VDV). Results presented in this manuscript are given as follows: In the second section we summarize and discuss the development in (automatic) passenger counting. The complete statistical model of APC measurements we introduce in the third section. In the fourth section we define and examine the revised t-test, which is an attempt to modify resp. extend the t-test to account for the type II error accordingly. There were two reasons for this approach: Firstly, the admission process based on the t-test was already established in the VDV 457 v2.0 and its predecessors, so we wanted to change as little as possible to make the impact of the changes foreseeable and manageable by decision makers. Secondly, being unopinionated was more likely to succeed than simply insisting to use a statistical test because it was popular in other fields like biostatistics. Subsequently, we illustrate that this newly introduced test generally suffers from numerical instability making the approach unsuitable for wide practical use. In the fifth section we introduce the equivalence test and in the sixth section we normalize the test criteria of both the revised t-test and the equivalence test to analytically see, that, after transposition of parameters, the tests are identical. This so-obtained t-test-induced equivalence test however is, due to only elementary calculations being made, generally not susceptible to numerical instability. We close with some concluding remarks and future prospects in the last section.

APC development and current practice

Traditionally, but also nowadays, passenger counts are collected manually via passenger surveys or human ride checkers, which are both expensive and produce only small samples. Former, the passenger surveys, are possibly biased and unreliable (Attanucci et al. 1981). For latter, the manual counts by ride checkers, the accuracy is doubtable, since already the first-generation automatic counting systems have been regarded to be more accurate (Hodges 1985). Ride checking is often done by less qualified personnel with high turn-over rates and Furth et al. (2005) instead suggest the use of video cameras to increase accuracy and reliability. Today, automatic data collection (ADC) systems in public transport are classified into automatic vehicle location (AVL), automatic passenger counting (APC), and automatic fare collection (AFC) systems (Zhao et al. 2007). The AVL system provides data on the position and timetable adherence of the bus, metro, or train which needs to be merged with APC data (Furth et al. 2004; Strathman et al. 2005; Saavedra et al. 2011). AFC data is based on ticket sales, magnetic strip cards, or smart cards and has become popular since it is often easily available (Zhao et al. 2007; Lee and Hickman 2014). Still, it often only provides information on boarding but not alighting, generally underestimates actual passenger counts and may therefore be less accurate than APC data (see e.g. Wilson and Nuzzolo 2008; Chu 2010; Xue and Sun 2015).

The first generation of APC systems was deployed in the 1970s (Attanucci and Vozzolo 1983) and usage increased in the following decades. Casey et al. (1998) reported that many local metropolitan transit agencies use APC systems and Strathman et al. (2005) reported increased rates in APC usage of over 445% within seven years. Today APC systems are used worldwide and have found their way into official documents, as the above-mentioned tendering procedures in Germany. In the United States, transit agencies using APC data have to submit a benchmarking and a maintenance plan for reporting to the FTA’s National Transit Database (NTD) to be eligible for related grant programs (see e.g. Chu 2010).

A wide range of competing APC technologies has been developed. Detection methods include infrared light beam cells, passive infrared detectors, infrared cameras, stereoscopic video cameras, laser scanners, ultrasonic detectors, microwave radars, piezoelectric mats, switching mats, and also electronic weighing equipment (EWE) (Casey et al. 1998; Kuutti 2012; Kotz et al. 2015). Transit agencies usually mount one or multiple sensors to collect APC data in each door area of public transport vehicles like buses, trams, and trains. The number of boarding and alighting passengers are counted separately by converting 3D video streams (infrared beam break) or light barrier methods, which are the most commonly used technologies (Kotz et al. 2015). In recent years also weight based EWE approaches utilizing pressure measurements in the vehicle braking/air bag suspension system have emerged to estimate passenger numbers (Nielsen et al. 2014; Kotz et al. 2015). These relatively new approaches have proven to provide easy-to-acquire additional information since modern buses and powertrains are equipped with (intelligent) pressure sensors by default.

First assessment of APC validity, i.e. accuracy, date back to the 1980s when large-scale usage started in the United States and Canada (Hodges 1985). To assess APC systems several researchers used confidence intervals and tests for paired data to investigate whether any found bias is statistically significant. The most commonly used statistical test is the t-test (Strathman 1989; Kimpel et al. 2003; Köhler et al. 2015), but the nonparametric Wilcoxon test for paired data has also been used in automated data (Kuutti 2012). Handbooks for reporting to FTA’s National Transit Database have adapted the t-test as well as the industrial recommendations for APC-buying transit agencies like Köhler et al. (2015). To our knowledge no t-test related APC criterion formulated so far controls the type II error of the statistical test. Some authors report concepts that resemble equivalence testing. Furth et al. (2006) states “A less stringent test would allow a small degree of bias, say, 2% (partly in recognition that the ‘true’ count may itself contain errors); [...]” which acknowledges the fact that almost no measurement in the real world will have an expected value of exactly zero. In a survey among transit agencies by Boyle (2008) on how they ensure that APC systems meet a specified level of accuracy it is reported “Some [agencies] were more specific, for example, with a confidence level of 90% that the observations were within 10% of actual boardings and alightings.”, which is an early occurrence of an equivalence test concept. Conversely, Chu (2010) introduced an “equivalency test” for APC benchmarking, which however is not to be confused with the equivalence test but rather is the application of the objected t-test to average passenger trip lengths. Additional adjustment factors on the raw APC counts are given without defining any equivalency criteria, an issue this paper shall address properly.

Various alternative criteria exist also to the t-test to assess accuracy resp. unbiasedness on the one hand and precision resp. reliability on the other hand. Nielsen et al. (2014) also investigate absolute differences in addition to evaluate the bias when analysing a weight-based APC approach. Restrictions on the absolute deviation from zero also limit the variability of the APC system. Criteria specifically on the variance of the APC have been made indirectly through the error rate or more specifically through specifying the allowed distribution of errors, see e.g. criteria b and c in Köhler et al. (2015), appendix E in Furth et al. (2003), or Boyle (2008). To the best of our knowledge, the most comprehensive and maintained industrial recommendation on APC validation and usage is the above-mentioned VDV Schrift 457 (Köhler et al. 2015). The document gives guidance on most relevant APC topics, including sampling and standardization of APC validation. One major aspect of APC validation is the demonstration of adequate APC accuracy regarding which Köhler et al. (2015) state for the approval process: for an APC system to pass the admission process, its systematic error has to be at most 1%, which is verified by (a variant of) the t-test. Worldwide, there are similar formulations for the validation and thus admission of APC systems, like Furth et al. (2006), Boyle (2008), or Chu (2010).

However, scepticism arose within the industry when seemingly good performing APC systems started to fail the test. In February 2015, with the help of a brute force algorithm, we constructed a proof of concept for a failed (APC) t-test: the error is almost zero, i.e. with 1000 (or arbitrary many more) boarding passengers the sample has three measurements, one with an error of one, the other two with an error of two passengers. In that case, the APC system fails the t-test. Such a proof of concept led the count precision workgroup (Arbeitsgemeinschaft Zählgenauigkeit) of the VDV to add the equivalence test with additional restrictions, as an exceptional alternative test alongside the t-test in the VDV 457 v2.0 release in June 2015 (Köhler et al. 2015) to account for APC systems with a low error standard deviation. Indeed, the above-mentioned proof of concept would now be accepted by the new, hybrid test, but as it turned out later, current or near future APC systems would not profit, since the parameter choice was too hard to pass. Further, a remark was added to the VDV 457 v2.0 that “in the advent of technological advance and increased counting precision, the admission process is still subject to change”: at that time, there was still little insight into why a seemingly suitable and popular statistical test exposed such a seemingly arbitrary behaviour and it was not entirely clear how the equivalence test compared to the long-established t-test.

Detailed investigations showed that the VDV 457 v2.0 t-test variant only accounted for the type I error, defined to be 5% to 10%, which is the risk for an APC systems manufacturer to fail the test with a system with having a systematic error of zero. In the t-test terminology, this parameter is known as statistical significance \(\alpha\). Conversely, the type II error \(\beta\) is the risk of an APC system with a systematic error greater than 1% to obtain admission, which is the complement to the statistical power \(1-\beta\). The type II error and thus the statistical power was neither accounted for in the sample size planning nor in the testing procedure. Through the sample size formula it was implicitly 50%, assuming the a priori estimated standard deviation was correct. Otherwise, the higher the empirical standard deviation, the greater the type II error and vice versa. The statistical framework for APC validation and methods to address the current shortcomings are given in the following sections.

Statistical model

Let \(\Omega _0=\{\omega _i\}, i=1,\ldots ,N\) be the statistical population of stop door events (SDE), which are used to summarize all boarding and alighting passengers at a single door during a vehicle (bus, tram, train) stop. Further, let \(\Omega =\{\omega _{i_j}\}, i_j\in \{1,\ldots ,N\}, \ j\in \{1,\ldots ,n\}\) be a sample, which consists of either randomly or structurally selected SDE (e.g. by a given sampling plan). The size of the statistical population N may be considered as the number of all SDE over the relevant time period, which is typically one or more years, so N can be assumed to be unbounded and thus \(N=\infty\). Let n be the sample size and \(M_i\), \(i\in \{i,\ldots ,n\}\) be the manual count and \(K_i\), \(i\in \{i,\ldots ,n\}\) be the automatic count of boarding passengers made by the APC system. The manual count obtained by multiple ride checkers or favourably video camera information (Kimpel et al. 2003) is assumed to be a ground truth to compare against. Alighting passengers are counted as well and results apply analogously, but w.l.o.g. we only consider the boarding passengers. Let \(\overline{M}=\frac{1}{n}\sum _{i=1}^{n} M_i\) be the average manual boarding passenger count. Similar to other authors [see e.g. appendix E in Furth et al. (2003), Furth et al. (2006), Nielsen et al. (2014), Köhler et al. (2015)] we consider the random variables

$$\begin{aligned} D_i := \frac{K_i-M_i}{\overline{M}}{,} \end{aligned}$$
(1)

which we call relative differences being the difference of the automatically and manually counted boarding passengers relative to the average of the manually counted boarding passengers. The average \({\overline{D}}:=\frac{1}{n} \sum _{i=1}^{n} D_i\) is the statistic of interest that is used in both the t-tests as well as the equivalence test. The expected value \(\mu :=E({\overline{D}})\) is the actual systematic errorFootnote 4 of an APC system (Furth et al. 2005), since it can systematically discriminate participants of the revenue sharing system or could also be referred to as bias of the measurement device, a term frequently used in APC accuracy evaluations (Strathman 1989; Kimpel et al. 2003; Furth et al. 2005; Chu 2010; Nielsen et al. 2014).

The criteria in each APC approval procedure are often reviewed by specially trusted authorities who are entitled to grant admission. They perform their own manual ride check, evaluate the criteria i.e. the statistical test, and either approve or reject the APC system. There are two conflicting interests that need to be dealt with: acquiring maximally accurate and reliable data on the one hand and approve a high number of APC in a fast and cost-efficient process on the other hand. Shortcomings of the first we will call calibration resp. user risk and shortcomings of the latter manufacturer risk. We attribute the user risk to public transportation companies and network authorities, who rely on accurate data, despite that the motivations for APC data collection might be more complex in the real world. The manufacturer risk relates directly to possible recourse claims and negative market reputation if the resp. APC system fails the admission. These two risks relate to the type I error and type II error of statistical tests. For the t-test, the hypotheses are

$$\begin{aligned} H_0:\quad {\mathrm{{There\,is\,no\,systematic\,APC\,measurement\,error}}}\,(\mu = 0)\end{aligned}$$
(2)
$$\begin{aligned} H_1:\quad {\mathrm{{There\,is\,a\,systematic\,APC\,measurement\,error}}}\,(\mu \ne 0). \end{aligned}$$
(3)

Let \(\nu\) be the a priori estimated standard deviation, \({\widehat{\nu }}\) the empirical standard deviation of the sample, \(d_r\) the maximal allowed error (e.g. 1%), \({\alpha _{\mathrm{{t}}}}\) the risk of falsely rejecting the null hypothesis \(H_0\) (type I error, i.e. rejecting an APC system with an actual systematic error of zero) and \({\beta _{\mathrm{{t}}}}\) the risk of falsely accepting the null hypothesis \(H_0\) when a particular value of the alternative hypothesis \(H_1\) is true (type II error, e.g. accepting an APC system with a systematic error of 1%) (see e.g. Guthrie 2010).

The sample size estimation for the t-test is given by

$$\begin{aligned} {n_{\mathrm{{t}}}}= \left( z_{1-{\beta _{\mathrm{{t}}}}}+z_{1-{\alpha _{\mathrm{{t}}}}/2}\right) ^2\ \frac{\nu ^2}{d_r^2}{,} \end{aligned}$$
(4)

with \(z_{(\cdot )}\) being the quantile function of the normal distribution and the test criterion as

$$\begin{aligned} |{\overline{D}}| \le z_{1-{\alpha _{\mathrm{{t}}}}/2} \frac{{\widehat{\nu }}}{\sqrt{{n_{\mathrm{{t}}}}}}{.} \end{aligned}$$
(5)

Revised t-test

Several discussions about post-hoc power adaptions for the t-test exist. A thorough discussion about those can be found in Hoenig and Heisey (2001). They argue that approaches referred to as Observed Power, Detectable Effect Size, or Biologically Significant Effect Size are “flawed”. For the latter approach, which is described in Cohen (1988), Hoenig and Heisey criticize the assumption that actual power is equal to the intended power and not updated according to experimental results (e.g. sampling variability). Addressing this, we investigate on procedures to control the (actual) type II error to assess non-presence of a crucial difference. Schuirmann (1987) initially referred to approaches of using a negative hypothesis test to make inference that no inequivalence was present as the Power Approach. Analogously to these thoughts, we will consider variations of the type I error \(\alpha\) to make adaptations to the testing procedure and call this approach post-hoc power calculation. This was explicitly mentioned by Schuirmann but was not derived further by stating a lack of practical interest: “In the case of the power approach, it is of course possible to carry out the test of the hypothesis of no difference at a level other than 0.05 and / or to require an estimated power other than 0.80, but this is virtually never done.” While this approach may not have been used in the world of pharmaceutics, it is of relevance for the validation of devices for automatic passenger count and likely other applications in industrial statistics. In general, as well in practice, after the data collection, the a priori estimated standard deviation and the empirical standard deviation differ to some extent and we strongly believe that it cannot be relied on that difference being negligible. Therefore, we want to ensure that the risk of the user (the type II error) does not exceed a prespecified level, which is usually 5%. So the only parameter to be adapted is the type I error \(\alpha\), which is the risk of the device manufacturer. The appropriate \({\widehat{\alpha }}_t\), the revised significance, can thus be determined by solving the equation

$$\begin{aligned} n_t\,{\mathop {=}\limits ^{!}}\,\left( z_{1-{\beta _{\mathrm{{t}}}}}+z_{1-{{\widehat{\alpha }}_{\mathrm{{t}}}}/2}\right) ^2\ \frac{{\widehat{\nu }}^2}{d_r^2}{,} \end{aligned}$$
(6)

and thus

$$\begin{aligned} {{\widehat{\alpha }}_{\mathrm{{t}}}}= 2\left[ 1 - z^{-1}\left( \left( z_{1-{\beta _{\mathrm{{t}}}}}+z_{1-{\alpha _{\mathrm{{t}}}}/2}\right) \frac{\nu }{{\widehat{\nu }}} - z_{1-{\beta _{\mathrm{{t}}}}}\right) \right] {.} \end{aligned}$$
(7)

Note that \(n_t\), by choice of the initial \({\alpha _{\mathrm{{t}}}}\) and \(\nu\), is fixed. If the actual sample size does not match the planned sample size, \(n_t\) resp. \(\nu\) needs to be adapted. Analogously to Eq. (5), we define the test criterion for the revised t-test:

$$\begin{aligned} |\overline{D}| \le z_{1-{{\widehat{\alpha }}_{\mathrm{{t}}}}/2} \frac{{\widehat{\nu }}}{\sqrt{{n_{\mathrm{{t}}}}}}{,} \end{aligned}$$
(8)

which, however, can yield a problematic behaviour in practice: First, the term \(z_{1-{{\widehat{\alpha }}_{\mathrm{{t}}}}/2}\) is undefined for \({\widehat{\alpha }}_t > 2\). Combined with Eq. (6), this induces a lower bound on \(n_t\):

$$\begin{aligned} n_t\ge z_{1-{\beta _{\mathrm{{t}}}}}^2\ \frac{{\widehat{\nu }}^2}{d_r^2}{.} \end{aligned}$$
(9)

Second, \(z(1)=\infty\) is the source of a different problem whenever \(\nu /{\widehat{\nu }}\) exceeds a certain critical value, which is 2.62 for \(\alpha _t=5\%\) and \(\beta _t=2.5\%\)Footnote 5. Therefore, this could be relevant in practice, since it yields \(z_{1-{{\widehat{\alpha }}_{\mathrm{{t}}}}/2}=\infty\) due to rounding errors. In that case, the test criterion from Eq. (8) is always true and the test is thus always passed as illustrated in Fig. 1.

Fig. 1
figure 1

Chances (of an APC system) to pass dependent on the method to determine the revised significance \({{\widehat{\alpha }}_{\mathrm{{t}}}}\), which we have obtained from Eq. (7), denoted by dashed lines, and by using the function power.t.test from R Core Team (2018) denoted by the solid lines. Latter was our initial approach, which we illustrate here for the purpose of completeness. We notice that for far too low sample sizes (red and dark red lines) the former method is stricter. For the dark red dashed line, n is below the lower bound from Eq. (9) and we thus assume the test to always fail, compare “t-test-induced equivalence test” section. For (too) large sample sizes numerical instabilities, which cannot be detected by the user through e.g. error messages, lead to sudden gaps in the function values (green lines for \(\nu =0.310\) and \(\nu =0.311\)) for the power.t.test-variant, which relies on fixed-point iteration and has a history of unexpected convergence behaviour. The blue lines denote a practically reachable numeric limit when using Eq. (7): starting with \(\nu \ge 0.393\) resp. \(\nu /{\widehat{\nu }}\ge 2.62\) all systems are always accepted. For the power.t.test approach, the light blue line is slightly above the dark blue one, so it has numerical problems for these values, too. Generally, all values of \(\nu\) have to be seen w.r.t. the ratio \(\nu /{\widehat{\nu }}\), since these numerical effects can already occur with much smaller sample sizes. (Colour figure online)

Equivalence testing

The equivalence test has its origin in the field of biostatistics. Often the term bioequivalence testing (e.g. Schuirmann 1987; Berger and Hsu 1996; Wellek 2010) is used. Bioequivalence tests are statistical tools that are commonly used to compare the performance of generic drugs with established drugs using several commonly accepted metrics of drug efficacy. The term equivalence between groups means that differences are within certain bounds, as opposed to complete equality. These bounds are application-specific and are usually to be chosen such that they are below any potential (clinically) relevant effect (Ennis and Ennis 2010). In many publications the problem was referred to as the two one-sided tests (TOSTs) procedure (Schuirmann 1987; Westlake 1981). TOST procedures were developed under various parametric assumptions and additionally distribution-free approaches exist (Wellek and Hampel 1999; Zhou et al. 2004).

Equivalence tests have begun making their way into psychological research (see e.g. Rogers et al. 1993) and natural sciences: Hatch (1996) applied it for testing in clinical biofeedback research. Parkhurst (2001) discussed the lack of usage of equivalence testing in biology studies and stated that equivalence tests improve the logic of significance testing when demonstrating similarity is important. Richter and Richter (2002) used equivalence testing in industrial applications and gave instructions on how to easily calculate it with basic spreadsheet computer programs. Applications also involved risk assessment (Newman and Strojan 1998), plant pathology (Garrett 1997; Robinson et al. 2005) ecological modelling (Robinson and Froese 2004) analytical chemistry (Limentani et al. 2005), pharmaceutical engineering (Schepers and Wätzig 2005), sensory and consumer research (Bi 2005), assessment of measurement agreement (Yi et al. 2008), sports sciences (Vanhelst et al. 2009), applications to microarray data (Qiu and Cui 2010), genetic epidemiology in the assessment of allelic associations (Gourraud 2011), and geography (Waldhoer and Heinzl 2011).

For the equivalence test, the hypotheses are (Schuirmann 1987; Julious 2004)

$$\begin{aligned} H_0:\quad {\mathrm{{There\,is\,a\,(relevant)\,systematic\,APC\,measurement\,error}}}\,(|\mu | \ge \Delta )\end{aligned}$$
(10)
$$\begin{aligned} H_1:\quad {\mathrm{{There\,is\,no\,(relevant)\,systematic\,APC\,measurement\,error}}}\,(|\mu | < \Delta ){.} \end{aligned}$$
(11)

We define \(\Delta\) to be the equivalence margin and the relevant errors for the equivalence test with \({\alpha _{\mathrm{{e}}}}\) referring to (half) the risk of the user and \({\beta _{\mathrm{{e}}}}\) to the risk of the device manufacturer. We will consider two-sided \(1 - 2{\alpha _{\mathrm{{e}}}}\) confidence intervals (symmetric around the mean) where \({\alpha _{\mathrm{{e}}}}\) is often to be chosen 2.5%. The usage of \(1 - {\alpha _{\mathrm{{e}}}}\) confidence intervals is also possible (see e.g. Westlake 1981) but is used less frequently in the recent literature in this topic. Note that, by definition, the meanings of the \(\alpha\) and \(\beta\) are interchanged between the t-test and the equivalence criterion in referring to the risk of the user and to the risk of the manufacturer.

For the equivalence test sample size estimation exists (see e.g. Liu and Chow 1992; Julious 2004) similar to Eq. (4) of the t-test:

$$\begin{aligned} {n_{\mathrm{{e}}}}= \left( z_{1-{\beta _{\mathrm{{e}}}}/2}+z_{1-{\alpha _{\mathrm{{e}}}}}\right) ^2\ \frac{\nu ^2}{\Delta ^2}{.} \end{aligned}$$
(12)

We define the test criterion for the equivalence test

$$\begin{aligned} |{\overline{D}}| \le \Delta - z_{1-{\alpha _{\mathrm{{e}}}}}\ \frac{{\widehat{\nu }}}{\sqrt{{n_{\mathrm{{e}}}}}}{,} \end{aligned}$$
(13)

which is the formulation of the Two One-Sided Test Procedure for the crossover design in the case of limits that are symmetrical around zero (see Schuirmann 1987).

t-test-induced equivalence test

An approach to compare the revised t-test to the equivalence test is to normalize and compare their test criteria from Eqs. (8) resp. (13) as well as their sample sizes formulas (4) resp. (12). By combining Eqs. (4), (6) and (8), we obtain a normalized test condition for the revised t-test:

$$\begin{aligned} |\overline{D}|&\le z_{1-{{\widehat{\alpha }}_{\mathrm{{t}}}}/2} \frac{{\widehat{\nu }}}{\sqrt{{n_{\mathrm{{t}}}}}} =\left( \frac{\nu }{{\widehat{\nu }}} (z_{1-{\beta _{\mathrm{{t}}}}}+z_{1-{\alpha _{\mathrm{{t}}}}/2}) - z_{1-{\beta _{\mathrm{{t}}}}}\right) \frac{{\widehat{\nu }}}{\sqrt{{n_{\mathrm{{t}}}}}} \nonumber \\&=\left( \frac{\nu }{{\widehat{\nu }}} (z_{1-{\beta _{\mathrm{{t}}}}}+z_{1-{\alpha _{\mathrm{{t}}}}/2}) - z_{1-{\beta _{\mathrm{{t}}}}}\right) \frac{{\widehat{\nu }}}{\nu }\ \frac{1}{z_{1-{\beta _{\mathrm{{t}}}}}+z_{1-{\alpha _{\mathrm{{t}}}}/2}}\ d_r \nonumber \\&=\biggl (1-\frac{{\widehat{\nu }}}{\nu }\ \frac{1}{1+\frac{z_{1-{\alpha _{\mathrm{{t}}}}/2}}{z_{1-{\beta _{\mathrm{{t}}}}}}}\biggr )\ d_r {.} \end{aligned}$$
(14)

Using Eqs. (12) and (13) we obtain

$$\begin{aligned} |\overline{D}|&\le \Delta - z_{1-{\alpha _{\mathrm{{e}}}}}\ \frac{{\widehat{\nu }}}{\sqrt{{n_{\mathrm{{e}}}}}} =\Delta - z_{1-{\alpha _{\mathrm{{e}}}}}\ \frac{{\widehat{\nu }}}{\nu }\left( \frac{1}{z_{1-{\beta _{\mathrm{{e}}}}/2}+z_{1-{\alpha _{\mathrm{{e}}}}}}\right) ^2\ \Delta \nonumber \\&=\biggl (1-\ \frac{{\widehat{\nu }}}{\nu }\ \frac{1}{1+\frac{z_{1-{\beta _{\mathrm{{e}}}}/2}}{z_{1-{\alpha _{\mathrm{{e}}}}}}}\biggr )\ \Delta {,} \end{aligned}$$
(15)

which resembles Eq. (14). If we now choose

$$\begin{aligned} {\beta _{\mathrm{{e}}}}:={\alpha _{\mathrm{{t}}}},\ \ {\alpha _{\mathrm{{e}}}}:={\beta _{\mathrm{{t}}}}\quad {\text{and}}\quad \Delta :=d_r \end{aligned}$$
(16)

for Eqs. (12) and (15), they are identical to Eqs. (4) and (14). Therefore, the revised t-test analytically is an equivalence test, with error types swapped and an extended domain: Since only elementary calculations are made and there is no need to handle a varying quantile function \(z_{(\cdot )}\), there is neither a lower bound as in Eq. (9), nor an upper bound due to numeric instability as illustrated in Fig. 1. We call an equivalence test with parameters chosen as in Eq. (16) a t-test induced equivalence test. For a visual comparison of the t-test and the equivalence test, see Fig. 2.

Fig. 2
figure 2

Chances of an APC system with an actual standard deviation of the (relative) errors equal to \({\widehat{\nu }}= 0.15\) to pass a the t-test or b the revised t-test resp. equivalence test over the actual systematic error. Different lines denote different sample sizes obtained from different a priori choices of the standard deviation \(\nu\). The golden, solid curve always represents a correctly estimated sample size, the green curve a sample which is too large and the reddish curves samples which are too small. The dashes in the dark red line in (b) denote the consequences of equation (9): only for the equivalence test, the outcome is defined for Eq. (13) and the test may be considered as always failed if \(n=385<z_{1-{\beta _{\mathrm{{t}}}}}^2 \cdot {\widehat{\nu }}^2 / d_r^2=864.33\). Using the original VDV 457 v2.0 with an (implicit) power of 50% yields the dashed golden curve in (a). The vertically striped areas are additionally correctly accepted, the horizontally striped areas are additionally incorrectly accepted. The thick blue lines denote the relative error of \(d_r=1\%\). For comparison: in (c) the incorrect decisions of a reference test are red, the correct decisions are coloured cyan. The reason for red areas to exist are economic considerations to limit the test costs: further increasing the sample size towards infinity would make the red areas disappear, at least for the revised t-test resp. the equivalence test. For the t-test, the areas with systematic error \(\mu > 1\%\) and \(\mu < -1\%\) remain blue, but the inner turns red. This behaviour is counterintuitive to the idea that the error of a statistic test goes to zero as the sample size goes to infinity. Note that, in the newly released VDV 457 v2.1, \(\alpha/2\) is used at the place where we use \(\alpha\) and therefore, our \(\alpha=2.5\%\) matches \(\alpha=5\%\) in Köhler et al. (2018). (Colour figure online)

Conclusion

We illustrated that the t-test as a criterion for APC approval may exhibit undesirable properties, even as the sample size grows beyond a certain level. Further, we have shown that when trying to compensate this behaviour by using post-hoc power calculations with a revised t-test, issues of numeric stability and domain limitations arise. Finally, we have proven analytically that the t-test-induced equivalence test, being numerically stable with a practically unlimited domain, can supersede the revised t-test. The equivalence test is popular in various fields and, from a user’s perspective, easier to apply than post-hoc power calculations. Our results thus not only apply to APC systems: every use of the t-test can now comfortably be reconsidered and—on demand—be replaced by a (t-test-induced) equivalence test.

Our work simplifies the decision-making process considerably, especially when it affects the worldwide revenue sharing in public transport, where there have been made 243 billion public transport journeys in the year of 2015 alone (UITP 2017). For this reason, a large German public transportation company, which was significantly involved in the creation of the original, t-test based recommendation, commissioned an additional, complementary experts report, which eventually confirmed our findings. With the release of the VDV 457 v2.1 in July 2018, our proposals have been accepted and the use of the equivalence test thus became the new recommendation for the validation of automatic passenger counting systems. Finally, we hope that long-lasting arguments within the industry about seemingly arbitrary admission results now end and also that our work will enable a broader audience to understand and profit from equivalence testing.