Automatic Passenger Counting: Introducing the t-Test Induced Equivalence Test

Automatic passenger counting (APC) in public transport has been introduced in the 1970s and has been rapidly emerging in recent years. Still, real-world applications continue to face events that are difficult to classify. The induced imprecision needs to be handled as statistical noise and thus methods have been defined to ensure that measurement errors do not exceed certain bounds. Various recommendations for such an APC validation have been made to establish criteria that limit the bias and the variability of the measurement errors. In those works, the misinterpretation of non-significance in statistical hypothesis tests for the detection of differences (e.g. Student's t-test) proves to be prevalent, although existing methods which were developed under the term equivalence testing in biostatistics (i.e. bioequivalence trials, Schuirmann, 1987) would be appropriate instead. This heavily affects the calibration and validation process of APC systems and has been the reason for unexpected results when the sample sizes were not suitably chosen: Large sample sizes were assumed to improve the assessment of systematic measurement errors of the devices from a user's perspective as well as from a manufacturers perspective, but the regular t-test fails to achieve that. We introduce a variant of the t-test, the revised t-test, which addresses both type I and type II errors appropriately and allows a comprehensible transition from the long-established t-test in a widely used industrial recommendation. This test is appealing, but still it is susceptible to numerical instability. Finally, we analytically reformulate it as a numerically stable equivalence test, which is thus easier to use. Our results therefore allow to induce an equivalence test from a t-test and increase the comparability of both tests, especially for decision makers.

• Introduction of statistical methods to guarantee that the type I and the type II errors are controlled as well as an appropriate sample size calculation.
• Creation of a new comprehensible analytical transition from the t-test to equivalence testing. Decision makers can now precisely compare both methods with the option to replace the t-test, even in high impact applications.
• Successful and fundamental modification of a widely used industrial recommendation that affects revenue sharing in public transportation in the billions.

Introduction
Assessment of passenger counts is of paramount importance for public transport agencies in order to plan, manage and evaluate their transit service. Application covers many topics, for example shortand long-term forecasting, optimizing passenger behaviour and daily operations, or sharing of revenue among operators. Issues of passenger demand have a long-lasting history (see e.g. Kraft and Wohl, 1968). In recent years, modelling of passenger counts has emerged rapidly due to the availability of large-scale automatic data collection. Data on (automatic) passenger counts has direct impact on both the revenue generated by ticket sales as well as state subsidies of public transport companies within one unified ticketing system. To illustrate, in Berlin, busses and underground are operated by the BVG and the S-Bahn Berlin, a complementary rapid transit system, by the Deutsche Bahn AG. Most tickets, e.g. a day ticket or monthly pass are issued by the Transport and Tariff Association of Berlin and Brandenburg (VBB) and allow passengers to use both services among around 40 others. Excluding subsidies, the revenue from the ticket sales alone has been around 1.4 billion EUR in 2016 3 . Revenue magnitudes within the billions are common in public transport (Armstrong and Meissner, 2010).
APC systems have evolved considerable within the past 40 years. Passenger flow data can be acquired with high accuracy outperforming manual ride checkers (Hodges, 1985;Hwang et al., 2006). Devices that operate on 3D image streams are the industrial state-of-the-art technology. Latest generation devices offer an accuracy of around 99% (iris, 2018b; Hella Aglaia, 2018) and technical progress is ongoing. As all measurement devices, APC systems are susceptible to error. For the comparison of counting precision between different sensors objective statistical criteria are therefore required. These criteria are not only needed for comparisons between APC systems but also decision making processes rely on high accuracy APC data (Furth et al., 2005). Upraising usage of APC systems led to the formulation of some precision criteria to ensure validity and reliability. The term APC validation for this type of quality control was used by Strathman (1989) and some wider usage of validation concepts awoke since the early 2000s (see e.g. Kimpel et al., 2003;Strathman et al., 2005;Boyle, 2008;Chu, 2010;Köhler et al., 2015).
For real-world APC validations the most relevant criterion is to ensure unbiasedness, i.e. the need to rule out that the APC system makes a relevant a systematic error. Especially regarding revenue sharing this is crucial since small errors -of whatever origin -can already have a large impact: to illustrate, companies with a shared ticketing system like the above mentioned BVG or the S-Bahn Berlin GmbH have revenues, consisting of ticket sales and subsidies, of roughly a billion EUR each 4 . If one of these companies somehow systematically counted 1% too few and the other one 1% too many passengers and passenger counts accounted to only 10% 5 of the shared revenue, yet two million EUR would be distributed inappropriately -every year, for these two companies operating in the Berlin area alone. In Germany, it is prevalent that the tickets are sold by the transport association and revenue as well as subsidies are distributed among the transport companies (Beck, 2011), which accounted for 12.8 billion EUR in 2017 (VDV, 2018). Furthermore, such a revenue sharing scheme itself is currently associated with high costs, which are roughly 1 million EUR for the VBB in 2014 (Baum and Gaebler, 2015).
One industrial recommendation regarding APC systems is central to tendering procedures in Germany and also advertised by manufacturers worldwide (Hella Aglaia, 2018; iris, 2018a): the VDV 457 (Köhler et al., 2015), which regularly and in an unmodified form becomes part of transportation contracts, sometimes even in the latest, yet unreleased version that has "this is a pre-release" watermarked on every page. Due to the huge impact of the document, all change requests to the VDV 457 must be approved by a committee within the Association of German Transport Companies (Verband Deutscher Verkehrsunternehmen, VDV).
Results presented in this manuscript are given as follows: In Section 2 we summarize and discuss the development in (automatic) passenger counting. The complete statistical model of APC measurements we introduce in Section 3. In Section 4 we define and examine the revised t-test, which is an attempt to modify resp. extend the t-test to account for the type II error accordingly. There were two reasons for this approach: Firstly, the admission process based on the t-test was already established in the VDV 457, so we wanted to change as little as possible to make the impact of the changes foreseeable and manageable by decision makers. Secondly, being unopinionated was more likely to succeed than simply insisting to use a statistic test because it was popular in other fields like biostatistics. Subsequently, we illustrate that this newly introduced test generally suffers from numerical instability making the approach unsuitable for wide practical use. In Section 5 we introduce the equivalence test and in Section 6 we normalize the test criteria of both the revised t-test and the equivalence test to analytically see, that, after a transposition of parameters, the tests are identical. This so-obtained t-test-induced equivalence test however is, due to only elementary calculations being made, generally not susceptible to numerical instability. We close with some concluding remarks and future prospects in Section 7.

APC development and current practice
Traditionally, but also nowadays, passenger counts are collected manually via passenger surveys or human ride checkers, which are both expensive and produce only small samples. Former, the passenger surveys, are possibly biased and unreliable (Attanucci et al., 1981). For latter, the manual counts by ride checkers, the accuracy is doubtable, since already the first-generation automatic counting systems have been regarded to be more accurate (Hodges, 1985). Ride checking is often done by less qualified personnel with high turn-over rates and Furth et al. (2005) instead suggest the use of video cameras to increase accuracy and reliability for any evaluations. Today, automatic data collection (ADC) systems in public transport are classified into automatic vehicle location (AVL), automatic passenger counting (APC), and automatic fare collection (AFC) systems (Zhao et al., 2007). The AVL system provides data on the position and timetable adherence of the bus, metro, or train which needs to be merged with APC data (Furth et al., 2004;Strathman et al., 2005;Saavedra et al., 2011). AFC data is based on ticket sales, magnetic strip cards, or smart cards and has become popular since it is often easily available (Zhao et al., 2007;Lee and Hickman, 2014). Still, it often only provides information on boarding but not alighting, generally underestimates actual passenger counts and may therefore be less accurate than APC data (see e.g. Wilson and Nuzzolo, 2008;Chu, 2010;Xue et al., 2015).
First generation of APC systems were deployed in the 1970s (Attanucci and Vozzolo, 1983) and usage increased in the following decades. Casey et al. (1998) reported that plenty local metropolitan transit agencies use APC systems and Strathman et al. (2005) reported increase rates in APC usage of over 445% within 7 years. Today APC systems are used worldwide and have found their way into official documents, as the above mentioned tendering procedures in Germany. In the United States, transit agencies using APC data have to submit a benchmarking and a maintenance plan for reporting to the FTA's National Transit Database (NTD) to be eligible for related grant programs (see e.g. Chu, 2010).
A wide range of competing APC technologies has been developed. Detection methods include infrared light beam cells, passive infrared detectors, infrared cameras, stereoscopic video cameras, laser scanners, ultrasonic detectors, microwave radars, piezoelectric mats, switching mats and also electronic weighing equipment (EWE) (Casey et al., 1998;Kuutti, 2012;Kotz et al., 2015). Transit agencies usually mount one or multiple sensors to collect APC data in each door area of public transport vehicles like busses, trams and trains. The number of boarding and alighting passengers are counted separately by converting 3D video streams (infrared beam break) or light barrier methods, which are the most commonly used technologies (Kotz et al., 2015). In recent years also weight based EWE approach utilizing pressure measurements in the vehicle braking / air bag suspension system has emerged to estimate passenger numbers (Nielsen et al., 2014;Kotz et al., 2015). The relatively new approach has proved to provide easy-to-acquire additional information, since modern buses and powertrains are equipped with (intelligent) pressure sensors by default.
First assessment as APC validity, i.e. accuracy, date back to the 1980s when large-scale usage started in the United States and Canada (Hodges, 1985). To assess APC systems several researchers used confidence intervals and tests for paired data to investigate whether any found bias is statistically significant. Tests mostly used is the t-test (Strathman, 1989;Kimpel et al., 2003;Köhler et al., 2015), but also nonparametric Wilcoxon test for paired data has been used in automated data (Kuutti, 2012). Handbooks for reporting to FTA's National Transit Database have adapted the t-test as well as the industrial recommendations for APC-buying transit agencies like (Köhler et al., 2015). To our knowledge no APC t-test related criterion formulated so far controls the type II error of the statistical test. Some authors report concepts that resemble equivalence testing. Furth et al. (2006) states "A less stringent test would allow a small degree of bias, say, 2% (partly in recognition that the 'true' count may itself contain errors); [...]" which acknowledges the fact that almost no measurement in the real world will have an expected value of exactly zero. In a survey among transit agencies by Boyle (2008) on how they ensure that APC systems meet a specified level of accuracy it is reported "Some [agencies] were more specific, for example, with a confidence level of 90% that the observations were within 10% of actual boardings and alightings.", which is an early occurrence of an equivalence test concept. Conversely, Chu (2010) introduced an "equivalency test" for APC benchmarking, which however is not to be confused with the equivalence test but rather is the application of the objected t-test to average passenger trip length. Additional adjustment factors on the raw APC counts are given without defining any equivalency criteria, an issue this paper shall address properly.
Various alternative criteria exist also to the t-test to assess accuracy resp. unbiasedness on the one hand and precision resp. reliability on the other hand. Nielsen et al. (2014) also investigate absolute differences in addition to evaluate the bias when analysing a weight-based APC approach. Restrictions on the absolute deviation from zero also limit the variability of the APC system. Criteria specifically on the variance of the APC have been made indirectly through the error rate or more specifically through specifying the allowed distribution of errors (see e.g. criteria b and c in Köhler et al., 2015), (Appendix E in Furth et al., 2003;Boyle, 2008). To the best of our knowledge, the most comprehensive and maintained industrial recommendation on APC validation and usage is the above mentioned VDV Schrift 457 (Köhler et al., 2015). The document gives guidance on most relevant APC topics, including sampling and standardization of APC validation. One major aspect of APC validation is the demonstration of adequate APC accuracy regarding which Köhler et al. (2015) state for the approval process: for an APC system to pass the admission process, its systematic error has to be at most 1%, which is verified by ( However, scepticism within the industry arose seemingly good performing APC systems started to fail the test. In February 2015, with the help of a brute force algorithm, we constructed a proof of concept for a failed (APC) t-test: the error is almost zero, i.e. with 1000 (or arbitrary many more) boarding passengers the sample has three measurements, one with an error of one, the other two with an error of two passengers. In that case, the APC system fails the t-test. Such a proof of concept led the count precision work group (Arbeitsgemeinschaft Zählgenauigkeit) of the VDV to add the equivalence test with additional restrictions, as an exceptional alternative test alongside the t-test in the VDV 457 v2.0 release in June 2015 (Köhler et al., 2015) to account for APC systems with a low error standard deviation. Indeed, the above mentioned proof of concept would now be accepted by the new, hybrid test, but as it turned out later, current or near future APC systems would not profit, since the parameter choice was too hard to pass. Further, a remark was added to the VDV 457 that "in the advent of technological advance and increased counting precision, the admission process is still subject to change": at that time, there was still little insight into why a seemingly suitable and popular statistical test exposed such a seemingly arbitrary behaviour and it was not entirely clear how the equivalence test compared to the long-established t-test.
Detailed investigations showed that the VDV 457 t-test variant only accounted for the type I error, defined to be 5% to 10%, which is the risk for an APC systems manufacturer to fail the test with a system with having a systematic error of zero. In the t-test terminology, this parameter is known as statistical significance α. Conversely, the type II error β is the risk of an APC system with a systematic error greater than 1% to obtain admission, which is the complement to the statistical power 1 − β. The type II error and thus the statistical power was neither accounted for in the sample size planning nor in the testing procedure. Through the sample size formula it was implicitly 50%, assuming the a priori estimated standard deviation was correct. Otherwise, the higher the empirical standard deviation, the greater the type II error and vice versa. The statistical framework for APC validation and methods to address the current shortcomings are given in the following Sections.

Statistical model
Let Ω 0 = {ω i }, i = 1, . . . ,N be the statistical population of door stop events (DSE) 6 , which are used to summarize all boarding and alighting passengers at a single door during a vehicle (bus, tram, train) stop. Further, let Ω = {ω i j }, i j ∈ {1, . . . ,N }, j ∈ {1, . . . ,n} be a sample, which consists of either randomly or structurally selected DSE (e.g. by a given sampling plan). The size of the statistical population N may be considered as the number all DSE over the relevant time period, which is typically one or more years, so N can be assumed to be unbounded and thus N = ∞. Let n be the sample size and M i , i ∈ {i, . . . ,n} be the manual count and K i , i ∈ {i, . . . ,n} be the automatic count of boarding passengers made by the APC system. The manual count obtained by multiple ride checkers or favourably video camera information (Kimpel et al., 2003) is assumed to be a ground truth to compare against. Alighting passengers are counted as well and results apply analogously, but w.
which we call relative differences being the difference of the automatically and manually counted boarding passengers relative to the average of the manually counted boarding passengers. The average D := 1 n n i=1 D i is the statistic of interest that is used in both the t-tests as well as the equivalence test. The expected value µ := E(D) is the actual systematic error 7 of an APC system (Furth et al., 2005), since it can systematically discriminate participants of the revenue sharing system or could also be referred to as bias of the measurement device, a term frequently used in APC accuracy evaluations (Strathman, 1989;Kimpel et al., 2003;Furth et al., 2005;Chu, 2010;Nielsen et al., 2014).
Criteria in any APC approval process are often checked by specially trusted authorities authorized to grant admission. They perform their own manual ride check, evaluate the criteria i.e. the statistical test, and either approve or reject the APC system. There are two conflicting interests that need to be dealt with: acquiring maximally accurate and reliable data on the one hand and approve a high number of APC in a fast and cost-efficient process on the other hand. Shortcoming of the first we will call calibration resp. user risk and shortcomings of the latter manufacturer risk. We attribute the user risk to public transportation companies and network authorities, who rely on accurate data, despite that the motivations for APC data collection might be more complex in the real world. The manufacturer risk relates directly to possible recourse claims and negative market reputation if the APC systems fails the 6 Referred to as Haltestellentürereignis (HTE) in VDV 457 7 Referred to as Verzerrtheit in VDV 457, which means distortedness (as e.g. in market distortion) admission. The two risks will distinct type I error and type II error of statistical tests. For the t-test, the hypotheses are H 0 : There is no systematic APC measurement error (µ = 0) (2) H 1 : There is a systematic APC measurement error (µ = 0) . ( Let ν be the a priori estimated standard deviance, ν the empirical standard deviance of the sample, d r the maximal allowed error (e.g. 1%), α t the risk of falsely rejecting the null hypothesis H 0 (type I error, i.e. rejecting an APC system with an actual systematic error of zero) and β t the risk of falsely accepting the null hypothesis H 0 when a particular value of the alternative hypothesis H 1 is true (type II error, e.g. accepting an APC system with a systematic error of 1%) (Guthrie, 2010).
The sample size estimation for the t-test is given by with z (·) being the quantile function of the normal distribution and the test criterion as

Revised t-test
Several discussions about post-hoc power adaptions for the t-test exist. A thorough discussion about those can be found in Hoenig and Heisey (2001). They argue that approaches referred to as Observed Power, Detectable Effect Size, or Biologically Significant Effect Size are "flawed". For the latter approach, which is described in Cohen (1988), Hoenig and Heisey criticize the assumption that actual power is equal to the intended power and not updated according to experimental results (e.g. sampling variability). Addressing this, we investigate on procedures to control the (actual) type II error to assess non-presence of a crucial difference. Schuirmann (1987) initially referred to approaches of using a negative hypothesis test to make inference that no inequivalence was present as the Power Approach.
Analogously to these thoughts, we will consider variations of the type I error α to make adaptations to the testing procedure and call this approach post-hoc power calculation. This was explicitly mentioned by Schuirmann but was not derived further by stating a lack of practical relevance: "In the case of the power approach, it is of course possible to carry out the test of the hypothesis of no difference at a level other than 0.05 and / or to require an estimated power other than 0.80, but this is virtually never done." While this approach may not have been used in the world of pharmaceutics, it is of relevance for the validation of devices for automatic passenger count and likely other applications in industrial statistics. In general, as well in practice, after the data collection, the a priori estimated standard deviation and the empirical standard deviation differ to some extent and we strongly believe that it cannot be relied on that difference being negligible. Therefore, we want to ensure that the risk of the user (the type II error) does not exceed a prespecified level, which is usually 5%. So the only parameter to be adapted is the type I error α, which is the risk of the device manufacturer. The appropriate α t , the revised significance, can thus be determined by solving the equation and thus Note that n t , by choice of the initial α t and ν, is fixed. If the actual sample size does not match the planned sample size, n t resp. ν need to be adapted. Analogously to equation (5), we define the test criterion for the revised t-test: which however, can yield a problematic behaviour in practice: First, the term z 1− αt/2 is undefined for α t > 2. Combined with equation (6), this induces a lower bound on n t : Second, z(1) = ∞ is the source of a different problem: if ν/ ν reaches a certain critical value, which is 2.76 for α t = β t = 5% 8 . Therefore, this could be relevant in practice, since it yields z 1− αt/2 = ∞ due to rounding errors. In that case, the test criterion from equation (8) is always true and the test is thus always passed as illustrated in Fig. 1.

Equivalence testing
Equivalence testing has originated in the field of biostatistics. Often the term bioequivalence testing (e.g. Schuirmann, 1987;Berger and Hsu, 1996;Wellek, 2010) is used. Bioequivalence tests are statistical tools that are commonly used to compare the performance of generic drugs with established drugs using several commonly accepted metrics of drug efficacy. The term equivalence between groups means that differences are within certain bounds, as opposed to complete equality. These bounds are applicationspecific and are usually to be chosen such that they are below any potential (clinically) relevant effect (Ennis and Ennis, 2010). In many publications the problem was referred to as the two one-sided tests (TOSTs) procedure (Schuirmann, 1987;Westlake, 1981). TOST procedures were developed under various parametric assumptions and additionally distribution-free approaches exist (Wellek and Hampel, 1999;Zhou et al., 2004).
Equivalence tests have begun making their way into psychological research (see e.g. Rogers et al., 1993) and natural sciences: Hatch (1996) applied it for testing in clinical biofeedback research. Parkhurst (2001) discussed the lack of usage of equivalence testing in biology studies and stated that equivalence tests improve the logic of significance testing when demonstrating similarity is important. Richter and  Figure 1: Chances (of an APC system) to pass dependent on the method to determine the revised significance αt, which we have obtained from equation (7), denoted by dashed lines, and by using the function power.t.test from R Core Team (2017) denoted by the solid lines. Latter was our initial approach, which we illustrate here for the purpose of completeness. We notice that for far too low sample sizes (red and dark red lines) the former method is stricter. For the dark red dashed line, n is below the lower bound from equation (9) and we thus assume the test to always fail, compare Section 6. For (too) large sample sizes numerical instabilities, which cannot be detected by the user through e.g. error messages, lead to sudden gaps in the function values (green lines for ν = 0.322 and ν = 0.323) for the power.t.test-variant, which relies on fixed-point iteration and has a history of unexpected convergence behaviour. The blue lines denote a practically reachable numeric limit when using equation (7): starting with ν ≥ 0.414 resp. ν/ ν ≥ 2.76 all systems are always accepted. For the power.t.test approach, the light blue line is slightly above the dark blue one, so it has numerical problems for these values, too. Generally, all values of ν have to be seen w.r.t. the ratio ν/ ν, since these numerical effects can already occur with much smaller sample sizes. Richter (2002) used equivalence testing in industrial applications and gave instructions how to easily calculate it with basic spreadsheet computer programs. Applications also involved risk assessment (Newman and Strojan, 1998), plant pathology (Garrett, 1997;Robinson et al., 2005) ecological modelling (Robinson and Froese, 2004) analytical chemistry (Limentani et al., 2005), pharmaceutical engineering (Schepers and Wätzig, 2005), sensory and consumer research (Bi, 2005), assessment of measurement agreement (Yi et al., 2008), sports sciences (Vanhelst et al., 2009), applications to microarray data (Qiu and Cui, 2010), genetic epidemiology in the assessment of allelic associations (Gourraud, 2011), and geography (Waldhoer and Heinzl, 2011).
We define ∆ to be the equivalence margin and the relevant errors for the equivalence test with α e referring to (half) the risk of the user and β e to the risk of the device manufacturer. We will consider two-sided 1 − 2α e confidence intervals (symmetric around the mean) where α e is often to be chosen 2.5%. The usage of 1−α e confidence intervals is also possible (see e.g. Westlake, 1981) but is used less frequently in the recent literature in this topic. Note that, by definition, the meanings of the α and β are interchanged between the t-test and the equivalence criterion in referring to the risk of the user and to the risk of the manufacturer.
For the equivalence test sample size estimation exists (see e.g. Liu and Chow, 1992;Julious, 2004) similar to equation (4) of the t-test: We define the test criterion for the equivalence test which is formulation of the Two One-Sided Test Procedure for the crossover design in the case of limits that are symmetrical around zero (see Schuirmann, 1987).

t-test-induced equivalence test
An approach to compare the revised t-test from Section 4 to the equivalence test from Section 5 is to normalize and compare their test criteria from equations (8) resp. (13) as well as their sample sizes formulas (4) resp. (12). By combining equations (4), (6) and (8), we obtain a normalized test condition for the revised t-test: Using equations (12) and (13) we obtain which resembles equation (14). If we now chose β e := α t , α e := β t and ∆ := d r for equations (12) and (15), they are identical to equations (4) and (14). Therefore, the revised t-test analytically is an equivalence test, with error types swapped and an extended domain: Since only elementary calculations are made and there is no need to handle a varying quantile function z (·) , there is neither a lower bound as in equation (9), nor an upper bound due to numeric instability as illustrated in Fig. 1. We call an equivalence test with parameters chosen as in equation (16)  : Chances of an APC system with a both sample and actual error standard deviation of ν = 0.15 to pass the t-test (a) or the revised t-test resp. equivalence test (b) over the actual systematic error. Different lines denote different sample sizes obtained from different a priori estimated error standard deviations ν. The golden, solid curve always represents a correctly estimated sample size, the green curve a sample which is too large and the reddish curves samples which are too small. The dashes in the dark red line in (b) denote the consequences of equation (9): only for the equivalence test, the outcome is defined for equation (13) and the test may be considered as always failed if n = 320 < z 2 1−βt ν 2 /d 2 r = 440.99. Using the original VDV 457 with an (implicit) power of 50% yields the dashed golden curve in (a). The vertically striped areas are additionally correctly accepted, the horizontal striped areas are additionally incorrectly accepted. The thick blue lines denote the relative error of dr = 1%. For comparison: in (c) the incorrect decisions of a reference test are red, the correct decisions are coloured cyan. The reason for red areas to exist are economic considerations to limit the test costs: further increasing the sample size towards infinity would make the red areas disappear, at least for the revised t-test resp. the equivalence test. For the t-test, the areas with systematic error µ > 1% and µ < −1% remain blue, but the inner turns red. This behaviour is counterintuitive to the idea that the error of a statistic test goes to zero as the sample size goes to infinity.

Conclusion
We illustrated that the t-test as a criterion for APC approval may exhibit undesirable properties, even as the sample size grows beyond a certain level. Further, we have shown that when trying to compensate this behaviour by using post-hoc power calculations with a revised t-test, issues of numeric stability and domain limitations arise. Finally, we have proven analytically that the t-test-induced equivalence test, being numerically stable with a practically unlimited domain, can supersede the revised t-test.
The equivalence test is popular in various fields and, from a user's perspective, easier to apply than posthoc power calculations. Our results thus not only apply to APC systems: every use of the t-test can now comfortably be reconsidered and -on demand -be replaced by the t-test-induced equivalence test.
Our work simplifies the decision making process considerably, especially when it affects the worldwide revenue sharing in public transport, where there have been made 243 billion public transport journeys in the year of 2015 alone (UITP, 2017). For this reason, a large German public transportation company, which was significantly involved in the creation of the original, t-test based recommendation, commissioned an additional, complementary experts report, which eventually confirmed our findings. A change proposal for the VDV 457 was made and its acceptance is soon to come with the t-test being replaced as APC validation criteria in 2019. The (t-test-induced) equivalence test is thus in the course of becoming the new recommendation.
Finally, we hope that long lasting arguments within the industry about seemingly arbitrary admission results now end and also that our work will enable a broader audience to understand and profit from equivalence testing.