In behavioral science, researchers often deal with continuous data—for example, when comparing kinematic profiles of different movements or the same movement under different conditions. Differences that emerge in the varying profiles typically extend over a series of subsequent data points and are, thus, not limited to a single point. Although the general logic of statistically detecting differences between two streams of continuous data is similar to the analysis of single-point data, there are crucial differences in the concrete methodical steps of the analysis. The present article will describe and discuss three existing methods to analyze continuous data. We will then contrast these methods with a new method introduced here.

First, we need to elaborate the differences between single-point and continuous data and why it is important to differentiate them in terms of data analysis. In single-point data analysis, the attribute Φsingle is represented by one single data point Y. This is the case when measuring, for example, reaction times, jumping heights, or systolic blood pressure. In contrast to features that can be represented by a single value, other attributes Φmultiple can only be described by a number of sequential or parallel measurements of the observed quantity, producing multiple data points Y(1…n), where n is always greater than 1. Examples of these attributes are the course of the knee angle during the gait cycle or the neural activity associated with processing a certain stimulus (e.g., measured with electroencephalography [EEG] recordings). According to this logic, any multiple data recordings stored in a single vector could be considered as continuous data.Footnote 1 However, the methods discussed in the following paragraphs require that the consecutive data points be arranged in an ordered sequence in which each data index refers to the same subfeature of Φmultiple. Together, the ordered data vector describes a measured feature in the world. In an experiment, the manifestation of Φsingle or Φmultiple is usually measured and analyzed in different conditions—that is, for subsamples with different values of an independent variable. Thus, when measuring Φsingle for each subsample, the data set (DS) will consist of k measurements of Y values, where k is the number of recordings DSsingle = {Y1, …Yk}. When testing Φmultiple, however, the data set will contain k recordings of \( {Y}_{\left(1\dots n\right)},{DS}_{multiple}=\left\{{Y}_{\left(1\dots n\right)}^1,\dots, {Y}_{(1..n)}^k\right\} \). In this regard, it is important that the same indices j ∈ {1, .., n} of Y(1…n) represent the same aspect of Φmultiple in all k recordings. For example, when capturing the time course of joint angles during walking, the resulting data sets need to be time-locked with respect to appropriate (e.g., kinematic) landmarks of the gait cycle. Analogously, EEG signals are often synchronized with respect to specific trigger events, such as the appearance of a stimulus or the onset of a response, which are located at the same index j in all Y(1…n). Thus, the data must be synchronized and normalized in time before further analyses can be conducted.

Despite the different characteristics of single (Y) and continuous (Y(1…n)) data, the logical sequences of the data analysis are closely related for both types. The initial step in analyzing both single and continuous data is to calculate descriptive parameters (e.g., the mean μ and standard deviation σ) that provide information about the distribution and variation of the measured data. For continuous data, the internal dependence (ρ) of the data within Y(1…n) is an additional parameter that must be taken into account. It is typically quantified by a correlation coefficient ρ, ranging from 0 ≤ |ρ| ≤ 1.

To calculate μ, σ, and ρ, the question of how many of the data can maximally be used for a reliable estimation must be considered. As we mentioned before, the natural characteristic of continuous data is that subsequent data points are sampled continuously during data acquisition. The data segments that represent one trial (or one gait cycle, one response to a stimulus, etc.) are defined by start–stop signals or synchronization triggers. The corresponding time window containing these data points can be termed the sampling window (Fig. 1, red). However, using all the data within the sampling window (i.e., all data points of one recording) for the calculation of the descriptive parameters is not always appropriate because the sampling window may include data captured at a time when the hypothesized effect of the independent variable(s) is not (yet) observable. A specific time window should be defined to focus the analysis on the relevant part of the data set in which the expected effect will occur. The data within this time window, which we will refer to as the effect window (Fig. 1, dashed lines), can then be analyzed separately. The effect window is defined a priori on the basis of preliminary data or findings reported in the relevant literature. Note that this necessity to define a priori which part of the data one should look at is not a particular difficulty for the analyses described here. The deliberate choice of any dependent variable requires the same type of considerations.

Fig. 1
figure 1

Two types of time windows are important when analyzing continuous data. The sampling window (t1, . . . , tn) represents the whole data set, from the start until the end of the recording. The effect window (dashed) is the time range over which the hypothesized effect is expected. The effect window must be defined a priori by the researcher

With the help of the descriptive parameters, the data can be further analyzed. Typically, differences of Φ between the different conditions based on the corresponding measurements Y are analyzed by using inference statistical methods. To this end, probabilities (p values) are calculated with which the observed distribution of Y values would occur if the null hypothesis were true, using the descriptive parameters μ, σ, and ρ (ρ only for continuous data). On the basis of these probabilities, a decision to reject or accept the alternative hypothesis can be made. In addition to hypothesis testing, a power analysis can be used to evaluate the experimental design and the explanatory power of the results.

As we mentioned above, the general logic of statistical hypothesis testing is similar for both single and continuous data. A problem arises, however, when choosing adequate inference statistical methods to estimate the distribution of values under the assumption of the null hypothesis being true. A wide range of statistical methods can be applied to single data Y(k) (e.g., student’s t test and analysis of variance), but only a few approaches actually tackle the problem of implementing the corresponding analysis steps for continuous data.

In the following paragraphs, we will characterize four statistical methods that can be applied to continuous data. Two of them attempt to adjust commonly used local statistical methods in order to apply them to continuous data. The other two approaches are specifically designed to deal with the characteristics of continuous in contrast to single data points (overview in Fig. 2).

Fig. 2
figure 2

Overview of different methods to analyze continuous data (local methods, local Gaussian methods, and function-based resampling techniques). The point-based resampling technique is a new approach that will be discussed in this article

The nature of any local method is that single data points Y are analyzed. Accordingly, the application of local methods to continuous data implies the selection of a single data point at one specific moment in time or in reference to a predetermined landmark, and thus a reduction to a single-data analysis. The major benefit of such an approach is that it allows the use of a wide range of existing inferential statistical methods (e.g., confidence intervals, p value estimates for different distributions [t, F, χ2, . . .], analysis of variance, power analysis, etc.). However, there are critical reasons why the simple application of local methods to continuous data is often not reasonable. First, there is the problem of validity because a single data point is chosen to represent the effect of the independent variable in the continuous data. In some cases it is not easy to adequately select this representative data point, to identify its position in the data stream, or even to find a valid representative at all, meaning that some (truly) multiple feature Φmultiple of the data cannot adequately be described by a single value. Second, reliability might be reduced since the value of a single point measure can be affected by stochastic fluctuations. As a consequence, the tested effect might be over- or underestimated.

The local Gaussian method (Fig. 2) is an alternative that combines the statistical options of the local methods (e.g., t test) with the inclusion of more than one data point in the analysis. More specifically, the local Gaussian method applies the same local test repeatedly on each index within the effect window. As a result, confidence intervals are calculated for each index which together form a confidence band (Fig. 2, broken curves). Two major issues need to be taken into account when repeatedly applying methods designed for single data points to continuous data. First, every data point of the tested curve must belong to a specific symmetric parametric family (Gaussian distribution). This prerequisite must be assured before testing. However, the data often do not fulfill this requirement of a normal distribution of data points at every index of the data set. As a consequence, the true coverage probability of the confidence bands is not even close to the desired nominal level in most cases. Second, this method does not consider the internal dependence ρ of the data points within a continuous data set. In the case, for instance, that the internal dependence of the analyzed data is ρ = 1, any single value of Y1 − Yn carries the same information, and the analysis is thus equivalent to the single-point case. If, on the other hand, it is the case that ρ = 0, a Bonferroni correction could be applied. However, with an increasing number of data points to be included in the confidence band, the width of the band increases so that down-sampling or up-sampling would severely affect the results. In most cases, the local method would be incorrect and Bonferroni corrections would be extremely conservative  because the true internal dependence is, unbeknownst to the observer, located somewhere in-between 0 and 1 (Lenhoff et al., 1999). Thus, if not every data point is normally distributed and/or the value of ρ is neither close to 0 nor 1, as it is commonly the case, continuous data should not be analyzed using local Gaussian methods.

Alternatives to the local methods are methods that use simultaneous confidence bands to test deviations in continuous data curves. For instance, Lenhoff et al. (1999) applied a bootstrap procedure to construct simultaneous confidence bands of continuous gait data. The bootstrap procedure is a nonparametric resampling technique, and hence Gaussian distribution is not a prerequisite (Efron, 1979; Tukey, 1958). Moreover, it takes into account the inner dependency ρ of continuous data by describing the variance in the data as a variation of the coefficients of the fitting function (SEFunc) and evaluates its variance–covariance matrix. Therefore, this method is suitable for computing simultaneous confidence intervals for several subsequent data points of continuous signals. An essential part of the method introduced by Lenhoff and colleagues is that they fit gait cycle curves with mathematical functions so that the interdependence of the data points would be reflected by these functions. With the help of a correction factor C, which is determined using a resampling technique (e.g., bootstrap), their method can assure that the true coverage probability of the confidence bands comes close to the desired nominal level. Thus, this function-based resampling technique (FBRT), as we call it, is an advanced tool that can be applied even to data with asymmetric distributions. Although the FBRT seems to be the appropriate tool for analyzing continuous data in general, it still has limitations that have to be considered. One issue can arise from the high computational cost of this procedure. The ideal bootstrap sample consists of mm subsamples (where m stands for the sample size). As m becomes large, it becomes unfeasible to calculate all possible subsamples. However, an approximation can be used. Our simulations show that the estimation variability of the correction factor C saturates with a reasonable number of calculations. A more critical limitation of the FBRT procedure, as the authors described it, is that it cannot be applied to all kinds of continuous data. Lenhoff and colleagues fitted gait data with mathematical functions representing the data as closely as possible. However, recorded data cannot always be represented by adequate fits—for instance, when no function family exists that is capable of fitting all the entities (see Appendix 2). EEG data, for example, are characterized by rapid changes in voltage polarity that cannot reasonably be represented by a single mathematical function prototype. Another limitation of the FBRT, as the authors describe it, is that the C factor is calculated using the data from only one sample (mostly representing the data of a null/control condition). By looking at a single data point analysis, for example the t test, the estimation of the variance parameter would integrate the data from all measured samples, which should in consequence lead to a more reliable estimation.

In conclusion, all of the described methods to statistically analyze continuous data have inherent problems that limit their scopes of application. The FBRT of Lenhoff et al. (1999) seems to be the most appropriate and useful approach with respect to continuous data that are not normally distributed and that can be represented by mathematical functions. For cases in which the data cannot be fitted, however, and therefore are not suited for the FBRT procedure, we developed a method combining a point-by-point variance estimate with a resampling technique in order to calculate confidence bands according to the FBRT. In addition, the new method includes the calculation of a C factor that integrates the data of all samples. We call this method the point-based resampling technique (PBRT); and it will be explained in the following sections. For illustration purposes, we will demonstrate the application of this method using exemplary EEG data from an event-related potential (ERP) analysis. Although we demonstrate the PBRT in a between-groups comparison, this procedure can also be used with other continuous data and different experimental designs (e.g., within-subjects comparisons, as is described in Appendix 3).

Method

Exemplary data

For our exemplary data, we recorded EEG signals while participants practiced a virtual throwing task. The specific EEG potential of interest was the so-called error-related negativity (ERN, Ne), which is characterized as a negative deflection that appears shortly after the onset of an erroneous motor response to an imperative stimulus. The measured activation voltage of the ERN is significantly more negative than the signal from correct motor responses in the same temporal interval (Falkenstein, Hohnsbein, Hoormann, & Blanke, 1991; Gehring, Goss, Coles, Meyer, & Donchin, 1993).

During data acquisition, we measured ten healthy participants. The participants’ task was to throw a virtual ball presented on a projection screen around an obstacle in order to hit a target object on the other side of the obstacle (for details, see Müller & Sternad, 2004). They were instructed to hit the target as frequently as possible. EEG data were collected while participants executed the task. In the following sections, we will describe how the PBRT was applied to these exemplary data in order to test whether there was a difference in neural activation between hit \( {DS}_{hit}=\left\{{Y}_{(1..n)}^1,\dots, {Y}_{(1..n)}^k\right\} \) and miss \( {DS}_{miss}=\left\{{X}_{(1..n)}^1,\dots, {X}_{(1..n)}^k\right\} \) trials. Further information about the data acquisition and processing can be found in Appendix 1.

Selecting an appropriate analysis method

To decide which methods are most appropriate for the analysis of a given data set in the context of a particular hypothesis, particular limitations and benefits of the methods need to be discussed. For our EEG example, local methods could be excluded from further consideration since the inner dependency ρ was neither 0 nor 1 and thus could not be considered adequately. When ρ < 1, a reduction of the continuous data (neural activation over time) to one single representative value (e.g., a peak or mean amplitude) goes along with a loss of information and is therefore limited. On the other hand, EEG data are also not completely independent from index to index (|ρ| > 0). Hence, rendering Bonferroni corrections is inadequate. The methods that fully integrate all data points adequately would be the FBRT and the PBRT. The application of the FBRT, however, challenges us with finding a function prototype whose parameters could be adjusted in order to represent each of the curves in the data set. Since every data curve can be represented by a sum of sine wave terms by applying a Fourier transform, one might think that this should solve the problem. However, this is not the case since the coefficients resulting from Fourier transforms of the different EEG curves cannot reasonably be used to estimate the variance within the data set. This can be demonstrated by applying Fourier transforms sequentially to all \( {Y}_{(1..n)}^k \) within \( {DS}_{hit}=\left\{{Y}_{(1..n)}^1,\dots, {Y}_{(1..n)}^k\right\} \). An example highlighting this issue is shown in Appendix 2. In this case and a wide range of comparable situations, no valid fitting function can be found, and hence the FBRT cannot be applied to the data set. The PBRT, however, is still applicable. We will describe the implementation of the PBRT in the following sections by using our exemplary EEG data. A more detailed and formalized description can be found in Appendix 3 and in the MATLAB script in the supplementary material.

Application of the point-based resampling technique (PBRT)

Preparation of the data before analysis

First, we synchronized the data, in our case, to the respective moments of ball release. In the next step, data from hits and misses were baseline-corrected and cut into segments of equal length (i.e., all resulting segments containing an equal amount of data points/indices). Two average curves were then calculated for each data set, yielding \( {\overline{DS}}_{\left(1\dots n\right)}^{hit} \) and \( {\overline{DS}}_{\left(1\dots n\right)}^{miss} \). Next, the temporal location and width of the effect window was determined on the basis of preliminary findings (e.g., Joch, Hegele, Maurer, Müller, & Maurer, 2017; Maurer, Maurer, & Müller, 2015). In our example, we set the effect window from 200 to 350 ms after ball release.

Construction of confidence bands using the PBRT

In our example, the aim was to test whether the average curve of miss trials \( {\overline{DS}}_{\left(1\dots n\right)}^{miss} \) was different from the average curve of hit trials \( {\overline{DS}}_{\left(1\dots n\right)}^{hit} \) within the effect window. To this end, a confidence band for the data points within the effect window was constructed on the basis of the data set that represents the null hypothesis (i.e., \( {DS}_{hit}=\left({Y}_{(1..n)}^1,\dots, {Y}_{(1..n)}^k\right) \)). To provide evidence that the effect is present on the miss trials, \( {\overline{DS}}_{\left(1\dots n\right)}^{miss} \) should fall outside the confidence band (which is constructed on the basis of the hit trial data) within the effect window.

Construction of the confidence bands involved two steps: (1) estimation of the local distribution parameters and (2) determination of a factor C to correct for the inner dependencies in the curves.

  • Step 1 (Estimation of local distribution): The calculation of the averageFootnote 2 curves (e.g., \( {\overline{DS}}_{\left(1\dots n\right)}^{hit} \) from \( {DS}_{hit}=\left({Y}_{(1..n)}^1,\dots, {Y}_{(1..n)}^k\right) \) ) resulted in only one average curve for each of the conditions (hit and miss trials, respectively). To estimate the distribution parameters of these average curves \( \left({SE}_{{\overline{DS}}_{\left(1\dots n\right)}^{hit}}\right) \), we simulated pseudosamples using the bootstrap resampling method described by Efron (1979), yielding a sample containing pseudo-average curves. For each of the 900 data points (i.e., the data points within the effect window), standard deviation was then computed across all pseudo-average curves.

Before applying Step 1, the researcher has to determine the number of required bootstrap samples and need to decide what measure of central tendency is adequate for the analysis. The ideal number of bootstrap samples is mm, where m represents the number of data curves available for analyzing (m = 10). The size of this ideal bootstrap sample becomes a computational issue when m is large. However, a simulation calculating the reliability of the estimates can help to choose an appropriate amount of bootstrap samples. In our example, the standard deviation of the estimates drops proportionally to the square root of the number of runs (see Fig. 3). We, therefore, stopped resampling after 400 runs where the variability of estimations of the correction factor C dropped below 5%. Alternatively, the number of bootstrap samples can be chosen without a simulation by calculating as many samples as reasonably possible.

Fig. 3
figure 3

Development of the estimation variability of the scaling factor C with an increasing number of bootstrap samples (BSs). The coefficient of variation drops from almost 30% when using only ten BSs to under 3% when using 1,600 BSs. Such simulations can be used to determine the minimum number of BSs when using the bootstrap resampling procedure

In many single-point analyses, the central tendency of the data set is represented by the mean. Please note that in some cases the mean is not a robust measure of central tendency. Alternative estimators can be, for example, the median or the trimmed mean (Wilcox, 2012). Which of the measures of central tendency best fits the data at hand will depend on the type of the distribution to be analyzed. The PBRT works with any of these measures. To demonstrate the influence of different measures of central tendency when using the PBRT, we calculated the α and β error rates while applying the PBRT using the mean, median, and trimmed mean of artificially generated EEG data (please see Appendix 4 for details about the artificially generated EEG data). The outcome of this parameter comparison can be found in the Results section.

  • Step 2 (Calculation of scaling factor C): Following the usual logic, the confidence limits (CIs) are set at a specific radius from the mean in order to exclude the intended percentage of cases. In the case of normally distributed data, the limits would be set to the mean plus/minus a given factor times the standard deviation (CI =  ± C ⋅ std), with this factor usually being C = 1.96, in order to include 95% of the cases. In addition, subsequent analyses would also require the scaling factor to be corrected according to the amount of the inner dependency ρ of the curves, as we mentioned earlier. Since it is very difficult to quantify ρ appropriately, the necessary correction factor cannot be determined directly. We therefore used the method introduced by Lenhoff et al. (1999). This method is based on the distribution of critical distances dcrit. For testing our hypothesis, the crucial concern was whether at least a single index exists for which the curve in question, \( {\overline{DS}}_{\left(1\dots n\right)}^{miss} \), would lie outside the confidence band. In general, a curve Y(1. . n) is inside the confidence band if the distance to the mean \( {d}_j=\left|{y}_j-\overline{y_j}\right| \) is smaller than C ⋅ stdj for every j = 1…n —that is, \( {d}_{crit}=\underset{j}{\max}\left(\frac{d_j}{std_j}\right)<C \). In our case, we calculated dcrit values for the instances from the resampling runs within the effect window according to the procedure of Lenhoff et al. (1999). A C value was then selected that cut off 5% of the distribution of the dcrit values.

This was done for the effect window only. The scaling factor is used to regulate the location of the upper and lower limits of the bands (see Appendix 3). Furthermore, this factor defines which multiple of the standard errors is added to or subtracted from the population mean so that the upper and lower endpoints of the confidence band will meet the desired coverage probability.

Once the confidence band for the null hypothesis (\( {DS}_{hit}=\left\{{Y}_{(1..n)}^1,\dots, {Y}_{(1..n)}^k\right\} \), in our case) is constructed, the average curve of the data set representing the alternative hypothesis can be tested within this band. Note that the confidence band was constructed so that if only a single data point at any index within the effect window lay outside the confidence band, a significant difference between the tested data sets would be suggested. The result of the test in our example can be found in the Results section.

Post-hoc analyses

Validation of the calculated C value

After the confidence band was calculated, the band had to be tested in terms of its true prediction coverage. A validation of the confidence band can be achieved by using cross-validation techniques. To do so, the average curves of the m bootstrap samples were used. Starting with m1, one curve was removed from the sample and a confidence band was computed with the remaining curves. In each run, we checked whether the removed curve stayed within the statistical limits over all data points 1 . . . n. This procedure was repeated for every value of m. In our case, the procedure was done 400 times. The relative number of cases in which the removed curve falls outside of the confidence band should be close to the desired α error probability. To emphasize the advantage of the PBRT over the local Gaussian method, we calculated the true prediction coverages of both methods for data sets with different distributions (varying the skewness) using the artificially generated EEG data (see Appendix 4). The results of this calculation can be found in the Results section and are illustrated in Fig. 6.

Power analysis

An important aspect of data analysis, aside from hypothesis testing, is the verification of the statistical test itself. Using the local methods (see Fig. 2), it is possible to quantify the power of an applied test in a post hoc analysis. Furthermore, a priori calculations of required sample sizes based on given effects can be done. However, these analyses are not limited to the single-point data case. We now present a way to validate the computed confidence bands and apply a power analysis to continuous data as well.

Our application of a power analysis method in the context of analyzing continuous data is based on effects from previous findings and simulated test curves for power analysis of the computed confidence band. Concretely, we used the effect (voltage differences between hit and miss trials) within the effect window found in a previous study (Maurer et al., 2015) and added those values to each of the 400 resampled average hit curves, which represent our null hypothesis. In other words, we constructed artificial error curves (AECs) from the hit curves. The power of the confidence band analysis was quantified by testing how many of the AECs fell outside the confidence band, which had been constructed on the basis of the hit curves. The number of AECs that stayed inside the confidence band of the hit curves throughout the complete effect window would be categorized incorrectly—that is, a curve carrying the effect would be judged as showing no effect. The relative frequency of this type of error would represent the ß error probability. Conversely, the relative frequency of AECs leaving the confidence limits for at least one index could be taken as the power (1–ß). For better illustration, the procedure of the power analysis is presented as a schematic diagram in Fig. 4. A detailed description of the calculation of the power analysis can also be found in Appendix 3.

Fig. 4
figure 4

Schematic illustration of the confidence band (CB) power analysis. The power analysis starts with the resampled data (A). The effect found in previous experiments (B; i.e., the effect within effect window) is then added to the data supporting the null hypothesis (Α) to produce artificial error curves (AECs). (C) Result of the effect addition A + B = AEC. (D) CBs are computed using the resampled data shown in A. (E) Tests of the AECs (C) within the CB (D), calculated from the resampled data. The relative frequency of AECs that do not leave the CB at any index within the effect window can be used as an estimate of β error probability

Results

Testing for differences with confidence bands

The aim in our example was to test two EEG data sets for differences in the means within a preset effect window. One data set included the hit trials \( {DS}_{hit}=\left\{{Y}_{(1..n)}^1,\dots, {Y}_{(1..n)}^k\right\} \) and the other included the miss trials \( {DS}_{miss}=\left\{{X}_{(1..n)}^1,\dots, {X}_{(1..n)}^k\right\} \) acquired from a goal-oriented throwing task. Therefore, the null hypothesis would be formulated as \( {H}_0:{\overline{DS}}_{\left(1\dots n\right)}^{hit}={\overline{DS}}_{\left(1\dots n\right)}^{miss} \). (H1 :  ¬ H0). For significance testing, a confidence band was calculated using the data set that was supposed to show no effect in the EEG data (DShit). The test for differences in the mean curves was done by placing the average curve of the data set in which an effect was expected (\( {\overline{DS}}_{\left(1\dots n\right)}^{miss} \)) within the confidence band around \( {\overline{DS}}_{\left(1\dots n\right)}^{hit} \). If \( {\overline{DS}}_{\left(1\dots n\right)}^{miss} \) were to leave the band for at least one single index, a significant difference between \( {\overline{DS}}_{\left(1\dots n\right)}^{hit} \) and \( {\overline{DS}}_{\left(1\dots n\right)}^{miss} \) would be indicated at all indices leaving the band.

In the example plotted in Fig. 5, the average curve of the miss trials leaves the confidence band from about 250 until 295 ms. Hence, a significant mean difference was found within the effect window and we would reject H0 and accept H1. With respect to the exemplary research question, we found a significant negative deflection in the miss trials—that is an error-related negativity.

Fig. 5
figure 5

Testing of continuous data (red) in the generated confidence band (black broken lines). Whenever the tested curve falls outside the band for at least a single index, a significant difference between the sample representing the null hypothesis (hit sample) and the test curve is detected

Comparison of different measures of central tendency within the PBRT

As already mentioned in the description of the PBRT (Step 1: Estimation of local distribution), the method can be used with different measures of central tendency. Wilcox (2012) discusses in his book on robust estimation and hypothesis testing that prominent measures such as the mean, median, and trimmed mean can be more or less robust depending on the type of data distribution (e.g., the level of skewness). Thus, we tested the influence of choosing any of the three different measures on the actual prediction coverage of the resulting confidence bands (where the desired nominal level was set to α = 5%) when increasing the level of skewness in our artificial data curves (please see Appendix 4 for details about the artificial data).

As can be seen in Fig. 6, the increase in skewness does not lead to relevant deviations from the desired nominal level for any of the tested measures of central tendency integrated in the PBRT. Nevertheless, it is indispensable to individually decide what measure fits the data at hand best in order to use this measure with the PBRT.

Fig. 6
figure 6

Comparison of three measures of central tendency in terms of the α error probability of the confidence band with increasing levels of skewness. All tested measures (mean, median, and trimmed mean 20%) using the PBRT seem to be robust to different levels of skewness and deliver the desired α error probability of 5%. The α error probability using the local Gaussian method is clearly different from the desired nominal level and increases slightly with an increasing level of skewness, from 27% to 34%. The distributions below the x-axis visualize the skewness of the analyzed data

Discussion

In this article, we have discussed differences between single and continuous data regarding data analysis. We compared different statistical methods with respect to their applicability to continuous data and introduced a method based on bootstrap resampling. An overview in Fig. 7 helps to visualize the key arguments for or against the methods discussed in this article.

Fig. 7
figure 7

Overview of the statistical options for analyzing continuous data. In this chart, the methods discussed in the article are compared regarding their testing principles, prerequisites, outputs, handling of the interdependence of acquired data, and computational effort

Local methods are designed to analyze data sets composed of k single data points (Fig. 7, column A; e.g., reaction times of k participants). These techniques can be adapted to the analysis of continuous data by treating the data at each index j = 1…n as a single data point and running n single-data analyses in parallel. However, this procedure is only accurate for data with an internal dependence ρ close to either 1 or 0 (using a Bonferroni correction in the latter case) because other ρ values cannot be modeled adequately within the procedure (Fig. 7, column A4). Thus, using local methods on continuous data can lead to extremely progressive (assuming ρ = 1) or conservative (assuming ρ = 0) results. This is supported by the observation that the true prediction coverage in the analysis of continuous EEG data is far from the nominal level if we apply a local Gaussian method, which is a point-by-point (n-by-n) application of a local method to each index j = 1…n (Fig. 7, column B; αnominal = 5%, αtrue ranging from 27% to 34%, depending on the skewness of the analyzed distribution). This finding is in line with the results of Lenhoff et al. (1999) who compared the true coverage probability of their FBRT (Fig. 7, column C) to the local Gaussian method and came to a similar result. On the basis of these findings, we do not recommend using either local methods or local Gaussian methods for the analysis of continuous data unless (1) a normal distribution of every data point can be assured and (2) ρ is either 0 or 1.

In contrast to this, we described the FBRT as a tool that includes mathematical fits and a scaling factor for the confidence bands and that is appropriate to analyze effects represented by more than one data point (i.e., continuous data; Fig. 7, column C1, C4). The major benefit of this technique results from the scaling factor C used to adjust the prediction coverage so that it closely approaches the desired nominal level, which was confirmed by a cross-validation in the study by Lenhoff et al. (1999). Using mathematical functions to fit the data, and thus to model the inner dependency represented by the coefficients of the fitting function, has one advantage as compared to bootstrap resampling techniques without functional fits. Such functions usually require less computational effort because the resampling is only done with the function coefficients and not with all data points j ∈ {1, .., n}, thus decreasing the necessary computational effort substantially (Fig. 7, column C5). In the case that the functions fit the data sufficiently well, the resulting confidence band will only be marginally influenced by the fitting error, which will not substantially change the researcher’s decision in the hypothesis testing. However, the mathematical fitting of the data is also a strong limitation of the FBRT, as this procedure cannot be applied to a wide range of continuous data sets (e.g., high-frequency data such as EEG signals) as it is not possible to find a family of functions appropriate to fit the data curves reasonably well. In most cases, a function can be found to fit one of the k data curves in a data set. This function, however, does often not sufficiently represent the characteristics of the other curves in the data set, which makes it unfeasible to apply the FBRT in these cases (for an example of this problem, see Appendix 2).

As an alternative to the existing methods, we introduced a nonparametric approach to analyzing continuous data by combining some aspects of the FBRT with the point-by-point analysis of local methods (we call this a point-based resampling technique, PBRT; Fig. 7, column D). Integrating bootstrap resampling allowed us to statistically analyze two average curves (which is often needed when, e.g., analyzing ERPs in EEG or comparing two population mean curves in general) through the use of confidence bands. This was done by simulating multiple pseudo-average curves on the basis of real, nonfitted data in order to compute a simultaneous confidence band, which can then be used to estimate the distribution of the average curves. For demonstration purposes, we examined the EEG signals of ten subjects in order to test for differences between two EEG time series (hit vs. miss trials in a throwing task)—more specifically, with respect to the appearance of an ERP that correlates with erroneous motor executions. After calculating the confidence band from the data set representing the null hypothesis (hit trials) for the indices within the preset effect window, we tested the average curve of the miss trials within the confidence band (see Fig. 5). We found that the average curve of the miss trials left the confidence band within the effect window, which indicated significant differences between hit and miss trials within this time interval. Hence, the results of the PBRT suggested rejecting the null hypothesis in this example.

In further simulations we showed that the true prediction coverage of the PBRT was very close to the desired nominal level of αnominal = 5% (5.3% ≤ αtrue 5.8%). Furthermore, we could show that this result was robust to the degree of skewness in the analyzed data distribution. With respect to robust estimators, we tested the influences of different measures of central tendency on the prediction coverage of the confidence band and found no relevant differences when using the mean, median, and trimmed mean (20%; see also Fig. 6), at least in our example. However, the influences of different measures of central tendency can change when analyzing different data sets.

To conclude, the PBRT can be applied to nonnormally distributed continuous data that cannot be fitted by a basic mathematical function. In addition, our method provides hypothesis testing in specific time windows (effect windows). Nevertheless, the definition of an effect window—that is its location and size—is only reasonable when the onset of the hypothesized effect is clearly known. The generated confidence band can also be used for a power analysis, as we described in the Method section. In fact, it can also be used as a tool to specify required sample sizes a priori. Thus, we recommend this specific confidence band analysis as an alternative to the method of Lenhoff et al. (1999) when analyzing continuous behavioral data curves that cannot be fitted by an appropriate mathematical function family (e.g., with EEG data).

Although the introduced PBRT can be applied to many types of continuous data, individual decisions must be made before every analysis: first, the appropriate estimators for the data to be analyzed; second, the minimum number of iterations to be used for the bootstrap subsamples. Our calculations show that accuracy improves to asymptotically approximate zero error with an increasing number of subsamples. However, it should be noted that the decision of which confidence band accuracy will be sufficient will heavily depend on the particular data set and the hypothesized effect. The accuracy of the resulting confidence limits can be roughly estimated from multiple calculation runs; but it is still up to the experimenter to decide what level of confidence band accuracy is enough. Finally, the selection of adequate estimation parameters, the location and size of the effect window, and the experimental design itself will all depend on informed decisions by the researcher. Note that these requirements are not a particular disadvantage of the methods described here since such decisions are required in any type of statistical testing.