Longitudinal Monitoring of Athletes: Statistical Issues and Best Practices

  • Chris BaileyEmail author


Athlete monitoring utilizing strength and conditioning as well as other sport performance data is increasing in practice and in research. While the usage of this data for purposes of creating more informed training programs and producing potential performance prediction models may be promising, there are some statistical considerations that should be addressed by those who hope to use this data. The purpose of this review is to discuss many of the statistical issues faced by practitioners as well as provide best practices recommendations. Single-subject designs (SSD) appear to be more appropriate for monitoring and statistically evaluating athletic performance than traditional group statistical methods. This paper discusses several SSD options available that produce measures of both statistical and practical significance. Additionally, this paper discusses issues related to heteroscedasticity, reliability, validity and provides recommendations for each. Finally, if data are incorporated into the decision-making process, it should be returned and utilized quickly. Data visualizations are often incorporated into this process and this review discusses issues and recommendations related to their clarity, simplicity, and distortion. Awareness of these issues and utilization of some best practice methods will likely result in an enhanced and more efficient decision-making process with more informed athlete development programs.


Continuous monitoring Athlete tracking Data return Data visualization 

Athlete Monitoring

A desirable result of any strength and conditioning program is an improved level of preparedness and improved ability to perform [28, 31, 54, 68]. Typically this can be evaluated through some form of testing, but continual maximal effort performance testing may not be practical for athletes, especially those in season. As such, regular testing of physical performance at submaximal levels or during regular practice or competition may be a better approach allowing for more frequent measurement such as in an athlete monitoring program [32, 50]. Regardless of the type of testing or the variable of interest, measurement must be completed if one is to evaluate an athlete’s level of preparedness [28, 38, 50].

Along with aiding practitioners in the evaluation of athlete readiness, athlete monitoring also helps in the evaluation of strength and conditioning programs [28]. Data from monitoring programs may substantiate or contradict a strength and conditioning program. This goes beyond simply evaluating a program based on a win/loss record for a season and provides objective data on the success of a particular program for a team or an individual athlete. Direct objective feedback on an athlete’s progression can be given to coaches and other decision makers [48]. This data can also be used to help coaches and practitioners make data driven decisions for program improvement at the team or individual level [28, 50, 54].

Understanding the demands of a particular sport is an ongoing venture in sport performance. The relationship between competition performance data and collected monitoring data will likely help answer some questions and potentially bring about new questions [28, 38]. Data from athlete monitoring may also aid in talent identification or the identification of variables that contribute to optimal performance of a particular sport or task [60].

Data from athlete monitoring programs may also help clarify some of the ambiguity of the training process from a dose response perspective at the individual athlete level [28, 38, 50, 54]. Not all athletes will respond to training the same way and there should be a focus on individual responses to training. While standard pre/post testing may explain the success of a program with a sufficient sample size, that model does not work with individual athletes. Fortunately, athlete monitoring utilizes more frequent data collection, providing many data points for each athlete [50, 54].

Monitoring for the purpose of understanding training can be broken down into two primary areas: (1) dosage or input and (2) response or output. All of the training sessions, practice sessions, competitions, and anything else that results in a reaction of the athlete can be considered dosage [54]. Much of this can be quantified, but there may be issues with differing units of intensity across the different types of dosage [17]. The response or output is often more difficult to quantify, but a change in performance, whether an improvement or a decrement, is a good signal of a response. It is important to note, that a performance decrease may not always be visible by a single marker such as amount of weight lifted and measures in different areas may be required to show the response [18, 19, 20]. It is also important to note that some short-term performance decrement due to fatigue should be expected during training, but that should be logically planned as part of a functional overreach [24].

While testing and monitoring variables of potential enhanced performance is important, monitoring the recovery is just as important for athletes. Each session, whether it be a training session, practice, or competition, can be considered a stimulus for adaptation. Recovery is necessary if adaptation or some form of supercompensation is to occur [31, 54, 68]. Even if adaptation has not occurred, at least returning to normal homeostatic levels of preparedness is desirable for athletes in season [30]. Recovery occurs when the amount of training stressors and stimuli are reduced. Unfortunately, practitioners may often forget or not be able to quantify all stressors and stimuli [31]. An athlete may have an issue in their social life that is resulting in a reduction in sleep or a student athlete may be sacrificing sleep to study for a exam. Each of these examples reduces the amount of sleep an athlete might get, and there are many other opportunities that might alter optimal recovery that sport performance coaches may not be aware of [30]. There are consequences for athletes who have a less than adequate amount of recovery for a given amount of stress/stimuli that are likely to occur if this continues. Short term underrecovery may result in fatigue and decreased motivation. Long term underrecovery may result in performance decreases, overtraining, and athlete burnout [40].

There are a lot of areas and variables that can potentially be monitored. Selection of what measures to monitor should be done with caution and they should be relevant. Some caution should be given as practitioners should remember that monitoring data serves as a representation of performance, but it is not the actual performance itself [50]. Goodhart’s law states that “when a measure becomes a target, it ceases to be a good measure” [55]. This law should also be applied to athlete monitoring. Once an individual monitoring variable becomes the objective, it can no longer be considered an adequate monitoring variable. The objective of monitoring should be to predict some future performance, not of itself. This adage is borrowed from economics, but it seems useful for athlete monitoring as well.

While athlete monitoring is increasing in practice and research, it appears that much of the research has focused on specific areas to monitor and not as much on the statistical procedures involved [67]. This may lead to an accumulation of data, but no plan as to what to do with it [50]. A handful of papers have discussed some statistical techniques, but they do not focus much attention on data preparation and evaluation of the quality of data [12, 32, 39, 50]. The assumptions of normality, homogeneity of variance, and reliability of data collection methods can prove problematic if they are not evaluated. If violated, they can often render other statistical significance tests useless, so data screening prior to other analysis is necessary [64]. Even fewer articles discuss the return of the monitoring data to coaches and the visual display of the information [38, 39, 50]. The purpose of this review is to discuss many of the statistical issues and provide some best practices information. If literature is lacking in our field, best practices information from other fields will be borrowed and adapted.

Statistical Concerns and Best Practices

Regular Testing

Sands et al. [50] state that monitoring utilizes assessments as “stand-ins” for competitions to simulate the competition as if it were today. As such, it is necessary to have regular testing if this question is to be answered [49]. Furthermore, mistakes in athletic development can prove costly further justifying frequent monitoring if it has the possibility to help reduce some of these mistakes [25, 26, 30, 43]. Testing pre and post season is important, but only doing that may result in missing data that could prove helpful and may even prevent injury or overtraining. Figure 1 may help illustrate the difference between pre/post testing and regular monitoring. Both sides of the figure are depicting the same athlete data of countermovement jump peak power (in W) over time, but the left side only shows what would be visible with only pre/post season testing. From this data, we might interpret that our strength and conditioning program was effective. On the right side, all of the test data are visible. This data tells a different story. It seems there was one outlier data point and most of the other data are similar. From this data, we likely would not consider the strength and conditioning program as effective. The correct, more informed, interpretation is made possible due to having more data points from regular testing.
Fig. 1

A comparison of two time series plots showing countermovement jump peak power in W on the y axis and change in time (by month) on the x axis from the same athlete. The 1st plot only includes the preseason and postseason data, while the 2nd plot shows all data collected weekly

Singe-Subject Designs

Group statistical techniques are the dominant method presented in research in our field [32]. They are quite useful and are strengthened by large sample sizes. Athlete monitoring focuses on individual athletes, which means these methods may not be as applicable [32]. Group statistical methods make inferences based upon the mean of data distributions. If decisions are being made based on that data, mistakes are likely being made for those athletes not centered around the mean [32, 47]. For example, understanding that the mean value of our sample has increased from one time period to the next is important when considering the overall team or program, but that is not as important for the individual athletes at either end of the data distribution. Consider Fig. 2, it is a histogram of a sample of 50 computer-generated normally distributed squat jump peak power values, with a mean of 3088.05 W and a standard deviation of 107.5 W. There is no issue if a practitioner is designing a program for an athlete near the middle of the data distribution. But, when considering developing or elite athletes, practitioners may find themselves working with athletes on the tails of data distributions rather than near the mean. It is also important to consider that athlete training response is individualistic and idiosyncratic [50]. As such, the mean of athletes’ training responses may have little actual value.
Fig. 2

A normal distribution of squat jump peak power values in W

Statistical and Practical Significance

Single-subject designs seem to be more appropriate for athlete monitoring as they focus on each athlete individually. A common statistical concern of single-subject designs is the sample size. A sample size of one is likely disastrous for group designs in terms of achieving statistical significance. Single-subject designs overcome this by utilizing repeated measures of the same subject [32, 50]. Statistical significance is similar between group statistics and single-subject designs in that they are both attempting to quantify the probability that some treatment will reliably produce the same result [41, 58].

Even though single-subject designs have a small sample (n = 1), tests of statistical and practical significance still exist, but repeated measures are necessary to establish statistical significance. Instead of using a large sample of measures from different athletes, a single-subject design will likely use numerous data points over a period of time from the same subject. Then individual phases may then be identified, and measures of those phases can be compared. The extended celeration line (ECL), improvement rate difference (IRD), percentage of all nonoverlapping data (PAND), nonoverlap of all pairs (NAP), and Tau-U are examples of these procedures [36, 44, 62]. While all are different techniques, each is essentially evaluating the amount of data in a phase that overlaps with the data in the other phase or phases. The selection of a particular technique will likely come down to the specific situation encountered. For example, during sport performance assessments, practitioners may be concerned with a potential learning effect. Increases in performance could be due to increases in physical preparedness or due to getting better at the test. The presence of a learning effect in data could lead to a misinterpretation. Furthermore, many of the phase comparison techniques assume that the baseline phase data are not trended. Fortunately, the ECL can control for linear trends and the Tau-U can control for non-linear trends [44]. If practitioners suspect a learning effect might be present, one of the aforementioned options may be sought. Single-subject design phase comparison techniques are similar to group means comparison techniques with larger samples. Practitioners looking for a single-subject design alternative to a regression or prediction technique should consider the Theil–Sen slope. To the current author’s knowledge, the Theil–Sen slope has not been used in sport performance research but has been used to in student academic progress monitoring in a similar manner [61]. It is worthwhile to note that these procedures may not be widely available in statistical software, as single case research does not make up a large portion of the market share. That being said, these procedures are easy to do by hand and free web-calculators using these methods have been created that will generate P values, confidence intervals, and effect sizes [62].

It should be noted that statistical and practical significance are not the same. Statistical significance is generally expressed as a P value and provides information about the reliability of a finding, or the probability that the finding is by chance alone [41, 58]. Practical significance, often referred to as meaningfulness, is generally reported via some type of effect size estimate [14]. Much of scientific writing and publications depends heavily on P values, but there is a movement to rely less on them [1]. The justification for this is that findings are generally accepted as “significant” if a P value of less than 0.05 is achieved, but P values do not indicate the size of the effect. They are also heavily influenced by sample size. So much so, that a small effect with a large enough sample may produce a statistically significant P value. This may lead to practitioners making misinformed decisions. This has become such an issue in larger fields of science that over 800 researchers are now promoting the complete abandonment of P values in a recent publication in Nature [1].

Specifically concerning athlete development, measures of practical significance may be of primary concern. In order to enhance performance, coaches, athletes, and scientists alike are mainly concerned with meaningful change [14, 32]. Furthermore, even if a group statistical method is chosen, sample sizes in sport performance are generally dictated by the size of the team one is working with. Small samples more often than not result in a low likelihood of achieving statistical significance even if meaningful change has occurred [14]. That being said, this does not necessarily mean that the reliability of a finding should be ignored and only the magnitude of difference or relatedness should be considered. Whenever possible, both P values and effect size estimates should be reported [14].

Concerning practical significance, there may be many occasions where a data visualization depicts the whole story and measures of probability and effect size are simply icing on the cake [32, 50]. Consider Fig. 3 that depicts changes in an athlete’s jumping peak power across a training macrocycle. When only considering the raw data, it may be observed that the athlete’s peak power is increasing. This notion is further justified by the smoothed trendline in blue along with the shaded 95% confidence interval. Individual training phases are indicated by the dashed vertical lines. The first 8 weeks of training were part of a hypertrophy phase that included a lot of volume in the weight room. As such, some decline in performance should probably be expected in this phase and that is noticeable in Fig. 3 [22]. The next phase focuses on strength and intensity has increased. The final phases are strength/power and power, respectfully. The final phases increase in intensity, but decrease in volume and an increase in jump performance is noted in Fig. 3. The previously described interpretation was entirely visual, but measures of statistical and practical significance would strengthen the justification especially if future publication is desired.
Fig. 3

A time series plot of an athlete’s peak power (in W) on the y axis and time (by week) on the x axis of an entire year of data collection. A smoothed trendline (blue) with a 95% confidence interval (shaded area) is added. Training phases are depicted by the vertical dashed lines (hypertrophy, strength, strength/power, and power, respectively)

Pre-analysis Data Screening

The current environment in sport performance and sport science fields provides ample opportunity to utilize “Big Data” techniques within athlete development programs. This is especially true with elite sports where ample data is available publicly [29, 34, 52]. This data can be explored and manipulated to evaluate relationships and produce predictive models. This information can then be used to make more informed decisions about player development. The data collected in strength and conditioning and sport science programs can often be used in the same way as an abundance of data is collected through monitoring programs [3, 9, 11, 56]. Unfortunately, all data and all data collection instruments are not always useful. As such, there are some concerns that need to be addressed and considered prior to including any data in the decision making process.


Testing and monitoring the development of an athlete’s bio-motor abilities is vital to determine the progress, maintenance, or regression associated with training. There are numerous instruments and methods of assessing performance [23, 33, 42]. Sport scientists and strength coaches should be concerned with the reliability and validity of these measurement devices and protocols. Reliability concerns the consistency of results of multiple tests while validity concerns the similarity between the measured value and the actual value [2, 27, 65]. While both can be largely affected by the quality of instrumentation, reliability is also affected by the subject and test protocol. Thus, the standardization of testing protocol is an essential component of reliability.

Although investigations of reliability in sport and exercise science are relatively common, the methods of reliability assessment may be quite diverse. Methods include intraclass correlation coefficients (ICC), coefficient of variance (CV), standard error of measurement (SEM) and limits of agreement (LOA) between trials. ICCs provide a relative value of reliability, presenting the amount which subjects maintain their rank in the distribution of scores in repeated measures or trial assessments. Investigators should avoid employing ICC values from previous research to current investigations as ICCs are specific to the sample tested. The range of data is not accounted for in ICC assessment. Therefore, the range of measures could change dramatically while each subject maintains their rank in the sample resulting in a high ICC measure. This would potentially mislead the investigator as to the reliability of the measure if ICC is the only measure used [2].

CV, SEM, and LOA are absolute measures of reliability, in that they measure the level of variability of repeated measures. Formulas for calculating CV, SEM, and LOA include the standard deviation (SD) which somewhat illustrates the disparity between subjects. CV is commonly derived from the SD with the formula: \(CV = SD/mean*100\). SEM and LOA can also be derived from the SD (\(SEM = SD * \sqrt {1 - ICC}\) and \(LOA = 1.96 * \sqrt {2 * SD}\)) [2]. Previous authors have justified the usefulness of CV, SEM, and LOA for the exercise or sport scientist as they provide implications for measurement precision and the improved ability to infer results to other samples [2, 65].

CV, SEM and LOA are not equal measures of absolute reliability and distinction of the appropriate measurement is critical. SEM and LOA both assume the data is homoscedastic, meaning that every data point has the same chance of variance regardless of magnitude. CV, on the other hand, assumes the data are heteroscedastic, where there is an unequal chance of variance based on measure magnitude. Thus, if heteroscedasticity is present, a CV may be more useful. Heteroscedasticity is commonly present in measurement of sport science data, but one should conduct a test of hetero/homoscedasticity prior to application of a measure of reliability to ensure they are using the appropriate measure [2]. Therefore the use of ICC is still recommended when appropriate, and the proper usage of a measure of absolute reliability should also be considered [2, 65].


Validity is the similarity between the measured value and the actual value. A measure must first be reliable before it can be considered valid. In fact, validity is dependent on reliability along with relevance [41, 58]. As such, it is possible for a measure to be reliable, but not valid if the measure is not relevant to its objectives. For example, a measure can be incorrect, but consistently incorrect leading to an acceptable level of reliability.

There are may different types of validity. Logical, ecological, and criterion validity are likely the ones most relevant to athlete monitoring. Logical or face validity refers to the way a test looks on the surface and it should logically measure what it is claiming to. This is quite important for coaches and athletes alike, as they may not fully participate in or support a test that does not show immediate perceived value [38, 58]. Ecological validity is concerned with the application of the findings to actual competition scenarios. Ecological validity is very important in athlete monitoring as the application of the findings is desired in a very short period of time [38]. Criterion validity utilizes scores on some criterion measure to establish either concurrent or predictive validity. Concerning the data collection and instrumentation part of athlete monitoring, concurrent validity can be established by examining the measures obtained via a specific method along with those simultaneously measured by a previously validated “Gold Standard” device [58]. For example, the force plate might be considered the criterion measure for analyzing jump performance as many variables can be attained from force–time data collected at high sampling frequencies. But, depending on software, force–time curve analysis may take a significant amount of time and there may be difficulties with portability. A switch mat may be a more practical way to measure jump performance, but it should be validated against a force plate, as has been done in research [10]. Predictive validity is concerned with the predictive value of the data obtained with a test. Concerning athlete monitoring, this refers to the ability to predict future sport performance. Sport scientist may be concerned with the predictive validity of a single measure or a combination of measures in a model [38].

Evaluating validity is generally more difficult than evaluating reliability. Evaluating reliability requires multiple trials, but evaluating validity also requires a criterion measure or actual competition data. Practitioners may not have access to “gold standard” equipment, so this may not be as practical as performing their own reliability analysis. Assuming access to criterion measurement equipment is available, validation of concurrent validity is often completed via Pearson’s Product-Moment (PPM) correlations [38, 41, 58]. The results are then interpreted via the r value. This can prove problematic if this is the only method of measurement of validity. Consider the two data sets in Table 1 of theoretical jumping peak power values. Running a PPM correlation yields a perfect 1 r value, but these data are not the same. A paired samples t test reveals a P value of < 0.000 and a Cohen’s d effect size estimate reveals a value of 1.43 indicating that these are both statistically and practically different. While these circumstances may be unlikely, it is possible that two measurement devices can be highly correlated but statistically and practically different as has been seen previously in research [4, 5]. As a result, statistical validation should include multiple methods such as a PPM correlation and some form of means comparison (ANOVA, t-test, limits of agreement, etc.) [6, 41].
Table 1

Two data sets of countermovement jump peak power (PP) represented in watts (W)























Heteroscedasticity and Measurement Error

It is important to remember that all measurement will contain some error. Even the most valid methods will have some amount of error. Theoretically we can consider the observed value as the sum of the true value and the error value (observed value = true value + error value). The true value is what one should strive to measure, but is not actually possible to attain [41, 58]. There are many sources of potential error and some such as methodology and instrumentation based error have been mentioned already. One area of potential error that is often neglected is the magnitude of the measure itself. If athletes who produce extreme values (very high or very low) have a greater chance to vary or produce error, the data are described as heteroscedastic. It is generally desired that measures be homoscedastic, meaning everyone has the same chance for measure variance regardless of measure magnitude [2, 27, 65]. Many statistical tests have the assumption of homoscedasticity. Along with SEM mentioned above, linear and nonlinear models also assume homoscedasticity [57]. Heteroscedasticity influences data by increasing the rate of type I errors rendering the results of statistical significance tests invalid [64].

Heteroscedasticity can be particularly troubling when dealing with sports performance data where extreme values may be seen on a regular basis. If data are heteroscedastic, practitioners should not be very confident in the reliability or validity of their data or their predictive models. As such, evaluation of heteroscedasticity needs to be completed [2, 27, 64, 65]. This can be evaluated in terms of reliability or validity. Either way, the difference (between trials for reliability or between the measured value and the value of the criterion measure for validity) is compared to the means. If the variance is uniform regardless of the means, it is considered homoscedastic. If there is a trend where more variance occurs at either end, it is considered heteroscedastic [2].

Data can be evaluated for the presence of heteroscedasticity visually as well as statistically. Visually, evaluation of heteroscedasticity can be completed by plotting the differences (residuals) against the overall means. Figure 4 shows an example scatter plot of athletes’ countermovement jump landing peak force values, with the trial means on the x axis and the between trial differences on the y axis. It appears there is a relationship between residual size, or variance, and the measure magnitude. This would indicate the presence of heteroscedasticity. The visual inspection should be aided by statistical inspection whenever possible. Statistically, this can be evaluated by the Levene’s test or the Breusch–Pagan test [8, 37]. Both of these test the null-hypothesis that the data are homoscedastic, thus a P value of less than 0.05 is necessary to indicate heteroscedasticity. These tests are routinely included in statistical software and programming languages. A less formal way to statistically test for the presence of heteroscedasticity is to run a correlation between the residuals and the means and to interpret the r value. Heteroscedasticity is indicated if a relationship is present. If not, the data are likely homoscedastic [35].
Fig. 4

A scatter plot of trial to trial residuals plotted against trial means for vertical jumping peak landing force that indicates heteroscedastic data in athletes

While the assumption of homoscedasticity is not often evaluated in research, it is likely that much of sport performance data violates that assumption [2]. This seems particularly true in accelerometers and inertial measurement units where several studies have presented heteroscedastic data [3, 13, 42, 45]. Given the increase in usage of wireless sensors, some attention and concern should be given to this area and practitioners should evaluate their own data for the presence of heteroscedasticity.

Data Return and Visualization

Much of the data collected, the indicators of performance, and the opportunity to adjust a training program based on it is time sensitive [7, 50]. As a result, it is imperative that data collected be utilized and returned to decision makers rapidly. As such, practitioners will be aided by software that can analyze and represent data quickly. There are many software programs that perform data analysis and produce data visualizations and dashboards (single-screen presentation of the most meaningful information) and do so quickly [16], but not all organizations and institutions will be able to afford such programs. As a result, Microsoft Excel may be the solution of choice for many, but it is limited in its ability to complete complex data analysis and creating numerous data visualizations for each athlete and team results in time-consuming redundant behaviors. Free, open-source programming languages such as R and Python may be a solution for this, but users must learn the syntax of the languages and additional packages (R Core Team, [46]; vanRossum, [63]. R and Python do offer more freedom in analysis and data visualization over other programs, but the initial coding may be time consuming. The ability to loop over, or iterate over a sequence of data instead of repeatedly completing similar tasks for every athlete or test, greatly reduces the time required to complete this [66]. This will save time over many of the “out of the box” software programs in the long run, but considerable time must be invested early on.

Regardless of the method chosen, the data given back to coaches or other decision makers likely comes in the form of a collection of charts, plots, or dashboards for each athlete. There are many factors that should be considered when returning data in the form of a visualization. In Edward Tufte’s landmark work on the subject, he promotes several tenets of graphical excellence and best practices. The main ones discussed in this paper will be that practitioners should represent as much data as possible in as little space as possible, graphics should not distort what the data is saying, and be fairly clear in its purpose [59].

Efficient Data Display

Concerning the presentation of large quantities of data in a small space, Tufte presents the concept of “Data-ink” which is a ratio of how much ink is used to represent actual data or change in data to the ink required to produce the graphic [59]. Essentially, it is the ratio of ink dedicated to the necessary display of information over that which is redundant or unnecessary. Certain types of charts have inherently bad data-ink ratios, such as the pie chart as it can generally be represented by a small table. Examples of a high data-ink ratio are the time series (seen in Fig. 3) and radar plot [16, 39, 50]. Radar plots (Fig. 5) allow practitioners to display multiple variables in a single graphic along with changes in time or performance comparisons between athletes [16, 38].
Fig. 5

A radar plot comparing athlete performance data of three athletes (JH jump height, Mass body mass, TimetoFirst home to first time, RFD rate of force development, PP peak power)

Fundamentally, a radar plot is just a line graph with multiple data series’ that have been formed into a round shape [16]. Some potential concerns to the radar plot are that if one is using different measurement scales [e.g. peak force in Newtons (4982 N) and jump height in meters (0.51 m)] data will have to be normalized, otherwise the smaller numbers will not be visible on the shared axis when potted. The most common way to do this is with the z score or t score [38, 41]. Most statistical software has formulas built in for each, but they are easy to calculate by hand if not. The decision about which standardized score to use is generally based upon sample size. Both formulas use the standard deviation, but the z score is supposed to use the standard deviation of the population that it is representing, not necessarily the standard deviation of the sample being tested. Thus, the general recommendation is to use z scores with sample sizes greater than or equal to 30 and t scores with smaller samples [41]. That being said, one could argue that their team of 22 athletes is the population as well as the sample, so a z score may still be appropriate. A second concern is the desired magnitude direction. For example, Fig. 5 displays baseball monitoring data. For several of the variables it is desirable that the data points be further away from the center [jump height (JH), rate of force development (RFD), peak power (PP)]. There may be other variables where a smaller value is desired, such as the time it takes to reach first base. This may lead to some confusion if not explained well. The final concern is that once data are converted into standardized scores, the units are no longer necessary and magnitudes may be difficult to interpret. All of these concerns should be considered and addressed, but if the graphic causes too much confusion, then it may be time to simplify [16, 59].

Misrepresenting the Data

Unfortunately, data visualizations can be misleading. For the purposes of athlete monitoring, misleading the viewer is likely an accident, but it can lead to making an incorrect decision. It is important to follow some best practices or guidelines when producing visualizations so this can be avoided.

One potential way for this to happen in athlete monitoring is by not displaying all the data or by not collecting enough data. If only preseason and postseason data are displayed, one might be mislead about what happened along the way or some effect might seem magnified as was the case in Fig. 1. Assuming regular testing is occurring, enough data should be available to produce time-series plots that are easily readable for viewers [50].

Misrepresenting the y axis may be the most common issue in data visualizations. For example, in Fig. 6 the same athlete’s countermovement jump data is used to create both plots. The plot on the left looks highly variable and seems to show a dramatic increase after the first two measurements for those not paying attention to the y axis tick marks. The magnitude of difference is misleading here. Standardizing the y axis will help us avoid this mistake. Fixing the y axis at zero illustrates the difference in Fig. 6 and it is a good form for all plots to start y axes at zero as a result [16, 53, 59].
Fig. 6

A comparison of two time series plots of the same athlete’s data. The plot on the left has an altered y axis, while the plot on the right has a y axis that starts at 0

Misrepresenting data happens frequently and the magnitude of misrepresentation can be quantified via Tufte’s Lie Factor [59].
$$Tufte{\text{'}}s\;lie\;factor = \frac{Size\;of\,effect\;shown\;in\;graphic}{Size\;of\;effect\;in\;data}$$
If the change between points 2 and 3 in Fig. 6 is measured at 50 mm in the plot on the left and only 5 mm in the plot on the right, then the Lie Factor of the first plot is 10.
$$Tufte{\text{'}}s\;lie\;factor\; = \;\frac{{50\;{\text{mm}}}}{{5\;{\text{mm}}}} = 10$$

Data visualizations can distort effects in both directions, so a Lie Factor can be above or below 1.0. According to Tufte, anything outside of the range of 0.95–1.05 is substantial distortion. Thus the example shown in Fig. 6 is substantial distortion.

Sometimes misrepresenting data may not entirely be the fault of the data visualization creator. Consider pie charts, which have understandably gone out of favor with many data scientists. [15, 51]. While pie charts are familiar to many, they force viewers to make comparisons between data based on chart area and our visual perception is limited in its ability to do this task. If a pie chart is rotated so that no side of any slice is directly in line with either the x or y axis, perception is weakened further. Looking at the pie chart in Fig. 7, it may be relatively easy to determine that Catchers (C) represent 25% of the data because both sides of its slice are on the x and y axis. Turning your head or the image slightly increases the difficulty of determining its value [51]. Determining the value of any of the other positions is likely much more difficult. Bar charts are easier to interpret and can illustrate the same information, leading to the recommendation to replace pie charts with bar charts or a simple table whenever possible [15]. Speedometer or gauge plots are popular in performance based dashboards, but they are fundamentally just a different version of a pie chart as they represent fractional components. These are often more complicated to produce and are extremely poor performers in Tufte’s Data-ink ratio as they only represent one value (e.g. 85% of total) [16, 59]. While some plots may appear elegant or visually appealing, if they offer little information relative to the amount of ink required to create the graphic, space and time are not being used efficiently. Finally, some attention should be paid to the choice of color palette used. The ‘viridis’ color palette (available in R and Python, used in Fig. 7) is accessible to those with different types of colorblindness, so it will be clear to most who view graphics with it [21].
Fig. 7

A comparison of the same theoretical baseball positional data represented as a pie chart, bar chart, and table (C catcher, P pitcher, IF infielder, OF outfielder)


Data collection during strength and conditioning and sport performance is on the rise and its use in athlete monitoring is also increasing. While the usage of this data for purposes of creating more informed training programs and potential performance prediction is promising, there are some statistical concerns that should be addressed by those who use this data. At minimum, analyses of reliability and the assumption of homoscedasticity should be evaluated. This should be done by all practitioners with their data, not relying on published findings of other samples. If possible, concurrent validity of devices should also be evaluated. Following any analysis, the data return process should not be overlooked. Data should be visualized in a simple and clear manner that does not result in distortion. This will likely result in an efficient decision making process and more informed athlete development programs.


  1. 1.
    Amrhein V, Greenland S, McShane B. Retire statistical significance. Nature. 2019;567(7748):305–7.PubMedCrossRefPubMedCentralGoogle Scholar
  2. 2.
    Atkinson G, Nevill AM. Statistical methods for assessing measurement error (reliability) in variables relevant to sports medicine. Sports Med. 1998;26(4):217–38.PubMedCrossRefPubMedCentralGoogle Scholar
  3. 3.
    Bailey C, McInnis T, Batcher J. Bat swing mechanical analysis with an inertial measurement unit: reliability and implications for athlete monitoring. J Trainol. 2016;5(2):43–5.CrossRefGoogle Scholar
  4. 4.
    Bampouras T, Relph N, Orne D, Esformes J. Validity and reliability of the myotest pro wireless accelerometer. Br J Sports Med. 2010;44(14):i20.CrossRefGoogle Scholar
  5. 5.
    Batcher J, Nilson K, North T, Brown D, Raszeja N, Bailey C. Validity of jump performance measures assessed with field-based devices and implications for athlete monitoring. J Strength Cond Res. 2017;31(p):s82–162.Google Scholar
  6. 6.
    Bland J, Altman D. Statistical methods for assessing agreement between two methods of clinical measurement. Lancet. 1986;1(8476):307–10.PubMedCrossRefPubMedCentralGoogle Scholar
  7. 7.
    Bosco C, Colli R, Bonomi R, Atko Viru SVD. Monitoring strength training: neuromuscular and hormonal profile. Med Sci Sports Exerc. 2000;32(1):202–8.PubMedCrossRefPubMedCentralGoogle Scholar
  8. 8.
    Breusch T, Pagan A. A simple test for heteroskedasticity and random coefficient of variation. Econometrica. 1979;47(5):1287–94.CrossRefGoogle Scholar
  9. 9.
    Bricker J, Bailey C, Driggers A, McInnis T, Alami A. A new method for the evaluation and prediction of base stealing performance. J Strength Cond Res. 2016;30(11):3044–50.PubMedCrossRefPubMedCentralGoogle Scholar
  10. 10.
    Buckthorpe M, Morris J, Folland J. Validity of vertical jump measurement devices. J Sports Sciences. 2012;30(1):63–9.CrossRefGoogle Scholar
  11. 11.
    Camp C, Tubbs T, Flesig G, Dines J, Dines D, Altcheck D, Dowling B. The relationship of throwing arm mechanics and elbow varus torque: within-subject variation for professional baseball pitchers across 82,000 throws. Am J Sports Med. 2017;45(13):3030–5.PubMedCrossRefPubMedCentralGoogle Scholar
  12. 12.
    Clubb J, McGuigan M. Developing cost-effective, evidence-based load monitoring systems in strength and conditioning practice. Strength Cond J. 2018;40(6):7–14.Google Scholar
  13. 13.
    Driggers A, Bingham G, Bailey C. The relationship of throwing arm mechanics and elbow varus torque: letter to the editor. Am J Sports Med. 2018;47(1):1–5.CrossRefGoogle Scholar
  14. 14.
    Ellis P. The essential guide to effect sizes: Statistical power, meta-analysis, and the interpretation of research results. 1st ed. Cambridge: University Press; 2010.CrossRefGoogle Scholar
  15. 15.
    Few S. Save the pies for dessert. Visual business intelligence newsletter;2007.Google Scholar
  16. 16.
    Few S. Information dashboard design: displaying data for at-a-glance monitoring. 2nd ed. Burlingame: Analytics Press; 2013.Google Scholar
  17. 17.
    Foster C. Monitoring training in athletes with reference to overtraining syndrome. Med Sci Sports Exerc. 1998;30(7):1164–8.PubMedCrossRefGoogle Scholar
  18. 18.
    Fry A, Kraemer W, Borselen F, Lynch J, Marsit J, Roy E, Knuttgen H. Performance decrements with high-intensity resistance exercise overtraining. Med Sci Sports Exerc. 1994;26(9):1165–73.PubMedCrossRefPubMedCentralGoogle Scholar
  19. 19.
    Fry A, Kraemer W, Lynch J, Triplett NT, Koziris L. Does short-term near maximal intensity machine resistance exercise induce overtraining? J Strength Cond Res. 1994;8(3):75–81.Google Scholar
  20. 20.
    Fry A, Webber J, Weiss L, Fry M, Li Y. Impaired performance with excessive high-intensity free-weight training. J Strength Cond Res. 2000;14(1):54–61.Google Scholar
  21. 21.
    Garnier S. Viridis: Default color maps from ‘matplotlib’. 2018. Accessed 9 Jul 2019.
  22. 22.
    Gonzalez-Badillo J, Gorostiaga E, Arellana R, Izquierdo M. Moderate resistance training volume produces more favorable strength gains than high or low volumes during a short-term training cycle. J Strength Cond Res. 2005;19(3):689–97.PubMedPubMedCentralGoogle Scholar
  23. 23.
    Haff G, Carlock J, Hartman M, Kilgore J, Kawamori N, Jackson J, Stone M. Force-time curve characteristics of dynamic and isometric muscle actions of elite women olympic weightlifters. J Strength Cond Res. 2005;19(4):741–8.PubMedPubMedCentralGoogle Scholar
  24. 24.
    Halson S, Jeukendrup A. Does overtraining exist? An analysis of overreaching and overtraining research. Sports Med. 2004;34(14):967–81.PubMedCrossRefPubMedCentralGoogle Scholar
  25. 25.
    Hickey J, Shield A, Williams M, Opar D. The financial cost of hamstring strain injuries in the australian football league. Br J Sports Med. 2014;48(8):729–30.PubMedCrossRefGoogle Scholar
  26. 26.
    Hoffman J, Kaminsky M. Use of performance testing for monitoring overtraining in youth basketball players. Strength Cond J. 2000;22(6):54–62.CrossRefGoogle Scholar
  27. 27.
    Hopkins W. Measures of reliability in sports medicine and science. Sports Med. 2000;30(1):1–15.PubMedCrossRefPubMedCentralGoogle Scholar
  28. 28.
    Joyce D, Lewindon D. High-performance training for sports. 1st ed. Champaign: Human Kinetics; 2014.Google Scholar
  29. 29.
    Kagan D. The anatomy of a pitch: doing physics with pitchf/x data. Phys Teacher. 2009;47(7):412.CrossRefGoogle Scholar
  30. 30.
    Kellman M. Enhancing recovery: Preventing underperformance in athletes. 1st ed. Champaign: Human Kinetics; 2002.Google Scholar
  31. 31.
    Kellman M, Beckmann J. Sport, recovery, and performance: interdisciplinary insights. 1st ed. New York: Routledge; 2018.Google Scholar
  32. 32.
    Kinugasa T, Cerin E, Hooper S. Single-subject research designs and data analyses for assessing elite athletes’ conditioning. Sports Med. 2004;34(15):1035–50.PubMedCrossRefPubMedCentralGoogle Scholar
  33. 33.
    Krustrup P, Mohr M, NYbo L, Jensen J, Nielsen N, Bangsbo J. The yo-yo ir2 test: physiological response, reliability, and application to elite soccer. Med Sci Sports Exerc. 2006;38(9):1666–73.PubMedCrossRefPubMedCentralGoogle Scholar
  34. 34.
    Lage M, Ono J, Cervone D, Chiang J, Dietrich C, Silva C. StatCast dashboard: exploration of spatiotemporal baseball data. IEEE Comput Graph Appl. 2016;36(5):28–37.PubMedCrossRefPubMedCentralGoogle Scholar
  35. 35.
    Lani J. Heteroscedasticity. 2019. Accessed 09 Jul 2019.
  36. 36.
    Lee J, Cherney L. Tau-u: a quantitative approach for analysis of single-case experimental data in aphasia. Am J Speech Lang Pathol. 2018;27(1S):495–503.PubMedCrossRefPubMedCentralGoogle Scholar
  37. 37.
    Levene H. Robust tests for equality of variances. In: Olkin I, editors. Contributions to probability and statistics: essays in honor of Harold Hotelling. Palo Alto, CA: Stanford University Press; 1960. p. 278–292.Google Scholar
  38. 38.
    McGuigan M. Monitoring training and performance in athletes. 1st ed. Champaign: Human Kinetics; 2017.Google Scholar
  39. 39.
    McGuigan M, Cormack S, Gill N. Strength and power profiling of athletes: selecting tests and how to use information for program design. Strength Cond J. 2013;35(6):7–14.CrossRefGoogle Scholar
  40. 40.
    Meeusen R, Duclos M, Foster C, Fry A, Glesson M, Neiman D, Urhausen A. Prevention, diagnosis and treatment of the overtraining syndrome: joint consensus statement of the european college of sport sciences (ecss) and the american college of sports medicine (acsm). Med Sci Sports Exerc. 2013;45(1):186–205.PubMedCrossRefPubMedCentralGoogle Scholar
  41. 41.
    Morrow J, Mood D, Disch J, Kang M. Measurement and evaluation in human performance. 5th ed. Champaign: Human Kinetics; 2016.Google Scholar
  42. 42.
    Nuzzo J, Anning J, Scharfenberg J. The reliability of three devices used for measuring vertical jump height. J Strength Cond Res. 2011;25(9):2580–90.PubMedCrossRefPubMedCentralGoogle Scholar
  43. 43.
    Ozturk S, Kilic D. What is the economic burden of sports injuries. Jt Dis Relat Surg. 2013;24(2):108–11.CrossRefGoogle Scholar
  44. 44.
    Parker R, Vannest K, Davis J. Effect size in single case research: a review of nine nonoverlap techniques. Behav Modif. 2011;35(4):303–22.PubMedCrossRefPubMedCentralGoogle Scholar
  45. 45.
    Perez-Castilla A, Piepoli A, Delgado-Garcia G, Garrido-Blanca G, Garcia-Ramos A. Reliability and concurrent validity of seven commercially available devices for the assessment of movement velocity at different intensities during the bench press. J Strength Cond Res. 2019;33(5):1258–65.PubMedCrossRefPubMedCentralGoogle Scholar
  46. 46.
    R Core Team. R: A language and environment for statistical computing. Vienna, Austria: R foundation for statistical computing. 2017. Accessed 9 Jul 2019.
  47. 47.
    Rose T. The end of average. 1st ed. New York: HarperOne; 2016.Google Scholar
  48. 48.
    Sands W. Monitoring the elite female gymnast. Natl Strength Cond Assoc J. 1991;13(4):66–72.CrossRefGoogle Scholar
  49. 49.
    Sands W, Stone M. Monitoring the elite athlete. Olymp Coach. 2005;17(3):4–12.Google Scholar
  50. 50.
    Sands W, Kavanaugh A, Murray S, McNeal J, Jemni M. Modern techniques and technologies applied to training and performance monitoring. Int J Sports Physiol Perform. 2017;12(Suppl 2):S263–72.PubMedCrossRefPubMedCentralGoogle Scholar
  51. 51.
    Schwabish J. An economist’s guide to visualizing data. J Econ Perspect. 2014;28(1):209–34.CrossRefGoogle Scholar
  52. 52.
    Sikka R, Baer M, Raja A, Stuart M, Tompkins M. Analytics in sports medicine: implications and responsibilities that accompany the era of big data. J Bone Jt Surg. 2019;101(3):276–83.CrossRefGoogle Scholar
  53. 53.
    Smith M. Conversations with data #31: bad charts. 2019. Accessed 21 Jul 2019.
  54. 54.
    Stone M, Stone M, Sands W. Science and practice of resistance training. 1st ed. Champaign: Human Kinetics; 2007.Google Scholar
  55. 55.
    Strathern M. Improving ratings: audit in the British University system. Eur Rev. 2019;5(3):305–21.CrossRefGoogle Scholar
  56. 56.
    Suchomel T, Bailey C. Monitoring and managing fatigue in baseball players. Strength Cond J. 2014;36(6):39–45.CrossRefGoogle Scholar
  57. 57.
    Tabachnick B, Fidell L. Using multivariate statisitcs. 5th ed. Boston: Pearson; 2015.Google Scholar
  58. 58.
    Thomas J, Nelson J, Silverman S. Research methods in physical activity. 7th ed. Champaign: Human Kinetics; 2015.Google Scholar
  59. 59.
    Tufte ER. The visual display of quantitative information. Cheshire: Graphic Press; 2001.Google Scholar
  60. 60.
    Vaeyens R, Lenoir M, Williams A, Philippaerts R. Talent identification and development programmes in sport: current models and future directions. Sports Med. 2008;38(9):703–14.PubMedCrossRefPubMedCentralGoogle Scholar
  61. 61.
    Vannest K, Parker R, Davis J, Soares D, Smith S. The Theil-Sen slope for high-stakes decisions from progress monitoring. Behav Disord. 2012;37(4):271–80.CrossRefGoogle Scholar
  62. 62.
    Vannest K, Parker R, Gonen O, Adiguzel T. Single case research: web based calculators for scr analysis. 2016. Accessed 05 Jul 2019.
  63. 63.
    Van Rossum G. Python tutorial, technical report cs-r9526. Amsterdam, Netherlands: Centrum voor Wiskunde en Informatica (CWI). 1995. Accessed 9 Jul 2019.
  64. 64.
    Vincent W, Weir J. Statistics in kinesiology. 5th ed. Champaign: Human Kinetics; 2012.Google Scholar
  65. 65.
    Weir J. Quantifying test–retest reliability using the intraclass correlation coefficient and the sem. J Strength Cond Res. 2005;19(1):231–40.PubMedPubMedCentralGoogle Scholar
  66. 66.
    Wicknham H, Grolemund G. R for data science. 1st ed. Sebastopol: O’Reilly Media; 2017.Google Scholar
  67. 67.
    Wing C. Monitoring athlete load: data collection methods and practical recommendations. Strength Cond J. 2018;40(4):26–39.Google Scholar
  68. 68.
    Zatsiorsky V, Kraemer W. Science and practice of strength training. 2nd ed. Champaign: Human Kinetics; 1995.Google Scholar

Copyright information

© Beijing Sport University 2019

Authors and Affiliations

  1. 1.Department of Kinesiology, Health Promotion, and RecreationUniversity of North TexasDentonUSA

Personalised recommendations