Comparison of a computer vision system against three-dimensional motion capture for tracking football movements in a stadium environment

Three-dimensional motion capture systems such as Vicon have been used to validate commercial electronic performance and tracking systems. However, three-dimensional motion capture cannot be used for large capture areas such as a full football pitch due to the need for many fragile cameras to be placed around the capture volume and a lack of suitable depth of field of those cameras. There is a need, therefore, for a hybrid testing solution for commercial electronic performance and tracking systems using highly precise three-dimensional motion capture in a small test area and a computer vision system in other areas to test for full-pitch coverage by the commercial systems. This study aimed to establish the validity of VisionKit computer vision system against three-dimensional motion capture in a stadium environment. Ten participants undertook a series of football-specific movement tasks, including a circuit, small-sided games and a 20 m sprint. There was strong agreement between VisionKit and three-dimensional motion capture across each activity undertaken. The root mean square difference for speed was 0.04 m·s−1 and for position was 0.18 m. VisionKit had strong agreement with the criterion three-dimensional motion capture system three-dimensional motion capture for football-related movements tested in stadium environments. VisionKit can thus be used to establish the concurrent validity of other electronic performance and tracking systems in circumstances where three-dimensional motion capture cannot be used.


Introduction
Electronic and performance tracking systems measure the location and speed of movement of athletes during competition and training.Speed and location information can in turn be used to describe the physical and tactical behaviour of players [1].The electronic performance and tracking system market is highly competitive and estimated to grow to be worth up to USD7 billion USD by 2023 [2].In this market where football department staff are often end users, the assumption is that electronic performance and tracking systems would be independently evaluated prior to purchase.However, in many cases electronic performance and tracking systems are not independently evaluated prior to entering the market, or indeed prior to manufacturers gaining lucrative contracts in professional sport.The Fédération Internationale de Football Association (FIFA) introduced a quality standard for electronic performance and tracking systems in 2019, against which commercial tracking systems are tested.This new quality standard should both accelerate research and development in the electronic performance and tracking system industry, and add important accountability in the accuracy of systems.
There are three main types of electronic performance and tracking systems currently used in sport.These are global navigation satellite systems (GNSS) [3,4], local positioning systems (LPS) [5,6] or optical systems [7].Each system is able to provide the location of a player on Earth, either relative to satellites or nodes in a stadium [3,5,6] or calibrated areas on a pitch [7,8].From location, speed can be derived, or with GPS the Doppler shift in signal from satellites can be used as an indirect marker of speed [3].
The differences in electronic performance and tracking system make for potential variations as to how validation studies are conducted.For example, with GNSS clear access to the sky to enable satellite reception is required, precluding testing indoors.With LPS an instrumented stadium or laboratory is necessary, and for optical systems sufficient height, line of site and field of view to place the cameras that derive images is required, typically meaning a stadium-like environment.As a result of the differences in systems, and the improvement in the ability to use motion capture systems outdoors in the past 5 years, electronic performance and tracking systems have been concurrently validated against various other systems.Thus, the concurrent validation of electronic performance and tracking systems has been determined outdoors against timing gates [4,8], or radar [9] in the absence of a true criterion measure.
Three-dimensional motion capture systems, in the cases below Vicon, have been used by us to validate many of the major commercial electronic performance and tracking systems available in today's market including those offered by Hawk-Eye Innovations Limited, Track160 Ltd, Fitogether Inc, Catapult Sports, Realtrack Systems SL, and Chyron Hego AB (for details see [10]).Three-dimensional motion capture has also been used to validate two optical (STATS SportVU, [7] and Chyron Hego [11]), one local and one GPS system (Inmotio, GPSPortsSPI Pro X [7]).Despite the relatively widespread use of three-dimensional motion capture systems, it is not currently possible to use them on a full football pitch due to the need for many fragile cameras to be placed around the capture volume and a lack of suitable depth of field of those cameras.
There is a need for a method to test electronic performance and tracking systems in stadia to ensure ecological validity of the test environment.With no true criterion measure that allows full-pitch movements and therefore comparisons with electronic performance and tracking systems, a hybrid solution is required.One component of the hybrid solution is a three-dimensional motion capture system operating in a small test area in one section of the pitch.These three-dimensional motion capture systems are considered the gold standard and have been reported to produce sub millimetre errors [7].The second component is a computer vision system accurate enough for concurrent validity of electronic performance and tracking systems to be tested, for activities on areas outside the three-dimensional motion capture space.It is important that electronic performance and tracking systems are assessed on a full pitch to reflect how these are required to operate in a football match.A reference system used in an area outside the three-dimensional motion capture space would enable a check of whether systems are trained on the capture space only, and thus potentially misrepresenting the true full-pitch accuracy of these systems.Further, the reference system for comparison should not be commercially available and thus be in commercial conflict with the electronic performance and tracking systems tested.Before a computer vision system could be used, it too should be validated against a highly accurate motion capture system.The aim of this study was therefore to test the accuracy of a bespoke computer vision system [8] against three-dimensional motion capture in a stadium environment.

Test environment and participants
The test was conducted in two separate stadia.The first test venue was a stadium used at the time for national level football competition.The stadium had a regulation football pitch, and grandstands with seating for 15,000 people.The second test venue was a stadium currently used for national and international football matches.This stadium had a regulation football pitch, and grandstands with seating for 100,000 people.Participants (n = 60) were members of an elite youth football academy, attached to a professional football team, or active community-level footballers each of whom gave written informed consent to participate.
The test area consisted of a 30 × 30-m area in which participant movements were captured simultaneously with a 3-D motion capture system and a computer vision system (each detailed below).The test area was set up in one of four possible quadrants on consecutive days, originating from the centre circle of the pitch, thus the computer vision system was tested against 3-D motion capture across an area of 3600 m 2 , representative of over half the area of a standard football pitch, and four times greater than the largest area any other EPTS has been tested in against 3-D motion capture [7,11].
The authors received human research ethics approval from the Institutional Ethics Committee to conduct this work (HRE 16-278).

Movement activities
Within the test area, participants conducted a series of activities to simulate common movements in football.The activities included a circuit with pre-determined movements, 2v2 and 5v5 small-sided games and a 20-m sprint commencing outside the test area and finishing within it.
The circuit, designed as demonstrated in Fig. 1A, within the test area with dimensions 30 × 30 m, included the following activities: self-paced walking; self-paced jogging; maximal accelerations; changes of direction.Each participant completed 4 min of circuit activities.Each small-sided game was 4 min in duration.
Data were collected on four consecutive days at the smaller stadium, with a total of 7459 individual frames of video data sampled and subsequent speed and position of players of circuit, 4207 sampled of 2v2, and 30,967 sampled of 5v5 data collected.Data were collected on two consecutive days at the larger stadium with a total of 39,486 samples of circuit, 3707 samples of 2v2, 43,518 samples of 5v5 and 781 samples the sprint data collected.

Three-dimensional motion capture system
Participant position and movement were determined by a large-scale three-dimensional motion capture system.To track participants, five 38-mm retro-reflective spherical markers were placed on specific landmarks: one on each shoulder, and three on the pelvis (Fig. 1B).The mid-position of the three pelvis markers was determined for each data frame to approximate the centre of mass of the player [7].Shoulder markers aided in identification of individual participants.
The playing volume was reconstructed into threedimensional space from 36 Vicon Vantage cameras (Oxford Metrics Group Plc [OMG], Oxford, UK) with a sampling frequency of 100 Hz.The cameras were positioned around the 30 × 30-m test area used for both the circuit and small-sided games (Fig. 2A).
Data for each marker were manually labelled in Vicon Nexus motion capture processing software and then transferred to Visual3D biomechanics analysis software (C-Motion Inc., Germantown, MD, USA).Data were interpolated where necessary using the interpolation function in Visual3D with a maximum window ranging from 10 to 100 frames depending on the section of data missing.Data were then smoothed using a dual-pass Butterworth digital filter.The cutoff frequency of 2.5 Hz was based on the results of wavelet analysis, residual analysis, and visual inspection of the effects of different cutoffs on the data (particularly around the maxima and minima).The lower end of these analyses was chosen (between 2.5 and 5.0 Hz was indicated) as it served to reduce the effects of the intra-step velocity fluctuation, thereby providing a better estimate of overall velocity.This approach of overall velocity estimation is directly relevant to the method used by practitioners to quantify running velocities in which they use bands (e.g.distance run within a certain velocity band).For this reason, and based on our experience in broader validation work with the manufacturers, it was found that these systems apply smoothing on their data to eliminate arbitrary fluctuations.
Data (X,Y coordinates) were reduced from 100 to 25 Hz and cropped to the start and finish line (circuit) or the kick off (small-sided games) to allow for aligning coordinates temporally with those from the computer vision system.

Computer vision system
Activities were recorded using four stationary high-definition video cameras (Panasonic AW-UE70KEJ) genlocked via a remote-control panel (Panasonic AW-RP50E) that provided a view of the entire pitch for each discrete task (see Fig. 2B for details).The resultant video footage was imported into the tracking software (VisionKit, Australian Institute of Sport, Canberra, Australia) and each camera's video image was calibrated to the capture area via association of known points from a rigid calibration rig in the field of view of each camera, so that a pixel represented a known unit of measurement.A set of player detection observations was then generated in VisionKit where each observation consisted of an X,Y ground location and a timestamp [8,12].VisionKit samples raw detections at 25 frames per second.Individual detections were then aggregated into temporal sequences using the low-and medium-level hierarchical association methods [13].A piecewise cubic polynomial was fitted to the continuous player tracking using the midpoint for each 1-s epoch.Coordinates (x,y) for players were then estimated by solving the cubic polynomial at each time point.

Statistical analysis
Three-dimensional motion capture raw position data were differentiated to obtain horizontal plane speed using a three-point finite central difference formula [14].Threedimensional motion capture position and speed data were then down sampled to 10 Hz.VisionKit raw position data were differentiated to obtain horizontal plane speed using a three-point finite central difference formula [14].VisionKit data were then up-sampled to 50 Hz using linear interpolation and then down-sampled to 10 Hz.The up-sampling was required as this study formed part of a larger project, and we needed to be able to make comparisons between the threedimensional motion capture data and EPTS at both 10-and 25-Hz sample rates.A fourth-order 1-Hz low-pass Butterworth filter was then applied to position and speed data for both VisionKit and three-dimensional motion capture.This filter was selected after wavelet and residual analyses, as per Vicon data analysis described above and has been used in previous research examining player movements over a similarly sized capture space [11].
Three-dimensional motion capture and VisionKit data were time synchronized using cross-correlation of position data [15].Once synchronized, data were trimmed for time on the field, combined and extracted into individual data files.VisionKit position data were then rotated through 360 degrees to find the lowest mean absolute error for position.Once the closest degree was found, the data was further rotated 2° either side by 0.01° increments to align VisionKit with three-dimensional motion capture that resulted in the lowest mean absolute error.Velocity and position data were then compared by root mean square deviation (RMSD): the sample standard deviation of the differences between threedimensional motion capture and VisionKit.The stabilization of the error was also calculated to determine if sufficient data were obtained for effective comparison in each activity [16] and is presented in Table 1.

Results
The root mean square deviation for speed was 0.04 m•s −1 and the mean absolute error for position was 0.15 m.The distribution of error for speed and position is shown in Fig. 3.The error for speed and position by activity and relative to position in the test area is shown in Fig. 4. Adequate data were collected to determine the error within all velocity bands, with stabilization of the error for position occurring after about 7 s, with the exception of the moderate speed band where stabilization occurred after 24 s (Table 1).

Discussion
This study established the accuracy of a bespoke computer vision system for tracking footballers in large stadia.The computer vision system had strong agreement for both speed and position of participants across various activities that included high-speed movements, rapid changes of direction and speed, and potential for occlusion with multiple participants moving in the relatively small (30 × 30 m) test area.
It is difficult to position the accuracy of the computer system tested here in the scientific literature, as only two published studies have tested electronic performance and tracking systems in stadium environments using threedimensional motion capture as the criterion measure [7,11].Further, there is no established criteria on what determines acceptable accuracy.Compared to EPTS tested in a similar way, VisionKit had superior speed accuracy than STATS SportVU (Optical), Inmotio (LPS) and GPSPortsSPI Pro X (GPS) that had reported speed accuracy of 0.41 ± 0.08; 0.25 ± 0.06 and 0.28 ± 0.07 m•s −1 mean ± SD, respectively [7].Further, the computer vision system here was more accurate for speed than either the Gen4, or Gen 5 Chyron Hego TRACAB system (0.09 and 0.08 m•s −1 respectively [11]).It should also be noted that VisionKit was tested over 3600 m 2 of the pitch, an area four times greater than the area covered by the commercial systems [7,11].
The stated trueness compared to three-dimensional motion capture of the systems above for speed of movement would enable each to be used for typical movement "performance" applications.Speed data are typically used to describe the mean or peak movement of players in a given epoch (see for example [17,18]).Speed data can be further differentiated to include measures of acceleration [19,20] or combined with skill measures to aid in understanding and prescription of training [21].In each of the measures, speed accuracy less than 0.5 m•s −1 is likely  satisfactory if that error is understood and incorporated into analyses and subsequent applications such as describing player movement in competition [17] or comparisons between levels of competition [18] or in training [21].The computer vision system here was also superior for positional trueness compared to three-dimensional motion capture than STATS SportVU, Inmotio and GPSPortsSPI Pro X systems (0.56 ± 0.16; 0.23 ± 0.07 and 0.96 ± 0.49 m, respectively) [7].The Chyron Hego TRACAB Gen 4 and Gen 5 systems were approximately 11 cm superior for positional trueness compared to three-dimensional motion capture than the computer vision system here.
The differences in trueness of position listed above begs the question as to how positionally accurate an electronic performance and tracking system needs to be in order to be effective for quantifying player movement in matches.Given that optical tracking systems typically track the trunk of a player, and limb length, and by association capacity to contact and control the ball, is far larger than trunk width, then positional accuracy within approximately 20 cm is likely sufficient.Certainly for common metrics associated with position such as x-and y-axis centroid [22], length, width, and surface area [22], player dyads [23] or occupancy maps [24], this level of accuracy would suffice.
There are many other factors that contribute to a difficulty in placing the results here in the broader electronic performance and tracking system context.First, and importantly, most studies attempting to establish the validity of electronic performance and tracking system have not used a criterion measure.Most studies have used timing gates to time the movement of participants between two points, rather than directly comparing the position or speed accuracy [4,[25][26][27].Whilst adding incremental advances in the knowledge of electronic performance and tracking systems, the results of these studies do not truly reflect the accuracy of these systems without comparison to a criterion measure.Second, few studies test systems in the environment that they will be used.In the case of outdoor team sports, systems should be tested in a stadium used for official competition.Third, most studies do not include game-specific tasks.Fourth, many studies use aggregated measures such as total distance, or distance at a certain velocity rather than instantaneous position or speed [4,28].The use of aggregated measures takes the validation away from first principles of what an electronic performance and tracking system measures and also results in lower degrees of freedom for comparison.Finally, there is lack of agreement in the statistical method for comparison of systems.Many, including some of our previous work, favour typical error expressed as a coefficient of variation [4,5], whilst we have used root mean square deviation here.Presentation of results in the units used in the field is possible with root mean square deviation and mean absolute error and thus can be easily interpreted by end users.
Ideally, the accuracy of electronic performance and tracking systems would be established in a stadium environment during actual competition on full-sized pitches.Unfortunately, three-dimensional motion capture systems are not yet able to be used on a whole pitch, and certainly not in full competition due to the need for fragile infrared cameras to be positioned on the sideline, and lack of depth of field of cameras.A second option for use in validation studies in selected areas of stadia is the use of a non-commercial research quality system that has strong agreement with 3-D motion capture.The VisionKit system tested here has strong levels of agreement with 3-D motion capture for speed and position, is not available commercially, and thus is not in direct conflict with any other commercial electronic performance and tracking system.VisionKit either alone or in conjunction with a smaller test area captured by 3-D motion capture thus offers a viable validation standard.

Conclusions
VisionKit has strong agreement with the criterion of threedimensional motion capture system 3 for football-related movements tested in stadium environments.VisionKit can thus be used to establish the concurrent validity of other electronic performance and tracking system in circumstances where three-dimensional motion capture cannot be used.

Fig. 1 A
Fig. 1 A Schematic of the circuit; green indicates walking, orange indicates jogging, red indicates maximal acceleration.B Position of the five 3-D motion capture reflective markers

Fig. 2 A
Fig. 2 A Location of the 30 × 30 m 36 camera 3-D motion capture test capture area at the large stadium.(B) Position, field of view and distance to test area of VisionKit cameras at each stadium.A = 10 m; B = 79 m; C = 10 m; D = 57 m; E = field of view approx.54.6°; F = 21 m; G = 109 m; H = 21 m; I = 89 m; J = field of view approx.62.6°, K = field of view approx.53.4°

Fig. 3 Fig. 4
Fig. 3 Histogram of speed differences (A) and histogram of position differences (B) This article is a part of Topical Collection in Sports Engineering on Football Research, edited by Dr. Marcus Dunn, Mr. Johsan Billingham, Prof. Paul Fleming, Prof. John Eric Goff and Prof. Sam Robertson.

Table 1
Number of samples required for stabilization of the error compared to the number of samples collected for each velocity band