Automatic data volley: game data acquisition with temporal-spatial filters

Data Volley is one of the most widely used sports analysis software for professional volleyball statistics analysis. To develop the automatic data volley system, the vision-based game data acquisition is a key technology, which includes the 3D multiple objects tracking, event detection and quality evaluation. This paper combines temporal and spatial features of the game information to achieve the game data acquisition. First, the time-vary fission filter is proposed to generate the prior state distribution for tracker initialization. By using the temporal continuity of image features, the variance of team state distribution can be approximated so that the initial state of each player can be filtered out. Second, the team formation mapping with sequential motion feature is proposed to deal with the detection of event type, which represents the players’ distribution from the spatial concept and the temporal relationship. At last, to estimate the quality, the relative spatial filters are proposed by extracting and describing additional features of the subsequent condition in different situations. Experiments are conducted on game videos from the Semifinal and Final Game of 2014 Japan Inter High School Games of Mens Volleyball in Tokyo Metropolitan Gymnasium. The results show 94.1% rounds are successfully initialized, the event type detection result achieves the average accuracy of 98.72%, and the success rate of the events’ quality evaluation achieves 97.27% on average.


Introduction
With the high-speed development of the vision sensing devices, the vision-based sports analysis technologies contribute in more and more fields, such as TV broadcasting contents, strategy development and coaching system. In game broadcasting, precise and real-time game data will provide much more attractive contents to the audiences to understand the game status. In the strategy development and coaching system, abundant game data help the teams and players to know their strengths and weaknesses, so that the proper strategy and personal training plans can be developed efficiently.
The performance of the sports analytic tools and the reliability of the game strategy development are depending on the B Xina Cheng xncheng@xidian.edu.cn 1 School of Artificial Intelligence, Xidian University, No. 2 South Taibai Road, Xi'an 710071, China 2 Graduate School of Information, Production and Systems, Waseda University, Kitakyushu City 808-0135, Japan quality and quantity of the game data. Therefore, the big data of the network are expected to support the sports analysis and other applications. [1,2] Some researches pay attentions to the applications of sports analysis based on big data [3,4], which will undergo major changes thanks to the utilization of the big data. However, the social network for utilizing big data is mature while how to obtain the data becomes key problem in real applications.
For the data acquisition in real game, only the 3D physical data (especially the position and speed information) make sense in data analysis and strategy development. However, most existing practical products for data acquisition only extract and present the game data (such as the position information) in the image coordinate system. From this point of view, volleyball is a typical object of 3D sports analysis, since both the ball and the players require the 3D concept to describe their motions. Once the volleyball game data can be acquired by computer vision method, other game data can also be obtained in similar way.
For the data acquisition and analysis of volleyball game, Data Volley [5] is the most widely used software for professional statistics analysis of volleyball games. This software can not only record the technical and tactical playing data of the players from both teams by a convenient interface, but also get a variety of statistical analysis data immediately in the statistical process, which helps the coaches to conduct real-time analysis and on-the-spot guidance for the game. In Data Volley, all input data are observed and judged by peoples through watching the game. This input process not only costs large human labor and time but also lacks of data accuracy since human eyes are weak at measuring the distance, velocity and time. Therefore, this article targets on the automatic and precise game data acquisition method to development the automatic Data Volley system with high reliability and efficiency of the volleyball game analysis.

Introduction of data volley
In order to achieve the development of the automatic Data Volley system, the data categories and functions utilized in Data Volley should be made clear. Thus, the requirement of expected automatic data acquisition method can be decided. The software interface of Data Volley (as Fig. 1 shows) is simple and clear, so that the game data can be quickly input through the keyboard and mouse operations. In Data Volley software, several types of data are available to be recorded and presented by some basic code. For each event, the input format sequence of the basic code is "the player number + event code + the start of the ball's trajectory + the evaluation of the event". The simple code sequence consists of information related to all the players and the ball.
-The location area of each player is required to decide the code of "the player number". As Fig. 2 shows, each half-court is divided into nine zones, which are used to present the position information of each player. -The "event code" denotes the event information. In Data Volley, seven types are defined to represent the function of every event. Based on some introductions [6] and volleyball game rules, the descriptions of all the events are summarized in Table 1. -The roughly trajectory of the ball with whose starting position is the required "start of the ball's trajectory".The number of the zone is used as the basic code. -The "quality of each event" is the comprehensive judgment based on the related players and the ball is required. For each type of event, symbols are used to represent the quality level like it is shown in Table 2.
Although Data Volley has demonstrated its great ability in the convenient interface, effective cooperation, and the comprehensive analysis functions, it still has limitations in practical applications. First, it is unfriendly for beginners to learn the input key and basic code. Second, the data observed by human eyes is rough, because the human eyes are weak at measurement of position, velocity and time. Third, the data accuracy is affected by individual difference of different analyses' judgment and operations. At last, it is difficult to synchronize the game video with the acquired game data.

Automatic game data acquisition
Based on the statement above, the research target of this article is the vision-based game data acquisition for the automatic Data Volley system. With the game videos, the computer vision technology will provide more accurate data than manually input, especially for the position and velocity information, so that the reliability and accuracy of the game data analysis will increase. According to the data being used in the Data Volley software and considering other game data which are useful for the analysis of game strategy, the requirement of automatic data acquisition from game video are stated as following. Firstly, the available game playing period should be detected from the entire game video, which also includes the rest time, pause time and other parts besides Secondly, all the players should be distinguished from each other, and at each event, the locations of players are required. Thirdly, as the evaluation criteria required, the trajectory including the position and velocity of the ball is also important data. Fourthly, the game events are required to present the game status. At last, based on the information of the ball and players in one event, the evaluation of the event is required.
In this paper, we propose data acquisition methods to automatically collect the game data required by the Data Volley software from complete volleyball games. Our contributions are summarized as follows: -A new initialization method of multiple player tracking is proposed for the team sports videos. The team state distribution and the sub-distribution are designed representing the entire team and each player based on game rules and physical basis. Instead of detecting all the players separately based on the image features, the sub-distribution of each player is filtered out through the proposed time-vary fission filter, which can be used to initialize the tracker directly. Combing the temporal and spatial features, this algorithm is robust for the occlusion and similar appearance of targets. -An event detection method based on the team formation mapping and the sequential motion feature are proposed. The team formation mapping represents the players' distribution of each event, which reveals the moving tendency of the team, so that the intra-class events are easier to be distinguished. The sequential ball motion state feature is designed to describe the relationship between the current event and the former one.
-For the target of evaluating the quality, the event series feature and relative spatial filter are designed. To describe different situations, event series feature utilizes the relationship of event types and the game process. The relative spatial filter is proposed to extract the information of the following events so that the overall quality of each event can be evaluated.

Related works
In this section, the related works are discussed for different tasks. The target of this article is automatic Data Volley for the volleyball game analysis. With the similar target, the visionbased game analysis methods are discussed at first. Then the data acquisition for automatic Data Volley can be divided into several tasks, the detection of play scene, the multiple player tracking, the ball tracking, the event detection and event evaluation. For each task, we use one subsection to discuss the related work. Among these tasks, the detection of play scene and the ball tracking have been achieved by the conventional works. And the left ones are the main works in this article.

Vision based sports video analysis
There have been several researches targeting on the development of the automatic game analysis system. Work [7] proposes a computationally efficient hybrid method for automatic sports highlights generation to make contributions for the broadcasting applications. Method [8,9] proposes a trajectory and action recognition of the player to analyze soccer training videos. This work has strict limitations on the environment and it only can be used in single player training scenario. Work [10] predicts the team events by analysis of the player motions and performance in basketball and water polo. This work analyzes the data by transferring the input video to overhead view, so that only 2D team sports can be used. Work [11] estimates the team tactics in soccer game videos based on the deep learning method and unique characteristics of tactics. Work [12] presents techniques for automatically classifying players and tracking ball movements in game video clips to analyze basketball movements and pass relationships. Targets and results of above methods are far from the requirements of automatic Data Volley. Therefore, there is no state-of-art method can be used directly to achieve our goal. Since the automatic Data Volley system can be broken down into multiple tasks, the subsequent contents discuss about the related works for each task.

Detection of play scene
With the similar target for detecting the available game playing sequence, work [13] analyzes the soccer game structure to classify shot-views and segment play-back. This work is applied for the broadcasting videos based on the editing in broadcast video without using game status. There is another work utilizing player feature based method to detect the start scenes [14] in actual badminton matches. The input of this system is the video captured by the ceiling camera from the top of the court. By simultaneous extraction of spatial-temporal features from the motion images, features of players' postures and motions are extracted, which is used to detect the serving. However, this work cannot handle the complex situation in volleyball videos, in which the feature of the server player is difficult to extract and the background noise is heavy.

Multiple players tracking
For multiple people and object tracking, there are large amount of works [15][16][17]. Most of them cannot be directly used in sports videos due to the special features of the sports scene, such as the severe occlusion between players with similar appearances (features), the complex background, and the fast motions of the target player. To deal with the occlusion and similar appearances problems, utilization of the temporal feature [18] is a feasible solution. Targeting at the multiple player tracking in sports video, traditional work in computer vision analyzing sports videos [19] has focused on tracking players [20]. In order to distinguish players from each other, Yamamoto et al. [21] performs brute-force SIFT features matching between learned features and extracted features of player's jersey number, which is weak at occlu-sion and complex background noise. Work [12] proposes a player identification method based on jersry number detection and player tracking based on the Yolo framework [22]. To handle the severe occlusion problem, Ikoma et al. [23] proposes a 2D elimination method which removes all other objects' regions in the frame. And for the tracking after occlusion/overlapping, Huang et al. [24] utilizes the motion vector of positions at two previous time points to predict player's position after occlusion.
In addition, most above multiple players tracking algorithms are initialized manually. The research targeting on automatic initialization of tracking [25,26] are based on object detection results. These detection based methods are weak at the occlusion and similar appearance problems, which often occur in volleyball game.

3D ball tracking
Work [27] summarizes the challenges of the ball tracking in sports video: fast speed, small size, and influence of other items in the court. A large number of research works [28][29][30] targeting at the ball tracking based on the 2D images and the results are presented in image coordinate system. As for the 3D ball tracking, Chen [31] proposed an automated system to approximate the 3D trajectory of the ball. However, the lack of multiple space information makes their approximated result unreliable and the tracking accuracy is not high. Work [32] and Takahashi [33] proposed a multi-view based 3D ball tracking in volleyball game to obtain the physical 3D data. Compared with Takahashi's work, our framework avoids the 3D reconstruction process (which causes large error) in the 3D tracking framework so that the result achieves higher accuracy.

Event detection
Many similar works for event detection and recognition focus on broadcasting video analysis. These works refer to some post-process information like text feature on the screen and inserted audio in work [34]. Wang [35] proposes a framework using visual feature and audio feature to detect the event for soccer. But for the purpose of automatic data acquisition, event data are expected to be collected based on the game content itself. Based on the pure vision information, [36][37][38] propose methods based on image clustering technologies to detect the events for soccer games. Yang et al. [39] and Guo et al. [40] classify the game videos based on the detection of the players' actions, which cost large calculation on the detail information of individual players. Work [41] proposes a ball event detection method while ball tracking. This work classifies the event into four categories based on the ball trajectory, which is different from the requirements of Data Volley.

Event evaluation
In order to analyze the quality or evaluate the performance of the sports, some researches [42,43] focus on the method of statistical analysis. By accessing historical data, these works analyze different factors of the overall games and draw up evaluation report based on certain criteria. These researches focus on the performance of the whole team and do not pay much attentions on a certain action or event. To obtain quality information of the receive event, we proposed a framework [44] for qualitative action recognition for volleyball game analysis. This work evaluates the quality based on the return ball quality and the posture quality, which is different to the definition of event evaluation in Data Volley. In general, there is few research targeting on the event quality evaluation. Since the event quality is defined according to specific game rule, the very few existing method cannot be used as a comparison.

Overall framework
The conceptual setup of the entire automatic data volley system is designed as Fig 3. The equipment consists of multiple high resolution cameras and a computer to collect and analyze the game data.
The multiple cameras are used to record the game from different view-angles so that we can obtain multi-view videos as the input. The reason multi-view videos are used in this Fig. 3 The conception of the entire automatic data volley system setup system is because it is difficult to construct precise 3D coordinate from single view information. Although there are some works [29,31] estimating the 3D coordinate only using single video, it requires heavy algorithms to compensate the reconstruction error. In addition, the multi-view videos are robust for occlusion situation. In volleyball game, there are always twelve players in the court who share same appearances and are overlapped by each other. In order to ensure a high data precision, multiple cameras are used to reduce the difficulties of the occlusion problem.
With the input multi-view game video, the vision based data acquisition algorithms are implemented on the computer. In this system, the overall framework and tasks are shown in Fig 4. First, the preprocessing consists of the multi-video synchronization, camera calibration, and the play scene detection. The video synchronization is for aligning different views to fully utilize the image information. The camera calibration is the key to create the projection relation between each image to the real physical world. With the input videos, the play scene detection method outputs the time at which one round begins and the subsequent data acquisition process starts.
Second, the basic data acquisition consists of the physical data tracking and the event detection. For the physical data tracking, 3D ball tracking [32] and multiple players tracking [45] are implemented to obtain the 3D trajectories of the ball and players. Here, we proposed a time-vary fission filter to approximate the initial distribution of the whole team to automatically initialize the tracker.
Third, for the event detection, based on the collected historical trajectories and the image features, a simultaneous tracking and event detection framework is used. Here, we proposed a team formation mapping with sequence motion based event detection method.
At last, for the evaluation process of each event, we propose a relative spatial filter based quality evaluation method.
By connecting combining the output of all the steps, the required data of automatic Data Volley are collected and the following processing of data analysis and strategy development can be applied. The detail algorithm of each proposal is described in the following sections.

Temporal-spatial filters
In this work, three methods are proposed for the acquisition of different game data, based on the same core concept: the fusion of the temporal and spatial information. The temporal correlations of the position states and event states make contribute to predict the state and extract objective features. And the spatial information consists not only the image contents, but also the physical information in the 3D world, such as the Fig. 4 The automatic Data Volley system consists of two parts: the automatic game data acquisition and the data analysis/strategy development. The overall framework and tasks of the automatic game data acquisition is marked with rectangle, in which the proposals are denoted with red color Fig. 5 The concept of the temporal-spatial filters player's height and the team formation. Fig. 5 shows how the concept of the temporal-spatial filter works in each proposal.
First, with the target of automatic tracker initialization, the distribution of each player is fissured from the team state distribution. The image feature is extracted to calculate the confidence value of each player and the temporal feature of trajectory is utilized to distinguish the players from each other. Second, the event detection is achieved through combination of the sequential motion and the team formation mapping. The sequential motion refers the order of the event specified by the game rules. The team formation mapping describes the event feature from the perspective of the whole court space, not just looking at the state of individual player. Third, the temporal and spatial features are extracted respectively for different evaluation criteria. The event series feature refers the subsequent events, which representing the consequence of the event, while the relative spatial filter utilizes the additional 3D location information. Therefore, the overall automatic data acquisition system combines the advantage of both the temporal and spatial information.

Time-vary fission filter based automatic initialization
In order to initialize the player tracker, the 3D positions of all the players are required, and the players in each team should be distinguished with each other. To estimate the location states of the players in one team, we propose a time-vary fission filter, which approximates the process from one dis-tribution separating to six of sub-distributions, that means simulating the state of one team to six players.
First of all, let us give a description of the player motions at the game start scene. We name the two teams as teamdefense and the team-offense. One round game starts from the team-offense serves the ball. Before the serve event, both teams wait at standby formation, which is based on the strategy of each team and there are some common formations. At the common formation, each player stands at the fixed area. When the server of the team-offense hits the ball, which shows the game starting, based on the movements of the ball, two teams will move and adjust the formation. After the ball moves over the net, the team-defense adjusts the formations to organize one attack and the team-offense prepares to change the formation to defense.
Based on this fact, we define the team state distribution to present the position state of the team and propose the timevary fission filter to estimate the distribution of each player to initialize the player's tracker. The initial team state distribution is the rough position distribution of the whole team and the distribution of each player is the sub-distribution fissured from the team state distribution. At the start scene, no matter where the individual players is locating, the team state distribution only presents the probability of the team covering the court. As the game goes on, the team adjusts the formation and all the players are in dynamic states. Based on the players feature extracted from the multi-view images, the distribution can be filtered by a weight of the player probability. As long as the player feature is reasonable, the distribution in the court space would present several peaks, which repre-sent the high probability of this area with players existing. According to the amount of the peaks, the team distribution is separated into several sub-distributions and each distribution represents at least one players. This process is repeated at each time step until the six sub-distributions are filtered out. At this time, one sub-distribution represents the position state of one player, and the sub-distribution can be used directly to initialize the player tracker.
Compared with other tracker initialization method, the advantage of the proposed time-vary fission filter is presented on handling the occlusion and same appearance problems of players. The conventional works detect the targets by trained model and use the detected results to initialize the tracker. In volleyball game, all the players wear the same uniform and are often occluded by others. The occlusions and similar appearance of targets limit the performance of detection. To deal with the specific problems in volleyball, the proposed time-vary fission filter use the temporal continuity of image feature to generate the distribution of each player. The distribution of one player is filtered out at the moment when the player is not occluded. And the filtered order is used as label to distinguish the player from others.
The detailed algorithms are introduced as below.

State definition and initialization
First, the team state X is defined as where, the coordinate z = (x, y, z) represents the 3D position in the court space. n represents the account number of the players and s is the serial number of the distributions. The value of n and s are the integers from 1 to 6.
since there are six players in each team. It can be assumed that one player, whose center position is located at (x, y, z), belongs to the s th sub-distributions and there are n players in the s th sub-distributions. Since the target distribution of the team state consists of both continuous elements and discrete elements, it is difficult to use one simple model to represent the high-dimensional state. In the proposed algorithm, a variety of the team state samples are generated to approximate the distribution as precise as possible. Therefore we use the samples set z (i) to describe the team state. Initial distribution of the team state at time k=0 follows wherez is the given mean value of the distribution, which follows the approximated distribution of the standby team formation. τ x , τ y and τ z are the Gaussian noise variances for position term at different directions. The value ofz is decided referring the game rules and statistic game data. In order to initialize the value ofz, a sequential integer number m (m ∈ [0, 5]) is generated. Since there are 6 players in each team, the variable m, which is from 0 to 5 is capable to number them. In this research, we use a Gaussian distribution H m to represent the spatial probability of the m th player. As preparation, some 3D players positions data are manually collected and divided into 6 groups based on different player roles. As for the m th player, the mean coordinate and the variance of the data is defined as the mean value C m and variance [σ x m , σ y m , σ z m ] T of distribution H m : After the value of z 0 is initialized, we assign n 0 = 6 and s 0 = 0. The physical meaning of the initialization is that at time k = 0, there is no prior information of the positions and identity labels for all the players. At this situation, all the players has no difference with each other. And we only know they belong to the same team, so that all of them belong to the 0 th sub-distribution and there is only one sub-distribution. The positions of players are generated according to the standby team formation.
Up to here, a prior team distribution is obtained.

Confidence calculation
Our goal is to filter (separate) the prior distribution via the temporal video so that the precise distributions of all the players in the team can be obtained. The distribution, which is represented by a set of team state samples, is filtered based on the confidence value of each samples. The confidence value is calculated according to the image content from different views. We define the confidence I k as a collection of image frames at discrete time k: where M is the total number of views. Thus, the confidence C X (i) k of each sample is indicated as: Here, the C(X (i) k ; I m k ), which is defined as "image confidence" of the i th sample estimated from the observation region in the m th view at the state X (i) k . g(x) is a function to combine each image likelihood obtained from each camera. Here we use three image features of the players: the color feature, the motion feature, and the body part categorization based feature.
For the color feature and the motion feature, we use the same one that the player tracking used. With the prepared image templates, the distance between the observation region and the templates are used to represents confidence of color. Motion feature C motion X (i) k ; I m k is obtained according to the background subsection.
In the team distribution, more than one players exist. To evaluate this confidence, the factors related to the player amounts are required. To avoid the severe occlusion of players, the amount and the probability of the body parts are used.
Therefore, the body part category based confidence is calculated as where, the N hand , N head , and N foot are the total amount of the detected efficient body parts of the heads, hands and feet. The function d() is to calculate the confidence of the amount depending on the deviation from the predicted amount n to the detected one. The reason these body parts are chosen is because the head, hand, and feet number directly decide the players' number. As for other body parts, such as the torso, shoulder, arm and leg, which are always overlapped and confused with each other when two or more players move close.
The P head k are the probability of the detected head, hand and foot, which are calculated as: where, the P head (I m k ), P hand (I m k ), and P foot (I m k ) are the probabilities of detected efficient head, hand, and foot in the m th view. Body part categorization is based on the hypothesis that the perspective image contents of one body part in different views should be detected as the same body part category, even the appearances of them are different. In each view, we can obtain a region of interest of each team state sample by projecting the 3D position to the image.
Here we train a detector using convolutional network to detect the body parts and the probability. 10,000 labeled images are used. Based on the output probability, we could obtain a Heatmap of the body part. Based on the Heatmap, the position and the probability of the detected body part can be obtained. Then, we count the detected body parts in the distribution area to obtain the value N hand , N head , and N foot .
The total confidence C X (i) k is calculated as: and

Time-vary fission filter
Filtering at k = 0 With the confidence of each sample is obtained, the posterior distribution can be filtered to approximate the true distribution of the team. Here, the normalized confidence weight of each sample is denoted as: The filtering model is based on the Monte Carlo method and Bayesian equation [46]. We randomly generate a number on (0, 1) and look up the corresponding state sample through the cumulative probability distribution of the confidence weight of the state team distribution. The theoretical basis of this step is that the team state sample with larger confidence has higher probability to be filtered out for the posterior distribution. By repeating the random number generation and the team state sample filtering enough times, we can obtain a new distribution p(X 0 ; I 0 ). Yet this is not what we expected for the posterior distribution.
Up to here we only filter the position state, and the player identity states are filtered as following algorithm. First, all the samples are clustered by the Mean Shift [47] according to their spatial 3D coordinates. Based on the clustering result, the team state distribution is separated to several subdistributions D K α , and the K α is the serial number of the cluster.
For the K α th sub-distribution, we assign the value of s (i) as: which means this samples belonging to the K α th subdistribution, and the s (i) value is assigned as K α . Also, for the K α th sub-distribution, the projected image region can be decided based on the 3D positions. We check the body part classification result, which is mentioned in section Confidence calculation. Based on the detected head number, foot number and hand number, it can be know how many players are there locating in this area. And the player numbers N K α are assigned to the value n (i) : System model at k = k + 1 When the team state distribution moves to the next time step, which means we simply assume that the players move randomly, and transfer the position state with Gaussian noise. So the system model is where {W k , k ∈ N } is the Gaussian noise term, which is defined as As for the player identity state,

Filtering at k > 0
For the sub-distribution in which the n k = 0, we only filter the position state based on the samples confidence and keep the same identity information, since the n k = 0 represents this player has already be distinguished from other players. Then, for the sub-distribution in which the n k = 0, both the position state and the identity state are filtered as same algorithm when k = 0.
And for each frame, the team state distribution are iteratively processed according to the system model, the confidence calculation and the filtering process until all the player number n k of all the sub-distributions is 0. And the set of distributions D K α can be used to initialize the multiple player tracker.

Simultaneously tracking and event detection
As we mentioned at the Sect. 3, the multiple players tracking [45] and the 3D ball tracking [32] is implemented to acquire the 3D physical data. Simultaneously, the ball event is also estimated. The rough algorithm is introduced as following: First, for the player tracking, the serial number s obtained at player initialization is used to manage multiple targets in one team. C is the set of all numbers. Then each player's state is defined as z c k , and the state is transited as where k is a three-dimensional noise in the prediction model combining a Gaussian model and a least square fitting prediction model. T is the sampling time interval of the video. Observation of each player includes color likelihood and sobel gradient likelihood. The estimated player positions are indicated asẑ c 0:k . Second, the 3D ball tracking and event change detection are employed in one model. the 3D ball state (including the event state) at a discrete time k is denoted as where y k is a pair of ball physical states including the three dimensional position and velocity. d k is the motion of the ball and the last term e k is the event type. Dynamic of the ball physical state is modelled in a time difference equation: where system noise term w k is three-dimensional random vector that represents external force added to the ball, which is a mixture of general noise and abrupt noise. The mixture of noise w k and the transition of e k is decided by the prediction of d k . The observation is calculated based on the multi-view images and the past ball trajectory features. At each sampling time, the estimated stateŷ 0:k by particle filter is the tracked 3D ball trajectory and event. The estimated stated 0:k updates when the event changes. As for the observation and estimation method for the event typeê 0:k , we proposed the team Fig. 6 The conception of the team formation mapping method formation mapping method and the sequential ball motion feature, which are introduced in the following subsections.

Team formation mapping
Team formation refers to how the team is organized to strengthen the power of their defensive or offensive activities. There are two situations for the defensive activity. When the team is going to handle opponent's serving, the receiver will move to the position against the ball's moving direction, and the others will stand around to get ready for saving the ball while the setter stands by in front of the net. When the team tries to stop the opponent's Attack, blockers will gather near the net and the others will fill the blank of the back court to dig the ball if necessary. As for the offensive situation, that means the team is going to perform the Attack, potential attackers will move towards the net, and the setter usually hits the ball near the net to decide the attacker. Based on the discussion above, when different events happen, the team will perform certain team formations on the court. The team formation mapping method describes the feature of team formation when certain event happens. Fig. 6 shows the details of this method. The court is divided into zones of five rows and eight columns on the x − y plane as the Fig. 6(a) shows. The principle of court division is based on the design of the court line and the game rules, which is summarized as follows.
-The size of a half court is a square of 9-meters×9-meters, and a 3-meter line is set to restrict the positions of attackers. The court inside the sideline and end-line are divided into 6×3 grids with the sides of 3 meters. -As the volleyball game allows hitting the ball outside the boundary, especially for some emergency, so the area outside the court is also taken into consideration and divided by every 3 meters along the court line.
-In the vertical scale, the player amount denotes the formation status and the height of player reflects the players' action. Both of them are important features of the event.
The mapped matrix ϒ k is the descriptor corresponding to the three dimension of the players' positions, consisting of two parts: the X-Y plane matrix E XY and the Z plane matrix E Z . The element in X-Y plane matrix denotes how many players are there in the corresponding zone of the court. And the element in Z plane is the sum of these players' height like the Fig. 6(b) shows. The mapped matrix is obtained by connecting these two matrices by row.  (a, b), the two elements are calculated by: As the feature vector which is a part of the input to the classifier for recognizing the event type, the X-Y plane and the Z plan are connected by row. Then the overall mapped matrix will be reshaped to a one-dimension vector by connecting the end of one row to the head of the next row. Thus the reshaped feature vector is represented as:  Fig. 7 The definition of event series feature

Sequential ball motion feature
According to the rules of volleyball game, the order of each event is fixed. Based on the discussions in [48], the status of volleyball game is divided into three categories: Attack Process, Counterattack Process and Emergency.
1. Attack Process: The process for a team to deal with opponent's Serve. This process starts from Serve, Reception, to Set and then to Attack. 2. Counterattack Process: A series of events starts from opponent's Attack and ends when the team returns the ball. In some situations, Block cannot be performed successfully so it will start from a Dig. 3. Emergency: A receiver or digger failed to control the direction of the ball. Players usually try to dig the ball when the emergency happens.
Since the team formations of all the events have some regular patterns like the Attack and Block are always performed near the net, and the preference of hitting position always occurs at dense area. So the ball motion of certain event highly relates to that of the former events. Therefore, the sequential ball motion feature is described as follows.
S j is defined as the ball motion state of the j th event counting from Serve of each round. Sinceŷ represents the estimated physical state of the ball, andŷ k = ẋ x , where x andẋ are position and velocity of the ball in the three dimensional space. So the S j is defined as: Thus the sequential ball motion state feature S seq for the j th event of one round is defined as For the occasion when the target event is Serve, since it is the first event of one round, the former ball motion state are set to zero.

Relative spatial filter based quality evaluation
According to the summary of the evaluation basis in Table 1, although the bases of all the event are different from each other, they can be roughly divided into two types. One refers the subsequent event series to evaluate the event quality; the other one requires additional information to make judgment. Therefore, this work gives two proposals to obtain the overall evaluation results.

Event series feature
For each event, there are three potential results for the target event: hit by home team, hit by rival team and hit the ground. Based on these three results, we define two corresponding factors in event series feature: the side of the court and the the ball hitting the ground. The definition of event series feature is shown in Fig. 7. In this figure, the target event to be evaluated is the Attack. And on the court where the attacker is recognized as the positive court. For this situation, the Attack is blocked by the opponent and the ball directly hits the ground, which indicates losing a point for the attacker's team. Here, the event series which contain the event type and their belonging court describes the condition whether the Attack is blocked. Thus the judgment can be made referring the game rules.

Relative spatial filter
The relative spatial filter is designed for specific categories for Receive and Set quality evaluation, since evaluation of these event required additional information besides event series. According to the summarise in Table 1, the judgment of Receive is based on the organization of Set. While the evaluation of Set needs to get the number of available blockers from opponents who play against the following Attack. Therefore, two filters are proposed to deal with these two events: the filter for Receive quality evaluation and filter for Set quality evaluation.

Filter for Receive quality evaluation
To evaluate the Receive quality, the hitting position of the following Set is the main basis. There are two factors making difference to the setting condition. First, since the setting position is decided by how the receiver pass the ball, it's clearly relative to the quality of Receive. When the ball is hit close to the center of the court, the possibility of the setter passing the ball to different directions tends to be equal. At this condition, the setter's team has a higher possibility to score a point. It is difficult for the blockers of rival team to judge which Attack should block against until the ball leaves the setter' hands. So this will disturb the judgment of the blockers of the rival team. Second, the pose of the setter is relative to the quality level. There are two poses generally used for setting the ball, the Bump and Overhand. Basically, when using the pose Overhand, the setter has a better control on the ball so that the Receive is considered to be better if the setter hits the ball using Overhand. It can be easily recognized that the hitting points of this two poses' tends are usually different. To be specifically, the hit point of Overhand is commonly higher than that of Bump.
Based on the two factors that are relative to the setting organization, a three-step spatial filter is used to evaluate the Receive quality as shown in Fig. 8. Firstly, the hitting point of Set is judged by the x−y constraining with three levels, which refers to the discipline that the closer the hit is to the court center, the better it is. Secondly, the hitting point is judged by the horizontal constraint, which refers to the relationship of Overhand and Bump's hitting point. This step also has three levels decided by the x-z constraint. Thirdly, to get the final judgment result, we take the first judgment as the basis, and if the level in step 2 is worse, the result will decrease a corresponding level. So the final result will be the quality of the corresponding Receive.

Filter for Set quality evaluation
To obtain the quality of Set, the number of opponent's blockers against the following attack is the key factor. The method to find out available blockers based on the assumption: the blocker should be close enough to the ball and jump in front of the net at certain moment to make it possible to block the ball. If the motion of the player does not meet the conditions, it is assumed that this player is not an available blocker.
Based on this assumption, the moment to search for the blocker should be make sure at first. There are two situations after the ball is attacked. One is the condition that the blocker hits the ball, the moment is chosen as the Block hit point. The other one is the condition when the ball crossed the middle line and is not far from the net is chosen. After deciding the moment for searching, available blockers will be filtered out by their height of torso, which refers to the value of their z-direction positions, and their relative distance to the ball.

Experiment environment and data set
The resource videos of the experiments in this work records the Semifinal Game and Final Game of 2014 Japan Inter High School Games of Men's Volleyball in Tokyo Metropolitan  Gymnasium. The camera views chosen for the test data set are the four corners of the court as it is shown in Fig. 9. The resolution of all videos is 1920×1080, with the frame rate of 60 frames per second and the shutter speed of 0.001 second.
The experiment is executed with the following environment setting: the CPU is Intel Core i7-3770, the RAM is 8GB, the complier is Visual Studio 2017, and the external library includes OpenCV-3.4.1 and GSL-1.6. In the implementation, for all the test samples we set the same parameters to create a universal method with high fitness to different conditions.

Data set for player tracker initialization
The automatic initialization is the post-process after the detection of the play scene. Therefore we cut the entire game sequence manually into different rounds, and each round starts at the detected frame of the player scene. Generally, most sequence starts at the serve event when the server throws the ball and is going to hit the ball. As for the distribution of the player formation, besides the server, all the players stand by following a fixed team formation. After the server hits the ball, all the players start to move to change the team formation. In our sequence, there are totally 286 rounds from 5 sets (There are three sets in the semi-final game and two sets in the final game).

Data set for event detection and evaluation
For each round, the trajectory information are acquired by the tracker initialization, ball tracking [32], players tracking [45] works. Then, among the correct tracking rounds, we labeled the event type manually. The overview of the data set is summarized in Table 3. The totally number of available event Serve, Receive, Set, Attack, Block, Dig and Free Ball are 24, 36, 100, 118, 47, 82, and 8. One thing to be mentioned is that the data set includes little numbers of the event Free Ball, so it is hard to conduct experiments. Besides, as it is discussed in previous chapters, this kind of event is not typical in general volleyball analysis. So the experiment does not take this event into consideration either.
Corresponding to the overall framework, the experimental results contain three parts: the initialization of player tracker, the event type detection and the quality evaluation. Since these three parts are based on different kinds of methods, their evaluation criteria are different as well.

Evaluation criteria
As the discussion in Sect. 2, the conventional works of tracker initialization are detection based methods, which only process on one frame and are evaluated by the detection accuracy. However, the proposed algorithm in this article should be evaluated from both the time and the space scale. Therefore, the conventional evaluation method of tracker initialization cannot be used to evaluate our algorithm. In the experiments, the evaluation criteria are defined as following.
First, from the time scale, the time efficiency of the algorithm is evaluated by the number of processing frames T frame . we use the number of used frames T frame to evaluate the time efficiency of the algorithm. T frame represents how many frames are used before the tracker is initialized. In theory, the smaller number of the used frames, the higher performance of the algorithm on time scale. Second, from the space scale, we calculate how many players are correctly initialized after the initialization process finished. As long as the main body part of the player can be distinguished to initialize the tracker, we assume this player is correctly initialized. For each sequence, there are 12 players participate so that the total players required to be initialized are 12 × 286 = 3432.

Experimental result
Data in Table 4 are the experimental results of the proposed time-vary fission filter based automatic initialization for player tracker. For each round of game, the initialization of tracker is evaluated by two factors. First one is whether the tracker is initialized successfully. When the iteration of time-vary fission filter is finished, the twelve trackers of all the players are initialized and the tracking process continues smoothly. We assume this round is "successful round". Otherwise, this round is defined as "failed round". The second evaluation factor is based on the precondition of "successful round". According to the value of T frame , the "successful round" can be classified into four types. From Table 4, it can be known that there are only 5.9% rounds are failed, and more than 70% rounds can be initialized within 1 second (that means the value of T frame is less than 60 frames). Due to the high frame rate, the players' positions are almost unchanged. The long time of the processing/iteration time is because of the sever occlusion between players, who shares the similar appearance. In order to reduce the processing time and increase the performance of the automatic tracker initialization, there are two possible ways. First, since the moving state of the players is not random, there is still large room for us to improve the current model. Our subsequent study focuses on learning a semantic distribution for representing the players moving state. Second, we could improve the model of confidence calculation to make it robust to the similar appearances of the targets.
In order to show the detailed processing and the intermediate results of the time-vary fission filter, we draw all the sampling point on one view. The results are shown as Fig. 10, we choose four frames during the process. At T frame = 0, as Fig. 10a shows, there is only one distribution represents the state of the whole team. The different colors represent different teams. As Fig. 10b shows, the distribution is fissured into sub-distributions according to the confidence value. From Fig. 10c it can be known that even the occlusion  happens, the fissured sub-distributions won't confused with each other. In Fig. 10d, all the players are distinguished and then the tracker has sufficient conditions to be initialized and work.
Although the comparison with other algorithm is not shown in the experiments due to the lack of suitable conventional work, the advantages of the proposed algorithm can still be observed. In Fig. 10(a), severe occlusion occurs between players with similar appearance. In this situation, it is hard to distinguish each players even by human eyes, let alone by other algorithms. From the conceptual aspect and the example, the proposed algorithm presents better on the occlusion and same appearance problems than the detection based methods.

Experiment result for event detection
The experiment of this step includes three parts. First, the idea of conventional work [41] is implemented and applied to the data set for recognizing six kinds of event types. Then the experiment combining conventional method and proposed method, the team formation mapping method, is conducted. Finally, the sequential motion state feature is combined with the former two methods to get a better result. The experiment results of the three parts are shown in Table 5.
Compared with the conventional work, who performs low ability in distinguishing Receive, Set and Dig, a much better result is achieved by combining the team formation feature. This proves that the team formation mapping method offers a good description for players' moving tendency of both teams on the court, which is relative to the purpose of each event. However, this method does not fully solve the problem to distinguish those similar events so a further experiment is conducted. For the result of combining the two methods, the improvement of each criteria is up to 20% comparing with the conventional work. This experiment also achieves a better result than that with formation mapping, especially for Set and Block. That means the sequential feature has the ability to distinguish events of similar hitting point thanks to the idea of referring to the former event corresponds to the principle for classifying events.

Experimental of event quality evaluation
For the event quality evaluation, there is no conventional work even for the similar target as we stated in Sect. 2. To evaluate the proposed algorithm, we define the concept of success rate as: The T otal_events means the amount of the patterns for the target event type. As it is explained before, the quality evaluation not only relies on the following event, but also relates to some addition information including available blockers and setting positions. So for those cases that can be judged by the following events, the correct detection refers to the number of detected event series. Here, the detected event series means that if all the events, including the target event, of the series are correctly detected, this event series is regarded as a correct detection. For the quality of Receive that judged by the setting point, we assume that if the judging result is the same with the ground truth, this case will be a correct detection. As for the quality of Set that relates to the number of available blockers, when the number of available blockers is correctly detected, we assume this event is a correct detection. The experiment result for the quality evaluation part is shown in Table 6. It can be seen that the success rate of obtaining the quality of each target event is over 95%.

Analysis of the algorithm performance
Firstly, for the evaluation of the automatic player initialization method, there are 5.9% sequences being failed. We analyze the failed sequences and summarized them into two big categories. One is that the round is very short, such as the serve ball cannot go cross the net or flies out of the court. In this situation, the team formation changes too little so that some players standing close with each other cannot be distinguished. For this kind of game round, as long as we know who is the server, the analysis of the player trajectories and event is not necessary. The other one is that two or three players cannot be distinguished. The main reason is our algorithm for calculating the player confidence failed, especially when the body part categorization provides a wrong detection result.
Secondly, for the result of event type detection, although the average accuracy reaches over 98%, there are still some abnormal cases that current methods failed to recognize correctly. These cases mainly happen between the Set and Dig. In this case, the feature of the Dig is very close to Set as the player only tries to adjust the direction of the ball. The reason to define it as Dig is that it is followed by a Set. Since the methods of this work only refer to current and historical information, the wrong historical message will negatively affect the performance. Thus, this problem is not so crucial in implementation of automatic system. Thirdly, as the event series feature relies on the event type detection result, the wrong event detection also has a negative effect on the result of quality evaluation. The basis of judging the type is that the action of player trying to save the ball obviously lose the control from the image. This kind of cases belongs to the situation of emergency which is included in the definition of Dig. For current features only refer to previous trajectories of the ball and the player, it is hard to recognize the potential failure of the events. However, the pose of the player denotes the situation of mandatory hit, so it is potential to utilize the player pose to strengthen the ability in recognizing these cases.

Analysis of the processing time performance
Since the automatic Data Volley system is expected to be applied in the real game, the time performance is also an important factor of the system efficiency. In this article, all the algorithms are implemented without considering the time efficiency since the accuracy has the highest priority. Based on current platform (we don't use the powerful server), the processing time of the time-vary fission filter is 20∼30 seconds per time step. Here at each time step, four images (whose size is 1920×1080) are processed. For the event detection and evaluation, the processing time is very fast whereas there are large delay since the features must be extracted after this round finished. For this task, there is little space to improve the time performance. Therefore, to achieve real-time implementation of the tracker initialization is the further topic. We have implemented real-time multiple-player tracking on GPU, whose processing time is 14.43∼16.01 milliseconds per time step. Based on this, the proposed algorithm has a large potential to achieve the real-time system.

Analysis of the generalization performance
Besides the automatic Data Volley, the data acquisition methods proposed in this article can also be generalized to other sports or tasks. Firstly, the automatic tracker initialization method can be used for players tracking of any sports. In addition, by modifying the image feature for the target, the time-vary fission filter can be generalized to any other multiple objects tracking. The only constraint condition is the amount of tracking objects must be fixed. Secondly, the idea of team formation mapping can be used for event detection of the sports whose event definition are related to the team formation.

Conclusion
To achieve the automatic data volley system, this paper combines the spatial and temporal features, proposes the time-vary fission filter based tracker initialization, the team formation mapping with sequence motion based event detection, and the relative spatial filter based quality evaluation. Firstly, by approximating the temporal variation of the team (and players belonging to which) distribution, the tracker can be automatic initialized without using absolute player positions. Secondly, the team formation mapping with the sequential motion represents the event feature using both the temporal event relation feature and the spatial feature of the team organization. At last, using of the relationship of event types and the volleyball game process, the event quality are evaluated referring relative position relationships. The experimental results show there are 94.1% rounds successfully initialized, the event type detection result achieves the accuracy of 98.72%, and the success rate of obtaining events' quality achieves 97.27% in average.
Our future target will dive into two aspects. First, combining current work with the game strategy knowledge and rules, a comprehensive automatic Data Volley system consisting of game data acquisition and tactics development is our future topic. Furthermore, based on the achievements of the GPU real-time acceleration [45,49], an real-time and lowdelay automatic data volley analysis system is expected for the supporting of TV content broadcasting in international big events such as Olympic Games.