Policy decision of curling in real competition scenes

Policy decision of curling refers to providing strategy suggestions for curling competition with the help of computers. Existing curling agents have achieved good results in the digital scenarios, but cannot make correct decisions when applied to actual competition and training scenes. In this paper, a strategies decision agent in the real scene has been proposed. The competition situation was acquired by a Situation-Aware Network and mapped by a Digital Extraction module. We designed Curling MCTS to explore the best strategy in continuous space. The effectiveness of our framework has been verified by experiments and evaluated by China’s wheelchair curling team at China Disabled Sports Management Center. With the help of our system, China’s wheelchair curling team trained effectively and won the championship in the XIII Paralympic Winter Games (2022, Beijing). In addition, a new curling target detection dataset was provided.


Introduction
Curling is an official Winter Olympic Games event that combines physical strength and intelligence. At  At first, researchers used sensors and other hardware to analyze curling games. Wang Xuefeng et al. [1] and Wang Rencheng et al. [2,3] both proposed to obtain the status of athletes and curling in the competition with the help of sensors. The former set six sensing nodes on curling, stirrups, players, etc., and the latter only set motion sensors on the curling to obtain the curling's positions and motion status. However, the accuracy of their method is gradually reduced by the sensing distance and has an impact on the athletes and the trajectory of the curling. The intelligent curling robot "Curly" designed by Won [4,5] and the ice-sweeping robot designed by Choi [6,7] have performed well against humans, which makes people pay more attention to the development of curling intelligence. However, the method's accuracy with sensors is affected by distance and can influence athletes and the trajectory of curling. The robot method also dramatically impacts the ice surface and can only be used in humancomputer battles. Therefore, to provide practical strategic suggestions in human-human curling competition, it is urgent to find an intelligent solution that does not interfere with the players and the field.
Meanwhile, some methods of analyzing curling based on computer vision have emerged. This kind of method has no impact on the curling sports scene. Shujun Zhang et al. [8] and Wenjia Li et al. [9] have used object detection methods to restore curling trajectories. Nevertheless, all these vision-based works can only extract the state of curling and competition but cannot give a hitting strategy. As a strategicoriented sport, decision-making plays a decisive role in the final curling performance.
In recent years, there are mainly two kinds of research on curling decision-making. One is to provide strategic advice by experts after statistics, and the other is to use computer algorithms. The curling data collection and analysis system proposed by Wang Xuefeng et al. [1], the curling simulation system named "Digital Curling" designed by Ito et al. [10] in 2015, and the portable digital scorebook (ICE) designed by Fumito Masui et al. [11] can record game information for subsequent curling strategy development [12,13]. However, there are not so many professional coaches to help athletes make decisions.
Improving the decision-making ability of computers has gone through the following stages. Because the space of actions and states is continuous, it is impossible to enumerate the entire actions completely. In 2015, Yamamoto et al. [14] discretized the digital curling state action data before obtaining strategy suggestions, but the computational volume was too large to continue deeper computations. Then with the development of deep learning and reinforcement learning, discrete zero-sum games such as Go [15,16] and chess [17] have been gradually solved and achieved great success. All possible states and actions in the game are enumerated, and strategy selection is performed by maximizing cumulative rewards with the help of reinforcement learning. Since then, more and more works have started to use reinforcement learning for strategy analysis of curling. In 2016, Yee et al. [18] first attempted to employ MCTS to solve strategies decision of curling. Since then, more and more works have started to use reinforcement learning for strategy analysis of curling. [19][20][21]. These methods are divided into two categories, one is deep reinforcement learning and the other is pure reinforcement learning. But these researches are in the experimental stage and have not been applied in real competitions. Significant modal differences exist between the digital curling and the actual curling data in the real curling scenario. Furthermore, under the difficulty of data collection and uncertainty (blocking, friction, etc.) of competitions, researchers cannot directly apply the existing strategy decisions for digital curling to the actual curling game. This paper obtains the curling game status with the help of the target detection algorithm of computer vision to perform a decision analysis of the real curling game without interfering with the players. In this paper, we investigated a strategy generation method for actual curling games, which is composed of four modules. Situation-Aware Network can perceive the competition situation; the Digital Extraction module can solve the inapplicability problem caused by the data gap; Policy-Value Network can extract effective information from previous competition data. Furthermore, we proposed a Curling Monte Carlo Tree Search algorithm to Solve the problem of deterministic discretization. In addition, we provide a dataset for target positions in real curling scenarios.
This paper is structured as follows: "Introduction" is the ntroduction, "Related work" introduces the related work of this work, "Network model design" is the design of the network, "Methods" introduces the main methods of this framework, "Experiments" introduces the dataset and experimental effects, and "Conclusion" is the conclusion.

Curling detection
Due to the complex state of the actual curling field, which makes positioning difficult, accurate positioning cannot be performed with the help of traditional methods. Object detection based on deep learning can identify object classes while obtaining the pixel-level precise location of object centers by learning abstract representations. And this has become the mainstream method for finding and determining the category and location in images.
Shujun Zhang et al. [8] applied YOLO v2 to curling detection in their curling intelligent tracking system. Wenjia Li et al. [9] performed curling position detection by improving the YOLO v3 model. Anguo Wu et al. [38] used TensorRT and network sparsification to optimize YOLO v5 and achieved good results.

Monte Carlo tree search (MCTS)
Monte Carlo is a statistical method invented by Von Neumann and Ulam in the 1940s, whose essence is to sample randomly. The more samples, the closer to the optimal solution. In zerosum games, it is often possible to find the move we want by randomly sampling as many as possible with the help of Monte Carlo methods. In 1993, Brügmann et al. proposed Gobble [39] and applied Monte Carlo methods to 9-way Go. Randomly drop a move and estimate its win rate through a large number of games. Gobble reached the level of 25 on the nine-way Go with almost no expertise. In 2007, Rémi Coulom et al. [40] proposed the Monte-Carlo Tree Search (MCTS) in Crazy Stone, which combines the precision of tree search with the generality of Monte-Carlo's random sampling, and applied it to the game of go. AlphaGo [15] trained two neural networks to implement the policy function and the value function according to the moves of human experts, which improved the exploration efficiency and precision of the Monte-Carlo tree search. Alpha Go enables computers to outperform human players in the game of Go for the first time. Since then, MCTS has become the core method to solve the zero-sum decision problem.
At the same time, people gradually started to study MCTS in continuous state space. In 2011, Couëtoux et al. [41] applied the double-progressive widening (DPW) to improve state transition. As the number of state occurrences increases, the number of generated states slowly increases. Weinstein et al. [42] proposed weighted upper confidence bound to improve MCTS by expanding adjacent states using sum functions.

Digital curling policy recommendation
In curling games, there is a large space of continuous actions, such as the angle, offset, speed, and rotation direction of the stroke. The space of game states is also continuous. So the continuous Monte Carlo tree search is suitable for curling policy recommendations.
In 2015, Yamamoto et al. [14] transformed the continuous decision problem into a discrete problem using the probability distribution. The current curling policy and value were calculated with the help of a game tree to get the best way in the current state, but no deeper calculations were performed. Ahmad et al. [43] proposed a sampling method based on Delaunay triangular dissection in a study of the last shot of curling, which significantly improved the required policy time.
Since then, more and more work has focused on the analysis of strategies of curling with the development of reinforcement learning. These methods are divided into two categories. One is to explore strategies using only reinforcement learning. In 2016, Yee et al. [18] improved the MCTS by kernel regression (KR) and kernel density estimation (KDE) named KR-UCT. KR-UCT exhibited effective selection and expansion of nodes using neighborhood information in continuous action spaces and was validated on a digital curling dataset. In 2017, Ohto et al. [44] considered the similarity of the best action among similar states and solving the curling decision problem based on MCTS. This kind of method relies on the exploration ability of reinforcement learning and is not limited by data. But it needs enormous computational power and time.
With the development of deep learning, data-driven neural networks were introduced for reinforcement learning in curling. This method uses deep learning neural networks to sense the environment, which can improve the calculation speed. In 2018, Yamamoto et al. [19] proposed a static evaluation function based on a deep neural network and developed a new digital curling system named Jiritsu-nn, reducing the computational cost of large-scale game tree search. Lee et al. [20] proposed a new self-game reinforcement learning framework KR-DL-UCT. The KR-DL-UCT searched in continuous action space by kernel regression and trained by supervised learning and self-game reinforcement learning, which improved the speed and accuracy of constant Monte Carlo tree search and won the international digital curling competition. Han et al. [21] combined NFSP and KR-UCT in a digital curling game. The NFSP uses two adversary learning networks to produce supervised data, and KR-UCT can be used for large game tree searching in continuous action space. This framework proposed two reward mechanisms to enable fast convergence of reinforcement learning, and the trained model competes with higher win rates in international digital curling competitions. Such methods as above use neural networks as the parameter structure to optimize the decision tree, which saves the time of rollout.

Curling situation-aware network
The curling tournament dataset was collected and labeled, and the object detection algorithm was trained on it. Deep neural networks perform well in object detection, so we design the curling Situation-Aware Network to detect the location of curling. The ratio of the curling diameter to the field length is only 1:146, so the existing detection network is easy to miss the small objects because of the loss of features in the shallow layer.
We improved YOLO-v4 to make it more suitable for curling detection. Retain shallow features with the help of residual networks and remove the spatial pyramid pooling to prevent losing detailed semantic information. The input data size is 416 × 416 × 3. There are five residual modules after a layer of convolution. The results of the last three residual modules are respectively subjected to further feature extraction and output the object detection results finally. The design of the network is shown in Fig. 1. Conv2DLeaky and Conv2DReLU respectively indicate the use of Leaky and ReLU activation functions after convolution. CIOU loss is used for the loss function and is defined as formula 1.
where IOU is the IoU loss proposed by Yu et al. [45], b and b gt denote the central points of predicted box and groundtruth, and ρ(b, b gt ) is the Euclidean distance, r is the diagonal length of the smallest enclosing box covering the two boxes, α is calculated by the following equation.
And the v is calculated by Eq. 3. The ω, h and ω gt , h gt represent the height and width of the predicted box and the real box, respectively.

Policy-value network
We use a convolutional neural network for imitative learning and then calculate the value of the curling state. The feature extraction capability of the neural network helps us reduce the search tree's depth and breadth. The main structure of the network is a policy-value network, using the value network to evaluate the state and using the policy network sampling operation. The location of the curling and the related information of match in AyumuGAT16 is input into the network, and convolutional layers learn the hidden features. The network's overall architecture is shown in Fig. 2.
The input data size of the Policy-Value Network is 32 × 32 × 29, containing information that can affect policy. The content and format of the input data are detailed in Appendices 7. The policy-value network contains one shared layer and two output heads, the policy and value heads. The value network shares the parameters in shared layers with the policy network. In the shared layers, there is one convolutional layer and nine residual blocks. Each residual block has two convolutional layers and one normalizing layer. The convolution uses 3 × 3 filters with stride 1, and the activation function is ReLU.
The policy head has three convolutional layers, and the output size is 32 × 32 × 2, representing the probability of throwing at 32 × 32 positions. The two layers indicate whether the curling throwing direction is counterclockwise or clockwise. All the probabilities of the policy sum up to 1. The value head consists of a 3 × 3 convolutional layer, followed by two fully connected layers with 256 and 17 nodes, respectively. In a curling game, both teams throw eight stones in turn. The output of the value network is the probability of getting 17 score results because the scoring range is −8 to 8 according to the rules of curling. When evaluating the state, the highest probability score is the value.

Behavior cloning
When allowing a random initialized policy-value network to carry out self-play directly, the network is highly stochastic and takes a long time to make reasonable judgments. Therefore, we imitate and learn the initial policy-value network from the game records of 400,000 pairs of digital state-action pairs to shorten the training process.
Denote the current state of the match by s t . According to the setting of 3.2, input the state into the neural network, according to the setting of 3.2, the neural network will output a probability p t = [π(1 | s t , θ), . . . , π(2048 | s t , θ)] ∈ (0,1) 2048 . y − policy t is the one hot code of the actual position of the stone as the true label. The loss function of the policy network is the cross-entropy loss which denoted by L p .
v t is the final predicted score, y − value t denotes the actual match score, and the loss function of the value network is defined as L v .
The value network is trained simultaneously with the strategy network, sharing the network parameters, and the network loss of imitation learning overall is defined as L bc = α L p + β L v + λ L reg . In the actual training process, α = β = 1.0, and an L2 regularization parameter λ is 0.0001. Imitation learning is limited by data. If the s t does not present in the training data, then the strategy network's action a t will be poor, leading to the accumulation of errors and finally losing the game. The space of curling is continuous, and it is difficult to list all possible actions, so it is necessary to use reinforcement learning to explore new actions. The policy-value network trained by behavior cloning were used for self-play. Take the results to further train the network to improve the discriminatory ability by rewarding the guidance of action selection.

Self-play reinforcement learning
It isn't easy to enumerate all possible actions from the huge continuous space. The action space is usually discretized to narrow down the action search space to a limited order of magnitude. Less time is required for one drop so that the MCTS can search for more actions in a limited time, which will improve the accuracy of self-pairing game selection. However, deterministic discretization of the continuous curling would lead to information loss and a strong bias. Because the closer the location to the tee, the higher density, discretization would cause different action choices to be treated as the same action in the policy selection, which would affect the judgment of the decision. The impact of even a small deviation is huge for the game results. Therefore, when we performed reinforcement learning in self-play, we utilized kernel regression-based upper confidence bounds (KR-UCT) to explore and evaluate action choices in continuous action space. The curling action state search tree is constructed from the action selection results during self-play. The Policy-Value Network is trained using MCTS-based self-play. The Policy-Value Network generates the probability of action and score. The value of each action node is calculated by MCTS and selects the best action as the current action selection, after which it continues to update the parameters of the Policy-Value Network. Our search steps follow the four stages of Monte Carlo tree search, selection, expansion, simulation, and backpropagation. The specific process is illustrated in Algorithm 1 and described in detail below.

Algorithm 1 Curling MCTS
(1) Selection s t is the current state (line 1). The trained value network evaluates the policy probability density WP t and value probability density WV t of s t . If the current state is the terminal, the score with the highest probability is the current state score (line 5). It is impossible to consider all actions, for the number of alternative actions is huge, and the search tree will be infinitely wide. According to the policy network, select k actions in the candidate action set according to the probability (line 8). The value of k is 20. When the number of current action selections is less than k, the current action is simulated (line 12), and the score of the new state is calculated (line 13).
When selecting actions, we must explore more meaningful results first. So we simulate the action with fewer visits first (line 24). Then the new state is obtained by the state transfer function (line 25), and the score of the next state s t+1 is calculated (line 26).
(2) Extension The training data are limited and discredited to a specific location point, but we cannot restrict the selection of actions to this finite order of magnitude. Any function can be linearly separable when extended to an infinite dimension. Therefore, this nonlinear problem can be solved by mapping to an infinite dimension with the help of the Gaussian kernel function. So we extend the set of candidate actions with the help of kernel regression (line 16). Kernel regression is a nonparametric estimation method that uses the kernel function as weights to estimate the conditional expectation of a random variable. The target is to increase the number of children and share information between similar actions when the node is repeatedly visited.
The KR_UCT formula 6, the heart of MCTS, is used to measure the maximization rewards.
In which a and b are two discrete actions. For action a, the kernel density estimation E[v a |a] is the expectation of the score, which is defined by formula 7. And the kernel regression W (a). is calculated by formula 8. The constant C is set to 0.1 to control the exploration-exploitation tradeoff. a, b) is the Gaussian probability density. n b is the visited times of action b.
(3) Simulation The initialized k nodes (line 13) and the extending nodes (line 26) must be estimated. The evaluation process is the simulation process, where the nodes are input to the value network to get 17 different probability values of [−8, 8] points. This probability value is also used to update the value of its parent node. When the node is the root node, the highest value is considered as the score of this point by one-hot coding (line 6).

Decision-making recommendation pipeline
The trained policy-value network can not directly provide strategies for competition. Because we only get the curling's pixel position on the image after detection, but we can not know the curling's actual position in the field. So we designed a Digital Extraction module to solve this problem.
The mapping matrix is noted as M = ⎡ ⎣ a 11 a 12 a 13 a 21 a 22 a 23 a 31 a 32 a 33 ⎤ ⎦ .
Given the four points' pixel position and mapped world coordinates, the M and constant c can be calculated. If the camera is at the same angle, we only computed W and c once when processing the first picture. The location of one curling (x, y) in the picture can be mapped to the location of the real site (x , y ) through the mapping matrix M. The calculation formula is shown in for- The area between the hog line and the background of the curling field is called the effective area. All the points in this area could be projected, and the effective area in the picture was mapped to a size of 4.75 m × 11.28 m. After magnifying on the scale of 1:200 in units of meters: pixels, the effective area in the picture was mapped to the size of 475 × 1128 × 3.
Then (x , y ) was projected to the corresponding location on the digitized playground at a scale of 1:200 in units of meters: pixels. Then the related information, such as the colors, is extracted structurally and transformed into the required form for the policy-value network.
The game state were detected by the Situation-Aware Network and be extracted structurally and transformed into the required form for the policy-value network by the Digital Extraction module. Initialize two policy-value networks that have been trained by "Behavior cloning", one is noted as P player , and one is P opponent . P player is the agent, controlled by the policy network, using the latest parameters. The score of each game u t is used as a reward to update the parameters of P player . P opponent is equivalent to the environment, and will make the corresponding actions opponent to P player . The policy network also controls the movement of P opponent , but the difference is the parameters are always fixed. The strategy gradient is the gradient of the state-value function V (s; θ), which can be approximated by the unbiased estimation of the strategy function ∂ log π(a t |s t , θ) ∂θ · Q π (s t , a t ). Q π (s t , a t ) is the expectation of future rewards E[U t |s t , a t ]. The strategy gradient is approximate to ∂ log π(a t |s t , θ) ∂θ · u t when joining the observed values u t of the random variables U t . Therefore, The action probability and value will be calculated through the Curling MCTS described in "Self-play reinforcement learning". Curling_UCT will expand 500 nodes and finally select the action with the maximum number of visits as the suggested strategy. The pipeline is shown in Fig. 4.

Introduction to curling
In a curling game, both teams throw eight stones in turn, the stones inside the house. The scores they receive are calculated by the number of stones closer to the tee than any opponent stone. The score of the whole game is the sum of the scores of the eight rounds. We make assisted decisions for the curling games, choosing the best action to play for the given game state with such rules. The target position of curling can be anywhere in the house, so curling throwing is a question of continuous action space.

Curling datasets
Game situation analysis for real curling games requires a large amount of data to train the network, but there is no suitable dataset. So we labelled and completed a curling dataset to promote the progress of curling detection. We used JVC GZ-RY980HAC cameras to record at the China Disabled Sports Management Center to collect data for the curling competition. According to the structure of the area, the camera installation solution is shown in Fig. 5. Three video capture devices were set up in the upper stands to collect oblique overhead view data of the six layers. We collected game records of the 8th Special Olympic Games and practising videos of the national wheelchair curling team. 2k representative curling game images were selected, and more than 10k curling instances were labelled with label me, including categories and precise 2D localization. The dataset example is shown in the Fig. 6.
The dataset used for supervised learning in reinforcement learning is GAT-UEC-16. GAT (Game AI Tournament) is a comprehensive competition for academic purposes with the theme of game AI sponsored by Cognitive Science and Entertainment Research Station in the University of Electro-Communications and Game Informatics Research Group of the Information Processing Society of Japan. And the UEC Cup is the biggest competition to decide the strongest digital curling AI of the year held in GAT. We used 400,000 pairs of state action combinations from this dataset to complete behaviour cloning.

Experimental results and discussion
In curling situation-aware experiments, the default hyperparameters are as follows: the batch size is 4, the learning rate grows linearly from 1e−4 to 1e−3, then transitions to a cosine decay. In the stage of behavior learning network, the learning rate was initialized at 1e−2 and reduced in subse- The accuracy of the Situation-Aware Network limits the strategy suggestion. Therefore the Situation-Aware Network has been tested and verified on our curling detection dataset, and the experimental results are shown in Fig. 7. The x-axis represents the number of training sessions, and the vertical axis represents the change in the corresponding parameters as the number of training sessions increases. The model can reach convergence when it is trained to the 60th epoch. The dataset was divided into a training set and a testing set according to the ratio of 7:3. On the testing set, the value of map@0.5 reaches 99.7, and map@0.5:0.95 up to 90.1.
Our detection network was compared with other curling detection networks, and the comparison results are shown in Table 1. We can see that our method has the highest Map, and the speed is relatively fast. It is because our method retains shallow features with the help of residual networks. Furthermore, our models remove the spatial pyramid pooling to prevent losing detailed semantic information, which improves the accuracy of small curling detection.
Our framework is the first work to make strategic decisions in a real scenario, and there is no work that can be done for strategy analysis comparison. So we took the strategy suggestion part separately for a 100-round competition with several other decision algorithms trained on the GAT-UEC- 16. The results are shown in Table 2. As we can see, we have a higher winning rate against the other two models because our Curling MCTS can explore new action spaces via separate alternative options. It is worth mentioning that our decisionmaking method has a lower winning rate when compared to NFSP [21]. This is because NFSP uses two adversary learning networks to produce supervised data automatically.
According to the velocity and direction suggestions generated by MCTS, we simulated the route of curling on the "Digital Curling" [10]. The Striking process of curling are visualized as GIF, which can be compared with the strategies chosen by athletes during the actual games. We tested our framework on the competition data of the 11th Paralympic Games and the 8th Special Olympics. Figure 8 is the analysis data of the test report showing the percentage of suggested strategies better or worse than the real choices by professional curling athletes. We can see that only 12 percent of the recommendations are inferior to the athletes' choices because of athlete's occlusion. We have selected several specific evaluation sampling examples for demonstration in Fig.  9. The results show that our model is applicable to curling decisions on different tracks with different angles. And it can give complex hitting suggestions such as Double Takeout, Hit and Roll.

Conclusion
In this paper, we proposed a curling policy decision model, which takes the actual curling competition as the object. To solve the modality gap between the actual curling and digital curling, we proposed Situation-Aware Network to perceive the competition situation and proposed the Digital Extraction module for location mapping. We designed Curling MCTS to explore actions in continuous space. The results of the experiments indicate that this algorithm can suggest policies to athletes for making timely and effective tactical adjustments without any interference. In the Policy-Value Network of this work, we treat the discrete state space for computational efficiency, which leads to the consistency of politics for similar states. In future work, we will introduce polar coordinates to address this challenge.

Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.