Keywords

1 Introduction

The RoboCup introduced by Kitano et al. [15] serves as the central problem in understanding and development of Artificial Intelligence. The challenge aims at developing a team of autonomous robots capable of playing soccer in a dynamic environment. It requires the development of collective intelligence and an ability to interact with surroundings for effective control and decision making. Over the years several humanoid robots [8, 9, 21] have participated in the challenge.

One of the main hurdle identified within the tournament is perceiving the soccer ball. The efficient detection of soccer ball relies on how good the vision system performs in tracking the ball. For instance, consider cases where the ball disappears or gets occluded from robots point of view for a few frames. In such situations using the current frame is not useful. However, a dependence on the history of frames can help in making a proper move. In this work, we propose an approach which can effectively utilize the history of ball movement and improve the task of ball detection. We first utilize the encoder-decoder architecture of SweatyNet model and train it for detection of the ball in single images. Later we use it as a part of our proposed layers and learn from temporal sequences of images, thereby developing a more robust detection system. In our approach we make use of three spatio-temporal models: TCN [2], ConvLSTM [26] and ConvGRU [3].

For this work, we recorded a new dataset for the soccer ball detection task. We make our data as well as our implementation available on GitHub so that the results can be easily reproduced. researchFootnote 1. We used Pytorch [19] for our implementation.

Fig. 1.
figure 1

The proposed architecture with feed-forward and temporal parts.

2 Related Work

Numerous works have been done in the area of soccer ball detection. Before RoboCup 2015 the ball was orange, and many teams used color information [22]. Since RoboCup2015, the ball is not color coded anymore, which forced teams to use more sophisticated learning based approaches like HOG cascade classifier [7]. In recent years, the convolutional approaches with their innate ability to capture equivariance and hierarchical features in images have emerged as a favorite choice for the task. In [23] authors use CNN to perform localization of soccer ball by predicting the x and y coordinates. In a recent work [17] use proposal generators to estimate regions of soccer ball and further use CNN for the classification of regions. In [13] authors compared various CNN architectures namely LeNet, SqueezeNet, and GoogleLeNet for the task of a ball detection by humanoid robots. In [21] authors inspired by work of [20] proposed a Fully Convolutional Networks (FCN) that offers a robust and low inference time, which is an essential requirement for the soccer challenge. As the name suggests, the FCN is composed entirely of convolution layers which allows them to learn a path from pixels in the first layers to the pixel in the deeper layers and produce an output in the spatial domain – hence making FCN architecture a natural choice for pixel-wise problems like object localization or image segmentation. In [12] authors use geometric properties of the scene to create graph-structured FCN. In [6] authors proposed a modification of U-Net [20] architecture by removing skip connections from encoder to decoder and using depthwise separable convolution. This allows to achieve improvement in inference time and making it the right choice for real-time systems.

The existing work uses the current frame for the detection of the soccer ball. We hypothesize that the history of frames (coherent sequence of previous frames) could help model in making a better prediction, especially in cases where ball disappears or is missed for a few frames. To support our hypothesis, we extend our experiments and use temporal sequences of images. A crucial element of processing continuous temporal sequences is to encode consensual information in spatial and temporal domains simultaneously. There are several methods which allow extracting spatiotemporal video features like widely used Dense Trajectories [25] where densely sampled points are tracked based on information from the optical flow field and describe local information along temporal and spatial axes. In [4] authors proposed Two-Stream Inflated 3D ConvNet (I3D) where convolution filters expanded into 3D let the network learn seamless video feature in both domains. For predicting object movement in the video, Farazi et al. proposed a model based on frequency domain representation [10]. One of the recent methods in modeling temporal data is temporal convolution networks (TCN) [16]. The critical advantage of TCN is the representation gained by applying the hierarchy of dilated causal convolution layers on the temporal domain, which successfully capture long-range dependencies. Also, it provides a faster inference time compared to recurrent networks, which make it suitable for real-time applications.

Additionally, there are successful end-to-end recurrent networks which can leverage correlations within sequential data [5, 11, 26]. ConvLSTM [26] and ConvGRU [3] are recurrent architectures which compound convolutions to determine the future state of the cell based on its local neighbors instead of the entire input.

In this work, we propose a CNN architecture which utilizes sequences of ball movements in order to improve the task of soccer ball detection in challenging scenarios.

3 Detection Models

3.1 Single Image Detection

In this paper, the task of soccer ball detection is formulated as a binary pixel-wise classification problem, where for a given image, the feed-forward model predicts the heatmap corresponding to the soccer ball. In this part we utilize three feed-forward models namely SweatyNet-1, SweatyNet-2 and SweatyNet-3 as proposed in [21].

All three networks are based on an encoder-decoder design. The SweatyNet-1 consists of five blocks in the encoder part and two blocks in the decoder part. In the encoder part, the first block includes one layer, and the number of filters is doubled after every block. In the decoder part, both blocks contain three layers. Each layer comprises of a convolutional operator followed with batch normalization and ReLU as the non-linearity. In addition, bilinear upsampling is used twice: after the last block of the encoder and after the first block of the decoder. Skip connections are added between layers of encoder and decoder to provide high-resolution details of the input to the decoder. Similar approaches have been successfully used in Seg-Net [1], V-Net [18] and U-Net [20].

All convolutional filters across the layers are of the fixed size of \(3\times 3\). The encoder part includes four max-pooling layers where each one is situated after the first four blocks. The number of filters in the first layer is eight, and it is doubled after every max-pooling layer. In the decoder, the number of filters is reduced by a factor of two before every upsampling layer.

The other two variants of SweatyNet, are designed to reduce the number of parameters and speed up the inference time. In SweatyNet-2, the number of parameters is reduced by removing the first layer in each of the last five blocks of the encoder. In SweatyNet-3, the number of channels is decreased by changing the size of convolutions to \(1\times 1\) in every first layer of last five encoder blocks and both of the decoder blocks.

Fig. 2.
figure 2

The prediction results, on the synthetically generated sequences. The network correctly predicts the future position and successfully keeps the size of slow moving ball with \(\sigma =4\) even when the history is sparse. Note that sparse history resembles an occluded ball.

Fig. 3.
figure 3

Visualization of (a) a stack of causal convolutional layers which compose TCN architecture. (b) a convolutional LSTM cell.

3.2 Detection in a Sequence

Temporal extensions capture spatio-temporal interdependence in the sequence and allow to predict the movement of the ball capturing its size, direction, and speed correctly. In our experiments, we utilize the temporal series of images to improve the task of soccer ball detection further.

Our approach illustrated in Fig. 1 propose a temporal layer and learnable weight w which makes use of the history of sequences of fixed length to predict the probability map of the soccer ball. We use a feed-forward layer TCN and compare it with recurrent layers ConvLSTM and ConvGRU. The three approaches differ in the type of connections formed in the network.

We train our model to learn heatmaps of a ball based on the sequence of frames representing the history of its movement. More precisely, if the timestamp of the current frame is t, given the heatmaps from \((t-h)\) to \((t-1)\) the output of the network is the sequence of heatmaps from timestamp t to \((t+p)\), where h is the history length and p is the length of predicted sequence.

The ConvLSTM and ConvGRU layers are stacks of several convolutional LSTM and GRU cells, respectively, which allows for capturing spatial as well as temporal correlations. Each ConvLSTM cell acts based on the input, forget and output gates, while the core information is stored in the memory cell controlled by the aforementioned gates. Each ConvGRU cell adaptively captures time dependencies with various time ranges based on content and reset gates. Convolutional structure avoids the use of redundant, non-local spatial data and results in lower computations. Figure 3 depicts the structure of convolutional LSTM cell where the input is a set of flattened 1D array image features obtained with the convolutions layers. Convolutional GRU cell also differs from standard GRU cell only in the way how input is passed to it.

Unlike the two recurrent models, where gated units control hidden states, TCN hidden states are intrinsically temporal. This is attributed to the dilated causal convolutions used in TCN, which generates temporally structured states without explicitly modeling connection between them. Thus, TCN captures long term temporal dependencies in a simple feed-forward network architecture. This feature further provides an advantage of the faster inference time. Figure 3 shows dilated causal convolutions for dilations 1, 2, 4, and 8. For our work, we replicated the original TCN-ED structure with repeated blocks of dilated convolution layers and normalized ReLU as activation functions.

For sequential data, it is challenging to train a network from scratch because of the limited size of the dataset and the difficulties in collecting the real data. Besides, the training process requires more memory to store a batch of sequences, resulting in a choice of smaller batch size. To address this problem, we use transfer learning and finetune the weights of our model on the sequences of synthetic data. We use SweatyNet-1 as the feature extractor and finetune it with the temporal layers.

For the input to temporal layers; TCN, ConvLSTM, and ConvGRU, we also take advantage of high-resolution spatial information by concatenating the output of \(2^{nd}\) and \(6^{th}\) block of SweatyNet-1. To speed up the training process and propagate spatial information, we apply a convolution of size \(7\times 7\) on the combined features. Moreover, we take an element-wise product of the output of convolution with a learnable weight of w and add it to the output of SweatyNet. This combination serves as an input to the temporal layers. The weight w serves as a gate which learns to control how much of high-resolution information is transferred from the early layers of Sweaty-Net and helps the network in detecting soccer ball with subpixel level accuracy.

Fig. 4.
figure 4

The result of the temporal part, trained on a dataset with one ball per frame. Note that the network can generalize to detect two moving objects.

Fig. 5.
figure 5

Qualitative results of the trained network in detecting two balls. (a) SweatyNet prediction (b) residual information (c) ground truth (d) temporal prediction (e) real image

4 Experiments

In this section, we describe the details of the training process for our two sets of experiments. In the first experiment, we consider a problem of localization of the object in an image. In the second experimet, we evaluated our temporal approach. The evaluation of our experiments is discussed in Sect. 4.3.

4.1 Training

Detection in an Image: For our work, we created a dataset of 4562 images, of which 4152 images contain a soccer ball. We refer to it as SoccerData. The images are extracted from a video recorded from the robot’s point of view and are manually annotated using the imagetaggerFootnote 2 library. The images are from three different fields with different light sources. Note that since the data is recorded on walking robot, in many images we have blurry data.

Each image is represented by a bounding box with coordinates: \(x_{min},y_{min}, x_{max}, y_{max}\). For teaching signal we generated a binormal probability distribution centered at \(c = 0.5 (x_{max}+x_{min}, y_{max}+y_{min})\) and with the variance of \(r= 0.5 min(x_{max}-x_{min}, y_{max}-y_{min})\). In contrast to the work of [21] where authors consider ball of fixed radius, we take into account the variable radius of a ball by computing the radius based on the size of the bounding box.

We apply three variants of SweatyNet model as described in Sect. 3 on the SoccerData. For the fair evaluation of the model, we randomly split our data into \(70\%\) training and \(30\%\) testing. In the training phase, mean squared error (MSE) is optimized between a predicted and a target probability map. We use Adam [14] as the optimizer. We trained all of our models for a maximum of 100 epochs on the Nvidia GeForce GTX TITAN GK110. Similar to [21] the hyperparameters used in our experiments are learning rate of 0.001 and a batch size of 4. In addition, we experiment with dropout probability of 0.0, 0.3 and 0.5.

Detection in a Sequence: We train the temporal part in two ways: (i) we pre-train the temporal model on artificially generated sequences and finetune it on top of the pre-trained SweatyNet-1 for the real sequences,

(ii) finetune the joint model on the real sequences where the pre-trained weights are used only for the SweatyNet-1 model.

Algorithm 2 details the procedure for synthetic data generation. To get heatmaps of a particular sequence \(L_i\) at each time step j we generate a multinormal probability distribution centered at \((x_j,y_j)\) with a variance equal to the radius \(R_i\).

To finetune the model on the real sequential data, we extracted a set of real soccer ball frames from bags recorded during RoboCup2018 Adult-Size games. Since video frames do not always contain a ball in the field of view, we preprocess videos to make sure that we do not use a sequence of frames without any ball present. With such restrictions, we got 20 sets of consecutive balls with an average length of 60. For all of our experiments, we fixed the history size h to 20 and prediction length p to 1.

For training on real data, we use learning rates of \(1e-5\) for the detection task and \(1e-4\) for the temporal part after pretraining. In the temporal network on the artificial sequences, the learning rate is set to \(1e-5\). We train on synthetic data for 20 epochs and 30 epochs for the real data.

TCN: Encoder and decoder of TCN are two convolutional networks with two layers of 64 and 96 channels, respectively. We set up all parameters following the work of [16] except that we use Adam as an optimizer with MSE loss.

ConvLSTM and ConvGRU: We use four layers of ConvLSTM/ConvGRU cells with the respective size of 32, 64, 32, 1, and fixed kernel of size five across all layers.

Multiple Balls in a Sequence: To verify that our model can generalize, we test it on a more complex scenario with two present balls. Note that the network was only trained on a dataset containing a single ball. The qualitative results can found in Figs. 4 and 5. These figures depict that the model is powerful enough to handle cases not covered by training data. The temporal part leverages the previous frames and residual information and can detect the ball which is absent in SweatyNet output (Fig. 4 (a) vs. (d)).

4.2 Postprocessing

The output of a network is a probability map of size \(160\times 120\). We use the contour detection technique explained in Algorithm 1 to find the center coordinates of a ball. The output of the network is of lower resolution and has less spatial information than the input image. To account for this effect, we calculate sub-pixel level coordinates and return the center of contour mass, as the center of the detected soccer ball.

figure a
figure b

4.3 Evaluation

To analyze the performance of different networks we use several metrics: false discovery rate (FDR), precision (PR), recall (RC), F1-score (F1) and accuracy (Acc) as defined in Eq. 1, where TP is true positives, FP is false positives, FN is false negatives, and TN is true negatives.

$$\begin{aligned} {\begin{matrix} &{}FDR = \frac{FP}{FP+TP}, PR = \frac{TP}{TP+FP}, RC = \frac{TP}{TP+FN},\\ &{}F1 = 2\times \frac{PR\times RC}{PR+RC}, Acc = \frac{TP+TN}{TP+FP+TN+FN} \end{matrix}} \end{aligned}$$
(1)
Fig. 6.
figure 6

Top row is the part of the input history (frame {18,19,20}). The bottom row consists of heatmaps where (a) visualization of the residual information from Sweaty Net to temporal, (b) ground truth ball position and (c) predicted output by the temporal part.

Fig. 7.
figure 7

Example of correctly detected ball after finetuning with the temporal model while the confidence of just the SweatyNet is very low, resulting in false negative detection. The left image is the real image; the middle is SweatyNet output without finetuning; the right one is SweatyNet output after finetuning.

An instance is classified as a TP if the predicted center and actual center of the soccer ball is within a fixed distance of \(\gamma =5\) (Fig. 6).

Fig. 8.
figure 8

From left to right: input image, the ground truth, prediction of the neural network, and the final output after post-processing.

Table 1. Evaluation of SweatyNets on the task of soccer ball detection. The highlighted numbers are the best performance for a particular dropout probability.

5 Results

The results of our experiments are summarized in Table 1. The performance of all three models are comparable. To improve generalization and prevent overfitting, we further experiment with different dropout [24] probability values. We train all our models on a PC with Intel Core i7-4790K CPU with 32 GB of memory and a graphics card Nvidia GeForce GTX TITAN with 6 GB of memory. For real-time detection, one major requirement is of a faster inference time. We report the inference time of the model on the NimbRo-OP2X robot in Table 2(a). The NimbRo-OP2X robot is equipped with Intel Core i7-8700T CPU with 8 GB of memory and a graphics card Nvidia GeForce GTX 1050 Ti with 4 GB of memory. Since all three models don’t use the full capacity of GPU during inference, which allows bigger models to perform extra computations in parallel; as a result, all three SweatyNet networks are comparable in real time inference. Figure 8 demonstrates the effectiveness of the model for the task of soccer ball detection. For this study, we only consider SweatyNet-1.

The results of sequential part are further summarized in Table 2(b). The sequential network successfully captures temporal dependencies and gives an improvement over the SweatyNet. Usage of artificial data for pre-training the temporal network is beneficial due to the shortage of real training data and boosts performance. Figure 2 illustrates artificially generated ball sequences with the temporal prediction. We observed that when the temporal model is pre-trained on the artificial data, the learnable weight for the residual information takes a value of 0.57 on average, though without pre-training, the value is 0.49. The performance of TCN is comparable to ConvLSTM and ConvGRU, but it considerably outperforms ConvLSTM and ConvGRU in terms of inference time, which is a critical requirement for a real-time decision-making process. Table 2(a) presents a comparison between temporal models on inference time.

To support our proposal of using sequential data, in Fig. 7 we present an example image where the SweatyNet alone is uncertain of the prediction, though the network gives an strong detection when further processed with the temporal model.

Table 2. (a) Inference time comparison. For sequential models, we report time on top of the base model, (b) Evaluation of different tested models. Note real denotes that training of the sequential part is performed only on real data and ft means that a pre-training phase on the artificially generated ball sequences is done before finetuning on real data.

6 Conclusion

In this paper, we address the problem of soccer ball detection using sequences of data. We proposed a model which utilizes the history of ball movements for efficient detection and tracking. Our approach makes use of temporal models which effectively leverage the spatio-temporal correlation of sequences of data and keeps track of the trajectory of the ball. We present three temporal models: TCN, ConvLSTM, and ConvGRU. The feed-forward nature of TCN allows faster inference time and makes it an ideal choice for real-time application of RoboCup soccer. Furthermore, we show that with transfer learning, sequential models can further leverage knowledge learned from synthetic counterparts. Based on our results, we conclude that our proposed deep convolutional networks are effective in terms of performance as well as inference time and are a suitable choice for soccer ball detection. Note that the presented models can be used for detecting other soccer objects like goalposts and robots.