1 Introduction

Recent developments in computer and relay modern technology have increased the quantity of sporting activities that are now available to the general public as videos. There are hundreds of millions of TV viewers and Internet watchers for everything from the World Cup finals to the quadrennial Olympic Games. Sports game video automatic analysis technology is also becoming more and more significant in this. On the one hand, video analysis data is necessary for coaches and athletes to examine overall strategy and specific technical traits. On the other side, broadcasters want to employ video analysis to enhance the audience’s pleasure of the game. The foundation for implementing these applications is the identification of player targets in game video. So, if we can combine the most popular deep learning (DL) method in the field of image processing and study a robust player recognition system for football match scenes, it will be of great practical value and practical significance.

Due to their fast movement, occlusions, and fluctuating lighting conditions, football players represent substantial hurdles for real-time detection and tracking in dynamic gaming situations. Computer vision systems face challenging situations due to the unpredictable nature of the game and the continual interaction of players on the pitch. It is difficult for algorithms to precisely recognize and track players since they move quickly over the pitch, frequently overlapping with one another or impeding one another’s view. Additionally, during outdoor matches, the lighting might quickly shift, resulting in changes to the image’s contrast and quality. To overcome these obstacles, advanced computer vision techniques are needed that can manage the nuances of football action and provide accurate real-time player monitoring for a variety of applications, including sports analytics and live broadcasting.

Computer technology has recently shown tremendous promise in the world of sports. For instance, target identification was employed in sports analysis [1], computer vision (CV) built virtual reality was employed for sports posture correction [2], and a CV-driven assessment system was used for making decisions in sports training [3]. Analyzing sports is essential for improving athletes’ performance. Traditionally, sensors have been placed in strategic locations on athletes to capture raw data. The data is then analyzed using data science techniques to deliver data-driven training recommendations [4,5,6,7]. Nevertheless, adding more sensors would raise the price and can make athletes perform worse. Furthermore, one cannot urge his rivals to wear sensors in order to learn about their advantages and disadvantages.

Traditional image processing techniques and two-stage object detectors have struggled to accurately detect and track soccer players under challenging game conditions of rapid motion, occlusions, and dynamic lighting. Recent work using deep learning models such as YOLOv5 has made progress, but still struggles to accurately localize players under complex scenarios and overlapping objects, as shown by the performance trends in the literature review. This study presents a novel integrated approach within the YOLOv5 architecture that incorporates multiple synergistic modules to improve both accuracy and real-time performance.

The proposed method utilizes the SimSPPF module for efficient multi-scale feature extraction. This module enables robust player detection in the presence of occlusions and varying object sizes. It also utilizes the GhostNet module to reduce computational complexity without compromising accuracy. Most importantly, the inclusion of the slim-scale recognition layer, which is optimized for accurate bounding box prediction, enables reliable player recognition even under overlap conditions that often challenge existing methods.

The proposed integrated approach provides a better solution for real-time soccer player tracking and analysis by combining this ability to effectively capture true positives with the efficient context understanding of the YOLOv5 architecture, while overcoming the pitfalls of existing techniques.

The study’s contributions include the following:

  • The study offers an enhanced YOLOv5 object recognition model that has been specially designed for football and player detection.

  • To improve real-time performance, the model uses fine-tuned features and optimized hyper-parameters that take use of YOLOv5’s precision and speed.

  • The proposed approach exhibits extraordinary effectiveness in detecting both the football and players. It is an effective tool for real-time player tracking in football games due to its capacity to manage quick player movements and occlusions.

  • The improved YOLOv5 model’s effectiveness and resilience are demonstrated by extensive testing on several football match recordings. The system demonstrates excellent application potential in live broadcasting, player tracking, and sports analytics, demonstrating its adaptability and agility to many settings.

2 Related Work

Various approaches have been presented in literature for real-time football and player detection. The study suggested a novel approach using integrated image processing methods for spotting football players on soccer fields [8]. There are two steps in this process. First, the region surrounding the football pitch is separated from the background. This procedure is used to prevent the error of player detection caused by the overlap issue between the cheering crowd and the player. Finally, the method for identifying football players using morphological. Research outlines the MPI and OpenMP-based parallel technique for a football tracking system built on multiple networks [9]. Segmentation and tracking, the two kind operations of the tracking model, are combined in the suggested model using a consumer–producer pattern, along with a send-and-receive communication pattern to spread the blob identities. An approach based on SVM classifier is suggested by research to find football players’ head and shoulder HOG attributes [10]. This method employs the head and shoulder HOG characteristics of football players as the research object, intercepts the positive samples that contain a moving head and shoulder and the negative samples that do not, trains the SVM classifier with the HOG feature of the samples, and scans the foreground region in various scales to detect the dynamic moving target.

An innovative segmentation approach for detecting football players in broadcast footage is provided in the research [11]. The system is built on a mix of histogram of oriented gradient descriptors and linear support vector machine classification. Despite the fact that HOG-based algorithms for pedestrian recognition have recently been employed successfully, the experimental findings given in this study indicate that combining HOG with SVM appears to be a potential way for detecting and segmenting players in broadcasted video. A CamShift and Kalman filtering-based object recognition and tracking approach was suggested by the study [12]. First, since the tracking flaw is readily disrupted by the surroundings, we use CamShift to track players and backdrop weighted histogram to enhance CamShift algorithm. The paper proposes occlusion factors based on search windows and the approach through occlusion factor to provide weights for the results of Kalman and CamShift. Next, the combination of CamShift and Kalman filter is utilized to process players’ occlusion. Through video monitoring and deep learning, the study tracks the whereabouts of objects in the stadium [13]. Additionally, we used the GPS (Global Positioning System), which has a significant error but can preserve the ID of the object even when the objects overlap, to enable accurate tracking of objects even when they overlap. The experiment findings confirmed that by combining deep learning and GPS, the object tracking failure rate can be decreased and the precision of the object location can be raised. According to research, two algorithms have been proposed for automatic analysis and predictions of the best passes in a football game utilizing the identification and monitoring of players and the ball in video footage [14]. The playing field is tagged for each frame depending on the grass ratio, which was used in our first algorithm to check for genuine frames. Following that, an algorithm uses the connection table to automatically determine which team has possession of the ball at any one time during a game. The study conduct the replay checks, which are often displayed following each successful move or goal scored, using the background check to identify any potential departure in player position in the final few clipping seconds.

An automated saliency-based offside detection approach was suggested by the research [15]. To record player interactions with the ball, four cameras were put on either side of the centerline. Saliency is used in the scenes to estimate the movement of offensive players who are playing the ball, and the estimated motion is contrasted with the positions of the defense players to determine offside. The description of a system for monitoring football players’ locations throughout a game [16] is as follows: eight video feeds from stationary cameras are used as the input, and each stream is processed to produce measurements of the players for a multi-view tracker. Forefront recognition and a bounding box tracker to separate measurements of combined players are included in the single view processing. The study uses novel detection techniques to identify the undervalued performers in its data and uses five machine learning models to analyze its methodology [17]. Support vector machine, Random Forest, Decision Tree, Linear Regression, and XGBoost are the five machine learning models that are used to evaluate performance. Here, XGboost outperformed other methods in both external testing and tenfold cross-validation, with RMSE values of 0.0122 and 0.0107, respectively.

A DL built image segmentation model has been suggested in the study for the pixel-level classification of football game video recording frames [18]. Each pixel in a football video frame is categorized into one of ten categories, including players, the ball, the goal post, and other backdrop scenes. The paper presents various machine learning techniques to forecast football player positions in space, which are used for sophisticated tactical analysis [19]. The study suggested a fully convolutional neural network-based player detection approach [20]. The novel aspect of our strategy is the creation of the player probability map using the U-Net architecture. An automated saliency-based offside detection approach was suggested by the study [21]. For the purpose of capturing images of players with a ball, six cameras have been put on both sides of the centerline and in the rear of goals. Saliency is used in the scenes to estimate the movement of offensive players who are playing the ball, and the estimated motion is contrasted with the positions of the defense players to determine offside.

Table 1 outlines various real-time football player detection methods, highlighting their strengths and limitations. In contrast, our study presents an improved YOLOv5 model incorporating SimSPPF, GhostNet, and Slim scale detection modules. It also demonstrates the superior performance of our method in terms of detection techniques, evaluation criteria, and overall efficiency.

Table 1 Comparison of related works

3 Hypothesis and Limitations of the Proposed Method

3.1 Hypothesis

The proposed YOLOv5 model with SimSPPF and GhostNet modules for real-time football and player detection aims to improve the accuracy, speed, and computational efficiency by combining them with Slim-scale detection. This integration aims to overcome the YOLOv5 architecture and other existing methods. The SimSPPF module enhances the model’s ability to capture multi-scale detail, which is necessary to detect small and large objects such as players and the ball in the dynamic football field. Meanwhile, the GhostNet module reduces the computational complexity, allowing faster inference without significant loss of accuracy.

Moreover, the Slim scale detection layer is predicted to be more accurate in predicting the bounding boxes of detected players and the ball. Such an improvement will increase the reliability and robustness of the system for real-time tracking and analysis applications.

3.2 Limitations

The proposed method offers promising solutions to existing challenges. However, it is important to be aware of potential limitations:

  • The current model is optimized for football/soccer players and ball detection. Extending it to other sports such as basketball or American football might require further adaptation and training.

  • Despite improved occlusion handling, detecting heavily occluded players, especially in crowded matches, remains challenging.

  • The performance of the model is highly dependent on the quality and variety of the training data. Variations in player attire, pitch conditions, or camera angles not represented in the training set could affect generalization.

4 Proposed Methods

4.1 Yolov5 Model

The cutting-edge object detection model YOLOv5 is renowned for its effectiveness and precision. It is a development of the real-time object detection models from the YOLO series. YOLOv5 was created to solve the shortcomings of earlier versions and enhance their performance. Compared to two-stage detectors, it has a single-stage architecture, which makes it quicker and better suited for real-time applications. YOLOv5 is used extensively in computer vision applications like object recognition, instance segmentation, and picture classification since it is made to recognize and localize objects in images across several classes with high accuracy. Due to its combination of speed and accuracy, it has gained popularity in both research and practical applications, becoming a crucial tool in the computer vision and machine learning fields. Figure 1 describe the overall model of YOLOv5 structure. Various applications of YOLOv5 are existing in the literature [22,23,24,25]. The SimSPPF module’s design is demonstrated in Fig. 2.

Fig. 1
figure 1

The overall model of YOLOv5 structure

Fig. 2
figure 2

The SimSPPF module’s architecture

4.2 SimSPPF Module

By pooling the characteristics of various sizes, the SPPF approach in YOLOv5 is crucial for collecting context information at various scales. The SPPF approach, which comprises convolution layers followed by a sigmoid linear unit and a normalization batch, is used in YOLOv5 to use the CBS network. We found that aggressively applying the SimSPPF improved the FPS and increased computational complexity, but only slightly enhanced accuracy. The SimSPPF module’s construction is shown in Fig. 2. In this research work, the SimSPPF technique was included in our model. Because of the advantages of SimSPPF over SPPF, including lower computing complexity and greater FPS while retaining acceptable object identification accuracy, the SimSPPF technique was chosen. Equation (1) is followed from [26].

$$ \left\{ \begin{gathered} {\text{F}}_{1} \,\, = \,\,CBR(F)\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \hfill \\ {\text{F}}_{2} \,\, = \,{\text{Maxpooling}}(F)\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \hfill \\ {\text{F}}_{3} \,\, = \,\,Maxpooling(F_{2} )\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \hfill \\ {\text{F}}_{4} \,\, = \,\,Maxpooling(F_{3} )\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \hfill \\ {\text{F}}_{5} \,\, = \,\,CBR([F;F_{2} ;F_{3} ;F_{4} ])\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \hfill \\ \end{gathered} \right. $$
(1)

4.3 GhostNet Module

The neural network module GhostNet is a compact one. The module’s goal is to decrease computational overhead while preserving performance to improve the efficiency of deep neural networks. The article presents the idea of “ghost” feature maps, which are inexpensive feature maps produced using only a portion of the original feature maps. The GhostNet module strikes a nice compromise between model size and accuracy by using these ghost feature maps in addition to conventional convolutional layers. Figure 3 represents the diagram of the ghost technique.

Fig. 3
figure 3

Ghost technique designed

4.4 Slim Scale Detection Layer

In particular, single-shot object detection systems like YOLO and SSD (Single-Shot Multibox Detector) utilize the Slim scale detection layer as a component. This layer’s main objective is to forecast the network’s bounding boxes and related objectness scores at various sizes.

Removing the P5 feature map may lead to a more effective model in addition to increasing accuracy. Since the P5 feature map is the biggest, processing it takes a lot of time. By eliminating it, we can increase the model’s efficiency and speed by lowering the amount of computing needed to execute it. In real-time object identification applications, when efficiency and speed are essential, this might be very significant (Fig. 4).The removal of the P5 feature map, however, must be taken into consideration since it can make it more challenging to find tiny items in the picture. The P5 feature map, which covers the biggest portion of the input picture, has the advantage of being able to collect details about objects that are further from the image’s center. Figure 5 depicts our upgraded model structure.

Fig. 4
figure 4

GhostNet with YOLOv5 model

Fig. 5
figure 5

Our upgraded model designed

Main Steps of the Proposed Method:

1. Input image: The system takes in a video frame or image captured during a football match.

2. YOLOv5 backbone: Utilizing the YOLOv5 object detection model renowned for its speed and accuracy, this forms the core of the system, identifying objects within the image.

3. SimSPPF module: Integrated to efficiently extract multi-scale features from the input, the SimSPPF (Simplified Spatial Pyramid Pooling) module enhances context information capture at various scales, crucial for detecting objects of diverse sizes, such as football players and the ball.

4. GhostNet module: Designed to reduce computational complexity, the GhostNet module facilitates faster inference without compromising detection accuracy. It achieves this by generating "ghost" feature maps that are computationally inexpensive, optimizing the balance between model size and performance.

5. Slim scale detection layer: Responsible for predicting bounding boxes and associated confidence scores for detected objects (players and the ball) across different scales. This layer is finely tuned for accurate and efficient bounding box predictions, vital for real-time player and ball tracking.

6. Non-maximum suppression (NMS): Post-detection, the NMS algorithm is employed to eliminate overlapping or redundant bounding boxes, ensuring each detected object is represented by a single, precise bounding box.

7. Output: The final output showcases the input image or video frame with highlighted football players and the ball, marked by respective bounding boxes, class labels, and confidence scores.

This detailed breakdown of the proposed method’s block diagram offers a comprehensive understanding of its core components and their functionalities.

5 Experimental Results

This study focuses on enhancing and adapting the YOLOv5 object identification model specifically for football and player detection. Utilizing the YOLOv5 architecture’s blend of precision and speed, our updated model integrates refined features and optimized hyper-parameters to achieve heightened real-time performance. Particularly, the recommended model excels in detecting both the football and players, even within complex environments. Through extensive testing across various football match recordings, we demonstrate the improved YOLOv5 model’s efficacy, robustness, and its potential applications in live broadcasting, player tracking, and sports analytics. Its swift response time and accurate recognition capabilities render it indispensable for augmenting football analysis and comprehending player actions in real-time scenarios.

5.1 Dataset

The dataset’s visualization is shown in Fig. 6. (a) The quantity of annotations for each class. (b) Visualization of each bounding box’s position and dimensions. (c) The bounding box position’s statistical distribution. (d) The bounding box sizes’ statistical distribution.

Fig. 6
figure 6

The dataset’s illustration

5.2 Experimental Environment

The PyTorch framework and a GPU with CUDA capability were two essential elements that had to be used in the experimental setting for this investigation. The YOLOv5 DL network needed equivalent processing to carry out the learning process and target identification swiftly and correctly. As a consequence, computing was sped up using a GPU and CUDA, and models were created and trained using the PyTorch framework. The setup information for this experimental environment is shown in the given Table.

5.3 Assessing Criteria

The following assessing metrics were used for validation of the proposed approach.

$$ \begin{gathered} \left\{ \begin{gathered} P_{{({\text{precision}})}} \,\, = \,\,\frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FP}}}},\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \hfill \\ \hfill \\ P_{{({\text{recall}})}} \,\, = \,\,\frac{{{\text{TP}}}}{{{\text{TP}} + {\text{FN}}}}\,,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \hfill \\ \hfill \\ {\text{AP}}_{{({\text{average precision}})}} \,\, = \,\,\int_{0}^{1} {P(r){\text{d}}r} \,,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \hfill \\ \hfill \\ {\text{mAP}}_{{({\text{mean average precision}})}} \,\, = \,\,\frac{{\sum {_{i = 0}^{k} } {\text{AP}}_{i} \,}}{k}.\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\,\, \hfill \\ \end{gathered} \right. \hfill \\ \hfill \\ \end{gathered} $$
(2)

5.4 Results and Analysis

Figure 7 demonstrates the different scene of the dataset detection results. The detection results obtained using the proposed enhanced YOLOv5 model are presented in Fig. 7, showcasing its ability to accurately identify both the football and players across a variety of challenging match scenes. The model’s effectiveness is further demonstrated by the quantitative results shown in Fig. 8 and Table 2.

Fig. 7
figure 7figure 7

Different scene of the dataset detection results

Fig. 8
figure 8

Accuracy compared between the models

Table 2 Experimental setup details

The outcomes in Fig. 8 provide light on how the models improved over time and if their performance followed any particular trends or patterns. Table 2 also presents a contrast of the model’s accuracy. In a contrast of the two models, our enhanced model fared better than the YOLOv5n in terms of accuracy and mAP@0.5. Our enhanced model was more adept at picking out real positives, which is crucial for those who are blind or visually impaired since false positives may result in missed buses, annoyance, and even possible dangers.

Figure 8 illustrates the performance trends of the proposed model and the baseline YOLOv5n model, indicating that the enhanced YOLOv5 achieves consistently higher accuracy over time. The values in Table 2 further substantiate the superiority of the proposed method, with significant improvements in precision, recall, and mean average precision (mAP) compared to the original YOLOv5n model.

Table 2 displays an outline of the model’s complexity. In this investigation, the initial model, YOLOv5n, was used to evaluate model complexity with the upgrade YOLOv5 model. The objective was to identify the model that was more practicable for football and player detection systems by being more effective and less complex.

The results presented in Table 2 demonstrate that the proposed enhanced YOLOv5 model not only achieves superior detection performance but also exhibits a more efficient computational footprint, with a 45% reduction in GFLOPs and a 53% decrease in model size compared to the YOLOv5n baseline. This highlights the effectiveness of the introduced SimSPPF, GhostNet, and Slim scale detection modules in optimizing the model’s speed and complexity without compromising its accuracy.

5.5 Ablation Experiment

In an ablation test, a model’s constituent elements or components are systematically removed or altered to determine how they affect the performance as a whole. To do this, we employed the identical dataset, hyper-parameters, and training/testing techniques for all network throughout the research, with the exclusion of the particular constituents under evaluation. This allowed us to pinpoint the precise adjustments under consideration as the cause of any performance discrepancies among the enhanced and the baseline model. The ablation test study is shown in Table 3.

Table 3 Evaluation of the complexity and accuracy of the models

The results in Table 3 show that the incorporation of the SimSPPF module leads to a significant boost in the mean average precision (mAP@0.5) from 0.81 to 0.86, while the addition of the GhostNet and Slim scale detection layer further enhance the model’s precision, recall, and computational efficiency. The cumulative effect of these innovations is the superior overall performance of the proposed enhanced YOLOv5 model, as reflected in the final row of the table.

5.6 Comparative Experiments

Our upgraded model is clearly better when compared to other models for the football and player recognition scheme utilizing the YOLOv5 architecture. It is the best option for real-time football and player identification and tracking systems because of its short inference time, high accuracy, and lightweight design. Our model offers the highest presentation and adeptness for this tender, and it is a major upgrade over the prior generations. Figure 9 show the performance comparison with different models.

Fig. 9
figure 9

A graphical comparison of the performance of several models

The results shown in Fig. 9 and the comprehensive analysis provided in the previous table highlight the significant improvements achieved by the proposed enhanced YOLOv5 model in terms of detection accuracy, computational complexity, and overall effectiveness for real-time football and player detection tasks.

6 Discussion

The experimental results validate the effectiveness of the proposed enhanced YOLOv5 model for real-time football and player detection. The model shows significant improvements in precision, recall, and mean average precision (mAP) compared to the baseline YOLOv5n, as shown in Table 2 and Fig. 8. This superior detection accuracy can be attributed to the synergistic integration of the SimSPPF module for multi-scale feature extraction, the GhostNet module for reducing the computational complexity, and the Slim scale detection layer for accurate bounding box prediction.

The results of the ablation study in Table 3 further confirm the individual contributions of these modules. The inclusion of the SimSPPF module alone increased the mAP@0.5 from 0.81 to 0.86, highlighting its role in capturing contextual information at multiple scales, essential for detecting players and the ball amidst occlusions and varying object sizes.

Furthermore, the performance gains achieved by the proposed model are complemented by a significant reduction in computational complexity. Table 2 shows a 45% reduction in GFLOPs and a 53% reduction in model size compared to the YOLOv5n baseline. This computational efficiency can be attributed to the GhostNet module’s ability to generate lightweight ‘ghost’ feature maps, optimizing the trade-off between model size and accuracy.

The qualitative detection results shown in Fig. 7 further validate the model’s robustness in handling challenging game scenarios, including rapid player movement, occlusions and dynamic lighting conditions. The accurate bounding box predictions facilitate reliable real-time tracking and analysis applications such as live broadcasting and sports analytics Table 4.

Table 4 Ablation test study

7 Main Advantages

  1. 1.

    The proposed model outperforms the baseline in terms of precision, recall, and mAP, enabling more reliable detection of true positives (players and football).

  2. 2.

    The integration of computationally efficient modules (SimSPPF, GhostNet) ensures fast inference times, crucial for real-time applications.

  3. 3.

    The SimSPPF module’s ability to capture multi-scale features enables efficient detection of objects of varying sizes (players and ball).

  4. 4.

    The GhostNet module and optimized architecture result in lower GFLOPs and a smaller model size, facilitating deployment on resource-constrained devices.

  5. 5.

    The model demonstrates resilience to occlusions, rapid movements, and dynamic illumination conditions commonly encountered in football matches.

8 Main Disadvantages

  1. 1.

    The current model is optimized for football/soccer scenarios. Adapting it to other sports may require additional adjustments and retraining.

  2. 2.

    The model’s performance relies heavily on the quality and diversity of the training data. Variations in player attire, pitch conditions, or camera angles not represented in the training set could impact generalization.

  3. 3.

    While the proposed method improves occlusion handling, detecting heavily obscured players, especially in densely crowded matches, may still pose challenges.

9 Conclusions

Computer science has recently shown tremendous promise in the world of sports. For instance, object identification was employed in sports analysis, CV built virtual reality was applied to rectify sports posture, and a CV-driven assessment scheme was implemented for decision-making in sports training. Analyzing sports is essential for raising athletes’ performance. In this study, the YOLOv5 object identification model is improved and tailored for football and player detection. By utilizing the precision and speed of the YOLOv5 architecture, our updated model combines fine-tuned features and optimized hyper-parameters to give increased real-time performance. The recommended model is exceptionally good at finding both the football and players even in complex environments. Our model’s effectiveness and resilience, as well as its potential for usage in live broadcasting, player tracking, and sports analytics, are demonstrated by extensive tests on a variety of football match recordings. The system’s quick response time and precise recognition capabilities make it an invaluable tool for enhancing football analysis and understanding player actions in real-time circumstances.