1 Introduction

After years of development history of basketball, fierce physical confrontation has become a representative feature of competitive basketball. In the modern basketball game, foul is one of the most important factors that affect the winning or losing of the game and evaluate the player's ability. Fouls in different areas, different courts, and different times can fully reflect the team basketball's foul concept, foul awareness and ability to control fouls. With the continuous improvement of the offensive level of competitive basketball, the defense has changed from passive to active, making the offensive and defensive conflicts in basketball more exciting and the game more difficult from beginning to end. Therefore, basketball players must perform reasonable or unreasonable technical actions in the game to achieve the purpose of controlling the ball in basketball games.

At present, the traditional methods of moving image acquisition or feature extraction generally use the horizontal camera or overhead camera to capture video images with moving motion. Because the horizontal camera detects the moving image of the target above a certain height in the horizontal direction, there would be many interference images, so it is difficult to set a clear detection area. Human behavior recognition based on vision is mainly to solve the problem of processing and analyzing the original image and image sequence data. These images are usually collected by computers through sensors (cameras). At the same time, this method can also learn and understand the actions and behaviors of people in them. Human behavior recognition covers many research topics in computer vision, including human detection, posture estimation, tracking in video, and analysis and understanding of time series data. With the new changes to the offense rules, the secondary offense in the game of basketball is now more condensed, and the speed of offense and defense transitions has become faster. Basketball is a sport of intense physical confrontation. Since the physical confrontation between players becomes more intense in the game, fouls are naturally inevitable, and the tactical significance changes accordingly. Therefore, it is a challenge for coaches and players to maintain the intensity of the game and increase the motivation on offense and defense while staying as foul free as possible. Mastering the characteristics and patterns of players' mistakes in the game makes an important contribution to the overall improvement of players' offensive and defensive performance. The positive effects of player mistakes during the game can support the team's tactics on both ends of the floor, thereby increasing the probability of winning to a certain extent.

The current research on basketball foul analysis reveals notable gaps in the existing literature. While recognizing the efficacy of machine vision in classifying fouls, there is a need for more comprehensive exploration of its application in understanding and categorizing diverse types of fouls in basketball. Additionally, the study provides insights into foul occurrences in various defensive scenarios but lacks a deeper understanding of the influencing factors. Further investigation is required to delve into player-specific attributes, team strategies, and game situations that contribute to observed foul patterns. Furthermore, the study presents statistics on fouls without sufficient contextualization, highlighting a gap in understanding the broader context behind foul occurrences, including game dynamics, player positions, and team strategies. Addressing these gaps could significantly contribute to a more nuanced and comprehensive understanding of basketball foul dynamics.

The study introduces innovative elements to the field of basketball foul analysis. Firstly, it pioneers the integration of machine vision, employing advanced technology to classify basketball actions as fouls or not. This marks a significant advancement in sports analytics, particularly within the context of basketball foul assessment. Secondly, the research offers a fine-grained analysis of foul occurrences in diverse defensive scenarios, such as dribble defense, shooting defense, defending the ball, and preventing the ball. This meticulous examination provides a novel perspective on the distribution of fouls in specific game situations, contributing to a deeper understanding of defensive strategies and player behaviors. Lastly, the study contributes to the field through a comparative analysis of foul statistics between the Chinese team and their opponents across various defensive categories. This comparative approach introduces novelty by shedding light on specific areas where one team may excel or face challenges in terms of foul management, offering valuable insights for coaches, players, and analysts alike.

In this paper, the characteristics of basketball players' foul actions based on machine vision are studied. The data showed that the Chinese team fouled 19 times for blocking, and the opponent fouled 19 times. The Chinese team fouled 64 times because of illegal hand, and the opponent fouled 82 times. The Chinese team fouled eight times because of collision, and the opponent fouled five times. The Chinese team committed one technical foul and the opponent fouled three times. The Chinese team fouled two times for unsportsmanlike conduct and the opponent fouled one time. From the foul detection data, it can be seen that after the research experiment of basketball players' foul action characteristics based on machine vision, it is of great significance to promote the development of current foul action monitoring. Some novel aspects of machine vision techniques in extracting features from photos during a basketball game can be listed as:

  • Efficiency When compared to manual human analysis, machine vision algorithms are able to process enormous amounts of visual data in real time. This makes it possible for the approach to recognize patterns and characteristics in photos captured throughout the game with great speed and accuracy.

  • Objectivity Techniques for machine vision, which are based on mathematical algorithms rather than human interpretation, are less susceptible to subjective biases. By doing so, mistakes and discrepancies in the classification of fouls may be eliminated.

  • Consistency Techniques based on machine vision can maintain a high degree of accuracy and dependability, since they are not impacted by human performance-impairing elements like weariness or distractions.

  • Scalability To find patterns and trends across a bigger sample size, machine vision algorithms may be readily scaled up to analyze data from several games and different teams.

  • Versatility Machine vision techniques can be used to analyze various aspects of the game, such as player movement, ball trajectory, and team formations, enabling coaches and referees to gain insights into different aspects of the game.

2 Related Works

This paper studies some techniques of foul action extraction, which can be fully applied to the research in this field. Hong et al. [1] proposed a learnable temporal attention mechanism for automatic selection of important time points from action sequences. Seol et al. [2] proposed a clinical question based on clinical semantic units and event causality patterns: an action relation extraction method. Hashimoto et al. [3] introduced a human behavior modeling method that designs human behavior models based on stored data obtained during long-term monitoring of people. Kartmann et al. [4] proposed a method for extracting physically plausible support relationships between objects based on visual information. Zhang and Zhang [5] proposed a two-channel model to decouple spatial and temporal feature extraction. Maity et al. [6] proposed an efficient method for human action recognition from sequences of contour images in videos. The methods presented offer valuable insights for research; however, due to the limited time and small sample size in the relevant studies, they have yet to gain widespread recognition by the public. In the pursuit of optimizing foul action extraction research, we conducted a review of the following related materials, leveraging machine vision. Dawood et al. [7] aimed to develop an ensemble model based on image-processing techniques and machine learning. Wang et al. [8] developed a dynamic selection machine based on image-processing technology. Tsai and Hsieh [9] proposed a fast image alignment method using expectation maximization technique. The research of Nouri-Ahmadabadi et al. [10] developed an intelligent system based on machine vision and support vector machines. Zhang et al. [11] proposed a hybrid convolutional neural network approach for fusion process monitoring. These methods provide sufficient literature basis for studying the feature extraction of basketball players' foul action by machine vision. Du focuses on developing a model for recognizing athletes' incorrect actions using artificial intelligence algorithms and computer vision. The approach involves constructing a dual-channel 3D convolutional neural network (CNN) for accurate recognition. The project incorporates a spatial attention mechanism (SA) into the 3D CNN, introducing inter-frame difference information to capture substantial changes in athletes' motion status. This combined with grayscale video data enhances the precision of identifying athletes' incorrect actions [12]. Zhang introduces a novel approach that integrates optical image enhancement with virtual reality to autonomously generate simulated basketball game scenes and training data. Initially, the method enhances the brightness, contrast, and color saturation of the light image, enhancing the visibility and realism of the basketball game setting. Subsequently, leveraging virtual reality technology, a comprehensive virtual basketball game scene is created, encompassing elements such as the stadium, players, and spectators. This innovative method combines image enhancement and virtual reality to provide a more immersive and realistic basketball simulation experience [13] and Yibing primarily relies on time series statistics related to basketball techniques, employing a three-layer feedforward back-propagation neural network. The approach incorporates a rotation prediction method to forecast crucial technical and statistical indicators for the team. Based on the forecasted data, the average field goal percentage is determined to be 46.03% and the three-point field goal percentage is 37.48%, with an average of 12.95 assists and 25.4 backcourt rebounds. This methodology showcases the application of neural networks in predicting key performance metrics for basketball teams [14].

3 Overview of Machine Vision and Motion Feature Extraction

The target is mainly converted into image signal by machine vision and transmitted to a special image-processing system. The color, brightness, and distribution of the image pixels are converted into digital signals, and the target feature information is extracted by operation. The simulation results show that the proposed method can effectively improve the accuracy of the feature extraction of the operation and mobilization foul action, and it takes less time. In a fierce basketball game, various reasonable and unreasonable technical movements of the players appear alternately, and the referee must make an accurate and fair ruling quickly, which requires a high degree of concentration, but there are always distractions. Therefore, this paper makes the detection of basketball fouls more accurate by studying machine vision.

3.1 Overview of Machine Vision

In the process of motion features, machine vision is applied more and more, and feature extraction is an important part of machine vision. Computer vision, also known as machine vision, uses computers to simulate the visual function of the human eyes to extract information from images or image sequences for shape and motion recognition to recognize three-dimensional scenes and objects in the objective world. In a computer vision system, the input data is a grayscale matrix that represents a projection of a 3D scene. There can be multiple input fields, providing information from different directions, different viewpoints, and different points in time. The desired result is a symbolic description of the scene represented by the image. These are usually descriptions of object classes and relationships between objects, which may also include information such as spatial surface structure, shape, physical surface properties, colors, materials, textures, shadows, and light source locations [15]. A general overview and steps of the machine vision technique has been shown through Fig. 1.

Fig. 1
figure 1

The working steps of machine vision techniques for feature extraction in a basketball match

With the development of computer technology, network technology and image-processing technology, machine vision has become an integral part of modern industrial production. This includes expertise in pattern recognition, computer vision, digital image processing, machine learning and artificial intelligence. Machine vision systems can quickly capture large amounts of information and process them automatically with ease, facilitating integration with design and process control information, and enabling real-time identification and analysis. Now, machine vision technology has moved from laboratory to reality, which has been widely used in many fields, bringing huge economic and social benefits.

It is not too early for machine vision to be proposed as an independent discipline, but it is by no means an independent research topic. Studying and mastering this subject require a comprehensive application of knowledge from other disciplines. In the field of machine vision, there are many interesting research directions that have long attracted many computer vision experts, who even dedicated their lives to some research tasks. Some of the important research areas are object tracking, object recognition, image processing, image segmentation, image classification, stereo vision, etc. The object recognition process is shown in Fig. 2.

Fig. 2
figure 2

Object recognition process

The problem of object recognition can be viewed as a process of debate. The image is the source of the evidence, and the conclusion is the determination of the existence of the object and its location. The identification method is the bridge from the evidence to the conclusion. This "evidence" is often indeterminate because the characteristics of the object itself are not well defined in the image, and errors may occur in extracting the image's features. Therefore, in practice, the reliability of each "evidence" is determined by assigning a confidence level [16].

Machine vision is a comprehensive technology, including image processing, mechanical engineering technology, control, electric lighting, optical imaging, sensor, analog and digital video technology, computer software and hardware technology (image enhancement and analysis algorithm, image card, etc.). A typical machine vision application system includes image capture, light source system, and image digitization module, digital image-processing module, intelligent judgment and decision module, and mechanical control execution module. The research of machine vision mainly includes input devices, low-level vision, middle-level vision, high-level vision, and system structure. Input devices include image-processing hardware and digitizing hardware. Low-level vision mainly uses various image-processing techniques and algorithms to process the original input image, such as image enhancement, image filtering, edge detection, etc. The main task of intermediate vision is to recover depth, surface normality, contour, and other 2.5-dimensional scene information through stereo vision. The task of high-level vision is based on the original input image, the image baseline, and the 2.5-dimensional map. It needs to complete the three-dimensional image reconstruction of the object in the central coordinate system and create the three-dimensional description of the object to identify the three-dimensional object and determine the position and orientation of the object. The system structure being discussed is not a specific, detailed, or practical design example that you can directly implement. Instead, it represents a high-level abstract model or framework of how a system might be organized or structured. The original image obtained by the imaging system cannot be directly used by the vision system because the vision system is interfered and restricted by various parties, and its image must be preprocessed by noise filtering or grayscale correction. Basic structure of sports basketball analysis based on machine vision is shown in Fig. 3.

Fig. 3
figure 3

Basic structure of sports basketball analysis using machine vision

In the context of machine vision, the term 'machine' encompasses the equipment and mechanisms involved in basketball movement and control. 'Vision' encompasses both the hardware and software components of the system. The hardware aspect includes devices like cameras and image acquisition tools. Image acquisition is the process of converting visual information and intrinsic properties of the measured object into computer-readable data. This step significantly impacts the stability and reliability of the system. Typically, optical systems, light sources, cameras, and image-processing devices (including image memory cards) are employed to capture images of the object under examination [17].

At its core, machine learning is grounded in the study of statistical theory and data analysis, forming a fundamental part of artificial intelligence. It seeks to mimic human learning processes by deriving rules and patterns from vast datasets, allowing computers to make intelligent decisions based on these learned rules. However, one notable limitation lies in artificial neural networks, which can easily suffer from overfitting issues during the learning process, leading to poor generalization capabilities. In traditional pattern recognition techniques, the primary approach for improving recognition accuracy typically involves increasing the quantity of training samples. This approach primarily focuses on aligning the classifier with the training data but often falls short in accurately classifying new, unseen test samples in real-world scenarios.

Support vector machines combine multiple binary classification problems, resulting in the final extension of multi-classification. Determining whether a sample is linearly separable is based on whether the sample can be divided by a linear function. The sample is said to be linearly separable if it can be split, and nonlinearly separable if it cannot. Then the general form of a linear discriminant function in D-dimensional space is shown as in Formula (1):

$$g\left(m\right)=\omega m+b.$$
(1)

The decision function is as in Formula (2):

$$f\left(m\right)={\text{sgn}}\left(\omega \cdot m+b\right).$$
(2)

Normalizing the discriminant function, the constraint is as in Formula (3):

$${n}_{i}\left(\omega {m}_{i}+b\right)\ge 1, \quad i=1,2,...,y.$$
(3)

The Lagrange function is defined as in Formula (4):

$$L\left(\omega ,b,a\right)=\frac{1}{2}\Vert \omega \Vert -\sum_{i=1}^{y}{\alpha }_{i}\left[{n}_{i}\left(\omega \cdot {m}_{i}+b\right)-1\right].$$
(4)

According to Formula (4), the following constraint as in Formula (5) can be obtained:

$$\sum\limits_{{i = 1}}^{y} {\mkern 1mu} {\mkern 1mu} \alpha _{i} \;n_{i} = 0\;\;\alpha _{i} \ge 0.$$
(5)

If α* is the optimal solution, then Formulas (6) and (7) can be obtained:

$${\omega }^{*}=\sum_{i=1}^{y}{\alpha }^{*}{n}_{i}{m}_{i,}$$
(6)
$${b}^{*}={n}_{i}-{\omega }^{*}\cdot {m}_{i}.$$
(7)

Among them, m is the test sample category determined by the optimal classification function.

3.2 Overview of Motion Feature Extraction

Motion recognition is a broad field of research that encompasses multi-sensor technology, image processing, computer-aided design, pattern recognition, virtual reality, computer vision and graphics, visualization techniques, and intelligent robotic systems. Methods for analyzing and processing human motion perception usually include motion pattern recognition, motion pattern feature extraction, and motion pattern recognition in complex backgrounds. As part of high-level processing, behavior understanding and identity recognition are hot research areas that have received extensive attention in recent years, especially for human motion analysis and description, and thus to identity recognition. The proliferation of various motion detection algorithms has made human feature extraction and identification an important problem for today's systems [18].

A standard motion feature recognition system consists of three main parts, image preprocessing and motion feature detection, motion feature extraction and classification. The three components include a camera, a host system, and a software package to process and identify people in moving video footage. A typical motion feature recognition system is shown in Fig. 4.

Fig. 4
figure 4

Typical motion feature recognition system

First, a video surveillance system captures a sequence of video images of human movement. Second, image preprocessing and binarization are performed on the detected video sequences of human moving objects. After these operations, the moving body image can be clearer and the background is single. Third, the typical features of human motion are extracted according to various rules and processed appropriately to ensure the consistency of these features with human motion patterns. Finally, the motion features to be identified are compared with feature templates in the feature database to manipulate and classify them [19].

Traditional feature recognition methods mainly use pattern matching, which is fast but not reliable, with a low recognition rate in complex environments. In recent years, feature recognition has mainly relied on statistical learning methods, that is, first building a set of rules based on knowledge, or using statistical learning methods to automatically learn corresponding rules and knowledge from patterns, and using the built or learned rules to classify patterns. Statistical learning methods overcome the shortcomings of traditional recognition methods and have good recognition performance in complex backgrounds and noisy situations, which is the future development direction of recognition technology.

The representation methods of target features are divided into static and dynamic features. The static representation of target features belongs to the perceptual layer, which mainly includes the shape, outer contour, color information, and texture information of the target image. They can be divided into methods that represent features in terms of boundaries, regions, and changes [20].

Boundary-based feature representations generally fall into three categories. The first category uses boundary points to represent contours, usually using the marker point method. The second category uses typical boundary parameters, such as chain codes, boundary segments, etc. The third category uses curve approximation methods, of which the polygonal method is the most commonly used. The technology classification based on boundary representation is shown in Fig. 5.

Fig. 5
figure 5

Technology classification based on boundary representation

Utilizing barcode features to represent boundaries involves a method in which an initial point is defined by its coordinates, and the subsequent points are characterized by calculated lines. This approach is commonly employed to depict the outlines of curves or regions. To convey more precise information, the method leverages connections to the eight adjacent dots, a frequently utilized technique. Additionally, boundary features serve to represent curves by approximating them as polygons. For a closed curve, this representation is considered valid when the number of points on the curve's edge matches or closely approximates the number of edges in the corresponding polygon [21].

Feature recognition typically relies on differences between features of different objects and similarities within the same object. Region-based algorithms are the most common parallel methods for direct region detection. The concept of extracting structural features from moving objects is to first obtain meaningful sub-regions in the distribution structure of objects, where meaningful means that moving objects are divided into several sub-regions with topological or physical significance. Topological meaning refers to the spatial topological relationship between sub-regions, and the physical meaning refers to the independent motion characteristics of each sub-region. The technical classification based on region representation is shown in Fig. 6.

Fig. 6
figure 6

Technology classification based on region representation

There are three prominent region-based approaches used for approximating object characteristics and interior features in the field of image and object representation. The first method, known as region decomposition, involves the use of a spanning tree to break down a target object into simpler units. In this approach, the enclosing region is defined by geometric primitives that fill the space, and the content feature comprises the pixels residing within this region. The second technique, region enclosing, employs methods such as creating an outer bounding box, a minimum bounding rectangle, or a convex graph around the object to represent its outer shape effectively. Finally, the third approach involves harnessing internal region features, with skeleton features being a notable example. These region-based methods offer diverse strategies for capturing both the overarching object characteristics and the finer interior details, rendering them valuable tools in various image and object recognition applications.

Feature extraction algorithms are unreliable in real time due to background noise, shadows caused by lighting changes, errors caused by camera shake, and mutual and self-occlusion caused by moving objects. Frame difference method, optical flow method, and background stretching method are the most commonly used feature extraction methods [22].

Image disparity is an algorithm that uses the absolute value of the difference in brightness between the previous image and the next image in a sequence of video images. It is great for highlighting changes between two images in an image sequence. Not only does it work with multiple moving targets, but it also works with moving cameras. The mathematical formula of the frame difference method is as in Formula (8):

$$G\left(m,n\right)=\left[{f}_{k+1}\left(m,n\right)-{f}_{k}\left(m,n\right)\right].$$
(8)

If the absolute value of the brightness difference obtained after discrimination is greater than the predetermined threshold, it is considered as the target area of human motion. Otherwise, it is considered as the background area.

Moving objects on the retina of the human eye form a continuous sequence of images over time, giving the impression that the image "flows" across the retina, and hence the name "optical flow". At each time point, there are corresponding points in the image, which are connected by Formula (9).

$$G\left(m,n,t\right)=G\left(m+\Delta m,n+\Delta n,t+\Delta t\right).$$
(9)

The right-hand side of Formula (9) is expanded with Taylor formula, and the formula for constraining optical flow can be obtained as Formula (10):

$$\frac{\partial {G}_{t}}{\partial m}\frac{{\text{d}}m}{{\text{d}}t}+\frac{{\partial G}_{t}}{\partial n}\frac{{\text{d}}n}{{\text{d}}t}+\frac{{{\text{d}}G}_{t}}{{\text{d}}t}=0.$$
(10)

Formula (10) gives the spatio-temporal difference correspondence of the gradient of the moving target in time and space. The formula to minimize the function is defined as in Formula (11):

$${\varepsilon }_{flow}=\sum_{m}{\left({u}_{x}{e}_{m}+{v}_{x}{e}_{n}+{e}_{t}\right)}^{2}.$$
(11)

Since the optical flow method is based on the Taylor formula, the method is very sensitive to light and noise, and it is computationally expensive to separate some objects [23].

For scenes where the camera position is fixed, the most commonly used object detection method is the background subtraction method. The basic principle of the background subtraction method is to construct a model about the background by analyzing the video sequence and to compare the current frame image with the background model, detecting the moving target by comparing the results. The formula for background subtraction is as in Formula (12):

$${R}_{t}\left(m,n\right)=\left\{\begin{array}{ll}0 & \left|{f}_{t}\left(m,n\right)-{b}_{t}\left(m,n\right)\right|>T\\ 1 & \left|{f}_{t}\left(m,n\right)-{b}_{t}\left(m,n\right)\right|\le T\end{array}\right..$$
(12)

The background difference method is a simple real-time algorithm that can capture a more complete image of a moving target than the image difference method. However, in practice, lighting and environmental conditions are constantly changing over time and are susceptible to interference between the current image and the background model. To overcome the interference caused by changes in the external environment, the background model needs to be continuously updated in practical applications.

In the context of a Gaussian mixture model (GMM), the computation of a data point's probability (responsibility) associated with a specific Gaussian component entails evaluating the probability density function (PDF) for that component based on its mean and standard deviation. The resulting PDF is then multiplied by the weight of the component to obtain the responsibility. Normalizing these responsibilities is crucial to ensure they collectively sum to 1, thereby representing probabilities. Typically, the data point is assigned to the component with the highest normalized responsibility. Throughout the training process for clustering and density estimation in intricate datasets, the expectation–maximization (EM) algorithm refines these probabilities and the parameters of the components iteratively.

The Gaussian mixture model serves as an extension of the Gaussian model and is particularly applicable to unimodal scenes, where changes in the background can be disregarded. The Gaussian distribution function is mathematically expressed as in Formula (13).

$$P\Bigg({M}_{t}, {\mu }_{t},\sum\limits_{t}\Bigg)=\frac{1}{{\left(2\pi \right)}^\frac{y}{2}}{e}^{-\frac{1}{2}{\left({m}_{t}-\mu \right)}^{T}\left({m}_{t}-\mu \right)}.$$
(13)

In Formula (13), Y represents the pixel color dimension, and Y is 1 for grayscale images. For a single Gaussian model, the background model is usually updated using Formula (14) and Formula (15).

$${\mu }_{t+1}=\left(1-\alpha \right){\mu }_{t}+\alpha {M}_{t},$$
(14)
$$\sum _{{t + 1}} = \left( {1 - \alpha } \right)\sum\limits_{t} { + \alpha (M_{t} - \sum\limits_{t} ) (M - \sum\limits_{t} ) ^{T} } .$$
(15)

A normal distribution is a statistical concept representing a bell-shaped curve. It is defined by two key parameters that is the mean (μ), which is the center of the curve, and the standard deviation (α), which measures the spread of data points around the mean. The normal distribution probability density function describes how data are distributed within this curve. It is a fundamental tool in statistics, used to model a wide range of real-world phenomena, and follows the empirical rule, which shows that most data fall within one, two, or three standard deviations from the mean.

The single Gaussian model is suitable for situations where the background scene changes slowly and the scene is simple. If the background is complex or changes rapidly, the background pixel distribution shows a multimodal distribution, and the single Gaussian model cannot accurately describe the background. To effectively describe the background, a Gaussian mixture model can be built for these pixels. The mixed Gaussian model uses multiple weights to represent complex and changing scenes. Compared with the optical flow method, the mixed Gaussian model method has lower computational complexity, showing better results for outdoor target detection. The probability density function of the Gaussian mixture model is set as in Formula (16):

$$p\left(m\right)=\sum_{i=1}^{X}{\omega }_{i}{g}_{i}\left(m\right).$$
(16)

In Formula (16), m represents a point in the Y-dimensional space, and X represents the number of mixtures of the mixture Gaussian model. The larger the X value, the stronger is the ability to deal with fluctuations. The larger the weight of each Gaussian function, the more dominant the corresponding Gaussian function is in the mixture model, and the sum of each weight is 1, as shown in Formula (17):

$$\sum_{i=1}^{X}{\omega }_{i}=1.$$
(17)

The utilization of the Gaussian mixture model involves three distinct stages: defining the model, updating the model, and assessing the background. EM, a widely used statistical algorithm for parameter estimation in models with hidden variables, undergoes an iterative optimization process that incorporates both forward and backward probabilities. Unlike numerical optimization methods that typically focus on maximizing likelihood or minimizing loss functions based solely on observed data (forward probabilities), EM takes into account not only the likelihood of observed data, but also the probabilities associated with hidden or unobservable variables (backward probabilities). This simultaneous consideration of forward and backward probabilities renders EM particularly well suited for problems requiring the incorporation of incomplete or missing data in parameter estimation.

4 Extraction and Discussion on Foul Features in Basketball Games

4.1 Comparison of the Number of Fouls, Time, and Areas in Basketball Games

In this section, the video of the basketball match between China and the opponents in the Olympic Games was analyzed. The video was strictly monitored using the feature extraction technology of machine vision. Competitive sports competitions were extremely intense. The form of the game was changing rapidly, and only with the help of video could the game situation be analyzed more accurately. The statistics of the number of fouls in the Olympic men basketball team are shown in Table 1.

Table 1 Statistics on the number of fouls in the Olympic men's basketball team

Table 1 illustrates that the Chinese team accrued a total of 90 fouls during the Olympic Games, averaging 18 fouls per game. The primary players were responsible for 45 fouls in total, averaging 9 fouls per game. On the receiving end, the main force experienced 75 fouls, averaging 15 fouls per game. Substitutes contributed 46 fouls in total, averaging 10 fouls. In terms of opposition, there were 106 fouls committed against the Chinese team, averaging 22 fouls per game. Opponents' main fouls amounted to 56, averaging 12 per game. The main force of the opponents endured 61 fouls, averaging 13 fouls per game. Opponent’s substitutes were responsible for 51 fouls in total, averaging 11 fouls per game

Akaike information criterion (AIC) and Bayesian information criterion (BIC) are indeed introduced and widely used in Gaussian mixture models (GMMs). These metrics help assess model fit while penalizing for model complexity, aiding in the selection of the optimal number of clusters in GMMs. Lower AIC and BIC values indicate better model fits, with BIC imposing a stronger complexity penalty compared to AIC. Researchers and practitioners frequently rely on these criteria for effective model selection in GMMs.

The statistics on the number of fouls by players in different positions in the Olympic men's basketball team are shown in Fig. 7:

Fig. 7
figure 7

Statistics on the a total number of fouls and b average number of fouls by players in different positions in the Olympic men's basketball team

Figure 7a displays the total number of fouls committed by players in different positions, and Fig. 7b depicts the average number of fouls per game for players in various positions. Upon analysis, it is evident that in the forward position of the Chinese team, there were a total of 45 fouls, resulting in an average of 9 fouls per game. Guards committed 25 fouls in total, averaging 5 per game, while centers were responsible for 22 fouls, averaging 4 per game. In contrast, for the opposing team, forwards were charged with 46 fouls, averaging 10 per game, and guards committed 36 fouls, averaging 8 per game. Centers from the opponent's side fouled 26 times, with an average of 6 per game. The data underscores a relatively small disparity in the number of fouls between forwards and centers on both teams, with the most significant difference observed in the guard position. Notably, regardless of team affiliation, forward players consistently exhibited a higher tendency to commit fouls compared to players in other positions. The statistics of the Olympic men's basketball foul time period are shown in Fig. 8.

Fig. 8
figure 8

Statistics of Olympic men's basketball foul time period for a Chinese team and b opponent team

Figure 8a shows the foul time period of the Chinese team, and Fig. 8b shows the foul time period of the opponent team. As can be seen from Fig. 8, the Chinese team averaged 5.4 fouls per game in the first quarter and 6.8 per game in the second quarter, averaged 4.6 in the third quarter and 5 in the fourth. Opponents averaged 5 in the first quarter, 6.2 in the second, 6.4 in the third, and 7.4 in the fourth. It can be seen from the above data that the number of fouls by the Chinese team in the second quarter is significantly higher than that in other time periods, and the fourth quarter is the peak of the opponent's foul. The statistics of the foul area of the Olympic men's basketball team are shown in Fig. 9.

Fig. 9
figure 9

Statistics of Olympic men's basketball foul areas of a Chinese team and b opponent team

Figure 9a represents the foul area of the Chinese team, and Fig. 9b represents the foul area of the opponent team. As can be seen from Fig. 9, the Chinese team averaged 12.8 fouls in the first area, 3.8 fouls in the second area, and 4.8 fouls in the third area. The opponent team committed nine fouls in the first and second districts, and six fouls in the third district. From the above data, the foul rate of the Chinese team in the first area was the highest among the three zones. In the number of fouls in the second area, the Chinese team was significantly less than the opponent team. In addition to the higher foul rate in the first area, the Chinese team had fewer fouls in the other two areas, and the opponent's fouls in the second and third areas were significantly higher than the Chinese team.

4.2 Comparison of Foul Types in Basketball Games

In this section, a more in-depth research and analysis of basketball game fouls based on the research in the previous section were conducted. As in the previous part, the video monitored by the feature extraction technology under machine vision was used to analyze the types of fouls in the game in detail. These analyses were a conduit for players to become aware of their mistakes and tactical problems. The statistics of the nature of fouls in the Olympic men's basketball team are shown in Table 2.

Table 2 Statistics on the nature of fouls in the Olympic men's basketball team

As can be seen from Table 2, in the Chinese team, the fouls of blocking totaled 19 times, and the number of illegal use of hands was 64 times. There were eight fouls of bumping into opponents and one foul of playing technique, with two unsportsmanlike fouls, in the team of China. The opponent team committed 19 blocking fouls, 82 fouls of illegal use of hands, 5 fouls of bumping into opponents, 3 fouls of playing technique, and 1 unsportsmanlike foul. It can be seen from the above data that the most fouls on both sides were illegal use of hands, with the Chinese team accounting for 71.8% and the opponent team accounting for 78.1%. The statistics of the types of defensive fouls of the Olympic men's basketball team are shown in Table 3.

Table 3 Statistics of Olympic men's basketball defensive foul types

Table 3 outlines the fouls committed by the Chinese team and the opponent team in various defensive situations. The Chinese team fouled 16 times during dribbling, 41 times during shooting, 5 times while holding the ball, 16 times against no-ball situations, 5 times during catching, and 4 times during passing. On the other hand, the opponent team fouled 33 times during dribbling, 32 times during shooting, 7 times while holding the ball, 12 times against no-ball situations, 14 times during catching, and 5 times during passing. An analysis of the data reveals that both teams committed more fouls during defense against shooting, with the Chinese team accounting for 50.4%, and the opponent team accounting for 33% of their total fouls in this defensive scenario

5 Conclusions

With the continuous advancement of basketball techniques and tactics, the game has become more intense, leading to an increase in turnovers. Player errors are evolving in a more nuanced and professional direction, emerging as a crucial factor in the game. Referees play an indispensable role in basketball games, impacting not only the flow of the game, but also influencing players' technical and tactical performances. To enhance the sport, it is imperative not only to elevate the technical proficiency of players and coaches, but also to refine the application of referee rules, a critical aspect in propelling basketball development. Hence, whether referees comprehend the significance of rules and accurately adjudicate fouls and violations to ensure fair and smooth play is vital for the normal course of a game and the seamless operation of both teams. Consequently, the research presented in this paper, focusing on machine vision and feature extraction technology, can aid referees in monitoring basketball players' foul actions, holding significant theoretical and practical significance in the realm of sports officiation.