Optimal Extraction Method of Feature Points in Key Frame Image of Mobile Network Animation

In order to effectively extract the feature points of mobile network animation images and accurately reflect the main content of the video, an optimization method to extract the feature points of key frame images of mobile network animation is proposed. Firstly, the key frames are selected according to the content change degree of the animation video. The scale invariant feature transformation algorithm is used to describe the feature points of the key frame image of the animation video. The local feature points of the image are estimated by the constraint optimization method to realize the optimization extraction of the feature points of the key frame image of the mobile network animation. The efficiency of feature points extraction is analyzed from the number and effectiveness of feature points extraction, time-consuming and similarity invariance of feature points. The experimental results show that the proposed method has excellent adaptability, and can effectively extract feature points of mobile network animation image.


Introduction
Mobile network animation images are prone to flicker, shadow and other phenomena [1]. In order to overcome these phenomena, reflect the main content of video and enhance the continuous expression of animation, it is necessary to extract stable and reliable features in the images. Because of the complex conditions of imaging scenes and the different characteristics of simultaneous interpreting of different sensors, it is very difficult to extract feature points that are stable in images [2][3][4]. In view of the fact that the number of image feature points is much less than the total number of image pixels, the image matching algorithm based on feature points reduces the calculation amount to a certain extent; at the same time, the feature points are not very sensitive to the influence of noise interference, distortion change and occlusion, so the method of feature points extraction can improve the matching accuracy of the algorithm. Reference [5] proposed a spike based minimum error entropy framework, which uses entropy theory to establish an online meta learning scheme based on gradient in recursive SNN architecture. The model can effectively improve the accuracy and robustness of peak based meta learning performance. Reference [6] proposed a new framework based on entropy theory, that is, a spike driven less shot online learning model based on heterogeneous integration. Using entropy theory, a gradient based less shot learning model is established in the recursive SNN structure. The model can effectively improve the accuracy and robustness of spike driven few shot learning.
Considering that the scale invariant feature transform (SIFT) algorithm has good invariance and stability to scale, view angle, rotation and other image transformations, sift has good invariance and stability The algorithm has certain advantages in image feature extraction. However, due to the effect of feature extraction of the image dominated by non smooth curve, it has been improved and optimized.
Firstly, this paper selects the key frame to narrow the overall range of image feature extraction in mobile network. The innovation is to use SIFT algorithm to preliminarily extract the key feature points of the image, build a nonlinear model for locally optimization. The parameters of the image are optimally estimated to maximize the gray correlation between the feature points of the anterior and posterior frames. The influence of marginal feature points is avoided and the representation of feature points is improved. It has obvious adaptability advantages. No matter the number, speed or the effectiveness of feature points, it is better than a single SIFT algorithm.

Feature point extraction based on improved scale invariant feature transform algorithm
Scale invariant feature transform algorithm (SIFT) is used. Selecting feature points of animation video image, because the algorithm will produce more edge feature points in the process of selecting feature points, unstable edge feature points must be eliminated. Therefore, before using this algorithm, the key frame of animation video is extracted first, and the feature points are selected within the key frame, so as to avoid the influence of some edge feature points in non key frame on the animation video image The influence of the accuracy of feature point selection results makes the final feature points more representative.

Key frame selection
According to the change degree of the video content, the algorithm selects 1 to 3 frames as the key frames of the video. The specific number of keyframes is automatically determined by the content change degree of the internal image of the video. Firstly, the first frame, the middle frame and the last frame are selected as candidate keyframes. Let f represent the statistical mean of the difference between all adjacent frames in a video. F1, F2 and F3 are used to represent the inter frame difference between the three candidate keyframes. The three values are calculated and compared with F respectively. Based on the calculated values, the final keyframes are determined according to the following rules: If F1, F2 and F3 are all less than f, it means that the contents of the three candidate keyframes are close and can represent each other. In order to unify, this paper determines the middle frame as the key frame representing the main content of the video. If F1, F2 and F3 are all greater than f, it means that the contents of the three candidate keyframes vary greatly from each other and each frame contains content not available by the other frames, indicating that the video content changes. According to the principle of selecting key frames, all three frames are determined as key frames.
In other cases, we need to compare the sizes of F1, F2 and F3, and select the larger one, then the corresponding candidate key frame will be determined as the last key frame.
One of the advantages of this algorithm is that the number of key frames can be determined according to the changes of animation video content. Compared with the traditional key frame extraction method with fixed number of frames, this algorithm greatly reduces the redundancy of the information represented by the key frames in a video, and makes the extracted key frames more representative. Then any frame can be the only key frame of the video.

Scale invariant feature points in scale space
After determining the key frames of the animation video image, the Gaussian scale space is constructed and the scale invariant feature points are found. If the same feature points can be detected in an image at different scales, then the feature points have certain scale invariance. In order to find the points with scale invariance, it is the key to construct the scale space first [7,8], considering the construction of a good kernel function in the scale space, namely Gaussian kernel function. So we can use the two-dimensional Gaussian kernel function to convolute the original image to construct the scale space The function of the original image is denoted by P(a, b) , G(a, b, ) is the Gaussian kernel function, and L(a, b, ) is the obtained scale space. The specific Gaussian kernel function is shown in formula (2): In order to get the feature points on a certain scale, we need the Gaussian difference scale space, which is obtained by the difference of the adjacent scale space, that is, D(a, b, ): If the difference value of a pixel is the minimum or maximum of 26 pixels of 18 pixels in the same position of its upper and lower layers and 8 pixels in its neighborhood, then the pixel can be selected as an extreme point in the scale space, that is, a feature point with scale invariance.

Selection of candidate feature points
In the key frame, selecting the appropriate feature points can effectively represent the key features in the animation video, which provides a good guarantee for the continuity of the page turning of the animation video. In the process of selecting feature points, the gray value of extreme point x max can be used to judge whether the feature point is in the place with low contrast, which is easy to be interfered by noise. In this case, the feature point is unstable and should be omitted [9,10]. Gray value of extreme point x max : In this paper, the candidate feature points of |D(x max )| < 0.008 are omitted.
The Gaussian difference function also has a strong response at the edge, so it is necessary to remove the candidate feature points which are easily affected by noise on the edge [11,12]. The main curvature of the edge which is easily affected by noise is larger, but the curvature in the vertical direction is smaller. The principal curvature can be calculated by Hessian matrix The eigenvalue of H is proportional to the principal curvature of D. Let be the larger eigenvalue, be the smaller eigenvalue, and be the ratio of α and β, i.e., α = , the ratio can be expressed as only: When the two eigenvalues are equal, the minimum value is obtained by the above formula. The greater the difference between the two eigenvalues, the greater the value. Therefore, for a given threshold C, it is only necessary to check: Therefore, feature points satisfying the above conditions are selected as candidate feature points.

Assignment of candidate feature directions
The main direction of feature points with scale invariance is given. Generally, there must be a connection between the pixels in the selected neighborhood, so when assigning the main direction to the feature points, it can be determined according to the distribution of the gradient direction of the pixels in the neighborhood [13,14]. The specific method is to take the selected feature points as the center and select a certain neighborhood block [15]. When the gradient direction statistics of all pixels in this block is based on histogram statistics, the gradient direction corresponding to the peak value is selected as the main direction of the characteristic point.
Firstly, the Gaussian convolution image corresponding to the candidate feature points is obtained, which can keep the invariance of scaling and zooming. For a candidate feature point, the gradient amplitude m(x, y) and gradient direction (x, y) can be obtained through its four neighborhood: In order to highlight the role of candidate feature points, Gaussian window function is used to weight the neighborhood. Using the finite difference method, the image gradient amplitude in the neighborhood centered on the feature points is calculated. The gradient amplitude of each sampling point added with the gradient direction histogram shall be weighted. The weighting adopts the circular Gaussian weighting function, which is added according to the Gaussian distribution of 1.5, and the sampling principle of three times the scale. The neighborhood window radius R is 3 × 1.5. Since SIFT algorithm only considers the invariance of scale and rotation and does not consider affine invariance, the gradient amplitude near the feature points has a large weight through Gaussian weighting, which can partially make up for the instability of the feature points caused by the lack of affine invariance, and obtain the corresponding gradient amplitude and gradient direction. The degree range of gradient direction is [0, 360], which is divided into 36 intervals, each of which is 10 to form a histogram. The maximum value of histogram represents the main direction of gradient in the local neighborhood of candidate feature points, and the maximum value represents the magnitude of gradient in this direction.

Generate feature point descriptor
In order to ensure the invariance of rotation, it is necessary to rotate the horizontal axis of the coordinate axis to be consistent with the main direction of the feature points. Then, the neighborhood block is selected based on the feature points, and the block is further divided into equal sub blocks. In each sub block, the gradient directions of pixels in 8 directions are counted respectively, and then the histogram of gradient directions in 8 directions of all blocks are merged to generate feature descriptors, which are scale invariant feature points [16][17][18]. Finally, the normalized method is used to eliminate the influence of illumination change on the obtained feature points. The specific description process of feature points is as follows: Firstly, the gradient amplitude and gradient direction of the surrounding pixels in the animation video image corresponding to the feature points are calculated by Eq. (8).
Similarly, Gaussian window function is used to weight the gradient amplitude of the neighborhood. Here w / 2 is used as the standard deviation of the Gaussian window function, where W is the neighborhood radius of the feature point, which is represented by a circle in Fig. 1. The neighborhood range of feature points is divided into 4 × 4 pixel blocks, as shown on the left side of Fig. 1. The histogram of gradient direction is established in each 4 × 4 pixel block, where it is divided into eight intervals, and the interval of each pixel gradient direction in the 4 × 4 pixel block is counted. The value of this interval is the sum of the gradient amplitude of all pixels whose gradient direction belongs to this interval after Gaussian weighting, and is represented by the arrow length on the right. Therefore, each 4 × 4 pixel block can be represented by an 8-dimensional feature description vector, and each dimension corresponds to an interval of the histogram. In this paper, 16 (4 × 4) histograms are generated by counting 16 × 16 neighborhood pixels around the feature points, and each histogram contains 8 features. Therefore, for each feature point, a total of 4 × 4 × 8 = 128 dimensional feature description vector is used to describe it, as shown on the right side of Fig. 1.
Note the feature description vector as N = [n 1 , n 2 , … , n 128 ] T . In order to make it light invariant, the feature description vector is normalized to unit length. The method is as follows: The local gradient value of the image is calculated based on the neighborhood of each feature point, so that a stable feature point descriptor can be generated. At this time, the obtained feature point has 16 × 8 = 128 dimensions, which has good scalability for scale scaling and rotation transformation.

Establishment of mathematical model of optimization theory
The programming method is one of the common methods to solve the function extremum problem under the constraints of equations and inequalities [19,20]. Many practical problems can be attributed to programming problems. The general form of programming is In other words, a function is minimized (or maximized) under a set of constraints. If there are no constraints in programming, it is called unconstrained programming; otherwise, it is called constrained programming. The vector x = (x 1 , x 2 , ⋯ , x n ) is called the decision vector, and the function f (x) of the decision vector x is called the objective function. The set S= x ∈ R n |g j (x) ⩽ 0;j = 1, 2, ⋯ p is called a feasible set. The solution x satisfying the condition x ∈ S is called feasible solution. The purpose of the programming problem is to find a solution x * (x * ∈ S) such that According to the feature point extraction model obtained in Section 2.5, although different key frame feature point extraction models are given for different regions of the image, their matching function definition and optimization method are the same, which is not lost of generality. A region is randomly selected as the research object, and the theoretical framework of optimization is proposed.
Five points are sampled at equal intervals in this region, and the parameter of characteristic points is x = x 1 , x 2 , ⋯ , x 7 T = a 0 , a 1 , a 2 , a 3 , a 4 , a 5 , c T . Assuming that the original coordinates of the k feature sampling point are (i, j) and the coordinate positions estimated by block matching method are i ′ , j ′ and i ′′ , j ′′ respectively, then the optimization objective function is defined as: Where n) is the gray value of each pixel in its 8 neighborhood centered on point i ′ , j ′ ; I (i �� ,j �� ) k (m, n) is the gray value of each pixel in 8 neighborhood with i ′′ , j ′′ as the center. The purpose of establishing the optimization objective function is to use optimization theory to estimate the parameters of a i , so that there is maximum gray correlation between the feature points of the front and back frames.
In addition, based on the above assumptions, constraints are set for the spatial position, that is, the Euclidean distance Fig. 1 Extraction of feature point description between the coordinate positions i ′ , j ′ and i ′′ , j ′′ satisfies the following conditions. Where X is a constant greater than zero. Thus, the optimal extraction of key frame feature points in mobile network animation images is realized, the influence of edge feature points is avoided, and the representativeness of feature points is improved.

Experiment and analysis
In this paper, different animation videos are selected and simulated in MATLAB. In order to judge the effect of key frame extraction and feature point extraction, two groups of experiments are carried out. One is the key frame extraction of this algorithm. Taking an animation video as an example, the key frame extraction is carried out. The second is to compare with the Improved SIFT algorithm. The specific steps are as follows: select another animation video lasting about seven minutes, which is composed of about 120 shots. Firstly, the animation video is divided into one frame, and then the key frame is extracted. Then, based on the extracted key frame set, the algorithm is used to extract the feature points under the condition of optimization, and the constraint optimization method is used to estimate the (12) d Euclidean ((i � , j � ), (i �� , j �� )) ⩽ X local feature points of the image, so as to realize the optimization extraction of the feature points of the key frame image of mobile network animation. Compared with the Improved SIFT algorithm, the performance of the two methods is tested.
In order to solve the above noise interference problem, data preprocessing is carried out.
(3) According to formula (6) and formula (7), the qualified feature points are selected in the H matrix as candidate feature points to complete data preprocessing.

The key frame extraction effect of this algorithm
As shown in Fig. 2, it is a shot in animation, which has obvious human movement. There are three key frames selected by the algorithm in this paper, as shown in Fig. 3. From these three keyframes, we can see that the first key frame describes one of the characters outside the room, the second key frame describes the character preparing to fly into the room, and the third key frame describes the character flying into the room. There are obvious differences between the three keyframes. Through these three keyframes, the shot content of the character flying into the room is accurately expressed.

Comparison results of number of feature points extracted
In several groups of animation video images, the smooth curve dominated animation video image is selected as the research object, and the improved algorithm is compared with SIFT algorithm. The experimental results are shown in Fig. 4. As can be seen from Fig. 4(b), the original SIFT algorithm can extract a small number of feature points of animation video image, but some key feature points are still not accurately extracted, which leads to the loss of detail and the continuity of animation content in the mobile network of animation video. Compared with Fig. 4(b) and (c), it can be found that the algorithm designed in this paper can extract more detailed image key frame feature points and fully retain the original image features. This is mainly because in the process of optimization, block matching method is used to estimate the coordinates of image feature points. This estimation method can deeply mine more features of the image So as to extract more perfect image features and mark the position of image feature points accurately.

Effectiveness comparison results of feature point extraction
The effectiveness of feature points is expressed by the ratio of matching logarithm of feature points to the sum of feature points extracted from two images to be matched, as shown in Eq. (13). Under the condition of the same experimental object: the larger the ratio, the more effective the feature points detected by a certain algorithm; on the contrary, the feature points detected by the algorithm are inefficient. At the same time, this criterion is also very important for the subsequent image matching.
Where E 1 and E 2 represent the number of feature points extracted from two images to be matched, E p represents the matching logarithm of feature points of two images to be matched, and P represents the effective ratio of feature points.
In this section, the original image (i.e. the image without rotation and scaling) is taken as the reference image, and the image obtained after a series of rotation and scale transformation is taken as the image to be registered. The experimental results are shown in Tables 1, 2, 3 and 4, where E 1 represents the number of feature points extracted from the original image, E 2 represents the number of feature points  Tables 1, 2, 3 and 4 are selected data from the experimental results. From the above table, we can find that: for the selected image (the image dominated by facial line features), the feature points extracted by this algorithm have better adaptability to image rotation and scale transformation. Under the same rotation angle and proportion, the rotation validity of the algorithm is higher than 41.2%, and the scale scaling effectiveness is higher than 38.8%. It shows that this algorithm is more effective than SIFT algorithm in image feature extraction. This is because this paper uses the optimization theory to estimate the image parameters and optimize the SIFT algorithm to maximize the gray correlation between the feature points of the front and back frames. It avoids the influence of edge feature points, improves the representativeness of feature points, and realizes the optimal extraction of feature points of key frame image of mobile network animation.

Time consuming comparison results of feature point extraction
The time consuming of calculation is mainly to use the time taken to extract feature points to judge which algorithm has higher efficiency. In the specific experiment, we have calculated the time-consuming values of the two algorithms respectively for the rotation and scale scaling transformation of the image corresponding to each rotation angle and scale scale scale. As for the rotation angle of the image, the rotation angle of the image is increased from 0 to 90 ° from 10 ° interval, and the zoom multiple of the image is from 0.6 to 1.5 from 0.1. For the statistics of time value, five sets of time values of the two algorithms are recorded respectively at the corresponding angle and zoom multiple, and the average value is obtained. In order to make the data more reasonable and to get the average value of the time spent under the corresponding angle and multiple, the calculation efficiency of the two algorithms is evaluated by the standard deviation and the mean. The specific contents of the experiment are as follows. Figure 5 is the graph of the relationship between the rotation angle of the image and the time spent; Fig. 6 is the graph of the relationship between the zoom ratio of the image and the time spent; the data in Table 5 are calculated from the relevant data in Figs. 5 and 6. From Figs. 5 and 6, we can clearly observe that when the image rotation angle changes from 0 to 90 ° and the zoom factor changes from   0.6 to 1.5, the time value of SIFT algorithm corresponding to each angle value and scaling factor is higher than that of the algorithm in this paper. The time-consuming extraction of the feature points of the algorithm is always less than 1.8s. In addition, the time value of this algorithm is approximately stable between 1.2 and 1.6, while the time value of SIFT algorithm fluctuates greatly. This is because the SIFT algorithm is used to extract the key feature points, establish nonlinear model to make local optimization, improve the representativeness of feature points, improve feature extraction efficiency; because the optimization design improves the gray correlation between the feature points of the front and rear frames, so it has adaptive advantage and low time volatility. In Table 5, t represents the mean value of time spent and represents the standard deviation of time spent. From Table 5, we can know that the standard deviation of the time value of the algorithm in this paper is 0.0196, which is lower than that of the SIFT algorithm, which undoubtedly indicates that the time value of the algorithm in this paper tends to be more stable, while the time value of the SIFT algorithm fluctuates relatively large. Furthermore, according to the mean value, it can be found that the operation time of this algorithm is 1.481s, which is about 30% higher than that of SIFT algorithm. This shows that the time efficiency of this algorithm is slightly higher than that of SIFT algorithm. The cause explanation is the same as the results explained in Figs. 5 and 6.

Comparison results of similarity invariance of feature points
The similarity invariance of feature points mainly refers to whether the algorithm can still detect stable feature points under the condition of image rotation and scaling transformation. Ē 2 refers to the mean value of the number of feature points E 2 (from Tables 1, 2, 3 and 4), while E 2 refers to the standard deviation of E 2 of feature points. The specific experimental data are shown in Table 6.
From Table 6, we can find that under the condition of image rotation and scaling transformation, corresponding to the same rotation angle and scaling scale, the mean value of the number of feature points of the algorithm is 163.4, the standard deviation of feature points is 19.58., so the feature points extracted by this algorithm tend to be more stable compared with SIFT algorithm, which further illustrates the feature points of this algorithm The similarity invariance is higher than SIFT algorithm, which is about 40% higher. This is because this paper optimizes SIFT algorithm by using a constrained optimization method, reduces the influence of edge feature points and improves the representativeness of feature points, and thus outperforms the SIFT algorithm regardless of the number, speed or effectiveness.

Conclusion
This paper studies the feature point extraction method of key frame image of mobile network animation. With the help of SIFT algorithm and its improved design, the model is used to optimize the feature point extraction. The experimental results are as follows.  (1) The proposed method can accurately extract the key frames of animation image and reduce the range of feature points extraction; (2) Compared with SIFT algorithm, the feature points extracted by the improved method are more comprehensive, and the extraction speed is faster. Generally, the extraction time is 1.2 ~ 1.6 s, and the time change is relatively stable; (3) Whether it is a face image or a smooth curve image in animation video, the feature points extracted by this algorithm have good adaptability to image rotation and scale transformation. (4) Compared with SIFT algorithm, the feature points extracted by this algorithm tend to be more stable, which further shows that the similarity invariance of feature points in this algorithm is higher than that of SIFT algorithm, which is about 40% higher.
In the future research, we should focus on the extraction of animation features, such as highlighting the concept of flash animation summary and the method of flash animation summary extraction based on key frame, to obtain the final matching animation frame, which can provide reference for star related research in this field.
Author contributions Tao Yin provided the algorithm and experimental results, wrote the manuscript, Zhihan Lv revised the paper, supervised and analyzed the experiment.
Funding Open access funding provided by Uppsala University.

Data availability
We also declare that data availability and ethics approval is not applicable in this paper.

Declarations
The authors have no relevant financial or non-financial interests to disclose.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.