1 Introduction

Mobile network environment refers to a mobile network that connects to the Internet of Things through mobile terminal devices such as mobile phones and tablets [1, 2]. Micro animated videos are a short, creative, and interesting form of video that can depict animated characters. It is usually created by creators based on personal preferences, emotions, or thoughts, using simple painting tools and sound materials [3]. It typically consists of multiple images, each lasting only a few seconds. These videos are usually short and short, usually between a few minutes and a few tens of minutes, but they can evoke strong resonance and resonance among the audience. Micro animated videos can cover various themes and themes, such as humor, emotions, philosophy, social hotspots, and so on [4]. Creators can express their unique viewpoints and insights through micro animated videos, while also interacting and communicating with the audience, and displaying the movement posture of the three-dimensional human body.

Due to the advantages of small data volume, fast transmission speed, and ease of loading and playback in mobile network environments, micro animated videos have been widely used [5]. However, due to the large number of images in micro animated videos and the high similarity between images, efficient image matching is needed in the field of more detailed image processing [6] to ensure the quality and accuracy of micro animated videos in mobile network propagation.

At present, many scholars have conducted research on image matching algorithms, such as Mousavi V et al. proposing an image key point matching method based on information content selection [7], which uses entropy, spatial saliency, and texture coefficients to measure image quality. They choose to use SIFT, SURF, MSER, and BRISK operators to extract key points, and use a mixed key point selection method to complete image matching. But if the image data quality measurement is not accurate, the inaccuracy of key point selection and the accuracy of matching results will be reduced. When extracting key points, the computational complexity is large, which limits the practicability of the algorithm in large-scale image data sets or real-time applications. Bellavia F et al. studied the Harris Z + angle selection algorithm based on Harris image matching optimization [8]. By adjusting the parameters of the Harris algorithm and introducing Harris selection criteria to further refine the algorithm, Harris Z + can extract more but differentiated key points, which can be better distributed on the image and have higher positioning accuracy, enhancing the matching effect of the image.But the HarrisZ + algorithm is sensitive to noise in the image. If there is a lot of noise in the image, it may affect the extraction and matching effect of key points, and the global image information is not fully utilized, resulting in the performance of the algorithm is limited. Chiatti A et al. proposed a mobile robot target recognition method based on few shot image matching [9], comparing different (shallow and deep) image matching methods with few shots on a novel dataset, and using the dataset training to complete the image matching of the target.But in this method, deep learning model is used for object recognition and image matching, which has high resource requirements and limits the use of the algorithm in real-time applications. Paringer R A et al. proposed an improved fuzzy image matching method [10], which improved the fuzzy image matching (BIM) method using speckle selection and comparison methods, and processed noise before matching, effectively improving the accuracy of image matching.But this method mainly focuses on the speckle noise and matching accuracy of the image, but may ignore the importance of the image content. In some cases, even if speckle noise is handled well, the matching results may still be inaccurate due to a lack of consideration for the image content.

Although the above methods can complete image matching, they also have certain drawbacks when applied to mobile network environments, therefore, according to the shortcomings of the above methods, a new image matching algorithm is studied. The Harris algorithm has the characteristics of low computational complexity and wide practical range, and has good adaptability to situations such as rotation, scaling, and deformation. SIFI feature extraction has the advantages of good uniqueness, strong scalability, and fast speed. Design a block based high-precision matching algorithm for multi image micro animation videos in a mobile network environment, combining these two algorithms. After applying the algorithm in this article, image noise can be reduced, more image feature points can be extracted, and image matching can be completed on a larger scale, which is conducive to higher quality playback and transmission of micro animated videos in mobile network environments.

2 Design of High-Precision Matching Algorithm for Multi Image Segmentation in 2 Micro Animation Videos

2.1 Image Denoising Processing

In microanimated videos, there may be large differences between the images of the front and back frames due to the continuity of the video. This difference can cause the matching algorithm to fail to match accurately. Noise reduction can reduce this difference and enhance the robustness of the matching algorithm, so that it can better adapt to the changes between successive frames. Therefore, the image quality in the micro-animation video is ensured, and the feature matching is better completed. Because the traditional two-dimensional wavelet transform algorithm involves a lot of computation, especially for large images, the computational complexity is very high. The two-dimensional HD-DWT algorithm is generalized by using one-dimensional HD-DWT algorithm, which can decompose the two-dimensional transform into the combination of two one-dimensional transforms, thus reducing the computational complexity and obtaining better results and performance in processing two-dimensional images. Therefore, this paper adopts one-dimensional HD-DWT algorithm to generalize and obtain two-dimensional High Density Discrete Wavelet Transform (HD-DWT) method. Compared with one-dimensional transform, two-dimensional transform can capture horizontal and vertical features of image more comprehensively. It is also compressed and reconstructed more accurately, which can effectively remove noise from the image and retain more detailed information. This article adopts the two-dimensional High Density Discrete Wavelet Transform (HD-DWT) method, which uses wavelet transform to achieve image denoising [11], by performing two-dimensional high-density decomposition on a micro animation video image, one rough (approximate) image and eight detail images will be obtained. If \({c}_{j}\) is used to represent the rough image of layer \(j\), and \({d}_{j}^{1}\), \({d}_{j}^{2}\), \({d}_{j}^{3}\), \({d}_{j}^{4}\), \({d}_{j}^{5}\), \({d}_{j}^{6}\), \({d}_{j}^{7}\), and \({d}_{j}^{8}\) are used to represent the detailed image of layer \(j\), then the decomposition algorithm for two-dimensional \(j\) is:

$$\left\{\begin{array}{c}{c}_{j+1}\left[n,m\right]={c}_{j}*{\mathrm{h}}_{0}{\mathrm{h}}_{0}\left[2n,2m\right]\\ {d}_{j+1}^{1}\left[n,m\right]={c}_{j}*{\mathrm{h}}_{0}{\mathrm{h}}_{1}\left[2n,2m\right]\\ {d}_{j+1}^{2}\left[n,m\right]={c}_{j}*{\mathrm{h}}_{0}{\mathrm{h}}_{2}\left[2n,m\right]\\ {d}_{j+1}^{3}\left[n,m\right]={c}_{j}*{\mathrm{h}}_{1}{\mathrm{h}}_{0}\left[2n,2m\right]\\ {d}_{j+1}^{4}\left[n,m\right]={c}_{j}*{\mathrm{h}}_{1}{\mathrm{h}}_{1}\left[2n,2m\right] \\ {d}_{j+1}^{5}\left[n,m\right]={c}_{j}*{\mathrm{h}}_{1}{\mathrm{h}}_{2}\left[2n,m\right]\\ {d}_{j+1}^{6}\left[n,m\right]={c}_{j}*{\mathrm{h}}_{2}{\mathrm{h}}_{0}\left[n,2m\right]\\ {d}_{j+1}^{7}\left[n,m\right]={c}_{j}*{\mathrm{h}}_{2}{\mathrm{h}}_{1}\left[n,2m\right]\\ {d}_{j+1}^{8}\left[n,m\right]={c}_{j}*{\mathrm{h}}_{2}{\mathrm{h}}_{2}\left[n,m\right]\end{array}\right.$$
(1)

where, \(*\) denotes convolution;\({\mathrm{h}}_{\mathrm{i}}{h}_{j}\left(i,j=\mathrm{0,1},2\right)\) denotes the two-dimensional filter formed by different-signal \({\mathrm{h}}_{\mathrm{i}}\) and \({\mathrm{h}}_{\mathrm{j}}\). After the micro-animation video image is decomposed, the image reconstruction process needs to reverse the decomposed image signal, that is, it is sampled step by step, and then restored through the filter. In order to maintain the accuracy and integrity of the reconstruction, it is necessary to satisfy the reversibility of the filter. In two-dimensional HD-DWT, in order to maintain the symmetry and reversibility of the filter, the filter is used to reverse the signal timing, so as to ensure that the image signal can be correctly detected during the inverse operation, so as to achieve the image noise reduction. Therefore, let \({\mathrm{h}}_{\mathrm{i}}{\prime}\) denote the time sequence flip of \({\mathrm{h}}_{\mathrm{i}}\), then the reconstruction algorithm of two-dimensional HD-DWT is as follows.

$${c}_{j}\left[n,m\right]={\overline{c} }_{j+1}*{\mathrm{h}}_{0}{\prime}{\mathrm{h}}_{0}{\prime}\left[n+m\right]+{\overline{d} }_{j+1}^{1}*{\mathrm{h}}_{1}{\prime}{\mathrm{h}}_{0}{\prime}\left[n+m\right]+{\overline{d} }_{j+1}^{2}*{\mathrm{h}}_{2}{\prime}{\mathrm{h}}_{0}{\prime}\left[n+m\right]+{\overline{d} }_{j+1}^{3}*{\mathrm{h}}_{0}{\prime}{\mathrm{h}}_{1}{\prime}\left[n+m\right]+{\overline{d} }_{j+1}^{4}*{\mathrm{h}}_{1}{\prime}{\mathrm{h}}_{1}{\prime}\left[n+m\right]+{\overline{d} }_{j+1}^{5}*{\mathrm{h}}_{2}{\prime}{\mathrm{h}}_{1}{\prime}\left[n+m\right]+{\overline{d} }_{j+1}^{6}*{\mathrm{h}}_{0}{\prime}{\mathrm{h}}_{2}{\prime}\left[n+m\right]+{\overline{d} }_{j+1}^{7}*{\mathrm{h}}_{1}{\prime}{\mathrm{h}}_{2}{\prime}\left[n+m\right]+{\overline{d} }_{j+1}^{8}*{\mathrm{h}}_{2}{\prime}{\mathrm{h}}_{2}{\prime}\left[n+m\right]$$
(2)

Among them, the\({\overline{c} }_{j+1}\),\({\overline{d} }_{j+1}^{1}\),\({\overline{d} }_{j+1}^{3}\),\({\overline{d} }_{j+1}^{4}\) respectively by\({c}_{j+1}\), and \({d}_{j+1}^{1}\),\({d}_{j+1}^{3}\), \({d}_{j+1}^{4}\) consecutive row inserted row 0, continuous inserted a column 0 get between the two columns;\({\overline{d} }_{j+1}^{2}\) and \({\overline{d} }_{j+1}^{5}\) respectively by the \({d}_{j+1}^{2}\) and \({d}_{j+1}^{5}\) consecutive inserted between two columns to get a list of zero, \({\overline{d} }_{j+1}^{6}\) and \({\overline{d} }_{j+1}^{7}\) respectively by the \({d}_{j+1}^{6}\) and \({d}_{j+1}^{7}\) consecutive row inserts a zero; \({\mathrm{h}}_{\mathrm{i}}{\prime}{\mathrm{h}}_{j}{\prime}\left(i,j=\mathrm{0,1},2\right)\) denotes the two-dimensional filter formed by \({\mathrm{h}}_{\mathrm{i}}{\prime}\) and\({\mathrm{h}}_{j}{\prime}\). After \(J\)-level decomposition of the micro-animation video image, \(8J+1\) sub-images will be obtained, that is, the expression of the image in the wavelet domain is [12]:

$$\left[{c}_{J},{\left[{d}_{j}^{1},{d}_{j}^{2},{d}_{j}^{3},{d}_{j}^{4},{d}_{j}^{5},{d}_{j}^{6},{d}_{j}^{7},{d}_{j}^{8}\right]}_{1\le j\le J}\right]$$
(3)

The wavelet coefficient threshold formula is:

$${\widehat{\delta }}_{n}^{2}=median\left(\left|{y}_{1}\right|\right)/0.6475$$
(4)
$${\widehat{\delta }}_{y}^{2}=\frac{1}{M}{\sum }_{{y}_{1i}\in N\left(k\right)}{y}_{1i}^{2}$$
(5)

Then,

$$\widehat{\updelta }=\sqrt{\left({\widehat{\updelta }}_{\mathrm{y}}^{2}-{\widehat{\updelta }}_{\mathrm{n}}^{2}\right)}$$
(6)

where \({\widehat{\delta }}_{n}^{2}\) represents the noise variance of the detail image, \({y}_{1}\) represents the wavelet coefficient matrix of the detail image, according to the property of wavelet transform, 0.6475 is selected as the weight of wavelet coefficient; \({\widehat{\delta }}_{y}^{2}\) represents the noise variance of the image with noise, \(y\) is the wavelet system of the image with noise, and \({y}_{1}\) represents the wavelet coefficient matrix of the detail image. Let \(M\) denote the window size of the local region \(N\left(k\right)\) adjacent to the \(k\)-th wavelet coefficient. After determining the wavelet threshold, the wavelet coefficients are processed in the wavelet domain [13], and the denoised micro-animation video image can be reconstructed.

2.2 Design of Micro-Animation Video Image Block Method

In the micro-animation video image matching, the purpose of image block after denoising is to change the influence of the single threshold response of the whole image corner. The block threshold is set in each image sub-block, so that the extracted corner distribution is uniform and reasonable, and the overall structure of the image is fully reflected. The method of fixed image block size and number of blocks is selected to divide the micro animation video image into blocks. At present, mainly according to the gray texture distribution of micro-animation video images, the size of sub-blocks is selected reasonably to ensure that each sub-block with different gray levels can reasonably extract corners according to different thresholds. However, this method is easy to cause out-of-bounds chunking. In this paper, a fixed number of blocks is used to remove the transboundary blocks. After the completion of micro-animation video image block, each sub-image block has its own gray distribution and texture.

2.3 Harris Based Corner Detection in Micro-Animation Video Sub-Images

Moravec operator is based on the difference of the local window to measure the degree of image change, and has good robustness to noise and texture blur. Its main idea is to detect the corner points in the image by comparing the difference between pixels and their neighboring pixels. For each pixel, the operator calculates its difference from adjacent locations and takes the difference value as the corner response for that pixel. Harris corner detection algorithm performs well in corner detection, but its performance may be greatly affected in the case of images with noise or fuzzy texture. The robustness of Moravec operator can improve the adaptability of the corner detection algorithm to these complex situations, so as to more accurately capture corner points in micro-animation video sub-images. Improve the accuracy of corner detection. Therefore, harris corner detection algorithm [14, 15] is an improved algorithm for Moravec operator. By comparing the Corner Response Function (CRF) of a pixel with the set threshold, it determines whether it is a characteristic corner. Assume that the pixel point \(\left(x,y\right)\) of the sub-image of the micro-animation video is the center, and the Gaussian window is selected to move \(u\) in the \(x\)-direction of the sub-image of the micro-animation video, and move \(v\) in the \(y\)-direction. The analytical expression of gray change measurement is carried out by Harris corner detection algorithm, and the Taylor expansion is carried out:

$$\begin{array}{c}{E}_{x,y}={w}_{x,y}\left({I}_{x+u,y+v}-{I}_{x,y}\right)\\ ={w}_{x,y}{\left(u\frac{\partial I}{\partial X}+v\frac{\partial I}{\partial Y}+o\left(\sqrt{{u}^{2}+{v}^{2}}\right)\right)}^{2}\end{array}$$
(7)

Type of \({E}_{x,y}\) gaussian window, micro animation video image grayscale change measurement, \({w}_{x,y}\) as the gaussian convolution function, namely \({w}_{x,y}={e}^{-\left({x}^{2}+{y}^{2}\right)/{\sigma }^{2}}\), \({I}_{x,y}\) in the \(\left(x,y()\right)\) micro animation video image grey value, \(o\left(\sqrt{{u}^{2}+{v}^{2}}\right)\) is an infinite small term whose size has negligible influence on the result, and the formula can be obtained as follows:

$${E}_{x,y}={w}_{x,y}\left[{u}^{2}{\left({I}_{x}\right)}^{2}+{v}^{2}{\left({I}_{Y}\right)}^{2}+2uv{I}_{X}{I}_{Y}\right]$$
(8)

The corner of an image is a point with obvious texture changes, so the gray scale changes of adjacent pixels in different directions should be considered in the corner detection. Gradients on horizontal (A: x axis direction) and vertical (B: y axis direction) can detect horizontal and vertical edges in the image, while gradients on diagonal (C: diagonal direction) are mainly used to detect changes in corners and texture directions in the image, further improving the efficiency and accuracy of corner detection. Therefore, the detection results of the above directions are as follows:

$$\left\{\begin{array}{c}A={\left({I}_{X}\right)}^{2}\otimes {w}_{x,y}\\ B={\left({I}_{Y}\right)}^{2}\otimes {w}_{x,y}\\ C={\left({I}_{X}{I}_{Y}\right)}^{2}\otimes {w}_{x,y}\end{array}\right.$$
(9)

Using Formula (8) to simplify Formula (7), the following can be obtained:

$${E}_{x,y}=A{u}^{2}+2Cuv+B{v}^{2}$$
(10)

where, \(\otimes\) is the convolution operation, and \({I}_{X}\) and \({I}_{Y}\) represent the gradient values of the micro-animation video sub-image pixels in the \(x\) and \(y\) directions respectively. Transforming the gray metric function into a quadratic form can be obtained as follows.

$${E}_{x,y}=\left[u v\right]M\left[\begin{array}{c}u\\ v\end{array}\right],M={w}_{x,y}\left[\begin{array}{cc}{I}_{X}^{2}& {I}_{X}{I}_{Y}\\ {I}_{X}{I}_{Y}& {I}_{Y}^{2}\end{array}\right]$$
(11)

Through the determinant and trace of the symmetric matrix M, the CRF function is defined as follows:

$$CRF\left(x,y()\left(M\right){\left(trace\left(M\right)\right)}^{2}{\left(AC-B\right)}^{2}{\left(A+C\right)}^{2}\right)$$
(12)

where \(det\left(M\right)\) is the determinant of the symmetric matrix and \(trace\left(M\right)\) is the trace. According to the empirical coefficient \(k\), the value is usually 0.04 ~ 0.06. When the CRF value of the pixel point \(\left(x,y\right)\) of the sub-image of the micro-animation video is greater than the given threshold, the point is considered as the feature corner point.

2.4 SIFT Feature Extraction of Micro-Animation Video Sub-Images

To make the extracted corners converge, the SIFT method is used to iteratively refine them and generate SIFT feature vectors. SIFT feature extraction method is widely used due to its good robustness to rotation, illumination, noise, scaling, etc. SIFT algorithm uses a progressive screening method, and each pixel on the image can only be called a feature point after four steps of processing [16, 17]. The purpose of this is to exclude most of the unstable pixels in the micro-animation video image before matching, so as to reduce the time cost of matching to adapt to the requirements of mobile networks.

2.4.1 Micro-Animation Video Sub-Image Scale Space Extremum Detection

The extreme value detection of scale space is to obtain the feature information of different scales by filtering the images at different scales. This is because in a microanimated video, the feature points may have different scales and sizes. By using scale transformation in scale space to detect local extreme values, detailed feature changes can be detected at different scales, which makes the algorithm better adapt to the feature points at different scales.

The scale space \(L\left(x,y,\sigma \right)\) of a micro-animation video subimage is defined as the convolution of the input function \(J\left(x,y\right)\) of the image and the Gaussian kernel function \(G\left(x,y,\sigma \right)\).

$$L\left(x,y,\sigma \right)=G\left(x,y,\sigma \right)\otimes J\left(x,y\right)$$
(13)
$$G\left(x,y,\sigma \right)=\frac{1}{2\pi {\sigma }^{2}}\mathit{exp}\left(-\frac{{x}^{2}+{y}^{2}}{2{\sigma }^{2}}\right)$$
(14)

where \(\sigma\) denotes the scale factor of Gaussian function; \(\otimes\) denotes the convolution.

In order to extract the SIFT key points of the micro-animation video sub-image, firstly, the input micro-animation video sub-image is convolved with the Gaussian kernel function under different scale factors to obtain the Gaussian pyramid, and then the difference of the Gaussian blurred images of adjacent scales in the Gaussian pyramid is obtained to obtain the Difference of Gaussian (DOG) pyramid, as shown in Eq. (15):

$$\begin{array}{c}D\left(x,y,\sigma \right)=\left(D\left(x,y,k\sigma \right)-D\left(x,y,\sigma \right)\right)J\left(x,y\right)\\ =L\left(x,y,k\sigma \right)-L\left(x,y,\sigma \right)\end{array}$$
(15)

In addition, extreme value detection in two-dimensional image space can capture detailed features at specific scales. In two-dimensional image space, local extremum points represent features such as texture or edge of the image. By detecting these local extremum points in the two-dimensional image space, the detailed features in the image can be obtained and fused with the feature information in the scale space. By detecting local extreme values in both scale space and two-dimensional image space, feature detection at different scales and levels of detail can be taken into account to ensure detailed feature detection and description.

In summary, detection of local extreme values in both scale space and two-dimensional image space can combine scale and detail information to ensure detailed feature detection of micro-animation video sub-images. When detecting the extremum of scale space, a pixel point is randomly selected, which needs to be with the surrounding neighborhood points of the same scale and the surrounding neighborhood points of the corresponding positions of adjacent scales to ensure that the local extremum is detected in both scale space and two-dimensional image space.

2.4.2 Generate Micro-Animation Video Sub-Image Sift Feature Vectors

For all the detected extreme points, the following two-step test is also required to obtain the accurately located feature points:

  1. 1)

    The second-order Taylor expansion of scale space function \(D\left(x,y,\sigma \right)\) is used to perform least squares fitting, and the extreme value of the fitting curve is calculated to locate the feature points. At the same time, a threshold is set to eliminate the points with low contrast.

  2. 2)

    The Hessian matrix is used to eliminate the unstable edge response points. In order to enhance the matching stability and improve the ability to resist noise.

By precisely locating the feature points, according to the distribution characteristics of the gradient direction of pixels in the keypoint's neighborhood, the direction parameter is specified for each keypoint, so that the operator remains rotation invariant. The magnitude and direction of the gradient at pixel \(\left(x,y\right)\) are given as follows:

$$m\left(x,y\right)=\sqrt{{\left(L\left(x+1,y\right)-L\left(x-1,y\right)\right)}^{2}+{\left(L\left(x,y+1\right)-L\left(x,y-1\right)\right)}^{2}}$$
(16)
$$\theta \left(x,y\right)=\alpha \mathit{tan}2\left(L\left(x,y+1\right)-L\left(x,y-1\right)\right)/\left(L\left(x+1,y\right)-L\left(x-1,y\right)\right)$$
(17)

Where, the scale used by \(L\) is the scale of each keypoint, \(\alpha\) represents the gradient operator. In the actual calculation, we sample in the neighborhood window centered at the keypoint, and use the histogram to count the gradient direction of the neighborhood pixels. The gradient histogram ranges from 0 to 360., where every 10. One column, a total of 36 columns. The peak of the histogram represents the main direction of the neighborhood gradient at the keypoint, which is the direction of the keypoint. In the histogram of gradient orientation, when there is another peak equivalent to 80% energy of the main peak, this direction is considered as the secondary direction of the keypoint.

The micro-animation video sub-image SIFT feature vector is generated by using the feature point orientation allocation results of the micro-animation video sub-image. First, the coordinate axes are converted to the orientation of the keypoints to ensure rotation invariance. Next, an 8 × 8 window centered on the keypoint is taken, as shown in Fig. 1. The central black point in the left part of Fig. 1 is the position of the current keypoint. Each small grid represents a pixel in the scale space where the keypoint neighborhood is located. The arrow direction represents the gradient direction of the pixel, and the arrow length represents the gradient magnitude value. The circles represent the Gaussian weighted range (pixels closer to the keypoint contribute more gradient direction information).

Fig. 1
figure 1

Feature vector generated by the gradient information of key point neighborhood

Since the key points in the micro-animation video sub-images may appear at different rotation angles, in order to make the extracted features rotationally invariant, coordinate axes need to be aligned to the direction of the key points. Specifically, after determining the direction of the key point, the rotation transformation of the image can be realized by rotating the coordinate axis. The orientation information of pixels near key points, such as gradient direction, is unified into the same coordinate system to ensure that the features extracted at different rotation angles are similar. Therefore,, a seed point can be formed by calculating the gradient orientation histogram of eight directions on every 4 × 4 patch and plotting the cumulative value of each gradient direction, as shown in the right part of FIG. 1. A keypoint consists of 2 × 2 total 4 seed points, and each seed point has 8 direction vector information. In order to enhance the robustness of matching, a total of 4 × 4 seed points are used to describe each keypoint, so that 128 dimensional data can be generated for a keypoint, that is, the final 128 dimensional SIFT feature vector is formed. By converting the axis to the direction of the key point and calculating the cumulative value of the gradient direction histogram in the neighborhood of the key point, the SIFT feature is guaranteed to have rotation invariance and scale invariance, and the stability and reliability of the feature are further improved.

2.5 Micro-Animation Image Matching Based on K-Means Clustering

K-means clustering analysis algorithm has the advantages of simple, easy to understand, strong adaptability and efficient clustering [18,19,20], and it is a widely used clustering analysis algorithm at present. According to the known Euclidean distance between the SIFT feature vectors of the key points of the micro-animation video sub-images, the K-means clustering analysis algorithm is used to cluster these feature vectors. The feature vector with the smallest difference from its vector is selected as the initial centroid to optimize the feature points, and the calculation dimension is simplified. K-means clustering algorithm is solved by iterative method, and the specific algorithm process is as follows:

The first step: divide the known data into K groups, and then randomly select K objects as the initial centroids (cluster centers).

The second step is to calculate the Euclidean distance between each object and each known cluster center, and then classify the object into the nearest cluster.

Step 3: The objects assigned to the cluster centers and the cluster centers then form a cluster. Each time a new sample is assigned, the new cluster centroids are calculated again based on the objects present in the existing cluster.

This process continues until some termination criterion is met. If the cluster centers change again until there are no objects or the minimum number of objects are reassigned to different clusters, and the condition of local minimum of the sum of squared errors is achieved, the iterative clustering analysis algorithm will not repeat.

In the K-means clustering algorithm, the selection of the initial clustering center and the determination of the number of clusters K are two problems that cannot be ignored [21]. The initial cluster centers can be selected based on PrincipalComponents Analysis (PCA). The problem of cluster number K can be determined by K-means optimal cluster number determination method based on potential stability.

K-means clustering is performed on all SIFT feature vectors in the two subimages in the way described above. This process will find K centroids in the feature space and assign each descriptor to the cluster where the closest centroids are located. For each cluster in each subimage, the nearest neighbor cluster in the other subimage is found to form a subimage matching pair. After combining the matching pairs of all sub-images, the matching results of the whole micro-animation video image are obtained. The above process of micro-animation video image matching is parallelized to realize multi-image matching of micro-animation video.

2.6 Elimination of Mismatching Pairs in Micro Animation Based on Improved RANSCA Algorithm

In order to ensure the matching accuracy, this paper designs an improved Random Sample Consensus (RANSCA) algorithm to eliminate these mismatching point pairs to the maximum extent, and realize high-precision matching of multi-image micro-animation video.

2.6.1 Transformation Matrix Estimation

When RANSAC algorithm [22] calculates the H matrix, it needs to randomly sample a large number of matching points for many times, and match the matching point pairs one by one to find the number of interior points, which puts forward high requirements for operational conditions, and the continuous resampling and computing transformation matrix also increases the workload. In this paper, an improved RANSAC algorithm is designed to improve the efficiency of RANSAC algorithm.

The basic idea of the improved RANSAC algorithm is to randomly select four non-collinear matching point pairs from the rough matching point pairs, solve the transformation matrix H, and calculate the Euclidean distance d between the matching points. The idea of nearest neighbor algorithm is introduced: calculate the ratio of the closest distance d nearest and the next closest distance d nearest, divide these values into four groups roughly equally according to the order of large to small, and arbitrarily select four groups of matching point pairs from the smallest group of values as the determination of an optimal model. If none of the four pairs of matching points selected are interior points, the probability of this group of models as the optimal model is very small, then this model is discarded and the other points are not judged. The matching points are selected again and the above steps are carried out until the inner points in the matching point pair are greater than 2, and then the inner points of the remaining matching point pairs are determined. The flow chart of the improved RANSAC algorithm, see FIG. 2.

Fig. 2
figure 2

Improved RANSAC algorithm flow

2.6.2 Accuracy Evaluation

In the mobile network environment, for the micro-animation video multi-image matching, the number of matching images is very large, ranging from hundreds to thousands of images. In the face of such a huge amount of data, to ensure the matching accuracy of micro-animation video image matching error accumulation not be ignored. The main sources that affect the matching error of micro-animation video images are as follows:

  1. (1)

    The original error of the micro-animation video image generally refers to the image error caused by external conditions or internal factors when the micro-animation video image is obtained. The original error is usually processed in the pre-processing of the micro-animation video image.

  2. (2)

    the error in the process of extracting feature points from micro-animation video images;

  3. (3)

    The error generated when establishing the relative orientation model between micro-animation video images, that is, the error in the process of calculating the transformation matrix.

Micro-animation video image matching is usually based on the registration accuracy as the standard to evaluate the quality of micro-animation video image matching, and the median error is a digital standard to measure the accuracy, and the root mean square error can reflect the deviation between the data. Therefore, in this paper, the median error \(\tau\) is used as the accuracy evaluation standard, and the root mean square error RMSE is used to reflect the degree of dispersion between data. The formula is:

$$\left\{\begin{array}{c}\tau =\sqrt{\frac{\sum \left[{\left({x}_{i}-{\overline{X} }_{i}\right)}^{2}+{\left({X}_{i}-{\overline{X} }_{i}\right)}^{2}+{\left({y}_{i}-{\overline{Y} }_{i}\right)}^{2}+{\left({Y}_{i}-{\overline{Y} }_{i}\right)}^{2}\right]}{N}}\\ RMSE=\sqrt{\frac{\sum \left[{\left({x}_{i}-{X}_{i}\right)}^{2}+{\left({y}_{i}-{Y}_{i}\right)}^{2}\right]}{N}}\end{array}\right.$$
(18)

where, \(\left({x}_{i},{y}_{i}\right)\) micro-animation video at some point in the image coordinates, after matching corresponding points on the target image \(\left({X}_{i},{Y}_{i}\right)\), \(\left({\overline{X} }_{i},{\overline{Y} }_{i}\right)\) said the average of the two coordinates,\(i=\mathrm{1,2},\cdots ,N\).

3 Experimental Analysis

In order to achieve high-precision multi-image block matching of micro-animation videos,Youtube-8m is a large-scale video classification dataset developed by Google that contains millions of low-resolution video clips from the YouTube platform, and the dataset's sources cover a wide range of topics and topics, making it suitable for diverse experimental needs. Each video clip in the YouTube-8M dataset comes with a wealth of tag information that represents various features and content of the video, including visual features, audio features, text descriptions, and more. Especially for the segmentation and high-precision matching task of micro-animation video, the visual features of video can be used for analysis and matching to achieve the matching goal. Select the microanimation video category in this dataset, which contains 10,000 videos with an average frame rate of 24fps. In the experiment, Network Simulator 3 (NS-3) is used to simulate the mobile network environment. NS-3 is a widely used open source network simulator, which can simulate various network environments and protocols. It provides a wealth of tools and modules to simulate the mobile network environment, and supports customized network topology, link characteristics, transmission characteristics, etc. In this environment, the following experiments are performed.

3.1 Noise Reduction Effect

In order to enhance the matching effect of micro video animation, it is necessary to preprocess the video image, and Fig. 3 shows the results of preprocessing the micro animation video image in this paper.

Fig. 3
figure 3

Noise reduction effect. (a) riginal microanimated image (b) Microanimation image after noise reduction

Figure 3 showed the effect of using two-dimensional HD-DWT method for micro-animation video image noise reduction is very obvious. The brightness, clarity and gray value of micro-animation video images are significantly optimized, which improves the image quality visible to the naked eye, and provides a very powerful help for the subsequent high-precision matching of micro-animation video images. Image matching can be done more easily.

3.2 Matching Effects

In order to reflect the matching effect of the micro-animation video designed in this paper, Key point matching algorithm in literature [7], HarrisZ + Angle selection matching algorithm in reference [8], The algorithm of image matching with fewer shots in literature [9] and The improved fuzzy image matching algorithm in literature [10] is used to compare the image matching effect. Five algorithms are used to extract image features of 500 micro-animation videos, and the results are shown in Table 1.

Table 1 Feature extraction results

According to Table 1, according to the length of the micro-animation video, the features that can be extracted are also different. More features are extracted when the video time is relatively long, and less when it is not. The comparison of five algorithms showed that regardless of the length of video time, the algorithm studied in this paper can extract more features than other algorithms, this is because it is innovative in feature extraction. It utilizes the stable features of key frames in micro-animation video, adopts the Harris algorithm to complete image corner detection, sets block threshold in each image sub-block, and makes the extracted corner points evenly and reasonably distributed, fully reflects the overall image structure, and obtains corner features of sub-images, so more matching points be obtained during matching. A video is randomly selected for micro-animation video image matching, and the image matching results of the five algorithms are shown in Table 2:

Table 2 Matching results

Figure 2 shows that the improved fuzzy image matching algorithm has the worst performance in micro-animation video image matching, the lowest correction rate, the largest median error and root mean square error. Harris Z + corner selection matching algorithm has the least number of matches, but the longest matching time. The algorithm studied in this paper performs the best among the other four algorithms with more matches, shorter matching time and lower number of false matches. Our algorithm is very suitable for image matching of micro animation. Because the improved Random sampling consistency (RANSCA) algorithm is adopted in this paper, these mismatched point pairs are eliminated to the greatest extent, so as to reduce the mismatched probability. The actual effect of the matching images of the proposed algorithm and the keypoint matching algorithm with better matching effect and the few-shot image matching algorithm is shown in FIG. 4.

Fig. 4
figure 4

Actual effect of image matching of the three algorithms

As can be seen from Fig. 4, the actual matching coverage area of the keypoint matching algorithm and the few shot image matching algorithm will be relatively small due to the relatively small number of matching pairs, and the actual matching area will be further reduced after the removal of certain false matching pairs. The algorithm designed in this paper produces many matching pairs of feature points, and adopts the image block method, so it can generate matching pairs at each position of the image, and can cover more image area. After the application of the algorithm designed in this paper, the number of video playing stalls, the clarity change of the image, and the sensitivity of the video image are shown in Table 3.

Table 3 Video quality before and after the application of this method

Table 3 showed that after the application of the designed algorithm, the number of pauses of the micro animation video has been significantly reduced, and the maximum gap has reached 12. The brightness of the video has also reached an appropriate level after the application of the designed algorithm, and the definition of the video has also been significantly improved. Because the proposed method adopts SIFT algorithm to process each pixel on the image and generate feature points, the detailed changes of each image block in the micro-animation video can be effectively captured. It can be seen that the designed algorithm has strong practicability. It can promote the development of micro animation video industry in mobile network environment.

4 Conclusion

This paper presents a high-precision matching algorithm for multi-image segmentation of micro animation videos in mobile network environments. Experiments show that this algorithm can effectively reduce the noise of the micro animation video image, improve the image quality, and enhance the image matching effect. Compared with other algorithms, the designed algorithm has a large number of micro animation video feature extraction, a large number of matching pairs, a small number of false matching pairs and short matching time. In the actual image matching, the designed algorithm can cover a larger image area, the number of matching lines is more, and the matching effect is better, which is more suitable for the requirements of mobile communication network. As future work, we plan to apply the designed algorithms in other computer vision applications including image generation [23, 24], object detection [25, 26], depth estimation [27, 28], etc. [29].