1 Introduction

The rapid development of artificial intelligence [12, 17, 22, 32, 39] technology has brought a new source of power to the progress of human society. Artificial intelligence has also entered every corner of our daily lives. The fast-growing food delivery industry in recent years has solved work problems for nearly tens of millions people and provided convenience for hundreds of millions people to eat. At the same time, in the core architecture of the takeaway delivery system [23, 38], how to plan the delivery staff and takeaway orders, how to plan the optimal delivery route [2] for each delivery staff, and how to dynamically adjust the entire takeaway delivery system so that it will not be down during peak periods machine, any of the problems mentioned above cannot be separated from the deep intervention of artificial intelligence technology. At the beginning of 2020, a sudden epidemic affected the entire world. Due to its high contagiousness and concealment, the new coronavirus forced various [16, 29] countries and regions to gradually restrict population movement and block entry and exit gates. Due to the long incubation period of the new coronavirus, and the symptoms are very uncertain. Therefore, in the work of controlling the spread of the virus, how to investigate the epidemiology of the patients and populations that have been found and trace the source of the contact history has become the top priority of epidemic prevention and control. Technological workers use artificial intelligence technology base on big data to build a population flow model to scientifically judge the possibility of population flow as much as possible, and improve the efficiency of epidemic prevention and control. On the other hand, the development of the epidemic has also brought about changes in traditional industries. Due to the blockade measures introduced by governments of various countries, economic activities in many urban areas have fallen into a state of suspension. In contrast, the short video platform [19] Douyin achieved a 10-fold increase in monthly revenue during the epidemic, and became the world’s most downloaded mobile program. The epidemic has caused many traditional industries to expand from offline to online. Selling goods on Douyin has become a transformation path for most companies. Surprisingly, a few hours of live broadcast may bring more revenue to merchants than previous months. The core of Douyin’s successful marketing for businesses is to use artificial intelligence algorithms to accurately match a large number of users and intelligently allocate advertising traffic. Through these examples, it can be found that the application of artificial intelligence technology is inseparable from the support of big data. Big data is the cornerstone of the development of artificial intelligence technology. For now, the most popular technology in artificial intelligence should be deep learning technology [9, 27, 35]. The application of deep learning technology has greatly improved the effect of algorithms in various fields. The BP model [10] at the core of deep learning algorithms has actually been proposed as early as 1974. However, the rapid development of deep learning is in recent years. The main reason is that the amount of data that can be trained and processed at that time is too small, so it can not achieve better results. In recent years, due to the generation of large-scale data and the gradual increase in the ability to process large-scale data [37, 49], the development of deep learning technology has become possible. The classification of deep learning in the field of machine learning is supervised learning [3, 50], which mean each sample needs to have a clear label. At the same time, the magnitude of the training data has a great impact on the performance of the algorithm. Generally, the larger the number of samples with correct labels, the better the performance of the deep learning model.

However, in actual situations, the large-scale data we obtain are often unlabeled or incorrectly labeled. Therefore, how to correctly label these large-scale data has become a very important research topic in the field of machine learning [40]. In order to label the results accurately, we can ask experts in specific fields to label the small-scale data. However, due to timeliness, economy and other factors, it is not practical to ask experts in large-scale data labeling tasks. To solve this problem, the crowdsourcing method [31, 47] distributes the labeling task to people from all walks of life all over the world through the network. Obtain large-scale data annotation results with lower economic cost and shorter time. While crowdsourcing methods bring the possibility of rapid labeling of large-scale data, there are also very obvious problems. The labelers are not professionals, and they may have very limited understanding of the characteristics of the labeled data. At the same time, the compensation provided by the crowdsourcing method is not high. Many labelers may not be serious about the labeling work, or they may be lazy to mark or simply not mark. These various reasons are very likely to cause the accuracy of the final sample label to be low.

In order to solve this problem, researchers have proposed some targeted algorithms [8, 33]. The majority voting algorithm is the most classic one. The main idea of the majority voting algorithm is very simple. Following the principle of minority obeys the majority, the category with the highest number of markings is selected as the predicted category of the sample from the marking results of all labelers. The principle of majority voting algorithm is clear, interpretable, and relatively easy to implement, so it has become a benchmark algorithm for solving crowdsourced data problems. However, the majority voting algorithm has some flaws. Because crowdsourced data is artificial data produced in a short time by different people from all walks of life around the world, these people have different identities, backgrounds, education levels, and different ideas. Therefore, the marking accuracy of each marker must be different, but in the majority voting algorithm idea, it is simply considered that the abilities of each marker are the same, so the final results often have certain defects.

This paper proposes a new two-step learning crowdsourced data classification algorithm for the problems of majority voting algorithm. In the first step, a worker ability model is proposed for the different labeling abilities of different labeled workers. The model first adds the initial ability weights to all workers, and obtains the labeled ability weights of different workers through fitting samples and self-expression reconstruction samples after adding worker ability weights. At the same time, the L12 norm [11] of the worker ability weight matrix is added to the objective function. The L12 norm is an optimized version of the lasso method [30, 48], which considers the similarity between attributes based on the sparseness of the variables in the Lasso method. The specific implementation is to set one of the weights of two workers with similar marking abilities to 0 to reduce redundancy, so that the difference in the abilities of different markers is further reflected, and then a more accurate weight of worker abilities can be obtained. The second step considers the similarity between different samples through the cosine measurement method [36, 41], specifically by calculating the sample similarity between the two, find the most similar sample for each sample. Finally, the specific category of the sample is calculated by combining the worker ability weight obtained in the first step and the similar sample weight obtained in the second step. The main contributions of the algorithm proposed in this paper are as follows:

  • The traditional MV algorithm simply assumes that all marking workers have the same ability, but in actual scenarios, this assumption is often unscientific, which leads to poorer final algorithm results. For this problem, the algorithm proposed in this paper assigns different ability weights to all workers based on the MV algorithm, making the majority voting process more reasonable, and the voting results are naturally more accurate.

  • The algorithm proposed in this paper adds an l12 regularization term to the model for training worker ability weights. l12 regularization is a group sparse method, which groups similar samples in attributes and makes the group sparse through restriction methods , The groups become non-sparse. The l12 regularization term is applied to the weight of the worker’s ability in this paper so that multiple workers with similar abilities only retain the weight of one of them, which sparses the weight of the entire worker’s ability, so that the redundant influence of workers with similar abilities on the entire weight is reduced. It distinguishes the different weights corresponding to workers with different abilities, and finally gets better results.

  • The algorithm proposed in this paper considers the uneven abilities of marking workers, but also gives weight to the samples. The real data collected in various industries is not randomly generated, but usually has certain industry characteristics. These sample data are not messy, but have some similar or contradictory characteristics. Therefore, this paper uses the cosine measurement method to calculate the similarity between samples, and combines the worker ability weights obtained from the worker ability model training to establish the final classification model and test the data. Therefore, this paper considers the two key factors of worker ability weight and data sample similarity at the same time, so that the algorithm proposed in this paper has better performance.

2 Related work

In this section, we first introduce some methods of similarity measurement in first part, and then introduce the random forest algorithm in the second part.

2.1 Similarity measure

The basic processing of data by machine learning algorithms is generally classification or clustering, i.e., to separate different data or aggregate the same data together [44]. How to judge whether two data are the same or different is very important. Therefore, data similarity measurement is a very important link in machine learning algorithms. How to measure the difference between samples scientifically and how to choose the correct measurement method for data characteristics are the key factors to improve the performance of the algorithm. Commonly used distance measurement methods include Euclidean distance [5], Manhattan distance, Minkowski distance, Hamming distance, cosine distance, etc. Euclidean distance refers to the actual distance between two points. Because of its more obvious difference in high-dimensional data, it is suitable for measuring the similarity of samples in high-dimensional data. The meaning of Manhattan distance [4] is very intuitive, that is, the distance between Manhattan blocks. Since there are often many intersections in the block, the actual driving distance in the block is the Manhattan distance. Minkowski distance [24] is a distance formula containing variables. When the variable is 1, it is Manhattan distance. When the variable value is 2. It is Euclidean distance. Hamming distance [28] mainly operates on character strings. The number of characters that need to be changed to convert character string 1 to character string 2 is used as the distance between the two character strings, which is the Hamming distance. Therefore, the Hamming distance is very suitable for the fields of password and information compression. The law of cosines in geometry can use the cosine of the angle to measure the difference between two vector directions, and the cosine measurement method of machine learning uses this method to measure the difference between samples. The cosine measurement method first needs to normalize the sample, so the cosine measurement is not sensitive to the length of the sample, but only sensitive to the direction of the sample. Therefore, the cosine measurement method is usually suitable for the similarity discrimination of high-dimensional samples, but not for specific distance calculations.

2.2 Random forest

Obviously, the random forest algorithm [18] mainly includes two parts, one is random and the other is forest. The forest in the random forest algorithm is composed of decision trees [42, 43, 45], and the way of composition is random. Decision tree is a supervised machine learning algorithm. It starts to split from the root node containing all samples, and splits into split nodes one by one until the last layer is all leaf nodes. Each split node represents a split condition. The samples in the previous node are classified into two or more categories through this condition. After the split is completed, each leaf node represents a specific category. To a certain extent, the decision tree algorithm solves the shortcoming that some algorithms can only perform linear segmentation of data. The sample data is classified more accurately through the tree structure of the algorithm. At the same time, it has strong interpretability and the algorithm process is intuitive. Easy to understand. The random forest algorithm first randomly samples samples and attributes, then completely splits the obtained data to obtain a decision tree, and then repeats these steps to finally obtain a random forest. Among them, the sample sampling method is random sampling with replacement. Therefore, the samples collected during the establishment of each tree are not necessarily the same, so that the model of each decision tree is not easy to overfit and can obtain the inter-model Some deviation information. Random forest is widely used in forecasting systems, big data modeling, etc. due to its strong robustness, suitable for processing high-dimensional data, simple implementation and excellent performance.

3 Methodology

In this Section, we first introduce some character definitions used in this article in Section 3.1, then introduce the MV algorithm and explain the specific mathematical representation of the MV algorithm in Section 3.2, and then introduce the Knv algorithm and its Specific process in Section 3.3. In Section 3.4, we introduce the two-step learning crowdsourcing data classification algorithm in detail. Finally, the objective function of the algorithm is optimized in Section 3.5.

3.1 Notations

In this paper, we use uppercase and lowercase letters to represent matrices and vectors, respectively. \(\mathbf {X} = \{x_{i}\}_{i = 1}^{n} \in \mathbb {R}^{n \times d}\) represents the sample set, where n and d represent the number of samples and the dimension of each sample, respectively.Xi represents the i-th row of the matrix X, which is the i-th sample of the sample set.Xj represents the j-th column of matrix X, which is the j-th attribute of all samples in the sample set. Xi,j represent the j-th attribute of the i-th sample in the sample set. The Frobenius norm of X is denoted as \(||\mathbf {X}||_{F} = \sqrt {\sum \nolimits _{ij} |x_{ij}|^{2}}\). Furthermore, The trace, inverse, and transpose of matrix X are represented as:tr(X),X− 1 and XT, respectively.We also summarize these notations in Table 1 .

Table 1 The detail of the notations used in this paper

3.2 Majority voting

Table 2 is a crowdsourcing data set. The abscissa represents the sample, while the ordinate represents the worker. The coordinateL(x,y) represents the marking of the y-th worker to the x-th sample. In practice, due to the ability of a single marker worker to understand some problems is relatively limited, the accuracy of labels is not high if only one worker’s tagging results are used. In order to solve this problem, the MV method proposes to analyze the labeled tags by the principle of the minority obeying the majority, and uses the result with the highest frequency as the final classification label of the sample. The specific expression is as follows:

$$ \mathop v(x)=\arg \text{ }\underset{b\in {\Omega} }{\mathop{\max }} v(b|x) $$
(1)

where \(v(b|x)=\frac {1}{\left | {{S}_{x}} \right |}\sum \nolimits _{w\in {{S}_{x}}}{\mathbf {1}(w=b)}\), \(\left | {{S}_{x}} \right |\) represents the number of workers, Sx represents all the labeling conditions of the x-th sample, and Ω represents the tag collection. The \( \mathbf {1}\centerdot \) function analyzes the comparison between all the tag results of the sample and the real tag. If the tag result is correct, return 1. Otherwise it is 0. Therefore, we can find the case of two class label data set , where v > 0.5, it shows that MV algorithm gets the real label. Compared with the traditional single label method, the majority voting method can get a more accurate labeling result. In the calculation process of MV algorithm, we give each worker the same weight, but in the actual situation, each worker’s ability to understand different problems is often very different. Therefore, it is not optimal to assign the same weight to each worker, which means that there are still some defects in the MV algorithm.

Table 2 Crowdsourcing data set

3.3 K nearest voting

In order to solve some defects of MV algorithm, this chapter introduces an improved algorithm Knv method based on MV. Knv method refers to k-nearest neighbor voting algorithm, and its specific mathematical expression is as follows:

$$ \mathop {v_{k}}(x) = \arg {\text{ }}\mathop {\max }\limits_{b \in {\Omega} } {v_{k}}(b|x) $$
(2)

where \( {v_{k}}(b|x) = \frac {1}{{\left | {{S_{x}}} \right | + \mathop \alpha \limits ^ - }}\left [ {\left | {{S_{x}}} \right |v(b|x) + {\alpha _{b}^{x}}} \right ] \), \( {\alpha _{b}^{x}} = \frac {1}{k}\sum \limits _{i = 1}^{k} {{\alpha _{i}}v(b|{x_{i}})} \), xiNK(x) The vector α represents the weight of k nearest neighbors of the sample. In order to reflect the discrimination degree of k samples, initialization \( \alpha = \left [ {k,k - 1,k - 2,...,k} \right ] \), \( \mathop \alpha \limits ^ - \) is used to represent the mean value of the elements in the vector. The MV algorithm just votes the labeled results of all the workers to get the final algorithm result without considering the relationship between adjacent samples. In practice, the samples with similar distance tend to have similar characteristics. These samples with similar characteristics often have the same category, so it is very necessary and scientific to consider the labeling of adjacent samples when judging the real label of a single sample. In order to solve this problem, Knv algorithm considers the labeling of k nearest neighbors of samples base on the original MV algorithm, so as to judge the real label of the sample more scientifically and accurately. At the same time, the experiment also proves that the Knv algorithm achieves better performance. Although the Knv algorithm improves the performance of MV algorithm to a certain extent by considering the influence of k nearest neighbor sample labeling. However, the Knv algorithm still does not consider the factors of different tagging workers’ ability to label different samples, so the Knv algorithm still can be improved by considering the factors of marker workers’ understanding of the samples.

3.4 Proposed method

In view of the shortcomings of MV and Knv algorithms introduced above, this paper proposes a two-step learning crowdsourcing data classification algorithm based on the traditional MV algorithm. The first step is to build a worker’s tagging ability model . Specifically, firstly assign a labeling ability weight matrix β to all workers, and reconstruct the sample by fitting the original sample and adding the self-expression of the worker labeling ability weight matrix β to obtain the optimal worker labeling ability weight matrix β. The specific expression is as follows:

$$ \mathop {\min }\limits_{\boldsymbol{\upbeta}} \left\| {{\mathbf{X}^{T}} - {\mathbf{X}^{T}}{\mathbf{Y}^{T}}\boldsymbol{\upbeta} } \right\|_{F}^{2} $$
(3)

Where \(\mathbf {X} \in \mathbb {R}^{n \times d}\) represents the original crowdsourcing dataset contains n d-dimensional samples, \(\mathbf {Y} \in \mathbb {R}^{m \times n}\) represents all the labeling results of m workers on n samples in the data set,and the matrix \(\boldsymbol {\upbeta } \in \mathbb {R}^{m \times n}\) represents the labeling ability weight of m workers for n samples in the crowdsourcing data set. In view of the fact that there are some workers with similar marking ability, there is a certain degree of redundancy in the marking ability weight matrix β. In order to solve this defect, our method creatively adds the l12 norm as the regularization term of the objective function base on (3), and sparses the ability weight matrix β, so that the redundancy of workers with similar abilities is reduced, and workers with different marking abilities are further distinguished. So the performance of the algorithm is better. The l12 norm is a group Lasso method. The core idea of the l12 norm is to group all samples so that the samples in the same group become sparse and the groups become as close as possible. The specific expression is as follows:

$$ \mathop \forall W \in \mathbb{R}^{d \times 1},{{\Omega}_{g}^{G}}(W) = \sum\limits_{g \in G} {\left\| {{W_{Gg}}} \right\|}_{1}^{2} $$
(4)

Among them, W represents the 1 − d attributes, G represents the set of all groups, G represents one of the groups, l1 −norm is used to make the attributes within the group more sparse, and the l2 − norm makes the groups not sparse. Because we know that samples in the same group often have great similarities, making them sparse can remove redundant samples in some groups, and samples in different groups are often different, so we need to keep these useful samples. Therefore, after adding the l12 norm, we get the final objective function of the worker ability model, the expression is as follows:

$$ \mathop {\min }\limits_{\boldsymbol{\upbeta}} \left\| {{\mathbf{X}^{T}} - {\mathbf{X}^{T}}{\mathbf{Y}^{T}}\boldsymbol{\upbeta} } \right\|_{F}^{2} + \lambda \left\| {{\boldsymbol{\upbeta}_{g}}} \right\|_{1}^{2} $$
(5)

The parameters λ are used to adjust the l12 regularization term. The main idea of the worker ability model is to reconstruct the original sample, and obtain the optimal ability weight matrix β by fitting and reconstructing the sample. At the same time, the group Lasso regularization term is added. We perform sparse and non-sparse operations on the worker ability weight matrix at the same time, so that while removing redundant workers, it retains useful worker information as much as possible.

Unlike the first step, which considers the weight of the worker’s marking ability, the second part of the algorithm considers the similarity between samples. In crowdsourced data, in addition to the similar or opposite relationship between different labeled workers, there is also a certain similar relationship between samples. Similar samples usually have similar labels, so the labels of similar samples also have certain reference significance for the current sample. Therefore, we use this idea to calculate the label for a single sample while considering the labeling of similar samples, which can reduce the adverse effects of random errors by a small number of labeling workers to a certain extent. This paper uses the cosine measurement method to measure the similarity between samples. The mathematical expression is as follows:

$$ \mathop \alpha = similarity = \cos (\theta ) = \frac{{\mathbf{A}\mathbf{B}}}{{\left\| \mathbf{A} \right\|\left\| \mathbf{B} \right\|}} = \frac{{\sum\limits_{i = 1}^{n} {{\mathbf{A}_{i}} \times {\mathbf{B}_{i}}} }}{{\sqrt {\sum\limits_{i = 1}^{n} {{{({\mathbf{A}_{i}})}^{2}}} } \times \sqrt {\sum\limits_{i = 1}^{n} {{{({\mathbf{B}_{i}})}^{2}}} } }} $$
(6)

WhereA and B represent the two samples in the dataset, Ai and Bi represent the i-th attribute value of the samples respectively. Specifically, the above cosine similarity formula is used to calculate all samples, all n samples are traversed, and the nearest neighbor sample that is most similar to each sample is calculated (when n is an odd number, the last sample is kept without calculation). Then we get n/2 pairs and n/2 similarity values α (α = similarity).

Finally, the worker’s labeling ability weight matrix β obtained in the first step is combined with the n/2 similarity values α obtained in the second step to calculate the final predicted label . Specifically for each sample, the ability weight matrix β is first applied to different markers, and on this basis, the sample weight is assigned 1-0.5 * α, and the similar sample weight of the sample is assigned 0.5 * α for calculation, and finally the predicted label of the sample is obtained . Since this algorithm also considers two key factors, the similarity between the crowdsourced data samples and the difference in the marking ability of different workers, the predicted label obtained by this algorithm is greatly improved in accuracy compared with the original label.

3.5 Optimization

Because the objective function of the first part of the algorithm can not be solved directly, this paper optimizes the objective function in this section, and the specific steps are as follows. And we also list the pseudo code in Algorithm 1.

$$ \mathop {\min }\limits_{\boldsymbol{\upbeta}} \left\| {{\mathbf{X}^{T}} - {\mathbf{X}^{T}}{\mathbf{Y}^{T}}\boldsymbol{\upbeta} } \right\|_{F}^{2} + \lambda \left\| {{\boldsymbol{\upbeta}_{g}}} \right\|_{1}^{2} $$
(7)

Where \(\mathbf {X} \in \mathbb {R}^{n \times d},\mathbf {Y} \in \mathbb {R}^{m \times n}\), \(\boldsymbol {\upbeta } \in \mathbb {R}^{m \times n}\). First of all, the above formula is expanded and the results are as follows:

$$ \mathop {\min }\limits_{\boldsymbol{\upbeta}} \left\| {{\mathbf{X}^{T}} - {\mathbf{X}^{T}}{\mathbf{Y}^{T}}\boldsymbol{\upbeta} } \right\|_{F}^{2} + \lambda tr({\boldsymbol{\upbeta}^{T}}\mathbf{F}\boldsymbol{\upbeta} ) $$
(8)

Then, the derivation of the above formula can be obtained as follows:

$$ \mathop - 2\mathbf{Y}\mathbf{X}{\mathbf{X}^{T}} + 2\mathbf{Y}\mathbf{X}{\mathbf{X}^{T}}{\mathbf{X}^{T}}\boldsymbol{\upbeta} + 2\lambda \mathbf{F}\boldsymbol{\upbeta} $$
(9)

We make the above formula equal to 0, the final weight matrix β of worker’s ability can be obtained:

$$ \mathop {\boldsymbol{\upbeta}} = {(\mathbf{Y}\mathbf{X}{\mathbf{X}^{T}}{\mathbf{Y}^{T}} + \lambda \mathbf{F})^{- 1}}\mathbf{Y}\mathbf{X}{\mathbf{X}^{T}} $$
(10)

Where F of above formula is a diagonal matrix . Its diagonal elements are:

$$ \mathop {\mathbf{F}_{ii}} = \sum\limits_{g} {\frac{{{{(I{G_{g}})}_{i}}{{\left\| {{\boldsymbol{\upbeta}_{g}}} \right\|}_{1}}}}{{{{\left\| {{\boldsymbol{\upbeta}^{i}}} \right\|}_{1}}}}} (i = 1, \cdots ,m) $$
(11)
figure f

4 Experiments

In this Section, the algorithm proposed in this paper and comparison method will be tested on 10 data to compare their data classification ability. Specifically, the crowdsourcing dataset used in this paper and how to set the parameters needed in the experiment are introduced in Section 4.1. In Section 4.2, the sensitivity of the parameters to the experimental effect is analyzed, so that we can find the appropriate parameters. Finally, the specific experimental results are analyzed also in the last Section 4.2.

4.1 Data set and parameter settings

In this experiment, 10 data sets of UCI data set are used. They are Clean,German, Parkinsons,Sonar,Contro,Drift,Ecoli,CCUDS,Movements and Soybean.The detail of these data sets is listed in Table 3 . In the experiment process of this article, we set three parameters. The first is the average number of of markers \( \left | {\overline {{S_{x}}} } \right | \). The setting of the average number of markers takes into account the different number of markers in each sample, and simulates the real by setting the average number of markers. The second is the parameter con of the beta distribution. In this paper, a simulated crowdsourced data set is constructed on the basis of the original data through the principle of beta distribution. Since the number of tags for each sample is not necessarily the same, it simulates the real crowdsourced data well. The third is the reliability parameter rel. This parameter represents the average labeling ability of all workers, which represents the label accuracy of the original data.In the specific experiment, this paper fixed the average number of markers \( \left | {\overline {{S_{x}}} } \right | \) = 25 and the beta distribution parameter con = 1 , and set the reliability parameters as 0.6, 0.7 and 0.8.

Table 3 The summarization of the used data sets

The first step of the experiment is to create the crowdsourced data set required for this article on the basis of the original data set. The specific steps are as follows: The first step is to classify the data through the random forest algorithm in related work, and then use the labels obtained by the classification. And the true label generation matrix M of the data. The second step is to construct the R matrix after the M matrix is generated. Then build the marking of crowdsourced data based on the R matrix.

4.2 Experimental result

In this section, we choose the average number of markers \( \left | {\overline {{S_{x}}} } \right | \)= 25, and set the reliability parameters rel = 0.6, 0.7, 0.8, respectively. Simulate the performance of the algorithm in this paper under the environment of different raw data quality, and compare it with the traditional MV algorithm. The experiment selected 4 two-classification data sets and 6 multi-classification data sets, which can test the performance of this algorithm on simple two-classification problems and complex multi-classification problems at the same time. From Tables 45, and 6, we can find that this algorithm has achieved very good results on 10 data sets, and the accuracy of the algorithm is higher than that of MV. Respectively, when the reliability parameterrel = 0.8, the algorithm is 7.01% higher than the MV algorithm, when the reliability parameter rel= 0.7, the algorithm is 8.47% higher than the MV algorithm, and when the reliability parameter rel = 0.6, the algorithm It is 12.36% higher than the MV algorithm. The traditional MV algorithm only does a large number of statistics for all the labeling results of all workers, and selects the category with the highest occurrence probability as the final prediction label of the MV method. The method I put forward first considers the differences in marking abilities of different workers, and assigns different weights to different marked workers. At the same time, the data samples are analyzed, considering that similar samples may have similar labels, combining the above two key points, so the algorithm in this paper has achieved better performance than the traditional MV algorithm.

Table 4 Average Classification accuracy(rel= 0.8)
Table 5 Average Classification accuracy(rel= 0.7)
Table 6 Average Classification accuracy(rel= 0.6)

By analyzing the three tables at the same time, it can also be found that as the accuracy of the original crowdsourced data decreases, the accuracy of the proposed algorithm in this paper decreases more slowly than the accuracy of the traditional MV algorithm, indicating that the quality of the original data of the algorithm in this paper is not good. At the same time, it can still achieve better results and has strong robustness. Finally, in Table 7 we find that the classification accuracy of the algorithm is usually better than the traditional MV algorithm in terms of stability.

Table 7 Standard deviation of Classification accuracy(rel= 0.6)

5 Conclusion

This paper proposes a new two-step learning crowdsourced data classification algorithm. First of all, by assigning different labeling weights to each worker, the negative impact of different workers’ abilities in the process of crowdsourced data labeling is reduced to a certain extent. At the same time, the similarity between samples in the crowdsourced data is analyzed by the cosine measurement method, and the most similar samples are found and weighted. Therefore, the algorithm proposed in this paper takes into account the two key factors of the difference in the ability of workers and the similarity between the data samples, and reclassifies the original crowdsourced data. It has achieved higher accuracy rate than the traditional MV algorithm on 10 data sets. On this basis, we added three sets of comparative experiments on raw data with different quality.It was found that the proposed algorithm achieved good performance in the stability of experimental results, and it was less affected by the quality of the original data, indicating that the proposed algorithm has highly accurate and good robustness at the same time.

In the future work, we will further consider how to use the similarity between samples to improve the performance of classification algorithm.