1 Introduction

The success of supervised learning applications often relies on large-scale well-labeled datasets. Unfortunately, obtaining high-quality annotations from experts can be costly in terms of time and money budget. Alternatively, crowdsourcing (Han et al., 2019) provides an inexpensive approach to data labeling by hiring world-wide annotators on public platforms like Amazon Mechanical Turk (AMT). However, crowdsourced labels are usually noisy due to the existence of inexperienced or malicious annotators. Using these noisy labels in supervised learning may result in an inaccurate classifier. A straightforward way to solve this problem is redundant labeling, i.e., obtaining multiple labels for each instance from multiple annotators. Hence this raises one fundamental problem termed as Learning from Crowds (LFC) (Rodrigues & Pereira, 2018): “How can we learn a good classifier from a set of possibly noisy labeled data provided by multiple annotators?”

To address the above issue, a two-stage approach is commonly adopted. First, in answer aggregation stage (Zheng et al., 2017; Sheng & Zhang, 2019; Jin et al., 2020a), the latent true labels are estimated. Then, a classifier is trained based on the estimated true labels. Alternatively, the one-stage approach (Raykar et al., 2009; Tanno et al., 2019) has been shown to be a promising direction that presents a maximum-likelihood estimator that jointly learns the classifier, abilities of multiple annotators (Yan et al., 2014), and the latent true labels. Among various research efforts on LFC, the probability transition process from latent true labels to observed crowdsourced labels is usually modeled with confusion matrices of annotators, which represents class-level probability transition. This means that the annotator’s performance is consistent across different instances within the same class, i.e., the transition from class j to class l is independent of instance features.

Fig. 1
figure 1

Top: An example describes various incorrect annotations. The first randomly flips the true label to one of other classes; the second is that the true label is corrupted to the relevant class according to a fixed probability; the third is that the true label is corrupted to the irrelevant class due to the impact of instance features. Bottom: The graphical model of LFC-\({\mathbf {x}}\) represents the correlation of the instance \({\mathbf {x}}_n\), the true label \(t_n\), and crowdsourced labels \(y_n\). The annotation depends not only on the true label but also on instance features

However, in the real world, the difficulty of labeling can vary among instances within the same class and the instance features themselves will affect annotators’ performance (Misra et al., 2016). Consider an example from LabelMe dataset (Rodrigues et al., 2017) in Fig. 1(Top), which illustrates various cases of incorrect annotations given the true label “highway”. The first indicates an inexperienced/malicious annotator who gives a random label “coast”; the second indicates an annotator have biased understanding on different classes, preferring to label “highway” as “street”, because there is a strong correlation between those two classes. In both cases, the class-level confusion matrix of annotator can be used to characterize their varying abilities and own biases. Nevertheless, the third depicts one instance in class “highway” contains related visual features of other classes, misleading the annotators label it as “forest”, although these two classes are weakly relevant. Therefore, the class-level confusion matrices cannot cover the diverse noisy cases and thus cannot completely characterize the performance of multiple annotators across different instances within the same class. This would limit the ability of a LFC model to estimate latent true labels, resulting in sub-optimal performance of the classifier. It is necessary to consider the impact of instance features in the process of characterizing performance of multiple annotators for LFC.

To address the aforementioned deficiency, this work aims at proposing a novel LFC framework, LFC-\({\mathbf {x}}\), which can learn a classifier directly from the instances and the associated crowdsourced labels provided by multiple annotators. In particular, beyond confusion matrices, LFC-\({\mathbf {x}}\) models the probability transition process with noise transition matrices by incorporating instance features into confusion matrices. To this end, we need to deal with two practical challenges. One is how to quantify the impact of instance features on the performance of annotators in order to construct the noise transition matrix, the other is how to incorporate the noise transition matrix into LFC method. To cope with these challenges, we model the correlation among instance features, latent true labels and crowdsourced labels in the probabilistic graphical model to construct the noise transition matrix. Furthermore, the LFC-\({\mathbf {x}}\) consists of two modules: the noise transition matrix module and the classifier module. These two modules are integrated into an end-to-end neural network system through a principled combination for maximizing a likelihood function. The graphical model of the LFC-\({\mathbf {x}}\) is presented in Fig. 1(Bottom). In a nutshell, the main contributions and results of this work are summarized as follows:

  • We propose a method to construct the noise transition matrix by incorporating the impact of instance features into the confusion matrix for modeling annotators’ performance across instance features.

  • We propose a novel LFC framework, which consists of a classifier module and a noise transition matrix module in an end-to-end neural network architecture. We show that the proposed noise transition matrix is easy to implement and can be directly optimized by the standard SGD.

  • We conduct extensive experiments on crowdsourced datasets which show that our method outperforms the compared methods in terms of test accuracy and robustness. In addition, we also verify that the noise transition matrix is superior to the confusion matrix for modeling noisy labels in a singly-labeled scenario.

2 Related work

There are mainly two lines of efforts on learning a classifier from crowdsourced labels provided by multiple annotators.

Answer aggregation: Two-stage approaches first infer true labels with answer aggregation (Zheng et al., 2017; Sheng & Zhang, 2019), then learn a classifier. One of the pioneer works is the DS model (Dawid & Skene, 1979), which applies the EM algorithm to estimate latent true labels and confusion matrices of annotators. On this basis, Whitehill et al. (2009) consider the generalized DS model, which involves the difficulty of each instance (Khetan & Oh, 2016; Han et al., 2016). Subsequently, Liu et al. (2012) aggregate the crowdsourced labels by applying approximate variational methods in graphical models. By analogy to ensemble learning, Kim and Ghahramani (2012) and Li et al. (2018) formalize the answer aggregation problem as Bayesian classifier combination that is capable of capturing correlations between different annotators. Except for the aforementioned probabilistic frameworks, weighted majority voting (Aydin et al., 2014) adopt weighted aggregation schemes for estimating true labels. Yin et al. (2017) integrate a classifier and a reconstructor into a unified model to estimate labels in an unsupervised manner. Similarly, even training deep neural networks directly to aggregate crowdsourced labels also achieve a good result (Gaunt et al., 2016). More details about answer aggregation can refer to Zheng et al. (2017); Sheng and Zhang (2019); Jin et al. (2020a). Nevertheless, two-stage approaches do not realize the full potential of combining answer aggregation and classifier (Khetan et al., 2018).

One-stage approaches: Raykar et al. (2009) come up with the one-stage approach, which implements an EM algorithm to jointly model abilities of annotators and learn a logistic regression classifier. This line of work is further extended to other types of models such as convolutional neural networks (Albarqouni et al., 2016) and supervised latent Dirichlet allocation (Rodrigues et al., 2017). Further, Khetan et al. (2018) allocate labeling budget to maximize the performance of a classifier via jointly modeling labels and confusion matrix from noisy crowdsourced labels. The above-mentioned methods reduce the LFC to a maximum likelihood estimation (MLE) problem and then use EM algorithm to solve it. Of particular interest, Kajino et al. (2013) notice that annotators form clusters according to their abilities, and apply clusters of annotators to resolve the LFC problem. Closer to our work, Rodrigues and Pereira (2018) propose the Crowd Layer to train deep neural networks end-to-end directly from the noisy crowdsourced labels, using only back-propagation. On this basis, Chen et al. (2020b) present a structured probabilistic model which incorporates the constraints of probability axioms into parameters of the Crowd Layer. More recently, Cao et al. (2019) and Li et al. (2020) simultaneously aggregate the crowdsourced labels and learn an accurate classifier via a multi-view learning. Chu et al. (2020) decompose the confusion matrix into two parts: one is commonly shared confusion matrix, and the other one is individual confusion matrix. Unlike our method, those methods are based on a common assumption: the crowdsourced labels and the instance features are independent conditioning on the true labels.

To our knowledge, some methods also focus on the impact of instance features in LFC. Yan et al. (2010) employ logistic regression directly on the original instance content to characterize the confusion matrix in EM iterations, which not only ignores prior class-level probability transition information but also is not suitable for large-scale data. In contrast, our LFC-\({\mathbf {x}}\) incorporates instance features into the confusion matrix to construct the noise transition matrix in a unified neural network architecture. Zhang et al. (2019) infer true labels by propagating multiple noisy label distribution of each instance to its nearest neighbors, since it assumes that the multiple noisy label space share similar topological structure with the instance feature space. Unlike their work, we aim to characterize the impact of instance features on the performance of multiple annotators, rather than utilizing KNN graph to reconstruct the instances. Zhong et al. (2017) propose quality sensitive LFC method by using robust loss function, which estimates the reliability of crowdsourced labels by using the disagreement between crowdsourced labels and the model predictions on instance features, and then applies this term to loss function in SVM implementation. Unlike their work, we use noise transition matrices to correct incorrect annotations by considering the impact of instance features in the process of characterizing performance of multiple annotators.

3 Learning from crowds

In this section, we first present basic notations and the goal of interest. Then, we introduce a typical EM algorithm for LFC to learn a classifier from crowdsourced labels and reveal the deficiency of LFC method using the confusion matrix.

3.1 Notation and problem formulation

We assume that there are N i.i.d instances \(\{ {\mathbf {x}}_{1},..., {\mathbf {x}}_{N}\}\), and each instance has an unknown true label. Let \(y_{n}^{(r)}\) represents the annotation/label for \({\mathbf {x}}_{n}\) provided by annotator r in a set of R annotators. The labels from individual annotators may not be correct. Formally, we set the matrix \({\mathbf {X}} = [{\mathbf {x}}_{1}^T;...;{\mathbf {x}}_{N}^T] \in {\mathbb {R}}^{N \times D}\) and \({\mathbf {Y}} = [y_{1}^{(1)},...,y_{1}^{(R)};...;y_{N}^{(1)},...,y_{N}^{(R)}] \in {\mathbb {R}}^{N \times R}\) in which \((\cdot )^\top\) represents matrix transposition. We denote the unknown true labels for \({\mathbf {X}}\) by \({\mathbf {T}} = [t_{1};...;t_{N}]\). Given the observed training data \({\mathbf {X}}\) and \({\mathbf {Y}}\), the goal of interest is to jointly estimate abilities of multiple annotators and latent true labels and train an accurate classifier.

In existing general methods of LFC, there are two common assumptions: 1) Given input instances, multiple annotators independently provide crowdsourced labels; 2) crowdsourced labels do not depend on the instance features, and just determined by the true labels. Conditioning on the true labels, the probability of crowdsourced labels on instance features can be factored as

$$\begin{aligned} p({\mathbf {Y}} \mid {\mathbf {X}}, \Theta )=\prod _{n=1}^{N} \sum _{t_{n}} p(t_{n} \mid {\mathbf {x}}_{n}, {\varvec{w}}) \prod _{r=1}^{R} p(y_{n}^{(r)} \mid t_{n}, \varvec{\Pi }^{(r)}), \end{aligned}$$
(1)

where \(p(t_{n} \mid {\mathbf {x}}_{n}, {\varvec{w}})\) represents true label distribution parameterized by \({\varvec{w}}\), and \(p(y_{n}^{(r)} \mid t_{n}, \varvec{\Pi }^{(r)})\) parameterized by matrix \(\varvec{\Pi }\) depicts the class-level probability transition that the annotator, r, will annotate class \(y_{n}^{(r)}\) given true label \(t_{n}\). Specifically, the matrix \(\varvec{\Pi }^{(r)} = (\pi _{i j}^{(r)})_{C \times C} \in [0,1]^{(C \times C)}\) is called the confusion matrix for representing the r-th annotator’s ability whose \((i, j)^{\mathrm {th}}\) element is parameterized by \(\pi _{i j}^{(r)}\) where \(i, j \in \{1, \ldots , C\}\) and C is the number of classes. To achieve the goal, following (Raykar et al., 2009) and extending it from binary classification to multi-class classification task, the EM algorithm can be used to compute the maximum-likelihood solution, formalized as

$$\begin{aligned} {\hat{\Theta }}_{\mathrm {ML}}=\arg \max _{\Theta }p({\mathbf {Y}} \mid {\mathbf {X}}, \Theta ). \end{aligned}$$
(2)

E-step: Given the observation \({\mathbf {X}}\) and \({\mathbf {Y}}\) and the current estimate of the model parameters, the expected-value of complete-data log-likelihood (a lower bound on the true likelihood) can be computed as

$$\begin{aligned} \begin{aligned} {\mathbb {E}}\{\log p({\mathbf {Y}}, {\mathbf {T}} \mid {\mathbf {X}}, \Theta )\} = \sum _{n=1}^{N} \sum _{t_{n}} q(t_{n}) \log ( p(t_{n} \mid {\mathbf {x}}_{n}, {\varvec{w}}) \prod _{r=1}^{R} p(y_{n}^{(r)} \mid t_{n}, \varvec{\Pi }^{(r)}) ), \end{aligned} \end{aligned}$$
(3)

where the expectation is w.r.t. \(p(t_{n} \mid y_{n}^{(1)}, \ldots , y_{n}^{(R)}, {\mathbf {x}}_{n}, \Theta )\) and we use \(q(t_{n})\) when referring to it for brevity. Given the estimate of the model parameters \(\Theta _{\text{ old } }\) in the current M-step, we can compute \(q(t_{n})\) by using Bayes’ rule as follows.

$$\begin{aligned} q(t_{n}) \propto p(t_{n} \mid {\mathbf {x}}_{n}, {\varvec{w}}_{\text{ old } }) \prod _{r=1}^{R} p(y_{n}^{(r)} \mid t_{n}, \varvec{\Pi }_{\mathrm {old}}^{(r)}). \end{aligned}$$
(4)

M-step: Based on the observation \({\mathbf {X}}\) and \({\mathbf {Y}}\) and the estimation of posterior probabilities of ground truth in current E-step, the confusion matrix can be updated by maximizing the expected-value of complete data log-likelihood. By equating the derivative of Eq. (1) to zero, we obtain the following estimate for updating the confusion matrix.

$$\begin{aligned} \pi _{j l}^{(r)} = \frac{\sum _{n=1}^{N} q(t_{n}=j) \cdot {{\mathbb {I}}[y_{n}^{(r)}=l]}}{\sum _{n=1}^{N} q(t_{n}=j)}, \end{aligned}$$
(5)

where \({\mathbb {I}}[\cdot ]\) is an indicator function. The update of parameter-set \({\varvec{w}}\) in classifier depends on which type of classifier is used. If the classifier used is neural network, then we can use the posterior probability of true labels to back-propagate the error by using gradient descent optimization algorithm.

3.2 Limitations

The noisy crowdsourced labels, in the most popular noise model hitherto, are corrupted from ground truth by an unknown noise transition matrix (Han et al., 2018a; Yao et al., 2020) to depict such probability transition process. We can notice that the conventional LFC method makes a simplistic assumption that crowdsourced labels only depend on the ground truth but not the input instance features. That is, the noise transition matrix is characterized only by confusion matrices of annotators. However, in the real world, the content of instances (i.e., instance features) including foreground and background varies among instances within the same class so that the instance features themselves will affect the annotator’s judgment on labels. Modeling the probability transition process considering only confusion matrices of multiple annotators would limit the ability to infer the latent true labels and lead to sub-optimal performance of the classifier. In summary, the noise transition matrix cannot be completely constructed by class-level confusion matrices of annotators, and it is also necessary to consider each annotator’s performance depending on instance features.

For the EM-based LFC described above, a potential issue of combining a classifier and EM algorithm direction is scalability (Goldberger & Ben-Reuven, 2017). The model requires training a classifier in each iteration of EM algorithm, so many EM iterations are likely to be needed for convergence. In addition, the other primary criticism of EM-based LFC approaches is that in practice, each instance would not be labeled too many times considering the labeling cost. With relatively little redundancy, the standard applications of EM are of limited use (Khetan et al., 2018). On the other hand, it is intractable to construct the dedicated noise transition matrix as a part of the EM algorithm because it cannot depict confusion matrix and the awareness of instance features separately, let alone incorporates the instance features into LFC framework. Inspired by the Crowd Layer (Rodrigues & Pereira, 2018) that trains deep neural networks end-to-end directly from the crowdsourced labels using only back-propagation, we present the principled solution which can incorporate instance features in modeling the probability transition process for designing LFC method.

Fig. 2
figure 2

General schematic of the LFC-\({\mathbf {x}}\) for classification with 3 classes and R annotators. It integrates a classifier module and a noise transition matrix module into a unified neural network architecture

4 Methodology

4.1 Proposed LFC-\({\mathbf {x}}\)

In our setting, we here make a key assumption that the crowdsourced labels depend on not only true labels but also instance features. We propose the noise transition matrix to model probability transition process on multiple annotators given input instances through incorporating input instance features into confusion matrix. On this basis, now we rewrite the probability of given crowdsourced labels, then we take its log and the part of the probability transition in Eq. (1) related to confusion matrix is replaced with the proposed noise transition matrix, yielding

$$\begin{aligned} \begin{aligned} \log p({\mathbf {Y}} \mid {\mathbf {X}}, \Theta )&= \sum _{n=1}^{N} \sum _{r=1}^{R} \log p({\mathbf {Y}}^{(r)} \mid {\mathbf {X}}, \Theta ^{(r)} ) \\&= \sum _{n=1}^{N} \sum _{r=1}^{R} \log \sum _{t_{n}} \underbrace{p(t_{n} \mid {\mathbf {x}}_{n}, {\varvec{w}})}_{\text{ classifier }} \underbrace{p(y_{n}^{(r)} \mid t_{n},{\mathbf {x}}_{n}, \varvec{\Pi }^{(r)}, {\varvec{v}}^{(r)})}_{\text{ noise } \text{ transition } \text{ matrix }}, \end{aligned} \end{aligned}$$
(6)

where the \(\Theta\) is a collection of \(\{{\varvec{w}}, \{\varvec{\Pi }^{(r)}\}_{r=1}^{R}, \{{\varvec{v}}^{(r)}\}_{r=1}^{R}\}\}\), and the parameter set \(\{{\varvec{v}}^{(r)}\}_{r=1}^{R}\) represents the impact of instance features themselves on annotators’ performance that we will discuss later.

To instantiate our probabilistic graphical model as shown in Fig. 1(Bottom), we propose LFC-\({\mathbf {x}}\) that minimizes the negative log-likelihood function with respect to the parameter \(\Theta\) within the framework of neural network that consists of two modules: a classifier module and a noise transition matrix module. Specifically, the log-likelihood is the output of LFC-\({\mathbf {x}}\) through a principled combination of the classifier module and the noise transition matrix module. Figure 2 presents the overall design of LFC-\({\mathbf {x}}\). Next, we describe how to jointly optimize parameters of the classifier module and the noise transition matrix module.

Classifier Module: Without loss of generality, suppose that a softmax neural classifier is given, parameterized by \({\varvec{w}}\), for inferring the true label distribution. We denote the non-linear function applied to an instance \({\mathbf {x}}_n\) by \(h({\mathbf {x}}_n)\) for extracting the instance features. Given the instance features, the softmax layer is adopted to predict the true label \(t_n\) by

$$\begin{aligned} p(t_n=j \mid {\mathbf {x}}_n, {\varvec{w}})=\frac{\exp ({\mathbf {u}}_{j}^{\top } h({\mathbf {x}}_n)+b_{j})}{\sum _{i=1}^{C} \exp ({\mathbf {u}}_{i}^{\top } h({\mathbf {x}}_n)+b_{i})}, \end{aligned}$$
(7)

in which the vector \({\mathbf {u}}\) and scalar b indicate weight and bias, respectively, and \(i, j \in \{1, \ldots , C\}\). However, we cannot have access to the true labels and only have access to the observed crowdsourced labels. The classifier module alone is not enough. Therefore, a noise transition matrix module that can model the labeling process of multiple annotators, i.e., the probability transition from true labels and instance features to crowdsourced labels, is required to provide weakly supervised information for the classifier. In doing so, the classifier can be trained by multiplying its output with the estimated noise transition matrices which is shown in Eq. (6).

Noise Transition Matrix Module: Different from annotator-specific confusion matrix that characterizes class-level probability transition, we propose annotator-specific noise transition matrix that characterizes instance-level probability transition through incorporating instance features into confusion matrix. Now, a major challenge is how to construct the noise transition matrix. Naturally, it contains two problems. The first one is how to quantify the impact of instance features on annotators’ performance. The second one is how to incorporate such impact of instance features into confusion matrix for constructing noise transition matrix.

The first problem comes from how to specify the mapping function from instance features to the annotators’ performance. To this end, we construct instance impact matrix parameterized by \({\varvec{v}}^{(r)}\) to characterize the instance features’ impact on annotators’ performance. More concretely, we explore a solution by adding a linear layer with \(C^2\) units on top of instance features in the classifier module for each annotator. Recall that the instance features are extracted by the non-linear function \(h({\mathbf {x}}_n)\), which is the output of penultimate layer of classifier. This linear layer, called instance impact matrix layer, is parallel to the softmax output layer of the classifier module (shared by the classifier module and the noisy transition matrix module) with only weights and no bias parameters, formalized as

$$\begin{aligned} f({\mathbf {x}}_n)^{(r)} = {\varvec{v}}^{(r)} h({\mathbf {x}}_n), \end{aligned}$$
(8)

where the vector \({\varvec{v}}^{(r)}\) is the parameter of instance impact matrix layer.

Now we obtain the instance impact matrix, the other crucial point is how to construct the noise transition matrix. To solve this issue, we make an intuitive assumption that the noise transition matrix is the result of the sum of the instance impact matrix \(f({\mathbf {x}}_n)_{ij}^{(r)}\) and the confusion matrix \(\varvec{\Pi }^{(r)}\). Also, we discuss other variants of constructing noise transition matrices in Sect. 5.5 For the sake of computational feasibility, we convert the output of instance impact matrix layer from vector form to matrix form with the same shape as the confusion matrix and then add them up, followed by a softmax operation, yielding

$$\begin{aligned} p(y_n^{(r)}=j \mid t_n=i, {\mathbf {x}}_n, \varvec{\Pi }^{(r)}, {\varvec{v}}^{(r)}) = \frac{\exp (f({\mathbf {x}}_n)_{ij}^{(r)} + \pi _{ij}^{(r)})}{\sum _{k=1}^{C} \exp ( f({\mathbf {x}}_n)_{ik}^{(r)} + \pi _{ik}^{(r)})}. \end{aligned}$$
(9)

Note that in our method we use the \(\pi\) to denote the parameters of confusion matrix, which is actually logits without softmax operation in the neural network.

We propose the LFC-\({\mathbf {x}}\) that integrates a classifier module and a noise transition matrix module into a unified neural network in an end-to-end manner for jointly estimating true labels, abilities of annotators, and learning a classifier.

We take a closer look at the optimization objective of LFC-\({\mathbf {x}}\) as Eq. (6). The penultimate layer (extracted instance features) of the classifier module is shared among every annotator and becomes an “information hub” that connects the noise transition matrix of each annotator. Given a loss function \({\mathcal {L}}(p(y_n^{(r)} \mid {\mathbf {x}}_{n}, \Theta ^{(r)}), y_n^{(r)})\) such as commonly-used Cross Entropy loss between the model outputs and the crowdsourced labels, minimizing the negative log-likelihood encourages the outputs of LFC-\({\mathbf {x}}\) to be as close as possible to the observed crowdsourced labels. In doing so, we can perform back-propagation end-to-end for updating parameters in classifier module and noise transition matrix module. Besides, the problem of missing labels from some of annotators can be addressed by setting their gradients to 0.

4.2 Training procedure

There are degrees of freedom in the outputs of a classifier. In other words, the outputs of classifier may not semantically correspond to the true labels even if the negative log-likelihood function is minimized (Sukhbaatar et al., 2014). Therefore, a reasonable initialization of noise transition matrix is crucial for successful convergence of the LFC-\({\mathbf {x}}\) for training a high-quality classifier. As for the confusion matrix, the diagonal element corresponds to the probability of correctly labeling a certain class. In this paper, we assume that there is no malicious annotator. We initialize the confusion matrix so that it has relatively large diagonal elements (i.e., \(\pi _{i i}>\pi _{i j} \text{ for } \forall i, j \ne i\)), and small symmetric noise in off-diagonal elements, i.e.,

$$\begin{aligned} \pi _{i j}^{(r)}=\log (\epsilon ^{{\mathbb {I}}[i=j]} \times (\frac{1-\epsilon }{C-1})^{(1-{\mathbb {I}}[i=j])}), \end{aligned}$$
(10)

in which \(\epsilon\) is set to 0.46 for all datasets and we set the value of \(\epsilon\) via a grid search within the range [0.4, 0.7] in line with ability of real annotators. The parameters of instance impact matrix layer are initially set to 0.

Let us finally describe the training procedure which consists of two stages, illustrated in Fig. 3. The first stage is to update confusion matrices. In detail, we freeze the instance impact matrix layer so that the noise transition matrix degenerates into the confusion matrix since the instance impact matrix is fixed to 0. Then, we train the LFC-\({\mathbf {x}}\) for updating confusion matrices. The second stage is to update noise transition matrices. Concretely, we unfreeze the instance impact matrix layer and retrain the LFC-\({\mathbf {x}}\) except for the learned confusion matrices. Once the LFC-\({\mathbf {x}}\) is trained, the classifier module can be used separately to make predictions for unseen instances.

Fig. 3
figure 3

Training procedure (see text for details)

4.3 Minimax error analysis

Motivated by previous theoretical works (Imamura et al., 2018; Gao et al., 2016), here we analyze the minimax error of our LFC-\({\mathbf {x}}\). The error rate can be measured by \({\mathcal {L}}(\hat{{\mathbf {T}}},{\mathbf {T}})=\frac{1}{N} \sum _{n=1}^{N} {\mathbb {I}}[{\hat{t}}_{n} \ne t_{n}]\), in which \(\hat{{\mathbf {T}}}\) is the collection of estimated true labels of all instances. We use \(\zeta (n)^{r}\) to represent instance impact matrix \(f(x_n)^{r}\) of an annotator acting on an instance in short. Denote by \(\rho ^n=\{\rho ^n_i\}_{i=1}^C\) the instance-specific class probabilistic distribution of model output. Given the instances, the crowdsourced labels, and \(\Theta\) representing a collection of \(\{\{\varvec{\Pi }^{(r)}\}_{r=1}^{R}, \{{\varvec{v}}^{(r)}\}_{r=1}^{R}\}\), we bound the minimax error rate with respect to LFC-\({\mathbf {x}}\) as follows.

Theorem 1

The minimax error rate of our method is lower bounded by

$$\mathop {\inf }\limits_{{{\mathbf{\hat{T}}}}} \mathop {\sup }\limits_{{{\mathbf{T}} \in [C]^{N} }} {\mathbb{E}} \left [ {\mathcal{L}} {({\mathbf{\hat{T}}},{\mathbf{T}})} \right]\quad \ge \frac{1}{{N^{2} \log C}}\sum\limits_{{n = 1}}^{N} F (\rho ^{n} ,\Theta ) - \frac{{\log 2}}{{N^{2} \log C}}{\text{ }}$$
(11)

where

$$\begin{aligned} \begin{aligned}&F(\rho ^n, \Theta )=H(\rho ^n) \\&\quad - \sum _{r=1}^R\sum _{i=1}^C\sum _{j=1}^{C}\rho _{i}^n\rho _{j}^nKL \left( (\zeta (n)^{r}_{i*}+\pi ^r_{i*})||(\zeta (n)^{r}_{j*}+\pi ^r_{j*})\right) \end{aligned} \end{aligned}$$
(12)

where \(H\left( \rho ^{n}\right) =-\sum _{c=1}^{C} \rho _{c}^n \log \rho _{c}^n\) indicates the entropy of class probabilistic distribution, \(\zeta (n)^r_{i*}\) and \(\pi ^r_{i*}\) denote the i-th row in the matrices respectively.

The proof of minimax error rate can be found in previous theoretical analysis (Imamura et al., 2018). The noise transition matrix is decomposed into the sum of the class-level confusion matrix and the instance impact matrix in our method.

Remark 1

Since \(\inf _{\hat{{\mathbf {T}}}} \sup _{{\mathbf {T}} \in [C]^{N}} {\mathbb {E}}[{\mathcal {L}}(\hat{{\mathbf {T}}}, {\mathbf {T}})]\) contains the infimum over estimate \({\mathbf{\hat{T}}}\), we provide the lower bound of the minimax error rate to analyse the behavior of the model itself which does not depend on the classifier module that estimates true labels. Theorem 1 sheds light on how the LFC-\({\mathbf {x}}\) reduces the error rate through the interaction of the confusion matrix and instance impact matrix included in the noise transition matrix. Given an example to illustrate the advantage of our method, one annotator prefers to label class “highway” as “street” due to the existence of class confusion (Jin et al., 2020b) and his/her own bias. There are some instances of class “highway” misleading the annotator to label them as “forest”. If the instance impact matrix is not considered in Theorem 1 as utilized by Imamura et al. (2018), the KL divergence between classes “street” and “forest” in confusion matrix becomes smaller across each instance because their entries on “highway” are close. Theorem 1 suggests that incorrect labels provided by annotators influenced by instance features can be learned by the instance impact matrix, ensuring that KL divergence between classes “street” and “forest” in the confusion matrix across other instances belonging to the same class would not became smaller. We can observe that considering the instance impact matrix has the potential to reduce the error rate over all instances. Furthermore, the two-stage training produce introduced in Sect. 4.2 encourages the confusion matrix and instance impact matrix to be learned separately without confusing each other.

4.4 Relations among the noise transition matrix and the quality factors of crowdsourced labels and singly-labeled data

Existing methods usually model the factors that influence the quality of crowdsourced labels, including ability of annotators and difficulty of instances. For example, the crowd layer (Rodrigues & Pereira, 2018) considers only class-level confusion matrix for characterizing the ability of annotator. Whitehill et al. (2009) propose the generalized DS model involving the difficulty of instance, in which the difficulty of instance is implicitly modeled only through access to crowdsourced labels. In this paper, we propose to construct the noise transition matrix by incorporating the impact of instance features into confusion matrix for modeling annotators’ performance across instance features, which can be regarded as a fusion of instance difficulty and annotator ability. Specifically, the difficulty of instance is modeled by explicitly utilizing instances features.

It is an important issue to consider the impact of instance features in learning with noisy singly-labeled data that has received increasing attention recently (Yao et al., 2021; Chen et al., 2020a; Zhang et al., 2021; Zhu et al., 2021; Liu, 2021). However, the impact of instance features on the labeling process of multiple annotators in learning from crowds has not been effectively addressed. The crowdsourced labels differ from the singly-labeled scenario in that (1) each instance may correspond to multiple annotations provided by multiple annotators and the information of annotators is required; (2) learning from crowds needs to consider the aggregation process of multiple annotators when modeling noisy annotations. There are some works (Jiang et al., 2021; Berthon et al., 2021) in singly-labeled scenario that rely on confusion matrix to model the label noise statistically. Although the confusion matrix-based methods possess theoretical guarantee, it is difficult to estimate the confusion matrix for each instance under the instance dependent noise. To ease the estimation, some unrealistic assumptions have to be posed on the confusion matrix, including class-level confusion matrix (Liu & Guo, 2020; Li et al., 2021), symmetric confusion matrix (Menon et al., 2018), upper bounded noise rate (Cheng et al., 2020), and part-dependent label noise (Xia et al., 2020). In addition, some works (Jiang et al., 2021; Berthon et al., 2021) consider the impact of the instance on the confusion matrix by re-weighting/correcting the loss term of the instance according to the confidence score of the noisy labels. Unlike their works, our method assumes that the instance-level noise transition matrix can be obtained by a learnable instance impact matrix acting on the class-level confusion matrix. To the best of our knowledge, two works (Goldberger & Ben-Reuven, 2017; Yan et al., 2010) most relevant to our method consider that the class-level confusion matrix is affected by the instance features. They directly use a nonlinear mapping from instance features to the instance-level confusion matrix which causes the class-level and instance-level probability transitions to become indistinguishable. Different from their works, our method adopts a two-stage training procedure which considers both class-level and instance-level transition probability information without confusing each other, thus yielding more stable and superior result.

5 Experiments

We begin by investigating and discussing the behavior of the neural-based LFC methods during training in the presence of noisy crowdsourced labels. Afterward, we evaluate our proposed LFC-\({\mathbf {x}}\) by comparing it with representative LFC baselines on both synthetic and real datasets. Moreover, we test our method combined with a robust loss function against noisy crowdsourced labels. Finally, our method has good flexibility to be applied to noisy singly-labeled scenario. All methods are implemented using the Keras framework.

5.1 Datasets

Synthetic datasets: Most existing public crowdsourcing datasets do not contain instance features information and are not suitable for LFC scenario. Following previous works (Yan et al., 2010; Rodrigues & Pereira, 2018), we simulate several annotators to provide crowdsourced labels based on CIFAR-10 and four UCI datasets, where UCI datasets are from UC Irvine machine learning repository (Dua & Graff, 2017). Table 1 provides detailed information of datasets used, including binary and multiclass classification tasks, which represents a wide range of domains and data characteristics.

To create reliable and rational synthetic datasets, we generate two types of synthetic crowdsourced labels, i.e., uniform labels and clustering-based labels, which indicate the overall ability of annotator and the personal bias on some similar instances features, respectively. To generate uniform labels, we randomly flip a correct label to one of the other incorrect labels uniformly and refer to the portion of incorrect labels as the ability of annotator. We simulate five annotators with different abilities varying from 0.3 to 0.7, i.e., \(p\in \{0.3, 0.4, 0.5, 0.6, 0.7\}\). To generate clustering-based labels, we employ data clustering method for each dataset, which follows the previous works (Yan et al., 2010; Zhong et al., 2017) that consider instance features in LFC proceeded as follows. Firstly, we simulate five annotators and perform k-means clustering on training data to group them into five clusters. Then, for each annotator, we assume that the r-th annotator in the five annotators is good at labeling instances belonging to r-th cluster, where their annotations coincide with their true labels; meanwhile the r-th annotator correctly labels the rest instances belonging to other clusters with a probability \(p^r \sim U(0.2,0.3)\) (U indicates uniform distribution). In doing so, we obtain the crowdsourced labels. In practice, since each annotator only labels a small subset of all instances, we introduce a probability \(\eta \sim U(0.5,0.7)\) for each annotator to decide whether to label an instance in label generation process.

Real datasets: LabelMe dataset (Rodrigues et al., 2017) is an image classification dataset involving eight classes including “highway”, “inside city”, “tall building”, “street”, “forest”, “coast”, “mountain” and “open country”, and it contains 2,688 images in total. Among them, 1,000 images are annotated by AMT annotators and each is annotated by 2.547 annotators on average. The remaining 1,688 images are used for testing.

Sentiment Polarity dataset (Pang & Lee, 2005) is a textual sentiment analysis dataset containing 10,428 sentences about movie review snippets from Rotten Tomatoes. Rodrigues et al. (2013) provided crowdsourced labels for this dataset on AMT platform. 4,999 sentences of this dataset are annotated by AMT workers with the sentiment polarity “positive” or “negative”, and each is labeled by an average of 136.68 annotators. The remaining 5,429 sentences are used for testing. Further, Rodrigues et al. (2013) also provided feature vectors version of the text dataset by applying latent semantic analysis to bag-of-words feature vectors.

Table 1 Characteristics of Datasets

5.2 Understanding the training process of neural-based LFC

Fig. 4
figure 4

The test accuracy vs. number of epochs on synthetic CIFAR-10. Left: uniform labels. Right: clustering-based labels

Since few works analyze the learning process of the neural-based LFC in the presence of noisy crowdsourced labels, here we present and analyze the behavior of proposed LFC-\({\mathbf {x}}\) and existing neural-based LFC methods including one-stage and two-stage respectively on synthetic CIFAR-10 dataset.

Competing strategies: Two-stage approaches with neural networks include: (1) NN-MV: the neural network classifier baseline of training with the labels inferred with majority voting; (2) NN-DS: the neural network classifier baseline of training with the labels inferred with DS model (Dawid & Skene, 1979). One-stage LFC approach Crowd Layer (Rodrigues & Pereira, 2018): train deep neural networks end-to-end directly from the crowdsourced labels using only back-propagation. Note that the Crowd Layer can be seen as a degenerated case of LFC-\({\mathbf {x}}\) if we freeze the instance impact matrix layer and fix its parameters to 0 in LFC-\({\mathbf {x}}\). The comparison of our LFC-\({\mathbf {x}}\) with Crowd Layer can be seen as an ablation study to test the efficacy of instance impact matrix.

Experimental setup: The experiments are conducted on the CIFAR-10 with two types of noisy crowdsourced labels. We use a 5-layer CNN architecture combined with ReLU activation and max-pooling, followed by two fully connected layers to build a classifier, which is standard test bed (Laine & Aila, 2016; Han et al., 2018b) for CIFAR-10. Unless otherwise specified, for a fair comparison, all comparative methods use the same network architecture with Cross Entropy loss. The model is trained using SGD with momentum of 0.9, weight decay of \(10^{-4}\), and an initial learning rate of 0.01. The learning rate is divided by 10 after epochs 40 and 80. The batch size is set to 1024.

In Fig. 4, we report test accuracy vs. number of epochs on uniform labels and clustering-based labels respectively, from which the observations are as follows.

One-stage approaches Crowd Layer and LFC-\({\mathbf {x}}\) exhibit relatively robust performance when confronted with both uniform labels and clustering labels as the number of epochs increases. It indicates that the potential advantage to jointly learn the classifier and estimate abilities of annotators in combating label noise. More importantly, LFC-\({\mathbf {x}}\) performs the best and consistently provides a more modest improvement than Crowd Layer, demonstrating the advantage of noise transition matrix over confusion matrix in designing the LFC method.

Unlike the one-stage approaches learning directly from crowdsourced labels, the two-stage approaches uses fixed labels inferred by the answer aggregation and then learns a neural network classifier. We can observe that two-stage approaches NN-MV and NN-DS show different behaviors under different types of crowdsourced labels. On uniform labels, neural networks in two-stage approaches can automatically learn generalizable “pattern” in the early training epochs before overfitting occurs (i.e., performance decline). Nonetheless, two-stage approaches are relatively robust to clustering-based labels, probably because clustering-based labels that are corrupted non-uniformly are more complicated than uniform labels, which makes it difficult for neural networks to automatically capture meaningful “patterns”.

5.3 Evaluation on synthetic and real datasets

Table 2 Comparison of the LFC-\({\mathbf {x}}\) to other LFC baselines on test accuracy (%)

Next, to further verify the effectiveness of proposed LFC-\({\mathbf {x}}\), we conduct comprehensive experiments on synthetic UCI datasets with two types of noisy crowdsourced labels and two real crowdsourcing datasets. Moreover, the comparative LFC baselines are not limited to neural network-based LFC approaches, and also include EM-based LFC approaches.

Competing strategies: Apart from neural-based LFC baselines, we also compare the LFC-\({\mathbf {x}}\) to the EM-based LFC methods including: (1) Raykar Raykar et al. (2009): a maximum-likelihood estimator that jointly learns the classifier, the confuse matrices, and the underlying true labels based on EM algorithm; (2) AggNet (Albarqouni et al., 2016): a generalized version of Raykar method, in which the classifier is a neural network. In addition, we also compare a method named Max-MIG (Cao et al., 2019): it jointly estimates a neural classifier and a label aggregation network using an information-theoretical loss function.

Experimental setup: For each UCI dataset, we retain 90% of the data as training set, the remaining 10% of the data as test set. Since the dimension of datasets is relatively small, we employ a two-layer neural network with 32 units for each layer followed by a softmax output layer to build a classifier. For LabelMe dataset, we use a pre-trained VGG-16 as the backbone network and replace the last fully connected layer with the task-specific fully connected layer. For Sentiment Polarity dataset, we use a two-layer neural network comprised 500 and 128 units and one softmax output layer on top to build a classifier. We choose ReLU as the activation function and use dropout with parameter 0.5 for all datasets. The batch size is set to 256 and we run 400 epochs. Jiang et al., (2020) point out that early stopping is not always effective on label noise, although some previous studies report the best results. Therefore, we report not only the optimal test score during training, but also the test score of last epoch, to see the robustness of methods. Each experiment is repeated ten times and we report the mean test accuracy and standard deviation.

Table 2 summarizes the comparisons of LFC-\({\mathbf {x}}\) to other LFC approaches. We have the following observations. Firstly, for the uniform labels as shown in Table 2a, the LFC-\({\mathbf {x}}\) exhibits competitive results and it surpasses competing methods in most cases. Secondly, Table 2b illustrates our LFC-\({\mathbf {x}}\) consistently outperforms its competitors by a clear margin across all datasets under the clustering-based labels. Finally, Table 2c presents the test accuracy on the real crowdsourcing datasets. The LFC-\({\mathbf {x}}\) is superior to all the compared LFC methods, which empirically demonstrates that we provide a realistic and applicable noise transition matrix for designing an LFC framework. Moreover, the LFC-\({\mathbf {x}}\) is very robust, and there is no big drop between the best score and the last score over almost all datasets.

5.4 Evaluation on combination with robust loss functions

We model the labeling process of multiple annotators for learning from noisy crowdsourced labels, which is also a case of weakly supervised learning (Karimi et al., 2020; Song et al., 2020). Our method is orthogonal to many label-denoising techniques such as instances re-weighting (Jiang et al., 2018) and robust loss functions (Zhang & Sabuncu, 2018; Wang et al., 2019; Patrini et al., 2017), which can be used to enhance our method. We evaluate a combination of LFC-\({\mathbf {x}}\) and some robust loss functions include Symmetric Learning (SL) (Wang et al., 2019) and Generalized Cross Entropy (Zhang & Sabuncu, 2018), called “LFC-\({\mathbf {x}}\) (SL)” and “LFC-\({\mathbf {x}}\) (GCE)” respectively. It is readily implemented by replacing the commonly-used Cross Entropy loss with the robust loss function. We also report experimental results for the combination of the Crowd Layer with robust loss functions. Table 3 reports the test accuracy of comparative methods on synthetic datasets with uniform labels and two real crowdsourcing datasets. We can observe that robust loss functions maintain the effectiveness of our method, and further boost the performance of LFC-\({\mathbf {x}}\) and Crowd Layer in many cases. For example, the best test accuracy of LFC-\({\mathbf {x}}\) (GCE) can reach 86.75% on LabelMe dataset. Furthermore, the LFC-\({\mathbf {x}}\) combined with robust loss functions performs better than the Crowd Layer combined with robust loss functions, which demonstrates the advantage of the noise transition matrix compared to the confusion matrix.

Table 3 Test accuracy of comparative methods combined with robust loss functions on synthetic and real datasets

5.5 Ablation study and analysis

As previously mentioned, the comparison of our LFC-\({\mathbf {x}}\) with Crowd Layer can be seen as an ablation study to test the efficacy of instance impact matrix in learning from crowds. The experimental comparisons in the Fig. 4 and Table 2 demonstrate the advantage of noise transition matrix over confusion matrix in designing the LFC method. Moreover, we add an experimental comparison between our algorithm and Crowd Layer for confusion matrix visualization on the LabelMe dataset, which reflects the difference between the noise pattern evaluated for annotators and the real noise pattern. Since the instance-level noise transition matrix cannot be visualized on the overall training data, we compare only the confusion matrices learned by LFC-\({\mathbf {x}}\) and Crowd Layer with the true confusion matrix on LabelMe dataset. The Fig. 5 shows the comparison of the confusion matrices for four annotators, where the higher color intensity indicates a larger value, demonstrating that the proposed LFC-\({\mathbf {x}}\) can model the labeling process of multiple annotators.

Fig. 5
figure 5

Comparison between ground truth confusion matrices and learned ones by Crowd Layer and LFC-\({\mathbf {x}}\) on LabelMe dataset

Although LFC-\({\mathbf {x}}\) is proposed to learn a classifier from crowdsourced labels, it can also be applied to more general weakly supervised learning scenario, i.e., singly-labeled scenario, where each instance is labeled by only one annotator while the information of annotator is not considered. Note that many LFC methods cannot be applied directly in singly-labeled scenario. To further verify the effectiveness of considering the awareness of instance features in our method on singly-labeled scenario. We implement an ablation study by comparing our LFC-\({\mathbf {x}}\) with LFC-0. The LFC-0 indicates that the instance impact matrix layer is frozen or removed in our LFC-\({\mathbf {x}}\). In addition, we also compare with a traditional method Noisy Classifier which represents training a classifier directly on noisy singly-labeled dataset.

Fig. 6
figure 6

Results with varying percentage of noisy labels

Experimental setup: We conduct experiments on a synthetic singly-labeled dataset in a controlled noise setting, to evaluate the performance of methods under different noise levels and noise patterns. The synthetic dataset is generated by injecting noisy labels into the CIFAR-10 dataset. Unlike the two types of synthetic noise in the previous section, we generate non-uniform labels commonly used in singly-labeled scenario, where true labels are transformed to noisy labels with varying degrees according to a predefined noise pattern (Reed et al., 2014). The noise pattern is the transition between similar classes (visually or semantically). In CIFAR-10 dataset, we transform the class “aircraft” to class “bird” (class 0 to class 2), class “deer” to class “horse” (class 4 to class 7), by setting different degrees of probability such as \(p\in \{0.10,0.12, \ldots , 0.58, 0.60\}\).

As shown in Fig. 6, when the percentage of noise increases, the classification accuracy of LFC-\({\mathbf {x}}\) decreases more slowly than that of LFC-0 and Noisy Classifier, which illustrates that LFC-\({\mathbf {x}}\) can better deal with noisy labels. More importantly, LFC-\({\mathbf {x}}\) significantly outperforms LFC-0 since it considers the impact of instance features to construct the noise transition matrix.

Variants of the LFC-\({\mathbf {x}}\): We construct the noise transition matrix through addition operation between the instance impact matrix and the confusion matrix, and adopt a two-stage training procedure to learns the confusion matrix first and then the noise transition matrix. In addition, we consider two variants regarding the construction of the noise transition matrix: (1) LFC-\({\mathbf {x}}\) (addition w/ one stage): the noise transition matrix is constructed in the same way as LFC-\({\mathbf {x}}\), but the training procedure is one-stage approach. Namely, this variant is equivalent to learning the noise transition matrix directly from instance features regardless of the class-level transition probability. It is worth mentioning that LFC-\({\mathbf {x}}\) (addition w/ one stage) can be regarded as an end-to-end neural network version of the method by Yan et al. (2010). (2) LFC-\({\mathbf {x}}\) (dot product): the noise transition matrix is obtained by dot product between the instance impact matrix and the confusion matrix, where the parameters of the instance impact matrix are generated with a normal distribution. Experiments on LabelMe dataset are run 10 times and an average of the test accuracy is reported in Table 4. The experimental results show that the two-stage addition operation between confusion matrix and instance impact matrix yields better results than other variants in the construction of the noise transition matrix.

Table 4 Comparison of different variants regarding the construction of noise transition matrix on LabelMe dataset. The test accuracy of the best during training (left) and the last epoch (right) are listed

6 Conclusion and future work

In this paper, we propose to learn a classifier from crowdsourced labels provided by multiple annotators. Specifically, we first construct the noise transition matrix by incorporating instance features into confusion matrix. Furthermore, we propose LFC-\({\mathbf {x}}\) that integrates a classifier module and a noise transition matrix module into a unified neural network in an end-to-end manner. Extensive experiments show the advantages of LFC-\({\mathbf {x}}\), confirming the effectiveness of noise transition matrix compared to class-level confusion matrix. In addition, our approach can also integrate some other techniques to further improve the performance. For example, label-denoising techniques (Song et al., 2020) such as symmetric learning can be applied to LFC-\({\mathbf {x}}\) framework to further improve performance. We also verify the effectiveness of our method by considering the awareness of instance features in general noisy singly-labeled scenario. In future, we plan to extend LFC-\({\mathbf {x}}\) to other types of labels, e.g., learning from crowdsourced sequence annotation (Lan et al., 2019) and multi-object bounding box annotation (Acuna et al., 2019).