1 Introduction

Traditionally, supervised machine learning relies on useful feature representation and hand-labeled data. With deep learning techniques, useful feature representation can be learned easily [1]. However, for supervised machine learning, deep learning cannot function without sufficient labeled data [2]. Moreover, the requirements for data labels usually evolve rapidly as applications change [3]. These changes can be labeling guidelines, labeling granularity [4], application scenarios and so on. What’s more, most training data samples are still labeled manually, which may be extremely expensive and time-consuming [3]. Thus, there is an urgent need for an efficient method to label training data automatically, especially for short text classification. Secondly, data sparsity remains a key challenge for short text classification [4]. Thirdly, in the real world, text classification is usually imbalanced. That is, short text classification is usually faced with insufficient labeled data, data sparsity and imbalanced classification simultaneously.

To address insufficient labeled data, data sparsity and imbalanced classification in short text classification wholly, multiple weak supervision [1, 5] was proposed, where conditional independent model was introduced to generate probabilistic labels as accurate as possible. To be specific, to label short text data automatically, three kinds of weak supervision sources (keywords matching, regular expressions and distant supervision clustering) were creatively introduced. Notably, keywords matching and regular expressions were used to represent explicit knowledge, while distant supervision clustering was specially designed to represent tacit knowledge.

Specially, distant supervision clustering was proposed firstly in this paper. According to the process, distant supervision clustering can be divided into three steps. The first step is to specify the similarity threshold, which is the criteria of distant supervision clustering. The second step is to calculate the similarity between the sample points and knowledge base. The third step is to compare the calculated similarity with the similarity threshold. If the calculated similarity is no less than the similarity threshold, the sample point will be labeled as the same as the corpus. Otherwise, the sample point will get abstain label. In fact, similarity threshold plays a key role in distant supervised clustering. However, since this paper focuses on the proposal of multiple weak supervision framework, similarity threshold will not be studied in depth. For example, the impact of similarity threshold and other related studies will be reflected in the follow-up work. Please look forward to it.

2 Related work

According to the first chapter, short text classification in real scenes is usually faced with three major challenges: insufficient labeled data, data sparsity and imbalanced classification. There are few comprehensive studies on labels bottleneck (insufficient labeled data), data sparsity and imbalanced classification. In fact, existing research usually focuses on solving only one problem. Therefore, the related work of labels bottleneck, data sparsity and imbalanced classification will be introduced one by one as follows.

2.1 Labels bottleneck

To solve labels bottleneck (insufficient labeled data) in natural language processing, there are two main solutions: weak supervision as well as fine tuning. Weak supervision is committed to expanding the scale of labeled data from the data level. Differently fine tuning aims to provide an initialization model as good as possible, so as to reduce the requirement of labeled data scale.

(1) Weak Supervision

There are many attempts to label training data in programmatic way. Generally speaking, these labeling ways, which generate nosier weak labels based on domain knowledge [1], are referred to as weak supervision. Taking text classification for example, if a text contains any one of certain keywords, it can be classified as a specific category. Distant supervision, the most popular one, gets weak labels by aligning the data points with the external knowledge base [7,8,9]. In addition, crowdsourcing labels [10, 11], heuristic rules for labeling data [12, 13] and others [14,15,16,17] are also the common sources of weak supervision. That is, weak supervision sources mainly contain distant supervision [18,19,20], crowdsourcing [10, 11] and heuristic rules [12, 13].

Distant supervision is mainly used for relation extraction [8, 21, 22]. The main idea is to align the sample points with the records in the external database [19]. For example, distant supervision can be used to extract spouse relation by aligning the sample points with the spouse records of an external knowledge base [1] (such as DBpedia [23] and Wikipedia [22]). Obviously, the external knowledge base needs to have a relative strong correlation with the target task. However, such a highly relevant knowledge base is usually scarce, which hinders the extended application of distant supervision.

Crowdsourcing, also called human computation [11, 24], is the process that a number of non-experts collectively perform a labeling task [25]. The explosive growth and widespread accessibility of the Internet have led to the surge of crowdsourcing [11]. Crowdsourcing has been widely used in labeling tasks of machine learning, which require a lot of human computation but little domain knowledge. These areas include image and video annotation [26,27,28], named entity annotation [11], relevance evaluation [29], natural language annotation [30,31,32] and others [11, 33]. Crowdsourcing can quickly generate a large number of data labels, but the quality of data labels is relatively poor.

Heuristic rules for labeling data are usually written by users or domain experts [3]. Due to the diverse quality of heuristic rules, the accuracy and correlation of labels might fluctuate widely [13, 34]. Therefore, the efficiency of rules-based labeling strategy depends on the quality of heuristic rules [35]. In view of this, the heuristic rules (or domain knowledge) from domain experts are essential for high quality labels.

However, any kind of weak supervision is weak and limited. This is because a kind of weak supervision is no longer sufficient to generate large higher-quality data labels. In light of this, to alleviate labels shortage, multiple weak supervision were introduced for labeling short text data. To be specific, according to the characteristics of short text classification, we combine keywords matching, regular expressions and distant supervision clustering to label short text and train classifier.

(2) Fine Tuning

In natural language processing, inadequate labeled data is usually too less to be used to learn good enough model parameters. Based on this, fine tuning were proposed to reduce the amount of labeled data needed for parameter learning. In short, the pre-training model can provide a good parameter initialization for tasks with insufficient labeled data. Based on this good parameter initialization, the model training only needs to fine tune the parameters to achieve the optimal solution. For this, fine tuning is usually done with a small amount of labeled data.

In conclusion, the pre-training model directly determines the quality of parameter initialization. At present, the pre-training models for text mainly included ELMo [36], GPT (Generative Pre-Training) [37], BERT (Bidirectional Encoder Representation from Transformers) [38], XLNet [39], ZEN [40], ERNIE (Enhanced Language Representation with Generative Entities) [41], etc. In particular, BERT and ERNIE have attracted a lot of attention and derived some deformation, such as RoBERTa [42] and ERNIE2.0 [43].

However, the training of the pre-training model requires a large amount of computing resources. For example, the training of BERT model [39], in the Google 64 TPU computing environment, still lasted for nearly 4 days. In addition, as time goes on, fixed pre-training models are prone to problems such as “concept drift” and even lack of generalization ability. Last but not least, fine tuning relies on strong labeled data, which cannot be provided by weak supervision. Therefore, the pre-training model is not very suitable for short text classification with multipe weak supervision.

2.2 Data sparsity

With the growth of instant messaging by Mobile Internet, the proliferation of short texts highlights the challenge of data sparsity and misspelling (informal writing) [54, 58], which limits the application of machine learning in short text classification. To address these problem, two types of solutions were proposed: feature strategy and algorithm strategy (Table 1 [46]). Notably, in feature selection, the measure of filter-based approach can be chi-squared (CHI2) [76], information gain (IG) [77, 78], correlation coefficient (CC) [79], accuracy balanced (Acc2) [80], pointwise mutual information (PMI) [61, 81], odds ratio (OR) [82] and multi-class odds ratio (MOR) [69].

Table 1 Solution of data sparsity [46]

Undoubtedly, both feature strategy and algorithm strategy have good effect on supervised learning with large-scale labeled data. However, they all did not considevr the case of data sparsity with weak supervision learning. Moreoverv, even data augmentation cannot really address the simultaneous challenges of insufficient lableled data, data sparsity, and imbalanced classification. In particular, data augmentation will also bring uncontrollable semantic changes, which will further increase the challenge of clasification. Similarly, distributed representations, such as word2vec and Glove, are difficult to be directly incorporated into multiple weak supervision framework due to their high computational overhead and dependence on strong labeled data sets.

For the sake of simplicity, only N-gram is taken as an example to carry out experimental test. In light of this, for short text classification with weak supervision, N-gram (feature representation) and Logistic Regression (algorithm) were introduced for addressing data sparsity and misspelling. Taking one step further, to solve data sparsity, N-gram (feature representation) and Logistic Regression (algorithm) were embedded into the proposed multiple weak supervision framework. Such a design is for simplicity and practicality. As for the dimension disaster that N-gram may cause, this paper does not rule out. The related ablation research will be further carried out in the following research. After all, this article focuses more on proposing a solution framework to solve the classification of unlabeled short texts.

2.3 Imbalanced classification

Imbalanced classification is a hotspot in data mining, machine learning and pattern recognition. There are several top-level conferences devoted to discussing and studying imbalanced classification problem, such as ICML 2003 [70], ACM SIGKDD2004 [71] and IJCAI 2017 [72]. In short, there are mainly four factors influencing the imbalanced classification problem: 1) the scale of the training set; 2) category priority; 3) the misclassification costs of different categories; 4) the location of the boundary.

In general, imbalanced classification has two major research directions: data strategy and algorithm strategy. By changing the distribution of original dataset, the data strategy increases the minority samples (over-sampling) [73,74,75] or decreases the majority samples (under-sampling) [76,77,78], so that the imbalanced data tends to balance. This strategy is favored by many researchers because of its advantages in improving the classification performance and being suitable for various classifiers [79]. Although there are more studies on over-sampling than under-sampling, it is still difficult to give a conclusion that over-sampling is better than under-sampling. Therefore, some studies also put forward the mixed sampling method, that is, the method of balancing the training set by synthesizing over-sampling and under-sampling [80].

By contrast, the algorithm strategy mainly makes the classification more focused on minority classes by means of weighting, voting, iteration and so on. Specifically, common methods include cost-sensitive learning and ensemble learning. Cost-sensitive learning was put forward to focus on imbalanced classification of minority classes. It mainly increases the misclassification cost of minority classes with cost-sensitive factor [81]. That is, learning parameters are adjusted to highlight the importance of minority classes. Theses parameters mainly have data space weighting, cost matrix of category dependence, and ROC (receiver operating characteristic curve) threshold. In addition, ensemble learning is also favored [82]. The basic idea [67, 83] is to train a series of basic classifiers and then improve the classification accuracy through integration. Bagging, Boosting and Random Forest are the most commonly used ensemble methods. There are two main reasons why research on algorithm strategy is less than that on data strategy. First, the determination of the cost matrix is very difficult; second, the cost sensitivity depends on different classifiers [81, 84]. As a result, researchers tend to integrate the algorithm strategy into the classification research of specific background rather than as a single research point. But the algorithm strategy is difficult to popularize, whose promotion application cost is very high. Based on this, a resolution mechanism, which is based on probabilistic labels generated from conditional independent model, was put forward to handle imbalanced classification.

For one thing, data strategy is easy to destroy the original distribution and requires very proper sampling methods. For another thing, algorithm strategy is hard to popularize and has very high promotion application cost. Motivated by this, a resolution mechanism based on probabilistic labels generated from conditional independent model, was put forward to handle imbalanced classification of weak supervision.

To sum up, any one of existing methods is hard to address labels shortage, data sparsity and imbalanced classification simultaneously. In other words, there is hardly effective overall solutions for the tree challenges. In light of this, an overall methodology, which is on the basis of multiple weak supervision and probabilistic labels, was proposed and elaborated in chapter 5.

3 Domain knowledge in weak supervision

In order to select proper weak supervision combination, dynamic theory was chosen as the guidance [85]. According to [85], domain knowledge can be divided into explicit knowledge and tacit knowledge. Corresponding to weak supervision, the relation between domain knowledge and weak supervision sources was shown in Fig. 1.

Fig. 1
figure 1

Domain knowledge represented by weak supervision

As shown in Fig. 1, explicit knowledge can be represented by heuristic rules, while tacit knowledge involves distant supervision and crowdsourcing labels. In spired by this, to combine both explicit knowledge and tacit knowledge [85], we adopt three types of weak supervision sources: simple keywords matching, regular expressions, and distant supervision clustering. Correspondingly, these three types can be boiled down to two categories: heuristic rules and distant supervision clustering, which correspond to explicit knowledge and tacit knowledge respectively.

3.1 Explicit knowledge (heuristic rules)

In order to represent explicit knowledge, two types of heuristic rules were designed to label data automatically. Specifically, simple keywords matching as well as regular expressions were adopted as explicit knowledge sources.

Combining keyword matching with regular expressions, nearly all explicit knowledge for text classification can be represented easily. However, tacit knowledge is hard to represent. Furthermore, it is prohibitively hard to get high recall score with the limited coverage of heuristic rules. In view of this, distant supervision clustering was proposed to represent tacit knowledge and improve recall score.

3.2 Tacit knowledge (distant supervision clustering)

As shown in Table 2, explicit knowledge, hard to quantify, can be represented formally by heuristic rules. On the contrary, tacit knowledge is easy to quantify while it is difficult to represent explicitly. In view of this, distant supervision clustering, a novel weak supervision strategy, was proposed to represent explicit knowledge.

Table 2 Representation difference between explicit and tacit knowledge

Notably, distant supervision clustering was inspired by distant supervision. For one thing, distant supervision, as a popular weak supervision source, can be regarded as one of the semi-supervised learning methods. Instead of the alignment strategy of distant supervision, distant supervision clustering gets weak labels based on cluster assumption. To be specific, the implication of the cluster assumption is that the data has a cluster structure and that the same cluster sample belongs to the same category. This is consistent with the clustering hypothesis of semi-supervised learning [4, 6].

figure a

As shown in Algorithm 1, distant supervision clustering can be divided into 3 steps Determining Threshold, Calculating Similarity, and Assigning Labels.

Specially, the similarity threshold is the maximum similarity between the small-scale labeled dataset and the external corpus. It is noted that similarity threshold, plays an important role in distant supervision clustering and tacit knowledge representation. For one thing, a proper threshold could ensure the quality (accuracy) of labels from distant supervision clustering. For another thing, if threshold is small enough, the vast majority of samples in the corresponding category will receive labels from it, which means very high recall score. Most importantly, with distant supervision clustering, we can represent tacit knowledge by quantitative method, which is hard to be represented formally by heuristic rules. However, since this paper focuses on the proposal of weak supervision framework, it will not be studied in depth. For example, the impact of similarity threshold and other related studies will be reflected in the follow-up work. Please look forward to it.

In this way, explicit knowledge can be represented by heuristic rules (simple keywords matching and regular expressions) while tacit knowledge can be included in distant supervision clustering. Thus, the coverage and quality of weak labels of training data can be obtained and applied to short text classification by machine learning. In next chapter, the labels integration mechanism and probabilistic labels suitable for solving imbalanced problem will be introduced in detail.

Specially, LDA (Latent Dirichlet Allocation) [86] was bringing in for extracting explicit knowledge (keywords pattern) extraction. LDA is a generative probabilistic model of a corpus. In LDA (Fig. 2), documents are represented as random mixtures over latent topics while each topic is characterized by a distribution over words. Dirichlet Allocation was thought to be the prior distribution of parameter of topic distribution. Notably, compared with common TF-IDF and TextRank model [87], LDA is more suitable for short text classification. Moreover, LDA can also better meet the background constraints, such as data sparsity, simplicity and limited space. Additionally, in the case of multiple weak supervision, the performance comparison of different keyword extraction algorithms will be elaborated in the following research and papers.

Fig. 2
figure 2

Graphical Model of LDA (Latent Dirichlet Allocation)

Taking binary classification for example, with LDA and prior (explicit) knowledge, we can get some keywords closely related to positive and negative class. With these keywords, we can quickly classify some data points to a category. To be specific, small-scale labeled dataset (e.g. Dev in Chapter 6) can be used to build LDA model and extract keywords of specific class. Despite of this, single keywords matching is not always useful for the flexibility and diversity of natural language expressions. Thus, regular expressions were absorbed to accommodate more complex expressions. For example, “check*out” can match any character other than the newline character 0 or more times between “check” and “out”.

4 Probabilistic labels for imbalanced calssification

With traditional method, data label yi of binary classification is usually in following format:

\( y_{i} \in Y=\left \lbrace -1,+1 \right \rbrace , i=1,2, \cdot \cdot \cdot ,n; \)where -1 and 1 correspond to negative class and positive class respectively. Based on this, yi can also be formally represented as labels matrix Ln×2: Ln×2 =

$$ \left[ \begin{array}{cc} y_{11},\ & y_{12} \\ y_{21},\ & y_{22} \\ \cdot\cdot\cdot,\ & \cdot\cdot\cdot \\ y_{n1},\ & y_{n2} \end{array} \right] $$
(1)

where each row i corresponds to one piece of data, and each column j corresponds to a category; \( y_{ij} \in Y^{\prime }= \left \lbrace 0,1 \right \rbrace ; \) 0,1 indicate whether they belong to the corresponding category or not; each row has only one value of 1. More generally, the k-classification (k ≥ 2,kZ) problem is as n × k matrix Łn×k =:

$$ \left[ \begin{array}{cccc} y_{11},\ & y_{12},\ & \cdot\cdot\cdot,\ & y_{1k}\\ y_{21},\ & y_{22},\ & \cdot\cdot\cdot,\ & y_{2k} \\ \cdot\cdot\cdot,\ & \cdot\cdot\cdot,\ & y_{ij},\ & \cdot\cdot\cdot \\ y_{n1},\ & y_{n2},\ &\cdot\cdot\cdot,\ & y_{nk} \end{array} \right] $$
(2)

where each row i corresponds to one piece of data, and each column j corresponds to a category; \( y_{ij} \epsilon Y^{\prime }=\left \lbrace 0,1\right \rbrace \); 0,1 indicate whether they belong to the corresponding category or not; each row has only one value of 1.

Even though labels matrix Łn×k has n × k elements, there are only n non-zero elements. In fact, the sparsity of labels matrix is rooted in the “black or white” indicator of discrete labels. By contrast, labels of weak supervision tend to be gray, or probabilistic, rather than discrete. Therefore, compared with discrete labels, probabilistic labels are more suitable for representing labels from weak supervision. According to [76], imbalanced classification refers to different sample sizes of different categories. Specifically, the category here refers to the discrete labels. Taking one step further, imbalanced classification is named because of the imbalance distribution of discrete labels among different categories. That is, imbalanced classification may be alleviated by replacing discrete labels with probabilistic labels. For illustration, let’s take the five data labels of binary classification for example. In binary classification, discrete labels may be [[0, 1], [0, 1], [0, 1], [1, 0], [0, 1]], while probabilistic labels might be [[0.2, 0.8], [0.4, 0.6], [0.5, 0.5], [0.7, 0.3], [0.1, 0.9]]. Generally, imbalance ratio (IR) [88], the ratio between the number of majority class instances and minority class instances, is used to measure the degree of class-imbalance. Accordingly, the imbalance ratio of the discrete labels [[0, 1], [0, 1], [0, 1], [1, 0], [0, 1]] can be calculated by (1 + 1+ 1 + 0+ 1) / (0 + 0+ 0 + 1+ 0), which equals to 5. Similarly, the imbalance ratio of [[0.2, 0.8], [0.4, 0.6], [0.5, 0.5], [0.7, 0.3], [0.1, 0.9]] is 31/19, calculated by (0.8 + 0.6 + 0.5 + 0.3 + 0.9) / (0.2 + 0.4 + 0.5 + 0.3 + 0.9). Obviously, 31/19 is smaller than 5, which means that for the same data, data with probabilistic labels is less imbalanced than data with discrete labels.

In view of this, probabilistic weak labels may provide a novel solution of imbalanced classification. Formally, take binary classification problem as an example. If the weak label vector of a certain data is [0.7, 0.3], it means that the data belongs to the first and second categories with probability of 0.7 and 0.3, respectively. In this way, the weak label vector of most data has a probability component in each category. Moreover, the problem of imbalanced classification will no longer exist. Thus, probabilistic labels of multiple weak supervision were proposed and tested, which can be formally represented as (2), too. But different from (2), in probabilistic labels, 0 ≤ yij ≤ 1; yij C is the probability that the i-th sample belongs to the category j; For each row i, \( {\sum }_{j=1}^{K}{(y_{ij})}=1 \).

Notably, the introduce of probabilistic labels also increases the noise, which may hurt the performance of training. However, the probabilistic labels here can be generated from multiple weak supervision. That is to say, to some extent, the quality of the probabilistic labels can be guaranteed by the multiple weak supervision framework and conditional independent model, which is absolutely different from the random noise. For this, the probabilistic labels in this paper have achieved the balance of noise and imbalance implicitly by means of the proposed framework. Therefore, the probabilistic labels adopted in this paper has premise and quality assurance. As for the probabilistic labels in general sense, it does not belong to the research scope of this paper. In addition, we will carefully examine the tradeoff of imbalanced classification and noise as well as explore this problem theoretically or empirically in the future work. After all, a more general and concrete study, empirical or theoretical analysis need a new paper to represent.

Taking one step further, a bridge from multiple weak supervision to probabilistic labels is needed, which is referred to as labels integration mechanism. One natural selection is simple arithmetic mean (SAM). With m weak supervision sources, each sample i can generate a label vector Li =

$$ \left[ \begin{array}{cccc} l_{i1},\ & l_{i2},\ & \cdot\cdot\cdot,\ & l_{im} \end{array} \right] $$
(3)

where lij denotes label from weak supervision source j and \( l_{ij} \in \left \lbrace 1,\cdot \cdot \cdot , k \right \rbrace \); k denotes the number of classes. Based on SAM, probabilistic label vector Yi can be generated:Yi =

$$ \left[ \begin{array}{cccc} y_{i1} , \ & y_{i2} , \ & \cdot\cdot\cdot, \ & y_{ik} \end{array} \right] $$
(4)

where each row i corresponds to one piece of data, and each column j corresponds to a category;0yij1; yij C is the probability that the i-th sample to belongs to the category j; For each row i, \( {\sum }_{j=1}^{K}{(y_{ij})}=1 \). Specifically, the arithmetic mean algorithm is shown in Algorithm 2.

figure b

In fact, the multiple weak labels integration based on conditional independent model is a weighted average label integration. Based on this, the multiple weak labels integration based on conditional independent model becomes the weight determination problem of different weak supervision modes. To solve the problem of weight determination, this paper takes the “repeated calculation” correlation as an example to formally show the multiple weak labels integration based on conditional independent model. If there are m weakly supervised patterns, they are used for unlabeled samples. When unlabeled samples meet the specific weak supervised mode, they will get weak labels, otherwise they will get abstain label. Therefore, in order to model the double counting correlation, it is necessary to ensure that the label is not abstain. Accordingly, this study needs to definewhether to mark and whether to calculate repeatedly.

Using the above definition, the label matrix obtained by m weakly supervised modes is abbreviated as Ln×m. For whether to mark or not,

\(\phi _{i,j}^{{\text {label}}}({\varLambda } ,Y) = 1\{ {{\varLambda }_{i,j}} \ne Abstain\}\)

For double counting or not,

$$ \phi_{i,j,k}^{correlation}({\varLambda} ,Y) = 1\{ {{\varLambda}_{i,j}} = {{\varLambda}_{i,k}}\} ,1 \le j \le k \le m $$

Accordingly, for a sample with m weakly supervised patterns, the following conditional independent model can be obtained by defining all possible recalculation C as ϕi(Λ,Y ) and the corresponding weight parameter vectors \({\text {w}} \in {{\text {R}}^{2m + \left | C \right |}}\).

$$ {{\text{p}}_{w}}({\varLambda} ,Y) = \frac{{{e^{\sum\limits_{i = 1}^{n} {{w^{T}}{\phi_{i}}({\varLambda} ,{y_{i}})} }}}}{{{Z_{w}}}} $$

where Zw is the normalized constant. Furthermore, under the condition of only label matrix Λ and no real label vector Y, the learning of weight parameter vector has the following negative log marginal likelihood objective function.

$$ \hat w = \mathop {\arg \min }\limits_{w} - \log \sum\limits_{Y} {{p_{w}}({\varLambda} ,Y)} $$

In this way, based on the above objective function and random gradient descent, the weight parameter vector w can be learned. Then the discrete label matrix Ln×m can be transformed into a more accurate probabilistic label matrix Ln×k.

5 Methodology

As shown in Algorithm 3 and Fig. 3, the process of short text classification with multiple weak supervision mainly has five steps: (1) Knowledge Extraction, (2) Data Labeling, (3) Labels Integration, (4) Model Training and (5) Model Evaluation. It is important that the heuristic rules are domain-independent as well as the regular expressions work for any domain text classification.

figure c
  1. (1)

    Knowledge Extraction. Knowledge extraction here refers to keywords extraction and similarity threshold calculating. However, both keywords extraction and similarity threshold determination should be based on small-scale labeled data (Dev), which has ground-truth label. With Dev and LDA (Latent Dirichlet Allocation), keywords related to specific class (topic) can be extracted effectively. Moreover, Dev is also the reference for the screening of distant supervision corpus. However, it is from the perspective of weak supervision and is dedicated to extracting the necessary domain knowledge to produce weak labels of data. Notably, LDA [86] was creatively applied in knowledge extraction.

  2. (2)

    Data Labeling. Data labeling formally represents extracted knowledge and then assigns labels to unlabeled data item by item. To ensure the quality of labels, every data point can only be assigned one label if they satisfy a specific pattern. If not, the data point will only get abstain label. In this way, with multiple weak supervision, one data point may get more than one label. If abstain is also treated as a kind of label, with m weak supervision sources, one data point will get m labels. Accordingly, after n pieces of data labeling, a noisy n×m discrete labels matrix Ln×m will be generated. However, discrete labels matrix Ln×m cannot enter machine learning algorithm directly as well as cannot handle imbalanced problem, so original discrete labels matrix need to be transformed into probabilistic labels matrix.

  3. (3)

    Labels Integration. It is assumed that the discrete label lij is generated by the true label yi. That is, given the true label yi, a conditional probability P(lij|yi) need to be learned. Considering that latent variable yi cannot be observed, the label liȷ (other than weak supervision pattern i) is used instead. In this way, with conditional independent model, the n × m labels matrix Ln×m can be transformed into a n×k (k denotes the number of classes) probabilistic labels matrix Ln×k.

  4. (4)

    Model Training. Together with bag-of-words (term frequency) feature vector, probabilistic labels vector can be directly used as the input of neural network for model training. In view of this, we adopt full connection layer based on sigmoid/softmax activation function to train, which can make full use of probabilistic labels.

  5. (5)

    Model Evaluation. To evaluate the performance of classification model, test experiments were conducted on small-scaled dataset (Test). With the test results, we can better determine the next step. If the results meet the requirements, the output is the optimal model; Otherwise, go back to step1 and optimize the keywords and distant supervision elements (corpus and threshold) until the model performance meets the requirements and gets the optimal model. Here, the ultimate goal of weak supervision is classification. The performance of the classification model may well illustrate the quality of weak supervision. Thus, categorical evaluation indicators such as precision, recall, and F1-score, rather than graph-based semi-supervised techniques, are used for model evaluation. Additionally, graph-based semi-supervised techniques were leaved for future research.

Fig. 3
figure 3

Multiple Weak Supervision for Short Text Classification

6 Experiments

6.1 Experimental settings

For simplicity and availability, the proposed method was tested to find out given topics from the title of news or tender announcements. There are two special statements here. Firstly, both oversampling and under-sampling require strong labels for large-scale training data. The proposed method is mainly based on weak labels generated by multiple weak supervision. In other words, it is difficult to directly compare the solutions of oversampling and under-sampling at the data level with the probabilistic labels resolution mechanism in this paper. Secondly, although multiple weak supervision uses both labeled and unlabeled data, it cannot be simply classified as a semi-supervised learning method. This is because multiple weak supervision can not only solve the problem of insufficient labeled data, but also solve the problems of data sparsity and imbalanced classification. Therefore, it is meaningless to compare semi-supervised learning methods such as co-training with multiple weak supervision. Given the confidentiality of the research, we will consider whether to disclose the source code and the data sets.

Datasets

As we all know, public datasets, real datasets and sythetic datasets all can be used for experimental verification. For the sake of completeness and simplicity, experiments were conducted on one public dataset AG News (AG) [89], two sythetic datasets(synthetic binary classification dataset SB, synthetic tri-classification dataset ST) and one real dataset RD). In particular, AG’s news title and title of tender announcement were used as the sole input of model. Concretely, the basic information of AG, SB, ST and RD was listed in Table 3. Among them, SB, ST and AG are balanced datasets, while RD is imbalanced. Moreover, all the experimental datasets used are short text with less than 50 Chinese characters or 15 English words, which indicates the data are very sparse. In addition, every data dataset includes three small-scale datasets (Dev, Valid, Test) with ground-truth label and large-scale unlabeled data (Train).

Table 3 Basic information of dataset

Model settings

Above all, to automatically generate better weak labels, keywords matching, regular expressions and distant supervision clustering were integrated. Secondly, for simplicity and utility, the N-gram (feature representation) of the titles and Logistic Regression (algorithm) were combined to address the challenge of data sparsity. Moreover, in order to alleviate the imbalanced classification, a fully connected neural network based on sigmoid/softmax activation function (Deep Logistic Regression Algorithm, DLR) was adopted to input probabilistic labels.

For simplicity and practicability, the bag-of-words of the titles is the only feature used. In addition, in order to input probabilistic labels, a fully connected neural network based on sigmoid/softmax activation function (Deep Logistic Regression algorithm, DLR) was adopted. In addition, L2 regularization and cross-entropy loss function are used. For the sake of limited space and convenience, 3 classical algorithms (Logistic Regression (LR), Naïve Bayes (NB) and Support Vector Machine (SVM)) and 6 pre-training models fine tuning were tested on HAND (small-scale hand-labeled data Dev as training data) comparison, which will be expanded yet in the future research. After all, this article focuses more on proposing and implementing overall effective solution. To be specific, 6 pre-training models include BERT Base Chinese (BERT1) [39], BERT Base Multilingual (BERT2) [39], RoBERTa Base Chinese (RoBERTa1) [42], RoBERTa Large Chinese (RoBERTa2) [42], ERNIE Chinese (ERNIE1) [41] and ERNIE2.0 Chinese (ERNIE2) [43].

Comparison models

Moreover, as an overall solution, mutiple weak supervision can solve insufficient labeled data, data sparsity and imbalanced classification. However, any one of semi-supervised learning, sampling startegy and weak supervision cannot achieve this. Moreover, accroding to No Free Lunch Theorem [91], algorithms that perform well in one domain or under certain assumptions may not necessarily be the “strongest” in another. In view of this, multiple weak supervision cannot be comapred with semi-supervised learning, sampling startegy, weak supervision and so on.

For comparison, we consider four baselines (Table 4) HAND (small-scale hand-labeled data Dev as training data); SWS (single type of weak supervision: only with several simple keywords matching rules); DET (with discrete labels for training); NOD (no distant supervision sources for labeling data). HAND is used for illustrating the efficiency of large-scale data with weak labels. Compared with SWS, the strong representation ability of multiple weak supervision can be verified. DET can highlight the role of probabilistic labels in imbalanced classification problem, while NOD can validate the importance of distant supervision clustering.

Table 4 The main differences between different experiments

6.2 Experimental results

It should be noted that the synthetic dataset SB and ST were strictly selected by keyword matching. Therefore, the heuristic rules of simple keyword matching are consistent with SB and ST, and the experimental results in SB and ST may well be similar to multiple weak supervision method. Notably, the bold emphasis in Tables 5678 and 9 are used to highlight the best experimental results.

Table 5 Results between Dev and Train
Table 6 Results between MWS and Pre-Training Model Fine Tuning on RD
Table 7 Results between SWS and MWS
Table 8 DET experimental results on RD
Table 9 Experimental results between NOD and WD

(1)HAND comparison

From Table 5, the results of synthetic dataset SB and ST on Dev and Train were similar, both get above 95% score. This is because the synthetic datasets SB and ST were strictly selected based on keyword matching pattern. But it also suggests that the process of model training is translating weak supervision strategies into machine learning models, or integrating several weak classifiers into one strong classifier, intellectually similar to stacking [90]. Notably, the results of datasets RD and AG well illustrate the huge advantages of expanding training samples with multiple weak supervision and improving training effect. Particularly, in RD, F1-score was improved by an average of 32 percentage points.

In addition, considering the relative poor perfrmance of 3 classical algorithms on dataset RD, fine tuning experiments were also added. To be specific, 6 pre-training models were adopted, which include BERT Base Chinese (BERT1) [39], BERT Base Multilingual (BERT2) [39], RoBERTa Base Chinese (Ro-BERTa1) [42], RoBERTa Large Chinese (RoBERTa2) [42], ERNIE Chinese (ERNIE1) [41] and ERNIE2.0 Chinese (ERNIE2) [43]. To be specific, the experimental results of fine tuning are shown in Table 6. According to Table 6, the recall and F1-Score of MWS are better than the fine tuning results of all the six pre-training models. In terms of precision, MWS is also no less than four pre-training models. This is not contrary to the effectiveness of fine tuning on small-scale strongly labeled data sets. This is because the small-scale strong labeled data set used for fine tuning becomes the large-scale weak labeled data set. After all, fine tuning relies on strong labeled data, which cannot be provided by weak supervision.

(2)SWS comparison

Table 6 shows that, with single type of weak supervision (SWS), the performance of SB and ST is so good that there is little room for improvement. Therefore, multiple weak supervision (MWS) was only tested in RD and AG. From Table 7, the performance of MWS is significantly better than that of SWS. This fully illustrates the obvious advantages of MWS over SWS, and proves the effectiveness of MWS method. In particular, with the help of MWS, the F1-score in RD has increased by 2%.

(3)DET comparison

RD covers a wide variety of topics, but we only try to find the topic we care about. In view of this, it is a binary classification problem. Moreover, compared with uninterested topics, the proportion of topics we care about are very low. That is, RD is imbalanced, while SB and ST are balanced. In order to verify the effect of probabilistic labels on solving imbalanced classification problem, we carried out the control test on imbalanced dataset RD based on probabilistic labels and discrete labels respectively. The results are shown in Table 8.

The results on imbalanced dataset RD (Table 8) fully illustrate the advantages of probabilistic labels in solving imbalanced classification problems compared with discrete labels. Specifically, probabilistic labels provide a 9% improvement of F1-score on Test. Table 8 shows that the probabilistic labels can improve the classification performance of minority class remarkably. Compared to 2% of majority class, the F1-score of minority class was improved by 16% with the help of probabilistic labels. In a sense, probabilistic labels, or multiple weak supervision, might provide a new possibility for solving imbalanced classification problem.

(4) NOD comparison

Experimental results show that, with weak labels form heuristic rules, the performance of SB and ST is good enough for application. Therefore, distant supervision clustering was only tested on datasets RD and AG. In detail, the experimental results are listed on Table 9.

In WD, the recall score of RD was improved by 4% without reduction in precision score. This suggests that similarity threshold can act as the regulator of recall. Therefore, adjusting similarity threshold can meet different application needs , which is of great significance in academia and industry.

To sum up, we have the following observations.

  1. (1)

    While multiple weak supervision expands the labeled dataset, it also alleviates data sparsity of short text, thus improving the performance of the classifier.

  2. (2)

    With conditional independent model, weak labels provided by multiple weak supervision have higher accuracy and coverage than those provided by single type of weak supervision.

  3. (3)

    Probabilistic labels may provide a new solution for imbalanced classification problem. Notably, probabilistic labels should base on reliable multiple weak supervision.

The similarity threshold can be the regulator of recall. That is, distant supervision clustering can be used to represent tacit knowledge and improve recall score.

  1. (4)

    For multiple weak supervision, LDA can be used to extract explicit knowledge (keywords) of heuristic rules efficiently.

Additionally, based on the comparison results of the above four experiments, the effectiveness of the proposed framework in solving labels shortage, data sparsity and imbalanced classification wholly has also been fully illustrated. In general, the proposed framework can be used for short text classification of any domain. Notably, the main differences among different domains are the keywords pattern, external corpus and similarity threshold. That is, with proper keywords and relevant corpus, there is little difference in the performance of the proposed framework in different areas of the short text classification.

7 Conclusion

To address the labels bottleneck, data sparsity and imbalanced classification in short text classification simultaneously, multiple weak supervision was designed. With multiple weak supervision, implicit knowledge and tacit knowledge can be used to generate weak labels automatically. Furthermore, based on weak labels and conditional independent model, probabilistic labels and effective imbalanced classification model can be trained. What makes it reasonable is that implicit knowledge and tacit knowledge can provide enough diversity for labels integration. Specifically, our work has the following four contributions:

(1) Multiple Weak Supervision Sources:

Multiple weak supervision sources, covering explicit knowledge and tacit knowledge, were creatively introduced to label training data. Taking short text classification as an example, multiple weak supervision sources can be simple keywords matching, regular expressions and distant supervision clustering.

(2) Probabilistic Labels for Imbalanced Classification:

Experimental results show that, the probabilistic labels generated by conditional independent model can effectively solve the imbalanced text classification problem. This may provide a new solution to imbalanced classification, which has troubled industry workers and researchers for years.

(3) Combining Distant Supervision with Clustering:

Different from common alignment strategy, distant supervision was combined with clustering for generating weak labels and improving the coverage. In this way, distant supervision clustering was proposed, which can make full use of small-scale hand-labeled data and does not need explicit knowledge extraction. With distant supervision clustering, tacit knowledge, which is hard to represent, can be included in knowledge base (corpus) and similarity threshold easily.

Notably, the similarity threshold of distant supervision clustering can be used as the regulator of recall. In practical applications, this is of great significance for applying weak supervision to meet different needs of recall score. That is, if the clustering corpus and similarity threshold can be selected well, the recall and F1-score could be improved with little effect on precision.

(4) LDA for Knowledge Extraction:

Latent Dirichlet Allocation (LDA) was introduced to extract keywords of specific topic, which is the foundation of weak supervision. Moreover, LDA can effectively prevent over-fitting, which is also very simple and useful.

Despite of this, there are still many limitations in this paper. In future, we will further study the knowledge extraction methods (such as LDA), expand weak supervision sources and seek more theoretical analysis to validate the multiple weak supervision method.