1 Introduction

In recent years, with the development of technology in Web 2.0, short texts, such as short messages, microblogs, and news comments, increase in a geometrical ratio. Unlike traditional texts, some inherent characteristics of short texts, such as extremely feature sparsity and highly unbalancing samples, hinder the traditional approaches for long texts being easily applied.

To extend short text feature, the most recent popular researches have been mainly focused on three aspects. Firstly, some researchers try to use language models, grammar and syntax analysis method to obtain more specific semantic information [13]. Secondly, some researchers take advantage of statistical approaches, such as global term context vectors, coupled term-term relations for short text extension [4, 5]. Lastly, both semantic information obtained from a hierarchical lexical database and statistical information contained in the corpus are involved for short text extension [6].

This paper proposes a short text feature extension strategy based on improved frequent term sets. We mainly focus on news titles and take news content as background knowledge. The feature extension algorithm is designed to extract frequent term sets and build the word similarity matrix, based on which to extend feature space of short text. The overview of our framework is shown in Fig. 1. Firstly, by calculating support and confidence, double term sets with co-occurring relation and identical class orientation are extracted from long text corpus. Then we extend these frequent term sets with external association. Meanwhile, the information gain is introduced to traditional TF-IDF, better expressing the category distribution information and the weight of words for each category is enhanced. Finally, the word similarity matrix is constructed via the frequent word sets, and symmetric non-negative matrix factorization (SNMF) technique is used to extend the feature space.

Fig. 1.
figure 1

Algorithm framework

2 Frequent Term Sets Extraction

We briefly introduce some concepts and notations employed in frequent term sets extraction in Table 1. Support is defined as the number of documents which contain term set T dividing the total number of documents in data set while confidence is defined as the number of documents containing t in class c dividing the number of all documents involving t [7].

Table 1. Notation definition

To select frequent term sets efficiently, several other concepts are needed:

Definition 1 (Co-occurring relation).

If the support of term set T surpasses the threshold α (0 < α < 1), T is considered as a frequent term set and all terms in T have a Co-occurring relation.

Definition 2 (Class orientation).

For term t and class \( {\text{c}} \in {\text{C}} \), if conf(t,c) surpasses the threshold β(0.5 ≤ β < 1), term t has a class orientation to c, formulated as Tendency(t) = c.

Definition 3 (Identical Class Orientation).

For two terms t1 and t2, if there is a class c, Tendency(t1) = c and Tendency(t2) = c, then t1 and t2 have an Identical Class Orientation.

In order to obtain more semantic information, we extract frequent term sets with both co-occurring relation and identical class orientation.

As co-occurring relation can perfectly represent semantic association between terms, while terms with identical class orientation can be potentially from the same or close topic, feature extension of these terms is expected to have better discriminative ability. Considering that there are often 2 words for most of Chinese phases, this paper focuses on extracting double frequent term sets. The algorithm is described in Table 2 [7], and the feature set F represents the collection of words in background knowledge.

Table 2. Double frequent term sets extraction algorithm

3 Word Similarity Matrix Construction

3.1 Improved Term Weighing Scheme

The training set is taken as a whole in the calculation of IDF, which ignores the distribution information of feature terms among categories. Thus, an improved term-weighting scheme is applied in our work.

Assuming that X = {x1:p1, x2:p2,…,xn:pn} is the information probability space, the information entropy of X is formulated as [8]:

$$ {\text{H}}\left( {\text{X}} \right) = {\text{H}}\left( {{\text{p}}_{1} ,{\text{p}}_{2} , \ldots ,{\text{p}}_{\text{n}} } \right) = - \sum\nolimits_{{{\text{i}} = 1}}^{\text{n}} {{\text{p}}_{\text{i}} { \log }_{2} {\text{p}}_{\text{i}} } $$
(1)

Information gain is the difference between information entropy, represented as:

$$ {\text{I}}\left( {{\text{X}},{\text{y}}} \right) = {\text{H}}\left( {\text{X}} \right) - {\text{H}}({\text{X}}|{\text{y}}) $$
(2)

where H(X) is the entropy without information of y, H(X|y) is the conditional entropy, representing the incertitude degree of X with the information of y obtained.

From the aspect of information theory, the essential part of the improved term weighting scheme is: for a training set with certain probability distribution, word categorical information mostly depends on information gain. With this consideration, the improved term-weighting is define as:

$$ {\text{Q}}_{\text{ij}} = {\text{TF}}\left( {{\text{t}}_{\text{j}} } \right) \times { \log }\left( {\frac{\text{k}}{{{\text{n}}_{\text{i}} }} + 0.01} \right) \times {\text{IG}} $$
(3)

Here \( {\text{Q}}_{\text{ij}} \) is the weight of \( {\text{t}}_{\text{j}} \) in \( {\text{c}}_{\text{i}} \), \( {\text{n}}_{\text{i}} \) is the total number of words in \( {\text{c}}_{\text{i}} \).

3.2 Word Similarity Calculation

With the improved term-weighting scheme, we can construct the word similarity matrix. For two terms \( {\text{t}}_{\text{i}} \) and \( {\text{t}}_{\text{j}} \), in frequent term sets, the semantic similarity between them is derived according to Jaccard similarity [9] as follows:

$$ {\text{CoR}}\left( {{\text{t}}_{\text{i}} ,{\text{t}}_{\text{j}} = \frac{1}{{\left| {{\bar{C}}} \right|}} \times \sum\nolimits_{{{\text{x}} \in \left| {{\bar{C}}} \right|}} {\frac{{{\text{Q}}_{\text{xi}} {\text{Q}}_{\text{xj}} }}{{{\text{Q}}_{\text{xi}} + {\text{Q}}_{\text{xj}} \text{ - }{\text{Q}}_{\text{xi}} {\text{Q}}_{\text{xj}} }}} } \right) $$
(7)

Since the frequent term sets extraction is based on categories, x here stands for a category and \( {\text{Q}}_{\text{xi}} \) is the improved term weight of \( {\text{t}}_{\text{i}} \) in \( {\text{c}}_{\text{x}} \). \( {\bar{C}} \) is a subset of collection C, satisfying \( \bar{C} = \left\{ {{\text{x|}}\left( {{\text{Q}}_{\text{xi}} \ne 0} \right) \cup \left( {{\text{Q}}_{\text{xj}} \ne 0} \right)} \right\} \). And when \( {\bar{C}} \) is empty, \( {\text{CoR}}\left( {{\text{t}}_{\text{i}} ,{\text{t}}_{\text{j}} } \right) = 0 \).

We then normalize the semantic similarity as:

$$ {\text{IaR}}\left( {{\text{t}}_{\text{i}} ,{\text{t}}_{\text{j}} } \right) = \left\{ {\begin{array}{*{20}c} 1 & { {\text{i}} = {\text{j}}} \\ {\frac{{{\text{CoR}}({\text{t}}_{\text{i}} ,{\text{t}}_{\text{j}} )}}{{\mathop \sum \nolimits_{{{\text{i}} = 1,{\text{i}} \ne {\text{j}}}}^{\text{N}} {\text{CoR}}\left( {{\text{t}}_{\text{i}} ,{\text{t}}_{\text{j}} } \right)}}} & {{\text{i}} \ne {\text{j}}} \\ \end{array} } \right. $$
(8)

However, in the whole term set there may be the case that co-occurring relations and identical class orientations might exist between one term and other terms. For example,{(computer, mouse), (computer, keyboard), (mouse, keyboard), (cellphone, computer), (cellphone, Internet)} is a frequent term set, in which ‘keyboard’ and ‘mouse’ are not only co-occurred but also linked by ‘computer’. Similarly, we can also relate ‘computer’ to ‘Internet’ by ‘cellphone’, though they are not co-occurred directly. Thus, we define another relation to strengthen semantic association of these words.

Definition 4(inter-relation).

Terms \( {\text{t}}_{\text{i}} \) and \( {\text{t}}_{\text{j}} \) are defined to be inter-related, if there exists at least one term \( {\text{t}}_{\text{k}} \), which is the linking term inter-related with both \( {\text{t}}_{\text{i}} \) and \( {\text{t}}_{\text{j}} \).

As is shown in Fig. 2, \( {\text{t}}_{\text{i}} \) and \( {\text{t}}_{\text{k}} \) are co-occurred as well as \( {\text{t}}_{\text{j}} \) and \( {\text{t}}_{\text{k}} \). Therefore, we believe that there is semantic relation between \( {\text{t}}_{\text{i}} \) and \( {\text{t}}_{\text{j}} \), though they are not co-occurred directly.

Fig. 2.
figure 2

External relation

All term pairs with external relations are extracted and added into R, the extended frequent couple term sets \( {\text{R}}_{\text{e}} \) is formed. Moreover, words in \( {\text{R}}_{\text{e}} \) are strongly related with each other, which provides solid foundation to word similarity matrix construction.

It is necessary to take measures to quantify for external relations before word similarity matrix construction.

$$ {\text{R}}\_{\text{IeR}}\left( {{\text{t}}_{\text{i}} ,{\text{t}}_{\text{j}} |{\text{t}}_{\text{k}} } \right) = { \hbox{min} }\left( {{\text{IaR}}\left( {{\text{t}}_{\text{i}} ,{\text{t}}_{\text{k}} } \right),{\text{IaR}}\left( {{\text{t}}_{\text{j}} ,{\text{t}}_{\text{k}} } \right)} \right) $$
(9)

where \( {\text{IaR}}\left( {{\text{t}}_{\text{i}} ,{\text{t}}_{\text{k}} } \right) \) and \( {\text{IaR}}\left( {{\text{t}}_{\text{j}} ,{\text{t}}_{\text{k}} } \right) \) represents the semantic similarity between \( {\text{t}}_{\text{i}} \) and \( {\text{t}}_{\text{k}} \), \( {\text{t}}_{\text{j}} \) and \( {\text{t}}_{\text{k}} \) respectively. In the quantification for external relations, we assume that the semantic similarity between \( {\text{t}}_{\text{i}} \) and \( {\text{t}}_{\text{j}} \) is at least valued as the minimization in all semantic similarities, which is feasible in fact.

The final external relation between \( {\text{t}}_{\text{i}} \) and \( {\text{t}}_{\text{j}} \) is calculated with all the linking terms of \( {\text{t}}_{\text{i}} \) and \( {\text{t}}_{\text{j}} \). After normalization, the inter-relation is formalized as:

$$ {\text{IeR}}\left( {{\text{t}}_{\text{i}} ,{\text{t}}_{\text{j}} } \right) = \left\{ {\begin{array}{*{20}c} 0 & {{\text{i}} = {\text{j}}} \\ {\frac{1}{{|{\text{L}}|}}\sum\nolimits_{{\forall {\text{t}}_{\text{k}} \in {\text{L}}}} {{\text{R}}_{ - } {\text{IeR}}\left( {{\text{t}}_{\text{i}} ,{\text{t}}_{\text{j}} |{\text{t}}_{\text{k}} } \right) } } & { {\text{i}} \ne {\text{j}}} \\ \end{array} } \right. $$
(10)

Here \( {\text{L}} = \left\{ {{\text{t}}_{\text{k}} |\left( {\left( {{\text{IaR}}\left( {{\text{t}}_{\text{i}} ,{\text{t}}_{\text{j}} } \right) > 0} \right) \cap \left( {{\text{IaR}}\left( {{\text{t}}_{\text{k}} ,{\text{t}}_{\text{j}} } \right) > 0} \right)} \right)} \right\} \), |L| is the total number of terms. If the set of L is empty, the inter-relation between \( {\text{t}}_{\text{i}} \) and \( {\text{t}}_{\text{j}} \), formulated as \( {\text{IeR}}\left( {{\text{t}}_{\text{i}} ,{\text{t}}_{\text{j}} } \right) \) is zero. And if \( {\text{t}}_{\text{i}} \) and \( {\text{t}}_{\text{j}} \) indicates the same word, We regard \( {\text{IeR}}\left( {{\text{t}}_{\text{i}} ,{\text{t}}_{\text{j}} } \right) \) as zero, too. Besides, when \( {\text{t}}_{\text{i}} \) is different from \( {\text{t}}_{\text{j}} \), there may be one or more linking terms relating them together, which are the elements in set L, taking the influence of all linking terms into consideration.

Word similarity matrix S is constructed based on frequent term sets, in which Sij denotes the semantic similarity and is defined as follows:

$$ {\text{S}}_{\text{ij}} = \left\{ {\begin{array}{*{20}c} 1 & {{\text{i}} = {\text{j}}} \\ {\left( {1 - \gamma } \right) \cdot {\text{IaR}}\left( {{\text{t}}_{{{\text{i}},}} {\text{t}}_{\text{j}} } \right) + \gamma \cdot {\text{IeR}}\left( {{\text{t}}_{{{\text{i}},}} {\text{t}}_{\text{j}} } \right) } & {{\text{i}} \ne {\text{j}}} \\ \end{array} } \right. $$
(11)

Where \( \gamma \in \left[ {0,1} \right] \) is an important parameter deciding the weight of inter-relations. In our work, we set γ as 0.5.

At this point, word similarity matrix S is constructed successfully, where semantic similarity of word represents not only co-occurring relation but also extended inter-relation, therefore, semantic associations are further enhanced.

4 Short Text Feature Extension Based on Semantic Similarity Matrix

The non-negative factorization method was first proposed by Lee in Nature in 1999 [10]. Different from original non-negative factorization algorithm, the symmetric non-negative factorization (SNMF) is pretty special, whose duty is to factor a non-negative matrix into a product of a non-negative matrix and its transposed matrix. More specifically, for a given non-negative matrix \( {\text{Z}}_{{{\text{n}} \times {\text{n}}}} \), a non-negative matrix factor \( {\text{Y}}_{{{\text{n}} \times {\text{k}}}} \), satisfying:

$$ {\text{Z}} \approx {\text{YY}}^{\text{T}} ,{\text{Y}} \ge 0 $$
(12)

Since the semantic similarity matrix S is obviously symmetric, SNMF aims to factor S into P and the transposed matrix of P. Each element in P is calculated iteratively as follows:

$$ {\text{P}}_{{{\text{i}},{\text{j}}}} \leftarrow \frac{1}{2}\left[ {{\text{P}}_{{{\text{i}},{\text{j}}}} \left( {1 + \frac{{({\text{SP}})_{\text{ij}} }}{{({\text{PP}}^{\text{T}} {\text{P}})_{\text{ij}} }}} \right)} \right] $$
(13)

We build the original feature space W with TF-IDF and factor S into \( {\text{S}}_{{{\text{K}} \times {\text{K}}}} = {\text{P}}_{{{\text{K}} \times {\text{N}}}} \times {\text{P}}_{{{\text{N}} \times {\text{K}}}}^{\text{T}} \). Then, the matrix \( {\text{W}}_{\text{e}} \) extended is obtained.

$$ {\text{W}}_{\text{e}} = {\text{WP}}^{\text{T}} $$
(14)

As the transposed matrix \( {\text{P}}^{\text{T}} \) is factored by S, each one in \( {\text{P}}^{\text{T}} \) is certainly not equals to 0. Meanwhile, W is the original matrix where each row represents a document, it is impossible to be 0 in rows, too. Therefore, the new extended feature space \( {\text{W}}_{\text{e}} \) is no more sparse than before, which is vital to the construction of short text model. Furthermore, word similarity and categorical information as well as word semantic information W are assimilated into the new feature space \( {\text{W}}_{\text{e}} \), which is favor of the similarity calculation of short text.

5 Experiments

In this section, we conduct a series of experiments to evaluate the performance of our algorithm and analyze these experiments and the results.

5.1 Data Set

Experiments are conducted on two datasets: 20-Newsgroups [11] and Sougou corpus [12]. 20-Newsgroups is composed of 20 different news groups and 20000 short text snippets. Sougou corpus is a data set of news pages from Sohu news provided by Sougou lab, including 18 categories such as International, Sports, Society, Entertainment, etc. Each page has its page URL, page ID, page title and body content.

Short text refers to the title, the news contents and the description of short text are used for background knowledge extraction. All Chinese documents were pre-processed by word segmentation using ICTCLAS. After pre-processing, we select 10 categories from 20-Newsgroups and each category contains 200 documents. We also choose 9 categories from Sougou corpus, 2000 pages in total. The traditional K-means algorithm is employed to verify our experiment performance.

5.2 Experiment Results

As α and β are the most important parameters in our algorithm, we first vary their values to testify the performance of double term set extraction. Then, the performance of the original frequent term set and improved frequent term set with external relations are compared. Finally, we make a comparison of five short text representation method for clustering using Purity and F-measure as evaluation criteria.

The extraction of double term sets is significant to word similarity matrix construction, which greatly relies on the number of support and confidence restraints. The support guarantees the co-occurring relation of terms while the confidence determines whether the terms have identical class orientation. We extract the couple term sets using different parameter settings: α = 1.0 %, 1.5 %, 2.0 %, 2.5 %, 3.0 % and β = 0.5, 0.6, 0.7, 0.8, 0.9 respectively, and the results are listed in Table 3.

Table 3(a). Double frequent term sets distribution with different support and confidence

As is shown in Table 3, the number of extracted double term sets decreases dramatically with the increase of support and confidence. It is understandable that higher support means more co-occurring relations and the constrains will be more strict when confidence increases. This shows that when support and confidence are set to be high, the information for background knowledge is too rare to have essential influence on the original feature space. Therefore, we choose α = 1.0 %, β = 0.5 to construct our background knowledge in the following experiment.

What’s more, we define external relation based on traditional frequent term sets to further extend term sets, obtaining more semantic information. With the parameter of support fixed and external relations taken into account, the number of extracted frequent term sets is demonstrated in Table 3b.

Table 3(b). Double frequent term sets distribution with external relation

From Table 3b, frequent term sets with external relation and original algorithm are different in quantity, though the distributions of them are the same. It is obvious that the number of extracted term sets with external relation is much more than that of original algorithm. As we can observe, when α = 1.0 %, β = 0.5, the original frequent term sets is 9117 while the number of term sets with external relation is 10942 which increases about 20 %. The growth ratio decreases with the increase of parameter. Besides, the double frequent term sets with external relation is in favor of more semantic information.

Finally, we conduct experiments to compare clustering performances of five methods on two different data sets. The experiment results on two evaluation index—Purity and F-measure are presented in Fig. 3.

Fig. 3.
figure 3

Clustering results of different methods on 20-Newsgroups (left) and Sougou corpus (right)

In Fig. 3, it is clear that different performance potentially depends on different data set. And the proposed method performs much better than any other methods. Furthermore, the result of the Sougou corpus is a little superior than that of 20-newsgroups in our experiment. As is depicted in Fig. 3, results of these five methods can be roughly divided into 3 levels. The traditional TF-IDF performs the worst, obviously blaming on ignoring semantics in model construction. The performances of coupled term-term relations method and the improved term-weighting method are similar, which are still worse than that of method based on frequent term sets. This phenomenon can be explained: the coupled term-term relations method extract inner and inter relation of word with co-occurring relation to enhance semantic information, while the improved term-weighting method considers information gain and statistical information. They are both unfortunately one-sided considered.

Extending the extracted semantic information to short text space, the original frequent term sets method to some extent releases the problem of high dimension sparsity in short texts. However, as is shown in the above Fig. 3, the best scheme for short text representation is the improved frequent term sets method. The superior of this method are summarized as follows: additional semantic information is first revealed with external relation. Then the word similarity matrix is built via improved term-weighting scheme and word categorical information. Finally, the symmetric non-negative matrix factorization technique is used to extend the feature space, which alleviates the problem of high-dimensional sparsity.

6 Conclusion

This paper discusses a short text feature extension algorithm based on improved frequent word sets. The external relation is proposed based on frequent term sets, further enhancing associations of words. Considering the distribution information of categories, an improved term-weighting scheme using information gain is presented, which efficiently remained the categorical information. What is more, all term pairs with external relations are extracted and the frequent term set is expanded. Finally, the word similarity matrix is constructed via the frequent term set, and the symmetric non-negative matrix factorization technique is used to extend the feature space. Experiments show that the constructed short text model can significantly improve the performance of clustering and effectiveness.