1 Introduction

With the advent of 5G era, the rapid development of Internet technology has brought convenience and also many challenges to people. The short text data we come into contact with every day such as search snippets, microblogs, and news headlines contains a large amount of valuable information [1], while most existing short text classification methods only focus on texts with a few dozen words such as microblogs and rarely consider short text data with fewer words such as news headlines. News headline classification is mainly based on the semantics of the headline to determine and classify the domain to which it belongs [2]. Since news headlines are short sentences with forced semantic expressions built on weakly related words, conventional short text classification methods do not effectively classify them. while high-quality headline classification has an efficient contribution to news content category classification and effectively saves computational overhead, and its main application directions include domain machine translation [3] and false information detection [4].

Although there are many machine learning algorithms and deep neural network methods that perform well in short text classification, they do not perform well in news headline classification [5]. This is due to the fact that news headlines generally contain less text and are less related to each other, so it is difficult for the previous short text classification methods to deal with them effectively. In this regard, many scholars started to find new methods to solve this problem. Fan et al. [6] proposed a co-occurrence-based short text feature expansion method, Hu et al. [7] proposed a new method by combining CNN with support vector machines, and Zhihao et al. [8] started to use graph convolutional networks for headline classification. All of these methods improve the generalization ability of machine learning in face of new situations to some extent and also relatively improve the classification accuracy, but none of them essentially solve the problem of inefficient news headline classification.

Since the association degree between words within news headlines is low, the implicit association between words is more likely to contribute to the classification of news headlines, while the expansion of feature space and the collection of contextual information can more effectively help us understand the implicit association between the words. To address this issue, some studies use external resources to enrich text features. Kanu et al. [9] effectively stores information in a language independent format, so that information can be retrieved across multiple geographical locations, leaving the crisis of corpus quantity behind. However, this method requires high quality of external resources, and inappropriate external resources may introduce new noise while the method is less portable and relatively time-consuming. In contrast, the extension based on internal semantic association information has been developed more in recent years. The word2vec model proposed by [10] has been widely used. Lilleberg et al. [11] proposed a new word2vec based word vector and latent Dirichlet allocation ( LDA ) model to generate corresponding topic features, Li et al. [12] proposed a new topic-based CNN model combining LDA topic model with deep neural network. However, the lack of word co-occurrence information seriously hinders the generation of document topic distribution, so the traditional topic modeling methods cannot achieve satisfactory results in domain headline topic modeling.

Based on this, this paper proposes a new method of news headline classification based on keyword feature expansion(KFNHC). This method makes full use of the existing data, considers the feature information carried by the text based on the category, and strictly filters the results generated by topic modeling to ensure the accuracy of the expanded topic words. Without using external data, the keyword feature expansion performs on news headlines at both sentence and word levels simultaneously. The main contributions of this paper are as follows:

First, the Huseg method is proposed in the preprocessing process to reduce the errors and inconsistencies of Chinese word separation, and thus reduce the loss of keywords in the preprocessing process.

Second, a keyword feature expansion method is proposed, which uses the LDA model of compound TextRank algorithm (TRLDA) and the improved term frequency-inverse document frequency (TF-IDF) algorithm to expand the keyword features of short texts from both sentence and word level, respectively.

Finally, extensive and complete experiments are conducted on the Sogou News dataset. The results prove that our approach outperforms existing similar models.

The rest of this paper is organized as follows. Related work on news headline classification is reviewed in Sect. 2, the method proposed in this paper is described in detail in Sect. 3, the experimental setup is described in Sect. 4, the experimental results are discussed in Sect. 5, and the conclusions are presented in Sect. 6.

2 Related Work

News headline classification plays an important role in many applications of natural language processing (NLP). The existing technology generally classifies it as the problem of short text classification. However, due to the unique nature of news headlines, most of the short text classification technologies cannot effectively deal with the task of it. In this part, we start from the corresponding technical modules according to the improvement points of the proposed model, It includes three parts: the method based on deep learning, the topic modeling method based on the expansion of internal topic features, and the weight weighting algorithm.

With the development of deep learning, some deep learning models have been introduced into short text classification [13]. However, the text information extracted by CNN mainly stays at the level of "words" and "characters" without considering other semantic features such as topic features. With the depth of research, many other techniques such as latent semantic analysis ( LSA ) [14] and LDA [15] and other topic modeling methods have been used for short text classification, and these methods have been used by scholars to model and extend the topics of texts to discover the hidden semantic topic information in documents. However, news headlines do not provide enough information to help topic modeling because of their limited sentence length and weak correlation between words, which affect the accuracy of generated topics, resulting in poor classification results. To address this problem, Qiang et al. [16] use Bi GRU to solve the ambiguity of Chinese vocabulary and the complexity of Chinese grammar rules. In addition, it is also possible to discover hidden topic structures based on external data such as Wikipedia [17] which mainly relies on domain-specific and synonym lists to obtain the corresponding external knowledge, but this approach is not universally applicable and scalable.

Different from previous studies that focused on using external knowledge or pre-trained extensions of topic features, researchers have mainly focused on designing a new approach to topic modeling in recent years, all of which are fundamentally aimed at improving the expressive power of text semantics, but rarely guarantee the validity of the extended words. To address this problem, Li et al. [18] proposed a new method to weaken these effects without external knowledge, analyze short textbooks using LDA-based topic models, represent each short textbook with a vector space model, and propose a new vector adjustment method for short textbooks. Feng et al. [19] proposed a regularized non-negative matrix decomposition topic model for short text, which uses pre-trained distribution vector representation of words to overcome the data sparsity problem of short text, meanwhile, it only considers the data itself and does not use external resources to expand the features of short text. Both Wang and Hao et al. [20, 21] started to incorporate an attention mechanism in the neural network model to perform semantic expansion at sentence level and related word level of the short text respectively. Reference [22] proposed a new neural network topic model in a self-coding framework, which uses a new quantization method for topic distribution to generate a peak distribution more suitable for modeling short textbooks, in addition proposed a new negative sampling decoder to avoid generating duplicate topics using negative sampling learning.

In contrast, term weighting methods have been one of the most common processing methods in text classification. According to [23] they can be grouped into two categories: unsupervised term weighting methods that do not consider category information, such as binary representation, TF-IDF, etc.; and supervised term weighting methods that take category information into account when weighting. Many past studies used TF-IDF-based text classification methods [24], but their simple assumption that terms with low document frequency should be more important than terms with high document frequency does not hold true for news headline domain classification, where the weak correlation between words and the consistency of texts within the domain pose a more severe challenge to the classification task. Pascual et al. [25] Properly improved the random fuzzy forest. An advantage of using fuzzy rules is the possibility to manage uncertainty and to work with linguistic scales. Fuzzy random forests achieve a good classification performance in many problems, but their quality decreases when they face a classification problem with imbalanced data between classes. Based on this, Sun et al. [26] proposed the concept of topic high-frequency words, and also proposed a grammatical category combination weighting method to distinguish the contribution of different keywords to the classification results. However, these term weighting methods only weight the classification results from the perspective of words, and there is still much room to improve in the classification effect due to the weak correlation between news headline words.

In summary, these models only explore the explicit word co-occurrence information that can be captured in the corpus, and do not fully utilize the deep semantic relationship and semantic sentence structure, which cannot accurately reflect the degree of contribution of feature words to the category, resulting in inaccurate text representation. Therefore, this paper proposes a new TRLDA topic model and the improved TF-IDF algorithm for keyword expansion of news headlines from both sentence and word level in the keyword feature expansion part. Meanwhile, this paper also proposes a new Huseg method to reduce the errors and inconsistencies of Chinese word separation in the preprocessing part, so that to reduce the loss of keywords in news headlines.

3 The Proposed KFNHC Method

In this paper, the KFNHC method proposed by us mainly uses TRLDA model and improved TF-IDF algorithm to achieve keyword feature expansion of Chinese news headlines. As shown in Fig. 1, the overall framework of this method consists of four main parts. First, a new Huseg method is used in the pre-processing part to reduce the probability of keyword loss for Chinese news headlines. Secondly, the TRLDA model performed keyword expansion of the title for dynamic construction and the improved TF-IDF algorithm was performed for weighting process. Then, the short text and related words are conceptualized by pre-trained word vectors to generate the corresponding word vector matrices which is used to expand the keyword features of the pre-processed Chinese news headlines. Finally, the above word vectors are fused with features, and the classical CNN model is used to extract the corresponding features and complete the classification process.

Fig. 1
figure 1

The proposed KFNHC method

3.1 Huseg Method

Previous classification methods usually uses word sequences to obtain potential meanings and gets better classification results. However, because different segmentation methods usually adopt different segmentation granularity, resulting in different segmentation results. Besides Chinese word segmentation usually inevitably has errors, ambiguities and inconsistencies, so the word-based model often encounters some potential problems caused by word segmentation, which affects the final segmentation results.

The Huseg method proposed in this paper is based on the previous voting method. In the pre-processing part, three common word separation methods like Jieba, Pkuseg and Thuseg, are combined to use for each text file, instead of discarding the low-frequency word separation results directly, the results are combined and saved in the same file for subsequent processing.

3.2 TRLDA Model

Traditional topic modeling methods have achieved great success in several fields of natural language processing. However, for news headlines, the limited sentence length does not provide enough information to help the model discover the potential information brought by semantics and syntax. This flaw seriously affecting the production of document topic distribution, resulting in low accuracy of generated topics, and thus affecting the classification accuracy of short texts. To address this problem, the TRLDA model is proposed in this paper, and the model diagram is shown in Fig. 2.

Fig. 2
figure 2

TRLDA model

Firstly, we found that the direct topic modeling of news headlines based on a single text is not accurate enough, meanwhile similar documents have similar patterns in syntactic structure and semantic information. Based on this, we calculate the text vector of each news headline by word vector, and calculate the cosine similarity between news headline vectors \(W_p\) and \(W_q\) according to Eq. (1), thus all the distributed vector representation of news headlines can capture the semantic information and overcome the deficiency that the bag-of-words model cannot reflect the importance relationship of words.

$$\begin{aligned} S_{(p,q)}=\frac{W_p \times W_q}{\Vert W_p \Vert _2\times \Vert W_q \Vert _2} \end{aligned}$$
(1)

Next, the LDA model is trained on the training dataset consisting of the long pseudo-text described above. Firstly, the topic of the dataset is extracted using LDA to generate the topic-word matrix. Then the word vector model is obtained by using Word2Vec to train the dataset, and the similarity between words under the topic is calculated by the cosine method as the weight between two points. A threshold is set to filter out the word relationships with lower weights and connect the remaining two words into an edge.

Finally, the TextRank method is used to iterate and the top n words with the highest values are used as the keywords under the topic. The formula of TextRank method is shown in Eq. (2). Where \(\textrm{WS}(V_i)\) denotes the weight of sentence i, the summation on the right-hand side indicates the degree of contribution of each neighboring sentence to this sentence, \(W_{ji}\) denotes the similarity of two sentences, \(\textrm{WS}(V_j)\) denotes the weight of sentence j from the last iteration, and d is the damping factor taken as 0.85.

$$\begin{aligned} \textrm{WS}(V_i)=(1-d)+d\times \sum \limits _{V_j\in In(V_i)}\frac{W_{jt}}{\sum _{V_k\in \textrm{Out}(V_j)}W_{jk}}\textrm{WS}(V_j) \end{aligned}$$
(2)

Then the top n words with the highest probability under each potential topic are normalized by Eq. (3) to the probability values of the different parts of the word weights to obtain the weights of each potential topic value vector. Where \(C_i\) is the probability value of each potential keyword, \(\omega\) is the different weights for different word classes.

$$\begin{aligned} P_{\textrm{LDA}} = \frac{\omega \times C_i}{\sum _{i=1}^T \omega \times C_i} \end{aligned}$$
(3)

3.3 The Improved TF-IDF Algorithm

In the past TF-IDF is frequently used as a term weight in text categorization, and obtained good results in the traditional task, but the traditional IDF structure may ignore the document category information, and cannot effectively reflect the importance of words and categories to distinguish ability to achieve the correct weight adjustment function. We propose an improved TF-IDF algorithm, which focusing on learning the semantic information of words. On the basis of retaining TF-IDF highlighting important words and suppressing minor words, the contribution of words to classification is divided by category boundaries to measure the degree of categorization of a term in a given document set.

For example, in the category of education, "score line", "apply for the examination" and other words often appear many times, at the same time in other categories appear probability is very low. Such words, which are much more likely to occur in one category than in others, can be effective in capturing domain knowledge and making distinctions. Therefore words in this category can be used as good category identifiers and are given higher weight. So in the improved formula, we use the frequency of a word in the current category and in other categories to calculate its corresponding weight value. The formula and deduction process proposed by us are is Eq. (4), where \(P_t\) represents the frequency of occurrence of the word in the current category and \(P_o\) represents the frequency of occurrence of the word in other categories.

$$\begin{aligned} \textrm{TIF}=-\log \left( 1-\frac{P_t}{P_t + P_o}\right) \times P_t=\log \left( 1 + \frac{P_t}{P_o}\right) \times P_t \end{aligned}$$
(4)

In real life, words of different lexical nature contribute differently to semantic expressions, and in some texts non-nouns and verbs may occur more frequently, but these words are less capable of identifying the topic, while adding to the noise of the short text. In this paper, taking into account the sparsity of news headlines most previous works ignored, lexical filtering does not reflect the importance of lexical features for feature selection, but may instead have some negative impact on the classification results. The lexical features are introduced with different weights for different lexical words to better reflect their contribution to the semantic expression. Bearing in mind that nouns and verbs are most important for the semantic expression of sentences, followed by adjectives and adverbs, we propose lexical weights as in Eq. (5), where h represents the weight of different parts of speech, and \(\alpha\) > \(\beta\) > \(\gamma\) > 0.

$$\begin{aligned} h= {\left\{ \begin{array}{ll} \alpha ,&{}n.\, or \,v.\\ \beta ,&{}\mathrm{adj.}\, or \,\mathrm{adv.}\\ \gamma ,&{}\textrm{other}\\ \end{array}\right. } \end{aligned}$$
(5)

3.4 Feature Fusion and Classification

Before proceeding with the CNN, we first connect the explicit and implicit feature vectorization obtained in the previous section. Using the pre-trained Word2Vec vectors, we can get the topic vectors generated by the LDA-CTA topic model and also the weighted news headline vector representation by the ITFIDF-ISF algorithm; then the cosine similarity between each news headline vector and the topic vector is calculated, and the news headline vector with the highest similarity is connected to the topic vector to generate the corresponding vector matrix. The process diagram is shown in Fig. 3.

Fig. 3
figure 3

Topic vector connection process

After vector concatenation, this paper uses the classical CNN model for feature extraction and classification. First, after obtaining the vector matrix we perform a convolutional operation using a convolutional layer with a one-dimensional convolutional kernel and multiple channels. The process of convolutional operation generates feature maps from the word window of each filter. After that we apply a maximum pooling operation over time on the feature maps to obtain the maximum value of each feature map, and then connect the maximum value to a vector and feed the vector to the fully connected layer for classification. Also, we use dropout on the fully connected layer to prevent overfitting phenomenon and constrain the l2-parametrization of the weight vector. Finally, we use the softmax layer to obtain the classification results.

4 Experimental Setup

Dataset: The experimental data of this paper comes from the news corpus provided by Sogou Laboratory. We extracts 200,000 news headlines, which are divided into ten categories of finance and economics, real estate, stock, education, science, technology, society, current politics, sports, games and entertainment. The purpose of the experiment is to divide a news title into one of ten categories. There are 20,000 articles in each field, and the text length is between 5 and 25. The data were divided into training set, validation set and test set in a ratio of 8:1:1. Training, validation, and test sets do not overlap. The statistics of all this real-world data sets are summarized in Table 1.

Table 1 Dataset overview

Experimental parameters: In this paper, the topic model of LDA is trained based on Gibbs sampling method. The parameters are set as follows: topic K is set to 10, hyperparameter a = 0.01, b = 0.01, number of keywords num = 20; Word vectors on the data set are trained by skip-gram model in Word2vec tool. CNN is used to classify the text after feature extension, and parameter Settings are as follows: the size of convolution kernels is 2 \(\times\) dim, 3 \(\times\) dim, and 4 \(\times\) dim; the number of convolution kernels is 256; the batch size is 128; and the learning rate is 0.001. In order to prevent over-fitting, dropout is set to 0.5 during training.

Evaluation indicators: For classification results, accuracy (ACC) and F1 scores were used as evaluation indicators. Where accuracy is the percentage of correct classification results, F1 matching score is the harmonic average of accuracy and recall rate. The formulas are show in Eq. (6). Where TP is the number of actual positive classes and predicted to be positive classes, TN is the actual negative class and predicted to be negative classes, P and N represent the number of positive and negative classes, respectively.

$$\begin{aligned} \begin{aligned}&\textrm{ACC}=\frac{\textrm{TP}+\textrm{TN}}{P+N}\\&F1=\frac{2 \times \textrm{Precision}\times \textrm{Recall}}{\textrm{Precision}+\textrm{Recall}} \end{aligned} \end{aligned}$$
(6)

Baseline model: The comparison baseline model used in this article is as follows:

  • TextCNN: Multiple convolutions in TextCNN are used to extract multiple features, and maximum pooling will extract the most important information to retain.

  • TextRNN: The LSTM in TextRNN can better capture long-range semantic relations, but it is slow because its recursive structure cannot be computed in parallel.

  • TextRNN_Att: The computation process of attention mechanism in TextRNN_Att is actually a weighted average of the hidden layers of LSTM at each moment.

  • TextRCNN: TextRCNN uses unconventional RNN, and the hidden layer values of bidirectional LSTM at each moment equal to The hidden value and the embedding value are spliced to represent a word, then the maximum pooling layer is used to filter out the useful feature information.

  • DPCNN: The regional embedding of DPCNN consists of superimposing the convolutional layer after removing the pooling layer of TextCNN, which is equivalent to do N-Gram on top of N-Gram, and the more information is fused in each position of the later layer, what the last layer extracts is the semantic information of the whole sequence.

  • Sentence BERT [27]: Sentence BERT uses Siamese and three-level network structure to obtain semantically meaningful sentences embedding, and uses cosine similarity to compare and find semantically similar sentences.

  • Fine-tuned sentence BERT [28]: The fine-tuned sentence BERT used in this paper mainly combines entity embedding with Sentence BERT, adds FastText and other baseline models, and uses voting method to integrate the final classification predicted by multiple base models.

Experimental environment: The experimental environment of this paper is as follows:

  • The operating system: Ubuntu Linux release 16.04.7 LTS

  • CPU: Intel(R) Xeon(R) Silver CPU @ 2.20 GHz

  • GPU: Quadro P4000

  • Software version: Python 3.7; Pytorch 1.1.0; Numpy 1.16.2; SciPy 1.3.1; Networkx 2.4; Scikit-learn 0.21.3

5 Experimental Results and Analysis

As shown in Fig. 4, we first compared the effects of different word separation methods on the classification results performed by four word separation methods, including Jieba, Pkuseg, Thuseg and the Huseg method proposed in this paper. We found that the Huseg method performs better than the other method in terms of classification accuracy and F1 value, so we used Huseg method as the experimental word separation method in the subsequent experiments.

Fig. 4
figure 4

ACC and F1 comparison of different word segmentation methods

The experimental results of all models are shown in Table 2,we can find that our model outperforms other basic deep neural network models on this dataset. In addition, FastText is the best of all baseline models, and it can train its own word vectors without pre-training, improves accuracy without losing training speed and testing speed. However, Transformer, which is the most used but the worst performer. The method based on Sentence BERT achieves the accuracy of 91.24\(\%\), and it can map the text into a concise expression, which is convenient for storage, so the calculation speed is fast. The Fine tuned Sentence BERT method used in this paper also achieves an accuracy of 92.5\(\%\). However, due to its underlying architecture, it occupies a large amount of memory during its operation, and its running speed is significantly lower than that of Sentence BERT.

Table 2 Accuracy and F1 comparison of different classification methods

The comparison results of classification accuracy under different epochs are shown in Fig. 5. The horizontal coordinates of the graph is the number of training epochs of different models, and the vertical coordinates the accuracy of the models. According to the results, the accuracy of our proposed method is higher than the results of any other models, and it also reaches the optimal result after sixth epoch and stabilizes at the earliest. It can be seen that the proposed Chinese news headline classification method is better than other classification methods in terms of accuracy and stability.

Fig. 5
figure 5

Comparison of model accuracy under different Epochs

6 Conclusions

To address the problem that traditional news headline classification methods are ineffective due to their short text length and low association between words, this paper proposes the KFNHC method. In order to improve the quality of keywords, Huseg method is used in the text pre-processing process to reduce the possible errors, ambiguities and inconsistencies in the word separation process. At the same time, the concept of term words in news headlines is strengthened for different domain knowledge, and the more appropriate subject words are selected for expansion to introduce high-quality extended words. The experimental results show that the method is feasible in the Chinese news headline classification task, and the classification effect is significantly improved. The method involves a large number of word vector distance calculations when performing keyword expansion, which leads to an increase in research time. The time complexity problem will be considered in subsequent studies to improve the classification efficiency of news headlines.