A Study of Chinese News Headline Classification Based on Keyword Feature Expansion

Miao, Kai; He, Xin; Yu, Junyang; Wang, Guanghui; Chen, Yongchao

doi:10.1007/s44196-023-00251-4

A Study of Chinese News Headline Classification Based on Keyword Feature Expansion

Research Article
Open access
Published: 05 May 2023

Volume 16, article number 71, (2023)
Cite this article

Download PDF

You have full access to this open access article

International Journal of Computational Intelligence Systems Aims and scope Submit manuscript

A Study of Chinese News Headline Classification Based on Keyword Feature Expansion

Download PDF

Kai Miao¹,
Xin He ORCID: orcid.org/0000-0003-0755-0964¹,
Junyang Yu¹,
Guanghui Wang² &
…
Yongchao Chen¹

1350 Accesses
2 Citations
Explore all metrics

Abstract

Existing work generally classifies news headlines as a matter of short text classification. However, due to the strong domain nature and limited text length of news headlines, their classification results are usually determined by several specific keywords, which makes the traditional short text classification method ineffective. In this paper, we propose a new method to identify keywords in news headlines and expand their features from sentence level and word level respectively, and finally use convolutional neural networks (CNN) to extract and classify their features. The proposed model was tested on the Sogou News Corpus dataset and achieved 93.42$\%$ accuracy.

A Semantic Representation Enhancement Method for Chinese News Headline Classification

An Approach for Bengali News Headline Classification Using LSTM

A News Headlines Classification Method Based on the Fusion of Related Words

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

With the advent of 5G era, the rapid development of Internet technology has brought convenience and also many challenges to people. The short text data we come into contact with every day such as search snippets, microblogs, and news headlines contains a large amount of valuable information [1], while most existing short text classification methods only focus on texts with a few dozen words such as microblogs and rarely consider short text data with fewer words such as news headlines. News headline classification is mainly based on the semantics of the headline to determine and classify the domain to which it belongs [2]. Since news headlines are short sentences with forced semantic expressions built on weakly related words, conventional short text classification methods do not effectively classify them. while high-quality headline classification has an efficient contribution to news content category classification and effectively saves computational overhead, and its main application directions include domain machine translation [3] and false information detection [4].

Although there are many machine learning algorithms and deep neural network methods that perform well in short text classification, they do not perform well in news headline classification [5]. This is due to the fact that news headlines generally contain less text and are less related to each other, so it is difficult for the previous short text classification methods to deal with them effectively. In this regard, many scholars started to find new methods to solve this problem. Fan et al. [6] proposed a co-occurrence-based short text feature expansion method, Hu et al. [7] proposed a new method by combining CNN with support vector machines, and Zhihao et al. [8] started to use graph convolutional networks for headline classification. All of these methods improve the generalization ability of machine learning in face of new situations to some extent and also relatively improve the classification accuracy, but none of them essentially solve the problem of inefficient news headline classification.

Since the association degree between words within news headlines is low, the implicit association between words is more likely to contribute to the classification of news headlines, while the expansion of feature space and the collection of contextual information can more effectively help us understand the implicit association between the words. To address this issue, some studies use external resources to enrich text features. Kanu et al. [9] effectively stores information in a language independent format, so that information can be retrieved across multiple geographical locations, leaving the crisis of corpus quantity behind. However, this method requires high quality of external resources, and inappropriate external resources may introduce new noise while the method is less portable and relatively time-consuming. In contrast, the extension based on internal semantic association information has been developed more in recent years. The word2vec model proposed by [10] has been widely used. Lilleberg et al. [11] proposed a new word2vec based word vector and latent Dirichlet allocation ( LDA ) model to generate corresponding topic features, Li et al. [12] proposed a new topic-based CNN model combining LDA topic model with deep neural network. However, the lack of word co-occurrence information seriously hinders the generation of document topic distribution, so the traditional topic modeling methods cannot achieve satisfactory results in domain headline topic modeling.

Based on this, this paper proposes a new method of news headline classification based on keyword feature expansion(KFNHC). This method makes full use of the existing data, considers the feature information carried by the text based on the category, and strictly filters the results generated by topic modeling to ensure the accuracy of the expanded topic words. Without using external data, the keyword feature expansion performs on news headlines at both sentence and word levels simultaneously. The main contributions of this paper are as follows:

First, the Huseg method is proposed in the preprocessing process to reduce the errors and inconsistencies of Chinese word separation, and thus reduce the loss of keywords in the preprocessing process.

Second, a keyword feature expansion method is proposed, which uses the LDA model of compound TextRank algorithm (TRLDA) and the improved term frequency-inverse document frequency (TF-IDF) algorithm to expand the keyword features of short texts from both sentence and word level, respectively.

Finally, extensive and complete experiments are conducted on the Sogou News dataset. The results prove that our approach outperforms existing similar models.

The rest of this paper is organized as follows. Related work on news headline classification is reviewed in Sect. 2, the method proposed in this paper is described in detail in Sect. 3, the experimental setup is described in Sect. 4, the experimental results are discussed in Sect. 5, and the conclusions are presented in Sect. 6.

2 Related Work

News headline classification plays an important role in many applications of natural language processing (NLP). The existing technology generally classifies it as the problem of short text classification. However, due to the unique nature of news headlines, most of the short text classification technologies cannot effectively deal with the task of it. In this part, we start from the corresponding technical modules according to the improvement points of the proposed model, It includes three parts: the method based on deep learning, the topic modeling method based on the expansion of internal topic features, and the weight weighting algorithm.

With the development of deep learning, some deep learning models have been introduced into short text classification [13]. However, the text information extracted by CNN mainly stays at the level of "words" and "characters" without considering other semantic features such as topic features. With the depth of research, many other techniques such as latent semantic analysis ( LSA ) [14] and LDA [15] and other topic modeling methods have been used for short text classification, and these methods have been used by scholars to model and extend the topics of texts to discover the hidden semantic topic information in documents. However, news headlines do not provide enough information to help topic modeling because of their limited sentence length and weak correlation between words, which affect the accuracy of generated topics, resulting in poor classification results. To address this problem, Qiang et al. [16] use Bi GRU to solve the ambiguity of Chinese vocabulary and the complexity of Chinese grammar rules. In addition, it is also possible to discover hidden topic structures based on external data such as Wikipedia [17] which mainly relies on domain-specific and synonym lists to obtain the corresponding external knowledge, but this approach is not universally applicable and scalable.

Different from previous studies that focused on using external knowledge or pre-trained extensions of topic features, researchers have mainly focused on designing a new approach to topic modeling in recent years, all of which are fundamentally aimed at improving the expressive power of text semantics, but rarely guarantee the validity of the extended words. To address this problem, Li et al. [18] proposed a new method to weaken these effects without external knowledge, analyze short textbooks using LDA-based topic models, represent each short textbook with a vector space model, and propose a new vector adjustment method for short textbooks. Feng et al. [19] proposed a regularized non-negative matrix decomposition topic model for short text, which uses pre-trained distribution vector representation of words to overcome the data sparsity problem of short text, meanwhile, it only considers the data itself and does not use external resources to expand the features of short text. Both Wang and Hao et al. [20, 21] started to incorporate an attention mechanism in the neural network model to perform semantic expansion at sentence level and related word level of the short text respectively. Reference [22] proposed a new neural network topic model in a self-coding framework, which uses a new quantization method for topic distribution to generate a peak distribution more suitable for modeling short textbooks, in addition proposed a new negative sampling decoder to avoid generating duplicate topics using negative sampling learning.

In contrast, term weighting methods have been one of the most common processing methods in text classification. According to [23] they can be grouped into two categories: unsupervised term weighting methods that do not consider category information, such as binary representation, TF-IDF, etc.; and supervised term weighting methods that take category information into account when weighting. Many past studies used TF-IDF-based text classification methods [24], but their simple assumption that terms with low document frequency should be more important than terms with high document frequency does not hold true for news headline domain classification, where the weak correlation between words and the consistency of texts within the domain pose a more severe challenge to the classification task. Pascual et al. [25] Properly improved the random fuzzy forest. An advantage of using fuzzy rules is the possibility to manage uncertainty and to work with linguistic scales. Fuzzy random forests achieve a good classification performance in many problems, but their quality decreases when they face a classification problem with imbalanced data between classes. Based on this, Sun et al. [26] proposed the concept of topic high-frequency words, and also proposed a grammatical category combination weighting method to distinguish the contribution of different keywords to the classification results. However, these term weighting methods only weight the classification results from the perspective of words, and there is still much room to improve in the classification effect due to the weak correlation between news headline words.

In summary, these models only explore the explicit word co-occurrence information that can be captured in the corpus, and do not fully utilize the deep semantic relationship and semantic sentence structure, which cannot accurately reflect the degree of contribution of feature words to the category, resulting in inaccurate text representation. Therefore, this paper proposes a new TRLDA topic model and the improved TF-IDF algorithm for keyword expansion of news headlines from both sentence and word level in the keyword feature expansion part. Meanwhile, this paper also proposes a new Huseg method to reduce the errors and inconsistencies of Chinese word separation in the preprocessing part, so that to reduce the loss of keywords in news headlines.

3 The Proposed KFNHC Method

In this paper, the KFNHC method proposed by us mainly uses TRLDA model and improved TF-IDF algorithm to achieve keyword feature expansion of Chinese news headlines. As shown in Fig. 1, the overall framework of this method consists of four main parts. First, a new Huseg method is used in the pre-processing part to reduce the probability of keyword loss for Chinese news headlines. Secondly, the TRLDA model performed keyword expansion of the title for dynamic construction and the improved TF-IDF algorithm was performed for weighting process. Then, the short text and related words are conceptualized by pre-trained word vectors to generate the corresponding word vector matrices which is used to expand the keyword features of the pre-processed Chinese news headlines. Finally, the above word vectors are fused with features, and the classical CNN model is used to extract the corresponding features and complete the classification process.

3.1 Huseg Method

Previous classification methods usually uses word sequences to obtain potential meanings and gets better classification results. However, because different segmentation methods usually adopt different segmentation granularity, resulting in different segmentation results. Besides Chinese word segmentation usually inevitably has errors, ambiguities and inconsistencies, so the word-based model often encounters some potential problems caused by word segmentation, which affects the final segmentation results.

The Huseg method proposed in this paper is based on the previous voting method. In the pre-processing part, three common word separation methods like Jieba, Pkuseg and Thuseg, are combined to use for each text file, instead of discarding the low-frequency word separation results directly, the results are combined and saved in the same file for subsequent processing.

3.2 TRLDA Model

Traditional topic modeling methods have achieved great success in several fields of natural language processing. However, for news headlines, the limited sentence length does not provide enough information to help the model discover the potential information brought by semantics and syntax. This flaw seriously affecting the production of document topic distribution, resulting in low accuracy of generated topics, and thus affecting the classification accuracy of short texts. To address this problem, the TRLDA model is proposed in this paper, and the model diagram is shown in Fig. 2.

Firstly, we found that the direct topic modeling of news headlines based on a single text is not accurate enough, meanwhile similar documents have similar patterns in syntactic structure and semantic information. Based on this, we calculate the text vector of each news headline by word vector, and calculate the cosine similarity between news headline vectors $W_p$ and $W_q$ according to Eq. (1), thus all the distributed vector representation of news headlines can capture the semantic information and overcome the deficiency that the bag-of-words model cannot reflect the importance relationship of words.

$$\begin{aligned} S_{(p,q)}=\frac{W_p \times W_q}{\Vert W_p \Vert _2\times \Vert W_q \Vert _2} \end{aligned}$$

(1)

Next, the LDA model is trained on the training dataset consisting of the long pseudo-text described above. Firstly, the topic of the dataset is extracted using LDA to generate the topic-word matrix. Then the word vector model is obtained by using Word2Vec to train the dataset, and the similarity between words under the topic is calculated by the cosine method as the weight between two points. A threshold is set to filter out the word relationships with lower weights and connect the remaining two words into an edge.

Finally, the TextRank method is used to iterate and the top n words with the highest values are used as the keywords under the topic. The formula of TextRank method is shown in Eq. (2). Where $\textrm{WS}(V_i)$ denotes the weight of sentence i, the summation on the right-hand side indicates the degree of contribution of each neighboring sentence to this sentence, $W_{ji}$ denotes the similarity of two sentences, $\textrm{WS}(V_j)$ denotes the weight of sentence j from the last iteration, and d is the damping factor taken as 0.85.

$$\begin{aligned} \textrm{WS}(V_i)=(1-d)+d\times \sum \limits _{V_j\in In(V_i)}\frac{W_{jt}}{\sum _{V_k\in \textrm{Out}(V_j)}W_{jk}}\textrm{WS}(V_j) \end{aligned}$$

(2)

Then the top n words with the highest probability under each potential topic are normalized by Eq. (3) to the probability values of the different parts of the word weights to obtain the weights of each potential topic value vector. Where $C_i$ is the probability value of each potential keyword, $\omega$ is the different weights for different word classes.

$$\begin{aligned} P_{\textrm{LDA}} = \frac{\omega \times C_i}{\sum _{i=1}^T \omega \times C_i} \end{aligned}$$

(3)

3.3 The Improved TF-IDF Algorithm

In the past TF-IDF is frequently used as a term weight in text categorization, and obtained good results in the traditional task, but the traditional IDF structure may ignore the document category information, and cannot effectively reflect the importance of words and categories to distinguish ability to achieve the correct weight adjustment function. We propose an improved TF-IDF algorithm, which focusing on learning the semantic information of words. On the basis of retaining TF-IDF highlighting important words and suppressing minor words, the contribution of words to classification is divided by category boundaries to measure the degree of categorization of a term in a given document set.

For example, in the category of education, "score line", "apply for the examination" and other words often appear many times, at the same time in other categories appear probability is very low. Such words, which are much more likely to occur in one category than in others, can be effective in capturing domain knowledge and making distinctions. Therefore words in this category can be used as good category identifiers and are given higher weight. So in the improved formula, we use the frequency of a word in the current category and in other categories to calculate its corresponding weight value. The formula and deduction process proposed by us are is Eq. (4), where $P_t$ represents the frequency of occurrence of the word in the current category and $P_o$ represents the frequency of occurrence of the word in other categories.

$$\begin{aligned} \textrm{TIF}=-\log \left( 1-\frac{P_t}{P_t + P_o}\right) \times P_t=\log \left( 1 + \frac{P_t}{P_o}\right) \times P_t \end{aligned}$$

(4)

In real life, words of different lexical nature contribute differently to semantic expressions, and in some texts non-nouns and verbs may occur more frequently, but these words are less capable of identifying the topic, while adding to the noise of the short text. In this paper, taking into account the sparsity of news headlines most previous works ignored, lexical filtering does not reflect the importance of lexical features for feature selection, but may instead have some negative impact on the classification results. The lexical features are introduced with different weights for different lexical words to better reflect their contribution to the semantic expression. Bearing in mind that nouns and verbs are most important for the semantic expression of sentences, followed by adjectives and adverbs, we propose lexical weights as in Eq. (5), where h represents the weight of different parts of speech, and $\alpha$ > $\beta$ > $\gamma$ > 0.

$$\begin{aligned} h= {\left\{ \begin{array}{ll} \alpha ,&{}n.\, or \,v.\\ \beta ,&{}\mathrm{adj.}\, or \,\mathrm{adv.}\\ \gamma ,&{}\textrm{other}\\ \end{array}\right. } \end{aligned}$$

(5)

3.4 Feature Fusion and Classification

Before proceeding with the CNN, we first connect the explicit and implicit feature vectorization obtained in the previous section. Using the pre-trained Word2Vec vectors, we can get the topic vectors generated by the LDA-CTA topic model and also the weighted news headline vector representation by the ITFIDF-ISF algorithm; then the cosine similarity between each news headline vector and the topic vector is calculated, and the news headline vector with the highest similarity is connected to the topic vector to generate the corresponding vector matrix. The process diagram is shown in Fig. 3.

After vector concatenation, this paper uses the classical CNN model for feature extraction and classification. First, after obtaining the vector matrix we perform a convolutional operation using a convolutional layer with a one-dimensional convolutional kernel and multiple channels. The process of convolutional operation generates feature maps from the word window of each filter. After that we apply a maximum pooling operation over time on the feature maps to obtain the maximum value of each feature map, and then connect the maximum value to a vector and feed the vector to the fully connected layer for classification. Also, we use dropout on the fully connected layer to prevent overfitting phenomenon and constrain the l2-parametrization of the weight vector. Finally, we use the softmax layer to obtain the classification results.

4 Experimental Setup

Dataset: The experimental data of this paper comes from the news corpus provided by Sogou Laboratory. We extracts 200,000 news headlines, which are divided into ten categories of finance and economics, real estate, stock, education, science, technology, society, current politics, sports, games and entertainment. The purpose of the experiment is to divide a news title into one of ten categories. There are 20,000 articles in each field, and the text length is between 5 and 25. The data were divided into training set, validation set and test set in a ratio of 8:1:1. Training, validation, and test sets do not overlap. The statistics of all this real-world data sets are summarized in Table 1.

Table 1 Dataset overview

Full size table

Experimental parameters: In this paper, the topic model of LDA is trained based on Gibbs sampling method. The parameters are set as follows: topic K is set to 10, hyperparameter a = 0.01, b = 0.01, number of keywords num = 20; Word vectors on the data set are trained by skip-gram model in Word2vec tool. CNN is used to classify the text after feature extension, and parameter Settings are as follows: the size of convolution kernels is 2 $\times$ dim, 3 $\times$ dim, and 4 $\times$ dim; the number of convolution kernels is 256; the batch size is 128; and the learning rate is 0.001. In order to prevent over-fitting, dropout is set to 0.5 during training.

Evaluation indicators: For classification results, accuracy (ACC) and F1 scores were used as evaluation indicators. Where accuracy is the percentage of correct classification results, F1 matching score is the harmonic average of accuracy and recall rate. The formulas are show in Eq. (6). Where TP is the number of actual positive classes and predicted to be positive classes, TN is the actual negative class and predicted to be negative classes, P and N represent the number of positive and negative classes, respectively.

$$\begin{aligned} \begin{aligned}&\textrm{ACC}=\frac{\textrm{TP}+\textrm{TN}}{P+N}\\&F1=\frac{2 \times \textrm{Precision}\times \textrm{Recall}}{\textrm{Precision}+\textrm{Recall}} \end{aligned} \end{aligned}$$

(6)

Baseline model: The comparison baseline model used in this article is as follows:

TextCNN: Multiple convolutions in TextCNN are used to extract multiple features, and maximum pooling will extract the most important information to retain.
TextRNN: The LSTM in TextRNN can better capture long-range semantic relations, but it is slow because its recursive structure cannot be computed in parallel.
TextRNN_Att: The computation process of attention mechanism in TextRNN_Att is actually a weighted average of the hidden layers of LSTM at each moment.
TextRCNN: TextRCNN uses unconventional RNN, and the hidden layer values of bidirectional LSTM at each moment equal to The hidden value and the embedding value are spliced to represent a word, then the maximum pooling layer is used to filter out the useful feature information.
DPCNN: The regional embedding of DPCNN consists of superimposing the convolutional layer after removing the pooling layer of TextCNN, which is equivalent to do N-Gram on top of N-Gram, and the more information is fused in each position of the later layer, what the last layer extracts is the semantic information of the whole sequence.
Sentence BERT [27]: Sentence BERT uses Siamese and three-level network structure to obtain semantically meaningful sentences embedding, and uses cosine similarity to compare and find semantically similar sentences.
Fine-tuned sentence BERT [28]: The fine-tuned sentence BERT used in this paper mainly combines entity embedding with Sentence BERT, adds FastText and other baseline models, and uses voting method to integrate the final classification predicted by multiple base models.

Experimental environment: The experimental environment of this paper is as follows:

The operating system: Ubuntu Linux release 16.04.7 LTS
CPU: Intel(R) Xeon(R) Silver CPU @ 2.20 GHz
GPU: Quadro P4000
Software version: Python 3.7; Pytorch 1.1.0; Numpy 1.16.2; SciPy 1.3.1; Networkx 2.4; Scikit-learn 0.21.3

5 Experimental Results and Analysis

As shown in Fig. 4, we first compared the effects of different word separation methods on the classification results performed by four word separation methods, including Jieba, Pkuseg, Thuseg and the Huseg method proposed in this paper. We found that the Huseg method performs better than the other method in terms of classification accuracy and F1 value, so we used Huseg method as the experimental word separation method in the subsequent experiments.

The experimental results of all models are shown in Table 2,we can find that our model outperforms other basic deep neural network models on this dataset. In addition, FastText is the best of all baseline models, and it can train its own word vectors without pre-training, improves accuracy without losing training speed and testing speed. However, Transformer, which is the most used but the worst performer. The method based on Sentence BERT achieves the accuracy of 91.24$\%$, and it can map the text into a concise expression, which is convenient for storage, so the calculation speed is fast. The Fine tuned Sentence BERT method used in this paper also achieves an accuracy of 92.5$\%$. However, due to its underlying architecture, it occupies a large amount of memory during its operation, and its running speed is significantly lower than that of Sentence BERT.

Table 2 Accuracy and F1 comparison of different classification methods

Full size table

The comparison results of classification accuracy under different epochs are shown in Fig. 5. The horizontal coordinates of the graph is the number of training epochs of different models, and the vertical coordinates the accuracy of the models. According to the results, the accuracy of our proposed method is higher than the results of any other models, and it also reaches the optimal result after sixth epoch and stabilizes at the earliest. It can be seen that the proposed Chinese news headline classification method is better than other classification methods in terms of accuracy and stability.

6 Conclusions

To address the problem that traditional news headline classification methods are ineffective due to their short text length and low association between words, this paper proposes the KFNHC method. In order to improve the quality of keywords, Huseg method is used in the text pre-processing process to reduce the possible errors, ambiguities and inconsistencies in the word separation process. At the same time, the concept of term words in news headlines is strengthened for different domain knowledge, and the more appropriate subject words are selected for expansion to introduce high-quality extended words. The experimental results show that the method is feasible in the Chinese news headline classification task, and the classification effect is significantly improved. The method involves a large number of word vector distance calculations when performing keyword expansion, which leads to an increase in research time. The time complexity problem will be considered in subsequent studies to improve the classification efficiency of news headlines.

Data Availability Statement

The data used to support the findings of this study are available from the corresponding author upon request.

Abbreviations

CNN:: Convolutional neural networks
LDA:: Latent Dirichlet allocation
KFNHC:: News headline classification based on keyword feature expansion
TRLDA:: The LDA model of compound TextRank algorithm
NLP:: Natural language processing
LSA:: Latent semantic analysis
TF-IDF:: Term frequency-inverse document frequency
ACC:: Accuracy
$W_p$, $W_q$ :: The title vector of heading p and heading q
$S_{(p,q)}$ :: The cosine similarity between the title vector $W_p$ and $W_q$
$WS(V_i)$ :: The weight of sentence i
$W_{ji}$ :: The similarity of sentence j and sentence i
$WS(V_j)$ :: The weight of sentence j from the last iteration
d :: Damping factor
$C_i$ :: The probability value of each potential keyword
$\omega$ :: The different weights for different word classes
$P_{LDA}$ :: Normalized topic weight
$P_t$ :: The frequency of occurrence of the word in the current category
$P_o$ :: The frequency of occurrence of the word in other categories
TIF :: Word weights
h :: The weight of different parts of speech
$\alpha$ :: The weight of a noun or verb
$\beta$ :: The weight of an adjective or adverb
$\gamma$ :: The weight of other parts of speech
TP :: The number of actual positive classes and predicted to be positive classes
TN :: The actual negative class and predicted to be negative classes
P :: The number of positive classes
N :: The number of negative classes

References

Zhang, Y., Jing, L., Yan, S., Zhang, C.: Encoding conversation context for neural keyphrase extraction from microblog posts. In: North American Chapter of the Association for Computational Linguistics (NAACL) (2018)
Dong, X., Song, R., Hong, Y., Zhu, F., Zhu, Q.: Classification of news headlines based on multiple models. Chin. J. Inf. 32(10), 9 (2018)
Google Scholar
Thuy-Trang, V., Xuanli, H., Dinh, P., Gholamreza, H.: Generalised unsupervised domain adaptation of neural machine translation with cross-lingual data selection. Computation and Language (2021)
Cinelli, M., Gdf Morales, Galeazzi, A., Quattrociocchi, W., Starnini, M.: The echo chamber effect on social media. Proc. Natl. Acad. Sci. 118(9), e2023301118 (2021)
Yukun, C., Xiaofei, X., Ye, D., Jun, H., Li, L.: Hybrid decision based Chinese news headline classification. In: Web and Big Data (2018)
Fan, X., Hu, H.: A new model for Chinese short-text classification considering feature extension. In: International Conference on Artificial Intelligence & Computational Intelligence (2010)
Hu, Y., Yang, L., Tao, Y., Quan, P.: Short text classification with a convolutional neural networks based method. In: 2018 15th International Conference on Control, Automation, Robotics and Vision (ICARCV)(2018)
Zhihao, Y., Gongyao, J., Liu, Y., Zhiyong, L., Yuan, J.: Document and word representations generated by graph convolutional network and bert for short text classification (2020)
Kanu, G., Parteek, B.: Information retrieval system using unl for multilingual question answering. In: IEEE International Conference on Recent Trends in Electronics (2016)
Tomas, M., Kai, C., Greg, C., Jeffrey, D.: Efficient estimation of word representations in vector space. Comput. Sci. (2013)
Lilleberg, J., Yun, Z., Zhang, Y.: Support vector machines and word2vec for text classification with semantic features. In: IEEE International Conference on Cognitive Informatics & Cognitive Computing (2015)
Li, Y.: Short text classification improved by feature space extension. In: IOP Conference Series: Materials Science and Engineering, vol. 533, p. 012046 (2019)
Yin, Y., Yang, W., Yang, H., Chaoying, X.U.: Research on short text classification algorithm based on convolutional neural network and knn. Comput. Eng. (2018)
Qiang, P, Guo-Wei, Y.: Short-text classification based on ica and lsa (2006)
Chen, Q., Yao, L., Jie, Y.: Short text classification based on lda topic model. In: 2016 International Conference on Audio, Language and Image Processing (ICALIP) (2017)
Qiang, L., Zhenfang, Z., Fuyong, X., Dianyuan, Z., Wenqing, W., Qiangqiang, G.: Bi-gru sentiment classification for Chinese based on grammar rules and bert. In: International Journal of Computational Intelligence Systems (2020)
Xuan, H.P., Nguyen, M.L., Horiguchi, S.: Learning to classify short and sparse text web with hidden topics from large-scale data collections. In: Proceedings of the 17th International Conference on World Wide Web, WWW 2008, Beijing (2008)
Li, Y., Liu, B.: A new vector representation of short texts for classification. Int. Arab J. Inf. Technol. 17(2), 241–249 (2020)
Google Scholar
Feng, Y., Jiang, B., Jianjun, W.: Topic modeling for short texts via word embedding and document correlation. IEEE Access (2020)
Wang, H., Tian, K., Wu, Z., Wang, L.: A short text classification method based on convolutional neural network and semantic extension. Int. J. Comput. Intell. Syst. 14(1) (2020)
Hao, M., Xu, B., Liang, J.-Y., Zhang, B.-W., Yin, X.-C.: Chinese short text classification with mutual-attention convolutional neural networks (2020)
Wu, X., Li, C., Zhu, Y., Miao, Y.: Short text topic modeling with topic distribution quantization and negative sampling decoder. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2020)
Man, L., Tan, C.L., Jian, S., Yue, L.: Supervised and traditional term weighting methods for automatic text categorization (2008)
Lan, M., Sung, S.Y., Low, H.B., Tan, C.L.: A comparative study on term weighting schemes for text categorization. In: 2005 IEEE International Joint Conference on Neural Networks, 2005. IJCNN ’05. Proceedings (2005)
Pascual-Fontanilles, J., Valls, A., Moreno, A., Romero-Aroca, P.: Continuous dynamic update of fuzzy random forests. In: International Journal of Computational Intelligence Systems (2022)
Sun, F., Chen, H.: Feature extension for chinese short text classification based on lda and word2vec (2018)
Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert-networks. In: Empirical Methods in Natural Language Processing (2019)
Piao, G.: Scholarly text classification with sentence bert and entity embeddings. In: Knowledge Discovery and Data Mining (2021)

Download references

Funding

This work was supported in part by Henan Provincial Major Public Welfare Project under Grant No. 201300210400, China Postdoctoral Science Foundation under Grant No. 2020M672211, Scientific Research Projects of Henan Provincial Colleges and Universities under Grant No. 21A520003, and Key Technologies Research and Development Program of Henan under Grant No. 212102210078, 212102210090, 212102210094.

Author information

Authors and Affiliations

Intelligent Data Processing Engineering Research Center Of Henan Province, School of Software, Henan University, Kaifeng, 475000, China
Kai Miao, Xin He, Junyang Yu & Yongchao Chen
Henan International Joint Laboratory of Intelligent Network Theory and Key Technology, School of Software, Henan University, Kaifeng, 475000, China
Guanghui Wang

Authors

Kai Miao
View author publications
You can also search for this author in PubMed Google Scholar
Xin He
View author publications
You can also search for this author in PubMed Google Scholar
Junyang Yu
View author publications
You can also search for this author in PubMed Google Scholar
Guanghui Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yongchao Chen
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by XH, YC and KM. The first draft of the manuscript was written by KM and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Xin He.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethical statement

This article does not contain any studies with human participants or animals performed by any of the authors.

Informed consent

Informed consent was obtained from all individual participants included in the study.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Miao, K., He, X., Yu, J. et al. A Study of Chinese News Headline Classification Based on Keyword Feature Expansion. Int J Comput Intell Syst 16, 71 (2023). https://doi.org/10.1007/s44196-023-00251-4

Download citation

Received: 13 April 2022
Accepted: 12 April 2023
Published: 05 May 2023
DOI: https://doi.org/10.1007/s44196-023-00251-4

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

A Study of Chinese News Headline Classification Based on Keyword Feature Expansion

Abstract

Similar content being viewed by others

A Semantic Representation Enhancement Method for Chinese News Headline Classification

An Approach for Bengali News Headline Classification Using LSTM

A News Headlines Classification Method Based on the Fusion of Related Words

1 Introduction

2 Related Work