Skip to main content
Log in

A deep-learning based citation count prediction model with paper metadata semantic features

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

Predicting the impact of academic papers can help scholars quickly identify the high-quality papers in the field. How to develop efficient predictive model for evaluating potential papers has attracted increasing attention in academia. Many studies have shown that early citations contribute to improving the performance of predicting the long-term impact of a paper. Besides early citations, some bibliometric features and altmetric features have also been explored for predicting the impact of academic papers. Furthermore, paper metadata text such as title, abstract and keyword contains valuable information which has effect on its citation count. However, present studies ignore the semantic information contained in the metadata text. In this paper, we propose a novel citation prediction model based on paper metadata text to predict the long-term citation count, and the core of our model is to obtain the semantic information from the metadata text. We use deep learning techniques to encode the metadata text, and then further extract high-level semantic features for learning the citation prediction task. We also integrate early citations for improving the prediction performance of the model. We show that our proposed model outperforms the state-of-the-art models in predicting the long-term citation count of the papers, and metadata semantic features are effective for improving the accuracy of the citation prediction models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig.1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data availability

The code used for this study is available at https://github.com/MatrixLabDUT/CitationPrediction.

Code availability

The code used for this study is available at https://github.com/MatrixLabDUT/CitationPrediction.

References

  • Abramo, G., D’Angelo, C. A., & Felici, G. (2019). Predicting publication long-term impact through a combination of early citations and journal impact factor. Journal of Informetrics, 13(1), 32–49. https://doi.org/10.1016/j.joi.2018.11.003

    Article  Google Scholar 

  • Abrishami, A., & Aliakbary, S. (2019). Predicting citation counts based on deep neural network learning techniques. Journal of Informetrics, 13(2), 485–499. https://doi.org/10.1016/j.joi.2019.02.01

    Article  Google Scholar 

  • Aikawa, K., Kawai, S., & Nobuhara, H. (2019). Multilingual Inappropriate Text Content Detection System Based on Doc2vec. In: 2019 IEEE 8th Global Conference on Consumer Electronics (GCCE), pp. 441–442. https://doi.org/10.1109/GCCE46687.2019.9015579

  • Bahdanau, D., Cho, K. H., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate.In: 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, pp. 1–15.

  • Bai, X., Zhang, F., & Lee, I. (2019). Predicting the citations of scholarly paper. Journal of Informetrics, 13(1), 407–418. https://doi.org/10.1016/j.joi.2019.01.010

    Article  Google Scholar 

  • Bornmann, L., Leydesdorff, L., & Wang, J. (2014). How to improve the prediction based on citation impact percentiles for years shortly after the publication date? Journal of Informetrics, 8(1), 175–180. https://doi.org/10.1016/j.joi.2013.11.005

    Article  Google Scholar 

  • Bornmann, L., Schier, H., Marx, W., & Daniel, H. D. (2012). What factors determine citation counts of publications in chemistry besides their quality? Journal of Informetrics, 6(1), 11–18. https://doi.org/10.1016/j.joi.2011.08.004

    Article  Google Scholar 

  • Braun, T., Glänzel, W., & Schubert, A. (2006). A Hirsch-Type Index for Journals. Scientometrics, 69(1), 169–173. https://doi.org/10.1007/s11192-006-0147-4

    Article  Google Scholar 

  • Cao, X., Chen, Y., & Ray Liu, K. J. (2016). A data analytic approach to quantifying scientific impact. Journal of Informetrics, 10(2), 471–484. https://doi.org/10.1016/j.joi.2016.02.006

    Article  Google Scholar 

  • Chen, J. (2015). Predicting Citation Counts of Papers.In: 2015 IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC), pp. 434–440. https://doi.org/10.1109/ICCI-CC.2015.7259421

  • Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794). New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/2939672.2939785

  • Chen, Y., Huang, S., Lee, H., Wang, Y., & Shen, C. (2019). Audio Word2vec : Sequence-to-sequence autoencoding for unsupervised learning of audio segmentation and Representation. IEEE/ACM Transactions on Audio Speech and Language Processing, 27(9), 1481–1493. https://doi.org/10.1109/TASLP.2019.2922832

    Article  Google Scholar 

  • Clark, K., Luong, M.-T., Le, Q. V, & Manning, C. D. (2020). ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. In: BT - 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30, 2020. https://openreview.net/forum?id=r1xMH1Btv

  • Clauset, A., Larremore, D. B., & Sinatra, R. (2017). Data-driven predictions in the science of science. Science, 355(6324), 477–480. https://doi.org/10.1126/science.aal4217

    Article  Google Scholar 

  • Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Vol 1 (pp. 4171–4186). https://doi.org/10.18653/v1/n19-1423

  • Egghe, L. (2006). Theory and practise of the g-index. Scientometrics, 69(1), 131–152. https://doi.org/10.1007/s11192-006-0144-7

    Article  MathSciNet  Google Scholar 

  • Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5), 1189–1232.

    Article  MathSciNet  Google Scholar 

  • Fronzetti Colladon, A., D’Angelo, C. A., & Gloor, P. A. (2020). Predicting the future success of scientific publications through social network and semantic analysis. Scientometrics, 124(1), 357–377. https://doi.org/10.1007/s11192-020-03479-5

    Article  Google Scholar 

  • Garfield, E. (2006). The history and meaning of the journal impact factor. JAMA, 295(1), 90–93. https://doi.org/10.1001/jama.295.1.90

    Article  Google Scholar 

  • Graves, A. (2012). Supervised sequence labelling with recurrent neural networks. Heidelberg: Springer.

    Book  Google Scholar 

  • Guo, J., Lu, S., Cai, H., Zhang, W., Yu, Y., & Wang, J. (2018). Long text generation via adversarial training with leaked information.In: 32nd AAAI Conference on Artificial Intelligence, AAAI 2018, pp. 5141–5148.

  • Habibzadeh, F., & Yadollahie, M. (2010). Are shorter article titles more attractive for citations? Cross-sectional study of 22 scientifc journals. Croatian Medical Journal, 51(2), 165–170. https://doi.org/10.3325/cmj.2010.51.165

    Article  Google Scholar 

  • Haggan, M. (2004). Research paper titles in literature, linguistics and science: Dimensions of attraction. Journal of Pragmatics, 36(2), 293–317. https://doi.org/10.1016/S0378-2166(03)00090-0

    Article  Google Scholar 

  • Hassan, S. U., Bowman, T. D., Shabbir, M., Akhtar, A., Imran, M., & Aljohani, N. R. (2019). Influential tweeters in relation to highly cited articles in altmetric big data. Scientometrics, 119(1), 481–493. https://doi.org/10.1007/s11192-019-03044-9

    Article  Google Scholar 

  • Hirsch, J. E. (2005). An index to quantify an individual’ s scientific research output. Proceedings of the National Academy of Sciences of the United States of America, 102(46), 16569–16572. https://doi.org/10.1073/pnas.0507655102

    Article  MATH  Google Scholar 

  • Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735

    Article  Google Scholar 

  • Hu, Y.-H., Tai, C.-T., Liu, K. E., & Cai, C.-F. (2020). Identification of highly-cited papers using topic-model-based and bibliometric features: the consideration of keyword popularity. Journal of Informetrics, 14(1), 101004. https://doi.org/10.1016/j.joi.2019.101004

    Article  Google Scholar 

  • Jamali, H. R., & Nikzad, M. (2011). Article title type and its relation with the number of downloads and citations. Scientometrics, 88(2), 653–661. https://doi.org/10.1007/s11192-011-0412-z.

    Article  Google Scholar 

  • Jati, A., & Georgiou, P. (2019). Neural predictive coding using convolutional neural networks toward unsupervised learning of speaker characteristics. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(10), 1577–1589. https://doi.org/10.1109/TASLP.2019.2921890

    Article  Google Scholar 

  • Karvelis, P., Gavrilis, D., Georgoulas, G., & Stylios, C. (2018). Topic recommendation using Doc2Vec. International Joint Conference on Neural Networks (IJCNN), 2018, 1–6. https://doi.org/10.1109/IJCNN.2018.8489513

    Article  Google Scholar 

  • Lau, J. H., & Baldwin, T. (2016). An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation. In: Proceedings of the 1st Workshop on Representation Learning for {NLP}, pp. 78–86. https://doi.org/10.18653/v1/W16-1609

  • Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. In: 31st International Conference on Machine Learning, ICML 2014, vol. 4, pp. 2931–2939.

  • Lecun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444. https://doi.org/10.1038/nature14539

    Article  Google Scholar 

  • Letchford, A., Preis, T., & Moat, H. S. (2016). The advantage of simple paper abstracts. Journal of Informetrics, 10(1), 1–8. https://doi.org/10.1016/j.joi.2015.11.001

    Article  Google Scholar 

  • Li, S., Hu, J., Cui, Y., & Hu, J. (2018). DeepPatent: Patent classification with convolutional neural networks and word embedding. Scientometrics, 117(2), 721–744. https://doi.org/10.1007/s11192-018-2905-5

    Article  Google Scholar 

  • Li, M., Xu, J., Ge, B., Liu, J., Jiang, J., & Zhao, Q. (2019a). A Deep Learning Methodology for Citation Count Prediction with Large-scale Biblio-Features. In: 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC), 1172–1176. https://doi.org/10.1109/SMC.2019.8913961

  • Li, S., Zhao, W. X., Yin, E. J., & Wen, J.-R. (2019b). A neural citation count prediction model based on peer review text. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 4914–4924). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1497

  • Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., et al. (2019). RoBERTa: A robustly optimized BERT pretraining approach. CoRR. http://arxiv.org/abs/1907.11692

  • Markov, I., Gómez-Adorno, H., Posadas-Durán, J.-P., Sidorov, G., & Gelbukh, A. (2017). Author profiling with doc2vec neural network-based document embeddings. In O. Pichardo-Lagunas & S. Miranda-Jiménez (Eds.), Advances in Soft Computing (pp. 117–131). Springer International Publishing.

    Chapter  Google Scholar 

  • Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. In: 1st International Conference on Learning Representations, ICLR 2013 - Workshop Track Proceedings, pp. 1–12.

  • Newman, M. E. J. (2014). Prediction of highly cited papers. EPL (europhysics Letters), 105(2), 28002. https://doi.org/10.1209/0295-5075/105/28002

    Article  Google Scholar 

  • Platanios, E. A., Sachan, M., Neubig, G., & Mitchell, T. M. (2020). Contextual parameter generation for universal neural machine translation.In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018, (2016), pp. 425–435. Doi: https://doi.org/10.18653/v1/d18-1039

  • Rose, M. E., & Kitchin, J. R. (2019). pybliometrics: scriptable bibliometrics using a Python interface to Scopus. SoftwareX, 10, 100263. https://doi.org/10.1016/j.softx.2019.100263

    Article  Google Scholar 

  • Ruan, X., Zhu, Y., Li, J., & Cheng, Y. (2020). Predicting the citation counts of individual papers via a BP neural network. Journal of Informetrics, 14(3), 101039. https://doi.org/10.1016/j.joi.2020.101039

    Article  Google Scholar 

  • Sohrabi, B., & Iraj, H. (2017). The effect of keyword repetition in abstract and keyword frequency per journal in predicting citation counts. Scientometrics, 110(1), 243–251. https://doi.org/10.1007/s11192-016-2161-5

    Article  Google Scholar 

  • Stegehuis, C., Litvak, N., & Waltman, L. (2015). Predicting the long-term citation impact of recent publications. Journal of Informetrics, 9(3), 642–657. https://doi.org/10.1016/j.joi.2015.06.005

    Article  Google Scholar 

  • Stiebellehner, S., Wang, J., & Yuan, S. (2018). Learning Continuous User Representations through Hybrid Filtering with doc2vec. CoRR. Retrieved from http://arxiv.org/abs/1801.00215

  • Tang, J., Lu, Z., Su, J., Ge, Y., Song, L., Sun, L., & Luo, J. (2019). Progressive Self-Supervised Attention Learning for Aspect-Level Sentiment Analysis. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 557–566. Doi: https://doi.org/10.18653/v1/P19-1053

  • Wang, M., Yu, G., Xu, J., He, H., Yu, D., & An, S. (2012). Development a case-based classifier for predicting highly cited papers. Journal of Informetrics, 6(4), 586–599. https://doi.org/10.1016/j.joi.2012.06.002

    Article  Google Scholar 

  • Wang, F., Fan, Y., Zeng, A., Di, Z., Wang, M., Yu, G., et al. (2019a). Can we predict ESI highly cited publications? Journal of Informetrics, 118(1), 109–125. https://doi.org/10.1007/s11192-018-2965-6

    Article  Google Scholar 

  • Wang, M., Wang, Z., & Chen, G. (2019b). Which can better predict the future success of articles? Bibliometric indices or alternative metrics. Scientometrics, 119(3), 1575–1595. https://doi.org/10.1007/s11192-019-03052-9

    Article  MathSciNet  Google Scholar 

  • Wang, Z., Zheng, L., Li, Y., & Wang, S. (2019c). Linkage Based Face Clustering via Graph Convolution Network. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 1(c), pp. 1117–1125. https://doi.org/10.1109/CVPR.2019.00121

  • Weinberger, C. J., Evans, J. A., & Allesina, S. (2015). Ten simple (empirical) rules for writing science. PLOS Computational Biology, 11(4), 1–6. https://doi.org/10.1371/journal.pcbi.1004205

    Article  Google Scholar 

  • Wen, Y., Zhang, K., Li, Z., & Qiao, Y. (2019). A Comprehensive study on center loss for deep face recognition. International Journal of Computer Vision, 127(6–7), 668–683. https://doi.org/10.1007/s11263-018-01142-4

    Article  Google Scholar 

  • Wu, Z., Lin, W., Liu, P., Chen, J., & Mao, L. (2019). Predicting long-term scientific impact based on multi-field feature extraction. IEEE Access, 7, 51759–51770. https://doi.org/10.1109/ACCESS.2019.2910239

    Article  Google Scholar 

  • Xiao, S., Yan, J., Li, C., Jin, B., Wang, X., Yang, X., et al. (2016). On Modeling and Predicting Individual Paper Citation Count over Time. In S. Kambhampati (Ed.), Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, {IJCAI} 2016, New York, NY, USA, 9–15 July 2016 (pp. 2676–2682). {IJCAI/AAAI} Press. http://www.ijcai.org/Abstract/16/380

  • Yahav, I., Shehory, O., & Schwartz, D. (2019). Comments mining with TF-IDF: The inherent bias and its removal. IEEE Transactions on Knowledge and Data Engineering, 31(3), 437–450. https://doi.org/10.1109/TKDE.2018.2840127

    Article  Google Scholar 

  • Yan, E., & Ding, Y. (2010). Measuring scholarly impact in heterogeneous networks. Proceedings of the American Society for Information Science and Technology, 47(1), 1–7. https://doi.org/10.1002/meet.14504701033

    Article  Google Scholar 

  • Yan, R., Huang, C., Tang, J., Zhang, Y., & Li, X. (2012). To Better Stand on the Shoulder of Giants. In: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries (pp. 51–60). New York, NY, USA: Association for Computing Machinery. Doi:https://doi.org/10.1145/2232817.2232831

  • Yan, R., Tang, J., Liu, X., Shan, D., & Li, X. (2011). Citation Count Prediction: Learning to Estimate Future Citations for Literature. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 1247–1252. Doi: https://doi.org/10.1145/2063576.2063757

  • Yu, L., Zhang, W., Wang, J., & Yu, Y. (2017). SeqGAN: Sequence generative adversarial nets with policy gradient.In: 31st AAAI Conference on Artificial Intelligence, AAAI 2017, pp. 2852–2858.

  • Yuan, S., Tang, J., Zhang, Y., Wang, Y., & Xiao, T. (2018). Modeling and Predicting Citation Count via Recurrent Neural Network with Long Short-Term Memory. CoRR, abs/1811.0. http://arxiv.org/abs/1811.02129

  • Zeng, J., Su, J., Wen, H., Liu, Y., Xie, J., Yin, Y., & Zhao, J. (2020). Multi-domain neural machine translation with word-level domain context discrimination. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018, 447–457. Doi: https://doi.org/10.18653/v1/d18-1041

  • Zhang, Y., Lu, J., Liu, F., Liu, Q., Porter, A., Chen, H., & Zhang, G. (2018). Does deep learning help topic extraction? A kernel k-means clustering method with word embedding. Journal of Informetrics, 12(4), 1099–1117. https://doi.org/10.1016/j.joi.2018.09.004

    Article  Google Scholar 

  • Zhou, P., Shi, W., Tian, J., Qi, Z., Li, B., Hao, H., & Xu, B. (2016). Attention-based bidirectional long short-term memory networks for relation classification. In: 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016 - Short Papers, pp. 207–212.Doi: https://doi.org/10.18653/v1/p16-2034

  • Zhu, S., Li, S., & Zhou, G. (2019). Adversarial Attention Modeling for Multi-dimensional Emotion Regression. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 471–480. Doi: https://doi.org/10.18653/v1/P19-1045

Download references

Acknowledgements

This work was supported by the Natural Science Foundation of China grant 61672128.

Author information

Authors and Affiliations

Authors

Contributions

AM Conceptualization, Methodology, Software, Formal analysis, Investigation, Data Curation, Writing—Original Draft, Writing—Review & Editing, Visualization. YL Writing—Review & Editing, Supervision, Project administration, Funding acquisition. XX Writing—Review & Editing, Supervision, Project administration. TD Methodology, Software, Formal analysis, Writing—Review & Editing.

Corresponding author

Correspondence to Yu Liu.

Ethics declarations

Conflict of interest

The authors have no conflicts of interest to declare that are relevant to the content of this article.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ma, A., Liu, Y., Xu, X. et al. A deep-learning based citation count prediction model with paper metadata semantic features. Scientometrics 126, 6803–6823 (2021). https://doi.org/10.1007/s11192-021-04033-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-021-04033-7

Keywords

Navigation