Abstract
Predicting the impact of academic papers can help scholars quickly identify the high-quality papers in the field. How to develop efficient predictive model for evaluating potential papers has attracted increasing attention in academia. Many studies have shown that early citations contribute to improving the performance of predicting the long-term impact of a paper. Besides early citations, some bibliometric features and altmetric features have also been explored for predicting the impact of academic papers. Furthermore, paper metadata text such as title, abstract and keyword contains valuable information which has effect on its citation count. However, present studies ignore the semantic information contained in the metadata text. In this paper, we propose a novel citation prediction model based on paper metadata text to predict the long-term citation count, and the core of our model is to obtain the semantic information from the metadata text. We use deep learning techniques to encode the metadata text, and then further extract high-level semantic features for learning the citation prediction task. We also integrate early citations for improving the prediction performance of the model. We show that our proposed model outperforms the state-of-the-art models in predicting the long-term citation count of the papers, and metadata semantic features are effective for improving the accuracy of the citation prediction models.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11192-021-04033-7/MediaObjects/11192_2021_4033_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11192-021-04033-7/MediaObjects/11192_2021_4033_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11192-021-04033-7/MediaObjects/11192_2021_4033_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11192-021-04033-7/MediaObjects/11192_2021_4033_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs11192-021-04033-7/MediaObjects/11192_2021_4033_Fig5_HTML.png)
Similar content being viewed by others
Data availability
The code used for this study is available at https://github.com/MatrixLabDUT/CitationPrediction.
Code availability
The code used for this study is available at https://github.com/MatrixLabDUT/CitationPrediction.
References
Abramo, G., D’Angelo, C. A., & Felici, G. (2019). Predicting publication long-term impact through a combination of early citations and journal impact factor. Journal of Informetrics, 13(1), 32–49. https://doi.org/10.1016/j.joi.2018.11.003
Abrishami, A., & Aliakbary, S. (2019). Predicting citation counts based on deep neural network learning techniques. Journal of Informetrics, 13(2), 485–499. https://doi.org/10.1016/j.joi.2019.02.01
Aikawa, K., Kawai, S., & Nobuhara, H. (2019). Multilingual Inappropriate Text Content Detection System Based on Doc2vec. In: 2019 IEEE 8th Global Conference on Consumer Electronics (GCCE), pp. 441–442. https://doi.org/10.1109/GCCE46687.2019.9015579
Bahdanau, D., Cho, K. H., & Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate.In: 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, pp. 1–15.
Bai, X., Zhang, F., & Lee, I. (2019). Predicting the citations of scholarly paper. Journal of Informetrics, 13(1), 407–418. https://doi.org/10.1016/j.joi.2019.01.010
Bornmann, L., Leydesdorff, L., & Wang, J. (2014). How to improve the prediction based on citation impact percentiles for years shortly after the publication date? Journal of Informetrics, 8(1), 175–180. https://doi.org/10.1016/j.joi.2013.11.005
Bornmann, L., Schier, H., Marx, W., & Daniel, H. D. (2012). What factors determine citation counts of publications in chemistry besides their quality? Journal of Informetrics, 6(1), 11–18. https://doi.org/10.1016/j.joi.2011.08.004
Braun, T., Glänzel, W., & Schubert, A. (2006). A Hirsch-Type Index for Journals. Scientometrics, 69(1), 169–173. https://doi.org/10.1007/s11192-006-0147-4
Cao, X., Chen, Y., & Ray Liu, K. J. (2016). A data analytic approach to quantifying scientific impact. Journal of Informetrics, 10(2), 471–484. https://doi.org/10.1016/j.joi.2016.02.006
Chen, J. (2015). Predicting Citation Counts of Papers.In: 2015 IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC), pp. 434–440. https://doi.org/10.1109/ICCI-CC.2015.7259421
Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 785–794). New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/2939672.2939785
Chen, Y., Huang, S., Lee, H., Wang, Y., & Shen, C. (2019). Audio Word2vec : Sequence-to-sequence autoencoding for unsupervised learning of audio segmentation and Representation. IEEE/ACM Transactions on Audio Speech and Language Processing, 27(9), 1481–1493. https://doi.org/10.1109/TASLP.2019.2922832
Clark, K., Luong, M.-T., Le, Q. V, & Manning, C. D. (2020). ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators. In: BT - 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30, 2020. https://openreview.net/forum?id=r1xMH1Btv
Clauset, A., Larremore, D. B., & Sinatra, R. (2017). Data-driven predictions in the science of science. Science, 355(6324), 477–480. https://doi.org/10.1126/science.aal4217
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2–7, 2019, Vol 1 (pp. 4171–4186). https://doi.org/10.18653/v1/n19-1423
Egghe, L. (2006). Theory and practise of the g-index. Scientometrics, 69(1), 131–152. https://doi.org/10.1007/s11192-006-0144-7
Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5), 1189–1232.
Fronzetti Colladon, A., D’Angelo, C. A., & Gloor, P. A. (2020). Predicting the future success of scientific publications through social network and semantic analysis. Scientometrics, 124(1), 357–377. https://doi.org/10.1007/s11192-020-03479-5
Garfield, E. (2006). The history and meaning of the journal impact factor. JAMA, 295(1), 90–93. https://doi.org/10.1001/jama.295.1.90
Graves, A. (2012). Supervised sequence labelling with recurrent neural networks. Heidelberg: Springer.
Guo, J., Lu, S., Cai, H., Zhang, W., Yu, Y., & Wang, J. (2018). Long text generation via adversarial training with leaked information.In: 32nd AAAI Conference on Artificial Intelligence, AAAI 2018, pp. 5141–5148.
Habibzadeh, F., & Yadollahie, M. (2010). Are shorter article titles more attractive for citations? Cross-sectional study of 22 scientifc journals. Croatian Medical Journal, 51(2), 165–170. https://doi.org/10.3325/cmj.2010.51.165
Haggan, M. (2004). Research paper titles in literature, linguistics and science: Dimensions of attraction. Journal of Pragmatics, 36(2), 293–317. https://doi.org/10.1016/S0378-2166(03)00090-0
Hassan, S. U., Bowman, T. D., Shabbir, M., Akhtar, A., Imran, M., & Aljohani, N. R. (2019). Influential tweeters in relation to highly cited articles in altmetric big data. Scientometrics, 119(1), 481–493. https://doi.org/10.1007/s11192-019-03044-9
Hirsch, J. E. (2005). An index to quantify an individual’ s scientific research output. Proceedings of the National Academy of Sciences of the United States of America, 102(46), 16569–16572. https://doi.org/10.1073/pnas.0507655102
Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Hu, Y.-H., Tai, C.-T., Liu, K. E., & Cai, C.-F. (2020). Identification of highly-cited papers using topic-model-based and bibliometric features: the consideration of keyword popularity. Journal of Informetrics, 14(1), 101004. https://doi.org/10.1016/j.joi.2019.101004
Jamali, H. R., & Nikzad, M. (2011). Article title type and its relation with the number of downloads and citations. Scientometrics, 88(2), 653–661. https://doi.org/10.1007/s11192-011-0412-z.
Jati, A., & Georgiou, P. (2019). Neural predictive coding using convolutional neural networks toward unsupervised learning of speaker characteristics. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(10), 1577–1589. https://doi.org/10.1109/TASLP.2019.2921890
Karvelis, P., Gavrilis, D., Georgoulas, G., & Stylios, C. (2018). Topic recommendation using Doc2Vec. International Joint Conference on Neural Networks (IJCNN), 2018, 1–6. https://doi.org/10.1109/IJCNN.2018.8489513
Lau, J. H., & Baldwin, T. (2016). An Empirical Evaluation of doc2vec with Practical Insights into Document Embedding Generation. In: Proceedings of the 1st Workshop on Representation Learning for {NLP}, pp. 78–86. https://doi.org/10.18653/v1/W16-1609
Le, Q., & Mikolov, T. (2014). Distributed representations of sentences and documents. In: 31st International Conference on Machine Learning, ICML 2014, vol. 4, pp. 2931–2939.
Lecun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444. https://doi.org/10.1038/nature14539
Letchford, A., Preis, T., & Moat, H. S. (2016). The advantage of simple paper abstracts. Journal of Informetrics, 10(1), 1–8. https://doi.org/10.1016/j.joi.2015.11.001
Li, S., Hu, J., Cui, Y., & Hu, J. (2018). DeepPatent: Patent classification with convolutional neural networks and word embedding. Scientometrics, 117(2), 721–744. https://doi.org/10.1007/s11192-018-2905-5
Li, M., Xu, J., Ge, B., Liu, J., Jiang, J., & Zhao, Q. (2019a). A Deep Learning Methodology for Citation Count Prediction with Large-scale Biblio-Features. In: 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC), 1172–1176. https://doi.org/10.1109/SMC.2019.8913961
Li, S., Zhao, W. X., Yin, E. J., & Wen, J.-R. (2019b). A neural citation count prediction model based on peer review text. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) (pp. 4914–4924). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1497
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., et al. (2019). RoBERTa: A robustly optimized BERT pretraining approach. CoRR. http://arxiv.org/abs/1907.11692
Markov, I., Gómez-Adorno, H., Posadas-Durán, J.-P., Sidorov, G., & Gelbukh, A. (2017). Author profiling with doc2vec neural network-based document embeddings. In O. Pichardo-Lagunas & S. Miranda-Jiménez (Eds.), Advances in Soft Computing (pp. 117–131). Springer International Publishing.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. In: 1st International Conference on Learning Representations, ICLR 2013 - Workshop Track Proceedings, pp. 1–12.
Newman, M. E. J. (2014). Prediction of highly cited papers. EPL (europhysics Letters), 105(2), 28002. https://doi.org/10.1209/0295-5075/105/28002
Platanios, E. A., Sachan, M., Neubig, G., & Mitchell, T. M. (2020). Contextual parameter generation for universal neural machine translation.In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018, (2016), pp. 425–435. Doi: https://doi.org/10.18653/v1/d18-1039
Rose, M. E., & Kitchin, J. R. (2019). pybliometrics: scriptable bibliometrics using a Python interface to Scopus. SoftwareX, 10, 100263. https://doi.org/10.1016/j.softx.2019.100263
Ruan, X., Zhu, Y., Li, J., & Cheng, Y. (2020). Predicting the citation counts of individual papers via a BP neural network. Journal of Informetrics, 14(3), 101039. https://doi.org/10.1016/j.joi.2020.101039
Sohrabi, B., & Iraj, H. (2017). The effect of keyword repetition in abstract and keyword frequency per journal in predicting citation counts. Scientometrics, 110(1), 243–251. https://doi.org/10.1007/s11192-016-2161-5
Stegehuis, C., Litvak, N., & Waltman, L. (2015). Predicting the long-term citation impact of recent publications. Journal of Informetrics, 9(3), 642–657. https://doi.org/10.1016/j.joi.2015.06.005
Stiebellehner, S., Wang, J., & Yuan, S. (2018). Learning Continuous User Representations through Hybrid Filtering with doc2vec. CoRR. Retrieved from http://arxiv.org/abs/1801.00215
Tang, J., Lu, Z., Su, J., Ge, Y., Song, L., Sun, L., & Luo, J. (2019). Progressive Self-Supervised Attention Learning for Aspect-Level Sentiment Analysis. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 557–566. Doi: https://doi.org/10.18653/v1/P19-1053
Wang, M., Yu, G., Xu, J., He, H., Yu, D., & An, S. (2012). Development a case-based classifier for predicting highly cited papers. Journal of Informetrics, 6(4), 586–599. https://doi.org/10.1016/j.joi.2012.06.002
Wang, F., Fan, Y., Zeng, A., Di, Z., Wang, M., Yu, G., et al. (2019a). Can we predict ESI highly cited publications? Journal of Informetrics, 118(1), 109–125. https://doi.org/10.1007/s11192-018-2965-6
Wang, M., Wang, Z., & Chen, G. (2019b). Which can better predict the future success of articles? Bibliometric indices or alternative metrics. Scientometrics, 119(3), 1575–1595. https://doi.org/10.1007/s11192-019-03052-9
Wang, Z., Zheng, L., Li, Y., & Wang, S. (2019c). Linkage Based Face Clustering via Graph Convolution Network. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 1(c), pp. 1117–1125. https://doi.org/10.1109/CVPR.2019.00121
Weinberger, C. J., Evans, J. A., & Allesina, S. (2015). Ten simple (empirical) rules for writing science. PLOS Computational Biology, 11(4), 1–6. https://doi.org/10.1371/journal.pcbi.1004205
Wen, Y., Zhang, K., Li, Z., & Qiao, Y. (2019). A Comprehensive study on center loss for deep face recognition. International Journal of Computer Vision, 127(6–7), 668–683. https://doi.org/10.1007/s11263-018-01142-4
Wu, Z., Lin, W., Liu, P., Chen, J., & Mao, L. (2019). Predicting long-term scientific impact based on multi-field feature extraction. IEEE Access, 7, 51759–51770. https://doi.org/10.1109/ACCESS.2019.2910239
Xiao, S., Yan, J., Li, C., Jin, B., Wang, X., Yang, X., et al. (2016). On Modeling and Predicting Individual Paper Citation Count over Time. In S. Kambhampati (Ed.), Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, {IJCAI} 2016, New York, NY, USA, 9–15 July 2016 (pp. 2676–2682). {IJCAI/AAAI} Press. http://www.ijcai.org/Abstract/16/380
Yahav, I., Shehory, O., & Schwartz, D. (2019). Comments mining with TF-IDF: The inherent bias and its removal. IEEE Transactions on Knowledge and Data Engineering, 31(3), 437–450. https://doi.org/10.1109/TKDE.2018.2840127
Yan, E., & Ding, Y. (2010). Measuring scholarly impact in heterogeneous networks. Proceedings of the American Society for Information Science and Technology, 47(1), 1–7. https://doi.org/10.1002/meet.14504701033
Yan, R., Huang, C., Tang, J., Zhang, Y., & Li, X. (2012). To Better Stand on the Shoulder of Giants. In: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries (pp. 51–60). New York, NY, USA: Association for Computing Machinery. Doi:https://doi.org/10.1145/2232817.2232831
Yan, R., Tang, J., Liu, X., Shan, D., & Li, X. (2011). Citation Count Prediction: Learning to Estimate Future Citations for Literature. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 1247–1252. Doi: https://doi.org/10.1145/2063576.2063757
Yu, L., Zhang, W., Wang, J., & Yu, Y. (2017). SeqGAN: Sequence generative adversarial nets with policy gradient.In: 31st AAAI Conference on Artificial Intelligence, AAAI 2017, pp. 2852–2858.
Yuan, S., Tang, J., Zhang, Y., Wang, Y., & Xiao, T. (2018). Modeling and Predicting Citation Count via Recurrent Neural Network with Long Short-Term Memory. CoRR, abs/1811.0. http://arxiv.org/abs/1811.02129
Zeng, J., Su, J., Wen, H., Liu, Y., Xie, J., Yin, Y., & Zhao, J. (2020). Multi-domain neural machine translation with word-level domain context discrimination. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018, 447–457. Doi: https://doi.org/10.18653/v1/d18-1041
Zhang, Y., Lu, J., Liu, F., Liu, Q., Porter, A., Chen, H., & Zhang, G. (2018). Does deep learning help topic extraction? A kernel k-means clustering method with word embedding. Journal of Informetrics, 12(4), 1099–1117. https://doi.org/10.1016/j.joi.2018.09.004
Zhou, P., Shi, W., Tian, J., Qi, Z., Li, B., Hao, H., & Xu, B. (2016). Attention-based bidirectional long short-term memory networks for relation classification. In: 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016 - Short Papers, pp. 207–212.Doi: https://doi.org/10.18653/v1/p16-2034
Zhu, S., Li, S., & Zhou, G. (2019). Adversarial Attention Modeling for Multi-dimensional Emotion Regression. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 471–480. Doi: https://doi.org/10.18653/v1/P19-1045
Acknowledgements
This work was supported by the Natural Science Foundation of China grant 61672128.
Author information
Authors and Affiliations
Contributions
AM Conceptualization, Methodology, Software, Formal analysis, Investigation, Data Curation, Writing—Original Draft, Writing—Review & Editing, Visualization. YL Writing—Review & Editing, Supervision, Project administration, Funding acquisition. XX Writing—Review & Editing, Supervision, Project administration. TD Methodology, Software, Formal analysis, Writing—Review & Editing.
Corresponding author
Ethics declarations
Conflict of interest
The authors have no conflicts of interest to declare that are relevant to the content of this article.
Rights and permissions
About this article
Cite this article
Ma, A., Liu, Y., Xu, X. et al. A deep-learning based citation count prediction model with paper metadata semantic features. Scientometrics 126, 6803–6823 (2021). https://doi.org/10.1007/s11192-021-04033-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-021-04033-7