Skip to main content
Log in

Using content-based and bibliometric features for machine learning models to predict citation counts in the biomedical literature

  • Published:
Scientometrics Aims and scope Submit manuscript

Abstract

The most popular method for judging the impact of biomedical articles is citation count which is the number of citations received. The most significant limitation of citation count is that it cannot evaluate articles at the time of publication since citations accumulate over time. This work presents computer models that accurately predict citation counts of biomedical publications within a deep horizon of 10 years using only predictive information available at publication time. Our experiments show that it is indeed feasible to accurately predict future citation counts with a mixture of content-based and bibliometric features using machine learning methods. The models pave the way for practical prediction of the long-term impact of publication, and their statistical analysis provides greater insight into citation behavior.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  • Aliferis, C., Statnikov, A., et al. (2009). Local causal and markov blanket induction for causal discovery and feature selection for classification. JMLR (accepted).

  • Aliferis, C., Statnikov, A., et al. (2006). Challenges in the analysis of mass-throughput data. Cancer Informatics, 2, 133–162.

    Google Scholar 

  • Aphinyanaphongs, Y., Tsamardinos, I., et al. (2005). Text categorization models for high-quality article retrieval in internal medicine. JAMIA, 12(2), 207–216.

    Google Scholar 

  • Burges, C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2), 121–167.

    Article  Google Scholar 

  • Feitelson, D., & Yovel, U. (2004). Predictive ranking of computer scientists using CiteSeer data. Journal of Documentation, 60(1), 44–61.

    Article  Google Scholar 

  • Fu, L., & Aliferis, C. (2008). Models for predicting and explaining citation count of biomedical articles. AMIA symposium.

  • Garfield, E. (1962). Can citation indexing be automated? Essays of an Information Scientist, 1, 84–90.

    Google Scholar 

  • Getoor, L. (2003). Link mining: A new data mining challenge. SIGKDD Explorations, 5(1), 84–89.

    Article  MathSciNet  Google Scholar 

  • Gross, P., & Gross, E. (1927). College libraries and chemical education. Science, 66, 385–389.

    Article  Google Scholar 

  • Leopold, E., & Kindermann, J. (2002). Text categorization with support vector machines. Machine Learning, 46, 423–444.

    Article  MATH  Google Scholar 

  • Lokker, C., McKibbon, K. A., et al. (2008). Prediction of citation counts for clinical articles at two years using data available within three weeks of publication: Retrospective cohort study. BMJ. http://www.bmj.com/cgi/content/abstract/bmj.39482.526713.BEv526711.

  • MacRoberts, M., & MacRoberts, B. (1996). Problems of citation analysis. Scientometrics, 36(3), 435–444.

    Article  Google Scholar 

  • Phelan, T. (1999). A compendium of issues for citation analysis. Scientometrics, 45(1), 117–136.

    Article  Google Scholar 

  • Porter, M. (1980). An algorithm for suffix stripping. Program, 14, 130–137.

    Google Scholar 

  • Rattigan, M., & Jensen, D. (2003). The case for anomalous link discovery. SIGKDD Explorations, 5(1), 41–47.

    Google Scholar 

  • Seglen, P. (1998). Citation rates and journal impact factors are not suitable for evaluation of research. Acta Orthopaedica Scandinavica, 69(3), 224–229.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lawrence D. Fu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fu, L.D., Aliferis, C.F. Using content-based and bibliometric features for machine learning models to predict citation counts in the biomedical literature. Scientometrics 85, 257–270 (2010). https://doi.org/10.1007/s11192-010-0160-5

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11192-010-0160-5

Keywords

Navigation