, Volume 108, Issue 1, pp 183–200 | Cite as

Predicting citation patterns: defining and determining influence

  • David Guy BrizanEmail author
  • Kevin Gallagher
  • Arnab Jahangir
  • Theodore Brown


Definitions for influence in bibliometrics are surveyed and expanded upon in this work. On data composed of the union of DBLP and CiteSeer x , approximately 6 million publications, a relatively small number of features are developed to describe the set, including loyalty and community longevity, two novel features. These features are successfully used to predict the influential set of papers in a series of machine learning experiments. The most predictive features are highlighted and discussed.


Citation analysis Bibliometrics Big data Machine learning 



This research was supported, in part, under National Science Foundation Grants CNS-0958379, CNS-0855217, ACI-1126113 and the City University of New York High Performance Computing Center at the College of Staten Island. The authors also acknowledge the Office of Information Technology at The Graduate Center, CUNY for providing database and server resources that have contributed to the research results reported within this paper. URL:


  1. Bollacker, K. D., Lawrence, S., & Giles, C. L. (1998). CiteSeer: An autonomous web agent for automatic retrieval and identification of interesting publications. In Proceedings of the second international conference on Autonomous agents (pp. 116–123).Google Scholar
  2. Catalini, C., Lacetera, N., & Oettl, A. (2015). The incidence and role of negative citations in science. Proceedings of the National Academy of Sciences, 112(45), 13823–13826.CrossRefGoogle Scholar
  3. Egghe, L. (2006). Theory and practise of the g-index. Scientometrics, 69(1), 131–152.MathSciNetCrossRefGoogle Scholar
  4. Giles, C. L., Bollacker, K. D., & Lawrence, S. (1998). CiteSeer: An automatic citation indexing system. In Proceedings of the third ACM conference on digital libraries (pp. 89–98).Google Scholar
  5. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: An update. ACM SIGKDD Explorations Newsletter, 11(1), 10–18.CrossRefGoogle Scholar
  6. Haslam, N., Ban, L., Kaufmann, L., Loughnan, S., Peters, K., Whelan, J., et al. (2008). What makes an article influential? Predicting impact in social and personality psychology. Scientometrics, 76(1), 169–185.CrossRefGoogle Scholar
  7. Hirsch, J. E. (2005). An index to quantify an individual’s scientific research output. Proceedings of the National Academy of Sciences of the United States of America, 102(46), 16569–16572.CrossRefGoogle Scholar
  8. Hirsch, J. E. (2007). Does the h index have predictive power? Proceedings of the National Academy of Sciences, 104(49), 19193–19198.CrossRefGoogle Scholar
  9. Judge, T. A., Cable, D. M., Colbert, A. E., & Rynes, S. L. (2007). What causes a management article to be citedarticle, author, or journal? Academy of Management Journal, 50(3), 491–506.CrossRefGoogle Scholar
  10. Lawrence, D. F. U., & Aliferis, C. F. (2010). Using content-based and bibliometric features for machine learning models to predict citation counts in the biomedical literature. Scientometrics, 85(1), 257–270.CrossRefGoogle Scholar
  11. Ley, M. (2002) The DBLP computer science bibliography: Evolution, research issues, perspectives. In String processing and information retrieval (pp. 1–10).Google Scholar
  12. Lotka, A. J. (1926). The frequency distribution of scientific productivity. Journal of the Washington Academy of Sciences, 16(12), 317–323.Google Scholar
  13. Merton, R. K. (1968). The Matthew effect in science. Science, 159(3810), 56–63.CrossRefGoogle Scholar
  14. Mitra, P. (2006). Hirsch-type indices for ranking institutions scientific research output. Current Science, 91(11), 1439.Google Scholar
  15. Newman, M. E. J. (2009). The first-mover advantage in scientific publication. EPL (Europhysics Letters), 86(6), 68001.Google Scholar
  16. Newman, M. E. J. (2014). Prediction of highly cited papers. EPL (Europhysics Letters), 105(2), 28002.Google Scholar
  17. Price, D. J. de Solla (1965). Networks of scientific papers. Science, 149(3683), 510–515.Google Scholar
  18. Rossiter, M. W. (1993). The Matthew Matilda effect in science. Social Studies of Science, 23(2), 325–341.CrossRefGoogle Scholar
  19. Schubert, A., Korn, A., & Telcs, A. (2008). Hirsch-type indices for characterizing networks. Scientometrics, 78(2), 375–382.CrossRefGoogle Scholar
  20. Sher, I. H., & Garfield, E. (1965). New tools for improving and evaluating the effectiveness of research. In Research program effectiveness, proceedings of the conference sponsored by the Office of Naval Research, Washington, DC (pp. 135–146).Google Scholar
  21. Shi, X., Tseng, B., & Adamic, L. A. (2009). Information diffusion in computer science citation networks. arXiv preprint arXiv:0905.2636.
  22. Steinbach, M., Karypis, G., & Kumar, V. (2000). A comparison of document clustering techniques. In KDD workshop on text mining (Vol. 400, No. 1, pp. 525–526).Google Scholar
  23. Tscharntke, T., Hochberg, M. E., Rand, T. A., Resh, V. H., & Krauss, J. (2007). Author sequence and credit for contributions in multiauthored publications. PLoS Biol, 5(1), e18.CrossRefGoogle Scholar
  24. Van Dalen, H. P., & Henkens, K. (2001). What makes a scientific article influential? The case of demographers. Scientometrics, 50(3), 455–482.CrossRefGoogle Scholar
  25. Van Raan, A. F. J. (2004). Sleeping beauties in science. Scientometrics, 59(3), 467–472.CrossRefGoogle Scholar

Copyright information

© Akadémiai Kiadó, Budapest, Hungary 2016

Authors and Affiliations

  • David Guy Brizan
    • 1
    Email author
  • Kevin Gallagher
    • 2
  • Arnab Jahangir
    • 3
  • Theodore Brown
    • 1
  1. 1.Department of Computer ScienceCUNY and CUNY Graduate CenterNew YorkUSA
  2. 2.Department of Computer ScienceNYU Tandon School of EngineeringBrooklynUSA
  3. 3.Department of Computer ScienceHunter College CUNYNew YorkUSA

Personalised recommendations