A Comparison of On-Line Computer Science Citation Databases

  • Vaclav Petricek
  • Ingemar J. Cox
  • Hui Han
  • Isaac G. Councill
  • C. Lee Giles
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3652)


This paper examines the difference and similarities between the two on-line computer science citation databases DBLP and CiteSeer. The database entries in DBLP are inserted manually while the CiteSeer entries are obtained autonomously via a crawl of the Web and automatic processing of user submissions. CiteSeer’s autonomous citation database can be considered a form of self-selected on-line survey. It is important to understand the limitations of such databases, particularly when citation information is used to assess the performance of authors, institutions and funding bodies.

We show that the CiteSeer database contains considerably fewer single author papers. This bias can be modeled by an exponential process with intuitive explanation. The model permits us to predict that the DBLP database covers approximately 24% of the entire literature of Computer Science. CiteSeer is also biased against low-cited papers.

Despite their difference, both databases exhibit similar and significantly different citation distributions compared with previous analysis of the Physics community. In both databases, we also observe that the number of authors per paper has been increasing over time.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Arxiv e-print archive,
  2. 2.
  3. 3.
  4. 4.
  5. 5.
  6. 6.
  7. 7.
    Spires high energy physics literature database,
  8. 8.
    Sciencedirect digital library (2003),
  9. 9.
    Bailey, P., Craswell, N., Hawking, D.: Dark matter on the web. In: Poster Proceedings of 9th International World Wide Web Conference. ACM Press, New York (2000)Google Scholar
  10. 10.
    Batty, M.: Citation geography: It’s about location. The Scientist 17(16) (2003)Google Scholar
  11. 11.
    Batty, M.: The geography of scientific citation. Environment and Planning A 35, 761–770 (2003)CrossRefGoogle Scholar
  12. 12.
    T.: C and de Albuquerque MP. Are citations of scientific papers a case of nonextensivity (2000)Google Scholar
  13. 13.
    Cosley, D., Lawrence, S., Pennock, D.M.: REFEREE: An open framework for practical testing of recommender systems using researchindex. In: 28th International Conference on Very Large Databases, VLDB 2002, Hong Kong, August 20–23 (2002)Google Scholar
  14. 14.
    Han, H., Giles, C.L., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.A.: Automatic document metadata extraction using support vector machines (2003)Google Scholar
  15. 15.
    Kim, M.-J.: Comparative study of citations from papers by korean scientists and their journal attributes (1998)Google Scholar
  16. 16.
    Klink, S., Ley, M., Rabbidge, E., Reuther, P., Walter, B., Weber, A.: Browsing and visualizing digital bibliographic data (2004)Google Scholar
  17. 17.
    Kotiaho, J.S.: Papers vanish in mis-citation black hole (1999)Google Scholar
  18. 18.
    Kotiaho, J.S.: Unfamiliar citations breed mistakes (1999)Google Scholar
  19. 19.
    Laherrére, J., Sornette, D.: Stretched exponential distributions in nature and economy: ’fat tails’ with characteristic scales. The European Physical Journal B - Condensed Matter 2(4), 525–539 (1998)CrossRefGoogle Scholar
  20. 20.
    Lam, S.K., Riedl, J.: Shilling recommender systems for fun and profit. In: Proceedings of the 13th international conference on World Wide Web, pp. 393–402. ACM Press, New York (2004)Google Scholar
  21. 21.
    Lawrence, S.: Online or invisible? Nature 411(6837), 521 (2001)CrossRefGoogle Scholar
  22. 22.
    Lawrence, S., Giles, C.L., Bollacker, K.: Digital libraries and autonomous citation indexing. IEEE Computer 32(6), 67–71 (1999)Google Scholar
  23. 23.
    Lehmann, S., Lautrup, B., Jackson, A.D.: Citation networks in high energy physics. Physical Review E (Statistical, Nonlinear, and Soft Matter Physics) 68(2), 26–113 (2003)Google Scholar
  24. 24.
    L.M.: The dblp computer science bibliography: Evolution, research issues, perspectives (2002)Google Scholar
  25. 25.
    May, R.M.: The scientific wealth of nations. Science 275, 793–795 (1997)CrossRefGoogle Scholar
  26. 26.
    Newman, M.E.J.: The structure of scientific collaboration networks (2000)Google Scholar
  27. 27.
    Price, D.D.S.: Price, d. de solla, little science, big science. columbia univ. press, new york (1963)Google Scholar
  28. 28.
    Redner, S.: How popular is your paper? an empirical study of the citation distribution. European Physics Journal B 4, 131–134 (1998)CrossRefGoogle Scholar
  29. 29.
    Simkin, M., Roychowdhury, V.: Read before you cite (2002)Google Scholar
  30. 30.
    Vazquez, A.: Statistics of citation networks (2001)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Vaclav Petricek
    • 1
  • Ingemar J. Cox
    • 1
  • Hui Han
    • 2
  • Isaac G. Councill
    • 3
  • C. Lee Giles
    • 3
  1. 1.University College LondonLondonUnited Kingdom
  2. 2.Yahoo! Inc.Sunnyvale
  3. 3.The School of Information Sciences and TechnologyThe Pennsylvania State UniversityUniversity ParkUSA

Personalised recommendations