Abstract
The CiteSeer digital library is a useful source of bibliographic information. It allows for retrieving citations, co-authorships, addresses, and affiliations of authors and publications. In spite of this, it has been relatively rarely used for automated citation analyses. This article describes our findings after extensively mining from the CiteSeer data. We explored citations between authors and determined rankings of influential scientists using various evaluation methods including citation and in-degree counts, HITS, PageRank, and its variations based on both the citation and collaboration graphs. We compare the resulting rankings with lists of computer science award winners and find out that award recipients are almost always ranked high. We conclude that CiteSeer is a valuable, yet not fully appreciated, repository of citation data and is appropriate for testing novel bibliometric methods.
Similar content being viewed by others
References
An, Y., Janssen, J., & Milios, E. E. (2004). Characterizing and mining the citation graph of the computer science literature. Knowledge and Information Systems, 6(6), 664–678.
Bar-Ilan, J. (2006). An ego-centric citation analysis of the works of Michael O. Rabin based on multiple citation indexes. Information Processing and Management, 42(6), 1553–1566.
Brin, S., & Page, L. (1998). The anatomy of a large-scale hypertextual web search engine. In Proceedings of the 7th World Wide Web Conference (pp. 107–117). Brisbane, Australia.
Chakrabarti, S., & Agarwal, A. (2006). Learning parameters in entity relationship graphs from ranking preferences. Lecture Notes in Computer Science, 4213, 91–102.
Chen, C. (2000). Domain visualization for digital libraries. In Proceedings of the international conference on information visualization (IV2000) (pp. 261–267). London, UK.
Feitelson, D. G., & Yovel, U. (2004). Predictive ranking of computer scientists using CiteSeer data. Journal of Documentation, 60(1), 44–61.
Fiala, D., Rousselot, F., & Ježek, K. (2008). PageRank for bibliographic networks. Scientometrics, 76(1), 135–158.
Franceschet, M. (2010). A comparison of bibliometric indicators for computer science scholars and journals on Web of Science and Google Scholar. Scientometrics, 83(1), 243–258.
Giles, C. L., & Councill, I. G. (2004). Who gets acknowledged: Measuring scientific contributions through automatic acknowledgment indexing. Proceedings of the National Academy of Sciences of the United States of America, 101(51), 17599–17604.
Goodrum, A. A., McCain, K. W., Lawrence, S., & Giles, C. L. (2001). Scholarly publishing in the Internet age: A citation analysis of computer science literature. Information Processing and Management, 37(5), 661–675.
Hopcroft, J., Khan, O., Kulis, B., & Selman, B. (2004). Tracking evolving communities in large linked networks. Proceedings of the National Academy of Sciences of the United States of America, 101(suppl. 1), 5249–5253.
Ježek, K., Fiala, D., & Steinberger, J. (2008). Exploration and evaluation of citation networks. In Proceedings of the 12th international conference on electronic publishing (pp. 351–362). Toronto, Canada.
Kleinberg, J. (1999). Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5), 604–632.
Meho, L. I., & Yang, K. (2007). Impact of data sources on citation counts and rankings of LIS faculty: Web of science versus scopus and google scholar. Journal of the American Society for Information Science and Technology, 58(13), 2105–2125.
Popescul, A., Ungar, L. H., Lawrence, S., & Pennock, D. M. (2003). Statistical relational learning for document mining. In Proceedings of the third IEEE international conference on data mining (ICDM’03) (pp. 275–282). Melbourne, Florida, USA.
Sidiropoulos, A., & Manolopoulos, Y. (2005). A citation-based system to assist prize awarding. SIGMOD Record, 34(4), 54–60.
Šingliar, T., & Hauskrecht, M. (2006). Noisy-OR component analysis and its application to link analysis. Journal of Machine Learning Research, 7, 2189–2213.
Zhao, D. (2005). Challenges of scholarly publications on the Web to the evaluation of science—A comparison of author visibility on the Web and in print journals. Information Processing & Management, 41(6), 1403–1418.
Zhao, D., & Logan, E. (2002). Citation analysis using scientific publications on the Web as data source: A case study in the XML research area. Scientometrics, 54(3), 449–472.
Zhao, D., & Strotmann, A. (2007). Can citation analysis of web publications better detect research fronts? Journal of the American Society for Information Science and Technology, 58(9), 1285–1302.
Zhou, D., Councill, I., Zha, H., & Giles, C. L. (2007). Discovering temporal communities from social network documents. In Proceedings of the seventh IEEE international conference on data mining (ICDM’07) (pp. 745–750). Omaha, Nebraska, USA.
Acknowledgments
This work (The related software may found at http://textmining.zcu.cz/downloads/sciento.php.) was supported in part by the Ministry of Education of the Czech Republic under Grant 2C06009. Many thanks go to the anonymous reviewers for their useful hints and comments and to Karel Ježek for his support of this project.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Fiala, D. Mining citation information from CiteSeer data. Scientometrics 86, 553–562 (2011). https://doi.org/10.1007/s11192-010-0326-1
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11192-010-0326-1