CiteSeerx: A Scholarly Big Dataset

  • Cornelia Caragea
  • Jian Wu
  • Alina Ciobanu
  • Kyle Williams
  • Juan Fernández-Ramírez
  • Hung-Hsuan Chen
  • Zhaohui Wu
  • Lee Giles
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8416)


The CiteSeer x digital library stores and indexes research articles in Computer Science and related fields. Although its main purpose is to make it easier for researchers to search for scientific information, CiteSeer x has been proven as a powerful resource in many data mining, machine learning and information retrieval applications that use rich metadata, e.g., titles, abstracts, authors, venues, references lists, etc. The metadata extraction in CiteSeer x is done using automated techniques. Although fairly accurate, these techniques still result in noisy metadata. Since the performance of models trained on these data highly depends on the quality of the data, we propose an approach to CiteSeer x metadata cleaning that incorporates information from an external data source. The result is a subset of CiteSeer x , which is substantially cleaner than the entire set. Our goal is to make the new dataset available to the research community to facilitate future work in Information Retrieval.


CiteSeerx Scholarly Big Data Record Linkage 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Giles, C.L., Bollacker, K., Lawrence, S.: Citeseer: An automatic citation indexing system. In: Digital Libraries 1998, pp. 89–98 (1998)Google Scholar
  2. 2.
    Lu, Q., Getoor, L.: Link-based classification. In: ICML (2003)Google Scholar
  3. 3.
    Peng, F., Schuurmans, D.: Combining naive bayes and n-gram language models for text classification. In: Sebastiani, F. (ed.) ECIR 2003. LNCS, vol. 2633, pp. 335–350. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  4. 4.
    Caragea, C., Silvescu, A., Kataria, S., Caragea, D., Mitra, P.: Classifying scientific publications using abstract features. In: SARA (2011)Google Scholar
  5. 5.
    Sen, P., Namata, G.M., Bilgic, M., Getoor, L., Gallagher, B., Eliassi-Rad, T.: Collective classification in network data. AI Magazine 29(3), 93–106 (2008)Google Scholar
  6. 6.
    Zhou, D., Zhu, S., Yu, K., Song, X., Tseng, B.L., Zha, H., Giles, C.L.: Learning multiple graphs for document recommendations. In: Proc. of WWW 2008 (2008)Google Scholar
  7. 7.
    Caragea, C., Silvescu, A., Mitra, P., Giles, C.L.: Can’t see the forest for the trees? a citation recommendation system. In: Proceedings of JCDL 2013 (2013)Google Scholar
  8. 8.
    Huang, W., Kataria, S., Caragea, C., Mitra, P., Giles, C.L., Rokach, L.: Recommending citations: translating papers into references. In: CIKM (2012)Google Scholar
  9. 9.
    Küçüktunç, O., Saule, E., Kaya, K., Çatalyürek, Ü.V.: Diversified recommendation on graphs: pitfalls, measures, and algorithms. In: WWW (2013)Google Scholar
  10. 10.
    Nallapati, R.M., Ahmed, A., Xing, E.P., Cohen, W.W.: Joint latent topic models for text and citations. In: Proceedings of KDD 2008 (2008)Google Scholar
  11. 11.
    Treeratpituk, P., Giles, C.L.: Disambiguating authors in academic publications using random forests. In: Proc. of JCDL, JCDL 2009 (2009)Google Scholar
  12. 12.
    Gollapalli, S.D., Mitra, P., Giles, C.L.: Similar researcher search in academic environments. In: JCDL (2012)Google Scholar
  13. 13.
    Chen, H.H., Gou, L., Zhang, X., Giles, C.L.: Collabseer: a search engine for collaboration discovery. In: Proceedings of JCDL 2011 (2011)Google Scholar
  14. 14.
    Kan, M.Y.: Slideseer: a digital library of aligned document and presentation pairs. In: Proceedings of JCDL 2007 (2007)Google Scholar
  15. 15.
    Kataria, S., Mitra, P., Caragea, C., Giles, C.L.: Context sensitive topic models for author influence in document networks. In: Proceedings of IJCAI 2011 (2011)Google Scholar
  16. 16.
    Rosen-Zvi, M., Griffiths, T., Steyvers, M., Smyth, P.: The author-topic model for authors and documents. In: Proceedings of UAI 2004 (2004)Google Scholar
  17. 17.
    Bhattacharya, I., Getoor, L.: A latent dirichlet model for unsupervised entity resolution. In: SDM (2006)Google Scholar
  18. 18.
    Han, H., Giles, C.L., Manavoglu, E., Zha, H., Zhang, Z., Fox, E.A.: Automatic document metadata extraction using support vector machines. In: JCDL (2003)Google Scholar
  19. 19.
    Councill, I.G., Giles, C.L., Yen Kan, M.: Parscit: An open-source crf reference string parsing package. In: Intl. Language Resources and Evaluation (2008)Google Scholar
  20. 20.
    Bhattacharya, I., Getoor, L.: Iterative record linkage for cleaning and integration. In: DMKD (2004)Google Scholar
  21. 21.
    Fellegi, I.P., Sunter, A.B.: A theory for record linkage. Journal of the American Statistical Association 64, 1183–1210 (1969)CrossRefGoogle Scholar
  22. 22.
    Rahm, E., Do, H.H.: Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin 23 (2000)Google Scholar
  23. 23.
    Tejada, S., Knoblock, C.A., Minton, S.: Learning object identification rules for information integration. Journal Information Systems (2001)Google Scholar
  24. 24.
    Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of KDD 2003 (2003)Google Scholar
  25. 25.
    Cohen, W.W., Richman, J.: Learning to match and cluster large high-dimensional data sets for data integration. In: Proceedings of KDD 2002 (2002)Google Scholar
  26. 26.
    Winkler, W.E.: Methods for record linkage and bayesian networks. Technical report, Statistical Research Div., U.S. Bureau of the Census (2002)Google Scholar
  27. 27.
    Cohen, W.W., Ravikumar, P., Fienberg, S.E.: A comparison of string distance metrics for name-matching tasks. In: IJCAI, pp. 73–78 (2003)Google Scholar
  28. 28.
    Pasula, H., Marthi, B., Milch, B., Russell, S., Shpitser, I.: Identity uncertainty and citation matching. In: NIPS. MIT Press (2003)Google Scholar
  29. 29.
    McCallum, A., Wellner, B.: Toward conditional models of identity uncertainty with application to proper noun coreference. In: IIWeb (2003)Google Scholar
  30. 30.
    Manku, G.S., Jain, A., Das Sarma, A.: Detecting near-duplicates for web crawling. In: Proc. of WWW 2007 (2007)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Cornelia Caragea
    • 1
    • 4
  • Jian Wu
    • 2
    • 5
  • Alina Ciobanu
    • 3
    • 6
  • Kyle Williams
    • 2
    • 5
  • Juan Fernández-Ramírez
    • 1
    • 7
  • Hung-Hsuan Chen
    • 1
    • 5
  • Zhaohui Wu
    • 1
    • 5
  • Lee Giles
    • 1
    • 2
    • 5
  1. 1.Computer Science and EngineeringUniversity of North TexasDentonUSA
  2. 2.Information Sciences and TechnologyUniversity of North TexasDentonUSA
  3. 3.Computer ScienceUniversity of North TexasDentonUSA
  4. 4.University of North TexasDentonUSA
  5. 5.Pennsylvania State UniversityUniversity ParkUSA
  6. 6.University of BucharestBucharestRomania
  7. 7.University of the AndesBogotaColombia

Personalised recommendations