Automatic Subject Classification of Scientific Literature Using Citation Metadata

  • Abdulhussain E. Mahdi
  • Arash Joorabchi
Part of the Communications in Computer and Information Science book series (CCIS, volume 194)


This paper describes a new method for automatic classification of scientific literature archived in digital libraries and repositories according to a standard library classification scheme. The method is based on identifying all the references cited in the document to be classified and, using the subject classification metadata of extracted references as catalogued in existing conventional libraries, inferring the most probable class for the document itself with the help of a weighting mechanism. We have demonstrated the application of the proposed method and assessed its performance by developing a prototype software system for automatic classification of scientific documents according to the Dewey Decimal Classification (DDC) scheme. A dataset of one thousand research articles, papers, and reports from a well-known scientific digital library, CiteSeer, were used to evaluate the classification performance of the system. Detailed results of this experiment are presented and discussed.


Digital library organization scientific literature classification library classification schemes Dewey Decimal Classification (DDC) library Online Public Access Catalogues (OPACs) 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Avancini, H., Rauber, A., Sebastiani, F.: Organizing digital libraries by automated text categorization. In: Heery, R., Lyon, L. (eds.) ECDL 2004. LNCS, vol. 3232. Springer, Heidelberg (2004)Google Scholar
  2. 2.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys (CSUR) 34(1), 1–47 (2002)CrossRefGoogle Scholar
  3. 3.
    Golub, K.: Automated subject classification of textual Web pages, based on a controlled vocabulary: Challenges and recommendations. New Review of Hypermedia and Multimedia 12(1), 11–27 (2006)CrossRefGoogle Scholar
  4. 4.
    Yi, K.: Automated text classification using library classification schemes: trends, issues, and challenges. International Cataloguing and Bibliographic Control (ICBC) 36(4), 78–82 (2007)Google Scholar
  5. 5.
    Markey, K.: Forty years of classification online: final chapter or future unlimited? Cataloging & Classification Quarterly 42(3), 1–63 (2006)CrossRefGoogle Scholar
  6. 6.
    Dewey, M.: Dewey Decimal Classification (DDC). (Online Computer Library Center (OCLC), 1876-2010) (cited February 2011),
  7. 7.
    Putnam, H.: Library of Congress Classification (LCC). (Library of Congress, Cataloging Policy and Support Office, 1897-2010) (cited February 2011),
  8. 8.
    OCLC (Online Computer Library Center (cited February 2011),
  9. 9.
    Scorpion (OCLC Online Computer Library Center, Inc., 2002) (cited (February 2011),
  10. 10.
    Larson, R.R.: Experiments in automatic Library of Congress Classification. Journal of the American Society for Information Science 43(2), 130–148 (1992)CrossRefGoogle Scholar
  11. 11.
    Jenkins, C., Jackson, M., Burden, P., Wallis, J.: Automatic classification of Web resources using Java and Dewey Decimal Classification. Computer Networks and ISDN Systems 30(1-7), 646–648 (1998)CrossRefGoogle Scholar
  12. 12.
    Dolin, R., Agrawal, D., Abbadi, E.E.: Scalable collection summarization and selection. In: Proceedings of the Fourth ACM Conference on Digital Libraries. ACM, Berkeley (1999)Google Scholar
  13. 13.
    Chung, Y.-M., Noh, Y.-H.: Developing a specialized directory system by automatically classifying Web documents. Journal of Information Science 29(2), 117–126 (2003)CrossRefGoogle Scholar
  14. 14.
    Pong, J.Y.-H., Kwok, R.C.-W., Lau, R.Y.-K., Hao, J.-X., Wong, P.C.-C.: A comparative study of two automatic document classification methods in a library setting. Journal of Information Science 34(2), 213–230 (2008)CrossRefGoogle Scholar
  15. 15.
    Frank, E., Paynter, G.W.: Predicting Library of Congress classifications from Library of Congress subject headings. Journal of the American Society for Information Science and Technology 55(3), 214–227 (2004)CrossRefGoogle Scholar
  16. 16.
    Sen, P., Namata, G., Bilgic, M., Getoor, L., Galligher, B., Eliassi-Rad, T.: Collective classification in network data. AI Magazine 29(3) (2008)Google Scholar
  17. 17.
    Joorabchi, A., Mahdi, A.E.: Leveraging the legacy of conventional libraries for organizing digital libraries. In: Agosti, M., Borbinha, J., Kapidakis, S., Papatheodorou, C., Tsakonas, G. (eds.) ECDL 2009. LNCS, vol. 5714, pp. 3–14. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  18. 18.
    Mahdi, A.E., Joorabchi, A.: A Citation-based approach to automatic topical indexing of scientific literature. Journal of Information Science 36(6), 798–811 (2010)CrossRefGoogle Scholar
  19. 19.
    Giles, C.L., Kurt, D.B., Steve, L.: CiteSeer: an automatic citation indexing system. In: Proceedings of the Third ACM Conference on Digital Libraries. ACM, Pittsburgh (1998)Google Scholar
  20. 20.
    Meier, W.: eXist-DB. (, Released under the open source GPL licence, 2009) (cited February 2011),
  21. 21.
    Google Books Search (GBS) engine. (Google, 2004) (cited February 2011),
  22. 22.
    WorldCat (Online Computer Library Center (OCLC), 2001-2010 2008) (cited February 2011),
  23. 23.
    WorldCat Search API (OCLC - WorldCat, 2009) (cited February 2011),
  24. 24.
    MARC standards (Library of Congress Network Development and MARC Standards Office, 1999 December 5, 2007) (cited February 2011),
  25. 25.
    Councill, I.G., Giles, C.L., Kan, M.Y.: ParsCit: An open-source CRF reference string parsing package. In: Proceedings of the Language Resources and Evaluation Conference (LREC 2008), Marrakesh, Morrocco (May 2008)Google Scholar
  26. 26.
    O’Madadhain, J., Fisher, D., Nelson, T., White, S., Boey, Y.-B.: JUNG 2.0. (Released under the open source GPL licence, 2009) (cited February 2011),
  27. 27.
    Brin, S.: A Library to Last Forever (The New York Times, October 8, 2009) (cited June 2010),
  28. 28.
    Networked Digital Library of Thesis and Dissertations (NDLTD, 1996-2010) (cited February 2011),

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Abdulhussain E. Mahdi
    • 1
  • Arash Joorabchi
    • 1
  1. 1.Department of Electronic and Computer EngineeringUniversity of LimerickIreland

Personalised recommendations