Harvesting: Broadening the Field of Distributed Information Retrieval

  • Edward A. Fox
  • Marcos A. Gonçalves
  • Ming Luo
  • Yuxin Chen
  • Aaron Krowne
  • Baoping Zhang
  • Kate McDevitt
  • Manuel Pérez-Qui ñones
  • Ryan Richardson
  • Lillian N. Cassel
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2924)


This chapter argues that in addition to federated search and gathering (as by Web crawlers), harvesting is an important approach to address the needs for distributed IR. We highlight the use of the Open Archives Initiative Protocol for Metadata Harvesting, illustrating its use in three projects: OAD, NDLTD, and CITIDEL. We explain how traditional services can be extended in a user-centered fashion, providing details of our new: ESSEX search engine, multischeming browsing, and quality-oriented filtering (using rules and SVMs). We conclude with an overview of work in progress on logging and component architectures, as well as a summary of our findings.


Digital Library Virginia Tech Subject Field Federate Search Open Archive Initiative 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Fox, E., Urs, S., Cronin, B. (ed.): Digital Libraries. Annual Review of Information Science and Technology 36(12), 503–589 (2002)Google Scholar
  2. 2.
    Fox, E., Feizbadi, F., Moxley, J., Weisser, C. (eds.): The ETD Sourcebook: Theses and Dissertations in the Electronic Age. Marcel Dekker, New York (2004) (in press)Google Scholar
  3. 3.
    National Information Standards Organization: Z39.50: Information Retrieval (Z39.50): Application Service Definition and Protocol Specification. NISO Press, Bethesda (1995)Google Scholar
  4. 4.
    Moen, W.E.: Accessing Distributed Cultural Heritage Information. CACM 41(4), 45–48 (1998)Google Scholar
  5. 5.
    Arms, W.Y.: Digital Libraries. MIT Press, Cambridge (2000)Google Scholar
  6. 6.
    Lagoze, C., Davis, J.R.: Dienst – An Architecture for Distributed Document Libraries. CACM 38(4), 47 (1995)Google Scholar
  7. 7.
    NCSTRL: Networked Computer Science Technical Reference Library. Homepage, (Available November 3, 2003)
  8. 8.
    Lagoze, C., Fielding, D., Payette, S.: Making global digital libraries work: collection services, connectivity regions, and collection views. In: Proceedings of the Third ACM Conference on Digital Libraries, pp. 134–143 (1998)Google Scholar
  9. 9.
    Anan, H., Liu, X., Maly, K., Nelson, M., Zubair, M., French, J., Fox, E., Shivakumar, P.: Preservation and transition of NCSTRL using an OAI-based architecture. In: JCDL 2002, pp. 181–182 (2002)Google Scholar
  10. 10.
    Bowman, C., Danzig, P., Hardy, D., Manber, U., Schwartz, M.: The Harvest Information Discovery and Access System. Computer Networks and ISDN Systems 28(1&2), 119–125 (1995)CrossRefGoogle Scholar
  11. 11.
    OAI: Open Archives Initiative. Homepage, (Available November 3, 2003)
  12. 12.
    Lagoze, C., van de Sompel, H.: The Open Archives Initiative: building a low-barrier interoperability framework. In: JCDL 2001, pp. 54–62 (2001)Google Scholar
  13. 13.
    Suleman, H., Fox, E.: The Open Archives Initiative: Realizing Simple and Effective Digital Library Interoperability. Special issue on “Libraries and Electronic Resources: New Partnerships, New Practices, New Perspectives” of J. Library Automation 35(1/2), 125–145 (2002)Google Scholar
  14. 14.
    Dublin Core Metadata Initiative. Homepage, (Available November 3, 2003)
  15. 15.
    Hochstenbach, H., Van de Sompel, H.: The OAI-PMH Static Repository and Static Repository Gateway. In: JCDL 2003, pp. 210–217 (2003)Google Scholar
  16. 16.
    Calado, P., Gonçalves, M., Fox, E., Ribeiro-Neto, B., Laender, A., da Silva, A., Reis, D., Roberto, P., Vieira, M., Lage, J.: The Web-DL Environment for Building Digital Libraries from the Web. In: JCDL 2003, pp. 346–357 (2003)Google Scholar
  17. 17.
    Wolf, J.L., Squillante, M.S., Yu, P.S., Sethuraman, J., Ozsen, L.: Optimal crawling strategies for web search engines. In: WWW 2002, pp. 136–147 (2002)Google Scholar
  18. 18.
    Ipeirotis, P., Gravano, L., Sahami, M.: Count, and Classify: Categorizing Hidden Web Databases. In: SIGMOD Conference (2001)Google Scholar
  19. 19.
    Lage, J.P., Silva, A.S., Golgher, P.B., Laender, A.H.F.: Collecting hidden web pages for data extraction. In: WIDM 2002, pp. 69–75 (2002)Google Scholar
  20. 20.
    OAD. Open Archives : Distributed services for physicists and graduate students. Homepage, (Available November 3, 2003)
  21. 21.
    PhysNet. The Worldwide Physics Departments and Documents Network. Homepage, (Available November 3, 2003)
  22. 22.
    OCLC. Online Computer Library Center. Homepage, (Available November 3, 2003)
  23. 23.
    Fox, E.A.: Networked Digital Library of Theses and Dissertations (NDLTD), Homepage (Available November 3, 2003)
  24. 24.
    Suleman, H., Atkins, A., Gonçalves, M.A., France, R.K., Fox, E.A., Virginia Tech., Chachra, V., Crowder, M., VTLS Inc., Young, J.: OCLC: Networked Digital Library of Theses and Dissertations: Bridging the Gaps for Global Access – Part 1: Mission and Progress. D-Lib Magazine 7(9) (2001), (Available November 3, 2003)
  25. 25.
    NDLTD Union Catalog Project. Electronic Thesis/Dissertation OAI Union Catalog Based at OCLC. Homepage, (Available November 3, 2003)
  26. 26.
    Suleman, H., Luo, M.: Electronic Thesis/Dissertation OAI Union Catalog. Homepage, (Available November 3, 2003)
  27. 27.
    ODL. Open Digital Libraries. Homepage, (Available November 3, 2003)
  28. 28.
    DSpace Federation. DSpace at MIT. Homepage, (Available November 3, 2003)
  29. 29.
    BEPres. The Berkeley Electronic Press. Homepage, (Available November 3, 2003)
  30. 30.
    ETDMS. ETD-MS: An Interoperability Metadata Standard for Electronic Theses and Dissertations. Homepage, (Available November 3, 2003)
  31. 31.
    CALIS. China Academic Library & Information System. Homepage, (Available November 3, 2003)
  32. 32.
    CITIDEL. Homepage, (Available November 3, 2003)
  33. 33.
    NSDL. National Science Digital Library. Homepage, (Available November 3, 2003)
  34. 34.
    On-line Virtual Computer History Museum. Homepage, (Available November 3, 2003)
  35. 35.
    CSTC. Computer Science Teaching Center. Homepage, (Available November 3, 2003) Google Scholar
  36. 36.
    Krowne, A.: An Architecture for Collaborative Math and Science Digital Libraries. In: Masters thesis, Virginia Tech Dept. of Computer Science, Blacksburg, VA 24061 USA, (Available November 3, 2003)
  37. 37.
    Ley, M. (ed.): Computer Science Bibliography. Homepage, (Available November 3, 2003)
  38. 38.
    IEEE-CS. IEEE Computer Society Digital Library. Homepage, (Available November 3, 2003)
  39. 39.
    eBizSearch. Homepage, (Available November 3, 2003)
  40. 40.
    NEC Research Institute CiteSeer : Scientific Literature Digital Library. Homepage, (Available November 3, 2003)
  41. 41.
    Krowne, A. The ESSEX Search Engine, (Available November 3, 2003)
  42. 42.
    Suleman, H.: Open Digital Libraries. PhD Dissertation, Virginia Tech (2002), (Available November 3, 2003)
  43. 43.
    Fox, E., Suleman, S., Luo, M.: Building Digital Libraries Made Easy: Toward Open Digital Libraries. In: Lim, E.-p., Foo, S.S.-B., Khoo, C., Chen, H., Fox, E., Urs, S.R., Costantino, T. (eds.) ICADL 2002. LNCS, vol. 2555, pp. 14–24. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  44. 44.
    Dumais, S., Chen, H.: Hierarchical classification of Web content. In: Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval, Athens, Greece, pp. 256–263 (2000)Google Scholar
  45. 45.
    Yahoo! Homepage, (Available November 3, 2003)
  46. 46.
    dmoz. Open Directory Project. Homepage, (Available November 3, 2003)
  47. 47.
    Krowne, A., Fox, E.: An Architecture for Multischeming in Digital Libraries. Virginia Tech Dept. of Computer Science Technical Report TR-03-25, Blacksburg, VA (2003), (Available November 3, 2003)
  48. 48.
    Fox, E.A.: Networked Digital Library of Theses and Dissertations. Nature Web Matters 12 (August 1999), (Available November 3, 2003)
  49. 49.
    Platt, J.: Fast Training of Support Vector Machines using Sequential Minimal Optimization. In: Schölkopf, B., Burges, C., Smola, A. (eds.) Advances in Kernel Methods – Support Vector Learning, MIT Press, Cambridge (1998)Google Scholar
  50. 50.
    Dumais, S.T., Platt, J., Heckerman, D., Sahami, M.: Inductive learning algorithms and representations for text categorization. In: Proceedings of CIKM 1998, 7th ACM International Conference on Information and Knowledge Management, Bethesda, MD, pp. 148–155 (1998)Google Scholar
  51. 51.
    Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 137–142. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  52. 52.
    Joachims, T.: A statistical learning model of text classification for support vector machines. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, New Orleans, LA, pp. 128–136 (2001)Google Scholar
  53. 53.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys (CSUR) 34(1), 1–47 (2002)CrossRefGoogle Scholar
  54. 54.
    ACM Digital Library. Homepage, (Available November 3, 2003)
  55. 55.
    Gonçalves, M.A., Luo, M., Shen, R., Ali, M.F., Fox, E.A.: An XML Log Standard and Tool for Digital Library Logging Analysis. In: Agosti, M., Thanos, C. (eds.) ECDL 2002. LNCS, vol. 2458, pp. 129–143. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  56. 56.
    Gonçalves, M.A., Panchanathan, G., Ravindranathan, U., Krowne, A., Fox, E.A., Jagodzinski, F., Cassel, L.N.: The XML Log Standard for Digital Libraries: Analysis, Evolution, and Deployment. In: JCDL 2003, pp. 312–314 (2003)Google Scholar
  57. 57.
    DLbox Team. Digital Libraries in a Box. Homepage, (Available November 3, 2003)
  58. 58.
    Castelli, D., Pagano, P.: OpenDLib: A Digital Library Service System. In: Agosti, M., Thanos, C. (eds.) ECDL 2002. LNCS, vol. 2458, pp. 292–308. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  59. 59.
    Castelli, D., Pagano, P.: A System for Building Expandable Digital Libraries. JCDL 2003, 335–345 (2003)Google Scholar
  60. 60.
    W3C. Web Services Architecture. Homepage, (Available November 3, 2003)
  61. 61.
    Papazoglou, M.P., Georgakopoulos, D.: Service-Oriented Computing, Special Section. CACM 46(10) (October 2003)Google Scholar
  62. 62.
    Berners-Lee, T., Hendler, J., Lassila, O.: The Semantic Web: A new form of Web content that is meaningful to computers will unleash a revolution of new possibilities. Scientific American 279(5), 35–43 (2001), (Available November 3, 2003)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Edward A. Fox
    • 1
  • Marcos A. Gonçalves
    • 1
  • Ming Luo
    • 1
  • Yuxin Chen
    • 1
  • Aaron Krowne
    • 1
  • Baoping Zhang
    • 1
  • Kate McDevitt
    • 1
  • Manuel Pérez-Qui ñones
    • 1
  • Ryan Richardson
    • 1
  • Lillian N. Cassel
    • 2
  1. 1.Digital Library Research LaboratoryVirginia TechBlacksburgUSA
  2. 2.Dept. of Computing SciencesVillanova UniversityVillanovaUSA

Personalised recommendations