Parallelising Harvesting

  • Hussein Suleman
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4312)


Metadata harvesting has become a common technique to transfer a stream of data from one metadata repository or digital library system to another. As collections of metadata, and their associated digital objects, grow in size, the ingest of these items at the destination archive can take a significant amount of time, depending on the type of indexing or post-processing that is required. This paper discusses an approach to parallelise the post-processing of data in a small cluster of machines or a multi-processor environment, while not increasing the burden on the source data provider. Performance tests have been carried out on varying architectures and the results indicate that this technique is indeed promising for some scenarios and can be extended to more computationally-intensive ingest procedures. In general, the technique presents a new approach for the construction of harvest-based distributed or component-based digital libraries, with better scalability than before.


Digital Library Data Provider Disk Access Beowulf Cluster High Computational Load 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Andresen, D., Yang, T., Egecioglu, O., Ibarra, O.H., Smith, T.R.: Scalability Issues for High Performance Digital Libraries on the World Wide Web. Technical Report 1996-03, Department of Computer Science, University of California Santa Barbara (March 1996)Google Scholar
  2. 2.
    Bar, M.: openMosix, a Linux Kernel Extension for Single System Image Clustering. In: Proceedings of Linux Kongress: 10th International Linux System Technology Conference, October 15-16, 2003, Saarbrücken, Germany (2003)Google Scholar
  3. 3.
    Brown, R.G.: Engineering a Beowulf-style Compute Cluster, Duke University Physics Department (2004), available
  4. 4.
    Diligent: A Digital Library Infrastructure on Grid Enabled Technology (2006), Website
  5. 5.
    Dongarra, J., Kennedy, K., White, A.: Introduction. In: Dongarra, J., Foster, I., Fox, G., Gropp, W., Kennedy, K., Torczon, L., White, A. (eds.) Sourcebook of Parallel Computing, Morgan Kaufman, Amsterdam (2003)Google Scholar
  6. 6.
    Haedstrom, M.: Research Challenges in Digital Archiving and Long-term Preservation. In: NSF Post Digital Library Futures Workshop, June 15-17, 2003, Cape Cod (2003), available
  7. 7.
    Imafouo, A.: A Scalability Survey in IR and DL. TCDL Bulletin 2(2) (2006),
  8. 8.
    Lagoze, C., Van de Sompel, H.: The Open Archives Initiative: Building a low-barrier interoperability framework. In: Proceedings of the ACM-IEEE Joint Conference on Digital Libraries, Roanoke, VA, USA, June 24-28, 2001, pp. 54–62 (2001)Google Scholar
  9. 9.
    Lagoze, C., Van de Sompel, H., Nelson, M., Warner, S.: The Open Archives Initiative Protocol for Metadata Harvesting – Version 2.0, Open Archives Initiative (June 2002), available
  10. 10.
    Lyman, P., Varian, H.R.: How Much Information 2003? University of California (2003), available
  11. 11.
    Wilkinson, B., Allen, M.: Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers. Prentice Hall, New Jersey (1999)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Hussein Suleman
    • 1
  1. 1.Department of Computer ScienceUniversity of Cape TownRondeboschSouth Africa

Personalised recommendations