MultiCrawler: A Pipelined Architecture for Crawling and Indexing Semantic Web Data

  • Andreas Harth
  • Jürgen Umbrich
  • Stefan Decker
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4273)


The goal of the work presented in this paper is to obtain large amounts of semistructured data from the web. Harvesting semistructured data is a prerequisite to enabling large-scale query answering over web sources. We contrast our approach to conventional web crawlers, and describe and evaluate a five-step pipelined architecture to crawl and index data from both the traditional and the Semantic Web.


Content Type Inverted Index Transformation Language Pipeline Architecture XPath Query 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. In: Proceedings of 27th International Conference on Very Large Data Bases, pp. 119–128 (September 2001)Google Scholar
  2. 2.
    Boldi, P., Codenotti, B., Santini, M., Vigna, S.: UbiCrawler: a Scalable Fully Distributed Web Crawler. Software: Practice and Experience 34(8), 711–726 (2004)CrossRefGoogle Scholar
  3. 3.
    Brin, S., Page, L.: The Anatomy of a Large-Scale Hypertextual Web Search Engine. Computer Networks 30(1-7), 107–117 (1998)Google Scholar
  4. 4.
    Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., Kanungo, T., Rajagopalan, S., Tomkins, A., Tomlin, J.A., Zien, J.Y.: SemTag and Seeker: Bootstrapping the Semantic Web via Automated Semantic Annotation. In: Proceedings of the Twelfth International World Wide Web Conference, pp. 178–186 (May 2003)Google Scholar
  5. 5.
    Gottlob, G., Koch, C., Pichler, R., Segoufin, L.: The Complexity of XPath Query Evaluation and XML Typing. Journal of the ACM 52(2), 284–335 (2005)CrossRefMathSciNetGoogle Scholar
  6. 6.
    Harth, A., Decker, S.: Optimized Index Structures for Querying RDF from the Web. In: Proceedings of the 3rd Latin American Web Congress, pp. 71–80. IEEE, Los Alamitos (2005)CrossRefGoogle Scholar
  7. 7.
    Heydon, A., Najork, M.: Mercator: A Scalable, Extensible Web Crawler. World Wide Web 2(4), 219–229 (1999)CrossRefGoogle Scholar
  8. 8.
    Melnik, S., Raghavan, S., Yang, B., Garcia-Molina, H.: Building a Distributed Full-Text Index for the Web. In: Proceedings of the 10th International World Wide Web Conference, pp. 396–406 (2001)Google Scholar
  9. 9.
    Michalowski, M., Ambite, J.L., Thakkar, S., Tuchinda, R., Knoblock, C.A., Minton, S.: Retrieving and Semantically Integrating Heterogeneous Data from the Web. IEEE Intelligent Systems 19(3), 72–79 (2004)CrossRefGoogle Scholar
  10. 10.
    Najork, M., Wiener, J.L.: Breadth-First Crawling Yields High-Quality Pages. In: Proceedings of the Tenth International World Wide Web Conference, pp. 114–118 (May 2001)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Andreas Harth
    • 1
  • Jürgen Umbrich
    • 1
  • Stefan Decker
    • 1
  1. 1.Digital Enterprise Research InstituteNational University of IrelandGalway

Personalised recommendations