International Conference on Web Information Systems Engineering

Web Information Systems Engineering – WISE 2015 pp 554-569 | Cite as

Adaptive Focused Crawling of Linked Data

  • Ran Yu
  • Ujwal Gadiraju
  • Besnik Fetahu
  • Stefan Dietze
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9418)

Abstract

Given the evolution of publicly available Linked Data, crawling and preservation have become increasingly important challenges. Due to the scale of available data on the Web, efficient focused crawling approaches which are able to capture the relevant semantic neighborhood of seed entities are required. Here, determining relevant entities for a given set of seed entities is a crucial problem. While the weight of seeds within a seed list vary significantly with respect to the crawl intent, we argue that an adaptive crawler is required, which considers such characteristics when configuring the crawling and relevance detection approach. To address this problem, we introduce a crawling configuration, which considers seed list-specific features as part of its crawling and ranking algorithm. We evaluate it through extensive experiments in comparison to a number of baseline methods and crawling parameters. We demonstrate that, configurations which consider seed list features outperform the baselines and present further insights gained from our experiments.

Keywords

Focused crawling Linked data Relevance assessment 

References

  1. 1.
    Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.G.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ISWC/ASWC 2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007) CrossRefGoogle Scholar
  2. 2.
    Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. ISDN Syst. 30(1), 107–117 (1998)CrossRefGoogle Scholar
  3. 3.
    Chakrabarti, S., Punera, K., Subramanyam, M.: Accelerated focused crawling through online relevance feedback. In: Proceedings of the 11th International Conference on World Wide Web, WWW, pp. 148–159. ACM, New York (2002)Google Scholar
  4. 4.
    Chakrabarti, S., Van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific web resource discovery. Comput. Netw. 31(11), 1623–1640 (1999)CrossRefGoogle Scholar
  5. 5.
    De Bra, P., Houben, G.-J., Kornatzky, Y., Post, R.: Information retrieval in distributed hypertexts. In: RIAO, pp. 481–493 (1994)Google Scholar
  6. 6.
    Diligenti, M., Coetzee, F., Lawrence, S., Giles, C.L., Gori, M., et al.: Focused crawling using context graphs. In: VLDB, pp. 527–534 (2000)Google Scholar
  7. 7.
    Fetahu, B., Gadiraju, U., Dietze, S.: Crawl me maybe: iterative linked dataset preservation. In: Proceedings of the 13th International Semantic Web Conference (ISWC) Posters & Demonstrations Track, pp. 433–436 (2014)Google Scholar
  8. 8.
    Fetahu, B., Gadiraju, U., Dietze, S.: Improving entity retrieval on structured data. In: Proceedings of the 14th International Semantic Web Conference. Springer (2015)Google Scholar
  9. 9.
    Gadiraju, U., Demartini, G., Kawase, R., Dietze, S.: Human beyond the machine: challenges and opportunities of microtask crowdsourcing. IEEE Intell. Syst. 30(4), 81–85 (2015)CrossRefGoogle Scholar
  10. 10.
    Gadiraju, U., Kawase, R., Dietze, S., Demartini, G.: Understanding malicious behaviour in crowdsourcing platforms: the case of online surveys. In: Proceedings of CHI 2015 (2015)Google Scholar
  11. 11.
    Isele, R., Umbrich, J., Bizer, C., Harth, A.: Ldspider: an open-source crawling framework for the web of linked data. In 9th International Semantic Web Conference, ISWC. Citeseer (2010)Google Scholar
  12. 12.
    Katz, L.: A new status index derived from sociometric analysis. Psychometrika 18(1), 39–43 (1953)MATHCrossRefGoogle Scholar
  13. 13.
    McCallumzy, A., Nigamy, K., Renniey, J., Seymorey, K.: Building domain-specific search engines with machine learning techniques (1999)Google Scholar
  14. 14.
    Meusel, R., Mika, P., Blanco, R.: Focused crawling for structured data. In: Proceedings of the 23rd ACM International Conference on Information and Knowledge Management, CIKM, pp. 1039–1048 (2014)Google Scholar
  15. 15.
    Pereira Nunes, B., Dietze, S., Casanova, M.A., Kawase, R., Fetahu, B., Nejdl, W.: Combining a co-occurrence-based and a semantic measure for entity linking. In: Cimiano, P., Corcho, O., Presutti, V., Hollink, L., Rudolph, S. (eds.) ESWC 2013. LNCS, vol. 7882, pp. 548–562. Springer, Heidelberg (2013) CrossRefGoogle Scholar
  16. 16.
    Pound, J., Mika, P., Zaragoza, H.: Ad-hoc object retrieval in the web of data. In: Rappa, M., Jones, P., Freire, J., Chakrabarti, S. (eds.) WWW, pp. 771–780. ACM (2010)Google Scholar
  17. 17.
    Suchanek, F.M., Kasneci, G., Weikum, G.: Yago: a core of semantic knowledge. In: Proceedings of the 16th International Conference on World Wide Web, pp. 697–706. ACM (2007)Google Scholar
  18. 18.
    Tang, T.T., Hawking, D., Craswell, N., Griffiths, K.: Focused crawling for both topical relevance and quality of medical information. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, pp. 147–154. ACM (2005)Google Scholar
  19. 19.
    Von Ahn, L., Dabbish, L.: Labeling images with a computer game. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 319–326. ACM (2004)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Ran Yu
    • 1
  • Ujwal Gadiraju
    • 1
  • Besnik Fetahu
    • 1
  • Stefan Dietze
    • 1
  1. 1.L3S Research CenterLeibniz Universität HannoverHannoverGermany

Personalised recommendations