GDist-RIA Crawler: A Greedy Distributed Crawler for Rich Internet Applications

  • Seyed M. Mirtaheri
  • Gregor von Bochmann
  • Guy-Vincent Jourdan
  • Iosif Viorel Onut
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8593)

Abstract

Crawling web applications is important for indexing, accessibility and security assessment. Crawling traditional web applications is an old problem, for which good and efficient solution are known. Crawling Rich Internet Applications (RIA) quickly and efficiently, however, is an open problem. Technologies such as AJAX and partial Document Object Model (DOM) updates only make the problem of crawling RIA more time consuming to the web crawler. One way to reduce the time to crawl a RIA is to crawl a RIA in parallel with multiple computers. Previously published Dist-RIA Crawler presents a distributed breath-first search algorithm to crawl RIAs. This paper expands Dist-RIA Crawler in two ways. First, it introduces an adaptive load-balancing algorithm that enables the crawler to learn about the speed of the nodes and adapt to changes, thus better utilize the resources. Second, it present a distributed greedy algorithm to crawl a RIA in parallel, called GDist-RIA Crawler. The GDist-RIA Crawler uses a server-client architecture where the server dispatched crawling jobs to the crawling clients. This paper illustrates a prototype implementation of the GDist-RIA Crawler, explains some of the techniques used to implement the prototype and inspects empirical performance measurements.

Keywords

Web crawling Rich internet application Greedy algorithm Load-balancing 

References

  1. 1.
    Amalfitano, D., Fasolino, A.R., Tramontana, P.: Reverse engineering finite state machines from rich internet applications. In: Proceedings of the 2008 15th Working Conference on Reverse Engineering, WCRE ’08, pp. 69–73. IEEE Computer Society, Washington, DC (2008)Google Scholar
  2. 2.
    Amalfitano, D., Fasolino, A.R., Tramontana, P.: Experimenting a reverse engineering technique for modelling the behaviour of rich internet applications. In: IEEE International Conference on Software Maintenance, ICSM 2009, pp. 571–574, September 2009Google Scholar
  3. 3.
    Amalftano, D., Fasolino, A.R., Tramontana, P.: Rich internet application testing using execution trace data. In: Proceedings of the 2010 Third International Conference on Software Testing, Verifcation, and Validation Workshops, ICSTW ’10, pp. 274–283. IEEE Computer Society, Washington, DC (2010)Google Scholar
  4. 4.
    Benjamin, K., von Bochmann, G., Jourdan, G.-V., Onut, I.-V.: Some modeling challenges when testing rich internet applications for security. In: Proceedings of the 2010 Third International Conference on Software Testing, Verification, and Validation Workshops, ICSTW ’10, pp. 403–409. IEEE Computer Society, Washington, DC (2010)Google Scholar
  5. 5.
    Benjamin, K., von Bochmann, G., Dincturk, M.E., Jourdan, G.-V., Onut, I.V.: A strategy for efficient crawling of rich internet applications. In: Auer, S., Díaz, O., Papadopoulos, G.A. (eds.) ICWE 2011. LNCS, vol. 6757, pp. 74–89. Springer, Heidelberg (2011) CrossRefGoogle Scholar
  6. 6.
    Boldi, P., Codenotti, B., Santini, M., Vigna, S.: Ubicrawler: a scalable fully distributed web crawler. Proc. Aust. World Wide Web Conf. 34(8), 711–26 (2002)Google Scholar
  7. 7.
    Boldi, P., Marino, A., Santini, M., Vigna, S.: BUbiNG: massive crawling for the masses. In: WWW (Companion Volume), pp. 227–228 (2014). http://doi.acm.org/10.1145/2567948.2577304
  8. 8.
    Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. In: Proceedings of the Seventh International Conference on World Wide Web 7, WWW7, pp. 107–117. Elsevier Science Publishers B.V., Amsterdam (1998)Google Scholar
  9. 9.
    Choudhary, S.: M-crawler: crawling rich internet applications using menu meta-model. Master’s thesis, EECS - University of Ottawa (2012). http://ssrg.site.uottawa.ca/docs/Surya-Thesis.pdf
  10. 10.
    Choudhary, S., Dincturk, M.E., Mirtaheri, S.M., Jourdan, G.-V., Bochmann, G., Onut, I.V.: Building rich internet applications models: example of a better strategy. In: Daniel, F., Dolog, P., Li, Q. (eds.) ICWE 2013. LNCS, vol. 7977, pp. 291–305. Springer, Heidelberg (2013) CrossRefGoogle Scholar
  11. 11.
    Choudhary, S., Dincturk, M.E., von Bochmann, G., Jourdan, G.-V., Onut, I.-V., Ionescu, P.: Solving some modeling challenges when testing rich internet applications for security. In: ICST, pp. 850–857 (2012)Google Scholar
  12. 12.
    Choudhary, S., Dincturk, M.E., Mirtaheri, S.M., von Bochmann, G., Jourdan, G.-V., Onut, I.-V.: Crawling rich internet applications: the state of the art. In: Proceedings of the 2012 Conference of the Center for Advanced Studies on Collaborative Research, CASCON ’12, IBM Corpm, Riverton (2012)Google Scholar
  13. 13.
    Dincturk, M.E.: Model-based crawling - an approach to design efficient crawling strategies for rich internet applications. Master’s thesis, EECS - University of Ottawa (2013). http://ssrg.eecs.uottawa.ca/docs/Dincturk_MustafaEmre_2013_thesis.pdf
  14. 14.
    Dincturk, M.E., Choudhary, S., von Bochmann, G., Jourdan, G.-V., Onut, I.V.: A statistical approach for efficient crawling of rich internet applications. In: Brambilla, M., Tokuda, T., Tolksdorf, R. (eds.) ICWE 2012. LNCS, vol. 7387, pp. 362–9. Springer, Heidelberg (2012) CrossRefGoogle Scholar
  15. 15.
    Duda, C., Frey, G., Kossmann, D., Matter, R., Zhou, C.: Ajax crawl: making ajax applications searchable. In: Proceedings of the 2009 IEEE International Conference on Data Engineering, ICDE ’09, pp. 78–89. IEEE Computer Society, Washington, DC (2009)Google Scholar
  16. 16.
    Edwards, J., McCurley, K., Tomlin, J.: An adaptive model for optimizing performance of an incremental web crawler (2001)Google Scholar
  17. 17.
    Exposto, J., Macedo, J., Pina, A., Alves, A., Rufino, J.: Geographical partition for distributed web crawling. In: Proceedings of the 2005 workshop on Geographic information retrieval, GIR ’05, pp. 55–60. ACM, New York (2005)Google Scholar
  18. 18.
    Frey, G.: Indexing ajax web applications. Master’s thesis, ETH Zurich (2007). http://e-collection.library.ethz.ch/eserv/eth:30111/eth-30111-01.pdf
  19. 19.
    Heydon, A., Najork, M.: Mercator: a scalable, extensible web crawler. World Wide Web 2, 219–9 (1999)CrossRefGoogle Scholar
  20. 20.
    Li, J., Loo, B., Hellerstein, J., Kaashoek, M., Karger, D., Morris, R.: On the feasibility of peer-to-peer web indexing and search. In: Kaashoek, M.F., Stoica, I. (eds.) IPTPS 2003. LNCS, vol. 2735, pp. 207–15. Springer, Heidelberg (2003) CrossRefGoogle Scholar
  21. 21.
    Lo, J., Wohlstadter, E., Mesbah, A.: Imagen: runtime migration of browser sessions for javascript web applications. In: Proceedings of the International World Wide Web Conference (WWW), pp. 815–825. ACM (2013)Google Scholar
  22. 22.
    Marchetto, A., Tonella, P., Ricca, F.: State-based testing of ajax web applications. In: Proceedings of the 2008 International Conference on Software Testing, Verifcation, and Validation, ICST ’08, pp. 121–130. IEEE Computer Society, Washington, DC (2008)Google Scholar
  23. 23.
    Matter, R.: Ajax crawl: making ajax applications searchable. Master’s thesis, ETH Zurich (2008). http://e-collection.library.ethz.ch/eserv/eth:30709/eth-30709-01.pdf
  24. 24.
    Mesbah, A., Bozdag, E., van Deursen, A.: Crawling ajax by inferring user interface state changes. In: Proceedings of the 2008 Eighth International Conference on Web Engineering, ICWE ’08, pages 122–134. IEEE Computer Society, Washington, DC (2008)Google Scholar
  25. 25.
    Mesbah, A., van Deursen, A., Lenselink, S.: Crawling ajax-based web applications through dynamic analysis of user interface state changes. TWEB 6(1), 3 (2012)CrossRefGoogle Scholar
  26. 26.
    Fard, A.M., Mesbah, A.: Feedback-directed exploration of web applications to derive test models. In: Proceedings of the 24th IEEE International Symposium on Software Reliability Engineering (ISSRE), p. 10. IEEE Computer Society (2013)Google Scholar
  27. 27.
    Mirtaheri, S.M., Zou, D., Bochmann, G.V., Jourdan, G.-V., Onut, I.V.: Dist-ria crawler: a distributed crawler for rich internet applications. In: Proceedings of the 8th International Conference on P2P, Parallel, Grid, Cloud and Internet Computing (2013)Google Scholar
  28. 28.
    Peng, Z., He, N., Jiang, C., Li, Z., Xu, L., Li, Y., Ren, Y.: Graph-based ajax crawl: Mining data from rich internet applications. In: 2012 International Conference on Computer Science and Electronics Engineering (ICCSEE), vol. 3, pp. 590–594 (2012)Google Scholar
  29. 29.
    Shkapenyuk, V., Suel, T.: Design and implementation of a high-performance distributed web crawler. In: Proceedings of the International Conference on Data, Engineering, pp. 357–368 (2002)Google Scholar
  30. 30.
    tsang Lee, H., Leonard, D., Wang, X., Loguinov, D.: Irlbot: Scaling to 6 billion pages and beyond (2008)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Seyed M. Mirtaheri
    • 1
  • Gregor von Bochmann
    • 1
  • Guy-Vincent Jourdan
    • 1
  • Iosif Viorel Onut
    • 2
  1. 1.School of Electrical Engineering and Computer ScienceUniversity of OttawaOttawaCanada
  2. 2.Security AppScan® Enterprise, IBMOttawaCanada

Personalised recommendations