Skip to main content

GDist-RIA Crawler: A Greedy Distributed Crawler for Rich Internet Applications

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCCN,volume 8593))

Abstract

Crawling web applications is important for indexing, accessibility and security assessment. Crawling traditional web applications is an old problem, for which good and efficient solution are known. Crawling Rich Internet Applications (RIA) quickly and efficiently, however, is an open problem. Technologies such as AJAX and partial Document Object Model (DOM) updates only make the problem of crawling RIA more time consuming to the web crawler. One way to reduce the time to crawl a RIA is to crawl a RIA in parallel with multiple computers. Previously published Dist-RIA Crawler presents a distributed breath-first search algorithm to crawl RIAs. This paper expands Dist-RIA Crawler in two ways. First, it introduces an adaptive load-balancing algorithm that enables the crawler to learn about the speed of the nodes and adapt to changes, thus better utilize the resources. Second, it present a distributed greedy algorithm to crawl a RIA in parallel, called GDist-RIA Crawler. The GDist-RIA Crawler uses a server-client architecture where the server dispatched crawling jobs to the crawling clients. This paper illustrates a prototype implementation of the GDist-RIA Crawler, explains some of the techniques used to implement the prototype and inspects empirical performance measurements.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    This paper only focuses on JavaScript events and leaves other client side events such as Flash events to the future studies.

  2. 2.

    http://phantomjs.org/

  3. 3.

    XMLHttpRequest is the module responsible for asynchronous calls in many popular browsers such as Firefox and Chrome. Microsoft Internet Explorer however does not use module, and instead it uses ActiveXObject.

  4. 4.

    Due to space limitation rest of code snippets in this section are omitted.

  5. 5.

    http://www.abeautifulsite.net/blog/2008/03/jquery-file-tree/

References

  1. Amalfitano, D., Fasolino, A.R., Tramontana, P.: Reverse engineering finite state machines from rich internet applications. In: Proceedings of the 2008 15th Working Conference on Reverse Engineering, WCRE ’08, pp. 69–73. IEEE Computer Society, Washington, DC (2008)

    Google Scholar 

  2. Amalfitano, D., Fasolino, A.R., Tramontana, P.: Experimenting a reverse engineering technique for modelling the behaviour of rich internet applications. In: IEEE International Conference on Software Maintenance, ICSM 2009, pp. 571–574, September 2009

    Google Scholar 

  3. Amalftano, D., Fasolino, A.R., Tramontana, P.: Rich internet application testing using execution trace data. In: Proceedings of the 2010 Third International Conference on Software Testing, Verifcation, and Validation Workshops, ICSTW ’10, pp. 274–283. IEEE Computer Society, Washington, DC (2010)

    Google Scholar 

  4. Benjamin, K., von Bochmann, G., Jourdan, G.-V., Onut, I.-V.: Some modeling challenges when testing rich internet applications for security. In: Proceedings of the 2010 Third International Conference on Software Testing, Verification, and Validation Workshops, ICSTW ’10, pp. 403–409. IEEE Computer Society, Washington, DC (2010)

    Google Scholar 

  5. Benjamin, K., von Bochmann, G., Dincturk, M.E., Jourdan, G.-V., Onut, I.V.: A strategy for efficient crawling of rich internet applications. In: Auer, S., Díaz, O., Papadopoulos, G.A. (eds.) ICWE 2011. LNCS, vol. 6757, pp. 74–89. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  6. Boldi, P., Codenotti, B., Santini, M., Vigna, S.: Ubicrawler: a scalable fully distributed web crawler. Proc. Aust. World Wide Web Conf. 34(8), 711–26 (2002)

    Google Scholar 

  7. Boldi, P., Marino, A., Santini, M., Vigna, S.: BUbiNG: massive crawling for the masses. In: WWW (Companion Volume), pp. 227–228 (2014). http://doi.acm.org/10.1145/2567948.2577304

  8. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. In: Proceedings of the Seventh International Conference on World Wide Web 7, WWW7, pp. 107–117. Elsevier Science Publishers B.V., Amsterdam (1998)

    Google Scholar 

  9. Choudhary, S.: M-crawler: crawling rich internet applications using menu meta-model. Master’s thesis, EECS - University of Ottawa (2012). http://ssrg.site.uottawa.ca/docs/Surya-Thesis.pdf

  10. Choudhary, S., Dincturk, M.E., Mirtaheri, S.M., Jourdan, G.-V., Bochmann, G., Onut, I.V.: Building rich internet applications models: example of a better strategy. In: Daniel, F., Dolog, P., Li, Q. (eds.) ICWE 2013. LNCS, vol. 7977, pp. 291–305. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  11. Choudhary, S., Dincturk, M.E., von Bochmann, G., Jourdan, G.-V., Onut, I.-V., Ionescu, P.: Solving some modeling challenges when testing rich internet applications for security. In: ICST, pp. 850–857 (2012)

    Google Scholar 

  12. Choudhary, S., Dincturk, M.E., Mirtaheri, S.M., von Bochmann, G., Jourdan, G.-V., Onut, I.-V.: Crawling rich internet applications: the state of the art. In: Proceedings of the 2012 Conference of the Center for Advanced Studies on Collaborative Research, CASCON ’12, IBM Corpm, Riverton (2012)

    Google Scholar 

  13. Dincturk, M.E.: Model-based crawling - an approach to design efficient crawling strategies for rich internet applications. Master’s thesis, EECS - University of Ottawa (2013). http://ssrg.eecs.uottawa.ca/docs/Dincturk_MustafaEmre_2013_thesis.pdf

  14. Dincturk, M.E., Choudhary, S., von Bochmann, G., Jourdan, G.-V., Onut, I.V.: A statistical approach for efficient crawling of rich internet applications. In: Brambilla, M., Tokuda, T., Tolksdorf, R. (eds.) ICWE 2012. LNCS, vol. 7387, pp. 362–9. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  15. Duda, C., Frey, G., Kossmann, D., Matter, R., Zhou, C.: Ajax crawl: making ajax applications searchable. In: Proceedings of the 2009 IEEE International Conference on Data Engineering, ICDE ’09, pp. 78–89. IEEE Computer Society, Washington, DC (2009)

    Google Scholar 

  16. Edwards, J., McCurley, K., Tomlin, J.: An adaptive model for optimizing performance of an incremental web crawler (2001)

    Google Scholar 

  17. Exposto, J., Macedo, J., Pina, A., Alves, A., Rufino, J.: Geographical partition for distributed web crawling. In: Proceedings of the 2005 workshop on Geographic information retrieval, GIR ’05, pp. 55–60. ACM, New York (2005)

    Google Scholar 

  18. Frey, G.: Indexing ajax web applications. Master’s thesis, ETH Zurich (2007). http://e-collection.library.ethz.ch/eserv/eth:30111/eth-30111-01.pdf

  19. Heydon, A., Najork, M.: Mercator: a scalable, extensible web crawler. World Wide Web 2, 219–9 (1999)

    Article  Google Scholar 

  20. Li, J., Loo, B., Hellerstein, J., Kaashoek, M., Karger, D., Morris, R.: On the feasibility of peer-to-peer web indexing and search. In: Kaashoek, M.F., Stoica, I. (eds.) IPTPS 2003. LNCS, vol. 2735, pp. 207–15. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  21. Lo, J., Wohlstadter, E., Mesbah, A.: Imagen: runtime migration of browser sessions for javascript web applications. In: Proceedings of the International World Wide Web Conference (WWW), pp. 815–825. ACM (2013)

    Google Scholar 

  22. Marchetto, A., Tonella, P., Ricca, F.: State-based testing of ajax web applications. In: Proceedings of the 2008 International Conference on Software Testing, Verifcation, and Validation, ICST ’08, pp. 121–130. IEEE Computer Society, Washington, DC (2008)

    Google Scholar 

  23. Matter, R.: Ajax crawl: making ajax applications searchable. Master’s thesis, ETH Zurich (2008). http://e-collection.library.ethz.ch/eserv/eth:30709/eth-30709-01.pdf

  24. Mesbah, A., Bozdag, E., van Deursen, A.: Crawling ajax by inferring user interface state changes. In: Proceedings of the 2008 Eighth International Conference on Web Engineering, ICWE ’08, pages 122–134. IEEE Computer Society, Washington, DC (2008)

    Google Scholar 

  25. Mesbah, A., van Deursen, A., Lenselink, S.: Crawling ajax-based web applications through dynamic analysis of user interface state changes. TWEB 6(1), 3 (2012)

    Article  Google Scholar 

  26. Fard, A.M., Mesbah, A.: Feedback-directed exploration of web applications to derive test models. In: Proceedings of the 24th IEEE International Symposium on Software Reliability Engineering (ISSRE), p. 10. IEEE Computer Society (2013)

    Google Scholar 

  27. Mirtaheri, S.M., Zou, D., Bochmann, G.V., Jourdan, G.-V., Onut, I.V.: Dist-ria crawler: a distributed crawler for rich internet applications. In: Proceedings of the 8th International Conference on P2P, Parallel, Grid, Cloud and Internet Computing (2013)

    Google Scholar 

  28. Peng, Z., He, N., Jiang, C., Li, Z., Xu, L., Li, Y., Ren, Y.: Graph-based ajax crawl: Mining data from rich internet applications. In: 2012 International Conference on Computer Science and Electronics Engineering (ICCSEE), vol. 3, pp. 590–594 (2012)

    Google Scholar 

  29. Shkapenyuk, V., Suel, T.: Design and implementation of a high-performance distributed web crawler. In: Proceedings of the International Conference on Data, Engineering, pp. 357–368 (2002)

    Google Scholar 

  30. tsang Lee, H., Leonard, D., Wang, X., Loguinov, D.: Irlbot: Scaling to 6 billion pages and beyond (2008)

    Google Scholar 

Download references

Acknowledgments

This work is largely supported by the IBM® Center for Advanced Studies, the IBM Ottawa Lab and the Natural Sciences and Engineering Research Council of Canada (NSERC). A special thank to Sara Baghbanzadeh.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Seyed M. Mirtaheri .

Editor information

Editors and Affiliations

Trademarks

Trademarks

IBM, the IBM logo, ibm.com and AppScan are trademarks or registered trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml. Intel, and Intel Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates.

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Mirtaheri, S.M., von Bochmann, G., Jourdan, GV., Onut, I.V. (2014). GDist-RIA Crawler: A Greedy Distributed Crawler for Rich Internet Applications. In: Noubir, G., Raynal, M. (eds) Networked Systems. NETYS 2014. Lecture Notes in Computer Science(), vol 8593. Springer, Cham. https://doi.org/10.1007/978-3-319-09581-3_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-09581-3_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-09580-6

  • Online ISBN: 978-3-319-09581-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics