Abstract
Crawling web applications is important for indexing, accessibility and security assessment. Crawling traditional web applications is an old problem, for which good and efficient solution are known. Crawling Rich Internet Applications (RIA) quickly and efficiently, however, is an open problem. Technologies such as AJAX and partial Document Object Model (DOM) updates only make the problem of crawling RIA more time consuming to the web crawler. One way to reduce the time to crawl a RIA is to crawl a RIA in parallel with multiple computers. Previously published Dist-RIA Crawler presents a distributed breath-first search algorithm to crawl RIAs. This paper expands Dist-RIA Crawler in two ways. First, it introduces an adaptive load-balancing algorithm that enables the crawler to learn about the speed of the nodes and adapt to changes, thus better utilize the resources. Second, it present a distributed greedy algorithm to crawl a RIA in parallel, called GDist-RIA Crawler. The GDist-RIA Crawler uses a server-client architecture where the server dispatched crawling jobs to the crawling clients. This paper illustrates a prototype implementation of the GDist-RIA Crawler, explains some of the techniques used to implement the prototype and inspects empirical performance measurements.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
This paper only focuses on JavaScript events and leaves other client side events such as Flash events to the future studies.
- 2.
- 3.
XMLHttpRequest is the module responsible for asynchronous calls in many popular browsers such as Firefox and Chrome. Microsoft Internet Explorer however does not use module, and instead it uses ActiveXObject.
- 4.
Due to space limitation rest of code snippets in this section are omitted.
- 5.
References
Amalfitano, D., Fasolino, A.R., Tramontana, P.: Reverse engineering finite state machines from rich internet applications. In: Proceedings of the 2008 15th Working Conference on Reverse Engineering, WCRE ’08, pp. 69–73. IEEE Computer Society, Washington, DC (2008)
Amalfitano, D., Fasolino, A.R., Tramontana, P.: Experimenting a reverse engineering technique for modelling the behaviour of rich internet applications. In: IEEE International Conference on Software Maintenance, ICSM 2009, pp. 571–574, September 2009
Amalftano, D., Fasolino, A.R., Tramontana, P.: Rich internet application testing using execution trace data. In: Proceedings of the 2010 Third International Conference on Software Testing, Verifcation, and Validation Workshops, ICSTW ’10, pp. 274–283. IEEE Computer Society, Washington, DC (2010)
Benjamin, K., von Bochmann, G., Jourdan, G.-V., Onut, I.-V.: Some modeling challenges when testing rich internet applications for security. In: Proceedings of the 2010 Third International Conference on Software Testing, Verification, and Validation Workshops, ICSTW ’10, pp. 403–409. IEEE Computer Society, Washington, DC (2010)
Benjamin, K., von Bochmann, G., Dincturk, M.E., Jourdan, G.-V., Onut, I.V.: A strategy for efficient crawling of rich internet applications. In: Auer, S., Díaz, O., Papadopoulos, G.A. (eds.) ICWE 2011. LNCS, vol. 6757, pp. 74–89. Springer, Heidelberg (2011)
Boldi, P., Codenotti, B., Santini, M., Vigna, S.: Ubicrawler: a scalable fully distributed web crawler. Proc. Aust. World Wide Web Conf. 34(8), 711–26 (2002)
Boldi, P., Marino, A., Santini, M., Vigna, S.: BUbiNG: massive crawling for the masses. In: WWW (Companion Volume), pp. 227–228 (2014). http://doi.acm.org/10.1145/2567948.2577304
Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. In: Proceedings of the Seventh International Conference on World Wide Web 7, WWW7, pp. 107–117. Elsevier Science Publishers B.V., Amsterdam (1998)
Choudhary, S.: M-crawler: crawling rich internet applications using menu meta-model. Master’s thesis, EECS - University of Ottawa (2012). http://ssrg.site.uottawa.ca/docs/Surya-Thesis.pdf
Choudhary, S., Dincturk, M.E., Mirtaheri, S.M., Jourdan, G.-V., Bochmann, G., Onut, I.V.: Building rich internet applications models: example of a better strategy. In: Daniel, F., Dolog, P., Li, Q. (eds.) ICWE 2013. LNCS, vol. 7977, pp. 291–305. Springer, Heidelberg (2013)
Choudhary, S., Dincturk, M.E., von Bochmann, G., Jourdan, G.-V., Onut, I.-V., Ionescu, P.: Solving some modeling challenges when testing rich internet applications for security. In: ICST, pp. 850–857 (2012)
Choudhary, S., Dincturk, M.E., Mirtaheri, S.M., von Bochmann, G., Jourdan, G.-V., Onut, I.-V.: Crawling rich internet applications: the state of the art. In: Proceedings of the 2012 Conference of the Center for Advanced Studies on Collaborative Research, CASCON ’12, IBM Corpm, Riverton (2012)
Dincturk, M.E.: Model-based crawling - an approach to design efficient crawling strategies for rich internet applications. Master’s thesis, EECS - University of Ottawa (2013). http://ssrg.eecs.uottawa.ca/docs/Dincturk_MustafaEmre_2013_thesis.pdf
Dincturk, M.E., Choudhary, S., von Bochmann, G., Jourdan, G.-V., Onut, I.V.: A statistical approach for efficient crawling of rich internet applications. In: Brambilla, M., Tokuda, T., Tolksdorf, R. (eds.) ICWE 2012. LNCS, vol. 7387, pp. 362–9. Springer, Heidelberg (2012)
Duda, C., Frey, G., Kossmann, D., Matter, R., Zhou, C.: Ajax crawl: making ajax applications searchable. In: Proceedings of the 2009 IEEE International Conference on Data Engineering, ICDE ’09, pp. 78–89. IEEE Computer Society, Washington, DC (2009)
Edwards, J., McCurley, K., Tomlin, J.: An adaptive model for optimizing performance of an incremental web crawler (2001)
Exposto, J., Macedo, J., Pina, A., Alves, A., Rufino, J.: Geographical partition for distributed web crawling. In: Proceedings of the 2005 workshop on Geographic information retrieval, GIR ’05, pp. 55–60. ACM, New York (2005)
Frey, G.: Indexing ajax web applications. Master’s thesis, ETH Zurich (2007). http://e-collection.library.ethz.ch/eserv/eth:30111/eth-30111-01.pdf
Heydon, A., Najork, M.: Mercator: a scalable, extensible web crawler. World Wide Web 2, 219–9 (1999)
Li, J., Loo, B., Hellerstein, J., Kaashoek, M., Karger, D., Morris, R.: On the feasibility of peer-to-peer web indexing and search. In: Kaashoek, M.F., Stoica, I. (eds.) IPTPS 2003. LNCS, vol. 2735, pp. 207–15. Springer, Heidelberg (2003)
Lo, J., Wohlstadter, E., Mesbah, A.: Imagen: runtime migration of browser sessions for javascript web applications. In: Proceedings of the International World Wide Web Conference (WWW), pp. 815–825. ACM (2013)
Marchetto, A., Tonella, P., Ricca, F.: State-based testing of ajax web applications. In: Proceedings of the 2008 International Conference on Software Testing, Verifcation, and Validation, ICST ’08, pp. 121–130. IEEE Computer Society, Washington, DC (2008)
Matter, R.: Ajax crawl: making ajax applications searchable. Master’s thesis, ETH Zurich (2008). http://e-collection.library.ethz.ch/eserv/eth:30709/eth-30709-01.pdf
Mesbah, A., Bozdag, E., van Deursen, A.: Crawling ajax by inferring user interface state changes. In: Proceedings of the 2008 Eighth International Conference on Web Engineering, ICWE ’08, pages 122–134. IEEE Computer Society, Washington, DC (2008)
Mesbah, A., van Deursen, A., Lenselink, S.: Crawling ajax-based web applications through dynamic analysis of user interface state changes. TWEB 6(1), 3 (2012)
Fard, A.M., Mesbah, A.: Feedback-directed exploration of web applications to derive test models. In: Proceedings of the 24th IEEE International Symposium on Software Reliability Engineering (ISSRE), p. 10. IEEE Computer Society (2013)
Mirtaheri, S.M., Zou, D., Bochmann, G.V., Jourdan, G.-V., Onut, I.V.: Dist-ria crawler: a distributed crawler for rich internet applications. In: Proceedings of the 8th International Conference on P2P, Parallel, Grid, Cloud and Internet Computing (2013)
Peng, Z., He, N., Jiang, C., Li, Z., Xu, L., Li, Y., Ren, Y.: Graph-based ajax crawl: Mining data from rich internet applications. In: 2012 International Conference on Computer Science and Electronics Engineering (ICCSEE), vol. 3, pp. 590–594 (2012)
Shkapenyuk, V., Suel, T.: Design and implementation of a high-performance distributed web crawler. In: Proceedings of the International Conference on Data, Engineering, pp. 357–368 (2002)
tsang Lee, H., Leonard, D., Wang, X., Loguinov, D.: Irlbot: Scaling to 6 billion pages and beyond (2008)
Acknowledgments
This work is largely supported by the IBM® Center for Advanced Studies, the IBM Ottawa Lab and the Natural Sciences and Engineering Research Council of Canada (NSERC). A special thank to Sara Baghbanzadeh.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Trademarks
Trademarks
IBM, the IBM logo, ibm.com and AppScan are trademarks or registered trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml. Intel, and Intel Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States and other countries. Linux is a registered trademark of Linus Torvalds in the United States, other countries, or both. Java and all Java-based trademarks and logos are trademarks or registered trademarks of Oracle and/or its affiliates.
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Mirtaheri, S.M., von Bochmann, G., Jourdan, GV., Onut, I.V. (2014). GDist-RIA Crawler: A Greedy Distributed Crawler for Rich Internet Applications. In: Noubir, G., Raynal, M. (eds) Networked Systems. NETYS 2014. Lecture Notes in Computer Science(), vol 8593. Springer, Cham. https://doi.org/10.1007/978-3-319-09581-3_14
Download citation
DOI: https://doi.org/10.1007/978-3-319-09581-3_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-09580-6
Online ISBN: 978-3-319-09581-3
eBook Packages: Computer ScienceComputer Science (R0)