A Structured Approach to Data Reverse Engineering of Web Applications

  • Roberto De Virgilio
  • Riccardo Torlone
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5648)


The majority of documents on the Web are written in HTML, constituting a huge amount of legacy data: all documents are formatted for visual purposes only and with different styles due to diverse authorships and goals and this makes the process of retrieval and integration of Web contents difficult to automate. We provide a contribution to the solution of this problem by proposing a structured approach to data reverse engineering of data-intensive Web sites. We focus on data content and on the way in which such content is structured on the Web. We profitably use a Web data model to describe abstract structural features of HTML pages and propose a method for the segmentation of HTML documents in special blocks grouping semantically related Web objects. We have developed a tool based on this method that supports the identification of structure, function, and meaning of data organized in Web object blocks. We demonstrate with this tool the feasibility and effectiveness of our approach over a set of real Web sites.


Logical Schema Cascade Style Sheet Document Object Model Page Segmentation Page Schema 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Antoniol, G., Canfora, G., Casazza, G., De Lucia, A.: Web Site Reengineering using RMM. In: Proc. of Int. Workshop on Web Site Evolution, Zurich, Switzerland (2000)Google Scholar
  2. 2.
    Baumgartner, R., Flesca, S., Gottlob, G.: Visual Web Information Extraction with Lixto. In: Proc. of the 27th Int. Conf. on Very Large Data Bases (VLDB 2007), Roma, Italy (2001)Google Scholar
  3. 3.
    Benslimane, S.M., Benslimane, D., Malki, M., Amghar, Y., Hassane, H.S.: Acquiring owl ontologies from data-intensive web sites. In: Proc. of Int. Conf. on Web Engineering (ICWE 2006), Palo Alto, California, USA (2006)Google Scholar
  4. 4.
    Bouchiha, D., Malki, M., Benslimane, S.M.: Ontology based Web Application Reverse Engineering Approach. INFOCOMP Journal of Computer Science 6(1), 37–46 (2007)Google Scholar
  5. 5.
    Cai, D., Yu, S., Wen, J.R., Ma, W.Y.: Extracting Content Structure for Web Pages based on Visual Representation. In: Zhou, X., Zhang, Y., Orlowska, M.E. (eds.) APWeb 2003. LNCS, vol. 2642, pp. 406–417. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  6. 6.
    Chikofsky, E.J., Cross, J.H.: Reverse Engineering and Design Recovery: A Taxonomy. IEEE Software 7(1), 13–17 (1990)CrossRefGoogle Scholar
  7. 7.
    Chung, S., Lee, Y.S.: Reverse Software Engineering with UML for Web Site Maintenance. In: Proc. of the 1th Int. Conf. on Web Information Systems Engineering (WISE 2000), Hong Kong, China (2000)Google Scholar
  8. 8.
    Crescenzi, V., Merialdo, P., Missier, P.: Clustering Web pages based on their structure. Data Knowl. Eng. 54(3), 279–299 (2005)CrossRefGoogle Scholar
  9. 9.
    De Virgilio, R., Torlone, R.: A Meta-model Approach to the Management of Hypertexts in Web Information Systems. In: ER Workshops (WISM 2008) (2008)Google Scholar
  10. 10.
    Di Lucca, G.A., Fasolino, A.R., Tramontana, P.: Reverse engineering Web applications: the WARE approach. Journal of Software Maintenance 16(1-2), 71–101 (2004)CrossRefGoogle Scholar
  11. 11.
    Du Bois, B.: Towards a Reverse Engineering Ontology. In: Proc. of the 2th Int. Workshop on Empirical Studies in Reverse Engineering (WESRE 2006), Benevento, Italy (2006)Google Scholar
  12. 12.
    Laender, A., Ribeiro-Neto, B., Da Silva, A., Teixeira, J.S.: A brief survey of web data extraction tools. ACM SIGMOD Record 31(2), 84–93 (2002)CrossRefGoogle Scholar
  13. 13.
    Ricca, F., Tonella, P.: Understanding and Restructuring Web Sites with ReWeb. IEEE Multimedia 8(2), 40–51 (2001)CrossRefGoogle Scholar
  14. 14.
    Tao, T., Mukherjee, A.: LZW Based Compressed Pattern Matching. In: Proc. of the 14th Data Compression Conf. (DCC 2004), Snowbird, UT, USA (2004)Google Scholar
  15. 15.
    Vanderdonckt, J., Bouillon, L., Souchon, N.: Flexible reverse engineering of Web Pages with VAQUISTA. In: Proc. of the 8th Working Conf. on Reverse Engineering (WCRE 2001), Stuttgart, Germany (2001)Google Scholar
  16. 16.
    Wong, T.-L., Lam, W.: Adapting web information extraction knowledge via mining site-invariant and site-dependent features. ACM Transactions on Internet Technology 7(1), 6 (2007)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Roberto De Virgilio
    • 1
  • Riccardo Torlone
    • 1
  1. 1.Università Roma TreItaly

Personalised recommendations