Skip to main content

SimiLay: A Developing Web Page Layout Based Visual Similarity Search Engine

  • Conference paper
Machine Learning and Data Mining in Pattern Recognition (MLDM 2014)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8556))

Abstract

Web page visual similarity has been a trend topic in last decade. Furthermore, effective methods and approaches are crucial for phishing detection and related issues. In this study, we aim to develop a search engine for web page visual similarity and propose a novel method for capturing and calculating layout similarity of web pages. To achieve this, web page elements are classified and mapped with a novel technique. Furthermore, an extension of well known bag of features approach named spatial pyramid match has been employed via histogram intersection schema for capturing and measuring the partial and whole page layout similarity. Promising results demonstrate that spatial pyramid matching kernel can be used for this field.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Yang, Y., Zhang, H.J.: HTML Page Analysis Based on Visual Cues. In: Proceedings of Sixth International Conference on Document Analysis and Recognition (2001)

    Google Scholar 

  2. Alpuante, M., Romero, D.: A Visual Technique for Web Pages Comparison. Electronic Notes in Theoretical Computer Science 235, 3–18 (2009)

    Article  Google Scholar 

  3. Eglin, V., Bres, S.: Document Page Similarity based on Layout visual saliency: Application to query by example and document classification. In: Proceedings of the Seventh International Conference on Document Analysis and Recognition (2003)

    Google Scholar 

  4. Wan, X.: A Novel Documents Similarity Measure based on Earth Mover’s Distance. Information Sciences 177, 3718–3730 (2007)

    Article  Google Scholar 

  5. Kang, J., Choi, J.: Recognising Informative Web Page Blocks Using Visual Segmentation for Efficient Information Extraction. Journal of Universal Computer Science 14(11), 1893–1910 (2008)

    Google Scholar 

  6. Hara, M., Yamada, A., Miyake, Y.: Visual Similarity-based Phishing Detection without Victim Site Information. In: Proceedings of Computational Intelligence in Cyber Security 2009, pp. 30–36 (2009)

    Google Scholar 

  7. Law, M.T., Gutierrez, C.S., Thome, N., Gançarski, S., Cord, M.: Structural and Visual Similarity Learning for Web Page Archiving. In: Proceeding of CBMI, pp. 1–6 (2012)

    Google Scholar 

  8. Bohunsky, P., Gatterbauer, W.: Visual Structure-based Web Page Clustering and Retrieval. In: Proceddings of the 19th International Conference on World Wide Web, pp. 1067–1068 (2010)

    Google Scholar 

  9. Medvet, E., Kirda, E., Kruegel, C.: Visual-Similarity-Based Phishing Detection. In: Proceedings of SecureComm 2008 (2008)

    Google Scholar 

  10. Alpuente, M., Romero, D.: A Tool for Computing the Visual Similarity of Web Pages. In: Proceedings of Applications and the Internet, SAINT (2010)

    Google Scholar 

  11. Rosiello, A.P., Kirda, E., Kruegel, C., Ferrandi, F.: A Layout-Similarity-Based Approach for Detecting Phishing Pages. In: Proceeding of Security and Privacy in Communications Networks and the Workshops, pages, pp. 457–463 (2007)

    Google Scholar 

  12. Gartner Press Release. Gartner Says Number of Phishing E-mails Sent to U.S. Adults early Doubles in Just Two Years (2006), http://www.gartner.com/it/page.jsp?id=498245

  13. Gehrke, D., Turban, E.: Determinant of successful website design: Relative importance and recommendations for effectiveness. In: Proceedings of the 32th Hawaii International Conference on System Sciences (1999)

    Google Scholar 

  14. Lazebnik, S., Schmid, C., Ponce, J.: Beyond Bags of Features: Spatial Pyramid Matching Recognizing Natural Scene Categories. In: Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2006)

    Google Scholar 

  15. O’Hara, S., Draper, B.A.: Introduction to The Bag of Features Paradigm for Image Classification and Retrieval, CoRR abs/1101.3354 (2011)

    Google Scholar 

  16. Cai, D., Yu, S., Wen, J.R., Ma, W.Y.: VIPS: a Vision-based Page Segmentation Algorithm, Technical Report MSR-TR-2003-79, Microsoft Research (2003)

    Google Scholar 

  17. Guo, H., Mahmud, J., Borodin, Y., Stent, A., Ramakrishnan, I.V.: A General Approach for Partioning Web Page Content Based on Geometric and Style Information. In: Proceedings of Document Analysis and Recognition – ICDAR 2007 (2007)

    Google Scholar 

  18. Tombros, A., Ali, Z.: Factors Affecting Web Page Similarity. In: Proceedings of ICIR 2005, pages, pp. 487–501 (2005)

    Google Scholar 

  19. ImgSeek (January 28, 2014), http://www.imageseek.net

  20. Pnueli, A., Bergman, R., Schein, S., Barkol, O.: Web Page Layout Via Visual Segmentation, Technical Report HPL-2009-160 (2009)

    Google Scholar 

  21. Kudelka, M., Takama, T., Snasel, V., Klos, K.: Visual Similarity of Web Pages, Advance. Intelligent and Soft Computing 67, 135-146 (2010)

    Google Scholar 

  22. Chen, T.C., Dick, S., Miller, J.: Detecting Visually Similar Web Pages: Application to Phishing Detection. ACM Transactions on Internet Technology 10(2) (2010)

    Google Scholar 

  23. Koenderink, J., Doorn, A.V.: The structure of locally orderless images. IJVC 31(2/3), 159–168 (1999)

    Google Scholar 

  24. Lazebnik, S., Schmid, C., Ponce, J.: Spatial Pyramid Matching, http://www.cs.unc.edu/~lazebnik/publications/pyramid_chapter.pdf

  25. Grauman, K., Darrell, T.: Pyramid match kernels: Discriminative classification with sets of image features. In: Proceedings of ICCV (2005)

    Google Scholar 

  26. Mozilla GeckoFx.Net (January 28, 2014), https://code.google.com/p/geckofx/

  27. ASP.NET (January 30, 2014), http://www.asp.net/

  28. What is CSS? (February 1, 2014), http://www.w3.org/Style/CSS/

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Bozkır, A.S., Sezer, E.A. (2014). SimiLay: A Developing Web Page Layout Based Visual Similarity Search Engine. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2014. Lecture Notes in Computer Science(), vol 8556. Springer, Cham. https://doi.org/10.1007/978-3-319-08979-9_35

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-08979-9_35

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-08978-2

  • Online ISBN: 978-3-319-08979-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics