Thumbnail Summarization Techniques for Web Archives

  • Ahmed AlSum
  • Michael L. Nelson
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8416)


Thumbnails of archived web pages as they appear in common browsers such as Firefox or Chrome can be useful to convey the nature of a web page and how it has changed over time. However, creating thumbnails for all archived web pages is not feasible for large collections, both in terms of time to create the thumbnails and space to store them. Furthermore, at least for the purposes of initial exploration and collection understanding, people will likely only need a few dozen thumbnails and not thousands. In this paper, we develop different algorithms to optimize the thumbnail creation procedure for web archives based on information retrieval techniques. We study different features based on HTML text that correlate with changes in rendered thumbnails so we can know in advance which archived pages to use for thumbnails. We find that SimHash correlates with changes in the thumbnails (ρ = 0.59, p < 0.005). We propose different algorithms for thumbnail creation suitable for different applications, reducing the number of thumbnails to be generated to 9% – 27% of the total size.


Digital Library Levenshtein Distance Style Sheet Information Retrieval Technique Internet Archive 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Woodruff, A., Faulring, A., Rosenholtz, R., Morrsion, J., Pirolli, P.: Using thumbnails to search the Web. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI 2001, pp. 198–205 (2001)Google Scholar
  2. 2.
    Kules, B., Wilson, M.L., Shneiderman, B.: From Keyword Search to Exploration: How Result Visualization Aids Discovery on the Web. Technical report, HCIL-2008-06 (2008)Google Scholar
  3. 3.
    Treharne, K., Powers, D.M.W.: Search Engine Result Visualisation: Challenges and Opportunities. In: Proceedings of 13th International Conference on Information Visualisation, pp. 633–638 (2009)Google Scholar
  4. 4.
    Kaasten, S., Greenberg, S., Edwards, C.: How People Recognise Previously Seen Web Pages from Titles, URLs and Thumbnails. In: People and Computers XVI - Memorable Yet Invisible SE, pp. 247–265. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  5. 5.
    Teevan, J., Cutrell, E., Fisher, D., Drucker, S.M., Ramos, G., André, P., Hu, C.: Visual Snippets: Summarizing Web Pages for Search and Revisitation. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI 2009, pp. 2023–2032. ACM (2009)Google Scholar
  6. 6.
    Padia, K., AlNoamany, Y., Weigle, M.C.: Visualizing digital collections at Archive-It. In: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2012, pp. 15–18 (2012)Google Scholar
  7. 7.
    Adar, E., Dontcheva, M., Fogarty, J., Weld, D.S.: Zoetrope: interacting with the ephemeral web. In: Proceedings of the 21st Annual ACM Symposium on User Interface Software and Technology, UIST 2008, pp. 239–248 (2008)Google Scholar
  8. 8.
    AlSum, A., Nelson, M.L.: ArcLink: Optimization Techniques to Build and Retrieve the Temporal Web Graph. Technical report, arXiv: 1305.5959 (2013)Google Scholar
  9. 9.
    Mayer, R.E., Moreno, R.: Nine ways to reduce cognitive load in multimedia learning. Educational Psychologist 38(1), 43–52 (2003)CrossRefGoogle Scholar
  10. 10.
    Graham, A., Garcia-Molina, H., Paepcke, A., Winograd, T.: Time as essence for photo browsing through personal digital libraries. In: Proceedings of the Second ACM/IEEE-CS Joint Conference on Digital Librariesm, JCDL 2002, pp. 326–335 (2002)Google Scholar
  11. 11.
    Hockx-Yu, H.: The Past Issue of the Web. In: Proceedings of 3rd International Conference on Web Science, WebSci 2011, pp. 1–8 (2011)Google Scholar
  12. 12.
    Chen, K., Chen, Y., Ting, P.: Developing National Taiwan University Web Archiving System. In: Proceedings of 8th International Web Archiving Workshop, IWAW 2008 (2008)Google Scholar
  13. 13.
    Soman, S., Chhajta, A., Bonomo, A., Paepcke, A.: ArcSpread for Analyzing Web Archives. Technical report. Stanford InfoLab (2012)Google Scholar
  14. 14.
    Cho, J., Garcia-Molina, H., Haveliwala, T., Lam, W., Paepcke, A., Raghavan, S., Wesley, G.: Stanford WebBase Components and Applications. ACM Transactions on Internet Technology 6(2) (2006)Google Scholar
  15. 15.
    Jatowt, A., Kawai, Y., Nakamura, S., Kidawara, Y., Tanaka, K.: Journey to the past: proposal of a framework for past web browser. In: Proceedings of the 17th Conference on Hypertext and Hypermedia, HYPERTEXT 2006, pp. 135–144. ACM (2006)Google Scholar
  16. 16.
    Jatowt, A., Kawai, Y., Tanaka, K.: Page History Explorer: Visualizing and Comparing Page Histories. IEICE Transactions on Information and Systems E94-D(3), 564–577 (2011)CrossRefGoogle Scholar
  17. 17.
    Tsang, M., Morris, N., Balakrishnan, R.: Temporal Thumbnails: rapid visualization of time-based viewing data. In: Proceedings of the Working Conference on Advanced Visual Interfaces, AVI 2004, pp. 175–178 (2004)Google Scholar
  18. 18.
    Stoev, S.L., Straßer, W.: A case study on interactive exploration and guidance aids for visualizing historical data. In: Proceedings of the Conference on Visualization, VIS 2001, pp. 485–488 (2001)Google Scholar
  19. 19.
    Janssen, W.C.: Document Icons and Page Thumbnails: Issues in Construction of Document Thumbnails for Page-Image Digital Libraries. In: Heery, R., Lyon, L. (eds.) ECDL 2004. LNCS, vol. 3232, pp. 111–121. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  20. 20.
    Lam, H., Baudisch, P.: Summary thumbnails: Readable Overviews for Small Screen Web Browsers. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI 2005, pp. 681–690 (2005)Google Scholar
  21. 21.
    Aula, A., Khan, R.M., Guan, Z., Fontes, P., Hong, P.: A comparison of visual and textual page previews in judging the helpfulness of web pages. In: Proceedings of the 19th International Conference on World Wide Web, WWW 2010, pp. 51–59. ACM Press (2010)Google Scholar
  22. 22.
    Platt, J.C.: AutoAlbum: clustering digital photographs using probabilistic model merging. In: Proceedings of IEEE Workshop on Content-based Access of Image and Video Libraries, pp. 96–100 (2000)Google Scholar
  23. 23.
    Coelho, F., Ribeiro, C.: Image abstraction in crossmedia retrieval for text illustration. In: Baeza-Yates, R., de Vries, A.P., Zaragoza, H., Cambazoglu, B.B., Murdock, V., Lempel, R., Silvestri, F. (eds.) ECIR 2012. LNCS, vol. 7224, pp. 329–339. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  24. 24.
    Chu, W.T., Lin, C.H.: Automatic selection of representative photo and smart thumbnailing using near-duplicate detection. In: Proceeding of the 16th ACM International Conference on Multimedia, MM 2008, pp. 829–832 (October 2008)Google Scholar
  25. 25.
    Kherfi, M.L., Ziou, D.: Image Collection Organization and Its Application to Indexing, Browsing, Summarization, and Semantic Retrieval. IEEE Transactions on Multimedia 9(4), 893–900 (2007)CrossRefGoogle Scholar
  26. 26.
    Henzinger, M.: Finding near-duplicate web pages. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2006, pp. 284–291 (2006)Google Scholar
  27. 27.
    Broder, A., Glassman, S.: Syntactic clustering of the web. Computer Networks and ISDN Systems 29(8-13) (1997)Google Scholar
  28. 28.
    Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: Proceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing, STOC 2002, pp. 380–388 (2002)Google Scholar
  29. 29.
    Manku, G.S., Jain, A., Das Sarma, A.: Detecting near-duplicates for web crawling. In: Proceedings of the 16th International Conference on World Wide Web, WWW 2007, pp. 141–149 (2007)Google Scholar
  30. 30.
    Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text features. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, WSDM 2010, pp. 441–450 (2010)Google Scholar
  31. 31.
    Pawlik, M., Augsten, N.: RTED: a robust algorithm for the tree edit distance. In: Proceedings of the VLDB Endowment, vol. 5(4), pp. 334–345 (December 2011)Google Scholar
  32. 32.
    Park, H.S., Jun, C.H.: A simple and fast algorithm for K-medoids clustering. Expert Systems with Applications 36(2, pt. 2), 3336–3341 (2009)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Ahmed AlSum
    • 1
  • Michael L. Nelson
    • 1
  1. 1.Computer Science DepartmentOld Dominion UniversityNorfolkUSA

Personalised recommendations