Understanding Content Reuse on the Web: Static and Dynamic Analyses

  • Ricardo Baeza-Yates
  • Álvaro Pereira
  • Nivio Ziviani
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4811)

Abstract

In this paper we present static and dynamic studies of duplicate and near-duplicate documents in the Web. The static and dynamic studies involve the analysis of similar content among pages within a given snapshot of the Web and how pages in an old snapshot are reused to compose new documents in a more recent snapshot. We ran a series of experiments using four snapshots of the Chilean Web. In the static study, we identify duplicates in both parts of the Web graph – reachable (connected by links) and unreachable components (unconnected) – aiming to identify where duplicates occur more frequently. We show that the number of duplicates in the Web seems to be much higher than previously reported (about 50% higher) and in our data the duplicated in the unreachable Web is 74,6% higher than the number of duplicates in the reachable component of the Web graph. In the dynamic study, we show that some of the old content is used to compose new pages. If a page in a newer snapshot has content of a page in an older snapshot, we say that the source is a parent of the new page. We state the hypothesis that people use search engines to find pages and republish their content as a new document. We present evidences that this happens for part of the pages that have parents. In this case, part of the Web content is biased by the ranking function of search engines.

Keywords

Egypt Karen 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Broder, A., Glassman, S., Manasse, M., Zweig, G.: Syntactic clustering of the Web. In: Sixth International World Wide Web Conference, pp. 391–404 (1997)Google Scholar
  2. 2.
    Broder, A.: On the resemblance and containment of documents. In: SEQUENCES 1997. Compression and Complexity of Sequences, pp. 21–29 (1998)Google Scholar
  3. 3.
    Broder, A., Kumar, R., Maghoul, F., Raghavan, P., Rajagopalan, S., Stata, R., Tomkins, A., Wiener, J.: Graph structure in the web. In: WWW 2000. Ninth International World Wide Web Conference, Amsterdam, Netherlands, pp. 309–320 (May 2000)Google Scholar
  4. 4.
    Cormen, T.H., Leiserson, C.E., Rivest, R.L.: Introduction to algorithms. MIT Press/McGraw-Hill, San Francisco, CA (1990)Google Scholar
  5. 5.
    Cho, J., Shivakumar, N., Garcia-Molina, H.: Finding replicated Web collections. In: ACM International Conference on Management of Data (SIGMOD), pp. 355–366. ACM Press, New York (2000)CrossRefGoogle Scholar
  6. 6.
    Shivakumar, N., Garcia-Molina, H.: Finding near-replicas of documents on the Web. In: Atzeni, P., Mendelzon, A.O., Mecca, G. (eds.) The World Wide Web and Databases. LNCS, vol. 1590, pp. 204–212. Springer, Heidelberg (1999)CrossRefGoogle Scholar
  7. 7.
    Fetterly, D., Manasse, M., Najork, M.: On the evolution of clusters of near-duplicate Web pages. In: First Latin American Web Congress, Santiago, Chile, pp. 37–45 (November 2003)Google Scholar
  8. 8.
    Calado, P.: The WBR-99 collection: Data-structures and file formats. Technical report, Department of Computer Science, Federal University of Minas Gerais (1999), http://www.linguateca.pt/Repositorio/WBR-99/wbr99.pdf
  9. 9.
    Castillo, C.: Effective Web Crawler. PhD thesis, Chile University, Ch. 2 (2004)Google Scholar
  10. 10.
    Cho, J.: The evolution of the web and implications for an incremental crawler. In: VLDB. 26th Intl. Conference on Very Large Databases, Cairo, Egypt, pp. 527–534 (September 2000)Google Scholar
  11. 11.
    Brewington, B., Cybenko, G., Stata, R., Bharat, K., Maghoul, F.: How dynamic is the web? In: Ninth Conference on World Wide Web, Amsterdam, Netherlands, pp. 257–276 (May 2000)Google Scholar
  12. 12.
    Ntoulas, A., Cho, J., Olston, C.: What’s new on the Web? the evolution of the Web from a search engine perspective. In: WWW 2004. World Wide Web Conference, New York, USA, pp. 1–12 (May 2004)Google Scholar
  13. 13.
    Douglis, F., Feldmann, A., Krishnamurthy, B., Mogul, J.C.: Rate of change and other metrics: a live study of the world wide Web. In: Symposium on Internet Technologies and Systems USENIX, Monterey, CA, pp. 147–158. (December 1997)Google Scholar
  14. 14.
    Chen, X., Mohapatra, P.: Lifetime behaviour and its impact on Web caching. In: WIAPP 1999, San Jose, CA, pp. 54–61. IEEE Computer Society Press, Los Alamitos (July 1999)Google Scholar
  15. 15.
    Cho, J., Roy, S.: Impact of search engine on page popularity. In: WWW 2004. World Wide Web Conference, New York, USA, pp. 20–29 (May 2004)Google Scholar
  16. 16.
    Baeza-Yates, R., Castillo, C., Saint-Jean, F.: Web dynamics, structure and page quality. In: Levene, M., Poulovassilis, A. (eds.) Web Dynamics, pp. 93–109. Springer, Heidelberg (2004)Google Scholar
  17. 17.
    Baeza-Yates, R., Pereira, A., Ziviani, N.: Genealogical trees on the web: A search engine user perspective (submitted) (2007)Google Scholar
  18. 18.
    Ntoulas, A., Cho, J., Cho, H.K., Cho, H., Cho, Y.J.: A study on the evolution of the Web. In: Korea, U.– (ed.) Conference on Science, Technology, and Entrepreneurship (UKC), Irvine, USA, pp. 1–6 (2005)Google Scholar
  19. 19.
    Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: Bringing order to the Web. Technical Report CA 93106, Stanford Digital Library Technologies Project, Stanford, Santa Barbara (January 1998)Google Scholar
  20. 20.
    Mitzenmacher, M.: Dynamic models for file sizes and double pareto distributions. Internet Mathematics 1(3), 305–333 (2003)MathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Ricardo Baeza-Yates
    • 1
  • Álvaro Pereira
    • 2
  • Nivio Ziviani
    • 2
  1. 1.Yahoo! Research &, Barcelona Media Innovation Centre, BarcelonaSpain
  2. 2.Department of Computer Science, Federal University of Minas Gerais, Belo HorizonteBrazil

Personalised recommendations