Advertisement

Restoring Semantically Incomplete Document Collections Using Lexical Signatures

  • Luis Meneses
  • Himanshu Barthwal
  • Sanjeev Singh
  • Richard Furuta
  • Frank Shipman
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8092)

Abstract

Unexpected changes create a problem when managing missing resources in a digital collection. In decentralized and distributed collections such as Walden’s Paths, a missing point or an incomplete resource is of grave importance as it can potentially interrupt the continuity in the narration and render the collection semantically incomplete. We can foresee two possible scenarios occurring when resources cannot be found. First, we have access to a copy of the missing document or to its lexical signatures, which allows us to find the missing resource. The second case is more interesting to us. What happens if we don’t have any valid metadata associated to the missing resource? To solve this problem, we used the lexical signatures of valid documents within a collection to find suitable replacements for absent resources. As results we found that traditional similarity metrics do not adequately convey the relationships between the elements in the collections. Our analyses also showed that our procedures were able to restore the semantic integrity of incomplete document collections.

Keywords

Semantic replacements Web resource management distributed collections 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bogen, P.L., Pogue, D., Poursardar, F., Li, Y., Furuta, R., Shipman, F.: WPv4: a re-imagined Walden’s paths to support diverse user communities. In: Proc. of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, Ottawa, Ontario, Canada, pp. 419–420 (2011)Google Scholar
  2. 2.
    Cassel, L., Fox, E., Shipman, F., Brusilovsky, P., Fax, W., Garcia, D., Hislop, G., Furuta, R., Delcambre, L., Potluri, S.: Ensemble: enriching communities and collections to support education in computing: poster session. Journal of Computing Sciences in Colleges 25, 224–226 (2010)Google Scholar
  3. 3.
    McCown, F., Marshall, C.C., Nelson, M.L.: Why web sites are lost (and how they’re sometimes found). Communications of the ACM 52, 141–145 (2009)CrossRefGoogle Scholar
  4. 4.
    Klein, M., Ware, J., Nelson, M.L.: Rediscovering missing web pages using link neighborhood lexical signatures. In: Proc. of the 11th Annual International ACM/IEEE Joint Conference on Digital libraries, Ottawa, Ontario, Canada (2011)Google Scholar
  5. 5.
    Klein, M., Nelson, M.L.: Evaluating methods to rediscover missing web pages from the web infrastructure. In: Proc. Of The 10th Annual Joint Conference on Digital Libraries, Gold Coast, Queensland, Australia (2010)Google Scholar
  6. 6.
    Bar-Yossef, Z., Broder, A.Z., Kumar, R., Tomkins, A.: Sic transit gloria telae: towards an understanding of the web’s decay. In: Proc. of the 13th International Conference on World Wide Web, New York, NY, USA (2004)Google Scholar
  7. 7.
    SalahEldeen, H.M., Nelson, M.L.: Losing My Revolution: How Many Resources Shared on Social Media Have Been Lost? In: Zaphiris, P., Buchanan, G., Rasmussen, E., Loizides, F. (eds.) TPDL 2012. LNCS, vol. 7489, pp. 125–137. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  8. 8.
    Francisco-Revilla, L., Shipman, F., Furuta, R., Karadkar, U., Arora, A.: Managing change on the web. In: Proc. of the 1st ACM/IEEE-CS Joint Conference on Digital Libraries, Roanoke, Virginia, United States (2001)Google Scholar
  9. 9.
    Francisco-Revilla, L., Shipman, F., Furuta, R., Karadkar, U., Arora, A.: Perception of content, structure, and presentation changes in Web-based hypertext. In: Proc. of the 12th ACM Conference on Hypertext and Hypermedia, Arhus, Denmark (2001)Google Scholar
  10. 10.
    Logasa Bogen, P., Francisco-Revilla, L., Furuta, R., Hubbard, T., Karadkar, U.P., Shipman, F.: Longitudinal study of changes in blogs. In: Proc. of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries, Vancouver, BC, Canada (2007)Google Scholar
  11. 11.
    Meneses, L., Furuta, R., Shipman, F.: Identifying “Soft 404” Error Pages: Analyzing the Lexical Signatures of Documents in Distributed Collections. In: Zaphiris, P., Buchanan, G., Rasmussen, E., Loizides, F. (eds.) TPDL 2012. LNCS, vol. 7489, pp. 197–208. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  12. 12.
    Dalal, Z., Dash, S., Dave, P., Francisco-Revilla, L., Furuta, R., Karadkar, U., Shipman, F.: Managing distributed collections: evaluating web page changes, movement, and replacement. In: Proc. of the 4th ACM/IEEE-CS Joint Conference on Digital Libraries, Tuscon, AZ, USA, pp. 160–168 (2004)Google Scholar
  13. 13.
    Baeza-Yates, R., Pereira, I., Ziviani, N.: Genealogical trees on the web: a search engine user perspective. In: Proc. of the 17th International Conference on World Wide Web, Beijing, China (2008)Google Scholar
  14. 14.
    Ashman, H.: Electronic document addressing: dealing with change. ACM Computing Surveys 32, 201–212 (2000)CrossRefGoogle Scholar
  15. 15.
    Ashman, H., Davis, H., Whitehead, J., Caughey, S.: Missing the 404: link integrity on the World Wide Web. In: Proc. of the Seventh International Conference on World Wide Web, Brisbane, Australia (1998)Google Scholar
  16. 16.
    Davis, H.C.: Hypertext link integrity. ACM Computing Surveys 31, 28 (1999)CrossRefGoogle Scholar
  17. 17.
    Davis, H.C.: Referential integrity of links in open hypermedia systems. In: Proc. of the Ninth ACM Conference on Hypertext and Hypermedia, Pittsburgh, Pennsylvania, United States (1998)Google Scholar
  18. 18.
    Kahle, B.: Preserving the Internet. Scientific American 276, 82–83 (1997)CrossRefGoogle Scholar
  19. 19.
    Koehler, W.: Web page change and persistence—a four-year longitudinal study. Journal of the American Society for Information Science and Technology 53, 162–171 (2002)CrossRefGoogle Scholar
  20. 20.
    Spinellis, D.: The decay and failures of web references. Communications of the ACM 46, 71–77 (2003)CrossRefGoogle Scholar
  21. 21.
    Phelps, T.A., Wilensky, R.: Robust Hyperlinks Cost Just Five Words Each. University of California at Berkeley (2000)Google Scholar
  22. 22.
    Park, S.-T., Pennock, D.M., Giles, C.L., Krovetz, R.: Analysis of lexical signatures for improving information persistence on the World Wide Web. Transactions on Information Systems 22, 540–572 (2004)CrossRefGoogle Scholar
  23. 23.
    Klein, M., Shipman, J., Nelson, M.L.: Is this a good title? In: Proc. of the 21st ACM Conference on Hypertext and Hypermedia, Toronto, Ontario, Canada (2010)Google Scholar
  24. 24.
    McCown, F., Smith, J.A., Nelson, M.L.: Lazy preservation: reconstructing websites by crawling the crawlers. In: Proc. of the 8th Annual ACM International Workshop on Web Information and Data Management, Arlington, Virginia, USA, pp. 67–74 (2006)Google Scholar
  25. 25.
    Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the Web. Computer Networks 29, 1157–1166 (1997)Google Scholar
  26. 26.
    Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: Proc. of the Thiry-fourth Annual ACM Symposium on Theory of Computing, Montreal, Quebec, Canada (2002)Google Scholar
  27. 27.
    Manber, U.: Finding similar files in a large file system. In: Proc. of the USENIX Winter 1994 Technical Conference, San Francisco, California (1994)Google Scholar
  28. 28.
    Shivakumar, N., Garcia-Molina, H.: Finding Near-Replicas of Documents and Servers on the Web. In: Atzeni, P., Mendelzon, A.O., Mecca, G. (eds.) WebDB 1998. LNCS, vol. 1590, pp. 204–212. Springer, Heidelberg (1999)CrossRefGoogle Scholar
  29. 29.
    Brin, S., Davis, J., Garcia-Molina, H.: Copy detection mechanisms for digital documents. In: Proc. of the 1995 ACM SIGMOD International Conference on Management of Data, San Jose, California, USA, pp. 398–409 (1995)Google Scholar
  30. 30.
    Forman, G., Eshghi, K., Chiocchetti, S.: Finding similar files in large document repositories. In: Proc. of the eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, Chicago, Illinois, USA (2005)Google Scholar
  31. 31.
    McCown, F., Nelson, M.L.: Search engines and their public interfaces: which apis are the most synchronized? In: Proc. of the 16th International Conference on World Wide Web, Banff, Alberta, Canada (2007)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Luis Meneses
    • 1
  • Himanshu Barthwal
    • 1
  • Sanjeev Singh
    • 1
  • Richard Furuta
    • 1
  • Frank Shipman
    • 1
  1. 1.Center for the Study of Digital Libraries and Department of Computer Science and EngineeringTexas A&M UniversityCollege StationUSA

Personalised recommendations