Advertisement

Abstract

The data for Web mining is usually extracted from the WWW server or proxy server log files. The paper examines the advantages and disadvantages of exploiting another source of input data – the browser buffer. The properties of data extracted from different types of sources are compared. The browser buffer contains data about user navigational habits as well as the formal properties and the content of all recently accessed WWW objects. The paper uses the data obtained from this source to examine the statistical properties of different types of texts extracted from HTML pages.

Keywords

Proxy Cache Local Buffer Stop List Link Text Goal Page 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Ajiferuke, I., Wolfram, D.: Analysis of Web Page Image Tag Distribution. Information Processing and Management 41, 987–1002 (2005)CrossRefGoogle Scholar
  2. 2.
    Cunha, C.A., Bestavros, A., Crovella, M.E.: Characteristics of WWW Client Traces. Boston University Department of Computer Science, Technical Report TR-95-010 (April 1995)Google Scholar
  3. 3.
    Gelbukh, A., Sidorov, G.: Zipf and Heaps Laws’ Coefficients Depend on Language. In: Gelbukh, A. (ed.) CICLing 2001. LNCS, vol. 2004, pp. 332–335. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  4. 4.
    Lovins, J.B.: Development of a Stemming Algorithm. Mechanical Translation and computation Linguistics 11(1), 23–31 (1968)Google Scholar
  5. 5.
    Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)Google Scholar
  6. 6.
    Rabinowich, M., Spatschech, O.: Web Caching and Replication. Addison-Wesley, USA (2002)Google Scholar
  7. 7.
    Siemiński, A.: The Cacheability of WWW Pages. In: Multimedia and Network Information Systems 2004, Technical University of Wrocław, Poland (2004)Google Scholar
  8. 8.
    Sieminski, A.: Changebility of Web Objects. In: ISDA 2005 5th International Conference on Intelligent Systems Desin and Implementation, Wrocław (2005)Google Scholar
  9. 9.
    Srivastava, J., Desikan, P., Kumar, V.: Web Mining: Accomplishments & Future Directions. In: National Science Foundation Workshop on Next Generation Data Mining (NGDM 2002) (2002)Google Scholar
  10. 10.
    Szafran, K.: SAM 95 - Morphological Analyzer, TR 96-05 (226), Instytut Informatyki Uniwersytetu Warszawskiego (1996)Google Scholar
  11. 11.
    Tran, L., Moon, C., Le, D., Thoma, G.: Web Page Downloading and Classification. In: The Fourteenth IEEE Symposium on Computer-Based Medical Systems (July 2001)Google Scholar
  12. 12.
    Weiss, D.: A Survey of Freely Available Polish Stemmers and Evaluation of Their Applicability in Information Retrieval. In: 2nd Language and Technology Conference, Poznań, Poland, pp. 216–221 (2005)Google Scholar
  13. 13.
    Zipf, G.K.: Human behavior and the principle of least effort. Addison-Wesley, Cambridge (1949)Google Scholar
  14. 14.
  15. 15.
  16. 16.
  17. 17.
  18. 18.
  19. 19.
  20. 20.
    Reed, D.: Privacy and the Future of Behavioral Marketing, http://www.claria.com/advertise/oas_archive/privacy.html?pub=imedia_module
  21. 21.
  22. 22.

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Andrzej Siemiński
    • 1
  1. 1.Institute for Applied InformaticsTechnical University of WrocławWrocławPoland

Personalised recommendations