Skip to main content

Abstract

The data for Web mining is usually extracted from the WWW server or proxy server log files. The paper examines the advantages and disadvantages of exploiting another source of input data – the browser buffer. The properties of data extracted from different types of sources are compared. The browser buffer contains data about user navigational habits as well as the formal properties and the content of all recently accessed WWW objects. The paper uses the data obtained from this source to examine the statistical properties of different types of texts extracted from HTML pages.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ajiferuke, I., Wolfram, D.: Analysis of Web Page Image Tag Distribution. Information Processing and Management 41, 987–1002 (2005)

    Article  Google Scholar 

  2. Cunha, C.A., Bestavros, A., Crovella, M.E.: Characteristics of WWW Client Traces. Boston University Department of Computer Science, Technical Report TR-95-010 (April 1995)

    Google Scholar 

  3. Gelbukh, A., Sidorov, G.: Zipf and Heaps Laws’ Coefficients Depend on Language. In: Gelbukh, A. (ed.) CICLing 2001. LNCS, vol. 2004, pp. 332–335. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  4. Lovins, J.B.: Development of a Stemming Algorithm. Mechanical Translation and computation Linguistics 11(1), 23–31 (1968)

    Google Scholar 

  5. Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)

    Google Scholar 

  6. Rabinowich, M., Spatschech, O.: Web Caching and Replication. Addison-Wesley, USA (2002)

    Google Scholar 

  7. Siemiński, A.: The Cacheability of WWW Pages. In: Multimedia and Network Information Systems 2004, Technical University of Wrocław, Poland (2004)

    Google Scholar 

  8. Sieminski, A.: Changebility of Web Objects. In: ISDA 2005 5th International Conference on Intelligent Systems Desin and Implementation, Wrocław (2005)

    Google Scholar 

  9. Srivastava, J., Desikan, P., Kumar, V.: Web Mining: Accomplishments & Future Directions. In: National Science Foundation Workshop on Next Generation Data Mining (NGDM 2002) (2002)

    Google Scholar 

  10. Szafran, K.: SAM 95 - Morphological Analyzer, TR 96-05 (226), Instytut Informatyki Uniwersytetu Warszawskiego (1996)

    Google Scholar 

  11. Tran, L., Moon, C., Le, D., Thoma, G.: Web Page Downloading and Classification. In: The Fourteenth IEEE Symposium on Computer-Based Medical Systems (July 2001)

    Google Scholar 

  12. Weiss, D.: A Survey of Freely Available Polish Stemmers and Evaluation of Their Applicability in Information Retrieval. In: 2nd Language and Technology Conference, Poznań, Poland, pp. 216–221 (2005)

    Google Scholar 

  13. Zipf, G.K.: Human behavior and the principle of least effort. Addison-Wesley, Cambridge (1949)

    Google Scholar 

  14. http://www.web-caching.com/cacheability.html

  15. Common Log Format: http://www.bacuslabs.com/WsvlCLF.html

  16. Gain Network: http://www.gainpublishing.com/

  17. log data: http://www.ircache.net/Traces/

  18. http://www.theregister.co.uk/2004/10/15/google_desktop_privacy/

  19. Music Machines log data: http://www.cs.washington.edu/ai/adaptive-data/

  20. Reed, D.: Privacy and the Future of Behavioral Marketing, http://www.claria.com/advertise/oas_archive/privacy.html?pub=imedia_module

  21. http://validator.w3.org/

  22. WorldCup98 log data: http://ita.ee.lbl.gov/html/contrib/WorldCup.html

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Siemiński, A. (2006). Local Buffer as Source of Web Mining Data. In: Gabrys, B., Howlett, R.J., Jain, L.C. (eds) Knowledge-Based Intelligent Information and Engineering Systems. KES 2006. Lecture Notes in Computer Science(), vol 4253. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11893011_99

Download citation

  • DOI: https://doi.org/10.1007/11893011_99

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-46542-3

  • Online ISBN: 978-3-540-46544-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics