Skip to main content

Identification of Underestimated and Overestimated Web Pages Using PageRank and Web Usage Mining Methods

  • Chapter
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((TCCI,volume 9240))

Abstract

The paper describes an alternative method of website analysis and optimization that combines methods of web usage and web structure mining - discovering of web users’ behaviour patterns as well as discovering knowledge from the website structure. Its primary objective is identifying of web pages, in which the value of their importance, estimated by the website developers, does not correspond to the real behaviour of the website visitors. It was proved before that the expected visit rate correlate with the observed visit rate of the web pages. Consequently, the expected probabilities of visiting of web pages by a visitor were calculated using the PageRank method and observed probabilities were obtained from the web server log files using the web usage mining method. The observed and expected probabilities were compared using the residual analysis. While the sequence rules analysis can only uncover the potential problem of web pages with higher visit rate, the proposed method of residual analysis can also consider other web pages with a smaller visit rate. The obtained results can be successfully used for a website optimization and restructuring, improving website navigation, and adaptive website realisation.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Srivastava, J., Cooley, R., Deshpande, M., Tan, P.-N.: Web usage mining: discovery and applications of usage patterns from web data. SIGKDD Explor. Newsl. 1, 12–23 (2000)

    Article  Google Scholar 

  2. Liu, Y., Zhang, M., Cen, R., Ru, L., Ma, S.: Data cleansing for web information retrieval using query independent features. J. Am. Soc. Inform. Sci. Technol. 58, 1884–1898 (2007)

    Article  Google Scholar 

  3. Chau, M., Chen, H.: A machine learning approach to web page filtering using content and structure analysis. Decis. Support Syst. 44, 482–494 (2008)

    Article  Google Scholar 

  4. Jacob, A., Olivier, C., Carlos, C.: WITCH: a new approach to web spam detection. Yahoo! Research report no. YR-2008-001 (2008)

    Google Scholar 

  5. Castillo, C., Donato, D., Gionis, A., Murdock, V., Silvestri, F.: Know your neighbors: web spam detection using the web topology. In: Conference Know Your Neighbors: Web Spam Detection Using the Web Topology, pp. 423–430. ACM (2006)

    Google Scholar 

  6. Gan, Q., Suel, T.: Improving web spam classifiers using link structure. In: Conference Improving Web Spam Classifiers Using Link Structure, pp. 17–20. ACM (2007)

    Google Scholar 

  7. Ntoulas, A., Najork, M., Manasse, M., Fetterly, D.: Detecting spam web pages through content analysis. In: Conference Detecting Spam Web Pages Through Content Analysis, pp. 83–92 (2006)

    Google Scholar 

  8. Stencl, M., St’astny, J.: Neural network learning algorithms comparison on numerical prediction of real data. In: Matousek, R. (ed.) 16th International Conference on Soft Computing Mendel 2010, pp. 280–285 (2010)

    Google Scholar 

  9. Lorentzen, D.G.: Webometrics benefitting from web mining? an investigation of methods and applications of two research fields. Scientometrics 99, 409–445 (2014)

    Article  Google Scholar 

  10. Lili, Y., Yingbin, W., Zhanji, G., Yizhuo, C.: Research on PageRank and hyperlink-induced topic search in web structure mining. In: Conference Research on PageRank and Hyperlink-Induced Topic Search in Web Structure Mining, pp. 1–4 (2011)

    Google Scholar 

  11. Wu, G., Wei, Y.: Arnoldi versus GMRES for computing pageRank: a theoretical contribution to google’s pageRank problem. ACM Trans. Inf. Syst. 28, 1–28 (2010)

    Article  Google Scholar 

  12. Jain, A., Sharma, R., Dixit, G., Tomar, V.: Page ranking algorithms in web mining, limitations of existing methods and a new method for indexing web pages. In: Proceedings of the 2013 International Conference on Communication Systems and Network Technologies, pp. 640–645. IEEE Computer Society (2013)

    Google Scholar 

  13. Ahmadi-Abkenari, F., Selamat, A.: A clickstream based web page importance metric for customized search engines. In: Nguyen, N.T. (ed.) Transactions on Computational Collective Intelligence XII. LNCS, vol. 8240, pp. 21–41. Springer, Heidelberg (2013)

    Chapter  Google Scholar 

  14. Agichtein, E., Brill, E., Dumais, S.: Improving web search ranking by incorporating user behavior information. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 19–26. ACM, Seattle (2006)

    Google Scholar 

  15. Meiss, M.R., Menczer, F., Fortunato, S., Flammini, A., Vespignani, A.: Ranking web sites with real user traffic. In: Proceedings of the 2008 International Conference on Web Search and Data Mining, pp. 65–76. ACM, Palo Alto (2008)

    Google Scholar 

  16. Su, J.-H., Wang, B.-W., Tseng, V.S.: Effective ranking and recommendation on web page retrieval by integrating association mining and PageRank. In: Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, vol. 03, pp. 455–458. IEEE Computer Society (2008)

    Google Scholar 

  17. Pabarskaite, Z., Raudys, A.: A process of knowledge discovery from web log data: systematization and critical review. J. Intell. Inf. Syst. 28, 79–104 (2007)

    Article  Google Scholar 

  18. Shutong, C., Congfu, X., Hongwei, D.: Website structure optimization technology based on customer interest clustering algorithm. In: Conference Website Structure Optimization Technology Based on Customer Interest Clustering Algorithm, pp. 802–804 (2008)

    Google Scholar 

  19. Wen-long, L., Ye-zheng, L.: A novel website structure optimization model for more effective web navigation. In: Conference A Novel Website Structure Optimization Model for More Effective Web Navigation, pp. 36–41 (2008)

    Google Scholar 

  20. Jeffrey, J., Karski, P., Lohrmann, B., Kianmehr, K., Alhajj, R.: Optimizing web structures using web mining techniques. In: Yin, H., Tino, P., Corchado, E., Byrne, W., Yao, X. (eds.) IDEAL 2007. LNCS, vol. 4881, pp. 653–662. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  21. Wang, H., Liu, X.: Adaptive site design based on web mining and topology. In: Conference Adaptive Site Design Based on Web Mining and Topology, pp. 184–189 (2009)

    Google Scholar 

  22. Romero, C., Ventura, S., Zafra, A., Bra, P.D.: Applying web usage mining for personalizing hyperlinks in web-based adaptive educational systems. Comput. Educ. 53, 828–840 (2009)

    Article  Google Scholar 

  23. Park, S., Suresh, N.C., Jeong, B.-K.: Sequence-based clustering for web usage mining: a new experimental framework and ANN-enhanced K-means algorithm. Data Knowl. Eng. 65, 512–543 (2008)

    Article  Google Scholar 

  24. Hay, B., Wets, G., Vanhoof, K.: Web usage mining by means of multidimensional sequence alignment methods. In: Zaïane, O.R., Srivastava, J., Spiliopoulou, M., Masand, B. (eds.) WebKDD 2003. LNCS (LNAI), vol. 2703, pp. 50–65. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  25. Hay, B., Wets, G., Vanhoof, K.: Segmentation of visiting patterns on web sites using a sequence alignment method. J. Retail. Consum. Serv. 10, 145–153 (2003)

    Article  Google Scholar 

  26. Masseglia, F., Tanasa, D., Trousse, B.: Web usage mining: sequential pattern extraction with a very low support. In: Yu, J.X., Lin, X., Lu, H., Zhang, Y. (eds.) APWeb 2004. LNCS, vol. 3007, pp. 513–522. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  27. Oyanagi, S., Kubota, K., Nakase, A.: Mining WWW access sequence by matrix clustering. In: Zaïane, O.R., Srivastava, J., Spiliopoulou, M., Masand, B. (eds.) WebKDD 2003. LNCS (LNAI), vol. 2703, pp. 119–136. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  28. Cooley, R., Mobasher, B., Srivastava, J.: Data preparation for mining world wide web browsing patterns. Knowl. Inf. Syst. 1(1), 5–32 (1999)

    Article  Google Scholar 

  29. Spiliopoulou, M., Faulstich, L.C.: WUM: a tool for web utilization analysis. In: Atzeni, P., Mendelzon, A.O., Mecca, G. (eds.) WebDB 1998. LNCS, vol. 1590, pp. 184–203. Springer, Heidelberg (1999)

    Chapter  Google Scholar 

  30. Chen, M.-S., Park, J.S., Yu, P.S.: Data mining for path traversal patterns in a web environment. In: Conference Data Mining for Path Traversal Patterns in a Web Environment, pp. 385–392 (1996)

    Google Scholar 

  31. Berendt, B., Spiliopoulou, M.: Analysis of navigation behaviour in web sites integrating multiple information systems. VLDB J. 9, 56–75 (2000)

    Article  Google Scholar 

  32. Guerbas, A., Addam, O., Zaarour, O., Nagi, M., Elhajj, A., Ridley, M., Alhajj, R.: Effective web log mining and online navigational pattern prediction. Knowl.-Based Syst. 49, 50–62 (2013)

    Article  Google Scholar 

  33. Cooley, R.: Web usage mining: discovery and application of interesting patterns from web data. Ph.D. thesis. University of Minnesota (2000)

    Google Scholar 

  34. Schmitt, E., Manning, H., Paul, Y., Tong, J.: Measuring Web Success. Forrester report (1999)

    Google Scholar 

  35. Downey, D., Dumais, S., Horvitz, E.: Models of searching and browsing: languages, studies, and applications. In: Proceedings of the 20th International Joint Conference on Artifical Intelligence, pp. 2740–2747. Morgan Kaufmann Publishers Inc., Hyderabad (2007)

    Google Scholar 

  36. Chien, S., Immorlica, N.: Semantic similarity between search engine queries using temporal correlation. In: Proceedings of the 14th International Conference on World Wide Web, pp. 2–11. ACM, Chiba (2005)

    Google Scholar 

  37. He, D., Göker, A.: Detecting session boundaries from web user logs. In: Conference Detecting Session Boundaries from Web User Logs, pp. 57–66 (2000)

    Google Scholar 

  38. Radlinski, F., Joachims, T.: Query chains: learning to rank from implicit feedback. In: Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 239–248. ACM, Chicago (2005)

    Google Scholar 

  39. Huynh, T., Miller, J.: Empirical observations on the session timeout threshold. Inf. Process. Manage. 45, 513–528 (2009)

    Article  Google Scholar 

  40. Zhang, J., Ghorbani, A.A.: The reconstruction of user sessions from a server log using improved time-oriented heuristics. In: Conference The reconstruction of User Sessions from a Server Log Using Improved Time-Oriented Heuristics, pp. 315–322 (2009)

    Google Scholar 

  41. Seco, N., Cardoso, N.: Detecting user sessions in the Tumba! query log. Technical report., Faculdade de Ciências da Universidade de Lisboa (2006)

    Google Scholar 

  42. Spiliopoulou, M., Mobasher, B., Berendt, B., Nakagawa, M.: A framework for the evaluation of session reconstruction heuristics in web-usage analysis. INFORMS J. Comput. 15, 171–190 (2003)

    Article  MATH  Google Scholar 

  43. Gong, W., Baohui, T.: A new path filling method on data preprocessing in web mining. In: Conference A New Path Filling Method on Data Preprocessing in Web Mining, pp. 1033–1035 (2008)

    Google Scholar 

  44. Dhawan, S., Lathwal, M.: Study of preprocessing methods in web server logs. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 3, 430–433 (2013)

    Google Scholar 

  45. Li, Y., Feng, B., Mao, Q.: Research on path completion technique in web usage mining. In: Proceedings of the 2008 International Symposium on Computer Science and Computational Technology, vol. 01, pp. 554–559. IEEE Computer Society (2008)

    Google Scholar 

  46. Tauscher, L., Greenberg, S.: Revisitation patterns in World Wide Web navigation. In: Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems, pp. 399–406. ACM, Atlanta (1997)

    Google Scholar 

  47. Chitraa, V., Davamani, A.S.: An Efficient path completion technique for web log mining. In IEEE International Conference on Computational Intelligence and Computing Research (2010)

    Google Scholar 

  48. Zhang, C., Zhuang, L.: New path filling method on data preprocessing in web mining. Proc. Comput. Inf. Sci. 1, 112–115 (2008)

    Google Scholar 

  49. Liu, B.: Web data mining. Springer, New York (2007)

    MATH  Google Scholar 

  50. Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. 30, 107–117 (1998)

    Google Scholar 

  51. Page, L., Brin, S., Motwani, R., Winograd, T.: The PageRank citation ranking: bringing order to the web. Technical report, Standford Digital (1998)

    Google Scholar 

  52. Pirolli, P., Pitkow, J., Rao, R.: Silk from a sow’s ear: extracting usable structures from the web. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 118–125. ACM, Vancouver (1996)

    Google Scholar 

  53. Munk, M., Kapusta, J., Švec, P.: Data preprocessing evaluation for web log mining: reconstruction of activities of a web visitor. Procedia Comput. Sci. 1, 2273–2280 (2010)

    Article  Google Scholar 

  54. Kapusta, J., Munk, M.: Web usage mining: analysis of expeced and observed visit rate UKF (2014)

    Google Scholar 

  55. Pilkova, A., Volna, J., Papula, J., Holienka, M.: The influence of intellectual capital on firm performance among slovak SMEs. In: Proceedings of the 10th International Conference on Intellectual Capital, Knowledge Management and Organisational Learning (Icickm-2013), pp. 329–338 (2013)

    Google Scholar 

  56. Kumar, P.R., Singh, A.K., Mohan, A.: Efficient methodologies to optimize website for link structure based search engines. In: Conference Efficient Methodologies to Optimize Website for Link Structure Based Search Engines, pp. 719–724 (2013)

    Google Scholar 

Download references

Acknowledgements

This paper is published with the financial support of the project of Scientific Grant Agency (VEGA), project number VEGA 1/0392/13.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jozef Kapusta .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Kapusta, J., Munk, M., Drlík, M. (2015). Identification of Underestimated and Overestimated Web Pages Using PageRank and Web Usage Mining Methods. In: Nguyen, N. (eds) Transactions on Computational Collective Intelligence XVIII. Lecture Notes in Computer Science(), vol 9240. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-48145-5_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-662-48145-5_7

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-662-48144-8

  • Online ISBN: 978-3-662-48145-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics