Abstract
Web usage mining inspects the navigation patterns in web access logs and extracts previously unknown and useful information. This may lead to strategies for various web-oriented applications like web site restructure, recommender system, web page prediction and so on. The current work demonstrates clustering of user sessions of uneven lengths to discover the access patterns by proposing a distance method to group user sessions. The proposed hybrid distance measure uses the access path information to find the distance between any two sessions without altering the order in which web pages are visited. R 2 is used to make a decision regarding the number of clusters to be constructed. Jaccard Index and Davies–Bouldin validity index are employed to assess the clustering done. The results obtained by these two standard statistic measures are encouraging and illustrate the goodness of the clusters created.
Similar content being viewed by others
Notes
What is a good value for R 2? http://www.duke.edu/~rnau/rsquared.htm. Accessed June 2011.
How high, R 2? http://cooldata.wordpress.com/2010/04/19/how-high-r-squared/. Accessed June 2011.
Jaccard Index, http://en.wikipedia.org/wiki/Jaccard_index. Accessed March 2011.
Cluster validity algorithms, http://machaon.karanagai.com/validation_algorithms.html. Accessed August 2011.
References
Adnan M, Nagi M, Kianmehr K, Tahboub R, Ridley M, Rokne J (2011) Promoting where, when and what? An analysis of web logs by integrating data mining and social network techniques to guide ecommerce business promotions. Soc Netw Anal Min 1:173–185. doi:10.1007/s13278-010-0015-3
Brudno M, Malde S, Do ACB, Courancne O, Dubchak I, Batzogiou S (2003) Glocal alignment: finding rearrangements during alignent. J Bioinform 19:i54–i63
Chaofeng L, Yansheng L (2007) Similarity measurement of web sessions based on sequence alignment. Wuhan Univ J Nat Sci 12(5):814–818
Cooley R, Mobasher B, Srivastava J (1997a) Grouping web page references into transactions for mining World Wide Web browsing patterns. In: Proceedings of the IEEE knowledge and data engineering exchange workshop (KDEX-97), pp 2–9
Cooley R, Mobasher B, Srivastava J (1997b) Web mining: information and pattern discovery on the World Wide Web. In: Proceedings of ninth IEEE international conference on tools with artificial intelligence (ICTAI’97), pp 558–567
Facca FM, Lanzi PL (2005) Mining interesting knowledge from weblogs: a survey. J Data Knowl Eng 53:225–241
Fu Y, Sandhu K, Shih M-Y (1999) Clustering of web users based on access patterns. In: KDD workshop on web mining
Gunduz S, Tamer Ozsu M (2003) A web page prediction model based on click-stream tree representation of user behavior. In: Proceedings of 9th ACM SIGKDD international conference on knowledge discovery and data mining
Hay B, Wets G, Vanhoof K (2004) Mining navigation patterns using a sequence alignment method. J Knowl Inform Syst 6:150–163
Hofgesang PI (2006) Relevance of time spent on web page. In: Proceedings of WEBKDD’06. ACM, New York
Jin Y, Lin C, Matsuo Y, Ishizuka M (2012) Mining dynamic social networks from public news articles for company value prediction. Soc Netw Anal Min. doi:10.1007/s13278-011-0045-5
Khasawneh N, Chan C-C (2008) Multidimensional sessions comparison method using dynamic programming. IEEE, pp 581–585
Krol D, Scigajlo M, Trawinski B (2008) Investigation of Internet system user behavior using cluster analysis. In: Proceedings of the seventh international conference on machine learning and cybernetics. IEEE, pp 3408–3412
Li C (2008) Algorithm of web session clustering based on increase of similarities. In: Proceedings of international conference on information management, innovation management and industrial engineering. IEEE, pp 316–319
Liu Y, Li Z, Xiong H, Gao X, Wu J (2010) Understanding of Internal Clustering Validation Measures. In:Proceedings of the 2010 IEEE International Conference on Data Mining, IEEE
Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48:443–453
Poornalatha G, Raghavendra PS (2011a) Web user session clustering using modified K-means algorithm. In: Proceedings of ACC-2011, part II, CCIS 191. Springer, Berlin, pp 243–252. doi:10.1007/978-3-642-22714-1_26
Poornalatha G, Raghavendra PS (2011b) Alignment based similarity distance measure for better web sessions clustering. J Procedia CS 5:450–457. doi:10.1016/j.procs.2011.07.058
Scott J (2010) Social network analysis: developments, advances, and prospects. Soc Netw Anal Min 1:21–26. doi:10.1007/s13278-010-0012-6
Shi P (2009) An efficient approach for clustering web access patterns from web logs. Int J Adv Sci Technol 5:1–13
Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147:195–197
Srivastava J, Cooley R, Deshpande M (2000) Web usage mining: discovery and applications of usage patterns from Web data. ACM SIGKDD 1:12–23
Tan P-N, Kumar V (2002) Discovery of web robot sessions based on their navigational patterns. J Data Mining Knowl Disc 6(1):9–35. doi:10.1023/A:1013228602957
Tseng VS, Lin KW, Chang J (2008) Prediction of user navigation patterns by mining the temporal web usage evolution. J Soft Comput 12(2):157–163
Umapathi C, Raja J (2008) Discovering frequent patterns and trends by applying web mining technology in web log data. Int J Soft Comput 3(2):99–105
Xing D, Shen J (2004) Efficient data mining for web navigation patterns. J Inform Softw Technol 46:55–63
Xu J, Liu H (2010) Web user clustering analysis based on KMeans algorithm. In: Proceedings of the international conference on information, networking and automation (ICINA). IEEE, pp v26–v29
Acknowledgments
The authors wish to thank anonymous reviewers for the useful and valuable suggestions.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Poornalatha, G., Prakash, S.R. Web sessions clustering using hybrid sequence alignment measure (HSAM). Soc. Netw. Anal. Min. 3, 257–268 (2013). https://doi.org/10.1007/s13278-012-0070-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13278-012-0070-z