Web Pattern Extraction and Storage

  • Víctor L. Rebolledo
  • Gastón L’Huillier
  • Juan D. Velásquez
Part of the Studies in Computational Intelligence book series (SCI, volume 311)

Abstract

Web data provides information and knowledge to improve the web site content and structure. Indeed, it eventually contains knowledge which suggests changes that makes a web site more efficient and effective to attract and retain visitors. Making use of a Data Webhouse or a web analytics solution, it is possible to store statistical information concerning the behaviour of users in a website. Likewise, through applying web mining algorithms, interesting patterns can be discovered, interpreted and transformed into useful knowledge. On the other hand, web data include quantities of irrelevant but complex data preprocessing that must be applied in order to model and understand visitor browsing behaviour. Nevertheless, there are many ways to pre-process web data and model the browsing behaviour, hence different patterns can be obtained depending on which model is used. In this sense, a knowledge representation is necessary to store and manipulate web patterns. Generally, different patterns are discovered by using distinct web mining techniques on web data with dissimilar treatments. Consequently, patterns meta-data are relevant to manipulate the discovered knowledge. In this chapter, topics like feature selection, web mining techniques, models characterisation and pattern management will be covered in order to build a repository that stores patterns’ meta-data. Specifically, a Pattern Webhouse that facilitates knowledge management in the web environment.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Agrawal, R., Imieliński, T., Swami, A.: Mining association rules between sets of items in large databases. SIGMOD Rec. 22(2), 207–216 (1993)CrossRefGoogle Scholar
  2. 2.
    Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: VLDB 1994: Proceedings of the 20th International Conference on Very Large Data Bases, pp. 487–499. Morgan Kaufmann, San Francisco (1994)Google Scholar
  3. 3.
    Arnold, W.A., Bowie, J.S.: Artificial Intelligence: A Personal Commonsense Journey. Prentice-Hall, Englewood Cliffs (1985)Google Scholar
  4. 4.
    Blum, A.L., Langley, P.: Selection of relevant features and examples in machine learning. Artif. Intell. 97(1-2), 245–271 (1997)MATHCrossRefMathSciNetGoogle Scholar
  5. 5.
    Blumer, A., Ehrenfeucht, A., Haussler, D., Warmuth, M.K.: Learnability and the vapnik-chervonenkis dimension. J. ACM 36(4), 929–965 (1989)MATHCrossRefMathSciNetGoogle Scholar
  6. 6.
    Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin classifiers. In: COLT 1992: Proceedings of the fifth annual workshop on Computational learning theory, pp. 144–152. ACM Press, New York (1992)CrossRefGoogle Scholar
  7. 7.
    Catania, B., Maddalena, A.: Hershey, PA, USAGoogle Scholar
  8. 8.
    Catania, B., Maddalena, A., Mazza, M.: Psycho: A prototype system for pattern management. In: Böhm, K., Jensen, C.S., Haas, L.M., Kersten, M.L., Larson, P.-Å., Ooi, B.C. (eds.) VLDB, pp. 1346–1349. ACM, New York (2005)Google Scholar
  9. 9.
    Chimphlee, S., Salim, N., Ngadiman, M.S.B., Chimphlee, W., Srinoy, S.: Independent component analysis and rough fuzzy based approach to web usage mining. In: AIA 2006: Proceedings of the 24th IASTED international conference on Artificial intelligence and applications, pp. 422–427. ACTA Press, Anaheim (2006)Google Scholar
  10. 10.
    Cooley, R., Mobasher, B., Srivastava, J.: Data preparation for mining world wide web browsing patterns. Knowledge and Information Systems 1, 5–32 (1999)Google Scholar
  11. 11.
    Davenport, T., Prusak, L.: Working Knowledge: How Organizations Manage What They Know. Harvard Business School Press, Cambridge (1997)Google Scholar
  12. 12.
    Davies, D.L., Bouldin, D.W.: A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence 1, 224–227 (1979)CrossRefGoogle Scholar
  13. 13.
    Davis, R., Shrobe, H., Szolovits, P.: What is knowledge representation. AI Magazine 14(1), 17–33 (1993)Google Scholar
  14. 14.
    Dell, R.F., Román, P.E., Velásquez, J.D.: Web user session reconstruction using integer programming. In: Web Intelligence, pp. 385–388. IEEE, Los Alamitos (2008)Google Scholar
  15. 15.
    Domingos, P., Pazzani, M., Provan, G.: On the optimality of the simple bayesian classifier under zero-one loss. Machine Learning, 103–130 (1997)Google Scholar
  16. 16.
    Dujovne, L.E., Velásquez, J.D.: Design and implementation of a methodology for identifying website keyobjects. In: Velásquez, J.D., Ríos, S.A., Howlett, R.J., Jain, L.C. (eds.) Knowledge-Based and Intelligent Information and Engineering Systems. LNCS, vol. 5711, pp. 301–308. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  17. 17.
    Fleuret, F.: Fast binary feature selection with conditional mutual information. Journal of Machine Learning Research 5, 1531–1555 (2004)MathSciNetGoogle Scholar
  18. 18.
    Freund, Y., Schapire, R.E.: A decision-theoretic generalization of on-line learning and an application to boosting. In: Vitányi, P.M.B. (ed.) EuroCOLT 1995. LNCS, vol. 904, pp. 23–37. Springer, Heidelberg (1995)Google Scholar
  19. 19.
    Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. In: ICML, pp. 148–156 (1996)Google Scholar
  20. 20.
    Grossman, R.L.: What is analytic infrastructure and why should you care? SIGKDD Explor. Newsl. 11(1), 5–9 (2009)CrossRefMathSciNetGoogle Scholar
  21. 21.
    Grossman, R.L., Hornick, M.F., Meyer, G.: Data mining standards initiatives. Commun. ACM 45(8), 59–61 (2002)CrossRefGoogle Scholar
  22. 22.
    Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. Journal of Machine Learning Research 3, 1157–1182 (2003)MATHCrossRefGoogle Scholar
  23. 23.
    Hah, J., Fu, Y., Wang, W., Koperski, K., Zaiane, O.: Dmql: A data mining query language for relational databases (1996)Google Scholar
  24. 24.
    Hartigan, J.A., Wong, M.A.: A K-means clustering algorithm. Applied Statistics 28, 100–108 (1979)MATHCrossRefGoogle Scholar
  25. 25.
    Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd edn. Springer Series in Statistics. Springer, Heidelberg (2009); corr. 3rd printing edition (September 2009)Google Scholar
  26. 26.
    Imieliński, T., Virmani, A.: Msql: A query language for database mining. Data Min. Knowl. Discov. 3(4), 373–408 (1999)CrossRefGoogle Scholar
  27. 27.
    Inmon, W.H.: Building the Data Warehouse, 4th edn. Wiley Publishing, Chichester (2005)Google Scholar
  28. 28.
    Elder IV, J.F., Fogelman-Soulié, F., Flach, P.A., Zaki, M.J. (eds.): Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, June 28 - July 1, 2009. ACM, New York (2009)Google Scholar
  29. 29.
    Kimball, R., Merx, R.: The Data Webhouse Toolkit. Wiley Computer Publisher, Chichester (2000)Google Scholar
  30. 30.
    Kimball, R., Ross, M.: The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling, 2nd edn. Wiley, Chichester (2002)Google Scholar
  31. 31.
    Klopotek, M.A., Wierzchon, S.T., Trojanowski, K.: Intelligent Information Processing and Web Mining: Proceedings of the International IIS: IIPWM 2006 Conference, Ustron, Poland, June 19-22, 2006. Advances in Soft Computing. Springer-Verlag New York, Inc., Secaucus (2006)MATHGoogle Scholar
  32. 32.
    Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artif. Intell. 97(1-2), 273–324 (1997)MATHCrossRefGoogle Scholar
  33. 33.
    Kohonen, T., Schroeder, M.R., Huang, T.S. (eds.): Self-Organizing Maps. Springer-Verlag New York, Inc., Secaucus (2001)MATHGoogle Scholar
  34. 34.
    Larsen, J., Hansen, L.K., Have, A.S., Christiansen, T., Kolenda, T.: Webmining: learning from the world wide web. Computational Statistics & Data Analysis 38(4), 517–532 (2002)MATHCrossRefMathSciNetGoogle Scholar
  35. 35.
    Liu, B.: Web Data Mining: Exploring Hyperlinks, Content and Usage Data, 1st edn. Springer, Heidelberg (2007)Google Scholar
  36. 36.
    Luo, P., Lin, F., Xiong, Y., Zhao, Y., Shi, Z.: Towards combining web classification and web information extraction: a case study. In: IV et al.: [28], pp. 1235–1244Google Scholar
  37. 37.
    MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Le Cam, L.M., Neyman, J. (eds.) Proc. of the fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297. University of California Press (1967)Google Scholar
  38. 38.
    Maldonado, S., Weber, R.: A wrapper method for feature selection using support vector machines. Inf. Sci. 179(13), 2208–2217 (2009)CrossRefGoogle Scholar
  39. 39.
    Markov, Z., Larose, D.T.: Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage. Wiley Interscience, Hoboken (2007)Google Scholar
  40. 40.
    Meo, R., Psaila, G., Ceri, S.: An extension to sql for mining association rules. Data Min. Knowl. Discov. 2(2), 195–224 (1998)CrossRefGoogle Scholar
  41. 41.
    Mitchell, T.M.: Machine Learning. McGraw-Hill, New York (1997)MATHGoogle Scholar
  42. 42.
    Papadimitriou, C.H., Tamaki, H., Raghavan, P., Vempala, S.: Latent semantic indexing: a probabilistic analysis. In: PODS 1998: Proceedings of the seventeenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database systems, pp. 159–168. ACM Press, New York (1998)CrossRefGoogle Scholar
  43. 43.
    Pechter, R.: What’s pmml and what’s new in pmml 4.0? SIGKDD Explor. Newsl. 11(1), 19–25 (2009)CrossRefGoogle Scholar
  44. 44.
    Rosenblatt, F.: Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Spartan Books (1962)Google Scholar
  45. 45.
    Rousseeuw, P.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20(1), 53–65 (1987)MATHCrossRefGoogle Scholar
  46. 46.
    Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)MATHCrossRefGoogle Scholar
  47. 47.
    Schapire, R.E.: The strength of weak learnability. Mach. Learn. 5(2), 197–227 (1990)Google Scholar
  48. 48.
    Schölkopf, B., Smola, A.J.: Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge (2001)Google Scholar
  49. 49.
    Sebastiani, F.: Text categorization. In: Zanasi, A. (ed.) Text Mining and its Applications to Intelligence, CRM and Knowledge Management, pp. 109–129. WIT Press, Southampton (2005)CrossRefGoogle Scholar
  50. 50.
    Terrovitis, M., Vassiliadis, P., Skiadopoulos, S., Bertino, E., Catania, B., Maddalena, A.: Modeling and language support for the management of pattern-bases. In: International Conference on Scientific and Statistical Database Management, vol. 0, p. 265 (2004)Google Scholar
  51. 51.
    Torkkola, K.: Feature extraction by non parametric mutual information maximization. Journal of Machine Learning Research 3, 1415–1438 (2003)MATHCrossRefMathSciNetGoogle Scholar
  52. 52.
    Vapnik, V., Chervonenkis, A.: On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications 16, 264–280 (1971)MATHCrossRefGoogle Scholar
  53. 53.
    Vapnik, V.N.: The Nature of Statistical Learning Theory (Information Science and Statistics). Springer, Heidelberg (1999)Google Scholar
  54. 54.
    Velasquez, J.D., Palade, V.: Adaptive Web Sites: A Knowledge Extraction from Web Data Approach. IOS Press, Amsterdam (2008)Google Scholar
  55. 55.
    Velasquez, J.D., Palade, V.: Building a knowledge base for implementing a web-based computerized recommendation system. International Journal of Artificial Intelligence Tools 16(5), 793–828 (2007)CrossRefGoogle Scholar
  56. 56.
    Velásquez, J.D., Palade, V.: A knowledge base for the maintenance of knowledge extracted from web data. Knowledge Based Systems 20(3), 238–248 (2007)CrossRefGoogle Scholar
  57. 57.
    Velasquez, J.D., Yasuda, H., Aoki, T., Weber, R.: A new similarity measure to understand visitor behavior in a web site. IEICE Transactions on Information and Systems, Special Issues in Information Processing Technology for web utilization E87-D(2), 389–396 (2004)Google Scholar
  58. 58.
    Velasquez, J.D., Rios, S.A., Bassi, A., Yasuda, H., Aoki, T.: Towards the identification of keywords in the web site text content: A methodological approach. International Journal of Web Information Systems information 1(1), 53–57 (2005)CrossRefGoogle Scholar
  59. 59.
    Wang, Y., Hodges, J., Tang, B.: Classification of web documents using a naive bayes method. In: ICTAI 2003: Proceedings of the 15th IEEE International Conference on Tools with Artificial Intelligence, p. 560. IEEE Computer Society Press, Washington (2003)CrossRefGoogle Scholar
  60. 60.
    Wen, C.W., Liu, H., Wen, W.X., Zheng, J.: A distributed hierarchical clustering system for web mining. In: Wang, X.S., Yu, G., Lu, H. (eds.) WAIM 2001. LNCS, vol. 2118, pp. 103–113. Springer, Heidelberg (2001)CrossRefGoogle Scholar
  61. 61.
    Werbos, P.J.: The roots of backpropagation: from ordered derivatives to neural networks and political forecasting. Wiley Interscience, New York (1994)Google Scholar
  62. 62.
    Wolf, L., Shashua, A.: Feature selection for unsupervised and supervised inference: The emergence of sparsity in a weight-based approach. J. Mach. Learn. Res. 6, 1855–1887 (2005)MathSciNetGoogle Scholar
  63. 63.
    Wu, J., Xiong, H., Chen, J.: Adapting the right measures for k-means clustering. In: IV et al.: [28], pp. 877–886Google Scholar
  64. 64.
    Xu, R., Wunsch, I.: Survey of clustering algorithms. IEEE Transactions on Neural Networks 16(3), 645–678 (2005)CrossRefGoogle Scholar
  65. 65.
    Yin, Z., Li, R., Mei, Q., Han, J.: Exploring social tagging graph for web object classification. In: IV et al.: [28], pp. 957–966Google Scholar
  66. 66.
    Young, T.Y.: The reliability of linear feature extractors. IEEE Transactions on Computers 20(9), 967–971 (1971)MATHCrossRefGoogle Scholar
  67. 67.
    Zeller, M., Grossman, R., Lingenfelder, C., Berthold, M.R., Marcade, E., Pechter, R., Hoskins, M., Thompson, W., Holada, R.: Open standards and cloud computing: Kdd-2009 panel report. In: KDD 2009: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 11–18. ACM Press, New York (2009)CrossRefGoogle Scholar
  68. 68.
    Zhao, Z., Liu, H.: Spectral feature selection for supervised and unsupervised learning. In: ICML 2007: Proceedings of the 24th international conference on Machine learning, pp. 1151–1157. ACM Press, New York (2007)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Víctor L. Rebolledo
    • 1
  • Gastón L’Huillier
    • 1
  • Juan D. Velásquez
    • 1
  1. 1.Department of Industrial EngineeringUniversity of ChileSantiagoChile

Personalised recommendations