Advanced Techniques in Web Data Pre-processing and Cleaning

  • Pablo E. Román
  • Robert F. Dell
  • Juan D. Velásquez

Abstract

Central to successful e-business is the construction of web sites that attract users, capture user preferences, and entice them into making a purchase. Web mining is diverse data mining applied to categorize both the content and structure of web sites with the goal of aiding e-business. Web mining requires knowledge of the web site structure (hyperlink graph), the web content (vector model) and user sessions (the sequence of pages visited by each user to a site). Much of the data for web mining can be noisy. The origin of the noise comes from many sources, for example, undocumented changes to the web site structure and content, a different understanding of the text and media semantic, and web logs without individual user identification. There may not be any record of the number of times a specific page has been visited in a session as page is stored on a proxy or web browser cache. Such noise presents a challenge for web mining. This chapter presents issues with and approaches for cleaning web data in preparation for web mining analysis.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Adar, E., Teevan, J., Dumais, S., Elsas, J.: The web changes everything: understanding the dynamics of web content. In: WSDM 2009: Proceedings of the Second ACM International Conference on Web Search and Data Mining, pp. 282–291. ACM Press, New York (2009)CrossRefGoogle Scholar
  2. 2.
    Alexander, J.: Understanding and improving navigation within electronic documents. Ph.D. thesis, University of Canterbury, Christchurch, New Zealand (2009)Google Scholar
  3. 3.
    Alexander, J., Cockburn, A.: An empirical characterisation of electronic document navigation. In: GI 2008: Proceedings of graphics interface 2008, pp. 123–130. Canadian Information Processing Society, Toronto (2008)Google Scholar
  4. 4.
    ASA, O.S.: Opera browser, http://www.opera.com
  5. 5.
    Baeza-Yates, R., Castillo, C., Efthimiadis, E.: Characterization of national web domains. ACM Transactions on Internet Technology 7(2) (2007)Google Scholar
  6. 6.
    Baeza-Yates, R., Poblete, B.: Dynamics of the chilean web structure. Comput. Netw. 50(10), 1464–1473 (2006)CrossRefGoogle Scholar
  7. 7.
    Bayir, M., Toroslu, I., Cosar, A., Fidan, G.: Smart miner: a new framework for mining large scale web usage data. In: WWW 2009: Proceedings of the 18th international conference on World wide web, pp. 161–170. ACM Press, New York (2009)CrossRefGoogle Scholar
  8. 8.
    Bhamidipati, N.L., Pal, S.K.: Stemming via distribution-based word segregation for classification and retrieval. IEEE Transactions on Systems, Man, and Cybernetics, Part B 37(2), 350–360 (2007)CrossRefGoogle Scholar
  9. 9.
    Bixby, R.E.: Solving real-world linear programs: A decade and more of progress. Operations Research 50(1), 3–15 (2002)MATHCrossRefMathSciNetGoogle Scholar
  10. 10.
    Burget, R., Rudolfova, I.: Web page element classification based on visual features. In: Asian Conference on Intelligent Information and Database Systems, vol. 0, pp. 67–72 (2009)Google Scholar
  11. 11.
    Castells, P., Fernandez, M., Vallet, D.: An adaptation of the vector-space model for ontology-based information retrieval. IEEE Trans. on Knowl. and Data Eng. 19(2), 261–272 (2007)CrossRefGoogle Scholar
  12. 12.
    Castillo, C.: Effective web crawling. Ph.D. thesis, University of Chile, Santiago, Chile (2004)Google Scholar
  13. 13.
    Catledge, L.D., Pitkow, J.E.: Characterizing browsing strategies in the world-wide web. In: Computer Networks and ISDN Systems, pp. 1065–1073 (1995)Google Scholar
  14. 14.
    Chakrabarti, D., Faloutsos, C.: Graph mining: Laws, generators, and algorithms. ACM Comput. Surv. 38(1), 2 (2006)CrossRefGoogle Scholar
  15. 15.
    Chakrabarti, D., Kumar, R., Punera, K.: Page-level template detection via isotonic smoothing. In: WWW 2007: Proceedings of the 16th international conference on World Wide Web, pp. 61–70. ACM Press, New York (2007)CrossRefGoogle Scholar
  16. 16.
    Chakrabarti, S., Dom, B.E., Kumar, S.R., Raghavan, P., Rajagopalan, S., Tomkins, A., Gibson, D., Kleinberg, J.: Mining the web’s link structure. Computer 32(8), 60–67 (1999)CrossRefGoogle Scholar
  17. 17.
    Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. In: OSDI 2006: Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation, p. 15. USENIX Association, Berkeley (2006)Google Scholar
  18. 18.
    Cho, J., Garcia-Molina, H.: The evolution of the web and implications for an incremental crawler. In: VLDB 2000: Proceedings of the 26th International Conference on Very Large Data Bases, pp. 200–209. Morgan Kaufmann Publishers Inc., San Francisco (2000)Google Scholar
  19. 19.
    Cho, J., Garcia-Molina, H.: Estimating frequency of change. ACM Trans. Internet Technol. 3(3), 256–290 (2003)CrossRefGoogle Scholar
  20. 20.
    Cooley, R., Mobasher, B., Srivastava, J.: Data preparation for mining world wide web browsing patterns. Knowledge and Information Systems 1, 5–32 (1999)Google Scholar
  21. 21.
    Corporation, M.: Mozilla firefox browser, http://www.mozilla.org
  22. 22.
    Coull, S.E., Collins, M.P., Wright, C.V., Monrose, F., Reiter, M.K.: On web browsing privacy in anonymized netflows. In: SS 2007: Proceedings of 16th USENIX Security Symposium on USENIX Security Symposium, pp. 1–14. USENIX Association, Berkeley (2007)Google Scholar
  23. 23.
    Das, R., Turkoglu, I.: Creating meaningful data from web logs for improving the impressiveness of a website by using path analysis method. Expert Syst. Appl. 36(3), 6635–6644 (2009)CrossRefGoogle Scholar
  24. 24.
    Debnath, S., Mitra, P., Pal, N., Giles, C.L.: Automatic identification of informative sections of web pages. IEEE Trans. on Knowl. and Data Eng. 17(9), 1233–1246 (2005)CrossRefGoogle Scholar
  25. 25.
    Dell, R.F., Román, P.E., Velásquez, J.D.: Web user session reconstruction using integer programming. In: Procs. of The 2008 IEEE/WIC/ACM International Conference on Web Intelligence, Sydney, Australia, pp. 385–388 (2008)Google Scholar
  26. 26.
    Dell, R.F., Román, P.E., Velásquez, J.D.: User session reconstruction with back button browsing. In: Velásquez, J.D., Ríos, S.A., Howlett, R.J., Jain, L.C. (eds.) Knowledge-Based and Intelligent Information and Engineering Systems. LNCS, vol. 5711, pp. 326–332. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  27. 27.
    Dell, R.F., Román, P.E., Velásquez, J.D.: Optimization models for construction of web user sessions. Working Paper (2010)Google Scholar
  28. 28.
    Demartini, G., Firan, C.S., Iofciu, T., Nejdl, W.: Semantically enhanced entity ranking. In: Bailey, J., Maier, D., Schewe, K.-D., Thalheim, B., Wang, X.S. (eds.) WISE 2008. LNCS, vol. 5175, pp. 176–188. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  29. 29.
    Demir, G.N., Goksedef, M., Etaner-Uyar, A.S.: Effects of session representation models on the performance of web recommender systems. In: ICDEW 2007: Proceedings of the 2007 IEEE 23rd International Conference on Data Engineering Workshop, pp. 931–936. IEEE Computer Society Press, Washington (2007)CrossRefGoogle Scholar
  30. 30.
    Desikan, P., Srivastava, J.: Mining temporally evolving graphs. In: Mobasher, B., Nasraoui, O., Liu, B., Masand, B. (eds.) WebKDD 2004. LNCS (LNAI), vol. 3932, pp. 1–17. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  31. 31.
    Dill, S., Kumar, R., Mccurley, K., Rajagopalan, S., Sivakumar, D., Tomkins, A.: Self-similarity in the web. ACM Trans. Internet Technol. 2(3), 205–223 (2002)CrossRefGoogle Scholar
  32. 32.
    Dujovne, L.E., Velásquez, J.D.: Design and implementation of a methodology for identifying website keyobjects. In: Velásquez, J.D., Ríos, S.A., Howlett, R.J., Jain, L.C. (eds.) Knowledge-Based and Intelligent Information and Engineering Systems. LNCS, vol. 5711, pp. 301–308. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  33. 33.
    Eguchi, S., Copas, J.: Interpreting kullback-leibler divergence with the neyman-pearson lemma. J. Multivar. Anal. 97(9), 2034–2040 (2006)MATHCrossRefMathSciNetGoogle Scholar
  34. 34.
    Fetterly, D., Manasse, M., Najork, M., Wiener, J.: A large-scale study of the evolution of web pages. In: WWW 2003: Proceedings of the 12th international conference on World Wide Web, pp. 669–678. ACM Press, New York (2003)Google Scholar
  35. 35.
    Gaugaz, J., Zakrzewski, J., Demartini, G., Nejdl, W.: How to trace and revise identities. In: Aroyo, L., Traverso, P., Ciravegna, F., Cimiano, P., Heath, T., Hyvönen, E., Mizoguchi, R., Oren, E., Sabou, M., Simperl, E. (eds.) ESWC 2009. LNCS, vol. 5554, pp. 414–428. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  36. 36.
    Ghani, R., Jones, R., Mladenic, D.: Mining the web to create minority language corpora. In: CIKM 2001: Proceedings of the tenth international conference on Information and knowledge management, pp. 279–286. ACM Press, New York (2001)CrossRefGoogle Scholar
  37. 37.
    Görnitz, N., Kloft, M., Brefeld, U.: Active and semi-supervised data domain description. In: ECML PKDD 2009: Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 407–422. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  38. 38.
    Granka, L., Feusner, M., Lorigo, L.: Eye monitoring in online search. In: Hammoud, R., Ohno, T. (eds.) Passive Eye Monitoring, Signals and Communication Technology, Part VI, pp. 347–372. Springer, Heidelberg (2008)Google Scholar
  39. 39.
    Gündüz, C., Özsu, M.T.: A web page prediction model based on click-stream tree representation of user behavior. In: KDD 2003: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 535–540. ACM Press, New York (2003)CrossRefGoogle Scholar
  40. 40.
    Hand, D.: Statistics and data mining: intersecting disciplines. SIGKDD Explor. Newsl. 1(1), 16–19 (1999)CrossRefGoogle Scholar
  41. 41.
    Hensman, S.: Construction of conceptual graph representation of texts. In: HLT-NAACL 2004: Proceedings of the Student Research Workshop at HLT-NAACL 2004, vol. XX, pp. 49–54. Association for Computational Linguistics, Morristown (2004)CrossRefGoogle Scholar
  42. 42.
    Huberman, B., Pirolli, P., Pitkow, J., Lukose, R.M.: Strong regularities in world wide web surfing. Science 280(5360), 95–97 (1998)CrossRefGoogle Scholar
  43. 43.
    Huberman, B., Wu, F.: The economics of attention: maximizing user value in information-rich environments. In: ADKDD 2007: Proceedings of the 1st international workshop on Data mining and audience intelligence for advertising, pp. 16–20. ACM Press, New York (2007)CrossRefGoogle Scholar
  44. 44.
    Iachello, G., Hong, J.: End-user privacy in human-computer interaction. Found. Trends Hum.-Comput. Interact. 1(1), 1–137 (2007)MATHCrossRefGoogle Scholar
  45. 45.
    Ipeirotis, P., Gravano, L.: When one sample is not enough: improving text database selection using shrinkage. In: SIGMOD 2004: Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pp. 767–778. ACM Press, New York (2004)CrossRefGoogle Scholar
  46. 46.
    Janzen, S., Maass, W.: Ontology-based natural language processing for in-store shopping situations. In: ICSC 2009: Proceedings of the 2009 IEEE International Conference on Semantic Computing, pp. 361–366. IEEE Computer Society, Washington (2009)CrossRefGoogle Scholar
  47. 47.
    Jatowt, A., Ishizuka, M.: Temporal multi-page summarization. Web Intelli. and Agent Sys. 4(2), 163–180 (2006)Google Scholar
  48. 48.
    Velásquez, J.D., Palade, V.: Adaptive web sites: A knowledge extraction from web data approach. IOS Press, Amsterdam (2008)Google Scholar
  49. 49.
    Jin, W., Srihari, R.K.: Graph-based text representation and knowledge discovery. In: SAC 2007: Proceedings of the 2007 ACM symposium on Applied computing, pp. 807–811. ACM, New York (2007)Google Scholar
  50. 50.
    Joachims, T., Granka, L., Pan, B., Hembrooke, H., Radlinski, F., Gay, G.: Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search. ACM Trans. Inf. Syst. 25(2), 7 (2007)CrossRefGoogle Scholar
  51. 51.
    Jung, J.J.: Ontology-based partitioning of data steam for web mining: A case study of web logs. In: ICCS 2004, 4th International Conference, Proceedings, Part I, June 6-9, 2004, Kraków, Poland, pp. 247–254 (2004)Google Scholar
  52. 52.
    Jung, J.J., Jo, G.S.: Semantic outlier analysis for sessionizing web logs. In: ECML/PKDD Conference, pp. 13–25 (2004)Google Scholar
  53. 53.
    Ke, Y., Deng, L., Ng, W., Lee, D.: Web dynamics and their ramifications for the development of web search engines. Comput. Netw. 50(10), 1430–1447 (2006)MATHCrossRefGoogle Scholar
  54. 54.
    Khan, J.I., Tao, Q.: Exploiting webspace organization for accelerating web prefetching. Web Intelli. and Agent Sys. 3(2), 117–129 (2005)Google Scholar
  55. 55.
    Khasawneh, N., Chan, C.: Active user-based and ontology-based web log data preprocessing for web usage mining. In: 2006 IEEE / WIC / ACM International Conference on Web Intelligence (WI 2006), Hong Kong, China, pp. 325–328. IEEE Computer Society, Los Alamitos (2006)Google Scholar
  56. 56.
    Kim, Y., Kim, J.: Web prefetching using display-based prediction. In: WI 2003: Proceedings of the 2003 IEEE/WIC International Conference on Web Intelligence, p. 486. IEEE Computer Society, Washington (2003)Google Scholar
  57. 57.
    Kohonen, T.: Self-organized formation of topologically correct feature maps, pp. 509–521 (1988)Google Scholar
  58. 58.
    Kryssanov, V., Kakusho, K., Kuleshov, E., Minoh, M.: Modeling hypermedia-based communication. Information Sciences 174(1-2), 37–53 (2005)CrossRefGoogle Scholar
  59. 59.
    Lan, M., Tan, C.L., Low, H.B., Sung, S.Y.: A comprehensive comparative study on term weighting schemes for text categorization with support vector machines. In: WWW 2005: Special interest tracks and posters of the 14th international conference on World Wide Web, pp. 1032–1033. ACM Press, New York (2005), http://doi.acm.org/10.1145/1062745.1062854 CrossRefGoogle Scholar
  60. 60.
    Lan, M., Tan, C.L., Su, J., Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 31(4), 721–735 (2009)CrossRefGoogle Scholar
  61. 61.
    Langford, D.: Internet ethics. MacMillan Press Ltd., Basingstoke (2000)Google Scholar
  62. 62.
    Lansey, J.C., Bukiet, B.: Internet search result probabilities, heaps’ law and word associativity. Journal of Quantitative Linguistics 16(1), 40–66 (2005)CrossRefGoogle Scholar
  63. 63.
    Leijenhorst, D.V., der Weide, T.V.: A formal derivation of heaps’ law. Inf. Sci. Inf. Comput. Sci. 170(2-4), 263–272 (2005)MATHGoogle Scholar
  64. 64.
    Levene, M., Borges, J., Loizou, G.: Zipf’s law for web surfers. Knowl. Inf. Syst. 3(1), 120–129 (2001)MATHCrossRefGoogle Scholar
  65. 65.
    Li, Y., Feng, B., Mao, Q.: Research on path completion technique in web usage mining. In: International Symposium on Computer Science and Computational Technology, vol. 1, pp. 554–559 (2008)Google Scholar
  66. 66.
    Linn, J.: Technology and web user data privacy: A survey of risks and countermeasures. IEEE Security and Privacy 3(1), 52–58 (2005)CrossRefGoogle Scholar
  67. 67.
    Liu, B.: Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data (Data-Centric Systems and Applications), 1st edn. (2007); corr. 2nd printing edn. Springer, Heidelberg (2009)Google Scholar
  68. 68.
    Manning, C.D., Schutze, H.: Fundation of Statistical Natural Language Processing. MIT Press, Cambridge (1999)Google Scholar
  69. 69.
    Maynor, D.: Metasploit Toolkit for Penetration Testing, Exploit Development, and Vulnerability Research, 1st edn. Syngress (2007)Google Scholar
  70. 70.
    Mobasher, B.: Web usage mining. In: Liu, B. (ed.) Web Data Mining: Exploring Hyperlinks, Contents and Usage Data, ch. 12. Springer, Heidelberg (2006)Google Scholar
  71. 71.
    Mobasher, B., Dai, H., Luo, T., Nakagawa, M.: Effective personalization based on association rule discovery from web usage data. In: WIDM 2001: Proceedings of the 3rd international workshop on Web information and data management, pp. 9–15. ACM Press, New York (2001)CrossRefGoogle Scholar
  72. 72.
    Mobasher, B., Dai, H., Luo, T., Nakagawa, M.: Discovery and evaluation of aggregate usage profiles for web personalization. Data Min. Knowl. Discov. 6(1), 61–82 (2002)CrossRefMathSciNetGoogle Scholar
  73. 73.
    Moloney, M., Bannister, F.: A privacy control theory for online environments. In: HICSS 2009: Proceedings of the 42nd Hawaii International Conference on System Sciences, pp. 1–10. IEEE Computer Society, Washington (2009)Google Scholar
  74. 74.
    Mori, T.: Information gain ratio as term weight: the case of summarization of ir results. In: Proceedings of the 19th international conference on Computational linguistics, pp. 1–7. Association for Computational Linguistics, Morristown, NJ, USA (2002)Google Scholar
  75. 75.
    Nadeax, D.: Semi-supervised named entity recognition: Learning to recognize 100 entity types with little supervision. Ph.D. thesis, University of Ottawa, Ottawa, Canada (2007)Google Scholar
  76. 76.
    Nasraoui, O., Soliman, M., Saka, E., Badia, A., Germain, R.: A web usage mining framework for mining evolving user profiles in dynamic web sites. IEEE Trans. on Knowl. and Data Eng. 20(2), 202–215 (2008)CrossRefGoogle Scholar
  77. 77.
    Obendorf, H., Weinreich, H., Herder, E., Mayer, M.: Web page revisitation revisited: implications of a long-term click-stream study of browser usage. In: CHI 2007: Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 597–606 (2007)Google Scholar
  78. 78.
    Olston, C., Pandey, S.: Recrawl scheduling based on information longevity. In: WWW 2008: Proceeding of the 17th international conference on World Wide Web, pp. 437–446. ACM Press, New York (2008)CrossRefGoogle Scholar
  79. 79.
    Pal, S.K., Talwar, V., Mitra, P.: Web mining in soft computing framework: Relevance, state of the art and future directions. IEEE Transactions on Neural Networks 13, 1163–1177 (2002)CrossRefGoogle Scholar
  80. 80.
    Peña-Ortiz, R., Sahuquillo, J., Pont, A., Gil, J.: Dweb model: Representing web 2.0 dynamism. Comput. Commun. 32(6), 1118–1128 (2009)CrossRefGoogle Scholar
  81. 81.
    Popov, B., Kiryakov, A., Ognyanoff, D., Manov, D., Kirilov, A.: Kim – a semantic platform for information extraction and retrieval. Nat. Lang. Eng. 10(3-4), 375–392 (2004)CrossRefGoogle Scholar
  82. 82.
    Porter, M.F.: An algorithm for suffix stripping. Electronic Library and Electronic Systems 40, 211–218 (2006)Google Scholar
  83. 83.
    Qi, X., Davison, B.: Web page classification: Features and algorithms. ACM Comput. Surv. 41(2), 1–31 (2009)CrossRefGoogle Scholar
  84. 84.
    Radlinski, F., Kurup, M., Joachims, T.: How does clickthrough data reflect retrieval quality? In: CIKM 2008: Proceeding of the 17th ACM conference on Information and knowledge management, pp. 43–52. ACM Press, New York (2008)CrossRefGoogle Scholar
  85. 85.
    Reay, I.K., Beatty, P., Dick, S., Miller, J.: A survey and analysis of the p3p protocol’s agents, adoption, maintenance, and future. IEEE Transactions on Dependable and Secure Computing 4, 151–164 (2007)CrossRefGoogle Scholar
  86. 86.
    Román, P.E., Velásquez, J.D.: Dynamic stochastic model applied to the analysis of the web user behavior. In: 6th Atlantic Web Intelligence Conference, AWIC 2009, Prague, CZECH Republic, pp. 31–40 (2009)Google Scholar
  87. 87.
    Rugaber, S., Harel, N., Govindharaj, S., Jerding, D.: Problems modeling web sites and user behavior. In: WSE 2006: Proceedings of the Eighth IEEE International Symposium on Web Site Evolution, pp. 83–94. IEEE Computer Society Press, Washington (2006)CrossRefGoogle Scholar
  88. 88.
    Sadagopan, N., Li, J.: Characterizing typical and atypical user sessions in clickstreams. In: WWW 2008: Proceeding of the 17th international conference on World Wide Web, pp. 885–894. ACM Press, New York (2008)CrossRefGoogle Scholar
  89. 89.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)CrossRefGoogle Scholar
  90. 90.
    Shehata, S.: A wordnet-based semantic model for enhancing text clustering. In: ICDMW 2009: Proceedings of the 2009 IEEE International Conference on Data Mining Workshops, pp. 477–482. IEEE Computer Society, Washington (2009)CrossRefGoogle Scholar
  91. 91.
    Snásel, V., Kudelka, M.: Web content mining focused on named objects. In (IHCI) First International Conference on Intelligent Human Computer Interaction, pp. 37–58. Springer, India (2009)CrossRefGoogle Scholar
  92. 92.
    Soares, M.V.B., Prati, R.C., Monard, M.C.: Improvement on the porter’s stemming algorithm for portuguese. IEEE Latin America Transaction 7(4), 472–477 (2009)CrossRefGoogle Scholar
  93. 93.
    Spaniol, M., Denev, D., Mazeika, A., Weikum, G., Senellart, P.: Data quality in web archiving. In: WICOW 2009: Proceedings of the 3rd workshop on Information credibility on the web, pp. 19–26. ACM Press, New York (2009)CrossRefGoogle Scholar
  94. 94.
    Spiliopoulou, M., Mobasher, B., Berendt, B., Nakagawa, M.: A framework for the evaluation of session reconstruction heuristics in web-usage analysis. Informs Journal on Computing 15(2), 171–190 (2003)CrossRefGoogle Scholar
  95. 95.
    Tauscher, L., Greenberg, S.: Revisitation patterns in world wide web navigation. In: Procs. of the Conference on Human Factors in Computing Systems, Atlanta, USA, pp. 22–27 (1997)Google Scholar
  96. 96.
    Tsatsaronis, G., Varlamis, I., Nørvåg, K.: An experimental study on unsupervised graph-based word sense disambiguation. In: Gelbukh, A. (ed.) Computational Linguistics and Intelligent Text Processing. LNCS, vol. 6008, pp. 184–198. Springer, Heidelberg (2010)CrossRefGoogle Scholar
  97. 97.
    Ullrich, C., Borau, K., Luo, H., Tan, X., Shen, L., Shen, R.: Why web 2.0 is good for learning and for research: principles and prototypes. In: WWW 2008: Proceeding of the 17th international conference on World Wide Web, pp. 705–714. ACM Press, New York (2008)CrossRefGoogle Scholar
  98. 98.
    Urbansky, D., Feldmann, M., Thom, J.A., Schill, A.: Entity extraction from the web with webknox. In: 6th Atlantic Web Intelligence Conference (AWIC), Prague, Czech Republic (2009)Google Scholar
  99. 99.
    Velásquez, J.D., Yasuda, H., Aoki, T., Weber, R., Vera, E.: Using self organizing feature maps to acquire knowledge about visitor behavior in a web site. In: Palade, V., Howlett, R.J., Jain, L. (eds.) KES 2003. LNCS, vol. 2773, pp. 951–958. Springer, Heidelberg (2003)Google Scholar
  100. 100.
    Wang, J., Wu, X., Zhang, C.: Support vector machines based on kmeans clustering for real time business intelligence systems. Int. J. Bus. Intell. Data Min. 1(1), 54–64 (2005)CrossRefMathSciNetGoogle Scholar
  101. 101.
    Wang, Y., Hodges, J.: Document clustering with semantic analysis. In: HICSS 2006: Proceedings of the 39th Annual Hawaii International Conference on System Sciences, p. 54.3. IEEE Computer Society, Washington (2006)Google Scholar
  102. 102.
    Weinreich, H., Obendorf, H., Herder, E., Mayer, M.: Off the beaten tracks: exploring three aspects of web navigation. In: WWW 2006: Proceedings of the 15th international conference on World Wide Web, pp. 133–142. ACM Press, New York (2006)CrossRefGoogle Scholar
  103. 103.
    Weinreich, H., Obendorf, H., Herder, E., Mayer, M.: Not quite the average: An empirical study of web use. ACM Trans. Web 2(1), 1–31 (2008)CrossRefGoogle Scholar
  104. 104.
    White, R.W.: Investigating behavioral variability in web search. In. Proc. WWW, pp. 21–30 (2007)Google Scholar
  105. 105.
    Wittek, P., Darányi, S., Tan, C.L.: Improving text classification by a sense spectrum approach to term expansion. In: CoNLL 2009: Proceedings of the Thirteenth Conference on Computational Natural Language Learning, pp. 183–191. Association for Computational Linguistics, Morristown (2009)CrossRefGoogle Scholar
  106. 106.
    Won, S., Jin, J., Hong, J.: Contextual web history: using visual and contextual cues to improve web browser history. In: CHI 2009: Proceedings of the 27th international conference on Human factors in computing systems, pp. 1457–1466. ACM Press, New York (2009)CrossRefGoogle Scholar
  107. 107.
    Yan, X., Zhang, C., Zhang, S.: Toward databases mining: Pre-processing collected data. Applied Artificial Intelligence 17(5-6), 545–561 (2003)CrossRefGoogle Scholar
  108. 108.
    Yu, L., Wang, S., Lai, K.: An integrated data preparation scheme for neural network data analysis. IEEE Transactions on Knowledge and Data Engineering 18, 217–230 (2006)CrossRefGoogle Scholar
  109. 109.
    Yue, C., Xie, M., Wang, H.: Automatic cookie usage setting with cookiepicker. In: DSN 2007: Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 460–470. IEEE Computer Society Press, Washington (2007)CrossRefGoogle Scholar
  110. 110.
    Zawodny, J.D.: Linux apache web server administration. Sybex, 2 edn. (2002)Google Scholar
  111. 111.
    Zhang, Z., Chen, J., Li, X.: A preprocessing framework and approach for web applications. J. Web Eng. 2(3), 176–192 (2004)MathSciNetGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Pablo E. Román
    • 2
  • Robert F. Dell
    • 1
  • Juan D. Velásquez
    • 2
  1. 1.Operations Research DepartmentNaval Postgraduate SchoolMontereyUSA
  2. 2.Department of Industrial EngineeringUniversity of ChileSantiagoChile

Personalised recommendations