Skip to main content

Advanced Techniques in Web Data Pre-processing and Cleaning

  • Chapter
Book cover Advanced Techniques in Web Intelligence - I

Part of the book series: Studies in Computational Intelligence ((SCI,volume 311))

Abstract

Central to successful e-business is the construction of web sites that attract users, capture user preferences, and entice them into making a purchase. Web mining is diverse data mining applied to categorize both the content and structure of web sites with the goal of aiding e-business. Web mining requires knowledge of the web site structure (hyperlink graph), the web content (vector model) and user sessions (the sequence of pages visited by each user to a site). Much of the data for web mining can be noisy. The origin of the noise comes from many sources, for example, undocumented changes to the web site structure and content, a different understanding of the text and media semantic, and web logs without individual user identification. There may not be any record of the number of times a specific page has been visited in a session as page is stored on a proxy or web browser cache. Such noise presents a challenge for web mining. This chapter presents issues with and approaches for cleaning web data in preparation for web mining analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Adar, E., Teevan, J., Dumais, S., Elsas, J.: The web changes everything: understanding the dynamics of web content. In: WSDM 2009: Proceedings of the Second ACM International Conference on Web Search and Data Mining, pp. 282–291. ACM Press, New York (2009)

    Chapter  Google Scholar 

  2. Alexander, J.: Understanding and improving navigation within electronic documents. Ph.D. thesis, University of Canterbury, Christchurch, New Zealand (2009)

    Google Scholar 

  3. Alexander, J., Cockburn, A.: An empirical characterisation of electronic document navigation. In: GI 2008: Proceedings of graphics interface 2008, pp. 123–130. Canadian Information Processing Society, Toronto (2008)

    Google Scholar 

  4. ASA, O.S.: Opera browser, http://www.opera.com

  5. Baeza-Yates, R., Castillo, C., Efthimiadis, E.: Characterization of national web domains. ACM Transactions on Internet Technology 7(2) (2007)

    Google Scholar 

  6. Baeza-Yates, R., Poblete, B.: Dynamics of the chilean web structure. Comput. Netw. 50(10), 1464–1473 (2006)

    Article  Google Scholar 

  7. Bayir, M., Toroslu, I., Cosar, A., Fidan, G.: Smart miner: a new framework for mining large scale web usage data. In: WWW 2009: Proceedings of the 18th international conference on World wide web, pp. 161–170. ACM Press, New York (2009)

    Chapter  Google Scholar 

  8. Bhamidipati, N.L., Pal, S.K.: Stemming via distribution-based word segregation for classification and retrieval. IEEE Transactions on Systems, Man, and Cybernetics, Part B 37(2), 350–360 (2007)

    Article  Google Scholar 

  9. Bixby, R.E.: Solving real-world linear programs: A decade and more of progress. Operations Research 50(1), 3–15 (2002)

    Article  MATH  MathSciNet  Google Scholar 

  10. Burget, R., Rudolfova, I.: Web page element classification based on visual features. In: Asian Conference on Intelligent Information and Database Systems, vol. 0, pp. 67–72 (2009)

    Google Scholar 

  11. Castells, P., Fernandez, M., Vallet, D.: An adaptation of the vector-space model for ontology-based information retrieval. IEEE Trans. on Knowl. and Data Eng. 19(2), 261–272 (2007)

    Article  Google Scholar 

  12. Castillo, C.: Effective web crawling. Ph.D. thesis, University of Chile, Santiago, Chile (2004)

    Google Scholar 

  13. Catledge, L.D., Pitkow, J.E.: Characterizing browsing strategies in the world-wide web. In: Computer Networks and ISDN Systems, pp. 1065–1073 (1995)

    Google Scholar 

  14. Chakrabarti, D., Faloutsos, C.: Graph mining: Laws, generators, and algorithms. ACM Comput. Surv. 38(1), 2 (2006)

    Article  Google Scholar 

  15. Chakrabarti, D., Kumar, R., Punera, K.: Page-level template detection via isotonic smoothing. In: WWW 2007: Proceedings of the 16th international conference on World Wide Web, pp. 61–70. ACM Press, New York (2007)

    Chapter  Google Scholar 

  16. Chakrabarti, S., Dom, B.E., Kumar, S.R., Raghavan, P., Rajagopalan, S., Tomkins, A., Gibson, D., Kleinberg, J.: Mining the web’s link structure. Computer 32(8), 60–67 (1999)

    Article  Google Scholar 

  17. Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. In: OSDI 2006: Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation, p. 15. USENIX Association, Berkeley (2006)

    Google Scholar 

  18. Cho, J., Garcia-Molina, H.: The evolution of the web and implications for an incremental crawler. In: VLDB 2000: Proceedings of the 26th International Conference on Very Large Data Bases, pp. 200–209. Morgan Kaufmann Publishers Inc., San Francisco (2000)

    Google Scholar 

  19. Cho, J., Garcia-Molina, H.: Estimating frequency of change. ACM Trans. Internet Technol. 3(3), 256–290 (2003)

    Article  Google Scholar 

  20. Cooley, R., Mobasher, B., Srivastava, J.: Data preparation for mining world wide web browsing patterns. Knowledge and Information Systems 1, 5–32 (1999)

    Google Scholar 

  21. Corporation, M.: Mozilla firefox browser, http://www.mozilla.org

  22. Coull, S.E., Collins, M.P., Wright, C.V., Monrose, F., Reiter, M.K.: On web browsing privacy in anonymized netflows. In: SS 2007: Proceedings of 16th USENIX Security Symposium on USENIX Security Symposium, pp. 1–14. USENIX Association, Berkeley (2007)

    Google Scholar 

  23. Das, R., Turkoglu, I.: Creating meaningful data from web logs for improving the impressiveness of a website by using path analysis method. Expert Syst. Appl. 36(3), 6635–6644 (2009)

    Article  Google Scholar 

  24. Debnath, S., Mitra, P., Pal, N., Giles, C.L.: Automatic identification of informative sections of web pages. IEEE Trans. on Knowl. and Data Eng. 17(9), 1233–1246 (2005)

    Article  Google Scholar 

  25. Dell, R.F., Román, P.E., Velásquez, J.D.: Web user session reconstruction using integer programming. In: Procs. of The 2008 IEEE/WIC/ACM International Conference on Web Intelligence, Sydney, Australia, pp. 385–388 (2008)

    Google Scholar 

  26. Dell, R.F., Román, P.E., Velásquez, J.D.: User session reconstruction with back button browsing. In: Velásquez, J.D., Ríos, S.A., Howlett, R.J., Jain, L.C. (eds.) Knowledge-Based and Intelligent Information and Engineering Systems. LNCS, vol. 5711, pp. 326–332. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  27. Dell, R.F., Román, P.E., Velásquez, J.D.: Optimization models for construction of web user sessions. Working Paper (2010)

    Google Scholar 

  28. Demartini, G., Firan, C.S., Iofciu, T., Nejdl, W.: Semantically enhanced entity ranking. In: Bailey, J., Maier, D., Schewe, K.-D., Thalheim, B., Wang, X.S. (eds.) WISE 2008. LNCS, vol. 5175, pp. 176–188. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  29. Demir, G.N., Goksedef, M., Etaner-Uyar, A.S.: Effects of session representation models on the performance of web recommender systems. In: ICDEW 2007: Proceedings of the 2007 IEEE 23rd International Conference on Data Engineering Workshop, pp. 931–936. IEEE Computer Society Press, Washington (2007)

    Chapter  Google Scholar 

  30. Desikan, P., Srivastava, J.: Mining temporally evolving graphs. In: Mobasher, B., Nasraoui, O., Liu, B., Masand, B. (eds.) WebKDD 2004. LNCS (LNAI), vol. 3932, pp. 1–17. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  31. Dill, S., Kumar, R., Mccurley, K., Rajagopalan, S., Sivakumar, D., Tomkins, A.: Self-similarity in the web. ACM Trans. Internet Technol. 2(3), 205–223 (2002)

    Article  Google Scholar 

  32. Dujovne, L.E., Velásquez, J.D.: Design and implementation of a methodology for identifying website keyobjects. In: Velásquez, J.D., Ríos, S.A., Howlett, R.J., Jain, L.C. (eds.) Knowledge-Based and Intelligent Information and Engineering Systems. LNCS, vol. 5711, pp. 301–308. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  33. Eguchi, S., Copas, J.: Interpreting kullback-leibler divergence with the neyman-pearson lemma. J. Multivar. Anal. 97(9), 2034–2040 (2006)

    Article  MATH  MathSciNet  Google Scholar 

  34. Fetterly, D., Manasse, M., Najork, M., Wiener, J.: A large-scale study of the evolution of web pages. In: WWW 2003: Proceedings of the 12th international conference on World Wide Web, pp. 669–678. ACM Press, New York (2003)

    Google Scholar 

  35. Gaugaz, J., Zakrzewski, J., Demartini, G., Nejdl, W.: How to trace and revise identities. In: Aroyo, L., Traverso, P., Ciravegna, F., Cimiano, P., Heath, T., Hyvönen, E., Mizoguchi, R., Oren, E., Sabou, M., Simperl, E. (eds.) ESWC 2009. LNCS, vol. 5554, pp. 414–428. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  36. Ghani, R., Jones, R., Mladenic, D.: Mining the web to create minority language corpora. In: CIKM 2001: Proceedings of the tenth international conference on Information and knowledge management, pp. 279–286. ACM Press, New York (2001)

    Chapter  Google Scholar 

  37. Görnitz, N., Kloft, M., Brefeld, U.: Active and semi-supervised data domain description. In: ECML PKDD 2009: Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 407–422. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  38. Granka, L., Feusner, M., Lorigo, L.: Eye monitoring in online search. In: Hammoud, R., Ohno, T. (eds.) Passive Eye Monitoring, Signals and Communication Technology, Part VI, pp. 347–372. Springer, Heidelberg (2008)

    Google Scholar 

  39. Gündüz, C., Özsu, M.T.: A web page prediction model based on click-stream tree representation of user behavior. In: KDD 2003: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 535–540. ACM Press, New York (2003)

    Chapter  Google Scholar 

  40. Hand, D.: Statistics and data mining: intersecting disciplines. SIGKDD Explor. Newsl. 1(1), 16–19 (1999)

    Article  Google Scholar 

  41. Hensman, S.: Construction of conceptual graph representation of texts. In: HLT-NAACL 2004: Proceedings of the Student Research Workshop at HLT-NAACL 2004, vol. XX, pp. 49–54. Association for Computational Linguistics, Morristown (2004)

    Chapter  Google Scholar 

  42. Huberman, B., Pirolli, P., Pitkow, J., Lukose, R.M.: Strong regularities in world wide web surfing. Science 280(5360), 95–97 (1998)

    Article  Google Scholar 

  43. Huberman, B., Wu, F.: The economics of attention: maximizing user value in information-rich environments. In: ADKDD 2007: Proceedings of the 1st international workshop on Data mining and audience intelligence for advertising, pp. 16–20. ACM Press, New York (2007)

    Chapter  Google Scholar 

  44. Iachello, G., Hong, J.: End-user privacy in human-computer interaction. Found. Trends Hum.-Comput. Interact. 1(1), 1–137 (2007)

    Article  MATH  Google Scholar 

  45. Ipeirotis, P., Gravano, L.: When one sample is not enough: improving text database selection using shrinkage. In: SIGMOD 2004: Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pp. 767–778. ACM Press, New York (2004)

    Chapter  Google Scholar 

  46. Janzen, S., Maass, W.: Ontology-based natural language processing for in-store shopping situations. In: ICSC 2009: Proceedings of the 2009 IEEE International Conference on Semantic Computing, pp. 361–366. IEEE Computer Society, Washington (2009)

    Chapter  Google Scholar 

  47. Jatowt, A., Ishizuka, M.: Temporal multi-page summarization. Web Intelli. and Agent Sys. 4(2), 163–180 (2006)

    Google Scholar 

  48. Velásquez, J.D., Palade, V.: Adaptive web sites: A knowledge extraction from web data approach. IOS Press, Amsterdam (2008)

    Google Scholar 

  49. Jin, W., Srihari, R.K.: Graph-based text representation and knowledge discovery. In: SAC 2007: Proceedings of the 2007 ACM symposium on Applied computing, pp. 807–811. ACM, New York (2007)

    Google Scholar 

  50. Joachims, T., Granka, L., Pan, B., Hembrooke, H., Radlinski, F., Gay, G.: Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search. ACM Trans. Inf. Syst. 25(2), 7 (2007)

    Article  Google Scholar 

  51. Jung, J.J.: Ontology-based partitioning of data steam for web mining: A case study of web logs. In: ICCS 2004, 4th International Conference, Proceedings, Part I, June 6-9, 2004, Kraków, Poland, pp. 247–254 (2004)

    Google Scholar 

  52. Jung, J.J., Jo, G.S.: Semantic outlier analysis for sessionizing web logs. In: ECML/PKDD Conference, pp. 13–25 (2004)

    Google Scholar 

  53. Ke, Y., Deng, L., Ng, W., Lee, D.: Web dynamics and their ramifications for the development of web search engines. Comput. Netw. 50(10), 1430–1447 (2006)

    Article  MATH  Google Scholar 

  54. Khan, J.I., Tao, Q.: Exploiting webspace organization for accelerating web prefetching. Web Intelli. and Agent Sys. 3(2), 117–129 (2005)

    Google Scholar 

  55. Khasawneh, N., Chan, C.: Active user-based and ontology-based web log data preprocessing for web usage mining. In: 2006 IEEE / WIC / ACM International Conference on Web Intelligence (WI 2006), Hong Kong, China, pp. 325–328. IEEE Computer Society, Los Alamitos (2006)

    Google Scholar 

  56. Kim, Y., Kim, J.: Web prefetching using display-based prediction. In: WI 2003: Proceedings of the 2003 IEEE/WIC International Conference on Web Intelligence, p. 486. IEEE Computer Society, Washington (2003)

    Google Scholar 

  57. Kohonen, T.: Self-organized formation of topologically correct feature maps, pp. 509–521 (1988)

    Google Scholar 

  58. Kryssanov, V., Kakusho, K., Kuleshov, E., Minoh, M.: Modeling hypermedia-based communication. Information Sciences 174(1-2), 37–53 (2005)

    Article  Google Scholar 

  59. Lan, M., Tan, C.L., Low, H.B., Sung, S.Y.: A comprehensive comparative study on term weighting schemes for text categorization with support vector machines. In: WWW 2005: Special interest tracks and posters of the 14th international conference on World Wide Web, pp. 1032–1033. ACM Press, New York (2005), http://doi.acm.org/10.1145/1062745.1062854

    Chapter  Google Scholar 

  60. Lan, M., Tan, C.L., Su, J., Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 31(4), 721–735 (2009)

    Article  Google Scholar 

  61. Langford, D.: Internet ethics. MacMillan Press Ltd., Basingstoke (2000)

    Google Scholar 

  62. Lansey, J.C., Bukiet, B.: Internet search result probabilities, heaps’ law and word associativity. Journal of Quantitative Linguistics 16(1), 40–66 (2005)

    Article  Google Scholar 

  63. Leijenhorst, D.V., der Weide, T.V.: A formal derivation of heaps’ law. Inf. Sci. Inf. Comput. Sci. 170(2-4), 263–272 (2005)

    MATH  Google Scholar 

  64. Levene, M., Borges, J., Loizou, G.: Zipf’s law for web surfers. Knowl. Inf. Syst. 3(1), 120–129 (2001)

    Article  MATH  Google Scholar 

  65. Li, Y., Feng, B., Mao, Q.: Research on path completion technique in web usage mining. In: International Symposium on Computer Science and Computational Technology, vol. 1, pp. 554–559 (2008)

    Google Scholar 

  66. Linn, J.: Technology and web user data privacy: A survey of risks and countermeasures. IEEE Security and Privacy 3(1), 52–58 (2005)

    Article  Google Scholar 

  67. Liu, B.: Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data (Data-Centric Systems and Applications), 1st edn. (2007); corr. 2nd printing edn. Springer, Heidelberg (2009)

    Google Scholar 

  68. Manning, C.D., Schutze, H.: Fundation of Statistical Natural Language Processing. MIT Press, Cambridge (1999)

    Google Scholar 

  69. Maynor, D.: Metasploit Toolkit for Penetration Testing, Exploit Development, and Vulnerability Research, 1st edn. Syngress (2007)

    Google Scholar 

  70. Mobasher, B.: Web usage mining. In: Liu, B. (ed.) Web Data Mining: Exploring Hyperlinks, Contents and Usage Data, ch. 12. Springer, Heidelberg (2006)

    Google Scholar 

  71. Mobasher, B., Dai, H., Luo, T., Nakagawa, M.: Effective personalization based on association rule discovery from web usage data. In: WIDM 2001: Proceedings of the 3rd international workshop on Web information and data management, pp. 9–15. ACM Press, New York (2001)

    Chapter  Google Scholar 

  72. Mobasher, B., Dai, H., Luo, T., Nakagawa, M.: Discovery and evaluation of aggregate usage profiles for web personalization. Data Min. Knowl. Discov. 6(1), 61–82 (2002)

    Article  MathSciNet  Google Scholar 

  73. Moloney, M., Bannister, F.: A privacy control theory for online environments. In: HICSS 2009: Proceedings of the 42nd Hawaii International Conference on System Sciences, pp. 1–10. IEEE Computer Society, Washington (2009)

    Google Scholar 

  74. Mori, T.: Information gain ratio as term weight: the case of summarization of ir results. In: Proceedings of the 19th international conference on Computational linguistics, pp. 1–7. Association for Computational Linguistics, Morristown, NJ, USA (2002)

    Google Scholar 

  75. Nadeax, D.: Semi-supervised named entity recognition: Learning to recognize 100 entity types with little supervision. Ph.D. thesis, University of Ottawa, Ottawa, Canada (2007)

    Google Scholar 

  76. Nasraoui, O., Soliman, M., Saka, E., Badia, A., Germain, R.: A web usage mining framework for mining evolving user profiles in dynamic web sites. IEEE Trans. on Knowl. and Data Eng. 20(2), 202–215 (2008)

    Article  Google Scholar 

  77. Obendorf, H., Weinreich, H., Herder, E., Mayer, M.: Web page revisitation revisited: implications of a long-term click-stream study of browser usage. In: CHI 2007: Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 597–606 (2007)

    Google Scholar 

  78. Olston, C., Pandey, S.: Recrawl scheduling based on information longevity. In: WWW 2008: Proceeding of the 17th international conference on World Wide Web, pp. 437–446. ACM Press, New York (2008)

    Chapter  Google Scholar 

  79. Pal, S.K., Talwar, V., Mitra, P.: Web mining in soft computing framework: Relevance, state of the art and future directions. IEEE Transactions on Neural Networks 13, 1163–1177 (2002)

    Article  Google Scholar 

  80. Peña-Ortiz, R., Sahuquillo, J., Pont, A., Gil, J.: Dweb model: Representing web 2.0 dynamism. Comput. Commun. 32(6), 1118–1128 (2009)

    Article  Google Scholar 

  81. Popov, B., Kiryakov, A., Ognyanoff, D., Manov, D., Kirilov, A.: Kim – a semantic platform for information extraction and retrieval. Nat. Lang. Eng. 10(3-4), 375–392 (2004)

    Article  Google Scholar 

  82. Porter, M.F.: An algorithm for suffix stripping. Electronic Library and Electronic Systems 40, 211–218 (2006)

    Google Scholar 

  83. Qi, X., Davison, B.: Web page classification: Features and algorithms. ACM Comput. Surv. 41(2), 1–31 (2009)

    Article  Google Scholar 

  84. Radlinski, F., Kurup, M., Joachims, T.: How does clickthrough data reflect retrieval quality? In: CIKM 2008: Proceeding of the 17th ACM conference on Information and knowledge management, pp. 43–52. ACM Press, New York (2008)

    Chapter  Google Scholar 

  85. Reay, I.K., Beatty, P., Dick, S., Miller, J.: A survey and analysis of the p3p protocol’s agents, adoption, maintenance, and future. IEEE Transactions on Dependable and Secure Computing 4, 151–164 (2007)

    Article  Google Scholar 

  86. Román, P.E., Velásquez, J.D.: Dynamic stochastic model applied to the analysis of the web user behavior. In: 6th Atlantic Web Intelligence Conference, AWIC 2009, Prague, CZECH Republic, pp. 31–40 (2009)

    Google Scholar 

  87. Rugaber, S., Harel, N., Govindharaj, S., Jerding, D.: Problems modeling web sites and user behavior. In: WSE 2006: Proceedings of the Eighth IEEE International Symposium on Web Site Evolution, pp. 83–94. IEEE Computer Society Press, Washington (2006)

    Chapter  Google Scholar 

  88. Sadagopan, N., Li, J.: Characterizing typical and atypical user sessions in clickstreams. In: WWW 2008: Proceeding of the 17th international conference on World Wide Web, pp. 885–894. ACM Press, New York (2008)

    Chapter  Google Scholar 

  89. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)

    Article  Google Scholar 

  90. Shehata, S.: A wordnet-based semantic model for enhancing text clustering. In: ICDMW 2009: Proceedings of the 2009 IEEE International Conference on Data Mining Workshops, pp. 477–482. IEEE Computer Society, Washington (2009)

    Chapter  Google Scholar 

  91. Snásel, V., Kudelka, M.: Web content mining focused on named objects. In (IHCI) First International Conference on Intelligent Human Computer Interaction, pp. 37–58. Springer, India (2009)

    Chapter  Google Scholar 

  92. Soares, M.V.B., Prati, R.C., Monard, M.C.: Improvement on the porter’s stemming algorithm for portuguese. IEEE Latin America Transaction 7(4), 472–477 (2009)

    Article  Google Scholar 

  93. Spaniol, M., Denev, D., Mazeika, A., Weikum, G., Senellart, P.: Data quality in web archiving. In: WICOW 2009: Proceedings of the 3rd workshop on Information credibility on the web, pp. 19–26. ACM Press, New York (2009)

    Chapter  Google Scholar 

  94. Spiliopoulou, M., Mobasher, B., Berendt, B., Nakagawa, M.: A framework for the evaluation of session reconstruction heuristics in web-usage analysis. Informs Journal on Computing 15(2), 171–190 (2003)

    Article  Google Scholar 

  95. Tauscher, L., Greenberg, S.: Revisitation patterns in world wide web navigation. In: Procs. of the Conference on Human Factors in Computing Systems, Atlanta, USA, pp. 22–27 (1997)

    Google Scholar 

  96. Tsatsaronis, G., Varlamis, I., Nørvåg, K.: An experimental study on unsupervised graph-based word sense disambiguation. In: Gelbukh, A. (ed.) Computational Linguistics and Intelligent Text Processing. LNCS, vol. 6008, pp. 184–198. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  97. Ullrich, C., Borau, K., Luo, H., Tan, X., Shen, L., Shen, R.: Why web 2.0 is good for learning and for research: principles and prototypes. In: WWW 2008: Proceeding of the 17th international conference on World Wide Web, pp. 705–714. ACM Press, New York (2008)

    Chapter  Google Scholar 

  98. Urbansky, D., Feldmann, M., Thom, J.A., Schill, A.: Entity extraction from the web with webknox. In: 6th Atlantic Web Intelligence Conference (AWIC), Prague, Czech Republic (2009)

    Google Scholar 

  99. Velásquez, J.D., Yasuda, H., Aoki, T., Weber, R., Vera, E.: Using self organizing feature maps to acquire knowledge about visitor behavior in a web site. In: Palade, V., Howlett, R.J., Jain, L. (eds.) KES 2003. LNCS, vol. 2773, pp. 951–958. Springer, Heidelberg (2003)

    Google Scholar 

  100. Wang, J., Wu, X., Zhang, C.: Support vector machines based on kmeans clustering for real time business intelligence systems. Int. J. Bus. Intell. Data Min. 1(1), 54–64 (2005)

    Article  MathSciNet  Google Scholar 

  101. Wang, Y., Hodges, J.: Document clustering with semantic analysis. In: HICSS 2006: Proceedings of the 39th Annual Hawaii International Conference on System Sciences, p. 54.3. IEEE Computer Society, Washington (2006)

    Google Scholar 

  102. Weinreich, H., Obendorf, H., Herder, E., Mayer, M.: Off the beaten tracks: exploring three aspects of web navigation. In: WWW 2006: Proceedings of the 15th international conference on World Wide Web, pp. 133–142. ACM Press, New York (2006)

    Chapter  Google Scholar 

  103. Weinreich, H., Obendorf, H., Herder, E., Mayer, M.: Not quite the average: An empirical study of web use. ACM Trans. Web 2(1), 1–31 (2008)

    Article  Google Scholar 

  104. White, R.W.: Investigating behavioral variability in web search. In. Proc. WWW, pp. 21–30 (2007)

    Google Scholar 

  105. Wittek, P., Darányi, S., Tan, C.L.: Improving text classification by a sense spectrum approach to term expansion. In: CoNLL 2009: Proceedings of the Thirteenth Conference on Computational Natural Language Learning, pp. 183–191. Association for Computational Linguistics, Morristown (2009)

    Chapter  Google Scholar 

  106. Won, S., Jin, J., Hong, J.: Contextual web history: using visual and contextual cues to improve web browser history. In: CHI 2009: Proceedings of the 27th international conference on Human factors in computing systems, pp. 1457–1466. ACM Press, New York (2009)

    Chapter  Google Scholar 

  107. Yan, X., Zhang, C., Zhang, S.: Toward databases mining: Pre-processing collected data. Applied Artificial Intelligence 17(5-6), 545–561 (2003)

    Article  Google Scholar 

  108. Yu, L., Wang, S., Lai, K.: An integrated data preparation scheme for neural network data analysis. IEEE Transactions on Knowledge and Data Engineering 18, 217–230 (2006)

    Article  Google Scholar 

  109. Yue, C., Xie, M., Wang, H.: Automatic cookie usage setting with cookiepicker. In: DSN 2007: Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 460–470. IEEE Computer Society Press, Washington (2007)

    Chapter  Google Scholar 

  110. Zawodny, J.D.: Linux apache web server administration. Sybex, 2 edn. (2002)

    Google Scholar 

  111. Zhang, Z., Chen, J., Li, X.: A preprocessing framework and approach for web applications. J. Web Eng. 2(3), 176–192 (2004)

    MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Román, P.E., Dell, R.F., Velásquez, J.D. (2010). Advanced Techniques in Web Data Pre-processing and Cleaning. In: Velásquez, J.D., Jain, L.C. (eds) Advanced Techniques in Web Intelligence - I. Studies in Computational Intelligence, vol 311. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14461-5_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-14461-5_2

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-14460-8

  • Online ISBN: 978-3-642-14461-5

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics