Advanced Techniques in Web Data Pre-processing and Cleaning

Román, Pablo E.; Dell, Robert F.; Velásquez, Juan D.

doi:10.1007/978-3-642-14461-5_2

Pablo E. Román⁵,
Robert F. Dell⁴ &
Juan D. Velásquez⁵

Part of the book series: Studies in Computational Intelligence ((SCI,volume 311))

942 Accesses
1 Citations

Abstract

Central to successful e-business is the construction of web sites that attract users, capture user preferences, and entice them into making a purchase. Web mining is diverse data mining applied to categorize both the content and structure of web sites with the goal of aiding e-business. Web mining requires knowledge of the web site structure (hyperlink graph), the web content (vector model) and user sessions (the sequence of pages visited by each user to a site). Much of the data for web mining can be noisy. The origin of the noise comes from many sources, for example, undocumented changes to the web site structure and content, a different understanding of the text and media semantic, and web logs without individual user identification. There may not be any record of the number of times a specific page has been visited in a session as page is stored on a proxy or web browser cache. Such noise presents a challenge for web mining. This chapter presents issues with and approaches for cleaning web data in preparation for web mining analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Adar, E., Teevan, J., Dumais, S., Elsas, J.: The web changes everything: understanding the dynamics of web content. In: WSDM 2009: Proceedings of the Second ACM International Conference on Web Search and Data Mining, pp. 282–291. ACM Press, New York (2009)
Chapter Google Scholar
Alexander, J.: Understanding and improving navigation within electronic documents. Ph.D. thesis, University of Canterbury, Christchurch, New Zealand (2009)
Google Scholar
Alexander, J., Cockburn, A.: An empirical characterisation of electronic document navigation. In: GI 2008: Proceedings of graphics interface 2008, pp. 123–130. Canadian Information Processing Society, Toronto (2008)
Google Scholar
ASA, O.S.: Opera browser, http://www.opera.com
Baeza-Yates, R., Castillo, C., Efthimiadis, E.: Characterization of national web domains. ACM Transactions on Internet Technology 7(2) (2007)
Google Scholar
Baeza-Yates, R., Poblete, B.: Dynamics of the chilean web structure. Comput. Netw. 50(10), 1464–1473 (2006)
Article Google Scholar
Bayir, M., Toroslu, I., Cosar, A., Fidan, G.: Smart miner: a new framework for mining large scale web usage data. In: WWW 2009: Proceedings of the 18th international conference on World wide web, pp. 161–170. ACM Press, New York (2009)
Chapter Google Scholar
Bhamidipati, N.L., Pal, S.K.: Stemming via distribution-based word segregation for classification and retrieval. IEEE Transactions on Systems, Man, and Cybernetics, Part B 37(2), 350–360 (2007)
Article Google Scholar
Bixby, R.E.: Solving real-world linear programs: A decade and more of progress. Operations Research 50(1), 3–15 (2002)
Article MATH MathSciNet Google Scholar
Burget, R., Rudolfova, I.: Web page element classification based on visual features. In: Asian Conference on Intelligent Information and Database Systems, vol. 0, pp. 67–72 (2009)
Google Scholar
Castells, P., Fernandez, M., Vallet, D.: An adaptation of the vector-space model for ontology-based information retrieval. IEEE Trans. on Knowl. and Data Eng. 19(2), 261–272 (2007)
Article Google Scholar
Castillo, C.: Effective web crawling. Ph.D. thesis, University of Chile, Santiago, Chile (2004)
Google Scholar
Catledge, L.D., Pitkow, J.E.: Characterizing browsing strategies in the world-wide web. In: Computer Networks and ISDN Systems, pp. 1065–1073 (1995)
Google Scholar
Chakrabarti, D., Faloutsos, C.: Graph mining: Laws, generators, and algorithms. ACM Comput. Surv. 38(1), 2 (2006)
Article Google Scholar
Chakrabarti, D., Kumar, R., Punera, K.: Page-level template detection via isotonic smoothing. In: WWW 2007: Proceedings of the 16th international conference on World Wide Web, pp. 61–70. ACM Press, New York (2007)
Chapter Google Scholar
Chakrabarti, S., Dom, B.E., Kumar, S.R., Raghavan, P., Rajagopalan, S., Tomkins, A., Gibson, D., Kleinberg, J.: Mining the web’s link structure. Computer 32(8), 60–67 (1999)
Article Google Scholar
Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. In: OSDI 2006: Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation, p. 15. USENIX Association, Berkeley (2006)
Google Scholar
Cho, J., Garcia-Molina, H.: The evolution of the web and implications for an incremental crawler. In: VLDB 2000: Proceedings of the 26th International Conference on Very Large Data Bases, pp. 200–209. Morgan Kaufmann Publishers Inc., San Francisco (2000)
Google Scholar
Cho, J., Garcia-Molina, H.: Estimating frequency of change. ACM Trans. Internet Technol. 3(3), 256–290 (2003)
Article Google Scholar
Cooley, R., Mobasher, B., Srivastava, J.: Data preparation for mining world wide web browsing patterns. Knowledge and Information Systems 1, 5–32 (1999)
Google Scholar
Corporation, M.: Mozilla firefox browser, http://www.mozilla.org
Coull, S.E., Collins, M.P., Wright, C.V., Monrose, F., Reiter, M.K.: On web browsing privacy in anonymized netflows. In: SS 2007: Proceedings of 16th USENIX Security Symposium on USENIX Security Symposium, pp. 1–14. USENIX Association, Berkeley (2007)
Google Scholar
Das, R., Turkoglu, I.: Creating meaningful data from web logs for improving the impressiveness of a website by using path analysis method. Expert Syst. Appl. 36(3), 6635–6644 (2009)
Article Google Scholar
Debnath, S., Mitra, P., Pal, N., Giles, C.L.: Automatic identification of informative sections of web pages. IEEE Trans. on Knowl. and Data Eng. 17(9), 1233–1246 (2005)
Article Google Scholar
Dell, R.F., Román, P.E., Velásquez, J.D.: Web user session reconstruction using integer programming. In: Procs. of The 2008 IEEE/WIC/ACM International Conference on Web Intelligence, Sydney, Australia, pp. 385–388 (2008)
Google Scholar
Dell, R.F., Román, P.E., Velásquez, J.D.: User session reconstruction with back button browsing. In: Velásquez, J.D., Ríos, S.A., Howlett, R.J., Jain, L.C. (eds.) Knowledge-Based and Intelligent Information and Engineering Systems. LNCS, vol. 5711, pp. 326–332. Springer, Heidelberg (2009)
Chapter Google Scholar
Dell, R.F., Román, P.E., Velásquez, J.D.: Optimization models for construction of web user sessions. Working Paper (2010)
Google Scholar
Demartini, G., Firan, C.S., Iofciu, T., Nejdl, W.: Semantically enhanced entity ranking. In: Bailey, J., Maier, D., Schewe, K.-D., Thalheim, B., Wang, X.S. (eds.) WISE 2008. LNCS, vol. 5175, pp. 176–188. Springer, Heidelberg (2008)
Chapter Google Scholar
Demir, G.N., Goksedef, M., Etaner-Uyar, A.S.: Effects of session representation models on the performance of web recommender systems. In: ICDEW 2007: Proceedings of the 2007 IEEE 23rd International Conference on Data Engineering Workshop, pp. 931–936. IEEE Computer Society Press, Washington (2007)
Chapter Google Scholar
Desikan, P., Srivastava, J.: Mining temporally evolving graphs. In: Mobasher, B., Nasraoui, O., Liu, B., Masand, B. (eds.) WebKDD 2004. LNCS (LNAI), vol. 3932, pp. 1–17. Springer, Heidelberg (2004)
Chapter Google Scholar
Dill, S., Kumar, R., Mccurley, K., Rajagopalan, S., Sivakumar, D., Tomkins, A.: Self-similarity in the web. ACM Trans. Internet Technol. 2(3), 205–223 (2002)
Article Google Scholar
Dujovne, L.E., Velásquez, J.D.: Design and implementation of a methodology for identifying website keyobjects. In: Velásquez, J.D., Ríos, S.A., Howlett, R.J., Jain, L.C. (eds.) Knowledge-Based and Intelligent Information and Engineering Systems. LNCS, vol. 5711, pp. 301–308. Springer, Heidelberg (2009)
Chapter Google Scholar
Eguchi, S., Copas, J.: Interpreting kullback-leibler divergence with the neyman-pearson lemma. J. Multivar. Anal. 97(9), 2034–2040 (2006)
Article MATH MathSciNet Google Scholar
Fetterly, D., Manasse, M., Najork, M., Wiener, J.: A large-scale study of the evolution of web pages. In: WWW 2003: Proceedings of the 12th international conference on World Wide Web, pp. 669–678. ACM Press, New York (2003)
Google Scholar
Gaugaz, J., Zakrzewski, J., Demartini, G., Nejdl, W.: How to trace and revise identities. In: Aroyo, L., Traverso, P., Ciravegna, F., Cimiano, P., Heath, T., Hyvönen, E., Mizoguchi, R., Oren, E., Sabou, M., Simperl, E. (eds.) ESWC 2009. LNCS, vol. 5554, pp. 414–428. Springer, Heidelberg (2009)
Chapter Google Scholar
Ghani, R., Jones, R., Mladenic, D.: Mining the web to create minority language corpora. In: CIKM 2001: Proceedings of the tenth international conference on Information and knowledge management, pp. 279–286. ACM Press, New York (2001)
Chapter Google Scholar
Görnitz, N., Kloft, M., Brefeld, U.: Active and semi-supervised data domain description. In: ECML PKDD 2009: Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 407–422. Springer, Heidelberg (2009)
Chapter Google Scholar
Granka, L., Feusner, M., Lorigo, L.: Eye monitoring in online search. In: Hammoud, R., Ohno, T. (eds.) Passive Eye Monitoring, Signals and Communication Technology, Part VI, pp. 347–372. Springer, Heidelberg (2008)
Google Scholar
Gündüz, C., Özsu, M.T.: A web page prediction model based on click-stream tree representation of user behavior. In: KDD 2003: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 535–540. ACM Press, New York (2003)
Chapter Google Scholar
Hand, D.: Statistics and data mining: intersecting disciplines. SIGKDD Explor. Newsl. 1(1), 16–19 (1999)
Article Google Scholar
Hensman, S.: Construction of conceptual graph representation of texts. In: HLT-NAACL 2004: Proceedings of the Student Research Workshop at HLT-NAACL 2004, vol. XX, pp. 49–54. Association for Computational Linguistics, Morristown (2004)
Chapter Google Scholar
Huberman, B., Pirolli, P., Pitkow, J., Lukose, R.M.: Strong regularities in world wide web surfing. Science 280(5360), 95–97 (1998)
Article Google Scholar
Huberman, B., Wu, F.: The economics of attention: maximizing user value in information-rich environments. In: ADKDD 2007: Proceedings of the 1st international workshop on Data mining and audience intelligence for advertising, pp. 16–20. ACM Press, New York (2007)
Chapter Google Scholar
Iachello, G., Hong, J.: End-user privacy in human-computer interaction. Found. Trends Hum.-Comput. Interact. 1(1), 1–137 (2007)
Article MATH Google Scholar
Ipeirotis, P., Gravano, L.: When one sample is not enough: improving text database selection using shrinkage. In: SIGMOD 2004: Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pp. 767–778. ACM Press, New York (2004)
Chapter Google Scholar
Janzen, S., Maass, W.: Ontology-based natural language processing for in-store shopping situations. In: ICSC 2009: Proceedings of the 2009 IEEE International Conference on Semantic Computing, pp. 361–366. IEEE Computer Society, Washington (2009)
Chapter Google Scholar
Jatowt, A., Ishizuka, M.: Temporal multi-page summarization. Web Intelli. and Agent Sys. 4(2), 163–180 (2006)
Google Scholar
Velásquez, J.D., Palade, V.: Adaptive web sites: A knowledge extraction from web data approach. IOS Press, Amsterdam (2008)
Google Scholar
Jin, W., Srihari, R.K.: Graph-based text representation and knowledge discovery. In: SAC 2007: Proceedings of the 2007 ACM symposium on Applied computing, pp. 807–811. ACM, New York (2007)
Google Scholar
Joachims, T., Granka, L., Pan, B., Hembrooke, H., Radlinski, F., Gay, G.: Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search. ACM Trans. Inf. Syst. 25(2), 7 (2007)
Article Google Scholar
Jung, J.J.: Ontology-based partitioning of data steam for web mining: A case study of web logs. In: ICCS 2004, 4th International Conference, Proceedings, Part I, June 6-9, 2004, Kraków, Poland, pp. 247–254 (2004)
Google Scholar
Jung, J.J., Jo, G.S.: Semantic outlier analysis for sessionizing web logs. In: ECML/PKDD Conference, pp. 13–25 (2004)
Google Scholar
Ke, Y., Deng, L., Ng, W., Lee, D.: Web dynamics and their ramifications for the development of web search engines. Comput. Netw. 50(10), 1430–1447 (2006)
Article MATH Google Scholar
Khan, J.I., Tao, Q.: Exploiting webspace organization for accelerating web prefetching. Web Intelli. and Agent Sys. 3(2), 117–129 (2005)
Google Scholar
Khasawneh, N., Chan, C.: Active user-based and ontology-based web log data preprocessing for web usage mining. In: 2006 IEEE / WIC / ACM International Conference on Web Intelligence (WI 2006), Hong Kong, China, pp. 325–328. IEEE Computer Society, Los Alamitos (2006)
Google Scholar
Kim, Y., Kim, J.: Web prefetching using display-based prediction. In: WI 2003: Proceedings of the 2003 IEEE/WIC International Conference on Web Intelligence, p. 486. IEEE Computer Society, Washington (2003)
Google Scholar
Kohonen, T.: Self-organized formation of topologically correct feature maps, pp. 509–521 (1988)
Google Scholar
Kryssanov, V., Kakusho, K., Kuleshov, E., Minoh, M.: Modeling hypermedia-based communication. Information Sciences 174(1-2), 37–53 (2005)
Article Google Scholar
Lan, M., Tan, C.L., Low, H.B., Sung, S.Y.: A comprehensive comparative study on term weighting schemes for text categorization with support vector machines. In: WWW 2005: Special interest tracks and posters of the 14th international conference on World Wide Web, pp. 1032–1033. ACM Press, New York (2005), http://doi.acm.org/10.1145/1062745.1062854
Chapter Google Scholar
Lan, M., Tan, C.L., Su, J., Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 31(4), 721–735 (2009)
Article Google Scholar
Langford, D.: Internet ethics. MacMillan Press Ltd., Basingstoke (2000)
Google Scholar
Lansey, J.C., Bukiet, B.: Internet search result probabilities, heaps’ law and word associativity. Journal of Quantitative Linguistics 16(1), 40–66 (2005)
Article Google Scholar
Leijenhorst, D.V., der Weide, T.V.: A formal derivation of heaps’ law. Inf. Sci. Inf. Comput. Sci. 170(2-4), 263–272 (2005)
MATH Google Scholar
Levene, M., Borges, J., Loizou, G.: Zipf’s law for web surfers. Knowl. Inf. Syst. 3(1), 120–129 (2001)
Article MATH Google Scholar
Li, Y., Feng, B., Mao, Q.: Research on path completion technique in web usage mining. In: International Symposium on Computer Science and Computational Technology, vol. 1, pp. 554–559 (2008)
Google Scholar
Linn, J.: Technology and web user data privacy: A survey of risks and countermeasures. IEEE Security and Privacy 3(1), 52–58 (2005)
Article Google Scholar
Liu, B.: Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data (Data-Centric Systems and Applications), 1st edn. (2007); corr. 2nd printing edn. Springer, Heidelberg (2009)
Google Scholar
Manning, C.D., Schutze, H.: Fundation of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
Google Scholar
Maynor, D.: Metasploit Toolkit for Penetration Testing, Exploit Development, and Vulnerability Research, 1st edn. Syngress (2007)
Google Scholar
Mobasher, B.: Web usage mining. In: Liu, B. (ed.) Web Data Mining: Exploring Hyperlinks, Contents and Usage Data, ch. 12. Springer, Heidelberg (2006)
Google Scholar
Mobasher, B., Dai, H., Luo, T., Nakagawa, M.: Effective personalization based on association rule discovery from web usage data. In: WIDM 2001: Proceedings of the 3rd international workshop on Web information and data management, pp. 9–15. ACM Press, New York (2001)
Chapter Google Scholar
Mobasher, B., Dai, H., Luo, T., Nakagawa, M.: Discovery and evaluation of aggregate usage profiles for web personalization. Data Min. Knowl. Discov. 6(1), 61–82 (2002)
Article MathSciNet Google Scholar
Moloney, M., Bannister, F.: A privacy control theory for online environments. In: HICSS 2009: Proceedings of the 42nd Hawaii International Conference on System Sciences, pp. 1–10. IEEE Computer Society, Washington (2009)
Google Scholar
Mori, T.: Information gain ratio as term weight: the case of summarization of ir results. In: Proceedings of the 19th international conference on Computational linguistics, pp. 1–7. Association for Computational Linguistics, Morristown, NJ, USA (2002)
Google Scholar
Nadeax, D.: Semi-supervised named entity recognition: Learning to recognize 100 entity types with little supervision. Ph.D. thesis, University of Ottawa, Ottawa, Canada (2007)
Google Scholar
Nasraoui, O., Soliman, M., Saka, E., Badia, A., Germain, R.: A web usage mining framework for mining evolving user profiles in dynamic web sites. IEEE Trans. on Knowl. and Data Eng. 20(2), 202–215 (2008)
Article Google Scholar
Obendorf, H., Weinreich, H., Herder, E., Mayer, M.: Web page revisitation revisited: implications of a long-term click-stream study of browser usage. In: CHI 2007: Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 597–606 (2007)
Google Scholar
Olston, C., Pandey, S.: Recrawl scheduling based on information longevity. In: WWW 2008: Proceeding of the 17th international conference on World Wide Web, pp. 437–446. ACM Press, New York (2008)
Chapter Google Scholar
Pal, S.K., Talwar, V., Mitra, P.: Web mining in soft computing framework: Relevance, state of the art and future directions. IEEE Transactions on Neural Networks 13, 1163–1177 (2002)
Article Google Scholar
Peña-Ortiz, R., Sahuquillo, J., Pont, A., Gil, J.: Dweb model: Representing web 2.0 dynamism. Comput. Commun. 32(6), 1118–1128 (2009)
Article Google Scholar
Popov, B., Kiryakov, A., Ognyanoff, D., Manov, D., Kirilov, A.: Kim – a semantic platform for information extraction and retrieval. Nat. Lang. Eng. 10(3-4), 375–392 (2004)
Article Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Electronic Library and Electronic Systems 40, 211–218 (2006)
Google Scholar
Qi, X., Davison, B.: Web page classification: Features and algorithms. ACM Comput. Surv. 41(2), 1–31 (2009)
Article Google Scholar
Radlinski, F., Kurup, M., Joachims, T.: How does clickthrough data reflect retrieval quality? In: CIKM 2008: Proceeding of the 17th ACM conference on Information and knowledge management, pp. 43–52. ACM Press, New York (2008)
Chapter Google Scholar
Reay, I.K., Beatty, P., Dick, S., Miller, J.: A survey and analysis of the p3p protocol’s agents, adoption, maintenance, and future. IEEE Transactions on Dependable and Secure Computing 4, 151–164 (2007)
Article Google Scholar
Román, P.E., Velásquez, J.D.: Dynamic stochastic model applied to the analysis of the web user behavior. In: 6th Atlantic Web Intelligence Conference, AWIC 2009, Prague, CZECH Republic, pp. 31–40 (2009)
Google Scholar
Rugaber, S., Harel, N., Govindharaj, S., Jerding, D.: Problems modeling web sites and user behavior. In: WSE 2006: Proceedings of the Eighth IEEE International Symposium on Web Site Evolution, pp. 83–94. IEEE Computer Society Press, Washington (2006)
Chapter Google Scholar
Sadagopan, N., Li, J.: Characterizing typical and atypical user sessions in clickstreams. In: WWW 2008: Proceeding of the 17th international conference on World Wide Web, pp. 885–894. ACM Press, New York (2008)
Chapter Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Article Google Scholar
Shehata, S.: A wordnet-based semantic model for enhancing text clustering. In: ICDMW 2009: Proceedings of the 2009 IEEE International Conference on Data Mining Workshops, pp. 477–482. IEEE Computer Society, Washington (2009)
Chapter Google Scholar
Snásel, V., Kudelka, M.: Web content mining focused on named objects. In (IHCI) First International Conference on Intelligent Human Computer Interaction, pp. 37–58. Springer, India (2009)
Chapter Google Scholar
Soares, M.V.B., Prati, R.C., Monard, M.C.: Improvement on the porter’s stemming algorithm for portuguese. IEEE Latin America Transaction 7(4), 472–477 (2009)
Article Google Scholar
Spaniol, M., Denev, D., Mazeika, A., Weikum, G., Senellart, P.: Data quality in web archiving. In: WICOW 2009: Proceedings of the 3rd workshop on Information credibility on the web, pp. 19–26. ACM Press, New York (2009)
Chapter Google Scholar
Spiliopoulou, M., Mobasher, B., Berendt, B., Nakagawa, M.: A framework for the evaluation of session reconstruction heuristics in web-usage analysis. Informs Journal on Computing 15(2), 171–190 (2003)
Article Google Scholar
Tauscher, L., Greenberg, S.: Revisitation patterns in world wide web navigation. In: Procs. of the Conference on Human Factors in Computing Systems, Atlanta, USA, pp. 22–27 (1997)
Google Scholar
Tsatsaronis, G., Varlamis, I., Nørvåg, K.: An experimental study on unsupervised graph-based word sense disambiguation. In: Gelbukh, A. (ed.) Computational Linguistics and Intelligent Text Processing. LNCS, vol. 6008, pp. 184–198. Springer, Heidelberg (2010)
Chapter Google Scholar
Ullrich, C., Borau, K., Luo, H., Tan, X., Shen, L., Shen, R.: Why web 2.0 is good for learning and for research: principles and prototypes. In: WWW 2008: Proceeding of the 17th international conference on World Wide Web, pp. 705–714. ACM Press, New York (2008)
Chapter Google Scholar
Urbansky, D., Feldmann, M., Thom, J.A., Schill, A.: Entity extraction from the web with webknox. In: 6th Atlantic Web Intelligence Conference (AWIC), Prague, Czech Republic (2009)
Google Scholar
Velásquez, J.D., Yasuda, H., Aoki, T., Weber, R., Vera, E.: Using self organizing feature maps to acquire knowledge about visitor behavior in a web site. In: Palade, V., Howlett, R.J., Jain, L. (eds.) KES 2003. LNCS, vol. 2773, pp. 951–958. Springer, Heidelberg (2003)
Google Scholar
Wang, J., Wu, X., Zhang, C.: Support vector machines based on kmeans clustering for real time business intelligence systems. Int. J. Bus. Intell. Data Min. 1(1), 54–64 (2005)
Article MathSciNet Google Scholar
Wang, Y., Hodges, J.: Document clustering with semantic analysis. In: HICSS 2006: Proceedings of the 39th Annual Hawaii International Conference on System Sciences, p. 54.3. IEEE Computer Society, Washington (2006)
Google Scholar
Weinreich, H., Obendorf, H., Herder, E., Mayer, M.: Off the beaten tracks: exploring three aspects of web navigation. In: WWW 2006: Proceedings of the 15th international conference on World Wide Web, pp. 133–142. ACM Press, New York (2006)
Chapter Google Scholar
Weinreich, H., Obendorf, H., Herder, E., Mayer, M.: Not quite the average: An empirical study of web use. ACM Trans. Web 2(1), 1–31 (2008)
Article Google Scholar
White, R.W.: Investigating behavioral variability in web search. In. Proc. WWW, pp. 21–30 (2007)
Google Scholar
Wittek, P., Darányi, S., Tan, C.L.: Improving text classification by a sense spectrum approach to term expansion. In: CoNLL 2009: Proceedings of the Thirteenth Conference on Computational Natural Language Learning, pp. 183–191. Association for Computational Linguistics, Morristown (2009)
Chapter Google Scholar
Won, S., Jin, J., Hong, J.: Contextual web history: using visual and contextual cues to improve web browser history. In: CHI 2009: Proceedings of the 27th international conference on Human factors in computing systems, pp. 1457–1466. ACM Press, New York (2009)
Chapter Google Scholar
Yan, X., Zhang, C., Zhang, S.: Toward databases mining: Pre-processing collected data. Applied Artificial Intelligence 17(5-6), 545–561 (2003)
Article Google Scholar
Yu, L., Wang, S., Lai, K.: An integrated data preparation scheme for neural network data analysis. IEEE Transactions on Knowledge and Data Engineering 18, 217–230 (2006)
Article Google Scholar
Yue, C., Xie, M., Wang, H.: Automatic cookie usage setting with cookiepicker. In: DSN 2007: Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, pp. 460–470. IEEE Computer Society Press, Washington (2007)
Chapter Google Scholar
Zawodny, J.D.: Linux apache web server administration. Sybex, 2 edn. (2002)
Google Scholar
Zhang, Z., Chen, J., Li, X.: A preprocessing framework and approach for web applications. J. Web Eng. 2(3), 176–192 (2004)
MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

Operations Research Department, Naval Postgraduate School, Monterey, California, USA
Robert F. Dell
Department of Industrial Engineering, University of Chile, República 701, Santiago, Chile
Pablo E. Román & Juan D. Velásquez

Authors

Pablo E. Román
View author publications
You can also search for this author in PubMed Google Scholar
Robert F. Dell
View author publications
You can also search for this author in PubMed Google Scholar
Juan D. Velásquez
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Chile Republica, Santiago, Chile
Juan D. Velásquez
University of South Australia, Adelaide, Australia
Lakhmi C. Jain

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Román, P.E., Dell, R.F., Velásquez, J.D. (2010). Advanced Techniques in Web Data Pre-processing and Cleaning. In: Velásquez, J.D., Jain, L.C. (eds) Advanced Techniques in Web Intelligence - I. Studies in Computational Intelligence, vol 311. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14461-5_2

Download citation

DOI: https://doi.org/10.1007/978-3-642-14461-5_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14460-8
Online ISBN: 978-3-642-14461-5
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics