Machine Learning

, Volume 57, Issue 1–2, pp 83–113 | Cite as

Lessons and Challenges from Mining Retail E-Commerce Data

  • Ron Kohavi
  • Llew Mason
  • Rajesh Parekh
  • Zijian Zheng
Article

Abstract

The architecture of Blue Martini Software's e-commerce suite has supported data collection, data transformation, and data mining since its inception. With clickstreams being collected at the application-server layer, high-level events being logged, and data automatically transformed into a data warehouse using meta-data, common problems plaguing data mining using weblogs (e.g., sessionization and conflating multi-sourced data) were obviated, thus allowing us to concentrate on actual data mining goals. The paper briefly reviews the architecture and discusses many lessons learned over the last four years and the challenges that still need to be addressed. The lessons and challenges are presented across two dimensions: business-level vs. technical, and throughout the data mining lifecycle stages of data collection, data warehouse construction, business intelligence, and deployment. The lessons and challenges are also widely applicable to data mining domains outside retail e-commerce.

data mining data analysis business intelligence web analytics web mining OLAP visualization reporting data transformations retail e-commerce Simpson's paradox sessionization bot detection clickstreams application server web logs data cleansing hierarchical attributes business reporting data warehousing 

References

  1. ANSI/X3/SPARC. (1975). Study group on data base management systems. Interim Report, ANSI.Google Scholar
  2. Almuallim, H., Akiba, Y., & Kaneda, S. (1995). On handling tree-structured attributes. In Proceedings of the Twelfth International Conference on Machine Learning (ICML'95) (pp. 12–20). Morgan Kauffmann.Google Scholar
  3. Agrawal, R., & Srikant, R. (1994). Fast algorithms for mining association rules. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB'94) (pp. 487–499). Morgan Kauffmann.Google Scholar
  4. Agrawal, R., & Shafer, J. (1996). Parallel mining of association rules. IEEE Transactions of Knowledge and Data Engineering, 8, 962–969. IEEE. http://www.almaden.ibm.com/cs/people/ragrawal/papers/parassoc96.ps.Google Scholar
  5. Ansari, S., Kohavi, R., Mason, L., & Zheng, Z. (2001). Integrating E-commerce and data mining: Architecture and challenges. In Proceedings of the IEEE International Conference on Data Mining (ICDM'2001). IEEE. http://www.lsmason.com/papers/ICDM01-eCommerceMining.pdf.Google Scholar
  6. Aronis, J., & Provost, F. (1997). Increasing the efficiency of data mining algorithms with breadth-first marker propagation. In Proceedings of Knowledge Discovery and Data Mining (KDD'97) (pp. 119–122). AAAI Press.Google Scholar
  7. Becker, B., Kohavi, R., & Sommerfield, D. (2001). Visualizing the simple Bayesian classifier. Information Visualization in Data Mining and Knowledge Discovery, 18, 237–249. Morgan Kaufmann. http://robotics.stanford.edu/ users/ronnyk/ronnyk-bib.html.Google Scholar
  8. Berry, M., & Linoff, G. (1997). Data mining techniques: For marketing, sales, and customer support. JohnWiley and Sons.Google Scholar
  9. Berry, M., & Linoff, G. (2000). Mastering data mining: The art and science of customer relationship management. John Wiley and Sons.Google Scholar
  10. Blue Martini Software. (2003a). Blue Martini business intelligence at work: Charting the terrains of MECWebsite data. http://robotics.stanford.edu/users/ronnyk/ronnyk-bib.html.Google Scholar
  11. Blue Martini Software. (2003b). Blue Martini business intelligence delivers unparalleled insight into user behavior at the Debenhams Web site. http://robotics.stanford.edu/users/ronnyk/ronnyk-bib.html.Google Scholar
  12. Catledge, L., & Pitkow, J. (1995). Characterizing browsing strategies in theWorld-WideWeb. Computer Networks and ISDN Systems, 27:6, 1065–1073. Elsevier Science. <http://citeseer.ist.psu.edu/catledge95characterizing>. html.Google Scholar
  13. Chan, P., & Stolfo, S. (1997). On the accuracy of meta-learning for scalable data mining. Journal of Intelligent Information Systems, 8:1, 5–28. Kluwer Academic Publishers. http://www1.cs.columbia.edu/~pkc/ papers/jiis97.ps.Google Scholar
  14. Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Sherer, C., & Wirth, R. (2000). Cross industry standard process for data mining (CRISP-DM) 1.0. http://www.crisp-dm.org/.Google Scholar
  15. Cheswick,W., & Bellovin, S. (1994). Firewalls and internet security: Repelling the wily hacker. Addison-Wesley Publishing Company.Google Scholar
  16. Cohen,W. (1996). Learning trees and rules with set-valued features. In Proceedings of the AAAI/IAAI Conference, 1, 709–716. AAAI Press.Google Scholar
  17. Collins, J., & Porras, J. (1994). Built to last, successful habits of visionary companies. Harper Collins Publishers.Google Scholar
  18. Cooley, R., Mobasher, B., & Srivastava, J. (1999). Data preparation for mining world wide web browsing patterns.Google Scholar
  19. Knowledge and Information Systems, 1:1. Springer-Verlag. http://maya.cs.depaul.edu/~mobasher/papers/ webminer-kais.ps.Google Scholar
  20. David Shepard Associates. (1998). The newdirect marketing:Howto implement a profit-driven database marketing strategy, 3rd edition. McGraw-Hill.Google Scholar
  21. Domingos, P. (2002). When and how to subsample: Report on the KDD-2001 panel. SIGKDD Explorations, 3:2, 74–76. ACM. http://www.acm.org/sigs/sigkdd/explorations/issue3-2/contents.htm#Domingos.Google Scholar
  22. Elder, J., & Abbott, D. (1998). A comparison of leading data mining tools. Tutorial at the Knowledge Discovery and Data Mining Conference (KDD'98). ACM. http://www.datamininglab.com/pubs/ kdd98 elder abbott nopics bw.pdf.Google Scholar
  23. English, L. (1999). Improving data warehouse and business information quality: Methods for reducing costs and increasing profits. John Wiley & Sons.Google Scholar
  24. Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., & Uthurusamy, R. (Eds.). (1996). Advances in knowledge discovery and data mining. MIT Press.Google Scholar
  25. Freitas, A. (1998). Tutorial on scalable, high-performance data mining with parallel processing. In Proceedings of the Principles and Practice of Knowledge Discovery in Databases (PKDD'98). Springer.Google Scholar
  26. Freitas, A., & Lavington, S. (1998). Mining very large databases with parallel processing. Kluwer Academic Publishers.Google Scholar
  27. Heaton, J. (2002). Programming spiders, bots, and aggregators in Java. Sybex Book.Google Scholar
  28. Hughes, A. (2000). Strategic database marketing, 2nd edition. McGraw-Hill.Google Scholar
  29. Kimball, R. (1996). The data warehouse toolkit: Practical techniques for building dimensional data warehouses John Wiley & Sons.Google Scholar
  30. Kimball, R., & Merz, R. (2000). The data webhouse toolkit: Building the Web-enabled data warehouse. John Wiley & Sons.Google Scholar
  31. Kimball, R., Reeves, L., Ross, M., & Thornthwaite, W. (1998). The data warehouse lifecycle toolkit: Expert methods for designing, developing, and deploying data warehouses. John Wiley & Sons.Google Scholar
  32. Kohavi, R. (1998). Crossing the Chasm: From academic machine learning to commercial data mining. Invited talk at the Fifteenth International Conference on Machine Learning (ICML'98), Madison,WA.Morgan Kauffmann. http://robotics.stanford.edu/users/ronnyk/ronnyk-bib.html.Google Scholar
  33. Kohavi, R. (2001). Mining e-commerce data: The good, the bad, and the ugly. In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2001) (pp. 8–13). ACM Press. http://robotics.stanford.edu/users/ronnyk/ronnyk-bib.html.Google Scholar
  34. Kohavi, R., Brodley, C., Frasca, B., Mason, L., & Zheng, Z. (2000). KDD-Cup 2000 organizers' report: Peeling the onion. SIGKDD Explorations, 2:2, 86–98. ACM Press. http://robotics.stanford.edu/users/ronnyk/ronnykbib.html.Google Scholar
  35. Kohavi, R., & Provost, F. (2001). Applications of data mining to electronic commerce. Data Mining and Knowledge Discovery, 5:1/2. Kluwer Academic. http://robotics.Stanford.EDU/users/ronnyk/ecommerce-dm.Google Scholar
  36. Kohavi, R., Rothleder, N., & Simoudis, E. (2002). Emerging trends in business analytics. Communications of the ACM, 45:8, 45–48. ACM Press. http://robotics.stanford.edu/users/ronnyk/ronnyk-bib.html.Google Scholar
  37. Langley, P. (2002). Lessons for the computational discovery of scientific knowledge. Proceedings of the First International Workshop on Data Mining Lessons Learned (DMLL'2002). http://www.hpl.hp.com/personal/Tom Fawcett/DMLL-2002/Langley.pdf.Google Scholar
  38. Lee, J., Podlaseck, M., Schonberg, E., & Hoch, R. (2001). Visualization and analysis of clickstream data of online stores for understandingWeb merchandising. Data Mining and Knowledge Discovery, 5:1/2. Kluwer Academic.Google Scholar
  39. Linoff, G., & Berry, M. (2002). Mining the Web: Transforming customer data. John Wiley and Sons.Google Scholar
  40. Madsen, M. R. (2002). Integrating Web-based clickstream data into the data warehouse. DM Review, August, 2002. http://www.dmreview.com/editorial/dmreview/print action.cfm?EdID=5565.Google Scholar
  41. Maniatty,W., & Zaki, M. (2000). A requirements analysis for parallel (KDD) systems. In Proceedings of the Data Mining Workshop at the International Parallel and Distributed Processing Symposium (IPDPS'2000). IEEE Computer Society.Google Scholar
  42. Mason, L., Zheng, Z., Kohavi, R., & Frasca, B. (2001). Blue Martini eMetrics study. <http://developer>. bluemartini.com.Google Scholar
  43. McJones, P. (1995). The 1995 SQL reunion: People, projects, and politics an informal but first-hand account of the birth of SQL, the history of System R, and the origins of a number of other relational systems inside and outside IBM. http://www.mcjones.org/System R/SQL Reunion 95/sqlr95-System.html.Google Scholar
  44. Pfahringer, B. (2002). Data mining challenge problems: Any lessons learned? In Proceedings of the First International Workshop on Data Mining Lessons Learned (DMLL'2002). http://www.hpl.hp.com/personal/ Tom Fawcett/DMLL-2002/Proceedings.html.Google Scholar
  45. Piatetsky-Shapiro, G., Brachman, R., Khabaza, T., Kloesgen, W., & Simoudis, E. (1996). An overview of issues in developing industrial data mining and knowledge discovery applications. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD'96) (pp. 89–95). AAAI Press.Google Scholar
  46. Provost, F., & Kolluri, V. (1999). A survey of methods for scaling up inductive algorithms. Data Mining and Knowledge Discovery, 3:2, 131–169. Kluwer Academic.Google Scholar
  47. Pyle, D. (1999). Data preparation for data mining. Morgan Kauffmann.Google Scholar
  48. Quinlan, R. (1986). Induction of decision trees. Machine Learning, 1, 81–106. Kluwer Academic.Google Scholar
  49. Quinlan, R. (1989). Unknown attribute values in induction. In Proceedings of the Sixth International Machine Learning Workshop (ICML'89) (pp. 164–168). Morgan Kauffmann.Google Scholar
  50. Rosset, S., Murad, U., Neumann, E., Idan, Y., & Pinkas, G. (1999). Discovery of fraud rules for telecommunications: Challenges and solutions. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'99) (pp. 409–413). ACM Press. http://www-stat.stanford.edu/ %7Esaharon/papers/fraud.pdf.Google Scholar
  51. RuleQuest Research. (2003). C5.0: An informal tutorial. http://www.rulequest.com/see5-unix.html.Google Scholar
  52. Simpson, E. (1951). The interpretation of interaction in contingency tables. Journal of the Royal Statistical Society, Ser. B, 13, 238–241.Google Scholar
  53. Spiliopoulou, M., Mobasher, B., Berendt, B., & Nakagawa, M. (2003). A framework for the evaluation of session reconstruction heuristics in Web usage. INFORMS Journal of Computing, Special Issue on Mining Web-Based Data for E-Business Applications, 15:2. http://maya.cs.depaul.edu/~mobasher/papers/SMBN03.pdf.Google Scholar
  54. Tan, P., & Kumar, V. (2002). Discovery of Web Robot sessions based on their Navigational patterns. Data Mining and Knowledge Discovery, 6:1, 9–35. Kluwer Academic. http://www-users.cs.umn.edu//~ptan/ Papers/DMKD.ps.gz.Google Scholar
  55. Underhill, P. (2000). Why we buy: The science of shopping. Touchstone Books.Google Scholar
  56. Webb, G. I. (2000). Efficient search for association rules. In Proceedings of the Discovery and Data Mining Conference (KDD 2000) (pp. 99–107). ACM Press. http://portal.acm.org/citation.cfm?id=347112&coll= portal&dl=portal&CFID=8086514&CFTOKEN=81282849.Google Scholar
  57. Zhang, H. (2000). Mining and visualization of association rules over relational DBMSs. PhD thesis, Department of Computer and Information Science and Engineering, The University of Florida. http://citeseer.ist.psu.edu/cache/ papers/cs/20450/http:zSzzSzetd.fcla.eduzSzetdzSzufzSz2000zSzana 7033zSzEtd.pdf/zhang00mining.pdf.Google Scholar
  58. Zhang, J., Silvescu, A., & Honavar, V. (2002). Ontology-driven induction of decision trees at multiple levels of abstraction. In Proceedings of Symposium on Abstraction, Reformulation, and Approximation. Lecture Notes in Artificial Intelligence (Vol. 2371), Springer-Verlag.Google Scholar
  59. Zheng, Z., Kohavi, R., & Mason, L. (2001). Real world performance of association rule algorithms. In Proceedings of the Knowledge Discovery and Data Mining Conference (KDD 2001) (pp. 401–406). ACM Press. http://www.lsmason.com/papers/KDD01-RealAssocPerformance.pdf.Google Scholar

Copyright information

© Kluwer Academic Publishers 2004

Authors and Affiliations

  • Ron Kohavi
    • 1
  • Llew Mason
    • 2
  • Rajesh Parekh
    • 2
  • Zijian Zheng
    • 3
  1. 1.Amazon.comSeattle
  2. 2.Blue Martini SoftwareSan Mateo
  3. 3.Microsoft CorporationRedmond

Personalised recommendations