Abstract
Modern businesses routinely capture data on millions of observations across subjects, brand SKUs, time periods, predictor variables, and store locations, thereby generating massive high-dimensional datasets. For example, Netflix has choice data on billions of movies selected, user ratings, and geodemographic characteristics. Similar datasets emerge in retailing with potential use of RFIDs, online auctions (e.g., eBay), social networking sites (e.g., mySpace), product reviews (e.g., ePinion), customer relationship marketing, internet commerce, and mobile marketing. We envision massive databases as four-way VAST matrix arrays of Variables × Alternatives × Subjects × Time where at least one dimension is very large. Predictive choice modeling of such massive databases poses novel computational and modeling issues, and the negligence of academic research to address them will result in a disconnect from the marketing practice and an impoverishment of marketing theory. To address these issues, we discuss and identify the challenges and opportunities for both practicing and academic marketers. Thus, we offer an impetus for advancing research in this nascent area and fostering collaboration across scientific disciplines to improve the practice of marketing in information-rich environment.
Similar content being viewed by others
References
Allenby, G. M., McCulloch, R., & Rossi, P. E. (1996). The value of purchase history data in target marketing. Marketing Science, 15, 321–340.
Ansari, A., Essegaier, S., & Kohli, R. (2000). Internet recommendation systems. Journal of Marketing Research, 37, 363–375.
Bacon, L., & Sridhar, A. (2006). Interactive innovation tools and methods. Annual Convention of the Marketing Research Association, Washington DC, June.
Baker, S. (2007). Google and the wisdom of clouds. Business Week, December 13th issue.
Balakrishnan, S., & Madigan, D. (2006). A one-pass sequential Monte Carlo method for Bayesian analysis of massive datasets. Bayesian Analysis, 1(2), 345–362.
Balakrishnan, S., & Madigan, D. (2007). LAPS: LASSO with partition search. Manuscript.
Balasubramanian, S., Gupta, S., Kamakura, W. A., & Wedel, M. (1998). Modeling large datasets in marketing. Statistica Neerlandica, 52(3), 303–324.
Benzécri, J.-P. (2005). Foreword. In F. Murtaugh (Ed.), Correspondence analysis and data coding with JAVA and R. London, UK: Chapman and Hall.
Bodapati, A. (2008). Recommendation systems with purchase data. Journal of Marketing Research, 45, 77–93.
Bradlow, E. T., Hardie, B. G. S., & Fader, P. S. (2002). Bayesian inference for the negative binomial distribution via polynomial expansions. Journal of Computational and Graphical Statistics, 11(1), 189–201.
Breese, J., Heckerman, D., & Kadie, C. (1998). Empirical analysis of predictive algorithms for collaborative filtering. Madison, WI: Morgan Kaufmann.
Brockwell, A. E. (2006). Parallel Markov chain Monte Carlo simulation by pre-fetching. Journal of Computational and Graphical Statistics, 15(1), 246–261.
Brockwell, A. E., & Kadane, J. B. (2005). Identification of regeneration times in MCMC simulation, with application to adaptive schemes. Journal of Computational and Graphical Statistics, 14(2), 436–458.
Brown, S., & Rose, J. (1996). Architecture of FPGAs and CPLDs: A tutorial. IEEE Design and Test of Computers, 13(2), 42–57.
Brynjolfson, E., Smith, M., & Montgomery, A. (2007). The great equalizer: An empirical study of choice in shopbots. Working Paper, Carnegie Mellon University, Tepper School of Business.
Chung, T., Siong, R. R., & Wedel, M. (2007). My mobile music: Automatic adaptive play-list personalization. Marketing Science, in press.
Cook, R. D., & Weisberg, S. (1991). Discussion of Li (1991). Journal of the American Statistical Association, 86, 328–332.
Ding, M., Park, Y.-H., & Bradlow, E. (2007). Barter markets. Working Paper, The Wharton School.
Du, R., & Kamakura, W. A. (2007). How efficient is your category management? A stochastic-frontier factor model for internal benchmarking. Working Paper.
DuMouchel, W. (1999). Bayesian data mining in large frequency tables, with an application to the FDA spontaneous reporting system. The American Statistician, 53(3), 177–190.
Escobar, M. D., & West, M. (1996). Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association, 90, 577–588.
Everson, P. J., & Bradlow, E. T. (2002). Bayesian inference for the beta-binomial distribution via polynomial expansions. Journal of Computational and Graphical Statistics, 11(1), 202–207.
Fan, J., & Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association, 96, 1348–1360.
Foutz, N. Z., & Jank, W. (2007). Forecasting new product revenues via online virtual stock market. MSI Report.
Genkin, A., Lewis, D. D., & Madigan, D. (2007). Large-scale Bayesian logistic regression for text categorization. Technometrics, 49, 291–304.
Handcock, M. S., Raftery, A. E., & Tantrum, J. M. (2007). Model-based clustering for social networks. Journal of the Royal Statistical Society. Series A, 170(2), 301–352.
Hauben, M., Madigan, D., Gerrits, C., & Meyboom, R. (2005). The role of data mining in pharmacovigilance. Expert Opinion in Drug Safety, 4(5), 929–948.
Huang, Z., & Gelman, A. (2006). Sampling for Bayesian computation with large datasets. Retrieved from http://www.stat.columbia.edu/~gelman/research/unpublished/comp7.pdf.
Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1998). An introduction to variational methods for graphical models. In M. I. Jordan (Ed.), Learning graphical models, vol 89 of series D: Behavioural and social sciences (pp. 105–162). Dordrecht, The Netherlands: Kluwer.
Kamakura, W. A., & Kang, W. (2007). Chain-wide and store-level analysis for cross-category management. Journal of Retailing, 83(2), 159–170.
Kreulen, J., Cody, W., Spangler, W., & Krishna, V. (2002). The integration of business intelligence and knowledge management. IBM Systems Journal, 41(4), 2002.
Kreulen, J., & Spangler, W. (2005). Interactive methods for taxonomy editing and validation. Next generation of data-mining applications, chapter 20 pp. 495–522. New York: Wiley.
Kreulen, J., Spangler, W., & Lessler, J. (2003). Generating and browsing multiple taxonomies over a document collection. Journal of Management Information Systems, 19(4), 191–212.
Li, K.-C. (1991). Sliced inverse regression for dimension reduction. Journal of the American Statistical Association, 86, 316–342.
Li, L., Cook, R. D., & Tsai, C.-L. (2007). Partial inverse regression. Biometrika, 94, 615–625.
Liu, J. S., & Chen, R. R. (1998). Sequential Monte Carlo methods for dynamical systems. Journal of the American Statistical Association, 93, 1032–1044.
Miller, S. J., Bradlow, E. T., & Dayartna, K. (2006). Closed-form Bayesian inferences for the logit model via polynomial expansions. Quantitative Marketing and Economics, 4(2), 173–206.
Montgomery, A. L. (1997). Creating micro-marketing pricing strategies using supermarket scanner data. Marketing Science, 16(4), 315–337.
Montgomery, A. L., Li, S., Srinivasan, K., & Liechty, J. (2004). Modeling online browsing and path analysis using clickstream data. Marketing Science, 23(4), 579–595.
Naik, P. A., Hagerty, M., & Tsai, C.-L. (2000). A new dimension reduction approach for data-rich marketing environments: Sliced inverse regression. Journal of Marketing Research, 37(1), 88–101.
Naik, P. A., & Tsai, C.-L. (2004). Isotonic single-index model for high-dimensional database marketing. Computational Statistics and Data Analysis, 47(4), 775–790.
Naik, P. A., & Tsai, C.-L. (2005). Constrained inverse regression for incorporating prior information. Journal of the American Statistical Association, 100(469), 204–211.
Naik, P. A., Wedel, M., & Kamakura, W. (2008). Multi-index binary response model for analysis of large datasets. Journal of Business and Economic Statistics, in press.
Prelec, D. (2001). Readings packet on the information pump. Boston, MA: MIT Sloan School of Management.
Ridgeway, G., & Madigan, D. (2002). A sequential Monte Carlo method for Bayesian analysis of massive datasets. Journal of Knowledge Discovery and Data Mining, 7, 301–319.
Silverman, B. W. (1986). Density estimation. London, UK: Chapman and Hall.
Simonoff, J. S. (1996). Smoothing methods in statistics. New York, NY: Springer.
Spangler, S., & Kreulen, J. (2007). Mining the talk: Unlocking the business value in unstructured information. Indianapolis, IN: IBM.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B, 58(1), 267–288.
Toubia, O. (2006). Idea generation, creativity, and incentives. Marketing Science, 25(5), 411–425.
Trusov, M., Bodapati, A., & Bucklin, R. E. (2007a). Determining influential users in internet social networks. Working Paper, Robert H. Smith School of Business, University of Maryland.
Trusov, M., Bucklin, R. E., & Pauwels, K. (2007b). Estimating the dynamic effects of online word-of-mouth on member growth of a social network site. Working Paper, Robert H. Smith School of Business, University of Maryland.
Wainwright, M., & Jordan, M. (2003). Graphical models, exponential families, and variational inference. Technical Report 649, Department of Statistics, UC Berkeley.
Wasserman, S., & Faust, K. (1994). Social network analysis. Cambridge: Cambridge University Press.
Wedel, M., & Kamakura, W. (2000). Market segmentation: Conceptual and methodological foundations (2nd edn.). Dordrecht: Kluwer.
Wedel, M., & Kamakura, W. A. (2001). Factor analysis with mixed observed and latent variables in the exponential family. Psychometrika, 66(4), 515–530.
Wedel, M., & Zhang, J. (2004). Analyzing brand competition across subcategories. Journal of Marketing Research, 41(4), 448–456.
Ying, Y., Feinberg, F., & Wedel, M. (2006). Improving online product recommendations by including nonrated items. Journal of Marketing Research, 43, 355–365.
Author information
Authors and Affiliations
Corresponding author
Additional information
Prasad Naik and Michel Wedel are co-chairs.
Rights and permissions
About this article
Cite this article
Naik, P., Wedel, M., Bacon, L. et al. Challenges and opportunities in high-dimensional choice data analyses. Mark Lett 19, 201–213 (2008). https://doi.org/10.1007/s11002-008-9036-3
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11002-008-9036-3