Skip to main content
Log in

Joint selection of variables and clusters: recovering the underlying structure of marketing data

  • Original Article
  • Published:
Journal of Marketing Analytics Aims and scope Submit manuscript

Abstract

Clustering observations into groups is perhaps one of the more common marketing analytic techniques. Many variable-selection procedures are available for clustering, and some have exhibited good performance in simulation studies. Unfortunately, the best-performing methods often fail because they emphasize the clustering power of individual variables. For this reason, we recommend extreme caution when using the existing procedures, and we argue that enumeration of all-possible variable subsets is a preferred strategy. We also address a common decision problem—the selection of the number of clusters—and develop an index which can help guide the joint selection of variables and clusters. By way of an empirical example, we illustrate the variable-selection problem and demonstrate the use of the proposed index to jointly select variables and clusters in K-means partitioning.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. The terms partitioning and clustering are often used interchangeably. A partitioning method separates a set of n objects into K nonempty, nonoverlapping, and exhaustive subsets. These K subsets are typically termed clusters or groups. By contrast, clustering methods include partitioning methods, but may also encompass methods that do not directly produce partitions, such as hierarchical clustering, overlapping clustering, and fuzzy clustering methods. In this paper, we limit our focus to partitioning methods, so both partitioning and clustering are valid descriptors of the method.

  2. Marketing applications can easily involve considerably more clustering variables in hyper-dimensional space. However, exhaustive enumeration of all-possible subsets becomes impractical for datasets with a large number of candidate variables. For J > 15, Steinley and Brusco (2008a, b) suggest the replacement of exhaustive enumeration of subsets with a tree-search heuristic. The tree size is controlled by limiting new branches from j to j + 1 variables to the 10 best candidates at each stage. For further discussion of the techniques, see Steinley and Brusco (2008a, b), which subjected the approach to testing on a number of actual and synthetic datasets.

  3. Blockbusters commonly are defined as pharmaceutical products garnering at least one billion dollars in sales annually (Li 2014).

  4. We omit all descriptive details of a full cluster analysis, largely because one of our goals is expository. We note than an important step after interpreting the solutions is to validate them using variables not included in the cluster analysis. For instance, product revenue can be used to establish criterion validity of the APS solution (F5,281 = 4.06, p < 0.01), with products in Cluster 2 generating significantly more revenue than other clusters. This is consistent with our interpretation of this cluster as identifying potential blockbusters.

References

  • Ahlawat, H., G. Chierchia, and P. van Arkel. 2014. The secret of successful drug launches. McKinsey & Company report, March. http://www.mckinsey.com/industries/pharmaceuticals-and-medical-products/our-insights/the-secret-of-successful-drug-launches. Accessed 5 Oct 2018.

  • Arabie, P., and L.J. Hubert. 1994. Cluster analysis in marketing research. In Advanced methods of marketing research, ed. R.P. Bagozzi, 160–189. Oxford: Blackwell.

    Google Scholar 

  • Bishop, C.M. 1995. Neural networks for pattern recognition. New York: Oxford University Press.

    Google Scholar 

  • Bozdogan, H. 1994. Choosing the number of clusters, subset selection of variables, and outlier detection in the standard mixture-model cluster analysis. In New approaches in classification and data analysis, ed. E. Diday, Y. Lechevallier, M. Schader, P. Bertrand, and B. Burtschy, 169–177. Berlin: Springer.

    Chapter  Google Scholar 

  • Brusco, M.J., and J.D. Cradit. 2001. A variable-selection heuristic for K-means clustering. Psychometrika 66 (2): 249–270.

    Article  Google Scholar 

  • Brusco, M.J., R. Singh, J.D. Cradit, and D. Steinley. 2017. Cluster analysis in OM research: Survey and recommendations. International Journal of Operations and Production Management 37 (3): 300–320.

    Article  Google Scholar 

  • Brusco, M.J., and D. Steinley. 2007. A comparison of heuristic procedures for minimum within-cluster sums of squares partitioning. Psychometrika 72 (4): 583–600.

    Article  Google Scholar 

  • Caliński, T., and J. Harabasz. 1974. A dendrite method for cluster analysis. Communications in Statistics 3 (1): 1–27.

    Google Scholar 

  • Carmone, F.J., A. Kara, and S. Maxwell. 1999. HINoV: A new model to improve market segmentation by identifying noisy variables. Journal of Marketing Research 36 (4): 501–509.

    Article  Google Scholar 

  • Cook, A.G. 2006. Forecasting for the pharmaceutical industry. Aldershot: Gower Publishing.

    Google Scholar 

  • Corstjens, M., E. Demeire, and I. Horowitz. 2005. New-product success in the pharmaceutical industry: How many bites at the cherry? Economics of Innovation and New Technology 14 (4): 319–331.

    Article  Google Scholar 

  • DeSarbo, W.S., J.D. Carroll, L.A. Clark, and P.E. Green. 1984. Synthesized clustering: A method for amalgamating alternative clustering bases with differential weighting of variables. Psychometrika 49 (1): 57–78.

    Article  Google Scholar 

  • Dy, J.G., and C.E. Brodley. 2004. Feature selection for unsupervised learning. Journal of Machine Learning Research 5: 845–889.

    Google Scholar 

  • Fischer, M., P.S.H. Leeflang, and P.C. Verhoef. 2010. Drivers of peak sales for pharmaceutical brands. Quantitative Marketing and Economics 8 (4): 429–460.

    Article  Google Scholar 

  • Fowlkes, E.B., and C.L. Mallows. 1983. A method for comparing two hierarchical clusterings. Journal of the American Statistical Association 78 (383): 553–584.

    Article  Google Scholar 

  • Friedman, J.H., and J.J. Meulman. 2004. Clustering objects on subsets of attributes. Journal of the Royal Statistical Society B 66 (4): 815–849.

    Article  Google Scholar 

  • Gnanadesikan, R., J.R. Kettenring, and S.L. Tsao. 1995. Weighting and selection of variables for cluster analysis. Journal of Classification 12 (1): 113–136.

    Article  Google Scholar 

  • Grabowski, H., and J. Vernon. 1990. A new look at the returns and risks to pharmaceutical R&D. Management Science 36 (7): 804–821.

    Article  Google Scholar 

  • Green, P.E., F.J. Carmone, and J. Kim. 1990. A preliminary study of optimal variable weighting in K-means clustering. Journal of Classification 7 (2): 271–285.

    Article  Google Scholar 

  • Hair, J.F., W.C. Black, B.J. Babin, and R.E. Anderson. 2014. Multivariate data analysis, 7th ed. Upper Saddle River: Pearson Prentice Hall.

    Google Scholar 

  • Han, J., M. Kamber, and J. Pei. 2012. Data mining: Concepts and techniques, 3rd ed. Amsterdam: Elsevier.

    Google Scholar 

  • Helsen, K., and P.E. Green. 1991. A computational study of replicated clustering with an application to market segmentation. Decision Sciences 22 (5): 1124–1141.

    Article  Google Scholar 

  • Henard, D.H., and D.M. Szymanski. 2001. Why some new products are more successful than others. Journal of Marketing Research 38 (3): 362–375.

    Article  Google Scholar 

  • Hubert, L., and P. Arabie. 1985. Comparing partitions. Journal of Classification 2 (2): 193–218.

    Article  Google Scholar 

  • Jain, A.K. 2010. Data clustering: 50 years beyond K-means. Pattern Recognition Letters 31 (8): 651–666.

    Article  Google Scholar 

  • Jain, A.K., M.N. Murty, and P.J. Flynn. 1999. Data clustering: A review. ACM Computing Surveys 31 (3): 264–323.

    Article  Google Scholar 

  • Jain, P., P. Sharma, and L. Jayaraman. 2014. Behind every good decision: How anyone can use business analytics to turn data into profitable insight. New York: American Management Association.

    Google Scholar 

  • Kalyanaram, G., W.T. Robinson, and G.L. Urban. 1995. Order of market entry: Established empirical generalizations, emerging empirical generalizations, and future research. Marketing Science 14 (3): G212–G221.

    Article  Google Scholar 

  • Kerin, R.A., P.R. Varadarajan, and R.A. Peterson. 1992. First-mover advantage: A synthesis, conceptual framework, and research propositions. Journal of Marketing 56 (4): 33–52.

    Article  Google Scholar 

  • Kim, S.-S. 2015. Variable selection and outlier detection for automated K-means clustering. Communications for Statistical Applications and Methods 22 (1): 55–67.

    Article  Google Scholar 

  • Koubaa, Y., R.S. Tabbane, and M. Hamouda. 2017. Segmentation of the senior market: How do different variable sets discriminate between senior segments? Journal of Marketing Analytics 5 (3–4): 99–110.

    Article  Google Scholar 

  • Law, M.H.C., M.A.T. Figueiredo, and A.K. Jain. 2004. Simultaneous feature selection and clustering using mixture models. IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (9): 1154–1166.

    Article  Google Scholar 

  • Li, J.J. 2014. Blockbuster drugs: The rise and decline of the pharmaceutical industry. New York: Oxford University Press.

    Google Scholar 

  • Mathwick, C. 2002. Understanding the online consumer: A topology of online relational norms and behavior. Journal of Interactive Marketing 16 (1): 40–55.

    Article  Google Scholar 

  • Milligan, G.W. 1989. A validation study of a variable-weighting algorithm for cluster analysis. Journal of Classification 6 (1): 53–71.

    Article  Google Scholar 

  • Milligan, G.W., and M.C. Cooper. 1986. A study of the comparability of external criteria for hierarchical cluster analysis. Multivariate Behavioral Research 21 (4): 441–458.

    Article  Google Scholar 

  • Montanari, A., and L. Lizzani. 2001. A projection pursuit approach to variable selection. Computational Statistics & Data Analysis 35 (4): 463–473.

    Article  Google Scholar 

  • Narayanan, S., R. Desiraju, and P.K. Chintagunta. 2004. Return on investment implications for pharmaceutical promotional expenditures: The role of marketing-mix interactions. Journal of Marketing 68 (4): 90–105.

    Article  Google Scholar 

  • Osinga, E.C., P.S.H. Leeflang, and J.E. Wieringa. 2010. Early marketing matters: a time-varying parameter approach to persistence modeling. Journal of Marketing Research 47 (1): 173–185.

    Article  Google Scholar 

  • Palazzo, M., A. Vollero, and A. Siano. 2016. Identifying new segments from a global branding perspective: A three-country study. Journal of Marketing Analytics 4 (4): 159–171.

    Article  Google Scholar 

  • Raftery, A.E., and N. Dean. 2006. Variable selection for model-based clustering. Journal of the American Statistical Association 101 (473): 168–178.

    Article  Google Scholar 

  • Resney, R., A. Aboshiha, E. Carlisle, and S. Waddell. 2017. Launch for long-term success. Pharmaceutical Executive report, 9 May. http://www.pharmexec.com/launch-long-term-success. Accessed 5 Oct 2018.

  • Shankar, V., G.S. Carpenter, and L. Krishnamurthi. 1998. Late mover advantage: How innovative late entrants outsell pioneers. Journal of Marketing Research 35 (1): 54–70.

    Article  Google Scholar 

  • Steinhaus, H. 1956. Sur la division des corps matériels en parties. Bulletin de l’Académie Polonaise des Sciences, Classe III, IV (12): 801–804.

  • Steinley, D. 2004. Properties of the Hubert-Arabie adjusted Rand index. Psychological Methods 9 (3): 386–396.

    Article  Google Scholar 

  • Steinley, D. 2006. K-means clustering: A half-century synthesis. British Journal of Mathematical and Statistical Psychology 59 (1): 1–34.

    Article  Google Scholar 

  • Steinley, D., and M.J. Brusco. 2008a. A new variable weighting and selection procedure for K-means cluster analysis. Multivariate Behavioral Research 43 (1): 77–108.

    Article  Google Scholar 

  • Steinley, D., and M.J. Brusco. 2008b. Selection of variables in cluster analysis: An empirical comparison of eight procedures. Psychometrika 73 (1): 125–144.

    Article  Google Scholar 

  • Steinley, D., M.J. Brusco, and L. Hubert. 2016. The variance of the adjusted Rand index. Psychological Methods 21 (2): 261–272.

    Article  Google Scholar 

  • Urban, G.L., and J.R. Hauser. 1993. Design and marketing of new products. Englewood Cliffs: Prentice-Hall.

    Google Scholar 

  • Wedel, M., and W.A. Kamakura. 2000. Market segmentation: Conceptual and methodological foundations, 2nd ed. Dodrecht: Kluwer.

    Book  Google Scholar 

  • Winegarden, W. 2017. U.S. Pharmaceutical pricing in context. San Francisco: Pacific Research Institute.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Susan Brudvig.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Brudvig, S., Brusco, M.J. & Cradit, J.D. Joint selection of variables and clusters: recovering the underlying structure of marketing data. J Market Anal 7, 1–12 (2019). https://doi.org/10.1057/s41270-018-0045-7

Download citation

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1057/s41270-018-0045-7

Keywords

Navigation