Abstract
Clustering observations into groups is perhaps one of the more common marketing analytic techniques. Many variable-selection procedures are available for clustering, and some have exhibited good performance in simulation studies. Unfortunately, the best-performing methods often fail because they emphasize the clustering power of individual variables. For this reason, we recommend extreme caution when using the existing procedures, and we argue that enumeration of all-possible variable subsets is a preferred strategy. We also address a common decision problem—the selection of the number of clusters—and develop an index which can help guide the joint selection of variables and clusters. By way of an empirical example, we illustrate the variable-selection problem and demonstrate the use of the proposed index to jointly select variables and clusters in K-means partitioning.
Similar content being viewed by others
Notes
The terms partitioning and clustering are often used interchangeably. A partitioning method separates a set of n objects into K nonempty, nonoverlapping, and exhaustive subsets. These K subsets are typically termed clusters or groups. By contrast, clustering methods include partitioning methods, but may also encompass methods that do not directly produce partitions, such as hierarchical clustering, overlapping clustering, and fuzzy clustering methods. In this paper, we limit our focus to partitioning methods, so both partitioning and clustering are valid descriptors of the method.
Marketing applications can easily involve considerably more clustering variables in hyper-dimensional space. However, exhaustive enumeration of all-possible subsets becomes impractical for datasets with a large number of candidate variables. For J > 15, Steinley and Brusco (2008a, b) suggest the replacement of exhaustive enumeration of subsets with a tree-search heuristic. The tree size is controlled by limiting new branches from j to j + 1 variables to the 10 best candidates at each stage. For further discussion of the techniques, see Steinley and Brusco (2008a, b), which subjected the approach to testing on a number of actual and synthetic datasets.
Blockbusters commonly are defined as pharmaceutical products garnering at least one billion dollars in sales annually (Li 2014).
We omit all descriptive details of a full cluster analysis, largely because one of our goals is expository. We note than an important step after interpreting the solutions is to validate them using variables not included in the cluster analysis. For instance, product revenue can be used to establish criterion validity of the APS solution (F5,281 = 4.06, p < 0.01), with products in Cluster 2 generating significantly more revenue than other clusters. This is consistent with our interpretation of this cluster as identifying potential blockbusters.
References
Ahlawat, H., G. Chierchia, and P. van Arkel. 2014. The secret of successful drug launches. McKinsey & Company report, March. http://www.mckinsey.com/industries/pharmaceuticals-and-medical-products/our-insights/the-secret-of-successful-drug-launches. Accessed 5 Oct 2018.
Arabie, P., and L.J. Hubert. 1994. Cluster analysis in marketing research. In Advanced methods of marketing research, ed. R.P. Bagozzi, 160–189. Oxford: Blackwell.
Bishop, C.M. 1995. Neural networks for pattern recognition. New York: Oxford University Press.
Bozdogan, H. 1994. Choosing the number of clusters, subset selection of variables, and outlier detection in the standard mixture-model cluster analysis. In New approaches in classification and data analysis, ed. E. Diday, Y. Lechevallier, M. Schader, P. Bertrand, and B. Burtschy, 169–177. Berlin: Springer.
Brusco, M.J., and J.D. Cradit. 2001. A variable-selection heuristic for K-means clustering. Psychometrika 66 (2): 249–270.
Brusco, M.J., R. Singh, J.D. Cradit, and D. Steinley. 2017. Cluster analysis in OM research: Survey and recommendations. International Journal of Operations and Production Management 37 (3): 300–320.
Brusco, M.J., and D. Steinley. 2007. A comparison of heuristic procedures for minimum within-cluster sums of squares partitioning. Psychometrika 72 (4): 583–600.
Caliński, T., and J. Harabasz. 1974. A dendrite method for cluster analysis. Communications in Statistics 3 (1): 1–27.
Carmone, F.J., A. Kara, and S. Maxwell. 1999. HINoV: A new model to improve market segmentation by identifying noisy variables. Journal of Marketing Research 36 (4): 501–509.
Cook, A.G. 2006. Forecasting for the pharmaceutical industry. Aldershot: Gower Publishing.
Corstjens, M., E. Demeire, and I. Horowitz. 2005. New-product success in the pharmaceutical industry: How many bites at the cherry? Economics of Innovation and New Technology 14 (4): 319–331.
DeSarbo, W.S., J.D. Carroll, L.A. Clark, and P.E. Green. 1984. Synthesized clustering: A method for amalgamating alternative clustering bases with differential weighting of variables. Psychometrika 49 (1): 57–78.
Dy, J.G., and C.E. Brodley. 2004. Feature selection for unsupervised learning. Journal of Machine Learning Research 5: 845–889.
Fischer, M., P.S.H. Leeflang, and P.C. Verhoef. 2010. Drivers of peak sales for pharmaceutical brands. Quantitative Marketing and Economics 8 (4): 429–460.
Fowlkes, E.B., and C.L. Mallows. 1983. A method for comparing two hierarchical clusterings. Journal of the American Statistical Association 78 (383): 553–584.
Friedman, J.H., and J.J. Meulman. 2004. Clustering objects on subsets of attributes. Journal of the Royal Statistical Society B 66 (4): 815–849.
Gnanadesikan, R., J.R. Kettenring, and S.L. Tsao. 1995. Weighting and selection of variables for cluster analysis. Journal of Classification 12 (1): 113–136.
Grabowski, H., and J. Vernon. 1990. A new look at the returns and risks to pharmaceutical R&D. Management Science 36 (7): 804–821.
Green, P.E., F.J. Carmone, and J. Kim. 1990. A preliminary study of optimal variable weighting in K-means clustering. Journal of Classification 7 (2): 271–285.
Hair, J.F., W.C. Black, B.J. Babin, and R.E. Anderson. 2014. Multivariate data analysis, 7th ed. Upper Saddle River: Pearson Prentice Hall.
Han, J., M. Kamber, and J. Pei. 2012. Data mining: Concepts and techniques, 3rd ed. Amsterdam: Elsevier.
Helsen, K., and P.E. Green. 1991. A computational study of replicated clustering with an application to market segmentation. Decision Sciences 22 (5): 1124–1141.
Henard, D.H., and D.M. Szymanski. 2001. Why some new products are more successful than others. Journal of Marketing Research 38 (3): 362–375.
Hubert, L., and P. Arabie. 1985. Comparing partitions. Journal of Classification 2 (2): 193–218.
Jain, A.K. 2010. Data clustering: 50 years beyond K-means. Pattern Recognition Letters 31 (8): 651–666.
Jain, A.K., M.N. Murty, and P.J. Flynn. 1999. Data clustering: A review. ACM Computing Surveys 31 (3): 264–323.
Jain, P., P. Sharma, and L. Jayaraman. 2014. Behind every good decision: How anyone can use business analytics to turn data into profitable insight. New York: American Management Association.
Kalyanaram, G., W.T. Robinson, and G.L. Urban. 1995. Order of market entry: Established empirical generalizations, emerging empirical generalizations, and future research. Marketing Science 14 (3): G212–G221.
Kerin, R.A., P.R. Varadarajan, and R.A. Peterson. 1992. First-mover advantage: A synthesis, conceptual framework, and research propositions. Journal of Marketing 56 (4): 33–52.
Kim, S.-S. 2015. Variable selection and outlier detection for automated K-means clustering. Communications for Statistical Applications and Methods 22 (1): 55–67.
Koubaa, Y., R.S. Tabbane, and M. Hamouda. 2017. Segmentation of the senior market: How do different variable sets discriminate between senior segments? Journal of Marketing Analytics 5 (3–4): 99–110.
Law, M.H.C., M.A.T. Figueiredo, and A.K. Jain. 2004. Simultaneous feature selection and clustering using mixture models. IEEE Transactions on Pattern Analysis and Machine Intelligence 26 (9): 1154–1166.
Li, J.J. 2014. Blockbuster drugs: The rise and decline of the pharmaceutical industry. New York: Oxford University Press.
Mathwick, C. 2002. Understanding the online consumer: A topology of online relational norms and behavior. Journal of Interactive Marketing 16 (1): 40–55.
Milligan, G.W. 1989. A validation study of a variable-weighting algorithm for cluster analysis. Journal of Classification 6 (1): 53–71.
Milligan, G.W., and M.C. Cooper. 1986. A study of the comparability of external criteria for hierarchical cluster analysis. Multivariate Behavioral Research 21 (4): 441–458.
Montanari, A., and L. Lizzani. 2001. A projection pursuit approach to variable selection. Computational Statistics & Data Analysis 35 (4): 463–473.
Narayanan, S., R. Desiraju, and P.K. Chintagunta. 2004. Return on investment implications for pharmaceutical promotional expenditures: The role of marketing-mix interactions. Journal of Marketing 68 (4): 90–105.
Osinga, E.C., P.S.H. Leeflang, and J.E. Wieringa. 2010. Early marketing matters: a time-varying parameter approach to persistence modeling. Journal of Marketing Research 47 (1): 173–185.
Palazzo, M., A. Vollero, and A. Siano. 2016. Identifying new segments from a global branding perspective: A three-country study. Journal of Marketing Analytics 4 (4): 159–171.
Raftery, A.E., and N. Dean. 2006. Variable selection for model-based clustering. Journal of the American Statistical Association 101 (473): 168–178.
Resney, R., A. Aboshiha, E. Carlisle, and S. Waddell. 2017. Launch for long-term success. Pharmaceutical Executive report, 9 May. http://www.pharmexec.com/launch-long-term-success. Accessed 5 Oct 2018.
Shankar, V., G.S. Carpenter, and L. Krishnamurthi. 1998. Late mover advantage: How innovative late entrants outsell pioneers. Journal of Marketing Research 35 (1): 54–70.
Steinhaus, H. 1956. Sur la division des corps matériels en parties. Bulletin de l’Académie Polonaise des Sciences, Classe III, IV (12): 801–804.
Steinley, D. 2004. Properties of the Hubert-Arabie adjusted Rand index. Psychological Methods 9 (3): 386–396.
Steinley, D. 2006. K-means clustering: A half-century synthesis. British Journal of Mathematical and Statistical Psychology 59 (1): 1–34.
Steinley, D., and M.J. Brusco. 2008a. A new variable weighting and selection procedure for K-means cluster analysis. Multivariate Behavioral Research 43 (1): 77–108.
Steinley, D., and M.J. Brusco. 2008b. Selection of variables in cluster analysis: An empirical comparison of eight procedures. Psychometrika 73 (1): 125–144.
Steinley, D., M.J. Brusco, and L. Hubert. 2016. The variance of the adjusted Rand index. Psychological Methods 21 (2): 261–272.
Urban, G.L., and J.R. Hauser. 1993. Design and marketing of new products. Englewood Cliffs: Prentice-Hall.
Wedel, M., and W.A. Kamakura. 2000. Market segmentation: Conceptual and methodological foundations, 2nd ed. Dodrecht: Kluwer.
Winegarden, W. 2017. U.S. Pharmaceutical pricing in context. San Francisco: Pacific Research Institute.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Brudvig, S., Brusco, M.J. & Cradit, J.D. Joint selection of variables and clusters: recovering the underlying structure of marketing data. J Market Anal 7, 1–12 (2019). https://doi.org/10.1057/s41270-018-0045-7
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1057/s41270-018-0045-7