Fusion Modeling

Feit, Elea McDonnell; Bradlow, Eric T.

doi:10.1007/978-3-319-57413-4_9

Elea McDonnell Feit⁴ &
Eric T. Bradlow⁵

7061 Accesses

Abstract

This chapter introduces readers to applications of data fusion in marketing from a Bayesian perspective. We will discuss several applications of data fusion including the classic example of combining data on media viewership for one group of customers with data on category purchases for a different group, a very common problem in marketing. While many missing data approaches focus on creating “fused” data sets that can be analyzed by others, we focus on the overall inferential goal, which, for this classic data fusion problem, is to determine which media outlets attract consumers who purchase in a particular category and are therefore good targets for advertising. The approach we describe is based on a common Bayesian approach to missing data, using data augmentation within MCMC estimation routines. As we will discuss, this approach can also be extended to a variety of other data structures including mismatched groups of customers, data at different levels of aggregation, and more general missing data problems that commonly arise in marketing. This chapter provides readers with a step-by-step guide to developing Bayesian data fusion applications, including an example fully worked out in the Stan modeling language. Readers who are unfamiliar with Bayesian analysis and MCMC estimation may benefit by reading the chapter in this handbook on Bayesian Models first.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 649.99; Price excludes VAT (USA)

Hardcover Book: USD 699.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Adigüzel, F., & Wedel, M. (2008). Split questionnaire design for massive surveys. Journal of Marketing Research, 45(5), 608–617.
Article Google Scholar
Andridge, R. R., & Little, R. J. (2010). A review of hot deck imputation for survey nonresponse. International Statistical Review, 78(1), 40–64.
Article Google Scholar
Bradlow, E. T., & Zaslavsky, A. M. (1999). A hierarchical latent variable model for ordinal data from a customer satisfaction survey with no answer responses. Journal of the American Statistical Association, 94(445), 43–52.
Google Scholar
Carpenter, B., Gelman, A., Hoffman, M., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M. A., Guo, J., Li, P., & Riddell, A. (2016). Stan: A probabilistic programming language. Journal of Statistical Software, 76.
Google Scholar
Chen, Y., & Yang, S. (2007). Estimating disaggregate models using aggregate data through augmentation of individual choice. Journal of Marketing Research, 44(4), 613–621.
Article Google Scholar
Cho, J., Aribarg, A., & Manchanda, P. (2015). The value of measuring customer satisfaction. Available at SSRN 2630898.
Google Scholar
Feit, E. M., Beltramo, M. A., & Feinberg, F. M. (2010). Reality check: Combining choice experiments with market data to estimate the importance of product attributes. Management Science, 56(5), 785–800.
Article Google Scholar
Feit, E. M., Wang, P., Bradlow, E. T., & Fader, P. S. (2013). Fusing aggregate and disaggregate data with an application to multiplatform media consumption. Journal of Marketing Research, 50(3), 348–364.
Article Google Scholar
Ford, B. L. (1983). An overview of hot-deck procedures. Incomplete Data in Sample Surveys, 2(Part IV), 185–207.
Google Scholar
Gilula, Z., McCulloch, R. E., & Rossi, P. E. (2006). A direct approach to data fusion. Journal of Marketing Research, 43(1), 73–83.
Article Google Scholar
Kamakura, W. A., & Wedel, M. (1997). Statistical data fusion for cross-tabulation. Journal of Marketing Research, 34, 485–498.
Article Google Scholar
Kamakura, W. A., & Wedel, M. (2000). Factor analysis and missing data. Journal of Marketing Research, 37(4), 490–498.
Article Google Scholar
Little, R. J., & Rubin, D. B. (2014). Statistical analysis with missing data. Hoboken: Wiley.
Google Scholar
Musalem, A., Bradlow, E. T., & Raju, J. S. (2008). Who’s got the coupon? Estimating consumer preferences and coupon usage from aggregate information. Journal of Marketing Research, 45(6), 715–730.
Article Google Scholar
Novak, J., Feit. E. M., Jensen, S., & Bradlow, E. (2015). Bayesian imputation for anonymous visits in crm data. Available at SSRN 2700347.
Google Scholar
Qian, Y., & Xie, H. (2011). No customer left behind: A distribution-free bayesian approach to accounting for missing xs in marketing models. Marketing Science, 30(4), 717–736.
Article Google Scholar
Qian, Y., & Xie, H. (2014). Which brand purchasers are lost to counterfeiters? An application of new data fusion approaches. Marketing Science, 33(3), 437–448.
Article Google Scholar
Rässler, S. (2002). Statistical matching: A frequentist theory, practical applications, and alternative Bayesian approaches (Vol. 168). New York: Springer Science & Business Media.
Google Scholar
Raghunathan, T. E., & Grizzle, J. E. (1995). A split questionnaire survey design. Journal of the American Statistical Association, 90(429), 54–63.
Article Google Scholar
Rubin, D. B. (1996). Multiple imputation after 18+ years. Journal of the American Statistical Association, 91(434), 473–489.
Article Google Scholar
Spiegelhalter, D., Thomas, A., Best, N., & Lunn, D. (2003). WinBUGS User Manual Version 1.4, January 2003 at https://faculty.washington.edu/jmiyamot/p548/spiegelhalter%20winbugs%20user%20manual.pdf.
Stan Development Team. (2017). Stan modeling language user’s guide and reference manual, version 2.17.0. http://mc-stan.org
Tanner, M. A., & Wong, W. H. (1987). The calculation of posterior distributions by data augmentation. Journal of the American Statistical Association, 82(398), 528–540.
Article Google Scholar
Stan Development Team (2016). Rstan getting started. https://github.com/stan-dev/rstan/wiki/RStan-Getting-Started
Ying, Y., Feinberg, F., & Wedel, M. (2006). Leveraging missing ratings to improve online recommendation systems. Journal of Marketing Research, 43(3), 355–365.
Article Google Scholar

Download references

Acknowledgments

We would like to thank the many co-authors with whom we have had discussions while developing and troubleshooting fusion models and other Bayesian missing data methods, especially Andres Musalem, Fred Feinberg, Pengyuan Wang, and Julie Novak.

Author information

Authors and Affiliations

LeBow College of Business, Drexel University, Philadelphia, PA, USA
Elea McDonnell Feit
The Wharton School, University of Pennsylvania, Philadelphia, PA, USA
Eric T. Bradlow

Authors

Elea McDonnell Feit
View author publications
You can also search for this author in PubMed Google Scholar
Eric T. Bradlow
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Elea McDonnell Feit .

Editor information

Editors and Affiliations

Department of Business-to-Business Marketing, Sales, and Pricing, University of Mannheim, Mannheim, Germany
Christian Homburg
Department of Marketing & Sales Research Group, Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany
Martin Klarmann
Marketing & Sales Department, University of Mannheim, Mannheim, Germany
Arnd Vomberg

Appendix

This appendix provides the code used to generate all examples in this chapter. It is also available online at https://github.com/eleafeit/data_fusion. Note that the results in the chapter were obtained with Stan 2.17. If you use a different version of Stan, you may obtain slightly different results even when using the same random number seed.

R Code for Generating Synthetic Data and Running Ex. 1 with Stan

R Commands for Ex. 1 (Requires Utility Functions Below to Be Sourced First)

library(MASS) library(coda) library(beanplot) library(rstan) # Example 1a: MVN ==================================== # Generate synthetic data set.seed(20030601) Sigma <- matrix(c(1, 0.3, -0.2, 0.7, 0.3, 1, -0.6, 0.4, -0.2, -0.6, 1, 0.1, 0.7, 0.4, 0.1, 1), nrow=4) d1 <- data.mvn.split(K1=1, K2=1, Kb=2, N1=100, N2=100, mu=rep(0,4), Sigma=Sigma) str(d1$data) # Call to Stan to generate posterior draws m1 <- stan(file="Data_Fusion_MVN.stan", data=d1$data, iter=10000, warmup=2000, chains=1, seed=12) # Summaries of posterior draws for population-level parameters summary(m1, par=c("mu")) summary(m1, par=c("tau")) summary(m1, par=c("Omega")) plot.post.density(m1, pars=c("mu", "tau"), prefix="Ex1", true=list(d1$true$mu, sqrt(diag(d1$true$Sigma)), returncov2cor(d1$true$Sigma))) draws <- As.mcmc.list(m1, pars=c("Omega")) png(filename="Ex1PostOmega.png", width=600, height=600) beanplot(data.frame(draws[[1]][,c(2:4, 7:8, 12)]), horizontal=TRUE, las=1, what=c(0, 1, 1, 0), side="second", main=paste("Posterior Density of Omega (correlations)", log=""), cex.axis=0.5) dev.off() # Summaries of posterior draws for missing data summary(extract(m1, par=c("y1mis"))$y1mis[,3,]) png("Ex1y13mis.png") plot(density(extract(m1, par=c("y1mis"))$y1mis[,3,]), main="Posterior of Unobserved y_1", xlab="y_1") dev.off() summary(m1, par=c("y")) # posteriors of observed data place a point mass at the observed value plot.true.v.est(m1, pars=c("y1mis", "y2mis"), prefix="Ex1", true=list(d1$true$y1mis, d1$true$y2mis)) # Example 1b: MVN with zero correlations =================== # Generate synthetic data set.seed(20030601) Sigma <- matrix(0, nrow=4, ncol=4) diag(Sigma) <- 1 # Call to Stan to generate posterior draws d2 <- data.mvn.split(K1=1, K2=1, Kb=2, N1=100, N2=100, mu=rep(0,4), Sigma=Sigma) m2 <- stan(file="Data_Fusion_MVN.stan", data=d2$data, iter=10000, warmup=2000, chains=1, seed=12) # Summarize posteriors of population-level parameters summary(m2, par=c("mu")) summary(m2, par=c("tau")) summary(m2, par=c("Omega")) plot.post.density(m2, pars=c("mu", "tau"), prefix="Ex2", true=list(d1$true$mu, sqrt(diag(d1$true$Sigma)), cov2cor(d1$true$Sigma))) draws <- As.mcmc.list(m2, pars=c("Omega")) png(filename="Ex2PostOmega.png", width=600, height=400) beanplot(data.frame(draws[[1]][,c(2:4, 7:8, 12)]), horizontal=TRUE, las=1, what=c(0, 1, 1, 0), side="second", main=paste("Posterior Density of Omega", log=""), cex.axis=0.5) dev.off() # Summaries of posterior draws for missing data plot.true.v.est(m2, pars=c("y1mis", "y2mis"), prefix="Ex2", true=list(d2$true$y1mis, d2$true$y2mis)) # Example 1c: MVN with strong positive correlations ========== # Generate synthetic data set.seed(20030601) Sigma <- matrix(0.9, nrow=4, ncol=4) diag(Sigma) <- 1 # Call to Stan to generate posterior draws d3 <- data.mvn.split(K1=1, K2=1, Kb=2, N1=100, N2=100, mu=rep(0,4), Sigma=Sigma) m3 <- stan(file="Data_Fusion_MVN.stan", data=d3$data, iter=10000, warmup=2000, chains=1, seed=12) # Summaries of population-level parameters summary(m3, par=c("mu")) summary(m3, par=c("tau")) summary(m3, par=c("Omega")) plot.post.density(m3, pars=c("mu", "tau"), prefix="Ex3", true=list(d1$true$mu, sqrt(diag(d1$true$Sigma)))) draws <- As.mcmc.list(m3, pars=c("Omega")) png(filename="Ex3PostOmega.png", width=600, height=400) beanplot(data.frame(draws[[1]][,c(2:4, 7:8, 12)]), horizontal=TRUE, las=1, what=c(0, 1, 1, 0), side="second", main=paste("Posterior Density of Omega", log="")) dev.off() # Summaries of posterior draws for missing data plot.true.v.est(m3, pars=c("y1mis", "y2mis"), prefix="Ex3", true=list(d3$true$y1mis, d3$true$y2mis))

Utility Functions for Ex. 1

data.mvn.split <- function(K1=2, K2=2, Kb=3, N1=100, N2=100, mu=rep(0, K1+K2+Kb), Sigma=diag(1, K1+K2+Kb)) { y <- mvrnorm(n=N1+N2, mu=mu, Sigma=Sigma) list(data=list(K1=K1, K2=K2, Kb=Kb, N1=N1, N2=N2, y1=as.matrix(y[1:N1, 1:K1], col=K1), y2=as.matrix(y[N1+1:N2, K1+1:K2], col=K2), yb=as.matrix(y[,K1+K2+1:Kb], col=Kb)), true=list(mu=mu, Sigma=Sigma, y1mis=y[1:N1, K1+1:K2], y2mis=y[N1+1:N2, 1:K1])) } data.mvp.split <- function(K1=2, K2=2, Kb=3, N1=100, N2=100, mu=rep(0, K1+K2+Kb), Sigma=diag(1, K1+K2+Kb)) { z <- mvrnorm(n=N1+N2, mu=mu, Sigma=Sigma) y <- z y[y>0] <- 1 y[y<0] <- 0 y1mis <- y[1:N1, K1+1:K2] y2mis <- y[N1+1:N2, 1:K1] y[1:N1, K1+1:K2] <- NA y[N1+1:N2, 1:K1] <- NA true=list(mu=mu, Sigma=Sigma, z=z, y=y, y1mis=y1mis, y2mis=y2mis) y[is.na(y)] <- 0 data=list(K1=K1, K2=K2, Kb=Kb, N1=N1, N2=N2, y=y) list(data=data, true=true) } plot.post.density <- function(m.stan, pars, true, prefix=NULL){ for (i in 1:length(pars)) { draws <- As.mcmc.list(m.stan, pars=pars[i]) if (!is.null(prefix)) { filename <- paste(prefix, "Post", pars[i], ".png", sep="") png(filename=filename, width=600, height=400) } beanplot(data.frame(draws[[1]]), horizontal=TRUE, las=1, what=c(0, 1, 1, 0), side="second", main=paste("Posterior Density of", pars[[i]])) if (!is.null(prefix)) dev.off() } } plot.true.v.est <- function(m.stan, pars, true, prefix=NULL){ for (i in 1:length(pars)) { draws <- As.mcmc.list(m.stan, pars=pars[i]) est <- summary(draws) if (!is.null(prefix)) { filename <- paste(prefix, "TrueVEst", pars[i], ".png", sep="") png(filename=filename, width=600, height=400) } plot(true[[i]], est$quantiles[,3], col="blue", xlab=paste("True", pars[i]), ylab=paste("Estiamted", pars[i], "(posterior median)")) abline(a=0, b=1) arrows(true[[i]], est$quantiles[,3], true[[i]], est$quantiles[,1], col="gray90", length=0) arrows(true[[i]], est$quantiles[,3], true[[i]], est$quantiles[,5], col="gray90", length=0) points(true[[i]], est$quantiles[,3], col="blue") if (!is.null(prefix)) dev.off() } }

Stan Model for Ex. 2 (Split Multivariate Probit Data)

functions { int mysum(int[,] a) { int s; s = 0; for (i in 1:size(a)) s = s + sum(a[i]); return s; } } data { int<lower=0> K1; // number of vars only observed in data set 1 int<lower=0> K2; // number of vars only observed in data set 2 int<lower=0> Kb; // number of vars observed in both data sets int<lower=0> N1; // number of observations in data set 1 int<lower=0> N2; // number of observations in data set 2 int<lower=0,upper=2> y[N1+N2, K1+K2+Kb]; // should contain zeros in missing positions } transformed data { int<lower=1, upper=N1+N2> n_pos[mysum(y)]; int<lower=1, upper=K1+K2+Kb> k_pos[size(n_pos)]; int<lower=1, upper=N1+N2> n_neg[(N1+N2)*(K1+K2+Kb) - K2*N1 - K1*N2 - mysum(y)]; int<lower=1, upper=K1+K2+Kb> k_neg[size(n_neg)]; int<lower=0> N_pos; int<lower=0> N_neg; N_pos = size(n_pos); N_neg = size(n_neg); { int i; int j; i = 1; j = 1; for (n in 1:N1) { //positions in observed y1 for (k in 1:K1) { if (y[n,k] == 1) { n_pos[i] = n; k_pos[i] = k; i = i + 1; } else { n_neg[j] = n; k_neg[j] = k; j = j + 1; } } for (k in (K1+K2+1):(K1+K2+Kb)) { if (y[n,k] == 1) { n_pos[i] = n; k_pos[i] = k; i = i + 1; } else { n_neg[j] = n; k_neg[j] = k; j = j + 1; } } } for (n in (N1+1):(N1+N2)) { //positions in observed y2 for (k in (K1+1):(K1+K2+Kb)) { if (y[n,k] == 1) { n_pos[i] = n; k_pos[i] = k; i = i + 1; } else { n_neg[j] = n; k_neg[j] = k; j = j + 1; } } } } } parameters { vector[K1 + K2 + Kb] mu; corr_matrix[K1 + K2 + Kb] Omega; vector<lower=0>[N_pos] z_pos; vector<upper=0>[N_neg] z_neg; vector[K2] z1mis[N1]; vector[K1] z2mis[N2]; } transformed parameters{ vector[K1 + K2 + Kb] z[N1 + N2]; vector[K2] y1mis[N1]; vector[K1] y2mis[N2]; for (i in 1:N_pos) z[n_pos[i], k_pos[i]] = z_pos[i]; for (i in 1:N_neg) z[n_neg[i], k_neg[i]] = z_neg[i]; for (n in 1:N1) { for (k in 1:K2) { z[n, K1 + k] = z1mis[n, k]; if (z1mis[n, k] > 0) y1mis[n, k] = 1; if (z1mis[n, k] < 0) y1mis[n, k] = 0; } } for (n in 1:N2) { for (k in 1:K1) { z[N1 + n, k] = z2mis[n, k]; if (z2mis[n, k] > 0) y2mis[n, k] = 1; if (z2mis[n, k] < 0) y2mis[n, k] = 0; } } } model { mu ˜ normal(0, 3); Omega ˜ lkj_corr(1); z ˜ multi_normal(mu, Omega); }

R Commands for Ex. 2

# Generate synthetic data set.seed(20030601) Sigma <- matrix(c(1, 0.3, -0.2, 0.7, 0.3, 1, -0.6, 0.4, -0.2, -0.6, 1, 0.1, 0.7, 0.4, 0.1, 1), nrow=4) d1 <- data.mvp.split(K1=1, K2=1, Kb=2, N1=100, N2=100, mu=rep(0,4), Sigma=Sigma) # Call to Stan to generate posterior draws m1 <- stan(file="Data_Fusion_MVP.stan", data=d1$data, iter=10000, warmup=2000, chains=1, seed=35) # Summaries of posteriors of population-level parameters summary(m1, par=c("mu", "Omega")) plot.post.density(m1, pars=c("mu"), prefix="Ex1MVP", true=list(d1$true$mu)) png(filename="Ex1MVPPostOmega.png", width=600, height=400) draws <- As.mcmc.list(m1, pars=c("Omega")) beanplot(data.frame(draws[[1]][,c(2:4, 7:8, 12)]), horizontal=TRUE, las=1, what=c(0, 1, 1, 0), side="second", main=paste("Posterior Density of Omega", log="")) dev.off() # Summarize posteriors for one of missing values y1mis.draws <- extract(m1, par=c("y1mis"))[[1]][,1,1] # draws for third respondent mean(y1mis.draws > 0) # Confusion matrix for missing data y1mis.est <- summary(m1, par=c("y1mis"))$summary[, "50%"]>0 xtabs(˜y1mis.est + (d1$true$y1mis>0)) y2mis.est <- summary(m1, par=c("y1mis"))$summary[, "50%"]>0 xtabs(˜y2mis.est + (d1$true$y2mis>0)) z.est <- data.frame(z.true=as.vector(t(d1$true$z)), y=as.vector(t(d1$true$y)), z.postmed=summary(m1, pars=c("z")) $summary[,"50%"]) png(filename="Ex1MVPTrueVEstz.png", width=600, height=400) plot(z.est[,c(1,3)], xlab="True Latent Variable", ylab="Posterior Mean of Latent Variable") points(z.est[is.na(z.est$y), c(1,3)], col="red") abline(h=0, v=0) dev.off()

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Feit, E.M., Bradlow, E.T. (2022). Fusion Modeling. In: Homburg, C., Klarmann, M., Vomberg, A. (eds) Handbook of Market Research. Springer, Cham. https://doi.org/10.1007/978-3-319-57413-4_9

Download citation

DOI: https://doi.org/10.1007/978-3-319-57413-4_9
Published: 03 December 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-57411-0
Online ISBN: 978-3-319-57413-4
eBook Packages: Business and ManagementReference Module Humanities and Social SciencesReference Module Business, Economics and Social Sciences

Publish with us

Policies and ethics

Fusion Modeling

Abstract

Access this chapter

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendix

Appendix

R Code for Generating Synthetic Data and Running Ex. 1 with Stan

R Commands for Ex. 1 (Requires Utility Functions Below to Be Sourced First)

Utility Functions for Ex. 1

Stan Model for Ex. 2 (Split Multivariate Probit Data)

R Commands for Ex. 2

Rights and permissions

Copyright information

About this entry

Cite this entry

Download citation

Share this entry

Publish with us

Search

Navigation