This chapter introduces readers to applications of data fusion in marketing from a Bayesian perspective. We will discuss several applications of data fusion including the classic example of combining data on media viewership for one group of customers with data on category purchases for a different group, a very common problem in marketing. While many missing data approaches focus on creating “fused” data sets that can be analyzed by others, we focus on the overall inferential goal, which, for this classic data fusion problem, is to determine which media outlets attract consumers who purchase in a particular category and are therefore good targets for advertising. The approach we describe is based on a common Bayesian approach to missing data, using data augmentation within MCMC estimation routines. As we will discuss, this approach can also be extended to a variety of other data structures including mismatched groups of customers, data at different levels of aggregation, and more general missing data problems that commonly arise in marketing. This chapter provides readers with a step-by-step guide to developing Bayesian data fusion applications, including an example fully worked out in the Stan modeling language. Readers who are unfamiliar with Bayesian analysis and MCMC estimation may benefit by reading the chapter in this handbook on Bayesian Models first.
Similar content being viewed by others
Adigüzel, F., & Wedel, M. (2008). Split questionnaire design for massive surveys. Journal of Marketing Research, 45(5), 608–617.
Andridge, R. R., & Little, R. J. (2010). A review of hot deck imputation for survey nonresponse. International Statistical Review, 78(1), 40–64.
Bradlow, E. T., & Zaslavsky, A. M. (1999). A hierarchical latent variable model for ordinal data from a customer satisfaction survey with no answer responses. Journal of the American Statistical Association, 94(445), 43–52.
Carpenter, B., Gelman, A., Hoffman, M., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M. A., Guo, J., Li, P., & Riddell, A. (2016). Stan: A probabilistic programming language. Journal of Statistical Software, 76.
Chen, Y., & Yang, S. (2007). Estimating disaggregate models using aggregate data through augmentation of individual choice. Journal of Marketing Research, 44(4), 613–621.
Cho, J., Aribarg, A., & Manchanda, P. (2015). The value of measuring customer satisfaction. Available at SSRN 2630898.
Feit, E. M., Beltramo, M. A., & Feinberg, F. M. (2010). Reality check: Combining choice experiments with market data to estimate the importance of product attributes. Management Science, 56(5), 785–800.
Feit, E. M., Wang, P., Bradlow, E. T., & Fader, P. S. (2013). Fusing aggregate and disaggregate data with an application to multiplatform media consumption. Journal of Marketing Research, 50(3), 348–364.
Ford, B. L. (1983). An overview of hot-deck procedures. Incomplete Data in Sample Surveys, 2(Part IV), 185–207.
Gilula, Z., McCulloch, R. E., & Rossi, P. E. (2006). A direct approach to data fusion. Journal of Marketing Research, 43(1), 73–83.
Kamakura, W. A., & Wedel, M. (1997). Statistical data fusion for cross-tabulation. Journal of Marketing Research, 34, 485–498.
Kamakura, W. A., & Wedel, M. (2000). Factor analysis and missing data. Journal of Marketing Research, 37(4), 490–498.
Little, R. J., & Rubin, D. B. (2014). Statistical analysis with missing data. Hoboken: Wiley.
Musalem, A., Bradlow, E. T., & Raju, J. S. (2008). Who’s got the coupon? Estimating consumer preferences and coupon usage from aggregate information. Journal of Marketing Research, 45(6), 715–730.
Novak, J., Feit. E. M., Jensen, S., & Bradlow, E. (2015). Bayesian imputation for anonymous visits in crm data. Available at SSRN 2700347.
Qian, Y., & Xie, H. (2011). No customer left behind: A distribution-free bayesian approach to accounting for missing xs in marketing models. Marketing Science, 30(4), 717–736.
Qian, Y., & Xie, H. (2014). Which brand purchasers are lost to counterfeiters? An application of new data fusion approaches. Marketing Science, 33(3), 437–448.
Rässler, S. (2002). Statistical matching: A frequentist theory, practical applications, and alternative Bayesian approaches (Vol. 168). New York: Springer Science & Business Media.
Raghunathan, T. E., & Grizzle, J. E. (1995). A split questionnaire survey design. Journal of the American Statistical Association, 90(429), 54–63.
Rubin, D. B. (1996). Multiple imputation after 18+ years. Journal of the American Statistical Association, 91(434), 473–489.
Spiegelhalter, D., Thomas, A., Best, N., & Lunn, D. (2003). WinBUGS User Manual Version 1.4, January 2003 at https://faculty.washington.edu/jmiyamot/p548/spiegelhalter%20winbugs%20user%20manual.pdf.
Stan Development Team. (2017). Stan modeling language user’s guide and reference manual, version 2.17.0. http://mc-stan.org
Tanner, M. A., & Wong, W. H. (1987). The calculation of posterior distributions by data augmentation. Journal of the American Statistical Association, 82(398), 528–540.
Stan Development Team (2016). Rstan getting started. https://github.com/stan-dev/rstan/wiki/RStan-Getting-Started
Ying, Y., Feinberg, F., & Wedel, M. (2006). Leveraging missing ratings to improve online recommendation systems. Journal of Marketing Research, 43(3), 355–365.
We would like to thank the many co-authors with whom we have had discussions while developing and troubleshooting fusion models and other Bayesian missing data methods, especially Andres Musalem, Fred Feinberg, Pengyuan Wang, and Julie Novak.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
This appendix provides the code used to generate all examples in this chapter. It is also available online at https://github.com/eleafeit/data_fusion. Note that the results in the chapter were obtained with Stan 2.17. If you use a different version of Stan, you may obtain slightly different results even when using the same random number seed.
R Code for Generating Synthetic Data and Running Ex. 1 with Stan
R Commands for Ex. 1 (Requires Utility Functions Below to Be Sourced First)
library(MASS) library(coda) library(beanplot) library(rstan) # Example 1a: MVN ==================================== # Generate synthetic data set.seed(20030601) Sigma <- matrix(c(1, 0.3, -0.2, 0.7, 0.3, 1, -0.6, 0.4, -0.2, -0.6, 1, 0.1, 0.7, 0.4, 0.1, 1), nrow=4) d1 <- data.mvn.split(K1=1, K2=1, Kb=2, N1=100, N2=100, mu=rep(0,4), Sigma=Sigma) str(d1$data) # Call to Stan to generate posterior draws m1 <- stan(file="Data_Fusion_MVN.stan", data=d1$data, iter=10000, warmup=2000, chains=1, seed=12) # Summaries of posterior draws for population-level parameters summary(m1, par=c("mu")) summary(m1, par=c("tau")) summary(m1, par=c("Omega")) plot.post.density(m1, pars=c("mu", "tau"), prefix="Ex1", true=list(d1$true$mu, sqrt(diag(d1$true$Sigma)), returncov2cor(d1$true$Sigma))) draws <- As.mcmc.list(m1, pars=c("Omega")) png(filename="Ex1PostOmega.png", width=600, height=600) beanplot(data.frame(draws[[1]][,c(2:4, 7:8, 12)]), horizontal=TRUE, las=1, what=c(0, 1, 1, 0), side="second", main=paste("Posterior Density of Omega (correlations)", log=""), cex.axis=0.5) dev.off() # Summaries of posterior draws for missing data summary(extract(m1, par=c("y1mis"))$y1mis[,3,]) png("Ex1y13mis.png") plot(density(extract(m1, par=c("y1mis"))$y1mis[,3,]), main="Posterior of Unobserved y_1", xlab="y_1") dev.off() summary(m1, par=c("y")) # posteriors of observed data place a point mass at the observed value plot.true.v.est(m1, pars=c("y1mis", "y2mis"), prefix="Ex1", true=list(d1$true$y1mis, d1$true$y2mis)) # Example 1b: MVN with zero correlations =================== # Generate synthetic data set.seed(20030601) Sigma <- matrix(0, nrow=4, ncol=4) diag(Sigma) <- 1 # Call to Stan to generate posterior draws d2 <- data.mvn.split(K1=1, K2=1, Kb=2, N1=100, N2=100, mu=rep(0,4), Sigma=Sigma) m2 <- stan(file="Data_Fusion_MVN.stan", data=d2$data, iter=10000, warmup=2000, chains=1, seed=12) # Summarize posteriors of population-level parameters summary(m2, par=c("mu")) summary(m2, par=c("tau")) summary(m2, par=c("Omega")) plot.post.density(m2, pars=c("mu", "tau"), prefix="Ex2", true=list(d1$true$mu, sqrt(diag(d1$true$Sigma)), cov2cor(d1$true$Sigma))) draws <- As.mcmc.list(m2, pars=c("Omega")) png(filename="Ex2PostOmega.png", width=600, height=400) beanplot(data.frame(draws[[1]][,c(2:4, 7:8, 12)]), horizontal=TRUE, las=1, what=c(0, 1, 1, 0), side="second", main=paste("Posterior Density of Omega", log=""), cex.axis=0.5) dev.off() # Summaries of posterior draws for missing data plot.true.v.est(m2, pars=c("y1mis", "y2mis"), prefix="Ex2", true=list(d2$true$y1mis, d2$true$y2mis)) # Example 1c: MVN with strong positive correlations ========== # Generate synthetic data set.seed(20030601) Sigma <- matrix(0.9, nrow=4, ncol=4) diag(Sigma) <- 1 # Call to Stan to generate posterior draws d3 <- data.mvn.split(K1=1, K2=1, Kb=2, N1=100, N2=100, mu=rep(0,4), Sigma=Sigma) m3 <- stan(file="Data_Fusion_MVN.stan", data=d3$data, iter=10000, warmup=2000, chains=1, seed=12) # Summaries of population-level parameters summary(m3, par=c("mu")) summary(m3, par=c("tau")) summary(m3, par=c("Omega")) plot.post.density(m3, pars=c("mu", "tau"), prefix="Ex3", true=list(d1$true$mu, sqrt(diag(d1$true$Sigma)))) draws <- As.mcmc.list(m3, pars=c("Omega")) png(filename="Ex3PostOmega.png", width=600, height=400) beanplot(data.frame(draws[[1]][,c(2:4, 7:8, 12)]), horizontal=TRUE, las=1, what=c(0, 1, 1, 0), side="second", main=paste("Posterior Density of Omega", log="")) dev.off() # Summaries of posterior draws for missing data plot.true.v.est(m3, pars=c("y1mis", "y2mis"), prefix="Ex3", true=list(d3$true$y1mis, d3$true$y2mis))
Utility Functions for Ex. 1
data.mvn.split <- function(K1=2, K2=2, Kb=3, N1=100, N2=100, mu=rep(0, K1+K2+Kb), Sigma=diag(1, K1+K2+Kb)) { y <- mvrnorm(n=N1+N2, mu=mu, Sigma=Sigma) list(data=list(K1=K1, K2=K2, Kb=Kb, N1=N1, N2=N2, y1=as.matrix(y[1:N1, 1:K1], col=K1), y2=as.matrix(y[N1+1:N2, K1+1:K2], col=K2), yb=as.matrix(y[,K1+K2+1:Kb], col=Kb)), true=list(mu=mu, Sigma=Sigma, y1mis=y[1:N1, K1+1:K2], y2mis=y[N1+1:N2, 1:K1])) } data.mvp.split <- function(K1=2, K2=2, Kb=3, N1=100, N2=100, mu=rep(0, K1+K2+Kb), Sigma=diag(1, K1+K2+Kb)) { z <- mvrnorm(n=N1+N2, mu=mu, Sigma=Sigma) y <- z y[y>0] <- 1 y[y<0] <- 0 y1mis <- y[1:N1, K1+1:K2] y2mis <- y[N1+1:N2, 1:K1] y[1:N1, K1+1:K2] <- NA y[N1+1:N2, 1:K1] <- NA true=list(mu=mu, Sigma=Sigma, z=z, y=y, y1mis=y1mis, y2mis=y2mis) y[is.na(y)] <- 0 data=list(K1=K1, K2=K2, Kb=Kb, N1=N1, N2=N2, y=y) list(data=data, true=true) } plot.post.density <- function(m.stan, pars, true, prefix=NULL){ for (i in 1:length(pars)) { draws <- As.mcmc.list(m.stan, pars=pars[i]) if (!is.null(prefix)) { filename <- paste(prefix, "Post", pars[i], ".png", sep="") png(filename=filename, width=600, height=400) } beanplot(data.frame(draws[[1]]), horizontal=TRUE, las=1, what=c(0, 1, 1, 0), side="second", main=paste("Posterior Density of", pars[[i]])) if (!is.null(prefix)) dev.off() } } plot.true.v.est <- function(m.stan, pars, true, prefix=NULL){ for (i in 1:length(pars)) { draws <- As.mcmc.list(m.stan, pars=pars[i]) est <- summary(draws) if (!is.null(prefix)) { filename <- paste(prefix, "TrueVEst", pars[i], ".png", sep="") png(filename=filename, width=600, height=400) } plot(true[[i]], est$quantiles[,3], col="blue", xlab=paste("True", pars[i]), ylab=paste("Estiamted", pars[i], "(posterior median)")) abline(a=0, b=1) arrows(true[[i]], est$quantiles[,3], true[[i]], est$quantiles[,1], col="gray90", length=0) arrows(true[[i]], est$quantiles[,3], true[[i]], est$quantiles[,5], col="gray90", length=0) points(true[[i]], est$quantiles[,3], col="blue") if (!is.null(prefix)) dev.off() } }
Stan Model for Ex. 2 (Split Multivariate Probit Data)
functions { int mysum(int[,] a) { int s; s = 0; for (i in 1:size(a)) s = s + sum(a[i]); return s; } } data { int<lower=0> K1; // number of vars only observed in data set 1 int<lower=0> K2; // number of vars only observed in data set 2 int<lower=0> Kb; // number of vars observed in both data sets int<lower=0> N1; // number of observations in data set 1 int<lower=0> N2; // number of observations in data set 2 int<lower=0,upper=2> y[N1+N2, K1+K2+Kb]; // should contain zeros in missing positions } transformed data { int<lower=1, upper=N1+N2> n_pos[mysum(y)]; int<lower=1, upper=K1+K2+Kb> k_pos[size(n_pos)]; int<lower=1, upper=N1+N2> n_neg[(N1+N2)*(K1+K2+Kb) - K2*N1 - K1*N2 - mysum(y)]; int<lower=1, upper=K1+K2+Kb> k_neg[size(n_neg)]; int<lower=0> N_pos; int<lower=0> N_neg; N_pos = size(n_pos); N_neg = size(n_neg); { int i; int j; i = 1; j = 1; for (n in 1:N1) { //positions in observed y1 for (k in 1:K1) { if (y[n,k] == 1) { n_pos[i] = n; k_pos[i] = k; i = i + 1; } else { n_neg[j] = n; k_neg[j] = k; j = j + 1; } } for (k in (K1+K2+1):(K1+K2+Kb)) { if (y[n,k] == 1) { n_pos[i] = n; k_pos[i] = k; i = i + 1; } else { n_neg[j] = n; k_neg[j] = k; j = j + 1; } } } for (n in (N1+1):(N1+N2)) { //positions in observed y2 for (k in (K1+1):(K1+K2+Kb)) { if (y[n,k] == 1) { n_pos[i] = n; k_pos[i] = k; i = i + 1; } else { n_neg[j] = n; k_neg[j] = k; j = j + 1; } } } } } parameters { vector[K1 + K2 + Kb] mu; corr_matrix[K1 + K2 + Kb] Omega; vector<lower=0>[N_pos] z_pos; vector<upper=0>[N_neg] z_neg; vector[K2] z1mis[N1]; vector[K1] z2mis[N2]; } transformed parameters{ vector[K1 + K2 + Kb] z[N1 + N2]; vector[K2] y1mis[N1]; vector[K1] y2mis[N2]; for (i in 1:N_pos) z[n_pos[i], k_pos[i]] = z_pos[i]; for (i in 1:N_neg) z[n_neg[i], k_neg[i]] = z_neg[i]; for (n in 1:N1) { for (k in 1:K2) { z[n, K1 + k] = z1mis[n, k]; if (z1mis[n, k] > 0) y1mis[n, k] = 1; if (z1mis[n, k] < 0) y1mis[n, k] = 0; } } for (n in 1:N2) { for (k in 1:K1) { z[N1 + n, k] = z2mis[n, k]; if (z2mis[n, k] > 0) y2mis[n, k] = 1; if (z2mis[n, k] < 0) y2mis[n, k] = 0; } } } model { mu ˜ normal(0, 3); Omega ˜ lkj_corr(1); z ˜ multi_normal(mu, Omega); }
R Commands for Ex. 2
# Generate synthetic data set.seed(20030601) Sigma <- matrix(c(1, 0.3, -0.2, 0.7, 0.3, 1, -0.6, 0.4, -0.2, -0.6, 1, 0.1, 0.7, 0.4, 0.1, 1), nrow=4) d1 <- data.mvp.split(K1=1, K2=1, Kb=2, N1=100, N2=100, mu=rep(0,4), Sigma=Sigma) # Call to Stan to generate posterior draws m1 <- stan(file="Data_Fusion_MVP.stan", data=d1$data, iter=10000, warmup=2000, chains=1, seed=35) # Summaries of posteriors of population-level parameters summary(m1, par=c("mu", "Omega")) plot.post.density(m1, pars=c("mu"), prefix="Ex1MVP", true=list(d1$true$mu)) png(filename="Ex1MVPPostOmega.png", width=600, height=400) draws <- As.mcmc.list(m1, pars=c("Omega")) beanplot(data.frame(draws[[1]][,c(2:4, 7:8, 12)]), horizontal=TRUE, las=1, what=c(0, 1, 1, 0), side="second", main=paste("Posterior Density of Omega", log="")) dev.off() # Summarize posteriors for one of missing values y1mis.draws <- extract(m1, par=c("y1mis"))[[1]][,1,1] # draws for third respondent mean(y1mis.draws > 0) # Confusion matrix for missing data y1mis.est <- summary(m1, par=c("y1mis"))$summary[, "50%"]>0 xtabs(˜y1mis.est + (d1$true$y1mis>0)) y2mis.est <- summary(m1, par=c("y1mis"))$summary[, "50%"]>0 xtabs(˜y2mis.est + (d1$true$y2mis>0)) z.est <- data.frame(z.true=as.vector(t(d1$true$z)), y=as.vector(t(d1$true$y)), z.postmed=summary(m1, pars=c("z")) $summary[,"50%"]) png(filename="Ex1MVPTrueVEstz.png", width=600, height=400) plot(z.est[,c(1,3)], xlab="True Latent Variable", ylab="Posterior Mean of Latent Variable") points(z.est[is.na(z.est$y), c(1,3)], col="red") abline(h=0, v=0) dev.off()
Rights and permissions
Copyright information
© 2018 Springer Nature Switzerland AG
About this entry
Cite this entry
Feit, E.M., Bradlow, E.T. (2018). Fusion Modeling. In: Homburg, C., Klarmann, M., Vomberg, A. (eds) Handbook of Market Research. Springer, Cham. https://doi.org/10.1007/978-3-319-05542-8_9-1
Download citation
DOI: https://doi.org/10.1007/978-3-319-05542-8_9-1
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-05542-8
Online ISBN: 978-3-319-05542-8
eBook Packages: Springer Reference Business and ManagementReference Module Humanities and Social SciencesReference Module Business, Economics and Social Sciences