Skip to main content

Fusion Modeling

  • Living reference work entry
  • First Online:
Handbook of Market Research

Abstract

This chapter introduces readers to applications of data fusion in marketing from a Bayesian perspective. We will discuss several applications of data fusion including the classic example of combining data on media viewership for one group of customers with data on category purchases for a different group, a very common problem in marketing. While many missing data approaches focus on creating “fused” data sets that can be analyzed by others, we focus on the overall inferential goal, which, for this classic data fusion problem, is to determine which media outlets attract consumers who purchase in a particular category and are therefore good targets for advertising. The approach we describe is based on a common Bayesian approach to missing data, using data augmentation within MCMC estimation routines. As we will discuss, this approach can also be extended to a variety of other data structures including mismatched groups of customers, data at different levels of aggregation, and more general missing data problems that commonly arise in marketing. This chapter provides readers with a step-by-step guide to developing Bayesian data fusion applications, including an example fully worked out in the Stan modeling language. Readers who are unfamiliar with Bayesian analysis and MCMC estimation may benefit by reading the chapter in this handbook on Bayesian Models first.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Similar content being viewed by others

References

  • Adigüzel, F., & Wedel, M. (2008). Split questionnaire design for massive surveys. Journal of Marketing Research, 45(5), 608–617.

    Article  Google Scholar 

  • Andridge, R. R., & Little, R. J. (2010). A review of hot deck imputation for survey nonresponse. International Statistical Review, 78(1), 40–64.

    Article  Google Scholar 

  • Bradlow, E. T., & Zaslavsky, A. M. (1999). A hierarchical latent variable model for ordinal data from a customer satisfaction survey with no answer responses. Journal of the American Statistical Association, 94(445), 43–52.

    Google Scholar 

  • Carpenter, B., Gelman, A., Hoffman, M., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M. A., Guo, J., Li, P., & Riddell, A. (2016). Stan: A probabilistic programming language. Journal of Statistical Software, 76.

    Google Scholar 

  • Chen, Y., & Yang, S. (2007). Estimating disaggregate models using aggregate data through augmentation of individual choice. Journal of Marketing Research, 44(4), 613–621.

    Article  Google Scholar 

  • Cho, J., Aribarg, A., & Manchanda, P. (2015). The value of measuring customer satisfaction. Available at SSRN 2630898.

    Google Scholar 

  • Feit, E. M., Beltramo, M. A., & Feinberg, F. M. (2010). Reality check: Combining choice experiments with market data to estimate the importance of product attributes. Management Science, 56(5), 785–800.

    Article  Google Scholar 

  • Feit, E. M., Wang, P., Bradlow, E. T., & Fader, P. S. (2013). Fusing aggregate and disaggregate data with an application to multiplatform media consumption. Journal of Marketing Research, 50(3), 348–364.

    Article  Google Scholar 

  • Ford, B. L. (1983). An overview of hot-deck procedures. Incomplete Data in Sample Surveys, 2(Part IV), 185–207.

    Google Scholar 

  • Gilula, Z., McCulloch, R. E., & Rossi, P. E. (2006). A direct approach to data fusion. Journal of Marketing Research, 43(1), 73–83.

    Article  Google Scholar 

  • Kamakura, W. A., & Wedel, M. (1997). Statistical data fusion for cross-tabulation. Journal of Marketing Research, 34, 485–498.

    Article  Google Scholar 

  • Kamakura, W. A., & Wedel, M. (2000). Factor analysis and missing data. Journal of Marketing Research, 37(4), 490–498.

    Article  Google Scholar 

  • Little, R. J., & Rubin, D. B. (2014). Statistical analysis with missing data. Hoboken: Wiley.

    Google Scholar 

  • Musalem, A., Bradlow, E. T., & Raju, J. S. (2008). Who’s got the coupon? Estimating consumer preferences and coupon usage from aggregate information. Journal of Marketing Research, 45(6), 715–730.

    Article  Google Scholar 

  • Novak, J., Feit. E. M., Jensen, S., & Bradlow, E. (2015). Bayesian imputation for anonymous visits in crm data. Available at SSRN 2700347.

    Google Scholar 

  • Qian, Y., & Xie, H. (2011). No customer left behind: A distribution-free bayesian approach to accounting for missing xs in marketing models. Marketing Science, 30(4), 717–736.

    Article  Google Scholar 

  • Qian, Y., & Xie, H. (2014). Which brand purchasers are lost to counterfeiters? An application of new data fusion approaches. Marketing Science, 33(3), 437–448.

    Article  Google Scholar 

  • Rässler, S. (2002). Statistical matching: A frequentist theory, practical applications, and alternative Bayesian approaches (Vol. 168). New York: Springer Science & Business Media.

    Google Scholar 

  • Raghunathan, T. E., & Grizzle, J. E. (1995). A split questionnaire survey design. Journal of the American Statistical Association, 90(429), 54–63.

    Article  Google Scholar 

  • Rubin, D. B. (1996). Multiple imputation after 18+ years. Journal of the American Statistical Association, 91(434), 473–489.

    Article  Google Scholar 

  • Spiegelhalter, D., Thomas, A., Best, N., & Lunn, D. (2003). WinBUGS User Manual Version 1.4, January 2003 at https://faculty.washington.edu/jmiyamot/p548/spiegelhalter%20winbugs%20user%20manual.pdf.

  • Stan Development Team. (2017). Stan modeling language user’s guide and reference manual, version 2.17.0. http://mc-stan.org

  • Tanner, M. A., & Wong, W. H. (1987). The calculation of posterior distributions by data augmentation. Journal of the American Statistical Association, 82(398), 528–540.

    Article  Google Scholar 

  • Stan Development Team (2016). Rstan getting started. https://github.com/stan-dev/rstan/wiki/RStan-Getting-Started

  • Ying, Y., Feinberg, F., & Wedel, M. (2006). Leveraging missing ratings to improve online recommendation systems. Journal of Marketing Research, 43(3), 355–365.

    Article  Google Scholar 

Download references

Acknowledgments

We would like to thank the many co-authors with whom we have had discussions while developing and troubleshooting fusion models and other Bayesian missing data methods, especially Andres Musalem, Fred Feinberg, Pengyuan Wang, and Julie Novak.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Elea McDonnell Feit .

Editor information

Editors and Affiliations

Appendix

Appendix

This appendix provides the code used to generate all examples in this chapter. It is also available online at https://github.com/eleafeit/data_fusion. Note that the results in the chapter were obtained with Stan 2.17. If you use a different version of Stan, you may obtain slightly different results even when using the same random number seed.

R Code for Generating Synthetic Data and Running Ex. 1 with Stan

R Commands for Ex. 1 (Requires Utility Functions Below to Be Sourced First)

 library(MASS)  library(coda)  library(beanplot)  library(rstan)  # Example 1a: MVN ====================================  # Generate synthetic data  set.seed(20030601)  Sigma <- matrix(c(1, 0.3, -0.2, 0.7, 0.3, 1, -0.6, 0.4, -0.2,             -0.6, 1, 0.1, 0.7, 0.4, 0.1, 1), nrow=4)  d1 <- data.mvn.split(K1=1, K2=1, Kb=2, N1=100, N2=100,              mu=rep(0,4), Sigma=Sigma)  str(d1$data)  # Call to Stan to generate posterior draws  m1 <- stan(file="Data_Fusion_MVN.stan", data=d1$data,          iter=10000, warmup=2000, chains=1, seed=12)  # Summaries of posterior draws for population-level parameters  summary(m1, par=c("mu"))  summary(m1, par=c("tau"))  summary(m1, par=c("Omega"))  plot.post.density(m1, pars=c("mu", "tau"), prefix="Ex1",              true=list(d1$true$mu, sqrt(diag(d1$true$Sigma)),                  returncov2cor(d1$true$Sigma)))  draws <- As.mcmc.list(m1, pars=c("Omega"))  png(filename="Ex1PostOmega.png", width=600, height=600)  beanplot(data.frame(draws[[1]][,c(2:4, 7:8, 12)]),             horizontal=TRUE, las=1, what=c(0, 1, 1, 0),             side="second", main=paste("Posterior Density of Omega        (correlations)", log=""), cex.axis=0.5)  dev.off()  # Summaries of posterior draws for missing data  summary(extract(m1, par=c("y1mis"))$y1mis[,3,])  png("Ex1y13mis.png")  plot(density(extract(m1, par=c("y1mis"))$y1mis[,3,]),     main="Posterior of Unobserved y_1", xlab="y_1")  dev.off()  summary(m1, par=c("y")) # posteriors of observed data place a  point mass at the observed value  plot.true.v.est(m1, pars=c("y1mis", "y2mis"), prefix="Ex1",           true=list(d1$true$y1mis, d1$true$y2mis))  # Example 1b: MVN with zero correlations ===================  # Generate synthetic data  set.seed(20030601)  Sigma <- matrix(0, nrow=4, ncol=4)  diag(Sigma) <- 1  # Call to Stan to generate posterior draws  d2 <- data.mvn.split(K1=1, K2=1, Kb=2, N1=100, N2=100,              mu=rep(0,4), Sigma=Sigma)  m2 <- stan(file="Data_Fusion_MVN.stan", data=d2$data,           iter=10000, warmup=2000, chains=1, seed=12)  # Summarize posteriors of population-level parameters  summary(m2, par=c("mu"))  summary(m2, par=c("tau"))  summary(m2, par=c("Omega"))  plot.post.density(m2, pars=c("mu", "tau"), prefix="Ex2",               true=list(d1$true$mu, sqrt(diag(d1$true$Sigma)),                     cov2cor(d1$true$Sigma)))  draws <- As.mcmc.list(m2, pars=c("Omega"))  png(filename="Ex2PostOmega.png", width=600, height=400)  beanplot(data.frame(draws[[1]][,c(2:4, 7:8, 12)]),          horizontal=TRUE, las=1, what=c(0, 1, 1, 0), side="second",         main=paste("Posterior Density of Omega", log=""),         cex.axis=0.5)  dev.off()  # Summaries of posterior draws for missing data  plot.true.v.est(m2, pars=c("y1mis", "y2mis"), prefix="Ex2",                     true=list(d2$true$y1mis, d2$true$y2mis))  # Example 1c: MVN with strong positive correlations ==========  # Generate synthetic data  set.seed(20030601)  Sigma <- matrix(0.9, nrow=4, ncol=4)  diag(Sigma) <- 1  # Call to Stan to generate posterior draws  d3 <- data.mvn.split(K1=1, K2=1, Kb=2, N1=100, N2=100,              mu=rep(0,4), Sigma=Sigma)  m3 <- stan(file="Data_Fusion_MVN.stan", data=d3$data,            iter=10000, warmup=2000, chains=1, seed=12)  # Summaries of population-level parameters  summary(m3, par=c("mu"))  summary(m3, par=c("tau"))  summary(m3, par=c("Omega"))  plot.post.density(m3, pars=c("mu", "tau"), prefix="Ex3",              true=list(d1$true$mu, sqrt(diag(d1$true$Sigma))))  draws <- As.mcmc.list(m3, pars=c("Omega"))  png(filename="Ex3PostOmega.png", width=600, height=400)  beanplot(data.frame(draws[[1]][,c(2:4, 7:8, 12)]),          horizontal=TRUE, las=1, what=c(0, 1, 1, 0), side="second",          main=paste("Posterior Density of Omega", log=""))  dev.off()  # Summaries of posterior draws for missing data  plot.true.v.est(m3, pars=c("y1mis", "y2mis"), prefix="Ex3",           true=list(d3$true$y1mis, d3$true$y2mis))

Utility Functions for Ex. 1

 data.mvn.split <- function(K1=2, K2=2, Kb=3, N1=100, N2=100,                     mu=rep(0, K1+K2+Kb),                     Sigma=diag(1, K1+K2+Kb))  {   y <- mvrnorm(n=N1+N2, mu=mu, Sigma=Sigma)   list(data=list(K1=K1, K2=K2, Kb=Kb, N1=N1, N2=N2,            y1=as.matrix(y[1:N1, 1:K1], col=K1),            y2=as.matrix(y[N1+1:N2, K1+1:K2], col=K2),            yb=as.matrix(y[,K1+K2+1:Kb], col=Kb)),      true=list(mu=mu, Sigma=Sigma,               y1mis=y[1:N1, K1+1:K2],               y2mis=y[N1+1:N2, 1:K1]))  }  data.mvp.split <- function(K1=2, K2=2, Kb=3, N1=100, N2=100,                      mu=rep(0, K1+K2+Kb),                      Sigma=diag(1, K1+K2+Kb))  {   z <- mvrnorm(n=N1+N2, mu=mu, Sigma=Sigma)   y <- z   y[y>0] <- 1   y[y<0] <- 0   y1mis <- y[1:N1, K1+1:K2]   y2mis <- y[N1+1:N2, 1:K1]   y[1:N1, K1+1:K2] <- NA   y[N1+1:N2, 1:K1] <- NA   true=list(mu=mu, Sigma=Sigma, z=z, y=y, y1mis=y1mis,             y2mis=y2mis)   y[is.na(y)] <- 0   data=list(K1=K1, K2=K2, Kb=Kb, N1=N1, N2=N2, y=y)   list(data=data, true=true)  }  plot.post.density <- function(m.stan, pars, true, prefix=NULL){   for (i in 1:length(pars)) {       draws <- As.mcmc.list(m.stan, pars=pars[i])       if (!is.null(prefix)) {        filename <- paste(prefix, "Post", pars[i], ".png", sep="")      png(filename=filename, width=600, height=400)    }    beanplot(data.frame(draws[[1]]),                 horizontal=TRUE, las=1, what=c(0, 1, 1, 0),              side="second", main=paste("Posterior Density of",                          pars[[i]]))    if (!is.null(prefix)) dev.off()   }  }  plot.true.v.est <- function(m.stan, pars, true, prefix=NULL){   for (i in 1:length(pars)) {    draws <- As.mcmc.list(m.stan, pars=pars[i])    est <- summary(draws)    if (!is.null(prefix)) {     filename <- paste(prefix, "TrueVEst", pars[i], ".png", sep="")     png(filename=filename, width=600, height=400)    }    plot(true[[i]], est$quantiles[,3], col="blue",       xlab=paste("True", pars[i]),       ylab=paste("Estiamted", pars[i], "(posterior median)"))    abline(a=0, b=1)    arrows(true[[i]], est$quantiles[,3], true[[i]],        est$quantiles[,1], col="gray90", length=0)    arrows(true[[i]], est$quantiles[,3], true[[i]],        est$quantiles[,5], col="gray90", length=0)    points(true[[i]], est$quantiles[,3], col="blue")    if (!is.null(prefix)) dev.off()   }  }

Stan Model for Ex. 2 (Split Multivariate Probit Data)

 functions {   int mysum(int[,] a) {       int s;       s = 0;       for (i in 1:size(a))        s = s + sum(a[i]);       return s;   }  }  data {   int<lower=0> K1;  // number of vars only observed in data set 1   int<lower=0> K2;  // number of vars only observed in data set 2   int<lower=0> Kb;  // number of vars observed in both data sets   int<lower=0> N1;  // number of observations in data set 1   int<lower=0> N2;  // number of observations in data set 2   int<lower=0,upper=2> y[N1+N2, K1+K2+Kb];  // should contain     zeros in missing positions  }  transformed data {   int<lower=1, upper=N1+N2> n_pos[mysum(y)];   int<lower=1, upper=K1+K2+Kb> k_pos[size(n_pos)];   int<lower=1, upper=N1+N2> n_neg[(N1+N2)*(K1+K2+Kb) - K2*N1                      - K1*N2 - mysum(y)];   int<lower=1, upper=K1+K2+Kb> k_neg[size(n_neg)];   int<lower=0> N_pos;   int<lower=0> N_neg;   N_pos = size(n_pos);   N_neg = size(n_neg);   {    int i;    int j;    i = 1;    j = 1;    for (n in 1:N1) {         //positions in observed y1        for (k in 1:K1) {            if (y[n,k] == 1) {       n_pos[i] = n;       k_pos[i] = k;       i = i + 1;         } else {       n_neg[j] = n;       k_neg[j] = k;       j = j + 1;         }     }        for (k in (K1+K2+1):(K1+K2+Kb)) {         if (y[n,k] == 1) {       n_pos[i] = n;       k_pos[i] = k;       i = i + 1;      } else {       n_neg[j] = n;       k_neg[j] = k;       j = j + 1;         }     }    }    for (n in (N1+1):(N1+N2)) {   //positions in observed y2     for (k in (K1+1):(K1+K2+Kb)) {         if (y[n,k] == 1) {       n_pos[i] = n;       k_pos[i] = k;       i = i + 1;         } else {       n_neg[j] = n;       k_neg[j] = k;       j = j + 1;         }     }    }   }  }  parameters {   vector[K1 + K2 + Kb] mu;   corr_matrix[K1 + K2 + Kb] Omega;   vector<lower=0>[N_pos] z_pos;   vector<upper=0>[N_neg] z_neg;   vector[K2] z1mis[N1];   vector[K1] z2mis[N2];  }  transformed parameters{   vector[K1 + K2 + Kb] z[N1 + N2];   vector[K2] y1mis[N1];   vector[K1] y2mis[N2];   for (i in 1:N_pos)    z[n_pos[i], k_pos[i]] = z_pos[i];   for (i in 1:N_neg)    z[n_neg[i], k_neg[i]] = z_neg[i];   for (n in 1:N1) {    for (k in 1:K2) {     z[n, K1 + k] = z1mis[n, k];     if (z1mis[n, k] > 0)      y1mis[n, k] = 1;     if (z1mis[n, k] < 0)      y1mis[n, k] = 0;    }   }   for (n in 1:N2) {    for (k in 1:K1) {     z[N1 + n, k] = z2mis[n, k];     if (z2mis[n, k] > 0)      y2mis[n, k] = 1;     if (z2mis[n, k] < 0)      y2mis[n, k] = 0;    }   }  }  model {   mu ˜ normal(0, 3);   Omega ˜ lkj_corr(1);   z ˜ multi_normal(mu, Omega);  }

R Commands for Ex. 2

 # Generate synthetic data  set.seed(20030601)  Sigma <- matrix(c(1, 0.3, -0.2, 0.7, 0.3, 1, -0.6, 0.4, -0.2,                  -0.6, 1, 0.1, 0.7, 0.4, 0.1, 1), nrow=4)  d1 <- data.mvp.split(K1=1, K2=1, Kb=2, N1=100, N2=100, mu=rep(0,4), Sigma=Sigma)  # Call to Stan to generate posterior draws  m1 <- stan(file="Data_Fusion_MVP.stan", data=d1$data,           iter=10000, warmup=2000, chains=1, seed=35)  # Summaries of posteriors of population-level parameters  summary(m1, par=c("mu", "Omega"))  plot.post.density(m1, pars=c("mu"), prefix="Ex1MVP", true=list(d1$true$mu))  png(filename="Ex1MVPPostOmega.png", width=600, height=400)  draws <- As.mcmc.list(m1, pars=c("Omega"))  beanplot(data.frame(draws[[1]][,c(2:4, 7:8, 12)]), horizontal=TRUE,             las=1, what=c(0, 1, 1, 0), side="second",             main=paste("Posterior Density of Omega", log=""))  dev.off()  # Summarize posteriors for one of missing values  y1mis.draws <- extract(m1, par=c("y1mis"))[[1]][,1,1] # draws for    third respondent  mean(y1mis.draws > 0)  # Confusion matrix for missing data  y1mis.est <- summary(m1, par=c("y1mis"))$summary[, "50%"]>0  xtabs(˜y1mis.est + (d1$true$y1mis>0))  y2mis.est <- summary(m1, par=c("y1mis"))$summary[, "50%"]>0  xtabs(˜y2mis.est + (d1$true$y2mis>0))  z.est <- data.frame(z.true=as.vector(t(d1$true$z)),              y=as.vector(t(d1$true$y)),              z.postmed=summary(m1, pars=c("z"))              $summary[,"50%"])  png(filename="Ex1MVPTrueVEstz.png", width=600, height=400)  plot(z.est[,c(1,3)], xlab="True Latent Variable",     ylab="Posterior Mean of Latent Variable")  points(z.est[is.na(z.est$y), c(1,3)], col="red")  abline(h=0, v=0)  dev.off()

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this entry

Check for updates. Verify currency and authenticity via CrossMark

Cite this entry

Feit, E.M., Bradlow, E.T. (2018). Fusion Modeling. In: Homburg, C., Klarmann, M., Vomberg, A. (eds) Handbook of Market Research. Springer, Cham. https://doi.org/10.1007/978-3-319-05542-8_9-1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-05542-8_9-1

  • Received:

  • Accepted:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-05542-8

  • Online ISBN: 978-3-319-05542-8

  • eBook Packages: Springer Reference Business and ManagementReference Module Humanities and Social SciencesReference Module Business, Economics and Social Sciences

Publish with us

Policies and ethics