Skip to main content

Accurate Estimation of Structural Equation Models with Remote Partitioned Data

  • Conference paper
  • First Online:
Privacy in Statistical Databases (PSD 2016)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9867))

Included in the following conference series:

Abstract

This paper focuses on a privacy paradigm centered around providing access to researchers to remotely carry out analyses on sensitive data stored behind firewalls. We develop and demonstrate a method for accurate estimation of structural equation models (SEMs) for arbitrarily partitioned data. We show that under a certain set of assumptions our method for estimation across these partitions achieves identical results as estimation with the full data. We consider two situations: (i) a standard setting with a trusted central server and (ii) a round-robin setting in which none of the parties are fully trusted, and extend them in two specific ways. First, we formulate our methods specifically for SEMs, which have become increasingly common models in psychology, human development, and the behavioral sciences. Secondly, our methods work for horizontal, vertical, and complex partitions without needing different routines. In application, this method will serve to increase opportunities for research by allowing SEM estimation without transfer or combination of data. We demonstrate our methods with both simulated and real data examples.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    See http://www.hhs.gov/ohrp/archive/nhrpac/documents/dataltr.pdf.

  2. 2.

    See https://www.census.gov/about/adrm/fsrdc/about/secure_rdc.html.

References

  1. Arbuckle, J.L., Marcoulides, G.A., Schumacker, R.E.: Full information estimation in the presence of incomplete data. Adv. Struct. Equ. Model. Issues Tech. 243, 277 (1996)

    Google Scholar 

  2. Boker, S., Neale, M., Maes, H., Wilde, M., Spiegel, M., Brick, T., Spies, J., Estabrook, R., Kenny, S., Bates, T., et al.: Openmx: an open source extended structural equation modeling framework. Psychometrika 76(2), 306–317 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  3. Boker, S.M., Brick, T.R., Pritikin, J.N., Wang, Y., von Oertzen, T., Brown, D., Lach, J., Estabrook, R., Hunter, M.D., Maes, H.H., et al.: Maintained individual data distributed likelihood estimation (middle). Multivar. Behav. Res. 50(6), 706–720 (2015)

    Article  Google Scholar 

  4. CALIT. Personal data for the public good. Technical report, California Institute for Telecommunications and Information Technology (2014)

    Google Scholar 

  5. de Montjoye, Y.-A., Shmueli, E., Wang, S.S., Pentland, A.S.: OpenPDS: protecting the privacy of metadata through safeanswers. PloS one 9(7), e98790 (2014)

    Article  Google Scholar 

  6. Dufau, S., Duñabeitia, J.A., Moret-Tatay, C., McGonigal, A., Peeters, D., Alario, F.-X., Balota, D.A., Brysbaert, M., Carreiras, M., Ferrand, L., et al.: Smart phone, smart science: how the use of smartphones can revolutionize research in cognitive science. PloS one 6(9), e24974 (2011)

    Article  Google Scholar 

  7. Dwork, C.: Differential privacy: a survey of results. In: Agrawal, M., Du, D.-Z., Duan, Z., Li, A. (eds.) TAMC 2008. LNCS, vol. 4978, pp. 1–19. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  8. Fienberg, S.E., Fulp, W.J., Slavkovic, A.B., Wrobel, T.A.: “Secure” log-linear and logistic regression analysis of distributed databases. In: Domingo-Ferrer, J., Franconi, L. (eds.) Privacy in Statistical Databases. LNCS, vol. 4302, pp. 277–290. Springer, Heidelberg (2006)

    Google Scholar 

  9. Fienberg, S.E., Nardi, Y., Slavković, A.B.: Valid statistical analysis for logistic regression with multiple sources. In: Gal, C.S., Kantor, P.B., Lesk, M.E. (eds.) ISIPS 2008. LNCS, vol. 5661, pp. 82–94. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  10. Gaye, A., Marcon, Y., Isaeva, J., LaFlamme, P., Turner, A., Jones, E.M., Minion, J., Boyd, A.W., Newby, C.J., Nuotio, M.-L., et al.: DataSHIELD: taking the analysis to the data, not the data to the analysis. Int. J. Epidemiol. 43(6), 1929–1944 (2014)

    Article  Google Scholar 

  11. Gillespie, N.: Direction of causation and comorbidity models mutualism, sibling / spousal interaction. Presentation at Advanced Genetic Epidemiology Statistical Workshop 2015, Richmond, VA (2015)

    Google Scholar 

  12. Gillespie, N.A., Henders, A.K., Davenport, T.A., Hermens, D.F., Wright, M.J., Martin, N.G., Hickie, I.B.: The brisbane longitudinal twin study: pathways to cannabis use, abuse, and dependence project–current status, preliminary results, and future directions. Twin Res. Hum. Genet. 16(01), 21–33 (2013)

    Article  Google Scholar 

  13. Goldwasser, S.: Multi party computations: past and present. In: Proceedings of the Sixteenth Annual ACM Symposium on Principles of Distributed Computing, pp. 1–6. ACM (1997)

    Google Scholar 

  14. Hall, R., Fienberg, S.E.: Privacy-preserving record linkage. In: Domingo-Ferrer, J., Magkos, E. (eds.) PSD 2010. LNCS, vol. 6344, pp. 269–283. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  15. Haynsworth, E.V.: On the schur complement. Technical report, DTIC Document (1968)

    Google Scholar 

  16. Hundepool, A., Domingo-Ferrer, J., Franconi, L., Giessing, S., Nordholt, E.S., Spicer, K., De Wolf, P.-P.: Statistical Disclosure Control. John Wiley & Sons, Hoboken (2012)

    Book  Google Scholar 

  17. Karr, A.F., Fulp, W.J., Vera, F., Young, S.S., Lin, X., Reiter, J.P.: Secure, privacy-preserving analysis of distributed databases. Technometrics 49(3), 335–345 (2007)

    Article  MathSciNet  Google Scholar 

  18. Karr, A.F., Lin, X., Sanil, A.P., Reiter, J.P.: Privacy-preserving analysis of vertically partitioned data using secure matrix products. J. Official Stat. 25(1), 125 (2009)

    Google Scholar 

  19. Kupek, E.: Beyond logistic regression: structural equations modelling for binary variables and its application to investigating unobserved confounders. BMC Med. Res. Methodol. 6(1), 1 (2006)

    Article  Google Scholar 

  20. Lindell, Y., Pinkas, B.: Secure multiparty computation for privacy-preserving data mining. J. Priv. Confidentiality 1(1), 5 (2009)

    Google Scholar 

  21. McArdle, J.J., McDonald, R.P.: Some algebraic properties of the reticular action model for moment structures. Br. J. Math. Stat. Psychol. 37(2), 234–251 (1984)

    Article  MATH  Google Scholar 

  22. Miller, G.: The smartphone psychology manifesto. Perspect. Psychol. Sci. 7(3), 221–237 (2012)

    Article  Google Scholar 

  23. Raab, G.M., Dibben, C., Burton, P.: Running an analysis of combined data when the individual records cannot be combined: practical issues in secure computation. In: Statistical Data Confidentiality Work Session, UNECE, October 2015

    Google Scholar 

  24. Schur, I.: Neue begründung der theorie der gruppencharaktere (1905)

    Google Scholar 

  25. Slavkovic, A.B., Nardi, Y., Tibbits, M.M.: “Secure” logistic regression of horizontally and vertically partitioned distributed databases. In: Seventh IEEE International Conference on Data Mining Workshops (ICDM Workshops 2007), pp. 723–728. IEEE (2007)

    Google Scholar 

  26. Willenborg, L., De Waal, T.: Statistical Disclosure Control in Practice, vol. 111. Springer, New York (1996)

    MATH  Google Scholar 

  27. Yao, A.C-C.: Protocols for secure computations. In: FOCS 82, pp. 160–164 (1982)

    Google Scholar 

Download references

Acknowledgements

This work was supported in part by NSF grants Big Data Social Sciences IGERT DGE-1144860 to Pennsylvania State University, and BCS-0941553 and SES-1534433 to the Department of Statistics, Pennsylvania State University. The work was also in part supported by the National Center for Advancing Translational Sciences, National Institutes of Health, through Grant UL1 TR000127. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Joshua Snoke .

Editor information

Editors and Affiliations

A Appendix

A Appendix

1.1 A.1 RAM Algebra

We briefly exhibit here the method we use for defining SEMs and transforming the model parameters to model implied means and covariance matrices. These model implied matrices are then used to calculate log likelihoods iteratively given the data. Optimizing over these matrices is equivalent to optimizing over the model parameters, giving us our estimates.

The SEM path diagram has a one-to-one relationship with the Multivariate normal mean and covariance matrices for the manifest variables. We construct this relationship through the use of RAM matrix algebra. For this we define five matrices denoted A, S, F, M, and I. These matrices contain both fixed and free model parameters. The free parameters are to be estimated and will be changed during optimization, while the fixed parameters do not change. In these matrices, free parameters are denoted with a greek symbol and the fixed parameters are designated by a constant number.

Recall the path diagram shown in Fig. 4. For this example model, the RAM algebra proceeds as follows. The A (“asymmetric”) matrix defines all regression parameters or one-headed arrows in the path diagram. It has number of rows and columns equal to the number of combined latent and manifest variables, with the column designating the path origin and the row designation the destination.

The S (“symmetric”) matrix defines are variance parameters or two-headed arrows in the path diagram in the same way as the A matrix.

The F (“filter”) matrix acts a filter for the manifest variables. It has columns equal to the combined number of latent and manifest variables but rows equal only the number of manifest variables. For each manifest variable it has a one on the diagonal.

The M (“mean”) matrix defines the mean parameters if any for the latent and manifest variables. These are not always included in the path diagrams.

Finally an I (“identity”) matrix is included, with columns and rows equal to the number of combined latent and manifest variables.

Using these matrices, we obtain the corresponding model implied mean (\(\mu \)) and covariance matrices (\(\varSigma \)) of the manifest variables based on the chosen parameters. The following equations give this crucial relationship.

$$\begin{aligned} \varSigma = F * (I - A)^{-1} * S * ((I - A)^{-1})^T * F^T \end{aligned}$$
(10)
$$\begin{aligned} \mu = F * (I - A)^{-1} * M \end{aligned}$$
(11)

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Snoke, J., Brick, T., Slavković, A. (2016). Accurate Estimation of Structural Equation Models with Remote Partitioned Data. In: Domingo-Ferrer, J., Pejić-Bach, M. (eds) Privacy in Statistical Databases. PSD 2016. Lecture Notes in Computer Science(), vol 9867. Springer, Cham. https://doi.org/10.1007/978-3-319-45381-1_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-45381-1_15

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-45380-4

  • Online ISBN: 978-3-319-45381-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics