Software Development for SDC in R

  • M. Templ
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4302)


The production of scientific-use files from economic microdata is a major problem. Many common methods change the data in a way which leaves the univariate distribution of each of the variables almost unchanged towards the distribution of the variables of the original data, the multivariate structure of the data, however, is often ruined.

Which method are suitable strongly depends on the underlying data. A program system with which one can apply different methods and evaluate and compare results from different algorithms in a flexible way is needed. The use of methods for protecting microdata as an exploratory data analysis tool requires a powerful program system, able to present the results in a number of easy to grasp graphics. For this purpose some of the most populare procedures for anonymising micro data are applied in a flexible R-package. The R system supports flexible data import/export facilities and advanced developement tools for the development of such a software for disclosure control.

Additionally to existing algorithms in other software (MDAV algorithm for microaggregation, ...) some new algorithms for anonymising microdata are implemented, e.g. a fast algorithm for microaggregation with a projection pursuit approach. This algorithm outperforms existing other algorithms for most of real data.

For all this algorithms/methods print, summary and plot methods and methods for validation are implemented.

In the field of economics suppression of cells in marginal tables is likely to be the most popular method to protect tables for statistical agencies. The use of linear programming for cell suppression seems to be the best way of protecting tables and hierarchical tables.

Some R-packages for various fields of disclosure control are being developed at the moment. It is easy to learn the applications of disclosure control even with little previous knowledge because of its integrated online-help with examples ready to be executed.


Latin Hypercube Sampling Projection Pursuit Disclosure Risk Multivariate Structure Disclosure Control 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Anwar, N.: Micro-aggregation - the small aggregates method. In: Internal report, Eurostat, Luxembourg (1993)Google Scholar
  2. 2.
    Berkelaar, M., Dirks, J., Eikland, K., Notebaert, P.: lpsolve ide v5.5 (2006)Google Scholar
  3. 3.
    Borchsenius, L.: New developements in the danish system for access to micro data. In: Monographs of official statistics, Work session on statistical data confidentiality, Eurostat, Luxembourg (2005)Google Scholar
  4. 4.
    Box, G.E.P., Cox, D.R.: An analysis of transformations. Journal of the Royal Statistical Society, 211–252 (1964)Google Scholar
  5. 5.
    Chambers, J.M.: Programming with Data. Springer, New York (1998)MATHGoogle Scholar
  6. 6.
    Croux, C., Ruiz-Gazen, A.: High breakdown estimators for principal components: the projection-pursuit approach revisited. Journal of Multivariate Analysis 95, 206–226 (2005)MATHCrossRefMathSciNetGoogle Scholar
  7. 7.
    Dalenius, T., Reiss, S.P.: Data-swapping: A technique for disclosure control. In: Proceedings of the Section on Survey Research Methods, vol. 6, pp. 73–85. American Statistical Association (1982)Google Scholar
  8. 8.
    Defays, D., Anwar, M.N.: Masking microdata using micro-aggregation. Journal of Official Statistics 14(4), 449–461 (1998)Google Scholar
  9. 9.
    Defays, D., Nanopoulos, P.: Panels of enterprises and confidentiality: the small aggregates method. In: Proceedings of the 1992 Symposium on Design and Analysis of Longitudinal Surveys, pp. 195–204. Statistics Canada, Ottawa (1993)Google Scholar
  10. 10.
    Domingo-Ferrer, J., Mateo-Sanz, J.M.: Practical data-oriented microaggregation for statistical disclosure control. IEEE Trans. on Knowledge and Data Engineering 14(1), 189–201 (2002)CrossRefGoogle Scholar
  11. 11.
    Efron, R.G., Tibshirani, R.G.: An Introduction to the Bootstrap. Chapman and Hall, New York (1993)MATHGoogle Scholar
  12. 12.
    Elliot, M., Hundepool, A., Nordholt, E.S., Tambay, J.-L., Wende, T.: Glossary on statistical disclosure control (2005)Google Scholar
  13. 13.
    Filmoser, P.: A multivariate outlier detection method. In: Aivazian, S., Filzmoser, P., Kharin, Y. (eds.) Proceedings of the Seventh International Conference on Computer Data Analysis and Modeling, vol. 1, pp. 18–22. Belarusian State University, Minsk (2004)Google Scholar
  14. 14.
    Filzmoser, P.: Robust principal component and factor analysis in the geostatistical treatment of environmental data. Environmetrics 10, 363–375 (1999)CrossRefGoogle Scholar
  15. 15.
    Fraley, C., Raftery, A.E.: How many clusters? which clustering method? answers via model-based cluster analysis. The Computer Journal 41(8), 578–588 (1998)MATHCrossRefGoogle Scholar
  16. 16.
    Gabriel, K.R.: The biplot graphic display of matrices with application to principal component analysis. Biometrika 58(3), 453–467 (1971)MATHCrossRefMathSciNetGoogle Scholar
  17. 17.
    Griffin, R., Navarro, A., Flores-Baez, L.: Disclosure avoidance for the 1990 census. In: Proceedings of the Section on Survey Research Methods, pp. 516–521. American Statistical Association (1989)Google Scholar
  18. 18.
    Huber, P.J.: Projection pursuit. Ann. Statist. 13, 435–525 (1985)MATHCrossRefMathSciNetGoogle Scholar
  19. 19.
    Hulliger, B.: Simple and robust estimators for sampling. In: Proceedings of the Survey Research Methods Section, pp. 54–63. American Statistical Association (1999)Google Scholar
  20. 20.
    Hundepool, A., de Wolf, P.-P.: Onsite@home: Remote access at statistics netherlands. In: Monographs of official statistics, Work session on statistical data confidentiality, Eurostat, Luxembourg (2005)Google Scholar
  21. 21.
    Hundepool, A., Ramaswamy, R., de Wolf, P.-P., Franconi, L., Giessing, S., Repsilber, D., Salazar, J.J., Castro, C., Merola, G., Lowthian, P. (2003)Google Scholar
  22. 22.
    Hundepool, A., Van de Wetering, A., Ramaswamy, R., Franconi, L., Capobianchi, A., De Wolf, P.-P., Domingo-Ferrer, J., Torra, V., Brand, R., Giessing, S.: μ-argus version 3.2 software and users manual (2005)Google Scholar
  23. 23.
    Iman, R.L., Conover, W.J.: A distribution-free approach to inducing rank correlation among input variables. Communications in Statistics B11, 311–334 (1982)Google Scholar
  24. 24.
    Kim, J.J.: A method for limiting disclosure in microdata based on random noise and transformation. In: Proceedings of the Section on Survey Research Methods, pp. 303–308. American Statistical Association (1986)Google Scholar
  25. 25.
    Kim, J.J., Winkler, W.E.: Masking microdata files. In: Proceedings of the Section on Survey Research Methods, pp. 114–119. American Statistical Association (1995)Google Scholar
  26. 26.
    Leisch, F.: Sweave: Dynamic generation of statistical reports using literate data analysis. In: Härdle, W., Rönz, B. (eds.) Compstat 2002 — Proceedings in Computational Statistics, pp. 575–580. Physica Verlag, Heidelberg (2002)Google Scholar
  27. 27.
    Leisch, F.: Sweave, part I: Mixing R and LaTeX. R News 2(3), 28–31 (2002)Google Scholar
  28. 28.
    Leisch, F., Rossini, A.J.: Reproducible statistical research. Chance 16(2), 46–50 (2003)Google Scholar
  29. 29.
    Li, G., Chen, Z.: Projection-pursuit approach to robust dispersion matrices and principal components: primary theory and monte carlo. J. Amer. Statist. Ass. 80, 759–766 (1985)MATHCrossRefGoogle Scholar
  30. 30.
    Maronna, R.A.: Robust m-estimators of multivariate location and scatter. The Annals of Statistics 4(1), 51–67 (1976)MATHCrossRefMathSciNetGoogle Scholar
  31. 31.
    Maronna, R.A., Zamar, R.H.: Robust multivariate estimates for highdimensional datasets. Technometrics 44, 307–317 (2002)CrossRefMathSciNetGoogle Scholar
  32. 32.
    Mateo-Sanz, J.M., Sebe, F., Domingo-Ferrer, J.: Outlier Protection in Continuous Microdata Masking. In: Domingo-Ferrer, J., Torra, V. (eds.) PSD 2004. LNCS, vol. 3050, pp. 201–215. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  33. 33.
    Pearson, K.: On lines and planes of closest fit to systems of points in space. Philosophical Magazine 6(2), 559–572 (1901)Google Scholar
  34. 34.
    Piker, K.: Geheimhaltung - allgemeiner programmablauf (1995)Google Scholar
  35. 35.
    R Development Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0 (2006)Google Scholar
  36. 36.
    Repsilber, R.D.: Preservation of confidentiality in aggregated data. In: The Second International Seminar on Statistical Confidentiality. Luxembourg (1994)Google Scholar
  37. 37.
    Rousseeuw, P.: Multivariate estimation with high breakdown point. In: Mathematical Statistics and Applications, pp. 283–297. Akademiai Kiado, Budapest (1985)Google Scholar
  38. 38.
    Schmid, M.: The effect of single-axis sorting on the estimation of a linear regression (2006)Google Scholar
  39. 39.
    Steel, P., Reznek, A.: Issues in designing a confidential preserving model server. In: Monographs of official statistics, Work session on statistical data confidentiality, Eurostat, Luxembourg (2005)Google Scholar
  40. 40.
    Stein, M.L.: Large sample properties of simulations using latin hypercube sampling. Technometrics 29, 143–151 (1987)MATHCrossRefMathSciNetGoogle Scholar
  41. 41.
    Ting, D., Fienberg, S., Trottini, M.: Romm methodology for microdata release. In: Monographs of official statistics, Work session on statistical data confidentiality, Eurostat, Luxembourg (2005)Google Scholar
  42. 42.
    Wyss, G.D., Jorgensen, K.H.: Sandia’s latin hypercube sampling software. Technical report sand98-0210, Sandia National Laboratories, Albuquerque, NM (1998)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • M. Templ
    • 1
    • 2
  1. 1.Statistics AustriaVienna
  2. 2.Dept. of Statistics & Probability TheoryVienna University of TechnologyVienna

Personalised recommendations