MR-RBAT: Anonymizing Large Transaction Datasets Using MapReduce

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9149)


Privacy is a concern when publishing transaction data for applications such as marketing research and biomedical studies. While methods for anonymizing transaction data exist, they are designed to run on a single machine, hence not scalable to large datasets. Recently, MapReduce has emerged as a highly scalable platform for data-intensive applications. In the paper, we consider how MapReduce may be used to provide scalability in transaction anonymization. More specifically, we consider how RBAT may be parallelized using MapReduce. RBAT is a sequential method that has some desirable features for transaction anonymization, but its highly iterative nature makes its parallelization challenging. A direct implementation of RBAT on MapReduce using data partitioning alone can result in significant overhead, which can offset the gains from parallel processing. We propose MR-RBAT that employs two parameters to control parallelization overhead. Our experimental results show that MR-RBAT can scale linearly to large datasets and can retain good data utility.


Generalize Item Utility Loss Distribute File System Privacy Model Split Phase 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Arthur, D., Vassilvitskii, S.: k-means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035. Society for Industrial and Applied Mathematics (2007)Google Scholar
  2. 2.
    Bahmani, B., Moseley, B., Vattani, A., Kumar, R., Vassilvitskii, S.: Scalable k-means++. Proc. VLDB Endow. 5(7), 622–633 (2012)CrossRefGoogle Scholar
  3. 3.
    Barbaro, M., Zeller, T., Hansell, S.: A face is exposed for aol searcher no. 4417749. New York Times 9, 2008 (2006). 8ForGoogle Scholar
  4. 4.
    Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: Haloop: efficient iterative data processing on large clusters. Proc. VLDB Endow. 3(1–2), 285–296 (2010)CrossRefGoogle Scholar
  5. 5.
    Cao, J., Karras, P., Raïssi, C., Tan, K.-L.: \(\rho \)-uncertainty: inference-proof transaction anonymization. Proc. VLDB Endow. 3(1–2), 1033–1044 (2010)CrossRefGoogle Scholar
  6. 6.
    Chierichetti, F., Kumar, R., Tomkins, A.: Max-cover in map-reduce. In: Proceedings of the 19th International Conference on World Wide Web. WWW 2010, pp. 231–240. ACM, New York (2010)Google Scholar
  7. 7.
    Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRefGoogle Scholar
  8. 8.
    Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.-H., Qiu, J., Fox, G.: Twister: a runtime for iterative mapreduce. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pp. 810–818. ACM (2010)Google Scholar
  9. 9.
    Cordeiro, R.L.F., Traina Jr., C., Traina, A.J.M., López, J., Kang, U., Faloutsos, C.: Clustering very large multi-dimensional datasets with mapreduce. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD 2011, pp. 690–698. ACM, New York (2011)Google Scholar
  10. 10.
    Fung, B.C.M., Wang, K., Chen, R., Yu, P.S.: Privacy-preserving data publishing: a survey of recent developments. ACM Comput. Surv. 42(4), 14:1–14:53 (2010)CrossRefGoogle Scholar
  11. 11.
    He, Y., Naughton, J.F.: Anonymization of set-valued data via top-down, local generalization. Proc. VLDB Endow. 2(1), 934–945 (2009)CrossRefGoogle Scholar
  12. 12.
    Lee, K.-H., Lee, Y.-J., Choi, H., Chung, Y.D., Moon, B.: Parallel data processing with mapreduce: a survey. SIGMOD Rec. 40(4), 11–20 (2012)CrossRefGoogle Scholar
  13. 13.
    Loukides, G., Gkoulalas-Divanis, A., Shao, J.: Anonymizing transaction data to eliminate sensitive inferences. In: Bringas, P.G., Hameurlain, A., Quirchmayr, G. (eds.) DEXA 2010, Part I. LNCS, vol. 6261, pp. 400–415. Springer, Heidelberg (2010) CrossRefGoogle Scholar
  14. 14.
    Narayanan, A., Shmatikov, V.: How to break anonymity of the netflix prize dataset. CoRR abs/cs/0610105 (2006)Google Scholar
  15. 15.
    Terrovitis, M., Mamoulis, N., Kalnis, P.: Privacy-preserving anonymization of set-valued data. Proc. VLDB Endow. 1(1), 115–125 (2008)CrossRefGoogle Scholar
  16. 16.
    Terrovitis, M., Mamoulis, N., Kalnis, P.: Local and global recoding methods for anonymizing set-valued data. VLDB J. 20(1), 83–106 (2011)CrossRefGoogle Scholar
  17. 17.
    Terrovitis, M., Mamoulis, N., Liagouris, J., Skiadopoulos, S.: Privacy preservation by disassociation. Proc. VLDB Endow. 5(10), 944–955 (2012)CrossRefGoogle Scholar
  18. 18.
    The Economist: A special report on managing information: data, data everywhere. The Economist, February 2010Google Scholar
  19. 19.
    Xu, Y., Wang, K., Fu, A.W.-C., Yu, P.S.: Anonymizing transaction databases for publication. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD 2008, pp. 767–775. ACM, New York (2008)Google Scholar
  20. 20.
    Zhang, X., Liu, C., Nepal, S., Yang, C., Dou, W., Chen, J.: Combining top-down and bottom-up: scalable sub-tree anonymization over big data using mapreduce on cloud. In: 2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), pp. 501–508, July 2013Google Scholar
  21. 21.
    Zhang, X., Liu, C., Nepal, S., Yang, C., Dou, W., Chen, J.: A hybrid approach for scalable sub-tree anonymization over big data using mapreduce on cloud. J. Comput. Syst. Sci. 80(5), 1008–1020 (2014). Special Issue on Dependable and Secure Computing The 9th IEEE International Conference on Dependable, Autonomic and Secure ComputingMATHMathSciNetCrossRefGoogle Scholar
  22. 22.
    Zhang, X., Yang, L., Liu, C., Chen, J.: A scalable two-phase top-down specialization approach for data anonymization using mapreduce on cloud. IEEE Trans. Parallel Distrib. Syst. 25(2), 363–373 (2014)CrossRefGoogle Scholar
  23. 23.
    Zheng, Z., Kohavi, R., Mason, L.: Real world performance of association rule algorithms. In: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD 2001, pp. 401–406. ACM, New York (2001)Google Scholar

Copyright information

© IFIP International Federation for Information Processing 2015

Authors and Affiliations

  1. 1.School of Computer Science and InformaticsCardiff UniversityCardiffUK

Personalised recommendations