Skip to main content
Log in

Delay-sensitive approaches for anonymizing numerical streaming data

  • Regular Contribution
  • Published:
International Journal of Information Security Aims and scope Submit manuscript

Abstract

Streaming data are widely used in today’s world. Data come from different sources in streams and must be processed online and with minimum delay. These data stream can contain confidential data such as customers’ purchase information and need to be mined in order to reveal other useful information like customers’ purchase patterns. Privacy preservation throughout these processes plays a crucial role. K-anonymity is a well-known technique for preserving privacy. The principle issues in k-anonymity are information loss and running time. Although some of the existing k-anonymity techniques are able to generate anonymized data with acceptable information loss, their main drawback is that they are very time-consuming and are not applicable in a streaming context since streaming data are usually very sensitive to delay and need to be processed quite fast. In [32], we proposed a cluster-based k-anonymity algorithm called fast anonymizing algorithm for numerical streaming data (FAANST) which can anonymize numerical streaming data quite fast while providing an admissible information loss. The main drawback of FAANST is that some tuples may remain in the system for a long time and are output when they might be considered to have expired. In this paper, we propose two extensions for FAANST, passive and proactive solutions. These two solutions put a soft deadline, called \(delay\), on the time each tuple can stay in the system, and if a tuple passes this deadline, these algorithms force the tuple to be output. The proactive solution goes even one step further and utilizes a simple heuristic function to predict when a tuple in the system may expire and outputs the tuple if it will expire in the next round of the algorithm’s execution.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

Notes

  1. There are a few approximation algorithms proposed for the problem of k-anonymization [1, 2]. Similarly to approximation algorithms in other contexts, such as graph theory, they do not usually come up with the most optimized solution.

  2. A tuple expires when it remains in the system for longer than a pre-specified threshold called \(delay\).

  3. Forcing an expired tuple to be output is done by suppressing it.

  4. A generalization with minimum absolute and relative distance is minimal (please refer to [24] for definitions of absolute and relative distance).

  5. For more information on window processing please refer to [22].

  6. We used our own implementation of \(k\)-means.

  7. We refer to the number of clusters in \(k\)-means as \(k~^\prime \) in order to prevent confusion with \(k\) in k-anonymity.

  8. There is only one case in which this requirement may not be satisfied, and it occurs when less than \(k\) tuples exist in the window when we are about to cluster them. So, selecting \(k^\prime = \frac{sizeOfCurrentWindow}{k}\) will cause tuples in at least one cluster to be output.

  9. It is worth mentioning that other heuristics such as average processing time of previous n rounds can be utilized instead of our heuristic. In other words, as streaming data in different contexts may have diverse properties, various heuristics can be simply substituted for our simple heuristics to accommodate those properties.

  10. Other solutions for anonymizing data streams do not use window processing and they are not capable of supporting categorical attributes, while our solution and CASTLE take advantage of window processing and as will be discussed later our solution, like CASTLE, can handle categorical attributes as well as numerical attributes.

  11. According to [32], \(\delta \) has a minor effect on the metrics which we desire to measure, so we have set it to 1 in our experiments.

  12. In the previous experiments, we set \(delay\) to 2,000 in the original FAANST and CASTLE to only measure its effect on the number of expired tuples. That is, we simply measured how many of the tuples passed the soft deadline.

  13. As the tuples remain in the system for longer time, and the condition is checked in each iteration.

References

  1. Aggarwal, G., Feder, T., Kenthapadi, K., Khuller, S., Panigrahy, R., Thomas, D., Zhu, A.: Achieving anonymity via clustering. In: Proceeding of the 25th ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 153–162. ACM, New York, NY, USA (2006)

  2. Aggarwal, G., Feder, T., Kenthapadi, K., Motwani, R., Panigrahy, R., Thomas, D., Zhu, A.: Anonymizing tables. In: Proceedings of the International Conference on Database Theory (ICDT’05), pp. 246–258 (2005)

  3. Asuncion, A., Newman, D.J.: UCI machine learning repository. http://www.ics.uci.edu/mlearn/ (2007)

  4. Bayardo, R.J., Agrawal, R.: Data privacy through optimal k-anonymization. In: Proceeding of the 21st International Conference on Data Engineering, pp. 217–228. USA (2005)

  5. Berkhin, P.: Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, CA (2002)

  6. Blocki, J., Williams, R.: Resolving the complexity of some data privacy problems. In: Proceedings of the 37th International Colloquium Conference on Automata, Languages and Programming: Part II, pp. 393–404 (2010)

  7. Cao, J., Carminati, B., Ferrari, E., Tan, K.L.: Castle: a delay-constrained scheme for ks-anonymizing data streams. In: Proceeding of the 2008 IEEE 24th ICDE, pp. 1376–1378 (2008)

  8. Fung, B.C.M., Wang, K., Yu, P.S.: Top-down specialization for information and privacy preservation. In: Proceeding of the 21st ICDE, pp. 205–216, USA (2005)

  9. Golle, P.: Revisiting the uniqueness of simple demographics in the us population. In: Proceeding of the 5th ACM Workshop on Privacy in Electronic Society (WPES’06) (2006)

  10. Guha, S., Rajeev, R., Kyuseok, S.: Rock: a robust clustering algorithm for categorical attributes. In: International Conference on Data Engineering, p. 512 (1999)

  11. Guha, S., Rastogi, R., Shim, K.: Cure: an efficient clustering algorithm for large databases. In: Proceeding of the 1998 ACM SIGMOD International Conference on Management of Data, pp. 73–84. USA (1998)

  12. Guttman, A.: R-trees: a dynamic index structure for spatial searching. In: Proceeding of the 1984 ACM SIGMOD International Conference on Management of Data, pp. 47–57. ACM, New York, NY, USA (1984)

  13. Hundepool, A., Willenborg, L.: mu and tau-Argus: software for statistical disclosure control. In: Proceeding of 3rd International Seminar on Statistical Confidentiality (1996)

  14. Iyengar, V.S.: Transforming data to satisfy privacy constraints. In: Proceeding of the eighth ACM SIGKDD, pp. 279–288 (2002)

  15. Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data: An Introduction to Cluster Analysis (Wiley Series in Probability and Statistics). Wiley-Interscience (March 2005)

  16. LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Incognito: efficient full-domain k-anonymity. In: Proceeding of ACM SIGMOD, pp. 49–60 (2005)

  17. LeFevre, K., DeWitt, D.J., Ramakrishnan, R.: Mondrian multidimensional k-anonymity. In: Proceeding of the 22nd ICDE, p. 25 (2006)

  18. Li, J., Ooi, B.C., Wang, W.: Anonymizing streaming data for privacy protection. In: Proceeding of the 2008 IEEE 24th International Conference on Data Engineering, pp. 1367–1369. USA (2008)

  19. Li, J., Wong, R.C.w., Fu, A.W.c., Pei, J.: Achieving k-anonymity by clustering in attribute hierarchical structures. In: Proceeding of 8th International Conference on Data Warehousing and Knowledge, Discovery, pp. 405–416 (2006)

  20. MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Le Cam, L.M., Neyman, J. (eds.) Proceeding of the 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297. University of California Press (1967)

  21. Meyerson, A., Williams, R.: On the complexity of optimal k-anonymity. In: Proceeding of the 23rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pp. 223–228. ACM, New York, NY, USA (2004)

  22. Patroumpas, K., Sellis, T.K.: Window specification over data streams. In: EDBT Workshops, pp. 445–464 (2006)

  23. Samarati, P., Sweeney, L.: Protecting privacy when disclosing information: k-anonymity and its enforcement through generalization and suppression. Technical report (1998)

  24. Samarati, P.: Protecting respondents’ identities in microdata release. IEEE Trans. Knowl. Data Eng. 13(6), 1010–1027 (2001)

    Article  Google Scholar 

  25. Sweeney, L.: Datafly: a system for providing anonymity in medical data. In: Proceeding of the IFIP TC11 WG11.3 11th International Conference on Database Securty XI, pp. 356–381. Chapman & Hall, Ltd. (1998)

  26. Sweeney, L.: k-anonymity: a model for protecting privacy. Int. J. Uncertain. Fuzziness Knowl.-Based Syst. 10(5), 557–570 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  27. Tan, K.-L., Cao, J., Carminati, B., Ferrari, E.: Castle: continuously anonymizing data streams. IEEE Trans. Dependable Secure Comput. 8(3), 337–352 (2011)

    Article  Google Scholar 

  28. Wang, W., Li, J., Ai, C., Li, Y.: Privacy protection on sliding window of data streams. In: Proceeding of the 2007 International Conference on Collaborative Computing: Networking, Applications and Worksharing, pp. 213–221. Washington, DC, USA (2007)

  29. Xiao, X., Tao, Y.: Dynamic anonymization: accurate statistical analysis with privacy preservation. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, pp. 107–120 (2008)

  30. Xiao, X., Tao, Y.: M-invariance: towards privacy preserving re-publication of dynamic data sets. In: Proceedings of the 2007 ACM SIGMOD International Conference on Management of Data, SIGMOD ’07, pp. 689–700 (2007)

  31. Xiao, X., Tao, Y.: Personalized privacy preservation. In: Proceeding of the 2006 ACM SIGMOD International Conference on Management of Data, pp. 229–240. ACM, New York, NY, USA (2006)

  32. Zakerzadeh, H., Osborn, S.L.: FAANST: fast anonymizing algorithm for numerical streaming daTa. In: Proceeding of the 5th International Workshop on Data Privacy Management (DPM’10) (2010)

  33. Zhang, Q., Koudas, N., Srivastava, D., Yu, T.: Aggregate query answering on anonymized tables. In: Proceedings of the International Conference on Data, Engineering (ICDE’07), pp. 116–125 (2007)

  34. Zhou, B., Han, Y., Pei, J., Jiang, B., Tao, Y., Jia, Y.: Continuous privacy preserving publishing of data streams. In: EDBT, pp. 648–659 (2009)

Download references

Acknowledgments

We express our gratitude to the reviewers for their work and also to the authors of [7, 27] for kindly providing us with their source code. This research was supported by a grant from the Natural Sciences and Engineering Research Council of Canada.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hessam Zakerzadeh.

Additional information

Most of this research was done when the author was a M.Sc student at the University of Western Ontario.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zakerzadeh, H., Osborn, S.L. Delay-sensitive approaches for anonymizing numerical streaming data. Int. J. Inf. Secur. 12, 423–437 (2013). https://doi.org/10.1007/s10207-013-0196-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10207-013-0196-7

Keywords

Navigation