Skip to main content

String Sanitization: A Combinatorial Approach

Part of the Lecture Notes in Computer Science book series (LNAI,volume 11906)

Abstract

String data are often disseminated to support applications such as location-based service provision or DNA sequence analysis. This dissemination, however, may expose sensitive patterns that model confidential knowledge (e.g., trips to mental health clinics from a string representing a user’s location history). In this paper, we consider the problem of sanitizing a string by concealing the occurrences of sensitive patterns, while maintaining data utility. First, we propose a time-optimal algorithm, TFS-ALGO, to construct the shortest string preserving the order of appearance and the frequency of all non-sensitive patterns. Such a string allows accurately performing tasks based on the sequential nature and pattern frequencies of the string. Second, we propose a time-optimal algorithm, PFS-ALGO, which preserves a partial order of appearance of non-sensitive patterns but produces a much shorter string that can be analyzed more efficiently. The strings produced by either of these algorithms may reveal the location of sensitive patterns. In response, we propose a heuristic, MCSR-ALGO, which replaces letters in these strings with carefully selected letters, so that sensitive patterns are not reinstated and occurrences of spurious patterns are prevented. We implemented our sanitization approach that applies TFS-ALGO, PFS-ALGO and then MCSR-ALGO and experimentally show that it is effective and efficient.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Abul, O., Bonchi, F., Giannotti, F.: Hiding sequential and spatiotemporal patterns. TKDE 22(12), 1709–1723 (2010)

    Google Scholar 

  2. Aggarwal, C.C., Yu, P.S.: A framework for condensation-based anonymization of string data. DMKD 16(3), 251–275 (2008)

    MathSciNet  Google Scholar 

  3. Bernardini, G., et al.: String sanitization: a combinatorial approach. CoRR abs/1906.11030 (2019)

    Google Scholar 

  4. Bonomi, L., Fan, L., Jin, H.: An information-theoretic approach to individual sequential data sanitization. In: WSDM, pp. 337–346 (2016)

    Google Scholar 

  5. Bonomi, L., Xiong, L.: A two-phase algorithm for mining sequential patterns with differential privacy. In: CIKM, pp. 269–278 (2013)

    Google Scholar 

  6. Cazaux, B., Lecroq, T., Rivals, E.: Linking indexing data structures to de Bruijn graphs: construction and update. J. Comput. Syst. Sci. 104, 165–183 (2016)

    CrossRef  MathSciNet  Google Scholar 

  7. Chen, R., Acs, G., Castelluccia, C.: Differentially private sequential data publication via variable-length n-grams. In: CCS, pp. 638–649 (2012)

    Google Scholar 

  8. Cormode, G., Korn, F., Tirthapura, S.: Exponentially decayed aggregates on data streams. In: ICDE, pp. 1379–1381 (2008)

    Google Scholar 

  9. Crochemore, M., Hancart, C., Lecroq, T.: Algorithms on Strings. Cambridge University Press, Cambridge (2007)

    CrossRef  Google Scholar 

  10. Gallant, J., Maier, D., Storer, J.A.: On finding minimal length superstrings. J. Comput. Syst. Sci. 20(1), 50–58 (1980)

    CrossRef  MathSciNet  Google Scholar 

  11. Gkoulalas-Divanis, A., Loukides, G.: Revisiting sequential pattern hiding to enhance utility. In: KDD, pp. 1316–1324 (2011)

    Google Scholar 

  12. Grossi, R., et al.: Circular sequence comparison: algorithms and applications. AMB 11, 12 (2016)

    Google Scholar 

  13. Gwadera, R., Gkoulalas-Divanis, A., Loukides, G.: Permutation-based sequential pattern hiding. In: ICDM, pp. 241–250 (2013)

    Google Scholar 

  14. Liu, A., Zhengy, K., Liz, L., Liu, G., Zhao, L., Zhou, X.: Efficient secure similarity computation on encrypted trajectory data. In: ICDE, pp. 66–77 (2015)

    Google Scholar 

  15. Loukides, G., Gwadera, R.: Optimal event sequence sanitization. In: SDM, pp. 775–783 (2015)

    Google Scholar 

  16. Malin, B., Sweeney, L.: Determining the identifiability of DNA database entries. In: AMIA, pp. 537–541 (2000)

    Google Scholar 

  17. Narayanan, A., Shmatikov, V.: Robust de-anonymization of large sparse datasets. In: S&P, pp. 111–125 (2008)

    Google Scholar 

  18. Pissinger, D.: A minimal algorithm for the multiple-choice Knapsack problem. Eur. J. Oper. Res. 83(2), 394–410 (1995)

    CrossRef  Google Scholar 

  19. Pissis, S.P.: MoTeX-II: structured MoTif eXtraction from large-scale datasets. BMC Bioinform. 15, 235 (2014)

    CrossRef  Google Scholar 

  20. Theodorakopoulos, G., Shokri, R., Troncoso, C., Hubaux, J., Boudec, J.L.: Prolonging the hide-and-seek game: optimal trajectory privacy for location-based services. In: WPES, pp. 73–82 (2014)

    Google Scholar 

  21. Verykios, V.S., Elmagarmid, A.K., Bertino, E., Saygin, Y., Dasseni, E.: Association rule hiding. TKDE 16(4), 434–447 (2004)

    Google Scholar 

  22. Wang, D., He, Y., Rundensteiner, E., Naughton, J.F.: Utility-maximizing event stream suppression. In: SIGMOD, pp. 589–600 (2013)

    Google Scholar 

Download references

Acknowledgments

HC is supported by a CSC scholarship. GR and NP are partially supported by MIUR-SIR project CMACBioSeq grant n. RBSI146R5L. We acknowledge the use of the Rosalind HPC cluster hosted by King’s College London.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Grigorios Loukides .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bernardini, G. et al. (2020). String Sanitization: A Combinatorial Approach. In: Brefeld, U., Fromont, E., Hotho, A., Knobbe, A., Maathuis, M., Robardet, C. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2019. Lecture Notes in Computer Science(), vol 11906. Springer, Cham. https://doi.org/10.1007/978-3-030-46150-8_37

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-46150-8_37

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-46149-2

  • Online ISBN: 978-3-030-46150-8

  • eBook Packages: Computer ScienceComputer Science (R0)