Skip to main content

Rapid Screening of Big Data Against Inadvertent Leaks

  • Chapter
  • First Online:
Big Data Concepts, Theories, and Applications

Abstract

Keeping sensitive data from unauthorized parties in the highly connected world is challenging. Statistics from security firms, research institutions, and government organizations show that the number of data-leak instances has grown rapidly in the last years. Deliberately planned attacks, inadvertent leaks, and human mistakes constitute the majority of the incidents. In this chapter, we first introduce the threat of data leak and overview traditional solutions in detecting and preventing sensitive data from leaking. Then we point out new challenges in the era of big data and present the state-of-the-art data-leak detection designs and algorithms. These solutions leverage big data theories and platforms—data mining, MapReduce, GPGPU, etc.—to harden the privacy control for big data. We also discuss the open research problems in data-leak detection and prevention.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 139.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 179.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 179.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The sensitive data collection is typically much smaller than the content collection.

  2. 2.

    Subsequence (with gaps) is a generalization of substring and allows gaps between characters, e.g., lo-e is a subsequence of flower (- indicates a gap).

  3. 3.

    Rabin’s fingerprint is min-wise independent.

  4. 4.

    Rabin’s fingerprint is used for unbiased sampling (Sect. 5.4.2).

  5. 5.

    The GPU prototype is realized on one NVIDIA Tesla C2050 with 448 GPU cores.

References

  1. Baraglia R, Morales GDF, Lucchese C (2010) Document similarity self-join with MapReduce. In: 2010 IEEE 10th international conference on data mining (ICDM). IEEE Computer Society, Sydney, Australia, pp 731–736

    Chapter  Google Scholar 

  2. Bertino E, Ghinita G (2011) Towards mechanisms for detection and prevention of data exfiltration by insiders: keynote talk paper. In: Proceedings of the 6th ACM symposium on information, computer and communications security, ASIACCS ’11, pp 10–19

    Google Scholar 

  3. Bilge L, Balzarotti D, Robertson W, Kirda E, Kruegel C (2012) Disclosure: detecting botnet command and control servers through large-scale netflow analysis. In: Proceedings of the 28th annual computer security applications conference, ACSAC ’12. ACM, New York, NY, pp 129–138. doi:10.1145/2420950.2420969. http://doi.acm.org/10.1145/2420950.2420969

  4. Blanas S, Patel JM, Ercegovac V, Rao J, Shekita EJ, Tian Y (2010) A comparison of join algorithms for log processing in MapReduce. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, SIGMOD ’10. ACM, New York, NY, pp 975–986 doi:10.1145/1807167.1807273. http://doi.acm.org/10.1145/1807167.1807273

  5. Blanton M, Atallah MJ, Frikken KB, Malluhi QM (2012) Secure and efficient outsourcing of sequence comparisons. In: Computer security - ESORICS 2012 - 17th European symposium on research in computer security, Proceedings, Pisa, 10–12 Sept 2012, pp 505–522. doi:10.1007/978-3-642-33167-1_29. http://dx.doi.org/10.1007/978-3-642-33167-1_29

    Google Scholar 

  6. Borders K, Prakash A (2009) Quantifying information leaks in outbound web traffic. In: IEEE symposium on security and privacy. IEEE Computer Society, San Jose, CA, USA, pp 129–140

    Google Scholar 

  7. Broder AZ (1993) Some applications of Rabin’s fingerprinting method. In: Capocelli R, De Santis A, Vaccaro U (eds) Sequences II. Springer, New York, pp 143–152. doi:10.1007/978-1-4613-9323-8_11. http://dx.doi.org/10.1007/978-1-4613-9323-8_11

    Google Scholar 

  8. Broder AZ (2000) Identifying and filtering near-duplicate documents. In: Proceedings of the 11th annual symposium on combinatorial pattern matching, pp 1–10

    Google Scholar 

  9. Burkhart M, Strasser M, Many D, Dimitropoulos X (2010) Sepia: privacy-preserving aggregation of multi-domain network events and statistics. In: Proceedings of the 19th USENIX Security Symposium, pp 15–15

    Google Scholar 

  10. Cai M, Hwang K, Kwok YK, Song S, Chen Y (2005) Collaborative Internet worm containment. IEEE Secur Priv 3(3):25–33

    Article  Google Scholar 

  11. Carbunar B, Sion R (2010) Joining privately on outsourced data. In: Secure data management. Lecture notes in computer science, vol 6358. Springer, Berlin, pp 70–86

    Google Scholar 

  12. Caruana G, Li M, Qi, H (2010) SpamCloud: a MapReduce based anti-spam architecture. In: Seventh international conference on fuzzy systems and knowledge discovery. IEEE, Yantai, Shandong, China, pp 3003–3006

    Google Scholar 

  13. Caruana G, Li M, Qi M (2011) A MapReduce based parallel SVM for large scale spam filtering. In: Eighth international conference on fuzzy systems and knowledge discovery. IEEE, Shanghai, China, pp 2659–2662

    Google Scholar 

  14. Chen Q, Liu C, Xiao Z (2014) Improving MapReduce performance using smart speculative execution strategy. IEEE Trans Comput 63(4):954–967. doi:10.1109/TC.2013.15

    Article  MathSciNet  Google Scholar 

  15. Croft J, Caesar M (2011) Towards practical avoidance of information leakage in enterprise networks. In: Proceedings of the 6th USENIX conference on hot topics in security, HotSec’11, pp 7–7

    Google Scholar 

  16. Croft J, Caesar, M (2011) Towards practical avoidance of information leakage in enterprise networks. In: 6th USENIX workshop on hot topics in security, HotSec’11. USENIX Association

    Google Scholar 

  17. Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113

    Article  Google Scholar 

  18. Elsayed T, Lin JJ, Oard DW (2008) Pairwise document similarity in large collections with MapReduce. In: ACL (Short Papers). The Association for Computer Linguistics, pp 265–268

    Google Scholar 

  19. Fang W, He B, Luo Q, Govindaraju NK (2011) Mars: accelerating MapReduce with graphics processors. IEEE Trans Parallel Distrib Syst 22(4):608–620

    Article  Google Scholar 

  20. FBI Cyber Division (2014) Recent cyber intrusion events directed toward retail firms

    Google Scholar 

  21. François J, Wang S, Bronzi W, State R, Engel T (2011) BotCloud: detecting botnets using MapReduce. In: IEEE international workshop on information forensics and security. IEEE, Iguacu Falls, Brazil, pp 1–6

    Chapter  Google Scholar 

  22. Fu X, Ren R, Zhan J, Zhou W, Jia Z, Lu G (2012) LogMaster: mining event correlations in logs of large-scale cluster systems. In: IEEE 31st symposium on reliable distributed systems. IEEE, Irvine, CA, USA, pp 71–80

    Google Scholar 

  23. Global Velocity Inc (2015) Global velocity inc. http://www.globalvelocity.com/. Accessed Feb 2015

  24. GTB Technologies Inc (2015) GoCloudDLP. http://www.goclouddlp.com/. Accessed Feb 2015

  25. Hao F, Kodialam M, Lakshman T, Zhang H (2005) Fast payload-based flow estimation for traffic monitoring and network security. In: Proceedings of the 2005 symposium on architecture for networking and communications systems, pp 211–220

    Google Scholar 

  26. Hoyle R, Patil S, White D, Dawson J, Whalen P, Kapadia A (2013) Attire: conveying information exposure through avatar apparel. In: Proceedings of the 2013 conference on computer supported cooperative work companion, CSCW ’13, pp 19–22

    Google Scholar 

  27. Huang Q, Jao D, Wang HJ (2005) Applications of secure electronic voting to automated privacy-preserving troubleshooting. In: Proceedings of the 12th ACM conference on computer and communications security, pp 68–80

    Google Scholar 

  28. Identifyfinder (2015) Identity finder. http://www.identityfinder.com/. Accessed Feb 2015

  29. Jagannathan G, Wright RN (2005) Privacy-preserving distributed k-means clustering over arbitrarily partitioned data. In: Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery in data mining, pp 593–599

    Google Scholar 

  30. Jang J, Brumley D, Venkataraman S (2011) BitShred: feature hashing malware for scalable triage and semantic analysis. In: Proceedings of the 18th ACM conference on computer and communications security, CCS ’11, pp 309–320

    Google Scholar 

  31. Jang Y, Chung S, Payne B, Lee W (2014) Gyrus: a framework for user-intent monitoring of text-based networked applications. In: Proceedings of the 23rd USENIX security symposium, pp 79–93

    Google Scholar 

  32. Jha S, Kruger L, Shmatikov V (2008) Towards practical privacy for genomic computation. In: Proceedings of the 29th Ieee symposium on security and privacy, pp 216–230

    Google Scholar 

  33. Jung J, Sheth A, Greenstein B, Wetherall D, Maganis G, Kohno T (2008) Privacy oracle: a system for finding application leaks with black box differential testing. In: Proceedings of the 15th ACM conference on computer and communications security, pp 279–288

    Google Scholar 

  34. Kalyan C, Chandrasekaran K (2007) Information leak detection in financial e-mails using mail pattern analysis under partial information. In: Proceedings of the 7th WSEAS international conference on applied informatics and communications, vol 7, pp 104–109

    Google Scholar 

  35. Kaspersky Lab (2014) Kaspersky lab IT security risks survey 2014: a business approach to managing data security threats

    Google Scholar 

  36. Kemerlis VP, Pappas V, Portokalidis G, Keromytis AD (2010) iLeak: a lightweight system for detecting inadvertent information leaks. In: Proceedings of the 6th European conference on computer network defense

    Google Scholar 

  37. Kleinberg J, Papadimitriou CH, Raghavan P (2001) On the value of private information. In: Proceedings of the 8th conference on theoretical aspects of rationality and knowledge, pp 249–257

    Google Scholar 

  38. Lam W, Liu L, Prasad S, Rajaraman A, Vacheri Z, Doan A (2012) Muppet: Mapreduce-style processing of fast data. Proc VLDB Endow 5(12):1814–1825. doi:10.14778/2367502.2367520. http://dx.doi.org/10.14778/2367502.2367520

    Google Scholar 

  39. Lee Y, Kang W, Son H (2010) An internet traffic analysis method with MapReduce. In: Network operations and management symposium workshops (NOMS Wksps), 2010 IEEE/IFIP, pp 357–361. doi:10.1109/NOMSW.2010.5486551

  40. Li K, Zhong Z, Ramaswamy L (2009) Privacy-aware collaborative spam filtering. IEEE Trans Parallel Distrib Syst 20(5):725–739

    Article  Google Scholar 

  41. Liu F, Shu X, Yao D, Butt AR (2015) Privacy-preserving scanning of big content for sensitive data exposure with mapreduce. In: Proceedings of the 5th ACM conference on data and application security and privacy, CODASPY 2015, San Antonio, TX, 2–4 Mar 2015, pp 195–206

    Google Scholar 

  42. Logothetis D, Trezzo C, Webb KC, Yocum K (2011) In-situ MapReduce for log processing. In: USENIX annual technical conference. USENIX Association

    Google Scholar 

  43. Matsunaga AM, Tsugawa MO, Fortes JAB (2008) Cloudblast: combining MapReduce and virtualization on distributed resources for bioinformatics applications. In: eScience. IEEE Computer Society, Indianapolis, IN, USA, pp 222–229

    Google Scholar 

  44. Nadkarni A, Enck W (2013) Preventing accidental data disclosure in modern operating systems. In: ACM conference on computer and communications security. ACM, Berlin, Germany, pp 1029–1042

    Google Scholar 

  45. Panda B, Herbach JS, Basu S, Bayardo RJ (2009) Planet: massively parallel learning of tree ensembles with MapReduce. Proc VLDB Endow 2(2):1426–1437. doi:10.14778/1687553.1687569. http://dx.doi.org/10.14778/1687553.1687569

    Google Scholar 

  46. Papadimitriou P, Garcia-Molina H (2011) Data leakage detection. IEEE Trans Knowl Data Eng 23(1):51–63

    Article  Google Scholar 

  47. Pappas V, Kemerlis V, Zavou A, Polychronakis M, Keromytis A (2013) Cloudfence: enabling users to audit the use of their cloud-resident data. In: Research in attacks, intrusions, and defenses. Lecture notes in computer science, vol 8145. Springer, Berlin, pp 411–431. doi:10.1007/978-3-642-41284-4_21. http://dx.doi.org/10.1007/978-3-642-41284-4_21

    Google Scholar 

  48. Peng D, Dabek F (2010) Large-scale incremental processing using distributed transactions and notifications. In: Proceedings of the 9th USENIX conference on operating systems design and implementation, OSDI’10. USENIX Association, Berkeley, CA, pp 1–15. http://dl.acm.org/citation.cfm?id=1924943.1924961

    Google Scholar 

  49. Provos N, McNamee D, Mavrommatis P, Wang K, Modadugu N (2007) The ghost in the browser: analysis of web-based malware. In: First workshop on hot topics in understanding botnets. USENIX Association

    Google Scholar 

  50. Rabin MO (1981) Fingerprinting by random polynomials. Technical Report TR-15-81, The Hebrew University of Jerusalem

    Google Scholar 

  51. Rabin MO (1981) Fingerprinting by random polynomials. Technical Report TR-15-81, Harvard Aliken Computation Laboratory

    Google Scholar 

  52. Ramaswamy L, Iyengar A, Liu L, Douglis F (2004) Automatic detection of fragments in dynamically generated web pages. In: Proceedings of the 13th international conference on world wide web, pp 443–454

    Google Scholar 

  53. RiskBasedSecurity (2015) Data breach quickview: 2014 data breach trends

    Google Scholar 

  54. Roy I, Setty STV, Kilzer A, Shmatikov V, Witchel E (2010) Airavat: security and privacy for MapReduce. In: Proceedings of the 7th USENIX symposium on networked systems design and implementation, pp 297–312. USENIX Association

    Google Scholar 

  55. Schatz MC (2008) Blastreduce: high performance short read mapping with mapreduce. University of Maryland. http://cgis.cs.umd.edu/Grad/scholarlypapers/papers/MichaelSchatz.pdf

    Google Scholar 

  56. Schatz MC (2009) Cloudburst: highly sensitive read mapping with mapreduce. Bioinformatics 25(11):1363–1369. doi:10.1093/bioinformatics/btp236. http://bioinformatics.oxfordjournals.org/content/25/11/1363.abstract

    Google Scholar 

  57. Shu X, Yao D (2012) Data leak detection as a service. In: Proceedings of the 8th international conference on security and privacy in communication networks (SecureComm), Padua, pp 222–240

    Google Scholar 

  58. Shu X, Zhang J, Yao D, Feng W (2015) Rapid and parallel content screening for detecting transformed data exposure. In: Proceedings of the third international workshop on security and privacy in big data (BigSecurity). Hongkong, China

    Google Scholar 

  59. Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147(1):195–197

    Article  Google Scholar 

  60. Squicciarini AC, Sundareswaran S, Lin D (2010) Preventing information leakage from indexing in the cloud. In: IEEE international conference on cloud computing, CLOUD 2010, Miami, FL 5–10 July. IEEE, Miami, FL, USA, pp 188–195. doi:10.1109/CLOUD.2010.82. http://dx.doi.org/10.1109/CLOUD.2010.82

  61. Stuart JA, Owens JD (2011) Multi-gpu mapreduce on gpu clusters. In: Proceedings of the 2011 ieee international parallel & distributed processing symposium, IPDPS ’11. IEEE Computer Society, Washington, DC, pp 1068–1079. doi:10.1109/IPDPS.2011.102. http://dx.doi.org/10.1109/IPDPS.2011.102

  62. Symantec (2015) Symantec data loss prevention. http://www.symantec.com/data-loss-prevention. Accessed Feb 2015

  63. Troncoso-Pastoriza JR, Katzenbeisser S, Celik M (2007) Privacy preserving error resilient DNA searching through oblivious automata. In: Proceedings of the 14th ACM conference on computer and communications security, pp 519–528

    Google Scholar 

  64. Vernica R, Carey MJ, Li C (2010) Efficient parallel set-similarity joins using MapReduce. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, SIGMOD ’10. ACM, New York, NY, pp 495–506. doi:10.1145/1807167.1807222. http://doi.acm.org/10.1145/1807167.1807222

  65. Williams P, Sion R (2008) Usable PIR. In: Proceedings of the 13th network and distributed system security symposium

    Google Scholar 

  66. Xu S (2009) Collaborative attack vs. collaborative defense. In: Collaborative computing: networking, applications and worksharing. Lecture notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 10. Springer, Berlin, pp 217–228

    Google Scholar 

  67. Xu K, Yao D, Ma Q, Crowell A (2011) Detecting infection onset with behavior-based policies. In: Proceedings of the 5th international conference on network and system security, pp 57–64

    Google Scholar 

  68. Yang SF, Chen WY, Wang YT (2011) ICAS: an inter-VM IDS log cloud analysis system. In: 2011 IEEE international conference on cloud computing and intelligence systems (CCIS), pp 285–289. doi:10.1109/CCIS.2011.6045076

  69. Yang Z, Yang M, Zhang Y, Gu G, Ning P, Wang XS (2013) AppIntent: analyzing sensitive data transmission in Android for privacy leakage detection. In: Proceedings of the 20th ACM conference on computer and communications security

    Google Scholar 

  70. Yao ACC (1986) How to generate and exchange secrets. In: Proceedings of the 27th annual symposium on foundations of computer science, pp 162–167

    Google Scholar 

  71. Yao D, Frikken KB, Atallah MJ, Tamassia R (2008) Private information: to reveal or not to reveal. ACM Trans Inf Syst Secur 12(1):6

    Article  Google Scholar 

  72. Yen TF, Oprea A, Onarlioglu K, Leetham T, Robertson W, Juels A, Kirda E (2013) Beehive: large-scale log analysis for detecting suspicious activity in enterprise networks. In: Proceedings of the 29th annual computer security applications conference, ACSAC ’13. ACM, New York, pp 199–208. doi:10.1145/2523649.2523670. http://doi.acm.org/10.1145/2523649.2523670

  73. Yi X, Kaosar MG, Paulet R, Bertino E (2013) Single-database private information retrieval from fully homomorphic encryption. IEEE Trans Knowl Data Eng 25(5):1125–1134

    Article  Google Scholar 

  74. Yi X, Paulet R, Bertino E (2013) Private information retrieval. Synthesis lectures on information security, privacy, and trust. Morgan & Claypool Publishers

    Google Scholar 

  75. Yoon E, Squicciarini A (2014) Toward detecting compromised mapreduce workers through log analysis. In: 2014 14th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGrid), pp 41–50. doi:10.1109/CCGrid.2014.120

  76. Yuan P, Sha C, Wang X, Yang B, Zhou A, Yang S (2010) XML structural similarity search using MapReduce. In: 11th international conference, web-age information management. Lecture notes in computer science, vol 6184. Springer, New York, pp 169–181

    Google Scholar 

  77. Zhang C, Chang EC, Yap R (2014) Tagged-MapReduce: a general framework for secure computing with mixed-sensitivity data on hybrid clouds. In: 2014 14th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGrid), pp 31–40. doi:10.1109/CCGrid.2014.96

  78. Zhao W, Ma H, He Q (2009) Parallel k-means clustering based on MapReduce. In: Cloud computing, first international conference, CloudCom 2009. lecture notes in computer science, vol 5931. Springer, Berlin. pp 674–679

    Google Scholar 

  79. Zhuang L, Dunagan J, Simon DR, Wang HJ, Osipkov I, Tygar JD (2008) Characterizing botnets from Email spam records. In: First USENIX workshop on large-scale exploits and emergent threats, LEET ’08. USENIX Association

    Google Scholar 

  80. Zohrevandi M, Bazzi RA (2013) Auto-FBI: a user-friendly approach for secure access to sensitive content on the web. In: Proceedings of the 29th annual computer security applications conference, ACSAC ’13. ACM, New York, NY, pp 349–358. doi:10.1145/2523649.2523683. http://doi.acm.org/10.1145/2523649.2523683

Download references

Acknowledgements

This work has been supported by NSF S2ERC Center (an I/UCRC Center) and ARO YIP grant W911NF-14-1-0535.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaokui Shu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Shu, X., Liu, F., (Daphne) Yao, D. (2016). Rapid Screening of Big Data Against Inadvertent Leaks. In: Yu, S., Guo, S. (eds) Big Data Concepts, Theories, and Applications . Springer, Cham. https://doi.org/10.1007/978-3-319-27763-9_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-27763-9_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-27761-5

  • Online ISBN: 978-3-319-27763-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics