Rapid Screening of Big Data Against Inadvertent Leaks

  • Xiaokui ShuEmail author
  • Fang Liu
  • Danfeng (Daphne) Yao


Keeping sensitive data from unauthorized parties in the highly connected world is challenging. Statistics from security firms, research institutions, and government organizations show that the number of data-leak instances has grown rapidly in the last years. Deliberately planned attacks, inadvertent leaks, and human mistakes constitute the majority of the incidents. In this chapter, we first introduce the threat of data leak and overview traditional solutions in detecting and preventing sensitive data from leaking. Then we point out new challenges in the era of big data and present the state-of-the-art data-leak detection designs and algorithms. These solutions leverage big data theories and platforms—data mining, MapReduce, GPGPU, etc.—to harden the privacy control for big data. We also discuss the open research problems in data-leak detection and prevention.


Alignment Algorithm Sensitive Data Bloom Filter Data Owner MapReduce Framework 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This work has been supported by NSF S2ERC Center (an I/UCRC Center) and ARO YIP grant W911NF-14-1-0535.


  1. 1.
    Baraglia R, Morales GDF, Lucchese C (2010) Document similarity self-join with MapReduce. In: 2010 IEEE 10th international conference on data mining (ICDM). IEEE Computer Society, Sydney, Australia, pp 731–736CrossRefGoogle Scholar
  2. 2.
    Bertino E, Ghinita G (2011) Towards mechanisms for detection and prevention of data exfiltration by insiders: keynote talk paper. In: Proceedings of the 6th ACM symposium on information, computer and communications security, ASIACCS ’11, pp 10–19Google Scholar
  3. 3.
    Bilge L, Balzarotti D, Robertson W, Kirda E, Kruegel C (2012) Disclosure: detecting botnet command and control servers through large-scale netflow analysis. In: Proceedings of the 28th annual computer security applications conference, ACSAC ’12. ACM, New York, NY, pp 129–138. doi: 10.1145/2420950.2420969.
  4. 4.
    Blanas S, Patel JM, Ercegovac V, Rao J, Shekita EJ, Tian Y (2010) A comparison of join algorithms for log processing in MapReduce. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, SIGMOD ’10. ACM, New York, NY, pp 975–986 doi: 10.1145/1807167.1807273.
  5. 5.
    Blanton M, Atallah MJ, Frikken KB, Malluhi QM (2012) Secure and efficient outsourcing of sequence comparisons. In: Computer security - ESORICS 2012 - 17th European symposium on research in computer security, Proceedings, Pisa, 10–12 Sept 2012, pp 505–522. doi: 10.1007/978-3-642-33167-1_29. Google Scholar
  6. 6.
    Borders K, Prakash A (2009) Quantifying information leaks in outbound web traffic. In: IEEE symposium on security and privacy. IEEE Computer Society, San Jose, CA, USA, pp 129–140Google Scholar
  7. 7.
    Broder AZ (1993) Some applications of Rabin’s fingerprinting method. In: Capocelli R, De Santis A, Vaccaro U (eds) Sequences II. Springer, New York, pp 143–152. doi: 10.1007/978-1-4613-9323-8_11. Google Scholar
  8. 8.
    Broder AZ (2000) Identifying and filtering near-duplicate documents. In: Proceedings of the 11th annual symposium on combinatorial pattern matching, pp 1–10Google Scholar
  9. 9.
    Burkhart M, Strasser M, Many D, Dimitropoulos X (2010) Sepia: privacy-preserving aggregation of multi-domain network events and statistics. In: Proceedings of the 19th USENIX Security Symposium, pp 15–15Google Scholar
  10. 10.
    Cai M, Hwang K, Kwok YK, Song S, Chen Y (2005) Collaborative Internet worm containment. IEEE Secur Priv 3(3):25–33CrossRefGoogle Scholar
  11. 11.
    Carbunar B, Sion R (2010) Joining privately on outsourced data. In: Secure data management. Lecture notes in computer science, vol 6358. Springer, Berlin, pp 70–86Google Scholar
  12. 12.
    Caruana G, Li M, Qi, H (2010) SpamCloud: a MapReduce based anti-spam architecture. In: Seventh international conference on fuzzy systems and knowledge discovery. IEEE, Yantai, Shandong, China, pp 3003–3006Google Scholar
  13. 13.
    Caruana G, Li M, Qi M (2011) A MapReduce based parallel SVM for large scale spam filtering. In: Eighth international conference on fuzzy systems and knowledge discovery. IEEE, Shanghai, China, pp 2659–2662Google Scholar
  14. 14.
    Chen Q, Liu C, Xiao Z (2014) Improving MapReduce performance using smart speculative execution strategy. IEEE Trans Comput 63(4):954–967. doi: 10.1109/TC.2013.15 MathSciNetCrossRefGoogle Scholar
  15. 15.
    Croft J, Caesar M (2011) Towards practical avoidance of information leakage in enterprise networks. In: Proceedings of the 6th USENIX conference on hot topics in security, HotSec’11, pp 7–7Google Scholar
  16. 16.
    Croft J, Caesar, M (2011) Towards practical avoidance of information leakage in enterprise networks. In: 6th USENIX workshop on hot topics in security, HotSec’11. USENIX AssociationGoogle Scholar
  17. 17.
    Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113CrossRefGoogle Scholar
  18. 18.
    Elsayed T, Lin JJ, Oard DW (2008) Pairwise document similarity in large collections with MapReduce. In: ACL (Short Papers). The Association for Computer Linguistics, pp 265–268Google Scholar
  19. 19.
    Fang W, He B, Luo Q, Govindaraju NK (2011) Mars: accelerating MapReduce with graphics processors. IEEE Trans Parallel Distrib Syst 22(4):608–620CrossRefGoogle Scholar
  20. 20.
    FBI Cyber Division (2014) Recent cyber intrusion events directed toward retail firmsGoogle Scholar
  21. 21.
    François J, Wang S, Bronzi W, State R, Engel T (2011) BotCloud: detecting botnets using MapReduce. In: IEEE international workshop on information forensics and security. IEEE, Iguacu Falls, Brazil, pp 1–6CrossRefGoogle Scholar
  22. 22.
    Fu X, Ren R, Zhan J, Zhou W, Jia Z, Lu G (2012) LogMaster: mining event correlations in logs of large-scale cluster systems. In: IEEE 31st symposium on reliable distributed systems. IEEE, Irvine, CA, USA, pp 71–80Google Scholar
  23. 23.
    Global Velocity Inc (2015) Global velocity inc. Accessed Feb 2015
  24. 24.
    GTB Technologies Inc (2015) GoCloudDLP. Accessed Feb 2015
  25. 25.
    Hao F, Kodialam M, Lakshman T, Zhang H (2005) Fast payload-based flow estimation for traffic monitoring and network security. In: Proceedings of the 2005 symposium on architecture for networking and communications systems, pp 211–220Google Scholar
  26. 26.
    Hoyle R, Patil S, White D, Dawson J, Whalen P, Kapadia A (2013) Attire: conveying information exposure through avatar apparel. In: Proceedings of the 2013 conference on computer supported cooperative work companion, CSCW ’13, pp 19–22Google Scholar
  27. 27.
    Huang Q, Jao D, Wang HJ (2005) Applications of secure electronic voting to automated privacy-preserving troubleshooting. In: Proceedings of the 12th ACM conference on computer and communications security, pp 68–80Google Scholar
  28. 28.
    Identifyfinder (2015) Identity finder. Accessed Feb 2015
  29. 29.
    Jagannathan G, Wright RN (2005) Privacy-preserving distributed k-means clustering over arbitrarily partitioned data. In: Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery in data mining, pp 593–599Google Scholar
  30. 30.
    Jang J, Brumley D, Venkataraman S (2011) BitShred: feature hashing malware for scalable triage and semantic analysis. In: Proceedings of the 18th ACM conference on computer and communications security, CCS ’11, pp 309–320Google Scholar
  31. 31.
    Jang Y, Chung S, Payne B, Lee W (2014) Gyrus: a framework for user-intent monitoring of text-based networked applications. In: Proceedings of the 23rd USENIX security symposium, pp 79–93Google Scholar
  32. 32.
    Jha S, Kruger L, Shmatikov V (2008) Towards practical privacy for genomic computation. In: Proceedings of the 29th Ieee symposium on security and privacy, pp 216–230Google Scholar
  33. 33.
    Jung J, Sheth A, Greenstein B, Wetherall D, Maganis G, Kohno T (2008) Privacy oracle: a system for finding application leaks with black box differential testing. In: Proceedings of the 15th ACM conference on computer and communications security, pp 279–288Google Scholar
  34. 34.
    Kalyan C, Chandrasekaran K (2007) Information leak detection in financial e-mails using mail pattern analysis under partial information. In: Proceedings of the 7th WSEAS international conference on applied informatics and communications, vol 7, pp 104–109Google Scholar
  35. 35.
    Kaspersky Lab (2014) Kaspersky lab IT security risks survey 2014: a business approach to managing data security threatsGoogle Scholar
  36. 36.
    Kemerlis VP, Pappas V, Portokalidis G, Keromytis AD (2010) iLeak: a lightweight system for detecting inadvertent information leaks. In: Proceedings of the 6th European conference on computer network defenseGoogle Scholar
  37. 37.
    Kleinberg J, Papadimitriou CH, Raghavan P (2001) On the value of private information. In: Proceedings of the 8th conference on theoretical aspects of rationality and knowledge, pp 249–257Google Scholar
  38. 38.
    Lam W, Liu L, Prasad S, Rajaraman A, Vacheri Z, Doan A (2012) Muppet: Mapreduce-style processing of fast data. Proc VLDB Endow 5(12):1814–1825. doi: 10.14778/2367502.2367520. Google Scholar
  39. 39.
    Lee Y, Kang W, Son H (2010) An internet traffic analysis method with MapReduce. In: Network operations and management symposium workshops (NOMS Wksps), 2010 IEEE/IFIP, pp 357–361. doi: 10.1109/NOMSW.2010.5486551
  40. 40.
    Li K, Zhong Z, Ramaswamy L (2009) Privacy-aware collaborative spam filtering. IEEE Trans Parallel Distrib Syst 20(5):725–739CrossRefGoogle Scholar
  41. 41.
    Liu F, Shu X, Yao D, Butt AR (2015) Privacy-preserving scanning of big content for sensitive data exposure with mapreduce. In: Proceedings of the 5th ACM conference on data and application security and privacy, CODASPY 2015, San Antonio, TX, 2–4 Mar 2015, pp 195–206Google Scholar
  42. 42.
    Logothetis D, Trezzo C, Webb KC, Yocum K (2011) In-situ MapReduce for log processing. In: USENIX annual technical conference. USENIX AssociationGoogle Scholar
  43. 43.
    Matsunaga AM, Tsugawa MO, Fortes JAB (2008) Cloudblast: combining MapReduce and virtualization on distributed resources for bioinformatics applications. In: eScience. IEEE Computer Society, Indianapolis, IN, USA, pp 222–229Google Scholar
  44. 44.
    Nadkarni A, Enck W (2013) Preventing accidental data disclosure in modern operating systems. In: ACM conference on computer and communications security. ACM, Berlin, Germany, pp 1029–1042Google Scholar
  45. 45.
    Panda B, Herbach JS, Basu S, Bayardo RJ (2009) Planet: massively parallel learning of tree ensembles with MapReduce. Proc VLDB Endow 2(2):1426–1437. doi: 10.14778/1687553.1687569. Google Scholar
  46. 46.
    Papadimitriou P, Garcia-Molina H (2011) Data leakage detection. IEEE Trans Knowl Data Eng 23(1):51–63CrossRefGoogle Scholar
  47. 47.
    Pappas V, Kemerlis V, Zavou A, Polychronakis M, Keromytis A (2013) Cloudfence: enabling users to audit the use of their cloud-resident data. In: Research in attacks, intrusions, and defenses. Lecture notes in computer science, vol 8145. Springer, Berlin, pp 411–431. doi: 10.1007/978-3-642-41284-4_21. Google Scholar
  48. 48.
    Peng D, Dabek F (2010) Large-scale incremental processing using distributed transactions and notifications. In: Proceedings of the 9th USENIX conference on operating systems design and implementation, OSDI’10. USENIX Association, Berkeley, CA, pp 1–15. Google Scholar
  49. 49.
    Provos N, McNamee D, Mavrommatis P, Wang K, Modadugu N (2007) The ghost in the browser: analysis of web-based malware. In: First workshop on hot topics in understanding botnets. USENIX AssociationGoogle Scholar
  50. 50.
    Rabin MO (1981) Fingerprinting by random polynomials. Technical Report TR-15-81, The Hebrew University of JerusalemGoogle Scholar
  51. 51.
    Rabin MO (1981) Fingerprinting by random polynomials. Technical Report TR-15-81, Harvard Aliken Computation LaboratoryGoogle Scholar
  52. 52.
    Ramaswamy L, Iyengar A, Liu L, Douglis F (2004) Automatic detection of fragments in dynamically generated web pages. In: Proceedings of the 13th international conference on world wide web, pp 443–454Google Scholar
  53. 53.
    RiskBasedSecurity (2015) Data breach quickview: 2014 data breach trendsGoogle Scholar
  54. 54.
    Roy I, Setty STV, Kilzer A, Shmatikov V, Witchel E (2010) Airavat: security and privacy for MapReduce. In: Proceedings of the 7th USENIX symposium on networked systems design and implementation, pp 297–312. USENIX AssociationGoogle Scholar
  55. 55.
    Schatz MC (2008) Blastreduce: high performance short read mapping with mapreduce. University of Maryland. Google Scholar
  56. 56.
    Schatz MC (2009) Cloudburst: highly sensitive read mapping with mapreduce. Bioinformatics 25(11):1363–1369. doi: 10.1093/bioinformatics/btp236. Google Scholar
  57. 57.
    Shu X, Yao D (2012) Data leak detection as a service. In: Proceedings of the 8th international conference on security and privacy in communication networks (SecureComm), Padua, pp 222–240Google Scholar
  58. 58.
    Shu X, Zhang J, Yao D, Feng W (2015) Rapid and parallel content screening for detecting transformed data exposure. In: Proceedings of the third international workshop on security and privacy in big data (BigSecurity). Hongkong, ChinaGoogle Scholar
  59. 59.
    Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147(1):195–197CrossRefGoogle Scholar
  60. 60.
    Squicciarini AC, Sundareswaran S, Lin D (2010) Preventing information leakage from indexing in the cloud. In: IEEE international conference on cloud computing, CLOUD 2010, Miami, FL 5–10 July. IEEE, Miami, FL, USA, pp 188–195. doi: 10.1109/CLOUD.2010.82.
  61. 61.
    Stuart JA, Owens JD (2011) Multi-gpu mapreduce on gpu clusters. In: Proceedings of the 2011 ieee international parallel & distributed processing symposium, IPDPS ’11. IEEE Computer Society, Washington, DC, pp 1068–1079. doi: 10.1109/IPDPS.2011.102.
  62. 62.
    Symantec (2015) Symantec data loss prevention. Accessed Feb 2015
  63. 63.
    Troncoso-Pastoriza JR, Katzenbeisser S, Celik M (2007) Privacy preserving error resilient DNA searching through oblivious automata. In: Proceedings of the 14th ACM conference on computer and communications security, pp 519–528Google Scholar
  64. 64.
    Vernica R, Carey MJ, Li C (2010) Efficient parallel set-similarity joins using MapReduce. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, SIGMOD ’10. ACM, New York, NY, pp 495–506. doi: 10.1145/1807167.1807222.
  65. 65.
    Williams P, Sion R (2008) Usable PIR. In: Proceedings of the 13th network and distributed system security symposiumGoogle Scholar
  66. 66.
    Xu S (2009) Collaborative attack vs. collaborative defense. In: Collaborative computing: networking, applications and worksharing. Lecture notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 10. Springer, Berlin, pp 217–228Google Scholar
  67. 67.
    Xu K, Yao D, Ma Q, Crowell A (2011) Detecting infection onset with behavior-based policies. In: Proceedings of the 5th international conference on network and system security, pp 57–64Google Scholar
  68. 68.
    Yang SF, Chen WY, Wang YT (2011) ICAS: an inter-VM IDS log cloud analysis system. In: 2011 IEEE international conference on cloud computing and intelligence systems (CCIS), pp 285–289. doi: 10.1109/CCIS.2011.6045076
  69. 69.
    Yang Z, Yang M, Zhang Y, Gu G, Ning P, Wang XS (2013) AppIntent: analyzing sensitive data transmission in Android for privacy leakage detection. In: Proceedings of the 20th ACM conference on computer and communications securityGoogle Scholar
  70. 70.
    Yao ACC (1986) How to generate and exchange secrets. In: Proceedings of the 27th annual symposium on foundations of computer science, pp 162–167Google Scholar
  71. 71.
    Yao D, Frikken KB, Atallah MJ, Tamassia R (2008) Private information: to reveal or not to reveal. ACM Trans Inf Syst Secur 12(1):6CrossRefGoogle Scholar
  72. 72.
    Yen TF, Oprea A, Onarlioglu K, Leetham T, Robertson W, Juels A, Kirda E (2013) Beehive: large-scale log analysis for detecting suspicious activity in enterprise networks. In: Proceedings of the 29th annual computer security applications conference, ACSAC ’13. ACM, New York, pp 199–208. doi: 10.1145/2523649.2523670.
  73. 73.
    Yi X, Kaosar MG, Paulet R, Bertino E (2013) Single-database private information retrieval from fully homomorphic encryption. IEEE Trans Knowl Data Eng 25(5):1125–1134CrossRefGoogle Scholar
  74. 74.
    Yi X, Paulet R, Bertino E (2013) Private information retrieval. Synthesis lectures on information security, privacy, and trust. Morgan & Claypool PublishersGoogle Scholar
  75. 75.
    Yoon E, Squicciarini A (2014) Toward detecting compromised mapreduce workers through log analysis. In: 2014 14th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGrid), pp 41–50. doi: 10.1109/CCGrid.2014.120
  76. 76.
    Yuan P, Sha C, Wang X, Yang B, Zhou A, Yang S (2010) XML structural similarity search using MapReduce. In: 11th international conference, web-age information management. Lecture notes in computer science, vol 6184. Springer, New York, pp 169–181Google Scholar
  77. 77.
    Zhang C, Chang EC, Yap R (2014) Tagged-MapReduce: a general framework for secure computing with mixed-sensitivity data on hybrid clouds. In: 2014 14th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGrid), pp 31–40. doi: 10.1109/CCGrid.2014.96
  78. 78.
    Zhao W, Ma H, He Q (2009) Parallel k-means clustering based on MapReduce. In: Cloud computing, first international conference, CloudCom 2009. lecture notes in computer science, vol 5931. Springer, Berlin. pp 674–679Google Scholar
  79. 79.
    Zhuang L, Dunagan J, Simon DR, Wang HJ, Osipkov I, Tygar JD (2008) Characterizing botnets from Email spam records. In: First USENIX workshop on large-scale exploits and emergent threats, LEET ’08. USENIX AssociationGoogle Scholar
  80. 80.
    Zohrevandi M, Bazzi RA (2013) Auto-FBI: a user-friendly approach for secure access to sensitive content on the web. In: Proceedings of the 29th annual computer security applications conference, ACSAC ’13. ACM, New York, NY, pp 349–358. doi: 10.1145/2523649.2523683.

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Department of Computer ScienceVirginia TechBlacksburgUSA

Personalised recommendations