Rapid Screening of Big Data Against Inadvertent Leaks

Shu, Xiaokui; Liu, Fang; (Daphne) Yao, Danfeng

doi:10.1007/978-3-319-27763-9_5

Xiaokui Shu³,
Fang Liu³ &
Danfeng (Daphne) Yao³

5775 Accesses
1 Citations

Abstract

Keeping sensitive data from unauthorized parties in the highly connected world is challenging. Statistics from security firms, research institutions, and government organizations show that the number of data-leak instances has grown rapidly in the last years. Deliberately planned attacks, inadvertent leaks, and human mistakes constitute the majority of the incidents. In this chapter, we first introduce the threat of data leak and overview traditional solutions in detecting and preventing sensitive data from leaking. Then we point out new challenges in the era of big data and present the state-of-the-art data-leak detection designs and algorithms. These solutions leverage big data theories and platforms—data mining, MapReduce, GPGPU, etc.—to harden the privacy control for big data. We also discuss the open research problems in data-leak detection and prevention.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Softcover Book: USD 179.99; Price excludes VAT (USA)

Hardcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The sensitive data collection is typically much smaller than the content collection.
2.
Subsequence (with gaps) is a generalization of substring and allows gaps between characters, e.g., lo-e is a subsequence of flower (- indicates a gap).
3.
Rabin’s fingerprint is min-wise independent.
4.
Rabin’s fingerprint is used for unbiased sampling (Sect. 5.4.2).
5.
The GPU prototype is realized on one NVIDIA Tesla C2050 with 448 GPU cores.

References

Baraglia R, Morales GDF, Lucchese C (2010) Document similarity self-join with MapReduce. In: 2010 IEEE 10th international conference on data mining (ICDM). IEEE Computer Society, Sydney, Australia, pp 731–736
Chapter Google Scholar
Bertino E, Ghinita G (2011) Towards mechanisms for detection and prevention of data exfiltration by insiders: keynote talk paper. In: Proceedings of the 6th ACM symposium on information, computer and communications security, ASIACCS ’11, pp 10–19
Google Scholar
Bilge L, Balzarotti D, Robertson W, Kirda E, Kruegel C (2012) Disclosure: detecting botnet command and control servers through large-scale netflow analysis. In: Proceedings of the 28th annual computer security applications conference, ACSAC ’12. ACM, New York, NY, pp 129–138. doi:10.1145/2420950.2420969. http://doi.acm.org/10.1145/2420950.2420969
Blanas S, Patel JM, Ercegovac V, Rao J, Shekita EJ, Tian Y (2010) A comparison of join algorithms for log processing in MapReduce. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, SIGMOD ’10. ACM, New York, NY, pp 975–986 doi:10.1145/1807167.1807273. http://doi.acm.org/10.1145/1807167.1807273
Blanton M, Atallah MJ, Frikken KB, Malluhi QM (2012) Secure and efficient outsourcing of sequence comparisons. In: Computer security - ESORICS 2012 - 17th European symposium on research in computer security, Proceedings, Pisa, 10–12 Sept 2012, pp 505–522. doi:10.1007/978-3-642-33167-1_29. http://dx.doi.org/10.1007/978-3-642-33167-1_29
Google Scholar
Borders K, Prakash A (2009) Quantifying information leaks in outbound web traffic. In: IEEE symposium on security and privacy. IEEE Computer Society, San Jose, CA, USA, pp 129–140
Google Scholar
Broder AZ (1993) Some applications of Rabin’s fingerprinting method. In: Capocelli R, De Santis A, Vaccaro U (eds) Sequences II. Springer, New York, pp 143–152. doi:10.1007/978-1-4613-9323-8_11. http://dx.doi.org/10.1007/978-1-4613-9323-8_11
Google Scholar
Broder AZ (2000) Identifying and filtering near-duplicate documents. In: Proceedings of the 11th annual symposium on combinatorial pattern matching, pp 1–10
Google Scholar
Burkhart M, Strasser M, Many D, Dimitropoulos X (2010) Sepia: privacy-preserving aggregation of multi-domain network events and statistics. In: Proceedings of the 19th USENIX Security Symposium, pp 15–15
Google Scholar
Cai M, Hwang K, Kwok YK, Song S, Chen Y (2005) Collaborative Internet worm containment. IEEE Secur Priv 3(3):25–33
Article Google Scholar
Carbunar B, Sion R (2010) Joining privately on outsourced data. In: Secure data management. Lecture notes in computer science, vol 6358. Springer, Berlin, pp 70–86
Google Scholar
Caruana G, Li M, Qi, H (2010) SpamCloud: a MapReduce based anti-spam architecture. In: Seventh international conference on fuzzy systems and knowledge discovery. IEEE, Yantai, Shandong, China, pp 3003–3006
Google Scholar
Caruana G, Li M, Qi M (2011) A MapReduce based parallel SVM for large scale spam filtering. In: Eighth international conference on fuzzy systems and knowledge discovery. IEEE, Shanghai, China, pp 2659–2662
Google Scholar
Chen Q, Liu C, Xiao Z (2014) Improving MapReduce performance using smart speculative execution strategy. IEEE Trans Comput 63(4):954–967. doi:10.1109/TC.2013.15
Article MathSciNet Google Scholar
Croft J, Caesar M (2011) Towards practical avoidance of information leakage in enterprise networks. In: Proceedings of the 6th USENIX conference on hot topics in security, HotSec’11, pp 7–7
Google Scholar
Croft J, Caesar, M (2011) Towards practical avoidance of information leakage in enterprise networks. In: 6th USENIX workshop on hot topics in security, HotSec’11. USENIX Association
Google Scholar
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Article Google Scholar
Elsayed T, Lin JJ, Oard DW (2008) Pairwise document similarity in large collections with MapReduce. In: ACL (Short Papers). The Association for Computer Linguistics, pp 265–268
Google Scholar
Fang W, He B, Luo Q, Govindaraju NK (2011) Mars: accelerating MapReduce with graphics processors. IEEE Trans Parallel Distrib Syst 22(4):608–620
Article Google Scholar
FBI Cyber Division (2014) Recent cyber intrusion events directed toward retail firms
Google Scholar
François J, Wang S, Bronzi W, State R, Engel T (2011) BotCloud: detecting botnets using MapReduce. In: IEEE international workshop on information forensics and security. IEEE, Iguacu Falls, Brazil, pp 1–6
Chapter Google Scholar
Fu X, Ren R, Zhan J, Zhou W, Jia Z, Lu G (2012) LogMaster: mining event correlations in logs of large-scale cluster systems. In: IEEE 31st symposium on reliable distributed systems. IEEE, Irvine, CA, USA, pp 71–80
Google Scholar
Global Velocity Inc (2015) Global velocity inc. http://www.globalvelocity.com/. Accessed Feb 2015
GTB Technologies Inc (2015) GoCloudDLP. http://www.goclouddlp.com/. Accessed Feb 2015
Hao F, Kodialam M, Lakshman T, Zhang H (2005) Fast payload-based flow estimation for traffic monitoring and network security. In: Proceedings of the 2005 symposium on architecture for networking and communications systems, pp 211–220
Google Scholar
Hoyle R, Patil S, White D, Dawson J, Whalen P, Kapadia A (2013) Attire: conveying information exposure through avatar apparel. In: Proceedings of the 2013 conference on computer supported cooperative work companion, CSCW ’13, pp 19–22
Google Scholar
Huang Q, Jao D, Wang HJ (2005) Applications of secure electronic voting to automated privacy-preserving troubleshooting. In: Proceedings of the 12th ACM conference on computer and communications security, pp 68–80
Google Scholar
Identifyfinder (2015) Identity finder. http://www.identityfinder.com/. Accessed Feb 2015
Jagannathan G, Wright RN (2005) Privacy-preserving distributed k-means clustering over arbitrarily partitioned data. In: Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery in data mining, pp 593–599
Google Scholar
Jang J, Brumley D, Venkataraman S (2011) BitShred: feature hashing malware for scalable triage and semantic analysis. In: Proceedings of the 18th ACM conference on computer and communications security, CCS ’11, pp 309–320
Google Scholar
Jang Y, Chung S, Payne B, Lee W (2014) Gyrus: a framework for user-intent monitoring of text-based networked applications. In: Proceedings of the 23rd USENIX security symposium, pp 79–93
Google Scholar
Jha S, Kruger L, Shmatikov V (2008) Towards practical privacy for genomic computation. In: Proceedings of the 29th Ieee symposium on security and privacy, pp 216–230
Google Scholar
Jung J, Sheth A, Greenstein B, Wetherall D, Maganis G, Kohno T (2008) Privacy oracle: a system for finding application leaks with black box differential testing. In: Proceedings of the 15th ACM conference on computer and communications security, pp 279–288
Google Scholar
Kalyan C, Chandrasekaran K (2007) Information leak detection in financial e-mails using mail pattern analysis under partial information. In: Proceedings of the 7th WSEAS international conference on applied informatics and communications, vol 7, pp 104–109
Google Scholar
Kaspersky Lab (2014) Kaspersky lab IT security risks survey 2014: a business approach to managing data security threats
Google Scholar
Kemerlis VP, Pappas V, Portokalidis G, Keromytis AD (2010) iLeak: a lightweight system for detecting inadvertent information leaks. In: Proceedings of the 6th European conference on computer network defense
Google Scholar
Kleinberg J, Papadimitriou CH, Raghavan P (2001) On the value of private information. In: Proceedings of the 8th conference on theoretical aspects of rationality and knowledge, pp 249–257
Google Scholar
Lam W, Liu L, Prasad S, Rajaraman A, Vacheri Z, Doan A (2012) Muppet: Mapreduce-style processing of fast data. Proc VLDB Endow 5(12):1814–1825. doi:10.14778/2367502.2367520. http://dx.doi.org/10.14778/2367502.2367520
Google Scholar
Lee Y, Kang W, Son H (2010) An internet traffic analysis method with MapReduce. In: Network operations and management symposium workshops (NOMS Wksps), 2010 IEEE/IFIP, pp 357–361. doi:10.1109/NOMSW.2010.5486551
Li K, Zhong Z, Ramaswamy L (2009) Privacy-aware collaborative spam filtering. IEEE Trans Parallel Distrib Syst 20(5):725–739
Article Google Scholar
Liu F, Shu X, Yao D, Butt AR (2015) Privacy-preserving scanning of big content for sensitive data exposure with mapreduce. In: Proceedings of the 5th ACM conference on data and application security and privacy, CODASPY 2015, San Antonio, TX, 2–4 Mar 2015, pp 195–206
Google Scholar
Logothetis D, Trezzo C, Webb KC, Yocum K (2011) In-situ MapReduce for log processing. In: USENIX annual technical conference. USENIX Association
Google Scholar
Matsunaga AM, Tsugawa MO, Fortes JAB (2008) Cloudblast: combining MapReduce and virtualization on distributed resources for bioinformatics applications. In: eScience. IEEE Computer Society, Indianapolis, IN, USA, pp 222–229
Google Scholar
Nadkarni A, Enck W (2013) Preventing accidental data disclosure in modern operating systems. In: ACM conference on computer and communications security. ACM, Berlin, Germany, pp 1029–1042
Google Scholar
Panda B, Herbach JS, Basu S, Bayardo RJ (2009) Planet: massively parallel learning of tree ensembles with MapReduce. Proc VLDB Endow 2(2):1426–1437. doi:10.14778/1687553.1687569. http://dx.doi.org/10.14778/1687553.1687569
Google Scholar
Papadimitriou P, Garcia-Molina H (2011) Data leakage detection. IEEE Trans Knowl Data Eng 23(1):51–63
Article Google Scholar
Pappas V, Kemerlis V, Zavou A, Polychronakis M, Keromytis A (2013) Cloudfence: enabling users to audit the use of their cloud-resident data. In: Research in attacks, intrusions, and defenses. Lecture notes in computer science, vol 8145. Springer, Berlin, pp 411–431. doi:10.1007/978-3-642-41284-4_21. http://dx.doi.org/10.1007/978-3-642-41284-4_21
Google Scholar
Peng D, Dabek F (2010) Large-scale incremental processing using distributed transactions and notifications. In: Proceedings of the 9th USENIX conference on operating systems design and implementation, OSDI’10. USENIX Association, Berkeley, CA, pp 1–15. http://dl.acm.org/citation.cfm?id=1924943.1924961
Google Scholar
Provos N, McNamee D, Mavrommatis P, Wang K, Modadugu N (2007) The ghost in the browser: analysis of web-based malware. In: First workshop on hot topics in understanding botnets. USENIX Association
Google Scholar
Rabin MO (1981) Fingerprinting by random polynomials. Technical Report TR-15-81, The Hebrew University of Jerusalem
Google Scholar
Rabin MO (1981) Fingerprinting by random polynomials. Technical Report TR-15-81, Harvard Aliken Computation Laboratory
Google Scholar
Ramaswamy L, Iyengar A, Liu L, Douglis F (2004) Automatic detection of fragments in dynamically generated web pages. In: Proceedings of the 13th international conference on world wide web, pp 443–454
Google Scholar
RiskBasedSecurity (2015) Data breach quickview: 2014 data breach trends
Google Scholar
Roy I, Setty STV, Kilzer A, Shmatikov V, Witchel E (2010) Airavat: security and privacy for MapReduce. In: Proceedings of the 7th USENIX symposium on networked systems design and implementation, pp 297–312. USENIX Association
Google Scholar
Schatz MC (2008) Blastreduce: high performance short read mapping with mapreduce. University of Maryland. http://cgis.cs.umd.edu/Grad/scholarlypapers/papers/MichaelSchatz.pdf
Google Scholar
Schatz MC (2009) Cloudburst: highly sensitive read mapping with mapreduce. Bioinformatics 25(11):1363–1369. doi:10.1093/bioinformatics/btp236. http://bioinformatics.oxfordjournals.org/content/25/11/1363.abstract
Google Scholar
Shu X, Yao D (2012) Data leak detection as a service. In: Proceedings of the 8th international conference on security and privacy in communication networks (SecureComm), Padua, pp 222–240
Google Scholar
Shu X, Zhang J, Yao D, Feng W (2015) Rapid and parallel content screening for detecting transformed data exposure. In: Proceedings of the third international workshop on security and privacy in big data (BigSecurity). Hongkong, China
Google Scholar
Smith TF, Waterman MS (1981) Identification of common molecular subsequences. J Mol Biol 147(1):195–197
Article Google Scholar
Squicciarini AC, Sundareswaran S, Lin D (2010) Preventing information leakage from indexing in the cloud. In: IEEE international conference on cloud computing, CLOUD 2010, Miami, FL 5–10 July. IEEE, Miami, FL, USA, pp 188–195. doi:10.1109/CLOUD.2010.82. http://dx.doi.org/10.1109/CLOUD.2010.82
Stuart JA, Owens JD (2011) Multi-gpu mapreduce on gpu clusters. In: Proceedings of the 2011 ieee international parallel & distributed processing symposium, IPDPS ’11. IEEE Computer Society, Washington, DC, pp 1068–1079. doi:10.1109/IPDPS.2011.102. http://dx.doi.org/10.1109/IPDPS.2011.102
Symantec (2015) Symantec data loss prevention. http://www.symantec.com/data-loss-prevention. Accessed Feb 2015
Troncoso-Pastoriza JR, Katzenbeisser S, Celik M (2007) Privacy preserving error resilient DNA searching through oblivious automata. In: Proceedings of the 14th ACM conference on computer and communications security, pp 519–528
Google Scholar
Vernica R, Carey MJ, Li C (2010) Efficient parallel set-similarity joins using MapReduce. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data, SIGMOD ’10. ACM, New York, NY, pp 495–506. doi:10.1145/1807167.1807222. http://doi.acm.org/10.1145/1807167.1807222
Williams P, Sion R (2008) Usable PIR. In: Proceedings of the 13th network and distributed system security symposium
Google Scholar
Xu S (2009) Collaborative attack vs. collaborative defense. In: Collaborative computing: networking, applications and worksharing. Lecture notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, vol 10. Springer, Berlin, pp 217–228
Google Scholar
Xu K, Yao D, Ma Q, Crowell A (2011) Detecting infection onset with behavior-based policies. In: Proceedings of the 5th international conference on network and system security, pp 57–64
Google Scholar
Yang SF, Chen WY, Wang YT (2011) ICAS: an inter-VM IDS log cloud analysis system. In: 2011 IEEE international conference on cloud computing and intelligence systems (CCIS), pp 285–289. doi:10.1109/CCIS.2011.6045076
Yang Z, Yang M, Zhang Y, Gu G, Ning P, Wang XS (2013) AppIntent: analyzing sensitive data transmission in Android for privacy leakage detection. In: Proceedings of the 20th ACM conference on computer and communications security
Google Scholar
Yao ACC (1986) How to generate and exchange secrets. In: Proceedings of the 27th annual symposium on foundations of computer science, pp 162–167
Google Scholar
Yao D, Frikken KB, Atallah MJ, Tamassia R (2008) Private information: to reveal or not to reveal. ACM Trans Inf Syst Secur 12(1):6
Article Google Scholar
Yen TF, Oprea A, Onarlioglu K, Leetham T, Robertson W, Juels A, Kirda E (2013) Beehive: large-scale log analysis for detecting suspicious activity in enterprise networks. In: Proceedings of the 29th annual computer security applications conference, ACSAC ’13. ACM, New York, pp 199–208. doi:10.1145/2523649.2523670. http://doi.acm.org/10.1145/2523649.2523670
Yi X, Kaosar MG, Paulet R, Bertino E (2013) Single-database private information retrieval from fully homomorphic encryption. IEEE Trans Knowl Data Eng 25(5):1125–1134
Article Google Scholar
Yi X, Paulet R, Bertino E (2013) Private information retrieval. Synthesis lectures on information security, privacy, and trust. Morgan & Claypool Publishers
Google Scholar
Yoon E, Squicciarini A (2014) Toward detecting compromised mapreduce workers through log analysis. In: 2014 14th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGrid), pp 41–50. doi:10.1109/CCGrid.2014.120
Yuan P, Sha C, Wang X, Yang B, Zhou A, Yang S (2010) XML structural similarity search using MapReduce. In: 11th international conference, web-age information management. Lecture notes in computer science, vol 6184. Springer, New York, pp 169–181
Google Scholar
Zhang C, Chang EC, Yap R (2014) Tagged-MapReduce: a general framework for secure computing with mixed-sensitivity data on hybrid clouds. In: 2014 14th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGrid), pp 31–40. doi:10.1109/CCGrid.2014.96
Zhao W, Ma H, He Q (2009) Parallel k-means clustering based on MapReduce. In: Cloud computing, first international conference, CloudCom 2009. lecture notes in computer science, vol 5931. Springer, Berlin. pp 674–679
Google Scholar
Zhuang L, Dunagan J, Simon DR, Wang HJ, Osipkov I, Tygar JD (2008) Characterizing botnets from Email spam records. In: First USENIX workshop on large-scale exploits and emergent threats, LEET ’08. USENIX Association
Google Scholar
Zohrevandi M, Bazzi RA (2013) Auto-FBI: a user-friendly approach for secure access to sensitive content on the web. In: Proceedings of the 29th annual computer security applications conference, ACSAC ’13. ACM, New York, NY, pp 349–358. doi:10.1145/2523649.2523683. http://doi.acm.org/10.1145/2523649.2523683

Download references

Acknowledgements

This work has been supported by NSF S²ERC Center (an I/UCRC Center) and ARO YIP grant W911NF-14-1-0535.

Author information

Authors and Affiliations

Department of Computer Science, Virginia Tech, Blacksburg, VA, 24060, USA
Xiaokui Shu, Fang Liu & Danfeng (Daphne) Yao

Authors

Xiaokui Shu
View author publications
You can also search for this author in PubMed Google Scholar
Fang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Danfeng (Daphne) Yao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaokui Shu .

Editor information

Editors and Affiliations

School of Information Technology, Deakin University, Burwood, Victoria, Australia
Shui Yu
Schl of Comp Science & Engg, The Univ of Aizu, Aizu-Wakamatsu City, Japan
Song Guo

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Shu, X., Liu, F., (Daphne) Yao, D. (2016). Rapid Screening of Big Data Against Inadvertent Leaks. In: Yu, S., Guo, S. (eds) Big Data Concepts, Theories, and Applications . Springer, Cham. https://doi.org/10.1007/978-3-319-27763-9_5

Download citation

DOI: https://doi.org/10.1007/978-3-319-27763-9_5
Published: 04 March 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27761-5
Online ISBN: 978-3-319-27763-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics