Skip to main content
Log in

Predicting sensitive information leakage in IoT applications using flows-aware machine learning approach

  • Published:
Empirical Software Engineering Aims and scope Submit manuscript

Abstract

This paper presents an approach for identification of vulnerable IoT applications. The approach focuses on a category of vulnerabilities that leads to sensitive information leakage which can be identified by using taint flow analysis. Tainted flows vulnerability is very much impacted by the structure of the program and the order of the statements in the code, designing an approach to detect such vulnerability needs to take into consideration such information in order to provide precise results. In this paper, we propose and develop an approach, FlowsMiner, that mines features from the code related to program structure such as control statements and methods, in addition to program’s statement order. FlowsMiner, generates features in the form of tainted flows. We developed, Flows2Vec, a tool that transform the features recovered by FlowsMiner into vectors, which are then used to aid the process of machine learning by providing a flow’s aware model building process. The resulting model is capable of accurately classify applications as vulnerable if the vulnerability is exhibited by changes in the order of statements in source code. When compared to a base Bag of Words (BoW) approach, the experiments show that the proposed approach has improved the AUC of the prediction models for all algorithms and the best case for Corpus1 dataset is improved from 0.91 to 0.94 and for Corpus2 from 0.56 to 0.96.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Listing 1
Listing 2
Listing 3
Listing 4
Listing 5
Fig. 1
Listing 6
Listing 8
Listing 9
Listing 10
Listing 11
Listing 7
Listing 12
Listing 13
Listing 14
Listing 15
Listing 16
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Similar content being viewed by others

References

  • Alon U, Zilberstein M, Levy O, Yahav E (2018) Code2vec: learning distributed representations of code. CoRR, arXiv:1803.09473

  • Andersen LO (1994) Program analysis and specialization for the C programming language. Ph.D. Dissertation. University of Cophenhagen

  • Arzt S, Rasthofer S, Fritz C, Bodden E, Bartel A, Klein J, Yves LT, Octeau D, McDaniel P (2014) FLOWDROID: precise context, flow, field, object-sensitive and lifecycle-aware taint analysis for Android apps. ACM SIGPLAN Not 49:259–269

    Article  Google Scholar 

  • Avdiienko V, Kuznetsov K, Gorla A, Zeller A, Arzt S, Rasthofer S, Bodden E (2015) Mining apps for abnormal usage of sensitive data. In: 37th IEEE/ACM international conference on software engineering, ICSE 2015, Florence, Italy, vol 1, pp 426–436

  • Boris C, Rakesh V (2018) Machine learning methods for software vulnerability detection, pp 31–39

  • Celik ZB, Babun L, Sikder AK, Aksu H, Tan G, McDaniel PD, Uluagac AS (2018) Sensitive information tracking in commodity IoT. In: 27th USENIX security symposium, USENIX security 2018, Baltimore, MD, USA, pp 1687–1704

  • Dam HK, Tran T, Pham TTM, Ng SW, Grundy J, Ghose A (2018) Automatic feature learning for predicting vulnerable software components. IEEE Trans Softw Eng 1–1

  • Dam HK, Pham T, Ng SW, Tran T, Grundy J, Ghose A, Kim T, Kim C (2019) Lessons learned from using a deep Tree-Based model for software defect prediction in practice. In: 2019 IEEE/ACM 16th international conference on mining software repositories (MSR), pp 46–57

  • Harer JA, Kim LY, Russell RL, Ozdemir O, Kosta LR, Rangamani A, Hamilton LH, Centeno GI, Key JR, Ellingwood PM, McConley MW, Opper JM, Chin SP, Lazovich T (2018) Automated software vulnerability detection with machine learning. CoRR, arXiv:1803.04497

  • Hassan J, Shoaib U (2020) Multi-class review rating classification using deep recurrent neural network. Neural Process Lett 51:1031–1048

    Article  Google Scholar 

  • Irfan MN, Oriat C, Groz R (2010) Angluin style finite state machine inference with non-optimal counterexamples. In: Proceedings of the first international workshop on model inference in testing, pp 11–19

  • Irfan M -N, Oriat C, Groz R (2013) Model inference and testing. Adv Comput 89:89–139

    Article  Google Scholar 

  • Kim H, Choi T, Jung S, Kim H, Lee O, Doh K (2008) Applying dataflow analysis to detecting software vulnerability. In: 2008 10th International conference on advanced communication technology, pp 255–258

  • López V, Fernández A, García S, Palade V, Herrera F (2013) An insight into classification with imbalanced data: empirical results and current trends on using data intrinsic characteristics. Inf Sci 250:113–141

    Article  Google Scholar 

  • Medeiros I, Neves NF, Correia M (2016) DEKANT: a static analysis tool that learns to detect web application vulnerabilitiess. In: Proceedings of the 25th international symposium on software testing and analysis, ISSTA 2016, Saarbrücken, Germany, pp 1–11

  • Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. In: 1st International conference on learning representations, ICLR 2013, Scottsdale, Arizona, USA, May 2–4, 2013, Workshop Track Proceedings

  • Naeem H, Alalfi MH (2020) Identifying vulnerable IoT applications using deep learning. In: 27th IEEE international conference on software analysis, evolution and reengineering, SANER 2020, London, ON, Canada, pp 582–586

  • Parveen S, Alalfi MH (2020) A mutation framework for evaluating security analysis tools in IoT applications. In: 27th IEEE international conference on software analysis, evolution and reengineering, SANER 2020, London, ON, Canada, pp 587–591

  • Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine Learning in Python. J Mach Learn Res 12:2825–2830

    MathSciNet  MATH  Google Scholar 

  • Sadeghi A, Bagheri H, Malek S (2015) Analysis of android Inter-App security vulnerabilities using COVERT. In: 2015 IEEE/ACM 37th IEEE international conference on software engineering, vol 2, pp 725–728

  • Scandariato R, Walden J, Hovsepyan A, Joosen W (2014) Predicting vulnerable software components via text mining. IEEE Trans Softw Eng 40:993–1006

    Article  Google Scholar 

  • Schmeidl F, Nazzal B, Alalfi MH (2019) Security analysis for SmartThings IoT applications. In: Proceedings of the 6th international conference on mobile software engineering and systems, MOBILESoft@ICSE, Montreal, QC, Canada, pp 25–29

  • Shar LK, Tan HBK (2012) Mining input sanitization patterns for predicting SQL injection and cross site scripting vulnerabilities. In: 34th International conference on software engineering, ICSE 2012, Zurich, Switzerland, pp 1293–1296

  • Shar LK, Tan HBK, Briand LC (2013) Mining SQL injection and cross site scripting vulnerabilities using hybrid program analysis. In: 35th International conference on software engineering, ICSE ’13, San Francisco, CA, USA, pp 642–651

  • Shoaib U, Ahmad N, Prinetto P, Tiotto G (2014) Integrating MultiWordNet with Italian Sign Language lexical resources. Expert Syst Appl 41:2300–2308

    Article  Google Scholar 

  • SmartThings Classic Developer Documentation (2019) https://buildmedia.readthedocs.org/media/pdf/smartthings/latest/smartthings.pdf

  • Sui Y, Cheng X, Zhang G, Wang H (2020) Flow2vec: value-flow-based precise code embedding. Proc ACM Program Lang 4(OOPSLA):233:1-233:27

    Article  Google Scholar 

  • Tai KS, Socher R, Manning CD (2015) Improved semantic representations from Tree-Structured long Short-Term memory networks. CoRR, arXiv:1503.00075

  • The Pandas Development Team (2020) Pandas-dev/pandas. Pandas, Zenodo

    Google Scholar 

  • Towards a definition of the Internet of Things (IoT) (2015) IEEE Internet Initiative and others

  • Walden J, Stuckman J, Scandariato R (2014) Predicting vulnerable components: software metrics vs text mining. In: 25th IEEE International symposium on software reliability engineering, ISSRE 2014, naples, Italy, pp 23–33

  • Wang S, Liu T, Tan L (2016) Automatically learning semantic features for defect prediction, pp 297–308

  • Zhao K, Zhang D, Su X, Li W (2015) Fest: a feature extraction and selection tool for Android malware detection. In: 2015 IEEE Symposium on computers and communication, ISCC 2015, Larnaca, Cyprus, pp 714–720

  • Zheng W, Gao J, Wu X, Xun Y, Liu G, Chen X (2020) An empirical study of high-impact factors for machine Learning-Based vulnerability detection. In: 2020 IEEE 2nd International workshop on intelligent bug fixing (IBF), pp 26–34

  • Zhu D, Jin H, Yang Y, Wu D, Chen W (2017) Deepflow: deep learning-based malware detection by mining Android application for abnormal usage of sensitive data. In: 2017 IEEE Symposium on computers and communications (ISCC), pp 438–443

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Manar H. Alalfi.

Additional information

Communicated by: Foutse Khomh, Gemma Catolino and Pasquale Salza

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: Machine Learning Techniques for Software Quality Evaluation (MaLTeSQuE).

Dataset used in the experiments FlowsMiner Webfront

Appendix A: Visual Analysis of sinks in Corpus1 and Coprpus2 datasets

Appendix A: Visual Analysis of sinks in Corpus1 and Coprpus2 datasets

1.1 A.1 Figures for Multiple Sinks in Corpus 1 and 2

Fig. 19
figure 19

SendPush frequency in all apps for Corpus 1

Fig. 20
figure 20

Occurrence of SendPush w.r.t sendSMS in all apps for Corpus 1

Fig. 21
figure 21

Occurrence of SendPush w.r.t sendNotificationToContacts in all apps for Corpus 1

Fig. 22
figure 22

Occurrence of SendPush w.r.t httpGet in all apps for Corpus 1

Fig. 23
figure 23

sendSMS Frequency in all apps for Corpus 1

Fig. 24
figure 24

Occurrence of sendSMS w.r.t sendPushMessage in all apps for Corpus 1

Fig. 25
figure 25

sendNotificationToContacts Frequency in all apps for Corpus 1

Fig. 26
figure 26

httpGet Frequency in all apps for Corpus 1

Fig. 27
figure 27

httpPost Frequency in all apps for Corpus 1

Fig. 28
figure 28

httpPostJson Frequency in all apps for Corpus 1

Fig. 29
figure 29

sendNotification Frequency in all apps for Corpus 1

Fig. 30
figure 30

httpPost Frequency in all apps for Corpus 2

Fig. 31
figure 31

sendSMS Frequency in all apps for Corpus 2

Fig. 32
figure 32

sendPush Frequency in all apps for Corpus 2

Fig. 33
figure 33

httpPostJson Frequency in all apps for Corpus 2

Fig. 34
figure 34

Occurrence of httpPost w.r.t sendSMS in all apps for Corpus 2

Fig. 35
figure 35

Occurrence of httpPost w.r.t sendPush in all apps for Corpus 2

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Naeem, H., Alalfi, M.H. Predicting sensitive information leakage in IoT applications using flows-aware machine learning approach. Empir Software Eng 27, 137 (2022). https://doi.org/10.1007/s10664-022-10157-y

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10664-022-10157-y

Keywords

Navigation