Skip to main content

Android Application Analysis Using Machine Learning Techniques

  • Chapter
  • First Online:
AI in Cybersecurity

Part of the book series: Intelligent Systems Reference Library ((ISRL,volume 151))

Abstract

The amount of malware that target Android terminals is growing. Malware applications are distributed to Android terminals in the form of Android Packages (APKs), similar to other Android applications. Analyzing APKs may thus help identify malware. In this chapter, we describe how machine learning techniques can be used to identify Android malware. We begin by looking at the structure of an APK file and introduce techniques for identifying malware. We then describe how data can be collected and analyzed and then used to prepare a dataset. This is done by not only using permission requests and API calls, but also by using application clusters and descriptions as the source. To demonstrate the effectiveness of machine learning techniques for analyzing Android applications, we analyze the performance of support vector machine classification on our dataset and compare it to that of a scheme that does not utilize machine learning. We also evaluate the effectiveness of the features used and further improve the classification performance by removing irrelevant features. Finally, we address several issues and limitations on the use of machine learning techniques for analyzing Android applications.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 199.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://www.android.com

  2. 2.

    A PIN is a numeric or alpha-numeric password or code used for user authentication.

  3. 3.

    https://developer.android.com/things/

  4. 4.

    This article is an extended version of work published in [7, 8].

  5. 5.

    https://ibotpeaches.github.io/Apktool/

  6. 6.

    https://developer.android.com/studio/

  7. 7.

    https://source.android.com/devices/tech/dalvik/dex-format

  8. 8.

    https://github.com/JesusFreke/smali

  9. 9.

    http://code.google.com/p/dex2jar/downloads/list

  10. 10.

    https://play.google.com

  11. 11.

    A stem is a part of a word and is common to all its inflected forms.

  12. 12.

    We used the language detection library [21] to detect the language, stemmify [22] for the stemming operation, and the stoplist of MALLET [23] as the list of stop words.

  13. 13.

    We used MALLET for running LDA and considered 300 topics because the MALLET documentation states that “the number of topics should depend to some degree on the size of the collection, but 200–400 will produce reasonably fine-grained results.”

  14. 14.

    We used the “kmeans” function of Ruby gem [24].

  15. 15.

    Opera Mobile Store was rebranded and is now called Bemobi Mobile Store [25].

  16. 16.

    https://www.virustotal.com

  17. 17.

    http://mobilesec.nict.go.jp

  18. 18.

    A margin is the distance from the hyperplane to the nearest training data point of either class.

  19. 19.

    The shapes of the curves were similar for all the runs.

References

  1. Van der Meulen R, Forni AA (2017) Gartner says demand for 4G smartphones in emerging markets spurred growth in second quarter of 2017. https://www.gartner.com/newsroom/id/3788963

  2. Sophos (2017) SophosLabs 2018 malware forecast. https://www.sophos.com/en-us/en-us/medialibrary/PDFs/technical-papers/malware-forecast-2018.pdf

  3. Lipovsky R (2014) ESET analyzes Simplocker—first Android file-encrypting, TOR-enabled ransomware. https://www.welivesecurity.com/2014/06/04/simplocker/

  4. Stefanko L (2015) Aggressive Android ransomware spreading in the USA. https://www.welivesecurity.com/2015/09/10/aggressive-android-ransomware-spreading-in-the-usa/

  5. Schölkopf B, Smola AJ (2001) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, Cambridge

    Google Scholar 

  6. Vapnik VN (1998) Statistical learning theory. Wiley, Hoboken

    MATH  Google Scholar 

  7. Takahashi T, Ban T, Tien CW, Lin CH, Inoue D, Nakao K (2016) The usability of metadata for Android application analysis. In: Hirose A, Ozawa S, Doya K, Ikeda K, Lee M, Liu D (eds) Proceedings of the 23nd International Conference on Neural Information Processing. Springer, Cham, pp 546–554. https://doi.org/10.1007/978-3-319-46687-3_60

    Chapter  Google Scholar 

  8. Ban T, Takahashi T, Guo S, Inoue D, Nakao K (2016) Integration of multimodal features for Android malware detection based on linear SVM. In: Proceedings of the 11th Asia Joint Conference on Information Security, IEEE, pp 141–146. https://doi.org/10.1109/AsiaJCIS.2016.29

  9. Moonsamy V, Rong J, Liu S (2014) Mining permission patterns for contrasting clean and malicious Android applications. Future Gener Comp Syst 36:122–132. https://doi.org/10.1016/j.future.2013.09.014

    Article  Google Scholar 

  10. Wang Y, Zheng J, Sun C, Mukkamala S (2013) Quantitative security risk assessment of Android permissions and applications. In: Wang L, Shafiq B (eds) Data and applications security and privacy XXVII. Springer, Heidelberg, pp 226–241. https://doi.org/10.1007/978-3-642-39256-6_15

    Google Scholar 

  11. Sarma BP, Li N, Gates C, Potharaju R, Nita-Rotaru C, Molloy I (2012) Android permissions: a perspective combining risks and benefits. In: Atluri V, Vaidya J (eds) Proceedings of the 17th ACM Symposium on Access Control Models and Technologies. ACM, New York, pp 13–22. https://doi.org/10.1145/2295136.2295141

  12. Demiroz A (2018) Google play crawler JAVA API. https://github.com/Akdeniz/google-play-crawler

  13. Enck W, Gilbert P, Han S, Tendulkar V, Chun BG, Cox LP, Jung J, McDaniel P, Sheth AN (2014) TaintDroid: an information-flow tracking system for realtime privacy monitoring on smartphones. ACM T Comput Syst 32(2), Article 5. https://doi.org/10.1145/2619091

    Article  Google Scholar 

  14. Octeau D, McDaniel P, Jha S, Bartel A, Bodden E, Klein J, Le Traon Y (2013) Effective inter-component communication mapping in Android with Epicc: an essential step towards holistic security analysis. In: Proceedings of the 22nd USENIX Conference on Security. USENIX Association, Berkeley, CA, USA, pp 543–558. https://www.usenix.org/system/files/conference/usenixsecurity13/sec13-paper_octeau.pdf

  15. Android Developers (2018) UI/Application exerciser monkey. https://developer.android.com/studio/test/monkey

  16. Li Y, Yang Z, Guo Y, Chen X (2017) DroidBot: a lightweight UI-guided test input generator for Android. In: Uchitel S, Orso A, Robillard M (eds) Proceedings of the 39th International Conference on Software Engineering Companion. IEEE Computer Society, Los Alamitos, CA, USA, pp 23–26. https://doi.org/10.1109/ICSE-C.2017.8

  17. Harris ZS (1954) Distributional structure. WORD 10(2–3):146–162. https://doi.org/10.1080/00437956.1954.11659520

  18. MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Le Cam LM, Neyman J (eds) Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol 1. University of California Press, Berkeley, pp 281–297. https://projecteuclid.org/euclid.bsmsp/1200512992

  19. Gorla A, Tavecchia I, Gross F, Zeller A (2014) Checking app behavior against app descriptions. In: Jalote P, Briand L, van der Hoek A (eds) Proceedings of the 36th International Conference on Software Engineering. ACM, New York, pp 1025–1035. https://doi.org/10.1145/2568225.2568276

  20. Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  21. Shuyo N (2010) Language detection library for Java. https://github.com/shuyo/language-detection

  22. Pereda R (2011) Stemmify 0.0.2. https://rubygems.org/gems/stemmify

  23. McCallum AK (2002) MALLET: a machine learning for language toolkit. http://mallet.cs.umass.edu

  24. RubyGems.org (2013) kmeans 0.1.1. https://rubygems.org/gems/kmeans/

  25. Apps and Games AS (2018) Bemobi mobile store. http://apps.bemobi.com

  26. Cover TM (1965) Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Trans Electron EC-14(3):326–334. https://doi.org/10.1109/PGEC.1965.264137

    Article  Google Scholar 

  27. Lin CJ, Weng RC, Keerthi SS (2007) Trust region Newton methods for large-scale logistic regression. In: Ghahramani Z (ed) Proceedings of the 24th International Conference on Machine Learning. ACM, New York, pp 561–568. https://doi.org/10.1145/1273496.1273567

  28. Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) LIBLINEAR: a library for large linear classification. J Mach Learn Res 9:1871–1874

    MATH  Google Scholar 

  29. Peiravian N, Zhu X (2013) Machine learning for Android malware detection using permission and API calls. In: Bourbakis N, Brodsky A (eds) Proceedings of the 25th International Conference on Tools with Artificial Intelligence. IEEE Computer Society, Los Alamitos, CA, USA, pp 300–305. https://doi.org/10.1109/ICTAI.2013.53

  30. Aafer Y, Du W, Yin H (2013) DroidAPIMiner: mining API-level features for robust malware detection in Android. In: Zia T, Zomaya A, Varadharajan V, Mao M (eds) Security and privacy in communication networks. Springer, Cham, pp 86–103. https://doi.org/10.1007/978-3-319-04283-1_6

    Google Scholar 

  31. Li W, Ge J, Dai G (2015) Detecting malware for Android platform: an SVM-based approach. In: Qiu M, Zhang T, Das S (eds) Proceedings of the 2nd International Conference on Cyber Security and Cloud Computing. IEEE Computer Society, Los Alamitos, CA, USA, pp 464–469. https://doi.org/10.1109/CSCloud.2015.50

  32. Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422. https://doi.org/10.1023/A:1012487302797

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Takeshi Takahashi .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2019 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Takahashi, T., Ban, T. (2019). Android Application Analysis Using Machine Learning Techniques. In: Sikos, L. (eds) AI in Cybersecurity. Intelligent Systems Reference Library, vol 151. Springer, Cham. https://doi.org/10.1007/978-3-319-98842-9_7

Download citation

Publish with us

Policies and ethics