Abstract
The amount of malware that target Android terminals is growing. Malware applications are distributed to Android terminals in the form of Android Packages (APKs), similar to other Android applications. Analyzing APKs may thus help identify malware. In this chapter, we describe how machine learning techniques can be used to identify Android malware. We begin by looking at the structure of an APK file and introduce techniques for identifying malware. We then describe how data can be collected and analyzed and then used to prepare a dataset. This is done by not only using permission requests and API calls, but also by using application clusters and descriptions as the source. To demonstrate the effectiveness of machine learning techniques for analyzing Android applications, we analyze the performance of support vector machine classification on our dataset and compare it to that of a scheme that does not utilize machine learning. We also evaluate the effectiveness of the features used and further improve the classification performance by removing irrelevant features. Finally, we address several issues and limitations on the use of machine learning techniques for analyzing Android applications.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
A PIN is a numeric or alpha-numeric password or code used for user authentication.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
A stem is a part of a word and is common to all its inflected forms.
- 12.
- 13.
We used MALLET for running LDA and considered 300 topics because the MALLET documentation states that “the number of topics should depend to some degree on the size of the collection, but 200–400 will produce reasonably fine-grained results.”
- 14.
We used the “kmeans” function of Ruby gem [24].
- 15.
Opera Mobile Store was rebranded and is now called Bemobi Mobile Store [25].
- 16.
- 17.
- 18.
A margin is the distance from the hyperplane to the nearest training data point of either class.
- 19.
The shapes of the curves were similar for all the runs.
References
Van der Meulen R, Forni AA (2017) Gartner says demand for 4G smartphones in emerging markets spurred growth in second quarter of 2017. https://www.gartner.com/newsroom/id/3788963
Sophos (2017) SophosLabs 2018 malware forecast. https://www.sophos.com/en-us/en-us/medialibrary/PDFs/technical-papers/malware-forecast-2018.pdf
Lipovsky R (2014) ESET analyzes Simplocker—first Android file-encrypting, TOR-enabled ransomware. https://www.welivesecurity.com/2014/06/04/simplocker/
Stefanko L (2015) Aggressive Android ransomware spreading in the USA. https://www.welivesecurity.com/2015/09/10/aggressive-android-ransomware-spreading-in-the-usa/
Schölkopf B, Smola AJ (2001) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, Cambridge
Vapnik VN (1998) Statistical learning theory. Wiley, Hoboken
Takahashi T, Ban T, Tien CW, Lin CH, Inoue D, Nakao K (2016) The usability of metadata for Android application analysis. In: Hirose A, Ozawa S, Doya K, Ikeda K, Lee M, Liu D (eds) Proceedings of the 23nd International Conference on Neural Information Processing. Springer, Cham, pp 546–554. https://doi.org/10.1007/978-3-319-46687-3_60
Ban T, Takahashi T, Guo S, Inoue D, Nakao K (2016) Integration of multimodal features for Android malware detection based on linear SVM. In: Proceedings of the 11th Asia Joint Conference on Information Security, IEEE, pp 141–146. https://doi.org/10.1109/AsiaJCIS.2016.29
Moonsamy V, Rong J, Liu S (2014) Mining permission patterns for contrasting clean and malicious Android applications. Future Gener Comp Syst 36:122–132. https://doi.org/10.1016/j.future.2013.09.014
Wang Y, Zheng J, Sun C, Mukkamala S (2013) Quantitative security risk assessment of Android permissions and applications. In: Wang L, Shafiq B (eds) Data and applications security and privacy XXVII. Springer, Heidelberg, pp 226–241. https://doi.org/10.1007/978-3-642-39256-6_15
Sarma BP, Li N, Gates C, Potharaju R, Nita-Rotaru C, Molloy I (2012) Android permissions: a perspective combining risks and benefits. In: Atluri V, Vaidya J (eds) Proceedings of the 17th ACM Symposium on Access Control Models and Technologies. ACM, New York, pp 13–22. https://doi.org/10.1145/2295136.2295141
Demiroz A (2018) Google play crawler JAVA API. https://github.com/Akdeniz/google-play-crawler
Enck W, Gilbert P, Han S, Tendulkar V, Chun BG, Cox LP, Jung J, McDaniel P, Sheth AN (2014) TaintDroid: an information-flow tracking system for realtime privacy monitoring on smartphones. ACM T Comput Syst 32(2), Article 5. https://doi.org/10.1145/2619091
Octeau D, McDaniel P, Jha S, Bartel A, Bodden E, Klein J, Le Traon Y (2013) Effective inter-component communication mapping in Android with Epicc: an essential step towards holistic security analysis. In: Proceedings of the 22nd USENIX Conference on Security. USENIX Association, Berkeley, CA, USA, pp 543–558. https://www.usenix.org/system/files/conference/usenixsecurity13/sec13-paper_octeau.pdf
Android Developers (2018) UI/Application exerciser monkey. https://developer.android.com/studio/test/monkey
Li Y, Yang Z, Guo Y, Chen X (2017) DroidBot: a lightweight UI-guided test input generator for Android. In: Uchitel S, Orso A, Robillard M (eds) Proceedings of the 39th International Conference on Software Engineering Companion. IEEE Computer Society, Los Alamitos, CA, USA, pp 23–26. https://doi.org/10.1109/ICSE-C.2017.8
Harris ZS (1954) Distributional structure. WORD 10(2–3):146–162. https://doi.org/10.1080/00437956.1954.11659520
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Le Cam LM, Neyman J (eds) Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol 1. University of California Press, Berkeley, pp 281–297. https://projecteuclid.org/euclid.bsmsp/1200512992
Gorla A, Tavecchia I, Gross F, Zeller A (2014) Checking app behavior against app descriptions. In: Jalote P, Briand L, van der Hoek A (eds) Proceedings of the 36th International Conference on Software Engineering. ACM, New York, pp 1025–1035. https://doi.org/10.1145/2568225.2568276
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
Shuyo N (2010) Language detection library for Java. https://github.com/shuyo/language-detection
Pereda R (2011) Stemmify 0.0.2. https://rubygems.org/gems/stemmify
McCallum AK (2002) MALLET: a machine learning for language toolkit. http://mallet.cs.umass.edu
RubyGems.org (2013) kmeans 0.1.1. https://rubygems.org/gems/kmeans/
Apps and Games AS (2018) Bemobi mobile store. http://apps.bemobi.com
Cover TM (1965) Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Trans Electron EC-14(3):326–334. https://doi.org/10.1109/PGEC.1965.264137
Lin CJ, Weng RC, Keerthi SS (2007) Trust region Newton methods for large-scale logistic regression. In: Ghahramani Z (ed) Proceedings of the 24th International Conference on Machine Learning. ACM, New York, pp 561–568. https://doi.org/10.1145/1273496.1273567
Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) LIBLINEAR: a library for large linear classification. J Mach Learn Res 9:1871–1874
Peiravian N, Zhu X (2013) Machine learning for Android malware detection using permission and API calls. In: Bourbakis N, Brodsky A (eds) Proceedings of the 25th International Conference on Tools with Artificial Intelligence. IEEE Computer Society, Los Alamitos, CA, USA, pp 300–305. https://doi.org/10.1109/ICTAI.2013.53
Aafer Y, Du W, Yin H (2013) DroidAPIMiner: mining API-level features for robust malware detection in Android. In: Zia T, Zomaya A, Varadharajan V, Mao M (eds) Security and privacy in communication networks. Springer, Cham, pp 86–103. https://doi.org/10.1007/978-3-319-04283-1_6
Li W, Ge J, Dai G (2015) Detecting malware for Android platform: an SVM-based approach. In: Qiu M, Zhang T, Das S (eds) Proceedings of the 2nd International Conference on Cyber Security and Cloud Computing. IEEE Computer Society, Los Alamitos, CA, USA, pp 464–469. https://doi.org/10.1109/CSCloud.2015.50
Guyon I, Weston J, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422. https://doi.org/10.1023/A:1012487302797
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Takahashi, T., Ban, T. (2019). Android Application Analysis Using Machine Learning Techniques. In: Sikos, L. (eds) AI in Cybersecurity. Intelligent Systems Reference Library, vol 151. Springer, Cham. https://doi.org/10.1007/978-3-319-98842-9_7
Download citation
DOI: https://doi.org/10.1007/978-3-319-98842-9_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-98841-2
Online ISBN: 978-3-319-98842-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)