Skip to main content
Log in

Real-time Korean voice phishing detection based on machine learning approaches

  • Original Research
  • Published:
Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Abstract

Voice phishing, or vishing, is a phishing phone call in which an attacker lures receivers into providing personal their information. Damage from vishing is a serious problem worldwide and is increasing in frequency. Therefore, this study is aimed at detecting vishing in real time. Owing to the absence of research on spam detection using low-resource languages, we detect vishing in the Korean language using basic machine-learning models. We collected actual vishing damage data and converted the voice files into text to achieve spam detection using natural language processing techniques. The focus is on determining whether vishing can be rapidly detected, rather than model development. Based on the results, we suggest that vishing can be detected in real time and requires only a short training time when using machine learning models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Data availability Statement

The datasets collected for the current study are available at https://anonymous.4open.science/r/vishing-1AF8. Additional information used in this study can be obtained from the corresponding author upon reasonable request.

Notes

  1. https://www.fss.or.kr/fss/vstop/avoid/this_voice_l.jsp.

  2. https://corpus.korean.go.kr/.

  3. https://www.anonymous.4open.science/r/vishing-1AF8.

  4. https://cloud.google.com/speech-to-text?hl=ko.

  5. https://www.ncloud.com/product/aiService/clovaSpeech.

  6. https://konlpy.org/en/latest/.

  7. https://github.com/kakao/khaiii.

  8. https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html.

References

  • Abu-Nimeh S, Nappa D, Wang X, Nair S (2007) A comparison of machine learning techniques for phishing detection. In: Proceedings of the anti-phishing working groups 2nd annual eCrime researchers summit, ACM, pp 60–69

  • Akinyelu AA, Adewumi AO (2014) Classification of phishing email using random forest machine learning technique. J Appl Math 2014:425731

    Article  Google Scholar 

  • Arık SÖ, Chrzanowski M, Coates A, Diamos G, Gibiansky A, Kang Y, Li X, Miller J, Ng A, Raiman J et al (2017) Deep voice: real-time neural text-to-speech. In: Proceedings of the International Conference on Machine Learning, PMLR, pp 195–204

  • Barraclough PA, Hossain MA, Tahir M, Sexton G, Aslam N (2013) Intelligent phishing detection and protection scheme for online transactions. Expert Syst Appl 40(11):4697–4706

    Article  Google Scholar 

  • Biswal S (2021) Real-time intelligent vishing prediction and awareness model (rivpam). In: Proceedings of the 2021 international conference on cyber situational awareness. Data Analytics and Assessment (CyberSA), IEEE, pp 1–2

  • Boser BE, Guyon IM, Vapnik VN (1992) A training algorithm for optimal margin classifiers. In: Proceedings of the fifth annual workshop on Computational learning theory, ACM, pp 144–152

  • Breiman L (2001) Random forests. Mach Learning 45(1):5–32

    Article  MATH  Google Scholar 

  • Choi K, Jl L, Yt C (2017) Voice phishing fraud and its modus operandi. Secur J 30(2):454–466

    Article  Google Scholar 

  • Cook S (2021) 35+ phone spam stattistics for 2017–2021. https://www.comparitech.com/blog/information-security/phone-spam-statistics/

  • Dreiseitl S, Ohno-Machado L (2002) Logistic regression and artificial neural network classification models: a methodology review. J Biomed Inform 35(5–6):352–359

    Article  Google Scholar 

  • Drucker H, Wu D, Vapnik VN (1999) Support vector machines for spam categorization. IEEE Trans Neural Networks 10(5):1048–1054

    Article  Google Scholar 

  • Ghourabi A, Mahmood MA, Alzubi QM (2020) A hybrid cnn-lstm model for sms spam detection in Arabic and English messages. Future Internet 12(9):156

    Article  Google Scholar 

  • Gómez Hidalgo JM, Bringas GC, Sánz EP, García FC (2006) Content based sms spam filtering. In: Proceedings of the 2006 ACM symposium on Document engineering, ACM, pp 107–114

  • Gorham M (2019) 2018 internet crime report. https://www.ic3.gov/Media/PDF/AnnualReport/2018_IC3Report.pdf

  • Gupta H, Jamal MS, Madisetty S, Desarkar MS (2018) A framework for real-time spam detection in twitter. In: Proceedings of the 2018 10th international conference on communication systems & networks (COMSNETS), IEEE, pp 380–383

  • Hwang S, Kim J, Park E, Kwon SJ (2020) Who will be your next customer: a machine learning approach to customer return visits in airline services. J Bus Res 121:121–126

    Article  Google Scholar 

  • Kadoya Y, Khan MSR, Yamane T (2020) The rising phenomenon of financial scams: evidence from Japan. J Financial Crime 27(2):387–396

    Article  Google Scholar 

  • Kenton JDMWC, Toutanova LK (2019) Bert: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies (NAACL-HLT), pp 4171–4186

  • Kim J, Bae K, Park E, del Pobil AP (2019) Who will subscribe to my streaming channel? The case of twitch. In: Conference companion publication of the 2019 on computer supported cooperative work and social computing (CSCW Companion), pp 247–251

  • Kim J, Lee J, Park E, Han J (2020) A deep learning model for detecting mental illness from user content on social media. Sci Rep 10(1):1–6

    Google Scholar 

  • Kim J, Hwang S, Park E (2021a) Can we predict the Oscar winner? A machine learning approach with social network services. Entertain Comput 39:100441

    Article  Google Scholar 

  • Kim JW, Hong GW, Chang H (2021b) Voice recognition and document classification-based data analysis for voice phishing detection. Human-Centric Comput Info Sci 11:2

    Google Scholar 

  • Korea Financial Supervisory Service (2021) Analysis of voice phishing status in 2020. https://www.fss.or.kr/fss/kr/promo/bodobbs_view.jsp?seqno=23836

  • Korea National Police Agency (2020) Voice phishing status. https://www.data.go.kt/data/15063815/fileData.do

  • Koøcz A, Alspector J (2001) SVM-based Filtering of E-mail Spam with Content-specific Misclassification Costs. In: Proceedings of the workshop on text mining (TEXTDM), Citeseer, pp 1–14

  • Lee S, Ji H, Kim J, Park E (2021) What books will be your bestseller? A machine learning approach with amazon kindle. Electron Libr 39(1):137–151

    Article  Google Scholar 

  • Li Z, Nie F, Chang X, Nie L, Zhang H, Yang Y (2018a) Rank-constrained spectral clustering with flexible embedding. IEEE Trans Neural Netw Learning Syst 29(12):6073–6082

    Article  MathSciNet  Google Scholar 

  • Li Z, Nie F, Chang X, Yang Y, Zhang C, Sebe N (2018b) Dynamic affinity graph construction for spectral clustering using multiple features. IEEE Trans Neural Netw Learning Syst 29(12):6323–6332

    Article  MathSciNet  Google Scholar 

  • Li Z, Yao L, Chang X, Zhan K, Sun J, Zhang H (2019) Zero-shot event detection via event-adaptive concept relevance mining. Pattern Recogn 88:595–603

    Article  Google Scholar 

  • Mccord M, Chuah M (2011) Spam detection on twitter using traditional classifiers. In: Proceedings of the international conference on autonomic and trusted computing (ATC), Springer, pp 175–186

  • Obuhuma J, Zivuku S (2020) Social engineering based cyber-attacks in kenya. In: Proceedings of the 2020 IST-Africa conference (IST-Africa), IEEE, pp 1–9

  • Raj H, Weihong Y, Banbhrani SK, Dino SP (2018) Lstm based short message service (sms) modeling for spam classification. In: Proceedings of the 2018 International Conference on Machine Learning Technologies, pp 76–80

  • Ren P, Xiao Y, Chang X, Huang PY, Li Z, Chen X, Wang X (2021) A comprehensive survey of neural architecture search: challenges and solutions. ACM Comput Surveys (CSUR) 54(4):1–34

    Article  Google Scholar 

  • Roy PK, Singh JP, Banerjee S (2020) Deep learning to filter sms spam. Futur Gener Comput Syst 102:524–533

    Article  Google Scholar 

  • Safavian SR, Landgrebe D (1991) A survey of decision tree classifier methodology. IEEE Trans Syst Man Cybern 21(3):660–674

    Article  MathSciNet  Google Scholar 

  • Sasaki M, Shinnou H (2005) Spam detection using text clustering. In: Proceedings of the 2005 international conference on cyberworlds (CW), IEEE, pp 1–4

  • Shen J, Pang R, Weiss RJ, Schuster M, Jaitly N, Yang Z, Chen Z, Zhang Y, Wang Y, Skerrv-Ryan R et al (2018) Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In: Proceedings of the 2018 IEEE international conference on acoustics. Speech and Signal Processing (ICASSP), IEEE, pp 4779–4783

  • Song J, Kim H, Gkelias A (2014) ivisher: real-time detection of caller id spoofing. ETRI J 36(5):865–875

    Article  Google Scholar 

  • Stein RA, Jaques PA, Valiati JF (2019) An analysis of hierarchical text classification using word embeddings. Inf Sci 471:216–232

    Article  Google Scholar 

  • Sun N, Lin G, Qiu J, Rimba P (2020) Near real-time twitter spam detection with machine learning techniques. Int J Comput Appl. https://doi.org/10.1080/1206212X.2020.1751387

    Article  Google Scholar 

  • Tran MH, Le Hoai TH, Choo H (2020) A third-party intelligent system for preventing call phishing and message scams. In: Proceedings of the international conference on future data and security engineering (FDSE), Springer, pp 486–492

  • Trivedi SK (2016) A study of machine learning classifiers for spam detection. In: Proceedings of the 2016 4th international symposium on computational and business intelligence (ISCBI), IEEE, pp 176–180

  • Wei F, Nguyen T (2020) A lightweight deep neural model for sms spam detection. 2020 International Symposium on Networks. Computers and Communications (ISNCC), IEEE, pp 1–6

  • Wijaya A, Bisri A (2016) Hybrid decision tree and logistic regression classifier for email spam detection. In: 2016 8th International Conference on Information Technology and Electrical Engineering (ICITEE), IEEE, pp 1–4

  • Wu T, Liu S, Zhang J, Xiang Y (2017) Twitter spam detection based on deep learning. In: Proceedings of the australasian computer science week multiconference (ACSW), ACM, pp 1–8

  • Yan C, Chang X, Luo M, Zheng Q, Zhang X, Li Z, Nie F (2020) Self-weighted robust lda for multiclass classification with edge classes. ACM Trans Intell Syst Technol (TIST) 12(1):1–19

    Google Scholar 

  • Yeboah-Boateng EO, Amanor PM (2014) Phishing, smishing & vishing: an assessment of threats against mobile devices. J Emerg Trends Comput Inf Sci 5(4):297–307

    Google Scholar 

  • Zhang R, Gurtov A (2009) Collaborative reputation-based voice spam filtering. In: Proceedings of the 2009 20th international workshop on database and expert systems application, IEEE, pp 33–37

Download references

Acknowledgements

This work was supported by an Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean government (MSIT) (No. IITP-2021-0-00358, AI big data-based cyber security orchestration, and automated response technology development). Moreover, this research was supported by National Research Foundation (NRF) of Korea Grant funded by the Korean Government (MSIT) (No. 2021R1A4A3022102).

Author information

Authors and Affiliations

Authors

Contributions

ML and EP designed the study. ML collected and analyzed the data. EP presented the results. ML and EP wrote and revised the manuscript. All authors reviewed the manuscript.

Corresponding author

Correspondence to Eunil Park.

Ethics declarations

Conflict of interest

The authors have no conflicts or competing interests to declare.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A. Data analysis

Table 6 shows the top-100 most widely used words in spam and nonspam cases when analyzing spam and nonspam text content, respectively. The nonspam cases mostly included everyday words, such as us, movies, people, and me, whereas the spam cases included words such as loan, investigation, bank accounts, bank, illegality, and victims (given in bold). Therefore, understanding the meaning of these words is important in vishing detection.

Table 6 Data analysis

Appendix B. Speech-to-text tool examples

We converted the collected.mp3 files of voice phishing speech, and the results when using actual voice scripts, Google speech-to-text API, and Naver Clova Speech speech-to-text conversion tools are shown in Table 7.

Table 7 Comparison of speech-to-text tools

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lee, M., Park, E. Real-time Korean voice phishing detection based on machine learning approaches. J Ambient Intell Human Comput 14, 8173–8184 (2023). https://doi.org/10.1007/s12652-021-03587-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12652-021-03587-x

Keywords

Navigation