Knowledge and Information Systems

, Volume 26, Issue 2, pp 285–307 | Cite as

Learning to detect spyware using end user license agreements

  • Niklas Lavesson
  • Martin Boldt
  • Paul Davidsson
  • Andreas Jacobsson
Regular Paper


The amount of software that hosts spyware has increased dramatically. To avoid legal repercussions, the vendors need to inform users about inclusion of spyware via end user license agreements (EULAs) during the installation of an application. However, this information is intentionally written in a way that is hard for users to comprehend. We investigate how to automatically discriminate between legitimate software and spyware associated software by mining EULAs. For this purpose, we compile a data set consisting of 996 EULAs out of which 9.6% are associated to spyware. We compare the performance of 17 learning algorithms with that of a baseline algorithm on two data sets based on a bag-of-words and a meta data model. The majority of learning algorithms significantly outperform the baseline regardless of which data representation is used. However, a non-parametric test indicates that bag-of-words is more suitable than the meta model. Our conclusion is that automatic EULA classification can be applied to assist users in making informed decisions about whether to install an application without having read the EULA. We therefore outline the design of a spyware prevention tool and suggest how to select suitable learning algorithms for the tool by using a multi-criteria evaluation approach.


End user license agreement Document classification Spyware 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Androutsopoulos I, Paliouras G, Karkaletsis V, Sakkis G, Spyropoulos CD, Stamatopoulos P (2000) Learning to filter spam E-mail: a comparison of a naive bayesian and a memory-based approach. In: 4th European conference on principles and practice of knowledge discovery in databases: workshop on machine learning and textual information access, Springer, Berlin, pp 1–13Google Scholar
  2. 2.
    Arnett KP, Schmidt MB (2005) Busting the ghost in the machine. Communications of the ACM 48(8)Google Scholar
  3. 3.
    Boldt M (2007) Privacy-invasive software—exploring effects and countermeasures, Licentiate Thesis Series, No 2007:01, Blekinge Institute of TechnologyGoogle Scholar
  4. 4.
    Boldt M, Carlsson B (2006) Analysing countermeasures against privacy-invasive software. In: 1st IEEE international conference on systems and networks communicationsGoogle Scholar
  5. 5.
    Boldt M, Carlsson B, Jacobsson A (2004) Exploring spyware effects. In: Eight nordic workshop on secure IT systems, Helsinki University of Technology, Espoo, Finland, no. TML-A10 in publications in telecommunication and software multimedia, pp 23–30Google Scholar
  6. 6.
    Breiman L (1996) Bagging predictors. Mach Learn 24(2): 123–140MATHMathSciNetGoogle Scholar
  7. 7.
    Carreras X, Màrquez L (2001) Boosting trees for anti-spam email filtering. In: Mitkov R, Angelova G, Bontcheva K, Nicolov N, Nikolov N (eds) European conference on recent advances in natural language processing. Tzigov Chark, Bulgaria, pp 58–64Google Scholar
  8. 8.
    Caruana R, Niculescu-Mizil A (2006) An empirical comparison of supervised learning algorithms. In: 23rd international conference on machine learning. ACM Press, New York City, pp 161–168Google Scholar
  9. 9.
    Cohen W (1996) Learning Rules that Classify E-Mail. In: Advances in inductive logic programming. IOS Press, AmsterdamGoogle Scholar
  10. 10.
    Coleman M, Liau TL (1975) A computer readability formula designed for machine scoring. J Appl Psychol 60: 283–284CrossRefGoogle Scholar
  11. 11.
    Demzar J (2006) Statistical comparisons of classifiers over multiple data sets. Mach Learn Res 7: 1–30MathSciNetGoogle Scholar
  12. 12.
    Denoyer L, Zaragoza H, Gallinari P (2001) HMM-based passage models for document classification and ranking. In: 23rd European colloquium on information retrieval researchGoogle Scholar
  13. 13.
    Drucker H, Wu D, Vapnik V (1999) Support vector machines for spam categorization. IEEE Trans Neural Netw 10(5): 1048–1054CrossRefGoogle Scholar
  14. 14.
    Fawcett T (2001) Using rule sets to maximize ROC performance. In: IEEE international conference on data mining. IEEE Press, New York City, pp 131–138Google Scholar
  15. 15.
    Fawcett T (2003) ROC graphs—notes and practical considerations for data mining researchers. Tech. Rep. HPL-2003-4, Intelligent enterprise technologies laboratories, Palo AltoGoogle Scholar
  16. 16.
    Feldman R, Sanger J (2007) The text mining handbook. Cambridge University Press, CambridgeGoogle Scholar
  17. 17.
    Flesch R (1948) A new readability yardstick. J Appl Psychol 32: 221–233CrossRefGoogle Scholar
  18. 18.
    Fox S (2005) Spyware—the threat of unwanted software programs is changing the way people use the Internet.
  19. 19.
    Good N, Grossklags J, Thaw D, Perzanowski A, Mulligan DK, Konstan J (2006) User choices and regret: understanding users’ decision process about consensually acquired spyware. I/S Law Policy Inf Soc 2(2): 283–344Google Scholar
  20. 20.
    Kang N, Domeniconi C, Barbara D (2005) Categorization and keyword identification of unlabeled documents. In: Fifth IEEE international conference on data mining. IEEE Press, New York City, pp 677–680Google Scholar
  21. 21.
    Kibriya AM, Frank E, Pfahringer B, Holmes G (2004) Multinomial naive bayes for text categorization revisited. In: Seventh Australian joint conference on artificial intelligence, Springer, Berlin, pp 488–499Google Scholar
  22. 22.
    Koprinska I, Poon J, Clark J, Chan J (2007) Learning to classify E-mail. Inf Sci 177: 2167–2187CrossRefGoogle Scholar
  23. 23.
    Lavesson N, Davidsson P (2008) Generic methods for multi-criteria evaluation. In: Eighth SIAM international conference on data mining. SIAM Press, Philadelphia, pp 541–546Google Scholar
  24. 24.
    Lavesson N, Davidsson P, Boldt M, Jacobsson A (2008) Spyware Prevention by Classifying End User License Agreements. In: New challenges in applied intelligence technologies, studies in computational intelligence, vol 134. Springer, BerlinGoogle Scholar
  25. 25.
    McFedries P (2005) The spyware nightmare. IEEE Spectr 42(8): 72–72CrossRefGoogle Scholar
  26. 26.
    Metzler D, Croft WB (2005) A markov random field model for term dependencies. In: 28th ACM SIGIR conference on research and development in information retrieval, pp 472–479Google Scholar
  27. 27.
    Moshchuk A, Bragin T, Gribble SD, Levy HM (2006) A crawler-based study of spyware on the web. In: 13th annual symposium on network and distributed systems security, Internet Society, RestonGoogle Scholar
  28. 28.
    Provost F, Fawcett T, Kohavi R (1998) The case against accuracy estimation for comparing induction algorithms. In: 15th international conference on machine learning. Morgan Kaufmann Publishers, San Francisco, pp 445–453Google Scholar
  29. 29.
    Sakkis G, Androutsopoulos I, Paliouras G, Karkaletsis V, Spyropoulos CD, Stamatopoulos P (2001) Stacking classifiers for anti-spam filtering of E-mail. In: Sixth conference on empirical methods in natural language processing, Carnegie Mellon University, PittsburghGoogle Scholar
  30. 30.
    Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1): 1–47CrossRefGoogle Scholar
  31. 31.
    Shukla S, Nah F (2005) Web browsing and spyware intrusion. Communications of the ACM 48(8)Google Scholar
  32. 32.
    Smith EA, Kincaid P (1970) Derivation and validation of the automated readability index for use with technical materials. Human Factors 12: 457–464Google Scholar
  33. 33.
    Townsend K (2003) Spyware, Adware, and Peer-to-Peer networks—the hidden threat to corporate security, Technical White Paper, Pest PatroGoogle Scholar
  34. 34.
    Wang P, Hu J, Zeng H-J, Chen Z (2009) Using wikipedia knowledge to improve text classification. Knowl Inf Syst 19: 265–281CrossRefGoogle Scholar
  35. 35.
    Wang BX, Japkowicz N (2009) Boosting support vector machines for imbalanced data Sets. Knowl Inf Syst (Online First)Google Scholar
  36. 36.
    Weiss A (2005) Spyware be gone. ACM Netw 9(1): 18–25Google Scholar
  37. 37.
    Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques. Morgan Kaufmann Publishers, San FranciscoMATHGoogle Scholar
  38. 38.
    Zhang X (2005) What do consumers really know about spyware. Commun ACM 48(8): 44–48CrossRefGoogle Scholar
  39. 39.
    Zhao H (2008) Instance weighting versus threshold adjusting for cost-sensitive classification. Knowl Inf Syst 15: 321–334CrossRefGoogle Scholar

Copyright information

© Springer-Verlag London Limited 2009

Authors and Affiliations

  • Niklas Lavesson
    • 1
  • Martin Boldt
    • 1
  • Paul Davidsson
    • 1
    • 2
  • Andreas Jacobsson
    • 2
  1. 1.School of ComputingBlekinge Institute of TechnologyRonnebySweden
  2. 2.School of TechnologyMalmö UniversityMalmöSweden

Personalised recommendations