Skip to main content
Log in

Similarity assessment for removal of noisy end user license agreements

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

In previous work, we have shown the possibility to automatically discriminate between legitimate software and spyware-associated software by performing supervised learning of end user license agreements (EULAs). However, the amount of false positives (spyware classified as legitimate software) was too large for practical use. In this study, the false positives problem is addressed by removing noisy EULAs, which are identified by performing similarity analysis of the previously studied EULAs. Two candidate similarity analysis methods for this purpose are experimentally compared: cosine similarity assessment in conjunction with latent semantic analysis (LSA) and normalized compression distance (NCD). The results show that the number of false positives can be reduced significantly by removing noise identified by either method. However, the experimental results also indicate subtle performance differences between LSA and NCD. To improve the performance even further and to decrease the large number of attributes, the categorical proportional difference (CPD) feature selection algorithm was applied. CPD managed to greatly reduce the number of attributes while at the same time increase classification performance on the original data set, as well as on the LSA- and NCD-based data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Abe N, Kudo M (2006) Non-parametric classifier-independent feature selection. Pattern Recogn 39: 737–746

    Article  MATH  Google Scholar 

  2. Axelsson S (2000) The base-rate fallacy and the difficulty of intrusion detection. ACM Trans Inf Syst Sec 3(3): 186–205

    Article  MathSciNet  Google Scholar 

  3. Axelsson S, Baca D, Feldt R, Sidlauskas D, Kacan D (2009) Detecting defects with an interactive code review tool based on visualisation and machine learning. In: 21st international conference on software engineering and knowledge engineering, Boston, USA

  4. Berry MW, Dumais ST, O’Brien GW (1995) Using linear algebra for intelligent information retrieval. SIAM Rev 37(4): 573–595

    Article  MathSciNet  MATH  Google Scholar 

  5. Boldt M, Carlsson B, Jacobsson A (2004) Exploring spyware effects. In: Eight nordic workshop on secure IT systems, pp 23–30

  6. Cebrian M, Alfonseca M, Ortega A (2007) The normalized compression distance is resistant to noise. IEEE Trans Inf Theory 53(5): 1895–1900

    Article  MathSciNet  Google Scholar 

  7. Cebrian M, Alfonseca M, Ortega A (2005) Common pitfalls using normalized compression distance: what to watch out for in a compressor. Commun Inf Syst 5(4): 367–400

    MathSciNet  MATH  Google Scholar 

  8. Cilibrasi R (2007) Statistical inference through data compression. PhD thesis, Institute for Logic, Language and Computation Universiteit van Amsterdam, Plantage Muidergracht 24, 1018 TV Amsterdam. http://www.illc.uva.nl/

  9. Deerwester S, Dumais S, Furnas G, Landauer T, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6): 391–407

    Article  Google Scholar 

  10. Delany SJ (2009) The Good, the bad and the incorrectly classified: profiling cases for case-base editing. In: 8th international conference on case-based reasoning, pp 135–149

  11. Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7: 1–30

    MathSciNet  MATH  Google Scholar 

  12. Dong Z (2002) Towards web information clustering. PhD thesis, Southeast University, Nanjing, China

  13. Edsberg O, Nytro O, Rost TB (2007) Novelty detection in patient histories: experiments with measures based on text compression. In: Berthold MR, Shawe-Taylor J, Lavrac N (eds) Advances in intelligent data analysis VII. Springer, New York, pp 367–378

  14. Feldman R, Sanger J (2007) The text mining handbook. Cambridge University Press, Cambridge

    Google Scholar 

  15. Ferragina P, Giancarlo R, Greco V, Manzini G, Valiente G (2007) Compression-based classification of biological sequences and structures via the universal similarity metric: experimental assessment. BMC Bioinf 8(1)

  16. Friedman M (1940) A comparison of alternative tests of significance for the problem of m rankings. Ann Math Stat 11: 86–92

    Article  MATH  Google Scholar 

  17. Gansterer WN, Janecek AGK, Neumayer R (2007) Spam filtering based on latent semantic indexing. In: Berry MW, Castellanos M (eds) Survey of Text Mining II. Springer, New York

  18. Good N, Grossklags J, Thaw D, Perzanowski A, Mulligan DK, Konstan J (2006) User choices and regret: understanding users’ decision process about consensually acquired spyware. I/S Law Policy Inf Soc 2(2): 283–344

    Google Scholar 

  19. Granados A, Cebrian M, Camacho D, Rodriguez FB (2008) Evaluating the impact of information distortion on normalized compression distance. In: Barbero A (ed) Coding Theory and Applications. Springer, Berlin, pp 69–79

  20. Hofmann T (1999) Probabilistic latent semantic indexing. In: 22nd annual international ACM SIGIR conference on research and development in information retrieval. ACM Press, pp 50–57

  21. Iman RL, Davenport JM (1980) Approximations of the critical region of the friedman statistic. Commun Stat A 9(6): 571–595

    Article  Google Scholar 

  22. Keogh E, Lonardi S, Ratanamahatana CA (2004) Towards parameter-free data mining. In: Tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM Press, New York, NY, USA, pp 206–215

  23. Keogh E, Lonardi S, Ratanamahatana CA, Wei L, Lee S-H, Handley J (2007) Compression-based data mining of sequential data. Data Min Knowl Discov 14(1): 99–129

    Article  MathSciNet  Google Scholar 

  24. Landauer TK, Foltz PW, Laham D (1998) Introduction to Latent Semantic Analysis. Discourse Process 25: 259–284

    Article  Google Scholar 

  25. Langville AN, Meyer CD (2004) The use of linear algebra by web search engines. Bull Int Linear Algebra Soc 33: 2–6

    Google Scholar 

  26. Lavesson N, Boldt M, Davidsson P, Jacobsson A (2008) Spyware prevention by classifying end user license agreements. In: Nguyen NT, Katarzyniak R (eds) New Challenges in Applied Intelligence Technologies, Studies in Computational Intelligence. Springer, Berlin

  27. Lavesson N, Boldt M, Davidsson P, Jacobsson A (2011) Learning to detect spyware using end user license agreements. Knowl Inf Syst 26(2): 285–307

    Article  Google Scholar 

  28. Leydesdorff L (2005) Similarity measures, author cocitation analysis,and information theory. J Am Soc Inf Sci Technol 56(7): 769–772

    Article  Google Scholar 

  29. Li M, Chen X, Xin ML, Ma B, Vitanyi PMB (2004) The similarity metric. IEEE Trans Inf Theory 50(12): 3250–3264

    Article  Google Scholar 

  30. Lin S-W, Chen S-C, Wu W-J, Chen C-H (2009) Parameter determination and feature selection for back-propagation network by particle swarm optimization. Knowl Inf Syst 21(2): 249–266

    Article  Google Scholar 

  31. Lovins JB (1968) Development of a stemming algorithm. Mech Transl Comput Linguist 11: 22–31

    Google Scholar 

  32. McCallum A, Nigam K (1998) A comparison of event models for naive bayes text classification. In: AAAI-98 workshop on learning for text categorization

  33. Nemenyi PB (1963) Distribution-free multiple comparisons. Ph.D. thesis, Princeton university

  34. Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1): 1–47

    Article  Google Scholar 

  35. Seward J (2001) Space-time tradeoffs in the inverse B-W transform. Data Compression Conference. Washington DC, USA

    Google Scholar 

  36. Simeon M, Hilderman R (2008) Categorical proportional difference: a feature selection method for text categorization. In: Roddick JF, Li J, Christen P, Kennedy PJ (eds) Seventh Australasian Data Mining Conference, volume 87 of CRPIT. ACS, Glenelg, South Australia, pp 201–208

  37. Telles GP, Minghim R, Paulovich FV (2007) Normalized compression distance for visual analysis of document collections. Comput Graph 31: 327–337

    Article  Google Scholar 

  38. Vitanyi PMB, Balbach FJ, Cilibrasi RL, Li M (2008) Information theory and statistical learning, Chap. 3. Springer, New York

    Google Scholar 

  39. Wang P, Hu J, Zeng HJ, Chen Z (2009) Using wikipedia knowledge to improve text classification. Knowl Inf Syst 19: 265–281

    Article  Google Scholar 

  40. Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco

    MATH  Google Scholar 

  41. Cleary JG, Witten IH (1984) Data compression using adaptive coding and partial string matching. IEEE Trans Commun 32(4): 396–402

    Article  Google Scholar 

  42. Ye S, Wen J-R, Ma W-Y (2008) A systematic study on parameter correlations in large-scale duplicate document detection. Knowl Inf Syst 14(2): 217–232

    Article  Google Scholar 

  43. Zhang M, Alhajj R (2010) Effectiveness of NAQ-tree as index structure for similarity search in high-dimensional metric space. Knowl Inf Syst 22(1): 1–26

    Article  MATH  Google Scholar 

  44. Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5): 429–449

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Niklas Lavesson.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lavesson, N., Axelsson, S. Similarity assessment for removal of noisy end user license agreements. Knowl Inf Syst 32, 167–189 (2012). https://doi.org/10.1007/s10115-011-0438-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-011-0438-9

Keywords

Navigation