Similarity assessment for removal of noisy end user license agreements

Lavesson, Niklas; Axelsson, Stefan

doi:10.1007/s10115-011-0438-9

Similarity assessment for removal of noisy end user license agreements

Regular Paper
Published: 28 July 2011

Volume 32, pages 167–189, (2012)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Niklas Lavesson¹ &
Stefan Axelsson¹

188 Accesses
7 Citations
Explore all metrics

Abstract

In previous work, we have shown the possibility to automatically discriminate between legitimate software and spyware-associated software by performing supervised learning of end user license agreements (EULAs). However, the amount of false positives (spyware classified as legitimate software) was too large for practical use. In this study, the false positives problem is addressed by removing noisy EULAs, which are identified by performing similarity analysis of the previously studied EULAs. Two candidate similarity analysis methods for this purpose are experimentally compared: cosine similarity assessment in conjunction with latent semantic analysis (LSA) and normalized compression distance (NCD). The results show that the number of false positives can be reduced significantly by removing noise identified by either method. However, the experimental results also indicate subtle performance differences between LSA and NCD. To improve the performance even further and to decrease the large number of attributes, the categorical proportional difference (CPD) feature selection algorithm was applied. CPD managed to greatly reduce the number of attributes while at the same time increase classification performance on the original data set, as well as on the LSA- and NCD-based data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Abe N, Kudo M (2006) Non-parametric classifier-independent feature selection. Pattern Recogn 39: 737–746
Article MATH Google Scholar
Axelsson S (2000) The base-rate fallacy and the difficulty of intrusion detection. ACM Trans Inf Syst Sec 3(3): 186–205
Article MathSciNet Google Scholar
Axelsson S, Baca D, Feldt R, Sidlauskas D, Kacan D (2009) Detecting defects with an interactive code review tool based on visualisation and machine learning. In: 21st international conference on software engineering and knowledge engineering, Boston, USA
Berry MW, Dumais ST, O’Brien GW (1995) Using linear algebra for intelligent information retrieval. SIAM Rev 37(4): 573–595
Article MathSciNet MATH Google Scholar
Boldt M, Carlsson B, Jacobsson A (2004) Exploring spyware effects. In: Eight nordic workshop on secure IT systems, pp 23–30
Cebrian M, Alfonseca M, Ortega A (2007) The normalized compression distance is resistant to noise. IEEE Trans Inf Theory 53(5): 1895–1900
Article MathSciNet Google Scholar
Cebrian M, Alfonseca M, Ortega A (2005) Common pitfalls using normalized compression distance: what to watch out for in a compressor. Commun Inf Syst 5(4): 367–400
MathSciNet MATH Google Scholar
Cilibrasi R (2007) Statistical inference through data compression. PhD thesis, Institute for Logic, Language and Computation Universiteit van Amsterdam, Plantage Muidergracht 24, 1018 TV Amsterdam. http://www.illc.uva.nl/
Deerwester S, Dumais S, Furnas G, Landauer T, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6): 391–407
Article Google Scholar
Delany SJ (2009) The Good, the bad and the incorrectly classified: profiling cases for case-base editing. In: 8th international conference on case-based reasoning, pp 135–149
Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7: 1–30
MathSciNet MATH Google Scholar
Dong Z (2002) Towards web information clustering. PhD thesis, Southeast University, Nanjing, China
Edsberg O, Nytro O, Rost TB (2007) Novelty detection in patient histories: experiments with measures based on text compression. In: Berthold MR, Shawe-Taylor J, Lavrac N (eds) Advances in intelligent data analysis VII. Springer, New York, pp 367–378
Feldman R, Sanger J (2007) The text mining handbook. Cambridge University Press, Cambridge
Google Scholar
Ferragina P, Giancarlo R, Greco V, Manzini G, Valiente G (2007) Compression-based classification of biological sequences and structures via the universal similarity metric: experimental assessment. BMC Bioinf 8(1)
Friedman M (1940) A comparison of alternative tests of significance for the problem of m rankings. Ann Math Stat 11: 86–92
Article MATH Google Scholar
Gansterer WN, Janecek AGK, Neumayer R (2007) Spam filtering based on latent semantic indexing. In: Berry MW, Castellanos M (eds) Survey of Text Mining II. Springer, New York
Good N, Grossklags J, Thaw D, Perzanowski A, Mulligan DK, Konstan J (2006) User choices and regret: understanding users’ decision process about consensually acquired spyware. I/S Law Policy Inf Soc 2(2): 283–344
Google Scholar
Granados A, Cebrian M, Camacho D, Rodriguez FB (2008) Evaluating the impact of information distortion on normalized compression distance. In: Barbero A (ed) Coding Theory and Applications. Springer, Berlin, pp 69–79
Hofmann T (1999) Probabilistic latent semantic indexing. In: 22nd annual international ACM SIGIR conference on research and development in information retrieval. ACM Press, pp 50–57
Iman RL, Davenport JM (1980) Approximations of the critical region of the friedman statistic. Commun Stat A 9(6): 571–595
Article Google Scholar
Keogh E, Lonardi S, Ratanamahatana CA (2004) Towards parameter-free data mining. In: Tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM Press, New York, NY, USA, pp 206–215
Keogh E, Lonardi S, Ratanamahatana CA, Wei L, Lee S-H, Handley J (2007) Compression-based data mining of sequential data. Data Min Knowl Discov 14(1): 99–129
Article MathSciNet Google Scholar
Landauer TK, Foltz PW, Laham D (1998) Introduction to Latent Semantic Analysis. Discourse Process 25: 259–284
Article Google Scholar
Langville AN, Meyer CD (2004) The use of linear algebra by web search engines. Bull Int Linear Algebra Soc 33: 2–6
Google Scholar
Lavesson N, Boldt M, Davidsson P, Jacobsson A (2008) Spyware prevention by classifying end user license agreements. In: Nguyen NT, Katarzyniak R (eds) New Challenges in Applied Intelligence Technologies, Studies in Computational Intelligence. Springer, Berlin
Lavesson N, Boldt M, Davidsson P, Jacobsson A (2011) Learning to detect spyware using end user license agreements. Knowl Inf Syst 26(2): 285–307
Article Google Scholar
Leydesdorff L (2005) Similarity measures, author cocitation analysis,and information theory. J Am Soc Inf Sci Technol 56(7): 769–772
Article Google Scholar
Li M, Chen X, Xin ML, Ma B, Vitanyi PMB (2004) The similarity metric. IEEE Trans Inf Theory 50(12): 3250–3264
Article Google Scholar
Lin S-W, Chen S-C, Wu W-J, Chen C-H (2009) Parameter determination and feature selection for back-propagation network by particle swarm optimization. Knowl Inf Syst 21(2): 249–266
Article Google Scholar
Lovins JB (1968) Development of a stemming algorithm. Mech Transl Comput Linguist 11: 22–31
Google Scholar
McCallum A, Nigam K (1998) A comparison of event models for naive bayes text classification. In: AAAI-98 workshop on learning for text categorization
Nemenyi PB (1963) Distribution-free multiple comparisons. Ph.D. thesis, Princeton university
Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1): 1–47
Article Google Scholar
Seward J (2001) Space-time tradeoffs in the inverse B-W transform. Data Compression Conference. Washington DC, USA
Google Scholar
Simeon M, Hilderman R (2008) Categorical proportional difference: a feature selection method for text categorization. In: Roddick JF, Li J, Christen P, Kennedy PJ (eds) Seventh Australasian Data Mining Conference, volume 87 of CRPIT. ACS, Glenelg, South Australia, pp 201–208
Telles GP, Minghim R, Paulovich FV (2007) Normalized compression distance for visual analysis of document collections. Comput Graph 31: 327–337
Article Google Scholar
Vitanyi PMB, Balbach FJ, Cilibrasi RL, Li M (2008) Information theory and statistical learning, Chap. 3. Springer, New York
Google Scholar
Wang P, Hu J, Zeng HJ, Chen Z (2009) Using wikipedia knowledge to improve text classification. Knowl Inf Syst 19: 265–281
Article Google Scholar
Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco
MATH Google Scholar
Cleary JG, Witten IH (1984) Data compression using adaptive coding and partial string matching. IEEE Trans Commun 32(4): 396–402
Article Google Scholar
Ye S, Wen J-R, Ma W-Y (2008) A systematic study on parameter correlations in large-scale duplicate document detection. Knowl Inf Syst 14(2): 217–232
Article Google Scholar
Zhang M, Alhajj R (2010) Effectiveness of NAQ-tree as index structure for similarity search in high-dimensional metric space. Knowl Inf Syst 22(1): 1–26
Article MATH Google Scholar
Japkowicz N, Stephen S (2002) The class imbalance problem: a systematic study. Intell Data Anal 6(5): 429–449
MATH Google Scholar

Download references

Author information

Authors and Affiliations

School of Computing, Blekinge Institute of Technology, 371 79, Karlskrona, Sweden
Niklas Lavesson & Stefan Axelsson

Authors

Niklas Lavesson
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Axelsson
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Niklas Lavesson.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lavesson, N., Axelsson, S. Similarity assessment for removal of noisy end user license agreements. Knowl Inf Syst 32, 167–189 (2012). https://doi.org/10.1007/s10115-011-0438-9

Download citation

Received: 02 May 2010
Revised: 29 March 2011
Accepted: 18 July 2011
Published: 28 July 2011
Issue Date: July 2012
DOI: https://doi.org/10.1007/s10115-011-0438-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similarity assessment for removal of noisy end user license agreements

Abstract

Access this article

Similar content being viewed by others

FSquaDRA: Fast Detection of Repackaged Applications

An effective and intelligent Windows application filtering system using software similarity

FriSM: Malicious Exploit Kit Detection via Feature-Based String-Similarity Matching

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Similarity assessment for removal of noisy end user license agreements

Abstract

Access this article

Similar content being viewed by others

FSquaDRA: Fast Detection of Repackaged Applications

An effective and intelligent Windows application filtering system using software similarity

FriSM: Malicious Exploit Kit Detection via Feature-Based String-Similarity Matching

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation