Abstract
Unsafe websites consist of malicious as well as inappropriate sites, such as those hosting questionable or offensive content. Website reputation systems are intended to help ordinary users steer away from these unsafe sites. However, the process of assigning safety ratings for websites typically involves humans. Consequently it is time consuming, costly and not scalable. This has resulted in two major problems: (i) a significant proportion of the web space remains unrated and (ii) there is an unacceptable time lag before new websites are rated. In this paper, we show that by leveraging structural and content-based properties of websites, we can reliably and efficiently predict their safety ratings, thereby mitigating both problems. We demonstrate the effectiveness of our approach using four datasets of up to 90,000 websites. We use ratings from Web of Trust (WOT), a popular crowdsourced web reputation system, as ground truth. We propose a novel ensemble classification technique that makes opportunistic use of available structural and content properties of web pages to predict their eventual ratings in two dimensions used by WOT: trustworthiness and child safety. Ours is the first classification system to predict such subjective ratings. The same approach works equally well in identifying malicious websites. Across all datasets, our classification achieves average F\(_1\)-score in the 74–90 % range.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
WOT ratings are obtained using their web API (https://www.mywot.com/wiki/API).
- 5.
See [8] for an exhaustive and in-depth description of all the HTML features.
- 6.
We also experimented using linear-SVM, SVM, KNN and C4.5 classifiers, and chose Random Forest for its superior performance.
- 7.
Feature importance is defined as the total decrease in node impurity averaged over all the trees [7].
References
Akhawe, D., Felt, A.P.: Alice in warningland: a large-scale field study of browser security warning effectiveness. In: Proceedings of the 22Nd USENIX Conference on Security, SEC 2013, pp. 257–272. USENIX Association, Berkeley, CA, USA (2013)
Bhattacharya, S., Huhta, O., Asokan, N.: Lookahead: augmenting crowdsourced website reputation systems with predictive modeling (2015). http://www.arxiv.org/pdf/1504.04730.pdf
Bhattacharya, S., Nurmi, P., Hammerla, N., Plötz, T.: Using unlabeled data in a sparse-coding framework for human activity recognition. Pervasive and Mobile Computing, May 2014
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Heidelberg (2007)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Breese, J.S., Heckerman, D., Kadie, C.: Empirical analysis of predictive algorithms for collaborative filtering. In: Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, pp. 43–52 (1998)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Canali, D., Cova, M., Vigna, G., Kruegel, C.: Prophiler: a fast filter for the large-scale detection of malicious web pages. In: Proceedings of the 20th International Conference on World Wide Web, pp. 197–206. ACM (2011)
Chia, P.H., Knapskog, S.J.: Re-evaluating the wisdom of crowds in assessing web security. In: Danezis, G. (ed.) FC 2011. LNCS, vol. 7035, pp. 299–314. Springer, Heidelberg (2012)
Cova, M., Kruegel, C., Vigna, G.: Detection and analysis of drive-by-download attacks and malicious javascript code. In: Proceedings of the 19th International Conference on World Wide Web, pp. 281–290. ACM (2010)
Cox, D.R., Oakes, D.: Analysis of Survival Data. Champman and Hall, CRC (1984)
Curtsinger, C., Livshits, B., Zorn, B.G., Seifert, C.: Zozzle: fast and precise in-browser javascript malware detection. In: USENIX Security Symposium, pp. 33–48 (2011)
Daigle, L.: Whois protocol specification
Daigle, L.: Rfc 3912: Whois protocol specification, September 2014. http://www.tools.ietf.org/html/rfc3912
Feinstein, B., Peck, D.: Caffeine monkey: automated collection, detection and analysis of malicious javascript. In: Proceedings of the Black Hat Security Conference, 2007 (2007)
Felegyhazi, M., Kreibich, C., Paxson, V.: On the potential of proactive domain blacklisting. In: Proceedings of the 3rd USENIX Conference on Large-scale Exploits and Emergent Threats: Botnets, Spyware, Worms, and More, LEET 2010, p. 6. USENIX Association, Berkeley, CA, USA (2010)
Fukunaga, K.: Introduction to Statistical Pattern Recognition, 2nd edn. Academic Press, San Diego (1990)
Hammerla, N., Kirkham, R., Andras, P., Plötz, T.: On preserving statistical characteristics of accelerometry data using their empirical cumulative distribution. In: Proceeding of International Symposium on Wearable Computers (ISWC) (2013)
Kittler, J., Hatef, M., Duin, R., Matas, J.: On combining classifiers. IEEE Trans. Pattern Anal. Mach. Intell. 20(3), 226–239 (1998)
Likarish, P., Jung, E., Jo, I.: Obfuscated malicious javascript detection using classification techniques. In: 4th International Conference on Malicious and Unwanted Software (MALWARE), pp. 47–54 (2009)
Ma, J., Saul, L.K., Savage, S., Voelker, G.M.: Beyond blacklists: learning to detect malicious web sites from suspicious URLs. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2009, pp. 1245–1254. ACM, New York, NY, USA (2009)
McNemar, Q.: Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12, 153–157 (1947)
Menardi, G., Torelli, N.: Training and assessing classification rules with imbalanced data. Data Min. Knowl. Disc. 28(1), 92–122 (2014)
Moore, T., Clayton, R.C.: Evaluating the wisdom of crowds in assessing phishing websites. In: Tsudik, G. (ed.) FC 2008. LNCS, vol. 5143, pp. 16–30. Springer, Heidelberg (2008)
Plötz, T., Hammerla, N.Y., Olivier, P.: Feature learning for activity recognition in ubiquitous computing. In: International Joint Conference on Artificial Intelligence (IJCAI), pp. 1729–1734 (2011)
Prakash, P., Kumar, M., Kompella, R., Gupta, M.: Phishnet: predictive blacklisting to detect phishing attacks. In: 2010 Proceedings IEEE INFOCOM, pp. 1–5, March 2010
Rieck, K., Krueger, T., Dewald, A.: Cujo: efficient detection and prevention of drive-by-download attacks. In: Proceedings of the 26th Annual Computer Security Applications Conference, pp. 31–39. ACM (2010)
Ruvolo, J.: WOT statistics, December 2014. https://www.mywot.com/en/community/statistics
Seifert, C., Welch, I., Komisarczuk, P.: Identification of malicious web pages with static heuristics. In: Telecommunication Networks and Applications Conference (ATNAC), pp. 91–96 (2008)
Seifert, C., Welch, I., Komisarczuk, P., Aval, C., Popovsky, B.: Identification of malicious web pages through analysis of underlying dns and web server relationships. In: 33rd IEEE Conference on Local Computer Networks (LCN), pp. 935–941 (2008)
Truong, H.T.T., Lagerspetz, E., Nurmi, P., Oliner, A.J., Tarkoma, S., Asokan, N., Bhattacharya, S.: The company you keep: mobile malware infection rates and inexpensive risk indicators. In: Proceedings of the 23rd International Conference on World Wide Web, pp. 39–50 (2014)
Acknowledgments
This work was partially supported by the Intel Institute for Collaborative Research in Secure Computing (ICRI-SC) and the Academy of Finland project “Contextual Security” (Grant Number: 274951). We thank Web of Trust for giving access to their data which we used in this work. We also thank Timo Ala-Kleemola and Sergey Andryukhin for helping us understand the WOT data, Jian Liu and Swapnil Udar for helping to develop the web crawler. We would also like to thank Petteri Nurmi, Pekka Parviainen, and Nidhi Gupta for their feedback on an earlier version of this manuscript.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Bhattacharya, S., Huhta, O., Asokan, N. (2015). LookAhead: Augmenting Crowdsourced Website Reputation Systems with Predictive Modeling. In: Conti, M., Schunter, M., Askoxylakis, I. (eds) Trust and Trustworthy Computing. Trust 2015. Lecture Notes in Computer Science(), vol 9229. Springer, Cham. https://doi.org/10.1007/978-3-319-22846-4_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-22846-4_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-22845-7
Online ISBN: 978-3-319-22846-4
eBook Packages: Computer ScienceComputer Science (R0)