LookAhead: Augmenting Crowdsourced Website Reputation Systems with Predictive Modeling

Bhattacharya, Sourav; Huhta, Otto; Asokan, N.

doi:10.1007/978-3-319-22846-4_9

Sourav Bhattacharya¹⁶,
Otto Huhta¹⁷ &
N. Asokan^17,18

Part of the book series: Lecture Notes in Computer Science ((LNSC,volume 9229))

Included in the following conference series:

International Conference on Trust and Trustworthy Computing

1469 Accesses
1 Citations

Abstract

Unsafe websites consist of malicious as well as inappropriate sites, such as those hosting questionable or offensive content. Website reputation systems are intended to help ordinary users steer away from these unsafe sites. However, the process of assigning safety ratings for websites typically involves humans. Consequently it is time consuming, costly and not scalable. This has resulted in two major problems: (i) a significant proportion of the web space remains unrated and (ii) there is an unacceptable time lag before new websites are rated. In this paper, we show that by leveraging structural and content-based properties of websites, we can reliably and efficiently predict their safety ratings, thereby mitigating both problems. We demonstrate the effectiveness of our approach using four datasets of up to 90,000 websites. We use ratings from Web of Trust (WOT), a popular crowdsourced web reputation system, as ground truth. We propose a novel ensemble classification technique that makes opportunistic use of available structural and content properties of web pages to predict their eventual ratings in two dimensions used by WOT: trustworthiness and child safety. Ours is the first classification system to predict such subjective ratings. The same approach works equally well in identifying malicious websites. Across all datasets, our classification achieves average F\(_1\)-score in the 74–90 % range.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
http://www.trustedsource.org/
2.
http://www.phishtank.com/
3.
https://www.mywot.com/
4.
WOT ratings are obtained using their web API (https://www.mywot.com/wiki/API).
5.
See [8] for an exhaustive and in-depth description of all the HTML features.
6.
We also experimented using linear-SVM, SVM, KNN and C4.5 classifiers, and chose Random Forest for its superior performance.
7.
Feature importance is defined as the total decrease in node impurity averaged over all the trees [7].

References

Akhawe, D., Felt, A.P.: Alice in warningland: a large-scale field study of browser security warning effectiveness. In: Proceedings of the 22Nd USENIX Conference on Security, SEC 2013, pp. 257–272. USENIX Association, Berkeley, CA, USA (2013)
Google Scholar
Bhattacharya, S., Huhta, O., Asokan, N.: Lookahead: augmenting crowdsourced website reputation systems with predictive modeling (2015). http://www.arxiv.org/pdf/1504.04730.pdf
Bhattacharya, S., Nurmi, P., Hammerla, N., Plötz, T.: Using unlabeled data in a sparse-coding framework for human activity recognition. Pervasive and Mobile Computing, May 2014
Google Scholar
Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Heidelberg (2007)
Google Scholar
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
MATH Google Scholar
Breese, J.S., Heckerman, D., Kadie, C.: Empirical analysis of predictive algorithms for collaborative filtering. In: Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, pp. 43–52 (1998)
Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article MATH Google Scholar
Canali, D., Cova, M., Vigna, G., Kruegel, C.: Prophiler: a fast filter for the large-scale detection of malicious web pages. In: Proceedings of the 20th International Conference on World Wide Web, pp. 197–206. ACM (2011)
Google Scholar
Chia, P.H., Knapskog, S.J.: Re-evaluating the wisdom of crowds in assessing web security. In: Danezis, G. (ed.) FC 2011. LNCS, vol. 7035, pp. 299–314. Springer, Heidelberg (2012)
Chapter Google Scholar
Cova, M., Kruegel, C., Vigna, G.: Detection and analysis of drive-by-download attacks and malicious javascript code. In: Proceedings of the 19th International Conference on World Wide Web, pp. 281–290. ACM (2010)
Google Scholar
Cox, D.R., Oakes, D.: Analysis of Survival Data. Champman and Hall, CRC (1984)
Google Scholar
Curtsinger, C., Livshits, B., Zorn, B.G., Seifert, C.: Zozzle: fast and precise in-browser javascript malware detection. In: USENIX Security Symposium, pp. 33–48 (2011)
Google Scholar
Daigle, L.: Whois protocol specification
Google Scholar
Daigle, L.: Rfc 3912: Whois protocol specification, September 2014. http://www.tools.ietf.org/html/rfc3912
Feinstein, B., Peck, D.: Caffeine monkey: automated collection, detection and analysis of malicious javascript. In: Proceedings of the Black Hat Security Conference, 2007 (2007)
Google Scholar
Felegyhazi, M., Kreibich, C., Paxson, V.: On the potential of proactive domain blacklisting. In: Proceedings of the 3rd USENIX Conference on Large-scale Exploits and Emergent Threats: Botnets, Spyware, Worms, and More, LEET 2010, p. 6. USENIX Association, Berkeley, CA, USA (2010)
Google Scholar
Fukunaga, K.: Introduction to Statistical Pattern Recognition, 2nd edn. Academic Press, San Diego (1990)
MATH Google Scholar
Hammerla, N., Kirkham, R., Andras, P., Plötz, T.: On preserving statistical characteristics of accelerometry data using their empirical cumulative distribution. In: Proceeding of International Symposium on Wearable Computers (ISWC) (2013)
Google Scholar
Kittler, J., Hatef, M., Duin, R., Matas, J.: On combining classifiers. IEEE Trans. Pattern Anal. Mach. Intell. 20(3), 226–239 (1998)
Article Google Scholar
Likarish, P., Jung, E., Jo, I.: Obfuscated malicious javascript detection using classification techniques. In: 4th International Conference on Malicious and Unwanted Software (MALWARE), pp. 47–54 (2009)
Google Scholar
Ma, J., Saul, L.K., Savage, S., Voelker, G.M.: Beyond blacklists: learning to detect malicious web sites from suspicious URLs. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2009, pp. 1245–1254. ACM, New York, NY, USA (2009)
Google Scholar
McNemar, Q.: Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12, 153–157 (1947)
Article Google Scholar
Menardi, G., Torelli, N.: Training and assessing classification rules with imbalanced data. Data Min. Knowl. Disc. 28(1), 92–122 (2014)
Article MathSciNet MATH Google Scholar
Moore, T., Clayton, R.C.: Evaluating the wisdom of crowds in assessing phishing websites. In: Tsudik, G. (ed.) FC 2008. LNCS, vol. 5143, pp. 16–30. Springer, Heidelberg (2008)
Chapter Google Scholar
Plötz, T., Hammerla, N.Y., Olivier, P.: Feature learning for activity recognition in ubiquitous computing. In: International Joint Conference on Artificial Intelligence (IJCAI), pp. 1729–1734 (2011)
Google Scholar
Prakash, P., Kumar, M., Kompella, R., Gupta, M.: Phishnet: predictive blacklisting to detect phishing attacks. In: 2010 Proceedings IEEE INFOCOM, pp. 1–5, March 2010
Google Scholar
Rieck, K., Krueger, T., Dewald, A.: Cujo: efficient detection and prevention of drive-by-download attacks. In: Proceedings of the 26th Annual Computer Security Applications Conference, pp. 31–39. ACM (2010)
Google Scholar
Ruvolo, J.: WOT statistics, December 2014. https://www.mywot.com/en/community/statistics
Seifert, C., Welch, I., Komisarczuk, P.: Identification of malicious web pages with static heuristics. In: Telecommunication Networks and Applications Conference (ATNAC), pp. 91–96 (2008)
Google Scholar
Seifert, C., Welch, I., Komisarczuk, P., Aval, C., Popovsky, B.: Identification of malicious web pages through analysis of underlying dns and web server relationships. In: 33rd IEEE Conference on Local Computer Networks (LCN), pp. 935–941 (2008)
Google Scholar
Truong, H.T.T., Lagerspetz, E., Nurmi, P., Oliner, A.J., Tarkoma, S., Asokan, N., Bhattacharya, S.: The company you keep: mobile malware infection rates and inexpensive risk indicators. In: Proceedings of the 23rd International Conference on World Wide Web, pp. 39–50 (2014)
Google Scholar

Download references

Acknowledgments

This work was partially supported by the Intel Institute for Collaborative Research in Secure Computing (ICRI-SC) and the Academy of Finland project “Contextual Security” (Grant Number: 274951). We thank Web of Trust for giving access to their data which we used in this work. We also thank Timo Ala-Kleemola and Sergey Andryukhin for helping us understand the WOT data, Jian Liu and Swapnil Udar for helping to develop the web crawler. We would also like to thank Petteri Nurmi, Pekka Parviainen, and Nidhi Gupta for their feedback on an earlier version of this manuscript.

Author information

Authors and Affiliations

Bell Laboratories, Dublin, Ireland
Sourav Bhattacharya
Aalto University, Espoo, Finland
Otto Huhta & N. Asokan
University of Helsinki, Helsinki, Finland
N. Asokan

Authors

Sourav Bhattacharya
View author publications
You can also search for this author in PubMed Google Scholar
Otto Huhta
View author publications
You can also search for this author in PubMed Google Scholar
N. Asokan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sourav Bhattacharya .

Editor information

Editors and Affiliations

University of Padua, Padua, Italy
Mauro Conti
Intel Labs, Darmstadt, Germany
Matthias Schunter
- Hellas (FORTH), Crete, Institute of Computer Science (ICS), Foundation for Research & Technology, Heraklion, Greece
Ioannis Askoxylakis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bhattacharya, S., Huhta, O., Asokan, N. (2015). LookAhead: Augmenting Crowdsourced Website Reputation Systems with Predictive Modeling. In: Conti, M., Schunter, M., Askoxylakis, I. (eds) Trust and Trustworthy Computing. Trust 2015. Lecture Notes in Computer Science(), vol 9229. Springer, Cham. https://doi.org/10.1007/978-3-319-22846-4_9

Download citation

DOI: https://doi.org/10.1007/978-3-319-22846-4_9
Published: 14 August 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-22845-7
Online ISBN: 978-3-319-22846-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics