LookAhead: Augmenting Crowdsourced Website Reputation Systems with Predictive Modeling

  • Sourav BhattacharyaEmail author
  • Otto Huhta
  • N. Asokan
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9229)


Unsafe websites consist of malicious as well as inappropriate sites, such as those hosting questionable or offensive content. Website reputation systems are intended to help ordinary users steer away from these unsafe sites. However, the process of assigning safety ratings for websites typically involves humans. Consequently it is time consuming, costly and not scalable. This has resulted in two major problems: (i) a significant proportion of the web space remains unrated and (ii) there is an unacceptable time lag before new websites are rated. In this paper, we show that by leveraging structural and content-based properties of websites, we can reliably and efficiently predict their safety ratings, thereby mitigating both problems. We demonstrate the effectiveness of our approach using four datasets of up to 90,000 websites. We use ratings from Web of Trust (WOT), a popular crowdsourced web reputation system, as ground truth. We propose a novel ensemble classification technique that makes opportunistic use of available structural and content properties of web pages to predict their eventual ratings in two dimensions used by WOT: trustworthiness and child safety. Ours is the first classification system to predict such subjective ratings. The same approach works equally well in identifying malicious websites. Across all datasets, our classification achieves average F\(_1\)-score in the 74–90 % range.


False Negative Rate Latent Dirichlet Allocation Combination Rule Reputation System Random Forest Classifier 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



This work was partially supported by the Intel Institute for Collaborative Research in Secure Computing (ICRI-SC) and the Academy of Finland project “Contextual Security” (Grant Number: 274951). We thank Web of Trust for giving access to their data which we used in this work. We also thank Timo Ala-Kleemola and Sergey Andryukhin for helping us understand the WOT data, Jian Liu and Swapnil Udar for helping to develop the web crawler. We would also like to thank Petteri Nurmi, Pekka Parviainen, and Nidhi Gupta for their feedback on an earlier version of this manuscript.


  1. 1.
    Akhawe, D., Felt, A.P.: Alice in warningland: a large-scale field study of browser security warning effectiveness. In: Proceedings of the 22Nd USENIX Conference on Security, SEC 2013, pp. 257–272. USENIX Association, Berkeley, CA, USA (2013)Google Scholar
  2. 2.
    Bhattacharya, S., Huhta, O., Asokan, N.: Lookahead: augmenting crowdsourced website reputation systems with predictive modeling (2015).
  3. 3.
    Bhattacharya, S., Nurmi, P., Hammerla, N., Plötz, T.: Using unlabeled data in a sparse-coding framework for human activity recognition. Pervasive and Mobile Computing, May 2014Google Scholar
  4. 4.
    Bishop, C.M.: Pattern Recognition and Machine Learning. Springer, Heidelberg (2007)Google Scholar
  5. 5.
    Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)zbMATHGoogle Scholar
  6. 6.
    Breese, J.S., Heckerman, D., Kadie, C.: Empirical analysis of predictive algorithms for collaborative filtering. In: Proceedings of the Fourteenth Conference on Uncertainty in Artificial Intelligence, pp. 43–52 (1998)Google Scholar
  7. 7.
    Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)CrossRefzbMATHGoogle Scholar
  8. 8.
    Canali, D., Cova, M., Vigna, G., Kruegel, C.: Prophiler: a fast filter for the large-scale detection of malicious web pages. In: Proceedings of the 20th International Conference on World Wide Web, pp. 197–206. ACM (2011)Google Scholar
  9. 9.
    Chia, P.H., Knapskog, S.J.: Re-evaluating the wisdom of crowds in assessing web security. In: Danezis, G. (ed.) FC 2011. LNCS, vol. 7035, pp. 299–314. Springer, Heidelberg (2012) CrossRefGoogle Scholar
  10. 10.
    Cova, M., Kruegel, C., Vigna, G.: Detection and analysis of drive-by-download attacks and malicious javascript code. In: Proceedings of the 19th International Conference on World Wide Web, pp. 281–290. ACM (2010)Google Scholar
  11. 11.
    Cox, D.R., Oakes, D.: Analysis of Survival Data. Champman and Hall, CRC (1984)Google Scholar
  12. 12.
    Curtsinger, C., Livshits, B., Zorn, B.G., Seifert, C.: Zozzle: fast and precise in-browser javascript malware detection. In: USENIX Security Symposium, pp. 33–48 (2011)Google Scholar
  13. 13.
    Daigle, L.: Whois protocol specificationGoogle Scholar
  14. 14.
    Daigle, L.: Rfc 3912: Whois protocol specification, September 2014.
  15. 15.
    Feinstein, B., Peck, D.: Caffeine monkey: automated collection, detection and analysis of malicious javascript. In: Proceedings of the Black Hat Security Conference, 2007 (2007)Google Scholar
  16. 16.
    Felegyhazi, M., Kreibich, C., Paxson, V.: On the potential of proactive domain blacklisting. In: Proceedings of the 3rd USENIX Conference on Large-scale Exploits and Emergent Threats: Botnets, Spyware, Worms, and More, LEET 2010, p. 6. USENIX Association, Berkeley, CA, USA (2010)Google Scholar
  17. 17.
    Fukunaga, K.: Introduction to Statistical Pattern Recognition, 2nd edn. Academic Press, San Diego (1990) zbMATHGoogle Scholar
  18. 18.
    Hammerla, N., Kirkham, R., Andras, P., Plötz, T.: On preserving statistical characteristics of accelerometry data using their empirical cumulative distribution. In: Proceeding of International Symposium on Wearable Computers (ISWC) (2013)Google Scholar
  19. 19.
    Kittler, J., Hatef, M., Duin, R., Matas, J.: On combining classifiers. IEEE Trans. Pattern Anal. Mach. Intell. 20(3), 226–239 (1998)CrossRefGoogle Scholar
  20. 20.
    Likarish, P., Jung, E., Jo, I.: Obfuscated malicious javascript detection using classification techniques. In: 4th International Conference on Malicious and Unwanted Software (MALWARE), pp. 47–54 (2009)Google Scholar
  21. 21.
    Ma, J., Saul, L.K., Savage, S., Voelker, G.M.: Beyond blacklists: learning to detect malicious web sites from suspicious URLs. In: Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2009, pp. 1245–1254. ACM, New York, NY, USA (2009)Google Scholar
  22. 22.
    McNemar, Q.: Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika 12, 153–157 (1947)CrossRefGoogle Scholar
  23. 23.
    Menardi, G., Torelli, N.: Training and assessing classification rules with imbalanced data. Data Min. Knowl. Disc. 28(1), 92–122 (2014)MathSciNetCrossRefzbMATHGoogle Scholar
  24. 24.
    Moore, T., Clayton, R.C.: Evaluating the wisdom of crowds in assessing phishing websites. In: Tsudik, G. (ed.) FC 2008. LNCS, vol. 5143, pp. 16–30. Springer, Heidelberg (2008) CrossRefGoogle Scholar
  25. 25.
    Plötz, T., Hammerla, N.Y., Olivier, P.: Feature learning for activity recognition in ubiquitous computing. In: International Joint Conference on Artificial Intelligence (IJCAI), pp. 1729–1734 (2011)Google Scholar
  26. 26.
    Prakash, P., Kumar, M., Kompella, R., Gupta, M.: Phishnet: predictive blacklisting to detect phishing attacks. In: 2010 Proceedings IEEE INFOCOM, pp. 1–5, March 2010Google Scholar
  27. 27.
    Rieck, K., Krueger, T., Dewald, A.: Cujo: efficient detection and prevention of drive-by-download attacks. In: Proceedings of the 26th Annual Computer Security Applications Conference, pp. 31–39. ACM (2010)Google Scholar
  28. 28.
    Ruvolo, J.: WOT statistics, December 2014.
  29. 29.
    Seifert, C., Welch, I., Komisarczuk, P.: Identification of malicious web pages with static heuristics. In: Telecommunication Networks and Applications Conference (ATNAC), pp. 91–96 (2008)Google Scholar
  30. 30.
    Seifert, C., Welch, I., Komisarczuk, P., Aval, C., Popovsky, B.: Identification of malicious web pages through analysis of underlying dns and web server relationships. In: 33rd IEEE Conference on Local Computer Networks (LCN), pp. 935–941 (2008)Google Scholar
  31. 31.
    Truong, H.T.T., Lagerspetz, E., Nurmi, P., Oliner, A.J., Tarkoma, S., Asokan, N., Bhattacharya, S.: The company you keep: mobile malware infection rates and inexpensive risk indicators. In: Proceedings of the 23rd International Conference on World Wide Web, pp. 39–50 (2014)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  1. 1.Bell LaboratoriesDublinIreland
  2. 2.Aalto UniversityEspooFinland
  3. 3.University of HelsinkiHelsinkiFinland

Personalised recommendations