Twitter Content-Based Spam Filtering

  • Igor Santos
  • Igor Miñambres-Marcos
  • Carlos Laorden
  • Patxi Galán-García
  • Aitor Santamaría-Ibirika
  • Pablo García Bringas
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 239)

Abstract

Twitter has become one of the most used social networks. And, as happens with every popular media, it is prone to misuse. In this context, spam in Twitter has emerged in the last years, becoming an important problem for the users. In the last years, several approaches have appeared that are able to determine whether an user is a spammer or not. However, these blacklisting systems cannot filter every spam message and a spammer may create another account and restart sending spam. In this paper, we propose a content-based approach to filter spam tweets. We have used the text in the tweet and machine learning and compression algorithms to filter those undesired tweets.

Keywords

spam filtering Twitter social networks machine learning text classification 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Thomas, K., Grier, C., Song, D., Paxson, V.: Suspended accounts in retrospect: an analysis of twitter spam. In: Proceedings of the 2011 ACM SIGCOMM Conference on Internet Measurement Conference, pp. 243–258. ACM (2011)Google Scholar
  2. 2.
    Bratko, A., Filipič, B., Cormack, G., Lynam, T., Zupan, B.: Spam filtering using statistical data compression models. The Journal of Machine Learning Research 7, 2673–2698 (2006)MATHGoogle Scholar
  3. 3.
    Jagatic, T., Johnson, N., Jakobsson, M., Menczer, F.: Social phishing. Communications of the ACM 50(10), 94–100 (2007)CrossRefGoogle Scholar
  4. 4.
    Benevenuto, F., Magno, G., Rodrigues, T., Almeida, V.: Detecting spammers on twitter. In: Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference, CEAS (2010)Google Scholar
  5. 5.
    Grier, C., Thomas, K., Paxson, V., Zhang, M.: @spam: The underground on 140 characters or less. In: Proceedings of the 17th ACM Conference on Computer and Communications Security, pp. 27–37. ACM (2010)Google Scholar
  6. 6.
    Wang, A.H.: Don’t follow me: Spam detection in twitter. In: Proceedings of the 2010 International Conference on Security and Cryptography (SECRYPT), pp. 1–10. IEEE (2010)Google Scholar
  7. 7.
    Gao, H., Chen, Y., Lee, K., Palsetia, D., Choudhary, A.: Towards online spam filtering in social networks. In: Symposium on Network and Distributed System Security, NDSS (2012)Google Scholar
  8. 8.
    Ahmed, F., Abulaish, M.: A generic statistical approach for spam detection in online social networks. Computer Communications (in press, 2013)Google Scholar
  9. 9.
    Martinez-Romo, J., Araujo, L.: Detecting malicious tweets in trending topics using a statistical analysis of language. Expert Systems with Applications (2012)Google Scholar
  10. 10.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys (CSUR) 34(1), 1–47 (2002)CrossRefGoogle Scholar
  11. 11.
    Lewis, D.: Naive (Bayes) at forty: The independence assumption in information retrieval. In: Nédellec, C., Rouveirol, C. (eds.) ECML 1998. LNCS, vol. 1398, pp. 4–18. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  12. 12.
    Schneider, K.: A comparison of event models for Naive Bayes anti-spam e-mail filtering. In: Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics, pp. 307–314 (2003)Google Scholar
  13. 13.
    Androutsopoulos, I., Koutsias, J., Chandrinos, K., Spyropoulos, C.: An experimental comparison of naive Bayesian and keyword-based anti-spam filtering with personal e-mail messages. In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 160–167 (2000)Google Scholar
  14. 14.
    Seewald, A.: An evaluation of naive Bayes variants in content-based learning for spam filtering. Intelligent Data Analysis 11(5), 497–524 (2007)Google Scholar
  15. 15.
    Vapnik, V.: The nature of statistical learning theory. Springer (2000)Google Scholar
  16. 16.
    Drucker, H., Wu, D., Vapnik, V.: Support vector machines for spam categorization. IEEE Transactions on Neural Networks 10(5), 1048–1054 (1999)CrossRefGoogle Scholar
  17. 17.
    Blanzieri, E., Bryl, A.: Instance-based spam filtering using SVM nearest neighbor classifier. Proceedings of FLAIRS 20, 441–442 (2007)Google Scholar
  18. 18.
    Sculley, D., Wachman, G.: Relaxed online SVMs for spam filtering. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 415–422 (2007)Google Scholar
  19. 19.
    Quinlan, J.: Induction of decision trees. Machine Learning 1(1), 81–106 (1986)Google Scholar
  20. 20.
    Carreras, X., Márquez, L.: Boosting trees for anti-spam email filtering. In: Proceedings of RANLP 2001, 4th International Conference on Recent Advances in Natural Language Processing, pp. 58–64. Citeseer (2001)Google Scholar
  21. 21.
    Zhang, L., Zhu, J., Yao, T.: An evaluation of statistical spam filtering techniques. ACM Transactions on Asian Language Information Processing (TALIP) 3(4), 243–269 (2004)CrossRefGoogle Scholar
  22. 22.
    Salton, G., Wong, A., Yang, C.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)MATHCrossRefGoogle Scholar
  23. 23.
    Wittel, G., Wu, S.: On attacking statistical spam filters. In: Proceedings of the 1st Conference on Email and Anti-Spam, CEAS (2004)Google Scholar
  24. 24.
    Pearl, J.: Reverend bayes on inference engines: a distributed hierarchical approach. In: Proceedings of the National Conference on Artificial Intelligence, pp. 133–136 (1982)Google Scholar
  25. 25.
    Bayes, T.: An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society 53, 370–418 (1763)Google Scholar
  26. 26.
    Castillo, E., Gutiérrez, J.M., Hadi, A.S.: Expert Systems and Probabilistic Network Models, Erste edn., New York, NY, USA (1996)Google Scholar
  27. 27.
    Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001)MATHCrossRefGoogle Scholar
  28. 28.
    Garner, S.: Weka: The Waikato environment for knowledge analysis. In: Proceedings of the 1995 New Zealand Computer Science Research Students Conference, pp. 57–64 (1995)Google Scholar
  29. 29.
    Quinlan, J.: C4. 5 programs for machine learning. Morgan Kaufmann Publishers (1993)Google Scholar
  30. 30.
    Fix, E., Hodges, J.L.: Discriminatory analysis: Nonparametric discrimination: Small sample performance. technical report project 21-49-004, report number 11. Technical report, USAF School of Aviation Medicine, Randolf Field, Texas (1952)Google Scholar
  31. 31.
    Amari, S., Wu, S.: Improving support vector machine classifiers by modifying kernel functions. Neural Networks 12(6), 783–789 (1999)CrossRefGoogle Scholar
  32. 32.
    Begleiter, R., El-Yaniv, R., Yona, G.: On prediction using variable order markov models. J. Artif. Intell. Res. (JAIR) 22, 385–421 (2004)MathSciNetMATHGoogle Scholar
  33. 33.
    Cleary, J., Witten, I.: Data compression using adaptive coding and partial string matching. IEEE Transactions on Communications 32(4), 396–402 (1984)CrossRefGoogle Scholar
  34. 34.
    Ziv, J., Lempel, A.: Compression of individual sequences via variable-rate coding. IEEE Transactions on Information Theory 24(5), 530–536 (1978)MathSciNetMATHCrossRefGoogle Scholar
  35. 35.
    Nisenson, M., Yariv, I., El-Yaniv, R., Meir, R.: Towards behaviometric security systems: Learning to identify a typist. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 363–374. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  36. 36.
    Willems, F.: The context-tree weighting method: Extensions. IEEE Transactions on Information Theory 44(2), 792–798 (1998)MathSciNetMATHCrossRefGoogle Scholar
  37. 37.
    Volf, P.A.J.: Weighting techniques in data compression: Theory and algorithms. Citeseer (2002)Google Scholar
  38. 38.
    Ron, D., Singer, Y., Tishby, N.: The power of amnesia: Learning probabilistic automata with variable memory length. Machine Learning 25(2), 117–149 (1996)MATHCrossRefGoogle Scholar
  39. 39.
    Cormack, G., Horspool, R.: Data compression using dynamic markov modelling. The Computer Journal 30(6), 541–550 (1987)MathSciNetCrossRefGoogle Scholar
  40. 40.
    Cormack, G., Gómez Hidalgo, J., Sánz, E.: Spam filtering for short messages. In: Proceedings of the 16th ACM Conference on Conference on Information and Knowledge Management, pp. 313–320. ACM (2007)Google Scholar
  41. 41.
    Cormack, G., Hidalgo, J., Sánz, E.: Feature engineering for mobile(sms) spam filtering. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, vol. 23, pp. 871–872 (2007)Google Scholar
  42. 42.
    Santos, I., Laorden, C., Sanz, B., Bringas, P.G.: Enhanced topic-based vector space model for semantics-aware spam filtering. Expert Systems With Applications 39(1), 437–444, doi:10.1016/j.eswa.2011.07.034Google Scholar
  43. 43.
    Laorden, C., Santos, I., Sanz, B., Alvarez, G., Bringas, P.G.: Word sense disambiguation for spam filtering. Electron. Commer. Rec. Appl. 11(3), 290–298 (2012)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Igor Santos
    • 1
  • Igor Miñambres-Marcos
    • 1
  • Carlos Laorden
    • 1
  • Patxi Galán-García
    • 1
  • Aitor Santamaría-Ibirika
    • 1
  • Pablo García Bringas
    • 1
  1. 1.DeustoTech-ComputingDeusto Institute of Technology (DeustoTech)BilbaoSpain

Personalised recommendations