Bridging social media via distant supervision

  • Walid Magdy
  • Hassan Sajjad
  • Tarek El-Ganainy
  • Fabrizio Sebastiani
Original Article


Microblog classification has received a lot of attention in recent years. Different classification tasks have been investigated, most of them focusing on classifying microblogs into a small number of classes (five or less) using a training set of manually annotated tweets. Unfortunately, labelling data is tedious and expensive, and finding tweets that cover all the classes of interest is not always straightforward, especially when some of the classes do not frequently arise in practice. In this paper, we study an approach to tweet classification based on distant supervision, whereby we automatically transfer labels from one social medium to another for a single-label multi-class classification task. In particular, we apply YouTube video classes to tweets linking to these videos. This provides for free a virtually unlimited number of labelled instances that can be used as training data. The classification experiments we have run show that training a tweet classifier via these automatically labelled data achieves substantially better performance than training the same classifier with a limited amount of manually labelled data; this is advantageous, given that the automatically labelled data come at no cost. Further investigation of our approach shows its robustness when applied with different numbers of classes and across different languages.


Twitter YouTube Tweet classification Distant supervision 


  1. Becker H, Naaman M, Gravano L (2011) Beyond trending topics: real-world event identification on Twitter. In: Proceedings of the 5th International Conference on Weblogs and Social Media (ICWSM 2011). Barcelona, ESGoogle Scholar
  2. Bollen J, Mao H, Zeng XJ (2011) Twitter mood predicts the stock market. J Comput Sci 2(1):1–8CrossRefGoogle Scholar
  3. Chen M, Jin X, Shen D (2011) Short text classification improved by learning multi-granularity topics. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence (IJCAI 2011). Barcelona, ES, pp 1776–1781Google Scholar
  4. Chen Y, Li Z, Nie L, Hu X, Wang X, Chua TS, Zhang X (2014) A semi-supervised Bayesian network model for microblog topic classification. In: Proceedings of the 24th International Conference on Computational Linguistics (COLING 2012). Mumbai, IN, pp 561–576Google Scholar
  5. Darwish K, Magdy W, Mourad A (2012) Language processing for Arabic microblog retrieval. In: Proceedings of the 21st ACM International Conference on Information and Knowledge Management (CIKM 2012). Maui, US, pp 2427–2430Google Scholar
  6. De Choudhury M, Diakopoulos N, Naaman M (2012) Unfolding the event landscape on Twitter: Classification and exploration of user categories. In: Proceedings of the 15th ACM Conference on Computer Supported Cooperative Work (CSCW 2012). Seattle, US, pp 241–244Google Scholar
  7. Do CB, Ng AY (2005) Transfer learning for text classification. In: Proceedings of the 19th Annual Conference on Neural Information Processing Systems (NIPS 2005). Vancouver, CA, pp 299–306Google Scholar
  8. Dodds PS, Harris KD, Kloumann IM, Bliss CA, Danforth CM (2011) Temporal patterns of happiness and information in a global social network: Hedonometrics and Twitter. PLoS One 6(12)Google Scholar
  9. Forman G (2004) A pitfall and solution in multi-class feature selection for text classification. In: Proceedings of the 21st International Conference on Machine Learning (ICML 2004). Banff, CA, pp 38–45Google Scholar
  10. Genc Y, Sakamoto Y, Nickerson JV (2011) Discovering context: Classifying tweets through a semantic transform based on Wikipedia. In: Proceedings of the 6th International Conference on Foundations of Augmented Cognition (FAC 2011). Orlando, US, pp 484–492Google Scholar
  11. Go A, Bhayani R, Huang L (2009) Twitter sentiment classification using distant supervision. Stanford University, Tech. repGoogle Scholar
  12. Gonçalves T, Quaresma P (2010) Polylingual text classification in the legal domain. Informatica e Diritto XIX(1–2), pp 203–216Google Scholar
  13. Husby SD, Barbosa D (2012) Topic classification of blog posts using distant supervision. In: Proceedings of the EACL Workshop on Semantic Analysis in Social Media. Avignon, FR, pp 28–36Google Scholar
  14. Imran M, Castillo C, Diaz F, Vieweg S (2014) Processing social media messages in mass emergency: a survey.
  15. Irani D, Webb S, Pu C, Li K (2010) Study of trend-stuffing on Twitter through text classification. In: Proceedings of th 7th Conference on Collaboration, Electronic Messaging, Anti-Abuse and Spam (CEAS 2010). Redmond, USGoogle Scholar
  16. Joachims T (2002) Learning to classify text using support vector machines: methods, theory and algorithms. Kluwer Academic Publishers, DordrechtCrossRefGoogle Scholar
  17. Kinsella S, Passant A, Breslin JG (2011) Topic classification in social media using metadata from hyperlinked objects. In: Proceedings of the 33rd European Conference on Information Retrieval (ECIR 2011). Dublin, IE, pp 201–206Google Scholar
  18. Kothari A, Magdy W, Darwish K, Mourad A, Taei A (2013) Detecting comments on news articles in microblogs. In: Proceedings of the 7th International Conference on Weblogs and Social Media (ICWSM 2013). Cambridge, USGoogle Scholar
  19. Lee K, Palsetia D, Narayanan R, Patwary MMA, Agrawal A, Choudhary A (2011) Twitter trending topic classification. In: Proceedings of the 6th Workshop on optimization-based techniques for emerging data mining problems (OEDM 2011). Vancouver, CA, pp 251–258Google Scholar
  20. Magdy W, Elsayed T (2014) Adaptive method for following dynamic topics on Twitter. In: Proceedings of the 8th International Conference on Weblogs and Social Media (ICWSM 2014). Ann Arbor, USGoogle Scholar
  21. Marchetti-Bowick M, Chambers N (2012) Learning for microblogs with distant supervision: Political forecasting with Twitter. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2012). Avignon, FR, pp 603–612Google Scholar
  22. McCallum AK, Nigam K (1998) A comparison of event models for naive Bayes text classification. In: Proceedings of the AAAI Workshop on Learning for Text Categorization. Madison, US, pp 41–48Google Scholar
  23. Mintz M, Bills S, Snow R, Jurafsky D (2009) Distant supervision for relation extraction without labeled data. In: Proceedings of the 47th Annual Meeting of the ACL and 4th International Joint Conference on Natural Language Processing (ACL/IJCNLP 2009). Singapore, SN, pp 1003–1011Google Scholar
  24. Pan SJ, Yang Q (2010) A survey on transfer learning. IEEE Trans Knowl Data Eng 22(10):1345–1359CrossRefGoogle Scholar
  25. Pan W, Zhong E, Yang Q (2012) Transfer learning for text mining. In: Aggarwal CC, Zhai C (eds) Mining text data. Springer, Heidelberg, DE, pp 223–258CrossRefGoogle Scholar
  26. Quercia D, Askham H, Crowcroft J (2012) TweetLDA: Supervised topic classification and link prediction in Twitter. In: Proceedings of the 4th ACM Conference on Web Science (WS 2012). Evanston, US, pp 247–250Google Scholar
  27. Raina R, Battle A, Lee H, Packer B, Ng AY (2007) Self-taught learning: transfer learning from unlabeled data. In: Proceedings of the 24th International Conference on Machine Learning (ICML 2007). Corvalis, US , pp 759–766Google Scholar
  28. Sammut C, Harries M (2011) Concept drift. In: Sammut C, Webb GI (eds) Encyclopedia of Machine Learning. Springer, Heidelberg, pp 202–205Google Scholar
  29. Sankaranarayanan J, Samet H, Teitler BE, Lieberman MD, Sperling J (2009) TwitterStand: news in tweets. In: Proceedings of the 17th ACM International Conference on Advances in Geographic Information Systems (GIS 2009). Seattle, US, pp 42–51Google Scholar
  30. Sriram B, Fuhry D, Demir E, Ferhatosmanoglu H, Demirbas M (2010) Short text classification in Twitter to improve information filtering. In: Proceedings of the 33rd ACM International Conference on Research and Development in Information Retrieval (SIGIR 2010). Geneva, CH, pp 841–842Google Scholar
  31. Yang Y, Liu X (1999) A re-examination of text categorization methods. In: Proceedings of the 22nd ACM International Conference on Research and Development in Information Retrieval (SIGIR 1999). Berkeley, US, pp 42–49Google Scholar
  32. Zubiaga A, Ji H (2013) Harnessing Web page directories for large-scale classification of tweets. In: Posters Proceedings of the 22nd International World Wide Web Conference (WWW 2013). Rio de Janeiro, BR, pp 225–226Google Scholar

Copyright information

© Springer-Verlag Wien 2015

Authors and Affiliations

  • Walid Magdy
    • 1
  • Hassan Sajjad
    • 1
  • Tarek El-Ganainy
    • 1
  • Fabrizio Sebastiani
    • 1
  1. 1.Qatar Computing Research Institute, Hamad Bin Khalifa UniversityDohaQatar

Personalised recommendations