Advertisement

Artificial Intelligence Review

, Volume 49, Issue 3, pp 301–338 | Cite as

Short-text feature construction and selection in social media data: a survey

  • Antonela Tommasel
  • Daniela Godoy
Article

Abstract

Social networking sites such as Facebook or Twitter attract millions of users, who everyday post an enormous amount of content in the form of tweets, comments and posts. Since social network texts are usually short, learning tasks have to deal with a very high dimensional and sparse feature space, in which most features have low frequencies. As a result, extracting useful knowledge from such noisy data is a challenging task, that converts large-scale short-text learning tasks in social environments into one of the most relevant problems in machine learning and data mining. Feature selection is one of the most known and commonly used techniques for reducing the impact of the high dimensional feature space in text learning. A wide variety of feature selection techniques can be found in the literature applied to traditional, long-texts and document collections. However, short-texts coming from the social Web pose new challenges to this well-studied problem as texts’ shortness offers a limited context to extract enough statistical evidence about words relations (e.g. correlation), and instances usually arrive in continuous streams (e.g. Twitter timeline), so that the number of features and instances is unknown, among other problems. This paper surveys feature selection techniques for dealing with short texts in both offline and online settings. Then, open issues and research opportunities for performing online feature selection over social media data are discussed.

Keywords

Feature selection Short-text Social media data Text learning 

References

  1. Airoldi EM, Blei DM, Fienberg SE, Xing EP (2008) Mixed membership stochastic blockmodels. J Mach Learn Res 9:1981–2014MATHGoogle Scholar
  2. Alelyani S, Liu H, Wang L (2011) The effect of the characteristics of the dataset on the selection stability. In: Proceedings of the 23rd IEEE international conference on tools with artificial intelligence (ICTAI), IEEE Computer Society, pp 970–977Google Scholar
  3. Alelyani S, Tang J, Liu H (2013) Feature selection for clustering: a review. In: Aggarwal CC, Reddy CK (eds) Data clustering: algorithms and applications - Chapman & Hall/CRC data mining and knowledge discovery series, Chapman and Hall/CRC, Boca Raton, pp 29–60Google Scholar
  4. Alexandrov M, Gelbukh A, Rosso P (2005) An approach to clustering abstracts. In: Montoyo A, Muñoz R, Métais E (eds) Natural language processing and information systems, vol 3513, Lecture notes in computer science, Springer, Berlin, pp 275–285Google Scholar
  5. Amir S, Almeida MB, Martins B, Ja Filgueiras, Silva MJ (2014) Tugas: exploiting unlabelled data for twitter sentiment analysis. In: Proceedings of the 8th international workshop on semantic evaluation (SemEval 2014). Association for computational linguistics and Dublin City University, Dublin, Ireland, pp 673–677Google Scholar
  6. Becker H, Naaman M, Gravano L (2011) Beyond trending topics: real-world event identification on twitter. In: Proceedings of the 5th international conference on weblogs and social media, The AAAI Press, SpainGoogle Scholar
  7. Chen M, Jin X, Shen D (2011) Short text classification improved by learning multi-granularity topics. In: Walsh T (ed) Proceedings of the 22th international joint conference on artificial intelligence, The AAAI Press, IJCAI’11, pp 1776–1781Google Scholar
  8. Dong L, Wei F, Tan C, Tang D, Zhou M, Xu K (2014) Adaptive recursive neural network for target-dependent twitter sentiment classification. In: Proceedings of the 52nd annual meeting of the association for computational linguistics. Association for computational linguistics, Baltimore, pp 49–54Google Scholar
  9. Fang Y, Zhang H, Ye Y, Li X (2014) Detecting hot topics from twitter: a multiview approach. J Inf Sci 40(5):578–593CrossRefGoogle Scholar
  10. Ferragina P, Scaiella U (2012) Fast and accurate annotation of short texts with wikipedia pages. IEEE Softw 29(1):70–75CrossRefGoogle Scholar
  11. Forman G (2004) A pitfall and solution in multi-class feature selection for text classification. In: Proceedings of the 21st international conference on machine learning, ACM, New York, NY, USA, ICML’04, p 38Google Scholar
  12. Forman G (2003) An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 3:1289–1305MATHGoogle Scholar
  13. Gabrilovich E, Markovitch S (2006) Overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with encyclopedic knowledge. In: Proceedings of the 21st national conference on artificial intelligence. MA, USA, Boston, pp 1301–1306Google Scholar
  14. Gu Q, Han J (2011) Towards feature selection in network. In: Proceedings of the 20th ACM international conference on information and knowledge management, ACM, New York, NY, USA, CIKM’11, pp 1175–1184Google Scholar
  15. Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182MATHGoogle Scholar
  16. Han Y, Yu L (2012) A variance reduction framework for stable feature selection. Stat Anal Data Min 5(5):428–445MathSciNetCrossRefGoogle Scholar
  17. Hoi SCH, Wang J, Zhao P, Jin R (2012) Online feature selection for mining big data. In: Proceedings of the 1st international workshop on big data, streams and heterogeneous source mining: algorithms, systems, programming models and applications, ACM, New York, NY, USA, BigMine’12, pp 93–100Google Scholar
  18. Hu X, Sun N, Zhang C, Chua TS (2009) Exploiting internal and external semantics for the clustering of short texts using world knowledge. In: Proceedings of the 18th ACM conference on Information and knowledge management, ACM, New York, NY, USA, CIKM’09, pp 919–928Google Scholar
  19. Jiang L, Yu M, Zhou M, Liu X, Zhao T (2011) Target-dependent twitter sentiment classification. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies—vol 1, Association for computational linguistics, Stroudsburg, PA, USA, HLT’11, pp 151–160Google Scholar
  20. Jin O, Liu NN, Zhao K, Yu Y, Yang Q (2011) Transferring topical knowledge from auxiliary long texts for short text clustering. In: Proceedings of the 20th ACM international conference on Information and knowledge management, ACM, New York, NY, USA, CIKM’11, pp 775–784Google Scholar
  21. John GH, Kohavi R, Pfleger K (1994) Irrelevant features and the subset selection problem. In: Proceedings of the 11th international conference of machine learning, Morgan Kaufmann, ICML’94, pp 121–129Google Scholar
  22. Li J, Hu X, Tang J, Liu H (2015) Unsupervised streaming feature selection in social media. In: Proceedings of the 24th ACM international on conference on information and knowledge management, ACM, New York, NY, USA, CIKM’15, pp 1041–1050Google Scholar
  23. Lin C, He Y (2009) Joint sentiment/topic model for sentiment analysis. In: Proceedings of the 18th ACM conference on information and knowledge management, ACM, New York, NY, USA, CIKM’09, pp 375–384Google Scholar
  24. Li C, Sun A, Datta A (2012) Twevent: Segment-based event detection from tweets. In: Proceedings of the 21st ACM international conference on information and knowledge management, ACM, New York, NY, USA, CIKM’12, pp 155–164Google Scholar
  25. Liu ZLZ, Yu WYW, Chen WCW, Wang SWS, Wu FWF (2010) Short text feature selection for micro-blog mining. In: Proceedings of the 2nd international conference on computational intelligence and software engineering, IEEE, CISE’10, pp 4–7Google Scholar
  26. Liu H, Yu L (2005) Toward integrating feature selection algorithms for classification and clustering. IEEE Trans Knowl Data Eng 17(4):491–502MathSciNetCrossRefGoogle Scholar
  27. Ma Z, Sun A, Cong G (2013) On predicting the popularity of newly emerging hashtags in twitter. J Am Soc Inf Sci Technol 64(7):1399–1410CrossRefGoogle Scholar
  28. Marsden PV, Friedkin NE (1993) Network studies of social influence. Sociol Methods Res 22(1):127–151CrossRefGoogle Scholar
  29. McPherson M, Smith-Lovin L, Cook JM (2001) Birds of a feather: homophily in social networks. Annu Rev Sociol 27(1):415–444CrossRefGoogle Scholar
  30. Medelyan O, Milne D, Legg C, Witten IH (2009) Mining meaning from wikipedia. Int J Hum Comput Stud 67(9):716–754CrossRefGoogle Scholar
  31. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Proceedings of the 27th annual conference on neural information processing systems. Lake Tahoe, Nevada, USA, pp 3111–3119Google Scholar
  32. Moradi P, Rostami M (2015) A graph theoretic approach for unsupervised feature selection. Eng Appl Artif Intell 44(C):33–45CrossRefGoogle Scholar
  33. Ozdikis O, Senkul P, Oguztuzun H (2012) Semantic expansion of tweet contents for enhanced event detection in twitter. In: Proceedings of the 2012 international conference on advances in social networks analysis and mining, IEEE Computer Society, Istanbul, Turkey, ASONAM’12, pp 20–24Google Scholar
  34. Peng Y, Xuefeng Z, Jianyong Z, Yumhong X (2009) Lazy learner text categorization algorithm based on embedded feature selection. J Syst Eng Electron 20(3):651–659Google Scholar
  35. Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing. Doha, Qatar, pp 1532–1543Google Scholar
  36. Perez-Tellez F, Pinto D, Cardiff J, Rosso P (2010) On the difficulty of clustering company tweets. In: Proceedings of the 2nd international workshop on search and mining user-generated contents, ACM, New York, NY, USA, SMUC’10, pp 95–102Google Scholar
  37. Perkins S, Lacker K, Theiler J (2003) Grafting: fast, incremental feature selection by gradient descent in function space. J Mach Learn Res 3:1333–1356MathSciNetMATHGoogle Scholar
  38. Perkins S, Theiler J (2003) Online feature selection using grafting. In: Fawcett T, Mishra N (eds) Proceedings of the 21st international conference on machine learning, AAAI Press, ICML’03, pp 592–599Google Scholar
  39. Rafeeque P, Sendhilkumar S (2011) A survey on short text analysis in web. In: Proceedings of the 3rd international conference on advanced computing, IEEE, Chennai, India, ICoAC’11, pp 365–371Google Scholar
  40. Rosa KD, Ellen J (2009) Text classification methodologies applied to micro-text in military chat. In: Proceedings of the 2009 international conference on machine learning and applications, IEEE Computer Society, Washington, DC, USA, ICMLA’09, pp 710–714Google Scholar
  41. Saeys Y, In Inza, Larrañaga P (2007) A review of feature selection techniques in bioinformatics. Bioinformatics 23(19):2507–2517CrossRefGoogle Scholar
  42. Saif H, Fernández M, He Y, Alani H (2014) On stopwords, filtering and data sparsity for sentiment analysis of twitter. In: Proceedings of the 9th international conference on language resources and evaluation, European Language Resources Association (ELRA), Reykjavik, Iceland, LREC’14, pp 810–817Google Scholar
  43. Saif H, He Y, Alani H (2012) Alleviating data sparsity for twitter sentiment analysis. In: Proceedings of the 2nd workshop on making sense of microposts: big things come in small packages at the 21st international conference on the World Wide Web, CEUR Workshop Proceedings, MSM’12, pp 2–9Google Scholar
  44. Schmid H (1994) Probabilistic part-of-speech tagging using decision trees. In: Proceedings of the 6th international conference on new methods in language processing, Manchester, UK, NeMLaP’94Google Scholar
  45. Sebastiani F (2002) Machine learning in automated text categorization. ACM Comput Surv 34(1):1–47CrossRefGoogle Scholar
  46. Severyn A, Moschitti A (2015) Twitter sentiment analysis with deep convolutional neural networks. In: Proceedings of the 38th International ACM SIGIR Conference on research and development in information retrieval, ACM, New York, NY, USA, SIGIR’15, pp 959–962. doi: 10.1145/2766462.2767830
  47. Strassen V (1969) Gaussian elimination is not optimal. Numer Math 13(4):354–356MathSciNetCrossRefMATHGoogle Scholar
  48. Tang J, Wang X, Gao H, Hu X, Liu H (2012) Enriching short text representation in microblog for clustering. J Front Comput Sci China 6(1):88–101MathSciNetMATHGoogle Scholar
  49. Tang J, Alelyani S, Liu H (2014c) Feature selection for classification: A review. In: Aggarwal CC, Reddy CK (eds) Data classification: algorithms and applications - Chapman & Hall/CRC data mining and knowledge discovery series, Chapman and Hall/CRC, Boca Raton, pp 37–64Google Scholar
  50. Tang J, Hu X, Gao H, Liu H (2013) Unsupervised feature selection for multi-view data in social media. In: Proceedings of the SIAM international conference on data mining, SIAM, SDM’13, pp 270–278Google Scholar
  51. Tang J, Liu H (2012) Feature selection with linked data in social media. In: Proceedings of the 12th SIAM International conference on data mining, SIAM / Omnipress, pp 118–128Google Scholar
  52. Tang J, Liu H (2014a) Feature selection for social media data. ACM Trans Knowl Discov Data 8(4):19:1–19:27Google Scholar
  53. Tang J, Liu H (2014b) An unsupervised feature selection framework for social media data. IEEE Trans Knowl Data Eng 26(12):2914–2927CrossRefGoogle Scholar
  54. Tang D, Wei F, Yang N, Zhou M, Liu T, Qin B (2014a) Learning sentiment-specific word embedding for twitter sentiment classification. In: Proceedings of the 52nd annual meeting of the association for computational linguistics, The Association for computer linguistics, Baltimore, MD, USA, pp 1555–1565Google Scholar
  55. Tang G, Xia Y, Wang W, Lau R, Zheng F (2014b) Clustering tweets using wikipedia concepts. In: Proceedings of the 9th international conference on language resources and evaluation, European Language Resources Association (ELRA), Reykjavik, Iceland, LREC’14Google Scholar
  56. Verma S, Vieweg S, Corvey W, Palen L, Martin JH, Palmer M, Schram A, Anderson KM (2011) Natural language processing to the rescue? extracting “situational awareness” tweets during mass emergency. In: Proceedings of the 5th International AAAI conference on web and social media, The AAAI Press, ICWSM’11Google Scholar
  57. Wang Bk, Huang YF, Yang Wx, Li X (2012) Short text classification based on strong feature thesaurus. J Zhejiang Univ Sci C 13(9):649–659CrossRefGoogle Scholar
  58. Wang J, Zhao P, Hoi S, Jin R (2014) Online feature selection and its applications. IEEE Trans Knowl Data Eng 26(3):698–710CrossRefGoogle Scholar
  59. Wang J, Zhao ZQ, Hu X, Cheung YM, Wang M, Wu X (2013) Online group feature selection. In: Proceedings of the 23rd international joint conference on artificial intelligence, AAAI Press, IJCAI’13, pp 1757–1763Google Scholar
  60. Wu Y, Hoi SCH, Mei T (2014) Massive-scale online feature selection for sparse ultra-high dimensional data. Computing Research Repository abs/1409.7794. https://arxiv.org/abs/1409.7794
  61. Wu X, Yu K, Wang H, Wei D (2010) Online streaming feature selection. In: Proceedings of the 27th international conference on machine learning (ICML-10), Omnipress, ICML’10, pp 1159–1166Google Scholar
  62. Yang Y, Pedersen JO (1997) A comparative study on feature selection in text categorization. In: Proceedings of the 14th international conference on machine learning, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, ICML’97, pp 412–420Google Scholar
  63. Yu L, Liu H (2004) Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res 5:1205–1224MathSciNetMATHGoogle Scholar
  64. Zhou J, Foster DP, Stine RA, Ungar LH (2006) Streamwise feature selection. J Mach Learn Res 7:1861–1885MathSciNetMATHGoogle Scholar
  65. Zubiaga A, Spina D, Martínez R, Fresno V (2015) Real-time classification of twitter trends. J Assoc Inf Sci Technol 66(3):462–473CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2016

Authors and Affiliations

  1. 1.ISISTAN, UNICEN-CONICETTandil, Buenos AiresArgentina

Personalised recommendations