A review of features for the discrimination of twitter users: application to the prediction of offline influence

Abstract

Many works related to Twitter aim at characterizing its users in some way: role on the service (spammers, bots, organizations, etc.), nature of the user (socio-professional category, age, etc.), topics of interest, and others. However, for a given user classification problem, it is very difficult to select a set of appropriate features, because the many features described in the literature are very heterogeneous, with name overlaps and collisions, and numerous very close variants. In this article, we review a wide range of such features. In order to present a clear state-of-the-art description, we unify their names, definitions and relationships, and we propose a new, neutral, typology. We then illustrate the interest of our review by applying a selection of these features to the offline influence detection problem. This task consists in identifying users who are influential in real life, based on their Twitter account and related data. We show that most features deemed efficient to predict online influence, such as the numbers of retweets and followers, are not relevant to this problem. However, we propose several content-based approaches to label Twitter users as influencers or not. We also rank them according to a predicted influence level. Our proposals are evaluated over the CLEF RepLab 2014 dataset, and outmatch state-of-the-art methods.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2

Notes

  1. 1.

    http://www.clef-initiative.eu/.

  2. 2.

    http://www.llorenteycuenca.com/.

  3. 3.

    http://nlp.uned.es/replab2014/.

  4. 4.

    https://github.com/CompNet/Influence.

  5. 5.

    http://dx.doi.org/10.6084/m9.figshare.1506785.

  6. 6.

    http://docs.oracle.com.

References

  1. Al Zamal, F, Liu W, Ruths D (2012) Homophily and latent attribute inference: inferring latent attributes of Twitter users from neighbors. In: ICWSM

  2. Aleahmad A, Karisani P, Rahgozar M, Oroumchian F (2014) University of Tehran at replab 2014. In: 4th international conference of the CLEF initiative

  3. Amigó E, Carrillo-de Albornoz J, Chugur I, Corujo A, Gonzalo J, Meij E, de Rijke M, Spina D (2014) Overview of replab 2014: author profiling and reputation dimensions for online reputation management. In: Information access evaluation. Multilinguality, multimodality, and interaction, pp. 307–322

  4. Anger I, Kittl C (2011) Measuring influence on Twitter. i-KNOW, pp. 1–4

  5. Armentano MG, Godoy DL, Amandi AA (2011) A topology-based approach for followees recommendation in Twitter. In: Workshop chairs, p. 22

  6. Bakshy E, Hofman JM, Mason WA, Watts DJ (2011) Everyone’s an influencer: quantifying influence on Twitter. In: WSDM, pp. 65–74

  7. Bavelas A (1950) Communication patterns in task-oriented groups. J Acoust Soc Am 22(6):725–730

    Article  Google Scholar 

  8. Benevenuto F, Magno F, Rodrigues T, Almeida V (2010) Detecting spammers on Twitter. In: CEAS

  9. Bonacich PF (1987) Power and centrality: a family of measures. Am J Soc 92:1170–1182

    Article  Google Scholar 

  10. Bond RM, Fariss CJ, Jones JJ, Kramer ADI, Marlow C, Settle JE, Fowler JH (2012) A 61-million-person experiment in social influence and political mobilization. Nature 489(7415):295–298

    Article  Google Scholar 

  11. Boyd D, Golder S, Lotan G (2010) Tweet, tweet, retweet: conversational aspects of retweeting on twitter. In: HICSS, pp. 1–10

  12. Buckley C, Voorhees EM (2000) Evaluating evaluation measure stability. In: ACM SIGIR, pp. 33–40

  13. Cha M, Haddadi H, Benevenuto F, Gummadi K (2010) Measuring user influence in Twitter: the million follower fallacy. In: ICWSM

  14. Cheng Z, Caverlee J, Lee K (2010) You are where you tweet: a content-based approach to geo-locating Twitter users. In: CIKM, pp. 759–768

  15. Chu Z, Gianvecchio S, Wang H, Jajodia S (2012) Detecting automation of Twitter accounts: are you a human, bot, or cyborg? IEEE Trans Dependable Secure Comput 9(6):811–824

    Article  Google Scholar 

  16. Conover MD, Goncalves B, Ratkiewicz J, Flammini A, Menczer F (2011) Predicting the political alignment of Twitter users. In: IEEE SocialCom, pp. 192–199

  17. Cossu JV, Dugué N, Labatut V (2015) Detecting real-world influence through Twitter. In: ENIC, pp. 83–90

  18. Cossu JV, Janod K, Ferreira E, Gaillard J, El-Bèze M (2014) Lia@replab 2014: 10 methods for 3 tasks. In: 4th international conference of the CLEF initiative

  19. Cossu JV, Janod K, Ferreira E, Gaillard J, El-Bèze M (2015) Nlp-based classifiers to generalize experts assessments in e-reputation. In: Experimental IR meets multilinguality, multimodality, and interaction

  20. da Fontoura Costa L, Rodrigues FA, Travieso G, Villas Boas PR (2007) Characterization of complex networks: a survey of measurements. Adv Phys 56(1):167–242

  21. Danisch M, Dugué N, Perez A (2014) On the importance of considering social capitalism when measuring influence on Twitter. In: Behavioral, economic, and socio-cultural computing (2014)

  22. de-Choudhury M, Diakopoulos N, Naaman M (2012) Unfolding the event landscape on Twitter: classification and exploration of user categories. In: ACM CSCW, pp. 241–244

  23. de Silva L, Riloff E (2014) User type classification of tweets with implications for event recognition. In: Joint workshop on social dynamics and personal attributes in social media, pp. 98–108

  24. Dugué N, Labatut V, Perez A (2014) Identifying the community roles of social capitalists in the twitter network. In: IEEE/ACM ASONAM, Beijing, pp. 371–374

  25. Dugué N, Perez A (2014) Social capitalists on Twitter: detection, evolution and behavioral analysis. Social Network Analysis and Mining, Springer, 4(1):1–15

  26. Dugué N, Perez A, Danisch M, Bridoux F, Daviau A, Kolubako T, Munier S, Durbano H (2015) A reliable and evolutive web application to detect social capitalists. In: IEEE/ACM ASONAM exhibits and demos

  27. Estrada E, Rodriguez-Velazquez JA (2005) Subgraph centrality in complex networks. Phys Rev E 71(5):056103

  28. Fornell C (1992) A national customer satisfaction barometer: the Swedish experience. J Mark. pp. 6–21

  29. Freeman LC, Roeder D, Mulholland RR (1979) Centrality in social networks: II. Experimental results. Soc Netw 2(2):119–141

    Article  Google Scholar 

  30. Garcia R, Amatriain X (2010) Weighted content based methods for recommending connections in online social networks. In: Workshop on recommender systems and the social web, Citeseer, pp. 68–71

  31. Gaussier E, Yvon F (2013) Opinion detection as a topic classification problem. In: Textual information access: statistical models, chap. 9, Wiley, New York, pp. 245–256

  32. Gayo-Avello D (2012) A balanced survey on election prediction using twitter data. Arxiv

  33. Ghosh S, Viswanath B, Kooti F, Sharma N, Korlam G, Benevenuto F, Ganguly N, Gummadi K (2012) Understanding and combating link farming in the Twitter social network. In: WWW, pp. 61–70

  34. Golder SA, Yardi S, Marwick A, Boyd D (2009) A structural approach to contact recommendations in online social networks. In: Workshop on search in social media, SSM

  35. Greenfield R (2014) The latest Twitter hack: talking to yourself. http://www.fastcompany.com/3029748/the-latest-twitter-hack-talking-to-yourself. Accessed 5 Feb 2014

  36. Guimerà R, Amaral LN (2005) Cartography of complex networks: modules and universal roles. J Stat Mech 02:02001

  37. Harary F (1969) Graph theory. Addison-Wesley, Boston

  38. Henseler J (2010) On the convergence of the partial least squares path modeling algorithm. Comput Stat 25(1):107–120

    MathSciNet  Article  MATH  Google Scholar 

  39. Huang W, Weber I, Vieweg S (2014) Inferring nationalities of Twitter users and studying inter-national linking. In: ACM Hypertext

  40. Java A, Song X, Finin T, Tseng B (2007) Why we twitter: understanding microblogging usage and communities. In: WebKDD/SNA-KDD, pp. 56–65

  41. Kim YM, Velcin J, Bonnevay S, Rizoiu MA (2015) Temporal multinomial mixture for instance-oriented evolutionary clustering. In: Advances in information retrieval

  42. Kred (2015) Kred story. http://www.kred.com. Accessed 12 Feb 2014

  43. Kywe SM, Lim EP, Zhu F (2012) A survey of recommender systems in twitter. In: Social informatics, pp. 420–433. Springer

  44. Laasby G (2014) Blocking fake Twitter followers and spam accounts just got easier. http://www.jsonline.com/blogs/news/280303802.html. Accessed Apr 2014

  45. Lancichinetti A, Kivelä M, Saramäki J, Fortunato S (2010) Characterizing the community structure of complex networks. PLoS One 5(8):e11976

  46. Landherr A, Friedl B, Heidemann J (2010) A critical review of centrality measures in social networks. Bus Inf Syst Eng 2(6):371–385

    Article  Google Scholar 

  47. Lee K, Caverlee J, Webb S (2010) Uncovering social spammers: social honeypots + machine learning. In: ACM SIGIR, pp. 435–442

  48. Lee K, Eoff BD, Caverlee J (2011) Seven months with the devils: a long-term study of content polluters on Twitter. In: ICWSM

  49. Lee K, Mahmud J, Chen J, Zhou M, Nichols J (2014) Who will retweet this? automatically identifying and engaging strangers on twitter to spread information. In: ACM IUI, pp. 247–256

  50. Lee K, Tamilarasan P, Caverlee J (2013) Crowdturfers, campaigns, and social media: tracking and revealing crowdsourced manipulation of social media. In: ICWSM

  51. Mahmud J, Nichols J, Drews C (2012) Where is this tweet from? inferring home locations of Twitter users. In: ICWSM

  52. Makazhanov A, Rafiei D (2013) Predicting political preference of Twitter users. In: IEEE/ACM ASONAM, pp. 298–305

  53. Mena Lomeña JJ, López Ostenero F (2014) Uned at clef replab 2014: author profiling. In: 4th international conference of the CLEF initiative

  54. Messias J, Schmidt L, Oliveira R, Benevenuto F (2013) You followed my bot! transforming robots into influential users in Twitter. First Monday 18(7)

  55. Naaman M, Boase J, Lai CH (2010) Is it really about me?: message content in social awareness streams. In: ACM CSCW, pp. 189–192

  56. Orman GK, Labatut V, Cherifi H (2012) Comparative evaluation of community detection algorithms: a topological approach. J Stat Mech 8:08001

  57. Pennacchiotti M, Popescu AM (2011) A machine learning approach to Twitter user classification. In: ICWSM, pp. 281–288

  58. Pramanik S, Danisch M, Wang Q, Mitra B (2015) An empirical approach towards an efficient “whom to mention?” Twitter app. Twitter for research, 1st international interdisciplinary conference

  59. Ramage D, Dumais S, Liebling D (2010) Characterizing microblogs with topic models. In: ICWSM, pp. 130–137

  60. Rangel F, Celli F, Rosso P, Potthast M, Stein B, Daelemans W (2013) Overview of the 3rd author profiling task at PAN 2015. In: Experimental IR meets multilinguality, multimodality, and interaction

  61. Rangel F, Rosso P, Chugur I, Potthast M, Trenkmann M, Stein B, Verhoeven B, Daelemans W (2014) Overview of the 2nd author profiling task at pan 2014. In: CLEF evaluation labs and workshop

  62. Rao A, Spasojevic N, Li Z, DSouza T (2015) Klout score: measuring influence across multiple social networks. Arvix

  63. Rao D, Yarowsky D, Shreevats A, Gupta M (2010) Classifying latent user attributes in Twitter. In: CIKM SMUC workshop, pp. 37–44

  64. Ramírez-de-la Rosa G, Villatoro-Tello E, Jiménez-Salazar H, Sánchez-Sánchez C (2014) Towards automatic detection of user influence in Twitter by means of stylistic and behavioral features. In: Human-inspired computing and its applications, Springer, pp. 245–256

  65. Rosvall M, Bergstrom CT (2008) Maps of random walks on complex networks reveal community structure. Proc Nat Acad Sci 105(4):1118

    Article  Google Scholar 

  66. Jones Sparck (1972) K.: a statistical interpretation of term specificity and its application in retrieval. J Doc 28(1):11–21

    Article  Google Scholar 

  67. Sriram B, Fuhry D, Demir E, Ferhatosmanoglu H, Demirbas M (2010) Short text classification in Twitter to improve information filtering. In: ACM SIGIR, pp. 841–842

  68. Suh B, Hong L, Pirolli P, Chi EH (2010) Want to be retweeted? large scale analytics on factors impacting retweet in Twitter network. In: Social computing, pp. 177–184

  69. Tenenhaus M, Amato S, Esposito Vinzi V (2004) A global goodness-of-fit index for PLS structural equation modelling. In: XLII SIS scientific meeting, vol 1, pp. 739–742

  70. Tommasel A, Godoy D (2015) A novel metric for assessing user influence based on user behaviour. In: Soc Inf, pp. 15–21

  71. Torres-Moreno JM (2012) Artex is another text summarizer. arXiv preprint. arXiv:1210.3312

  72. Uddin MM, Imran M, Sajjad H (2014) Understanding types of users on Twitter. arXiv cs.SI, 1406.1335

  73. Vilares D, Hermo M, Alonso MA, Gómez-Rodrıguez C, Vilares J (2014) Lys at clef replab 2014: creating the state of the art in author influence ranking and reputation classification on Twitter. In: 4th international conference of the CLEF initiative, pp. 1468–1478

  74. Villatoro-Tello E, Ramirez-de-la Rosa G, Sanchez-Sanchez C, Jiménez-Salazar H, Luna-Ramirez WA, Rodriguez-Lucatero C (2014) Uamclyr at replab 2014: author profiling task. In: 4th international conference of the CLEF initiative

  75. Wang AH (2010) Don’t follow me: spam detection in Twitter. In: International conference on security and cryptography, pp. 1–10

  76. Watts DJ, Strogatz SH (1998) Collective dynamics of ’small-world’ networks. Nature 393(6684):440–442

    Article  Google Scholar 

  77. Weng J, Lim EP, Jiang J, He Q (2010) TwitterRank: finding topic-sensitive influential twitterers. In: WSDM, pp. 261–270

  78. Weren ERD, Kauer AU, Mizusaki L, Moreira VP, de Oliveira JPM, Wives LK (2014) Examining multiple features for author profiling. J Inf Data Manag 5(3):266

    Google Scholar 

  79. Wold H (1982) Soft modeling: the basic design and some extensions. In: Systems under indirect observations: causality, structure, prediction, pp. 36–37

Download references

Acknowledgments

This work is a revised and extended version of the article Detecting Real-World Influence Through Twitter, presented at the 2nd European Network Intelligence Conference (ENIC 2015) by the same authors (Cossu et al. 2015). It was partly funded by the French National Research Agency (ANR), through the project ImagiWeb ANR-2012-CORD-002-01.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Jean-Valère Cossu.

Centrality measures

Centrality measures

In their description, we note \(G=(V,E)\) the considered cooccurrence graph, where V and E are its sets of nodes and links, respectively.

The degree measure d(u) is quite straightforward: it is the number of links attached to a node u. So in our case, it can be interpreted as the number of words co-occurring with the word of interest. More formally, we note \(N(u)=\{v\in V:\{u,v\}\in E \}\) the neighborhood of node u, i.e., the set of nodes connected to u in G. The degree \(d(u)=|N(u)|\) of a node u is the cardinality of its neighborhood, i.e., its number of neighbors.

The Betweenness centrality \(C_b(u)\) measures how much a node u lies on the shortest paths connecting other nodes. It is a measure of accessibility Freeman et al. (1979):

$$\begin{aligned} C_b(u) = \sum _{v < w}\frac{\sigma _{vw}(u)}{\sigma _{vw}}, \end{aligned}$$
(8)

where \(\sigma _{vw}\) is the total number of shortest paths from node v to node w, and \(\sigma _{vw}(u)\) is the number of shortest paths from v to w running through node u.

The closeness centrality \(C_c(u)\) quantifies how near a node u is to the rest of the network Bavelas (1950):

$$\begin{aligned} C_c(u) = \frac{1}{\sum _{v \in V} dist(u,v)}, \end{aligned}$$
(9)

where dist(uv) is the geodesic distance between nodes u and v, i.e., the length of the shortest path between these nodes.

The Eigenvector centrality \(C_e(u)\) measures the influence of a node u in the network based on the spectrum of its adjacency matrix. The Eigenvector centrality of each node is proportional to the sum of the centrality of its neighbors Bonacich (1987):

$$\begin{aligned} C_e(u) = \frac{1}{\lambda }\sum _{v \in N(u)}C_e(v) \end{aligned}$$
(10)

Here, \(\lambda\) is the largest Eigenvalue of the graph adjacency matrix.

The subgraph centrality \(C_s(u)\) is based on the number of closed walks containing a node u (Estrada and Rodriguez-Velazquez (2005). Closed walks are used here as proxies to represent subgraphs (both cyclic and acyclic) of a certain size. When computing the centrality, each walk is given a weight which gets exponentially smaller as a function of its length.

$$\begin{aligned} C_s(u) = \sum _{\ell =0}^{\infty }\frac{\left( A^\ell \right) _{uu}}{\ell !}, \end{aligned}$$
(11)

where A is the adjacency matrix of G, and therefore \(\left( A^\ell \right) _{uu}\) corresponds to the number of closed walks containing u.

The Eccentricity E(u) of a node u is its furthest (geodesic) distance to any other node in the network Harary (1969):

$$\begin{aligned} E(u) = \max _{v \in V}(dist(u,v)) \end{aligned}$$
(12)

The Local Transitivity T(u) of a node u is obtained by dividing the number of links existing among its neighbors, by the maximal number of links that could exist if all of them were connected (Watts and Strogatz (1998):

$$\begin{aligned} T(u) = \dfrac{|\{\{v,w\}\in E: v \in N(u) \wedge w \in N(u)\}|}{d(u)(d(u)-1)/2}, \end{aligned}$$
(13)

where the denominator corresponds to the binomial coefficient \(\left( {\begin{array}{c}d(u)\\ 2\end{array}}\right)\). This measure ranges from 0 (no connected neighbors) to 1 (all neighbors are connected).

The Embeddedness e(u) represents the proportion of neighbors of a node u belonging to its own community Lancichinetti et al. (2010). The community structure of a network corresponds to a partition of its node set, defined in such a way that a maximum of links are located inside the parts while a minimum of them lie between the parts. We note c(u) the community of node u, i.e., the parts that contains u. Based on this, we can define the internal neighborhood of a node u as the subset of its neighborhood located in its own community: \(N^{int}(u)=N(u) \cap c(u)\). Then, the internal degree \(d^{int}(u)=|N^{int}(u)|\) is defined as the cardinality of the internal neighborhood, i.e., the number of neighbors the node u has in its own community. Finally, the embeddedness is the following ratio:

$$\begin{aligned} e(v) = \frac{ d_{int}(v)}{d(v)} \end{aligned}$$
(14)

It ranges from 0 (no neighbors in the node community) to 1 (all neighbors in the node community).

The two last measures were proposed by Guimerà & Amaral Guimerà and Amaral (2005) to characterize the community role of nodes. For a node u, the Within Module Degree z(u) is defined as the z-score of the internal degree, processed relatively to its community c(u):

$$\begin{aligned} z(u) = \frac{ d_{int}(u)-\mu (d_{int},c(u))}{\sigma (d_{int},c(u))}, \end{aligned}$$
(15)

where \(\mu\) and \(\sigma\) denote the mean and standard deviation of \(d_{int}\) over all nodes belonging to the community of u, respectively. This measure expresses how much a node is connected to other nodes in its community, relatively to this community. By comparison, the embeddedness is not normalized in function of the community, but of the node degree.

The participation coefficient is based on the notion of community degree, which is a generalization of the internal degree: \(d_{i}(u)=|N(u) \cap C_{i}|\). This degree \(d_{c}\) corresponds to the number of links a node u has with nodes belonging to community number i. The participation coefficient is defined as:

$$\begin{aligned} P(u) = 1-\sum _{1 \le i \le k} \left( \frac{d_i(u)}{d(u)}\right) ^{2}, \end{aligned}$$
(16)

where k is the number of communities, i.e., the number of parts in the partition. P characterizes the distribution of the neighbors of a node over the community structure. More precisely, it measures the heterogeneity of this distribution: it gets close to 1 if all the neighbors are uniformly distributed among all the communities, and to 0 if they are all gathered in the same community.

Both community role measures are defined independently from the method used for community detection (provided it identifies mutually exclusive communities). In this work, we applied the InfoMap method (Rosvall and Bergstrom (2008), which was deemed very efficient in previous studies (Orman et al. (2012).

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Cossu, JV., Labatut, V. & Dugué, N. A review of features for the discrimination of twitter users: application to the prediction of offline influence. Soc. Netw. Anal. Min. 6, 25 (2016). https://doi.org/10.1007/s13278-016-0329-x

Download citation

Keywords

  • Twitter
  • Influence
  • Natural language processing
  • Social network analysis