Abstract
Many works related to Twitter aim at characterizing its users in some way: role on the service (spammers, bots, organizations, etc.), nature of the user (socio-professional category, age, etc.), topics of interest, and others. However, for a given user classification problem, it is very difficult to select a set of appropriate features, because the many features described in the literature are very heterogeneous, with name overlaps and collisions, and numerous very close variants. In this article, we review a wide range of such features. In order to present a clear state-of-the-art description, we unify their names, definitions and relationships, and we propose a new, neutral, typology. We then illustrate the interest of our review by applying a selection of these features to the offline influence detection problem. This task consists in identifying users who are influential in real life, based on their Twitter account and related data. We show that most features deemed efficient to predict online influence, such as the numbers of retweets and followers, are not relevant to this problem. However, we propose several content-based approaches to label Twitter users as influencers or not. We also rank them according to a predicted influence level. Our proposals are evaluated over the CLEF RepLab 2014 dataset, and outmatch state-of-the-art methods.
This is a preview of subscription content, access via your institution.


References
Al Zamal, F, Liu W, Ruths D (2012) Homophily and latent attribute inference: inferring latent attributes of Twitter users from neighbors. In: ICWSM
Aleahmad A, Karisani P, Rahgozar M, Oroumchian F (2014) University of Tehran at replab 2014. In: 4th international conference of the CLEF initiative
Amigó E, Carrillo-de Albornoz J, Chugur I, Corujo A, Gonzalo J, Meij E, de Rijke M, Spina D (2014) Overview of replab 2014: author profiling and reputation dimensions for online reputation management. In: Information access evaluation. Multilinguality, multimodality, and interaction, pp. 307–322
Anger I, Kittl C (2011) Measuring influence on Twitter. i-KNOW, pp. 1–4
Armentano MG, Godoy DL, Amandi AA (2011) A topology-based approach for followees recommendation in Twitter. In: Workshop chairs, p. 22
Bakshy E, Hofman JM, Mason WA, Watts DJ (2011) Everyone’s an influencer: quantifying influence on Twitter. In: WSDM, pp. 65–74
Bavelas A (1950) Communication patterns in task-oriented groups. J Acoust Soc Am 22(6):725–730
Benevenuto F, Magno F, Rodrigues T, Almeida V (2010) Detecting spammers on Twitter. In: CEAS
Bonacich PF (1987) Power and centrality: a family of measures. Am J Soc 92:1170–1182
Bond RM, Fariss CJ, Jones JJ, Kramer ADI, Marlow C, Settle JE, Fowler JH (2012) A 61-million-person experiment in social influence and political mobilization. Nature 489(7415):295–298
Boyd D, Golder S, Lotan G (2010) Tweet, tweet, retweet: conversational aspects of retweeting on twitter. In: HICSS, pp. 1–10
Buckley C, Voorhees EM (2000) Evaluating evaluation measure stability. In: ACM SIGIR, pp. 33–40
Cha M, Haddadi H, Benevenuto F, Gummadi K (2010) Measuring user influence in Twitter: the million follower fallacy. In: ICWSM
Cheng Z, Caverlee J, Lee K (2010) You are where you tweet: a content-based approach to geo-locating Twitter users. In: CIKM, pp. 759–768
Chu Z, Gianvecchio S, Wang H, Jajodia S (2012) Detecting automation of Twitter accounts: are you a human, bot, or cyborg? IEEE Trans Dependable Secure Comput 9(6):811–824
Conover MD, Goncalves B, Ratkiewicz J, Flammini A, Menczer F (2011) Predicting the political alignment of Twitter users. In: IEEE SocialCom, pp. 192–199
Cossu JV, Dugué N, Labatut V (2015) Detecting real-world influence through Twitter. In: ENIC, pp. 83–90
Cossu JV, Janod K, Ferreira E, Gaillard J, El-Bèze M (2014) Lia@replab 2014: 10 methods for 3 tasks. In: 4th international conference of the CLEF initiative
Cossu JV, Janod K, Ferreira E, Gaillard J, El-Bèze M (2015) Nlp-based classifiers to generalize experts assessments in e-reputation. In: Experimental IR meets multilinguality, multimodality, and interaction
da Fontoura Costa L, Rodrigues FA, Travieso G, Villas Boas PR (2007) Characterization of complex networks: a survey of measurements. Adv Phys 56(1):167–242
Danisch M, Dugué N, Perez A (2014) On the importance of considering social capitalism when measuring influence on Twitter. In: Behavioral, economic, and socio-cultural computing (2014)
de-Choudhury M, Diakopoulos N, Naaman M (2012) Unfolding the event landscape on Twitter: classification and exploration of user categories. In: ACM CSCW, pp. 241–244
de Silva L, Riloff E (2014) User type classification of tweets with implications for event recognition. In: Joint workshop on social dynamics and personal attributes in social media, pp. 98–108
Dugué N, Labatut V, Perez A (2014) Identifying the community roles of social capitalists in the twitter network. In: IEEE/ACM ASONAM, Beijing, pp. 371–374
Dugué N, Perez A (2014) Social capitalists on Twitter: detection, evolution and behavioral analysis. Social Network Analysis and Mining, Springer, 4(1):1–15
Dugué N, Perez A, Danisch M, Bridoux F, Daviau A, Kolubako T, Munier S, Durbano H (2015) A reliable and evolutive web application to detect social capitalists. In: IEEE/ACM ASONAM exhibits and demos
Estrada E, Rodriguez-Velazquez JA (2005) Subgraph centrality in complex networks. Phys Rev E 71(5):056103
Fornell C (1992) A national customer satisfaction barometer: the Swedish experience. J Mark. pp. 6–21
Freeman LC, Roeder D, Mulholland RR (1979) Centrality in social networks: II. Experimental results. Soc Netw 2(2):119–141
Garcia R, Amatriain X (2010) Weighted content based methods for recommending connections in online social networks. In: Workshop on recommender systems and the social web, Citeseer, pp. 68–71
Gaussier E, Yvon F (2013) Opinion detection as a topic classification problem. In: Textual information access: statistical models, chap. 9, Wiley, New York, pp. 245–256
Gayo-Avello D (2012) A balanced survey on election prediction using twitter data. Arxiv
Ghosh S, Viswanath B, Kooti F, Sharma N, Korlam G, Benevenuto F, Ganguly N, Gummadi K (2012) Understanding and combating link farming in the Twitter social network. In: WWW, pp. 61–70
Golder SA, Yardi S, Marwick A, Boyd D (2009) A structural approach to contact recommendations in online social networks. In: Workshop on search in social media, SSM
Greenfield R (2014) The latest Twitter hack: talking to yourself. http://www.fastcompany.com/3029748/the-latest-twitter-hack-talking-to-yourself. Accessed 5 Feb 2014
Guimerà R, Amaral LN (2005) Cartography of complex networks: modules and universal roles. J Stat Mech 02:02001
Harary F (1969) Graph theory. Addison-Wesley, Boston
Henseler J (2010) On the convergence of the partial least squares path modeling algorithm. Comput Stat 25(1):107–120
Huang W, Weber I, Vieweg S (2014) Inferring nationalities of Twitter users and studying inter-national linking. In: ACM Hypertext
Java A, Song X, Finin T, Tseng B (2007) Why we twitter: understanding microblogging usage and communities. In: WebKDD/SNA-KDD, pp. 56–65
Kim YM, Velcin J, Bonnevay S, Rizoiu MA (2015) Temporal multinomial mixture for instance-oriented evolutionary clustering. In: Advances in information retrieval
Kred (2015) Kred story. http://www.kred.com. Accessed 12 Feb 2014
Kywe SM, Lim EP, Zhu F (2012) A survey of recommender systems in twitter. In: Social informatics, pp. 420–433. Springer
Laasby G (2014) Blocking fake Twitter followers and spam accounts just got easier. http://www.jsonline.com/blogs/news/280303802.html. Accessed Apr 2014
Lancichinetti A, Kivelä M, Saramäki J, Fortunato S (2010) Characterizing the community structure of complex networks. PLoS One 5(8):e11976
Landherr A, Friedl B, Heidemann J (2010) A critical review of centrality measures in social networks. Bus Inf Syst Eng 2(6):371–385
Lee K, Caverlee J, Webb S (2010) Uncovering social spammers: social honeypots + machine learning. In: ACM SIGIR, pp. 435–442
Lee K, Eoff BD, Caverlee J (2011) Seven months with the devils: a long-term study of content polluters on Twitter. In: ICWSM
Lee K, Mahmud J, Chen J, Zhou M, Nichols J (2014) Who will retweet this? automatically identifying and engaging strangers on twitter to spread information. In: ACM IUI, pp. 247–256
Lee K, Tamilarasan P, Caverlee J (2013) Crowdturfers, campaigns, and social media: tracking and revealing crowdsourced manipulation of social media. In: ICWSM
Mahmud J, Nichols J, Drews C (2012) Where is this tweet from? inferring home locations of Twitter users. In: ICWSM
Makazhanov A, Rafiei D (2013) Predicting political preference of Twitter users. In: IEEE/ACM ASONAM, pp. 298–305
Mena Lomeña JJ, López Ostenero F (2014) Uned at clef replab 2014: author profiling. In: 4th international conference of the CLEF initiative
Messias J, Schmidt L, Oliveira R, Benevenuto F (2013) You followed my bot! transforming robots into influential users in Twitter. First Monday 18(7)
Naaman M, Boase J, Lai CH (2010) Is it really about me?: message content in social awareness streams. In: ACM CSCW, pp. 189–192
Orman GK, Labatut V, Cherifi H (2012) Comparative evaluation of community detection algorithms: a topological approach. J Stat Mech 8:08001
Pennacchiotti M, Popescu AM (2011) A machine learning approach to Twitter user classification. In: ICWSM, pp. 281–288
Pramanik S, Danisch M, Wang Q, Mitra B (2015) An empirical approach towards an efficient “whom to mention?” Twitter app. Twitter for research, 1st international interdisciplinary conference
Ramage D, Dumais S, Liebling D (2010) Characterizing microblogs with topic models. In: ICWSM, pp. 130–137
Rangel F, Celli F, Rosso P, Potthast M, Stein B, Daelemans W (2013) Overview of the 3rd author profiling task at PAN 2015. In: Experimental IR meets multilinguality, multimodality, and interaction
Rangel F, Rosso P, Chugur I, Potthast M, Trenkmann M, Stein B, Verhoeven B, Daelemans W (2014) Overview of the 2nd author profiling task at pan 2014. In: CLEF evaluation labs and workshop
Rao A, Spasojevic N, Li Z, DSouza T (2015) Klout score: measuring influence across multiple social networks. Arvix
Rao D, Yarowsky D, Shreevats A, Gupta M (2010) Classifying latent user attributes in Twitter. In: CIKM SMUC workshop, pp. 37–44
Ramírez-de-la Rosa G, Villatoro-Tello E, Jiménez-Salazar H, Sánchez-Sánchez C (2014) Towards automatic detection of user influence in Twitter by means of stylistic and behavioral features. In: Human-inspired computing and its applications, Springer, pp. 245–256
Rosvall M, Bergstrom CT (2008) Maps of random walks on complex networks reveal community structure. Proc Nat Acad Sci 105(4):1118
Jones Sparck (1972) K.: a statistical interpretation of term specificity and its application in retrieval. J Doc 28(1):11–21
Sriram B, Fuhry D, Demir E, Ferhatosmanoglu H, Demirbas M (2010) Short text classification in Twitter to improve information filtering. In: ACM SIGIR, pp. 841–842
Suh B, Hong L, Pirolli P, Chi EH (2010) Want to be retweeted? large scale analytics on factors impacting retweet in Twitter network. In: Social computing, pp. 177–184
Tenenhaus M, Amato S, Esposito Vinzi V (2004) A global goodness-of-fit index for PLS structural equation modelling. In: XLII SIS scientific meeting, vol 1, pp. 739–742
Tommasel A, Godoy D (2015) A novel metric for assessing user influence based on user behaviour. In: Soc Inf, pp. 15–21
Torres-Moreno JM (2012) Artex is another text summarizer. arXiv preprint. arXiv:1210.3312
Uddin MM, Imran M, Sajjad H (2014) Understanding types of users on Twitter. arXiv cs.SI, 1406.1335
Vilares D, Hermo M, Alonso MA, Gómez-Rodrıguez C, Vilares J (2014) Lys at clef replab 2014: creating the state of the art in author influence ranking and reputation classification on Twitter. In: 4th international conference of the CLEF initiative, pp. 1468–1478
Villatoro-Tello E, Ramirez-de-la Rosa G, Sanchez-Sanchez C, Jiménez-Salazar H, Luna-Ramirez WA, Rodriguez-Lucatero C (2014) Uamclyr at replab 2014: author profiling task. In: 4th international conference of the CLEF initiative
Wang AH (2010) Don’t follow me: spam detection in Twitter. In: International conference on security and cryptography, pp. 1–10
Watts DJ, Strogatz SH (1998) Collective dynamics of ’small-world’ networks. Nature 393(6684):440–442
Weng J, Lim EP, Jiang J, He Q (2010) TwitterRank: finding topic-sensitive influential twitterers. In: WSDM, pp. 261–270
Weren ERD, Kauer AU, Mizusaki L, Moreira VP, de Oliveira JPM, Wives LK (2014) Examining multiple features for author profiling. J Inf Data Manag 5(3):266
Wold H (1982) Soft modeling: the basic design and some extensions. In: Systems under indirect observations: causality, structure, prediction, pp. 36–37
Acknowledgments
This work is a revised and extended version of the article Detecting Real-World Influence Through Twitter, presented at the 2nd European Network Intelligence Conference (ENIC 2015) by the same authors (Cossu et al. 2015). It was partly funded by the French National Research Agency (ANR), through the project ImagiWeb ANR-2012-CORD-002-01.
Author information
Authors and Affiliations
Corresponding author
Centrality measures
Centrality measures
In their description, we note \(G=(V,E)\) the considered cooccurrence graph, where V and E are its sets of nodes and links, respectively.
The degree measure d(u) is quite straightforward: it is the number of links attached to a node u. So in our case, it can be interpreted as the number of words co-occurring with the word of interest. More formally, we note \(N(u)=\{v\in V:\{u,v\}\in E \}\) the neighborhood of node u, i.e., the set of nodes connected to u in G. The degree \(d(u)=|N(u)|\) of a node u is the cardinality of its neighborhood, i.e., its number of neighbors.
The Betweenness centrality \(C_b(u)\) measures how much a node u lies on the shortest paths connecting other nodes. It is a measure of accessibility Freeman et al. (1979):
where \(\sigma _{vw}\) is the total number of shortest paths from node v to node w, and \(\sigma _{vw}(u)\) is the number of shortest paths from v to w running through node u.
The closeness centrality \(C_c(u)\) quantifies how near a node u is to the rest of the network Bavelas (1950):
where dist(u, v) is the geodesic distance between nodes u and v, i.e., the length of the shortest path between these nodes.
The Eigenvector centrality \(C_e(u)\) measures the influence of a node u in the network based on the spectrum of its adjacency matrix. The Eigenvector centrality of each node is proportional to the sum of the centrality of its neighbors Bonacich (1987):
Here, \(\lambda\) is the largest Eigenvalue of the graph adjacency matrix.
The subgraph centrality \(C_s(u)\) is based on the number of closed walks containing a node u (Estrada and Rodriguez-Velazquez (2005). Closed walks are used here as proxies to represent subgraphs (both cyclic and acyclic) of a certain size. When computing the centrality, each walk is given a weight which gets exponentially smaller as a function of its length.
where A is the adjacency matrix of G, and therefore \(\left( A^\ell \right) _{uu}\) corresponds to the number of closed walks containing u.
The Eccentricity E(u) of a node u is its furthest (geodesic) distance to any other node in the network Harary (1969):
The Local Transitivity T(u) of a node u is obtained by dividing the number of links existing among its neighbors, by the maximal number of links that could exist if all of them were connected (Watts and Strogatz (1998):
where the denominator corresponds to the binomial coefficient \(\left( {\begin{array}{c}d(u)\\ 2\end{array}}\right)\). This measure ranges from 0 (no connected neighbors) to 1 (all neighbors are connected).
The Embeddedness e(u) represents the proportion of neighbors of a node u belonging to its own community Lancichinetti et al. (2010). The community structure of a network corresponds to a partition of its node set, defined in such a way that a maximum of links are located inside the parts while a minimum of them lie between the parts. We note c(u) the community of node u, i.e., the parts that contains u. Based on this, we can define the internal neighborhood of a node u as the subset of its neighborhood located in its own community: \(N^{int}(u)=N(u) \cap c(u)\). Then, the internal degree \(d^{int}(u)=|N^{int}(u)|\) is defined as the cardinality of the internal neighborhood, i.e., the number of neighbors the node u has in its own community. Finally, the embeddedness is the following ratio:
It ranges from 0 (no neighbors in the node community) to 1 (all neighbors in the node community).
The two last measures were proposed by Guimerà & Amaral Guimerà and Amaral (2005) to characterize the community role of nodes. For a node u, the Within Module Degree z(u) is defined as the z-score of the internal degree, processed relatively to its community c(u):
where \(\mu\) and \(\sigma\) denote the mean and standard deviation of \(d_{int}\) over all nodes belonging to the community of u, respectively. This measure expresses how much a node is connected to other nodes in its community, relatively to this community. By comparison, the embeddedness is not normalized in function of the community, but of the node degree.
The participation coefficient is based on the notion of community degree, which is a generalization of the internal degree: \(d_{i}(u)=|N(u) \cap C_{i}|\). This degree \(d_{c}\) corresponds to the number of links a node u has with nodes belonging to community number i. The participation coefficient is defined as:
where k is the number of communities, i.e., the number of parts in the partition. P characterizes the distribution of the neighbors of a node over the community structure. More precisely, it measures the heterogeneity of this distribution: it gets close to 1 if all the neighbors are uniformly distributed among all the communities, and to 0 if they are all gathered in the same community.
Both community role measures are defined independently from the method used for community detection (provided it identifies mutually exclusive communities). In this work, we applied the InfoMap method (Rosvall and Bergstrom (2008), which was deemed very efficient in previous studies (Orman et al. (2012).
Rights and permissions
About this article
Cite this article
Cossu, JV., Labatut, V. & Dugué, N. A review of features for the discrimination of twitter users: application to the prediction of offline influence. Soc. Netw. Anal. Min. 6, 25 (2016). https://doi.org/10.1007/s13278-016-0329-x
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13278-016-0329-x
Keywords
- Influence
- Natural language processing
- Social network analysis