## Abstract

Many works related to Twitter aim at characterizing its users in some way: role on the service (spammers, bots, organizations, etc.), nature of the user (socio-professional category, age, etc.), topics of interest, and others. However, for a given user classification problem, it is very difficult to select a set of appropriate features, because the many features described in the literature are very heterogeneous, with name overlaps and collisions, and numerous very close variants. In this article, we review a wide range of such features. In order to present a clear state-of-the-art description, we unify their names, definitions and relationships, and we propose a new, neutral, typology. We then illustrate the interest of our review by applying a selection of these features to the offline influence detection problem. This task consists in identifying users who are influential in real life, based on their Twitter account and related data. We show that most features deemed efficient to predict online influence, such as the numbers of retweets and followers, are not relevant to this problem. However, we propose several content-based approaches to label Twitter users as influencers or not. We also rank them according to a predicted influence level. Our proposals are evaluated over the CLEF RepLab 2014 dataset, and outmatch state-of-the-art methods.

This is a preview of subscription content, access via your institution.

## References

Al Zamal, F, Liu W, Ruths D (2012) Homophily and latent attribute inference: inferring latent attributes of Twitter users from neighbors. In: ICWSM

Aleahmad A, Karisani P, Rahgozar M, Oroumchian F (2014) University of Tehran at replab 2014. In: 4th international conference of the CLEF initiative

Amigó E, Carrillo-de Albornoz J, Chugur I, Corujo A, Gonzalo J, Meij E, de Rijke M, Spina D (2014) Overview of replab 2014: author profiling and reputation dimensions for online reputation management. In: Information access evaluation. Multilinguality, multimodality, and interaction, pp. 307–322

Anger I, Kittl C (2011) Measuring influence on Twitter. i-KNOW, pp. 1–4

Armentano MG, Godoy DL, Amandi AA (2011) A topology-based approach for followees recommendation in Twitter. In: Workshop chairs, p. 22

Bakshy E, Hofman JM, Mason WA, Watts DJ (2011) Everyone’s an influencer: quantifying influence on Twitter. In: WSDM, pp. 65–74

Bavelas A (1950) Communication patterns in task-oriented groups. J Acoust Soc Am 22(6):725–730

Benevenuto F, Magno F, Rodrigues T, Almeida V (2010) Detecting spammers on Twitter. In: CEAS

Bonacich PF (1987) Power and centrality: a family of measures. Am J Soc 92:1170–1182

Bond RM, Fariss CJ, Jones JJ, Kramer ADI, Marlow C, Settle JE, Fowler JH (2012) A 61-million-person experiment in social influence and political mobilization. Nature 489(7415):295–298

Boyd D, Golder S, Lotan G (2010) Tweet, tweet, retweet: conversational aspects of retweeting on twitter. In: HICSS, pp. 1–10

Buckley C, Voorhees EM (2000) Evaluating evaluation measure stability. In: ACM SIGIR, pp. 33–40

Cha M, Haddadi H, Benevenuto F, Gummadi K (2010) Measuring user influence in Twitter: the million follower fallacy. In: ICWSM

Cheng Z, Caverlee J, Lee K (2010) You are where you tweet: a content-based approach to geo-locating Twitter users. In: CIKM, pp. 759–768

Chu Z, Gianvecchio S, Wang H, Jajodia S (2012) Detecting automation of Twitter accounts: are you a human, bot, or cyborg? IEEE Trans Dependable Secure Comput 9(6):811–824

Conover MD, Goncalves B, Ratkiewicz J, Flammini A, Menczer F (2011) Predicting the political alignment of Twitter users. In: IEEE SocialCom, pp. 192–199

Cossu JV, Dugué N, Labatut V (2015) Detecting real-world influence through Twitter. In: ENIC, pp. 83–90

Cossu JV, Janod K, Ferreira E, Gaillard J, El-Bèze M (2014) Lia@replab 2014: 10 methods for 3 tasks. In: 4th international conference of the CLEF initiative

Cossu JV, Janod K, Ferreira E, Gaillard J, El-Bèze M (2015) Nlp-based classifiers to generalize experts assessments in e-reputation. In: Experimental IR meets multilinguality, multimodality, and interaction

da Fontoura Costa L, Rodrigues FA, Travieso G, Villas Boas PR (2007) Characterization of complex networks: a survey of measurements. Adv Phys 56(1):167–242

Danisch M, Dugué N, Perez A (2014) On the importance of considering social capitalism when measuring influence on Twitter. In: Behavioral, economic, and socio-cultural computing (2014)

de-Choudhury M, Diakopoulos N, Naaman M (2012) Unfolding the event landscape on Twitter: classification and exploration of user categories. In: ACM CSCW, pp. 241–244

de Silva L, Riloff E (2014) User type classification of tweets with implications for event recognition. In: Joint workshop on social dynamics and personal attributes in social media, pp. 98–108

Dugué N, Labatut V, Perez A (2014) Identifying the community roles of social capitalists in the twitter network. In: IEEE/ACM ASONAM, Beijing, pp. 371–374

Dugué N, Perez A (2014) Social capitalists on Twitter: detection, evolution and behavioral analysis. Social Network Analysis and Mining, Springer, 4(1):1–15

Dugué N, Perez A, Danisch M, Bridoux F, Daviau A, Kolubako T, Munier S, Durbano H (2015) A reliable and evolutive web application to detect social capitalists. In: IEEE/ACM ASONAM exhibits and demos

Estrada E, Rodriguez-Velazquez JA (2005) Subgraph centrality in complex networks. Phys Rev E 71(5):056103

Fornell C (1992) A national customer satisfaction barometer: the Swedish experience. J Mark. pp. 6–21

Freeman LC, Roeder D, Mulholland RR (1979) Centrality in social networks: II. Experimental results. Soc Netw 2(2):119–141

Garcia R, Amatriain X (2010) Weighted content based methods for recommending connections in online social networks. In: Workshop on recommender systems and the social web, Citeseer, pp. 68–71

Gaussier E, Yvon F (2013) Opinion detection as a topic classification problem. In: Textual information access: statistical models, chap. 9, Wiley, New York, pp. 245–256

Gayo-Avello D (2012) A balanced survey on election prediction using twitter data. Arxiv

Ghosh S, Viswanath B, Kooti F, Sharma N, Korlam G, Benevenuto F, Ganguly N, Gummadi K (2012) Understanding and combating link farming in the Twitter social network. In: WWW, pp. 61–70

Golder SA, Yardi S, Marwick A, Boyd D (2009) A structural approach to contact recommendations in online social networks. In: Workshop on search in social media, SSM

Greenfield R (2014) The latest Twitter hack: talking to yourself. http://www.fastcompany.com/3029748/the-latest-twitter-hack-talking-to-yourself. Accessed 5 Feb 2014

Guimerà R, Amaral LN (2005) Cartography of complex networks: modules and universal roles. J Stat Mech 02:02001

Harary F (1969) Graph theory. Addison-Wesley, Boston

Henseler J (2010) On the convergence of the partial least squares path modeling algorithm. Comput Stat 25(1):107–120

Huang W, Weber I, Vieweg S (2014) Inferring nationalities of Twitter users and studying inter-national linking. In: ACM Hypertext

Java A, Song X, Finin T, Tseng B (2007) Why we twitter: understanding microblogging usage and communities. In: WebKDD/SNA-KDD, pp. 56–65

Kim YM, Velcin J, Bonnevay S, Rizoiu MA (2015) Temporal multinomial mixture for instance-oriented evolutionary clustering. In: Advances in information retrieval

Kred (2015) Kred story. http://www.kred.com. Accessed 12 Feb 2014

Kywe SM, Lim EP, Zhu F (2012) A survey of recommender systems in twitter. In: Social informatics, pp. 420–433. Springer

Laasby G (2014) Blocking fake Twitter followers and spam accounts just got easier. http://www.jsonline.com/blogs/news/280303802.html. Accessed Apr 2014

Lancichinetti A, Kivelä M, Saramäki J, Fortunato S (2010) Characterizing the community structure of complex networks. PLoS One 5(8):e11976

Landherr A, Friedl B, Heidemann J (2010) A critical review of centrality measures in social networks. Bus Inf Syst Eng 2(6):371–385

Lee K, Caverlee J, Webb S (2010) Uncovering social spammers: social honeypots + machine learning. In: ACM SIGIR, pp. 435–442

Lee K, Eoff BD, Caverlee J (2011) Seven months with the devils: a long-term study of content polluters on Twitter. In: ICWSM

Lee K, Mahmud J, Chen J, Zhou M, Nichols J (2014) Who will retweet this? automatically identifying and engaging strangers on twitter to spread information. In: ACM IUI, pp. 247–256

Lee K, Tamilarasan P, Caverlee J (2013) Crowdturfers, campaigns, and social media: tracking and revealing crowdsourced manipulation of social media. In: ICWSM

Mahmud J, Nichols J, Drews C (2012) Where is this tweet from? inferring home locations of Twitter users. In: ICWSM

Makazhanov A, Rafiei D (2013) Predicting political preference of Twitter users. In: IEEE/ACM ASONAM, pp. 298–305

Mena Lomeña JJ, López Ostenero F (2014) Uned at clef replab 2014: author profiling. In: 4th international conference of the CLEF initiative

Messias J, Schmidt L, Oliveira R, Benevenuto F (2013) You followed my bot! transforming robots into influential users in Twitter. First Monday 18(7)

Naaman M, Boase J, Lai CH (2010) Is it really about me?: message content in social awareness streams. In: ACM CSCW, pp. 189–192

Orman GK, Labatut V, Cherifi H (2012) Comparative evaluation of community detection algorithms: a topological approach. J Stat Mech 8:08001

Pennacchiotti M, Popescu AM (2011) A machine learning approach to Twitter user classification. In: ICWSM, pp. 281–288

Pramanik S, Danisch M, Wang Q, Mitra B (2015) An empirical approach towards an efficient “whom to mention?” Twitter app. Twitter for research, 1st international interdisciplinary conference

Ramage D, Dumais S, Liebling D (2010) Characterizing microblogs with topic models. In: ICWSM, pp. 130–137

Rangel F, Celli F, Rosso P, Potthast M, Stein B, Daelemans W (2013) Overview of the 3rd author profiling task at PAN 2015. In: Experimental IR meets multilinguality, multimodality, and interaction

Rangel F, Rosso P, Chugur I, Potthast M, Trenkmann M, Stein B, Verhoeven B, Daelemans W (2014) Overview of the 2nd author profiling task at pan 2014. In: CLEF evaluation labs and workshop

Rao A, Spasojevic N, Li Z, DSouza T (2015) Klout score: measuring influence across multiple social networks. Arvix

Rao D, Yarowsky D, Shreevats A, Gupta M (2010) Classifying latent user attributes in Twitter. In: CIKM SMUC workshop, pp. 37–44

Ramírez-de-la Rosa G, Villatoro-Tello E, Jiménez-Salazar H, Sánchez-Sánchez C (2014) Towards automatic detection of user influence in Twitter by means of stylistic and behavioral features. In: Human-inspired computing and its applications, Springer, pp. 245–256

Rosvall M, Bergstrom CT (2008) Maps of random walks on complex networks reveal community structure. Proc Nat Acad Sci 105(4):1118

Jones Sparck (1972) K.: a statistical interpretation of term specificity and its application in retrieval. J Doc 28(1):11–21

Sriram B, Fuhry D, Demir E, Ferhatosmanoglu H, Demirbas M (2010) Short text classification in Twitter to improve information filtering. In: ACM SIGIR, pp. 841–842

Suh B, Hong L, Pirolli P, Chi EH (2010) Want to be retweeted? large scale analytics on factors impacting retweet in Twitter network. In: Social computing, pp. 177–184

Tenenhaus M, Amato S, Esposito Vinzi V (2004) A global goodness-of-fit index for PLS structural equation modelling. In: XLII SIS scientific meeting, vol 1, pp. 739–742

Tommasel A, Godoy D (2015) A novel metric for assessing user influence based on user behaviour. In: Soc Inf, pp. 15–21

Torres-Moreno JM (2012) Artex is another text summarizer. arXiv preprint. arXiv:1210.3312

Uddin MM, Imran M, Sajjad H (2014) Understanding types of users on Twitter. arXiv cs.SI, 1406.1335

Vilares D, Hermo M, Alonso MA, Gómez-Rodrıguez C, Vilares J (2014) Lys at clef replab 2014: creating the state of the art in author influence ranking and reputation classification on Twitter. In: 4th international conference of the CLEF initiative, pp. 1468–1478

Villatoro-Tello E, Ramirez-de-la Rosa G, Sanchez-Sanchez C, Jiménez-Salazar H, Luna-Ramirez WA, Rodriguez-Lucatero C (2014) Uamclyr at replab 2014: author profiling task. In: 4th international conference of the CLEF initiative

Wang AH (2010) Don’t follow me: spam detection in Twitter. In: International conference on security and cryptography, pp. 1–10

Watts DJ, Strogatz SH (1998) Collective dynamics of ’small-world’ networks. Nature 393(6684):440–442

Weng J, Lim EP, Jiang J, He Q (2010) TwitterRank: finding topic-sensitive influential twitterers. In: WSDM, pp. 261–270

Weren ERD, Kauer AU, Mizusaki L, Moreira VP, de Oliveira JPM, Wives LK (2014) Examining multiple features for author profiling. J Inf Data Manag 5(3):266

Wold H (1982) Soft modeling: the basic design and some extensions. In: Systems under indirect observations: causality, structure, prediction, pp. 36–37

## Acknowledgments

This work is a revised and extended version of the article *Detecting Real-World Influence Through Twitter*, presented at the 2nd European Network Intelligence Conference (ENIC 2015) by the same authors (Cossu et al. 2015). It was partly funded by the French National Research Agency (ANR), through the project ImagiWeb ANR-2012-CORD-002-01.

## Author information

### Authors and Affiliations

### Corresponding author

## Centrality measures

### Centrality measures

In their description, we note \(G=(V,E)\) the considered cooccurrence graph, where *V* and *E* are its sets of nodes and links, respectively.

The degree measure *d*(*u*) is quite straightforward: it is the number of links attached to a node *u*. So in our case, it can be interpreted as the number of words co-occurring with the word of interest. More formally, we note \(N(u)=\{v\in V:\{u,v\}\in E \}\) the neighborhood of node *u*, i.e., the set of nodes connected to *u* in *G*. The degree \(d(u)=|N(u)|\) of a node *u* is the cardinality of its neighborhood, i.e., its number of neighbors.

The Betweenness centrality \(C_b(u)\) measures how much a node *u* lies on the shortest paths connecting other nodes. It is a measure of accessibility Freeman et al. (1979):

where \(\sigma _{vw}\) is the total number of shortest paths from node *v* to node *w*, and \(\sigma _{vw}(u)\) is the number of shortest paths from *v* to *w* running through node *u*.

The closeness centrality \(C_c(u)\) quantifies how near a node *u* is to the rest of the network Bavelas (1950):

where *dist*(*u*, *v*) is the *geodesic distance* between nodes *u* and *v*, i.e., the length of the shortest path between these nodes.

The Eigenvector centrality \(C_e(u)\) measures the influence of a node *u* in the network based on the spectrum of its adjacency matrix. The Eigenvector centrality of each node is proportional to the sum of the centrality of its neighbors Bonacich (1987):

Here, \(\lambda\) is the largest Eigenvalue of the graph adjacency matrix.

The subgraph centrality \(C_s(u)\) is based on the number of closed walks containing a node *u* (Estrada and Rodriguez-Velazquez (2005). Closed walks are used here as proxies to represent subgraphs (both cyclic and acyclic) of a certain size. When computing the centrality, each walk is given a weight which gets exponentially smaller as a function of its length.

where *A* is the adjacency matrix of *G*, and therefore \(\left( A^\ell \right) _{uu}\) corresponds to the number of closed walks containing *u*.

The *Eccentricity*
*E*(*u*) of a node *u* is its furthest (geodesic) distance to any other node in the network Harary (1969):

The *Local Transitivity*
*T*(*u*) of a node *u* is obtained by dividing the number of links existing among its neighbors, by the maximal number of links that could exist if all of them were connected (Watts and Strogatz (1998):

where the denominator corresponds to the binomial coefficient \(\left( {\begin{array}{c}d(u)\\ 2\end{array}}\right)\). This measure ranges from 0 (no connected neighbors) to 1 (all neighbors are connected).

The *Embeddedness*
*e*(*u*) represents the proportion of neighbors of a node *u* belonging to its own community Lancichinetti et al. (2010). The community structure of a network corresponds to a partition of its node set, defined in such a way that a maximum of links are located *inside* the parts while a minimum of them lie *between* the parts. We note *c*(*u*) the community of node *u*, i.e., the parts that contains *u*. Based on this, we can define the *internal neighborhood* of a node *u* as the subset of its neighborhood located in its own community: \(N^{int}(u)=N(u) \cap c(u)\). Then, the *internal degree*
\(d^{int}(u)=|N^{int}(u)|\) is defined as the cardinality of the internal neighborhood, i.e., the number of neighbors the node *u* has in its own community. Finally, the embeddedness is the following ratio:

It ranges from 0 (no neighbors in the node community) to 1 (all neighbors in the node community).

The two last measures were proposed by Guimerà & Amaral Guimerà and Amaral (2005) to characterize the community role of nodes. For a node *u*, the *Within Module Degree*
*z*(*u*) is defined as the *z*-score of the internal degree, processed relatively to its community *c*(*u*):

where \(\mu\) and \(\sigma\) denote the mean and standard deviation of \(d_{int}\) over all nodes belonging to the community of *u*, respectively. This measure expresses how much a node is connected to other nodes in its community, relatively to this community. By comparison, the embeddedness is not normalized in function of the community, but of the node degree.

The participation coefficient is based on the notion of community degree, which is a generalization of the internal degree: \(d_{i}(u)=|N(u) \cap C_{i}|\). This degree \(d_{c}\) corresponds to the number of links a node *u* has with nodes belonging to community number *i*. The participation coefficient is defined as:

where *k* is the number of communities, i.e., the number of parts in the partition. *P* characterizes the distribution of the neighbors of a node over the community structure. More precisely, it measures the heterogeneity of this distribution: it gets close to 1 if all the neighbors are uniformly distributed among all the communities, and to 0 if they are all gathered in the same community.

Both community role measures are defined independently from the method used for community detection (provided it identifies mutually exclusive communities). In this work, we applied the InfoMap method (Rosvall and Bergstrom (2008), which was deemed very efficient in previous studies (Orman et al. (2012).

## Rights and permissions

## About this article

### Cite this article

Cossu, JV., Labatut, V. & Dugué, N. A review of features for the discrimination of twitter users: application to the prediction of offline influence.
*Soc. Netw. Anal. Min.* **6**, 25 (2016). https://doi.org/10.1007/s13278-016-0329-x

Received:

Revised:

Accepted:

Published:

DOI: https://doi.org/10.1007/s13278-016-0329-x

### Keywords

- Influence
- Natural language processing
- Social network analysis