HURI: Hybrid user risk identification in social networks

The massive adoption of social networks increased the need to analyze users’ data and interactions to detect and block the spread of propaganda and harassment behaviors, as well as to prevent actions influencing people towards illegal or immoral activities. In this paper, we propose HURI, a method for social network analysis that accurately classifies users as safe or risky, according to their behavior in the social network. Specifically, the proposed hybrid approach leverages both the topology of the network of interactions and the semantics of the content shared by users, leading to an accurate classification also in the presence of noisy data, such as users who may appear to be risky due to the topic of their posts, but are actually safe according to their relationships. The strength of the proposed approach relies on the full and simultaneous exploitation of both aspects, giving each of them equal consideration during the combination phase. This characteristic makes HURI different from other approaches that fully consider only a single aspect and graft partial or superficial elements of the other into the first. The achieved performance in the analysis of a real-world Twitter dataset shows that the proposed method offers competitive performance with respect to eight state-of-the-art approaches.


Introduction
In modern society, social networks represent the most common way for millions of users to express their ideas, beliefs and preferences through posts, likes and comments.Such a massive adoption has attracted the interest of many companies and institutions worldwide.Indeed, the analysis of users' interactions and behavior may inspire the design of innovative products and services according to current trends and customer preferences.Other activities that can be performed by users of the social network can be the spread of information about important (ongoing or upcoming) events, or the sensitization of people about emerging social causes and issues.
However, since social networks are also considered an innovative and effective tool for propaganda, they can also be used for bad or illegal activities, such as i) harassment [1], ii) influencing people to adopt illegal practices, iii) promoting the use of drugs, or iv) spreading religious fundamentalism and political extremism.With regards to the latter, we can also find extreme situations, where terrorist communities exploit social networks, such as Twitter or Facebook, to disseminate their ideas and recruit new people.Some approaches proposed in the literature rely on the analysis of either the network topology, i.e., the relationships among users [6][7][8][9], or the textual content the users post or interact with [10][11][12].However, these approaches may be ineffective in the presence of noisy or misleading data.A typical case is that of journalists: They may share many contents related to topics involving words that are usually used in high-risk contexts, such as events related to the use of weapons or explosive devices.This aspect may push methods purely based on the analysis of textual content to erroneously label journalists as risky users (see the left side of Figure 1).On the other hand, journalists tend to be linked to (e.g., through the relationship follows in Twitter) users belonging to both safe and risky communities, mainly to stay up-to-date with the latest events.In this respect, methods purely based on the analysis of the network topology (i.e., only considering the users' relationships) will correctly label journalists as safe only if their relationships with other safe users are strongly predominant as compared to those with risky users (see the right side of Figure 1).
In order to overcome the limitations of existing approaches, in this paper, we propose a new method called HURI (Hybrid User Risk Identification in Social Networks) that is able to detect high-risk users in social networks, by analyzing the information conveyed by both the topology of the network and the content posted by the users.This is achieved by a hybrid approach that learns two models (i.e., one for each aspect) that are combined to make the final predictions.
Hybrid approaches have already been proposed in the literature (see Section 2 for an overview).Among them, we can find approaches falling in the relational data mining field, as well as methods for the analysis of heterogeneous networks [13][14][15], that are able to model Fig. 1 A graphical representation of common misclassification errors made on a noisy user (e.g., a journalist).On the left, we show a misclassification error commonly made by content-based systems on users who post/interact with apparently risky content, even if he/she is linked with several safe users.On the right, we show a misclassification error commonly made by topology-based systems on users who establish more relationships with risky users than with safe users, even if he/she posts only safe content input data as entities and relationships.Although this characteristic provides them with the possibility to be applied in multiple domains, they are generally not able to exploit peculiarities exhibited by a specific domain, e.g., the semantics of the content generated by users.Other approaches (e.g., the methods proposed by [16] and [17]) represent one dimension (content or relationships, respectively) and inject additional features summarizing the other (relationships or content, respectively).This strategy usually leads to implicitly provide a higher relevance to one dimension with respect to the other, and to possibly introduce spurious or redundant features.In this context, the challenge we face with the proposed method HURI is represented by the explicit modeling of the semantics of the content generated by the users, as well as their relationships, without introducing spurious features and without assuming a higher relevance of one dimension with respect to the other.While this aspect represents one of the major advantages of HURI over other hybrid approaches, its other key contributions are summarized as follows: i) it captures the semantics of the content posted by risky and safe users, by learning two separate models based on AutoEncoders [18], that are able to represent/embed the users into a numerical feature space; ii) it represents the network of relationships established by the users and learns a separate predictive model based on decision trees; iii) it properly combines the contribution coming from both dimensions of analysis, by exploiting a stacked neural network that does not consider only the predictions, as usually done by common approaches based on Stacked Generalization [19], but also the confidence about such predictions.The latter provides HURI with the ability of capturing and exploiting the uncertainty of the predictions, making it more robust to the noise in the data.
Experiments conducted on a real-world Twitter dataset show that HURI is able to detect high-risk users more accurately than existing approaches and that is also robust to the presence of noisy data (e.g., journalists).
The rest of the paper is organized as follows.In the next section, we briefly discuss the background and the motivations of the proposed hybrid approach.In Section 3 we review related work in node embedding and classification.In Section 4 we describe the proposed method.In Section 5 we describe the performed experiments and discuss the obtained results.Finally, in Section 6 we draw some conclusions and outline possible future works.

Background and motivations
The task solved in this paper falls in the research area of Social Network Analysis (SNA), that is, the study of social structures exploiting network and graph theory [20,21].Although SNA has its roots in sociology [22], the concept has evolved over time and is being adopted in multiple fields, such as biology, economics, political science and computer science.Networks studied by SNA consist of nodes, that represent, for example, people or organizations, and edges between nodes, describing social relationships.Examples of tasks addressed by SNA in the literature include the identification of collaborations between academic co-authors [23], the study of the cohesion among political parties [24], the detection of compromised accounts [25], or the prediction of users involved in criminal incidents [26].
Regarding the last examples, some works in the literature proposed SNA approaches tailored for the detection of propaganda activities about terrorism in social networks.In [27], the authors describe SNA as a tool to fight this problem, and highlight the main tasks investigated in the counter-terrorism field, such as key-player identification [28,29], community discovery [30,31], link analysis [32,33] and dynamic network analysis [34,35].
The task we solve in this paper falls in the category of key-player identification, and aims at identifying, more generally, high-risk users, namely, users who may demonstrate any kind of negative behavior, or exercise a negative influence over the community.Therefore, this task can be considered a particular case of the node classification task in network data.In the literature, we can find several approaches to solve this task, that can be categorized into three main classes, depending on the underlying criteria adopted to define the similarity among users (see Figure 2 for a graphical representation): • topology-based: they consider only the topology of the network of relationships, motivated by the assumption that the similarity among users can be estimated by considering their relationships; • content-based: they focus on the analysis of the content (e.g., posts, comments) generated by users, assuming that similar users will generate or interact with content regarding similar topics; • hybrid: they attempt to combine topology-based and content-based approaches, to exploit the advantages of both viewpoints.
As underlined in Section 1, one major challenge in real-world social networks is the effective identification of high-risk users in the presence of "noisy" data, e.g., safe users who can erroneously be classified as high-risk users when solely considering either the posted content posted or to their relationships.In this case, hybrid approaches should be able to produce more accurate predictions, since they observe two complementary aspects of the social networks.These considerations pushed us towards the design of the hybrid approach proposed in this paper.

Related work
In this section, we briefly discuss existing node embedding techniques, that are commonly adopted to represent network nodes in a numerical feature space, as well as existing approaches for node classification.

Node embedding techniques
In the literature many works address the task of node embedding in networks, namely the identification of a numerical feature space for nodes, that embeds the characteristics and the topological role of each node in the network.Among the most straightforward solutions, we can find dimensionality reduction techniques.In particular, by representing the network of relationships as an adjacency matrix, it is possible to apply methods like Singular Value Decomposition (SVD) [36], Principal Component Analysis (PCA) [37] or Non-negative Matrix Factorizations [38,39].Such approaches can identify a reduced, numerical feature space, and deal with data sparsity issues, that are typical of adjacency matrices representing relationships in social networks.
On the other hand, in the literature we can also find methods specifically designed to solve node embedding tasks.For example, DeepWalk [40] aims at learning a feature space for nodes that preserves the closeness with their neighboring nodes in the network.The neighborhood of each node is identified by exploiting truncated random walks.A similar approach is adopted by node2vec [41] Other methods, such as LINE [42] and SDNE [43], perform network embedding using both first-order (i.e., observed links in the network) and second-order (i.e., shared neighborhood among nodes) proximity, with the main goal of preserving both local and global network structure.
A different approach, called Hashtag2Vec [44], performs node embedding by exploiting the information conveyed by both the topological structure and the content.The proposed embedding model is able to learn a hashtag representation by optimizing a loss function that takes into account multiple types of relationships: hashtag-hashtag, hashtag-tweet, tweet-word and word-word.However, this method cannot be directly adapted to learn a representation for users, since it explicitly represents and exploits co-occurrence relationships among hashtags that cannot be mapped to friend or follow relationships among users.

Node classification methods
As mentioned at the beginning of this section, methods for node classification available in the literature can be categorized in three classes: topology-based, content-based and hybrid.Topology-based methods focus on the link structure of the network [45] and exploit it for node classification.A relevant example is the system GNetMine [6], that is able to represent arbitrary, also heterogeneous, information networks, and to classify nodes according to their relationships.In general, methods falling into this category are based on collective inference, i.e., they make simultaneous judgements on the same variables of related nodes.In particular, they exploit the so-called relational autocorrelation, a phenomenon that takes place in relational data when the value of a property of one object is highly correlated with the value of the same property of another object [46].Within this class of approaches, we can find an interesting work [7] that proposes a node-centric framework that exploits only information on the structure of class linkage in the network, that is, only links and class labels.Another work [8] addresses a challenging scenario falling into the within-network classification setting, in partially-labeled networks.Specifically, they combine statistical relational learning and semi-supervised learning to improve the classification performance in sparse networks, by adding "ghost edges" that enable the flow of information from labeled to unlabeled nodes.
The authors of [9] propose an active inference method that learns to identify the cases in which collective classification algorithms make mistakes, and suggests changes to correct such mistakes.The authors demonstrated that the proposed method outperforms several approaches based on network topology.
In [47], the authors aim to identify Sybil attacks in online social networks, where attackers attempt to carry out harmful actions while posing as (multiple) genuine users.To achieve this goal, the authors exploit the topology of the network, focusing on the strength and on the interactions of the users' relationships.They also incorporate graph-based features, such as betweenness-centrality, to enhance the identification of the attacks.
Focusing on content-based approaches, in [10] the authors propose a method to classify posts and users on Twitter into three different classes: positive, negative and neutral.To this aim, the authors exploit two lexicons, containing, respectively, "positive" and "negative" words.For each tweet in the dataset, a feature vector is constructed, where each feature represents the occurrence of each word (belonging to either the positive or the negative lexicons) in the tweet.The vector is then updated through Word2Vec [48] in order to consider both the semantics and the relationships among words.The obtained features are used to cluster tweets into positive, negative and neutral.
Another content-based approach is Doc2Vec [12], which is an extension of Word2Vec.Its goal is to create a numerical representation of a document, regardless of its length, that can be subsequently exploited by any classification approach based on feature vectors.Contrary to Word2Vec, that extracts semantic vector representations at a word level, Doc2Vec extracts semantic vector representations at a document level, learning distributed representations for both words and documents simultaneously.
A different approach is proposed in [11], which goal is to detect the presence of cyberterrorism and extremism contents in textual data.Together with classical weighting methods, like TF-IDF and binary weighting, the authors propose a novel "fuzzy set-based weighting method" that appears to be more appropriate for the specific task.
In [49], the authors present a study on keyword-based indicators and discusses their effectiveness in highlighting frustration and discrimination, and in estimating the risk of radicalization for users of the social network.
The work in [16] focuses on the analysis of a network in which nodes represent tweets, while edges represent hashtags and mentions.The authors show that the adoption of relational probability trees, with features built from both the content and the structure of the network, leads to an accurate user classification.However, social relationships such as friend or follows, are not explicitly taken into account.
Shifting the focus on hybrid approaches, in [17] the authors propose a method based on Adaboost to analyze both content-based and topology-based features, to automatically detect extremist accounts on Twitter.The considered features include hashtags, the tokens included in hashtags, the harmonic closeness between a target node and the set of known ISIS supporter nodes, and the expected hitting time of a random walk from individual nodes to known ISIS nodes.However, like [16], this approach does not explicitly take into account the network of relationships, but only use centrality measures.
In [50], the authors analyze a real-world dataset extracted from Instagram to identify influential users who may contribute to the dissemination of harmful information by advertising specific products.They focus on the content rather than the network structure and exploit a combination of high-level features extracted from images such as color scheme, semantics, and advertising aspects.In the experiments, the authors compare their system with a previous study that solely used the textual content and prove that their image-based method is more accurate.
In [51], the authors propose an optimization tool that exploits both the content and the topology of social networks.The authors show that the information conveyed by the topology of the network is usually noisy, and aim to support such a dimension of analysis with the content associated with the nodes.Although the authors considered a different task (i.e., community detection) with respect to that we solve in this paper, they analogously proved that the combined exploitation of content and topology provides better results than those achieved considering only the network topology.
The structure of the network of relationships is fully exploited also by methods working on heterogeneous information networks.Among such methods, it is worth mentioning HENPC [13], that solves the multi-type node classification task by extracting overlapping and hierarchically organized clusters, that are subsequently used for predictive purposes.Analogously, the method Mr-SBC [14], as well as its multi-type counterpart MT-MrSBC [15], adopts the naïve Bayes classification method for the multi-relational network setting, thus allowing the consideration of both the content and the relationships among the involved entities.
One common limitation of the approaches previously mentioned is that the content of the posts is represented implicitly or indirectly, that is, through the relationships between words and posts, and between posts and users, without exploiting the semantics of the textual content.Contrary to other hybrid approaches [14,15,52], HURI is able to explicitly take into account the semantics of the content generated by users, and their role in the network.This is not limited to including topological features together with those depending on the content (as performed by other methods), but it consists in explicitly exploiting the network of relationships as complementary information.

The proposed method HURI
Before describing the proposed method HURI, we briefly formalize the task we are solving.In particular, HURI analyzes a network G = N , E N , C, E C , where: L is the set of nodes representing users whose label is known, where N (s) L is the set of safe users (i.e., with label S) and N (r ) L is the set of risky users (i.e., with label R); • N U is the set of nodes representing unlabeled users; • N = N L ∪ N U is the set of nodes representing all the users (either labeled or unlabeled); • E N ⊆ N × N represents a relationship (e.g., follower) between users; • C is the set of textual documents; • E C ⊆ N × C represents the relationships among users and textual contents, namely, that a given user generated/posted (or interacted with) a given textual content.
The task solved by our method is the estimation of the risk and the prediction of the corresponding label for the users in N U .This means that our approach works in the within-network (or semi-supervised transductive) setting [53]: nodes for which the label is known are linked to nodes for which the label must be predicted (see [13,54]).This setting differs from the across-network (or semi-supervised inductive) setting, where learning is performed from one (fully labeled) network and prediction is performed on a separate, presumably similar, unlabeled network (see [55,56]).This provides our method with a significant advantage, since it can fully exploit the textual content and the relationships of unlabeled users during the training phase.The general workflow of the proposed method consists of three phases (see Figure 3): i) network topology analysis, ii) semantic content analysis and iii) their combination.In particular, we learn a predictive model based on a set of features that represents each user Fig. 3 General workflow of the proposed method on the basis of her/his relationships with other users (e.g., follows), and a predictive model based on the textual content she/he posted.Finally, we leverage the output of such models in combination to obtain a final model that is less prone to return incorrect classifications, that may possibly derive from the partial analysis of each single aspect in isolation.The adopted combination approach is inspired by the Stacked Generalization framework [19], which aims to reduce the bias of each single task.However, in addition to the classical Stacked Generalization approach, we also exploit the degree of confidence of the returned predictions, making HURI more robust to the possible uncertainty exhibited by the models learned separately from the textual content and from the network of relationships.
Although other methods take into account both aspects (see Section 3.2), they are not able to simultaneously exploit their full potential.In particular, either i) they include simple topological features together with those related to the content (see the work by [16,17]), or ii) although they are able to explicitly represent both the topology and the content (see the work by [13][14][15]), specific peculiarities of textual content (e.g., the semantics) or network relationships are not taken into account.

Network topology analysis
The goal of phase is to exploit the network structure, i.e., the relationships in which users are involved, for predictive purposes.The most straightforward approach consists in training a prediction model directly from an adjacency matrix built from the network of relationships.In particular, given the network G = N , E N , C, E C , and considering n i as the i-th user of the network, the adjacency matrix A ∈ {0, 1} |N |×|N | , can be easily constructed by setting However, social networks are usually not densely connected, leading to the construction of highly sparse adjacency matrices.For example, according to the financial results reported in Q2 2019 IR Statement1 , on Facebook there were 1.59 billion active daily users in June 2019.Assuming an average number of 1000 friends per user, the sparseness of an adjacency matrix representing the Facebook network would be more than 99.99%.
To deal with this issue, in the literature, we can find several dimensionality reduction techniques, such as SVD [36], PCA [37] and NMF [38], that aim at identifying a new, reduced, feature space, with a lower sparseness rate.Such a task can also be performed by exploiting embedding techniques, such as AutoEncoder bottleneck encodings [57] or Node2vec [58].
We do not bind HURI to a specific approach, but we allow the adoption of any solution that is able to reduce the dimensionality of the adjacency matrix.Indeed, in our experiments (see Section 5), we evaluated the performance exhibited by HURI with different solutions to solve this step.
Formally, given the adjacency matrix A, the adoption of a dimensionality reduction technique leads to a new matrix A ∈ R |N |×k t , where k t is the desired dimensionality of the reduced feature space for the topology analysis.Once a compact, dense representation of the network has been identified, we train a node classification model nc on the set of labeled users N L .This model is then exploited to predict the label for all the unlabeled users N U as either S (safe) or R (risky).
Specifically, we require the classification model nc to be able to produce a pair nc p (u), nc c (u) for each user u ∈ N U , where nc p (u) represents the predicted label and nc c (u) represents the confidence of the prediction.This is fundamental in order to provide the final combination step (see Section 4.3) with complete information about the prediction.
In HURI, we specifically rely on tree-based classifiers for this purpose.This choice is mainly motivated by the state-of-the-art performances exhibited by such approaches in semisupervised settings [59], and specifically on network data [56].The learned decision trees consist of nodes and branches, identified through a top-down induction procedure that recursively partitions the set of training examples.Each partitioning criterion, also called split, is based on a feature on a value/threshold, which are greedily determined by maximizing some heuristics.
In HURI we maximize the reduction of the classical Gini Index [60], that is based on the purity of each class measured after the split.More formally, the Gini Index is defined as: where p s and p r are the relative frequencies of safe and risky users in the tree node n, respectively.Given an unlabeled user u ∈ N U , the decision tree built by HURI returns the predicted label nc p (u) as the majority class in the leaf node in which u falls in the learned tree, and the confidence value nc c (u), which corresponds to the purity of such a leaf computed according to the examples falling in such a leaf during the training phase.
A graphical overview of the topological analysis performed by our method can be seen in the bottom section of Figure 3.

Semantic content analysis
As already mentioned in Section 1, together with the users' relationships, we leverage the textual content that users interacted with (e.g., posted and commented on).In particular, given the network G = N , E N , C, E C as formalized in Section 4, we first pre-process the textual content in C, using a standard Natural Language Processing pipeline, consisting of tokenization, stopword removal, stemming and metadata removal [61].According to E C , we associate each user in N with a representation that depends on his/her textual content.In particular, we build the dataset T by concatenating the textual content of all the documents associated with each user, according to their timestamp.This approach has two advantages: i) the documents of each user are not considered independently, but in a combined form; ii) the temporal evolution of the topics discussed in different documents can be exploited in the definition of the context.Then we train a Word2Vec model [48] from all the textual documents associated with the labeled users N L and exploit it to process the dataset T .In particular, using this model, we obtain a k c -dimensional numerical vector (embedding) for each word, that represents its semantic meaning.We then use it to associate a k c -dimensional numerical vector to each user, according to all the terms appearing in the textual content the user interacted with.
Formally, let wor ds(u) be the list of words appearing in the textual content the user u interacted with, and w2v(w) the embedding generated by Word2Vec for the word w.We exploit the "additive compositionality" property of word embeddings [48], according to which, not only similar words appear close to each other in the feature space, but the sum of vectors in the embedding space resembles an "AND" concatenation.As a result, if two sentences appear in the same context, their vectors obtained as the sum of word embedding vectors will still be close to each other according to a similarity measure.Analogously, in our case, two users whose vectors have been obtained by the sum of word embedding vectors appearing in their documents will be close/similar to each other.Formally, we compute the semantic vector representation sem(u) for each user u ∈ N as follows: In this way, we obtain a new dataset T ∈ R |N |×k c , consisting of the semantic vector representation for all the users in N .Since, in this specific context, it is expected that the textual contents are strongly polarized towards the label safe, whereas data for the label risky would be generally scarce, we adopt AutoEncoders [18] to effectively model the different data distributions of the two classes.AutoEncoders work by compressing input data into a latent-space representation and then reconstructing the output from this representation.This characteristic has been exploited in the literature to perform anomaly detection and classification [62][63][64], relying on the analysis of the reconstruction error.For the classification task, the most effective approach consists in training a one-class AutoEncoder model for each possible label and assessing its reconstruction capability on unseen data.If a high reconstruction error of the AutoEncoder is observed, then the given object most likely belongs to a different class than that assumed by training data instances.This solution is preferred with respect to standard multi-class classifiers, since, as observed by [65], the performance of one-class classifiers appears more stable with respect to the level of class imbalancing.
Following such an approach, starting from the dataset T , we build two AutoEncoders, i.e., one for the label S (safe) and one for the label R (risky).More formally, an AutoEncoder aims at learning two functions: the encoding function enc : X → F and the decoding function dec : F → X , such that: where X is the data input space of T (i.e., X = R k c ), and F is the encoding space learned by the AutoEncoder.The functions enc(•) and dec(•) should be parametric and differentiable with respect to a distance function, so that their parameters can be optimized by minimizing the reconstruction loss.
The architecture of an AutoEncoder consists of one or more hidden layers, where the output of the i-th hidden layer represents the i-th encoding level of the input data.The last layer of the AutoEncoder is of the same size of the input layer and aims to return the reconstructed input representation after the decoding stage.In this work, we adopt two hidden layers for the encoding stage and two hidden layers for the decoding stage (see Figure 4 for a graphical representation of the architecture).
Without loss of generality, in the following we briefly explain how an AutoEncoder with one hidden layer works.The formalization can then be easily extended to AutoEncoders with multiple hidden layers.In particular, the encoding stage takes the input sem(u) ∈ R k c = X and maps it to an hidden representation z(u) ∈ R k c /2 = F .Formally: where σ is a sigmoid activation function, W is a weight matrix, and b is a bias vector, all associated to the encoding part.The decoding stage reconstructs sem(u) from z as: where σ is a sigmoid activation function, W is a weight matrix, and b is a bias vector, all associated to the decoding part.
The process aims at minimizing the following reconstruction loss: The learning of W, W , b, b takes place according to the minimization of the reconstruction loss φ on training data, which computes the difference between the original and reconstructed versions.Given an unlabeled user u, we feed both the AutoEncoders AE S and AE R with his/her semantic representation sem(u), in order to compute the reconstruction errors, AE S (u) and AE R (u) respectively, according to the function φ(•, •).Therefore, the output of the semantic analysis for a user u ∈ N U is threefold: • the reconstruction error AE S (u) computed by the AutoEncoder AE S ; • the reconstruction error AE R (u) computed by the AutoEncoder AE R ; • the predicted label AE label (u) ∈ {S, R} (safe or risky), according to the minimum error measured by the AutoEncoders AE S and AE R .
More formally, AE label (u) is computed as follows: We stress that the adopted strategy allows us to catch and focus on the semantics of the textual contents, and to properly model safe and risky users accordingly, without introducing spurious features based on topological characteristics of the network (as done, for example, by [16,17]).Topological aspects, on the contrary, are specifically considered by the phase described in the previous subsection.
A graphical view of this phase can be seen in the left section of Figure 3.

Combining topology and semantics in textual contents
The final step aims at estimating the final risk score to assign to the unlabeled users N U .This problem is solved by learning a model able to combine the outputs of the network topology analysis (predicted class nc p and prediction confidence nc c ) and semantic analysis (safe error AE S , risky error AE R and label AE label ).
Methodologically, we exploit a Multi-Layer Perceptron (MLP) [66] in a stacked generalization setting [19].An MLP is an Artificial Neural Network, consisting of an input layer that receives the signal, an output layer that produces the output (i.e., the predicted label) and (possibly) multiple hidden layers.More formally, the predicted label l(u) for the user u is obtained as: The architecture of the MLP adopted in this work is shown at the bottom of Figure 3 and consists of the following layers.
The input layer consists of 5 neurons and receives the values of nc p (u), nc c (u), AE S (u), AE R (u) and AE label (u).
The hidden layer consists of 3 neurons that use sigmoid as the activation function.The adoption of the sigmoid function is motivated by its ability to extract non-linear dependencies between input and output values [67], whereas the number of neurons for the hidden layer is heuristically defined between the number of input and output neurons [68].
The output layer consists of 2 neurons that exploit the softmax activation function.This choice became highly popular in classification problems, due to its advantage to return the probability for each class and predict the class according to the highest probability.For this purpose, the class attribute for training examples is subject to one-hot-encoding [69], which leads to two binary class attributes, only one of which assumes a value of 1.According to this setting, the first neuron returns the probability that the user is safe, whereas the second neuron returns the probability that he/she is risky.The highest probability is chosen to make the final prediction.
In this architecture, our MLP model acts as a stacking meta-model that learns how to effectively combine the predictions of the different analytical steps, thus maximizing the overall predictive accuracy.As previously emphasized, this last step allows us to automatically catch both the aspects (topology and semantics), without imposing any user-defined criteria.Moreover, this approach is smarter than simple averaging approaches (or variants based on majority voting), since it can exploit possible patterns in the output provided by the other two phases as well as additional features, like the confidence and the reconstruction errors.

Datasets
In this work, we exploit a real-world Twitter dataset2 , collected using a crawling system compliant with the Twitter policies, and the Conditional Independence Coupling (CIC) algorithm to obtain a representative sample of users, with no specific hashtag.The sample produced by CIC is mathematically proven to converge to the stationary distribution of the population.CIC also allows to constrain the sampling process to a desired geographic location, on the basis of geo-location information and self-reported location.In our dataset, the geographic location is restricted to users who are resident in the United States [70].Each tweet is associated to a sentiment value, i.e. an integer value which represents the polarity of the message, computed through Stanford CoreNLP Toolkit [71], and manually revised by 3 domain experts.
The ground truth for users (i.e., risky (R) or safe (S)) has been defined using two different strategies, leading to the construction of two different datasets: • Keywords.We consider a tweet as risky if it contains at least one keyword included in two specific manually-curated lists.The first is related to terrorism and threats 3 , whereas the second contains keywords related to hate against immigrants and women4 .We assign a score to each user computed as the ratio between the number of her risky tweets and the number of her tweets.This strategy assumes that users who post the majority of tweets containing words related to terrorism, threats and hate, are more likely to be risky.• Sentiment.We assign a score to each user, calculated as the sum of the sentiment score of their tweets, that was already pre-computed in the original dataset through the CoreNLP toolkit.This strategy assumes that users who post multiple tweets with a negative sentiment are more likely to be risky.
After sorting the users according to their score, three expert reviewers performed a manual inspection of their tweets, that led to select the safest (from the top of the list) and the riskiest (from the bottom of the list) users.This selection allowed us to ensure the correctness of the user labeling procedure, avoiding incorrect labels in the ground truth (more likely occurring for users with intermediate scores) that would have led to misleading conclusions in the evaluation.An additional step was carried out to inject noisy data under controlled conditions.Specifically, we defined borderline users who, in this case, may correspond to the journalists who possibly share negative textual contents for informative purposes, but are primarily connected with safe users.For this purpose, users showing the majority of their neighbors in the network labeled as safe were considered as borderline and relabeled as safe.Finally, we removed users showing no connection with other users.This process led to defining a dataset of 1467 safe users (including 263 borderline users) and 1470 risky users, described by 7,686,231 tweets, for the strategy based on the keywords, and a dataset of 2241 safe users (including 304 borderline users) and 1033 risky users, described by 10,016,749 tweets, for the strategy based on the sentiment.
Regarding model hyperparameters for all the neural network-based architectures (i.e., the autoencoders for the semantic content analysis, the autoencoder for the network topology analysis, and the MLP for the combination phase), we followed the heuristics proposed by [72].Specifically, we initially experimented with different configurations for learning rate (negative powers of 10, starting from a default value of 0.01) and batch si ze (powers of 2) using a 20% validation set.Preliminary results suggested that the different configurations did not affect performance metrics significantly.For this reason, the final experiments were performed with the following model configuration: epochs = 500, learning rate = 0.0001, batch si ze = 32.
The results obtained were compared with those achieved by eight competitor approaches, each belonging to a different category (topology-based, content-based and hybrid), namely: • GNetMine [6], that is a topology-based method able to classify unlabeled nodes organized in (also heterogeneous) information networks.This comparison allows us to evaluate the performance of HURI against a state-of-the-art method that properly models users' relationships.
• Doc2Vec [12], that creates a numerical representation of a document.We apply the method to the textual content and exploit the embedding vector for classification.In the experiments, we consider different values for the dimensionality of the Doc2Vec embedding vector, namely, 128, 256, and 512.As for the downstream classifier, we adopted three different methods, namely: -Support Vector Machines implemented in scikit-learn [73], with the RBF Kernel, with the adjustment of class weights to compensate the class imbalance.-Random Forests implemented in scikit-learn [73], with the adjustment of class weights to compensate the class unbalancing.We set the number of trees equal to 100, the minimum number of examples per leaf to 2 and adopted the Minimal Cost-Complexity Pruning, considering the optimal value of the α parameter in {0.0, 0.2, 0.5, 1.0, 2.0}.-Multi Layer Perceptron (MLP), designed with an input layer whose size depends on the Doc2Vec embedding vector size (128, 256, or 512 neurons), an hidden layer with 128 neurons (corresponding to 100%, 50%, or 25% of the input features based on the Doc2Vec embedding vector size) using the sigmoid activation function, and an output layer with 2 neurons with softmax activations.
These two systems represent state-of-the-art methods able to properly model the content generated by users.They are able to catch the semantic content thanks to the adoption of Doc2Vec as an embedding approach.
As for hybrid methods, we consider the following approaches: • Doc2Vec + Node2Vec [12], to synergically extract embedding vector representations for both the textual content (Doc2Vec) and the topology component (Node2Vec) of the data.The same experimental setting used for Doc2Vec is adopted in this approach.Specifically, we extract embedding vectors of dimensionality 128, 256, and 512, for both the textual content and the topology content.The two embedding vectors are concatenated and provided as inputs to Support Vector Machines, Random Forest, or Multi-Layer Perceptron, which are adopted as base models for the downstream classification task.
• MrSBC [14], that is a state-of-the-art relational classification method, based on a combination of the naïve Bayes classification framework and first-order rules, able to work on data stored on a relational database.The database schema defined for the system consists of: i) the users table, containing the user IDs and their label; ii) the users_users table, containing pairs of user IDs that represent their relationships in the network; iii) the users_posts table, that contains the ID of tweets, each associated to the user who posted it and to the sentiment score; iv) the posts_words table, that represents the words contained in each tweet.
In the experiments, we considered different values of its parameter max_length_path, i.e., the maximum length of the paths considered in the exploration of the relational schema.
In particular, we evaluated the results with max_length_path ∈ {3, 4, 5, 6}.These methods are capable of considering both the content posted by the users and the network topology, like the proposed method HURI.Therefore, they allow us to directly compare HURI to state-of-the-art hybrid approaches.

Experimental setup
Our experiments were carried out according to a stratified 5-fold cross validation scheme, that subdivides users randomly into 5 different folds and alternatively considers users in one fold as N U and users in the remaining 4 folds as N L .The stratified approach preserves the ratio of safe and risky users.For the evaluation, the workflow shown in Figure 3 was repeated once for each fold and the results obtained were averaged.The metrics used for the evaluation of the performance achieved by the different methods are precision, recall, F1-Score and accuracy, where TP (True Positive), TN (True Negative), FP (False Positive) and FN (False Negative) were computed by considering R (risky) as the positive class.
We report separate metric values for All Users and Borderline users.This choice is important to provide a dual perspective on our quantitative assessment: a more general one (all users), and a more specific one that focuses on users at the boundary between safe and risky users, who may be more challenging to classify (borderline users).
We recall that, in our study, borderline users are not harmful to the community, but share textual contents on sensitive/risky topics with informative purposes, resembling risky users.For borderline users, we only collect the accuracy, since we assume they are all safe users, with the result that the accuracy corresponds to the recall of the safe class, and it is not correct to compute the precision.The stratified random sampling that we performed also aimed to preserve the ratio of borderline users within the set of safe users.

Results and discussion
The first aspect that we discuss concerns the role of the parameters k t and k c and the sensitivity of predictive performances of HURI to their values.In Tables 1 and 2, we report the results obtained by HURI with the best configurations of k t and k c , which are: From the results, it is possible to observe that the configuration k t = 128, k c = 256 offers the best trade-off for the two tasks: the discrimination between risky and safe users (F1-Score on all the users) and the correct classification of borderline users as safe users (accuracy on borderline users).Such a result gives a clear idea of the need to use larger vectors for the representation of the content than for the network structure, where 128-sized vectors appear to be enough.For  The best results for each configuration in terms of F1-Score and accuracy of classification of borderline users are highlighted in bold the correct classification of borderline users, it is apparently necessary to further extend the feature space for the representation of the content.Other configurations, different from k t = 128, k c = 256 , can lead to a higher accuracy on borderline users, but lead to a lower overall F1-Score for the task of discriminating between risky and safe users.This phenomenon is evident for the Keywords dataset (Table 1), where the best configuration achieves an F1-Score on all users of 0.808 and an accuracy on borderline users of 0.868, whereas the other configurations achieve an inferior performance in both aspects (see k t = 128, k c = 512 , which achieves in the best case an F1-Score on all users of 0.751 and an accuracy on borderline users of 0.830), or obtain high values of accuracy on borderline users at the cost of a drastic reduction of the F1-Score on all the users (see k t = 256, k c = 128 , which yields in the best case an F1-Score on all users of 0.735 and an accuracy on borderline users of 0.947, and k t = 512, k c = 128 , which provides an F1-Score on all users of 0.734 and an accuracy on borderline users of 0.943).Similarly, on the Sentiment dataset (Table 2), the best configuration achieves an F1-Score on all users of 0.649 and an accuracy on borderline users of 0.928, whereas the other configurations globally exhibit lower performances (see k t = 256, k c = 128 , which achieves in the best case an F1-Score on all users of 0.506 and an accuracy on borderline users of 0.893, and k t = 512, k c = 128 , which yields in the best case an F1-Score on all users of 0.519 and an accuracy on borderline users of 0.901) or provide just a small improvement to the accuracy on borderline users, resulting in an excessive penalization of the overall F1-Score on all the users (as seen in k t = 128, k c = 512 , which obtains an F1-Score on all users of 0.399 and an accuracy on borderline users of 0.949).
Comparing the results in terms of network representation for HURI, we can observe that, except for some specific cases (see k t = 256, k c = 128 and k t = 512, k c = 128 for the dataset based on sentiment), considering the full adjacency matrix led to the best result on the task of classifying borderline users as safe users.This is an expected behavior since they have been defined in the dataset according to their relationships, that are fully and explicitly represented by the adjacency matrix.However, as regards the F1-Score on the whole dataset, we can observe that SVD and PCA led to the best results, without significantly affecting the accuracy on borderline users.This means that they were able to effectively represent the general network structure, leading to a better generalization of the learned model and good overall robustness to the presence of noise (i.e., journalists).Overall, HURI leads to satisfactory results with any of the considered approaches for the representation of the network, except for the AutoEncoder that in some cases leads to significantly lower results.This means that i) AutoEncoder performs worse, compared to statistical approaches like PCA and SVD, when adopted to identify a representation of strongly sparse networks and leads to underfitting; ii) HURI is generally able to correctly balance the contribution from the fusion of the information in the content posted by users with that in the network structure, and effectively discriminate between safe and risky users.These conclusions apply to both the considered datasets and show that HURI is a suitable solution to analyze domains characterized by heterogeneous and noisy data, structured as a network, offering better generalization and robustness capabilities than other methods.
In order to further assess the validity of our work, we perform an ablation study that aims to ascertain that all components in the proposed method HURI provide a positive contribution, which translates into an improvement in terms of classification accuracy.Specifically, we allow HURI to analyze the textual content or the relationships among users in isolation, devising two simplified variants of the method that differ in the combination stage, namely: • HURI only content: The MLP adopted for combining the contribution of the two perspectives is trained without considering the predicted class and the confidence returned by the component for the network topology analysis.Instead, for the latter, during the training the ground truth of the data is used as the label, while the confidence is set to 1.0.During the prediction phase, the majority class of the training set is considered, with a confidence factor set to the ratio between the number of majority class samples and the number of samples in the training set.The rationale is to provide the method with most reliable source of information for the training stage, without incorporating the potential smoothness introduced by the confidence factor, which is exploited by the full version of HURI.
• HURI only relationships: The MLP adopted for combining the contribution of the two perspectives is trained without considering the actual reconstruction error predicted by the Autoencoder models, which represent the semantic analysis component of HURI.
Instead, the reconstruction error for the true label is replaced such that it is lower than the reconstruction error of the opposite label.During the prediction, the predicted class is set to the majority class of the training set, while the safe and the risky errors are set such that the one of the majority class is lower than that of the minority class.The rationale is to provide the model with reasonable and informative input for the textual component, coherently with the prior knowledge.
The results in Table 3 show the performance obtained on the dataset based on keywords.By inspecting the results obtained across all configurations, it is possible to observe that limiting HURI to the analysis of user relationships yields, in the best case, an F1-Score on all users of 0.726 and an accuracy on borderline users of 0.437.On the other hand, the HURI variant that solely analyzes the textual content achieves in the best configuration a close-to-zero F1-Score performance on all users and an accuracy on borderline users of 0.890.These results are significantly worse than those obtained by HURI analyzing both textual contents and relationships.Indeed, the best results achieved on this dataset using HURI with all active components correspond to an F1-Score of 0.808 on all users and an accuracy of 0.947 on borderline users.A similar situation is observed with the dataset based on sentiment, as shown in Table 4. Specifically, limiting HURI to the analysis of user relationships yields, in the best case, an F1-Score on all users of 0.389, and an accuracy on borderline users of 0.200.On the other hand, the HURI variant that solely analyzes the textual content achieves in the best configuration an F1-Score performance on all users of 0.172 on and an accuracy on borderline users of 0.840.The full version of HURI largely outperforms such results, achieving an F1-Score on all users of 0.649 and an accuracy on borderline users of 0.949.
Overall, these results confirm that the synergic analysis of both textual content and user relationships, provided by the combination of the semantic and topology components of HURI, is the enabling factor allowing HURI to yield the most accurate predictive performance.
In Tables 1 and 2 we also compare the results of HURI with the results obtained by the two best competitor systems, i.e., MrSBC (max_length = 6) and Doc2Vec+Node2Vec+SVM (kt=128, kc=512) for the keyword dataset, and MrSBC (max_length = 6) and GNetMine for the sentiment dataset.From the results, it is possible to observe that HURI is able to outperform them on both datasets, in almost all the configurations of the parameters k t and k c .The only case where MrSBC outperforms HURI is in on the whole set of users, in the configuration k t = 512, k c = 128 on the Sentiment dataset.However, we can see that it fails to correctly classify borderline users since it classifies all of them as risky.
The best competitors were selected by observing the detailed results obtained by all competitor methods (see Tables 5 and 6).A broader analysis of results obtained by competitor methods reveals that Doc2Vec, with all the considered downstream classifier (Random Forests, SVMs, and Multi Layer Perceptron) was able to obtain acceptable overall results only on the dataset based on keywords.This means that, although it was able to catch the semantics of the content, the learned feature space was not able to properly represent the sentiment of the tweets.We observe an opposite behavior when focusing on the accuracy of the classification of borderline users, that indicates that this approach was not able to properly handle the noise in the data.In other words, it was not possible to simultaneously achieve an acceptable overall F1-Score and a good accuracy on borderline users.The system GNetMine, which exclusively analyzes the link structure of the network, generally exhibited a poor classification performance on the whole dataset, but a high accuracy on possible borderline users.However, a close analysis of the prediction results reveals that The best results for each configuration in terms of F1-Score and accuracy of classification of borderline users are highlighted in bold the method is very prone to classify users as safe, that is, the topology alone does not appear sufficient to accurately detect high-risk users.Consequently, the positive results on borderline users do not imply an actual discriminating ability.The hybrid method MrSBC shows a weak classification performance on the whole dataset and a relatively small accuracy on borderline users in all its parameter configurations.Shifting our focus on the hybrid method Doc2Vec + Node2Vec (with all the considered downstream classifiers), it is possible to observe significantly more accurate results than MrSBC in terms of accuracy on borderline users, but worse results in terms of F1-Score, on both datasets.The highest accuracy on borderline users is achieved with Random Forest on the dataset based on sentiment, and Support Vector Machine on the dataset based on keywords.The situation is reversed when observing results for Doc2Vec + Node2Vec in terms of F1-Score on all users, i.e.Support Vector Machine is the best classifier on the dataset based on sentiment, whereas Random Forest is the leading classifier on the dataset based on keywords.This behavior shows that the adoption of a method able to catch both the content and the relationships does not necessarily guarantee an accurate classification, especially if the method does not explicitly exploit the semantics associated to the content (which is the case of MrSBC).
In conclusion, the proposed method HURI showed the best overall performance, in both discriminating between safe and risky users and in being robust to the presence of noisy data, i.e., borderline users represented by journalists.This superiority is observable with both considered datasets, meaning that HURI was able to correctly leverage the semantics from The best results for each configuration in terms of F1-Score and accuracy of classification of borderline users are highlighted in bold the content in both situations and to properly combine its contribution with of the network structure.Finally, in Figure 5 we show an example of words appearing in Twitter posts that HURI classified as safe (Figure 5.a) and risky (Figure 5.b), respectively, according to the semantic content analysis of the system.By observing the figure, it is clear that the semantic component of our method can accurately classify users.Nevertheless, as previously emphasized, the final user classification still depends on the combined contribution of the content-based and topology-based components which allow us to accurately label borderline users as safe users.

Conclusion
In this paper, we have proposed HURI, a method for social network analysis that exploits multiple sources of information to accurately classify users as safe or risky.In our method, we have simultaneously leveraged the network topology and the semantics of the content shared by users to analyze in detail the underlying social relationships and interactions.This was possible thanks to the stacked generalization approach proposed in HURI, which learns an adaptive model to combine the two contributions.
The experimental results showed that the proposed method exhibits competitive performance with respect to topology-based, content-based, and hybrid state-of-the-art approaches for social network analysis, especially in the presence of noisy data.We analyzed the performance of the different methods in a complex network scenario that includes borderline users who, in this specific context, may represent journalists who post contents that may appear risky, but who are actually safe users according to their relationships.
We observed that all the competitor methods analyzed provide unsatisfactory performances either in terms of classification accuracy on all the users or specifically on borderline users.On the contrary, our method provides the best results on both the considered tasks.One possible limitation of the proposed method HURI is a potential reduction of accuracy in scenarios characterized by borderline users with an unknown or ill-defined network topology.Another potential challenge for the method is the increased difficulty for the classification task arising when borderline users mimic both the topology and the generated content of risky users.In the future, we aim to assess and improve the robustness of HURI in such situations, and to extend it to address complex applications that involve multi-modal data, including images and videos.Moreover, we will design a distributed variant of HURI able to analyze large-scale networks.

Fig. 2 A
Fig. 2 A graphical representation of topology-based, content-based and hybrid methods for social network analysis

Fig. 4 A
Fig. 4 A graphical representation of the proposed AutoEncoder architecture for semantic content analysis: Three stages of encoding and decoding, that aggregate and reconstruct the aggregated semantic representation of each user.The lowest reconstruction error obtained between the two AutoEncoders (one for the risky label and one for the safe label) is used to perform a content-based classification of the user

Fig. 5
Fig. 5 An example of words appearing in Twitter posts that HURI classifies as safe (a) or risky (b) according to the semantic analysis component.The final user classification in any case depends on the combined contribution of the content-based and topology-based components

Table 1
Average performance on the Keywords dataset HURI (kt = 128, kc = 256) The best results for each configuration in terms of F1-Score and accuracy of classification of borderline users are highlighted in bold

Table 2
Average performance on the Sentiment dataset

Table 3
Ablation study considering simplified variants of HURI (only content and only relationships) on the

Table 4
Ablation study considering simplified variants of HURI (only content and only relationships) on the

Table 5
Average performance for all competitor methods on the Keywords dataset

Table 6
Average performance for all competitor methods on the Sentiment dataset