Using Node Identifiers and Community Prior for GraphBased Classification
Abstract
With widely available largescale network data, one hot topic is how to adopt traditional classification algorithms to predict the most probable labels of nodes in a partially labeled network. In this article, we propose a new algorithm called identifierbased relational neighbor classifier (IDRN) to solve the withinnetwork multilabel classification problem. We use the node identifiers in the egocentric networks as features and propose a withinnetwork classification model by incorporating community structure information to predict the most probable classes for unlabeled nodes. We demonstrate the effectiveness of our approach on several publicly available datasets. First, taking a semisupervised approach, IDRN without any community prior is applied in community detection experiments, and it outperforms most existing unsupervised community detection algorithms. After that, in largescale graphbased multilabel classification tasks, our approaches perform well in both fully labeled and partially labeled networks in most cases. To evaluate the scalability of our algorithm, we also show a scalability test to evaluate the running time of our algorithm in different networks. The experiment results show that our approach is quite efficient and suitable for largescale realworld classification tasks.
Keywords
Withinnetwork classification Node classification Collective classification Relational learning1 Introduction
Massive networks exist in various realworld applications. These networks may be only partially labeled due to manual labeling can be highly cost in realworld tasks. A critical problem is how to use the network structure and other extra information to build better classifiers to predict labels for the unlabeled nodes. Recently, much attention has been paid to this problem, and various prediction algorithms over nodes have been proposed [1, 2, 3].
In this article, we propose a withinnetwork classifier by making use of the node identifiers in the egocentric networks as features and the community prior. Traditional relational classification algorithms, such as WvRN [4] and SCRN [5] classifier, make statistical estimations of the labels through statistics, class label propagation or relaxation labeling. From a different viewpoint, many realworld networks display some useful phenomena, such as clustering phenomenon [6] and scalefree phenomenon [7]. Most realworld networks show high clustering property or community structure, i.e., their nodes are organized into clusters which are also called communities [6, 8]. The clustering phenomenon indicates that the network can be divided into communities with dense connections internally and sparse connections between them. For example, people sharing the same beliefs and interests tend to connect to each other [9], and queries in the same text clustering often share similar class labels [10]. The scalefree phenomenon indicates the existence of nodes with high degrees [7], and the high degree nodes’ identifiers can be also widely shared by neighbors. In the dense connected communities, as hub nodes are connected by different ones, thus the identifiers of neighbors may be used as features to capture the label patterns of nodes. Due to the widely existed high clustering property and the scalefree phenomenon in network data, we regard that the identifiers of nodes can be used as finegrained features and community prior can be used as coarse grained prior to boost the performance of our approach in node classification tasks. In this article, we first apply IDRN [11] with 10% labeled nodes for training in the community detection experiments, and it improves the metric values over the existing unsupervised community detection algorithms. As well, we demonstrate the effectiveness of our algorithm on tens of public datasets which are fully labeled or partially labeled in multilabel classification tasks. In the experiments, our approach outperforms recently proposed baseline methods in most cases.
Our contributions are as follows. First, to the best of our knowledge, this is the first time that node identifiers in the egocentric networks are used as features to solve networkbased classification problem. Second, we utilize the community prior in a principle way to improve its performance in different realworld networks. Finally, our approach is very effective and easy to implement, which makes it quite applicable for different realworld withinnetwork classification tasks. The rest of the article is organized as follows. In the next section, we first review related work. Section 3 describes our methods in detail. In sect. 4, we show the experiment results in different publicly available datasets. Section 5 gives the conclusion and discussion.
2 Related Work
One of the recent focus in machine learning research is how to extend traditional classification methods to classify nodes in network data, and a body of work for this purpose has been proposed. Bhagat et al. [12] give a survey on the node classification problem in networks. They divide the methods into two categories: one uses the graph information as features and the other one propagates existing labels via random walks. The relational neighbor (RN) classifier provides a simple but effective way to solve the node classification problems. Macskassy and Provost [4] propose the weightedvote relational neighbor (WvRN) classifier by making predictions based on the class distribution of a certain node’s neighbors. It works reasonably well for withinnetwork classification and is recommended as a baseline method for comparison. Wang and Sukthankar [5] propose a multilabel relational neighbor classification algorithm by incorporating a class propagated probability obtained from edge clustering. Macskassy and Provost [13] also believe that the very high cardinality categorical features of identifiers may cause the obvious difficulty for classifier modeling. Thus there is very little work that has incorporated node identifiers [13]. As we regard that node identifiers are also useful features for node classification, our algorithm does not solely depend on neighbors’ class labels but also incorporate node identifiers in each node’s egocentric networks as features and community structure as prior.
For withinnetwork classification problem, a large number of algorithms for generating node features have been proposed. Unsupervised feature learning approaches typically exploit the spectral properties of various matrix representations of graphs. To capture different affiliations of nodes in a network, Tang and Liu [14] propose the SocioDim algorithm framework to extract latent social dimensions based on the topd eigenvectors of the modularity matrix, and then utilize these features for discriminative learning. Rizos et al. [9] study the problem of semisupervised, multilabel user classification of networked data in social networks. They propose a framework that combines unsupervised community extraction and supervised communitybased feature weighting before training a classifier. Using the same feature learning framework, Tang and Liu [15] also propose an algorithm to learn dense features from the dsmallest eigenvectors of the normalized graph Laplacian. Ahmed et al. [16] propose an algorithm to find lowdimensional embeddings of a large graph through matrix factorization. However, the objective of the matrix factorization may not capture the global network structure information. To overcome this problem, Tang et al. [2] propose the LINE model to preserve the firstorder and the secondorder proximities of nodes in networks. Perozzi et al. [17] present DeepWalk which uses the SkipGram language model [18] for learning latent representations of nodes in a network by considering a set of short truncated random walks. Grover and Leskovec [19] define a flexible notion of a node’s neighborhood by random walk sampling, and they propose node2vec algorithm by maximizing the likelihood of preserving network neighborhoods of nodes. Nandanwar and Murty [1] also propose a novel structural neighborhoodbased classifier by random walks, while emphasizing the role of medium degree nodes in classification. As most of the algorithms based on the features generated by heuristic methods such as random walks or matrix factorization often have high time complexity, thus they may not easily be applied to largescale realworld networks. To be more effective in node classification, in both training and prediction phrases we extract community prior and identifier features of each node in linear time, which makes our algorithm much faster. We will evaluate the scalability of our approach in different largescale networks in the following experiments.
Several realworld networkbased applications boost their performances by obtaining extra data. McDowell and Aha [20] find that accuracy of node classification may be increased by including extra attributes of neighboring nodes as features for each node. In their algorithms, the neighbors must contain extra attributes such as textual contents of web pages. Rayana and Akoglu [21] propose a framework to detect suspicious users and reviews in a userproduct bipartite review network which accepts prior knowledge on the class distribution estimated from metadata. To address the problem of query classification, Bian and Chang [22] propose a label propagation method to automatically generate query class labels for unlabeled queries from clickbased search logs. To identify spammer accounts, Fakhraei et al. [23] propose a statistical relational model to makes use of structural features, sequence modeling, and collective reasoning. With the help of the large amount of automatically labeled queries, the performance of the classifiers has been greatly improved. To predict the relevance issue between queries and documents, Jiang et al. [24] and Yin et al. [25] propose a vector propagation algorithm on the click graph to learn vector representations for both queries and documents in the same term space. Experiments on search logs demonstrate the effectiveness and scalability of the proposed method. Wang et al. [26] study the problem of linked document embedding for classification and propose a linked document embedding framework LDE, which combines link and label information with content information to learn document representations for classification. Newman and Clauset [27] propose a method that combines a network and its node information to detect communities. Their method learns whether the node information is correlated with the communities, and this method makes the predictions about the community membership of nodes more accurately. Tu et al. [28] propose a contextaware embedding algorithm to learn the embeddings for vertices by considering both the structural roles and text information of nodes simultaneously. Wang et al. [29] presents a novel item concept embedding approach to learn the embeddings of both items and words by leverage the concept of neighborhood proximity in both homogeneous and heterogeneous retrieval tasks. However, as it is hard to find useful extra attributes in many public available realworld network data and it may require some domain knowledge to handle the extra metadata, in this article our approach only depends on the structural information in partially labeled networks.
3 Methodology
In this section, as a withinnetwork classification task, we focus on performing multilabel node classification in networks, where each node can be assigned to multiple labels and only a few nodes have already been labeled. We first present our problem formulation, and then show our algorithm in detail.
3.1 Problem Formulation
The multilabel node classification we addressed here is related to the withinnetwork classification problem: estimating labels for the unlabeled nodes in partially labeled networks. Given a partially labeled undirected network \(G=\{\mathcal {V}, {\mathcal {E}}\}\), in which a set of nodes \({\mathcal {V}} = \{1,\cdots , n_{max}\}\) are connected with edge \(e(i, j) \in {\mathcal {E}}\), and \({\mathcal {L}}=\{l_1, \cdots , l_{max}\}\) is the label set for nodes.
3.2 Objective Formulation
3.2.1 IDRN Classifier
As shown in Eq. (2), traditional relational neighbor classifiers, such as WvRN [4], only use the class labels in neighborhood as features. However, as we will show, by taking the identifiers in each node’s egocentric network as features, the classifier often performs much better than most baseline algorithms.
3.2.2 Multilabel Classification
Traditional ways of addressing multilabel classification problem is to transform it into a onevsrest learning problem [5, 14]. When training IDRN classifier, for each node i with a set of true labels \(T_i\), we transform it into a set of singlelabel data points, i.e., \(\{ \langle {\mathbf{X }}_{{\mathcal {N}}_i}, c \rangle  c \in T_i\}\). After that, we use naive Bayes training framework to estimate the class prior \(P(Y_i = c)\) and the conditional probability \(P(kY_i = c)\) in Eq. (4).
Algorithm 1 shows how to train IDRN to get the maximal likelihood estimations (MLE) for the class prior \(P(Y_i = c)\) and conditional probability \(P(kY_i = c)\), i.e., \({\hat{\theta }}_\mathrm{c} = P(Y_i = c)\) and \({\hat{\theta }}_{\mathrm{kc}}=P(kY_i = c)\). As it has been suggested that multinomial naive Bayes classifier usually performs better than Bernoulli naive Bayes model in various realworld practices [31], we take the multinomial approach here. Suppose we observe N data points in the training dataset. Let \(N_c\) be the number of occurrences in class c and let \(N_{\mathrm{kc}}\) be the number of occurrences of feature k and class c. In the first 2 lines, we initialize the counting values of N, \(N_c\) and \(N_{\mathrm{kc}}\). After that, we transform each node i with a multilabel set \(T_i\) into a set of singlelabel data points and use the multinomial naive Bayes framework to count the values of N, \(N_c\) and \(N_{\mathrm{kc}}\) as shown from line 3 to line 12 in Algorithm 1. After that, we can get the estimated probabilities, i.e., \({\hat{\theta }}_c = P(Y_i = c)\) and \({\hat{\theta }}_{\mathrm{kc}}=P(kY_i = c)\), for all classes and features.
3.2.3 Community Prior
3.3 Efficiency
Suppose that the largest node degree of the given network \(G=\{{\mathcal {V}}, {\mathcal {E}}\}\) is K. In the training phrase, as shown in Algorithm 1, the time complexity from line 1 to line 12 is about \(O(K \times {\mathcal {L}} \times {\mathcal {V}})\), and the time complexity from line 13 to line 18 is \(O({\mathcal {L}} \times {\mathcal {V}})\). So the total time complexity of the training phrase is \(O(K \times {\mathcal {L}} \times {\mathcal {V}})\). Obviously, it is quite simple to implement this training procedure. In the training phrase, the time complexity of each node is linear with respect to the product of the number of its degree and the size of class label set \({\mathcal {L}}\).
In the prediction phrase, suppose node i contains n neighbors. It takes \(O(n + 1)\) time to find its identifier vector \({\mathbf{X }}_{{\mathcal {N}}_i}\). Given the knowledge of i’s community membership \(C_i\), in Eqs. (5) and (8), it only takes O(1) time to get the values of \(P(Y_i = cC_i)\) and \(P(Y_i = c)\), respectively. As it takes O(1) time to get the value of \(P(kY_i = c)\), for a given class label c, the time complexities of Eqs. (5) and (8) both are O(n). Thus for a given node, the total complexity of predicting the probability scores on all labels \({\mathcal {L}}\) is \(O({\mathcal {L}} \times n)\) even we consider predicting the precise probabilities in Eq. (6). For each class label prediction, it takes O(n) time which is linear to its neighbor size. Furthermore, the prediction process can be greatly spedup by building an inverted index of node identifiers, as the identifier features of each class label can be sparse.
4 Experiments
In this section, we first use IDRN to detect communities with partially cluster labels to show the ‘community aware’ characteristic it has, then we introduce the datasets and the evaluation metrics for multilabel classification tasks. After that, we conduct several experiments to show the effectiveness of our proposed algorithm. Code to reproduce our results is available at the authors’ website.^{1}
4.1 Networks with Known Communities
To empirically demonstrate the ‘community aware’ characteristic IDRN has [17], in this part, we apply IDRN on five realworld networks whose community clustering metadata are perfectly corresponding to the underlying ground truth. Using the community metadata as labels, we directly run IDRN without any community prior to predict the labels of unlabeled ones in these networks.
4.1.1 Metrics on Comparing Community Partitions
To quantify the similarity between the communities extracted by different algorithms and the ‘ground truth’ communities, we choose the normalized mutual information (NMI) [34], the Jaccard index (Jaccard) and the Rand index (Rand) as the metrics. The details of these metrics can be found in the survey [33]. All these metrics ranges from 0 when the detected community labels are uninformative to 1 when the community labels specify the original partitions completely.
4.1.2 Networks with Community Metadata

karate: The Karate club network is a wellknown network of friendships between 34 members in an American University [35]. After a dispute between the coach and the treasurer, the network is further split into two communities. There are 34 nodes and 78 edges in this network.

dolphin: The dolphin social network is an undirected social network of frequent associations between 62 dolphins in a community living off Doubtful Sound, New Zealand [36]. The dolphin network splits into two communities as a result of the departure of a key individual. The links between nodes are established by observation of statistically significant frequent associations. The network contains 62 nodes and 159 edges.

football: This is a collage football network which represents the schedule of games between American college football teams in the 2000 season [37]. Nodes in the network represent teams, and edges represent regularseason games between teams. There are 115 nodes and 613 edges in the network. This network is partitioned into 12 conferences.

polbook: This is a network of books about US politics published around the time of the 2004 presidential election and sold by the online bookseller Amazon.com [38]. Edges between books represent frequent copurchasing of books by the same buyers. The community labels are ‘liberal,’ ‘neutral’ and ‘conservative.’ There are 105 nodes and 441 edges in the network.

polblog: This is a network of hyperlinks between weblogs on US politics, recorded in 2005 by Adamic and Glance [39]. Community labels are ‘liberal’ and ‘conservative’ which are assigned by blog directories or occasional selfevaluation. There are 1490 nodes and 16,715 edges in the network.
4.1.3 Experiments on Networks
Experiment comparisons between IDRN (with 10% labeled nodes for training) and other unsupervised community detection algorithms by the metrics of NMI, Rand index and Jaccard index
Network  Metrics (%)  Algorithms  

IDRN  GN  CNM  Louvain  MMO  LPA  
Karate  NMI  87.62  57.98  69.25  68.73  56.12  46.58 
Rand  95.44  76.92  85.90  85.90  74.36  85.90  
Jaccard  91.58  74.29  84.06  83.82  70.59  84.72  
Dolphin  NMI  79.88  55.42  62.08  51.62  40.63  51.42 
Rand  92.72  82.39  83.23  76.73  69.18  79.25  
Jaccard  88.40  81.82  83.23  76.13  68.59  79.11  
Football  NMI  83.67  87.89  76.24  85.61  91.11  82.56 
Rand  95.47  92.01  84.18  90.38  92.33  90.05  
Jaccard  62.25  88.84  79.49  86.89  88.92  86.50  
Polbook  NMI  59.87  55.85  53.08  53.69  40.63  56.40 
Rand  85.75  83.67  82.77  83.22  57.60  84.35  
Jaccard  69.92  82.90  82.16  82.38  55.16  83.61  
Polblog  NMI  53.30  30.34  37.99  37.55  34.77  39.09 
Rand  80.36  93.70  95.33  95.12  93.47  95.66  
Jaccard  67.63  93.28  95.02  94.79  93.03  95.37 
Table 1 shows the average NMI score, the average Jaccard index, and the average Rand index for community clustering results in the datasets. We highlight the best algorithms in each metric of different datasets. As shown in the table, IDRN outperforms most existing unsupervised baselines. As we show through our experiments in the networks with known community structure, by taking a semisupervised approach with just a few labeled nodes, our algorithm can learn the ‘community aware’ characteristic better than most widely used existing unsupervised community detection algorithms.
4.2 Datasets for Classification
 Amazon

The dataset contains a subset of books from the amazon copurchasing network data extracted by Nandanwar and Murty [1]. For each book, the dataset provides a list of other similar books, which is used to build a network. Genre of the books gives a natural categorization, and the categories are used as class labels in our experiment.
 CoRA

It contains a collection of research articles in computer science domain with predefined research topic labels which are used as the groundtruth labels for each node.
 IMDb

The graph contains a subset of English movies from IMDb,^{3} and the links indicate the relevant movie pairs based on the top 5 billed stars [1]. Genre of the movies gives a natural class categorization, and the categories are used as class labels.
 PubMed

The dataset contains publications from PubMed database, and each publication is assigned to one of three diabetes classes. So it is a singlelabel dataset in our learning problem.
 Wikipedia

The network data is a dump of Wikipedia pages from different areas of computer science. After crawling, Nandanwar and Murty [1] choose 16 top level category pages, and recursively crawled subcategories up to a depth of 3. The top level categories are used as class labels.
 Youtube

A subset of Youtube users with interest grouping information is used in our experiment. The graph contains the relationships between users, and the user nodes are assigned to multiple interest groups provided by Nandanwar and Murty [1].
 Blogcatalog and Flickr

These datasets are social networks, and each node is labeled by at least one category. The categories can be used as the ground truth of each node for evaluation in multilabel classification task.
 PPI

It is a protein–protein interaction (PPI) network for Homo Sapiens. The labels of nodes represent the bilolgical states.
 POS

This is a cooccurrence network of words appearing in the Wikipedia dump. The node labels represent PartofSpeech (POS) tags of each word.
 CiteSeer

The CiteSeer dataset consists of labeled 3312 scientific publications classified into one of six classes. The citation network consists of 4536 undirected links among the labeled nodes.
 WebKB

The WebKB dataset consists of 877 scientific publications classified into one of five classes. The citation network consists of 1608 links. To form an undirected graph with all nodes labeled, there are 877 nodes and 1388 undirected edges in the graph.
 SocialSpam

This anonymized dataset was collected from the Tagged.com social network website [23]. It contains 5.2 million users in the graph and 496 million undirected links between them. Each user is manually labeled as ‘spammer’ or ‘not spammer.’
 \(\text {Snow2014}_{\mathrm{all}}\) and Snow2014

We use the mention and retweet social interactions to form the graph edges from the tweet collection introduced in the SNOW 2014 Data Challenge [43], and the labels belong to various types of user attribute. This dataset is a partially labeled, and we form two graphs from dataset. The \(\text {Snow2014}_{\mathrm{all}}\) graph is partially labeled with 10,992 nodes labeled, and the Snow2014 graph is a subgraph of it with all nodes labeled. Both of these two graphs are unweighted and undirected in order to make the method comparisons fair.
 \(\text {Youtube}_{\mathrm{all}}\)

In this graph vertices represent users in the YouTube^{4} video sharing website. Apart from uploading videos, users form a subscription graph among them and also subscribe to various interest groups. There are 1,138,499 vertices and 2,990,443 edges in the original graph with only 31,703 nodes labeled.
Summary of undirected networks used for multilabel classification
Dataset  #Nodes  #Edges  #Classes  Average category  \(\frac{{\#\hbox {Edges}}}{{\#\hbox {Nodes}}}\) 

Amazon  83,742  190,097  30  1.546  2.270 
CoRA  24,519  92,207  10  1.004  3.782 
IMDb  19,359  362,079  21  2.301  18.703 
PubMed  19,717  44,324  3  1.000  2.248 
Wikipedia  35,633  495,388  16  1.312  13.903 
Youtube  22,693  96,361  47  1.707  4.246 
Blogcatalog  10,312  333,983  39  1.404  32.387 
Flickr  80,513  5,899,882  195  1.338  73.278 
PPI  3890  37,845  50  1.707  9.804 
POS  4777  92,295  40  1.417  19.320 
CiteSeer  3312  4536  6  1.000  1.369 
Snow2014  9489  22,309  90  2.351  2.538 
WebKB  877  1388  5  1.000  1.582 
SocialSpam  5,275,125  49,6691,571  2  1.000  94.157 
\(\text {Snow2014}_{\mathrm{all}}\)  533,874  942,226  90  2.534  1.764 
\(\text {Youtube}_{\mathrm{all}}\)  1,138,499  2,990,443  47  1.000  2.626 
4.3 Classification Evaluation Metrics
In this part, we explain the details of the evaluation metrics: Hamming score, \({\text {MicroF}}_{1}\) score and \({\text {MicroF}}_{1}\) score which have also widely been used in many other multilabel withinnetwork classification tasks [1, 5, 14]. Given node i, let \(T_i\) be the true label set and \(P_i\) be the predicted label set, then we have the following scores:
Definition 1
\({\text {Hamming Score}}=\sum _{i=1}^{{\mathcal {V}}}{\frac{T_i \cap P_i}{T_i \cup P_i}},\)
Definition 2
\({\text {MicroF}}_{1}\ {\text {Score}}=\frac{2\sum _{i=1}^{{\mathcal {V}}}T_i \cap P_i}{\sum _{i=1}^{{\mathcal {V}}}T_i + \sum _{i=1}^{{\mathcal {V}}}P_i},\)
Definition 3
\({\text {MacroF}}_{1}\ {\text {Score}}=\frac{1}{{\mathcal {L}}}\sum _{j=1}^{{\mathcal {L}}}\frac{2\sum _{i\in {\mathcal {L}}_j}T_i \cap P_i}{ \sum _{i \in {\mathcal {L}}_j}T_i + \sum _{i \in {\mathcal {L}}_j}P_i },\)
where \({\mathcal {L}}\) is the number of classes and \({\mathcal {L}}_j\) is the set of nodes in class j.
4.3.1 Baseline Methods

WvRN [4]: The Weightedvote Relational Neighbor is a simple but surprisingly good relational classifier. Given the neighbors \({\mathcal {N}}_i\) of node i, the WvRN estimates i’s classification probability P(yi) of class label y with the weighted mean of its neighbors as mentioned above. As WvRN algorithm is not very complex, we implement it in Java programming language by ourselves.

SocioDim [14]: This method is based on the SocioDim framework which generates a representation in d dimension space from the topd eigenvectors of the modularity matrix of the network, and the eigenvectors encode the information about the community partitions of the network. The implementation of SocioDim in Matlab is available on the author’s website.^{10} As the authors preferred in their study, we set the number of social dimensions as 500.

DeepWalk [17]: DeepWalk generalizes recent advancements in language modeling from sequences of words to nodes [44]. It uses local information obtained from truncated random walks to learn latent dense representations by treating random walks as the equivalent of sentences. The implementation of DeepWalk in Python has already been published by the authors.^{11}

LINE [2]: LINE algorithm proposes an approach to embed networks into lowdimensional vector spaces by preserving both the firstorder and secondorder proximities in networks. The implementation of LINE in C++ has already been published by the authors.^{12} To enhance the performance of this algorithm, we set embedding dimensions as 256 (i.e., 128 dimensions for the firstorder proximities and 128 dimensions for the secondorder proximities) in LINE algorithm as preferred in its implementation.

SNBC [1]: To classify a node, SNBC takes a structured random walk from the given node and makes a decision based on how nodes in the respective \(k^{th}\)level neighborhood are labeled. The implementation of SNBC in Matlab has already been published by the authors.^{13}

node2vec [19]: It also takes a similar approach with DeepWalk which generalizes recent advancements in language modeling from sequences of words to nodes. With a flexible neighborhood sampling strategy, node2vec learns a mapping of nodes to a lowdimensional feature space that maximizes the likelihood of preserving network neighborhoods of nodes. The implementation of node2vec in Python is available on the authors’ website.^{14}
4.4 Different Community Prior
Experiment comparisons of IDRN with different community prior by the metrics of Hamming score, \({\text {MicroF}}_1\) score and \({\text {MacroF}}_1\) score with 10% nodes labeled for training
Metric  Network  IDRN  \({\hbox {IDRN}}_{\mathrm{Louvain}}\)  \({\hbox {IDRN}}_{\mathrm{CNM}}\)  \({\hbox {IDRN}}_{\mathrm{MMO}}\)  \({\hbox {IDRN}}_{\mathrm{LPA}}\) 

Hamming score (%)  Amazon  56.20  63.86  63.54  56.09  58.65 
Youtube  37.17  39.91  39.94  38.90  39.58  
CoRA  71.71  73.58  72.33  71.77  72.33  
IMDb  20.33  20.79  21.20  20.97  21.00  
PubMed  73.76  78.02  78.07  74.57  76.14  
Wikipedia  71.67  71.12  70.54  72.02  72.02  
Flickr  28.44  29.11  29.62  29.66  29.71  
Blogcatalog  26.77  26.29  26.39  26.13  26.73  
PPI  11.39  12.34  11.77  12.06  12.22  
POS  39.09  39.94  39.62  39.14  39.12  
CiteSeer  43.55  55.76  54.42  46.22  49.21  
Snow2014  24.97  27.50  26.80  25.37  25.95  
WebKB  47.17  44.25  46.01  48.37  44.01  
Micro\(F_1\) (%)  Amazon  56.86  64.62  64.32  57.07  59.64 
Youtube  43.08  45.20  45.20  44.27  45.03  
CoRA  71.72  73.59  72.34  71.78  72.34  
IMDb  29.63  29.81  30.29  29.94  29.98  
PubMed  73.76  78.02  78.07  74.57  76.14  
Wikipedia  73.21  72.79  72.18  73.57  73.51  
Flickr  31.99  32.56  33.09  33.03  33.09  
Blogcatalog  29.05  28.81  28.80  28.39  29.06  
PPI  15.82  17.15  16.21  16.64  16.74  
POS  42.36  43.71  43.29  42.48  42.56  
CiteSeer  43.55  55.76  54.42  46.22  49.21  
Snow2014  28.87  30.55  30.24  29.09  28.94  
WebKB  47.17  44.25  46.01  48.37  44.01  
Macro\(F_1\) (%)  Amazon  53.48  61.00  61.03  53.91  57.57 
Youtube  34.48  37.85  36.84  35.06  37.28  
CoRA  64.57  66.36  64.60  64.54  65.80  
IMDb  19.96  20.57  20.57  20.14  20.43  
PubMed  72.04  76.79  76.89  72.86  74.64  
Wikipedia  64.84  65.46  64.40  65.29  64.64  
Flickr  14.85  14.80  15.17  14.91  15.68  
Blogcatalog  11.39  12.08  11.74  11.32  11.56  
PPI  10.93  12.96  12.33  12.53  11.88  
POS  5.88  6.68  6.65  5.87  6.12  
CiteSeer  40.36  52.05  50.43  43.08  46.32  
Snow2014  14.00  17.82  17.51  14.42  17.02  
WebKB  26.38  26.01  26.76  25.44  25.74 
To show the performance of IDRN with different community prior from various community detection algorithms, we combine IDRN with different community detection algorithms, i.e., the CNM algorithm [37], the Louvain algorithm [40], the MMO algorithm [41] and the label propagation algorithm (LPA) [42]. Table 3 shows the metrics of IDRN with 10% nodes labeled with underlying cluster labels for training and the rest for testing. As shown in the table, IDRN with different community prior improve the metrics over the one with global prior. The results also indicate that there is no much difference between the prior got by different community detection algorithms. Although IDRN with the CNM algorithm seems to be comparable with IDRN with the Louvain algorithm, however, to keep the balance of the speed and performance, we still choose the Louvain algorithm to get the community prior of IDRN in the following experiments as suggested by the experiments in different papers [8, 33].
4.5 Performance of Classifiers
Experiment comparisons of baselines, IDRN and \({\text {IDRN}}_{\mathrm{c}}\) by the metrics of Hamming score, \({\text {MicroF}}_1\) score and \(\text {MacroF}_1\) score with 10% labeled nodes for training
Metric  Network  WvRN  SocioDim  DeepWalk  LINE  SNBC  node2vec  IDRN  \({{\hbox {IDRN}}}_{\mathrm{c}}\) 

Hamming score (%)  Amazon  33.76  38.36  31.79  40.55  59.00  49.18  56.20  63.86 
Youtube  22.82  31.94  36.63  33.90  35.06  33.86  37.17  39.91  
CoRA  55.83  63.02  71.37  65.50  66.75  72.66  71.71  73.58  
IMDb  33.59  22.21  33.12  30.39  30.18  32.97  20.33  20.79  
PubMed  50.32  65.68  77.40  68.31  79.22  79.02  73.76  78.02  
Wikipedia  45.10  65.29  71.10  68.812  68.78  70.69  71.67  71.121  
Flickr  21.37  29.67  28.73  30.96  24.20  30.65  28.44  29.11  
Blogcatalog  17.89  27.04  25.63  25.32  22.40  27.46  26.77  26.29  
PPI  6.28  8.61  8.14  9.27  7.97  8.88  11.39  12.34  
POS  23.05  21.06  31.40  38.24  37.73  34.59  39.09  39.94  
CiteSeer  31.89  20.04  23.80  22.79  54.89  50.99  43.44  55.76  
Snow2014  15.71  9.39  8.71  12.97  22.57  20.00  24.91  27.50  
WebKB  42.67  28.06  27.58  45.57  47.97  35.79  48.34  44.25  
SocialSpam  11.29  –  –  –  –  –  98.89  98.91  
Micro\(F_1\) (%)  Amazon  34.86  39.62  33.06  42.42  59.79  50.55  56.86  64.62 
Youtube  27.81  36.40  40.73  38.01  39.67  38.35  43.08  45.20  
CoRA  55.85  63.00  71.36  65.47  66.78  72.66  71.72  73.59  
IMDb  42.62  29.99  41.82  39.89  39.53  42.36  29.63  29.81  
PubMed  50.32  65.68  77.40  68.31  79.22  79.02  73.76  78.02  
Wikipedia  48.51  66.95  72.19  70.21  70.68  72.07  73.21  72.79  
Flickr  25.40  32.91  31.66  34.03  27.60  33.76  31.99  32.56  
Blogcatalog  20.50  28.86  27.29  27.45  24.66  29.41  29.05  28.81  
PPI  18.41  12.29  11.52  13.16  11.32  12.80  15.82  17.15  
POS  26.04  24.42  35.98  42.70  41.99  39.09  42.36  43.71  
CiteSeer  31.89  20.04  23.80  22.79  54.89  50.99  43.44  55.76  
Snow2014  20.43  15.23  14.50  17.26  23.36  24.55  29.02  30.55  
WebKB  42.67  28.06  27.58  45.57  47.97  35.79  48.34  44.25  
SocialSpam  11.29  –  –  –  –  –  98.89  98.91  
Macro\(F_1\) (%)  Amazon  32.00  35.95  21.64  37.52  56.84  45.85  53.48  61.00 
Youtube  18.17  34.19  33.92  33.47  32.07  32.60  34.48  37.85  
CoRA  43.16  56.82  62.68  59.07  55.68  64.79  64.57  66.36  
IMDb  18.89  18.77  18.22  18.83  17.45  18.46  19.96  20.57  
PubMed  41.57  64.85  75.92  66.66  77.16  77.50  72.04  76.79  
Wikipedia  45.58  58.93  62.29  62.17  61.99  64.90  64.84  65.46  
Flickr  15.54  18.28  17.13  21.80  7.36  18.46  14.85  14.80  
Blogcatalog  11.47  18.88  14.65  15.52  8.29  17.16  11.39  12.08  
PPI  7.35  10.59  9.61  10.82  8.27  11.27  10.93  12.96  
POS  3.91  6.05  8.26  8.93  5.92  8.61  5.58  6.68  
CiteSeer  26.13  18.27  21.84  20.26  51.17  46.39  40.50  52.05  
Snow2014  9.67  4.41  4.53  8.84  12.57  14.03  14.06  17.82  
WebKB  15.86  20.19  20.54  24.53  25.53  20.05  26.31  26.01  
SocialSpam  10.68  –  –  –  –  –  98.75  98.77 
Table 4 shows the average metric scores for multilabel classification results in the datasets. We highlight the best performance algorithms of each metric in bold. As shown in the table, in most of the cases, IDRN and \(\text {IDRN}_c\) algorithms improve the metrics over the existing baselines. Our model with community prior, i.e., \(\text {IDRN}_c\) often performs better than IDRN with global prior. For the three metrics, \(\text {IDRN}_c\) performs consistently better than other algorithms in most of these datasets. Take IMDb dataset for an example, we observe that Hamming score and \({\text {MicroF}}_1\) score got by \(\text {IDRN}_c\) are worse than those got by some baseline algorithms, such as node2vec and WvRN, however \(\text {MacroF}_1\) score got by \(\text {IDRN}_c\) is the best. As \(\text {MacroF}_1\) score computes an average over classes while Hamming and \({\text {MicroF}}_1\) scores get the average over all testing nodes, the result may indicate that our algorithms get more accurate results over different classes in the imbalanced IMDb dataset. To show the results more clearly, we also get the average validation scores for each algorithm in these datasets which are shown in the last lines of the three metrics in Table 4. On average our approach can provide Hamming score, \({\text {MicroF}}_1\) score and \(\text {MacroF}_1\) score higher than competing methods. The results indicate that our \(\text {IDRN}_c\) outperforms most baseline methods when networks are sparsely labeled. We also perform our algorithm on an extremely large dataset—SocialSpam. As SocialSpam is very huge, most of the competing methods cannot finish their algorithm in less than 24 hours which are indicated by the ‘’ characters in the table. Among all the competing methods only the simplest algorithm, i.e., WvRN can classify the nodes in SocialSpam in 24 hours. However, our results are much better than those got by WvRN, and the Hamming score, \({\text {MicroF}}_1\) score and \(\text {MacroF}_1\) score got by \(\mathbf{IDRN }_{\mathrm{c}}\) are 98.91, 98.91 and 98.77%, respectively. We believe the reason is that since the SocialSpam network is much denser than others and it can provide more node identifiers as features, so our algorithms works very well in such dense and large networks.
Summary of running time (in s) of IDRN and \(\text {IDRN}_\mathrm{c}\) in each validation process by using 10% training data and 90% testing data
Dataset  Wikipedia  \({\hbox {Snow2014}}_{\mathrm{all}}\)  \({\hbox {Youtube}}_{\mathrm{all}}\)  Flickr  Spammer 

IDRN  2  3  14  518  2937 
\(\text {IDRN}_\mathrm{c}\)  2  5  9  482  4145 
As an empirical benchmark, we perform a scalability test to evaluate the running time of our algorithms in largescale networks in Table 5. We show the results in the networks with most edges using 10% training data and the rest for testing data. Both IDRN and \(\text {IDRN}_c\) are written in Java, and these algorithms are performed on a server with 256 G memory, Intel Xeon 2.60 GHz CPU, and Redhat OS system in a single thread. As shown in Table 5, IDRN and \(\text {IDRN}_c\) are very fast to handle largescale networks which is in accordance with the complexity analysis we mention above. Both of our methods can finish their training and testing processes in nearly one hour in the largest network, SocialSpam, which contains 5.2 million users and 496 million links. However, it may be even hard for most other comparing algorithms to load such largescale networks in memory.
4.6 Performance of Classifiers in Partially Labeled Networks
Experiment comparisons of baselines, IDRN and \(\text {IDRN}_{\mathrm{c}}\) by the metrics of Hamming score, \({\text {MicroF}}_1\) score and \(\text {MacroF}_1\) score with 10% labeled nodes for training in partially labeled graphs
Metric  Network  WvRN  DeepWalk  LINE  node2vec  IDRN  \({\hbox {IDRN}}_{\mathrm{c}}\) 

Hamming score (%)  \(\text {Youtube}_{\mathrm{all}}\)  24.05  10.32  27.51  33.21  39.16  35.02 
\(\text {Snow2014}_{\mathrm{all}}\)  15.78  7.24  21.76  20.64  22.49  24.76  
Micro\(F_1\) (%)  \(\text {Youtube}_{\mathrm{all}}\)  37.42  15.18  32.95  38.38  60.46  41.05 
\(\text {Snow2014}_{\mathrm{all}}\)  20.53  12.06  25.62  26.45  25.63  27.23  
Macro\(F_1\) (%)  \(\text {Youtube}_{\mathrm{all}}\)  30.94  8.79  25.90  32.34  60.38  33.95 
\(\text {Snow2014}_{\mathrm{all}}\)  8.67  4.37  15.97  16.05  14.73  18.69 
5 Conclusion and Discussion
In this article, we propose a novel approach for node classification, which uses the node identifiers in the egocentric networks as finegrained likelihood features and community prior. Using the coarsegrained community prior, we can get high level features of the nodes’ categories. However, these coarsegrained features may not be discriminative enough especially for the nodes linked to different communities. As node identifiers can be shared by neighbors, identifiers retain finegrained details critical for discrimination class labels in nodes’ adjacent communities. Thus, we propose our new approach which combines the coarsegrained features and finegrained features. We consider that this is the first time that node identifiers in the egocentric networks are used as features. In this article, first, we show that IDRN can learn the ‘community aware’ characteristic, and it stably outperforms most existing unsupervised community detection algorithms with a few underlying cluster labels. After that, empirical evaluation confirms that our proposed algorithm is capable of handling highdimensional identifier features and achieves better performance in realworld networks. We demonstrate the effectiveness of our approach on many publicly available datasets. No matter networks are sparsely labeled or densely labeled, our approach usually provides higher metric scores than competing methods. Moreover, our method is quite practical and efficient, since it only requires the features extracted from network structure without any extra data which makes it suitable for different realworld withinnetwork classification tasks.
It should be noted that there is significant space for our future research. We would like to assess our algorithm’s effectiveness in classifying search queries in clickthrough bipartite graph by integrating extra textual features in web search companies. Furthermore, as many realworld network, e.g., social networks and clickthrough networks, are evolving over time, it is important to classify these newly appearing ones automatically. We could improve our algorithm to predict the node labels in these involving networks in the future. Besides, as we find that the node identifiers are very useful features, we would like to try more effective classification models which can combine these sparse, highdimensional and discrete features, such as node identifiers, with other deep representations in a lowdimensional space to get higher classification performance.
Footnotes
Notes
Acknowledgements
The authors would like to thank all the members in ADRS (ADvertisement Research for Sponsered search) group in Sogou Inc. for the help with parts of the data processing and experiments.
References
 1.Nandanwar S, Murty MN (2016) Structural neighborhood based classification of nodes in a network. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, San Francisco, CA, USA, August 13–17, 2016, ACM, pp 1085–1094Google Scholar
 2.Tang J, Qu M et al (2015) LINE: largescale information network embedding. In: Proceedings of the 24th international conference on world wide web, WWW’15, pp 1067–1077Google Scholar
 3.Wang D, Cui P, Zhu W (2016) Structural deep network embedding. In: Proceedings of the 22Nd ACM SIGKDD international conference on knowledge discovery and data mining (ACM, New York, NY, USA), KDD’16, pp 1225–1234Google Scholar
 4.Macskassy SA, Provost F (2003) A simple relational classifier. In: Proceedings of the second workshop on multirelational data mining (MRDM2003) at KDD2003, pp 64–76Google Scholar
 5.Wang X, Sukthankar G (2013) Multilabel relational neighbor classification using social context features. In: Proceedings of the 19th ACM SIGKDD conference on knowledge discovery and data mining (KDD) (Chicago, USA), pp 464–472Google Scholar
 6.Girvan M, Newman MEJ (2002) Community structure in social and biological networks. Proc Natl Acad Sci 99(12):7821–7826MathSciNetCrossRefzbMATHGoogle Scholar
 7.Barabási AL, Albert R (1999) Emergence of scaling in random networks. Science 286:509–512MathSciNetCrossRefzbMATHGoogle Scholar
 8.Fortunato S, Hric D (2016) Community detection in networks: a user guide. Phys Rep 659:1–44MathSciNetCrossRefGoogle Scholar
 9.Rizos G, Papadopoulos S, Kompatsiaris Y (2017) Collective spammer detection in evolving multirelational social networks. PLoS ONE 12(3):e0173347CrossRefGoogle Scholar
 10.Ye Q, Wang F, Bo L (2016) StarrySky: a practical system to track millions of highprecision query intents. In: Proceedings of the 25th international conference companion on World Wide Web (WWW’16 companion). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, pp 961–966. https://doi.org/10.1145/2872518.2890588
 11.Ye Q, Zhu C, Li G, Wang F (2017) Combining node identifier features and community priors for withinnetwork classification. In: AsiaPacific Web (APWeb) and webage information management (WAIM) joint conference on web and big data part II, Springer, pp 3–17Google Scholar
 12.Bhagat S, Cormode G, Muthukrishnan S (2011) Node classification in social networks. CoRR arXiv.1101.3291
 13.Macskassy SA, Provost F (2007) Classification in networked data: a toolkit and a univariate case study. J Mach Learn Res 8(May):935–983Google Scholar
 14.Tang L, Liu H (2009) Relational learning via latent social dimensions. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining (ACM, New York, NY, USA), KDD’09, pp 817–826Google Scholar
 15.Tang L, Liu H (2009) Scalable learning of collective behavior based on sparse social dimensions. In: The 18th ACM conference on information and knowledge management ACM. NY, USA, New York, pp 1107–1116Google Scholar
 16.Ahmed A, Shervashidze N, et al. (2013) Distributed largescale natural graph factorization, In: Proceedings of the 22nd international conference on World Wide Web, ACM, pp 37–48Google Scholar
 17.Perozzi B, AlRfou R, Skiena S (2014) DeepWalk: online learning of social representations. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining (ACM, New York, NY, USA), KDD’14, pp 701–710Google Scholar
 18.Joulin A, Grave E et al (2016) Bag of tricks for efficient text classification. CoRR arXiv:1607.01759
 19.Grover A, Leskovec J (2016) Node2Vec: scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (ACM, New York, NY, USA), KDD’16, pp 855–864Google Scholar
 20.McDowell LK, Aha DW (2013) Labels or attributes? Rethinking the neighbors for collective classification in sparselylabeled networks, In: International conference on information and knowledge management. ACM Press (ACM Press, San Francisco, CA), pp 847–852Google Scholar
 21.Rayana S, Akoglu L (2015) Collective opinion spam detection: bridging review networks and metadata. In: Proceedings of the 21st ACM SIGKDD international conference on knowledge discovery and data mining (ACM), pp 985–994Google Scholar
 22.Bian J, Chang Y (2011) A taxonomy of local search: semisupervised query classification driven by information needs. In: Proceedings of the 20th ACM international conference on information and knowledge management (ACM, New York, NY, USA), CIKM’11, pp 2425–2428Google Scholar
 23.Fakhraei S, Foulds J et al (2015) Multilabel user classification using the community structure of online networks. PLoS ONE, KDD’15, pp 1769C1778. ACMGoogle Scholar
 24.Jiang S, Hu Y etal (2016) Learning query and document relevance from a webscale click graph. In: Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval (ACM, New York, NY, USA), SIGIR’16, pp 185–194Google Scholar
 25.Yin D, Hu Y et al (2016) Ranking relevance in yahoo search. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining (ACM, New York, NY, USA), KDD’16, pp 323–332Google Scholar
 26.Wang S, Tang J et al (2016) Linked document embedding for classification. In: Proceedings of the 25th ACM international on conference on information and knowledge management (ACM, New York, NY, USA), CIKM’16, pp 115–124Google Scholar
 27.Newman ME, Clauset A (2016) Structure and inference in annotated networks. Nat Commun 7:11863CrossRefGoogle Scholar
 28.Tu C, Liu H, Liu Z, Sun M (2017) CANE: contextaware network embedding for relation modeling. In: Proceedings of the 55th annual meeting of the association for computational linguistics, pp 1722–1731Google Scholar
 29.Wang CJ, Wang TH et al (2017) ICE: item concept embedding via textual information. In: Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval (ACM), pp 85–94Google Scholar
 30.Marsden PV (2002) Egocentric and sociocentric measures of network centrality. Soc Netw 24(4):407–422CrossRefGoogle Scholar
 31.Wang SI, Manning CD (2012) Baselines and bigrams: simple, good sentiment and topic classification. In: Proceedings of the ACL, pp 90–94Google Scholar
 32.Murphy KP (2012) Machine learning: a probabilistic perspective. The MIT Press, CambridgezbMATHGoogle Scholar
 33.Fortunato S (2010) Community detection in graphs. Phys Rep 486(3–5):75–174MathSciNetCrossRefGoogle Scholar
 34.Danon L, Duch J, Arenas A, Dazguilera A (2005) Comparing community structure identification. J Stat Mech Theory Exp 9008:09008CrossRefGoogle Scholar
 35.Zachary WW (1977) An information flow model for conflict and fission in small groups. J Anthropol Res 33(4):452–473CrossRefGoogle Scholar
 36.Lusseau D, Schneider K et al (2003) The bottlenose dolphin community of Doubtful Sound features a large proportion of longlasting associations. Behav Ecol Sociobiol 54(4):396CrossRefGoogle Scholar
 37.Clauset A, Newman MEJ, Moore C (2004) Finding community structure in very large networks. Phys Rev E 70(6):066111CrossRefGoogle Scholar
 38.Watts DJ, Strogatz SH (1998) Collective dynamics of ‘smallworld’ networks. Nature 393:440–442CrossRefzbMATHGoogle Scholar
 39.Adamic LA, Glance N (2005) The political blogosphere and the 2004 US election: divided they blog. In: Proceedings of the 3rd international workshop on Link discovery (ACM), pp 36–43Google Scholar
 40.Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E (2008) Fast unfolding of communities in large networks. J Stat Mech Theory Exp 2008(10):P10008CrossRefGoogle Scholar
 41.Ye Q, Bin W, Bai W (2013) The influence of technology on social network analysis and mining (Springer), vol 6, chap. 16 detecting communities in massive networks efficiently with flexible resolution, pp 373–392Google Scholar
 42.Raghavan UN, Albert R, Kumara S (2007) Near linear time algorithm to detect community structures in largescale networks. Phys Rev E 76(3):036106CrossRefGoogle Scholar
 43.Papadopoulos S, Corney D, Aiello LM (2014) SNOW 2014 data challenge: assessing the performance of news topic detection methods in social media. In: SNOWDC@ WWW, pp 1–8Google Scholar
 44.Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. CoRR arXiv:1301.3781
 45.Fan RE, Chang KW et al (2008) LIBLINEAR: a library for large linear classification. J Mach Learn Res 9:1871–1874zbMATHGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.