Evaluations of Similarity Measures on VK for Link Prediction

Lee, JooYoung; Tukhvatov, Rustam

doi:10.1007/s41019-018-0073-5

Evaluations of Similarity Measures on VK for Link Prediction

Open access
Published: 14 September 2018

Volume 3, pages 277–289, (2018)
Cite this article

Download PDF

You have full access to this open access article

Data Science and Engineering Aims and scope Submit manuscript

Evaluations of Similarity Measures on VK for Link Prediction

Download PDF

JooYoung Lee¹ &
Rustam Tukhvatov¹

4397 Accesses
8 Citations
Explore all metrics

Abstract

Recommender system is one of the most important components for many companies and social networks such as Facebook and YouTube. A recommendation system consists of algorithms which allow to predict and recommend friends or products. This paper studies to facilitate finding like-minded people with same interests in social networks. In our research, we used real data from the most popular social network in Russia, VK (Vkontakte). The study is motivated on the assumption that similarity breeds connection. We evaluate well-known similarity measures in the field on our collected VK datasets and find limited performance results. The result shows that majority of users in VK tend not to add possible users with whom they have common acquaintances. We also propose a topology-based similarity measure to predict future friends. Then, we compare our results with the results of other well-known methods and discuss differences.

Comparative Research for Social Recommendations on VK

Similarity-based link prediction in social networks using latent relationships between the users

Article Open access 18 November 2020

Playing the role of weak clique property in link prediction: A friend recommendation model

Article Open access 21 July 2016

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Social networks help people to interact through Internet. Nowadays social network services allow to share interests via texts, music, video, etc. Clearly, people want to get more information and new contacts as fast as possible, and the information should be relevant to their preferences.

Link recommendation has become one of the most important features in online social networks and has been an active research area [3, 8,9,10, 14, 16]. There are well-known examples of link recommendation such as “People You May Want to Hire” on LinkedIn, “You May Know” on Google+ and “People You May Know” on Facebook. Given the tremendous academic and practical interests in link recommendation [6], we examined existing approaches and applied recommendation engine for VK social network.

We especially focus on the most popular similarity measures which are used in the literature, namely, cosine similarity, common neighbors, Jaccard similarity and Adamic–Adar. We show the performance of each metric using confusion matrix plus F1 measure which is the standard baselines. We examine each metric on a set of small online social networks as well as collected VK datasets.

A social network is a graph as a data structure, the users are nodes, and the users’ friendships (relations) are edges [11]. We aim to understand factors which may affect the emergence of new edges and try to predict future connections in social networks. In Sect. 2, we introduce some of major existing methods on link prediction. In Sect. 3, we present existing methods which we test and compare our results with on VK datasets. In Sect. 4, we describe in detail about our experiment settings and datasets which we collected and preprocess. Section 5, then, presents how we can evaluate the results as well as the evaluation of the most well-known methods. In Sect. 6, we show our results and discuss implications. Finally, in Sect. 7, we conclude our research.

2 Related Work

Wu et al. [17] study understanding users’ temporal behaviors in social network platforms. The evolution of social network services is driven by the interplay between users’ preferences and social network structures. Authors argue that users’ future preference behavior is affected by the network around them and the homophily effect. In [1], authors demonstrate the developed Social Poisson Factorization (SPF), a probabilistic model that incorporates social network information into a traditional factorization method. SPF introduces the social aspect of algorithmic recommendation. They develop a scalable algorithm for analyzing data with SPF, and demonstrate that it outperforms competing methods on six real-world datasets.

Zhang et al. [18] studied the social influence problem in a large microblogging network, Weibo.com.^{Footnote 1} They investigate (re)tweet behaviors of users by considering ego network of users. In our proposed method, we expand another step by considering second level degrees of users. Kooti et al. [5] demonstrate online consumers’ behavior and try to explain ways to improve ad (advertising) targeting systems. Researchers use information in emails such as purchase logs and communications between users to find patterns and their interaction. For example, authors measured the effect of gender and age which showed that a female email user is more likely to be an online shopper than an average male email user. Also the spending ability goes up with the age until age of 30, then stabilizes in the early 60s and starts to drop afterward. Such findings help to predict future customers’ behavior and make purchases more pleasant for consumers.

Measuring similarity between nodes is the main task in link prediction problem. In [15], authors constructed a new way to measure similarity between nodes based on game-theoretic interaction index. The basic form of the interaction index is built upon two solution concepts from game theory: the Shapley value and the Banzhaf index. It is also generalized to a wider class of solution concepts: Shapley value and Banzhaf index. Authors showed that using their approach, it is possible to improve existing results in link prediction and community detection problems.

In this paper, we consider link prediction problems, and more precisely, we discuss similarity measures for link prediction problems. One of the well-accepted hypotheses is that similar users will become future friends. Therefore, it is essential to measure similarity between users so that future links can be accurately predicted. As we will discuss throughout the paper, there exist many metrics to measure the similarity. Once the similarity is measured, there are many ways to predict future connections. One intuitive way is to have a threshold and predict users who have similarities greater than the given threshold. Other simple ways include predicting top k users and top l% of users. More advanced techniques include learning how much to predict using machine learning algorithms.

2.1 VKontakte

In this section, we focus on the datasets which we collected over 6 months from VK network. VK (VKontakte, meaning InContact) is the largest European online social network and social media service. According to SimilarWeb,^{Footnote 2} VK is the fifth most popular website in the world. As of January 2017, daily average audience is about 87 million visitors,^{Footnote 3} with more than 410 million registered users.^{Footnote 4}

Any user in VK has a profile which contains various information. First and last names are mandatory fields, and other data such as birthday, city and interests are optional. Figure 1 is an example of a profile of the social network. In addition, users can share interesting information via posting on the profile wall as well as making repost from other users or groups,^{Footnote 5} and each post may contain attachments—documents or media files.

VK has open APIs for application developments. It allows developers to access the server through requests. The APIs provide the ability to access some user information with his/her consent, such as photographs, friends, profile and wall.

3 Methods

The link prediction problem is connected with network structure. In our research, we use similarity between users using information from their second level degrees.

3.1 Problem Statement

Given a snapshot of a social network $G^t=(V^t,E^t)$ where V is a set of users and E is a set of relationships, we aim to find a relationship $(v_i, v_j)\notin E^t$ and $(v_i, v_j)\in E^{t'}$ where $t<t'$.

The link prediction problem is the problem of predicting future connections in a network which may form in the network. Here, we briefly list some measures which we examine in this study.

$${\rm Cosine}\,{\rm Similarity}\,(A, B) = {\rm cos}(\theta ) = \frac{f(A)f(B)}{|f(A)| |f(B)|} $$

(1)

$$ {\rm Common}\,{\rm Neighbors}\,(A, B) = f(A) \cap f(b) $$

(2)

$$ {\rm Jaccard}\,{\rm Similarity}\,(A, B)= \frac{|f(A) \cap f(B)|}{|f(A) \cup f(B)|} $$

(3)

$$\begin{aligned}&{\rm Adamic}\,{\rm Adar}\,{\rm Index}\,(A, B) = \sum _{w \in f(A) \cap f(B)} \frac{1}{\log |f(w)|} \\&{\text {where}}\,(f(w))\,{\hbox {denotes\,the\,set\,of\,neighbors\,of}}\,w. \end{aligned}$$

(4)

3.2 Second Common Neighbor Similarity

We propose a new similarity measure between users using second degrees [7, 12]. Second neighbors are defined as the nodes which are connected in the second hops.

$$\begin{aligned}&{\rm Second}\,{\rm Common}\,{\rm Neighbors}\,(A, B)= \left| \bigcup _{i \in f(A)} f(i) \cap \bigcup _{j\in f(B)}f(j)\right| \\& {\text {where}}\,f()\,{\hbox{is user's friends,} } f(A)\,{\hbox{friend of user}}\,A, f(i)\,{\hbox{is the set of neighbors of}}\,i. \end{aligned}$$

(5)

We propose another simple metric to compare with the proposed similarity measure which is based on the shortest distance between users.

$$\begin{aligned} {\rm Shortest}\,{\rm Path}\,{\rm Index}\,(A, B) = {\text{Length\,of\,Shortest\,Path\,Between}}\,A\,{\text{and}}\,B \end{aligned}$$

(6)

4 Experiments

Ideally, the evaluation of link prediction algorithms should take place every timestamp to check the performance and adjust the algorithm if necessary. However, such an approach takes a long period of time and it is hard to track of incoming and outgoing users from the initial network. Therefore, it is common practice to delete a small portion of edges from the network as if they do not exist and then try to predict the deleted edges. This static approach has possible flaws since the similarity between users is not static; thus, deleted edges are easier to predict than future edges.

In the experiment, we compare the common practice with the real link prediction based on actual future links without deleting edges. To be able to evaluate the accuracy, we collected datasets from the same set of users in different time intervals to build a learning model and then compare the predictions.

4.1 Collection of Dataset

We collected several snapshots of networks from VK to test link prediction methods. In order to make comparisons easier, we preprocess the data so that the resulting networks contain the same users since newly joined users in latter networks are not predictable. In the next step, we exclude users who are suspected to be not regular individuals such as groups, commercial accounts and celebrities. We removed users with more than 500 friends based on the mean number of friends in VK which was 240. Datasets contain users’ profile information, friends and posts from their walls as shown in Fig. 2.

We use VK networks with four different timestamps as summarized in Table 1.

Table 1 VK networks with different time stamps

Evaluations of Similarity Measures on VK for Link Prediction

Abstract

Similar content being viewed by others

Comparative Research for Social Recommendations on VK

Similarity-based link prediction in social networks using latent relationships between the users

Playing the role of weak clique property in link prediction: A friend recommendation model

1 Introduction

2 Related Work

2.1 VKontakte

3 Methods

3.1 Problem Statement

3.2 Second Common Neighbor Similarity

4 Experiments

4.1 Collection of Dataset

4.2 Implementation

4.2.1 Python Libraries

5 Evaluations

5.1 Evaluation of Existing Methods

5.1.1 Evaluation of Existing Methods on Facebook

5.2 Evaluation of Existing Methods on Different Datasets

5.3 A Small Dataset from Facebook

5.4 LastFM

5.5 GrQc

5.6 HepTh

5.7 CondMat

6 Results

6.1 Next Steps of Prediction on VK

6.1.1 Wall Analysis

6.1.2 Real Data

6.2 Proposed Measure

6.3 Evaluation of Other Approaches

6.3.1 Top 25

6.4 Comparison of Second Neighbor Approach

7 Conclusion and Future Work

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation