1 Introduction

The ever-increasing maturity of web techniques [1,2,3,4,5] has promoted the birth of various social networks, such as Wechat, Facebook, and Twitter. Generally, in a social network, a user member can communicate with other friends or neighbors, share his or her ideas, and make other interesting collaborations [6,7,8]. Today, an increasing number of users have joined various social networks and become a part of the whole web ecosystem [9, 10].

At the very beginning, a user’s social network is often very small as he or she only has a few neighbors or friends. Such a small social network does not make much sense for the user as it cannot deeply fulfill the sharing objective of social software. Therefore, in the domain of social network, a key issue that needs to be addressed is how to find out the prospective friends or neighbors of a user so as to effectively extend the user’ friend network [11, 12]. In other words, making new friends is a practical and significant requirement for the users in social network, which is also one of the major reasons why users are willing to use the social interaction software or tools.

Fortunately, the thumbs-up data left by users have provided us a promising way to seek for the prospective friends or neighbors of a user in social network. Generally, a user will leave a thumbs-up record towards a shared article in social network if he or she likes the article. Therefore, the thumbs-up data from a user are a good basis to evaluate the preferences of the user. Furthermore, if the thumbs-up data from two users A and B are close, then we can infer that A and B are possible friends or neighbors as they share similar preferences. Therefore, through analyzing the users’ thumbs-up data, we can quickly figure out the users who may become friends or neighbors of a target user in the future.

However, the thumbs-up data left by users are a kind of sensitive data as they probably disclose the private information of users [13,14,15,16]. Such a kind of privacy disclosure often does harm to the information sharing in social network and violates the correlated civil privacy-protection laws enacted by government departments. In addition, the thumbs-up data-based friend-finding process is often time-consuming as a social network often contains massive users, and each user often left multiple thumbs-up data.

In view of the above two challenges, Simhash technique that is developed for quick information retrieval is introduced for friend or neighbor finding in social network. Afterwards, a privacy-aware and time-efficient friend-finding approach is brought forth based on Simhash.

In summary, the scientific contributions of our paper are threefold:

  1. (1)

    We introduce the Simhash technique in information retrieval domain into social network to help find the prospective friends or neighbors of a user.

  2. (2)

    We develop a Simhash-based friend- or neighbor-finding approach that can protect the sensitive information contained in thumbs-up data.

  3. (3)

    A series of experiments are conducted on well-known Movielens dataset. The reported experimental results show the feasibility of our proposed solution.

The rest of paper is organized as follows. In Section 2, we formulate the thumbs-up data-based friend- or neighbor-finding problem in social network. Our proposal based on Simhash is presented in Section 3. In Section 4, we conduct a set of experiments to validate the feasibility of our solution; correlated comparison analyses are also presented. At last, in Section 5, we conclude our paper and point out the prospective research directions in the future.

2 Formulation

To better clarify the friend- or neighbor-finding problem in social network, we list the formal symbols to be used in this paper as below:

  1. (1)

    MEMBER = {M1, …, Mm}: user members in a social network.

  2. (2)

    ARTICLE = {A1, …, An}: articles shared in a social network.

  3. (3)

    u#: a user that needs to find new friends or neighbors.

  4. (4)

    \( \mathrm{THUMBS}\hbox{-} \mathrm{UP}=\left({T}_1,\dots, {T}_m\right)=\left[\begin{array}{ccc}{t}_{1,1}& \dots & {t}_{1,n}\\ {}\vdots & \ddots & \vdots \\ {}{t}_{m,1}& \cdots & {t}_{m,n}\end{array}\right] \)

Thumbs-up m*n matrix left by users: each row denotes a member in social network and each column represents an article shared in the social network. Here, ti, j = 1 if member Mi leaves a thumbs-up record over article Aj; ti, j = 0 if member Mi does not leave a thumbs-up record over article Aj. In other words, ti, j is a Boolean value.

With abovementioned formulations, we describe the friend- or neighbor-finding process in social network as below: according to the thumbs-up data in THUMBS-UP matrix associated with articles in set ARTICLE and users in set MEMBER, find out the prospective neighbors or friends of user u#. As the thumbs-up data are sometimes sensitive to users, we need to protect the private information contained in THUMBS-UP matrix. This is the main issue that needs to be addressed in this paper.

Next, we utilize the example in Fig. 1 to motivate our paper. In the figure, there are three user members in the social network, i.e., M1, M2, and M3, and four articles, i.e., A1, A2, A3, and A4. Each arrow between members and articles in the figure denotes that there is a thumbs-up record over an article by a member. Concretely, M1 left two thumbs-up records on articles A1 and A3, respectively; M2 left two thumbs-up records on articles A2 and A3, respectively; M3 left two thumbs-up records on articles A1 and A4, respectively; and u# left three thumbs-up records on articles A1, A2 and A3, respectively.

Fig. 1
figure 1

Privacy-aware friend-finding based on thumbs-up data: an example

Generally, through analyzing and comparing the thumbs-up data left by three user members and u#, we can find out the friends or neighbors that share similar or same preferences with user u#, so as to finish the friend or neighbor recommendation process. However, the thumbs-up data are often sensitive to the three members and u#; therefore, they often refuse to release their thumbs-up data to other parties due to the probable privacy disclosure risks. Therefore, it becomes a challenging task to find the friends or neighbors of u# without revealing the real thumbs-up data. This is the major focus of this paper, and we will introduce how to achieve the above objective based on well-known Simhash technique. Detailed introduction will be presented in the next section.

3 Simhash-based friend finding with privacy preservation

Next, we will introduce our proposed friend- or neighbor-finding approach (named FFSimhash) based on Simhash and sensitive thumbs-up data. The privacy-preservation strategy of our solution is as follows: first, we generate user member indices offline based on thumbs-up data and Simhash technique; second, we search for similar friends or neighbors of u# based on the derived member indices; finally, the selected Top-N friends are returned to u#.

Concretely, the proposed FFSimhash solution mainly consists of the three steps in Fig. 2. Here, THUMBS-UP denotes the thumbs-up data set left by user members in social network; u# is a user who needs to find new friends or neighbors; Top-N (here, N is a parameter that should be designated by u#) means the size of returned neighbor set or friend set requested by u#; for example, if N = 3, then the optimal three friends or neighbors would be returned to u#.

Fig. 2
figure 2

Concrete process of FFSimhash

3.1 Step 1: Generate user member indices offline based on Simhash and thumbs-up data

The THUMBS-UP set introduced in Section 2 records the historical thumbs-up behaviors of all the user members in a social network. If a user member likes an article, then a thumbs-up action is triggered; otherwise, not. Therefore, the thumbs-up data in THUMBS-UP can be regarded as an ideal basis to evaluate the personalized preferences of a user member. Inspired by this observation, we can extract hidden user preferences by analyzing the thumbs-up behaviors of the user. In other words, we can use the thumbs-up data of a user to delegate his (or her) preferences and index.

Concretely, we utilize Simhash technique [17] to generate user member indices based on the thumbs-up data in THUMBS-UP set. Next, we introduce the concrete indices generation process based on Simhash. In set ARTICLE, there are totally n articles A1, …, An. So we can encode these n articles with \( \left\lceil {\log}_2^n\right\rceil \) 0/1 bits (here, ⌈x⌉represents the upper bound of value x). For example, we can encode 3 articles A1, A2, A3 with 2 (\( \left\lceil {\log}_2^3\right\rceil \)=2) Boolean bits:

$$ {A}_1=\left(0\ 1\right),{A}_2=\left(1\ 0\right),{A}_3=\left(1\ 1\right) $$

This way, we can convert the n articles A1, …, An into corresponding 0/1 expressions. Next, in the 0/1 expressions, the “0” values are replaced by “− 1”; for example, A1 = (− 1 1), A2 = (1 − 1), A3 = (1 1).

In THUMBS-UP, each row denotes a user member; for example, the i-th row Ti = (ti,1, …, ti,n). Here, ti,j = 1 if user member Mi leaves a thumbs-up record over article Aj; ti,j = 0 if user member Mi does not leave a thumbs-up record over article Aj. Thus for user member Mi (1 ≤ i ≤ m), we multiply Aj and ti,j, respectively (1 ≤ j ≤ n). Let us consider the above example with three articles A1 = (− 1 1), A2 = (1 − 1), A3 = (1 1). If Mi only leaves thumbs-up records on A1 and A3, i.e., Ti = (1, 0, 1), then we can get the following multiplication operations:

$$ {A}_1\ast 1=\left(-1\ 1\right) $$
$$ {A}_2\ast 0=\left(0\ 0\right) $$
$$ {A}_3\ast 1=\left(1\ 1\right) $$

Next, we calculate the sum of each column of the above equations, i.e., A1 * 1 + A2 * 0 + A3 * 1 = (0, 2). Then, the negative values are replaced by “0”, and positive values are replaced by “1”. Thus, we can get A1 * 1 + A2 * 0 + A3 * 1 = (0, 1). According to the Simhash theory, “01” can be considered as the index for user member Mi based on the thumbs-up data in set THUMBS-UP. Compared to the original thumbs-up data left by Mi, i.e., Ti = (1, 0, 1), index “01” contains less user privacy; this is also one reason that we choose the Simhash technique to perform privacy-aware friend or neighbor findings in social network. For each user member in the social network, we can repeat the above process to generate his or her less-sensitive index, respectively.

3.2 Step 2: Determine the similar friends or neighbors of u # based on the member indices derived in step 1

In step 1, each user member Mi is assigned a \( \left\lceil {\log}_2^n\right\rceil \)-dimensional index, denoted by h(Mi). According to the Simhash theory, if the index values of two user members are close enough, then we can come to a conclusion that these two members are friends or neighbors. Concretely, consider two members Ma and Mb whose index values are h(Ma) and h(Mb), respectively. If h(Ma)⊕h(Mb) ≤ 3, then it can be deemed that Ma and Mb are similar with large probability. Here, “⊕” denotes the XOR operation that can count the difference between two multi-dimensional vectors; for example, 11111 ⊕ 10111 = 1. This way, we can judge whether a user member Mi is a qualified friend or neighbor of u# by evaluating the condition h(Mi)⊕h(u#) ≤ 3. If Mi is a qualified friend of u#, then we put Mi into set Friend (u#).

3.3 Step 3: Return Top-N friends or neighbors to u #

In step 2, we have obtained a set of friends or neighbors of u#, i.e., Friend (u#). If the size of Friend (u#) is smaller than the number (i.e., N) of requested friend by u#, i.e., | Friend (u#) | ≤ N, then all the user members in set Friend (u#) are returned to u# and the friend-finding process for u# ends successfully. Otherwise, if | Friend (u#) | > N, then we need to rank the members in set Friend (u#) and finally pick out N optimal ones. Here, we adopt the following simple comparison operation to evaluate the members in set Friend (u#). Concretely, all the members Mi (1 ≤ i ≤ m) in Friend (u#) can be divided into the following four cases:

$$ h\left({M}_i\right)\oplus h\left({u}^{\#}\right)=0 $$
(1)
$$ h\left({M}_i\right)\oplus h\left({u}^{\#}\right)=1 $$
(2)
$$ h\left({M}_i\right)\oplus h\left({u}^{\#}\right)=2 $$
(3)
$$ h\left({M}_i\right)\oplus h\left({u}^{\#}\right)=3 $$
(4)

We argue that the members belonging to case (3) are more similar than those belonging to case (4); likewise, the members belonging to case (2) are more similar than those belonging to case (3), and the members belonging to case (1) are more similar than those belonging to case (2). Therefore, when picking out N optimal friends or neighbors of u# from set Friend (u#), we first consider the members belonging to case (1), then case (2), case (3), and case (4). Furthermore, if the volume of members belonging to case (1) is larger than N, then we randomly select N members from the candidates. Finally, the picked Top-N friends or neighbors of u# are returned to u#.

This is the end of our proposed privacy-aware friend-finding approach FFSimhash. To validate the feasibility of our approach, a series of experiments on a real-world dataset are designed, which will be introduced in detail in the next section.

4 Experiments

4.1 Experimental settings

In this section, a series of experiments are designed to prove the effectiveness and efficiency of our suggested FFSimhash approach in handling the privacy-aware friend finding in social network. The experiments are based on the popular Movielens [18] dataset containing the user-movie watching records of 4021 users and 1571 movies. To show the advantages of our proposal, we compare it with three state-of-the-art ones that are introduced briefly as below one by one:

  1. (1)

    Random: this approach randomly selects a user member from the social network as the prospective friends or neighbors of user u#.

  2. (2)

    Core-user [19]: this approach first searches for the core users (i.e., key users) with maximal social influences from user candidates and then finds out the friends of user u# based on the core users.

  3. (3)

    Exact-match: this approach exactly compares the feedback records left by user members and u#; if their feedback records are exactly the same, then they can be regarded as similar friends or neighbors.

Concretely, we compare the four approaches from the following two perspectives.

  1. (1)

    Number of friends: For the friend-finding approaches in social network, the number of returned friends of u# is a key factor to evaluate the performance of different approaches. For a friend-finding approach, we always expect that the returned friends of u# to be suitable enough (i.e., neither too many nor too few).

  2. (2)

    Time cost: the time consumed to find out the final similar friends or neighbors of u#. For a friend-finding approach, we expect the time cost to be small enough so as to support the real-time social interactions.

The experiments were running on a laptop with an i7 processor (2.50GHz), 16.0 GB RAM, Windows 10 OS, and Python 3.7. To alleviate the unexpected influences brought by computer network or other environment factors, each test was repeated 100 times totally and finally we report the average results.

4.2 Experimental results

Concretely, we run the following three tests and comparisons, respectively. Correlated parameter specifications can be found in Section 2.

4.2.1 Profile 1: Number of friends of u #

The number of returned friends or neighbors of u# is a good criterion to evaluate the performances of a friend search approach in social networks. Here, for FFSimhash approach, we focus on the number of returned friends or neighbors of u# in step 2 and step 3. More specifically, we list four filtering conditions for friend or neighbor search in step 3. According to Simhash theory, different search conditions in FFSimhash approach will lead to different number of returned friends or neighbors of u#.

We test and compare the number of friends of u# of four approaches with respect to different parameters of m (varied from 50 to 900) and n (varied from 50 to 1571). Here, the number of expected friends or neighbors, i.e., N = 3. Running results are presented in Fig. 3. Concretely, in Fig. 3a, n = 1571, while in Fig. 3b, m = 4021.

Fig. 3
figure 3

a, b Number of returned friends of u#

As the figure shows, the number of users of FFSimhash approach increases with the growth of both m and n. This is because when the user volume or movie volume increases, the probability of finding a qualified friend of u# also rises. Besides, our solution can always return a set of friends for u# regardless of the user volume and movie volume in the social network. So the experimental results indicate that our solution can achieve a good balance between friend-finding accuracy and privacy preservation. As to the Random method and Core-user method, their returned friend number is very small; therefore, related data are not presented in Fig. 3b.

4.2.2 Profile 2: Time cost comparisons of four approaches

Time cost is another objective criterion to evaluate the friend-finding performances in social network. For a friend-finding approach, we expect the time cost to be small enough so as to support the real-time social interactions. Considering this, we test and compare the friend search efficiency of four competitive approaches. The number of user members in social network, i.e., m, is varied from 50 to 900; the number of movies in social network, i.e., n, is varied from 50 to 1571. Experimental results are demonstrated in Fig. 4. Here, we use “−log(t)” to indicate the time costs of different approaches (larger is better).

Fig. 4
figure 4

a, b Time cost with m and n

As the results show, the time cost of Exact-match approach is the highest as all the thumbs-up data should be taken into consideration when performing exact matching. Our suggested FFSimhash approach only requires little time to derive a set of qualified friends of u# as the adopted Simhash in our solution is a time-efficient neighbor search technique, and the indices of user members in social network can be generated offline before we perform friend or neighbor search. As to the Core-user method, the returned friend number is very small; therefore, related data are not presented in Fig. 4a.

4.3 Discussions

However, there are still several potential shortcomings in the experiments.

  1. (1)

    Privacy concerns are existing in many data-related application areas [20,21,22,23]. Although we adopt Simhash technique to protect user privacy, the capability of privacy-protection effects is not measured here due to the inherent nature of Simhash technique. Besides, we only consider time cost for friend finding without considering other costs such as energy consumption [24,25,26,27,28,29] and space granularity [30, 31]. In the future, more discussions about various costs should be added.

  2. (2)

    In this paper, we only discuss one dimension for friend-finding decision-makings, i.e., user members’ thumbs-up data on articles in a social network. While multi-dimensional cases are more common in practical business applications [32,33,34,35,36,37]. Besides, a user may leave multiple thumbs-up records on an article, which also involves the weighting problem associated with the multiple thumbs-up data for an identical paper. Therefore, multiple dimensions for friend finding as well as their weighting significance are necessary to be studied in the future research work.

  3. (3)

    In data-driven business applications, there is often a tradeoff between accuracy and privacy. Therefore, our proposal cannot guarantee 100% accuracy of found friends. Besides, recommendation failures are possible especially when the thumbs-up data for friend finding are very sparse, which may decrease the robustness of friend finding. In the future, it is necessary to explore further improvements in terms of accuracy and robustness while protecting the private information of user members.

5 Conclusions and future work

Social network has provided a promising way for massive users to share their ideas and communicate with each other. A key issue in social network is to find out the prospective friends of users so as to extend the users’ social cycles. Typically, through analyzing the thumbs-up data from different users, we can find out the friends or neighbors of a user. However, the thumbs-up data are often sensitive to users as they can disclose the private information of users, which violate the civil privacy-protection laws enacted by governments. In view of this challenge, we introduce the Simhash technique in information retrieval domain into social network and further bring forth a privacy-aware prospective friend-finding solution in social network based on the sensitive thumbs-up data. At last, we conduct a range of experiments based on well-known Movielens dataset. Experimental data demonstrate the advantages of our solution.

In the future, we will further extend our friend-finding approach to accommodate more general and comprehensive multi-dimensional cases. Besides, how to continue to refine our work for improving the accuracy of found friends is still an open problem that needs intensive study.