# Locality-Sensitive Hashing for Distributed Privacy-Preserving Collaborative Filtering: An Approach and System Architecture

## Abstract

Recommendation systems are currently widely used in domains where abundance of choice is conjoined with its subjective nature (books, movies, trips, etc.). Most of the modern recommendation systems are centralized. Although the centralized recommendation system design has some significant advantages, it also bears two primary disadvantages: the necessity for users to share their preferences and a single point of failure. This paper follows user-centric approach to distributed recommendation system design, and proposes an architecture of a collaborative peer-to-peer recommendation system with limited preferences’ disclosure. Privacy in the proposed design is provided by the fact that exact user preferences are never shared together with the user identity. To achieve that, the proposed architecture employs a locality-sensitive hashing of user preferences and an anonymized distributed hash table approach to peer-to-peer design.

### Keywords

Recommendation systems Distributed collaborative filtering Locality-sensitive hashing Peer-to-peer Anonymization Privacy## 1 Introduction

Recommendation systems play an important role in modern e-commerce systems by helping users to make their ways through the abundant variety of goods and services offers. From an architectural point of view, most of the widely used recommendation systems have a centralized design. It means that a system collects user’s preferences and feedback (in the form of “likes”, ratings, or some other), stores them in its inner database, and uses this database for making recommendations. An advantage of this design is that it allows employing a broad spectrum of user preference models to predict user’s attitude to new items. Indeed, many recommendation methods and techniques are derived from machine learning and data mining [1], therefore the presence of representative data sets is crucial. Centralization also puts all the relevant user information under control of the recommendation system maintainer allowing to perform various research activities on this information besides providing online recommendations to users (see, e.g., Netflix Prize [19]).

However, the centralized approach has several drawbacks. First, it introduces a quandary about privacy and, in a wider perspective, about rights on the preferences data collected about users. As a rule, a user is not aware of what information the system collects about his/her behaviour and cannot extract this information from the centralized system. Moreover, if a recommendation system’s maintainer abandons it, all the collected user profiles may be lost. Second, the centralization usually results in some kind of preferences partitioning. A user may communicate with several recommendation systems, sharing with each system some part of his/her preferences; therefore, all user’s preferences become spread over several recommendation systems with no chance of being united. This is not desirable, as a complete preferences profile can potentially lead to recommendations that are more accurate. Third, any centralization usually leads to a single point of failure, however, in modern computer systems, this drawback is usually alleviated by multilevel duplication and replication.

a decentralized system may improve the users’ privacy, as there exists no central entity owning the users’ private information (however, this topic is subtle due to the inherent security issues of peer-to-peer systems);

data-processing and recommendation functions can be distributed among all users, thus, removing the need for a costly central server and enhancing scalability.

There are several approaches to recommendation system decentralization. In this paper, a user-centric approach is examined. According to this approach the user holds all his/her preferences on his/her own system. This entirely removes the quandary about rights – the user fully controls his/her preferences storage. This can also remove the preferences’ partitioning as all the user preferences become centralized in a device controlled by the user. When recommendations are needed, the users’ device sends recommendation requests to other devices.

Albeit all the enumerated issues of the centralized recommendation systems design are circumvented by the user-centric decentralized recommendation system design, it poses several new issues. The main problem that is addressed in this paper is how to make recommendations based on collaborative filtering approach respecting user privacy by not sharing complete profiles among members of distributed recommendations network. Collaborative filtering is one of two main approaches to recommendation system; in collaborative filtering systems, in contrast to the other approach – content-based systems, recommendations are based solely on users’ attitude to items (usually expressed as ratings), but not on explicit features of items. The focus of this paper is on collaborative filtering approach, because it is more universal, allowing to build recommendation system without creating a domain-specific item model. Decentralization of content-based systems, on the other hand, is generally simpler and does not bear principal difficulties.

In this paper, the recommendation system architecture that follows the user-centric approach is proposed. It is based on a structured peer-to-peer (P2P) network, where each peer corresponds to one user and holds his/her preferences. Recommendations are made by means of anonymized communication between peers. The proposed architecture enforces privacy by providing limited preferences disclosure. It means that there is no way to reliably match ratings and a user’s network address without having global control over the entire P2P network. The proposed architecture is a hybrid P2P as it uses one special node for the data-driven coordination that, however, is not used directly in the recommendation process.

The rest of the paper is structured as follows. Section 2 presents an overview of existing P2P recommendation systems and approaches. In Sect. 3, the locality-sensitive hashing approach to recommendations is discussed. Section 4 contains the description of the proposed recommendation system’s architecture. Section 5 contains an experimental evaluation of the proposed ideas. Main results are summarized in the conclusion.

## 2 Related Work

Peer-to-peer recommendation systems design is already addressed in literature.

In Draidi et al. in [8, 9] propose P2Prec system. The idea of this system is to recommend high quality documents related to query topics and content hold by useful friends (or friends of friends) of the users, by exploring friendship networks. To disseminate information about relevant peers, it relies on gossip algorithms. For publishing and discovering services a distributed hash table is used.

The authors of P2Prec employ two-level Latent Dirichlet Allocation to automatically model topics. At the global level performed by a bootstrap server a sample of documents is collected from peers and a set of topics is inferred. Then at the local level performed by each peer the local documents are analysed with respect to common topics. Each user maintains the friendship network. A user enlarges the friendship network by accretion of new friends relevant to queries and overlapping with this users’ friendship network.

To establish friendship P2Prec use gossip protocols. Keyword queries are routed recursively through friends networks, based on user trust and usefulness.

In a number of methods described in literature, an overlay network structure based on a similarity between nodes is built and recommendation algorithm is defined on this network (e.g., [8, 22]). Recommendations are searched for among neighbours up to certain depth or certain similarity threshold.

One of the algorithms of an aligning network structure to peer similarities is T-Man [14]. T-Man relies on the ability of a peer to measure how it «likes» peers. Having defined this relation, T-Man algorithm aligns the structure of the overlay network to juxtapose peers that «like» each other.

The similarity-based overlay network structure is extensively studied in [20] where authors showed that overlay topologies defined by node similarity have highly unbalanced degree distributions to be taken into account when load-balancing the P2P recommendation network. They also proposed algorithms with favourable convergence of speed and prediction accuracy taking load balancing into account, considering collaborative filtering system where similarity of users is measured as cosine similarity.

In the proposed architecture, the exact ratings are not exposed together with a node identity, so there is no way to say how similar the two nodes are. Using the locality-sensitive hash values one can possibly say whether they are likely to be close enough or not.

Another approach is to rely on random walk search for similar nodes in the ordinary P2P network using some form of the flooding technique [26]. Similarly, Bakker et al. in [2] show that it is enough to take a random sample of the network and use the closest elements of that sample to make recommendations.

In [15], the random walks approach to collaborative filtering recommendations is examined in the context of P2P systems. The authors argue that the effect of random walk in decentralized environment is different than the centralized one. They also propose a system where epidemic protocols (gossip protocols) are used to disseminate the user similarity information. They start from a random set of peers and then in series of random exchanges compare their local-view with the local view of the remote node, leaving only the most similar peers in the local view (clustering gossip protocol). This process converges to form some overlay based on the peers’ similarity. Then peers that are not farther than two hops from the given one are used to make recommendations.

In epidemic protocols, peers have access to a Random Peer Sampling service (RPS) providing them with a continuously changing random subset of the network peers. Each peer maintains a view of the network, which is initialized at random through RPS when a peer joins the network. Gossip protocols are fully decentralized, can handle high churn rates, and require no specific protocol to recover from massive failures.

There also published research papers where structured P2P networks are used. For example, in [10, 11], distributed hash tables are used to store ratings. The proposed approach stands close to this way except the point that ratings are not stored in a distributed hash table, instead a fast lookup capability provided by this kind of P2P architecture is employed for searching similar peers.

Most of the approaches involve sharing the rating data between nodes, while in the proposed architecture it is avoided.

Privacy concerns are directly addressed in [23]. The authors propose a file sharing network where users exchange their data only with their friends and the recommendation system on the top of it. They propose a privacy-conserving distributed collaborative filtering approach that is based on exchanges of anonymized items’ relevance ranks between peers. Their approach, however, allows only unary ratings (initially, the fact of owning a specific file).

Distributed recommendation systems are also analysed in quite different context, seeking for efficient parallel implementations of centralized recommendation techniques. This research direction is entirely beyond the scope of this paper.

## 3 Locality-Sensitive Hashing for Recommendations

Locality-sensitive hashing (LSH) is a method widely used for a probabilistic solution of k-NN (k Nearest Neighbours) problem. The idea of this method is to hash multidimensional objects in such a way that similar objects (w.r.t. some distance measure defined on them) are likely to have the same hash value.

### 3.1 The Idea of LSH

*d*

_{1}<

*d*

_{2}be two distances according to some distance measure

*d*. A family

*F*of functions is said to be (

*d*

_{1},

*d*

_{2},

*p*

_{1},

*p*

_{2})-sensitive if for every

*f*in

*F*and two arbitrary objects

*x*and

*y*[24]:

If

*d*(*x*,*y*) ≤*d*_{1}, then probability that*f*(*a*) =*f*(*b*) is at least*p*_{1}.If

*d*(*x*,*y*) ≥*d*_{2}then probability that*f*(*a*) =*f*(*b*) is at most*p*_{2}.

An important concept in the locality-sensitive hashing theory is an amplification. Given a (*d*_{1}, *d*_{2}, *p*_{1}, *p*_{2})-sensitive family *F*, a new family *F*’ can be constructed by either AND-construction or OR-construction.

AND-construction of *F*’ is defined as follows. Each member of *F*’ consists of *r* members of *F* for some fixed *r*. If *f* is in *F*’ and *f* is constructed from the set {*f*_{1}, *f*_{2}, …, *f*_{r}} of members of *F*, *f*(*x*) = *f*(*y*) iff *f*_{i}(*x*) = *f*_{i}(*y*) for all *i* ∈{1,…, *r*}. As members of *F*’ are independently chosen from *F*, *F*’ is an (\(d_{1}, d_{2}, {p_{1}}^{r}, {p_{2}}^{r} \))-sensitive family [24].

OR-construction of *F*’ is defined as follows. Each member of *F*’ consists of *b* members of *F* for some fixed *b*. If *f* is in *F*’, and *f* is constructed from the set {*f*_{1}, *f*_{2}, …, *f*_{b}} of members of *F*, *f*(*x*) = *f*(*y*) iff there exists *i* ∈{1,…, *b*}, such that *f*_{i}(*x*) = *f*_{i}(*y*). Similarly, *F*’ is an (*d*_{1}, *d*_{2}, 1 – (1 – *p*_{1})^{b}, 1 – (1 – *p*_{2})^{b}) -sensitive family.

Generally, it is desirable that *p*_{1} be as large as possible and *p*_{2} be as small as possible. If *p*_{1} < 1, then there exists some possibility that similar objects will have different hash values. On the other hand, if *p*_{2} > 0, some possibility exists that distant objects will have similar hash values. Therefore, family *F* is chosen in such a way that *p*_{1} is large (close to 1) and *p*_{2} is small (close to 0). There is a finite set of well-studied locality-sensitive function families and the desired levels of *p*_{1} and *p*_{2} cannot always be achieved with one “pure” family, and here the amplification comes into play.

If family *F*^{Ar} is obtained as AND-construction of r functions from family *F*, and *G* is then obtained as OR-construction of *b* functions from family *F*^{Ar}, then *G* is a (*d*_{1}, *d*_{2}, 1 – (1 – *p*_{1}^{r})^{b}, 1 – (1 – *p*_{2}^{r})^{b})-sensitive family. Informally, AND-construction mostly lowers the initially low *p*_{2} probability and subsequent OR-construction raises the initially high *p*_{1} probability.

The idea of the nearest neighbours search based on LSH is described in many papers (e.g., [24, 25]). First, a hash family *F* (to be discussed in detail later) is chosen and *b* ordinary hash tables are arranged. For each hash table a hash function \( f_{i}^{Ar} ,\,i = 1, \ldots ,b \) is defined an AND-construction of *r* random functions from *F*. Every object *x* is stored into each of the *b* hash tables. Key is the \( f_{i}^{Ar} \left( x \right) \) and value is either some identity of *x* or *x* itself. It is natural that several objects can fall into one hash table bucket.

When searching for the nearest neighbours of an object *y*, first, \( f_{i}^{Ar} \left( y \right),\,i = 1, \ldots ,b \) is calculated and then all values from the corresponding hash tables are retrieved resulting in a set of the nearest neighbour candidates. Precise distance to each of the candidates is then assessed and false positives are removed.

Particular choice of the hash function family depends on data representation and distance function *d*. For Hamming distance a bit sampling locality sensitive hash was proposed in [13], for cosine distance a random projections method was proposed in [3], a well-performing hash function for Euclidean distance is proposed in [6].

*f*from

*F*corresponds to one random hyperplane and can have value of one if an object being hashed is above the hyperplane, and value of zero if an object is below it. An object is usually represented as an

*n*-dimensional vector (

*x*∈ ℜ

^{n}), and a hyperplane in that vector space is also denoted by its normal

*n*-dimensional vector (

*v*(

*f*) ∈ ℜ

^{n}). Relative position of an object and a hyperplane can be found as a sign of a dot product of these two vectors:

*r*functions from

*F*(and therefore

*r*hyperplanes), and the results of application of these functions are

*r*-dimensional vectors of ones and zeroes. Formally, if \( f_{i}^{Ar} (x) = (f_{i,1} ,f_{i,2} , \ldots ,f_{i,r} ) \), then:

### 3.2 Recommendations Generation

The problem of finding the nearest neighbours is closely related to the recommendation systems research area, namely neighbourhood-based methods in collaborative filtering systems (see, e.g., [7]). These methods of recommendation are based on an assumption that users that had similar preferences in the past are likely to have similar preferences now (and in the future). Therefore, to make recommendations, users with similar preferences should be found. To do this, user preferences are typically represented as numerical vectors and some measure is introduced in that vector space corresponding to preference similarity. In this setting, the problem of finding similar users translates into the nearest neighbours search. This subsection provides a formal description of collaborative filtering recommendation method based on the locality-sensitive hashing.

User-based collaborative filtering system is the recommendation system that infers recommendations from the similarity of users measured by the degree known user ratings coincide.

More formally, let *r*_{uj} be the rating assigned to the item *j* by the user *u*, which corresponds to how user *u* liked item *j*, or what was the subjective utility of *j* for *u*. Let *U* be the set of all users, *I* – the set of all items, *I*_{u} – the set of items rated by user *u*, and *I*_{uv} – the set of items rated by both user *u* and user *v*. Usually, a user has ratings for relatively small number of items, |*I*_{u}| << |*I*|. Neighbourhood methods of user-based collaborative filtering employ some similarity measure between users which is calculated based on common ratings (*sim*(*u*, *v*) = *f*_{s}({*r*_{uj}, *r*_{vj} | *j* ∈ *I*_{uv}})) and estimate unknown rating \( r_{uj}^{*} \) based on known ratings *r*_{vj} and estimated similarities *sim*(*u*, *v*).

The similarity measure choice is caused mostly by the fact that there exists a known way to approximate this measure by a set of locality-sensitive hash functions [3], which is not the case for other wide-spread similarity measures (e.g., Pearson correlation coefficient). It is also supported by the evidence that cosine similarity works well in many recommendation system settings [1].

User ratings are normalized in such a way that *r*_{uj} > 0 corresponds to positive attitude of user *u* to item *j*, *r*_{uj} < 0 corresponds to negative attitude, and zero corresponds to neutral. Absolute value shows strength of the attitude. Probably, the simplest way of such kind of normalization is mean-centring of ratings.

*u*in a pure collaborative filtering system can be understood as a set of

*r*

_{uj}, where

*j*∈

*I*

_{u}. In some cases, it is also convenient to represent user’s profile as a vector

*p*

_{u}∈ ℜ

^{|I|}, constructed in the following way:

It does not mean that user’s profile should be stored in this way, it would not be efficient, as most of *p*_{u} components equal to zero, rather this representation makes some mathematical formulas more intuitive.

*r**

_{uj}requires the search of users

*v*that are similar to

*u*, or the nearest neighbours of

*u*according to cosine similarity measure. This is where LSH comes into play. The original algorithm for LSH-based recommendations (e.g., [24]), consists of the following steps:

Preparation. Several hash tables

*HT*_{i}are organized, and corresponding number of locality-sensitive functions*f*_{i}are generated. Then, each user’s identifier is put into each table, and its bucket in*HT*_{i}is determined by value of function*f*_{i}(*p*_{u}), where*p*_{u}is the vector representation of users’ profile.Recommendation. When searching for recommendations for user

*u*, hash values of his/her profile are calculated and looked up in respective hash tables. Lookups result in a set*C*= {*v*| ∃*i**f*_{i}(*p*_{v}) =*f*_{i}(*p*_{u})} of user identifiers that have at least one hash function value in common with the user*u*(and whose interests are likely, due to hash function properties, to be similar with*u*’s). Then, exact similarities are calculated between user*u*and members of*C*and predictions are generated.

However, the recommendation step of the original algorithm does not allow to fulfil the goal, pursued in this paper. Namely, it requires calculation of exact similarity of users in *C,* which is impossible without sending complete profiles to the side that performs this calculation. This paper proposes a modification of the original algorithm that does not require exact similarity computation and thus allows to avoid profile sharing.

*s’(u,v)*is introduced. It is defined as the number of locality-sensitive hash functions whose values are equal for users

*u*and

*v*:

*s’(u,v)*can be easily integrated into the modified recommendation step of the algorithm. More specifically, instead of the set

*C*in the original algorithm a multiset

*C**

_{u}can be used. If

*C**

_{u}= (

*U*,

*m*

_{U}) is a multiset of user identifiers retrieved from hash tables using locality-sensitive hash functions on the user

*u*’s profile, then:

*m*

_{u}(

*v*) is a multiplicity function of

*C**

_{u}. The proposed recommendation algorithm, at first, retrieves all approximate neighbours

*Q*

_{u}= {

*v*|

*m*

_{U}(

*v*) > 0} of user

*u*from hash tables and computes

*s’(u, v)*(where

*v*∈

*Q*

_{u}). Then, each of the approximate neighbours

*v*∈

*Q*

_{u}is asked for the recommended items

*R*

_{v}. The proposed algorithm and the system as a whole do not predict ratings, instead it ranks all items that were recommended by approximate neighbours with respect to attractiveness estimate \( \tilde{a}_{ui} \) of item

*i*for user

*u*defined by the following expression:

*i*is in the list of items recommended by user

*v*:

In other words, the attractiveness estimate \( \tilde{a}_{ui} \) is the sum of approximate similarities between user *u* and neighbours that “recommended” item *i* to user *u*.

To sum it up, in the proposed system architecture, a profile of user *u* is a set of pairs *(i, r*_{ui}*)*, where *i* are item identifiers. To compute the hash function, a profile is normalized and transformed into a vector *p*_{u} (Eq. 4). Each of *b* locality-sensitive hash functions is represented by *r* vectors, whose dimensionality equals to the number of the known items (|*I*|). Finding a hash of a profile vector corresponds to computing inner products of the profile vector and hash functions vectors (Eq. 2). After application of all these hash functions, *b**r*-dimensional binary vectors are obtained and stored into hash table. When looking for recommendations, *b* lookups are performed, and then each found approximate neighbour is queried for recommended items and the list of recommended items is sorted according to \( \tilde{a}_{ui} \) value.

Algorithms 1 and 2 are provided here without taking into account their distributed implementation, which is one of the aims of this paper. However, the analysis of their inputs reveals some challenges that have to be addressed by recommendation system’s architecture. One of these challenges is connected to the fact that it should be possible to initiate recommendation algorithm (Algorithm 2) from any node of the peer-to-peer recommendation network. Therefore, each node should have *fv* matrix filled with the same values as a node that used this matrix to calculate hash value when inserted an item into *HT*_{i} had, and that leads to a problem of maintaining some shared state of a distributed network. Parameter *θ* controls what items should be considered as “recommended” and can be set for each user individually, usually in the range of [0.3, 1]. Parameters *b* and *r* affect selectivity of neighbours and quality of recommendations; their impact on recommendations quality is assessed in Sect. 5.

## 4 System Architecture

The proposed hybrid architecture enables the personalized recommendations exchange with the limited user preferences disclosure. In this section, target use cases are discussed, as well as components of the proposed system and scenarios that implement the target use cases.

### 4.1 Use Cases

Recommendation systems may provide for somewhat different end-user features. Specifically, in this paper the following recommendation use cases are considered: (a) attractiveness estimation of a given item (or set of items); (b) recommendations query; (c) rating an item.

Attractiveness estimation of a given item (or a set of items) is involved when a user encounters some item and wants to check if it is potentially interesting or useful for him/her. In this case, the user passes this item (item identity) to recommendation system and the recommendation system should return an expected attitude of this user to this item. Certainly, the user is not required to perform this request intentionally by hand; some other program or GUI element acting on behalf of the user can mediate this action. Attractiveness estimation request may contain several items. Though estimation for multiple items can always be implemented as a series of single item estimations, it is interpreted here as a use case extension, because in some circumstances the estimation for multiple items is potentially more efficient than multiple separate single item requests.

Recommendations query is initiated when a user wants to receive some recommendations – a list of new, previously unseen items matching his/her preferences.

Rating an item is initiated when a user encounters some new item and expresses his/her attitude to it.

### 4.2 Components

- (1)
Peer-to-Peer recommendations network: In the proposed architecture, each user corresponds to exactly one node (or peer – these terms are used here interchangeably). That node holds all the information about one user’s preferences, ratings, browsing history, but does not share this information with the other nodes, instead it shares only the locality-sensitive hash values of this information in order to find similar users to query for recommendations.

P2P network is based on the Distributed Hash Table (DHT) (see, e.g., [16]) model widely employed in various P2P networks. The general idea of DHT is rather straightforward. It holds a collection of key/value pairs scattered over a distributed set of nodes, supporting key/value pair migration in case of node disconnection. DHT usually refers to a class of systems rather than to some specific system or algorithm. Common point of all DHT-based systems is that there is some scheme of distribution of a keyspace (a set of possible keys) among peers accompanied with some regular pattern of links between nodes (sometimes called “fingers”). When a node receives a request for some key it checks if it is “responsible” for holding this key, and either responds with a value, or passes the request to a linked node that has identifier closest to the key being looked for. Keyspace distribution and link pattern ensure that distributed table lookups can be accomplished by no more than

*O*(log*n*) nodes.Original DHT has some security and privacy vulnerabilities. For example, in original DHT implementations a lookup request contains information about the node that initiated it, and therefore makes this information available for any malicious node that happen to redirect this lookup request or process it. In the context of recommendation systems, it means, for example, that a malicious node could be able to associate a value of locality-sensitive hash function of preferences profile (a key in the lookup) with a node (user). It does not reveal exact ratings, however narrows uncertainty distribution from uniform. Moreover, if P2P network contains several malicious nodes, then they are able to collect several hash values of one user’s profile and uncertainty becomes even less. Even in this case, it is impossible to detect exact values of ratings, because they are not let out of a node, but overall preferences “flavor” may be detected and associated with a physical node of network, which is undesirable. Vulnerability that is even more important is that potential neighbor node makes recommendations by sending identifiers of items, which are marked as good by the respective user, therefore allowing its peer (probably malicious) to match some presumably high ratings with physical node address.

To overcome these vulnerabilities a variety of secure and anonymous DHT lookup implementations were designed. The proposed architecture relies on one of these anonymized implementations, namely Octopus [27]. The idea behind most of secured DHT implementations is that all the DHT lookups are made through other nodes accessible by anonymous paths through anonymization relays. Each node in the anonymization path knows only the neighbour nodes and does not know whether some request originated in the neighbour node, or was passed over from some other node.

DHT in the proposed system is used as a set of hash tables to perform nearest neighbour search, as described in Sect. 3. Each key/value pair stored in DHT holds information about one locality-sensitive hash value and the list of nodes corresponding to that hash value (potential neighbours). As it was discussed in the respective section, several (

*b*) hash tables are needed to perform the nearest neighbour search. Each of the*b*tables uses its own locality-sensitive hash function. In the proposed architecture, all of these*b*hash tables are stored in one DHT. In order to achieve this, key of the DHT pair includes a unique identifier of the locality-sensitive hash function and the value of that function. Keyspace of most DHT implementations consists of 160-bit values. In the proposed implementation, concatenation of unique identifier of function and its value are processed by SHA-1 algorithm to provide equable distribution of used keys in the keyspace.Each node of the P2P network has its unique identifier taken from the same keyspace. It is produced by applying SHA-1 to the network address of the node.

Before a starts to advertise itself in DHT it creates an anonymized path and uses the endpoint specification of this path as an address it shares with other nodes. These anonymized paths are created each time when the node connects network, resulting in different public identifiers of the same node.

As user preferences expressed in ratings are not changing very fast, it is reasonable for each node to locate other nodes with the similar profiles through DHT and store links to them. Therefore, a new overlay network of similar users is formed over the P2P network. It is important to differentiate between the three employed connection layers (Fig. 3). The first layer is the underlying network, that provides a physical connection between P2P nodes. The second layer is DHT connection layer that provides DHT key search, key redistribution etc. This layer is formed by links to adjacent nodes in structured P2P, the so-called “fingers”. The third layer is formed by connections between similar nodes, where the similarity is interpreted like an equality of locality-sensitive hashes.It is important to note, that links to neighbour nodes in the third layer are not exactly identifiers of nodes in P2P network, they are entrances to anonymized paths to these nodes.

- (2)
The Master node: The distributed nature of the proposed system causes one hindrance. LSH-based nearest neighbour search implies that when searching for the neighbours of object

*x*, all the locality-sensitive hash functions that were used to hash other objects and fill hash tables are applied to*x*. In the proposed architecture, an object being hashed is a vector of normalized ratings assigned by the user to different items of interest and hashing functions family is represented by random hyperplane projections. Both to represent a user profile in the vector form, and to define a hyperplane to be used to calculate locality-sensitive hash value, the number of items (and their ordering) should be known. It is later referred to as*item space dimensionality*, or just dimensionality. In some cases, for instance, when the rating storage is centralized and/or all possible items are known in advance, knowing dimensionality is not a problem. However, in case of distributed rating storage when each node holds only ratings of one user, overall item space dimensionality can be found out only though communication between nodes. For example, let initially a system contains two nodes (of user Alice and user Bob) and no ratings. Then, Alice encounters three movies (items):*Forrest Gump*,*Scary Movie*, and*Sleepless in Seattle*and rates them 4, 3, 5 respectively. As Alice’s node does not have information about other movies, vector representation of normalized Alice’s preferences could be: (0, −1, 1). At the same time, Bob encounters the same three movies, but in different order:*Forrest Gump*,*Sleepless in Seattle*, and*Scary Movie*and rates them 4, 5, 3 respectively. Under the same considerations, vector representation of Bob’s preferences could be (0, 1, −1). Then hash values of users’s preferences using hyperplane (0, 1, 0) would be −1 for Alice and 1 for Bob, and as signs differ, the profiles are considered to be different, although ratings match perfectly.Hence, it is needed to synchronize item space characteristics and random projection hyperplanes across all nodes. The problem of maintaining a global shared state in the P2P network is rather complex, and there are numerous papers dedicated to it, e.g. [4, 12, 21]. In the proposed system, this problem is addressed in a way similar to the one presented in [17] and sacrificing the P2P-purity of the system. It is the Master node that, first, collects all new items discovered and rated by peers, maintains their ordering and generates new locality-sensitive hash functions. So, each peer must connect to the Master node in two situations: first, to notify about some previously unknown item (which should become a new dimension), second, to get a new set of locality-sensitive hash functions. It must be noted, that there is no necessity in generation of new hash functions after an assessment of each new item. Using outdated hash functions with lower dimensions is still possible, but it gradually decreases the quality of recommendations. So, each user node collects the new rated items (which were not assigned identifiers yet) and then sends a batch of these items to the Master node. The Master node, in turn, accumulates new items, assigns them unique ordered identifiers, and when their number is great enough, issues a new set locality sensitive hash functions. It is also important that the new set is not an entire replacement of the previous, but contains only several new hash functions.

### 4.3 Scenarios

- (1)
Attractiveness estimation of a given item: attractiveness estimation on a node is possible only after the integration of this node into the P2P network: receiving a set of hash functions from Master node and locating the nodes of the users with similar ratings (hereinafter these nodes are referred to as neighbour nodes). Let the neighbour nodes for the given one be stored in the

*Neighbours*list. Then attractiveness estimation for the item is performed by sending requests to each node from the*Neighbours*list passing the item identifier over. Each neighbour node answers with a binary value meaning if it can recommend this item to others or not. Attractiveness estimation for the set of items is done mostly in the same way, except that the requester node passes the list of item identifiers instead one identifier and the answer contains a list of pairs (*itemId*,*recommend_flag*) for all items that the neighbour node is able to recommend from the requested set.Informally, attractiveness estimation scenario can be interpreted as asking an advice from co-minded people. In centralized systems it is performed in some conceptual way, in the proposed hybrid P2P system it is performed literally sending requests to the respective nodes. When answering attractiveness estimation request, a node can base the response on the rating that is stored for the given item, or infer the rating from some other information. This is an extension point of the proposed system architecture.

These requests are sent and answered through anonymization relays, so the node does not expose both its identity and an exact rating for any item.

- (2)
Recommendations query: In this case, node that needs recommendations just sends requests to each of the neighbour nodes. The request contains additional information about what kind of recommendations the node is looking for, e.g., any high-rated objects, or new objects (encountered and rated after specified time) only. Each neighbour node answers with a list of (item, rating) pairs. Unlike the previous scenario, here the neighbour node needs to send not just identifiers of the recommended items, but their contents, something that the receiver side can use directly.

Anonymization relays make sure that the recommendations provider does not expose both ratings and its identity.

- (3)
Rating an item: The main issue of rating items is the generation of new locality-sensitive hash functions that must follow it. To address this issue each node has two lists:

*Known*and*New*. The*Known*list holds all the items the Master node is aware of. This list is received from the*Master*node during the bootstrap process or periodical synchronization process. The order of items in this list is also important as it corresponds to the order of dimensions of locality-sensitive hash functions. The*New*list, on the other hand, holds the items that are discovered by this node and are not yet approved by the Master node. When the user rates an item, the rating is saved and then, if the item is neither in*Known*, nor in*New*lists it is added to the*New*list.When

*New*list exceeds some predefined size or once in a predefined period of time (whatever happens first), the node sends its*New*list to the Master node and retrieves the global shared state from the Master node. Global shared state from the Master node includes up-to-date version of the*Known*list. Each node augments its*Known*list according to the one received from the Master node and removes from*New*list items that are present in*Known*list. - (4)
Refreshing hash functions (supplementary scenario): Each node periodically queries the Master node for the global shared state. As it was described earlier, there are

*b*functions, and each hash function is a vector of*r*|*I|*-dimensional random vectors (representing random hyperplanes). To reduce the amount of information exchange and load of the Master node, each hash function posted by the Master node is represented by three integers: function unique identifier (*funcId*), random seed and current number of items |*I*| (i.e. item space dimensionality). When a node gets this information it generates random hyperplanes constituting each of the*b*locality-sensitive hash function as a sequence of*r**|*I*| (|*I*| dimensions for each of*r*hyperplanes) random numbers from the specified seed using Mersenne twister [18]. - (5)
The search for similar peers (supplementary scenario): The search for similar, or neighbour, peers is initiated when a node is registered in the P2P network. Then this search is performed regularly. Before searching for neighbours a node have to refresh item list and hash functions from the Master node. Then each function from an up-to-date set of hash functions is applied to this node ratings vector. The results are merged into pairs (

*funcId*,*value*) and these pairs are used as keys to look up in DHT. DHT look up returns a list of node identifiers similar to this one according to the respective locality-sensitive function. These lists are then merged and stored as the*Neighbours*list.

## 5 Experimental Study

Experimental study of the proposed approach was performed with the MovieLens 100 k dataset shared by GroupLens research lab. This dataset fits well with e-commerce scenarios (specifically, media streaming services), as it contains 100,000 real-life ratings assigned by 943 users to 1682 movies.

The purpose of the experimental study was twofold. First, to gain some insights into the internal quantitative characteristics of the proposed approach and to estimate time and spatial complexity of the DHT-based LSH recommendation system. Second, to evaluate the quality of recommendations with respect to some well-known baselines.

Ratings are normalized by centring over the user’s mean rating.

### 5.1 Time, Space and Network Load

It was already noted that *b* (the number of hash functions) and *r* (the number of hyperplanes in each hash function) are parameters of the LSH-based recommender. Values of these parameters have significant impact both system performance and accuracy.

As each node puts itself into DHT *b* times, the size of the DHT is *n***b* it means that on the average only *b* records of the DHT are located on each node. In most cases, this burden is negligible. More important is the fact that the search for the neighbour nodes takes *b* lookups which is O(*b* log(*n*)) of internode communications. Even more important is the number of neighbours, as this number corresponds to the number of network queries performed to obtain recommendations, and it is desirable to keep the number of these queries as small as possible.

*b*and

*r*parameters of recommender and the average number of neighbours found through hash table look up. It can be seen that the number of neighbours increases with the growth of

*b*, and the speed of growth significantly depends on the dimensionality of hash functions. It is expected behaviour, as small dimensionality of hash functions and large number of “alternative” hash functions make neighbour search procedure indiscriminative.

In this experiment, we assume that the reasonable number of hash functions is under 100 and the reasonable number of neighbours is under 50. The numbers are different as neighbours search is one-time action (and, therefore, can bear more overhead) and queries to neighbours happen more often.

*r*) different values of

*b*were tried and the average number of neighbours and the respective recall were evaluated. It can be seen, that when the number of neighbours is less than approximately 50, the quality of recommendations is growing fast, whereas for bigger values of the number of neighbours it reaches a plateau.

Having this in mind, three configurations were selected to examine recommendations quality: (*r* = 12, *b* = 100), (*r* = 10, *b* = 35), (*r* = 8, *b* = 10). These configurations were selected because each of them gives on the average approximately 50 neighbours for a user in the explored dataset (see Fig. 4).

### 5.2 Recommendations Quality

*n*recommended items for that user. The outcome of this check may be either 1 (if it is in the top

*n*) or 0 (if it is not). These outcomes are summed for all high ratings of the testing set to produce

*N*

_{p}value. Recall is calculated according to formula:

*N*

_{H}is the number of high ratings. In other words, this value can be interpreted as a probability that a randomly taken high rated item is in fact recommended by the algorithm.

*n*random items to any user, second, popular items recommender (PopRec) which recommends the items that have the most number of ratings. Figure 6 shows the recall of each of the recommenders at different values of

*n*.

All the tested variants of LSH recommendation method give similar results. It may be explained by the fact that in all of the tested variants there are nearly the same number of neighbour nodes (about 50, see Fig. 4). It can also be seen that the proposed recommendation algorithm significantly outperforms the non-personalized recommendation algorithms in terms of recall.

## 6 Conclusions

This paper proposes the architecture of a user-centric hybrid peer-to-peer recommendation system based on locality-sensitive hashing. One of the main distinguishing features of the proposed system is that exact ratings that a user assigns to items are never shared together with the user’s identity (and network address), which provides privacy. This is achieved by employing locality-sensitive hashing technique and building an anonymized overlay in a P2P network.

The paper describes use cases of the recommendation system and shows how these use cases can be implemented via communication of nodes in P2P network and communication of nodes with the Master node responsible for data-driven coordination and holding a shared state of the distributed system.

The proposed approach was evaluated on a widely used dataset from an e-commerce scenario (movie ratings) and it was shown that the estimated recall of the proposed recommendation system is sufficiently higher than that of the trivial baselines.

a principal limitation of a user-centric recommendation system is that a user can receive recommendations from only those other users that are online and connected to P2P network. It can be alleviated by using some virtual proxies (“avatars”) that are always online, but using these “avatars” blurs difference between centralized and decentralized systems and needs further thorough examination;

due to DHT limitations, the proposed approach is not applicable to the P2P networks with high churn;

the proposed approach most likely does not fit highly dynamical domains, such as news recommendation, because of the need for sharing information about all objects all over the P2P network;

modern recommendation systems evolve in the direction of context awareness, but context is totally out of the picture in the proposed recommendation technique.

In the future, the authors are planning to consider alternative solutions of sharing the global set of locality-sensitive hash functions among peers, as well as add contextual awareness to the recommendation engine.

## Notes

### Acknowledgements

The research was partially supported by projects funded by grants # 13-07-00271, # 13-07-00039, and # 14-07-00345 of the Russian Foundation for Basic Research, project 213 (program 8) of the Presidium of the Russian Academy of Sciences, project # 2.2 of the basic research program “Intelligent information technologies, system analysis and automation” of the Nanotechnology and Information technology Department of the Russian Academy of Sciences. This work was partially financially supported by the Government of the Russian Federation, Grant 074-U01.

### References

- 1.Amatriain, X., Jaimes, A., Oliver, N., Pujol, J.M.: Data mining methods for recommender systems. In: Ricci, F., Rokach, L., Shapira, B., Kantor, P. (eds.) Recommender Systems Handbook. Springer, Heidelberg (2011)Google Scholar
- 2.Bakker, A., Ogston, E., van Steen, M.: Collaborative filtering using random neighbours in peer-to-peer networks. In: Workshop on Complex Networks in Information and Knowledge Management, pp. 67–75 (2009)Google Scholar
- 3.Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: STOC 2002 Proceedings of the 34th Annual ACM Symposium on Theory of Computing, pp. 380–388 (2002)Google Scholar
- 4.Chen, X., et al.: SCOPE: scalable consistency maintenance in structured P2P systems. In: Proceedings of IEEE INFOCOM 2005, pp. 1502–1513 (2005)Google Scholar
- 5.Cremonesi, P., Koren, Y., Turrin, R.: Performance of recommender algorithms on top-n recommendation tasks. In: Proceedings of the Fourth ACM Conference on Recommender Systems (RecSys 2010), pp. 39–46. ACM, New York, NY, USA (2010)Google Scholar
- 6.Datar, M., et al.: Locality-sensitive hashing scheme based on p-Stable distributions. In: SCG 2004 Proceedings of the 20th Annual Symposium on Computational Geometry, pp. 253–262 (2004)Google Scholar
- 7.Desrosiers, C., Karypis, G.: A comprehensive survey of neighborhood-based recommendation methods. In: Ricci, F., Rokach, L., Shapira, B., Kantor, P. (eds.) Recommender Systems Handbook. Springer, Heidelberg (2011)Google Scholar
- 8.Draidi, F., Pacitti, E., Kemme, B.: P2Prec: a P2P recommendation system for large-scale data sharing. J. Trans. Large-Scale Data Knowl.-Centered Syst. (TLDKS)
**3**, 87–116 (2011)Google Scholar - 9.Draidi, F., et al.: P2Prec: a social-based P2P recommendation system. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, pp. 2593–2596 (2011)Google Scholar
- 10.Han, P., et al.: A scalable P2P recommendation system based on distributed collaborative filtering. Expert Syst. Appl.
**27**(2), 203–210 (2004)CrossRefGoogle Scholar - 11.Hecht, F., et al.: Radiommendation: P2P on-line radio with a distributed recommendation system. In: Proceedings of the IEEE 12th International Conference on Peer-to-Peer Computing, pp. 73–74 (2012)Google Scholar
- 12.Hu, Y., Bhuyan, L.N., Feng, M.: Maintaining data consistency in structured P2P systems. IEEE Trans. Parallel Distrib. Syst.
**23**(11), 2125–2137 (2012)CrossRefGoogle Scholar - 13.Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: STOC 1998 Proceedings of the 30th Symposium on Theory of Computing, pp. 604–613 (1998)Google Scholar
- 14.Jelasity, M., Montresor, A., Babaoglu, O.: T-Man: gossip-based fast overlay topology construction. Comput. Netw.
**53**(13), 2321–2339 (2009)CrossRefMATHGoogle Scholar - 15.Kermarrec, A.-M., et al.: Application of random walks to decentralized recommendation systems. In: Proceeding of the 14th International Conference on Principles of Distributed Systems, pp. 48–63 (2010)Google Scholar
- 16.Korzun, D., Gurtov, A.: Structured Peer-to-Peer Systems. Fundamentals of Hierarchical Organization, Routing, Scaling and Security. Springer, Heidelberg (2013)CrossRefMATHGoogle Scholar
- 17.Mastroianni, C., Pirro, G., Talia, D.: Data consistency and peer synchronization in cooperative P2P environments. Technical report (2008, unpublished)Google Scholar
- 18.Matsumoto, M., Nishimura, T.: Mersenne twister: a 623-dimensionally equidistributed uniform pseudo-random number generator. ACM Trans. Model. Comput. Simul.
**8**(1), 3–30 (1998)CrossRefMATHGoogle Scholar - 19.Netflix Prize. http://www.netflixprize.com/
- 20.Jelasity, M., Hegedűs, I., Ormándi, R.: Overlay management for fully distributed user-based collaborative filtering. In: D’Ambra, P., Guarracino, M., Talia, D. (eds.) Euro-Par 2010, Part I. LNCS, vol. 6271, pp. 446–457. Springer, Heidelberg (2010)CrossRefGoogle Scholar
- 21.Oster, G., et al.: Data consistency for P2P collaborative editing. In: Proceedings of the 20th Anniversary Conference on Computer Supported Cooperative Work, pp. 259–268 (2006)Google Scholar
- 22.Pitsilis, G., Marshall, L.: A trust-enabled P2P recommendation system. In: Proceedings of 15th IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprises, pp. 59–64 (2006)Google Scholar
- 23.Pussep, K., et al.: A peer-to-peer recommendation system with privacy constraints. In: CISIS: IEEE Computer Society, pp. 409–414 (2009)Google Scholar
- 24.Rajaraman, A., Ullman, J.: Mining of Massive Datasets. Cambridge University Press, Cambridge (2012)Google Scholar
- 25.Slanley, M., Casey, M.: Locality-sensitive hashing for finding nearest neighbors. IEEE Signal Process. Mag.
**25**(2), 128–131 (2008)CrossRefGoogle Scholar - 26.Tveit, A.: Peer-to-peer based recommendations for mobile commerce. In: Proceedings of 1st International Workshop on Mobile Commerce (WMC 2001), pp. 26–29. ACM (2001)Google Scholar
- 27.Wang, Q., Borisov, N.: Octopus: a secure and anonymous DHT lookup. In: Proceedings of the IEEE 32nd International Conference on Distributed Computing Systems, pp. 325–334 (2012)Google Scholar