Analysis of user keyword similarity in online social networks
How do two people become friends? What role does homophily play in bringing two people closer to help them forge friendship? Is the similarity between two friends different from the similarity between any two people? How does the similarity between a friend of a friend compare to similarity between direct friends? In this work, our goal is to answer these questions. We study the relationship between semantic similarity of user profile entries and the social network topology. A user profile in an on-line social network is characterized by its profile entries. The entries are termed as user keywords. We develop a model to relate keywords based on their semantic relationship and define similarity functions to quantify the similarity between a pair of users. First, we present a ‘forest model’ to categorize keywords across multiple categorization trees and define the notion of distance between keywords. Second, we use the keyword distance to define similarity functions between a pair of users. Third, we analyze a set of Facebook data according to the model to determine the effect of homophily in on-line social networks. Based on our evaluations, we conclude that direct friends are more similar than any other user pair. However, the more striking observation is that except for direct friends, similarities between users are approximately equal, irrespective of the topological distance between them.
KeywordsOnline social network User keywords User similarity Homophily measurement Semantic analysis
The famous experiment conducted by Travers and Miligram on the small world problem (Travers et al. 1969; Milgram 1967) tried to ascertain if people in society are linked by small chains. They asked people to forward letters to their friends who they thought were likely to know the target person. Thus, people implicitly made decisions based on their view of the geographical location and professional associations of their friends and the associated likelihood of a successful delivery of the letter through that friend. The results showed that people are able to find other individuals at even far off places fairly quickly and the path length connecting such a pair of individuals is small. These very interesting conclusions opened up the question about how individuals are connected amongst each other, in spite of living at far-off geographic locations. In other words, what brings a set of individuals together, even when they do not belong in the same geographic location? What role does homophily play here? Do people become friends when they share common interests and passions despite of living at different places?
On-line social networks (OSNs) help us study such problems using the set of rich data present about the users. A typical user profile in an on-line social network is characterized by its profile entries like location, hometown, activities, interests, favorite music, professional associations, etc. For example, in sites like Facebook 1 and Orkut,2 users establish friendships when they discover similar profile entries. In LinkedIn 3 people connect amongst each other to build professional networks and find career development opportunities. Using LinkedIn, employers look into the profile information of users to search for potential employees. Similarly, it helps employees find potential employers. Thus, when two people share a common professional field, they come closer, connect to each other and establish friendship.
In this work, our goal is to (1) understand the process of how people connect to each other, i.e., form friendships based on the intersections of their interests and passions, (2) study the similarity across different user profiles and (c) correlate user similarity with the network topology to understand the effect of homophily in on-line social networks.
Consider the scenario, where a newcomer in the city, say Bob, a soccer enthusiast friends other soccer enthusiasts. On his OSN profile, he enters ‘football’ as his entry in his interests field while his friends enter ‘soccer’. When we try to analyze the similarity between Bob and his friends through a similarity analysis of their interests, we do not see any similarity based on a direct matching of the entries. But essentially, the friendship between Bob and his friends evolve because all of them are interested in a sporting activity and their interests match. In other words, homophily plays an important role in bridging friendship between Bob and his friends.
To understand the influence of homophily using the underlying semantic relationship of profile entries and to successfully extract relationship(s) from the diverse information present, we build models in this work. We term each of the individual profile entries of an user as Keyword. Our key contributions are summarized next. In this paper, we study the relationship between semantic similarity of user keywords and the social network topology. First, we define a model to categorize keywords based on the semantic relationship. The model consists of multiple categorization trees to aggregate similar keywords. We formally term the model of multiple categorization trees as the ‘Forest Model’. Second, we define the notion of distance between keywords in the ‘forest’ and based on the keyword distance, we define functions to determine the similarity between a pair of users. Third, we analyze a set of Facebook data according to the model to determine the effect of homophily in on-line social networks.
Based on our evaluations, we conclude that direct friends are more similar than any other user pair. The most striking observation is that except for direct friends, similarity between users are approximately equal, irrespective of the topological distance between them. The similarity between users who are separated by two hops is nearly equal to the similarity between users placed at three, four or more hops away in the on-line social network. We also observe the effect of different ways in building the ‘forest’ in determining similarity between the users. Our analysis also shows that an increase in the number of friends and keywords for an individual user lowers the average similarity between the user and his friends.
In Sect. 2 we survey related work. We discuss the key challenges and present our findings on keyword usage patterns in Sect. 3. Next, we introduce the ‘Forest Model’ to categorize keywords and discuss its impact on analyzing user keywords in Sect. 4. We propose functions to quantify similarity between users in Sect. 5 and evaluate them in Sect. 6. We conclude in Sect. 7 with a discussion of future works.
In this section, we review some of the related work. First, we discuss work related to the mathematics behind the small world problem and social networks in general. Next, we discuss works that address homophily in social networks and user similarity based on user characteristics.
Works in Kleinberg (2001), Sandberg (2007), and Kleinberg (2000) have developed mathematical models to show how users interact with one another and establish links to build a social network. The lattice model (Kleinberg 2000) is based on the geographical distance between users. The model defines a network model based on characteristics of user’s to establish multiple short range friendships and few long range contacts. Based on the definitions, decentralized algorithms are developed to show that users can search for short paths to other users with high probability. The work in Sandberg (2007) also presents mathematical models to further the decentralized search algorithm to enable searches even when users are unaware of their own and other’s positions in the network. In Kleinberg (2001), a hierarchical network model was developed. Users are arranged at the leaves of a hierarchical structure such that the least common ancestor of two nodes in the tree is the node at which they start differing in their attributes. Thus, the least common ancestor defines the similarity of two nodes or how likely they are to become friends. The closer the least common ancestor is to the two nodes, higher probability of the two nodes being friends. Based on this probability, the social network graph and the decentralized search algorithm are developed.
Homophily, or the more commonly known phrase of ‘birds of a feather flock together’ has constituted an important role in the study of social networks. Sociologists (Mcpherson et al. 2001) have tried to understand the phenomenon using multiple characteristics like gender, race, ethnicity, age, educational level, etc. Similarity between users due to their association to same communities has been studied in Crandall et al. (2008). Community associations and user keywords have been used to model user communication in social networks in Banks et al. (2007, 2009). Information exchange between users takes place only when they share a social path and common keywords and community memberships. Decentralized search algorithms using combinations of homophily and node degree parameters have been developed in (Şimşek and Jensen 2005, 2008).
Similarity between users as a function of their topological distance was studied in Adamic et al. (2003). The work tried to find out the average fraction of similar users with a common characteristic like year in school, graduate status, etc. to track the number of similar users from a data set of Club Nexus. Their findings reported a gradual decay in similarity with increased topological distance in the social network. The work in Adamic and Adar (2001) developed functions to analyze similarity between users as a function of the frequency of a shared item.
Geographic ties between on-line social network users has been another property to understand homophily between users. Geographic location and friendship behaviors of bloggers was studied in Kumar et al. (2004). The work in Liben-Nowell et al. (2005) has also studied the relationship between geographic location of users and the relationships among them. The study showed that one-third of friendships in a social network are independent of geography. This is an interesting conclusion and raises the question of why people at far off locations become friends and what characteristics bring them together? Will understanding the other key interests or activities of users in on-line social networks explain why people become friends?
In this work we answer these questions by understanding the interests pattern of users in Facebook and how similarity between user interests influence friendship. We study the influence of user similarity in the network topology. We use the term network and social network interchangeably in our work to mean the set of all users and the links between them that represent the friendship between them. We also explain the patterns of characteristics associated with a user, i.e., a user’s profile entries.
We classify the similarity between users through the semantic links between the keywords used by them. Methods like Latent Semantic Indexing (Deerwester et al. 1990) have previously explored the semantics between digital data to explore the relationship between them. Analysis of user similarity through relations between their profile characteristics can also help in furthering the link prediction problem (Liben-Nowell and Kleinberg 2007) in social networks to correctly identify pairs who are likely to forge friendship in future. In the next section, we discuss the keyword usage patterns of Facebook users.
3 Keyword usage patterns
To measure the similarity between keywords and understand the usage scope of keywords as entered by different users in their on-line social network profiles, we analyzed Facebook profiles. We considered keywords that are available in the English dictionary. For this purpose, we used the entries present in the Interests fields of a Facebook profile. Users list the activities they are passionate about or topics of which they are interested in this field. For example, an analysis of the data shows that a large portion of users list Music as their Interests.
Top ten keywords in keyword set along with respective occurrence frequency and percentage values
Table 1 contains the top ten most frequently used keywords. The top ten keywords collectively make up approximately 19% of the entire keyword set. This shows that for an average of 6 keywords per user profile, a user has a high chance of having any of these top ranked keywords. This result also opens the question on how the rest of keywords are distributed among the profiles. To inspect the frequency of keywords in the set, we plot the relationship between the number of keywords found for a given keyword frequency in Fig. 1. The plots show that there are 866 keywords (approximately 66.56% of the number of keywords) that occur only once, i.e., they occur with a frequency of only one. Based on keyword usage frequencies, we see a randomly picked user profile has a high chance of listing a keyword that none of the other users in the dataset have in their profile. These two results show the wide distribution in usage of keywords by users in their profiles.
The trend line (solid continuous line) in Fig. 1 shows an exponential drop in the number of distinct keywords as the keyword frequency increases. The distribution follows a power law as the number of distinct keywords decreases as the frequency of the keyword increases. The distribution also shows consistency with similar results on tag distribution over web applications (Xu et al. 2006).
We further substantiate this result by aggregating keyword frequencies into four categories based on the values. Keywords that occur more than 45 times are put into ‘High Freq.’ category. Keywords occurring more than 25 times are put into ‘Medium Freq.’ and more than 5 times are put into ‘Low Freq.’ categories. The rest of the keywords, i.e., those occurring less than 5 times are put into the ‘Very Low Freq.’. These results are plotted in Fig. 2.
We observe here that approximately only 1.15% of the keywords belong to the high frequency category, while more than 87% of the entire dataset comes under the very low frequency category. With such wide distribution of keywords across user profiles, analyzing the similarity between two user profiles based on matching keywords leads to inconsistent and inconclusive results. The key questions now are, how can we aggregate different keywords based on their usage patterns to understand similarity between users? Can models be developed to match keyword pairs when they have semantic relationships? How can we explore the hidden relations and categorize them? For instance, from the previous example, if we can build models to understand the relationship between ‘soccer’ and ‘football’, we can analyze more deterministically the influence of homophily between Bob and his friends. In the next section, we introduce the ‘Forest Model’ to categorize and aggregate keywords effectively to understand the similarity between on-line social network users.
4 Forest model
In this section, we first describe the ‘forest model’ to categorize keywords. The model helps to define the data structure to utilize the underlying relationship amongst keywords. Second, we describe ‘forest generation’ process. Here, we also describe the heuristics we define to analyze the similarity between users in later sections. In the third subsection, we analyze the entire keyword set and present results of our evaluation.
4.1 Forest model
How do we relate two keywords? How do we keep two keywords separated when they can not be related? Our goal here is to find a model that can clearly distinguish between related and unrelated keywords. We aim for a simple and intuitive model that helps us achieve this.
What is the underlying hidden relationship between any pair of keywords? Intuitively, keywords can share the same source of origin. This characteristic of keywords relating to each other is based on their source of origin and development. Linguists term this characteristic as etymology. For instance, in a language like English, words have a Latin or Greek root associated with them. Wordinfo 4 lists 61,362 English words that have either Latin or Greek roots. For example, the words ‘equine’ (horse), ‘equestrian’ (horse rider), ‘equestrienne’ (female horse rider) can be derived from the Latin root `equus’. `Equus’ meaning a horse.
Alternatively, keywords can said to be related when they are semantically linked, e.g., when they share the same meaning. For example, keywords like ‘football’ and ‘soccer’ are related because they both are a type of sport. These keywords can also be related to the keyword ‘sports’ because of the relationships between their meanings. Thus, continuing Bob’s example, now when we look at Bob’s interest in ‘football’ and his new friend’s interest in ‘soccer’, we can say how the two profiles match to each other from the fact that ‘soccer’ is a hyponym of ‘football’. Thus, aided by relationships between keywords, we can match user profiles and analyze similarity between users, effects of homophily, and why friendship links are established.
Once we establish a relation between two keywords, the key requirements of a model is that it must keep unrelated keywords separated. This means, that while ‘football’ and ‘soccer’ are related through the model, keywords like ‘soccer’ and ‘equine’ from the previous examples are kept separated.
Next, we describe the model. Each keyword is considered as a node. Nodes are connected when relations exist between the keywords. These nodes are placed in a hierarchical order such that when a keyword is derived from another keyword, hierarchy helps in defining the relation between the keywords. The hierarchy thus gives the ability to detect distances and dissimilarity between keywords and prevents homogeneity between the nodes that can arise from the use of a flat data structure. Hierarchies, thus constructed to define the relationship between keywords leads to the definition of ‘Trees’. To keep unrelated keywords separate from each other, multiple trees are defined. Such trees each contain set of keywords that are related to each other in the tree but are unrelated to any other keyword in any of the other trees. Formally, let a forest F be declared as a data structure consisting of t trees, (T1, T2,...,Tt).
4.2 Forest generation
We used the underlying semantic relationship between keywords to built the ‘forest’. We used WordNet (Fellbaum 1998) as the database of English words to build the forest structure. We will describe the features of WordNet next and then we will describe the heuristics we used during our evaluation process in Sect. 6.
Base: Here, we let the tree be composed of only the initial keyword. Thus, the tree is not allowed to grow its sub-tree. This heuristic thus constitutes the boundary condition where keywords match only if they are exactly similar to each other.
HM: In this heuristic, we grow the tree using the keyword’s holonyms and meronyms. Consequently, we term this heuristic as ‘HM’. The ontology ‘Holonyms’ is referred to mean a word that names the whole of which a given word is a part of. For example,‘hat’ is a holonym for ‘brim’. The term ‘Meronyms’ is used to refer to a part/whole relationship. For example, paper is a meronym of book, since paper is a part of a book. We also use the ‘nominalizations’ ontology of the WordNet to obtain the set of nominalized terms for all senses of the keyword, i.e., referring to the use of a verb or an adjective as a noun. For example, WordNet ontology returns the set of keyword ‘happiness, felicity’ for the keyword ‘happy’ and the set ‘happy’ for the keyword ‘happiness’. Thus, depth of the tree is 2. The root word is the keyword itself and the rest of the terms returned by WordNet ontologies form the sub-tree.
SS: In this heuristic, we grow the tree using the keyword’s ‘similars’ and ’synonyms’ ontology available from WordNet and thus we term the heuristic as ‘SS’. The keywords available from WordNet form the subtree and the depth of the tree thus formed is also 2. The WordNet ontology ‘Similars’ returns a similar-to list for the given keyword, e.g., it returns the set ‘blessed, blissful, bright, golden, halcyon, prosperous’ for the keyword ‘happy’. These related keywords are obtained only for keywords that are adjectives. In the ‘Synonyms’ ontology, words that have similar meanings are obtained, e.g., ‘glad’ for the keyword ‘happy’.
All: In this heuristic, we use ‘all’ the ontologies present in WordNet to obtain a list of all the related keywords available for a given keyword. The tree depth is 2 and the subtree is formed by keywords that are available using ‘Nominlizations’, ‘Holonyms’, ‘Meronyms’, ‘Synonyms’, ‘Antonyms’, ‘Similars’, ‘Hypernyms’, ‘Hyponyms’ and ‘Derived Terms’ ontologies. Hypnernymy refers to a hierarchical relationship between words. For example, furniture is a hypernym of chair since every chair is a piece of furniture (but not vice-versa). Hypnonymy is the opposite of hypernymy. Dog is a hyponym of canine since every dog is a canine. The derived-terms holds for adverbs and returns derived terms for all senses of a keyword. For example, the set of keywords ‘jubilant, blithe, gay, mirthful, merry, happy’ is returned for the keyword ‘happily’. Thus, this heuristic makes for a boundary case where all related keywords are used to build the tree for evaluation purposes.
The motivation to use multiple heuristics as defined above comes from the observation that keywords can have more than one meaning or context, e.g., according to WordWeb,6 the word ‘stern’ could mean ‘severe’ as an adjective and ‘rear part of a ship’ as a noun. Generating the forest with different heuristics helps us capture different scenarios where a keyword may be present in different trees due to varied usage and contextual scopes. Thus, the above mentioned heuristics not only capture different meanings of a keyword but also helps capture the similarity between keywords when they are used in different contexts or belong to different syntactic categories.
To build a ‘forest’, we adopted a more ad hoc approach, allowing each keyword of a keyword pair to build its own tree. For each of the above heuristics, related keywords were pulled from WordNet and aggregated together to form the individual tree. This process was recursively repeated to the desired tree depth. The initial keyword was placed as the root of the tree. For every keyword pair, thus two trees were formed. These two trees were checked for any common keyword. If a matching was found, keyword pairs were declared as related to each other. Otherwise, the keyword pair were termed as not similar to each other. In the next section, we analyze the effectiveness of the ‘forest model’ in matching keywords.
4.3 Analyzing the user keyword set
In this section, we will analyze the effectiveness of the ‘forest model’ in computing the similarity between user keywords. We will use a set of examples to demonstrate the advantages of the ‘forest model’.
Sample users with keywords
Wakeboarding, softball, fishing, jesus, god, learning, backpacking
Running, hiking, hurricanes, tornadoes
Basketball, dancing, shopping, pictures
Running, soccer, tennis, football, hiking, knitting, art, tea, lime, pie
Number of matches to keywords of user Z
Number of pairs
Number of matches
Fractions match (%)
Number of matches
Fractions match (%)
It can be seen that for the ‘Base’ case, most of the trials to match keywords of both the users fail. Only since B and Z have 2 keywords in common across their profile that the similarity between B and Z come out to be a non-zero value. Now, when we look at the similarity between B and Z for the ‘All’ heuristic, we see that similarity values have risen to 25 and the fraction of keyword pairs now similar to each other stands at 62.5%. This is because their profiles match for keywords that can be derived from ‘athletic sports’ (e.g., pairs formed from running, soccer, tennis, etc.).
It can also be seen that s(Z, C) is maximum even though Z and C do not have more keyword pairs than between Z and A or between Z and B. This is because both are interested in arts (C has ‘pictures’ and Z has ‘art’) implying that Z has more common interests with C than with A or B. A and Z are least similar as A is mostly interested in water sports (and not athletic sports as Z) and does not share any other common interest with Z even though they both share a large number of keyword pairs. This shows the effectiveness of characterizing keywords using semantic relationships and that the content of keywords becomes more important than their number for finding similarity values. We also observe the effectiveness of the ‘forest model’, built using the semantic relations of keywords, in measuring the similarity of users keywords. Next, we describe similarity functions based upon the ‘forest model’ to measure the similarity between user profiles and understand the effect of homophily in social networks.
5 User similarity
With keywords present at different hierarchies in a tree, how do we measure the similarity between keywords and correspondingly the similarity between users? How do we differentiate the similarity between two users when all their keyword pairs belong to the same tree but the keywords are positioned at various different heights? In this section, we describe the formulations to answer some of these questions. First, we quantify the distance between two keywords in the ‘forest’. Afterwards we describe two different similarity functions to quantify the similarity between users.
5.1 Keyword distance
Now we define the notion of distance between keywords based on the forest structure. Let there be t trees (T1, T2,...,Tt) in the forest F. Consider two keywords Ka and Kb such that both of them belong to the same tree. Let LCA be the least common ancestor of Ka and Kb. Also, assume d(LCA, Ka) to be the depth of Ka from the LCA.
If K1 and K2 do not have any relation then D(K1, K2) is ∞. Also, the minimum of all dLCA’s is used to account for multiple occurrences of keywords in F.
Thus, from Fig. 3, if Ka = soccer and Kb = racing then LCA = sports and d(LCA, Ka) = 2, d(LCA, Kb) = 1 and D(Ka, Kb) = 2. When Ka = soccer and Kb = equine then, as each of the keywords are present in different trees, no LCA exists and D(Ka, Kb) = ∞.
The separation of keywords into different trees and defining the distance between keywords as ∞ when they do not belong together in a tree makes the model robust enough to handle the aggregation of keywords and yet clearly separate keywords when they do not belong together. The hierarchy inside the trees helps determine the distance when the keywords belong to a single tree. This is an advantage over possible models where all keywords are put together in a single hierarchy, for example by generalizing the model of hierarchy presented in (Kleinberg 2001) to relate keywords.
It is also important to note that in the definition of D(K1, K2) when keywords are aggregated together, the distance between keywords are captured from the generic point where an aggregation is possible. For example, in Figure 3, soccer and racing aggregate at sports and thus the distance between the keywords is defined as the farthest distance from this generic point. An alternate definition where distance between keywords is the summation of the distances of each keyword from the generic point (i.e., D(K1, K2) = d(LCA, K1) + d(LCA, K2)) fails to comprehend the importance of the distance from the LCA itself. Based on the definition of distance between keywords, next we describe the formulations to define the similarity between a pair of users.
5.2 Similarity functions
Assume that a social network user v has Nv keywords and let Kiv (1 ≤ i ≤ Nv) be his/her keywords. Consider two users u and v on the network. Let k(u, v) (Nu × Nv) be the total number of keyword pairs that they have. Also, let n(u, v) be the number of keyword pairs (Kiu, Kjv) such that Kiu and Kjv and Kjv belong to the same tree in F. How do we measure the similarity between u and v? How will the similarity between u and v vary when the keyword pairs belong to the same tree compared to the similarity between u and v when keyword pairs also belong to the same tree but at different hierarchical levels? We define two similarity functions to address these questions. We describe these functions next.
Weak similarity: This function defines the similarity between users when keyword pairs belong to the same tree. Thus, for two users, u, v with keywords K1 and K2 respectively, whenever D(K1, K2) ≠ ∞, n(u, v) is incremented by 1. Formally, it is defined as follows.
The position of the keywords inside the tree is not taken into account, i.e., keywords with distinct distance values will contribute equally towards the weak similarity. The word ‘weak’ is used to define the function because conceptually the definition ignores the position of the keywords and only tries to capture the fact whether two keywords belong to the same tree. In order to measure similarity between users with due consideration to position of the keywords we next define ‘strong similarity’.
Strong similarity: We utilize the definition of keyword distance to define this function. We use exponential function for the definition because it has finite values at the boundary conditions of D(Kiu, Kjv) (as e−0 = 1 and e−∞ = 0 for D(Kiu, Kjv) = 0 and D(Kiu, Kjv) = ∞, respectively). Formally, it is defined as follows.
The function S is called ‘strong similarity’, as it considers the relative position of the keywords in the tree. It may happen that strong similarity is numerically smaller than the weak similarity but still it is a relatively stronger definition as it captures more information. Using this definition, keywords at a greater distance contribute less towards the similarity value. The value of S(u, v) decreases as the distance between the keywords increases implying that u and v share lesser interests or attributes.
User similarity to user Z
6 Results and discussion
In this section, we describe the results of analyzing the Facebook profiles for similarity according to the ‘forest model’ and the similarity functions. First, we present results from the analysis on the number of keyword pairs the forest model was successful in matching. Second, we present results describing the variations in number of matches between keyword pairs and the variations in weak similarity and strong similarity for different number of keyword pairs between two users. Finally, we present results showing the variation in weak similarity and strong similarity based on different node degree of users and their individual number of keywords.
User pairs across the available network data is divided in the following three categories. Friend Pairs: When a user pair is formed such that the users are direct friends in the network. Friend2 Pairs: When a user pair is formed such the participating user pairs share a common friend and are separated in the network by 1 hop. All Pairs: In this category, we consider all users pairs irrespective of the topological distance between them in the network. The ‘All Pairs’ category helps us to compare entries of more than half a million user pairs. Now, we describe the results obtained by comparing keywords of user pairs belonging to each of the categories. We compare the keywords according to each of the heuristics defined in Sect. 4.2.
Third, it is also interesting to see how the trend lines between the different categories of friends behave for different heuristics. For example, in ‘Base’ and ‘HM’, the similarity between ‘All Pairs’ crosses the trend line for similarity of ‘Friend Pairs’ at increasing count of keyword pairs. This trend reverses in the later two heuristic cases as the gap in similarity values between ‘Friend Pairs’ and other pairs keep increase as the number of keyword pairs increases. We can see from here how the relations between the keywords play a role in determining the similarity between any user pairs and how a model like the ‘forest model’ is crucial to homophily analysis in social networks. Fourth, it is also interesting to note how the similarity values between ‘Friend2 Pairs’ are always so close to the values of ‘All Pairs’ for each of the heuristics. We conclude from these observation that similarity between friends of friends is almost equal to the similarity between any pair of users, i.e., topological distance between users does not significantly effect the similarity between users after the first hop. In other words, friend pairs are relatively high in similarity but beyond that, any user is almost similar to every one another, irrespective of the topological distance.
7 Concluding remarks
In this paper, we studied the similarity between users in an online social network. We based our studies on user similarity by evaluating the similarity between user keywords. First, we studied the distribution of user keywords in online social networks. Next, we defined a ‘forest model’ to link related keywords. The model links keywords based on the semantic relations. We showed how the model is able to quantify the similarity between seemingly unrelated user profile information available in social networks. Based on the model, we defined two different types of functions to quantify the similarity between users. Next, we evaluated a dataset containing Facebook user profiles for similarity between the users using the ‘forest model’ and the similarity functions.
We saw that user keywords can be aggregated effectively, based on the heuristic used to generate the ‘forest’, to evaluate user similarity. Based on our evaluations, we conclude that direct friends are more similar than any other user pair in the social network. The similarity between users remains approximately the same, irrespective of the topological distance between them. Finally, we also observed that with an increase in the node degree and number of keywords for a user, the average similarity a node has with its friends comes down.
Future research would augment the social network model based upon user similarity functions that we proposed in our earlier work (Bhattacharyya et al. 2009). The motivation is to generate an online social network model based upon a user’s similarity with other users and establish links when certain levels of similarity are observed. Another direction is to develop social search query models by comparing the similarity among friends.
Facebook is avilable at http://www.facebook.com.
Orkut is available at http://www.orkut.com.
Linkedin is available at http://www.linkedin.com/.
Wordinfo is available at http://www.wordinfo.info and is copyrighted by Senior Scribe Publications.
WordNet is available at http://wordnet.princeton.edu/.
WordWeb Software available at http://www.wordwebonline.com.
We are thankful to Matthew Spear who provided us the Facebook data. We also thank Lerone Banks for his help during the manuscript preparation. We thank the anonymous reviewers for their insightful comments. This work was supported by the National Science Foundation FIND (Future Internet Design) program under Grant No. 0832202, MURI under ARO (Army Research Office) and Network Science CTA under ARL (Army Research Laboratory).
This article is distributed under the terms of the Creative Commons Attribution Noncommercial License which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and source are credited.
- Adamic LA, Buyukkokten O, Adar E (2003) A social network caught in the web. First Monday 8(6)Google Scholar
- Banks L, Ye S, Huang Y, Wu SF (2007) Davis social links: integrating social networks with internet routing. In: LSAD ’07: Proceedings of the 2007 workshop on large scale attack defense ACM Press, New York, pp 121–128Google Scholar
- Banks L, Bhattacharyya P, Spear M, Wu SF (2009) Davis social links: Leveraging social networks for future internet communication. Ninth annual international symposium on applications and the internet, pp 165–168Google Scholar
- Bhattacharyya P, Garg A, Wu SF (2009) Social network model based on keyword categorization. International conference on advances in social network analysis and mining (ASONAM'09), pp 170–175Google Scholar
- Crandall D, Cosley D, Huttenlocher D, Kleinberg J, Suri S (2008) Feedback effects between similarity and social influence in online communities. In: KDD ’08: Proceeding of the 14th ACM SIGKDD international conference on knowledge discovery and data mining. ACM Press, New York, pp 160–168Google Scholar
- Fellbaum C (1998) Wordnet: an electronic lexical database. Bradford Books, BradfordGoogle Scholar
- Howe DC (2009) Rita wordnet Java based API to access Wordnet.http://www.rednoise.org/rita. Accessed 26 October
- Kleinberg J (2000) The small-world phenomenon: an algorithm perspective. In: STOC ’00: Proceedings of the 32nd annual ACM symposium on theory of computingGoogle Scholar
- Kleinberg J (2001) Small-world phenomena and the dynamics of information. In: Advances in neural information processing systems. MIT Press, Cambridge, pp 431–438Google Scholar
- Milgram S (1967) The small world problem. Psychol Today 61:60–67Google Scholar
- Sandberg O (2007) The structure and dynamics of navigable networks. PhD thesis, Chalmers UniversityGoogle Scholar
- Şimşek Ö, Jensen D (2005) Decentralized search in networks using homophily and degree disparity. In: Nineteenth international joint conference on artificial intelligence (IJCAI 2005)Google Scholar
- Spear M, Lu X, Matloff NS, Wu SF (2009) Inter-profile similarity (ips): a method for semantic analysis of online social networks. In: Complex ’09: Proceedings of the first international conference on complex sciences: theory and applicationsGoogle Scholar
- Xu Z, Fu Y, Mao J, Su D (2006) Towards the semantic web: collaborative tag suggestions. In: WWW’06: Proceedings of the collaborative web tagging workshopGoogle Scholar