1 Introduction

Nowadays, with the exploration of social networks, there are more and more people joining these networks. In these digital worlds, users freely present themselves, share information about their favorites and passions, or share their personal opinion on some issues of economic, social, cultural, etc. through several activities on social network such as posting entries, sharing video clips, images, or news they read, and then leaving their comments or liking these entries or the comments of others, etc. Consequently, huge data are created on the social network. This huge data attract many researchers, businessmen, etc. to mine and exploit it. This tendency also brings some new challenges to researchers: do users having the same profile/interest show the same behavior? And vice versa, do users having the same behavior have interest in the same things? One of the basic issues in these challenges is the problem of estimating the similarity among users on these social networks based on their profile, interest, and behavior?

The problem of detecting the similarity or the difference between users is not only based on the user profile on the social network, but also based on the data about user behavior such as posting entries, commenting, and liking. This problem has been attracting many researchers. For instance, Raad et al. [16] and Peled et al. [15] proposed a model to measure the similarity between user profiles. Anderson et al. [1] calculated the similarity between user characteristics. Liu et al. [8] estimated the similarity among preferences of user behavior. Liu et al. [9] and Chen et al. [4] measured the similarity among user mobility behavior. Xu et al. [23] analyzed the user posting behavior on a popular social media website. Erlandsson et al. [5] proposed association learning to detect relationships between users. Benevenuto et al. [2] presented a kind of analysis of user workloads in online social networks. Singh et al. [18] formulated a metric based on the common words used in social networks to measure the user similarity in textual posting. Zou et al. [26] mined individual behavior patterns and study user similarity. Sun et al. [19] proposed a mapping method, which integrates text and structure information for similarity computation. Guo et al. [6] developed a model to estimate continuous tie strength between users for friend recommendation with the heterogeneous data from social media community. Nguyen et al. [11] aimed to understand the strategies users employ to make retweet decision. Liu and Terzi [10] approached the privacy issues raised in online social networks from the individual users viewpoint: they proposed a framework to compute the privacy score of a user. Tang et al. [20] adopted a “microeconomics” approach to a model and predicted the individual retweet behavior. Xu et al. [22] introduced several methods to identify online communities with similar sentiments in online social networks. Zhao et al. [25] proposed to separately model users’ topical interests that come from these various behavioral signals to construct better user profiles. Vedula et al. [21] detected pairwise and global trust relations between users in the context of emergent real-world crisis scenarios. Jamali and Ester [7] explored social rating networks, which record not only social relations but also user ratings for items. Bhattacharyya et al. [3] studied the relationship between semantic similarity of user profile entries and the social network topology. In the model of Zhao et al. [24], two social factors, interpersonal rating behavior similarity and interpersonal interest similarity, are fused into a consolidated personalized recommendation model based on probabilistic matrix factorization.

Most of these works try to estimate the similarity among user based on: user profile, user interests or favorites, or user relationship on social network. However, there are not many works which estimate the similarity among social network users based on their activities on social network.

In line with our previous works ([13, 14]), this paper introduces a model for measuring the similarity between users based on their behavior on social network. In this model, the similarity between users is estimated from the similarity of their behaviors such as posting an entry or sharing an existing entry, liking an entry or liking a comment, commenting on a post, and joining a group or a community. The similarity of user behavior on these activities is also estimated based on the content of the entries that they post, like, or the content of their comment on these entries from social networks. The similarity among entries is estimated based on the content, tags, category, sentiment, and emotion included in these entries [14]. The model is then evaluated with a dataset-collected users from Twitter. The results show that the model estimates correctly the similarity among users in the majority of the cases.

The paper is organized as follows: Sect. 2 presents the similarity model. Section 3 takes some experiments to evaluate the proposed model with empirical data. Section 4 is the conclusion and perspectives.

2 A model for estimation of similarity among social network users based on their behaviors

2.1 Notations

Without loss of generality, we assume that:

  • A social network is a 4-tuples \( N=<U,G,E,B>\), in which:

    • \(U=\{u_1,u_2, \ldots , u_m\}\) is a set of users,

    • \(G=\{g_1,g_2, \ldots , g_n\}\) is a set of communities or groups,

    • \(E=\{e_1,e_2, \ldots , e_k\}\) is a set of entries of users on the social network N,

    • \(B= \{b_1,b_2,\ldots ,b_l\}\) is a set of behaviors of each user \(u\in U\) on the group \(g \in G\) or on the entry \(e \in E\) on the social network N.

  • A user could post a status, an image, or a video clip that we call an entry e. An entry e could be viewed by a set of users U. Each user, within an entry, could like the entry and comment on the entry or share that entry on their homepage.

  • Each user u, within an entry, could like a set of comments of the entry. A user could like a page or join a group. In this case, the user is a member of a community or group of the social network. An user could post an entry in a community or a group, like an entry, comment into an entry, or like a set of comments in an entry of a community or share an entry.

2.1.1 Entry

Generally, an entry on a social network can be a video, an image, a text, or a combination of all content. However, in this paper, we only consider entries that contain textual content. If they do not contain texts such as video or images, they are ignored. Therefore, the problem is to consider and estimate the similarity of users based on the entry that focuses on reviewing and estimating similarity between texts.

On a social network, there is a set of user \( U=\left\{ u_1, u_2,\ldots ,u_m\right\} \). Each user \(u_{i}\) is characterized by a set of entries posted E and a set of behaviors B on the social network. Each user \(u_{i}\ in U\) has a set of entries \( E_{i}=\left\{ e^i_{1}, e^i_{2},\ldots ,e^i_{k}\right\} \) and each entry \(e_{j}\in E\) has a set of features: \(e_{j}=\left\{ f^j_{1}, f^j_{2},\ldots ,f^j_{p}\right\} \).

An entry could have several features, including explicit features such as the content, and the implicit features such as tag, category, sentiment, and emotion. As the implicit features cannot be directly extracted from an entry, the model needs a step to extract these features before estimating the similarity on entries.

This model considers five features of an entry:

  • Content of entry \(e^j\), noted as \(f^{j}_{\text {cont}}\): content is the whole text part in the entry itself. This is an explicit feature.

  • Tags of entry \(e^j\), noted as \( f^{j}_{\text {tags}}\): an entry could be tagged to a set of tags. Each tag is an independent word or expression. In some cases, tags can be directly tagged by the user (explicit). In some other case, it is not explicitly tagged by the user (implicit).

  • Category of entry \(e^j\), noted as \(f^{j}_{\text {cate}}\): an entry could be assigned to a category. Each category is represented by an independent word or expression.

  • Sentiment of entry \(e^j\), noted as \(f^{j}_{\text {sent}}\): an entry could have a sentiment of the user. A sentiment may be agree (positive), disagree (negative), or neutral opinion.

  • Emotion of entry \(e^j\), noted as \(f^{j}_{\text {emot}}\): an entry could also have some emotion of the user. Each emotion is represented by an independent word or expression

As an entry is considered as a set of features and only their textual contents are considered, the problem of estimating the similarity among entries could be considered as the computation of the similarity among texts or among sets of expressions.

2.1.2 Behavior

In this model, only five popular behaviors are considered: post an entry, like an entry, comment on an entry, share an entry, and join a group on social network. We assume that a social network has a set of user \( U=\left\{ u_{1}, u_{2},\ldots ,u_{m}\right\} \). Each user \(u_{i}\in U\) posts a set of entries E and acts with a set of behaviors \(B_{i}=\left\{ b^i_{1}, b^i_{2},\ldots ,b^i_{l}\right\} \). Each behavior \(b_{l} \in B\) may have a set of features \(b_{l}=\left\{ f^l_{1}, f^l_{2},\ldots ,f^l_{q}\right\} \):

  • Post of entry, noted as \(b^{l}_{\text {post}}\): the user writes an entry on the user homepage.

  • Like an entry or like a comment, noted as \(b^{l}_{like}\): the user clicks on the like icon of an entry or a comment.

  • Comment of entry, noted as \(b^{l}_{\text {comt}}\): the user writes some comments on an entry

  • Share of entry, noted as \(b^{l}_{\text {shar}}\): the user shares an entry on his/her wall. The shared entry could belong to different users of social media or its social network.

  • Join a group, noted as \(b^{l}_{\text {join}}\): the user joins a group or community. A group usually has the name of a group, description of the group and other characters of the group.

2.1.3 Group or community

As a community or a group is described by its meta-data, the similarity between two communities or groups is, thus, considered as the similarity between two multi-feature objects (Nguyen and Nguyen [12]). Each meta-data of a community or group is considered as a feature of the community or a group. In this model, we assume that a social network has a set of user \( U=\left\{ u_{1}, u_{2},\ldots ,u_{m}\right\} \). Each user \(u_{i}\in U\) can be joined into a set of communities or groups \( g_v\in G\) with features: \(g_v=\left\{ g^v_{1},g^v_{2},\ldots ,g^v_{w}\right\} \):

  • Name of the community, noted as \(g^{v}_{\text {name}}\): it could be an entitle or a short brief sentence. After eliminating all stop words in the title, this feature becomes a set of words to be compared to that of other communities. So, estimating the similarity on the name of the community is to estimate the similarity between two sets of words.

  • Categories of the community, noted as \(g^{v}_{\text {camu}}\): on some social networks, each community is always assigned to at least one category. Each category is an independent word (or independent expression). So, estimating the similarity on the categories of the community is also to estimate the similarity between two sets of expressions.

  • Description of the community, noted as \(g^{v}_{\text {desc}}\): on many of the social networks, each community is also provided a short description. A description is normally a short text. After eliminating all stop words in the text, this feature becomes a set of words to be compared to that of other communities. So, estimating the similarity on the description of the community is also to estimate the similarity between two set of expressions.

As the comparison between two entries is considered as a comparison between their sets of feature and only their textual values are considered, the comparison between two behaviors and communities or groups was made by comparing only their textual values. Therefore, the problem of estimating the similarity among users based on behaviors becomes the computation of the similarity among texts or among sets of expressions.

2.2 General model

The general model is as follows:

  • Input: .\(u_1, u_2 \in U\) with their two sets of entries \(E_1, E_2 \in E\) and two sets of behaviors \(B_1, B_2 \in B\)

  • Output: Estimated similarity between the two entered users \(u_1, u_2 \in U\) called sim\((u_1,u_2).\) Inside the model, there are four main steps:

    • Step 1: modeing entries E and behaviors B.

    • Step 2: extracting the value for implicit features of entries.

    • Step 3: estimating the similarity on each entry’s features and on each user’s behaviors.

    • Step 4: aggregating the similarity between two sets of entries \(E_1, E_2\) and between two sets of behaviors \( B_1, B_2\) of users \(u_1,u_2.\)

These steps will be described in detail in the next sections.

2.3 Determination value features of entries

2.3.1 Evaluation value implicit features

Let’s consider an example of a status on Twitter: “Thank you @apple for Find My Mac - just located and wiped my stolen Air”. When we see this status, only the content is explicitly presented, which is the whole text of the status. However, we can quickly identify some other features of this status, such as category (technology), tags (Apple, Mac), sentiment (neutral—neither agree nor disagree), and emotion (gratitude, joy). The features whose value is not explicitly presented in the entry, but could be extracted from the inside of the entry, are called implicit features. Our object in this step is extracting the value of implicit features of an entry.

In this model, we apply a method to extract the value of each of four implicit features as follows (called the method to classify the texts into classes) [14]:

  • Step 1: Construct a set of labeled samples (texts), called training set, in which each text is assigned to a set of labels. The union of all labels of all texts is called the set of labels L.

  • Step 2: For each label \(l_i \in L\), create two sets of text samples:

    • \(T_{l_i}\) is the set of all texts which are labeled with \(l_i\).

    • \(T_{\lnot l_i}\) is the set of all texts which are not labeled with \(l_i\).

  • Step 3: For each text \(t_k \in T_{l_i}\) or (\(t_k\in T_{\lnot l_i}\)), calculate the label-oriented features as follows:

    • Split \(t_k\) into a set of n-gram or term (stop words may be removed).

    • Take the union of all terms in all texts in the set \(T_{l_i}\) and \(T_{\lnot l_i}\).

  • Step 4: Calculate the label-oriented term score of each term in the corresponding set for each label \(l_i\):

    $$\begin{aligned} s_{\text {LOT}}(x,l_i) =\displaystyle \frac{N^x_{l_i}}{N_{l_i}}\times \text {log}\left( \frac{N_{\lnot l_i}}{N^x_{\lnot l_i}}\right) - \displaystyle \frac{N^x_{\lnot l_i}}{N_{\lnot l_i}}\times \text {log}\left( \frac{N_{l_i}}{N^x_{l_i}}\right) , \end{aligned}$$
    (1)

    where \(N_{l}, N_{\lnot l}\) are the number of texts in the set \(T_{l}\), \(T_{\lnot l}\), respectively. \(N^x_{l}, N^x_{\lnot l}\) are the number of texts in the set \(T_{l}\), \(T_{\lnot l}\), respectively, which contains the term x.

  • Step 5: For a new text t, the choice of label to assign to the text is as follows:

    • Split t into a set of n-grams or terms \(X = (x_1,x_2,\ldots x_n)\).

    • Calculate the term frequency for each term \(x_i\) in the text t: \(tf(x_i,t)\).

    • For each label \(l_i \in L\), calculate the label-oriented document score:

      $$\begin{aligned} s_{\text {LOD}}(t,l_i) =&\displaystyle \frac{1}{n_{t}} \times \displaystyle \sum _{x \in t}s_{\text {LOT}}(x,l_i) * tf(x,t). \end{aligned}$$
      (2)
    • If \(s_{\text {LOD}}(t,l_i) > 0\):

      • In the multi-label problem where a text could be assigned to several labels, the text t will be labeled with \(l_i\).

      • In the single label problem where a text could be assigned to only one label, it is needed to calculate all the final label-oriented (disoriented) scores of the text t for all the labels \(l_i \in L\). The label whose label-oriented document score is the highest will be assigned to the text t.

2.3.2 Evaluation value sentiment features

To determine the sentiment of a short text, it is necessary to estimate the value of the point of view of the text that views the author’s point of view expressed in the text. In this paper, we apply the method to classify the texts into classes in Sect. 2.3.1. Therefore, the value of the sentiment feature of a text is assigned to one of three values (classes), that is, positive, negative, or neutral.

2.3.3 Evaluation value emotion features

The emotions of the user represented in the entries are often represented by icons or images, each of which is equivalent to a term describing that emotion. Therefore, estimating the similarity between two emotions of the entry is to estimate the similarity between the two terms. In this paper, we apply the method of classifying the texts into classes in Sect. 2.3.1. Therefore, the emotion value of a text is assigned to one of the values (classes): enjoy; happy for; love; gratitude; admiration; pride; hope; sad; sorry; regret; disappointed; disgust; angry; confused; no emotion.

2.4 Estimating similarity on each feature

In this model, we distinguish two kinds of textual values of a feature:

  • First, the feature value is already in the form of a set of expressions, such as the value of features tags, category, sentiment, and emotion. Their similarity is considered as the similarity among sets of expressions.

  • Second, the feature value is in the form of a general text, such as the value of the feature content. Their similarity is considered as the similarity among texts.

2.4.1 Estimating the similarity for expression features

Since the content of the feature is in the form of a set of textual expressions, their similarity is defined as follows: suppose that \(A_1 = (a^1_1,a^2_1,\ldots a^m_1)\), \(A_2 = (a^1_2,a^2_2,\ldots a^n_2)\) are two sets of expressions or strings, in which, m, n are the sizes of the set \(A_1\) and \(A_2\), respectively. Let v be the size of the set of intersection of \(A_1\) and \(A_2\). The similarity between \(A_1\) and \(A_2\) is defined by the formula:

$$\begin{aligned} s_{\text {exp}}(A_1,A_2) = \displaystyle \frac{2 \times \mid A_1 \cap A_2 \mid }{\mid A_1\mid + \mid A_2\mid } = \displaystyle \frac{2\times v}{m + n}. \end{aligned}$$
(3)

It is clear that all possible values of \(s_{\text {exp}}(A_1,A_2)\) are in the interval [0, 1]. This formula could be applied to the features, whose value is a set of expressions.

Suppose that \(e^i = (f^i_1,f^i_2,\ldots f^i_n)\), \(e^j = (f^j_1,f^j_2,\ldots f^j_n)\) are two entries represented by their features. Let us consider the feature k whose value is a set of expressions. The similarity between entries \(e^i\) and \(e^j\) on the feature k is defined by the formula:

$$\begin{aligned} s_{k}(e^i,e^j) = s_{\text {exp}}(f^i_k,f^j_k), \end{aligned}$$
(4)

where \(f^i_k,f^j_k\) are the expression values of the feature k of the two entries \(e^i\) and \(e^j\).

2.4.2 Estimating similarity for text features

The problem of estimating the similarity among textual values becomes the estimation of the similarity among texts. We can apply the technique TF–IDF (term frequency–inverse document frequency) [17] to characterize the texts as follows:

  • Split the text into a set of n-gram \(t^1 = (g^1_1, g^1_2,\ldots g^1_n)\) and \(t^2 = (g^2_1, g^2_2,\ldots g^2_m).\)

  • Calculate the TF–IDF of each n-gram in the text. Then, represent the feature value by a vector in which each element is a pair

    < n-gram, td-idf >: \(v^1 = (<g^1_1,v^1_1>,<g^1_2,v^1_2>,\cdots <g^1_n,v^1_n>)\) and \(v^2 = (<g^2_1,v^2_1>,<g^2_2,v^2_1>,\cdots <g^2_m,v^2_m>).\)

  • Calculate the distance between the two vectors:

    $$\begin{aligned} D(v^1,v^2) = \displaystyle \frac{1}{N}\sum ^N_1 d_k, \end{aligned}$$
    (5)

    where N is the number of different n-grams considered in both \(t^1 \cup t^2\) and \(d_k\) is the distance on each element \(<g^1_i,v^1_i>\) of \(v^1\) (or element \(<g^2_j,v^2_j>\) of \(v^2\), respectively):

    • If there is an element \(<g^2_l,v^2_l>\) of \(v^2\) (or element \(<g^1_l,v^1_l>\) of \(v^l\), respectively) such that \(g^2_l = g^1_i\), then:

      $$\begin{aligned} d_k = \displaystyle \frac{\mid v^1_i - v^2_l \mid }{\max (v^1_i, v^2_l)}. \end{aligned}$$
      (6)
    • Otherwise, \(d_k = 1\).

  • It is clear that the value of \(D(v^1,v^2)\) is in the interval [0, 1]. Similarity between the two features is then:

    $$\begin{aligned} s_{\text {txt}}(t^1,t^2) = 1 - D(v^1,v^2). \end{aligned}$$
    (7)

2.4.3 Estimating similarity between two entries

We considered an entry via five features including: content, tag, category, sentiment and emotion. In this case, there are four expression features of entry including: tags, category, sentiment and emotion. So they are estimated as the similarity on expression feature as follows:

$$\begin{aligned} s_{\text {cate}}(e^i,e^j) =&s_{\text {exp}}(f^i_{\text {cate}},f^j_{\text {cate}}), \end{aligned}$$
(8)
$$\begin{aligned} s_{\text {tags}}(e^i,e^j) =&s_{\text {exp}}(f^i_{\text {tags}},f^j_{\text {tags}}), \end{aligned}$$
(9)
$$\begin{aligned} s_{\text {sent}}(e^i,e^j) =&s_{\text {exp}}(f^i_{\text {sent}},f^j_{\text {sent}}), \end{aligned}$$
(10)
$$\begin{aligned} s_{\text {emot}}(e^i,e^j) =&s_{\text {exp}}(f^i_{\text {emot}},f^j_{\text {emot}}). \end{aligned}$$
(11)

One text feature of entry is content, so it is estimated as the text feature similarity, calculated as follows:

$$\begin{aligned} s_{\text {cont}}(e^i,e^j) = s_{\text {txt}}(f^i_{\text {cont}},f^j_{\text {cont}}). \end{aligned}$$
(12)

Let \(e^i \) and \(e^j \) be two considered entries whose content, tags, categories, sentiment and emotion are features of entries: \(e^i_{\text {cont}}\); \(e^j_{\text {cont}}\); \(e^i_{\text {tags}}\); \(e^j_{\text {tags}}\); \(e^i_{\text {cate}}\); \(e^j_{\text {cate}}\); \(e^i_{\text {sent}}\); \(e^j_{\text {sent}}\); \(e^i_{\text {emot}}\); \(e^j_{\text {emot}}\). Based on the approach of multi-attribute similarity of two objects [12], the similarity between the two entries \(e^i \) and \(e^j \) is estimated as follows:

$$\begin{aligned} s_{\text {entry}}(e^i,e^j) =&f_{\text {ent}}(s_{\text {cont}}(e^i, e^j), s_{\text {tags}}(e^i, e^j), \nonumber \\&s_{\text {cate}}(e^i, e^j), s_{\text {sent}}(e^i, e^j), s_{\text {emot}}(e^i, e^j)), \end{aligned}$$
(13)

where \(f_{\text {ent}}:{[0,1]}^5\rightarrow [0,1] \) is similarity is a similar function between two entries, which satisfies the following conditions:

$$\begin{aligned}&(i) \ \ f_{\text {ent}0}(v_1, w, x, y, z ) \leqslant f_{\text {ent}}(v_2, w, x, y, z )\ if \ v_1 \leqslant v_2;\nonumber \\&(ii) \ \ f_{\text {ent}}(v, w_1, x, y, z ) \leqslant f_{\text {ent}}(v, w_2, x, y, z ) \ if \ w_1 \leqslant w_2;\nonumber \\&(iii) \ \ f_{\text {ent}}(v, w, x_1, y, z ) \leqslant f_{\text {ent}}(v, w, x_2, y, z ) \ if \ x_1\leqslant x_2;\nonumber \\&(iv) \ \ f_{\text {ent}}(v, w, x, y_1, z ) \leqslant f_{\text {ent}}(v, w, x, y_2, z ) \ if \ y_1\leqslant y_2;\nonumber \\&(v) \ \ f_{\text {ent}}(v, w, x, y, z_1 ) \leqslant f_{\text {ent}}(v, w, x, y, z_2 ) \ if \ z_1\leqslant z_2. \end{aligned}$$
(14)

2.4.4 Estimating the similarity between two groups

Once the similarity between two groups on each feature is estimated, the similarity between two groups is then estimated by a weighted average aggregation of the similarity between them on all considered features as follows:

  • Let \( w_1, w_2, w_3 \) be the weight of features Name, Description and Category, respectively. They have to satisfy this condition: \(w_1 + w_2 + w_3 = 1.\)

  • The similarity between group \(g_i\) and group \(g_j\) is:

    $$\begin{aligned} s_{\text {group}}(g_i, g_j) =&w_1 \times s_{\text {exp}}(g^i_{\text {name}},g^j_{\text {name}})\nonumber \\&+ w_2 \times s_{\text {exp}}(g^i_{\text {desc}}, g^j_{\text {desc}}) \nonumber \\&+ w_3 \times s_{\text {exp}}(g^i_{\text {camu}}, g^j_{\text {camu}}), \end{aligned}$$
    (15)

    where \(w_1, w_2, w_3\) are, respectively, the weight of the features Name, Description, and Category. \(s_{exp}(A,B)\) is the similarity between the two sets of expressions A and B.

In the case of two sets of communities, let \(G_1 = {g^1_1,g^1_2, \ldots ,c^1_m}\) and \(G_2={g^2_1,g^2_2, \ldots ,g^2_n}\) be the two considered sets of communities. We create a common set of these two sets \(G_{12 }= G_1+G_2 = {g_1, g_2, \ldots , g_{m+n}}\) and then construct their non-ordered semantic vectors \( T = (t_1, t_2,\ldots , t_{m+n}) \) as:

$$\begin{aligned} t_i&= \text {min}( \text {max}(s_{\text {group}}(g_i; g^1_k )), \text {max}(s_{\text {group}}(g_i; g^2_v )) )\quad \nonumber \\ k&= 1\ldots m; v = 1\ldots n, \end{aligned}$$
(16)

where \(s_{\text {group}}(x, y)\) is the similarity between two groups x and y. The similarity between two non-ordered sets of communities \(G_1\) and \(G_2\) is defined by the formula:

$$\begin{aligned} s_{\text {css}}(G_1, G_2) = f_{\text {set}}(T) = f_{\text {set}}(t_1, t_2, \dots , t_{p+q}), \end{aligned}$$
(17)

where \(f_{\text {set}}: {[0,1]}^k \rightarrow [0,1]\) is a similar function between two sets.

2.5 Estimating each behavior of the user

In this paper, we consider five behaviors of the user on social networks including: post an entry, like, comment, share an entry, and join a group or a community.

2.5.1 The similarity between post or share behavior

In the case of post or share an entry, the similarity post or share an entry is estimated by estimating the similarity of two sets of posted or shared entries as follows:

Let \( E_1 = {e^1_1, e^1_2,\ldots ,e^1_p}\) and \( E_2 = {e^2_1, e^2_2, \ldots ,e^2_q}\) be two considered sets of posted or shared entries. We create a common set of these two sets \( E_{12}= E_1 + E_2 ={e_1, e_2, \ldots ,e_{p+q}}\) and then construct their non-ordered semantic vectors \( T = (t_1, t_2, \ldots ,t_{p+q}) \) as:

$$\begin{aligned} t_i&= \text {min}(\text {max}(s_{\text {entry}}(e_i,e^1_k )), \text {max}(s_{\text {entry}}(e_i,e^2_v ))) \quad \nonumber \\ k&= 1\ldots p; v = 1\ldots q, \end{aligned}$$
(18)

where \(s_{\text {entry}}(x,y) \) is the similarity between the two entries x and y. To measure the similarity between two sets of entries \(E_1\) and \(E_2\), we make use of the following assumptions: The bigger the magnitude of the vector T, the higher is the similarity between \(E_1\) and \(E_2\). The similarity between two non-ordered sets of entries \(E_1\) and \(E_2\) is defined by the formula:

$$\begin{aligned} s_{\text {ess}}(E_1,E_2) = f_{\text {set}}(T) = f_{\text {set}}(t_1, t_2,\ldots , t_{p+q}), \end{aligned}$$
(19)

where \(f_{\text {set}}: {[0,1]}^k \rightarrow [0 1]\) is a similar function between two sets, which satisfies the following conditions:

$$\begin{aligned}&(i) \ \ f_{\text {set}}(0,0,\ldots ,0 ) =0; \nonumber \\&(ii) \ \ f_{\text {set}}(1,1,\ldots ,1) =1; \nonumber \\&(iii) \ \ f_{\text {set}}(X_1)\leqslant f_{\text {set}}(X_2)\ \text {if }\ \Vert X_1\Vert \leqslant \Vert X_2\Vert .\end{aligned}$$
(20)

For example, the following functions are similar function between two sets of entries:

$$\begin{aligned}&(1) \ \ f(x_1,x_2,\ldots ,x_n ) =\frac{\sum _{i=1}^{n}x_i}{n},\nonumber \\&(2) \ \ f(x_1,x_2,\ldots ,x_n ) =\sqrt{\frac{\sum _{i=1}^{n}x^2_i}{n}}. \end{aligned}$$
(21)

In the case of similarity between two sets of posted or shared entries, let \(E^{u_i}_{\text {post}}\) and \(E^{u_j}_{\text {post}}\) be, respectively, two sets of posted or shared entries of user \(u_i\) and user \(u_j\). The posting-based (or sharing-based) behavior similarity of user \(u_i\) and user \(u_j\) is defined by the formula:

$$\begin{aligned} s_{\text {post}}(u_i,u_j)=s_{\text {ess}}(E^{i}_{\text {post}}, E^{j}_{\text {post}}), \end{aligned}$$
(22)

where \(s_{\text {ess}}(A,B) \) is the similarity between two sets of entries A and B.

2.5.2 The similarity on behavior of joining a group

Let’s \(G^{u_i}_{\text {join}}\) and \(G^{u_j}_{\text {join}}\) are respectively the two sets of communities or groups which were joined by user \(u_i\) and user \(u_j\). The joining a group behavior similarity of user \(u_i\) and user \(u_j\) is defined by the formula:

$$\begin{aligned} s_{\text {join}}(u_i,u_j)=s_{\text {css}}(G^{i}_{\text {join}}, G^{j}_{\text {join}}), \end{aligned}$$
(23)

where \(s_{\text {css}}(A,B) \) is the similarity between two sets of communities A and B.

2.5.3 The similarity on behavior of liking an entry

Let’s \(L^{u_i}_{\text {like}}\) and \(L^{u_j}_{\text {like}}\) are respectively the two sets of entries were liked by user \(u_i\) and user \(u_j\). To measure the like-based behavior similarity of user \(u_i\) and user \(u_j,\) the following is defined: the more the two sets \(L^{u_i}_{\text {like}}\) and \(L^{u_j}_{\text {like}}\) are similar, the higher is the like/dislike-based behavior similarity of user \(u_i\) and user \(u_j\) is. The like-based behavior similarity of user i and user j is defined by the formula:

$$\begin{aligned} s_{\text {like}}(u_i,u_j)=s_{\text {ess}}(L^{u_i}_{\text {like}}, L^{u_j}_{\text {like}}), \end{aligned}$$
(24)

where \(s_{\text {ess}}(A,B) \) is the similarity between the two sets of entries A and B

2.5.4 The similarity between two comment likes in comment-based behaviors

Although this behavior is obviously a confirmation of what the user already liked or disliked, sometimes some user could comment or like some comment without liking or disliking the entry. In these cases, we take it into account to measure the similarity among users on the following principles:

  • The value of each comment is detected as: positive, negative, or neutral. This determination could be done by applying the method of classifying the texts into classes in Sect. 2.3.1.

  • In each entry, only the positive or negative comments are counted. The neutral comment will be removed.

  • In each entry, if the number of positive comments of an user is greater than that of the negative comments, then the entry is considered as positive for the user. Vice versa, if the number of positive comments of a user is smaller than that of negative comments, then the entry is considered as negative for the user.

  • In the case where the numbers of positive comments and the negative comments are equal, we will consider the comments as liked by the user:

    • If the number of positive comments liked by an user is greater than that of negative comments, then the entry is considered as positive for the user.

    • If the number of positive comments liked by a user is smaller than that of negative comments, then the entry is considered as negative for the user.

    • If the numbers of positive comments and negative comments liked by a user are equal, then the entry is considered as neutral for the user and it will be removed from the considering set of entries for the user.

Let \(C^{u_i}_p\) and \(C^{u_i}_n\) be, respectively, the set of positive and negative entries for user \(u_i\). \(C^{u_j}_p\) and \(C^{u_j}_n\) are, respectively, the set of positive and negative entries for user \(u_j\). To measure the comment/like comment-based behavior similarity of user \(u_i\) and user \(u_j,\) the following is defined:

  • The more the two sets \(C^{u_i}_p\) and \(C^{u_j}_p\) are similar, the higher is the comment/like comment-based behavior similarity of user \(u_i\) and user \(u_j\).

  • The more the two sets \(C^{u_i}_n\) and \(C^{u_j}_n\) are similar, the higher is the comment/like comment-based behavior similarity of user \(u_i\) and user \(u_j\) is.

  • The less the two sets \(C^{u_i}_p\) and \(C^{u_j}_n\) are similar, the higher is the comment/like comment-based behavior similarity of user \(u_i\) and user \(u_j\).

  • The less the two sets \(C^{u_i}_n\) and \(C^{u_j}_p\) are similar, the higher is the comment/like comment-based behavior similarity of user \(u_i\) and user \(u_j\).

The comment/like comment-based behavior similarity of user \(u_i\) and user \(u_j\) is defined by the formula:

$$\begin{aligned} s_{\text {comt}}(u_i,u_j)&= \text {min}(1, \max (0, s_{\text {ess}}(C^{u_i}_p,C^{u_j}_p)\nonumber \\&\quad +s_{\text {ess}}(C^{u_i}_n,C^{u_j}_n) - s_{\text {ess}}(C^{u_i}_p,C^{u_j}_n) \nonumber \\ {}&\quad - s_{\text {ess}}(C^{u_i}_n,C^{u_j}_p))), \end{aligned}$$
(25)

where \(s_{\text {ess}}(A,B) \) is the similarity between two sets of entries A and B.

2.5.5 Estimating the similarity between two users

Once the similarities between two users on each kind of behavior are estimated, the similarity between the two users is then estimated by a weighted average aggregation, and the similarity between them on all considered kinds of behaviors are as follows:

  • Let \(w_1, w_2, w_3, w_4\) be the weight of the similarity based on posting/sharing, joining a group, liking entries, and comment/like comment respectively. They have to satisfy this condition: \(w_1 + w_2 + w_3 + w_4 = 1.\)

  • The similarity between user \(u_i\) and user \(u_ j\) is:

    $$\begin{aligned} s(u_i, u_j) =&w_1 \times s_{\text {post}}(u_i, u_j)+ w_2 \times s_{\text {join}}(u_i, u_j)\nonumber \\&+ w_3 \times s_{\text {like}}(u_i, u_j)+ w_4 \times s_{\text {comt}}(u_i, u_j), \end{aligned}$$
    (26)

    where \(w_1, w_2, w_3, w_4\) are, respectively, the weight of the similarity based on posting/sharing, joining a group, liking entries, and comment/like comment. \( s_{\text {post}}(u_i, u_j)\); \(s_{\text {join}}(u_i, u_j)\); \( s_{\text {like}}(u_i, u_j)\); \( s_{\text {comt}}(u_i, u_j) \) are the similarity between the two users \(u_i\) and \(u_ j\) based on posting entries, joining a group, liking entries, commenting/liking comment behaviors, respectively.

3 Experiments and evaluation

3.1 Method

3.1.1 Collection of data

To evaluate the proposed model, we collected data from Twitter.com sources (Table 1): we could directly apply the model to estimate the similarity among Twitter users. Each tweet is considered in five features: content, tags, category, sentiment and emotion as in the model. The considered activities of Twitter user are: post/share, like, comment, and list of groups of user. Note that in Twitter, there are no explicit activities as like and join a group as in the model. Therefore, we have to map some similar activities in Twitter to these two activities as follows:

  • Like: the like activity of a user in Twitter is considered as the favorite tweets list of the user.

  • Join a group: in the case of Twitter, a group could be considered as a list that some users subscribed to.

Table 1 Collected data from Twitter.com

3.1.2 Construction of sample set

Each sample is constructed as follows:

  • Each sample contains three users collected from Twitter.com. These users are called as user A, B, and C, respectively.

  • We ask a number of selected volunteers to answer the question: Which user, B or C, is more similar to user A than the other?

  • Then, we compare the number of people who chooses B, and that of people who chooses C. If the number of answer B is greater than that of C, then the value of this sample is 1. It means that user B is more similar to user A than C. On the contrary, if the number of answer C is greater than that of B, then the value of this sample is 2. It means that user C is more similar to user A than B. If the number of the answers B and C are not significantly different, this sample will be removed from the sample set.

After this step, we have a set of samples. We use the samples and save them in a set of samples. In experiments, we use the sample with the size of each sample set as described in Table. 2.

Table 2 Sample constructed from Twitter

3.1.3 Scenario

The experiment is performed as follows:

  • For each sample, we use the model proposed in this paper to estimate the similarity between user B and user A, and that between user C and user A.

  • If B is more similar to A than C is, then the result of this sample is 1. On the contrary, If C is more similar to A than B is, then the result of this sample is 2.

  • We then compare the result and the value of each sample. If they are identical, we increase the variable number of correct samples by 1.

3.1.4 Output parameters

The correct ratio (CR) of the model over the given sample set is calculated as follows:

$$\begin{aligned} \text {CR} = \frac{\text {number of correct sample}}{\text {total of sample}}\times 100\%. \end{aligned}$$
(27)

The more the CR value is close to \(100\%\), the more is the model correct. We expect that the obtained value of CR would be as high as possible.

3.2 Results

The results are presented in Table 3. In total, the correct ratio of the model over all samples is about 438/500 \( (87.60\%)\).

Table 3 Correct ratio CR of the sample set
Table 4 Best weight of the entry criteria for the sample set
Table 5 Best weight of behavior for the sample set

For more details, we run experiments with several combinations of weights from criteria of an entry, and weights from behavior of user with the following detailed scenario:

  • At the level of entry, we run the experiment with only 1/5, 2/5, 3/5, 4/5, and 5/5 criteria to detect the similarity among entries: 1/5 and 4/5 criteria have five possible combinations; 2/5 and 3/5 criteria have ten possible combinations, 5/5 criteria have only 1 combination.

  • For each combination, we run the experiment with different weights of each selected criteria. The changing step for each weight is 0.05. Therefore, each criteria weight runs from 0.05 to 1.00 as long as the sum of all criteria weights in the experiment is equal to 1.

  • The same principle is applied at the level of behavior: we run the experiment with 1/4, 2/4, 3/4, and 4/4 behaviors. Each combination is also applied in the same manner as the previous level.

The results are presented in Table 4 (for entry) and Table 5 (for behavior). At the level of entry, the best weight combination is that: 0.30 of content, 0.25 for tags, 0.20 for emotion, 0.15 for sentiment, and 0.10 for category. These results are reasonable: in Twitter, the content and tags are explicit value of entry, so they are most important in the results. Meanwhile, the three remaining criteria emotion, sentiment, and category are implicit values from entry. Their value possibly depends on the classifying method and, therefore, their importance may be reduced in the final results. The criterion category is less important possibly because this has only small possible different values. A value of this criterion could represent a big number of different entries. Therefore, this criteria does not very well classify the entries as the other criteria.

At the level of behavior, the best weight combination is: 0.35 of post, 0.30 for comment, 0.25 for like, and 0.10 for join a group. As mentioned in the scenario, in Twitter, there is no real data about like and join a group; we have to map the favorite list in Twitter to the activity like, and map the subscribed to list in Twitter to the activity join a group of the model. This may be the main reason why in this experiment these two activities are less important than the two real activities of post and comment.

4 Conclusions

In this paper, we present a model for estimating the similarity between users based on their entries and behavior on social network. The considered behaviors are based on the activity of posting an entry or sharing an existing entry, liking an entry or liking a comment on an entry, commenting on an entry, and joining a group or community. The model is applied to estimate the similarity among users of Twitter. The results show that the model could estimate correctly the similarity among users in the majority of cases.

This model could be applied to several applications such as to predict the behavior of a social network user in commenting or liking some kind of status; to recommend some new entries which could be appropriate to a given user; to cluster the user based on some criteria.