1 Introduction

Recommender systems, e-commerce sites, social media platforms, and search engines use personalization to tailor recommendations to their users [13, 29]. It is a core part of content consumption and companies’ revenue [25].

A recommender system’s (RS’s) personalized recommendation is meant to help users find information of their interest by filtering items by their relevance. This reduces users’ information overload and information seeking efforts, saves them time and improves their decision quality [18]. For providers, an RS is useful to keep users engaged and retain a variety of audiences, thereby increase revenue. Recommendation providers and users have, therefore, a common interest in meeting users’ needs and preferences using personalized recommendations.

A RS’s personalization can be measured for different ends and from different perspectives. For instance, Aniko Hannak et al. [13] measure personalization as the variation from the presentation of search results in exact order for each query for each user. In doing so, this measure totally disregards the user differences. It is evident, however, that users have differences in information needs and interests, even when they use the same search query [28]. They are also able to single their preferences out from a mixed presentation [10, 15, 19, 21]. The challenge for a personalized RS, then, is to be able to capture the natural and healthy differences between users without imposing new and unnecessary differences. This suggests the need for a precarious balance where users’ differences in interests and tastes are satisfied, where they are neither “overloaded” with content they are not interested in, nor are they served with content more differentiated than necessary.

How do we then measure RS’s personalization success in meeting the users’ differences? Personalized recommendation can be conceived of as a two-stage process: the generation of the recommendation list and a subsequent ranking of the recommendation list. By choosing different recommendation lists to users, a recommender system imposes a difference between the users. The users react to the differentiated recommendations, for example by clicking. Implicit in their reactions, users approve the imposed difference when there is a proportional difference in their reactions to the differentiated recommendations. Likewise, they may also implicitly show a wish to pull (get closer to each other) by consuming more of the items in common between the two lists, or a desire to push each other (drift apart) by consuming more of the items that are distinct. The differentiated recommendations and the subsequent reactions by the users give us two groups of content differentiation. Comparison of the two groups of differentiation allows us to measure the degree of pull or push that an RS imposed difference causes.

We introduce a user-centric metric called pull–push, to quantify the discrepancy between the two groups of content differentiation. Pull–Push compares the degree of differentiation in recommendations with the degree of differentiation in the resulting user reactions. The metric’s score can indicate three states: balanced, pull or push. A pull state indicates the users’ tendency to come together (consume common items) despite the imposed differences in the recommendations. A push state, on the other hand, indicates the tendency of users to drift apart more than that which is imposed by the RS. In this situation, the RS should personalize more. A balanced state indicates a congruence between recommendations and user interests/preferences. A pull or a push state shows a degree of disapproval of the recommendations by the users. A pull score is a measure of over-personalization, and a push score is a measure of under-personalization. How to use the measure to tune a RS is beyond the scope of this work, but a topic of future research. Ideally, we would want every deployed RS to be in a balanced state.

The contributions of our work are: (1) a novel user-centric conceptualization of a RS’s content differentiation success, (2) a generic, and versatile user-centric metric for quantifying the gap between personalization and user interests, (3) applications of the metric on simulated and real-world datasets and (4) discussion of the metric in relation to other metrics on personalization, to popularity bias, and to normative standards. The rest of the paper is organized as follows. In Sect. 2, we present background and related work. In Sect. 3, we discuss our proposed method, followed by Sect. 4 where we experiment with the application of the proposed method on simulated and real-world datasets. We discuss the method and the results in a broader context in Sect. 5 and finish with a conclusion in Sect. 6.

2 Background and related work

Here, we review relevant metrics on personalization. To help us explain them better, we use the recommendation flowchart shown in Fig. 1. The flowchart shows a recommendation flow from item selection from an item pool, to impression and to interaction. This simplified flow applies to both query-based or query-less personalization. Available items can, for instance, be the daily news items for a news recommendation platform, or not-yet-personalized first-page search results. Different recommender system evaluation metrics address different aims and stages in this recommendation flowchart.

The area of interest for the metric proposed by Aniko Hannak, et al. [13] that quantifies personalization as the variation from exact presentation is the fine-dotted rectangle (Available Items + Shown + Not Shown), in Fig. 1. In this point of view, a score is produced by comparing the personalized recommendations against each other, or against the available items. Accuracy-oriented measures such as RMSE and MAP, and engagement-oriented measures such as CTR and dwell time (see [30]) target the area surrounded by the solid-rectangle (Shown + Clicked + Not Clicked).

Fig. 1
figure 1

The recommendation flowchart from available items to clicks. Available items are either shown (recommended) or not shown. Shown items are either clicked or not clicked

Works by Nguyen, et al. [23] and by Teevan et al. [28] target personalization at the level of the production of recommendation lists and at the level of ranking those lists, respectively. Nguyen et al. examined the effect of an item-item recommender system on the diversity of recommended and consumed items [23]. Similar to our approach, this study compares recommendation lists and the resulting reaction lists, but while the pull–push aims to measure how successful the content differentiation is, the aim of the former is only to uncover personalization’s impact on content diversity. Nguyen et al.’s metric targets the boxes, in Fig. 1, of Shown and Clicked directly, and the box of Not Clicked indirectly.

Teevan et al.’s “potential for personalization” [28] concerns itself with the ranking of the recommendation (search) lists. Using normalized DCG [17] as a measure of the quality of the ranking of a recommendation list, they define the difference between the ideal ranking score for an individual and that for a group as a potential for personalization. They concern themselves only with the ranking of the search (recommendation) list, ignoring the personalization done in the selection of the recommendation list. This study concerns itself with the boxes, in Fig. 1, of Clicked and Not Clicked.

While a lot of measures are employed to assess RS’s, methods to quantify RS’s success at the content differentiation at the level of the generation of the recommendation list itself are under-addressed. To address this, we first conceive of personalization as containing two stages, namely content differentiation to generate the recommendation list and a subsequent ranking of the recommendation list. We then offer a novel, user-centric metric, which we call pull–push, that measures the gap between the degree of the RS’s imposed content differentiation and the resulting reaction lists that are produced by the users given the differentiated recommendations. The gap between the difference in the recommendation lists and the difference in the reaction lists is a degree of the users’ approval or disapproval of the imposed differentiation. A disapproval indicates either a disapproval of under-personalization or over-personalization.

Our pull–push metric differs from “the potential for personalization” in that it deals with the content differentiation and production of recommendation lists, as opposed to the ranking of a recommendation list. In our case, the potential for personalization, if there is one, is a function of the difference in the items of recommendation lists; in the Teevan et al.’s case, it is a function of the ranking of the recommendation lists. Our measure and Teevan et al.’s are complimentary, covering both the generation of recommendation lists and the subsequent ranking of the recommendation lists, and we will show how they can be combined to produce a score for s potential for improvement later in this work. The entire area (Available Items + Shown + Not Shown + Clicked + Not Clicked) of Fig. 1 can be targeted by the pull–push metric, as shown later when we discuss the application of the metric.

3 Method

Fig. 2
figure 2

Under-personalization–Over-personalization. Starting from a balanced position, one can either go in the direction of under-personalization risking information overload, or in the direction of over-personalization risking user isolation

Starting out from a balanced level of personalization in Fig. 2, one can go either in the direction of less personalization or more personalization.

A good recommender system must strive to find the middle ground, the right level of personalization. The pull–push measures a RS’s content differentiation level with respect to the balanced level. The following two concepts underpin our metric.

  • Differentiation through pair-wise difference

    Personalization imposes differences between recommendation lists for different users on the basis of a certain user model. This content differentiation happens in an environment that is in a state of flux, where both items and user interests are dynamic and evolving. We, therefore, cannot construct a stable “reference frame” for the computation of the success of the production of differentiated recommendations. To overcome this, we conceive of content differentiation as a function of the pairwise differences in the user recommendation lists and the resulting reaction lists. This conceptualization abstracts away from the actual recommendations and the resulting reactions, requiring only the preservation of the proportionality of similarities/differences in the recommendation lists and in the reaction lists.

  • User-centricity

    From the user-centric perspective, it is important to satisfy the natural differences in interests and preferences. When the proportion of difference a recommender system imposes during recommendation is approved by the users in their reactions/consumption, we consider the recommender system successful in meeting the actual user differences. When that is not the case, the users are either in a state of pull or push, expressing disapproval, implicitly, by their selective behaviors.

User clicks on recommendations or lack thereof are reactions to the personalized recommendation list. It is fair to assume that an alteration in the recommended items would result in the alteration in the clicked items. This coupled nature of recommendations and subsequent user reactions means that user reactions are at best a tentative representation of the user’s actual interest. In this work, we abstract away from the actual recommendations and the resulting reactions, and view a RS’s differentiation success as the proportionality of the differentiation in the recommendations and the differentiation in the resulting user reactions. Proportionality is easier to observe and likely more stable, as it is independent of the particular recommendation lists and the resulting reaction lists.

Let X and Y represent the two recommendation lists presented to two users. \({X \cup Y}\) is the union of the recommendation lists, and \({X \cap Y}\) is the set of the shared recommendations (the intersection). The proportion of the intersection of the items to the union of the items (see Eq. 1) is the magnitude of similarity that the RS has imposed between the users. From this, we define \({\sigma _{\mathrm{rec}}}\) (see Eq. 2), which is the magnitude of difference that the RS imposed between the pair of users.

$$\begin{aligned}&J(X,Y) = \frac{X \cap Y}{X \cup Y} \end{aligned}$$
(1)
$$\begin{aligned}&\sigma _{\mathrm{rec}_(X,Y)} = 1- J(X,Y) = 1-\frac{X \cap Y}{X \cup Y} \end{aligned}$$
(2)

The pair of users will react to the differentiated recommendation lists resulting in corresponding reaction lists \({X'}\) and \({Y'}\). Like in the recommendations, we can obtain the set of shared items (intersection) and the set of the union of the items. The proportion of the shared items to the union of items is the magnitude of similarity between the two users according to the users themselves, given the differentiated recommendations. Using this, we also define \({\sigma _{\mathrm{react}}}\) as the magnitude of difference observed between the two users as in Eq. 3.

$$\begin{aligned} \sigma _{\mathrm{react}_(X',Y')} = 1- J(X',Y') = 1-\frac{X' \cap Y'}{X' \cup Y'} \end{aligned}$$
(3)

If the magnitude of the difference in the reactions is the same as the magnitude of the difference that was imposed in the recommendations, that is \({\sigma _{\mathrm{rec}}= \sigma _{\mathrm{react}}}\), then the users seem to approve, implicitly, the RS’s differentiation. We define this condition as the balanced state. We consider this condition ideal, assuming that satisfying the interests of both the recommendation provider and the users is the goal. The differences in the recommendations and in the reactions can also differ. The following are all the possible scenarios.

  • Balanced state \(\sigma _{{rec}}= \sigma _{\mathrm{react}}\)

    This is the situation where the proportion of shared items to the union of items between the pair of users in the recommendation holds in the reactions of the users to the recommendations. Alternatively, this is the state where the distance in the recommendation lists (\({\sigma _{\mathrm{rec}}}\)) and the distance in the resulting user reaction lists (\({\sigma _{\mathrm{react}}}\)) are the same.

  • Push \({\sigma _{\mathrm{rec}}< \sigma _{\mathrm{react}}}\)]

    This happens when the proportion of shared to the union of items in the recommendations is greater than the one in the reactions. Alternatively, this is the state when the distance in the recommendations (\({\sigma _{\mathrm{rec}}}\)) is less than the distance in the resulting user reactions (\({\sigma _{\mathrm{react}}}\)). This happens when users diverge from each other—hence pushing each other away—by consuming proportionally larger number of items not in common and less of the shared items. This signals that the RS’s differentiation of the recommendation lists is under-personalized. The bigger the magnitude of the difference between the distances, the larger the under-personalization according to the users.

  • Pull \({\sigma _{\mathrm{rec}}> \sigma _{\mathrm{react}}}\)

    It is the opposite of push and happens when the proportion of shared items to the union of items in the recommendations is smaller than the one in the reactions. In terms of distance, this is the state when the distance imposed in the recommendations (\({\sigma _{\mathrm{rec}}}\)) is greater than the distance observed in the resulting reactions (\({\sigma _{\mathrm{react}}}\)). This happens when users come closer to each other together—hence pulling each other—by consuming proportionally a larger number of items in common, and less of the items not in common. This signals that the RS’s differentiation in the recommendation lists is over-personalized beyond the pair of users want it between them.

We can now delve into a more detailed explanation of the pull–push metric.

Fig. 3
figure 3

The mapping of the recommendation space to the reaction space. For a balanced content differentiation, the distances between the users in the recommendation space must be preserved in the mapping to the reaction space. That means, for example, \({d(u_{1}, u_{8})} ={d(f(u_{1}), f(u_{8}))}\)

Let recommendation space be the set of users in the recommendations with a distance metric on the elements, and reaction space the set of users in the reactions to the recommendations (e.g., user clicks) with the same distance metric on them. We view the users in the recommendation space, and in the reaction space as metric spaces that are related to each other by a mapping, as shown in Fig. 3. For a recommender system to be said in the balanced state, the mapping from the recommendation metric space to the reaction metric space must preserve the distance between users. Mathematically, the mapping from the recommendation metric space to the reaction metric space must be isometric. For our case, it means the distance between each pair of users \({u_{i}}\) and \({u_{j}}\) must remain the same before and after the mapping, i.e., Eq. 4 must hold. It is this state of isometry that we consider a recommender system’s balanced personalization. A good recommender system should be able to impose differences in the recommendation lists that result in proportional, distance-preserving reactions. Deviation from the balanced personalization is then considered a measure of disapproval of the degree of personalized differentiation.

$$\begin{aligned} d(u_{i}, u_{j}) = d(f(u_{i}), f(u_{j})) \end{aligned}$$
(4)
Fig. 4
figure 4

The imposed and resultant distances of a personalized recommendation using the recommendations and clicks of two users. Arrows show that recommendations influence clicks. The difference between RDistance and CDistance must be 0 in a balanced personalized differentiation

3.1 The pull–push score—\({\delta }\)

Figure 4 shows the relationship between recommendations and clicks for two users, \({u_{1}}\) and \({u_{2}}\). User \({u_{1}}\) is served \({R_{1}}\) and has consumed \({C_{1}}\). Similarly, user \({u_{2}}\) is served \({R_{2}}\) and has consumed \({C_{2}}\). The arrows from Rs to Cs show the coupling—the direction of influence of recommendations on clicks.

If recommendation lists for two users differ, that difference is the result of (personalized) differentiation by the recommender system. Users will click on some and not on others and thus, will have different click vectors from their respective recommendation vectors. The difference between the clicks is the result of the difference imposed by the RS and the difference created by the users’ own selection. The discrepancy between the difference in recommendations and difference in clicks shows the magnitude of push or pull. We call the discrepancy score between the differences in recommendations and the difference in the user reactions a pull–push score.

The pull–push computes 1) the distance between the recommendation vectors (\({\sigma _{\mathrm{rec}}}\)) to quantify the difference imposed, 2) the distance between the click vectors (\({\sigma _{\mathrm{react}}}\)) to quantify the resulting difference and 3) the difference between the distances to obtain the pull–push score. Mathematically, for a pair of users \({u_{i}}\) and \({u_{j}}\), we define \({\sigma _{\mathrm{rec}}}\) in Eq. 5 and \({\sigma _{\mathrm{react}}}\) in Eq. 6.

$$\begin{aligned}&\sigma _{\mathrm{rec}_{u_{i}u_{j}}}= d(R_{u_{i}}, R_{u_{j}}) \end{aligned}$$
(5)
$$\begin{aligned}&\sigma _{reac_{u_{i}u_{j}}}= d(C_{u_{i}}, C_{u_{j}}) \end{aligned}$$
(6)

The \({\sigma _{\mathrm{rec}}}\) is the distance that the system estimates and maintains between the two users. Not all recommendations result in clicks. Within the imposed differentiation, users have the freedom to consume some items and ignore others. For example, a certain RS may recommend to \({u_{1}}\) and \({u_{2}}\) some number of (shared) items on Joe Biden and on Donald Trump and some members of unshared items. The two users may ignore the shared items of Joe Biden and Donald Trump and read only the respective unshared items, or they may only read the shared items and ignore the unshared items. In the first case, the users would be pushing each other away, signalling that the items in the overlap are not of interest to them. In the second case, they would be coming together, “telling” the system that they like shared content more than the other items.

\({\sigma _{\mathrm{react}}}\) is the measure of how different the users are in terms of the content they choose to consume given the distance imposed by the recommender system. Using both \({\sigma }\)s, we define \({\delta }\) (the pull–push score), in Eq. 7, as the difference between \({\sigma _{\mathrm{rec}}}\) and \({\sigma _{\mathrm{react}}}\).

$$\begin{aligned} \delta _{u_{i} u_{j}} = \sigma _{\mathrm{rec}_{u_{i} u_{j}}} - \sigma _{\mathrm{react}_{u_{i} u_{j}}} \end{aligned}$$
(7)

3.2 Properties of the pull–push metric

When \({\sigma _{\mathrm{rec}}}\) is 1, the pull–push metric is undefined. By definition, a \({\sigma _{\mathrm{rec}}}\) score of 1 means that the pair of users are in a state of absolute isolation from each other. In this state, they have no shared items and we cannot infer a score that is meaningful. For any Recommendation Distance (RD) score in the interval [0, 1), the Click Distance (CD) score falls in the interval [0, 1].

For any \({\sigma _{\mathrm{rec}} \ne 1}\), the \({\delta }\) score falls in the interval \({[\sigma _{\mathrm{rec}}-1, \sigma _{\mathrm{rec}}]}\). Let us consider two special cases, no personalization and extreme personalization, to illustrate this. For no (personalized) content differentiation, \({\sigma _{\mathrm{rec}}=0}\). The \({\delta }\) range for this is \({[0-1, 0]}\) = \({[-1, 0]}\). For an extreme content differentiation, \({\sigma _{\mathrm{rec}}}\) score is closer to 1, which means the RS has served nearly completely different content to the pairs of users. The \({\delta }\) range for this is \({(1-1, 1] = (0, 1]}\). The \({\sigma _{\mathrm{rec}}}\) score determines the range of \({\delta }\). This is due to the fact that the user reactions are contingent upon the recommendations. If a RS has imposed a distance of 0.8 between a pair of users, the users have a maximum further distance of  0.2 to be more different. They cannot be more different than that since they cannot go beyond mutually exclusive consumption. They can, however, be as similar as they want by the amount of shared content they choose to consume. Conversely, if the RS does little or no content differentiation between two other users, the users have a large possibility of consuming as different content as they want, but a smaller possibility of being more similar.

The potential \({\delta }\) scores for all pair of users in a RS fall in the interval \({[-1, 1)}\). Let us consider two special cases of \({\sigma _{\mathrm{rec}}}\) and \({\sigma _{\mathrm{react}}}\) to determine the bounds of the interval. For a pair of users with no content differentiation between them, \({\sigma _{\mathrm{rec}}=0}\). Given this, let us assume the user reactions result in a \({\sigma _{\mathrm{react}}=1}\), which means users ended up consuming exclusively different content. The \({\delta }\) score is \({0-1 =-1}\). This means the RS predicted the users to have exactly similar interests; the users, however, showed that they have as different interests as it can be. Now, consider the second case, where the \({\sigma _{\mathrm{rec}}}\) score is almost 1, which means the RS has served nearly completely different content to the pairs of users. Given that, the maximum possible \({\delta }\) score is close to 1. This is when the RS predicts the users are nearly mutually exclusive; the users, however, show that they actually have exactly similar interests. When we combine the highest possible and the lowest possible \({\delta }\) scores, we obtain \({[-1, 1)}\), which is the interval for the potential \({\delta }\) scores for all pairs of users. The practical score depends on the maximum and minimum content differentiation that a RS imposes between pairs of users.

3.3 Interpreting pull–push scores

A \({\delta }\) score shows the magnitude of difference that must be either avoided or imposed, depending on whether it is negative or positive, to arrive at the balanced differentiation. A \({\delta }\) score of 0.5 shows the need to avoid a distance score of 0.5 between the pair of users in question. A \({\delta }\) score of \({-\,0.5}\), on the other hand, shows the need to impose an additional distance score of 0.5 between the users. A balanced differentiation is one that results in \({\delta =0}\). While \({\sigma _{\mathrm{rec}}}\) is the current level of differentiation, \({\sigma _{\mathrm{react}}}\) is the differentiation that the users want given the \({\sigma _{\mathrm{rec}}}\), and \({\delta }\) is the amount of differentiation distance that needs to be added to or decreased from the \({\sigma _{\mathrm{rec}}}\) in order to achieve \({\sigma _{\mathrm{react}}}\).

The pull–push score quantifies the tendency of the users’ responses given the RS’s differentiated recommendations. When \({\delta }\) is positive (Pull), the tendency of the pair of users is pulling each other or coming together. It indicates that users find the distance imposed in the recommendations more than necessary. A positive score then is the degree of disapproval, by the users, of the RS’s imposition of unnecessarily larger difference between them. We see this as a measure of protest, by the users, at the imposed difference by choosing to consume a larger proportion of the shared content.

If \({\delta }\) is negative (Push), the tendency of the pair of users is drifting apart. Drifting apart signals inefficient content differentiation; their drifting apart is their attempt to avoid (protest) it. A Push is a measure of disapproval, by the users, of the lack of enough content differentiation by the RS.

The \({\delta }\) score is defined for a pair of users. We can aggregate \({\delta }\) scores to obtain group- or system-level averages. But we need to be careful not to sum up positive and negative scores. By averaging all positive \({\delta }\) scores, we obtain the average degree of disapproval, by the users, of the RS’s unnecessary imposition of difference, and by averaging negative scores we obtain the average degree of disapproval of the lack of content differentiation. Note that a recommender system can, at the same time, be disapproved for not doing enough content differentiation for some users and for doing over-personalization for some other users. It can even be disapproved, by a single user, for over-personalization with respect to some other users, and under-personalizing with respect to some other users. A user having a negative \({\delta }\) score with one user, and a positive \({\delta }\) score with another user shows they are experiencing both over- and under-personalization. When all pair-wise \({\delta }\) scores are 0, the RS is said to have achieved a balanced level of content differentiation for all users.

If a certain user-pair has a \({\delta }\) score different from the balanced state of \({\delta =0}\), it shows that there is a potential for improving the recommendation system, by either increasing or decreasing the level of content differentiation.

4 Application on news recommendation datasets

In this section, we demonstrate applications of the metric on news recommendation. Large datasets from news aggregators that provide personalized news recommendation and the resulting user reactions would be especially suited for experimentation. But such datasets are hard to come by. Instead, we chose two news recommendation datasets: simulated and real-world. We use the simulated dataset to explore and show some interesting aspects of the pull–push metric. The real-world datasets have some limitations that we will explain later. But, first we discuss some of the practical considerations and choices.

4.1 Selection of vector components and users

The components for the recommendation and click vectors can be either items served, meta data about the items, named entities or other relevant features. For our experiment, we use items for components in both the simulated and the real-world cases. The values of the vector components are, however, different. For the simulated case, the values are Boolean 0s and 1s. In the real-world case, they are counts of the number of times an item is shown and clicked.

The users can also be of different granularity, such as the individual user, or cohorts of users such as a demographic group, or a geographic unit. In the simulated case, we used individual users. In the real-world case, we used a higher level granularity of geographic units as users. The choice of the geographic unit for the real-world datasets has two advantages. The first is that we wanted to overcome data sparsity problem since our dataset is not big enough. News recommendations result in less than 1% of the recommendation being clicked, which together with a small dataset can be a big sparsity problem giving the impression that many users have the same reaction lists. The second is that it offers us the opportunity to quantify the RS behavior at higher user-granularity than the oft-used level of individual user. There is evidence that geography [11], demography [20] and educational background [12] affect the consumption of information. Given these limitations and opportunities, we found applying the pull–push metric at cohort-level both overcomes certain problems of our available dataset and offers opportunity.

4.2 Selection of distance metrics

One can use different distance metrics depending on one’s tastes and goals. It is important to normalize the vectors such that \({\sigma _{\mathrm{rec}}}\) and \({\sigma _{\mathrm{react}}}\) are comparable. In this study, we experimented with two different distance metrics. For the simulated case, we used Jaccard distance, which measures the difference (dissimilarity) between sets. When aggregating users into groups, one might be interested in viewing recommendations and reactions as distributions instead of as sets. We do exactly that in the real-world application, where we use the Jensen–Shannon Divergence (JSD) distance metric, which is suited for comparing distributions.

4.3 Application on simulated datasets

First, consider the user recommendation list and the reaction list as sets. For a distance metric, we use the Jaccard Distance Metric, given in Eq. 2. Table 1 shows a simulated data of users, sets of recommendations and the resulting sets of user reactions. We use this simulated data to highlight some possible pull–push scores that can show interesting relationship between recommendations and the resulting user reactions.

Table 1 Simulated data of a RS’s recommendations showing users, sample sets of recommendations, and sets of the resulting user reactions

Let us consider a few pairs of users that are interesting. Users \({u_{1}}\) and \({u_{2}}\) are recommended exactly the same set of items. The Jaccard distance between their recommendation lists is, therefore, 0. Given these exact sets of recommendations, the users ended up consuming exclusively different content. The Jaccard distance between their clicks is, therefore, \({\sigma _{\mathrm{react}}=1}\). The pull–push score for this pair of users is \({0-1 =-1}\). The RS predicted that this pair of users has exactly the same interests; the users disagreed by consuming as different content as possible.

By contrast, users \({u_{3}}\) and \({u_{4}}\) are recommended nearly mutually exclusive sets of items, that is, a distance score of \({\sigma _{\mathrm{rec}}=0.8}\). Despite that, they ended up consuming exactly the same sets of items, resulting in \({\sigma _{\mathrm{react}}=0}\). This gives a pull–push score of 0.8. This positive pull–push score is the magnitude of unwanted difference imposed between them. According to the users themselves, less differentiation of content between them would have been better to meet their interests. In this case, the RS predicted a distance of 0.8, but the users wanted a distance of 0, a state of no content differentiation between them.

The cases of the two pairs of users, \({u_{1}}\) and \({u_{2}}\), and \({u_{3}}\) and \({u_{4}}\), show users turning out completely the opposite of what the RS predicted them to be. Users can also protest the RS’s differentiation by ending up more different or more similar than the RS makes them to be. For instance, users \({u_{6}}\) and \({u_{7}}\) are recommended the same sets of items as users \({u_{3}}\) and \({u_{4}}\). In both cases \({\sigma _{\mathrm{rec}}=0.8}\). The user reactions are, however, very different. In the case of users \({u_{3}}\) and \({u_{4}}\), they ended up consuming exactly the same sets of items, despite the large content differentiation imposed by the RS. In the case of users \({u_{6}}\) and \({u_{7}}\), they ended up consuming even more different items, resulting in a pull–push score of \({\delta =-0.2}\). Given exactly the same sets of highly differentiated recommendations, one pair of users ended up consuming exactly the same content, and another pair asked even more content differentiation, a mutual exclusivity. Similarly, given highly similar recommendations, users can end up consuming either mutually exclusively different or even more similar items.

Users \({u_{1}}\) and \({u_{5}}\) are recommended exactly the same sets of items, and they have also consumed exactly the same sets of items. The pull–push score for this pair of users is 0.0.

The simulated examples above show some possible differentiation and some possible resulting user reactions. Users agree or disagree with the RS’s differentiation by the magnitude and sign of the pull–push score.

4.4 Application on real-world datasets

For this part of our experiment, we use datasets of recommendations and clicks collected from a real-world recommender system’s recommendations and reactions. We view the recommendations and the clicks as distributions, as opposed to sets because (1) the counts of recommendations and clicks are not suited for set-based processing (2) sometimes there is a need to treat recommendations and clicks as distributions and (3) we want to use this opportunity to show a distribution-based distance metric.

Our datasets were extracted from user interaction history in the Plista platform, which is a recommendation service provider that offered the Open Recommendation Platform (ORP)Footnote 1. The platform brought together online content publishers in need of recommendation services and news recommendation service providers that provide recommendation by plugging their recommendation algorithms to the platform. The recommender we used is a simple recency-based recommender that does little content differentiation as it recommends the most recent/popular items.

Several media outlets were present in the consumption of recommendation services. For our analysis, we choose two popular German news and opinion portals: TagesspiegelFootnote 2 and KStAFootnote 3. For user groups, we chose the 16 statesFootnote 4 of Germany.

4.4.1 Data pre-processing

For each state of Germany, we prepared recommendation and click vectors. The components of the vectors are items, and the values are the number of times the items have been shown and clicked in the state. A sample of the recommendation and click vectors for two states, Berlin and Bavaria, is shown in Table 2. We prepared such vectors of recommendation and clicks for each of the 16 states of Germany, which results in 16 pairs of recommendation-click vectors, one pair for each German state. The union of all the items that appeared in the recommendations in any of the geographical regions was used as the vector components, thus harmonizing the vector components across all states. Then, we used add-1 smoothing in both vectors. The vectors are then converted to conditional probabilities of \(recommendation \mid state\) and \(click \mid state\) by dividing the vectors by the sum of all recommendations and the sum of all clicks, respectively. As two conditional probabilities having the same dimensions, they are normalized for the computations and comparisons. Using these conditional probabilities, we compute the pull–push score for each pair of states.

Table 2 A sample of Recommendations (Recom) and click vectors for Berlin and Bavaria before smoothing and normalization

4.4.2 Application

For a distance metric, we use Jensen–Shannon Divergence (JSD), defined in Eq. 8. JSD is a symmetric and normalized distance metric based on KL-divergence, defined in Eq. 9. JSD is suited for calculating distances between distributions. After estimating probability distributions for recommendations and clicks, we use JSD to compare how different they are.

$$\begin{aligned}&\mathrm{JSD}(X,Y) = \sqrt{ \frac{1}{2} \mathrm{KL}(X, \frac{(X+Y)}{2}) + \frac{1}{2} \mathrm{KL}(Y, \frac{(X+Y)}{2})} \end{aligned}$$
(8)
$$\begin{aligned}&\mathrm{KL}(X,Y)=\sum \limits _{i} x_{i}\ln {\frac{x_{i}}{y_{i}}} \end{aligned}$$
(9)

Using JSD as a distance metric, we applied the pull–push metric to the recommendation and click vectors of the 16 German states. We did this for the two news portals of Tagesspiegel and KStA. The pull–push scores for the two publishers are all negative, averaging \({-\,0.224}\) for Tagesspiegel and \({-\,0.213}\) for KStA. Both scores indicate the lack of enough content differentiation, which is unsurprising for a recency-based recommender system. The smaller average score for KStA can be explained by the fact that KStA has a more geographically local readership as compared to Tagesspiegel which is a nationally read portal [11]. A more local readership means less geographical diversity in interests and preferences.

The pair-wise pull–push scores for 11 select German states are presented in Table 3. The upper diagonal shows the pull–push scores for Tagesspiegel and the lower diagonal for KStA. The diagonal scores are 0 because there is no pull–push between a state and itself. When we compare the pull–push magnitudes for same pairs of states in Tagesspiegel and in KStA, we find that, in most cases, their scores in Tagesspiegel are greater than those in KStA. For instance, the pull–push score for Berlin-Saarland is \({-\,0.404}\) in Tagesspiegel, while it is \({-\,0.194}\) in KStA. The magnitudes of the scores indicate that the need for more content differentiation between the state of Saarland and the state of Berlin in the news portal of Tagesspiegel is greater than in the news portal of KStA. The two largest pull–push scores are, however, found in KStA, and they are Westphalia-Bremen (\({-\,0.569}\)) followed by Westphalia-Bremen (\({-\,0.561}\)). These pairs of states have shown the largest interest for more content differentiation.

The pair-wise pull–push scores are presented in the multidimensional visualizations in Figs. 5 and 6 for easy viewing. As can be observed from the figures, the states in Tagesspiegel are generally more scattered, indicating a larger gap between the states’ needs for content differentiation and current differentiation levels. Looking at specific pair-wise scores for Tagesspiegel, we find that Berlin-pairings show greater push. Next to Berlin, Brandenburg-pairings shows greater push. A possible explanation is the fact that Tagesspiegel is primarily a local portal for Berlin and Brandenburg, whose pull–push distance is not very big, as can be seen from the figure. As such, Berlin and Brandenburg are showing a need for a more differentiated content, probably more local recommendations, as opposed to the other states which would probably be more interested in the more national-level news.

For KStA, Westphalia-pairings exhibit the tendency to differentiate from each other. This can also be explained by the fact that KStA is based in Cologne and most of its readers come from Westphalia (the state with capital Cologne) [11]. That means that the scores indicate more need for content differentiation between Westphalia and the other states. As Tagesspiegel is local to Berlin and Brandenburg, so is KStA for Westphalia.

The content differentiation needs indicated by the pair-wise scores in Tagesspiegel and in KStA are consistent with intuitions and previous findings of the impact of geography on news consumption [11]. For example, both Tagesspiegel and KStA have local news category and previous report [11] showed that those categories are mainly read by the Berlin (and Brandenburg) and Cologne (Westphalia), respectively. Apparently, the recommender system has not captured these trends, and hence the bigger push scores we observe.

Table 3 Adjacency matrix of pull–push scores for select 11 states of Germany
Fig. 5
figure 5

A multidimensional scaling of the pull–push scores for Tagesspiegel. The visual distance between states is proportional to the magnitude of the pull–push score. We observe that the highest distances are Berlin-pairings followed by Brandenburg-pairings

Fig. 6
figure 6

A multidimensional scaling of the pull–push scores for KStA. Visual distances between states are proportional to the magnitude of pull–push scores. We observe that the highest distances are between Westphalia and the other states

The pull–push scores give the opportunity to observe the impact of content differentiation at both aggregate and pair-wise levels at the chosen user granularity. At the aggregate level, it shows the degree of differentiation success at the level of the system or the section of users. At the pair-wise level, it shows a fine-grained score at the level of pairs of users, opening a possibility for selective intervention, without affecting other pairs. For example, one can decide to only fine-tune differentiation between Westphalia and the rest of the other states, without affecting the rest of the other pairs.

5 Discussion

The pull–push metric is a generic user-centric metric that measures a RS’s content differentiation success/failure at meeting user content differentiation needs as a function of the differences between user-pair differences in recommendations and in the resulting user reactions. It is a versatile metric allowing the choices of the vector components or the members of the set, the distance metric and the user granularity—one can use demographic levels, geographic cohorts or gender groups. The ability to produce average scores and pair-wise scores provides RS practitioners and researchers with a useful and practical measurement to zoom in and zoom out to gain insights and to subsequently intervene.

The pull–push measure is related to several other measures. The Pull–Push score concerns itself with content differentiation into recommendation lists. Personalization, however, covers the ranking of the recommendation lists. Below, we show how pull–push’s measures on the differentiation of recommendation lists and “the potential for personalization” can be combined to produce what we call “potential for improvement.” We also discuss the pull–push score in relation to popularity bias and how the later can be punished, if need be. In measuring personalization, one comes face-to-face with normative standards on news journalism. We therefore discuss the pull–push score in relation to this too. Finally, we discuss the limitations and weaknesses of the metric.

5.1 Potential for improvement

In Sect. 2, we have explained earlier the difference of the pull–push metric with Teevan et al.’s “potential for personalization” [28]. The two metrics concern themselves with different stages of the recommendation process: the pull–push metric with the content differentiation to recommendations lists, and the potential for personalization with the reranking of the recommendation list. When using the potential for personalization, if the user clicks the recommended items in the order of their ranking, then the potential for personalization (henceforth \({\gamma }\)) is 0. There are also other differences related to the main difference. The potential for personalization does not penalize a RS for showing more irrelevant items as long as the ranking of the clicked items is correct. Recommending 20 items whose three top-ranked items are clicked and recommending 3 items which are all clicked have the same potential for personalization. In the pull–push case, a RS is penalized for showing items that are not clicked.

In our case, there are two cases for improvement. One is when the pull–push score is in push state (negative pull–push score), indicating that the users find the content differentiation between them falling short of meeting their preferences. This can also be viewed as the amount of potential for personalization at the content differentiation stage, as opposed to the ranking stage. There is a potential for personalization for a user in the pull–push case wherever the pull–push score with another user is negative. If we sum up all the negative pull–push scores a user has with other users and average them, then we can say we have the average potential for further differentiated recommendations for that user. If we do the same for all users, we can then compute the average potential for personalization for all users.

The two measures of personalization differ, but they compliment each other. Both indicate potentials for differentiation (personalization), but at two different stages (the selection of a recommendation list and the ranking of the recommendation list) of the recommender system. They can be combined (assuming equal weight) as in Eq. 10 to obtain the total potential for personalization (PP) at both stages. We use |push| because push is negative and the sign is not needed in this case. By dividing the score by two, it would fall in the interval [0, 1] where 0 indicates no potential for personalization at both stages and 1 indicates the highest possible potential for content differentiation. The highest potential is when \({\delta =1}\) and \({\gamma =1}\). Essentially, therefore, the amount of potential for personalization in this combined sense is the amount of further effort/struggling needed to select items the users wants and to rank them according to how the user would click them.

$$\begin{aligned} pp = \frac{|\delta _{\mathrm{neg}}|+\gamma }{2} \end{aligned}$$
(10)

The other potential for improvement in the pull–push case is when the pull–push score is pull state (positive pull–push score), indicating that users find the content differentiation more than that is necessary. Since this score is not the desired balanced state, it can be seen as a potential for improvement. We refrain from calling it a potential for personalization because a positive score indicates the need for less personalization. In the sense of effort needed to bring the recommendation list to what the users want; however, it represents a potential for improvement. Since positive score, negative score or the potential for reranking all indicate the need for further efforts to satisfy the user interests, we can combine them all to obtain the total potential for improvement (pi) as in Eq. 11. The potential for improvement would again fall in the interval [0, 1]. When one needs to take action, however, one needs to first check the sign of the \({\delta }\) score to see whether to do less or more personalization at the level of the production of the recommendation lists.

$$\begin{aligned} \mathrm{pi} = \frac{|\delta |+\gamma }{2} \end{aligned}$$
(11)

5.2 The pull–push metric and popularity bias

The pull–push metric is in part a precision-oriented metric [2] in that the score calculation starts with the recommendation list (top-N) and the resulting user reaction list. A known problem of top-N-based evaluation metrics is popularity bias, which is the act of rewarding a recommender system that recommends popular items, as opposed to rewarding algorithms that personalize content according to user needs [6, 9].

The dominant opinion is that popularity bias is an undesirable bias and should therefore be removed [3, 9, 16, 26]. Cañamares and Castells [6], however, ponder about whether we actually want to get rid of the popularity bias. They ask thus: if recommending popular items happens to be the right thing to do, then should not recommending them be favored and rewarded? Regardless of the opposing views in whether popularity bias should be removed or not, popularity bias is a prevalent property in recommender systems and their evaluation metrics. We would, therefore, like to discuss the pull–push metric in relation to it.

The pull–push metric is susceptible to popularity bias in that a recommender system that recommends popular items can achieve as good a pull–push score as another one that does good personalized recommendations. For instance, if a certain recommender system recommends the top 3 popular items to two users, and both the users click on three of them, then pull–push score would be 0, indicating a balanced content differentiation. While this may not be a problem for a recommendation provider interested in maximizing clicks, or the user who is content with being recommended popular items, there are situations where one would like to diversify content recommendations or to penalize popularity bias.

One way to penalize popularity bias in the pull–push metric is to look at the \({\sigma _{\mathrm{rec}}}\) score in addition to the \({\delta }\) score. A higher \({\sigma _{\mathrm{rec}}}\) shows a higher level of content differentiation (personalization). A recommender system with a higher \({\sigma _{\mathrm{rec}}}\) score and a lower \({\delta }\) score shows that the recommendations have been differentiated (personalized) and that users are content with the differentiation.

5.3 Pull–push score, normative standards, filter bubble and fairness

Developing a metric to quantify a recommendation success is tricky because it touches on the doubly contested area of normative standards for journalism. Normative standards for human journalism are pluralistic, and already contested. Normative standards for algorithmic journalism are doubly contested [22]. As journalism is a normative activity, scholars state the importance of going beyond descriptive investigation to consider the normative implications (even if it is a contested) of algorithmic recommendation [22]. Encouraged by this call, we attempt to relate the metric to the discussion on normative standards and to ground it on a particular normative framework.

Natali Helberger [14] outlines three democratic models, namely liberal, participatory and deliberative, that are used in assessing media and she discusses their implication for news recommender systems. In the liberal democratic model, recommender systems put user interests and preferences central stage. Under this model, it is the prerogative of citizens to choose what information they need and also it is fine for a news platform to provide information items customized to the needs of the user. In the participatory model, the participatory recommender will need to make sure recommendations are fair and inclusive representation of different ideas and opinions in society, in addition to making the user gain a deeper understanding and make them feel engaged. This democratic model operates out of principles to nudge users to “powerful ideas and opinions.” In the deliberative model, the media assumes a public forum function where “the different ideas and opinions in a democratic society can be articulated, encountered, debated and weighed.”

The pull–push metric, as a user-centric metric of content differentiation effectiveness, falls under the liberal democratic model. This means that user-interest takes center stage along with attendant implications. For example, issues of filter bubble, fairness, inclusiveness and diversity will need to be seen from the perspective of user interest and preference. Personalization is a response to information overload. In fact, we can consider the size of the push score as a measure of information overload yet to be mitigated to arrive at the user-preferred recommendation list. Under the liberal normative standard, mitigating this information overload is a necessity. As we over-personalize, however, we risk isolating the user in a filter bubble [24], which is a societal concern. In the liberal model, as long as the user somehow does not see the filter bubble as a problem, it is fine and does not need to be avoided since the user decides for themselves.

Similarly, the presently hot issues of bias and fairness [1, 7, 27] will only be considered from the perspective of user interest. Recommender systems are multi-stakeholder environments. Fairness notions may contradict not only utility, but also fairness notions of the different stakeholders. What is fairness in this context, fairness for whom, and according to whom? Castillo [7] views algorithmic fairness in ranking from the point of view of the people and organizations that are being searched. Edizel et al. [8] view fairness in recommender systems from the point of view of users. Burke [5] proposes that recommender systems have different fairness requirements for the different stakeholders. News recommendation is different from other recommendations because items are ephemeral, and unlike recommendations whose aim is post-click conversions (buying an item or booking), news recommendation ends with clicking and maybe dwell time. Defining fairness in terms of the predictability of sensitive attributes, as did Edize, et al. [8], does not account for discrimination on legitimate grounds such as on the basis of different base rates [4]. For example, Edizel [8] presents two specific reddit threads, “makeupaddictions” and “cscareerquestions,” whose 97% and 84% of their comments are submitted by females and males, respectively. It does not make sense, in the liberal democratic model, to eliminate the predictability of gender from the recommendation matrix involving these threads. If the pull–push score is 0, meaning a balanced personalized recommendation for each user according to the user themselves, can the recommender system be considered unfair? We surmise that while a non-balanced pull–push score may not necessarily indicate a measure of unfairness, a balanced score does not imply unfairness, as long as user perspective is concerned.

5.4 Limitations of the pull–push

The pull–push measure assumes enough presence of shared items between the recommendations lists for the users to be able to diverge or converge. When over-personalization is so high that the generated recommendation get closer to disjoint sets, the pull–push scores are highly distorted. Interpreting a pull–push score in this situation can be a bit problematic. We consider this as one limitation.

Another limitation is data sparsity. Since reactions (for instance clicks on items) are a small fraction of the recommendation list, pull–push is susceptible to data sparsity. In our case, we have minimized this by using the higher user-granularity of a geographic region, but data sparsity at the individual user level may be a bigger problem. We recommend the use of large datasets to minimize the impact of data sparsity.

Although not infeasible to integrate in future works, at the moment, the metric does not consider time. We have used a batch dataset, but maybe a time series analysis could be more appropriate. Finally, we do not consider our dataset the best possible choice for analysis using this metric, because recommendation was recency based, meaning doing almost no personalization, leading to negative scores. That is why our scores are all negative indicating push. An additional dataset of recommendations lists and the resulting user reactions from a recommender systems that actually implements some personalization would be very good to compare it to our dataset, but such datasets are hard to come by in the public domain.

6 Conclusion

Personalized recommendation list generation can match users with items of their interest and lead to increased engagement, but it can also mean that users may end up more differentiated beyond what they want. To see whether our differentiated recommendations fall short of the ideal content differentiation or goes beyond what is necessary, we introduced the pull–push metric to measure the success of a RS at generating user-preferred recommendation lists. The metric quantifies the degree of pull or push between users using the difference in recommendation lists and the difference in the resulting reaction lists.

In the pull–push measure, the production of personalized recommendations is viewed as the act of imposing some difference (distance) between pairs of users in terms of the items they are recommended to. This view of content differentiation is carried to the pull–push metric offering a novel way of measuring a RS’s content differentiation effectiveness in meeting users difference in interests and preferences. The metric is an abstraction from the actual recommendation lists and the resulting users reactions; it concerns itself with the preservation of the differences or similarities introduced at the recommendation stage in the resulting user reactions. The pull–push metric is suited to the practical exploitation of a recommender system, as it can be used to get insight at different granularities and to suggest a course of action at any stage of the deployment of a recommender system. The metric is versatile allowing choices in distance metric, user granularity, vector components and their values. With appropriate normalization of the vectors, pull–push distance scores fall in an interval with endpoints, making it possible to compare different recommender systems’ personalized differentiation success.

We applied the method to simulate and real-world datasets, using different distance metrics and different granularity of users. In the simulated case, we used the set-theoretic distance metric of Jaccard distance and individual users and in the real-world one, we used Jensen Shannon distance (a distance metric for probability distributions) and cohorts of users of geographical units. The abstraction to geographical cohorts of users provided us with two advantages: less data sparsity and an opportunity to examine (personalized) content differentiation at a higher user granularity. In the simulated case, we showed different interesting scores of the pull–push score and their interpretations. In the real-world experiment, the pull–push score suggested the presence of under-personalization (explainable by the importance of recency in news RS), and therefore, a potential for more personalization when the desire is to satisfy user interests.

We have discussed the metric in relation to other very related metrics on personalization and shown how it can be combined with the “potential for personalization” to serve as a metric of the potential for improvement in the content differentiation and the ranking stages of a RS. Also, we discussed how the metric is susceptible to the known popularity bias of recommender systems and offered a way to penalize the bias in case one wants to do so. We also discussed the metric in relation to normative standards, and fairness in recommender systems. Finally, we discussed the limitations of the pull–push metric.

The pull–push metric is user-centric and in tune with liberal normative vision and the fundamental tenet of personalization, that is, that users have differences in information interests and preferences. In the pull–push measurement, a balanced (personalized) content differentiation is one where the proportion of imposed differences during recommendation are approved in the resulting user reactions. If that is not the case, it indicates a degree of disapproval of the RS’s under-personalization (push) or over-personalization (pull) by the users themselves.