1 Introduction

Recommender systems (RSs) are software tools aimed at assisting users in their choice-making procedures. An RS performance is often assessed offline, without the explicit involvement of real system’s users. Following a classical data mining and machine learning approach, a pre-existing data set of users’ ratings or purchases is split into training and test parts. The novel RS, which is going to be tested, is trained on the training part, and the generated recommendations are compared with the data in the test set, to measure important performance indicators such as precision and recall (Gunawardana and Shani 2015).

Such an evaluation approach makes the implicit assumption that either the ratings or the purchases/choices for items, which are recorded in the test data, forms a proper ground truth for testing the goodness of the RS predictions. However, these ratings or choices in the test set might have been collected when users were exposed to a different “treatment”, e.g. a different RS, or even no recommendations at all. As a matter of fact, this type of evaluation can simulate, in a simple way, either the choices or the ratings of a target user for items recommended by an RS, if these items are present in the “test set”. However, the absence of a recommendation in the test set makes it difficult to judge whether the recommendation is correct or not; usually, this situation is interpreted as a sign that the recommended item is not relevant to the test user. Moreover, it is impossible to precisely determine if the recommendation is not in the test set, because it was not deliberately chosen/rated by the user, or because the user was not aware of it. For all these reasons, the significance of such an offline evaluation approach has been criticised. Ultimately, offline evaluations can lead to results that do not necessarily correlate with the performance of the RS when it is measured in an online evaluation (Steck et al. 2021).

In fact, real users are never passive in their choices, as offline evaluation schemes implicitly assume; recommendations are always evaluated by the users who estimate the value or utility of the alternative candidates and then preferably choose those with the highest perceived utility. This process results in a collection of choices dynamically evolving during the usage of an RS. Moreover, the RS itself should be repeatedly trained on the new interaction data generated by the usage of the RS itself. The collective choices of a user population is indeed an interesting phenomenon to observe because it determines the global effect of an RS in a real scenario. That global effect is the core motivation for a platform owner to introduce such recommendation technologies (Abdollahpouri et al. 2020). Hence, properly assessing the true and long-term (longitudinal) influence of RSs on users’ choices is a fundamental research subject that apparently can only be conducted in live user experiments.

However, a more affordable and different approach (Hazrati and Ricci 2022a; Zhang et al. 2020), aimed at assessing the long-term performance of an RS when a community of users is influenced in their choices by the RS, has been recently proposed; it adopts simulation and multi-agent techniques (Yao et al. 2021; Ekstrand 2021; Zhang et al. 2020; Umeda et al. 2014; Chaney et al. 2018). In this approach, the system recommendations are supposed to be evaluated by simulated users, who make simulated choices among the items that are presented to them by the RS. The computational model that sets which choice an individual agent, with a given set of preferences, makes when presented with a set of alternative options has been named “choice model” (Hazrati et al. 2020) or “consumption strategy” (Zhang et al. 2020). In comparison with the classical evaluation paradigm, when an RS is tested in a simulation experiment, a choice model is adopted as well. It is, therefore, possible to simulate users that evaluate and possibly choose any of the recommended items, rather than just those present in the test set. In this fashion, a clearer picture of an RS’s performance can be obtained.

In this paper, we follow this new line of research and focus on the analysis of the impact of alternative simulated users’ choice models and their interaction with the RS, which generates the options the users can choose. In fact, as it has been already observed, in addition to the specific effect of the presented recommendations, users’ choice behaviour is determined by the users’ tendency to choose items with specific properties, for instance, more popular or recent items (Szlávik et al. 2011; Eelen et al. 2015). This kind of behaviour is called “atypical” in the literature (Yao et al. 2021), because it differs from the, assumed to be, typical behaviour of choosing the items that have maximal value or utility (the largest rating). In practice, users have a tendency to consider and even maximise multiple criteria when making choices (Adomavicius et al. 2011). We are interested in performing a sound analysis of the combined effect of an RS and the users’ atypical and typical choice behaviour on the overall distribution and quality of the choices. The distribution and quality of the simulated choices are measured by precise metrics, such as the Gini index, the coverage of the catalogue or the average predicted rating of the choices.

Hence, first, we try to understand whether users’ choices, observed in standard rating data sets, manifest typical and atypical choice behaviours. In particular, we conjecture that users’ choices are “correlated” with distinguished properties of the items, such as, their popularity, age (time passed since the item was first available), and their predicted rating. We find that this is true: for the majority of the users, in the three considered data sets (Application, Video games and Kindle Books), the users’ choices are correlated with the popularity, age and rating of the items; most of the users tend to choose items among the more popular, the newer, but also those predicted to have the highest ratings. We note that to perform the correlation analysis between choices and ratings, we have adopted a proper method for producing a debiased rating estimation using “Inverse Propensity Score Matrix Factorisation” (Schnabel et al. 2016). Such a method has been widely used to debias ratings from observation biases (Sato et al. 2020; Huang et al. 2020; Yang et al. 2018).

After having confirmed the existence of these correlations in our data sets, we have formally defined three choice models (CMs): Age-CM, Popularity-CM, and Rating-CM. Age-CM models users that tend to choose newer items among the recommended ones. Popularity-CM instead models users that tend to choose more popular items. Finally, Rating-CM models users that choose items with higher ratings. By adopting these choice models, whose details will be described in the paper, the simulated users are assumed to prioritise those item properties in their choices. Moreover, we have also defined a baseline CM, which is called Base-CM, where users faithfully accept the recommendations, but with an often observed position bias (Collins et al. 2018); items on top of the recommendation list tend to be chosen more frequently. This choice model is used as a baseline to set the default behaviour of users that are faithfully following the RS suggestions.

We have then adopted the defined choice models in simulation experiments where a population of users is repeatedly exposed to recommendations generated by alternative RSs, and the users’ choices are simulated by relying on the above-mentioned choice models. In each simulation experiment, a pre-existing rating data set of observed user preferences, up to a certain time-point, is used to train the considered RSs. Then, the successive month-by-month choices of the users are simulated and used for retraining the RSs on the expanded set of data, i.e. including the simulated choices. We measure important and global metrics of the simulated choices, such as the Gini index, the coverage of the catalogue or their average predicted rating. The details of the metrics used to measure the distribution and quality of choices are given in the paper. Doing so, the combined effect of an RS and a choice model can be analysed.

In the simulations, which are conducted on three above-mentioned application domains (Applications, Video games and Kindle Books), we have analysed three research hypotheses and found interesting results supporting them.

  • H1—Some important metrics measuring the distribution and quality of users’ choices, when they are exposed to an RS, are strongly influenced by the prevalent CM of the users, irrespectively from the specific RS. We have found several facts supporting this hypothesis, which are described in the paper. For instance, we have discovered that when users tend to prefer in their choices more popular items, the popularity of the chosen items grows even larger than when the choices are supposed to be only influenced by the RS. Hence, the users’ choice model may amplify typical effects of RSs.

  • H2—RSs have essential effects on metrics measuring the distribution and quality of users’ choices, regardless of the prevalent CM adopted by the users. To cite only one example of findings that support this hypothesis, we have found that non-personalised RSs have a clear effect on reducing the diversity of the choices, and this influence is observable whatever is the users’ CM. For instance, this effect is observed even when the users’ choice model prioritise newer items (Age-CM), hence less popular ones.

  • H3—Some effects of an RS on metrics measuring the distribution and quality of users’ choices are dependent on the adoption of a particular CM. In supporting this hypothesis, we can mention another result of our simulation experiments; when an RS explicitly recommends relatively less popular items, the average popularity of the chosen items is minimal, and it remains minimal if the users are supposed to be influenced only by the recommendation ranking (Base-CM is adopted). However, when other CMs are adopted, the average popularity immediately increases. This shows how important it is to factor in the choice model to accurately predict the real effect of an RS.

The obtained results enlighten the hidden impact of three types of choice behaviours (Age-CM, Popularity-CM, and Rating-CM), showing how important it is to understand and anticipate the implications of certain choice behaviours. Moreover, the shown effects of alternative RSs indicate the importance and feasibility of understanding, even offline, the implications of deploying an RS, i.e. before it is deployed on an online web platform. Finally, the results of this paper clearly show how the RSs and CMs’ effects on the users’ choices can be coupled, and specific and even unexpected choices’ distribution can be determined. Hence, in summary, this study brings two main contributions. First, we introduce a general and reusable simulation approach for investigating the impact of recommender systems and choice models on users’ choice distribution and quality. Secondly, the obtained results show that RSs and CMs independently, and jointly, can affect the distribution of RS users’ choices.

The rest of the paper is structured as follows. Section 2 presents relevant literature on RSs’ and the CMs’ impact on users’ choices. Section 3 presents the results of our analysis of the correlation of item popularity, age and rating with the users’ choices. Section 4 presents the proposed choice simulation framework and the experimental setups. Section 5 discusses the obtained results on the impact of the considered CMs and RSs. Finally, the paper is concluded in Sect. 6.

2 Related work

The goal of this paper is to study the long-term impact, of alternative users’ choice models and recommender systems, on the distribution and quality of users’ choices. Hence, we review studies that aim at assessing, with alternative approaches, the influence of different CMs and RSs on users’ choice behaviour. In this section, we first discuss choice models, and how some item properties (popularity, age and rating) may affect users’ choices (Sect. 2.1). Then, we review analyses that investigate the impact of recommender systems on users’ choice distribution (Sect. 2.2). Finally, we give an overview of papers that studied the joint impact of recommender systems and choice models on users’ choice distribution (Sect. 2.3).

2.1 Consumer choice and modelling

Choice making is a fundamental activity of humans, and it has often been studied in artificial intelligence and related areas (Moins et al. 2020; Tarabay and Abou-Zeid 2021; Márquez et al. 2020). Moreover, discrete choice modelling, i.e. explaining or predicting an agent choice among two or more discrete alternatives, has been studied extensively in the economics literature (Hensher and Johnson 2018). A well-known discrete choice model is multinomial logit (MNL), which is used when three or more alternatives are available for an agent to choose (Luce 2012). MNL is used to determine the probability of purchase in product line problems, e.g. the retail domain (Berbeglia et al. 2021). In MNL, consumers (users), when exposed to a set of options/items, are supposed to evaluate them by computing the options’ utility, and make a choice for an option with a probability that grows with the utility of the option. The option’s utility represents the amount of satisfaction that the user estimates to obtain by consuming that option (Broome 1991). A proper estimation of the options’ utility has been an important subject of study. The utility is typically estimated by modelling the users’ preferences. In fact, the theory of revealed preference (Samuelson 1938) suggests that the best way to measure users’ preferences is to observe their purchasing behaviour. Moreover, in addition to personal preferences, product properties have been observed to have a significant impact on users’ choices (Blackwell et al. 2006). In that respect, users’ choices can be influenced by the product’s age, i.e. the time since the item was first introduced (Eelen et al. 2015), the item’s unexpectedness (Adamopoulos and Tuzhilin 2014), its popularity (Jannach et al. 2017) and also promotion offers (Kaveh et al. 2020). In this paper, we concentrate on three product properties: popularity, age and rating.

A popular item, by definition, has been chosen by many users, but many users have also the tendency to prefer “popular” items, i.e. items that other users have often selected. Many studies have shown that users’ choices are often correlated with item popularity, and the internet economy has often been described as a “winners take all” market (Giridharadas 2019). In particular, Jannach et al. (2017) discovered a high correlation (Chi-square test) of the successful recommendations, i.e. recommendations that are actually selected by the users, with the popularity of the items. In addition, Powell et al. (2017) showed that Amazon’s users have the tendency to choose items that have a large number of reviews, even if they have lower ratings compared to the other, less popular, alternatives. Also Heck et al. (2020) identified the same pattern. In fact, many studies found that a larger number of reviews for an item is associated with increased sales of the item (Chevalier and Mayzlin 2006; Sun 2012; Hoffart et al. 2019).

The second item property that is considered in this paper is the “age” of the items, i.e. the time that has passed since the item has been introduced in the catalogue. It has been frequently shown that consumers are typically attracted by the “new” label of a product (Bartels and Reinders 2011; Zhang et al. 2014; Gravino et al. 2019). Eelen et al. (2015) conducted a study on supermarket products, and they found that attaching the label “new” to a product translates into a more positive consumers’ attitude to the product and a greater purchase intention. Additionally, Gerrath and Biraglia (2021) analysed users’ choices when they were exposed to both popular and less popular brands. It was found that some users, due to curiosity, prefer new items that are released from a brand, especially if it is a popular one (i.e. known by the users). In another study, Im et al. (2003) analysed households’ replies to questionnaires regarding their preferences for durable goods. They showed that the users’ tendency to choose newer items, referred to as “New-Product Adoption Behaviour”, exists in some households, and this behaviour is dependent on both consumer’s “Innate Innovativeness” as well as some consumer characteristics, e.g. age, income and education. In fact, innovativeness is a fundamental driver of consumers that can only be nurtured by considering recent items (Agarwal and Prasad 1998; Hirschman 1980).

In the more specific context of recommender systems, the effect of innovativeness on choices was explored in an empirical experiment described in Pathak et al. (2010). The authors showed that more recent books, i.e. released sometime during the past year, tend to reinforce the influence of the recommendations on choices. Innovativeness has also been explicitly used in building recommender systems (Wang et al. 2018; Kawamae 2010). For instance, Kawamae (2010) proposed a neighbour-based collaborative filtering RS where the neighbours of a target user can only consist of innovator users and, as a consequence, new items are recommended more often.

The third item property that we consider in this paper is the predicted rating of an item. The rating for an item is an indicator of the rater’s satisfaction obtained from the consumption of the item. In that respect, many RSs predict users’ ratings and recommend to a target user the items with the largest predicted ratings (Ricci et al. 2022). This implies that a large part of the items that a user is exposed to before making a choice consists of highly rated items and users actually chose these items, either because they like them (i.e. the RS was right in the prediction) or because they have been exposed to them. Some of the previously cited studies have therefore considered predicted ratings as input for modelling user’s choices. In particular, Burke et al. (2016) considered ratings above 3 as indicators of choice and Hazrati and Ricci (2022a) used rating prediction as a component of the users’ choice model.

2.2 Recommender systems impact on choosers

It has often been shown that recommendations do influence individuals’ decisions in consuming items. Such an influence has been inferred by analysing various determinants of users’ choices, e.g. sale increase (Jannach and Hegelich 2009), global choice distribution (Lee and Hosanagar 2019), users’ preferences (Adomavicius et al. 2013), and complexity of choice making (Senecal et al. 2005). In that respect, several studies have analysed users’ responses when exposed to recommendations (Lee and Hosanagar 2019; Jannach and Hegelich 2009; Dias et al. 2008; Adomavicius et al. 2013; Senecal and Nantel 2004; De et al. 2010). The impact of RSs on sales was analysed by Jannach and Hegelich (2009), with a comparison of personalised and non-personalised RSs in an online platform. The authors found that personalised RSs lead to larger sales count compared to non-personalised ones. Lee and Hosanagar (2019) also compared the case when a personalised RS vs. none is used, and observed a major increase in sales in the first case. Many other studies made similar observations (Zhou et al. 2010; Lee and Hosanagar 2014; De et al. 2010).

Senecal and Nantel (2004) analysed the complexity of choice making of users when facing recommendations and showed that users of an online retail store that follow recommendations have a much less complex shopping behaviour (fewer pages visited before their purchase) compared to the users that do not receive recommendations. In fact, this choice facilitation, which is enabled by an RS, is shown to increase the number of new users following recommendations (Dias et al. 2008). As a side effect, the reliance of users on such personalised systems lead to their preferences being altered over time by RSs, as it was observed in Adomavicius et al. (2013). Consequently, RSs influence the global distribution of their users’ choices (Lee and Hosanagar 2019; Matt et al. 2013; Lawrence et al. 2001; Gomez-Uribe and Hunt 2015).

Such an important effect of RSs has motivated studies attempting to understand how alternative RSs impact on the distribution of the users’ choices. Aimed at that goal, Matt et al. (2013) designed a website offering music tracks and augmented it with alternative RSs. By conducting a user study on a small set of users and for a short time period, the authors discovered that RSs do affect the users’ decisions and lead to different levels of choice diversity. They discovered that both collaborative filtering (CF) and content-based (CB) approaches lead to higher choice diversity than when no RS is employed. However, such diversity effect has not always been observed; in a large-scale online experiment in the retail domain, a traditional CF was shown to produce a decrease in sales diversity compared to when no RS was used (Lee and Hosanagar 2019).

To investigate the impact of recommender systems on users’ choice distribution, some studies simulated users’ choice making when exposed to recommendations (Fleder and Hosanagar 2007, 2009; Bountouridis et al. 2019; Nadolski et al. 2009; Hazrati et al. 2020; Huang et al. 2020; Yao et al. 2021). The difficulty of running online tests as well as the importance of conducting more controlled experiments were among the motivations of adopting simulations. One often cited simulation is described in Fleder and Hosanagar (2009); it was aimed at studying the effect of alternative RSs on the diversity of users’ choices. The authors simulated users iteratively choosing items, among a small set of products, according to a probabilistic multinomial-logit choice model (Brock and Durlauf 2002). The model was based on a randomly generated utility function of the users. The higher the utility for an item is, the more likely the item is chosen. In fact, each user was assumed to choose from a catalogue of items, and if an item was recommended to a user, its probability to be chosen was increased. To quantify the impact of the RS, the authors estimated the diversity of the choices. They discovered that common RSs, namely CF algorithms, produce choices concentrated on a reduced set of items.

That study was very influential; however, the simulation settings were rather artificial; a random utility function for each user was used and a small number of (hypothetical) users and items were considered; there was no real product chosen by any true purchaser. Consequently, their findings provide a limited picture of real consumers’ choice making, for instance, in a typical web portal. In our study, we use three data sets containing logs of users’ purchases in order to use a concrete set of products purchased by a real sample of customers. Then, we simulate the choices of these users in alternative conditions: when they are influenced by six types of RSs, as well as if they use four distinct choice models.

However, Fleder and Hosanagar (2009) inspired other researches, such as the study described by Bountouridis et al. (2019), which was aimed at understanding how recommendations affect the diversity of news consumption. The authors adopted the same simulation framework introduced by Fleder and Hosanagar (2009) but considered two additional diversity metrics: long-tail diversity and unexpectedness (Vargas 2015). They found that the more the users choose popular news topics, the less unexpected the recommendations become. They also discovered that both simple algorithmic strategies (item-based k-Nearest Neighbour) and more sophisticated strategies (Bayesian Personalised Ranking) increase the choice diversity over time. This observation contrasts with what Fleder and Hosanagar (2009) initially found, which instead showed that RSs lead to a decrease in choices’ diversity.

Some other attempts to understand the impact of RSs on choices’ distribution also focused on choice diversity. Chaney et al. (2018) performed another simulation of users’ choices under the influence of RSs and found that RSs increase the concentration of the choices over a narrow set of items and create a richer-get-richer effect for popular items. They discovered that this homogeneity reduces the utility of the RSs as it hinders the user-item match, something that was also acknowledged by Ciampaglia et al. (2018). Finally, Hazrati and Ricci (2022a) investigated the effect of recommender systems on diversity through a more articulated simulation framework. The users in this simulation have the chance to accept recommendations from alternative RSs or choose other items from the catalogue. The authors analysed RSs’ impact on the Gini index, as a measure of choice diversity, by using three data sets and one choice model based on the estimated rating of the item. They found that the Gini index varies substantially when alternative RSs are used. However, in this study, the authors assumed that users’ choices are influenced only by the items’ predicted ratings. In this paper, we model four alternative types of behaviour that users may exhibit in their choice making procedure (including a choice model that is influenced by the item’s rating).

2.3 The combined effect of the choice model and the recommender system

Some research works tried to model and quantify the combined effect of the users’ choice behaviour (choice model) and the RS on the choices’ distribution of the users. Along this line of research, Szlávik et al. (2011) simulated users’ choices under different choice models while users were exposed to recommendations. The authors refer to a choice model as the combination of a recommendation set and simple criteria adopted by the user to select the recommendations, i.e. they do not sharply distinguish the two concepts, as done in our work. Users are exposed to recommendations (generated by a matrix factorisation (Funk 2006) RS), and they are simulated to choose among the recommended items based on either a deterministic or a probabilistic criteria. The authors discovered that with a CM where all the users choose the same number of items randomly among the recommended items, the choices do not necessarily become more uniform, and their mean rating will decrease. Additionally, they evaluate the effect of varying the acceptance probability of the recommendations, i.e. forcing the simulated users to select the recommended item or giving them the chance to select among other items. They found that when users select trending items, the mean rating value of the chosen items is higher than when they choose randomly among the recommended items. These are interesting findings; however, the authors considered only one data set (Netflix Prize (Bennett et al. 2007)) and only one RS. While the authors of the previously cited work evaluated the interaction between an RS and a choice model by measuring only the average predicted rating and the diversity of the choices, in this article, we consider other important metrics (average predicted rating, age, Gini index, popularity of the chosen items). Moreover, Szlávik et al. (2011) considered only one RS and observed the effect of CMs that are only varying the number of accepted recommendations. Conversely, in this paper, we simulate six diverse RSs, in order to understand their specific impact on the users’ choices, as well as four CMs that have a well-founded motivation in the consumer behaviour literature (see the first part of this analysis of the state of the art).

In a more recent simulation study of users’ rating behaviour, Zhang et al. (2020) considered several user choice models, referred to as consumption strategies. The study aims at understanding the effect that the reliance of the users on recommendations has on the performance of an RS. The study contributes with some notable findings regarding RSs and users’ behaviours, and among them, the authors discovered a phenomenon called the “performance paradox”. That paradox is observed when a strong reliance of users on recommendations leads to sub-optimal performance of the RS in the long run. It was shown that the RS accuracy improvement over time becomes smaller when users strongly rely on recommendations, compared to when they completely ignore recommendations. The authors also considered a second consumption strategy (hybrid) where users consume items from a recommendation list that includes both personalised and popularity-based recommendations. It was found that the hybrid consumption strategy improves the relevance of consumed items over time, mainly because the RS popularises the “good quality” items that are preferred by a large number of users. While this study focuses primarily on choice models that simulate the extent of reliance of users on recommendations, the CMs that are analysed in our study are meant to operationalise users’ choice behaviour when the preferences of the users are influenced by distinguished item properties (such as age or popularity). In addition, while Zhang et al. (2020) have considered in their study only one RS (eventually combined with a popularity-based RS), we simulate six alternative RSs and analyse their impact on the choices as well. Hence, our study provides a more comprehensive account of the possible outcomes, in terms of choices’ distribution, when users are making choices influenced by RSs.

3 Choices correlation with item properties

As it was mentioned in the introduction, in order to motivate the development and the analysis of alternative and even atypical choice models, we first inspect some choice data sets. In particular, we present here the analysis of the correlation of observed users’ choices with three distinguished items’ properties, namely the popularity, the age, as well as the rating. The age of an item depends on the time the item is chosen by a user; it is defined as the time passed, from the introduction of the item in the catalogue to the time a particular choice for it is made. Popularity is also depending on time; it is estimated by counting how many times an item was previously chosen when a choice for it is made. In order to conduct such an analysis, it is necessary to distinguish items a user has chosen from those she has not chosen. We use three rating data sets where it is legit to assume that a user’s rating is present only if the rated item was chosen (purchased) by the user.

3.1 Data sets

To perform the analysis, we looked for data sets that record time-stamped ratings of the users and where the ratings signal users’ actual choices. Hence, we have identified three data sets from the Amazon collection that satisfy the mentioned criteria and have quite different characteristics so that they may also exemplify others: Apps (Android applications), Games (video games), and Kindle Books (He and McAuley 2016). We restrict our analysis to subsets of the original data. We first narrow down each rating data set by considering only the ratings of items for which we could find a release date (necessary for computing item’s age at choice time). Then, we select ratings from the users who rated at least 15 items. This is needed to have a minimum number of samples necessary for the correlation test (Bonett 2020). Then, to calculate the age of an item at rating time, we search the Amazon website for the original release date of the item. Accordingly, we have developed a web crawler that fetches Amazon item pages, and extracts the “release date” field. Table 1 shows some important characteristics of the three data sets, after the above mentioned filters are applied. The time span in the table indicates the time between the first and the last choice/rating in the data set.

Table 1 Important characteristics of the considered Amazon data sets

We here define a dichotomous variable for labelling items depending on a considered user; it can take two values:

  • Choice: A choice of a target user u is an item i which the user has rated; a rating is recorded in the data set for that user-item pair. In fact, as above mentioned, in the considered data sets, one can assume that the user choices correspond to their ratings, as Amazon users predominantly rate games, Kindle books and apps after they have purchased them.

  • No-choice: A no-choice for a target user u is an item j that the user did not choose; no rating for the user-item pair is recorded in the data set. More precisely, we sample the “no-choices” of each user from the items that the user did not rate.

While in principle, the “no-choices”, by definition, are the complement set of the choices, we use a sample of them since we are only interested in assessing the presence or not of a correlation between an item property and this dichotomous variable separating user’s choices from no-choices. Hence, we assume that when the user u made a choice for the item i, at the time-stamp of the corresponding rating \(r_{ui}\), she also did not choose another item (no-choice), let us say, the item j, which is randomly selected among the items that the user u has not chosen (rated) in the data. This sampling approach has been previously performed in the literature for generating negative implicit feedback (Pan et al. 2008; Jannach et al. 2018). It is also worth noting that some of the sampled no-choices may be for items that the user could have chosen. This is also happening in all the “one class” collaborative filtering approaches (Pan et al. 2008). However, this adds only a minor error as one can safely assume that this type of items are a small minority among the full set of items that the user did not rate.

3.2 Correlation analysis

We formally define three properties of an item i at the time the user u chose it or did not choose it:

  1. 1.

    Item popularity is equal to the number of times the item i was rated/chosen by all the users in the n days before it was chosen (or not chosen) by u, divided by n. In this analysis, we consider n = 90 (three months). We note that such a measurement of popularity could be made domain-specific. For instance, in the Kindle books domain, the number of days used to measure popularity should be larger than in the android application domain; the lifespan of a Kindle book is supposed to be higher that an app lifespan. However, we kept the number of days used to assess the item popularity equal to 90 in all the data sets, to simplify the interpretation of the results.

  2. 2.

    Item age is the time difference, in months, between the time-stamp of the choice of the item i (or no-choice) and the release date of the item, i.e. the first time the item could have been chosen. We note that the item age in this analysis is not the exact age of the item, which could be defined as the time when, for instance, a Kindle book was first published. However, for our analysis, the time when an item is listed in Amazon is typically the time when it has been commercialised, and users have started to choose it in the considered scenario; this is important in our analysis.

  3. 3.

    Item rating is the predicted rating of the user u for the item i. We note that this property depends on both the user and the item, unlike popularity and age properties that depend only on the item. In order to better assess the true preferences of the user, we predict the missing user-item ratings using a debiased prediction method, “Inverse Propensity Score” Matrix Factorisation model (IPS-MF) (Schnabel et al. 2016) that is fully discussed later in Sect. 4.2, because it is also used in the simulation of users’ choices.

We are interested in assessing possible correlations between the dichotomous variable that distinguishes choices from no-choices and the three previously defined properties, namely popularity, age and rating. This analysis could reveal if users may have been influenced in their choices by these three properties. Such an influence could be determined by an explicit preference, e.g. for more novel items, or could be only implicit; the users may not be aware, but they may tend to prioritise items that appear to be popular, e.g. because they have many ratings. However, we stress that we do not hypothesise the existence of any causal dependency between item features and users’ choices; the topic of why users may show specific choice behaviours is out of the scope of this article.

In order to precisely measure the existence of such correlations, we have performed a point biserial correlation test (Tate 1954). This test is commonly used to determine the dependency between a dichotomous and a continuous variable (Mazza et al. 2019; Van Der Heide et al. 2013). We consider the users separately because they can have different choice behaviours. The performed tests produce correlation coefficients; a positive value indicates that a user’s choices are made among items with higher values of the property, and a negative correlation coefficient indicates that the choices are made among items with smaller values of the property.

Figure 1 shows the histograms (percentage frequency distribution) of the measured correlation coefficients (one for each user) in the Apps data set between the choice variable and the item age (Fig. 1a), the item popularity (Fig. 1b) and the item rating (Fig. 1a). Similarly, Figs. 2 and 3 show the correlation results for the Games and Kindle books data sets.

Fig. 1
figure 1

Apps data set—histograms of the correlation coefficients (one for each user) between the choice variable and a item age and b item popularity and c item rating

Fig. 2
figure 2

Games data set—histograms of the correlation coefficients (one for each user) between the choice variable and a item age and b item popularity and c item rating

Fig. 3
figure 3

Kindle books data set—histograms of the correlation coefficients (one for each user) between the choice variable and a item age and b item popularity and c item rating

The overall percentages of users that have their choices correlated to the considered item properties are shown in Fig. 4. This figure shows the percentage of the users whose choices have a correlation (positive for popularity and rating, and negative for age) with the considered property, as well as the percentage of the users that have a significant correlation. For instance, it is shown that 81% of the users in the Apps data set have a positive correlation in their choices with item popularity, and among them, 48% of the correlation results are significant.

Fig. 4
figure 4

Percentage of users that have a correlation of their choices with item age (negative), popularity (positive) and item rating (positive); and the percentage of these users for which the correlation is statistically significant

It is clear that a large part of the users in the three considered data sets have a significant (either positive or negative) correlation of their choices with the three considered item properties; their choice behaviour shows this dependency.

Our analysis illustrates three kinds of user choice behaviours. It is also important to understand whether one kind of behaviour encourages the others, or, in other words, if there are dependencies between these choice behaviours. This analysis can show whether it is valid to model these behaviours in an isolated way. Hence, for each property, we divided the users into two groups: those having a positive correlation between the property and the choices, and those having a negative correlation. Then, to understand whether any dependency between two choice behaviours exist, we perform a Chi-squared test for every pair of properties, namely for {popularity, age}, {popularity, rating} and {rating, age} data. After performing the test, we have found that for all of the pairs, the obtained p-values are between 0.8 and 1; hence, the null hypothesis (independence of choice behaviours) cannot be rejected. Hence, for instance, a user having their choices positively correlated with item popularity is not more likely to have the choices positively correlated with item predicted rating, or negatively correlated with item age.

In conclusion, while the results of the point biserial correlation test show that a large fraction of users have their choices correlated with some of the considered properties, the second test shows that the correlation with one property is not dependent on the correlation with another property.

It is also interesting to note that overall the strength of the correlation between item properties and choices depends on the data set. Hence, it is relevant to study how this influence can affect the global distribution of the users’ choices. In order to investigate this effect, in the next section, we use a simulation procedure that explicitly accounts for alternative user’s choice models. We mathematically model how either the items’ popularity, age, or predicted rating can determine the probability that a user chooses an item among those recommended by an RS.

4 Choice simulation

In this section, we aim at simulating how the users’ choice distribution and quality can be influenced by specific user choice models adopted by the simulated users, when they are exposed to recommendations. We are specifically interested in understanding how the user choice model couples with an RS and shapes a specific choice distribution. In this section, we present the details of our simulation approach.

4.1 Simulation procedure

The general schema of the simulation procedure is shown in Fig. 5. The simulation generates users’ choices within successive time intervals, when users receive recommendations from an RS. At the beginning of the simulation, a recommender system is trained on an initial logged users’ choices (observed in a real system up to a given point in time). Then, the users’ choices of the first time interval are simulated one after another. When a user is simulated to make a choice, an RS suggests a set of items to her. The considered RSs are presented in detail in Sect. 4.3. The simulated user is assumed to compute the utility of the recommended items and make one choice among them with a choice model. The utility calculation and the used choice models are discussed in detail in Sect. 4.2. When all the choices in the first interval are simulated, the RS is retrained by also considering the simulated choices in the training set. This procedure is repeated for a number of time intervals.

Let us now give a more formal and precise description of the simulation framework. Let U and I be the full set of users and items in a data set, logging users’ ratings for items in a time span. Let P be the \(\vert U \vert \times \vert I \vert \) choice matrix where an element of this matrix, \(p_{uj}\), is 1 if the user u has chosen the item j, and \(p_{uj}=0\) otherwise. The matrix P is derived from a rating matrix R, which is supposed to be known and shows the real observed evolution of ratings. We here assume that if an item is rated, then it is also chosen. With \(p_u\), we denote the u-th row vector of the matrix P. The users’ choices are time stamped; \(t_{uj}\) is the time when the user u chose the item j. Assume that \(t_0\) is a given time point; we denote with \(P^0\) an initial choice matrix formed by all the real observed choices \(p_{uj}\) with \(t_{uj}\le t_0\). The proposed simulation procedure starts from this initial knowledge of real ratings/choices and aims at simulating users’ choices made after this time point, when users are exposed to alternative recommender systems and they use one of the considered choice models.

We consider successive time intervals, each one spanning a month duration, starting from the time point \(t_0\). So, for instance \(]t_0, t_1]\) denotes the time interval spanning from \(t_0\) (excluded) to \(t_1\) (included), and \(t_1\) is a time point one month after \(t_0\). The simulation iterates over these time intervals to identify \({\hat{P}}^l\), that is, the matrix of the simulated choices in \(]t_{l-1},t_l]\). We assume that in each time interval \(]t_{l-1},t_l]\) the users select items one after another.

Fig. 5
figure 5

General diagram of the simulation framework

The detailed simulation procedure is shown in Algorithm 1. The parameters of the simulation are:

  • U: set of users.

  • I: set of items.

  • RS: a recommender system.

  • CM: is the choice model of the simulated user.

  • L: the number of time intervals in the simulation.

  • \(P^0\): the binary matrix of the real observed users’ choices made up to time \(t_0\).

  • \(z_1, \ldots , z_L\): the lists of the users who will be simulated to make choices in each time interval (from 1 to L). A user can appear in each list \(z_l\) multiple times.

  • \(N^0, N^1, \ldots , N^L\): \(N^l\) is the binary matrix of actual choices for new items introduced in the l-th interval (\(N^0\) is empty because these choices are already included in \(P^0\)).

At the beginning of each time interval l, the RS is trained (Sect. 4.3) on the set of choices in \(T^l = Q^{l-1} + N^{l-1}\). \(Q^{l-1}\) contains the real choices made before \(t_0\), contained in the matrix \(P^0\), and the simulated choices from the beginning of the simulation till the end of the previous interval \(l-1\): \({\hat{P}}^1 + \cdots {\hat{P}}^{l-1}\). \(N^{l-1}\) contains the actual choices for the items that were added to the system during the previous time interval \(l-1\). These choices are added to the training set to enable the RS to recommend also new items. We note that at every time interval, in the data that we use in the simulation, a large number of items and their choices are newly added. More details on new items is given next in the paper (see Table 2).

After training the RS, users in the list \(z_l\) are simulated to make a choice. Users are inserted in \(z_l\) in a random way with repetitions. Each user u appears in \(z_l\) the number of times she really made a choice, as recorded in the log data set, in the time interval \(]t_{l-1}, t_l]\). The list \(z_l\) gives the (inessential) order in which the simulated users are going to make their choices in this interval. Before a choice for user u is simulated, the utility of the recommended items is calculated. In Sect. 4.2 (Eqs. 234 or 5), we give the details of this utility estimation. Then, a choice for u is simulated according to a choice model that uses the estimated items’ utilities. Details on that step are also given in Sect. 4.2 (Eq. 1). After all the choices of the users in \(z_l\) are simulated, these choices are inserted in \({\hat{P}}^{l}\) and added to \(Q^{l-1}\) to produce a new aggregated set of simulated choices \(Q^l\). The simulation procedure continues to the next time interval until all the L time intervals are considered.

figure a

4.2 Choice model

When a simulated user is given the chance to make a choice among the recommended items, three alternative multinomial-logit choice models (CMs) are used in parallel. This type of model has been previously validated, and it will make our results comparable with earlier simulations (Fleder and Hosanagar 2007, 2009; Anas 1983; Hazrati and Ricci 2022a). The three CMs depend on alternative estimations of the utility of the item j for the user u, which is assumed to be assessed by the user before making a choice. In all the three CMs, the user u is supposed to choose an item j among those in the recommendation set \(S_u\), with the following probability:

$$\begin{aligned} p(u\ \hbox {chooses}\ j) = \frac{e^{v_{uj}}}{\sum _{k \in S_u} e^{v_{uk}}} \end{aligned}$$
(1)

\(v_{uj}\) is the utility of the item j for user u, and \(S_u\) is the set of recommendations for u. Clearly, items with a larger utility are more likely to be chosen, but users do not necessarily maximise utility.

We consider three alternative utility functions, one for each choice model (CM), that are influenced by three distinguished and relevant item properties. A fourth CM is not based on any item property, and it is used to simulate users that are influenced only by the ranking of the recommendation. Even the definition of this CM is based on a specific utility function.

  1. 1.

    Rating-CM: the utility of the item j is in this case equal to the predicted rating:

    $$\begin{aligned} v_{uj} = {\hat{r}}_{uj} \end{aligned}$$
    (2)

    This is considered to be the standard choice model, where the users are supposed to prefer items with larger predicted ratings (Hazrati et al. 2019; Szlávik et al. 2011). In order to predict the user’s utility, we use all the ratings actually present in the considered rating data set. However, the observed ratings are generally subject to selection bias, and therefore, any rating estimation based on observed ratings is also biased (Marlin and Zemel 2009; Schnabel et al. 2016). For example, in a movie website, users typically watch and rate movies they like; they more rarely rate movies that they do not like (Pradel et al. 2012). This produces a situation where data is said to be Missing Not At Random (MNAR) (Schnabel et al. 2016; Marlin and Zemel 2009). To debias rating predictions computed with the available data, we use Inverse Propensity Score Matrix Factorisation model (IPS-MF) (Schnabel et al. 2016), which modifies the loss function of a typical matrix factorisation model by taking into account the inverse probability of a user rating an item. This increases the loss when estimating a high rating value for a user-item pair with low probability of being observed. Hence, in practice, lower ratings tend to be predicted for items that have lower probability of being observed.

  2. 2.

    Popularity-CM: In the second choice model, the utility of the item j for the user u is equal to:

    $$\begin{aligned} v_{uj} = k_f*f_j^{(t)} \end{aligned}$$
    (3)

    where \(f_j^{(t)}\) is the popularity of item j (as defined in Sect. 3.2) at the time t of the simulated choice, and \(k_f\) is a parameter that adjusts the impact of the popularity on the utility.

  3. 3.

    Age-CM: in the third CM. Here, the utility of item j is equal to:

    $$\begin{aligned} v_{uj} = k_a * \left( m - a_j^{(t)}\right) \end{aligned}$$
    (4)

    where \(a_j^{(t)}\) is the age, in months, of item j at the time t of the choices (see Sect. 3.2), m is the maximum item age of the item chosen in the entire data set. \(k_a\) is again a parameter that adjusts the impact of the item age on the utility. Hence, in this choice model, more recent items have a larger utility, and they tend to be preferred by the simulated users.

  4. 4.

    Base-CM: This is a baseline CM that is used to simulate choices in a general scenario where users faithfully adopt the recommender system. Inspired by the natural position bias in consumer’s choice (Collins et al. 2018; Carare 2012), we model the user’s behaviour of prioritising the top recommended items; the probability of choosing a recommended item reduces significantly as the item rank gets lower in the recommendation list. Hence, the utility of a recommended item in the Base-CM is equivalent to a transformation of the rank of the item. The utility of item j in the Base-CM is equal to:

    $$\begin{aligned} v_{uj} = \log _e \alpha ^{-i} \end{aligned}$$
    (5)

    where i is the rank of the recommended item and \(\alpha \) is the decay factor that determines how sensitive to rank the user is: a larger value of \(\alpha \) simulates a stronger decay of the utility with the rank of the items. For instance, when \(\alpha \) is set to 1, according to Eq. 1, the probability of making a choice is the same for all of the items (2% if 50 items are recommended). While by setting \(\alpha \) to 2, the probabilities of choosing the top-5 recommended items are 0.5, 0.25, 0.12, 0.06, and 0.03, respectively. We set \(\alpha \) to 1.4, and according to the exponential multinomial choice model in Eq. 1, the final probability of choosing the top-5 items becomes 0.28, 0.20, 0.14, 0.10, and 0.07, respectively. The items at the bottom of the recommendation list have probabilities very close to zero. A similar probability distribution of real users’ choices is observed by Collins et al. (2018). Additionally, Zhang et al. (2020) have used a similar approach to model the natural choice behaviour of users.

We note that items’ popularity and age in the three data sets have different ranges of values. Hence, in order to have a better comparison, we have chosen, in each data set, specific values for \(k_f\) and \(k_a\) such that the three considered utility functions range in the same interval of values. In practice, we set \(k_f\) and \(k_a\) in such a way that all the computed utilities range between 1 and 5, which is the default range of the Rating-CM utility (five stars rating).

4.3 Recommender systems

In order to isolate the effect of RSs, our simulation assumes that users make their choices among the recommended items. Hence, we do not consider other information sources that may lead the users to choose other items. We initially ran some simulations with alternative recommendation sizes, i.e. 10, 20, and 50 recommendations. We found that when the number of recommendations is increased from 10 to 50, the values of the considered metrics change, i.e. diversity of the choices increases, and the average predicted rating of the choices decreases. However, the relative effect of the RSs and the CMs remains the same. Hence, since our primary goal is to understand the effect of different RSs and CMs, we did not manipulate the recommendation size, and we set the number of recommendations to 50.

We have selected six recommender systems because they are well-known, it is easy to interpret their behaviour, and they cover diverse types of approaches, both personalised and non-personalised.

  • PCF—Popularity-based Collaborative Filtering—is a neighbourhood-based CF that computes the cosine similarity between the 0/1 choices’ vector of a target user u, and the choice vector of other users to find the nearest neighbours. The most popular items among the choices of the nearest neighbour users are recommended to the target one (Fleder and Hosanagar 2009). For the number of neighbours, we have tested values ranging from 5 to 50 for each data set, and the recommendation precision tends to be better when 10 neighbours are used; this is the value that we finally adopted.

  • LPCF—Low Popularity-based Collaborative Filtering—is similar to PCF, but it penalises the score computed by PCF for popular items by multiplying it by the inverse of their popularity. The highest scored items are recommended (Fleder and Hosanagar 2009). We set even here the number of neighbours to 10.

  • FM—Factor Model—is a Factor Model RS which generates recommendations with the approach proposed by Hu et al. (2008).

  • NCF—Neural network-based Collaborative Filtering—is a model that leverages a multi-layer perceptron to learn the user-item interaction function that is used to recommend top-k items to a target user (He et al. 2017).

  • POP—Popularity-based—recommends the most popular items in terms of the number of times that they were selected by the users in the past are recommended.

  • AR—Average Rating: The items are scored with a variation of their average rating. IMDB.com uses this method to calculate the adjusted average ratings for the movies. A weighted average is calculated for each item as follows:

    $$\begin{aligned} \hbox {WR} = \frac{v}{v+m}\times R + \frac{m}{v+m} \times C \end{aligned}$$
    (6)

    where R is the average rating for the item among all the available ratings, v is the number of times this item is rated, m is the minimum number of ratings required to be considered by the RS (10 in our experiment), and C is the average of all of the ratings in the data set. The highest scored items are recommended.

We note that when new users enter the simulation at a given month, the personalised RSs cannot generate recommendations for them (users with no ratings). In this case, similarly to what is done in production RS, we have used a non-personalised RS, namely POP.

4.4 Data sets

We have used the same data sets that were introduced in Sect. 3. To simulate choices on items in each data set, the first n months’ choices observed in the data set are here considered as the starting point of the simulation, i.e. they are included in \(P^0\). The value of n is 31 for Apps, 169 for Games and 35 for Kindle books. Then, we simulate the successive 10 months’ choices of the users. Some characteristics of the data set, related to the simulation, are shown in Table 2. We note that in the 10 month intervals of simulated choices new users and items are added to the simulation. Especially in the Apps data set, a considerable percentage of new users, items and their choices are added every month.

Table 2 Characteristics of the considered data sets

4.5 Evaluation metrics

By running the described simulation procedure, we are interested in exploring the conjoint effect of the considered RSs and choice models on the users’ choices. We introduce here metrics that capture important global characteristics of a collection of choices, those performed, month by month by the simulated users.

  • Gini index: We measure the diversity of the users’ choices with the Gini index, which is often used to quantify inequality and has been previously adopted in related studies to measure sales diversity (Matt et al. 2013; Fleder and Hosanagar 2007, 2009; Szlávik et al. 2011; Lee and Hosanagar 2019; Adamopoulos et al. 2015). The Gini index measures the inequality of a distribution with a single value \(G \in [0,1]\). Higher Gini index values represent lower diversity of the choices and vice versa. G is 0 when there is a perfectly uniform distribution across items, while it is close to 1 when a high inequality is observed. The value of the Gini index typically depends on the data. However, in domains where users’ choices are recorded, the Gini index typically varies between 50 and 80%. While in some data sets where users’ choices are biased through choosing more popular items, the Gini index can reach values even higher than 90%. Dorfman (1979) discusses the Gini index in detail. We also observe that diversity is considered a positive feature and, actually, the lack of diversity is perceived as an adverse effect of RSs (Matt et al. 2013; Fleder and Hosanagar 2007, 2009; Szlávik et al. 2011; Lee and Hosanagar 2019; Adamopoulos et al. 2015).

  • Choice Coverage: Choice Coverage captures another critical aspect of the users’ choice diversity; it is the percentage of the items in the catalogue that have been chosen at least once by some user. It is worth noting that the catalogue of items is changing each month as new items are added to the catalogue at the beginning of each month. Choice Coverage can show the ability of an RS in recommending the full potential set of available items. We note that Choice Coverage reveals a different facet of choice diversity, compared to the Gini index. While Choice Coverage captures the users’ spread of choices over the catalogue, the Gini index measured how uniformly the choices are distributed over all the chosen items.

  • Recommendation Coverage: This metric measures the fraction of items that have been recommended at least once (until a time point) among those that are available.

  • Chosen Items Popularity: This metric is, as previously defined, the average of the number of times the chosen items are actually chosen in the last 90 days, divided by 90.

  • Chosen Items Age: This metric is equal to the average, on the chosen items, of the time passed from when the items were first available in the catalogue (see Sect. 3 for the details).

  • Choice’s Rating: This metric is computed as the average of the predicted ratings of the chosen items. Namely, for each user, we compute the average of the unbiased prediction of the ratings of the chosen items. Then, we average the obtained values over all of the users. This metric shows whether the users select more valuable items or not. In fact, the predicted rating of u for item i (\({\hat{r}}_{ui}\)) is the only measure that we have at our disposal to assess the quality of the choices.

5 Simulation results

The main objective of this study is to understand how specific user choice models affect the choice distribution in the presence of an RS, and especially how a user choice model couples with an RS, and they jointly produce a specific choice distribution.

The discussion is divided into three parts, each part discusses one of our research hypotheses and its supporting results. We group the results in five tables (Tables 3456 and 7); each one depicts the resulting distribution of the simulated choices according to a specific metric, for every scenario determined by a CM and RS combination (there are 24 scenarios). The values in the tables are calculated over all the simulated choices, which are in the matrix \({\hat{P}}^1+\cdots +{\hat{P}}^{10}\), i.e. at the end of the 10 months of simulated choices. Additionally, Figs. 67 and 8 show the evolution of some metrics over the simulation intervals (months) in some data sets. We have limited this visualisation to particularly interesting combinations. In these figures, each metric of interest is calculated over the set of users’ choices from the beginning of the simulation until the end of each month considered in the simulation. For instance, Fig. 6 shows the evolution of Recommendation Coverage: At the x-axis value l, the l-th simulation interval, the y-axis shows the RSs Coverage, calculated over \({\hat{P}}^1+\cdots +{\hat{P}}^l\).

5.1 Effect of the users’ choice model

We have found in the result the confirmation of the first hypothesis, namely that: some important metrics measuring the distribution and quality of users’ choices, when they are exposed to an RS, are strongly influenced by the prevalent CM of the users, irrespectively from the specific RS. The most evident results supporting this hypothesis are included in Table 3, which shows the Popularity of the chosen items. By comparing Popularity-CM with Age-CM and Rating-CM, we can observe that, for all the RSs, when Popularity-CM is adopted by the users’ population, the popularity of the chosen items is maximal (in all three data sets). This can also be immediately seen by observing the values in the rows named “CM Average”, where we report the average of the metric values computed over all the RSs. The average values of Popularity-CM, in the three data sets, are always larger than the corresponding values of Age-CM and Rating-CM. However, when the Base-CM is adopted, i.e. when users choose the top recommended item, the Popularity of the chosen items can also be quite large, especially in the Apps and Games data sets for the RSs FM, POP and AR, which are RSs with a strong tendency to recommend popular items. Note that in the Base-CM, users are supposed to stick with the top recommendations; hence, this CM is producing popular choices, if the RS suggests them. An additional observation can be done by considering the Items’ Age metric. In Table 4, one can see that when the users adopt the Age-CM, this metric is minimised in every combination of RSs and data sets. These results clearly demonstrate the direct effect of a CM on its “corresponding” metric, whatever the RS is.

Table 3 Chosen items’ popularity calculated over all of the simulated choices for each data set
Table 4 Items’ age calculated over all of the simulated choices for each data set

In addition to the direct effect of a CM on its corresponding metric, we can find some, maybe less obvious, effects of the CMs’ in how they influence the users’ choice distribution. For instance, by looking at the Choice Coverage metric (Table 5), one can note that when the Age-CM or Rating-CM are adopted, Choice Coverage tends to be higher, compared to Base-CM and Popularity-CM. This can be seen by looking again at the average of the metric computed for all the considered RSs (CM Average rows). For instance, in Games and Kindle books Age-CM and Rating-CM, on average across the various RSs, both score 0.07, while Base-CM and Popularity-CM score 0.05 in Games and 0.06 and 0.05 in Kindle Books. Hence, in a population of users that tend to select more recent or highly rated items, one can observe a larger Choice Coverage compared to what can be observed in a user population that more faithfully accepts the recommendations or has a preference to select more popular items.

Table 5 Choice coverage calculated over all of the simulated choices for each data set

However, Table 6 shows that when the age of the item is prioritised by the user (Age-CM is used), in addition to what we observed before (larger coverage of the choices), the Choice’s Rating tends to be the smallest. That is, if the users prioritise newer items in their choices, they may select less good items. Conversely, when instead Rating-CM is adopted, which produces similar (large) coverage of the choices as Age-CM, the Choice’s Rating is the largest one. This can be clearly seen by looking at the average values of the metric obtained by the Rating-CM (4.07 in Apps, 4.30 in Games and 4.07 in Kindle Books): all the other CMs on average have lower values. In conclusion, we recognise the important role of the CM of the users’ population in determining the effect of the RSs on the users’ choices.

Table 6 Choice’s rating calculated over all of the simulated choices for each data set

5.2 Essential and universal effects of the RS

We have shown in the previous section that the choice model may have a substantial effect on some metrics of the users’ choice distribution and quality. Now, we focus on the second research hypothesis: RSs have essential effects on metrics measuring the distribution and quality of users’ choices, regardless of the prevalent CM adopted by the users. We discuss here some results that support our hypothesis.

We start with a simple observation (see Table 3): The non-personalised RSs, namely POP and AR, produce users’ choices with a very high Popularity. For instance, that metric, when we consider the Base-CM, scores in Apps 0.479 for POP and 0.511 for AR, while all the other RSs score lower. While it is evident that Age-CM and Rating-CM do have some effect in reducing the Popularity metric computed on the choices influenced by these two RSs, it is also clear that even if the users have the tendency to select newer or highly rated items (Age-CM and Rating-CM), the tendency of POP and AR to recommend highly popular items remains. This can be seen by observing the average value of the Popularity of the users’ choices under these RSs, when they are averaged across the CMs (values in the column named “RS Average” in Table 3). In fact, such a tendency of POP and AR results in an extremely high concentration of the choices over a small set of items, which is also demonstrated by the values of Choice Coverage and the Gini index metrics (Tables 5 and 7). Here, low Choice Coverage and high Gini index can be observed. Another observation on these non-personalised RSs is related to their effect on the Choice’s Rating metric (Table 6). We can observe that the choices produced by these two RSs are for highly rated items (see the values in the columns “RS Average’), and this can be observed again whatever is the CMs: irrespectively from the adopted CM, POP and AR produce choices with the largest ratings.

5.3 Combined effect of the choice model and the recommender system

Although there are some prominent and unavoidable effects of the RS, as it was discussed in the previous section, we now focus on the third research hypothesis, namely: Some effects of an RS on metrics measuring the distribution and quality of users’ choices are dependent on the adoption of a particular CM.

The first result that we want to stress is that there is an evident effect of the Age-CM on the Age of the chosen items, when either PCF, LPCF or NCF is considered, especially in the Games data set (Table 4). By using this CM, the Age of the choices is particularly small. This effect may be explained by observing that these three RSs (PCF, LPCF and NCF) have the largest coverage (As shown in Fig. 6); hence they are able to recommend a more diverse set of items compared to FM, POP and AR. This relatively higher RS Coverage for PCF, LPCF and NCF introduces more new items that can be selected by users adopting the Age-CM and can produce a reduction in the Age metric.

Fig. 6
figure 6

Evolution of RS Coverage when Base-CM is adopted

The second notable result is related to the FM recommender system. As it is shown in Table 3, FM is a personalised RS that has a high tendency to recommend popular items, since when the Base-CM is adopted, hence showing the sole impact of the recommender system, in all the three data sets, the Chosen Items Popularity generated by this RS is the highest (among the personalised RSs). For instance, in the Apps data set, the Chosen Items Popularity generated by FM (Base-CM) is 0.133, while PCF scores 0.034, LPCF scores 0.015, and NCF scores 0.038. However, in a user population that is adopting Age-CM or Rating-CM, one can observe a significant reduction of this tendency. For instance, in the Games data set, when Base-CM is adopted, FM score Chosen Items Popularity is 0.019, while if Age-CM or Rating-CM is adopted, one can observe a clear reduction of this value to 0.005. In fact, although the Chosen Items Popularity of FM will never reduce to the values observed for PCF, LPCF and NCF, a proper CM still can strongly alter the value of this metric. In Fig. 7, which is showing the evolution of the Chosen Items Popularity for the considered RSs in the Apps data set, one can clearly see the effect of Age-CM and Rating-CM on FM. We note that FM is not the only RS whose effect on the Popularity of the chosen items can be reduced if the users adopt the Age-CM or the Rating-CM. In fact, Fig. 7 shows that Age-CM and Rating-CM are also mitigating (much less than FM) such tendency to recommend popular items for PCF, NCF and the non-personalised RSs POP and AR.

A completely different scenario can be observed instead for LPCF, which is an RS that explicitly penalises the popular items when ranking the recommendations. Here, if the users would choose the top item recommended by LPCF (Base-CM is adopted), the Chosen Items Popularity would be minimal. But, if the users adopt other choice models one could observe a higher Chosen Items Popularity. Moreover, while LPCF has a relatively high Choice Coverage and low Gini index, when Base-CM is adopted, the Choice’s Rating is among the lowest in most cases (Table 6). Interestingly, in a different scenario, i.e. when users adopt the Popularity-CM or Rating-CM or even Age-CM, the Choice’s Rating grows to values more similar to those of PCF and NCF. However, as a side effect, when these CMs are adopted, users choose less diverse items, and hence diversity decreases (increasing the Gini index and decreasing Choice Coverage).

Fig. 7
figure 7

Apps data set—the evolution of the Popularity for each RS

Finally, we found that the CM can even influence some facets of the RS itself: by altering the training data used by the RS. As an example of this, one can see Fig. 8, which shows the evolution of the RS Coverage on the Apps data set for three RSs: (a) PCF, (b) LPCF, and (c) NCF. This metric shows the RS capacity to cover with its recommendations the items in the catalogue. It is clear that PCF’s Coverage is smaller when a population of users adopting the Popularity-CM is considered, compared to the Base-CM. However, in the populations of users that adopt the other two CMs, the RS Coverage of PCF remains unchanged. Interestingly, the RS Coverage of LPCF is also different from the Base-CM scenario depending on whether Rating-CM, Age-CM or Popularity-CM are adopted by the users. Additionally, although NCF’s Coverage appears to vary in users’ populations with different CMs, no clear pattern is found in this case.

Fig. 8
figure 8

Apps data set—the evolution of RS Coverage for three RSs, PCF, LPCF and NCF

Table 7 Gini index calculated over all of the simulated choices for each data set

6 Conclusions and future works

In this article, we have first analysed the correlation of three distinguished items’ properties, namely item popularity, age and rating with a user’s choices. We have measured these correlations in three data sets that contain the log of ratings for items that signal their purchases in an online eCommerce web site (Amazon Apps, Games and Kindle Books). We have shown that in these three data sets, several users do have their choices correlated with these properties, and these behaviours are actually independent. This result supports the significance of the successive analysis that we have performed, which is aimed at understanding how the users’ choices may be globally distributed when the users’ inclination to make a choice is influenced by the above mentioned properties. Hence, motivated by the identified correlation analysis, we have simulated month-by-month choices of users adopting four alternative choice models (CMs) for items suggested by six alternative RSs.

6.1 Findings

We have found several interesting facts relating CMs and RSs to the data sets. We have discovered the substantial influence of the users’ adopted choice model on the choice distribution of the simulated users. For instance, when users tend to choose more popular items (Popularity-CM), the choices become even more concentrated over a small set of items, while choosing newer items (i.e. adopting Age-CM) can lead to more diverse choices but with lower quality.

We have also found that besides the choice model, the RS itself has a significant impact on the choice distribution of the users. For instance, the strong popularity bias of non-personalised RSs is clearly seen also in the simulated choices, and even a potential users’ tendency to prefer newer or highly rated items, i.e. if their CM prioritise items with these characteristics, has a marginal effect on the popularity of the chosen items.

Finally, we discovered that some important effects of the RS may depend on the adoption of a particular CM in the users’ population. For instance, when an RS recommends relatively less popular items (as LPCF does), average popularity of the chosen items is minimal when users are only influenced in their choices by the recommendations (Base-CM is adopted), but the average popularity increases when any other CM is adopted.

The major findings of our study are summarised in Table 8. We highlight their relations with the considered metrics of the choices’ distribution and quality, and our three research hypotheses.

Table 8 Summary of the most important and evident findings supporting the research hypotheses

6.2 Contribution

In this paper, a novel method for evaluating recommender systems is proposed. It focuses on the long-term effect of RSs on the users’ choice distribution, when users are adopting a given choice model. The users choice model is largely user-dependent; however, the interaction context, including the RS graphical user interface, may influence users to make their choices by prioritising certain properties of the presented items, for instance their popularity. The analysis of the impact of the combined effect of the recommendations and the users’ choice model is particularly important when deploying a novel RS or changing a system parameter; this can have unexpected consequences that could even damage the involved stakeholders. For instance, in an online music streaming service, the business might want to introduce a new system feature where a special label is given to recommended items that are recently released or popular. In this case, users are nudged in choosing tracks with these specific properties. These types of changes of the user/system interaction affect how users make choices (i.e. their choice model); hence, they have an impact on the choices’ distribution. To tame a potential negative impact on the users’ purchases (choices), or to enhance a positive one, the proposed simulation approach can be used in operational systems to quantitatively estimate both types of effects, before the actual system deployment.

Another use case of the proposed simulation framework is when a novel recommender system is trained to be used in a system. For instance, if an RS with a known popularity bias is used to recommend articles on a news platform, the real impact of such a recommender system should be anticipated with regard to the users’ with alternative choice models. Such simulations in general can help building fairness-aware recommender systems that capture the long-term impact of new features or updates (Fu et al. 2020; D’Amour et al. 2020).

The proposed simulation framework has general applicability, and it can be relatively easily adjusted to specific settings. For instance, one can conduct simulations with other data sets, or can modify the simulation parameters, such as the number of recommendations, and the time intervals, or can even simulate the effect of the GUI on the users’ CMs. The obtained analysis can give insights to other researchers on how RSs and CM, independently and coupled, may determine the long-term and collective distribution of the choices of a population of users.

6.3 Limitations and future works

Some open issues remain and could be considered in future works. First of all, it is important to measure the reliability of the results obtained by our simulation study. As discussed in the literature (Hazrati and Ricci 2022b), an important analysis must assess whether the various components of a simulation framework, e.g. the choice model, correctly matches real behaviours and observations. Assessing the reliability of a choice model could be possible by using an additional data set where for each user, the slate of recommendations and the choice made by the user is available. In this case, one can test how accurately a choice models is in predicting the observed choices.

Moreover, in our study we have made the assumption that the choices of the users are only influenced by the RS and a rather simple CM, which only considers a single feature of the items. This assumption is motivated by the desire to isolate the effects of these components. While most simulation-based studies in the literature made similar assumptions (Zhang et al. 2020; Szlávik et al. 2011), users may have a more complex decision making procedure when selecting items (Chaney 2021). In fact, it could be possible to model users’ choices more accurately by considering the prior knowledge of the users about the catalogue, which is referred to as awareness set in some studies (Fleder and Hosanagar 2009; Hazrati and Ricci 2022a). Additionally, as observed in the correlation analysis (Sect. 3.2), users, when making choices, may be differently influenced by the item properties. While we have used some parameters to control the influence of age (\(k_a\)) and popularity (\(k_f\)), it is important to investigate how these parameters do affect the choice distribution in the long-term.

An additional interesting analysis could focus on other types of RSs, and in particular on those systems that leverage the same item properties that in our analysis are influencing the CM. For instance, it could be interesting to simulate the choices of users exposed to a content-based RS that recommends the most novel items, and to assess its effects when users are assumed to adopt the Age-CM. This could possibly generate choices for even more novel items and with lower popularity. A final extension of our study should focus on the simulation of other types of users’ behaviours, not only the choices for product items. In fact, the proposed simulation framework is flexible and could be used for simulating other types of feedback, such as likes or clicks on items.

In conclusion, we hope that the limitations of our work could open new and interesting lines of future research that will further clarify the global effects of the interaction of a larger variety of recommender systems with real diverse users.