Dimensions and Metrics for Evaluating Recommendation Systems
Recommendation systems support users and developers of various computer and software systems to overcome information overload, perform information discovery tasks, and approximate computation, among others. They have recently become popular and have attracted a wide variety of application scenarios ranging from business process modeling to source code manipulation. Due to this wide variety of application domains, different approaches and metrics have been adopted for their evaluation. In this chapter, we review a range of evaluation metrics and measures as well as some approaches used for evaluating recommendation systems. The metrics presented in this chapter are grouped under sixteen different dimensions, e.g., correctness, novelty, coverage. We review these metrics according to the dimensions to which they correspond. A brief overview of approaches to comprehensive evaluation using collections of recommendation system dimensions and associated metrics is presented. We also provide suggestions for key future research and practice directions.
Due to the complexity of today’s software systems, modern software development environments provide recommendation systems for various tasks. These ease the developers’ decisions or warn them about the implications of their decisions. Examples are code completion, refactoring support, or enhanced search capabilities during specific maintenance activities. In recent years, research has produced a variety of these recommendation systems and some of them have similar intentions and functionalities [24, 60]. One obvious question is, therefore, how can we assess quality and how can we benchmark different recommendation systems?
In this chapter, we provide a practical guide to the commonly used quantitative evaluation techniques used to compare recommendation systems. As a first step, we have identified a set of dimensions, e.g., the correctness or diversity of the results that may serve as a basis for an evaluation of a recommendation system. The different dimensions will be explained in detail and different metrics are presented to measure and quantify each dimension. Furthermore, we explore interrelationships between dimensions and present a guide showing how to use the dimensions in an individual recommendation system validation.
The rest of the chapter is organized as follows: Sect. 10.2 introduces the evaluation dimensions for recommendation systems and presents common metrics for them. Section 10.3 explores relationships between the different dimensions. Section 10.4 provides a description of some evaluation approaches and their practical application and implications. Finally, conclusions are drawn in Sect. 10.5.
The multi-faceted characteristics of recommendation systems lead us to consider multiple dimensions for recommender evaluation. Just one dimension and metric for evaluating the wide variety of recommendation systems and application domains is far too simplistic to obtain a nuanced evaluation of an approach as applied to a particular domain.
In this chapter, we investigate a variety of dimensions that may be used to play a significant role in evaluating a recommendation system. We list these dimensions below according to our view of their relative evaluative importance, along with the characteristics that each dimension is used to measure. Some of these dimensions describe qualitative characteristics while others are more quantitative.
How close are the recommendations to a set of recommendations that are assumed to be correct?
To what extent does the recommendation system cover a set of items or user space?
How diverse (dissimilar) are the recommended items in a list?
How trustworthy are the recommendations?
- Recommender confidence.
How confident is the recommendation system in its recommendations?
How successful is the recommendation system in recommending items that are new or unknown to users?
To what extent has the system succeeded in providing surprising yet beneficial recommendations?
What is the value gained from this recommendation for users?
How much user risk is associated in accepting each recommendation?
How tolerant is the recommendation system to bias or false information?
- Learning rate.
How fast can the system incorporate new information to update its recommendation list?
How usable is the recommendation system? Will it be easy for users to adopt it in an appropriate way?
How scalable is the system with respect to number of users, underlying data size, and algorithm performance?
How consistent are the recommendations over a period of time?
Are there any risks to user privacy?
- User preference.
How do users perceive the recommendation system?
We have grouped these dimensions into four broad categories, depending on different aspects of the recommendation system they address: recommendation-centric, user-centric, system-centric, and delivery-centric. Table 10.1 summarizes how each of the above dimensions can be grouped inside each category.
Categorization of dimensions
Recommendation-centric dimensions primarily assess the recommendations generated by the recommendation system itself: their coverage, correctness, diversity and level of confidence in the produced recommendations. On the other hand, user-centric dimensions allow us to assess the degrees to which the recommendation system under evaluation fulfills its target end-user needs. This includes how trustworthy are the recommendations produced, degree of novelty, whether serendipitous recommendations are a feature, the overall utility of the recommendations from the users’ perspective, and risks associated with the recommendations produced, again from the users’ perspective. System-centric dimensions in contrast principally provide ways to gauge the recommendation system itself, rather than the recommendations or user perspective. These include assessment of the robustness of the recommendation system, its learning rate given new data, its scalability given data size, its stability under data change, and degree of privacy support in the context of shared recommendation system datasets. Finally, delivery-centric dimensions primarily focus of the recommendation system in the context of use, including its usability (broadly assessed) and support for user configuration and preferences.
The following subsections describe each of these dimensions in detail.
In order to be of real value, recommendation systems must provide useful results that are close to users’ interests, intentions, or applications, without overwhelming them with unwanted results. A key measure of this is the correctness of the set of recommendations produced. Correctness provides a measure of how close the recommendations given to a user are to a set of expected predefined, or assumed correct, recommendations. This predefined set of correct recommendations is sometimes referred to as the gold standard. The correctness of a recommendation may refer to its alignment with a benchmark (e.g., each recommended link is in the predefined set of correct links), or how well it adheres to desired qualities (e.g., an increase in developer productivity).
Depending on the type of recommendations the system is generating, different methods can be used for measuring correctness. A recommender might predict how users rate an item, the order (ranking) of most interesting to least interesting items for a user in a list, or which item (or list of items) is of interest to the user. In the following subsections, we describe the most commonly used metrics for evaluating recommendation approaches for correctness in each scenario.
Predicting User Ratings
If the items to be tested represent an unbalanced distribution, RMSE and MAE can be used in averaged form, depending on the evaluation (e.g., per-user or per-item). If the RMSE of each item can be calculated separately, then the average of all calculated RMSEs represents the average RMSE of the recommendation system.1
Ranking measures are used when an ordered list of recommendations is presented to users according to their preferences. This order can be the most important, or “most relevant,” items at the top and the least important, or “least relevant” items at the bottom. For example, when recommending links between architecture documents and code artifacts in a source code traceability recommendation system, the most closely related links should be shown first. Similarly, when recommending code snippets for reuse from a source code repository in a code reuse recommendation system, the code snippet most appropriate to the current reuse context should be shown first.
When checking for correctness of ranking measures, if a reference ranking (benchmark) is available, the correctness of the ranking can be measured by the normalized distance-based performance measure (NDPM) . The value returned by NDPM is between 0 and 1 with any acceptable ranking having a distance of 0. A ranking farthest away from an ideal ranking would have a normalized distance of 1.
NDPM penalizes a contradicting prediction twice as much as when it does not predict an item in the ranking. It also does not penalize the system for ranking one item over another when they have ties. Having a tie in some situations, however, indicates that the value of the tied items is equal to the user. Therefore, ranking one item higher than the other in a tie will produce inaccurate ranking. In situations where ties between recommended items are to be considered, rank correlation measures, such as Spearman’s ρ  or Kendall’s τ can be used [30, 31].
For some cases, the position of recommended items in the list is important for the application of the recommendation. For example, in a software documentation retrieval environment, since all documentation artifacts are not of equal relevance to their users, highly relevant documents, or document components, should be identified and ranked first for presentation . Therefore, the correctness of an item in the ranking list should be weighted by its position in the ranking. A frequently used metric for measuring ranking correctness, considering item ranking position, is the normalized discounted cumulative gain (NDCG). It is calculated based on measuring the discounted cumulative gain (DCG) and then comparing that to the ideal ranking. DCG measures the correctness of a ranked list based on the relevance of items discounted by their position in the list. Higher values of NDCG indicate better ranked lists and therefore better correctness. Various approaches have been introduced to optimize NDCG and ranking measures. Examples of these approaches can be found in Weimer et al.  and Le and Smola .
Recommending Interesting Items
If a recommendation system is providing the items that users may like to use, a common approach to evaluate it is to use classification metrics like precision, recall (also called true positive rate), accuracy, false positive rate, and specificity (also called true negative rate). These metrics have been used excessively across different domains [e.g., 15, 17, 18, 43, 47, 80] to classify recommendations into groups, as indicated by Table 10.2. Once the categories are defined, these metrics can be calculated as follows:
Categorization of all possible recommendations (also called the confusion matrix)
It is also important to mention the cost associated with identifying false positives (FP) and false negatives (FN). For example, it could be relatively easier to identify FP for a user. If this is the case, calculating recall would be less costly and hence more preferred than precision. The F-measure assumes an equal cost for both FP and FN.2
Sometimes it is desirable to provide multiple recommendations to users. In this case, these metrics can be altered to provide correctness measured for the number of items being provided to user. For example, consider a code completion recommender that can recommend hundreds of items while the user is typing program code. Showing one item at a time would be too limited; similarly, showing all recommendations would not be useful. If for each recommendation five items are shown to the user, to calculate the precision of this code completion recommender for example, precision at 5 can be used.
If using recommendations over a range of recommendation list lengths, one can plot precision versus recall (the precision–recall curve) or true-positive rate versus false-positive rate (the receiver operating characteristic, or ROC, curve) . Both curves measure the proportion of preferred items that are actually recommended. Precision–recall curves emphasize the proportion of recommended items that are preferred while ROC curves emphasize the items that are not preferred but are recommended.
Recommendation systems make recommendations by searching available information spaces. This recommendation is not always possible, for example when new items or users are introduced, or insufficient data is available for particular items or users. Coverage refers to the proportion of available information (items, users) for which recommendations can be made.
Consider a code maintenance recommendation system that guides developers on where to look in a large code base to apply modifications [e.g., 59]. If such a recommender is not capable of covering the whole codebase at hand, developers might not be able to find the actual artifact that requires alteration. Hence, the information overload problem and complexity of finding faults in the codebase will still exist to a greater or lesser degree. Sometimes this is acceptable, such as when alternative techniques, like visualization, can assist users. Sometimes this is unacceptable, for example when the search space is too large for developers or important parts of the code base remain un-searched, thus hindering maintenance effort.
Coverage usually refers to catalog coverage (item-space coverage) or prediction coverage (user-space coverage) . Catalog coverage is the proportion of available items that the recommendation system recommends to users. Prediction coverage refers to the proportion of users or user interactions that the recommendation system is able to generate predictions for.
Situations where a new item is added to the system and sufficient information (like ratings by other users for that item) does not yet exist is referred to as the cold start problem. Cold start can also refer to situations where new users have joined the system and their preferences are not yet known. For example, consider a recommendation system that recommends solutions to fixing a bug similar to DebugAdvisor . In such a recommender the developer submits a query describing the defect. The system then searches for bug descriptions, functions, or people that can help the developer fix the bug. If the bug, or a similar bug, has not been previously reported, there is no guarantee that the returned results will help resolve the situation. Similarly, if the system has been newly implemented in a development environment with few bug reports or code repositories, the recommendation would not be very helpful.
Cold start is seen more often in collaborative filtering recommenders as they rely heavily on input from users. Therefore, these recommenders can be used in conjunction with other non-collaborative techniques. Such a hybrid mechanism was proposed by Schein et al. , in which they used two variations of ROC curves to evaluate their method, namely global ROC (GROC) and customer ROC (CROC). GROC was used to measure performance when the recommender is allowed to recommend more often to some users than others. CROC was used to measure performance when the system was constrained to recommend the same number of items to each user.
In some cases, having similar items in a recommendation list does not add value from the users’ perspectives. The recommendations will seem redundant and it takes longer for users to explore the item space. For example, in an API recommendation system, showing two APIs with the same non-functional characteristics may not be useful unless it helps users gain confidence in the recommendation system. Showing two APIs with (say) diverse performance, memory overheads, and providers could be more desirable for the developer.
A recommendation list should display some degree of diversity in the presented items. Candillier et al.  performed a case study on recommending documents to users in which they showed that users prefer a system providing document diversity. This allows users to get a more complete map of the information.
To measure diversity in a recommendation list, an alternative approach is to compute the distance of each item from the rest of the list and average the result to obtain a diversity score. For such an average, however, a random recommender may also produce diverse recommendations. Therefore, this needs to be accompanied by some measure of precision. Plotting precision–diversity curves helps in selecting the algorithm with the dominating curve . Having correctness metrics combined with diversity has an added advantage, as correctness metrics do not take into account the entire recommendation list. Instead, they consider the correctness of individual items. For instance, the intra-list similarity metric can help to improve the process of topic diversification for recommendation lists . In this way, the returned lists can be checked for intra-list similarity and altered to either increase or decrease the diversity of items on that list as desired or required. Increasing diversity this way has been shown to perform worse than unchanged lists, according to correctness measures, but users preferred the altered lists .
Diversity of rating predictions can be measured by well-known diversity measures being used in ensemble learning . These approaches try to increase diversity for returned classification of individual learning algorithms in order to improve the overall performance. For example, Q-statistics can be used to find diversity between two recommender algorithms. Q-statistics are based on a modified confusion matrix, confronting two classifiers as correctly classified versus incorrectly classified. As a result, the confusion matrix displays the overlap of those itemsets. Q-statistic measures are then defined to combine the elements in the modified confusion matrix, ultimately arriving at a measure for the diversity of the two recommender algorithms. Kille and Albayrak  used this approach and introduced a difficulty measure to help with personalizing recommendations per user. They measured a user’s difficulty by means of the diversity of rating predictions (RMSE) and item rankings (NDCG), and used diversity metrics by pairwise Q-statistics to fit the item ranking scenario.
Lathia et al.  introduced a measure of diversity for recommendations in two lists of varying lengths. In their approach, given two sets L1 and L2, the items of L2 that are not in L1 are first determined as their set theoretic difference. Then, the diversity between the two lists (at depth N) is defined as the size of their set theoretic difference over N. This way, diversity returns 0 if the two lists are the same, and 1 if the two lists are completely different at depth N.
A recommendation system is expected to provide trustworthy suggestions to its users. It has been shown that perceived usefulness correlates most highly with good and useful recommendations . If the system is continuously producing incorrect recommendations, users’ trust in the recommender will be lost. Lack of trustworthiness will encourage users to ignore recommendations and so decrease the usefulness of the recommendation system. For example, in an IDE being used for a refactoring scenario, a wrong suggestion made by the refactoring task recommender may adversely impact large amounts of application code. If users of such a refactoring recommendation system use a faulty recommendation and experience the consequences, they will be less likely to use it again.
Some users will not build trust in the recommendations unless they see a well-known item, or an item they were already aware of, being recommended . Also, explanations regarding how the system comes up with its recommendations can encourage users to use them and build trust [72, 77].
A common approach to measure trust is to ask users in a user study whether the recommendations are reasonable [7, 14, 25]. Depending on the usage scenario of the recommendation system, it might be possible to check how frequently users use recommendations, to gain understanding of their trust . For example, in a code reuse recommender, how often the user selects and applies one of the recommended code snippets. Or similarly, how often do users select recommendations of a code completion recommender.
10.2.5 Recommender Confidence
Recommender confidence is the certainty the system has in its own recommendations or predictions. In online scenarios, it is possible to calculate recommender confidence by observing environmental variables. For example, a refactoring recommendation system can build confidence scores by observing how frequently users use and apply suggested refactoring recommendations to their application.
Some prediction models can be used in calculating confidence scores. For example, Bell et al.  used a neighborhood-aware similarity model that considers similarities between items and users for generating recommendations. In their model, a recommendation that maximizes the similarity between the item being recommended and similar items, and the user to whom a recommendation is to be presented and similar users, defines the most suitable recommendation. They showed how such a metric can help identify most suitable recommendations, according to RMSE of the predicted rating and the user’s true rating.
Cheetham and Price  provided an approach for calculating confidence in case-based reasoning (CBR) systems. They proposed to identify multiple indicators such as “sum of similarities for retrieved cases with best solution” or “similarity of the single most similar case with best solution.” Once possible indicators are defined, their effect on the CBR process was determined using “leave-one-out” testing. Finally, they used Quinlan’s C4.5 algorithm  on the leave-one-out test results to identify indicators that are best at determining confidence.
Recommender confidence scores can be used in the form of confidence intervals [e.g., 61] or by the probability that the predicted value is true . Also, they have been used in hybrid recommendation systems for switching between recommender algorithms .
A novel recommendation is one that the users did not know about. Novelty is very much related to the emotional response of users to a recommendation; as a result, it is a difficult dimension to measure .
A possible approach for building a novel recommender is to remove items that the user has already rated or used before in a recommendation list. If this information is available, novelty of the recommender can be measured easily by comparing recommendations against already used or rated recommendations. This requires keeping user profiles so that it is possible to know which user chose and rated which items. User profiles can then be used to calculate the set of familiar items. For example, CodeBroker  is a development environment that promotes reuse by enabling software developers to reuse available components. It integrates a user model for capturing methods that the developer already knows and thus does not need to be recommended again.
An alternative approach for measuring novelty is to count the number of popular items that have been recommended . This metric is based on the assumption that highly rated and popular items are likely to be known to users and therefore not novel . A good measure for novelty might be to look more generally at how well a recommendation system made the user aware of previously unknown items that subsequently turn out to be useful in context .
Serendipity by definition is “the occurrence and development of events by chance in a happy or beneficial way” . In the context of recommendation systems this has been referred to as an unexpected and fortuitous recommendation . Serendipity and novelty are different considering the fact that there is an element of correctness present in serendipity, which prevents random recommenders from being serendipitous. Novel unexpected items may, or may not, turn out to be serendipitous. While a random recommender may be novel, if a surprising recommendation does not have any utility to the user it will not be classified as serendipitous, but rather as erroneous and distracting. Therefore it is required that correctness and serendipity be balanced and considered together .
Like novelty, to have a serendipitous recommender, similar recommendations should be avoided since their expected appearance in the list will generally not benefit the user . Therefore, user profiles or an automatic or manual labeling of pairs of similar items can help filter out similar items. The definition of this similarity, however, should be dependent on the context in which the recommender is being used. For example, an API recommender presenting completely unusable APIs in the current code context is highly unlikely to promote serendipitous reuse. A document recommender, showing unlikely but still possibly related artifacts in a traceability recommender, may very well present the user with serendipitously useful artifacts.
Ratability is a feature defined in accordance to serendipity. It is considered mostly in machine learning approaches. Given that the system has some understanding of the user profile, the ratability of a recommended item to a user is the probability that the item will be the next item the user will consume . It is assumed that items with higher ratability are the items that the user has not consumed yet but is likely to use in future, or the items the user has consumed but have not been added to the user profile . In other words, ratability defines the obviousness of a “user rating an item.” Since machine-learning approaches calculate the probability of the item being chosen next, if the recommendation system is using a leave-one-out approach to train the learning procedure, it is possible to calculate the ratability based on that probability.
Utility is the value that the system or user gains from a recommendation. For example, PARSEWeb  aims to help developers find sequences of method calls on objects of a specific type. This helps to match an object with a specific method sequence. In that context, the evaluation can be based on the amount of time saved for finding such a method sequence using recommendations. Therefore, the value of a correct recommendation is based on the utility of that item. A possible evaluation in this context is to consider utility from a cost/benefit ratio analysis .
It is noteworthy that precision cannot measure the true usefulness of a recommendation. For example, recommending an already well-known and used API call, document link, code snippet, data map or algorithm will increase precision but has very low utility  since such an item will probably already be known to the user. On the other hand, for memory-intensive applications, it is sometimes beneficial to recommend well-known items. Thus, it is fair to align the recommender evaluation framework with utility measures in real world applications rather than overalign for correctness.
Depending on the application domain of the recommendation system, the utility of a recommendation can be specified by the user (e.g., in user-defined ratings) or computed by the application itself (e.g., profit-based utility function) . The utility might be calculated by observing subsequent actions of the user, for example, interacting with the recommendation or using recommended items.
For some applications, the position of a recommendation in a list is a deciding factor. For example, RASCAL  uses a recommender agent to track usage histories of a group of developers and recommends components that are expected to be needed by individual developers. The components that are believed to be most useful to current developers will appear first in the recommendation list. If we assume that there is a higher chance for developers to choose a recommendation among top recommended items rather than exploring the whole list, the utility of each recommendation is then the utility of the recommended item in relation to its position in the list of recommendations .
Depending on where the recommendation system is being used and what its application domain is, the recommendations can be associated with various potential risks. For example, recommending a list of movies to watch is usually less risky than recommending refactoring solutions in complex coding situations (unless the movies might include inappropriate material for some audiences). Therefore, high-risk recommendation systems must obey a set of constraints on a valid solution. This is because false positive recommendations are less tolerable and users must be more convinced to use a recommendation .
Consequently, users may approach risk differently. For example, different users might be prepared to tolerate different levels of risk. One user might prefer using a component which is no longer maintained but has all required features. Another user might prefer a component that has less features but is under heavy development. In such cases, a standard way to evaluate risk is to consider utility variance in conjunction with the measures of utility and parameterize the degree of risk that users will tolerate, in the evaluation .
Another aspect of risk involves privacy. If the system is working according to user profiles, collecting information from users to create that profile introduces the risk of breaching users’ privacy . Therefore, it should be ensured that users are aware and willing to take that risk. For example, when recommending developers based on expertise for outsourcing tasks, many other factors will also need to be considered. Privacy will be discussed more in Sect. 10.2.15.
Robustness is the ability of a recommendation system to tolerate false information intentionally provided by malicious users or, more commonly, to tolerate mistaken information accidentally provided by users. Mistakes made by users may include asking recommender to analyze documents in incorrect formats, mistakenly rating items, making mistakes in the user profile specification, and using the recommender in the wrong context or for the wrong tasks.
10.2.11 Learning Rate
Learning rate is the speed at which a recommendation system learns new information or trends and updates the recommended item list accordingly. A system with high learning rate will be able to adapt to new user preferences or interests of existing users to provide useful recommendations within a short period of learning time. For example, an API recommendation system may have a high learning rate if every time a user rates a recommended item the ranking index and calculations are immediately updated. In comparison, a code recommendation system may have a low learning rate if the indexing of the code repository can only be undertaken sporadically due to high overheads.
Although a fast learning rate can cope with quick shifts in trends, it may also give up some prediction correctness since the new trend that the system recommends might not perfectly match a user’s interests. A slow learning rate can also affect the system utility if it fails to catch up with trends and cannot provide a new set of useful recommendations.
The evaluation of learning rate can be done by measuring (1) the time that takes the system to regain its prediction correctness when user interests drift, (2) the time to reach a certain level of correctness for new users, or (3) the prediction correctness that the system can achieve within a limited learning time. Koychev and Schwab  measured and plotted the prediction correctness of a recommendation system over time and assessed how fast their algorithm adapted to changes. To evaluate the learning rate for new users, Rashid et al.  evaluated different algorithms that learn user preferences during the sign-up process. Each algorithm presents users with a list of initial items to be rated and learns from the given ratings. After the sign-up process and the learning phase is completed, predictions for other items are made and the accuracies of the algorithms are measured and compared.
In order for recommendation systems to be effective, their target end users must be able to use them in appropriate ways. They must also adhere to the general principles of usability. They must be effective, efficient, and provide some degree of satisfaction for their target end users .
Recommendation systems typically manifest in some way via a user interface. The contents presented by this user interface play an important role in acceptance of the recommendation . This user interface may simply be an in situ suggestion to the user in the containing application. More commonly, a list of recommendations, often ranked, is provided to the user on demand. Additionally, many recommendation systems require configuration parameters, user preferences, and some form of user profile to be specified. All of these interfaces greatly impact on the usability of the recommendation system as a whole. For example, presenting the user with an overwhelmingly large list of unranked or improperly ordered items is ineffective and inefficient. Presenting the user with very complicated or hard to understand information is also ineffective and impacts satisfaction. Satisfaction and efficiency are reduced if users are not allowed to interact with recommended items, for example go to target document adversely, or if the system is slow in producing a set of recommendations. These factors of recommendation systems are generally evaluated through user studies [55, 71, 72].
One of the most important goals of a recommendation system is to provide online recommendations for users to navigate through a collection of items. When the system scales up to the point where there are thousands of components, bug reports, or software experts to be recommended, the system must be able to process and make each recommendation within a reasonable amount of time. If the system cannot otherwise handle a large amount of data, other dimensions will have to be compromised. For instance, the algorithm might generate recommendations based on only a subset of items instead of using the whole database. This reduces the processing time but consequently also reduces its coverage and correctness. Many examples exist of recommendation systems that work well on small datasets but struggle with large item sets or large numbers of users. These include most early API and code recommenders, many existing code or database search and rank result recommenders and complex design or code refactoring recommenders.
The scalability problem can be divided into two parts: (1) the training time of the recommendation algorithm and (2) the performance of the system or throughput when working with a large item database. The time that is required to train the algorithm can be evaluated by training different algorithms with the same dataset or by training them until they reach the same level of prediction correctness [21, 29]. The performance of the system can be evaluated in terms of throughput—the number of recommendations that the system can generate per second [16, 23, 65]. Performance (in terms of number of recommendations) can also adversely impact the usability of the recommendation system as response time may become too slow to be effective for its users.
Stability refers to the prediction consistency of the recommendation system over a period of time, assuming that new ratings or items added during that period are in agreement with the ones already existing in the system. A stable recommender can help increase user trust as users will be presented with consistent predictions. The prediction that changes and fluctuates frequently can cause confusion to the users and, consequently, distrust in the system.
Stability can be measured by comparing a prediction at a certain point in time with a point when new ratings are added. Adomavicius and Zhang [2, 3] carried out a stability evaluation by training the recommendation algorithm with the existing ratings and making a first prediction. After new ratings during the next period are added, the algorithm is retrained with this new dataset. It then makes a second prediction. Similar to robustness, the prediction shift (10.1) can be calculated after a new set of ratings are added.
Recommendation systems often record and log user interaction into historical user profiles. This helps personalize recommendations and improve understanding of user needs. Recording this information introduces a potential threat to users’ privacy. Therefore, some users might request their personal data to be kept private and not disclosed. To secure data, some approaches have proposed cryptographic solutions, or removing the single trusted party having access to the collected data [e.g., 4, 12]. Despite these efforts, it has been demonstrated that it is possible to infer user histories by passively observing a recommender’s recommendations .
In the context of recommendation systems, however, privacy should be measured in conjunction with correctness since keeping information from the system, or third party recommendation system, has a direct effect on correctness of the recommendation system. This difference can be shown by plotting correctness against the options available for preserving privacy. For example, McSherry and Mironov  demonstrated their privacy preserving application by plotting RMSE versus differential privacy.
There are still open questions and areas to explore regarding how privacy can affect recommendation systems and how to measure its effects . Consider multi-user and multi-organizational situations such as open source applications where API, bug triage, code reuse, document/code trace, and expertise recommenders may share repositories. Capturing user recommender interactions may enhance recommender performance for all of these domains, however, exposing the recommended items, user ratings and recommender queries all have the potential to seriously compromise developer and organizational privacy.
10.2.16 User Preferences
We have presented a number of measures to evaluate the performance of recommendation systems. The bottom line of any recommendation system evaluation is the perception of the users of that system. Therefore, depending on application domain, an effective evaluation scenario could be to provide recommendations regarding the selection of algorithms and ask users which one they prefer. Moreover, it has been shown that some metrics (although useful for comparison) are not good measures of user preference. For example, what MAE measures and what really matters to users contrast since, due to the decision supportive nature of recommendation systems, the exact predicted value is of far less importance to a user than the fact that an item is recommended . A number of recent document/code link recovery recommenders incorporate concurrently used algorithms that generate multiple sets of recommendations that can be presented either separately or combined. Many systems allow users to configure the presentation of results, ranking scales, filters on results, number of results provided, and relative weighting of multiple item features.
It should be taken into consideration, however, that user preferences are not binary values. Users might prefer one algorithm to another . Therefore, if testing user preferences regarding a group of algorithms, a non-binary measure should be used before the scores are calibrated . Also, new users should be separated in the evaluation from more experienced users. New users may need to establish trust and rapport with a recommender before taking advantage of the recommendations it offers. Therefore, they might benefit from an algorithm which generates highly ratable items .
10.3 Relation Between Dimensions
To have an effective evaluation, relationships between dimensions should also be considered. These relationships describe whether changing a dimension affects other dimensions. We have captured these relationships in Table 10.3, depicting the relationships between dimensions for overall performance of the recommendation system. Each cell in this table depicts relationships between one dimension when compared to another. If changes to a dimension are in accordance with another dimension, i.e., if improving that dimension improves the other, it has been shown by ⊙. If a dimension tends to adversely impact another, it is shown as a ×. Dimensions that tend to be independent are shown with blank cells. Below we summarize some of these recommender dimension interrelations that are not already mentioned in previous sections.
Relationships between metrics
Coverage can directly affect correctness, since the more data available for generating recommendations, the more meaningful the recommendations are. Hence correctness increases with increasing coverage . Coverage is also closely related to serendipity. Not every increase in coverage increases serendipity; however, an increase in serendipity will lead to higher catalog coverage. On the other hand, greater correctness dictates more constraints and therefore decreases serendipity . The same is true for risk, i.e., if recommendations are being used in high risk environments, more constraints should be considered. This decreases serendipity, novelty, and diversity but increases correctness, trust, and utility.
High usability increases the amount of trust that users have in the recommendation system, especially when recommendations are transparent and accompanied by explanations. Improving privacy forces recommendation systems to hide some user data and hence affects the correctness of the recommendation.
Novel recommendations are generally recommendations that are not known to the user. It is not always a requirement for a novel recommendation to be accurate. Improving novelty by introducing randomness may decrease correctness. Also, improving novelty by omitting well-known items will affect correctness. Therefore, increasing novelty may decrease correctness. The same is true for diversity.
Scalability and learning rate directly affect correctness since improving them allows faster adaptation of new items and users, thus resulting in better correctness. Improving scalability at the same time also improves coverage.
Improving robustness prevents mistaken information from affecting recommendations and hence improves user trust . It will, however, result in true recommendations being adopted more slowly, therefore reducing short-term correctness.
It is noteworthy that from the metrics presented in this table, risk could have been categorized separately. Regardless of how the recommendation system performs, risks involved with the application are the same, i.e., although having a better performing recommendation system helps to minimize the risk associated with “selecting a recommendation,” it does not change the fact that risks for that particular application exist in general.
The true relationships between metrics are more nuanced than can be represented in a two-dimensional table. For example, improving coverage directly improves correctness and increasing novelty might improve coverage. Thus, improving novelty can be considered to indirectly improve correctness, contradicting the table. Therefore, a better framework or standard for understanding these relationships is needed and should be considered for future research.
Summary of metrics
10.4 Evaluation Approaches and Frameworks
Table 10.4 summarizes the set of evaluation metrics and technique dimensions described earlier according to their corresponding dimension and type(s). Some of the dimensions are qualitative assessments while others are quantitative.
The most basic evaluation of a recommendation system is to use just one or two metrics covering one or two dimensions. For example, one may choose to evaluate and compare a recommender using correctness and diversity dimensions. When possible, the selected dimensions can be plotted to allow better analysis. The selection of dimensions can be chosen according to a particular recommender application. As mentioned in Sect. 10.3, however, there is always a tradeoff present between the dimensions of a recommendation system that should be considered when evaluating the effectiveness of recommendation systems. Also, the multi-faceted characteristics of these systems, and unavailability of a standard framework for evaluation, and in many case suitable performance benchmarks, has directly affected the evaluation of different systems by dimensions. In addition, many metrics require significant time and effort to properly design experiments, and to capture and analyze results. Availability of end users, suitable datasets, suitable reference benchmarks, and multiple implementations of different approaches are all often challenging issues.
However, some new approaches are beginning to emerge to help developers and users decide between different recommender algorithms and systems. An example of this is an approach that helps users define which metrics can be used for evaluation of the recommendation system at hand . It proposes to consider evaluation goals to ensure the selection of an appropriate metric. An analysis of a collection of correctness metrics is provided as evidence regarding how different goals can affect the outcome of the evaluation.
Hernández del Olmo and Gaudioso  propose an objective-based framework for the standardization of recommendation system evaluations. Their framework is based on the concept that a recommendation system is composed of interactive and non-interactive subsystems (called guides and filters respectively). The guide decides when and how each recommendation is to be shown to users. The filter selects interesting items to recommend. Accordingly, a performance metric P has been introduced as the quantification of the final performance of a recommendation system over a set of sessions. P is defined as the number of selected relevant recommendations that have been followed by the user over a recommendation session.
A more recent approach introduced a multi-faceted model for recommender evaluation that proposes evaluation along three axis: users, technical constraints, and business models . This approach considers user, technical, and business aspects together and evaluates the recommender accordingly. However, considerable further work is needed to enable detailed evaluation of recommendation system against many of the potential metrics itemized in Table 10.4.
In this chapter, we have presented and explained a range of common metrics used for the evaluation of recommendation systems in software engineering. Based on a review of current literature, we derived a set of dimensions that are used to evaluate an individual recommendation system or in comparing it against the current state of the art. For the dimensions, we have provided a description as well as a set of commonly used metrics and explored relationships between the dimensions.
We hope that our classification and description of this range of available evaluation metrics will help other researchers to develop better recommendation systems. We also hope that our taxonomy will be used to improve the validation of newly developed recommendation systems and clearly show in specific ways how a new recommendation system is better than the current state of the art. Finally, the content of this chapter can be used by practitioners in understanding the evaluation criteria for recommendation systems. This can thus improve their decisions when selecting a specific recommendation system for a software development project.
- 2.Adomavicius, G., Zhang, J.: Iterative smoothing technique for improving stability of recommender systems. In: Proceedings of the Workshop on Recommendation Utility Evaluation: Beyond RMSE. CEUR Workshop Proceedings, vol. 910, pp. 3–8 (2012a)Google Scholar
- 3.Adomavicius, G., Zhang, J.: Stability of recommendation algorithms. ACM Trans. Inform. Syst. 30(4), 23:1–23:31 (2012b). doi:10.1145/2382438.2382442Google Scholar
- 5.Ashok, B., Joy, J., Liang, H., Rajamani, S.K., Srinivasa, G., Vangala, V.: DebugAdvisor: a recommender system for debugging. In: Proceedings of the European Software Engineering Conference/ACM SIGSOFT International Symposium on Foundations of Software Engineering, pp. 373–382 (2009). doi:10.1145/1595696.1595766Google Scholar
- 6.Bell, R., Koren, Y., Volinsky, C.: Modeling relationships at multiple scales to improve accuracy of large recommender systems. In: Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 95–104 (2007). doi:10.1145/1281192.1281206Google Scholar
- 7.Bonhard, P., Harries, C., McCarthy, J., Sasse, M.A.: Accounting for taste: using profile similarity to improve recommender systems. In: Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems, pp. 1057–1066 (2006). doi:10.1145/1124772.1124930Google Scholar
- 10.Calandrino, J.A., Kilzer, A., Narayanan, A., Felten, E.W., Shmatikov, V.: “You might also like”: privacy risks of collaborative filtering. In: Proceedings of the IEEE Symposium on Security and Privacy, pp. 231–246 (2011). doi:10.1109/SP.2011.40Google Scholar
- 11.Candillier, L., Chevalier, M., Dudognon, D., Mothe, J.: Diversity in recommender systems: bridging the gap between users and systems. In: Proceedings of the International Conference on Advances in Human-Oriented and Personalized Mechanisms, Technologies, and Services, pp. 48–53 (2011)Google Scholar
- 12.Canny, J.: Collaborative filtering with privacy. In: Proceedings of the IEEE Symposium on Security and Privacy, pp. 45–57 (2002). doi:10.1109/SECPRI.2002.1004361Google Scholar
- 16.Das, A.S., Datar, M., Garg, A., Rajaram, S.: Google news personalization: scalable online collaborative filtering. In: Proceedings of the International Conference on the World Wide Web, pp. 271–280 (2007). doi:10.1145/1242572.1242610Google Scholar
- 17.De Lucia, A., Fasano, F., Oliveto, R., Tortor, G.: Recovering traceability links in software artifact management systems using information retrieval methods. ACM Trans. Software Eng. Methodol. 16(4), 13:1–13:50 (2007). doi:10.1145/1276933.1276934Google Scholar
- 18.Dolques, X., Dogui, A., Falleri, J.R., Huchard, M., Nebut, C., Pfister, F.: Easing model transformation learning with automatically aligned examples. In: Proceedings of the European Conference on Modelling Foundations and Applications. Lecture Notes in Computer Science, vol. 6698, pp. 189–204 (2011). doi:10.1007/978-3-642-21470-7_14 Google Scholar
- 20.Ge, M., Delgado-Battenfeld, C., Jannach, D.: Beyond accuracy: evaluating recommender systems by coverage and serendipity. In: Proceedings of the ACM Conference on Recommender Systems, pp. 257–260 (2010). doi:10.1145/1864708.1864761Google Scholar
- 21.George, T., Merugu, S.: A scalable collaborative filtering framework based on co-clustering. In: Proceedings of the IEEE International Conference on Data Mining (2005). doi:10.1109/ICDM.2005.14Google Scholar
- 22.Good, N., Schafer, J.B., Konstan, J.A., Borchers, A., Sarwar, B., Herlocker, J., Riedl, J.: Combining collaborative filtering with personal agents for better recommendations. In: Proceedings of the National Conference on Artificial Intelligence and the Conference on Innovative Applications of Artificial Intelligence, pp. 439–446 (1999)Google Scholar
- 24.Happel, H.J., Maalej, W.: Potentials and challenges of recommendation systems for software development. In: Proceedings of the International Workshop on Recommendation Systems for Software Engineering, pp. 11–15 (2008). doi:10.1145/1454247.1454251Google Scholar
- 25.Herlocker, J.L., Konstan, J.A., Riedl, J.: Explaining collaborative filtering recommendations. In: Proceedings of the ACM Conference on Computer Supported Cooperative Work, pp. 241–250 (2000). doi:10.1145/358916.358995Google Scholar
- 27.Hernández del Olmo, F., Gaudioso, E.: Evaluation of recommender systems: a new approach. Expert Syst. Appl. 35(3), 790–804 (2008). doi:10.1016/j.eswa.2007.07.047Google Scholar
- 29.Karypis, G.: Evaluation of item-based top-N recommendation algorithms. In: Proceedings of the International Conference on Information and Knowledge Management, pp. 247–254 (2001). doi:10.1145/502585.502627Google Scholar
- 32.Kille, B., Albayrak, S.: Modeling difficulty in recommender systems. In: Proceedings of the Workshop on Recommendation Utility Evaluation: Beyond RMSE. CEUR Workshop Proceedings, vol. 910, pp. 30–32 (2012)Google Scholar
- 33.Kitchenham, B.A., Pfleeger, S.L.: Principles of survey research. Part 3: constructing a survey instrument. SIGSOFT Software Eng. Note. 27(2), 20–24 (2002). doi:10.1145/511152.511155Google Scholar
- 35.Koychev, I., Schwab, I.: Adaptation to drifting user’s interests. In: Proceedings of the Workshop on Machine Learning in the New Information Age, pp. 39–46 (2000)Google Scholar
- 36.Krishnamurthy, B., Malandrino, D., Wills, C.E.: Measuring privacy loss and the impact of privacy protection in web browsing. In: Proceedings of the Symposium on Usable Privacy and Security, pp. 52–63 (2007). doi:10.1145/1280680.1280688Google Scholar
- 38.Lam, S.K., Riedl, J.: Shilling recommender systems for fun and profit. In: Proceedings of the International Conference on the World Wide Web, pp. 393–402 (2004). doi:10.1145/988672.988726Google Scholar
- 39.Lam, S.K.T., Frankowski, D., Riedl, J.: Do you trust your recommendations?: an exploration of security and privacy issues in recommender systems. In: Proceedings of the International Conference on Emerging Trends in Information and Communication Security. Lecture Notes in Computer Science, vol. 3995, pp. 14–29 (2006). doi:10.1007/11766155_2Google Scholar
- 40.Lathia, N., Hailes, S., Capra, L., Amatriain, X.: Temporal diversity in recommender systems. In: Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval, pp. 210–217 (2010). doi:10.1145/1835449.1835486Google Scholar
- 41.Le, Q.V., Smola, A.J.: Direct optimization of ranking measures. Technical Report (2007) [arXiv:0704.3359]Google Scholar
- 42.Massa, P., Avesani, P.: Trust-aware recommender systems. In: Proceedings of the ACM Conference on Recommender Systems, pp. 17–24 (2007). doi:10.1145/1297231.1297235Google Scholar
- 43.McCarey, F., Ó Cinnéide, M., Kushmerick, N.: RASCAL: a recommender agent for agile reuse. Artif. Intell. Rev. 24(3–4), 253–276 (2005). doi:10.1007/s10462-005-9012-8Google Scholar
- 44.McNee, S.M.: Meeting user information needs in recommender systems. Ph.D. thesis, University of Minnesota (2006)Google Scholar
- 45.McNee, S.M., Riedl, J., Konstan, J.A.: Being accurate is not enough: how accuracy metrics have hurt recommender systems. In: Extended Abstracts of the ACM SIGCHI Conference on Human Factors in Computing Systems, pp. 1097–1101 (2006). doi:10.1145/1125451.1125659Google Scholar
- 46.McSherry, F., Mironov, I.: Differentially private recommender systems: building privacy into the net. In: Proceedings of the ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp. 627–636 (2009). doi:10.1145/1557019.1557090Google Scholar
- 47.Melnik, S., Garcia-Molina, H., Rahm, E.: Similarity flooding: a versatile graph matching algorithm and its application to schema matching. In: Proceedings of the International Conference on Data Engineering, pp. 117–128 (2002). doi:10.1109/ICDE.2002.994702Google Scholar
- 48.Meyer, F., Fessant, F., Clérot, F., Gaussier, E.: Toward a new protocol to evaluate recommender systems. In: Proceedings of the Workshop on Recommendation Utility Evaluation: Beyond RMSE. CEUR Workshop Proceedings, vol. 910, pp. 9–14 (2012)Google Scholar
- 49.Mobasher, B., Burke, R., Bhaumik, R., Williams, C.: Toward trustworthy recommender systems: an analysis of attack models and algorithm robustness. ACM Trans. Inter. Tech. 7(4), 23:1–23:38 (2007). doi:10.1145/1278366.1278372Google Scholar
- 50.Mockus, A., Herbsleb, J.D.: Expertise Browser: a quantitative approach to identifying expertise. In: Proceedings of the ACM/IEEE International Conference on Software Engineering, pp. 503–512 (2002). doi:10.1145/581339.581401Google Scholar
- 52.O’Donovan, J., Smyth, B.: Trust in recommender systems. In: Proceedings of the International Conference on Intelligent User Interfaces, pp. 167–174 (2005). doi:10.1145/1040830.1040870Google Scholar
- 54.Oxford Dictionaries: Oxford Dictionary of English. 3rd edn. Oxford: Oxford University Press, UK (2010)Google Scholar
- 56.Quinlan, J. R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1993)Google Scholar
- 58.Rashid, A.M., Albert, I., Cosley, D., Lam, S.K., McNee, S.M., Konstan, J.A., Riedl, J.: Getting to know you: learning new user preferences in recommender systems. In: Proceedings of the International Conference on Intelligent User Interfaces, pp. 127–134 (2002). doi:10.1145/502716.502737Google Scholar
- 59.Robillard, M.P.: Topology analysis of software dependencies. ACM Trans. Software Eng. Methodol. 17(4), 18:1–18:36 (2008). doi:10.1145/13487689.13487691Google Scholar
- 62.Said, A., Tikk, D., Shi, Y., Larson, M., Stumpf, K., Cremonesi, P.: Recommender systems evaluation: a 3D benchmark. In: Proceedings of the Workshop on Recommendation Utility Evaluation: Beyond RMSE. CEUR Workshop Proceedings, vol. 910, pp. 21–23 (2012)Google Scholar
- 63.Salfner, F., Lenk, M., Malek, M.: A survey of online failure prediction methods. ACM Comput. Surv. 42(3), 10:1–10:42 (2010). doi:10.1145/1670679.1670680Google Scholar
- 64.Sandvig, J.J., Mobasher, B., Burke, R.: Robustness of collaborative recommendation based on association rule mining. In: Proceedings of the ACM Conference on Recommender Systems, pp. 105–112 (2007). doi:10.1145/1297231.1297249Google Scholar
- 65.Sarwar, B., Karypis, G., Konstan, J., Riedl, J.: Application of dimensionality reduction in recommender system: a case study. Technical Report 00-043, Department of Computer Science & Engineering, University of Minnesota (2000)Google Scholar
- 66.Sarwar, B., Karypis, G., Konstan, J., Riedl, J.: Item-based collaborative filtering recommendation algorithms. In: Proceedings of the International Conference on the World Wide Web, pp. 285–295 (2001). doi:10.1145/371920.372071Google Scholar
- 67.Schein, A.I., Popescul, A., Ungar, L.H., Pennock, D.M.: Methods and metrics for cold-start recommendations. In: Proceedings of the ACM SIGIR International Conference on Research and Development in Information Retrieval, pp. 253–260 (2002). doi:10.1145/564376.564421Google Scholar
- 68.Schroder, G., Thiele, M., Lehner, W.: Setting goals and choosing metrics for recommender system evaluation. In: Proceedings of the Workshop on Human Decision Making in Recommender Systems and User-Centric Evaluation of Recommender Systems and Their Interfaces. CEUR Workshop Proceedings, vol. 811, pp. 78–85 (2011)Google Scholar
- 69.Seminario, C.E., Wilson, D.C.: Robustness and accuracy tradeoffs for recommender systems under attack. In: Proceedings of the Florida Artificial Intelligence Research Society Conference, pp. 86–91 (2012)Google Scholar
- 71.Simon, F., Steinbrückner, F., Lewerentz, C.: Metrics based refactoring. In: Proceedings of the European Conference on Software Maintenance and Reengineering, pp. 30–38 (2001). doi:10.1109/.2001.914965Google Scholar
- 72.Sinha, R., Swearingen, K.: The role of transparency in recommender systems. In: Extended Abstracts of the ACM SIGCHI Conference on Human Factors in Computing Systems, pp. 830–831 (2002). doi:10.1145/506443.506619Google Scholar
- 75.Su, X., Khoshgoftaar, T.M.: A survey of collaborative filtering techniques. Adv. Artif. Intell. 2009, 421425:1–421425:19 (2009). doi:10.1155/2009/421425Google Scholar
- 76.Thummalapenta, S., Xie, T.: PARSEWeb: a programmer assistant for reusing open source code on the web. In: Proceedings of the IEEE/ACM International Conference on Automated Software Engineering, pp. 204–213 (2007). doi:10.1145/1321631.1321663Google Scholar
- 77.Tintarev, N., Masthoff, J.: A survey of explanations in recommender systems. In: Proceedings of the IEEE International Workshop on Web Personalisation, Recommender Systems and Intelligent User Interfaces, pp. 801–810 (2007). doi:10.1109/ICDEW.2007.4401070Google Scholar
- 78.Weimer, M., Karatzoglou, A., Le, Q.V., Smola, A.: CoFi RANK: maximum margin matrix factorization for collaborative ranking. In: Proceedings of the Annual Conference on Neural Information Processing Systems, pp. 222–230 (2007)Google Scholar
- 81.Ziegler, C.N., McNee, S.M., Konstan, J.A., Lausen, G.: Improving recommendation lists through topic diversification. In: Proceedings of the International Conference on the World Wide Web, pp. 22–32 (2005). doi:10.1145/1060745.1060754Google Scholar