How to reverse-engineer quality rankings
- 898 Downloads
- 1 Citations
Abstract
A good or bad product quality rating can make or break an organization. However, the notion of “quality” is often defined by an independent rating company that does not make the formula for determining the rank of a product publicly available. In order to invest wisely in product development, organizations are starting to use intelligent approaches for determining how funding for product development should be allocated. A critical step in this process is to “reverse-engineer” a rating company’s proprietary model as closely as possible. In this work, we provide a machine learning approach for this task, which optimizes a certain rank statistic that encodes preference information specific to quality rating data. We present experiments on data from a major quality rating company, and provide new methods for evaluating the solution. In addition, we provide an approach to use the reverse-engineered model to achieve a top ranked product in a cost-effective way.
Keywords
Supervised ranking Quality ratings Discrete optimization Reverse-engineering Applications of machine learning1 Introduction
Many organizations depend on the top ratings given to their products or services by quality rating companies. For instance, the reputations of undergraduate and graduate programs at colleges and universities depend heavily on their U.S. News and World Report rankings. Similarly, the mortgage industry hinges on the models of credit rating agencies like Standard & Poor’s, Moody’s, Dun & Bradstreet, and Fitch Ratings. Mutual funds rely on Morningstar and Lipper ratings. For electronics, rating companies include CNET and PCMag; and for vehicles, they include What Car? J.D. Power, Edmunds, Kelley Blue Book, and Car and Driver. Most of these rating companies use a formula to score products, and few of them make their complete rating formulas public. Moreover, the exact values of the input data to the formula are also often kept confidential. If organizations were able to recreate the formulas for quality rating models, they would better understand the standards by which their products were being judged, which would potentially allow them to produce better products. Furthermore, rating companies that are aware of reverse-engineering may be motivated to re-evaluate the accuracy and fairness of their formulas in representing the quality of products.
Point 1 (Linear scoring functions): The rating company states publicly that its product rankings are based on real-valued scores given to each product, and that the score is a weighted linear combination of a known set of factors. The precise values for some factors can be obtained directly, but other factors have been discretized into a number of “stars” between 1 and 5 and are thus noisy versions of the true values. For example, the National Highway Traffic Safety Administration discretizes factors pertaining to vehicle safety ratings.
Point 2 (Category structure): Products are organized into categories, and within each category there are one or more subcategories. For example, a computer rating company may have a laptop category with subcategories such as netbooks and tablets. Products within a category share the same scoring system, but the ranking of each product is with respect to its subcategory.
Point 3 (Ranks over scores): The scores themselves are not as meaningful as the ranks, since consumers pay more attention to product rankings than to scores or to differences in score. Moreover, sometimes the scores are not available at all, and only the ranks are available.
Point 4 (Focus on top products): Consumers generally focus on top-ranked products, so a model that can reproduce the top of each subcategory’s ranked list accurately is more valuable than one that better reproduces the middle or bottom of the list.
Note that even though Point 1 makes the assumption of known factors, it is also possible to use our method for problems in which the factors are unknown. As long as the factors in our model encompass the information used for the rating system, our algorithm can be applied regardless of whether or not the factors are precisely the same as those used by the rating company. For instance, a camera expert might know all of the potential camera characteristics that could contribute to camera quality, which we could then use as the factors in our model.
After the model has been reverse-engineered, we can use it to determine the most cost-effective way to increase product rankings, and we present discrete optimization algorithms for this task. These algorithms can be used independently of the reverse-engineering method. That is, if the reverse-engineered formula were obtained using a different method from ours, or if the formula were made public, we could still use these algorithms to cost-effectively increase a product’s rank.
We describe related work in Sect. 2. In Sect. 3, we derive a ranking quality objective that encodes the preference relationships discussed above. In Sect. 4 we provide the machine learning algorithm, based on discrete optimization, that exactly maximizes the ranking quality objective. In Sect. 5, we establish new measures that can be used to evaluate the performance of our model. In Sect. 6, we derive several baseline algorithms for reverse-engineering, all involving convex optimization. Section 7 contains results from a proof-of-concept experiment, and Sect. 8 provides experimental results using rating data from a major quality rating company. Section 9 discusses the separate problem of how to cost-effectively increase the rank of a product. Finally, we conclude in Sect. 10. The main contributions of the paper are: the application of machine learning to reverse-engineering product quality rankings; our method of encoding the preference relationships in accordance with Points 1 through 4 above; using data from other product categories as regularization; the design of novel evaluation measures; and the mechanism to cost-effectively achieve a highly ranked product.
2 Related work
Reverse-engineering and approximation of rating models has been done in a number of industries, albeit not applied to rankings for consumer products with the category/subcategory structure. The related work we have found is published mostly within blogs. These works deal mostly with the problem of approximating the ranking function with a smaller number of variables, rather than using the exact factors in the rating company’s formula. For instance, Chandler (2006) approximated the U.S. News and World Report Law School rankings using symbolic regression to obtain a formula with four factors, and another with seven factors; currently the formula for the law school rankings is completely public and based on survey results, but the approximated versions are much simpler. In the sports industry, there has been some work in reverse-engineering Elias Sports Bureau rankings, which are used to determine compensation for free agents (Bajek 2008). The search engine optimization (SEO) industry aims to be able to boost the search engine rank of a web page by figuring out which features have high influence in the ranking algorithm. For instance, Su et al. (2010) used a linear optimization model to approximate Google web page rankings. As a final example, Hammer et al. (2007) approximated credit rating models using Logical Analysis of Data. As far as we know, our work is the first to present a specialized machine learning algorithm to reverse-engineer product ratings.
If the ratings are accurate measures of quality, then making the ratings more transparent could have a uniformly positive impact: it would help companies to make better rated products, it would help consumers to have these higher quality products, and it would encourage rating companies to receive feedback as to whether their rating systems fairly represent quality. If the ratings are not accurate measures of quality, many problems could arise. Unethical manipulation of reverse-engineered credit rating models heavily contributed to the 2007–2010 financial crisis (Morgenson and Story 2010). These ratings permitted some companies to sell “junk bonds” with very high ratings. Rating companies were blamed for “performing the alchemy that converted the securities from F-rated to A-rated.”^{1}
Rating systems can also be arbitrary—even some well-established, heavily trusted rating systems can be inconsistent from product to product. There has been some controversy also over the Motion Picture Association of America movie rating system, discussed in the documentary “This Film Is Not Yet Rated.”^{2} The MPAA rating system sorts movies into categories based on how appropriate they are for certain audiences. The documentary demonstrates that the rating system was inconsistent between different types of films, and that the MPAA directly lied to the public regarding the way these ratings are constructed. This can be difficult for movie makers, whose profits may depend on getting an “R” rating rather than an “NC-17” rating, and it also causes problems for moviegoers, who want to know whether the movie is suitable for them.
Our reverse-engineering problem could potentially be useful in the area of conjoint analysis in marketing (Green et al. 2001). Conjoint analysts aim to model how a consumer chooses one brand over another, with the goal of learning which product characteristics are most important to consumers.
We have considered the reverse-engineering task as a problem of supervised ranking. Supervised ranking originated to handle problems that occur mainly in the information retrieval domain (see, for instance, the LETOR compilation of works^{3}). The vast majority of work on supervised ranking considers problems that are specific to information retrieval (e.g., Cao et al. 2007; Matveeva et al. 2006; Lafferty and Zhai 2001; Li et al. 2007) or give insight into how to approximately solve versions of extremely large ranking problems quickly (Tsochantaridis et al. 2005; Freund et al. 2003; Cossock and Zhang 2006; Joachims 2002; Burges et al. 2006; Xu et al. 2008; Le and Smola 2007; Ferri et al. 2002; Ataman et al. 2006). For the task of reverse-engineering ranking models, fast computational speed is not essential, and the extra time needed to compute a better solution is worthwhile. This, coupled with the fact that the size of the dataset is not extremely large, permits us to use mixed-integer optimization (MIO). MIO preserves our encoding of exactly the desired preference structure, where we have incorporated membership into categories and subcategories. If we remove regularization and do not concentrate on the top ranks, then the problem is a generalization of Area Under the Curve (AUC) maximization (Freund et al. 2003; Joachims 2002). Most works on AUC maximization use a smoothed approximation of the 0-1 loss within the AUC. If we were to use a smoothed approximation for the reverse-engineering problem, it is possible that the algorithm would miss the best solutions to the 0-1 optimization problem. The ℓ_{p}RE relaxation algorithm we introduce in Sect. 6 is one such approximation. The work of Bertsimas et al. (2010, 2011) also discusses in depth the benefits of exact solutions over relaxations.
Clearly, reverse-engineered ranking models can affect design decisions in a variety of applications. To the best of our knowledge, our work is the first to show the most cost-effective way to increase the rank of a new product.
3 Encoding preferences for quality ranking data
We derive a specialized rank statistic that serves as our objective for reverse-engineering. Maximizing this objective yields estimates of the weights on each of the factors in the rating company’s model. Our starting point is the case of one category with one subcategory, that is, there is only a single ranked list. Then, we generalize this statistic to handle multiple categories and subcategories. Our method can be used to reverse-engineer quality rankings whether or not the underlying scores are made available; we need only to know the ranks.
3.1 One category, one subcategory
3.2 Multiple categories and subcategories
We assume from Sect. 1 that different categories have different ranking models. Even so, these models may be similar enough that knowledge obtained from other categories can be used to “borrow strength” when there are limited data in the category of interest. Thus, as we derive the objective for reverse-engineering the model f for one prespecified category, we use data from all of its subcategories as well as from the subcategories in other categories.
Let \(S_{\text{sub}}\) be the set of all subcategories across all categories, including the category of interest, and let there be n_{s} products in subcategory s. Similar to our previous notation, \(x_{i}^{s}\in\mathcal{R}^{d}\) represents product i in subcategory s, \(\zeta_{i}^{s}\in\mathcal{R}\) is the score assigned to product i in subcategory s, and \(\pi^{s}_{ik}\) is 1 if \(\zeta_{i}^{s}>\zeta_{k}^{s}\) and is 0 otherwise. The threshold T_{s} defines the top of the list for subcategory s.
We assume a linear form for the model, in accordance with Point 1 in Sect. 1. That is, the scoring function has the form f(x)=w^{T}x, so that \(w\in\mathcal{R}^{d}\) is a vector of variables in our formulation, and the objective in (2) is a function of w. Note that we can capture relatively complex nonlinear rating systems using a linear model with nonlinear factors. For instance, we could introduce extra factors to accommodate “necessity” constraints, where products that do not have a certain property will always get a low score. To do this, we would add a binary factor to the model that is 1 if the product does not possess the property, and the learning algorithm should discover a large negative weight for that factor.
4 Optimization
We now provide an algorithm to reverse-engineer quality rankings that exactly maximizes (2). The algorithm is called MIO-RE—Mixed Integer Optimization for Reverse-Engineering, and expands on a technique due to Bertsimas et al. (2010, 2011) for supervised ranking in machine learning. In this work, the authors develop a type of approach with an advantage over other machine learning techniques in that it exactly optimizes the objective. This advantage is counterbalanced by a sacrifice in computational speed, but for the rating problem, new data come out occasionally (e.g., yearly, monthly, weekly) whereas the computation time is generally on the order of hours, depending on the number of products in the training data and the number of factors. In this case, the extra computation time needed to produce a better solution is worthwhile.
In MIO, it is important to note that even though there are often various correct formulations to solve the same problem, not all valid formulations are equally strong. In fact, the ability to solve an MIO problem depends critically on the choice of formulation (see Bertsimas and Weismantel 2005, for details). This is not true of linear optimization, where a good formulation is simply one that correctly captures the model and is small in terms of the number of variables and constraints. In linear optimization, the choice of formulation is not crucial for solving a problem. However, when there are integer variables, it is typical to reformulate multiple times to achieve the best model. Essentially, a formulation is stronger if it cuts off extra unnecessary parts of the region of feasible solutions. Below we present a strong MIO formulation that we have found to work well empirically, and we discuss the logic behind its derivation.
Modern solvers typically produce a bound (upper for maximization problems, lower for minimization problems) as they search for better integer feasible solutions, and when the bound matches the objective value of an integer solution, the solution has reached provable optimality. However, it is common for a solver to find an optimal solution relatively quickly, but to take much longer in proving optimality, that is, in bringing the bound closer to the optimal objective value. See Bertsimas et al. (2011) for an introduction to MIO that discusses in particular the strength of a formulation and also the progress in MIO technology over the last few decades. Due to advances in both hardware and MIO algorithms, computational speed has been increasing exponentially, allowing us today to solve large scale MIO problems that would have been impossible only a few years ago. MIO will be progressively more powerful as this exponential trend continues.
In Sects. 7 and 8, our experimental results show that our MIO algorithm performs well on both training and test data. Considering generalization bounds from statistical learning theory, there are two ways to achieve better test performance: one is to decrease the training error, and the other is to decrease the complexity term to prevent overfitting. Using MIO, we can decrease the training error, and we control the complexity by using regularization across categories as shown in Sect. 3.2.
4.1 Model for reverse-engineering
After the optimization problem (13) is solved for our category of interest, we use the maximizing weights w^{∗} to determine the score f(x)=w^{∗T}x of a new product x within the same category.
5 Evaluation metrics
In the case of our rating data, one goal is to predict, for instance, whether a new product that has not yet been rated would be among the top-k products that have already been rated. That is, the training data are included in the assessment of test performance. This type of evaluation is contrary to common machine learning practice in which evaluations on the training and test sets are separate, and thus it is not immediately clear how these evaluations should be performed.
In this section, we define three measures that are useful for supervised ranking problems in which test predictions are gauged relative to the training set. These measures are intuitive, and more closely represent how most industries would evaluate ranking quality than conventional rank statistics. The measures are first computed separately for each subcategory and then aggregated over the subcategories to produce a concise result. We focus on the top \(\bar{T}_{s}\) products in subcategory s, and use the following notation, where f(x)=w^{T}x is a given scoring function.
Measure 1: fraction of correctly ranked pairs among top of ranked list
Measure 2: fraction of correctly ranked pairs over entire ranked list
Measure 3: fraction of correctly classified products
Aggregation of measures
6 Other methods for reverse-engineering
We developed several baseline algorithms for our experiments that also encode the points in the introduction. The first set of methods are based on least squares regression, and the second set are convex relaxations of the MIO method. These algorithms could be themselves useful, for instance, if a fast convex algorithm is required.
6.1 Least squares methods for reverse-engineering
- 1.
the true score \(\zeta^{s}_{i}\) for product \(x^{s}_{i}\) (method LS1),
- 2.
the rank over all training products, that is, the number of training products that are within subcategories r such that C_{r}>0 and are ranked strictly below \(x^{s}_{i}\) according to the true scores \(\zeta^{s}_{i}\) (method LS2),
- 3.
the rank within the subcategory, that is, the number of training products in the same subcategory as \(x^{s}_{i}\) that are ranked strictly below \(x^{s}_{i}\) according to the true scores \(\zeta^{s}_{i}\) (method LS3).
6.2 The ℓ_{p} reverse-engineering algorithm
7 Proof of concept
Training and test values for M1, M2, and M3 on artificial dataset (top 60)
Algorithm | M1 | M2 | M3 | |
---|---|---|---|---|
LS1, ℓ_{1}RE, ℓ_{2}RE | train | 0.878 | 0.912 | 0.780 |
test | 0.892 | 0.909 | 0.770 | |
LS2 | train | 0.909 | 0.923 | 0.780 |
test | 0.915 | 0.918 | 0.770 | |
MIO-RE | train | 0.925 | 0.928 | 0.780 |
test | 0.943 | 0.929 | 0.770 |
Training and test values for M1, M2, and M3 on artificial dataset (top 45)
Algorithm | M1 | M2 | M3 | |
---|---|---|---|---|
LS1, ℓ_{1}RE, ℓ_{2}RE | train | 0.880 | 0.912 | 0.920 |
test | 0.898 | 0.909 | 0.930 | |
LS2 | train | 0.935 | 0.923 | 0.920 |
test | 0.942 | 0.918 | 0.930 | |
MIO-RE | train | 0.964 | 0.928 | 0.920 |
test | 0.994 | 0.929 | 0.930 |
Training and test values for M1, M2, and M3 on artificial dataset (top 25)
Algorithm | M1 | M2 | M3 | |
---|---|---|---|---|
LS1, ℓ_{1}RE, ℓ_{2}RE | train | 0.907 | 0.912 | 1.000 |
test | 0.899 | 0.909 | 0.980 | |
LS2 | train | 0.907 | 0.923 | 1.000 |
test | 0.899 | 0.918 | 0.980 | |
MIO-RE | train | 1.000 | 0.928 | 1.000 |
test | 1.000 | 0.929 | 1.000 |
The methods all performed similarly according to the classification measure M3. MIO-RE had a significant advantage with respect to M2, no matter the definition we used for top of the list (top 60 in Table 1, top 45 in Table 2, or top 25 in Table 3). For M1, MIO-RE performed substantially better than the others, and its advantage over the other methods was more pronounced as the evaluation measure concentrated more on the top of the list. One can see this by comparing the M1 column in Tables 1, 2, and 3. In Table 3, MIO-RE performed better than the other methods by 10.3 % on training and 11.3 % on testing. Using exact optimization rather than approximations, the MIO-RE method was able to find solutions that none of the other methods could find. This study demonstrates the potential of MIO-RE to substantially outperform other methods.
8 Experiments on rating data
For our main experiments, the dataset contains approximately a decade’s worth of rating data from a major rating company, compiled by an organization that is aiming to reverse-engineer the ranking model. The values for most of the factors are discretized versions of the true values, that is, they have been rounded to the nearest integer. The rating company periodically makes ratings for new products available, and our goal is to predict, with respect to the products that are already rated: where each new product is within the top-k (M1), where it is in the full list, even if not in the top-k (M2), and whether each new product falls within the top-k (M3). We generate a scoring function for one category, “Category A,” regularizing with data from “Category B.” Category A has eight subcategories with a current total of 209 products, and Category B has eight subcategories with a total of 212 products. There are 19 factors.
This dataset is small and thus challenging to deal with from a machine learning perspective. The small size of the training sets causes problems with accurate reverse-engineering. The small size of the test sets causes problems with evaluating generalization ability. That is, for all algorithms, the variance of the test evaluation measures is high compared to the difference in training performance, so it is difficult to evaluate which algorithm is better in a robust way. The worst performing algorithm in training sometimes has the best test performance, and vice versa. What we aim to determine is whether MIO-RE has consistently good performance, as compared with other algorithms that sometimes perform very poorly.
8.1 Experimental setup
- 1.For each set of parameters, perform three-fold cross-validation using the first three folds as follows:Note that when we compute M1, M2, and M3 on validation data, this also takes into account the training data, as in Sect. 5.
- a.
Train using folds 1 and 2, and Category B, and validate using fold 3. Compute M1, M2, and M3 for training and validation.
- b.
Train using folds 1 and 3, and Category B, and validate using fold 2. Compute M1, M2, and M3 for training and validation.
- c.
Train using folds 2 and 3, and Category B, and validate using fold 1. Compute M1, M2, and M3 for training and validation.
- d.
Compute the average over the three folds of the training and validation values for each of M1, M2, and M3.
- a.
- 2.
Sum the three average validation measures, and choose the parameters corresponding to the largest sum.
- 3.
Train using folds 1, 2, and 3, and Category B, together with the parameters chosen in the previous step, and test using fold 4. Compute M1, M2, and M3 for training and testing.
- 4.
Repeat steps 1 through 3 using folds 1, 2, and 4 for cross-validation and fold 3 for the final test set.
- 5.
Repeat steps 1 through 3 using folds 1, 3, and 4 for cross-validation and fold 2 for the final test set.
- 6.
Repeat steps 1 through 3 using folds 2, 3, and 4 for cross-validation and fold 1 for the final test set.
Parameter values tested for each algorithm
Algorithm | Parameter1 | Parameter2 |
---|---|---|
LS1 | C=0, 0.1, or 0.2 | |
LS2 | C=0, 0.1, or 0.2 | |
LS3 | C=0, 0.025, or 0.05 | |
ℓ_{1}RE | C=0 or 0.1 | \(C_{\text{high}}=0\) |
ℓ_{2}RE | C=0 or 0.1 | \(C_{\text{high}}=0\), 0.5, or 1 |
MIO-RE | C=0 or 0.5 | θ=0 or 9 |
In total, for the cross-validation step, there were 6×3=18 problems to solve for LS1, LS2, and LS3; 6×2=12 problems for ℓ_{1}RE, 6×2×3=36 problems for ℓ_{2}RE, and 6×2×2=24 problems for MIO-RE. (For each method, the total number of problems was the number of different parameter settings times six, which is the number of ways to choose two out of four folds for training.) For the test step, there were an additional four problems for each method. This set of experiments required approximately 163 hours of computation time.
8.2 Results
- 1.
Let M1_{m} be the value of M1 for method m, where m is either LS1, LS2, LS3, ℓ_{1}RE, ℓ_{2}RE, or MIO-RE. Note that these are the M1 values from training on folds 1, 2, and 3.
- 2.
Let \(\text{M1}_{\text{min}}\) be the minimum of the six M1_{m} values.
- 3.The bar height for method m is the percentage increase of M1_{m} from \(\text{M1}_{\text{min}}\):$$ \frac{\text{M1}_m-\text{M1}_{\text{min}}}{\text{M1}_{\text{min}}}. $$
Average of M1 metric over four rounds for each algorithm
Algorithm | M1 (train) | M1 (test) |
---|---|---|
LS1 | 0.767 | 0.794 |
LS2 | 0.792 | 0.798 |
LS3 | 0.752 | 0.811 |
ℓ_{1}RE | 0.797 | 0.820 |
ℓ_{2}RE | 0.792 | 0.814 |
MIO-RE | 0.836 | 0.840 |
Sums of ranks over four rounds for each algorithm
LS3 | LS1 | LS2 | ℓ_{2}RE | ℓ_{1}RE | MIO-RE | ||
---|---|---|---|---|---|---|---|
Train | M1 | 4 | 3 | 9 | 9 | 11 | 17 |
M2 | 0 | 8 | 4 | 13 | 13 | 20 | |
M3 | 0 | 8 | 8 | 7 | 8 | 18 | |
Total | 4 | 19 | 21 | 29 | 32 | 55 | |
Test | M1 | 7 | 6 | 5 | 7 | 8 | 15 |
M2 | 0 | 14 | 8 | 11 | 18 | 8 | |
M3 | 0 | 5 | 11 | 8 | 3 | 8 | |
Total | 7 | 25 | 24 | 26 | 29 | 31 |
Note that LS1 has an inherent advantage over the other five methods in that it uses information—namely the true scores—that is not available to the other methods that use only the ranks. As discussed earlier, in many cases the true scores may not be available if the rating company does not provide them. Even if the scores are available, our experiment demonstrates that it is possible for methods that encode only the ranks, such as MIO-RE, to have comparable or better performance than methods that directly use the scores. For example, in all but the third round of our experiment, it appears that there was a particularly good solution that none of the approximate methods found, but that MIO-RE did, similar to the results in Sect. 7. This is the major advantage of exactly optimizing the objective function rather than using a convex proxy.
8.3 Example of differences between methods on evaluation measures
Example of ranked lists produced by different algorithms, corresponding to metrics in Table 8
True | MIO-RE | LS3 |
---|---|---|
LakeCounty | ||
Brassfield | Brassfield | Wildhurst |
Langtry | Langtry | Langtry |
Wildhurst | Wildhurst | Brassfield |
NorthCoast | ||
Alpen | Alpen | Alpen |
Fieldbrook | Fieldbrook | Fieldbrook |
Winnett | Winnett | Winnett |
SouthCali | ||
Faulkner | Lenora | Lenora |
Lenora | Faulkner | Faulkner |
Peralta | Peralta | Peralta |
Salerno | Salerno | Thompkin |
Thompkin | Thompkin | Salerno |
Mendocino | ||
Baxter | Navarro | Navarro |
Goldeneye | Baxter | Baxter |
Navarro | Goldeneye | Goldeneye |
Skylark | Skylark | Skylark |
CentralCoast | ||
Blackstone | Blackstone | Morgan |
Estancia | Estancia | Blackstone |
Jenkins | Morgan | Ronan |
Morgan | Parsonage | Estancia |
Newell | Newell | Ventana |
Parsonage | Jenkins | Jenkins |
Ronan | Ronan | Newell |
Ventana | Ventana | Parsonage |
CentralVal | ||
Accardi | Accardi | Accardi |
Baywood | Baywood | Mariposa |
Cantiga | Mariposa | Trimble |
Harmony | Cantiga | Harmony |
Mariposa | Omega | Cantiga |
Omega | Watts | Omega |
Trimble | Harmony | Watts |
Watts | Trimble | Baywood |
SierraFoot | ||
Auriga | Auriga | Auriga |
Chevalier | Chevalier | Paravi |
Dillian | Paravi | Chevalier |
Fitzpatrick | Dillian | Solomon |
Hatcher | Fitzpatrick | Oakstone |
Montevina | Hatcher | Hatcher |
Oakstone | Montevina | Fitzpatrick |
Paravi | Oakstone | Dillian |
Renwood | Solomon | Renwood |
Solomon | Renwood | Montevina |
Venezio | Venezio | Venezio |
NapaValley | ||
Carter | Falcor | Falcor |
Falcor | Carter | Carter |
Ilsley | Ilsley | Kelham |
Kelham | Kelham | Ilsley |
Mason | Mason | Mason |
Oberon | Oberon | Oberon |
Quintessa | Relic | Quintessa |
Relic | Quintessa | Trefethen |
Sawyer | Sawyer | Relic |
Trefethen | Varozza | Sawyer |
Varozza | Trefethen | Varozza |
Comparison of MIO-RE and LS3 (train on folds 2, 3, and 4; test on fold 1), corresponding to ranked lists in Table 7
Algorithm | M1 | M2 | M3 |
---|---|---|---|
MIO-RE | 0.967 | 0.904 | 0.887 |
LS3 | 0.867 | 0.796 | 0.868 |
9 Determining a cost-effective way to achieve top rankings
9.1 Two formulations
We directly provide the formulations for, first, achieving a cost-effective increase in score, and, second, minimizing cost for a fixed target score.
9.1.1 Maximizing score on a fixed budget
9.1.2 Minimizing cost with a fixed target score
By solving the first formulation for a range of budgets, or by solving the second formulation for a range of target scores, we can map out an efficient frontier of maximum score for minimum cost. This concept is best explained through an example, which we present in the next section.
9.2 Practical example
Point-and-shoot digital camera factors
1 | 2 | 3 | 4 | 5 |
Resolution | Weight | Photo quality | Video quality | Response time |
6 | 7 | 8 | 9 | 10 |
Handling shake | Versatility | LCD quality | Widest angle | Battery life |
Coefficients of scoring function for digital cameras
w_{1} | w_{2} | w_{3} | w_{4} | w_{5} |
0.584 | −0.571 | 4.342 | 2.926 | 3.769 |
w_{6} | w_{7} | w_{8} | w_{9} | w_{10} |
1.137 | 1.442 | 2.896 | 0.005 | 0.001 |
Scores of two example cameras
Camera | x_{1} | x_{2} | x_{3} | x_{4} | x_{5} | x_{6} | x_{7} | x_{8} | x_{9} | x_{10} | Score |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 14 | 5 | 5 | 5 | 5 | 5 | 5 | 5 | 35 | 500 | 88.38 |
2 | 12 | 5 | 4 | 4 | 4 | 3 | 4 | 4 | 30 | 300 | 69.41 |
Change information for a digital camera
Change | δ_{1} | δ_{2} | δ_{3} | δ_{4} | δ_{5} | δ_{6} | δ_{7} | δ_{8} | δ_{9} | δ_{10} | Cost | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Larger battery | ⋅ | ⋅ | ⋅ | ⋅ | ⋅ | ⋅ | ⋅ | ⋅ | ⋅ | 50 | 2 |
2 | Add 1 megapixel | 1 | ⋅ | ⋅ | ⋅ | ⋅ | ⋅ | ⋅ | ⋅ | ⋅ | ⋅ | 3 |
3 | Better LCD | ⋅ | ⋅ | ⋅ | ⋅ | ⋅ | ⋅ | ⋅ | 0.5 | ⋅ | ⋅ | 4 |
4 | More modes | ⋅ | ⋅ | ⋅ | ⋅ | ⋅ | ⋅ | 1 | ⋅ | ⋅ | ⋅ | 4 |
5 | Wider angle | ⋅ | ⋅ | 0.5 | ⋅ | ⋅ | ⋅ | ⋅ | ⋅ | 2 | ⋅ | 5 |
6 | Add 2 megapixels | 2 | ⋅ | 0.5 | ⋅ | ⋅ | ⋅ | ⋅ | ⋅ | ⋅ | ⋅ | 5 |
7 | Heavier material | ⋅ | 1 | ⋅ | ⋅ | ⋅ | 1 | ⋅ | ⋅ | ⋅ | ⋅ | 5 |
8 | Better video | ⋅ | ⋅ | ⋅ | 1 | ⋅ | ⋅ | ⋅ | ⋅ | ⋅ | ⋅ | 6 |
9 | Faster response | ⋅ | ⋅ | ⋅ | ⋅ | 0.5 | ⋅ | ⋅ | ⋅ | ⋅ | ⋅ | 6 |
10 | Better lens | ⋅ | ⋅ | 0.5 | 1 | ⋅ | ⋅ | ⋅ | ⋅ | ⋅ | ⋅ | 7 |
11 | Fastest response | ⋅ | ⋅ | ⋅ | ⋅ | 0.5 | 1 | ⋅ | ⋅ | ⋅ | ⋅ | 7 |
12 | Most modes | ⋅ | ⋅ | 1 | ⋅ | 0.5 | ⋅ | 1 | ⋅ | ⋅ | ⋅ | 9 |
Conflict sets (M=6)
m | S_{m} |
---|---|
1 | {2,6} |
2 | {5,6,10,12} |
3 | {8,10} |
4 | {9,11,12} |
5 | {7,11} |
6 | {4,12} |
Conflicts between changes
Change | Conflicts | |
---|---|---|
1 | Larger battery | ⋅ |
2 | Add 1 megapixel | 6 |
3 | Better LCD | ⋅ |
4 | More modes | 12 |
5 | Wider angle | 6, 10, 12 |
6 | Add 2 megapixels | 2, 5, 10, 12 |
7 | Heavier material | 11 |
8 | Better video | 10 |
9 | Faster response | 11, 12 |
10 | Better lens | 5, 6, 8, 12 |
11 | Fastest response | 7, 9, 12 |
12 | Most modes | 4, 5, 6, 9, 10, 11 |
Lookup table for fixed budget
Max cost | Optimal change(s) | Score diff | Actual cost |
---|---|---|---|
2 | Larger battery | 0.030 | 2 |
3 | Add 1 megapix | 0.584 | 3 |
4 | Better LCD | 1.448 | 4 |
5 | Add 2 megapix | 3.339 | 5 |
6 | Better video | 2.926 | 6 |
7 | Better lens | 5.097 | 7 |
8 | Better lens | 5.097 | 7 |
9 | Most modes | 7.669 | 9 |
10 | Most modes | 7.669 | 9 |
⋮ | ⋮ | ⋮ | ⋮ |
Lookup table for target score
Min diff | Optimal change(s) | Cost | Actual diff |
---|---|---|---|
1 | More modes | 4 | 1.442 |
Better LCD | 1.448 | ||
2 | Wider angle | 5 | 2.182 |
Add 2 megapix | 3.339 | ||
3 | Add 2 megapix | 5 | 3.339 |
4 | Better lens | 7 | 5.097 |
5 | Better lens | 7 | 5.097 |
6 | Most modes | 9 | 7.669 |
7 | Most modes | 9 | 7.669 |
⋮ | ⋮ | ⋮ | ⋮ |
10 Conclusion
We presented a machine learning approach to reverse-engineering ranking models, and an experiment on data from a rating company. The formulation encodes a specific preference structure and categorical organization of the products. Another contribution of our work is the introduction of evaluation measures that take into account the rank of a new product, relative to the products that have already been ranked. Finally, we showed how to use a reverse-engineered ranking model to achieve a high rank for a product in a cost-effective way.
This leads to many avenues for future work, for instance, it would be useful to develop an algorithm that solves the ranking problem while locating potential errors in the data. Another idea is to quantify the uncertainty in each of the coefficients in the reverse-engineered model.
Footnotes
- 1.
http://www.bloomberg.com/apps/news?sid=ah839IWTLP9s&pid=newsarchive by Elliot Blair Smith, September 24, 2008.
- 2.
Directed by Kirby Dick, 2006.
- 3.
- 4.
Dataset available at: http://web.mit.edu/rudin/www/ReverseEngineering_Flex_Data.csv.
- 5.
All least-squares methods were implemented using R 2.8.1, and all ℓ_{p}RE methods were implemented using MATLAB 7.8.0, on a computer with an Intel Core 2 Duo 2 GHz processor with 1.98 GB of RAM. MIO-RE was implemented using ILOG AMPL 11.210 with the Gurobi 3.0.0 solver on a computer powered by two Intel quad core Xeon E5440 2.83 GHz processors with 32 GB of RAM. We always used ε=10^{−6} for MIO-RE.
- 6.
Notes
Acknowledgements
This material is based upon work supported by the MIT-Ford Alliance and the National Science Foundation under Grant No IIS-1053407. We would like to thank Dimitris Bertsimas from MIT, Brian Jahn and Larry Kummer from Ford, and Elaine Savage, John Leonard, and Ed Krause from the MIT-Ford Alliance.
References
- Ataman, K., Street, W. N., & Zhang, Y. (2006). Learning to rank by maximizing AUC with linear programming. In International joint conference on neural networks. Google Scholar
- Bajek, E. (2008). (Almost) How the Elias Sports Bureau rankings work. Unpublished blog entry at http://tigers-thoughts.blogspot.com/2008/07/almost-how-elias-sports-bureau-rankings.html.
- Bertsimas, D., & Weismantel, R. (2005). Optimization over integers. Charlestown: Dynamic Ideas. Google Scholar
- Bertsimas, D., Chang, A., & Rudin, C. (2010). A discrete optimization approach to supervised ranking. In Proceedings of the 5th INFORMS workshop on data mining and health informatics (DM-HI 2010). Google Scholar
- Bertsimas, D., Chang, A., & Rudin, C. (2011). Integer optimization methods for supervised ranking. MIT DSpace, Operations Research Center. Working paper available at http://dspace.mit.edu/handle/1721.1/67362.
- Burges, C. J., Ragno, R., & Le, Q. (2006). Learning to rank with nonsmooth cost functions. In Proceedings of neural information processing systems (NIPS) (pp. 395–402). Google Scholar
- Cao, Z., Qin, T., Liu, T.-Y., Tsai, M.-F., & Li, H. (2007). Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on machine learning (ICML) (pp. 129–136). CrossRefGoogle Scholar
- Chandler, S. J. (2006). Analyzing US News and World Report rankings using symbolic regression. Unpublished blog entry at http://taxprof.typepad.com/taxprof_blog/files/analyzing_u.S. News & World Report Rankings Using Symbolic Regression.pdf.
- Cossock, D., & Zhang, T. (2006). Subset ranking using regression. In Proceedings of the 19th conference on learning theory (COLT) (pp. 605–619). Google Scholar
- Ferri, C., Flach, P., & Hernández-Orallo, J. (2002). Learning decision trees using the area under the ROC curve. In Proceedings of the 19th international conference on machine learning (ICML) (pp. 139–146). Google Scholar
- Freund, Y., Iyer, R., Schapire, R. E., & Singer, Y. (2003). An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4, 933–969. MathSciNetGoogle Scholar
- Green, P. E., Krieger, A. M., & Wind, Y. (J.) (2001). Thirty years of conjoint analysis: Reflections and prospects. Interfaces, 31(3), S56–S73. CrossRefGoogle Scholar
- Hammer, P. L., Kogan, A., & Lejeune, M. A. (2007). Reverse-engineering banks’ financial strength ratings using logical analysis of data. Working paper available at http://www.optimization-online.org/DB_FILE/2007/02/1581.pdf.
- Joachims, T. (2002). Optimizing search engines using clickthrough data. In Proceedings of the ACM conference on knowledge discovery and data mining (KDD). New York: ACM Press. Google Scholar
- Kendall, M. G. (1938). A new measure of rank correlation. Biometrika, 30(1/2), 81–93. MathSciNetMATHCrossRefGoogle Scholar
- Lafferty, J., & Zhai, C. (2001). Document language models, query models, and risk minimization for information retrieval. In Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval (pp. 111–119). New York: ACM Press. CrossRefGoogle Scholar
- Le, Q., & Smola, A. (2007). Direct optimization of ranking measures. arXiv:0704.3359v1 [cs.IR].
- Li, P., Burges, C. J., & Wu, Q. (2007). McRank: learning to rank using multiple classification and gradient boosting. In Proceedings of neural information processing systems (NIPS). Google Scholar
- Matveeva, I., Laucius, A., Burges, C., Wong, L., & Burkard, T. (2006). High accuracy retrieval with multiple nested ranker. In Proceedings of the 29th annual international ACM SIGIR conference on research and development in information retrieval (pp. 437–444). New York: ACM Press. CrossRefGoogle Scholar
- Morgenson, G., & Story, L. (2010). Rating agency data aided wall street in deals. New York Times, Business section, April 24, 2010. Article at http://www.nytimes.com/2010/04/24/business/24rating.html.
- Rudin, C. (2009). The P-Norm Push: a simple convex ranking algorithm that concentrates at the top of the list. Journal of Machine Learning Research, 10, 2233–2271. MathSciNetMATHGoogle Scholar
- Su, A.-J., Hu, Y. C., Kuzmanovic, A., & Koh, C.-K. (2010). How to improve your Google ranking: myths and reality. In IEEE/WIC/ACM international conference on web intelligence and intelligent agent technology (WI-IAT) (Vol. 1, pp. 50–57). CrossRefGoogle Scholar
- Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y., & Singer, Y. (2005). Large margin methods for structured and interdependent output variables. Journal of Machine Learning Research, 6, 1453–1484. MATHGoogle Scholar
- Xu, J., Liu, T. Y., Lu, M., Li, H., & Ma, W. Y. (2008). Directly optimizing evaluation measures in learning to rank. In Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval. New York: ACM Press. Google Scholar