Abstract
A good or bad product quality rating can make or break an organization. However, the notion of “quality” is often defined by an independent rating company that does not make the formula for determining the rank of a product publicly available. In order to invest wisely in product development, organizations are starting to use intelligent approaches for determining how funding for product development should be allocated. A critical step in this process is to “reverseengineer” a rating company’s proprietary model as closely as possible. In this work, we provide a machine learning approach for this task, which optimizes a certain rank statistic that encodes preference information specific to quality rating data. We present experiments on data from a major quality rating company, and provide new methods for evaluating the solution. In addition, we provide an approach to use the reverseengineered model to achieve a top ranked product in a costeffective way.
Keywords
Supervised ranking Quality ratings Discrete optimization Reverseengineering Applications of machine learning1 Introduction
Many organizations depend on the top ratings given to their products or services by quality rating companies. For instance, the reputations of undergraduate and graduate programs at colleges and universities depend heavily on their U.S. News and World Report rankings. Similarly, the mortgage industry hinges on the models of credit rating agencies like Standard & Poor’s, Moody’s, Dun & Bradstreet, and Fitch Ratings. Mutual funds rely on Morningstar and Lipper ratings. For electronics, rating companies include CNET and PCMag; and for vehicles, they include What Car? J.D. Power, Edmunds, Kelley Blue Book, and Car and Driver. Most of these rating companies use a formula to score products, and few of them make their complete rating formulas public. Moreover, the exact values of the input data to the formula are also often kept confidential. If organizations were able to recreate the formulas for quality rating models, they would better understand the standards by which their products were being judged, which would potentially allow them to produce better products. Furthermore, rating companies that are aware of reverseengineering may be motivated to reevaluate the accuracy and fairness of their formulas in representing the quality of products.

Point 1 (Linear scoring functions): The rating company states publicly that its product rankings are based on realvalued scores given to each product, and that the score is a weighted linear combination of a known set of factors. The precise values for some factors can be obtained directly, but other factors have been discretized into a number of “stars” between 1 and 5 and are thus noisy versions of the true values. For example, the National Highway Traffic Safety Administration discretizes factors pertaining to vehicle safety ratings.

Point 2 (Category structure): Products are organized into categories, and within each category there are one or more subcategories. For example, a computer rating company may have a laptop category with subcategories such as netbooks and tablets. Products within a category share the same scoring system, but the ranking of each product is with respect to its subcategory.

Point 3 (Ranks over scores): The scores themselves are not as meaningful as the ranks, since consumers pay more attention to product rankings than to scores or to differences in score. Moreover, sometimes the scores are not available at all, and only the ranks are available.

Point 4 (Focus on top products): Consumers generally focus on topranked products, so a model that can reproduce the top of each subcategory’s ranked list accurately is more valuable than one that better reproduces the middle or bottom of the list.
Note that even though Point 1 makes the assumption of known factors, it is also possible to use our method for problems in which the factors are unknown. As long as the factors in our model encompass the information used for the rating system, our algorithm can be applied regardless of whether or not the factors are precisely the same as those used by the rating company. For instance, a camera expert might know all of the potential camera characteristics that could contribute to camera quality, which we could then use as the factors in our model.
After the model has been reverseengineered, we can use it to determine the most costeffective way to increase product rankings, and we present discrete optimization algorithms for this task. These algorithms can be used independently of the reverseengineering method. That is, if the reverseengineered formula were obtained using a different method from ours, or if the formula were made public, we could still use these algorithms to costeffectively increase a product’s rank.
We describe related work in Sect. 2. In Sect. 3, we derive a ranking quality objective that encodes the preference relationships discussed above. In Sect. 4 we provide the machine learning algorithm, based on discrete optimization, that exactly maximizes the ranking quality objective. In Sect. 5, we establish new measures that can be used to evaluate the performance of our model. In Sect. 6, we derive several baseline algorithms for reverseengineering, all involving convex optimization. Section 7 contains results from a proofofconcept experiment, and Sect. 8 provides experimental results using rating data from a major quality rating company. Section 9 discusses the separate problem of how to costeffectively increase the rank of a product. Finally, we conclude in Sect. 10. The main contributions of the paper are: the application of machine learning to reverseengineering product quality rankings; our method of encoding the preference relationships in accordance with Points 1 through 4 above; using data from other product categories as regularization; the design of novel evaluation measures; and the mechanism to costeffectively achieve a highly ranked product.
2 Related work
Reverseengineering and approximation of rating models has been done in a number of industries, albeit not applied to rankings for consumer products with the category/subcategory structure. The related work we have found is published mostly within blogs. These works deal mostly with the problem of approximating the ranking function with a smaller number of variables, rather than using the exact factors in the rating company’s formula. For instance, Chandler (2006) approximated the U.S. News and World Report Law School rankings using symbolic regression to obtain a formula with four factors, and another with seven factors; currently the formula for the law school rankings is completely public and based on survey results, but the approximated versions are much simpler. In the sports industry, there has been some work in reverseengineering Elias Sports Bureau rankings, which are used to determine compensation for free agents (Bajek 2008). The search engine optimization (SEO) industry aims to be able to boost the search engine rank of a web page by figuring out which features have high influence in the ranking algorithm. For instance, Su et al. (2010) used a linear optimization model to approximate Google web page rankings. As a final example, Hammer et al. (2007) approximated credit rating models using Logical Analysis of Data. As far as we know, our work is the first to present a specialized machine learning algorithm to reverseengineer product ratings.
If the ratings are accurate measures of quality, then making the ratings more transparent could have a uniformly positive impact: it would help companies to make better rated products, it would help consumers to have these higher quality products, and it would encourage rating companies to receive feedback as to whether their rating systems fairly represent quality. If the ratings are not accurate measures of quality, many problems could arise. Unethical manipulation of reverseengineered credit rating models heavily contributed to the 2007–2010 financial crisis (Morgenson and Story 2010). These ratings permitted some companies to sell “junk bonds” with very high ratings. Rating companies were blamed for “performing the alchemy that converted the securities from Frated to Arated.”^{1}
Rating systems can also be arbitrary—even some wellestablished, heavily trusted rating systems can be inconsistent from product to product. There has been some controversy also over the Motion Picture Association of America movie rating system, discussed in the documentary “This Film Is Not Yet Rated.”^{2} The MPAA rating system sorts movies into categories based on how appropriate they are for certain audiences. The documentary demonstrates that the rating system was inconsistent between different types of films, and that the MPAA directly lied to the public regarding the way these ratings are constructed. This can be difficult for movie makers, whose profits may depend on getting an “R” rating rather than an “NC17” rating, and it also causes problems for moviegoers, who want to know whether the movie is suitable for them.
Our reverseengineering problem could potentially be useful in the area of conjoint analysis in marketing (Green et al. 2001). Conjoint analysts aim to model how a consumer chooses one brand over another, with the goal of learning which product characteristics are most important to consumers.
We have considered the reverseengineering task as a problem of supervised ranking. Supervised ranking originated to handle problems that occur mainly in the information retrieval domain (see, for instance, the LETOR compilation of works^{3}). The vast majority of work on supervised ranking considers problems that are specific to information retrieval (e.g., Cao et al. 2007; Matveeva et al. 2006; Lafferty and Zhai 2001; Li et al. 2007) or give insight into how to approximately solve versions of extremely large ranking problems quickly (Tsochantaridis et al. 2005; Freund et al. 2003; Cossock and Zhang 2006; Joachims 2002; Burges et al. 2006; Xu et al. 2008; Le and Smola 2007; Ferri et al. 2002; Ataman et al. 2006). For the task of reverseengineering ranking models, fast computational speed is not essential, and the extra time needed to compute a better solution is worthwhile. This, coupled with the fact that the size of the dataset is not extremely large, permits us to use mixedinteger optimization (MIO). MIO preserves our encoding of exactly the desired preference structure, where we have incorporated membership into categories and subcategories. If we remove regularization and do not concentrate on the top ranks, then the problem is a generalization of Area Under the Curve (AUC) maximization (Freund et al. 2003; Joachims 2002). Most works on AUC maximization use a smoothed approximation of the 01 loss within the AUC. If we were to use a smoothed approximation for the reverseengineering problem, it is possible that the algorithm would miss the best solutions to the 01 optimization problem. The ℓ _{ p }RE relaxation algorithm we introduce in Sect. 6 is one such approximation. The work of Bertsimas et al. (2010, 2011) also discusses in depth the benefits of exact solutions over relaxations.
Clearly, reverseengineered ranking models can affect design decisions in a variety of applications. To the best of our knowledge, our work is the first to show the most costeffective way to increase the rank of a new product.
3 Encoding preferences for quality ranking data
We derive a specialized rank statistic that serves as our objective for reverseengineering. Maximizing this objective yields estimates of the weights on each of the factors in the rating company’s model. Our starting point is the case of one category with one subcategory, that is, there is only a single ranked list. Then, we generalize this statistic to handle multiple categories and subcategories. Our method can be used to reverseengineer quality rankings whether or not the underlying scores are made available; we need only to know the ranks.
3.1 One category, one subcategory
3.2 Multiple categories and subcategories
We assume from Sect. 1 that different categories have different ranking models. Even so, these models may be similar enough that knowledge obtained from other categories can be used to “borrow strength” when there are limited data in the category of interest. Thus, as we derive the objective for reverseengineering the model f for one prespecified category, we use data from all of its subcategories as well as from the subcategories in other categories.
Let \(S_{\text{sub}}\) be the set of all subcategories across all categories, including the category of interest, and let there be n _{ s } products in subcategory s. Similar to our previous notation, \(x_{i}^{s}\in\mathcal{R}^{d}\) represents product i in subcategory s, \(\zeta_{i}^{s}\in\mathcal{R}\) is the score assigned to product i in subcategory s, and \(\pi^{s}_{ik}\) is 1 if \(\zeta_{i}^{s}>\zeta_{k}^{s}\) and is 0 otherwise. The threshold T _{ s } defines the top of the list for subcategory s.
We assume a linear form for the model, in accordance with Point 1 in Sect. 1. That is, the scoring function has the form f(x)=w ^{ T } x, so that \(w\in\mathcal{R}^{d}\) is a vector of variables in our formulation, and the objective in (2) is a function of w. Note that we can capture relatively complex nonlinear rating systems using a linear model with nonlinear factors. For instance, we could introduce extra factors to accommodate “necessity” constraints, where products that do not have a certain property will always get a low score. To do this, we would add a binary factor to the model that is 1 if the product does not possess the property, and the learning algorithm should discover a large negative weight for that factor.
4 Optimization
We now provide an algorithm to reverseengineer quality rankings that exactly maximizes (2). The algorithm is called MIORE—Mixed Integer Optimization for ReverseEngineering, and expands on a technique due to Bertsimas et al. (2010, 2011) for supervised ranking in machine learning. In this work, the authors develop a type of approach with an advantage over other machine learning techniques in that it exactly optimizes the objective. This advantage is counterbalanced by a sacrifice in computational speed, but for the rating problem, new data come out occasionally (e.g., yearly, monthly, weekly) whereas the computation time is generally on the order of hours, depending on the number of products in the training data and the number of factors. In this case, the extra computation time needed to produce a better solution is worthwhile.
In MIO, it is important to note that even though there are often various correct formulations to solve the same problem, not all valid formulations are equally strong. In fact, the ability to solve an MIO problem depends critically on the choice of formulation (see Bertsimas and Weismantel 2005, for details). This is not true of linear optimization, where a good formulation is simply one that correctly captures the model and is small in terms of the number of variables and constraints. In linear optimization, the choice of formulation is not crucial for solving a problem. However, when there are integer variables, it is typical to reformulate multiple times to achieve the best model. Essentially, a formulation is stronger if it cuts off extra unnecessary parts of the region of feasible solutions. Below we present a strong MIO formulation that we have found to work well empirically, and we discuss the logic behind its derivation.
Modern solvers typically produce a bound (upper for maximization problems, lower for minimization problems) as they search for better integer feasible solutions, and when the bound matches the objective value of an integer solution, the solution has reached provable optimality. However, it is common for a solver to find an optimal solution relatively quickly, but to take much longer in proving optimality, that is, in bringing the bound closer to the optimal objective value. See Bertsimas et al. (2011) for an introduction to MIO that discusses in particular the strength of a formulation and also the progress in MIO technology over the last few decades. Due to advances in both hardware and MIO algorithms, computational speed has been increasing exponentially, allowing us today to solve large scale MIO problems that would have been impossible only a few years ago. MIO will be progressively more powerful as this exponential trend continues.
In Sects. 7 and 8, our experimental results show that our MIO algorithm performs well on both training and test data. Considering generalization bounds from statistical learning theory, there are two ways to achieve better test performance: one is to decrease the training error, and the other is to decrease the complexity term to prevent overfitting. Using MIO, we can decrease the training error, and we control the complexity by using regularization across categories as shown in Sect. 3.2.
4.1 Model for reverseengineering
After the optimization problem (13) is solved for our category of interest, we use the maximizing weights w ^{∗} to determine the score f(x)=w ^{∗T } x of a new product x within the same category.
5 Evaluation metrics
In the case of our rating data, one goal is to predict, for instance, whether a new product that has not yet been rated would be among the topk products that have already been rated. That is, the training data are included in the assessment of test performance. This type of evaluation is contrary to common machine learning practice in which evaluations on the training and test sets are separate, and thus it is not immediately clear how these evaluations should be performed.
In this section, we define three measures that are useful for supervised ranking problems in which test predictions are gauged relative to the training set. These measures are intuitive, and more closely represent how most industries would evaluate ranking quality than conventional rank statistics. The measures are first computed separately for each subcategory and then aggregated over the subcategories to produce a concise result. We focus on the top \(\bar{T}_{s}\) products in subcategory s, and use the following notation, where f(x)=w ^{ T } x is a given scoring function.
Measure 1: fraction of correctly ranked pairs among top of ranked list
Measure 2: fraction of correctly ranked pairs over entire ranked list
Measure 3: fraction of correctly classified products
Aggregation of measures
6 Other methods for reverseengineering
We developed several baseline algorithms for our experiments that also encode the points in the introduction. The first set of methods are based on least squares regression, and the second set are convex relaxations of the MIO method. These algorithms could be themselves useful, for instance, if a fast convex algorithm is required.
6.1 Least squares methods for reverseengineering
 1.
the true score \(\zeta^{s}_{i}\) for product \(x^{s}_{i}\) (method LS1),
 2.
the rank over all training products, that is, the number of training products that are within subcategories r such that C _{ r }>0 and are ranked strictly below \(x^{s}_{i}\) according to the true scores \(\zeta^{s}_{i}\) (method LS2),
 3.
the rank within the subcategory, that is, the number of training products in the same subcategory as \(x^{s}_{i}\) that are ranked strictly below \(x^{s}_{i}\) according to the true scores \(\zeta^{s}_{i}\) (method LS3).
6.2 The ℓ _{ p } reverseengineering algorithm
7 Proof of concept
Training and test values for M1, M2, and M3 on artificial dataset (top 60)
Algorithm 
M1 
M2 
M3  

LS1, ℓ _{1}RE, ℓ _{2}RE 
train 
0.878 
0.912 
0.780 
test 
0.892 
0.909 
0.770  
LS2 
train 
0.909 
0.923 
0.780 
test 
0.915 
0.918 
0.770  
MIORE 
train 
0.925 
0.928 
0.780 
test 
0.943 
0.929 
0.770 
Training and test values for M1, M2, and M3 on artificial dataset (top 45)
Algorithm 
M1 
M2 
M3  

LS1, ℓ _{1}RE, ℓ _{2}RE 
train 
0.880 
0.912 
0.920 
test 
0.898 
0.909 
0.930  
LS2 
train 
0.935 
0.923 
0.920 
test 
0.942 
0.918 
0.930  
MIORE 
train 
0.964 
0.928 
0.920 
test 
0.994 
0.929 
0.930 
Training and test values for M1, M2, and M3 on artificial dataset (top 25)
Algorithm 
M1 
M2 
M3  

LS1, ℓ _{1}RE, ℓ _{2}RE 
train 
0.907 
0.912 
1.000 
test 
0.899 
0.909 
0.980  
LS2 
train 
0.907 
0.923 
1.000 
test 
0.899 
0.918 
0.980  
MIORE 
train 
1.000 
0.928 
1.000 
test 
1.000 
0.929 
1.000 
The methods all performed similarly according to the classification measure M3. MIORE had a significant advantage with respect to M2, no matter the definition we used for top of the list (top 60 in Table 1, top 45 in Table 2, or top 25 in Table 3). For M1, MIORE performed substantially better than the others, and its advantage over the other methods was more pronounced as the evaluation measure concentrated more on the top of the list. One can see this by comparing the M1 column in Tables 1, 2, and 3. In Table 3, MIORE performed better than the other methods by 10.3 % on training and 11.3 % on testing. Using exact optimization rather than approximations, the MIORE method was able to find solutions that none of the other methods could find. This study demonstrates the potential of MIORE to substantially outperform other methods.
8 Experiments on rating data
For our main experiments, the dataset contains approximately a decade’s worth of rating data from a major rating company, compiled by an organization that is aiming to reverseengineer the ranking model. The values for most of the factors are discretized versions of the true values, that is, they have been rounded to the nearest integer. The rating company periodically makes ratings for new products available, and our goal is to predict, with respect to the products that are already rated: where each new product is within the topk (M1), where it is in the full list, even if not in the topk (M2), and whether each new product falls within the topk (M3). We generate a scoring function for one category, “Category A,” regularizing with data from “Category B.” Category A has eight subcategories with a current total of 209 products, and Category B has eight subcategories with a total of 212 products. There are 19 factors.
This dataset is small and thus challenging to deal with from a machine learning perspective. The small size of the training sets causes problems with accurate reverseengineering. The small size of the test sets causes problems with evaluating generalization ability. That is, for all algorithms, the variance of the test evaluation measures is high compared to the difference in training performance, so it is difficult to evaluate which algorithm is better in a robust way. The worst performing algorithm in training sometimes has the best test performance, and vice versa. What we aim to determine is whether MIORE has consistently good performance, as compared with other algorithms that sometimes perform very poorly.
8.1 Experimental setup
 1.For each set of parameters, perform threefold crossvalidation using the first three folds as follows:Note that when we compute M1, M2, and M3 on validation data, this also takes into account the training data, as in Sect. 5.
 a.
Train using folds 1 and 2, and Category B, and validate using fold 3. Compute M1, M2, and M3 for training and validation.
 b.
Train using folds 1 and 3, and Category B, and validate using fold 2. Compute M1, M2, and M3 for training and validation.
 c.
Train using folds 2 and 3, and Category B, and validate using fold 1. Compute M1, M2, and M3 for training and validation.
 d.
Compute the average over the three folds of the training and validation values for each of M1, M2, and M3.
 a.
 2.
Sum the three average validation measures, and choose the parameters corresponding to the largest sum.
 3.
Train using folds 1, 2, and 3, and Category B, together with the parameters chosen in the previous step, and test using fold 4. Compute M1, M2, and M3 for training and testing.
 4.
Repeat steps 1 through 3 using folds 1, 2, and 4 for crossvalidation and fold 3 for the final test set.
 5.
Repeat steps 1 through 3 using folds 1, 3, and 4 for crossvalidation and fold 2 for the final test set.
 6.
Repeat steps 1 through 3 using folds 2, 3, and 4 for crossvalidation and fold 1 for the final test set.
Parameter values tested for each algorithm
Algorithm 
Parameter1 
Parameter2 

LS1 
C=0, 0.1, or 0.2  
LS2 
C=0, 0.1, or 0.2  
LS3 
C=0, 0.025, or 0.05  
ℓ _{1}RE 
C=0 or 0.1 
\(C_{\text{high}}=0\) 
ℓ _{2}RE 
C=0 or 0.1 
\(C_{\text{high}}=0\), 0.5, or 1 
MIORE 
C=0 or 0.5 
θ=0 or 9 
In total, for the crossvalidation step, there were 6×3=18 problems to solve for LS1, LS2, and LS3; 6×2=12 problems for ℓ _{1}RE, 6×2×3=36 problems for ℓ _{2}RE, and 6×2×2=24 problems for MIORE. (For each method, the total number of problems was the number of different parameter settings times six, which is the number of ways to choose two out of four folds for training.) For the test step, there were an additional four problems for each method. This set of experiments required approximately 163 hours of computation time.
8.2 Results
 1.
Let M1_{ m } be the value of M1 for method m, where m is either LS1, LS2, LS3, ℓ _{1}RE, ℓ _{2}RE, or MIORE. Note that these are the M1 values from training on folds 1, 2, and 3.
 2.
Let \(\text{M1}_{\text{min}}\) be the minimum of the six M1_{ m } values.
 3.The bar height for method m is the percentage increase of M1_{ m } from \(\text{M1}_{\text{min}}\):$$ \frac{\text{M1}_m\text{M1}_{\text{min}}}{\text{M1}_{\text{min}}}. $$
Average of M1 metric over four rounds for each algorithm
Algorithm 
M1 (train) 
M1 (test) 

LS1 
0.767 
0.794 
LS2 
0.792 
0.798 
LS3 
0.752 
0.811 
ℓ _{1}RE 
0.797 
0.820 
ℓ _{2}RE 
0.792 
0.814 
MIORE 
0.836 
0.840 
Sums of ranks over four rounds for each algorithm
LS3 
LS1 
LS2 
ℓ _{2}RE 
ℓ _{1}RE 
MIORE  

Train 
M1 
4 
3 
9 
9 
11 
17 
M2 
0 
8 
4 
13 
13 
20  
M3 
0 
8 
8 
7 
8 
18  
Total 
4 
19 
21 
29 
32 
55  
Test 
M1 
7 
6 
5 
7 
8 
15 
M2 
0 
14 
8 
11 
18 
8  
M3 
0 
5 
11 
8 
3 
8  
Total 
7 
25 
24 
26 
29 
31 
Note that LS1 has an inherent advantage over the other five methods in that it uses information—namely the true scores—that is not available to the other methods that use only the ranks. As discussed earlier, in many cases the true scores may not be available if the rating company does not provide them. Even if the scores are available, our experiment demonstrates that it is possible for methods that encode only the ranks, such as MIORE, to have comparable or better performance than methods that directly use the scores. For example, in all but the third round of our experiment, it appears that there was a particularly good solution that none of the approximate methods found, but that MIORE did, similar to the results in Sect. 7. This is the major advantage of exactly optimizing the objective function rather than using a convex proxy.
8.3 Example of differences between methods on evaluation measures
Example of ranked lists produced by different algorithms, corresponding to metrics in Table 8
True 
MIORE 
LS3 

LakeCounty  
Brassfield 
Brassfield 
Wildhurst 
Langtry 
Langtry 
Langtry 
Wildhurst 
Wildhurst 
Brassfield 
NorthCoast  
Alpen 
Alpen 
Alpen 
Fieldbrook 
Fieldbrook 
Fieldbrook 
Winnett 
Winnett 
Winnett 
SouthCali  
Faulkner 
Lenora 
Lenora 
Lenora 
Faulkner 
Faulkner 
Peralta 
Peralta 
Peralta 
Salerno 
Salerno 
Thompkin 
Thompkin 
Thompkin 
Salerno 
Mendocino  
Baxter 
Navarro 
Navarro 
Goldeneye 
Baxter 
Baxter 
Navarro 
Goldeneye 
Goldeneye 
Skylark 
Skylark 
Skylark 
CentralCoast  
Blackstone 
Blackstone 
Morgan 
Estancia 
Estancia 
Blackstone 
Jenkins 
Morgan 
Ronan 
Morgan 
Parsonage 
Estancia 
Newell 
Newell 
Ventana 
Parsonage 
Jenkins 
Jenkins 
Ronan 
Ronan 
Newell 
Ventana 
Ventana 
Parsonage 
CentralVal  
Accardi 
Accardi 
Accardi 
Baywood 
Baywood 
Mariposa 
Cantiga 
Mariposa 
Trimble 
Harmony 
Cantiga 
Harmony 
Mariposa 
Omega 
Cantiga 
Omega 
Watts 
Omega 
Trimble 
Harmony 
Watts 
Watts 
Trimble 
Baywood 
SierraFoot  
Auriga 
Auriga 
Auriga 
Chevalier 
Chevalier 
Paravi 
Dillian 
Paravi 
Chevalier 
Fitzpatrick 
Dillian 
Solomon 
Hatcher 
Fitzpatrick 
Oakstone 
Montevina 
Hatcher 
Hatcher 
Oakstone 
Montevina 
Fitzpatrick 
Paravi 
Oakstone 
Dillian 
Renwood 
Solomon 
Renwood 
Solomon 
Renwood 
Montevina 
Venezio 
Venezio 
Venezio 
NapaValley  
Carter 
Falcor 
Falcor 
Falcor 
Carter 
Carter 
Ilsley 
Ilsley 
Kelham 
Kelham 
Kelham 
Ilsley 
Mason 
Mason 
Mason 
Oberon 
Oberon 
Oberon 
Quintessa 
Relic 
Quintessa 
Relic 
Quintessa 
Trefethen 
Sawyer 
Sawyer 
Relic 
Trefethen 
Varozza 
Sawyer 
Varozza 
Trefethen 
Varozza 
Comparison of MIORE and LS3 (train on folds 2, 3, and 4; test on fold 1), corresponding to ranked lists in Table 7
Algorithm 
M1 
M2 
M3 

MIORE 
0.967 
0.904 
0.887 
LS3 
0.867 
0.796 
0.868 
9 Determining a costeffective way to achieve top rankings
9.1 Two formulations
We directly provide the formulations for, first, achieving a costeffective increase in score, and, second, minimizing cost for a fixed target score.
9.1.1 Maximizing score on a fixed budget
9.1.2 Minimizing cost with a fixed target score
By solving the first formulation for a range of budgets, or by solving the second formulation for a range of target scores, we can map out an efficient frontier of maximum score for minimum cost. This concept is best explained through an example, which we present in the next section.
9.2 Practical example
Pointandshoot digital camera factors
1 
2 
3 
4 
5 
Resolution 
Weight 
Photo quality 
Video quality 
Response time 
6 
7 
8 
9 
10 
Handling shake 
Versatility 
LCD quality 
Widest angle 
Battery life 
Coefficients of scoring function for digital cameras
w _{1} 
w _{2} 
w _{3} 
w _{4} 
w _{5} 
0.584 
−0.571 
4.342 
2.926 
3.769 
w _{6} 
w _{7} 
w _{8} 
w _{9} 
w _{10} 
1.137 
1.442 
2.896 
0.005 
0.001 
Scores of two example cameras
Camera 
x _{1} 
x _{2} 
x _{3} 
x _{4} 
x _{5} 
x _{6} 
x _{7} 
x _{8} 
x _{9} 
x _{10} 
Score 

1 
14 
5 
5 
5 
5 
5 
5 
5 
35 
500 
88.38 
2 
12 
5 
4 
4 
4 
3 
4 
4 
30 
300 
69.41 
Change information for a digital camera
Change 
δ _{1} 
δ _{2} 
δ _{3} 
δ _{4} 
δ _{5} 
δ _{6} 
δ _{7} 
δ _{8} 
δ _{9} 
δ _{10} 
Cost  

1 
Larger battery 
⋅ 
⋅ 
⋅ 
⋅ 
⋅ 
⋅ 
⋅ 
⋅ 
⋅ 
50 
2 
2 
Add 1 megapixel 
1 
⋅ 
⋅ 
⋅ 
⋅ 
⋅ 
⋅ 
⋅ 
⋅ 
⋅ 
3 
3 
Better LCD 
⋅ 
⋅ 
⋅ 
⋅ 
⋅ 
⋅ 
⋅ 
0.5 
⋅ 
⋅ 
4 
4 
More modes 
⋅ 
⋅ 
⋅ 
⋅ 
⋅ 
⋅ 
1 
⋅ 
⋅ 
⋅ 
4 
5 
Wider angle 
⋅ 
⋅ 
0.5 
⋅ 
⋅ 
⋅ 
⋅ 
⋅ 
2 
⋅ 
5 
6 
Add 2 megapixels 
2 
⋅ 
0.5 
⋅ 
⋅ 
⋅ 
⋅ 
⋅ 
⋅ 
⋅ 
5 
7 
Heavier material 
⋅ 
1 
⋅ 
⋅ 
⋅ 
1 
⋅ 
⋅ 
⋅ 
⋅ 
5 
8 
Better video 
⋅ 
⋅ 
⋅ 
1 
⋅ 
⋅ 
⋅ 
⋅ 
⋅ 
⋅ 
6 
9 
Faster response 
⋅ 
⋅ 
⋅ 
⋅ 
0.5 
⋅ 
⋅ 
⋅ 
⋅ 
⋅ 
6 
10 
Better lens 
⋅ 
⋅ 
0.5 
1 
⋅ 
⋅ 
⋅ 
⋅ 
⋅ 
⋅ 
7 
11 
Fastest response 
⋅ 
⋅ 
⋅ 
⋅ 
0.5 
1 
⋅ 
⋅ 
⋅ 
⋅ 
7 
12 
Most modes 
⋅ 
⋅ 
1 
⋅ 
0.5 
⋅ 
1 
⋅ 
⋅ 
⋅ 
9 
Conflict sets (M=6)
m 
S _{ m } 

1 
{2,6} 
2 
{5,6,10,12} 
3 
{8,10} 
4 
{9,11,12} 
5 
{7,11} 
6 
{4,12} 
Conflicts between changes
Change 
Conflicts  

1 
Larger battery 
⋅ 
2 
Add 1 megapixel 
6 
3 
Better LCD 
⋅ 
4 
More modes 
12 
5 
Wider angle 
6, 10, 12 
6 
Add 2 megapixels 
2, 5, 10, 12 
7 
Heavier material 
11 
8 
Better video 
10 
9 
Faster response 
11, 12 
10 
Better lens 
5, 6, 8, 12 
11 
Fastest response 
7, 9, 12 
12 
Most modes 
4, 5, 6, 9, 10, 11 
Lookup table for fixed budget
Max cost 
Optimal change(s) 
Score diff 
Actual cost 

2 
Larger battery 
0.030 
2 
3 
Add 1 megapix 
0.584 
3 
4 
Better LCD 
1.448 
4 
5 
Add 2 megapix 
3.339 
5 
6 
Better video 
2.926 
6 
7 
Better lens 
5.097 
7 
8 
Better lens 
5.097 
7 
9 
Most modes 
7.669 
9 
10 
Most modes 
7.669 
9 
⋮ 
⋮ 
⋮ 
⋮ 
Lookup table for target score
Min diff 
Optimal change(s) 
Cost 
Actual diff 

1 
More modes 
4 
1.442 
Better LCD 
1.448  
2 
Wider angle 
5 
2.182 
Add 2 megapix 
3.339  
3 
Add 2 megapix 
5 
3.339 
4 
Better lens 
7 
5.097 
5 
Better lens 
7 
5.097 
6 
Most modes 
9 
7.669 
7 
Most modes 
9 
7.669 
⋮ 
⋮ 
⋮ 
⋮ 
10 Conclusion
We presented a machine learning approach to reverseengineering ranking models, and an experiment on data from a rating company. The formulation encodes a specific preference structure and categorical organization of the products. Another contribution of our work is the introduction of evaluation measures that take into account the rank of a new product, relative to the products that have already been ranked. Finally, we showed how to use a reverseengineered ranking model to achieve a high rank for a product in a costeffective way.
This leads to many avenues for future work, for instance, it would be useful to develop an algorithm that solves the ranking problem while locating potential errors in the data. Another idea is to quantify the uncertainty in each of the coefficients in the reverseengineered model.
http://www.bloomberg.com/apps/news?sid=ah839IWTLP9s&pid=newsarchive by Elliot Blair Smith, September 24, 2008.
All leastsquares methods were implemented using R 2.8.1, and all ℓ _{ p }RE methods were implemented using MATLAB 7.8.0, on a computer with an Intel Core 2 Duo 2 GHz processor with 1.98 GB of RAM. MIORE was implemented using ILOG AMPL 11.210 with the Gurobi 3.0.0 solver on a computer powered by two Intel quad core Xeon E5440 2.83 GHz processors with 32 GB of RAM. We always used ε=10^{−6} for MIORE.
Acknowledgements
This material is based upon work supported by the MITFord Alliance and the National Science Foundation under Grant No IIS1053407. We would like to thank Dimitris Bertsimas from MIT, Brian Jahn and Larry Kummer from Ford, and Elaine Savage, John Leonard, and Ed Krause from the MITFord Alliance.
Appendix
Tables 17, 18, 19, and 20 show the training and test values of M1, M2, and M3 in each of the four rounds of the experiment in Sect. 8. The highest training and test measures are highlighted in bold. The integer number next to each measure is the rank of the method, that is, the number of other methods below it for the particular measure and dataset (training or test). Note that 0 is the lowest possible rank by this definition.
Training and test values of M1, M2, and M3 on ratings data, and ranks of algorithms (train on folds 1, 2, and 3; test on fold 4)
Algorithm 
M1 
M2 
M3  

LS1 C=0 
train 
0.686 
0 
0.920 
2 
0.930 
2 
test 
0.714 
0 
0.922 
5 
0.865 
3  
LS2 C=0.1 
train 
0.706 
2 
0.911 
1 
0.924 
1 
test 
0.762 
2 
0.911 
1 
0.885 
5  
LS3 C=0.05 
train 
0.725 
4 
0.832 
0 
0.866 
0 
test 
0.738 
1 
0.797 
0 
0.827 
0  
ℓ _{1}RE C=0.1, \(C_{\text{high}}=0\) 
train 
0.706 
2 
0.921 
3 
0.930 
2 
test 
0.762 
2 
0.919 
4 
0.846 
1  
ℓ _{2}RE C=0.1, \(C_{\text{high}}=1\) 
train 
0.686 
0 
0.921 
3 
0.930 
2 
test 
0.762 
2 
0.918 
3 
0.865 
3  
MIORE C=0.5,θ=0 
train 
0.765 
5 
0.932 
5 
0.955 
5 
test 
0.786 
5 
0.916 
2 
0.846 
1 
Training and test values of M1, M2, and M3 on ratings data, and ranks of algorithms (train on folds 1, 2, and 4; test on fold 3)
Algorithm 
M1 
M2 
M3  

LS1 C=0 
train 
0.833 
1 
0.925 
2 
0.904 
2 
test 
0.784 
4 
0.922 
5 
0.846 
1  
LS2 C=0.2 
train 
0.833 
1 
0.917 
1 
0.892 
1 
test 
0.773 
2 
0.918 
3 
0.885 
5  
LS3 C=0.05 
train 
0.792 
0 
0.850 
0 
0.879 
0 
test 
0.750 
0 
0.831 
0 
0.808 
0  
ℓ _{1}RE C=0.1, \(C_{\text{high}}=0\) 
train 
0.854 
3 
0.930 
4 
0.904 
2 
test 
0.761 
1 
0.919 
4 
0.846 
1  
ℓ _{2}RE C=0, \(C_{\text{high}}=0\) 
train 
0.854 
3 
0.930 
3 
0.904 
2 
test 
0.773 
2 
0.915 
2 
0.865 
4  
MIORE C=0.5,θ=0 
train 
0.875 
5 
0.937 
5 
0.917 
5 
test 
0.784 
4 
0.914 
1 
0.846 
1 
Training and test values of M1, M2, and M3 on ratings data, and ranks of algorithms (train on folds 1, 3, and 4; test on fold 2)
Algorithm 
M1 
M2 
M3  

LS1 C=0.1 
train 
0.843 
1 
0.913 
2 
0.841 
1 
test 
0.778 
0 
0.925 
2 
0.942 
1  
LS2 C=0 
train 
0.902 
4 
0.908 
1 
0.866 
3 
test 
0.822 
1 
0.929 
3 
0.942 
1  
LS3 C=0.05 
train 
0.804 
0 
0.860 
0 
0.828 
0 
test 
0.889 
5 
0.896 
0 
0.904 
0  
ℓ _{1}RE C=0.1, \(C_{\text{high}}=0\) 
train 
0.882 
2 
0.919 
3 
0.866 
3 
test 
0.822 
1 
0.933 
5 
0.942 
1  
ℓ _{2}RE C=0.1, \(C_{\text{high}}=1\) 
train 
0.902 
4 
0.920 
4 
0.854 
2 
test 
0.822 
1 
0.931 
4 
0.942 
1  
MIORE C=0.5,θ=0 
train 
0.882 
2 
0.928 
5 
0.866 
3 
test 
0.822 
1 
0.923 
1 
0.942 
1 
Training and test values of M1, M2, and M3 on ratings data, and ranks of algorithms (train on folds 2, 3, and 4; test on fold 1)
Algorithm 
M1 
M2 
M3  

LS1 C=0 
train 
0.706 
1 
0.932 
2 
0.929 
3 
test 
0.900 
2 
0.902 
2 
0.868 
0  
LS2 C=0.2 
train 
0.725 
2 
0.925 
1 
0.929 
3 
test 
0.833 
0 
0.894 
1 
0.868 
0  
LS3 C=0.05 
train 
0.686 
0 
0.839 
0 
0.878 
0 
test 
0.867 
1 
0.796 
0 
0.868 
0  
ℓ _{1}RE C=0.1, \(C_{\text{high}}=0\) 
train 
0.745 
4 
0.933 
3 
0.917 
1 
test 
0.933 
4 
0.906 
5 
0.868 
0  
ℓ _{2}RE C=0.1, \(C_{\text{high}}=0.5\) 
train 
0.725 
2 
0.933 
3 
0.917 
1 
test 
0.900 
2 
0.902 
2 
0.868 
0  
MIORE C=0.5,θ=0 
train 
0.824 
5 
0.944 
5 
0.942 
5 
test 
0.967 
5 
0.904 
4 
0.887 
5 