How to reverse-engineer quality rankings

Chang, Allison; Rudin, Cynthia; Cavaretta, Michael; Thomas, Robert; Chou, Gloria

doi:10.1007/s10994-012-5295-6

How to reverse-engineer quality rankings

Published: 03 June 2012

Volume 88, pages 369–398, (2012)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

How to reverse-engineer quality rankings

Download PDF

Allison Chang¹,
Cynthia Rudin²,
Michael Cavaretta³,
Robert Thomas³ &
…
Gloria Chou³

3150 Accesses
7 Citations
Explore all metrics

Abstract

A good or bad product quality rating can make or break an organization. However, the notion of “quality” is often defined by an independent rating company that does not make the formula for determining the rank of a product publicly available. In order to invest wisely in product development, organizations are starting to use intelligent approaches for determining how funding for product development should be allocated. A critical step in this process is to “reverse-engineer” a rating company’s proprietary model as closely as possible. In this work, we provide a machine learning approach for this task, which optimizes a certain rank statistic that encodes preference information specific to quality rating data. We present experiments on data from a major quality rating company, and provide new methods for evaluating the solution. In addition, we provide an approach to use the reverse-engineered model to achieve a top ranked product in a cost-effective way.

How artificial intelligence will change the future of marketing

Article Open access 10 October 2019

Artificial intelligence in recommender systems

Article Open access 01 November 2020

Introduction to Design Science Research

1 Introduction

Many organizations depend on the top ratings given to their products or services by quality rating companies. For instance, the reputations of undergraduate and graduate programs at colleges and universities depend heavily on their U.S. News and World Report rankings. Similarly, the mortgage industry hinges on the models of credit rating agencies like Standard & Poor’s, Moody’s, Dun & Bradstreet, and Fitch Ratings. Mutual funds rely on Morningstar and Lipper ratings. For electronics, rating companies include CNET and PCMag; and for vehicles, they include What Car? J.D. Power, Edmunds, Kelley Blue Book, and Car and Driver. Most of these rating companies use a formula to score products, and few of them make their complete rating formulas public. Moreover, the exact values of the input data to the formula are also often kept confidential. If organizations were able to recreate the formulas for quality rating models, they would better understand the standards by which their products were being judged, which would potentially allow them to produce better products. Furthermore, rating companies that are aware of reverse-engineering may be motivated to re-evaluate the accuracy and fairness of their formulas in representing the quality of products.

In this work, we introduce a method for reverse-engineering product ranking models, and apply it to over a decade’s worth of data from a major quality rating company. Our method integrates knowledge about the way many such models are commonly constructed, which can be summarized as follows:

Point 1 (Linear scoring functions): The rating company states publicly that its product rankings are based on real-valued scores given to each product, and that the score is a weighted linear combination of a known set of factors. The precise values for some factors can be obtained directly, but other factors have been discretized into a number of “stars” between 1 and 5 and are thus noisy versions of the true values. For example, the National Highway Traffic Safety Administration discretizes factors pertaining to vehicle safety ratings.
Point 2 (Category structure): Products are organized into categories, and within each category there are one or more subcategories. For example, a computer rating company may have a laptop category with subcategories such as netbooks and tablets. Products within a category share the same scoring system, but the ranking of each product is with respect to its subcategory.
Point 3 (Ranks over scores): The scores themselves are not as meaningful as the ranks, since consumers pay more attention to product rankings than to scores or to differences in score. Moreover, sometimes the scores are not available at all, and only the ranks are available.
Point 4 (Focus on top products): Consumers generally focus on top-ranked products, so a model that can reproduce the top of each subcategory’s ranked list accurately is more valuable than one that better reproduces the middle or bottom of the list.

Reverse-engineering product quality rankings is a new application for machine learning, and the algorithm we provide for this task matches the application in conforming to the points above. We use linear combinations of the same factors used by the rating company, and generate a separate model for each category, in accordance with Points 1 and 2. The reverse-engineered model for a given category is provided by a supervised ranking algorithm that uses discrete optimization to force the ranks produced by our algorithm to be similar to the ranks from the rating company; note that the algorithm reproduces the ranks, not the scores, as in Point 3. Specifically, the model is constructed to obey certain preference relationships in accordance with Point 4, that is, within each subcategory, the rankings of the rating companies’ top-k products should match the top-k rankings from our model. When there are not enough data within a category to completely determine the ranking model for that category, our algorithm draws strength across categories, by using data from other categories as a type of regularization. Our experimental results on product quality ratings data indicate an advantage in sharing information across product categories, modeling ranks rather than scores, and using discrete optimization to maximize the exact rank statistic of interest rather than a convex proxy as is typical of conventional machine learning methods.

Note that even though Point 1 makes the assumption of known factors, it is also possible to use our method for problems in which the factors are unknown. As long as the factors in our model encompass the information used for the rating system, our algorithm can be applied regardless of whether or not the factors are precisely the same as those used by the rating company. For instance, a camera expert might know all of the potential camera characteristics that could contribute to camera quality, which we could then use as the factors in our model.

After the model has been reverse-engineered, we can use it to determine the most cost-effective way to increase product rankings, and we present discrete optimization algorithms for this task. These algorithms can be used independently of the reverse-engineering method. That is, if the reverse-engineered formula were obtained using a different method from ours, or if the formula were made public, we could still use these algorithms to cost-effectively increase a product’s rank.

We describe related work in Sect. 2. In Sect. 3, we derive a ranking quality objective that encodes the preference relationships discussed above. In Sect. 4 we provide the machine learning algorithm, based on discrete optimization, that exactly maximizes the ranking quality objective. In Sect. 5, we establish new measures that can be used to evaluate the performance of our model. In Sect. 6, we derive several baseline algorithms for reverse-engineering, all involving convex optimization. Section 7 contains results from a proof-of-concept experiment, and Sect. 8 provides experimental results using rating data from a major quality rating company. Section 9 discusses the separate problem of how to cost-effectively increase the rank of a product. Finally, we conclude in Sect. 10. The main contributions of the paper are: the application of machine learning to reverse-engineering product quality rankings; our method of encoding the preference relationships in accordance with Points 1 through 4 above; using data from other product categories as regularization; the design of novel evaluation measures; and the mechanism to cost-effectively achieve a highly ranked product.

2 Related work

Reverse-engineering and approximation of rating models has been done in a number of industries, albeit not applied to rankings for consumer products with the category/subcategory structure. The related work we have found is published mostly within blogs. These works deal mostly with the problem of approximating the ranking function with a smaller number of variables, rather than using the exact factors in the rating company’s formula. For instance, Chandler (2006) approximated the U.S. News and World Report Law School rankings using symbolic regression to obtain a formula with four factors, and another with seven factors; currently the formula for the law school rankings is completely public and based on survey results, but the approximated versions are much simpler. In the sports industry, there has been some work in reverse-engineering Elias Sports Bureau rankings, which are used to determine compensation for free agents (Bajek 2008). The search engine optimization (SEO) industry aims to be able to boost the search engine rank of a web page by figuring out which features have high influence in the ranking algorithm. For instance, Su et al. (2010) used a linear optimization model to approximate Google web page rankings. As a final example, Hammer et al. (2007) approximated credit rating models using Logical Analysis of Data. As far as we know, our work is the first to present a specialized machine learning algorithm to reverse-engineer product ratings.

If the ratings are accurate measures of quality, then making the ratings more transparent could have a uniformly positive impact: it would help companies to make better rated products, it would help consumers to have these higher quality products, and it would encourage rating companies to receive feedback as to whether their rating systems fairly represent quality. If the ratings are not accurate measures of quality, many problems could arise. Unethical manipulation of reverse-engineered credit rating models heavily contributed to the 2007–2010 financial crisis (Morgenson and Story 2010). These ratings permitted some companies to sell “junk bonds” with very high ratings. Rating companies were blamed for “performing the alchemy that converted the securities from F-rated to A-rated.”^{Footnote 1}

Rating systems can also be arbitrary—even some well-established, heavily trusted rating systems can be inconsistent from product to product. There has been some controversy also over the Motion Picture Association of America movie rating system, discussed in the documentary “This Film Is Not Yet Rated.”^{Footnote 2} The MPAA rating system sorts movies into categories based on how appropriate they are for certain audiences. The documentary demonstrates that the rating system was inconsistent between different types of films, and that the MPAA directly lied to the public regarding the way these ratings are constructed. This can be difficult for movie makers, whose profits may depend on getting an “R” rating rather than an “NC-17” rating, and it also causes problems for moviegoers, who want to know whether the movie is suitable for them.

Our reverse-engineering problem could potentially be useful in the area of conjoint analysis in marketing (Green et al. 2001). Conjoint analysts aim to model how a consumer chooses one brand over another, with the goal of learning which product characteristics are most important to consumers.

We have considered the reverse-engineering task as a problem of supervised ranking. Supervised ranking originated to handle problems that occur mainly in the information retrieval domain (see, for instance, the LETOR compilation of works^{Footnote 3}). The vast majority of work on supervised ranking considers problems that are specific to information retrieval (e.g., Cao et al. 2007; Matveeva et al. 2006; Lafferty and Zhai 2001; Li et al. 2007) or give insight into how to approximately solve versions of extremely large ranking problems quickly (Tsochantaridis et al. 2005; Freund et al. 2003; Cossock and Zhang 2006; Joachims 2002; Burges et al. 2006; Xu et al. 2008; Le and Smola 2007; Ferri et al. 2002; Ataman et al. 2006). For the task of reverse-engineering ranking models, fast computational speed is not essential, and the extra time needed to compute a better solution is worthwhile. This, coupled with the fact that the size of the dataset is not extremely large, permits us to use mixed-integer optimization (MIO). MIO preserves our encoding of exactly the desired preference structure, where we have incorporated membership into categories and subcategories. If we remove regularization and do not concentrate on the top ranks, then the problem is a generalization of Area Under the Curve (AUC) maximization (Freund et al. 2003; Joachims 2002). Most works on AUC maximization use a smoothed approximation of the 0-1 loss within the AUC. If we were to use a smoothed approximation for the reverse-engineering problem, it is possible that the algorithm would miss the best solutions to the 0-1 optimization problem. The ℓ _pRE relaxation algorithm we introduce in Sect. 6 is one such approximation. The work of Bertsimas et al. (2010, 2011) also discusses in depth the benefits of exact solutions over relaxations.

Clearly, reverse-engineered ranking models can affect design decisions in a variety of applications. To the best of our knowledge, our work is the first to show the most cost-effective way to increase the rank of a new product.

3 Encoding preferences for quality ranking data

We derive a specialized rank statistic that serves as our objective for reverse-engineering. Maximizing this objective yields estimates of the weights on each of the factors in the rating company’s model. Our starting point is the case of one category with one subcategory, that is, there is only a single ranked list. Then, we generalize this statistic to handle multiple categories and subcategories. Our method can be used to reverse-engineer quality rankings whether or not the underlying scores are made available; we need only to know the ranks.

3.1 One category, one subcategory

Let n denote the number of products to be ranked. We represent product i by a vector of d factors $x_{i}\in\mathcal{X}$, where $\mathcal{X}\subset\mathcal{R}^{d}$. The rating company gives a score $\zeta_{i}\in\mathcal{R}$ to each product i, which translates into a rank. Higher scores imply higher ranks, so that a product with rank 0 is at the bottom of the list with the lowest quality. For all pairs of products, let the preference function $\pi:\mathcal{X}\times\mathcal{X}\rightarrow\{0,1\}$ capture the true pairwise preferences according to the scores ζ _i. That is, let:

$$ \pi(x_i,x_k):=\pi_{ik} := \mathbf{1}_{[\zeta_i>\zeta_k]}, $$

where 1 _q is the indicator function that equals 1 if condition q holds and 0 otherwise. In other words, if product i is ranked higher than product k by the rating company, then π _ik is 1. Even if the ζ _i’s are not available, we can derive the π _ik’s because we know which products are ranked higher than which other products. Our goal is to generate a scoring function $f:\mathcal{X}\rightarrow\mathcal{R}$ that assigns real-valued scores f(x _i) to each product x _i such that the π _ik values match as closely as possible with our model preferences $\mathbf{1}_{[f(x_{i})>f(x_{k})]}$.

Let $\varPi= \sum_{i=1}^{n}\sum_{k=1}^{n}\pi_{ik}$. We first consider a rank statistic that generalizes the area under the ROC curve (AUC):

$$ \mathrm{AUC}_{\pi}(f):=\frac{1}{\varPi}\sum _{i=1}^n\sum_{k=1}^n \pi_{ik}\mathbf{1}_{[f(x_i)>f(x_k)]}. $$

(1)

This statistic is related to the disagreement measure introduced by Freund et al. (2003), as well as Kendall’s τ coefficient (Kendall 1938). That is, in the absence of ties, the disagreement measure is 1−AUC_π(f) and Kendall’s τ is 2AUC_π(f)−1. The highest possible value of AUC_π(f) is 1, which is achieved if the scoring function f satisfies f(x _i)>f(x _k) for all pairs (x _i,x _k) such that π _ik=1.

AUC_π(f) does not put any emphasis on the top of the ranked list; a product at the bottom of the ranked list can contribute the same amount to AUC_π(f) as a product at the top. However, as noted in Point 4 in Sect. 1, it is often more important to accurately reproduce rankings at the top of the list than in the middle or at the bottom. Suppose we want to concentrate on the top $\bar{T}$ products within the subcategory. In particular, we want to weigh the top $\bar{T}$ products 1+θ times more than the rest of the list, where θ≥0. To do this, we first define the rank of product i, with respect to scoring function f, to be the number of products it is scored strictly above:

$$ \mathrm{rank}_f(x_i) := \sum _{k=1}^n \mathbf{1}_{[f(x_i) > f(x_k)]}. $$

The top $\bar{T}$ products have rank at least $T:=n-\bar{T}$. For example, if n=10, then assuming no ties in rank, the top $\bar{T}=4$ products have ranks at least T=6, that is, their ranks are 6, 7, 8, and 9. We consider the objective function:

$$ \text{AUC}_{\pi}^{\text{top}}(f) := \frac{1}{\varPi(\theta)}\sum _{i=1}^n\sum_{k=1}^n \pi_{ik}\mathbf{1}_{[f(x_i) > f(x_k)]} (1+\theta\mathbf{1}_{[\mathrm{rank}_f(x_i)\geq T]} ), $$

where we normalize by

$$ \varPi(\theta)=\sum_{i=1}^n\sum _{k=1}^n\pi_{ik} (1+\theta \mathbf{1}_{[\sum_{k=1}^n \pi_{ik}\geq T]} ). $$

Note that $\text{AUC}_{\pi}^{\text{top}}(f)$ varies between 0 and 1 since the largest possible value of the summation in $\text{AUC}_{\pi}^{\text{top}}(f)$ is Π(θ), which is achieved if f ranks all pairs (x _i,x _k) correctly. Each pair of products (x _i,x _k) contributes $\frac{1}{\varPi(\theta)}\pi_{ik}\mathbf{1}_{[f(x_{i}) > f(x_{k})]}(1+\theta)$ to the objective if the rank of x _i is at least T, and contributes $\frac{1}{\varPi(\theta)}\pi_{ik}\mathbf{1}_{[f(x_{i}) > f(x_{k})]}$ otherwise. If either θ=0 or T=0, then maximizing this objective is equivalent to maximizing AUC_π(f), which does not focus at the top.

3.2 Multiple categories and subcategories

We assume from Sect. 1 that different categories have different ranking models. Even so, these models may be similar enough that knowledge obtained from other categories can be used to “borrow strength” when there are limited data in the category of interest. Thus, as we derive the objective for reverse-engineering the model f for one prespecified category, we use data from all of its subcategories as well as from the subcategories in other categories.

Let $S_{\text{sub}}$ be the set of all subcategories across all categories, including the category of interest, and let there be n _s products in subcategory s. Similar to our previous notation, $x_{i}^{s}\in\mathcal{R}^{d}$ represents product i in subcategory s, $\zeta_{i}^{s}\in\mathcal{R}$ is the score assigned to product i in subcategory s, and $\pi^{s}_{ik}$ is 1 if $\zeta_{i}^{s}>\zeta_{k}^{s}$ and is 0 otherwise. The threshold T _s defines the top of the list for subcategory s.

The general objective we will optimize is a weighted sum of the $\text{AUC}_{\pi}^{\text{top}}(f)$ values over all subcategories:

(2)

where

$$ \mathrm{rank}^s_f \bigl(x_i^s\bigr) = \sum_{k=1}^{n_s} \mathbf{1}_{[f(x_i^s) > f(x_k^s)]}. $$

(3)

The normalization constants are

$$ \varPi_s(\theta) = \sum _{r\in\text{cat}(s)}\sum_{i=1}^{n_r}\sum _{k=1}^{n_r} \pi^r_{ik} (1+\theta\mathbf{1}_{[\sum_{k=1}^{n_r}\pi ^r_{ik}\geq T_r]} ), $$

(4)

where cat(s) denotes the category to which subcategory s belongs. The values C _s determine how much influence each subcategory has on the model. It is logical in general for C _s to be the same for all subcategories within a certain category. If there is a sufficient number of rated products in the category of interest, relative to the total number d of factors, then we can train the model with only these data. In that case, we would set C _s=1 for subcategories within the category of interest and C _s=0 for subcategories in all other categories. On the other hand, if the number of products in the category of interest is too small to permit the model to generalize, then we can regularize by setting C _s∈(0,1] for subcategories of other categories, choosing the values of C _s by cross-validation or heuristics.

Note that Π _s(θ) is the same for all subcategories s within the same category, instead of being proportional to the size of the subcategory. This is because we want each pair of products within the same category to have the same influence on the objective function. Consider if the normalization constants were alternatively

$$ \varPi_s(\theta) = \sum_{i=1}^{n_s} \sum_{k=1}^{n_s} \pi^s_{ik} (1+\theta\mathbf{1}_{[\sum_{k=1}^{n_s}\pi ^s_{ik}\geq T_s]} ). $$

Then for a particular category, there may be subcategories with large values of Π _s(θ) and others with small values. But in this case, assuming C _s is the same for all subcategories in this category, a misranked pair lowers the objective by much more in the subcategories with small Π _s(θ) values than with large values (since Π _s(θ) is in the denominator). Thus, in some sense, normalizing this way puts more weight on accurately ranking within the smaller subcategories. To avoid this issue, we use (4) to normalize. Conventional ranking methods do not address the subcategory/category structure of our product ranking problem in this manner; in fact it can be difficult to take the normalization into account accurately if the learning algorithm is limited to convex optimization. We show in Sect. 4 how our algorithm incorporates, in an exact way, this form of normalization.

We assume a linear form for the model, in accordance with Point 1 in Sect. 1. That is, the scoring function has the form f(x)=w ^T x, so that $w\in\mathcal{R}^{d}$ is a vector of variables in our formulation, and the objective in (2) is a function of w. Note that we can capture relatively complex nonlinear rating systems using a linear model with nonlinear factors. For instance, we could introduce extra factors to accommodate “necessity” constraints, where products that do not have a certain property will always get a low score. To do this, we would add a binary factor to the model that is 1 if the product does not possess the property, and the learning algorithm should discover a large negative weight for that factor.

4 Optimization

We now provide an algorithm to reverse-engineer quality rankings that exactly maximizes (2). The algorithm is called MIO-RE—Mixed Integer Optimization for Reverse-Engineering, and expands on a technique due to Bertsimas et al. (2010, 2011) for supervised ranking in machine learning. In this work, the authors develop a type of approach with an advantage over other machine learning techniques in that it exactly optimizes the objective. This advantage is counterbalanced by a sacrifice in computational speed, but for the rating problem, new data come out occasionally (e.g., yearly, monthly, weekly) whereas the computation time is generally on the order of hours, depending on the number of products in the training data and the number of factors. In this case, the extra computation time needed to produce a better solution is worthwhile.

In MIO, it is important to note that even though there are often various correct formulations to solve the same problem, not all valid formulations are equally strong. In fact, the ability to solve an MIO problem depends critically on the choice of formulation (see Bertsimas and Weismantel 2005, for details). This is not true of linear optimization, where a good formulation is simply one that correctly captures the model and is small in terms of the number of variables and constraints. In linear optimization, the choice of formulation is not crucial for solving a problem. However, when there are integer variables, it is typical to reformulate multiple times to achieve the best model. Essentially, a formulation is stronger if it cuts off extra unnecessary parts of the region of feasible solutions. Below we present a strong MIO formulation that we have found to work well empirically, and we discuss the logic behind its derivation.

Modern solvers typically produce a bound (upper for maximization problems, lower for minimization problems) as they search for better integer feasible solutions, and when the bound matches the objective value of an integer solution, the solution has reached provable optimality. However, it is common for a solver to find an optimal solution relatively quickly, but to take much longer in proving optimality, that is, in bringing the bound closer to the optimal objective value. See Bertsimas et al. (2011) for an introduction to MIO that discusses in particular the strength of a formulation and also the progress in MIO technology over the last few decades. Due to advances in both hardware and MIO algorithms, computational speed has been increasing exponentially, allowing us today to solve large scale MIO problems that would have been impossible only a few years ago. MIO will be progressively more powerful as this exponential trend continues.

In Sects. 7 and 8, our experimental results show that our MIO algorithm performs well on both training and test data. Considering generalization bounds from statistical learning theory, there are two ways to achieve better test performance: one is to decrease the training error, and the other is to decrease the complexity term to prevent overfitting. Using MIO, we can decrease the training error, and we control the complexity by using regularization across categories as shown in Sect. 3.2.

4.1 Model for reverse-engineering

The variable $v^{s}_{i}$ represents the model’s score $w^{T}x^{s}_{i}$ of product i in subcategory s, and the binary variable $z^{s}_{ik}$ captures the decision $\mathbf{1}_{[v^{s}_{i}>v^{s}_{k}]}$, as in (3). The strict inequality is numerically defined using a small positive constant ε, that is:

$$ z^s_{ik}=\mathbf{1}_{[v^s_i-v^s_k\geq\varepsilon]}. $$

(5)

Thus $\mathrm{rank}^{s}_{f}(x^{s}_{i}) = \sum_{k=1}^{n_{s}} z^{s}_{ik}$. To keep track of which products are in the top, we want the binary variable $t^{s}_{i}$ be 1 only if $\mathrm{rank}^{s}_{f}(x^{s}_{i})$ is at least T _s:

$$ t^s_i=\mathbf{1}_{[\sum_{k=1}^{n_s} z^s_{ik}\geq T_s]}. $$

(6)

Also, we want the binary variable $u^{s}_{ik}$ to be 1 only if both $v^{s}_{i}-v^{s}_{k}\geq\varepsilon$ and $\mathrm{rank}^{s}_{f}(x^{s}_{i})\geq T_{s}$, which is equivalent to:

$$ u^s_{ik} = \min\bigl \{z^s_{ik},t^s_i\bigr\}. $$

(7)

Using these decision variables, we can rewrite the objective (2) as

$$ \sum_{s\in S_{\text{sub}}} \frac{C_s}{\varPi_s(\theta)}\sum_{i=1}^{n_s}\sum _{k=1}^{n_s} \pi^s_{ik} \bigl(z^s_{ik}+\theta u^s_{ik}\bigr). $$

(8)

To capture (5) through (7), we use the following constraints:

(9)

(10)

(11)

(12)

If $v^{s}_{i}-v^{s}_{k}\geq\varepsilon$, then the right-hand side of (9) is at least 1, so the solver sets $z^{s}_{ik}=1$ because we are maximizing $z^{s}_{ik}$. Otherwise, the right-hand side is strictly less than 1, so the solver sets $z^{s}_{ik}=0$. Similarly, if $\sum_{k=1}^{n_{s}} z^{s}_{ik}\geq T_{s}$, then (10) implies $t^{s}_{i}=1$; note that since we are maximizing $u^{s}_{ik}$ in (8), we are also automatically maximizing $t^{s}_{i}$ because of (12). And if both $v^{s}_{i}-v^{s}_{k}\geq\varepsilon$ and $\mathrm{rank}^{s}_{f}(x^{s}_{i})\geq T_{s}$, then $z^{s}_{ik}=t^{s}_{i}=1$, so (11) and (12) imply $u^{s}_{ik}=1$. We do not need to explicitly specify $u^{s}_{ik}$ as a binary variable because $u^{s}_{ik}$ is the minimum of two binary variables; if either $z^{s}_{ik}$ or $t^{s}_{i}$ is 0, then $u^{s}_{ik}$ is 0, and otherwise it is 1. Here is the MIO formulation that maximizes (2):

(13)

We enforce that the weights $\{w_{j}\}_{j=1}^{d}$ are nonnegative, in accordance with our knowledge of how most quality ratings are constructed. If there is a case in which a factor is negatively correlated with rank, then we would simply use the negative of the factor, so that the corresponding weight would be positive. Also, if w ^∗ maximizes (2), then so does γw ^∗, for any constant γ>0; thus we can constrain each w _j to be in the interval [0,1] without loss of generality. The primary purpose of this constraint is to reduce the size of the region of feasible solutions, which is intended to speed up the computation. There is a single parameter ε>0 that the user specifies. Since increasing ε tends to increase runtimes, we choose ε to be just large enough to be recognized as nonzero by the solver.

After the optimization problem (13) is solved for our category of interest, we use the maximizing weights w ^∗ to determine the score f(x)=w ^∗T x of a new product x within the same category.

5 Evaluation metrics

In the case of our rating data, one goal is to predict, for instance, whether a new product that has not yet been rated would be among the top-k products that have already been rated. That is, the training data are included in the assessment of test performance. This type of evaluation is contrary to common machine learning practice in which evaluations on the training and test sets are separate, and thus it is not immediately clear how these evaluations should be performed.

In this section, we define three measures that are useful for supervised ranking problems in which test predictions are gauged relative to the training set. These measures are intuitive, and more closely represent how most industries would evaluate ranking quality than conventional rank statistics. The measures are first computed separately for each subcategory and then aggregated over the subcategories to produce a concise result. We focus on the top $\bar{T}_{s}$ products in subcategory s, and use the following notation, where f(x)=w ^T x is a given scoring function.

Note that ζ ^s and $f^{s}_{w}$ are computed from only the training data.

Measure 1: fraction of correctly ranked pairs among top of ranked list

This is the most useful and important of the three measures because it specifically captures ranking quality at the top of the list. Using the same notation as in (13), let π _ik=1 if $\zeta^{s}_{i}>\zeta^{s}_{k}$ and 0 otherwise, and z _ik=1 if w ^T x _i>w ^T x _k and 0 otherwise. The evaluation measures for the training and test data are:

The M1 metric does not require the actual values of the true scores $\zeta^{s}_{i}$; it suffices to know the pairwise preferences π _ik. Note that M1_test(s) is the fraction of correctly ranked pairs among both training and test products, excluding pairs for which both products are in the training set.

Measure 2: fraction of correctly ranked pairs over entire ranked list

This measure is similar to Measure 1, except that instead of considering only the top of the ranked list, it considers the entire list.

Note that M2_train is the same as AUC_π in (1).

Measure 3: fraction of correctly classified products

This evaluation metric is the fraction of products that are correctly classified in terms of being among the top of the list:

Although M3_test(s) measures quality on the test set, the values ζ ^s and $f^{s}_{w}$ depend on the true scores and model scores from the training set. If the true scores $\zeta^{s}_{i}$ are not available, then it suffices to know the rank of each product relative to the product in position $\bar{T}_{s}$ in the training set in order to compute this metric.

Aggregation of measures

To produce a single numerical evaluation for each of the three measures, we aggregate by taking a weighted sum of the measures over subcategories in a given category, where the weights are proportional to the sizes of the subcategories. The three evaluation measures defined above all have the form:

$$ \mathrm{M}(s) = \frac{\text{numer}(s)}{\text{denom}(s)}. $$

The version of evaluation measure M aggregated over subcategories for either the training or test set is:

$$ \mathrm{M} = \frac{\sum_s \text{numer}(s)}{\sum_s \text {denom}(s)} = \frac{\sum_s \text{denom}(s)\mathrm{M}(s)}{\sum_s \text{denom}(s)}. $$

6 Other methods for reverse-engineering

We developed several baseline algorithms for our experiments that also encode the points in the introduction. The first set of methods are based on least squares regression, and the second set are convex relaxations of the MIO method. These algorithms could be themselves useful, for instance, if a fast convex algorithm is required.

6.1 Least squares methods for reverse-engineering

The organization that provides our rating data currently uses a proprietary method to reverse-engineer the ranking model, the core of which is very similar to least squares regression on the scores. If the scores were not available—for instance, when working with data from a different rating company—the organization would conceivably use least squares regression on the ranks. Thus, our baselines are variations on least squares regression, minimizing:

$$ \sum_{s\in S_{\text{sub}}}\frac{C_s}{N_s}\sum _{i=1}^{n_s} \bigl(y^s_i - \bigl(w_0+w^Tx^s_i\bigr) \bigr)^2, $$

where N _s is the number of products in the category to which subcategory s belongs:

$$ N_s = \sum_{r\in\text{cat}(s)} n_r, $$

and $y^{s}_{i}$ can be one of three quantities:

1.
the true score $\zeta^{s}_{i}$ for product $x^{s}_{i}$ (method LS1),
2.
the rank over all training products, that is, the number of training products that are within subcategories r such that C _r>0 and are ranked strictly below $x^{s}_{i}$ according to the true scores $\zeta^{s}_{i}$ (method LS2),
3.
the rank within the subcategory, that is, the number of training products in the same subcategory as $x^{s}_{i}$ that are ranked strictly below $x^{s}_{i}$ according to the true scores $\zeta^{s}_{i}$ (method LS3).

6.2 The ℓ _p reverse-engineering algorithm

As another point of comparison, we introduce a new method called “ℓ _p Reverse-Engineering” (ℓ _pRE) that generalizes the P-Norm Push algorithm for supervised ranking, developed by Rudin (2009). This algorithm minimizes an objective with two terms, one that “pushes” low-quality products to the bottom of the list, and another that “pulls” high-quality products to the top. To derive this algorithm, we first consider the following loss function:

$$ \text{Loss}_{s,p,\text{low},0-1}(f) := \Biggl(\sum_{k=1}^{n_s} \Biggl(\sum_{i=1}^{n_s} \pi^s_{ik} \mathbf {1}_{[f(x^s_i)\leq f(x^s_k)]} \Biggr)^p \Biggr)^{1/p}. $$

In order to interpret $\text{Loss}_{s,p,\text{low},0-1}(f)$, consider that $\sum_{i=1}^{n_{s}} \pi^{s}_{ik}\mathbf{1}_{[f(x^{s}_{i})\leq f(x^{s}_{k})]}$ is the number of products i that should be ranked higher than k (that is, $\pi^{s}_{ik} =1$), but are ranked lower by f (that is, $\mathbf{1}_{[f(x^{s}_{i})\leq f(x^{s}_{k})]}$). This quantity is large when k is a low-quality product that is near the top of the ranked list. In other words, the largest terms in the sum $\sum_{k=1}^{n_{s}} (\sum_{i=1}^{n_{s}} \pi^{s}_{ik}\mathbf{1}_{[f(x^{s}_{i})\leq f(x^{s}_{k})]} )^{p}$ correspond to low quality products that are highly ranked. Thus, minimizing $\text{Loss}_{s,p,\text{low},0-1}(f)$ tends to “push” low-quality products towards the bottom of the list.

Instead of minimizing $\text{Loss}_{s,p,\text{low},0-1}(f)$ directly, we can minimize the following convex upper bound:

$$ \text{Loss}_{s,p,\text{low}}(f) := \Biggl(\sum_{k=1}^{n_s} \Biggl(\sum_{i=1}^{n_s} \pi^s_{ik}e^{- (f(x^s_i)-f(x^s_k) )} \Biggr)^p \Biggr)^{1/p}. $$

We reverse the sums over i and k to define another quantity:

$$ \text{Loss}_{s,p,\text{high}}(f) := \Biggl(\sum_{i=1}^{n_s} \Biggl(\sum_{k=1}^{n_s} \pi^s_{ik}e^{- (f(x^s_i)-f(x^s_k) )} \Biggr)^p \Biggr)^{1/p}. $$

Minimizing $\text{Loss}_{s,p,\text{high}}(f)$ tends to “pull” high-quality products towards the top of the list. The ℓ _pRE method uses both $\text{Loss}_{s,p,\text{low}}(f)$ and $\text{Loss}_{s,p,\text{high}}(f)$. The loss function minimized by ℓ _pRE is:

$$ \sum_{s\in S_{\text{sub}}}\frac{C_s}{N_{s,p}} \bigl(\text {Loss}_{s,p,\text{low}}(f)+ C_{\text{high}}\cdot\text{Loss}_{s,p,\text{high}}(f) \bigr), $$

where the normalization factor N _s,p is:

$$ N_{s,p} = \sum_{r\in\text{cat}(s)} \Biggl( \Biggl(\sum _{k=1}^{n_r} \Biggl(\sum _{i=1}^{n_r} \pi^r_{ik} \Biggr)^p \Biggr)^{1/p} + C_{\text{high}} \Biggl(\sum _{i=1}^{n_r} \Biggl(\sum _{k=1}^{n_r} \pi^r_{ik} \Biggr)^p \Biggr)^{1/p} \Biggr), $$

and C _s and $C_{\text{high}}$ are user-specified parameters that control the relative importance of each subcategory, and the importance of $\text{Loss}_{s,p,\text{high}}(f)$ relative to $\text{Loss}_{s,p,\text{low}}(f)$ respectively. We use p=1 and p=2, and denote the corresponding methods ℓ ₁RE and ℓ ₂RE respectively.

7 Proof of concept

As a preliminary experiment, we tested the methods using an artificial dataset that we have made publicly available.^{Footnote 4} Figure 1 shows for each of the five factors of this dataset, a scatterplot of the factor values versus the scores. The sixth plot in the figure shows all five factors versus the scores in the same window. For each factor, there is one set of products for which there is perfect correlation between the factor values and scores, another set for which there is perfect anti-correlation, and the remainder for which the factor value is constant. By constructing the dataset in this manner, we expect there to be significant variation in the ranking performance of the different methods.

There is only one category with one subcategory. There are 200 products total, and we randomly divided the data into 100 products for training and 100 products for testing. We tested five methods: LS1, LS2, ℓ ₁RE, ℓ ₂RE, and MIO-RE; LS3 is equivalent to LS2 since there is only one subcategory.^{Footnote 5} We ran the methods for three cases: concentrating on the top 60, the top 45, and the top 25, that is, T=40, T=55, and T=75 respectively. We ran ℓ ₁RE with $C_{\text{high}}=0$; ℓ ₂RE with $C_{\text{high}}=0, 0.5$, and 1; and MIO-RE with θ=9. MIO-RE found the final solutions within three minutes for each case, and the other methods ran within seconds. Tables 1, 2, and 3 show the results. The highest training and test measures across the methods are highlighted in bold. LS1, ℓ ₁RE, and ℓ ₂RE (with $C_{\text{high}}=0,0.5$, and 1) always produced the same values for the three evaluation measures.

Table 1 Training and test values for M1, M2, and M3 on artificial dataset (top 60)

How to reverse-engineer quality rankings

Abstract

Similar content being viewed by others

How artificial intelligence will change the future of marketing

Artificial intelligence in recommender systems

Introduction to Design Science Research

1 Introduction

2 Related work

3 Encoding preferences for quality ranking data

3.1 One category, one subcategory

3.2 Multiple categories and subcategories

4 Optimization

4.1 Model for reverse-engineering

5 Evaluation metrics

6 Other methods for reverse-engineering

6.1 Least squares methods for reverse-engineering

6.2 The ℓ p reverse-engineering algorithm

7 Proof of concept

8 Experiments on rating data

8.1 Experimental setup

8.2 Results

8.3 Example of differences between methods on evaluation measures

9 Determining a cost-effective way to achieve top rankings

9.1 Two formulations

9.1.1 Maximizing score on a fixed budget

9.1.2 Minimizing cost with a fixed target score

9.2 Practical example

10 Conclusion

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation

6.2 The ℓ _p reverse-engineering algorithm