1 Introduction

The ranking of candidates or objects is widely used in modern societies and new technologies, involving tasks ranging from preference voting and social opinion analysis to recommender systems and omics data integration. Let us assume that no metric measurements, such as object scores, are available. Only ordinal information can be observed, which can be incomplete. Our interest goes beyond rank aggregation. Our goal is to recover the latent signals which inform the observed ranks and to use the signal variabilities for inference on the estimated consolidated rank order but not on an individual assessor level. Standard rank aggregation algorithms cannot serve this purpose because they lack probabilistic assumptions. As a direct consequence, rank stability cannot be studied. Their advantage is high computational efficiency. More advanced aggregation techniques (e.g. Lin and Ding 2009) adopt the most versatile stochastic algorithms at the risk of rank order instability, even concerning the critical top range. In contrast, our indirect inference approach, when rerun, is always yielding the same consolidated ranked list. We take advantage of mathematical instead of stochastic optimization. Our optimization approach is based on sets of constraints representing the pairwise object rank comparisons. We propose several optimization techniques embedded in a rather general non-parametric statistical framework.

Traditionally, the number of rankers n exceeds by far the number of objects or candidates p. In any such assessment, the resulting orderings and the thus obtained ranked lists differ across the rankers. For the consolidation of these individual orderings, rank aggregation techniques have been devised. Probably the first of its kind is known as Borda count (de Borda 1781). The current motivation is to construct a more informative and robust ranked list compared with the individual rankings. Such aggregation approaches do not imply any signal plus noise model or error distribution behind the observed ranks. They are fully adequate for ordinal information that does not have a numerical underpinning. A popular, more recent concept is the median rank (Fagin et al. 2003). The increase in computer power has given rise to rather demanding stochastic aggregation techniques, some of them developed in the context of machine learning. The typical workhorse is Markov chain Monte Carlo (McMC). Fundamental work was contributed by Dwork (2001) with a focus on Web applications such as spam control. Other relevant work is due to Lin and Ding (2009) with a focus on high throughput technologies in genomic analysis. Most of these techniques, based on deterministic or stochastic algorithms, are implemented in the ”R” package ”TopKLists” (Schimek et al. 2015). A disadvantage of most rank aggregation approaches is the fact that discordances of rankings at different, even far distant, positions are treated equally. Recently, there have been proposals for the identification of those top-ranked objects that are characterised by the high concordance of their rankings (Hall and Schimek 2012; Sampath and Verducci 2013). The thus obtained subset of objects, usually named top-k, is rather small and robust compared to the total number of objects. These objects make a stable input for rank aggregation techniques.

Another class of techniques for \(n > > p\), with a long tradition, are probabilistic ranking models. Among the first are the Thurstone model (Thurstone 1927), the Mallows model (Mallows 1957) and its multistage generalization (Fligner and Verducci 1988), and finally, the Plackett-Luce model (Luce 1959; Plackett 1975). The Thurstone model is based on order statistics and assumes that there is a latent score for each object, which, together with noise, determines its rank. The Mallows model is built on pairwise comparisons applying a distance function between rankings, whereas the Plackett-Luce model is a stagewise probabilistic model based on permutations. For the latter, maximum likelihood estimation is feasible. However, maximum likelihood inference for the consensus ranking in the Mallows case can be NP-hard depending on the choice of the distance function. Another problem of maximum likelihood evaluation is over-fitting of the consensus ranking model. A special case of the Plackett-Luce model, also known as the multinomial logit model (McFadden 1973), applied to pairwise comparisons is the Bradley-Terry-Luce model (Bradley and Terry 1955; Luce 1959).

Apart from enormous computational demands, common shortcomings of probabilistic models are their limited capability to handle large numbers p of objects and their need for many more rankers (i.e. assessments) than objects being ranked. A detailed review of probabilistic models for ranking data can be found in Alvo and Yu (2014). Some most recent proposals can overcome the above limitations, also allowing to handle \(n < < p\)-problems, but remain time-consuming: Vitelli et al. (2018) have developed a rather general Bayesian Mallows model for right-invariant distances such as the popular Kendall tau. They also provide solutions for incomplete rankings, for heterogeneous assessments, and for rank positions missing at random. A generalized Mallows model approach has been published most recently by Li et al. (2020). It is assumed that the ranked data are produced by a multistage ranking process. The process towards consensus is governed by parameters of overall ranking quality and stability. This model allows even for input rankings of heterogeneous agreement, an unsolved problem in standard rank aggregation. The Mallows model was further generalized to permit partitions of two disjoint ranker groups of different abilities by Zhu et al. (2023).

In a very recent article, the Thurstone model has been generalized in a Bayesian framework (Li et al. 2022). The focus of this work is on the inclusion of covariate information concerning the ranked objects and the integration of rankers of varying abilities. Other recent papers introduce angle-based models for ranking data (Cucuringu 2016; Xu et al. 2018). A consensus score vector is calculated for all objects reflecting the associated preferences. The probability of observing a ranking is proportional to the cosine of the angle from the consensus score vector. Spearman distance-based models are a special case of Xu et al.’s approach. Angle-based models can handle incomplete rankings as well as relatively large numbers of objects because the high computational demand of other, mostly McMC-based probabilistic model approaches can be avoided.

Machine learning approaches are usually constructed in a way to handle both \(n > > p\) and \(n < < p\)-problems with large numbers of n and of p. Such an approach based on pairwise comparisons is rank centrality by Negahban et al. (2017). The connection to our proposal is that its authors are interested in finding ‘scores’, we call them ‘signals’, in addition to obtaining a consolidated ranking typically produced by rank aggregation algorithms. Their iterative ranking algorithm has a random walk interpretation, and the scores are the stationary probabilities of this random walk. The motivation of most pairwise comparison machine learning approaches differs from what we aim at. Wauthier et al. (2013) distinguish between (i) actively measured, (ii) repeatedly measured, and (iii) fully measured comparisons, up to some noise. In our proposal, the focus is on (i) in combination with (iii). The latter does not exclude incomplete rankings. Instead of repeated measurements, we assume independent assessments, having input from n assessors rating the same p objects in settings \(n > > p\) or \(n < < p\). To be precise, this does not mean that we require a large number of n or that we have to calculate all \(p(p-1)/2\) comparisons in our algorithm. However, we do not aim at situations where \(n=1\) or that comparisons can only be measured passively (e.g. in click-through data) as often addressed in machine learning algorithms.

Currently, there are many challenging applications beyond preference and recommendation in social and marketing research: examples are omics data analysis and large-scale medical data integration that confront researchers with thousands or tens of thousands of objects in rank order. Our intention is to propose a scalable method which is most general with respect to the type of rank data and at the same time indifferent to the relative size of n and p. Despite the rich statistical and machine learning literature, such a method is still in demand. Moreover, there is a methodological as well as a computational motivation for our research. We are aiming at a mathematical concept in which rank order relations are modelled directly without need for a distance measure. One should keep in mind that the choice of distance measure is usually driven by technical requirements and not by the characteristics of the data. From a computational perspective our goal is to provide rather robust results compared to stochastic algorithms and to time-wise outperform related machine learning approaches.

The overall research question, we intend to answer, is the estimation of the metric latent signals from multiple rank observations for a given set of objects, symmetric, that is novel, with respect to the dimensions n and p. As will be pointed out later, this can be achieved via a specific construction of the objective function in the mathematical optimization problem. There is no other input than a matrix of ranks representing multiple assessments by human experts or machines of a fixed set of objects. This matrix does not need to be complete in case of missing assignments. Neither specific distributional assumptions nor a distance measure should be required to estimate the latent signals that inform the observed rankings. We aim at an approach that overcomes the shortcomings of aggregation techniques by estimating the signals behind the observed rankings and by quantifying the uncertainty of the signal estimates. To be able to scale to large or huge data sets, the algorithm needs to be computationally highly efficient. In summary, we have two aims: (i) to estimate the vector of signals that best explains the consensus ranking and (ii) to deliver an aggregate ranking under the assumption that each individual rank assignment is noisy. This noise assumption makes the difference to the established rank aggregation algorithms.

We have developed an indirect inference approach for estimating the latent signals underlying observed ranking data. In terms of methodology, this novel approach interfaces statistical resampling and mathematical optimization techniques. In contrast to probabilistic ranking models based on order statistics, distance measures or permutation statistics, we characterize the observed ranking data by sets of order constraints. To the best of our knowledge this has not been studied so far. The obtained solutions are the basis for computationally efficient bootstrap estimates of the latent signals and associated errors. From the signal estimates of the individual objects, we can derive a consolidated list of ranks. The error estimates allow us to address the stability of the obtained rank order.

The article is organized as follows. Section 2 describes the statistical model and its assumptions. The methodology is set out in Sect. 3. There, the statistical estimation problem is formulated in terms of convex mathematical optimization. Next, the representation of order relations by constraints is introduced. Based on the transitivity property of rank scales, a computationally less demanding characterization of the order constraints is developed. Then the concept of signal estimation via indirect inference is formally motivated. Finally in this section, the use of numerically highly efficient Poisson bootstrap for non-parametric signal estimation is worked out. In Sect. 4, the settings for numerical and comparative analyses are introduced. The proposed estimation variants are studied via simulations in various data scenarios including the cases of incomplete and heterogeneous rankings in Sect. 5. In Sect. 6, the indirect inference approach is compared to the machine learning method rank centrality. The article describes two conceptually different applications in Sects. 7 and 8. Both applications address the \(n < < p\) setting but the first one comprises only \(n=3\) rankers (i.e. rating institutions) and the second one a plentiful number of \(n=50\) rankers (i.e. cancer patients). In Sect. 7, world university ranking data are analyzed with the proposed approach, and the obtained results are compared with conventional rank aggregation findings. In Sect. 8, kidney cancer gene expression ranking data are studied with respect to patient survival versus non-survival, applying various exploratory techniques. We conclude with a general discussion and plans for future research in Sect. 9.

2 The statistical model and its assumptions

Let us consider n rankers (humans or machines) assessing p objects. We assume that the \(j^{\text {th}}\) ranker either implicitly or explicitly observed random variables \(X_{1j},X_{2j},\ldots ,X_{pj}\), which we name scores. The variable \(X_{ij}\) denotes the value of the score for the \(i^{\text {th}}\) object as perceived by the \(j^{\text {th}}\) ranker. The permuted order of these variables,

$$\begin{aligned} X_{\pi {(1j)}}>X_{\pi {(2j)}}>\ldots > X_{\pi {(pj)}}, \end{aligned}$$
(1)

defines the rankings that the \(j^{\text {th}}\) ranker assigned to the objects. Let \(R_{1j},R_{2j},\ldots ,R_{pj}\) denote the ranks of \(X_{1j},X_{2j},\ldots , X_{pj}\) according to (1), where \(R_{ij}\in {\mathbb {N}}^{+}\) and \(R_{ij}\le p\). Then we define \(\{R_{ij}\}\) as the column rank matrix of the scores \(\{X_{ij}\}\).

Further, we assume that the values \(X_{1j},X_{2j},\ldots ,X_{pj}\) follow the simple statistical model

$$\begin{aligned} X_{ij}=\theta _{i}+Z_{ij}, \qquad i=1,\ldots ,p, \qquad j=1,\ldots ,n, \end{aligned}$$
(2)

where the \(\theta _i\) denote positive real-valued parameters and the \(Z_{ij}\) random variables. Without loss of generality, we assume \(Z_{ij} \in {\mathbb {R}}^{+}\) because the informed rank order is invariant against shift operations on the scale of scores.

Let us write \({\textbf{R}}=\{R_{ij}\}\), \({\textbf{X}}=\{X_{ij}\}\), \(\mathbf {\theta }=\{\theta _i\}\), and \({\textbf{Z}}=\{Z_{ij}\}\). The parameters \(\mathbf {\theta }\) represent the latent values that all rankers aim to assess during the ranking process, but due to the random errors \({\textbf{Z}}\), their best approximation of the latent values \(\mathbf {\theta }\) are the scores \({\textbf{X}}\). For instance, this score can be interpreted as intensity of preference. Such an assessment model was first introduced by Thurstone (1927). It implies that the \(\mathbf {\theta }\)’s are implicitly shared by all rankers, independent of their abilities that contribute to the random (error) variables \(Z_{ij}\).

In the following, we call the parameters \(\theta _i\) the unobservable latent signals we wish to estimate via \({\hat{\theta }}_i\). Let \(\mathbf {\pi }(\theta _i)\) denote the rank position \(\pi _i \in \{1,2,\ldots ,p\}\) of \(\theta _i\). From the complete ordered set of estimates \({\hat{\theta }}_i\) for \(i=1,2,\ldots ,p\) we obtain the true or consolidated ranked list. The latent signals are measured on an arbitrary scale and represent the true relative distances between the ranked objects. The distances that can be derived from the point estimates \({\hat{\theta }}_i\) in combination with their standard errors \(SE({\hat{\theta }}_i)\) allow us to assess the stability of the consolidated rank \(\pi _i\) of an object. Here we should emphasize that the smaller the distances between \({\theta }_i\) and \({\theta }_{i+1}\) for an arbitrary \(i\in \{1,2,\ldots ,p\}\), the lesser the perturbation crucial to swap the rank positions of objects i and \(i+1\). When the ranks \(\pi _i\) informed by the true signals \(\theta _i\) are stable in a ranked list, there should be sufficient relative distance between arbitrary signal estimates \({\hat{\theta }}_i\) and \({\hat{\theta }}_{i+1}\).

Finally, in clear contrast to most probabilistic rank estimation approaches in the literature, we do not impose distributional restrictions on the latent signals \(\theta _i\) or on the random errors \(Z_{ij}\).

3 Methodology

Our aim is to estimate the vector of signals \(\theta _i\) from the scores \([X_{i1}=\theta _i+Z_{i1},\ldots ,X_{in}= \theta _i+Z_{in}], \quad i=1,2,\ldots ,p\). In situations where the assessment scale differs from ranker to ranker (e.g. in omics platform data), or where the measurements \(X_{ij}\) are completely lacking (e.g. in preference data), we are restricted to the rank representation of the scores \({\textbf{X}}\) in the input rank matrix \({\textbf{R}}\).

The mathematical formulation of the signal estimation problem is based on pairwise comparisons of all objects with respect to their rank positions. Pairwise comparisons are also the substantiation of ratings or preferences in many probabilistic and machine learning ranking models proposed in the literature. Novel is the form of representation of the order relations. Sets of constraints are built, one for each ranker, that permit to consider the relations between the hidden scores in the sense of Thurstone (i.e. ‘true’ signals) as anticipated by the assessors, only observable in terms of their individual ordering of objects. These allow us to formulate the statistical estimation problem in terms of mathematical optimization.

3.1 Quadratic optimization with order relation constraints

We propose to use quadratic optimization. Let us define t for the number of variables and m for the number of constraints. In standard form we have

$$\begin{aligned} \begin{aligned} \min _{{{\textbf{x}}} \in {\mathbb {R}}^t} \quad&\frac{1}{2} {{\textbf{x}}}^{\intercal } {{\textbf{Q}}}{{\textbf{x}}} + {{\textbf{c}}}^{\intercal }{{\textbf{x}}}\\ \text {s.t.} \quad&{{\textbf{A}}} {{\textbf{x}}} \le {{\textbf{b}}}\\&{{\textbf{x}}} \ge {{\textbf{0}}}, \\ \end{aligned} \end{aligned}$$

where \({\textbf {Q}}\) is a real symmetric (\(t \times t\)) matrix, \({\textbf {c}}\) a real t-dimensional vector, \({\textbf {A}}\) a real (\(m \times t\)) matrix, and \({\textbf{b}}\) a real m-dimensional vector. The latter has the role of scaling the latent scores. Without loss of generality, its elements can take the value of some positive constant. Quadratic optimization does a global reduction of the assessor-induced error realizations \(z_{(i,j)}\) of the random variables \({\textbf{Z}}\) aiming at the ‘true’ signals.

Let us recall the statistical model in (2), which is

$$\begin{aligned} X_{ij}=\theta _{i}+Z_{ij}, \qquad i=1,\ldots ,p, \qquad j=1,\ldots ,n, \end{aligned}$$

where we have parameters \(\theta _i \in {\mathbb {R}}^{+}\) and \(Z_{ij} \in {\mathbb {R}}^{+}\). In this respect, we mimic the rank-based distribution function approach of Svendova and Schimek (2017). Applying our novel constraint-based rank representation under slightly different assumptions, we can completely avoid computationally highly demanding stochastic optimization.

Under the assumption of stochastically independent assessments, we aim to represent each assessor by a set of constraints specific to his/her ranking of the available objects. The essential idea is to compare the position of each object ranked higher with the positions of all other objects ranked lower for each of the assessors.

The construction of the constraints can be illustrated in the following way: Given a fixed ranker j and starting with the object ranked first, we force this object with the latent parameter \(\theta _{\pi (1,j)}\) plus error \(z_{(\pi (1,j),j)}\) to have a greater equal relation to the object ranked second with \(\theta _{\pi (2,j)}\) plus \(z_{(\pi (2,j),j)}\). Thus, the first constraint for the ranker j is \(\theta _{\pi (1,j)} + z_{(\pi (1,j),j)} - \theta _{\pi (2,j)} - z_{(\pi (2,j),j)} \ge b\), where \(b > 0\) is a scaling constant, that can be arbitrarily chosen. In the second constraint for the same ranker, we require \(\theta _{\pi (1,j)}\) plus \(z_{(\pi (1,j),j)}\) to obey a greater equal relation to the object ranked third, represented by \(\theta _{\pi (3,j)}\) plus \(z_{(\pi (3,j),j)}\). In formal notation, \(\theta _{\pi (1,j)} + z_{(\pi (1,j),j)} - \theta _{\pi (3,j)} - z_{(\pi (3,j),j)} \ge b\). Analogue calculations are carried out for all other objects, moving from one ranker to the next. This procedure allows us to infer each underlying latent signal across the rankers via convex optimization.

For convenience let us introduce a function \({\mathcal {I}}(g,h)\) that returns the index of an object in position g for an assessor h

$$\begin{aligned} \begin{aligned} i_{1j}: {\mathcal {I}}(i,j) = 1 \\ i_{2j}: {\mathcal {I}}(i,j) = 2 \\ \vdots \\ i_{pj}: {\mathcal {I}}(i,j) = p. \\ \end{aligned} \end{aligned}$$

The mathematical formulation (in non-standard form for reasons of comprehensibility) of the full approach considering all pairwise comparisons for the observed individual rankings is then

$$\begin{aligned} \begin{aligned}&\underset{z}{\text {min}}\quad \quad \sum _{i=1}^{p}{\sum _{j=1}^n{z^2_{(i,j)}}} \\&\text {s. t.} \quad \quad \theta _{i_ {1j}} + z_{(i_ {1j},j)} - \theta _{i_ {tj}} - z_{(i_ {tj},j)} \ge b \quad \wedge \quad t = 2, 3 \dots , p,\quad j = 1, \dots , n\\&\quad \quad \qquad \theta _{i_ {2j}} + z_{(i_ {2j},j)} - \theta _{i_ {tj}} - z_{(i_ {tj},j)} \ge b \quad \wedge \quad t = 3, 4 \dots , p,\quad j = 1, \dots , n\\&\quad \quad \qquad \vdots \\&\quad \quad \qquad \theta _{i_ {p-1,j}} + z_{(i_ {p-1,j},j)} - \theta _{i_ {pj}} - z_{(i_ {pj},j)} \ge b \quad \wedge \quad t=p, \quad j = 1, \dots , n\\&\quad \quad \qquad ----- \\&\quad \quad \qquad \theta _i \ge 0 \quad i=1,\dots , p \\&\quad \quad \qquad z_{(i,j)} \ge 0 \quad i=1, \dots , p, \quad j = 1, \dots , n, \end{aligned} \end{aligned}$$

where \(\wedge\) is the logical and operation. The number of rows equals to p and the number of columns equals to n. The number of constraints is equal to \(n \times \frac{(p-1)p}{2}\), and the number of variables is equal to \((n \times p) + p\). Certainly, a limitation of this approach is the fact that the time demand is increasing almost quadratic in p.

For large or huge ranking data, the full approach, which considers all possible comparisons, is computationally rather demanding. As stated in Sect. 1, we try to do better than the current stochastic optimization approaches. To overcome the above limitation, we apply the transitivity property of rank scales to the full comparison approach. Transitivity simply means that when \(x>y\) and \(y>z\), then \(x>z\). Consequently, it should be sufficient to compare only those objects that are next to each other in terms of their rank positions, i.e. neighbouring objects. We call this simplified procedure for the construction of constraints the restricted approach. For simulation evidence that the full and the restricted approach are equivalent, see Sect. 4. The numerical gain is substantial because we need much fewer combinations of the terms \((\theta _{\pi (i,j)} + z_{(\pi (i,j),j)})\) to characterize the input rank matrix \({\textbf{R}}\). The number of variables \((n \times p)+p\) remains the same but the number of constraints is reduced from \(n \times \frac{(p-1)p}{2}\) to \(n \times (p-1)\). Hence, the numerical complexity is reduced from quadratic to linear when the restricted approach is applied. The mathematical formulation of the latter is

$$\begin{aligned} \begin{aligned}&\underset{z}{\text {min}} \quad \sum _{i=1}^{p}{\sum _{j=1}^n{z^2_{(i,j)}}} \\&\text {s. t.} \quad \theta _{i_ {sj}} + z_{(i_ {sj},j)} - \theta _{i_ {tj}} - z_{(i_ {tj},j)} \ge b \; \wedge \; (s,t) = \left\{ (1, 2), (2,3), \dots , (p-1,p)\right\} \\&\quad \qquad ----- \\&\quad \qquad \theta _i \ge 0 \quad i=1,\dots , p \\&\quad \qquad z_{(i,j)} \ge 0 \quad i=1, \dots , p, \quad j = 1, \dots , n. \end{aligned} \end{aligned}$$

Both the full and the restricted approach also cover the situation of incomplete rankings. The technical handling of incomplete rankings follows the space-oriented rank-based data integration approach of Lin (2010), which is implemented in the R package TopKLists (Schimek et al. 2015) for data aggregation. In the space oriented setting, missing rank assignments do not cause additional computational costs because the number of involved constraints does not differ from the case of complete rankings.

The variables \(z_{(i,j)}\) are central to the proposed methodology. The \(z_{(i,j)}\)’s in the constraints constitute the matrix of the error structure

$$\begin{aligned} {\textbf{Z}}_{(p,n)} = \begin{pmatrix} z_{1,1} &{} z_{1,2} &{} \cdots &{} z_{1,n} \\ z_{2,1} &{} z_{2,2} &{} \cdots &{} z_{2,n} \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ z_{p,1} &{} z_{p,2} &{} \cdots &{} z_{p,n} \end{pmatrix}. \end{aligned}$$

Violations of the assumption of a common ability of all rankers to contribute to the estimation of the signals \(\theta\) can be identified via the error structure. At the end of the estimation process, the \({\hat{{\textbf{Z}}}}\) matrix can be used to check the above mentioned assumption or for exploratory purposes. Typical tasks are the search for outliers among the rankers, the identification of distinct groups of rankers or of top-k objects of highest agreement among the assessors.

3.2 Non-parametric signal estimation via indirect inference

The optimization problem we have introduced in Sect. 3.1 is

$$\begin{aligned} \begin{aligned} \min _{{\textbf{x}} \in {\mathbb {R}}^t} \quad&\frac{1}{2} {\textbf{x}}^{\intercal } \textbf{Q x} + {\textbf{c}}^{\intercal }{\textbf{x}}\\ \text {s.t.} \quad&{{\textbf{A}}}{{\textbf{x}}} \le {\textbf{b}}\\&{\textbf{x}} \ge {\textbf{0}}. \\ \end{aligned} \end{aligned}$$
(3)

Moreover, let us define the set of constraints \(\Psi\) representing all order relations produced by a complete permutation of the involved objects. Note that these constraints are all linear. Our model assumptions imply a column vector \({\textbf{x}}\) of size \(t=(p+q)\), the total number of variables,

$$\begin{aligned} {\textbf{x}} = (\theta _1,\theta _2,\dots ,\theta _p,z_{1,1},z_{2,1},\dots ,z_{p,1},z_{1,2},z_{2,2}, \dots ,z_{p,2},\dots ,z_{1,n},z_{2,n},\dots ,z_{p,n}), \end{aligned}$$

where \(q=np\). Then \({\textbf{Q}}\) is a square t-dimensional block-structured diagonal matrix

$$\begin{aligned} {\textbf{Q}}:= \begin{bmatrix} {\textbf{0}}_{p\times p} &{} {\textbf{0}}_{p \times q} \\ {\textbf{0}}_{q \times p} &{} {\textbf{I}}_{q} \end{bmatrix}_{t \times t}, \end{aligned}$$

where \({\textbf{0}}\) denotes the null and \({\textbf{I}}\) the identity matrix. Then \({\textbf{c}}=[{\textbf{0}}_{p},{\textbf{1}}_{q}]\), where \({\textbf{0}}\) is a subvector of nulls and \({\textbf{1}}\) is a subvector of ones. As a direct consequence of this diagonal structure, \({\textbf {Q}}\) is positive semi-definite, and the mathematical optimization problem is convex.

The size of the matrix of constraints \({\textbf{A}}\) is determined by the number of variables t and the number of constraints m. Compared to the full approach, the number m is substantially reduced in the restricted approach resulting in a significant gain of computational efficiency. Let us define

$$\begin{aligned} {\textbf{A}}:=\left[ [A_\theta ] [A_z]\right] = \begin{bmatrix} \Phi _{\theta }(x_{1}) &{} \Phi _{\theta }(x_{2}) \ldots &{} \Phi _{z}(x_{t-1}) &{} \Phi _{z}(x_{t}) \\ \vdots &{} \vdots &{} \vdots \\ \Phi _{\theta }(x_{1}) &{} \Phi _{\theta }(x_{2}) \ldots &{} \Phi _{z}(x_{t-1}) &{} \Phi _{z}(x_{t}) \\ \end{bmatrix} _{m \times t}, \end{aligned}$$

where \({\textbf{A}}_{\theta }\) is a latent signal-specific submatrix and \({\textbf{A}}_{z}\) an error-specific submatrix. Each row of matrix \({\textbf{A}}\) characterizes one constraint. There are two functions \(\Phi\) which completely define the numerical entries \(a_{ij}\) of \({\textbf{A}}\):

$$\begin{aligned} \Phi _{\theta }(a_{ij} \in A_{\theta })= {\left\{ \begin{array}{ll} -1 &{} \theta _{i} < \theta _{j} \\ 1, &{} \theta _{i} > \theta _{j} \\ 0 &{} \theta _{i}, \theta _{j} \not \in \psi _\ell \\ \end{array}\right. } \end{aligned}$$

and

$$\begin{aligned} \Phi _{z}(a_{ij} \in A_{z})= {\left\{ \begin{array}{ll} -1 &{} \theta _{i} < \theta _{j} \\ 1, &{} \theta _{i} > \theta _{j} \\ 0 &{} \theta _{i}, \theta _{j} \not \in \psi _\ell , \\ \end{array}\right. } \end{aligned}$$

where \(\psi _{\ell } \in \Psi\) for \(\ell =1,2,\ldots ,m\). All the entries of the m-dimensional vector \({\textbf{b}}\) are fixed to 0.01. As an alternative to the introduced quadratic objective function

$$\begin{aligned} \sum _{i=1}^{p}{\sum _{j=1}^n{z_{(i,j)}^2}} \end{aligned}$$

we also propose a linear objective function

$$\begin{aligned} \sum _{i=1}^{p}{\sum _{j=1}^n{z_{(i,j)}}}. \end{aligned}$$

In the first case a \(L^{2}\)-norm, and in the second case a \(L^{1}\)-norm is applied to the errors \(z_{(i,j)}\). Under both criteria the system of inequalities remains the same and a strictly convex optimization problem is solved. This ensures that all local optima are global optima. For the linear objective function, \({\textbf{Q}}\) is obsolete, and the quadratic optimization problem reduces to a linear one.

Both objective functions comprise two sums, one over the set of ranked objects p and the other over the set of assessors n, which can be interchanged due to the commutative law. Therefore, we have a ‘symmetry’ between the dimensions n and p. This makes an essential difference for parameter estimation in comparison to probabilistic ‘asymmetric’ approaches where the estimand converges with increasing n to the population parameter. The advantage of our approach is that \(n > > p\) and \(n < < p\)-problems are performing equally well with a similar computational demand. However, this comes at the cost that we cannot establish statistical convergence results. In Sect. 6 we compare our approach with the pairwise comparison rank centrality method for which some convergence results could be established. Both methods are score-oriented and work without distributional assumptions. The theoretically established optimality relationship of rank centrality with the probabilistic Bradley-Terry-Luce model can be generalized to our indirect inference approach because of formal commonalities.

3.3 The bootstrap sampling distribution of the signal estimates

The sampling distribution of the estimated signals can be derived by using the bootstrap. Classical bootstrap was, for instance, used in Svendova and Schimek (2017). There, the bootstrap ranking matrix was obtained as a random sample of size n drawn from \(R(\theta )\) by simple random sampling with replacement from the set of assessors. The actual rank matrix, containing the evaluation of the n assessors, is used to create a number, say B, of perturbed data sets (the bootstrap samples) of size n, \(R^1, \dots , R^B\). Then, the signal estimators are applied to each bootstrap sample and the variation among these B estimators is used to describe the sampling variation of each signal. However, it is well known that simple random sampling with replacement does not result in samples with equal information content. That is due to the randomness in the number of distinct observations that occur in different bootstrap samples. Consequently, the expected number of distinct columns in a given bootstrap sample is approximately equal to 0.632n with a standard deviation of circa \(0.482\sqrt{n}\).

To avoid this problem, we employ here a sequential resampling scheme (Rao et al. 1997). It is a re-sampling method in which sampling is carried out one by one with replacement until \((m + 1)\) distinct original observations appear, where m denotes the largest integer not exceeding \((1-e^{-1}) n\), where n is the sample size, i.e. the number of assessors. The information of each bootstrap sample is kept constant by requiring that the number of distinct observations in each re-sample fulfills 0.632n. The sequential resampling scheme is equivalent to a weighted bootstrap with Poisson weights, which is second-order correct and consistent under general assumptions (Babu et al. 1999). In this scheme \((n-m)\) weights are randomly chosen and set to zero. The remaining m weights are independently chosen from the Poisson distribution with mean one and censored at zero. Note that the usual bootstrap scheme can be seen as weighted bootstrap, where the weights are drawn from a multinomial distribution of size n with probabilities \((1/n,1/n, \ldots , 1/n)\).

The Poisson bootstrap is a weighted scheme which fits nicely into the optimization problem in Eq. (3), where the Poisson bootstrap weights are included in the main diagonal of the matrix \({\textbf {Q}}\) for the quadratic objective function, and in the vector \({\textbf {c}}\) for the linear objective function.

Recently, the great potential of Poisson bootstrap was discovered for resampling algorithms that can scale to large or huge data problems in a big data framework. For the implementation of our procedure it is not necessary to know the sample size n, because the Poisson resampling scheme complements the MapReduce paradigm (Chamandy et al. 2012), which can be considered one of the most popular computational tools used in big data analysis. When applied to signal estimation, as discussed here, the Poisson bootstrap can further reduce the size of the convex optimization problem in addition to the gain due to the restricted approach. This fact makes the non-parametric indirect inference procedure numerically highly efficient.

The Poisson bootstrap algorithm can be described as follows: Select a random sample of size m without replacement from \(\{1, 2, \dots , n\}\), say \(I =\{i_1, i_2, \dots , i_m\}\), and generate a random sample of size m from a truncated Poisson distribution with mean one, say \((V_{i_1}, V_{i_2}, \dots , V_{i_m})\). The Poisson bootstrap sample is defined by selecting \(V_{i_j}\) times each position of the original sample \(i_j\), appearing in the set I. So the final bootstrap weighting vector is \({\textbf {v}}^* = (V_{1}^*, V_{2}^*, \dots , V_{n}^*)\) with \(V_{j}^* = V_{i_j}\) for \(j \in I\) and \(V_j^* = 0\) for \(j \notin I\).

The weights \({\textbf {v}}^*\) can be easily incorporated in the objective function of the optimization problem. In the case of quadratic optimization we have

$$\begin{aligned} \sum _{i=1}^{p}{\sum _{j=1}^{n} V_j^* \times {z_{(i,j)}^2}}, \end{aligned}$$

and in the case of linear optimization

$$\begin{aligned} \sum _{i=1}^{p}{\sum _{j=1}^{n} V_j^* \times {z_{(i,j)}}}. \end{aligned}$$

The new objective functions use Poisson resampling weights, multiplying the error variables of the assessors by the respective weights, thus reducing the number of constraints and variables significantly. So, the main computational advantage of Poisson bootstrap, with respect to classical bootstrap, is obtained through aggregating the repeated rankers using specific weights inside the objective functions. It is worthwhile to remark that the reduction of the overall computational burden does not imply any reduction of the accuracy of the signal estimates obtained from the Poisson bootstrap, as is shown in the simulation results in Appendix A.

The signal and the error matrix are estimated from each bootstrap sample. All the involved values are positive. The final signal estimates are equal to the mean of all normalized signals estimated by each bootstrap sample \({\textbf {v}}^{*}_{b}\), \(b = 1,\dots , B\), where B is equal to the number of bootstrap runs. Then we have

$$\begin{aligned} {\hat{\theta }}_{i}^* = \frac{\sum _{b=1}^{B} \hat{\theta }_{i,b}^* }{B}, \end{aligned}$$

where \({\hat{\theta }}_{i}^*\) is the signal mean of the \(i^{th}\) object and \({\hat{\theta }}_{i,b}^*\) is the estimate of normalised \(\theta _{i}^{*}\) for the bootstrap run b. The standard error for the \(i^{th}\) object is

$$\begin{aligned} SE_i = \frac{\sqrt{\sum _{b=1}^{B} {(\hat{\theta }_{i,b}^* - \hat{\theta _i^*})}^2 }}{B-1}. \end{aligned}$$

The final error matrix is evaluated using all available bootstrap estimates. Finally, the sampling distribution generated by the Poisson bootstrap can be used to construct confidence regions for the estimated signals among many other inferential tasks.

4 Outline of the simulation experiments

We investigated several simulation scenarios focussing on \(p > > n\) for various combinations of numbers of rankers n and numbers of objects p. The estimation methods studied were the full approach with linear and with quadratic optimization, as well as the restricted, computationally more appealing, approach with linear and with quadratic optimization. In our numerical experiments, n takes the values 10, 20, 40 and p the values 25, 50, 100, 200, 400, 800. We aimed to compare the signal estimation quality and the computer time consumption of the introduced estimation procedures under both convex optimization techniques. It was necessary to assign the elements of the vector \({\textbf{b}}\) in Eq. (3) to a single constant b. To verify our assumption that b is a scaling constant in our estimation problem, we carried out a simulation experiment in which we studied the effect of a set of values \(b \in \{\log (1.01), \log (2),\ldots , \log (9), \log (10)\}\) on the estimation results for a variety of combinations of n and p, when applying the restricted linear and the restricted quadratic method. These results are summarized in Fig. 1. There is clearly no effect on the estimates \({\hat{\theta }}\) and the consolidated ranks. As assumed, the relative size of the metric signal scale is preserved. We finally decided for \(b=0.1\) in all our numerical experiments. Other values of b would only modify the range of obtained estimation results \({\hat{\theta }}\), which might be interesting in practical data analysis.

Fig. 1
figure 1

The effect of the choice of b on the signal estimates and the consolidated ranks for the restricted linear and the restricted quadratic method. Displayed are the boxplots of the obtained Pearson and Kendall’s \(\tau\) correlations

Starting from the statistical model in (2), for all simulation scenarios, the latent signals in the vector \(\theta =(\theta _1,\ldots ,\theta _p)\) were built according to a standard half-normal distribution, \(\theta _i \sim \mid {\mathcal {N}}(0,1)\mid\), because its domain is restricted to \(x \in [0,\infty )\). To account for the diverse assessment ability of the rankers, we varied the standard deviations characterizing the distributions of the random errors \(Z_{ij}\), by drawing from the uniform with \(\sigma \sim {\mathcal {U}}(0.4, 0.6)\). Then the random errors \(Z_{ij}\) were drawn from the half-normal distribution \(\mid {\mathcal {N}}(0,\sigma ^2)\mid\). Finally, we could calculate the measurement matrices \({\textbf{X}}\) and derive from them the input rank matrices \({\textbf{R}}\) for the different scenarios of interest.

In a small simulation study, we compared the estimation quality of standard bootstrap and Poisson bootstrap (see Section A in the Appendix). For the evaluation of the introduced full and restricted procedures, 200 Monte Carlo (MC) repetitions with 500 Poisson bootstraps were calculated. For each Monte Carlo repetition, a new sample of the input matrix \({\textbf {R}}\) was drawn. As expected, there is no relevant outcome difference between these two resampling schemes. In the following, all reported simulation results refer to Poisson bootstrap.

The impact of incomplete rankings on the signal estimates and the obtained consolidated ranked lists was studied in simulations for all four proposed methods. We looked into three settings, \(p=20\) with \(n=50\) and its permutation \(p=50\) with \(n=20\), and in addition with \(n=80\). The considered percentage of randomly allocated missing ranks across all rankers was 5, 10, 20, and 40, compared to the situation of complete ranked lists (i.e. zero missing assignments). Each evaluation of all combinations of methods and settings was based on 50 runs.

Last but not least, we studied the effect on the obtained error matrix \({\hat{\textbf{Z}}}\) when two disjunct ranker groups are present, one producing complete and the other incomplete rank assignments. The settings were those from above but now only half of the n rankers failed to assign ranks with a percentage of 5, 10, 20, and 40. Each evaluation was again based on 50 runs.

Metrics for the quality of the estimation results are the following: (i) Pearson correlation of the estimated signals \(\hat{\theta }\) versus the constructed latent (true) signals \(\theta\), and (ii) Kendall’s \(\tau\) of the consolidated (estimated) ranked list versus the ranked list due to the latent (true) signals. Moreover, for each approach in the main scenarios, the execution time in seconds is reported.

Fig. 2
figure 2

Comparison of all four methods for typical combinations of n and p. Displayed are the boxplots of the obtained Pearson and Kendall’s \(\tau\) correlations

5 Numerical evidence from the simulation experiments

Our first goal was to learn whether the full approach applying linear or quadratic convex optimization can be replaced by the computational, much more efficient restricted approach. To limit the overall computational burden of these simulations, we studied the following settings: \(n={10, 20, 40, 80}\) in all combinations with \(p={25, 50, 100}\).

Fig. 3
figure 3

Comparison of all four methods for typical combinations of n and p. Displayed are the barplots of computer time consumption

In Fig. 2 boxplots represent the Pearson correlation coefficients of the estimated signal values \(\hat{\theta }\) with the true signals \(\theta\), and the Kendall’s \(\tau\) coefficients of the estimated ranked list with the true ranked list. The full methods started to fail due to numerical overload for the setting \(n=80\) with \(p=100\) (the concerned boxplots are missing in Fig. 2). What we can immediately see is that the median correlation across all runs is always higher than 0.9, thus very strong throughout all settings and the four approaches. Of course, increasing the number of rankers (i.e. more available information about the objects) brings about higher correlation coefficients.

Moreover, in the median, the signal reconstruction metric tends to the value of one, but not only for the full method, as expected, but also for the restricted method. Last but not least, there is virtually no difference between the linear and the quadratic optimization results. In summary, we can conclude that (i) all convex optimization approaches work reasonably well, and that (ii) the loss in estimation quality is negligible when restricted methods are used instead of full methods. In essence, the consequence is that the use of the full methods is not imperative. How drastic the gain of computer time is, when one of the restricted approaches is applied, can be concluded from Fig. 3. There we see barplots of computer time consumption in seconds for each of the settings and methods displayed in Fig. 2. The more parameters are involved in the estimation process, the higher the numerical effort. Moreover, there is a clear indication that full methods require substantially more computer time, a fact that becomes critical for at the same time large n and large p (see the right bottom part of Fig. 3). The full methods even failed for the setting \(n=80\) and \(p=100\).

Fig. 4
figure 4

Comparison of the restricted methods for typical combinations of n and p. Displayed are the boxplots of the obtained Pearson and Kendall’s \(\tau\) correlations

Fig. 5
figure 5

Comparison of the restricted methods for typical combinations of n and p. Displayed are the barplots of computer time consumption

In an additional simulation study, we evaluated also the robustness of the indirect inference approach (see Section B in the Appendix). Its aim was to quantify the impact of heterogenous assessment abilities of rankers on the overall quality of the estimates. For the assumed 25% of random assessments we observed a sufficient robustness of all considered methods. However, there is indication that linear methods are slightly less influenced by inappropriate rankers compared to quadratic methods.

Our next focus is on the numerical behavior of the restricted approach in large data sets. The following settings were studied: \(n={10, 20, 40, 80}\) in all combinations with \(p={200, 400, 800}\). The reason to limit p to 800 is the fact that for longer ranked lists it is unlikely that one and the same generating (assessment) process prevails. It does not make sense to estimate signals from lists where the rank positions of objects after some index value k (notion of top-k) are subject to random allocation. For a detailed discussion see Hall and Schimek (2012).

In Fig. 4 boxplots represent the Pearson correlation coefficients of the estimated signal values \(\hat{\theta }\) with the true signals \(\theta\), and the Kendall’s \(\tau\) coefficients of the estimated ranked list with the true ranked list. As in Fig. 2, the median correlation across all repetitions is always higher than 0.9. However, there is one limitation, the restricted quadratic method fails due to numerical problems in the convex optimization routine in three instances: \(n=80\) with \(p=400\) and \(p=800\), and \(n=40\) with \(p=800\). The restricted linear approach is stable throughout and provides excellent estimates reflected in Pearson correlation coefficients of the signal very near to one. As already known from Fig. 2, an increase in the number of rankers brings about higher correlation coefficients. For those settings where we could obtain estimation results, there is virtually no difference between the linear and the quadratic optimization procedure. The conclusion is, that convex optimization also works well for large data sets. However, certain settings with very large p require the numerically less demanding restricted linear approach, as can be seen in Fig. 5 showing the computer time consumption for all settings. There is substantial difference in time demand between the two restricted methods.

Fig. 6
figure 6

Comparison of all methods for three combinations of n and p under different levels of incomplete ranking. Displayed are the boxplots of the obtained Pearson and Kendall’s \(\tau\) correlations over 50 runs

The impact of incomplete rankings on the signal estimates and the obtained consolidated ranked lists was studied in simulation experiments for all four proposed methods. In Fig. 6 boxplots represent the Pearson correlation coefficients of the estimated signal values \(\hat{\theta }\) with the true signals \(\theta\), and the Kendall’s \(\tau\) coefficients of the estimated ranked list with the true ranked list for 5, 10, 20, and 40 percent of randomly allocated missing ranks across all rankers. When we compare the correlation results across the different methods and parameter settings, there is no marked difference between them. For 5 and 10 percent of missing assignments, the Pearson correlations range approximately between 0.75 and 1. The Kendall’s \(\tau\) coefficients are slightly lower. Incomplete rankings have a stronger impact in the setting \(n < < p\), as we expected. In all considered settings, the correlation coefficients never drop below 0.5. We can conclude that all methods perform equally well, despite the extreme assumption of randomly assigned missing ranks.

Table 1 Hierarchical cluster separation of two disjunct ranker groups in restricted quadratic and restricted linear optimization

In a final simulation experiment, we studied the effect on the error matrix \({\hat{{\textbf{Z}}}}\) obtained from the restricted approaches when two disjunct ranker groups are present, one producing complete and the other incomplete rank assignments. The amount of unspecified object ranks varied between 5 and 40 percent of the total number p of objects. A hierarchical cluster analysis of the ranker-specific errors was performed for each setting, and the adjusted Rand index of cluster dissimilarity was calculated. As can be seen in Table 1, for 20 and 40 percent incomplete ranks, the heterogenous two-group error structure was easily detected, independent of the approach or setting, reflected by Rand indices between 0.84 and 1. However, for \(n > > p\), even the 5 percent incomplete rank group could be separated. We can conclude that the estimated \({\hat{{\textbf{Z}}}}\) matrix is highly informative for data exploratory purposes with respect to the involved assessors.

6 Comparison with the rank centrality method

In Sect. 1 we have already mentioned the rank centrality method of Negahban et al. (2017), which had been developed over years (see e.g. their proceedings paper of 2012). They proposed an iterative ‘aggregation’ algorithm for discovering scores for objects from pairwise comparisons. Sticking to the usual terms in the ranking literature, their procedure goes beyond aggregation techniques. Its focus is on the score of an object, which is named ‘rank centrality’. The algorithm has a natural random walk interpretation over the graph of objects with edges present between two objects if they are compared. The scores are the stationary probabilities of this random walk.

Fig. 7
figure 7

Comparison of the restricted linear and the restricted quadratic methods with the rank centrality algorithm for typical combinations of n and p. Displayed are the boxplots of the obtained Pearson and Kendall’s \(\tau\) correlations

The efficacy of the rank centrality method was established against the popular Bradley-Terry-Luce model in which each object has an associated score. These scores are determined by the probabilistic outcomes of pairwise comparisons between the objects. Negahban et al. (2017) could prove that the finite sample error rates between the scores assumed by the Bradley-Terry-Luce model and those estimated by the rank centrality algorithm are bound. The consequence is order-optimal dependence on the number of samples (i.e. rankers) required to obtain valid scores. In simulations, it could be demonstrated that the rank centrality method performs as well as the maximum likelihood estimator of the Bradley-Terry-Luce model.

Fig. 8
figure 8

Comparison of the restricted linear and the restricted quadratic methods with the rank centrality algorithm for typical combinations of n and p. Displayed are the barplots of computer time consumption

Why is a comparison of our indirect inference approach with the rank centrality method of particular interest? First of all, like ours, it is a pairwise comparison approach, secondly, it aims at scores (i.e. intensities of preferences), we call them signals, and thirdly it does not require distributional assumptions about the scores. Other points in favor of rank centrality are its novelty and its established optimality relationship with the classical and most popular probabilistic Bradley-Terry-Luce model. The latter is of central interest because in machine learning, it is a benchmark model for pairwise comparison procedures and also under current theoretical investigation (Gao et al. 2021).

For the purpose of comparison of our two restricted indirect inference methods with the rank centrality algorithm, we carried out simulations applying the same parameter setting as used for our results in Fig. 2. The rank centrality calculations were performed in the Python package choix, freely available from GitHub (https://github.com/lucasmaystre/choix). For the rank centrality method, we applied the default parameters, aside from \(\alpha\) controlling the regularization. The value \(\alpha =0.01\) performed best for the data under consideration.

As can be seen in Fig. 7, the signal estimates (Pearson correlation) and rank estimates (Kendall’s \(\tau\) correlation) are the same for all three methods apart from random fluctuations. As a direct consequence, we can assume that the restricted methods perform equally well as rank centrality, and thus the optimality properties of the latter with respect to the Bradley-Terry-Luce model can be generalized to our approach. However, Fig. 8 makes one important difference apparent: computer time consumption. Our methods outperform the rank centrality algorithm throughout all settings. The more demanding the setting, the larger is the difference in time demand.

7 World university ranking application

Nowadays, the world-wide academic status of universities is observed with much attention. Various institutions provide higher education rankings to serve potential students, university decision-makers, and industrial partners. In our application, we focus on those three rating institutions with a long history and a high reputation: the Times Higher Education World University Ranking (THE), the Quacquarelli Symonds World University Ranking (QS), and the Academic Ranking of World Universities (ARWU, also known as Shanghai Ranking). Their rankings are based on various quality indicators and comprise 400 universities.

We have chosen this application because the ranking data of the three rating institutions are routinely summarized by the well-known Aggregate Ranking of Top Universities (ARTU). The University of New South Wales applies a conventional rank aggregation procedure to produce the ARTU consolidated ranking. For details see https://research.unsw.edu.au/artu/methodology. We wanted to use their aggregation result as ground truth for our indirect inference approach. Moreover, we had the opportunity to perform an in-depth-analysis based on the individual indicators adopted by the rating institutions.

The THE ranking is based on five primary indicators comprising subindicators: (1) the Teaching indicator (contributes 30% to the overall ranking) is composed of the Reputation in Surveys (15%), Staff-to-Student Ratio (4.5%), Doctorate-to-Bachelor’s Ratio (2.25%), Doctorates-Awarded-to-Academic-Staff Ratio (6%), and Institutional Income: (2.25%); (2) the Research indicator (contributes 30% to the overall ranking) is based on a Reputation Survey on Scientific Research (18%), the Research Income (6%), and Research Productivity (6%); (3) the Citations indicator (contributes 30% to the overall ranking) is based on the university’s influence in spreading new knowledge and ideas; (4) the International Outlook indicator (contributes 7.5% to the overall ranking) is the Proportion of International Students (2.5%), the Proportion of International Staff (2.5%), and the Intensity of International Collaborations (2.5%); (5) the Industry Income indicator (contributes 2.5% to the overall ranking) characterizes the knowledge transfer from academia to companies. Detailed information about the THE ranking methodology is available at https://www.timeshighereducation.com.

The QS ranking comprises six different indicators: (1) The Academic Reputation indicator (contributes 40% to the overall ranking) is based on expert opinions of around 100,000 individuals in higher education institutions regarding teaching and research quality worldwide; (2) the Employer Reputation indicator (contributes 10%) is based on almost 50,000 responses to the QS Employer Survey asking for universities turning out the most competent, innovative, and effective graduates; (3) the Faculty to Student Ratio indicator (contributes 20%); (4) the Citations per Faculty indicator (contributes 20%); (5) the International Faculty Ratio (contributes 5%), and (6) the International Student Ratio (contributes 5%). Detailed information about the QS approach is available at https://www.topuniversities.com.

The ARWU ranking is built upon six different indicators: (1) the Alumni indicator (contributes 10% to the overall ranking) is based on the number of alumni of an institution that won a Nobel Prize or Fields Medal; (2) the Award indicator (contributes 20%) is the number of staff members that won a Nobel Prize or Fields Medal; (3) the HiCi indicator (contributes 20%) is the number of highly cited researchers selected by Clarivate Analytics; (4) the N &S indicator (contributes 20%) is the number of articles published in Nature and Science, currently between 2015 and 2019; (5) the PUB indicator (contributes 20%) is the total number of articles indexed in the Science Citation Index Expanded and the Social Science Citation Index, currently in 2019; (6) the PCP (contributes 10%) is the per capita academic performance of a higher education institution. Further information about the ARWU rating methodology is available at http://www.shanghairanking.com/.

As can be seen, the rating systems of THE, QS, and ARWU show substantial differences in the selection of indicators and their weighting. The current practice is to consolidate the rankings produced by different agencies by means of rank aggregation techniques. The Aggregate Ranking of Top Universities (ARTU) is based on THE, QS, and ARWU. ARTU’s aggregation technique is simple: the individual rank positions of each of the 400 universities are added, thereof a numerical score built, and finally all scores ranked. Such an aggregation technique lacks any probabilistic assumptions such as a statistical error model. In our suggested approach, we assume a latent signal variable plus some error reflecting the variation in the ability of rankers (i.e. indicators). This can make a substantial difference between a signal-based rank estimate and an aggregated rank. It is certainly not realistic to assume ‘correct' rankings in empirical research.

We applied the restricted quadratic signal estimation approach with Poisson bootstrap for reasons of numerical efficiency. For a first analysis, we selected the same input as used for the ARTU aggregation technique, i.e. THE, QS, and ARWU as rankers (assessors of academic quality). The available data for the year 2020 are of the size \(n=3\) and \(p=400\). For such a small value n of rankers, we can still reconstruct the latent consensus signals from the rank matrix \({\textbf{R}}\). However, it is not feasible to obtain bootstrap standard errors of the signals for rank stability considerations.

Fig. 9
figure 9

Profile plot of university ranks ordered according to SB-WUR versus THE, QS, ARWU, and ARTU. The universities are divided into five quality groups (ranks 1-20, 21-40, 41-60, 61-80, 81-100)

For the year 2020, in Fig. 9 we display the university ranks resulting from our signal-based approach (abb. SB-WUR), the THE ranks, the QS ranks, the ARWU ranks, and the ARTU aggregate ranks as ground truth (from left to right). The Kendall’s \(\tau\) correlation between SB-WUR and ground truth is 0.697. For reasons of visibility, the profile plot is restricted to the top-100 universities. It is interesting so see that top-100 SB-WUR universities can occupy ARTU rank positions up to almost 200.

Fig. 10
figure 10

Notched boxplots of the signal estimates. The top-20 universities ranked according to the mean bootstrap signals (red solid line)

Table 2 The latent signal estimates, standard errors, associated ranks, and the ARTU aggregated ranks of the top-20 universities

There is only a substantial overlap in rank order for the top-40 institutions. The input lists from THE, QS, and ARWU are rather heterogeneous, which is no surprise given the diversity of the assessment concepts of the agencies. Their Kendall’s \(\tau\) correlations take the following values: \(\tau _{THE-QS}=0.560\), \(\tau _{THE-ARWU}=0.576\) and \(\tau _{QS-ARWU}=0.473\). This rather limited agreement between the input ranked lists motivated us to combine as much evaluation information as possible. Consequently, for our second analysis, we increased the number of rankers from three to the number of all suitable indicators in terms of Kendall’s \(\tau\) cross-correlations (see Fig. 17 in the Appendix C). We had to eliminate 7 out of 22 reported indicators (i.e. rankers), because of inappropriate ranker quality (see Table 6 in the Appendix C for the remaining 15 indicators). The data had to be cleaned for name conflicts, tied observations, and a substantial number of missing observations. Finally, we could run our restricted quadratic signal estimation procedure applying Poisson bootstrap with \(n=300\) and \(p=15\). The obtained bootstrap signal estimation results are summarized in Table 2. The latent signal estimates \(\hat{\theta }\), standard errors \(SE(\hat{\theta })\), associated ranks \(\hat{\pi }\), and for comparison, the aggregation results of ARTU, are presented for top-20 universities. As can be immediately seen, there is an enormous discrepancy between the rank positions obtained by simple rank aggregation and those produced by the latent signal estimation. One should note that the information input in terms of indicators is the same but without pre-specified indicator weighting. Aggregation techniques ignore the implicit error structure of the input rank matrix. The indirect inference latent signal approach takes advantage of it.

In Fig. 10 we see the top-20 universities of the second analysis ranked according to their mean bootstrap signals. For quality assessment, we have also displayed the notched boxplots. Asymmetric boxplots with large ranges, and sometimes a substantial number of outliers, give a clear indication of assessment heterogeneity. Also the frequent discrepancies between mean and median tell us that there is no agreement among agencies how to rank the respective universities. Only for the top-4 institutions (MIT, University of Oxford, Imperial College London, and ETH Zurich) the means and medians coincide, thus pointing at full agreement. Rather stable in terms of rank positions are only those universities for which the interquartile distances do not overlap. As a consequence, MIT, University of Oxford, and Imperial College London could easily swap their ranks, but ETH Zurich is well separated in position four. The same is true for other institutions which ‘cluster’ lower.

In summary, there is enormous volatility even in the top-20 range of the 300 signal-based ranked universities. This needs to be seriously considered in future evaluation exercises, especially when policy decisions are based on specific rank positions. A signal-based rank estimation approach is certainly contributing to more fairness.

8 Kidney cancer gene expression application

The availability of large genomic data repositories has opened new perspectives for medical research. Genomic data sets always comprise large or even huge numbers p of relevant objects (i.e. genes) in relation to small or mid-size numbers n of rankers (i.e. patients). In this application, our interest is the search for such genes that might be informative with respect to kidney cancer survival of patients. For that purpose we have retrieved gene expression (mRNA) kidney cancer profiles from The Cancer Genome Atlas (TCGA) as provided by Rappoport and Shamir (2018). The obtained data matrix consists of 604 patients, out of which 202 patients have a non-survived status and 402 patients a survived status. A total number of 20531 genes was available for these patients. In the application in Sect. 7 the focus was on the objects in terms of the academic performance of higher education institutions. Here, the focus is on the genomic characterization of the rankers with respect to their survival, not on the objects themselves. The \(n < < p\) setting remains the same but the available data set is substantially larger compared to the previous application.

In a first dimension-reduction step, we removed all zero-valued genes across the two patient groups. In a second step, we filtered by known cancer genes as specified and suggested by Schulte-Sasse et al. (2021). We then sampled \(n=50\) patients from each group (survival and non-survival). After the necessary transformations of their sequencing data (applying ”voom” of the ”R”-package ”limma”; Ritchie et al. 2015) we kept 5% of the highest variance genes, and 5% of the lowest variance genes. Finally we ended up with 906 genes for downstream analysis of the two patient groups. For laboratory and study technical reasons, the patient-specific sequencing counts are not observed on the same scale. Therefore, we transformed them to the unique ordinary scale. Each patient can be imagined as an independent ‘ranker’ of the set of 906 genes. For each group (survival and non-survival) of patients we could thus form an input rank matrix for signal estimation.

Fig. 11
figure 11

Non-survival group vs. survival group. Top-20 highest ranked genes ordered according to the signal estimates \(\hat{\theta }\) of the two groups. Gene names which are not shared by the two groups are marked in red

For estimation we utilized the restricted quadratic approach with 1000 bootstrap iterations. For each patient group, the resulting consensus estimation of gene expression signals reflects the genome-wide gene importance ordering pointing at potential differences in survival. We aimed at group-specific top-ranked as well as bottom-ranked genes and their associated rank stabilities which can be derived from the signal standard errors. The estimation was carried out in a fully non-parametric manner as outlined in Sect. 3.2.

As expected, in both groups, the mean consensus signals show marked differences in their standard errors with low variation in the top and in the bottom ranges (see Figs. 18 and 19 in the Appendix C). In Fig. 11 the largest signal estimates, on the left side ordered for the non-surviving, and on the right side ordered for the surviving patients, are displayed. The ranges of the notched boxplots are larger, when there are non-conforming assessments. Most of the top-ranked genes are related to vital functions. That is why they show up in both patient groups. However, there are two genes (marked in red), CDC42 and SET, which are not in the intersection of the two top-ranked object sets. For gene SET we know that it is a tumour promoting factor (Kohyanagi et al. 2022). For an interpretation beyond this method’s application example, indepth biomedical research would be required.

Fig. 12
figure 12

Separation of a cluster of top genes with highly overlapping standard error regions in the survival group, supported by both the elbow plot in the top and the dendrogram in the bottom

The bottom-20 smallest signal estimates are more heterogeneous in both groups, but still showing a reasonable intersection (see Fig. 20 in Appendix C). The aspect of rank stability can be addressed in Fig. 12 for the non-surviving patients. The top subgraph shows the elbow plot of the top-20 genes represented by their signal estimates \(\hat{\theta }\) and standard errors \(SE(\hat{\theta })\). There is a clear separation between the first eight ranked genes and the rest. Among the top-8 genes, position swaps are likely due to overlapping standard error regions but are most unlikely between them and those ranked in a position lower than eight. This cutoff of eight was also found by hierarchical clustering (based on Euclidean distance and average distance for merging) of the complete ensemble of the Poisson bootstrap sample results. The same exploration of the signal estimates of the non-surviving patients is displayed in Fig. 21 in Appendix C. There we can see a cut-off at position seven. The two group-specific \({\hat{\textbf{Z}}}\) matrices did not expose specific patterns of error structure. Therefore we concluded that there is sufficient patient homogeneity for the application of our approach. For a conclusive interpretation of our sequencing-based omics findings, additional information about the patients, including clinical subgroups, would be needed.

9 Conclusion

Beyond rank aggregation for consensus ranking, it is certainly advantageous to estimate the, usually unobservable, latent signals that inform a consensus ranking. Under the usual assumption of independent assessments, we have introduced an indirect inference approach that allows us to estimate the signals and their standard errors from multiple ranked lists. Our intention was to propose a scalable method which is most general with respect to the type of rank data and at the same time indifferent to the relative size of n and p. Moreover, we aimed at handling incomplete input rankings. From a computational point of view, our goal was the development of a less demanding technique compared to most probabilistic or distribution function approaches. In addition, for the estimation process we wished to keep the number of assumptions as low as possible. Finally, we did not want to involve stochastic optimization techniques because of the variability of their results.

In response to these ambitious goals, we have come up with a novel mathematical formulation of the signal estimation problem. It resorts to the often used object rank pairwise comparison concept but without the need for a distance measure. Pairwise comparisons are the substantiation of ratings or preferences in many probabilistic ranking as well as machine learning models. We could demonstrate that the order relations can be represented by sets of constraints, enabling the use of strictly convex optimization. The generic signals are then recovered via global minimization of the errors induced by the assessors until optimal latent signals can be obtained. As shown, linear as well as quadratic objective functions can serve this purpose. The transitivity property of rank scales allows us to substantially reduce the number of constraints associated with the full set of object comparisons.

Instead of classical bootstrap, we are applying Poisson bootstrap to estimate the signals and error bounds. The Poisson bootstrap is a weighted scheme which fits nicely into our optimization problem, with the effect of drastically reducing the computational burden. In the context of signal reconstruction from ranking data, the bootstrap methodology allows for various kinds of evaluation of the obtained signal estimates and consensus ranks. This makes it easy, for instance, to explore ranker homogeneity with respect to their assessments or to explore rank order instabilities.

Due to the numerical efficiency of the proposed approach, it scales to large or even huge data, as could be demonstrated in simulation experiments and applications. Moreover, it allows the handling of \(n < < p\) problems and of incomplete ranking data. Various data scenarios were studied in simulation experiments. The achieved results fully support the proposed approach with respect to the estimation quality of the signals and the derived consensus ranks. The order of the values of the estimated signals represents a consensus ranking of relative importance or preference intensity of the objects across the observed ranked lists as an alternative to standard rank aggregation techniques. We do not only produce consensus lists as in rank aggregation, but can also quantify the relative stability of object ranks through exploratory tools when the number n of assessors is limited or by inferential tools for sufficiently large n.

In numerical experiments, the estimation precision and the numerical efficiency of the indirect inference approach were demonstrated in comparison to the rank centrality method by Negahban et al. (2017). Apart from random fluctuations, the estimates of both methods coincide. However, in larger data problems, the rank centrality algorithm is clearly outperformed by our proposed approach. In analogy, we can also assume that the theoretical efficacy results of rank centrality with respect to the popular benchmark Bradley-Terry-Luce model also apply to our approach.

Let us finally consider the limitations and advantages of our proposed methodology against the background of state-of-the-art signal (score) estimation methods. The strongest competitors are Bayesian probability approaches and machine learning approaches.

Let us compare with the most versatile Bayesian Mallows model by Vitelli et al. (2018). It is limited to right-invariant distances and requires various distributional assumptions. We do not need to specify any distance measure or probability distribution. Both approaches allow for incomplete rankings. Vitelli et al. (2018) can handle and classify heterogeneous assessments. Our approach cannot do that but is rather robust against ranker heterogeneity. Further, it is indifferent with respect to the relative size of n and p, the Bayes Mallows model is not. But in one point we fully outperform this and any stochastic optimization-based approach: in numerical speed and in the size of estimation problems we can process.

A machine learning state-of-the-art approach is the rank centrality algorithm by Negahban et al. (2017) formally related to the popular Bradley-Terry-Luce model. Thus they share some characteristics. Apart from Euclidean distance (between the estimated score vector and the true underlying score vector) no distance measures are involved. A restricted number of comparisons (i.e. incomplete ranking data) is permissible. Like in our approach, heterogeneous assessments or covariate information cannot be handled. In small to mid-size data problems the numerical performance of the rank centrality algorithm and of our approach is about the same. However, in larger data problems we substantially outperform the rank centrality algorithm while obtaining the same quality of estimated scores (signals). Unlike our proposal, rank centrality does not provide inferential information for rank quality evaluation.

For the sake of practical evidence we have worked out two completely different applications, one from higher education evaluation and the other from molecular cancer research. In both cases the obtained findings are most relevant to their respective research fields. In the first application, the world university rankings, our approach can contribute to a better understanding of results published by higher education ranking agencies and might stimulate a critical discussion of stand alone aggregation evidence and its interpretation. In the second application, the analysis of gene expression sequencing data of surviving versus non-surviving patients, the motivation is completely different. Despite the fact that we can observe metric data, analyzing them in a comparative fashion is prohibitive because we cannot obtain a common scale of measurement for laboratory technical reasons. Consequently, we have proposed to convert the measurements into ranks and then estimate the consensus signals independently for each patient group. This strategy makes a comparative analysis feasible.

In conclusion, this novel approach has the potential to overcome major numerical limitations of recent probabilistic as well as certain machine learning approaches. Moreover, it can replace conventional rank aggregation techniques with the advantage of having a formalized handle for the critical judgement of obtained consensus rank positions. In future work we hope to develop additional inferential tools and wish to explore ways of developing non-parametric test statistics for the comparison of subgroups of rankers. The software package TopKSignal, comprising the R procedures we have implemented for this article, is available on CRAN. Also any future extensions of our approach will be made public in TopKSignal.