1 Introduction

Quantification is a supervised learning task that consists of training a predictor, on a set of labeled data items, that estimates the relative frequencies \(p_{\sigma }(y_{i})\) (a.k.a. prevalence values, or prior probabilities, or class priors) of the classes of interest \(\mathcal {Y}=\{y_{1}, \dots , y_{n}\}\) in a bag (or multi-set) \(\sigma = \{\textbf{x} \in \mathcal {X}\}\) of unlabeled data items \(\textbf{x}\) (Forman 2005)—see also (González et al. 2017; Esuli et al. 2023) for recent surveys. In other words, a trained quantifier (i.e., an estimator of class prevalence values) must return a predicted distribution \(\hat{\textbf{p}}_{\sigma }=(\hat{p}_{\sigma }(y_{1}), \dots , \hat{p}_{\sigma }(y_{n}))\) of the classes for the unlabeled bag \(\sigma\), where \(\hat{\textbf{p}}_{\sigma }\) must coincide as much as possible with the true, unknown distribution \(\textbf{p}_{\sigma }\). Quantification is also known as “learning to quantify”, “supervised class prevalence estimation”, and “class prior estimation”.

Quantification is important in many disciplines, e.g., market research, political science, ecological modeling, the social sciences, and epidemiology. By their own nature, these disciplines are only interested in aggregate (as opposed to individual) data. Hence, classifying individual unlabeled instances is usually not a primary goal in these fields, while estimating the prevalence values \(p_{\sigma }(y_{i})\) of the classes of interest is. For instance, when classifying the tweets about a certain entity (e.g., about a political candidate) as displaying either a Positive or a Negative stance towards the entity, political scientists or market researchers are usually not interested in the class of a specific tweet, but in the fraction of these tweets that belong to each class (Gao and Sebastiani 2016).

A predicted distribution \(\hat{\textbf{p}}_{\sigma }\) could, in principle, be obtained by means of the “classify and count” method (CC), i.e., by training a standard classifier, classifying all the unlabeled data items in \(\sigma\), and computing the fractions of data items that have been assigned to each class in \(\mathcal {Y}\). However, it has been shown that CC delivers poor prevalence estimates, and especially so when the application scenario suffers from prior probability shift (Moreno-Torres et al. 2012), the (ubiquitous) phenomenon according to which the distribution \(\textbf{p}_{U}\) of the unlabeled test data items U across the classes is different from the distribution \(\textbf{p}_{L}\) of the labeled training data items L. As a result, a plethora of quantification methods have been proposed in the literature—see e.g., Bella et al. (2010), Esuli et al. (2018), González-Castro et al. (2013), Pérez-Gállego et al. (2019), González and del Coz (2021), Saerens et al. (2002)—whose goal is to generate accurate class prevalence estimations even in the presence of prior probability shift.

The vast majority of the methods proposed so far deals with quantification tasks in which \(\mathcal {Y}\) is a plain, unordered set. Very few methods, instead, deal with ordinal quantification (OQ), the task of performing quantification on a set of \(n>2\) classes on which a total order “\(\prec\)” is defined. Ordinal quantification is important, though, because totally ordered sets of classes (“ordinal scales”) arise in many applications, especially ones involving human judgments. For instance, in a customer satisfaction endeavor, one may want to estimate how a set of reviews of a certain product is distributed across the set of classes \(\mathcal {Y}=\){1Star, 2Stars, 3Stars, 4Stars, 5Stars}, while a social scientist might want to find how inhabitants of a certain region are distributed in terms of their happiness with health services in the area, i.e., how they are distributed across the classes in \(\mathcal {Y}=\){VeryUnhappy, Unhappy, Happy, VeryHappy}.

As a field, quantification is inherently related to the field of classification. This is especially true of the so-called “aggregative” family of quantification algorithms, which, in order to return prevalence estimates for the classes of interest, rely on the output of an underlying classifier. As such, a natural and straightforward approach to ordinal quantification might simply consist of replacing, within a multi-class aggregative quantification method, the standard multi-class classifier with an ordinal classifier, i.e., with a classifier specifically devised for classifying data items according to an ordered scale. However, the experiments we have run (see Sect. 6.3) show that this simple solution does not suffice; instead, actual OQ methods are required.

This paper is an extension to an initial study on OQ that we conducted recently (Bunse et al. 2022). It contributes to the field of OQ in four ways.

First, we develop and make publicly available two datasets for evaluating OQ algorithms, one consisting of textual product reviews and one consisting of telescope observations. Both datasets stem from scenarios in which OQ arises naturally, and they are generated according to a strong, well-tested protocol for the generation of datasets oriented to the evaluation of quantifiers. This contribution fills a gap in the state-of-the-art because the datasets that have previously been used for the evaluation of OQ algorithms were inadequate, for reasons we discuss in Sect. 2.

Second, we perform the most extensive experimental comparison of OQ algorithms that have been proposed in the literature to date, using the two previously mentioned datasets. This contribution is important because some algorithms (e.g., the ones of Sect. 4.3.1 and 4.3.2) have so far been evaluated only on an arguably inadequate test-bed (see Sect. 2) and because some other algorithms (e.g., the ones of Sect. 4.3 and 4.4) have been developed by authors from very different research fields, such as data mining and astrophysics, which were utterly unaware of each others’ developments.

Third, we formulate an ordinal plausibility assumption, i.e., the assumption that ordinal distributions that appear in practice tend to be “smooth”. Here, a smooth distribution is one that can be represented by a histogram with at most a limited amount of (upward or downward) “humps”. We informally show that this assumption is verified in many real-world applications.

Fourth, we propose a class of new OQ algorithms, which introduces ordinal regularization into existing quantification methods. The effect of this regularization is to discourage the prediction of distributions that are not smooth and, hence, would tend to lack plausibility in OQ tasks. Using the datasets mentioned above, we run extensive experiments which show that our algorithms, which are based on ordinal regularization, outperform their state-of-the-art competitors. In the interest of reproducibility, we make publicly available all the datasets and all the code that we use.

This paper is organized as follows. In Sect. 2 we review past work on ordinal quantification. Sect. 3 is devoted to presenting preliminaries, including an illustration of the evaluation measures that we are going to use in the paper (Sect. 3.2) and our formulation of the ordinal plausibility assumption (Sect. 3.3). In Sect. 4 we present previously proposed ordinal quantification algorithms, while in Sect. 5 we detail the ones that we propose in this work. Section 6 is devoted to our experimental comparison of new and existing OQ algorithms. In Sect. 7 we look back at the work we have done and discuss alternative notions of ordinal plausibility. We finish in Sect. 8 by giving concluding remarks and by discussing future work. The Appendix includes a discussion on how reasonable it is to postulate the smoothness of real-life ordinal distributions (Appendix 1), and additional experimental results obtained by using alternative measures of the prediction error of ordinal quantifiers or by using alternative datasets (Appendix 2).

2 Related work

Quantification, as a task of its own right, was first proposed by Forman (2005), who observed that some applications of classification only require the estimation of class prevalence values and that better methods than “classify and count” can be devised for this purpose. Since then, many methods for quantification have been proposed (González et al. 2017; Esuli et al. 2023). However, most of these methods tackle the binary and/or multi-class problem with unordered classes. Ordinal quantification was first discussed in Esuli and Sebastiani (2010), where an evaluation measure (the Earth Mover’s Distance—see Section 3.2) was proposed for it. However, it was not until 2016 that the first true OQ algorithms were developed, the Ordinal Quantification Tree (OQT—see Sect. 4.3.1) by Da San Martino et al. (2016) and Adjusted Regress and Count (ARC—see Sect. 4.3.2) by Esuli (2016). In the same years, the first data challenges that involved OQ were staged (Nakov et al. 2016; Rosenthal et al. 2017; Higashinaka et al. 2017). However, except for OQT and ARC, the participants in these challenges used “classify and count” with highly optimized classifiers, instead of true OQ methods; this attitude persisted also in later challenges (Zeng et al. 2019, 2020), likely due to a general lack of awareness in the scientific community that more accurate methods than “classify and count” existed.

Unfortunately, the data challenges, in which OQT and ARC were evaluated (Nakov et al. 2016; Rosenthal et al. 2017), tested each quantification method only on a single bag of unlabeled data items, which consisted of the entire test set. This evaluation protocol is not adequate for quantification because quantifiers issue predictions for sets of data items, not for individual data items as in classification. Measuring a quantifier’s performance on a single bag is thus akin to, and as insufficient as, measuring a classifier’s performance on a single data item. As a result, our current knowledge of the relative merits of OQT and ARC lacks solidity.

However, even before the previously mentioned developments had taken place, methods that we would now call OQ algorithms had been proposed within experimental physics. In this field we often need to estimate the distribution of a continuous physical quantity. However, physicists consider a histogram approximation of a continuous distribution sufficient for many physics-related analyses (Blobel 2002). This conventional simplification essentially maps the values of a continuous target quantity into a set of classes endowed with a total order, and the problem of estimating the continuous distribution becomes one of OQ (Bunse 2022b). Early on, physicists had termed this problem “unfolding” (Blobel 1985; D’Agostini 1995), a term that was unfamiliar to data mining / machine learning researchers and that, hence, prevented them from realizing that the “ordinal quantification” algorithms they used and the “unfolding” algorithms that physicists used, were actually addressing the very same task. This connection was discovered only recently by Bunse (2022b), who argued that OQ and unfolding are in fact the same problem. In the following we deepen these connections, to find that ordinal regularization techniques proposed in the physics literature are able to improve the ability of well-known quantification methods at performing OQ.

Castaño et al. (2024) have recently proposed a different approach to OQ. This approach does not rely on regularization, but on loss functions tailored to the OQ setting. The two approaches are orthogonal, in the sense that they target different characteristics of quantification algorithms which can be combined. In this paper, we therefore extend our initial study (Bunse et al. 2022) with combinations of the two approaches, i.e., with algorithms that use ordinal loss functions in conjunction with ordinal regularization.

3 Preliminaries

In this section, we introduce our notation, we discuss measures for evaluating the prediction error of OQ methods, and we provide a measure for evaluating the smoothness of ordinal distributions. Understanding these types of measures will help us better understand the OQ methods that are to be presented in Sects. 4 and 5.

3.1 Notation

By \(\textbf{x} \in \mathcal {X}\) we indicate a data item drawn from a domain \(\mathcal {X}\), and by \(y \in \mathcal {Y}\) we indicate a class drawn from a set of classes \(\mathcal {Y}=\{y_{1}, \dots , y_{n}\}\), also known as a code frame; in this paper we will only consider code frames with \(n>2\), on which a total order “\(\prec\)” is defined. The symbol \(\sigma\) denotes a bag, i.e., a non-empty set of unlabeled data items in \(\mathcal {X}\), while \(L\subset \mathcal {X}\times \mathcal {Y}\) denotes a set of labeled data items \((\textbf{x},y)\), which we use to train our quantifiers.

By \(p_{\sigma }(y)\) we indicate the true prevalence of class y in \(\sigma\), by \(\hat{p}_{\sigma }(y)\) we indicate an estimate of this prevalence, while by \(\hat{p}_{\sigma }^{Q}(y)\) we indicate an estimate of \(p_{\sigma }(y)\) as obtained by a quantification method Q that receives \(\sigma\) as input. By \(\textbf{p}_{\sigma }=(p_{\sigma }(y_{1}), \dots , p_{\sigma }(y_{n}))\) we indicate a distribution of the elements of \(\sigma\) across the classes in \(\mathcal {Y}\); \(\hat{\textbf{p}}_{\sigma }\) and \(\hat{\textbf{p}}_{\sigma }^{Q}\) can be interpreted analogously. All of \(\textbf{p}_{\sigma }\), \(\hat{\textbf{p}}_{\sigma }\), \(\hat{\textbf{p}}_{\sigma }^{Q}\), are probability distributions, i.e., are elements of the unit (n-1)-simplex \(\varDelta ^{n-1}\) (aka probability simplex, or standard simplex), defined as

$$\begin{aligned} \varDelta ^{n-1}=\left\{ (p_{1}, \ldots ,p_{n}) \in \mathbb {R}^n : p_{i} \ge 0, \sum _{i=1}^{n} p_{i}=1\right\} \end{aligned}$$
(1)

In other words, \(\varDelta ^{n-1}\) is the domain of all vectors that represent probability distributions over \(\mathcal {Y}\).

As customary, we use lowercase boldface letters (\(\textbf{p}\), \(\textbf{q}\), ...) to denote vectors, and uppercase boldface letters (\(\textbf{M}\), \(\textbf{C}\), ...) to denote matrices or tensors; we use subscripts to denote their elements and projections, e.g., we use \(\textbf{p}_{i}\) to denote the i-th element of \(\textbf{p}\), \(\textbf{M}_{ij}\) to denote the element of \(\textbf{M}\) at the i-th row and j-th column, and bullets to indicate projections (with, e.g., \(\textbf{M}_{i\bullet }\) indicating the i-th row of \(\textbf{M}\)). We indicate distributions in boldface in order to stress the fact that they are vectors of class prevalence values and because we will formulate most of our quantification methods by using matrix notation. We will often write \(\textbf{p}\), \(\hat{\textbf{p}}\), \(\hat{\textbf{p}}^{Q}\), instead of \(\textbf{p}_{\sigma }\), \(\hat{\textbf{p}}_{\sigma }\), \(\hat{\textbf{p}}_{\sigma }^{Q}\), thus omitting the indication of \(\sigma\) when clear from context.

3.2 Measuring quantification error in ordinal contexts

The main function for measuring quantification error in ordinal contexts that we use in this paper is the Normalized Match Distance (NMD), defined by Sakai (2018) as

$$\begin{aligned} \begin{aligned} {{\,\textrm{NMD}\,}}(\textbf{p},\hat{\textbf{p}}) =&\frac{1}{n-1}{{\,\textrm{MD}\,}}(\textbf{p},\hat{\textbf{p}}) \end{aligned} \end{aligned}$$
(2)

where \(\frac{1}{n-1}\) is just a normalization factor that allows NMD to range between 0 (best prediction) and 1 (worst prediction).Footnote 1 Here, MD is the well-known Match Distance (Werman et al. 1985), defined as

$$\begin{aligned} \begin{aligned} {{\,\textrm{MD}\,}}(\textbf{p},\hat{\textbf{p}}) =&\sum _{i=1}^{n-1} d(y_{i},y_{i+1})\cdot |\hat{P}(y_{i})-P(y_{i})| \end{aligned} \end{aligned}$$
(3)

where \(\smash {P(y_{i})=\sum _{j=1}^{i}p(y_{j})}\) is the prevalence of \(y_{i}\) in the cumulative distribution of \(\textbf{p}\), \(\hat{P}(y_{i})= \sum _{j=1}^{i}\hat{p}(y_{j})\) is an estimate of it, and \(d(y_{i},y_{i+1})\) is the “semantic distance” between consecutive classes \(y_{i}\) and \(y_{i+1}\), i.e., the cost we incur in mistaking \(y_{i}\) for \(y_{i+1}\) or vice versa. Throughout this paper, we assume \(d(y_{i},y_{i+1}) = 1\) for all \(i\in \{1, 2, \dots , n-1\}\).

MD is a widely used measure in OQ evaluation (Esuli and Sebastiani 2010; Nakov et al. 2016; Rosenthal et al. 2017; Da San Martino et al. 2016; Bunse et al. 2018; Castaño et al. 2024), where it is often called Earth Mover’s Distance (EMD); in fact, MD is a special case of EMD as defined by Rubner et al. (1998).Footnote 2 Since NMD and MD differ only by a fixed normalization factor, our experiments closely follow the tradition in OQ evaluation. The use of NMD is advantageous because the presence of the normalization factor \(\frac{1}{n-1}\) allows us to compare results obtained on different datasets characterized by different numbers n of classes; this would not be possible with MD or EMD, whose scores tend to increase with n.

To obtain an overall score for a quantification method Q on a dataset, we apply Q to each test bag \(\sigma\). The resulting estimated distribution \(\hat{\textbf{p}}_\sigma ^{Q}\) is then compared to the true distribution \(\textbf{p}_\sigma\) via NMD, which yields one NMD value for each test bag. The final score for method Q is the average NMD value across all bags \(\sigma\) in the test set, which characterizes the average prediction error of Q. We test for statistically significant differences between quantification methods in terms of a paired Wilcoxon signed-rank test.

3.3 Measuring the plausibility of distributions in ordinal contexts

Any probability distribution over \(\mathcal {Y}\) is a legitimate ordinal distribution. However, some ordinal distributions, though legitimate, are hardly plausible, i.e., they hardly occur in practice. For instance, assume that we are dealing with how a set of book reviews is distributed across the set of classes \(\mathcal {Y}=\){1Star, 2Stars, 3Stars, 4Stars, 5Stars}; a distribution such as

$$\textbf{p}_{\sigma _{1}}=(0.20, 0.10, 0.05, 0.20, 0.45)$$

is both legitimate and plausible, while a distribution such as

$$\textbf{p}_{\sigma _{2}}=(0.02, 0.47, 0.02, 0.47, 0.02)$$

is legitimate but hardly plausible.

What makes \(\textbf{p}_{\sigma _{2}}\) lack plausibility is the fact that it describes a highly dissimilar behavior of neighboring classes, despite the semantic similarity that ordinality imposes on the class neighborhood. As shown in Fig. 1, the dissimilarity of neighboring classes in \(\textbf{p}_{\sigma _{2}}\) manifests in sharp “humps” of prevalence values. For instance, a sequence (0.02, 0.47, 0.02) of prevalence values, such as the one that occurs in \(\textbf{p}_{\sigma _{2}}\) for the last three classes (an “upward” hump), hardly occurs in practice. Sequences such as (0.47, 0.02, 0.47), such as the one that occurs in \(\textbf{p}_{\sigma _{2}}\) for the middle three classes (a “downward” hump), also hardly occur in practice.

Fig. 1
figure 1

Two ordinal distributions \(\textbf{p}_{\sigma _{1}}\) (blue circles) and \(\textbf{p}_{\sigma _{2}}\) (red triangles). The interpolating lines are displayed only for establishing a visual coherence among the dots (Color figure online)

In the rest of this paper, a smooth ordinal distribution is one that tends not to exhibit (upward or downward) humps of prevalence values across consecutive classes; conversely, a jagged ordinal distribution is one that tends to exhibit such humps. We will thus take smoothness to be a measure of ordinal plausibility, i.e., a measure of how likely it is, for a distribution with a certain form, to occur in real-life applications of OQ.

As a measure of the jaggedness (the opposite of smoothness) of an ordinal distribution we propose using

$$\begin{aligned} \xi _{1}(\textbf{p}_{\sigma }) = \ \frac{1}{\min (6,n+1)}\sum _{i=2}^{n-1}(-p_{\sigma }(y_{i-1})+2\cdot p_{\sigma }(y_{i})-p_{\sigma }(y_{i+1}))^{2} \end{aligned}$$
(4)

where \(\frac{1}{\min (6,n+1)}\) is just a normalization factor to ensure that \(\xi _{1}(\textbf{p}_{\sigma })\) ranges between 0 (least jagged) and 1 (most jagged); therefore, \(\xi _{1}(\textbf{p}_{\sigma })\) is a measure of jaggedness and (1-\(\xi _{1}(\textbf{p}_{\sigma })\)) a measure of smoothness.Footnote 3

The intuition behind Eq. 4 is that, for an ordinal distribution to be smooth, the prevalence of a class \(y_{i}\) should be as similar as possible to the average prevalence of its two neighboring classes \(y_{i-1}\) and \(y_{i+1}\); \(\xi _{1}(\textbf{p}_{\sigma })\) is nothing else than a (normalized) sum of these (squared) differences across the classes in the code frame. In our example above, \(\xi _{1}(\textbf{p}_{\sigma _{1}})=0.009\) indicates a very smooth distribution and \(\xi _{1}(\textbf{p}_{\sigma _{2}})=0.405\) indicates a fairly jagged distribution.

By way of example, Fig. 2 displays the class distributions for each of the 28 product categories in the ordinal dataset of 233.1M Amazon product reviews made available by McAuley et al. (2015) (see also Sect. 6.1.2), while Fig. 3 displays the class distribution of the ordinal dataset of the FACT telescope (see also Sect. 6.1.3). It is evident from these figures that all these ordinal distributions are fairly smooth, in the sense indicated above. For instance, the 28 class distributions from the Amazon dataset tend to exhibit a moderate downward hump in the first three classes (or in the last three classes), but tend to be smooth elsewhere, with their value of \(\xi _{1}(\textbf{p}_{\sigma })\) ranging in [0.007,0.037]; likewise, the class distribution for the FACT telescope also tends to exhibit an upward hump in classes 4 to 6 but to be smooth elsewhere, with a value of \(\xi _{1}(\textbf{p}_{\sigma })=0.0115\). Appendix 1 presents other real-life examples, which show that smoothness is a pervasive phenomenon in ordinal distributions.

Fig. 2
figure 2

The class distribution \(\textbf{p}_\sigma\) of each of the 28 product categories in the Amazon dataset (see Sect. 6.1.2). The categories are ordered (from left to right, then from top to bottom) in terms of their \(\xi _{1}(\textbf{p}_{\sigma })\) score

Fig. 3
figure 3

The class distribution \(\textbf{p}_\sigma\) of the ordinal dataset of the FACT telescope (see Sect. 6.1.3), along with its \(\xi _{1}(\textbf{p}_{\sigma })\) score

It is easy to see that the most jagged distribution (\(\xi _{1}(\textbf{p}_{\sigma })\)=1) is not unique; for instance, assuming a 7-point scale by way of example, distributions

$$\begin{aligned} (0.000, 0.000, 1.000, 0.000, 0.000, 0.000, 0.000)\\(0.000, 0.000, 0.000, 1.000, 0.000, 0.000, 0.000)\\ (0.000, 0.000, 0.000, 0.000, 1.000, 0.000, 0.000)\end{aligned}$$

are the most jagged distributions (\(\xi _{1}(\textbf{p}_{\sigma })\)=1). The least jagged distribution is also not unique; examples of least jagged distributions (\(\xi _{1}(\textbf{p}_{\sigma })\)=0) on a 5-point scale are

$$\begin{aligned} (0.200, 0.200, 0.200, 0.200, 0.200)\\(0.198, 0.199, 0.200, 0.201, 0.202)\\ (0.000, 0.100, 0.200. 0.300, 0.400)\\ (0.202, 0.201, 0.200, 0.199, 0.198)\\\ldots\qquad\qquad\quad\quad\end{aligned}$$

Luckily enough, uniqueness of the most jagged distribution and uniqueness of the least jagged distribution turn out not to be required properties as far as our work is concerned. Indeed, jaggedness plays a central role both in the (regularization-based) methods that we propose (see Sect. 5) and in the data sampling protocol that we use for testing purposes (see Sect. 6.1.1), but neither of these contexts requires these uniqueness properties.

4 Existing multi-class quantification methods

In this section we introduce a number of known (non-ordinal and ordinal) multi-class quantification methods that we use as baselines in our experiments. Our novel OQ methods from Sect. 5 build upon a selection of these baselines.

4.1 Problem setting

In the multi-class quantification setting we want to estimate a distribution \(\textbf{p} \in \varDelta ^{n-1}\), where \(n>2\) and where \(\varDelta ^{n-1}\) is the probability simplex from Eq. 1 and where \(\textbf{p}\) represents the class prevalences within a testing bag \(\sigma\). At our disposal is a validation dataset V, where we denote by \(V_i\) those data items that belong to class \(y_i \in \mathcal {Y}\), i.e.,

$$\begin{aligned} V_i \;=\; \big \{\textbf{x} \in \mathcal {X} : (\textbf{x},y)\in V,\, y=y_i\big \} \end{aligned}$$
(5)

Let \(f:\mathcal {X}\rightarrow \mathbb {R}^d\) be a transformation function that embeds any data point into a d-dimensional vector. For example, f might be a soft classifier, so that each data point is represented as an d-dimensional vector of posterior probabilities, with d equal to the number of classes n; or f may instead be a binning function, in which case f returns one-hot d-dimensional vectors with d the number of bins. Many alternative choices for f exist, each of which gives rise to a different quantification method; see, e.g., those of Sect. 4.2.

Moreover, let \(S \in \mathbb {N}^\mathcal {X}\) be any bag (or multi-set) of an arbitrary number of data items, where each data item is drawn from the feature space \(\mathcal {X}\). For any choice of f and S, we denote by

$$\begin{aligned} \phi _f(S) \;=\; \frac{1}{|S |}\sum _{\textbf{x} \in S}f(\textbf{x}) \end{aligned}$$
(6)

the mean embedding of S, as represented by f.

With embeddings of this kind, the multi-class quantification problem can be framed as solving for \(\textbf{p}\in \varDelta ^{n-1}\) the system of linear equations

$$\begin{aligned} \textbf{q}=\textbf{M}\textbf{p} \end{aligned}$$
(7)

where the vector \(\textbf{q}=\phi _f(\sigma )\in \mathbb {R}^d\) is a mean embedding of the test bag and the columns of the matrix \(\textbf{M}=[\phi _f(V_1), \cdots , \phi _f(V_n)]\in \mathbb {R}^{d\times n}\) contain the class-wise mean embeddings of the validation set. Note that V coincides with our training set L if k-fold cross-validation is employed.

Multiple quantification algorithms have been proposed by quantification researchers, and many of them can be seen, as conceptualized by Firat (2016) and formally proven by Bunse (2022b), as different ways of solving Eq. 7. In the next sections, when introducing previously proposed quantification algorithms, we indeed present them as different means of solving Eq. 7, even if their original proposers did not present them as such. Since we will formulate in this way also our novel algorithms, Eq. 7 will act as a unifying framework for quantification methods of different provenance.

A naive solution of Eq. 7 would be \(\textbf{M}^\dagger \textbf{q}\), where \(\textbf{M}^\dagger\) is the Moore-Penrose pseudo-inverse, which exists for any matrix \(\textbf{M}\), even if \(\textbf{M}\) is not invertible. This solution is shown to be a minimum-norm least squares solution (Mueller and Siltanen 2012), which unfortunately is not guaranteed to be a distribution, i.e., it is not guaranteed to be an element of the probability simplex \(\varDelta ^{n-1}\).

A recent and fairly general proposal is to minimize a loss function \(\mathcal {L}\) and use a soft-max operator in order to guarantee that the result is indeed a distribution (Bunse 2022a), i.e.,

$$\begin{aligned} \hat{\textbf{p}} = \textrm{softmax}\big (\textbf{l}^*\big ) \in \varDelta ^{n-1} \hspace{3.5em} \end{aligned}$$
(8)

where

$$\begin{aligned} \textbf{l}^*= \mathop {\mathrm {arg\,min}}\limits _{\textbf{l} \in \mathbb {R}^n} \mathcal {L}\big (\,\textrm{softmax}(\textbf{l}); \textbf{M}, \textbf{q}\big ) \end{aligned}$$
(9)

is a vector of latent quantities and where the i-th output of the soft-max operator in Eq. 9 is \(\textrm{softmax}_i(\textbf{l}) = \textrm{exp}(\textbf{l}_i) / (\sum _{j=1}^n \textrm{exp}(\textbf{l}_j))\). Due to the soft-max operator, these latent quantities lend themselves to be interpreted as (translated) log-probabilities. In our implementation, we establish the uniqueness of \(\textbf{l}^*\) by fixing the first dimension to \(\textbf{l}_1 = 0\), which reduces the minimization of \(\mathcal {L}\) to \((n-1)\) dimensions without sacrificing the optimality of \(\textbf{l}^*\).

What remains to be detailed in the following subsections are the different choices of loss functions \(\mathcal {L}\) and feature transformations f that the different multi-class quantification methods employ.

4.2 Non-ordinal quantification methods

In the following, we introduce some important multi-class quantification methods which do not take ordinality into account. These methods provide the foundation for their ordinal extensions, which we develop in Sect. 5.

4.2.1 Classify and Count and its adjusted and/or probabilistic variants

The basic Classify and Count (CC) method (Forman 2005) employs a “hard” classifier \(h: \mathcal {X} \rightarrow \mathcal {Y}\) to generate class predictions for all data items \(\textbf{x} \in \sigma\). The fraction of predictions for a given class is directly used as its prevalence estimate, i.e.,

$$\begin{aligned} \hat{p}_{\sigma }^{\textrm{CC}}(y_i) = \ \frac{1}{|\sigma |} \cdot \big |\{\textbf{x} \in \sigma : h(\textbf{x}) = y_i\}\big | \end{aligned}$$
(10)

In the probabilistic variant of CC, called Probabilistic Classify and Count (PCC) by Bella et al. (2010), the hard classifier is replaced by a “soft” classifier \(s:\mathcal {X} \rightarrow \varDelta ^{n-1}\) (with \(\varDelta ^{n-1}\) the probability simplex from Eq. 1) that returns a vector of (ideally well-calibrated) posterior probabilities \(s_i(\textbf{x})\equiv \Pr (y_{i}|\textbf{x})\), i.e.,

$$\begin{aligned} \hat{p}_{\sigma }^{\textrm{PCC}}(y_i) = \ \frac{1}{|\sigma |} \cdot \sum _{\textbf{x} \in \sigma } s_i(\textbf{x}) \end{aligned}$$
(11)

CC and PCC are two simplistic quantification methods, which do not attempt to solve Eq. 7 for \(\textbf{p}\) and, hence, are biased towards the class distribution of the training set. Despite this inadequacy, these two methods are often used by practitioners, usually due to unawareness of the existence of more suitable quantification methods.

Adjusted Classify and Count (ACC) by Forman (2005) and Probabilistic Adjusted Classify and Count (PACC) by Bella et al. (2010) are based on the idea of applying a correction to the estimates \(\hat{\textbf{p}}^{\text {CC}}_\sigma\) and \(\hat{\textbf{p}}^{\text {PCC}}_\sigma\), respectively. These two methods estimate the (hard or soft, respectively) misclassification rates of the classifier on a validation set V; the correction of the estimates \(\hat{\textbf{p}}^{\text {CC}}_\sigma\) and \(\hat{\textbf{p}}^{\text {PCC}}_\sigma\) is then obtained by solving Eq. 7 for \(\textbf{p}\), where \(\textbf{q} = (\hat{p}_\sigma (y_1), \dots , \hat{p}_\sigma (y_n))\) is the distribution as estimated by CC or by PCC, respectively (see Eqs. 10 and 11), and where

$$\begin{aligned} \textbf{M}_{ij} = \frac{1}{|V_i |} \cdot \big |\{\textbf{x} \in V_i : h(\textbf{x}) = y_j\}\big | \end{aligned}$$
(12)

in the case of ACC, or where

$$\begin{aligned} \textbf{M}_{ij} = \frac{1}{|V_i |} \cdot \sum _{\textbf{x} \in V_i} s_j(\textbf{x}) \end{aligned}$$
(13)

in the case of PACC, and where \(V_i\) is the set of validation data items that belong to class \(y_i\); see Eq. 5. In other words, the feature transformation \(f(\textbf{x})\) of ACC is a one-hot encoding of hard classifier predictions \(h(\textbf{x})\), and the feature transformation \(f(\textbf{x})\) of PACC is the output \(s(\textbf{x})\) of a soft classifier (Firat 2016; Bunse 2022b).

Both ACC and PACC use a least-squares loss

$$\begin{aligned} \mathcal {L}(\textbf{p} ;\, \textbf{M}, \textbf{q}) = \Vert \textbf{q} - \textbf{M}\textbf{p} \Vert _2^2 \end{aligned}$$
(14)

to solve Eq. 7 for \(\textbf{p}\) (Bunse 2022a). We implement this solution as a minimization in terms of Eq. 8.

4.2.2 The HDx and HDy distribution-matching methods

For other choices of feature transformations and loss functions, we obtain other quantification algorithms. Two other popular and non-ordinal quantification algorithms are HDx and HDy (González-Castro et al. 2013), which compute feature-wise (HDx) or class-wise (HDy) histograms and minimize the average Hellinger distance across all histograms.

Let d be the number of histograms and let b be the number of bins in each histogram. To ease our notation, we now describe \(\textbf{q} \in \mathbb {R}^{d \times b}\) and \(\textbf{M} \in \mathbb {R}^{d \times b \times n}\) as tensors. Note, however, that a simple concatenation

$$\begin{aligned} (\textbf{q}_{11}, \textbf{q}_{12}, \dots , \textbf{q}_{1b}, \textbf{q}_{21}, \dots , \textbf{q}_{db}) \in&\ \mathbb {R}^{db} \\ (\textbf{M}_{11\bullet }, \textbf{M}_{12\bullet }, \dots , \textbf{M}_{1b\bullet }, \textbf{M}_{21\bullet }, \dots , \textbf{M}_{db\bullet }) \in&\ \mathbb {R}^{db \times n} \end{aligned}$$

yields again Eq. 7, the system of linear equations that uses vectors and matrices instead of tensor notation.

The HDx algorithm computes one histogram for each feature in \(\sigma\), i.e.,

$$\begin{aligned} \textbf{q}_{ij} = \frac{1}{|\sigma |} \cdot \big |\{\textbf{x} \in \sigma \;:\; b_i(\textbf{x}) = j\}\big | \end{aligned}$$
(15)

where \(b_i(\textbf{x}): \mathcal {X} \rightarrow \{1, \dots , b\}\) returns the bin of the i-th feature of \(\textbf{x}\). Accordingly, the tensor \(\textbf{M}\) counts how often each bin of each histogram co-occurs with each class, i.e.,

$$\begin{aligned} \textbf{M}_{ijk} = \frac{1}{|V_k |} \cdot \big |\{\textbf{x} \in V_k : b_i(\textbf{x}) = j\}\big | \end{aligned}$$
(16)

As a loss function, HDx employs the average of all feature-wise Hellinger distances, i.e.,

$$\begin{aligned} \mathcal {L}(\textbf{p} ;\, \textbf{M}, \textbf{q}) \;=\; \frac{1}{d} \sum _{i=1}^d \textrm{HD}(\textbf{q}_{i\bullet }, \, \textbf{M}_{i \bullet \bullet }\textbf{p}) \end{aligned}$$
(17)

where

$$\begin{aligned} \textrm{HD}(\textbf{a}, \, \textbf{b}) \;=\; \sqrt{ \sum _{i = 1}^b \left( \sqrt{\textbf{a}_i} - \sqrt{\textbf{b}_i} \right) ^2} \end{aligned}$$
(18)

is the Hellinger distance between two histograms of a feature.

The HDy algorithm uses the same loss function, but operates on the output of a “soft” classifier \(s: \mathcal {X} \rightarrow \varDelta ^{n-1}\), as if this output was the original feature representation of the data. Hence, we have

$$\begin{aligned} \begin{aligned} \textbf{q}_{ij} =&\ \frac{1}{|\sigma |} \cdot \big |\{\textbf{x} \in \sigma \;:\; b_i(s(\textbf{x})) = j\}\big |\\ \textbf{M}_{ijk} =&\ \frac{1}{|V_k |} \cdot \big |\{\textbf{x} \in V_k : b_i(s(\textbf{x})) = j\}\big |\end{aligned} \end{aligned}$$
(19)

where s is a soft classifier that returns posterior probabilities \(s_i(\textbf{x}) \equiv \Pr (y_{i}|\textbf{x})\) (or some monotonous transformation thereof). Like ACC and PACC, we implement HDx and HDy as a minimization in terms of Eq. 8.

4.2.3 The Saerens–Latinne–Decaestecker EM-based method (SLD)

The Saerens-Latinne-Decaestecker (SLD) method (Saerens et al. 2002), also known as “EM-based quantification”, follows an iterative expectation maximization approach, which (i) leverages Bayes’ theorem in the E-step, and (ii) updates the prevalence estimates in the M-step. Both steps can be combined in the single update rule

$$\begin{aligned} \hat{p}_\sigma ^{(k)}(y_i) = \displaystyle \frac{1}{|\sigma |} \sum _{\textbf{x} \in \sigma } \displaystyle \frac{ \displaystyle \frac{\hat{p}_\sigma ^{(k-1)}(y_i)}{\hat{p}_\sigma ^{(0)}(y_i)} \cdot s_i(\textbf{x}) }{ \sum _{j=1}^n \displaystyle \frac{\hat{p}_\sigma ^{(k-1)}(y_j)}{\hat{p}_\sigma ^{(0)}(y_j)} \cdot s_j(\textbf{x}) } \end{aligned}$$
(20)

which is applied until the estimates converge. Here, the “(k)” superscript indicates the k-th iteration of the process and \(p_\sigma ^{(0)}(y)\) is initialized with the class prevalence values of the training set.

4.3 Ordinal quantification methods from the data mining literature

In this section and in Sect. 4.4 we describe existing ordinal quantification methods. While this section describes methods that had been proposed in the data mining / machine learning / NLP literature, and that their proposers indeed call “quantification” methods, Sect. 4.4 describes methods that were introduced in the physics literature, and that their proposers call “unfolding” methods.

4.3.1 Ordinal Quantification Tree (OQT)

The OQT algorithm (Da San Martino et al. 2016) trains a quantifier by arranging probabilistic binary classifiers (one for each possible bipartition of the ordered set of classes) into an ordinal quantification tree (OQT), which is conceptually similar to a hierarchical classifier. Two characteristic aspects of training an OQT are that (a) the loss function used for splitting a node is a quantification loss (and not a classification loss), e.g., the Kullback–Leibler Divergence, and (b) the splitting criterion is informed by the class order. Given a test data item, one generates a posterior probability for each of the classes by having the data item descend all branches of the trained tree. After the posteriors of all data items in the test bag have been estimated this way, PCC is invoked in order to compute the final prevalence estimates.

The OQT method was only tested in the SemEval 2016 “Sentiment analysis in Twitter” shared task (Nakov et al. 2016). While OQT was the best performer in that sub-task, its true value still has to be assessed, since the above-mentioned sub-task evaluated participating algorithms on one test bag only. In our experiments, we test OQT in a much more robust way. Since PCC (the final step of OQT) is known to be biased, we do not expect OQT to exhibit competitive performances.

4.3.2 Adjusted Regress and Count (ARC)

The ARC algorithm (Esuli 2016) is similar to OQT in that it trains a hierarchical classifier where (a) the leaves of the tree are the classes, (b) these leaves are ordered left-to-right, and (c) each internal node partitions an ordered sequence of classes in two such sub-sequences. One difference between OQT and ARC is the criterion used in order to decide where to split a given sequence of classes, which for OQT is based on a quantification loss (KLD), and for ARC is based on the principle of minimizing the imbalance (in terms of the number of training examples) of the two sub-sequences. A second difference is that, once the tree is trained and used to classify the test data items, OQT uses PCC, while ARC uses ACC.

Concerning the quality of ARC, the same considerations made for OQT apply, since ARC, like OQT, has only been tested in the Ordinal Quantification sub-task of the SemEval 2016 “Sentiment analysis in Twitter” shared task (Nakov et al. 2016); despite the fact that it worked well in that context, the experiments that we present here are more conclusive.

4.3.3 The Match Distance in the EDy method

Castaño et al. (2024) have recently proposed EDy, a variant of the EDx method (Kawakubo et al. 2016) which employs the MD from Eq. 3 to measure the distance between soft predictions \(s(\textbf{x})\). Since MD addresses the order of classes, we regard EDy as a true OQ method.

The underlying idea of EDy, following the idea of EDx, is to choose the estimate \(\textbf{p}\) such that the energy distance between \(\textbf{q}\) and \(\textbf{M}\textbf{p}\) is minimal. This distance can be written as

$$\begin{aligned} \mathcal {L}(\textbf{p} ;\, \textbf{M}, \textbf{q}) \;=\; 2 \textbf{p}^\top \textbf{q} - \textbf{p}^\top \textbf{M} \textbf{p} \end{aligned}$$
(21)

where

$$\begin{aligned} \begin{aligned} \textbf{q}_i \;&=\; \frac{1}{|\sigma |\cdot |V_i |} \sum _{\textbf{x} \in \sigma } \sum _{\textbf{x}' \in V_i} \textrm{MD}\big (s(\textbf{x}), s(\textbf{x}')\big ) \\ \textbf{M}_{ij} \;&=\; \frac{1}{|V_j |\cdot |V_i |} \sum _{\textbf{x} \in V_j} \sum _{\textbf{x}' \in V_i} \textrm{MD}\big (s(\textbf{x}), s(\textbf{x}')\big ) \end{aligned} \end{aligned}$$
(22)

describe the average MD between data items of different classes (in case of \(\textbf{M}\)) and between data items of \(\sigma\) and individual classes (in case of \(\textbf{q}\)). In other words, the feature representation of the MD-based variant of EDy is

$$\begin{aligned} f_i(\textbf{x}) \;=\; \frac{1}{|V_i |} \sum _{\textbf{x}' \in V_i} \textrm{MD}\big (s(\textbf{x}), s(\textbf{x}')\big ) \end{aligned}$$
(23)

Alternatively, the distance between bags could be measured in other ways than \(\textrm{MD}(s(\textbf{x}), s(\textbf{x}'))\), e.g., in terms of the Euclidean distance \(\Vert \textbf{x} -\textbf{x}'\Vert _2\). However, with the MD being a suitable measure for ordinal problems, we regard Eq. 21 as the best fitting and most promising variant of EDx and EDy. In experiments with ordinal data, this variant is recently shown to exhibit state-of-the-art performances (Castaño et al. 2024).

4.3.4 The Match Distance in the PDF method

Another proposal by Castaño et al. (2024) is PDF, an OQ method that minimizes the MD between two ranking histograms. In this method, a ranking function \(r: \mathcal {X} \rightarrow \mathbb {R}\) is required. Such a function can be obtained from any multi-class soft-classifier \(s: \mathcal {X} \rightarrow \varDelta ^{n-1}\) by taking

$$\begin{aligned} r(\textbf{x}) \;=\; \sum _{i=1}^n i \cdot s_i(\textbf{x}) \end{aligned}$$
(24)

such that \(r(\textbf{x})\) is a real value between 1 and n.

Having a ranking function, we can compute a one-dimensional histogram of the ranking values of \(\sigma\) and another one-dimensional histogram of the ranking values of the training set, weighted by an estimate \(\textbf{p}\). Castaño et al. (2024) choose \(\textbf{p}\) such that it minimizes the MD between these two histograms, i.e.,

$$\begin{aligned} \mathcal {L}(\textbf{p} ;\, \textbf{M}, \textbf{q}) \;=\; \textrm{MD}(\textbf{q}, \textbf{M}\textbf{p}) \end{aligned}$$
(25)

where

$$\begin{aligned} \begin{aligned} \textbf{q}_i \;=&\; \frac{1}{|\sigma |} \cdot \big |\{\textbf{x} \in \sigma \;:\; b(r(\textbf{x})) = i\}\big |\\ \textbf{M}_{ij} \;=&\; \frac{1}{\left|V_j \right|} \cdot \big |\{\textbf{x} \in V_j : b(r(\textbf{x})) = i\}\big |\end{aligned} \end{aligned}$$
(26)

and where \(b(\textbf{x}): \mathcal {X} \rightarrow \{1, 2, \dots , B\}\) returns the bin index of \(r(\textbf{x})\). In other words, the feature transformation of PDF is a one-hot encoding of \(b(r(\textbf{x}))\).

4.4 Ordinal quantification methods from the physics literature

Similar to some of the methods discussed in Sects. 4.2 and 4.3, experimental physicists have proposed additional adjustments that solve, for \(\textbf{p}\), the system of linear equations from Eq. 7. These “unfolding” methods have two particular aspects in common.

The first aspect is that the feature transformation f is assumed to be a partition \(c: \mathcal {X} \rightarrow \{{1, \dots , t}\}\) of the feature space, and

$$\begin{aligned} \textbf{q}_i =&\ \frac{1}{|\sigma |} \cdot \big |\{\textbf{x} \in \sigma : c(\textbf{x}) = i\}\big | \end{aligned}$$
(27)
$$\begin{aligned} \textbf{M}_{ij} =&\ \frac{1}{\left|V_j \right|} \cdot \big |\{\textbf{x} \in V_j : c(\textbf{x}) = i\}\big | \end{aligned}$$
(28)

with \(\textbf{M} \in \mathbb {R}^{t \times n}\); here, i indexes the representation for the i-th partition in \(\textbf{q}\) and \(\textbf{M}\), while j indexes the class being modeled in \(\textbf{M}\). In other words, these methods were defined without supervised learning in mind, which differentiates them from all the methods introduced in the previous sections. However, note that, once we replace partition c with a trained classifier h, Eqs. 27 and 28 become exactly Eqs. 10 and 12, which define the ACC method.

Another possible choice for c is to partition the feature space by means of a decision tree; in this case, (i) it typically holds that \(t>n\), and (ii) \(c(\textbf{x})\) represents the index of a leaf node (Börner et al. 2017). Here, we choose \(c=h\) (i.e., we plug in supervised learning) for performance reasons and for establishing a high degree of comparability between quantification methods.

The second aspect of “unfolding” quantifiers, which is central to our work, is the use of a regularization component that promotes what we have called (see Sect. 3.3) “ordinally plausible” solutions. Specifically, these methods employ the assumption that ordinal distributions are smooth (in the sense of Sect. 3.3); depending on the algorithm, this assumption is encoded in different ways, as we will see in the following paragraphs.

4.4.1 Regularized unfolding (RUN)

Regularized Unfolding (RUN) (Blobel 2002, 1985) has been used by physicists for decades (Nöthe et al. 2017; Aartsen et al. 2017). Here, the loss function \(\mathcal {L}\) consists of two terms, a negative log-likelihood term to model the error of \(\textbf{p}\) and a regularization term to model the plausibility of \(\textbf{p}\).

The negative log-likelihood term in \(\mathcal {L}\) builds on a Poisson assumption about the distribution of the data. Namely, this term models the counts \(\bar{\textbf{q}}_i = |\sigma |\cdot \textbf{q}_i\), which are observed in the bag \(\sigma\), as being Poisson-distributed with the rates \(\lambda _i = \textbf{M}_{i\bullet }^\top \bar{\textbf{p}}\). Here, \(\bar{\textbf{p}}_i = |\sigma |\cdot \textbf{p}_i\) are the class counts that would be observed under a prevalence estimate \(\textbf{p}\).

The second term of \(\mathcal {L}\) is a Tikhonov regularization term \(\frac{1}{2}\left( \textbf{C}_{1}\textbf{p}\,\right) ^2\), where

$$\begin{aligned} \textbf{C}_{1} = \begin{pmatrix} -1 &{} \; \phantom {-}2 &{} \; -1 &{} \; \phantom {-}0 &{} \; \cdots &{} \; \phantom {-}0 &{} \; \phantom {-}0 &{} \; \phantom {-}0 &{} \; \phantom {-}0 \\ \phantom {-}0 &{} \; -1 &{} \; \phantom {-}2 &{} \; -1 &{} \; \cdots &{} \; \phantom {-}0 &{} \; \phantom {-}0 &{} \; \phantom {-}0 &{} \; \phantom {-}0 \\ \cdots &{} \; \cdots &{} \; \cdots &{} \; \cdots &{} \; \cdots &{} \; \cdots &{} \; \cdots &{} \; \cdots &{} \; \cdots \\ \phantom {-}0 &{} \; \phantom {-}0 &{} \; \phantom {-}0 &{} \; \phantom {-}0 &{} \; \cdots &{} \; -1 &{} \; \phantom {-}2 &{} \; -1 &{} \; \phantom {-}0 \\ \phantom {-}0 &{} \; \phantom {-}0 &{} \; \phantom {-}0 &{} \; \phantom {-}0 &{} \; \cdots &{} \; \phantom {-}0 &{} \; -1 &{} \; \phantom {-}2 &{} \; -1 \\ \end{pmatrix} \in \mathbb {R}^{(n-2) \times n} \end{aligned}$$
(29)

This term introduces an inductive bias towards smooth solutions, i.e., solutions which are (following the assumption we have made in Sect. 3.3) ordinally plausible. The choice of the Tikhonov matrix \(\textbf{C}_{1}\) ensures that \(\frac{1}{2}\left( \textbf{C}_{1}\textbf{p}\,\right) ^2\) measures the jaggedness of \(\textbf{p}\), i.e.,

$$\begin{aligned} \frac{1}{2}\left( \textbf{C}_{1}\textbf{p}\,\right) ^2 = \frac{1}{2} \sum _{i = 2}^{n-1} \left( -\textbf{p}_{i-1} + 2\textbf{p}_i - \textbf{p}_{i+1} \right) ^2 \end{aligned}$$
(30)

which only differs from \(\xi _{1}(\textbf{p}_{\sigma })\), our measure of ordinal plausibility from Eq. 4, in terms of a constant normalization factor.Footnote 4 (Indeed, subscript “1” in \(\textbf{C}_{1}\) is there to indicate that the goal of \(\textbf{C}_{1}\) is to minimize \(\xi _{1}(\textbf{p}_{\sigma })\).) Combining the likelihood term and the regularization term, the loss function of RUN is

$$\begin{aligned} \mathcal {L}(\textbf{p} ;\, \textbf{M}, \textbf{q}, \tau ) \;=\; \sum _{i = 1}^t \left( \textbf{M}_{i\bullet }^\top \bar{\textbf{p}} - \bar{\textbf{q}}_i \cdot \ln (\textbf{M}_{i\bullet }^\top \bar{\textbf{p}})\right) \;+\; \frac{\tau }{2}\left( \textbf{C}_{1}\textbf{p}\,\right) ^2 \end{aligned}$$
(31)

and an estimate \(\hat{\textbf{p}}\) can be chosen in terms of Eq. 8. Here, \(\tau \ge 0\) is a hyper-parameter which controls the impact of the regularization.

4.4.2 Iterative Bayesian unfolding (IBU)

Iterative Bayesian Unfolding (IBU) by D’Agostini (2010, 1995) is still popular today (Aad et al. 2021; Nachman et al. 2020). This method revolves around an expectation maximization approach with Bayes’ theorem, and thus has a common foundation with the SLD method. The E-step and the M-step of IBU can be written as the single, combined update rule

$$\begin{aligned} \hat{p}_\sigma ^{(k)}(y_i) = \sum _{j = 1}^t \frac{ \textbf{M}_{ij} \cdot \hat{p}_\sigma ^{(k-1)}(y_i) }{ \sum _{l = 1}^n \textbf{M}_{lj} \cdot \hat{p}_\sigma ^{(k-1)}(y_l) } \, \textbf{q}_i \end{aligned}$$
(32)

One difference between IBU and SLD is that \(\textbf{q}\) and \(\textbf{M}\) are defined via counts of hard assignments to partitions \(c(\textbf{x})\) (see Eq. 27), while SLD is defined over individual soft predictions \(s(\textbf{x})\) (see Eq. 20).

Another difference between IBU and SLD is regularization. In order to promote solutions which are ordinally plausible, IBU smooths each intermediate estimate \(\smash {\hat{\textbf{p}}^{(k)}}\) by fitting a low-order polynomial to \(\smash {\hat{\textbf{p}}^{(k)}}\). A linear interpolation between \(\smash {\hat{\textbf{p}}^{(k)}}\) and this polynomial is then used as the prior of the next iteration in order to reduce the differences between neighboring prevalence estimates. The order of the polynomial and the interpolation factor are hyper-parameters of IBU through which the regularization is controlled.

4.4.3 Other quantification methods from the physics literature

Other methods from the physics literature that perform what we here call “quantification” go under the name of “unfolding” methods, and are based on similar concepts as RUN and IBU. We focus on RUN and IBU due to their long-standing popularity within physics research. In fact, they are among the first methods that have been proposed in this field, and are still widely adopted today, in astro-particle physics (Nöthe et al. 2017; Aartsen et al. 2017), high-energy physics (Aad et al. 2021), and more recently in quantum computing (Nachman et al. 2020). Moreover, RUN and IBU already cover the most important aspects of unfolding methods with respect to OQ.

Several other unfolding methods are similar to RUN. For instance, the method proposed by Hoecker and Kartvelishvili (1996) employs the same regularization as RUN, but assumes different Poisson rates, which are simplifications of the rates that RUN uses; in preliminary experiments, here omitted for the sake of conciseness, we have found this simplification to typically deliver less accurate results than RUN. Two other methods (Schmelling 1994; Schmitt 2012) employ the same simplification as Hoecker and Kartvelishvili (1996) but regularize differently. To this end, Schmelling (1994) regularizes with respect to the deviation from a prior, instead of regularizing with respect to ordinal plausibility; we thus do not perceive this method as a true OQ method. Schmitt (2012) adds to the RUN regularization a second term which enforces prevalence estimates that sum up to one; however, implementing RUN in terms of Eq. 8 already solves this issue. Another line of work evolves around the algorithm by Ruhe et al. (2013) and its extensions (Bunse et al. 2018). We perceive this algorithm to lie outside the scope of OQ because it does not address the order of classes, like the other “unfolding” methods do. Moreover, the algorithm was shown to exhibit a performance comparable to, but not better than RUN and IBU (Bunse et al. 2018).

5 New ordinal versions of multi-class quantification algorithms

In the following, we develop algorithms which modify ACC, PACC, HDx, HDy, SLD, EDy, and PDF with the regularizers from RUN and IBU. Through these modifications, we obtain o-ACC, o-PACC, o-HDx, o-HDy, and o-SLD, the OQ counterparts of these well-known non-ordinal quantification algorithms, as well as o-EDy and o-PDF, which combine ordinal loss functions and feature representations with an ordinal regularizer. In doing so, since we employ the regularizers but not any other aspect of RUN and IBU, we preserve the general characteristics of the original algorithms. In particular, we do not change the feature representations and we maintain the original loss functions of these methods. Therefore, our extensions are “minimal”, in the sense of being directly addressed to ordinality, without introducing any undesired side effects in the original methods.

5.1 Tikhonov regularization in multi-class algorithms

The OQ counterparts of most algorithms—ACC, PACC, HDx, HDy, EDy, and PDF—are constructed by defining a novel, OQ-oriented loss function that adds the Tikhonov regularizer from Eq. 30 to the original loss function of each algorithm. This ordinal extension is defined through the regularized loss

$$\begin{aligned} \mathcal {L}(\textbf{p} ;\, \textbf{M}, \textbf{q}, \tau ) \;=\; \mathcal {L}(\textbf{p} ;\, \textbf{M}, \textbf{q}) \;+\; \frac{\tau }{2}\left( \textbf{C}_{1}\textbf{p}\,\right) ^2 \end{aligned}$$
(33)

where \(\mathcal {L}(\textbf{p};\, \textbf{M}, \textbf{q})\) is the original loss function of any existing (not necessarily ordinal) quantification algorithm. The hyper-parameter \(\tau \ge 0\) and the Tikhonov matrix \(\textbf{C}_1\) are the ones introduced by physicists to address ordinality in the RUN method of Sect. 4.4.1. Like before, we minimize Eq. 33 with the soft-max operator from Eq. 8.

If we apply the above definition of a regularized loss to ACC and PACC (see Sect. 4.2.1), we obtain o-ACC and o-PACC, the ordinal counterparts of these methods. The respective feature transformation and loss function of ACC and PACC are maintained, such that the only novelty is the regularization term that promotes ordinally plausible solutions.

Similarly, if we apply the above definition to HDx and HDy (see Sect. 4.2.2), we obtain o-HDx and o-HDy; if we apply the definition to EDy and PDF (see Sects. 4.3.3 and 4.3.4), we obtain o-EDy and o-PDF. In all of these cases, the only novelty is the regularization term.

Among the extended methods, o-EDy and o-PDF stand out in the sense that they combine multiple approaches to addressing ordinality. In the case of o-EDy, an ordinal feature transformation (the one of EDy) is combined with an ordinal regularizer (the one of RUN). In the case of o-PDF, an ordinal loss function (the one of PDF) is regularized to further promote solutions that are ordinally plausible. In all other extensions—o-ACC, o-PACC, o-HDx, and o-HDy—the one and only aspect concerning ordinality is the regularizer.

5.2 o-SLD

Unlike the other methods, SLD does not explicitly minimize a loss function. Hence, our ordinal extension o-SLD uses, instead of a Tikhonov regularization term, the ordinal regularization approach of IBU in SLD. Namely, our method does not use the latest estimate directly as the prior of the next iteration, but a smoothed version of this estimate. To this end, we fit a low-order polynomial to each intermediate estimate \(\smash {\hat{\textbf{p}}^{(k)}}\) and use a linear interpolation between this polynomial and \(\smash {\hat{\textbf{p}}^{(k)}}\) as the prior of the next iteration. Like in IBU, we consider the order of the polynomial and the interpolation factor as hyper-parameters of o-SLD.

6 Experiments

The goal of our experiments is to uncover the relative merits of OQ methods originating from different fields. We pursue this goal by carrying out a thorough comparison of these methods on representative OQ datasets. In the interest of reproducibility we make all the code publicly available.Footnote 5

6.1 Datasets and pre-processing

We conduct our experiments on two large datasets that we have generated for the purpose of this work, and that we make available to the scientific community. The first dataset, named Amazon-OQ-BK, consists of product reviews labeled according to customers’ judgments of quality, ranging from 1Star to 5Stars. The second dataset, Fact-OQ, consists of telescope observations each labeled by one of 12 totally ordered classes. These datasets originate in practically relevant and very diverse applications of OQ.

6.1.1 The data sampling protocol

We start by dividing each data set into a set L of training data items, a pool of validation (i.e., development) data items, and a pool of test data items. These three sets are disjoint from each other, and we obtain each of them through stratified sampling from the original data source. We set the size of the training set to 20,000 data items, use half of the remaining items for the validation pool, and use the other half for the testing pool.

From both the validation pool and the test pool, we separately extract bags (i.e., multi-sets of data items) to be predicted during quantifier evaluation. Following Esuli et al. (2022), each bag \(\sigma\) is generated in two steps. First, we randomly draw a ground-truth vector \(\textbf{p}_\sigma\) of class prevalence values. We realize this step in three different ways that are still to be detailed in the following paragraphs. Second, we draw from the pool of data (be it our validation pool or our test pool) a fixed-size bag \(\sigma\) of data items that realizes the class prevalence values of \(\textbf{p}_\sigma\). We set the size of \(\sigma\) to 1,000 data items. For validation, we draw 1,000 such bags and for testing, we draw 5,000 bags. All data items in a pool are replaced after the generation of each bag, where our initial split into a training set, a validation pool, and a test pool already ensures that each validation bag is disjoint from each test bag and that the training set is disjoint from all other bags.

Through the above approach, we can predict the prevalence values of each \(\sigma\) through quantification methods and compare the outcomes with the ground-truth vector \(\textbf{p}_\sigma\). By drawing many \(\textbf{p}_\sigma\) at random, we can test the quantification methods in many different instances of prior probability shift.

Real prevalence vectors The most realistic way of drawing \(\textbf{p}_\sigma\) is to draw it uniformly at random from the set of those prevalence vectors that are exhibited by bags that naturally occur in the data. We call these vectors real prevalence vectors due to their natural occurrence.

For Amazon-OQ-BK (to be detailed in Sect. 6.1.2), each natural bag consists of all reviews that address one individual product. Hence, each \(\textbf{p}_\sigma\) corresponds to the prevalence of customer ratings for a single product. For Fact-OQ (to be detailed in Sect. 6.1.3), each natural bag consists of telescope observations that are distributed according to a parametrization of the Crab Nebula (Aleksić et al. 2015) and are thus representative of data that physicists expect to handle in practice.

While real prevalence vectors provide the most realistic (and therefore the most sensible) setting for quantifier evaluation, they also bear two shortcomings. First, they are not available for standard classification data sets, preventing these sets from being used for quantifier evaluation with real prevalence vectors. Due to this reason, we make available Amazon-OQ-BK and Fact-OQ as actual quantification data sets with real prevalence vectors. Second, since the distribution of real prevalence vectors differs between data sets, quantifiers cannot easily be compared across multiple data sets. Due to these shortcomings, we evaluate not only in terms of real prevalence vectors, but also in terms of two other evaluation protocols.

Artificial Prevalence Protocol (APP) Perhaps the most common way of drawing \(\textbf{p}_\sigma\) is to draw it uniformly at random from \(\varDelta ^{n-1}\), the set of all possible prevalence vectors (Forman 2005).

By picking all prevalence vectors with the same probability and without any dependence on the data, APP allows us to compare performance across multiple datasets. Moreover, it is capable of re-purposing any standard classification data set for the evaluation of quantifiers and it demands a high performance of quantification methods throughout \(\varDelta ^{n-1}\), which is another desirable property. However, this demand is without any consideration of whether some \(\textbf{p}_\sigma\) is realistic or “ordinally plausible”, in the sense of Sect. 3.3. Therefore, APP has a tendency of over-emphasizing performance in regions of \(\varDelta ^{n-1}\) which are unlikely to ever appear in practice.

APP-OQ for ordinal plausibility Since we take smoothness (in the sense of Sect. 3.3) as a criterion for ordinal plausibility, we counteract this shortcoming of APP by further devising APP-OQ(\(x\%\)), a protocol similar to APP but for the fact that only the \(x\%\) smoothest bags are retained. Hence, when evaluating a quantifier, we perform hyper-parameter optimization on the x% smoothest validation bags and test on the x% smoothest test bags generated by APP. While this decrease in the number of bags might increase the variance of the prediction performance that is averaged over all bags, we will see that such an effect cannot be observed in the results of our experiments.

To use the above approach, we need to decide on a percentage x to use. To make this choice, we characterize the \(\textbf{p}_\sigma\) that result from different choices of x in terms of their average jaggedness \(\xi _{1}(\textbf{p}_\sigma )\) and in terms of the average amount of prior probability shift \(\textrm{NMD}(\textbf{p}_L, \textbf{p}_\sigma )\) that they generate. We compare these characteristics with those of the real prevalence vectors and choose the value of x that yields the most realistic values of \(\xi _{1}(\textbf{p}_\sigma )\).

Table 1 Characteristics of ground-truth class prevalence distributions \(\textbf{p}_\sigma\), which are sampled through different protocols and for both datasets

The results from Table 1 convey that APP-OQ, while becoming smoother with smaller values of x, produces constant amounts of prior probability shift. In this sense, the quantification tasks of APP-OQ become more ordinally plausible but not simpler. Hence, APP-OQ retains the beneficial coverage of \(\varDelta ^{n-1}\) that APP exhibits. The most suitable percentage for Amazon-OQ-BK turns out to be 50% while the percentage for Fact-OQ turns out to be 5%. This difference stems from the smoother distributions that Fact-OQ exhibits in the real prevalence vectors.

In a nutshell, each of the above protocols provides a different perspective, which we combine by always reporting the results of all three protocols side by side. Real prevalence vectors provide the most realistic evaluation but do not allow to compare performance across multiple data sets, APP is the most common approach that provides a bridge to previous works, and APP-OQ seeks to balance these two perspectives for ordinal quantification in particular.

6.1.2 The Amazon-OQ-BK dataset

We make available the Amazon-OQ-BK dataset,Footnote 6 which we extract from an existing dataset by McAuley et al. (2015), consisting of 233.1M English-language Amazon product reviewsFootnote 7; here, a data item corresponds to a single product review. As the labels of the reviews, we use their “stars” ratings, and our code frame is thus \(\mathcal {Y}=\){1Star, 2Stars, 3Stars, 4Stars, 5Stars}, which represents a sentiment quantification task (Esuli and Sebastiani 2010).

The reviews are subdivided into 28 product categories, including “Automotive”, “Baby”, “Beauty”, etc. We restrict our attention to reviews from the “Books” product category, since it is the one with the highest number of reviews. We then remove (a) all reviews shorter than 200 characters because recognizing sentiment from shorter reviews may be nearly impossible in some cases, and (b) all reviews that have not been recognized as “useful” by any users because many reviews never recognized as “useful” may contain comments, say, on Amazon’s speed of delivery, and not on the product itself.

We convert the reviews into vectors by using the RoBERTa transformer (Liu et al. 2019) from the Hugging Face hub.Footnote 8 To this aim, we truncate the reviews to the first 256 tokens and fine-tune RoBERTa via prompt learning for a maximum of 5 epochs on our training data, using the model parameters from the epoch with the smallest validation loss monitored on 1000 held-out reviews randomly sampled from the training set in a stratified way. For training, we set the learning rate to \(2e^{-5}\), the weight decay to 0.01, and the batch size to 16, leaving the other hyper-parameters at their default values. For each review, we generate features by first applying a forward pass over the fine-tuned network, and then averaging the embeddings produced for the special token [CLS] across all the 12 layers of RoBERTa. In our initial experiments, this latter approach yielded slightly better results than using the [CLS] embedding of the last layer alone. The embedding size of RoBERTa, and hence the number of dimensions of our vectors, amounts to 768.

6.1.3 The Fact-OQ dataset

We extract our second dataset, called Fact-OQ,Footnote 9 from the open dataset of the FACT telescope (Anderhub et al. 2013)Footnote 10; here, a data item corresponds to a single telescope recording. We represent each data item in terms of the 20 dense features that are extracted by the standard processing pipelineFootnote 11 of the telescope. Each of the 1,851,297 recordings is labeled with the energy of the corresponding astro-particle, and our goal is to estimate the distribution of these energy labels via OQ. While the energy labels are originally continuous, astro-particle physicists have established a common practice of dividing the range of energy values into ordinal classes, as argued in Sect. 4.4. Based on discussions with astro-particle physicists, we divide the range of continuous energy values into an ordered set of 12 classes. As a result, our quantifiers predict histograms of the energy distribution that have 12 equal-width bins.

Note that, since we are using NMD as our evaluation measure, we can meaningfully compare the results we obtain on Amazon-OQ-BK (which uses a 5-class code frame) with the results we obtain on Fact-OQ (which uses a 12-class code frame); this would not have been possible if we had used MD, which is not normalized by the number of classes in the code frame.

6.1.4 The UCI and OpenML datasets

Additionally to our experiments on Amazon-OQ-BK and Fact-OQ, we also carry out experiments on a collection of public datasets from the UCI repositoryFootnote 12 and OpenML.Footnote 13 To identify these datasets, we first select all regression datasets (i.e., datasets consisting of data items labeled by real numbers) in UCI or OpenML that contain at least 30,000 data items. We then try to apply equal-width binning (i.e., bin the data according to their label by constraining the resulting bins to span equal-width intervals of the x axis) to each such dataset, in such a way that the binning process produces 10 bins (which we view as ordered classes) of at least 1000 data items each. We only retain the datasets for which such a binning is possible. In these cases, in order to retain as many data items as possible, we maximize the distance between the leftmost and rightmost boundaries of each bin (which implies, among other things, using exactly 10 bins). We also remove all the data items that lie outside the 10 equidistant bins. From this protocol, we obtain the 4 datasets UCI-blog-feedback-OQ, UCI-online-news-popularity-OQ, OpenMl-Yolanda-OQ, and OpenMl-fried-OQ, which we make publicly available.Footnote 14

We present the results obtained on these datasets in “Results on other datasets” section in Appendix 2. The reason why we confine these results to an appendix is that, unlike Amazon-OQ-BK and Fact-OQ, the data of which these datasets consist are not “naturally ordinal”. In other words, in order to create these datasets we use data that were originally labeled by real numbers (i.e., data suitable for metric regression experiments), bin them by their label, and view the resulting bins as ordinal classes. The ordinal nature of these datasets is thus somehow questionable, and we thus prefer not to consider them as being on a par with Amazon-OQ-BK and Fact-OQ, which instead originate from data that its users actually treat as being ordinal.

6.2 Results: non-ordinal quantification methods with ordinal classifiers

In our first experiment, we investigate whether OQ can be solved by non-ordinal quantification methods built on top of ordinal classifiers. To this end, we compare the use of a standard multi-class logistic regression (LR) with the use of several ordinal variants of LR. In general, we have found that LR models, trained on the deep RoBERTa embedding of the Amazon-OQ-BK dataset, are extremely powerful models in terms of quantification performance. Therefore, approaching OQ with ordinal LR variants embedded in non-ordinal quantifiers could be a straightforward solution worth investigating.

The ordinal LR variants we test are the “All Threshold” variant (OLR-AT) and the “Immediate-Threshold variant” (OLR-IT) of Rennie and Srebro (2005). In addition, we try two ordinal classification methods based on discretizing the outputs generated by regression models (Pedregosa et al. 2017); the first is based on Ridge Regression (ORidge) while the second, called Least Absolute Deviation (LAD), is based on linear SVMs.

Table 2 reports the results of this experiment, using the non-ordinal quantifiers of Sect. 4.2 and following the APP-OQ protocol (the results for other protocols were by and large similar and are omitted for conciseness). The fact that the best results are almost always obtained by using, as the embedded classifier, non-ordinal LR shows that, in order to deliver accurate estimates of class prevalence values in the ordinal case, it is not sufficient to equip a multi-class quantifier with an ordinal classifier. Moreover, the fact that PCC obtains worse results when equipped with the ordinal classifiers (OLR-AT and OLR-IT) than when equipped with the non-ordinal one (LR) suggests that the posterior probabilities computed under the ordinal assumption are of lower quality.

Table 2 Performance of classifiers in terms of average NMD (lower is better) in the Amazon-OQ-BK dataset for the APP-OQ protocol

Overall, these results suggest that, in order to tackle OQ, we cannot simply rely on ordinal classifiers embedded in non-ordinal quantification methods. Instead, we need proper OQ methods.

6.3 Results: ordinal quantification methods

In our main experiment, we compare our proposed methods o-ACC, o-PACC, o-HDx, o-HDy, o-SLD, o-EDy, and o-PDF with several baselines, i.e.,

  1. 1.

    the non-ordinal quantification methods CC, PCC, ACC, PACC, HDx, HDy, and SLD (see Sect. 4.2);

  2. 2.

    the ordinal quantification methods OQT, ARC, EDy, and PDF (see Sect. 4.3); and

  3. 3.

    the ordinal quantification methods IBU and RUN from the “unfolding” tradition (see Sect. 4.4).

We compare these methods on the Amazon-OQ-BK and Fact-OQ datasets, using real prevalence vectors and the APP and APP-OQ protocols.

Table 3 Average performance in terms of NMD (lower is better) for the Amazon-OQ-BK data
Table 4 Same as Table 3 but using Fact-OQ in place of Amazon-OQ-BK

Each method is allowed to tune the hyper-parameters of its embedded classifier, using the bags of the validation set. We use logistic regression on Amazon-OQ-BK and random forests on Fact-OQ; this choice of classifiers is motivated by common practice in the fields where these datasets originate, and from our own experience that these classifiers work well on the respective type of data. To estimate the quantification matrix \(\textbf{M}\) of a logistic regression consistently, we use k-fold cross-validation with \(k=10\), by now a standard procedure in quantification learning (Forman 2005). Since random forests are capable of producing out-of-bag predictions at virtually no extra cost, they do not require additional hold-out predictions from cross-validation to estimate the generalization errors of the forest (Breiman 1996). Therefore, we use the out-of-bag predictions of the random forest to estimate \(\textbf{M}\) in a consistent manner, without further cross-validating these classifiers.

After the hyper-parameters of the quantifier, including the hyper-parameters of the classifier, are optimized, we apply each method to the bags of the test set. The results of this experiment are summarized in Tables 3 and 4. These results convey that our proposed methods outperform the competition on both datasets and under all protocols; at least, they perform on par with the competition. In each protocol, o-SLD is the best method on Amazon-OQ-BK while o-PACC and o-SLD are best methods on Fact-OQ.

For all methods, we observe that the ordinally regularized variant is always better than or equal to the original, non-regularized variant of the same method. This observation can also be made with respect to EDy and PDF, the two recent OQ methods that address ordinality through ordinal feature transformations (EDy) and loss functions (PDF). We further recognize that the non-regularized EDy and PDF often loose even against non-ordinal baselines, such as SLD and HDy. From this outcome, we conclude that, in addressing ordinality, regularization is indeed a more important aspect than those feature transformations and loss functions that have been proposed so far.

Regularization even improves performance in the standard APP protocol, where the sampling does not enforce any smoothness. First of all, this finding demonstrates that regularization leads to a performance improvement that cannot be dismissed as a mere byproduct of simply having smooth ground-truth prevalence vectors (such as in APP-OQ and with real prevalence vectors). Instead, regularization appears to result in a systematic improvement of OQ predictions. We attribute this outcome to the fact that, even if no smoothness is enforced, neighboring classes are still hard to distinguish in ordinal settings. Therefore, an unregularized quantifier can easily tend to over- or under-estimate one class at the expense of its neighboring class. Regularization, however, effectively controls the difference between neighboring prevalence estimates, thereby protecting quantifiers from a tendency towards the over- or under-estimation of particular classes. This effect persists even if the evaluation protocol, like APP, does not enforce smooth ground-truth prevalence vectors. Hence, the performance improvement due to regularization can be attributed (at least in part) to the similarity between neighboring classes, a ubiquitous phenomenon in ordinal settings.

Experiments carried out on the UCI and OpenML datasets reinforce the above conclusions. We provide these results in the appendix.

Fig. 4
figure 4

Each point represents one hyper-parameter combination in the space of the average validation error (y axis) and the average ratio between the jaggedness of the predictions \(\hat{\textbf{p}}\) and the jaggedness of the ground-truth vectors \(\textbf{p}\) (x axis) during APP-OQ. Colors and shapes represent the regularization parameters of the hyper-parameter combinations. Our proposed ordinal regularization is beneficial for configurations that are otherwise too jagged, i.e., for configurations that are located to the right of the vertical line at \(\frac{\xi _1(\hat{\textbf{p}})}{\xi _1(\textbf{p})} = 1\)

6.4 Results: limitations of ordinal regularization

Table 3 lists several cases in which, if evaluated on the Amazon-OQ-BK data, some of our ordinal variants (e.g., o-ACC, o-PACC, o-HDx, and o-HDy) perform only on par with (and do not outperform) the non-ordinal methods they extend; hence, regularization is not able to improve quantification performance in these particular cases.

The reason for this observation is that our embedding representation of the Amazon-OQ-BK data often leads to predictions that are already smooth without any regularization. Due to this smoothness property of the data, any additional smoothing through regularization bears the danger of over-smoothing (i.e., of predictions that tend to be smoother than the ground-truth) which, in turn, can increase the prediction error.

Figure 4 illustrates this issue by plotting the average validation NMD over the average ratio \(\frac{\xi _1(\hat{\textbf{p}})}{\xi _1(\textbf{p})}\) between the jaggedness of the predictions, \(\xi _1(\hat{\textbf{p}})\), and the jaggedness of the ground-truth vectors, \(\xi _1(\textbf{p})\). Here, ratios smaller than one indicate that the predictions tend to be less jagged than the ground-truth; in other words, they tend to be too smooth and, hence, often exhibit high NMD values. Since regularization adds smoothness to predictions, we expected a benefit in NMD only for those predictions that are otherwise too jagged, with ratios above one. Examples of improvements are o-SLD with the Amazon-OQ-BK data (sub-plot b in Fig. 4) or o-PACC with the Fact-OQ-BK data (sub-plot d). However, PACC with Amazon-OQ-BK (sub-plot a) turns out to be already too smooth, even without any regularization. Therefore, adding regularization cannot further decrease the NMD on this data set.

The high smoothness within sub-plot (a) is a consequence of the powerful embedding representation that we employ for the Amazon-OQ-BK data (see Sect. 6.1.2). To demonstrate this claim, we repeat the same experiment with the same data and the same classifier, but employ a weaker TF-IDF representation instead of the embeddings. As we can see in sub-plot (c), the weaker representation leads again to predictions that are too jagged and, hence, can benefit from regularization. The complete results of the TF-IDF representation can be found in Appendix 2.

We conclude that smoothness can not only be achieved through regularization but also through data representations, although methods in the latter direction remain open to future research. Regularization benefits quantification performance only if the predictions are otherwise too jagged, a setting that can be verified by evaluating \(\frac{\xi _1(\hat{\textbf{p}})}{\xi _1(\textbf{p})}\). Regularization parameters provide a fine-grained control over the smoothness that predictions exhibit.

7 Other notions of smoothness for ordinal distributions

In Sect. 3.3 we have introduced the notion of “jaggedness” (and that of smoothness, its opposite), and we have proposed the \(\xi _{1}(\textbf{p}_{\sigma })\) function as a measure of how jagged an ordinal distribution \(\textbf{p}_{\sigma }\) is. We have then proposed ordinal quantification methods that use a Tikhonov matrix \(\textbf{C}_{1}\) whose goal is to minimize this measure, as in the regularization term of Eq. 30. The key assumption behind \(\xi _{1}(\textbf{p}_{\sigma })\) and \(\textbf{C}_{1}\) is a key assumption of ordinality: that neighboring classes are similar.

However, note that \(\xi _{1}(\textbf{p}_{\sigma })\) is by no means the only conceivable function for measuring jaggedness, and that other alternatives are possible in principle. For instance, one such alternative might be

$$\begin{aligned} \xi _{0}(\textbf{p}_{\sigma }) = \ \frac{1}{2}\sum _{i=1}^{n-1}(p_{\sigma }(y_{i})-p_{\sigma }(y_{i+1}))^{2} \end{aligned}$$
(34)

where \(\frac{1}{2}\) is a normalization factor to ensure that \(\xi _{0}(\textbf{p}_{\sigma })\) ranges between 0 (least jagged distribution) and 1 (most jagged distribution). For instance, the two distributions in the example of Sect. 3.3 yield the values \(\xi _{0}(\textbf{p}_{\sigma _{1}})=0.0375\) and \(\xi _{0}(\textbf{p}_{\sigma _{2}})=.4050\).

A matrix analogue to the \(\textbf{C}_{1}\) matrix of Sect. 4.4.1, whose goal is to minimize \(\xi _{0}(\textbf{p}_{\sigma })\) instead of \(\xi _{1}(\textbf{p}_{\sigma })\), would be

$$\begin{aligned} \textbf{C}_0 = \begin{pmatrix} 1 &{} \; -1 &{} \; \phantom {-}0 &{} \; \cdots &{} \; 0 &{} \; \phantom {-}0 &{} \; \phantom {-}0 \\ 0 &{} \; \phantom {-}1 &{} \; -1 &{} \; \cdots &{} \; 0 &{} \; \phantom {-}0 &{} \; \phantom {-}0 \\ \cdots &{} \; \cdots &{} \; \cdots &{} \; \cdots &{} \; \cdots &{} \; \cdots &{} \; \cdots \\ 0 &{} \; \phantom {-}0 &{} \; \phantom {-}0 &{} \; \cdots &{} \; 1 &{} \; -1 &{} \; \phantom {-}0 \\ 0 &{} \; \phantom {-}0 &{} \; \phantom {-}0 &{} \; \cdots &{} \; 0 &{} \; \phantom {-}1 &{} \; -1 \\ \end{pmatrix} \in \mathbb {R}^{(n-1) \times n} \end{aligned}$$
(35)

By using \(\textbf{C}_0\), one could build regularization-based ordinal quantification methods based on \(\xi _{0}(\textbf{p}_{\sigma })\) rather than on \(\xi _{1}(\textbf{p}_{\sigma })\).

The main difference between \(\xi _{0}(\textbf{p}_{\sigma })\) and \(\xi _{1}(\textbf{p}_{\sigma })\) is that, for each class \(y_{i}\), in \(\xi _{1}(\textbf{p}_{\sigma })\) we look at the prevalence values of both its right neighbor and its left neighbor, while in \(\xi _{0}(\textbf{p}_{\sigma })\) we look at the prevalence value of its right neighbor only. Unsurprisingly, \(\xi _{0}(\textbf{p}_{\sigma })\) has a different behavior than \(\xi _{1}(\textbf{p}_{\sigma })\). For example, unlike for \(\xi _{1}(\textbf{p}_{\sigma })\), for \(\xi _{0}(\textbf{p}_{\sigma })\) there is a unique least jagged distribution, namely, the uniform distribution \(p_\sigma (y) = \frac{1}{n} \;\forall \,y \in \mathcal {Y}\).

More importantly, \(\xi _{0}(\textbf{p}_{\sigma })\) and \(\xi _{1}(\textbf{p}_{\sigma })\) are not monotonic functions of each other; for instance, given the distributions \(\textbf{p}_{\sigma _{2}}\) (from Sect. 3.3) and \(\textbf{p}_{\sigma _{3}} = \ (0.00, 0.00, 0.00, 0.00, 1.00)\), it is easy to check that \(\xi _{1}(\textbf{p}_{\sigma _{2}})>\xi _{1}(\textbf{p}_{\sigma _{3}})\) but \(\xi _{0}(\textbf{p}_{\sigma _{2}})<\xi _{0}(\textbf{p}_{\sigma _{3}})\). Hence, the choice of the jaggedness measure indeed makes a difference in methods that regularize with respect to jaggedness. Ultimately, it seems reasonable to have the designer choose which function ideally reflects the notion of “ordinal plausibility” in the specific application being tackled.

While the particular mathematical form of \(\xi _{0}(\textbf{p}_{\sigma })\), as from Eq. 34, may seem empirical, a mathematical justification comes from the following observation: in fact, \(\xi _{0}(\textbf{p}_{\sigma })\) measures the amount of deviation from a polynomial of degree 0 (i.e., from a constant line) of our predicted distribution \(\hat{\textbf{p}}_{\sigma }\). This observation reveals the meaning of the subscript “0” in \(\xi _{0}(\textbf{p}_{\sigma })\). In contrast, \(\xi _{1}(\textbf{p}_{\sigma })\) measures the amount of deviation from a polynomial of degree 1 (i.e., from any straight line) of \(\hat{\textbf{p}}_{\sigma }\). Indeed, all of the least jagged distributions (according to \(\xi _{1}\)) listed at the end of Sect. 3.3 are perfect fits to a straight line (assuming equidistant classes). For instance,

$$\begin{aligned} \textbf{p}_{\sigma _{4}}&= \ (0.0, 0.1, 0.2, 0.3, 0.4) \end{aligned}$$
(36)

represents the sequence of points ((1, 0.0), (2, 0.1), (3, 0.2), (4, 0.3), (5, 0.4)) that lies on the straight line \(y=\frac{1}{10}x-\frac{1}{10}\).

Yet another notion of jaggedness might be implemented by the function

$$\begin{aligned} \xi _2(\textbf{p}_{\sigma }) = \ \frac{1}{8}\sum _{i=1}^{n-3}(3p_{\sigma }(y_{i+1})-3p_{\sigma }(y_{i+2})+p_{\sigma }(y_{i+3})-p_{\sigma }(y_{i}))^{2} \end{aligned}$$
(37)

which measures the amount of deviation from a polynomial of degree 2 (i.e., a parabola); while \(\xi _{1}(\textbf{p}_{\sigma })\) penalizes the presence of any hump in the distribution, \(\xi _{2}(\textbf{p}_{\sigma })\) would penalize the presence of more than one hump. For instance, the distribution

$$\begin{aligned} \textbf{p}_{\sigma _{5}}&= \ (0.129, 0.093, 0.127, 0.231, 0.405) \end{aligned}$$
(38)

would be a perfectly smooth distribution according to \(\xi _{2}(\textbf{p}_{\sigma })\) but not according to \(\xi _{0}(\textbf{p}_{\sigma })\) and \(\xi _{1}(\textbf{p}_{\sigma })\) because it produces points that lie on the parabola \(y=0.035x^{2}-0.141x+0.235\), which is displayed in Fig. 5. A matrix analogue of \(\xi _2(\textbf{p}_{\sigma })\) would be

$$\begin{aligned} \textbf{C}_2 = \begin{pmatrix} -1 &{} \; \phantom {-}3 &{} \; -3 &{} \; \phantom {-}1 &{} \; \phantom {-}0 &{} \; \cdots &{} \; \phantom {-}0 &{} \; \phantom {-}0 &{} \; \phantom {-}0 &{} \; \phantom {-}0 \\ \phantom {-}0 &{} \; -1 &{} \; \phantom {-}3 &{} \; -3 &{} \; \phantom {-}1 &{} \; \cdots &{} \; \phantom {-}0 &{} \; \phantom {-}0 &{} \; \phantom {-}0 &{} \; \phantom {-}0 \\ \cdots &{} \; \cdots &{} \; \cdots &{} \; \cdots &{} \; \cdots &{} \; \cdots &{} \; \cdots &{} \; \cdots &{} \; \cdots &{} \; \cdots \\ \phantom {-}0 &{} \; \phantom {-}0 &{} \; \phantom {-}0 &{} \; \phantom {-}0 &{} \; \phantom {-}0 &{} \; \cdots &{} \; -1 &{} \; \phantom {-}3 &{} \; -3 &{} \; \phantom {-}1 \\ \end{pmatrix} \in \mathbb {R}^{(n-3) \times n} \end{aligned}$$
(39)

In fact, we can produce matrices that penalize the deviation from polynomials of any chosen degree. To achieve this goal, we first need to multiply—with the transpose of itself, an arbitrary amount of times—a square variant of \(\textbf{C}_0\),

$$\begin{aligned} \textbf{C}' = \begin{pmatrix} 1 &{} \; -1 &{} \; \phantom {-}0 &{} \; \cdots &{} \; 0 &{} \; \phantom {-}0 &{} \; \phantom {-}0 \\ 0 &{} \; \phantom {-}1 &{} \; -1 &{} \; \cdots &{} \; 0 &{} \; \phantom {-}0 &{} \; \phantom {-}0 \\ \cdots &{} \; \cdots &{} \; \cdots &{} \; \cdots &{} \; \cdots &{} \; \cdots &{} \; \cdots \\ 0 &{} \; \phantom {-}0 &{} \; \phantom {-}0 &{} \; \cdots &{} \; 1 &{} \; -1 &{} \; \phantom {-}0 \\ 0 &{} \; \phantom {-}0 &{} \; \phantom {-}0 &{} \; \cdots &{} \; 0 &{} \; \phantom {-}1 &{} \; -1 \\ 0 &{} \; \phantom {-}0 &{} \; \phantom {-}0 &{} \; \cdots &{} \; 0 &{} \; 0 &{} \; \phantom {-}1 \\ \end{pmatrix} \in \mathbb {R}^{n \times n} \end{aligned}$$
(40)

which is the original \(\textbf{C}_0\) matrix with one additional row appended at the end. Second, we need to omit the outermost rows of this multiplication. That is, omitting the last row of \(\textbf{C}'\) yields \(\textbf{C}_0\), omitting the first and last rows of \((\textbf{C}')^\top \textbf{C}'\) yields \(\textbf{C}_1\), omitting the first one and the last two rows of \(((\textbf{C}')^\top \textbf{C}')^\top \textbf{C}'\) yields \(\textbf{C}_2\), up to only a constant factor. This procedure provides us with matrices \(\textbf{C}_3\), \(\textbf{C}_4\), ... that correspond to jaggedness measures \(\xi _3(\textbf{p}_{\sigma })\), \(\xi _4(\textbf{p}_{\sigma })\), ... and penalize deviations from polynomials of degree 3, 4, and so on.

Fig. 5
figure 5

The ordinal distributions \(\textbf{p}_{\sigma _{4}}\) (blue circles) and \(\textbf{p}_{\sigma _{5}}\) (red triangles). The lines display perfect polynomial fits of degree 1 (blue) and degree 2 (red) (Color figure online)

In this article, we have chosen \(\xi _{1}\) as our primary measure of jaggedness because \(\xi _{1}\) reflects the assumption of ordered classes in a minimal sense. In contrast to \(\xi _{0}\), it permits many different distributions that are all least jagged. Using \(\xi _{0}\) would instead promote the uniform distribution exclusively, which would remain the least jagged distribution even if the order of the classes was randomly shuffled and, hence, meaningless in terms of OQ. In contrast to \(\xi _{2}\) (or \(\xi _{3}\), \(\xi _{4}\), ...), our chosen \(\xi _{1}\) is more general in the sense that it does not impose any certain shape (like parabolas, third-order polynomials, etc.) other than the most simple shape that exhibits small differences between consecutive classes. Hence, we consider \(\xi _{1}\) to be the most suitable notion of jaggedness for studying the general value of regularization in OQ. It reflects the minimal OQ assumption that neighboring classes are similar, in the sense that they have similar prevalence values. We conceive other notions of jaggedness, used to reflect particular OQ applications, to be covered in future work.

8 Conclusions

We have carried out a thorough investigation of ordinal quantification, which includes (i) making available two datasets for OQ, generated according to the strong extraction protocols APP and APP-OQ and according to real prevalence vectors, which overcome the limitations of existing OQ datasets, (ii) showing that OQ cannot be profitably tackled by simply embedding ordinal classifiers into non-ordinal quantification methods, (iii) proposing seven OQ methods (o-ACC, o-PACC, o-HDx, o-HDy, o-SLD, o-EDy, and o-PDF) that combine intuitions from existing, ordinal and non-ordinal quantification methods and from existing, physics-inspired “unfolding” methods, and (iv) experimentally comparing our newly proposed OQ methods with existing non-ordinal quantification methods, ordinal quantification methods, and “unfolding” methods, which we have shown to be OQ methods under a different name. Our newly proposed OQ methods outperform the competition, a finding that our appendix confirms with additional error measures and datasets.

At the heart of the success of our newly proposed methods lies regularization, which is motivated by the ordinal plausibility assumption, i.e., the assumption that typical OQ class prevalence vectors are smooth. In future work, we plan to investigate other ways of achieving ordinal plausibility, to address different notions of smoothness, and to develop regularization terms that address characteristics of other quantification problems outside of OQ.