Introduction

Extensive data collections are considered valuable resources for hypothesis generation as well as theory building and confirmation. Providing samples of major concepts or categories, they can be seen as the foundation of data-driven learning and reasoning. Classical machine learning techniques focus on the discrimination of individual concepts. Depending on the chosen model type, they allow an interpretation of the underlying discrimination rule and the generation of hypotheses on its intrinsic characteristics [10, 27, 45]. For example, features with a high impact on a decision boundary can be reported when screening for potential causes of class differences [21, 34, 36]. Dense regions can be extracted when the definition of prototypic cases is of interest [16]. Combined with external domain knowledge classification models can also give hints to higher-level processes involved [35, 47].

Hypotheses on the relations of the categories are only rarely provided [9, 22, 52]. Nevertheless, they have tremendous explanatory potential as they link different concepts collected for a specific subject. This information is complementary to the intrinsic characteristics mentioned above. Together they allow a holistic overview of the matter. One type of classifier that utilises interclass relations are ordinal classifiers [11]. An ordinal classifier relies on an ordinal relationship of the kind \(a \prec b \prec c\), which is assumed for the semantic concepts (class labels) of a dataset but not guaranteed for its feature representation. The assumed semantic ordering guides the training process of an ordinal classifier. If it is reflected in the feature space, the ordinal classifier can achieve high sensitivities. Otherwise, it will show a diminished performance [33].

A dependency on a predefined class order can be established in different ways. The overall structure, as well as the training algorithm of a classification model, can be modified. The first one is most prominently used for ordinal adaptations of hierarchical multi-class architectures based on sequences of binary classifiers [40]. The order of these sequences can be designed to reflect assumed class orders. Examples can be ordinal classifier cascades [18], which are modifications of general decision lists [45], ordinal versions [23] of nested dichotomies [19] or systems based on directed acyclic graphs [44]. More specific modifications can be applied to different types of base classifiers [11, 12]. A training algorithm can be adapted for ordinal classification by utilising specific performance measures [50] or cost-sensitive base classifiers [31, 39].

Here, we now do not assume any order on the class labels but instead, screen for potential candidate orders on the semantic level. Nevertheless, the assumption of a total order of classes might not be appropriate in the presence of non-ordinal categories. We propose a novel screening procedure for generating hypotheses on ordinal relations among subsets of these categories. It extensively replaces de novo modelling by memoisation techniques bringing exhaustive evaluation into range. We utilise ordinal classifier cascades to screen through all class combinations and highlight those sub-cascades that achieve a predefined minimal class-wise sensitivity which is more difficult for longer sub-cascades. We, therefore, focus on those that cover as many classes as possible. We found that accurate sub-cascades can also result in a compact assignment of the remaining categories and conclude that they hence allow for hypothesis generation on these remaining non-ordinal classes.

The remaining article is organised as follows. Section 2 provides the underlying notation and concepts of ordinal classification. In Sect. 3, we describe our screening method. We outline the design of our experiments and results in Sects. 4 and 5 and discuss them in Sect. 6.

Methods

The basic building block of our screening experiments is the training and evaluation of classification models, the classifiers. In its basic version, a classifier is a function designed for categorising an object into one out of \(|\mathcal {Y}|\) predefined and fixed classes \(y\in \mathcal {Y}\)

$$\begin{aligned} c:\mathbb {R}^{n} \longrightarrow \mathcal {Y}. \end{aligned}$$
(1)

The corresponding classification task is typically named binary (\(|\mathcal {Y}|=2\)) or multi-class (\(|\mathcal {Y}|>2\)). The object is presented as a vector \(\mathbf {x} = (x_{1},\ldots ,x_{n})^T\) of measurements which we assume to be embedded in the n-dimensional real-valued space \(\mathbb {R}^{n}\). A classifier is adapted to a classification task via a set of labeled training examples \(\mathcal {T}=\{(\mathbf {x}_{i},y_{i})\}_{i=1}^{|\mathcal {T}|}\), \(y_{i} \in \mathcal {Y}\). The training algorithm and the concept class of the classifier is chosen a priori.

The performance of a classifier is typically evaluated on an independent set of validation samples \(\mathcal {V}=\{(\mathbf {x}'_{i}, y'_{i})\}_{i=1}^{|\mathcal {V}|}\). Although a classifier is trained to predict a predefined set of classes \(\mathcal {Y}\), it is applied in altered scenarios in which a subset \(y'_{i}\in \mathcal {Y}' \subset \mathcal {Y}\) or a superset \(y'_{i}\in \mathcal {Y}' \supset \mathcal {Y}\) of classes might be used. We will therefore define our performance measures in dependency on the label set of interest.

All performance measures will be based on conditional prediction rates of type

$$\begin{aligned} p_{c}(y' | \mathcal {X}_{y}) = \frac{1}{|\mathcal {X}_{y}|}\sum _{x \in \mathcal {X}_{y}} \mathbb {I}_{\left[ c(x)= y'\right] }. \end{aligned}$$
(2)

The symbol \(\mathbb {I}_{\left[ \cdot \right] }\) denotes the indicator function and \(\mathcal {X}_{y}\) denotes a set of samples of class y. Other (re-)sampling schemes might be applied but will not change the theoretical characteristics discussed in this work. Three types of conditional prediction rates can be distinguished:

$$\begin{aligned} {sensitivities} \quad&\text {if}&\quad y=y' \quad \text {and}\quad y,y' \in \mathcal {Y} \end{aligned}$$
(3)
$$\begin{aligned} {confusions} \quad&\text {if}&\quad y\ne y' \quad \text {and}\quad y,y' \in \mathcal {Y} \end{aligned}$$
(4)
$$\begin{aligned} {external rates} \quad&\text {if}&\quad y \not \in \mathcal {Y} \quad \text {and}\quad y' \in \mathcal {Y}. \end{aligned}$$
(5)

While (class-wise) sensitivities and confusions occur for the learned set of classes \(\mathcal {Y}\), the external rates describe the classifiers behaviour on samples of foreign classes \(y \in \mathcal {Y}'\setminus \mathcal {Y}\).

A standard performance measure for classification tasks is the empirical accuracy. In dependency on a label set \(\mathcal {Y}\) it can be formulated as

$$\begin{aligned} \mathrm {Acc}_{c}(\mathcal {Y}) = \frac{\sum _{y \in \mathcal {Y}}p_{c}(y \mid \mathcal {X}_{y})|\mathcal {X}_{y}|}{\sum _{y \in \mathcal {Y}}|\mathcal {X}_{y}|}. \end{aligned}$$
(6)

In its standard version, the empirical accuracy is applied to the full set of classes but it might also be applied to a sub- or superset of classes. In the following, we focus on the minimal class-wise sensitivity as a quality measure for a multi-class classifier system

$$\begin{aligned} \mathrm {Smin}_{c}(\mathcal {Y}) = \underset{y \in \mathcal {Y}}{\mathrm {min}} \quad p_{c}(y \mid \mathcal {X}_{y}). \end{aligned}$$
(7)

It can be seen as a lower bound on the overall accuracy

$$\begin{aligned} \mathrm {Acc}_{c}(\mathcal {Y})= & {} \frac{\sum _{y \in \mathcal {Y}}p_{c}(y \mid \mathcal {X}_{y})|\mathcal {X}_{y}|}{\sum _{y \in \mathcal {Y}}|\mathcal {X}_{y}|} \end{aligned}$$
(8)
$$\begin{aligned}\ge & {} \frac{\sum _{y \in \mathcal {Y}} {\mathrm {min}}_{y' \in \mathcal {Y}} \, p_{c}(y' \mid \mathcal {X}_{y'})|\mathcal {X}_{y}|}{\sum _{y \in \mathcal {Y}}|\mathcal {X}_{y}|} \end{aligned}$$
(9)
$$\begin{aligned}= & {} {\mathrm {min}}_{y' \in \mathcal {Y}} \, p_{c}(y' \mid \mathcal {X}_{y'}) \end{aligned}$$
(10)
$$\begin{aligned}= & {} \mathrm {Smin}_{c}(\mathcal {Y}). \end{aligned}$$
(11)

In the context of analysing subsets of classes, \(\mathrm {Smin}_{c}\) shows a monotonicity on the number of classes. For a non empty subset of classes

$$\begin{aligned} \mathcal {Y}' \subseteq \mathcal {Y},\, \mathcal {Y}' \ne \emptyset : \,\mathrm {Smin}_{c}(\mathcal {Y}') \ge \mathrm {Smin}_{c}(\mathcal {Y}). \end{aligned}$$
(12)

For external classes \(y'\not \in \mathcal {Y}\), we define the percentage of the most frequently predicted class label \(y \in \mathcal {Y}\) as the purity of \(y'\)

$$\begin{aligned} \mathrm {pur}_{c}(y') = \underset{y \in \mathcal {Y}}{\mathrm {max}} \quad p_{c}(y \mid \mathcal {X}_{y'}). \end{aligned}$$
(13)

Ordinal Classification

Ordinal classification is a multi-class classification task (\(|\mathcal {Y}|>2\)), in which we hypothesise that the classes are related via an ordinal relationship

$$\begin{aligned} y_{(1)} \prec \ldots \prec y_{(|\mathcal {Y}|)}, \end{aligned}$$
(14)

where the subscript \(y_{(i)}\) indicates the position of the class within the ordering (ith class). The symbol \(\prec \) indicates that this ordering is only known (or assumed) for the verbal concepts. Its reflection in feature space cannot be guaranteed. Nevertheless, ordinal classifiers are constrained to this ordering and try to identify an embedding that reflects this ordinality. If such an embedding can be found, the chosen feature representation might be seen as evidence for the hypothesised ordinal relationship; otherwise the ordinal classifier will suffer from a decreased generalisation performance (Fig. 1). The classifier will typically not be able to predict all classes of \(\mathcal {Y}\). Its performance might be measured by the minimal class-wise sensitivity \(\mathrm {Smin}_{c}(\mathcal {Y})\). The susceptibility of ordinal classifiers to incorrect class orders is therefore an important characteristic for the screening processes, in which we try to select reflected/relevant class orders among all possible class orders.

Fig. 1
figure 1

Influence of class order on ordinal classifier cascades. The figure shows four ordinal classifier cascades trained under different assumptions of the class order. Panel a: A class order that is reflected by the data. The inverse order is shown in Panel b. Note, that the decision regions differ in both cases. Panels c, d: Examples for class orders that are not reflected by the data. Here, decision regions of individual classes can be empty or even do not exist

Ordinal Classifier Cascades

Our screening procedure is based on ordinal classifier cascades, which can be seen as ensemble multi-class classifiers

$$\begin{aligned} h: \mathbb {R}^n \longrightarrow \mathcal {Y} \end{aligned}$$
(15)

based on an ensemble \(\mathcal {E}= \{c_{(i)}\}_{i=1}^{|\mathcal {Y}|-1}\) of \(|\mathcal {Y}|-1\) binary base classifiers

$$\begin{aligned} c_{(i)}: \mathbb {R}^{n} \longrightarrow \{y_{(i)}, y_{(i+1)}\} \end{aligned}$$
(16)

designed for predicting a pair of two consecutive classes within the assumed class ordering. We utilise pairwise training in the following meaning that each base classifier \(c_{(i)}\) is trained on the samples of two consecutive classes

$$\begin{aligned} \mathcal {T}_{(i)} = \left\{ (\mathbf {x},y) \mid (\mathbf {x},y) \in \mathcal {T}, y \in \{y_{(i)}, y_{(i+1)}\} \right\} . \end{aligned}$$
(17)

The cascade itself can be seen as an untrainable fusion architecture, which evaluates the base classifiers sequentially according to the assumed class order

$$\begin{aligned} h(\mathbf {x}) = {\left\{ \begin{array}{ll} y_{(1)} &{}\text {if}\quad c_{(1)}(\mathbf {x}) = y_{(1)}\\ y_{(i)} &{}\text {if}\quad \left( \underset{j<i}{\bigwedge }\left( c_{(j)}(\mathbf {x})=y_{(j+1)}\right) \right) \wedge c_{(i)}(\mathbf {x})=y_{(i)}\\ y_{(|\mathcal {Y}|)} &{}\text {else.} \end{array}\right. } \end{aligned}$$
(18)

Pairwise Training of Base Classifiers

The pairwise training of the base classifiers improves the training and the evaluation time of ordinal classifier cascades when multiple class orders are tested. As the pairwise training of a base classifier \(c_{(i)}\) does not take into account any foreign classes \(\mathcal {Y} \setminus \{y_{(i)}, y_{(i+1)}\}\), a classifier designed for predicting classes \(y_{i}\) and \(y_{j}\)

$$\begin{aligned} c_{i,j}: \mathbb {R}^{n} \longrightarrow \{y_{i}, y_{j}\} \end{aligned}$$
(19)

will be identical for all class orders. The classifier \(c_{i,j}\) is not based on the class order. The screening process through all class orderings can therefore be based on a common set of \(\left( {\begin{array}{c}|\mathcal {Y}|\\ 2\end{array}}\right) \) base classifiers and their predictions, which can be memoized beforehand. We propose to construct a prediction table as shown in Table 1, which allows the evaluation of all cascades without training and evaluation of the base classifiers.

Table 1 Scheme of predicted class labels

Error Bounds

As a consequence of their invariant predictions, also the conditional prediction rates of pairwise trained base classifiers are invariant against the order of classes. They can again be precalculated and looked up if necessary (Table 2). Previously, it has been shown that the class-wise sensitivities of an ordinal classifier cascade are upper bounded by a set of conditional prediction rates of the corresponding base classifiers [37].

Theorem 1

Let h denote an ordinal classifier cascade

$$\begin{aligned} h:\mathbb {R}^n \longrightarrow \mathcal {Y} = \left\{ y_{(1)}, \ldots , y_{(|\mathcal {Y}|)} \right\} \end{aligned}$$
(20)

with base classifiers \(\mathcal {E} = \left\{ c_{(1)},\ldots ,c_{(|\mathcal {Y}|-1)}\right\} \). Let furthermore \(\mathcal {X}_{(i)}\) be a non-empty set of samples of class \(y_{(i)}\). Then the sensitivity of h for \(y_{(i)}\) is limited by

$$\begin{aligned} p_{h}(y_{(i)} \,|\,\mathcal {X}_{y_{(i)}})\le & {} p_{c_{(i)}}(y_{(i)} \,|\,\mathcal {X}_{y_{(i)}}) \end{aligned}$$
(21)
$$\begin{aligned} p_{h}(y_{(i)} \,|\,\mathcal {X}_{y_{(i)}})\le & {} \min _{k<i} p_{c_{(k)}}(y_{(k+1)} \,|\,\mathcal {X}_{y_{(i)}}). \end{aligned}$$
(22)

Note that neither the construction nor the evaluation of an ordinal classifier cascade is required for providing these upper limits. A limit to the minimal class-wise sensitivity of an ordinal classifier cascade can be given by a set of lookups. In this way a large proportion of those cascades that will not pass a predefined threshold \(t\le \mathrm {Smin}_{c}(\mathcal {Y})\) can be rejected before starting the more expensive calculations on the prediction table (Table 1). We utilise the CASCADES algorithms [37], which realises this strategy as preprocessing step in order to reduce the computational burden. Only those cascades, which pass this filter will be evaluated for their real minimal class-wise sensitivity.

Table 2 Scheme of prediction rates

Screening for Ordinal (Sub-)structures

Here, we focus on the analysis of multi-class datasets that are not originally designed for ordinal classification (Fig. 2). That is, we do not assume a unique ordinal relationship for all classes in \(\mathcal {Y}\). The classes in \(\mathcal {Y}\) can instead be seen as a loose collection of known categories within a common context of interest. Nevertheless, subsets of classes \(\mathcal {Y}' \subseteq \mathcal {Y}\) can fulfil our quality criteria for ordinal classification. The corresponding decision regions can be interpreted as an ordinal structure, and we might hypothesise an ordinal relation between the classes in \(\mathcal {Y}'\). Here, memoization tables (Tables 1, 2) proposed for the full set of classes \(\mathcal {Y}\) can directly be repurposed for the analysis of any subset of classes \(\mathcal {Y}'\).

Fig. 2
figure 2

Schematic drawing of ordinal cascades. a A scheme of a full ordinal cascade that comprises all available classes \(y_{(1)}, \ldots , y_{(6)}\). b An ordinal subcascade. Here classes \(y_{(1)}, \ldots , y_{(6)}\) are embedded in a larger collection of classes (black circles), which can not be linked via one single ordinal relationship

Size of the Search Space

The number of candidate cascades is dependent on the chosen screening task. In all cases, it can be decomposed in the number of class orders \(|\mathcal {Y}|!\) and the number of subsets \(\left( {\begin{array}{c}|\mathcal {Y}|\\ k\end{array}}\right) \) of a predefined size kFootnote 1:

  • Full cascades: \(o(|\mathcal {Y}|)=|\mathcal {Y}|!\)

  • Subcascades of size k: \(s_{k}(|\mathcal {Y}|)=\left( {\begin{array}{c}|\mathcal {Y}|\\ k\end{array}}\right) k!\)

  • All subcascades: \(s_{all}(|\mathcal {Y}|)=\sum _{k=1}^{|\mathcal {Y}|} \left( {\begin{array}{c}|\mathcal {Y}|\\ k\end{array}}\right) k!\)

It might be noteworthy that \(s_{k}(|\mathcal {Y}|)< s_{|\mathcal {Y}|-k}(|\mathcal {Y}|)\) for \(k < |\mathcal {Y}|/2\).

The ratio of the number of all subcascades and the number of full cascades is approximately given by the Euler number as

$$\begin{aligned} s_{all}(|\mathcal {Y}|)&=\sum _{k=1}^{|\mathcal {Y}|} \left( {\begin{array}{c}|\mathcal {Y}|\\ k\end{array}}\right) k! \end{aligned}$$
(23)
$$\begin{aligned}&\ldots \nonumber \\&=|\mathcal {Y}|! \sum _{k=1}^{|\mathcal {Y}|} \frac{1}{k!} \end{aligned}$$
(24)

and the ratio itself

$$\begin{aligned} \frac{s_{all}(|\mathcal {Y}|)}{o(|\mathcal {Y}|)}=\sum _{k=1}^{|\mathcal {Y}|} \frac{1}{k!} \xrightarrow {|\mathcal {Y}| \rightarrow \infty } e\sim 2.718. \end{aligned}$$
(25)

Examples for numbers of candidate cascades can be found in Table 3. Note that for other (traditional) types of ordinal classifiers the evaluation of each candidate cascade requires the de novo training of the corresponding model. As these training phases by far exceed the time complexity of evaluating the memoization tables (Tables 1, 2), the proposed screenings are infeasible for traditional algorithms and implementations.

Table 3 Number of candidate cascades

Properties of Subcascades

The combinatorics show that the overall number of subcascades exceeds the overall number of full cascades by only a constant factor. Nevertheless, we expect a much higher number of candidate subcascades to pass a threshold on the minimal class-wise sensitivity than a number of candidate full cascades.

Theorem 2

Let \(y_{(1)} \prec \ldots \prec y_{(i)} \prec \ldots \prec y_{(|\mathcal {Y}|)}\) denote an arbitrary but fixed class order of the classes in \(\mathcal {Y}\) and \(h: \mathbb {R}^{n} \rightarrow \mathcal {Y}\) an ordinal classifier cascade designed for this class order and based on an ensemble of pairwise trained base classifiers \(\mathcal {E} =\{c_{(i)}\}_{i=1}^{|\mathcal {Y}|-1}\). Let furthermore \(\mathcal {Y}' = \mathcal {Y}\setminus \{y_{(i)}\}\) and \(h': \mathbb {R}^{n} \rightarrow \mathcal {Y}'\) denote an ordinal classifier cascade designed for class order \(y_{(1)} \prec \ldots \prec y_{(i-1)} \prec y_{(i+1)} \prec \ldots \prec y_{(|\mathcal {Y}|)}\) based on \(\mathcal {E}' =\mathcal {E} \setminus \{c_{(1)}\} \), if \(i=1\), \(\mathcal {E}' =\mathcal {E} \setminus \{c_{(|\mathcal {Y}|-1)}\} \), if \(i=|\mathcal {Y}|\) and

$$\begin{aligned} \mathcal {E}' = \left( \mathcal {E} \setminus \{c_{(i-1)},c_{(i)}\} \right) \cup \{c'_{(i-1)}\} \end{aligned}$$
(26)

otherwise, where

$$\begin{aligned} c'_{(i-1)}(\mathbf {x}) := {\left\{ \begin{array}{ll} y_{(i-1)} &{} \text {if}\quad c_{(i-1)}(\mathbf {x}) = y_{(i-1)}\\ y_{(i+1)} &{}\text {if}\quad c_{(i-1)}(\mathbf {x}) = y_{(i)} \end{array}\right. }. \end{aligned}$$
(27)

In this case

$$\begin{aligned} \mathrm {Smin}_{h}(\mathcal {Y}) \le \mathrm {Smin}_{h'}(\mathcal {Y}'). \end{aligned}$$
(28)

Proof

From the monotonicity of the minimal class-wise sensitivity (Equation 12) we get

$$\begin{aligned} \mathrm {Smin}_{h}(\mathcal {Y}) \le \mathrm {Smin}_{h}(\mathcal {Y}'). \end{aligned}$$
(29)

Cascade \(h'\) is constructed from the same base classifiers as h. Only one classifier is removed, which closes the option of predicting class label \(y_{(i)}\). In case \(i=1\), all samples that would have been classified as \(y_{(1)}\) by h are directly sent to \(c_{(2)}\) by \(h'\). This is of course also true for all remaining samples with \(c_{(1)}(\mathbf {x})=y_{(2)}\) in h. Therefore for each class label \(y\in \mathcal {Y}'\), the set of samples that receive this class label y by \(h'\) comprises at least all samples that receive class label y by h. The corresponding class-wise sensitivities and their minimum can only be increased.

In case \(1<i<|\mathcal {Y}|\), additionally classifier \(c_{(i-1)}\) and \(c_{(i)}\) are replaced by \(c'_{(i-1)}\), which redirects those samples that would be sent from \(c_{(i-1)}\) to \(c_{(i)}\). The common preamble of base classifiers of h and \(h'\) leads to identical class-wise sensitivities for classes \(y_{(1)}, \ldots , y_{(i-1)}\). Those classes that received class label \(y_{(i)}\) by h are again reclassified into one of the subsequent classes leading to equivalent or increased class-wise sensitivities.

In case \(i=|\mathcal {Y}|\), the cascades h and \(h'\) share the longest possible preamble of base classifiers. Only the samples receiving class label \(y_{(|\mathcal {Y}|)}\) by h will be reclassified as \(y_{(|\mathcal {Y}|-1)}\) by \(h'\) leading to at least the class-wise sensitivity for class \(y_{(|\mathcal {Y}|)}\).

We therefore get for all cases

$$\begin{aligned} \mathrm {Smin}_{h}(\mathcal {Y}') \le \mathrm {Smin}_{h'}(\mathcal {Y}'). \end{aligned}$$
(30)

\(\square \)

Theorem 2 states that a subcascade \(h'\) constructed from a full cascade h by removing a single class \(y_{(i)}\) and by reconnecting the corresponding base classifiers is guaranteed to achieve at least the same minimal class-wise sensitivity on \(\mathcal {Y}'\) than h on \(\mathcal {Y}\). It can recursively be applied to \(h'\) and its own subcascades leading to a transitive relationship between a full cascade and all its subcascades. The more classes a cascade comprises the more difficult it will be to reach a predefined minimal class-wise sensitivity. We will therefore focus on longer cascades.

Experiments

We evaluated our screening procedure empirically in 10 x 10 cross-validation experiments [24]. All ordinal classifier cascades were based on linear SVMs (cost=1) [49]. The TunePareto Software was used for the evaluation of base classifiers [41]. All time measure experiments were performed on an AMD Opteron(tm) Processor 6276 with 2.6 GHz (32 cores with HT) and 512 GB RAM.

As mentioned above, alternative systems are unlikely to cope with the computational burden of exhaustive screenings (Sect. 3). Reference classifiers were therefore directly applied to those orders identified by our screening procedure.

We compared our results to those of other multi-class architectures based on independently trained linear SVMs (cost=1). As ordinal reference architectures we used a splitted ordinal classifier cascade (Scc, pairwise training) denoted as “voter 3” by Jiang et al. [26], the ensembles of nested dichotomies (END, \(|\mathcal {E}|=20\)) [19] and directed acyclic graphs (DAG) [44]. As non-ordinal references the one-against-one (OaO) and the one-against-all (OaA) architectures were applied [40].

Additionally, our results were compared to a 1D self-organising map (SOM) [30]. This mapping was performed using the standard settings of the R-package kohonen  [53, 54] (learning rate: [0.05,0.01], but no layer normalisation). The nodes and consequently the clusters were labelled based on the cascade under investigation.

Datasets

The screening procedure is evaluated on eight multi-class datasets from different domains. The main characteristics of the analysed datasets can be found in Table 4. We used data, which comprises different representations of alphanumeric characters (written and spoken) from the machine learning database UCI [13], and gene expression and methylation profiles which were measured either by microarrays or by deep sequencing.

Table 4 Overview of the analysed datasets

\(d_{1}\): Handwritten digits dataset. The handwritten digits dataset was made publicly available by Alpaydin and Kaynak [3] as part of the UCI repository [13]. This dataset is a collection of bitmaps of digits written by 43 different people. The classes \(y_{1}, \ldots , y_{10}\) correspond to the labels \(0, \ldots , 9\).

\(d_{2}\): Isolet dataset. The Isolet (Isolated Letter Speech Recognition) dataset by Fanty and Cole [17] and downloaded from the UCI repository [13] consists of spoken letters of the alphabet. The classes \(y_{1}, \ldots , y_{26}\) correspond to the labels \(a, \ldots , z\).

\(d_{3}\): Cancer cell lines. The NCI-60 dataset (GSE32474) was collected by Pfister et al. [43]. It consists of gene expression profiles from cell lines that derived from different cancer tissue types. The different cancer types are: \(y_{1}\): Leukemia, \(y_{2}\): Breast cancer, \(y_{3}\): Ovarian cancer, \(y_{4}\): Melanoma, \(y_{5}\): Central nervous system (CNS), \(y_{6}\): Colon cancer, \(y_{7}\): Renal cancer, \(y_{8}\): Non-small cell lung cancer, \(y_{9}\): Prostate cancer.

\(d_{4}\): Leukemia. As a prephase to the MILE Study (Microarray Innovations In LEukemia) program expression data from blood and bone marrow samples from acute and chronic leukemia patients was collected (GSE13159). Kohlmann et al. [29] could hereby show that standardised experimental protocols can lead to comparable results across different laboratories. The samples have been categorised in different leukemia subgroups and assigned to class labels as follows: \(y_{1}\): ALL (acute lymphocytic leukaemia) with hyperdiploid karyotype, \(y_{2}\): ALL with t(1;19), \(y_{3}\): ALL with t(12;21), \(y_{4}\): AML (acute myeloid leukaemia) complex aberrant karyotype, \(y_{5}\): AML with inv(16)/t(16;16), \(y_{6}\): AML with normal karyotype and other abnormalities, \(y_{7}\): AML with t(11q23)/MLL, \(y_{8}\): AML with t(15;17), \(y_{9}\): AML with t(8;21), \(y_{10}\): c-ALL/Pre-B-ALL without t(9;22), \(y_{11}\): c-ALL/Pre-B-ALL with t(9;22), \(y_{12}\): CLL (chronic lymphocytic leukaemia), \(y_{13}\): CML (chronic myeloid leukaemia), \(y_{14}\): mature B-ALL with t(8;14), \(y_{15}\): MDS (myelodysplastic syndromes), \(y_{16}\): Non-leukemia and healthy bone marrow, \(y_{17}\): Pro-B-ALL with t(11q23)/MLL, \(y_{18}\): T-ALL.

\(d_{5}\): TCGA. This dataset is a collection of datasets that was generated by the TCGA Research Network (http://cancergenome.nih.gov/). The feature space consists of the intersection of all features of the single datasets. The different cancer types are assigned to class labels as follows: \(y_{1}\): Cholangiocarcinoma (CHOL), \(y_{2}\): Liver hepatocellular carcinoma (LIHC), \(y_{3}\): Pancreatic adenocarcinoma (PAAD), \(y_{4}\): Colon adenocarcinoma (COAD), \(y_{5}\): Rectum adenocarcinoma (READ), \(y_{6}\): Kidney renal papillary cell carcinoma (KIRP), \(y_{7}\): Kidney renal clear cell carcinoma (KIRC), \(y_{8}\): Kidney chromophobe carcinoma (KICH).

\(d_{6}\): C. elegans. Baugh et al. [6] analysed the influence of the homeodomain protein PAL-1 of the C-lineage-specific gene regulatory network in the model organism C. elegans. They gathered gene expression data of samples of wild-type embryos and mutant embryos at 10 points in time after the 4-cell-stage of the embryo (GSE2180). We normalised (mas5) and labelled these samples based on the points in time: \(y_{1}\): 0 minutes, \(y_{2}\): 23 minutes, \(y_{3}\): 41 minutes, \(y_{4}\): 53 minutes, \(y_{5}\): 66 minutes, \(y_{6}\): 83 minutes, \(y_{7}\): 101 minutes, \(y_{8}\): 122 minutes, \(y_{9}\): 143 minutes, \(y_{10}\): 186 minutes.

\(d_{7}\): Wound healing. Xiao et al. [56] analysed the total cellular RNA of blood samples taken from severe blunt trauma patients (GSE36809). They performed a time scale analysis and took samples at different points in time after the injury. We normalised (mas5) and summarised the different points in time, measured in hours and labelled them as follows: \(y_{1}\): (0,50), \(y_{2}\): [50,100), \(y_{3}\): [100,150), \(y_{4}\): [150,200), \(y_{5}\): [200,400), \(y_{6}\): [400,600), \(y_{7}\): [600,800), \(y_{8}\): control.

\(d_{8}\): HPA-axis. This dataset by Agba et al. [1] comprises measurements of DNA methylation patterns of different promoters of the glucocorticoid receptor gene (NR3C1) and the imprinting control region of IGF2/H19 of 7 tissues of rats. These profiles especially enclose measurements of the hypothalamic-pituitary-adrenal-axis (HPA-axis), a neuroendocrine system which is known to be involved in the regulation of stress reactions. In our experiments, the classes denote the different tissue types \(y_{1}\): Cortex, \(y_{2}\): Hippocampus, \(y_{3}\): Hypothalamus, \(y_{4}\): Pituitary, \(y_{5}\): Adrenal, \(y_{6}\): Skin, \(y_{7}\): Liver.

Experimental Results and Interpretation

In this section we will present the results on the utilised benchmark data sets and give some interpretation on these results. Table 5 provides an overview on the extracted longest cascades. Between one and five cascades were detected per dataset passing the dataset specific highest threshold. They achieved minimal class-wise sensitivity thresholds ranging from 0.75 to 0.9. The theoretical maximum number of cascades is for all datasets more than 800 times larger than the number of returned cascades. A general overview about the numbers of cascades that pass the first threshold in comparison to the number that pass the second threshold (dataset specific) is given in the Supplementary Information.

Table 5 The characteristics of the longest cascades

Longest cascades. Graph representations of the longest cascades can be found in Figs. 3 and 4. It can be seen that all found cascades overlap in at least one class except for the TCGA data \(d_5\). The extrema in the amount of overlap are \(d_{5}\) which shows no overlap and cascades of \(d_{2}\) and \(d_6\) which overlap in all except one class. Further structures that can be observed are that cascades differ in one class at the beginning (\(d_{2}\)) in the middle (\(d_{6}\)) or at the end (\(d_{1}\)), but especially if more than two cascades pass the minimal sensitivity criterion the cascades also align to more complex graphs (\(d_{3}, d_{8}\)).

Fig. 3
figure 3

Longest cascades detected for datasets \(d_{1} - d_{4}\). Descriptions of the datasets are given in the text. For each dataset the threshold (t) and the length of the longest cascades (l) is stated. The individual cascades are color encoded differently. For datasets \(d_1\) and \(d_2\) the left out classes (others) are not shown

Fig. 4
figure 4

Longest cascades detected for datasets \(d_{5} - d_{8}\). Descriptions of the datasets are given in the text. For each dataset the threshold (t) and the length of the longest cascades (l) is stated. The individual cascades are color encoded differently. Additionally, the left out classes, which are classes that are not included in neither of the cascades are depicted as others

Classes that are not included into the subcascade. Extended confusion tables for the longest cascades are shown in the Supplementary Information. Figure 5 shows exemplarily the confusion table for the TCGA dataset \(d_5\). As focusing on the detection of subcascades, classes exist that are not part of the longest cascades, termed Others in Figs. 3 and 4. These classes can be split in classes that are not part of any of the longest subcascades (uncovered classes) and classes that are not part of the specific subcascade but part of another of the longest subcascades returned (alternative class). Presenting a sample of one of these classes to a specific subcascade, these samples are necessarily classified as one of the cascade classes as there is no rejection possibility. A general overview of how distinct the allocation of these additional classes to cascade classes is, provides the mean of the purity values (Eq: 13) (Table 5).

De novo calculation vs memoization scheme. In order to demonstrate the impact of the proposed memoization techniques we performed additional experiments with a traditional implementation of an ordinal classifier cascade. That is each training phase is realised as a de novo adaptation of the classification model. Both versions were based on the same implementation of base classifiers (SVM) [41] and did not use any parallelisation. The runtime of both approaches was compared on dataset \(d_{4}\) (\(|\mathcal {Y}| = 18\)), which corresponds to \(s_{all}(|\mathcal {Y}|)>10^{18}\) subcascades. Again a \(10\times 10\) CV was conducted for each subcascade. The memoization techniques allowed to accomplish this screening in about 1868 minutes, while the de novo calculation of a single \(10\times 10\) CV on all \(|\mathcal {Y}| = 18\) classes (fixed class order) required 798 minutes. The de novo calculation is therefore not able to perform this screening in a feasible amount of time.

Performance comparison. As the complexity of an ordinal classifier cascade is rather low in comparison to the complexity of the chosen reference classifiers we have chosen to run these algorithms only for the longest subcascades identified by the initial screening. The minimal class-wise sensitivities achieved by the reference classifiers on the longest subcascades are shown in Table 6. An overview on all sensitivities is given in the Supplementary Information. The comparison to the non-ordinal architectures (OaO, OaA) demonstrate the impact of (correct) ordinal information by a higher performance of the ordinal classifier cascade (Occ). In five out of eight datasets (\(d_{4}-d_{8}\)) cascades exist for which the OaO achieved lower minimal sensitivities and did not even pass the minimal threshold (t). For the OaA this was observed only twice (\(d_{6}, d_{7}\)). The ordinal architectures, the ensembles of nested dichotomies (END) and directed acyclic graphs (DAG), perform well on these datasets and constrained to the selected orders. Both approaches did not pass the minimal threshold t on \(d_{7}\). The END did also slightly pass the threshold on \(d_{4}\). The splitter ordinal classifier cascade (Scc) achieved high minimal class-wise sensitivities only on \(d_{8}\) dataset and \(d_{5}\). It would have deselected individual classes in our experiments. With the exception of the first cascade of \(d_{8}\) SOM was also not able to separate all classes of the cascades.

Table 6 The table shows the minimal class-wise sensitivity performance as % of various methods on the selected subcascades

Handwritten digits dataset. The two longest cascades of length five which are found in \(d_{1}\) the digit dataset overlap at three classes. A meaningful interpretation of the direction is not obvious, but the returned cascades show possible structural properties. There are three digits (0, 4, 8) that are uncovered classes and in both cascades the 8 is mainly allocated to the second cascade position, corresponding once to the digit 2 and once to 1. If 1 is placed at the second position, 0 is mainly classified as 6, whereas in the other cascade (2 at position two), 0 is classified as 5 (the third position of the cascade), which means that in the first case the first decision region is placed in a way that the samples of class 6 and 0 are on the same side, whereas in the second case they lay on opposite sides and hence the directions of the first relation in the feature space for both cascades differ. The two cascades split at the second and at the last position of the cascade and the purity values for the alternative classes show that on average half of the samples of the alternative classes are assigned to one intra-cascade class.

Isolet dataset. Dataset \(d_{2}\) corresponds to a spoken representation of letters. Two cascades of length 10 pass a minimal class-wise sensitivity threshold of 0.9. The uncovered classes are classified to around 70% on average to the same class and the heatmaps (Supplementary Information) show that they are not well-distributed as many entries are zero. The two cascades differ only in their first position (s vs. y). If not part of the cascade the class s is completely assigned to the second cascade class f, whereas in the case in which y is not part of the cascade its assignment fades out till the fifth intra-cascade class. The different classes might be grouped based on the phonetic alphabet. In this case s, f, m and n share the sound [\(\varepsilon \)], but y does not, which might be one possible interpretation of what is observed. The last five classes (b, d, g, z, c) of both cascades share the sound . The further letters (e, p, t, v) that are pronounced using this sound are also mainly classified as one of these classes. Why they are not included in the main cascade is not clear from the semantic level. The cascade, however, reveals that the samples and hence the classes are located in a way that including one of those classes seem to change the direction of the decision region sequence in a way that would lead to shorter sequences.

Cancer cell lines. The cancer cell line dataset \(d_{3}\) shows four cascades of length five. There is only one class, \(y_{2}\): breast cancer, that is not part of any cascade and reveals a low purity. All four cascades share their first class, \(y_{1}\): leukemia and their center class, \(y_{3}\): ovarian cancer. If not part of the returned cascade no sample of \(y_{6}\): colon cancer or \(y_{8}\): lung cancer is classified as \(y_{9}\): prostate cancer, however the other way round \(y_{9}\): prostate cancer splits half-half between those classes, which reveals that the decision regions differ depending on the assumed order and outlines the importance of the fact that the assumed order has to be appropriately reflected in the feature representation. The correlation within this dataset can be explained by messenger molecules. Both prostate as well as ovarian are affected by sex hormones [28]. The fact that leukemia is twice as prevalent in men as in women [2] does not rule out the influence of sex hormones and may explain why leukemia is closer to prostate cancer. A similar relationship can also be found between ovarian and kidney cancer. Here, it is assumed that estrogen has a protective effect on the kidney, while testosterone damages the kidney [7, 48]. Likewise, an effect of sex hormones on the CNS is clearly confirmed [57]. Here, a decrease in hormone concentration during aging can be attributed to loss of neuronal functions. Various studies support a neuroprotective role of estrogens on this neuronal decline [58]. The other cascade connects neurotransmitter-controlled tissues. The central nerve system is controlled by a diversity of neurotransmitters [42]. Non-adrenergic, non-cholinergic nerves are known to control the gut as well as the urogenital tract [5]. Since the lung develops embryonically seen from the foregut, it is not surprising that non-adrenergic, non-cholinergic nerves are also present [5]. In addition, secreted neurotransmitter such as glutamate are known to be involved in T-cell leukemia [20].

Leukemia. Both cascades found in \(d_{6}\) share exactly one class \(y_{5}\): AML with inv(16) /t(16,16). In both cascades classes that correspond to lymphoid leukemia (LL) are placed before this class and afterwards classes that correspond to myeloid leukemia (ML) or syndromes that might develop into a myeloid leukemia, with the exception of the healthy bone marrow. It can be observed that the AML classes that are not included into one of the cascades are not classified as an own entity according to the WHO classification system [4]. The samples of those classes might be considered as not distinct enough to be included. Furthermore, it was shown that certain subtypes of LL show a higher incidence for the expression of myeloid antigens and these are pro B-All and early T-ALL [55] which are the LL classes at the found ML-LL transition. This might lead to the hypothesis that the found subcascades represent a sequence of similar but still distinct subtypes. The cascade showing a higher average purity for both, the alternative and the uncovered classes (others), reflects the grouping of ML and LL also in these classes. The observed characteristic that classes of similar concepts form subcascades within the context of a larger research topic, reveals that the proposed screening procedure is able to find orders within such a group but is also at the same time able to relate these groups using the same consecutive and hence overarching pattern.

TCGA data. For the TCGA dataset \(d_{5}\) consisting of eight classes, a cascade of length four achieved a minimal class-wise sensitivity higher than 90%. For \(d_{5}\) all uncovered classes are assigned with a purity of at least 80%. An order for most tissue derived tumors of endodermal or mesodermal origin can be found here. Only the data for kidney renal clear cell carcinoma (KIRC) cannot be assigned to the mesodermal germ layer. Similar findings were made by Berman [8] in his classification analysis. Here, he found that epithelial tumors with mesodermal origin in particular exhibit class independent behavior [8]. In line with this, there are various differentiation protocols to derive functional pancreatic or liver cells from human pluripotent stem cells (hPSCs) while few protocols are able to induce effective kidney differentiation [32]. Thus, differentiation of renal cells and associated tumors is somehow more complex in comparison to other tissues. One reason for this might be the complex architecture of the kidney with all its functional units [32] that require a variety of differentiation factors not only specific for kidney.

Fig. 5
figure 5

Longest cascade of the TCGA dataset \(d_{5}\) (extended confusion matrix). The sensitivity values and the conditional prediction rates for the classes belonging to the selected cascades as well as the external rates of classes that do not belong to the cascade are shown. The dataset specific threshold (t) an the minimal class-wise sensitivity is given in the heading. The columns give the order of classes predicted by the cascade (from left to right). The rows give all classes of a dataset. They are split in two blocks (separated by a red line): classes covered by the own cascade (top), other classes (bottom)

C.elegans and wound healing data. In contrast to the other datasets, the class labels of \(d_{6}\) and \(d_{7}\) are given by consecutive time points and therefore might be seen as hypothesised ordinal class labels. These timelines were not completely reconstructed in our experiments. Nevertheless the identified subcascades followed the assumed order.

For C.elegans \(d_{6}\) two cascades passed a minimal class-wise sensitivity threshold of 75% that included all classes despite one. Both orderings correspond to a progression in time and either the class corresponding to samples taken at minute 66 (\(y_{5}\)) or the ones taken at minute 83 (\(y_{6}\)) are skipped. In both cases the uncovered classes were assigned to their assumed ordinal neighbours with a purity over 90%. The high similarity of classes \(y_{5}\) and \(y_{6}\) are also indicated by a decreased classification performance of the corresponding base classifier (74% minimal class-wise sensitivity, data not shown) suggesting that in this period the expression does not change a lot. The detected cascades therefore rather constructs an ordinal sequence of discriminable events than a reconstruction of the timeline.

Similar observations can be gained for the wound healing dataset \(d_{7}\). Here two cascades of length four achieved a minimal class-wise sensitivity higher than 80%. They differ in their second class. In the first cascade \(y_{6}\) is chosen, which corresponds to the samples taken between 400 and 600 minutes after injury. In the second one, \(y_{7}\) is covered, which comprises to the samples taken between 600 and 800 minutes after injury. If uncovered, \(y_{6}\) receives the class label of \(y_{7}\) (and vice versa) with a purity of over 80%. Both cascades were not able to cover three classes that reflect earlier time points (\(y_{2}\), \(y_{4}\), \(y_{5}\)). These external classes are assigned to their assumed ordinal neighbours (within the cascade) with a purity of at least 90%. As the class definitions are again based on prior assumptions they most likely do not correspond to large changes in the process. This can be also seen in the minimal class-wise sensitivity of the binary classifiers, which are below 0.8 for time-based neighbouring classes, except for the first class \(y_{1}\) and the control class \(y_{8}\). Based on the classification tendency of the uncovered classes one might conclude that the expression profile of the first 50 minutes after injury and the control one is quite distinct from the process in between. Furthermore the assignment of the uncovered and alternative classes are consistent with the findings that wound healing consists of overlapping phases [51].

HPA-axis methylation data. Not considering both directions and the classes at the two last cascade positions the subcascades of \(d_{8}\) reduce to two orderings:

(\(liver {\prec } skin {\prec }hippocampus/hypothalamus {\prec }pituitary {\prec }adrenal \)). One part of one cascade \(hypothalamus {\prec }pituitary {\prec }adrenal \) is reported as HPA-axis and known to control stress reaction. The role of the hippocampus in the control of stress reaction and its interconnection towards the hypothalamus might lead to the effect that hippocampus and hypothalamus are not discriminable anymore if a relation reflecting the interplay of stress response is considered. The neighbourhood between skin and the HPA-axis within the cascade reflects the findings by Agba et al. [1] that methylation patterns in skin are closely related to the patterns within the HPA-axis.

Discussion and Conclusion

Although the task of classification is mainly based on the assumption of pairwise distinct concepts, it is highly unlikely that no other relationship among the classes exists. Nevertheless, these relations might be implicit or even unknown. They might be revealed by analysing suitable data representations.

In this work, we utilise ordinal classifier cascades for detecting ordinal relations in each subset of classes. This multi-class architecture is highly sensitive to the order of classes as decision regions of early classes overlay decision regions of later ones. Interestingly, this property causes more rejections than the performance of the corresponding base classifiers. In our study, almost all classes would have been separable in pairwise classification experiments.

From a combinatorial point of view the number of candidate cascades increases exponentially in the number of classes. The proposed exhaustive screening is therefore out of range for approaches that require a de novo generation of all classification models. It would require the training of an ordinal classifier for each subset of classes and each permutation thereof. In our approach we circumvent this complexity by the extensive use of memoization techniques and theoretical error bounds. This allowed us to perform exhaustive screens for datasets with 26 classes, which corresponds to the evaluation of more than \(10^{27}\) candidate orders. The runtime of our approach is mainly determined be the training of the required number of base classifiers which is quadratic in the overall number classes. It therefore brings into range ordinal classifier cascades based on more sophisticated but also more complex base classifiers [14, 15, 25, 38, 46, 59]. To our knowledge, our screening is the first one that applies memoization techniques to ordinal classification. It might be a blue print for other ordinal classifiers or fusion architectures.

Due to the transitive relationship between the minimal class-wise sensitivities of a cascade and its subcascades, the probability of larger ordinal structures decreases. Long ordinal subcascades must be seen as rare events. As such, we consider the longest cascades that pass the highest threshold on the minimal class-wise sensitivity as being the most informative ones. Of course, both length and threshold are clearly dependent on the classification task and the chosen data representation. There is no guarantee that cascades of suitable length will be found.

The identified subcascades can be seen as ordered subsets of well separable classes. They might be considered as a roadmap of axes organising the complete collection of classes. However, not all classes are connected in this network. The labels of these external classes cannot be predicted by the ordinal classifier cascades. As most base classifiers do not possess a rejection option, the samples of the external classes will in general be assigned to cascade classes. Although the classification of these samples is incorrect, the association to one of the ordinal classes can reveal properties of the external classes. In our experiments, we observed a large set of external classes that are assigned to a small set of consecutive classes of an ordinal classifier cascade. They might be a hint to a too fine granular classification (e.g. caused by inappropriate sampling), which cannot be separated in feature space. The whole process of identifying subcascades can also be seen as a selection of a maximal number of classes that can collectively be discriminated. This in turn can give rise to new hypothesis about the processes involved like “Is this entity really a distinct new group?”, or “There could be some stagnation in the development after a certain time”.

Overall, our experiments show that ordinal substructures of classes can be detected in feature space and that they corroborate existing findings. These structures might be seen as hypotheses for ordinal relations and also for discriminative entities among the corresponding class concepts, which can be investigated in a more detailed analysis. From a theoretical point of view, the aggregation of ordinal subcascades remains an open research question. It might be addressed in form of multi-class architectures that directly address partial orders of classes.