1 Introduction

Neural networks have proven to be incredibly powerful at processing low-level inputs, and for this reason they have been extensively applied to computer vision tasks, such as image classification, object detection, and action detection. However, they can exhibit unexpected behaviors, contradicting known requirements expressing background knowledge. This can have dramatic consequences, especially in safety-critical scenarios such as autonomous driving. To address the problem, models should (i) be able to learn from the requirements, and (ii) be guaranteed to be compliant with the requirements themselves. Indeed, as suggested in Amodei et al. (2016), in such settings it is of primary importance to create models that are able to operate within boundaries specified by the requirements written by domain experts. Unfortunately, the development of such models is hampered by the lack of real-world datasets equipped with formally specified requirements. A notable exception is given by hierarchical multi-label classification (HMC) problems (see, e.g., (Vens et al. 2008; Schietgat et al. 2010; Wehrmann et al. 2018)) in which datasets are provided with simple binary constraints of the form \((A \rightarrow B)\) stating that label B must be predicted whenever label A is predicted.

In this paper, we generalize HMC problems by introducing multi-label classification problems with (full) propositional logic requirements. Thus, given a multi-label classification problem with labels A, B, and C, we can, for example, write the requirement:

$$\begin{aligned} (\lnot A \wedge B) \vee C, \end{aligned}$$

stating that for each data point in the dataset either the label C is predicted, or B but not A are predicted. Then, we present the ROad event Awareness Dataset with logical Requirements (ROAD-R), the first publicly available dataset for autonomous driving with requirements expressed as logical constraints. ROAD-R extends the ROAD dataset (Singh et al. 2022), which was built on top of the Oxford RobotCar Dataset (Maddern et al. 2017) and consists of 22 relatively long (\(\sim \!\!\!~8\) minutes each) videos annotated with road events. A road event corresponds to a tube/tubelet, i.e., a sequence of frame-wise bounding boxes linked in time. Each bounding box is labeled with a subset of the 41 labels specified in Table 1. The goal is to predict the set of labels associated with each bounding box. We manually annotated ROAD-R with 243 constraints expressing which combinations of labels are admissible. We verified that the constraints hold for all bounding boxes’ ground truth annotations appearing in the dataset using the SAT-solver MiniSat (Eén and Sörensoon 2004).Footnote 1 An example of a constraint is thus “a traffic light cannot be red and green at the same time”, while there are no constraints like “pedestrians should cross at crossings”, which should always be satisfied in theory, but which might not be in real-world scenarios.

Table 1 ROAD labels

Given ROAD-R, we considered 6 current state-of-the-art (SOTA) models, and we showed that they are not able to learn the requirements just from the data points, as more than 90% of their predictions violate the constraints. Then, we faced the problem of how to leverage the additional knowledge provided by constraints with the goal of (i) improving their performance, measured by the frame mean average precision (f-mAP) at intersection over union (IoU) thresholds 0.5 and 0.75; see, e.g., (Kalogeiton et al. 2017; Li et al. 2018), and (ii) guaranteeing that they are compliant with the constraints. To achieve the above two goals, we propose the following new models:

  1. 1.

    CL models, i.e., models with a constrained loss allowing them to learn from the requirements,

  2. 2.

    CO models, i.e, models with a constrained output enforcing the requirements on the output, and

  3. 3.

    CLCO models, i.e., models with both a constrained loss and a constrained output.

In particular, we consider three different ways to build CL (resp., CO, CLCO) models. More specifically, we run the \(9 \times 6\) models obtained by equipping the 6 current SOTA models with a constrained loss and/or a constrained output, and we show that it is always possible to

  1. 1.

    Improve the performance of each SOTA model, and

  2. 2.

    Be compliant with (i.e., strictly satisfy) the constraints.

Overall, the best performing model (for IoU = 0.5 and also IoU = 0.75) is CLCO-RCGRU, i.e., the SOTA model RCGRU equipped with both constrained loss and constrained output: CLCO-RCGRU (i) always satisfies the requirements and (ii) has f-mAP = 31.81 for IoU = 0.5, and f-mAP = 17.27 for IoU = 0.75. On the other hand, the standard RCGRU model (i) produces predictions that violate the constraints at least 92% of the times, and (ii) has f-mAP = 30.78 for IoU = 0.5 and f-mAP = 15.98 for IoU = 0.75.

The main contributions of this paper are thus as follows:

  1. 1.

    We introduce multi-label classification problems with propositional logic requirements,

  2. 2.

    We introduce ROAD-R, which is the first publicly available dataset whose requirements are expressed in full propositional logic,

  3. 3.

    We consider 6 SOTA models and show that on ROAD-R, they produce predictions violating the requirements more than \(\sim\)90% of the times,

  4. 4.

    We propose new models with a constrained loss and/or constrained output, and

  5. 5.

    We conduct an extensive experimental analysis and show that, with our new models, it is always possible to improve the performance of the SOTA models and satisfy the requirements.

The rest of this paper is organized as follows. After the introduction to the problem, we present ROAD-R (Sect. 3), followed by the evaluation of the SOTA models (Sect. 4) and of the SOTA models incorporating the requirements (Sect. 5) on ROAD-R. We end the paper with the related work (Sect. 6) and the summary and outlook (Sect. 7).

2 Learning with requirements

In ROAD, the detection of road events requires the following tasks: (i) identify the bounding boxes, (ii) associate with each bounding box a set of labels, and (iii) form a tube from the identified bounding boxes with the same labels. Here, we focus on the second task, and we formulate it as a multilabel classification problem with requirements.

A multi-label classification (MC) problem \(\mathcal {P} \,{=}\, (\mathcal {C},\mathcal {X})\) consists of a finite set \(\mathcal {C}\) of labels, denoted by \(A_1, A_2,\ldots\), and a finite set \(\mathcal {X}\) of pairs (xy), where \(x \in {\mathbb {R}}^D\) \((D \ge 1)\) is a data point, and \(y \subseteq \mathcal {C}\) is the ground truth of x. The ground truth y

Fig. 1
figure 1

Example of violation of  \(\lnot {\text {RedTL}}\,{\vee }\, \lnot {\text {GreenTL}}\)

associated with a data point x characterizes both the positive and the negative labels associated with x, defined to be y and \(\{\lnot A: A \in \mathcal {C} \setminus y\}\), respectively. In ROAD-R, a data point corresponds to a bounding box, and each box is labeled with the positive labels representing (i) the agent performing the actions in the box, (ii) the actions being performed, and (iii) the locations where the actions take place. See Appendix A for a detailed description of each label. Consider an MC problem \(\mathcal {P} = (\mathcal {C},\mathcal {X})\). A prediction p is a set of positive and negative labels such that for each label \(A \in \mathcal {C}\), either \(A \in p\) or \(\lnot A \in p\). A model m for \(\mathcal {P}\) is a function \(m(\cdot ,\cdot )\) mapping every label A and every data point x to [0, 1]. A data point x is predicted by a model m to have label A if its output value m(Ax) is greater than a user-defined threshold \(\theta \in [0, 1]\). The prediction of model m for data point x is the set \(\{A: A \in \mathcal {C}, m(A,x) > \theta \} \cup \{\lnot A: A \in \mathcal {C}, m(A,x) \le \theta \}\) of positive and negative labels.

An MC problem with propositional logic requirements \((\mathcal {P},\Pi )\) consists of an MC problem \(\mathcal {P}\) and a finite set \(\Pi\) of propositional logic constraints on the labels of \(\mathcal {P}\). Consider an MC problem with propositional logic requirements \((\mathcal {P},\Pi )\). Each constraint in \(\Pi\) delimits the set of predictions that can be associated with each data point by ruling out those that violate it. A prediction p is admissible if each constraint r in \(\Pi\) is satisfied by p. A model m for \(\mathcal {P}\) satisfies (resp., violates) the constraints on a data point x if the prediction of m for x is (resp., is not) admissible.

Example 1

The requirement that a traffic light cannot be both red and green corresponds to the constraint \(\lnot {\text {RedTL}} \vee \lnot {\text {GreenTL}}\). Any prediction with \(\{\text {RedTL},\) \(\text {GreenTL}\}\) is non-admissible. An example of such predictions made by the SOTA models is shown in Fig. 1.

Given an MC problem with propositional logic requirements, it is possible to take advantage of the constraints in two different ways: (i) they can be exploited during learning to teach the model the background knowledge that they express, and (ii) they can be used as post-processing to turn a non-admissible prediction into an admissible one. Models in the first and second category have a constrained loss (CL) and constrained output (CO), respectively. Constrained loss models have the advantage that the constraints are deployed during the training phase, and this should result in models (i) with a higher understanding of the problem and a better performance, but still (ii) with no guarantee that no violations will be committed. On the other hand, constrained output models (i) do not exploit the additional knowledge during training, but (ii) are guaranteed to have no violations in the final outputs. These two options are not mutually exclusive (i.e., can be used together), and which one is to be deployed depends also on the extent to which a system is available. For instance, there can be companies that already have their own models (which can be black boxes) and want to make them compliant with a set of requirements without modifying the model itself. On the other hand, the exploitation of the constraints in the learning phase can be an attractive option for those who have a good knowledge of the model and want to further improve it.

3 ROAD-R

ROAD-R extends the ROAD datasetFootnote 2 (Singh et al. 2022) by introducing a set \(\Pi\) of 243 constraints that specify the space of admissible outputs.

In order to improve the usability of our dataset, we write each constraint as a disjunction of positive and negative labels, i.e., as expressions having the form:

$$\begin{aligned} l_1 \vee l_2 \vee \cdots \vee l_n, \end{aligned}$$
(1)

where \(n \ge 1\), and each \(l_i\) is either a negative label \(\lnot A\) or a positive label A. Thus, \(\Pi\) can be equivalently seen as a formula in conjunctive normal form (CNF), which is the standard form used by propositional logic solvers. Notice that for any propositional formula there is an equivalent one in CNF.

The requirements have been manually specified following three steps:

  1. 1.

    An initial set of constraints \(\Pi _1\) was manually created,

  2. 2.

    A subset \(\Pi _2 \subset \Pi _1\) was retained by eliminating all those constraints that were entailed by the others,

  3. 3.

    The final subset \(\Pi \subset \Pi _2\) was retained by keeping only those requirements that were always satisfied by the ground-truth labels of the entire ROAD-R dataset.

Considering the above procedure, a few considerations are in order:

  1. 1.

    The requirement specification process (i) is a standard step in the development of any software, necessary to characterize the expected behavior of the system and then verify that the system functions as expected; and (ii) deeply involves the stakeholders/designers of the system (see, e.g., (Sommerville 2011)). As a consequence, the set \(\Pi _1\) is not guaranteed to be complete from every possible point of view. Indeed, with a different set of labels and/or in different contexts, other constraints may hold. For instance, some roads can be closed to “large vehicles" and in some countries it is possible to have traffic lights with both the green and amber lights on;

  2. 2.

    The elimination of the constraints that are violated by the ground-truth labels of the entire dataset—despite their validity—is a necessary step in order to maintain (i) consistency between the knowledge provided by the constraints and by the data points, and (ii) backward compatibility with the ROAD dataset. Indeed, some constraints in \(\Pi _1\), like “it is not possible for an agent to both move towards and move away", have been discarded, since they were not satisfied by all the data points because of errors in the ground-truth labels;

  3. 3.

    Following the standard practice adopted in software development, the requirement specification process should come before the software development begins and before the annotation of the dataset. Indeed, this would have allowed to (i) simplify the annotation process, and then (ii) validate the annotated dataset.

Given the above, ROAD-R, along with the presented models, is a first step pushing in the direction of having a new generation of machine learning models (i) whose design starts with the specification of the requirements that it should satisfy, and (ii) able to learn from and then obey to the constraints. This will help in the deployment of machine learning models in all application domains, including safety-critical ones. Indeed, as stated in Amodei et al. (2016) and Hoernle et al. (2022), to be applied in such settings models need to be guaranteed to be able to operate within boundaries specified by domain experts.

Table 2 Constraint statistics

Tables 2 and 3 give a high-level description of the properties of the set \(\Pi\) of constraints. Notice that, with a slight abuse of notation, in the tables, we use a set-based notation for the requirements. Each requirement of the form (1) thus becomes \(\{l_1, l_2, \ldots , l_n\}\). Such notation allows us to express the properties of the requirements in a more succinct way. From Table 2, we can see that:

  • Aall the constraints have between 2 and 15 positive and negative labels, with an average of 2.86,

  • All the labels appear positively in \(\Pi\).

  • Of the 41 labels, 38 appear negatively in \(\Pi\), and

  • Each label appears either positively or negatively between 2 and 31 times in \(\Pi\), with an average of 16.95.

Table 3 gives a close-up view of structure of the constraints, showing the number of rules having n positive and negative labels, together with the average number of negative and positive labels in such rules. As witnessed by Table 3, in the 243 constraints, there are two in which all the labels are positive (expressing that there must be at least one agent and that every agent but traffic lights has at least one location), and 214 in which all the labels are negative (expressing mutual exclusion between two labels). All the constraints with more than two labels have at most one negative label, as they express a one-to-many relation between actions and agents (like “if something is crossing, then it is a pedestrian or a cyclist”). Constraints like “pedestrians should cross at crossings”, which might not be satisfied in practice, are not included. The list with all the 243 requirements, with their natural language explanations, is in Appendix D, Tables 9, 10, and 11. Overall, the 243 requirements restrict the number of admissible prediction to \(4985868 \sim 5 \times 10^6\), thus ruling out \((2^{41} - 4985868) \sim 10^{12}\) non-admissible predictions.Footnote 3

Table 3 Constraint statistics. \(\Pi _n\) is the set of constraints r in \(\Pi\) with \(\mid r \mid = n\), i.e., with n positive and negative labels. \(\overline{\mathcal {C}} = \{\lnot {A}: A \in \mathcal {C}\}\)

4 ROAD-R and SOTA models

As a first step, we ran 6 SOTA temporal feature learning architectures as part of a 3D-RetinaNet model (Singh et al. 2022) (with a 2D-ConvNet backbone made of Resnet50 (He et al. 2016)) for event detection and evaluated to which extent constraints are violated. Each SOTA model takes as input a sequence of frames, and it returns: (i) a set of bounding boxes for each frame, and (ii) a vector \(v \in [0,1]^{\mid \mathcal {C} \mid }\) for each bounding box. For each bounding box, the final prediction is then the set of positive and negative labels obtained by thresholding v as described in Sect. 2. We considered:

  1. 1.

    2D-ConvNet (C2D) (Wang et al. 2018): a Resnet50-based architecture with an additional temporal dimension for learning features from videos. The extension from 2D to 3D is done by adding a pooling layer over time to combine the spatial features.

  2. 2.

    Inflated 3D-ConvNet (I3D) (Carreira and Zisserman 2017): a sequential learning architecture extendable to any SOTA image classification model (2D-ConvNet based), able to learn continuous spatio-temporal features from the sequence of frames.

  3. 3.

    Recurrent Convolutional Network (RCN) (Singh and Cuzzolin 2019): a 3D-ConvNet model that relies on recurrence for learning the spatio-temporal features at each network level. During the feature extraction phase, RCNs exploit both 2D convolutions across the spatial domain and 1D convolutions across the temporal domain.

  4. 4.

    Random Connectivity Long Short-Term Memory (RCLSTM) (Hua et al. 2018): an updated version of LSTM in which the neurons are connected in a stochastic manner, rather than fully connected. In our case, the LSTM cell is used as a bottleneck in ResNet50 for learning the features sequentially.

  5. 5.

    Random Connectivity Gated Recurrent Unit (RCGRU) (Hua et al. 2018): an alternative version of RCLSTM where the GRU cell is used instead of the LSTM one. GRU makes the process more efficient with fewer parameters than the LSTM.

  6. 6.

    SlowFast (Feichtenhofer et al. 2019): a 3D-CNN architecture that contains both slow and fast pathways for extracting the sequential features. A slow pathway computes the spatial semantics at a low frame rate, while a fast pathway processes high frame rate for capturing the motion features. Both the pathways are fused in a single architecture by lateral connections.

We trained 3D-RetinaNetFootnote 4 using the same hyperparameter settings for all the models: (i) batch size equal to 4, (ii) sequence length equal to 8, and (iii) image input size equal to \(512\times 682\). All the models were initialized with the Kinetics pre-trained weights. An SGD optimizer (LeCun et al. 2012) with step learning rate was used. The initial learning rate was set to 0.0041 for all the models except SlowFast, for which it was set to 0.0021 due to the diverse nature of slow and fast pathways. All the models were trained for 30 epochs, and the learning rate was made to drop by a factor of 10 after 18 and 25 epochs. The machine used for the experiments has 64 CPUs (2.2 GHz each) and 4 Titan RTX GPUs having 24 GB of RAM each.

To measure the models’ performance, we used the frame mean average precision (f-mAP), which is the standard metric used for action detection (see, e.g., (Kalogeiton et al. 2017; Li et al. 2018)) and is obtained by calculating for each class the mean average precision over all frames, averaging the final results as shown in Eq. (2). In our experiments, we set IoU threshold equal to 0.5 and 0.75, indicated as f-mAP@0.5 and f-mAP@0.75, respectively.

$$\begin{aligned} \text {f-mAP@}\tau = \frac{1}{|\mathcal {C}|} \frac{1}{F} \sum _{i=1}^{|\mathcal {C}|} \sum _{j=1}^{F} {\text {AP}_{ij}}, \end{aligned}$$
(2)

where F is the number of frames, and \(\text {AP}_{ij}\) is the average precision for class i at frame j at IoU \(\tau\). The results for the SOTA models at IoU threshold 0.5 and 0.75 are reported in Table 4, column “SOTA”.

To measure the extent to which each system violates the constraints, we used the following metrics:

  • The percentage of non-admissible predictions,

  • The average number of violations committed per prediction, and

  • The percentage of constraints violated at least once,

while varying the threshold \(\theta\) from 0.1 to 0.9 with step 0.1. The results are in Fig. 2, where (to improve readability) we do not plot the values corresponding to \(\theta =0.0\) and \(\theta =1.0\). For \(\theta =0.0\) (resp., \(\theta = 1.0\)), all the predictions are positive (resp., negative), and thus the corresponding values are (in order) 100%, 214, and 214/243 (resp., 100%, 2, and 2/243).

Fig. 2
figure 2

ROAD-R and SOTA models. In the x-axis, there is the threshold \(\theta \in [0.1,0.9]\), step 0.1

Consider the results in Table 4, column “SOTA”, and in Fig. 2. First, note that the performances are not an indicator of the ability of the model to satisfy the constraints. Indeed, higher f-mAPs do not correspond to lower trends in the plots of Fig. 2b. For example, RCGRU performs better than C2D for both IoU = 0.5 and IoU = 0.75, however, its curve is above C2D’s in both Fig. 2a and b. Then, note that the percentage of non-admissible predictions is always very high for every model: at its minimum, for \(\theta =0.1\), more than 90% of the predictions are non-admissible, and this percentage reaches 99% for \(\theta = 0.9\) (see Fig. 2a). In addition, most predictions violate roughly two constraints, as shown by Fig. 2b. Considering that we are in an autonomous vehicle setting, such results are critical: one of the constraints that is violated by all the baseline models is \(\{\lnot {\text {RedTL}}, \lnot {\text {GreenTL}}\}\), corresponding to predictions stating that there is a traffic light with both the red and the green lights on. Figure 1 shows an image where such a prediction is made by C2D. Appendix C contains images with all the models making predictions violating \(\{\lnot {\text {RedTL}}, \lnot {\text {GreenTL}}\}\) and other constraints.

Table 4 f-mAP@0.5 (top table) and f-mAP@0.75 (bottom table) of the (i) current SOTA models, (ii) CL model, (iii) CO models, and (iv) CLCO models. P, G, and L stand for the Product, Gödel, and Łukasiewicz t-norm, respectively. MD, AP, and AP \(\times\) O indicate the Minimal Distance policy, Average Precision-based policy and Average Precision and Output-based policy, respectively. In parenthesis we report the difference in performance between of each model and the relative SOTA model. The values of the threshold \(\theta\) (for the CO and CLCO models) and of the hyperparameter \(\alpha\) (for the CL and CLCO models) are given in Table 8 in Appendix B. Best results are in bold

5 ROAD-R and CL, CO, and CLCO models

We now show how it is possible to build CL, CO, and CLCO models. In particular, we show how to equip the 6 considered SOTA models with a constrained loss and/or a constrained output. As anticipated in the introduction, we introduce (i) three different methods to build the constrained loss, (ii) three different methods to obtain the constrained output, and (iii) three combinations of constrained loss and constrained output. Thus, we get 9 models for each SOTA model, for a total of 54. In order to get an overall view of the performance gains produced by each method, we also report the average ranking of the 9 proposed methods and SOTA (Demsar 2006), computed as follows: (i) for each row in Table 4, we rank the performances of the 9 CL, CO, and CLCO models and of the SOTA model separately: the best performing model gets the rank 1, the second best gets rank 2, etc., and in case of ties, the rank is split (e.g., the assigned rank is 1.5 if two models have the best performance), and (ii) for each column, we take the average of the rankings computed in step 1. See Table 4 for f-mAP@0.5, f-mAP@0.75 and average rankings, where, for each row the best results are in bold. The details of the implemented models with constrained loss, constrained output, and both constrained loss and constrained output is given in the three subsections below.

Fig. 3
figure 3

Comparison of the behaviour of RCLSTM and CL-RCLSTM (with Product, Gödel and Łukasiewicz loss) with respect to the requirements. In the x-axis, there is the threshold \(\theta \in [0.1,0.9]\), step 0.1

5.1 Constrained loss

To constrain the loss, we take inspiration from the approaches proposed in Diligenti et al. (2017a, 2017b), and we train the models using the standard localization and classification losses, to which we add a regularization term. This last term represents the degree of satisfaction of the constraints in \(\Pi\) and has the form:

$$\begin{aligned} \mathcal {L}_{\Pi } = \alpha \sum \nolimits _{i=1}^{\mid \Pi \mid } (1 - t(r_i)), \end{aligned}$$

where \(r_i\) represents the ith constraint in \(\Pi\), \(t(r_i)\) represents the fuzzy logic relaxation of \(r_i\), and \(\alpha\) is a hyperparameter ruling the weight of the regularization term (the higher \(\alpha\) is, the more relevant the term corresponding to the constraints becomes, up to the limit case in which \(\alpha \rightarrow \infty\), and the constraints become hard (Diligenti et al. 2017b)). We considered \(\alpha \in \{1, 10, 100\}\) and the three fundamental t-norms: (i) Product t-norm, (ii) Gödel t-norm, and (iii) Łukasiewicz t-norm as fuzzy logic relaxations (Hájek 1998). The best results for f-mAP@0.5 and f-mAP@0.75 while varying \(\alpha\) are in Table 4, columns Product, Gödel, and Łukasiewicz. As can be seen, SOTA never achieves the best average ranking, even when compared with only the three CL methods. Of these, Łukasiewicz (for IoU = 0.5) and Product (for IoU = 0.75) have the best ranking, though for some model and IoU, the best performances are obtained with Gödel. In only one case (for RCLSTM at IoU = 0.75), the SOTA model performs better than the CL models. Furthermore, we measure the extent to which the CL models violate the constraints using the metrics introduced in the previous section, and we never get any significant reduction in the number of predictions violating the constraints. As example, we plot the resulting charts in Fig. 3 for the SOTA model RCLSTM and the CL-RCLSTM models with Product, Gödel and Łukasiewicz loss. As it can be seen from Fig. 3a, the CL models’ predictions also violate the constraints at least 90% of the times.

5.2 Constrained output

We now consider the problem of how to correct a prediction p whose admissibility is evaluated at a given threshold \(\theta\). The first observation is that determining the existence of an admissible prediction is an intractable problem: indeed, this is just a reformulation of the satisfiability problem in propositional logic, which is well known to be NP-complete. Despite this, we want to correct any non-admissable prediction p in such a way that (i) the final prediction is admissible, and (ii) the performance of the final model either improves or remains unaltered.

In order to achieve the above, we first test the policy of trying to correct as few labels as possible. More precisely, for each prediction q, \((p \setminus q)\) is the set of positive and negative predictions on which q differs from p. Then, we can compute the admissible prediction q with the minimum number of differences, i.e., such that \(\mid p \setminus q\mid\) is minimal. We call such policy Minimal Distance (MD). Unfortunately, no polynomial time algorithm is known to solve this problem.

Theorem 1

Let \((\mathcal {P},\Pi )\) be an MC problem with requirements. Let p be a prediction. For each positive d, determining the existence of an admissible prediction q such that \(\mid p {\setminus } q \mid \le d\) is an NP-complete problem.

The theorem is an easy consequence of Proposition 1 in Bailleux and Marquis (2006). In order to be able to solve the problem in practice, we formulate the problem of finding an admissible prediction with minimal \(\mid p \setminus q \mid\) as a weighted partial maximum satisfiability (PMaxSAT) problem of a set of clauses (see, e.g., (Li and Manyà 2009)) in which

  1. 1.

    Each constraint in \(\Pi\) corresponds to a clause marked as hard, and

  2. 2.

    Each positive and negative prediction in p corresponds to a unit clause marked as soft with unitary weight.

In our setting, a clause is a disjunction of literals, and a literal is either a positive label or its negation, representing the corresponding negative label. A clause is unit if it consists of a single literal. This allows us to use the very efficient solvers publicly available for PMaxSat problems. In particular, in our experiments, we used MaxHS (Hickey and Bacchus 2019), and running times were in the order of \(10^{-3}\)s at most. As intended, since we assign all labels unitary weight, MaxHS returns the admissible prediction q with as few as possible labels flipped. Notice that flipping the ith label amounts to changing its output value \(o_i\) from a value below the threshold to another value \(f(o_i)\) above the threshold or vice versa. In all our experiments, we considered (i) \(f(o_i) = \theta + \epsilon\), if \(o_i < \theta\), and (ii) \(f(o_i) = \theta - \epsilon\), otherwise (\(\epsilon = 10^{-3}\)). In this way, assuming \(o_i \le \theta\) (i.e., the label is negatively predicted and then flipped positive), we expect \(f(o_i)\) to be lower (but still higher than \(\theta\)) than the output values of the non-flipped, positively predicted labels. It is done analogously for the case \(o_i > \theta\). We tested this approach with all the thresholds \(\theta\) from 0.1 to 0.9 with step 0.1, and the resulting f-mAP@0.5 and f-mAP@0.75 for the best threshold are reported in Table 4, column MD. As we can see from the table, despite the fact that we are minimizing the number of corrections, the results obtained by the CO-MD model are always worse than the ones obtained by the SOTA models. We can hence conclude that adding such post-processing has a detrimental effect on the models’ performance.

Thus, we need alternative policies to correct a non-admissible prediction p. We generalize the problem by assigning a positive weight \(w_i\) (representing the cost of correcting the ith label in p) and then computing the admissible prediction q that minimizes \(\sum _{i=1}^{|\mathcal {C}|} w_i\). More precisely, for every prediction q, \(cost(p,q) = \sum _{i \in \mathcal {I}} w_i\), \(\mathcal {I}\) being the set of indexes of the labels in \((p \setminus q)\). Then, we can compute the admissible prediction q such that cost(pq) is minimal. As it is a generalization of the problem above, no polynomial-time algorithm is known to solve this problem.

Theorem 2

Let \((\mathcal {P},\Pi )\) be an MC problem with requirements. Let p be a prediction, and let \(w_i\) be the cost of correcting the ith label in p. For each positive d, determining the existence of an admissible prediction q such that \(cost(p,q) \le d\) is an NP-complete problem.

This is an easy consequence of Theorem 1. We can again formulate the problem as a PMaxSAT problem in which:

  1. 1.

    Each constraint in \(\Pi\) corresponds to a clause marked as hard, and

  2. 2.

    For each i (\(1 \le i \le \mid \mathcal {C} \mid\)), the prediction in p for the ith label corresponds to a unit clause having weight \(w_i\).

Given the above formulation, we tested two different policies for choosing the weight associated to each label:

  1. 1.

    Average Precision-based (AP), in which each \(w_i\) is equal to the average precision \(\text {AP}_i\) of the ith label, and

  2. 2.

    Average Precision and Output-based (AP\(\times\)O), in which each \(w_i = \text {AP}_i \times c_i\), where \(c_i\) is equal to (i) the output \(o_i\) of the model for the ith label if \(o_i > \theta\), and (ii) \((1-o_i)\), otherwise.

Differently from the MD policy, in order to take a decision, these policies take into account the reliability of the output \(o_i\) for the ith label. We again tested the two policies with all the thresholds \(\theta\) from 0.1 to 0.9 with step 0.1, and the resulting f-mAP@0.5 and f-mAP@0.75 for the best threshold are reported in Table 4, columns AP, and AP\(\times\)O. Comparing the predictions of the SOTA models with the CO-AP and the CO-AP\(\times\)O models, we can see that flipping the variables taking into account the average precision (i) never leads to worse performances than the ones of the SOTA models, and (ii) for IoU = 0.75, correcting the output of RCLSTM with AP and AP\(\times\)O gives the best and second best performance in the row. Notice that the differences in the performances between AP and AP\(\times\)O are negligible, AP\(\times\)O being better than AP more often. The average rankings are in line with the above statements.

5.3 Constrained loss and output

Given the results presented in the previous paragraph, of the 9 possible combinations of a constrained loss and a constrained output, we consider only the ones with AP\(\times\)O as constrained output. The results are shown in Table 4, last three columns. Given the constant improvement produced by the constrained output AP\(\times\)O over the SOTA models discussed in the previous paragraph, the results of the CLCO models are not surprising: post-processing the output with AP\(\times\)O policy produces again relatively small but almost always constant improvements over the corresponding CL models. The average rankings are in line with the above statements.

Considering the results in Table 4 all together, we see that

  1. 1.

    Constraining the output alone guarantees the compliance with the constraints, but improvements in the performances are constant but limited,

  2. 2.

    Constraining the loss alone does not guarantee the satisfaction of the requirements but can lead to non marginal improvements in the performances,

  3. 3.

    The best performances (the numbers in bold) are always obtained by constraining the output, and thus it is always possible to (i) improve the performance of each SOTA model, and (ii) guarantee to be compliant with the requirements,

  4. 4.

    On average, the best performances are obtained by CLCO models, as witnessed by the average rankings,

  5. 5.

    The best performing model is CLCO-RCGRU, i.e., RCGRU with Łukasiewicz constrained loss and AP\(\times\)O constrained output: such model (i) is compliant with the constraints by construction, and (ii) has f-mAP \(=\) 31.81 for IoU \(=\) 0.5, and f-mAP \(=\) 17.27 for IoU \(=\) 0.75. RCGRU (without CL and CO) (i) produces predictions that violate the constraints at least 92% of the times, and (ii) has f-mAP \(=\) 30.78 for IoU \(=\) 0.5, and f-mAP \(=\) 15.98 for IoU \(=\) 0.75.

6 Related work

The approach proposed in this paper generalizes HMC problems, in which requirements are binary and have the form \((A \rightarrow B)\), corresponding to our \((\lnot A \vee B)\). Many models have been developed for HMC; see, e.g., (Vens et al. 2008; Wehrmann et al. 2018; Giunchiglia and Lukasiewicz 2020).

Interestingly, when dealing with more complex logical requirements on the output space, in the past, researchers have mostly focused on exploiting the background knowledge that they express to improve performance and/or to deal with data scarcity, curiously neglecting the problem of guaranteeing their satisfaction. Many works go in this direction, such as Hu et al. (2016a, 2016b), where an iterative method to embed structured logical information into the neural networks’ weights is introduced: at each step, the authors consider a teacher network based on the set of logical rules to train a student network to fit both supervisions and logic rules. Another neural model is considered in Li and Srikumar (2019), in which some neurons are associated with logical predicates, and their activation is modified on the ground of the activation of the neurons corresponding to predicates that co-occur in the same rules. An entire line of research is dedicated to embedding logical constraints into loss functions; see, e.g., (Diligenti et al. 2017b; Donadello et al. 2017; Xu et al. 2018). These works consider a fuzzy relaxation of FOL formulas to get a differentiable loss function that can be minimized by gradient descent. However, in all the above methods, there is no guarantee that the constraints will be actually satisfied. Recently, this problem has gained more relevance, and few works now propose novel ways of addressing the problem. One of such works is (Giunchiglia and Lukasiewicz 2021), which presents a novel neural model called coherent-by-construction network (CCN). CCN not only exploits the knowledge expressed by the constraints, but is also able to guarantee the constraints’ satisfaction. However, that model is able to guarantee the satisfaction of constraints written as normal logic rules with at least one positive label, and thus is not able to deal with all the ROAD-R’s requirements. Another work that goes in this direction is (Dragone et al. 2021), in which NESTER is proposed. In this case, the constraints are not mapped into the last layer of the network (like CCN), but they are enforced by passing the outputs of the neural network to a constraint program, which enforces the constraints. The most recent work is given by Hoernle et al. (2022), where the authors propose MultiPlexNet. MultiplexNet can impose constraints consisting of any quantifier-free linear arithmetic formula over the rationals (thus, involving “\(+\)", “\(\ge\)", “\(\lnot\)", “\(\wedge\)", and “\(\vee\)"). In order to train the model with such constraints, the formulas are firstly expressed in disjunctive normal form (DNF), and then the output layer of the network is augmented to include a separate transformation for each term in the DNF formula. Thus, the network’s output layer can be viewed as a multiplexor in a logical circuit that permits for a branching of logic. For a general overview of deep learning with logical constraints, see the survey (Giunchiglia et al. 2022).

In the video understanding field, some recent works have started to argue the importance of being able to extract structured information from videos and to incorporate background knowledge in the models. For example, (Curtis et al. 2020) propose a challenge to test the models’ ability of extracting knowledge graphs from videos. In Mahon et al. (2020), the authors develop a model that is able to exploit the knowledge expressed in logical rules to extract knowledge graphs from videos. However, ROAD-R is the first dataset which proposes the incorporation of logical constraints into deep learning models for videos, and thus represents a truly novel challenge.

7 Summary and outlook

In this paper, we proposed a new learning framework, called learning with requirements, and a new dataset for this task, called ROAD-R. We showed that SOTA models most of the times violate the requirements, and how it is possible to exploit the requirements to create models that are compliant with (i.e., strictly satisfy) the requirements while improving their performance.

ROAD-R opens up a number of research possibilities. The most straightforward open problem is how to create neural-based models that are compliant by design with the given requirements, i.e., without the need of any post-processing steps. However, other directions are also possible. For example, it is an open question whether the annotated constraints can help in alleviating the data greediness characteristic of the large deep learning models usually deployed in the autonomous driving setting. Indeed, we can now use the requirements to train models on both labelled and unlabelled data. Another open question is whether neural models, in addition to bounding boxes and labels, can also learn the requirements that we annotated. In this case, the annotated requirements could be used to measure the coverage of the learned ones.

Finally, in the future, we will further extend ROAD-R. In particular, we plan to annotate ROAD with temporal constraints stating facts like “a traffic light becomes red after being green”, and with soft constraints stating likely facts like “pedestrians should cross at crossings”.