ROAD-R: the autonomous driving dataset with logical requirements

Neural networks have proven to be very powerful at computer vision tasks. However, they often exhibit unexpected behaviors, acting against background knowledge about the problem at hand. This calls for models (i) able to learn from requirements expressing such background knowledge, and (ii) guaranteed to be compliant with the requirements themselves. Unfortunately, the development of such models is hampered by the lack of real-world datasets equipped with formally specified requirements. In this paper, we introduce the ROad event Awareness Dataset with logical Requirements (ROAD-R), the first publicly available dataset for autonomous driving with requirements expressed as logical constraints. Given ROAD-R, we show that current state-of-the-art models often violate its logical constraints, and that it is possible to exploit them to create models that (i) have a better performance, and (ii) are guaranteed to be compliant with the requirements themselves.


Introduction
Neural networks have proven to be incredibly powerful at processing low-level inputs, and for this reason they have been extensively applied to computer vision tasks, such as image classification, object detection, and action detection (see e.g., [Krizhevsky et al., 2012;Redmon et al., 2016]).However, they can exhibit unexpected behaviors, contradicting known requirements expressing background knowledge.This can have dramatic consequences, especially in safetycritical scenarios such as autonomous driving.To address the problem, models should (i) be able to learn from the requirements, and (ii) be guaranteed to be compliant with the requirements themselves.Unfortunately, the development of such models is hampered by the lack of datasets equipped with formally specified requirements.A notable exception is given by hierarchical multi-label classification (HMC) problems (see, e.g., [Vens et al., 2008]) in which datasets are provided with binary constraints of the form (A → B) stating that label B must be predicted whenever label A is predicted.* Contact authors.
In this paper, we introduce multi-label classification problems with propositional logic requirements, in which datasets are provided with requirements ruling out non-admissible predictions and expressed in propositional logic.In this new formulation, given a multi-label classification problem with labels A, B and C, we can, for example, write the requirement: (¬A ∧ B) ∨ C, stating that for each datapoint in the dataset either the label C is predicted, or B but not A are predicted.Obviously, any constraint written for HMC problems can be represented in our framework, and thus, our problem formulation represents a generalisation of HMC problems.Then, we present the ROad event Awareness Dataset with logical Requirements (ROAD-R), the first publicly available dataset for autonomous driving with requirements expressed as logical constraints.ROAD-R extends the ROAD dataset [Singh et al., 2021], which consists of 22 relatively long (∼ 8 minutes each) videos annotated with road events.A road event corresponds to a tube, i.e., a sequence of frame-wise bounding boxes linked in time.Each bounding box is labeled with a subset of the 41 labels specified in Table 1.The goal is to predict the set of labels associated to each bounding box.We manually annotated ROAD-R with 243 constraints, each verified to hold for each bounding box.A typical constraint is thus "a traffic light cannot be red and green at the same time", while there are no constraints like "pedestrians should cross at crossings", which should always be satisfied in theory, but which might not be in real-world scenarios.
Given ROAD-R, we considered 6 current state-of-the-art (SOTA) models, and we showed that they are not able to learn the requirements just from the data points, as more than 90% of the times, they produce predictions that violate the constraints.Then, we faced the problem of how to leverage the additional knowledge provided by constraints with the goal of (i) improving their performance, measured by the frame mean average precision (f-mAP) at intersection over union (IoU) thresholds 0.5 and 0.75; see, e.g., [Kalogeiton et al., 2017;Li et al., 2018]), and (ii) guaranteeing that they are compliant with the constraints.To achieve the above two goals, we propose the following new models: 1. CL models, i.e., models with a constrained loss allowing them to learn from the requirements, 2. CO models, i.e, models with a constrained output enforcing the requirements on the output, and 3. CLCO models, i.e., models with both a constrained loss and a constrained output.
In particular, we consider three different ways to build CL (resp., CO, CLCO) models.More specifically, we run the 9 × 6 models obtained by equipping the 6 current SOTA models with a constrained loss and/or a constrained output, and we show that it is always possible to 1. improve the performance of each SOTA model, and 2. be compliant with (i.e., strictly satisfy) the constraints.
The main contributions of the paper thus are: • we introduce multi-label classification problems with propositional logic requirements, • we introduce ROAD-R, which is the first publicly available dataset whose requirements are expressed in full propositional logic, • we consider 6 SOTA models and show that on ROAD-R, they produce predictions violating the requirements more than 90% of the times, • we propose new models with a constrained loss and/or constrained output, and show that in our new models, it is always possible to improve the performance of the SOTA models and satisfy the requirements.
. The rest of this paper is organized as follows.After the introduction to the problem, we present ROAD-R (Section 3), followed by the evaluation of the SOTA models (Section 4) and of the SOTA models incorporating the requirements (Section 5) on ROAD-R.We end the paper with the related work (Section 6) and the summary and outlook (Section 7).

Learning with Requirements
In ROAD, the detection of road events requires the following tasks: (i) identify the bounding boxes, (ii) associate with each bounding box a set of labels, and (iii) form a tube from the identified bounding boxes with the same labels.Here, we focus on the second task, and we formulate it as a multilabel classification problem with requirements.
A multi-label classification (MC) problem P = (C, X ) consists of a finite set C of labels, denoted by A 1 , A 2 , . .., and a finite set X of pairs (x, y), where x ∈ R D (D ≥ 1) is a data point, and y ⊆ C is the ground truth of x.The ground truth y associated with a data point x characterizes both the positive and the negative labels associated with x, defined to be y and {¬A : A ∈ C \ y}, respectively.In ROAD-R, a data point corresponds to a bounding box, and each box is labeled with the positive labels representing (i) the agent performing the actions in the box, (ii) the actions being performed, and (iii) the locations where the actions take place.See Appendix A for a detailed description of each label.Consider an MC problem P = (C, X ).A prediction p is a set of positive and negative labels such that for each label A ∈ C, either A ∈ p or ¬A ∈ p.A model m for P is a function m(•, •) mapping every label A and every datapoint x to [0, 1].A datapoint x is predicted by m to have label A if its output value m(A, x) is greater than a user-defined threshold θ ∈ [0, 1].The prediction of m for x is the set {A : A ∈ C, m(A, x) > θ} ∪ {¬A : A ∈ C, m(A, x) ≤ θ} of positive and negative labels.
An MC problem with propositional logic requirements (P, Π) consists of an MC problem P and a finite set Π of constraints ruling out non admissible predictions and expressed in propositional logic.
Consider an MC problem with requirements (P, Π).Each requirement delimits the set of predictions that can be asso- Example 2.1 The requirement that a traffic light cannot be both red and green corresponds to the constraint {¬RedTL, ¬GreenTL}.Any prediction with {RedTL, GreenTL} is nonadmissible.An example of such prediction is shown in Fig. 1.

Statistics
Given an MC problem with requirements, it is possible to take advantage of the constraints in two different ways: • they can be exploited during learning to teach the model the background knowledge that they express, and • they can be used as post-processing to turn a non-admissible prediction into an admissible one.Models in the first and second category are said to have a constrained loss (CL) and constrained output (CO) respectively.Constrained loss models have the advantage that the constraints are deployed during the training phase, and this should result in models (i) with a higher understanding of the problem and a better performance, but still (ii) with no guarantee that no violations will be committed.On the other hand, constrained output models (i) do not exploit the additional knowledge during training, but (ii) are guaranteed to have no violations in the final outputs.These two options are not mutually exclusive (i.e., can be used together), and which one is to be deployed depends also on the extent to which a system is available.For instance, there can be companies that already have their own models (which can be black boxes) and want to make them compliant with a set of requirements without modifying the model itself.On the other hand, the exploitation of the constraints in the learning phase can be an attractive option for those who have a good knowledge of the model and want to further improve it.

ROAD-R
ROAD-R extends the ROAD dataset 1 [Singh et al., 2021] by introducing a set Π of 243 constraints that specify the space 1 All the code will be released upon publication.ROAD is available at: https://github.com/gurkirt/road-dataset. of admissible outputs.In order to improve the usability of our dataset, we write the constraints in a way that allows us to easily express Π as a single formula in conjunctive normal form (CNF).The above can be done without any loss in generality, as any propositional formula can be expressed in CNF, and is important because many solvers expects formulas in CNF as input.Thus, each requirement in Π has form: where n ≥ 1, and each l i is either a negative label ¬A or a positive label A. The requirements have been manually specified following three steps: 1. an initial set of constraints Π 1 was manually created, 2. a subset Π 2 ⊂ Π 1 was retained by eliminating all those constraints that were entailed by the others, 3. the final subset Π ⊂ Π 2 was retained by keeping only those requirements that were always satisfied by the ground-truth labels of the entire ROAD-R dataset.Finally, redundancy in the constraints has been automatically checked with RELSAT2 .Note that our process of gathering and further selecting the logical requirements follows more closely the software engineering paradigm rather than the machine learning view.To this end, we ensured that the constraints were consistent with the provided labels from the ROAD dataset in the sense that they were acting as strict conditions to be absolutely satisfied by the ground-truth labels, as emphasized in the third step of the annotation pipeline above.Tables 2 and 3 give a high-level description of the properties of the set Π of constraints.Notice that, with a slight abuse of notation, in the tables we use a set based notation for the requirements.Each requirement of form (1) thus becomes {l 1 , l 2 , . . ., l n }.
Such notation allows us to express the properties of the requirements in a more succinct way.In addition to the information in the tables, we report that of the 243 constraints, there are two in which all the labels are positive (expressing Figure 2: ROAD-R and SOTA models.In the x-axis, there is the threshold θ ∈ [0.1, 0.9], step 0.1.that there must be at least one agent and that every agent but traffic lights has at least one location), and 214 in which all the labels are negative (expressing mutual exclusion between two labels).All the constraints with more than two labels have at most one negative label, as they express a one-tomany relation between actions and agents (like "if something is crossing, then it is a pedestrian or a cyclist").Constraints like "pedestrians should cross at crossings", which might not be satisfied in practice, are not included.Additionally embedding such logical constraints would require, e.g., using modal operators and, while it would be an interesting study to see the impact on the model's predictions when adding more complex layers to the expressivity of our logic, we opted for using a simpler logic in this first instance.This also provides more transparency to the wider research community, as the full propositional logic covers a vast range of applications that do not require extra logical operators.The list with all the 243 requirements, with their natural language explanations, is in Appendix B, Tables 8, 9, and 10.Notice that the 243 requirements restrict the number of admissible prediction to 4985868 ∼ 5 × 10 6 , thus ruling out (2 41 − 4985868) ∼ 10 12 non-admissible predictions. 3In principle, the set of admissible predictions can be further reduced by adding other constraints.Indeed, the 243 requirements are not guaranteed to be complete from every possible point of view: as standard in the software development cycle, the requirement specification process deeply involves the stakeholders of the system (see, e.g., [Sommerville, 2011]).For example, we decided not to include constraints like "it is not possible to both move towards and move away", which were not satisfied by all the data points because of errors in the ground truth labels.In these cases, we decided to dismiss the constraint in order to maintain (i) consistency between the knowledge provided by the constraints and by the data points, and (ii) backward compatibility.
As an additional point, we underline that, even though the annotation of the requirements introduces some overhead in the annotation process, it is also the case that the effort of manually writing 243 constraints (i) is negligible when compared to the effort of manually annotating the 22 videos, and (ii) can improve such last process, e.g., allowing to prevent errors in the annotation of the data points.GRU makes the process more efficient with fewer parameters than the LSTM.6. SlowFast [Feichtenhofer et al., 2019]: a 3D-CNN architecture that contains both slow and fast pathways for extracting the sequential features.A Slow pathway computes the spatial semantics at low frame rate while a Fast pathway processes high frame rate for capturing the motion features.Both of the pathways are fused in a single architecture by lateral connections.We trained 3D-RetinaNet 4 using the same hyperparameter settings for all the models: (i) batch size equal to 4, (ii) sequence length equal to 8, and (iii) image input size equal to 512 × 682.All the models were initialized with the Kinetics pre-trained weights.An SGD optimizer [LeCun et al., 2012] with step learning rate was used.The initial learning rate was set to 0.0041 for all the models except SlowFast, for which it was set to 0.0021 due to the diverse nature of slow and fast pathways.All the models were trained for 30 epochs and the learning rate was made to drop by a factor of 10 after 18 and 25 epochs.The machine used for the experiments has 64 CPUs (2.2 GHz each) and 4 Titan RTX GPUs having 24 GB of RAM each.

ROAD-R and SOTA Models
To measure the models' performance, we used the frame mean average precision (f-mAP), which is the standard metric used for action detection (see, e.g., [Kalogeiton et al., 2017;Li et al., 2018]), with IoU threshold equal to 0.5 and 0.75, indicated as f-mAP@0.5 and f-mAP@0.75,respectively.The results for the SOTA models at IoU threshold 0.5 and 0.75 are reported in Table 4, column "SOTA".
To measure the extent to which each system violates the constraints, we used the following metrics: • the percentage of non-admissible predictions, • the average number of violations committed per prediction, and • the percentage of constraints violated at least once, while varying the threshold θ from 0.1 to 0.9 with step 0.1.The results are in Fig. 2, where (to improve readability) we do not plot the values corresponding to θ = 0.0 and θ = 1.0.
For θ = 0.0 (resp., θ = 1.0), all the predictions are positive (resp., negative), and thus the corresponding values are (in order) 100%, 214, and 214/243 (resp., 100%, 2, and 2/243).Consider the results in Table 4, column "SOTA", and in Fig. 2. First, note that the performances are not an indicator of the ability of the model to satisfy the constraints.Indeed, higher f-mAPs do not correspond to lower trends in the plots of Fig. 2b.For example, RCGRU performs better than C2D for both IoU = 0.5 and IoU = 0.75, however, its curve is above C2D's in both Figs.2a and 2b.Then, note that the percentage of non-admissible predictions is always very high for every model: at its minimum, for θ = 0.1, more than 90% of the predictions are non-admissible, and this percentage reaches 99% for θ = 0.9 (see Fig. 2a).In addition, most predictions violate roughly two constraints, as shown by Fig. 2b.Considering that we are in an autonomous vehicle setting, such results are critical: one of the constraints that is violated by all the baseline models is {¬RedTL, ¬GreenTL}, corresponding to predictions according to which there is a traffic light with both the red and the green lights on.Fig. 3 shows an image for each of the SOTA models where such a prediction (for θ = 0.5) is made.Appendix C contains qualitative examples of all the SOTA models making predictions violating other constraints.

ROAD-R and CL, CO, and CLCO Models
We now show how it is possible to build CL, CO and CLCO models.In particular, we show how to equip the 6 considered SOTA models with constrained a loss and/or a constrained output.As anticipated in the introduction, we introduce (i) three different methods to build the constrained loss, (ii) three different methods to obtain the constrained output, and (iii) three combinations of constrained loss and constrained output.Thus, we get 9 models for each SOTA model, for a total of 54.In order to get an overall view of the performance gains produced by each method, we also report the average ranking of the 9 proposed methods and SOTA [Demsar, 2006], computed as follows: 1. for each row in Table 4, we rank the performances of the 9 CL, CO and CLCO models and of the SOTA model separately: the best performing model gets the rank 1, the second best gets rank 2, etc., and in case of ties, the rank is split (e.g., the assigned rank is 1.5 if two models have the best performance), and 2. for each column, we take the average of the rankings computed in the previous step.
See Table 4 for f-mAP@0.5, f-mAP@0.75 and average rankings.In the table, for each row the best results are in bold.The details of the implemented models with constrained loss, constrained output, and both constrained loss and constrained output is given in the three paragraphs below.
Constrained Loss.To constrain the loss, we take inspiration from the approaches proposed in [Diligenti et al., 2017b;Diligenti et al., 2017a], and we train the models using the standard localization and classification losses, to which we add a regularization term.This last term represents the degree of satisfaction of the constraints in Π and has the form: where r i represents the ith constraint in Π, t(r i ) represents the fuzzy logic relaxation of r i , and α is a hyperparameter ruling the weight of the regularization term (the higher α is, the more relevant the term corresponding to the constraints becomes, up to the limit case in which α → ∞ and the constraints become hard [Diligenti et al., 2017a]).We considered α ∈ {1, 10, 100} and the three fundamental t-norms: 1. Product t-norm, 2. Gödel t-norm, and 3. Lukasiewicz t-norm as fuzzy logic relaxations (see e.g., [Metcalfe, 2005]).The best results for f-mAP@0.5 and f-mAP@0.75 while varying α are in Table 4, columns Product, Gödel, and Lukasiewicz.As can be seen, SOTA never achieves the best average ranking, even when compared with only the three CL methods.Of these, Lukasiewicz (for IoU = 0.5) and Product (for IoU = 0.75) have the best ranking, though for some model and IoU, the best performances are obtained with Gödel.In only one case (for RCLSTM at IoU = 0.75), the SOTA model performs better than the CL models.Also notice that the best performances are never obtained with α = 100, and that we never get any significant reduction in the number of predictions violating the constraints.Constrained Output.We now consider the problem of how to correct a prediction p whose admissibility is evaluated at a given threshold θ.The first observation is that determining the existence of an admissible prediction is an intractable problem: indeed, this is just a reformulation of the satisfiability problem in propositional logic, which is well known to be NP-complete.Despite this, we want to correct any nonadmissable prediction p in such a way that (i) the final prediction is admissible, and (ii) the performance of the final model either improves or remains unaltered.
In order to achieve the above, we first test the policy of trying to correct as few labels as possible.More precisely, for each prediction q, (p \ q) is the set of positive and negative predictions on which q differs from p.Then, we can compute the admissible prediction q with the minimum number of differences, i.e., such that |p \ q| is minimal.We call such policy Minimal Distance (MD).Unfortunately, no polynomial time algorithm is known to solve this problem.
Theorem 5.1 Let (P, Π) be an MC problem with requirements.Let p be a prediction.For each positive d, determining the existence of an admissible prediction q such that | p \ q |≤ d is an NP-complete problem.
The theorem is an easy consequence of Proposition 1 in [Bailleux and Marquis, 2006].In order to be able to solve the problem in practice, we formulate the problem of finding an admissible prediction with minimal | p \ q | as a weighted partial maximum satisfiability (PMaxSAT) problem of a set of clauses5 (see, e.g., [Li and Manyà, 2009]) in which 1. each constraint in Π corresponds to a clause marked as hard, and 2. each positive and negative prediction in p corresponds to a unit clause marked as soft with unitary weight.
This allows us to use the very efficient solvers publicly available for PMaxSat problems.In particular, in our experiments we used MaxHS [Hickey and Bacchus, 2019], and running times were in the order of 10 −3 s at most.As intended, since we assign all labels unitary weight, MaxHS returns the admissible prediction q with as few as possible labels flipped.Notice that flipping the ith label amounts to changing its output value o i from a value below the threshold to another value f (o i ) above the threshold or vice versa.In all our experiments, we considered (i) f positive), we expect f (o i ) to be lower (but still higher than θ) than the output values of the non-flipped, positively predicted labels.It is done analogously for the case o i > θ.We tested this approach with all the thresholds θ from 0.1 to 0.9 with step 0.1, and the resulting f-mAP@0.5 and f-mAP@0.75 for the best threshold are reported in Table 4, column MD.As we can see from the table, despite the fact that we are minimizing the number of corrections, the results obtained by the CO-MD model are always worse than the ones obtained by the SOTA models.We can hence conclude that adding such post-processing has a detrimental effect on the models' performance.Thus, we need alternative policies to correct a nonadmissable prediction p.We generalise the problem by assigning a positive weight w i (representing the cost of correcting the ith label in p) and then computing the admissible prediction q that minimizes 41 i=1 w i .More precisely, for every prediction q, cost(p, q) = i∈I w i , I being the set of indexes of the labels in (p \ q).Then, we can compute the admissable prediction q such that cost(p, q) is minimal.Being it a generalization of the problem above, no polynomial time algorithm is known to solve this problem.
Theorem 5.2 Let (P, Π) be an MC problem with requirements.Let p be a prediction and let w i be the cost of correcting the ith label in p.For each positive d, determining the existence of an admissible prediction q such that cost(p, q) ≤ d is an NP-complete problem.This is an easy consequence of Theorem 5.1.Luckily, we can again formulate the problem as a PMaxSAT problem in which: 1. each constraint in Π corresponds to a clause marked as hard, and 2. for each i (1 ≤ i ≤ 41), the prediction in p for the ith label corresponds to a unit clause having weight w i .Given the above two formulation, we tested two policies: 1. Average Precision based (AP), in which each w i is equal to the average precision AP i of the ith label, and 2. Average Precision and Output based (AP×O), in which each w i = AP i ×c i , where c i is equal to (i) the output o i of the model for the ith label if o i > θ, and (ii) (1 − o i ), otherwise.Differently from the MD policy, in order to take a decision, these policies take into account the reliability of the output o i of the model for the ith label.We again tested the two policies with all the thresholds θ from 0.1 to 0.9 with step 0.1, and the resulting f-mAP@0.5 and f-mAP@0.75 for the best threshold are reported in Table 4, columns AP, and AP×O.Comparing the predictions of the SOTA models with the CO-AP and the CO-AP×O models, we can see that flipping the variables taking into account the average precision (i) never leads to worse performances than the ones of the SOTA models, and (ii) for IoU = 0.75, correcting the output of RCLSTM with AP and AP×O gives the best and second best performance in the row.Notice that the differences in the performances between AP and AP×O are always negligible, AP×O being better than AP more often.The average rankings are in line with the above statements.

Related Work
The approach proposed in this paper generalizes HMC problems, in which requirements are binary and have the form (A → B), corresponding to our (¬A ∨ B).Many models have been developed for HMC, see e.g., [Vens et al., 2008;Wehrmann et al., 2018;Giunchiglia and Lukasiewicz, 2020].
Interestingly, when dealing with more complex logical requirements on the output space, researchers have mostly focused on exploiting the background knowledge that they express to improve performance and/or to deal with data scarcity, curiously neglecting the problem of guaranteeing their satisfaction.Many works go in this direction, such as [Hu et al., 2016a;Hu et al., 2016b], where an iterative method to embed structured logical information into the neural networks' weights is introduced: at each step, the authors consider a teacher network based on the set of logical rules to train a student network to fit both supervisions and logic rules.Another neural model is considered in [Li and Srikumar, 2019], in which some neurons are associated with logical predicates, and their activation is modified on the ground of the activation of the neurons corresponding to predicates that co-occur in the same rules.
An entire line of research is dedicated to embedding logical constraints into loss functions; see, e.g., [Diligenti et al., 2017a;Donadello et al., 2017;Xu et al., 2018].These works consider a fuzzy relaxation of FOL formulas to get a differentiable loss function that can be minimized by gradient descent.However, in all the above methods, there is no guarantee that the constraints will be actually satisfied.Luckily recently this problem has gained more relevance, and few works now propose novel ways of addressing the problem.One of such works is [Giunchiglia and Lukasiewicz, 2021], which presents a novel neural model called coherentby-construction network (CCN).CCN not only exploits the knowledge expressed by the constraints, but is also able to guarantee the constraints' satisfaction.However, that model is able to guarantee the satisfaction of constraints written as normal logic rules with at least one positive label, and thus is not able to deal with all the ROAD-R's requirements.
Another work that goes in this direction is [Dragone et al., 2021], in which NESTER is proposed.In this case, the constraints are not mapped into the last layer of the network (like MultiPlexNet or CNN), but they are enforced by passing the outputs of the neural network to a constraint program, which enforces the constraints.The most recent work is given by [Hoernle et al., 2022], where the authors propose Multi-PlexNet.MultiplexNet can impose constraints consisting of any quantifier-free linear arithmetic formula over the rationals (thus, involving "+", "≥", "¬", "∧", and "∨").In order to train the model with such constraints, the formulas are firstly expressed in disjunctive normal form (DNF), and then the output layer of the network is augmented to include a separate transformation for each term in the DNF formula.Thus, the network's output layer can be viewed as a multiplexor in a logical circuit that permits for a branching of logic.For a general overview of deep learning with logical constraints, see the survey [Giunchigla et al., 2022].
In the video understanding field, some recent works have started to argue the importance of being able to extract structured information from videos and to incorporate background knowledge in the models.For example, [Curtis et al., 2020] propose a challenge to test the models' ability of extracting knowledge graphs from videos.Mahon et al. (2020) develop a model that is able to exploit the knowledge expressed in logical rules to extract knowledge graphs from videos.However, ROAD-R is the first dataset which proposes the incorporation of logical constraints into deep learning models for videos, and thus represents a truly novel challenge.

Summary and Outlook
We proposed a new learning framework, called learning with requirements, and a new dataset for this task, called ROAD-R.We showed that SOTA models most of the times violate the requirements, and how it is possible to exploit the requirements to create models that are compliant with (i.e., strictly satisfy) the requirements while improving their performance.
We envision that requirement specification will become a standard step in the development of machine learning models, to guarantee their safety, as it is in any software development process.In this sense, ROAD-R may be followed by many other datasets with formally specified requirements.

Ethical Considerations
ROAD-R consists of the set of logical constraints in Appendix B, on top of the existing dataset ROAD, which is publicly available, and which is linked.Thus, ethical issues related to person identification do not apply to ROAD-R.
(a) Percentage of predictions violating at least one constraint.(b) Average number of violations committed per prediction.(c) Percentage of constraints violated at least once.
As a first step, we ran 6 SOTA temporal feature learning architectures as part of a 3D-RetinaNet model [Singh et al., 2021] (with a 2D-ConvNet backbone made of Resnet50 [He et al., 2016]) for event detection and evaluated to which extent constraints are violated.We considered: 1. 2D-ConvNet (C2D) [Wang et al., 2018]: a Resnet50based architecture with an additional temporal dimension for learning features from videos.The extension from 2D to 3D is done by adding a pooling layer over time to combine the spatial features.2. Inflated 3D-ConvNet (I3D) [Carreira and Zisserman, 2017]: a sequential learning architecture extendable to any SOTA image classification model (2D-ConvNet based), able to learn continuous spatio-temporal features from the sequence of frames.3. Recurrent Convolutional Network (RCN) [Singh and Cuzzolin, 2019]: a 3D-ConvNet model that relies on recurrence for learning the spatio-temporal features at each network level.During the feature extraction phase, RCNs exploit both 2D convolutions across the spatial domain and 1D convolutions across the temporal domain.4. Random Connectivity Long Short-Term Memory (RCLSTM) [Hua et al., 2018]: an updated version of LSTM in which the neurons are connected in a stochastic manner, rather than fully connected.In our case, the LSTM cell is used as a bottleneck in Resnet50 for learning the features sequentially.5. Random Connectivity Gated Recurrent Unit (RCGRU) [Hua et al., 2018]: an alternative version of RCLSTM where the GRU cell is used instead of the LSTM one.

Table 3 :
Constraint statistics.Πn is the set of constraints r in Π with |r| = n, i.e., with n positive and negative labels.C = {¬A : A ∈ C}.Each row shows the number of rules r with |r| = n, and the average number of negative and positive labels in such rules.

Table 4 :
).In this way, assuming o i ≤ θ (i.e., the label is negatively predicted and then flipped f-mAP@0.5 (top table) and f-mAP@0.75(bottom table) for the (i) current SOTA models; (ii) CL models, in parentheses the value of α; (iii) CO models, in parenthesis the threshold used to evaluate the admissibility of the predictions; and (iv) CLCO models, in parenthesis the threshold used to evaluate the admissibility of the predictions.P, G, and L stand for the Product, Gödel, and Lukasiewicz t-norm, respectively.