Keywords

1 Introduction

Today “average precision” (AP) is the de facto standard for performance evaluation in object detection competitions [8, 14, 28], and in the studies on still-image object detection [6, 13, 16, 24], video object detection [9, 12, 36] and online video object detection [17, 34]. AP not only enjoys such vast acceptance but it also appears to be unchallenged. Except for a small number of papers which do ablation studies [13, 24], AP appears to be the sole criterion used to compare object detection methods.

Fig. 1.
figure 1

Three different object detection results (for an image from ILSVRC [28]) with very different RP curves but the same AP. AP is unable to identify the difference between these curves. (a, b, c) Red, blue and green colors denote ground-truth, true positives; false positives respectively. Numbers are detection confidence scores. (d, e, f) RP curves, AP and LRP results for the corresponding detections in (a, b, c). Red crosses indicate Optimal LRP points. (Color figure online)

Despite its popularity, AP has certain deficiencies. First, AP cannot distinguish between very different RP curves: In Fig. 1, we present detection results of three hypothetical object detectors. The detector in (a) detects only half of the objects but with full precision; this is a low-recall-high-precision detector. In contrast, the detector in (b) detects all objects; however, for each correct detection it also produces a close-to-duplicate detection which escapes non-maxima suppression. Hence, detector (b) is a high-recall-low-precision detector. And the detector in (c) is in between; it represents a detector with higher precision at lower recall and vice versa. Despite their very different characteristics, the APs of these detectors are exactly the same (AP = 0.5). One needs to inspect the RP curves in order to understand the differences in behavior, which can be time-consuming and impractical with large number of classes such as in the ImageNet object detection challenge [28] with 200 classes.

Another deficiency of AP is that it does not explicitly include localization accuracy: One cannot infer from AP the tightness level of the bounding box detections. Nevertheless, since extracting tighter bounding boxes is a desired property, nearly every paper on the topic discusses the issue mostly qualitatively [6, 9, 16, 17, 24] and some quantitatively by computing AP scores for different intersection-over-union (IoU) thresholds [13, 16, 24]. However, this quantitative approach does not directly measure the localization accuracy either and for the qualitative approach, it is very likely for the sample boxes to be very limited and biased. We discuss other less severe deficiencies of AP in Sect. 3.

A desirable performance metric is expected to include all of the factors related with performance. In object detection, the most important three factors are (i) the localization accuracy of the true positives (TP), (ii) the false positives (FP) rate and (iii) the false negative (FN) rate. Being able to evaluate a detector based on these factors is another desirable property for a performance measure since it can reveal improvement directions. Furthermore, a performance metric should reveal the RP characteristics of a detector (as LRP achieves in Fig. 1). This ability would benefit certain applications. For instance, using a high-precision detector is common in visual tracking methods [3, 4, 31, 32, 37], while initializing the tracker, known as tracking by detection as faster response times are required. Also, in online video object detection, the current approach is to use a still-image object detector with a general threshold (e.g., Association-LSTM [17] uses SSD [16] detections with confidence score above 0.8). A desirable performance measure should help in setting an optimal confidence score threshold per class.

In this paper, we propose a new metric called the “Localization-Recall-Precision Error” (LRP, for short). LRP involves appropriate components closely related to the precision, recall, and IoU and each parametrization of LRP corresponds to a point on the RP curve. We propose the “Optimal LRP”, the minimum achievable LRP error, as the alternative performance metric to AP. Optimal LRP alleviates the drawbacks of AP, represents the tightness of the bounding-boxes and the shape of the RP curve via its components and is more suitable for ablation studies. Finally, based on Optimal LRP, a confidence score thresholding method is proposed to decrease the number of detections in an optimal manner. Our extensive experiments confirm that LRP is a highly capable metric for comparing object detectors thoroughly.

2 Related Work

Information Theoretic Performance Measures: Several performance measures have been derived on the confusion matrix. Among them, the most relevant one is the F-measure [25] defined as the harmonic mean of precision and recall. However, F-measure violates the triangle inequality, and therefore, it is not suitable as a metric [20] and it is not symmetric in the positive and negative classes. These violations and its incapacity to measure bounding box tightness prevent its use for comparison among detectors in a consistent manner. Moreover, [5] points out that, except for accuracy, all information theoretic measures have undefined intervals. For example, F-measure is undefined when the number of TP is 0 even if there are detections. AP is an information theoretic measure, too, with deficiencies discussed in Sects. 1 and 3.

Point Multi-target Tracking Performance Metrics: Object detection is very similar to the multi-target tracking problem. In both problems, there are multiple instances to detect, and the localization, FN and FP rates are common criteria for success. Currently, component-based performance metrics are the accepted way of evaluating point multi-target tracking filters. The first metric to combine the localization and cardinality (including both FP and FN) errors is the Optimal Subpattern Assignment (OSPA) [29]. Following OSPA, several measures and metrics have been proposed as its variants [19, 23, 26, 27, 29, 30, 35]. Similarly, CLEAR multi-object tracking metrics [1] considers only FP and mismatch rate while ignoring the localization error. However, similar measures and metrics are lacking in the object detection literature, though similar performance evaluation problems are observed.

Setting the Thresholds of the Classifiers: The research on the optimization of a precision-recall balanced performance measure is mostly concentrated around the F-measure. [7] considers maximizing F-measure at the inference time using plug-in rules, while [18, 33] offer maximization during training for support vector machines and conditional random fields. Similarly, [15] aims to find optimal thresholds for a probabilistic classifier based on maximizing the F-measure. Finally, [21] presents a theoretical analysis of optimization of the F-measure, which also confirms the threshold-F-measure relationship depicted in [15, 22].

In summary, we see that existing methods mostly focus on the F-measure for optimizing the thresholds for classifiers, which, however, has the aforementioned drawbacks. Moreover, F-measure is shown to be concave with respect to its inputs, number of TPs and FPs [15], which makes the analytical optimization impossible. In addition, none of these studies have considered the object detection problem in particular, thus no localization error is directly included for these measures. Therefore, different from the previous work, we specifically are interested in performance evaluation and optimal thresholding of the deep object detectors. Moreover, we directly optimize a well-behaving function which has a smaller domain in practice in order to identify the class-specific thresholds.

3 Average Precision: An Analysis and Its Deficiencies

Due to space constraints, we omit the definition of AP and refer the reader to the accompanying supplementary material or [8]. There exist minor differences in AP’s practical usage. For example, AP is computed by simply integrating over 11 points (that divide the entire recall domain in equal pieces) in the PASCAL VOC 2007 challenge [8] whereas in MSCOCO [14], 101 points are used. Precision values at intermediate points are simply interpolated to prevent wiggles in the curve, by setting it to the maximum precision obtained in the interval of higher recall than the current point. While a single intersection-over-union (IoU) threshold, which is 0.5, is used in PASCAL VOC [8]; a range of IoU thresholds (from 0.5 to 0.95) are used in MSCOCO; the average AP over this range of IoU thresholds is also called mAP.

AP aims to evaluate the precision of the detector over the entire recall domain. Thus, it favors the methods that have precision over the entire recall domain, instead of the detectors whose RP curves are nearer to the top-right corner. In other words, AP does not compare the maximum but the overall capability/performance of the detectors. The most important two deficiencies of AP are discussed in Sect. 1. In the following, we list other, more minor deficiencies.

AP is not Confidence-Score Sensitive. Since the sorted list of the detections is required to calculate AP, a detector generating results in a limited interval will lead to the same AP. As an example, consider only 2 detections with same confidence score in Fig. 1 out of 4 ground truths. Note that setting the confidence scores to any value (i.e. 0.01) leads to the same AP as long as the order is preserved.

AP does not suggest a confidence score threshold for the best setting of the object detector. However, in a practical application, detections are usually required to be filtered owing to time limitations. For example, the state-of-the-art online object detector [17] applies a confidence score threshold of 0.8 on the SSD method [16] and obtains 12fps in this fashion.

AP uses interpolation between neighboring recall values, which is especially problematic for classes with very small size. For example, “toaster” class of [14] has 9 instances in the validation 2017 set.

4 Localization-Recall-Precision (LRP) Error

Let X be the set of ground truth boxes and Y be the set of boxes returned by an object detector. To compute \(\mathrm {LRP}(X,Y_s)\), the LRP error of \(Y_s\) against X at a given score threshold s (\(0 \le s \le 1\)) and IoU threshold \(\tau \) (\(0 \le \tau <1\)); first, \(Y_s\), the set of detections with confidence score larger than s, is constructed and detections in \(Y_s\) are assigned to ground-truth boxes in X, as done for AP. Once the assignments are made, the following values are computed: (i) \(N_{TP}\), the number of true positives; (ii) \(N_{FP}\), the number of false positives; (iii) \(N_{FN}\), the number of false negatives. Using these quantities, the LRP error is:

$$\begin{aligned} \mathrm {LRP}(X,Y_s) := \frac{1}{Z} \left( w_{IoU} \mathrm {LRP}_{IoU}(X,Y_s)+ w_{FP} \mathrm {LRP}_{FP}(X,Y_s) + w_{FN} \mathrm {LRP}_{FN}(X,Y_s) \right) , \end{aligned}$$
(1)

where \(Z=N_{TP}+N_{FP} +N_{FN}\) is the normalization constant; and the weights \(w_{IoU}=\frac{N_{TP}}{1-\tau }\), \(w_{FP}=|Y_s|\), and \(w_{FP}=|X|\) control the contributions of the terms. The weights make each component easy to interpret, provide further information about the detector and prevent the total error from being undefined whenever the denominator of a single component is 0. \(\mathrm {LRP}_{IoU}\) represents the IoU tightness of valid detections as follows:

$$\begin{aligned} \mathrm {LRP}_{IoU}(X,Y_s):=\frac{1}{N_{TP}}\sum \limits _{i=1}^{N_{TP}} (1-IoU(x_i, y_{x_i})), \end{aligned}$$
(2)

which measures the mean bounding box localization error resulting from correct detections. Another interpretation is that \(1-\mathrm {LRP}_{IoU}(X,Y_s)\) is the average IoU of the valid detections.

The second component, \(\mathrm {LRP}_{FP}\), in Eq. 1 measures the false-positives:

$$\begin{aligned} \mathrm {LRP}_{FP}(X,Y_s):= 1-Precision=1- \frac{N_{TP}}{|Y_s|}=\frac{N_{FP}}{|Y_s|}, \end{aligned}$$
(3)

and false negatives are measured by \(\mathrm {LRP}_{FN}\):

$$\begin{aligned} \mathrm {LRP}_{FN}(X,Y_s):= 1-Recall=1- \frac{N_{TP}}{|X|}=\frac{N_{FN}}{|X|}. \end{aligned}$$
(4)

FP and FN components together represent precision-recall of the corresponding \(Y_s\) by \(1-\mathrm {LRP}_{FP}(X,Y_s)\) and \(1-\mathrm {LRP}_{FN}(X,Y_s)\) respectively. Denoting the IoU between \(x_i \in X\) and its assigned detection \(y_{x_i} \in Y_s\) by \(IoU(x_i, y_{x_i})\), the LRP error can be equally defined in a more compact form as:

$$\begin{aligned} \mathrm {LRP}(X,Y_s):= \frac{1}{{N_{TP}+N_{FP} +N_{FN}}}\left( \sum \limits _{i=1}^{N_{TP}} \frac{1-IoU(x_i, y_{x_i})}{1-\tau }+N_{FP} +N_{FN} \right) . \end{aligned}$$
(5)

\(\mathrm {LRP}\) penalizes each TP by its erroneous localization normalized by \(1-\tau \) to the [0,1] interval, each FP and FN by 1 that is the penalty upper bound. This sum of error is averaged by the total number of its contributors, i.e., \(N_{TP}+N_{FP} +N_{FN}\). So, with this normalization, \(\mathrm {LRP}\) yields a value representing the average error per bounding box in the [0,1] interval, where each component equally contributes to the error. When necessary, the individual importance of IoU, FP, FN can be changed for different applications. To this end, the prominent component can be multiplied by a factor (say C) both in the numerator and the denominator [19]. This implies having C artificial errors for each error of the prominent type.

Overall, the ranges of total error and the components are [0, 1] and lower value implies better performance. At the extreme cases; 0 for \(\mathrm {LRP}\) means that each ground truth item is detected with perfect localization, and if \(\mathrm {LRP}\) is 1, then no valid detection matches the ground truth (i.e., \(|Y_s|=N_{FP}\)). \(\mathrm {LRP}\) is undefined only when the ground truth and detection sets are both empty (i.e., \(N_{TP}+N_{FP} +N_{FN}=0\)), i.e., there is nothing to evaluate.

As for the parameters, s is the confidence score threshold, and \(\tau \) is the IoU threshold. Since the RP pair is directly identified by the FP&FN components, each different detection set \(Y_s\) corresponds to a specific point of the RP curve. For this reason, decreasing s corresponds to moving along the RP curve in the positive recall direction. \(\tau \) defines minimum overlap for a detection to be validated as a TP. In other words, higher \(\tau \) means we require tighter BBs. Overall, both parameters are related with the RP curve: A \(\tau \) value sets the RP curve and an s value moves along the RP curve to evaluate the LRP error.

In the supplementary material, we prove that LRP is a metric.

5 Optimal LRP (oLRP) Error: The Performance Metric and Thresholder

Optimal LRP (oLRP) is defined as the minimum achievable LRP error with \(\tau =0.5\), which makes \(\mathrm {oLRP}\) parameter independent:

$$\begin{aligned} \mathrm {oLRP}:= \min _s \mathrm {LRP}(X,Y_s). \end{aligned}$$
(6)

For ablation studies and practical requirements, different \(\tau \) values can be adopted. In such cases, \(\mathrm {oLRP}@\tau \) can be used to denote the Optimal LRP error at \(\tau \).

oLRP searches among the confidence scores to find the best balance for competing precision-recall-IoU. The RP setting of the RP curve that oLRP has found corresponds to the top-right part of the curve, where the optimal balanced setting resides. We call a curve sharper than another RP curve, if its peak point at the top-right part is nearer to the (1, 1) RP pair. To illustrate, the RP curves in Fig. 1(d) and (e) are sharper than that in Fig. 1(f).

The components of \(\mathrm {oLRP}\) are coined as optimal box localization (\(\mathrm {oLRP}_{IoU}\)), optimal FP (\(\mathrm {oLRP}_{FP}\)), and optimal FN (\(\mathrm {oLRP}_{FN}\)) components. In this case, \(\mathrm {oLRP}_{IoU}\) describes the mean average tightness for a class, and \(\mathrm {oLRP}_{FP}\) and \(\mathrm {oLRP}_{FN}\) together pertain to the sharpness of the curve since the corresponding RP pair is the maximum achievable performance value of the detector for this class. One can directly pinpoint the sharpness point by \(1-\mathrm {oLRP}_{FP}\) and \(1-\mathrm {oLRP}_{FN}\). Overall, different from AP, oLRP aims to find out the best class specific setting of the detector and it favors sharper ones that also represent better BB tightness.

Denoting oLRP error of class \(c \in C\) by \(\mathrm {oLRP}_c\), Mean Optimal LRP (moLRP) is defined as follows:

$$\begin{aligned} {\mathrm {moLRP}}:= \frac{1}{|C|} \sum \limits _{c\in {C}} \mathrm {oLRP}_c. \end{aligned}$$
(7)

As in mAP, \({\mathrm {moLRP}}\) is the performance metric for the entire detector. Mean optimal box localization, FP and FN components, denoted by \(\mathrm {moLRP}_{IoU}\), \(\mathrm {moLRP}_{FP}\), \(\mathrm {moLRP}_{FN}\) respectively, are similarly defined as the mean of the class specific components. Different from the components in oLRP, the mean optimal FP and FN components are not necessarily on the average of the RP curves of all classes due to averaging \(\mathrm {moLRP}_{FP}\) (i.e., precision) with different \(\mathrm {moLRP}_{FN}\) (i.e., recall) values but still provides information on the sharpness of the RP curves as shown in the experiments.

Owing to its filtering capability, \(\mathrm {oLRP}\) can be used for thresholding purposes. If a problem needs an image object detector as the backbone and processing is to be completed within limited time, then only a small subset of the detections should be selected. For such methods, using an overall confidence score for the object detector is a common approach [17]. For such a task, oLRP identifies the class-specific best confidence score thresholds. One possible drawback of this method is that validated detections can still be too large to be processed in the desired limited time. However, by accepting larger LRP errors, higher confidence scores can be set, but again in a class-specific manner. Second practical usage of oLRP is about the deployment of the devised object detector into a platform in which confidence scores are to be discarded for user-friendliness. In such a case, one needs to set the \(\tau \) threshold considering the application requirements while optimizing for the best confidence score.

In essence, calculating \(\mathrm {oLRP}\) is an optimization problem. However, thanks to the smaller search space, we propose to discretize the s domain into 0.01 spaced intervals and search exhaustively in this limited space.

6 Experimental Evaluation

In this section, we analyze the parameters of LRP, represent its discriminative power on common object detectors and finally show that the class specific thresholds increase the performance of a simple online video object detector.

Evaluated Object Detectors: We evaluate commonly used deep object detectors; namely, Faster R-CNN, RetinaNet, and SSD. For Faster R-CNN and RetinaNet variants, we use the models by [11] and for SSD variants, the models of [10] are utilized. For the variants, we use R50, R101 and X101 while referring to the ResNet-50, ResNet-101 and RexNeXt-101 backbones respectively and FPN for feature pyramid network. All models are tested on “MS COCO validation 2017” including 80 classes and 5000 images.

Fig. 2.
figure 2

For each class, LRP components & total error of Faster R-CNN (X101+FPN) are plotted against s. The optimal confidence scores are marked with crosses.

6.1 Analyzing Parameters s and \(\tau \)

Using Faster R-CNN (X101+FPN) results of the first 10 classes and mean-error for clarity, the effects of the s and \(\tau \) are analyzed in Fig. 2 and 3. We observe that box localization component is not significantly affected by increasing s, except for large s, where the error slightly decreases since the results tend to be more “confident”. FP and FN components act in contrast to precision and recall respectively, as expected. Therefore, lower curves imply better performance for these components. Finally, the total error (oLRP) has a second-order shape. Since the localization error is not affected significantly by s, the behavior of the total error is mainly determined by FP and FN components, which result in the global minima of the total error to have a good precision and recall balance.

Fig. 3.
figure 3

For each class, oLRP and its components for Faster R-CNN (X101+FPN) are plotted against \(\tau \). The mean represents the mean of 80 classes.

In Fig. 3, oLRP and moLRP are plotted against different \(\tau \) values. As expected, larger \(\tau \) values imply lower the box localization component (\(\mathrm {oLRP_{IoU}}\)). On the other hand, increase \(\tau \) causes FP and FN components to increase rapidly, leading to higher total error (oLRP). This is intuitive since at the extreme case, i.e., when \(\tau =1\), there are hardly any valid detections and all the detections are false positives, which makes oLRP to be approximately 1. Therefore, oLRP allows measuring the performance of a detector designed for an application that requires a different \(\tau \) by also providing additional information. In addition, investigating oLRP for different \(\tau \) values represents a good extension for ablation studies.

Table 1. Performance comparison of common object detectors. R50, R101 and X101 represent the backbone networks used by ResNet-50, ResNet-101 and RexNeXt-101, respectively, and FPN refers to the feature pyramid network. \(s^*_{min}\) and \(s^*_{max}\) denote minimum and maximum class-specific thresholds respectively for oLRP. Note that unlike AP, lower scores are better for LRP.

6.2 Evaluating Common Image Object Detectors

General Overview: Table 1 compares the detectors using mAP as the COCO’s standard metric, mAP@0.50, moLRP and the class-specific threshold ranges. We observe that moLRP values are indicative of the known performances of the detectors. For any type of the detector, each new property (i.e., including FPN, increasing depth, using ResNext for Faster R-CNN and RetinaNet, increasing input size to 512 for SSD) decreases moLRP as expected. Moreover, the overall order is consistent with mAP except for RetinaNet (X101+FPN) and Faster R-CNN (R101+FPN), which are equal in terms of mAP; however, Faster R-CNN (R101+FPN) surpasses RetinaNet (X101+FPN) in terms of moLRP, which is discussed below. Note that \(\mathrm {moLRP_{FP}}\) and \(\mathrm {moLRP_{FN}}\) values in Table 1 are also consistent with the sharpness of the RP curves of the methods as presented in Fig. 4. To illustrate, Faster R-CNN (X101+FPN) has the best \(\mathrm {moLRP_{FP}}\), \(\mathrm {moLRP_{FN}}\) combination, corresponding to the sharpest RP curve. Another interesting example pertains to the RetinaNet (X101+FPN) and Faster R-CNN (R50+FPN) curves. For these methods, \(\mathrm {moLRP_{FP}}\) and \(\mathrm {moLRP_{FN}}\) comparison slightly favors Faster R-CNN (R50+FPN), which is justified by their PR curves in Fig. 4.

Fig. 4.
figure 4

Average RP curves of the common detectors.

Class-Based Comparison and Interpreting the Components: Now we analyze oLRP on a class-basis and look at the individual components to get a better feeling about the characteristics of methods – see Fig. 5. For all three classes, oLRP is determined at the RP pairs where there exists a sharp precision decrease on the top right part of the curve. Moreover, intuitively, these pairs provide a good balance between precision and recall. Considering the FP and FN components, one can infer the structure of the curve. For all methods, the “zebra” class has the sharpest RP curves which correspond to lower FP & FN error values. For example, Faster R-CNN has 0.069 and 0.188 FP and FN error values, respectively. Thus, without looking at the curve, one may consider that the peak of the curve resides at \(1-0.069=0.931\) precision and \(1-0.188=0.812\) recall. For the “broccoli” curve, a less sharp one, the optimal point is at \(1-0.498=0.502\) and \(1-0.484=0.516\) as precision and recall respectively. Similar to “zebra”, these values suggest that the peak of the curve is around the center of the RP range. The localization component (\(\mathrm {oLRP_{IoU}}\)) shows that the tightness of the boxes for the “bus” class is better than that of“zebra” for all detectors even though “zebra” has a sharper RP curve. For RetinaNet, average IoU is \(1-0.106=0.894\) and \(1-0.122=0.878\) for the “bus” and“zebra” classes respectively. With this analysis, we also see that it is easy to compare the tightness of the boxes among the methods and classes.

Fig. 5.
figure 5

Example RP curves representing the optimal configurations marked with crosses. The curves are drawn for \(\tau =0.5\). The tables in the figures represent the performance of the methods with respect to AP and moLRP. The rows of the table correspond to SSD-512, RetinaNet (X101+FPN) and Faster R-CNN (R101+FPN) respectively. Note that unlike AP, lower scores are better for LRP.

Same mAP but Different Behaviors, Faster R-CNN vs. RetinaNet: Now we compare two detectors with equal AP in order to identify their characteristics using the components of moLRP; namely, RetinaNet (X101+FPN), a single shot detector and Faster R-CNN (R101+FPN), a two-step detector. Firstly, we use the box localization component (\(\mathrm {moLRP_{IoU}}\)) in Table 1 to discriminate between these two detectors. The standard metric used in MS COCO aims to include the localization error by averaging over 10 mAP values. Since \(1.8\%\) difference for these two detectors is present in the mAP@0.5, one can infer that RetinaNet seems to produce more tight boxes. However, this inference is possible only by examining all 10 mAP results one by one and still it is not possible to quantize this tightness. In contrast, \(\mathrm {moLRP_{IoU}}\) directly suggests that, among all the detectors in Table 1, RetinaNet (X101+FPN) produces the tightest bounding boxes with an average tightness of \(1-0.161=0.839\) in IoU terms.

Secondly, we compare the sharpness of the same two detectors, which are evidently different (Fig. 4). RetinaNet (X101+FPN) produces 486, 108 bounding boxes for 36, 781 annotations, whereas Faster R-CNN (R101+FPN) yields only 127, 039 thanks to its RPN method. For RetinaNet, confidence scores of \(57\%\) of the detections are under 0.1, and \(87\%\) of them are under 0.25 (these values are \(29\%\) and \(56\%\) for Faster R-CNN), which generally causes RetinaNet to have lower or equal precision than Faster R-CNN throughout the recall domain except for the tail of the RP curve. In the tail of RetinaNet, owing to its large number of results, it has some precision even though that of Faster R-CNN drops to 0. Figure 5 illustrates this phenomenon, which is best observed in the “zebra” curve. Even though RetinaNet has higher AP than Faster R-CNN with 0.899 to 0.880, this AP difference originates from the large number of RetinaNet detections, which causes the better RP curve tail. This shallow curve-longer tail phenomenon is observed to be more or less valid for more than 50 classes including the ones in Fig. 6. On the other hand, oLRP and thus moLRP do not favor these kind of detectors but the sharper ones as shown in Fig. 5, which causes Faster R-CNN (R101+FPN) to have lower Optimal LRP error for “zebra” class.

Overall, even though RetinaNet has the best bounding box localization, Faster R-CNN (R101+FPN) with the same AP has lower mean oLRP error. Moreover, considering the RP curve of these variants, Faster R-CNN is sharper than RetinaNet as shown in Fig. 4. This is also validated by the components with nearly equal \(\mathrm {moLRP_{FP}}\) and difference in \(\mathrm {moLRP_{FN}}\) in favor of Faster R-CNN. Similarly, both \(\mathrm {moLRP_{FP}}\) and \(\mathrm {moLRP_{FN}}\) for RetinaNet (R50+FPN) are greater than those of Faster R-CNN (R50) due to the same shallow curve-longer tail phenomenon, preventing its RP curves to be sharper. Again, what makes RetinaNet (R50+FPN) to have better performance regarding both mAP and moLRP is its strength to produce tight bounding boxes as shown in Table 1.

6.3 Better Threshold, Better Performance

In this experiment, we demonstrate a use-case where oLRP helps us to set class-specific optimal thresholds as an alternative to the naive approach of using a general threshold for all classes. To this end, we developed a simple, online video object detection framework where we use an off-the-shelf still-image object detector (RetinaNet-50 [13] trained on MS-COCO [14]) and built three different versions of the video object detector. The first version, denoted with B, uses the still-image object detector to process each frame of the video independently. The second and third versions, denoted with G and S, respectively, again use the still-image object detector to process each frame and in addition, they link bounding boxes across subsequent frames using the Hungarian matching algorithm [2] and update the scores of these linked boxes using a simple Bayesian rule (details of this simple online video object detector is given in the Supplementary Material). The only difference between G and S is that while G uses a validated threshold of 0.5 (see \(s^*\) of B in Table 2 and Fig. 1 in Supplementary Material for validation) as the confidence score threshold for all classes, S uses the optimal threshold per class which achieves the oLRP error. We test these three detectors on 346 videos of ImageNet VID validation set [28] for 15 object classes which also happen to be included in MS COCO.

Fig. 6.
figure 6

Example RP curves of the methods. Optimal RP pairs are marked with crosses.

AP vs. oLRP: We compare G with B in order to represent the evaluation perspectives of AP and oLRP – see Fig. 6 and Table 2. Since B is a conventional object detector, with conventional RP curves as illustrated in Fig. 6. On the other hand, in order to be faster, G ignores some of the detections causing its maximum recall to be less than that of B. Thus, these shorter ranges in the recall set a big problem in the AP evaluation. Quantitatively, B surpasses G by \(7.5\%\) AP. On the other hand, despite limited recall coverage, G obtains higher precision than B especially through the end of its RP curve. To illustrate, for the “boat” class in Fig. 6, G has significantly better precision after approximately between 0.5 and 0.9 recall even though its AP is lower by \(6\%\). Since oLRP compares methods concerning their best configurations (i.e. the peak of their RP curves), this difference is clearly addressed comparing their oLRP error in which G surpasses S by \(4.1\%\). Furthermore, the superiority of G is shown to be its higher precision since FN components of G and S are very close while FP component of G is \(8.6\%\) better, which is also the exact difference of precisions in their peaks of RP curves.

Therefore, while G seems to have very low performance in terms of AP, for 12 classes G reaches better peaks than B as illustrated by the oLRP values in Table 2. This suggests that oLRP is better than AP in capturing the performance details of the methods.

Table 2. Comparison among B, G, S with respect to AP & oLRP and their best class-specific configurations. The mean of class thresholds are assigned as N/A since the thresholds are set class-specific and the mean is not used. Note that unlike AP, lower scores are better for LRP.

Effect of the Class-Specific Thresholds: Compared to G, owing to the class-specific thresholds, S has \(2.3\%\) better mAP and \(0.6\%\) better moLRP as shown in Table 2. However, since the mean is dominated by \(s^*\) around 0.5, it is better to focus on classes with low or high \(s^*\) values in order to grasp the effect of the approach. The “bus” class has the lowest \(s^*\) with 0.27. For this class, S surpasses G by \(8.7\%\) in AP and \(4.1\%\) in oLRP. This performance increase is also observed for other classes with very low thresholds, such as “airplane”,“bicycle” and “zebra”. For these classes with lower thresholds, the effect of class-specific threshold on the RP curve is to stretch the curve in the recall domain (maybe by accepting some loss in precision) as shown in the “bus” example in Fig. 6. Not surprisingly,“cow” is one of the two classes for which AP of S is lower since its threshold is the highest and thereby causing recall to be more limited. On the other hand, regarding oLRP, the result is not worse since this time the RP curve is stretched through the positive precision, as shown in Fig. 6, allowing better FP errors. Thus, in any case, lower or higher, the threshold setting method aims to discover the best RP curve. There are four classes in total for which G is better than S in terms of oLRP. However, note that the maximum difference is \(0.2\%\) in oLRP and these are the classes with thresholds around 0.5. These suggest that choosing class-specific thresholds rather than the common general thresholding approach increases the performance of the detector especially for classes with low or high class-specific thresholds.

7 Conclusion

We introduced a novel performance metric, LRP, as an alternative to the dominantly used AP. LRP has a number of advantages over AP, which we demonstrated in the paper: (i) AP cannot distinguish between very different RP curves whereas LRP, through its error components, provides a richer evaluation in terms of TP, FN and localization. (ii) AP not does have a localization component and one needs to calculate AP@\(\tau \) with different \(\tau \) values. However, LRP explicitly includes a localization error component (\(1- \mathrm {oLRP_{IoU}}\) gives the mean localization accuracy of a detector). (iii) There are many practical use cases where one needs to set a detection threshold in order to obtain detections to be used in a subsequent stage. Optimal LRP provides a practical solution to this problem, which we demonstrated for online video object detection.

Supplementary Material. Supplementary material contains a detailed definition of AP, a result on the distribution of confidence thresholds, a description of the online detector and the proof that LRP is a metric.