As the field of computer vision is growing and maturing, performance evaluation has become essential. Most sub-areas of computer vision now have established datasets and benchmarks allowing quantitative evaluation and comparison of current methods. In addition, new benchmarks often stimulate research into the particular challenges presented by the data. Conversely, important areas lacking high-quality datasets and benchmarks might not receive adequate attention by researchers.

The deep learning revolution has made datasets and performance evaluation even more important. Learning-based methods not only require large, well-designed training datasets but also well-defined loss functions, which are usually designed to optimize established performance measures. This creates an implicit bias based on the availability of datasets and the definition of performance metrics.

In this special issue we sought all types of contributions broadly relating to performance evaluation in computer vision, including:

  • Manuscripts introducing new performance evaluations or benchmarks, in particular in areas where quantitative evaluation is challenging or subjective

  • Manuscripts surveying, evaluating, or comparing existing benchmarks or datasets

  • Manuscripts addressing pros and cons of performance evaluations and datasets, as well as recommendations for best practices

We received 38 initial submissions for this special issue. Of these, 30 were considered to be within the scope of the special issue and underwent a rigorous peer review process. Each paper received at least 3 reviews, for a total of 148 reviews by experts in the field—we greatly thank all of them for their effort! In the end, we accepted 18 papers for publication, all of which went through at least one round of revisions.

The accepted papers span the following topics: datasets, benchmarks and evaluations, robustness, novel metrics, and bias in datasets.

Several papers introduce new benchmarks for various domains, including wide-baseline matching (Jin et al. 2021), non-rigid structure from motion (Jensen et al. 2021), single-object tracking (Fan et al. 2021), multi-object tracking (Dendorfer et al. 2021), deraining (Li et al. 2021), and inpainting of censored images (Black et al. 2021).

Other papers contribute novel datasets for action recognition (Weinzaepfel and Rogez 2021) and anomaly detection (Bergmann et al. 2021), as well as a technique for improved reference pose generation for the benchmarking of visual localization methods (Zhang et al. 2021).

Two papers address the evaluation of robustness (Kamann and Rother 2021; Shekar et al. 2021).

Several papers contribute novel metrics and taxonomies for the evaluation of different tasks, including multi-object tracking (Luiten et al. 2021), semantic segmentation (Yan et al. 2021), visual place recognition (Zaffar et al. 2021), image quality assessment (Ding et al. 2021), conditional image generation (Benny et al. 2021), and embodied exploration (Ramakrishnan et al. 2021).

Finally, one paper addresses the problem of bias in facial datasets (Georgopoulos et al. 2021).

We thank all authors for their contribution to this special issue. We hope that the datasets, benchmarks, and metrics developed by the accepted papers as well as the general methodologies for creating benchmarks will be of value to the community.