1 Introduction

Due to the popularity of Internet-based video sharing services, the volume of video content on the Web has reached unprecedented scales. For instance, YouTube reports that more than 500 h of content are uploaded every minute.Footnote 1 This puts considerable challenges for all video analysis problems, such as video classification, action recognition, and video retrieval, that need to achieve high performance at low computational and storage requirements in order to deal with the large scale of the data. The problem is particularly hard in the case of content-based video retrieval, where, given a query video, one needs to calculate its similarity with all videos in a database to retrieve and rank the videos based on relevance. In such scenario, this requires efficient indexing, i.e., storage of the representations extracted from the videos in the dataset, and fast calculations of the similarity between pairs of them.

Depending on whether the spatio-temporal structure of videos is stored/indexed and subsequently taken into consideration during similarity calculation, research efforts fall into two broad categories, namely coarse- and fine-grained approaches. Coarse-grained approaches address this problem by aggregating frame-level features into single video-level vector representations (that are estimated and stored at indexing time) and then calculating the similarity between them by using a simple function such as the dot-product or the Euclidean distance (at retrieval time). The video-level representations can be global vectors (Gao et al., 2017; Kordopatis-Zilos et al., 2017b; Lee et al., 2020), hash codes (Song et al., 2011, 2018; Yuan et al., 2020), Bag-of-Words (BoW) (Cai et al., 2011; Kordopatis-Zilos et al., 2017a; Liao et al., 2018), or concept annotations (Markatopoulou et al., 2017, 2018; Liang & Wang, 2020). These methods have very low storage requirements, allow rapid similarity estimation at query-time, but they exhibit low retrieval performance, since they disregard the spatial and temporal structure of the videos and are therefore vulnerable to clutter and irrelevant content. On the other hand, fine-grained approaches extract (and store at indexing time) and use in the similarity calculation (at retrieval time) representations that respect the spatio-temporal structure of the original video, i.e., they have a temporal or a spatio-temporal dimension/index. Typically, such methods consider the sequence of frames in the similarity calculation and align them, e.g., by using Dynamic Programming (Chou et al., 2015; Liu et al., 2017), Temporal Networks (Tan et al., 2009; Jiang & Wang, 2016), or Hough Voting (Douze et al., 2010; Jiang et al., 2014); or consider spatio-temporal video representation and matching based on Recurrent Neural Networks (RNN) (Feng et al., 2018; Bishay et al., 2019), Transformer-based architectures (Shao et al., 2021), or in the Fourier domain (Poullot et al., 2015; Baraldi et al., 2018). These approaches achieve high retrieval performance but at considerable computation and storage cost.

In an attempt to exploit the merits of both fine- and coarse-grained methods, some works tried to utilize them in a single framework (Wu et al., 2007; Chou et al., 2015; Liang & Wang, 2020), leading to methods that offer a trade-off between computational efficiency and retrieval performance. Typically, these approaches first rank videos based on a coarse-grained method, in order to filter the videos with similarity lower than a predefined threshold, and then re-rank the remaining ones based on the similarity calculated from a computationally expensive fine-grained method. However, setting the threshold is by no means a trivial task. In addition, in those approaches, both coarse- and fine-grained components are typically built based on hand-crafted features with traditional aggregations (e.g., BoW) and heuristic/non-learnable approaches for similarity calculation—this results in sub-optimal performance. We will be referring to such approaches as re-ranking methods.

Fig. 1
figure 1

Performance of our proposed DnS framework and its variants for several dataset percentages sent for re-ranking (denoted in bold) evaluated on the DSVR task of FIVR-200K in terms of mAP, computational time per query in seconds, and storage space per video in megabytes (MB), in comparison to state-of-the-art methods. Coarse-grained methods are in blue, fine-grained in red, and re-ranking in orange (Color figure online)

Figure 1 illustrates the retrieval performance, time per query, and storage space per video of several methods from the previous categories. Fine-grained approaches achieve the best results but with a significant allocation of resources. On the other hand, coarse-grained approaches are very lightweight but with considerably lower retrieval performance. Finally, the proposed re-ranking method provides a good trade-off between accuracy and efficiency, achieving very competitive performance with low time and storage requirements.

Fig. 2
figure 2

Overview of the proposed framework. It consists of three networks: a coarse-grained student \({\textbf {S}}^\mathrm{c}\), a fine-grained student \({\textbf {S}}^\mathrm{f}\), and a selector network \({\textbf {SN}}\). Processing is split into two phases, Indexing and Retrieval. During indexing (blue box), given a video database, three representations needed by our networks are extracted and stored in a video index, i.e., for each video, we extract a 3D tensor, a 1D vector, and a scalar that captures video self-similarity. During retrieval (red box), given a query video, we extract its features, which, along with the indexed ones, are processed by the \({\textbf {SN}}\). It first sends all the 1D vectors of query-target pairs to \({\textbf {S}}^\mathrm{c}\) for an initial similarity calculation. Then, based on the calculated similarity and the self-similarity of the videos, the selector network judges which query-target pairs have to be re-ranked with the \({\textbf {S}}^\mathrm{f}\), using the 3D video tensors. Straight lines indicate continuous flow, i.e., all videos/video pairs are processed, whereas dashed lines indicate conditional flow, i.e., only a number of selected videos/video pairs are processed. Our students are trained with Knowledge Distillation based on a fine-grained teacher network, and the selector network is trained based on the similarity difference between the two students (Color figure online)

Knowledge Distillation is a methodology in which a student network is being trained so as to approximate the output of a teacher network, either in the labelled dataset in which the teacher was trained, or in other, potentially larger unlabelled ones. Depending on the student’s architecture and the size of the dataset, different efficiency-performance trade-offs can be reached. These methods have been extensively used in the domain of image recognition (Yalniz et al., 2019; Touvron et al., 2020; Xie et al., 2020); however, in the domain of video analysis, they are limited to video classification methods (Bhardwaj et al., 2019; Garcia et al., 2018; Crasto et al., 2019; Stroud et al., 2020), typically performing distillation at feature level across different modalities. Those methods typically distill the features of a stream of the network operating in a (computationally) expensive modality (e.g., optical flow field, or depth) into the features of a cheaper modality (e.g., RGB images) so that only the latter need to be stored/extracted and processed at test time. This approach does not scale well on large datasets, as it requires storage or re-estimation of the intermediate features. Furthermore, current works arrive at fixed trade-offs of performance and computational/storage efficiency.

In this work, we propose to address the problem of high retrieval performance and computationally efficient content-based video retrieval in large-scale datasets. The proposed method builds on the framework of Knowledge Distillation, and starting from a well-performing, high-accuracy-high-complexity teacher, namely a fine-grained video similarity learning method (ViSiL) (Kordopatis-Zilos et al., 2019b), trains (a) both fine-grained and coarse-grained student networks on a large-scale unlabelled dataset and (b) a selection mechanism, i.e., a learnable re-ranking module, that decides whether the similarity estimated by the coarse-grained student is accurate enough, or whether the fine-grained student needs to be invoked. By contrast to other re-ranking methods that use a threshold on the similarity estimated by the fast network (the coarse-grained student in our case), our selection mechanism is a trainable, lightweight neural network. All networks are trained so as to extract representations that are stored/indexed, so that each video in the database is indexed by the fine-grained spatio-temporal representation (3D tensor), its global, vector-based representation (1D vector), and a scalar self-similarity measure that is extracted by the feature extractor of the selector network, and can be seen as a measure of the complexity of the videos in question. The latter is expected to be informative of how accurate the coarse-grained, video-level similarity is, and together with the similarity rapidly estimated by the coarse-grained representations, is used as input to the selector. We note that, by contrast to other Knowledge Distillation methods in videos that address classification problems and typically perform distillation at intermediate features, the students are trained on a similarity measure provided by the teacher—this allows training on large scale datasets as intermediate features of the networks do not need to be stored, or estimated multiple times. Due to the ability to train on large unlabeled datasets, more complex models, i.e., with more trainable parameters, can be employed leading to even better performance than the original teacher network. An overview of the proposed framework is illustrated in Fig. 2.

The main contributions of this paper can be summarized as follows:

  • We build a re-ranking framework based on a Knowledge Distillation scheme and a Selection Mechanism that allows for training our student and selector networks using large unlabelled datasets. We employ a teacher network that is very accurate but needs a lot of computational resources to train several student networks and the selector networks, and use them to achieve different performance-efficiency trade-offs.

  • We propose a selection mechanism that, given a pair of a fine- and a coarse-grained student, learns whether the similarity estimated by the fast, coarse-grained student is accurate enough, or whether the slow, fine-grained student needs to be invoked. To the best of our knowledge, we are the first to propose such a trainable selection scheme based on video similarity.

  • We propose two fine-grained and one coarse-grained student architectures. We develop: (i) a fine-grained attention student, using a more complex attention scheme than the teacher’s, (ii) a fine-grained binarization student that extracts binarized features for the similarity calculation, and (iii) a course-grained attention student that exploits region-level information, and the intra- and inter-video relation of frames for the aggregation.

  • We evaluate the proposed method on five publicly available datasets and compare it with several state-of-the-art methods. Our fine-grained student achieves state-of-the-art performance on two out of four datasets, and our DnS approach retains competitive performance with more than 20 times faster retrieval per query and 99% lower storage requirements compared to the teacher.

The remainder of the paper is organised as follows. In Sect. 2, the related literature is discussed. In Sect. 3, the proposed method is presented in detail. In Sect. 4, the datasets and implementation are presented. In Sect. 5, the results and ablation studies are reported. In Sect. 6, we draw our conclusions.

2 Related Work

This section gives an overview of some of the fundamental works that have contributed to content-based video retrieval and knowledge distillation.

2.1 Video Retrieval

The video retrieval methods can be roughly classified, based on the video representations and similarity calculation processes employed, in three categories: coarse-grained, fine-grained, and re-ranking approaches.

2.1.1 Coarse-Grained Approaches

Coarse-grained approaches represent videos with a global video-level signature, such as an aggregated feature vector or a binary hash code, and use a single operation for similarity calculation, such as a dot product. A straightforward approach is the extraction of global vectors as video representations combined with the dot product for similarity calculation. Early works (Wu et al., 2007; Huang et al., 2010) extracted hand-crafted features from video frames, i.e., color histograms, and aggregated them to a global vector. More recent works (Gao et al., 2017; Kordopatis-Zilos et al., 2017b; Lee et al., 2018, 2020) rely on CNN features combined with aggregation methods. Also, other works (Cai et al., 2011; Kordopatis-Zilos et al., 2017a) aggregate video content to Bag-of-Words (BoW) representation (Sivic and Zisserman, 2003) by mapping frames to visual words and extracting global representations with tf-idf weighting. Another popular direction is the generation of hash codes for the entire videos combined with Hamming distance (Song et al., 2011, 2018; Liong et al., 2017; Yuan et al., 2020). Typically, the hashing is performed via a network trained to preserve relations between videos. Coarse-grained methods provide very efficient retrieval covering the scalability needs of web-scale applications; however, their retrieval performance is limited, typically outperformed by the fine-grained approaches.

2.1.2 Fine-Grained Approaches

Fine-grained approaches extract video representations, ranging from video-level to region-level, and calculate similarity by considering spatio-temporal relations between videos based on several operations, e.g., a dot product followed by a max operation. Tan et al. (2009) proposed a graph-based Temporal Network (TN) structure, used for the detection of the longest shared path between two compared videos, which has also been combined with frame-level deep learning networks (Jiang & Wang, 2016; Wang et al., 2017). Additionally, other approaches employ Temporal Hough Voting (Douze et al., 2010) to align matched frames by means of a temporal Hough transform. Another solution is based on Dynamic Programming (DP) (Chou et al., 2015), where the similarity matrix between all frame pairs is calculated, and then the diagonal blocks with the largest similarity are extracted. Another direction is to generate spatio-temporal representations with the Fourier transform in a way that accounts for the temporal structure of video similarity (Poullot et al., 2015; Baraldi et al., 2018). Finally, some recent works rely on attention-based schemes to learn video comparison and aggregation by training either attentional RNN architectures (Feng et al., 2018; Bishay et al., 2019), transformer-based networks for temporal aggregation (Shao et al., 2021), or multi-attentional networks that extract multiple video representations (Wang et al., 2021). Fine-grained methods achieve high retrieval performance; however, they do not scale well to massive datasets due to their high computational and storage requirements.

2.1.3 Video Re-Ranking

Re-ranking is a common practice in retrieval systems. In the video domain, researchers have employed it to combine methods from the two aforementioned categories (i.e., coarse- and fine-grained) to overcome their bottleneck and achieve efficient and accurate retrieval (Wu et al., 2007; Douze et al., 2010; Chou et al., 2015; Yang et al., 2019; Liang & Wang, 2020). Typical methods deploy a coarse-grained method as an indexing scheme to quickly rank and filter videos, e.g., using global vectors (Wu et al., 2007) or BoW representations (Chou et al., 2015; Liang & Wang, 2020). Then, a fine-grained algorithm, such as DP (Chou et al., 2015), Hough Voting (Douze et al., 2010) or frame-level matching (Wu et al., 2007), is applied on the videos that exceed a similarity threshold in order to refine the similarity calculation. Another re-ranking approach employed for video retrieval is Query Expansion (QE) (Chum et al., 2007). It is a two-stage retrieval process where, after the first stage, the query features are re-calculated based on the most similar videos retrieved, and the query process is executed again with the new query representation. This has been successfully employed with both coarse-grained (Douze et al., 2013; Gao et al., 2017; Zhao et al., 2019) and fine-grained (Poullot et al., 2015; Baraldi et al., 2018) approaches. Also, an attention-based trainable QE scheme has been proposed in Gordo et al. (2020) for image retrieval. However, even though the retrieval performance is improved with QE, the total computational time needed for retrieval is doubled as the query process is applied twice.

2.2 Knowledge Distillation

Knowledge Distillation (Hinton et al., 2015) is a training scheme that involves two networks, a teacher network that is usually trained with supervision and a student network that leverages teacher’s predictions for improved performance. A thorough review of the field is provided in Gou et al. (2021). Knowledge Distillation has been employed on various computer vision problems, i.e., image classification (Yalniz et al., 2019; Touvron et al., 2020; Xie et al., 2020), object detection (Li et al., 2017; Shmelkov et al., 2017; Deng et al., 2019), metric learning (Park et al., 2019; Peng et al., 2019), action recognition (Garcia et al., 2018; Thoker & Gall, 2019; Stroud et al., 2020), video classification (Zhang & Peng, 2018; Bhardwaj et al., 2019), video captioning (Pan et al., 2020; Zhang et al., 2020), and representation learning (Tavakolian et al., 2019; Piergiovanni et al., 2020).

Relevant works in the field of Knowledge Distillation distill knowledge based on the relations between data samples (Park et al., 2019; Tung & Mori, 2019; Liu et al., 2019; Lassance et al., 2020; Peng et al., 2019). Student networks are trained based on the distances between samples calculated by a teacher network (Park et al., 2019), the pairwise similarity matrix between samples within-batch (Tung & Mori, 2019), or by distilling graphs constructed based on the relations of the samples, using the sample representations as vertices and their distance as the edges to build an adjacency matrix (Liu et al., 2019; Lassance et al., 2020).

In the video domain, several approaches have been proposed for the improvement of the computational efficiency of the networks (Bhardwaj et al., 2019; Zhang & Peng, 2018; Garcia et al., 2018). Some works (Bhardwaj et al., 2019) proposed a Knowledge Distillation setup for video classification where the student uses only a fraction of the frames processed by the teacher, or multiple teachers are employed to construct a graph based on their relations, and then a smaller student network is trained (Zhang & Peng, 2018). Also, a popular direction is to build methods for distillation from different modalities and learn with privileged information to increase the performance of a single network, i.e., using depth images (Garcia et al., 2018), optical flow (Crasto et al., 2019; Stroud et al., 2020; Piergiovanni et al., 2020), or multiple modalities (Luo et al., 2018; Piergiovanni et al., 2020). In video retrieval, Knowledge Distillation has been employed for feature representation learning on frame-level using the evaluation datasets (Liang et al., 2019).

2.3 Comparison to Previous Approaches

In this section, we draw comparisons of the proposed approach to the related works from the literature with respect to the claimed novelties.

Proposed Framework There is no similar prior work in the video domain that builds a re-ranking framework based on Knowledge Distillation and a trainable Selection Mechanism based on which the re-ranking process is performed. Other works (Chou et al., 2015; Yang et al., 2019; Liang & Wang, 2020) rely on outdated hand-crafted methods using simple re-ranking approaches based on similarity thresholding, the selection of which is a non-trivial task. By contrast, in this work, a framework is proposed that starts from an accurate but heavy-weight teacher to train (a) both a fine-grained and coarse-grained student network on a large unlabelled dataset and (b) a selection mechanism, i.e., a learnable module based on which the re-ranking process is performed.

Knowledge Distillation To the best of our knowledge, there is no prior work in the video domain that trains a pairwise function that measures video similarity with distillation. Works that use a similar loss function for distillation are Park et al. (2019) and Tung and Mori (2019); however, these approaches have been proposed for the image domain. Video-based approaches (Bhardwaj et al., 2019; Zhang & Peng, 2018; Garcia et al., 2018; Liang et al., 2019) distill information between intermediate representations, e.g., video/frame activations or attention maps—this is costly due to the high computational requirements of the teacher. By contrast, in our training scheme the teacher’s similarities of the video pairs used during training can be pre-computed—this allows training in large datasets in an unsupervised manner (i.e., without labels). Finally, these distillation methods end up with a single network that either offers compression or better performance—by contrast, in the proposed framework, we are able to arrive at different accuracy/speed/storage trade-offs.

Network architectures We propose three student network architectures that are trained with Knowledge Distillation in an unsupervised manner on large unannotated datasets avoiding in this way overfitting (cf. Sect. 5.1.2). Two fine-grained students are built based on our prior work in Kordopatis-Zilos et al. (2019b), with some essential adjustments to mitigate its limitations. A fine-grained attention student is developed using a more complex attention mechanism, which outperforms the Teacher when trained on the large unlabeled dataset. Also, a fine-grained binarization student is introduced with a binarization layer that has significantly lower storage requirements. Prior works have used binarization layers with coarse-grained approaches (Liong et al., 2017; Song et al., 2018; Yuan et al., 2020), but none learns a fine-grained similarity function based on binarized regional-level descriptors. Furthermore, a coarse-grained student is built. Its novelties are the use of a trainable region-level aggregation scheme—unlike other works that extract frame-level descriptors—and the combination of two aggregation components on frame-level that considers intra- and inter-video relations between frames. Prior works have employed a transformer encoder to capture intra-video frame relations (Shao et al., 2021), or a NetVLAD to capture inter-video ones (Miech et al., 2017); however, none combines the two components together.

3 Distill-and-Select

This section presents the Distill-and-Select (DnS) method for video retrieval. First, we describe the developed retrieval pipeline, which involves a fine-grained and a coarse-grained student network trained with Knowledge Distillation, and a selector network, acting as a re-ranking mechanism (Sect. 3.1). Then, we discuss the network architectures/alternatives employed in our proposed approach that offer different performance-efficiency trade-offs (Sect. 3.2). Finally, the training processes followed for the training of the proposed networks are presented (Sect. 3.3).

Fig. 3
figure 3

Illustration of ViSiL (Kordopatis-Zilos et al., 2019b) architecture used for the teacher and fine-grained students, i.e., \({\textbf {X}}\) can take three values: \({\textbf {T}}\), \({\textbf {S}}^\mathrm{f}_{\mathcal {A}}\), and \({\textbf {S}}^\mathrm{f}_{\mathcal {B}}\). During indexing, a 3D video tensor is extracted based on a Feature Extraction (FE) process, applying regional pooling, whitening, and \(\ell ^2\)-normalization on the activations of a CNN. Then, a modular component is applied according to the employed network, i.e., an attention scheme for \({\textbf {T}}\) and \({\textbf {S}}^\mathrm{f}_{\mathcal {A}}\), and a binarization layer for \({\textbf {S}}^\mathrm{f}_{\mathcal {B}}\). During retrieval, the Tensor Dot (TD) followed by Chamfer Similarity (CS) are applied on the representations of a video pair to generate their frame-to-frame similarity matrix, which is propagated to a Video Comparator (VC) CNN that captures the temporal patterns. Finally, CS is applied again to derive a single video-to-video similarity score (Color figure online)

3.1 Approach Overview

Figure 2 depicts the DnS framework. It consists of three networks: (i) a coarse-grained student (\({\textbf {S}}^\mathrm{c}\)) that provides very fast retrieval speed but with low retrieval performance, (ii) a fine-grained student (\({\textbf {S}}^\mathrm{f}\)) that has high retrieval performance but with high computational cost, and (iii) a selector network (\({\textbf {SN}}\)) that routes the similarity calculation of the video pairs and provides a balance between performance and time efficiency.

Each video in the dataset is stored/indexed using three representations: (i) a spatio-temporal 3D tensor \(f^{{\textbf {S}}^\mathrm{f}}\) that is extracted (and then used at retrieval time) by the fine-grained student \({\textbf {S}}^\mathrm{f}\), (ii) a 1D global vector \(f^{{\textbf {S}}^c}\) that is extracted (and then used at retrieval time) by the coarse-grained student \({\textbf {S}}^c\), and (iii) a scalar \(f^{{\textbf {SN}}}\) that summarises the similarity between different frames of the video in question that is extracted (and then used at retrieval time) by the selector network \(\mathbf {SN}\). The indexing process that includes the feature extraction is illustrated within the blue box in Figs. 2345 and is denoted as \(f^{{\textbf {X}}}(\cdot )\) for each network \({\textbf {X}}\). At retrieval-time, given an input query-target video pair, the selector network sends to the coarse-grained student \({\textbf {S}}^\mathrm{c}\) the global 1D vectors so that their similarity is rapidly estimated (i.e., as the dot product of the representations) \(g^{\mathbf {S}^\mathrm{c}}\). This coarse similarity and the self-similarity scalars for the videos in question are then given as input to the selector \(\mathbf {SN}\), which takes a binary decision \(g^{\mathbf {SN}}\) on whether the calculated coarse similarity needs to be refined by the fine-grained student. For the small percentage of videos that this is needed, the fine-grained network calculates the similarity \(g^{\mathbf {S}^\mathrm{f}}\) based on the spatio-temporal representations. The retrieval process that includes the similarity calculation is illustrated within the red box in Figs. 2, 34, 5 and is denoted as \(g^{{\textbf {X}}}(\cdot , \cdot )\) for each network \({\textbf {X}}\).

In practice, we apply the above process on every query-target video pair derived from a database, and a predefined percentage of videos with the largest confidence score calculated by the selector is sent to the fine-grained student for re-ranking. With this scheme, we achieve very fast retrieval with very competitive retrieval performance.

3.2 Network Architectures

In this section, the architectures of all networks included in the DnS framework are discussed. First, the teacher network that is based on the ViSiL architecture is presented 3.2.1. Then, we discuss our student architectures, which we propose under a Knowledge Distillation framework that addresses the limitations introduced by the teacher; i.e., high resource requirements, both in terms of memory space for indexing, due to the region-level video tensors, and computational time for retrieval, due to the fine-grained similarity calculation. More precisely, three students are proposed, two fine-grained and one coarse-grained variant, each providing different benefits. The fine-grained students are both using the ViSiL architecture. The first fine-grained student simply introduces more trainable parameters, leading to better performance with similar computational and storage requirements to the teacher 3.2.2. The second fine-grained student optimizes a binarization function that hashes features into a Hamming space and has very low storage space requirements for indexing with little performance sacrifice 3.2.3. The third coarse-grained student learns to aggregate the region-level feature vectors in order to generate a global video-level representation and needs considerably fewer resources for indexing and retrieval but at notable performance loss 3.2.3. Finally, we present the architecture of the selector network for indexing and retrieval 3.2.3. Our framework operates with a specific combination of a fine-grained and coarse-grained student and a selector network. Each combination achieves different trade-offs between retrieval performance, storage space, and computational time.

3.2.1 Baseline Teacher (\({\textbf {T}}\))

Here, we will briefly present the video similarity learning architecture that we employ as the teacher and which builds upon the ViSiL (Kordopatis-Zilos et al., 2019b) architecture (Fig. 3).

Feature Extraction/Indexing (\(f^{{\textbf {T}}}\)): Given an input video, we first extract region-level features from the intermediate convolution layers (Kordopatis-Zilos et al., 2017a) of a backbone CNN architecture by applying region pooling (Tolias et al., 2016) on the feature maps. These are further PCA-whitened (Jégou & Chum, 2012) and \(\ell ^2\)-normalized. We denote the aforementioned process as Feature Extraction (FE), and we employ it in all of our networks. FE is followed by a modular component, as shown in Fig. 3, that differs for each fine-grained student. In the case of the teacher, an attention mechanism is employed imposing that frame regions are weighted based on their saliency via a visual attention mechanism over region vectors based on an \(\ell ^2\)-normalized context vector. The context vector is a trainable vector \({\textbf {u}}\in {\mathbb {R}}^{D}\) that weights each region vector independently based on their dot-product. It is learned through the training process. Also, no fully-connected layer is employed to transform the region vectors for the attention calculation. We refer to this attention scheme as \(\ell ^2\)-attention. The output representation of an input video x is a region-level video tensor \({\mathcal {X}}\in {\mathbb {R}}^{N_x\times R_x\times D}\), where \(N_x\) is the number of frames, \(R_x\) is the number of regions per frame, and D is the dimensionality of the region vectors—this is the output of the indexing process, and we denote it as \(f^{{\textbf {T}}}(x)\).

Similarity Calculation/Retrieval (\(g^{{\textbf {T}}}\)): At retrieval time, given two videos, q and p, with \(N_q\) and \(N_p\) number of frames and \(R_q\) and \(R_p\) regions per frame, respectively, for every pair of frames, we first calculate the frame-to-frame similarity based on the similarity of their region vectors. More precisely, to calculate the frame-to-frame similarity on videos q and p, we calculate the Tensor Dot combined with Chamfer Similarity on the corresponding video tensors \(f^{{\textbf {T}}}(q)={\mathcal {Q}}\in {\mathbb {R}}^{N_q\times R_q\times D}\) and \(f^{{\textbf {T}}}(p)={\mathcal {P}}\in {\mathbb {R}}^{N_p\times R_p\times D}\) as follows

$$\begin{aligned} {\mathcal {M}}_f^{qp} = \frac{1}{R_q}\sum _{i=1}^{R_q} \max _{1\le j\le R_p} {\mathcal {Q}} {\varvec{\cdot }} _{(3,1)} {\mathcal {P}}^\top (\cdot ,i,j,\cdot ), \end{aligned}$$
(1)

where \({\mathcal {M}}_f^{qp}\in {\mathbb {R}}^{N_q\times N_p}\) is the output frame-to-frame similarity matrix, and the Tensor Dot axes indicate the channel dimension of the corresponding video tensors. Also, the Chamfer Similarity is implemented as a max-pooling operation followed by an average-pooling on the corresponding dimensions. This process leverages the geometric information captured by region vectors and provides some degree of spatial invariance. Also, it is worth noting that this frame-to-frame similarity calculation process is independent of the number of frames and region vectors; thus, it can be applied on any video pair with arbitrary sizes and lengths.

To calculate the video-to-video similarity, the generated similarity matrix \({\mathcal {M}}_f^{qp}\) is fed to a Video Comparator (VC) CNN module (Fig. 3), which is capable of learning robust patterns of within-video similarities. The output of the network is the refined similarity matrix \({\mathcal {M}}_v^{qp}\in {\mathbb {R}}^{N'_q\times N'_p}\). In order to calculate the final video-level similarity for two input videos qp, i.e., \(g^{{\textbf {T}}}(q,p)\), the hard tanh (\({\text {Htanh}}\)) activation function is applied on the values of the aforementioned network output followed by Chamfer Similarity in order to obtain a single value, as follows

$$\begin{aligned} g^{{\textbf {T}}}(q,p) = \frac{1}{N'_q}\sum _{i=1}^{N'_q}\max _{1\le j\le N'_p} {\text {Htanh}}\left( {\mathcal {M}}_v^{qp}(i,j)\right) . \end{aligned}$$
(2)

In that way, the VC takes temporal consistency into consideration by applying learnable convolutional operations on the frame-to-frame similarity matrix. Those enforce local temporal constraints while the Chamfer-based similarity provides invariance to global temporal transformations. Hence, similarly to the frame-to-frame similarity calculation, this process is a trade-off between respecting the video-level structure and being invariant to some temporal differences.

3.2.2 Fine-Grained Attention Student (\({\textbf {S}}^\mathrm{f}_{\mathcal {A}}\))

The first fine-grained student adopts the same architecture as the teacher (Sect. 3.2.1, Fig. 3), but uses a more complex attention scheme in the modular component, employed for feature weighting, as proposed in Yang et al. (2016).

Feature Extraction/Indexing (\(f^{{\textbf {S}}^\mathrm{f}_{\mathcal {A}}}\)): The Feature Extraction (FE) process is used to extract features, similar to the teacher. In the modular component shown in Fig. 3, we apply an attention weighting scheme as follows. Given a region vector \({\textbf {r}}:{\mathcal {X}}(i,j,\cdot )\in {\mathbb {R}}^D\), where \(i=1,\ldots ,N_x\), \(j=1,\ldots ,R_x\), a non-linear transformation is applied, which is implemented as a fully-connected layer with tanh activation function, to form a hidden representation \({\textbf {h}}\). Then, the attention weight is calculated as the dot product between \({\textbf {h}}\) and the context vector \({\textbf {u}}\), followed by the sigmoid function, as

$$\begin{aligned} \begin{aligned}&{\textbf {h}} = \text {tanh}({\textbf {r}}\cdot W_a + b_a), \\&\alpha = \text {sig}({\textbf {u}}\cdot {\textbf {h}}), \\&{\textbf {r}}' = \alpha {\textbf {r}}, \end{aligned} \end{aligned}$$
(3)

where \(W_a\in {\mathbb {R}}^{D\times D}\) and \(b_a\in {\mathbb {R}}^{D}\) are the weight and bias parameters of the hidden layer of the attention module, respectively, and \(\text {sig}(\cdot )\) denotes the element-wise sigmoid function. We will be referring to this attention scheme as h-attention. The resulting 3D representation is the indexing output \(f^{{\textbf {S}}^\mathrm{f}_{\mathcal {A}}}(x)\) for an input video x.

Similarity Calculation/Retrieval (\(g^{{\textbf {S}}^\mathrm{f}_{\mathcal {A}}}\)): To calculate similarity between two videos, we build the same process as for the teacher, i.e., we employ a Video Comparator (VC) and use the same frame-to-frame and video-to-video functions to derive \(g^{{\textbf {S}}^\mathrm{f}_{\mathcal {A}}}(q,p)\) for two input videos qp (Fig. 3).

In comparison to the teacher, this student (a) has very similar storage requirements, since in both cases, the videos are stored as non-binary spatio-temporal features, (b) has similar computational cost, since the additional attention layer introduces only negligible overhead, and (c) typically reaches better performance, since it has slightly higher capacity and can be trained in a much larger, unlabelled dataset.

3.2.3 Fine-Grained Binarization Student (\({\textbf {S}}^\mathrm{f}_{\mathcal {B}}\))

The second fine-grained student also adopts the same architecture as the teacher (Sect. 3.2.1, Fig. 3), except for the modular component where a binarization layer is introduced, as discussed below.

Feature Extraction/Indexing (\(f^{{\textbf {S}}^\mathrm{f}_{\mathcal {B}}}\)): \(f^{{\textbf {S}}^\mathrm{f}_{\mathcal {B}}}\) is the part of the indexing of the student \({\textbf {S}}^\mathrm{f}_{\mathcal {B}}\) that extracts a binary representation for an input video that will be stored and used at retrieval time. It uses the architecture of the teacher, where the modular component is implemented as a binarization layer (Fig. 3). This applies a binarization function \(b:{\mathbb {R}}^D\rightarrow \{-1, 1\}^L\) that hashes the region vectors \({\textbf {r}}\in {\mathbb {R}}^D\) to binary hash codes \({\textbf {r}}_{\mathcal {B}}\in \{-1,1\}^L\) as

$$\begin{aligned} b({\textbf {r}}) = \text {sgn}\left( {\textbf {r}}\cdot W_{\mathcal {B}}\right) , \end{aligned}$$
(4)

where \(W_{\mathcal {B}}\in {\mathbb {R}}^{D\times L}\) denote the learnable weights and \(\text {sgn}(\cdot )\) denotes the element-wise sign function.

However, since \(\text {sgn}\) is not a differentiable function, learning binarization parameters via backpropagation is not possible. To address this, we propose an approximation of the sign function under the assumption of small uncertainty in its input. More specifically, let \(\text {sgn}:x\mapsto \{\pm 1\}\), where x is drawn from a uni-variate Gaussian distribution with given mean \(\mu \) and fixed variance \(\sigma ^2\), i.e., \(x\sim {\mathcal {N}}(\mu ,\sigma ^2)\). Then, the expected valueFootnote 2 of the sign of x is given analytically as follows

$$\begin{aligned} {{\,\mathrm{\mathbb {E}}\,}}[\text {sgn}(x)] = {\text {erf}}\left( \frac{\mu }{\sqrt{2\sigma ^2}}\right) , \end{aligned}$$
(5)

where \(\text {erf}(\cdot )\) denotes the error function. This is differentiable and therefore can serve as an activation function on the binarization parameters, that is,

$$\begin{aligned} b({\textbf {r}}) = {\text {erf}}\left( \frac{{\textbf {r}}\cdot W_\mathcal {B}}{\sqrt{2\sigma ^2}}\right) , \end{aligned}$$
(6)

where we use as variance an appropriate constant value (empirically set to \(\sigma =10^{-3}\)). During training, we use (6), while during evaluation and hash code storage we use (4). After applying this operation to an arbitrary video x with \(N_x\) frames and \(R_x\) regions, we arrive at a binary tensor \({\mathcal {X}}_{\mathcal {B}}\in \{\pm 1\}^{N_x\times R_x\times L}\), which is the indexing output \(f^{{\textbf {S}}^\mathrm{f}_{\mathcal {B}}}(x)={\mathcal {X}}_{\mathcal {B}}\) used by this student.

Similarity Calculation/Retrieval (\(g^{{\textbf {S}}^\mathrm{f}_{\mathcal {B}}}\)): In order to adapt the similarity calculation processes with the binarization operation, the Hamming Similarity (HS) combined with Chamfer Similarity is employed as follows. Given two videos qp with \(f^{{\textbf {S}}^\mathrm{f}_{\mathcal {B}}}(q)={\mathcal {Q}}_{\mathcal {B}}\in \{\pm 1\}^{N_q\times R_q\times L}\) and \(f^{{\textbf {S}}^\mathrm{f}_{\mathcal {B}}}(p)={\mathcal {P}}_{\mathcal {B}}\in \{\pm 1\}^{N_p\times R_p\times L}\) their binary tensors, respectively, we first calculate the HS between the two tensors with the use of Tensor Dot to calculate the similarity of all region pair combinations of the two videos and then apply Chamfer Similarity to derive the frame-to-frame similarity matrix \({\mathcal {M}}_{\mathcal {B}}^{qp}\in {\mathbb {R}}^{N_q\times N_p}\). That is,

$$\begin{aligned} \begin{aligned}&HS^{qp} = {\mathcal {Q}}_{\mathcal {B}} {\varvec{\cdot }} _{(3,1)} {\mathcal {P}}_{\mathcal {B}}^\top / L, \\&{\mathcal {M}}_{\mathcal {B}}^{qp} = \frac{1}{R_q}\sum _{i=1}^{R_q} \max _{1\le j\le R_p} HS^{qp} (\cdot , i, j, \cdot ), \end{aligned} \end{aligned}$$
(7)

Finally, a Video Comparator (VC) is applied on the frame-to-frame similarity matrices in order to calculate the final video-to-video similarity, similarly to (2) in the original teacher (Fig. 3)—this is denoted as \(g^{{\textbf {S}}^\mathrm{f}_{\mathcal {B}}}(q,p)\) for two input videos qp.

In comparison to the teacher, this student (a) has remarkably lower storage requirements, since the binary spatio-temporal representations are 32 times smaller than the corresponding float ones (full precision), (b) has similar computational cost, as the architecture is very similar, and (c) reaches better performance since it is trained at a larger (despite being unlabelled) dataset. Note that this student only uses a binary input but is not a binarized network.

Fig. 4
figure 4

Illustration of the architecture of the coarse-grained student \({\textbf {S}}^\mathrm{c}\), consisting of three main components. During indexing, the FE process with attention weighting and average pooling is applied to extract frame-level features. Then, they are processed by a Transformer network and aggregated to 1D vectors by a NetVLAD module. During retrieval, the video similarity derives from a simple dot product between the extracted representations (Color figure online)

3.2.4 Coarse-Grained Student (\({\textbf {S}}^\mathrm{c}\))

The coarse-grained student introduces an architecture that extracts video-level representations that are stored and can be subsequently used at retrieval time so as to rapidly estimate the similarity between two videos as the cosine similarity of their representations. An overview of the coarse student is shown in Fig. 4.

Feature Extraction/Indexing (\(f^{{\textbf {S}}^\mathrm{c}}\)): The proposed coarse-grained student comprises of three components. First, we extract weighted region-level features with Feature Extraction (FE), using the attention module given by (3), and then average pooling is applied across the spatial dimensions of the video tensors lead to frame-level representations for the videos; i.e., \({\textbf {x}}_i=\frac{1}{R_x} \sum _{k=1}^{R_x} \alpha _k{\textbf {r}}_k\), where \({\textbf {x}}_i\in {\mathbb {R}}^D\) is the frame-level vector of the i-th video frame, \(R_x\) is the number of regions, and \(\alpha _k\) is the attention weight calculated by (3). In that way, we apply a trainable scheme to aggregate the region-level features that focuses on the information-rich regions. Second, a transformer (Vaswani et al., 2017) network architecture is used to derive frame-level representations that capture long-term dependencies within the frame sequence, i.e., it captures the intra-video relations between frames. Following Shao et al. (2021), the encoder part of the Transformer architecture is used, which is composed of a multi-head self-attention mechanism and a feed-forward network. Finally, a NetVLAD (Arandjelovic et al., 2016) module aggregates the entire video to a single vector representation (Miech et al., 2017). This component learns a number of cluster centers and a soft assignment function through the training process, considering all videos in the training dataset. Therefore, it can be viewed as it encodes the inter-video relations between frames. Given input a video x, the output \(f^{{\textbf {S}}^\mathrm{c}}(x)\) is a 1D video-level vector that is indexed and used by the coarse-grained student during retrieval.

Fig. 5
figure 5

Illustration of the Selector Network architecture. During indexing, the self-similarity of the videos is calculated according to the following scheme. First, region-level attention-weighted features are extracted. Then, the frame-to-frame self-similarity matrix derives with a Tensor Dot (DT) and Average Pooling (AP), which is propagated to a VC module to capture temporal patterns. The final self-similarity is calculated based on an AP on the VC output. During retrieval, given a video pair, a 3-dimensional vector is composed by the self-similarity of each video and their similarity calculated by the \({\textbf {S}}^\mathrm{c}\). The feature vector is fed to an MLP to derive a confidence score (Color figure online)

Similarity Calculation/Retrieval (\(g^{{\textbf {S}}^\mathrm{c}}\)): Once feature representations have been extracted, the similarity calculation is a simple dot product between the 1D vectors of the compared videos, i.e., \(g^{{\textbf {S}}^\mathrm{c}}(q,p)=f^{\mathbf {S}_c} (q)\cdot f^{\mathbf {S}_c}(p)\) for two input videos qp.

In comparison to the original teacher, this student (a) has remarkably lower storage requirements for indexing, since it stores video-level representations instead of spatio-temporal ones, (b) has significantly lower computational cost at retrieval time, since the similarity is calculated with a single dot-product between video-level representations, and (c) has considerably lower performance, since it does not model spatio-temporal relations between videos during similarity calculation.

3.2.5 Selector Network (\({\textbf {SN}}\))

In the proposed framework, at retrieval time, given a pair of videos, the role of the selector is to decide whether the similarity that is calculated rapidly based on the stored coarse video-level representations is accurate enough (i.e., similar to what a fine-grained student would give), or whether a fine-grained similarity, based on the spatio-temporal, fine-grained representations needs to be used, and a new, refined similarity measure needs to be estimated. Clearly, this decision needs to be taken rapidly and with a very small additional storage requirement for each video.

The proposed selector network is shown in Fig. 5. At retrieval time, a simple Multi-Layer Perceptron (MLP) takes as input a three dimensional vector, \(\mathbf {z}\in {\mathbb {R}}^3\), with the following features: (a) the similarity between a pair of videos qp, as calculated by \({\textbf {S}}^\mathrm{c}\) (Sect. 3.2.4), and (b) the fine-grained self-similarities \(f^{{\textbf {SN}}}(q)\) and \(f^{{\textbf {SN}}}(p)\), calculated by a trainable NN (Fig. 5). Since \(f^{{\textbf {SN}}}(x)\) depends only on video x, it can be stored together with the representations of the video x with negligible storage cost. Having \(f^{{\textbf {SN}}}(q)\) and \(f^{{\textbf {SN}}}(p)\) pre-computed, and \(g^{{\textbf {S}}^\mathrm{c}}\) rapidly computed by the coarse-grained student, the use of selector at retrieval time comes at a negligible storage and computational cost. Both the self-similarity function \(f^{{\textbf {SN}}}\) that extracts features at indexing time and the MLP that takes the decision at retrieval time, which are parts of the Selector Network \({\textbf {SN}}\), are jointly trained.

In what follows, we describe the architecture of the selector, starting from the network that calculates the fine-grained similarity \(f^{{\textbf {SN}}}\). This is a modified version of the ViSiL architecture that aims to derive a measure that captures whether there is large spatio-temporal variability in its content. This is expected to be informative on whether the fine-grained student needs to be invoked. The intuition is that for videos with high \(f^{{\textbf {SN}}}\), i.e., not high spatio-temporal variability, their video-level representations are sufficient to calculate their similarity, i.e., the similarity estimated by the coarse-grained student is accurate enough.

Feature Extraction/Indexing (\(f^{{\textbf {SN}}}\)): Given a video x as input, features are extracted based on the Feature Extraction (FE), using the attention module as in (3), to derive a video tensor \({\mathcal {X}}\in {\mathbb {R}}^{N_x\times R_x\times D}\). Then, the frame-to-frame self-similarity matrix is calculated, as

$$\begin{aligned} {\mathcal {M}}_f^{x} = \frac{1}{R_x^2}\sum _{i=1}^{R_x} \sum _{j=1}^{R_x} {\mathcal {X}} {\varvec{\cdot }} _{(3,1)} {\mathcal {X}}^\top (\cdot ,i,j,\cdot ), \end{aligned}$$
(8)

where, \({\mathcal {M}}_f^{x}\in {\mathbb {R}}^{N_x\times N_x}\) is the symmetric frame-to-frame self-similarity matrix. Note that (8) is a modified version of (1), where the Chamfer Similarity is replaced by the average operator. In this case, we calculate the average similarity of a region with all other regions in the same frame—the use of Chamfer Similarity would have resulted in estimating the similarity of a region with the most similar region in the current frame, that is itself.

Similarly, a Video Comparator (VC) CNN network is employed (same as ViSiL, Fig. 3) that is fed with the self-similarity matrix in order to extract the temporal patterns and generate a refined self-similarity matrix \({\mathcal {M}}_v^{x}\in {\mathbb {R}}^{N'_x\times N'_x}\). To extract a final score (indexing output) that captures self-similarity, we modify (2) as

$$\begin{aligned} f^{{\textbf {SN}}}(x) = \frac{1}{N'_x{}^2}\sum _{i=1}^{N'_x}\sum _{j=1}^{N'_x} {\mathcal {M}}_v^{x}(i,j), \end{aligned}$$
(9)

that is, the average of the pair-wise similarities of all video frames. Note that we also do not use the hard tanh activation function, as we empirically found that it is not needed.

Confidence Calculation/Retrieval (\(g^{{\textbf {SN}}}\)): Given a pair of videos and their similarity predicted by the \({\textbf {S}}^\mathrm{c}\), we retrieve the indexed self-similarity scores, and then we concatenate them with the \({\textbf {S}}^\mathrm{c}\) similarity, forming a three-dimensional vector \({\textbf {z}}\in {\mathbb {R}}^3\) for the video pair, as shown in Fig. 5. This vector is given as input to a two-layer MLP using Batch Normalization (Ioffe & Szegedy, 2015) and ReLU (Krizhevsky et al., 2012) activation functions. For an input video pair qp, the retrieval output \(g^{{\textbf {SN}}}(q, p)\) is the confidence score of the selector network that the fine-grained student needs to be invoked.

3.3 Training Process

Fig. 6
figure 6

Illustration of the training process of the teacher networks. It is trained with supervision with video triplets derived from a labelled dataset, minimizing triplet loss

In this section, we go through the details of the procedure followed for the training of the underlying networks of the proposed framework, i.e., the teacher, the students, and the selector.

3.3.1 Teacher Training

The teacher network is trained with supervision on a labelled video dataset \({\mathcal {V}}_l\), as shown in Fig. 6. The videos are organized in triplets \((v,v^+,v^-)\) of an anchor, a positive (relevant), and a negative (irrelevant) video, respectively, where \(v,v^+,v^-\in {\mathcal {V}}_l\), and the network is trained with the triplet loss

$$\begin{aligned} {\mathcal {L}}_{tr} = \max \left( 0, g^{\textbf {T}}(v, v^-) - g^{\textbf {T}}(v, v^+) + \gamma \right) , \end{aligned}$$
(10)

where \(\gamma \) is a margin hyperparameter. In addition, a similarity regularization function is used that penalizes high values in the input of hard tanh that would lead to saturated outputs. Following other works, we use data augmentation (i.e., color, geometric, and temporal augmentations) on the positive samples \(v^+\).

3.3.2 Student Training

An overview of the student training process is illustrated in Fig. 7. Let \({\mathcal {V}}_u=\{v_1,v_2,\ldots ,v_n\}\) be a collection of unlabelled videos and \(g^{\textbf {T}}(q,p)\), \(g^{\textbf {S}}(q, p)\) be the similarities between videos \(q,p\in {\mathcal {V}}_u\), estimated by a teacher network \({\textbf {T}}\) and a student network \({\textbf {S}}\), respectively. \({\textbf {S}}\) is trained so that \(g^{\textbf {S}}\) approximates \(g^{\textbf {T}}\), with the \(L_1\) loss,Footnote 3 that is,

$$\begin{aligned} {\mathcal {L}}_{TS} = \left\Vert g^{\textbf {T}}(q, p) - g^{\textbf {S}}(q, p)\right\Vert _1. \end{aligned}$$
(11)
Fig. 7
figure 7

Illustration of the training process of the student networks. They are trained on an unlabelled dataset by minimizing the difference between their video similarity estimations and the ones calculated by the teacher network

Note that the loss is defined on the output of the teacher. This allows for a training process in which the scores of the teacher are calculated for a number of pairs in the unlabelled dataset only once, and then being used as targets for the students. This is in contrast to methods where the loss is calculated on intermediate features of \({\textbf {T}}\) and \({\textbf {S}}\), and cannot, thus, scale to large-scale datasets as they have considerable storage and/or computational/memory requirements. In this setting, the selection of the training pairs is crucial. Since it is very time consuming to apply the teacher network \({\textbf {T}}\) to every pair of videos in the dataset (\(O(n^2)\) complexity) and randomly selecting videos would result in mostly pairs with low similarity scores, here, we follow Kordopatis-Zilos et al. (2019a) and generate a graph to extract its connected components, which are considered as video clusters. Each video included in a video cluster is considered as an anchor, and we form pairs with the videos belonging to the same cluster, which are treated as positive pairs. Also, based on the anchor video, we form pairs with the 50 most similar videos that belong to the other clusters and the 50 most similar videos that belong to no cluster, which are treated as negative pairs. At each epoch, one positive and one negative pair are selected for each anchor video to balance their ratio.

3.3.3 Selector Training

Typically, the similarity between two videos q and p that is estimated by a fine-grained student \({\textbf {S}}^\mathrm{f}\), leads to better retrieval scores than the one estimated by the coarse-grained student \({\textbf {S}}^\mathrm{c}\). However, for some video pairs, the difference between them (i.e., \(\Vert g^{{\textbf {S}}^\mathrm{c}}(q,p) - g^{{\textbf {S}}^\mathrm{f}}(q,p)\Vert _1\)) is small and, therefore, having negligible effect to the ranking and on whether the video will be retrieved or not. The selector is a network that is trained to distinguish between those video pairs, and pairs of videos that exhibit large similarity differences. For the former, only the coarse-grained student \({\textbf {S}}^\mathrm{c}\) will be used; for the latter, the fine-grained student \({\textbf {S}}^\mathrm{f}\) will be invoked.

The selector network is trained as a binary classifier, with binary labels obtained by setting a threshold t on \(\Vert g^{{\textbf {S}}^\mathrm{c}}(q,p) - g^{{\textbf {S}}^\mathrm{f}}(q,p)\Vert _1\), that is,

$$\begin{aligned} l(q,p) = {\left\{ \begin{array}{ll} 1 &{} \text {if}\ \left\Vert g^{{\textbf {S}}^\mathrm{c}}(q,p) - g^{{\textbf {S}}^\mathrm{f}}(q,p)\right\Vert _1 > t, \\ 0 &{} \text {otherwise}. \end{array}\right. } \end{aligned}$$
(12)

Video pairs are derived from \({\mathcal {V}}_u\), and Binary Cross-Entropy is used as a loss function, as shown in Fig. 8. We use the same mining process used for the student training, and at each epoch, a fixed number of video pairs is sampled for the two classes. We reiterate here that the selector is trained in an end-to-end manner, i.e., both the self-similarity feature extraction network \(f^{{\textbf {SN}}}\), given by (9), and the decision-making MLP (Fig. 5) are optimized jointly during training.

Fig. 8
figure 8

Illustration of the training process of the selector network. It is trained on an unlabelled dataset, exploiting the similarities calculated by a coarse- and fine-grained student. Note that the fine-grained student is applied on all video pairs only during training time. During retrieval, only a portion of the dataset is sent to it

4 Evaluation Setup

In this section, we present the datasets (Sect. 4.1), evaluation metrics (Sect. 4.2), and implementation details (Sect. 4.3) adopted during the experimental evaluation of the proposed framework.

4.1 Datasets

4.1.1 Training Datasets

VCDB (Jiang et al., 2014) was used as the training dataset to generate triplets for the training of the teacher model. The dataset consists of videos derived from popular video platforms (i.e., YouTube and Metacafe) and has been developed and annotated as a benchmark for partial copy detection. It contains two subsets, namely, the core and the distractor subsets. The former one contains 28 discrete sets composed of 528 videos with over 9000 pairs of copied segments. The latter subset is a corpus of approximately 100,000 randomly collected videos that serve as distractors.

DnS-100K is the dataset collected for the training of the students. We followed the collection process from our prior work (Kordopatis-Zilos et al., 2019a) for the formation of the FIVR-200K dataset in order to collect a large corpus of videos with various relations between them. First, we built a collection of the major news events that occurred in recent years by crawling Wikipedia’s “Current Event” page.Footnote 4 To avoid overlap with FIVR-200K, where the crawling period was from 2013–2017, we only considered the news events from the years 2018–2019. Then, we retained only the news events associated with armed conflicts and natural disasters by filtering them based on their topic. Afterwards, the public YouTube APIFootnote 5 was used to collect videos by providing the event headlines as queries. The results were filtered to contain only videos published at the corresponding event start date and up to one week after the event. At the end of this process, we had collected a corpus of 115,792 videos. Following the mining scheme described in Sect. 3.3.2, we arrived at 21,997 anchor videos with approximately 2.5M pairs.

4.1.2 Evaluation Datasets

FIVR-200K (Kordopatis-Zilos et al., 2019a) was used as a benchmark for Fine-grained Incident Video Retrieval (FIVR). It consists of 225,960 videos collected based on the 4687 events and 100 video queries. It contains video-level annotation labels for near-duplicate (ND), duplicate scene (DS), complementary scene (CS), and incident scene (IS) videos. FIVR-200K includes three different tasks: a) the Duplicate Scene Video Retrieval (DSVR) task, where only videos annotated with ND and DS are considered relevant, b) the Complementary Scene Video Retrieval (CSVR) task, which accepts only the videos annotated with ND, DS, or CS as relevant, and c) Incident Scene Video Retrieval (ISVR) task, where all labels are considered relevant. For quick comparisons of different configurations, we also used FIVR-5K, a subset of FIVR-200K, as provided in Kordopatis-Zilos et al. (2019b).

CC_WEB_VIDEO (Wu et al., 2007) simulates the Near-Duplicate Video Retrieval (NDVR) problem. It consists of 24 query sets and 13,112 videos. The collection consists of a sample of videos retrieved by submitting 24 popular text queries to popular video sharing websites, i.e., YouTube, Google Video, and Yahoo! Video. For every query, a set of video clips was collected, and the most popular video was considered to be the query video. Subsequently, all videos in the video set were manually annotated based on their near-duplicate relation to the query video. We also use the ‘cleaned’ version, as provided in Kordopatis-Zilos et al. (2019b).

SVD (Jiang et al., 2019) was used for the NDVR problem, tailored for short videos in particular. It consists of 562,013 short videos crawled from a large video-sharing website, namely, Douyin.Footnote 6 The average length of the collected videos is 17.33 s. The videos with more than 30,000 likes were selected to serve as queries. Candidate videos were selected and annotated based on a three-step retrieval process. A large number of probably negative unlabelled videos were also included to serve as distractors. Hence, the final dataset consists of 1206 queries with 34,020 labelled video pairs and 526,787 unlabelled videos. The queries were split into two sets, i.e., training and test set, with 1000 and 206 queries, respectively. In this paper, we only use the test set for the evaluation of the retrieval systems.

EVVE (Revaud et al., 2013) was designed for the Event Video Retrieval (EVR) problem. It consists of 2,375 videos and 620 queries. The main task on this dataset is the retrieval of all videos that capture the event depicted by a query video. The dataset contains 13 major events that were provided as queries to YouTube. Each event was annotated by one annotator, who first produced a precise definition of the event. However, we managed to download and process only 1906 videos and 504 queries (that is, \(\approx \)80% of the initial dataset) due to the unavailability of the remaining ones.

4.2 Evaluation Metric

To evaluate retrieval performance, we use the mean Average Precision (mAP) metric, as defined in Wu et al. (2007), which captures the quality of video rankings. For each query, the Average Precision (AP) is calculated as

$$\begin{aligned} AP = \frac{1}{n} \sum \limits _{i=0}^{n} \frac{i}{r_i}, \end{aligned}$$
(13)

where n is the number of relevant videos to the query video and \(r_i\) is the rank of the i-th retrieved relevant video. The mAP is calculated by averaging the AP scores across all queries. Also, for the evaluation of the selector, we use the plot of mAP with respect to the total dataset percentage sent to the fine-grained student. The objective is to achieve high retrieval performance (in terms of mAP) with low dataset percentage.

4.3 Implementation Details

All of our models have been implemented with the PyTorch (Paszke et al., 2019) library. For the teacher, we have re-implemented ViSiL (Kordopatis-Zilos et al., 2019b) following the same implementation details, i.e., for each video, we extracted 1 frame per second and used ResNet-50 (He et al., 2016) for feature extraction using the output maps of the four residual blocks, resulting in \(D=3840\). The PCA-whitening layer was learned from 1M region vectors sampled from VCDB. In all of our experiments, the weights of the feature extraction CNN and whitening layer remained fixed. We sampled 2000 triplets in each epoch. The teacher was trained for 200 epochs with 4 videos per batch using the raw video frames. We employed Adam optimization (Kingma & Ba, 2015) with learning rate \(10^{-5}\). Other parameters were set to \(\gamma =0.5\), \(r=0.1\) and \(W=64\), similarly to Kordopatis-Zilos et al. (2019b).

For the students, we used the same feature extraction process as in the teacher, and the same PCA-whitening layer was used for whitening and dimensionality reduction. We empirically set \(D=512\) as the dimensions of the reduced region vectors. The students were trained with a batch size of 64 video pairs for 300 epochs, using only the extracted video features. Also, during training, we applied temporal augmentations, i.e., random frame drop, fast forward, and slow motion, with 0.1 probability each. We employed Adam optimization (Kingma & Ba, 2015) with learning rate \(10^{-5}\) and \(10^{-4}\) for the course- and fine-grained students, respectively. For the fine-grained binarization student, the binarization layer was initialized with the ITQ algorithm (Gong et al., 2012), learned on 1M region vectors sampled from our dataset, as we observed better convergence than random initialization, and with \(L=512\) bits. For the coarse-grained student’s training, the teacher’s similarities were rescaled to [0, 1] leading to better performance. Also, we used one layer in the transformer, with 8 heads for multi-head attention and 2048 dimension for the feed-forward network. For the NetVLAD module, we used 64 clusters, and a fully-connected layer with 1024 output dimensions and Layer Normalization (Ba et al., 2016). For the fine-grained students’ training, we employed the similarity regularization loss from Kordopatis-Zilos et al. (2019b), weighted with \(10^{-3}\), arriving at marginal performance improvements.

For the selector, we used the same feature extraction scheme that was used for the students. It was trained with a batch size of 64 video pairs for 100 epochs, using only the extracted video features. At each epoch, we sampled 5000 video pairs from each class. We employed Adam optimization (Kingma & Ba, 2015) with learning rate \(10^{-4}\). For the fully-connected layers of the MLP, we used 100 hidden units and 0.5 dropout rate. For the training of the selector model, the similarities of the fine-grained student were rescaled to [0, 1] to match similarities calculated from the coarse-grained student. Finally, we used a threshold of \(t=0.2\) for the class separation, unless stated otherwise.

5 Experiments

In this section, the experimental results of the proposed approach are provided. First, a comprehensive ablation study on the FIVR-5K dataset is presented, evaluating the proposed students and the overall approach under different configurations to gain better insight into its behaviour (Sect. 5.1). Then, we compare the performance and requirements of the developed solutions against several methods from the literature on the four benchmark datasets (Sect. 5.2).

5.1 Ablation Study

Table 1 Comparison of the teacher \({\textbf {T}}\) and the students \({\textbf {S}}^\mathrm{f}_{\mathcal {A}}\), \({\textbf {S}}^\mathrm{f}_{\mathcal {B}}\), and \({\textbf {S}}^\mathrm{c}\), in terms of mAP on FIVR-5K and computational requirements, i.e., storage space in KiloBytes (KB) per video and computational time in seconds (s) per query

5.1.1 Retrieval Performance of the Individual Networks

In Table 1, we show the performance and storage/time requirements of the teacher \({\textbf {T}}\) and the three proposed student networks, namely, \({\textbf {S}}^\mathrm{f}_{\mathcal {A}}\), \({\textbf {S}}^\mathrm{f}_{\mathcal {B}}\), and \({\textbf {S}}^\mathrm{c}\), trained with the proposed scheme. The fine-grained attention student \({\textbf {S}}^\mathrm{f}_{\mathcal {A}}\) achieves the best results on all evaluation tasks, outperforming the teacher \({\textbf {T}}\) by a large margin. Also, the fine-grained binarization student \({\textbf {S}}^\mathrm{f}_{\mathcal {B}}\) reports performance very close to the teacher’s on the DSVR and CSVR tasks, and it outperforms the teacher on the ISVR task, using only quantized features with lower dimensionality than the ones used by the teacher and therefore requiring up to 240 times less storage space. This highlights the effectiveness of the proposed training scheme and the high quality of the collected dataset. Furthermore, both fine-grained students have similar time requirements, and they are three times faster than the teacher because they process lower dimensionality features. Finally, as expected, the coarse-grained student \({\textbf {S}}^\mathrm{c}\) results in the worst performance compared to the other networks, but it has the lowest requirements in terms of both storage space and computational time.

Table 2 Comparison of the teacher \({\textbf {T}}\) and the students \({\textbf {S}}^\mathrm{f}_{\mathcal {A}}\), \({\textbf {S}}^\mathrm{f}_{\mathcal {B}}\), and \({\textbf {S}}^\mathrm{c}\) trained on different datasets and training schemes in terms of mAP on FIVR-5K

5.1.2 Distillation Versus Supervision

In Table 2, we show the performance of the teacher \({\textbf {T}}\) trained with supervision on VCDB (as proposed in Kordopatis-Zilos et al. (2019b) and used for our teacher training) and the three proposed students, namely, \({\textbf {S}}^\mathrm{f}_{\mathcal {A}}\), \({\textbf {S}}^\mathrm{f}_{\mathcal {B}}\), and \({\textbf {S}}^\mathrm{c}\), trained under various combinations: (i) with supervision on VCDB (same as the original teacher), (ii) with distillation on VCDB, and (iii) with distillation on the DnS-100K dataset. It is evident that the proposed training scheme using a large unlabelled dataset leads to considerably better retrieval performance compared to the other setups for all students. Also, it is noteworthy that training students with supervision, same as the teacher, results in a considerable drop in performance compared to distillation on either dataset. The students achieve better results when trained with DnS-100K instead of VCDB. An explanation for this is that our dataset contains various video relations (not only near-duplicates as in VCDB) and represents a very broad and diverse domain (by contrast to VCDB, which consists of randomly selected videos), resulting in better retrieval performance for the students.

Table 3 Comparison of students \({\textbf {S}}^\mathrm{f}_{\mathcal {A}}\), \({\textbf {S}}^\mathrm{f}_{\mathcal {B}}\), and \({\textbf {S}}^\mathrm{c}\), in terms of mAP, trained with different amount of training data on FIVR-5K

5.1.3 Impact of Dataset Size

In Table 3, we show the performance of the proposed students, namely, \({\textbf {S}}^\mathrm{f}_{\mathcal {A}}\), \({\textbf {S}}^\mathrm{f}_{\mathcal {B}}\), and \({\textbf {S}}^\mathrm{c}\), in terms of mAP, when they are trained with different percentages of the collected DnS-100K dataset (that is, 25%, 50%, 75%, and 100%). We report large differences in performance for the fine-grained binarization student \({\textbf {S}}^\mathrm{f}_{\mathcal {B}}\) and the coarse-grained student \({\textbf {S}}^\mathrm{c}\). We note that the more data is used for training, the better their retrieval results are. On the other hand, the fine-grained attention student’s \({\textbf {S}}^\mathrm{f}_{\mathcal {A}}\) performance remains relatively steady, regardless of the amount used for training. We attribute this behaviour to the fact that \({\textbf {S}}^\mathrm{f}_{\mathcal {A}}\) learns to weigh the input features without transforming them; hence, a smaller dataset with real video pairs with diverse relations, as in our collected dataset, is adequate for its robust performance.

5.1.4 Student Performance with Different Teachers

In Table 4, we show the performance of the proposed students, namely, \({\textbf {S}}^\mathrm{f}_{\mathcal {A}}\), \({\textbf {S}}^\mathrm{f}_{\mathcal {B}}\), and \({\textbf {S}}^\mathrm{c}\), in terms of mAP, when they are trained/distilled using different teachers. More specifically, using as a teacher: (i) the original teacher \({\textbf {T}}\), leading to the student \({\textbf {S}}^\mathrm{f(1)}_{\mathcal {A}}\), (ii) the fine-grained attention student \({\textbf {S}}^\mathrm{f(1)}_{\mathcal {A}}\), leading to the student \({\textbf {S}}^\mathrm{f(2)}_{\mathcal {A}}\) (first iteration), and (iii) the fine-grained attention student \({\textbf {S}}^\mathrm{f(2)}_{\mathcal {A}}\) (second iteration). In the case of fine-grained students, training with the \({\textbf {S}}^\mathrm{f(1)}_{\mathcal {A}}\) and \({\textbf {S}}^\mathrm{f(2)}_{\mathcal {A}}\) yields large performance boost in comparison to original teacher \({\textbf {T}}\). More precisely, the fine-grained attention student \({\textbf {S}}^\mathrm{f}_{\mathcal {A}}\) exhibits a total improvement of about 0.006 mAP comparing its results trained with the teacher \({\textbf {T}}\) (i.e., 0.893 mAP on DSVR task) and the \({\textbf {S}}^\mathrm{f(2)}_{\mathcal {A}}\) (i.e., 0.899 mAP on DSVR task). A very considerable improvement has the fine-grained binarization student, i.e., training with \({\textbf {S}}^\mathrm{f(1)}_{\mathcal {A}}\) gives a performance increase of almost 0.01 mAP on DSVR task, which further improves when trained with \({\textbf {S}}^\mathrm{f(2)}_{\mathcal {A}}\) by 0.007. On the other hand, using a better teacher does not improve the performance of the coarse-grained student \({\textbf {S}}^\mathrm{c}\).

Table 4 Comparison of students \({\textbf {S}}^\mathrm{f}_{\mathcal {A}}\), \({\textbf {S}}^\mathrm{f}_{\mathcal {B}}\), and \({\textbf {S}}^\mathrm{c}\), in terms of mAP, trained with different teachers on FIVR-5K

5.1.5 Student Performance with Different Settings

In this section, the retrieval performance of the proposed students is evaluated under different design choices.

Fine-Grained Attention Student In Table 5, we show how the adopted attention scheme (\(\ell ^2\)-attention—Sect. 3.2.1, or h-attention—Sect. 3.2.2) affects the performance of the student \({\textbf {S}}^\mathrm{f}_{\mathcal {A}}\). Using h-attention leads to considerably better results compared to the \(\ell ^2\)-attention, that was originally used in ViSiL (Kordopatis-Zilos et al., 2019b).

Table 5 Comparison of different attention schemes of the fine-grained attention student \({\textbf {S}}^\mathrm{f}_{\mathcal {A}}\), in terms of mAP on FIVR-5K
Table 6 Comparison of different activation functions for fine-grained binarization student \({\textbf {S}}^\mathrm{f}_{\mathcal {B}}\), in terms of mAP on FIVR-5K

Fine-Grained Binarization Student In Table 6, we report the retrieval results of the fine-grained binarization student \({\textbf {S}}^\mathrm{f}_{\mathcal {B}}\) implemented with different activation functions in the binarization layer, i.e., \(\text {sgn}({\textbf {x}})\) which is not differentiable so the layer weights remain fixed, \(\text {tanh}(\beta {\textbf {x}})\), as proposed in Cao et al. (2017) with \(\beta =10^3\), and the proposed \({{\,\mathrm{\mathbb {E}}\,}}[\text {sgn}({\textbf {x}})]\) (Sect. 3.2.3). The binarization student with the proposed function achieves notably better results on all tasks, especially on ISVR. Moreover, we experimented with different number of bits for the region vectors and report results in Table 7. As expected, larger region hash codes lead to better retrieval performance. Nevertheless, the student achieves competitive retrieval performance even with low number of bits per region vector.

Table 7 Comparison of different fine-grained binarization student \({\textbf {S}}^\mathrm{f}_{\mathcal {B}}\) implemented with different number of bits per region vector, in terms of mAP on FIVR-5K
Table 8 Comparison of different design choices of our coarse-grained student \({\textbf {S}}^\mathrm{c}\), in terms of mAP on FIVR-5K

Coarse-Grained Student In Table 8, we report the performance of the coarse-grained student \({\textbf {S}}^\mathrm{c}\) implemented under various combinations of its components. The proposed setup with all three components achieves the best results compared to the other configurations. The single component that provides the best results is the transformer network, followed by NetVLAD. Also, the attention mechanism provides a considerable boost in performance when applied. The second-best performance is achieved with the combination of the transformer module with the NetVLAD.

5.1.6 Selector Network Performance

In this section, the performance of the proposed selector network is evaluated in comparison with the following approaches: (i) a selection mechanism that applies naive similarity thresholding for choosing between the coarse-grained and the fine-grained student, (ii) an oracle selector, where the similarity difference between the fine-grained and coarse-grained student is known and used for the re-ranking of video pairs, and (iii) a random selector that sends with a fixed probability videos to either the coarse-grained or the fine-grained student. Figure 9 illustrates the performance of the DnS approach in terms of mAP with respect to the percentage of video pairs from the evaluation dataset sent to the fine-grained student. We consider that the closer the curves are to the upper left corner, the better their performance. For this experiment, we used the proposed fine-grained attention student \({\textbf {S}}^\mathrm{f}_{\mathcal {A}}\) and the coarse-grained student \({\textbf {S}}^\mathrm{c}\). All three runs outperform the performance of the random selector by a large margin on all dataset percentages. The oracle selector performs the best with considerable margin, highlighting that using the similarity difference between the two students (Sect. 3.3.3) is a good optimization criterion. Furthermore, the proposed selector network outperforms the one with similarity thresholding on all tasks and percentages, i.e., in lower dataset percentages (\(<25\%\)) with a large margin. It achieves more than 0.85 mAP on the DSVR task with only 10% of the video pairs in FIVR-5k sent to the fine-grained student.

5.1.7 Impact of Threshold on the Selector Performance

In this section, we assess the impact of the threshold parameter t that is used to obtain binary labels for the selector network [see Sect. 3.3.3, Eq. (12)], on the retrieval performance. To do so, we report the mAP as a function of the dataset percentage sent to the fine-grained student for re-ranking—we do so for selectors trained with different values of t in order to compare the curves. The results are shown in Fig. 10. The best results are obtained for \(t=0.2\); however, the performance is rather stable for thresholds between 0.1 and 0.4, as well. For threshold values \(>0.4\), the performance drops considerably on all evaluation tasks.

Fig. 9
figure 9

mAP with respect to the dataset percentage sent to the fine-grained student for re-ranking based on four selectors: (i) the proposed selector network, (ii) a selector with naive similarity thresholding, (iii) an oracle selector, ranking videos based on the similarity difference between the two students, and (iv) a random selector

Fig. 10
figure 10

mAP with respect to the dataset percentage sent to the fine-grained student for re-ranking based on our selector network trained with different values for the threshold t

Table 9 mAP comparison of our proposed students and re-ranking method against several video retrieval methods on four evaluation datasets

5.2 Comparison with State of the Art

In this section, the proposed approach is compared with several methods from the literature on four datasets. In all experiments, the fine-grained attention student \({\textbf {S}}^\mathrm{f(2)}_{\mathcal {A}}\) is used as the teacher. We report the results of our re-ranking DnS scheme using both fine-grained students and sending the 5% and 30% of the dataset videos per query for re-ranking based on our selector score. We compare its performance with several coarse-grained, fine-grained, and re-ranking approaches: ITQ (Gong et al., 2012) and MFH (Song et al., 2013) are two unsupervised and CSQ (Yuan et al., 2020) a supervised video hashing methods using Hamming distance for video ranking, BoW (Cai et al., 2011) and LBoW (Kordopatis-Zilos et al., 2017a) extracts video representations based on BoW schemes with tf-idf weighting, DML (Kordopatis-Zilos et al., 2017b) extract a video embedding based on a network trained with DML, R-UTS-GV and R-UTS-FRP (Liang et al., 2019) are a coarse- and fine-grained methods trained with a teacher–student setup distilling feature representations, TCA\(_c\) and TCA\(_f\) (Shao et al., 2021) are a coarse- and fine-grained methods using a transformer-based architecture trained with contrastive learning, TMK (Poullot et al., 2015) and LAMV (Baraldi et al., 2018) extracts spatio-temporal video representations based on Fourier transform, which are also combined with QE (Douze et al., 2013), TN (Tan et al., 2009) employs a temporal network to find video segments with large similarity, DP (Chou et al., 2015) is a dynamic programming scheme for similarity calculation, A-DML (Wang et al., 2021) assess video similarity extracting multiple video representations based on a multi-head attention network, PPT (Chou et al., 2015) is a re-ranking method with a BoW-based indexing scheme combined with DP for reranking, and HM (Liang & Wang, 2020) is also a re-ranking method using a concept-based similarity and a BoW-based method for refinement, and our re-implementation of ViSiL (Kordopatis-Zilos et al., 2019b). From the aforementioned methods, we have re-implemented BoW, TN, and DP, and we use the publicly available implementations for ITQ, MFH, CSQ, DML, TMK, and LAMV. For the rest, we provide the results reported in the original papers. Also, for fair comparison, we have implemented (if possible) the publicly available methods using our extracted features.

Table 10 Performance in mAP, storage in KiloBytes (KB) and time in seconds (S) requirements of our proposed students and re-ranking method and several video retrieval implemented with the same features

In Table 9, the mAP of the proposed method in comparison to the video retrieval methods from the literature is reported. The proposed students achieve very competitive performance achieving state-of-the-art results in several cases. First, the fine-grained attention student achieves the best results on the two large-scale datasets, i.e., FIVR-200K and SVD, outperforming ViSiL (our teacher network) by a large margin, i.e., 0.022 and 0.021 mAP, respectively. It reports almost the same performance as ViSiL on the CC_WEB_VIDEO dataset, and it is slightly outperformed on the EVVE dataset. Additionally, it is noteworthy that the fine-grained binarization student demonstrates very competitive performance on all datasets. It achieves similar performance with ViSiL and the fine-grained attention student on the CC_WEB_VIDEO, the second-best results on all three tasks of FIVR-200K, and the third-best on SVD with a small margin from the second-best. However, its performance is lower than the teacher’s on the EVVE dataset, highlighting that feature reduction and hashing have considerable impact on the student’s retrieval performance on this dataset. Also, another possible explanation for this performance difference could be that the training dataset does not cover the included events sufficiently.

Second, the coarse-grained student exhibits very competitive performance among coarse-grained approaches on all datasets. It achieves the best mAP on two out of four evaluation datasets, i.e., on SVD and EVVE, reporting performance close or even better than several fine-grained methods. On FIVR-200K and CC_WEB_VIDEO, it is outperformed by the BoW-based approaches, which are trained with samples from the evaluation sets. However, when they are built with video corpora other than the evaluation (which simulates more realistic scenarios), their performance drops considerably (Kordopatis-Zilos et al., 2017b, 2019a). Also, their performance on the SVD and EVVE datasets is considerably lower.

Third, our DnS runs maintain competitive performance. It improves the performance of the coarse-grained student by more than 0.2 on FIVR-200K and 0.02 on SVD by re-ranking only 5% of the dataset with the fine-grained students. However, on the other two datasets, i.e., CC_WEB_VIDEO and EVVE, the re-ranking has negative effects on performance. A possible explanation for this might be that the performance of the coarse- and fine-grained students is very close, especially on the EVVE dataset. Also, this dataset consists of longer videos than the rest, which may impact the selection process. Nevertheless, the performance drop on these two datasets is mitigated when 30% of the dataset is sent to the fine-grained students for re-ranking; while on the FIVR-200K and SVD, the DnS method reaches the performance of the corresponding fine-grained students, or it even outperforms them, i.e., \({\textbf {DnS}}^{30\%}_{\mathcal {B}}\) outperforms \({\textbf {S}}^\mathrm{f}_{\mathcal {B}}\) on SVD dataset.

Additionally, Table 10 displays the storage and time requirements and the reference performance of the proposed method on each dataset. In comparison, we include the video retrieval methods that are implemented with the same features and run on GPU. For FIVR-200K and CC_WEB_VIDEO datasets, we display the DSVR and cc_web\(^*_c\) runs, respectively. We have excluded the TN and DP methods, as they have been implemented on CPU and their transfer to GPU is non-trivial. Also, the requirements of the TCA runs from Shao et al. (2021) are approximated based on features of the same dimensionality. All times are measured on a Linux machine with the Intel i9-7900X CPU and an Nvidia 2080Ti GPU.

First, the individual students are compared against the competing methods in their corresponding category. The fine-grained binarization student has the lowest storage requirements among the fine-grained approaches on all datasets, having 240 times lower storage requirements than the ViSiL teacher. The fine-grained attention student needs the second-highest requirements in terms of space, but still, it needs 7.5 times less than ViSiL, achieving considerably better retrieval performance on two out of four evaluation datasets. However, the required retrieval time is high for all fine-grained approaches, especially in comparison with the coarse-grained ones. The coarse-grained student, which employs global vectors, has high storage requirements compared to the hashing and BoW methods that need notably lower storage space. In terms of time, all coarse-grained methods need approximately the same on all datasets, which is several orders of magnitude faster than the fine-grained ones.

Second, we benchmark our DnS approach with the two fine-grained students and two dataset percentages sent for refinement. An excellent trade-off between time and performance comes with the \({\textbf {DnS}}^{5\%}_{\mathcal {B}}\) offering an acceleration of more than 17 times in comparison to the fine-grained students, at a small cost in terms of performance when 5% is used. Combined with the fine-grained binarization student, on FIVR-200K, it offers 55 times faster retrieval and 240 times lower storage requirements compared to the original ViSiL teacher providing comparable retrieval performance, i.e., 0.041 relative drop in terms of mAP. The performance of the DnS increases considerably when 30% of the video pairs are sent for re-ranking, outperforming the ViSiL on two datasets with considerable margins. However, this performance improvement comes with a corresponding increase in the retrieval time.

6 Conclusion

In this paper, we proposed a video retrieval framework based on Knowledge Distillation that addresses the problem of performance-efficiency trade-off focused on large-scale datasets. In contrast to typical video retrieval methods that rely on either a high-performance but resource demanding fine-grained approach or a computationally efficient but low-performance coarse-grained one, we introduced a Distill-and-Select approach. Several student networks were trained via a Teacher–Student setup at different performance-efficiency trade-offs. We experimented with two fine-grained students, one with a more elaborate attention mechanism that achieves better performance and one using a binarization layer offering very high performance with significantly lower storage requirements. Additionally, we trained a coarse-grained student that provides very fast retrieval with low storage requirements but at a high cost in performance. Once the students were trained, we combined them using a selector network that directs samples to the appropriate student in order to achieve high performance with high efficiency. It was trained based on the similarity difference between a coarse-grained and a fine-grained student so as to decide at query-time whether the similarity calculated by the coarse-grained one is reliable or the fine-grained one needs to be applied. The proposed method has been benchmarked to a number of content-based video retrieval datasets, where it improved the state-of-art in several cases and achieved very competitive performance with a remarkable reduction of the computational requirements.

The proposed scheme can be employed with several setups based on the requirements of the application. For example, when small-scale databases are involved, with no strict storage space and computational time restrictions, the fine-grained attention student could be employed since it achieves the best retrieval performance. On the other hand, for mid-scale databases, where the storage requirements increase, the fine-grained binarization student would be a reasonable option since it achieves very high retrieval performance with remarkable reduction of storage space requirements. Finally, for large-scale databases, where both storage space and computation time are an issue, the combination of fine-grained binarization student and the coarse-grained student with the selector network would be an appropriate solution that offers high retrieval performance and high efficiency.

In the future, we plan to investigate alternatives for the better selection and re-ranking of video pairs based on our selector network by exploiting the ranking of videos derived from the two students. Also, we will explore better architectural choices for the development of the coarse-grained student to further improve the system’s scalability with little compromises in retrieval performance.