1 Introduction

Vehicle retrieval is a task that aims to find the target vehicle from a large image gallery given a query image, which is an image-to-image matching technique also known as vehicle re-identification [2,3,4,5]. It has promising prospects in building ITS [6,7,8,9] for smart cities [10]. However, an image-based vehicle retrieval system also has its defects in practice. For example, such a system needs an image to provide characteristics of the target vehicle, which is not always easy to obtain in the real world. The performance of an image-based vehicle retrieval system may also be limited because there is only one type of modality to provide spatial and temporal information.

To alleviate these problems, Natural Language (NL), as another essential modality in the real world, has received more and more attention from researchers in recent years. A natural language-based vehicle retrieval system aims to identify the target vehicle using an NL description. Such a system has a broader range of application scenarios, such as finding a vehicle when a bystander provides only an informal description. Most current natural language vehicle retrieval implementations construct the text encoder and visual encoder to extract features from both data types. They then project the obtained text and visual embeddings into the same latent space to compare their similarity. In addition, both visual and NL data will be carefully modified by these methods for more effective representation. For example, vehicle track images are cropped to generate a global motion image [11,12,13,14]. As for NL, some keywords related to vehicle attributes (e.g., colour, vehicle type and orientation) are extracted in the given NL query [11, 12, 14, 15]. Although these works achieve acceptable performance on the CityFlow-NL [16] benchmark, they can still be improved, especially in terms of NL. We find that when implementing keyword extraction, existing methods are usually based on dependency analysis (e.g., using NLTK package) or semantic role labelling techniques to determine whether the word is a keyword or not. These techniques only assign the part of speech to the words in the sentence. It means that pre-determined rules and post-processing are required to divide the extracted keywords into corresponding vehicle attributes, making the whole process complex [15, 17]. Such methods can also suffer from extracting the wrong keyword if the NL description is complex. This can lead to error propagation in subsequent modules and reduce model performance.

Table 1 Datasets of Vehicle Retrieval

In fact, keyword extraction is already a mature technology in natural language processing (NLP), also known as named entity recognition (NER). The main obstacle that prevents us from applying the state-of-the-art NER model to solve the above problem is the lack of a domain-specific corpus with high-quality annotations. Therefore, to alleviate this problem, we propose a named entity labelled natural language dataset focused on the traffic domain, called FindVehicle. It consists of descriptions of the vehicle from the point of view of urban traffic surveillance cameras. Some example descriptions from our dataset are shown. We also compare them with instances selected from other traffic domain datasets using natural language, namely Talk2Car [18] and CityFlow-NL [16]. All details are given in Table 1. We carefully construct the vehicle descriptions to match real traffic scenarios and to enrich more detailed information about the target vehicles. Our dataset includes eight types of vehicle features, namely vehicle location, orientation, brand, model, type, colour, distance from the traffic surveillance camera, and velocity. In contrast, Talk2Car [18] only records vehicle type, and CityFlow-NL [16] has only four types of information, which are vehicle colour, type, action, and scene. More vehicle information in the description text means that the data can more accurately reflect the traffic scene in real life while reducing the challenge in NL-based vehicle retrieval tasks caused by the ambiguity of natural language. Both FindVehicle and CityFlow-NL [16] have the description of the relationship with other vehicles (surrounding vehicles). Therefore, we do not treat the surrounding vehicle as a separate feature. Furthermore, FindVehicle is annotated with multi-granularity named entity labels in order to be able to meet further requirements in the future.

To verify the effectiveness of the proposed dataset, we construct a simple and highly efficient cross-modal vehicle retrieval system called VehicleFinder. Unlike current transformer-based models [19, 20], which have huge parameters and slow inference time, VehicleFinder has only 8.81 million parameters. This means that it can achieve real-time performance in the actual scenario and is more friendly to edge devices. VehicleFinder is trained and tested on our homemade text-to-image dataset called Vehicle-TI based on the training set of UA-DETRAC [1]. The keywords fed into VehicleFinder are extracted by a NER model pre-trained on FindVehicle. The experiment result shows that VehicleFinder gets 87.7% precision and 89.4% recall when detecting the vehicle. Its latency is 279.35 ms on one 8-core ARM v8.2 CPU.

To conclude, the main contributions of this paper include:

  1. 1.

    We propose the first NER dataset (benchmark) in the traffic domain called FindVehicle, which has 42.3 thousand sentences, 1.361 million tokens, 202.5 thousand entities and 21 entity types. FindVehicle is not only a dataset that contains both flat and overlapping entities, but also has both coarse-grained and fine-grained entity types.

  2. 2.

    We propose a text-image cross-modal vehicle retrieval system called VehicleFinder to prove the effectiveness of our proposed NER dataset. VehicleFinder is a highly efficient model with favourable performance that can achieve real-time performance and be applied to edge devices.

  3. 3.

    During the experiment, we construct a text-image vehicle matching dataset called Vehicle-TI. Vehicle-TI has 335,040 training samples, 179,520 test samples and 83,776 validation samples.

The rest of this paper is organized as follows: Section 2 presents the related work of our paper; Section 3 presents the critical information of FindVehicle and how we construct it; Section 4 presents the statistics details of FindVehicle; Section 5 presents VehicleFinder, our text-image cross-modal vehicle retrieval system; Section 6 includes the baselines of FindVehicle; Section 7 presents the experiment details of VehicleFinder; Section 8 presents the conclusion of this paper and our future work; Section 9 presents some challenges of FindVehicle.

2 Related work

2.1 Named entity recognition

Named entity recognition (NER) is a classical sequence tagging task in NLP. It is to locate and classify the words or sentences with specific types in the text. The input of NER model is a sequence with part-of-speech (POS) taggings, as it shows in Equation 1,

$$\begin{aligned} WT = (w_1, t_1) , (w_2, t_2) \dots (w_i, t_i) \dots (w_n, t_n) \end{aligned}$$
(1)

where n denotes the number of words segmented by word segmentation program. \(t_i\) is the POS of the word \(w_i\).

The process of NER based on word segmentation and POS tagging is splitting, combining (determining entity boundaries) and reclassifying (determining entity categories) some words. The output is an optimal sequence \(WC^*, TC^*\) with a pair format of (word category (WC), tagging category (TC)), as it shows in Equation 2,

$$\begin{aligned} WC^*, TC^* = (wc_1, tc_1), (wc_2, tc_2), (wc_i, tc_i), \dots , (wc_m, tc_m) \end{aligned}$$
(2)

where \(m \le n\), \(wc_i = [w_j,\dots ,w_{j+k}]\), \(tc_i=[t_j, \dots , t_{j+k}]\), \(1 \le k\), \(j + k \le n\).

In brief, the NER modal could be written as Equation 3 shows,

$$\begin{aligned} (WC^*, TC^*) = argmax_{(WC, TC)}P(WC, TC \vert W, T) \end{aligned}$$
(3)

where W is the word sequence while T is the tagging sequence. \(P(\cdot )\) is a conditional probability model.

Hidden Markov Models [21] and Conditional Random Fields [22] are two typical machine learning models for NER. Convolutional neural network [23], recurrent neural network [24], transformer [25], and graph neural network [26], these deep learning models all achieve the state-of-the-art results in NER.

Moreover, many NER datasets have been proposed in past years. These [27,28,29,30,31] are the well-known NER datasets (benchmarks). In these datasets, there are mainly three kinds of named entities: flat entity, overlapped entity and discontinuous entity. [32] proposed a unified neural framework to concurrently solve the three NER problems.

2.2 Text-image vehicle retrieval

Vehicle retrieval based on test-image cross-modal learning is a hot spot these years [11, 13,14,15, 33,34,35,36,37]. The model could find out the highest matching vehicle based on the description with the text format. There are mainly two formats according to the architecture. The first is the end-to-end neural network based on early retrieval, where the features of images and text are fused in the early stage. The second is the non-end-to-end system based on late retrieval, where images and text features are extracted individually and loaded into a decision module.

2.3 Contrastive language image pretraining

Contrastive language image pretraining (CLIP) combines the modalities of language and image in one neural network, which is mainly for multi-modal tasks based on natural language and computer vision. Prior to this, most computer vision work was trained based on pre-defined labels, and supervision limited the generalization and usefulness of neural networks. There has been a lot of work in the field of NLP using a large amount of corpus data for self-supervised learning, and the effect of these models has surpassed manually labelled datasets [38, 39]. In the field of CV, the current mainstream method is still to use large-scale datasets with labelled information for pre-training [40]. Vanilla CLIP [19] creatively uses text as a supervision signal to train a vision model and achieves conspicuous results on ImageNet [40]. In addition, vanilla CLIP [19] is also very good at zero-shot tasks. [20] proposes a CLIP framework called DenseCLIP, which is good at dense prediction tasks, such as semantic segmentation and dense object detection. [41] proposes a new contrastive loss to normalize the location and geometric information of image and text features in the semantic space.

3 The construction of FindVehicle

3.1 Brief introduction

FindVehicle is the first NER dataset in traffic. It is based on the image samples of UA-DETRAC [1]. FindVehicle contains various descriptions of traffic participants on the road from the view of traffic surveillance cameras, mainly vehicles. A description contains many attributes of one or several vehicles. These attributes all could be detected by traffic sensors, such as surveillance cameras, lidar and radar. Moreover, FindVehicle also incorporates much real-world prior knowledge, such as the vehicle brand and model. Furthermore, FindVehicle contains both coarse-grained and fine-grained entities. Entities include both flat and overlapped entities.

Fig. 1
figure 1

Entity types and an annotated sample of FindVehicle. Images are from UA-DETRAC [1]

3.2 Entity types

As Fig. 1 shows, there are 21 entity types in FindVehicle, 8 coarse-grained entities and 13 fine-grained entities. These entities are all the attributes of vehicles, which all follow the distribution of the real world. Moreover, FindVehicle also contains both flat and overlapped entities.

3.2.1 Coarse-grained entity

There are 8 kinds of coarse-grained entities, including vehicle_location, vehicle_orientation, vehicle_brand, vehicle_model, vehicle_type, vehicle_color, vehicle_range and vehicle_velocity.

vehicle_location indicates the locations of vehicles from the view of the traffic surveillance cameras, such as |bottom right|, |top-left|, etc.

vehicle_orientation indicates the directions of vehicles’ heads from the view of the traffic surveillance cameras, such as |this way|, |away|, etc.

vehicle_brand indicates the brands of vehicles. FindVehicle contains 65 vehicle brands all over the world.

vehicle_model indicates the models of vehicle brands. There are 4793 models of different vehicle brands in FindVehicle. For example, |Q7| is one of the models of |Audi|.

vehicle_type indicates the types of vehicles, such as |sedan|, |suv|, etc.

vehicle_color indicates the colors of vehicles, such as |silver grey|, |rose red|, etc.

vehicle_range indicates the distance between the vehicle and the traffic surveillance camera, such as |18m|, |123 meters|, etc.

vehicle_velocity indicates the speed of the moving vehicle on the road, such as |50 kilometres per hour|, |120 km/h|, etc.

3.2.2 Fine-grained entity

As it shows in Fig. 1, in FindVehicle, there are 13 kinds of fine-grained entities, which belong to the coarse-grained entity vehicle_type, for example, |BMW X5| is a fine-grained entity of vehicle_type-suv. Fine-grained entities contain the human prior knowledge of cars.

3.2.3 Flat and overlapped entity

Overlapped entities exist in coarse-grained entities vehicle_brand, vehicle_model and fine-grained entities vehicle_type-*. For example, as Fig. 2 shows, the label of |BMW| is vehicle_brand while the label of |X5| is vehicle_model, for a car enthusiast, the label of |BMW X5| is vehicle_type-suv.

Fig. 2
figure 2

An example of flat and overlapped entities

3.3 Corpus collection

As Fig. 4 shows, the corpus collection includes two parts, the corpus with simple context and the corpus with complex context. The corpus with simple context denotes the short sentences, which are presented in the column of Data Samples in Table 1. As Fig. 3 presents, firstly, we sample some target vehicles with bounding boxes and labels in UA-DETRAC [1]. Based on these samples, we create a relational table to save the attributes of the corresponding vehicle. Each item in the table represents one vehicle with several attributes. Furthermore, to increase the complexity of the dataset, we replace some formal phrase-type and word-type entities with our informal expression habits and add some rare entities which do not exist in UA-DETRAC [1]. Moreover, for the entity generation of three entities vehicle_brand, vehicle_model and vehicle_type-*, we invite three car enthusiasts to collect and integrate data based on their extensive car knowledge and the search results of Wikipedia. They write data with different expressions and curate 65 vehicle brands, 4793 vehicle models and 13 vehicle types in total. Secondly, we recruit four volunteers to write descriptive sentences with various patterns in their tone and expression habits. All volunteers are well-educated and have adequate English linguistic knowledge. Thirdly, we insert the target vehicles with their attributes into these patterns by our sentence auto-generation framework.

Fig. 3
figure 3

The generation of corpus with simple context

As the sample in Fig. 1 presents, the corpus with complex context indicates narrative long sentences or paragraphs with persons’ subjective emotions and imagination. Instead of generating a corpus with simple context by combining labor and computers, a corpus with complex context is made by human beings only. Four members of our team write down the corresponding sentences and paragraphs with their own writing habits and imagination by observing the images in UA-DETRAC [1].

Fig. 4
figure 4

The framework of corpus collection and annotation of FindVehicle

Fig. 5
figure 5

The two annotation formats of FindVehicle

3.4 NER annotation

As Fig. 4 shows, in our NER annotation framework, there are two processes for the corpus with simple and complex contexts, respectively. The annotations of the corpus with simple context are completed simultaneously with sentence auto-generation by our annotation auto-generation framework. After that, the correction framework of auto-generation will automatically identify whether the NER annotations by the auto-generation framework have errors. If the data had an error, the annotation process would be interrupted and report the location of the error, and then we would check and fix it. If it had no error, the corpus with annotations would be loaded into the dataset directly.

The annotations of the corpus with complex context are totally manual. They are based on the common sense and knowledge of annotators. Annotators are all volunteers who are knowledgeable about vehicles and good at narrative writing.

As Fig. 5 shows, we organize the data in two formats, JSON and CoNLL-style [27]. The value of the key ner_label is the annotated named entities. The values of ner_label in each element is [entity type, start index of char span, end index of char span, start index of token span, end index of token span]. Our annotation considers char-level and token-level, meeting different needs of the NER models. The key re_label denotes the indexes of values of ner_label that refer to one target in the context of a sentence.

4 Data statistics of FindVehicle

4.1 Size and distribution of FindVehicle

FindVehicle is the first NER dataset in traffic with the annotations of automatic labeling and manual labelling together. As Table 2 shows, we present the statistics of FindVehicle and other widely used well-known NER datasets, including CoNLL’03 [27], WikiGold [28], WNUT’17 [29], I2B2 [42] and OntoNotes [30]. FindVehicle has 42.3 thousand sentences, 1.361 million tokens, 202.5 thousand entities and 21 entity classes. As Fig. 6 presents, the entity types are long-tail distributed to reflect the real-world traffic scenario.

Table 2 Statistics of FindVehicle and other well-known NER datasets
Fig. 6
figure 6

Statistics by entities in FindVehicle

4.2 Dataset split

FindVehicle is a hybrid NER dataset containing both flat and overlapped entities. We split it into a training set and a test set. The details of these two sets are shown in Table 3. For the training set, there are 84.6k coarse-grained entities and 18.2k fine-grained entities. In addition, there are 84.2k flat entities and 18.6k overlapped entities. For the test set, there are 82.5k coarse-grained entities and 17.4k fine-grained entities. Besides, 82.7k flat entities and 17.2k overlapped entities are in the test set.

Table 3 Data Split of FindVehicle

5 VehicleFinder

VehicleFinder is a lightweight text-image cross-modal vehicle retrieval system. Users could find out the target vehicle through the description of its type, color and orientation. As Fig. 7 presents, VehicleFinder has two branches. One is to extract proposals by a vision detector while the other is to extract named entities by a text detector. We adopt NanoDet [43] as the vision detector and BiLSTM-CRF [24] as the text detector. The NanoDet [43] is pretrained on UA-DETRAC [1] while the BiLSTM-CRF [24] is pretrained on our FindVehicle. The proposals and name entities will be loaded into the contrastive text-image module (CTIM) to compare the semantic similarity of data of two modalities.

As Fig. 8 shows, there are two encoder branches in CTIM to encode the data of image and text modalities, respectively. The output of CTIM is the similarity of the image and text, whose value domain is between 0 and 1. An output below 0.5 indicates that the image and text are unrelated, while an output above 0.5 indicates that they are related. CTIM is a complete convolution module whose convolution operations are all the depthwise separable convolution [44], dramatically reducing the parameter number, especially in the deep layers of the neural network. CTIM could perform as a plug-and-play module in some cross-modal systems.

In the branch of the image encoder, there are five same encoder units. An encoder unit will initially put the input feature map \(x_i \in R^{c \times h \times w}\) into three branches, where chw respectively denote the channel, height and width of a feature map. The first three branches with different convolution kernel sizes are used to extract the feature with different receptive fields. The output feature map \(\hat{x}_i \in R^{c \times h \times w} \) will be activated by ReLU [45], then increase the channels and reduce the spatial size through a depthwise separable convolution operation with the \(3 \times 3\) kernel. After a batch normalization and a ReLU activation, the output feature map is \(x_{i+1} \in R^{2c \times \frac{h}{2} \times \frac{w}{2}}\). To alleviate the gradient vanishing and explosion in the training stage, a long residual path with a depthwise separable convolution is connected with the output feature map. The final output of the encoder unit is \(\hat{x}_{i+1} \in R^{2c \times \frac{h}{2} \times \frac{w}{2}}\). The whole process is presented in Equation 4.

$$\begin{aligned} \begin{aligned}&\hat{x}_i = BN(Conv_{3\times 3}(x_i)) + BN(Conv_{1\times 1}(x_i)) + BN(x_i), \hat{x}_i \in R^{c \times h \times w} \\&x_{i+1} = BN(Conv_{3\times 3}(ReLU(\hat{x}_i))), x_{i+1} \in R^{2c \times \frac{h}{2} \times \frac{w}{2}} \\&\hat{x}_{i+1} = ReLU(x_{i+1}) + Conv_{3\times 3}(x_i), \hat{x}_{i+1} \in R^{2c \times \frac{h}{2} \times \frac{w}{2}} \end{aligned} \end{aligned}$$
(4)
Fig. 7
figure 7

The architecture of VehicleFinder

Fig. 8
figure 8

The architecture of Contrastive Text-Image Module (CTIM). All convolution operations are all depthwise separable convolutions, except for the convolution operation with the kernel size of \(1 \times 1\). Because the depthwise separable convolution contains the convolution operation with the \(1 \times 1\) kernel size. \(c_i\) denotes the channel number of the feature map whose kernel height is i. \(w_i\) denotes the width of the feature map whose kernel height is i

In the branch of text encoder, named entities will be firstly embedded by pretrained embeddings of Fasttext (wiki-news-300d-1M) [46]. Fasttext could infer the embeddings of words not in the word dictionary based on the existing words, which is more robust than Word2vec [47] and GloVe [48] for the system. The shape of the embedding matrix is \(d \times 300\), where d indicates the number of named entities and 300 is the vector length of each named entity. After that, we adopt four groups of multi-scale depthwise separable convolution operations to extract the feature with different scales concurrently. The first group is n convolution operations of the kernel size \(1 \times w_1\), which is to extract the feature of a single word in named entities. The second group has one convolution operation of the kernel size \(2 \times w_2\) and \(n-1\) convolution operations of the kernel size \(1 \times w_1\), where the convolution of the \(2 \times w_2\) kernel is to extract the associated feature of adjacent words. The rest \(1 \times w_1\) convolution operations are to enhance the non-linear representation. The third group is firstly processed by a convolution of the \(3 \times w_3\) kernel, which is also for the feature extraction of adjacent words with a word window size of three. Then the following operations are the same as the second group. The fourth group is a convolution operation with the kernel size of \(d \times w_d\), which is to extract the feature of the global context. Finally, the outputs of these four groups will be added to get a comprehensive representation of the name entities. The four convolution operations are shown in Fig. 9.

Fig. 9
figure 9

The four multi-scale convolution operations in our text encoder

After we get the representations of the proposal and named entities, we align their shape to calculate their cosine distance. Cosine distance measures the distance between vectors of the proposal and named entities. It could maintain the same similarity in the high-dimensional case as the low-dimensional case, which is a robust indicator of the relative difference in direction. Equation 5 shows cosine distance.

$$\begin{aligned} CosineD = \frac{{\textbf {A}} \cdot {\textbf {B}}}{\Vert {\textbf {A}} \Vert \Vert {\textbf {B}} \Vert } = \frac{\sum ^n_{i=1}A_iB_i}{\sqrt{\sum ^n_{i=1}A_i^2}\sqrt{\sum ^n_{i=1}B_i^2}}, CosineD \in [-1, 1] \end{aligned}$$
(5)

where n is the number of vector’s components. \(A_i\) and \(B_i\) respectively denote the text and image vector of ith component.

However, the value domain of cosine distance is \([-1, 1]\). It means that the result of cosine distance could not be directly fed to binary cross entropy loss (BCE loss) because BCE loss (Equation 6) could not process the negative number.

$$\begin{aligned} L_{BCE} = - \sum ^N_{i=1} [y_iln(\hat{y}_i) + (1-y_i)ln(1-\hat{y}_i)] \end{aligned}$$
(6)

where N indicates the number of samples in a batch. \(y_i \in \{0, 1\}\) is the ground truth while \(\hat{y}_i \in [-1, 1]\) is the result of cosine distance predicted by the neural network. Apparently, \(\hat{y}_i\) is not in the definitional domain of \(ln(\cdot )\) if \(\hat{y}_i\) is below zero.

Therefore, as Equation 7 presents, we use Equation 7 to compress the results of cosine distance from \([-1, 1]\) to [0, 1], which can be the input to BCE loss. The linear compression function is a monotonically increasing function whose value domain is [0, 1]. It is differentiable everywhere. Monotonicity ensures that the relative position of the variable does not change when it maps from \([-1, 1]\) to [0, 1]. The property of being differentiated everywhere ensures that it can participate well in backpropagation in neural networks.

$$\begin{aligned} Comp(x) = \frac{1}{2}x + \frac{1}{2}, Comp(x) \in [0, 1] \end{aligned}$$
(7)

where \(x \in [-1, 1]\) denotes the result calculated by cosine distance.

Therefore, the complete form of the loss function is presented in Equation 8.

$$\begin{aligned} L(y_i, \hat{y}_i) = - \sum ^N_{i=1} [y_iln(Comp(\hat{y}_i)) + (1-y_i)ln(1-Comp(\hat{y}_i))] \end{aligned}$$
(8)

Finally, our VehicleFinder will calculate the similarities between named entities extracted from the command and object proposals extracted by the vision detector. We will sort object proposals in terms of the similarity with named entities descendingly. The ranking list proposals will then be fed to a decision module. In the decision module, considering that the user cannot always describe the vehicle characteristics in detail, we take two patterns of commands into account and process them respectively to enhance the system’s robustness, which could also be user-friendly. As Fig. 10 presents, the first is the no-missing-entity pattern and the second is the missing-entity pattern. No-missing-entity pattern indicates the command contains all three named entities, vehicle_type, vehicle_color and vehicle_orientation. Missing-entity pattern indicates the command contains one or two named entities, and the other one or two named entities are not mentioned.

Fig. 10
figure 10

Two patterns of commands

As Algorithm 1 presents, we firstly set two thresholds \(th_{nm}\) and \(th_{m}\), which mean the threshold for no-missing-entity pattern and the threshold for missing-entity pattern. The variable proposals containing vehicle: sim pairs indicates object proposals and their similarity with named entities extracted from the command. For no-missing-entity pattern, if existing the vehicle: sim pair whose sim is larger than \(th_{nm}\), the vehicle: sim pair would be appended to retainVehicle. If not existing the vehicle: sim pair whose sim is larger than \(th_{nm}\), the decision module would continue to search for the vehicle: sim pair whose sim is larger than \(th_m\). If existing the vehicle: sim pair whose sim is larger than \(th_m\), the vehicle: sim pair would also be appended to retainVehicle. \(th_{nm}\) and \(th_{m}\) are set based on the results of experiments. We assume by default that \(th_{nm}\) is greater than \(th_m\).

Algorithm 1
figure a

Decision Module

6 Experiments of FindVehicle

In the experiments of FindVehicle, we make the baselines of our FindVehicle.

6.1 Settings of training and evaluation

We select three representative and state-of-the-art models to train and test on FindVehicle, which were BiLSTM-CRF [24], BERT-CRF [49] and FLERT [50].

BiLSTM-CRF [24] combines BiLSTM and CRF. BiLSTM acts as the encoder layer and takes word embeddings as input, CRF serves as a decoder to determine the tag for each token based on hidden states outputted from a encoder.

BERT-CRF [49] replaces word embeddings of BiLSTM with subword-embeddings learned from BERT, and changes the encoder from BiLSTM to Transformer.

FLERT [50] is a NER model that takes document-level features as an extra account. By adding context text on both sides (left and right) to the query sentence, FLERT captures document-level features and presents a better predict result than the previous model.

For each model, we use the most suitable hyperparameters that make the model converge smoothly. We train and test these models on one TITAN RTX GPU. Table 4 shows the implementation details.

Table 4 Implementation Details of Models on The Training Set of FindVehicle
Table 5 Confusion Matrix

Furthermore, as Equation 9, 10 and 11 present, we choose precision, recall and F1 score as the evaluation metrics of the test, which are based on the confusion matrix (Table 5).

$$\begin{aligned} Precision = \frac{TP}{TP + FP} \end{aligned}$$
(9)
$$\begin{aligned} Recall = \frac{TP}{TP + FN} \end{aligned}$$
(10)
$$\begin{aligned} F1 = \frac{2 \times Precision \times Recall}{Precision + Recall} \end{aligned}$$
(11)

6.2 Baselines of FindVehicle

Table 6 shows the evaluation results of models on the test set of FindVehicle. It is apparent that Transformer-based models perform better than the RNN-based model. BiLSTM-CRF [24] got 49.5% F1 score, which is the lowest value among models. FLERT [50] achieved 80.9% F1 score, which is the highest value, 3% higher than BERT-CRF [49].

Table 6 Evaluation Results of Three Models on The Test Set of FindVehicle

Furthermore, we do the statistics on the evaluation results for all 21 classes of named entities. We take the evaluation results of FLERT [50] as the example, as Table 7 shows, all the evaluation metric values of fine-grained entities are much lower than those of coarse-grained entities. It denotes that the recognition of fine-grained entities is harder than coarse-grained entities for neural networks. Moreover, we also calculate the evaluation results of flat entities and overlapped entities by FLERT [50]. As Table 8 shows, the values of three metrics of flat entity are about 20% higher than overlapped entities’. The recognition of overlapped entities is still a challenge in FindVehicle.

Table 7 Evaluation Results of FLERT [50] for All The Classes of FindVehicle
Table 8 Evaluation Results of FLERT [50] for Flat and Overlapped Entities of FindVehicle

6.3 Comparison of models on different NER datasets

We also compare the performances of models on different NER datasets (Table 9), including CoNLL’03 (4 classes) [27], WNUT’17 (6 classes) [29], Ontonotes (18 classes) [30] and our FindVehicle (21 classes). We use F1 score as the evaluation metric. We can see that F1 scores (Equation 11) of three models on FindVehicle are all lower than the scores on CoNLL’03 [27] and Ontonotes (18 classes) [30], which indicates that there are some challenges in our dataset to some extent.

Table 9 Performances of Models on Test Sets of Different NER Datasets

7 Experiments of VehicleFinder

There are four parts in the experiments of VehicleFinder, which are vision detector, text detector, CTIM and VehicleFinder.

7.1 Experiments of vision detector

Vision detector is to extract proposals of vehicles from the image. We adopt NanoDet-m [43] as the vision detector, which is a lightweight detector with only 0.95 million parameters. It is trained on the training set of UA-DETRAC [1]. The implementation details are shown in Table 10.

Moreover, we want the vision detector to miss as few targets as possible, so we use recall as the evaluation metric instead of precision. As Table 11 presents, NanoDet-m [43] gets 86.7% recall rate on the test set.

Table 10 Implementation Details of NanoDet-m on The Training Set of UA-DETRAC
Table 11 Evaluation of NanoDet-m on UA-DETRAC [1]

7.2 Experiments of text detector

Text detector is to extract keywords (named entities) from the user command. BiLSTM-CRF has relatively few parameters and fast inference among all NER models mentioned in Section 6.2, so we train a BiLSTM-CRF on our FindVehicle as the text detector, which is to extract named entities with types of vehicle_type, vehicle_color and vehicle_orientation. The implementation details are shown in Table 4.

As Table 12 shows, BiLSTM-CRF has 4.02 million parameters. It spends 148.57 ms extracting all named entities from a sample in FindVehicle on the 8-core ARM v8.2. In addition, it spends 87.19 ms and 51.73 ms when tested on i7-12700 and RTX A4000 (Table 13).

7.3 Experiments of CTIM

7.3.1 Settings of training and evaluation

Table 12 Inference Speed Evaluation of BiLSTM-CRF [24]
Table 13 F1 Scores of Different Kinds of Named Entities by BiLSTM-CRF [24] on FindVehicle

We construct a text-image-pair dataset called Vehicle-TI based on the training set of UA-DETRAC [1] to train and test our CTIM. As Fig. 11 shows, each data sample in Vehicle-TI has a triple keyword (text modal), a proposal (image modal) and a label, which are extracted and reconstituted from UA-DETRAC [1]. A triple keyword contains the type, color and orientation of the vehicle. The label indicates whether the proposal is consistent with the description of the triple keyword, where 1 means consistent (positive sample) and 0 means inconsistent (negative sample). Positive sample is to make the feature encodings of text and image closer while the negative sample is to make the feature encodings of text and image farther. There are 598,336 samples in Vehicle-TI, 335,040 for training, 179,520 for test and 83,776 for validation.

As Table 14 shows, we train CTIM for 50 epochs with a batch size of 64. The initial learning rate is 0.001 and CTIM is optimized by AdamW [51]. The learning rate is scheduled by the Step scheduler.

Furthermore, we set a threshold of 0.7 as the boundary of the consistency of the vehicle proposal and the triple keyword. If the output of CTIM is above 0.7, it indicates that the vehicle proposal and the triple keyword are consistent (strong-related), if not, we think they are not related or weak-related.

7.3.2 Evaluation results

Table 15 presents the evaluation results of CTIM on the test set of Vehicle-TI. CTIM has only 3.84 million parameters, which gets 97.7% accuracy (Equation 12) for the identification of consistency between vehicle images and triple keywords.

$$\begin{aligned} Accuracy = \frac{TP + TN}{TP + FP + TN + FN} \end{aligned}$$
(12)

Moreover, we also test the inference speed of CTIM on different devices. CTIM spends 131.42 ms identifying one sample on an 8-core ARM v8.2 of NVIDIA Jetson AGX Xavier. When tested on an i7-12700, CTIM gets 67.43 ms latency. In addition, it costs CTIM 39.47 ms on one RTX A4000. The above proves that CTIM can maintain high performance on both edge and host devices.

7.3.3 Comparison of CTIM with other models

As the above mention, all convolution operations in CTIM are depthwise separable convolution. In addition, we use cosine distance and linear compression function to measure and process the similarity between text and image modalities. We still call it CTIM.

Fig. 11
figure 11

The data format of Vehicle-TI, which is for the training and test of CTIM

Table 14 Implementation Details of CTIM on The Training Set of Vehicle-TI
Table 15 Comparison of Various Text-Image Siamese Network on Test Set of Vehicle-TI

We firstly replace all depthwise separable convolution operations in CTIM with the normal convolution operations. We call it CTIM-Conv-CosineD.

Secondly, we replace the cosine distance and linear compression function in CTIM with fully connected layers, which is the operation in normal Siamese neural networks to fit the similarity by fully connected layers. We call it CTIM-DSConv-Siamese.

Thirdly, we adopt the most well-known contrastive language and image pretraining model, CLIP [19]. We adopt ResNet-50 as the image encoder and Transformer as the text encoder. The total parameter number of CLIP is 102.58 million.

Last but not least, we fine-tune a bert-based Siamese neural network [52] to make it adapt to our task. Its architecture is Transformer-based, totally different from the aforementioned neural networks. We call it Bert-Siamese.

As Table 15 presents, CTIM performs the best for both accuracy and inference speed. In contrast, CTIM-Conv-CosineD-Linear has 36.6 million parameters, which is 32.83 million more than CTIM. Furthermore, its speed on different devices is slower than CTIM.

Secondly, the parameter number of CTIM-DSConv-Siamese is the largest among all CNN-based neural networks, which is 99.72 million. Its inference speed on different devices is also the slowest and the accuracy is only 25.7%.

Thirdly, CLIP with ResNet-50 and Transformer has 102.58 million parameters. It gets 96.5% accuracy on the test set.

Last but not least, although Bert-Siamese [52] has a close performance with our CTIM, it has huge parameters of 189.53 million. It spends nearly 3 seconds to identify one sample on 8-core ARM v8.2, which is far too slow to deploy on edge devices.

Based on the performances of the above models, we find that cosine distance is a much better choice for measuring the similarity of text and image features than fully-connected layers, where the accuracy of CTIM is 72% higher than CTIM-DSConv-Siamese. Furthermore, transformer-based encoders do not behave as expected, where CLIP and Bert-Siamese get lower accuracy than CTIM. Due to the limited features of named entities, transformer-based encoders could not play to their strengths.

7.4 Evaluation of VehicleFinder

7.4.1 Settings of evaluation

We randomly sample 2000 images from the test set of UA-DETRAC [1] as our homemade test set for VehicleFinder. For each image, we write a piece of retrieval text, which corresponds to one or more vehicles in the image. The format of the test set is presented in Fig. 12. Each item includes columns of the image path img_path, target id target_id, the upper-left abscissa of bounding box left, the upper-left ordinate of bounding box top, the width of bounding box width, the height of bounding box height and the retrieval content retrieval_text. There are 3917 target vehicles based on retrieval text in these 2000 images. We adopt precision, recall and F1 score to evaluate our VehicleFinder, which are presented in Equation 13, 14 and 15. We also test our VehicleFinder on three different devices.

$$ \begin{aligned} Precision_{V} = \frac{num(\textit{detected vehicles} {\textbf { \& }} \textit{detected vehicles in the testset})}{\textit{num(detected vehicles)}} \end{aligned}$$
(13)
$$ \begin{aligned} Recall_{V} = \frac{num(\textit{detected vehicles} {\textbf { \& }} \textit{detected vehicles in the testset})}{\textit{num(all vehicles in the testset)}} \end{aligned}$$
(14)
$$\begin{aligned} F1 = \frac{2 \times Precision_{V} \times Recall_{V}}{Precision_{V} + Recall_{V}} \end{aligned}$$
(15)
Fig. 12
figure 12

The format of the homemade test set for VehicleFinder

Table 16 Evaluation of VehicleFinder on Our Homemade Test Set

7.4.2 Evaluation results

Table 16 shows that our VehicleFinder(CTIM) has 8.81 million parameters, containing the vision detector, the text detector and CTIM. After setting two thresholds \(th_{nm}\) and \(th_{m}\) as 0.70 and 0.30 respectively, our VehicleFinder(CTIM) achieves 87.7% precision, 89.4% recall and 88.5% F1 score. Fig. 13 presents the test results of VehicleFinder(CTIM). We can observe that the targeted vehicles could be preciously retrieved based on the description.

Furthermore, we also collect the results of the control group. VehicleFinder(CTIM-Conv-CosineD-Linear) achieves 87.4% precision, 87.9% recall and 87.6% F1 score with 41.64 million parameters. VehicleFinder(CTIM-DSConv-Siamese) has 104.69 million parameters and its F1 score is only 12.5%, which is the lowest among all. VehicleFinder(CLIP) has 107.55 million parameters total and gets 87.5% F1 score. VehicleFinder(Bert-Siamese) has the almost same F1 score (88.4%) as our VehicleFinder(CTIM), but its parameters are too huge. This further proves that Bert may not be a better choice than RNNs for encoding named entities because named entities are mainly short text with few contextual features.

We calculate the latency of our VehicleFinder(CTIM) from the moment that the command is loaded into VehicleFinder(CTIM) to the moment that the VehicleFinder(CTIM) completes the identification of one vehicle. As Equation 16 presents, \(T_{ner}\) means the time of named entity recognition and \(T_{cti}\) means the identification time of the consistency of named entities and one vehicle proposal. It includes the inference time of the text detector and CTIM. We ignore the time for the system to schedule different models. Table 17 shows the inference speed evaluation of our VehicleFinder(CTIM). The longest latency is 279.35 ms on one 8-core ARM v8.2 while the shortest latency is 93.72 ms on one RTX A4000. It implies that our VehicleFinder(CTIM) could be deployed on both edge devices and host devices, but host devices are the better choice.

Moreover, according to experiment results in Table 16. We find transformer-based CLIP and Bert achieves worse performances than our CTIM. It implies that huge transformer-based text encoders do not perform better than lightweight LSTMs on short text feature extraction, since short text has few features for extraction.

Last but not least, we also test the VehicleFinder(CTIM) on our collected images in some traffic scenes. Based on the images, we recruited two volunteers to describe the vehicles that they want to find out in the images. As Fig. 14 presents, VehicleFinder(CTIM) can still accurately find out the targeted vehicles based on the volunteers’ descriptions. In addition, inference of corner cases is included as Fig. 15 shows, which means VehicleFinder(CTIM) can keep robust to some extent when confronted with some adverse phenomenons.

$$\begin{aligned} Latency = T_{ner} + T_{cti} \end{aligned}$$
(16)
Fig. 13
figure 13

Samples of inference results by VehicleFinder on UA-DETRAC [1]

Fig. 14
figure 14

Inference results of VehicleFinder on our collected images

Fig. 15
figure 15

Inference results of corner cases: occluded targets, dark environment and strong light interference

Table 17 Inference Speed Evaluation of VehicleFinder

8 Conclusion and future work

We propose the first NER dataset FindVehicle in traffic domain, which contains different sentences that describe the vehicles in different traffic scenes. Named entities include several attributes of vehicles that could be detected by perception sensors. FindVehicle is a NER dataset that contains both flat and overlapped entities. All the named entities in it are annotated by both machine annotation algorithms and human annotators. Annotation includes both coarse-grained and fine-grained entity annotation. FindVehicle could be used to assist text-image cross-modal tasks in traffic scenes and act as the pretrained corpus of the territory of traffic. Furthermore, We propose an efficient text-image cross-modal vehicle retrieval system called VehicleFinder. VehicleFinder achieves 87.7% precision when identifying target vehicles by text commands, which spends 279.35 ms on one 8-core ARM v8.2 CPU and 93.72 ms on one RTX A4000 GPU. Our VehicleFinder could help traffic supervisors find out the target vehicle from a large number of images or videos based on natural language. Last but not least, we construct a text-to-image vehicle-matching dataset called Vehicle-TI.

In the future, firstly, we will continue to maintain our FindVehicle. Secondly, we will extend FindVehicle by adding the corpus of some special traffic scenes, and connecting samples of FindVehicle to images of real traffic scenes, which would be a new dataset (benchmark). Thirdly, we will explore text-video cross-modal vehicle retrieval.

9 Discussion

The discussion is divided into two parts, the challenges of FindVehicle and the limitation of our cross-modal vehicle retrieval system VehicleFinder.

In FindVehicle, long-tail data distribution, the recognition of vehicle brands out of the distribution, and the recognition of fine-grained and overlapped entities are three challenges worth exploring. Moreover, as Fig. 16 shows, identifying whether the extracted named entities refer to the same vehicle is a considerable challenge, equivalent to clustering named entities according to context.

The first limitation of our VehicleFinder is that a description can only contain the attributes of one vehicle. Our VehicleFinder is not adaptive to context with multiple vehicles because we only adopt NER in keyword extraction instead of combining NER with relation extraction, which is also a challenge in the future. The second limitation is that the granularity of the keywords used to describe the attributes of the vehicles is not fine enough, which is attributed to the limitation of the human cost of the annotation effort. We will continue to pay attention to and research this field in the future.

Fig. 16
figure 16

The challenge of multiple entity clustering