FindVehicle and VehicleFinder: A NER dataset for natural language-based vehicle retrieval and a keyword-based cross-modal vehicle retrieval system

Natural language (NL) based vehicle retrieval is a task aiming to retrieve a vehicle that is most consistent with a given NL query from among all candidate vehicles. Because NL query can be easily obtained, such a task has a promising prospect in building an interactive intelligent traffic system (ITS). Current solutions mainly focus on extracting both text and image features and mapping them to the same latent space to compare the similarity. However, existing methods usually use dependency analysis or semantic role-labelling techniques to find keywords related to vehicle attributes. These techniques may require a lot of pre-processing and post-processing work, and also suffer from extracting the wrong keyword when the NL query is complex. To tackle these problems and simplify, we borrow the idea from named entity recognition (NER) and construct FindVehicle, a NER dataset in the traffic domain. It has 42.3k labelled NL descriptions of vehicle tracks, containing information such as the location, orientation, type and colour of the vehicle. FindVehicle also adopts both overlapping entities and fine-grained entities to meet further requirements. To verify its effectiveness, we propose a baseline NL-based vehicle retrieval model called VehicleFinder. Our experiment shows that by using text encoders pre-trained by FindVehicle, VehicleFinder achieves 87.7\% precision and 89.4\% recall when retrieving a target vehicle by text command on our homemade dataset based on UA-DETRAC. The time cost of VehicleFinder is 279.35 ms on one ARM v8.2 CPU and 93.72 ms on one RTX A4000 GPU, which is much faster than the Transformer-based system. The dataset is open-source via the link https://github.com/GuanRunwei/FindVehicle, and the implementation can be found via the link https://github.com/GuanRunwei/VehicleFinder-CTIM.


Introduction
Vehicle retrieval is a task that aims to find the target vehicle from a large image gallery given a query image, which is an image-to-image matching technique also known as vehicle re-identification [2][3][4] [5].It has promising prospects in building ITS [6] [7][8] [9] for smart cities [10].However, an image-based vehicle retrieval system also has its defects in practice.For example, such a system needs an image to provide characteristics of the target vehicle, which is not always easy to obtain in the real world.The performance of an image-based vehicle retrieval system may also be limited because there is only one type of modality to provide spatial and temporal information.
To alleviate these problems, Natural Language (NL), as another essential modality in the real world, has received more and more attention from researchers in recent years.A natural language-based vehicle retrieval system aims to identify the target vehicle using an NL description.Such a system has a broader range of application scenarios, such as finding a vehicle when a bystander provides only an informal description.Most current natural language vehicle retrieval implementations construct the text encoder and visual encoder to extract features from both data types.They then project the obtained text and visual embeddings into the same latent space to compare their similarity.In addition, both visual and NL data will be carefully modified by these methods for more effective representation.For example, vehicle track images are cropped to generate a global motion image [11] [12][13] [14].As for NL, some keywords related to vehicle attributes (e.g., colour, vehicle type and orientation) are extracted in the given NL query [11] [12][14] [15].Although these works achieve acceptable performance on the CityFlow-NL [16] benchmark, they can still be improved, especially in terms of NL.We find that when implementing keyword extraction, existing methods are usually based on dependency analysis (e.g., using NLTK package) or semantic role labelling techniques to determine whether the word is a keyword or not.These techniques only assign the part of speech to the words in the sentence.It means that pre-determined rules and post-processing are required to divide the extracted keywords into corresponding vehicle attributes, making the whole process complex [15] [17].Such methods can also suffer from extracting the wrong keyword if the NL description is complex.This can lead to error propagation in subsequent modules and reduce model performance.
In fact, keyword extraction is already a mature technology in natural language processing (NLP), also known as named entity recognition (NER).The main obstacle that prevents us from applying the state-of-the-art NER model to solve the above problem is the lack of a domain-specific corpus with highquality annotations.Therefore, to alleviate this problem, we propose a named entity labelled natural language dataset focused on the traffic domain, called FindVehicle.It consists of descriptions of the vehicle from the point of view of urban traffic surveillance cameras.Some example descriptions from our dataset are shown.We also compare them with instances selected from other traffic domain datasets using natural language, namely Talk2Car [18] and CityFlow-NL [16].All details are given in Table 1.We carefully construct the vehicle descriptions to match real traffic scenarios and to enrich more detailed information about the target vehicles.Our dataset includes eight types of vehicle features, namely vehicle location, orientation, brand, model, type, colour, distance from the traffic surveillance camera, and velocity.In contrast, Talk2Car [18] only records vehicle type, and CityFlow-NL [16] has only four types of information, which are vehicle colour, type, action, and scene.More vehicle information in the description text means that the data can more accurately reflect the traffic scene in real life while reducing the challenge in NL-based vehicle retrieval tasks caused by the ambiguity of natural language.Both Find-Vehicle and CityFlow-NL [16] have the description of the relationship with other vehicles (surrounding vehicles).Therefore, we do not treat the surrounding vehicle as a separate feature.Furthermore, FindVehicle is annotated with multi-granularity named entity labels in order to be able to meet further requirements in the future.
To verify the effectiveness of the proposed dataset, we construct a simple and highly efficient cross-modal vehicle retrieval system called VehicleFinder.Unlike current transformer-based models [19] [20], which have huge parameters and slow inference time, VehicleFinder has only 8.81 million parameters.This means that it can achieve real-time performance in the actual scenario and is more friendly to edge devices.VehicleFinder is trained and tested on our homemade text-to-image dataset called Vehicle-TI based on the training set of UA-DETRAC [1].The keywords fed into VehicleFinder are extracted by a NER model pre-trained on FindVehicle.The experiment result shows that VehicleFinder gets 87.7% precision and 89.4% recall when detecting the vehicle.Its latency is 279.35 ms on one 8-core ARM v8.2 CPU.
To conclude, the main contributions of this paper include: 1. We propose the first NER dataset (benchmark) in the traffic domain called FindVehicle, which has 42.3 thousand sentences, 1.361 million tokens, 202.5 thousand entities and 21 entity types.FindVehicle is not only a dataset that contains both flat and overlapping entities, but also has both coarse-grained and fine-grained entity types.2. We propose a text-image cross-modal vehicle retrieval system called Vehi-cleFinder to prove the effectiveness of our proposed NER dataset.Vehi-cleFinder is a highly efficient model with favourable performance that can achieve real-time performance and be applied to edge devices.The rest of this paper is organized as follows: Section 2 presents the related work of our paper; Section 3 presents the critical information of FindVehicle and how we construct it; Section 4 presents the statistics details of FindVehicle; Section 5 presents VehicleFinder, our text-image cross-modal vehicle retrieval system; Section 6 includes the baselines of FindVehicle; Section 7 presents the experiment details of VehicleFinder; Section 8 presents the conclusion of this paper and our future work; Section 9 presents some challenges of FindVehicle.

Named Entity Recognition
Named entity recognition (NER) is a classical sequence tagging task in NLP.It is to locate and classify the words or sentences with specific types in the text.[2] Yeah that would be my son on the stairs next to the bus.Low 11959 No My mum is on the right!Park near her, she might want a lift.
A cargo truck drives down an intersection with many smaller cars.CityFlow-NL [3] The large green flatbed 18 wheeler is going straight.Medium 9374 No A green truck drives through an intersection, followed by a sedan.The input of NER model is a sequence with part-of-speech (POS) taggings, as it shows in Equation 1, where n denotes the number of words segmented by word segmentation program.t i is the POS of the word w i .
In brief, the NER modal could be written as Equation 3shows, where W is the word sequence while T is the tagging sequence.P (•) is a conditional probability model.
Hidden Markov Models [21] and Conditional Random Fields [22] are two typical machine learning models for NER.Convolutional neural network [23], recurrent neural network [24], transformer [25], and graph neural network [26], these deep learning models all achieve the state-of-the-art results in NER.
Moreover, many NER datasets have been proposed in past years.These [27][28][29] [30][31] are the well-known NER datasets (benchmarks).In these datasets, there are mainly three kinds of named entities: flat entity, overlapped entity and discontinuous entity.[32] proposed a unified neural framework to concurrently solve the three NER problems.

Text-Image Vehicle Retrieval
Vehicle retrieval based on test-image cross-modal learning is a hot spot these years [11][33] [13][34] [35][14] [36][15] [37].The model could find out the highest matching vehicle based on the description with the text format.There are mainly two formats according to the architecture.The first is the end-to-end neural network based on early retrieval, where the features of images and text are fused in the early stage.The second is the non-end-to-end system based on late retrieval, where images and text features are extracted individually and loaded into a decision module.

Contrastive Language Image Pretraining
Contrastive language image pretraining (CLIP) combines the modalities of language and image in one neural network, which is mainly for multi-modal tasks based on natural language and computer vision.Prior to this, most computer vision work was trained based on pre-defined labels, and supervision limited the generalization and usefulness of neural networks.There has been a lot of work in the field of NLP using a large amount of corpus data for selfsupervised learning, and the effect of these models has surpassed manually labelled datasets [38] [39].In the field of CV, the current mainstream method is still to use large-scale datasets with labelled information for pre-training [40].Vanilla CLIP [19] creatively uses text as a supervision signal to train a vision model and achieves conspicuous results on ImageNet [40].In addition, vanilla CLIP [19] is also very good at zero-shot tasks.[20] proposes a CLIP framework called DenseCLIP, which is good at dense prediction tasks, such as semantic segmentation and dense object detection.[41] proposes a new contrastive loss to normalize the location and geometric information of image and text features in the semantic space.
3 The Construction of FindVehicle

Brief Introduction
FindVehicle is the first NER dataset in traffic.It is based on the image samples of UA-DETRAC [1].FindVehicle contains various descriptions of traffic participants on the road from the view of traffic surveillance cameras, mainly vehicles.A description contains many attributes of one or several vehicles.These attributes all could be detected by traffic sensors, such as surveillance cameras, lidar and radar.Moreover, FindVehicle also incorporates much realworld prior knowledge, such as the vehicle brand and model.Furthermore, FindVehicle contains both coarse-grained and fine-grained entities.Entities include both flat and overlapped entities.

Entity Types
As Fig. 1 shows, there are 21 entity types in FindVehicle, 8 coarse-grained entities and 13 fine-grained entities.These entities are all the attributes of vehicles, which all follow the distribution of the real world.Moreover, FindVehicle also contains both flat and overlapped entities.

Coarse-grained Entity
There are 8 kinds of coarse-grained entities, including vehicle location, vehicle orientation, vehicle brand, vehicle model, vehicle type, vehicle color, vehicle range and vehicle velocity.
vehicle location indicates the locations of vehicles from the view of the traffic surveillance cameras, such as bottom right, top-left, etc.
vehicle orientation indicates the directions of vehicles' heads from the view of the traffic surveillance cameras, such as this way, away, etc.
vehicle brand indicates the brands of vehicles.FindVehicle contains 65 vehicle brands all over the world.
vehicle model indicates the models of vehicle brands.There are 4793 models of different vehicle brands in FindVehicle.For example, Q7 is one of the models of Audi.
vehicle type indicates the types of vehicles, such as sedan, suv, etc. vehicle color indicates the colors of vehicles, such as silver grey, rose red, etc.
vehicle range indicates the distance between the vehicle and the traffic surveillance camera, such as 18m, 123 meters, etc.
vehicle velocity indicates the speed of the moving vehicle on the road, such as 50 kilometres per hour, 120 km/h, etc.

Fine-grained Entity
As it shows in Fig. 1, in FindVehicle, there are 13 kinds of fine-grained entities, which belong to the coarse-grained entity vehicle type, for example, BMW X5 is a fine-grained entity of vehicle type-suv.Fine-grained entities contain the human prior knowledge of cars.

Flat and Overlapped Entity
Overlapped entities exist in coarse-grained entities vehicle brand, vehicle model and fine-grained entities vehicle type-*.For example, as Fig. 2 shows, the label of BMW is vehicle brand while the label of X5 is vehicle model, for a car enthusiast, the label of BMW X5 is vehicle type-suv.

Corpus Collection
As Fig. 4 shows, the corpus collection includes two parts, the corpus with simple context and the corpus with complex context.
The corpus with simple context denotes the short sentences, which are presented in the column of Data Samples in Table 1.As Fig. 3 presents, firstly, we sample some target vehicles with bounding boxes and labels in UA-DETRAC [1].Based on these samples, we create a relational table to save the attributes of the corresponding vehicle.Each item in the table represents one vehicle with several attributes.Furthermore, to increase the complexity of the dataset, we replace some formal phrase-type and word-type entities with our informal expression habits and add some rare entities which do not exist in UA-DETRAC [1].Moreover, for the entity generation of three entities vehicle brand, vehicle model and vehicle type-*, we invite three car enthusiasts to collect and integrate data based on their extensive car knowledge and the search results of Wikipedia.They write data with different expressions and curate 65 vehicle brands, 4793 vehicle models and 13 vehicle types in total.Secondly, we recruit four volunteers to write descriptive sentences with various patterns in their tone and expression habits.All volunteers are welleducated and have adequate English linguistic knowledge.Thirdly, we insert the target vehicles with their attributes into these patterns by our sentence auto-generation framework.
As the sample in Fig. 1 presents, the corpus with complex context indicates narrative long sentences or paragraphs with persons' subjective emotions and imagination.Instead of generating a corpus with simple context by combining labor and computers, a corpus with complex context is made by human beings only.Four members of our team write down the corresponding sentences and paragraphs with their own writing habits and imagination by observing the images in UA-DETRAC [1].

NER Annotation
As Fig. 4 shows, in our NER annotation framework, there are two processes for the corpus with simple and complex contexts, respectively.
The annotations of the corpus with simple context are completed simultaneously with sentence auto-generation by our annotation auto-generation framework.After that, the correction framework of auto-generation will automatically identify whether the NER annotations by the auto-generation framework have errors.If the data had an error, the annotation process would be interrupted and report the location of the error, and then we would check and fix it.If it had no error, the corpus with annotations would be loaded into the dataset directly.
The annotations of the corpus with complex context are totally manual.They are based on the common sense and knowledge of annotators.Annotators are all volunteers who are knowledgeable about vehicles and good at narrative writing.
As Fig. 5 shows, we organize the data in two formats, JSON and CoNLLstyle [27].The value of the key ner label is the annotated named entities.The values of ner label in each element is [entity type, start index of char span, end index of char span, start index of token span, end index of token span].Our annotation considers char-level and token-level, meeting different needs of the NER models.The key re label denotes the indexes of values of ner label that refer to one target in the context of a sentence.2 shows, we present the statistics of FindVehicle and other widely used well-known NER datasets, including CoNLL'03 [27], WikiGold [28], WNUT'17 [29], I2B2 [42] and OntoNotes [30].FindVehicle has 42.3 thousand sentences, 1.361 million   tokens, 202.5 thousand entities and 21 entity classes.As Fig. 6 presents, the entity types are long-tail distributed to reflect the real-world traffic scenario.

Dataset Split
FindVehicle is a hybrid NER dataset containing both flat and overlapped entities.We split it into a training set and a test set.The details of these two sets are shown in Table 3.For the training set, there are 84.6kcoarse-grained entities and 18.2k fine-grained entities.In addition, there are 84.2kflat entities and 18.6k overlapped entities.For the test set, there are 82.5kcoarse-grained entities and 17.4k fine-grained entities.Besides, 82.7k flat entities and 17.2k overlapped entities are in the test set.
Fig. 7 The architecture of VehicleFinder.

VehicleFinder
VehicleFinder is a lightweight text-image cross-modal vehicle retrieval system.Users could find out the target vehicle through the description of its type, color and orientation.As Fig. 7 presents, VehicleFinder has two branches.One is to extract proposals by a vision detector while the other is to extract named entities by a text detector.We adopt NanoDet [43] as the vision detector and BiLSTM-CRF [24] as the text detector.The NanoDet [43] is pretrained on UA-DETRAC [1] while the BiLSTM-CRF [24] is pretrained on our FindVehicle.The proposals and name entities will be loaded into the contrastive text-image module (CTIM) to compare the semantic similarity of data of two modalities.As Fig. 8 shows, there are two encoder branches in CTIM to encode the data of image and text modalities, respectively.The output of CTIM is the similarity of the image and text, whose value domain is between 0 and 1.An output below 0.5 indicates that the image and text are unrelated, while an output above 0.5 indicates that they are related.CTIM is a complete convolution module whose convolution operations are all the depthwise separable convolution [44], dramatically reducing the parameter number, especially in the deep layers of the neural network.CTIM could perform as a plug-and-play module in some cross-modal systems.
In the branch of the image encoder, there are five same encoder units.An encoder unit will initially put the input feature map x i ∈ R c×h×w into three branches, where c, h, w respectively denote the channel, height and width of a feature map.The first three branches with different convolution kernel sizes are used to extract the feature with different receptive fields.The output feature map xi ∈ R c×h×w will be activated by ReLU [45], then increase the channels and reduce the spatial size through a depthwise separable convolution operation with the 3 × 3 kernel.After a batch normalization and a ReLU activation, the output feature map is To alleviate the gradient vanishing and explosion in the training stage, a long residual path with a depthwise separable convolution is connected with the output feature map.The final output of the encoder unit is xi+1 ∈ R 2c× h 2 × w 2 .The whole process is presented in Equation 4.
Fig. 8 The architecture of Contrastive Text-Image Module (CTIM).All convolution operations are all depthwise separable convolutions, except for the convolution operation with the kernel size of 1 × 1.Because the depthwise separable convolution contains the convolution operation with the 1 × 1 kernel size.c i denotes the channel number of the feature map whose kernel height is i. w i denotes the width of the feature map whose kernel height is i.
In the branch of text encoder, named entities will be firstly embedded by pretrained embeddings of Fasttext (wiki-news-300d-1M) [46].Fasttext could infer the embeddings of words not in the word dictionary based on the existing words, which is more robust than Word2vec [47] and GloVe [48] for the system.The shape of the embedding matrix is d × 300, where d indicates the number of named entities and 300 is the vector length of each named entity.After that, we adopt four groups of multi-scale depthwise separable convolution operations to extract the feature with different scales concurrently.The first group is n convolution operations of the kernel size 1 × w 1 , which is to extract the feature of a single word in named entities.The second group has one convolution operation of the kernel size 2 × w 2 and n − 1 convolution operations of the kernel size 1 × w 1 , where the convolution of the 2 × w 2 kernel is to extract the associated feature of adjacent words.The rest 1 × w 1 convolution operations are to enhance the non-linear representation.The third group is firstly processed by a convolution of the 3 × w 3 kernel, which is also for the feature extraction of adjacent words with a word window size of three.Then the following operations are the same as the second group.The fourth group is a convolution operation with the kernel size of d × w d , which is to extract the feature of the global context.Finally, the outputs of these four groups will be added to get a comprehensive representation of the name entities.The four convolution operations are shown in Fig. 9.
After we get the representations of the proposal and named entities, we align their shape to calculate their cosine distance.Cosine distance measures the distance between vectors of the proposal and named entities.It could maintain the same similarity in the high-dimensional case as the low-dimensional case, which is a robust indicator of the relative difference in direction.Equation 5shows cosine distance.
where n is the number of vector's components.A i and B i respectively denote the text and image vector of ith component.
However, the value domain of cosine distance is [−1, 1].It means that the result of cosine distance could not be directly fed to binary cross entropy loss (BCE loss) because BCE loss (Equation 6) could not process the negative number.
where N indicates the number of samples in a batch.y i ∈ {0, 1} is the ground truth while ŷi ∈ [−1, 1] is the result of cosine distance predicted by the neural network.Apparently, ŷi is not in the definitional domain of ln(•) if ŷi is below zero.
Therefore, as Equation 7presents, we use Equation 7to compress the results of cosine distance from [−1, 1] to [0, 1], which can be the input to BCE loss.The linear compression function is a monotonically increasing function whose value domain is [0, 1].It is differentiable everywhere.Monotonicity ensures that the relative position of the variable does not change when it maps from [−1, 1] to [0, 1].The property of being differentiated everywhere ensures that it can participate well in backpropagation in neural networks.
where x ∈ [−1, 1] denotes the result calculated by cosine distance.Therefore, the complete form of the loss function is presented in Equation 8.
Finally, our VehicleFinder will calculate the similarities between named entities extracted from the command and object proposals extracted by the vision detector.We will sort object proposals in terms of the similarity with named entities descendingly.The ranking list proposals will then be fed to a decision module.In the decision module, considering that the user cannot always describe the vehicle characteristics in detail, we take two patterns of commands into account and process them respectively to enhance the system's robustness, which could also be user-friendly.As Fig. 10 presents, the first is the no-missing-entity pattern and the second is the missing-entity pattern.No-missing-entity pattern indicates the command contains all three named entities, vehicle type, vehicle color and vehicle orientation.Missing-entity pattern indicates the command contains one or two named entities, and the other one or two named entities are not mentioned.
As Algorithm 1 presents, we firstly set two thresholds th nm and th m , which mean the threshold for no-missing-entity pattern and the threshold for missing-entity pattern.The variable proposals containing vehicle: sim pairs indicates object proposals and their similarity with named entities extracted from the command.For no-missing-entity pattern, if existing the vehicle: sim pair whose sim is larger than th nm , the vehicle: sim pair would be appended to retainV ehicle.If not existing the vehicle: sim pair whose sim is larger than th nm , the decision module would continue to search for the vehicle: sim pair whose sim is larger than th m .If existing the vehicle: sim pair whose sim is larger than th m , the vehicle: sim pair would also be appended to retainV ehicle.th nm and th m are set based on the results of experiments.We assume by default that th nm is greater than th m .

Algorithm 1 Decision Module
Input

Experiments of FindVehicle
In the experiments of FindVehicle, we make the baselines of our FindVehicle.

Settings of Training and Evaluation
We select three representative and state-of-the-art models to train and test on FindVehicle, which were BiLSTM-CRF [24], BERT-CRF [49] and FLERT [50].
BiLSTM-CRF [24] combines BiLSTM and CRF.BiLSTM acts as the encoder layer and takes word embeddings as input, CRF serves as a decoder  to determine the tag for each token based on hidden states outputted from a encoder.BERT-CRF [49] replaces word embeddings of BiLSTM with subwordembeddings learned from BERT, and changes the encoder from BiLSTM to Transformer.
FLERT [50] is a NER model that takes document-level features as an extra account.By adding context text on both sides (left and right) to the query sentence, FLERT captures document-level features and presents a better predict result than the previous model.
For each model, we use the most suitable hyperparameters that make the model converge smoothly.We train and test these models on one TITAN RTX GPU.Table 4 shows the implementation details.
Furthermore, as Equation 9, 10 and 11 present, we choose precision, recall and F1 score as the evaluation metrics of the test, which are based on the confusion matrix (Table 5).

P recision =
T P T P + F P (9)

Baselines of FindVehicle
Table 6 shows the evaluation results of models on the test set of FindVehicle.
It is apparent that Transformer-based models perform better than the RNNbased model.BiLSTM-CRF [24] got 49.5% F1 score, which is the lowest value among models.FLERT [50] achieved 80.9% F1 score, which is the highest value, 3% higher than BERT-CRF [49].Furthermore, we do the statistics on the evaluation results for all 21 classes of named entities.We take the evaluation results of FLERT [50] as the example, as Table 7 shows, all the evaluation metric values of fine-grained entities are much lower than those of coarse-grained entities.It denotes that the recognition of fine-grained entities is harder than coarse-grained entities for neural networks.Moreover, we also calculate the evaluation results of flat entities and overlapped entities by FLERT [50].As Table 8 shows, the values of three metrics of flat entity are about 20% higher than overlapped entities'.The recognition of overlapped entities is still a challenge in FindVehicle.[49] 93.4 59.8 92.0 77.9 FLERT [50] 94.1 61.1 92.3 80.9 Fig. 11 The data format of Vehicle-TI.

Comparison of Models on Different NER Datasets
We also compare the performances of models on different NER datasets (Table 9), including CoNLL'03 (4 classes) [27], WNUT'17 (6 classes) [29], Ontonotes (18 classes) [30] and our FindVehicle (21 classes).We use F1 score as the evaluation metric.We can see that F1 scores (Equation 11) of three models on FindVehicle are all lower than the scores on CoNLL'03 [27] and Ontonotes (18 classes) [30], which indicates that there are some challenges in our dataset to some extent.

Experiments of VehicleFinder
There are four parts in the experiments of VehicleFinder, which are CTIM, vision detector, text detector and VehicleFinder.

Settings of Training and Evaluation
We construct a text-to-image dataset called Vehicle-TI based on the training set of UA-DETRAC [1] to train and test our CTIM.As Fig. 11 shows, each data sample in Vehicle-TI has a triple keyword (text modal), a proposal (image  modal) and a label, which are extracted and reconstituted from UA-DETRAC [1].A triple keyword contains the type, color and orientation of the vehicle.The label indicates whether the proposal is consistent with the description of the triple keyword, where 1 means consistent (positive sample) and 0 means inconsistent (negative sample).Positive sample is to make the feature encodings of text and image closer while the negative sample is to make the feature encodings of text and image farther.There are 598,336 samples in Vehicle-TI, 335,040 for training, 179,520 for test and 83,776 for validation.As Table 10 shows, we train CTIM for 50 epochs with a batch size of 64.The initial learning rate is 0.001 and CTIM is optimized by AdamW [51].The learning rate is scheduled by the Step scheduler.
Furthermore, we set a threshold of 0.7 as the boundary of the consistency of the vehicle proposal and the triple keyword.If the output of CTIM is above 0.7, it indicates that the vehicle proposal and the triple keyword are consistent (strong-related), if not, we think they are not related or weak-related.

Evaluation Results
Table 11 presents the evaluation results of CTIM on the test set of Vehicle-TI.CTIM has only 3.84 million parameters, which gets 97.7% accuracy (Equation 12) for the identification of consistency between vehicle images and triple keywords.
Moreover, we also test the inference speed of CTIM on different devices.CTIM spends 131.42 ms identifying one sample on an 8-core ARM v8.2 of NVIDIA Jetson AGX Xavier.When tested on an i7-12700, CTIM gets 67.43 ms latency.In addition, it costs CTIM 39.47 ms on one RTX A4000.The above proves that CTIM can maintain high performance on both edge and host devices.As the above mention, all convolution operations in CTIM are depthwise separable convolution.In addition, we use cosine distance and linear compression function to measure and process the similarity between text and image modalities.We still call it CTIM.
We firstly replace all depthwise separable convolution operations with the normal convolution operations, which is called CTIM-Conv-CosineD.
Secondly, we replace the cosine distance and linear compression function in CTIM with fully connected layers, which is the operation in normal Siamese neural networks to fit the similarity by fully connected layers.We call it CTIM-DSConv-Siamese.
Thirdly, we adopt the most well-known contrastive language and image pretraining model, CLIP [19].We adopt ResNet-50 as the image encoder and Transformer as the text encoder.The total parameter number of CLIP is 102.58 million.
Last but not least, we fine-tune a bert-based Siamese neural network [52] to make it adapt to our task.Its architecture is Transformer-based, totally different from the aforementioned neural networks.We call it Bert-Siamese.
As Table 12 presents, all the indicators of CTIM are the best.In contrast, CTIM-Conv-CosineD-Linear has 36.6 million parameters, which is 32.83 million more than CTIM.Furthermore, its speed on different devices is slower than CTIM.
Secondly, the parameter number of CTIM-DSConv-Siamese is the largest among all CNN-based neural networks, which is 99.72 million.Its inference speed on different devices is also the slowest and the accuracy is only 25 Last but not least, although Bert-Siamese [52] has a close performance with our CTIM, it has huge parameters of 189.53 million.It spends nearly 3 seconds to identify one sample on 8-core ARM v8.2, which is far too slow to deploy on edge devices.

Experiments of Vision Detector
Vision detector is to extract proposals of vehicles from the image.We adopt NanoDet-m [43] as the vision detector, which is a lightweight detector with   only 0.95 million parameters.It is trained on the training set of UA-DETRAC [1].The implementation details are shown in Table 13.Moreover, we want the vision detector to miss as few targets as possible, so we use recall as the evaluation metric instead of precision.As Table 14 presents, NanoDet-m [43] gets 86.7% recall rate on the test set.

Experiments of Text Detector
Text detector is to extract keywords (named entities) from the user command.BiLSTM-CRF has relatively few parameters and fast inference among all NER models mentioned in Section 6.2, so we train a BiLSTM-CRF on our Find-Vehicle as the text detector, which is to extract named entities with types of vehicle type, vehicle color and vehicle orientation.The implementation details are shown in Table 4.
As Table 15 shows, BiLSTM-CRF has 4.02 million parameters.It spends 148.57ms extracting all named entities from a sample in FindVehicle on the 8-core ARM v8.2.In addition, it spends 87.19 ms and 51.73 ms when tested on i7-12700 and RTX A4000.

Settings of Evaluation
We randomly sample 2000 images from the test set of UA-DETRAC [1] as our homemade test set for VehicleFinder.For each image, we write a piece of retrieval text, which corresponds to one or more vehicles in the image.The format of the test set is presented in Fig. 12.Each item includes columns of the image path img path, target id target id, the upper-left abscissa of bounding  box left, the upper-left ordinate of bounding box top, the width of bounding box width, the height of bounding box height and the retrieval content retrieval text.There are 3917 target vehicles based on retrieval text in these 2000 images.We adopt precision, recall and F1 score to evaluate our Vehi-cleFinder, which are presented in Equation 13, 14 and 15.We also test our VehicleFinder on three different devices.
P recision V = num(detected vehicles & detected vehicles in the testset) num(detected vehicles) Recall V = num(detected vehicles & detected vehicles in the testset) num(all vehicles in the testset)

Evaluation Results
Table 17 shows that our VehicleFinder(CTIM) has 8.81 million parameters, containing the vision detector, the text detector and CTIM.After setting two thresholds th nm and th m as 0.70 and 0.30 respectively, our Vehi-cleFinder(CTIM) achieves 87.7% precision, 89.4% recall and 88.5% F1 score.Fig. 13 presents the test results of VehicleFinder(CTIM).We can observe that the targeted vehicles could be preciously retrieved based on the description.Furthermore, we also collect the results of the control group.VehicleFinder(CTIM-Conv-CosineD-Linear) achieves 87.4% precision, 87.9% recall and 87.6% F1 score with 41.64 million parameters.VehicleFinder(CTIM-DSConv-Siamese) has 104.69 million parameters and its F1 score is only 12.5%, which is the lowest among all.VehicleFinder(CLIP) has 107.55 million parameters total and gets 87.5% F1 score.VehicleFinder(Bert-Siamese) has the almost same F1 score (88.4%) as our VehicleFinder(CTIM), but its parameters are too huge.We calculate the latency of our VehicleFinder(CTIM) from the moment that the command is loaded into VehicleFinder(CTIM) to the moment that the VehicleFinder(CTIM) completes the identification of one vehicle.As Equation 16presents, T ner means the time of named entity recognition and T cti means the identification time of the consistency of named entities and one vehicle proposal.It includes the inference time of the text detector and CTIM.We ignore the time for the system to schedule different models.Table 18 shows the inference speed evaluation of our VehicleFinder(CTIM).The longest latency is 279.35 ms on one 8-core ARM v8.2 while the shortest latency is 93.72 ms on one RTX A4000.It implies that our VehicleFinder(CTIM) could be deployed on both edge devices and host devices, but host devices are the better choice.
In contrast, firstly, VehicleFinder(CTIM-Conv-CosineD-Linear) is 86.42 ms slower than VehicleFinder(CTIM) on one ARM v8.2 CPU and 40.85 ms slower on one RTX A4000 GPU.Secondly, VehicleFinder(CTIM-DSConv-Siamese) is 119.05 ms slower than VehicleFinder(CTIM) on one ARM v8.2 CPU and 73.73 ms slower on one RTX A4000 GPU.Thirdly, VehicleFinder(CLIP) spends above 2 seconds inferring one text-image pair on ARM v8.2 CPU.Fourthly, VehicleFinder(Bert-Siamese) has the slowest inference among all, whose latency on one ARM v8.2 CPU exceeds 3 seconds (3091.33 ms) and almost 1 second on one RTX A4000 GPU.The inference speed of VehicleFinder(Bert-Siamese) is far too slow to deploy on no matter edge devices or host devices to use in actual traffic scenes.
Last but not least, we also test the VehicleFinder(CTIM) on our collected images in some traffic scenes.Based on the images, we recruited two volunteers to describe the vehicles that they want to find out in the images.As Fig. 14 presents, VehicleFinder(CTIM) can still accurately find out the targeted vehicles based on the volunteers' descriptions.In addition, inference of corner cases is included as Fig. 15 shows, which means VehicleFinder(CTIM) can keep robust to some extent when confronted with some adverse phenomenons.

Conclusion and Future Work
We propose the first NER dataset FindVehicle in traffic domain, which contains different sentences that describe the vehicles in different traffic scenes.Named entities include several attributes of vehicles that could be detected     In the future, firstly, we will continue to maintain our FindVehicle.Secondly, we will extend FindVehicle by adding the corpus of some special traffic scenes, and connecting samples of FindVehicle to images of real traffic scenes, which would be a new dataset (benchmark).Thirdly, we will explore text-video cross-modal vehicle retrieval.

Discussion
The discussion is divided into two parts, the challenges of FindVehicle and the limitation of our cross-modal vehicle retrieval system VehicleFinder.
In FindVehicle, long-tail data distribution, the recognition of vehicle brands out of the distribution, and the recognition of fine-grained and overlapped entities are three challenges worth exploring.Moreover, as Fig. 16 shows, identifying whether the extracted named entities refer to the same vehicle is a considerable challenge, equivalent to clustering named entities according to context.
The first limitation of our VehicleFinder is that a description can only contain the attributes of one vehicle.Our VehicleFinder is not adaptive to context with multiple vehicles because we only adopt NER in keyword extraction instead of combining NER with relation extraction, which is also a challenge in the future.The second limitation is that the granularity of the keywords used to describe the attributes of the vehicles is not fine enough, which is attributed to the limitation of the human cost of the annotation effort.We will continue to pay attention to and research this field in the future.

Fig. 2
Fig. 2 An example of flat and overlapped entities.

Fig. 3
Fig.3The generation of corpus with simple context.

Fig. 4
Fig.4The framework of corpus collection and annotation of FindVehicle.

FindVehicle
is the first NER dataset in traffic with the annotations of automatic labeling and manual labelling together.As Table

Fig. 9
Fig.9The four multi-scale convolution operations in our text encoder.

Fig. 12
Fig.12The format of the homemade test set for VehicleFinder.

Fig. 14
Fig. 14 Inference results of VehicleFinder on our collected images.

Fig. 15
Fig. 15 Inference results of corner cases: occluded targets, dark environment and strong light interference.

Table 1
Datasets of Vehicle Retrieval

Table 2
Statistics of FindVehicle and other well-known NER datasets

Table 3
Data Split of FindVehicle

Table 4
Implementation Details of Models on The Training Set of FindVehicle

Table 5
Confusion Matrix

Table 6
Evaluation Results of Three Models on The Test Set of FindVehicle

Table 7
[50]uation Results of FLERT[50]for All The Classes of FindVehicle

Table 9
Performances of Models on Test Sets of Different NER Datasets

Table 10
Implementation Details of CTIM on The Training Set of Vehicle-TI 1 BS means batch size; ILR means initial learning rate; Opt means optimizer; Sch means scheduler.

Table 11
Evaluation of CTIM on The Test Set of Vehicle-TI

Table 12
Ablation and Comparison Experiments on The Test Set of Vehicle-TI .7%.Thirdly, CLIP with ResNet-50 and Transformer has 102.58 million parameters.It spends about 2.2 seconds inferring one text-image pair sample on 8-core ARM v8.2.It gets 96.5% accuracy on the test set.

Table 17
Evaluation of VehicleFinder on Our Homemade Test Set

Table 18
Inference Speed Evaluation of VehicleFinder