FindVehicle and VehicleFinder: a NER dataset for natural language-based vehicle retrieval and a keyword-based cross-modal vehicle retrieval system

Guan, Runwei; Man, Ka Lok; Chen, Feifan; Yao, Shanliang; Hu, Rongsheng; Zhu, Xiaohui; Smith, Jeremy; Lim, Eng Gee; Yue, Yutao

doi:10.1007/s11042-023-16373-y

FindVehicle and VehicleFinder: a NER dataset for natural language-based vehicle retrieval and a keyword-based cross-modal vehicle retrieval system

Open access
Published: 14 August 2023

Volume 83, pages 24841–24874, (2024)
Cite this article

Download PDF

You have full access to this open access article

Multimedia Tools and Applications Aims and scope Submit manuscript

FindVehicle and VehicleFinder: a NER dataset for natural language-based vehicle retrieval and a keyword-based cross-modal vehicle retrieval system

Download PDF

Runwei Guan^1,2,3,4^na1,
Ka Lok Man²^na1,
Feifan Chen²^na1,
Shanliang Yao^1,2,3,4,
Rongsheng Hu⁵,
Xiaohui Zhu²,
Jeremy Smith¹,
Eng Gee Lim² &
…
Yutao Yue ORCID: orcid.org/0000-0003-4532-0924^3,4,6

1117 Accesses
2 Citations
Explore all metrics

Abstract

Natural language (NL) based vehicle retrieval is a task aiming to retrieve a vehicle that is most consistent with a given NL query from among all candidate vehicles. Because NL query can be easily obtained, such a task has a promising prospect in building an interactive intelligent traffic system (ITS). Current solutions mainly focus on extracting both text and image features and mapping them to the same latent space to compare the similarity. However, existing methods usually use dependency analysis or semantic role-labelling techniques to find keywords related to vehicle attributes. These techniques may require a lot of pre-processing and post-processing work, and also suffer from extracting the wrong keyword when the NL query is complex. To tackle these problems and simplify, we borrow the idea from named entity recognition (NER) and construct FindVehicle, a NER dataset in the traffic domain. It has 42.3k labelled NL descriptions of vehicle tracks, containing information such as the location, orientation, type and colour of the vehicle. FindVehicle also adopts both overlapping entities and fine-grained entities to meet further requirements. To verify its effectiveness, we propose a baseline NL-based vehicle retrieval model called VehicleFinder. Our experiment shows that by using text encoders pre-trained by FindVehicle, VehicleFinder achieves 87.7% precision and 89.4% recall when retrieving a target vehicle by text command on our homemade dataset based on UA-DETRAC [1]. From loading the command into VehicleFinder to identifying whether the target vehicle is consistent with the command, the time cost is 279.35 ms on one ARM v8.2 CPU and 93.72 ms on one RTX A4000 GPU, which is much faster than the Transformer-based system. The dataset is open-source via the link https://github.com/GuanRunwei/FindVehicle, and the implementation can be found via the link https://github.com/GuanRunwei/VehicleFinder-CTIM.

Multi-branch Deep Neural Model for Natural Language-Based Vehicle Retrieval

NER Tagging of Free Text Queries to Search Data for Developing Autonomous Driving System

Exploring the Effect of Vehicle Appearance and Motion for Natural Language-Based Vehicle Retrieval

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Vehicle retrieval is a task that aims to find the target vehicle from a large image gallery given a query image, which is an image-to-image matching technique also known as vehicle re-identification [2,3,4,5]. It has promising prospects in building ITS [6,7,8,9] for smart cities [10]. However, an image-based vehicle retrieval system also has its defects in practice. For example, such a system needs an image to provide characteristics of the target vehicle, which is not always easy to obtain in the real world. The performance of an image-based vehicle retrieval system may also be limited because there is only one type of modality to provide spatial and temporal information.

To alleviate these problems, Natural Language (NL), as another essential modality in the real world, has received more and more attention from researchers in recent years. A natural language-based vehicle retrieval system aims to identify the target vehicle using an NL description. Such a system has a broader range of application scenarios, such as finding a vehicle when a bystander provides only an informal description. Most current natural language vehicle retrieval implementations construct the text encoder and visual encoder to extract features from both data types. They then project the obtained text and visual embeddings into the same latent space to compare their similarity. In addition, both visual and NL data will be carefully modified by these methods for more effective representation. For example, vehicle track images are cropped to generate a global motion image [11,12,13,14]. As for NL, some keywords related to vehicle attributes (e.g., colour, vehicle type and orientation) are extracted in the given NL query [11, 12, 14, 15]. Although these works achieve acceptable performance on the CityFlow-NL [16] benchmark, they can still be improved, especially in terms of NL. We find that when implementing keyword extraction, existing methods are usually based on dependency analysis (e.g., using NLTK package) or semantic role labelling techniques to determine whether the word is a keyword or not. These techniques only assign the part of speech to the words in the sentence. It means that pre-determined rules and post-processing are required to divide the extracted keywords into corresponding vehicle attributes, making the whole process complex [15, 17]. Such methods can also suffer from extracting the wrong keyword if the NL description is complex. This can lead to error propagation in subsequent modules and reduce model performance.

Table 1 Datasets of Vehicle Retrieval

FindVehicle and VehicleFinder: a NER dataset for natural language-based vehicle retrieval and a keyword-based cross-modal vehicle retrieval system

Abstract

Similar content being viewed by others

Multi-branch Deep Neural Model for Natural Language-Based Vehicle Retrieval

NER Tagging of Free Text Queries to Search Data for Developing Autonomous Driving System

Exploring the Effect of Vehicle Appearance and Motion for Natural Language-Based Vehicle Retrieval

1 Introduction

2 Related work

2.1 Named entity recognition

2.2 Text-image vehicle retrieval

2.3 Contrastive language image pretraining

3 The construction of FindVehicle

3.1 Brief introduction

3.2 Entity types

3.2.1 Coarse-grained entity

3.2.2 Fine-grained entity

3.2.3 Flat and overlapped entity

3.3 Corpus collection

3.4 NER annotation

4 Data statistics of FindVehicle

4.1 Size and distribution of FindVehicle

4.2 Dataset split

5 VehicleFinder

6 Experiments of FindVehicle

6.1 Settings of training and evaluation

6.2 Baselines of FindVehicle

6.3 Comparison of models on different NER datasets

7 Experiments of VehicleFinder

7.1 Experiments of vision detector

7.2 Experiments of text detector

7.3 Experiments of CTIM

7.3.1 Settings of training and evaluation

7.3.2 Evaluation results

7.3.3 Comparison of CTIM with other models

7.4 Evaluation of VehicleFinder

7.4.1 Settings of evaluation

7.4.2 Evaluation results

8 Conclusion and future work

9 Discussion

Data Availability

Code Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation