Visible-infrared person re-identification via specific and shared representations learning

Zheng, Aihua; Liu, Juncong; Wang, Zi; Huang, Lili; Li, Chenglong; Yin, Bing

doi:10.1007/s44267-023-00032-9

Visible-infrared person re-identification via specific and shared representations learning

Research
Open access
Published: 05 December 2023

Volume 1, article number 29, (2023)
Cite this article

Download PDF

You have full access to this open access article

Visual Intelligence Aims and scope Submit manuscript

Visible-infrared person re-identification via specific and shared representations learning

Download PDF

1100 Accesses
Explore all metrics

Abstract

The primary goal of visible-infrared person re-identification (VI-ReID) is to match pedestrian photos obtained during the day and night. The majority of existing methods simply generate auxiliary modalities to reduce the modality discrepancy for cross-modality matching. They capture modality-invariant representations but ignore the extraction of modality-specific representations that can aid in distinguishing among various identities of the same modality. To alleviate these issues, this work provides a novel specific and shared representations learning (SSRL) model for VI-ReID to learn modality-specific and modality-shared representations. We design a shared branch in SSRL to bridge the image-level gap and learn modality-shared representations, while a specific branch retains the discriminative information of visible images to learn modality-specific representations. In addition, we propose intra-class aggregation and inter-class separation learning strategies to optimize the distribution of feature embeddings at a fine-grained level. Extensive experimental results on two challenging benchmark datasets, SYSU-MM01 and RegDB, demonstrate the superior performance of SSRL over state-of-the-art methods.

Information disentanglement based cross-modal representation learning for visible-infrared person re-identification

Article 02 September 2022

Modality Synergy Complement Learning with Cascaded Aggregation for Visible-Infrared Person Re-Identification

Multi-granularity feature utilization network for cross-modality visible-infrared person re-identification

Article 10 May 2023

Discover the latest articles, news and stories from top researchers in related subjects.

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Single-modality person re-identification (Re-ID) is a pedestrian matching problem between query and gallery photos from separate cameras and has received much attention in computer vision [1–5]. Visible cameras play a limited role in night monitoring and security work, therefore visible-infrared person re-identification (VI-ReID) [6] is proposed to match the images of people captured by visible and infrared cameras. This task aims to solve not only intra-modality discrepancies caused by different camera perspectives and human poses in single-modality visible person Re-ID but also inter-modality discrepancies caused by various spectral cameras.

Reducing the modality discrepancy and learning modality-invariant representations is a significant challenge. Cross-modality matching using grayscale images is a popular method to eliminate the color discrepancy [6, 7]. However, modality-specific representations are discarded along with color information, and specific representations are a crucial decision-making for retrieval, which can help to widen the inter-class discrepancy. Using generative adversarial networks (GANs) to generate concrete visualizations [8, 9] to eliminate image-level gaps allows the task to be reduced as much as possible to a single-modal task, but with the inevitable generation of noise. Another method is to build complex networks [10, 11] for learning modality-invariant representations, but due to the large modality discrepancy, heterogeneous modality features cannot be well projected into a unified space. In summary, extracting and decoupling specific and shared representations is a challenge for current methods. It could not completely decouple specific and shared representations, although Lu et al. [12] made a preliminary attempt.

This paper proposes a specific and shared representations learning (SSRL) model for VI-ReID to mitigate the discrepancy between heterogeneous pedestrian images while better capture modality-invariant and modality-specific representations. Specifically, we employ the traditional visible-infrared branch as our specific branch to learn modality-specific representations. The color information in red green blue (RGB) images is an important decision-making accordance for retrieval which may lead to large inter-class differences and slight intra-class differences. Therefore, it is necessary to decompose modality-specific representations. The specific branch takes visible and infrared images as inputs and feature extractors with different parameters to ensure that the extracted features are modality-specifc representations. We design a shared branch, as shown in Fig. 1(a). We can obtain grayscale images directly from visible images using channel conversion, which converts three-channel visible images into single-channel grayscale images and replicates them to the three channels. The converted grayscale images alleviate the modality discrepancy with the infrared images, and feed them into the feature extractor. The shared branch feature extractor is parameter-shared and used to learn modality-invariant representations. Our dual-branch structure has two main benefits. First, the specific branch improves extraction to modality-specific representations by preserving color information from visible images. As illustrated in Fig. 1(b), it is necessary to rely on specific representations in the visible light images to separate them when infrared images from different IDs are similar and visible images from different IDs are different. Second, grayscale operations for shared branch alleviate the modality discrepancy which allow the parameter-shared feature extractor to capture the modality-shared representations more effectively. As illustrated in Fig. 1(c), it is vital to rely on shared representations to bring the features of different modalities of the same ID closer when visible images from different IDs are similar and infrared images from different IDs are different. In addition, we design effective intra-class aggregation and inter-class separation learning strategies to optimize the distribution of feature embeddings on a fine-grained level. It constrains the distance of different class centers from both the same modality and cross-modality and plays a role in expanding the intra-class distance and minimizing the inter-class distance. In the early stages of training, substantial disparities exist between modality-specific features which render them unsuitable for cross-modality tasks. However, we mitigated this issue by using intra-class aggregation learning (IAL) loss, which constrains the differences between modality-specific and modality-shared features. As training progresses, these differences gradually diminish, while both types of features continue to contribute effectively to the network’s accurate identification of individual identities. During the training process, we employed cross-entropy loss and inter-class separation learning (ISL) loss to enhance the representation capability of the final features by extracting modality-specific feature and combining them with modality-shared feature through concatenation. This approach shows advantages in increasing the distinction between features from different IDs and improves the ability of the feature extraction network to extract ID-related information to some extent.

Our main contributions are summarized below:

1)
A novel specific and shared representations learning model termed SSRL is proposed for the VI-ReID task, which contains a specific branch and a shared branch to learn modality-specific and modality-shared representations.
2)
Intra-class aggregation and inter-class separation learning strategies are further developed to optimize the distribution of feature embeddings on a fine-grained level.
3)
Extensive experimental results on the SYSU-MM01 and RegDB datasets show that our proposed method achieves a new performance of state-of-the-art.

2 Related work

2.1 Single-modality person re-identification

Single-modality person Re-ID aims at matching the person images captured by different cameras in the daytime, while all the images are from the same visible modality. Existing works have shown desirable performance on widely-used datasets with deep learning techniques [13–15]. Lin et al. [16] presented an attribute-person recognition (APR) network, a multi-task network that learnt a Re-ID embedding and simultaneously predicted pedestrian attributes. Sun et al. [17] proposed a network named part-based convolutional baseline (PCB) which conducted uniform partition on the conv-layer for learning part-level features and an adaptive pooling method named refined part pooling (RPP) to improve the uniform partition. Li et al. [18] formulated a harmonious attention convolutional neural networks (CNNs) model for joint learning of pixel and regional attention to optimize Re-ID performance with misaligned images. Due to the tremendous discrepancy between visible and infrared images, single-modality solutions were not suitable for cross-modality person re-identification, which created a demand for the development of VI-ReID solutions. Zhao et al. [19] introduced a general framework, namely JoT-GAN, to jointly train GAN and the Re-ID model.

2.2 Visible-infrared person re-identification

The main challenge in VI-ReID is the appearance discrepancy, including large intra-class and slight inter-class variations. The existing VI-ReID methods are divided into three categories: representation learning, metric learning and image generation. The approach of representation learning based on feature extraction mainly explores how to construct a reasonable network architecture that can extract the robust and discriminating features shared by the two modality images. Wu et al. [6] first defined the VI-Re-ID problem and contributed a new multiple modality Re-ID dataset SYSU-MM01 for research, and they introduced a deep zero-padding for training one-stream network toward automatically evolving domain-specific nodes in the network for cross-modality matching. Hao et al. [20] offered a novel modality confusion learning network (MCLNet) to confuse two modalities, which ensured that the optimization was explicitly concentrated on the modality-irrelevant perspective. Ye et al. [21] presented a novel hierarchical cross-modality matching model that could simultaneously handle both cross-modality discrepancy and cross-view variations, as well as intra-modality intra-person variations. To explore the potential of both the modality-shared information and the modality-specific characteristics to boost the re-identification performance, Lu et al. [12] proposed a novel cross-modality shared-specific feature transfer algorithm to address the above limitation. A dual-path network with a novel bi-directional dual-constrained top-ranking loss was developed in Ref. [22] to learn discriminative feature representations. Metric learning was used to learn the similarity of two pictures through the network, and the key was to design a reasonable measurement method or loss function. Zhu et al. [23] designed a novel loss function, called hetero-center loss, to constrain the distance between two centers of heterogenous modality. Liu et al. [24] proposed the hetero-center triplet loss to constrain the distance of different class centers from the same modality and cross-modality. Hao et al. [25] used sphere softmax to learn a hypersphere manifold embedding and constrained the intra-modality variations and cross-modality variations on this hypersphere. Liu et al. [26] presented the dual-modality triplet loss to constrain both the cross-modality and intra-modality and address the cross-modality discrepancy and intra-modality variations. Zhao et al. [27] introduced a novel loss function called hard pentaplet loss, which could simultaneously handle cross-modality and intra-modality variations. Image generation means using GANs or other methods to reduce discrepancies by transforming one modality into another. Wang et al. [9] proposed generating cross-modality paired images and performing global set-level and fine-grained instance-level alignments. Choi et al. [28] presented a hierarchical cross-modality disentanglement method that extracted pose-invariant and illumination-invariant features for cross-modality matching. Wang et al. [8] translated an infrared image into its visible counterpart and a visible image into its infrared version, which could unify the representations for images with different modalities. Wang et al. [29] presented a novel and end-to-end alignment generative adversarial network (AlignGAN) for the visible-infrared Re-ID task. Li et al. [30] introduced an auxiliary X modality as an assistant and reformulated infrared-visible dual-mode cross-modality learning as an X-infrared-visible three-mode cross-modality learning problem.

3 Method

This section will detail the SSRL model proposed for visible-infrared person Re-ID. The proposed SSRL model is demonstrated in Fig. 2. We first propose a dual-branch structure containing a shared branch to alleviate modality discrepancy to learn modality-invariant representations and a specific branch to learn more accurate modality-specific representations. Then, we use the intra-class aggregation and inter-class separation learning strategies to further optimize the distribution of features and aggregate instances with the same identity.

3.1 Baseline

We adopt ResNet-50 [31] as the backbone, in which each branch contains a pre-trained model. At the same time, we use max pooling to obtain fine-grained features. Inspired by the work of PCBs [17, 32] in extracting discriminant features, the work divides the feature map horizontally into sections and feeds each part into a classifier to learn local clues and sets the part to 6. Following the state-of-the-art methods, we utilize identity loss [33] $\mathcal{L}_{\mathrm{ID}}$ and hetero-center triplet loss [24] $\mathcal{L}_{\mathrm{HCT}}$ to constrain the network. The baseline learning loss is denoted as $\mathcal{L}_{\mathrm{base}}$.

$$\begin{aligned} \mathcal{L}_{\mathrm{base}}=\mathcal{L}_{\mathrm{ID}}+\mathcal{L}_{ \mathrm{HCT}}. \end{aligned}$$

(1)

3.2 Dual-branched structure

3.2.1 Specific branch

The color information in the visible light image is crucial discriminative information, which can help the network expand the difference among classes as much as possible. The specific branch takes the original RGB and near-infrared (NIR) images as input and employs two parameter-independent feature extractors to ensure that the network learns specific characteristics more effectively. Through CNN, global max pooling (GMP), batch normalization (BN) operations and feature vectors are input to the fully connected layer for identity classification. The network pays more attention to learning discriminative color information since the input to the specific branch includes RGB images. For this branch, identity loss is formulated as follows:

$$\begin{aligned} \mathcal{L}_{\mathrm{ID}}^{\mathrm{specific}}=-\frac{1}{N} \sum _{n=1}^{N} \log \frac{\mathrm{e}^{\boldsymbol{W}_{y_{n}}^{\mathrm{T}} \boldsymbol{f}_{n}^{V}}}{\sum_{u=1}^{U} \mathrm{e}^{\boldsymbol{W}_{u}^{\mathrm{T}} \boldsymbol{f}_{n}^{V}}}, \end{aligned}$$

(2)

where N denotes the sample number in a batch, ${y}_{n}$ and ${\boldsymbol{f}}_{n}^{V}$ are the identity and the feature vector of the n-th pedestrian image, U is the number of identities, and $\boldsymbol{W}_{u}$ denotes the u-th column of the weights. Under the supervision of identity loss, the specific branch can extract information about a specific identity for classification.

3.2.2 Shared branch

The visible image has three channels, which contain the visible light color information of red, green and blue, while the infrared image has only one channel which contains the intensity information of near-infrared light. From the perspective of imaging principles, the wavelength ranges of the two are also different. Different sharpness and lighting conditions can produce very different effects on the two types of images. The large modality discrepancy makes it very challenging to extract modality-shared features directly using feature extractors. Thus we introduce an additional shared branch. The shared branch first performs a grayscale transformation operation on the visible light image to reduce the modality discrepancy. This is done for a given visible image $x_{v}^{i}$ with three channels $\mathcal{R}$, $\mathcal{G}$ and $\mathcal{B}$, we take the $\mathcal{R}(x)$, $\mathcal{G}(x)$ and $\mathcal{B}(x)$ values for each pixel of the visible image $x_{v}^{i}$. The corresponding grayscale pixel point $\mathcal{G}(x)$ can therefore be calculated as

$$\begin{aligned} \mathcal{G}(x)=\alpha _{1} \mathcal{R}(x)+\alpha _{2} \mathcal{G}(x)+ \alpha _{3} \mathcal{B}(x), \end{aligned}$$

(3)

where $\alpha _{1}$, $\alpha _{2}$ and $\alpha _{3}$ are each set to 0.299, 0.587 and 0.114, respectively. The grayscale image is then restored to a three-channel image and sent to the shared branch along with the infrared image. The two feature extractor parameters of shared branch are shared. The shared branch removes color information from the network, allowing it to focus on learning modality-invariant representations such as texture and structural information. The identity loss of the shared branch $\mathcal{L}_{\mathrm{ID}}^{\mathrm{shared}}$ is the same as that of the specific branch.

3.2.3 Fusion features

We employ concatenation to fuse the obtained specific and shared features from dual-branch cooperative learning on visible and infrared images from specific and shared branches. The discriminative information of specific features in the fusion feature can assist the model in separating different identities. The invariant information of the shared features in the fusion feature can assist the model in identifying the same identity in different modalities. As a result, we implement an identity classification layer for the fusion features for synchronous classification learning. The fusion feature $\mathcal{L}_{\mathrm{ID}}^{\mathrm{fusion}}$ has the same identity loss computation as the specific branch. Add the above identity loss to obtain the final identity loss function:

$$\begin{aligned} \mathcal{L}_{\mathrm{ID}}^{\mathrm{final}}=\mathcal{L}_{\mathrm{ID}}^{ \mathrm{specific}}+ \mathcal{L}_{\mathrm{ID}}^{\mathrm{shared}}+ \mathcal{L}_{\mathrm{ID}}^{\mathrm{fusion}}. \end{aligned}$$

(4)

The hetero-center triplet loss $\mathcal{L}_{\mathrm{HCT}}$ is also used for metric learning of the fusion feature.

3.3 Intra-class aggregation learning

The distance between sample pairs among classes is frequently longer than the distance between sample pairs intra-classes due to the camera viewpoint, clothing, posture and other factors. We apply an intra-class aggregation learning strategy to limit the distance in each class to increase cross-modality intra-class similarity. By intra-class aggregation learning strategies, we aggregate different modalities of the same identity. It is challenging to restrict the distance in each class’s distribution of visible-specific features, infrared-specific features, visible-shared features and infrared-shared features, therefore, taking the centers of these feature distributions and penalizing the center distance. As shown in Fig. 3, we suppose that there are P × K images of P identities in a mini-batch, where each identity contains K images. The feature distribution center of identity in visible-specific features is calculated as follows:

$$\begin{aligned} \boldsymbol{c}_{\mathrm{vspe}}^{p}=\frac{1}{K} \sum _{k=1}^{K} (vspe)_{k}^{p},\quad p \in [1, P], \end{aligned}$$

(5)

where $(vspe)_{k}^{p}$ denotes the feature vector of the k-th image output. Infrared-specific features, visible-shared features, and infrared-shared feature distribution centers are calculated as $\boldsymbol{c}_{\mathrm{ispe}}^{p}$, $\boldsymbol{c}_{\mathrm{vsha}}^{p}$ and $\boldsymbol{c}_{\mathrm{isha}}^{p}$ in the same way. We introduce an intra-class aggregation constraint loss to handle the distance among different modalities of the same identity, which can be interpreted as

$$\begin{aligned} \begin{aligned} \mathcal{L}_{\mathrm{IAL}}={}&\sum _{p=1}^{P} \bigl[ \bigl\Vert \boldsymbol{c}_{\mathrm{vspe}}^{p}- \boldsymbol{c}_{\mathrm{ispe}}^{p} \bigr\Vert _{2}^{2}+ \bigl\Vert \boldsymbol{c}_{\mathrm{vspe}}^{p}- \boldsymbol{c}_{\mathrm{vsha}}^{p} \bigr\Vert _{2}^{2}+ \\ &{} \bigl\Vert \boldsymbol{c}_{\mathrm{vspe}}^{p}- \boldsymbol{c}_{ \mathrm{isha}}^{p} \bigr\Vert _{2}^{2}+ \bigl\Vert \boldsymbol{c}_{ \mathrm{ispe}}^{p}-\boldsymbol{c}_{\mathrm{vsha}}^{p} \bigr\Vert _{2}^{2}+ \\ &{} \bigl\Vert \boldsymbol{c}_{\mathrm{ispe}}^{p}- \boldsymbol{c}_{ \mathrm{isha}}^{p} \bigr\Vert _{2}^{2}+ \bigl\Vert \boldsymbol{c}_{ \mathrm{vsha}}^{p}-\boldsymbol{c}_{\mathrm{isha}}^{p} \bigr\Vert _{2}^{2} \bigr]. \end{aligned} \end{aligned}$$

(6)

As illustrated in Fig. 3, the visible-specific features, infrared-specific features, visible-shared features and infrared-shared features represented by the four colors are focused closer by the intra-class aggregation learning strategy. Our goal is to reduce the distance between different modality centers of the same identity, thereby suppressing cross-modality variations.

3.4 Inter-class separation learning

Intra-class aggregation learning strategy can only ensure that samples of cross-modality of the same identity are aggregated together, but the model also needs to ensure dissimilarity among different identities. The extracted modality-specific representations can help different IDs to achieve inter-class separation. By inter-class separation learning strategies, we separate different identities. We first calculate the center of all samples for each identity. Then we constrain the distance of the distribution of different identity features by cosine distance. The inter-class separation loss is calculated as:

$$\begin{aligned} \mathcal{L}_{\mathrm{ISL}}=\sum_{p=1}^{P} \sum_{{j}\neq p}^{P}\max \bigl[0,1-\cos \bigl( \boldsymbol{c}^{p},\boldsymbol{c}^{j} \bigr)-m \bigr], \end{aligned}$$

(7)

where $\boldsymbol{c}^{p}$ and $\boldsymbol{c}^{j}$ denote the centers of ($\boldsymbol{c}_{\mathrm{vspe}}^{p}$, $\boldsymbol{c}_{\mathrm{ispe}}^{p}$, $\boldsymbol{c}_{\mathrm{vsha}}^{p}$, $\boldsymbol{c}_{\mathrm{isha}}^{p}$) and ($\boldsymbol{c}_{ \mathrm{vspe}}^{j}$, $\boldsymbol{c}_{\mathrm{ispe}}^{j}$, $\boldsymbol{c}_{\mathrm{vsha}}^{j}$, $\boldsymbol{c}_{\mathrm{isha}}^{j}$), cos(⋅, ⋅) represents the cosine distance in centers of different identities, m is a margin term. As illustrated in Fig. 4, with the inter-class separation learning strategy, the feature distributions of different identities are separated.

3.5 Objective function

As mentioned above, the goal of SSRL is to jointly learn the two branches, obtain modality-specific representations and modality-invariant representations, and make full use of them. Combined with the losses mentioned above, we finally define the total loss of the overall network as follows:

$$\begin{aligned} \mathcal{L}_{\mathrm{total}}=\mathcal{L}_{\mathrm{ID}}^{ \mathrm{final}}+\lambda _{1}\mathcal{L}_{\mathrm{HCT}}+\lambda _{2} \mathcal{L}_{\mathrm{IAL}}+\lambda _{3}\mathcal{L}_{\mathrm{ISL}}, \end{aligned}$$

(8)

where $\lambda _{1}$, $\lambda _{2}$ and $\lambda _{3}$ are the weights of $\mathcal{L}_{\mathrm{HCT}}$, $\mathcal{L}_{\mathrm{IAL}}$ and $\mathcal{L}_{\mathrm{ISL}}$, respectively.

4 Experiments

4.1 Experimental settings

Datasets. We evaluate our proposed method on two publicly available VI-ReID datasets(SYSU-MM01 [6] and RegDB [34]).

SYSU-MM01 dataset contains 287,628 visible images and 15,729 infrared images captured by 4 RGB cameras and 2 thermal imaging cameras, with a total of 491 valid IDs, of which 296 identities are used for training, 99 for verification, and 96 for testing. Infrared images were used to search for visible images during the testing phase. Samples from the visible camera are used in the gallery set, and samples from the infrared camera are used in the probe set. This dataset contains two modes: all search mode and indoor search mode. For the all search mode, visible cameras 1, 2, 4 and 5 are used for the gallery set, and infrared cameras 3 and 6 are used for the probe set. For the indoor search mode, visible cameras 1 and 2 are used for the gallery set, and infrared cameras 3 and 6 are used for the probe. A detailed description of the experimental settings can be found in Ref. [6].

RegDB dataset contains 412 person identities, each with 10 visible light and 10 infrared images, for a total of 4120 visible images and 4120 infrared images. The 10 images of each individual vary in body pose, capture distance, and lighting conditions. However, in the 10 images of the same person, the camera’s weather conditions, viewing angles, and shooting angles are all the same, and the pose of the same identity varies little, so the task of the RegDB dataset is less complicated. Following the evaluation protocol proposed by Ref. [21], we randomly sampled 206 identities for training, and the remaining 206 identities were used for testing. The training/test segmentation process is repeated 10 times.

LLCM dataset utilizes a 9-camera network deployed in low-light environments and contains 46,767 bounding boxes containing 1064 identities. The training set contains 30,921 bounding boxes of 713 identities (16,946 bounding boxes from the visible (VIS) modality, 13,975 bounding boxes from the NIR modality), and the test set contains 13,909 bounding boxes of 351 identities (8680 bounding boxes from the VIS modality, 7166 bounding boxes from the NIR modality.)

Evaluation metrics. For performance evaluation, we employed the widely known cumulative matching characteristic (CMC) [35] curve and mean average precision (mAP).

4.2 Implementation details

The proposed method is implemented by the PyTorch framework and an NVIDIA Tesla V100 GPU. Building on existing person Re-ID work, a pre-trained ResNet-50 model is used as the backbone for a fair comparison. Specifically, the stride of the last convolutional block is changed from 2 to 1 to obtain fine-grained feature maps. In the training phase, the batch-size is set to 32, containing 16 visible light and 16 infrared images from 8 identities. For each identity, 2 visible images and 2 infrared images are selected randomly. For infrared images, three copied channels are fed into the network. The input images are resized to 288 × 144 and padded with 10, then randomly left-right flipped and cropped to 288 × 144 and randomly erased [36] for data augmentation. We use the stochastic gradient descent (SGD) optimizer for optimization with the momentum parameter set to 0.9. We set the initial learning rate of the two datasets to 0.1 and guide the network using a warmup learning rate strategy [37]. For $\lambda _{1}$, $\lambda _{2}$ and $\lambda _{3}$ in Eq. (8), we set them to 1.0, 1.0 and 2.0, respectively.

4.3 Comparison with state-of-the-art methods

Results on SYSU-MM01 dataset. Table 1 demonstrates the comparison results on SYSU-MM01 dataset. It can be seen that the proposed method has reached state-of-the-art results in two settings. Our model achieves 72.68% rank-1 and 68.28% mAP in all search settings and achieves 77.58% rank-1 and 79.50% mAP in indoor settings. Most of the indicators reach the second highest accuracy.

Table 1 Comparison with the state-of-the-art on SYSU-MM01 and RegDB datasets. R1 and R10 denote the Rank-1 and Rank-10 accuracies, respectively. Rank-k accuracy (%) and mAP (%) are reported. Herein, the best and second best results are indicated by bold and italic font, respectively. † represents that this paper reports a higher accuracy by using the transferred graph features. We use backbone features for inference for a fair comparison

Full size table

Results on RegDB dataset. Table 1 depicts the comparison results on RegDB dataset. Our SSRL achieves superior performance on both visible-to-infrared and infrared-to-visible settings. Specifically, we achieve a Rank-1 accuracy of 93.64% and an mAP of 82.55% in infrared to visible mode, and a Rank-1 accuracy of 93.52% and an mAP of 82.43% in visible to infrared mode. Hence, our SSRL model is robust against different datasets and query settings.

Results on LLCM dataset. Table 2 illustrates the comparison results on the LLCM dataset. The images in the LLCM dataset are captured in complex low-light environments, which contain severe illumination variations. Our SSRL does not have measures for low-light environments but also achieves decent performance.

Table 2 Comparison with the state-of-the-art on LLCM dataset. R1 and R10 denote the Rank-1 and Rank-10 accuracies, respectively. Rank-k accuracy (%) and mAP (%) are reported. Herein, the best and second best results are indicated by bold and italic font, respectively. The symbol ∗ represents the methods that we reproduced with the random erasing technique

Full size table

4.4 Ablation study

Effectiveness of each component. We evaluate the performance of each component on SYSU-MM01 dataset to verify the effectiveness of each component of SSRL. The ablation experiment is conducted on SYSU-MM01 dataset in the all search single-shot mode. The results are demonstrated in Table 3. The dual-branch structure improves the Rank-1 accuracy and mAP by 10.12% and 10.89%, respectively, compared with the baseline model. The reason for the improvement is mainly because the dual-branch structure fully extracts specific features and shared features. After intra-class aggregation and inter-class separation learning strategy, the performance is greatly improved by 3.87% and up to 72.68% Rank-1.

Table 3 Ablation study of different components on SYSU-MM01 (all search). Base: baseline; DB: dual-branch structure; IAL: intra-class aggregation learning strategy; ISL: intra-class separation learning strategy. R1 and R10 denote the Rank-1 and Rank-10 accuracies. Rank-k accuracy (%) and mAP (%) are reported

Full size table

Effectiveness of the dual-branch structure. We evaluate the performance of the dual-branch structure to verify that the dual-branch structure is better than the single-branch structure. The ablation experiment is conducted on SYSU-MM01 dataset in the all search single-shot mode. For a fair comparison, we keep other structures the same and only use $\mathcal{L}_{\mathrm{ID}}$ and $\mathcal{L}_{\mathrm{HTC}}$ in the training phase. The results are depicted in Table 4. It shows that the accuracy is inferior when only using a shared branch or specific branch, but outstanding performance gains can be achieved when the specific and shared features of the dual-branch structure are fully utilized.

Table 4 Analysis of effectiveness of the dual-branch structure on SYSU-MM01 dataset under the all search mode. Specific branch: use only specific branch; shared branch: use only shared branch; dual-branch: use both specific branch and shared branch. R1 and R10 denote the Rank-1 and Rank-10 accuracies. Rank-k accuracy (%) and mAP (%) are reported

Full size table

Comparison of different augmented modalities. To verify the superiority of the dual-branch structure augmentation, we compare it with other existing modality augmentation strategies, including the X modality [30] and grayscale modality [55]. The ablation experiment is conducted on SYSU-MM01 dataset in the all search single-shot mode. The results are shown in Table 5. Our dual-branch modality augmentation method outperforms the other augmentation methods.

Table 5 Comparison of different augmentation modalities on SYSU-MM01 dataset under the all search settings. R1 and R10 denote the Rank-1 and Rank-10 accuracies. Rank-k accuracy (%) and mAP (%) are reported

Full size table

4.5 Visualization

Retrieval result. We compare the SSRL approach with the baseline on SYSU-MM01 dataset using the single-shot setting and the all search mode to further highlight the advantages of our suggested SSRL model. The results of the acquired Rank-10 ranking are displayed in Fig. 5. In general, the ranking list are greatly improved by the proposed SSRL method, with more accurately recovered images placed in the top spots.

Attention to patterns. One of the key goals of the VI-ReID task is to improve the discriminability of features. We visualize the pixel-level pattern mapping learned by SSRL to illustrate further that it can learn modality-specific and modality-invariant features. We apply Grad-Cam [56] to visualize these areas by highlighting them on the image. We separate the feature map into 6 sections since our approach uses the PCB method, as presented in Fig. 6. In the RGB/NIR modality, comparing each part of the segmentation, and the baseline model focuses only on some unconsidered areas. In contrast, our SSRL model focuses on specific and shared information separately through a dual-branch structure.

Feature distribution. To further analyze the effectiveness of IAL loss and ISL loss, we use t-SNE [57] to transform high-dimensional feature vectors into two-dimensional vectors. As shown in Fig. 7, compared to the visualization results of Fig. 7(a), the features extracted from SSRL including IAL loss and ISL loss are better clustered together. The distance between the centers and boundaries among different identities are more obvious, verifying that our work is more discriminating.

4.6 Parameters analysis

The proposed SSRL involves two key parameters, including intra-class aggregation loss weight $\lambda _{2}$ and inter-class separation loss weight $\lambda _{3}$. As seen in Fig. 8 and Fig. 9, the two parameters are investigated by setting them to various values. Setting $\lambda _{2}$ to 1.0 and $\lambda _{3}$ to 2.0 achieves the best performance for $\mathcal{L}_{\mathrm{IAL}}$ and $\mathcal{L}_{\mathrm{ISL}}$, respectively. Then, we analyze the performance of our SSRL model with a different number of parts p since our method adopts the PCB method. p represents the number of blocks being sliced. The performance is optimal when p is 6, as illustrated in Fig. 10. The proposed SSRL network includes specific-branch and shared-branche, so we compared it with the baseline in floating-point operations per second (FLOPs) and Parameters. As shown in Table 6, the amount of calculations and parameters have increased compared with the baseline, but it can be seen from Table 3 that there is an enormous improvement in accuracy.

Table 6 Comparison of our model with the baseline method in FLOPs and Parameters

Full size table

5 Conclusion

In this paper, we propose a novel SSRL model which consists of a shared branch to learn modality-invariant representations based on bridging the gap at the image level and a specific branch to learn modality-specific representations under the premise of retaining the discriminative information of visible light images. Through intra-class aggregation and inter-class separation learning, which reduces the intra-class distance and increases the inter-class distance, the distribution of feature embeddings at the fine-grained level is optimized. Comprehensive experiments on two VI-ReID datasets demonstrate that the proposed method outperforms the state-of-the-art methods.

Availability of data and materials

All data generated or analyzed during this study are included in this published article [and its supplementary information files].

Abbreviations

BN:: batch normalization
CMC:: cumulative matching characteristic
FLOPs:: floating-point operations per second
GANs:: generative adversarial networks
GMP:: global max pooling
HCT:: hetero-center triplet
IAL:: intra-class aggregation learning
ID:: identity
ISL:: inter-class separation learning
mAP:: mean average precision
NIR:: near-infrared
RGB:: red green blue
Re-ID:: re-identification
SGD:: stochastic gradient descent
SSRL:: specific and shared representations learning
VI-ReID:: visible-infrared person re-identification
VIS:: visible

References

Ye, M., Shen, J., Lin, G., Xiang, T., Shao, L., & Hoi, S. C. (2021). Deep learning for person re-identification: a survey and outlook. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(6), 2872–2893.
Article Google Scholar
Zhai, Y., Lu, S., Ye, Q., Shan, X., Chen, J., Ji, R., et al. (2020). Ad-cluster: augmented discriminative clustering for domain adaptive person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 9021–9030). Piscataway: IEEE.
Google Scholar
Liu, F., & Zhang, L. (2019). View confusion feature learning for person re-identification. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6639–6648). Piscataway: IEEE.
Google Scholar
Liu, M., Qu, L., Nie, L., Liu, M., Duan, L., & Chen, B. (2020). Iterative local-global collaboration learning towards one-shot video person re-identification. IEEE Transactions on Image Processing, 29, 9360–9372.
Article MATH Google Scholar
Zahra, A., Perwaiz, N., Shahzad, M., & Fraz, M. M. (2023). Person re-identification: a retrospective on domain specific open challenges and future trends. Pattern Recognition, 142, 109669.
Article Google Scholar
Wu, A., Zheng, W.-S., Yu, H.-X., Gong, S., & Lai, J. (2017). RGB-infrared cross-modality person re-identification. In Proceedings of the IEEE international conference on computer vision (pp. 5380–5389). Piscataway: IEEE.
Google Scholar
Liu, H., Xia, D., Jiang, W., & Xu, C. (2022). Towards homogeneous modality learning and multi-granularity information exploration for visible-infrared person re-identification. ArXiv preprint. arXiv:2204.04842.
Wang, Z., Wang, Z., Zheng, Y., Chuang, Y.-Y., & Satoh, S. (2019). Learning to reduce dual-level discrepancy for infrared-visible person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 618–626). Piscataway: IEEE.
Google Scholar
Wang, G.-A., Zhang, T., Yang, Y., Cheng, J., Chang, J., Liang, X., et al. (2020). Cross-modality paired-images generation for RGB-infrared person re-identification. In Proceedings of the 34th AAAI conference on artificial intelligence (pp. 12144–12151). Palo Alto: AAAI Press.
Google Scholar
Ye, M., Shen, J., Crandall, D. J., Shao, L., & Luo, J. (2020). Dynamic dual-attentive aggregation learning for visible-infrared person re-identification. In A. Vedaldi, H. Bischof, T. Brox, et al. (Eds.), Proceedings of the 16th European conference on computer vision (pp. 229–247). Berlin: Springer.
Google Scholar
Wu, A., Zheng, W.-S., Gong, S., & Lai, J. (2020). RGB-IR person re-identification by cross-modality similarity preservation. International Journal of Computer Vision, 128, 1765–1785
Article MathSciNet Google Scholar
Lu, Y., Wu, Y., Liu, B., Zhang, T., Li, B., Chu, Q., et al. (2020). Cross-modality person re-identification with shared-specific feature transfer. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 13379–13389). Piscataway: IEEE.
Google Scholar
Jin, X., Lan, C., Zeng, W., Chen, Z., & Zhang, L. (2020). Style normalization and restitution for generalizable person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 3143–3152). Piscataway: IEEE.
Google Scholar
Wang, G., Yang, S., Liu, H., Wang, Z., Yang, Y., Wang, S., et al. (2020). High-order information matters: learning relation and topology for occluded person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 6449–6458). Piscataway: IEEE.
Google Scholar
Zhou, J., Su, B., & Wu, Y. (2020). Online joint multi-metric adaptation from frequent sharing-subset mining for person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2909–2918). Piscataway: IEEE.
Google Scholar
Lin, Y., Zheng, L., Zheng, Z., Wu, Y., Hu, Z., Yan, C., et al. (2019). Improving person re-identification by attribute and identity learning. Pattern Recognition, 95, 151–161.
Article Google Scholar
Sun, Y., Zheng, L., Yang, Y., Tian, Q., & Wang, S. (2018). Beyond part models: person retrieval with refined part pooling (and a strong convolutional baseline). In V. Ferrari, M. Hebert, C. Sminchisescu, et al. (Eds.), Proceedings of the 15th European conference on computer vision (pp. 480–496). Cham: Springer.
Google Scholar
Li, W., Zhu, X., & Gong, S. (2018). Harmonious attention network for person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2285–2294). Piscataway: IEEE.
Google Scholar
Zhao, Z., Song, R., Zhang, Q., Duan, P., & Zhang, Y. (2022). JoT-GAN: a framework for jointly training GAN and person re-identification model. ACM Transactions on Multimedia Computing Communications and Applications, 18(1), 1–18.
Google Scholar
Hao, X., Zhao, S., Ye, M., & Shen, J. (2021). Cross-modality person re-identification via modality confusion and center aggregation. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 16403–16412). Piscataway: IEEE.
Google Scholar
Ye, M., Lan, X., Li, J., & Yuen, P. (2018). Hierarchical discriminative learning for visible thermal person re-identification. In S. A. McIlraith & K. Q. Weinberger (Eds.), Proceedings of the 32nd AAAI conference on artificial intelligence (pp. 7501–7508). Palo Alto: AAAI Press.
Google Scholar
Ye, M., Wang, Z., Lan, X., & Yuen, P. C. (2018). Visible thermal person re-identification via dual-constrained top-ranking. In Proceddings of the 27th international joint conference on artificial intelligence (pp. 1092–1099). San Francisco: Morgan Kaufmann.
Google Scholar
Zhu, Y., Yang, Z., Wang, L., Zhao, S., Hu, X., & Tao, D. (2020). Hetero-center loss for cross-modality person re-identification. Neurocomputing, 386, 97–109.
Article Google Scholar
Liu, H., Tan, X., & Zhou, X. (2020). Parameter sharing exploration and hetero-center triplet loss for visible-thermal person re-identification. IEEE Transactions on Multimedia, 23, 4414–4425.
Article Google Scholar
Hao, Y., Wang, N., Li, J., & Gao, X. (2019). HSME: hypersphere manifold embedding for visible thermal person re-identification. In Proceedings of the 33th AAAI conference on artificial intelligence (pp. 8385–8392). Palo Alto: AAAI Press.
Google Scholar
Liu, H., Cheng, J., Wang, W., Su, Y., & Bai, H. (2020). Enhancing the discriminative feature learning for visible-thermal cross-modality person re-identification. Neurocomputing, 398, 11–19.
Article Google Scholar
Zhao, Y.-B., Lin, J.-W., Xuan, Q., & Xi, X. (2019). HPILIN: a feature learning framework for cross-modality person re-identification. IET Image Processing, 13(14), 2897–2904.
Article Google Scholar
Choi, S., Lee, S., Kim, Y., Kim, T., & Kim, C. (2020). Hi-CMD: hierarchical cross-modality disentanglement for visible-infrared person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10257–10266). Piscataway: IEEE.
Google Scholar
Wang, G., Zhang, T., Cheng, J., Liu, S., Yang, Y., & Hou, Z. (2019). RGB-infrared cross-modality person re-identification via joint pixel and feature alignment. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 3623–3632). Piscataway: IEEE.
Google Scholar
Li, D., Wei, X., Hong, X., & Gong, Y. (2020). Infrared-visible cross-modal person re-identification with an x modality. In Proceedings of the AAAI conference on artificial intelligence (pp. 4610–4617). Palo Alto: AAAI Press.
Google Scholar
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778). Piscataway: IEEE.
Google Scholar
Wang, G., Yuan, Y., Chen, X., Li, J., & Zhou, X. (2018). Learning discriminative features with multiple granularities for person re-identification. In Proceedings of the 26th ACM international conference on multimedia (pp. 274–282). New York: ACM.
Chapter Google Scholar
Zheng, L., Yang, Y., & Hauptmann, A. G. (2016). Person re-identification: past, present and future. ArXiv preprint. arXiv:1610.02984.
Nguyen, D. T., Hong, H. G., Kim, K. W., & Park, K. R. (2017). Person recognition system based on a combination of body images from visible light and thermal cameras. Sensors, 17(3), 605.
Article Google Scholar
Moon, H., & Phillips, P. J. (2001). Computational and performance aspects of PCA-based face-recognition algorithms. Perception, 30(3), 303–321.
Article Google Scholar
Zhong, Z., Zheng, L., Kang, G., Li, S., & Yang, Y. (2020). Random erasing data augmentation. In Proceedings of the 34th AAAI conference on artificial intelligence (pp. 13001–13008). Palo Alto: AAAI Press.
Google Scholar
Luo, H., Jiang, W., Gu, Y., Liu, F., Liao, X., Lai, S., et al. (2019). A strong baseline and batch normalization neck for deep person re-identification. IEEE Transactions on Multimedia, 22(10), 2597–2609.
Article Google Scholar
Dai, P., Ji, R., Wang, H., Wu, Q., & Huang, Y. (2018). Cross-modality person re-identification with generative adversarial training. In Proceddings of the 27th international joint conference on artificial intelligence (pp. 677–683). San Francisco: Morgan Kaufmann.
Google Scholar
Chen, Y., Wan, L., Li, Z., Jing, Q., & Sun, Z. (2021). Neural feature search for RGB-infrared person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 587–597). Piscataway: IEEE.
Google Scholar
Gao, Y., Liang, T., Jin, Y., Gu, X., Liu, W., Li, Y., et al. (2021). MSO: multi-feature space joint optimization network for RGB-infrared person re-identification. In Proceedings of the 29th ACM international conference on multimedia (pp. 5257–5265). New York: ACM.
Chapter Google Scholar
Wei, Z., Yang, X., Wang, N., & Gao, X. (2021). Syncretic modality collaborative learning for visible infrared person re-identification. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 225–234). Piscataway: IEEE.
Google Scholar
Ye, M., Ruan, W., Du, B., & Zheng, M.Z. (2021). Channel augmented joint learning for visible-infrared recognition. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 13567–13576). Piscataway: IEEE.
Google Scholar
Wu, Q., Dai, P., Chen, J., Lin, C.-W., Wu, Y., Huang, F., et al. (2021). Discover cross-modality nuances for visible-infrared person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 4330–4339). Piscataway: IEEE.
Google Scholar
Zheng, X., Chen, X., & Lu, X. (2022). Visible-infrared person re-identification via partially interactive collaboration. IEEE Transactions on Image Processing, 31, 6951–6963.
Article Google Scholar
Huang, Z., Liu, J., Li, L., Zheng, K., & Zha, Z.-J. (2022). Modality-adaptive mixup and invariant decomposition for RGB-infrared person re-identification. In Proceedings of the 36th AAAI conference on artificial intelligence (pp. 1034–1042). Palo Alto: AAAI Press.
Google Scholar
Chen, C., Ye, M., Qi, M., Wu, J., Jiang, J., & Lin, C.-W. (2022). Structure-aware positional transformer for visible-infrared person re-identification. IEEE Transactions on Image Processing, 31, 2352–2364.
Article Google Scholar
Zhang, Q., Lai, C., Liu, J., Huang, N., & Han, J. (2022). FMCNet: feature-level modality compensation for visible-infrared person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7349–7358). Piscataway: IEEE.
Google Scholar
Yang, M., Huang, Z., Hu, P., Li, T., Lv, J., & Peng, X. (2022). Learning with twin noisy labels for visible-infrared person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 14308–14317). Piscataway: IEEE.
Google Scholar
Zhang, Y., Yan, Y., Lu, Y., & Wang, H. (2021). Towards a unified middle modality learning for visible-infrared person re-identification. In Proceedings of the 29th ACM international conference on multimedia (pp. 788–796). New York: ACM.
Chapter Google Scholar
Sun, H., Liu, J., Zhang, Z., Wang, C., Qu, Y., Xie, Y., et al. (2022). Not all pixels are matched: dense contrastive learning for cross-modality person re-identification. In Proceedings of the 30th ACM international conference on multimedia (pp. 5333–5341). New York: ACM.
Chapter Google Scholar
Li, X., Lu, Y., Liu, B., Liu, Y., Yin, G., Chu, Q., et al. (2022). Counterfactual intervention feature transfer for visible-infrared person re-identification. In Proceedings of the 17th European conference on computer vision (pp. 381–398). Berlin: Springer.
Google Scholar
Liu, J., Sun, Y., Zhu, F., Pei, H., Yang, Y., & Li, W. (2022). Learning memory-augmented unidirectional metrics for cross-modality person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 19366–19375). Piscataway: IEEE.
Google Scholar
Zhang, Y., & Wang, H. (2023). Diverse embedding expansion network and low-light cross-modality benchmark for visible-infrared person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 2153–2162). Piscataway: IEEE.
Google Scholar
Park, H., Lee, S., Lee, J., & Ham, B. (2021). Learning by aligning: visible-infrared person re-identification using cross-modal correspondences. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 12046–12055). Piscataway: IEEE.
Google Scholar
Ye, M., Shen, J., & Shao, L. (2020). Visible-infrared person re-identification via homogeneous augmented tri-modal learning. IEEE Transactions on Information Forensics and Security, 16, 728–739.
Article Google Scholar
Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Grad-CAM: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision (pp. 618–626). Piscataway: IEEE.
Google Scholar
van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of Machine Learning Research, 9(11), 2579–2605.
MATH Google Scholar

Download references

Acknowledgements

I would like to express my gratitude to several individuals who have supported me throughout the process of completing this paper. First, I thank my academic supervisor, Aihua Zheng, for her invaluable guidance, feedback, and encouragement. I am also grateful to Anhui Provincial Key Laboratory of Multimodal Cognitive Computation for providing me with the necessary resources and facilities. Finally, I appreciate all the participants who generously gave their time and shared their experiences for the purpose of this study.

Funding

This work is supported by the National Key R&D Program of China (2022ZD0160605), the National Natural Science Foundation of China (61976002), the University Synergy Innovation Program of Anhui Province (GXXT-2022-036), the Natural Science Foundation of Anhui Province (No. 2208085J18), the National Natural Science Foundation of China under Grant (62106006), and the Natural Science Foundation of Anhui Higher Education Institution (No. 2022AH040014).

Author information

Authors and Affiliations

Information Materials and Intelligent Sensing Laboratory of Anhui Province, the Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, School of Artificial Intelligence, Anhui University, Hefei, 230601, China
Aihua Zheng & Chenglong Li
Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University, Hefei, 230601, China
Juncong Liu, Zi Wang & Lili Huang
iFLYTEK Research, No. 666 West Wangjiang Road, Hefei, Anhui, China
Bing Yin

Authors

Aihua Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Juncong Liu
View author publications
You can also search for this author in PubMed Google Scholar
Zi Wang
View author publications
You can also search for this author in PubMed Google Scholar
Lili Huang
View author publications
You can also search for this author in PubMed Google Scholar
Chenglong Li
View author publications
You can also search for this author in PubMed Google Scholar
Bing Yin
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by AZ, JL and ZW. The first draft of the manuscript was written by JL and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Lili Huang.

Ethics declarations

Competing interests

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zheng, A., Liu, J., Wang, Z. et al. Visible-infrared person re-identification via specific and shared representations learning. Vis. Intell. 1, 29 (2023). https://doi.org/10.1007/s44267-023-00032-9

Download citation

Received: 02 May 2023
Revised: 19 November 2023
Accepted: 21 November 2023
Published: 05 December 2023
DOI: https://doi.org/10.1007/s44267-023-00032-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Visible-infrared person re-identification via specific and shared representations learning

Abstract

Similar content being viewed by others

Information disentanglement based cross-modal representation learning for visible-infrared person re-identification

Modality Synergy Complement Learning with Cascaded Aggregation for Visible-Infrared Person Re-Identification

Multi-granularity feature utilization network for cross-modality visible-infrared person re-identification

Explore related subjects

1 Introduction

2 Related work

2.1 Single-modality person re-identification

2.2 Visible-infrared person re-identification

3 Method

3.1 Baseline

3.2 Dual-branched structure

3.2.1 Specific branch

3.2.2 Shared branch

3.2.3 Fusion features

3.3 Intra-class aggregation learning

3.4 Inter-class separation learning

3.5 Objective function

4 Experiments

4.1 Experimental settings

4.2 Implementation details

4.3 Comparison with state-of-the-art methods

4.4 Ablation study

4.5 Visualization

4.6 Parameters analysis

5 Conclusion

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation