Intra-Camera Supervised Person Re-Identification

Existing person re-identification (re-id) methods mostly exploit a large set of cross-camera identity labelled training data. This requires a tedious data collection and annotation process, leading to poor scalability in practical re-id applications. On the other hand unsupervised re-id methods do not need identity label information, but they usually suffer from much inferior and insufficient model performance. To overcome these fundamental limitations, we propose a novel person re-identification paradigm based on an idea of independent per-camera identity annotation. This eliminates the most time-consuming and tedious inter-camera identity labelling process, significantly reducing the amount of human annotation efforts. Consequently, it gives rise to a more scalable and more feasible setting, which we call Intra-Camera Supervised (ICS) person re-id, for which we formulate a Multi-tAsk mulTi-labEl (MATE) deep learning method. Specifically, MATE is designed for self-discovering the cross-camera identity correspondence in a per-camera multi-task inference framework. Extensive experiments demonstrate the cost-effectiveness superiority of our method over the alternative approaches on three large person re-id datasets. For example, MATE yields 88.7% rank-1 score on Market-1501 in the proposed ICS person re-id setting, significantly outperforming unsupervised learning models and closely approaching conventional fully supervised learning competitors.


Introduction
Person re-identification (re-id) aims to retrieve the target identity class in detected person bounding box im-ages captured by non-overlapping camera views (Gong et al., 2014;Prosser et al., 2010;Farenzena et al., 2010;Li et al., 2014;Zheng et al., 2013).It is a challenging task due to the non-rigid structure of human body, highly unconstrained appearance variation across cameras, and the low resolution and low quality of the observations (Fig. 1(a)).While deep learning methods (Chen et al., 2017;Li et al., 2018b;Sun et al., 2018;Hou et al., 2019;Zheng et al., 2019;Zhou et al., 2019) have demonstrated remarkable performance advances, they rely on supervised model learning from a large set of cross-camera identity labelled training samples.This paradigm needs an exhaustive and expensive training data annotation process (Fig. 1(b)), dramatically lowering the usability while affecting the scalability of these methods for large scale deployment in real-world applications.
Specifically, to label a conventional training dataset, a human annotator often needs to match manually a given person identity from one camera view with all the persons from other camera views.This is a quadratic complexity with the number of both camera view and person identity.An illustration of this labelling process is given in Fig. 2(a).Assume an ideal case with M cameras and N identities in each camera view.The cost of labelling a single camera view is O(N ).This is because for most people, re-appearing in a same camera view is rare during a limited time period.An acceleration is possible by the factor that people co-occurring at the same time in a single camera view but at different spatial locations should have distinct identity labels.Besides, the cross-camera identity association complexity is O(M 2 N 2 ), much more significant than per-camera labelling.This is due to that in real-world scenarios, generic (unframed) people usually take apriori unknown pathways in open public spaces.While using the camera topology knowledge (if available) may help to reduce the search space, it remains the dominant labelling cost.All together, the annotation complexity is O(M N + M 2 N 2 ) which is expensive hence non-scalable given many different deployment domains.
The problem of expensive training data collection has received significant attention.Representative attempts for minimising the annotation cost include: (1) Domain generic feature design (Gray and Tao, 2008;Farenzena et al., 2010;Zheng et al., 2015;Liao et al., 2015;Matsukawa et al., 2016), (2) Unsupervised domain adaptation (Peng et al., 2016;Deng et al., 2018a;Wang et al., 2018;Lin et al., 2018;Zhong et al., 2018a;Yu et al., 2019a;Chen et al., 2019), (3) Unsupervised image/tracklet model learning (Wang et al., 2016a;Chen et al., 2018a;Lin et al., 2019;Li et al., 2019;Wu et al., 2020), and (4) Weakly supervised learning (Meng et al., 2019).By hand-crafting generic appearance features with prior knowledge, the first paradigm of methods can perform re-id matching universally.However, their performances are often inferior due to limited knowledge encoded in such image representations.This can be addressed by transferring the labelled training data of a source dataset (domain), as demonstrated in the second paradigm of methods.Implicitly, these methods assume that the source and target domains share reasonably similar camera viewing conditions for ensuring sufficient transferable knowledge.The heavy reliance on the relevance and quality of source datasets (Zhu et al., 2019a) renders this approach less practically useful, since this assumption is often invalid.The third paradigm of methods is more scalable, as they need only unlabelled target domain data.While having high potential, unsupervised re-id methods usually yield the weakest performance, making them fail to meet the deployment requirements.In contrast, the fourth paradigm of methods considers a weakly supervised learning setting, where the person identity labels are annotated at the video level without fine-grained bounding boxes.Apart from insufficient re-id accuracy, this paradigm is mostly sensible only when such weak labels can be cheaply obtained from certain domain knowledge, which however is not generically accessible.
In this work, we suggest another novel person reidentification paradigm for scaling-up the model training process, called Intra-Camera Supervised (ICS) person re-id (Fig. 2(b)).As the name indicates, ICS eliminates the sub-process of cross-camera identity association during annotation, therefore its corresponding complexity O(M 2 N 2 ) has been eliminated, which is the majority component of the standard annotation cost, as discussed above.Under the ICS paradigm the training data involves only the intra-camera annotated identity labels with each camera view labelled independently.This labelling complexity is hence only O(M N ) with M the camera view number and N the average per-camera person identity number, therefore being significantly more affordable.Importantly, ICS naturally enables a parallel annotation process by camera views without labelling conflict due to no cross-camera identity association (Fig. 3(b)).This desirable merit is lacking in the conventional training data labelling due to the difficulty of obtaining disjoint labelling tasks, e.g.subsets of person identity classes without overlap (Fig. 3(a)).While being similar to the concurrent work (Meng et al., 2019) since they both consider explicitly the training data labelling process, the proposed ICS paradigm however does not assume specific domain knowledge therefore it is more generally applicable.To solve the ICS re-id problem, we propose a Multi-tAsk mulTi-labEl (MATE) deep learning model.Unlike the conventional fully supervised re-id methods using inter-camera identity labels, MATE is designed specially for overcoming two ICS challenges: (1) how to learn effectively from per-camera independently labelled training data, and (2) how to discover reliably the missing identity association across camera views.Specifically, MATE integrates two complementary learning components into a unified model: (a) Per-camera multitask learning that separately learn individual camera views for modelling their specificity and the implicit shared information in a multi-task learning manner (Sec.4.1).This assigns a specific network branch (i.e. a learning task) for modelling each camera view while constraining all the per-camera tasks to share a feature representation space.(b) Cross-camera multi-label learning that associates the identity labels across camera views in a multi-label learning strategy (Sec.4.2).This is based on an idea of curriculum cyclic association that can associate reliably multiple cross-camera identity classes from self-discovered identity matches for multi-label model optimisation.
The contributions of this work are: (1) We present a novel person re-identification paradigm for scaling up the model training process, dubbed as Intra-Camera Supervised (ICS) person re-id.ICS is characterised by no need for exhaustive cross-camera identity matching during training data annotation, whilst allowing naturally parallel labelling by camera views without conflict.Consequently, it makes the training data collection substantially cheaper and faster than the standard crosscamera identity labelling, therefore offering a more scalable mechanism to large re-id deployments.(2) We formulate a Multi-tAsk mulTi-labEl (MATE) deep learning method for solving the proposed ICS person re-id problem.In particular, MATE combines the strengths of multi-task learning and multi-labelling learning in a unified framework to account for independent cameraspecific identity label information and self-discovering their cross-camera association relationships concurrently.This represents a natural strategy for fully leveraging the ICS supervision with per-camera independent identity label spaces.(3) Through extensive benchmarking and comparisons on the ICS variant of three large re-id datasets (Market-1501 (Zheng et al., 2015), DukeMTMC-reID (Zheng et al., 2017;Ristani et al., 2016), and MSMT17 (Wei et al., 2018)), we demonstrate the costeffectiveness advantages of the ICS re-id paradigm using our MATE model over the existing representative solutions including supervised learning, semi-supervised learning, unsupervised learning, unsupervised domain adaptation, and tracklet learning.
A preliminary version of this work was published in (Zhu et al., 2019b).Compared with this earlier study, there are a number of key differences: (i) This study presents a more comprehensive investigation into the proposed ICS person re-id paradigm in terms of training data annotation complexity, along with a comparison to the standard cross-camera identity labelling method.This provides a more accurate measurement of training data collection cost, revealing explicitly the intrinsic obstacles to scaling up model training as suffered by the conventional supervised learning re-id paradigm.
(ii) We propose a more principled Multi-tAsk mulTi-labEl learning method that can self-discover the crosscamera identity associations in a curriculum learning spirit.This improves dramatically the accuracy of crosscamera identity matching and therefore the final model generalisation, as compared to the earlier method.Besides, this new model performs unified end-to-end training without the need for two-stage learning as required in the earlier version.(iii) We provide more comprehensive evaluations and analyses of the ICS person re-id for giving holistic and useful insights, in comparison to the existing alternative re-id paradigms.

Related Work
Supervised person re-id Most existing person re-id models are created by supervised learning methods on a separate set of cross-camera identity labelled training data (Wang et al., 2014b(Wang et al., , 2016b;;Zhao et al., 2017;Chen et al., 2017;Li et al., 2017;Chen et al., 2018b;Li et al., 2018b;Song et al., 2018;Chang et al., 2018;Sun et al., 2018;Shen et al., 2018a;Wei et al., 2018;Hou et al., 2019;Zheng et al., 2019;Zhang et al., 2019;Wu et al., 2019;Quan et al., 2019;Zhou et al., 2019).Relying on the strong supervision of cross-camera identity labelled training data, they have achieved remarkable performance boost.However, collecting such training data for each target domain is highly expensive, limiting their usability and scalability in real-world deployments at scales.
Semi-supervised person re-id A typical strategy for supervision minimisation is by semi-supervised learning.The key idea is to self-mine supervision information from unlabelled training data based on the knowledge learned from a small proportion of labelled training data.A few attempts have been made in this research direction (Figueira et al., 2013;Liu et al., 2014;Wang et al., 2016a;Xin et al., 2019).However, this paradigm not only suffers from significant performance degradation but also still needs a fairly large proportion of expensive cross-view pairwise labelling.
Weakly supervised person re-id Recently, Meng et al. (2019) propose a weakly supervised person re-id paradigm where the identity labels are annotated at the untrimmed video level.This setting makes more sense when such identity labels are readily available from certain domain knowledge which may be not generally provided.This is because, the major annotation cost of re-id training data comes from matching identity classes across camera views, rather than drawing person bounding boxes.Often, person images are directly detected from the raw videos by an on-the-shelf person detection model.Therefore, this paradigm is not sufficiently general.
Unsupervised person re-id Unsupervised model learning is an intuitive solution to avoid the need of exhaustively collecting a large number of labelled training data for every application domain.Early hand-crafted feature based unsupervised learning methods (Wang et al., 2014a;Kodirov et al., 2015Kodirov et al., , 2016;;Khan and Bremond, 2016;Ma et al., 2017;Ye et al., 2017;Liu et al., 2017) offer significantly inferior re-id matching performance, when compared to the supervised learning counterparts.Deep learning based methods (Lin et al., 2019;Wu et al., 2020) reduce this performance gap.Besides, there are two research lines on unsupervised re-id learning that become increasingly topical recently.
(1)Unsupervised domain adaptation The key idea of domain adaptation based methods (Wang et al., 2018;Fan et al., 2018;Peng et al., 2018;Yu et al., 2017;Zhu et al., 2017;Deng et al., 2018b;Zhong et al., 2018b) is to explore the knowledge from the labelled data in related source domains with model adaptation on the unlabelled target domain data.Typical strategies include appearance style transfer (Zhu et al., 2017;Deng et al., 2018b;Chen et al., 2019), semantic attribute knowledge transfer (Peng et al., 2018;Wang et al., 2018), and progressive source appearance information adaptation (Fan et al., 2018;Yu et al., 2017).Although performing better than the earlier unsupervised learning methods, they require implicitly similar data distributions between the labelled source domain and the unlabelled target domain.This limits their scalability to arbitrarily diverse (unknown) target domains in realworld deployments.
(2) Unsupervised tracklet learning Instead of assuming transferable source domain training data, a small number of methods (Li et al., 2018a(Li et al., , 2019;;Chen et al., 2018a;Wu et al., 2020) leverage the auto-generated tracklet data with rich spatio-temporal information for unsupervised re-id model learning.In many cases this is a feasible solution as long as video data are available.However, it remains highly challenging to achieve good model performance due to noisy tracklets with unconstrained dynamics.
In this work, we introduce a new more scalable person re-id paradigm characterised by intra-camera supervised (ICS) learning, complementing the existing reid scenarios as mentioned above.In comparison, ICS provides a superior trade-off between model accuracy and annotation cost, i.e. higher cost-effectiveness.This makes it a favourable choice for large scale re-id applications with high accuracy performance requirement and reasonably limited annotation budget.

Problem Formulation
We formulate the Intra-Camera Supervised (ICS) person re-identification problem.As illustrated in Fig. 2(b), ICS only needs to annotate intra-camera person identity labels independently, whilst eliminating the mostexpensive inter-camera identity association as required in the conventional fully supervised re-id setting.
Suppose there are M camera views in a surveillance camera network.For each camera view p ∈ {1, 2, • • • , M }, we independently annotate a set of training images D p = {(x p i , y p k )} where each person image x p i is associated with an identity label y p k ∈ {y p 1 , y p 2 , • • • , y p N p }, and N p is the total number of unique person identities in D p1 .For clarity, we express the camera view index in the superscript due to the per-camera independent labelling nature in the ICS setting.By combining all the cameraspecific labelled data D p , we obtain the entire training set as D = {D 1 , D 2 , . . ., D M }.For any two camera views p and q, their k-th person identities y p k and y q k usually describe two different people, i.e. they are two independent identity label spaces (Fig. 2(b)).This means exactly that the cross-camera identity association is not available, in contrast to the fully supervised re-id data annotation (Fig. 2(a)).
The ICS re-id problem presents a couple of new modelling challenges: (1) how to effectively exploit the per-camera person identity labels, and (2) how to automatically and reliably associate independent identity label spaces across camera views.The existing fully supervised re-id methods do not apply due to the need for identity annotation in a single label space across camera views.A new learning method tailored for the ICS setting is required to be developed.

Method
We introduce a novel ICS deep learning method, capable of conducting Multi-tAsk mulTi-labEl (MATE) model learning to fully exploit the independent percamera person identity label spaces.In particular, MATE solves the aforementioned two challenges by integrating two complementary learning components into a unified solution: (i) Per-camera multi-task learning that assigns a separate learning task to each individual camera view for dedicatedly modelling the respective identity space (Sec.4.1), (ii) Cross-camera multi-label learning that associates the independent identity label spaces across camera views in a multi-label strategy (Sec.4.2).Combining the two capabilities with a unified objective function, MATE explicitly optimises their mutual compatibility and complementary benefits via end-to-end training.An overview of MATE is depicted in Fig. 4.

Per-Camera Multi-Task Learning
To maximise the use of multiple camera-specific identity label spaces with some underlying correlation (e.g.partial identity overlap) in the ICS setting, multi-task learning is a natural choice for model design (Argyriou et al., 2007).This allows to not only mine the common knowledge among all the camera views, but also to improve per-camera model learning concurrently given augmented (aggregated) training data.
Specifically, given the nature of independent label spaces we consider each camera view as a separated learning task, all of which share a feature representation network for extracting the common knowledge in a multi-branch architecture design.One branch is in charge of a specific camera view.This forms percamera multi-task learning in the ICS context.By such multi-task learning, our method can favourably derive a person re-id representation with implicit cross-camera identity discriminative capability, facilitating cross-camera identity association (Li et al., 2019).This is because during training, all the branches concurrently propagate the respective camera-specific identity label information through the shared representation network f θ (Fig. 4(b)), leading to a camera-generic representation.This process is done by minimising the softmax crossentropy loss.
Formally, for a training image (x p i , y p k ) ∈ D p from camera view p, the softmax cross-entropy loss is used for formulating the training loss: where given the camera-shared feature vector f θ (x p i ) ∈ R d×1 , the classifier g p (•) for the camera view p predicts an identity class distribution in its own label space with N p classes: R d×1 → R Np×1 .The Dirac delta function

1(•)
: R → R 1×Np returns a one-hot vector with "1" at the specified index.By aggregating the loss of training samples from all the camera views, we formulate the per-camera multitask learning objective function as: where B p denotes the number of training images from the camera view p in a mini-batch.

Cross-Camera Multi-Label Learning
Cross-camera person appearance variation is a key challenge for re-id.Whilst this is implicitly modelled by the proposed multi-task learning as detailed above, the percamera multi-task learning is still insufficient to fully capture the underlying identity correspondence relationships across camera-specific label spaces.However, it is non-trivial to associate identity classes across camera views.One major reason is that a different set of persons may appear in a specific camera view, leading to no one-to-one identity matching between camera views.Conceptually, this gives rise to a very challenging open-set recognition problem where a rejection strategy is often additionally required (Scheirer et al., 2013(Scheirer et al., , 2014)).Compared to generic object recognition in natural images, open-set modelling in re-id is more difficult due to small training data, large intraclass variation, subtle inter-class difference, and ambiguous visual observations of surveillance person imagery.Besides, existing open-set methods often assume accurately and completely labelled training data, and the unseen classes only in model test.In contrast, we need to discover cross-camera identity correspondences during training with small (unknown) overlap across different spaces.This is hence a harder learning scenario with a higher risk of error propagation from noisy cross-camera association.An intuitive solution for open-set recognition is to find an operating threshold, e.g. by Extreme Value Theory (De Haan and Ferreira, 2007) based statistical analysis.This relies on optimal supervised model learning from a sufficiently large training dataset, which however is unavailable in the ICS setting.
To circumvent the above problems, we design a crosscamera multi-label learning strategy for robust crosscamera identity association.This is realised by (i) designing a curriculum cyclic association constraint to find reliable cross-camera identity association, and (ii) forming a multi-label learning algorithm to incorporate the self-discovered cross-camera identity association into discriminative model learning (Fig. 4(c)).

Curriculum Cyclic Association
For more reliable identity association across camera views, we form a cyclic prediction consistency constraint.Specifically, given an identity class y p k ∈ {y p 1 , y p 2 , • • • , y p N p } from a camera view p ∈ {1, 2, . . ., M }, we need to find if a true matching identity (i.e. the same identity) exists in another camera view q.We achieve this in the following process.
(i) We first project all the images of each person identity y p k from camera view p to the classifier branch of camera view q to obtain a cross-camera prediction ỹp→q k via averaging as: where S p k is the number of images from identity y p k .Each element of ỹp→q k , denoted as ỹp→q k (l), means the identity class matching probability at which y p k (an identity from camera view p) matches y q l (an identity from camera view q) in a cross-camera sense.
(ii) We then nominate the person identity y q l * from camera view q with the maximum likelihood probability as the candidate matching identity: With such one-way (p → q) association alone, the matching accuracy should be not satisfactory since it cannot handle the cases of no-true-match as typical in the ICS setting.To boost the matching robustness and correctness, we further design a curriculum cyclic association constraint.
(iii) Specifically, in an opposite direction of the above steps, we project all the images of identity y q l * from camera view q to the classifier branch of camera view p in a similar way as Eq.(3), and obtain the best candidate matching identity y p t * with Eq. ( 4).Given this backand-forth matching between camera view p and q, we subsequently filter the above candidate pair (y p k , y q l * ) by a cyclic constraint as: is a candidate match, if y p t * = y p k , is not a candidate match, otherwise. (5) This removes non-cyclic association pairs.While being more reliable, it is observed that only the cyclic association in Eq. ( 5) is not sufficiently strong for hard cases (e.g.different people with very similar clothing appearance), leading to false association.
(iv) To overcome this problem, inspired by the findings of cognitive study which suggest a better learning strategy is to start small (Elman, 1993;Krueger and Dayan, 2009), we design a curriculum association constraint.It is based on the cross-camera identity matching probability.Formally, we define a cyclic association degree as: which measures the joint probability of a cyclic association between two identities y p k and y q l * .Given this unary measurement, we can deploy a curriculum threshold τ ∈ [0, 1] for selecting candidate matching pairs via: This filtering determines if a cyclically associated identity pair (y p i , y q k * ) will be considered as a match.Curriculum threshold.The design of the curriculum threshold τ has a crucial influence on the quality of cross-camera identity association.In the spirit of curriculum learning, we consider τ as an annealing function of the model training time to enable a progressive selection.Meanwhile, we need to take into account that the magnitude of maximum prediction usually increases along the training process as the model gets more mature.Taking these into consideration, we formulate the curriculum threshold as: where r specifies the current training round, with a total of R rounds.We maintain two thresholds: τ u , which denotes the upper bound, and τ l , which denotes the lower bound.Both of these two thresholds can be estimated by cross-validation.
Summary.We perform the above curriculum cyclic association process for every camera view pairs, which outputs a set of associated identity pairs across camera views.This self-discovered pairwise information will be used to improve model training as detailed in the following.

Multi-Label Learning.
To leverage the above identity association results for improving model discriminative learning, we introduce a multi-label learning scheme in a cross-camera perspective.It consists of (i) multi-label annotation and (ii) multi-label training.
(i) Multi-label annotation.For easing presentation and understanding, we assume two camera views, and it is straightforward to extend to more camera views.Given an associated identity pair (y p k , y q l * ) obtained as above, we annotate all the images X p i of y p i from camera view p with an extra label y q l * of camera view q.We do the same for all the images X q l * of y q l * in an inverse direction.Both image sets are therefore annotated with the same two identity labels, i.e. these images are associated.See an illustration example in Fig. 4(c).Given M camera views, for each identity y p k we perform at most M −1 times such annotation whenever a cross-camera association is found, resulting in a multi-label set means an identity association is found in every other camera view.
(ii) Multi-label training.Given such cross-camera multi-label annotation, we then formulate a multi-label training objective for an image x p i as where c indices the camera view of Y p i with the corresponding identity label simplified as y c .For mini-batch training, we design the cross-camera multi-label learning objective as: which averages the multi-label training loss of all the B number of training images in a mini-batch.
Remarks.It is noteworthy to point out that, in contrast to the conventional single-task multi-label learning (Tsoumakas and Katakis, 2007), we jointly form multi-label learning and multi-task learning in a unified framework, with a unique objective of associating different label spaces and merging the independently annotated labels with the same semantics.

Final Objective Loss Function
By combining per-camera multi-task (Eq.( 2)) and crosscamera multi-label (Eq.( 10)) learning objectives, we obtain the final model loss function as: where the weight parameter λ ∈ [0, 1] is to trade-off the two loss terms.With this formula as model training supervision, our method can effectively learn discriminative re-id model using both camera-specific identity label spaces available under the ICS setting (L mt ) and cross-camera identity association self-discovered by MATE itself (L ml ) concurrently.The MATE model training process is summarised in Algorithm 1.
Algorithm 1 The MATE model training procedure.

Experiments
Datasets.Due to no existing re-id datasets for the proposed scenario, we introduced three ICS re-id benchmarks.We simulated the ICS identity annotation process on three existing large person re-id datasets, Market-1501 (Zheng et al., 2015), DukeMTMC-reID (Ristani et al., 2016;Zheng et al., 2017) and MSMT17 (Wei et al., 2018).Specifically, for the training data of each dataset, we independently perturbed the original identity labels for every individual camera view, and ensured that the same class labels of any pair of different camera views correspond to two unique persons (i.e.no labelled cross-camera association).We used the same original test data of each dataset for model performance evaluation.
Performance metrics.Following the common person re-id works, the Cumulative Matching Characteristic (CMC) and mean Average Precision (mAP) metrics were used for model performance measurement.
Implementation details.The ImageNet pre-trained ResNet-50 (He et al., 2016) was selected as the backbone network of our MATE model.As shown in Fig. 4, each branch in MATE was formed by a fully connected (FC) classification layer.We set the dimension of the re-id feature representation to 512.The person images were resized to 256×128 in pixel.The standard stochastic gradient descent (SGD) optimiser was adopted.The initial learning rate of the backbone network and classifiers were set to 0.005 and 0.05, respectively.We set a total of 10 rounds to anneal the curriculum threshold τ (Eq.( 7)), with each round covering 20 epochs (except the last round where we trained 50 epochs to guarantee the convergence).We empirically estimated τ l = 0.5 (the lower bound of τ ) and τ u = 0.95 (the upper bound of τ ) for Eq. ( 8).In order to balance the model training across camera views, we randomly selected from each camera the same number of images, i.e. 4 images, per identity and the same number of identities, i.e. 2 identities, to construct a mini-batch.Unless stated otherwise, we set the loss weight λ = 0.5 for Eq. ( 11).In test, the Euclidean distance was applied to the camera-generic feature representations for re-id matching.

Benchmarking the ICS Person Re-ID
Since there is no dedicated methods for solving the proposed ICS person re-id problem, we formulated and benchmarked three baseline methods based on the generic learning algorithms: 1. Multi-Camera Single-Task (MCST) learning (Fig. 5(a)): Given no identity association across camera views, we simply consider that identity classes from different camera views are distinct people and merge all the per-camera label spaces into a joint space cumulatively.This enables the conventional supervised model learning based on identity classification.We therefore train a single re-id model, as in the common supervised learning paradigm.At test time, we extract the re-id feature vectors and apply the Euclidean distance as the metrics for re-id matching.2. Ensemble of Per-Camera Supervised (EPCS) learning (Fig. 5(b)): Without inter-camera identity labels, for each camera view we train a separate reid model with its own single-camera training data.
During deployment, given a test image we extract the feature vectors of all the per-camera models, concatenate them into a single representation vector, and utilise the Euclidean distance as the matching metrics for re-id.3. Per-Camera Multi-Task (PCMT) learning (Fig. 5(c)): While being a variant of our MATE model without the cross-camera multi-label learning component, we simultaneously consider it as a baseline due to the use of the multi-task learning strategy.
To implement fairly the baseline learning methods, we used the same backbone ResNet50 as our method, a widely used architecture in the re-id literature.We trained each of these models with the softmax crossentropy loss function in their respective designs.
Results.We compared our MATE model with the three baseline methods in Table 1.Several observations can be pointed: 1. Concatenating simply the per-camera identity label spaces, MCST yields the weakest re-id performance.This is not surprised because there is a large (unknown) proportion of duplicated identities but mistakenly labelled with different classes, misleading the model training process.2. The above problem can be addressed by independently exploiting camera-specific identity class annotations, as EPCS does.This method does produce better re-id model generalisation consistently.However, the over accuracy is still rather low, due to the incapability of leveraging the shared knowledge between camera views and mining the inter-camera identity matching information.3. To address this cross-camera association issue, PCMT provides an implicit solution and significantly improves the model performance.4.Moreover, the proposed MATE model further boosts the re-id matching accuracy by explicitly associating the identity classes across camera views in a reliable formulation.This verifies the efficacy of our model in capitalising such cheaper and more scalable percamera identity labelling.
To further examine the model performance, in Fig. 6 we visualised the feature distributions of a randomly selected person identity with images captured by all the camera views of Market-1501.It is shown that the   feature points of our model present the best camerainvariance property, qualitatively validating the superior re-id performance over other competitors.

Comparing Different Person Re-ID Paradigms
As a novel re-id person scenario, it is informative and necessary to compare with other existing scenarios in the problem-solving and supervision cost perspectives.

Further Evaluation of Our Method
We conducted a sequence of in-depth component evaluations for the MATE model on the Market-1501 dataset.

Ablation Study
We started by evaluating the three components of our MATE model: Per-Camera Multi-Task (PCMT) learning, Cross-Camera Multi-Label (CCML) learning, and    ifying the capability of our cross-camera identity matching strategy in discovering the underlying image pairs.(3) With the help of CT, a further performance gain is realised, validating the idea of exploiting curriculum learning and the design of our curriculum threshold.

Curriculum Thresholding (CT). The results in
As a key performance contributor, we further examined CCML by evaluating its essential part -crosscamera identity association.To this end, we tracked the statistics of self-discovered identity pairs across camera views over the training rounds, including the precision and recall measurements.It is shown in Fig. 7 that our model can mine an increasing number of identity association pairs whilst maintaining very high precision which therefore well limits the risk of error propagation and its disaster consequence.This explains the efficacy of our cross-camera multi-label learning.On the other hand, while failing to identify around 40% identity pairs, our model can still achieve very competitive performance as compared to fully supervised learning models.This suggests that our method has already discovered the majority of re-id discrimination information from the associated identity pairs, missing only a small fraction embedded in those hard-to-match pairs.In this regard, we consider the proposed model is making a satisfactory trade-off between identity association error and knowledge mining.To check the impact of crosscamera identity association together with per-camera learning, we visualised the feature distribution change for a set of multi-camera images from a single person.It is observed from Fig. 8 that the same-person images are associated gradually in the re-id feature space, reaching a similar distribution as in the supervised learning case.This is consistent with the numerical performance evaluation above.

Hyper-Parameter Analysis
We examined the performance sensitivity of three parameters of MATE: the loss weight λ (default value 0.5) in Eq. ( 11), the lower (default value 0.5) and upper (default value 0.95) bound of curriculum threshold in Eq. ( 8).The evaluation in Fig. 9 shows that all these parameters have a wide range of satisfactory values in terms of performance.This suggests the ease and convenience of setting up model training and good accuracy stability of our method.

Conclusions
In this work, we presented a novel person re-identification paradigm, i.e. intra-camera supervised (ICS) learning, characterised by training re-id models with only percamera independent person identity labels, but no the conventional cross-camera identity labelling.The key motivation lies in eliminating the tedious and expensive process of manually associating identity classes across every pair of camera views in a surveillance network, which makes the training data collection too costly to be affordable in large real-world application.To address the ICS re-id problem, we formulated a Multi-tAsk mulTi-labEl (MATE) learning model capable of fully exploiting per-camera re-id supervision whilst simultaneously self-discovering cross-camera identity association.Extensive evaluations were conducted on three re-id benchmarks to validate the advantages of the proposed MATE model over the state-of-the-art alternative methods in the proposed ICS learning setting.Detailed ablation analysis is also provided for giving insights on our model design.We conducted extensive comparative evaluations to demonstrate the cost-effectiveness advantages of the ICS re-id paradigm over a wide range of existing representative re-id settings and the performance superiority of our MATE model over alternative learning methods.

Fig. 1
Fig. 1 (a) Person re-identification challenges.Each triplet bounded by a dashed box shows the images of a single person from different camera views.(b) Illustration of manually associating identities across camera-views.The dashed arrow denotes the comparison between two identities.The associated identities are bounded with red boxes.

Fig. 2
Fig. 2 Labels in person re-id data.(a) Conventional fully supervised training data needs both per-camera and cross-camera identity annotation in a unified class space.(b) Intra-camera supervised (ICS) training data only needs per-camera identity annotated independently in each camera view with a separate class space.Camera-view index is encoded as superscript of identity label in ICS person re-id data.Solid and dashed arrows denote intra-camera and inter-camera association, respectively.

Fig. 3
Fig. 3 Illustrations of data annotation process.(a) Conventional fully supervised person re-id vs.(b) ICS person re-id in the process of training data collection.Suppose each annotator needs to label the training data from a different camera view.In order to minimise the labelling conflict, an annotator may have to check if a person has been labelled or not by others.This gives rise to expensive communication costs, which is totally eliminated in the proposed ICS re-id paradigm, due to the independence nature between camera views.

Fig. 4
Fig. 4 Overview of the proposed Multi-tAsk mulTi-labEl (MATE) deep learning method.(a) Given per-camera independently labelled training images, MATE aims to learn an identity discriminative feature representation model.This is achieved by designing two learning components: (b) Per-camera multi-task learning where we consider each individual camera view as a separate learning task with its own identity class space and optimise these camera-specific tasks on a common feature representation (Sec.4.1), and (c) Cross-camera multi-task learning where we self-discover the underlying identity matching relationships across camera views via curriculum cyclic association and design a multi-label optimisation algorithm to exploit these discovered cross-camera association information during model training.The two components are integrated together in a single MATE formulation, resulting in an end-to-end trainable model.

Fig. 7
Fig. 7 Dynamic statistics of cross-camera identity association over the training rounds.Dataset: Market-1501.
versarial networks.In: IEEE International Conference on Computer Vision, pp 2223-2232 Zhu X, Morerio P, Murino V (2019a) Unsupervised domain adaptive person re-identification based on pedestrian attributes.In: Proceedings of the IEEE International Conference on Image Processing Zhu X, Zhu X, Li M, Murino V, Gong S (2019b) Intracamera supervised person re-identification: A new benchmark.Workshop of IEEE International Conference on Computer Vision

Table 1
Benchmarking the ICS person re-id performance.
(Maaten and Hinton, 2008)alisation of a randomly selected person identity appearing under all the six camera views of the Market-1501 dataset.This is made by t-SNE(Maaten and Hinton, 2008).Camera views are colour-coded.Best viewed in colour.

Table 3
The feature distribution evolution of a set of multi-camera images from a single random person over the training rounds, in comparison to (e) the feature distribution by supervised learning.Dataset: Market-1501.Best viewed in colour.

Table 3
Evaluating the model components of MATE: Per-