1 Introduction

With the rapid development of online shopping, the problem of counterfeit products has become increasingly severe and a global problem. For example, the Chinese police cracked a case of counterfeit products in 2021, more than 340,000 items, such as counterfeit clothing and luggage, evaluating more than 300 million RMB [1]. Obviously, counterfeit goods harm the legitimate interests of consumers, and should be able to detect in a simple way. However, the detection task, especially related to expensive goods, can be only conducted by experts even by far. It means poor efficiency when encountering a tremendous number of goods.

There have been a few studies of automatic counterfeit detection. Traditional machine learning methods manually design features based on domain knowledge [2], which requires more time and workforce. The deep learning method must require a large-scale dataset, and a small sample size cannot guarantee its performance. However, counterfeit detection differs from defect detection, fake face detection, and other tasks. The difficulty of sample collection leads to the inability to structure a large-scale dataset, and the variety of sample series leads to the weakness of multi-series detection performance. Given the above problems, we propose a counterfeit detection network based on key area guidance and multi-task learning, and conduct experiments on counterfeit luxury goods as an example. The authenticity of samples is judged by the shape differences between the genuine and counterfeit samples. The main contributions of this paper are summarized as follows:

Firstly, the multi-task learning mechanism is introduced in counterfeit detection for the first time. The single-task method requires large amounts of data to find tiny differences between genuine and counterfeit samples. However, a well-designed multi-tasking method can facilitate the learning process. We design a simple auxiliary task that is easy to learn from key areas, and helps the attention of the main task quickly focus on key locations, and thus reduce the requirement for data volume.

Secondly, a lightweight multi-task architecture is designed for counterfeit detection. For the sake of a compact architecture, the supporting relationship between tasks is leveraged and so the feature extraction network is shared. The parameters are optimized to simultaneously minimizing a disturbance detection loss and authenticity identification loss. In addition, a sample generation algorithm, called KAG, is designed by a way of disturbing key areas. As a result, a super-dataset is constructed for training both main and auxiliary tasks, free of collecting additional samples.

Finally, an image preprocessing strategy named FWD is proposed to avoid deformation interference. Images are usually normalized into same size and heavy distortion will be introduced if original ones with diverse aspect ratio. This will significantly affect performance as fine-grained differences in key objects are also confused. With the FWD strategy, the sample image is first filled into a shape of square, which is proved to enhance the learning.

The rest of this paper is organized as follows: The second section discusses the related work that can be used for counterfeit detection. The proposed method and performance analysis are detailed separately in the third and the fourth section. In the fifth section, summation and future study are presented.

2 Related work

Some related studies on counterfeit detection have been published in recent years. The entropy team was the first to apply deep learning to counterfeit detection [3]. The microscopic features of genuine products have unique attributes that can be used for identification. The application was constrained due to the dependence of specialist equipment. Tang’s team used object detection and text recognition to identify samples [4] and developed the “Bao Xiaojian” counterfeit detection system. Wang et al. designed a lightweight CNN authentication model to identify texture material and font print of Gucci’s black labels [5]. Arguing that both global and local information should be exploited for better performance, a two-stage method [6] extract features both from a whole word and its separate characters. All these works keep on improving the ability to identify non-significant differences. Actually, according to our research, a counterfeit detection application should rather be recognized as a fine-grained classification task due to two facts. Firstly, the discrimination between genuine and counterfeit gets weaker as upgrade of counterfeit craftsmanship. Secondly, the style of genuine goods varies among different series.

Fig. 1
figure 1

The overall network architecture of the proposed method. The original images and the disturbed version generated by the KAG algorithm are sampled randomly, each has two attributes and then the corresponding feature is obtained by the feature extraction network. the main branch and the auxiliary branch make judge in turn on whether it is counterfeit and whether it is disturbed. In particular, no auxiliary branch is needed during the test process, and the model is lighter

With aware of the similarity to the fine-grained classification task, techniques including attention mechanism, fine-grained classification and large visual model, are analyzed too. Wang et al. proposed ECA Net [7], which enhances the recognition ability of the model by introducing an attention module. Fine-grained classification networks [8,9,10,11,12,13] designed various mechanisms to push models to focus on critical visual areas thus addressing the issue of large intra-class differences and tiny inter-class differences. Since the Vision Transformer [14], many ViTs are put forward and achieved impressive improvements. Especially, the SwinTransformer [15], which introduces a shift window mechanism to improve performance while saving computational costs, is widely chosen as a backbone network to promote visual presentation. Although these methods are useful in learning the tiny differences between genuine and counterfeit goods, none of them can tolerate a small dataset.

Seeking a promotion over fine-grained discrimination, multi-task methods [16,17,18,19,20,21,22,23,24,25,26,27,28] are also taken account in. A multi-task learning focuses its design on parameter-sharing strategies for different tasks and for improving task correlation. Demonstrated by studies [29,30,31], appropriate auxiliary tasks can effectively promote the learning process of the other tasks. Although the multi-task learning mechanism has the potential to solve the problem of counterfeit detection with insufficient samples, the current research is limited in application of recommendation systems, and the research on counterfeit detection is still blank.

The proposed method addresses the aforementioned issues. It is composed of a targeted counterfeit detection network based on task characteristics and is able to learn fine-grained divergences on the challenging dataset with multiple categories and small samples in each class.

3 The proposed approach

Multi-series goods with similar but various visual characteristics are usually encountered in counterfeit detection task. What contradicts requirement of multi-classification is the poor number of samples. It is almost impossible to train a separate model for each series. Therefore, we decide to build a multi-classification model and leverage an auxiliary task assisting in promote fine-grained discrimination ability. The auxiliary task should not only be easy to learn, but also should be beneficial to attract the attention onto the critical visual areas and should not increase samples requirement.

The overall architecture, shown in Fig. 1, composes five stages: key area guidance, feature extraction, main task branch, auxiliary task branch, and loss aggregation. The label of the main task and the auxiliary task is denoted as \(\{(y_i^1,y_i^2)|y_i^k\in \{0,1\},k\in \{1,2\}\}\)

3.1 Key area guidance

As aforementioned analysis, the auxiliary task need guide the attention of the main task. Inspired by some image augmentation techniques [32,33,34], we propose a simple but effective algorithm named KAG to enhance the focus of the main task by disturbing image patches. We also observed that DCL-Net [35] used a similar technique to drive the model to focus on detail differences, which confirms the validity of our method.

figure a

The KAG algorithm takes unprocessed raw images as input, the input image is first divided into several patches of the same size by segment degree, the segment degree N is used to control the granularity of image segmentation, where the input image is adaptively segmented into \(N\times N\) patches. Then adjacent patches are randomly replaced, and the replacement condition follows the Bernoulli distribution with a probability of 0.5.

The effects of the KAG algorithm are mainly reflected in two aspects: sample quantity and sample attributes. In terms of quantity, the algorithm alters the character morphology of the generated new samples, increasing the available samples and reducing the model’s reliance on sample quantity. Regarding attributes, regardless of whether the original sample is genuine or counterfeit, it will become counterfeit after being perturbed by the algorithm. Importantly, the non-core authentication attributes such as sample color, material texture, and brightness remain largely unchanged before and after KAG processing, while the character morphology changes, prompting the network to focus on the core areas.

The principle of the algorithm is shown in Fig. 2. In the KAG algorithm, adjacent patches of original samples are shuffled to simulate various counterfeit samples. This allows the network to recognize multiple patterns of counterfeit samples instead of being limited to a single pattern, so the problem of poor model performance caused by large intra-class differences can be relieved. Simultaneously, it indirectly increases the sample size.

Fig. 2
figure 2

The schematic diagram of KAG algorithm

Fig. 3
figure 3

The sample produced by KAG

The LV leather tag is taken as an example in Fig. 3, in which (a) is the original image, and (b) is the scrambled image generated by the KAG algorithm. It magnified the differences of characters and simulated a rough counterfeiting process, while ensuring that the character space position is not excessively disturbed.

3.2 Feature extraction stage

In the application of counterfeit detection, fine-grained visual divergences need to be learned, which means deep network is required. Similar with many novel designs, ResNet50 [36] is chosen for feature extraction as shown in Fig. 4. The residual structure of ResNet50 transmits the shallow features to the deep layer by skip connection and combines the shallow texture and deep semantic of the input image, which is pretty beneficial. However, it is not necessarily the case that more complex networks yield better performance. Subsequent experiments have shown that ResNet50 outperforms other backbone networks.

Fig. 4
figure 4

The feature extractor

Due to the high similarity between counterfeit products and genuine ones, relying on residual networks is inadequate. Therefore, we have incorporated multi-scale features by leveraging spatial pyramid pooling [37] to enhance the model’s capability in capturing diverse scales of information. Different pooling windows are designed to capture different details of characters. In the specific implementation, three scales of adaptive pooling are applied for each feature map. The formula is as follows:

$$\begin{aligned} \begin{aligned} f_{i,j}^{t,l+1} = \textrm{average}\Big (&f_{i,j}^{t,l}[\textrm{i}\times \frac{w^{l}}{w^{l+1}};(\textrm{i}+1)\times \frac{w^{l}}{w^{l+1}},\\&\textrm{j}\times \frac{w^{l+1}}{h^{l+1}};(\textrm{j}+1)\times \frac{w^{l+1}}{h^{l+1}}]\Big ) \end{aligned} \end{aligned}$$
(1)
$$\begin{aligned} F=\textrm{concat}\bigl (f_{i,j}^{1,l+1},f_{i,j}^{2,l+1},f_{i,j}^{3,l+1}\bigr ) \end{aligned}$$
(2)

where \(f \in \mathbb {R}^{w \times h}\) represents the input feature map, \(t\in \{1,2,4\}\) represents three pooling levels, and i and j represent the element coordinates of the output feature map. \(w^l\) and \(h^l\) represent the width and height of the current input respectively. Finally, the three output feature maps are flattened into 1D vectors and concatenated to obtain the output feature vector.

3.3 Task branch

In the task branch, feature vectors are sent to the main and auxiliary branch, respectively. The auxiliary task recognizes the character disturbance generated by the KAG algorithm, and the main task discriminates whether samples are genuine or counterfeit ones. As the disturbance caused by the KAG algorithm are much more significant than the divergences between the genuine and counterfeit samples, so the auxiliary task can be easily optimized, thus guiding the attention of the main task.

Fig. 5
figure 5

The classifier

It is found that in the proposed architecture, a simple classifier is enough to work well. The two task branches are implemented similarly by the classifier shown in Fig. 5, mainly composed of two fully connected layers. ReLU is applied in the middle, and the classification result is finally mapped to a probability distribution with the sum of 1 by softmax. The calculation method of softmax is as follows:

$$\begin{aligned} y_{i}=\frac{e^{p_{i}}}{\sum _{i=1}^{n}e^{p_{i}}}\in (0,1) \end{aligned}$$
(3)

where n denotes the total number of categories, \(p_i \in \mathbb {R}^n\) is the prediction probability of the ith category of the model output, and \(y_i\) represents the probability of the ith category calculated by softmax, \(\sum _{i=1}^n y_i = 1\).

3.4 Loss aggregation

The auxiliary task seeks optimum parameter to minimize its predication error of perturbation. According to that, the cross-entropy defined as follows is chosen as the loss function to evaluate a disturbance detection loss. For the main task, a same type of loss function is selected to reflect authenticity identification loss.

$$\begin{aligned} L(s,y)=-\sum _{i=1}^{c}s_{i}\textrm{log}y_{i} \end{aligned}$$
(4)

where c denotes the total number of categories. In the main task, it equals the number of categories for all genuine and counterfeit samples and is two in the auxiliary task. \(s_i\) is the label after one-hot coding, \(y_i\) represents the prediction normalized by softmax. The closer between prediction and ground truth, the smaller the loss is.

As the two tasks hold different importance, the aggregation needs to be carefully designed. Aiming to make the model thoroughly learn from the main and auxiliary task, the two losses are integrated by a weight coefficient \(\alpha \) as follows:

$$\begin{aligned} L_{\textrm{fusion}}=\alpha \cdot L_{\textrm{main}}+(1-\alpha )\cdot L_{\textrm{aux}} \end{aligned}$$
(5)

where \(L_{\text {main}}\) and \(L_{\text {aux}}\) are separately denoted as the authenticity identification and perturbation detection loss. The optimization of the weight coefficient is discussed in the fourth part.

3.5 Preprocessing strategy

Images are always resized into shapes of uniform square before being fed to a DL network and therefore the caused deformation of objects may eliminate the none-significant visual differences between genuine and counterfeit samples. To address the issue, a preprocessing strategy named FWD is proposed for counterfeit detection. The FWD fills up an image before zooming an image and thus maintain its aspect ratio.

Fig. 6
figure 6

The distribution of the three datasets

Figure 8 shows the strategy with different colors. Illustrated sequentially from left to right, are the original images, the directly resized ones, and those filled with black, gray, white, random colors, and the adaptive average color adjacent to the edge, respectively. Obviously, directly resizing the sample lengthened the letters and reduced the letter spacing. Such deformations are usually enough to challenge a state-of-the-art model of classification task. While with the FWD strategy, the impact of deformation can be mitigated.

4 Result and analysis

To evaluate the proposed method, the comparison among attention networks, fine-grained classification algorithm, ViT models and our model is presented. Besides, impact and selection of important designs are discussed too. In seek for an insight interpretation over the presented mechanism, t-SNE and GradCAM are applied.

4.1 Experimental setup

In our method, the network has been trained for 200 epochs, and the batch size is set to 24. The SGD optimizer is used in the training process, momentum is set to 0.9 and weight decay is set to 0.001. The initial learning rate is 0.001, the cosine annealing strategy is employed, the attenuation cycle is ten epochs, and the minimum learning rate is 0.0005. \(\alpha \) is set to 0.5. In the KAG algorithm, the segment degree is 20 by default. The images are automatically resized to 448×448 before being fed into the model. In the FWD strategy, the gray color is applied by default.

4.2 Dataset

Data need preprocessing before being further analyzed. Firstly, techniques like Gamma correction and auto contrast enhancement are applied over data captured by various devices to gain better visual quality. Secondly, an object location network, such as YOLOv7 [38] in our case, is leveraged to segment objects including leather tags, metal buckles, and metal round labels. The objection location network was pretrained with handcraft cut samples. Through the above steps, we extract sample images from the entire bag image for subsequent experiments.

We conducted experiments on the leather tag, metal buckle, and metal round label datasets, respectively. Each dataset contains luxury images from different series of LV brands in the real world. Each series includes two categories: genuine and counterfeit. The number of each category is diverse as a result of the tough collecting task. Consequently, it challenges the abilities of the models under unbalanced categories. The samples of these three datasets total in 3095, 1836, and 2300, respectively. The distribution over series in each dataset is shown in Fig. 6, and some samples are illustrated in Fig. 7. Each dataset is disjointly segmented into three subsets with a ratio of 8:1:1, namely the training set, the valid set, and the test set.

4.3 Comparison and analysis

We carried out experiments on three datasets and evaluated the performance from three metrics: precision, recall, and f1-score. The results shown in Table 1 indicate the method works well on almost all conditions. The f1-score shows it recognizes the counterfeit metal buckles better than genuine ones. Although the model performs variously in different series but the metal round labels are still be well recognized.

Fig. 7
figure 7

Some samples in the three datasets

Fig. 8
figure 8

The FWD strategy with various colors

Table 1 Performance metrics on three datasets
Fig. 9
figure 9

The confusion matrix of three datasets

Table 2 The comparison of the average accuracy between KAGMTLN and other advanced methods on three datasets where “+FWD” means to use the FWD strategy

The confusion matrixes on the test sets are illustrated in Fig. 9. The results indicate the good performance of our model in multi-series detection. It can accurately recognize the authenticity of multi-series samples, only a few samples are incorrectly identified as other categories.

Research on automated detection of luxury goods is scarce, our study primarily revolves around the key technologies involved in luxury detection, selecting state-of-the-art algorithms for comparison. Table 2 shows the average accuracy of each method on the three datasets. KAG-MTLN outperforms other methods, with a highest accuracy of 98.8%. Furthermore, the FWD strategy was discussed on various models. The strategy works well on the leather tag dataset, but not on others where square-shaped samples are predominant. Our experiments validate that the FWD strategy is more suited for samples with a larger aspect ratio.

Different training costs were compared in these methods. We extracted 100% (3095), 80% (2476), 60% (1857), 40% (1238), and 20% (619) of the samples from the leather tag dataset for experiments. In Fig. 10, the horizontal and vertical coordinates represent the sample size and test accuracy, respectively. At an 80% reduced sample size, all networks, except KAG-MTLN, saw notable accuracy drops due to limited data. KAG-MTLN outperformed MMAL Net by 4.8%, ECA Net by 46.1%, CBAM by 41.7%, API-Net by 9.6%, MHEM by 14.3%, and SwinTransformer by 31.8%. Our findings highlight that our method can significantly reduce the training cost with maintained performance.

Fig. 10
figure 10

The comparison of various models under different sample sizes. Each sample set contains multiple series

4.4 Ablation experiment

We further discussed the effectiveness of our method by adequate ablation experiments on the leather tag dataset.

4.4.1 Analysis over segment degree

To assess the effect of segment degree in the KAG algorithm, we experimented with different values. Table 3 shows that the optimum value is 20. In Fig. 11, we display samples with segment degrees of 20 and 50. A higher segment degree yields smaller patch divisions, and too large values will excessively destroy letter detail, leading to reduced performance.

Table 3 performance of the KAG algorithm with different segment degrees
Fig. 11
figure 11

The impact of different segment degrees, the three images are the original image, the generated image with segment degrees of 20 and 50, respectively

4.4.2 Different colors in FWD

To examine the impact of different image padding colors, we conducted experiments using black, white, gray, random, and uniform colors. The results, summarized in Table 4, indicate that gray and uniform colors yield the best performance.

Table 4 The performance comparison of different padding colors

4.4.3 Weight coefficient of loss function

To analyze the effect of weight coefficients in the loss function, we explored different values, as shown in Table 5. Optimal performance is achieved with a main coefficient of 0.5, striking a balance between the main and auxiliary tasks. When the main coefficient is set to 1, the model becomes a single-task architecture.

Fig. 12
figure 12

The visualization of t-SNE under different training epochs

Table 5 The performance under different main weight coefficients

4.4.4 Selection of backbone network

To investigate the impact of different backbone networks, we utilized a range of networks for feature extraction, as detailed in Table 6. These results confirm that our method’s performance remains robust across diverse backbones, underscoring its reliability.

Table 6 Accuracy of different backbone models

4.4.5 Effectiveness analysis of feature

To validate feature distinguishability, we employed t-SNE [42] to visualize the output of the final layer in the feature extraction stage at different training epochs. In Fig. 12, each colors signify a distinct category. Due to diverse processes by different counterfeiters, goods within the same category exhibit varied variations. Consider the purple dots, while distinct from other colors, they form multiple clusters. Overall, KAG-MTLN’s extracted features are highly differentiated, effectively mitigating interference from intra-class differences.

4.4.6 Effectiveness analysis of key area

To assess the model’s attention on key areas, GradCAM [43] was utilized to highlight focus regions. Figure 13 compares heatmaps from different models on the same sample. Authenticity indicators, such as the circular R mark for leather tags, end serif and inflection point for metal buckles, and circular font for metal round labels, are discerned. The heatmaps demonstrate our method’s ability to discriminate based on specific key areas.

4.4.7 Effectiveness analysis of multi-task

To validate the effect of the auxiliary task, we conducted experiments by excluding the auxiliary branch. The results in Table 7 demonstrate that the auxiliary branch significantly improves the learning performance of the main branch, resulting in a 2.1% higher accuracy in our multi-task architecture compared to the single-task counterpart.

Simultaneously, the visualization results are contrasted before and after removing the auxiliary branch. In Fig. 14, the features extracted by the multi-task method are more clustered and differentiated and can focus on more details.

Fig. 13
figure 13

Partial heatmap of the model on the test set, with each column sequentially from the original image, ECA Net, CBAM, MMAL Net, SwinTransformer, KAG-MTLN

Table 7 The effectiveness of auxiliary branch. "-auxiliary" represents the removal of the auxiliary branch
Fig. 14
figure 14

The two lines are the t-SNE graph and heatmap before and after removing the auxiliary branch, respectively

5 Conclusion

In conclusion, this paper proposes a counterfeit detection network based on key area guidance and multi-task learning. The experiments indicate that our method achieves superior performance with reduced training cost. Through visual analysis, the key areas of the sample can be highlighted by the method effectively. Furthermore, our method is not confined to the presented architecture, it can readily extend existing single-task methods to multi-task methods. Future work will focus on refining fusion strategies between diverse tasks and enhancing the key area guidance algorithm. Our method provides an efficient deep learning solution for intelligent counterfeit detection, contributing to fight against counterfeit products and safeguarding the legitimate rights and interests of consumers.