Keywords

1 Introduction

The increased availability of affordable mass-produced goods, coupled with rapidly changing consumer fashion trends, has resulted in a sharp increase in the consumption of products in many industrial sectors. The production of textiles for clothing and footwear is expected to increase by 2.7% each year until 2030Footnote 1. In [8], the authors claim that approximately 5% of the twenty billion pairs of shoes produced worldwide every year are recycled or reused. In the European Union, it is estimated that the waste that results from postconsumer shoes could reach several million tonnes per year. Politicians and members of civil society are starting to implement the Zero Waste to Landfill policy to address one of the major challenges of the twenty-first century for the footwear sector. This ambitious goal requires extensive support from intelligent information systems.

The problem considered in this article results from practical need. VIVE Textile Recycling is the leader of the textile recycling industry in Poland and Europe. The firm is regularly required to process massive deliveries of shoes. A conveyor belt transports the shoes to a darkroom, where their profile photos are taken (of the vamp, sole, and heel). Next, a management system, ShoeSelector consumes information from many external intelligent systems, and decides, for example, where the shoes should be thrown off of the conveyor belt, and whether they should be paired with their counterparts or remain single shoes. Shoe pairing is crucial for the shoes to be resold.

The primary goal of this article is to present a novel shoe-pairing method that can be used effectively in the process of analysing large quantities of shoes that are transported on a conveyor belt. We have verified various well-known classical approaches, including image descriptors like Central/Hu moments, HOG, and SSIM-based approaches, as well as a novel deep-learning-based method that outperforms others.

In our deep learning approach to shoe pairing, we introduced two sequential stages: a) multi-image shoe embedding that uses a deep neural network; and b) clustering of the shoes’ embeddings with a fixed similarity threshold to return pairs (clusters in which size = 2) and singles (unpaired shoes). Each shoe in our pipeline is represented by three images (vamp, sole, and heel) that are collected in the darkrooms. Sample images are presented in Fig. 1. Our approach is evaluated on diverse custom industrial datasets, as provided by Vive Textile Recycling.

Fig. 1.
figure 1

Sample shoe images. The top row presents all three images (sole, vamp, and heel) for three distinct shoes. The bottom row presents the corresponding shoes found by the system.

2 Related Work

Shoe pairing can be modelled as clustering of shoes’ compact representations within a cluster (maximum size = 2) and a minimum similarity threshold to discriminate correct pairs (two-element clusters) against single shoes without counterparts.

Clustering is a fundamental task in the machine learning paradigm whose objective is to divide subjects into a number of groups in such a way that subjects in the same groups are more similar to each other and less similar to those in other groups. Traditional methods cluster subjects on the basis of a single set of features per subject [11].

Multiview clustering (MVC) is a variant of clustering in which each subject is represented by multiple sets of features [2]. MVC has been applied successfully to various applications, including computer vision, natural language processing, healthcare, and social media. Given that our project involves only images assigned to shoes, our focus lies in computer vision.

MVC has been used widely in image clustering [7] and motion segmentation [5] tasks. Chi et al. [3] conducted MVC for web image retrieval ranking. Xin et al. [14] successfully applied MVC for person reidentification. Typically, several feature types, such as HOG [4], LBP [10], and SIFT [9], can be extracted from images prior to cluster analysis.

Since 2012, deep learning has proved outstandingly efficient in a variety of applications, such as image classification, speech recognition, object detection, and natural language processing. Multiview deep clustering methods demonstrated better performance than traditional multiview shallow clustering methods. Most of the literature presents multiview deep clustering as the process of clustering on the representations obtained from multiview representation models built in a supervised manner [2].

Multiview representation learning (MVRL) has recently become popular by its exploitation of complementary information of multiple features or modalities. Recently, due to the remarkable performance of deep models, deep MVRL has been adopted in many domains, including computer vision and signal processing. One article [15] presents a comprehensive review of deep MVRL in two perspectives: (1) deep MVRL extensions of traditional methods; (2) MVRL methods in full deep learning scope. The first group of methods introduce the advancements of deep learning models into traditional MVRL methods, such as multiview canonical correlation analysis and matrix factorisation. The second group represents pure deep learning MVRL methods, such as multiview autoencoders, conventional neural networks, and deep belief networks [15].

The deep neural networks used to learn deep representations adopted in multiview clustering methods have superior expression ability to reflect multiview data comprehensively. Multiview deep representation learning is a method of transforming a collection of inputs (in our case, profile images) into compact, fixed-size, floating-point representations that exhibit the desired properties. Such embedding can be subject to further processing, such as clustering. However, the authors of [2] claim that the separate processes of representation learning and clustering entails limitations, such as representation learning being unaware of clustering’s goal. Our approach resolves this problem by using learning representations that are tailored to the purpose of our clustering, which is pairing.

The outstanding efficiency of multiview deep-based clustering and its ability to make representation learning aware of the clustering’s goal inspired us to develop a novel shoe-pairing process based on the clustering of representations obtained from a trained deep multiview representation model (hereinafter referred to as deep multiview embedding-based clustering, DMVEC).

3 The Proposed Approach

The solution is a web service that is capable of responding to a variety of shoe-related requests via REST API. This article describes only the pairing aspect of the sytem. Answering pairing-related questions requires the system to perform a series of processing steps—some of which are shared with other tasks. The first (1) is input preprocessing (e.g. image decoding from base64 encoding, image scaling, and normalisation). Later (2), shoe detection is performed (empty hangers occasionally get photographed). Next, (3) deep multiview embedding of a shoe is performed (this step assumes that the embedding model is already trained and available for inference). Last (4), clustering of the shoes’ embeddings into pairs can be requested. In the two sections below, we describe (3) and (4) in greater detail.

3.1 Deep Multiview Representation Learning

The deep multiview embedding approach is a supervised method of changing the representation of a shoe into a single fixed-size vector based on its three images (shoe embedding). The process is based on a deep neural network, which is responsible for transforming the three images of a shoe into a compact vector of floating numbers. This transformation is trained in such a way that each shoe in a pair should be located close to the other, but away from unrelated shoes. The Euclidean distances between the embeddings can be used by the clustering method as a distance metric.

Fig. 2.
figure 2

The deep multiview shoe embedding model. vgg_preproc – a preprocessing layer in which images are converted from RGB to BGR, before each colour channel is zero-centered relative to the ImageNet dataset; vgg – the convolutional base of the original VGG-16 network pretrained on ImageNet; gap – global average pooling, the layer used for dimensionality reduction; dense - a dense or otherwise fully-connected layer; relu – an activation layer in which all negative values are replaced with zero and the remaining values are passed to the output; sum – a layer for aggregating the output by adding tensors from multiple layers; norm – a layer that normalises the input vector to a unit vector.

This method is a multibranch combination of convolutional layers and dense layers. The input comprises three photographs of a shoe (vamp, sole, and heel). The output is a single 64-dimensional unit floating vector. In the initial part of the proposed deep neural network, we used the convolutional base of the VGG-16 network [12], pretrained on the ImageNet datasetFootnote 2. The results of the convolutional base are globally pooled by channels. The use of the convolutional base of the VGG-16 network in our method is an example of transfer learning. The nonconvolutional part of the network is trained from scratch by freezing the VGG-16 weights. The vector obtained as a result of global average pooling of the output of the convolutional base is processed using a fully connected layer. Such a network fragment, called a branch, is repeated between one and three times, depending on the number of image types desired. Then, all branches are aggregated over the corresponding indices (element-wise) and transformed once more using the fully connected layer. Last, the 64-dimensional output is normalised so that it lies on a multidimensional sphere with a unit radius. Figure 2 presents the entire architecture of the deep neural network. To train our embedding model, we used triplet learning (namely, the TripletSemiHardLoss and TripletHardLoss loss functions), in which each reference (anchor) shoe requires a positive example (a corresponding shoe from a pair) and a negative example (a nonsimilar shoe). The deep neural network is trained in a regime that forces similarity (e.g. Euclidean-distance-based similarity) between the representations of the reference and the compatible shoe, while reducing the similarity of the representation of the reference example and the negative example.

3.2 Clustering

Given the trained deep multiview representation model (the shoe embedding model), a visually represented batch of shoes can be transformed into a set of embeddings. The most similar embeddings can then be identified to pair the shoes. The desired pairing should connect the elements so that the distance—and, therefore, the differences in appearance—is as short as possible (or alternatively, similarity is as high as possible).

Although we evaluated many methods, considering the heavy deep multiview embedding model and specific nonfunctional requirements (such as the number of shoes in the clustered population being limited by the capacity of the conveyor belt, which is around 1,000–2,000 hangers), the most robust and comprehensive was the greedy method, which is based on agglomeration clustering with an additional termination condition. It works as follows:

  1. 1.

    Create a collection of unpaired shoes L;

  2. 2.

    Create an empty collection of result pairs P;

  3. 3.

    Find two shoes \(l_a\) and \(l_b\) in L that are separated by the shortest Euclidean distance between their embeddings, which does not exceed the threshold t;

  4. 4.

    Remove them from L and add a pair (\(l_a\), \(l_b\)) to P;

  5. 5.

    If at least two items remain in L, go to step 3;

  6. 6.

    Return pairing P plus the remaining unpaired shoes as singles.

4 Results

In the initial phase of our experiments, we had limited access to labelled data. Approximately 1,000 hand-labelled pairs of shoes were split into training (80%) and test (20%) sets. Using these relatively small datasets, we evaluated various methods to identify the most promising one. At this point, we suspected that we using too little data to fully unlock the benefits of deep learning. We opted to use transfer learning, which is helpful in cases of scarce data.

We verified four unsupervised classical methods: Central and Hu moments, HOG, and SSIM. We used a special evaluation measure called pairing accuracy, which is the percentage of shoes that are correctly paired (or unpaired). We define pairing accuracy as the number of shoes with correct assignment (properly paired or properly unpaired) divided by the number of shoes considered.

Table 1. The quality evaluation of the shoe-pairing methods on the test set measured by pairing accuracy; VAMP/SOLE/BOTH indicates from which images the shoe-pairing model was inferring.

Table 1 presents the results of the shoe-pairing experiments on our initial test set. The first row presents the proposed DMVEC approach (deep multiview embedding-based clustering). The subsequent rows present the results obtained by our unsupervised classical baselines:

  1. 1.

    Central moments – the shoe images, separately or joined horizontally in one image, are described by image descriptors called central moments [6] before clustering is applied;

  2. 2.

    Hu moments – the shoe images, separately or joined horizontally in one image, are described by image descriptors called Hu moments [6] before clustering is applied;

  3. 3.

    HoG – the shoe images, separately or joined horizontally in one image, are described by image descriptors called histogram of oriented gradients - (HoG) [13] before clustering is applied;

  4. 4.

    SSIM – the shoe images, separately or joined horizontally in one image, are compared one vs one using a structural similarity index measure (SSIM) [1]. SSIM distances are used for clustering.

Fig. 3.
figure 3

Adjustment of the similarity threshold is a crucial task in shoe pairing. The image on the right shows the similarity between shoes in proper and incorrect pairs in a set of 34,000 pairs returned by a prototype system and next labelled by human taggers. The image on the left demonstrates how we tuned the hyper-param-similarity threshold using the 1,000-shoe validation set. We counted partial errors and accuracy (acc) for each threshold setting. The legend presents the errors as a ratio of TP/FP/TN/FN to all shoes in the paired set. Thr is represented by the vertical line that corresponds to the threshold value that reports the highest accuracy on the 1,000-shoe validation set.

The initial experiment revealed that DMVEC obtained the best results; however, the test set consisted solely of pairs i.e. all shoes in the test set had their counterparts within the test set. This situation is a borderline one; usually, the ratio of paired shoes in a real-world population of shoes under pairing is between 50% and 90%. The appearance of singles in a population demands the introduction of a threshold, which prevents further pairing and returns the remaining shoes as singles. We received a 1,000-shoe validation set from Vive Textile Recycling that represented the most common distribution of pairs in a population. In Fig. 3, we present how we tuned the similarity threshold value based on the validation set.

Fig. 4.
figure 4

Training and validation (on the test set) metrics concerning our best pairing model. The model was trained using the maximum number of true pairs collected over two years of labelling efforts. We accumulated 108,000 shoes in pairs, and validated/tested after each epoch using 2,000 shoes, of which 1,000 were paired and 1,000 were singles (the x-axis represents epochs; the y-axis represents accuracy - left axis and loss score - right axis).

The next phases of the project involved conducting experiments on diverse test populations that contained significant proportions of singles. The results of the experiments revealed that accuracy reported a huge drop from an initial 99% to around 80–90% (lower scores when more singles were in the population; 80% was reported when we had equal numbers of paired shoes and singles). After investigating the drop in pairing accuracy, we reached two conclusions. (1) Finding pairs in a population that does not contain singles is a significantly easier task than pairing a general population. The threshold is challenging to discover; embeddings are typically not separable by any single threshold and this introduces a tradeoff between precision and recall. (2) The difficulty of shoe pairing depends inherently on the size of the population. This differs from other tasks, like classification, in which scores are independent of validation set size. This happens because in pairing, examples in the dataset are not independent of each other. This can be also seen from the perspective of a solution space, in which a combinatorial explosion appears in the number of possible pairings.

We decided that our training dataset was insufficiently representative for deep multiview representation learning, and that a more robust embedding model was demanded. We manually labelled tens of thousands of pairs returned by a prototype system in a preindustrial environment. After more than a year of labelling, we had collected approximately 54,000 proper pairs. In Fig. 4, we present the training and validation process using our largest training dataset and data augmentation (cropping, saturation, and hue disturbance). The trained model is highly robust and reports high accuracy scores during clustering on the test set—even when it was tested on a very difficult set (1,000 shoes in pairs and 1,000 singles), it presents accuracy above 97% after the 10th epoch.

5 Conclusions

This article presents a novel approach to shoe pairing, a dual-stage approach that comprises deep multiview shoe embedding and clustering. We evaluated different approaches to shoe pairing, from classical unsupervised ones based on image descriptors to the proposed supervised one that applies deep neural networks. The best-performing model in this task is the proposed supervised method, DMVEC. It reports almost 100% accuracy on the test sets that exclusively comprise shoes in pairs, and at least 97% when singles cover almost half of the test population. Over a broad number of tests, we evaluated different test sets (with different size and various distributions of singles). We also demonstrated how our selected method can be improved by hyperparameter tuning (similarity threshold tuning), massive increases in training data, or data augmentation.

In the future, we plan to conduct research on further optimisations for DMVEC. We suspect that multiple factors can impact the pairing accuracy of our model, such as the margin hyperparameter in triplet learning, the schedule of TripletHardLoss and TripletSemiHardLoss during training, or the GAP layer discarding too much information. There are also other necessary tasks apart from pairing, which will be addressed soon.