A Novel Process of Shoe Pairing Using Computer Vision and Deep Learning Methods

Kozłowski, Marek; Buczkowski, Przemyslaw; Brzezinski, Piotr

doi:10.1007/978-3-031-37649-8_4

Marek Kozłowski¹⁵,
Przemyslaw Buczkowski¹⁵ &
Piotr Brzezinski¹⁶

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 710))

Included in the following conference series:

Machine Intelligence and Digital Interaction Conference

1281 Accesses

Abstract

The industrialisation of the footwear recycling processes is a major issue in the European Union—particularly in view of the fact that at least 90% of shoes consumed in western economies are ultimately sent to landfill. This requires new AI-empowered technologies that enable detection, classification, pairing, and quality assessment in a viable automatic process. This article discusses automatic shoe pairing, which comprises two sequential stages: a) deep multiview shoe embedding (compact representation of multiview data); and b) clustering of shoes’ embeddings with a fixed similarity threshold to return sets of possible pairs. Each shoe in our pipeline is represented by multiple images that are collected in industrial darkrooms. We present various approaches to shoe pairing—from fully unsupervised ones based on image descriptors to supervised ones that rely on deep neural networks—to identify the most effective one for this highly specific industrial task. The article also explains how the selected method can be improved by hyperparameter tuning, massive increases in training data, and data augmentation.

You have full access to this open access chapter, Download conference paper PDF

Keywords

1 Introduction

The increased availability of affordable mass-produced goods, coupled with rapidly changing consumer fashion trends, has resulted in a sharp increase in the consumption of products in many industrial sectors. The production of textiles for clothing and footwear is expected to increase by 2.7% each year until 2030^{Footnote 1}. In [8], the authors claim that approximately 5% of the twenty billion pairs of shoes produced worldwide every year are recycled or reused. In the European Union, it is estimated that the waste that results from postconsumer shoes could reach several million tonnes per year. Politicians and members of civil society are starting to implement the Zero Waste to Landfill policy to address one of the major challenges of the twenty-first century for the footwear sector. This ambitious goal requires extensive support from intelligent information systems.

The problem considered in this article results from practical need. VIVE Textile Recycling is the leader of the textile recycling industry in Poland and Europe. The firm is regularly required to process massive deliveries of shoes. A conveyor belt transports the shoes to a darkroom, where their profile photos are taken (of the vamp, sole, and heel). Next, a management system, ShoeSelector consumes information from many external intelligent systems, and decides, for example, where the shoes should be thrown off of the conveyor belt, and whether they should be paired with their counterparts or remain single shoes. Shoe pairing is crucial for the shoes to be resold.

The primary goal of this article is to present a novel shoe-pairing method that can be used effectively in the process of analysing large quantities of shoes that are transported on a conveyor belt. We have verified various well-known classical approaches, including image descriptors like Central/Hu moments, HOG, and SSIM-based approaches, as well as a novel deep-learning-based method that outperforms others.

In our deep learning approach to shoe pairing, we introduced two sequential stages: a) multi-image shoe embedding that uses a deep neural network; and b) clustering of the shoes’ embeddings with a fixed similarity threshold to return pairs (clusters in which size = 2) and singles (unpaired shoes). Each shoe in our pipeline is represented by three images (vamp, sole, and heel) that are collected in the darkrooms. Sample images are presented in Fig. 1. Our approach is evaluated on diverse custom industrial datasets, as provided by Vive Textile Recycling.

2 Related Work

Shoe pairing can be modelled as clustering of shoes’ compact representations within a cluster (maximum size = 2) and a minimum similarity threshold to discriminate correct pairs (two-element clusters) against single shoes without counterparts.

Clustering is a fundamental task in the machine learning paradigm whose objective is to divide subjects into a number of groups in such a way that subjects in the same groups are more similar to each other and less similar to those in other groups. Traditional methods cluster subjects on the basis of a single set of features per subject [11].

Multiview clustering (MVC) is a variant of clustering in which each subject is represented by multiple sets of features [2]. MVC has been applied successfully to various applications, including computer vision, natural language processing, healthcare, and social media. Given that our project involves only images assigned to shoes, our focus lies in computer vision.

MVC has been used widely in image clustering [7] and motion segmentation [5] tasks. Chi et al. [3] conducted MVC for web image retrieval ranking. Xin et al. [14] successfully applied MVC for person reidentification. Typically, several feature types, such as HOG [4], LBP [10], and SIFT [9], can be extracted from images prior to cluster analysis.

Since 2012, deep learning has proved outstandingly efficient in a variety of applications, such as image classification, speech recognition, object detection, and natural language processing. Multiview deep clustering methods demonstrated better performance than traditional multiview shallow clustering methods. Most of the literature presents multiview deep clustering as the process of clustering on the representations obtained from multiview representation models built in a supervised manner [2].

Multiview representation learning (MVRL) has recently become popular by its exploitation of complementary information of multiple features or modalities. Recently, due to the remarkable performance of deep models, deep MVRL has been adopted in many domains, including computer vision and signal processing. One article [15] presents a comprehensive review of deep MVRL in two perspectives: (1) deep MVRL extensions of traditional methods; (2) MVRL methods in full deep learning scope. The first group of methods introduce the advancements of deep learning models into traditional MVRL methods, such as multiview canonical correlation analysis and matrix factorisation. The second group represents pure deep learning MVRL methods, such as multiview autoencoders, conventional neural networks, and deep belief networks [15].

The deep neural networks used to learn deep representations adopted in multiview clustering methods have superior expression ability to reflect multiview data comprehensively. Multiview deep representation learning is a method of transforming a collection of inputs (in our case, profile images) into compact, fixed-size, floating-point representations that exhibit the desired properties. Such embedding can be subject to further processing, such as clustering. However, the authors of [2] claim that the separate processes of representation learning and clustering entails limitations, such as representation learning being unaware of clustering’s goal. Our approach resolves this problem by using learning representations that are tailored to the purpose of our clustering, which is pairing.

The outstanding efficiency of multiview deep-based clustering and its ability to make representation learning aware of the clustering’s goal inspired us to develop a novel shoe-pairing process based on the clustering of representations obtained from a trained deep multiview representation model (hereinafter referred to as deep multiview embedding-based clustering, DMVEC).

3 The Proposed Approach

The solution is a web service that is capable of responding to a variety of shoe-related requests via REST API. This article describes only the pairing aspect of the sytem. Answering pairing-related questions requires the system to perform a series of processing steps—some of which are shared with other tasks. The first (1) is input preprocessing (e.g. image decoding from base64 encoding, image scaling, and normalisation). Later (2), shoe detection is performed (empty hangers occasionally get photographed). Next, (3) deep multiview embedding of a shoe is performed (this step assumes that the embedding model is already trained and available for inference). Last (4), clustering of the shoes’ embeddings into pairs can be requested. In the two sections below, we describe (3) and (4) in greater detail.

3.1 Deep Multiview Representation Learning

The deep multiview embedding approach is a supervised method of changing the representation of a shoe into a single fixed-size vector based on its three images (shoe embedding). The process is based on a deep neural network, which is responsible for transforming the three images of a shoe into a compact vector of floating numbers. This transformation is trained in such a way that each shoe in a pair should be located close to the other, but away from unrelated shoes. The Euclidean distances between the embeddings can be used by the clustering method as a distance metric.

This method is a multibranch combination of convolutional layers and dense layers. The input comprises three photographs of a shoe (vamp, sole, and heel). The output is a single 64-dimensional unit floating vector. In the initial part of the proposed deep neural network, we used the convolutional base of the VGG-16 network [12], pretrained on the ImageNet dataset^{Footnote 2}. The results of the convolutional base are globally pooled by channels. The use of the convolutional base of the VGG-16 network in our method is an example of transfer learning. The nonconvolutional part of the network is trained from scratch by freezing the VGG-16 weights. The vector obtained as a result of global average pooling of the output of the convolutional base is processed using a fully connected layer. Such a network fragment, called a branch, is repeated between one and three times, depending on the number of image types desired. Then, all branches are aggregated over the corresponding indices (element-wise) and transformed once more using the fully connected layer. Last, the 64-dimensional output is normalised so that it lies on a multidimensional sphere with a unit radius. Figure 2 presents the entire architecture of the deep neural network. To train our embedding model, we used triplet learning (namely, the TripletSemiHardLoss and TripletHardLoss loss functions), in which each reference (anchor) shoe requires a positive example (a corresponding shoe from a pair) and a negative example (a nonsimilar shoe). The deep neural network is trained in a regime that forces similarity (e.g. Euclidean-distance-based similarity) between the representations of the reference and the compatible shoe, while reducing the similarity of the representation of the reference example and the negative example.

3.2 Clustering

Given the trained deep multiview representation model (the shoe embedding model), a visually represented batch of shoes can be transformed into a set of embeddings. The most similar embeddings can then be identified to pair the shoes. The desired pairing should connect the elements so that the distance—and, therefore, the differences in appearance—is as short as possible (or alternatively, similarity is as high as possible).

Although we evaluated many methods, considering the heavy deep multiview embedding model and specific nonfunctional requirements (such as the number of shoes in the clustered population being limited by the capacity of the conveyor belt, which is around 1,000–2,000 hangers), the most robust and comprehensive was the greedy method, which is based on agglomeration clustering with an additional termination condition. It works as follows:

1.
Create a collection of unpaired shoes L;
2.
Create an empty collection of result pairs P;
3.
Find two shoes \(l_a\) and \(l_b\) in L that are separated by the shortest Euclidean distance between their embeddings, which does not exceed the threshold t;
4.
Remove them from L and add a pair (\(l_a\), \(l_b\)) to P;
5.
If at least two items remain in L, go to step 3;
6.
Return pairing P plus the remaining unpaired shoes as singles.

4 Results

In the initial phase of our experiments, we had limited access to labelled data. Approximately 1,000 hand-labelled pairs of shoes were split into training (80%) and test (20%) sets. Using these relatively small datasets, we evaluated various methods to identify the most promising one. At this point, we suspected that we using too little data to fully unlock the benefits of deep learning. We opted to use transfer learning, which is helpful in cases of scarce data.

We verified four unsupervised classical methods: Central and Hu moments, HOG, and SSIM. We used a special evaluation measure called pairing accuracy, which is the percentage of shoes that are correctly paired (or unpaired). We define pairing accuracy as the number of shoes with correct assignment (properly paired or properly unpaired) divided by the number of shoes considered.

Table 1. The quality evaluation of the shoe-pairing methods on the test set measured by pairing accuracy; VAMP/SOLE/BOTH indicates from which images the shoe-pairing model was inferring.

Full size table

Table 1 presents the results of the shoe-pairing experiments on our initial test set. The first row presents the proposed DMVEC approach (deep multiview embedding-based clustering). The subsequent rows present the results obtained by our unsupervised classical baselines:

1.
Central moments – the shoe images, separately or joined horizontally in one image, are described by image descriptors called central moments [6] before clustering is applied;
2.
Hu moments – the shoe images, separately or joined horizontally in one image, are described by image descriptors called Hu moments [6] before clustering is applied;
3.
HoG – the shoe images, separately or joined horizontally in one image, are described by image descriptors called histogram of oriented gradients - (HoG) [13] before clustering is applied;
4.
SSIM – the shoe images, separately or joined horizontally in one image, are compared one vs one using a structural similarity index measure (SSIM) [1]. SSIM distances are used for clustering.

The initial experiment revealed that DMVEC obtained the best results; however, the test set consisted solely of pairs i.e. all shoes in the test set had their counterparts within the test set. This situation is a borderline one; usually, the ratio of paired shoes in a real-world population of shoes under pairing is between 50% and 90%. The appearance of singles in a population demands the introduction of a threshold, which prevents further pairing and returns the remaining shoes as singles. We received a 1,000-shoe validation set from Vive Textile Recycling that represented the most common distribution of pairs in a population. In Fig. 3, we present how we tuned the similarity threshold value based on the validation set.

The next phases of the project involved conducting experiments on diverse test populations that contained significant proportions of singles. The results of the experiments revealed that accuracy reported a huge drop from an initial 99% to around 80–90% (lower scores when more singles were in the population; 80% was reported when we had equal numbers of paired shoes and singles). After investigating the drop in pairing accuracy, we reached two conclusions. (1) Finding pairs in a population that does not contain singles is a significantly easier task than pairing a general population. The threshold is challenging to discover; embeddings are typically not separable by any single threshold and this introduces a tradeoff between precision and recall. (2) The difficulty of shoe pairing depends inherently on the size of the population. This differs from other tasks, like classification, in which scores are independent of validation set size. This happens because in pairing, examples in the dataset are not independent of each other. This can be also seen from the perspective of a solution space, in which a combinatorial explosion appears in the number of possible pairings.

We decided that our training dataset was insufficiently representative for deep multiview representation learning, and that a more robust embedding model was demanded. We manually labelled tens of thousands of pairs returned by a prototype system in a preindustrial environment. After more than a year of labelling, we had collected approximately 54,000 proper pairs. In Fig. 4, we present the training and validation process using our largest training dataset and data augmentation (cropping, saturation, and hue disturbance). The trained model is highly robust and reports high accuracy scores during clustering on the test set—even when it was tested on a very difficult set (1,000 shoes in pairs and 1,000 singles), it presents accuracy above 97% after the 10th epoch.

5 Conclusions

This article presents a novel approach to shoe pairing, a dual-stage approach that comprises deep multiview shoe embedding and clustering. We evaluated different approaches to shoe pairing, from classical unsupervised ones based on image descriptors to the proposed supervised one that applies deep neural networks. The best-performing model in this task is the proposed supervised method, DMVEC. It reports almost 100% accuracy on the test sets that exclusively comprise shoes in pairs, and at least 97% when singles cover almost half of the test population. Over a broad number of tests, we evaluated different test sets (with different size and various distributions of singles). We also demonstrated how our selected method can be improved by hyperparameter tuning (similarity threshold tuning), massive increases in training data, or data augmentation.

In the future, we plan to conduct research on further optimisations for DMVEC. We suspect that multiple factors can impact the pairing accuracy of our model, such as the margin hyperparameter in triplet learning, the schedule of TripletHardLoss and TripletSemiHardLoss during training, or the GAP layer discarding too much information. There are also other necessary tasks apart from pairing, which will be addressed soon.

Notes

References

Brunet, D., Vrscay, E.R., Wang, Z.: On the mathematical properties of the structural similarity index. IEEE Trans. Image Process. 21(4), 1488–1499 (2011)
Article MathSciNet MATH Google Scholar
Chao, G., Sun, S., Bi, J.: A survey on multiview clustering. IEEE Trans. Artif. Intell. 2(2), 146–168 (2021)
Article Google Scholar
Chi, M., Zhang, P., Zhao, Y., Feng, R., Xue, X.: Web image retrieval reranking with multi-view clustering. In: Proceedings of the 18th International Conference on World Wide Web, pp. 1189–1190 (2009)
Google Scholar
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), vol. 1, pp. 886–893. IEEE (2005)
Google Scholar
Djelouah, A., Franco, J.S., Boyer, E., Le Clerc, F., Pérez, P.: Multi-view object segmentation in space and time. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2640–2647 (2013)
Google Scholar
Flusser, J., Zitova, B., Suk, T.: Moments and Moment Invariants in Pattern Recognition. John Wiley & Sons, Hoboken (2009)
Google Scholar
Jin, C., Mao, W., Zhang, R., Zhang, Y., Xue, X.: Cross-modal image clustering via canonical correlation analysis. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 29 (2015)
Google Scholar
Lee, M.J., Rahimifard, S.: An air-based automated material recycling system for postconsumer footwear products. Resour. Conserv. Recycl. 69, 90–99 (2012)
Article Google Scholar
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)
Article Google Scholar
Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 24(7), 971–987 (2002)
Article MATH Google Scholar
Schütze, H., Manning, C.D., Raghavan, P.: Introduction to Information Retrieval, vol. 39. Cambridge University Press, Cambridge (2008)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Suard, F., Rakotomamonjy, A., Bensrhair, A., Broggi, A.: Pedestrian detection using infrared images and histograms of oriented gradients. In: 2006 IEEE Intelligent Vehicles Symposium, pp. 206–212. IEEE (2006)
Google Scholar
Xin, X., Wang, J., Xie, R., Zhou, S., Huang, W., Zheng, N.: Semi-supervised person re-identification using multi-view clustering. Pattern Recognit. 88, 285–297 (2019)
Article Google Scholar
Yan, X., Hu, S., Mao, Y., Ye, Y., Yu, H.: Deep multi-view learning methods: a review. Neurocomputing 448, 106–129 (2021)
Article Google Scholar

Download references

Author information

Authors and Affiliations

National Information Processing Institute, Warsaw, Poland
Marek Kozłowski & Przemyslaw Buczkowski
Vive Textile Recycling, Warsaw, Poland
Piotr Brzezinski

Authors

Marek Kozłowski
View author publications
You can also search for this author in PubMed Google Scholar
Przemyslaw Buczkowski
View author publications
You can also search for this author in PubMed Google Scholar
Piotr Brzezinski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marek Kozłowski .

Editor information

Editors and Affiliations

National Research Institute, National Information Processing Institut, Warszaw, Poland
Cezary Biele
Polish Academy of Sciences, Systems Research Institute, Warsaw, Poland
Janusz Kacprzyk
Polish-Japanese Academy of Information T, Warsaw, Poland
Wiesław Kopeć
Polish Academy of Sciences, Systems Research Institute, Warsaw, Poland
Jan W. Owsiński
Institute of Applied Computer Science, Łódż University of Technology, Łódź, Poland
Andrzej Romanowski
Department of Informatics in Management, Faculty of Management and Economics, Gdańsk University of Technology, Gdańsk, Poland
Marcin Sikorski

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kozłowski, M., Buczkowski, P., Brzezinski, P. (2023). A Novel Process of Shoe Pairing Using Computer Vision and Deep Learning Methods. In: Biele, C., Kacprzyk, J., Kopeć, W., Owsiński, J.W., Romanowski, A., Sikorski, M. (eds) Digital Interaction and Machine Intelligence. MIDI 2022. Lecture Notes in Networks and Systems, vol 710. Springer, Cham. https://doi.org/10.1007/978-3-031-37649-8_4

Download citation

DOI: https://doi.org/10.1007/978-3-031-37649-8_4
Published: 25 July 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-37648-1
Online ISBN: 978-3-031-37649-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

A Novel Process of Shoe Pairing Using Computer Vision and Deep Learning Methods