Abstract
The heterogeneous data produced in agricultural supply chains can be divided into three main systems: (i) product identification and traceability, related to identifying production batches and locations of the product throughout the supply chain; (ii) environmental monitoring, considering environmental variables in production, storage and transportation; and (iii) processes monitoring, related to the data describing the production processes and inputs used. Data labeling on the different systems can improve decision-making, traceability, and coordination in the chains. Nevertheless, this is a labor-intensive task. The objective of this Chapter was to evaluate if unsupervised machine learning techniques could be used to identify patterns in the data, clusters of data, and generate labels for an unlabeled agricultural supply chain dataset. A dataset was generated through merging seven datasets that contained information from the three systems, and the k-means and self-organizing maps (SOM) models were evaluated on clustering the data and generating labels. The use of principal component analysis (PCA) was also evaluated together with the k-means model. Several supervised and unsupervised learning metrics were evaluated. The SOM model with the Gaussian neighborhood function provided the best results, with an F1-score of 0.91 and a more defined clusters map. A series of recommendations for the use of unsupervised learning techniques on supply chain data are discussed. The methodology used in this Chapter can be implemented on other supply chains and unsupervised machine learning research. Future work is related to improving the dataset and implementing other clustering models and dimensionality reduction techniques.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
For a thorough description of the steps of each clustering model, we refer the reader to the work by Mehta et al. [17].
- 2.
- 3.
- 4.
Data preprocessing and processing techniques focus on eliminating faulty data, dealing with missing values, and preparing the data to be used by the knowledge extraction and pattern recognition models. For an in-depth review of data processing and preprocessing techniques, the reader is referred to the work by Van den Broeck et al. [28]. If the reader is interested in processing time series data, please refer to the work by Wang and Wang [29]. If the reader is interested in processing natural language data, please refer to the work by Sun et al. [30].
- 5.
Data fusion can be described as the combination of multiple datasets or data sources to improve the quality of the information [31]. It is important to note that these datasets may contain different features or variables. For an in-depth review of the different techniques used for data fusion, the reader is referred to the work by Castanedo [31].
- 6.
Data normalization is an essential step for most machine learning models, especially for artificial neural networks. According to Singh and Singh [32], it can be defined as transforming the features to a specific range. Two of its main advantages are: allowing for faster model training, and avoiding problems due to features with very distinct value ranges. For more information on the use of data normalization for classification, please refer to the work by Singh and Singh [32].
- 7.
For implementation aspects of the standard scaler technique, the reader is referred to the scikit-learn library documentation: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
- 8.
For implementation aspects of the MinMax scaler technique, the reader is referred to the scikit-learn library documentation: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
- 9.
The Agglomerative is a commonly used variation of the hierarchical clustering models. Initially, each data point is considered a cluster. The algorithm then joins neighbor clusters, using a linkage method to evaluate the dissimilarity between them and to choose which clusters should be aggregated [12]. A dendrogram can then be used to visualize the results of each step of the algorithm.
- 10.
For an in-depth description of the different methods that are commonly used to estimate the optimal number of clusters for hierarchical clustering models, please refer to the work by Zambelli [33].
- 11.
For a detailed analysis of the k-means model, the reader is referred to the works by Jain [9] and Steinley [10]. For implementation aspects of the k-means model, the reader is referred to the scikit-learn library tutorials on its implementation: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html and https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html#sphx-glr-auto-examples-cluster-plot-kmeans-digits-py
- 12.
- 13.
For a detailed analysis of the PCA method, as well as examples of uses, please refer to the work by Jolliffe and Cadima [35]. For implementation aspects related to PCA, the reader is referred to the scikit-learn library tutorials on its implementation: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html and https://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_iris.html#sphx-glr-auto-examples-decomposition-plot-pca-iris-py
- 14.
For a detailed analysis of the SOM model, the reader is referred to the works by Kohonen [11], Yin [36], and Liu and Weisberg [37]. For implementation aspects of the SOM model, the reader is referred to the Minisom library tutorials on its Github repository: https://github.com/JustGlowing/minisom
- 15.
- 16.
For implementation aspects as well as descriptions of the supervised learning metrics used in this chapter, please refer to the scikit-learn library documentation: https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics
- 17.
For implementation aspects as well as descriptions of the supervised and unsupervised clustering metrics used in this chapter, please refer to the scikit-learn library documentation on clustering performance evaluation: https://scikit-learn.org/stable/modules/clustering.html#clustering-evaluation
- 18.
- 19.
- 20.
- 21.
For an in-depth analysis of classification metrics, including the F1 score, we refer the reader to the work by Goutte and Gaussier [42]. For implementation purposes, the reader can refer to the Scikit Learn documentation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html
- 22.
- 23.
- 24.
- 25.
For an in-depth exploration of the data imbalance problem, refer to the work by Kotsiantis et al. [45].
- 26.
The strategy of partially labelling the dataset and then using the partially labelled subset to train the model and predict the labels for the whole dataset is a form of semi-supervised learning. For a detailed description of semi-supervised learning models, please refer to the work by Bagherzadeh and Asil [46].
- 27.
For a full list of papers that used the Minisom implementation of SOM, we refer the reader to its official Github repository: https://github.com/JustGlowing/minisom
References
Chopra, S., & Meindl, P. (2013). Supply chain management: Strategy, planning, and operation (5th ed., 528pp). Pearson Education.
Corella, V. P., Rosalen, R. C., & Simarro, D. M. (2013). SCIF-IRIS framework: A framework to facilitate interoperability in supply chains. International Journal of Computer Integrated Manufacturing, 26(1–2), 67–86.
Huang, C. C., & Lin, S. H. (2010). Sharing knowledge in a supply chain using the semantic web. Expert Systems with Applications, 37(4), 3145–3161.
Pang, Z., Chen, Q., Han, W., & Zheng, L. (2015). Value-centric design of the internet-of-things solution for food supply chain: Value creation, sensor portfolio and information fusion. Information Systems Frontiers, 17(2), 289–319.
Verdouw, C. N., Vucic, N., Sundmaeker, H., & Beulens, A. (2013). Future internet as a driver for virtualization, connectivity and intelligence of agri-food supply chain networks. International Journal on Food System Dynamics, 4(4), 261–272.
Verdouw, C. N., Wolfert, J., Beulens, A. J. M., & Rialland, A. (2016). Virtualization of food supply chains with the internet of things. Journal of Food Engineering, 176, 128–136.
Verdouw, C., Sundmaeker, H., Tekinerdogan, B., Conzon, D., & Montanaro, T. (2019). Architecture framework of IoT-based food and farm systems: A multiple case study. Computers and Electronics in Agriculture, 165(104939), 1–26.
Harris, I., Wang, Y., & Wang, H. (2015). ICT in multimodal transport and technological trends: Unleashing potential for the future. International Journal of Production Economics, 159, 88–103.
Jain, A. K. (2010). Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31(8), 651–666.
Steinley, D. (2006). K-means clustering: A half-century synthesis. British Journal of Mathematical and Statistical Psychology, 59(1), 1–34.
Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43(1), 59–69.
Hartigan, J. A. (2000). 'Statistical clustering'. International Encyclopedia of the Social and Behavioral Science, Yale University, 15014–15019. https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.38.1277
Lance, G. N., & Williams, W. T. (1967). A general theory of classificatory sorting strategies: 1. Hierarchical systems. The Computer Journal, 9(4), 373–380.
Ghahramani, Z. (2003). Unsupervised learning. In: Summer School on Machine Learning. Springer: Berlin, 72–112.
Liakos, K., Busato, P., Moshou, D., Pearson, S., & Bochtis, D. (2018). Machine learning in agriculture: A review. Sensors, 18(8), 1–29.
Arthur, D., & Vassilvitskii, S. (2007). K-means++: The advantages of careful seeding. In Proceedings of the 18th annual ACM-SIAM symposium on discrete algorithms (pp. 1027–1035). ACM.
Mehta, P., Shah, H., Kori, V., Vikani, V., Shukla, S., & Shenoy, M. (2015). Survey of unsupervised machine learning algorithms on precision agricultural data. In 2015 international conference on innovations in information, embedded and communication systems (ICIIECS) (pp. 1–8). DRDO.
Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd international conference on knowledge discovery and data mining, AAAI Press, pp. 226–231.
Ankerst, M., Breunig, M. M., Kriegel, H. P., & Sander, J. (1999). OPTICS: Ordering points to identify the clustering structure. ACM SIGMOD Record, 28(2), 49–60.
Gowda, K. C., & Ravi, T. V. (1995). Divisive clustering of symbolic objects using the concepts of both similarity and dissimilarity. Pattern Recognition, 28(8), 1277–1282.
Fisher, D. H. (1987). Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2(2), 139–172.
Ramesh, V., Ramar, K., & Babu, S. (2013). Parallel k-means algorithm on agricultural databases. International Journal of Computer Science Issues (IJCSI), 10(1), 710–713.
Kind, M. C., & Brunner, R. J. (2014). SOMz: Photometric redshift PDFs with self-organizing maps and random atlas. Monthly Notices of the Royal Astronomical Society, 438(4), 3409–3421.
Samsonova, E. V., Kok, J. N., & IJzerman, A. P. (2006). TreeSOM: Cluster analysis in the self-organizing map. Neural Networks, 19(6–7), 935–949.
Jeong, K. S., Hong, D. G., Byeon, M. S., Jeong, J. C., Kim, H. G., Kim, D. K., & Joo, G. J. (2010). Stream modification patterns in a river basin: Field survey and self-organizing map (SOM) application. Ecological Informatics, 5(4), 293–303.
Ruß, G., & Kruse, R. (2011). Exploratory hierarchical clustering for management zone delineation in precision agriculture. In Industrial conference on data mining (pp. 161–173). Springer.
Mingoti, S. A., & Lima, J. O. (2006). Comparing SOM neural network with fuzzy c-means, k-means and traditional hierarchical clustering algorithms. European Journal of Operational Research, 174(3), 1742–1759.
Van den Broeck, J., Cunningham, S. A., Eeckels, R., & Herbst, K. (2005). Data cleaning: detecting, diagnosing, and editing data abnormalities. PLoS Med, 2(10), e267, pp.966–970.
Wang, X., & Wang, C. (2019). Time series data cleaning: A survey. IEEE Access, 8, 1866–1881.
Sun, W., Cai, Z., Li, Y., Liu, F., Fang, S., & Wang, G. (2018). Data processing and text mining technologies on electronic medical records: A review. Journal of Healthcare Engineering, 2018, 1–9.
Castanedo, F. (2013). A review of data fusion techniques. The Scientific World Journal, 2013, 1–19.
Singh, D., & Singh, B. (2019). Investigating the impact of data normalization on classification performance. Applied Soft Computing, 105524, 1–23.
Zambelli, A. E. (2016). A data-driven approach to estimating the number of clusters in hierarchical clustering. F1000Research, 5, 1–11.
Kodinariya, T. M., & Makwana, P. R. (2013). Review on determining number of cluster in k-means clustering. International Journal of Advance Research in Computer Science and Management Studies, 1(6), 90–95.
Jolliffe, I. T., & Cadima, J. (2016). Principal component analysis: A review and recent developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374, 1–16.
Yin, H. (2008). The self-organizing maps: Background, theories, extensions and applications. In Computational intelligence: A compendium (pp. 715–762). Springer.
Liu, Y., & Weisberg, R. H. (2011). A review of self-organizing map applications in meteorology and oceanography. Self-organizing maps: Applications and Novel Algorithm Design, InTech:Croatia, 253–272. https://books.google.com.br/books?hl=en&lr=&id=k-SgDwAAQBAJ&oi=fnd&pg=PA253
Vettigli, G. (2013). MiniSom: minimalistic and NumPy-based implementation of the Self Organizing Map. Available at: https://github.com/JustGlowing/minisom/. Accessed on: October 15th, 2020.
Scribner, E. A., Battaglin, W. A., Dietze, J. E., & Thurman, E. M. (2003). 'Reconnaissance data for glyphosate, other selected herbicides, their degradation products, and antibiotics in 51 streams in nine Midwestern States', open-file report 03–217 (102pp). U.S. Geological Survey.
Silva, A. M. D., Degrande, P. E., Suekane, R., Fernandes, M. G., & Zeviani, W. M. (2012). Impacto de diferentes níveis de desfolha artificial nos estádios fenológicos do algodoeiro. Revista de Ciências Agrárias, 35(1), 163–172.
Omer, S. O., Abdalla, A. W. H., Mohammed, M. H., & Singh, M. (2015). Bayesian estimation of genotype-by-environment interaction in sorghum variety trials. Communications in Biometry and Crop Science, 10, 82–95.
Goutte, C., & Gaussier, E. (2005). A probabilistic interpretation of precision, recall and F-score, with implication for evaluation. In European conference on information retrieval (ECIR) (pp. 345–359). Springer.
Murtagh, F., & Contreras, P. (2012). Algorithms for hierarchical clustering: An overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(1), 86–97.
Mohamad, I. B., & Usman, D. (2013). Standardization and its effects on k-means clustering algorithm. Research Journal of Applied Sciences, Engineering and Technology, 6(17), 3299–3303.
Kotsiantis, S., Kanellopoulos, D., & Pintelas, P. (2006). Handling imbalanced datasets: A review. GESTS International Transactions on Computer Science and Engineering, 30(1), 25–36.
Bagherzadeh, J., & Asil, H. (2019). A review of various semi-supervised learning models with a deep learning and memory approach. Iran Journal of Computer Science, 2(2), 65–80.
Mansha, S., Babar, Z., Kamiran, F., & Karim, A. (2016). Neural network based association rule mining from uncertain data. In International conference on neural information processing (pp. 129–136). Springer.
Stoean, C., Stoean, R., Becerra-García, R. A., García-Bermúdez, R., Atencia, M., García-Lagos, F., Velázquez-Pérez, L., & Joya, G. (2019). Unsupervised learning as a complement to convolutional neural network classification in the analysis of saccadic eye movement in spino-cerebellar ataxia type 2. In International work-conference on artificial neural networks (pp. 26–37). Springer.
Riese, F. M., Keller, S., & Hinz, S. (2020). Supervised and semi-supervised self-organizing maps for regression and classification focusing on hyperspectral data. Remote Sensing, 12(1), 1–23.
Acknowledgments
This work was supported by Itaú Unibanco S.A. through the Itaú Scholarship Program, at the Centro de Ciência de Dados (C2D), Universidade de São Paulo, Brazil, by the National Council for Scientific and Technological Development (CNPq), and also by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brazil (CAPES) - Finance Code 001.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Silva, R.F., Mostaço, G.M., Xavier, F., Saraiva, A.M., Cugnasca, C.E. (2022). Use of Unsupervised Machine Learning for Agricultural Supply Chain Data Labeling. In: Bochtis, D.D., Moshou, D.E., Vasileiadis, G., Balafoutis, A., Pardalos, P.M. (eds) Information and Communication Technologies for Agriculture—Theme II: Data. Springer Optimization and Its Applications, vol 183. Springer, Cham. https://doi.org/10.1007/978-3-030-84148-5_11
Download citation
DOI: https://doi.org/10.1007/978-3-030-84148-5_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-84147-8
Online ISBN: 978-3-030-84148-5
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)