Skip to main content

Use of Unsupervised Machine Learning for Agricultural Supply Chain Data Labeling

  • Chapter
  • First Online:
Information and Communication Technologies for Agriculture—Theme II: Data

Abstract

The heterogeneous data produced in agricultural supply chains can be divided into three main systems: (i) product identification and traceability, related to identifying production batches and locations of the product throughout the supply chain; (ii) environmental monitoring, considering environmental variables in production, storage and transportation; and (iii) processes monitoring, related to the data describing the production processes and inputs used. Data labeling on the different systems can improve decision-making, traceability, and coordination in the chains. Nevertheless, this is a labor-intensive task. The objective of this Chapter was to evaluate if unsupervised machine learning techniques could be used to identify patterns in the data, clusters of data, and generate labels for an unlabeled agricultural supply chain dataset. A dataset was generated through merging seven datasets that contained information from the three systems, and the k-means and self-organizing maps (SOM) models were evaluated on clustering the data and generating labels. The use of principal component analysis (PCA) was also evaluated together with the k-means model. Several supervised and unsupervised learning metrics were evaluated. The SOM model with the Gaussian neighborhood function provided the best results, with an F1-score of 0.91 and a more defined clusters map. A series of recommendations for the use of unsupervised learning techniques on supply chain data are discussed. The methodology used in this Chapter can be implemented on other supply chains and unsupervised machine learning research. Future work is related to improving the dataset and implementing other clustering models and dimensionality reduction techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 44.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 59.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 59.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    For a thorough description of the steps of each clustering model, we refer the reader to the work by Mehta et al. [17].

  2. 2.

    https://jupyter.org/

  3. 3.

    https://www.libreoffice.org/discover/calc/

  4. 4.

    Data preprocessing and processing techniques focus on eliminating faulty data, dealing with missing values, and preparing the data to be used by the knowledge extraction and pattern recognition models. For an in-depth review of data processing and preprocessing techniques, the reader is referred to the work by Van den Broeck et al. [28]. If the reader is interested in processing time series data, please refer to the work by Wang and Wang [29]. If the reader is interested in processing natural language data, please refer to the work by Sun et al. [30].

  5. 5.

    Data fusion can be described as the combination of multiple datasets or data sources to improve the quality of the information [31]. It is important to note that these datasets may contain different features or variables. For an in-depth review of the different techniques used for data fusion, the reader is referred to the work by Castanedo [31].

  6. 6.

    Data normalization is an essential step for most machine learning models, especially for artificial neural networks. According to Singh and Singh [32], it can be defined as transforming the features to a specific range. Two of its main advantages are: allowing for faster model training, and avoiding problems due to features with very distinct value ranges. For more information on the use of data normalization for classification, please refer to the work by Singh and Singh [32].

  7. 7.

    For implementation aspects of the standard scaler technique, the reader is referred to the scikit-learn library documentation: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html

  8. 8.

    For implementation aspects of the MinMax scaler technique, the reader is referred to the scikit-learn library documentation: https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

  9. 9.

    The Agglomerative is a commonly used variation of the hierarchical clustering models. Initially, each data point is considered a cluster. The algorithm then joins neighbor clusters, using a linkage method to evaluate the dissimilarity between them and to choose which clusters should be aggregated [12]. A dendrogram can then be used to visualize the results of each step of the algorithm.

  10. 10.

    For an in-depth description of the different methods that are commonly used to estimate the optimal number of clusters for hierarchical clustering models, please refer to the work by Zambelli [33].

  11. 11.

    For a detailed analysis of the k-means model, the reader is referred to the works by Jain [9] and Steinley [10]. For implementation aspects of the k-means model, the reader is referred to the scikit-learn library tutorials on its implementation: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html and https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_digits.html#sphx-glr-auto-examples-cluster-plot-kmeans-digits-py

  12. 12.

    For a detailed description of the elbow method, as well as other methods for determining the optimal number of clusters for applying the k-means model, please refer to the works by Kodinariya and Makwana [34] and Zambelli [33].

  13. 13.

    For a detailed analysis of the PCA method, as well as examples of uses, please refer to the work by Jolliffe and Cadima [35]. For implementation aspects related to PCA, the reader is referred to the scikit-learn library tutorials on its implementation: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html and https://scikit-learn.org/stable/auto_examples/decomposition/plot_pca_iris.html#sphx-glr-auto-examples-decomposition-plot-pca-iris-py

  14. 14.

    For a detailed analysis of the SOM model, the reader is referred to the works by Kohonen [11], Yin [36], and Liu and Weisberg [37]. For implementation aspects of the SOM model, the reader is referred to the Minisom library tutorials on its Github repository: https://github.com/JustGlowing/minisom

  15. 15.

    https://github.com/JustGlowing/minisom

  16. 16.

    For implementation aspects as well as descriptions of the supervised learning metrics used in this chapter, please refer to the scikit-learn library documentation: https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics

  17. 17.

    For implementation aspects as well as descriptions of the supervised and unsupervised clustering metrics used in this chapter, please refer to the scikit-learn library documentation on clustering performance evaluation: https://scikit-learn.org/stable/modules/clustering.html#clustering-evaluation

  18. 18.

    https://www.kaggle.com

  19. 19.

    https://datahub.io

  20. 20.

    https://github.com

  21. 21.

    For an in-depth analysis of classification metrics, including the F1 score, we refer the reader to the work by Goutte and Gaussier [42]. For implementation purposes, the reader can refer to the Scikit Learn documentation: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html

  22. 22.

    https://colab.research.google.com

  23. 23.

    https://github.com/rfsilva1/unsupervised-learning-agriculture

  24. 24.

    Cluster maps are used with the SOM model to graphically illustrate the clusters’ spatial relations [24]. These maps can be drawn using different formats. For a thorough exploration of different cluster maps, as well as best practices, please refer to the work by Samsonova et al. [24].

  25. 25.

    For an in-depth exploration of the data imbalance problem, refer to the work by Kotsiantis et al. [45].

  26. 26.

    The strategy of partially labelling the dataset and then using the partially labelled subset to train the model and predict the labels for the whole dataset is a form of semi-supervised learning. For a detailed description of semi-supervised learning models, please refer to the work by Bagherzadeh and Asil [46].

  27. 27.

    For a full list of papers that used the Minisom implementation of SOM, we refer the reader to its official Github repository: https://github.com/JustGlowing/minisom

References

  1. Chopra, S., & Meindl, P. (2013). Supply chain management: Strategy, planning, and operation (5th ed., 528pp). Pearson Education.

    Google Scholar 

  2. Corella, V. P., Rosalen, R. C., & Simarro, D. M. (2013). SCIF-IRIS framework: A framework to facilitate interoperability in supply chains. International Journal of Computer Integrated Manufacturing, 26(1–2), 67–86.

    Article  Google Scholar 

  3. Huang, C. C., & Lin, S. H. (2010). Sharing knowledge in a supply chain using the semantic web. Expert Systems with Applications, 37(4), 3145–3161.

    Article  Google Scholar 

  4. Pang, Z., Chen, Q., Han, W., & Zheng, L. (2015). Value-centric design of the internet-of-things solution for food supply chain: Value creation, sensor portfolio and information fusion. Information Systems Frontiers, 17(2), 289–319.

    Article  Google Scholar 

  5. Verdouw, C. N., Vucic, N., Sundmaeker, H., & Beulens, A. (2013). Future internet as a driver for virtualization, connectivity and intelligence of agri-food supply chain networks. International Journal on Food System Dynamics, 4(4), 261–272.

    Google Scholar 

  6. Verdouw, C. N., Wolfert, J., Beulens, A. J. M., & Rialland, A. (2016). Virtualization of food supply chains with the internet of things. Journal of Food Engineering, 176, 128–136.

    Article  Google Scholar 

  7. Verdouw, C., Sundmaeker, H., Tekinerdogan, B., Conzon, D., & Montanaro, T. (2019). Architecture framework of IoT-based food and farm systems: A multiple case study. Computers and Electronics in Agriculture, 165(104939), 1–26.

    Google Scholar 

  8. Harris, I., Wang, Y., & Wang, H. (2015). ICT in multimodal transport and technological trends: Unleashing potential for the future. International Journal of Production Economics, 159, 88–103.

    Article  Google Scholar 

  9. Jain, A. K. (2010). Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31(8), 651–666.

    Article  Google Scholar 

  10. Steinley, D. (2006). K-means clustering: A half-century synthesis. British Journal of Mathematical and Statistical Psychology, 59(1), 1–34.

    Article  MathSciNet  Google Scholar 

  11. Kohonen, T. (1982). Self-organized formation of topologically correct feature maps. Biological Cybernetics, 43(1), 59–69.

    Article  MathSciNet  MATH  Google Scholar 

  12. Hartigan, J. A. (2000). 'Statistical clustering'. International Encyclopedia of the Social and Behavioral Science, Yale University, 15014–15019. https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.38.1277

  13. Lance, G. N., & Williams, W. T. (1967). A general theory of classificatory sorting strategies: 1. Hierarchical systems. The Computer Journal, 9(4), 373–380.

    Article  Google Scholar 

  14. Ghahramani, Z. (2003). Unsupervised learning. In: Summer School on Machine Learning. Springer: Berlin, 72–112.

    Google Scholar 

  15. Liakos, K., Busato, P., Moshou, D., Pearson, S., & Bochtis, D. (2018). Machine learning in agriculture: A review. Sensors, 18(8), 1–29.

    Article  Google Scholar 

  16. Arthur, D., & Vassilvitskii, S. (2007). K-means++: The advantages of careful seeding. In Proceedings of the 18th annual ACM-SIAM symposium on discrete algorithms (pp. 1027–1035). ACM.

    MATH  Google Scholar 

  17. Mehta, P., Shah, H., Kori, V., Vikani, V., Shukla, S., & Shenoy, M. (2015). Survey of unsupervised machine learning algorithms on precision agricultural data. In 2015 international conference on innovations in information, embedded and communication systems (ICIIECS) (pp. 1–8). DRDO.

    Google Scholar 

  18. Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd international conference on knowledge discovery and data mining, AAAI Press, pp. 226–231.

    Google Scholar 

  19. Ankerst, M., Breunig, M. M., Kriegel, H. P., & Sander, J. (1999). OPTICS: Ordering points to identify the clustering structure. ACM SIGMOD Record, 28(2), 49–60.

    Article  Google Scholar 

  20. Gowda, K. C., & Ravi, T. V. (1995). Divisive clustering of symbolic objects using the concepts of both similarity and dissimilarity. Pattern Recognition, 28(8), 1277–1282.

    Article  Google Scholar 

  21. Fisher, D. H. (1987). Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2(2), 139–172.

    Article  Google Scholar 

  22. Ramesh, V., Ramar, K., & Babu, S. (2013). Parallel k-means algorithm on agricultural databases. International Journal of Computer Science Issues (IJCSI), 10(1), 710–713.

    Google Scholar 

  23. Kind, M. C., & Brunner, R. J. (2014). SOMz: Photometric redshift PDFs with self-organizing maps and random atlas. Monthly Notices of the Royal Astronomical Society, 438(4), 3409–3421.

    Article  Google Scholar 

  24. Samsonova, E. V., Kok, J. N., & IJzerman, A. P. (2006). TreeSOM: Cluster analysis in the self-organizing map. Neural Networks, 19(6–7), 935–949.

    Article  MATH  Google Scholar 

  25. Jeong, K. S., Hong, D. G., Byeon, M. S., Jeong, J. C., Kim, H. G., Kim, D. K., & Joo, G. J. (2010). Stream modification patterns in a river basin: Field survey and self-organizing map (SOM) application. Ecological Informatics, 5(4), 293–303.

    Article  Google Scholar 

  26. Ruß, G., & Kruse, R. (2011). Exploratory hierarchical clustering for management zone delineation in precision agriculture. In Industrial conference on data mining (pp. 161–173). Springer.

    Google Scholar 

  27. Mingoti, S. A., & Lima, J. O. (2006). Comparing SOM neural network with fuzzy c-means, k-means and traditional hierarchical clustering algorithms. European Journal of Operational Research, 174(3), 1742–1759.

    Article  MATH  Google Scholar 

  28. Van den Broeck, J., Cunningham, S. A., Eeckels, R., & Herbst, K. (2005). Data cleaning: detecting, diagnosing, and editing data abnormalities. PLoS Med, 2(10), e267, pp.966–970.

    Article  Google Scholar 

  29. Wang, X., & Wang, C. (2019). Time series data cleaning: A survey. IEEE Access, 8, 1866–1881.

    Article  Google Scholar 

  30. Sun, W., Cai, Z., Li, Y., Liu, F., Fang, S., & Wang, G. (2018). Data processing and text mining technologies on electronic medical records: A review. Journal of Healthcare Engineering, 2018, 1–9.

    Article  Google Scholar 

  31. Castanedo, F. (2013). A review of data fusion techniques. The Scientific World Journal, 2013, 1–19.

    Article  Google Scholar 

  32. Singh, D., & Singh, B. (2019). Investigating the impact of data normalization on classification performance. Applied Soft Computing, 105524, 1–23.

    Google Scholar 

  33. Zambelli, A. E. (2016). A data-driven approach to estimating the number of clusters in hierarchical clustering. F1000Research, 5, 1–11.

    Article  Google Scholar 

  34. Kodinariya, T. M., & Makwana, P. R. (2013). Review on determining number of cluster in k-means clustering. International Journal of Advance Research in Computer Science and Management Studies, 1(6), 90–95.

    Google Scholar 

  35. Jolliffe, I. T., & Cadima, J. (2016). Principal component analysis: A review and recent developments. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374, 1–16.

    MathSciNet  MATH  Google Scholar 

  36. Yin, H. (2008). The self-organizing maps: Background, theories, extensions and applications. In Computational intelligence: A compendium (pp. 715–762). Springer.

    Chapter  Google Scholar 

  37. Liu, Y., & Weisberg, R. H. (2011). A review of self-organizing map applications in meteorology and oceanography. Self-organizing maps: Applications and Novel Algorithm Design, InTech:Croatia, 253–272. https://books.google.com.br/books?hl=en&lr=&id=k-SgDwAAQBAJ&oi=fnd&pg=PA253

  38. Vettigli, G. (2013). MiniSom: minimalistic and NumPy-based implementation of the Self Organizing Map. Available at: https://github.com/JustGlowing/minisom/. Accessed on: October 15th, 2020.

  39. Scribner, E. A., Battaglin, W. A., Dietze, J. E., & Thurman, E. M. (2003). 'Reconnaissance data for glyphosate, other selected herbicides, their degradation products, and antibiotics in 51 streams in nine Midwestern States', open-file report 03–217 (102pp). U.S. Geological Survey.

    Google Scholar 

  40. Silva, A. M. D., Degrande, P. E., Suekane, R., Fernandes, M. G., & Zeviani, W. M. (2012). Impacto de diferentes níveis de desfolha artificial nos estádios fenológicos do algodoeiro. Revista de Ciências Agrárias, 35(1), 163–172.

    Google Scholar 

  41. Omer, S. O., Abdalla, A. W. H., Mohammed, M. H., & Singh, M. (2015). Bayesian estimation of genotype-by-environment interaction in sorghum variety trials. Communications in Biometry and Crop Science, 10, 82–95.

    Google Scholar 

  42. Goutte, C., & Gaussier, E. (2005). A probabilistic interpretation of precision, recall and F-score, with implication for evaluation. In European conference on information retrieval (ECIR) (pp. 345–359). Springer.

    Google Scholar 

  43. Murtagh, F., & Contreras, P. (2012). Algorithms for hierarchical clustering: An overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(1), 86–97.

    Google Scholar 

  44. Mohamad, I. B., & Usman, D. (2013). Standardization and its effects on k-means clustering algorithm. Research Journal of Applied Sciences, Engineering and Technology, 6(17), 3299–3303.

    Article  Google Scholar 

  45. Kotsiantis, S., Kanellopoulos, D., & Pintelas, P. (2006). Handling imbalanced datasets: A review. GESTS International Transactions on Computer Science and Engineering, 30(1), 25–36.

    Google Scholar 

  46. Bagherzadeh, J., & Asil, H. (2019). A review of various semi-supervised learning models with a deep learning and memory approach. Iran Journal of Computer Science, 2(2), 65–80.

    Article  Google Scholar 

  47. Mansha, S., Babar, Z., Kamiran, F., & Karim, A. (2016). Neural network based association rule mining from uncertain data. In International conference on neural information processing (pp. 129–136). Springer.

    Chapter  Google Scholar 

  48. Stoean, C., Stoean, R., Becerra-García, R. A., García-Bermúdez, R., Atencia, M., García-Lagos, F., Velázquez-Pérez, L., & Joya, G. (2019). Unsupervised learning as a complement to convolutional neural network classification in the analysis of saccadic eye movement in spino-cerebellar ataxia type 2. In International work-conference on artificial neural networks (pp. 26–37). Springer.

    Google Scholar 

  49. Riese, F. M., Keller, S., & Hinz, S. (2020). Supervised and semi-supervised self-organizing maps for regression and classification focusing on hyperspectral data. Remote Sensing, 12(1), 1–23.

    Google Scholar 

Download references

Acknowledgments

This work was supported by Itaú Unibanco S.A. through the Itaú Scholarship Program, at the Centro de Ciência de Dados (C2D), Universidade de São Paulo, Brazil, by the National Council for Scientific and Technological Development (CNPq), and also by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brazil (CAPES) - Finance Code 001.

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Silva, R.F., Mostaço, G.M., Xavier, F., Saraiva, A.M., Cugnasca, C.E. (2022). Use of Unsupervised Machine Learning for Agricultural Supply Chain Data Labeling. In: Bochtis, D.D., Moshou, D.E., Vasileiadis, G., Balafoutis, A., Pardalos, P.M. (eds) Information and Communication Technologies for Agriculture—Theme II: Data. Springer Optimization and Its Applications, vol 183. Springer, Cham. https://doi.org/10.1007/978-3-030-84148-5_11

Download citation

Publish with us

Policies and ethics