A big data smart agricultural system: recommending optimum fertilisers for crops

Nutrients are important to promote plant growth and nutrient deficiency is the primary factor limiting crop production. However, excess fertilisers can also have a negative impact on crop quality and yield, cause an increase in pollution and decrease producer profit. Hence, determining the suitable quantities of fertiliser for every crop is very useful. Currently, the agricultural systems with internet of things make very large data volumes. Exploiting agricultural Big Data will help to extract valuable information. However, designing and implementing a large scale agricultural data warehouse are very challenging. The data warehouse is a key module to build a smart crop system to make proficient agronomy recommendations. In our paper, an electronic agricultural record (EAR) is proposed to integrate many separate datasets into a unified dataset. Then, to store and manage the agricultural Big Data, we built an agricultural data warehouse based on Hive and Elasticsearch. Finally, we applied some statistical methods based on our data warehouse to extract fertiliser information such as a case study. These statistical methods propose the recommended quantities of fertiliser components across a wide range of environmental and crop management conditions, such as nitrogen (N), phosphorus (P) and potassium (K) for the top ten most popular crops in EU.


Introduction
In the last few years, there have been approximately 124 million people in 53 countries experiencing acute food insecurity. Besides, in another 42 countries, there are additional 143 million people were at the edge of facing acute hunger [12]. The world population in 2021 is 7.9 billion [41] and will increase to 8.6 billion by 2030 and 9.8 billion in 2050 [38]. So, the major urgent challenge for humans is the growing food demands of the annually increasing population [8,42]. The problem is exacerbated by resources for crop production which are really limited, such as available freshwater and cropland [13]. There is an urgent need to increase crop yield by using new agricultural technologies, such as smart farming also called digital agriculture.
Today, farmers often enhance soil nutrients through fertilisers to improve crop yields. The fertiliser application is crucial to protect global food and enhance the yield of cereals [15]. However, excess fertilisers will make toxic and negative impacts to crop quality and yield [7,43]. Besides, fertilisers are very expensive and fertiliser waste will reduce producer incomes. Further, redundancy of fertilisers makes pollution of air, soil and water [34]. It can create high salt concentration to hurt beneficial soil microorganisms. It also makes eutrophication of surface water and groundwater, and greenhouse gas (GHG). Agricultural production contributes to global climate change. In 2008, it accounted for about 25% of global anthropogenic GHG emissions, such as carbon dioxide ( CO 2 ), methane ( CH 4 ) and nitrous oxide ( N 2 O ). Fertiliser production is the main source because it not only consumes large amounts of energy, but also produces N 2 O and CH 4 in the manufacture of nitrate ( NO − 3 ) and ammonia ( NH 3 ) fertilisers [39,40]. So, fertiliser inputs need to be optimised to increase crop yield, farmer income and environmental quality [9,36]. Fertilisers are composed of many elements, such as nitrogen (N), phosphorus (P), potassium (K), sulfur trioxide ( SO 3 ) and magnesium oxide (MgO). An excess element can affect uptake other elements and cause both redundancy and deficiency of fertilisers. So the recommended quantity of very element in fertilisers also needs to be determined.
Smart farming applies statistical algorithms and data mining methods on historical data to discover new agricultural knowledge or build expert systems for improving farm productivity or being used tools for farmers [1,24]. The global agricultural analytic market will increase more than 110% from $580 million in 2018 to $1.236 million in 2023 [22]. For example, the Bayer company collects data from farms, processes and analyses it, and then sells it back to the producers [28]. In this paper, firstly, we analyse many separated agricultural datasets by describing their Entity Relationship Diagrams (ERDs) to determine useful attributes, entities and objects. Secondly, a constellation schema, called Electronic Agricultural Record (EAR), is modeled and built. The EAR is adjustable to combine any agricultural data in making a united representation and used to build an agricultural data warehouse (DW). Thirdly, the information of the separated datasets is standardized, extracted, transferred and loaded into the EAR schema to make our EAR dataset which is an agricultural Big Data dataset. Fourthly, to store, process and manage the EAR dataset, we propose and implement an agricultural Big Data system on top of Hive and Elasticsearch. Finally, the proposed analytic methods use data about crops, their yields and their used fertiliser elements which are extracted from the unified EAR dataset. The methods are applied to discover the suitable quantities of fertiliser components for adapting criteria, such as quality advancing, yield increasing, profit improvement and environment protection. For example, we extract the suitable quantities of fertilisers N-P-K being 100-92-123 (kg/ha) for Spring Dried Beans and 126-106-112 (kg/ha) for Winter Oats. We study the three most popular fertiliser components (i.e. N, P, K) to adapt efficiently nutrients for the top ten most popular crops in EU (i.e. Spring Barley, Winter Barley, Spring Dried Beans, Winter Dried Beans, Spring Linseed, Forage Maize, Winter Oats, Winter Rape, Winter Rye, and Winter Wheat).
The rest of this paper is organised as follows: in the next section, we reviewed related work on agricultural systems which propose useful information about fertilisers. In Sect. 3, the original datasets and their ERDs are presented and analysed. Specially, we build and propose an electronic agricultural record for agricultural data integration. In Sect. 4, the agricultural Big Data system is designed and implemented through the data warehouse. Section 5 presents a statistical methodology about fertiliser based on agricultural Big Data system as a case study. It proposes suitable quantities of fertiliser components for ten crops cross a wide range of environmental and crop management conditions. Finally, we present conclusion and future work in Sect. 6.

Related work
Many research papers used machine learning or analysis methods for fertiliser optimisation in digital agriculture. For instance, Barrett et al. [4] applied regression analysis algorithm to determine a suitable quantity fertilise N for cabbage. Cambouris [5] investigated the effect of soil texture and fertiliser N on corn yield. In [9], the back propagation neural network model was proposed to determine the recommended quantity of fertiliser N for maize. In [34], the authors experimented and analysed on a trial dataset about wheat to evaluate a long-term N management strategy which maintains a base level of fertiliser N rather than attempting to match N inputs to seasonal conditions. They concluded that a long-term N management strategy was potential to increase wheat yields, improve soil reserves and decrease environmental damage. In [43], a randomized complete block design on foxtail millet was conducted with four different rates of fertilisers N and P: no fertiliser, low, medium and high. The authors discovered fertiliser application at a medium rate (i.e. 180 kg/ha for N and 120 kg/ha for P) which would be suitable to improve yield and water use efficiency of foxtail millet in the semiarid regions. However, these papers only used and analysed trial data which is not real data collected from different farms. Especially, they did not adapt to Big Data in agriculture, where diverse external and internal factors have been combined, analysed and exploited together to give exact information or decisions to farmers or companies. Besides, they just researched fertiliser N, fertiliser P and one crop.
Moreover, some other papers analysed their datasets to support some decisions about fertilisers. For instance, Kaizzi et al. [19] developed a fertiliser optimization tool by the linear programming algorithm. The tool was used by smallholders to select fertiliser and the amount of each nutrient for their crops. In [37], the authors implemented an E-Water system used a multi-objective genetic algorithm. The system provided efficient management functions to find suitable amount of fertiliser and water. In [17], a system was built for developing optimum crop plan which exploited data regarding cropping pattern, rainfall, water status, land use etc. In [31], the authors presented a soil nitrate sensor technology based on spectroscopy analysis to manage and improve the used fertiliser N. In [16], the authors implemented the smart weather prediction using the internet of things and statistical models. The algorithm used data about temperature, rainfall, humidity, soil moisture and air pressure. However, the papers did not adapt Big Data exploitation and integration. Their datasets just include a few of agricultural information, similar to trial datasets of the papers mentioned in the above paragraph. Besides, they used available agricultural knowledge to decide on fertilisers; they did not propose recommended fertiliser quantities from agricultural Big Data.
Hence, to understand the importance of the data scale for agricultural analysis, the authors in [2,18] and [33] analysed pretty large datasets. The paper [18] contains information about 10 crops and 5 cultivation strategies. The information was collected from crawling webs and interviewing selected farmers by face-to-face. The authors proposed suitable planting strategies for some crops to get high economic benefit. In [2], the authors presented some steps to build a data warehouse in agriculture based on Microsoft SQL to facilitate accessibility and explorations of open datasets. While, the paper [33] monitored N performance for different 7 crops across different arable farms and over time in Dutch. However, the papers did not present how to organise agricultural Big Data to deal with large scale and high performance in data analysis. In addition, their datasets are not considered as real Big Data.
Finally, to fix restrictions of all the above papers, a data warehouse constellation schema is used in [24,26,27] to  [26], e.g. testing of crops, soils and nutrients, and management actions of treatments, spray, fertilisers, nutrients and inspection. The papers [26,27] are about data warehouse design and implementation. While, in [24], the authors did not build an agricultural data warehouse. They just proposed information about insecticides, herbicides, soil properties and soil pH. Specially, the three papers did not present how to design a suitable schema, and use data mining algorithms to extract recommended quantities of fertilisers.

Original datasets
We study and assess 29 datasets supplied by a leading agronomy company from 2014 to 2018. In that, each dataset is about 1.4 gigabytes in textual format and has 18 tables of records on average. The agronomy company collected these datasets from its technology centres, operational systems, iFarms, field trials and research results [29]. The company has real agricultural data in 103 distribution centres, 70 demonstration farms, 12 million hectares of direct farm customer footprints, 34 input formulation and processing facilities, 45, 000 trial units and 800 sale forces at 7 countries being Belgium, Poland, Romania, Ukraine, Ireland United Kingdom and Brazil (Figs. 1 and 2) Each dataset just contains a few of farming information. For example, the information in the crop dataset is almost about crops, such as crop name, season, crop condition, estimated yield, diameter, height and crop coverage percent and BBCH growth stage index. Besides, in the treatment dataset, there is treatment information for crop diseases, such as treatment name, form type, lot code, rate, applied date, description and comment.

Data standardization and analysis
Data standardization and integration are important tasks and need to be built in large scientific projects by the big enterprises which have various data sources. The above separate datasets will be cleaned, standardized and combined into a united dataset to analyse information and extract agricultural significant knowledge. To do this, studying and understanding the datasets are necessary and useful for not only designing a suitable unified schema, but also extracting, loading and transferring information from the separate datasets into a unified dataset. So, the ERDs of the complex original datasets need to be designed and explored.
A part of ERD of Dataset 1 is presented in Fig. 3 which has seven main entities being Activity, Cropping, Field, Inspection, Plan, Site and Treatment.
The dataset contains information about farming operations on fields with detail plans, such as inspections, sprays and treatments for each crop. While, a part of ERD of Dataset 2 is presented in Fig. 4 which has six main entities being Cropping, NutrientApplication, Observation, Seeding, SoilTesting and Zone. It focuses on information about crops, soils and nutrients on zones. There are relationships between Field, Site in Dataset 1 and Zone in Dataset 2. Because, in agriculture, a site has some fields and a field has a few of zones. For example, Fig. 5 presents a field divided into 5 zones.
In Fig. 6, a part of ERD of Dataset 3 is presented. It has seven main entities being Company, Cropping, Diar-yNote, Fertiliser, Operation, Order and Soil-Sampling. Among them, the Company and Cropping entities have the most relationships, i.e., 5 for every entity. It contains mainly information about agricultural companies/ farmers and their operations on farms such as cropping, fertilisers, soils and orders. Finally, Fig. 7 shows a part of ERD of Dataset 4. It has six main entities being Crop, Crop-State, Field, Pest, SoilAirStation and Treatment. The dataset contains information about pests on fields In the original datasets, there is a lot of overlapped information which needs to be integrated carefully. For example, all four datasets contain information about crops through entities Cropping, Crop and CropState. While, datasets 1 & 4 contain information about fields which need to connect to information about sites and zones in datasets 1 & 2. Treatment information for crops is contained in datasets 1 & 4. Further, there are some overlapped information between the activity table in Dataset 1 and the operation table in Dataset 3. So, an agricultural constellation schema is designed and proposed to integrate information from our 29 separate datasets as below.

Electronic agricultural record
Our original datasets are collected from different sources and almost raw and semi-processed data. Specifically, the data is very complex, diverse, large, unstructured, conflicting and non-standardised. So, data in agriculture has all the attributes and criteria of Big Data [25]: (1) Volume: The quantity of agricultural data is fast increasing and is explosively made by external and internal sources, such as sensors, farming company operations, retail agronomists, satellites, intelligent equipment, government agencies, research centres and farmers. The external sources can help to supply information about local market accessing, pest and disease outbreak tracking, treatment and food price; (2) Variety: The data in agriculture has various formats and types which are structured data, text, imagery, multimedia, video, equations, metrics and models; (3) Velocity: The data in agriculture is being generated, collected and stored at very high rates. Because the sensing and mobile devices become cheaper and more efficient; (4) Veracity: the characteristics of agricultural data are inaccuracy, ambiguous, uncertain and inconsistent. Because the data is collected from various systems, sensors, operations and manual processes. Hence, agronomic Big Data harmonisation and integration are very difficult and challenge missions.
We need to propose and implement a suitable schema for integrating various separate datasets. Specially, this schema must adapt the criteria of data warehouse and the analysis on agricultural Big Data. So, firstly, in three kinds of DW schema models (i.e. star, snowflake and constellation), we select constellation schema for our agricultural enterprise DW which needs many fact tables and their dimension tables. Secondly, the ideas of agronomists and the ERDs of original agricultural datasets are reviewed and selected carefully to choose suitable attributes, entities and subjects for the schema. The proposed EAR schema (Electronic Agricultural Record) is presented in Fig. 8

Hive and elasticsearch
A data warehouse is a unified repository system for various heterogeneous data sources that a big company can collect from its business systems, research results and external inputs. The DW should adapt all the criteria of agricultural Big Data being volume, variety, velocity and veracity. Redshift 1 , Cassandra 2 , MongoDB 3 , and Hive 4 are popular databases supporting efficiently the DW. Hence, we analyse them on data management, DW and technical features, and see Hive as be the best suited for our data problem. Hive is a data warehouse system built on Hadoop Distributed File System (HDFS) 5 for processing, writing and storing large datasets and running distributed applications [3,20]. Elasticsearch [35] is an open-source, distributed search engine server built on top of Apache Lucene [14]. So, it is high scalable and high performance. Besides, Elasticsearch can support agricultural information and documents, such as JSON, text, images, figures, geo-spatial, multi-media. It uses the JSON over HTTP API and gets back a JSON reply for indexing and searching data. It is built on the Java programming language and hence it can run on different platforms. Finally, Elasticsearch supports functions to visualise, analyse and search easily.

System architecture
Our system architecture for agricultural Big Data is illustrated in Fig. 9 which contains three modules, namely In DW Storage, Elasticsearch and Hive receive data in Operational Databases and External Data from Raw Data module through the ETL tool. Products module also sends information to Elasticsearch and Hive. In that, Elasticsearch receives real-time data in dashboard and web application. Otherwise, Elasticsearch sends analysed answers which need to be retrieved in real-time to Products module. With queries having multiplex calculation, the Hive component will receive from and process for the Products module directly. Hive also stores the online data from Products module through Elasticsearch and sends processed data to store in Elasticsearch.

Our primary storage performance evaluation
The reading performance of our primary DW storage (i.e. Hive) needs to be evaluated because a DW is used primarily for reporting and analysing data, not for writing data. In addition, the secondary DW storage (i.e. Elasticsearch) is near real-time in indexing and searching [11]. So it does not need to be evaluated the performance. We use Hadoop 2.6.5, Our database in Hive is copied to MySQL to evaluate and compare run-time performance. The popular HQL/ SQL commands, namely Where, Group by, Having, Right (left) Join, Order by and Union, are used to create 10 query groups for testing. Each query group uses a few of commands and includes five queries (see Table 2). In addition, the queries also applies operations, e.g. Sum, Count, Or, And, Like, Min and ≤ , to the commands to express complex queries. Each query is evaluated the runtime in three times and taken its average runtime.
The mean executive times of the 10 query groups on MySQL and our primary storage (Hive) are shown in respectively. So, MySQL is lower 3.2 times than the primary storage. If our DW is deployed on a distributed Where and Group by G 7 Where, Group by and Having G 3 Where and Right (left) Join G 8 Where, Group by, Having and Order by G 4 Where and Union G 9 Where, Group by, Having, Right (left) Join and Order by G 5 Where and Order by G 10 Where, Group by, Having, Union and Order by  system or a cloud storage, we believe that its runtime performance will be faster than MySQL many times.

Classification based on yield
From the EAR dataset, we analyse information related to fertiliser and crop yield in every field. This includes crop name, yield, field identification, year, season, the quantities of total fertiliser and main elements of fertiliser, being N, P and K. We classify each EAR record into one of the five yield groups of every crop type that is based on crop type and yield of each record. Each yield group includes 20% of the amount of records of each crop type. Among them, based on yield, group 1 is the highest 20% group, group 3 is medium 20% group and group 5 is the lowest 20% group. After that, in each group, the mean values of yield, total fertiliser, NPK group, and fertilisers N, P, and K are calculated. Table 3  In addition, in each crop yield, the different percentages between yield group 5 (the lowest yield group) and other yield groups are also presented clearly. Specifically, in Barley S., group 5 has mean yield of 4.26 ton/ha. While, group 1 and 2 have mean yield of 8.16 ton/ha and 7.32 ton/ ha, and are higher than group 5 about 91.5% and 71.8% , respectively. Besides, group 3 and 4 have mean yield of 6.52 ton/ha and 5.81 ton/ha, and are higher than group 5 about 53.1% and 36.4% , respectively. Specially, in Linseed S. and Beans S.D., group 1 is higher than group 5 about 430.2% and 382.4%.
Fertilisers have been used since the start of agriculture to supply one or more essential nutrients to the growth of crops. Today, farmers often use fertilisers being either mined or manufactured. However, fertilisers are very expensive and can harm the environment. Besides, excess fertilisers will badly impact crop quality and yield. So, the right fertiliser quantities for every crop should be used. The fertilisers are composed of many major, secondary and trace elements.  [32]. So, the "Big 3" nutrients will be detected and analysed more careful.

Crop and NPK group correlation
With classifying through the five yield groups of ten crops, we extract information about the quantities of fertiliser in Table 4 and Fig. 11. They show the mean quantities of total fertiliser, NPK group and other group of each crop in each In addition, they also present the percentage of NPK group in total fertiliser. From Table 4 and Fig. 11, there are not significant differences among yield groups in Barley W., Beans S.D., Rape W. and Rye W. While, in the remain 6 crops being Barley S., Beans W.D., Linseed S., Maize F., Oats W. and Wheat W., the percentages of NPK groups between high-yield groups and low-yield groups are clearly separate. Hence, we recommend the suitable percentages of NPK groups are about 47% for Barley S., 75% for Bean W.D., 40% for Linseed S., 57% for Maize F., 48% for Oats W. and 41% for Wheat W. Besides, the suitable total fertilisers are about 882 kg/ha for Barley S., 300 kg/ha for Bean W.D., 580 kg/ha for Linseed S., 812 kg/ha for Maize F., 724 kg/ha for Oats W. and 1,403 kg/ha for Wheat W. Moreover, the ratio of N, P and K in the NPK group is also important for developing crops. So we continue to analyse these ratios in next section.

Crop and N-P-K ratio correlation
Nitrogen is very important because it is a major component of chlorophyll, amino acids (being the building blocks of proteins) and nucleic acids (e.g. DNA). Without nitrogen, plants wither and die. Moreover, nitrogen deficiency will limit plant growth, make yellow leaves and be easily attacked by diseases and insects. On other hand, nitrogen redundancy can cause excessive growth of aquatic plants and algae which use up dissolved oxygen and clog water intake to affect growth of crops. Besides, nitrogen can pervade in drinking water, environmental damage, and be harmful to human or livestock [6]. So, in each crop, we need to determine the suitable quantity of nitrogen fertiliser to make the highest yield. Figure 12 presents mean quantities of fertiliser N used in yield groups of crops that are extracted from the data warehouse.
Phosphorus is a vital component of DNA and RNA. Especially, it captures and converts the sun's energy into useful plant compounds. P deficiency makes a stunting of the plant in the early growth, and affects both seed development and normal crop maturity in the late growth [30]. While, too much P can be toxic. Because, waste P can easily flow into water and cause algal blooms and excessive vegetative growth. Besides, it also impedes the uptake of Fe and Zn. The mean quantities of fertiliser P in yield groups for crops are presented in Fig. 13.
Potassium is essential in photosynthesis, enzyme activation and protein increment to sustain growth and reproduction of plants. K deficient plants are less resistant to drought, excess water, disease, insect attack, frost and cold [21]. Specially, K deficiency makes yellow firing leaf and poor root development. While, excess K can affect uptake other nutrients, such as N and Mg. We extract the mean quantities of fertiliser K in yield groups of crops from the data warehouse and present in Fig. 14.
The N-P-K ratio in fertiliser is important for developing crops. So we use ternary graphs to analyse correlation between the ratio of N, P and K and crop yield. In Fig. 15a, c, e, we don't see the separation among yield groups of Barley S., Barley W., Lindseed S. and Rye W. However, there are significant differences between high-yield groups and low-yield groups of Beans S.D., Beans W.D., Maize F., Oats W., Rape W. and Wheat W. in Fig. 15b-e. So we can propose the suitable ratio of N, P and K in the NPK group for the 6 crops based on information of their group 1. Combining with information extracted in Sect. 5.2 and information in Figs. 12, 13, 14 and 15, we propose the suitable ratio of N, P and K in the NPK group, and the suitable percentage of NPK group in fertiliser total for crops in Table 5. The suitable quantities of N, P and K for Beans S.D., Beans W.D., Maize F., Oats W., Rape W. and Wheat W. are extracted and recommended. While, we can only propose the suitable quantity of group NPK for Barley S. Besides, we do not have enough information to recommend the fertiliser quantities for Barley W., Linseed S. and Rye W.

Conclusion and future work
In this paper, we analysed and integrated many original agricultural datasets to determine useful dimensional and fact tables, and their attributes and relationships for proposing an EAR. Based on the EAR, also being a fact constellation schema, various separate datasets are extracted, transferred and loaded into a unified crop dataset. The EAR is adjustable and scalable to new datasets and variety standards of Big Data analytics in agriculture. Besides, we also designed and implemented an agricultural DW based on Hive and Elasticsearch which adapted criteria about DW and Big Data, such as security, high performance, high storage, variety data and data science support. Specially, from the unified EAR dataset, we presented a data analysis method based on crop yield classification with fertiliser components. The studied results showed that in some crops, the more fertilisers used, the more yield increased. However, in many other crops, they are suitable to medium fertiliser quantities and their yield decreased as using more fertilisers. We proposed the suitable quantities of the NPK group, N, P and K in various season and farms on the top ten famous crops in continental Europe, Ireland, United Kingdom and Brazil.
With the scope of the paper, we exploited information about fertiliser as a case study. However, the crop yield improvement is affected not only the fertiliser components, but also available soil properties, soil texture and nutrient translocation. So, in the future, we will apply our deep learning [10] and machine learning [23] algorithms to discovery relation of fertiliser components, soil properties, adjuvants and water requirements on increasing crop yield. Besides, time series and the weather factors, such as temperature of air and soil, sunshine, rain fall, humidity and wind speech, are powerfully affect to crop yield and also will be studied. n/a n/a n/a n/a n/a n/a Barley W. n/a n/a n/a n/a n/a n/a n/a n/a n/a Beans S. D. 436  315  72  100  92  123  32  29  39  Beans W.D. 300  224  75  102  60  62  46  27  27  Linseed S.  580  230  40 n/a n/a n/a n/a n/a n/a Maize F.  812  460  57  227  68  165  49  15  36  Oats W.  724  344  48  126  106  112  37  31  32  Rape W.  920  330  36  204  59  67  62  18  20  Rye W. n/a n/a n/a n/a n/a n/a n/a n/a n/a Wheat W.