1 Introduction

In the last few years, there have been approximately 124 million people in 53 countries experiencing acute food insecurity. Besides, in another 42 countries, there are additional 143 million people were at the edge of facing acute hunger [12]. The world population in 2021 is 7.9 billion [41] and will increase to 8.6 billion by 2030 and 9.8 billion in 2050 [38]. So, the major urgent challenge for humans is the growing food demands of the annually increasing population [8, 42]. The problem is exacerbated by resources for crop production which are really limited, such as available freshwater and cropland [13]. There is an urgent need to increase crop yield by using new agricultural technologies, such as smart farming also called digital agriculture.

Today, farmers often enhance soil nutrients through fertilisers to improve crop yields. The fertiliser application is crucial to protect global food and enhance the yield of cereals [15]. However, excess fertilisers will make toxic and negative impacts to crop quality and yield [7, 43]. Besides, fertilisers are very expensive and fertiliser waste will reduce producer incomes. Further, redundancy of fertilisers makes pollution of air, soil and water [34]. It can create high salt concentration to hurt beneficial soil microorganisms. It also makes eutrophication of surface water and groundwater, and greenhouse gas (GHG). Agricultural production contributes to global climate change. In 2008, it accounted for about 25% of global anthropogenic GHG emissions, such as carbon dioxide (\(CO_2\)), methane (\(CH_4\)) and nitrous oxide (\(N_2O\)). Fertiliser production is the main source because it not only consumes large amounts of energy, but also produces \(N_2O\) and \(CH_4\) in the manufacture of nitrate (\(NO^-_3\)) and ammonia (\(NH_3\)) fertilisers [39, 40]. So, fertiliser inputs need to be optimised to increase crop yield, farmer income and environmental quality [9, 36]. Fertilisers are composed of many elements, such as nitrogen (N), phosphorus (P), potassium (K), sulfur trioxide (\(SO_3\)) and magnesium oxide (MgO). An excess element can affect uptake other elements and cause both redundancy and deficiency of fertilisers. So the recommended quantity of very element in fertilisers also needs to be determined.

Smart farming applies statistical algorithms and data mining methods on historical data to discover new agricultural knowledge or build expert systems for improving farm productivity or being used tools for farmers [1, 24]. The global agricultural analytic market will increase more than 110% from $580 million in 2018 to $1.236 million in 2023 [22]. For example, the Bayer company collects data from farms, processes and analyses it, and then sells it back to the producers [28]. In this paper, firstly, we analyse many separated agricultural datasets by describing their Entity Relationship Diagrams (ERDs) to determine useful attributes, entities and objects. Secondly, a constellation schema, called Electronic Agricultural Record (EAR), is modeled and built. The EAR is adjustable to combine any agricultural data in making a united representation and used to build an agricultural data warehouse (DW). Thirdly, the information of the separated datasets is standardized, extracted, transferred and loaded into the EAR schema to make our EAR dataset which is an agricultural Big Data dataset. Fourthly, to store, process and manage the EAR dataset, we propose and implement an agricultural Big Data system on top of Hive and Elasticsearch. Finally, the proposed analytic methods use data about crops, their yields and their used fertiliser elements which are extracted from the unified EAR dataset. The methods are applied to discover the suitable quantities of fertiliser components for adapting criteria, such as quality advancing, yield increasing, profit improvement and environment protection. For example, we extract the suitable quantities of fertilisers NPK being 100–92–123 (kg/ha) for Spring Dried Beans and 126–106–112 (kg/ha) for Winter Oats. We study the three most popular fertiliser components (i.e. N, P, K) to adapt efficiently nutrients for the top ten most popular crops in EU (i.e. Spring Barley, Winter Barley, Spring Dried Beans, Winter Dried Beans, Spring Linseed, Forage Maize, Winter Oats, Winter Rape, Winter Rye, and Winter Wheat).

The rest of this paper is organised as follows: in the next section, we reviewed related work on agricultural systems which propose useful information about fertilisers. In Sect. 3, the original datasets and their ERDs are presented and analysed. Specially, we build and propose an electronic agricultural record for agricultural data integration. In Sect. 4, the agricultural Big Data system is designed and implemented through the data warehouse. Section 5 presents a statistical methodology about fertiliser based on agricultural Big Data system as a case study. It proposes suitable quantities of fertiliser components for ten crops cross a wide range of environmental and crop management conditions. Finally, we present conclusion and future work in Sect. 6.

2 Related work

Many research papers used machine learning or analysis methods for fertiliser optimisation in digital agriculture. For instance, Barrett et al. [4] applied regression analysis algorithm to determine a suitable quantity fertilise N for cabbage. Cambouris [5] investigated the effect of soil texture and fertiliser N on corn yield. In [9], the back propagation neural network model was proposed to determine the recommended quantity of fertiliser N for maize. In [34], the authors experimented and analysed on a trial dataset about wheat to evaluate a long-term N management strategy which maintains a base level of fertiliser N rather than attempting to match N inputs to seasonal conditions. They concluded that a long-term N management strategy was potential to increase wheat yields, improve soil reserves and decrease environmental damage. In [43], a randomized complete block design on foxtail millet was conducted with four different rates of fertilisers N and P: no fertiliser, low, medium and high. The authors discovered fertiliser application at a medium rate (i.e. 180 kg/ha for N and 120 kg/ha for P) which would be suitable to improve yield and water use efficiency of foxtail millet in the semiarid regions. However, these papers only used and analysed trial data which is not real data collected from different farms. Especially, they did not adapt to Big Data in agriculture, where diverse external and internal factors have been combined, analysed and exploited together to give exact information or decisions to farmers or companies. Besides, they just researched fertiliser N, fertiliser P and one crop.

Moreover, some other papers analysed their datasets to support some decisions about fertilisers. For instance, Kaizzi et al. [19] developed a fertiliser optimization tool by the linear programming algorithm. The tool was used by smallholders to select fertiliser and the amount of each nutrient for their crops. In [37], the authors implemented an E-Water system used a multi-objective genetic algorithm. The system provided efficient management functions to find suitable amount of fertiliser and water. In [17], a system was built for developing optimum crop plan which exploited data regarding cropping pattern, rainfall, water status, land use etc. In [31], the authors presented a soil nitrate sensor technology based on spectroscopy analysis to manage and improve the used fertiliser N. In [16], the authors implemented the smart weather prediction using the internet of things and statistical models. The algorithm used data about temperature, rainfall, humidity, soil moisture and air pressure. However, the papers did not adapt Big Data exploitation and integration. Their datasets just include a few of agricultural information, similar to trial datasets of the papers mentioned in the above paragraph. Besides, they used available agricultural knowledge to decide on fertilisers; they did not propose recommended fertiliser quantities from agricultural Big Data.

Hence, to understand the importance of the data scale for agricultural analysis, the authors in [2, 18] and [33] analysed pretty large datasets. The paper [18] contains information about 10 crops and 5 cultivation strategies. The information was collected from crawling webs and interviewing selected farmers by face-to-face. The authors proposed suitable planting strategies for some crops to get high economic benefit. In [2], the authors presented some steps to build a data warehouse in agriculture based on Microsoft SQL to facilitate accessibility and explorations of open datasets. While, the paper [33] monitored N performance for different 7 crops across different arable farms and over time in Dutch. However, the papers did not present how to organise agricultural Big Data to deal with large scale and high performance in data analysis. In addition, their datasets are not considered as real Big Data.

Finally, to fix restrictions of all the above papers, a data warehouse constellation schema is used in [24, 26, 27] to combine diverse datasets and comply with standards of agriculture Big Data. Nevertheless, there is not much important information about farming operations in the schema of [26], e.g. testing of crops, soils and nutrients, and management actions of treatments, spray, fertilisers, nutrients and inspection. The papers [26, 27] are about data warehouse design and implementation. While, in [24], the authors did not build an agricultural data warehouse. They just proposed information about insecticides, herbicides, soil properties and soil pH. Specially, the three papers did not present how to design a suitable schema, and use data mining algorithms to extract recommended quantities of fertilisers.

3 Electronic agricultural record and data integration

3.1 Original datasets

We study and assess 29 datasets supplied by a leading agronomy company from 2014 to 2018. In that, each dataset is about 1.4 gigabytes in textual format and has 18 tables of records on average. The agronomy company collected these datasets from its technology centres, operational systems, iFarms, field trials and research results [29]. The company has real agricultural data in 103 distribution centres, 70 demonstration farms, 12 million hectares of direct farm customer footprints, 34 input formulation and processing facilities, 45, 000 trial units and 800 sale forces at 7 countries being Belgium, Poland, Romania, Ukraine, Ireland United Kingdom and Brazil (Figs. 1 and 2)

Fig. 1
figure 1

Data sources in Belgium, Poland, Romania and Ukraine

Fig. 2
figure 2

Data sources in Ireland, United Kingdom and Brazil

Each dataset just contains a few of farming information. For example, the information in the crop dataset is almost about crops, such as crop name, season, crop condition, estimated yield, diameter, height and crop coverage percent and BBCH growth stage index. Besides, in the treatment dataset, there is treatment information for crop diseases, such as treatment name, form type, lot code, rate, applied date, description and comment.

3.2 Data standardization and analysis

Data standardization and integration are important tasks and need to be built in large scientific projects by the big enterprises which have various data sources. The above separate datasets will be cleaned, standardized and combined into a united dataset to analyse information and extract agricultural significant knowledge. To do this, studying and understanding the datasets are necessary and useful for not only designing a suitable unified schema, but also extracting, loading and transferring information from the separate datasets into a unified dataset. So, the ERDs of the complex original datasets need to be designed and explored.

A part of ERD of Dataset 1 is presented in Fig. 3 which has seven main entities being Activity, Cropping, Field, Inspection, Plan, Site and Treatment. The dataset contains information about farming operations on fields with detail plans, such as inspections, sprays and treatments for each crop. While, a part of ERD of Dataset 2 is presented in Fig. 4 which has six main entities being Cropping, NutrientApplication, Observation, Seeding, SoilTesting and Zone. It focuses on information about crops, soils and nutrients on zones. There are relationships between Field, Site in Dataset 1 and Zone in Dataset 2. Because, in agriculture, a site has some fields and a field has a few of zones. For example, Fig. 5 presents a field divided into 5 zones.

Fig. 3
figure 3

A part of the ERD of Dataset 1

Fig. 4
figure 4

A part of the ERD of Dataset 2

Fig. 5
figure 5

Examples about a field divided into zones

Fig. 6
figure 6

A part of the ERD of Dataset 3

In Fig. 6, a part of ERD of Dataset 3 is presented. It has seven main entities being Company, Cropping, DiaryNote, Fertiliser, Operation, Order and SoilSampling. Among them, the Company and Cropping entities have the most relationships, i.e., 5 for every entity. It contains mainly information about agricultural companies/farmers and their operations on farms such as cropping, fertilisers, soils and orders. Finally, Fig. 7 shows a part of ERD of Dataset 4. It has six main entities being Crop, CropState, Field, Pest, SoilAirStation and Treatment. The dataset contains information about pests on fields and how to treatment for crops. Besides, it also has information about soils and air temperatures.

In the original datasets, there is a lot of overlapped information which needs to be integrated carefully. For example, all four datasets contain information about crops through entities Cropping, Crop and CropState. While, datasets 1 & 4 contain information about fields which need to connect to information about sites and zones in datasets 1 & 2. Treatment information for crops is contained in datasets 1 & 4. Further, there are some overlapped information between the activity table in Dataset 1 and the operation table in Dataset 3. So, an agricultural constellation schema is designed and proposed to integrate information from our 29 separate datasets as below.

Fig. 7
figure 7

A part of the ERD of Dataset 4

3.3 Electronic agricultural record

Our original datasets are collected from different sources and almost raw and semi-processed data. Specifically, the data is very complex, diverse, large, unstructured, conflicting and non-standardised. So, data in agriculture has all the attributes and criteria of Big Data [25]: (1) Volume: The quantity of agricultural data is fast increasing and is explosively made by external and internal sources, such as sensors, farming company operations, retail agronomists, satellites, intelligent equipment, government agencies, research centres and farmers. The external sources can help to supply information about local market accessing, pest and disease outbreak tracking, treatment and food price; (2) Variety: The data in agriculture has various formats and types which are structured data, text, imagery, multimedia, video, equations, metrics and models; (3) Velocity: The data in agriculture is being generated, collected and stored at very high rates. Because the sensing and mobile devices become cheaper and more efficient; (4) Veracity: the characteristics of agricultural data are inaccuracy, ambiguous, uncertain and inconsistent. Because the data is collected from various systems, sensors, operations and manual processes. Hence, agronomic Big Data harmonisation and integration are very difficult and challenge missions.

We need to propose and implement a suitable schema for integrating various separate datasets. Specially, this schema must adapt the criteria of data warehouse and the analysis on agricultural Big Data. So, firstly, in three kinds of DW schema models (i.e. star, snowflake and constellation), we select constellation schema for our agricultural enterprise DW which needs many fact tables and their dimension tables. Secondly, the ideas of agronomists and the ERDs of original agricultural datasets are reviewed and selected carefully to choose suitable attributes, entities and subjects for the schema.

The proposed EAR schema (Electronic Agricultural Record) is presented in Fig. 8 which can handle high performance and high scalable. The EAR contains 5 fact tables being FieldFact, Management Action, Order, Sale and Testing. Among them, the FieldFact fact table describes data about fields, soils, fertilisers, nutrients, treatments, weather and pests. The Management Action fact table presents management operations on nutrients, fertilisers, inspection, treatments and spray. While, the Order and Sale fact tables include information about business operations. Finally, the Testing fact table includes testing operations on crops, soils and nutrients on zones.

In EAR, there also are 22 dimension tables, such as CropState, Fertiliser, Field, Inspection, Soil, Treatment, Weather Station, etc. Every dimension table includes information in detail about instances which are related framing operations. Some representative attributes of the dimension tables are described in Table 1. To exploit information, the HQL (Hive Query Language) or SQL (Structured Query Language) queries need to combine fact tables and dimension tables.

Fig. 8
figure 8

A part of our EAR for smart farming

Table 1 The 22 dimension tables and their representative attributes

For example: Listing the information of fertiliser and treatment for each crop. The crops were harvested in spring or summer 2018, attached by ’black twitch’ or whitefly pests, and have yield \(> 8\) tons/ha. Besides, the soil in field has pH \(> 5.5\), potassium \(>= 100\) mg/l and magnesium (Mg) \(<=80\) mg/l. To answer this requirement, the HQL/SQL query needs to use the FieldFact fact table and the 6 dimension tables, being Crop, Fertiliser, OperationTime, Pest, Soil and Treatment, as folows.

figure a

4 Big data system implementation and design

4.1 Hive and elasticsearch

A data warehouse is a unified repository system for various heterogeneous data sources that a big company can collect from its business systems, research results and external inputs. The DW should adapt all the criteria of agricultural Big Data being volume, variety, velocity and veracity. RedshiftFootnote 1, CassandraFootnote 2, MongoDBFootnote 3, and HiveFootnote 4 are popular databases supporting efficiently the DW. Hence, we analyse them on data management, DW and technical features, and see Hive as be the best suited for our data problem. Hive is a data warehouse system built on Hadoop Distributed File System (HDFS)Footnote 5 for processing, writing and storing large datasets and running distributed applications [3, 20]. Hive supports many main features: (1) Online analytical processing (OLAP); (2) Storage capacity; (3) Data extract - transform - load (ETL); (4) Governance and data lifecycle management (via Hadoop); (5) Data science; (6) Security and monitoring; (7) Workload management; (8) Hive query language, similar to SQL; and (9) Replication-recovery.

However, Hive was not built for: (1) Real-time queries; (2) Data variety adaptation; (3) Online transaction processing (OLTP); (4) Iterative execution; and (5) Row-level update. So, it needs to be combined with Elasticsearch to overcome its disadvantages.

Elasticsearch [35] is an open-source, distributed search engine server built on top of Apache Lucene [14]. So, it is high scalable and high performance. Besides, Elasticsearch can support agricultural information and documents, such as JSON, text, images, figures, geo-spatial, multi-media. It uses the JSON over HTTP API and gets back a JSON reply for indexing and searching data. It is built on the Java programming language and hence it can run on different platforms. Finally, Elasticsearch supports functions to visualise, analyse and search easily.

4.2 System architecture

Our system architecture for agricultural Big Data is illustrated in Fig. 9 which contains three modules, namely Data Integration, Management and Storage, and Information Extraction. The Data Data Integration Module has components being Raw Data (External and Operational Databases), ERDs and Electronic Agricultural Record (see more in Sect. 3). The Management and Storage Module is DW Storage including the Elasticsearch component and the Hive component (see more in 4.1 and below). The Information Extraction Module presents Web Application and Dashboard products which apply data mining algorithms to extract statistical information about fertiliser components corresponding to each crop (see more in Sect. 5).

Fig. 9
figure 9

Agricultural data warehouse

In DW Storage, Elasticsearch and Hive receive data in Operational Databases and External Data from Raw Data module through the ETL tool. Products module also sends information to Elasticsearch and Hive. In that, Elasticsearch receives real-time data in dashboard and web application. Otherwise, Elasticsearch sends analysed answers which need to be retrieved in real-time to Products module. With queries having multiplex calculation, the Hive component will receive from and process for the Products module directly. Hive also stores the online data from Products module through Elasticsearch and sends processed data to store in Elasticsearch.

4.3 Our primary storage performance evaluation

The reading performance of our primary DW storage (i.e. Hive) needs to be evaluated because a DW is used primarily for reporting and analysing data, not for writing data. In addition, the secondary DW storage (i.e. Elasticsearch) is near real-time in indexing and searching [11]. So it does not need to be evaluated the performance. We use Hadoop 2.6.5, Hive 2.3.3, JDK 1.8.0_171 and MySQL 5.7.22 for evaluation. The software are installed on Ubuntu Bash 16.04.2 on Windows 10 and a Dell laptop having 16 GB memory and Intel Core i7 CPU (2.40 GHz).

Table 2 The query groups with combined commands

Our database in Hive is copied to MySQL to evaluate and compare run-time performance. The popular HQL/SQL commands, namely Where, Group by, Having, Right (left) Join, Order by and Union, are used to create 10 query groups for testing. Each query group uses a few of commands and includes five queries (see Table 2). In addition, the queries also applies operations, e.g. Sum, Count, Or, And, Like, Min and \(\le\), to the commands to express complex queries. Each query is evaluated the runtime in three times and taken its average runtime.

Fig. 10
figure 10

Mean executive times of MySQL and Hive in every query group

The mean executive times of the 10 query groups on MySQL and our primary storage (Hive) are shown in Fig. 10. The average runtimes of a query on the primary storage and MySQL are 216 seconds and 688 seconds, respectively. So, MySQL is lower 3.2 times than the primary storage. If our DW is deployed on a distributed system or a cloud storage, we believe that its runtime performance will be faster than MySQL many times.

5 Case study: fertiliser knowledge extraction

5.1 Classification based on yield

From the EAR dataset, we analyse information related to fertiliser and crop yield in every field. This includes crop name, yield, field identification, year, season, the quantities of total fertiliser and main elements of fertiliser, being N, P and K. We classify each EAR record into one of the five yield groups of every crop type that is based on crop type and yield of each record. Each yield group includes \(20\%\) of the amount of records of each crop type. Among them, based on yield, group 1 is the highest \(20\%\) group, group 3 is medium \(20\%\) group and group 5 is the lowest \(20\%\) group. After that, in each group, the mean values of yield, total fertiliser, NPK group, and fertilisers N, P, and K are calculated.

Table 3 describes the top ten most popular crops in EU, which are Barley S. (Spring Barley), Barley W. (Winter Barley), Beans S.D. (Spring Dried Beans), Beans W.D. (Winter Dried Beans), Linseed S. (Spring Linseed), Maize F. (Forage Maize), Oats W. (Winter Oats), Rape W. (Winter Rape), Rye W. (Winter Rye), and Wheat W. (Winter Wheat). The mean yield of each yield group of each crop type is shown in this table. In addition, in each crop yield, the different percentages between yield group 5 (the lowest yield group) and other yield groups are also presented clearly. Specifically, in Barley S., group 5 has mean yield of 4.26 ton/ha. While, group 1 and 2 have mean yield of 8.16 ton/ha and 7.32 ton/ha, and are higher than group 5 about \(91.5\%\) and \(71.8\%\), respectively. Besides, group 3 and 4 have mean yield of 6.52 ton/ha and 5.81 ton/ha, and are higher than group 5 about \(53.1\%\) and \(36.4\%\), respectively. Specially, in Linseed S. and Beans S.D., group 1 is higher than group 5 about \(430.2\%\) and \(382.4\%\).

Table 3 Mean yield (ton/ha) in each yield group of every crop type

Fertilisers have been used since the start of agriculture to supply one or more essential nutrients to the growth of crops. Today, farmers often use fertilisers being either mined or manufactured. However, fertilisers are very expensive and can harm the environment. Besides, excess fertilisers will badly impact crop quality and yield. So, the right fertiliser quantities for every crop should be used. The fertilisers are composed of many major, secondary and trace elements. Among them, the trace elements are ions of Chlorine (Cl), Iron (Fe), Manganese (Mn), Zinc (Zn) and Copper (Cu). The secondary elements are Calcium (Ca), Magnesium (Mg) and Sulfur (S). While, the primary elements, being Nitrogen (N), Phosphorus (P) and Potassium (K), are used in large quantities by plants and play a key role in plant nutrition [32]. So, the “Big 3” nutrients will be detected and analysed more careful.

5.2 Crop and NPK group correlation

With classifying through the five yield groups of ten crops, we extract information about the quantities of fertiliser in Table 4 and Fig. 11. They show the mean quantities of total fertiliser, NPK group and other group of each crop in each yield group. In addition, they also present the percentage of NPK group in total fertiliser.

Table 4 The mean quantities of general fertiliser and NPK group (kg/ha)

From Table 4 and Fig. 11, there are not significant differences among yield groups in Barley W., Beans S.D., Rape W. and Rye W. While, in the remain 6 crops being Barley S., Beans W.D., Linseed S., Maize F., Oats W. and Wheat W., the percentages of NPK groups between high-yield groups and low-yield groups are clearly separate. Hence, we recommend the suitable percentages of NPK groups are about 47% for Barley S., 75% for Bean W.D., 40% for Linseed S., 57% for Maize F., 48% for Oats W. and 41% for Wheat W. Besides, the suitable total fertilisers are about 882 kg/ha for Barley S., 300 kg/ha for Bean W.D., 580 kg/ha for Linseed S., 812 kg/ha for Maize F., 724 kg/ha for Oats W. and 1,403 kg/ha for Wheat W. Moreover, the ratio of N, P and K in the NPK group is also important for developing crops. So we continue to analyse these ratios in next section.

Fig. 11
figure 11

Mean fertiliser quantities of the NPK groups and the other groups

5.3 Crop and N–P–K ratio correlation

Nitrogen is very important because it is a major component of chlorophyll, amino acids (being the building blocks of proteins) and nucleic acids (e.g. DNA). Without nitrogen, plants wither and die. Moreover, nitrogen deficiency will limit plant growth, make yellow leaves and be easily attacked by diseases and insects. On other hand, nitrogen redundancy can cause excessive growth of aquatic plants and algae which use up dissolved oxygen and clog water intake to affect growth of crops. Besides, nitrogen can pervade in drinking water, environmental damage, and be harmful to human or livestock [6]. So, in each crop, we need to determine the suitable quantity of nitrogen fertiliser to make the highest yield. Figure 12 presents mean quantities of fertiliser N used in yield groups of crops that are extracted from the data warehouse.

Phosphorus is a vital component of DNA and RNA. Especially, it captures and converts the sun’s energy into useful plant compounds. P deficiency makes a stunting of the plant in the early growth, and affects both seed development and normal crop maturity in the late growth [30]. While, too much P can be toxic. Because, waste P can easily flow into water and cause algal blooms and excessive vegetative growth. Besides, it also impedes the uptake of Fe and Zn. The mean quantities of fertiliser P in yield groups for crops are presented in Fig. 13.

Fig. 12
figure 12

Mean nitrogen fertiliser quantities

Fig. 13
figure 13

Mean phosphorus fertiliser quantities

Fig. 14
figure 14

Mean potassium fertiliser quantities

Potassium is essential in photosynthesis, enzyme activation and protein increment to sustain growth and reproduction of plants. K deficient plants are less resistant to drought, excess water, disease, insect attack, frost and cold [21]. Specially, K deficiency makes yellow firing leaf and poor root development. While, excess K can affect uptake other nutrients, such as N and Mg. We extract the mean quantities of fertiliser K in yield groups of crops from the data warehouse and present in Fig. 14.

The N–P–K ratio in fertiliser is important for developing crops. So we use ternary graphs to analyse correlation between the ratio of N, P and K and crop yield. In Fig. 15a, c, e, we don’t see the separation among yield groups of Barley S., Barley W., Lindseed S. and Rye W. However, there are significant differences between high-yield groups and low-yield groups of Beans S.D., Beans W.D., Maize F., Oats W., Rape W. and Wheat W. in Fig. 15b–e. So we can propose the suitable ratio of N, P and K in the NPK group for the 6 crops based on information of their group 1. Combining with information extracted in Sect. 5.2 and information in Figs. 121314 and 15, we propose the suitable ratio of N, P and K in the NPK group, and the suitable percentage of NPK group in fertiliser total for crops in Table 5. The suitable quantities of N, P and K for Beans S.D., Beans W.D., Maize F., Oats W., Rape W. and Wheat W. are extracted and recommended. While, we can only propose the suitable quantity of group NPK for Barley S. Besides, we do not have enough information to recommend the fertiliser quantities for Barley W., Linseed S. and Rye W.

Fig. 15
figure 15

N–P–K ratio

Table 5 The proposed quantities of the NPK group and its elements in each crop are extracted from Big Data

6 Conclusion and future work

In this paper, we analysed and integrated many original agricultural datasets to determine useful dimensional and fact tables, and their attributes and relationships for proposing an EAR. Based on the EAR, also being a fact constellation schema, various separate datasets are extracted, transferred and loaded into a unified crop dataset. The EAR is adjustable and scalable to new datasets and variety standards of Big Data analytics in agriculture. Besides, we also designed and implemented an agricultural DW based on Hive and Elasticsearch which adapted criteria about DW and Big Data, such as security, high performance, high storage, variety data and data science support. Specially, from the unified EAR dataset, we presented a data analysis method based on crop yield classification with fertiliser components. The studied results showed that in some crops, the more fertilisers used, the more yield increased. However, in many other crops, they are suitable to medium fertiliser quantities and their yield decreased as using more fertilisers. We proposed the suitable quantities of the NPK group, N, P and K in various season and farms on the top ten famous crops in continental Europe, Ireland, United Kingdom and Brazil.

With the scope of the paper, we exploited information about fertiliser as a case study. However, the crop yield improvement is affected not only the fertiliser components, but also available soil properties, soil texture and nutrient translocation. So, in the future, we will apply our deep learning [10] and machine learning [23] algorithms to discovery relation of fertiliser components, soil properties, adjuvants and water requirements on increasing crop yield. Besides, time series and the weather factors, such as temperature of air and soil, sunshine, rain fall, humidity and wind speech, are powerfully affect to crop yield and also will be studied.