Introduction

Shrimp is a valuable natural resource that is widely used in the food industry around the world (Li et al. 2016). The recent enhancement of the fatal spread of the White Spot Syndrome Disease Virus (WSSV) in the shrimps all over the world has affected economy of the coastal regions and countries that rely on sea food consumption and export. First time this disease was reported in the year 1992 in the cultured shrimps (Penaeus japonicus) in China (Tuyen et al. 2014). There are so many factors responsible for the spread of WSSV disease in shrimps. Quality of the seawater (pH, salinity and other chemical parameters) has a significant effect on shrimp production. For example, the accumulation of nitrogenous waste and ammonia is a very important toxic limiting factor for shrimp because these substances severely pollute the seas, which are the main habitat of shrimp (Chaijarasphong et al. 2019). In other words, environmental parameters and conditions can have a major impact on the physiological characteristics of organisms. The physical parameters like temperature, hypoxia, salinity and ammonia can severely disrupt the physiological processes of shrimp as well as their immune system (Hasan and Haque 2020). A series of statistics provided by global shrimp production indicate that more than 4.5million tons of shrimp production are minced to the shrimp farming industry (Schleder et al. 2020). The ability of aquaculture to meet many needs and to address the pressures of increasing human population can be very effective. Unfortunately, the aquaculture system is ignored in the community strategy (Caipang et al. 2008). Modified ammonia forms are highly toxic to shrimp due to their ability to diffuse into the cell membranes (Lu et al. 2016). In aqueous surroundings, total ammonia nitrogen is present in two forms: unionized form (ammonia, NH3) and ionized form (ammonium, NH4+). Shrimp fosterage is faced with several diseases around the world including WSSV (Verbruggen et al. 2016). It is worth mentioning that WSSV has affected all great shrimp generating countries, especially south eastern countries like Vietnam. The economic loss caused by this disease is estimated between $ 8 billion and $ 15 billion. Worldwide (Millard et al. 2020), the average annual cost used to mitigate this disease is around of $ 1 billion (Millard et al. 2020). These shrimp diseases are often caused by the interactions between the host environment and the outside environment. Recently, researchers have found that 70% of shrimp are diseased after being caught and harvested (Millard et al. 2020).

The symptoms of WSSV include shrimp natural discoloration, decrease in shrimp nutrition, and white spots on the shrimp (Wang et al. 1999; Nunan and Lightner 1997). However, it should be noted that even in some shrimps, the occurrence of this disease is asymptomatic and eventually leads to death (Sun et al. 2013; Meng et al. 2010). On the other hand, some recent research has shown that feeding shrimp diets, which should contain a mixture of two seaweed, affects the gut microbiota and can reduce mortality due to the spread of white spot disease, which can also improve the harmful effects of environmental stressors (Zacarias et al. 2021). The study of WSSV in four different groups (AHPND-Vibrio parahaemolyticus infection) showed that there were significant differences in the histopathology of surviving shrimp hepatopancreas among the groups. Shrimps in the co-infection group showed signs of normal histopathology in the hepatopancreas (Han et al. 2019). On the other hand, the toxicity of ammonia that causes disease in shrimp is influenced by the water pH, salinity and temperature as significantly affect immune system of shrimps (Kathyayani et al. 2019). Another study found that other factors (dissolved oxygen concentration, nitrogen, partial pressure of carbon dioxide and pH) related to water quality played a key role in determining the severity of White Spot Disease (Millard et al. 2020).

In this study, we have developed disease susceptibility map considering spatial parameters (distances to roads and factories) and physio-chemical factors (Chemical Oxygen Demand (COD), Dissolved Oxygen (DO), Salinity, NO3, P3O4, pH) of water of shrimp aquaculture sites. Three Decision Tree (DT) based ML models namely Random Tree (RT), Extra Tree (ET), and J48 have been used for the data analysis and generation of diseased susceptibility maps. These DT models have been as they have advantage of easy, quick and efficient interpretation of the results by the decision makers (Chien and Chen 2008). The shrimp aquaculture area of Quynh Luu district, Nghe An province, Vietnam, which is severely affected by WSSV has been selected as the study area. First time spatial parameters have been considered in this study for the prediction of White Spot disease susceptibility using ML methods. Standard statistical measures including Area Under the Curve (AUC) were used for the evaluation models performance. Weka and ArcGIS software were used for the data analysis and models development.

Experimental

Decision tree (DT) based methods

Random tree (RT)

RT search model is a ML model, that programs a random track that can effectively find non-convex locations. On the other hand, it should be noted that according to this model, continuous computation requires control decisions to navigate the system from the prime xinit situation to the xgoal target situation. Also, usually this model is characterized by a high variance (Ajayram et al. 2021). The workflow associated with the RT classifier assumes the involvement of input feature vectors and their classification for each tree in the ensemble forest (Díaz et al. 2020). The class label for which the majority votes is assigned will represent the output of the model (Zhang et al. 2020). Different training sets are used to train the trees having the same parameters. A bootstrap process is followed in order to generate the training sets. In this case, a randomly generated subset for each variable is utilized to find the best split at each node where a new subset will be created with a specific size (Keivani and Sinha 2021). A very important characteristic of RT is given by the fact that the classification error is estimated using the out of bag (OBB) data (Kamiński and Prałat 2019; Rustam et al. 2020; Zito and Cooper 2006).

Extra tree (ET)

The ET model, or the "Extremely randomized trees" model, is a highly advanced random forest model. This is a very new method that is a proper subset of ML models. ET is a group of learning methods that use decision tree predictions to increase the accuracy efficiency and decrease computational elaborations (Geurts et al. 2006). In this model, a set of trees is generated randomly, and then, the forecasted value of which is added in certain ways. For example, the arithmetic mean is used for the purpose of regression by a maximum judgment in the categorization. This is one of the principal and basic differences between the ET model and other tree-based grouped models. In the ET model, the splitting procedure in the nodes, is done using a completely random selection of cutting points, and therefore, the trees will grow using all the training instances against the use of the bootstrap duplicate (John et al. 2016). In other words, the extra tree follows the origin of random forest. It also applies an accidental subset of features to teach every basis assessment (Okoro et al. 2021). ET is divided into two methods to solve regression issues. In the first method, the number of random divisions, i.e., the random selection of both input parameters and cut-off points in each node, is denoted by K. The second method provides the minimum instances size for node division, and is denoted by nmin. The process of tree progression in the extra tree group is continued by determining the value of K in each node until the process of reaching the leaves in which all subsets have a net output, and also the number of learning samples must be measured in nmin. ET has the ability to reduce variance through a clear randomization of input parameters and cut points along with averaging groups (Ahmad et al. 2018). However, the use of all the basic learning examples will help to minimize the bias or the variance. The solution of understanding the number of trees produced in the ET model can control the extent to which the variance of group model association can be reduced (Hou et al. 2020).

J48

J48 model is one of the most popular ML models that have been widely used in recent decades (Pham et al. 2017b). The method is based on a tree hierarchy which helps to build classification trees that display a naive tree frame. In this simple structure, the non-terminative nodes represent the properties, while the terminative nodes represent the decision results (Sridhar Raj and Nandhini 2018). J48 model represents an example of basic classification because it can support the next class. As a result, it produces a type of hybrid classification that uses a test sample for validation. Predicted results should be compared using training and testing samples. In this section, the results of each classification will be different (Chen et al. 2020). J48 classifier creates a decision tree that consists of data values and instructional samples and also generates a new instance. The construction of the J48 model consists of 5 main steps: (1) Preparation of input layers represented by the independent and dependent variables, (2) Creating the first internal node in the tree structure or the root node, (3) Based on the root node values, the training data should be subdivided into many training samples to create the sub-nodes, (4) Evaluation of the incremental rate under the nodes, and (5) Selecting the most effective agents is an ongoing process (Hong et al. 2018; Madhusudana et al. 2018).

Evaluation methods of model’s performance

Statistical indicators

The data is usually split randomly into a 70:30 ratio, where 70% is used for training and 30% is used for validation of the models (Nguyen et al. 2021). The training dataset (70%) evaluates how well the models fit the data, while the testing dataset (30%) evaluates the predictive capability of the models. In the present study, in order to evaluate the accuracy of DT-based ML Models (RT, ET, and J48) the following statistical metrics were used: Positive Predictive Value (PPV), Negative Predictive Value (NPV), Root Mean Square Error (RSME), Accuracy (ACC), Sensitivity (SST), Specificity (SPF), and Kappa (K) were used. Out of these, PPV and NPV are the number of pixels classified as "Shrimp’s White Spot Disease" and "non- Shrimp’s White Spot Disease" susceptibility. The ratio of Shrimp’s WSSV pixels is illustrated by the SST and ratio of non-Shrimp’s WSSV pixels is displays by the SPF. K index is used to check and analyze the accuracy of models. The value of K ranges between 0 and 1 (Tangirala 2020). The closer K is to the number 1, the higher the accuracy of the model. The ACC value of the true prediction rate indicates the whole number of forecasts for Shrimp’s White Spot Disease. Finally, the RMSE criterion shows the difference between the estimated dataset and the observed dataset. Values close to zero indicate the high accuracy of the model (Tangirala 2020; Pham et al. 2017a; Van Phong et al. 2020). The criterion equations are described below:

$${\text{SST }} = {\text{ TP}}_{{\text{true positive}}} /{\text{ TP}} + {\text{ FN}}_{{\text{false negative}}}$$
(1)
$${\text{SPF}}_{{}} = {\text{ TN}}_{{\text{true negative}}} /{\text{ TN}}_{{\text{true negative}}} + {\text{ FP}}_{{\text{false positive}}}$$
(2)
$${\text{PPV}}_{{}} = {\text{ TPtrue positive}}/{\text{ FP}}_{{\text{false positive}}} + {\text{ TP}}_{{\text{true positive}}}$$
(3)
$${\text{NPV}}_{{}} = {\text{ TN}}_{{\text{true negative}}} /{\text{ FN}} + {\text{ TN}}_{{\text{true negative}}}$$
(4)
$${\text{Kappa}} = {\text{ P}}_{{\text{p}}} - {\text{ P}}_{{{\text{exp}}}} /{ 1} - {\text{ P}}_{{{\text{exp}}}}$$
(5)
$${\text{ACC}} = {\text{ TP}}_{{\text{true positive}}} + {\text{ TN}}_{{\text{true negative}}} /{\text{ TP}}_{{\text{true positive}}} + {\text{ TN}}_{{\text{true negative}}} + {\text{ FP}}_{{\text{false positive}}} + {\text{ FN}}_{{\text{false negative}}}$$
(6)
$${\text{RMSE}}_{{}} = \, \surd {1}/{\text{N}}\sum {\left( {{\text{X}}_{{{\text{predicted}}}} - {\text{ X}}_{{{\text{actual}}}} } \right)^{{2}} }$$
(7)

The components of the above formulas are such that FP and FN are the number of pixels that are incorrectly classified as Shrimp’s WSSV and non-Shrimp’s WSSV. Pp is the number of pixels that are correctly classified for Shrimp’s WSSV or non-Shrimp’s WSSV. Predictable adaptations are determined by Pexp. Finally, Xpredicted and Xactual are the forecasted and actual numbers in the training instances or test instances of the models, and N is the whole number of instances in the training instances or test instances.

Receiver operating characteristic (ROC) curve

ROC curve was also used in order to evaluate the model’s performances. The ROC curve is a graph of the balance between the negative and positive error rates for each possible number of slices (Depina et al. 2020). The area under the ROC curve (AUC) indicates the predicted value of the system by describing its ability to accurately estimate the occurrence and non-occurrence of disease. Its value ranges from 0.5 to 1 (Tien Bui et al. 2019). The closer the sub-curved surface is to 1, the better the accuracy of the zoning map. Classification of the amount of under/sub-curved area is classified as excellent if AUC is between 0.9 and 1, very good if AUC is between 0.8–0.9, good if AUC is between 0.7 and 0.8, medium if AUC is between 0.6 and 0.7 and poor if AUC is between 0.5 and 0.6 (Tripepi et al. 2009). The equation and numerical value of the area under the curve (AUC) is obtained from the following formula (Dou et al. 2020; Myung et al. 1998):

$${\text{AUC}} = \, \sum {{\text{TP}}_{{\text{true positive}}} } \, + \, \sum {{\text{TN}}_{{\text{true negative}}} /{\text{ P}} + {\text{ N}}} \,$$
(8)

where TP and TN are presented the value of pixels categorized truly as disease and non-disease, P and N are the total number of disease and non-disease, respectively.

Description of study area

The study area of shrimp farming is located in the north-central coast region of Vietnam in the Quynh Luu district of Nghe An province, covering approximately 445.1 km2 area and 307,000 population (Fig. 1). Topography of the district is divided from West to East into hills, plains and coastal areas. The territory has many natural lakes and 9.5 km long coastline passing through nine communes and two estuaries (Quen and Thoi canals).

Fig. 1
figure 1

Location map of the study area showing spread of WSSV disease in Shrimps

The dominant vegetation in the hilly and mountainous area is represented by planted forests, fruit trees, delta crops, rice and vegetables. The coastal areas are populated with shrimp ponds and mangroves. The climate is a tropical monsoon that is divided into two main seasons: warm summers and cold winters. The average temperature and average annual cumulative precipitation of area are 25°C and 1600 mm, respectively. The warm season starts in May and ends in October, with the average temperature is 30°C. The highest temperature recorded in July was 40°C. The cold season starts from November and ends in April due to the influence of the Northeast Monsoon. Monsoon humid climate and a long coastline with brackish water at the river mouth are considered favorable geographical conditions for the aquaculture farming. The aquaculture in Nghe An province includes shrimp, shellfish, crab, and fish in about 2555 ha surface area. Quynh Luu is the largest area of intensive shrimp farming in this province with 465 ha, raised into 2–3 crops per year, yield 3.5–4 tons per ha. This economic sector accounts for 67% of the workforce and contributes with 72% for the region's GDP.

Data used

Inventory of white spot syndrome virus (WSSV) affected Shrimps

For the development of WSSV infection in shrimp’s inventory, disease statistics include secondary and primary data. Shrimp disease statistics data, from 2010 to 2020, was obtained from the Department of Agriculture of Quynh Luu District and the Fisheries Department of Nghe An province. In the study area, the shrimp ponds were most affected by WSSV in the years 2011, 2017, and 2020. Data of 114 disease points were collected by directly interviewing shrimp pond owners about the history of the white spot disease outbreak in their farms. Locations of the shrimp ponds were marked by GPS field surveys (Fig. S2).

Shrimp white spot disease influencing factors

There are many spatial and physio-chemical factors which affect the spread of WSSV in shrimps. The physio-chemical parameters of water affecting shrimp diseases include Chemical Oxygen Demand (COD) (Fig. 2a), Dissolved Oxygen (DO) (Fig. 2b), temperature (Khiem et al. 2020). In addition, salinity (Fig. 2c) has a great impact on the health and growth of the shrimp (Falconer et al. 2016). Besides, spatial factors such as distance to the roads (access and transport networks) (Fig. 2h) and distance to factory (including hatchery and medicine) influence the shrimp farming (Fig. 2d) (Giap et al. 2005). Therefore, in the present model’s study, we have selected eight WSSV disease affecting factors namely: pH, Salinity, COD, DO, NO3, P3O4, distance to the factory (shrimp storage) and distance to the road (farming site).

Fig. 2
figure 2figure 2

Thematic maps of the study area (a COD map; b. DO map; c. Salinity map; d. Distance to factory map, e. NO3 map; f. P3O4 map; g. PH map; h. Distance to road map)

To monitor the parameters of the water environment, 17 water sampling stations were set up on the river, shrimp ponds and estuaries and samples were collected twice in year in the month of January and July, 2020. At those locations, salinity, pH, DO and COD were measured directly by portable instruments such as salinity meter (pH—9909SP), COD meter (YSI 910) and Multi-indicator meter (Hanna HI9829-01042) for NO3 and P3O4.

Thematic layers of distances to farms and factories were extracted from the cadastral maps of the communes of 1:10,000 scale obtained from the Department of Natural Resources and Environment of Nghe An province in conjunction with Google Earth image.

Methodological flowchart and steps

General methodology of the development of accurate susceptibility map of the Shrimp’s WSSV disease based on the spatial and physio-chemical parameters of water using the DT based models is described below (Fig. 3). In the first step, eight factors which affects to the Shrimp’s WSSV disease were selected and their spatial database were collected from various source. In the second step, the inventory of Shrimp’s WSSV disease which includes 114 disease points and 114 non-disease points were collected from the field investigation. In the third step, the inventory of disease was split in two parts with 70:30 ratios; out of these, 70% of inventory was used for generating training dataset used for building the models and 30% remaining was used for generating testing dataset used for validating the models. In the four step, training dataset was then used to construct the DT-based models namely RT, DT, and J48). In the final step, testing dataset was used to validate the performance of the models using various indicators namely PPV, NPV, RSME, ACC, SST, SPF, AUC and K.

Fig. 3
figure 3

Methodological flowchart used for Shrimp WSSV disease susceptibility prediction

Results and discussion

Validation of the models

Using training dataset, the accuracy of the three DT models in the prediction of WSSV disease was evaluated. The results show that PPV, NPV, SST, SPF and ACC values of both the RT and ET models have same values that is 91.25, 78.75, 81.11, 90.00 and 85.00, respectively, whereas J48 model show relatively lower values 66.25, 65.00, 65.43, 65.82 and 65.63, respectively (Table 1). The results of Kappa coefficient based on the training data are also same (K: 0.7) for RT and ET models and lower (0.313) for the J48 model (Fig. 4). Similar results were found for RMSE value (0.299) in comparison to J48 model (RMSE: 0.463) (Fig. 5a–c). However, AUC value of model RT (0.949) is the highest, followed by ET (0.941) and J48 models (0.706), respectively.

Table 1 Accuracy analysis of the models
Fig. 4
figure 4

Kappa analysis of the models using training dataset and testing dataset

Fig. 5
figure 5

RMSE analysis of the models using training dataset (a RT; b ET; and c J48)

Using testing dataset, the results showed that RT and ET models have similar results of PPV, NPV, SST, SPF and ACC values: 67.65, 58.82, 62.16, 64.52 and 63.24, respectively but J48 model show different values: 70.59, 55.88, 61.54, 65.52 and 63.24, respectively that is higher values except in case of SST and PPF. Moreover, Kappa Index is the highest for RT and J48 models (0.265) (Fig. 4). The RMSE results also showed that the J48 (0.483) is the most accurate model among the 3 models used (Fig. 6a–c). However, AUC value on testing data set show that ET model (0.713) is the best, followed by RT (0.701) and J48 (0.641) in predicting the susceptibility of Shrimp’s WSSV disease in this area (Fig. 7a and b).

Fig. 6
figure 6

RMSE analysis of the models using validating dataset (a RT; b ET; and c J48)

Fig. 7
figure 7

AUC analysis of the models using training dataset (a) and testing dataset (b)

In general, all three models (RT, ET and J48) have shown good performance in the development WSSV susceptibility maps. However, ET model has shown the highest accuracy (AUC: 0.713) on testing dataset in comparison to other two models (RT: 0.701 and J48: 0.641) which means that the ET model has a better predictive capability compared with other models, namely RT and J48. One of the advantages of ET method is that it adapts individual trees for global conjecture/ approximation and popularization/ generalization. It has more power due to their smaller error in running this model because the outliers create a limit for predicting Extra Tree. Moreover, ET can estimate with a superior computational efficiency than single trees and can be more accurate than the irrelevant properties of the variables. Further, ET has very little variance compared to single trees, and the accuracy of ET predictions increments with the number of trees in the model, belonging on the type of issue/ problem (Seyyedattar et al. 2020). Finding of this study is also in line with previous published works which stated that ML based models are good tools for prediction and identification of WSSV disease (Edeh et al. 2022; Ramachandran et al.).

Construction of Shrimp’s WSSV disease susceptibility maps

Shrimp's WSSV disease susceptibility was finally constructed using the results of training the DT models. In the first step, all pixels of the study areas were assigned disease susceptibility indexes generated by the training process of the models. Thereafter, the Natural Break classification method was used to classify these indexes into different classes namely low susceptibility, moderate susceptibility, high susceptibility (Fig. 8). Figure 9 shows the validation of the susceptibility maps using the disease frequency ratio analysis. It can be observed that most of the disease points were observed in the high susceptibility classes, which means the susceptibility maps generated are reliable. In practice, these maps can be used to provide valuable information related with the areas with higher susceptibility to WSSV disease for shrimp farmers and aquaculture managers that can help them make informed decisions about the location and conditions for shrimp farming. In addition, the farmers can implement preventive measures, adjust farming practices, and even consider alternative locations to minimize the risk of disease outbreaks.

Fig. 8
figure 8

Disease Susceptibility Maps (DSM) using: a RT model; b ET model; c J48 model

Fig. 9
figure 9

Analysis of disease density on the susceptibility maps using the models

Conclusions

White spot syndrome is a serious problem for shrimp farming as shrimps affected by WSSV can spread disease in other healthy shrimps within ponds and farms when they come in contact with each other, causing the reduction of effectiveness of shrimp farming. In this study, we have used the three DT models: RT, ET and J48, in the development of WSSV-susceptibility maps, which might help in effective shrimp farming management.

Results showed that the DT-based ML models especially ET (AUC: 0.713) are promising tools in prediction of WSSV and production of reliable WSSV-susceptibility maps which will help decision makers in better planning and control of spatial spread of WSSV disease in shrimps. Limitation of this work is that the limited number of factors affecting the WSSV disease were used for shrimp's WSSV disease susceptibility assessment, therefore, in future study, we proposed to incorporate other disease affecting parameters, such as temperature of pond, farm and sea-water; local chemicals used and present in the water; mixing of other diseased crustaceans from the land and sea, which might improve accuracy of the susceptibility maps of the area. Moreover, other ML models will also be used to check possibility of further improvement in the quality of maps.