Regularization Methods for High-Dimensional Data as a Tool for Seafood Traceability

Seafood traceability, needed to regulate food safety, control fisheries, combat fraud, and prevent jeopardizing public health from harvesting in polluted locations, depends heavily on the prediction of the geographic origin of seafood. When the available datasets to study traceability are high-dimensional, standard classic statistical models fail. Under these circumstances, proper alternative methods are needed to predict accurately the geographic origin of seafood. In this study, we propose an analytical approach combining the use of regularization methods and resampling techniques to overcome the high-dimensionality problem. In particular, we analyze comparatively the Ridge regression, LASSO and Elastic net penalty-based approaches. These methods were applied to predict the origin of the saltwater clam Ruditapes philippinarum, a non-indigenous and commercially very relevant marine bivalve species that occurs commonly in European estuaries. Further, the resampling method of Monte Carlo Cross-Validation was implemented to overcome challenges related to the small sample size. The results of the three methods were compared. For fully reproducibility, an R Markdown file and the used dataset are provided. We conclude highlighting the insights that this methodology may bring to model a multi-categorical response based on high-dimensional dataset, with highly correlated explanatory variables, and combat the mislabeling of geographic origin of seafood.


Introduction
As a result of the growing globalization of seafood markets, traceability has become paramount to safeguard food safety. With consumers being increasingly aware of the potential hazards that contaminated seafood may cause to their health, ensuring that the geographic origin of seafood is not mislabeled is a first step to fight fraudulent practices that aim to cover-up illegal fishing [1][2][3]. Features associated with the traceability of seafood place of origin have been reviewed by Leal et al. [2]. Among these, biochemical and biogeochemical fingerprints have been pointed out as of interest when aiming to trace the geographic origin of seafood, including tens of features.
Data analytic techniques typically applied for detecting fish and seafood authenticity and adulteration were recently reviewed by Kotsanopoulos et al. [4]. These authors identified the approaches that have been adopted so far, which includes the use of Multivariate Analysis of Variance, Principal Component Analysis, Canonical Correlation Analysis, Multidimensional Scaling and some analytical variants of these methods (for a complete description, see [4]). In their paper, the authors out some difficulties in using the described methods including the normality, homoscedasticity and/or linearity strict assumptions. Generalized linear models are mentioned as a strategy to accommodate variance heterogeneity and/or non-normality. However, there is no reference in how to deal with high-dimensional datasets which are very likely to arise in these studies as the number of species fingerprints is commonly high (increasing the number of variables) and the number of captures tends to be small to minimize invasive sampling (decreasing the sample size). Further, dimensional reduction by variable selection is seldom addressed in the field literature, although, in this context, finding which features are most relevant for predicting the outcome is essential, as attaining information on each of these features is commonly highly time and cost demanding. In addition not all fingerprints necessarily have to be used when assembling models to predict the geographic origin of a seafood species. In fact, if not properly selected, some elements may even act as confounding variables between origins [5].
The analysis of high-dimensional datasets may be challenging, specially with a multi-categorical outcome, as typically the model will be included at least one parameter per category. In addition, as the number of covariates increases relatively to the sample size, problems with the convergence of parameter estimates arise and the usual maximum likelihood estimates do not exist [6]. Regularization penalty-based models, including Ridge regression, LASSO (Least Absolute Shrinkage and Selection Operator) and Elastic net, solve these limitations. Despite not being new these techniques are relatively less used by statisticians which traditionally tend to adopt more classic approaches.
In this paper we propose a methodological path for the problem of predicting a multicategorical response variable, based on regularized models. This approach is suggested to allow a sparse representation of the link between predictors and response, solving, at one go, multicollinearity and feature selection issues. In addition, we propose the resampling method of Monte Carlo Cross-Validation to study the models performance and overcome the difficulties related to the small sample size.
To exemplify the use of the proposed methodology, these methods were applied to predict the origin of the Manila clam (Ruditapes philippinarum), a non-indigenous marine bivalve species that commonly occurs in European estuaries. This species was chosen because, when illegally harvested from places where captures have been forbidden (due to public health issues), it can pose a threat to consumers if traded [7]. A good example of this scenario is that of the illegal harvesting of Manila clams from the Tagus estuary in Portugal. The capture of this highly priced bivalve from this specific ecosystem has been prohibited by national authorities due to public health issues [5]. Nonetheless, this prohibition has not deterred poachers to illegally capture these bivalves and supply them to organized crime networks that make them available to consumers. This criminal practice is made possible through the mislabeling of the place of origin of Manila clams, taking advantage of the loopholes of current traceability protocols currently incapable of safeguarding the labeling of seafood geographic origin from fraud. Thus, we assembled regularization-based methods to oppose the fraudulent practice of mislabeling the place of harvesting of seafood.
The structure of the paper is as follows. Section 2 summarizes the regularization approaches and the model validation strategies. In Sect. 3 we describe the dataset, present and discuss the results, comparing the three regularization methods regarding the models fitting and their predictive performance. Finally, in Sect. 4 we summarize the main conclusions and comment on how the proposed method may be further used and contribute toward seafood traceability.

Methodology
In this section we summarize the proposed methodological path. All the procedures were coded using R software [8]. Penalty-based regularization approaches were implemented using the glmnet package [9], with the assistance of the packages ensr [10] and glmnetUtils [11]. K-fold Cross-Validation approach was ran at each iteration of the Monte Carlo Cross-Validation using glmnet package [9]. Simultaneous penalization and mixing parameters optimization was implemented using caret package [12]. Finally, ROC curves were constructed resorting to packages ROCR [13] and PRROC [14].
Furthermore, to avoid difficulties in reproducing the proposed methodology and ensure transparency we have created a dedicated GitHub repository (https://github.com/rb1970/RM4Traceability) in which we made available an R Markdown file and the used data to assist in replicating the methodology.

Regularization
Regularization is generally obtained by maximizing the penalized log-likelihood of the model. Different types of penalties result in different methods. In this section we briefly describe the approaches known as Ridge regression, LASSO and Elastic net.

Ridge regression
Ridge regression [15] minimizes the residual sum of squares plus the sum of the squared beta coefficients affected by a parameter, λ (λ ≥ 0), that regulates the strength of the penalization, i.e., the model coefficients β are estimated by given the data {(x i , y i )} (i = 1, ..., n) with Y being the response variable and x the vector of predictors. This method regulates the dimensionality by shrinking the coefficients of the less relevant features toward 0, but never actually removes features from the model. It can achieve good performance measures, but given that it does not perform feature selection, it is commonly seen as less convenient when dealing with high-dimensional datasets [16].

LASSO
LASSO [17] also imposes a penalization to the residual sum of squares but in this case given by the sum of the absolute values of the coefficients, i.e., This method has proven to be a flexible tool that performs variable selection allowing to shrink some coefficients to zero, for some suitably chosen value of λ. Despite its popularity, it is generally inefficiency in dealing with groups of highly correlated predictors. In such circumstances, for each group of correlated variables, LASSO tends to arbitrarily select one predictor into the model, discarding the others [18]. Another drawback of the LASSO method is not being able to select more variables than the sample size. In addition, some studies show that LASSO aggressivity might eliminate interesting features and be outperformed by Ridge regression [19].

Elastic net
Elastic net penalty [20] combines LASSO and Ridge regression approaches through the penalty term [21] where the parameter α (α ≥ 0) controls the mixture between the two types of penalties. As described, the parameter λ is responsible for controlling the strength of the penality. As this non-negative parameter increases the regularization effect is strengthened, and the model develops into keeping the coefficients small instead of minimizing the loss function. The α parameter controls the proximity between Ridge regression, Elastic net and LASSO penalties. A value of α = 0 corresponds to Ridge regression and α = 1 implements LASSO approach. The Elastic net method is suited to deal with highly correlated predictors [18] and, at the same time, perform feature selection. Further, it can retain more than n explanatory variables [19].

Resampling Techniques
The K-fold Cross-Validation method is one of the most common approaches to optimize the penalization and mixing parameters. Generically, it selects optimal values as the ones that lead to the model with the smallest error [22,23]. When K = 1, this procedure consists in one simply partitioning of the data into two complementary sets: training and testing, fitting the model with the training set and evaluating the predictive quality of the model on the testing set. For more statistically meaningful results, multiple rounds (K > 1) of cross-validation can be executed. In this case the validation results for each iteration can be combined to postulate about the overall predictive performance of the model. However, this validation method can only be performed within the span of K iterations, making it dependent on the size of the dataset. Monte Carlo Cross-Validation overcomes this limitation. This method also combines the results of several iterations to postulate about the overall performance of a model but, unlike K-fold Cross-Validation which, at each iteration, divides the original dataset into K complementary partitions and creates a random split of the dataset into training and testing sets [24]. For this reason, some observations may never be selected to a testing set, whereas others may be selected more than once throughout the different iterations of this process. Hence, contrarily to K-fold Cross-Validation, it is not dependent on sample size.
In this study, optimal hyperparameters, λ and α, were estimated through fivefold Cross-Validation ran for each regularization method at each iteration of the Monte Carlo Cross-Validation. For Ridge and LASSO approaches, the optimal λ value was chosen as the one that culminated in the model with the smallest percentage of misclassification error, while selecting the least number of predictors into the model (for LASSO). For Elastic net we estimated the optimal mixing and penalization parameters using fivefold Cross-Validation, selecting the parameters that originated the model with the lowest error.
To validate the results, we implemented a 1000 iterations Monte Carlo Cross-Validation method, constructing 1000 testing and training sets at random, and hence fitting 1000 distinct models for each regularization approach. We ran and tested each model separately and, ultimately, for each regularization approach, combined the results of the 1000 fitted models. At each iteration, the dataset was partitioned into training and testing sets with a respective ratio of 80%/20%.

Model Validation
To perform a model validation, we analyzed cross-entropy, confusion matrices and ROC (receiver operating characteristic) curves.

Cross-entropy
Cross-entropy (also named Log-Loss, L L) measures how close a model is to predicting the correct class with probability 1. It is defined as where a iq (i = 1, . . . , n; q = 0, . . . , J ) is a binary indicator of whether or not label q is the correct classification for the instance i, and p iq is the predicted probability of assigning label q to instance i.

Confusion matrices
Confusion matrices contrast models prediction with the known class affiliation. Each entry of these matrices indicates the number of predictions made by the model, and whether the classes were classified correctly or incorrectly. Based on these tables several measures can be computed to evaluate models performance, including Accuracy, Precision, Recall, Micro-F1, Macro-F1 and Weighted F1 indicators (definitions in "Appendix B").

ROC curves
ROC curves allow to evaluate the discriminatory ability of a binary classification model by mapping True positive rate vs. False positive rate (definitions in "Appendix B"). An uninformative diagnosis tool is represented by the line with unit slope. ROC curve is typically described using the Area Under Curve (AUC) index, defined by the curve integral between 0 and 1. A perfect diagnosis tool has an AUC = 1 and a diagonal line, corresponding to an uninformative tool, has an AUC = 0.5. Model A is said to be a better diagnosis tool than model B if the respective AUC statistics are order such that AUC A ≥ AUC B .

Application
In this section we present a case study. We start by describing the data (Sect. 3.1), followed by discussing the models fitting (Sect. 3.2) and models performance (Sect. 3.3) results.

Data
Ruditapes philippinarum, a saltwater clam, is a marine bivalve that is commercially harvested for human consumption, being one of the most important bivalve species grown in aquaculture worldwide [25][26][27][28]. The place of origin of bivalve species is expected to be traceable by features such as their biochemical and geochemical fingerprints [5,7,29].
The available dataset contained detailed information on the bio-and geochemical compositions of the adductor muscle and the shell (respectively) of Ruditapes philippinarum, including 26 concerning the quantification of the fatty acids of the adductor mussel of the clams (biochemical fingerprints, Table 3) and the remaining 18 concerning the quantification of the chemical elements on the composition of their shell (geochemical fingerprints, Table 4).
As the performance of the regularization methods depends on the data characteristics, namely on the existence of groups of correlated data, we present the main descriptive statistics of the dataset in "Appendix A". Two traits are worth to highlight: (1) magnitudes and standard deviations are quite different across variables, so data were standardized (Table 5), and (2) the data contain several groups of variables highly correlated (Fig. 5).

Models Fitting
Model fitting for each of the regularization methods was performed considering the optimal penalization and mixing estimated parameters selected at each iteration. In this section, we comment on the hyperparameters estimates and coefficients shrinkage obtained across the 1000 models. Table 1 summarizes the obtained penalization parameter ranges, and Fig. 1 depicts the parameters observed distributions. Ridge regression consistently attained much higher penalization values than the other methods (Fig. 1A), which indicates a stronger shrinkage effect. As this method does not perform variable selection, severe shrinkage is the means to force the distinction between the least and the most important predictors.  Elastic net mixing parameter distribution shows clear mode at 0.1 (Fig. 1B). This means that, in most cases, this model prioritized shrinking the coefficients toward zero without eliminating them entirely. Thus, these models tend to retain relatively more variables favoring less sparse solutions (note that the maximum number of coefficients included in Elastic net models was 74, Table 1).

Coefficients Shrinkage and Variable Selection
Without dropping any of the 44 predictors, Ridge regression required 135 (44 × 3 + 3) coefficients at each iteration (Table 1). LASSO selected a very small number of variables to predict each class ranging from 6 to 20, regardless of the low penalty values (Table 1). In fact, 80% of the LASSO models included less than 10 coefficients. This aggressive feature selection was most likely due to how this approach handles groups of correlated variables, as multicollinearity was a marked trait in the dataset (see Fig. 5) and LASSO tends to keep just one variable representing each group which limited the number of coefficients included in the models. Elastic net models kept a number of coefficients between 6 and 74 (Table 1), with nearly 35% of the models having between 60 and 70 coefficients. In summary, LASSO approach originated the most parsimonious models, followed by Elastic net and Ridge regression methods. Table 1 summarizes the obtained residuals ranges for the three approaches. Figure 2 depicts the respective estimated densities. LASSO and Elastic net methods present quite similar ranges and estimated residual densities. These curves display a flattened shape, with values relatively scattered, ranging from 0.0001 to approximately 0.7. In contrast, the density of the residuals from the Ridge regression has a completely different shape and central tendency. This curve is more narrow and is centered around higher residual values, ranging from approximately 0.2 to 0.8. Thus, Ridge regression has the highest residuals deviance. LASSO and Elastic net display similar adjustments.

Models Performance
In this section we discuss models performance when applied to the test datasets, through the analysis of cross-entropy values, confusion matrices and ROC curves. Table 2 summarizes the performance measures for the three studied regularization methods.

Cross-entropy
Based on the predicted probabilities of affiliation, we computed the cross-entropy values for each iteration of Monte Carlo Cross-Validation and each regularization method ( Table 2, Fig. 3). In general, three approaches performed similarly. However, the results show that the cross-entropy values varied considerably more across LASSO models than among Ridge regression or Elastic net models, suggesting that LASSO performance could be considered less stable.

Confusion matrices
Metrics based on confusion matrices (definitions in "Appendix B") are summarized in Table 2 for each regularization approach. Note that Ridge regression misclassification rates reach values up to 12%. Elastic net achieves a maximum value around 6%, and LASSO shows a maximum value of only 3%. All the other performance metrics  indicate consistently a lower performance for Ridge regression. The other two methods show very similar results. However, LASSO never attains as low performances as the ones reached by Elastic net models. These results suggest that LASSO models present the best overall performance, followed closely by Elastic net and lastly by Ridge regression. ROC curves ROC curves are depicted in Fig. 4. The predictive ability was remarkably good for all classes regardless of the regularization approach (Fig. 4). Ridge regression showed the poorest results with AUC values between approximately 0.86 and 0.99 ( Table 2). The LASSO method performed better than Ridge regression with AUC values between 0.96 and 0.9990 (Table 2). Overall, LASSO also performed better than Elastic net with AUC values ranging from 0.89 to 0.99 (Table 2). In summary, based on ROC curves, LASSO performed better than Elastic net, followed by Ridge regression.

Conclusion
In this paper we propose a methodological path for the problem of predicting a multicategorical response variable based on high-dimensional data. This methodology uses regularized models to deal with multicollinearity and allow variable selection. Further, we propose the resampling method of Monte Carlo cross-validation to overcome the difficulties related to the small sample size while analyzing models performance. Regularization methods included Ridge regression, LASSO and Elastic net penaltybased approaches. The dimensionality reduction ability and the predictive quality of the models were studied by applying complementary methods of model validation, including the analysis of cross-entropy values, confusion matrices and ROC curves. Overall, the Ridge regression was the poorest performing method. Besides being incapable of removing variables from the models, it also systematically presented the worst performance metrics. LASSO and Elastic net methods clearly outperformed Ridge regression. LASSO and Elastic net performed equivalently being hard to emphasize one as better, since both outperformed the other in different aspects. In fact, although the presence of severe multicollinearity favored LASSO method in terms of feature selection, this method showed the least stable cross-entropy values attaining, in some cases, very high prediction losses. Nonetheless, we stress that the number of features selected by LASSO was much smaller than the number selected by the Elastic net and LASSO showed consistently high prediction quality. We have applied this methodological path to the problem of predicting the geographical origin of Manila clams, under a scenario of severe small sample size and highly correlated predictors. This example has shown that it is possible to select predictors without compromising the models prediction performance. Our application also demonstrated the adequacy of the proposed method in dealing with high-dimensional data to study traceability. We believe and hope that our study may represent a valuable contribution to fight against illegal, unreported, and unregulated (IUU) fishing, one of the major threats to achieve United Nations Sustainable Development Goal 14-Life Below Water [30].
Although the establishment of this methodology was initially motivated by this specific problem, its application potential goes evidently beyond this particular context. Our study presents results that may be used to guide the modelling process of a multi-categorical response based on high-dimensional dataset with highly correlated explanatory variables, which is relevant in diverse contexts. Further, by providing the code and data in an open-source manner, we aim to eliminate barriers that often hinder the reproducibility of research. Thus, researchers and practitioners can easily access the code, gaining a deeper understanding of the steps involved in implementing the methodology. They can also use the provided data to validate and compare their own results, fostering a collaborative and iterative approach. With this open sharing of resources not only we intend to promote transparency but also empower others to build upon our work and contribute to the collective knowledge in the field.
Funding Open access funding provided by FCT|FCCN (b-on). This work is funded by national funds through the FCT -Fundação para a Ciência e a Tecnologia, I.P., under the scope of the projects UIDB/00297/2020 and UIDP/00297/2020 (Center for Mathematics and Applications).

Conflict of interest On behalf of all authors, the corresponding author states that there is no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Appendix A Explanatory Variables Description
See Tables 3, 4 Table 7 showcases the formulas for the direct metrics of a 3-class classification problem.
• Precision: Fraction of positive class predictions that were actually positive.
TP TP + FP If, for a certain class, TP + FP = 0, then we consider Precision = 1, since the model did not actually fail to predict the class.
Once more, if, for a certain class, TP + FN = 0, then we consider Recall = 1, since the model did not actually fail to predict the class. • Specificity/True Negative Rate: Fraction of all negative samples are correctly predicted as negative.
TN FP + TN (9) In this case, if, for a certain class, FP + TN = 0, then we consider Specificity = 1, since the model did not actually fail to predict any negative samples.

Table 7
Formulas showcase for the metrics of a 3-class classification problem • F1-Score: Merging Precision and Recall into a single measure, it is, mathematically, the harmonic mean of precision and recall.
2 × Precision × Recall Precision + Recall = 2TP 2TP + FP + FN (10) If 2TP + FP + FN = 0 for a particular class, signifying no sample from that class was selected to the testing set, we consider F1-Score = 1, since the model did not actually fail to predict the class.
Additional performance measures can be computed for the overall model, instead of considering the performance for each individual class. Considering that: • • Total Precision: Total TP Total TP + Total FP (11) • Total Recall: Total TP Total TP + Total FN (12) • Micro-F1: Assesses the quality of multi-classification problems. Simply put, it measures the F1-score of the aggregated contributions of all classes.

× Total Precision × Total Recall
Total Precision + Total Recall (13) • Macro-F1: Calculates the F1-score metrics for each class individually and then takes unweighted mean of the measures.

F1-Score
where F1-Score k = 2TP k 2TP k + FP k + FN k • Weighted F1: It takes the weighted mean of the measures. The weights for each class are the total number of samples of that class.
T A × F1-Score A + T B × F1-Score B + T C × F1-Score C T A + T B + T B (15)