Introduction

Colorectal cancer (CRC) is the 3rd most diagnosed cancer in the world, with more than 1.931 million new cases in 2020 worldwide and 4558 new cases in Morocco, representing almost 7.7 percent of all new cancer cases in the country1. Several factors are significantly associated with the prognosis of CRC patients2,3,4. Whether these factors are related to patient characteristics, treatments, or even the healthcare system in general, it is essential to study their effects in order to develop a care strategy adapted to the local context.

Moreover, CRC imposes an economic burden that translates into direct medical expenses (cost of screening, hospitalization, treatment, transportation, etc.) and indirect expenses such as loss of productivity5,6,7.

Financial and insurance barriers have a major effect on the survival of patients with cancer. They are generally identified as directly related to health care utilization. For example, financial concerns have been found to prevent uninsured people from seeking care unless they were in severe pain or thought they were going to die8,9. In a study on CRC patients, 18% reported untreated rectal bleeding and 20% reported a change in bowel habit but never sought medical attention. The main reason was that they did not consider this symptom to be serious10.

Furthermore, there is considerable epidemiological and observational evidence that the risk of CRC is closely related to lifestyle, particularly diet and physical activity11,12,13.

CRC, being one of the leading causes of cancer-related deaths worldwide, presents intricate challenges to researchers and clinicians alike. Its multifaceted nature is evident in its diverse etiology, varied genetic mutations, and the influence of both internal and environmental factors on its progression. Patients with seemingly similar clinical presentations can have starkly different outcomes, making prognosis and treatment a challenging endeavor. In addition, the interplay between genetic factors, lifestyle choices, environmental exposures, and gut microbiota adds layers of complexity to the disease's understanding14. Conventional epidemiological methodologies frequently face challenges in effectively disentangling the intricate and interdependent factors involved in disease etiology. Therefore, there's a pressing need for sophisticated analytical tools that can sift through this complexity, discern patterns, and provide actionable insights to improve patient outcomes.

Advancements in precision medicine will necessitate more personalized prognostic evaluations for patients to guide appropriate treatments. Traditional statistical techniques for creating prognostic models involve substantial human input, encompassing the selection of prognostic factors based on disease understanding, variable manipulation and filling, and validating model assumptions15.

In this sense, machine learning provides a promising alternative to conventional statistical models, in identifying prognostic factors within vast and intricate datasets. Contrary to the standard approach of pre-selecting prognostic factors grounded in disease understanding, machine learning employs a non-deterministic strategy, letting the data itself uncover pivotal characteristics essential for precise predictions. This evolution from a pre-determined selection to a data-driven exploration signifies a substantial shift in predictive and prognostic modeling16,17.

Specifically, machine learning methods include automatic variable selection and non-parametric modeling strategies. These techniques can adeptly manage a multitude of predictors while making fewer presumptions about the relationships between particular variables and the desired outcomes. Consequently, such methodologies potentially reduce the extent of human input needed in crafting prognostic models18.

For instance, random survival forests can accommodate nonlinearities and interactions among variables. They are not confined to a uniform baseline hazard for all patients, circumventing the inherent assumption in Cox proportional hazards models19. Machine learning techniques can also make use of more data, which in different settings may give rise to further improved performance over conventional models. Also, variable selection allows its use in contexts where risk factors are unknown20.

Futheremore, in the survival analysis framework, parametric models rest upon assumptions that might not be upheld in real-world scenarios. The semi-parametric Cox proportional hazards models (Cox PH), for example, operate under the assumption that changes in predictor variables lead to a multiplicative effect on the baseline hazard during the period of observation21.

Additionally, addressing missing data requires a distinct approach, such as multiple imputation. Many imputation methods assume that missing data occur randomly (missing at random). While this assumption holds in many clinical settings, it's not always guaranteed, especially in prognostic studies22.

In an optimal research setting, a study based on the entire population would be the preferred approach, providing a broad range of insights. However, due to limited resources and the lack of a nationwide information system, using data from hospitals becomes necessary. While population-based data offers a lot of information, hospital data has its own unique characteristics. Hospital datasets contain a variety of details, including clinical information, biomarkers, and healthcare expenses23.

This research focuses on a specific situation where various factors interact and affect the prognostic of CRC patients. The main goal of this study was to carefully explore the factors that shaped the risks, prognosis, and survival of CRC, particularly within the healthcare system of Morocco. As this investigation delves into the complex set of factors related to CRC and its various causes, its importance goes beyond just academic research. The significance of the study lies on its potential to provide practical insights that can guide specific healthcare policies and actions, thus filling gaps in healthcare delivery and improving patient outcomes.The aim of our study was to assess the overall survival rates for CRC at 3 years and to identify associated strong prognostic factors among patients in Morocco through an interpretable machine learning approach based on a fully non-parametric survival random forest with variable importance and partial dependence effects.

Methods

Study design and data collection

We performed a retrospective analysis of 343 patients diagnosed and followed at Hassan II University Hospital. The date of diagnosis of CRC indicated the start of the observation period. Patients were followed from January 2009 to January 2015 until death or censored at the end of the study. The end-point was set at 36 months from the date of diagnosis. All included patients were incident cases, whose dates of diagnoses occur within the active study period, ensuring that potential biases associated with disease duration prior to study entry were minimized. This approach, focusing exclusively on incident cases, effectively reduces the risk of biases related to left truncation censoring.

Data about patients’ characteristics was extracted from the patients’ medical records and supplemented with an active follow-up to record vital status and observed survival times. Eligible patients had a histologically confirmed diagnosis of CRC. Patients with histological types other than adenocarcinoma (N = 200) and diffuse lattice type (N = 50) were excluded. Similarly, patients with uninformed medical records were removed (N = 150).

For each patient, information on sex, age, insurance, residency, delay to treatment, personal history, tumor site, stage, tumor differentiation, histological type, surgery, and MSI/MSS status, was extracted.

Ethical considerations

This study, conducted under the strict observance of relevant guidelines and regulations, was retrospectively designed and only involved the collection of medical history data. Due to this retrospective nature, the requirement for informed consent was waived by the Local Ethics Committee of the Hassan II University Hospital of Fes. The approval for this study was granted by the aforementioned Ethics Committee, as attested by the reference number 05/18.

Statistical analysis

A descriptive analysis of the study sample was conducted.

While multiple imputation remains as the gold standard in prognostic studies, its utilization is incompatible with machine learning algorithms. Specifically, multiple imputation requires the generation of a distribution for the missing values, and the subsequent estimation of multiple models of the missing data, with the results pooled using Rubin's rules. This process is feasable and attainable with regression models, as they yield coefficients of estimation, allowing the pooled coefficients and their standard errors to incorporate the uncertainty surrounding the missing data imputation.

However, since the majority of machine learning algorithms take a single dataset as input, the pooling procedure becomes challenging to implement. Confronted with this limitation, we adopted a method of single imputation. Single imputation of missing data was done using the missRanger algorithm. It implements an imputation approach based on the random forest algorithm combined with the predictive mean matching method24,25. This is a non-parametric imputation method that makes no prior assumptions about the distribution of the data. It directly predicts missing values using a random forest trained on the observed parts of the dataset. The imputation is performed iteratively until a convergence criterion is reached.

Before settling on our final model based on single imputation, a comparison was conducted among the complete cases, multiple imputation relying on Random Forest (mice RF) with 10 datasets, and single imputation based on Random Forest (missRanger).

Overall survival rates (1, 2 and 3 years) and corresponding 95% CIs were calculated using the nonparametric Kaplan–Meier estimator26. Statistical comparison of the survival curves was performed using the log-rank test when stratification was performed on categorical variables.

The Cox PH model and RSF were both used to identify factors that affect CRC patient survival in our study. The well-known Cox PH model was used to estimate the effect of prognostic factors on survival time. The Multivariate Cox PH model with all the potential baseline predictors was estimated in order to compute the hazard ratio (HR) and their associated 95% CIs21. Baseline predictors were: sex, age, insurance, residency, delay to treatment, personal history, tumor site, stage, tumor differentiation, histological type, surgery, and MSI/MSS status.

Beyond the Kaplan–Meier estimator, the Cox PH model allows the inclusion of covariates, which is useful for refining the information on survival time. In this case, the statistical significance of the adjusted covariates on survival times is tested.

Cox proportional hazard model

The Cox PH is a semi-parametric model with two components, one parametric related to predictor variables and another fully nonparametric related to the estimate of the survival function, which doesn’t make any assumptions about the underlying distribution of survival times.

In practice, the Cox model is specified by a hazard function:

$$h\left(t\right)={h}_{0}\left(t\right)exp\left({\beta }^{T}{X}_{i}\right)$$

where \({h}_{0}\left(t\right)\) is the risk function (i.e. instantaneous risk of death at inclusion). This risk function is modified by changes in survival time conditional on covariates \({X}_{i}={X}_{1},...,{X}_{p}\) which is a vector of covariates that do not depend on time, and \({\beta }_{i}={\beta }_{1},...,{\beta }_{p}\) is a vector of regression coefficients associated to \({X}_{i}\). The parameters of the vector \({\beta }_{i}\) are estimated by maximizing the partial true likelihood expressed as follows:

$$L\left(\beta \right)={\prod }_{i=1}^{m}\frac{exp\left({\beta }^{T}{X}_{i}\right)}{{\sum }_{j\in {R}_{i}}exp\left({\beta }^{T}{X}_{i}\right)}$$

where \({R}_{i}\) is the set of subjects at risk at time \({t}_{i}\), either \({y}_{i}\ge {t}_{i}\).

Several approaches can be used for model selection. A simple method is to estimate a univariate regression model, then a multivariate model with all predictors statistically significant (p-value > \(\alpha\), \(\alpha =0.05\)). However, this univariate statistical significance filtering approach does not take into account interactions between covariates.

Recognizing the multitude of statistical tools available, we implemented a comprehensive regression modeling strategy to evaluate the robustness of the final Cox PH model fitted. This included diagnostic checks for the proportionality of hazards, evaluation of linearity for continuous variables, and investigation into potential outliers or influential cases.

It should be noted that, upon identifying deviations from the Cox model hypotheses, we did not employ typical modifications such as stratification or the inclusion of time-dependent covariates. This was a deliberate choice, aligning with our study's ultimate aim. Our primary objective was not to fine-tune the Cox PH model for optimal integrity and applicability, but rather to present an approach centered on the interpretability of a machine learning-based prognostic algorithm.

Random survival forests

In order to identify the most influential predictors of survival outcome, we extended our data analysis by performing a RSF. The RSF is an extension of the classical random forest framework to right-censored observations11,27, 29.

The main advantage of this approach is that it does not require any restrictive assumptions on the distribution of the data, unlike the proportional hazard assumption for the Cox model31. As a first step, binary survival trees are developed using the bootstrap sampling procedure for all predictors included in the analysis, by recursive partitioning similar to CART32. Each bootstrap sample excludes about 37% of the out-of-bag (OOB) data used as an estimate of the predictive error. Then the log-rank test statistic is specified as the default dump rule for splitting survival trees33.

The final forest set is calculated by averaging the end node statistics using the boosted Nelson-Aalen and Kaplan–Meier estimators29.

Variable importance

Another approach to covariate selection is the variable importance method (VIMP) based on permutation. With this method, the attributable prediction error of each predictor \({X}_{i}\) is calculated. This approach is defined by34 as the difference in model predictive performance between datasets with and without permuted values for the associated variable.

Permutation-based VIMP as implemented in RSF permutes the OOB data of a variable and compares its OOB prediction error with the original one. The intuition behind this method is that large importance values indicate variables with strong predictive potential11,27, 28.

Partial dependence plots (PDP)

Futhermore, PDP were displayed to explore in depth the relationship between the estimated partial effect of a given predictor and survival rates29.

PDP can reveal the shape of the relationship between a covariate and the target variable. Its values are constructed by drawing a subset of patients at random and then predicting their survival with the random forest many times, while holding all the covariates constant except for the covariate for which we want to estimate its marginal effect. This gives risk curves for each individual in the subset, normalized by its mean to obtain a single risk curve. For categorical covariates, PDP gives the risk associated with a certain class given different values of the covariate30. Only the final set of strong predictors obtained by the VIMP procedure was considered in our interpretation.

Predictive accuracy

Finally, we evaluated the predictive performance of our two models, Cox PH and RSF. The predictive performance was measured by two metrics, the Concordance Index (C-index) and the Brier Score (BS). The C-index is the frequency of concordant pairs among all pairs of subjects. It can be used to assess and compare the discriminative power of a risk prediction survival model35. A pair of patients \(\left(i,j\right)\) is called concordant if the risk of the predicted event by the model is lower for the patient who experiences event at a later time point.

$$\text{C-index}=\frac{{\sum }_{i,j}{1}_{{T}_{j}<{T}_{i}}\cdot {1}_{{\eta }_{j}>{\eta }_{i}}\cdot {\delta }_{j}}{{\sum }_{i,j}{1}_{{T}_{j}<{T}_{i}}\cdot {\delta }_{j}}$$

with \({\eta }_{i}\) the risk score of a unit \(i\) and \({1}_{{T}_{j}<{T}_{i}}=1\) if \({T}_{j}<{T}_{i}\) else 0 and \({1}_{{\eta }_{j}>{\eta }_{i}}=1\) if \({\eta }_{j}>{\eta }_{i}\) else 0. C-index takes values between 0 and 1, with 1 corresponding to the best discriminative power of the model.

The BS is used to evaluate the accuracy of a predicted survival function given a vector of time \(t\). This is an improved version of the prediction error at the time point using inverse probability weighting of the censoring36.

$$BS\left(t\right)=\frac{1}{N}{\sum }_{i=1}^{N}\left[\frac{{\left(\widehat{S}\left(t\vee {z}_{i}\right)\right)}^{2}}{\widehat{G}\left({X}_{i}\right)}\cdot I\left({X}_{i}<t,{\delta }_{i}=1\right)+\frac{{\left(1-\widehat{S}\left(t\vee {z}_{i}\right)\right)}^{2}}{\widehat{G}\left(t\right)}\cdot I\left({X}_{i}\ge t\right)\right)$$

where \(t\) is the time point at which BS is calculated, \(N\) is the sample size, \({x}_{i}\) is the covariate corresponding to sample \(i\); \({S}{\left(\cdot \right)}\) is the survival function predicted by the model and \({G}{\left(\cdot \right)}\) is the survival function corresponding to censoring. BS take values between 0 and 1, with 0 the best possible value corresponding to the best predictive accuracy of the model.

All statistical analyses were performed using R37. Specifically, Kaplan–Meier curves and cox models were computed using the survival package1. The finalfit package38 was used to display the univariate and multivariate cox regression tables. The survminer package39,40 was used to draw the KM curves. The missRanger package24 was used to impute missing values from the original data set. The RandomForestSRC package41,42 was used to build random survival forests, perform variable importance and display PDP.

Results

A total of 346 patients were included for analysis. Details of demographic characteristics are described in Table 1. Overall, 181 were female (52.31%), with a mean age of 56.5 years (SD = 13.4).

Table 1 Socio-demographic characteristics of CRC patients.

Table 2 describes the clinical characteristics of CRC patients. There was almost an equal proportion of patients with colon (N = 186, 53%) and rectum (N = 160, 47%) cancers. The left colon was the most common tumor subsite (143, 41.3%), followed by the inferior (84, 24.3%) and middle rectum (58, 16.8%). More than half of the patients were at a distant stage at the time of inclusion (189, 54.6%). Furthermore, the predominant histological type was mucinous (313, 90.5%).

Table 2 Clinical characteristics of CRC patients.

Table 3 presents the overall survival rates at 1, 2 and 3 years, which are, respectively, 87% (SE = 0.02; CI-95% 0.84–0.91), 77% (SE = 0.02; CI-95% 0.73–0.82) and 60% (SE = 0.03; CI-95% 0.54–0.66).

Table 3 Overall survival rates at 1 year, 2 years and 3 years.

The difference in the survival curve was statistically significant by sex (log-rank test, p < 0.0098). The difference was also statistically significant when the survival curve was stratified by stage (log-rank test, p = 0.0001). However, no statistical significant difference was observed when stratifying by site (log-rank test, p = 0.28) (Fig. 1).

Figure 1
figure 1

Kaplan-Meir survival curves.

Only MSI/MSS status, Delay and histological type had missing values, respectively, 4.6%, 2.89% and 0.28%.

Cox proportional hazard models

The reuslts of the multivariate Cox PH model illustrate significant disparities in mortality risks associated with surgical intervention, cancer stage, insurance status, residential location, and tumor location. Specifically, patients who did not have surgery were 3.21 times more likely to die (HR 3.21; CI 1.83–5.63; p < 0.001) than those who did, and those with a distant stage were 6.64 times more likely to die (HR 6.64; CI 2.80–15.72; p < 0.001). Patients with no health insurance had 2.85 times a higher risk of mortality (HR 2.85; CI 1.63–4.98; p < 0.001) than patients with health insurance. Similarly, patients living in rural areas had a 1.88 times greater risk of death (HR 1.88; CI 1.18–2.98; p < 0.001) compared to those living in urban areas. Besides, patients with a tumor in the rectum side had a 1.86 times greater risk of death (CI 1.21; 2.88; p = 0.005) than patients with a tumor in the colon side (Table 4).

Table 4 Cox regression models.

After the sensitivity analysis performed by RSF, we decided to keep all predictors in the multivariate Cox PH model in order to preserve the interaction between covariates, while interpreting covariates that have a statistically significant effect in the Cox model, and the presence of their effect has been confirmed by the RSF.

RSF variable importance

The RSF was fitted using the same covariates as in the Cox PH models. A variable importance permutation-based approach was used to identify the most important prognostic factors related to the overall survival of our group of patients, and confidence intervals were calculated using subsampling.

The variable importance obtained from RSF strengthens that surgery, stage, insurance, residency, and age were the most important prognostic factors (Fig. 2).

Figure 2
figure 2

Variable importance of random survival forest through permutation.

These results are in agreement with the results obtained from fitting the multivariate Cox PH model (Table 4), as far as statistical significant effects are concerned.

Figure 3 depicts PDP. For age, PDP shows that the survival rate increases from 18 to 30 years, decreases very slowly between 30 and 75 years, and drops after that age. For insurance, we observed a positive effect for the ‘Yes’ modality compared to the ‘No’ modality on the survival rate, with a relative difference of almost 15 percentage points. For residency, we observed a positive effect on the survival rate for the ‘Urban’ modality compared to the ‘Rural’ modality, with a relative difference of almost 9 percentage points. For the site, a positive effect on the survival rate for the ‘Colon’ modality compared to the ‘Rectum’ modality was shown, with a relative difference of almost 3 percentage points. Furthermore, the ‘Distant’ stage had a negative effect on the survival rate when compared to the ‘Local’ stage, with a relative difference of nearly 7 percentage points. For surgery, the ‘Yes’ modality had a positive effect on the survival rate compared to the ‘No’ modality, with a relative difference of almost 15 percentage points.

Figure 3
figure 3

Partial dependence plots.

Predictive performance

The predictive performance of the Cox PH and RSF was evaluated using the C-index and the Integrated BS. The discriminative capacity of the Cox PH and RSF was, respectively, 0.771 and 0.798 for the C-index. while the accuracy of the Cox PH and RSF were, respectively, 0.257 and 0.207 for the BS. This shows that RSF had both a better discriminative capaciy and predictive accuracy (Fig. 4).

Figure 4
figure 4

Predictive performance using C-index and BS.

Discussion

CRC is one of the most common cancers diagnosed globally. However, due to earlier detection and more effective treatment, high-income countries have seen a significant decrease in CRC mortality over the past decades. Personalizing treatment, focusing on high-risk patients, and improving access to health-care systems are critical to achieving the best possible health outcome.

In Morocco, there are two Population-Based Cancer Registries (PBCR): one in Casablanca that covers about 12% of the national population (36 million people) and has 38 data sources, and another in Rabat that covers about 21% of the national population (642,000 people) and has 65 data sources43,44. But as it is known, PBCR make a trade-off between exhaustivity in terms of incidence and availability of variables about patients (i.e., prognostic factors). Therefore, the study of factors influencing survival times is a difficult exercise with this type of data. Nevertheless, we have attempted for the first time in Morocco to study the survival of CRC and associate it with both socio-demographic factors and histopathological characteristics.

We found only one Moroccan study in the literature on the effect of predictive factors, particularly surgery, on survival of patients with mid and low rectal aenocarcinoma45. The overall survival rate at 3 years was found to be 82%. However, this survival rate should be viewed with caution due to the small sample size (81 patients) but also the specificity of the patients who had the same histological type.

The overall survival at 3 years observed in our cohort was 0.6 [SE = 0.03; CI 95% 0.54–0.66], which is close to the results found in the literature. In a meta-analysis conducted on individual studies done in Iran, the pooled 3 years survival rate estimated by a random effect model was 0.64 [CI 95% 0.59–0.7]12. This result is very close to the rate found by46, which is 70.67 [CI 95% 66.4–74.93] in EMRO countries.

The literature on prognostic factors for CRC is extremely rich47,48. Regarding age, our results reveal a high-risk period after 72 years and a low-risk period before 30 years (Fig. 3). This result is interesting because the majority of studies on CRC predictive factors use age as a categorical variable through-age groups, thus losing an estimate of their prognostic effect at age time points47,49,50.

We also identified factors that are considered to be barriers to health care seeking for CRC patients. Our results reveal that health insurance, pathological stage, and surgery were the main prognostic factors. Age, residency, and site also have a significant impact on the predicted survival rate.

In fact, the health coverage of patients is a condition to their regular access to care and thus have an important impact on their survival probability after diagnosis. The influence of financial barriers affects perceptions of severity, importance, and attribution of symptoms. When patients believe they cannot afford treatment, they may be more likely to downplay the severity of their symptoms51.

Patients must first detect and interpret CRC symptoms as requiring medical attention before approaching the health system. When this does not happen, it is referred to as an apparaisal delay51 Assessment delay is defined as the time between when patients first detects their symptoms and when they first disclose them to a health professional. This delay is even more important when patients do not have health insurance and access to care is difficult.

In addition, other factors such as geographical distance and unfavourable economic conditions may also delay diagnosis. In a study done on a large panel of patients, the risk of death increased significantly beyond 31 days from diagnosis to treatment interval52. We found that living in a rural area has a negative effect on CRC survival. This is closely related to the large distance between the health care centers and the places of residence in rural areas.

Morever, a prevention strategy based on the individual risk of CRC should be systematically adopted by health professionals53. Primary prevention for the general population is important. However, in order to be more effective, it should be targeted at the high-risk population54. Our results show that people aged 30 years or older, living in rural areas, without health insurance, at a distant stage and who have not had surgery, constitute a subgroup of patients with poor prognonsis. Therefore, screening programs should be offered systematically to those patients.

The good predictive performance of RSF is an interesting result, especially in the context of predicting prognostic factors. RSF are flexible and have no assumptions about the data in hand. Even though they are considered black-box models, the methodological development of interpretability techniques has made their results more intelligible. Therefore, they are a good alternative model for analyzing survival data for cancer studies55,56.

The RSF offers a fresh perspective and methodology in survival analysis, leveraging the power of ensemble machine learning. Unlike the classical Cox PH model, which assumes proportional hazards and relies heavily on predefined assumptions about variable relationships, RSF offers a more flexible approach. It inherently accounts for interactions between variables, non-linear effects, and complex relationships within the data without needing explicit specification. This adaptability and robustness make RSF particularly advantageous in handling intricate datasets where traditional methods might fall short. In this regard, the RSF model, being a tree-based ensemble non-parametric algorithm, can address the limitations of the Cox model while also identifying and ranking the most important variables affecting survival time57.

It is necessary to combine several methods to ensure consistency of results. This is the case in our study. We combined conventional methods such as the Kaplan–Meier estimator and the Cox PH model with the RSF method and the PDP interpretation approach. The agreement between the methods used confirms the low degree of sensitivity of our results15. Even though in our case, the assumptions of the Cox model were no met (see supplementary_file.docx). As a result, we cannot draw definitive conclusions based on the Cox model in our case; however, we utilized it as a reference, a baseline model, in order to shed light on the value of a data-driven approach. This comparison allows us to underscore the potential enhancements a machine learning methodology can offer, especially when handling the complexity and volume of data typically encountered in electronic health records (EHR). It also provides a framework for evaluating the degree to which advanced analytical techniques can complement or surpass traditional statistical models in the realm of prognostic modeling.

This study acknowledges certain limitations that warrant consideration. Primarily, the assumptions of the Cox PH model, central to our analysis, were not fully met, limiting the validity of the Cox model results. Additionally, the absence of a preliminary simulation study to validate the Cox PH model and (RSF methods may limit the generalizability of our findings58. Furthermore, challenges related to the sample size and potential selection bias, inherent in observational studies using EHR also impact our study. These factors, could influence the representativeness of our results, necessitating a cautions interpretation and highlighting areas for future research improvement. Our study is the first in Morocco based on an innovative and robust statistical approach. Thus, the sample studied was subjected to an important data quality verification work. Even if our data come from patient files, they have been completed if necessary by an actif follow-up reflecting more precision for the endpoint.

It is clear that our cohort recruited at Hassan II University Hospital cannot claim to be representative of the Moroccan population, but this will be improved in our future studies, where we intend to design a multicenter study to reflect the profile of Moroccan CRC patients as much as possible. Furthermore, adding data on the genetic profile of the included patients is an excellent way o obtain very revealing results59.

Also, a clinical decision support platform for CRC is needed in order to make clinical information easily usable by practitioners. Unfortunately, we are limited by the retrospective nature of our study. Ideally, the physician should have access to such a tool at each suspicious consultation to predict a risk score of developing CRC in order to perform real-time clinical prevention. Besides, the risk of information bias is not totally discarded, especially for disadvantaged patients with difficulties accessing care.

Conclusion

Our research has highlighted that the RSF approach demonstrates better performance in scenarios where the assumptions of the Cox PH model are not valid. Thus, it’s more appropriate to view RSF  as a better option when the Cox PH models’s assumptions are chellenged, rather than as universally superior. This highlight the significance of choosing the right method based on the particular conditions and assumptions of each study.

Utilizing data-driven techniques not only streamlines the model-building process for researchers but also paves the way for the discovery of new predictors with substantial epidemiological importance. Intriguingly, these methods are not tethered to specific diseases, making them adaptable for studying conditions whose origins are not yet fully understood. Managing and addressing the multifaceted challenges of CRC requires more than just traditional therapeutic interventions. There's an imperative to embrace predictive and personalized medical strategies at policy-making levels for comprehensive and effective disease control.

To enhance the efficacy of such strategies, we must emphasize the preventable aspects of CRC and prioritize ensuring healthcare accessibility for the most susceptible sections of the population. Expanding health insurance coverage emerges as a pressing national requirement. Universal health coverage can significantly mitigate cancer risks by fostering an environment conducive to early detection and minimizing the financial burdens primarily affecting those with limited means.

From a methodological perspective, machine learning offers a transformative potential in prognostic studies. It facilitates the identification of complex relationships between predictors and outcomes, leading to intriguing epidemiological findings. In contrast, traditional methods often operate within deterministic boundaries.

In line with this research, the development of a clinical decision support system would be immensely beneficial. Such a system would empower clinicians to formulate prognoses in a more personalized and dynamic manner, aligning with the individualized needs of patients. By harnessing the insights from machine learning and predictive analytics, clinicians can make informed decisions, optimizing treatment pathways and potentially improving patient outcomes. This shift towards a data-driven, patient-centric approach is the future of medical prognosis and can redefine how we address the challenges of diseases like CRC in real-time clinical settings.