Background

Colorectal cancer

Colon cancer is the third most widespread cancer that affects both genders, after prostate cancer in men, breast cancer in women, and lung cancer in both sexes. Colon cancer starts in the large intestine (colon), the final part of the digestive tract [1, 2]. It usually affects older adults, although it can occur at any age. It typically starts as small, noncancerous clumps of cells called polyps, which form inside the colon. Over time some of these polyps can convert into colon cancers [3]. Colon cancer is occasionally called colorectal cancer, a term, which merges colon cancer and rectal cancer that begins in the rectum [4].

Recent research

In recent decades, researchers have used genomic signal processing (GSP) methods to solve a range of bioinformatics problems. This research falls into five broad parts. Firstly, the research could be performing cluster analysis of deoxyribonucleic acid (DNA) sequences [5], breast cancer diagnosis and detection using Wisconsin diagnostic breast cancer Database [6,7,8,9], cancer diagnosis and classification using DNA microarray technology [8, 10, 11] and classifying any gene sequence into diseased/non-diseased state based on trinucleotide repeat disorders using DNA sequences [12].

Secondly, the mapping method applied to DNA sequences could be a Voss representation [5, 12, 13] or the EIIP method [13, 14].

Thirdly, the GSP algorithms used could be discrete Fourier transform (DFT) [5, 13], power spectral density (PSD) [5, 13], discrete wavelet transform (DWT) [12], moment invariants [14], statistical parameters [12], or fast Fourier transform [12].

Fourthly, the classifier used in the research could be the k-means algorithm [5, 7], linear discriminant analysis and support vector machine (SVM) [6, 7, 15], a Naive–Bayes classifier (NB) [7, 10], a deep convolutional neural network (CNN) [16, 17], multilayer perceptron (MLP) [7, 9], inception recurrent residual convolutional neural network [18, 19], probabilistic neural network [8], classification and regression tree (CART) [7], simple linear iterative clustering (SLIC), or optimal deep neural network (ODNN) [20].

Finally, the evaluation method used could be plotting [5], comparison [7], using improved binary particle swarm optimization (iBPSO) [10], calculation of the area under the curve of the receiver operating characteristic curve [17, 18], and calculating sensitivity, specificity, and accuracy [17, 20].

In this study, a combination of those ideas was used, as well as other methods that are not listed above, for example, using DNA sequences as the classification database and using the k-nearest neighbor as the classifier. The following block diagram depicts the study steps, which are explained in detail later (Fig. 1).

Fig. 1
figure 1

Block diagram of the presented method

Methods

Database sequence

A vital source of genomic data is the search and retrieval system created in the NCBI GenBank [21] at the National Institutes of Health In this research, 55 healthy genes and 55 cancerous genes of the colon are used. Each DNA sequence has a length of 400 nucleotides. The following is an example of the output of cancer data read by the fastaread function in MATLAB R2017b:

Sequence = 'GCGATCGCCATGGCGGTGCAGCCGAAGGAGACGCTGCAGTTGGAGAGCGCGGCCGAGGTCGGCTTCGTGCGCTTCTTTCA…………….etc.’

Description = 'AB489153.1 Synthetic construct DNA, clone: pF1KB3091, Homo sapiens MSH2 gene for mutS homolog 2, colon cancer, nonpolyposis type 1, without stop codon, in Flexi system',

Mapping method

The electron–ion interaction pseudopotential (EIIP) numerical method is the most common representation rule used by many researchers [22,23,24,25]. The EIIP numerical values for A, G, C, and T in a DNA string are 0.1260, 0.0806, 0.1340, and 0.1335. These values represent the free electron energy distribution along the DNA sequence [26]. For example, if Y[n] = TATGGATCC, the corresponding EIIP numerical values, Y[e], will be:

$$\hbox{Y}_{{\rm{e}}} \left[ \hbox{n} \right] \, = \, \left[ 0.1335\quad 0.1260 \quad 0.1335\quad 0.0806\quad 0.0806\quad 0.1260\quad 0.1335\quad 0.1340 \quad 0.1340 \right]$$
(1)

Genomic signal processing techniques

  1. 1.

    Discrete Wavelet Transform

    DWT transforms a signal into a group of basis functions called wavelets. DWT converts a discrete-time signal to its wavelet representation [27]. For DWT, there are various wavelets, which are widely divided into orthogonal and biorthogonal wavelets [28]. The orthogonal type was introduced by Hungarian mathematician Alfréd Haar [29]. The Haar DWT transform of a signal (S) is generated by crossing it over a group of filters [30]. These are produced by passing a signal through a low-pass filter with impulse response (g) resulting in a convolution, as follows:

    $$F\left[m\right]=\left(S*g\right)\left[m\right]=\sum_{k=-\infty }^{\infty }S\left[m\right]g\left[m-k\right]$$
    (2)

    The signal is also passed over a high-pass filter (h). The result gives two components, the first one, from the high-pass filter, is called the detail coefficients, and the other, from the low-pass filter, is called the approximation coefficients [31, 32]. In Fig. 2, the two filters are known as quadrature mirror filters, and they are linked to each other.

    According to the rule of Nyquist, half of the signal frequencies are removed. As a result, the output of the low-pass filter in Fig. 2 is subsampled by two and processed by crossing it for another time over a new low-pass filter, g, and a new high-pass filter, h, with half cutoff frequency, as follows:

    $${F}_{low}\left[m\right]=\sum_{k=-\infty }^{\infty }S\left[m\right]g\left[2m-k\right]$$
    (3)
    $${F}_{high}\left[m\right]=\sum_{k=-\infty }^{\infty }S\left[m\right]h\left[2m-k\right]$$
    (4)
  2. 2.

    Statistical Features

Fig. 2
figure 2

Single-level 1D discrete wavelet transform

After obtaining the DWT coefficients, some statistical features are extracted as follows:

Mean

The arithmetical mean, called the average or the mathematical expectation, is the centric value of a group of numbers [33, 34]. To calculate the mean value µ of a sequence S = [s1, s2, s3, …., sM] with length M, divide the sum of all sequence values by its length as in the following equation:

$$\mu =\frac{\sum_{i=1}^{M}{s}_{i}}{M}$$
(5)

Variance

Variance (σ2) in probability theory and statistics is defined as the squared deviation expectation of a random variable from its mean. Informally, it quantifies how far a sequence of arbitrary numbers diverges from the mean value of the sequence [35]. It is determined by taking the differences between each number in the set and the mean. Then, the distinctions are squared, to make them positive. Finally, the sum of the squares is divided by the number of values in the set, as follows:

$${\sigma }^{2}= \frac{\sum_{i=1}^{M}{({s}_{i}-\mu )}^{2}}{M}$$
(6)

where si is the ith data point, µ is the mean of all data points, and M is the number of data points.

Standard deviation

The standard deviation (σ) is the square root of the variance (σ2) [35].

$$\sigma =\sqrt{\frac{\sum_{i=1}^{M}{({s}_{i}-\mu )}^{2}}{M}}$$
(7)

where si is the ith data point, µ is the mean of all data points, and M is the number of data points.

Autocorrelation

Autocorrelation, also called serial correlation, is the association of a signal with a later copy of itself obtained via a delay function. It is a mathematical exemplification of the similarity between a given time series and a later version of itself over consecutive periods [36]. The method of calculation is the same as that used in the computation of the correlation between two different time series, excluding using the same time series twice: one time in its original form and another in a later form or in more time intervals [37]. The equation for the autocorrelation function is:

$${\rho }_{k}= \frac{\sum_{t=k+1}^{T}({r}_{t}-{\mu }_{r})({r}_{t-k}-{\mu }_{r}) }{\sum_{t=1}^{T}{({r}_{t}-{\mu }_{r})}^{2}}$$
(8)

where ρk are the autocorrelation coefficients, rt is a data set sorted by ascending date, rt-k is the same data set shifted by k units, and µr is the average of the original data set.

Entropy

Originally, Claude Shannon defined entropy as an aspect of his communication theory [38]. Shannon entropy provides vital information about repetitive sequences in whole chromosomes and is beneficial in finding evolutionary differences between organisms [39].

Shannon introduced the entropy, E, of a discrete random variable, Y, with possible values {y1,y2,y3,….,yn}, and probability mass function M(Y) as illustrated in [40, 41]:

$$E\left(Y\right)=-\sum_{i=1}^{n}M\left({y}_{i}\right){\mathrm{log}}_{h}M\left({y}_{i}\right)$$
(9)

where h is the used logarithm base.

Skewness and kurtosis

In statistics, skewness is a measure of the asymmetry of the probability distribution of the variable around its mean. A symmetrical data set has a skewness of 0. It can be calculated as the averaged cubed deviation from the mean divided by the cubed standard deviation [42]. For defined data X1, X2, …, Xn, the equation for skewness, which represents the third moment, is as follows:

$$Skewness= \frac{\sum_{j=1}^{n}{({X}_{j}-\mu )}^{3}/n}{{\sigma }^{3}}$$
(10)

where σ is the standard deviation, µ is the mean, and n is the data points' number.

It is used as a measure of the variable asymmetry and deviation from the normal distribution. It is called positively skewed distribution (right), where the most values are located on the left side of the mean, if the skewness value is greater than zero. It is called negatively skewed distribution (left), where the values are located on the right side of the mean, if the value is lower than zero. For the zero value (the mean value equals the median), the distribution is symmetrical about the mean value.

There is an incorrect concept that has appeared in different reports that kurtosis somehow measures the peakedness (flatness, pointiness, or modality) of a distribution, despite statisticians' efforts to set the record straight. In statistics, kurtosis is the measurement of the probability distribution tailedness of a variable [43]. The kurtosis value is related to the distribution tail-heaviness, not its peak. For defined data X1, X2, …, Xn, the equation for kurtosis, which represents the fourth moment, is as follows:

$$Kurtosis= \frac{\sum_{j=1}^{n}{({X}_{j}-\mu )}^{4}/n}{{\sigma }^{4}}$$
(11)

where σ is the standard deviation, µ is the mean, and n is the number of data points.

The result is usually compared to the kurtosis of the normal distribution (Mesokurtic distribution), which equals three. A distribution is called a Leptokurtic distribution if the kurtosis value is more than three. In this case, it has more intensive tails than the Mesokurtic distribution. A distribution is known as a Platykurtic distribution if the kurtosis value is less than three. It has fewer tails than the normal distribution.

Classifier

In this research, two kinds of classifiers were used, and then their results were compared. They were K-nearest neighbors (KNN) and support vector machine (SVM).

  1. 1.

    K-nearest neighbors

    The KNN algorithm is an unsupervised machine learning algorithm, and it is one of the most widely used classification methods. KNN is a case-based algorithm, so it does not require a learning step. It handles the training samples using a distance function and a separation function. It is based on the categories of the closest neighbors [7, 8]. When a new item is rated, it must be compared to others using a similarity scale, then KNNs are taken into regard, and the distance between the new item and the neighbor is used as the weight [44]. Various methods are used to calculate this distance. The most common technique is the Euclidean distance between the two vectors yir and yjr which can be measured as stated in [45]:

    $$d\left({y}_{i},{y}_{j}\right)=\sqrt{\sum_{r=1}^{n}{({y}_{ir}-{y}_{jr})}^{2}}$$
    (12)

    The performance of the method depends on the K value selected and the distance cutoff used. The K value represents the number of neighbors chosen to specify the new element class.

  2. 2.

    Support vector machines

In learning systems, SVMs or networks are a supervised-learning method related to learning techniques that analyze data for detection and classification studies [6]. An SVM creates a hyperplane as a resolution surface to classify input data into a high-dimensional feature space. The hyperplane can differentiate between the different class patterns and increase the class margin. Patterns represent a set of points grouped to be separated by distinct lines for various categories. The points are assigned and classified according to which aspect of the line they belong to [7, 46]. This process leads to a linear classification generated by the SVM, while the use of a kernel produces a nonlinear classification [15].

Each algorithm was used separately for classification and provided parameters for comparison. In this study, 35 normal colorectal genes and 35 cancerous genes were used as training data, and the testing data included 20 normal colorectal genes and 20 cancerous genes.

MATLAB R2017b was used to perform the analysis. The fitcknn function was used for creating a KNN with a default number of neighbors k = 1, while the fitcsvm function was used for generating the SVM.

Results

Three important parameters were calculated for the performance evaluation of proposed method. They are Matthews correlation coefficient (MCC), F1 score, and ACC. They can be estimated as follows:

$$ACC=\frac{TP+TN}{TP+TN+FP+FN} \times 100\%$$
(13)

(ACC: 0 is the worst value; 100 is the best)

$${F}_{1score}=\frac{2TP}{2TP+FP+FN}\times 100 \%$$
(14)

(F1 Score: 0 is the worst value; 100 is the best)

$$MCC=\frac{\left(TP\times TN\right)-\left(FP\times FN\right)}{\sqrt{\left(TP+FP\right)\left(TP+FN\right)\left(TN+FP\right)\left(TN+FN\right)}}$$
(15)

(MCC: − 1 is the worst value; + 1 is the best) where four confusion-matrix parts (FP, FN, TP, and TN) stand for false positive, false negative, true positive, and true negative values, respectively.

The MCC is more informative than the ACC or the F1 score in evaluating the performance of a binary classifier, because it takes into the balance rates of the FP, FN, TP, and TN [47]. For example, in a set of 200 elements, 180 are positive, and only 20 are negative. After applying the classifier, the following results are obtained:

$${\text{FP }} = { 2}0,{\text{ TP }} = { 18}0;{\text{ FN }} = \, 0,{\text{ TN }} = \, 0.$$

The previous inputs give F1 score = 94.73% and accuracy = 90%. Although these results look impressive, MCC would be indefinite, as the FN and TN would be zeroes, thus the denominator of Eq. 15 would be zero. The F1 score relies on which class is positive, which MCC does not [49].

The extracted features were used as an input to a KNN and an SVM network separately, and the results of each were compared. In this study, training data of 35 normal colorectal genes and 35 cancerous genes were used, and testing data of 20 normal colorectal genes and 20 cancerous genes were used.

Table 1 shows the TP, FP, TN, and FN values obtained from the two classifiers.

Table 1 Results of the two classifiers

From the previous values, Table 2 can be created using Eqs. 1315.

Table 2 Comparison of calculated parameters for the two classifiers

Discussions

The KNN algorithm identified 19 cancer genes and 20 normal genes out of a total of 20 each (TP = 19, FP = 1, TN = 20, and FN = 0), while the SVM network recognized 18 cancer genes and 20 normal genes (TP = 18, FP = 2, TN = 20, and FN = 0) (Table 1).

The results of both methods were satisfactory. KNN gives 97.5% accuracy, 97.44% F1 score and 0.9512 MCC, while SVM network gives 95% accuracy, 94.74% F1 score and 0.9045 MCC (Table 2).

In comparison, achieving a higher ACC, higher F1 score, and higher MCC is evidence that the classification process is more successful, and the classifier is more effective. These results indicate that the classifier can recognize the required target with minimum errors. From the research results, the KNN classifier could achieve the research purpose of differentiating between normal and cancerous colorectal genes using GSP methods.

The results indicate the success of using GSP methods for cancer recognition and diagnosis. Table 3 provides a comparison of the results obtained in the current work to those of other studies according to the database used, method, classifier, and output.

Table 3 Comparison between different studies of classifications

From Table 3, the best accuracy obtained from the related studies was 96.7% [7], and this study reached 97.5% accuracy.

Conclusions

Many researchers worldwide have studied cancer, hoping to detect this disease at an early stage so that they could reduce its risk, which often leads to death. The basic concept of the presented study is that cancer is considered to be a genetic disease. The EIIP method was used to convert the DNA sequences from strings into number values so that GSP could be applied in the feature extraction step, and suitable classifiers were selected. Single-level DWT was applied using Haar wavelets. Then, the statistical features mean, variance, standard deviation, autocorrelation, entropy, skewness, and kurtosis were obtained from the wavelet domain. Finally, the resulting values were input into KNN and SVM networks. The KNN results were the best, with low error for the classification system, although the results of the SVM were acceptable. An automated system was therefore generated for the classification and detection of colorectal cancer with good results, avoiding the disadvantages of traditional methods. These traditional detection methods include collecting blood, urine, or stool sample from the patient and testing it in the laboratory. That takes a long time, requires experienced examiners, and the probability of error is relatively high. In future work, other GSP features can be used, and different classifiers can be chosen to improve the results.