1 Introduction

Resistance to the coronavirus is still a challenge due to the limited information available about this virus. The available cases of coronavirus protein sequences encouraged research on the taxonomic classification of the COVID-19 virus [1]. The early differentiation of this virus resolved the massive spread, especially in dynamic population growth. Many factors focused on the probabilities of developing the virus in patients, such as age, hygienic behaviors, location, environment, and health status [2]. The World Health Organization (WHO) reported that the COVID-19 virus, which originated in China, can spread in many countries [3]. The WHO also confirmed that this virus, referred to as severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), emerged in bats and was transmitted to humans [4]. This means that exposure to the specific wild environment is providing more rapid infections. The infection risk of COVID-19 is controlled by accurate quarantine schedules because there are no specific drugs or vaccines for this virus [5]. COVID-19 has had an important influence on society’s financial and social aspects, causing a critical need to reduce the spread of this virus. Generally, the viral analysis of protein sequence has a good indication of its responsibility for pharmacological interactions.

Moreover, the classification task of the virus protein sequences by machine learning algorithms supports treatment plans for COVID-19 [6]. Additionally, the National Center for Biotechnology Information (NCBI) [7] provided a list of genomic datasets targeting COVID-19 to aid in coronavirus variant analysis as a public resource. To combat this virus, the genetic features of COVID-19 should be well known. It is a single-stranded RNA virus with a diameter ranging from 65 to 125 nm, consisting of around 27–32 kb [8].

Generally, protein–protein interactions (PPI) are employed for studying virus proteins regarding their replications and interactions [9]. The traditional methods for PPI are related to laboratory tests such as tandem affinity purification [10], yeast two-hybrid system [11], and mass spectrometry [12]. However, these tests are suffered from high costs, time consumption, human errors, and restricted coverage. The computational algorithms have struggled to predict protein sequences. Therefore, PPI’s computational prediction [13] is a more effective solution than traditional methods. The computational prediction models [14,15,16,17] are developed to extract the features from amino acids with diverse accuracy levels. Recent studies proved that the CT method [18] had undergone a high prediction accuracy for PPI. The successful representation of the CT method is based on the adjoining influence of the relative three kinds of amino acids [19].

Zhang et al. [20] presented the protein sequence analysis for host identification of COVID-19 and studied the similarity between COVID-19 and HIV (human immunodeficiency virus). Randhawa et al. [6] presented machine learning algorithms of DNA sequences to classify COVID-19 as a beta coronavirus. Qiang et al. [21] proposed the observation model of protein sequences of COVID-19 to study its genomic evolution via machine learning algorithms. Zhou et al. [22] suggested the PPI network for recognizing the candidate drugs for COVID-19. The prediction of PPI for HIV, SARS, and pandemic influenza A (H1N1) virus [23] identified by amino acid composition is shown in previous research [24]. Host-virus PPI proposed by the co-immunoprecipitation method has also been shown in [25].

On the other hand, some researchers presented an effective model for determining mutation locations in the DNA sequences of this coronavirus [26]. Desautels et al. [27] suggested a machine learning–based prediction model of antibody structures qualified for attacking the COVID-19 receptor-binding domain. Recently, Dey et al. [28] focused on the train of interactions between COVID-19 and human proteins using machine learning approaches to encourage COVID-19 drug discovery.

The numbers of infected cases and fatality rates associated with COVID-19 change from country to country. Toyoshima et al. [29] confirmed that there had been no explanation of the reasons for these genomic variations based on infected countries. Also, the temperature divergences in different countries [30] created mutations of COVID-19 protein sequences and have prompted researchers to study geographical variations from a bioinformatics perspective. Therefore, the COVID-19 protein sequence classification based on the country is a significant way to track the virus. Transmission of this virus through different countries might affect the potential for mutations and severity of COVID-19. Each country has created a database for COVID-19 protein sequences, which is an important research point for virus recognition through classification algorithms. This means that the COVID-19 protein sequences have categories equal to the number of endemic countries. Hence, the COVID-19 treatment based on antivirals and vaccines will adjust according to virus recognition based on the affected country.

In this study, the proposed model investigated the CT method for the classification of COVID-19 human protein sequences considering their countries through machine learning techniques. This is the initial processing related to infection areas of COVID-19, which has been recorded recently. Also, the proposed model explored the different forms of COVID-19 protein sequences and their relationships to mortality rates in 27 countries. This study aims to promote the growth control of COVID-19 based on protein sequences and discover other mutation types of this pandemic that will support as a tool for the drug production of this virus in the health sector.

2 Materials and methods

The proposed block diagram designed procedure series including data preprocessing, data labeling, and classification algorithms to categorize the COVID-19 countries, as shown in Fig. 1.

Fig. 1
figure 1

Block diagram of the classification of COVID-19 protein sequences based on the country

2.1 Dataset description

The used dataset for SARS-Cov-2 (COVID-19) was constructed by the NCBI virus [7]. This dataset contains 9238 sequences, each sequence was identified using the accession number and containing information such as protein sequence as a FASTA file, geographical location, and protein sequence length. There are two forms in the dataset. In the first form, comma-separated values (CSV) files include the accession numbers of protein sequences and other information, such as species, genus, family, length, sequence type, and geographical location (Geo_location), as shown in Fig. 2. The maximum sequence length found in this dataset is QIX12193, which has 7098 amino acids. The minimum sequence length found in this dataset is YP_009725312, which has 13 amino acids. In the second form, FASTA files contain the accessions and the protein sequences, as shown in Fig. 3. In the bioinformatics domain, the FASTA format is an arrangement of protein sequences using the alignment software package based on single-letter codes described by the European Bioinformatics Institute (EBI). The FASTA format is used to find the similarities between protein sequences.

Fig. 2
figure 2

Data samples of COVID-19 protein sequences in CSV format that include some information such as accession number, release date, species, genus, family, length, sequence type, and geographical location for each protein sequence

Fig. 3
figure 3

Data samples of COVID-19 protein sequences in FASTA format

2.2 Data preprocessing stage by conjoint triad (CT) method

Generally, the protein sequences consist of twenty amino acids. The frequency of every three adjacent amino acids is recognized as a feature (a triad). Thus, the extracted features in each protein sequence are 203 = 8000. It is more difficult to analyze protein sequences as raw data by machine learning approaches [31]. In the data preprocessing phase, classification of the amino acid side chains into seven classes ({AGV}, {ILFP}, {YMTS}, {HNQW}, {RK}, {DE}, {C}) is applied by the conjoint triad (CT) method [18]. The CT method is used to transform the amino acids in each sequence into numbers according to their volumes and dipoles. Most approaches of PPI [32,33,34] are implemented by the CT method to reduce the extracted features to 73 = 343 according to seven classes.

Mathematically, we consider a protein sequence P with length L is P = P1P2P3 ……. PL and all the adjacent three amino acids in P are P1P2P3, P2P3P4, ….. P3P4P5,..., PL-2PL-1PL. Then, the CT feature for protein sequence represents normalized frequency where CT is defined as follows.

$$\mathrm{CT}={\left[{f}_{1},{f}_{2},{f}_{3},{}_{\cdots }{}_{\cdots }{}_{\cdots }{}_{\bullet }{}_{\bullet }{f}_{343}\right]}^{T}$$
(1)

where \(f\) is the frequency of each triad.

Molecular modeling and density-functional theory approaches used Gaussian03 system [35] for calculating volumes and dipoles of the side chains of the amino acids. The Gaussian03 system [35] is used as a computational chemistry software to avoid traditional computations’ cost. The functions of the Gaussian03 system are listed in the following:

  • (1) Calculation of one- and two-electron integrals over any general contracted Gaussian functions. The basic functions can either be Cartesian Gaussians or pure angular momentum functions, and a variety of basis sets are stored in the program and can be requested by name.

  • (2) Transformation of the atomic orbital (AO) integrals to the molecular orbital basis.

  • (3) Molecular mechanics calculations.

  • (4) Correlation energy calculations using Møller-Plesset perturbation theory with high accuracy.

  • (5) Speed up the density functional theory (DFT) calculations.

This CT method converts the protein sequences to numbers from zero to seven. The classification of the amino acids based on their side-chain dipoles and volumes is shown in Table 1.

Table 1 Amino acid classification according to dipoles and volumes using the CT method

Some explanations of the symbols are found in Table 1.

  • • Dipole scale: (-), dipole < 1.0/( +), 1.0 < dipole < 2.0/(+ +), 2.0 < dipole < 3.0 /(+ + +), dipole > 3.0 / (+ ' + ' + '), dipole > 3.0 with opposite orientation

  • • Volume scale: (-), volume < 50;/( +), volume > 50

  • • Cysteine (C) amino acids are separated from class 3 because of their ability to form disulfide bonds

As shown in Table 2, the new amino acid class was added to the coding with class number zero and amino acid labeled with (X). The protein sequence “ALGCERQSKXTP” would be represented as “121,765,435,032” according to eight amino acid classes. The sample of COVID-19 protein sequences after conversion is displayed in Fig. 4. The CT method is employed as encoding each protein sequence using the continuous three amino acid frequency distribution extracted from 7 letters to reduce the characters of amino acids.

Table 2 Amino acid coding according to eight classes
Fig. 4
figure 4

Protein sequences after conversion using eight amino acid classes

After converting the protein sequences to numbers, an array of eight elements is filled with each amino acid class’s number for each sequence in the dataset. Finally, the amino acids’ frequencies are normalized by dividing each frequency by the length of the sequence to avoid bias to the taller sequences.

2.3 Data labeling stage

The classification of protein sequences based on the country is implemented on the COVID-19 dataset. The countries’ names are labeled in Table 3. There are 27 unique countries found in this dataset after removing the cities.

Table 3 Code numbers of countries

To prepare these countries as labels for the classifier. There are two methods as follows:

  • Each country has a number from 0 to 26, and each record has only one number as a label according to its country. A sample of the labeled sequences is displayed in Table 4.

  • Each record has an array of 27 binary elements in which 1 represents its country; otherwise, it is zero, as shown in Table 5.

Table 4 One number of format samples
Table 5 Binary array format samples

2.4 Data division

After the data labeling stage, all labeled data is randomly divided with a proportion of 8:2 into a training set and a testing set. In machine learning algorithms, the value of training and testing is a significant factor in deciding the performance level [36]. If the features and the label have a high correlation, the training–testing ratio is 50%–50%. This means that half of the data is used for training and half is used for testing. If there is a fear of success slipping away, the training rate may be increased. In previous researches, the training–testing ratio varies according to the data structure. It is based on the trials and errors by changing the training and testing ratio to achieve a high accuracy rate [37]. By many experiments, it was found that 80% of training and 20% of testing achieved a high accuracy rate in this proposed model. Less or more than the selected ratio in the training and testing process is not preferred because the test results will be adversely affected. Subsequently, the classification stage was taught using the training data and the final results were calculated using the testing data.

2.5 Classification stage

After converting the protein sequences of COVID-19 to numbers that belong to eight amino acid classes, six different classifiers were sequentially used for predicting the protein sequence country. Each amino acid class was considered a feature in this classification model. One number format and binary array format are the two methods of labeling the predicted class (country). The six classifiers are carried out by linear regression (LR) [38], K-nearest neighbor (KNN) [39] using various numbers of neighbors, support vector machine (SVM) [40] using different kernel functions, naive Bayesian (NB) [41], decision tree (DT) [42], and random forest (RF) [43] using different numbers of estimators. The classifier structure’s main objective is to reduce the learning complexity for the accurate examination of unknown samples.

2.5.1 Machine learning algorithms

These six machine learning algorithms can be applied to the classification problem of COVID-19 protein sequences according to the country as the following:

  1. 1.

    The LR classifier [38] is supported by the relationship between the dependent and independent variables using a linear separating line. The best separating line is called the regression line and represented by a linear equation Y = a *X + b, where Y is a dependent variable, a is a slope, X is an independent variable, and b is an intercept. The values a and b are derived based on reducing the sum of squared difference of distance between data points and regression line.

  2. 2.

    The KNN classifier [39] focused on distance measurement by the density distribution that was not related to decision boundary calculations. A class of their closest neighbors grades the KNN classifier based on K values. In this paper, the selected K values were 20, 50, 150, and 201 to attain the best K for results accuracy.

  3. 3.

    The SVM classifier [40] is based on plotting each data item as a point in n-dimensional space (where n is the number of features) with the value of each feature being the value of a particular coordinate (where coordinates are support vectors). In this paper, the SVM classifier investigated decision boundary calculations by four types of kernel functions, such as linear, sigmoid, polynomial, and radial basis function (RBF), to choose the effective function for discrimination.

  4. 4.

    The NB classifier [41] is based on the probability theorem and maximum likelihood calculation. This classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature.

  5. 5.

    The DT classifier [42] is based on data separation into groups according to low entropy measures. It splits the data into two or more homogeneous sets. This is done based on the most significant attributes/independent variables to make as distinct groups as possible.

  6. 6.

    The RF classifier [43] is based on large numbers of DT methods and a high correlation between two trees. The forest chooses the classification having the most votes (over all the trees in the forest). If the number of cases in the training set is N, then a sample of N cases is taken at random but with replacement. This sample will be the training set for growing the tree. If there are M input variables, a number m <  < M is specified such that at each node, m variables are selected at random out of the M and the best split on these m is used to split the node. The value of m is held constant during forest growth. Each tree is grown to the largest extent possible. There is no pruning.

2.5.2 Performance metrics

The performance metrics for the proposed model are based on a multi-label confusion matrix and receiver operating characteristic (ROC) curve to evaluate COVID-19 classification algorithms according to each country’s protein sequence.

The multi-label confusion matrix is a common method for recording results in multi-class classification problems because it is possible to identify the relations between the classifier outputs and the true ones. The multi-label confusion matrix of all classification algorithms is based on the percentage of accuracy, sensitivity, and specificity that refer to the performance of each classifier. Accuracy is calculated as the rate of correctly classified sequences to the whole number of sequences in the protein data. Sensitivity is calculated as the rate of positives that are correctly identified sequences. Specificity is calculated as the rate of negatives that are correctly rejected. Three parameters are shown as the following equations (Eqs. 24).

$$\mathrm{Accuracy}=\frac{TN+TP}{TN+TP+FN+FP}$$
(2)
$$\mathrm{Sensitivity}=\frac{TP}{TP+FN}$$
(3)
$$\mathrm{Specificity}=\frac{TN}{TN+FP}$$
(4)

where TP is true positive (correctly classified), TN is true negative (correctly rejected), FP is false positive (incorrectly classified), and FN is false negative (incorrectly rejected).

In this proposed model, the results covered the accuracy, sensitivity, and specificity rates for six algorithms based on two different labeling methods to assess the usefulness of the prediction model for COVID-19 according to country. It is a preliminary study to verify its accuracy in predicting COVID-19 disease according to protein sequences. The performance is calculated by taking the average of the accuracy rates obtained from each classification process. Therefore, the accuracy rate is a critical indicator in the training and testing process and classification success. In machine learning algorithms, if the testing data increases, the accuracy is expected to decrease [36].

The ROC curve [37] is another method to estimate the classification performance of machine learning algorithms. It represented the true positive rate (sensitivity) in the function of the false positive rate (specificity) for different cut-off points. Each point on the ROC curve showed a sensitivity/specificity pair corresponding to a particular decision threshold.

3 Experimental results

The dataset is separated into 80% for the training database and 20% for the testing database to perform the COVID-19 classification according to the country under the Python computational environment. The training–testing ratio is selected according to data correlation and performance criteria to achieve high algorithm accuracy [36]. It was found that the performance criteria can be maximized when the training data is greater than the testing data [37].

The CT method [18] is used to encode protein sequences to numbers according to seven classes, which have acted as feature extraction. Each feature group is estimated by the tenfold cross-validation test. There are two methods of class labeling used for the prediction of the COVID-19 country. The inclusive comparison between the six machine learning approaches is investigated on COVID-19 protein sequences. For the KNN classifier, the adjustable factor is the K value while for the SVM classifier, it is the kernel function, and it is an estimator number for the RF classifier. Based on twelve forms from six classifiers, the classification results for the one-number labeling method are illustrated in Table 6.

Table 6 Performance comparison of different classifiers for the one number labeling method

The preferable results confirmed that the DT classifier achieved an accuracy of 79.5%, a specificity of 82%, and a sensitivity of 89% for the one number labeling method. For the one-number labeling method, the worst classifier provided by NB achieved an accuracy of 13.2%, a specificity of 45%, and a sensitivity of 30%.

Based on twelve forms from six classifiers, the results of COVID-19 protein sequence–assisted detection for the binary labeling method are illustrated in Table 7. For the binary array method, the best classifier provided by linear SVM achieved an accuracy of 100%, sensitivity of 100%, and specificity of 90%.

Table 7 Performance comparison of different classifiers for the binary array labeling method

In contrast, linear SVM, KNN, and LR classifiers achieved an accuracy of 100%, as well as the linear SVM classifier, achieved a specificity of 100% for the binary array labeling method. The best sensitivity is 100% for LR, KNN (K >  = 201), and SVM (linear, polynomial) classifiers. For the binary array labeling method, the worst classifier provided by KNN with K = 20 achieved an accuracy of 93.4%, specificity of 81%, and sensitivity of 89%.

Additionally, the best K value of ≥ 50 is obtained with 77.7% accuracy for the one number labeling method. The best K value of ≥ 201 is obtained with 100% accuracy for the binary array labeling method. The choice of K value has a slight impact on the classification accuracy for the one number labeling method. Still, it has a powerful impact on the classification accuracy for the binary labeling method.

In the SVM classifier, the best kernel function was polynomial for the one number labeling method that obtained 78.2% accuracy. The best kernel function was linear for the binary array labeling method that obtained 100% accuracy. The SVM classifier’s accuracy is altered only by changing the kernel functions for the binary labeling method.

Figure 5 displays the ROC curve for the best classifier for two labeling methods. The area under the ROC curve (AUC) is a measure of how well a parameter can distinguish between classes. It can be noted that the binary array labeling method is an effective method for the classification of COVID-19 protein sequences according to the country.

Fig. 5
figure 5

Average ROC curve for the best classifier. a DT classifier based on the one number labeling method. b Linear SVM classifier based on the binary array labeling method

The relationship between the eight amino acid classes and the frequency of amino acid classes in each COVID-19 protein sequence by using all samples found in the dataset (9238 sequences) concerning the eight classes of amino acids is displayed in Fig. 6. This figure confirmed that the COVID-19 virus has a high record of amino acids provided in the second class that consists of isoleucine (I), leucine (L), phenylalanine (F), and proline (P) amino acids.

Fig. 6
figure 6

COVID-19 samples with respect to amino acids classes

4 Discussion

The analysis of virus genome sequence is a serious tool for early diagnosis and identifies the infected geographical areas. Coronavirus [44] is the longest RNA virus which is categorized into four genera including alpha, beta, gamma, and delta. The alpha and beta viruses are known to infect humans. Bats are a normal host of the corona but not the only animal allowed to transfer the infection to humans. The WHO proved that SARS-CoV-2 is a bat coronavirus that attacks human cells. Recently, COVID-19 is a novel virus related to SARS-CoV-2, which is leading to increased morbidity and mortality rates in several counties. The explosive growth of data and its availability indicates the emergence of new COVID-19 analysis by machine learning algorithms.

This paper investigated the existing state-of-the-art solutions proposed for machine learning algorithms to address pandemic challenges related to COVID-19 protein sequences. The main objective of this paper is to classify the COVID-19 protein sequences according to infected countries. Also, it can be used for forecasting the COVID-19 sequences in different countries. Altogether, the results revealed that the binary array labeling method with a linear SVM classifier had the highest performance for the classification of COVID-19 protein sequences from the 27 countries.

The limitations of the current work are related to the unbalanced protein sequences of COVID-19 among different countries. The recorded data for COVID-19 protein sequences is not enough for the classification process or early prediction of COVID-19 patients according to the infection area. Therefore, data of COVID-19 protein sequences are required for all countries. The countries that are reported the COVID-19 cases should provide a database of virus genome sequences in public sources to encourage primary researches.

Also, the number of features needs to be carefully chosen if applied to an extremely large number of countries and cities. Especially, it found that the fatality rates of COVID-19 differ between regions or cities in the same country [45]. In the future, this point will encourage the researchers to find the classification among COVID-19 protein sequences in regions or cities in the same country.

So far, the mortality rates related to COVID-19 change among countries according to many factors such as age distribution, virus genomic types, and genetic backgrounds. It is not clear about the reasons for the high mortality rates of COVID-19 in the USA. It was noted that each country has a unique structure of COVID-19 protein sequence. Thus, it can create the genomic COVID-19 database for every country that is used for a COVID-19 diagnosis.

On the other hand, the limitation of the current work is based on using accuracy, sensitivity, and specificity as evaluation metrics for COVID-19 classification. Therefore, the evaluation metrics should be expanded to classify the COVID-19 protein sequence in the future.

In the future, analysis of COVID-19 genomic sequence should extend to the classification between COVID and other infectious sequences as well as a study of genomic mutations in a 3D protein structure to support the clinical trials to find drugs. Finally, the fight against the COVID-19 pandemic with scientific weapons will continue by gathering a huge database for this virus, developing the contributions of machine learning algorithms to the COVID-19 sequences. To explore this COVID-19 virus, there are main targets including therapeutics, diagnosis, and prevention that need more attention from researchers around the world.

5 Conclusion

In this paper, a dataset of 9238 COVID-19 protein sequences is used to evaluate the proposed model’s capability to predict the country of the protein sequences. The proposed model extracted suitable features from the protein sequences by replacing the amino acid characters in each sequence with the eight amino acid class normalized frequencies. After that, the model performed six different classifiers to predict the country of the virus protein sequence.

This model indicated that the classification using the binary array labeling method has more promising results than the one number labeling method. Hence, the preliminary proposed model elucidated the utility of the country binary array labeling method for discriminating COVID-19 protein sequences within an appropriate time using a linear SVM classifier.

The better classification accuracy was observed in US protein sequences compared with different countries because of their high data records, approximately 7020 records from 9238 records (76% from all occurrences). It is interesting to note that unbalanced data have a low impact on the COVID-19 classification results. This leads to increased misclassifications on the testing set because all countries’ COVID-19 sequences except the USA occupied only 24% of all sequences. Furthermore, the classifiers biased to US records compared with the countries with a small number of occurrences.

Finally, this proposed model could be expanded to large stored data of COVID-19 protein sequences and could then develop prediction algorithms for unbalanced sequences within several countries. Also, the proposed algorithm will expand to calculate further evaluation criteria such as precision to ensure the protein classification of COVID-19 according to country.