Introduction

Bioinformatics [13], has emerged as a forefront research area in the recent past since biological data is accumulating at an accelerated rate. In particular, the number and sizes of genome databases have grown rapidly over the last few years. One of the most important problems is automatically determining the group to which a previously unseen genome sequence belongs [4].

Classifying organisms from its genomic database into groups within a taxonomical hierarchy has several applications which include specific identification of any unknown organism, study of evolutionary characteristics, and study of mutual relationship existing between organisms [5]. Currently more than a million organisms have been discovered, but a large number are yet to be discovered. Any systematic study on an organism can be done only when it is identified to be in a particular group. Thus genome identification finds wide application in evolutionary studies of organisms. Classification and species identification have also been associated with practical applications such as bio-diversity studies [6], forensic investigations [7] and food and meat authentication [8], to name a few.

Pattern matching can be considered as either exact matching or approximate matching for sequential data [9]. In exact sequence pattern matching problems, we aim to find a substring in text T that is exactly the same as the searching pattern P. In real world biological applications, most of the sequences are “similar” instead of exactly the same. Most fundamental operations like repeat pattern mining, similarity between two sequences, sequence alignment, etc., can be modeled as searching for given “patterns” in a “text.” However, exact searching is of little use for this application, since the patterns rarely match the text exactly. Thus searching in sequence repositories often requires going beyond exact matching to determine the sequences which are similar [10, 11]. This gave a motivation to “search allowing errors” or approximate match. Approximate matching/fuzzy matching is the finding of the most similar match of a particular pattern within a sequence.

The existing approaches extract dinucleotide composition or trinucleotide composition from the sequences using exact pattern matching method and are used as features for classifier. In most of the real world biological problems exact matching may not give desired results because sequences are similar and the patterns rarely match the sequences. The problem which can appear in using n-grams is exponential explosion. It is clear that many of algorithms with n-grams are computationally too expensive [4].

In this paper identification of organism from its genomic database is considered. In the present work, we propose an approach for genome identification based on approximate pattern matching. Since genome data is very huge, we sample the data into different sizes. Given a database of randomly selected samples of genomic sequences, our proposed work includes two algorithms viz. algorithm for finding fuzzy occurrences based on Levenshtein distance and algorithm for finding total number of fuzzy matching patterns by varying candidate length so as to allow both positive and negative tolerance from the genome data sequences. These fuzzy matching patterns are used as features for a classifier. Since, candidate length is varied so as to allow both positive and negative tolerance, the length of subsequences and number of subsequences that are used as feature for classifier also changes. Classification has been done using data mining techniques namely, Naïve Bayes, support vector machine, backpropagation and also by nearest neighbor. Experimental results are reported for 100 randomly selected samples (size varying from 2,000 to 10,000 bp) from each of complete genome data of Yeast and E. coli. To extract fuzzy matching patterns from genome sequences, we used query of length 10 and allowed tolerance from 10 to 70 %. The proposed model is tested separately for fuzzy matching patterns extracted with each of the fault tolerance and the classification accuracies are monitored. The experimental results vary according to the tolerance allowed as well as according to sampling/sequence size.

The article is arranged as follows. In Sect. 2 we give some background information. Related work is explained in Sect. 3. In Sect. 4, proposed approach is explained. The experimental results obtained by using genome data of two species namely Yeast and E. coli are explained in Sect. 5. Conclusion section summarizes the results.

Background

Relevant terms

Genome

A genome is the complete genetic material of an organism. Its size is generally given as its total number of base pairs [12].

Base pair

A base pair consists of two nitrogenous bases (adenine and thymine or guanine and cytosine) held together by weak bonds. Two strands of DNA are held together in the shape of a double helix by the bonds between base pairs [13].

Base sequence

Base sequence is the order of nucleotide bases in a DNA molecule [13].

Nucleotide

Nucleotide is a subunit of DNA or RNA consisting of a nitrogenous base (adenine, guanine, thymine, or cytosine in DNA; adenine, guanine, uracil, or cytosine in RNA), a phosphate molecule, and a sugar molecule (deoxyribose in DNA and ribose in RNA). Thousands of nucleotides are linked to form a DNA or RNA molecule [13].

n-Gram

The n-gram is a subsequence composed of n characters, extracted from a larger sequence. For a given sequence, the set of the n-grams which can be generated is obtained by sliding a window of n characters on the whole sequence. This movement is carried out character by character. With each movement, a subsequence of n characters is extracted. This process is repeated for all the analysed sequences. The n-gram can be represented in binary form [14, 15] or either in dinucleotide [16] or trinucleotide [17, 18] frequencies.

Example: consider the generation of 3-g and representing them in binary form, from the following two sequences

  • Seq1: AVADEK

  • Seq2: QAVALGYVS

For n = 3, a total of 11 motifs are extracted from seq1 and seq2 by the sliding window procedure. Sequence1 results in 5 motifs, i.e., AVA, VAD, ADE and DEK while sequence 2 gives 7 motifs: QAV, AVA, VAL, LGY, GYV and YVS. Out of these 11 motifs, 10 are distinct (the motif AVA is repeated in both the sequences). These 10 distinct motifs are used as attributes/features to construct a binary table where each row corresponds to a sequence. The presence or absence of an attribute in a sequence is denoted by 1 or 0 as shown in Table 1.

Table 1 n-Gram based sequence encoding in binary form

Dinucleotide composition

DNA sequences are usually long sequences consisting of only four characters: A, T, C and G. The dinucleotide composition is the frequencies of “AC”, “TG”, “AA”, “TT”… set of 16 subsequences of length two [16].

Example: consider the seq1 represented as AATGAATC, the 2-g that can be generated from the sequence are AA, AT, TG, GA, AA, AT, TC. Hence the dinucleotide frequencies for the example sequence can be represented as in Table 2.

Table 2 n-Gram based sequence encoding in dinucleotide frequencies

Data mining

Data mining, or knowledge discovery from data refers to the process of extracting interesting, non-trivial, implicit, previously unknown and potentially useful information or patterns from data [19]. Mining of sequence data has many real world applications [19]. Transaction history of a bank customer, product order history of a company, performance of the stock market [20] and biological DNA data [21] are all sequence data, where sequence data mining techniques are applied.

Soft computing

Soft computing is a collection of techniques in artificial intelligence, which can be used to handle uncertainty, imprecision and partial truth. The guiding principle is to provide the computation method that leads to an approximate solution at low cost, thereby speeding up the process. Fuzzy sets, which constitute the oldest component of soft computing, are suitable for handling issues related to understandability of patterns, incomplete/noisy data and can provide approximate solutions faster [22].

Related work

Patil et al. [14] proposed a method for species identification based on approximate pattern matching. The novelty in the work was feature extraction technique for genome data. The existing n-gram based methods extract frequencies of 4n features from genome data. In [14] authors extracted all candidate/subsequences that satisfy: length grater or equal to given minimum length, given number of mismatches and support grater or equal to user threshold. These frequent subsequences are used as features to construct a binary table where the presence or absence of an attribute/feature in a sequence is represented by 1 or 0 respectively. Classification of genome sequences has been done using data mining techniques namely, naive Bayes, support vector machine and k-nearest neighbor. Based on experimentation, data mining techniques with approximate patterns showed better results. In this work very short sequences were analyzed and all the frequent subsequences represented in binary form are used as features for classifier.

In [16], a set of 16 kinds of dinucleotide compositions was used to analyze the protein-encoding nucleotide sequences in nine complete genomes: Escherichia coli, Haemophilus influenzae, Helicobacter pylori, Mycoplasma genitalium, Mycoplasma pneumoniae, Synechocystis sp., Methanococcus jannaschii, Archaeoglobus fulgidus, and Saccharomyces cerevisiae. The dinucleotide composition was significantly different between the organisms. The distribution of genes from an organism was clustered around its center in the dinucleotide composition space. The genes from closely related organisms such as Gram-negative bacteria, mycoplasma species and eukaryotes showed some overlap in the space. The genes from nine complete genomes together with those from humans were discriminated into respective clusters with 80 % accuracy using the dinucleotide composition alone.

Classification of organisms into 2 classes—Bacteria and Archea, based on their di-nucleotide frequencies in DNA sequence using naive Bayesian approach was discussed by [23]. The methodology is based on scanning all genomes for the occurrences of all possible overlapping motifs with a length of n nucleotides. Then a genomic sequence is chosen at random from anywhere inside a genome. From this genomic sequence, all overlapping motifs are extracted. The naive Bayesian classifier uses the extracted motifs to predict their most probable genomic origin by comparing the frequencies of the extracted motifs with the motif frequencies of the different genomes. The average accuracy obtained was 85 %.

Protein classification into domains of life was attempted and the test protein was predicted to be of bacterial or eukaryotic origin with 85 % accuracy using a Markov model for compositional bias analysis [24].

Norashikin et al. [25], used 2-g encoding, which is the frequency of occurrence of two consecutive amino acids from a protein sequence to investigate the effect of the spread factor value towards cluster separation in the growing self-organizing map GSOM. They used simple k-means algorithm as a method to identify clusters in the GSOM. By using Davies–Bouldin index, clusters formed by different values of spread factor were obtained and the resulting clusters of protein sequences were analyzed.

Andrija et al. [4] have addressed the problem of automated classification of isolates, i.e., the problem of determining the family of genomes to which a given genome belongs and also the problem of automated unsupervised hierarchical clustering of isolates, according only to their statistical substring properties. For both of these problems, they presented novel algorithms based on nucleotide n-grams, with no required preprocessing steps such as sequence alignment.

In [26] an approach for genome data clustering based on approximate matching is proposed. The proposed work includes finding total number of approximate matches to a query with specified fault tolerance from the genome data sequences. The number of matches is used as a feature value for clustering. Clustering has been done using soft computing technique namely, Fuzzy C-means (FCM), Possibilistic C-means (PCM) and results are compared to hard clustering technique, i.e., K-means. Experimental results are reported for 100 randomly selected samples (size 10,000 bp) from two different complete genome data sets namely, Yeast, E. coli and Drosophila, Mouse. It is also verified that proposed method outperforms existing n-gram frequency based method for both the data sets. Overall performance comparison of different clustering techniques shows that FCM has performed better compared to K-means and PCM for both the data sets. PCM is also a good clustering technique to genome data clustering except the undesirable effect of coincident centroids formed at lower tolerances.

Narasimhan et al. [17] designed a scheme for automatic identification of a species from its genome sequence. A set of 64 3-tuple keywords was first generated using the 4 types of bases A (for Adenine), T (for Thymine), C (for Cytosine) and G (for Guanine). These keywords were searched on randomly sampled genome sequence of a given length (10,000 elements) and frequency count for each of the 43 = 64 keywords was determined to obtain a DNA-descriptor. The process was repeated for N such sampled genome sequences and then Principal Component Analysis was employed on the frequency counts for N sampled instances to obtain a unique feature descriptor which identifies the species from its genome sequence. It was shown that the feature descriptors were effective representatives of the structural signature of the species. No quantitative measures of accuracy are seen reported.

Narasimhan and coworkers [18] also proposed an alternative approach to automatic classification and identification of species using self-organizing feature map. The computational map was trained by using the DNA-descriptors (frequency count for each of the 43 = 64 keywords from N number of sampled genome sequences of given length 10,000) of different species as the training inputs. The maps for different dimensions were constructed and analyzed for optimum performance. The scheme presented a novel method for identifying a species from its genome sequence with the help of a two dimensional map of neuron clusters, where each cluster represents a particular species. The map has been shown to provide an easier technique for recognition and classification of a species based on its genomic data. But once again no quantitative measures of accuracy are seen reported.

Wei et al. [27] attempted classification of several DNA sequences of bacteria using the artificial neural network model. The “dinucleotides compositions” method was used to characterize the DNA sequences which transform every DNA sequence to a 16-dimension vector. A back-propagation artificial neural network was developed and trained using “leave-one-out” method. Results showed that the accuracy of classification was 84.3 %, which proved that the model was satisfactory in summary. However, the author stated that the applicability of the characterization strategy needs to be improved to reflect the features of the DNA sequences.

All the work mentioned above take DNA sequence for classification. The possibility of species classification using CGR images of DNA sequences, by using different distance metrics [28] and by using neural networks [29] has been investigated. In both these works, only species identification is done taking a few different species. A detailed classification problem, addressing 6 categories in the taxonomical hierarchy of eukaryotic organisms, using a combination of FCGR and naive Bayesian approach is also attempted [30]. The average classification accuracy is reported as 85.63 %.

Thus, the drawbacks of existing approaches for genome identification are as follows:

  1. 1.

    Existing methods extract dinucleotide composition or trinucleotide composition from the sequences using exact pattern matching. In most real world biological problems, exact matching may not give desired results because the sequences are similar. Thus searching in sequence repositories often requires going beyond exact matching [11].

  2. 2.

    In the n-gram frequency based classification, the number of features (4n for DNA sequence and 20n for protein sequence) for classifiers, increases with increase in n and sometimes the classifier performance gets hampered by a lot of redundant features in the dataset [31]. Further, algorithms with n-grams are computationally too expensive [4].

  3. 3.

    Sequence alignment algorithms and techniques for estimating homologies and mismatches among DNA sequences that are used for comparing sequences of relatively small sizes, are not applicable to sequences with sizes varying between a few thousand base pairs to a few hundred thousand base pairs. Even for comparison of small sequences, the standard alignment and matching algorithms are known to be time consuming. There is a need for procedures that may be somewhat approximate in nature, yet useful in producing quick and significant results. To fill this gap [17, 18] has clustered complete genome data of Yeast and E. coli, by sampling it into size of 10,000 elements and choosing N such random samples. But no quantitative measures of accuracy are reported.

In the present work, we propose a novel approach for genome identification based on approximate pattern matching. Given a database of randomly selected samples of genomic sequences, our proposed work includes extraction of total number of fuzzy matching patterns with given fault tolerance. These total number of fuzzy matches are used as features for classifier. Classification of genome sequences has been done by data mining techniques and the number of subsequences that are used as features for classifier depends on the tolerance allowed. Experimental results are reported for randomly selected samples from complete genome data of Yeast and E. coli. The proposed approach is compared with n-gram sequence encoding method in binary form which resulted in highest accuracy of 53 %. But the proposed approach based on approximate matching resulted in highest accuracy of 99 % with 70 % tolerance and sampling size of 10,000 bp.

Methods

Our proposed approach used for species identification from genome data is as shown in Fig. 1. First, we sample the complete genome data of species into 2,000–10,000 bp size and then we choose randomly 200 such samples. In the next step we extract approximate matches for a given query with specified fault tolerance by varying candidate length so as to allow both positive and negative tolerance from the given sequences. These total approximate matches are used as features for classification algorithms and then the results are analyzed. Figure 2 shows extraction of total number of approximate matching patterns from a genome sequence database with query pattern of length p and percentage of fault tolerance allowed k.

Fig. 1
figure 1

Proposed approach for genome data classification based on approximate matching

Fig. 2
figure 2

Extraction of total number of approximate matches from genome samples

In the next section, two algorithms namely, algorithm for finding approximate occurrences based on Levenshtein distance and algorithm for finding total number of approximate matches by varying candidate length so as to allow both positive and negative tolerance from the genome data sequences are explained. Finally, we explain classification techniques used in our method.

Algorithm for finding approximate occurrences

Given a pattern P, a text string T \( \left( {\left| P \right| = m\;{\text{and}}\;\left| T \right| = n} \right) \) and fuzziness factor k, the task is to find all positions j in T such that there exists a suffix of \( T\left[ {1 \ldots ,j} \right] \) that has edit distance of less than k with P. We first define:

$$ D\left( {i,j} \right) = \mathop {\hbox{min} }\limits_{1 \le i \le j} {\text{edit distance between }}T\left[ {1 \ldots ,j} \right]{\text{ and }}P\left[ {1 \ldots ,i} \right]. $$

D(i, j) will hold the Levenshtein distance between the first i characters of P and the first j characters of T for all i, j.

We then use the following recursion to compute the table \( D\left( {i\,j} \right) \) for all i and j:

$$ \begin{aligned} D\left( {i,0} \right) & = i \\ D\left( {0,j} \right) & = j \\ \end{aligned} $$
$$ D\left( {i,j} \right) = { \hbox{min} }\left\{ \begin{aligned} & D\left( {i - 1,j} \right) + 1 \\ & D\left( {i,j - 1} \right) + 1 \\ & D\left( {i - 1,j - 1} \right) + \left( {{\text{if }}P_{i} = T_{j} ,\quad {\text{then }}0\,{\text{else }}1} \right) \\ \end{aligned} \right. $$
(1)

The positions of interest are those j’s for which \( D\left( {m,j} \right) < \, k. \)

Algorithm for finding total number of approximate matches for a specified fault tolerance from sequence database

Given complete genome data of length l, sample size s. Sample the complete genome data into N number of sequences of size s. Now, given a sequence database with N number of sequences, query pattern of length p and percentage of fault tolerance allowed k, we want to find, for every sequence in the database, the total number of candidate patterns that will approximately match the query pattern by varying candidate length, so as to allow both positive and negative tolerance (at most k). The algorithm is given below.

1 for all sequences in database do step 2.
  2 for all candidate length from (p − p*k/100)
    to (p + p*k/100) do steps 3–8
    3 Generate candidates of length in step 2.
    4 Count = 0.
    5 Find the Levenshtein distance between candidate pattern and query pattern.
    6 If distance ≤k, then increment count.
    7 Repeat steps 3 and 4 for all generated candidates.
    8 Output count as total number of approximate matches.
  End for
End for

In step 2, we vary candidate length so as to allow both positive and negative tolerance, i.e., if query pattern length is p and percentage of tolerance allowed is k, then candidate length is varied from (p − p*k/100) to (p + p*k/100). For example, if query length is 10 and tolerance allowed is 50 % then candidate length is varied from (10 − 50/100) to (10 + 50/100), i.e., from 5 to 15.

In the next step, candidates of length as in step 2 are generated. A candidate is a subsequence extracted from a larger sequence. For a given sequence, the set of the candidates with length n [in our approach n is from (p − p*k/100) to (p + p*k/100)] can be generated by sliding a window of n characters on the whole sequence. This movement is carried out character by character. With each movement a subsequence of n characters is extracted.

Once candidates are generated, we match each candidate to the query by using Levenshtein distance. If the distance is less than or equal to the specified tolerance, then the candidate pattern is counted as a approximate match. This process is repeated for the entire database. We select a query pattern that occurs frequently in the given database of sequences.

Once all the approximate matching patterns are extracted, we build a feature table where each row corresponds to a sequence and each column is a subsequence/candidate of length from (p − p*k/100) to (p + p*k/100). Therefore, the number of columns/features depends on the fault tolerance allowed. The value in each column for a particular sequence corresponds to the number of approximate matches. This feature table is called a learning context. It represents the result of the preprocessing step and the new sequence encoding format. In the mining step, clustering algorithms are applied to the learning context to identify genome data into different groups.

Example 1

Consider the sequence Seq1: AGCTTGCAAT

Let the query be AGCG of length 4 and tolerance allowed is 50 %. Therefore, candidate length varies from 2 to 6. To encode the given sequences, we first generate candidates and then find the Levenshtein distance between each candidate and query as shown in Table 3. Finally the number of approximate matches within the given tolerance is counted.

Table 3 Candidates generated and their respective distance with query AGCG for the example

Approximate matches within the given tolerance are marked as bold in Table 3. Total approximate matches for candidates generated with length 2 (L2) for the given query and for the given tolerance are 3. Similarly total approximate matches for candidates with L3, L4, L5 and L6 are 4, 2, 1 and 1, respectively. Table 4 shows the sequence encoding/feature table in which a row indicates sequence and every column value for the sequence indicates total approximate matches.

Table 4 Sequence encoding for the given example

Algorithms used for classification

We have used four classifiers namely back propagation (BP), naive Bayes (NB), support vector machine (SVM) and K-nearest neighbor (KNN). These classifiers are briefly explained below (Table 5).

Table 5 Confusion matrix

Back propagation

Figure 3 shows the structure of back-propagation neural network model. The artificial neural network (ANN) model we designed includes three layers- one input layer, one hidden layer and one output layer. The input layer includes nodes equal to the number of candidates/subsequences for a specified tolerance; the number of nodes in the hidden layer needs to be determined in the training process; the output layer includes two nodes which represents the kind of genome data, i.e., E. coli and Yeast. The back-propagation algorithm with a momentum term was used in training the ANN model [32]. During training, the predicted output is compared with the desired output, and error is calculated. If the error is more than a prescribed limiting value, it is back propagated from output to input, and weights are further modified till the error or number of iterations is within a prescribed limit.

Fig. 3
figure 3

A multilayer feed-forward neural network

The general rule for updating weights is:

$$ \Updelta w_{ji} = \eta \delta_{j} o_{i} $$
(2)

η is a positive number (called learning rate), which determines the step size in the gradient descent search. A large value enables back propagation to move faster to the target weight configuration but it also increases the chance of its never reaching this target. o i is the output computed by neuron i. \( \delta_{j} = o_{j} \left( {1 - o_{j} } \right)\left( {T_{j} - o_{j} } \right) \) for the output neurons where \( T_{j} \) wanted output for the neuron j and \( \delta_{j} = o_{j} \left( {1 - o_{j} } \right)\sum\nolimits_{k} {\delta_{k} w_{kj} } \) for the internal neurons. In our experiment, learning rate of the model is 0.3, the coefficient of the momentum term is 0.2, and the number of iterations is 500.

Naive Bayes

The simple naive Bayes (NB) algorithm [33] is used in this study. The main advantage of Bayesian classifier is that they are probabilistic models, robust to deal with the real data noise and missing values [34]. In addition, it also has advantages in terms of simplicity, learning speed, classification speed and storage space [35]. Naïve Bayes is simplified version of Bayes theorem that is used to classify the unknown instances into relevant class. Posterior probability of each class is calculated based on given attribute value associated with each tuple.

$$ p\left( {C_{i} /v_{1} ,v_{2} \ldots ,v_{n} } \right) = \frac{{p\left( {C_{i} } \right)\prod\nolimits_{j - 1}^{n} {p\left( {v_{j} /C_{i} } \right)} }}{{p\left( {v_{1} ,v_{2} \ldots ,v_{n} } \right)}} $$
(3)

The posterior probability of class C i given the attribute 〈v1, v2…, v n 〉. Learning with the Naive Bayes classifier involves estimating the probabilities in the RHS Eq. 3 from the training tuples

Support vector machine (SVM)

The SVM is a supervised classification algorithm that learns by example to discriminate among two or more given classes of data. Given a training set in a vector space, SVMs find the best decision hyper plane that separates two classes. The quality of a decision hyper plane is determined by the distance between two hyper planes defined by support vectors. The best decision hyper plane is the one that maximizes this margin. SVM extends its applicability on the linearly non-separable data sets by either using soft margin hyper planes or by mapping the original data vectors into a higher dimensional space in which the data points are linearly separable [36]. The mapping to higher dimensional spaces is done using appropriate kernel functions.

For a binary classification problem, assume that we have a series of feature vectors x i and class labels y i (i = 1, 2…, N, where N is the number of samples), where \( x_{i} \in R^{n} \) and \( y_{i} \in \left\{ { - 1, + 1} \right\} \) The SVM requires the solution of the following optimization problem [37]:

$$ {\text{Min}}\frac{1}{2}w^{T} w + C\sum\limits_{i = 1}^{l} {\xi_{i} } $$
(4)

subject to \( \gamma_{i} (w^{T} \phi (x_{i} ) + b) \ge 1 - \xi_{i} ,\xi_{i} \ge 0. \)

Here, feature vectors x i are mapped into a higher dimensional space by the function ϕ(x). Then SVM constructs an optimal separating hyper plane (OSH), which maximizes the margin in the higher dimensional space. C > 0 is the penalty factor of the error term. Furthermore, \( K(x_{i} ,x_{j} ) = \phi (x_{i} )^{T} \phi (x_{j} ) \) is called the kernel function.

There are several typical kernel functions. In this work, we have adopted SVM with polynomial kernel function. The polynomial kernel has strong generalization ability [38].

Polynomial kernel function: \( K\left( {x,y} \right) = \left( {x \cdot y + 1} \right)^{p}.\)

K-nearest neighbor (KNN)

KNN classification algorithm assumes that all instances correspond to points in an n-dimensional space. Nearest neighbors of an instance are described by a distance/similarity measure. When a new sample comes, a KNN classifier searches the training dataset for the k closest sample to the new sample using distance/similarity measure for determining the nature of new sample. These k samples are known as the k nearest neighbors of the new sample. The new sample is assigned the most common class of its k nearest neighbors. KNN is the best choice for making a good classifier, when simplicity and accuracy is important issues [39, 40]. In the nearest neighbor model, choice of a suitable distance function and the value of the members of nearest neighbors (k) are very crucial. The k represents the complexity of nearest neighbor model. The model is less adaptive with higher k values. We have used Euclidean distance measure with k = 1 in our experiment which are default parameters in WEKA and these parameters resulted in highest accuracy for the experimental data used.

Results and discussion

In our experiment, we used complete genome data of two different species namely the bacterium Escherichia coli (E. coli) [41], Saccharomyces cerevisiae (Yeast) [42, 43]. E. coli data is downloaded from NCBI [44] and the total length is 4639675 bp. Total length of complete genome data of Yeast is 12,136, and 020 bp (with mitochondrial genome). Since complete genome data of species is very huge, data is sampled into different sizes. We classified sequences of five different lengths: 2000, 4000, 6000, 8000, and 10,000 bp and monitored the classification accuracy. In each case, proposed model is tested with total 200 samples out of which 100 samples are from E. coli and 100 samples are from Yeast.

The experiments were done on an Intel pentium-4 processor-based machine having a clock frequency of 2.66 GHz and 1 GB RAM. In the classification process we use the k-fold cross-validation in which, the data was randomly partitioned into k subset or k-fold each having approximately equal size. Training and testing is performed k times and each time one of the subset is held out in turn. The classifier is trained on the remaining k − 1 subsets to build classification model and classification error of the iteration is calculated by testing the classification model on the holdout set. Finally, the k numbers of errors are summed up to yield an overall error estimate. Obviously, at the end of cross-validation, every sample has been used exactly once for testing.

We used the following classifiers: SVM, NB and KNN and multilayer neural network with BP of the workbench WEKA [45]. We generated and tested the classification models; then the classification accuracies (rate of correctly classified sequences) are reported.

To extract approximate matching patterns from genome sequences, we used query of length 10 and allowed tolerance of 10–70 %. The proposed model is tested separately for fuzzy matching patterns extracted with each of the fault tolerance and the classification accuracies are monitored. The experimental results vary according to the tolerance allowed as well as according to sampling/sequence size.

Tables 6, 7, 8, 9 and 10 shows performance of different classifiers for different sample size (sequence length) with specified fault tolerance. Same query of fixed length 10 is used in all the experiment. It can be observed from the obtained results that, classification accuracy increases with increase in fault tolerance as well as increase in sampling size of the sequences. Highest accuracy obtained at each sample size is marked as bold in Tables 6, 7, 8, 9 and 10. Our results show that, the classification accuracy achieved is 98.5 % by BP, 96.5 % by other classifiers, i.e., by NB, SVM and KNN with sampling/sequence size of 10,000 bp and with allowed tolerance of 50 %.

Table 6 Accuracy (in %) of different classifiers with 10 % tolerance
Table 7 Accuracy (in %) of different classifiers with 20 % tolerance
Table 8 Accuracy (in %) of different classifiers with 30 % tolerance
Table 9 Accuracy (in %) of different classifiers with 40 % tolerance
Table 10 Accuracy (in %) of different classifiers with 50 % tolerance

Tables 11, 12, 13, 14 and 15 shows detailed performance comparison. For every classification technique shown in Tables 11, 12, 13, 14 and 15 confusion matrix column makes use of four values. Left upper side indicates true positive and right upper side indicates false negative. Similarly, lower left side indicates false positive and lower right side indicates true negative. The result of confusion matrix is used to calculate the accuracy, sensitivity and specificity of a classifier. Kappa value of the BP is 0.97 and 0.93 by other classifiers, i.e., by NB, SVM and KNN with sampling/sequence size of 10,000 bp and with allowed tolerance of 50 %. The area under the curve (AUC) for BP is 0.999 which is the largest as compared to the other three classifiers viz. NB, SVM and KNN with an area of 0.996, 0.965, and 0.965, respectively, with sampling/sequence size of 10,000 bp and with allowed tolerance of 50 %.

Table 11 Confusion matrix, accuracy, sensitivity, specificity, AUC and Kappa of four classification methods at 10 % tolerance
Table 12 Confusion matrix, accuracy, sensitivity, specificity, AUC and Kappa of four classification methods at 20 % tolerance
Table 13 Confusion matrix, accuracy, sensitivity, specificity, AUC and Kappa of four classification methods at 30 % tolerance
Table 14 Confusion matrix, accuracy, sensitivity, specificity, AUC and Kappa of four classification methods at 40 % tolerance
Table 15 Confusion matrix, accuracy, sensitivity, specificity, AUC and Kappa of four classification methods at 50 % tolerance

Tables 16 and 17 shows the performance comparison of different classification methods at 60 and 70 % tolerance, respectively, with sampling size of 10,000 bp. Our results show that, the classification accuracy achieved is 98.5 % by BP and NB, 98 % by SVM and 97 % by KNN with sampling/sequence size of 10,000 bp and with tolerance of 60 %. The highest classification accuracy achieved is 99 % by all the classification methods used in the experiment at 70 % tolerance and at sampling size of 10,000 bp. The AUC for NB and BP is 1 which indicates a model with perfect accuracy at 70 % tolerance.

Table 16 Confusion matrix, accuracy, sensitivity, specificity, AUC and Kappa of four classification methods at 60 % tolerance and sample size 10,000 bp
Table 17 Confusion matrix, accuracy, sensitivity, specificity, AUC and Kappa of four classification methods at 70 % tolerance and sample size 10,000 bp

Effect of tolerance on classification accuracy

The proposed model has been tested separately by varying tolerance from 10 to 70 %. When the allowed tolerance is only 10 %, since our query pattern is of length 10, we varied the candidate length (subsequences that are used as features for classifier) from 9 to 11. It indicates that, when candidates generated are of length 9, all 9 characters must match to query pattern. Similarly, for the candidates with length 10, 11 any 9 characters, any 10 characters, respectively, in the candidate must match to query pattern for the given tolerance of 10 %. Hence in this case, for a given sequence, the percentage of matching of candidate to query is very less, i.e., total fuzzy matches will be less. Number of subsequences that are used as features for classifier are only 3 (candidates of length from 9 to 11). Hence all the classifiers that are used in our proposed work resulted in less accuracy.

When the allowed tolerance is 20 %, since our query is of length 10, we varied the candidate length (subsequences that are used as features for classification) from 8 to 12. Hence the subsequences that are used as features for classification are 5 (candidates of from length 8 to 12). A tolerance of 20 % indicates that the maximum allowed distance in matching a candidate to a query is 2. It indicates that when candidates generated are of length 8, all 8 characters must match the query pattern. Similarly, for the candidates of length 9, 10, 11 and 12, any 8 characters, any 8 characters, any 9 characters and any 10 characters, respectively, in the candidate must match the query pattern. Hence, for this case, fuzzy matches between the candidates generated and query will increase as well as number of subsequence/features for classifier also increases. Therefore, our results show that there is increase in classification accuracy compared to 10 % tolerance. Maximum classification accuracy achieved for 20 % tolerance and with sample/sequence size of 10,000 bp is 66.5 % by SVM. Other classifiers NB, BP and KNN resulted with 66, 63.5 and 60.5 % accuracy, respectively, for sampling size of 10,000 bp.

When the tolerance allowed is 30 and 40 %, the number of features are 7 (from length 7 to 13), 9 (from length 6 to 12), respectively, and maximum allowed distance in matching is 3 and 4, respectively. Hence in these two cases, for a given sequence, the percentage of matching of candidate to query is slightly increased compared to 20 % tolerance, i.e., total approximate matches will be more and feature values start being distinguishable for Yeast, E. coli genome data. Therefore, increase in classification accuracy can be observed from 30 % tolerance compared to 20 % tolerance. At 40 % tolerance, SVM resulted in maximum accuracy of 95.5 % for sample size of 10,000 bp compared to other classifiers.

When the allowed tolerance is 50 %, number of subsequences that are used as features for classifier are 11 and are of length from 5 to 15. In this case, maximum distance allowed in matching candidate to query is 5. Since, 50 % of the mismatch is allowed in matching a candidate to query, number of fuzzy matches will increase. In this case, our experimental results show that the classification accuracy of 98.5 % by BP and 96.5 % by other classifiers, i.e., by NB, SVM and KNN with sampling/sequence size of 10,000 bp.

Similarly when the tolerance allowed is 60 and 70 %, number of subsequences that are used as features for classifier are 13 and 15 which are of length from 4 to 16 and 3 to 17, respectively. In this case, our experimental results show that the classification accuracy achieved is 98.5 % by BP and NB, 98 % by SVM and 97 % by KNN with sampling/sequence size of 10,000 bp and with tolerance of 60 %. The highest classification accuracy achieved is 99 % by all the classification methods used in the experiment at 70 % tolerance and at sampling size of 10,000 bp.

Figures 4, 5, 6, 7 and 8 shows change in accuracy over tolerance for classification methods. It can be observed that as tolerance increases, classification accuracy also increases.

Fig. 4
figure 4

Classification accuracy for sample size 2,000 at different tolerance

Fig. 5
figure 5

Classification accuracy for sample size 4,000 at different tolerance

Fig. 6
figure 6

Classification accuracy for sample size 6,000 at different tolerance

Fig. 7
figure 7

Classification accuracy for sample size 8,000 at different tolerance

Fig. 8
figure 8

Classification accuracy for sample size 10,000 at different tolerance

Effect of sampling size on classification accuracy

We can observe from experimental results that, as sampling size increases, classification accuracy also increases. We have tested our model with 200 samples (100 from E. coli and 100 from Yeast) of size 2000, 4000, 6000, 8000 and 10,000 bp separately and monitored classification accuracy. As sample/sequence size increases, a huge number of candidates are generated. Every candidate in this huge candidate database is compared with query for approximate match. Numbers of candidates generated for each candidate length say c and sample size s by sliding window procedure is (s-c-1). Finally, out of (s-c-1) candidates we will find total candidates/features that approximately match to query within given tolerance. It has been observed that, as candidates generated increases, number of fuzzy matches also increases. Feature values are very much distinguishable (for each of genome sequence data of E. coli and Yeast), as we start increasing sampling size. This results in increase in classification accuracy with sampling size.

Figure 9 shows, when allowed fault tolerance is 10 % sample size of 2000, 4000, 6000, 8000, and 10,000 bp resulted in maximum accuracy of 49, 52, 54, 55.5, and 60 % by BP, BP and KNN, NB, BP, and BP, respectively. But, as tolerance is increased we can observe further increase in performance of classifiers with increase in sampling size as shown in Figs. 10, 11, 12 and 13. Figure 13 shows that, at 50 % tolerance, sample size of 2000, 4000, 6000, 8000, and 10,000 bp resulted in maximum accuracy of 88.5, 91, 93.5, 96, and 98.5 % by BP, SVM, BP and SVM, BP, BP, respectively.

Fig. 9
figure 9

Performance of different classifiers with 10 % tolerance

Fig. 10
figure 10

Performance of different classifiers with 20 % tolerance

Fig. 11
figure 11

Performance of different classifiers with 30 % tolerance

Fig. 12
figure 12

Performance of different classifiers with 40 % tolerance

Fig. 13
figure 13

Performance of different classifiers with 50 % tolerance

Comparison with n-gram based method

The proposed approach is compared with n-gram sequence encoding method in binary form [14, 15]. In this method, Preprocessing consists of extracting motifs from a set of sequences. These motifs will be used as attributes/features to construct a binary table where each row corresponds to sequence. The presence or the absence of an attribute in a sequence is respectively denoted by 1 or 0. This binary table is called a learning context. It represents the result of the preprocessing step and the new sequence encoding format. In the mining step, a classifier is applied to the learning context to generate a classification model. The latter model is used to classify other sequences in the post-processing step.

Table 18 shows the performance of classifiers (classification accuracy in %) by using binary sequence encoding method for 3-g. A set of 64 3-tuple keywords is first generated using the 4 types of bases A (for Adenine), T (for Thymine), C (for Cytosine) and G (for Guanine).These keywords are searched on same 200 (that are used to verify our proposed model) randomly selected genome sequence of a given length and a binary table/feature table is constructed in which presence or the absence of an attribute in a sequence is respectively denoted by 1 or 0. We can observe that, binary encoding method for 3-g resulted in very less classification accuracy. It is because, when the database is very large and sequences are similar, the searching keywords appear at least once and hence the feature values turn to be almost 1 for every keyword. This leads to reduced classification accuracy of 50–53 %. Thus binary encoding works when each family probably has its own motifs which characterize it and distinguish it from the others.

Table 18 Accuracy (in %) of different classifiers based on for 3-g represented in binary form

So, we can conclude that when the sequence database is very large and sequences are partially similar, our proposed model based on fuzzy matching is good compared to existing methods.

Conclusions and future work

Genomic data mining and knowledge extraction is an important problem in bioinformatics. Only a few attempts are seen in literature focusing on unknown genome identification by using either dinucleotide or trinucleotide composition. Existing approaches are based on exact pattern matching. In most of the real world biological problems exact matching may not give desired results because, biological sequences are “similar” instead of exactly same. In this paper, a novel approach for identification of species based on fuzzy patterns is proposed. Genome data of two species namely, Yeast and E. coli is sampled into different sizes and then fuzzy patterns with given tolerance are extracted by using Levenshtein distance. Candidate length is varied so as to allow both positive and negative tolerance. Fuzzy matches/approximate matches for these candidates/subsequences are used as feature values for the classifiers. Classification has been done by using data mining techniques namely, Naïve Bayes and support vector machine, backpropagation and nearest neighbor. To extract fuzzy matching patterns from genome sequences, we used query of length 10 and allowed tolerance from 10 to 70 %. The proposed model is tested separately for fuzzy matching patterns extracted with each of the fault tolerance and the classification accuracies are monitored. The experimental results vary according to the tolerance allowed as well as according to sampling/sequence size. Total 200 samples are used to test the model (100 samples are from E. coli and 100 samples are from Yeast). Our experimental results show that, the classification accuracy achieved is 98.5 % by BP, 96.5 % by other classifiers, i.e., by NB, SVM and KNN for sampling/sequence size of 10,000 bp and with allowed tolerance of 50 % and at 70 % all classifiers achieved 99 % accuracy. It can be observed from the obtained results that classification accuracy increases with increase in tolerance and sampling size. We used a query of length 10 in our experiment, in future experimental results are to be verified with different query length and a relationship between query length and tolerance values is to be established.