Genome data classification based on fuzzy matching

Patil, Nagamma; Toshniwal, Durga; Garg, Kumkum

doi:10.1007/s40012-012-0001-1

Genome data classification based on fuzzy matching

Original Research
Published: 13 October 2012

Volume 1, pages 9–28, (2013)
Cite this article

Download PDF

CSI Transactions on ICT Aims and scope Submit manuscript

Genome data classification based on fuzzy matching

Download PDF

Nagamma Patil¹,
Durga Toshniwal¹ &
Kumkum Garg²

3143 Accesses
3 Citations
Explore all metrics

Abstract

Genomic data mining and knowledge extraction is an important problem in bioinformatics. Some research work has been done on unknown genome identification and is based on exact pattern matching of n-grams. In most of the real world biological problems exact matching may not give desired results and the problem in using n-grams is exponential explosion. In this paper we propose a method for genome data classification based on approximate matching. The algorithm works by selecting random samples from the genome database. Tolerance is allowed by generating candidates of varied length to query from these sample sequences. The Levenshtein distance is then checked for each candidate and whether they are k-fuzzily equal. The total number of fuzzy matches for each sequence is then calculated. This is then classified using the data mining techniques namely, naive Bayes, support vector machine, back propagation and also by nearest neighbor. Experiment results are provided for different tolerance levels and they show that accuracy increases as tolerance does. We also show the effect of sampling size on the classification accuracy and it was observed that classification accuracy increases with sampling size. Genome data of two species namely Yeast and E. coli are used to verify proposed method.

A Novel Approach for Genome Data Classification Using Hadoop and Spark Framework

Feature clustering and feature discretization assisting gene selection for molecular classification using fuzzy c-means and expectation–maximization algorithm

Article 06 November 2020

Microarray Filtering-Based Fuzzy C-Means Clustering and Classification in Genomic Signal Processing

Article 19 June 2019

1 Introduction

Bioinformatics [1–3], has emerged as a forefront research area in the recent past since biological data is accumulating at an accelerated rate. In particular, the number and sizes of genome databases have grown rapidly over the last few years. One of the most important problems is automatically determining the group to which a previously unseen genome sequence belongs [4].

Classifying organisms from its genomic database into groups within a taxonomical hierarchy has several applications which include specific identification of any unknown organism, study of evolutionary characteristics, and study of mutual relationship existing between organisms [5]. Currently more than a million organisms have been discovered, but a large number are yet to be discovered. Any systematic study on an organism can be done only when it is identified to be in a particular group. Thus genome identification finds wide application in evolutionary studies of organisms. Classification and species identification have also been associated with practical applications such as bio-diversity studies [6], forensic investigations [7] and food and meat authentication [8], to name a few.

Pattern matching can be considered as either exact matching or approximate matching for sequential data [9]. In exact sequence pattern matching problems, we aim to find a substring in text T that is exactly the same as the searching pattern P. In real world biological applications, most of the sequences are “similar” instead of exactly the same. Most fundamental operations like repeat pattern mining, similarity between two sequences, sequence alignment, etc., can be modeled as searching for given “patterns” in a “text.” However, exact searching is of little use for this application, since the patterns rarely match the text exactly. Thus searching in sequence repositories often requires going beyond exact matching to determine the sequences which are similar [10, 11]. This gave a motivation to “search allowing errors” or approximate match. Approximate matching/fuzzy matching is the finding of the most similar match of a particular pattern within a sequence.

The existing approaches extract dinucleotide composition or trinucleotide composition from the sequences using exact pattern matching method and are used as features for classifier. In most of the real world biological problems exact matching may not give desired results because sequences are similar and the patterns rarely match the sequences. The problem which can appear in using n-grams is exponential explosion. It is clear that many of algorithms with n-grams are computationally too expensive [4].

In this paper identification of organism from its genomic database is considered. In the present work, we propose an approach for genome identification based on approximate pattern matching. Since genome data is very huge, we sample the data into different sizes. Given a database of randomly selected samples of genomic sequences, our proposed work includes two algorithms viz. algorithm for finding fuzzy occurrences based on Levenshtein distance and algorithm for finding total number of fuzzy matching patterns by varying candidate length so as to allow both positive and negative tolerance from the genome data sequences. These fuzzy matching patterns are used as features for a classifier. Since, candidate length is varied so as to allow both positive and negative tolerance, the length of subsequences and number of subsequences that are used as feature for classifier also changes. Classification has been done using data mining techniques namely, Naïve Bayes, support vector machine, backpropagation and also by nearest neighbor. Experimental results are reported for 100 randomly selected samples (size varying from 2,000 to 10,000 bp) from each of complete genome data of Yeast and E. coli. To extract fuzzy matching patterns from genome sequences, we used query of length 10 and allowed tolerance from 10 to 70 %. The proposed model is tested separately for fuzzy matching patterns extracted with each of the fault tolerance and the classification accuracies are monitored. The experimental results vary according to the tolerance allowed as well as according to sampling/sequence size.

The article is arranged as follows. In Sect. 2 we give some background information. Related work is explained in Sect. 3. In Sect. 4, proposed approach is explained. The experimental results obtained by using genome data of two species namely Yeast and E. coli are explained in Sect. 5. Conclusion section summarizes the results.

2 Background

2.1 Relevant terms

2.1.1 Genome

A genome is the complete genetic material of an organism. Its size is generally given as its total number of base pairs [12].

2.1.2 Base pair

A base pair consists of two nitrogenous bases (adenine and thymine or guanine and cytosine) held together by weak bonds. Two strands of DNA are held together in the shape of a double helix by the bonds between base pairs [13].

2.1.3 Base sequence

Base sequence is the order of nucleotide bases in a DNA molecule [13].

2.1.4 Nucleotide

Nucleotide is a subunit of DNA or RNA consisting of a nitrogenous base (adenine, guanine, thymine, or cytosine in DNA; adenine, guanine, uracil, or cytosine in RNA), a phosphate molecule, and a sugar molecule (deoxyribose in DNA and ribose in RNA). Thousands of nucleotides are linked to form a DNA or RNA molecule [13].

2.1.5 n-Gram

The n-gram is a subsequence composed of n characters, extracted from a larger sequence. For a given sequence, the set of the n-grams which can be generated is obtained by sliding a window of n characters on the whole sequence. This movement is carried out character by character. With each movement, a subsequence of n characters is extracted. This process is repeated for all the analysed sequences. The n-gram can be represented in binary form [14, 15] or either in dinucleotide [16] or trinucleotide [17, 18] frequencies.

Example: consider the generation of 3-g and representing them in binary form, from the following two sequences

Seq1: AVADEK
Seq2: QAVALGYVS

For n = 3, a total of 11 motifs are extracted from seq1 and seq2 by the sliding window procedure. Sequence1 results in 5 motifs, i.e., AVA, VAD, ADE and DEK while sequence 2 gives 7 motifs: QAV, AVA, VAL, LGY, GYV and YVS. Out of these 11 motifs, 10 are distinct (the motif AVA is repeated in both the sequences). These 10 distinct motifs are used as attributes/features to construct a binary table where each row corresponds to a sequence. The presence or absence of an attribute in a sequence is denoted by 1 or 0 as shown in Table 1.

Table 1 n-Gram based sequence encoding in binary form

1 for all sequences in database do step 2.
2 for all candidate length from (p − p*k/100)
to (p + p*k/100) do steps 3–8
3 Generate candidates of length in step 2.
4 Count = 0.
5 Find the Levenshtein distance between candidate pattern and query pattern.
6 If distance ≤k, then increment count.
7 Repeat steps 3 and 4 for all generated candidates.
8 Output count as total number of approximate matches.
End for
End for

Genome data classification based on fuzzy matching

Abstract

Similar content being viewed by others

A Novel Approach for Genome Data Classification Using Hadoop and Spark Framework

Feature clustering and feature discretization assisting gene selection for molecular classification using fuzzy c-means and expectation–maximization algorithm

Microarray Filtering-Based Fuzzy C-Means Clustering and Classification in Genomic Signal Processing

1 Introduction

2 Background

2.1 Relevant terms

2.1.1 Genome

2.1.2 Base pair

2.1.3 Base sequence

2.1.4 Nucleotide

2.1.5 n-Gram

2.1.6 Dinucleotide composition

2.1.7 Data mining

2.1.8 Soft computing

3 Related work

4 Methods

4.1 Algorithm for finding approximate occurrences

4.2 Algorithm for finding total number of approximate matches for a specified fault tolerance from sequence database

Example 1

4.3 Algorithms used for classification

4.3.1 Back propagation

4.3.2 Naive Bayes

4.3.3 Support vector machine (SVM)

4.3.4 K-nearest neighbor (KNN)

5 Results and discussion

5.1 Effect of tolerance on classification accuracy

5.2 Effect of sampling size on classification accuracy

5.3 Comparison with n-gram based method

6 Conclusions and future work

References

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

1.1 Definitions

1.2 Measures for performance evaluation

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation