Background

Phylogenetic analysis is a technique used to perform species classification through DNA sequences analysis [1]. It examines sequence divergence or similarity through the alignment of DNA sequences. Recently, a new technique, named logic data mining, has been introduced and applied in species classification through DNA barcode, a short fragment of mitochondrial DNA composed of few hundreds of bases from which it is possible to extract the information needed to classify living species [2]. This technique allows finding patterns of nucleotides in the DNA barcode of a given species that characterizes it and allows distinguishing one species from the others. A pattern is defined as a set of positions of the DNA sequence whose corresponding nucleotides completely characterize the species. The logic data mining technique provides a sort of fingerprinting of the species and has been shown to allow a correct classification of new individuals. In this work we apply this technique to the classification of polyomaviruses. Human polyomaviruses are small double stranded DNA viruses of about 5 kb in length which belong to the Polyomaviridae family. Up to 2008, five human polyomaviruses have been identified and characterized: BK (BKPyV), JC (JCPyV), KI (KIPyV), WU (WUPyV) and MC (MCPyV) polyomaviruses. BKPyV and JCPyV were both uncovered in 1971. BKPyV was first identified in the urine of a kidney transplant patient [3], while JCPyV was uncovered in the brain tissue of a patient affected by progressive multifocal leucoencephalopathy (PML) [4]. The novel KIPyV and WUPyV have been identified in respiratory secretions of children with signs of acute respiratory disease [5, 6], though there is little evidence that these viruses are the causative agents of respiratory disease. MCPyV was found integrated monoclonally in a rare skin cancer named Merckel cell carcinoma, strongly suggesting that viral infection may be an early event in the pathogenesis of Merkel cell carcinoma [7]. KIPyV, WUPyV and MCPyV share most of the genomic characteristics of BKPyV and JCPyV, with a noncoding control region (NCCR) separating the early and late coding regions on opposite strands [57]. However, unlike JCPyV and BKPyV, the three novel polyomaviruses lack of the gene region encoding for the agnoprotein.

The genome of each polyomavirus is about 5 Kb in length. It is important to adopt methods that are able to identify a subset of positions that characterize the polyomaviruses and the gene regions among the five analyzed polyomaviruses. The aim of this study is to identify nucleotide positions that allow distinguishing the five human polyomaviruses. The nucleotides, and their combinations, that determine an effective separation between the different polyomaviruses will be examined in a wider study for an effective biological analysis.

Methods

Sequences collection and alignments

All sequences of BKPyV, JCPyV, KIPyV, WUPyV and MCPyV deposited in Genbank until the date of submission of this paper were retrieved and analyzed by a logic data mining method (1982 total sequences). Of these, 40 sequences have been obtained in our laboratory and then deposited in GenBank. Before analysis, the sequences of each human polyomavirus type were aligned against the following reference sequences: BKPyV, NC_001538; JCPyV, NC_001699; KIPyV, NC_009238; WUPyV, NC_009539; MCPyV, NC_010277. Each gene fragment was aligned using Clustal × [8] and the sequences were manually edited with the Bioedit software [9]. Positions containing gaps were removed from the final alignment (each gene fragment was in frame and coded for the corresponding protein).

The logic mining software system

The logic data mining software used in this study has been derived from previous works on the analysis of biological and genetic data [2, 1012] and it has been customized for the polyomavirus genome analysis. Its purpose is to identify logic rules, expressed as combination of the nucleotide positions, which are able to characterize univocally the different types of viruses in the same family. A standard classification paradigm is adopted, where the classification rules are extracted from a portion of the data (training set) and then tested on the remaining samples (test and verification data). Hereafter, we refer to the different types of viruses as classes to which the samples belong.

The input of the program is a FASTA format file of DNA sequences containing the training and testing sets. It is converted into an internal format named DMB that best organizes all the information needed for this analysis tool.

The main outputs of the software are:

  • The logic rules that are able to individuate each type of virus, expressed as a combination of the nucleotide positions (e.g., the rule IF (pos437 = A) and (pos486 = C) THEN BK is to be interpreted as "if the nucleotide in position 437 is A and the nucleotide in position 486 is C then the sequence belongs to virus type BK").

  • The classification statistics (confusion matrices, average and variances of error rates obtained with different sampling strategies).

The formulas are determined solving a series of optimization problems with the aim of minimizing their dimension and maximizing their precision.

The flow diagram of the logic mining software system in Figure 1 shows a schematic view of the architecture, representing the system flows and the fundamental modules. The main steps of the method are briefly described in the following paragraphs.

Figure 1
figure 1

Flow diagram of the logic mining software.

The feature selection step

In this study the features are associated with a position that has a determinate nucleotide value (A, C, G, T). The aim of the feature selection step is to extract a small set of positions or species specific bases that are able to distinguish between the different species that are present in the data set. The selection of the most relevant features is a typical issue in data mining and data analysis [10, 13]. We use an integer optimization model formulation to address the selection of the most relevant features. We consider a particular distance among samples and then select a given number of features that maximizes the smallest distances between each pair of samples of different classes. This optimization problem is of large size and grows quadratically with the number of sequences; for its solution a fast heuristic algorithm based on greedy randomized search is adopted. A detailed description of the method used to solve the underlying optimization problem can be found in references [2, 10, 11].

The formula extraction step

The selected features in the previous step are taken as input to compute the logic separating formulas by the Lsquare method [12]. A logic formula is constituted by conjunctions ("and") and disjunctions ("or") of features. The classification formulas are extracted through the solution of an integer optimization problem based on a minimum cost satisfiability problem (MINSAT), which is solved by decomposition techniques. See reference [12] for further details. After obtaining the first set of logic formulas, the feature selection and the formula extraction steps are repeated to acquire all the separating formulas, adding a constraint that is able to avoid the selection of the previously chosen features. These two steps are iterated until no more separation is found.

The application of the logic mining software

A typical application of the software is organized as follows. Each virus is defined as a class, indicated as v. For each v, a 2-class classification problem is defined, where class A contains the individuals of virus v, and class B the individuals of the other viruses. Then, the training data is analyzed in order to formulate the feature selection problem and to compute the optimal set of feature. Thereafter, the system identifies the logic formulas that separate the individuals in class A from those in class B. These formulas are then applied to the virus sequences, and, if a virus sequence is recognized as positive by the formula of species v, its predicted class is v and the prediction will be verified. When a virus is recognized by more than one formula (or by none of them) such an event is registered as a non classification. If a formula recognizes a virus of a class different from v into class v, we register such an event as a recognition error. Finally, the analysis is performed and a subset of patterns is produced. The component diagram in Figure 2 gives a final compact overview of the system.

Figure 2
figure 2

Component diagram of the logic mining software.

Results and discussion

Logic mining analysis on each virus class

In order to understand the biological role of nucleotide sequences in a specific gene region of each polyomavirus, the analysis described in methods was carried out considering each virus as a class, in order to extract more general information. In this way, the focus is on positions that can discriminate between viruses. The number of sequences that compose the different gene regions and viruses is summarized in Table 1.

Table 1 Number of sequences

The feature selection method was applied in order to select a small set of relevant positions. In the first runs a single position is selected only if it is able-just by itself-to distinguish between the groups. After a position is chosen and the logic formulas are obtained, the selected position is removed from the data, reiterating this procedure until no separation with only one feature is possible. A 100 fold cross validation sampling of the different sequences was performed to validate the results. The original sequences were randomly partitioned into 100 folds and the validation was performed 100 times, each on different sequences.

The adoption of this approach results in the discovery of the most characteristic base pairs of the data set. The method led to the identification of formulas each with the property of being able to correctly classify a polyomavirus.

In this step the main purpose of the analysis was to distinguish the five different polyomaviruses (BK, JC, KI, MC, WU) in all the 1982 sequences (here all the genes are considered together). The classification has a perfect recognition rate (100% in training and in test set). The logic formulas are listed in Table 2, where we note that they are composed by few literals (from three to five) and all formulas have only one clause in conjunctive normal form (i.e., a sequence of conjunctions).

Table 2 Logic formulas for virus classification.

We recall the straight-forward interpretation of the formulas of Table 2: in the fourth row of the Table, we read that "if position 199 is A, and if position 286 is different from T, it is a MC polyomavirus". The last column of Table 2 reports a measure of the coverage of the formula; it indicates the proportion of samples of that class that are correctly classified. When it is equal to 1, as it is the case for all formulas but two, it means that that single formula performs exact classification of the training data.

Logic mining analysis over 21 gene regions

Another classification analysis was done by distinguishing the different 21 gene regions and polyomaviruses in all the 1982 sequences. The available gene regions were Small t antigen (ST), Large t antigen (LT), VP1, VP2 and VP3. For performing the analysis and to validate the results a 100 fold cross validation sampling of the different sequences was applied. Also in this analysis the recognition rate is very high (99% both in training and testing set). The logic separating formulas are listed in Table 3. We point out that the error rate of 1% is caused by the very small amount of sequences available in MCPyV-VP1, MCPyV-VP2 and MCPyV-VP3: 3, 2 and 2 respectively. In this case the logic formulas are very compact (from one to three literals) and precise.

Table 3 Logic formulas for gene-virus classification

Logic mining analysis in the same gene region

We also performed the virus classification within the same gene regions, adopting a 100 fold cross validation approach, and found the separating formulas between the different viruses in that particular gene region. The viruses differ in multiple positions and we found all the discriminating base pairs. We remand to the appendix [Additional file 1] for further explanation of the experiments and to Additional file 2 and Additional file 3 for visualizing the logic formulas.

Conclusions

In this work the human polyomaviruses genome was analyzed using a logic data mining approach in the attempt to get insights into the specific features of the viral sequence. The objective of the analysis was to identify very small portions of the DNA sequence, composed by one or few nucleotides, which are able to characterize the different viral types. The approach appears to be successful as it identifies several small "rules" that effectively characterize the different classes of polyomaviruses and that are able to separate one class from the other with utter precision. The analysis was carried out on both the virus sequences as a whole and the different gene regions, obtaining in both cases extremely high correct recognition rates. In particular, for each polyomavirus (BK, JC, KI, MC, WU), a single logic formula was able to provide a perfect classification; in addition, within the same gene region we discovered the logic formulas that can distinguish all the different polyomaviruses with a 99% accuracy.

The first classification experiment was based on all the gene regions of all five different polyomaviruses sequences aligned from the ATG starting codon. This experiment led to the identification of some particular positions of a polyomavirus for all the gene regions; these positions enable to distinguish that polyomavirus from the others. In other words, all the gene regions of a given polyomavirus exhibit a prevalence of an allele with respect to the same positions in all the gene regions of the other viruses. This evidence directs further studies towards the understanding of the biological meaning of those particular positions.

Recently, taxonomical developments in the family Polyomaviridae have been proposed. According to these suggestions the family Polyomaviridae, which is currently constituted as a single genus polyomavirus, will include three genera: two containing mammalian viruses and one containing avian viruses. The two mammalian genera are named Orthopolyomavirus and Wukipolyomavirus, while the avian genus is named Avipolyomavirus.

The demarcation criterion used for the new proposed polyomavirus species is based on the evaluation of the whole-genome nucleotide sequence identity respect to members of known species with less than 81% similarity [1]. With the method proposed in this work, it is possible to distinguish the five described human polyomaviruses focusing only on few nucleotide positions. This approach may thus help defining new viral species (i.e., when interpreting sequences obtained by high-throughput sequencing) in conjunction with the classical method based on sequence identity.

Overall, this is a promising approach that needs to be validated on a larger sample size. For instance, it will be interesting to verify how mutations (polymorphisms, deletions and insertions) in different genomic regions can affect this analysis. To this regard, we are going to select and analyze sequences obtained from patients with known mutations in particular regions or domains of the viral genome [14].