Background

The bacterium Staphylococcus aureus secretes α-hemolysin monomers that bind to the outer membrane of susceptible cells. Seven monomers can oligomerize to form a very stable water-filled transmembrane channel [1]. The channel can cause death to the target cell by rapidly discharging vital molecules (such as ATP) and disturbing the membrane potential.

Suspended in lipid bilayer, as shown in Figure 1, the α-hemolysin channel can be used as a sensor (nanopore-detector) when large molecules interact with the channel environment under an applied potential (where the open channel has 120 picoAmperes of ion flow under normal conditions). When a 9 bp DNA hairpin enters the pore, the loop is caught at the vestibule mouth, leaving the stem terminus perched to readily bind to the amino acid residues near the limiting aperture, resulting in a consistent toggle for thousands of milliseconds as shown in Figure 2.

Figure 1
figure 1

α-hemolysin nanopore with captured hairpin.

Figure 2
figure 2

Upper Level Toggler (ULT) with profile example.

Many approaches to characterizing of nucleic acid analyte – channel interactions use 2-D scatter plot analysis [2, 3]. A recently proposed method of discriminating translocating RNA polynucleotide orientation [4] uses a combination of six sigmoid phenomenological functional forms to approximate possible blockades. A hybrid method of automated analyte classification was used in [5, 6] that discriminates among 8GC, 9GC, 9CG, 9TA and 9AT molecules by first obtaining features extracted with Expectation Maximization (EM) learning on a single 50-state fully connected Hidden Markov Model (HMM). They then construct a feature vector based on the HMM parameters and pass that to a Support Vector Machine (SVM) for classification (with the binary decision tree shown in Figure 3). Although the process shown in Fig. 3 is scalable, and has high classification accuracy, it can also involve high data rejection rates (good for performing solution assays). This motivates effort to have a less scalable, but lower data-rejection rate (such as what is needed during genomic sequencing). Later study of the data examined in [5, 6], with PCA reduction on states followed by a simple, uninformed, AdaBoost classification (not SVM, see [7]), led to similar improvement on zero-rejection accuracy, and thus similar improvements (reductions) on the data-rejection needed for high-accuracy classification [7]. That approach, however, didn't begin with the stronger (but non-scalable in class-number) feature extraction method described here. This is the first test of what is expected to be a highly accurate feature extraction method (better than those employed previously), where the critical limitation in general use, however, is in its scalability in number of classes to discriminate.

Figure 3
figure 3

Existing classification process with HMM feature extraction followed by SVM binary tree decision.

In an interview [In Focus, January 2002], one of the pioneers in the development of nanopore technology, Dr. Mark Akeson, states that getting a machine to learn base pair or nucleotide signatures and report the results automatically will be a key feature of a nanopore sequencing instrument. Here we propose a new method of unsupervised learning of ionic flow blockades with Mixture of Hidden Markov Models (MHMM) profiles that has a number of attractive attributes, at the expense of restricting learning to a smaller state space. For genome sequencing, the problem reduces to identifying the classes {A, C, G, T}, i.e., there are only four classes to discriminate. Thus, for some important problems the non-scaling constraint is not an issue and this approach may offer the best performance.

The Maximum a Posteriori (MAP) molecular classification with our model opens the possibility for making distributed decisions in real time. The EM algorithms running on our model are computationally expensive procedures, thus, an important method in this work involves computational speed-up efforts via distributed processing implementations.

Results

We have learned blockade signal clusters for five different types of molecules: two such profile mixtures, learned in 50 iterations, are shown in Figure 4. The classification accuracy is shown in Figure 5, where we used 10-fold resampling of 500 labeled toggle sample subsets from our test set [see Section Methods] (the 10-fold resampling is needed to perform majority-vote classification stabilization). The resampling offers a similar stabilization on classifications, and at similar computational expense, to what is done via data-rejection in [5, 6]. Accuracy here is defined as

A c c u r a c y = T P + T N T P + F P + T N + F N , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemyqaeKaem4yamMaem4yamMaemyDauNaemOCaiNaemyyaeMaem4yamMaemyEaKNaeyypa0tcfa4aaSaaaeaacqWGubavcqWGqbaucqGHRaWkcqWGubavcqWGobGtaeaacqWGubavcqWGqbaucqGHRaWkcqWGgbGrcqWGqbaucqGHRaWkcqWGubavcqWGobGtcqGHRaWkcqWGgbGrcqWGobGtaaGaeiilaWcaaa@4AA5@
(1)

where True Positives (TP), True Negatives (TN), False Positives (FP) and False Negatives (FN) are among the classified data samples. We have systematically investigated how the model complexity affects accuracy as shown in Figure 6, where average accuracy does not improve for the model of more than 12 components and more than 4 blockade levels, although some individual molecules take advantage of increased model complexity as their classification becomes more accurate. We have also investigated the blockade signal duration needed for proper classification, as shown in Figure 7, and for the data-sets examined found that samples with more that 100 ms duration yield little in either average classification accuracy or classification time. We tried using ionic flow blockade samples of 200 ms in the MHMM training, for example, with no apparent improvement to classification accuracy over the 100 ms duration samples. This behavior was not observed with the non-MAP, large-state (but scalable), approach used in [5, 6], where greater observation times led to improved classification (although there is agreement that there was diminishing returns on learning sets for signal durations greater then 100 ms, and, especially, if greater than 500 ms).

Figure 4
figure 4

Toggle clusters for 9GC and 9CG molecules. The mixture proportions correspond to the frequency of a certain toggle mode. Sixteen possible transitions corresponding to profile shown in Figure 7 (a) are shown as chessboard, the darker the area of a cell the more probable a transition. Emissions corresponding to each of the four hidden HMM states are shown below the transitions matrix. MAP classified 100 ms toggle sample from the learning set corresponding to a certain profile is also shown.

Figure 5
figure 5

MAP classification accuracy with 10-fold resampling on a split-sample data (with 4 levels and 15 components).

Figure 6
figure 6

Increasing model complexity affects accuracy.

Figure 7
figure 7

Accuracy of molecular classification depends on sample duration.

The accuracy of consecutive same-analyte toggle samples classification is shown in Figure 8(b), where we reach 100% performance within 14 classifications, except for the 9GC molecule, which underperformed when compared with [5, 6]. The difficulty with 9GC classification accuracy convergence could be explained by substantial confusion with 9AT toggles, which reaches ~17% at first classification round and reluctantly reduces to ~3% after 21 classification rounds.

Figure 8
figure 8

Proposed and existing process classification accuracy.

The accuracy improvement is consistent with the accuracy of the previously reported classification process [6] as shown in Figure 8(a) (except for the 9GC molecule). The failure to discern 9GC from 9AT in the approach described here, and not in prior efforts [5, 6], may simply be the result of better blockade-level resolution 'fine-structure' with the prior model.

The better resolution between 9GC and 9AT channel blockades obtained with the 50-state single HMM (used in [5, 6]) may simply be due to the fixed 1pA resolution (the state quantization bin-size) providing a critical resolving capability between very similar blockade signals. If true, a hybrid solution may be to directly incorporate fine-structure into the 4-state multiple HMM processing model that is used here, by adding fine-structure states at 1pA distances on either side of the 4 states identified by EM. Efforts along these lines are ongoing (see Discussion).

The MHMM analysis framework first has been implemented in a concurrent fashion on a quad-core Sun Ultra 40 M2 machine with speedup factor 3.66 as compared to a conventional implementation, and then distributed to the five machines of the same type with Java RMI with additional speedup of 4.02, which translates to the total speedup of 3.66 × 4.02 = 14.71.

Methods

In our approach we used unsupervised distributed learning of nanopore ionic flow blockade toggles with an MHMM. MHMMs have a long record of successful implementations that started in speech recognition [8] and later were used for clustering protein families [9], sequences [10] and in the search for splicing enhancers [11]. We use the HMM profile shown in Figures 2 and 9(a) to model the channel blockade process using MHMM components as shown in Figure 9(b). Justification for using such profiles is provided in [12], where we have found the duration of ionic flow blockade levels to be distributed with a simple geometric distribution. The noise at a fixed-level blockade level is typically found to be Gaussian, consistent with the overall thermal and shot noise background for the transient-binding fixed-flow-geometry environments formed by channel and blockading elements.

Figure 9
figure 9

HMM profile and mixture of profiles.

The ionic flow blockade records were obtained from the previous studies [6]. Two axon binary files (each containing 500 blockade samples of 300 ms) have been used to learn the probabilistic profiles for each hairpin molecule. The first 100 ms of each channel blockade is the basis of the first test set. Four other axon binary files, with uninterrupted recordings (non-sweep data), for each hairpin molecule and recorded on the same day, are then used for testing. The test set was formed by equiprobable sampling of 500 labeled blockade samples from the pool of test files.

Another test set was constructed from the above data files to measure accuracy of consecutive same-analyte toggle sample classification. In this instance we take all the available blockade signal coming from the test files of a certain molecule (not just the first 100 ms) and use multiple sample draws from the same signal blockade (i.e., consecutive 100 ms segments). With the 100 ms signal samples drawn from the same blockade event, we perform MAP scoring followed by majority vote (with random resolution of ties). (Note: the data rejection employed in [6] could be made roughly equivalent to the signal resampling approach described here by simply collecting consecutive 100 ms samples, as done here, and having classification on a given blockade once the signal isn't rejected, the only difference with the classification post-processing being that in this effort majority vote is employed instead.) Accuracy is calculated as the number of correct classifications matching the known molecule type to the total number of classification events. As in [6], the ionic flow in each record has been normalized to the open channel current 120 pA prior to learning and testing.

For our distributed MHMM system implementation we have used cluster of five workstations Sun Ultra 40 M2, each equipped with two AMD Dual-Core Opteron processors (2220SE 2.8 GHz), connected through gigabit Ethernet switch.

HMM definition and EM learning

The following parameters describe the conventional HMM implementation according to [13]:

  • A set of states S = {S1,..., S N } with q t being the state visited at time t,

  • A set of PDFs B = {b1(o),..., b N (o)}, describing the emission probabilities b j (o t ) = p(o t |q t = S j ) for 1 ≤ jN, where o t is the observation at time-point t from the sequence of observations O = {o1,..., o T },

  • The state-transition probability matrix A = {ai,j} for 1 ≤ i, jN, where ai, j= p(qt+1= S j |q t = S i ),

  • The initial state distribution vector ∏ = {π1,..., π N }.

A set of parameters λ = (∏, A, B) completely specifies an HMM. Here we describe the HMM parameter update rules for the EM learning algorithm rigorously derived in [14]. When training the HMM using the Baum-Welch algorithm (an Expectation Maximization procedure), first we need to find the expected probabilities of being at a certain state at a certain time-point using the forward-backward procedure as shown in Table 1.

Table 1 Forward and backward procedures.

Let us define ξ t (i, j) as the probability of being in state i at time t, and state j at time t + 1, given the model and the observation sequence

ξ t ( i , j ) = p ( q t = S i , q t + 1 = S j | O , λ ) = α t ( i ) a i , j b j ( o t + 1 ) β t + 1 ( j ) p ( O | λ ) , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeqOVdG3aaSbaaSqaaiabdsha0bqabaGccqGGOaakcqWGPbqAcqGGSaalcqWGQbGAcqGGPaqkcqGH9aqpcqWGWbaCcqGGOaakcqWGXbqCdaWgaaWcbaGaemiDaqhabeaakiabg2da9iabdofatnaaBaaaleaacqWGPbqAaeqaaOGaeiilaWIaemyCae3aaSbaaSqaaiabdsha0jabgUcaRiabigdaXaqabaGccqGH9aqpcqWGtbWudaWgaaWcbaGaemOAaOgabeaakiabcYha8jabd+eapjabcYcaSiabeU7aSjabcMcaPiabg2da9KqbaoaalaaabaGaeqySde2aaSbaaeaacqWG0baDaeqaaiabcIcaOiabdMgaPjabcMcaPiabdggaHnaaBaaabaGaemyAaKMaeiilaWIaemOAaOgabeaacqWGIbGydaWgaaqaaiabdQgaQbqabaGaeiikaGIaem4Ba82aaSbaaeaacqWG0baDcqGHRaWkcqaIXaqmaeqaaiabcMcaPiabek7aInaaBaaabaGaemiDaqNaey4kaSIaeGymaedabeaacqGGOaakcqWGQbGAcqGGPaqkaeaacqWGWbaCcqGGOaakcqWGpbWtcqGG8baFcqaH7oaBcqGGPaqkaaGaeiilaWcaaa@752F@
(2)

and γ t (i) as the probability of being in state i at time t, given the observation sequence and the model

γ t ( i ) = p ( q t = S i | O , λ ) = α t ( i ) β t ( i ) i = 1 N α t ( i ) β t ( i ) = j = 1 N ξ t ( i , j ) . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaeq4SdC2aaSbaaSqaaiabdsha0bqabaGccqGGOaakcqWGPbqAcqGGPaqkcqGH9aqpcqWGWbaCcqGGOaakcqWGXbqCdaWgaaWcbaGaemiDaqhabeaakiabg2da9iabdofatnaaBaaaleaacqWGPbqAaeqaaOGaeiiFaWNaem4ta8KaeiilaWIaeq4UdWMaeiykaKIaeyypa0tcfa4aaSaaaeaacqaHXoqydaWgaaqaaiabdsha0bqabaGaeiikaGIaemyAaKMaeiykaKIaeqOSdi2aaSbaaeaacqWG0baDaeqaaiabcIcaOiabdMgaPjabcMcaPaqaamaaqadabaGaeqySde2aaSbaaeaacqWG0baDaeqaaiabcIcaOiabdMgaPjabcMcaPiabek7aInaaBaaabaGaemiDaqhabeaacqGGOaakcqWGPbqAcqGGPaqkaeaacqWGPbqAcqGH9aqpcqaIXaqmaeaacqWGobGtaiabggHiLdaaaOGaeyypa0ZaaabCaeaacqaH+oaEdaWgaaWcbaGaemiDaqhabeaakiabcIcaOiabdMgaPjabcYcaSiabdQgaQjabcMcaPaWcbaGaemOAaOMaeyypa0JaeGymaedabaGaemOta4eaniabggHiLdGccqGGUaGlaaa@751E@
(3)

The HMM maximization step using these probabilities is shown in Table 2.

Table 2 Maximization step in HMM learning.

EM learning of HMM mixture

The objective of mixture learning is to maximize the likelihood function p ( O | Θ ) = Π i = 1 N p ( o i | Θ ) = ( Θ | O ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemiCaaNaeiikaGYenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NdX=KaeiiFaWhcceGae4hMdeLaeiykaKIaeyypa0JaeuiOda1aa0baaSqaaiabdMgaPjabg2da9iabigdaXaqaaiabd6eaobaakiabdchaWjabcIcaOiabd+gaVnaaBaaaleaacqWGPbqAaeqaaOGaeiiFaWNae4hMdeLaeiykaKIaeyypa0Jae8NeHWKaeiikaGIae4hMdeLaeiiFaWNae8NdX=KaeiykaKcaaa@562E@ , i.e. we wish to find the locally optimal set of parameters Θ = a r g m a x ( Θ | O ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaacceGae8hMde1aaWbaaSqabeaacqGHxiIkaaGccqGH9aqpieGacqGFHbqycqGFYbGCcqGFNbWzcqGFTbqBcqGFHbqycqGF4baEt0uy0HwzTfgDPnwy1egaryqtHrhAL1wy0L2yHvdaiqaacqqFsectcqGGOaakcqWFyoqucqGG8baFcqqFoe=tcqGGPaqkaaa@48D3@ by using the Expectation Maximization (EM) iterative procedure and the set of data points O MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWenfgDOvwBHrxAJfwnHbqeg0uy0HwzTfgDPnwy1aaceaGae8NdX=eaaa@3765@ .

The Expectation step in the mixture fitting algorithm is done by computing the responsibility matrix of the components given the data points:

p ( λ 1 | O 1 , Θ ) p ( λ M | O 1 , Θ ) p ( λ 1 | O 2 , Θ ) p ( λ M | O 2 , Θ ) p ( λ 1 | O 3 , Θ ) p ( λ M | O 3 , Θ ) p ( λ 1 | O K , Θ ) p ( λ M | O K , Θ ) M  mixture components } K  data points MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaaiGaaeaadaagaaqaauaabeqafmaaaaqaaiabdchaWjabcIcaOiabeU7aSnaaBaaaleaacqaIXaqmaeqaaOGaeiiFaWNaem4ta80aaSbaaSqaaiabigdaXaqabaGccqGGSaaliiqacqWFyoqucqGGPaqkaeaacqWIVlctaeaacqWGWbaCcqGGOaakcqaH7oaBdaWgaaWcbaGaemyta0eabeaakiabcYha8jabd+eapnaaBaaaleaacqaIXaqmaeqaaOGaeiilaWIae8hMdeLaeiykaKcabaGaemiCaaNaeiikaGIaeq4UdW2aaSbaaSqaaiabigdaXaqabaGccqGG8baFcqWGpbWtdaWgaaWcbaGaeGOmaidabeaakiabcYcaSiab=H5arjabcMcaPaqaaiabl+UimbqaaiabdchaWjabcIcaOiabeU7aSnaaBaaaleaacqWGnbqtaeqaaOGaeiiFaWNaem4ta80aaSbaaSqaaiabikdaYaqabaGccqGGSaalcqWFyoqucqGGPaqkaeaacqWGWbaCcqGGOaakcqaH7oaBdaWgaaWcbaGaeGymaedabeaakiabcYha8jabd+eapnaaBaaaleaacqaIZaWmaeqaaOGaeiilaWIae8hMdeLaeiykaKcabaGaeS47IWeabaGaemiCaaNaeiikaGIaeq4UdW2aaSbaaSqaaiabd2eanbqabaGccqGG8baFcqWGpbWtdaWgaaWcbaGaeG4mamdabeaakiabcYcaSiab=H5arjabcMcaPaqaaiabl+Uimbqaaiabl+Uimbqaaiabl+UimbqaaiabdchaWjabcIcaOiabeU7aSnaaBaaaleaacqaIXaqmaeqaaOGaeiiFaWNaem4ta80aaSbaaSqaaiabdUealbqabaGccqGGSaalcqWFyoqucqGGPaqkaeaacqWIVlctaeaacqWGWbaCcqGGOaakcqaH7oaBdaWgaaWcbaGaemyta0eabeaakiabcYha8jabd+eapnaaBaaaleaacqWGlbWsaeqaaOGaeiilaWIae8hMdeLaeiykaKcaaaWcbaGaemyta0KaeeiiaaIaeeyBa0MaeeyAaKMaeeiEaGNaeeiDaqNaeeyDauNaeeOCaiNaeeyzauMaeeiiaaIaee4yamMaee4Ba8MaeeyBa0MaeeiCaaNaee4Ba8MaeeOBa4MaeeyzauMaeeOBa4MaeeiDaqNaee4CamhakiaawIJ=aaGaayzFaaGaem4saSKaeeiiaaIaeeizaqMaeeyyaeMaeeiDaqNaeeyyaeMaeeiiaaIaeeiCaaNaee4Ba8MaeeyAaKMaeeOBa4MaeeiDaqNaee4Camhaaa@C96B@

We use Bayes' rule to find the posterior probability (responsibility) of a mixture component with parameters λ m and emission sequence O k :

p ( λ m | O k , λ ) = α m p ( O k | λ m ) j = 1 M α j p ( O k | λ j ) . MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaemiCaaNaeiikaGIaeq4UdW2aaSbaaSqaaiabd2gaTbqabaGccqGG8baFcqWGpbWtdaWgaaWcbaGaem4AaSgabeaakiabcYcaSGGadiab=T7aSjabcMcaPiabg2da9KqbaoaalaaabaGaeqySde2aaSbaaeaacqWGTbqBaeqaaiabdchaWjabcIcaOiabd+eapnaaBaaabaGaem4AaSgabeaacqGG8baFcqaH7oaBdaWgaaqaaiabd2gaTbqabaGaeiykaKcabaWaaabmaeaacqaHXoqydaWgaaqaaiabdQgaQbqabaGaemiCaaNaeiikaGIaem4ta80aaSbaaeaacqWGRbWAaeqaaiabcYha8jabeU7aSnaaBaaabaGaemOAaOgabeaaaeaacqWGQbGAcqGH9aqpcqaIXaqmaeaacqWGnbqtaiabggHiLdGaeiykaKcaaOGaeiOla4caaa@5D83@
(4)

The Expectation step is followed by the maximization step where we re-estimate parameters.

  • Mixture proportions

    α ^ m = 1 K k = 1 K p ( λ m | O k , λ ) , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqySdeMbaKaadaWgaaWcbaGaemyBa0gabeaakiabg2da9KqbaoaalaaabaGaeGymaedabaGaem4saSeaaOWaaabCaeaacqWGWbaCcqGGOaakcqaH7oaBdaWgaaWcbaGaemyBa0gabeaakiabcYha8jabd+eapnaaBaaaleaacqWGRbWAaeqaaOGaeiilaWcccmGae83UdWMaeiykaKcaleaacqWGRbWAcqGH9aqpcqaIXaqmaeaacqWGlbWsa0GaeyyeIuoakiabcYcaSaaa@4808@
    (5)
  • Initial probabilities

    Π ^ m = k = 1 K Π ^ m k p ( λ m | O k , λ ) k = 1 K p ( λ m | O k , λ ) , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafuiOdaLbaKaadaWgaaWcbaGaemyBa0gabeaakiabg2da9KqbaoaalaaabaWaaabmaeaacuqHGoaugaqcamaaDaaabaGaemyBa0gabaGaem4AaSgaaiabdchaWjabcIcaOiabeU7aSnaaBaaabaGaemyBa0gabeaacqGG8baFcqWGpbWtdaWgaaqaaiabdUgaRbqabaGaeiilaWcccmGae83UdWMaeiykaKcabaGaem4AaSMaeyypa0JaeGymaedabaGaem4saSeacqGHris5aaqaamaaqadabaGaemiCaaNaeiikaGIaeq4UdW2aaSbaaeaacqWGTbqBaeqaaiabcYha8jabd+eapnaaBaaabaGaem4AaSgabeaacqGGSaalcqWF7oaBcqGGPaqkaeaacqWGRbWAcqGH9aqpcqaIXaqmaeaacqWGlbWsaiabggHiLdaaaOGaeiilaWcaaa@5D25@
    (6)

where Π ^ m k MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafuiOdaLbaKaadaqhaaWcbaGaemyBa0gabaGaem4AaSgaaaaa@3052@ is an estimate of initial probabilities for the component m given sequence O k ,

  • Transitions

    A ^ m = i = 1 K A ^ m k p ( λ m | O k , λ ) k = 1 K p ( λ m | O k , λ ) , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmyqaeKbaKaadaWgaaWcbaGaemyBa0gabeaakiabg2da9KqbaoaalaaabaWaaabmaeaacuWGbbqqgaqcamaaDaaabaGaemyBa0gabaGaem4AaSgaaiabdchaWjabcIcaOiabeU7aSnaaBaaabaGaemyBa0gabeaacqGG8baFcqWGpbWtdaWgaaqaaiabdUgaRbqabaGaeiilaWcccmGae83UdWMaeiykaKcabaGaemyAaKMaeyypa0JaeGymaedabaGaem4saSeacqGHris5aaqaamaaqadabaGaemiCaaNaeiikaGIaeq4UdW2aaSbaaeaacqWGTbqBaeqaaiabcYha8jabd+eapnaaBaaabaGaem4AaSgabeaacqGGSaalcqWF7oaBcqGGPaqkaeaacqWGRbWAcqGH9aqpcqaIXaqmaeaacqWGlbWsaiabggHiLdaaaOGaeiilaWcaaa@5C3B@
    (7)

where A ^ m k MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmyqaeKbaKaadaqhaaWcbaGaemyBa0gabaGaem4AaSgaaaaa@2FDF@ is an estimate of transition probabilities for the component m given sequence O k ,

  • Emissions

    B ^ m = k = 1 K B ^ m k p ( λ m | O k , λ ) k = 1 K p ( λ m | O k , λ ) , MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xI8qiVKYPFjYdHaVhbbf9v8qqaqFr0xc9vqFj0dXdbba91qpepeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmOqaiKbaKaadaWgaaWcbaGaemyBa0gabeaakiabg2da9KqbaoaalaaabaWaaabmaeaacuWGcbGqgaqcamaaDaaabaGaemyBa0gabaGaem4AaSgaaiabdchaWjabcIcaOiabeU7aSnaaBaaabaGaemyBa0gabeaacqGG8baFcqWGpbWtdaWgaaqaaiabdUgaRbqabaGaeiilaWcccmGae83UdWMaeiykaKcabaGaem4AaSMaeyypa0JaeGymaedabaGaem4saSeacqGHris5aaqaamaaqadabaGaemiCaaNaeiikaGIaeq4UdW2aaSbaaeaacqWGTbqBaeqaaiabcYha8jabd+eapnaaBaaabaGaem4AaSgabeaacqGGSaalcqWF7oaBcqGGPaqkaeaacqWGRbWAcqGH9aqpcqaIXaqmaeaacqWGlbWsaiabggHiLdaaaOGaeiilaWcaaa@5C43@
    (8)

where B ^ m k MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafmOqaiKbaKaadaqhaaWcbaGaemyBa0gabaGaem4AaSgaaaaa@2FE1@ is an estimate of emission parameters for the component m given sequence O k .

Distributed EM implementation

As discussed in [15], the computational gain of a parallel implementation can greatly depend on model topology. In the speech recognition community researchers are able to use a highly parallel HMM architectures for phoneme and dictionary word recognition. Typically, when a large number of Processing Elements (PEs) is used, the utilization of each element drops due to communication overheads. Therefore, the communication overhead in any parallel architecture must be strictly managed, ideally reduced to a constellation of PEs with shared memory [15]. In recent work [16] we describe the performance of the following HMM EM algorithms (where we studied the last on the list):

  • Conventional EM due to Leonard E. Baum and Lloyd R. [17] takes O(T N) memory and O(2T N Q max + T (Q + E)) time, where T is the length of the observed sequence, N is the number of HMM states, Q max is the maximum HMM node out-degree, E is the number of free emission parameters, Q is the number of free transition parameters.

  • Checkpointing EM [1820] takes O( T MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaaOaaaeaacqWGubavaSqabaaaaa@2D21@ N) memory and O(3T N Q max + T (Q + E)) time,

  • Linear memory EM [16, 21] takes only O(N(Q + E D)) memory and O(T NQ max (Q + E D)) time.

Similar improvements are also described for the HMM Viterbi implementation in linear memory [16]. In actual usage with the comparatively small durations generally examined, the checkpointing algorithm was found to be the most memory efficient.

Distributed checkpointing algorithm for learning from large data samples

The distributed checkpointing EM algorithm is shown in Figure 10. Here are the steps in our distributed checkpointing algorithm implementation:

Figure 10
figure 10

Distributed Checkpointing algorithm.

  1. 1.

    Client machine splits data sequence O into subsequences O1,..., O t ,..., O T MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGaem4ta80aaSbaaSqaamaakaaabaGaemivaqfameqaaaWcbeaaaaa@2E80@ each of size T MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaaOaaaeaacqWGubavaSqabaaaaa@2D21@ and distributes them across the servers along with λ,

  2. 2.

    Find Forward and Backward T MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaaOaaaeaacqWGubavaSqabaaaaa@2D21@ checkpoints in sequential manner at the corresponding servers where emission matrices for O t were calculated and stored,

  3. 3.

    Reconstruct dynamic programming tables of size N T MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaWaaOaaaeaacqWGubavaSqabaaaaa@2D21@ at the servers according to locally stored checkpoints to make local parameter estimate λ ^ t = ( Π ^ t , A ^ t , B ^ t ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafq4UdWMbaKaadaWgaaWcbaGaemiDaqhabeaakiabg2da9iabcIcaOiqbfc6aqzaajaWaaSbaaSqaaiabdsha0bqabaGccqGGSaalcuWGbbqqgaqcamaaBaaaleaacqWG0baDaeqaaOGaeiilaWIafmOqaiKbaKaadaWgaaWcbaGaemiDaqhabeaakiabcMcaPaaa@3C73@ ,

  4. 4.

    After calculating local parameter estimate, communicate λ ^ t MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafq4UdWMbaKaadaWgaaWcbaGaemiDaqhabeaaaaa@2F36@ back to the client machine and calculate λ ^ = ( Π ^ , A ^ , B ^ ) MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafq4UdWMbaKaacqGH9aqpcqGGOaakcuqHGoaugaqcaiabcYcaSiqbdgeabzaajaGaeiilaWIafmOqaiKbaKaacqGGPaqkaaa@35D7@ ,

  5. 5.

    Redistribute newly found λ ^ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafq4UdWMbaKaaaaa@2D99@ among the server machines for another EM round.

Distributed MHMM parameter estimate

An MHMM can easily split the responsibilities calculation between several cluster nodes with minimum communication overhead in the following way:

  1. 1.

    For each parameter λ1,..., λ m ,..., λ M and sequence O1,..., O k ,..., O K calculate likelihood p(O k | λ m ) on the server nodes and communicate them back to the client,

  2. 2.

    Client finds responsibilities for each mixture component and a sequence according to formula (4),

  3. 3.

    Estimated mixture proportions α ^ 1 , ... , α ^ m , ... , α M MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafqySdeMbaKaadaWgaaWcbaGaeGymaedabeaakiabcYcaSiabc6caUiabc6caUiabc6caUiabcYcaSiqbeg7aHzaajaWaaSbaaSqaaiabd2gaTbqabaGccqGGSaalcqGGUaGlcqGGUaGlcqGGUaGlcqGGSaalcqaHXoqydaWgaaWcbaGaemyta0eabeaaaaa@3DB8@ are found on a client node according to (5),

  4. 4.

    The server nodes find λ ^ m k MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafq4UdWMbaKaadaqhaaWcbaGaemyBa0gabaGaem4AaSgaaaaa@3088@ estimates for parameter λ m and sequence O k and send them back to the client,

  5. 5.

    On the client node these newly computed parameters are weighted according to responsibilities (6), (7), (8),

  6. 6.

    Newly found HMM parameters λ ^ 1 , ... , λ ^ M MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacPC6xNi=xH8viVGI8Gi=hEeeu0xXdbba9frFj0xb9qqpG0dXdb9aspeI8k8fiI+fsY=rqGqVepae9pg0db9vqaiVgFr0xfr=xfr=xc9adbaqaaeGaciGaaiaabeqaaeqabiWaaaGcbaGafq4UdWMbaKaadaWgaaWcbaGaeGymaedabeaakiabcYcaSiabc6caUiabc6caUiabc6caUiabcYcaSiqbeU7aSzaajaWaaSbaaSqaaiabd2eanbqabaaaaa@363E@ are disbursed back to the server nodes for the next round of EM training.

Discussion and conclusion

There are several advantages in our approach:

  • Classification is highly accurate with no data dropped from consideration,

  • Model parameters may have intuitive physical interpretation (but not in this study),

  • The MHMM implementation is distributed, such that:

  • Learning can take a larger number of samples (for improved accuracy),

  • Enables real-time analyte classification, currently takes only 0.411 sec to classify 100 ms sample,

  • Checkpointing algorithm keeps the memory profile low both on server and client sides without compromising the running time [16].

The need for using a mixture model beyond a simple HMM comes from the observation that generally no more than half of hairpin blockades come from the same mode of hairpin molecule interacting with nanopore (the modes correspond to principal components in the channel blockade stationary statistics profile). Other mode contributions require different probabilistic profiles for classification which naturally leads to a mixture analysis problem. The method shown in Figure 3 doesn't introduce such modes at the HMM-processing stage, relying instead on the strengths of the SVM classifier directly.

Increasing EM-learning model complexity beyond 4 levels and 12 mixture components increases the log-likelihood of fully trained model, but does not lead to better prediction accuracy as shown in Figure 6. The likelihood increase is caused by the model overfitting the data. Overfitting with HMM-profile models, however, isn't found to be as detrimental to the generalization performance as with other learning methods – the main penalty is that the learning and classification times increase dramatically, as we need to estimate progressively increasing number of parameters.

Since we did not computationally exhaust all the possible parameter settings (number of components, number of levels and sample duration), we provide a rationale for the parameter choice we believe is optimal. With preliminary experiments learning on 9CG toggle samples with MHMM of 15 toggle clusters we have consistently exhausted the number of components, many of them converging to the same simple blockade as shown in figure 4(a) at the top right. This observation prompted us to use no more than 12 components in the channel blockade signal-mode mixture model.

The number of four blockade levels corresponds to the physical model of DNA hairpin interacting with nanopore [5]. From the physical perspective the hairpin molecule can undergo different modes of capture blockade, such as Intermediate Level (IL), Upper Level (UL), Lover Level (LL) conductance states and spikes (S) [6]. When a 9 bp DNA hairpin initially enters the pore, the loop is perched in the vestibule mouth and the stem terminus binds to amino acid residues near the limiting aperture. This results in the IL conductance level. When the terminal basepair desorbs from the pore wall, the stem and loop may realign, resulting in a substantial current increase to UL. Interconversion between the IL and UL states may occur numerous times with UL possibly switching to the LL state. This LL state corresponds to binding of the stem terminus to amino acids near the limiting aperture but in a different manner from IL. From the LL bound state, the duplex terminus may fray, resulting in extension and capture of one strand in the pore constriction resulting into short term S state. The allowed transition events between the levels ILULLLS to happen at any time during the analysis procedure. The spikes model, as described in [16], could possibly be used to increase prediction accuracy. However, with the scenario discussed in this manuscript use of such additions did not lead to higher performance since the primary blockade modes shown in Figures 4(a) and 4(b) are void of spikes.

A demo program implementing distributed MHMM analysis framework is available free of charge on our web site http://logos.cs.uno.edu/~achurban.