Appendix
Analysis of MBM as a voting rule
Motivation
In this section, we justify our use of MBM as a voting rule in SBA to combine the assignments for each structure in the NMA ensemble. To that end, we show that MBM is a maximum likelihood estimator (MLE). Maximum likelihood estimation is a general technique to estimate the unknown parameters of a distribution, given a set of observed data values derived from the distribution. It returns the parameters that maximize the likelihood of observing the set of data values.
Our proof demonstrates that our voting scheme is sound and optimal, by showing that our algorithm returns the assignment that maximizes the likelihood. In our case, the set of observed data values comprises the assignments for each model in our NMA ensemble, and the unknown parameters of the distribution are the unknown correct assignments. An MLE estimator has many desirable properties: In particular, it is consistent, which means that it converges to the true value of the estimated parameter (Wasserman 2004). This means that, as the number of models in our NMA ensemble increases, the assignments returned by our voting scheme converge to the correct assignments. This proof depends on our assumption that the assignments computed for each model are independent and identically distributed, according to our noise model (which is described below).
First, we formulate our algorithm as a voting scheme. In voting, there are multiple voters and multiple candidates. Each voter may vote for one (or a subset of) the candidates, or may rank the alternatives. In our setting, a vote is the resonance assignments for a structure in our NMA ensemble. Our voting scheme aggregates these preferences to compute “consensus” assignments, which are returned by our algorithm.
The idea of using MLE in voting was first proposed by de Caritat (Marquis de Condorcet) (1785), who analyzed 2- and 3-candidate elections; and was extended two centuries later to arbitrary number of candidates by Young (1995). However, none of the voting rules studied in these works corresponds to a widely-used voting rule. Conitzer and Sandholm (2005) then studied which of the well-known voting rules can be viewed as an MLE. For this purpose, they adopted the following model/assumptions: There exists an (unknown) ground truth winner (or ranking) of the election w, and each voter’s vote is a noisy measure of this ground truth. Due to noise, each voter’s vote may be different from the ground truth. The noise models the probability of observing a vote v
i
for voter i, given the ground truth winner w. The votes are independent given w, and identically distributed. Under these assumptions, given a set of votes v
1,…,v
m
, where m is the number of votes, a voting rule is an MLE of the correct winner w if it returns a winner w
o
that maximizes the likelihood of the observed votes. That is, it returns:
$$ \hbox{arg}\mathop{\hbox{max}}\limits_{w_o} p(v_1,v_2,\ldots,v_m|w_o) = \hbox{arg}\mathop{\hbox{max}}\limits_{w_o} {p(v_1|w_o) p(v_2|w_o) \ldots p(v_m|w_o)} $$
where p(v
1,v
2,…,v
m
|w
i
) is the probability of observing v
1,v
2,…,v
m
if the (unknown) ground truth were w
i
.
Proof that MBM is an MLE
We now show that our voting rule, MBM, is the MLE of the correct assignments. In our setting, there is a ground truth winner, which is the correct (and unknown) (peak, residue) assignments. The individual votes correspond to the assignments made using each of the structures in the NMA ensemble separately using an SBA algorithm (Fig. 2). The MBM is done on a BPG where one set of nodes corresponds to peaks and the other set to residues. The edge weights are the number of structures that assign (“vote for”) the corresponding (peak, residue) pair.
We assume the following noise model: For each template in the ensemble (“voter”), each peak is correctly (resp., incorrectly) assigned with probability p (resp., q) where p > q, independent of other peaks, and such that if the resulting assignments have more than one peak assigned to the same residue, each peak is reassigned (again with probability p to the correct residue and probability q to an incorrect residue). We further assume that the assignments corresponding to individual models are independent, given the correct assignments. So, the noise is independent and identically distributed.
With this noise model, the probability of a given assignment (vote) i in which k
i
of the n peaks are matched correctly is (proportional to) \(p^{k_i} q^{n-k_i}.\) The joint probability of all m votes corresponding to all m templates together is proportional to
$$ \begin{aligned} \prod_i p^{k_i} q^{n-k_i} =\,& p^a q^b \\ a =& \sum_i{k_i} \\ b =\,& nm - \sum_i{k_i} \\ \end{aligned} $$
(2)
where the product and the sums in (2) are from i = 1,…,m.
An MLE of the correct assignment chooses an assignment w
o
such that (2) is maximized. Fix a particular protein and its NMA ensemble, so that n and m are constants. Then, since p > q and nm is a constant, (2) is maximized when \(\sum_i k_i\) is maximized. \(\sum_i {k_i}\) is the number of times each (peak, residue) assignment (for each of the structural models) coincides with the (peak, residue) assignment in w
o
. This is maximized by MBM. Therefore MBM is an MLE of the correct assignment.\(\hfill\square\)