OpenKnowledge for peer-to-peer experimentation in protein identification by MS/MS
- First Online:
- 6.2k Downloads
Traditional scientific workflow platforms usually run individual experiments with little evaluation and analysis of performance as required by automated experimentation in which scientists are being allowed to access numerous applicable workflows rather than being committed to a single one. Experimental protocols and data under a peer-to-peer environment could potentially be shared freely without any single point of authority to dictate how experiments should be run. In such environment it is necessary to have mechanisms by which each individual scientist (peer) can assess, locally, how he or she wants to be involved with others in experiments. This study aims to implement and demonstrate simple peer ranking under the OpenKnowledge peer-to-peer infrastructure by both simulated and real-world bioinformatics experiments involving multi-agent interactions.
A simulated experiment environment with a peer ranking capability was specified by the Lightweight Coordination Calculus (LCC) and automatically executed under the OpenKnowledge infrastructure. The peers such as MS/MS protein identification services (including web-enabled and independent programs) were made accessible as OpenKnowledge Components (OKCs) for automated execution as peers in the experiments. The performance of the peers in these automated experiments was monitored and evaluated by simple peer ranking algorithms.
Peer ranking experiments with simulated peers exhibited characteristic behaviours, e.g., power law effect (a few dominant peers dominate), similar to that observed in the traditional Web. Real-world experiments were run using an interaction model in LCC involving two different types of MS/MS protein identification peers, viz., peptide fragment fingerprinting (PFF) and de novo sequencing with another peer ranking algorithm simply based on counting the successful and failed runs. This study demonstrated a novel integration and useful evaluation of specific proteomic peers and found MASCOT to be a dominant peer as judged by peer ranking.
The simulated and real-world experiments in the present study demonstrated that the OpenKnowledge infrastructure with peer ranking capability can serve as an evaluative environment for automated experimentation.
This study demonstrates the use of peer ranking of multi-agents (peers) on a peer-to-peer scientific experimentation platform. Experiments with simulated peers were conducted to show that sophisticated peer ranking algorithms are applicable to the OpenKnowledge infrastructure. Experimental demonstration of peer ranking, albeit with a simple algorithm, of actual computational services for tandem mass spectrometry (MS/MS) protein identification were still useful in monitoring the performance of peers during their interactions. The rest of this background section will describe the concepts and technologies applicable to this study.
In Figure 1, frame 1 (top left) depicts the peer to peer network of computers, some being operated by humans others being automated programs (such as database services). Protocols are designed by humans and contributed to the peer to peer network by an advertisement mechanism (using tags to give a simple keyword-based characterisation of the protocol). In frame 1, peer p1 contributes an experimental protocol with three interacting roles (depicted as black circles in the diagram). Other peers acquire experimental protocols by posting a query to the peer network and receiving a list of protocols with (approximately) matching tags. This is analogous to searching for a Web page except that routing through the network replaces a centralised system of query servers. In frame 2, peer p2, acquires the experimental protocol contributed by p1. If a peer finds an experiment that is useful to it then it may offer it for subscription by appropriate collaborating peers. If the choice of peers is already determined (e.g. when the experiment involves coordinating a known collection of Web services) then there is no need for subscription by peers other than the one initiating the protocol. If, however, the initiator does not know which peers might undertake some of the roles in the experimental protocol then it may offer it for subscription (and in turn it will be offered subscribers as these become available). In frame 3, peer p2 offers the experimental protocol for subscription and peers p3 and p4 subscribe to appropriate roles. Once a protocol is fully subscribed it can be used (via the OpenKnowledge interpreter) to coordinate the interaction between subscribed peers. This can be done by maintaining the coordination on a server (the traditional method); or by distributing the roles to the peers (a "pure" peer to peer method); or by selecting one of the peer group to act as host. The latter method is used in the current OpenKnowledge system but the experimental protocol is defined independently of this choice (so the protocols don't have to change if the infrastructure for coordination changes). In frame 4, peer p4 is hosting the coordination between p2, p3 and itself.
Crucial to this view of experimentation (Figure 1) is the idea that a single, formal specification describes the abstract experiment design and is also capable of being automated in a simple, uniform way to perform actual experiments. This basic idea is not new-the idea of executable specifications (explained in the context of LCC in ) permeates declarative programming (giving a basis for logic programming and functional programming) while in bioinformatics there exist systems such as Taverna  that provide design tools for scientific workflow specifications and, in the MyExperiment  system, allow these to be tagged with meta-data (analogous to frame 1 in Figure 1) while also supplying infrastructure for executing these workflows (analogous to frame 4 in Figure 1). Where OpenKnowledge differs from Taverna (and similar bioinformatics workflow and pipeline systems) is that we generalise these principles to a peer to peer setting and for generic (not only bioinformatics) applications such as emergency response (see the OpenKnowledge Web site at http://www.openk.org for example applications and scenarios). Specifically:
OpenKnowledge is not specifically targeted at workflow. Although it can be used to implement workflows, its core language (LCC, described later) is a generic, declarative language for describing coordination between processes that synchronise through message passing. In later sections of this paper we show practical bioinformatics experiments that require the facilities of this sort of declarative programming language.
OpenKnowledge is targeted at sharing of detailed experimental protocols. By analogy to the current practice of writing scripts for combining programs in bioinformatics experiments, we assume that a large scale system for sharing experimental protocols would have to deal with large numbers of quite specific protocols, often designed independently by peers. Effective methods of finding, reviewing and repeating experimental protocols then become an important component of the basic science because without such methods it would be unclear what science had been done.
When conducting experimental studies (or amassing information to support experimental studies) from Internet sources, each scientist (or group) may adopt a variety of roles as information providers, consumers or modifiers. Often these roles are narrowly specific, as for example the role one adopts when canvassing trusted sources for information about specific proteins and applying assessment metrics to these that are appropriate to a particular style of experimentation. In science, the roles we adopt and the specific ways in which we discharge the obligations of those roles are fundamental to establishing peer groups of "like minded" scientists in pursuit of related goals by compatible means. The need to be precise about such obligations is strongly felt in traditional science-hence the use of rigid conventions for description of experimental method and monitoring of its execution via laboratory notebooks, enabling experiments to be monitored, replicated and re-used. Analogous structure is beginning to emerge in Internet based science. For example the structure of Web service composition in Taverna provides a record of the associations between services when using these to manipulate scientific data. Like Taverna, we describe interactions. Unlike Taverna, our interaction models are part of a system for peer-to-peer communication in which specifications of complementary roles in experimentation are shared between peers as a means of communicating and coordinating experiments.
The lightweight coordination calculus (LCC)
An interaction model in LCC is a set of clauses each of the form R :: D, where R denotes the role in the interaction and D is the definition of the role. Roles are of the form a(T, P), where T gives the type of role and P is an identifier for the individual peer undertaking that role. The definition of performance of a role is constructed using combinations of the sequence operator (then) or choice operator (or) to connect messages and changes of role. Messages are either outgoing to another peer in a given role (= >) or incoming from another peer in a given role (< =). Message input/output or change of role can be governed by constraints (connected by the "←" operator) which may be conjunctive or disjunctive. Constraints can be satisfied via shared components registered with http://www.openk.org, so that complex (possibly interactive) solving methods can be shared along with interaction models; or they can be calls to services with private data and reasoning methods. Variables begin with upper case characters.
The model in Figure 2 defines an interaction between peers in two roles, r1 and r2. In the interaction some peer, A, in role r1, would send the message, m1, to peer B in role r2; then it would wait for a reply from peer B with message m2. Conversely, peer, B, in role r2, would wait for the message, m1, from peer A in role r1; then it would send a reply to peer A with message m2 if it can satisfy the constraint c.
Protein Identifications by MS/MS
Tandem mass spectrometry (MS/MS) is a technique to measure the mass of sample, and has been used for protein sequence analysis for more than two decades . A number of informatics algorithms have been developed to identify protein sequences from the MS spectra with various degrees of accuracy for different types of proteins [5, 6, 7]. The basic idea to correlate the spectrum data with theoretical sequences derived from known sequences and their spectra is complicated by the events happened before, during, and after protein synthesis in the cells, i.e., pre-, co-, and post-translational modifications. As such, MS/MS protein identification techniques have some specialisation for handling specific issues. For instance, some techniques are better to reduce complexity of MS spectra and others are better handle modifications or mutations.
The two principal approaches for protein identifications by MS/MS include peptide fragment fingerprinting (PFF) and de novo sequencing. PFF algorithms, is performed specifically for candidate peptides extracted from a database by building theoretical model spectra from theoretical peptides and measuring the similarity between the experimental spectra and the modelled ones. Most of the search engines, including MASCOT , OMSSA , SEQUEST , are available both as standalone programs enquiring local copy of public genomic-translated databases (GTDB), or as web services connected to online GTDBs. The main drawback of this approach is that it can only be used in situations where the genome has been sequenced and all predicted proteins for the genome are known. This approach is not suitable for the proteins with missing post-translational modifications (PTMs) and from unsequenced genomes.
The de novo sequencing approach infers knowledge about the peptide sequence independently of any information extracted from a pre-existing protein or DNA database . Then, the inferred complete or partial sequences are compared to theoretical sequences according to sequence similarity. It is noted that de novo can take a pseudo PFF approach to building a pseudo sequence database on the fly: the sequences are generated by determining all possible amino acid compositions with a total mass matching the experimental precursor mass, and then, for each composition, by determining all possible amino acid permutations. Subsequently, theoretical spectra are computed from the pseudo sequences and common peaks between experimental and theoretical spectrum are counted to find the best matched sequences. Some information, e.g., more accurate precursor masses and range in the number of amino acids, would help reduce the combinational complexity.
For improving the accuracy of the MS/MS protein identification, combining knowledge of applicable methods is plausible . However, this idea of combining methods is yet to take on seriously, let alone to implement and validate. And due to the lack of peer performance monitoring in current experimentation platforms, it would be difficult to evaluate plausible combinations of methods. Thus, a peer ranking facility was implemented and experimented under the peer-to-peer OpenKnowledge infrastructure for performance monitoring. With the peer ranking facility, we also test the feasibility of a simple interaction model that combines both PFF and de novo methods of MS/MS protein identification.
Experimentation with peer ranking
Our peer ranking algorithm for monitoring the simulated peers follows the style of the PageRank algorithm . It works by assigning a ranking to a peer at any given time as a function of its previous ranking modified by the rankings of each peer with which it has interacted (we refer below to such peers as supporting peers). The modification made by each of these peers is spread equally across the set of peers which it supports. In the algorithm, we assume that we have available minimal information about interactions-only knowing for each interaction which peers were involved in it and whether it completed successfully. This information is less than that we can obtain routinely (since the coordinating peer has access to all the structure of a completed interaction) but we make this simplification to make the implementation less domain-dependent and to explore how far we can go with the bare minimum of data. For proteomic experiments, we implemented an even simpler algorithm which should be adequate to demonstrate that certain peers are dominant after a period of interactions in real-world experiments.
This study aims to test peer ranking under OpenKnowledge infrastructure using simulated experiments and real-world bioinformatics experiments. The simulated experiments involved simulated peers assigned with different roles for interactions. The real-world bioinformatics experiments were conducted with multiple peers performing different methods of peptide fragment fingerprinting (PFF) and de novo sequencing under OpenKnowledge infrastructure, in which the interaction among these peers was coordinated according to an interaction model.
Peer ranking algorithm
The peer ranking algorithm used for monitoring simulated peers in this study was based on PageRank , which notations and formulation were specified and implemented as follows:
A data set of initial rankings for each peer, each of the form cr(P, T) = R
A data set of records of successful or failed interactions, each of the form im(T, I, PI, CP)
P is the identifier of a peer
T is either "pos" (denoting the positive ranking) or "neg" (denoting the negative ranking)
R is the numerical magnitude of the (positive or negative) ranking, Rp is positive ranking, Rn is negative ranking
I is the identifier of the interaction model
Pi is the initiating peer for an interaction
CP is the set of peers that subscribe to the interaction initiated by Pi
- ◦ For i iterations:
- ◦ For each peer, P:
▪ Calculate rank(P) = (Rp, Rn)
▪ assign cr(P, pos) = Rp
▪ assign cr(P, neg) = Rn
r(S, T) is the sum of the current ranks for all the peers in peer set, S, each peer's rank being divided by the number of peers it supports (to apportion the influence of the rank evenly across those supported peers). If T is "pos" then this is the sum of ranks from positive associations or if T is "neg" this is the sum of ranks from negative associations. In other words, r(S, T) is the sum for all P in peer set S of cr(P, T)/L, where L is the number of peers supported by P.
Experiments on simulated peers
This interaction model had no representation of the provenance of the data but there was a point in the interaction, when the data seeker sent its request to a data source, when an appropriate data source needed to be chosen. At this point the data seeker could, if it so chose, use peer rank information to select the highest ranking peer. The rank of a data source peer in this interaction, in turn, depended on how frequently the data seekers with which it interacted found the quality of the data it supplied to be acceptable (otherwise the acceptable constraint in the data seeker role failed and consequently the interaction overall failed). This gave a feedback loop from supply to ranking via the peers that requested data.
To make our simulation as simple as possible we represented data quality as a number ranging from 0 (lowest quality) to 1 (top quality). Instead of storing actual data elements, the peers in our simulation stored these quality values (representing the quality of a data item). Although in the OpenKnowledge framework it is possible for any peer to coordinate interactions, and therefore to have any blend of control over service orchestration from a highly centralised system to a pure peer-to-peer arrangement, in the simulation a single peer represented the central database and (to keep it even simpler) it generated a set of 10 data elements with quality chosen randomly to be between 0 and 1, so the data elements that were obtained from it by peers tended to have a normal distribution with mean of 0.5. We assumed that the quality that was acceptable to peers was higher than this mean, setting it at 0.8 for all peers, indicating that peers seeking data from the database obtained a mixture of acceptable (mean quality greater than 0.8) and unacceptable results, so the database built up a negative ranking as well as a positive one.
Experiments on MS/MS protein identification
Roles and LCC specifications
As depicted in Figure 4, a peer spectra_input, the interaction initiator, uploaded the mass spectrum, and randomly selected an interaction model to execute. If the "de novo + similarity search" interaction model was selected, peers assigned to denovo_approach and similarity_search were invoked. The analysis result was passed to peer output_interface for evaluation. Similarly, if the PFF analysis interaction model was executed, peers responsible for pff_approach were invoked. The analysis result was passed to peer output_interface for evaluation as well. Evaluation result, success or failure of interaction model, was passed to spectra_input in the end.
Figure 5 specified the LCC specification for the PFF interaction pathway and Figure 6 listed LCC codes for the de novo sequencing + database searching pathway. Two roles, spectra_input and output_interface, were specified in both interaction pathways with the same arguments and interactions. Peers subscribed to the roles of spectra_input were responsible for uploading MS/MS spectra and selecting preferred routes for the spectrum interpretation. Peers subscribed to the role output_interface were in charge of the re-formating, filtering and displaying of the final result yielded by the interaction model executed. Other roles specified in the two LCC interaction models included pff_approach, denovo_approach, and similarity_search, which performed PFF, de novo sequencing, and database searching, respectively.
OpenKnowledge Components (OKCs) were developed to access and manipulate web servers and/or local programs for MS/MS identification, including algorithms OMSSA and MASCOT subscribed to the role pff_approach, PepNovo Win32 Executable  and Lutefisk XPv1.0  performed the role denovo_approach, and MS-BLAST  subscribed to the role similarity_search. In this experiment, parameters for MASCOT was set with database NCBI nr, enzyme trypsin, and all the other parameters were set to default settings. Similarly, OMSSA was run with database nr, enzyme trypsin, maximum missed cleavages set as "2", minimum charge to start using multiply charged products set to "2", all the optional species being selected, and all the other default parameters for ion trap spectrometers. Both Lutefisk and PepNovo were run with their default parameters for doubly charged tryptic peptides on ion trap MS. MS-BLAST was run with default parameters and database nrdb95.
The scores and threshold values for MASCOT, OMSSA, and MS-BLAST.
Name of score
Perkins et al., 1999
Geer et al., 2004
Habermann et al., 2004
The experiment was based on a benchmark dataset with doubly charged tryptic peptides obtained from low-energy ion trap LC/MS/MS runs . Each round went through only one of the possible routes (Figure 4) with one peer to perform the roles specified in the associated LCC codes (Figures 5 and 6).
A single round started with uploading the MS spectra data of peptides to the spectra_input peer. In addition to spectrum uploading, the OKC developed for this peer allowed the peer to select the analysis approach, which was either (1) PFF approach or (2) de novo sequencing and database similarity searching approach, for MS/MS protein identification. In the cases (e.g., the present study) when none of the approaches was preferred, the system randomly selected one approach. In each route, one of the peers subscribed to the role pff_approach, denovo_approach, or similarity_search, was randomly selected by the system in the experiment. The result of protein identification was sent to the peer performing the role output_interface for reformatting and filtering according to the criteria as shown in Table 1. The failure or success message of the execution was finally sent to the peer spectra_input in the end of the interaction model. The peer ranking for monitoring the performance of the peers was simply based on counting the numbers of failures and successes.
Results and discussion
The present study obtained results from experiments on the basic behaviours of simulated peers monitored by PageRank-style peer ranking under OpenKnowledge infrastructure and on the feasibility of combining and ranking real-world services of peptide fragment fingerprinting (PFF) and de novo sequencing for MS/MS protein identification.
PageRank-style ranking of simulated peers
This is an interesting finding as in many data sharing applications we consider the information about the source of the data (e.g., provenance or accuracy) to influence which source we use. Traditionally, provenance was incorporated into data-sharing interactions by devising ontologies for provenance and supplying task-/domain-specific methods for propagating provenance information between peers. We could do this in the OpenKnowledge infrastructure by writing interaction models in LCC that specify provenance propagation along with the message passing needed for "raw" data transmission. This, however, takes engineering effort so it is useful if peer ranking could, alone, take some of the strain of assigning reputation to peers in a data sharing context.
Initially, the central database would become the most popular peer as the other peers had no data so if any of them was asked for data the interaction definitely would fail, increasing their negative ranking while their positive ranking remained unchanged. As the other peers gained data from the database through interaction then those peers lucky enough to get high quality data could then offer it to other peers, raising the positive ranking of the supplier and giving the recipient high quality data with which it could increase its rating.
Peer ranking of peers for MS/MS protein identification
Peer ranking in MS/MS protein identification experiment.
Peer ranking score
Number of successful runs
Number of failed runs
output_interface and spectra_input
PFF algorithm MASCOT was rated better in peer ranking than the two de novo sequencing tools. This result was consistent to the previous finding that PFF analysis was more accurate than de novo sequencing . For de novo sequencing approach, the ratings of PepNovo were higher than that of Lutefisk. Even though this experiment did not evaluate how accurate was each round of protein identification for each sequence output, the peer ranking results in this experiment were consistent with the performance evaluation result of different de novo sequencing programs by Frank and Pevzner .
As seen in the experiments of the present study, peer ranking is applicable to monitoring peer-to-peer interactions under the OpenKnowledge infrastructure. With more OKCs built for more practical tasks (not limited to protein identification) and various performance feedback mechanisms like peer ranking, we would be able to include more peers for large-scale experimentation on various other issues such as trust .
This study demonstrated a use of peer ranking to support automated experimentation in open and peer-to-peer environments. Both simulated and real-world experiments in the present study showed that the OpenKnowledge infrastructure with peer ranking capability can serve as an evaluative environment for automated experimentation.
This work was a part of the OpenKnowledge project http://www.openk.org and supported by the Sixth Framework of European Union grant FP6-027253. The work of SL was also supported by the University of Macau research grant MYRG190(Y1-L3)-ICMS11-LSW.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.