Encompassing new use cases - level 3.0 of the HUPO-PSI format for molecular interactions
Systems biologists study interaction data to understand the behaviour of whole cell systems, and their environment, at a molecular level. In order to effectively achieve this goal, it is critical that researchers have high quality interaction datasets available to them, in a standard data format, and also a suite of tools with which to analyse such data and form experimentally testable hypotheses from them. The PSI-MI XML standard interchange format was initially published in 2004, and expanded in 2007 to enable the download and interchange of molecular interaction data. PSI-XML2.5 was designed to describe experimental data and to date has fulfilled this basic requirement. However, new use cases have arisen that the format cannot properly accommodate. These include data abstracted from more than one publication such as allosteric/cooperative interactions and protein complexes, dynamic interactions and the need to link kinetic and affinity data to specific mutational changes.
The Molecular Interaction workgroup of the HUPO-PSI has extended the existing, well-used XML interchange format for molecular interaction data to meet new use cases and enable the capture of new data types, following extensive community consultation. PSI-MI XML3.0 expands the capabilities of the format beyond simple experimental data, with a concomitant update of the tool suite which serves this format. The format has been implemented by key data producers such as the International Molecular Exchange (IMEx) Consortium of protein interaction databases and the Complex Portal.
PSI-MI XML3.0 has been developed by the data producers, data users, tool developers and database providers who constitute the PSI-MI workgroup. This group now actively supports PSI-MI XML2.5 as the main interchange format for experimental data, PSI-MI XML3.0 which additionally handles more complex data types, and the simpler, tab-delimited MITAB2.5, 2.6 and 2.7 for rapid parsing and download.
KeywordsMolecular interactions Protein-protein interaction Protein complexes Data standards XML HUPO-PSI PSI-MI
Human Proteomics Organization
- IMEx Consortium
International Molecular Exchange Consortium
Proteomics Standards Initiative
Understanding the interaction networks that govern biological systems is essential to fully decipher the molecular mechanisms ensuring cellular biology and tissue homeostasis. Interactions between molecules result in both the assembly of stable functional protein complexes, which form the molecular machinery of the cell, and transient, often regulatory, networks of weakly associating molecules. Together these drive and regulate cellular processes, cell-cell interactions and cell-matrix interactions. The capture and curation of published interaction data has been the work of interaction databases for many years, and many of these resources have collaborated through the Molecular Interaction workgroup of the Human Proteome Organization Proteomics Standards Initiative (HUPO-PSI) to create and maintain community data formats and standards . These formats and standards have enabled the systematic capture, reuse and exchange of these data and the building of tools to enable network contextualization and analysis of -omics data.
Version 1.0 of PSI-MI XML was published in 2004 and enabled the description of simple protein interaction data . The format was widely implemented and supported by both software tool developers and data providers, but was soon found to be too limited in its scope. To facilitate rich, integrative analyses, many databases wished to describe and exchange the full wealth of data generated by interaction experiments, including a detailed description of experimental conditions and features such as binding sites or affinity tags on participating molecules. In order to make this possible, the Molecular Interactions working group of the HUPO-PSI further extended the XML schema to enable the annotation of a wider range of data. PSI-MI XML2.5 expanded the type of interactors to encompass any molecule or complex of molecules which can be described in the ‘interactor type’ branch of the accompanying controlled vocabulary (PSI-MI CV) . Sequence or positional features on a participant molecule that are relevant for the interaction can be described in a featureList, again using an appropriate controlled vocabulary term. The PSI-MI XML2.5 schema allows two different representations of interactions. The compact format was designed for larger datasets. In this, the repetitive elements of a larger set of interactions, such as the interactors and experiments, are only described once, in the respective list elements, and subsequently referred to. The extended format groups all related data closely together and was designed to simplify parsing. This version of the schema also supports the hierarchical build-up of complexes from component sub-complexes.
Version 2.5 has proven to be, and will continue to be, capable of capturing the vast majority of molecular interaction data, generated by techniques such as protein complementation assays, affinity capture, biophysical measurements and enzyme assays. It successfully describes genetic as well as physical interactions, and can also be used to hold predicted interactions or the results of text-mining exercises, all clearly described as such by appropriate controlled vocabulary terms. Consequently, this version of the format will continue to be supported by the PSI-MI community for the foreseeable future. However, use cases have arisen which cannot be adequately described within this XML schema, and in 2013 it was decided that the field had advanced sufficiently to justify moving to the next level in this deliberately tiered approached to describing interaction data, and to produce PSI-MI XML3.0.
A community standard will only remain of use to that community if it meets the needs of current and future users, and if these users have bought into, and contributed to, the update process. Prior to creating any changes in the schema, a questionnaire was sent out to known users of the format to establish how PSI-MI XML2.5 was currently being utilised, and to identify cases in which the format was not meeting user needs. Once an initial list of requirements had been established, use cases and examples of each were collated. Initial proposals or, in some cases, multiple proposals for tackling each case were drawn up and circulated to mailing lists and known format users. Each proposal, and any subsequent feedback, was then discussed in detail at the 2014 HUPO-PSI meeting by attendees to the MI work track . The final list of use cases was agreed upon and the changes to PSI-MI XML2.5 described below approved and subsequently implemented. Additional file 1 contains an example file showing the representation of the molecular interaction data from a single publication in PSI-MI XML3.0.
Enhancements to the description of molecule features
The position attribute type and interval attribute type for featureRange have been updated. In PSI-MI XML2.5 these are of the type ‘unsignedLong’, which means that features described in this version can only have positive range positions. This has been updated to ‘long’ in PSI-MI XML3.0 to enable negative positions, for example designated gene promoter regions, to be captured (Fig. 1, Additional file 2).
The position and effect of a mutation can be systematically captured using the featureRange positions and the featureType element. However, in PSI-MI XML2.5 there is no defined way to capture the actual sequence change. In PSI-MI XML3.0, a new element named resultingSequence has been added at the level of the featureRange element (Fig. 2, Additional file 3). The resultingSequence element contains an originalSequence element to describe the original sequence, a newSequence element which contains the mutated sequence and an xref element, which would be optional, and could be used to add external cross references such as Ensembl cross references to single nucleotide polymorphisms (SNPs). The newSequence and originalSequence are not required if an xref element is provided.
It is now possible to add several feature detection methods in the feature element by making the featureDetectionMethod element repeatable in the feature element (Additional file 4). This will enable users to describe cases in which a feature has been recognized by more than one method, for example a post-translational modification (PTM) being identified by both a specific antibody and by mass spectrometry. The change was made to maintain backwards compatibility with earlier versions of the schema, a goal that was set by the work group when version 1.0 was published. When several feature detection methods are described in a file, most existing parsers will simply use the last feature detection method they have parsed.
The feature element has been extended in PSI-MI XML3.0 to capture the dependency of an interaction on a particular feature, for example the presence of a specific PTM and also the effect of an interaction, such as the phosphorylation of a tyrosine residue by a protein kinase. In PSI-XML 2.5 this information is stored as an attribute of a feature. An optional featureRole element has been added to the feature element, which can be used to describe PTMs existing in/resulting from the context of the interaction. This element would be populated from a list of new controlled vocabulary terms added to the PSI-MI ontology, such as ‘prerequisite-PTM (MI:0638)’ or ‘observed-PTM (MI:0925)’.
The equilibrium dissociation constant or parameters, such as kon or koff can be added at the interaction level in PSI-MI XML2.5; however, this does not enable the systematic capture of changes in this parameter when a sequence is mutated at the feature level. The kinetic and the equilibrium dissociation constant parameters that are linked to a specific mutation have been moved from interaction parameterList to the feature parameterList (Fig. 3, Additional file 5). However, the kinetic and the equilibrium dissociation constant parameters associated with the wild type protein will still be at the interaction level in PSI-MI XML3.0.
Description of New data types
Dynamic interactions: interaction sub-networks may be rewired in response to changes in the environmental conditions in which the experiment is performed. Examples of such changes include applying increasing concentration of an agonist onto a cell or a single concentration for an increasing amount of time, or merely sampling the interactome at different stages of the cell cycle. In PSI-MI XML3.0 an optional variableParameterList element has been added to the experiment element, which contains one-to-many variableParameter elements. Each variableParameter element contains the required description element to define the variable condition, an optional unit element to describe the unit of the different parameters in the variableValueList and a required variableValueList element to list all the existing variable parameter values used in the experiment. A variableValueList contains one-to-many variableValue elements, which may themselves contain an optional order attribute, an integer defining the position of the given variableValue within its containing variableValueList parent element (Fig. 3, Additional file 6). The format can also handle multiple changes in condition, such as parallel time courses of an increasing concentration of an agonist. The example given in Additional file 4 shows the changing profile of proteins that interact with STAT6 as the number of hours post-Sendai viral infection increases.
Abstracted interactions: The PSI-XML2.5 schema was designed to represent experimental interactions, therefore an experiment description is required for each interaction. However, groups are increasingly looking to capture and exchange data collated from several publications. Examples of these include reference protein complexes described in the Complex Portal (www.ebi.ac.uk/complexportal, Additional file 7)  and the descriptions of cooperative binding when distinct molecular interactions influence each other either positively or negatively (Additional file 8). A version of the XML2.5 schema (PSI-PAR) was created to describe the production of protein binders such as antibodies, including detail such as antibody cross-reactivity – data that also cannot be described by a single experiment, and often not even in a single publication . In order to describe such cases, the ‘interactionDetectionMethod’ element within an ‘experimentDescription’ element does not have a specific method assigned as a value in entries in the PSI-XML 2.5 format. Instead the CV terms ‘inferred by author’ (MI:0363) or ‘inferred by curator’ (MI:0364) are used to indicate that the interaction was inferred from multiple experiments or from several publications, respectively. Within the ‘experimentDescription’ element, the ‘bibref’ element refers to a related publication. In PSI-MI XML3.0, a new optional abstractInteraction element has been added within the interactionList. This element can now be used to describe ‘abstract’ or ‘modelled’ interactions such as stable complexes or allosteric interactions. This element contains many optional elements, for example a participantList, bindingFeaturesList, an interactorType element to describe the type, such as a protein complex, a protein-RNA or an antibody-antigen complex and an interactionType element to differentiate between a stable or transient complex, a cooperative interaction, or an enzymatic reaction.
PSI-PAR was designed to fulfil three anticipated use cases: 1) affinity reagent and target protein production data, 2) characterisation/quality control results, and 3) complete summaries of end products. In practice, there has been no requirement for the format to exchange reagent and target production data. The ability to describe abstracted data in PSI-MI XML3.0 format fulfils use cases 2 and 3, by enabling the capture of quality control and reagent specificity data which are rarely described in a single publication. It has therefore been decided to merge PSI-PAR back into the parent PSI-MI XML, and XML3.0 will be regarded as the standard format for exchanging binder-target data from this point onwards. The PAR CV which was created to populate PSI-PAR will be merged back into the PSI-MI CV, thus minimising both schema and CV maintenance overheads.
Co-operative interactions: in a cellular and tissue context, interactions between biomolecules are rarely independent. Instead, distinct molecular binding events affect each other positively or negatively, i.e. they are cooperative . The two main mechanisms underlying cooperative binding are allostery and pre-assembly [8, 9]. Allostery involves a change in binding or catalytic properties of a biomolecule at one site of the molecule by an event at a different distinct site of the same molecule [10, 11]. Pre-assembly involves the generation or abrogation of a binding site through an interaction or enzymatic modification [12, 13, 14]. This includes (i) complex assembly resulting in the formation of a continuous binding site spanning multiple subunits; (ii) competitive binding to overlapping or adjacent, mutually exclusive binding sites; (iii) enzymatic modification that changes the physicochemical compatibility for a binding partner; or (iv) configurational pre-organization involving multivalent ligands that engage in multiple discrete interactions with one or more binding partners for high-avidity binding.
As cooperative binding is common between many molecules in vivo, and the number of experimentally validated, interdependent interactions reported in the literature is increasing, it should be possible to represent and exchange these data in a standard format. Previously, however, cooperativity was only captured by the PSI-MI XML2.5 format by using annotations at the interaction level . This has several shortcomings, including difficulties with parsing and automatic validation, repetition and redundancy, and lack of experimental details . Because the data required to describe cooperative interactions rarely comes from a single experiment, or may even need to be assembled from many distinct publications, they are treated as abstract interactions and in PSI-MI XML3.0, captured using the abstractInteraction element. Within this element, an optional cooperativeEffectList allows listing the cooperative effects a specific interaction has on one or more other interactions. The effect will be described in the allostery or preassembly child element, as appropriate. Within these elements, additional details are captured, including the experimental methods and publications from which the data were inferred, references to the interactions that are affected, and the outcome of the effect.
Description of new molecule types
Molecule sets: PSI-MI XML2.5 contains a key element interactorType, to describe the type of molecule involved in an interaction. This qualifies an interactor with a term from the PSI-MI controlled vocabulary, for example ‘protein’ (MI:0326) or ‘polysaccharide’ (MI:0904). However, there are cases when the exact molecule cannot be described, where it may be one of several possible entities. Examples of such cases include a peptide identified as the result of a mass spectrometry experiment which can be redundantly assigned to any one of a family or closely related molecules, and a non-specific antibody which cannot distinguish between two proteins with a high degree of sequence homology. There are cases when the products of one or more genes cannot be distinguished at the protein level, for example human calmodulin is an identical protein produced by three genes (CALM1, CALM2, CALM3). In these cases it may be necessary to describe a ‘set’ of molecules. This is not a new concept – it has been common practice in pathway databases such as Reactome  for some years, and indeed the required CV terms have been taken from the Reactome definition. However, this cannot be a simple addition to the Participant type CV as the ability to add a feature to a specific molecule within that set may be necessary. In PSI-MI XML3.0, the participant element will now contain a choice between interactor, interactorRef, interactionRef and interactorCandidateList.The interactorCandidateList element would contain a moleculeSetType element (PSI-MI CV Type) followed by one to many interactorCandidate elements. The interactorCandidate node contains a required id attribute, a required interactor or interactorRef element to describe or reference an interactor and an optional featureList element with one to many features to describe binding features for each interactor candidate (Additional file 9).
Stoichiometry: in PSI-MI XML2.5 the stoichiometry of a molecule can only be described as free-text annotation or as an attribute of the participant. In PSI-MI XML3.0 the participant element has been updated to add an optional XML Schema Development (XSD) choice sub-element, which provides a choice between a stoichiometry element to describe the mean stoichiometry for this participant and a stoichiometryRange element to describe a stoichiometry range for this participant. If the stoichiometry element is selected, a value attribute is required to describe the stoichiometry as a decimal value. If the stoichiometryRange element is chosen, both minValue and maxValue attributes are required to describe the stoichiometry range as decimal values (Additional file 10).
Update of the bibref element: the bibref element refers to a publication. PSI-MI XML2.5 allows either a cross reference (xref) element (to describe PubMed primary reference if it exists) or an attributeList element (to describe publication details such as publication title and publication date). To export both PubMed primary reference and publication details, the PubMed primary reference is added in bibref and the publication details attributes in the attributeList of the experimentDescription. In PSI-MI XML 3.0 the bibref element has been updated to accept both xref and attributeList so that the publication can be entirely described within bibref.
All data resources using the IntAct database as their data storage repository, i.e., members of the IMEx Consortium  including IntAct, IID, InnateDB, MINT, DIP, MatrixDB, HPIDB routinely make their data available in PSI-MI XML3.0 in addition to the existing PSI-MI XML2.5 and MITAB 2.7 formats. Manually curated protein complexes from the Complex Portal are also made available in PSI-MI XML3.0. The PSI-MI maker software (https://github.com/MICommunity/psimi-maker-flattener), a desktop application that helps users to create PSI-MI XML documents and extract data from them, has been updated to support PSI-MI XML3.0. In addition, the new features included in PSI-MI XML 3.0 are currently being used to extend an existing tool suite, the MI Bundle, that integrates molecular, structural and genomics data and that already relies on the PSI-MI standard .
PSI-MI XML3.0 will enable the molecular interaction community to meet the demands of new data types and increase our ability to systematically describe important biological events such as the composition, topology and stoichiometry of protein complexes, the cooperative binding of molecules to form new binding sites, and to modulate the activity of enzymes through allosteric binding. The accompanying PSI-MI controlled vocabulary used to populate this schema is also constantly being updated and expanded to more fully describe new ways of measuring molecular interactions and meet the needs of novel data types. We have developed a Java library, JAMI , The PSICQUIC web service , that is capable of both reading and writing all the PSI-MI formats, PSI-MI XML, MI-JSON and MITAB, to ensure that software developers are not faced with having to create multiple version of a program to address all versions of the interchange formats. The PSICQUIC web service  is also being improved, to handle the increased volume of data traffic as we move towards a comprehensive understanding of the interactomes of model organism species.
Availability and requirements
Project name: PSI-MI XML3.0.
Operating system(s): Platform independent.
Programming language: XML.
Any restrictions to use by non-academics: None.
Availability: All example files are available in both Supplementary Materials and in GitHub, as listed in the article. The data used in the example files is also freely available from the IntAct or Complex Portal databases, as appropriate, with the exception of the cooperative interaction described in Additional file 8, which is not available in any public repository.
MD, MK, AS, JS, JH and YY were funded by BBSRC MIDAS grant (BB/L024179/1), this grant provided the funds for the design of PSI-MI XML3.0 and its implementation by the IntAct database. KVR was funded by European Commission (FP7-HEALTH-2009-242129 SyBoSS), LL by ELIXIR-IIB, the Italian Node of the European ELIXIR infrastructure, IJ was funded by Ontario Research Fund (GL2–01-030, #34876) and Canada Research Chair Program (#225404), DJL by EMBL Australia and FP7-HEALTH-2011-278568, SRB and NTM by Fondation pour la Recherche Médicale (grant n° DBI20141231336) and by the French Institute of Bioinformatics (2015 call), NHC and RCL by British Heart Foundation (RG/13/5/30112), GC by the European Research Council (Grant Agreement 32274), CC was funded by the Wellcome Trust (grant numbers 103139, 063412, 203149) and LS by National Institutes of Health (R01GM071909). These monies funded input by these groups into the design of the format, its subsequent adoption by members of the IMEx Consortium and the update of the tools described in the paper.
Availability of data and materials
MS(D), ND-T, MK, AS, JS, JH, HH and YY designed and implemented the PSI-MI XML format, DA-L, JDLR, AC, CC updated and designed tools to use the new format, SO, BM, GB, NC, SR-B, KVR, SP, NT-M provided use cases and example files, MA, NC, GC, HH., IJ, LL, RCL, DJL, PP, BR, LS provided IMEx data implemented in the format. SO, PP, LL, LS, SR-B, KVR contributed to the controlled vocabulary development. SO drafted the manuscript with input from all authors, YY designed the figures. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
- 19.Sivade (Dumousseau) M, Koch M, Shrivastava A, Alonso-Lopez D, De Las Rivas J, et al. JAMI: a Java library for molecular interactions and data interoperability BMS Bioinformatics 2018 [in press].Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.