A public repository for mass spectrometry imaging data
- First Online:
- 1.3k Downloads
Mass spectrometry (MS) imaging is a very active field of research, and has seen impressive progress in recent years [1, 2]. The number of groups that are working on this topic is constantly increasing. However, the field is still very heterogeneous in terms of applied instrumentation and data processing methods. In addition, complex datasets are reduced to a set of two-dimensional “images,” which inevitably results in information loss. This simplified graphical representation also strongly depends on processing options such as color scale, intensity normalization, and spatial interpolation. Consequently, experimental data are presented in very diverse ways, and published results can therefore be difficult to evaluate and compare. With a growing number of published studies, the issue of standardization and quality control of MS imaging data is becoming more important. This is a natural process for any new field that is maturing. The MS-based proteomics community has been facing similar issues in the last decade, and this discipline is therefore discussed as a “role model” herein. Since its inception in 2002, the Proteomics Standards Initiative (http://www.psidev.info) has driven the development of a number of minimum reporting guidelines (called “minimum information about a proteomics experiment” documents)  and several standard data formats for the different data types relevant in proteomics. For example, for raw and processed MS data, the data standard is called mzML .
In addition, several data repositories were established about 10 years ago to address the demand for storage and availability of MS data in the public domain [5, 6, 7, 8, 9]. A big step forward in this area has been the establishment of the ProteomeXchange (PX; http://www.proteomexchange.org/) consortium , led by the PRIDE  and PeptideAtlas  resources. The overall aim of PX is to provide a common framework and infrastructure for the cooperation of proteomics resources by defining and implementing consistent, harmonized, user-friendly data deposition and exchange procedures among the members. Thanks to the guidelines promoted by several scientific journals and funding agencies, and the general perception that sharing data is good scientific practice, the culture in the proteomics community has evolved toward data deposition as part of the publication process.
In analogy to these activities in the MS proteomics field, similar mechanisms have been discussed and to some extent already implemented in the MS imaging community in recent years. A common data format for MS imaging—imzML—has been established . This format is being used more and more, and the number of available tools is constantly growing (see http://www.imzml.org for more details). Reporting guidelines have been discussed for several years, and a first suggestion of those is included in this topical collection .
Nevertheless, owing to the lack of suitable resources, a missing element so far has been the possibility to make MS imaging datasets available in the public domain. Earlier attempts to develop a data repository were abandoned mainly because of the large size of MS imaging datasets. However, nowadays very large datasets (i.e., file size on the order of a few terabytes) can also be generated in MS-based proteomic and metabolomic studies, and can be submitted to established repositories.
From a purely technical point of view, the infrastructure available in existing MS repositories is also suited for MS imaging data. Therefore, the missing step is to define and adopt a submission procedure in order to be compatible with MS-imaging-specific parameters. Here we describe the newly implemented way of submitting MS imaging data to PX via the PRIDE database. We also describe how to retrieve these data and to reproduce the MS images.
Experimental procedures/submission process
Procedure for submitting datasets
For MS proteomics, there are two different PRIDE/PX submission modes: “complete” and “partial.” For both types of submission, a set of common metadata (agreed by all PX partners) and MS raw data are mandatory for each dataset. The difference is in the way the processed results are provided. After a “complete” submission has been performed it is possible for the repository to directly connect the processed peptide/protein identification results with the mass spectra. This can be achieved if the processed results are available in an open data format supported by the repository. The alternative is to perform a “partial” submission, and in this case, the connection between the spectra and the identification results cannot be done in a straightforward way. For “partial” submissions, the processed results are not available in a format supported by the repository. Instead, the corresponding analysis software’s output files (in heterogeneous formats) are made available for download .
The submission procedure has been adapted for MS imaging data using the “partial” submission mechanism. The PX Submission Tool  is the main tool used to perform the submissions. It is developed in Java and provides a user-friendly graphical user interface for performing the actual data submission, through a series of stages: (1) select all the files needed for the submission; (2) interactively group related files; (3) ensure the minimum level of metadata annotation; and (4) transfer the files via Aspera (http://asperasoft.com/) or FTP. Aspera can perform much faster transfers than FTP, resulting in a very convenient way of transferring potentially very large datasets.
In addition to the PX Submission Tool, datasets containing a high number of files can also be submitted using a command-line-based alternative .
An exemplary dataset has been used to demonstrate the submission process. The data describe a matrix-assisted laser desorption ionization imaging experiment of mouse urinary bladder tissue sections acquired with high mass accuracy at a pixel size of 10 μm which was reported previously . Experimental details are described in the electronic supplementary material (the results published originally are shown in Fig. 2, panels A and B). The dataset was deposited in PRIDE/PX (accession number PXD001283).
The submission process is described below; screenshots of all submission steps are provided in Figs. S1–S10. For a detailed description of the submission process, a tutorial for MS proteomics data is available . Here, we will mainly focus on the changes made to accommodate the MS imaging datasets.
Before starting, users must first register at PRIDE (http://www.ebi.ac.uk/pride/archive/register). The default assumption is that all of the files belonging to one study or manuscript will be uploaded at once and handled as a unit, although there is some flexibility in the process. After the PX Submission Tool has been launched, in step 1 the type of submission must be indicated. In this case, “partial” must always be chosen.
Submitted data files in the example dataset (accession number PXD001283)
HR2MSI mouse urinary bladder S096.ibd
Binary part of imzML data
HR2MSI mouse urinary bladder S096.imzml
XML part of imzML data
HR2MSI mouse urinary bladder S096 - optical image.tiff
Optical image of measured sample (tissue section)
HR2MSI mouse urinary bladder S096 - results.txt
Information on MS images which have been presented in the corresponding manuscript (e.g., m/z values)
It is mandatory to provide the MS raw data (labeled as “RAW”). It is recommended to submit MS imaging data in imzML format as it offers the most flexible options for viewing, but proprietary data formats are also accepted. The default submission process has been modified, and it now includes the possibility to submit two different mass-spectral-related files for one dataset, as required for several MS imaging data formats (e.g., imzML and Analyze). The mass spectral data file (*.ibd file for imzML or *.img file in Analyze format) must be labeled as “RAW.” The file that contains metadata (such as pixel dimensions and additional information) must be labeled as “MS_IMAGE_DATA” (e.g., *.imzml file for imzML or *.hdr file for Analyze). If an “ibd” file (imzML format) is submitted as “RAW,” an “MS_IMAGE_DATA” file (*.imzml) is mandatory. However, in the case of “RAW” proprietary formats that consist of only one file, an “MS_IMAGE_DATA” file is not required.
In addition, PRIDE requires a mandatory “SEARCH” file for “partial” submissions, which corresponds to the processed results. This file typically contains a list of identified peptides and proteins, and related metadata. There is currently no strict definition for the format of this mandatory file, but it should contain a list of m/z values, names of (tentatively) identified compounds, and additional information that was used to generate the MS images in the published work. The “SEARCH” file for the example dataset is given in the electronic supplementary material. These data should be available from the manuscript, but including them in a concise way in the data submission facilitates the reproduction of the published MS images.
We acknowledge that the file tag “SEARCH” might not be the best option to describe these data, but this file is still mandatory for consistency with the overall PX submission framework. Another more specific file tag might be added in the future if this turns out to be necessary.
Since MS imaging data contain spatial information, the data submission also supports the inclusion of an optical image (“OPTICAL_IMAGE”) of the measured sample, which can allow validation and/or interpretation. The “OPTICAL_IMAGE” file could contain a photograph of the imaged sample or an adjacent section that shows comparable spatial features. Native samples, classic histological techniques (hematoxylin and eosin, toluidine), or immunohistochemistry staining (antibody staining) can be provided for this purpose.
The rest of the submission steps are identical to the MS proteomics “partial” submission process  (see Figs. S1–S10). An updated tutorial is available at http://www.proteomexchange.org/submission. Briefly, in step 4 (“Mapping Files”), the relationships between the different files can be captured. Each “RAW” file requires at least one “SEARCH” file. In addition, for MS imaging data, “RAW” files and “MS_IMAGE_DATA” files need to be connected. Step 5 is devoted to the annotation of biological and technical metadata. Information about species, tissues, and instrumentation is mandatory. In step 6, contact details for the principal investigator need to be provided. Step 7 is devoted to additional details. For instance, there it is possible to provide a PubMed identifier if the corresponding manuscript has already been published at submission time (as is the case in the example dataset). Step 8 (“Summary Screen”) provides an overview of all the files and file mappings, for the submitters to perform a final review before the actual file upload occurs. Step 9 is the final step where the files are uploaded to PRIDE. When the transfer is finished, the submitter will receive a confirmation e-mail. The actual time required to upload a dataset depends on the size of the dataset and the bandwidth available. In our example, it took approximately 1 h to prepare the data and collect the information for the submission, and the actual transfer of 830 MB data from Giessen to Cambridge, UK, took less than 5 min.
The data submission is then processed by the PRIDE team, and a “PXD identifier” is issued for each dataset. The submitter also receives by e-mail a username and password to allow private access to the data.
Retrieve and display data from the repository
All submitted datasets are private by default. Each dataset becomes publicly available on acceptance or publication of the corresponding manuscript, or when the submitters tell PRIDE to do so. All public PX datasets (including those in PRIDE) are accessible via the portal ProteomeCentral (http://proteomecentral.proteomexchange.org/). There it is possible to search for PX datasets, independently from the receiving repository (Fig. S11). However, although the datasets are private, reviewers/editors need to go to PRIDE (http://www.ebi.ac.uk/pride) and log in using the username and password provided by the authors. Detailed information about how to access private datasets is provided in the electronic supplementary material.
The imzML data can be downloaded and viewed in freely available software tools such as MSiReader , Datacube Explorer , and OmniSpect . Alternative commercial tools include Quantinetix  and MALDIVision . An updated list of available tools is available at http://www.imzml.org.
Discussion and outlook
In this article we have provided an overview of the newly implemented process for submitting MS imaging datasets to PX via PRIDE. The procedure allows the evaluation of data during the manuscript reviewing process. In the MS proteomics field, there is a growing trend toward public data reuse which is triggering the assessment, reuse, comparative analysis, and extraction of new findings from already published data. As an example, one very prominent case of reuse of PX datasets occurred in the elaboration of one of the recently published “drafts” of the human proteome .
Likewise, submitted datasets of MS imaging data could be used as input for bioinformatics reanalysis by other groups. One particular type of data reuse, popular already in other disciplines, is to analyze data coming from a large number of publications/datasets in a combined way. This is a so-called meta-analysis study, which could also be applicable to MS imaging.
A possible next step would be to establish procedures for making possible PX “complete” submissions for MS imaging data. For this it would be necessary to agree on a standard procedure that can make online/direct data visualization possible and, if applicable, make possible the direct connection between the mass spectra and the compounds reported in a given study. The submitted exemplary dataset represents the first fully functional (independent of vendor software) MS imaging dataset that is publically available and that can be used freely for subsequent reprocessing and interpretation (the original publication  should be referenced in this case). For example, this dataset could be used as a reference and test dataset for developers of data processing software, i.e., to evaluate procedures for mass recalibration or spatial segmentation. We have been asked to provide data for such activities in the past, which triggered a discussion about how to share and exchange data and what the conditions for their reuse would be. Now interested parties can directly download the data and use the data as necessary. This submitted example also acts as a template for the submission of larger datasets, since the procedure and requirements are identical. Currently, there is no upper limit on the size of the submitted datasets. MS datasets with sizes up to a few terabytes have been submitted successfully to PX.
Our main objective here has been to show that a data repository is now available for MS imaging data. A relatively small dataset was chosen in order to facilitate data retrieval and processing, but the mechanism is also applicable for larger-scale data. We are not proposing this process as the only and default procedure for publication of MS imaging data in the immediate future, but rather see this as a demonstration that there are no longer any technical obstacles to a public repository for MS imaging data. The availability of such informatics tools represents an important step toward establishing MS imaging as a routine and reliable method in biomedical research. For instance, it may well open the route to fully integrate MS imaging within large-scale international research programs such as the Human Proteome Project.
The proposed approach allows public dissemination of MS imaging data based on existing public investment and infrastructure, avoiding significant extra costs for the development of a new, dedicated resource. The current setup allows the community to evaluate and test this submission process. We welcome any feedback since it could be used to improve and facilitate the submission process in the future.
J.A.V. is supported by the Wellcome Trust (grant number WT101477MA) and the EU Seventh Framework Programme grants ProteomeXchange (grant number 260558) and PRIME-XS (grant number 262067). R.W. is supported by the Biotechnology and Biological Sciences Research Council Quantitative Proteomics grant (reference BB/I00095X/1). We acknowledge the COST Action BM1104 Mass Spectrometry Imaging: New Tools for Healthcare Research, which provided the framework for the initial discussion between A.R., J.P.A., and A.U. We especially acknowledge the contribution made to this work by Juan Pablo Albar (1953–2014). His untimely death has deprived us of both a valued friend and colleague.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.