The Use of Random Projections for the Analysis of Mass Spectrometry Imaging Data
- 1.4k Downloads
The ‘curse of dimensionality’ imposes fundamental limits on the analysis of the large, information rich datasets that are produced by mass spectrometry imaging. Additionally, such datasets are often too large to be analyzed as a whole and so dimensionality reduction is required before further analysis can be performed. We investigate the use of simple random projections for the dimensionality reduction of mass spectrometry imaging data and examine how they enable efficient and fast segmentation using k-means clustering. The method is computationally efficient and can be implemented such that only one spectrum is needed in memory at any time. We use this technique to reveal histologically significant regions within MALDI images of diseased human liver. Segmentation results achieved following a reduction in the dimensionality of the data by more than 99% (without peak picking) showed that histologic changes due to disease can be automatically visualized from molecular images.
KeywordsRandom projection Mass spectrometry imaging Informatics Segmentation Digital histology Dimensionality reduction Data processing
The determination of molecular profiles from individual tissue types is central to the understanding of their biological function, and direct chemical analysis of tissue using mass spectrometry imaging (MSI) is an established tool for determining profiles encompassing a broad range of molecules within a single imaging experiment [7, 29]. One route to producing molecular profiles is to group similar tissue regions according to the similarity of their mass spectra, and to extract an average spectrum for each group. Manually identifying distinct tissue types is difficult and requires a histologic expert [2, 8], so several groups have examined automated segmentation methods for group identification to provide an unsupervised and reproducible scheme for the analysis of data [9, 7, 29, 24].
These clustering methods were shown to be useful in MSI for extracting distinct histologic regions , separating tumor from normal tissue , and for three-dimensional visualization of tissue structures . Other sophisticated approaches have been developed for viewing data heterogeneity  and provide powerful tools for the visualization of trends within mass spectrometry images. A specific advantage of segmentation is that all tissue regions within clusters have similar spectra by construction, and so molecular profiles for the corresponding tissue types can be computed. These profiles can be used to identify discriminatory, characteristic, or spatially co-varying molecules. This work addresses two issues that restrict the application of automated processing of mass spectra; first, the number of peaks that can be processed, and second, the ability to perform data-processing in real time whilst data is still being collected.
Automated segmentation identifies clusters of similar spectra using a ‘distance’ metric to quantify spectral similarity. A significant issue when calculating distance metrics for mass spectra is the dimensionality of the data which, in the case of mass spectra, is equal to the number of m/z values being considered. In a time-of-flight spectrum, this could be more than 100,000 mass bins, and could be millions for high-resolution instruments. High dimensionality negatively affects accuracy of distance metrics as the relative differences between distances tends to zero (so all spectra are measured as being equally different to each other). Two factors compound this problem even further in MSI: the number of samples (pixels) is nearly always much lower than the dimensionality, and the covariance of samples introduces redundancy into the data and effectively reduces the sampling rate further. Dimensionality reduction methods are frequently used to allow accurate distance calculations  by removing this redundancy between spectral channels. This allows the accuracy and speed of cluster formation to be improved , either by choosing a small number of ‘important’ measurements or by a transformation of the data. A common approach involves a linear transformation of the data by projection onto a low dimensional basis which, if constructed correctly, will preserve key relationships between samples and allow analyses such as segmentation to be performed on the projected data [19, 24]. Unfortunately dimensionality reduction often carries a high computational cost or requires multiple passes through the data in order to extract a meaningful set of measurements. Commonly used methods such as principal component analysis and non-negative matrix factorization have been shown to be effective on mass spectrometry images  but have the distinct disadvantage of requiring the basis to be calculated from the data. This usually means the whole dataset needs to be collected and loaded into memory to compute the basis, which prevents real-time analysis and may be impossible for very large datasets, in which case a preliminary stage of data reduction is required [21, 26]. The issue of coping with the size of mass spectrometry imaging data has been noted for almost as long as the field has existed [9, 1]. Most workflows described in the literature go through a multi-stage process of peak identification and feature selection that can require extensive processing and completely removes some peaks from the subsequent analysis [21, 1, 14].
The quality of segmentation is then dependent on the quality of the peak picking, which can require extensive tuning for specific mass spectrometers, sample preparation techniques, and datasets .
An alternative approach uses a pseudo-basis composed of randomly drawn vectors onto which the data is projected [30, 6]. The central idea is that projections onto a collection of such random vectors can be shown to extract almost mutually independent information and so a set of these vectors will capture the essential features of the data . The random basis itself is formed independently of the data and so removes a major computational hurdle. Random projections have been shown to preserve patterns within the data, including distances and angles between data points , making them useful for dimensionality reduction in areas including image processing and text mining . Previously, applications in the processing of mass spectrometry data were to compare individual spectra against a database  and to form orthonormal approximate bases for mass spectrometry imaging compression . The importance of using memory-efficient data processing is well-known  and the random projection algorithm can be implemented in a memory-efficient manner to avoid loading the whole dataset at once.
In this paper, we investigate the use of random projections to enable efficient image segmentation for the identification of spatial features in mass spectrometry images without requiring peak picking or other data reduction stages.
2.1 MALDI MSI of Human Liver
The mass spectrometry dataset used in this work consists of a MALDI mass spectrometry image acquired from a section of diseased human liver suffering from non-alcoholic steatohepatitis (NASH). This dataset has previously been used to demonstrate novel mass spectrometry image visualization methods , and a full description of the imaging methodology can be found in the supporting information of that paper. A brief summary is presented here.
2.1.1 Tissue Handling
Samples were collected from patients undergoing liver transplantation or tumor resection surgery at The Queen Elizabeth Hospital in Birmingham, with local research ethics committee approval (NHS Walsall LREC) and written informed patient consent during transplantation surgery. All samples were rapidly processed and snap-frozen in liquid nitrogen prior to storage at –80°C.
Serial tissue sections were obtained at 5 μm using a cryostat (model OFTF; Bright Instruments, Cambridge, UK) either onto steel MALDI target plates (ABSciex, Warrington, UK) for mass spectrometry or glass slides destined for H&E staining.
2.1.3 H&E Staining
Tissue architecture was visualized by routine hematoxylin and eosin (H&E) staining and optical microscopy.
2.1.4 MALDI Imaging
Fifteen mg mL–1 α-cyano-4-hydroxycinnamic acid (CHCA) in 80% CH3OH, 0.1% trifluoroacetic acid (TFA) was applied to the sample and MALDI plate using an artist airbrush (Draper, Hampshire, UK) with Badger Airbrush propellant (Badger, IL, USA), approximately 10 mL of matrix solution was dispensed in total. MALDI TOF MS analysis was carried out on a hybrid quadrupole time of flight mass spectrometer (QStar XL, Analyst QS 1.1, and oMALDI 5.1, ABSciex, Warrington UK) equipped with a Nd:YVO4 (355 nm, 5 kHZ, Elforlight: SPOT-10-100-355; Elforlight, Daventry, UK) fiber delivered (100 μm core diameter) diode pumped solid state laser, providing a mass resolving power of >6000 at m/z 643. Spectra were acquired in positive ion mode in the mass range m/z 600–950 with a spatial resolution of 100 μm in both x and y directions.
2.1.5 Data Processing
Mass spectrometry images were extracted from the proprietary instrument format (.wiff) to the imzML format [converting to mzML using AB SCIEX MS Converter (ver. beta 1.1; ABSciex, Warrington UK), then to imzML using imzMLConverter (ver. 1.0, www.imzMLConverter.co.uk )]. The imzML parser included with imzMLconverter was used to load individual spectra into MATLAB (Mathworks, Nantucket, MA, USA).
2.2 Random Projection
A mass spectrometry image is represented as a 2D data matrix X m × n where m is the number of spectral channels and n is the number of pixels, typically m ≫ n. The random projections are implemented by constructing a matrix Q k × m , where k is an integer controlling the number of projections. Each element of Q is drawn from a zero-mean normal distribution with unit standard deviation (N(0, 1))  and each row of Q corresponds to a random direction in spectral space onto which the data is projected by calculating A = QX, giving a projection score matrix A k × n . By setting k < m the dimensionality is reduced following projection.
We note that this can be implemented in a memory-efficient manner as the spectra are projected independently so that the full data matrix X does not need to be loaded into memory in its entirety.
2.3.1 k-Means Clustering
Segmentation was performed using the k-means algorithm implemented as the function kmeans in the MATLAB Statistics Toolbox (MATLAB R2009a). The algorithm is initialized by specifying a number of clusters, then arbitrarily allocating each data point to one of the clusters. The algorithm then proceeds iteratively by calculating the geometric center of each cluster and then allocating each data point to the cluster whose centroid is closest according to the Euclidean distance in the spectral space. The algorithm ends when membership of the clusters stabilizes. For visualization, every member of each cluster is assigned the same color (allowing spatially disconnected regions to have the same color), and a segmentation map is formed showing the class of each pixel.
2.4 Code Implementation
The random projection algorithm was implemented in MATLAB and demonstration code is provided in the Supporting Information.
We have evaluated the use of random projections for dimensionality reduction in MSI on a benchmark dataset whose histologic features have previously been identified using several approaches to MSI visualization . A second demonstration on a publicly available mouse brain dataset that was included in the supplementary information of Race et al. (2013)  is contained in the Supporting Information (see Supplementary Figure S2).
3.1 Mass Spectrometry Imaging of Human Liver
The benchmark dataset consists of a MALDI mass spectrometry image acquired from a section of diseased human liver suffering from non-alcoholic steatohepatitis (NASH). The dataset contains 12,325 pixels each with an associated spectrum in 33,725 m/z channels, resulting in a raw data size ≈3 GB.
NASH disease is characterized by the accumulation of fat within liver hepatocytes (steatosis) and in a proportion of patients this is followed by the development of necro-inflammatory activity that leads to cirrhosis [17, 13]. The development of liver cell ballooning and inflammation (steatohepatitis) determines whether a patient progresses to irreversible liver damage and fibrosis  and can currently only be identified by histologic examination .
Spectra were averaged from the tissue and a substantial number of peaks were visible within the m/z range 700–900, which is known to correspond to the masses of multiple lipids (Figure 1). Manual inspection of the data revealed several peaks that produced ion images that reflected the tissue histology; an arbitrary example from a peak of low intensity in the mean spectrum is shown in Figure 1. To obtain a rough estimate of the spectral complexity of the dataset, peak picking was applied to the mean spectrum (maximum-window peak detection ), which returned >900 peak centroids, the majority of which do not correspond to m/z values associated with CHCA matrix . This gives an indication of the degree to which the data can potentially be reduced but applying peak detection to all spectra within an image, and aligning the results is computationally intensive . As the random projections are data-independent, they can be generated without the dataset in memory and applied piece-wise to one pixel at a time.
3.2 Random Projection of MSI
The random projection of the data onto the k random vectors that make up Q creates k vectors, each of which randomly samples over the whole m/z range. Each projection therefore captures a randomly weighted linear combination of all m/z channels and, thus, samples the full range of chemical information present. As the sampling is random, there is no a priori way of knowing what chemical information will be captured by a particular projection, and direct analysis of single projections is unlikely to be informative, but by taking many projections, all of the information can be captured with very high probability.
It is also important to note that the projection vectors are chosen from a zero-mean Gaussian so they contain values of both signs. Accordingly, the scores also have both positive and negative values, which present some difficulties in relating the projection intensities to their physical origin.
In this work, projections are applied to the data sequentially, loading each column of X in turn and forming the k projections for each pixel in turn. The time it takes to project a spectrum (150 random projections of a single spectrum takes ≈0.1 s) is lower than the data acquisition time (≈0.5 s), which makes this potentially usable for real-time analysis of data during the acquisition process.
3.3 Segmentation from Random Projections
3.3.1 Spatial Patterns Detected by Segmentation
The image segmentation following random projection is shown in Figure 2, and shows clear delineation of the tissue section that has been determined to be consistent with histopathology. Hepatocytes are extracted from the surrounding tissue (orange), which consists mostly of fibrotic connective tissue, with the majority of hepatocytes being assigned to the same cluster (green). Interestingly, this segmentation technique identified the subpopulation of hepatocytes (blue), which were thought to be regenerating nodules, and identified the center of these nodules as being a distinct cluster. All of these assignments are in agreement with the visualization techniques of Fonville et al. . Further analysis is necessary to determine the nature of the spectral differences between the clusters.
3.3.2 Spectral Properties of ROIs Derived from Segmentation
After clustering was performed on the randomly projected data, the mean spectrum for each cluster was computed from the original data. These are shown underneath the segmentation map in Figure 2. These molecular profiles show a variety of spectral differences between the regions. There is a clear difference in the relative abundances of species present, and different ions show patterns corresponding to hepatocytes (green and blue), portal areas (red), and regions of fibrotic matrix (orange). The Euclidean distance between the centroids provides an idea of how different the clusters are to each other, and this is shown in the grid in Figure 2. As this distance is based on the projection of the spectra, it is a measure of the spectral similarity between clusters, and these results indicate that the most difference is between the regenerating hepatocyte centers and the surrounding (normal) tissue, with less difference compared with the other enlarged hepatocytes.
Interpreting the spatial maps still requires input from an appropriate expert but segmentation provides a way of presenting the results from mass spectrometry imaging in a format that can be readily understood by non-mass spectrometry experts.
3.4 Choosing the Number of Projections
We now consider how many projections are necessary to ensure that the original data is accurately represented. The search for formal upper bounds on the number of random projections is still an active field [6, 19, 12] and so we treat this as an experimental variable.
An important feature of this approach is that the number of projected values can be estimated as soon as a good fit to the first singular value curve can be made, which can be made before the elbow has been reached. This can be done efficiently by taking an initial set of projections from which subsets can be drawn to generate the curve. If the total number of projections is insufficient and the elbow in the curve is not reached, further projections can be added until the elbow is seen.
3.5 Effect of the Number of Projections on the Segmentation
We first observe that clustering on a very small number of projections can produce a segmentation that has some resemblance to the known tissue histology (row 1 in Figure 4) but is typically very noisy and poorly connected, and an insufficient number of random projections yields rather unstable and unreproducible clustering results.
However, experiments using a low number of projections serve to illustrate the idea that each projection samples across the whole spectrum and, therefore, a few projections capture a statistical selection of the chemical information. A small number of projections is, therefore, sufficient to identify broad trends in the data, but not the important fine details.
As the number of projections is increased, the segmentation map rapidly stabilizes. The pairwise correlation between maps produced with an equal number of random projections was calculated (see Supplementary Figure S3) and was less than 0.2 for five projections but approximately 0.9 when 200 were used. Using more than ≈100 projections yields little additional benefit, which agrees with the singular value decay shown in Figure 3 and with other results in the literature on random projections: a stable solution is reached after sufficient projections are included and the results do not significantly improve when additional projections are included [6, 19, 12]. This makes random projection a very robust dimensionality reduction technique as it is not too sensitive to the number of projections. For MALDI-MSI data, we have found that 100 to 200 are sufficient on all datasets that we have considered, which is in line with other recommendations for the number of variables to consider with classification algorithms . It is useful to note the computational cost of increasing the number of projections is low as the majority of computational time is spent loading the data from disk as opposed to performing the calculations.
For a comparison with the performance of a more conventional dimensionality reduction technique, we also performed principal component analysis (PCA), which is frequently used in MSI for this purpose , and subsequently performed segmentation, as shown in Supplementary Figure S4. Visually, the segmentation results obtained are near-identical in both cases (with 100 RPs) with the same tissue regions identified. We also computed the correlation between segmentations following random projection and PCA, and found P >0.9 from 100 projections, rising slowly thereafter. This illustrates that the information required for segmentation (in particular, Euclidean distance) is preserved to the same degree by both techniques, but random projection is much more computationally efficient (Supplementary Table S1).
Random projection has been shown to be a fast, repeatable, and effective dimensionality reduction tool for MSI data that can be used to enable fast and accurate segmentation. We have shown that segmentation following random projection produces results that are consistent with the known histology. As random projection permits segmentation on data that has not undergone any processing, it potentially offers a useful baseline against which the effects of further data processing can be compared. In this work, random projections were applied directly to the data without any other processing but could equally well be applied after de-noising and feature selection. Further investigation would be required into the effect this has on subsequent segmentation.
We have demonstrated the use of random projections to allow rapid segmentation using k-means clustering but, in principle, any segmentation or visualization method that uses the Euclidean distance metric could benefit [14, 16]. The main disadvantage of this method is that the projection matrix is, in general, not invertible. The projections are, therefore, “one-way” and the results cannot be directly interpreted in terms of the original m/z values. In cases where recovery of the original data is required from the projections, an orthogonalized random basis approach has previously been developed , which yields similar benefits for segmentation but requires additional computation.
This work has demonstrated the potential of simple random projections on MSI datasets but other spectroscopic techniques could also benefit. Related work has shown the application of random projections to Raman microscopy  and hyperspectral optical imaging , and it is therefore reasonable to expect that the results found here can be generalized to other spectral techniques. We expect there will be particular benefits in high mass-resolution mass spectrometry methods and new developments such as Rapid Evaporative Ionization Mass Spectrometry  or miniaturized portable spectrometers  that produce high-throughput data requiring real-time analysis in environments where significant computing power is not available and data transfer bandwidth may be limited. It is memory-efficient as each spectrum is processed sequentially, and is computationally inexpensive as the basis simply requires the generation of k random vectors. The use of computationally efficient algorithms such as random projection may be a powerful tool for the rapid classification of samples or for determining which samples require further investigation.
The authors are grateful to the EPSRC for funding a studentship to A.D.P. under grant EP/F50053X/1, the PSIBS Doctoral Training Center.
- 5.Balog, J., Szaniszlo, T., Schaefer, K.-C., Denes, J., Lopata, A., Godorhazy, L., Szalay, D., Balogh, L., Sasi-Szabo, L., Toth, M., Takats, Z.: Identification of biological tissues by rapid evaporative ionization mass spectrometry. Anal. Chem. 820(17), 7343–7350 (2010)Google Scholar
- 6.Bingham, E., Mannila, H.: Random projection in dimensionality reduction: applications to image and text data. Proceedings of the 7th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 245–250, San Francisco, California, 26–29 Aug 2001Google Scholar
- 8.Chaurand, P., Cornett, D., Angel, P., Caprioli, R.: From whole-body sections down to cellular level, multiscale imaging of phospholipids by MALDI mass spectrometry. Mol. Cell. Proteom. 100(2), 4259–11 (2011)Google Scholar
- 10.Donoho, D.L.: High-dimensional data analysis: the curses and blessings of dimensionality. AMS Math Challenges Lecture 1–32 (2000)Google Scholar
- 12.Durrant, R., Kabán A.: Compressed fisher linear discriminant analysis: classification of randomly projected data. Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1119–1128, Washington, DC, 25–28 Jul 2010Google Scholar
- 13.Farrell, G., Larter, C.: Nonalcoholic fatty liver disease: from steatosis to cirrhosis. Hepatology 430(S1), S9–S112 (2006)Google Scholar
- 14.Fonville, J.M., Carter, C.L., Pizarro, L., Steven, R.T., Palmer, A.D., Griffiths, R.L., Lalor, P.F., Lindon, J.C., Nicholson, J.K., Holmes, E., Bunch, J.: Hyperspectral visualization of mass spectrometry imaging data. Anal. Chem. 850(3), 1415–1423 (2013)Google Scholar
- 15.Johnson, W., Lindenstrauss, J.: Extensions of Lipchitz mappings into a Hilbert space. Contemp. Math. 260(189/206), 189–206 (1984)Google Scholar
- 18.Lalor, P., Faint, J., Aarbodem, Y., Hubscher, S., Adams, D.: The role of cytokines and chemokines in the development of steatohepatitis. In: Seminars in Liver Diseases, Vol. 27, pp, 173–193. Thieme-Stratton: New York: c1981 (2007)Google Scholar
- 19.Lin, J., Gunopulos, D.: Dimensionality reduction by random projection and latent semantic indexing. Proceedings of the Text Mining Workshop, at the 3rd SIAM International Conference on Data Mining, San Francisco, California, 1–3 May 2003Google Scholar
- 23.Palmer, A.D., Bannerman, A., Grover, L., Styles, I.B.: Faster tissue interface analysis from Raman microscopy images using compressed factorization. Proceedings of the European Conferences on Biomedical Optics, pp. 87980H–87980H. International Society for Optics and Photonics, Munich, Germany, 12–16 May 2013Google Scholar
- 24.Palmer, A.D., Bunch, J., Styles, I.B.: Randomized approximation methods for the efficient compression and analysis of hyperspectral data. Anal. Chem. 85(10), 5078–5086 (2013b)Google Scholar
- 25.Race, A., Styles, I., Bunch, J.: Inclusive sharing of mass spectrometry imaging data requires a converter for all. J. Proteom. 75(16), 5111–5112 (2012)Google Scholar
- 26.Race, A., Steven, R., Palmer, A., Styles, I., Bunch, J.: Memory efficient principal component analysis for the dimensionality reduction of large mass spectrometry imaging datasets. Anal. Chem. 85(6), 3071–3078 (2013)Google Scholar
- 29.Trede, D., Schiffler, S., Becker, M., Wirtz, S., Steinhorst, K., Strehlow, J., Aichler, M., Kobarg, J.H., Oetjen, J., Dyatlov, A., Heldmann, S., Walch, A., Thiele, H., Maa eszett, P., Alexandrov, T.: Exploring three-dimensional matrix-assisted laser desorption/ionization imaging mass spectrometry data: three-dimensional spatial segmentation of mouse kidney. Anal Chem 840(14), 6079–6087 (2012)Google Scholar
- 31.Varmuza, K., Engrand, C., Filzmoser, P., Hilchenbach, M., Kissel, J., Krüger, H., Silén, J., Trieloff, M.: Random projection for dimensionality reduction-applied to time-of-flight secondary ion mass spectrometry data. Anal. Chim. Acta 705(1) 48–55 (2011)Google Scholar
Open Access This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.