1 Introduction

The emerging technologies for creating complex and specific engineered surface chemistry has increased the need for surface analysis methods that can accurately characterize and verify the actual surface chemistry [1]. This is particularly challenging when one considers the complexity of even simple surface modifications, and the trend towards using multicomponent, patterned surface chemistry in a wide range of biomedical applications, including biosensing. For example, an engineered surface may have a linker molecule to tether a coating to the substrate, molecules with binding groups for a specific analyte, and other molecules to provide a non-fouling background to avoid non-specific adsorption [2]. Each of these surface components could have a similar chemical composition with differences only in the structure or arrangement of the chemical groups. To optimize the performance of biosensors it is essential to minimize non-specific interactions via a non-fouling background and to maximize the biological activity of surface tethered probe molecule via control of orientation, conformation, and density [3]. Though no one technique can provide a complete characterization of such a surface, time-of-flight secondary ion mass spectrometry (ToF-SIMS) is one method that shows great promise due to its molecular specificity, relatively high mass resolution, and high sensitivity [4]. However, ToF-SIMS data from even a set of simple homogenous surfaces can be very complex.

Good introductions to ToF-SIMS can be found in the literature [47]. ToF-SIMS is a mass spectrometry technique that probes the chemistry and structure of the outer surface by impacting energetic primary ions onto a sample and analyzing the secondary ions emitted from the surface. There are a wide variety of primary ions used in ToF-SIMS ranging from mono-atomic ions such as Ar+, Ga+, Cs+, Au+, Bi+, to cluster ions such as Au +n (n = 1−3), Bi +n (n = 1–5), SF5+, Ar +n (n ~ 500–10,000) and C60+. These ions impact the surface with energies typically in the range of 1–25 keV causing a collision cascade that results in the emission of ions, neutrals and radicals. Only a small fraction (typically 1 % or less) of the emitted material is ionized. These ionized atoms, molecules, and molecular fragments are extracted into a time-of-flight mass analyzer where they are separated by mass and recorded in a mass spectrum. ToF-SIMS is an ultra high vacuum technique that requires the samples be analyzed in a dehydrated state, but hydrated conditions can be simulated by placing samples in a frozen hydrated state. In spite of this limitation, ToF-SIMS has been used successfully to analyze a wide range of organic and biological samples including proteins [815], lipids [1619], cells [2025], and tissues [2630].

A typical ToF-SIMS spectrum can contain hundreds of peaks, the intensity of which can vary due to the composition, structure, order, and orientation of the surface species [31]. ToF-SIMS data are inherently multivariate since the relative intensities of many of the peaks within a given spectrum are related, due to the fact that they often originate from the same surface species [32, 33]. The challenge is to determine which peaks are related to each other, and how they relate to the chemical differences present on the surface. This problem is then exacerbated by the fact that a given data set typically contains multiple spectra from multiple samples, which can result in a large data matrix to be analyzed. This data overload is even more prominent in ToF-SIMS imaging where a single 256 × 256 pixel image contains 65536 spectra. This complexity, combined with the enormous amount of data produced in a ToF-SIMS experiment, has led to a marked increase in the use of multivariate analysis (MVA) methods in the processing of ToF-SIMS images and spectra [33].

Table 1 presents a summary of ToF-SIMS studies that have been carried out using at least one MVA method. As seen in the table, MVA includes an alphabet soup of methods that are designed to aid a researcher in reducing large data sets to a manageable number of variables with the aim of helping them understand and interpret the data. These methods include, but are not limited to principal components analysis (PCA), discriminant analysis (DA), partial least squares (PLS), multivariate curve resolution (MCR), and maximum autocorrelation factors (MAF).

Table 1 Summary table of selected studies using mva for processing of ToF-SIMS data

These MVA methods are one of many tools available that can aid in the interpretation of ToF-SIMS images and spectra. These methods are statistically based, and to be used properly require good experimental plans and controls. MVA methods should not be treated as a black box. These methods are a way for the analyst to summarize and understand the large data sets generated by ToF-SIMS. The use of MVA does not preclude the need for a sound understanding of ToF-SIMS, or for using complementary surface analysis methods in order to interpret the data and understand the surfaces being analyzed.

In this short review we will propose guidelines for the successful application of MVA to ToF-SIMS data, and provide examples from the literature and our group on how these methods have been successfully applied. Since it is beyond the scope of this paper to discuss all MVA methods we will mainly focus on PCA and its application to ToF-SIMS data, though other methods will also be highlighted. PCA forms the basis of many of the other MVA methods and therefore provides a good entry point into the use of MVA with ToF-SIMS data.

2 PCA

2.1 Overview

A brief summary of PCA will be provided below. Jackson and Wold [3436] have provided an excellent general introduction to PCA, and other overviews of using MVA with ToF-SIMS data can be found in the literature [32, 3739, 41].

Principal component analysis is a MVA method that looks at the overall variance within a data set. Herein a data set is defined as a matrix where the rows contain samples and the columns contain variables. For ToF-SIMS data the samples are spectra and the variables are measured intensities of individual mass channels or integrated peak areas from selected or binned regions. PCA is calculated from the covariance matrix of this original data set [42, 43]. Geometrically, PCA is an axis rotation that aligns a new set of axes, called principal components (PC), with the maximal directions of variance within a data set [35]. PCA generates three new matrices containing the scores, the loadings, and the residuals. The scores show the relationship between the samples (spectra) and are a projection of the original data points onto a given PC axis. The loadings show which variables (peaks) are responsible for the separation seen in the scores plot. The loadings are the direction cosines between the original axes and the new PC axes. The residuals represent random noise that is presumed to not contain any useful information about the samples [34].

Together the scores and loadings represent a concise summary of the original data that in most cases can aid in interpreting the data being analyzed. The scores and loadings must be interpreted together and have little meaning alone. Figure 1 shows the scores and loadings plots from two brands of 70 % dark chocolate analyzed by ToF-SIMS and PCA. The data was normalized to the total intensity of each respective spectrum, square root transformed and mean centered prior to PCA. As seen in Fig. 1a the two samples are separated on the PC1 axis. The PC1 loadings in Fig. 1b show which peaks are responsible for the differences seen between the two samples. In general peaks with high loadings on one side of a given PC axis will show a higher relative intensity for samples with high scores on the same side of the given PC axis. In the case of these two chocolates, Sample 2 (negative PC1 scores) corresponds with a series of hydrocarbons and peaks such as Al, Na, Ca (negative PC1 loadings). Sample 1 (positive PC1 scores) corresponds with K and a series of higher mass peaks from lipids and diacylglycerols (positive PC1 loadings). It should be noted that this does not necessarily mean that the high mass lipid peaks do not show up in the spectra from Sample 2, only that the relative intensities of these peaks were lower in Sample 2. This can be seen in Fig. 2 which shows the relative intensity of Na and a peak from a diacylglycerol of stearic and oleic acid. As expected, the relative intensity of the Na peak is higher for Sample 1, and the relative intensity of the diacylglycerol peak is higher for Sample 2.

Fig. 1
figure 1

PCA scores and loadings from a set of dark chocolate samples. a PCA PC1 scores. The black squares represent the 95 % confidence limits for each sample. Samples are separated from each other on the PC1 axis. b PCA PC1 loadings plot. Peaks with positive loadings correspond with samples with positive scores. Peaks with negative loadings correspond with samples with negative scores

Fig. 2
figure 2

Peak area plots from the chocolate data. a m/z = 23 (Na), a negative loading peak. b m/z = 605 (diacylglyerol peak), a positive loading peak. As expected from the PCA results, the relative intensity of Na is higher on sample 1 and the relative intensity of the diacylglycerol peak is higher on sample 2

2.2 Data Collection

Since all MVA methods are statistically based, it is important to make sure that one collects sufficient data to have statistically relevant results. The exact number of replicates required will depend on the system being analyzed. However, doing MVA processing on data from only 1 or 2 spots per sample will most likely produce meaningless, unreproducible results unless the surfaces being analyzed are extremely homogeneous. For ToF-SIMS spectra we suggest collecting data from at least 3–5 spots per sample across at least 2 samples for homogenous surfaces, and 5–7 spots per sample across at least 3 samples for non-homogenous samples. It is also best to collect data from samples created on separate days to get a true measure of the sample-to-sample variability. Care should also be taken to minimize, and ideally avoid, detector saturation during data collection. Saturated peaks can result in the production of extra factors that describe the non-linear intensity variations of saturated peaks. For example, Lee and Gilmore [40] showed an imaging PCA example of an immiscible blend of PC/PVC where the first PC captured the difference between the two polymers, and the second PC captured differences between the 35Cl peak due to detector saturation. We have also observed isotopes loading in opposite directions when one of the isotopic peaks is saturated.

2.3 Data Preprocessing

2.3.1 Peak Selection

Though it is possible to run MVA on every mass channel of a ToF-SIMS spectrum, it is more common for users to export a set of selected peak areas from the spectrum. This is a logical choice since ToF-SIMS spectra contain many mass channels that contain only noise, or no counts, due to the low background between peaks. Since the set of peaks used in MVA can affect the results, peak selection becomes the first data preprocessing step before one can run MVA. Peak selection can be done by binning the data into chosen interval sizes, using automated peak detection routines, or by manually choosing and measuring peak areas. Due to the large number of peaks typically seen at each nominal mass in ToF-SIMS spectra from organic and biological surfaces, manual peak selection is recommended. Data binning is quick, but it loses the high mass resolution information present in the original spectra. If data binning is used, one must go back to the original data to determine which peaks are changing at a given mass to properly interpret the MVA results. Automated peak selection routines are getting better, but they often misplace peak integration limits, which can be problematic when working with spectra with overlapping peaks. Manual peak selection is time consuming, but allows the user to utilize all of the spectral peaks and to check the peak integration limits to assure they are placed properly across all samples. The assumptions made when selecting peaks for ToF-SIMS MVA have been reviewed previously [33].

Before selecting peaks it is important to make sure that all of the spectra are properly mass calibrated, and that the same calibration set is used for all spectra within a given sample set. When selecting peaks it is recommended that one overlaps representative spectra from each sample type in a given set to be sure all peaks across all spectra are selected. The same peak list must be used for all samples within a given set.

2.3.2 Data Scaling

The results produced by many MVA methods depend strongly on the data matrix preprocessing. Data preprocessing includes normalization, mean centering, scaling and transformation. The goal of data preprocessing is to remove variance from the matrix that is not due to chemical differences between the samples. This could include variation due to the instrumentation, differences in the absolute intensity of peaks within a spectrum, differences due to topography or other factors. The assumptions made when preprocessing a data matrix have been addressed [33]. Normalization is done by dividing each variable (peak) in the matrix by a scalar value. Normalization is done to remove variance in the data that is due to differences such as sample charging, instrumental conditions or topography. It should be noted however that normalization can accentuate noise in ToF-SIMS images due to the low count rates often found within the data. Therefore, care should be taken when using normalization of ToF-SIMS images. Mean centering subtracts the mean of each variable (peak) from the data set. This allows the data to be compared across a common mean of zero. Scaling refers to dividing each variable by a constant, and transformation refers to transforming the data with a function such as the logarithm or square root. There are many different ways to scale and transform a data set. It is recommended that one selects a scaling and transformation method based on the experimental uncertainty of each peak. ToF-SIMS data often follows a Poisson distribution. Keenan and Kotula [39, 44, 45] proposed a scaling method that accounts for Poisson noise which has been found to work well for ToF-SIMS data. If Poisson scaling is not available in the software package being used, other good alternatives include using a square root transform of the data or scaling by the square root of the mean of the data. Regardless of the preprocessing methods used, one should understand the assumptions made [33] when applying a given method and choose a method based on what they are trying to learn from the data and not what gives the best looking results.

2.4 Interpreting Results

Once the data set has been processed using MVA the user must interpret what, if anything, the results mean. This is particularly important when using scale dependent methods such as PCA or MCR since the results obtained will be affected by the assumptions made when preprocessing the data. Though there are no quantitative measures of the validity for a given result, we suggest using the following guidelines when examining the results from MVA processed ToF-SIMS data: (1) Are the results logical? (2) Do the results agree with the other data collected on the samples? (3) Are the results reflected in the original data?

Multivariate analysis is simply a tool available to help understand large data sets. As with any results, one should always ask if the results obtained make sense. When using MVA one should also ask if the assumptions made during data pre-processing make sense, and are based off of valid assumptions about the data. Also, since no one surface analytical method can provide all the information needed to understand a surface, it is important to collect complementary data on the samples within a given set. The user can then validate the ToF-SIMS data with respect to the information obtained from the other methods. Another important aspect of interpreting results from MVA is to validate the results by looking at the original data. This means going back to look at the relative intensity of the peaks that are highlighted in the loadings plots to verify that they follow the trends shown in the scores plots (for example, see Ref. [46]). This is particularly important when using PCA since some PCs can capture multiple trends across the samples. When this happens, one will find that some peaks with high loadings can reflect changes in only a subset of the samples seen in the scores plot. Once caveat that should be noted when validating results with the original data is that methods such as PCA are calculated from the scaled covariance matrix, and that each subsequent PC is calculated from the matrix after the previous PCs have been subtracted from the data set. Therefore, the original data is only truly reflective of the first PC.

Another important aspect to keep in mind is that unsupervised MVA methods such as PCA are designed to find the greatest directions of variance within a data set regardless of their source. That means that if one sample within a data set is contaminated, the first few PCs will most likely separate out the contaminated sample from the other samples. Thus, any variance that is due to the designed surface chemistry may be suppressed by the large variance due to the contamination. However, it is important to note that being able to discover contamination within a sample set can be a valuable piece of information that can help troubleshoot sample preparation and handling protocols. This information can then be fed back into the sample procedures and experimental design so differences in the surface chemistry of interest can be studied.

3 MVA of Complex Surfaces

Most biological materials are multicomponent whether they involve a mixture of proteins, a combination of cells and extracellular matrix, or a mixed polymer system used as a scaffold for tissue engineering. Being able to accurately characterize these samples can be challenging and can push the limits of any analytical technique. ToF-SIMS and MVA have been used to analyze a range of different multicomponent systems including binary [811], ternary [10] and multicomponent protein layers [1215], lipids [1619], mammalian cells [2022], microbial cells [2325], decellularized extracellular matrix [47, 48], tissues [2630] and polymer microarrays [4952].

Many multicomponent systems are challenging to analyze because the components within the systems generate the same peaks. For example, when analyzing immobilized proteins with ToF-SIMS the spectra are dominated by amino acid fragments, not small or large peptides. This means that for any given protein, the “characteristic” peaks will all be the same, and it is the relative intensities of these peaks that encodes information about the proteins identity [5355], concentration [1012, 56, 57], conformation [5862] orientation [6366], or spatial distribution [6669]. Therefore, these types of problems are truly multivariate in nature. As a result of this interrelated spectral complexity, one will need proper controls to be able to define the best system to characterize a given sample set. Furthermore, one will need to choose the appropriate MV method to provide the information sought in the experiments. For example PCA, MCR or MAF would be logical choices for general data exploration or sample identification, whereas PCR or PLS would be better for quantification or modeling, and DFA, HCA or PLS-DA would be better for classification. The remainder of this section provides a brief discussion for applying MVA to multicomponent systems.

3.1 Analysis of Polymer Microarrays

Publications from Alexander et al. [49, 51, 52, 70, 71] show how to combine combinatorial chemistry, ToF-SIMS and MVA. They use a microarray printer to prepare microarrays of a library of polymeric materials for high throughput assays, and relate polymer functionality to surface properties [49] and cellular interactions [52]. They have also studied carbohydrate microarrays [51] and examined printing heterogeneities with different polymer formulations [70].

In their study of the relationship of surface chemistry with wettability, the authors used a combinatorial copolymer array of 576 different polymers from 24 unique monomers [49]. The polymers were made using a 70:30 mixture of 2 monomers. PCA scores of the mean centered data showed that most of the polymers clustered together based off the major component of the monomer mixture. The loadings showed peak trends that corresponded with the chemical differences between the types of monomers as expected from their molecular structures. The authors then created a PLS model correlating the ToF-SIMS data with the water contact angle, and showed good correlation between the predicted and measured water contact angle (R2 = 0.94). This work is a good demonstration of how using a systematically designed study combined with ToF-SIMS and MVA can provide useful chemical information from a complex data set.

3.2 Analysis of Patterned NHS on Polymer Substrates

We have used ToF-SIMS imaging and PCA to analyze patterned polymers surfaces functionalized with N-hydroxysuccinimide (NHS) [67, 68]. For this, commercially available slides with a PEG based coating were patterned using standard photolithography methods. The surfaces were imaged using ToF-SIMS and the images were analyzed using PCA of the normalized, mean centered data. PC1 was able to separate the NHS from the methoxy regions with good fidelity matching the originally designed pattern. The loadings separating the two regions were seen to contain peaks characteristic of the two compounds. Interestingly PC3 showed a dark outline around the border between the two regions. From the PC3 loadings it was determined that these dark regions corresponded with the photoresist material used in the pattern creation. The presence of residual photoresist had not been previously reported in other studies using a similar patterning method. This study presents a good example of how the combination of PCA with ToF-SIMS data can often pick up trends or identify components from complex data sets that would otherwise be overlooked by the analyst. The ability to identify the presence of submonolayer quantities of unexpected organic contaminants on an organic surface is particularly challenging, but as this study shows the combination of ToF-SIMS and PCA is well suited for this type of challenging analysis.

3.3 Analysis of DNA Microarray Spots

Our group has been working on a project analyzing spots on DNA microarrays. In this project, DNA spots are printed onto polymer-coated commercial microarray slides (Slide H, amine-reactive polymer-coated slides, Schott Nexterion, Louisville, USA). The exact chemistry of the slides is proprietary, but it is known that the slides contain NHS and a non-fouling background, most likely based on polyethylene glycol (PEG). It is likely that the organic coating is attached to the glass substrate via a silane linker. The aim of this work is to investigate and understand the surface chemistry of DNA microarrays with the goal of improving their reproducibility and efficiency.

Figures 3 and 4 show the PCA scores and loadings plots for the first 2 PCs from the negative ion data of a DNA spot on the NHS coated slide. The data was Poisson scaled [45] and then mean centered prior to PCA. As seen in Fig. 3a, PC1 separates the DNA spot from the slide background. The PC1 loadings seen in Fig. 3b show that the DNA spot (darker area in PC1 scores plot, negative loadings) corresponds with a series of nitrogen containing peaks, as well as the PO2 and PO3 peaks from the DNA. The areas with positive scores on PC1 (bright area) correspond with a series of peaks of the form CxHyOz, many of which are characteristic of PEG. Examining the PC1 scores image closely, it is apparent there are several bright spots scattered around the surface. These spots are even more clearly highlighted in the PC2 scores plot (see Fig. 4a). As seen in the PC2 loadings plot these spots correspond with a set of CxHyOz peaks (darker area in PC2 scores plot, negative loadings), once again many of which are characteristic of PEG. Looking at this data set as a whole, one can see that DNA was spotted successfully onto the NHS coated slide, however the DNA spot is not very uniform as noted by the variation in intensity of the spot in the PC1 scores plot. Overall, the background around the spot is fairly uniform, but it appears there are spots that could be agglomerations of PEG scattered across the surface (as seen by the bright regions in the PC1 scores image and the dark regions in the PC2 scores image). The non-uniform nature of PEG coatings has been observed previously [72].

Fig. 3
figure 3

PC1 scores and loadings from a DNA spot on an NHS functionalized slide. a PC1 scores image. b PC1 loadings. The bright areas correspond with peaks of the form CxHyOz. The dark regions correspond with nitrogen containing peaks as well as PO2 and PO3 peaks from the DNA

Fig. 4
figure 4

PC2 scores and loadings from a DNA spot on an NHS functionalized slide. a PC2 scores image. b PC2 loadings. Bright areas correspond with a series of silicon containing peaks, presumably from the silane linker used to tether the NHS groups to the surface and some other sulfur or phosphate containing peaks that could be from the buffer solutions used to deposit the DNA. The dark spots correspond with peaks of the form CxHyOz, which likely originate from PEG

In this example, PCA was able to quickly summarize the data from a sample with a complex and heterogeneous surface chemistry, and highlight non-uniformities in the DNA spots and in the background coating chemistry. This type of information can then be fed back into the surface engineering design to improve the sample preparation methods to prepare more uniform DNA spots with improved sensitivity and reproducibility.

3.4 Discriminating Cell and Tissue Types

Several groups have explored using MVA and ToF-SIMS to distinguish cells [2025, 73] and tissues [2630]. Biological samples such as these represent some of the most complex samples being explored by ToF-SIMS.

Fletcher et al. [73] used principal component discriminant function analysis (PC-DFA) to process the spectra of nineteen bacterial strains of bacteria common to urinary tract infections. PC-DFA applies discriminant function analysis to the results from PCA and enables classification of samples by maximizing the between group differences while minimizing the within group differences. In their study they showed that they could discriminate between 15 different strains with minimal overlap between sample groups. They also showed using a sub group of the data that they could build a PC-DFA model that could be used to correctly classify spectra from various bacterial strains.

Wu et al. [27, 30] has explored the use of ToF-SIMS and MVA for the analysis of tissues. In one example they used PCA to analyze six different tissues from a mouse embryo (brain, spinal cord, heart, liver, rib, skull) [30]. For this analysis spectra were extracted from images taken from the various tissues and processed with PCA. When using all tissue types in the analysis they were able to show separation between most groups, though they saw some overlap between the brain and spinal cord, and the heart and liver tissues. As could be expected, the brain and spinal cord tissues clustered near each other, as these were similar tissue types. The skull and rib tissues were also seen on the same side of the PC axes as could be expected since they are both bony tissues.

These brief examples show that MVA has proven successful for distinguishing and classifying various cell and tissue types, but the analysis is still limited by the lack of identified peaks that appear within their spectra. It is also noted that in most cases, the spectra from cells and tissues are dominated by low mass fragments that can come from multiple sources including proteins, lipids and other compounds in the samples. This lack of higher mass molecular information makes the use of MVA even more critical since it can be used to distinguish trends in the fragmentation pattern within a set of spectra that can be specific to the type of sample [32, 55].

4 Outlook and Conclusions

As the use of MVA with ToF-SIMS data increases, it is important that users understand the methods used and assumptions made when processing their data. Proper experimental design and sampling statistics are required to ensure the results acquired are statistically significant. The choice of MVA method should also be considered according to the goals of the analysis and the results obtained should be verified by going back to the raw data.

It should be remembered that MVA methods are simply a set of tools that can be used by the analyst to better understand their data. MVA does not replace the need to understand the basic methodologies of analyzing ToF-SIMS data, rather they provide a way to explore the data more efficiently and make use of all of the information within a given data set. MVA methods reduce large data sets down to a more manageable set of variables that can highlight trends in the data, build predictive models, and find differences between samples. The continued development and use of MVA methods for ToF-SIMS data processing holds great promise for understanding complex surface chemistry and dealing with the large data sets generated from both simple and complex sample sets. However, there is still a great need to expand the availability and depth of peak libraries for ToF-SIMS so one can interpret the results gained through applying MVA. This is particularly true for biological samples and other complex surfaces such as biosensors.