Introduction

Molecular-level tissue characterization is highly potent for cancer diagnosis. As a tissue starts becoming cancerous, specific biomolecules are overexpressed or aberrantly expressed, which can be used as cancer molecular markers. If we can detect these molecular markers spectroscopically, it would lead to a new molecular-level cancer diagnosis with high objectivity.

Keratin-family proteins (M.W. 40000 ~ 67000) are major components of fibrous structural proteins in epithelial cells. They play important roles in the formation of cytoskeleton network and help maintain the structural integrity of cellular morphology1,2. Several studies have shown that keratin is aberrantly expressed in many different types of human epithelial cancers including skin cancer, lung cancer, breast cancer, cervix cancer, esophagus cancer, salivary gland cancer and oral cancer3,4,5,6. In the present study, we focus on oral cancer. Oral squamous cell carcinoma (OSCC) is one of the most common cancers (95% in oral malignancy) in oral cavity. Keratin is a well-established molecular marker of OSCC; oral malignancy can be diagnosed by detecting the variations in keratin expression between OSCC and normal oral tissues7,8. At present, keratin in oral tissues is detected and analyzed by means of immunohistochemistry (IHC). However, IHC is expensive, time-consuming and needs specialist attention. An economic and straightforward alternative for keratin detection in oral tissues is longed for.

Spectroscopic methods for cancer diagnosis have made a rapid progress in recent years. In particular, Raman spectroscopy has been proven to be effective for discriminating cancerous against normal oral tissues9,10,11,12,13,14,15,16,17,18,19. Spectroscopic discriminations of cancer tissues in these previous studies are mostly based on Principal Component Analysis (PCA) in conjunction with statistical multi-parameter analyses. The key advantage of PCA is that once a spectral data set obtained from tissues is analyzed and separated into several particular categories, then a new spectrum can automatically be assigned to one of those categories, for example, cancerous vs. normal. However, PCA does not extract detailed molecular spectral information from the categorized spectra and its physical basis of categorization tends to remain unclear.

Biological tissues are highly heterogeneous and their Raman spectra vary widely depending on the position where they are measured. Furthermore, molecular compositions of tissues are so complicated that their raw Raman spectra can hardly be interpreted. In order to accomplish global tissue analysis effective for cancer diagnosis, we need to (1) collect Raman spectra from as many as possible points from a tissue sample, (2) estimate the number of principal spectral components contained in this large number of Raman spectra, (3) decompose the raw spectra into spectrally interpretable components and finally (4) objectively characterize tissues according to the extracted spectral information. The methodology employed up to now relies greatly on specialized “spectroscopic eyes”, which has not facilitated its practical applications in cancer diagnosis. The aim of the present study is to develop an automatic and objective method for discriminating oral cancer tissue by detecting keratin without any specialized knowledge of spectroscopy. We (1) collected a total of 196 Raman spectra from one oral tissue sample, (2) estimated the number of principal spectral components by Singular Value Decomposition (SVD), (3) applied Multivariate Curve Resolution-Alternating Least Square (MCR-ALS)20,21 analysis to decompose a large set of complicated spectra into spectrally interpretable components and (4) carried out the spectral matching analysis between these MCR-decomposed spectral components and the keratin standard spectrum, to objectively discriminate OSCC against normal tissues via Unit Normalized Euclidean Distance (UNED).

The present method fully utilizes the Raman spectral information (molecular fingerprint) of the marker molecule, keratin; in contrast, in PCA approaches, Raman spectra are treated just as two-dimensional signature for a pattern recognition analysis without referring much to their physicochemical meanings. The identification of keratin signature is automatically and objectively achieved with the use of UNED, making the whole analysis readily acceptable for non-specialists of spectroscopy.

Results

Determination of the number of principal spectral components contained in the observed spectra

We first determined the number of principal spectral components, k, based on the signal-to-noise ratio (S/N) consideration described in the Methods Section. We have tried several threshold values and finally set it to S/N = 4. If the threshold value is higher, we have fewer spectral components (smaller k values) in which keratin signatures are likely to be mixed up with other protein signatures. If the threshold value is lower, we have more spectral components (larger k) in which keratin signatures are likely to be contaminated with noise and dispersed among plural spectral components. The present threshold value, S/N = 4, is the optimized value for the present data set of 196 × 24 = 4704 Raman spectra from 14 patients. This threshold is to be further optimized with larger number of data from larger number of patients in the future. For present, we used the threshold value, S/N = 4, to show that the following automatic analysis proceeds successfully, once the threshold value is fixed. The determined k values are shown in Fig. 1 for Patient-1 ~ Patient-10 (OSCC and normal tissue samples), Patient-11 ~ Patient-13 (OSCC) and Patient-14 (normal). These different k values show the variation of samples obtained from different patients.

Figure 1
figure 1

The number of principal spectral components k in 14 patients’ OSCC and normal oral tissues with signal-to-noise ratio (S/N) higher than 4.

Different k values show the variation of samples obtained from different patients.

MCR-ALS fitting to decompose the observed spectra into spectrally interpretable components

After determining the number of principal spectral components in tissue, we applied MCR-ALS analysis (see Methods) to decompose the complicated raw spectra into spectrally interpretable components. The MCR spectral components of the OSCC tissue of Patient 1 is shown in Fig. 2(a–f). The normalized residual Rij = |(Aij-WHij)/Aij| at the i-th row and the j-th column is less than 5 ~ 7%, indicating that the principal signatures in the original data set A are well represented by the product WH with MCR decomposition. Thus obtained MCR decomposed spectra are readily compared with the standard keratin spectrum in Fig. 2g. We notice that one of the MCR components, Fig. 2b, shows excellent correspondence with the standard keratin spectrum with characteristic protein peaks at 1650 cm−1 (Amide I), 1450 cm−1 (CH bend), 1200 ~ 1350 cm−1 (Amide III), 1003 cm−1, 1030 cm−1 (Phenylalanine), 937 cm−1, 890 cm−1 (C-C stretch), 850 cm−1 and 830 cm−1 (Tyrosine), especially with a broad band feature around 1200 ~ 1350 cm−1. The spectral components Fig. 2a,d are ascribable to autofluorescence. Raman spectrum of glass substrate is separated out as Fig. 2c. Proteins other than keratin seem to be included in components Fig. 2e and hemoglobin is likely to be contained in Fig. 2f22.

Figure 2
figure 2

MCR-ALS spectral components from Patient-1 OSCC tissue sample (af) and the standard keratin spectrum (g); the spectral component (b) shows excellent correspondence with the standard keratin spectrum (g).

The result for the normal oral tissue of Patient 1 is shown in Fig. 3(a–e). The normalized residual is less than 6%. The keratin spectrum does not seem to match any spectral components from the normal oral tissue. Spectral components in Fig. 3a,b and e are ascribed to autofluorescence. The spectral component in Fig. 3c is likely to contain protein signatures with a characteristic band of phenylalanine residue at 1003 cm−1. Prominent signatures in Fig. 3d are from glass substrate.

Figure 3
figure 3

MCR-ALS spectral components from Patient-1 normal tissue sample (ae) and the standard keratin spectrum (f). No MCR spectral components seem to match the standard keratin spectrum.

Although we discuss the assignments of the decomposed components, we can process the MCR and the spectral matching evaluation (next step) analysis without them. The process is fully automatic and does not require spectral assignments in our protocol.

Spectral matching between principal MCR spectral components and the standard spectrum of keratin

By the preceding MCR analysis, we obtained decomposed spectral components. In order to evaluate how “similar” these spectral components are to the standard keratin spectrum, without relying on specialized “spectroscopic eyes”, we employed the idea of spectral matching. We have tried several indicators of spectral matching including Spectral Angle (SA), Euclidean Distance (ED) and Spectral Information Divergence (SID)23,24,25 and found that Unit Normalized Euclidean Distance (UNED) (described in Methods) provides the clearest measure of the spectral similarity.

We calculated UNEDs between each MCR-decomposed spectrum and the standard keratin spectrum in the region 800 ~ 1800 cm−1. We then picked up the minimum UNED value among decomposed spectra in each tissue sample to quantify the “highest similarity” of each sample. We first used the ten paired-samples (OSCC/normal) from the same ten patients for comparison. The result is shown in Fig. 4 with ten red points for OSCC and ten blue points for normal tissue samples, respectively.

Figure 4
figure 4

UNED results of ten paired-patients (including OSCC and normal tissues) with the confidence intervals of OSCC (UNED = 0.16 ~ 0.26) and normal (UNED = 0.29 ~ 0.38).

The upper bound of OSCC confidence interval (UNED = 0.26) can separate OSCC and normal tissues effectively.

We note that OSCC points tend to have smaller UNED values than the corresponding normal points. This trend indicates that the decomposed spectral components in the OSCC tissue samples have higher similarity with the standard keratin spectrum than those in the normal. The 95% confidence interval of the OSCC points is 0.16 < UNED < 0.26, while that of the normal points is 0.29 < UNED < 0.38. These two 95% confidence intervals do not overlap with each other, showing that UNED clearly distinguishes OSCC and normal oral tissues with high accuracy and specificity. If we simply take the upper bound of OSCC confidence interval (UNED = 0.26) as the threshold value, we can separate OSCC and normal groups with 70% accuracy in OSCC tissues (failure for Patient 5, 7, 10) and 100% specificity in normal tissues. The three false points, which were histologically diagnosed as cancerous, may well correspond to the preliminary stage of cancer that has more cancerous tissues than normal (see discussion below).

With the threshold UNED = 0.26, we analyzed the other three independent OSCC and one independent normal tissue samples. The three OSCC tissue samples show the UNED similarity values of 0.22, 0.13 and 0.22, respectively. These values are all smaller than 0.26. The normal tissue sample shows the value 0.38, which is much higher than 0.26. If we include the three independent OSCC samples, the accuracy increases from 70% to 77%. 100% specificity does not change if we add one normal sample in the analysis.

Discussion

In the present study, we have developed an automatic and objective method for oral cancer diagnosis by applying the MCR-ALS analysis with spectral matching. In spectral matching, we compared the distance, UNED, between normalized MCR-decomposed spectral components and the normalized standard keratin spectrum to evaluate their “similarity”. The UNED “similarity” value tells us how much the MCR-decomposed spectra contain the characteristic Raman signature of keratin. When UNED value is small, the decomposed spectral component contains much signature of keratin. When UNED value is large, the decomposed spectral component contains less keratin signature. Our results indicated that, from the OSCC tissue samples, high similarity was always found for one of the decomposed spectral components but that no spectral component showed high similarity from the normal tissue samples. Keratin signature was successfully captured from the OSCC tissue samples but not from the normal. Keratin in the normal tissue samples was not detected by MCR primarily because of the much less keratin amount in normal tissues than in OSCC3. In addition, spatial distribution of keratin may also play certain roles in the MCR decomposition. Note that the MCR decomposition is based on the differences not only in the spectral profile but also in the spatial distribution. It is likely that the keratin spatial distribution in OSCC is different from that in normal tissues. OSCC tissues may consist of a homogenous population of cells at one particular stage of differentiation, whereas normal tissues consist of cells at different stages of differentiation26,27. Cancer cells in OSCC tissues are likely to have more chance to stay at G2 phase with aberrant keratin syntheses and produce localized spatial distribution of keratin. On the contrary, normal cells mostly progress at different stages of differentiation, M, G1 and G2 phases, to have keratin randomly distributed spatially. In the present study, we randomly measured the points in tissue samples to obtain global spectral information. We anticipate that specific areas of tissues can be examined by this approach for comparison with immunohistochemical staining result and use the spatial distribution of keratin, Hi in Equation 4, to further substantiate the discrimination of OSCC tissues against the normal tissues.

In spectral matching, the threshold UNED = 0.26 was set to discriminate OSCC against normal oral tissues from the same patient. The UNED value could also help elucidating the metastasis condition of cancer. The marginal region (UNED = 0.26 ~ 0.29 in Fig. 4) probably represents metastatic cancer cells gradually accessing into normal tissues. With UNED = 0.26 discrimination, the UNED values were larger than 0.26 for the three false points in Patients −5, −7, −10; hence, these were not identified as cancerous. However, their UNED values were smaller than the corresponding values of the normal. If we make comparisons of the UNED values within the same patient, the OSCC tissue samples always show lower UNED values than the normal (for Patient-10, they almost overlap). In that sense, the two OSCC tissue samples, Patients −5, −7, can probably be diagnosed as suspicious (though not identified as cancerous) for having more cancerous tissues than normal. If we regard these suspicious samples as positive, the accuracy increases from 77% to 92%. The pair comparison of the UNED values may provide further information on the metastasis condition of oral cancer.

Methods

Tissue samples

Use of tissue samples was approved by Institutional Review Board of the Taichung Veterans General Hospital. All experiments were performed in accordance with the approved guidelines and regulations. Informed consent was obtained from all subjects. Samples from fourteen oral cancer patients included ten paired (cancer and normal tissue samples from the same patient), three independent cancer and one independent normal tissue samples. Cancerous oral tissue samples were histologically confirmed as oral squamous cell carcinoma (OSCC). All tissue samples, immediately after surgical removal, were flash-frozen at −196 degrees Celsius and stored in liquid nitrogen. Tissue samples once stored at liquid nitrogen temperature were then embedded in optimum cutting temperature (OCT) compound and sectioned in a microtome in approximately ten-micrometer thick and were mounted on glass slides. Standard keratin sample was prepared from human stratum corneum, which was known to contain 80% keratin28. Practically, it was obtained from stratum corneum cut out from the heel of one of the authors. The sample was soaked in a 1:1 mixture of methanol and chloroform overnight and then was immersed in deionized water29. The standard keratin sample that we have taken from human stratum corneum is extensively used as the standard antigen in immunostaining detection of keratin in squamous cell carcinoma (SCC) tissues3.

Raman microspectroscopy

We used a laboratory-constructed Raman microspectrometer for all the Raman measurements. The 488 nm line of an Ar-ion laser (CVI Melles Griot) was used for excitation with a power of about 1 mW at the sample point. The laser beam was focused into the sample by using a non-immersion 40X,

NA = 0.6, objective (Olympus, LUCPlanFL N). The laser spot size at the sample was estimated to be about 1 μm. The back-scattered light was collected by the same objective lens and was focused on to the entrance slit of a polychromator (Andor, SR303i-BNS). A 1200-grooves/mm grating was used to disperse scattered light. The signal was detected by a CCD detector (Andor, DU401A-BV) cooled to −80 °C. The acquisition time was 60 sec for each measurement. The Raman spectrum of indene was acquired for wavenumber calibration30. The Raman spectra of all samples were recorded in the 300–2000 cm−1 wavenumber region, which covers most Raman signatures observed from oral tissues.

In the present study, we emphasized on extracting global molecular information of tissues. Therefore, we tried to globally and randomly measure as many points as possible without specific localization in a tissue sample. A piezo X-Y stage (Physilk Instrumente) was used to scan 7 × 7 = 49 points in one region of a tissue sample, with the distance of 5 μm between two adjacent points (Fig. 5). The same measurement was repeated four times at different regions of the sample and a total of 196 spectra were collected for each sample for subsequent analysis.

Figure 5
figure 5

A schematic diagram of the laboratory-constructed Raman microspectrometer.

Data Analysis

The flow chart of data analysis is shown in Fig. 6. Wavenumber calibration based on the standard spectrum of indene was carried out prior to the analysis. The analysis consists of the following three

Figure 6
figure 6

Flow chart of data analysis.

steps: (1) determination of the number of principal spectral components contained in the observed spectra, (2) multivariate curve resolution-alternating least squares (MCR-ALS) fitting to decompose the observed spectra into spectrally interpretable components, (3) spectral matching between principal MCR spectral components and the standard spectrum.

(1) Determination of the number of principal spectral components

To determine the number of principal spectral components in the observed spectra, we introduced a new protocol based on signal-to-noise ratio (S/N) consideration. First, SVD analysis was performed to obtain SVD-decomposed spectra, IntensitySVDoriginal. Then, these SVD-decomposed spectra were smoothen by Savitzky-Golay (polynomial) method to obtain IntensitySVDsmooth, which was regarded as the “Signal”. Then, for each SVD-decomposed spectrum, IntensitySVDoriginal can be written as,

Equation 1 can be visualized as shown Fig. 7.

Figure 7
figure 7

Signal and noise in SVD components.

The signal-to-noise ratio is defined as,

By using this method, we can automatically select out spectral components that have S/N ratios higher than a prefixed threshold value. In the present study, we fixed the threshold at 4. SVD spectra with S/N ratios higher than 4 were included in the subsequent analysis.

(2) Multivariate Curve Resolution - Alternating Least Squares (MCR-ALS)

A raw Raman spectrum is a superposition of a number of independent spectral components from different molecules existing in the tissue. Multivariate Curve Resolution-Alternating Least Squares (MCR-ALS) decomposes a large spectral data set into spectrally interpretable components. The experimental raw spectral data set A can be written as an m × n matrix, where m is the number of data points in one spectrum and n is the number of spectra in the data set; each column vector of A, Ai = (A1i, …, Ami)T, represents the i-th Raman spectrum having m data points. We decompose A into a product of two matrices W and H,

where W is an m × k matrix and H is an k × n matrix, k is the number of components determined by the S/N consideration in the first step. Equation (3) can be written in matrix form as,

In equation (4), MCR-ALS decomposes the raw data set matrix A into the major spectral component matrix W and their spatial distribution matrix H. Each column vector of W, Wi =  (W1i, …, Wmi) T is the i-th major spectral component and each row vector of H, Hi =  (Hi1, Hi2, Hi3, Hi4, …, Hin) represents the intensity profile corresponding to Wi.

During the process of the MCR analysis, W and H matrices are forced to be non-negative; i.e., W ≥ 0, H ≥ 0. The final solutions are obtained by iterative refinement to minimize the Frobenius norm ||A-WH||2. The SVD spectral components are used as initial guesses of the spectral components in the iteration process. The negative values in the SVD spectra are truncated to be zero. The present MCR-ALS analysis does not require orthogonality among column vectors in W and row vectors in H. Note that the component spectra and the intensity patterns are definitely not orthogonal to one another; in contrast, they are assumed to be orthogonal in other spectral decomposition methods like PCA and SVD. The details of the MCR-ALS method are given elsewhere20,31.

(3) Spectral matching between principal MCR spectral components and standard spectral component

The decomposed spectra W1, …, Wk and the standard spectrum Sstd were normalized so that their norms are unity. These normalized vectors can be written as,

Then, we calculate the similarity Unit Normalized Euclidean Distance (UNED) between Wi,unit and Sstd,unit as follows,

where j is the index of the j-th element of the m-dimensional vectors Wi,unit and Sstd,unit.

Figure 8 schematically shows the principle of the UNED analysis, represented in 2-D space for simplicity. UNED represents the distance between a normalized MCR-decomposed spectral component and the normalized standard spectrum, whose minimum value is 0 (two identical vectors) and maximum value is 2 (two vectors along opposite direction). Therefore, the smaller the UNED value is, the larger is the similarity. Thus, from UNED, we can evaluate the distance between the two normalized spectral vectors to know how similar the two spectra are.

Figure 8
figure 8

A schematic 2-dimensional model showing the principle of the UNED analysis.

Additional Information

How to cite this article: Chen, P.-H. et al. Automatic and objective oral cancer diagnosis by Raman spectroscopic detection of keratin with multivariate curve resolution analysis. Sci. Rep. 6, 20097; doi: 10.1038/srep20097 (2016).