Diagnosis of thyroid neoplasm using support vector machine algorithms based on platelet RNA-seq

Objective To assess the capacity of support vector machine (SVM) algorithms that are developed based on platelet RNA-seq data in identifying thyroid neoplasm patients and differentiating patients with thyroid adenomas, papillary thyroid cancer and metastasized papillary thyroid cancer. Methods Platelets were collected and isolated from 109 patients and 63 healthy controls. RNA-seq was performed to find transcripts with differential levels. Genes corresponding to these altered transcripts were identified using R packages. All samples were subsampled into a training set and a validation set. Two SVM algorithms were developed and trained with the training set, using the genes with differential transcript levels (GDTLs) as classifiers, and validated with the validation set. GO and KEGG pathway enrichment analysis were performed using the R package clusterProfiler. Results We detected 765 GDTLs (442 up-regulated and 323 down-regulated) in platelets of patients and healthy controls. The algorithm identifying thyroid neoplasm patients achieved an accuracy of 97%, with an AUC (area under curve) of 0.998. The other algorithm differentiating patients with multiclass thyroid neoplasms had an average accuracy of 80.5%. GO analysis showed that GDTLs were strongly involved in biological processes such as neutrophil degranulation, neutrophil activation, autophagy and regulation of multi-organism process. KEGG pathway enrichment analysis revealed that GDTLs were mainly enriched in NOD-like receptor signaling pathway and pathways in endocytosis, osteoclast differentiation, human cytomegalovirus infection and tuberculosis. Conclusion Our results indicated that the combination of SVM algorithms and platelet RNA-seq data allowed for thyroid neoplasm diagnostics and multiclass thyroid neoplasm classification.


Introduction
Thyroid cancer is the most common endocrine cancer, with an increasing incidence of about two fold in the last 25 years and accounting for 2% of all cancers [1]. This overall incidence growth is driven by the widespread use of sensitive imaging techniques [2,3]. The growth in incidence without corresponding rise in mortality reflects cancer overdiagnosis. Because surgery is the key therapy for most thyroid cancers [4], overdiagnosis usually leads to overtreatment, resulting in the performance of unnecessary thyroidectomies or radioiodine treatments with potential harm to the patient [5]. Therefore, development of an artificial intelligence algorithm with high sensitivity and specificity could help clinicians to diagnose thyroid neoplasm and differentiate thyroid adenomas and carcinomas, thus avoiding overtreatment and also unnecessary fine-needle aspiration biopsy.
It has been reported that blood platelets respond to changes in the host systemic environment during tumorigenesis and cancer metastasis [6], thus resulting in altered platelet behavior [7,8]. Best et al. [9]. demonstrated that platelet transcriptome could function as a biomarker trove to detect and classify cancer via SVM algorithms. Herein, we aimed to evaluate the capacity of SVM algorithms in identifying patients with thyroid neoplasms and differentiating healthy individuals and patients with thyroid adenomas (TA), papillary thyroid cancer (PTC) and metastasized papillary thyroid cancer (MPTC) based on altered transcript profiles in platelets.

Clinical sample collection
Healthy individuals and patients with biopsy-confirmed TA, PTC and MPTC were recruited from Renji Hospital, Shanghai, China. MPTC is defined as PTC that spreads to lymph nodes. Healthy individuals are voluntary blood donors who undertook thyroid ultrasound to rule out thyroid diseases. This experiment was approved by the Ethical Committee of the Renji Hospital. Written informed consent was obtained from all subjects. Patients with thyroid nodules and following complications were excluded: diabetes, hypertension, cardiovascular diseases, pregnancy and active infection. 2 ml peripheral whole blood samples were collected from all subjects and stored in EDTA-containing tubes. A patient class and a control class were established to classify samples.

Platelet isolation and RNA extraction
Platelets were isolated from the samples within 24 h after collection to minimize loss of platelet RNA quality and quantity. To isolate platelets, platelet rich plasma (PRP) was separated by a 20-min 1000 rpm centrifugation step at 4°C. To reduce the risk of contamination of platelet preparation with hemocytes, only 9/10th PRP was drawn and transferred to 1.5 ml Eppendorf tubes, after which platelets were pelleted by a 20-min 3000 rpm centrifugation step. The supernatant was subsequently removed, and platelets were observed as white precipitate. Phosphate buffered saline (PBS) solution was used to wash the platelets and then removed after a 15-min 3000 rpm centrifugation step, after which an instantaneous centrifugation was performed, and a 10-μl pipette was used to remove residual PBS solution. Precipitated platelets were carefully resuspended in RNAlater (Thermo Scientific) and frozen at −20°C. Contamination of hemocytes in platelet preparations was quantified by a hematology analyzer.

Library construction and mRNA sequencing
Total platelet RNA of each sample was extracted using RNeasy Micro Kit (Qlagen, cat. no. 74004) per the manufacturer's instructions. The total RNA was quantified and qualified by Agilent 2100 Bioanalyzer, NanoDrop (Thermo Scientific) and 1% agrose gel. Reverse transcription was achieved with 1 ng of total RNA, oligo (dT) primer and Superscript II reverse transcriptase. To have sufficient platelet cDNA for robust RNA-seq library preparation, each sample was subjected to 20-cycle enrichment PCR, using the KAPA HiFi HotStart ReadyMix PCR kit (Kapa Biosystems, USA), yielding~200 ng of cDNA. AMPure XP beads were used to purify double-stranded cDNA, which was then treated with End Prep Enzyme Mix to repair both ends and add a dA-tailing in one reaction, followed by a T-A ligation to add adapters to both ends. Size selection of Adapter-ligated DNA was then performed using beads, and fragments of~420 bp (with the approximate insert size of 300 bp) were recovered. Each sample was then amplified by PCR for 13 cycles using P5 and P7 primers, with both primers carrying sequences which can anneal with flow cell to perform bridge PCR and P7 primer carrying a six-base index allowing for multiplexing. The PCR products were cleaned up using beads, validated using an Qsep100 (Bioptic, China), and quantified by Qubit 3.0 Fluorometer (Invitrogen, USA).
Libraries with different indices were multiplexed and loaded on an Illumina HiSeq instrument per the manufacturer's instructions (Illumina, USA). Sequencing was carried out using a 2 × 150 bp paired-end (PE) configuration; image analysis and base calling were conducted by the HiSeq Control Software (HCS) + OLB + GAPipeline-1.6 (Illumina) on the HiSeq instrument. The sequences were processed and analyzed by Fengneng Pharmaceutical Technology (Shanghai) Co., Ltd.

Data analysis and GDTL identification
In order to purify raw data of fastq format, Cutadapt (v1.9.1) was used to remove adapters and ploymerase chain reaction (PCR) primers, and FASTP was used to filter reads with the following standards: (1) reads shorter than 50 bp were filtered; (2) reads with more than 5N bases were filtered; (3) reads with an average quality per base below Q15 were removed. For quality checking, a sample will be excluded if its clean reads account for <80% of its raw reads or its clean reads do not meet quality standards (Q20 > 90% and Q30 > 80%). Clean reads were aligned to human reference genome GRCh38 (ftp://ftp.ensembl.org/pub/ release-91/fasta/homo_sapiens/dna/) via STAR (v2.4.2a), yielding BAM files, which were subsequently processed by HTSeq (ve0.6.1) to estimate differential transcript levels of genes. Here, STAR was also used to remove multi-mapping and unmapped reads, and a sample will be excluded if its uniquely mapped reads cannot account for >85% of its total reads before removal. Gene-level differential analysis at transcript-level resolution was performed using Bioconductor package DESeq2. Basemean > 100, |log2(FC)| > 1 and Padj < 0.05 were set to detect GDTLs. VENNY 2.1 (https://bioinfogp.cnb.csic.es/tools/venny/) was used to screen class-specific GDTLs as classifiers for the multiclass algorithm.

GO and KEGG enrichment analysis
Gene Ontology (GO) is a structured, controlled vocabulary for the classification of gene function at the molecular and cellular level. Kyoto Encyclopedia of Genes and Genomes (KEGG) is a collection of databases that can help to understand large-scale datasets generated by transcriptome sequencing. We performed GO and KEGG pathway enrichment analysis to functionally cluster GDTLs via the R package clusterProfiler.

Algorithm development
Prior to algorithm development, a train_test_split function implemented in Python 3.7.2 was used to randomly split samples into a training set and a validation set. The ratio of training:validation was set to be 60:40. Each time we executed the algorithms, samples in each sample class were extracted realtimely per this ratio according the total size of their classes to fill the training and validation sets, thus SVM outcomes were dynamic. To test the stability of performances of both algorithms, we execute both algorithms 1000 times and calculated their average accuracies. We also tried to adjust the ratio from 60:40 to 40:60 to increase the number of samples in the validation set relatively. For binary and multiclass sample classification, a SVM algorithm implemented in Python 3.7.2 was used. Particularly, we applied a One-Versus-One approach to the classification algorithm. The algorithm's accuracy reflects the percentage of correct predictions. In principle, the algorithm was trained on the training set to fix SVM parameters before the validation set was predicted.

Platelet purity and algorithm details
Quantification of platelet purity of the used samples was not conducted initially when we carried out the study. To support the credibility of our isolation results, 10 newly collected blood samples were applied to platelet isolation following the protocol described above. The contamination of hemocytes was quantified using a hematology analyzer, yielding an average platelet purity of 93.7% (Table S1).
Per the ratio mentioned above, sizes of the training set and the validation set were fixed to be 103 and 69. Samples in each sample class were also extracted per this ratio according the total size of their classes to fill the two sets. For the binary algorithm, the training set contains 67 patient samples and 36 normal samples, and the validation set contains 42 patient samples and 27 normal samples. For the multiclass algorithm, the training set contains 20 TA, 26 PTC, 23 MPTC and 34 normal samples, and the validation set contains 12 TA, 15 PTC, 13 MPTC and 29 normal samples. Sample details were listed in Table 1.
Gene expression profiles of platelets of patients are distinct from platelets of healthy individuals Initially, 287 thyroid neoplasm patients were recruited. 119 patients with comorbidities and 58 patients with thyroid nodules were excluded before blood drawn, and 1 patient whose transcript profile contains too many zero was excluded before SVM training. Baseline characteristics of 172 subjects are described in Table 2 and a more detailed version is shown in Table S2. Platelets were collected and                 isolated from healthy individuals (n = 63) and patients with thyroid adenomas (n = 32), papillary thyroid cancer (n = 41) and metastasized papillary thyroid cancer (n = 36). Platelet RNA-seq yielded a mean read count of~18.8 million reads per sample, and a total of 58302 RNAs were detected in the dataset. We identified 765 GDTLs (base mean > 100, |log2FC| > 1, FDR < 0.05) in platelets of healthy individuals (n = 63) and patients (n = 109), 442 with higher and 323 with lower expression (Table 3). Next, we trained a binary diagnostic algorithm in a training set (n = 103). This algorithm employed the 765 GDTLs as classifier for sample classification, to which the contribution of each GDTL was evaluated at the meantime, yielding two top 50 weighted gene plots (Fig. 1). Subsequent validation employed a validation set (n = 69), yielding an accuracy of 97%, with an area under the curve (AUC) of 0.998 (Fig. 2). When the binary algorithm was executed 1000 times, we obtained an average accuracy of 96.5%. When the ratio of training:validation was changed from 60:40 to 40:60, the average accuracy decreased from 96.5% to 94.9%.
Platelet RNA-seq allows for thyroid cancer diagnostics 370, 854 and 1160 GDTLs were identified in platelets of healthy individuals compared to patients with TA (Table  S3), PTC (Table S4) and MPTC (Table S5), respectively. We noticed significant increases in the number of GDTLs, suggesting that neoplasm types as well as neoplasm progression or metastasis had influence on the platelet transcriptome profile. To distinguish healthy individuals and patients with specific disease class, GDTLs included exclusively in the three classes were screened using VENNY 2.1 (Fig. 3), and this resulted in three classspecific gene lists (Table S6), which were used as the classifier for training and validation of the diagnostic algorithm.
The multiclass algorithm was also trained in a training set (n = 103), and we subsequently determined whether it could correctly classify each individual sample in the validation set (n = 69) as healthy individual or patient with one of the three neoplasms. The validation resulted in an average accuracy of 80.5% (Fig. 4), suggesting a thyroid neoplasm discriminative power of our algorithm. However, when we executed the multiclass algorithm 1000 times, an average accuracy of 63.9% was obtained.

GO and KEGG enrichment analysis
Our date analysis yielded four GDTL lists (Tables 2, S3-S5). GO analysis revealed that most GDTLs were enriched for biological processes such as neutrophil degranulation, neutrophil activation, autophagy and regulation of multiorganism process (Fig. S1). KEGG pathway enrichment analysis demonstrated that GDTLs were mainly enriched in NOD-like receptor signaling pathway and pathways in endocytosis, osteoclast differentiation, human cytomegalovirus infection and tuberculosis (Fig. S2). Padj < 0.05 was used as the screening criterion.

Discussion
We showed that two SVM algorithms that were developed based on platelet RNA-seq data could achieve high accuracy, sensitivity and specificity in thyroid neoplasm identification and also provide a strong indication on neoplasm type and metastasis. We also identified important genes and pathways in thyroid neoplasms using bioinformatics methods.
In our study, for the first time, the SVM algorithm and platelet RNA-seq were combined and employed for thyroid neoplasm diagnostics. Compared to similar studies that employed artificial intelligence system to develop diagnostic algorithms, the accuracy, AUC, sensitivity and specificity yielded by our algorithms were at levels similar to theirs. For example, also based on platelet mRNA profiles, Best et al. [9]. employed a leave-one-out crossvalidation SVM algorithm to distinguish patients with neoplasms from healthy individuals, yielding an accuracy of 96%, with an AUC of 0.986. In Li et al.'s study [3], a deep convolutional neural network model was developed and trained based on 180668 images from 25235 controls and 131731 ultrasound images from 17627 patients with thyroid cancer, yielding a highest accuracy of 89.8%, with an AUC of 0.947. To test the stability of performances of our algorithms, we additionally executed both of them 1000 times, yielding an average accuracy of 96.5% for the  binary algorithm and 63.9% for the algorithm. Compared to 97%, the accuracy we reported particularly, 96.5% suggests the performance of the binary algorithm was stable. When the ratio of training:validation was changed to 40:60, an accompanying decrease of average accuracy from 96.5% to 94.9% indicated a relatively high number of samples in the training set was needed to ensure a relatively accurate algorithm. By contrast, the performance of the multiclass algorithm seemed to be overestimated. After 1000 times execution, its overall accuracy declined to 63.9%, with accuracies of the four classes (NC, TA, PTC, MPTC) decreasing to 93.3%, 47.3%, 47.7% and 47.3%, respectively. This was probably due to transcript patterns of our samples are partially similar. When the training set comprised too many samples with similar transcript patterns, hyperplanes generated by the algorithm tended to separate these samples well but not classes, which resulted in the decreased accuracy. Although 80.5% seems to approach the upper limit of the multiclass algorithm's accuracy, as long as the training set comprise more patient samples with distant transcript patterns, this high accuracy could be repeated, and the thyroid neoplasm discriminative power of the algorithm could also be proved. Compared to recently published studies that used bioinformatics approaches to identify key genes and pathways in thyroid cancer, our study had a rather big number of samples (both patient and normal) analyzed. Based on four datasets (GSE3678, GSE3467, GSE33630 and GSE58545), Liang et al. [10] reported 225 differentially expressed genes, among which PKIA, SELENBP1, CHI3L1, QPCT, KCNJ2, ITGA2 also showed to have differential transcript levels in our study. Other GDTLs we detected like CXCR1, VASP, ZYX, GPR137 and GOLPH3 were reported as known markers in many cancers [11][12][13][14][15].
Our study has some limitations. The two algorithms we presented were developed with a quiet small number of samples. The performance of both was believed to improved by including more samples, especially the multiclass algorithm's. The binary algorithm was good at proving the capacity of a SVM algorithm in identifying thyroid neoplasm patient but not clinically practical so far, unless other diseases that share similar clinical symptoms with thyroid neoplasms are included. Besides, the SVM hyper parameter Cost-C was not fine-tuned. In principle, a bigger or a smaller C parameter results in a hyperplane with larger minimum margin or a hyperplane that correctly separate more training points, respectively. Because we used a default C parameter, the accuracy, sensitivity and specificity that our algorithms achieved might not be optimal. Also, we did not estimate the contribution of variables, such as patient age, gender, blood sample storage time, alcohol, smoking and treatment history to the altered platelet profile.
The capacity of SVM algorithms developed based on platelet RNA-seq data in identifying thyroid neoplasm patients and differentiating patients with multiclass thyroid neoplasms was proven in this study. Further algorithm development is warranted to include significantly more matched patient/healthy individual blood samples. Further validation is expected to determine the potential of our algorithms for therapy selection, disease recurrence monitoring, etc. Also, future studies should address causes and anticipated risks of samples misclassified in our study, such as cancer patients classified as healthy individuals.