Identification of an epigenetic signature in human induced pluripotent stem cells using a linear machine learning model

The use of human induced pluripotent stem cells (iPSCs), used as an alternative to human embryonic stem cells (ESCs), is a potential solution to challenges, such as immune rejection, and does not involve the ethical issues concerning the use of ESCs in regenerative medicine, thereby enabling developments in biological research. However, comparative analyses from previous studies have not indicated any specific feature that distinguishes iPSCs from ESCs. Therefore, in this study, we established a linear classification-based learning model to distinguish among ESCs, iPSCs, embryonal carcinoma cells (ECCs), and somatic cells on the basis of their DNA methylation profiles. The highest accuracy achieved by the learned models in identifying the cell type was 94.23%. In addition, the epigenetic signature of iPSCs, which is distinct from that of ESCs, was identified by component analysis of the learned models. The iPSC-specific regions with methylation fluctuations were abundant on chromosomes 7, 8, 12, and 22. The method developed in this study can be utilized with comprehensive data and widely applied to many aspects of molecular biology research. Electronic supplementary material The online version of this article (10.1007/s13577-020-00446-3) contains supplementary material, which is available to authorized users.


Introduction
The application of human induced pluripotent stem cells (iPSCs) in medicine requires prior assessment of the cells with respect to quality, including identity, equivalence, and safety. For evaluation of the iPSCs, comprehensive molecular analysis of characteristics, such as DNA methylation, rather than tests based on a few marker genes, is considered to be more useful. DNA methylation is an epigenetic modification with important roles in normal development and differentiation [1][2][3][4][5][6]. DNA methylation profiles vary depending on tissue types and cell lineage [5,7]; therefore, the DNA methylation profile of a cell can be useful for the identification and validation of its cell type. Epigenetic reprogramming, which involves conversion of the DNA methylation profile from somatic to pluripotent cell type, is an essential for the transformation of somatic cells into iPSCs; the cells that acquire the DNA methylation profile of embryonic stem cells (ESCs) become iPSCs [8,9].
Human iPSCs lower the rate of immune rejection and help in resolving ethical issues associated with the use of ESCs in regenerative medicine [10]. Since the successful development of iPSCs [11][12][13], comparative analyses between iPSCs and ESCs have been performed by many researchers. Choi et al. [14] reported that there are no molecular or functional differences between genetically matched human ESCs and iPSCs. On the other hand, several studies have identified differentially methylated DNA regions between human iPSCs and ESCs [8,[15][16][17].

3
However, these studies only analyzed single point of passage of human iPSCs. In a previous study, we comparatively analyzed several points of passages of 22 human iPSC lines and the results indicated the presence of aberrant hypermethylated sites in iPSCs; however, aberrant hypermethylation in iPSCs occurs stochastically throughout the genome and there is no iPSC-specific aberrant methylated site common to all iPSCs [9]. Despite the lack of DNA methylation hotspots in iPSCs, previous studies have suggested that there are fundamental differences between ESCs and iPSCs, raising questions regarding the extent of similarity between ESC-type epigenome and the reconstructed whole genome of iPSCs. For comparative analysis of cell types with no clear differences, machine learning technology may be useful.
Machine learning is a data analysis technique that attempts to train computers to learn through experience with datasets, in manner similar to natural learning in human. Supervised machine learning can be used to build models for evidence-based prediction, even when there is uncertainly. A supervised learning algorithm trains a machine learning model on a set of input data and the resultant responses (outputs), so that it can reasonably predict the response to new data. In supervised machine learning, classification or regression methods are used to construct predictive models. Classification models are trained to classify the data into categories. Regression models are used to estimate one variable based on the data.
If a model capable of discriminating between ESCs and iPSCs can be constructed using supervised machine learning, the difference between the two cell types could be elucidated. Such a model could help identify the factors underlying the differences between ESCs and iPSCs, as well as enable visualization of these differences, which cannot be distinguished by the naked human eyes.
In this study, we used classification method-based machine learning to create a model that can discriminate between iPSCs and ESCs on the basis of DNA methylation profiles. Further, we attempted to determine the difference between iPSCs and ESCs by analyzing the components of the learning model. Our machine learning-based analysis method and the identified epigenetic indices are useful for evaluating the therapeutic application of human iPSCs. We propose a new method for molecular analysis of the cells that combines comprehensive DNA methylation data and machine learning.

Preparations of mouse embryonic fibroblasts (MEFs) and MEF feeder cells
MEFs were isolated from 13.5-dpc fetuses of pregnant CD1(ICR) mice (Charles River Japan, Inc., Yokohama, Japan) and cultured in Dulbecco's modified Eagle's/highglucose medium (DMEM) (Sigma-Aldrich, St Louis, MO, USA) containing 10% fetal bovine serum (FBS) (Thermo Fisher Scientific, Inc., Waltham, MA, USA, Cat. No. SH3091003), 55 μM 2-mercaptoethanol (Thermo Fisher Scientific), 1% penicillin and streptomycin (Thermo Fisher Scientific). MEFs were irradiated with 30 Gy of gamma irradiation to generate MEF feeder cells. All procedures were performed in accordance with the guidelines for animal care and use of laboratory animals, University of Miyazaki, and the experimental protocols were approved by the Animal Experiment Committee of University of Miyazaki (no. 2012-017, 2017-009).

DNA methylation analysis
DNA methylation profiles were obtained from each sample using the Illumina Infinium assay with the Infinium Human-Methylation450K BeadChip and Infinium MethylationEPIC BeadChip (Illumina Inc., San Diego, CA, USA). Genomic DNA was extracted from the cells using the QIAamp DNA Mini Kit (Qiagen, Hilden, Germany). From each sample, 1 µg of genomic DNA was subjected to bisulfite conversion using the EZ DNA Methylation kit (Zymo Research, Orange, CA, USA), according to the manufacturer's recommendations. Following bisulfite conversion, the genomic DNA was hybridized with the Infinium HumanMethylation450K BeadChip and MethylationEPIC BeadChip, and each Bead-Chip was scanned on an iScan (Illumina Inc.) according to the manufacturer's instructions. GenomeStudio (Illumina Inc.) was used for background subtraction and normalization of data. Methylated and unmethylated signals were used to compute the β value, a quantitative score of the DNA methylation rate that ranges from "0.00", for completely unmethylated state to "1.00", for completely methylated state. Additional DNA methylation data were obtained from the NCBI database. Detailed information of cell lines and accession numbers used in this study is mentioned in Supplemental Table 1. Common probes between 450K and EPIC were selected. The probes with sequences that overlapped with variants showing minor allele frequency (MAF) ≥ 5% [35] and detection p value ≥ 0.05 (computed from the background based on negative controls) were eliminated from further analysis. A total of 385,683 CpG sites were analyzed in 104 samples including 27 ESC lines, 43 iPSC lines, 9 ECC lines, and 25 somatic cell lines. Unsupervised hierarchical clustering (HCA) with Euclian distance and group average method and principal component analysis (PCA) were used for data analysis. A differentially methylated region (DMR) was characterized by a CpG site having a score that differed by ≥ 0.3 points with respect to the β values between two groups. For comparing the average number of DMRs between ESCs and iPSCs, 15 samples were randomly selected from 27 ESC lines and 47 iPSC lines, and the number of DMRs was counted. This step was repeated 100 times and the average number of DMRs was calculated. For comparing the average number of CpG sites within a certain range of standard deviation (SD) between ESCs and iPSCs, 15 samples were randomly selected from 27 ESC lines and 47 iPSC lines, and the number of CpG sites within a certain range of SD was counted. This step was repeated 100 times and the average number of CpG sites within a certain range of SD was calculated.

Machine learning
Jubatus, a machine learning analytical platform, is an online open-source software (https ://jubat .us/en/) developed by Preferred Infrastructure, Inc. (Tokyo, Japan) and NTT SIC (Tokyo, Japan). Multi-class classification (one-vs-others) of the cell types was performed using the classification module Jubaclassifier with Adaptive Regularization of Weight vectors (AROW) [36], which is a linear classification model supported by Jubatus. To perform 4-fold cross-validation, each cell line was divided into four groups, A-D (Supplemental Table 1), and the following four learning series were used: Series-1 comprising training dataset, BCD and test dataset, A; Series-2 comprising training dataset, CDA and test dataset, B; Series-3 comprising training dataset, DAB and test dataset, C; and Series-4 comprising training dataset, ABC and test dataset, D. The training datasets were used for learning in constructing learning models, and test datasets as unknowns were used for validation of the learned models. The construction of learning models was entailed by the random selection of one sample from the training dataset, followed by the input of the DNA methylation rates of 385,683 CpG sites and the cell type of the selected sample into Jubatus followed by learning, thereby updating the learning model. This process was repeated for all the samples in the training dataset, and learning once with all the samples in the training dataset was designated as 1 epoch. In total, 300 epochs were performed and the learned model was assessed every 10 epochs. The adaptive regularization parameter was evaluated using variable regularization weight values of "0.10", "0.25", "0.50", "0.90", "1.00", and "1.10". The learned model was delineated using four classification models corresponding to the cell types (ESCs, iPSCs, ECCs and somatic cells). The source code is available on GitHub (https ://githu b.com/aknis hino/20191 212_Jub). For evaluating the learned models, Precision, Recall and F-score, Macroaverage Precision (Precision Macro ), Macro-average Recall (Recall Macro ) and Macro-average F-score (F-score Macro ) of the each learned model were calculated using the formulae shown in Table 1.

Sodium bisulfite sequencing
Sodium bisulfite treatment of genomic DNA was carried out using the EZ DNA Methylation-Gold kit (Zymo Research). PCR amplification was performed using BIOTAQ™ HS DNA polymerase (Bioline Ltd, London, UK) with specific primers for CSMD1, FZD10, DNAH9, FAM19A5, TMEM132C, and TMEM132D. The primers used in this study are summarized in Supplemental Table 2. To determine the methylation states of individual CpG sites, the PCR product was gel-extracted and subcloned into Eco RV cut-pBluescriptII vector using NEBuilder HiFi DNA Assembly Master Mix (New England BioLabs, Ipswich, MA, USA), and then sequenced. Methylation sites were visualized and quality control was carried out using the web-based tool QUMA (https ://quma.cdb.riken .jp/) [37].

Accession numbers
NCBI GEO: Infinium HumanMethylation450K BeadChip and Infinium MethylationEPIC BeadChip data obtained in this study have been submitted under the accession number GSE141521. Additional DNA methylation data were obtained from the NCBI database. Accession numbers are given in Supplemental Table 1.

Comparison of DNA methylation between ESCs and iPSCs
The DNA methylation profiles of 104 human samples, including 27 ESC lines, 43 iPSC lines, 9 ECC lines, and 25 somatic cell lines, were obtained using the Illumina Infinium HumanMethylation array. The methylation rates of 385,683 CpG sites were further analyzed (see details in "Materials and methods"). The promoter regions of the pluripotencyassociated genes POU5F1, NANOG, SALL4, RAB25, and EPHA1 showed low levels of methylation, whereas those of the somatic cell-associated genes GBP3, LYST, and SP100 were highly methylated in ESC and iPSC lines (Supplemental Fig. 1a). Unsupervised hierarchical cluster analysis (HCA) (Fig. 1a and Supplemental Fig. 1b) and principal component analysis (PCA) (Supplemental Fig. 1c) revealed that iPSCs were clearly distinct from somatic cells and ECCs, but not from ESCs. Comparison between the two types of cells showed that there was no differentially methylated region (DMR) between ESCs and iPSCs (Fig. 1b). These results indicate that there was no clear difference between ESCs and iPSCs.

Construction of a machine learning model for the classification of cell types
The DNA methylation data of 385,683 CpG sites and information on the cell type of the training samples were used for machine learning (Fig. 2a). In this study, machine learning involved 4-fold cross-validation method, wherein each cell line was divided into four groups to create four datasets (training dataset and test dataset) (Supplemental Table 1) and six different regularization weight values were validated. With each training dataset and regularization weight, 300 epochs were performed; thus, the total number of epochs performed was 7,200 (4 data sets × 6 regularization weight values × 300 epochs). After every 10 epochs, learning results were saved and thus, 720 learning results were obtained as learning models from each dataset. Each of the 720 learning models was used to discriminate the training dataset and the unknowns (test data set) (Fig. 2b), and comparative analyses of the average of the F-score Macro rate were performed (Supplemental Fig. 2b). The learning models from the 250th epoch with AROW regularization weight value of "1.00" had the highest average of the F-score Macro rate from the four models for both training dataset and test dataset, and were therefore selected as the optimal learning models. The highest average of the F-score Macro rate of the test data set, which was achieved by the optimal learning model, was 94.36% (Supplemental Fig. 2a, b). The accuracy, Precision Macro , Recall Macro , and F-score Macro rates of the test data set in the mixed four models were 94.23%, 95.17%, 93.63% and 94.39%, respectively ( Table 1). The accuracy, Precision Macro , and Recall Macro rates of the test data set in each four models were shown in Supplemental Fig. 3. The learning model distinguished ESCs from iPSCs with an accuracy of ≥ 81.82% (Supplemental Fig. 3). These results indicated that the learning model generated in the 250th epoch, with a regularization weight value of 1.00, is able to distinguish iPSCs from ESCs with a high efficiency.

Distribution of the iPSC Pos-ESC Neg high weight sites on chromosome
The iPSC Pos-ESC Neg sites were found to be abundant on chromosomes 7, 8, 12, and 22 (Fig. 4a). Next, we focused on the DNA methylation rate of high weighted sites. We compared the DNA methylation rates of iPSC Pos-ESC Neg sites in ESCs and iPSCs, and identified five regions in which the DNA methylation fluctuated only in iPSCs (Fig. 4b); these regions were found in the following genes: CUB And Sushi Multiple Domains 1 (CSMD1), Transmembrane Protein 132C (TMEM132C), Transmembrane Protein 132D (TMEM132D), Frizzled 10 (FZD10), dynein axonemal heavy chain 9 (DNAH9), and TAFA chemokine like family member 5 (FAM19A5). The fluctuating regions in these genes were located around the TSS (Fig. 4c, d and Supplemental Fig. 4). To confirm variable methylation at those regions, sodium bisulfite sequencing analysis was performed. Consistent with the results of Infinium HumanMethylation assay, these reasons showed variable methylation in iPSCs (Fig. 4c, d and Supplemental Fig. 4). However, these genes are rarely expressed in ESCs and iPSCs.

Analysis of the high weight sites
The top ten highly weighted sites were selected and the DNA methylation rates in these sites were compared. The DNA methylation rates of the iPSC Pos-ESC Neg sites in an individual line of iPSCs were found to be widely distributed, whereas ESCs generally had a low methylation rate ( Fig. 5a and Supplemental Fig. 5a). On the contrary, the ESC Pos-iPSC Neg sites had varied methylation rates in both ESCs and iPSCs ( Fig. 5b and Supplemental Fig. 5b). Variations in the DNA methylation of the high weight sites in iPSCs were not due to the differences in the methods of iPSC generation or the types of the parental cells ( Fig. 6a and Supplemental Fig. 5c). Interestingly, the ESCs showed a larger number of variably   (Fig. 6b) and CpG sites with high standard deviation compared to the iPSCs (Fig. 6c) in the analysis of all CpG sites, indicating that ESCs have more variability in DNA methylation rates than iPSCs. However, in the high weight sites, iPSCs had more CpG sites with high standard deviation compared to ESCs, indicating high variability in iPSCs with respect to methylation levels at the high weight sites (Fig. 6d). These results suggest that the machine learning method was able to determine CpG sites with DNA methylation diversity specific to iPSCs, which can be considered as a characteristic for distinguishing iPSCs from ESCs.

Discussion
In this study, we developed a new method to distinguish between iPSCs and ESCs on the basis of their DNA methylation profiles. We constructed a learning model based on the linear model for multi-class classification using Jubatus, a machine learning platform. In recent years, deep learning methods have often been used for biological analysis; however, these methods usually require at least 10,000 samples. The availability of only 10-100 variants of human iPSC lines makes the linear model classification system ideal for the analysis of human iPSCs.
iPSCs are essentially an alternative to ESCs, with almost no difference between the two in terms of their properties. iPSC lines generated with non-genome integration methods, such as episomal vector or RNA transfection, are indistinguishable from ESCs in terms of morphology, differentiation ability, gene expression, DNA methylation, etc. [24,38]. The results obtained in this study are in agreement with previous reports, as no epigenetic features that clearly distinguished iPSCs from ESCs were found. However, our analyses, using a collection of DNA methylation profiles from different types of cells, including 43 iPSC lines which contained Retro-, Lenti-, Sendai-, and Episomal-iPSCs, 27 ESC lines, 9 ECC lines, and 25 somatic cell lines, demonstrated that machine learning with AROW, a linear model for classification, is effective for the discrimination of cell types, especially iPSCs and ESCs. The learned models achieved high-accuracy prediction rates in distinguishing iPSCs from ESCs. In other words, our learned models recognized the differences between iPSCs and ESCs and were able to discriminate between the cell types. Interestingly, the learned models recognized the iPSC lines as iPSCs, irrespective of the production methods used.
One of the advantages of a linear classification-based learning model is the ability to select and analyze components, such as determination weights corresponding to each CpG site. The analysis of the high weight components revealed that the learned models searched for genomic regions that are characteristic of iPSCs and used them to distinguish iPSCs from ESCs. This resulted in the identification of fluctuating iPSC-specific methylation regions, which are especially abundant on chromosomes 7, 8, 12, and 22. DNA methylation was more variable in each of the ESC lines than in iPSCs, indicating that the ESCs possessed more fluctuating methylation regions than the iPSCs. Despite the methylation variation in ESCs, the learned models selected ß-value (Methylation rate) CpG sites with DNA methylation diversity specific to iPSCs as characteristics for distinguishing iPSCs from ESCs. Comparison of the DNA methylation rates of the iPSC Pos-ESC Neg sites led to the identification of fluctuating methylation regions in six genes, including CSMD1, TMEM132C, TMEM132D, FZD10, DNAH9, and FAM19A5. CSMD1 is known to be a tumor-suppressor gene under the control of DNA methylation in liver cancer and head and neck squamous cell carcinoma [39][40][41]. TMEM132C has been reported to show differential methylation and is downregulated by DNA hypermethylation in breast tumors [42]. TMEM132D [43] and DNAH9 [44] are cancer-associated genes in small cell lung cancer, and FDZ10 has a role in cancer reactivation [45]. The expression of FAM19A5, also known as TAFA5, is influenced by the activation of β-catenin [46] and c-Myc promotes the Wnt/β-catenin activity in breast cancers [47]. These genes are involved in carcinogenesis, and the fluctuating regions in these genes are located around the TSS; this suggests that variations in DNA methylation in these genes influence the risk of iPSCs. However, it is seen that these variations in DNA methylation do not affect the gene expression profiles in either ESCs or iPSCs, and also do not exert any influence on pluripotency. Nevertheless, it is possible that the fluctuations in methylation may affect the differentiation properties of iPSCs. The possible effects of such methylation fluctuations on the differentiation properties of iPSCs need to be evaluated through further detailed investigations. Comparison of iPSCs obtained through different production methods revealed that Sendai-iPSCs were the least diverse in terms of fluctuating methylation regions, and their DNA methylation pattern showed maximum similarity with that of the ESCs. The similarity observed between Sendai-iPSCs and ESCs is consistent with the result of a previously reported comprehensive DNA methylation analysis [23]. However, no significant differences were detected in pluripotency between the Sendai-iPSCs and the iPSCs derived from other production methods [23]. Aberrant DNA methylation at some imprinted gene loci in ESCs and iPSCs has been reported [9,48,49], and this abnormality was detected in 68 imprinted genes [23], indicating that aberrant DNA methylation occurs widely in human ESCs and iPSCs. In this study, we identified 130 high weight sites, including 13 ESC Pos-iPSC Neg and 117 iPSC Pos-ESC Neg CpG sites; however, there were no imprinted genes in the 130 high weight sites, suggesting that the abnormalities of imprinted genes are not specific to either iPSCs or ESCs.
In conclusion, we were able to distinguish human iPSCs from ESCs using machine learning methods, even when the cells lacked specific markers. The results of this study will have a significant effect on the use of these cell lines in various in vitro research studies for specific purposes. In addition, an epigenetic signature of iPSCs was identified by component analysis using our learned models. The learned models developed in this study contribute towards enhancing our understanding of the iPSCs at the gene level and hold potential for achieving remarkable advances in various fields of biology research, including computational biology, molecular biology, cell biology, and cancer biology. The approach of the machine learning method used in this study is useful for comprehensive data analysis and can be widely applied to iPSC research as well as many other fields of research in life sciences.