Detailed Characterization of Human Induced Pluripotent Stem Cells Manufactured for Therapeutic Applications

We have recently described manufacturing of human induced pluripotent stem cells (iPSC) master cell banks (MCB) generated by a clinically compliant process using cord blood as a starting material (Baghbaderani et al. in Stem Cell Reports, 5(4), 647–659, 2015). In this manuscript, we describe the detailed characterization of the two iPSC clones generated using this process, including whole genome sequencing (WGS), microarray, and comparative genomic hybridization (aCGH) single nucleotide polymorphism (SNP) analysis. We compare their profiles with a proposed calibration material and with a reporter subclone and lines made by a similar process from different donors. We believe that iPSCs are likely to be used to make multiple clinical products. We further believe that the lines used as input material will be used at different sites and, given their immortal status, will be used for many years or even decades. Therefore, it will be important to develop assays to monitor the state of the cells and their drift in culture. We suggest that a detailed characterization of the initial status of the cells, a comparison with some calibration material and the development of reporter sublcones will help determine which set of tests will be most useful in monitoring the cells and establishing criteria for discarding a line.


Introduction
Induced pluripotent stem cells (iPSCs) are akin to embryonic stem cells (ESC) [2] in their developmental potential, but differ from ESC in the starting cell used and the requirement of a set of proteins to induce pluripotency [3]. Although functionally identical, iPSCs may differ from ESC in subtle ways, including in their epigenetic profile, exposure to the environment, their mitochondrial content and perhaps X chromosome inactivation [4]. These differences are intrinsic to the source of starting material, and such differences may be further amplified by the pluripotency induction process [5]. It is important to note, however, that current studies have shown that these intrinsic differences between ESC and iPSC do not necessarily reflect in their functional utility; in fact, several large-scale analyses have verified that the differences seen are more reflective of the allelic diversity of individuals [6,7]. The degree of difference seen between iPSC lines from different individuals is in the same range as differences seen between iPSC lines made from the same donor but different tissues and between ESC and iPSC [8,9]. Equally important, the changes introduced by the process of iPSC generation using nonintegration methods are in the same range as those changes that are seen when cells are maintained in culture for prolonged periods [10].
These differences, while of importance to the academic community, would be largely irrelevant to the regulatory authorities and for the development of an allogeneic or Electronic supplementary material The online version of this article (doi:10.1007/s12015-016-9662-8) contains supplementary material, which is available to authorized users. autologous product, provided these differences did not alter the potency or efficacy of the differentiated cells that were derived from these lines [11]. Indeed, this concept has been used for hematopoietic stem cell transplants where different donors (which differ in their allelic background and presumably in the efficacy) have been transplanted without showing that each sample was functionally identical by in vivo or in vitro testing [12]. It was presumed that the cells were functioning normally in the donor and that the harvesting process would not alter the cells (i.e. the cells were minimally manipulated), and one could reasonably infer that no change had occurred. This concept was extended to sibling or related transplants and then to matched unrelated donors (MUD) transplants. Here the inference was that one could reasonably extend the idea of functional equivalency even though the transplanted cells were functioning in a different setting.
This concept of functional equivalence was extended to cord blood, which is used for the same purpose as bone marrow but differs in that cord blood is processed differently from bone marrow and is considered more than minimally manipulated. The regulatory authorities reasoned that if functional equivalence in in vivo studies showed that cells could be manufactured reliably and reproducibly, then different groups using different processes and manufacturing at different sites could be approved under a Biologics License Application (BLA). Indeed, five public cord blood banks have been approved to provide MUD-type transplants for individuals using a commonly accepted release criteria for functional equivalence. Given the extent of in vivo human data available, no animal studies were required for the approval process. It is important to note that the regulatory authorities in the United States recognized that such licensure requirements should not be extended to autologous or related cord blood use, such as that proposed by private cord blood banks; indeed, those banks are not subject to the same BLA licensure requirements. This logic has been extended to other autologous therapy where cells are more than minimally manipulated, such as autologous T-cells, B cells, dendritic cells, NK cells and macrophages [12]. Each of the cell populations is manufactured in a lot that is sufficient for one individual, and each lot is intrinsically different from another lot and is transplanted in a host at different stages of illness, where the cells likely encounter different environments. The authorities have not required that each lot undergo testing, as would be required for an allogeneic product that would be used for hundreds or thousands of patients. Rather, they have asked people to demonstrate that the end product obtained after processing is functionally equivalent [12]. In some cases the authorities have required that eight or ten samples manufactured be shown to be effective in an animal model, and in some cases have required human safety studies of a limited nature [12]. Critical to such approval has been the necessity of having adequate tests or comparability data that also assesses function [13].
We have assumed that the regulatory authorities will consider a similar logic for other autologous products or HLAmatched products, including those derived from iPSC [12]. Therefore, like cord blood, autologous or matched cells will be regulated differently than allogeneic products derived from iPSC. We further reasoned that if other groups wished to generate new lines they could use the same process and the community could define functional equivalence of these new lines. Since the lines themselves are merely input material to make fully differentiated cells, we felt that no animal tests were required at the iPSC stage; rather, criteria for use for further downstream processing could be established by in vitro differentiation assays and agreed-on quality control (QC) criteria for pluripotency. Functional characterization and equivalency of the end product with any necessary in vivo or human studies would occur on the final manufactured product. As with other products that may be used for autologous or allogeneic manufacture, we assume that the tests required will be different and the regulations likely different, but in both cases it will be critical to have comparability data.
Given most groups were initially focusing on allogeneic therapy, we initiated a program to generate clinically compliant cells and have reported on the generation of two such lines [1], which we presume will be used to generate a variety of products from a MCB. Although the process development was expensive and time consuming [1,14], we reasoned that the cost would be amortized over a large number of patients [15]. Our data suggest that these lines could be used to commercialize iPSC-based cell therapy following a standard Investigational New Drug (IND) path.
However, it became evident that this was not a viable model for autologous cells and haplobank-derived cells, as had become clear with other cell therapies (see above). One would have to reduce cost and the regulators would need to develop models akin to those they have developed for cord blood banks. Therefore, we evaluated what would need to be done to reduce cost should cells be used for autologous therapy [13] or if a Haplobank was established [16]. Moreover, since these lines may be used by a number of individuals and utilized to generate a number of different productseach of which will be likely manufactured in a different site by different companieswe reasoned that additional characterization may be necessary and that a database to monitor changes in cells in culture needs to be established. In this manuscript, we describe the detailed characterization of two cGMP-compatible iPSC lines using WGS, array-based analysis and aCGH SNP analysis. One of these lines -LiPSC-GR1.1-generated during GMP manufacturing runs and the other line -LiPSC-ER2.2generated during engineering runs using the same GMP compatible process described before [1]. Our goal is to provide data to end users to determine which subset of tests will be required for ongoing monitoring, how such tests should be used to evaluate use of subclones for preclinical studies or cell therapy, and how comparability between manufacturing sites needs to be established.

Embryoid Body (EB) Differentiation
Confluent cultures of human pluripotent stem cell colonies were dissociated using L7™ hPSC Dissociation Solution. Cell aggregates were suspended in EB formation medium consisting of DMEM/F12 (Life Technologies, 11330-032) containing 10 μM Rock Inhibitor Y27632 (Millipore, SCM075) and allowed to settle by gravity in a conical tube. After removing the supernatant, cells were suspended in fresh EB medium. Cell aggregates were then seeded using a split ratio of 1:1 on Ultra Low Attachment (Corning, YO-01835-24) plates and returned to the incubator for 12 to 24 h. Once large cell aggregates formed, they were collected into a conical tube and allowed to settle by gravity. The medium was then removed and replaced with differentiation medium (80 % DMEM High Glucose (Life Technologies, 11965-092), 20 % defined fetal bovine serum (Hyclone, SH30070.03), 1X non-essential amino acids (Life Technologies, 11140-050), 2 mM L-glutamine (Cellgro/ Mediatech, 25-005-CI) and 55 μM β-Mercaptoethanol (Life Technologies, 21985-023)). The cell aggregates were placed on Ultra Low Attachment plates using a split ratio of 1:1 in 0.4 ml differentiation medium/cm 2 . The culture medium was then changed every second day for six days. On the seventh day, EBs were seeded on gelatin-coated plates (EmbryoMax® ES Cell Qualified Gelatin Solution (Millipore, ES006-B)) at 10 EBs/cm 2 . The EBs were allowed to attach undisturbed for 2 days. The differentiation medium was changed after the second day and every other day afterward with 0.4 ml/cm 2 differentiation medium. The cultures were prepared for immunocytochemistry on day 14.

Flow Cytometry
Flow cytometry of hPSCs was performed when cells reached approximately 70 to 80 % confluency in hPSC medium. The cultures were dissociated into a single-cell suspension using a solution of 0.05 % Trypsin/EDTA (CellGro, 25-052-CI) containing 2 % chick serum (Sigma-Aldrich, C5405). The cells were fixed and permeabilized for intracellular staining with the Cytofix/Cytoperm Kit (Becton Dickinson, 554714) following the manufacturer's recommended protocol.
The immunocytochemistry and staining procedures of human pluripotent stem cells differentiated into neural lineage were as described previously [19]. Briefly, cells were fixed with 4 % paraformaldehyde for 20 min, blocked in blocking buffer (10 % goat serum, 1 % BSA, 0.1 % Triton X-100) for one hour, followed by incubation with the primary antibody at 4°C overnight in 8 % goat serum, 1 % BSA, 0.1 % Triton X-100. Appropriately coupled secondary antibodies, Alexa350-, Alexa488-, Alexa546-, Alexa594-or Alexa633 (Molecular Probes, and Jackson ImmunoResearch Lab Inc., CA), were used for single or double labeling. All secondary antibodies were tested for cross reactivity and non-specific immunoreactivity.

Expression Analysis by Microarray
Total RNA was isolated using the RNeasy® Mini kit according to the manufacturer's instructions (Qiagen, CA) and hybridized to Illumina Human HT-12 BeadChip (Illumina, Inc., CA, performed by Microarray core facility at the Burnham Institute for Medical Research). All the data processing and analysis was performed using the algorithms included with the Illumina BeadStudio software. The background method was used for normalization. The maximum expression value of gene for probe set was used as the expression value of the gene. For the processed data, the dendrogram was represented by global array clustering of genes across all the experimental samples, using the complete linkage method and measuring the Euclidian distance. Expression of sample correlations was a measure of Pearson's coefficient, implemented in R System.

CGH-CHIP Analysis
CGH-CHIP analysis was carried out using the aCGH + SNP service by Cell Line Genetics. Cryopreserved vials of the iPSCs were submitted to the contract lab to prepare sample and run the assay per standard procedures summarized below. The iPSC cryovials were thawed at 37°C, washed once in 1xPBS, and centrifuged. The supernatant was then removed and the cell pellet was exposed to proteinase K and RNase and incubated at room temperature for two minutes. Following the addition of lysis buffer and incubation at 56°C for 10 min, the samples were added to a DNeasy® mini spin column and attached by centrifugation. Samples were then washed two times with wash buffer and eluted in suspension buffer. gDNA samples were then cleaned using a Zymo DNA clean and concentration column. ChIP DNA binding buffer was added to the gDNA and added to a Zymo-Spin IC-XL column by centrifugation. The tube was washed two times with wash buffer and then eluted in DNA suspension buffer. DNA concentration and quality were determined using a NanoVue™ UV spectrophotometer, Qubit™ Fluorometer, and agarose gel analysis. The isolated gDNA must meet the following requirements: concentration of ≥1 μg of dsDNA measured by the Qubit™ Fluorometer; 260/280 Ratio of 1.76-1.9 measured by NanoVue™ Spectrophotometer; and 260/230 Ratio of ≥1.9 measured by NanoVue™ Spectrophotometer.
Following gDNA isolation, labeling reactions were prepared using the Agilent SureTag Complete Labeling Protocol for aCGH with 500-1500 ng total (RNase treated) DNA input. The Agilent microarray aCGH protocol composed of two steps: Labeling of the DNA and hybridization. First, equal amounts of both test and reference samples (500-1500 ng) were enzymatically sheared for aCGH + SNP arrays using a dual DNA digestion with restriction enzymes Rsa1 and Alu1. The test sample DNA was labeled with Cyanine 5-dUTP and the reference DNA was labeled with Cyanine 3-dUTP by Exo-Klenow fragment. The labeled DNA was then purified, and the labeling efficiency and concentration were determined using the NanoVue™ UV spec. The test and appropriate reference samples were then combined and denatured. The labeled probes were allowed to hybridize with the feature on the microarray for 24 h at 65°C. Finally, the arrays were stringently washed and scanned at a 3 μM resolution on an Agilent SureScan Microarray Scanner. Feature data was extracted, processed and mapped to the human genome (hg19) using ADM-2 Segmentation Algorithm using Agilent CytoGenomics.

Whole Genome Sequencing
Whole genome sequencing was performed by Macrogen Clinical Laboratory (Rockville, MD). The samples were prepared according to an Illumina TruSeq Nano DNA sample preparation guide. Briefly, the whole genomic DNA was extracted using the DNeasy® Blood & Tissue Kit according to manufacturer's instructions (Qiagen, CA cat#69506). One microgram of genomic DNA was then processed using the Illumina TruSeq DNA PCR-Free Library Preparation Kit to generate a final library of 300-400 bp fragment size. Completed, indexed library pools were run on the Illumina HiSeq platform as paired-end 2x150bp runs. FASTQ files were generated by bcl2fastq2 (version 2.15.0.4) and aligned by ISAAC Aligner (version 1.14.08.28) to generate BAM files. SNPs, Indels, structural variants (SV) and copy number variants (CNV) were detected by ISAAC Variant Caller version 1.0.6 [20]. For the SNPs and Indel, locus reads with genotype quality less than 30 were removed from analysis. The vcf file thus generated was annotated using SNPEff Version 4.0e (http://snpeff.sourceforge.net/) [21] using hg19 reference genome, dbSNP138 build. The alternate allele frequency for European descendent samples were obtained from 1000 genome project_phase1_release_v3 and ESP6500 databases.
Samtools was used to obtain basic statistics such as the number of reads, number of duplicate reads, total reads mapped and total reads unmapped. SAMSTAT version 1.5.1 (http://samstat.sourceforge.net/) [22] was used to report the mapping quality statistics. The depth of each chromosome was computed by Issac variant caller.
The variants derived were used to predict the blood group phenotypes, with the analytical software Boogie [23]. Blood group predictions were made for routinely used ABO and Rh system. Apart from this, predictions for MN-and Rhassociated glycoprotein systems were also performed for both the cell lines. Genotype information including the chromosome number, genomic position, reference allele, alternate allele and zygosity of the variants belonging to the genes involved in the above mentioned blood group systems were provided as an input. Boogie verified the relevant variants in the input genotype with defined phenotypes in the haplotype table provided default by the software, based on 1-nearest neighbor algorithm. The SNV permutation with the most likely phenotype gets the best score. The blood groups thus predicted were compared with available donor data.
The HLA class I (HLA-A,-B and -C) and II (HLA-DQA1, −DQB1 and -DRB1) profiles for the iPSC lines were estimated from the WGS data by software called HLAVBseq, which was developed by Nariai and colleagues [24]. FASTQ files were aligned to the reference genome using BWA-MEM to generate a sam file. This method is based on the alignment of sequence reads to the genomic HLA sequences that are registered with IMGT/HLA database. Based on variational Bayesian inference statistical framework, the expected read counts on HLA alleles is estimated. The hyper parameter alpha zero for paired end data set to 0.01. The average depth of coverage for each HLA allele was calculated based on the perl script provided by the authors for 200 bp data. The predicted HLA types was cross-verified with HLA typing results generated by HLAssure TM SE SBT kit.
To verify if these cell lines showed any variations in genes implicated in PD, only the non-synonymous variants were considered. The variants were prioritized based on the benign or damaging effect of the amino acid substitutions. The insilico predictions program, such as SIFT, bases its predictions on the degree of conservation of amino acid residues [25], whereas Polyphen predicts these changes based on the impact of the amino substitution on the structure and function of the protein based on physical and comparative considerations [26], respectively. The scores of SIFT and Polyphen were computed by dbNSFP [27]. The prioritized variants were cross-validated with the list of PD related genes obtained from gene cards (www.genecards.org). The variants shortlisted were referred to clinvar and MIM_disease databases, annotated by dbNSFP. The integrative study of WGS and expression data was performed to validate if these PD-related genes showed any variation in their expression against the control lines H9, H7 and NCRM6 [28].
The Issac variant caller which was used to detect structural variants for the cell lines showed highest number of deletions, and hence were considered for the analysis. The variants with genotype quality <20, reads with MAPQ of zero around either break-end or unknown exact breakpoint location, and read pairs that support the variant with low confidence were removed. The filtered variants were annotated for the genes using UCSC table browser (https://genome.ucsc.edu/cgi-bin/ hgTables). Deletion events were manually viewed by Integrated Genomics Viewer (IGV) [29] to check for the dip in the coverage at deletion sites. Bed Tools v2-2.20.1 was used to check the level of overlap from two sets of genomic coordinates. This data was cross-validated with expression data to verify if there was any differential expression levels due to these deletions. These short listed differentially expressed genes were checked for gene enrichment to verify their implications in various OMIM_diseases and pathways [30]. A similar strategy was followed for duplications [31]. The results were compared with SNP CHIP data for both the cell lines to check for the gene overlap, if any. This cross comparison was made even with the unfiltered SV data, to verify if the genes identified by micro array were missed in WGS due to filtering.
To verify the status of the imprinted genes, the published list of imprinted genes was extracted from the database (http:// www.geneimprint.com/). Allelic depth of the alternate allele, if <10, was filtered. The number of heterozygous and homozygous SNPs, INDELS were calculated for these imprinted genes and verified for the genes overlap with expression data. The maternal or paternal specific expression for these genes were reported from the documented data. As no parental information was available, phasing could not be conducted on these samples to identify maternal or paternal specific inheritance pattern of these variants identified by the WGS.

HLA Type Analysis
HLA-typing was carried out by Texas Biogene, Inc. (Richardson, TX) using HLAssure TM SE SBT typing kits. The HLAssure TM SE SBT Kit is for determining HLA alleles using PCR amplification with sequence based typing (PCR-SBT) methodology. Briefly, the whole genomic DNA was extracted using the DNeasy® Blood & Tissue Kit according to the manufacturer instructions (Qiagen, CA cat#69506) and per requirements suggested by Texas Biogene, Inc. (i.e. DNA sample with an A260/A280 ratio between 1.65 and 1.8). The genomic DNA was then analyzed using the HLAssure TM A, B, C, DRB1, and DQB1 SBT typing kits and Accutype TM (HLADB-3.19.0) software per procedure established by Texas Biogene, Inc.

Karyotype and Short Tandem Repeat (STR)
Karyotype and STR analyses were performed by a qualified service provider (Cell Line Genetics) using standard methods. Human G-banding karyotyping was performed in accordance with FDA Good Laboratory Practice by Cell Line Genetics, which was audited by Lonza Walkersville, Inc. with clinically certified cytogeneticists experienced with identifying chromosomal abnormalities from pluripotent stem cells. For each cell line, 20 chromosomes were analyzed from live or fixed cells in metaphase. The analysis was performed using G-banding and Leishman stain, and the Cells were analyzed according to the Clinical Cytogenetics Standards and Guidelines published by the American College of Medical Genetics [32].
The STR assay utilized PCR and capillary electrophoresis on a PowerPlex 16 mutiplex STR platform (Promega) to determine a match of ≥80 % of the 16 loci evaluated. Data analysis was performed with SoftGenetics Genemarker software. Each assay was evaluated for off ladder peaks, considered artifacts, and cross contamination prior to reporting.

MCB Viral Testing
According to FDA regulations, release of allogeneic MCBs for clinical use requires extensive testing for the presence of viral contaminates. The scope of the MCB viral testing for hiPSCs was adjusted based on the cellular characteristics of pluripotent stem cells and comprised of both in vitro and in vivo assays [1]. Following preparation of samples per standard procedures recommended by the contract lab (BioReliance), samples were submitted to the BioReliance in appropriate condition and format. BioReliance is fully accredited for GLP, and all studies conducted by BioReliance are performed in compliance with the requirements of the UK and German GLP Regulations, the US FDA Good Laboratory Practice Regulations (21 CFR 58), the Japanese GLP Standard, and the OECD Principles of Good Laboratory Practice (http://www.bioreliance.com/us/ about-us).

Assay Qualification, Characterizations, and in Process Control
Flow Cytometry Assay for Pluripotent Stem Cells The flow cytometry assay for evaluation of human pluripotent stem cells was qualified according to the current Good Manufacturing Practices, the International Conference on Harmonization Technical Requirements for Registration of Pharmaceuticals for Human Use (ICH) validation guidelines [33]. The qualification study was conducted using stagespecific embryonic antigen-4 (SSEA-4), Tra-1-60, and Tra-1-81. In addition, Oct4, a transcription factor thought to play a key role in maintaining the self-renewal and pluripotency of the embryonic stem cells [34], was also included in the qualification study. The release criteria for pluripotency markers were established based on the positive expression of four different markers (SSEA-4, Tra-1-60, Tra-1-81, and Oct3/4) according to published data as well as data generated during the process development phase. Since cord blood derived CD34 positive cells were utilized as a starting material for reprogramming and generation of the final product hiPSCs, negative expression of surface marker CD34 was also included in the qualification study. Precision (intra-assay, inter-assay, and intermediate), accuracy, specificity and sensitivity of this flow cytometry assay was determined during the qualification study. The qualified flow cytometry assay with established release criteria was later used to evaluate the purity and identify of human iPSCs.
Quantitative PCR for Evaluation of Residual Plasmid Clearance Since human induced pluripotent stem cells (iPSCs) were generated using pEB-C5 (i.e. an EBNA1/OriP episomal plasmid expressing Oct4, Sox2, Klf4, c-Myc and Lin28) and pEB-Tg (i.e. an EBNA1/OriP plasmid for transient expression of SV40 T antigen), a quantitative PCR (qPCR) assay (BResidual qPCR^) was developed to quantitatively detect residual EBNA/OriP sequences originating from either pEB-C5 or pEB-Tg. Both pEB-C5 and pEB-Tg are nonintegrating plasmids that are supposed to become clear following serially passaging of hiPSCs [18,35]. Considering the goal of assay to determine the clinical safety of the hiPSC clones generated by episomal plasmids, the Residual qPCR assay was qualified according to the International Conference on Harmonization Technical Requirements for Registration of Pharmaceuticals for Human Use (ICH) validation guidelines. Accuracy, specificity, limit of detection (LOD), and limit of quantification (LOQ) were determined during the PCR qualification studies. This qualification study was executed based on 9 total assays conducted by 3 analysts on 3 separate days using a validated qPCR machine. Appropriate control positive, control negative, and reference materials were used in the qualification study.
Characterization Assays Evaluation of hiPSC colony morphology, plating efficiency of hiPSCs post-thaw, and embryoid body (EB) formation were classified as FIO assays due to the challenges associated with qualification of these assays, in particular the subjective interpretation of the results. EB formation was used to demonstrate the identity and potency of hiPSCs by investigating spontaneous differentiation into three germ layers (i.e. ectoderm, mesoderm, and endoderm) and evaluating the results through immunofluorescence at the protein level or qPCR analysis at the transcript level. Post-thaw plating efficiency was evaluated based on alkaline phosphatase (AP) staining. AP, a hydrolase enzyme responsible for dephosphorylating molecules such as nucleotides, proteins, and alkaloids under alkaline conditions, has been widely used for evaluation of undifferentiated pluripotent stem cells, including both embryonic stem cells and iPSCs [3,[36][37][38]. Upon staining, the undifferentiated cells appear red or purple, whereas the differentiated cells appear colorless. However, considering the inconsistencies observed in the quality of AP reagents offered by different suppliers and subsequent intensity of AP staining, it was difficult to set specifications and cut-off values for the cells positively stained with AP marker, which in turn resulted in inability to qualify this assay.

Pluritest
Pluritest is an online bioinformatic assay based on gene expression collected from Illumina microarray to verify pluripotency [3,39]. Pluritest is based on 450 genome-wide transcriptional profiles. These samples are from multiple laboratories and vary from diverse stem cell samples to differentiated cell types, developing and adult human tissue. 223 samples are human embryonic stem cells and 41 are from iPSC's. Two models were developed to obtain pluripotency and nonpluripotency, a pluripotency score and a novelty score. Pluripotency score is based on expression levels from known pluripotent and non-pluripotent genes in the 450 genomewide transcriptional profiles. Unknown samples' gene expression levels are compared to the expression levels from the 450 samples, and their pluripotency is based on this comparison. Novelty score measures the technical and biological variation, based on comparing samples to well-known PSC s in the dataset [40].

In Process Assays
One important aspect of the GMP manufacturing process was to establish appropriate assays at different stages of the process to ensure the quality of intermediate materials and monitor the progress of the process during the long manufacturing process. For instance, a flow cytometry assay was implemented following the isolation of CD34+ cells but before proceeding with the priming step. This step ensured appropriate population of actively proliferating CD34+ cells (i.e. a minimum of 40 % CD34+ cells) were included in the expansion phase (priming step) prior to the reprogramming (Nucleofection) step. Importantly, the selection of best hiPSC colonies for expansion was based on the quality of hiPSCs observed during the expansion phase as well as level of residual plasmid present in the samples taken from each hiPSC clone. A scoring system was established to evaluate the quality of hiPSC cultures after isolation and throughout the course of serial subculturing of the cells based on the attachment of hiPSC colonies the day after passaging, evaluation of the confluency and amount of spontaneous differentiation at each passage and every day, and elapsed time (days) per passage. Following selection and establishing the hiPSC clones, in-process samples were submitted at the end of each passage to detect the level of residual plasmids (i.e. residual EBNA/OriP sequences), using qPCR analysis and by recording the Ct value. The Ct values and the scores achieved for each clone were used to evaluate the best clone(s) to use in further manufacturing, scale-up and banking process.

Basic Cell Line Characterization
As we developed a process for cGMP manufacture of hiPSCs [1,14], we have generated over fifty lines using different methods and numerous donors to evaluate an optimal method that worked reliably and reproducibly in our hands. During the process development, we identified a critical set of required tests ( Table 1) that were similar to those required for most of the cell lines. In addition, we added karyotype analysis as a key release assay, because karyotype abnormalities (e.g. trisomies of chromosomes 12 and 17) have been observed with in vitro cultures of both human ESCs [41,42] and iPSCs [43], and these abnormalities are suggested to be characteristics of malignant germ cell tumors [44,45]. As no required tests have been defined by regulatory authorities on determining the quality of ESC or iPSC lines, we reasoned that determining performance of the lines based on their use would be a reasonable start. We determined that pluripotency could be determined by the presence of pluripotency markers that were assessed by immunocytochemistry using well characterized and widely accepted markers (Fig. 1), and by performing Table summarizes the tests that were performed on the three engineering run lines and the two cGMP lines (all). Note that the three engineering lines were generated at different times from the same donor sample (Female), while the two cGMP lines were generated from a different donor (Male) • For Information Only (FIO) functional assays on the ability of the cells to differentiate into ectoderm, endoderm and mesoderm (i.e. EB formation) (Fig. 1). In addition, given that iPSC are potentially immortal, we felt that it would be important to be able to trace the identity of the cells as they pass through the manufacturing process and are distributed worldwide. While several methods are available, we used STR (single tandem based repeat) tracking, as CLIA certified laboratories perform this test routinely and results are rapidly available [46]. Examining our process, we determined that STR typing should be done on the donor sample prior to the start of the reprogramming step, and it should be matched to the final iPSC sample taken after the cryopreservation step at the end of the manufacturing process. Additional characterization was performed using quantitative PCR, which included examining residual plasmids clearance. Furthermore, sterility was carried out as a standard method on the starting materials (cord blood derived CD34+ cells) and the final manufactured cell product (see methods). Mycoplasma and endotoxin tests were also carried out as standard method on final iPSC products. Figure 2 illustrates the process flow diagram along with the in-process samples and associated tests. At the beginning of the process, it was important to test the sterility of CD34+ cells and purity of these cells using flow cytometry analysis (in-process QC1); it was also critical to test CD34+ cells for karyotype, take samples for STR analysis and matching with the final iPSCs, and evaluate the purity of CD34+ cells at the end of priming stage (in-process QC2). The karyotype analysis, in particular, was critical to ensure the starting materials undergoing the reprogramming process were normal. In-process QC3 was performed to evaluate the quality of iPSCs selected after the reprogramming based on the morphology of iPSCs [1] and an RT-PCR based plasmid clearance test. To confirm plasmid clearance, in-process QC4 was carried out at multiple passages before final expansion and banking. These multiple inprocess tests were key to evaluate the quality of iPSCs before performing a comprehensive characterization of iPSCs through a wide range of QC tests [47,48].
An example of such a complete characterization is shown in Fig. 1 for one of the iPSC lines (LiPSC-ER2.1) produced during the engineering runs described before [1], and further detailed information on the tests conducted on the other lines is provided in Fig. 1 and Tables 1 and 2. Although such a comprehensive characterization may be deemed sufficient for these iPSC lines, we felt it was inadequate for the potential use of iPSC lines and that some additional important data needed to be collected. This included blood group (ABO and Rh) and HLA typing at high resolution. We noted that this was not part of the routine process of donor sample, and we were unable to return to the donor to obtain blood group data ( Table 2) from one of the tissue acquisition sites.
In summary, this minimal set of tests allowed us to track each line, define its basic characteristics and provide reasonable predictability as to the quality of the line and its potential to differentiate into desired differentiated cell types. The immunocytochemistry provided insight not only into the purity of the sample but also assessed the degree of contaminating cell populations and provided an unbiased method of comparing cells to established comparability criteria in the future.
However, while these tests may be necessary to evaluate a minimal level of the cells' quality, we wondered if these would be sufficient to eliminate all unwanted cell types, and unfortunately this was unclear. Minor Karyotypic abnormalities or karyotypic abnormalities in a small percentage of the cells will be missed. Integration of genes used in the iPSC generation process will not be recognized, and changes or mutations in genes that are not functionally important at the pluripotent stage will be missed. We wondered if one could increase the predictability of the quality by adding additional tests, and more importantly, be prepared to add additional tests that may be required in the future. We have proposed three additional tests, including CGH array based on hybridization to complement the karyotyping; a transcriptome analysis; and a whole genome sequencing (WGS) assay that would

CGH + SNP Microarray Analysis
We selected a SNP hybridization based assay to provide a higher-resolution map of the state of the cells. We utilized both clones of the cGMP line and the three engineering lines generated from the same individual using the same cGMP compliant process to determine the utility of the methodology. We chose a CLIA-certified provider, and the results are provided in Table 3 (for GMP lines) and Table 4 (for ENG run lines). As can be seen in Table 3, although GR1.1 is karyotypically and phenotypically normal, it has several small duplications and deletions. (See Table 3 Panel A and B). Comparing the changes seen in GR1.1 and GR1.2, several of the duplications and deletions were common, suggesting that they preexisted prior to the initiation of the iPSC process. A smaller but significant number differed between the two samples and likely were generated during the process of cell line derivation or propagation in culture. A larger number of alterations were seen in the engineering clones, and what was surprising was the degree of non-overlap (i.e. uncommon aCGH) ( Table 4). As with the ER clones, the detected changes were within the range described by other groups and included genes known to be affected in disease as well as genes known to be related to pathways important in disease [7,8,10]. Given the lack of overlap and because we had access to WGS and transcriptome analysis from cells at the same passage, we compared the results with those obtained by WGS (whole genome sequencing) and microarray expression. As discussed below, these changes did not appear to correlate with WGS data or alter transcription levels to any demonstrable extent. Although it is unclear as to the relevance of these changes, developing a database of the affected genes will allow us to determine if further changes occur as cells are propagated and whether changes in certain genes are common and related to the derivation process used, as has been suggested by some studies [7,8,10]. Indeed, in the relatively small sample size we noted that mutation in GNAS [49] were common in several lines.

Microarray Analysis
We reported a whole genome expression analysis of iPSC lines generated during process development, training runs, engineering runs and GMP manufacturing runs [1,14]. Here we compared gene expression of 10 iPSC lines generated from the same manufacturing process. Samples of the cells were collected for RNA extraction and whole genome expression analysis conducted using Illumina Bead Array platform (Human HT-12 v4 Expression BeadChip). We have previously shown that this platform is suitable for reliable and robust detection of differential gene expression in a large number of samples [50,51]. As a control, we included additional iPSC and ESC lines as well as a CD34+ sample, from which one iPSC line was derived and the NSC differentiated from the iPSC. The list of samples analyzed is reported in Supplemental Table 1, and the entire gene expression profile is reported in Supplemental Table 2, available upon request. Initial data processing was done in GenomeStudio software as previously described [52]. Whole genome expression raw data, normalized and non-normalized data can be accessed through Gene expression accession number of GSE72078. The quality control tests and the various analyses we performed are summarized in Table 5.
First, we performed the quality control of our data set. The average number of detected genes for all samples was highly similar: 11,798.4 ± 701.6 (detection p-value < 0.01; mean ± standard deviation; non-normalized data) and 14,661 ± 710.1 (detection p-value < 0.05; mean ± standard deviation; non-normalized data), and no wide discrepancies in  (14); Penta E (7, 11); D5S818 (11); D13S317 (9, 11); D7S820 (10, 11); D16S539 (11, 13); CSF1PO (12) Table summarizes the information for identity and HLA typing and ethnic background that was collected. This data does not distinguish between clones from the same individual, and tracing and tracking of clones represents an issue that will require a solution to complement the barcoding and physical tracking methods we recommend hybridization signal intensity distributions were observed (not shown). The quality of the iPSC was verified by submitting the samples for the Pluritest assay and the results are shown in Fig. 3. Pluritest verification of iPSC line PR1.0, TR1.1, TR1.2, GR1.1, GR1.2, ER2.1, ER2.2, ER2.3 and XCL1iPSC samples fall within the same pluripotency range (red lines) of ESC samples and iPSC samples used to develop Pluritest, as does our added positive controls H14 ESC. The CD34 ++ cord blood line and ER2.2 NSC fall into nonpluripotent range (blue lines), demonstrating that our iPSC lines correlate with known iPSC and ESC (Fig. 3a).
In order to visualize the overall strength of measured signal across samples and identify presence of potential outliers, we plotted the signal-to-noise ratios and high-end intensity variation (95th percentile of signal intensity, P95) (Fig. 3b) in nonnormalized data. Signal intensity was similar in all tested samples and no outliers were detected, suggesting matching quality across microarray samples. Based on technical and biological variation, all our iPSC and ESC samples falls within the novelty score (green), while CD34 ++ cord blood cells and the ER2.2 NSC fall outside the novelty score (red) (Fig. 3b). Next, we calculated the pairwise correlation coefficients (r 2 ) to aCGH-SNP analysis of the iPSCs GR1.1 and GR1.2 which are clones. Panel A identifies common changes detected and Panel B lists non-overlapping changes. The genes are color coded; Red-Known to be affected in disease. Black -Not reported as affected In database . Teal in a pathway related to diseaseNote the analysis was done at the same time on the same run. The clones were manufactured in two separate runs determine the overall relatedness of samples (data now shown). The correlation coefficients between the iPSC samples manufactured using the cGMP compliant process was greater than 0.9 [1], and no significant difference in gene expression profiles at a global level was observed among these iPSC lines and the ESC or iPSC lines previously generated using similar or different reprogramming methodologies [1,50]. We then performed unsupervised one-way hierarchical clustering analysis to group averaged samples according to the degree of gene expression similarity (Fig. 3c). The results displayed three distribution features: starting material (CD34+ cells used to generate iPSC), pluripotent stem cells (iPSC and ESC), and differentiated neuronal cells. Together, these findings indicated good overall quality of the microarray data. Combination of novelty score and pluripotency score (Fig. 3d) illustrates iPSC and ESC samples are grouped together, suggesting an empirical distribution of pluripotent cells (red background). CD34 ++ cord blood cells and ER2.2 NSC are located closer to the non-pluripotent (blue background) (Fig. 3d). These results also confirm that Pluritest can be used as another tool to verify pluripotency of newly derived iPSC. In addition to utilizing Pluritest, we reasoned we could also identify subsets of genes which have been identified as being PSC cell-specific and compare their levels of expression with that in other PSC that have been well characterized, such as XL1 [50,51]. In addition, markers of CD34+ cells as well as markers of trophoblast and early differentiating cells can be examined (see supplementary materials) to assess completed transformation and the absence of contaminating cells. We note that the Illumina chips include microRNA and our isolation process preserves small RNA species, allowing us to use the same array system to assess microRNA expression profiles of the cells. Comparison between the two lines and the other lines included in this analysis showed that the lines were similar in their expression profiles and did not show the presence of differentiated cell markers (data not shown). In particular, the expression of positional markers (such as HOX genes) was absent.
Another factor that may alter the behavior of PSC, which is unlikely to be assessed by routine tests, is the expression of imprinted genes. We therefore extracted the list of published imprinted genes and compared the expression of these genes in the entire sample set (Supplemental Table 3, available upon request). Overall the expression patterns were similar, though we noted a significant difference in the expression of NNAT in XCL1 iPSC and GR1.2. While the relevance of this observation is unknown, we believe following the levels of these genes may be important given their association with disease.
Evaluation of gene expression based on a panel of approximately 325 markers (Supplemental Table 3)including markers of pluripotency, gender, imprint, endoderm, mesoderm and ectoderm) revealed (1) no difference between the male or female lines, (2) no change in the expression of imprinted genes, and (3) highly expressed level of several pluripotency markers including Oct4, Nanog and Sox2 in all iPSC samples (1). Overall, gene expression profiling of the lines generated using cGMP compliant process was similar and similar to previously reported iPSC and ESC lines [52].
To further analyze gene expression data, we compared expression of genes associated with functional pathways. As an example we compared the levels of genes involved in Cell Cycle, as well as genes coded for Transcription factors (TF) and Growth Factor Receptors (GFR) between iPSC line ER2.2 and GR1.1. Our results (Supplemental Table 4) showed no significant differences in gene expression in these pathways between the 2 iPSC lines. (Supplemental Table 4).
We also reasoned that one could identify the sex of the individual using expression of Y chromosome genes, and assess if X chromosome inactivation had occurred in female samples by examining expression of XIST and the levels of expression of X chromosome genes and comparing them with cell lines that have demonstrated X chromosome inactivation or activation. While this is by no means a definitive test, it allows one to highlight where further testing is required. Further information on imprinting can be obtained by identifying the differences between sequences between the two alleles of the imprinted genes by WGS and determining whether expression was from the maternal or paternal allele. Although we did not perform such an analysis in this work, such analysis can be readily performed if considered necessary.
An additional analysis that could readily be performed was assessing if the deletions and duplications reported in the WGS and SNP/CGH array tests had an effect on gene expression by preparing a list of deletions and duplications in GR1.1 and GR1.2 and examining the expression of those genes that are known to be expressed at the PSC stage. No significant change in expression was detected (Sup Table 3).
We have previously shown that rather than comparing individual gene expression one can examine the overall Note that both gene subsets as well as examination of overall quality can be inferred using tests such as Pluritest, correlation co-efficients and unbiased hierarchical clustering. The data quality is improved by comparing to a database and with sufficient numbers of lines one can establish ranges for a comparability assay expression of a particular signaling pathway or a subset of genes related to a particular disease and examine if there are any obvious differences in expression [50,51]. For iPSC, we reasoned that expression of mitochondrial-related genes and genes related to cell death may be sensitive predictors of growth rate and proliferation as data on lines is collected. Furthermore, mitochondrial mutations account for a large group of hereditary disorders. We therefore examined the subset of mitochondrial genes to examine if any significant changes were observed when compared to other iPCS and ESC lines. These analyses revealed no significant change in the expression of mitochondrial-related genes (data not shown).
Overall our results suggest that microarray analysis is a relatively inexpensive method that allows one to rapidly evaluate pluripotency and presence of contaminating cells using a variety of methods. One can compare expression of specific subsets of genes that may be critical for the use of these cells for a specific purpose and importantly provide a referral database to compare the ongoing evolution of the cells as they are b. Novelty Score: This score is based on well-characterized pluripotent samples, color-coded green and non-pluripotent samples color-coded red. All iPSC and ES samples are in green demonstrates pluripotent samples, whereas CD34 + cord blood line and ER2.2 NSC demonstrates nonpluripotent color-code red. c. Hierarchical Clustering of vst-transformed samples: Samples were transformed using variance stabilizing transformation (VST) and robust spline normalization _ENREF_35 [40]. Distance on x -axis is based on Pearson correlations. d. Overview: This combines novelty score on X-axis and Pluripotency score on y-axis. The red background, where the iPSC and ESC samples are located, suggest the empirical distribution of pluripotent cells. CD34 + cord blood line and ER2.2 NSC are located closer to the non-pluripotent background colored blue maintained in culture. Collection of such a database may allow us to develop comparability criteria and acceptance and release assay cut-offs for cell processing.

Whole Genome Analysis
We prepared genomic DNA and used it to perform the WGS on two iPSC lines, one male and one female, to assess their overall status. The data was analyzed using the pipeline summarized in Fig. 4a. A list of tests that we have performed using these data sets is summarized in Table 6. Table 7 summarizes the QC parameters evaluated. Both the lines passed the standard QC parameters.
The average number variants identified in both cell lines were similar and showed uniform patterns in the number of SNP, INS, DEL, synonymous and non-synonymous variants. As the number of deletions was higher than insertions or translocations, these were analyzed further (Fig. 4c).
The maximum depth of coverage obtained across each chromosome for both cell lines are shown in Fig. 4b. Higher depth of coverage in mtDNA was observed in both cell lines, as is to be expected. Y chromosome in LiPSC-ER2.2 dipped as expected in the female line. The total of 36 variants of Y chromosome overlapped with the X chromosome from the LiPSC-GR1.1 (male line). This highlights the issues with assignment of sequence data.
Extracting Blood type from the whole genome sequencing data was predicted by using BOOGIE for four different blood group systems, namely ABO, MNS, Rh and RhAG (Fig. 5a). The two cell lines LiPSC-ER2.2 and LiPSC-GR1.1 were distinguished based on their ABO system, showing blood group A and O respectively with a high confidence score of 94 and 99. The blood group of LiPSC-GR1.1 matched with that of the donor data LiPSC-GR1.1 for ABO system (blood group O), whereas donor information for LiPSC-ER2.2 was unavailable for comparison.
Comparing predictions across platforms, we note that SNP Chip analysis would have predicted the phenotype as O. This is because the SNP Chip does not pick up the insertions while WGS does, indicating the power and sensitivity of the WGS. The same program or similar program can be used to extract information for the thirty minor blood group antigens that have been described [53].
As with extracting Blood groups, we reasoned that it should be possible to use the data to identify the HLA phenotypes of the cell lines, and we used HLAVBSeq program (Fig. 5b.). The predicted HLA types had a resolution up to 8 digits. Both iPSC cell lines could be distinguished based on their class I HLA types along with their average depth of coverage, whereas class II HLA types show similar genotype but different depths.
Likewise for LiPSC-GR1.1, DQB1*02:02:01 (449.2): DRB1*04/*08, DRB1*09:21 (18785.9): DQB1*03/*04. From these results it can be inferred that the WGS gives higher resolution of the HLA prediction, and this with the coverage values provide confidence in call. The WGS allows us to identify the cis or trans type of the HLA types, which is not feasible by the standard methods of assessment. The DQA1-DQB1 haplotype in both cell lines was found to be in transisoform.
Our interest was studying Parkinson's disease (PD), and we wanted to see if a focused analysis of the data could be performed. So we prepared a list of PD-related genes to assess their status as an example to how a report can be prepared for specific, disease-related gene sets.
Both iPSC lines showed 23 unique genes relating to PD, after filtering for damaging effects of the amino acid substitution, predicted in-silico by SIFT and Polyphen converted rank scores computed by dbNSFP. These filtered variants were verified with the expression data. LiPSC-ER2.2 genes SYNJ1 and E1F4G1 have been implicated in PD when compared to MIM disease database. Whereas line LiPSC-GR1.1 showed two variants for MAPT gene (rs63750417) and one variant with SYNJ1 gene (rs2254562), which showed its association with PD ( Table 8).
The structural variants identified in WGS were crossvalidated with the results obtained from other platforms such as microarray and expression data. LiPSC-ER2.2 showed no overlap of the data obtained from WGS in comparison to SNP data in both unfiltered and filtered data (only one clone of Fig. 4 Whole Genome analysis conducted on two iPSC lines generated using the cGMP compliant manufacturing process. a. WGS data characterisation pipeline. This figure outlines the work flow followed in this study. The filters applied at various stages have been mentioned in the methods section. The fastq files were aligned using Isaac aligner to generate bam file. The bam file generated was checked for its mapping quality using samstat. The fastq files were also used for prediction of HLA types using HLAVBseq. The variants called by Issac variant caller was annotated using SnpEff and this data was used for predicting blood groups using BOOGIE. Only non-synonymous variants were considered for its implication with PD and cross validated with expression data. Structural variation (deletions and duplications) were filtered (see methods) and annotated for genes using UCSC. This data was cross validated with expression and microarray data. The differentially expressed genes were verified for gene enrichment relevant to disease and pathway through DAVID. b. The x-axis shows the various chromosomes and Y-axis represents the max-depth computed by Issac Variant Caller across each chromosome on a log scale for both the cell lines under study. c. Bar graph representing number of variants identified for SNPs, small Insertions, small deletions, synonymous, nonsynonymous variants, CNVs and different types of structural variants including duplications, large insertions (length > 50), large deletions (length > −50), inversions and translocations. Deletions were higher in number than other types of SVs LiPSC-ER2.2 showed overlap of PPP2R2B gene across both these platforms). The LiPSC-GR1.1 showed overlap of CSRNP3 and PPP2R3B in the unfiltered data from WGS. But the genes were removed due to bad quality in the filtered set of the WGS. The CSRNP3 gene had a discrepancy that was considered as GAIN in microarray data, and WGS data shows it to be a deletion event. This data is summarized in Table 9 along with the log fold change in relation to the control lines H9_H7 and NCRM6.
To verify if the genes overlapping with structural variants identified from WGS-filtered data set showed any implication with known diseases or pathways, we tested for gene enrichment through DAVID. The line LiPSC-ER2.2 showed no signals with disease association or pathways for deletion event. For duplication event, HTR6, GRIK3, OPRD1 were found to be associated with chemdependency and was enriched for pathway Eicosanoid Metabolism but with only 2 genes (CYP2J2, PTGER3).
The LiPSC-GR1.1 line showed no disease-specific enrichment but was enriched for T Cytotoxic Cell Surface Molecules, T Helper Cell Surface Molecules pathways for deletion event and no enrichment was found with the duplication event (Table 10).
To check the status of the imprinted genes, the list of published imprinted genes were obtained from the database (http://www.geneimprint.com). Genes with the status of imprinted and predicted were considered for the analysis. In LiPSC-ER2.2 cell line, a total of 148 of 203 genes were identified with an adequate depth of coverage. Similarly for LiPSC-GR1.1 cell line, a total of 147 of 203 genes were identified.
We also examined the expression of these imprinted genes at the iPSC stage by microarray analysis. LiPSC-ER2.2 and LiPSC-GR1.1 exhibited detectable expression of 121 and 120 genes respectively (Table 11). The number of homozygous, heterozygous SNPs, INDELS for these genes and its documented inheritance are given in Table 11. Thus, the variants that show the differences in both the chromosomes for imprinted genes could be extracted. Given the parental information we would be able to phase the data to identify the paternal or maternal specific allele expression in future by combining this data with RNA sequencing data. Note that both gene subsets as well as examination of overall quality can be inferred using tests such as Pluritest, correlation co-efficients and unbiased hierarchical clustering. The data quality is improved by comparing to a database and with sufficient numbers of lines one can establish ranges for a comparability assay We have provided the examples of certain tests that could be conducted with the WGS data. We can further utilize the data generated by this platform to determine the mycoplasma contamination, presence of foreign DNA and other infectious agents as these would be present as unmapped reads. WGS analysis can be further extended to extract and annotate mitochondrial sequences, STR verification, detect the genotypes of other transplant antigens and identify plasmid/viral insertions.

Discussion
We have shown that we can use small amounts of material to analyze samples with high content tests that yield useful results. For cGMP manufacturing, we have divided these tests into those required by regulatory authorities as part of a release criteria and tests that we classified as for information only (FIO). We have, however, argued that while not required, some subset of these tests should be incorporated into a routine testing process, as this new cell type was being considered as stating material for cell therapy for a variety of products. Our rationale for the type of tests proposed here was based on practical and theoretical consideration. Cells are intrinsically variable and can change during the manufacturing process as well as after implantation as they respond to the environment. Designing predictive tests for release is difficult in these circumstances. In addition, current models for therapy suggest that primary cells will be used where each lot will come from different starting material, or Haplobanks will be considered where Fig. 5 Extracting blood type and HLA from the whole genome sequencing data. a. Table showing the blood group  predicted by BOOGIE for four  different blood group systems namely ABO, Rh, MN and Rh associated glycoprotein with the score for both the cell lines. The validity of this approach is provided as means of comparison with the actual available donor information. b. Table showing HLA types predicted in-silico by HLAVBseq using WGS data for both the cell lines along with their depth. The predicted HLA type obtained using the WGS data is compared with the results obtained using the HLA AssureTM SE SBT kit multiple lines will be used to provide the same end product. We, therefore, approached this issue of comparability or equivalency by asking what factors underlie biological variability, and whether there are some simple tests that can provide data in an unbiased way that if collected in a database over time would allow us to infer what critical parameters one will need to monitor.
We proposed a transcriptome analysis, a SNP-CHIP/CGH array, and whole genome sequencing as three basic tests to complement the standard tests for pluripotency, differentiation ability and composition that are routine. It is widely agreed that karyotyping provides important information, both as an initial quality control as well as an ongoing measure of the quality of the cells, and changes in karyotype are indicative of a change in the quality of the cells. We have, however, suggested that while useful it may not be sensitive enough to predict changes in the differentiation behavior of cells or other more subtle defects that may or may not be relevant. An alternative strategy that provides additional information while requiring small amounts of material is a SNP-CHIP/CGH analysis [54,55] where genomic DNA is hybridized to a reference genome chip set and changes across the entire genome are examined using markers ranging in number from 500 K to a million. The assay is rapid and has the advantage that additional SNP's can be added and the chip customized as new information becomes available. Our data showed that all lines that were karyotypically normal also carried a number of deletions or duplications that included genes that are known to be altered in specific diseases or are important in their development. Some of these were identical in lines made from the same individual, indicating that they were present prior to the iPSC generation process while others were induced during the process. Although the N is small we did notice that an amplification of GNAS was present in samples from two individuals, and we suggest that monitoring this region in other lines will be important to determine if this is an anomaly or of relevance. Thus, SNP-CHIP analysis can reliably be used to monitor the karyotypic stability over time, and if a database is established then the relevant importance or frequency of a deletion or duplication can be inferred over time, so appropriate regulations on use of a line carrying a higher risk can be established.
To verify if long term cell culturing, passaging has not altered the genetic integrity of the cell lines generated and to verify the contamination issues arising due to certain laboratory practices, we also suggested to use whole genome sequencing as the WGS, in addition to confirming SNP-CHIP analysis, can be used to provide data on the state of the starting material and serve as a reference to any changes that occur in this population over time. The wealth of data is of immediate utility in providing a highresolution map of key genes and variation between individuals that may have predictive value. In addition, as we have shown in the results, WGS provides significant additional value. We, for example, used standard software tools (BOOGIE and HLA VBSeq) to infer minor blood group antigens and high-resolution HLA typing. This information is difficult to assess without expensive tests and is not done routinely; the WGS if performed on iPSC cells could provide an initial signal as to the importance of testing recipients for particular phenotypes or minor blood group antigens. Cross-species contamination (such as bacterial, viral or mouse) and its relative abundance in the cell lines can be estimated from the unmapped regions of the WGS data [56,57]. Also, mitochondrial specific variants that are captured in the off-target regions are also mined from our data (unpublished data). This provides valuable information on the involvement of mitochondrial DNA in disease pathology and/or could be used as a marker to identify the individual affiliation with geography or ethnic group based on predicted mitochondrial haplogroup.
In addition to looking at any specific subset of genes that might play a role in a particular organ or cell type, one can also examine the status of imprinted genes and identify unique allelic markers for the maternal and In particular, one may be able to determine if the X chromosome is randomly inactivated or not as a complement to the transcriptome analysis.
The WGS data can also be queried to cross examine the SNP's present or absent in the actual sample to provide better clarity on the changes seen in CGH/SNP chip analysis, as lack of hybridization could be due to absence of the SNP or allelic variability. We have shown (unpublished data) that a virtual SNP-CHIP result can be generated from WGS and a subset from exome sequencing SNP's. This may be more relevant for ethnic subgroups and can be used as an independent verification of the SNP-CHIP result. The importance of such an analysis is highlighted by the lack of overlap seen between the WGS and SNP/CHIP hybridization results (see results). Others have made similar observations. Rogers and colleagues [58] have evaluated on the levels of discordance across sequencing, imputation and microarray platforms and reported that the most common type of discordance comes from missing genotypes on the sequence technology, which occurred most frequently when the microarray technology identified at least 1 reference allele at the variant site.
We show that array-based transcritpome analysis can be used to assess overall interchangeability of the starting population using a R2 correlation co-efficient or comparing the profile with currently available databases such as Pluritest [59,60]. The RNA seq data, in addition, provides a complement to the more specific tests that are performed as standard quality control and also allows us to go back and ask additional questions as to the quality of the cells that one may not ask routinely. For example, while we examine pluripotency using a couple of markers we could easily use the array-based results to identify another 150 potential pluripotency markers, or reexamine if trophectoderm is a contaminant, or some cells have already acquired a epiblast fate, or for distinguishing between naïve and primed states. Having the transcriptome analysis in a database may also allow one to assess batch-to-batch variability and define specs for any release criteria one may establish.
We chose to use transcriptome analysis, SNP-CHIP and exome sequencing rather than alternatives, because we felt that these tests provide sufficient information and represent the best cost-benefit compromise. These tests are widely available, can be done worldwide and the analysis pipeline is relatively standard. The tests provide several internal controls and the data can be readily cross-correlated. Furthermore, database repositories that are geared to receiving such data-sets have been established in multiple countries. We have also suggested that while the tests themselves are of utility as internal controls for any organization, we feel the tests could be more widely applicable for comparisons across sites and groups, provided uniformity in data collection and storage was in place and comparators or calibration material   was available. In principle, any line can be used as long as it is widely available, stable and well characterized. We have proposed that a set of ten lines that we generated at the NIH, which are widely disseminated and are well characterized and have been grown in different culture conditions and have been shown to differentiate into the major phenotypes of interest, could serve as such calibration material. The NIH has generated whole genome sequencing data and in association with RUDCR (Rutgers) SNP-CHP/CGH and RNA-SEQ data, and as such, these lines could serve such a purpose. It is important to emphasize that these tests are by no means an absolute standard, and certainly do not enable complete characterization of lines. For example, one can miss many epigenetic changes that may be important, or one may miss point mutations or inversions given the depth and resolution of the analysis we have chosen. We feel, however, these tests are a critical first step in developing a database of information that can offer a path to developing cheaper, more focused tests, as information on the lines evolves and provides a path to more rigorous testing, at a time when costs for more detailed testing decrease.
In summary, we have shown that for a relatively modest investment useful additional information can be obtained, and this information if shared widely and accumulated in a database along with data from calibration material will allow us to assess the safety of therapy and provide regulators with a For each cell line the mean depth of the alternate allele, the number of heterozygous and homozygous variants identified from WGS is provided strategy to assess Haplobanks and other models for delivering cell-based therapy.