Genomic and transcriptomic analysis of Korean colorectal cancer patients

Background Colorectal cancer (CRC) is the third most common type of diagnosed cancer in the world and has the second-highest mortality rate. Meanwhile, South Korea has the second-highest incidence rate for CRC in the world. Objective To assess the possible influence of ethnicity on the molecular profile of colorectal cancer, we compared genomic and transcriptomic features of South Korean CRCs with European CRCs. Methods We assembled a genomic and transcriptomic dataset of South Korean CRC patients (KOCRC; n = 126) from previous studies and European cases (EUCRC; n = 245) selected from The Cancer Genome Atlas (TCGA). Then, we compared the two datasets in terms of clinical data, driver genes, mutational signature, gene sets, consensus molecular subtype, and fusion genes. Results These two cohorts showed similar profiles in driver mutations but differences in the mutation frequencies of some driver genes (including APC, TP53, PABPC1, FAT4, MUC7, HSPG2, GNAS, DENND5B, and BRAF). Analysis of hallmark pathways using genomic data sets revealed further differences between these populations in the WNT, TP53, and NOTCH signaling pathways. In consensus molecular subtype (CMS) analyses of the study cases, no BRAF mutations were found in the CMS1 subtype of KOCRC, which contrasts with previous findings. Fusion gene analysis identified oncogenic fusion of PTPRK-RSPO3 in a subset of KOCRC patients without APC mutations. Conclusions This study presents insights into the genomic landscape of KOCRCs and reveals some similarities and differences with EUCRCs at the molecular level. Supplementary Information The online version contains supplementary material available at 10.1007/s13258-022-01275-4.


Introduction
Colorectal cancer (CRC) is the third most commonly diagnosed cancer in the world and has the second-highest mortality rate, accounting for about 1 out of 10 cancer mortalities worldwide. Moreover, the global burden of CRC is expected to increase by 60% to more than 2.2 million new cases and 1.1 million deaths by 2030 (Arnold et al. 2017). Notably, in this regard, South Korea has a CRC rate of 44.5 (age-standardized rate per 100,000), which was the secondhighest global rate in 2018 (Bray et al. 2018).
Over the past three decades, molecular genetic studies have provided important genomic insights into the pathogenesis of both sporadic and hereditary CRC (Fearon 2011). Alterations in oncogenes and tumor suppressor genes are closely related to CRC subsets, and a larger collection of pathway genes has also been defined for these tumors (Fearon 2011). Various targets have been subsequently explored concerning personalized treatments, and these targeted therapies are regarded as a novel approach to improving individual survival outcomes in CRC patients (Xie et al. 2020).
According to prior large-scale genomic investigations (Cancer Genome Atlas Network 2012; Lu et al. 2019;Nagahashi et al. 2016), well-known driver gene mutations including APC, TP53, SMAD4, PIK3CA, and KRAS, are significantly involved in the tumorigenesis of CRC. Furthermore, the cancer genome atlas (TCGA) has revealed the role of several new driver genes and potential target pathways in these cancers (Cancer Genome Atlas Network 2012). However, the genomic knowledge of CRC has mainly been acquired from European cohorts, and little information is available on the genomic landscape in Asian CRC populations, including Korean CRC cohorts (KOCRC). Multiple genomic studies have revealed new therapeutic approaches to CRC (Ellis and Perou 2013;Horibata et al. 2020;Nagahashi et al. 2016), uncovering the specific genomic and molecular profiles of KOCRC cohorts will likely assist with the tailoring of diagnostic and therapeutic modalities for Korean cases.
The present study aimed to identify specific molecular and genetic features of KOCRCs using an integrated approach that combined clinical data comparisons with a well-defined European CRC population (EUCRC).

Dataset establishment and public data processing
Genomic and transcriptomic data sets of KOCRC patients (n = 126) were obtained from three previous studies (Kim et al. 2016(Kim et al. , 2019(Kim et al. , 2014 by the Korea Research Institute of Bioscience and Biotechnology (KRIBB, Daejeon, Republic of Korea) and Asan Medical Center (Seoul, Republic of Korea). Whole exome sequencing (WES) of normal samples was carried out using normal tissues or blood samples (n = 42 and n = 84, respectively). All patients provided voluntary written formal consent to be included in the study. The study protocol strictly conformed to the Declaration of Helsinki and was approved by the Institutional Review Board of Asan Medical Center (registration numbers: 2009-0091, 2014-0150, 2018-0087). The data sets used in this study are available from GEO (GSE50760, GSE107422, GSE132024) and KoNA (PRJKA210050).
To examine possible ethnic differences in the molecular profiles of CRC between our Korean cases and a European cohort, we downloaded a CRC dataset from The Cancer Genome Atlas (TCGA), and exclusively selected Caucasian cases for our present analyses (EUCRC; n = 245) as the European ancestry cohort. The information for our EUCRC cases, including MAF, gene expression count, and clinical data, were acquired from the TCGA colon adenocarcinoma (TCGA-COAD) and TCGA rectum adenocarcinoma (TCGA-READ) project through the GDC Data Portal (Cancer Genome Atlas Network 2012). We used MAF files as an alternative to bam files for WES data and gene expression count files as an alternative to raw RNA sequencing (RNA-seq) fastq files. For further information about sample collection, histology method, library preparation, and bioinformatics analysis of both cohorts, please see Supplementary  Table 1.

Identification of somatic SNVs, indels, and gene fusion events
In the KOCRC cohort, exome sequencing reads were mapped to the human reference genome GRCh38 (primary assembly) using bwa-mem (version 0.7.17-r1188) with default parameters, followed by sorting of the bam files with samtools (version 1.10). As the TCGA databases had been preprocessed using GATK (McKenna et al. 2010), our databases were processed following GATK best practices (GATK version 4.1.4.0). PCR duplicates were removed via Picard MarkDuplicates (version 2.21.2), and base recalibration was conducted using GATK BaseRecalibrator & ApplyRecalibration. Candidate variants were called via GATK Mutect2 and filtered using GATK FilterMutectCalls. ANNOVAR (Wang et al. 2010) was used for the annotation steps.
Fusion genes and positions were predicted using STAR-Fusion (version 1.9.1). We used trimmed KOCRC RNA-seq fastq files as the input. We filtered and determined fusion genes identified in 4-time repeats in a sample. Fusion genes, including non-coding RNA or immunoglobulin-related genes, were excluded from the final selection. The reported and non-reported fusion genes were distinguished using previous reports. MutSigCV (Lawrence et al. 2013) (version 1.3.5) software was used to detect driver genes in our CRC subjects. Briefly, the KOCRC cases were lifted from GRCh38 to GRCh37 via the CrossMap (version 0.3.8) for MutSigCV processing. The maftools (Mayakonda et al. 2018) R package (version 2.6.0) was consecutively used to prepare MAF files for the MutSigCV analysis, which was finally completed on the GenePattern (Reich et al. 2006) online platform using default settings.

Driver gene and mutational signature identification
The nonnegative matrix factorization (NMF) R package (version 0.23.0) and maftools R package (version 2.6.0) were used to identify de novo mutation signatures. The number of signatures was estimated based on a cophenetic correlation matrix. Mutational signatures were then extracted from the trinucleotide context and decomposed into the designated number of signatures.

Gene set enrichment analysis (GSEA) and consensus molecular subtyping
Transcriptomic data from the KOCRC and EUCRC cases were used to conduct GSEA. Trimmed RNA-seq fastq files were mapped to GRCh38 (primary assembly) on STAR (Dobin et al. 2013) (version 2.7.3a), concurrently estimating the expression counts. The edgeR (Robinson et al. 2010) R package (version 3.32.0) was used to obtain log2 fold-changes in gene expression between normal and tumor tissues. The fgsea (Korotkevich et al. 2021) R package was used to perform GSEA with the 50 hallmark gene set (v7.2) from MSigDB (Liberzon et al. 2015). Significantly enriched gene sets were filtered and acquired based on a cutoff level at q < 0.01. Enriched known oncogenic pathways were examined on a maftools R package. Oncogenic signaling pathways were derived from TCGA cohorts. The values of "fraction mutated samples" were used to compare the influence in oncogenic pathways between the KOCRC and EUCRC cohorts.
To identify consensus molecular subtypes (CMS) of CRC samples, we used the CMSclassifier R package (Guinney et al. 2015). Transcriptomic data was initially normalized to counts per million bases (CPM). Log transformations were subsequently conducted by adding one pseudo-count transformed into a log 2 scale. A random forest classifier method was used to arrange the KOCRC and TCGA samples into four CMS classes. The ambiguous subtypes were designated as 'unspecified'.

Statistics
A Wilcoxon signed-rank test was used to determine differences between two dependent samples with unknown distribution, while continuous variables were compared using paired Student's t-tests. The chi-square test was used to compare clinical datasets on oncogenic pathways, whereas mutational frequencies between KOCRC and EUCRC gene sets were compared with a Fisher's exact test. All statistical analyses were performed using the limma (Ritchie et al. 2015) R package (ver. 3.48.0), with a two-sided p < 0.05 defined as statistically significant.

General clinical features of the KOCRC and EUCRC cohorts
This study was designed to enable genomic comparisons of CRC patients of Korean and European descent, i.e., KOCRC and EUCRC cohorts (Fig. 1a). The clinical features of these cases were also compared, including cancer stage, primary tumor site, and patient demographics (Fig. 1b). The gender ratios were similar between the cohorts (p = 0.1), but differences were evident in the cancer stage, primary site, and age (p = 0.004, 0.001, and 3.86 × 10 8 , respectively). Age differences were particularly noticeable, with the KOCRC cohort having a median age of about 58, which was ten years younger than of the EUCRC patients.
We estimated the tumor mutation burden (TMB) of the two cohorts ( Fig. 1c) and found a median TMB per megabase (TMB/MB) of 2.65 and 2.76, respectively, for the KOCRC and EUCRC populations. It appeared from our analyses that the higher proportion of rectum adenocarcinoma (READ) in the KOCRC cohort may have affected the median TMB/MB (the READ proportions for the KOCRC and EUCRC groups were about 47.6% and 29.8%, respectively) but this was not statistically significant (p = 0.13).

Mutation analysis centered on driver genes
Using the driver detecting software, MutSigCV, we found six previously well-known CRC driver genes (APC, TP53, KRAS, FBXW7, SMAD4, and AMER1) common between the two cohorts. In contrast, three putative novel CRC driver genes (MUC7, PABPC1, and B2M) were identified in the KOCRC cohort at a false discovery rate (FDR) of 0.05. Additionally, we adopted well-known CRC driver genes from Integrative Onco Genomics (Martinez-Jimenez et al. 2020) (intOgen) and other previous studies for these comparative analyses (Hanna et al. 2013;Lu et al. 2019). A gene set of 25 driver genes was used in further analyses (Fig. 2a).

Mutational signature analysis
We used the NMF algorithm to identify mutational signatures in the KOCRC and EUCRC patients and calculated cosine similarities against single base substitution (SBS) COSMIC (Tate et al. 2019) signatures to identify the best matches ( Fig. 3a, b). We thereby identified 'defective DNA mismatch repair (dMMR)' (COSMIC Signature 6), 'POLE' (COSMIC Signature 10), 'unknown' (COSMIC Signature 5), and 'sequencing artifact' (COSMIC Signature 45) in the KOCRC cohort, and 'aging' (COSMIC Signature 1), 'dMMR', and 'POLE' signatures in the EUCRC populations. Both cohorts have 'dMMR' and 'POLE'signatures, which have also been verified in many other cancer types. The 'unknown' signature, COSMIC Signature 5, also arises in all cancer types but remains to be verified.
We used ten canonical oncogenic signaling pathways derived from TCGA cohorts (Sanchez-Vega et al. 2018) (Fig. 4c) to perform pathway analysis. Pathway analyses were performed using genomic data. In most pathways, the frequencies of affected samples were similar in both cohorts. However, in the β-catenin/WNT and p53 signaling pathways, significantly more fractions of samples were affected in the

CMS classification
A prior study established four CMSs for CRC and developed a tool named 'CMSclassifier' (Guinney et al. 2015). To investigate how well our data fitted with existing findings, we utilized 'CMSclassifier' to analyze our transcriptomic data from both the KOCRC and EUCRC cohorts.

Discussion
By comparing large cohorts and establishing the genomic landscape of KOCRCs, the commonalities and differences between CRC patients of Korean and European ancestry could be identified and discussed. In the comparative analyses of the clinical data for these populations, it was notable that the KOCRC and EUCRC cohorts showed significant age differences, with a median age of about 58 and 68, respectively. The lower median age of the KOCRC patients is likely to be related to the higher prevalence of this cancer in Korea and the national health checkups for all Korean citizens over the age of 50. These checkups include a CRC screen using a stool occult blood test and a colonoscopy, which can improve the early diagnosis of CRC.
The KOCRC and EUCRC cohorts in our present study showed differences in the mutation frequencies in several driver genes. Of note, the lower mutation frequency of the BRAF gene in our Korean subjects is consistent with another study of CRCs from distinct ethnic groups that also found variations in the BRAF mutation frequency (Hanna et al. 2013). In addition, the higher mutation frequencies observed in the GNAS and DENND5B genes in our KOCRC cases is supported by another study that identified 13 loci that were significantly associated with the risk for CRC in Asians. Two of these 13 loci were located inside or near the proteincoding regions of GNAS and DENND5B (Lu et al. 2019).
We additionally identified three new putative driver genes (MUC7, PABPC1, B2M) in our KOCRC population. MUC7 has often been associated with other cancer types, particularly bladder cancer, and its expression levels have been assayed in many tumor types (Retz et al. 1998). However, the significance of MUC7 mutations in CRC remains uncertain. PABPC1 (poly A binding protein cytoplasmic1) is known to play a role in the post-transcriptional control of mRNA and may be involved in tumorigenesis (Takashima et al. 2006). In addition, several studies have revealed that this gene has important roles in tumor progression and carcinogenesis in both esophageal and gastric cancer (Takashima et al. 2006;Zhu et al. 2015). B2M mutations are often reported in highlevel microsatellite instability (MSI-H) CRCs (Tikidzhieva et al. 2012). Robust evidence is available that correlates B2M variations and immune escape in CRC (Grasso et al. 2018;Ozcan et al. 2018), and this gene also acts as a driver in diffuse large B cell lymphoma (DLBC) (Fan et al. 2020).
The most frequently mutated genes in our EUCRC cohort were APC, TP53, FAT4, and BRAF. These four genes are involved in major carcinogenesis pathways, including the Wnt, Hippo, and MAPK signaling pathways. Of the genes most frequently mutated in the KOCRC cohort, the activating mutation in GNAS has been reported previously in APC deficient mice to promote intestinal tumorigenesis by activating the Wnt and ERK1/2 MAPK pathways (Wilson et al. 2010). In another prior study, the GNAS mutation functioned as an alternative activator of the Wnt/beta-catenin signaling pathway in gastric adenocarcinoma (Nomura et al. 2014). These results suggest that the Wnt/beta-catenin pathway is activated in Korean CRC patients by a GNAS-mediated alternative pathway and a canonical APC pathway. We speculate that this alternative mechanism of Wnt pathway activation by GNAS may partially explain the lower mutational frequency of the APC gene in the KOCRC compared to the EUCRC cohort in our current study. However, we predict that the PTPRK-RSPO3 fusion gene likely plays a role in an alternative mechanism of Wnt pathway activation. The Wntdependent endogenous Rspo2 and Rspo3 chromosomal rearrangements can initiate and maintain colorectal carcinogenesis (Han et al. 2017). Another previous study has suggested a role for the PTPRK-RSPO3 fusion gene in activating Wnt/ beta-catenin signaling because it showed a mutually exclusive pattern with APC or beta-catenin mutations (Hao et al. 2016), which is in line with our present data indicating its mutual exclusiveness with APC mutations. Taken together, the cumulative evidence now suggests that two alternative pathways, including GNAS-mediated and PTPRK-RSPO3 fusion-mediated mechanisms, may play an important role in the activation of Wnt/beta-catenin signaling in place of APC mutations in Korean CRC lesions. Additionally, DENND5B, a guanine nucleotide exchange factor that activates RAB39A and RAB39B, was previously identified as one of 13 loci significantly associated with risk for CRC in Asians (Lu et al. 2019). Further studies are needed to determine the roles of DENND5B in colorectal carcinogenesis. Our current mutational signature analysis results suggested that KOCRCs and EUCRCs are very similar except for the unknown signature (COSMIC Signature 5), indicating that the major mutational signatures are conserved among these two cohorts. The aging signature (COSMIC Signature 1) was evident in EUCRC cases which were not surprising since the median age of the EUCRC cohort was older than that of the KOCRC cohort. POLE has a crucial role in chromosomal DNA replication due to its proofreading capacity and is known to be mutually exclusive with dMMR. Somatic mutations in the proofreading domains of POLE have been identified in relation to microsatellite instability (MSI), which has been found to occur in CRC due to a dMMR system with key MMR genes inactivated by various mechanisms (Domingo et al. 2016;Kim et al. 2013). Moreover, mutations in polymerase proofreading-associated syndrome involving POLE and POLD1 constitute 0.3-0.7% of familial cancer cases when only CRC and polyposis are considered (Mur et al. 2020).
In our GSEA and pathway analysis for mutated genes, we identified significant differences in some hallmark gene sets and pathways between the KOCRC and EUCRC patients. These results indicate that Korean CRC cases may require different therapeutic approaches than the current conventional methods. Among the gene sets enriched in KOCRC were upregulated immune-related gene sets such as 'interferon gamma response', 'inflammatory response', and 'IL2 STAT5 signaling', indicating the possibility that immunotherapy-based approaches could be effective in these cases.
In the CMS analysis we conducted in our present series, we assessed the previously established four CRC subtypes (CMS1-4) (Guinney et al. 2015). CMS1 is the MSI immunogenic type, CMS2 is the canonical type, CMS3 is a metabolic type and CMS4 is a mesenchymal type. CMS1 was enriched for MSI tumors and BRAF-mutations. CMS2 tumors had epithelial characteristics with marked WNT and MYC signaling augmentation and a high CIN. CMS3 cancers also had epithelial features but a lower CIN, were enriched for KRAS mutations and presented with evident metabolic dysregulation. The CMS4 grouping was the mesenchymal subtype with prominent TGF-β activation, stromal invasion, angiogenesis, and an inflammatory, immunosuppressive phenotype. CMS analyses of CRCs is a new modality that includes knowledge of molecular factors, tumor stroma, and signaling pathways to facilitate personalized, patient-orientated systemic treatments, i.e., precision medicine (Ten Hoorn et al. 2021). In our present study, the proportions of each subtype in the two cohorts did not show differences, implying that they are conserved among different ethnic groups. Additionally, the conserved proportions of each subtype indicated no fundamental differences in the molecular carcinogenesis processes between the two cohorts.
Since gene fusions are closely associated with specific tumor phenotypes, they represent ideal targets for anticancer treatments and risk stratification. A previous study reported that the fusion of NAV2 and TCF7L1 is a new marker for aggressive CRC and has an important role in MYC-directed transcriptional activation and repression (Cancer Genome Atlas Network 2012). We identified several new fusion genes that may become oncogenic candidates for CRC in this present study. A previous report identified a PTPRK-RSPO3 fusion gene in CRC and demonstrated that targeting RSPO3 (c) Comparison of the mutation frequency of genes in 10 hallmark pathways across the KOCRC and EUCRC patient subjects. Asterisks indicate significant differences based on a chi-square test. The p-values for the WNT, NOTCH, and TP53 pathways were 1.64e-09, 4.88e-06, and 0.011, respectively ◂ in PTPRK-RSPO3 fusion-positive human tumor xenografts inhibited tumor growth and promoted differentiation (Storm et al. 2016). Although the precise functions of the fusion genes found in CRC remain to be defined, our current data in two ethnically different cohorts suggest that gene fusion events may contribute to tumorigenesis in this cancer type.
Overall, we suggest from our present analyses that further studies involving larger populations of Korean CRC cases are needed to validate our current data. In addition, as the data from the KOCRC and EUCRC cohorts in our series were processed using partially different computational procedures, caution should be exercised in interpreting our results which may have been affected by this. However, the effect would be trivial, as we followed most of the computational procedures as GDC Data Portal stated (Supplementary Table 1). Notwithstanding these limitations, our present study suggests that distinct molecular and genomic differences exist between Korean and European CRCs, and our analyses provide an important

Data availability
The data sets used in this study are available from GEO (GSE50760, GSE107422, GSE132024) and KoNA (PRJKA210050).

Declarations
Conflict of interest SAJ, YJH, JHK, JHK, SKK, YSK,,SYK, and JCK declare that they have no conflict of interest.   Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.