This chapter was inspired by my participation in the tenth anniversary alumni meeting of the Master Online Advanced Oncology Study Program that was held online in the times of unprecedented COVID-19 pandemics in October 2020. To a large extent, the text is a personal retrospective of my research endeavors in cancer biology during the last decade. Most of the ideas that I addressed as well as the approaches used were inspired by my participation in the program. I do hope that these personal revelations will help young professional with genuine interest in cancer research having limited resources in designing and performing their own research projects.
- Data mining
- Low- and middle-income country
- Prognostic value
I graduated as medical doctor from the Medical University in my home country (Bulgaria) in 2003. At that time, Bulgaria was on its way to join two important international unions (NATO and EU), which were expected to define its long-term political orientation and the rapid establishment of democratic values in all domains of the Bulgarian society. During my studies, I have developed a genuine interest in science and was eager to have an opportunity to pursue an academic career with some research focus. I was quite unsure how to find a pathway in experimental medicine as in Bulgaria the medical education is primarily focused on clinical practice. However, shortly afterward, I was able to win a scholarship by the Japanese government (MEXT scholarship) and was hosted by the leading immunologist Tasuku Honjo. I spent 18 months in Kyoto and realized the power of molecular medicine, and I was fascinated by the central dogma in molecular biology (Crick 1970), that is, that life itself is a flow of information between macromolecules, and it is our ability to understand and decode this information in order to achieve a comprehensive understanding of living matter. Upon my leave from Kyoto, I was firmly determined to follow Prof. Honjo’s advice “always to follow my research questions and to try to stay close to science” (Fig. 1). I spent the next several years in Bulgaria and the USA developing skills and expertise in immunology and hematology. In 2012, I started the Master Online Advanced Oncology Study Program at Ulm University, which provided me with a broader conceptual view on cancer and triggered me to develop further skills in biostatistics and big data analysis. In that respect, I found really transforming the lessons by Prof. Dietrich Rothenbacher and my MSc thesis advisor Prof. Lars Bullinger.
At that time, I have already been based in Bulgaria and was struggling to perform some high-quality research. Indeed, there were two main problems that I faced. Bulgaria was and still is lagging behind most of the countries in the EU in terms of spending on R&D both in absolute numbers and as a percentage of GDP (http://uis.unesco.org/apps/visualisations/research-and-development-spending/). Furthermore, the only governmental organization responsible for research funding was suffering from a series of malpractices (https://www.nature.com/news/2011/110406/full/472019a.html). Because of those hurdles, I realized that one of the reasonable ways to perform good quality research is to make use of existing data sources. To follow that goal, I invested additional efforts in developing skills in big data analysis. That was achievable because of the accessibility of a series of MOOCs (massive open online courses) in biostatistics and big data analysis on Coursera (www.coursera.org) and edX (www.edx.org) platforms. Further to that, I developed my clinical skills in hematology and became a certified clinical hematologist as well as theoretical concepts in cancer biology through participation in another blended learning course (https://postgraduateeducation.hms.harvard.edu/certificate-programs/research-programs/high-impact-cancer-research). The latter helped me shape my understanding of cancer research in the context of cancer hallmarks (Hanahan and Weinberg 2011) and their application in malignant hematology and cancer immunology (Alkhazraji et al. 2019). With such a set of skills and conceptual background, I was able to address various topics in several directions as outlined below.
Prognostic Values of Different Mutations in Acute Myeloid Leukemia (AML)
While participating in the Master Online Advanced Oncology Study Program, there was a debate on the prognostic value of several newly identified mutations in AML patients. Several cooperative groups had started to publish their retrospective analyses. Inspired by the Biometrics module of the program, I decided to address the prognostic role of DNMT3A mutations in adult AML patients. I self-learned how to use the most popular software for that purpose (RevMan) and included 9 studies with over 4500 AML cases and almost 1000 DNMT3A mutated cases (Shivarov et al. 2013). Our meta-analysis showed that DNMT3A mutations conferred significantly worse prognosis with both shorter overall and relapse free survival (Shivarov et al. 2013). This prognostic value was independent of the cytogenetic features of AML. Our findings were later confirmed by other groups’ meta-analyses (Tie et al. 2014; Ahn et al. 2016).
After that initial experience, I focused on the prognostic value of ASXL1 mutations. Analogous to our experience with DNMT3A mutations, we performed meta-analysis on 6 studies with over 3300 patients (including 307 ASXL1 mutated cases) (Shivarov et al. 2015). Notably, we demonstrated that ASXL1 mutations conferred a worse prognosis in both younger and older patients (Shivarov et al. 2015). We extended the study by performing a meta-analysis of gene expression data in order to obtain a robust gene expression profile associated with the presence of ASXL1 mutations. Gene ontology analysis showed that ASXL1 mutations were associated with catabolic processes and phosphatase activity (Shivarov et al. 2015).
Outcomes Research Using Data from Cancer Registries
Population-based cancer registries can be invaluable sources of raw data for epidemiological and outcomes research. The largest registry providing free data access is the Surveillance, Epidemiology, and End Results (SEER) Program (www.seer.cancer.gov). Although the SEER database does not provide in-depth data on histological features, imaging, and therapy, it does provide data on date of diagnosis, demographic and social characteristics, secondary cancers, cause of death, and survival data allowing a number of possible analyses. Free access to the latest release of SEER database was then granted upon individual application without specifying the specific project the data will be used for. These days, a straightforward process provides access to this valuable source. The granted access comes along with the SEERStat software which allows for a number of predefined statistical analyses. Another approach is to perform a case listing through filtering based on various criteria and to perform a subsequent descriptive or inferential statistics with commercial or free software tools. In the last several years, we used the SEER to perform outcomes research in a number of rare hematological malignancies. We analyzed the clinical outcomes of nodular lymphocyte-predominant Hodgkin lymphoma (NLPHL) after the year 2000 (Shivarov and Ivanova 2018). We listed 1401 such cases and were able to demonstrate that there was no difference in survival between males and females, which was suggested by some previous reports (Shivarov and Ivanova 2018). Furthermore, the introduction of rituximab in the management of this disease in the mid-2000s appeared not to affect clinical prognosis (Shivarov and Ivanova 2018). In line with this study, we also used cases listed from the SEER database to identify a significant number of cases with either composite or sequential B-cell lymphomas with features intermediate between DLBCL/PMBCL and classical Hodgkin lymphoma (Shivarov and Ivanova 2020). We demonstrated differences in overall survival between composite and sequential lymphomas (Shivarov and Ivanova 2020). We also showed that the order of the type of B-cell lymphomas in sequential lymphoma cases also has prognostic impact (Shivarov and Ivanova 2020). Collectively, the data from this study demonstrated how to use registry data to define new entities and to address in an unbiased fashion the role of some factors on clinical outcomes. Another small study that we performed addressed the incidence of secondary malignancies in classical HL after the year 2000 (presented at EHA Annual Congress in 2018). We also assessed the risk of second solid cancers in another rare hematologic malignancy—systemic mastocytosis (SM) (Shivarov et al. 2018). SM with an associated hematologic neoplasm is a well-defined SM subgroup, but at the time of our analysis, it was not reported whether SM was associated with an increased risk for solid cancers as well. Notably, based on the SEER data SM cases, we could not identify a consistently increased risk for second solid cancers (Shivarov et al. 2018).
Using Publicly Available Omics Datasets to Build or Support Original Hypotheses
I have had a genuine interest in the molecular biology of Philadelphia chromosome-negative myeloproliferative neoplasms (MPNs) since 2005 after the identification of JAK2 V617F mutation as the most frequent mutation in all three classical MPNs—essential thrombocythemia (ET), polycythemia vera (PV), and primary myelofibrosis (PMF) (Kralovics et al. 2005; Levine et al. 2005; Vainchenker and Constantinescu 2005; Baxter et al. 2005). Indeed, over the years, we developed several multiplex methods for the detection of mutations in myeloid malignancies including JAK2 V617F (Shivarov et al. 2011). In December 2013, I had the opportunity to listen to the original reports of the identification of the second most frequent group of MPN-associated mutations—those in exon 9 of the calreticulin gene (CALR) (Klampfl et al. 2013; Nangalia et al. 2013). I was struck by the invariable common neoformed C-terminus that all those frameshift mutations caused. Two ideas were particularly appealing to me: (i) that the structural characteristics of the neoformed C-terminus may play a role in its mechanism of neoplastic transformation; and (ii) this unique C-terminus is probably a new source of variable neoepitopes. To address the former idea, we performed bioinformatic analyses of the sequence of the neoformed C-terminus and proposed that it was structurally disordered and cannot bind Ca2+, which would have direct mechanistic and therapeutic implications (Shivarov et al. 2014). The second idea was further developed so that we reached to a project assessing the role of immunoediting in MPNs. We initially questioned whether protective HLA alleles for the development of JAK2 V617F+ MPNs existed. We genotyped a large number of MPN patients and healthy controls and demonstrated that two HLA class I alleles were significantly less frequent in JAK2 V617F+ patients in comparison to healthy individuals. Interestingly, we used a bioinformatic approach to show that the HLA-B*35:01 allele can bind a specific 9-mer peptide derived from the mutant JAK2 protein (NetMHCpan server) (Ivanova et al. 2020). Using other bioinformatic tools, we were also able to show that the same peptide was likely to be successfully processed and presented in the HLA class I antigen processing and presentation pathway. Therefore, we proposed that adaptive immune response through cognate T cells can edit early carcinogenesis in JAK2 V617F+ MPN stem cells (MPN-SCs) in the presence of alleles such as HLA-B*35:01, which are able to present JAK2 V617F-derived peptides efficiently (Fig. 2) (Ivanova et al. 2020). The major criticism of that hypothesis would be that the protection is not absolute and JAK2 V617F+ MPNs can still develop even in the presence of potentially protective HLA class I alleles. We therefore proposed that MPN-SCs can escape immunoediting through downregulation of important genes in the HLA class I antigen processing and presentation pathway (Fig. 2). We did not have our own expression data, so we sought to provide such evidence using publicly available datasets. We were able to show that bone marrow (BM)-derived CD34+ cells from PV and ET downregulated key components of antigen processing and presentation machinery, including HLA-A and HLA-B (Ivanova et al. 2020). The same phenomenon was not observed in PMF and PV peripheral blood (PB) CD34+ cells. We further analyzed the effect of various treatments on HLA gene expression. Notably, short-term ruxolitinib treatment JAK2 V617F+ CD34+ SET-2 cell line was associated with an upregulation of some HLA genes such as HLA-A, HLA-E, and HLA-F (Ivanova et al. 2020). On the other hand, SET-2 cells that persisted long-term ruxolitinib treatment showed a downregulation of HLA-A, HLA-B, HLA-C, HLA-E, and HLA-G. Analogous to ruxolitinib, we showed that short-term treatment of JAK2 V617F+ mouse cells with IFN-α led to upregulation of most genes of the MHC-I pathway (Ivanova et al. 2020). However, long-term treatment with IFN-α in vivo did not show changes in the expression of MHC class I pathway genes in mouse long-term HSCs (Ivanova et al. 2020).
Collectively, our own genetic data, bioinformatic predictions, and analysis of a number of publicly available gene expression datasets allowed us to propose a model of immunoediting in the early pathogenesis of JAK2 V617F+ MPNs (Fig. 2) (Ivanova et al. 2020).
Several personal milestones in the last 15 years helped me shape my understanding of scientific medicine. My participation in the Advanced Oncology Program at Ulm University was one of them and just confirmed the lesson I got earlier from my first scientific mentor Prof. Tasuku Honjo that in order to succeed in science, you “must work hard, be smart, and have a devoted mentor.” The difficulties in my career afterward made me realize that the program was an important preparation to face those challenges, as after all “the fact that you dare swim against the mainstream means that you have already become a very good swimmer.”
Ahn J-S, Kim H-J, Kim Y-K, Lee S-S, Jung S-H, Yang D-H et al (2016) DNMT3A R882 mutation with FLT3-ITD positivity is an extremely poor prognostic factor in patients with normal-karyotype acute myeloid leukemia after allogeneic hematopoietic cell transplantation. Biol Blood Marrow Transplant 22(1):61–70
Alkhazraji A, Elgamal M, Ang SH, Shivarov V (2019) All cancer hallmarks lead to diversity. Int J Clin Exp Med 12(1):132–157
Baxter EJ, Scott LM, Campbell PJ, East C, Fourouclas N, Swanton S et al (2005) Acquired mutation of the tyrosine kinase JAK2 in human myeloproliferative disorders. Lancet 365(9464):1054–1061. https://doi.org/10.1016/S0140-6736(05)71142-9. S0140-6736(05)71142-9 [pii]
Crick F (1970) Central dogma of molecular biology. Nature 227(5258):561–563. https://doi.org/10.1038/227561a0
Hanahan D, Weinberg RA (2011) Hallmarks of cancer: the next generation. Cell 144(5):646–674. https://doi.org/10.1016/j.cell.2011.02.013. S0092-8674(11)00127-9 [pii]
Ivanova M, Tsvetkova G, Lukanov T, Stoimenov A, Hadjiev E, Shivarov V (2020) Probable HLA-mediated immunoediting of JAK2 V617F-driven oncogenesis. Exp Hematol 92:75–88. https://doi.org/10.1016/j.exphem.2020.09.200. S0301-472X(20)30565-8 [pii]
Klampfl T, Gisslinger H, Harutyunyan AS, Nivarthi H, Rumi E, Milosevic JD et al (2013) Somatic mutations of calreticulin in myeloproliferative neoplasms. N Engl J Med 369(25):2379–2390. https://doi.org/10.1056/NEJMoa1311347
Kralovics R, Passamonti F, Buser AS, Teo SS, Tiedt R, Passweg JR et al (2005) A gain-of-function mutation of JAK2 in myeloproliferative disorders. N Engl J Med 352(17):1779–1790. https://doi.org/10.1056/NEJMoa051113. 352/17/1779 [pii]
Levine RL, Wadleigh M, Cools J, Ebert BL, Wernig G, Huntly BJ et al (2005) Activating mutation in the tyrosine kinase JAK2 in polycythemia vera, essential thrombocythemia, and myeloid metaplasia with myelofibrosis. Cancer Cell 7(4):387–397. https://doi.org/10.1016/j.ccr.2005.03.023. S1535-6108(05)00094-2 [pii]
Nangalia J, Massie CE, Baxter EJ, Nice FL, Gundem G, Wedge DC et al (2013) Somatic CALR mutations in myeloproliferative neoplasms with nonmutated JAK2. N Engl J Med 369(25):2391–2405. https://doi.org/10.1056/NEJMoa1312542
Shivarov V, Ivanova M (2018) Nodular lymphocyte predominant Hodgkin lymphoma in USA between 2000 and 2014: an updated analysis based on the SEER data. Br J Haematol 182(5):727–730. https://doi.org/10.1111/bjh.14861
Shivarov V, Ivanova M (2020) Clinical outcomes of composite and sequential B-cell lymphomas with features intermediate between DLBCL/PMBCL and classical Hodgkin lymphoma from the SEER database. Br J Haematol 190(3):464–466. https://doi.org/10.1111/bjh.16728
Shivarov V, Ivanova M, Hadjiev E, Naumova E (2011) Rapid quantification of JAK2 V617F allele burden using a bead-based liquid assay with locked nucleic acid-modified oligonucleotide probes. Leuk Lymphoma 52(10):2023–2026. https://doi.org/10.3109/10428194.2011.584995
Shivarov V, Gueorguieva R, Stoimenov A, Tiu R (2013) DNMT3A mutation is a poor prognosis biomarker in AML: results of a meta-analysis of 4500 AML patients. Leuk Res 37(11):1445–1450. https://doi.org/10.1016/j.leukres.2013.07.032. S0145-2126(13)00275-0 [pii]
Shivarov V, Ivanova M, Tiu RV (2014) Mutated calreticulin retains structurally disordered C terminus that cannot bind Ca(2+): some mechanistic and therapeutic implications. Blood Cancer J 4:e185. https://doi.org/10.1038/bcj.2014.7. bcj20147 [pii]
Shivarov V, Gueorguieva R, Ivanova M, Tiu RV (2015) ASXL1 mutations define a subgroup of patients with acute myeloid leukemia with distinct gene expression profile and poor prognosis: a meta-analysis of 3311 adult patients with acute myeloid leukemia. Leuk Lymphoma 56(6):1881–1883. https://doi.org/10.3109/10428194.2014.974596
Shivarov V, Gueorguieva R, Ivanova M, Stoimenov A (2018) Incidence of second solid cancers in mastocytosis patients: a SEER database analysis. Leuk Lymphoma 59(6):1474–1477. https://doi.org/10.1080/10428194.2017.1382694
Tie R, Zhang T, Fu H, Wang L, Wang Y, He Y et al (2014) Association between DNMT3A mutations and prognosis of adults with de novo acute myeloid leukemia: a systematic review and meta-analysis. PLoS One 9(6):e93353
Vainchenker W, Constantinescu SN (2005) A unique activating mutation in JAK2 (V617F) is at the origin of polycythemia vera and allows a new classification of myeloproliferative diseases. Hematology Am Soc Hematol Educ Program 195-200. https://doi.org/10.1182/asheducation-2005.1.195. 2005/1/195 [pii]
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
© 2022 The Author(s)
About this chapter
Cite this chapter
Shivarov, V. (2022). Asking Existing Data the Right Questions: Data Mining as a Research Option in Low- and Middle-Income Countries. In: Schmidt-Straßburger, U. (eds) Improving Oncology Worldwide. Sustainable Development Goals Series. Springer, Cham. https://doi.org/10.1007/978-3-030-96053-7_9
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-96052-0
Online ISBN: 978-3-030-96053-7
eBook Packages: MedicineMedicine (R0)