What Is Research?

Research is a process of questioning paradigms, setting new hypotheses, and challenging the validity of such hypotheses through the proposition of theories that may or may not stand the rigor of scientific testing. This process should result in the development of new knowledge that would be used for the best for humanity. With such new knowledge, novel sets of questions should emerge, leading to further research. In general, the research aims to open new frontiers for the better good of humanity, allowing humankind to expand the boundaries of what is possible and attainable.

From what has been described above, it should become evident that research is a cyclical process. Indeed, research invariably starts with one core idea or set of ideas. Such goals need then to be divided into smaller “achievable” ones. This is followed by a process of prioritizing these ideas and putting them in a logical order, thus creating a research plan. With this plan, a researcher can start working on collecting the data and conducting experiments. Research does not come without its—often considerable—costs. Costs of research stem from the complexity of some research ideas; the need for materials and other experiment-related prerequisites, including human resources, chemicals, travel costs, tools, and space allocation; and in addition the possibility of payments made to patients involved in clinical trials. Funding can be applied for and attained through the potential beneficiaries of research outcomes. Those include private companies that might market and sell a product of research, universities that can benefit from licensing resultant intellectual property, or societies that try to understand the priorities of the research in the community. Applications for funds are often conducted within the context of a call for a grant application or, less frequently, are unsolicited. There are other mechanisms for funding research that can be more complicated, involve multistage revisions, and involve multiple partners/consortia. In these cases, the money granting body usually undergoes ever more complex procedures the more the financial coverage requested. The granting body usually challenges grant applications with multiple tests to affirm the applicability of the research project and ensure that the applicants can accomplish what they claim to be able to do.

Most evaluating committees look at the research idea, its validity, and its logical order in relation to the previously proven facts. This stands at the core of the process of grant application reviews. When evaluating committees affirm the logic of the research idea, they follow this by evaluating the methods used in conducting the proposed research project. They usually investigate the suitability of the techniques for the intended experimentations and the potential validity of data collected using such techniques. Besides, statistical methods proposed for data analysis are reviewed. Importantly, reviewers also examine how researchers had calculated the required sample size of the patients or participants. Erroneous calculation of sample size has important ethical repercussions, in addition to its effect on the validity of research results. Underestimating the required number of participants would expose humans or animals to potential harm without any statistical value. Likewise, an overestimation of the number of required participants would expose a needless number of patients or lives to potential harm when the hypothesis could have been proven or negated with a lesser number of patients.

Evaluation of the research plan could also involve team structure and their qualifications in relation to the proposed research tasks, as well as the ability of each research sub-team to achieve work packages (a term used for parts of a plan). Individual team members are evaluated separately in terms of their previous achievements. Next, the entire team’s work history is examined. A multidisciplinary team—or consortium—with a track record of previous successes would thus more likely win funding for their research proposal. In particular, any successful previous work that is relevant to the proposed research project would play an essential role in a team’s successful application.

The evaluation process typically also involves the assessment of the proposed marketing, dissemination, and application of research outcomes. Presentations by researchers should thus include how they had previously conducted marketing and dissemination of their research outcomes as well as their methods of evaluating such processes.

At the start of their careers, most new researchers initially take part in well-funded research projects at well-established research facilities. As they contemplate potentially moving to independent research, they question how these well-funded projects had received funding. While not easy to summarize, we hope that this chapter would shed light on how a new researcher, especially in the epidemiology field, could prepare for the moment when they decide to move to independent research.

What Is a Cancer Registry?

As the name suggests, a cancer registry is a record-keeping method of data pertaining to patients with different cancers. The scope of such a registry could involve data collected at a hospital or a clinic or a group of them. Registries may also collect data pertaining to a small population within a county or a state or bigger populations, that of an entire country or group of countries. Moreover, efforts are conducted by a registry in order to collect data pertaining to the global burden of various cancers. Often, the smaller population cancer registries report to the wider scope ones (Menck and Smart 1994; Jensen et al. 1991).

Typically, efforts to collect information about cancer are conducted for purposes of estimating the burden of disease and making decisions with regard to disease management and resource allocation; countries look at collecting a basic set of information about cancers. Proper data collection requires follow-up and quality assurance, and each collected item necessitates multiple checks and verifications that may compel multiple sources of data. Therefore, most cancer registries prefer to start with a small set of variables and expand with time and advancement of their collection methodologies (Parkin 2006).

There is no single agreed list of data items that should be collected in a cancer registry. However, online resources can help to initiate and decide the scope of a new cancer registry (The International Agency for Research on Cancer (IARC) n.d.). Most cancer registries collect the patients’ identifiers to follow up with such patients in the future (if follow-up is included) and to avoid duplication of entries. Other items typically collected include tumor morphology and topography, age at diagnosis, and survival data. The scope of data collected may often be expanded if agreed to between the data collection counterparts. Those counterparts may involve clinics/practices, laboratories, and hospitals. Data collection processes are sometimes enforced by law in order to guarantee a national level of data collection; however, such enforcement does not guarantee the quality of the process and, hence, the resulting outcomes.

For purposes of knowing the burden and trials to disease surveillance, management, and resource allocation, the countries look at collecting a basic set of information about cancers. Due to the required endeavors for data collection, follow-up, and quality assurance, each collected item requires multiple checks and verifications that may necessitate multiple data sources. Historically, most cancer registries preferred to start with a small set of variables and expand with time and advancement of the collection methodologies. Expansion requires additional training of those involved in data collection, especially with large cancer registries where multiple data collection agencies or registrars have to agree on the definitions of the items collected and the milestones of collection and reporting.

After data collection and validation, registries typically report such data both internally and externally. The process of handling data has to be conducted in accordance with internationally agreed ethical norms. The identity of patients involved should always be protected when data is reported.

Security and Privacy

The scope of data collected typically evolves as the aims of a cancer registry change or expand. Commonly, data collected include demographic data, details pertaining to the tumor(s), and data pertaining to survival. Less commonly, data collected may include details of treatment. After some time, a cancer registry may decide to help determine the environmental drivers or psychosocial confounders involved in the burden of a malignancy. Other data collected may include biomolecular or genetic data about the malignancies. Cancer registries may facilitate the integration with other population databases, including health insurance claims and socioeconomic details through national identifiers.

Although this data may represent a wealth of information, it also represents a responsibility to protect the identity of patients/participants. Many studies have shown that such information, which may be minimal in some cases, could still be used to be tracked down to patients, especially those with rare diseases and with the introduction of genomic medicine. Therefore, a debate has arisen regarding whether there should be any widening of the scopes of such registries.

For the aim of fostering scientific ingenuity and research, cancer registries should be able to provide limited access to various data, with levels of clearances/guarantees of anonymity of patients for different “levels” of data provided without hindering scientific progress and innovation. Setting the rules for the data access levels and the minimum reportable groups of patients should be based on further modeling studies that are conducted to examine the datasets (Terry 2012; Hofferkamp 2008). Recent trials are being initiated to use technologies like blockchain to enhance security while connecting multiple data sources (Glicksberg et al. 2020).

Using Registry Data for Research

Although the long introduction I gave may look irrelevant, cancer registries support each of the mentioned aspects. Registry data can provide a base for hypothesis generation and testing for epidemiologic studies. Studies conducted may involve describing specific groups or investigating hypotheses against data collected from a particular population or setting. Through its time and population coverage, it may allow finding incidence and survival differences between groups in addition to temporal and spatial variations (Armstrong 1992; Kumar et al. 2020). Such findings can be correlated with other data from other sources, allowing researchers to detect correlations between cancer incidences and environmental as well as occupational factors. Moreover, such data allow for the comparative assessment of the quality of healthcare services and treatment outcomes between regions and time periods. Furthermore, it can help develop models for the prediction of the impact of diseases, probably on the different socioeconomic slices in the population. Limitless projects can be achieved under the abovementioned paradigms.

For a researcher, it enriches and strengthens their profile in the studied subjects. For grant applicants, working on relevant cancer registry projects together with their future collaborators can provide proof that the proposers can create a functioning team. Besides, working with real-life data adds new skills and statistical challenges to the team and helps for sample size calculations for future studies.

It could allow the team to validate the concepts that are found computationally, for example, if a molecular finding is recorded in a cancer registry, and then the predictive and prognostic significance can be assessed between groups.

The Limitations of Cancer Registries

Cancer registries provide a snapshot or snapshots of patient status at a certain point or multiple points in time. Records do not typically provide a view of patient status or beyond the end of treatment. Multiple treatment milestones could be missed. Therefore, data collected from cancer registries should be analyzed with consideration of the aims and methods used in that registry (Izquierdo and Schoenbach 2000).

Cancer registries vary in their methods of data collection and confirmation, in part due to the nature of their active counterparts. Confirmation of disease in some registries can be based solely upon clinical findings without histopathological confirmation. In other situations, confirmation of data may be based on death certificates only (DCO) which do not typically provide sufficient information about dates of diagnoses or the sites or morphologies of tumors. Such DCO data should be analyzed with caution. The feeding of cancer registries from multiple sources, including hospitals, clinics, pathologists, laboratories, insurance companies, and death registries, results in a verifiable multidimensional and time-oriented picture of the patients’ experience with the disease. Differences in quality/amount of data collected by cancer registries could also be caused by differences in lengths of time chosen for data collection as well as the version of staging or classification system chosen.

Cancer registries that cover geographically dispersed regions face possible discrepancies in training or resources and hence in the quality of the data collection. Cancer registries try to eliminate such discrepancies through training, reporting, policies, and infrastructure standardization.

As I have mentioned, the aims of cancer registries could be different from one to another. As also mentioned, collection milestones could be different, too. Therefore, it is sometimes challenging to aggregate data from different registries, especially when specific questions are being addressed.

The Use of Data from a Cancer Registry Data: One Researcher’s Story

Between 2010 and 2014, I worked on initializing a clinical research training program for undergraduates and early graduates at the Children’s Cancer Hospital—Egypt 57,357 (Amgad and AlFaar 2014). In that program, I included methods of designing case report forms, data collection, analysis, and reporting. Due to the amount of time it would take for typical data collection and “cleaning,” I decided to teach some trainees how to analyze cancer registry data using the Surveillance, Epidemiology, and End Results Program of the US National Cancer Institute (SEER) program instead. The SEER program had been mentioned a few months earlier in the master’s program of the Advanced Oncology at Ulm, Germany. Its importance had also been emphasized by our colleague, Mohamed Sabry, the head of the research informatics unit. We, therefore, thought it would be a good candidate to start with SEER as the first topic of the training program. Our student and later colleague, Waleed Magdy, started investigating the structure of a particular set of data and conducted an analysis of this data (which, at that time, was downloadable data). He returned with his first report 2 weeks later. The question I had chosen for him to study was a familiar explorative question from the field of ophthalmic oncology regarding the relative incidence and temporal and spatial patterns of orbital malignancies in the USA. At that time, we had downloaded large data files that we later had to go through and select relevant information. Now SEER data can be accessed through the SEER*Stat Program, which is convenient for those with stable Internet connections, especially when it comes to big queries. In the resultant publication following Sabry’s work, we were able to report the incidence of orbital and adnexal tumors and time trends over the years. We reported finding a steady increase in the incidence of lymphomas till the early 2000s (Hassan et al. 2016). We then decided to compare survival data in order to find any potential differences in the survival between orbital tumors in general and orbital lymphomas in particular, as well as potential differences influenced by age (Hassan et al. 2014). Taking this question to the extreme side, Dr. Ibrahim Qaddoumi from St. Jude’s Hospital (Memphis, Tennesse, USA) suggested that we focus on the patients that had been afflicted with malignancies in the neonatal period (Alfaar et al. 2017a). Such tumors most probably develop during the intrauterine period. In addition to differences in the distribution of tumors attributed to genetic causes, we found that patients with neonatal tumors had worse survival rates than older ones. This inferior survival had not improved since 1973, despite developments in diagnostic and therapeutic methods. It is noteworthy to say that this detailed collecting of age-specific data required a special “agreement” before using the SEER program. Moreover, we highlighted the major non-oncological causes of death in those patients.

After a discussion with my colleague, Anas Saad, he decided to focus on the topic of non-cancer causes of death in cancer patients. He started investigating possible causes of death in different malignancies. His results shed light on aspects of cancer patient mortality that were previously unknown. We worked together to study the psychological effects of a tumor diagnosis on patients as well as the effects of tumor type and the timing after the diagnosis (Saad et al. 2018). In this study, we have found that in the early diagnosis time and without waiting for any treatment results, there was a significant increase in the suicide rate. We recommended that patients with cancers with known unfavorable progress (such as pancreatic and lung cancers) be given special psychosocial attention and support in the first 3 months after diagnosis. Our investigation of events after cancer diagnosis led us to study the association of cancer diagnoses when multiple malignancies were diagnosed at temporally distant points, as highlighted in previous studies (Harbour et al. 2010). This resulted in finding that pathways involved in the development of uveal melanoma may be shared with other malignancies that were not thought to involve such a pathway before. In the same study, we found an increased incidence of primary systemic tumors following an earlier diagnosis of uveal melanoma (but not the reverse). We attributed this to efforts by oncologists to search for possible metastases following a diagnosis of uveal melanoma. Such efforts would not be conducted to search the uvea for metastases after a diagnosis of malignancies elsewhere in the body (Alfaar et al. 2020).

Another innovation that is soon to come to widespread use is the recruitment of artificial intelligence/machine learning methods to aid in the classification of registry data, especially data that are hard to analyze with conventional techniques.

During that time, we had been developing our hospital-based cancer registry. Our aim was to automate procedures and create routines that update records to overcome the delays imposed by the manual data collection, cleaning, and verification. We were also able to develop routines (or automated verification methods) to check if data had been validated by a set of preset rules. These rules were mainly meant to help clinical practice and to avoid protocol violations. However, we discovered that such routines had also helped the research process by finding anomalies that could not have been discovered by practitioners, including a list of rare presentations or combinations of presentations that had rarely previously been discussed in the literature (Zekri et al. 2015). Having our own database allowed us to learn how to judge other databases. Moreover, it allowed us to link our database with other databases, aggregating data and allowing for better decision-making with regard to local and countrywide issues. We have also tried to overcome the drawbacks of other databases. Furthermore, our registry provided a core of data upon which we started to advance our biorepository and epidemiologic studies (Labib et al. 2016). Based on these achievements, we were able to apply for further research grants and joint collaborations. This proves that establishing a cancer registry is not an end of a story; it draws a path for further research into bettering patient outcomes.

Starting Your Own Registry in a Nutshell

Fourteen years ago, there were only patients, limited resources, a dream of a children’s cancer hospital, and determined efforts to fund such a hospital through charity. The founders recruited engineers to build and run the hospital infrastructure and physicians to design, provide, and follow up on treatment plans. However, it was clear that the system required a parallel team that would dig deep in data to provide the evidence for better treatment plans, support the design of measurable protocols, and help follow up the patients closely in conjunction with physicians and pharmacists. The founders decided to establish a research department with the core idea of providing measurement tools that would help improve clinical care. It was destined to be the classic research department that one would find in other countries (Alfaar et al. 2017b). The key to the initiation of this department was a saying by Peter Drucker, who stated that “You cannot manage what you cannot measure.” The initial team consisted of eight pharmacists and physicians who were given the title of clinical research associates (CRAs) and clinical research specialists (CRSs). Each of them was assigned with a group of diseases. They started collecting resources and studying required data and treatment protocols and related studies for the disease they were assigned. Physicians used to write their findings on papers that were gathered in sectioned folders. Folders were kept in central drawers and called once the patient was present in the clinic and archived if the patient died or was lost to follow-up. During study team meetings for each disease, the team discussed different clinical and research topics. The CRAs and CRSs translated these topics into fields in the case report forms. First, the case report forms were disseminated as paper forms and data tabulated in Microsoft Excel sheets. Physicians found it difficult to complete these forms, and data were liable to be lost. Excel sheets were found to be liable to inconsistencies and required extensive cleaning. Later, repeated orientations were conducted for physicians and nurses to stress on the importance of accurate filling of the forms. Excel sheets were converted into MS Access databases which were easier to clean and allowed for some networking. However, with the recruitment of an increasing number of CRAs/CRSs, it became clear that a more network-friendly solution was needed. Therefore, a solution based on MySQL database and PHP programming language provided more networking functions, faster data entry, and cleaner exports. Later, further challenges faced the research teams, including patients developing multiple cancers and study teams requiring multiple changes or modifications of the case report forms. The epidemiology team concomitantly developed multiple survey-based studies and started designing the biobank that required integration with the central patients’ records. Therefore, we started looking for a modular and flexible solution to facilitate the integration between projects and keep running while modifications were being made or data are being exported. At that point, hospital management decided to acquire an electronic medical record system. The options for the solution were filtered down to CaBIG and REDCap. The cancer Biomedical Informatics Grid (CaBIG), which NCI provided, was a good candidate because we decided to use the CaTissue solution for managing biobanking which is made by NCI to be easily integrated with CaBIG (Whippen et al. 2007). However, The REDCap solution from the University of Vanderbilt was chosen to manage the research data space because of its more straightforward setup and maintenance (Harris et al. 2009). In parallel, we developed a solution for importing data from EMR to REDCap and another one based on R, R-Shiny, and PHP that would connect with REDCap and develop data aggregates, statistics, and graphs in real time. This required developing and refining previously specified data management and analysis plans and defining data release and transfer policies. The need for these changes became pressing in order to integrate with the then-emerging national cancer registry program.

The final program resulted in a path where the initial patients’ data are typed into the electronic medical records as the patient registers in the hospital. These data are transferred automatically overnight to the REDCap’s common pool of patients’ database. Further rules were developed to sort the patients automatically based on laboratory, pathological, surgical techniques, or possible diagnoses. This allowed access to patient’s data on the research nurses’ interface. These nurses now open medical records and verify the diagnoses. Once they choose that the diagnosis had been verified, the particular patient’s data appear on the designated CRAs/CRSs view. The CRAs/CRSs start preparing clinical trial consents, complementing the case report forms, and preparing patient’s educational materials. Patients who require special attention or a deviation from protocols are highlighted due to other applied rules. Both automatic and manual data collection and validation cycles continue by the end of treatment and over follow-up milestones. Upon data analysis time point, the data gathering is conducted based on designated filtering rules and exported in the forms of interpretable data formats by data analysis software that helps develop more customizable reports and publishing-ready graphs. Data are concomitantly shared with the national cancer registries. Daily exports of patient datasets and their confirmed diagnoses are shared with the hospital biobank and epidemiology team to start preparing consents, filling custom forms, and collecting samples. The workflow of the publishing process has been accelerated dramatically after the mentioned implementations.

All in all, the founding of the cancer registry has provided a “safe haven” of complete and reliable patient’s data for scientists and researchers. Elsewhere, the lack of solid patient’s data continues to hinder quality medical research. The establishment of this registry has thus laid the groundwork for more complex and more ambitious studies to be carried out in the near future.