Ethical Data Collection for Medical Image Analysis: a Structured Approach

Due to advancements in technology such as data science and artificial intelligence, healthcare research has gained momentum and is generating new findings and predictions on abnormalities leading to the diagnosis of diseases or disorders in human beings. On one hand, the extensive application of data science to healthcare research is progressing faster, while on the other hand, the ethical concerns and adjoining risks and legal hurdles those data scientists may face in the future slow down the progression of healthcare research. Simply put, the application of data science to ethically guided healthcare research appears to be a dream come true. Hence, in this paper, we discuss the current practices, challenges, and limitations of the data collection process during medical image analysis (MIA) conducted as part of healthcare research and propose an ethical data collection framework to guide data scientists to address the possible ethical concerns before commencing data analytics over a medical dataset.


Introduction
Medical image analysis (MIA) has gained significant importance in the recent years due to the advancement in technology specifically in disciplines such as artificial intelligence, machine learning, and data science.Data analytics for MIA has become inevitable for diagnosing diseases or disorders in human beings at the initial stage.
MIA is the process of using computer algorithms to examine the medical images that are available, such as X-rays, CT scans, and MRI scans (Chen et al. 2022).MIA focuses on identifying specific structures in the images, measuring sizes or positions of structures, and tracing abnormal or affected segments.This is vital for diagnosing diseases or disorders and providing suitable medical solutions to get the patient into a stable state as early as possible (Chakraborthy and Mali 2023).The underlying principle of medical image analysis is to facilitate the researchers and the practitioners to diagnose the abnormal state of diseases or disorders by using extensive data analytics based on the dataset, i.e., medical images maintained in the medical or healthcare centers and medical research laboratories.The most widely used techniques in medical image analysis include image segmentation, registration, and visualization ( Van et al. 2022).
Data ethics and medical image analysis have become inseparable entities as MIA is grounded in the dataset provided by the healthcare centers or research laboratories.The process they have adopted to collect, store, and share data for MIA should necessarily address the data privacy of the patients.To date, there is no structured procedure or universally acceptable approach for carrying out the data collection process for MIA ethically (Martin et al. 2022).One of the key concerns in medical image analysis is protecting patient data, which may consist of sensitive information such as their medical history and personal identification.Ensuring that the patient data is secured and protected from unauthorized access is required to maintain the patient's privacy and maintain trust in the healthcare system.This includes avoiding any bias in the selection of training data, as well as ensuring that the analysis is not discriminatory against certain groups or individuals.Thus, data ethics for medical image analysis should receive attention and need to be studied so as to guide the researchers and practitioners on adopting a structured approach for ethical data collection and analytics (Carter et al. 2015).
Gathering research data for medical image analysis is a herculean task for data scientists.To ensure data privacy, medical ethics commissions from participating institutions must approve all data collection.The use, distribution, and reuse of datasets is a distinguishing aspect of today's medical research field.Several technological advancements, such as artificial intelligence and data science, can only be achieved through the use of medical datasets.While much has been written about the ethics of medical data collection in a number of contexts, there is little guidance on which values are at risk and how we should make decisions in an incredibly challenging health and research environment.Ethical data collection is vital for medical image analysis because it ensures that the data used to train and test models are obtained in a way that respects patient's privacy and anonymity (Beauchamp and Childress 1994).This includes obtaining informed consent from the patients before collecting their medical images and ensuring that their personal information is protected.It is thus observed that the ethical data collection process for medical image analysis must be accurate, reliable, and generalizable to the population they are intended to serve (Beauchamp 2003).
If the data used to train models is not representative of the population or biased, the data models developed from those datasets may not contribute to accurate predictions thereby misguiding the patient's treatment.While the 1 3 Asian Bioethics Review (2024) 16:95-108 potential of data analytics in many disciplines, including health and research, is widely acknowledged, there are certain inherent issues with the ethics of the data-gathering process for MIA.These difficulties have been identified as being related to the peculiarities of the medical dataset, as well as the difficulty in acquiring, integrating, processing, evaluating, and interpreting data for medical image analysis.In addition, it involves addressing privacy concerns, data security, governance, data sharing, and data compliance as per stipulated laws and regulations (Altman 1980;Krutzinna and Floridi 2019), as well as operational and ownership issues (Sivarajah et al. 2017).We would like to place on record that all these challenges and issues inevitably influence each other.
In recent years, the extensive use of data science has had uncomplimentary effects, such as an increase in privacy invasion, data-driven prejudice, and datadriven decision-making without justification (Martens 2022).Data ethics is primarily concerned with right and wrong.When data science is applied to medical images, addressing ethical concerns may result in better datasets and data models, possibly with more accurate predictions or more user acceptance of the data models.The General Data Protection Regulation (GDPR) in Europe addresses numerous data science elements linked to data privacy, including explainability (Martens 2022).The theme "data-related problems" focuses on the most important ethical dilemmas that might occur in connection with the gathering and use of data.Data scientists can forecast future events based on historical trends thanks to the large volume of data that is collected, stored, and made available to them (Saltz and Dewar 2019).It has been repeatedly observed that data ethics are necessary when analyzing medical images.
Organizations that use data science should also provide ethical training and interactive ethical evaluations to assist employees in resolving ethical problems (Leonelli 2016).However, it is uncertain whether these organizations have the breadth and depth of expertise required to provide this training effectively.Previous research (Saltz and Dewar 2019) identified three main data-related challenges: data accuracy and validity, data misuse, and privacy and anonymity.The ethical use of data is not necessarily implied by the ability to access or gather it.The ethical issues that can come up when creating and applying analytical models are the main emphasis of the data model.An analytical model is a computational tool for analyzing, interpreting, and forecasting future events based on historical data.An analytical model, in other words, is a series of mathematical operations that provide a forecast of a certain state based on prior knowledge.However, applying an algorithm could bring about or exacerbate a variety of unethical circumstances (Saltz and Dewar 2019).
The paper is organized as follows."Current Practices, Challenges, and Limitations" describes the current practices, challenges, and limitations in medical image analysis."Ethical Data Collection for MIA-a Structured Approach" describes our proposed structured approach for ethical data collection for conducting medical image analysis."Data Science Ethics Equilibrium" discusses the application of the proposed approach followed by the conclusion.

Related Works
MIA refers to the process of using computational methods to extract meaningful information from the medical images obtained in the form of X-ray, CT, or MRI scans.Performing medical image analysis includes computer-aided diagnosis (CAD), image segmentation, image registration, and image-guided therapy.Recently, all such analyses are supported by deep learning techniques and explainable artificial intelligence (AI) to improve the accuracy of predictions leading to earlier diagnosis of diseases or disorders (Shen et al. 2017).The computer-aided diagnosis uses machine learning algorithms to analyze images and identify potential abnormalities.
The image segmentation involves separating an image into different regions or structures, such as tumors or organs.The process of image registration is carried out by aligning or registering multiple images of the same patient taken at different times or using different modalities.The image-guided therapy involves using image analysis techniques to guide and monitor surgical procedures or other forms of therapy (Altaf et al. 2019).The deep learning techniques play a significant role in medical image analysis specifically for tasks such as image segmentation and diagnosis.Explainable AI is gaining momentum in developing methods that can provide insights into the decision-making process of a model to improve the precision and interpretability (Weese and Lorenze 2016).
Armed with our understanding of medical image analysis (MIA) based on the related previous research works (Kalaiselvi et al. 2020a, b, 2021, 2022a, b, Kalaiselvi and Padmapriya 2022, Padmapriya et al. 2021, 2022) on MIA by the first author, we now present the current practices, challenges, and limitations in the data collection process for performing medical image analysis.
In medical image analysis, the tasks such as brain tumor detection, classification (Kalaiselvi et al. 2020a(Kalaiselvi et al. , b, 2021(Kalaiselvi et al. , 2022a)), and segmentation (Kalaiselvi et al. 2022b, Kalaiselvi andPadmapriya 2022) from magnetic resonance imaging (MRI), and COVID-19 detection from X-ray and computed tomography (CT) (Padmapriya et al. 2021(Padmapriya et al. , 2022) ) were carried out using deep learning techniques.Brain tumor classification and segmentation were carried out using the brain tumor segmentation (BraTS) dataset (Menze et al. 2014).It is a publicly available dataset of magnetic resonance imaging (MRI) scans of brain tumors.It is used for evaluating the performance of algorithms for brain tumor segmentation, which is the process of identifying and separating the tumor from the surrounding healthy tissue in an MRI scan.
The BraTS dataset consists of multi-modal MRI scans from more than 200 patients with gliomas, which are the most common type of brain tumor.The scans include T1, T1-weighted with contrast, T2, and fluid-attenuated inversion recovery (FLAIR) modalities.The dataset also includes manual segmentations of the tumors, which are used as the ground truth for evaluating algorithm performance.The challenges in collecting and articulating these datasets are data variability 1 3 Asian Bioethics Review (2024) 16:95-108 and inconsistency, annotation and labeling, limited availability of data, data storage, and management.

Challenges and Limitations
Medical image analysis is a rapidly evolving field, and there are several challenges that researchers and practitioners currently face when collecting the correct dataset (Altaf et al. 2019).Some of these challenges are erroneous data in the dataset, noisy data, incomplete dataset, limited domain adaptation, complexity of images, lack of interpretability, data annotation, integration with clinical workflows, and explainable AI (Weese and Lorenze 2016).A successful and effective medical image analysis demands a large volume of high-quality data without compromising on privacy and ethical concerns.
Medical images can vary significantly depending on the imaging modality, patient population, and clinical context.This can make it difficult to develop models that can be generalized across different domains.It can be very complex and difficult to interpret, particularly in the case of 3D images or images that contain multiple structures or organs.Most of the existing image analysis techniques are based on deep learning techniques, which require domain knowledge for the practitioners to correctly interpret and understand the results.
The annotation process for medical images is time-consuming and expensive and requires expert knowledge.The integration of computer-aided diagnostic systems in clinical workflows is still a challenge and requires the development of efficient and user-friendly systems.Developing models that can provide insights into their decision-making process is important to improve trust and interpretability, but it can be a challenging task.Medical image analysis is a complex field, and there are several limitations regarding dataset collection and its usage.Some of these limitations include limited accuracy, lack of robustness, limited interpretability, lack of generalizability, limited scalability, privacy and ethical concerns, and limited clinical adoption (Habuza et al. 2021).Collecting the right dataset while retaining privacy and handling ethical concerns is a difficult task for MIA.
The existing methods used in MIA are not robust enough to respond to the variations in imaging modalities, acquisition protocols, and patient populations.The models that were trained on a specific dataset may not be sufficient to reach generalized results or observations.This could later become a limitation in terms of clinical adoption as well (Yaffe 2019).Medical image analysis may also demand the use of the same dataset for multiple iterations or repurposing.This is impossible if researchers or practitioners are not provided with the required rights for such usage.Failing to address such issues may also lead to constrained scalability in the future during MIA.Medical image analysis often requires access to sensitive personal information which may raise a red flag due to privacy and ethical concerns.
Based on our above observations on the related previous research on the data collection process part of medical image analysis, we now state the most significant challenges and limitations of the current data collection process during MIA as follows: 1. Data privacy for the medical images, as they hold sensitive information about the patients.2. There is no standard procedure or regulations in practice to carry out an ethically guided data collection process.3.An extension activity to collecting a dataset of medical images is properly annotating and labeling them.This is mostly left out in practice as it is a time-consuming and expensive process.This may cause ethical issues in the future.4. Datasets of medical images can be either from a data repository (stored images) or sometimes collected dynamically in real-time from patients.The ethical procedures would certainly vary depending on the method adopted for data collection and storage.
We deem that once these are addressed with a structured approach for managing the data collection process ethically during MIA, the researchers and the practitioners will be able to perform extensive analysis using cutting-edge techniques involving artificial intelligence, data science, and machine learning.

Ethical Data Collection for Medical Image Analysis-a Structured Approach
While data science can help in so many ways, it can cause harm too.By developing a shared sense of ethical values, we can reap benefits while minimizing harms.Based on the previous research (Kalaiselvi et al. 2020a, b, 2021, 2022a, b, Kalaiselvi and Padmapriya 2022, Padmapriya et al. 2021, 2022) on medical image analysis by the first author and related research works involving data science during MIA as discussed in the previous section, we draw the elements for developing a structured approach to facilitating an ethical data collection process during MIA.
The structured approach is grounded specifically in the previous research works on applications of data science in healthcare, namely big data repositories (Xafis and Labude 2019), big data and medicines (Schaefer et al. 2019), real-time healthcare dataset (Lipworth 2019), artificial intelligence, and machine learning on a big dataset from healthcare systems (Lysaght et al. 2019;Ballantyne and Stewart 2019;Laurie 2019).This structured approach is presented below (Fig. 1) in the form of a framework.Even though it can be connected to theory with regard to data science research literature and descriptions, the goal of the framework for ethical data collection for carrying out MIA is to have a tool rather than a theory that helps justify actions or a framework that paves the way for better and deeper understanding of complicated issues (Dawson 2010).
In a complex world, with many actors, it is often difficult to see where the problem lies and how to define limits on the use of data science.In these circumstances, using an ethical data collection framework will help the data scientists think about the ethical questions to consider as we define these limits.From Fig. 1, we observe that there are two possible sources of data-a primary dataset and a secondary dataset for conducting the medical image analysis.The medical data such as X-rays, CT 1 3 Asian Bioethics Review (2024) 16:95-108 scans, and MRI scans that are collected directly from the human beings are considered a primary dataset.The metadata that arises from the original dataset is viewed as a secondary dataset.It is generally considered inappropriate to use this dataset (primary or secondary) without permission.This includes recording metadata during the data collection process, such as who collected the data, what time it took place, and how long it lasted.
Concerns about data validity are not new.Anyone with basic statistics training should be able to think through the concerns we discuss.Many data scientists seem to forget these simple fundamentals.It should be noted that untimely use of data analytics may lead to erroneous inferences and findings.It is trivial that one cannot expect a fair execution of an algorithm during MIA if this algorithm uses an erroneous or invalid dataset as its input.There is no way a computer could have prejudice or stereotypes.The assumptions, model, training data, and other boundary conditions were all specified by human beings, who may reflect their biases in the data analytics outcome, possibly without even realizing it.Despite the fact that the analysis technique itself may be perfectly neutral, stakeholders have just recently started to consider how an algorithm's dataset can lead to unfair outcomes.
Assume that you have a suggestion to enhance the process for entering patient data into electronic medical records that would reduce errors and enhance the integration of data entry with the workflow for patient care.In this scenario, when you do an experiment to test a hypothesis, the type of data you need is unquestionably prospective data, not retrospective data (i.e., pre-existing information).The prospective data is inevitable for MIA as new patients are cared for and their data recorded.Data scientists should remember that the ethical concerns for the prospective dataset and retrospective dataset would vary and permission or rights to capture and utilize those datasets for the MIA must be obtained prior to building a data model and performing analysis.
It should be noted that the prospective data collection process demands a review of dataset by the Institutional Review Board (IRB).It is also applicable to human subject-based social science research.In the current era, the use of the prospective dataset has become inevitable by data scientists as it contributes to medical research significantly.The past population is not the same as the future population.More Fig. 1 Ethical data collection framework so, MIA involving the dataset based on the past will work in the future only to the extent the future resembles the past.In these circumstances, the data scientists may have to watch out for singularities, but should also worry about gradual drift.
When discussing datasets for medical image analysis, data privacy is undoubtedly another pressing issue that comes to mind for data scientists.The roadmap to make use of it through its collection, connection, and analysis while also minimizing the negative effects that can result from the dissemination of data about human beings is a challenging task.The current need is to define rules and regulations that regulatory bodies would agree to.It should also be noted that maintaining anonymity forever is very difficult in practice.
The process of deleting identifying information so that the remaining information can never be utilized to extract the identity of any specific person is known as "anonymization" (also known by the term "anonymity") (Xafis and Labude 2019).Considering both the data by itself and the data when combined with other information to which the organization has or is likely to have access to, as well as the measures and safeguards implemented by the organization to mitigate the risk of identification, data would not be considered anonymous if there is a space to re-identify an individual.
With enough other data, anonymity is practically impossible in practice.The diversity of datasets can be eliminated by combining them with external data.Aggregation only works when the entities being aggregated have unknown structures.In a facial recognition dataset, for instance, faces can be gradually recognized even in difficult situations like partial occlusion.The simplest way to prevent data misuse if anonymity cannot be achieved or is difficult to implement is to avoid publishing the dataset in public forums and to ensure that the data scientists are well informed about the level of consent (informed or voluntary) given by the data owners for its use during MIA.
To protect data privacy during MIA, the validity period for the usage of the collected data should be well defined, either for the purpose of research or to get new insights into the diseases or disorders.Healthcare centers or medical research laboratories legitimately collect data as part of performing MIA.However, it will make an effort to not use the data in egregiously inappropriate ways in order to maintain the consumers' trust.But after the tests, this dataset becomes an asset that is likely sold to a different party with malicious intentions.Thus, to ensure data privacy, either the collected data must be destroyed, not sold, after a stipulated time period, or the consent for its usage should be explicitly mentioned in the agreement approved by the data owner during the data collection process itself.Modern data science systems must deal with data privacy by design.As there are too many players in the race to exploit the dataset for personal or business gains, it should be noted that the data sharing is contractual and not based on trust.
For utilizing a sensitive dataset extracted from medical records, such as deidentified data used for MIA, data scientists must first go through a straightforward licensing process.This could be carried out by means of signing a contract or agreement with the data owners or other trusted parties who hold the data repository.When it comes to utilization of the collected dataset, before commencing the medical image analysis, the data scientists should verify whether 1 3 Asian Bioethics Review (2024) 16:95-108 either informed consent or voluntary consent exists for the chosen dataset.By "informed consent," we mean that the human subject must be made aware of the experiment, must give their permission for it to proceed, and must be given the option to revoke their consent at any time by notifying them.However, consent can be obtained voluntarily, i.e., without coercion for the dataset to perform MIA without any hassle.Ideally, any data science project that involves human subjects should necessarily consider the feedback from the Institutional Review Board (IRB).
The IRB comprises diverse members, including non-scientists, and approves human subject study, balances potential harm to subjects against the benefits to science, and manages necessary violations of informed consent.Regarding voluntary consent, i.e., voluntary disclosure, the stakeholders (data owners) should be informed that anything they disclose voluntarily to others has much less protection than anything they keep completely to themselves.For example, the telephone company must know what number you dialed to be able to place the call.Hence, when you have disclosed this metadata to someone, even if it is not the content of your conversation, there is always a possibility that this may quickly become problematic due to ubiquity.
Data scientists should abide by the laws framed by the medical council in different countries (USA, European Union, India, Australia, and so on) while retaining datasets on human subjects for future reference and usage.After some years, if such medical records are available on the public domain through published research papers, research reports, or design patents, then this would impact the privacy of the data owner and may later lead to legal hurdles as well.Thus, knowing how to remove such records should also be a matter of concern for data scientists during MIA.
Between voluntary consent and informed consent, there will always be a tradeoff.It should be emphasized that while "voluntary" consent is given by the stakeholders to carry out a desired action when necessary, "informed" consent is based on information that is concealed in numerous pages of fine print.Facebook explicitly tells users in its agreement that it may collect user data for research purposes.But Facebook got its user's dissatisfaction for irrelevant usage of their dataset or due to the limitations on the agreement (informed consent).Hence, it is observed that informed consent and the voluntary consent when used wisely would ensure data privacy and facilitate the data scientists in collecting and using the data for effective MIA ethically.However, if it is used otherwise, data science can result in major asymmetries during MIA.In practice, many user agreements are "all-or-nothing."This amounts to data owners (person who holds accountability for a specific dataset) completely giving up control of shared data to get any service benefits at all.Later, such owners often complain about the loss of privacy.Hence, the ideal solution to deal with this situation would be for data scientists to provide graduated choices, so the data owners could make the tradeoffs they like, for example, incognito browsing.
Another problematic area in the ethical data collection process is "repurposing."By "repurposing," we refer to the collection of a dataset for one objective during MIA and later using it for a different objective without obtaining consent from the data owner.It should be noted that "repurposing" can be problematic.Imagine that one gives their dataset to a medical team to obtain a specific service.However, they may not want their dataset (medical records) to be used for other purposes by the data scientists during MIA, nor do they permit them to share the data with others.This means that the data owner's consent can often be limited to disallow repurposing, thereby indicating that the "context" matters during the data collection and its subsequent usage by the data scientists.
Repurposing is also occasionally inevitable for carrying out medical research using data science or artificial intelligence to generate new insights on certain diseases or disorders.For instance, a patient may willingly allow the repurposing of their medical data with a hospital to receive better care.However, the specific research questions may not be known to the medical team at when this particular patient receives care at the hospital.This is referred to as retrospective data analysis in data science during MIA.

Data Science Ethics Equilibrium
The simplest way to maintain equilibrium between the ethical concerns and utilization of data is to ensure that the data collection process during MIA adheres strictly to data compliance.Data scientists (or their companies) must comply with laws.They must at least do the minimum required to meet the letter of the law.Ideally, compliance follows regulations, which in turn follows the social impact of technology.Data scientists ought to be enthusiastic about the good that data science can accomplish.But if businesses do not self-regulate, there will be backlash from the public that will cost them and impede us from fully utilizing the promise of data science.
Data scientists must therefore conduct themselves morally if they are to continue to be proud of and successful in their data analytics on medical images for improved comprehension and diagnosis.Do not surprise people with data analytics carried out based on their health dataset without their consent.It is unfair to surprise others by utilizing their data for MIA irrespective of the goals and objectives of such analysis.As part of MIA, data scientists should own the result ethically.Even though there is nothing "wrong" with the process, it should be changed if data analytics lead to in unwanted results.
The structured approach (Fig. 1) is meant to help data scientists take the ethical concerns that should be considered during the data collection process and its utilization for the purpose of MIA into account.The proposed framework strives to achieve a stable equilibrium between the ethical concerns and the data utilization for MIA by helping data scientists respond to the following questions.They are the following: Who is the data owner?What would be the purpose of utilizing the data?And is there a way to hide certain portions in exposed data?Also, after performing the data analytics over medical dataset on human beings, data scientists should own the outcomes by addressing the following questions.They are the following: Are we doing a valid data analysis?Is the data analysis transparent, fair, and accountable?And is there any societal impact?Ethical analysis is difficult in the sense that in a complex world, with many actors, it is often challenging to see where the problem lies and how to define limits to the 1 3 Asian Bioethics Review (2024) 16:95-108 use of data science.Making the right decision does not guarantee success.But surrendering morality invariably ends in disaster.Data is the primary component of all data science.Algorithms that result in specific data science models can be applied to such data.One can debate which of these is unethical, but data can undoubtedly be unethical if it contains a data subject's private information that they do not want to be known or if there is bias against particularly vulnerable groups.A predictive model can undoubtedly be unethical, most typically as a result of its reliance on unethical data.In such cases, a predictive model is likely to utilize personal data and generate outcomes that negatively discriminate against a group of individuals.When it comes to applying data science, a practice is not simply moral or immoral.Rather, it frequently involves striking a balance between moral considerations and the usefulness of the data.
At one extreme, there is absolutely no interest in or investment in data science ethics, and at the other, the ethical problem is so great that no data is used.Data science approaches are chosen based on how important ethical issues are and how useful the data is.The equilibrium that is reached depends heavily on the context, and it is influenced by the potential effects on people and society, as well as how positive or negative these effects are considered.Medical image analysis using data science will undoubtedly require law enforcement and entail stringent requirements.The data scientists need to adhere to stricter data science ethics procedures in such circumstances.

Conclusions
The ethical data collection framework presented in this paper identified three important components, namely the source of data, data validity, and its usage with due consent from data owners while conducting MIA.Even though this list is not allinclusive, the articulation of these elements and its underlying idea has the ability to highlight crucial factors to take into account when using data science in healthcare research.The framework also highlighted general issues that cut across all data science applications in healthcare contexts, such as data privacy, data repurposing, prospective dataset, retrospective dataset, and, finally, limited informed consent or open voluntary consent for data usage.
Most data of interest during MIA are provided by human beings, about human beings, or affect human beings.Thus, we must consider this impact as we practice data science.It is crucial that we pay careful attention to the ethical concerns around data collection, its validity, and utility during medical image analysis.Otherwise, the results or the findings of MIA cannot be disseminated among the intended audience as it may invite legal hurdles and privacy concerns or harm the reputation of the medical research laboratories.Each new algorithm, analysis, or system must be considered in terms of how it will affect society.Data scientists are unable to shirk their obligations.In reality, the technology is advancing quickly whereas regulations governing data privacy and its related ethical concerns move slowly.Thus, we will be regulating yesterday's technologies.This may lead to many abuses because they comply with outdated regulations and, at the same time, disallow possible benefits because they conflict with outdated or needlessly broad regulations.