Introduction

The advent of quantitative image analysis techniques has revolutionized the field of radiology, enabling researchers and clinicians to analyze and interpret medical imaging data more efficiently and accurately [1]. Radiomics, an emerging field at the intersection of radiology and oncology, leverages the power of advanced computational techniques to extract a wealth of quantitative information from different imaging modalities [2]. This process involves extracting numerous high-dimensional features that capture various aspects of the tumor and its surrounding microenvironment, including shape, size, texture, intensity, spatial relationships, and heterogeneity within the tumor [3].

By converting medical images into mineable, high-dimensional data, radiomics can uncover potential biomarkers that can aid in various aspects of cancer diagnosis, prognosis, monitoring treatment response, and personalizing therapy plans according to each patient’s individual needs. In the domain of oncologic imaging, quantitative analysis using labeled cross-sectional imaging data to guide the model has arguably seen the biggest success, with applications ranging from organ segmentation [4] and lesion detection [5] to cancer characterization and risk stratification [6]. Furthermore, radiomics can potentially improve the efficiency and cost-effectiveness of cancer care by reducing the need for invasive biopsies and enabling earlier detection of malignancies [7]. By providing non-invasive, quantitative, and reproducible information, radiomics can complement traditional imaging techniques and contribute to a more comprehensive understanding of a patient’s cancer and its underlying biology [8].

Unlike deep learning approaches, which are capable of learning features and patterns directly from raw image data, radiomics typically requires predefined regions of interest (ROIs) to be segmented within the dataset. As a result, radiomics approaches require additional preparatory steps to delineate relevant areas for later feature extraction and analysis. However, because labeling is expensive and time-consuming, datasets containing both accurate inputs and labels are often difficult to find and often reused as benchmarks across many different studies [1]. Open-access datasets have emerged as an invaluable resource for validating new radiomics approaches, providing researchers with diverse and annotated data [9].

Despite the growing number of publicly available datasets, numerous challenges hinder their effective utilization in radiomics research. These issues include incomplete documentation, low visibility, inconsistency in image and segmentation formats, data inhomogeneity across disparate datasets, and complex data preprocessing. Inadequate documentation and mislabeling in datasets can lead to misinterpretation and unintentional bias, whereas low visibility stems from datasets being hosted across various platforms. The absence of centralized data repositories with standardized formats impedes system interoperability and limits opportunities for collaboration and shared progress in the field. Furthermore, differences in acquisition protocols, scanners, and settings across studies can introduce bias and diminish the robustness of radiomics models. Depending on the clinical application, datasets might require custom, time-consuming preprocessing to handle multiple modalities (e.g., CT and PET), sequences, ROIs, or readers and to verify data correctness before their use in a radiomics analysis.

The lack of reproducibility and generalizability of radiomics models is another major challenge. Insufficient transparency in reporting radiomics studies further prevents the translation of the developed radiomics signatures into clinical practice. In recent years, several notable efforts to improve reproducibility and standardization in radiomics studies have been initiated, including the Image Biomarker Standardization Initiative (IBSI) [10], which identified a reference set of reproducible radiomics features, and the CheckList for EvaluAtion of Radiomics Research (CLEAR) [11], which provided guidelines for more structured and consistent documentation for radiomics studies. While these initiatives primarily focused on improving study methodologies, access to high-quality, open-source data is the other crucial element for further progress in the field.

In this study, we systematically reviewed cross-sectional cancer imaging datasets, specifically identifying those suitable for radiomics research. We created a code repository and curated a comprehensive data repository to facilitate the evaluation of new radiomics models on benchmark datasets, addressing the time-consuming task of locating appropriate datasets with segmentations and clinical labels and preprocessing them from their raw form. We hope that the project will catalyze further advancements in this field, promoting standardization, reproducibility, and ultimately the clinical translation of radiomics research.

Methods

Dataset selection and acquisition

We reviewed multiple publicly available imaging datasets spanning various oncologic entities. The datasets were acquired from established online data repositories and included The Cancer Imaging Archive (TCIA) [12], the Grand Challenge platform (https://grand-challenge.org, Radboud University Medical Center, 2023), Zenodo [13] (https://zenodo.org), Synapse (https://synapse.org, Sage Bionetworks, 2023), and BMIAXNAT [14]. Inclusion criteria encompassed (1) dataset publication on one of the abovementioned repositories by March 2023 and (2) availability of a tomographic imaging modality (CT, MRI, or PET). Exclusion criteria included the following: non-permissive license, absence of volumetric (3D) segmentations, unavailability of clinical labels, the dataset being included in another public dataset, and an insufficiently small number of labeled cases (n < 10). The study flowchart, displaying data sources as well as the inclusion and exclusion criteria, is presented in Fig. 1. Labels were defined as clinical outcomes, characteristics, or classifications related to the imaging data. These can be used to guide a machine learning model to learn the mapping from radiomics features to clinical information by examining labeled examples. The labels are therefore necessary for developing clinical radiomics models. We also collected detailed information about each dataset, including its clinical task, imaging modality, cohort size, data format, region of interest, annotation process, label availability, and license.

Fig. 1
figure 1

Study flowchart with the inclusion and exclusion criteria

Data preprocessing

Raw data were acquired from various sources in different formats, including DICOM (Digital Imaging and Communications in Medicine), NIfTI (Neuroimaging Informatics Technology Initiative), MetaImage, and others. Data preprocessing was performed to ensure uniformity and compatibility across all datasets for subsequent analysis. This process involved converting image and segmentation formats into NifTI, intensity normalization, and resampling to a common voxel size. Preprocessing parameters are described in detail in Supplement S1. Standard Python libraries, including SimpleITK [15], NiBabel [16], and PlatiPy [17], were used for processing volumetric medical imaging data. Multichannel images were split into separate volumetric images, and segmentations with multiple labels (e.g., for multiple ROIs, organs, or readers) were split into separate segmentations. Modalities, ROI names, and readers were explicitly encoded in the filename as well as in the tables containing relevant metadata. Each image-segmentation pair was assigned a unique ID to streamline subsequent feature extraction. Data identified as corrupt were excluded, with all associated errors carefully logged.

Feature extraction

Radiomic features were extracted from the segmented regions of interest using standardized methods. Included radiomics feature classes were selected from the standardized set of features validated in the Image Biomarker Standardization Initiative [10] and included first-order statistics, 3D shape-based features, and texture features derived from the Gray Level Size Zone (GLSZM), Gray Level Dependence Matrix (GLDM), Gray Level Co-occurrence Matrix (GLCM), Neighbouring Gray Tone Difference Matrix (NGTDM), and Gray Level Run Length Matrix (GLRLM). Feature extraction was performed using the open-source AutoRadiomics [18] framework, which performs the standard extraction based on the pyradiomics [19] library. Extraction parameters are detailed in Supplement S2.

All the processing steps for each dataset were run as a single script using Python 3.10 and are documented in the code repository at https://github.com/pwoznicki/RadiomicsHub. The repository is distributed under the permissive MIT license.

Dataset repository

We have built a dedicated website for the project, which conveniently presents all the extracted metadata for each dataset, along with tables of radiomics features, clinical data, and labels. It can be accessed at https://radiomics.uk. The website provides backlinks to the original data sources and references to studies that have used each dataset. Radiomics features and clinical parameters can be directly downloaded and used to develop machine-learning models for the prediction of specific clinical outcomes.

Results

Dataset overview

Out of 143 open-access datasets reviewed, we identified 29 datasets suitable for radiomics analysis, covering a wide range of cancer types and imaging modalities. The datasets encompassed 10,354 patients, 15,221 studies, and 49,515 scans. The most common organ of interest was the lung (7 datasets), followed by the head and neck (6 datasets), the brain (5 datasets), the prostate, the liver, and the soft tissue (each region was the focus of 3 datasets). A single dataset represented the gastrointestinal tract and kidney tumors. Table 1 presents the core statistics of the datasets, including the clinical tasks and imaging modalities used. The tasks ranged from binary classification (15 datasets) and multi-class classification (1 dataset) to survival analysis (11 datasets) and repeatability assessments (3 datasets). The most common imaging modality was computed tomography (CT), followed by magnetic resonance imaging (MRI) and positron emission tomography (PET/CT). Figure 2 showcases the diversity of imaging modalities and disease focuses through representative ROIs from each dataset.

Table 1 Core statistics of the datasets, including clinical tasks and imaging modalities used
Fig. 2
figure 2

Examples of regions of interest from each dataset, demonstrating the diversity in imaging modalities and disease focuses

Data formats and annotation methods

Table 2 provides an overview of the image and segmentation formats used in the datasets and the segmented ROIs and annotation types. The original image formats included DICOM, NifTI, and MetaImage. The segmentation formats comprised DICOM Segmentation object (DICOM-SEG), DICOM Radiation Therapy Structure set (DICOM-RT), NifTI, MetaImage, and Stereolithography (STL) format. The primary ROIs varied across datasets and included typically tumor region. However, a few datasets had additional segmentations available, for organs of interest (lung, prostate, liver, kidney). Included datasets utilized manual, semiautomatic, and automatic segmentation techniques. Manual segmentations were performed by expert radiologists and radiation oncologists, while automatic segmentation methods employed state-of-the-art algorithms based on convolutional neural networks, such as U-Net [48] and its variants. Three datasets included segmentations from multiple readers.

Table 2 Overview of image and segmentation formats as well as segmentation region of interest (ROI), imaging phase or sequence and annotation type

Detailed dataset description

The data sources, study times licensing, and cohort sizes are presented in Table 3. The study times ranged from the early 1990s to 2021, with 15 studies finishing after 2013. Most datasets were licensed under Creative Commons licenses (3.0 and 4.0), permissive of non-commercial and commercial usage and redistribution, and some datasets had custom or restricted licenses. The number of patients in the datasets varied from 15 to 1476, with the number of studies ranging from 30 to 11,523 and the number of scans ranging from 62 to 7,380. The largest dataset, in terms of patients, was the PI-CAI dataset (n = 1476) for detecting clinically significant prostate cancer using MRI. The LIDC-IDRI dataset, which focuses on lung nodule classification using CT, contained 1010 patients and 1308 studies. UCSF-PDGM dataset with brain tumor MRI cases included the most scans (n = 11,523) for 495 patients, which can be attributed to multiple sequences, including T2w, FLAIR, SWI, DWI, T1w, T1CE, ASL, and HARDI.

Table 3 Overview of dataset times, sources, licensing and cohort sizes (CC—Creative Commons)

Clinical labels and predictors

Table 4 describes labels and clinical predictors provided for each dataset. The labels included health outcomes (overall survival, recurrence- and progression-free survival), pathologic tumor type and grade, TNM status, genetic markers, and imaging-based scores. Clinical predictors varied across datasets, including demographic information (age, sex, BMI), medical history (risk factors), laboratory parameters, clinical scores, and treatment details.

Table 4 A detailed description of dataset labels and clinical predictors

Radiomics features

All datasets were successfully preprocessed and radiomics features were extracted with specified settings. The results of the preprocessing and extraction for each dataset are available online at https://radiomics.uk, with an overview of this website provided in Fig. 3. The website presents each dataset with its detailed metadata, examples, links to sources, code used for extraction, and logs. Its core is the tables with radiomics features and labels available for download. It also includes a form that allows the user to request a new dataset. We also investigated the association between core radiomic features: mean intensity and major axis length across overlapping regions of interest and imaging modalities in our collection. Figure 4 shows a significant overlap in the distribution of these features, which emphasizes the potential for integrating multiple datasets for a more extensive evaluation.

Fig. 3
figure 3

View of metadata and extraction artifacts for a selected dataset (LIDC-IDRI). a Dropdown menu for dataset selection, b most important dataset information, c extraction success rate, d detailed dataset information, e logs for download, f radiomics features, g labels. An interactive version of the wiki is available at https://radiomics.uk

Fig. 4
figure 4

Scatterplot illustrating the relationship between core radiomics features. Mean intensity and major axis length for shared regions of interest and modality are plotted across multiple datasets. The substantial overlap observed in the feature distributions suggests the feasibility of merging these datasets for a comprehensive evaluation

Discussion

In this study, we introduced RadiomicsHub, a repository and a wiki designed to streamline the utilization of open-access cancer imaging datasets for radiomics research. The primary goal of RadiomicsHub is to enable the efficient evaluation of novel radiomics models on benchmark datasets, addressing the time-consuming task of locating appropriate datasets with segmentations and outcomes/labels and preprocessing them from their raw form. Our detailed examination of publicly available datasets revealed a collection well-suited for radiomics research. The key findings present a great diversity in imaging modalities, data formats, segmentation techniques, clinical labels, and predictors across the datasets, with comprehensive details and associated radiomics features made readily accessible online.

By converting all datasets into a common format (NifTI) and making the conversion process reproducible and traceable, RadiomicsHub ensures consistency and reliability in the data used for model evaluation. Furthermore, the extraction of radiomics features from each dataset, using various parameter settings, and the availability of metadata and descriptions online contribute to a comprehensive and accessible platform for researchers, which may serve as a bridge between nuanced radiomics research and practical, clinical oncology care. Importantly, standardized, and processed radiomics data are invaluable for developing robust machine learning models trained on high-quality, validated public datasets. As a rich, curated repository of radiomics features, RadiomicsHub emerges as a potential catalyst in translating radiomics research findings into tangible clinical applications. We are committed to maintaining and expanding the project in collaboration with the research community.

RadiomicsHub builds upon existing open science projects and repositories, such as the TCIA, Grand Challenge, and Zenodo, which have laid the foundation for sharing imaging datasets. A few other notable projects committed to promoting open science and collaboration exist in the domain of medical imaging. EUCanImage [49] is a consortium that is building a highly secure, federated, and large-scale cancer imaging platform across Europe, aimed at enhancing the use of AI in oncology. Although there are parallels in our goal to identify and utilize cancer imaging data, EUCanImage is a large initiative focusing on data exchange and storage. In contrast, our study focuses on providing the methods to preprocess and extract radiomics features that can be reproduced locally. The National Cancer Institute (NCI) Imaging Data Commons (IDC) [50] is a cloud-based platform that provides access to diverse cancer-related medical imaging datasets from various sources, including TCIA and other NCI-supported projects. It aims to facilitate the development and validation of AI models, computational models, and quantitative imaging methods by making it easier for researchers to find, access, and analyze large-scale imaging datasets. Open Access Series of Imaging Studies (OASIS) [51] is another project that offers a publicly accessible collection of neuroimaging data, including cross-sectional and longitudinal MRI data. Other large-scale initiatives contributing valuable imaging data to their respective research fields include Alzheimer’s Disease Neuroimaging Initiative (ADNI) [52], UK Biobank [53], and the German National Cohort (NAKO) [54] studies. ADNI focuses on collecting and sharing Alzheimer’s disease-related data, including MRI and PET images. The UK Biobank offers an extensive collection of genetic, lifestyle, and health data from half a million UK participants, including brain, cardiac, and abdominal MRI datasets. The NAKO study investigates the causes of chronic diseases by collecting a wealth of data, including imaging data, from a large German population.

Our study complements these initiatives by focusing on providing standardized and processed radiomics data, making it a specialized resource for the radiomics research community. As a living repository, it has the potential to grow and adapt to the evolving needs of the community by incorporating new datasets, feature sets, and tools reflecting the latest developments and innovations in the field. With a commitment to open science and a focus on collaborative research, we hope its results will stimulate further research and innovation within the research community, further expanding its scope and capabilities. We hope that through this dynamic nature, RadiomicsHub will remain relevant and valuable to researchers, fostering collaboration and accelerating the progress of radiomics research. We believe that pooling different datasets will spark interest in novel research questions, such as the impact of study-specific parameters (acquisition parameters, study time, annotation method, and quality) on the distribution of radiomics features and clinical variables.

While we have focused on the core features of RadiomicsHub, there are potential areas for expansion and improvement. For instance, allowing single images and segmentations to be downloaded through an API or providing TotalSegmentator [4] organ masks for CT datasets could enhance the platform’s utility. Additionally, offering baseline models for each dataset could assist researchers in comparing the performance of their models against established benchmarks. This could be achieved by using the recently published AutoRadiomics [18] framework. Additionally, feature harmonization methods, such as ComBat [55], could be used to compensate for multicenter effects affecting extracted radiomics features. ComBat can align feature distributions across different sites without performing any additional image processing. Adding this step to subsequent analyses would ensure models trained on our data work reliably in various settings, which is necessary for successful clinical translation.

There are potential risks and challenges associated with RadiomicsHub. One such concern is the possibility of introducing errors or generating non-meaningful processed data and features during the conversion and preprocessing steps. To address this concern, we have implemented robust quality control measures, including standard, reproducible processing instructions, and error logging. Volumes have been tested for various assertions, including correct dimensionality, shape, label presence, and valid ROI placement. However, despite our efforts, there remains a residual risk regarding the integrity and accuracy of the data.

Conclusions

In this study, we developed a comprehensive repository with radiomics features from public cancer imaging datasets that can be readily used for robust evaluation of radiomics models. We addressed the challenges associated with dataset preprocessing and radiomics feature extraction, ensuring reproducibility and offering our scripts for reuse. We believe that fostering a collaborative research environment and promoting standardized datasets can accelerate the discovery of new biomarkers and improve clinical decision-making in oncology and beyond.