Data Curation for Preclinical and Clinical Multimodal Imaging Studies
In biomedical research, imaging modalities help discover pathological mechanisms to develop and evaluate novel diagnostic and theranostic approaches. However, while standards for data storage in the clinical medical imaging field exist, data curation standards for biomedical research are yet to be established. This work aimed at developing a free secure file format for multimodal imaging studies, supporting common in vivo imaging modalities up to five dimensions as a step towards establishing data curation standards for biomedical research.
Images are compressed using lossless compression algorithm. Cryptographic hashes are computed on the compressed image slices. The hashes and compressions are computed in parallel, speeding up computations depending on the number of available cores. Then, the hashed images with digitally signed timestamps are cryptographically written to file. Fields in the structure, compressed slices, hashes, and timestamps are serialized for writing and reading from files. The C++ implementation is tested on multimodal data from six imaging sites, well-documented, and integrated into a preclinical image analysis software.
The format has been tested with several imaging modalities including fluorescence molecular tomography/x-ray computed tomography (CT), positron emission tomography (PET)/CT, single-photon emission computed tomography/CT, and PET/magnetic resonance imaging. To assess performance, we measured the compression rate, ratio, and time spent in compression. Additionally, the time and rate of writing and reading on a network drive were measured. Our findings demonstrate that we achieve close to 50 % reduction in storage space for μCT data. The parallelization speeds up the hash computations by a factor of 4. We achieve a compression rate of 137 MB/s for file of size 354 MB.
The development of this file format is a step to abstract and curate common processes involved in preclinical and clinical multimodal imaging studies in a standardized way. This work also defines better interface between multimodal imaging modalities and analysis software.
Key WordsData curation Reproducibility Credibility Timestamp Multimodal imaging Metadata Compression Cryptographic hashing File format Serialization
Data curation is the active management of data throughout its lifecycle of interest and usefulness to scholarly and educational activities ensuring data is available for discovery and reuse. For instance, in accordance with rules of good scientific practice and as a foundation for quality assured research data, the German Research Foundation (DFG) requests that primary data must be stored for ten years in sustainable and secure repositories at the institution where it was collected or in a nationwide infrastructure [4, 5, 6]. Cipra and Taubes in their separate publications in Science suggest the use of digital timestamps to prove that a given document existed at a particular time and also assure data integrity, all in an attempt to enhance credibility of published scientific data [7, 8].
Different imaging modalities store their acquired images in various data formats such as DICOM, NIfTI, etc. These formats may not be sufficient for efficient data manipulation, analysis, and image processing. Similarly, manufacturers of medical imaging systems use different proprietary formats to store images in digital form. The problem of different file formats is even bigger for preclinical imaging devices. These format differences pose significant challenges for multimodal imaging studies particularly in registration, fusion, analysis, and curation of image datasets. Another difficulty with the difference in formats occurs if the utilized analysis software becomes obsolete, inaccessible, or undergoes repeated substantive changes that are not backward compatible [9, 10]. Post-processing software tools for kinetic modeling, spectral unmixing, and relaxometry require exact information about time points and channels. Our developed file format seeks to serve as the necessary bridge between imaging devices and multiple analysis software tools and to enhance post-processing.
The aim of our study was to provide the research imaging community with a free secured file format for multimodal imaging that is compatible with all in vivo imaging modalities for five dimensions, supports multichannel and multitime series data, and equipped with the possibility to write cryptographically hashed images with or without trusted timestamp to file. This helps to improve the integrity of study data and promotes reproducibility and continuity. We developed this new image format in cooperation with users and providers of imaging hardware and software.
Materials and Methods
Content of the File
Many of the current imaging modalities are able to acquire multiple images successively over time, where the time is considered as the fourth dimension. Time points may be equidistant as in μCT scanners and ultrasound devices or non-equidistant as in PET scanners . In addition, some of these devices are able to acquire multiple channel information, considering number of channels as the fifth dimension, as in the use of dual-energy CT , multichannel fluorescence molecular tomography (FMT) , MRI, multi-isotope SPECT, or use of different scanning protocols successively. It is desirable to have a single image file format that supports both isotropic (e.g., usually provided by μCT and FMT scanners) and anisotropic (e.g., in ultrasound and MRI modalities) voxels, multiple time points, and multiple channels. We propose a simple image file format with extension “gff” that writes and reads up to five-dimensional images to and from a single file. It consists of a data structure that serially stores voxel dimensions, sizes, and type, rotation and translation data elements, time and channel information, compression information, as well as metadata. We maintain both structured metadata (e.g., voxel size, time point, and channels) and unstructured or more general metadata represented as key-value pairs (such as standardized uptake values (SUV), weight, age, etc.) in the format. The unstructured metadata can be easily deleted to ensure anonymity without negatively affecting the data needed for analysis. The format is capable of storing voxel data of any type ranging from char, half, float, double, signed, and unsigned short integer through to long integer, representing the data size per voxel. The X, Y, and Z dimensions provide the orthogonal spatial dimensions while dimensions T and C define time and channel dimensions respectively in this work. The X, Y, Z, T, and C dimensions are represented by 64-bit signed integers to support very large images. The voxel size is of type double and is measured in millimeters. As an option, we store scale factor and offset values to be applied to the voxel intensities. These values are stored as part of the structure. If these attributes are set to true, then the voxel intensities should be converted to float type by the software that reads the data. Our format stores all data in little endian format on the file.
We provide two fields of type double in the file format that store the nine elements of the rotation matrix R and the three elements in the translation vector t. An inverse operation of the equation converts world coordinates to voxel coordinates. World position is provided in millimeters, while the voxel position is provided in a manner similar to numbering of the voxels with X being the fastest dimension and Z the slowest.
Time Point and Channel Information
The format stores the different time points at which the devices capture 3D volumes, a characteristic feature of four-dimensional images. Both equidistant time points and non-equidistant time points are supported. The metadata saved for each time point corresponds to the center time point and duration of each frame. We store the channel centers and widths as double data types. The dimension representing the number of time points and channels is stored as 64 bit integer. The field sizes used in storing the time point information and channel information depend on the number of time points and channels used. We support both equidistant and non-equidistant channel step information in the format.
We enable users to store unstructured metadata such as PET-CT or SPECT-CT SUVs like body weight, age, sex, etc. as key-value pairs. The metadata can be of any data type but is stored as key value pairs of type string. We define a data dictionary of type string, which contains entries that describe the unstructured metadata. The dictionary helps to check inside the operator key for validity. These metadata (strings) are encoded as Unicode in utf-8 encoding. Similarly, we support file and folder names in Unicode.
Cryptographic hash functions apply a one-way algorithm on an arbitrary length of input data to produce a fixed length output. The hash function is a powerful tool, whose application helps protect the authenticity and integrity of information. We use the one-way SHA-256 algorithm to compute a cryptographic hash of the entire file. We compute all hashes in parallel, resulting in negligible overhead. We further implemented timestamp functionality based on the OpenSSL cryptographic library such that timestamps are appended to files written in our proposed “gff” file format. The cryptographic hashed data is sent as a timestamp request to our chosen TSA (e.g., Bundesdruckerei, DFN timestamp server, etc.), which upon receipt, generates the timestamp and sends it back as a response as per the definitions of RFC 3161 . The timestamp can be verified locally, but its creation may not be possible locally since it requires internet connectivity. Due to this internet connectivity requirement, the inclusion of timestamp is optional.
Memory Layout of File Format
The fields or objects in the file structure include versioning numbers, header size, dimensions and sizes of voxels, voxel data type, affine transform elements, time dimensions, channel dimensions, compression information, among others, and these are represented in the header structure as shown in Fig. 3. A 32-bit tag is placed at the beginning of the structure (0xD8CA0B00) as a signature that uniquely identifies the file type. The same is placed at the end of the structure, but only as a control. It is important to note that the memory space allotted to the header for each structure depends on the number of dimensions of the structure. Data pertaining to all these fields are written to the file serially. In addition, a metainformation field that stores a map of key-value pairs is serialized to the structure. The address that points to the beginning of each slice (sliceStarts) and also the length of each slice (sliceLengths) are saved in their respective vectors and are serially saved in memory. The vectors allow for reading a subset of slices. The sliceStarts vector with addresses correspond to the actual compressed slices of the image data acquired that are serially written to the file. The sliceLength vector provides the size in bytes of each slice. These two vectors (sliceStarts and sliceLength) have a certain degree of redundancy on purpose to allow serialization of slices in arbitrary order, e.g., the order of generation which also simplifies parallel encoding and decoding of individual slices. Finally, the hash and optional timestamps are appended to the end of the file. Figure 3 gives a pictorial representation of the memory layout of the proposed file format.
Example Images Used for the Experimentation and Evaluation
We also performed a comparative study to identify the costs involved in saving with and without cryptographic hashes. From the graph in Fig. 6d, we observed that saving with hash is computationally more expensive compared to saving without hashes. However, from our experiments, we found out that the margin of difference between saving with the cryptographic hash and saving without it was almost negligible, not significant (P > 0.05). Hence, considering the integrity benefits that cryptographic hashes provide, we suggest saving with hash to achieve the set security goal.
Image Compression Measurements
File compression analysis using DEFLATE option and HUFFMAN-only option. Sizes of experimental input files prior to and after compression at level 5 and corresponding ratios and rates of compression
Image file type
Input file size (MB)
Output file size (MB)
Compression rate (MB/s)
Decompression rate (MB/s)
In this work, we proposed and developed a simple image file format that is usable for all in vivo imaging modalities generating volumetric datasets with regular grids. This new file format is particularly suitable for multimodal imaging including CT-FMT, PET-CT, SPECT-CT, and PET-MRI. We provided an implementation with few dependencies from the zlib library, OpenSSL library (optional), and Oliver Gay’s version of sha256 [17, 21]. The software is platform-independent. The simple implementation is contained in two C++ header files. The file format was developed in close collaboration with device manufacturers including Inviscan (Strasbourg, France), Molecubes (Ghent, Belgium) and MILabs (Utrecht, the Netherlands) and a software supplier Gremse-IT (Aachen, Germany) to improve the interface between imaging hardware and analysis software.
The generation and/or collection of multidimensional medical image sequences such as in vivo μCT images, functional magnetic resonance imaging (fMRI), etc. is increasing at an alarming rate and necessitates structured and secured measures for managing, accessing, preserving, and reusing this huge amount of data since such digital data are typically not stored in long-term institutions such as libraries [6, 22, 23]. It also requires considerable amount of memory storage space in order of several gigabytes per acquisition, posing storage difficulties. However, image data has different types of redundancies which compression algorithms (both lossy and lossless) take advantage of, in order to increase the effective data densities on storage devices and optimize transmission costs [24, 25, 26]. Lossy compression algorithms, though results in smaller sizes, fail to be the choice for clinical use and preclinical research since they may eliminate certain critical information needed for diagnosis, analysis, and legal purposes [27, 28]. In this work, we employ the lossless compression implemented in the widely adopted and freely available zlib library. File formats characterized by properties such as complete and open documentation, platform-independence, and lossless compression among others have been identified to have the capability of preserving their contents and functionality over a long-term [29, 30].
Metadata controls most steps of the curation process, from preservation to access and reuse [5, 31]. However, its usage raises concerns about anonymity in clinical research since it may include subject’s personal information and other traceable information that could raise privacy concerns . Due to this and other complexities, certain image file formats such as NIfTI-1 provide limited support for the storage of the various image acquisition parameters as opposed to what is available in DICOM . Also, writing a single large file to disk is comparatively more efficient and cost-effective than writing several small files to disk. . Hence, in our file format, data is written as single file per modality as opposed to other file formats.
To assess the performance of our file format, we measured the rates, time, and ratios of compression and decompression and also measured time and rates of both writing and reading on a network drive. We took particular interest in the network drive analysis since it happens to be the most heavily used drive in research centers, institutes, and clinical centers. We recorded an average writing and compression rates of 28 MB/s and 137 MB/s, respectively on our network drive, showing that writing an uncompressed file to a network drive is more time consuming than compressing the same file. Also, the disk reading and decompression rates recorded were 97 MB/s and 1288 MB/s, respectively, indicating that the decompression is approximately 13 times faster than the disk reading speed and the compression is approximately 4 times faster than the writing speed. Both the hashes and compression/decompression rates are computed in parallel to minimize duration of writing files and enhance efficiency. We achieve a hashing speed of 822 MB/s. The parallel compression achieves a speedup by factor of 2 and 3 for the μCT and segmentation files, respectively.
It is worth mentioning that the reading and writing rates are affected if multithreaded parallel processing is in use or not. To further enhance efficiency, GPUs can be used for the computation.
The proposed file format can store 3D data, multiple time series, and multichannel data resulting in what we refer to as five-dimensional data. Future extensions of the format may provide more dimensions to cater for time intervals and dimensions in ECG-gated images, respiratory-gated images, and dual-gated images, among others.
Incremental reading and writing of slices are enabled with the format. In addition, the files in this proposed format are easy to anonymize by changing the file name and removing key-value pairs constituting the unstructured metadata without the danger of losing relevant information. This provides privacy protection for study participants, concealing sensitive information that could easily be traceable. We provide a dictionary-based mechanism to store predefined keys for the key-value pairs to enhance standardization.
A unique feature of our format is the inclusion of trusted timestamps. Timestamps provide a legally accepted way to prove the existence of a file at a certain time. Timestamps could aid in ensuring credibility and reproducibility, properties that are crucial to scientific research [7, 8, 35]. The RFC 3161 specification states that only a hashed representation of the data or file should be time-stamped to avoid unnecessary original data exposure . Timestamps are typically issued by trusted third parties or timestamp authorities (TSA). There are free timestamp servers for academic use  and commercial servers whose service comes with a cost. For instance, the D-STAMP timestamp, provided by the German Bundesdruckerei (Federal Bureau for Printing), costs approximately 10 cents. The creation and checking of timestamp introduces some overhead in the writing and reading of files, respectively. However, considering the benefit of the timestamp and even the negligible time overhead for hash creation observed in our measurements, it makes sense to go for the timestamp and achieve the integrity goal. It is important to note that the inclusion of the timestamp is optional and could be added later (e.g., in cases where internet connectivity is a problem). We also make the source code and sample datasets freely available for use and reproducibility purposes and also as a step in cutting down on replication costs . Reproducibility of science is an aspect of credibility that requires researchers to provide proper and sufficient information as well as accurate documentation to enable the verification of their work [38, 39]. In recent times, it has been observed by some researchers , the National Institutes of Health (NIH) , and other government funding bodies that there is a growing irreproducibility of science and poor data management. This may be incidental or deliberate . Data curation methods and tools could help restore confidence and trust in scientific research and subsequently enhance credibility . The format provides for backward compatibility support in the event of versioning change.
In conclusion, a free file format and a software tool for multimodal imaging studies that seek to make post-processing easier, solving the vendor variability, and data import issues, particularly for complex image analysis methods have been developed. The development of this file format is a step to archive and curate preclinical and clinical scientific imaging studies in a standardized way. Find the summarized features of the file format in Suppl. Table 1.
This research was supported by the German Academic Exchange Service (DAAD), German Research Foundation (DFG; GR 5027/2-1), German Higher Education Ministry (BMBF) (Biophotonics/13 N13355), Federal Government of North-Rhine Westphalia (EFRE), and the European Union (FP7), and a grant from the Interdisciplinary Centre for Clinical Research within the faculty of Medicine at the RWTH Aachen University (E8-13).
Compliance with Ethical Standards
Conflict of Interests
F. Gremse is the owner of Gremse-IT GmbH. F. Beekman holds shares in MILabs B.V. B. Vandeghinste holds shares in Molecubes NV.
- 4.DFG, Ger Res Foundation—Handling of research data. http://www.dfg.de/en/research_funding/proposal_review_decision/applicants/research_data/index.html. Accessed 25 Feb 2019
- 5.Ray JM (2014) Introduction to research data management. In: Ray JM (ed) Research Data Management: Practical Strategies for Information Professionals. Purdue University Press, West Lafayette, pp 1–22Google Scholar
- 6.Lord P, Macdonald A, Lyon L, Giaretta D (2004) From Data Deluge to Data Curation. Conference: Proceedings of the UK e-Science All Hands Meeting 2004, pp. 371–37Google Scholar
- 10.Abrams S (2007) DCC digital curation manual Installment on file formats. HATII, University of Glasgow; University of Edinburgh; UKOLN, University of Bath; Council for the Central Laboratory of the Research Councils. https://www.era.lib.ed.ac.uk/handle/1842/3351. Accessed 27 Feb 2019
- 14.Pöschinger T, Renner A, Eisa F, Dobosz M, Strobel S, Weber TG, Brauweiler R, Kalender WA, Scheuer W (2014) Dynamic contrast-enhanced micro-computed tomography correlates with 3-dimensional fluorescence ultramicroscopy in antiangiogenic therapy of breast cancer xenografts. Investig Radiol 49:445–456CrossRefGoogle Scholar
- 17.Gailly J, Adler M (2004) zlib compression library. http://www.dspace.cam.ac.uk/handle/1810/3486. Accessed 27 Feb 2019
- 18.Adams C, Cain P, Pinkas D, Zuccherato R (2001) Internet X.509 Public Key Infrastructure Time-Stamp Protocol (TSP). RFC 3161. https://doi.org/10.17487/RFC3161
- 19.Rosenhain S, Al Rawashdeh W, Kiessling F, Gremse F (2016) Sensitivity and accuracy of hybrid fluorescence-mediated tomography in deep tissue regions. J Biophotonics 10(9):1208–1216. https://doi.org/10.1002/jbio.201600232
- 21.Gay O (2007) Fast software implementation in C of the FIPS 180-2 hash algorithms SHA-224, SHA-256, SHA-384 and SHA-512. http://www.ouah.org/ogay/sha2/. Accessed 25 Feb 2019
- 26.Carpentieri B, Pizzolante R (2014) Lossless compression of multidimensional medical images for augmented reality applications. In: De Paolis L, Mongelli A (eds) Augmented and Virtual Reality. Lecture Notes in Computer Science, vol 8853. Springer, ChamGoogle Scholar
- 29.Mortimore J Guides: Data Management Services: Recommended file formats for long-term data curation. http://georgiasouthern.libguides.com/c.php?g=410908&p=2955521. Accessed 25 Feb 2019
- 30.File formats and standards - Digital Preservation Handbook. http://www.dpconline.org/handbook/technical-solutions-and-tools/file-formats-and-standards. Accessed 25 Feb 2019
- 31.Bird CL, Willoughby C, Coles SJ, Frey JG (2013) Data curation issues in the chemical sciences. Inf Stand Q 25:4Google Scholar
- 33.dfwg NIfTI: — Neuroimaging Informatics Technology Initiative. https://nifti.nimh.nih.gov/. Accessed 25 Feb 2019
- 34.Thain D, Moretti C (2007) Efficient access to many small files in a filesystem for gridcomputing. In Proceedings of the 8th IEEE/ACM International Conference on Grid Computing (GRID '07). IEEE Computer Society, Washington, DC, pp 243–250. https://doi.org/10.1109/GRID.2007.4354139Google Scholar
- 36.Manouchehri D (2016) List of free rfc3161 servers. In: Gist https://gist.github.com/Manouchehri/fd754e402d98430243455713efada710. Accessed 25 Feb 2019
OpenAccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.