A novel image dataset for source camera identification and image based recognition systems

Multimodal emotion recognition has attracted a great deal of attention in recent years, with new interesting applications now being considered. One promising application is in the digital image forensics fields where, for example, it gives the possibility to automatically highlight subjects that are in pain, in digital images under examination, by analyzing their facial expressions. However, finding an image that represents a possible crime leaves the problem of identifying the device used to take the image open. Such a problem has been addressed by Source Camera Identification algorithms (SCI, for short). These algorithms analyze some features hidden in a target image to find traces left by the sensor that captured the image. A particularly challenging case is when the candidate source cameras for an image under investigation are of the same manufacturer and model. A fair and universal assessment of these algorithms is only possible if standard datasets are used for their benchmarking. However, our comprehensive analysis has shown that the majority of the datasets proposed so far contain a collection of images taken with different types of cameras, mostly smartphones. We fill this gap by presenting UNISA2020, a novel image dataset that contains a large collection of real-world images taken with multiple conventional digital cameras of the same type. The images in our dataset have been assembled so as to avoid artifacts that could negatively affect the identification process. To validate our dataset, we also performed a comparative experimental analysis to investigate the performance of an SCI reference algorithm when running on our dataset as well as on other SCI standard datasets.


Introduction
Over the years, even thanks to the success of social media, digital photography has become increasingly pervasive. As a negative consequence of this, the number of crimes involving the use of digital images has also growing. Consider, for example, the nefarious example of digital pedo-pornography or the practice of extorting money by threatening to reveal sexually explicit digital images. In all these cases, digital images can play an important role as evidence both in investigations and in court. Unfortunately, it is difficult for law enforcement agencies struggle to keep up with cybercrime, in part because of the large amount of material that needs to be investigated. Scientific research is providing answers to these needs by developing methods, such as those based on face, emotion, and body gesture recognition, that are useful for automatically examining images under investigation and identifying those that may represent a crime as it is being committed (see, e.g., [7,24,26]).
These applications are very promising. However, their effectiveness depends on the ability to prosecute the discovered crime by identifying the digital source camera used to capture the image. Fortunately, many identification techniques have been proposed in the literature for this purpose, mostly based on the analysis of the Pixel Non-Uniformity noise (PNU for short) left by the camera sensors on the digital images they capture.
As the number of alternative approaches increases, it is critical to provide standard image datasets that can be used to compare the performance of different algorithms on a challenging and neutral ground. This requirement has been consistently addressed by the scientific community, as witnessed by the many datasets that have been published in the recent past. However, most of these proposals seem to focus on a use case that, although very recent and popular, is nevertheless specific. They mostly consider images taken using devices of different types, mostly smartphones.
In this paper, we first analyze this landscape by providing a comprehensive overview of all the image datasets that have been proposed so far for benchmarking SCI algorithms, as well as the image datasets that, although primarily developed for other forensic problems related to digital images, can also be used for benchmarking SCI algorithms with relatively little effort. We then present a novel image dataset called UNISA2020 that builds on the experience of previous datasets and provides a solution for benchmarking SCI algorithms in an uncovered scenario. We focus only on images taken with a traditional camera rather than a smartphone, as well as images taken with several different cameras of the same manufacturer and model. In this way, the researchers can investigate the impact that the use of digital sensors that are different but of the same type can have on correct identification.
Finally, we report the results of an experimental analysis comparing the performance of a reference SCI algorithm, in terms of identification accuracy, when running on a wide range of image datasets, including the one proposed.
The rest of the paper is organized as follows. We formally introduce the SCI problem in Section 2, together with the description of a reference algorithm, commonly used for its solution. In Section 3, we overview many of the image datasets proposed so far for benchmarking SCI algorithms or that can, with little effort, be used to this end. In Section 4, we detail the objective of this paper. More information about the dataset we propose is available in Section 4.1. In Section 5, we report the results of the experimental analysis we conducted in order to test the performance of a reference SCI algorithm, when run on our dataset, as opposed to the same algorithm being running on other standard datasets. Finally, in Section 6, we draw our conclusions.

The SCI problem
The SCI problem deals with the identification of the digital camera device used to capture one or more digital images under scrutiny. A widely used identification technique is based on the use of pixel non-uniformity noise (PNU noise for short), the characteristic noise left by digital camera sensors when images are captured. Realistically, it is unlikely that different sensors, even if manufactured on the same chip wafer, will have similar PNU patterns. Therefore, PNU is an inherent property of every sensor. Accordingly, each digital camera has its own unique PNU noise. If there is a method to extract this noise from a digital camera's sensor, it can effectively be used as a kind of fingerprint for the identification process.
One of the most popular SCI techniques proposed so far, based on PNU noise analysis, is the one presented by Fridrich et al. in [23]. It also includes a robust and reliable method for extracting an approximation of the PNU noise characteristic of a given digital sensor, starting from a collection of images taken with that sensor. This technique can be seen as a template for SCI algorithms, since many of the identification algorithms proposed in the literature are essentially a variant of this algorithm.
In the following, we give a brief overview of this algorithm, since it is used as a reference method in our experiments.

The Fridrich et al. SCI algorithm
Let I be a digital image under scrutiny and let C be a reference camera. We are interested in assessing whether I is likely to have been taken with C. First, we need to extract the PNU pattern noise of C, defined here as its reference pattern RP C , and the noise component of I , defined here as its residual noise RN I . The general approach proposed by Fridrich et al. in [23] is then to associate I with C by checking whether or not the correlation between RP C RN I exceeds a recognition threshold.
In their paper, Fridrich et al. thus present both a method for extracting an approximation to the PNU noise of a target camera and an identification algorithm (see Algorithm 1) based on the correlation between this noise and the noise present in the image under scrutiny.

Reference pattern extraction
The reference pattern noise of C, RP C is extracted starting from a collection N C of enrollment images taken with that camera.
First, for each image I in N C , we extract an approximate value for the residual noise contained in it (RN) according to the following formula: where F is a denoising function, such as a Daubechies 8 low pass filter wavelet. Then, the residual noises extracted from all the images in N C , are used to estimate the PNU reference pattern RP C of C. This can be done using the maximum likelihood approach [8], defined in the following formula:

Source camera identification
A correlation statistic is used to measure the similarity between RP C and RN I . This can be done using different approaches, such as the Circular Cross-Correlation Norm (CCN) statistic (see [20]) or the Bravais-Pearson Correlation (CCBP) statistic (see [23]). In our following experiments, we used the Peak-to-Correlation Energy (PCE) statistic, which is defined as follows: If the value of this statistic exceeds a certain threshold θ estimated by the Neyman-Pearson method, then it is likely that I was taken from C. Note that we chose the PCE statistic because it works well for any image size [17] and can highlight the distance between very close correlation values.

Other SCI algorithms
In addition to the method of Fridrich et al., there have been many other contributions to this topic. For example, among the conventional feature extraction algorithms based on the analysis of traces left by the original sensor on digital images, there is a similar technique by Goljan [16]. It is based on the cross-correlation analysis of the peak to correlation energy ratio available in digital images. We also refer to the method proposed in Kang et al. in [20], that improves the results obtained by the Fridrich et al. algorithm, thanks to the idea of using the correlation to the circular correlation norm as a test statistic, which could increase the rate of true positives in camera identification. A computational problem commonly encountered when using these techniques is the large size of the digital sensor fingerprints, leading to longer processing times. Li et al. have mitigated this problem by introducing a compact representation of the sensor fingerprints in [21] to reduce their processing time. We also refer to the study by Gupta and Tiwari in [18,34], which discusses several possible approaches to further improve the identification performance of existing SCI techniques. A new class of approaches that is attracting a lot of attention is that based on the use of machine learning techniques. Contributions such as [4,12,22] use neural networks to provide identification systems with high accuracy that can easily handle very large data sets.

Related work
As described in Section 2, numerous SCI algorithms for camera detection have been introduced in the recent past. In this context, we note that at the beginning of the research in this area, many of the proposed scientific contributions were underpinned by experimental studies performed on self-generated datasets. Moreover, many of these datasets have not always been made publicly available, making it difficult for other researchers to replicate these results and compare the performance of the current state of the art with new algorithms under development.
The availability of standard datasets is an essential prerequisite to properly evaluate the performance of these algorithms, as well as to compare them to existing algorithms. Things have changed in recent years and there is an increasing interest in the introduction of standard datasets that can be used as a reference for a fair comparison of the performance of different solutions to the same problem. This has the advantage of promoting a kind of datadriven research that fosters innovation and facilitates the reproducibility of scientific results in the forensic field.
In the following, we review a selection of several popular datasets that we used for the experimental analysis of the SCI algorithms. They are an integral part of the work aimed at showing that the accuracy of the SCI algorithm is indeed affected by the input dataset.
Note that some of these datasets were originally developed to study other digital forensics problems, but can easily be adapted to test SCI algorithms. In these cases, we provide some essential information about the original application.
For all the considered datasets, we report in Table 1 some essential information useful for their evaluation in this context, including: • Descriptive information as described by: the name of the dataset, the year of publication, the scientific publication (if any) in which the dataset was presented; • Structure of the dataset described by: the total number of images included, the number of different devices, the number of different device models, and the type of devices, used to capture those images (e.g., a smartphone digital camera); • Structure and content of the images as described by: the minimum and maximum resolution of all the images, the encoding format of these images (which may have important implications for the SCI process), whether or not they have been post-processed.

UCID -uncompressed color image dataset
This dataset consists of uncompressed TIFF images showing a variety of subjects, including natural scenes and man-made objects, both indoor and outdoor, taken by five different cameras of the same type (i.e., Minolta Dimage 5) along with ground truth (predefined query images with corresponding model images to be retrieved). The exposure settings (e.g., exposure, contrast, colour balance) were automatically selected by the camera according to the context of each shot. Originally, the UCID database was intended to provide a dataset to evaluate algorithms for compressed domains, but it is also a good benchmark set to evaluate any type of content-based image retrieval (CBIR) method as well as to analyze image compression and colour quantization algorithms. 1

COLUMBIA image splicing detection evaluation Dataset
This dataset was presented as the first and earliest dataset available online to study image tampering. It focuses primarily on images that have undergone a splicing process, i.e., a process in which areas of one or more images are cut out and then pasted into a target image. It contains a collection of grayscale-only blocks extracted from 322 different photographs: 10 of which were taken by the authors using a Canon PowerShot S40 digital camera, while the remaining 310 were downloaded from the CalPhotos dataset [5], an online database of natural photographs portraying subjects from nature, like plants, animals and habitats. 2

SIDD dataset
This dataset was created to analyze the image denoising performance of smartphone cameras. It contains a large number of images representing ten different types of scenes captured under the following lighting conditions: • fifteen different ISO levels from 50 to 10000, to obtain a variety of noise values (the higher the ISO level, the higher the noise); • three illumination levels, to simulate the effect of different light sources: 3200K for tungsten or halogen, 4400K for fluorescent and 5500K for daylight; • three brightness levels: low, normal and high, to adjust the brightness of the lighting and the color temperature (from 3200K to 5500K).
In addition, the database also provides noise-free images of these native images as ground truth images. 3

CASIA image tampering detection evaluation database -version 1
This dataset was published as an open benchmark dataset for techniques to detect any tampering and verify the authenticity of image content, focusing on images tampered by copy-move and cut-paste operations. The manipulated images were obtained by cutting and pasting authentic images using image editing software. Most of the authentic images were obtained from the COREL image dataset [32]. The remaining portion was taken by the authors and shows various subjects that fall into one of the following categories: Animals, Architecture, Characters, Plants, Articles, Nature, and Texture. Edge masks marking the boundaries of the manipulated regions are not explicitly provided, but can be computed using the techniques presented in [28,36]. 4

CASIA image tampering detection evaluation database -version 2
This dataset is a large-scale image dataset presented as an evolution of the CASIA v1.0 dataset. Most of the authentic images are from the COREL image dataset [32]. The remaining portion is from the authors and represents various subjects organized by the same categories introduced in CASIA v1.0. The manipulated images were obtained by cutting and pasting authentic images using image processing software and by applying a blur effect to the resulting image. Edge masks marking the boundaries of the manipulated regions are not explicitly provided, but can be computed using the techniques presented in [28,36]. 5

DIDB -Dresden image dataset
This dataset was created for the development and benchmarking of camera-based digital forensic techniques. It contains images representing different indoor and outdoor scenes taken with different devices for each model. This allows for the study of feature similarity between different devices of the same model, different models of the same manufacturer, and between devices of different manufacturers. Based on the observation that most consumer cameras save images in lossy compressed JPEG format by default, the authors of this dataset configured their camera with the highest JPEG quality setting and maximum available resolution, or instead captured lossless compressed images if supported by the device. 6

MICC-F220 dataset
This dataset was introduced to evaluate techniques to detect for any tampering, while also verifying the authenticity of image content. Most of the authentic images are from the COLUMBIA dataset. The remaining part was taken by the authors. The tampered images were obtained by applying 14 different types of attacks to authentic images, such as translation, rotation, scaling, or a combination thereof. 7

MICC-F2000 dataset
This dataset, published along with MICC-F220, is a large-scale image dataset. Unlike the latter, however, it contains a much larger number of authentic and manipulated images, compiled here using the same methodology as MIC-F220. 8

RAISE-RAw ImageS datasEt
This dataset was created to evaluate digital counterfeit detection algorithms. It consists of high-resolution RAW images stored in an uncompressed format as provided by the cameras used. Each image is also assigned to one of seven possible categories (e.g., "outdoor," "indoor," "landscape"). It is designed for general purposes in the field of forensic image processing: The raw images contain all the information about the acquisition process and can be used to test any type of processing according to the desired experimental setup. 10

WILD WEB dataset
This dataset contains images from the Internet. Most of them were usually post-processed, e.g., re-saved and re-sampled when they were circulated online, because that is not an option for the recognition tasks, since they are not authentic images. Therefore, all the collected images contain confirmed tamper types, most are cut-paste images, a few are copy-move and erase-fill images. For each manipulation case, ground truth masks were manually created. These unique images (where exact duplicates were removed) contained a number of variants of each forgery, falling into one or more categories: a) versions of the same image at different scales, often with different aspect ratios, b) cropped versions of the original image, c) subsequently inserted cuts: watermarks, frames, or other cuts that have been superimposed on the fake original images and are usually noticeable. 11

VISION dataset
This dataset contains a total of 34427 images. Of these, 7565 were taken via WhatsApp and Facebook, in both high and low quality. The images include native and socially shared images and represent either generic images or flat surfaces in landscape format, such as skies or walls. 12

SOCRatES dataset
This dataset contains a large number of images captured with a variety of smartphones and has the largest number of different sensors used for data collection by the device owners themselves, resulting in a large heterogeneity and realism of the data. 13 The collection took place under uncontrolled conditions. Background images are the subset of photos that depict a solid-colored scene (for example, a blue sky). Foreground images, on the other hand, are photos that represent an arbitrary scene. These images are very heterogeneous because they were taken under conditions where the equipment, people, places, and times differed. 10 http://loki.disi.unitn.it/RAISE/download.html 11 It is not publicly available online but is accessible for research purposes upon request. 12 https://lesc.dinfo.unifi.it/en/datasets 13 http://socrates.eurecom.fr/

DSID -DAXING smartphone identification dataset
This dataset contains images taken with different smartphone models. Each image is also assigned to one of eight possible categories (e.g., "sky", "grass", "stone"). Exposure settings (e.g., exposure, contrast, colour balance) were automatically selected by the camera according to the context of each image. 14

DHIf Dataset
This dataset was assembled to investigate the performance of SCI algorithms in processing High Dynamic Range (HDR) images. It contains 5415 HDR JPEG images and their standard dynamic range counterparts captured with different smartphones. 15 The photos were taken without flash in different environments, including both indoor and outdoor shots, and were divided into two categories: FLAT and NAT. The first ones are homogeneous and include walls and sky. The latter, taken from a tripod, handheld and with a shaky hand, are natural images; they have a wide range of scenes and can contain a large number of details and colours. Finally, the images taken with a shaky hand may be blurry due to the pixel shift caused by the shaking of the camera.

FODB -FORCHHEIM dataset
This dataset was introduced to test the ability of SCI 's algorithms to handle the recompression algorithms used by social media to publish user-supplied images. It includes a large number of images taken with a variety of smartphones and available in six different versions: the original version coming from the camera and five copies from social networks such as Facebook, Instagram, Telegram, Twitter and Whatsapp. 16 The content of the images concerns indoor and outdoor, day and night, close-up and distant, natural and man-made scenes.

KAGGLE dataset
This dataset is from the 2018 Signal Process Cup cell phone image source identification competition held on Kaggle and sponsored by the IEEE Signal Processing Society [14]. It consists of a training dataset containing images captured with 10 different smartphones and a test dataset containing only single blocks of 512 x 512 pixels cropped from the center of a single image captured with each of the considered devices. 17

Our contribution
The detailed overview presented in Section 3 confirms the interest in the availability of standard datasets for a fair evaluation and comparison of SCI algorithms. However, many of the available options seem to focus on a specific application scenario that, while very actual, does not cover all the scenarios where SCI algorithms are needed. Most of these datasets only contain images taken with smartphones, and they mostly contain collections of images taken with a wide variety of devices. Indeed, smartphones are now the most commonly used devices for capturing images, but traditional digital cameras still have a significant market share (e.g., about 20 million devices shipped in 2018 according to figures available in [27]). We also argue that evaluating SCI algorithms against a dataset where the images were captured with very heterogeneous devices can positively simplify the identification process, making the results less general. Finally, as pointed out in several other scientific papers (see, e.g., [13,15,33]), the robustness and reliability of the SCI process based on PNU analysis can be severely compromised when images with different resolutions are considered. This is because either the resizing or cropping required to bring all the images under analysis to a common resolution will cause some of the existing PNU in an image to be discarded or its geometry to be altered, making the identification process less effective.
To address these concerns, we present a new large image dataset, named UNISA2020, specifically designed for experiments using SCI algorithms to identify source cameras under certain controlled conditions. To this end, our dataset was developed considering the following requirements.
First, we only considered images taken with traditional digital cameras. Second, we considered a rather large number of different devices (i.e., 20), but all of them were from the same manufacturer and model. Finally, all the images of our dataset were captured and stored at the same resolution. In addition, the images were captured in such a way that there were almost no statistical artifacts (i.e., type of color interpolation, manipulation by JPEG compression, no resampling, modification of Exif data, etc.) that were not caused by the camera firmware and that could either positively or negatively affect the SCI process.

UNISA2020 technical information
The UNISA2020 dataset 18 consists of 4647 unpublished JPEG images. All of these images were taken with twenty different cameras of the same model (i.e., Nikon D90), using the maximum resolution (i.e., 4288×2848) and the minimum compression factor (i.e., Nikon D90 Maximum Image Quality setting). In addition, many of these images were taken with a professional camera tripod. Of all the images included in our dataset, 2592 portray, with the subject a sheet conforming to the ISO 15739:2017 standard (see Fig. 1) to maximize the resulting extracted PNU noise. These images are marked as "non-generic". In addition, these images are also labeled as " BUSY ". By this term, we refer to the high amount of detail and colour present in these images, instead of " FLAT " images, i.e., images that represent homogeneous and roughly uniform surfaces (e.g., images of walls and sky). The remaining images in our dataset, either FLAT or BUSY, mainly depict outdoor scenes and are labeled as "generic" because they contain a wide variety of scenes. All the images were taken using the SRGB colour profile and none of them were post-processed in any way.

Experimental analysis
To serve as a benchmark, a good image dataset should challenge an SCI algorithm to identify the original camera used to capture the analyzed images, under adversarial conditions. In our case, we focus on the difficult case of tracing the origin of multiple digital images taken with several different instances of the same camera brand and model. Therefore, we tested the robustness of our dataset by using it to evaluate the identification accuracy of a vanilla implementation of the Fridrich et al.algorithm presented in Section 2.1, available via an open-source Python library 19 already used in [6], compared to the same accuracy obtained by looking at some of the main public datasets available today, included in Section 3. For each device in each dataset, we selected about 60% of the images uniformly at random for the reference pattern extraction process (see Section 2 for details). Then, we compared the residual noise of the remaining 40% of the images with the reference pattern of each digital camera under study by PCE (see (3)). At the end of this process, we obtained a correlation matrix. Then we trained a Support-Vector Machines (SVM) with half of this correlation matrix (including half of the images from each device) and tested the system with the other half of the correlation matrix. In the end, we calculated the accuracy of the SVM for each of the datasets.

SCI datasets
To test the performance of this procedure, we used for our experimental evaluation a subset of the datasets discussed in Section 3 that have the following characteristics: they should contain a consistent number of images for each of the devices under study, these images should have a certified origin, they should not have undergone any post-processing, and finally the entire dataset should be publicly available for download. The datasets that meet these requirements are: UCID, VISION and SOCRatES.

Results and discussion
We report the results of our experiment in Table 2. As a precondition, we had to ensure that the images used for identification were of the same resolution as the images used to extract the reference patterns (see Section 2.1). If this is not the case, this condition can be satisfied by reducing or cropping the images with different resolutions so that they all have the same reference resolution. In our case, we decided to exclude the possibility of resizing, since it has a very negative impact on the identification performance [6]. Instead, we opted for a solution based on the operation of cropping images.
Looking at the results of our experiments, we find that the worst performance is obtained when we consider the entire VISION dataset. This dataset contains images that come from a variety of devices and have widely varying resolutions, so all the images must be brought to the same resolution by cropping. If we instead consider only the images from VISION that come from the same device (e.g., Apple iPhones), we can observe a significant increase in accuracy. This is probably due to the fact that, in this case, the images considered are all of the same resolution and have not been post-processed. Also, the performance of the SOCRatES dataset is negatively affected by using the cropping process to bring all the images to the same resolution.
We find that the algorithm under consideration achieves a very high accuracy (i.e., about 98%) when running on our dataset. This is probably due to a combination of factors: the images considered all have the same resolution, that is very high, and the images were captured and stored in such a way that there are no artifacts that could hinder the identification process.
For further confirmation, we have also added an additional experiment where we measured the pairwise correlation between the camera devices Reference Patterns (RPs) in each dataset. Figure 2 shows the results for UNISA2020 and VISION; for the others, the results Fig. 2 Pairwise comparison between devices in each dataset are comparable and are available upon request. A quick look at the scales in the subfigures shows that the UNISA2020 PCE values are higher and more uniform than the ones in VISION. This result is a consequence of the artefact (i.e., cropping) necessary to compare the RPs in the second dataset. The uniformity of the values in our dataset makes it harder to identify the source between two devices in this dataset.

Conclusion and future directions
In recent years, we have witnessed the introduction of new methods, such as those based on face, emotion, and body gestures have been introduced to support the prevention and investigation activities of law enforcement agencies and courts related to the analysis of digital images. Once a digital image is found to represent a crime, a closely related problem is often to identify the camera used to capture the image. Many scientific proposals have been made to date. A fair and universal evaluation of their accuracy and robustness requires the availability of standard datasets against which these algorithms can be compared under controlled, realistic conditions. In this paper, we presented a new image dataset called UNISA2020, which was developed specifically to experiment with SCI algorithms in a particular scenario. We considered the case where there are several different but similar source cameras available. We also restricted ourselves to traditional digital cameras, a case that is often overlooked in other datasets. In addition, we captured our images in a way that eliminated some preliminary processing steps that could hinder the identification procedure.
We evaluated the effectiveness of our proposal by using it in a benchmark study along with several other public image datasets available in the literature and obtained positive results. A much-needed further development of our work is the inclusion of SCI machine learning-based algorithms in our benchmarks, as described in Section 2.2.
We expect that our dataset and underlying methodology can be shared among the research groups working on the SCI topic. In this way the scientific community will obtain comparable results allowing for the measurement of the effectiveness of new algorithms. In the future we plan to constantly use this dataset in other application contexts for both identification and security purposes.
Funding Open access funding provided by Università degli Studi di Salerno within the CRUI-CARE Agreement. and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.