There has been an increasing amount of work surrounding the Electric Network Frequency (ENF) signal, an environmental signature captured by audio and video recordings made in locations where there is electrical activity. ENF is the frequency of power distribution networks, 60 Hz in most of the Americas and 50 Hz in most other parts of the world. The ubiquity of this power signature and the appearance of its traces in media recordings motivated its early application toward time–location authentication of audio recordings. Since then, more work has been done toward utilizing this signature for other forensic applications, such as inferring the grid in which a recording was made, as well as applications beyond forensics, such as temporally synchronizing media pieces. The goal of this chapter is to provide an overview of the research work that has been done on the ENF signal and to provide an outlook for the future.

1 Electric Network Frequency (ENF): An Environmental Signature for Multimedia Recordings

In this chapter, we discuss the Electric Network Frequency (ENF) signal, an environmental signature that has been under increased study since 2005 and has been shown to be a useful tool for a number of information forensics and security applications. The ENF signal is such a versatile tool that later work has also shown that it can have applications beyond security, such as for digital archiving purposes and multimedia synchronization.

The ENF signal is a signal influenced by the electric power grid. Most of the power provided by the power grid comes from turbines that work as generators of alternating current. The rotational velocity of these turbines determines the ENF, which usually has a nominal value 60 Hz in most of the Americas and 50 Hz in most other parts of the world. The ENF fluctuates around its nominal value as a result of power-frequency control systems that maintain the balance between the generation and consumption of electric energy across the grid (Bollen and Gu 2006). These fluctuations can be seen as random, unique in time, and typically very similar in all locations of the same power grid. The changing instantaneous value of the ENF over time is what we define as the ENF signal.

What makes the ENF particularly relevant to multimedia forensics is that the ENF can be embedded in audio or video recordings made in areas where there is electrical activity. It is in this way that the ENF serves as an environmental signature that is intrinsically embedded in media recordings. Once we can extract this invisible signature well, we can answer many questions about the recording it was embedded in. For example: When was the recording made? Where was it made? Has it been tampered with? A recent study has shown that ENF traces can even be extracted from images captured by digital cameras with rolling shutters, which can allow researchers to determine the nominal frequency of the area in which an image was captured.

Government agencies and research institutes of many countries have conducted ENF-related research and developmental work. This includes academia and government in Romania (Grigoras 2005, 2007, 2009), Poland (Kajstura et al. 2005), Denmark (Brixen 2007a, b, 2008), the United Kingdom (Cooper 2008, 2009a, b, 2011), the United States (Richard and Peter 2008; Sanders 2008; Liu et al. 2011, 2012; Ojowu et al. 2012; Garg et al. 2012, 2013; Hajj-Ahmad et al. 2013), the Netherlands (Huijbregtse and Geradts 2009), Brazil (Nicolalde and Apolinario 2009; Rodríguez et al. 2010; Rodriguez 2013; Esquef et al. 2014, Egypt (Eissa et al. 2012; Elmesalawy and Eissa 2014), Israel (Bykhovsky and Cohen 2013), Germany (Fechner and Kirchner 2014), Singapore (Hua 2014), Korea (Kim et al. 2017; Jeon et al. 2018), and Turkey (Vatansever et al. 2017, 2019). Among the work studied, we can see two major groups of study. The first group addresses challenges in accurately estimating an ENF signal from media signals. The second group focuses on possible applications of the ENF signal once it is extracted properly. In this chapter, we conduct a comprehensive literature study on the work done so far and outline avenues for future work.

The rest of this chapter is organized as follows. Section 10.2 describes the methods for ENF signal extraction. Section 10.3 discusses the findings of works studying the presence of ENF traces and providing statistical models for ENF behavior. Section 10.4 explains how ENF traces are embedded in videos and images and how they can be extracted. Section 10.5 delves into the main forensic and security ENF applications proposed in the literature. Section 10.6 describes an anti-forensics framework for understanding the interplay between an attacker and an ENF analyst. Section 10.7 extends the conversation toward ENF applications beyond security. Section 10.8 summarizes the chapter and provides an outlook for the future.

2 Technical Foundations of ENF-Based Forensics

As will be discussed in this chapter, the ENF signal can have a number of useful real-world applications, both in multimedia forensics-related fields and beyond. A major first step to these applications, however, is to properly extract and estimate the ENF signal from an ENF-containing media signal. But, how can we check that what we have extracted is indeed the ENF signal? For this purpose, we introduce the notion of power reference signals in Sect. 10.2.1. Afterward, we discuss various proposed methods to estimate the ENF signal from a media signal. We focus here on ENF extraction from digital one-dimensional (1-D) signals, typically audio signals, and delay the explanation of extracting ENF from video and image signals to Sect. 10.4.

2.1 Reference Signal Acquisition

Power reference recordings can be very useful to ENF analysis. The ENF traces found in the power recordings are typically much stronger than the ENF traces found in audio recordings. This is why they can be used as a reference and a guide for ENF signals extracted from audio recordings, especially in cases where one has access to a pair of simultaneously recorded signals, one a power signal and one an audio signal; the ENF in both should be very similar at the same instants of time. In this section, we describe different methods used in the ENF literature to acquire power reference recordings.

The Power Information Technology Laboratory at the University of Tennessee, Knoxville (UTK), operates the North American Power Grid Frequency Monitoring Network System (FNET), or GridEye. The FNET/GridEye is a power grid situational awareness tool that collects real-time, Global Position System (GPS) timestamped measurements of grid reference data at the distribution level (Liu et al. 2012). A framework for FNET/GridEye is shown in Fig. 10.1. The FNET/GridEye system consists of two major components, which are the frequency disturbance recorders (FDRs) and the information management system (IMS). The FDRs are the sensors of the system; each FDR is an embedded microprocessor system that performs local GPS-synchronized measurements, such as computing the instantaneous ENF values over time. In this setup, the FDR estimates the power frequency values at a rate of 10 records/s using phasor techniques (Phadke et al. 1983). The measured data is sent to the server through the Internet, where the IMS collects the data, stores it, and provides a platform for the visualization and analysis of power system phenomena. More information on the FNET/GridEye system can be found in  FNET Server Web Display (2021), Zhong et al. (2005), Tsai et al. (2007), and Zhang et al. (2010).

Fig. 10.1
figure 1

Framework of the FNET/GridEye system (Zhang et al. 2010)

A system similar to the FNET/GridEye system, named the wide area management systems (WAMS), has been set up in Egypt, where the center providing the information management functions is at Helwan University (Eissa et al. 2012; Elmesalawy and Eissa 2014). The researchers operating this system have noted that during system disturbances, the instantaneous ENF value is not the same across all points of the grid. It follows that in such cases, the ENF value from a single point in the grid may not be a reliable reference. For this purpose, in Elmesalawy and Eissa (2014), they propose a method for establishing ENF references from a number of FDRs deployed in multiple locations of the grid, rather than from a single location.

Recently, the authors of Kim et al. (2017) presented an ENF map that takes a different route: Instead of relying on installing specialized hardware, they built their ENF map by extracting ENF signals from the audio tracks of open-source online streaming multimedia data obtained from such sources as “Ustream”, “Earthcam”, and “Skyline webcams”. Most microphones used in such streaming services are mains-powered, which makes the ENF traces captured in the recordings stronger than those that would be captured in recordings made using battery-powered recorders. Kim et al. (2017) addressed in detail the challenges that come with the proposed approach, including accounting for packet loss, aligning different ENF signals temporally, and interpolating ENF signals geographically to account for locations that are not covered by the streaming services.

Systems such as the ones discussed offer tremendous benefits in power frequency monitoring and coverage, yet one does not need access to them in order to acquire ENF references locally. An inexpensive hardware circuit can be built to record a power signal or measure ENF variations, given access to an electric wall outlet. Typically, a transformer is used to convert the voltage from the wall outlet voltage levels down to a level that an analog-to-digital converter can capture. Figure 10.2 shows a sample generic circuit that can be built to record the power reference signal (Top et al. 2012).

Fig. 10.2
figure 2

Sample generic schematic of sensoring hardware (Top et al. 2012)

There is more than one design to build the sensor hardware. In the example of Fig. 10.2, an anti-aliasing filter is placed in the circuit along with a fuse for safety purposes. In some implementations, such as in Hajj-Ahmad et al. (2013), a step-down circuit is connected to a digital audio recorder that records the raw power signal, whereas in other implementations, such as in Fechner and Kirchner (2014), the step-down circuit is connected to a BeagleBone Black board, via a Schmitt trigger that computes an estimate of the ENF signal spontaneously. In the former case, the recorded digital signal is processed later using ENF estimation techniques, which will be discussed in the next section, to extract the reference ENF signal while in the latter case, the ENF signal is ready to be used as a reference for the analysis.

As can be seen, there are several ways to acquire ENF references depending on the resources one has access to. An ENF signal extracted through these measurements typically has a high signal-to-noise ratio (SNR) and can be used as a reference in ENF research and applications.

2.2 ENF Signal Estimation

In this section, we discuss several approaches that have been proposed in the literature to extract the ENF signal embedded in audio signals.

A necessary stage before estimating the changing instantaneous ENF value over time is preprocessing the audio signal. Typically, since the ENF component is in a low-frequency band, a lowpass filter with proper anti-aliasing can be applied to the ENF-containing audio signal to make the computations of the estimation algorithms easier. For some estimation approaches, it also helps to bandpass the ENF-containing audio signal around the frequency band of interest, i.e., frequency band surrounding the nominal ENF value. Besides that, the ENF-containing signal is then divided into consecutive overlapping or nonoverlapping frames. The aim of the ENF extraction process would be to apply a frequency estimation approach on each frame to estimate its most dominant frequency around the nominal ENF value. This frequency estimate would be the estimated instantaneous ENF value for the frame. Concatenating the frequency estimates of all the frames together would form the extracted ENF signal. The length of the frame, typically on the order of seconds, determines the resolution of the extracted ENF signal. Typically, a trade-off exists here. Smaller frame size would better capture the ENF variations but may result in poorer performance of the frequency estimation approach, and vice versa.

Generally speaking, an ENF estimation approach can be one of three types: (1) a time-domain approach, (2) a nonparametric frequency-domain approach, and (3) a parametric frequency-domain approach.

2.2.1 Time-Domain Approach

The time-domain zero-crossing approach is fairly straightforward, and it is one of the few ENF estimation approaches that is not preceded by dividing the recording into consecutive frames for individual processing. As described in Grigoras (2009), a bandpass filter with 49–51  Hz or 59–61Hz cutoff is first applied to the ENF-containing signal without downsampling initially. This is done to separate the ENF waveform from the rest of the recording. Afterward, the zero-crossings of the remaining ENF signal are computed, and the time differences between consecutive zero values are computed and used to obtain the instantaneous ENF estimates.

2.2.2 Nonparametric Frequency-Domain Approach

Nonparametric approaches do not assume any explicit model for the data. Most of these approaches are based on the Fourier analysis of the signal.

Most nonparametric frequency-domain approaches are based on the periodogram-based or spectrogram-based approach utilizing the short-time Fourier transform (STFT). STFT is often used for signals with a time-varying spectrum, such as speech signals. After the signal is segmented into overlapping frames, each frame undergoes Fourier analysis to determine the frequencies present. A spectrogram is then defined as the squared magnitude of the STFT and is usually displayed as a two-dimensional (2-D) intensity plot, with the two axes being time and frequency, respectively (Hajj-Ahmad et al. 2012).

Because of the slowly varying nature of the ENF signal, it is reasonable to consider the instantaneous frequency within the duration of a frame approximately constant for analysis. Given a sinusoid of a fixed frequency embedded in noise, the power spectral density (PSD) estimated by the STFT should ideally exhibit a peak at the frequency of the sinusoidal signal. Estimating this frequency well gives a good estimate for the ENF value of this frame.

A straightforward approach to estimating this frequency would be finding the frequency that has the maximum power spectral component. Directly choosing this frequency as the ENF value, however, typically leads to a loss in accuracy, because the spectrum is computed for discretized frequency values and the actual frequency of the maximum energy may not be aligned with these discretized frequency values. For this reason, typically, the STFT-based ENF estimation approach carries out further computations to obtain a more refined estimate. Examples of such operations are quadratic or spline interpolations being done about the detected spectral peak, or a weighted approach where the ENF estimate is found by weighing the frequency bins around the nominal value based on their spectrum intensities (Hajj-Ahmad et al. 2012; Cooper 2008, 2009b; Grigoras 2009; Liu et al. 2012).

In addition to the STFT-based approach that is most commonly used, the authors in Ojowu et al. (2012) advocate the use of a nonparametric, adaptive, and high-resolution technique known as the time-recursive iterative adaptive approach (TR-IAA). This algorithm reaches the spectral estimates of a given frame by minimizing a quadratic cost function using a weighted least squares formulation. It is an iterative technique that takes 10–15 iterations to converge, where the spectral estimate is initialized to be either the spectrogram or the final spectral estimate of the preceding frame (Glentis and Jakobsson 2011). As compared to STFT-based techniques, this approach is more computationally extensive. In Ojowu et al. (2012), the authors report that the STFT-based approach gives slightly better estimates of the network frequency when the SNR is high, yet the adaptive TR-IAA approach achieves a higher ENF estimation accuracy in the presence of interference from other signals.

To enhance the ENF estimation accuracy,  Ojowu et al. (2012) propose a frequency tracking method based on dynamic programming that finds a minimum cost path. For each frame, the method generates a set of candidate frequency peak locations from which the minimum cost path is generated. The cost function selected in this proposed extension takes into account the slowly varying nature of the ENF and penalizes significant jumps in frequency from frame to frame. The minimum path found is the estimated ENF signal.

In low SNR scenarios, the sets of candidate peak frequency locations used by Ojowu et al. (2012) may not be precisely estimated. To avoid using imprecise peak locations, the  Zhu et al. (2018, 2020) apply dynamic programming directly to a 2-D time–frequency representation such as the spectrogram. The authors also propose to extract multiple ENF traces of unequal strengths by iteratively estimating and erasing the strongest one. A near-real-time variant of the multiple ENF extraction algorithm was proposed in Zhu et al. (2020) to facilitate efficient online frequency tracking.

There have been other proposed modifications to nonparametric approaches in the literature to improve the final ENF signal estimate. In Ling Fu et al. (2013), the authors propose a discrete Fourier transform (DFT)-based binary search algorithm to lower the computational complexity. Instead of calculating the full Fourier spectrum, the proposed method uses the DFT to calculate a spectral line at the midpoint of a frequency interval at each iteration. At the end of the iteration, the frequency interval will be replaced by its left or right half based on the relative strength of the calculated spectral line and that of the two ends of the interval. The search stops once the frequency interval is narrow enough, and the estimated frequency of the current frame will be used to initialize the candidate frequency interval of the next frame.

In Dosiek (2015), the author considers the ENF estimation problem as a frequency demodulation problem. By considering the captured power signal to be a carrier sinusoid of nominal ENF value modulated by a weak stochastic signal, a 0-Hz intermediate frequency signal can be created and analyzed instead of using the higher frequency modulated signal. This allows the ENF to be estimated through the use of FM algorithms.

In Georgios Karantaidis and Constantine Kotropoulos (2018), the authors suggested using refined periodograms as the basis for the nonparametric frequency estimation approach, including Welch, Blackman–Tukey, and Daniel, as well as the Capon method, which is a filter bank approach based on a data-dependent filter.

In Xiaodan Lin and Xiangui Kang (2018), the authors propose to improve the extraction of ENF estimates in cases of low SNR by exploiting the low-rank structure of the ENF signal in an approach that uses robust principal component analysis to remove interference from speech content and background noise. Weighted linear prediction is then involved in extracting the ENF estimates.

2.2.3 Parametric Frequency-Domain Approach

Parametric frequency-domain ENF estimation approaches assume an explicit model for the signal and the underlying noise. Due to such an explicit assumption about the model, the estimates obtained using parametric approaches are expected to be more accurate than those obtained using nonparametric approaches if the modeling assumption is correct (Manolakis et al. 2000). Two of the most widely used parametric frequency estimation methods are based on the subspace analysis of a signal–noise model, namely the MUltiple SIgnal Classification (MUSIC) (Schmidt 1986) and Estimation of Signal Parameters via Rotational Invariance Techniques (ESPRIT) (Roy and Kailath 1989). These methods can be used to estimate the frequency of a signal composed of P complex frequency sinusoids embedded in white noise. As ENF signals consist of a real sinusoid, the value of P for ENF signals is 2 when no harmonic signals exist.

The MUSIC algorithm is a subspace-based approach to frequency estimation that relies on eigendecomposition and the properties between the signal and noise subspaces for sinusoidal signals with additive white noise. MUSIC makes use of the orthogonality between the signal and noise subspaces to compute a pseudo-spectrum for a signal. This pseudo-spectrum should have P frequency peaks corresponding to the dominant frequency components in the signal, i.e., the ENF estimates.

The ESPRIT algorithm makes use of the rotational property between staggered subspaces that are invoked to produce the frequency estimates. In our case, this property relies on observations of the signal over two intervals of the same length staggered in time. ESPRIT is similar to MUSIC in the sense that they are both subspace-based approaches, but it is different in that it works with the signal subspace rather than the noise subspace.

2.3 Higher Order Harmonics for ENF Estimation

The electric power signal is subject to waveform distortion due to the presence of nonlinear elements in the power system. A nonlinear element takes a non-sinusoidal current for a sinusoidal voltage. Thus, even for a nondistorted voltage waveform, the current through a nonlinear element is distorted. This distorted current waveform, in turn, leads to a distorted voltage waveform. Though most elements of the power network are linear, transformers are not, especially during sustained over-voltages and power electronic components. Besides that, a main reason behind the distortion is due to nonlinear load, mainly power-electronic converters (Bollen and Gu 2006).

A significant way in which the waveform distortion of the power signal manifests itself is in harmonic distortion, where the power signal waveform can be decomposed into a sum of harmonic components with the fundamental frequency being close to the 50/60  Hz nominal ENF value. It follows that scaled versions of almost the same variations appear in many of the harmonic bands, although the strength of the traces may differ at different harmonics with varying recording environments and devices used. An example of this can be seen in Fig. 10.3. In extracting the ENF signal, we can take advantage of the presence of ENF traces at multiple harmonics of the nominal frequency in order to make our ENF signal estimate more robust.

Fig. 10.3
figure 3

Spectrogram strips about the harmonics 60 Hz for two sets of simultaneously recorded power and audio measurements (Hajj-Ahmad et al. 2013)

In Bykhovsky and Cohen (2013), the authors extend the ENF model from a single-tone signal to a multi-tone harmonic one. Under this model, they use the Cramer–Rao bound for the estimation model that shows that the harmonic model can lead to a theoretical \(O \left( M^3 \right) \) factor improvement in the ENF estimation accuracy, with M being the number of harmonics. The authors then derive a maximum likelihood estimator for the ENF signal, and their results show a significant gain as compared with the results on the single-tone model.

In Hajj-Ahmad et al. (2013), the authors propose a spectrum combining approach to ENF signal estimation, which exploits the different ENF components appearing in a signal and strategically combines them based on the local SNR values. A hypothesis testing performance of an ENF-based timestamp verification application is examined, which validates that the proposed approach achieves a more robust and accurate estimate than conventional ENF approaches that rely solely on a single tone.

3 ENF Characteristics and Embedding Conditions

Despite that there has been an increasing amount of work recently geared toward extracting the ENF signal from media recordings and then using it for innovative applications, there has been relatively less work done toward understanding the conditions that promote or hinder the embedding of ENF in media recordings. Further work also can be done toward understanding the way the ENF behaves. In this section, we will first go over the existing literature on the conditions that affect ENF capture and then discuss recent statistical modeling efforts on the ENF signal.

3.1 Establishing Presence of ENF Traces

Understanding the conditions that promote or hinder the capture of ENF traces in media recordings would go a long way in helping us benefit from the ENF signal in our applications. It would also help us understand better the situations in which ENF analysis is applicable.

If a recording is made with a recorder connected to the electric mains power, it is generally accepted that ENF traces will be present in the resultant recording (Fechner and Kirchner 2014; Kajstura et al. 2005; Brixen 2007a; Huijbregtse and Geradts 2009; Cooper 2009b, 2011; Garg et al. 2012). The strength and presence of the ENF traces depend on the recording device’s internal circuitry and electromagnetic compatibility characteristics (Brixen 2008).

If the recording is made with a battery-powered recorder, the question of whether or not the ENF will be captured in the recording becomes more complex. Broadly speaking, ENF capturing can be affected by several factors that can be divided into two groups: factors related to the environment in which the recording was made and factors related to the recording device used to make the recording. Interactions between different factors may as well lead to different results. For instance, electromagnetic fields in the place of recording promote ENF capturing if the recording microphone is dynamic but not in the case where the recording microphone is electret. Table 10.1 shows a sample of factors that have been studied in the literature for their effect on ENF capture in audio recordings (Fechner and Kirchner 2014; Brixen 2007a; Jidong Chai et al. 2013).

Table 10.1 Sample of factors affecting ENF capture in audio recordings made by battery-powered recorders (Hajj-Ahmad et al. 2019)

Overall, the most common cause of ENF capture in audio recordings is the acoustic mains hum, which can be produced by mains-powered equipment in the place of recording. The hypothesis that this background noise is a carrier of ENF traces was confirmed in Fechner and Kirchner (2014). Experiments carried out in an indoor setting suggested high robustness of ENF traces where the ENF traces were present in a recording made 10 m away from a noise source located in a different room. Future work will still need to conduct large-scale empirical studies to infer how likely real-world audio recordings will contain distinctive ENF traces.

Hajj-Ahmad et al. (2019) conducted studies exploring factors that affect the capture of ENF traces. They demonstrated that moving a recorder while making a recording will likely compromise the quality of the ENF being captured, due to the Doppler effect, possibly in conjunction with other factors such as air pressure and other vibrations. They also showed that using different recorders in the same recording setting can lead to different strengths of ENF traces captured and at different harmonics. Further studies along this line will help understand better the applicability of ENF research and inform the design of scalable ENF-based applications.

In Zhu et al. (2018), Zhu et al. (2020), the authors propose an ENF traces presence test for 1-D signals. The presence of the ENF is first tested for each frame using a test statistic quantifying the relative energy within a small neighborhood of the frequency of the spectral peak. A frame can be classified as “voiced” or “unvoiced” by thresholding the test statistic. The algorithm merges frames of the same decision type into a segment while allowing frames of the other type to be sparsely presented in a long segment. This refinement produces a final decision result containing two types of segments that are sparsely interleaved over time.

In Vatansever et al. (2017), the authors address the issue of the presence of ENF traces in videos. In particular, the problem they address is that ENF analysis on videos can be computationally expensive, and typically, there is no guarantee that a video would contain ENF traces prior to analysis. In their paper, they propose an approach to assess a video before further analysis to understand whether it contains ENF traces. Their ENF detection approach is based on using superpixels, which are steady object regions having very close reflectance properties, in a representative video frame. They show that their algorithm can work on video clips as short as 2 min and can operate independently of the camera image sensor type, i.e., complementary metal-oxide semiconductor (CMOS) or charge-coupled device (CCD).

3.2 Modeling ENF Behavior

Understanding how the ENF behaves over time can be important toward understanding if a detected frequency component corresponds to the ENF or not. Several studies have carried out statistical modeling on the ENF variations. In Fig. 10.4, we can see that ENF values collected from the UK grid over one month follow a Gaussian-like distribution (Cooper 2009b).

Fig. 10.4
figure 4

Histogram of ENF data collected from the UK for November 2006, with a Gaussian distribution overlaid for reference Cooper (2009b)

Other studies have been carried out on the North American grids, where there are four interconnections, namely Eastern Interconnection (EI), Western Interconnection (WECC), Texas Interconnection (ERCOT), and a Quebec Interconnection. The studies show that the ENF generally follows Gaussian distributions in the Eastern interconnection, the Western interconnection, and the Quebec interconnection, yet the mean and standard deviation values are different across interconnections. For instance, the Western interconnection shows a smaller standard deviation than the Eastern interconnection indicating that it maintains a slightly tighter control over the ENF variations. The differences in density are reflective of the control strategies employed on the grids and the size of the grids (Liu et al. 2011, 2012; Garg et al. 2012; Top et al. 2012). A further indication of the varying behaviors of ENF variations across different grids is that a study has shown that the ENF data collected in Singapore do not strictly follow a Gaussian distribution (Hua 2014).

In Garg et al. (2012), the authors have shown that the ENF signal in the North American Eastern interconnection can be modeled as a piecewise wide sense stationary (WSS) signal. Following that, they model the ENF signal as a piecewise autoregressive (AR) process, and show how this added understanding of the ENF signal behavior can be an asset toward improving performance in ENF applications, such as with time–location authentication, which will be discussed in Sect. 10.5.1.

4 ENF Traces in the Visual Track

In Garg et al. (2011), the authors showed for the first time that the ENF signal can be sensed and extracted from the light illumination using a customized photodiode circuit. In this section, we explain how the ENF signal can be extracted from video and image signals captured by digital cameras. First, we describe each stage of the pipeline that an ENF signal goes through and is processed. Different types of imaging sensors, namely CCD and CMOS, will be discussed. Second, we explain the steps to extract the ENF signal. We start from simple videos with white-wall scenes and then move to more complex videos with camera motion, foreground motion, and brightness change.

4.1 Mechanism of ENF Embedding in Videos and Images

The intensity fluctuation in visual recordings is caused by the alternating pattern of the supply current/voltage. We define ENF embedding as the process of adding intensity fluctuations caused by the ENF to visual recordings. The steps of the ENF embedding are illustrated in Fig. 10.5. The process starts by converting the AC voltage/current into a slightly flickering light signal. The light then travels through the air with its energy attenuated. When it arrives at an object, the flickering light interacts with the surface of the object, producing a reflected light that flickers at the same speed. The light continues to travel through the air and the lens and goes through another round of attenuation before it arrives at the imaging sensor of a camera or the retina of a human. Through a short temporal accumulation process that is lowpass in nature, a final sensed signal will contain the flickering component in addition to the pure visual signal. We describe each stage of ENF embedding in Sect. 10.4.1.1.

Fig. 10.5
figure 5

Illustration of the ENF embedding process for visual signals. The supply voltage v(t) with frequency f(t) is first converted into the visual light signal, which has a DC component and an AC component flickering at 2f(t) Hz. The visual light then reaches an object and interacts with the object’s surface. The reflected light emitted from the object arrives at a camera’s imaging sensor at which photons are collected for a duration of the exposure time, resulting in a sampled and quantized temporal visual signal whose intensity fluctuates at 2f(t) Hz

Depending on whether a camera’s image sensor type is CCD or CMOS, the flickering due to the ENF will be captured at the frame level or the row level, respectively. The latter scenario has an equivalent sampling rate that is hundreds or thousands of times that of the former scenario. We provide more details about CCD and CMOS in Sect. 10.4.1.2.

4.1.1 Physical Embedding Processes

Given the supply voltage

$$\begin{aligned} v(t) = A(t) \cos \left( 2 \pi \int _{-\infty }^{t} f(\tau ) d\tau + \phi _0 \right) \end{aligned}$$
(10.1)

with a time-varying voltage amplitude A(t), a time-varying frequency f(t) Hz, and a random initial phase \(\phi _0\), it follows from the power law that the frequency of supply power is double that the supply voltage,Footnote 1 namely

$$\begin{aligned} P(t) = v^2(t) / R&= A^2(t) \cos ^2 \left( 2 \pi \int _{-\infty }^{t} f(\tau ) d\tau + \phi _0 \right) \Big / R \end{aligned}$$
(10.2a)
$$\begin{aligned}&= \frac{A^2(t)}{2R} \left\{ \cos \left( 2 \pi \int _{-\infty }^{t} [2f(\tau )] d\tau + 2\phi _0 \right) + 1 \right\} . \end{aligned}$$
(10.2b)

Note that for the nonnegative supply power P(t), the ratio of the amplitude of the AC component to the strength of the DC is equal to one. The voltage to power conversion is illustrated by the first block of Fig. 10.5.

Next, the supply power in the electrical form is converted into the electromagnetic wave by a lighting device. The conversion process equivalently applies a lowpass filter to attenuate the relative strength of the AC component based on the optoelectronic principles of the fluorescent or incandescent light. The resulting visual light contains a stable DC component and a relatively smaller flickering AC component with a time-varying frequency 2f(t). The power to light conversion is illustrated by the second block of Fig. 10.5.

A ray of the visual light then travels to an object and interacts with the object’s surface to produce reflected or reemitted light that can be picked up by the eyes of an observer or a camera’s imaging system. The overall effect is a linear scaling to the magnitude of the visual light signal, which preserves its flickering component for the light arriving at the lens. The third block of Fig. 10.5 illustrates a geometric setup where a light source shines on an object, and a camera acquires the reflected light. Given a point \(\mathbf{p}\) on the object, the intensity of reflected light arriving at the camera is dependent on the light–objective distance \(d(\mathbf{p})\) and the camera–object distance \(d'(\mathbf{p})\) per the inverse-square law that light intensity is inversely proportional to the squared distance. The intensity of the reflected light is also determined by the reflection characteristics of the object at \(\mathbf{p}\). The reflection characteristics can be broadly summarized into the diffuse reflection and the specular reflection, which are widely adopted in computer vision and computer graphics for understanding and modeling everyday vision tasks (Szeliski 2010). The impact of reflection on the intensity of the reflected light is a combined effect of albedo, the direction of the incident light, the orientation of the object’s surface, and the direction of the camera. From the camera’s perspective, the light intensities at different locations of the object form an image of the object (Szeliski 2010). Adding a time dimension, the intensities of all locations fluctuate synchronously at the speed of 2f(t) Hz due to the changing appearance caused by the flickering light.

The incoming light is further sampled and quantized by the camera’s sensing unit to create digital images or videos. Through the imaging sensor, the photons of the incoming light reflected from the scene are converted into electrons and subsequently into voltage levels that represent the intensity levels of pixels of a digital image. Photons are collected for a duration of the exposure time to accumulate a sizable mass to clearly depict the scene. Hajj-Ahmad et al. (2016) show that the accumulation of photons can be viewed as convolving a rectangular window to the perceived light signal arriving at the lens as illustrated in the last block of Fig. 10.5. This is a second lowpass filtering process that further reduces the strength of the AC component relative to the DC.

4.1.2 Rolling Versus Global Shutter

When images and videos are digitized by cameras, depending on whether the camera uses a global-shutter or rolling-shutter sensing mechanism, the ENF signal will be embedded into the visual data in different ways. CCD cameras usually capture images using global shutters. It exposes and reads out all pixels of an image/frame simultaneously, hence the ENF is embedded by capturing the global intensity of the image. In contrast, CMOS cameras usually capture images using rolling shutters (Jinwei Gu et al. 2010). A rolling shutter exposes and reads out only one row of pixels at a time, hence the ENF is embedded by sequentially capturing the intensity of every row of an image/frame. The rolling shutter digitizes rows of each image/frame sequentially, making it possible to sense the ENF signal much faster than using the global shutter—the effective sampling rate is scaled up by a multiplicative factor that equals the number of rows of sensed images, which is usually on the order of hundreds or thousands. The left half of Fig. 10.6 shows a timing diagram of when each row and each frame are acquired using a rolling shutter. Rows of each frame are sampled uniformly in time, followed by an idle period before proceeding to the next frame (Su et al. 2014a). The right half of Fig. 10.6 shows a row signal of a white-wall video (Su et al. 2014b), which can be used for frequency estimation/tracking. The row signal is generated by averaging the pixel intensities of each row and concatenating the averaged values for all frames (Su et al. 2014b). Note that the discontinuities are caused by the missing values during the idle periods.

Fig. 10.6
figure 6

(Left) Rolling-shutter sampling time diagram of a CMOS camera: Rows of each frame are sequentially exposed and read out, followed by an idle period before proceeding to the next frame (Su et al. 2014a). (Right) The row signal of a white-wall video generated by averaging the pixel intensities of each row and concatenating the averaged values for all frames (Su et al. 2014b). The discontinuities are caused by idle periods

4.2 ENF Extraction from the Visual Track

ENF signals can be embedded in both global-shutter and rolling-shutter videos as explained in Sect. 10.4.1.2. However, being able to successfully extract an ENF signal from a global-shutter video needs a careful selection of a camera’s frame rate. This is because global-shutter videos may suffer from frequency contamination due to aliasing caused by an insufficient sampling rate normally ranging from 24 to 30 fps. Two typical failure examples are using a 25-fps camera in an environment with light flickering 100 Hz or using a 30-fps camera with 120 Hz light. We will detail this challenge in Sect. 10.4.2.1.

The rolling shutter, traditionally considered to be detrimental to image qualities (Jinwei Gu et al. 2010), can sample the flickering of the light signal hundreds or thousands of times faster than the global shutter. A faster sampling rate eliminates the need to worry about the ENF’s potential contamination due to aliasing.

We will explain the steps to extract the ENF signal starting from simple videos with no visual content in Sect. 10.4.2.2 to complex videos with camera motion, foreground motion, and brightness change in Sect. 10.4.2.3.

4.2.1 Challenges of Using Global-Shutter Videos

In Garg et al. (2013), the authors used a CCD camera to capture white-wall videos under indoor lighting to demonstrate the feasibility of extracting ENF signals from the visual track. The white-wall scene can be considered to contain no visual content except for an intensity bias, hence it can be used to demonstrate the essential steps to extract ENF from videos. The authors took the average of the pixel values in each H-by-W frame of the video and obtained a 1-D sinusoid-like time signal \(s_{\text {frame}}(t)\) for frequency estimation, namely

$$\begin{aligned} s_{\text {frame}}(t) = \frac{1}{HW}\sum _{x=0}^{H-1} \sum _{y=0}^{W-1} {I(x,y,t)}, \quad t = 0, 1, 2, \dots , \end{aligned}$$
(10.3)

where H is the number of rows, W is the number of columns, I is the video intensity, and x, y, and t are its row, column, and time indices, respectively. Here, the subscript “frame” of the signal symbol implies that its sampling is conducted at the frame level. Figure 10.7(a) and (b) shows the spectrograms calculated from the power signal and the time signal \(s_{\text {frame}}(t)\) of frame averaged intensity. It is revealed that the ENF estimated from the video has the same trend as that from the power mains and has doubled dynamic range, confirming the feasibility of ENF extraction from videos.

Fig. 10.7
figure 7

Spectrograms of fluctuating ENF measured from a power mains, and b frames of a white-wall video recording (Garg et al. 2013)

One major challenge in using a global-shutter-based CCD camera for ENF estimation is the aliasing effect caused by the insufficient sampling rate. Most consumer digital cameras adopt a frame rate of around 30 fps, while the ENF signal and its harmonics appear at integer multiples of 100 120 Hz. The ENF signal, therefore, suffers from a severe aliasing effect due to an insufficient sampling rate. In Garg et al. (2013), the illumination was 100 Hz and the camera’s sampling rate was at 30 fps, resulting in an aliased component centered 10 Hz as illustrated in Fig. 10.7(b). Such a combination of ENF nominal frequency and the CCD camera’s frame rate does not affect the ENF extraction. However, the 30-fps sampling rate can cause major difficulties in ENF estimation for global-shutter videos captured in 120-Hz ENF countries such as the US. We point out two issues using the illustrations in Fig. 10.8. First, two mirrored frequency components cannot be easily separated once they are mixed. Since the power signal is real-valued, a minored \(-120\) Hz component also exists in its frequency domain as illustrated in the left half of Fig. 10.8. When the frame rate is 30 Hz, both \(\pm 120\) Hz components will be aliased to 0 Hz upon sampling, creating a symmetrically overlapping pattern at 0 Hz as shown in the right half of Fig. 10.8. Once the two desired frequency components are mixed, they cannot, in general, be separated without ambiguity hence making the ENF extraction impossible in most cases. Second, to make the matter worse, the native DC content around 0 Hz may further distort the mirrored and aliased ENF components (Garg et al. 2013), which may further hinder the estimation of the desired frequency component.

Fig. 10.8
figure 8

Issue of mixed frequency components caused by aliasing when global-shutter CCD cameras are used. (Left) Time-frequency representation of the original signal containing substantial signal contents around \({\pm }120\) Hz and DC (0 Hz); (Right) Mixed/aliased frequency components around DC after being sampled at 30 fps. Two mirrored components, in general, cannot be separated once mixed. Their overlap with the DC component further hinders the estimation of the desired frequency component

4.2.2 Rolling-Shutter Videos with No Visual Content

To address the challenge of insufficient sampling rate, Garg et al. (2013); Su et al. (2014a); Choi and Wong (2019) exploited the fact that a rolling shutter acquires rows of each frame in a sequential manner, which effectively upscales the sampling rate by a multiplicative factor of the number of rows in a frame. In the rolling-shutter scenario, the authors again use a white-wall video to demonstrate the essential steps of extracting ENF traces. As discussed in Sect. 1.4.1.2, the ENF is embedded by sequentially affecting the row intensities. To extract the ENF traces, it is intuitive to average the similarly affected intensity values within each row to produce one value indicating the impact of ENF at the timestamp at which the row is exposed. More precisely, after averaging, the resulting “video” data is indexed only by row x and time t:

$$\begin{aligned} I_{\text {row}}(x, t) = \frac{1}{W} \sum _{y=0}^{W-1} I(x, y, t), \quad x = 0, \dots , H-1,\ t = 0, 1, \dots . \end{aligned}$$
(10.4)

Su et al. (2014a) define a 1-D row signal by concatenating \(I_{\text {row}}(x, t)\) along time, namely

$$\begin{aligned} s_{\text {row}}(n) = I_{\text {row}}(n \text { mod } H, \ {\text {floor}}(n/H)), \quad n = 0, 1, \dots \end{aligned}$$
(10.5)

Using a similar naming convention as in (10.3), the subscript in \(s_{\text {row}}(n)\) implies that its sampling is equivalently conducted at the row level, which is hundreds or thousands of times faster than the frame rate. This allows the frequency estimation of ENF to be conducted on a signal of a much higher rate without suffering from potentially mixed signals due to aliasing.

The multi-rate signal analysis (Parishwad 2006) is used in Su et al. (2014a) to analyze the concatenated signal (10.5) and shows that such direct concatenation by ignoring a frame’s idle period can result in slight distortion to the estimated ENF traces. To avoid such distortion, Choi and Wong (2019) show that the signal must be concatenated as if it were sampled uniformly in time. That is, the missing sample points due to the idle period need to be filled in with zeros before the concatenation. This zero-padding approach can produce undistorted ENF traces but requires the knowledge of the duration of the idle period ahead of time. The idle period is related to a camera-model specific parameter named read-out time (Hajj-Ahmad et al. 2016), which will be discussed in Sect. 10.5.4. The authors in Vatansever et al. (2019) further illustrate how the frequency of the main ENF harmonic is replaced with new ENF components depending on the length of the idle period. Their model reveals that the power of the captured ENF signal is inversely proportional to the idle period length.

4.2.3 Rolling-Shutter Videos with Complex Visual Content

In practical scenarios, an ENF signal in the form of light intensity variation coexists with the nontrivial video content reflecting camera motion, foreground motion, and brightness change. To be able to extract an ENF signal from such videos, the general principle is to construct a reasonable estimator for the video content V(xyt) and then subtract it from the original video I(xyt) in order to single out the light intensity variation due to ENF (Su et al. 2014a, b). The resulting ENF-only residual video \(\hat{E}(x,y,t) = I(x,y,t) - \hat{V}(x, y, t)\) can then be used for ENF extraction by following the procedure laid out in Sect. 10.4.2.2. A conceptual diagram is shown in Fig. 10.9. Below, we formalize the procedure to generate an ENF-only residual video using a static-scene example. The procedure is also applicable to videos with more complex scenes once motion estimation and compensation are conducted to estimate the visual content V(xyt). We will also explain how camera motion, foreground motion, and brightness change can be addressed.

Fig. 10.9
figure 9

A schematic of how ENF can be extracted from rolling-shutter videos

ENF-Only Residual Video Generation

Based on the ENF embedding mechanism discussed in Sect. 10.4.1.2, we formulate an additive ENF embedding modelFootnote 2 for rolling-shutter captured video as follows:

$$\begin{aligned} I(x,y,t) = V(x,y,t) + E(x,t), \end{aligned}$$
(10.6)

where V(xyt) is the visual content and E(xt) is the ENF component that depends on row index x and time index t. For a fixed t, the ENF component E(xt) is a sinusoid-like signal.

Our goal is to estimate E(xt) given a rolling-shutter captured video I(xyt) modeled by (10.6). Once we obtain the estimate \(\hat{E}(x, t)\), one can choose to use the direct concatenation (Su et al. 2014a) or the periodic zero padding (Choi and Wong 2019) to generate a 1-D time signal for frequency estimation.

Dangling modifier. An intuitive approach to estimate E(xt), is to first obtain a reasonable estimate \(\hat{V}(x, y, t)\) of the visual content and subtract it from the raw video I(xyt), namely

$$\begin{aligned} \hat{E}(x,y,t) = I(x,y,t) - \hat{V}(x, y, t). \end{aligned}$$
(10.7)

For a static scene video, since the visual contents of every frame are identical, it is intuitive to obtain an estimator by taking the average of a random subset \(\mathcal {T} \subset \{0, 1, 2, \dots \}\) of video frames

$$\begin{aligned} \hat{V}(x, y, t) = \frac{1}{|\mathcal {T}|} \sum _{t \in \mathcal {T}} I(x,y,t). \end{aligned}$$
(10.8)

Such a random average can help cancel out the ENF component in (10.6). Once \(\hat{E}(x,y,t)\) is obtained, the same procedure of averaging over the column index (Su et al. 2014a) as in (10.4) is conducted to estimate E(xt), as the pixels of the same row in \(\hat{E}(x, y, t)\) are perturbed by the ENF in the same way:

$$\begin{aligned} \hat{E}(x, t)&= \frac{1}{W} \sum _{y=0}^{W-1}{ \hat{E}(x, y, t)} \end{aligned}$$
(10.9a)
$$\begin{aligned}&= E(x, t) + \frac{1}{W} \sum _{y=0}^{W-1} [V(x,y,t) - \hat{V}(x,y,t)]. \end{aligned}$$
(10.9b)

When \(\sum _{y=0}^{W-1} \hat{V}(x,y,t)\) is a good estimator for \(\sum _{y=0}^{W-1} V(x,y,t)\) for all (xt), the second term is close to zero, making \(\hat{E}(x, t)\) a good estimator for E(xt). To extract the ENF signal, \(\hat{E}(x,t)\) is then vectorized into a 1-D time signal as in (10.5), namely

$$\begin{aligned} s_{\text {row}}(n) = \hat{E}(n \text { mod } H, \ {\text {floor}}(n/H)),\quad n = 0, 1, \dots \end{aligned}$$
(10.10)

Camera Motion

For videos with more complex scenes, appropriate video processing tools need to be used to generate the estimated visual content \(\hat{V}(x,y,t)\). We first consider the class of videos containing merely camera motions such as panning, rotation, zooming, and shaking. Two frames that are not far apart in time can be regarded as being generated by almost identical scenes projected to the image plane with some offset in the image coordinates. The relationship of the two frames can be established by finding each pixel of the frame of interest at \(t_0\) and its new location in a frame at \(t_1\), namely

$$\begin{aligned} I \, \big ( \, x + dx(x,y,t_1)\, , \, y + dy(x,y,t_1)\, , \, t_0 \, \big ) \approx I(x,y,t_1), \end{aligned}$$
(10.11)

where (dxdy) is the motion vector associated with the frame of interest pointing to the new location in the frame at \(t_1\). Motion estimation and compensation are carried out in Su et al. (2014a), Su et al. (2014b) using optical flow to obtain a set of estimated frames \(\hat{I}_i(x,y,t_0)\) for the frame of interest \(I(x,y,t_0)\). As the ENF signal has a relatively small magnitude compared to the visual content and the motion compensation procedure often introduces noise, an average over all these motion-compensated frames, namely

$$\begin{aligned} \hat{V}(x, y, t) = \frac{1}{n} \sum _{i=1}^n \hat{I}_i(x,y,t), \end{aligned}$$
(10.12)

should lead to a good estimation of the visual content per the law of large numbers.

Foreground Motion

We now consider the class of videos with foreground motions only. In this case, the authors in Su et al. (2014a), Su et al. (2014b) use a video’s static regions to estimate the ENF signal. Given two image frames, the regions that are not affected by foreground motion in both frames are detected by thresholding the pixel-wise differences. The row averages of the static regions are then calculated to generate a 1-D time signal for ENF estimation.

Brightness Compensation

Many cameras are equipped with an automatic brightness control unit that would adjust a camera’s sensitivity in response to the changes of the global or local light illumination. For example, as a person in a bright background moves closer to the camera, the background part of the image can appear brighter when the control unit relies on the overall intensity of the frame to adjust the sensitivity (Su et al. 2014b). Such a brightness change can bias the estimated visual content and lead to a failure in the content elimination process.

Su et al. (2014b) found that the intensity values of two consecutive frames roughly follow the linear relationship

$$\begin{aligned} I(x,y,t) \approx \beta _1(t) I(x,y,t+1) + \beta _0(t), \quad \forall (x,y), \end{aligned}$$
(10.13)

where \(\beta _1(t)\) is a slope and \(\beta _0(t)\) is an intercept for the frame pair \(\big (I(x,y,t),I(x,y,t+1)\big )\). By estimating the parameters, background intensity values of different frames can be matched to the same level, which allows a precise visual content estimation and elimination.

4.3 ENF Extraction from a Single Image

In an extension to ENF extraction from videos, Wong et al. (2018) showed that ENF can even be extracted from a single image taken by a camera with a rolling shutter. As with rolling-shutter videos, each row of an image is sequentially acquired and contains the temporal flickering component due to the ENF. The acquisition of all rows within an image happens during the frame period that is usually 1/30 or 1/25 s based on the common frame rates, which is too short to extract a meaningful temporal ENF signal for most ENF-based analysis tasks. However, an easier binary classification problem can be answered (Wong et al. 2018): Is it possible to tell whether the capturing geographic region of an image has 50 Hz or 60 Hz supply frequency?

There are several unsolved research challenges in ENF from images. For example, the proposed ENF classifier in Wong et al. (2018) works well for images with synthetically added ENF, but its performance on real images may be further improved by incorporating a more sophisticated physical embedding model. Moreover, instead of making a binary decision, one may construct a ternary classifier to test whether ENF is present in a given image. Lastly, one may take advantage of a few frames captured in a continuous shooting mode to estimate the rough shape of the ENF signal for authentication purposes.

5 Key Applications in Forensics and Security

The ENF signal has been shown to be a useful tool in solving several problems that are faced in digital forensics and security. In this section, we highlight major ENF-based forensic applications discussed in the ENF literature in recent years. We start with joint time–location authentication, and proceed to discuss ENF-based tampering detection, ENF-based localization of media signals, and ENF-based approaches for camera forensics.

5.1 Joint Time–Location Authentication

The earliest works on ENF-based forensic applications have focused on ENF-based time-of-recording authentication of audio signals  (Grigoras 2005, 2007; Cooper 2008; Kajstura et al. 2005; Brixen 2007a; Sanders 2008). The ENF pattern extracted from an audio signal should be very similar to the ENF pattern extracted from a power reference recording simultaneously recorded. So, if a power reference recording is available from a claimed time-of-recording of an audio signal, the ENF pattern can be extracted from the audio signal and compared against the ENF pattern from the power recording. If the extracted audio ENF pattern is similar to the reference pattern, then the audio recording’s claimed timestamp is deemed authentic.

To discern the measure of similarity between the two ENF patterns, the minimum mean squared error (MMSE) or Pearson’s correlation coefficient can be used. Certain modifications to these matching criteria have been proposed in the literature.

One such modification to the matching criteria was proposed in Garg et al. (2012) where the authors exploit their findings that the US ENF signal can be modeled as a piecewise linear AR process to achieve better matching in ENF-based time-of-recording estimation. In this work, the authors extract the innovation signals resulting from the AR modeling of the ENF signals, and use these innovation signals for matching instead of the original ENF signals. Experiments done under a hypothesis detection framework show that this approach provides higher confidence in time-of-recording estimation and verification.

Other matching approaches have been proposed as well. For instance, in Hua (2014), the authors proposed a new threshold-based dynamic matching algorithm, termed the error correction matching algorithm (ECM), to carry out the matching. Their approach accounts for noise affecting ENF estimates due to limited frequency resolution. In their paper, they illustrate the advantages of their proposed approach using both synthetic and real signals. The performance of this ECM approach, along with a simplified version of it termed the bitwise similarity matching (BSM) approach, is compared against the conventional MMSE matching criterion in Hua (2018). The authors find that due to the complexity of practical situations, the performance of the ECM and BSM approaches over the benchmark MMSE approach may not be guaranteed. However, in the situations examined in the paper, the finding is that ECM results in the most accurate matching results while BSM sacrifices matching accuracy for processing speed.

In Vatansever et al. (2019), the authors propose a time-of-recording authentication approach tailored to ENF signals extracted from videos taken with cameras using rolling shutters. As mentioned in Sect. 10.4.1.2, such cameras typically have an idle period between each frame, thus resulting in missed ENF samples in the resultant video. In Vatansever et al. (2019), and based on multiple idle period assumptions, missing illumination periods in each frame are interpolated to compensate for the missing samples. Each interpolated time series then yields an ENF signal that can be matched to the ground-truth ENF reference through correlation coefficients to find or verify the time-of-recording.

An example of a real-life case where ENF-based time-of-recording authentication was used was described in Kajstura et al. (2005). In 2003, the Institute of Forensic Research in Cracow, Poland, was asked to investigate a 55-min long recording of a conversation between two businessmen made using a Sony portable digital recorder (model: ICD-MS515). The time of the creation of the recording indicated by the evidential recorder differed by close to 196 days from the time reported by witnesses of the conversation. Analysis of the audio recording revealed the presence of ENF traces. The ENF signal extracted from the audio recording was compared against reference ENF signals provided by the power grid operator, and it was revealed that the true time of the creation of the recording was the one reported by the witnesses. In this case, the date/time setting on the evidential recorder at the time of recording was incorrect.

A broader way to look at this particular ENF forensics application is not just as a time-of-recording authentication but as a joint time–location authentication. Consider the case where a forensic analyst is asked to verify, or discover, the correct time-of-recording and the location-of-recording of a media recording. In addition to authenticating the time-of-recording, the analyst may as well be able to identify the location-of-recording on a power grid level. Provided the analyst has access to ENF reference data from candidate times-of-recording and candidate grids-of-origin, comparing the ENF pattern extracted from the media recording to the reference ENF patterns extracted from candidate time/location ENF patterns should point to the correct time-of-recording and grid-of-origin of the media recording.

This application was the first that piqued the interest of the forensics community in the ENF signal. Yet, wide application in practical settings still faces several challenges.

First, the previous discussion has assumed that the signal has not been tampered with. If it had been tampered with, the previously described approach may not be successful. Section 10.5.2 describes approaches to verify the integrity of a media recording using its embedded ENF.

Second, depending on the initial information available about the recording, exhaustively matching the extracted media ENF pattern against all possible ENF patterns may be too expensive. An example of a recent work addressing this challenge is  Pop et al. (2017), where the authors proposed computing a bit sequence encoding the trend of an ENF signal when it is added into the ENF database. When presented with a query ENF signal to be timestamped, its trend bit sequence is computed and compared to the reference trend bit sequences as a way to prune its candidate timestamps, and reduce the final number of reference ENF signals it would need to be compared against using conventional approaches such as MMSE and correlation coefficient.

Third, this application assumes the availability of reference ENF data from the time and location of the media recording under study. In the case where reference data is incomplete or unavailable, a forensic analyst will need to resort to other measures to carry out the analysis. In Sect. 10.5.3, one such approach is described, which infers the grid-of-origin of a media recording without the need for concurrent reference ENF data.

5.2 Integrity Authentication

Audio forgery techniques can be used to conduct piracy over the Internet, falsify court evidence, or modify security device recordings or recordings of events taking place in different parts of the world (Gupta et al. 2012). This is especially relevant today with the prevalent use of social media by people who add on to news coverage of global events by sharing their own videos and recordings of what they experienced.

In some cases, it is instrumental to be able to ascertain whether recordings are authentic or not. Transactions involving large sums of money, for example, can take place over the phone. In such cases, it is possible to change the numbers spoken in a previously recorded conversation and then replay the message. If the transaction is later challenged in court, the forged phone conversation can be used as an alibi by the accused to argue that the account owner did authorize the transaction (Gupta et al. 2012). This would be a case where methods to detect manipulation in audio recordings are necessary. In this section, we discuss the efforts made to use the captured ENF signal in audio recordings for integrity authentication.

The basic idea of ENF-based authentication is that if an ENF-containing recording has been tampered with, the changes made will affect the extracted ENF signal as well. Examples of tampering include removing parts of the recording and inserting parts that do not belong. Examining the ENF of a tampered signal would reveal discontinuities in the extracted ENF signal that raise the question of tampering. The easiest scenario in which to check for tampering would be the case where we have reference ENF available. In such a case, the ENF signal can be compared to the reference ENF signal from the recording’s claimed time and location. This comparison would either support or question the integrity of the recording. Figure 10.10 shows an example where comparing an ENF signal extracted from a video recording with its reference ENF signal reveals the insertion of a foreign video piece.

Fig. 10.10
figure 10

ENF matching result demonstrating video tampering detection based on ENF traces (Cooper 2009b)

Fig. 10.11
figure 11

Audio signals from Brazil bandpassed around the 60 Hz value, a filtered original signal, and b filtered edited signal (Nicolalde and Apolinario 2009)

The case of identifying tampering in ENF-containing recordings with no reference ENF data becomes more complicated. The approaches studied in the literature have focused on detecting ENF tampering based on ENF phase and/or magnitude changes. Generally speaking, when an ENF-containing recording has been tampered with, it is highly likely to find discontinuities in the phase of the ENF at regions where the tampering took place.

In Nicolalde and Apolinario (2009), the authors noticed that phase changes in the embedded ENF signal result in a modulation effect when the recording containing the ENF traces is bandpassed about the nominal ENF band. An example of this is shown in Fig. 10.11, where there are visible differences in the relative amplitude in an edited signal versus in its original version. Here, the decrease in amplitude occurs at locations where the authors had introduced edits into the audio signal.

Plotting the phase of an ENF recording will also reveal changes indicating tampering. In Rodríguez et al. (2010), the authors describe their approach to compute the phase that goes as follows. First, a signal is downsampled and then passed through a bandpass filter centered around the nominal ENF value. If no ENF traces are found at the nominal ENF, due to attacks aimed at avoiding forensics analysis, for instance (Chuang et al. 2012; Chuang 2013), bands around the higher harmonics of the nominal ENF can be used instead (Rodriguez 2013). Afterward, the filtered signal is divided into overlapping frames of \(N_C\) cycles of the nominal ENF, and the phase of each segmented frame is computed using DFT or using a high-precision Fourier analysis method called DFT\(^1\) (Desainte-Catherine and Marchand 2000). When the phase of the signal is plotted, one can not only visually inspect for tampering but can also glean insights as to whether the editing was the result of a fragment deletion or of a fragment insertion, as can be seen in Fig. 10.12.

Fig. 10.12
figure 12

Phase estimated using DFT from edited Spanish audio signals where the edits were a fragment deletion and b fragment insertion (Rodríguez et al. 2010)

Rodríguez et al. (2010) also proposed an automatic approach for discriminating between an edited signal and an unedited one, which follows from the phase estimation just described. The approach depends on a statistic F defined as

$$\begin{aligned} F = 100 \log \left( \frac{1}{N - 1} \sum _{n = 2}^N \left[ \phi (n) - m_\phi \right] ^2 \right) , \end{aligned}$$
(10.14)

where N is the number of frames used for phase estimation, \(\phi (n)\) denotes the estimate phase of frame n, and \(m_\phi \) is the average of the computed phases. The process of determining whether or not an ENF-containing recording is authentic is formulated under a detection framework with the null hypothesis \(H_0\) and the alternative hypothesis \(H_1\) denoting an original signal and an edited signal, respectively. If F is greater than a threshold \(\gamma \), then the alternative hypothesis \(H_1\) is favored; otherwise, the null hypothesis \(H_0\) is favored.

For optimal detection, the goal is to obtain a value for the threshold \(\gamma \) that maximizes the value of the probability of detection \(P_{\text {D}} = \Pr \left( F > \gamma \mid H_1 \right) \). To do so, the authors in Rodríguez et al. (2010) prepare a corpus of audio signals, including original and edited signals, and evaluate the database with the proposed automatic approach with a range of \(\gamma \) values. The value chosen for \(\gamma \) is the one it takes at the equal error rate (EER) point where the probability of miss \(P_{\text {M}} = 1-P_{\text {D}}\) is equal to the probability of false alarm \(P_{\text {FA}} = \Pr \left( F > \gamma \mid H_0 \right) \). Experiments carried out on the Spanish databases AHUMADA and GAUDI resulted in an EER of 6%, and experiments carried out on Brazilian databases Carioca 1 and Carioca 2 resulted in an EER of 7%.

Esquef et al. (2014), the authors proposed a novel edit detection method for forensic audio analysis that includes modification on their approach in Rodríguez et al. (2010). In this version of the approach, the detection criterion is based on unlikely variations of the ENF magnitude. A data-driven magnitude threshold is chosen in a hypothesis framework similar to the one in Rodríguez et al. (2010). The authors conduct a qualitative evaluation of the influences of the edit duration and location as well as noise contamination on the detection ability. This new approach achieved a 4% EER on the Carioca 1 database versus 7% EER on the same database for the previous approach of Rodríguez et al. (2010). The authors report that their results indicate that amplitude clipping and additive broadband noise severely affect the performance of the proposed method, and further research is needed to improve the detection performance in more challenging scenarios.

The authors further extended their work in Esquef et al. (2015). Here, they modified the detection criteria by taking advantage of the typical pattern of ENF variations elicited by audio edits. In addition to the threshold-based detection strategy of Esquef et al. (2014), a verification of the pattern of the anomalous ENF variations is carried out, making the edit detector less prone to false positives. The authors confronted this newly proposed approach with that of Esquef et al. (2014) and demonstrated experimentally that the new approach is more reliable as it tends to yield lower EERs such as a reduction from 4% EER to 2% EER on the Carioca 1 database. Experiments carried out on speech databases degraded with broadband noise also support the claim that the modified approach can achieve better results than the original one.

Other groups have also addressed and extended the explorations into ENF-based integrity authentication and tampering detection. We mention some highlights in what follows.

In Fuentes et al. (2016), the authors proposed a phase locked loop (PLL)-based method for determining audio authenticity. They use a voltage-controlled oscillator (VCO) in a PLL configuration that produces a synthetic signal similar to that of a preprocessed ENF-containing signal. Some corrections are made to the VCO signal to make it closer to the ENF, but if the ENF has strong phase variations, there will remain large differences between the ENF signal and the VCO signal, signifying tampering. An automatic threshold-based decision on the audio authenticity is made by quantifying the frequency variations of the VCO signal. The experimental results shown in this work show that the performance of the proposed approach is on par with that of previous works (Rodríguez et al. 2010; Esquef et al. 2014), achieving, for instance, 2% EER on the Carioca 1 database.

Fig. 10.13
figure 13

Flowchart of the proposed ENF-based authentication system in Hua et al. (2016)

In Hua et al. (2016), the authors present a solution to the ENF-based audio authentication system that jointly performs timestamp verification (if tampering is not detected) and detection of tampering type and tampering region (if tampering is detected). A high-level description of this system is shown in Fig. 10.13. The authentication tool here is an absolute error map (AEM) between the ENF signal under study and reference ENF signals from a database. The AEM is quantitatively a matrix containing all the raw information of absolute errors where the extracted ENF signal is matched to each possible shift of a reference signal. The AEM-based solutions rely coherently on the authentic and trustworthy portions of the ENF signal under study to make their decisions on authenticity. If the signal is deemed authentic, the system returns the timestamp at which a match is found. Otherwise, the authors propose two variant approaches that analyze the AEM to find the tampering regions and characterize the tampering type (insertion, deletion, or splicing). The authors frame their work here as a proof-of-concept study, demonstrating the effectiveness of their proposal by synthetic performance analysis and experimental results.

In Reis et al. (2016), Reis et al. (2017), the authors propose their ESPRIT-Hilbert-based tampering detection scheme with SVM classifier (SPHINS) framework for tampering detection, which uses an ESPRIT-Hilbert ENF estimator in conjunction with an outlier detector based on the sample kurtosis of the estimated ENF. The computed kurtosis values are vectorized and applied to an SVM classifier to indicate the presence of tampering. They report a 4% EER performance on the clean Carioca 1 database and that their proposed approach gives improved results, as compared to those in Rodríguez et al. (2010); Esquef et al. (2014), for low SNR regimes and in scenarios with nonlinear digital saturation, when applied to the Carioca 1 database.

In Lin and Kang (2017), the authors propose an approach where they apply a wavelet filter to an extracted ENF signal to reveal the detailed ENF fluctuations and then carry out autoregressive (AR) modeling on the resulting signal. The AR coefficients are used as input features for an SVM system for tampering detection. They report that, as compared to Esquef et al. (2015), their approach can achieve improved performance in noisy conditions and can provide robustness against MP3 compression.

In the same vein as ENF-based integrity authentication,  Su et al. (2013); Lin et al. (2016) address the issue of recaptured audio signals. The authors in Su et al. (2013) demonstrate that recaptured recordings may contain two sets of ENF traces, one from the original time of the recording and the other due to the recapturing process. They tackle the more challenging scenario that the two traces overlap in frequency by proposing a decorrelation-based algorithm to extract the ENF traces with the help of the power reference signal. Lin et al. (2016) employ a convolutional neural network (CNN) to decide if a certain recording is recaptured or original. Their deep neural network relies on spectral features based on the nominal ENF and its harmonics. Their paper goes into further depth, examining the effect of the analysis window on the approach’s performance and visualizing the intermediate feature maps to gain insight on what the CNN learns and how it makes its decisions.

5.3 ENF-Based Localization

The ENF traces captured by audio and video signals can be used to infer information about the location in which the media recording was taken. In this section, we discuss approaches to use the ENF to do inter-grid localization, which entails inferring the grid in which the media recording was made, and intra-grid localization, which entails pinpointing the location-of-recording of the media signal within a grid.

5.3.1 Inter-Grid Localization

This section describes a system that seeks to identify the grid in which an ENF-containing recording was made, without the use of concurrent power references Hajj-Ahmad et al. (2013), Hajj-Ahmad et al. (2015). Such a system can be very important for multimedia forensics and security. It can pave the way to identify the origins of videos such as those of terrorist attacks, ransom demands, and child exploitation. It can also reduce the computational complexity and facilitate other ENF-based forensics applications, such as the time-of-recording authentication application of Sect. 10.5.1. For instance, if a forensic analyst is given a media recording of unknown time and location information, inferring the grid-of-origin of the recording would help in narrowing down the set of power reference recordings that need to be used to compare the media ENF pattern.

Upon examining ENF signals from different power grids, it can be noticed that there are differences between them in the nature and the manner of the ENF variations (Hajj-Ahmad et al. 2015). These differences are generally attributed to the control mechanisms used to regulate the frequency around the nominal value and the size of the power grid. Generally speaking, the larger the power grid is, the smaller the frequency variations are. Figure 10.14 shows the 1-min average frequency during a 48-hour period from five different locations in five different grids. As can be seen in this figure, Spain, which is part of the large continental European grid, shows a small range of frequency values as well as fast variations. The smallest grids of Singapore and Great Britain show relatively large variations in the frequency values (Bollen and Gu 2006).

Fig. 10.14
figure 14

Frequency variations measured in Sweden (top left), in Spain (top center), on the Chinese east coast (top right), in Singapore (bottom left), and in Great Britain (bottom right) (Bollen and Gu 2006)

Given an ENF-containing media recording, a forensic analyst can process the captured ENF signal to extract its statistical features to facilitate the identification of the grid in which the media recording was made. A machine learning system can be built that learns the characteristics of ENF signals from different grids and uses it to classify ENF signals in terms of their grids-of-origin. Hajj-Ahmad et al. (2015) have developed such a machine learning system. In that implementation, ENF signal segments are extracted from recordings 8 min long each. Statistical, wavelet-based, and linear-predictive-related features are extracted from these segments and used to train an SVM multiclass system.

The authors make a distinction between “clean” ENF data extracted from power recordings and “noisy” ENF data extracted from audio recordings, and various combinations of these two sets of data are used in different training scenarios. Overall, the authors were able to achieve an average accuracy of 88.4% on identifying ENF signals extracted from power recordings from eleven candidate power grids, and an average accuracy of 84.3% on identifying ENF signals extracted from audio recordings from eight candidate grids. In addition, the authors explored using multi-conditional systems that can adapt to cases where the noise conditions of the training and testing data are different. This approach was able to improve the identification accuracy of noisy ENF signals extracted from audio recordings by 28% when the training dataset is limited to clean ENF signals extracted from power recordings.

This ENF-based application was the focus of the 2016 Signal Processing Cup (SP Cup), an international undergraduate competition overseen by the IEEE Signal Processing Society. The competition engaged participants from nearly 30 countries. 334 students formed 52 teams that registered for the competition; among them, more than 200 students in 33 teams turned in the required submissions by the open competition deadline in January 2016. The top three teams from the open competition attended the final stage of the competition at ICASSP 2016 in Shanghai, China, to present their final work (Wu et al. 2016).

Most student participants from the SP Cup have uploaded their final reports to SigPort (Signal Processing 2021). A number of them have also further developed their works and published them externally. A common improvement on the SVM approach originally proposed in Hajj-Ahmad et al. (2015) was to make the classification system a multi-stage one, with earlier stages distinguishing between 50 Hz grid versus 60 Hz grid, or an audio signal versus a power signal. An example of such a system is proposed in Suresha et al. (2017).

In Šarić et al. (2016), a team from Serbia examined different machine learning algorithms that can be used to achieve inter-grid localization, including K-nearest neighbors, random forests, SVM, linear perceptron, and neural networks. In their setup, the classifier that achieved the highest accuracy overall was random forest. They also explored adding additional features to those of Hajj-Ahmad et al. (2015), related to the extrema and rising edges in the ENF signal, which showed performance improvements between 3% and 19%.

Aside from SP-Cup-related works,  Jeon et al. (2018) demonstrated how their ENF map proposed in Jeon et al. (2018) can be utilized for inter-grid location identification as part of their LISTEN framework. Their experimental setup included identifying grid-of-origin of audio streams extracted from Skype and Torfan from seven power grids, achieving classification accuracies in the 85–90% range for audio segments of length 10–40 mins. Their proposed system also tackled intra-grid localization, which we will discuss next.

5.3.2 Intra-Grid Localization

Though the ENF variations at different points in the same grid are very similar, research has shown that there can be discernible differences between these variations. The differences can be due to the local load characteristics of a given city and the time needed to propagate a response to demand and supply in the load to other parts of the grid (Bollen and Gu 2006). System disturbances, such as short circuits, line switching, and generator disconnections, can be contributing causes to these different ENF values (Elmesalawy and Eissa 2014). When a significant change in the load occurs somewhere in the grid, it has a localized effect on the ENF in the given area. This change in the ENF then propagates across the grid at, typically, a speed of approximately 500 ms (Tsai et al. 2007).

Fig. 10.15
figure 15

a Locations of power recordings. b Pairwise correlation coefficients between 500 s long, highpass-filtered ENF segments extracted from power recordings made in the locations shown in a (Hajj-Ahmad et al. 2012)

In Hajj-Ahmad et al. (2012), the authors conjecture that small and large changes in the load may cause location-specific signatures in local ENF patterns. Following this conjecture, such differences may be exploited to pinpoint the location-of-recording of a media signal within a grid. Due to the finite propagation speed of frequency disturbances across the grid, the ENF signal is anticipated to be more similar for locations close to each other as compared to locations that are farther apart. As an experiment, concurrent recordings of power signals were done in three cities in the Eastern North American grid shown in Fig. 10.15a.

Examining the ENF signals extracted from the different power recordings, the authors found that they have a similar general trend but possible microscopic differences. To better capture these differences, the extracted ENF signals are passed through a highpass filter, and the correlation coefficient between the highpassed ENF signals is used as a metric to examine the location similarity between recordings. Figure 10.15b shows the correlation coefficients between various 500 s ENF segments from different locations at the same time. We can see that the closer the two city-of-origins are, the more similar their ENF signals are, and the farther they are apart, the less similar their ENF signals are. From here, one can start to think about using the ENF signal as a location stamp and exploiting the relation between pairwise ENF similarity and distance between locations of recording. A highpass-filtered ENF query can be compared to highpass-filtered ENF signals from known city anchors in a grid as a means toward pinpointing the location-of-recording of an ENF query.

The authors further examined the use of the ENF signal as an intra-grid location stamp in Garg et al. (2013), where they proposed a half-plane intersection method to estimate the location of ENF-containing recordings and found that the localization accuracy can be improved with the increase in the number of locations used as anchor nodes. This study conducted experiments on power ENF signals. With audio and video signals, the situation is more challenging because of the noisy nature of the embedded ENF traces. The location-specific signatures are best captured using instantaneous frequencies estimated at 1 s temporal resolution, and reliable ENF signal extraction at such a high temporal resolution is an ongoing research problem. Nevertheless, there is a potential of being able to use ENF signals as a location stamp.

Jeon et al. (2018) have included intra-grid localization as part of the capabilities of their LISTEN framework for inferring the location-of-recording of a target recording, which relies on an ENF map built using ENF traces collected from online streaming sources. After narrowing down the power grid to which a recording belongs through inter-grid localization, their intra-grid localization approach relies on calculating the Euclidean distance between a time-series sequence of interpolated signals in the chosen power grid and the target recording’s ENF signal. The approach then narrows down the location of recording to a specific region within the grid.

In Chai et al. (2016); Yao et al. (2017); Cui et al. (2018), the authors rely on the FNET/GridEye system, described in Sect. 10.2.1, to achieve intra-grid localization. In these works, the authors work on ENF-based intra-grid localization for ENF signals extracted from clean power signals without using concurrent ENF power references. Such intra-grid localization is made possible by exploiting geolocation-specific temporal variations induced by electromechanical propagation, nonlinear load, and recurrent local disturbance (Yao et al. 2017).

In Cui et al. (2018), the authors apply features similar to those proposed in intra-grid localization approaches and rely on a random forest classifier for training. In Yao et al. (2017), the authors apply wavelet analysis to ENF signals and feed detail signals into a neural network for training. The results show that the system works better at larger geographical scales and when the training and testing ENF-containing recordings were recorded closer together in time.

5.4 ENF-Based Camera Forensics

An additional ENF-based application that has been proposed uses the ENF traces captured in a video recording to characterize the camera that was used to produce the video (Hajj-Ahmad et al. 2016). This is done within a nonintrusive framework that only analyzes the video at hand to extract a characterizing inner parameter of the camera used. The focus was on CMOS cameras equipped with rolling shutters, and the parameter estimated was the read-out time \(T_{\text {ro}}\), which is the time it takes for the camera to acquire the rows of a single frame. \(T_{\text {ro}}\) is typically not listed in a camera’s user manual and is usually less than the frame period equal to the reciprocal of a video’s frame rate.

This work was inspired by prior work on flicker-based video forensics that addresses issues in the entertainment industry pertaining to movie piracy-related investigations (Baudry et al. 2014; Hajj-Ahmad 2015). The focus of that work was on pirated videos that are produced by camcording video content displayed on an LCD screen. Such pirated videos commonly exhibit an artifact called the flicker signal, which results from the interplay between the backlight of the LCD screen and the recording mechanism of the video camera. In Hajj-Ahmad (2015), this flicker signal is exploited to characterize the LCD screen and camera producing the pirated video by estimating the frequency of the screen’s backlight signal and the camera’s read-out time \(T_{\text {ro}}\) value.

Both the flicker signal and the ENF signal are signatures that can be intrinsically embedded in a video due to the camera’s recording mechanism and the presence of a signal in the recording environment. For the flicker signal, the environmental signal is the backight signal of the LCD screen, while for the ENF signal, it is the electric lighting signal in the recording environment. In Hajj-Ahmad et al. (2016), the authors leverage the similarities between the two signals to adapt the flicker-based approach to an ENF-based approach targeted at characterizing the camera producing the ENF-containing video. The authors carried out an experiment involving ENF-containing videos produced using five different cameras, and the results showed that the proposed approach achieves high accuracy in estimating the discriminating \(T_{\text {ro}}\) parameter with a relative estimation error within 1.5%.

Vatansever et al. (2019) further extend the work on the application by proposing an approach that can operate on videos that the approach in Hajj-Ahmad et al. (2016) cannot address, which are cases where the main ENF component is aliased 0 Hz when the nominal ENF is a multiple of the video camera’s frame rate, e.g., a video camera with 25 fps capturing 100 Hz ENF. The approach examines the two strongest frequency components at which the ENF component appears, following a model proposed in the same paper, and deduces the read-out time \(T_{\text {ro}}\) based on the ratio of strength between these two components.

6 Anti-Forensics and Countermeasures

ENF signal-based forensic investigations such as time–location authentication, integrity authentication, and recording localization discussed in Sect. 10.5 rely on the assumption that the ENF signal buried in a hosting signal is not maliciously altered. However, there may exist adversaries performing anti-forensic operations to mislead forensic investigations. In this section, we examine the interplay between forensic analysts and adversaries to better understand the strengths and limitations of ENF-based forensics.

6.1 Anti-Forensics and Detection of Anti-Forensics

Anti-forensic operations by adversaries aim at altering the ENF traces to invalidate or mislead the ENF analysis. One intuitive approach is to superimpose an alternative ENF signal without removing the native ENF signal. This approach can confuse the forensic analyst when the two ENF traces are overlapping and have comparable strengths, under which it is impossible to separate the native ENF signal without using side information (Su et al. 2013; Zhu et al. 2020). A more sophisticated approach is to remove the ENF traces either in the frequency domain by using a bandstop filter around the nominal frequency (Chuang et al. 2012; Chuang 2013) or in the time domain by subtracting a time signal synthesized via frequency modulation based on some accurate estimates of the ENF signal such as those from the power mains (Su et al. 2013). The most malicious approach is to mislead the forensic analyst by replacing the native ENF signal with some alternative ENF signal (Chuang et al. 2012; Chuang 2013). This can be done by first removing the native signal and then adding an alternative ENF signal.

Forensic analysts are motivated to devise ways to detect the use of anti-forensic operations (Chuang et al. 2012; Chuang 2013), so that the forged signal can be rejected or be recovered to allow forensic analysis. In case an anti-forensic operation for ENF is carried out merely around the fundamental frequency, the consistency of the ENF components from multiple harmonic frequency bands can be used to detect such operation. The abrupt discontinuity in a spectrogram may be another indicator for an anti-forensic operation. In Su et al. (2013), the authors formulated a composite hypothesis testing problem to detect the operation of superimposing an alternative ENF signal.

Adversaries may respond to forensic analysts by improving their anti-forensic operations (Chuang et al. 2012; Chuang 2013). Instead of replacing the ENF component at the fundamental frequency, adversaries can attack all harmonic bands as well. The abrupt discontinuity check for spectrogram can be addressed by envelope adjustment via the Hilbert transform. Such dynamic interplay continues by evolving their actions in response to each other’s actions (Chuang et al. 2012; Chuang 2013), and actions tend to become more complex as the interplay goes on.

6.2 Game-Theoretic Analysis on ENF-Based Forensics

Chuang et al. (2012), Chuang (2013) quantitatively analyzed the dynamic interplay between forensic analysts and adversaries using the game-theoretic framework. A representative but simplified scenario that involves different actions was quantitatively evaluated. The optimal strategies in terms of Nash equilibrium, namely the status in which no player can increase his/her own benefit (or formally, the utility) via unilateral strategy changes, were derived. The optimal strategies and the resulting forensic and anti-forensic performances at the equilibrium can lead to a comprehensive understanding of the capability of ENF-based forensics in a specific scenario.

7 Applications Beyond Forensics and Security

The ENF signal as an intrinsic signature of multimedia recordings gives rise to applications beyond forensics and security. For example, an ENF signal’s unique fluctuations can be used as a time–location signature to allow the synchronization of multiple pieces of media signals that overlap in time. This allows the synchronized signal pieces to jointly reveal more information than the individual pieces (Su et al. 2014c, b; Douglas et al. 2014). In another example, the slowly varying trend of the ENF signal can serve as an anchor for the detection of abnormally recorded segments within a recording (Chang and Huang 2010; Feng-Cheng Chang and Hsiang-Cheh Huang 2011) and can thus allow for the restoration of tape recordings suffering from irregular motor speed (Su 2014).

7.1 Multimedia Synchronization

Conventional multimedia synchronization approaches rely on passive/active calibration of the timestamps or identifying common contextual information from two or more recordings. Voice and music are the contextual information that can be used for audio synchronization. Overlapping visual scenes, even recorded from different viewing angles, can be exploited for video synchronization. ENF signals naturally embedded in the audio and visual tracks of recordings can complement conventional synchronization approaches as they would not rely on common contextual information.

ENF-based synchronization can be categorized into two major scenarios, one where multiple recordings for synchronization overlap in time, and one where they do not. If multiple recordings from the same power grid are recorded with time overlap, then they can be readily synchronized by the ENF without using other information. In Su et al. (2014c), Douglas et al. (2014), Su et al. (2014b), the authors use the ENF signals extracted from audio tracks for synchronization. For videos containing both the visual and audio tracks, it is possible to use either modality as the source of ENF traces for synchronization. Su et al. (2014b) demonstrate using the ENF signals from the visual track for synchronization. Using the visual track for multimedia synchronization is important for such scenarios as video surveillance cases where an audio track may not be available.

In Golokolenko and Schuller (2017), the authors show that the idea can be extended to synchronizing the audio streams from wireless/low-cost USB sound cards in a microphone array. Sound cards have their own respective sample buffers, which fill up at different rates, thus leading to synchronization mismatches of streams by different sound cards. Conventional approaches to address the synchronization require specialized hardware, which is not a requirement for a synchronization approach based on using the captured ENF traces.

If recordings were not captured with time overlap, then reference ENF databases are needed to help determine the absolute recording time of each piece. This falls back to the time-of-recording authentication scenario discussed in Sect. 10.5.1—Each recording will need to determine its timestamp by matching against one or multiple reference databases, depending on whether the source location of the recording is known.

7.2 Time-Stamping Historical Recordings

ENF traces can help archivists by providing a time-synchronized exhibit of multimedia files. Many twentieth century recordings are important cultural heritage records, but some may lack necessary metadata, such as the date and the exact time of recording. Su et al. (2013), Su et al. (2014c), Douglas et al. (2014) found ENF traces in the 1960s phone conversation recordings of President Kennedy in the White House, and in the multi-channel recordings of the 1970 NASA Apollo 13 mission. In Su et al. (2014c), the ENF was used to align two recordings of around 4 hours in length from the Apollo 13 mission, and the result was confirmed by the audio contents of the two recordings.

Fig. 10.16
figure 16

The spectrogram of an Apollo mission control recording a before and b after speed correction on the digitized signal (Su 2014)

A set of time-synchronized exhibits of multimedia files with reliable metadata of recording time and high-SNR ENF traces can be used to reconstruct a reference ENF database for various purposes (Douglas et al. 2014). Such a reconstructed ENF database from a set of recordings can be valuable for applying ENF analysis to sensing data collected in the past before power signals recorded for reference purposes is available.

7.3 Audio Restoration

Audio signal digitized from an analog tape has its content frequency “drifted” off the original value due to the inconsistent rolling speed between the recorder and the digitizer. The ENF signal, although varying along time, has a general trend that remains around the nominal frequency. When ENF traces are embedded in a tape recording, this consistent trend can serve as the anchor for correcting the abnormal speed of the recording (Su 2014). Figure 10.16a shows the drift of the general trend of the ENF followed by an abrupt jump in a recording from the NASA Apollo 11 Mission. The abnormal speed can be corrected by temporally stretching or compressing the frame segments of an audio signal using techniques such as multi-rate conversion and interpolation (Su 2014). The spectrogram of the resulting corrected signal is shown in Fig. 10.16b in which the trend of the ENF is constant without an abrupt jump.

8 Conclusions and Outlook

This chapter has provided an overview of the ENF signal, an environment signature that can be captured by multimedia recordings made in locations where there is electrical activity. We first examined the embedding mechanism of ENF traces, followed by how ENF signals, can be extracted from audio and video recordings. We noted that in order to reliably extract ENF signals from visual recordings, it is often helpful to exploit the rolling-shutter mechanism commonly used by cameras with CMOS image sensor arrays. We then systematically reviewed how the ENF signature can be used for media recordings’ time and/or location authentication, integrity authentication, and localization. We also touched on anti-forensics in which adversaries try to mislead forensic analysts and discussed countermeasures. Applications beyond forensics, including multimedia synchronization and audio restoration, were also discussed.

Looking into the future, we notice many new problems have naturally arisen as technology continues to evolve. For example, given the popularity of short video clips that last for less than 30 s, ENF analysis tools need to accommodate short recordings. The technical challenge lies in how to efficiently make use of the ENF information than traditional ENF analysis tools that focus on recordings of minutes or even hours long.

Another technical direction is to push forward the limit of ENF extraction algorithms from working on videos to images. Is it possible to extract enough ENF information from a sequence of fewer than ten images captured over a few seconds/minutes to infer the time and/or location? For the case of ENF extraction from a single rolling-shutter image, is it possible to know more beyond whether the image is captured at a 50 or 60 Hz location?

Exploring new use cases of ENF analysis based on its intrinsic ENF embedding mechanism is also a natural extension. For example, deepfake video detection may be a potential avenue for ENF analysis. ENF can be naturally embedded through the flickering of indoor lighting into the facial region and the background of a video recording. A consistency test may be designed to complement detection algorithms based on other visual/statistical cues.