1 Introduction

The surface of the Earth comprises 15 tectonic plates, with each plate individually moving several centimeters per year. Seismic phenomena often occur at the boundary where a plate subducts beneath another one in the subduction zone by releasing strains accumulated over a long time. Figure 1 shows the schematic of a subduction zone, in which an oceanic plate is subducting beneath a continental plate. For example, the Philippine Sea Plate is subducting beneath the Eurasian Plate around southwest Japan, forming a subduction zone called the Nankai Trough. In a subduction zone, various seismic phenomena occur depending on the magnitude of plate friction: ordinary earthquakes (high friction), slow earthquakes (low friction), and stable sliding (no friction). The friction generally increases as the temperature decreases and the pressure increases. As both temperature and pressure increase with depth, the friction according to depth presents a complicated relation. This relation might lead to different locations of seismic phenomena occurrence, as shown in Fig. 1. Ordinary earthquakes usually occur when the strains accumulated over a long time are released instantaneously at a shallow depth, where the friction is very high. At the beginning of the 21st century, new phenomena called slow earthquakes originated from subduction zones were discovered [8]. Owing to the lower friction, slow earthquakes repeat with much shorter intervals and have smaller magnitudes than ordinary earthquakes.

Fig. 1.
figure 1

Schematic of subduction zone. The ellipses represent the locations of seismic phenomena occurrence.

Hi-net [9, 12], the seismometer network installed by NIED and operating since 1996, led to the discovery of deep low-frequency tremors, which are categorized as slow earthquakes. The Hi-net seismometers are more sensitive and densely located than conventional seismometer networks. Thus, they allow to observe weak oscillations that were previously unobservable and correlatively analyze records between neighboring seismometers. The envelope correlation method [8] successfully extracted evidence of tremors from Hi-net data, which has been recognized as the first discovery of tremors in the world. Such tremors are weak oscillations that occur in a deeper area than ordinary earthquakes. These tremors have an approximate dominant frequency of 2–8 Hz, last from several hours to several days, and have magnitudes below 1.3 according to the tremor catalog published by NIED [4, 11]. Thus, tremors cannot be perceived on the ground. According to the NIED catalog, more than 30,000 tremors have been observed in southwest Japan from January 2001 to April 2019, indicating their high frequency. Tremors have been detected in various subduction zones worldwide after the first discovery (e.g., [5, 15]), and they represent a research hotspot in seismology, as many studies have been indicated the relations between tremors and large earthquakes (e.g., [10]). In fact, seismologists expect tremors to provide clues to predict large earthquakes and understand their mechanisms. Currently, only digital data from the last 20 years are available for studying tremors. Considering that megathrust earthquakes have periodically occurred in the Nankai Trough over intervals of 100–200 years, it is important to analyze tremor occurrences in southwest Japan over a longer period. Before the seismic records were available in digital format, seismometers continuously recorded waveforms with a pen on drum-rolled papers.

In this study, we aimed to detect tremors from seismograph paper records by using a convolutional neural network (CNN), a deep-learning method that has shown high performance for image recognition. A CNN can automatically tune its internal parameters by learning the characteristics of tremors from input images without requiring prior knowledge of tremors or manually adjusting the parameters. Training a CNN from scratch with real data polluted by a variety of noises may hinder the model construction and hyperparameter tuning. Thus, we conducted numerical experiments to construct a CNN and train it with synthetic images similar to the seismograph paper records. These experiments aim to obtain clues to improve the model and provide pretrained models for subsequent fine-tuning, which is a popular learning method to improve the model performance and learning efficiency.

2 Seismographs

Modern seismic research based on observational data strongly relies on digital records. In Hi-net, more than 1000 seismometers installed in Japan Islands continuously observe the velocity of the ground at a sampling rate 100 Hz, and NIED collects and publishes the corresponding digital data in real time. Multivariate time-series analyses on digital seismic data considering spatial correlations enable to increase the signal-to-noise ratio of detected phenomena or eventually uncover unknown phenomena such as slow earthquakes and tremors.

Past seismometers used over 50 years ago drew waveforms directly on paper. Considering the time interval between megathrust earthquakes in the subduction zones, paper records are a valuable source for research on slow earthquakes (e.g., [2]). Figure 2 shows an example of a seismograph paper record. The daily records are drawn on a single piece of paper, in which each time series is drawn horizontally and contains approximately 2.5 min of data. Besides seismic waveforms, the records contain pulses that indicate a time stamp every second. The average image size of a paper record excluding its margins is approximately \(7000\times 7000\) pixels.

The digitization of paper records by tracing the waveforms is effective for investigating large earthquakes because such waveforms are extractable even from overlapping time series given the low frequencies and large amplitudes of earthquakes (e.g., [6]). In contrast, tremors generally have smaller amplitudes and higher frequencies than large earthquakes. Consequently, their digitization is difficult, especially for overlapping waveforms. Therefore, CNN-based image recognition is a promising alternative for analyzing tremors compared to individual waveform extraction through digitization.

Fig. 2.
figure 2

Seismograph paper record from August 13, 1966, 19:47 to August 14, 1966, 20:05 in Kumano, Japan [14].

Both digital and analog seismic data include a wide variety of phenomena such as earthquakes, tremors, pulsations excited by oceanic waves, teleseisms (distant earthquakes), oscillations due to meteorological events, and noise. Thus, identifying tremors from seismic data becomes difficult when large earthquakes or other signals in similar spectra pollute the measurements.

3 Methods

To detect tremors by applying CNNs to real data, we conducted numerical experiments based on synthetic images to obtain a trained model that determines correctly the presence of tremors in an input image. In this section, we detail these experiments, including image synthetization and CNN construction.

3.1 Generation of Synthetic Images

Fig. 3.
figure 3

Representative examples of synthetic images. Details of each image are available on the text and Table 1.

Table 1. Versions of synthetic images. The checkmarks represent the waveforms contained in each version. The values next to the checkmarks indicate the ratio of the waveform amplitude with respect to Ver. 1.

We generate synthetic images based on seismograph paper records from Kumano, Japan (Fig. 2). Figure 3 shows examples of synthetic images. Each image is \(7000\times 7000\) pixels and corresponds to a daily record that contains 576 time series of 2.5 min vertically stacked. The synthetic images correspond to one of three versions, Ver. 1, Ver. 2, or Ver. 3, according to the included types of signals and noise listed in Table 1. For each version, 100 images were generated without tremors (labeled “none”), and 100 images were generated with tremors (labeled “tremor”). Ver. 1 images (Figs. 3(a) and 3(b)) contain observation noise, time stamps, and pulsations, as well as tremors only for “tremor” images. Ver. 1 represents a simple case that allows straightforward tremor detection. In Ver. 2 images (Figs. 3(c) and 3(d)), the tremors are smaller and the pulsations are larger than those in Ver. 1 images. In Ver. 3 images (Figs. 3(e) and 3(f)), we change the amplitudes of tremors and pulsations and insert earthquakes. Figure 4 confirms that Ver. 3 images suitably resemble the paper records regarding both their overview and details.

Fig. 4.
figure 4

(a) Ver. 3 “tremor” image. (b) Magnified view of (a). (c) Image of paper record shown in Fig. 2. (d) Magnified view of (c).

3.2 Preprocessing of Synthetic Images

For preprocessing, we divide each image vertically into five strips of \(7000\times 1400\) pixels and then resize each strip to \(2000\times 400\) pixels. The vertical division allows to easily distinguish between tremors and noise. As a tremor usually lasts several hours, the five strips must include its evidence, unlike temporary noise. This feature is useful to decide the presence of tremors on the images. For example, consider five strips extracted from an image with unknown label, either “none” or “tremor.” If a CNN assigns “none” to four of the five strips and “tremor” to the remaining strip, we can assume that the image corresponds to “none”, and the misjudgment for the last strip is due to noise. The reduced resolution after resizing aims to reduce the number of model parameters and consequently the computational cost. After preprocessing, each version comprises 500 “none” images and 500 “tremor” images. For each version, we use 800 images for training and the remaining 200 for validation, as detailed in the next subsection.

Fig. 5.
figure 5

Architecture of proposed CNN.

3.3 CNN for Tremor Detection

The CNN is a representative deep learning method that has exhibited high performance in tasks such as image recognition and handwriting recognition. A CNN has two distinctive layers, the convolutional layer and the pooling layer, which mathematically describe the function of human visual cells. These layers allow the CNN to extract features from input images. In recent years, CNNs have begun to be used in seismology for tasks such as detection or discrimination of seismic phenomena (e.g., [7, 13]).

Below, we define the key terms and formulas of CNNs. Let \(\varvec{x}\) be an input image for a CNN. Image \(\varvec{x}\) has a true label \(\ell \) in set S of labels. Let \(\mathcal {M}\) be a CNN with internal parameters \(\varvec{\theta }\). For an input \(\varvec{x}\), CNN \(\mathcal {M}=\mathcal {M}(\varvec{\theta })\) provides predictions \(\varvec{y}=(p_{\ell })_{\ell \in S}\), where \(p_{\ell }\) is the predicted probability that the true label of \(\varvec{x}\) is \(\ell \). In a CNN, the loss is a function that expresses its performance. The loss returns a non-negative real number for a pair \((\ell ,\varvec{y})\) of the same \(\varvec{x}\). If the loss value for \(\varvec{x}\) is close to 0, the prediction performance for \(\varvec{x}\) is favorable. On the other hand, the accuracy is the agreement rate between true label \(\ell \) and predicted label \(\hat{\ell }=\arg \max _{\ell \in S}p_{\ell }\) for the inputs. Training is the process to optimize parameters \(\varvec{\theta }\) by minimizing the loss value for inputs with known labels. Then, validation calculates the loss and accuracy at fixed parameters \(\varvec{\theta }\) for inputs with known labels but not used for training. Thus, validation evaluates the predictive performance of the CNN for previously unseen inputs. Finally, test performs prediction on inputs with unknown labels.

We built the proposed CNN by incorporating the residual connections used in the ResNet [1]. Figure 5 shows the architecture of the proposed CNN, which establishes a binary classifier that determines whether the input image has label “none” or “tremor.” The CNN output has the form \(\varvec{y}=(p_{\text {``none''}},p_{\text {``tremor'' }})\) corresponding to \(S=\{\text {``none''},\text {``tremor'' }\}\). We used the categorical cross-entropy as the loss function and optimized it using the Adam method [3] with a batch size of 16 in this study.

Fig. 6.
figure 6

(a) Accuracy and (b) loss for Ver. 1 images according to number of training epochs.

Fig. 7.
figure 7

(a) Ver. 1 “none” image and corresponding Grad-CAMs for prediction of labels (b) “none” and (c) “tremor.” (d) Ver. 1 “tremor” image and corresponding Grad-CAMs for prediction of labels (e) “none” and (f) “tremor.”

4 Results

Figure 6 shows the accuracy and loss throughout learning of Ver. 1 images. An epoch (horizontal axis) indicates learning iterations, that is, a set of training using all the training images and validation using all the validation images. Although the validation accuracy does not improve in the initial learning stage, it rapidly increases after 13 epochs to reach almost 1.0, possibly after the model parameters leave a local optimum. Figure 7 shows the gradient-weighted class activation maps (Grad-CAMs) [16] of Ver. 1 images. Each Grad-CAM indicates the image regions that influence the prediction calculation, with red (light) indicating the highest influence and blue (dark) indicating no influence. For prediction of “none”, the response is uniform on the entire image, except for the areas containing tremors (Figs. 7(b) and 7(e)). In Fig. 7(f), remarkable responses clearly appear in areas containing tremors. Therefore, the model correctly detects tremors and correctly identifies the “tremor” image.

Fig. 8.
figure 8

(a) Accuracy and (b) loss for Ver. 2 images according to number of training epochs.

Fig. 9.
figure 9

(a) Ver. 2 “none” image and corresponding Grad-CAMs for prediction of labels (b) “none” and (c) “tremor.” (d) Ver. 2 “tremor” image and corresponding Grad-CAMs for prediction of labels (e) “none” and (f) “tremor.”

Figures 8 and 9 show the learning performance and Grad-CAMs for Ver. 2 images, respectively. Although the tremors in Ver. 2 are smaller than those in Ver. 1, model training and tremor detection are successful.

Fig. 10.
figure 10

(a) Accuracy and (b) loss for Ver. 3 images according to number of training epochs.

Fig. 11.
figure 11

(a) Ver. 3 “none” image and corresponding Grad-CAMs for prediction of labels (b) “none” and (c) “tremor.” (d) Ver. 3 “tremor” image and corresponding Grad-CAMs for prediction of labels (e) “none” and (f) “tremor.”

Figures 10 and 11 show the learning performance and Grad-CAMs for Ver. 3 images, respectively. In Ver. 3 images, the Grad-CAM responses to earthquake waveforms are notable. For small earthquake waveforms, the Grad-CAMs do not show any remarkable response. This result may be due to the input image shrinking as it passes through the convolutional layers, and small waveforms eventually disappear during shrinking. On the other hand, in Figs. 11(b) and 11(e), the large earthquake waveforms cause the most influential responses. This result suggests that the CNN can distinguish earthquake waveforms from tremors, observation noise, and pulsations. Moreover, the proposed CNN may be able to achieve multinomial classification of labels such as “none,” “tremor,” “earthquake,” and “both.” Figure 11(f) shows tremor responses that do not appear when tremors are masked by earthquake waveforms.

Overall, the tremor waveforms appear to show a band-like pattern in the images. This result may indicate that the CNN discriminates tremors based on a rough view of the image. To discard this possibility, we conducted an additional experiment using monochromatic and dichromatic images, as shown in Fig. 12. The set of monochromatic images consists of 256 images for grayscale values from 0 to 255. The set of dichromatic images consists of 200 images, with each image containing a monochromatic band-like pattern on a monochromatic background. By providing these images as inputs to the CNN trained on Ver. 3, all the predicted labels are “none.” Although Ver. 3 “tremor” images seem to be similar to the dichromatic images, the CNN correctly determines the presence of tremors in Ver. 3 images. This result indicates that the CNN appropriately learned tremor features.

Fig. 12.
figure 12

Proposed CNN trained on Ver. 3 images correctly predicts label “tremor” for “tremor” images and label “none” for all monochromatic and dichromatic images.

5 Conclusion

The proposed CNN is expected to effectively detect tremors from seismograph paper records, as verified through numerical experiments. Based on the finding from the experiments, we will conduct CNN training with real data, which contain a wider variety of noises than synthetic images. To improve the CNN performance, we will also explore persistent parameter tuning and additional data preprocessing methods.