Keywords

1 Introduction

Parkinson’s disease (PD) is the second most common neurodegenerative disorder, with a prevalence exceeding more than 6 million people worldwide [6]. This disease is directly related to deficiencies in the production of the dopamine neurotransmitter, which plays a key role in sending and receiving messages from the brain to the body. Such affectation results on progressive motor alterations, e.g., tremor, bradykinesia, stiffness, and postural instability [14].

A main critical issue nowadays is the poor understanding of dopamine loss, resulting in diagnostic procedures based mostly on conspicuous motor alterations. In clinical practice, specific macro and observational tests that coarsely relate such physical signs with PD are often used as the main diagnosis tool [11]. This evaluation is however prone to bias and subjectivity due to the high interpersonal motion variability and the particular perception dependency [16]. To tackle these problems, researchers have proposed a variety of quantitative and computerized approaches including the use of body-worn motion sensors [10], computer vision schemes [4] or machine and deep learning algorithms to recognize PD behaviors during handwriting, speech and gait tasks [15]. These approaches rely on the quantification of upper and lower limb movements, which generally appear in advanced stages [3] and therefore are not suitable for an early disease detection.

Fig. 1.
figure 1

Ocular fixation comparison of PD and control subjects. Magnified xt-slices improve the visual difference between both classes. Video magnification was applied around a frequency of 5.7 Hz, previously identified in PD fixational behavior [8]. (a) PD sequence where the magnification enhances an oscillatory fixational instability. (b) Control sequence where no particular oculomotor pattern can be detected at a glance.

Recently, eye movement has been pointed out as a potential and very promising PD biomarker [5]. Several works report a strong association between subtle oculomotor abnormalities and dopamine concentration changes, even at the disease onset [1, 12]. This fact suggests a highly sensitive predictor of neurodegeneration before it reaches critical levels. For instance, the representative study of [8] evidenced the presence of ocular tremor responses while fixating a static target in a large cohort of PD patients at different stages. Interestingly enough, one of the control subjects, who turned out to exhibit fixational tremor, eventually developed the disease over a period of 2 years. Such studies rely on classical oculomotor monitoring protocols like electro-oculography (EOG) and video-oculography (VOG) [7]. The main disadvantage of EOG setups is their high susceptibility to electronic noise. Alternatively, modern VOG devices offer more reliable and noise-free recordings, but in both settings obtained signals only depict global and simplified trajectories of the whole eye motion field and its intrinsic deformations. Furthermore, they require individual calibration steps and invasive configurations, usually making contact with the entire area around the eyes and thus affecting the natural visual gesture. Other approaches [19] include the Poculomotor quantification from velocity field patterns, but require experiments with strong eye displacements to differentiate among control and disease classes.

This paper introduces a novel strategy to enhance and quantify fixational eye micro-movements, which results in a rich ocular motility description without invasive and comfortless protocols. Such description is achieved from spatio-temporal video slices, extracted from standard videos, instead of relying on the acquisition of global gaze trajectories through sophisticated devices. Firstly, tiny fixation patterns are highlighted by means of a video acceleration magnification. As shown in Fig. 1, magnified slices can better depict the oscillatory fixational patterns that visually differentiate between control and PD eye motion. Then, a dense profile description is obtained by means of a bank of filter responses computed from early layers of deep learning architectures. These features are compactly codified as the covariance matrix of filter channels with major information energy according to an eigenvalue decomposition. The resulting covariance is then mapped to a previously trained support vector machine to classify the disease given a particular slice. A preliminary evaluation with 6 PD patients and 6 control subjects by a leave-one-patient-out cross-validation scheme shows high effectiveness to codify fixation motility and differentiate between both classes. The proposed approach could potentially provide support and assistance in the PD diagnosis and follow-up.

2 Video Acceleration Magnification

A particular limitation in the analysis of fixational PD motion lies in the quantification of involuntary eye micro-movements [9]. Additionally, it should be considered that eye displacement can be masked by comparatively larger head motion. This work hence starts by performing an optical spatio-temporal amplification over fixation sequences. A remarkable fact of PD fixations is their oscillatory behavior, as shown in [8, 9]. Therefore, we can amplify specific frequency bands to characterize PD. For doing so, we use the acceleration magnification approach of [20], which allows to amplify subtle motion (fixational) even in the presence of large motion (head). The acceleration magnification works by analyzing the variation of pixel signals over time. In such case, any amplified pixel signal \(\hat{I} (\mathbf {x}, t)\) can be represented by a Taylor decomposition:

$$\begin{aligned} \hat{I} (\mathbf {x}, t) = f(\mathbf {x} + \beta \,\delta (t)) \approx f(\mathbf {x})\,+\,\underbrace{\beta \,\delta (t)\tfrac{\partial f(\mathbf {x})}{\partial \mathbf {x}}}_{A}\,+\,\underbrace{\alpha \, \delta (t)^{2} \tfrac{\partial ^{2} f(\mathbf {x})}{\partial ^{2} \mathbf {x}}}_{B} \,\, , \end{aligned}$$
(1)

where \(\delta (t)\) is the small displacement function of the original pixel signal \(I(\mathbf {x}, t) = f(\mathbf {x} + \delta (t))\), and \(\alpha =\beta ^2\) is the magnification factor.

Equation 1 part A, is related to first-order linear changes (velocity) and will not be considered. Part B represents second order changes, i.e., acceleration, which is amplified by an \(\alpha \) factor that allows to accentuate every small deviation of linear motion. For measuring motion information, local phase information is computed through a complex-valued steerable pyramid \(\varPsi _{w,\theta }\): \(\mathbf {I}(\mathbf {x},t) *\varPsi _{w,\theta } = A_{w,\theta }(\mathbf {x},t)\,e^{i\,\varPhi _{w,\theta }(\mathbf {x},t)}\), where \(\mathbf {I}(\mathbf {x}, t)\) is a video frame and \(w,\theta \) are spatial frequency bands and orientations. Then, phases are amplified by using a Laplacian temporal filtering:

$$\begin{aligned} \hat{\varPhi }_{w,\theta }(\mathbf {x},t) = \varPhi _{w,\theta }(\mathbf {x},t) \,\,+\,\, \alpha \,\,\varPhi _{w,\theta }(\mathbf {x},t) * \tfrac{\partial ^2 G_{\sigma }(\mathbf {x},t)}{\partial t^2} \,\, , \end{aligned}$$
(2)

where \(\sigma \) is the Laplacian filter scale. This computation allows to take the second-order derivative of the smoothed phase signals and thus amplify them by \(\alpha \) at a given temporal frequency \(f = \tfrac{\text {frame rate}}{8\pi \sqrt{2}\,\sigma }\).

3 Learning a CNN Fixation Representation

The resulting amplified video \(\hat{I} (\mathbf {x}, t)\) is split-up into a set of spatio-temporal slices \( \mathbf {S}_\theta = \{\mathbf {s}_{\theta _1}, \mathbf {s}_{\theta _2}, \ldots , \mathbf {s}_{\theta _N}\}\) at N orientations. For doing so, different radial directions on the spatial xy-plane were used as reference along time. Typical slice configurations are illustrated in Fig. 2. These slices capture small eye iris displacements and herein constitute an ideal source of information to analyze small ocular movements.

Fig. 2.
figure 2

Spatio-temporal video slices. At different slice directions, relevant cues in fixation recordings can be captured.

Each slice \(\mathbf {s}_{\theta _i}\) is represented as a bank of separated band responses of high and low frequency filters, with some coverage of mid frequencies. This was achieved by mapping \(\mathbf {S_\theta }\) on the first layers of known and pre-trained deep convolutional frameworks, which have been implemented for the classification of natural scenes and trained on the ImageNet dataset (around 1.2 million samples). In brief, such architectures progressively compute linear transformations, followed by contractive nonlinearities, projecting information on a set of C learned filters \(\mathbf {\Psi }^j = \{\varvec{\psi }^j_1,\varvec{\psi }^j_2, \ldots , \varvec{\psi }^j_C \}\) at a given layer j. Hence, each eye slice \(\mathbf {s}_{\theta _i}\) is filtered by a particular \(\mathbf {\Psi }^j\) set, obtaining a feature representation \(\mathbf {\Phi }^j = \sum _{c= 1\ldots C} \mathbf {s}_{\theta _i} *\varvec{\psi }^j_c\), with \(\varvec{\phi }^j_c = \mathbf {s}_{\theta _i} *\varvec{\psi }^j_c\) as each independent feature channel. In this work, three different pre-trained architectures were independently used and evaluated for eye slice dense representation. A succinct description of the studied deep architectures is presented below.

  • VGG-19 [17]: is a classical CNN architecture with a total of 19 layers. For low-level representation purposes, in this work we considered the first block pooling layer with a total of \(C=64\) filter channels, and responses of size \(W=112\,\times \,H=112\). Hereafter named VGG-19 Layer A. Also, a second representation was evaluated from the second block pooling layer (\(C=128\), \(W=56 \times H=56\)), referred to as VGG-19 Layer B.

  • ResNet-101 [13]: is a deep net that includes residual maps serving as recursive inputs on superior layers throughout shortcut connections. A primary representation was herein obtained from the first block pooling layer (\(C=64\), \(W=112\,\times \,H=112\)), hereafter named ResNet-101 Layer A. Additionally, the first residual-added ReLu layer was computed as second representation (\(C=256\), \(W=56 \times H=56\)), designated as ResNet-101 Layer B.

  • Inception-ResNet-v2 [18]: is one of the most recent approaches that combines the inception blocks, i.e., multiple sub-networks that learn independently, with residual connections that allow optimal learning rates. The third block ReLu layer of \(C=64\) filters and responses of \(W=147\,\times \,H=147\) was used as primary representation, referred to as Inception-ResNet-v2 Layer A. The first residual-concatenated layer (\(C=320\), \(W=35\,\times \,H=35\)) was also considered, named Inception-ResNet-v2 Layer B.

For illustration, Fig. 3 shows sample responses of the utilized deep architectures for a given input video slice.

Fig. 3.
figure 3

Sample filter responses from each CNN architecture. In general, selected layers exhibit a high response rate to the small local slice patterns, such as lines, edges, and corners. This representation can thus provide an adequate feature map for the depicted fixational cues.

3.1 Recognizing Parkinsonian Patterns: A Compact Fixational Descriptor

The feature representation \(\mathbf {\Phi } \in \mathbb {R}^{H \times W \times C}\) for each slice \(\mathbf {s_{\theta }}\) is composed by C filter responses \(\varvec{\phi }_{c}\) with dimensions \(H \times W\). We vectorize each \(\varvec{\phi }_{c}\), reshaping \(\mathbf {\Phi }\) to \(HW \times C\). This information is nevertheless redundant on common background and could lead to a wrong motion description. Hence, a very compact analysis is herein carried out by computing a feature covariance matrix \(\mathbf {\Sigma } = \tfrac{1}{HW}\left[ (\mathbf {\Phi } - \mu (\mathbf {\Phi })) (\mathbf {\Phi } - \mu (\mathbf {\Phi }))^T \right] \), with \(\mu \) as the mean \(1 \times C\) feature vector repeated HW times. The covariance \(\mathbf {\Sigma } \in \mathbb {R}^{ C \times C}\) describes a second statistical moment on the whole feature space, that compactly summarizes the fixational motion representation. From a spectral matrix analysis, we found that the energy of \(\mathbf {\Sigma }\) is fully concentrated on only a few eigenvalues, which is why we only keep information related to the k major eigenvalues. In this way, a new compact reduced covariance \(\mathbf {\Sigma _{r}} \) that captures the most variability of the C feature channels is computed as \(\mathbf {\Sigma _{r}} =\mathbf {W^T \Sigma \,W}\), where \(\mathbf {W} \in \mathbb {R}^{C \times k}\) is the reduced eigenvector matrix of \(\mathbf {\Sigma }\) with \(k < C\).

Due to the semi-definite and positive properties of covariance matrices, they exist on a semi-spherical Riemannian space. This fact limits the application of classic machine learning approaches that assume Euclidean structured data. We therefore project \(\mathbf {\Sigma _r}\) onto the Euclidean space by the matrix logarithm of \(\mathbf {\Sigma _r}\). That is, \(\log (\mathbf {\Sigma _{r}}) = \mathbf {V_r} \log (\mathbf {\Lambda _r}) \mathbf {V_r}^T\), where \(\mathbf {V_r}\) are the eigenvectors of \(\mathbf {\Sigma _{r}}\) and \(\log (\mathbf {\Lambda _r})\) the corresponding logarithmic eigenvalues. This reduced covariance represents the fixational motion descriptor to be fed into a machine learning algorithm in order to obtain a prediction of Parkinson’s disease, under a supervised learning scheme. In this work, we selected a support vector machine (SVM) as supervised model due to its demonstrated capability at defining non-linear boundaries between classes. Also, SMVs have widely reported proper performance on high dimensional data with low computational complexity. Finally, as non-linear SVM kernel classifier, we utilized the classical yet powerful Radial Basis Function (RBF) kernel \(K = \exp \left( - \gamma \, || \log (\mathbf {\Sigma _{r}})_i - \log (\mathbf {\Sigma _{r}})_j ||^2 \right) \) [2].

3.2 Imaging Data

We implemented a protocol to record fixational motion on PD-diagnosed and control patients. Participants were invited to observe a fixed spotlight projected on a screen with a dark background. A Nikon D3200 camera with spatial resolution of \(1920\,\times \,1080\) pixels and a temporal resolution of 30 fps was fixed in front of the subjects to capture the whole face. The eye region was manually cropped (\(180\,\times \,120\) pixels) to obtain the sequences of interest. A total of 6 PD patients (average age of \(71.8 \pm 12.2\)) and 6 control subjects (average age of \(66.2 \pm 6.6\)) were captured and analyzed for validation of the proposed approach. PD patients were diagnosed in second (2 patients), third (3 patients) and fourth (1 patient) stage of the disease by a physician using standard protocols of the Hoehn-Yahr scale [11]. A total of 24 sequences were recorded, i.e., 2 samples per patient, with duration of 5 s. This study was approved by the Ethics Committee of the Universidad Industrial de Santander and a written informed consent was obtained. The recorded dataset was possible thanks to the support of the local Parkinson foundation FAMPAS (Fundación del Adulto Mayor y Parkinson Santander) and the local elderly institution Centro Vida Años Maravillosos.

4 Evaluation and Results

Experiments were performed on the 24 recorded sequences through a leave-one-patient-out cross-validation, in which at each iteration one patient is left out to test and the remaining ones are used for training the model. For comparison purposes, we considered both standard and magnified videos. Figure 1 illustrates standard and magnified ocular fixational motion for sample PD and control subjects. Parkinsonian slices show well-defined amplitudes of slight oscillatory motility at specific frequencies (later discussed). A total of 4 slices that recover eye motion cues were extracted from each video (see Fig. 2), and thereafter mapped to the selected CNN architectures in order to obtain a deep representation. Then, a very compact covariance descriptor per slice was computed by using the minimal number of eigenvalues that concentrate \(95\%\) of information.

Fig. 4.
figure 4

Obtained accuracy results by using different deep architectures and different number of slices on standard and magnified videos. In general, the proposed approach achieves outstanding results regarding PD prediction.

Figure 4 shows a first quantitative analysis of average accuracies over the patient-fold scheme. Initially, it should be noted that an increasing number of eye slices yields a performance improvement. This is valid for all of the considered architectures. Secondly, video magnification consistently contributes to improve the disease prediction, being a major contribution when fewer slices are utilized in the analysis. Interestingly enough, outstanding performance is achieved on layer A of all CNNs by using a 4-slice representation. In this case, standard slices are able to obtain complete predictions on the VGG-19 and Inception-ResNet-v2, and magnified ones on all the three nets. Such fact could be related to a more comprehensive description of eye slices in the earliest representation levels, which suggest that low-level primitives were able to properly represent the depicted fixational patterns even on standard slices. Also, it is remarkable the Inception-ResNet-v2 performance that achieves complete predictions even with only two slices. Although Layer A in such scheme is not directly related to residual inputs or inception blocks, the overall network minimization allows a better description of relevant slice features. Layer B of VGG-19 achieves the same results for two slices, but in general, this net layer requires more forward steps and denser representations.

Thereafter, a deeper analysis of layer A was performed in Fig. 5 by quantifying the False Negative Rate (FNR) and the False Positive Rate (FPR). The FNR is related to the percentage of PD patterns incorrectly classified as control, while the FPR represents the percentage of control patterns miss-classified as PD. In all cases, the use of four slices in magnified sequences yields negligible prediction of false conditions. In general, an exponential error decay can be observed as there are more slices involved on the descriptor. For the Inception-ResNet-v2 architecture, this trend is more accentuated, achieving the best condition classification with a very compact descriptor, i.e., by using fewer slices. On the other side, the ResNet-101 architecture presents the slowest decay, requiring more information to achieve proper performance.

Fig. 5.
figure 5

FPR and FNR indices for Layer A of selected architectures.

Finally, video magnification parameters were evaluated to correctly enhance fixational motion patterns. In the proposed approach, the temporal frequency to be magnified was fixed w.r.t. the fundamental PD ocular tremor frequency. In the literature, such frequency during ocular fixation has been quantified as \(f = 5.7\) Hz [8]. As observed in Fig. 6-(a), the proposed approach achieves equally outstanding performance on a range of characteristic PD tremor frequencies, in terms of improved accuracies per layer. Regarding the magnification factor, a visually reasonable value to emphasize this subtle oscillatory pattern was found to be \(\alpha = 15\), as shown in the spatio-temporal comparison of Fig. 6-(b).

5 Conclusions

In this work, we proposed a quantitative strategy to characterize ocular fixational motion as an imaging biomarker for Parkinson’s disease (PD). This approach achieved a robust eye motion modeling over conventional video sequences. For so doing, we recorded eye fixation experiments in order to capture the oculomotor activity of test subjects. Acquired videos were then magnified using an optical acceleration-based framework that allowed to enhance small motility patterns. Video slice features based on primary CNN layer responses were used to classify control and PD-diagnosed patients under a supervised machine learning framework. Preliminary experiments on a pilot case-control set of 12 subjects yielded promising results in terms of high classification accuracies and low false-positive and false-negative rates. Despite the relatively limited sample size, a common problem in PD research [9, 19], obtained results demonstrated a feasible alternative for PD assessment upon ordinary and magnified videos, avoiding complex and sophisticated acquisition setups like EOG and VOG. The proposed strategy therefore represents a potential approach to understand and quantify the association between PD and eye motility, aiming to support diagnosis and follow-up of the disease. In order to validate our findings, further evaluation with a larger population sample is warranted. Future work also includes the distinction of PD sensibility regarding different stages of the disease.

Fig. 6.
figure 6

Video magnification variables. (a) Influence of the magnification frequency choice in the disease classification task. Reported values were obtained by counting the number of layer representations (layer A and layer B at the three considered slice quantities) with increased performance due to the magnification step. (b) Visual effect of different magnification factors.