Abstract
Recent neurological studies suggest that oculomotor alterations are one of the most important biomarkers to detect and characterize Parkinson’s disease (PD), even on asymptomatic stages. Nevertheless, only global and simplified gaze trajectories, obtained from tracking devices, are generally used to represent the complex eye dynamics. Besides, such acquisition procedures often require sophisticated calibration and invasive configuration schemes. This work introduces a novel approach that models very subtle ocular fixational movements, recorded with conventional cameras, as an imaging biomarker for PD assessment. For this purpose, a video acceleration magnification is performed to enhance small fixational patterns on standard gaze video recordings of test subjects. Subsequently, feature maps are derived from spatio-temporal video slices by means of convolutional layer responses of known pre-trained CNN architectures, allowing to describe the depicted oculomotor cues. The set of extracted CNN features are then efficiently coded by means of covariance matrices in order to train a support vector machine and perform an automated disease classification. Promising results were obtained through a leave-one-patient-out cross-validation scheme, showing a proper PD characterization from fixational eye motion patterns in ordinary sequences.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
1 Introduction
Parkinson’s disease (PD) is the second most common neurodegenerative disorder, with a prevalence exceeding more than 6 million people worldwide [6]. This disease is directly related to deficiencies in the production of the dopamine neurotransmitter, which plays a key role in sending and receiving messages from the brain to the body. Such affectation results on progressive motor alterations, e.g., tremor, bradykinesia, stiffness, and postural instability [14].
A main critical issue nowadays is the poor understanding of dopamine loss, resulting in diagnostic procedures based mostly on conspicuous motor alterations. In clinical practice, specific macro and observational tests that coarsely relate such physical signs with PD are often used as the main diagnosis tool [11]. This evaluation is however prone to bias and subjectivity due to the high interpersonal motion variability and the particular perception dependency [16]. To tackle these problems, researchers have proposed a variety of quantitative and computerized approaches including the use of body-worn motion sensors [10], computer vision schemes [4] or machine and deep learning algorithms to recognize PD behaviors during handwriting, speech and gait tasks [15]. These approaches rely on the quantification of upper and lower limb movements, which generally appear in advanced stages [3] and therefore are not suitable for an early disease detection.
Recently, eye movement has been pointed out as a potential and very promising PD biomarker [5]. Several works report a strong association between subtle oculomotor abnormalities and dopamine concentration changes, even at the disease onset [1, 12]. This fact suggests a highly sensitive predictor of neurodegeneration before it reaches critical levels. For instance, the representative study of [8] evidenced the presence of ocular tremor responses while fixating a static target in a large cohort of PD patients at different stages. Interestingly enough, one of the control subjects, who turned out to exhibit fixational tremor, eventually developed the disease over a period of 2 years. Such studies rely on classical oculomotor monitoring protocols like electro-oculography (EOG) and video-oculography (VOG) [7]. The main disadvantage of EOG setups is their high susceptibility to electronic noise. Alternatively, modern VOG devices offer more reliable and noise-free recordings, but in both settings obtained signals only depict global and simplified trajectories of the whole eye motion field and its intrinsic deformations. Furthermore, they require individual calibration steps and invasive configurations, usually making contact with the entire area around the eyes and thus affecting the natural visual gesture. Other approaches [19] include the Poculomotor quantification from velocity field patterns, but require experiments with strong eye displacements to differentiate among control and disease classes.
This paper introduces a novel strategy to enhance and quantify fixational eye micro-movements, which results in a rich ocular motility description without invasive and comfortless protocols. Such description is achieved from spatio-temporal video slices, extracted from standard videos, instead of relying on the acquisition of global gaze trajectories through sophisticated devices. Firstly, tiny fixation patterns are highlighted by means of a video acceleration magnification. As shown in Fig. 1, magnified slices can better depict the oscillatory fixational patterns that visually differentiate between control and PD eye motion. Then, a dense profile description is obtained by means of a bank of filter responses computed from early layers of deep learning architectures. These features are compactly codified as the covariance matrix of filter channels with major information energy according to an eigenvalue decomposition. The resulting covariance is then mapped to a previously trained support vector machine to classify the disease given a particular slice. A preliminary evaluation with 6 PD patients and 6 control subjects by a leave-one-patient-out cross-validation scheme shows high effectiveness to codify fixation motility and differentiate between both classes. The proposed approach could potentially provide support and assistance in the PD diagnosis and follow-up.
2 Video Acceleration Magnification
A particular limitation in the analysis of fixational PD motion lies in the quantification of involuntary eye micro-movements [9]. Additionally, it should be considered that eye displacement can be masked by comparatively larger head motion. This work hence starts by performing an optical spatio-temporal amplification over fixation sequences. A remarkable fact of PD fixations is their oscillatory behavior, as shown in [8, 9]. Therefore, we can amplify specific frequency bands to characterize PD. For doing so, we use the acceleration magnification approach of [20], which allows to amplify subtle motion (fixational) even in the presence of large motion (head). The acceleration magnification works by analyzing the variation of pixel signals over time. In such case, any amplified pixel signal \(\hat{I} (\mathbf {x}, t)\) can be represented by a Taylor decomposition:
where \(\delta (t)\) is the small displacement function of the original pixel signal \(I(\mathbf {x}, t) = f(\mathbf {x} + \delta (t))\), and \(\alpha =\beta ^2\) is the magnification factor.
Equation 1 part A, is related to first-order linear changes (velocity) and will not be considered. Part B represents second order changes, i.e., acceleration, which is amplified by an \(\alpha \) factor that allows to accentuate every small deviation of linear motion. For measuring motion information, local phase information is computed through a complex-valued steerable pyramid \(\varPsi _{w,\theta }\): \(\mathbf {I}(\mathbf {x},t) *\varPsi _{w,\theta } = A_{w,\theta }(\mathbf {x},t)\,e^{i\,\varPhi _{w,\theta }(\mathbf {x},t)}\), where \(\mathbf {I}(\mathbf {x}, t)\) is a video frame and \(w,\theta \) are spatial frequency bands and orientations. Then, phases are amplified by using a Laplacian temporal filtering:
where \(\sigma \) is the Laplacian filter scale. This computation allows to take the second-order derivative of the smoothed phase signals and thus amplify them by \(\alpha \) at a given temporal frequency \(f = \tfrac{\text {frame rate}}{8\pi \sqrt{2}\,\sigma }\).
3 Learning a CNN Fixation Representation
The resulting amplified video \(\hat{I} (\mathbf {x}, t)\) is split-up into a set of spatio-temporal slices \( \mathbf {S}_\theta = \{\mathbf {s}_{\theta _1}, \mathbf {s}_{\theta _2}, \ldots , \mathbf {s}_{\theta _N}\}\) at N orientations. For doing so, different radial directions on the spatial xy-plane were used as reference along time. Typical slice configurations are illustrated in Fig. 2. These slices capture small eye iris displacements and herein constitute an ideal source of information to analyze small ocular movements.
Each slice \(\mathbf {s}_{\theta _i}\) is represented as a bank of separated band responses of high and low frequency filters, with some coverage of mid frequencies. This was achieved by mapping \(\mathbf {S_\theta }\) on the first layers of known and pre-trained deep convolutional frameworks, which have been implemented for the classification of natural scenes and trained on the ImageNet dataset (around 1.2 million samples). In brief, such architectures progressively compute linear transformations, followed by contractive nonlinearities, projecting information on a set of C learned filters \(\mathbf {\Psi }^j = \{\varvec{\psi }^j_1,\varvec{\psi }^j_2, \ldots , \varvec{\psi }^j_C \}\) at a given layer j. Hence, each eye slice \(\mathbf {s}_{\theta _i}\) is filtered by a particular \(\mathbf {\Psi }^j\) set, obtaining a feature representation \(\mathbf {\Phi }^j = \sum _{c= 1\ldots C} \mathbf {s}_{\theta _i} *\varvec{\psi }^j_c\), with \(\varvec{\phi }^j_c = \mathbf {s}_{\theta _i} *\varvec{\psi }^j_c\) as each independent feature channel. In this work, three different pre-trained architectures were independently used and evaluated for eye slice dense representation. A succinct description of the studied deep architectures is presented below.
-
VGG-19 [17]: is a classical CNN architecture with a total of 19 layers. For low-level representation purposes, in this work we considered the first block pooling layer with a total of \(C=64\) filter channels, and responses of size \(W=112\,\times \,H=112\). Hereafter named VGG-19 Layer A. Also, a second representation was evaluated from the second block pooling layer (\(C=128\), \(W=56 \times H=56\)), referred to as VGG-19 Layer B.
-
ResNet-101 [13]: is a deep net that includes residual maps serving as recursive inputs on superior layers throughout shortcut connections. A primary representation was herein obtained from the first block pooling layer (\(C=64\), \(W=112\,\times \,H=112\)), hereafter named ResNet-101 Layer A. Additionally, the first residual-added ReLu layer was computed as second representation (\(C=256\), \(W=56 \times H=56\)), designated as ResNet-101 Layer B.
-
Inception-ResNet-v2 [18]: is one of the most recent approaches that combines the inception blocks, i.e., multiple sub-networks that learn independently, with residual connections that allow optimal learning rates. The third block ReLu layer of \(C=64\) filters and responses of \(W=147\,\times \,H=147\) was used as primary representation, referred to as Inception-ResNet-v2 Layer A. The first residual-concatenated layer (\(C=320\), \(W=35\,\times \,H=35\)) was also considered, named Inception-ResNet-v2 Layer B.
For illustration, Fig. 3 shows sample responses of the utilized deep architectures for a given input video slice.
3.1 Recognizing Parkinsonian Patterns: A Compact Fixational Descriptor
The feature representation \(\mathbf {\Phi } \in \mathbb {R}^{H \times W \times C}\) for each slice \(\mathbf {s_{\theta }}\) is composed by C filter responses \(\varvec{\phi }_{c}\) with dimensions \(H \times W\). We vectorize each \(\varvec{\phi }_{c}\), reshaping \(\mathbf {\Phi }\) to \(HW \times C\). This information is nevertheless redundant on common background and could lead to a wrong motion description. Hence, a very compact analysis is herein carried out by computing a feature covariance matrix \(\mathbf {\Sigma } = \tfrac{1}{HW}\left[ (\mathbf {\Phi } - \mu (\mathbf {\Phi })) (\mathbf {\Phi } - \mu (\mathbf {\Phi }))^T \right] \), with \(\mu \) as the mean \(1 \times C\) feature vector repeated HW times. The covariance \(\mathbf {\Sigma } \in \mathbb {R}^{ C \times C}\) describes a second statistical moment on the whole feature space, that compactly summarizes the fixational motion representation. From a spectral matrix analysis, we found that the energy of \(\mathbf {\Sigma }\) is fully concentrated on only a few eigenvalues, which is why we only keep information related to the k major eigenvalues. In this way, a new compact reduced covariance \(\mathbf {\Sigma _{r}} \) that captures the most variability of the C feature channels is computed as \(\mathbf {\Sigma _{r}} =\mathbf {W^T \Sigma \,W}\), where \(\mathbf {W} \in \mathbb {R}^{C \times k}\) is the reduced eigenvector matrix of \(\mathbf {\Sigma }\) with \(k < C\).
Due to the semi-definite and positive properties of covariance matrices, they exist on a semi-spherical Riemannian space. This fact limits the application of classic machine learning approaches that assume Euclidean structured data. We therefore project \(\mathbf {\Sigma _r}\) onto the Euclidean space by the matrix logarithm of \(\mathbf {\Sigma _r}\). That is, \(\log (\mathbf {\Sigma _{r}}) = \mathbf {V_r} \log (\mathbf {\Lambda _r}) \mathbf {V_r}^T\), where \(\mathbf {V_r}\) are the eigenvectors of \(\mathbf {\Sigma _{r}}\) and \(\log (\mathbf {\Lambda _r})\) the corresponding logarithmic eigenvalues. This reduced covariance represents the fixational motion descriptor to be fed into a machine learning algorithm in order to obtain a prediction of Parkinson’s disease, under a supervised learning scheme. In this work, we selected a support vector machine (SVM) as supervised model due to its demonstrated capability at defining non-linear boundaries between classes. Also, SMVs have widely reported proper performance on high dimensional data with low computational complexity. Finally, as non-linear SVM kernel classifier, we utilized the classical yet powerful Radial Basis Function (RBF) kernel \(K = \exp \left( - \gamma \, || \log (\mathbf {\Sigma _{r}})_i - \log (\mathbf {\Sigma _{r}})_j ||^2 \right) \) [2].
3.2 Imaging Data
We implemented a protocol to record fixational motion on PD-diagnosed and control patients. Participants were invited to observe a fixed spotlight projected on a screen with a dark background. A Nikon D3200 camera with spatial resolution of \(1920\,\times \,1080\) pixels and a temporal resolution of 30 fps was fixed in front of the subjects to capture the whole face. The eye region was manually cropped (\(180\,\times \,120\) pixels) to obtain the sequences of interest. A total of 6 PD patients (average age of \(71.8 \pm 12.2\)) and 6 control subjects (average age of \(66.2 \pm 6.6\)) were captured and analyzed for validation of the proposed approach. PD patients were diagnosed in second (2 patients), third (3 patients) and fourth (1 patient) stage of the disease by a physician using standard protocols of the Hoehn-Yahr scale [11]. A total of 24 sequences were recorded, i.e., 2 samples per patient, with duration of 5 s. This study was approved by the Ethics Committee of the Universidad Industrial de Santander and a written informed consent was obtained. The recorded dataset was possible thanks to the support of the local Parkinson foundation FAMPAS (Fundación del Adulto Mayor y Parkinson Santander) and the local elderly institution Centro Vida Años Maravillosos.
4 Evaluation and Results
Experiments were performed on the 24 recorded sequences through a leave-one-patient-out cross-validation, in which at each iteration one patient is left out to test and the remaining ones are used for training the model. For comparison purposes, we considered both standard and magnified videos. Figure 1 illustrates standard and magnified ocular fixational motion for sample PD and control subjects. Parkinsonian slices show well-defined amplitudes of slight oscillatory motility at specific frequencies (later discussed). A total of 4 slices that recover eye motion cues were extracted from each video (see Fig. 2), and thereafter mapped to the selected CNN architectures in order to obtain a deep representation. Then, a very compact covariance descriptor per slice was computed by using the minimal number of eigenvalues that concentrate \(95\%\) of information.
Figure 4 shows a first quantitative analysis of average accuracies over the patient-fold scheme. Initially, it should be noted that an increasing number of eye slices yields a performance improvement. This is valid for all of the considered architectures. Secondly, video magnification consistently contributes to improve the disease prediction, being a major contribution when fewer slices are utilized in the analysis. Interestingly enough, outstanding performance is achieved on layer A of all CNNs by using a 4-slice representation. In this case, standard slices are able to obtain complete predictions on the VGG-19 and Inception-ResNet-v2, and magnified ones on all the three nets. Such fact could be related to a more comprehensive description of eye slices in the earliest representation levels, which suggest that low-level primitives were able to properly represent the depicted fixational patterns even on standard slices. Also, it is remarkable the Inception-ResNet-v2 performance that achieves complete predictions even with only two slices. Although Layer A in such scheme is not directly related to residual inputs or inception blocks, the overall network minimization allows a better description of relevant slice features. Layer B of VGG-19 achieves the same results for two slices, but in general, this net layer requires more forward steps and denser representations.
Thereafter, a deeper analysis of layer A was performed in Fig. 5 by quantifying the False Negative Rate (FNR) and the False Positive Rate (FPR). The FNR is related to the percentage of PD patterns incorrectly classified as control, while the FPR represents the percentage of control patterns miss-classified as PD. In all cases, the use of four slices in magnified sequences yields negligible prediction of false conditions. In general, an exponential error decay can be observed as there are more slices involved on the descriptor. For the Inception-ResNet-v2 architecture, this trend is more accentuated, achieving the best condition classification with a very compact descriptor, i.e., by using fewer slices. On the other side, the ResNet-101 architecture presents the slowest decay, requiring more information to achieve proper performance.
Finally, video magnification parameters were evaluated to correctly enhance fixational motion patterns. In the proposed approach, the temporal frequency to be magnified was fixed w.r.t. the fundamental PD ocular tremor frequency. In the literature, such frequency during ocular fixation has been quantified as \(f = 5.7\) Hz [8]. As observed in Fig. 6-(a), the proposed approach achieves equally outstanding performance on a range of characteristic PD tremor frequencies, in terms of improved accuracies per layer. Regarding the magnification factor, a visually reasonable value to emphasize this subtle oscillatory pattern was found to be \(\alpha = 15\), as shown in the spatio-temporal comparison of Fig. 6-(b).
5 Conclusions
In this work, we proposed a quantitative strategy to characterize ocular fixational motion as an imaging biomarker for Parkinson’s disease (PD). This approach achieved a robust eye motion modeling over conventional video sequences. For so doing, we recorded eye fixation experiments in order to capture the oculomotor activity of test subjects. Acquired videos were then magnified using an optical acceleration-based framework that allowed to enhance small motility patterns. Video slice features based on primary CNN layer responses were used to classify control and PD-diagnosed patients under a supervised machine learning framework. Preliminary experiments on a pilot case-control set of 12 subjects yielded promising results in terms of high classification accuracies and low false-positive and false-negative rates. Despite the relatively limited sample size, a common problem in PD research [9, 19], obtained results demonstrated a feasible alternative for PD assessment upon ordinary and magnified videos, avoiding complex and sophisticated acquisition setups like EOG and VOG. The proposed strategy therefore represents a potential approach to understand and quantify the association between PD and eye motility, aiming to support diagnosis and follow-up of the disease. In order to validate our findings, further evaluation with a larger population sample is warranted. Future work also includes the distinction of PD sensibility regarding different stages of the disease.
References
Anderson, T., Macaskill, M.R.: Eye movements in patients with neurodegenerative disorders. Nat. Rev. Neurol. 9, 74–85 (2013)
Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27:1–27:27 (2011)
Cheng, H.C., Ulane, C.M., Burke, R.E.: Clinical progression in parkinson disease and the neurobiology of axons. Ann. Neurol. 67(6), 715–725 (2010)
Contreras, S., Salazar, I., Martínez, F.: Parkinsonian hand tremor characterization from magnified video sequences. In: 14th International Symposium on Medical Information Processing and Analysis, SPIE, vol. 10975, p. 1097503 (2018)
Ekker, M.S., Janssen, S., Seppi, K., et al.: Ocular and visual disorders in parkinson’s disease: common but frequently overlooked. Parkinsonism Related Disord. 40, 1–10 (2017)
Feigin, V.L., Abajobir, A.A., Abate, K.H., et al.: Global, regional, and national burden of neurological disorders during 1990–2015: a systematic analysis for the global burden of disease study 2015. LANCET Neurol. 16(11), 877–897 (2017)
Furman, J.M., Wuyts, F.L.: vestibular laboratory testing. In: Aminoff, M.J. (ed.) Aminoff’s Electrodiagnosis in Clinical Neurology, pp. 699–723, sixth edn. W.B. Saunders, London (2012)
Gitchel, G.T., Wetzel, P.A., Baron, M.S.: Pervasive ocular tremor in patients with parkinson disease. Arch. Neurol. 69(8), 1011–1017 (2012)
Gitchel, G.T., Wetzel, P.A., Qutubuddin, A., Baron, M.S.: Experimental support that ocular tremor in parkinson’s disease does not originate from head movement. Parkinsonism Related Disord. 20(7), 743–747 (2014)
Godinho, C., Domingos, J., Cunha, G., et al.: A systematic review of the characteristics and validity of monitoring technologies to assess parkinson’s disease. J. Neuroeng. Rehabil. 13(1), 24 (2016)
Goetz, C.G., Poewe, W., Rascol, O., et al.: Movement disorder society task force report on the hoehn and yahr staging scale: status and recommendations the movement disorder society task force on rating scales for parkinson’s disease. Mov. Disord. 19(9), 1020–1028 (2004)
Gorges, M., Müller, H.P., Lulé, D., et al.: The association between alterations of eye movement control and cerebral intrinsic functional connectivity in parkinson’s disease. Brain Imaging Behav. 10(1), 79–91 (2016)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Jankovic, J.: Parkinson’s disease: clinical features and diagnosis. J. Neurol. Neurosurg. Psychiatry 79(4), 368–376 (2008)
Pereira, C.R., Pereira, D.R., Weber, S.A., et al.: A survey on computer-assisted parkinson’s disease diagnosis. Artif. Intell. Med. 95, 48–63 (2018)
Rizzo, G., Copetti, M., Arcuti, S., et al.: Accuracy of clinical diagnosis of parkinson disease a systematic review and meta-analysis. Neurology 86(6), 566–576 (2016)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: Thirty-First AAAI Conference on Artificial Intelligence (2017)
Trujillo, D., Martínez, F., Atehortúa, A., et al.: A characterization of parkinson’s disease by describing the visual field motion during gait. In: 11th International Symposium on Medical Information Processing and Analysis, SPIE, vol. 9681 (2015)
Zhang, Y., Pintea, S.L., Van Gemert, J.C.: Video acceleration magnification. In: Computer Vision and Pattern Recognition (2017)
Acknowledgements
The authors thank the Vicerrectoría de Investigación y Extensión of the Universidad Industrial de Santander for supporting this research work by the project “Reconocimiento continuo de expresiones cortas del lenguaje de señas registrado en secuencias de video”, with SIVIE code 2430.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Salazar, I., Pertuz, S., Contreras, W., Martínez, F. (2019). Parkinsonian Ocular Fixation Patterns from Magnified Videos and CNN Features. In: Nyström, I., Hernández Heredia, Y., Milián Núñez, V. (eds) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. CIARP 2019. Lecture Notes in Computer Science(), vol 11896. Springer, Cham. https://doi.org/10.1007/978-3-030-33904-3_70
Download citation
DOI: https://doi.org/10.1007/978-3-030-33904-3_70
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-33903-6
Online ISBN: 978-3-030-33904-3
eBook Packages: Computer ScienceComputer Science (R0)