1 Introduction

Increasing life expectancy of the world’s population has resulted in a growing number of older adults. It is estimated that by 2050 the percentage of population in the UK aged over 60 years is going to double to 23.6%, while in Japan it reaches 35.6% and globally over 20%, i.e. 2.43 billion [1]. However, healthy life expectancy statistics have not followed the above trends [2] and show increasing numbers of elderly people who require social or clinical care due to age-related cognitive conditions and frailty [3]. For example, those over 65 years who are hospitalized account for 62% of total bed days and on average stay 11.9 days in hospital, so that older people are the main users of both secondary and primary care [4]. Much of this health expenditure is due to people living longer with multiple chronic illnesses, i.e. as the population ages, there are growing healthcare costs, complexity of healthcare delivery and caregiver’s burdens [5, 6]. The age structure of the population is expected to continue to change in the future; with the proportion of younger age groups continuing to decline [7], the ratio of caregivers to pensioners will decrease. These have in turn put an increased burden on national health providers such as NHS and care delivery organizations.

To mitigate this strain, Information and Communications Technologies (ICT) can be very useful in facilitating the remote delivery of some of the essential care that elderly people need in order to stay safe at home without unnecessary, expensive hospitalizations or admission into nursing homes. In this context, the MiiHome project [8, 9] for the development of a new eHealth system for autonomous remote monitoring and decision-making was formed recently. Because implementation of autonomous remote monitoring requires home installation of sensors and associated minicomputers, ethical, environmental and infrastructural considerations are vital. As a result, engagement with a wide range of stakeholders is necessary and sought via care organization delivering integrated health and social care. This includes Salford Royal NHS Foundation Trust, which works closely with the local Council to develop an Integrated Care Organisation (ICO) called “Salford Together” [29, 30]. Salford Together conducted a community engagement exercise in 2017 called the “The Big Health and Social Care Conversation” [31]. We have adopted some of the output from this exercise to inform the direction of our MiiHome project in Salford. Therefore, for example, we have explored how ICTs may help the people of Salford, while being mindful that the solution should be applicable more widely across the UK National Health Service (NHS) or worldwide. In this regard, using new technologies are viewed eventually as an integral component of the home. As a result, working closely with the housing sector is important because the project will rely on implementation of structural changes (albeit modest ones) to homes. Through mobilization of the tenants of our project partner, Salix Homes, a registered social housing provider based in Salford that owns 8,500 properties, the project has greatly added to the richness of the intervention and driven forward recruitment of participants. There are currently around 10,000 tenants, mostly elderly, living in these properties. The resident age group and distribution are shown in Fig. 1. Co-creation with all stakeholders [3] and amplification of the agency of such tenant/participants are pivotal to the broader aims in developing a proactive approach to maintenance of health and well-being of elderly people by developing a product that meets clinical requirements, expectation of the participants as well as sustainability of the NHS.

Fig. 1
figure 1

a Age group and b distribution of resident in Salix Homes

In order to analyse the acquired data and to assist health professionals in reaching a conclusion about the state of health of the elderly residents, a decision support tool was put in place. The capability of the current tool is limited to measuring gait speeds using a conventional method and reporting them for frailty. Because of the lack of sophistication to deal with big data, it takes extremely large amounts of time to retrieve and analyse past data accumulated through time [77]. The current main focus of our research, as presented in this paper, is in extending the capabilities of the current decision support tool to include more sophisticated algorithms for analysing the data acquired from various sources quickly and accurately in order to assist clinicians in the diagnosis and prognosis of cognitive impairment. In the next section, after reviewing related work and current state of the art developments, the methodology used in this study will be described in some length. After explaining current constraints in applying our methodology and solutions to address them, the results obtained from the application of various machine learning techniques for routine human behaviour recognition are presented. This is followed by a discussion and validation of results, especially for how this may direct clinicians to predict cognitive decline in a home setting. This paper ends with a conclusion and suggestions for further work, as especially this paper reports only those results influenced by the elderly’s physical (in)activities, and does not cover the clinical results.

2 Related works

2.1 Assistive living technologies and autonomous remote monitoring

To improve and provide better care for a globally ageing population, and to facilitate its delivery in terms of cost and reducing complexities, the recent developments in the field of assistive living technologies have attracted the attention of health professionals and researchers [10,11,12,13,14]. Low-cost, tech-based solutions, including ambient living and remote healthcare management systems, are under development to tackle some of the aspects of ageing [13, 14]. Use of autonomous systems for remote health monitoring is well known, and conventionally, there are two types of remote monitoring technology used to support elderly people living independently at home. Firstly, wearable devices, such as pendants, can send an alarm signal when pressed, or automatically when incorporating sensors such as accelerometers [30]. Smartphones have been used as wearables for fall detection [33], however, with limitations. The participant needs to recharge/replace batteries, remember to have the device with them at all times and keep it charged and ``on'' [34]. Furthermore, smartphones need to be kept in a fixed position and are susceptible to false positives [33]. The second type of technology used to aid independence is an environmentally embedded (non-wearable) sensor. Many kinds are available, such as infrared motion sensors to detect presence; bed sensors that can detect pulse, respiration and restlessness; and pressure sensors to detect walking and gait. Rantz et al. [35] described the development of ``TigerPlace'', a purpose-built US retirement community using multiple embedded sensors to support residents. Lin et al. [36] developed a 4-device system for home-based remote detection of frailty. This approach benefits from the sensors being mains powered, and elderly people generally prefer non-wearables as they are less intrusive [34]. The Technology Integrated Health Management (TIHM) project in Surrey UK [38] uses a combination of passive environmental sensors, medical devices, wearable technologies and interactive applications to collect real-time data containing information on environmental conditions (i.e. humidity, temperature, appliance usage), patients’ physiological parameters (i.e. blood pressure, pulse) and their daily lifestyles. For measuring gait properties, wearable insole pressure sensors can be used in home settings, but have drawbacks such as needing correct placement, regular calibration and battery charging/replacement [39]. Optical gait analysis systems require two video cameras for depth information and set-up and calibration can be complex [39]. In the study by Alberdi et al. [37], the daily distance that the subjects moved inside the homes was estimated by computing the distance between areas of the home covered by each passive infrared (PIR) motion sensor as determined from the sensor layout and the floor plan. Note that this approach only provides an approximation of the real covered distance, as it does not consider the existence of walls or other obstacles between the sensors that must be avoided or navigated. All the sensor and medical devices record and send the data to their corresponding gateways over Wi-Fi, Bluetooth or, in exceptional cases, via auxiliary interfaces of such a device. Gateways relay these data to the companies’ back end systems over GPRS, SigFox or home broadband as in TIHM [40]. At this stage, all participating companies comply with a common JSON data model for the TIHM project. This is followed by the communication with the TIHM backend system through a publish/subscribe (pub/sub) or message queue (MQ) model. Specifically, the advanced message queuing protocol (AMQP) is employed for in-home assessment of walking speed based on PIR sensors and a wireless network for data collection. The authors assume that the sensors are placed at physical positions in some spatial coordinate system. For a particular walking event, the sensors fire at times, where [41] provides a linear walking model and estimates velocity. The linear model has been experimentally verified on a total of 882 walks from the 27 participants.

2.2 MS Kinect

Recently, MS Kinect (referred to as Kinect from now on) has been widely used for health monitoring. Kinect has the ability to measure gait, inactivity, isolation and falls, which have been demonstrated in laboratory conditions using healthy subjects [34, 36, 43, 44]. It has been found to have high acceptable accuracy when identifying key elements of healthy subject movement. Very few studies have involved clinical subjects or home environments [44].

One study [34] used Kinect as a fall detection system with 16 older adults in an independent living facility. Over one year, six out of seven standing falls could be detected at a false alarm rate of one per week, but two sitting falls were highly occluded and not detectable. The TigerPlace project in a real-world test setting compared the Kinect to an optical system and a pulse Doppler radar; post-analysis of 203 paired observation of the 6 standardized Fall Risk Assessments (FRAs) from the same day for 18 subjects found most Kinect measurements were significantly correlated with all of the FRAs (p < 0.01), i.e. the Kinect to be the most robust [45]. These early results demonstrate the utility of technology-supported independence and the great potential of such systems for remote monitoring in a residential setting. However, a major obstacle to the application of this technology more widely is the lack of evidence on performance at scale in real-world settings.

Beside the application of Kinect as a monitoring device, it has also been explored to provide meaningful engagement as a group activity for people with dementia [78]. In a literature review [79], the potential of motion-based technology (including Kinect) to improve the lives of people living with dementia or Mild Cognitive Impairment (MCI) was concluded. In this regard, some guidelines that would help designing motion-based software to meet the needs of people with dementia or MCI have been considered in [80].

2.3 ICT for MCI

Despite the fact that cognitive-related conditions are not uncommon in the elderly population, which is at higher risk of cognitive decline due to their age, the literature review shows that much fewer researchers have been working on the development of innovative technologies that could alleviate the impending cognitive challenges of an ageing population. In terms of using new technologies for nonpharmacological interventions, Liappas and Cabrera-Umpierrez [15] recently proposed the treatment of cognitive deficits by the use of virtual reality (VR), while a novel virtual reality application was developed in [81] for screening of early dementia. In addition, a commercial-grade eye-tracking camera and a laptop-embedded camera were used for a 30-min visual paired comparison (VPC) recognition memory task and it was found that it accurately correlates with existing cognitive composites [82]. Another method for detection of early decline in cognitive function was suggested in [83] by continuously tracking basic aspects of internet search terms using some software on line. Alberdi et al. [37] assessed the possibility of detecting changes in psychological, cognitive and behavioural symptoms of Alzheimer’s Disease (AD) by making use of unobtrusively collected smart home behaviour data which are collected by passive infrared (PIR) motion sensors and machine learning techniques.

Computer Interactive Reminiscence and Conversation Aid (CIRCA) has been adopted and developed for cognitive stimulation therapy (CST) by various authors [16,17,18]. NANA (Novel Assessment of Nutrition and Ageing) was developed as a mean for computerized self-administered measures of mood and appetite [19,20,21,22] and was recently used for predicting depression and anxiety in older adults [21, 22]. Subramaniam and Woods [23] presented original works on ICT reminiscence systems for dementia with a therapeutic view. The Virtual Cognitive Health (VC Health) study [84] claims that internet-delivered lifestyle interventions can act as remote intervention for the prevention or delay of Alzheimer disease.

Other technological paradigms such as smart cities (SC) and the Internet of Things (IoT) have also been explored in various European projects in the last 10 years [10, 24, 25]. While they have focused on elderly people, these on the whole are for people who do not exhibit cognitive deterioration or deficits. Indeed, this is not limited only to technological research. A scoping review conducted by Fang and et al. [26] explored the conceptual development of MCI to identify the resulting ethical, political and technological implications for the care of older adults with MCI. This states that “the development of effective interventions for MCI in older adults has been limited by extensive variability in the conceptualization and definition of MCI, its subtypes and relevant diagnostic criteria within the neurocultural, pharmaceutical and gerontological communities”. Among many findings, the authors concluded that technological interventions were scarce and highlighted “significant opportunities for technological interventions to effectively reposition MCI in the ageing care discourse”. In a study on application of an eHealth intervention among adult Dutch population with early-stage cognitive decline, widespread, cost-effective and self-motivated prevention effects have been anticipated [85]. Beside this, Panou et al. [27] also anticipated a prevalent increase in cognitive impairment in future elderly populations as a result of the recent economic crisis and hardship based on a separate survey [28], where the authors concluded that people who lived through economic recessions in early to mid-life might be at higher risk of cognitive decline after the age of 50. These indicate a vacuum in the application of new technology for people with cognitive-related conditions.

3 Method

3.1 Objectives

The main objective in this work is to extend the current capabilities of the decision support tool in order to analyse the data acquired from various sources quickly and accurately. This was achieved by development of intelligent algorithms with the required level of sophistication to present key indicators of health events and states related to MCI and to deal with big data issues. Having developed an appropriate powerful decision support tool, then we are able to evaluate the effectiveness of using Kinect and other sensors for remote cognition and frailty monitoring of elderly people in their homes.

In the long term, once the effectiveness of such a system is proved, we expect the system to assist clinical teams with the performance assessment of daily living activities in the periods between hospital and clinic visits [32] based on the analysis of the historical data collected with Kinect and other sensors. It is also anticipated that the proposed system will reduce the time and cost spent on in-clinic examinations as clinicians will use objective measures instead of semi-structured interviews aimed at eliciting an accurate history.

3.2 Setting and system infrastructure

Here we present one aspect of the development of the system infrastructure as part of the MiiHome project in achieving the above objectives. The infrastructure consists of “The Living Laboratory” a living space within the university campus laboratories that is fully furnished and with a fully functional kitchen supplied by Salix Homes. The suite is heavily sensorized with monitoring at every level including motion sensors, door sensors, appliance monitors, water flow meters, Internet of Things appliances such as fridges and dishwashers. This is used as a test bed to develop software, networks and machine learning. Older people including those living with dementia and their carers visited the laboratory to begin co-creation of the work programme in August 2015. There is also a heavily sensorized one-bedroom home (Smart Home 1) situated within the community directly adjacent to the hospital campus that has been provided by Salix Homes. Here participants can live in the short term to engage in experiments or evaluate systems in terms of functionality.

Furthermore, we have configured the system so that Kinect sensors, minicomputers that power the sensors and internet connectivity in the residence communicate with the Living Lab, see Fig. 2. The system is provided turn-key and is installed by Salix Homes working with our technicians. The participant living in the home does not need to have any experience in using computers, the internet or a mobile computing device of any sort. In this way, we can offer the system to any participant regardless of educational attainment or socioeconomic standing. Figure 2 illustrates the information flow of the MiiHome project. We have created two smart homes for pilot testing, one adjacent to the hospital campus and used for temporary living (testing or clinical evaluation).

Fig. 2
figure 2

MiiHome configuration and its information flow

The two smart homes deploy Kinect and a number of sensors (motion, magnetic, temperature, tilt, air quality, touch, sound, moisture, water flow). MiiHome is a large-scale project deployed in up to 200 participant/patient’s homes throughout Salford, deploying a more limited collection of sensory arrays (Fig. 3). Some homes deploy care-on-call (a health service call centre that provides emergency assistance, currently based on pendant alarms—eventually planned to be triggered by MiiHome).

Fig. 3
figure 3

Some of the sensors used in MiiHome project a water flow, b touch (on the base of a kettle), c magnetic switch (on the top of a microwave door) and d smart carpet sensors

3.3 Participants

Out of 200 Salix Homes’ residences in which Kinects and various sensors with minicomputers and internet connectivity were installed, 12 pensioners aged 65 or over were selected after interview as participants for continuous remote monitoring. All participants are living alone and suffering from various age-related cognitive impairment and frailty. However, apart from mild severity, no further details of the diagnosis were presented to the researcher by the clinicians.

To track participants’ behaviour, a vision system (i.e. Kinect) was implemented so that it will not act as a surveillance system in order to keep participants’ privacy. This system does not need any wearable device while providing depth information without interruption to participants’ ADL.

3.4 Methodology and design

The use of Kinect for extracting postural data and use of machine learning and big data techniques for detecting and recognition of various routine human behavioural patterns at home and their attributes are presented in the following.

3.4.1 Using MS Kinect and practical constraints

For the first time, the use of Kinect for human behaviour detection and recognition for elderly adults with MCI in a natural setting at their homes is presented here. Kinect is a single unit, has a built-in depth sensor plus a camera and is fairly inexpensive. Kinect captures the movements of 25 body joint positions and angles in an anonymized fashion [42]. Machine learning and big data techniques are used to capture the posture obtained from 25 body joints positions and angles and then recognize various kinds of human behaviour such as eating, walking, phoning, furniture crawling, other daily activities and their related parameters (e.g. speed).

Data from the human skeleton are extracted, and the image depth of 25 body joints is presented by 3-dimensional coordinates (x, y, z) for each joint. In total, 75 joint coordinates correspond to specific postures and can therefore identify a basic human activity. For example, by visualizing these 3-dimensional joint coordinates, it can be easily seen whether the resident is “Standing” or “Sitting”.

3.4.1.1 Environmental constraints

Figure 4 illustrates three of the participants’ homes and their home settings. This project has so far installed Kinect into the homes of 12 participants who were freely living in the community. This differs from some other studies that took place in retirement facilities (e.g. TigerPlace). We found that installation had to be a compromise between the optimal positions to detect activity and the wishes of the participants. The Kinect device installed as displayed in Fig. 4b and e had to be placed such that it can be obscured by a door and it faced a window. This caused numerous artefacts in the data. However, it illustrates the difficulty in real-world working. Unfortunately, it was not economical to install more Kinect devices in the living rooms or extra devices in other places such as staircase, hallway or kitchen, where occupants spend most of their daily activities with more possible fall events. In such situations when occupants are not in front of the Kinect device, signals drawn from other sensors such as water flow, touch and magnetic switches in the dwelling have been interrogated to provide further information about specific activity or behaviours by the occupants. For example, a sequence of signals obtained from a touch sensor in the base of a kettle, followed by water flow in the kitchen and then a touch sensor in the base of a kettle, may lead us to filling a kettle activity with water and switching it on to boil the water in it. Please note that in Kinecting cognition we opted to detect broad health indicators that were potentially accessible to direct measurement rather than try to de-convolute information into activities of daily living as has been done by many smart home installations and systems [12, 13]. As a result, we used digital signals from other sensors to supplement Kinect and as a measure of presence/absence of the occupier (for example sleeping in the bed during night) or (in)activity.

Fig. 4
figure 4

Typical participants’ homes settings

3.4.1.2 Geometrical constraints

As mentioned earlier, Kinect provides 3-dimensional coordinates (x, y, z) for each joint. It has a limited field of view [70, 71] as shown in Fig. 5. However, this does not create any concern for error caused by geometrical distortion. An important source of error manifests itself as it is nearly impossible to locate Kinect horizontally. As a result, there will be an error associated with measured height or “y” coordinates that increase by angle ϴ, i.e. horizontal angle that Kinect makes. After some geometrical manipulations, the correct yr can be obtained as

$$y_{r} = \frac{{z_{m} \sin \left( {\uptheta } \right)\cos \left( {\uptheta } \right) - y_{m} }}{{\cos \left( {\uptheta } \right)}}$$

where ym and zm are measured y and z coordinates by Kinect. This correction is vital, especially in fall detection, and can be adjusted and inserted into machine learning while training.

Fig. 5
figure 5

Kinect limited a vertical and b horizontal Field of View in default range where the depth is approximately 4 m

3.4.1.3 Big data constraints

Kinect is a low price and low frame rate camera producing 30 frames per second. Figure 6 shows the hourly rate of data generation for one of the 12 computers that was installed in one of Salix homes. This implies that hourly storage capacity of the system must be in order of GByte to be able to store this rate of data generation. Considering the calculation complexity of most machine learning algorithms with order of N2 implies a working memory of order 1018 bytes is required.

Fig. 6
figure 6

Data generation rate in bytes wrt time

If we consider a 24 h, 7 days a week continuous operation, the system must have a capacity of storing TBytes of data per month with a working memory of order 1024. This is impractical and imposes a limitation on the system, which requires big data techniques and appropriate analytics to deal with it properly. We employed techniques to process the data stream on the fly instead of analysing batches of data all at once to deal with big data and the curse of dimensionality.

3.4.2 Learning for human behaviour recognition

Recent rapid advances in AI and machine learning have created better opportunities for dealing with many difficult nonlinear stochastics and time varying applications and the associated curse of dimensionality, especially in pattern recognition where they excel. We have therefore used various machine-learning algorithms independently in order to recognize and classify human activities or postures autonomously in real time. A pictorial description of the proposed human behaviour recognition system using Kinect is shown in Fig. 7. Three independent machine learning techniques were used for this purpose are discussed next.

Fig. 7
figure 7

Human behaviour recognition using Kinect

3.4.2.1 Applied learning algorithms

In the first instance, a multilayer perceptron artificial neural network (ANN) was utilized and implemented for learning the above data sets. Every ANN has an input layer, an output layer and some optional layers between input and output layers, which are called hidden layers. In each layer, there are several processing elements and each processing element has a single output connection that branches into as many collateral connections as desired to the next layer [86]. As a matter of demonstration, Fig. 8 shows a feed-forward ANN architecture with just one hidden layer, 3 inputs and 2 outputs and processing elements f1 and f2. The actual number of inputs for the proposed ANN was based on Kinect-produced data signals. As already mentioned, each Kinect data entry is the coordinates of 25 joints extracted from the skeleton by Kinect. Since each joint has three coordinates (x, y, z), the size of each data entry is 75, which will be the number of inputs to the network and number of neurones in the input layer. The number of network’s outputs depends on the number of postures to be classified, and each output can be interpreted as the probabilities associated with its related posture. As will be described in the following sections, five different postures were selected, i.e. the number of the ANN outputs was set to five and only one hidden layer was used. The connections among the processing elements are appropriately weighted to obtain a desired output from the ANN. There is a possibility for an additional bias term or offset for each processing element. The outputs of ANN are dependent on their weights and biases. The determination of these weights and biases is essential for ANN to learn the proposed human behaviours.

Fig. 8
figure 8

A feed-forward ANN with one hidden layer

Since ANN was not able to incorporate other sources of information including human knowledge into the network, another paradigm, i.e. an Adaptive Neuro-Fuzzy Inference System (ANFIS), was developed. A typical ANFIS with 2 inputs and one output is shown in Fig. 9 for demonstration. Fuzzy sets Ai and Bi (i = 1, 2, … n, n = number of linguistic variables) encapsulate human knowledge in various linguistic variables in terms of joint coordinates, Π represents relevant fuzzy operations, and processing element in consequent part of each linguistic rule is indicated by fj. The actual number of inputs for the proposed ANFIS was 75 with five outputs as described above. The outputs of ANFIS are dependent on their weights and linguistic variables and rules. The ANFIS was first initialized with some intuitive knowledge (or rules) about various human postures and relationships between their positions and angles. It was then optimized by using the above subsets of data through a training process.

Fig. 9
figure 9

An ANFIS block diagram

Linguistic variables have enabled ANFIS to infuse human knowledge into the network structure and at the same time offered a simple and crisp structure with very specific limited number of layers in the network (Fig. 9). However, this stops ANFIS structure from having more layers and therefore the required sophistication for absorbing existing complexities in complicated dynamical systems and their associated data. Finally, the architecture of the above paradigms was enhanced and their training further optimized by using a deep learning algorithm to present the most robust generalization of the training sets for human behaviour classification and recognition. Deep learning is a type of machine learning which refers to the number of layers in the network, the more layers, the deeper the network. Deep learning is usually implemented using an ANN architecture, and while traditional ANNs contain only 2 or 3 layers, deep ANN networks can have hundreds. With the same number of parameters to approximate an input (X) to output (Y) mapping, mathematically an ANN with only 2 layers can be expressed as Y = f2(f1(X), while for a deep ANN this can be written as

$$Y = f_{\wedge} (...\left( {f_{2} \left( {f_{1} \left( X \right)} \right) \ldots } \right),$$

where Λ refers to the output layer. This means that each layer can do a different mapping, so with more hidden layers it offers more flexibility to ANN paradigms as universal approximators for approximating general nonlinear mappings to any desired degree of accuracy. Figure 10 shows a deep ANN architecture counterpart to Fig. 8, with the same number of inputs and outputs.

Fig. 10
figure 10

A deep ANN with many hidden layers

The process of finding the network parameters, which produces the best match between the desired map and the output of the network, is called the training process. In a training process, all weights and biases are modified in such a manner that minimizes an error function, which is formalized by the normalizing summation of squared errors ε(i) = y(i) − aΛ(i) between the outputs of network (aΛ) with sΛ neurones at output layer Λ and a training output set:

$$\mathop {\rm minimise}\limits_{{W \in R^{Q} }} J = \frac{1}{{2k}}\left\| {Y - A^{\Lambda } } \right\|_{2}^{2} = \frac{1}{{2k}}\sum\limits_{{i = 1}}^{k} {\sum\limits_{{j = 1}}^{{s^{\Lambda } }} {\left[y_{j} (i) - a_{j}^{\Lambda } (i)\right]^{2} } }$$

where AΛ is the k consecutive output of the network and Q is the number of parameters in the network. A training set is a set of k input/output pairs of the desired mapping, which are supposed to be learned by the network. The error function is a multivariate function that depends on the network parameters, so that the training process is basically an unconstrained nonlinear optimization problem. A quadratic objective function is the simplest nonlinear function. It is expected that any optimization technique that minimizes a general nonlinear function successfully works well on a quadratic one. Besides, any nonlinear function near its optimum behaves approximately as a quadratic function. There are several well-known methods to search the optimal value of the objective function in optimization theory, which can be applied to train the neural networks.

After learning of residents’ posture and their basic activities, the networks can be exploited to recognize ADL as a set of frequent sequences of the resident’s posture through time (a transaction). Each input transaction will be weighted and compared with older input transactions. If a transaction appears frequently, then it is considered significant and is represented as a frequent behaviour. If the resident changes behaviour, a new transaction and hence a new behaviour may occur. The human behaviour recognition system can work autonomously in real time for remote monitoring of a resident’s daily activities, where a percentage of the change in behaviour can be obtained on hourly, daily, weekly and monthly basis.

3.4.2.2 Determining training sets for basic human behaviour

The machine algorithms normally rely on good examples of these activities in training sets for learning and identification. In order to provide these examples from data obtained by Kinect for learning during the training process, a visualization software tool called Master Active Gestures ID (MAGID), was developed using a C compiler (LabWindows CVI from National Instrument). This visualization tool helped to identify and separate subsets of data related to each activity for training, validation and verification (testing). A sample output of this tool is presented in Fig. 11, showing 3 perspectives of a standing skeleton in 3 perpendicular hyperplanes.

Fig. 11
figure 11

User interface of the developed software MAGID

3.4.2.3 Basic behaviour and attributes identified

It is well established that several simple tests of physical performance are strongly associated with the onset of long-term functional decline and disability [47]. Ample evidence supports the potential use of physical performance measures in risk assessment strategies that can identify subgroups of older persons, initially independent in all ADLs, who are at increased risk of decline into disability or even death [47]. We sought to investigate whether real-time measurement of such physical performance was also reflective of acute decline as a potential signal, triggering proactive interventions to prevent or mitigate such decline. The following behaviours and attributes were identified for monitoring using the Kinect. Items were selected because they fulfilled the following criteria: simple to implement, acceptable load on computing power, clinically relevant to a wide range of situations and potential for giving acceptable predictive values in assessing risk of adverse outcomes.

Dietary intake and food consumption behaviour:

It has been recognized that elderly people’s appetite, in addition to effecting dietary intake [48], could also be related to depression [49] and cognitive decline [50], and therefore, it is recognized as a useful addition to computerized systems for collecting longitudinal assessments of health and behaviour [19]. Mood, which is also an important factor in quality of life [51] and well-being [52], can also be related to dietary intake [48]. It has also been stated that changes in appetite caused by depressed mood have been associated with reductions in food consumption and body weight in later life [19]. This is known as anorexia of ageing [19, 53]. Measuring dietary intake and food consumption can be based on attributes such as speed of food consumption and duration of dietary intake. Using the developed visualization software (MAGID), appropriate set of data from Kinect were identified and used as training set for this behaviour.

Walking behaviour:

Gait or walking speed is a common clinical measure and has been described as the sixth vital sign [54]. Timed walking tests are an important measure in comprehensive geriatric assessment [55]. Changes in walking speed mark a critical point in personal performance, and the assessment of gait speed has the potential to serve as a key indicator in mapping the trajectory of health and function in ageing and disease [56]. Systematic reviews have shown that it reliably predicts disability, cognitive impairment, institutionalization, falls, hospital admission and mortality [57, 58]. Furthermore, a declining trajectory of gait speed is also associated with adverse events such as mortality [59]. Furniture crawling (cruising in North America) is a classic adaptive response to more severe levels of gait disorder [60]. In response to severe postural or gait instability, patients hold on to furniture, walls and door handles in order to stabilize themselves while moving around the house. Acutely it is a marker of severe loss of stability and chronically an adaptive response to neurological disorders such as ataxia [60]. We considered using this approach after advice from clinical colleagues because the arms are used for support in furniture crawling and this may be easy to detect using Kinect in a real home setting. The MAGID visualization tool proved to be quite helpful to identify training datasets from Kinect for this behaviour.

Sit-to-stand (Sist) behaviour:

Rising to stand from a sitting position is one of the most common activities one performs in a home every day. Sit to stand and stand to sit are two of the most mechanically demanding activities undertaken in daily life. The ability to perform a sit to stand (SiSt) is therefore an important skill. In the case of elderly people, without disability the inability to perform this basic skill can lead to institutionalization, impaired functioning in activities of daily living (ADL) and impaired mobility. Objective measures of lower extremity function such as SiSt are highly predictive of subsequent disability [61, 62]. Ageing causes a loss of skeletal muscle mass and quality, which leads to a deficit in muscle function measured as loss of muscle strength or of muscle power [63]. SiSt depends on quadriceps femoris and trunk musculature. Muscle strength, which is the ability to generate force, and muscle power, which is the ability to generate this force rapidly, are associated with falling [64, 65]. There is also evidence for the contribution of both muscle strength and power to the ability of older people to maintain balance and posture [63]. Balance and posture are also key components of SiSt [66] alongside muscle strength and power. We evaluated SiSt using Kinect taking into account biomechanical considerations [67] but also being mindful that we performed assessments in a real home where variables such as the height of the chair could not be controlled. Our approach was to use the results as part of a composite marker for general health and not as a marker for falls risk as has been done by others [68, 69].

In order to capture this behaviour for machine learning, identifying two postures, i.e. sitting and standing, is required. The developed visualization tool, MAGID, was used for identifying these postures besides lying on the ground, so that appropriate datasets for each posture can be used as training sets.

3.5 Implementation platform

For this work, we focus on developing and enhancing the functionally of the decision support tool. Implementation and effective usage of this tool within the system infrastructure as part of the MiiHome project requires proper configuration and synergy with the rest of the components within the infrastructure. MiiHome algorithms perform a range of clinical and well-being assessments. For this purpose, a three-tier conceptual platform, which maps into the infrastructure, was employed. In the lowest tier, patient/occupier data are collected using an array of sensor devices and processed using software modules. Mostly passive sensing devices such as temperature, tilt, air, touch, sound, moisture, water flow sensors as well as the Microsoft Kinect devices are preferred within the MiiHome project, as opposed to wearable sensors. The features detected by the system range from composite signals drawn from the electricity and water supply to the dwelling that can be interrogated to give information about specific behaviours such as filling a kettle with water and switching it on to boil the water in it (for example see [11]). In the top tier, the methodology is to analyse participants’ data retrospectively and compare this quantitative data against data collected during an interview. The key performance metrics are fall detection, activity monitoring (e.g. duration of eating and sitting per hour-day), walking speed, posture and so on. Features include posture and balance, gait and behavioural pattern analysis. Quantitative measures assess the progression of specific disorders, assist clinical decisions as well as offer personalized predictive analysis. Raw and processed patient data are used to model the trajectory of health status. Data are hosted on University of Salford servers but will be moved to NHS systems as a plug-in to the electronic patient record (EPR). The patient data are fused and aggregated with EPR. Clinician input and clinical decision support systems then update the individual’s care plan.

4 Results and discussion

4.1 Dealing with practical constraints

Three major constraints, i.e. environmental, geometrical and big data constraints, were identified during the course of this study. While resolving and handling geometrical constraints were uncomplicated, dealing with environmental constraints required us to gather data concurrently from any possible source that could help the participants. Success in this objective will come not only from maturing the technical and logistical issues but also in resolving attitudinal and sociological issues. As part of the overall project, we are therefore working closely with these and other potential participants in focus groups to co-create a solution that is mutually beneficial.

The constraints imposed by big data issues enforced many limitations, especially for extending the decision-making system and machine learning algorithms freely. As a result, any technique that uses batch training or required N2 complexity such as Hessian, Newton methods and Levenberg–Marquardt has been avoided and only back-propagation learning algorithm from gradient descent method, which requires the evaluation of the first-order derivative of the objective function, was used to reduce the required memory storage and the curse of dimensionality. This has resulted in a longer training duration; however, once the network was trained, it can be used to process stream of data on the fly quickly within required real time specification.

4.2 Human behaviour recognition using Ms Kinect and deep learning

4.2.1 Basic human activities

As mentioned previously, a software tool called MAGID was developed in order to provide good examples of basic human activities from data obtained by Kinect as training set for the learning purposes. This tool can therefore also be used for visual validation. Using this tool, large subsets of Kinect data containing over 80,000 data entries related to each activity were identified for training, validation and verification (testing) during the learning process. Each data entry is the coordinates of 25 joints extracted from the skeleton by Kinect. Therefore, all 3 paradigms mentioned in Sect. 3.4.2.1 had 75 inputs and the same number of neurones in the input layer. The number of network outputs depends on the number of postures to be classified, i.e. “Standing”, “Sitting”, “Eating”, “Walking” and “Lying on the ground”, so that each output can be interpreted as the probabilities associated with its related posture. To achieve this, the processing elements of the output layers for both ANN and deep learning ANN were chosen to be Sigmoid function. For ANFIS, however, the output layer’s processing elements are linear functions and this can’t be hardwired into the network. Depending on the quality of ANFIS training, it is expected that the network follows appropriate out map so that its output would range between 0 and 1, not higher nor lower. The Softmax activation function was chosen to form the hidden layers of the deep learning network, while only one hidden layer with Sigmoid function was considered for feed-forward ANN. There are many technical details that are ignored in this section in order to focus on the application domain.

Both ANFIS and one layer feed-forward ANN have produced errors of 10–6 magnitude and, however, failed during verification tests. Out of above three paradigms mentioned, the best validation and verification results after training were obtained for the network trained by deep learning using the validation and verification subset of Kinect data separated by MAGID software. After training with over 1000 iterations, the confusion matrix for test data showed that the overall system achieved an accuracy rate of 99.3%. The static network model has been implemented so that it can analyse a stream of big data on the fly in real time. The average accuracy of the network during verification was 98.17%, which is discussed in Sect. 5, i.e. “Validations and Discussion”.

4.2.2 Determining the frequent behaviours

The trained networks presented in Sect. 4.2.1 have the capability to be used on the fly and in real time in order to process data stream from Kinect and to model basic human activities. Their results can then be directly passed to various data mining algorithms for further analysis. For example, a simple algorithm can detect if two consequent postures are sitting or standing and then calculate related attributes to SiSt behaviour as discussed earlier. Similarly, this has been done to calculate the frequency and duration of dietary intake. The output of the network can also feed to estDec [46], which is a datamining algorithm for determining the frequent transactions. The SPMF [72, 73] java platform was used for implementation of estDec, which can mine big data stream on the fly. The data stream is made of one-hour data, where the network detects and recognizes one activity from data acquired from Kinect every 30 s. Each transaction is composed of every four activities, which will be sent to the data mining algorithm every 2 min, i.e. 30 transactions are processed during one hour. For this purpose, experimental data obtained in the Living Lab show that using higher values for decay rate (rate of reduction in a weight for a chunk of information) close to one can keep the transaction occurrence for longer. The possible alternative is statistical analysis of a specific basic behaviour recognized by the network through the time and using them to model the time trajectory of health status.

4.2.3 Fall detection

Many research developments have so far demonstrated the capacity of Kinect to identify falls in laboratory conditions for healthy subjects. As far as the authors are aware, these have not involved clinical subjects in home environments. Falls are identified here for elderly people with MCI by analysing the coordinates, speed and acceleration on the y-axis of captured skeleton positions measured directly at their homes. For demonstration, the distance travelled by HIP_CENTRE, its speed and acceleration were calculated for one participant on 17 November 2017. The x and y coordinates were measured directly by Kinect are shown in Fig. 12a, where y coordinates refer to height. At first glance, it can be seen that there are more negative and large height coordinates than positive coordinates, which makes it impossible to reach any sensible conclusion from them regarding fall, suggesting the skeleton must be under the floor most of time. Using these data can be misleading and results in false positive by both human intuition or AI and machine learning techniques. However, using the geometrical transformation obtained and described in Sect. 3.4.1.2, new height coordinates have been estimated allowing better intuitive analysis by a human prior to using any autonomous decision-making system.

Fig. 12
figure 12

a Skeleton hip’s x (horizontal) and y (height) coordinates sensed by Kinect directly and b the transformed skeleton hip’s x and y coordinates of Fig. 12a

The results of geometrical transformation for estimated new y coordinates are portrayed in Fig. 12b. Note that x or horizontal coordinates have not changed, and they are the same as shown in Fig. 12a. Two interesting outlier regions can quickly be identified with a glance of Fig. 12b. These two regions, which are circled in Fig. 12b, can sensibly suggest a fall. To investigate this, a histogram of the skeleton’s x and z coordinates obtained directly by Kinect with no geometrical transformation (as no y coordinates involved) was produced and is presented in Fig. 13. This figure represents the floor plan of the residence’s living room, where Kinect was installed. It also shows the frequency that the resident had spent in that specific location within the living room. This can provide useful information about the residents’ habits when superimposed by real residence floorplan.

Fig. 13
figure 13

Histogram of skeleton hip’s x and z coordinates, which mimics living room floorplan

The far right circle in Fig. 12b maps to top circle in Fig. 13, which after some analysis we concluded to be the entrance/exit door to the living room where the participant is out of the Kinect’s view. High frequency associated with this location (shown in Fig. 13) confirms this. This is the place in which Kinect brings all 25 points of skeleton coordinates together in that small anticipated region (exit region) and announce as skeleton position, when there is actually no human either in there, in the room or indeed in the Kinect field of view. By doing so, Kinect introduces systematic error into measurements. The far right region in Fig. 12b can be ignored as it is a true unrelated outlier.

The lower middle circle in Fig. 12b is more interesting as it is related to the lower middle circle in Fig. 13, where the resident spent considerable time according to the histogram. To investigate this, the resident trajectories (i.e. space–time history) at this location and point of time created and are shown in Fig. 14 after performing geometrical transformation on height. The analysis of Fig. 14 suggests that the fall happened when the resident started standing up from a stationary sitting position (circles in Fig. 14). High acceleration comparable to gravity “g” was obtained at this area of trajectories, which suggest a near miss fall followed by a rapid bounce back (shown by smaller circle) in less than 3 s. Noise and non-smooth vertical speed behaviour observed in Fig. 14 may question whether this would fully validate the system suggested hypothesis or not. To analyse this and validate the system fully, rendering a video clip from upper skeleton position has been suggested (Fig. 15) and developed for initial resident movements, which shows clear movement of the skeleton and its walking activity. However, this is yet to be developed for this specific event due to memory requirements. The content of a video rendered for moving forward skeleton captured by Kinect in laboratory settings confirmed by the ART motion analysis system as gold standard, which provides 60 frames per second by using eight cameras. The developed MAGID software also provided a visual validation for the video clip, which is beyond discussion in this section.

Fig. 14
figure 14

Resident trajectories in terms of height (y coordinates) and speed vs time

Fig. 15
figure 15

Validating the system for identified human behaviour by a video clip rendered from upper skeleton’s joints

4.2.4 Determining walking behaviour attribute, gait speed

The approach presented in 3.4.2.3 was used for gait speed. Figure 16 illustrates a histogram of all the subject speeds (m/s) collected from one participant’s skeleton by Kinect from 17 November 2017 to 5 January 2018. The median shown in this figure illustrates the histogram for hourly subject speed. For comparison, the participant walking speed has been measured with a stopwatch at his house during an interview and the participant’s walking speed noted as 0.162 from the timed up and go test. The red arrow on the x-axis in Fig. 16 points the manual measured walking speed. The gait speed within the room is distributed over a range of values lower than the speed measured in a straight line with a stop watch. This reflects the shorter path length and reduction in speed to negotiate furniture and entrances.

Fig. 16
figure 16

Histogram for hourly subject speed over 50 days (median)

Analysis of this large amount of data accumulated from 17 November 2017 to 5 January 2018 shows the number of low speed occurrences (frequency ~ 500) is large. This is due to the participant standing stationary or the acceleration or deceleration phases of the participant’s walk. We observe two sets of similar patterns in this set of data: there are two peaks (~ 0.2 m/s and ~ 0.5 m/s) and there is gradual declination of frequency between ~ 0.2 m/s to ~ 0.4 m/s and ~ 0.5 m/s to ~ 0.8 m/s. We are examining whether the peaks represent changes in the participants’ general health. One of the peaks when speed is ~ 0.5 m/s, for example, might occur at times when the participant is walking faster (when the participants muscles are flexible enough and in a good day when he/she is not feeling any stiffness). The lower peak might occur when the participant’s health is not as good (when he/she is suffering from a back pain or during mornings when the muscles are not flexible yet).

5 Validations and discussion

All participants in this study are suffering from various age-related cognitive impairment and frailty. In order to understand better exactly how monitoring and modelling certain behavioural aspects could be useful for each level of severity, it would be necessary to have more detailed information on diagnosis and severity of illness for each of the participants. While this was not available for this study, it is apparent that the trend of quantitative measures of features such as posture, balance, gait and behavioural pattern analysis over a period of time can be used to assess the progression of specific disorders. This may then assist clinical decisions for diagnosis and prognosis of cognitive impairment as well as offering personalized predictive analysis. To demonstrate this, the percentage of active time (including walking and standing) and inactive time (sitting time, as no significant falling or laying time was observed) by a participant during 50 days obtained by analysing deep learning behavioural patterns, are shown in upper curve and middle curve in Fig. 17, respectively. To understand the peaks and valleys in these two curves, the percentage of time for which deep learning did not recognize any behaviour is also added in Fig. 17 (the bottom curve), which shows two distinct peaks, correlating with previous graphs in the same figure. To examine this and the hypothesis on the number of skeletons (i.e. social isolation), the number of simultaneously detected skeletons is presented in Fig. 18 as a bar chart. This chart explains curves in Fig. 17 very well: for example, as the number of detected skeletons approaches two and increases further, it is expected that the level of activities of each skeleton increases. This can be considered natural as by the presence of any visitor, the average activity of each skeleton including pensioner is expected to rise. This may also be helpful to interpret the increase in the skeletal speed and the second peak (~ 0.5 m/s) in hourly speed histogram of Fig. 16. However, this could be challenged as Kinect does not have the ability to distinguish between skeletons because of its multiplexing nature in three-dimensional scanning regime of skeletons. The graphs therefore show the average (in)activity for each skeleton. The decrease in behavioural recognition can also be correlated with reduction in the average number of detected skeletons below one, especially when it is approaching zero, i.e. there are no skeletons in the Kinect field of view for behavioural recognition. Other sensory signals could then be used to obtain complementary information.

Fig. 17
figure 17

Trends of active, inactive (sitting) time and unrecognized behaviour

Fig. 18
figure 18

Average number of simultaneously detected skeletons for each day

However, the important point here is that the trend lines (dotted lines) for active and inactive times (as also illustrated in Fig. 17) show a tendency toward a more active (or passive) life style, which could be a key health indicator. This suggests that the raw and processed participant data (e.g. eating habit) can be used to model the trajectory of health status such as average mood [21] and its key indicators in regard to cognitive decline. Correlating the recognized behavioural patterns and their attributes directly to cognition decline diagnosed by clinicians is critical. Unfortunately, due to the short period of this study and the slow nonlinear characteristics of cognitive impairment, this was not possible for this pilot study, but this forms one of the tasks for our future studies.

Many limitations have been identified that could be a source of various errors in the measurements. We developed a geometrical transformation to remove errors due to geometrical limitations. The effectiveness of this was validated by the results obtained for fall detection of a participant from Salix homes in Sect. 4.2.2 after applying geometrical transformation. This proved that this correction will be necessary when developing our autonomous remote monitoring and decision support tool further. In the presence of environmental constraints, fusion of composite signals drawn from other sensors along with the Kinect signal proved to be a supplement that can provide useful conclusions about the participant’s activities. Animating skeleton moves by creating video clips showed the depth and effect of physical limitations clearly. This in turn helps to validate and remove physical limitations by addressing them with participants in focus groups and progressing further in attitudinal and sociological issues alongside the technical and logistical ones.

Despite many considerations for dealing with big data issues, they still can constrain our progress in terms of assessing and storing large amounts of raw historical data for further analysis and validating acquired information, for example, in creating video clips. As a result, the calculation of errors from the above data shows an average error of 1.83% obtained while the maximum error can reach a peak of 12.98% during the implementation phase.

Computation can also create errors. For example, Fig. 19a shows the average daily walking speed over the given period for a participant, calculated based on the hourly subject speed outcomes. It can be argued that collecting data in real-world settings results in artefacts such as biologically implausible in-home walking speeds (> 5 m/s). It is for this reason that it was chosen to report median values (Fig. 19b). The earlier interpretation of the daily outcomes was that if the participant was spending less time in front of the Kinect, it would be less likely to obtain good samples because of noise and lower accuracy [8]. However, this is not the whole story. As shown in Fig. 13, the so-called exit region can cause a systematic measurement error by Kinect each time that participants move out of or into the scope of Kinect.

Fig. 19
figure 19

Daily subject speed walking in m/s for a average and b median [8]

Another source of systematic errors is multiplexing of a skeleton by Kinect. High walking speed, much greater than 5 m/s, leads us to believe the way that calculations were previously performed has introduced further errors and noise: it is a well-known fact that integration acts as filter and direct differentiation would introduce additional noise to the measurements. One way of validation as described previously is by developing video clips (Fig. 15) for the resident’s behaviour suggested by our autonomous and decision-making system. While this video clip preserves all privacy concerns, it shows how well it can present the residents’ movements, their behaviour and consequently validates various resident behaviours suggested by our system. Our developed MAGID software tool also provides visual validation that in addition can handle big data very well, while video clips do not.

6 Conclusion

Increasing life expectancy without increasing healthy life expectancy has resulted in people living longer with a growing number of age-related conditions including cognitive deficiencies. This has put an increasing burden on social and healthcare. To mitigate this, there has been an extensive focus on exploring the capability of ICT and new technologies such as autonomous systems, machine learning and big data within the MiiHome project. During this pilot study, dealing with the large dataset gathered from the 12 participants, all of whom had age-related cognitive conditions and frailty, was not an easy task. However, it has been fruitful as we discover everyday a fact or a solution to improve the current situation and narrowing the existing vacuum in the application of new technologies for people with cognitive-related conditions.

Measuring mood and appetite [19,20,21,22] was recently used for predicting depression and anxiety in older adults. These, together with other studies highlighted in our literature, have motivated us to develop our autonomous remote monitoring and decision support tool further and to employ various machine learning approaches including deep learning, while this was previously considered impossible due to big data issues [8]. By monitoring and recognizing important human behaviour such as eating, walking, furniture crawling, physical (in)activities, the system can calculate the main attributes related to key indicators of health events that are necessary for the diagnosis and prognosis of cognitive-related conditions in older adults and hence prolong independent living. As a result, by applying big data techniques and customizing machine learning algorithms accordingly, we have successfully developed various human behaviour recognition patterns for analysing data gathered from participants in MiiHome study. Alongside this, another complementary study for gait analysis and classifications, especially speed during swing, was performed in our living lab [74, 75]. Further improvements such as incorporating fundamental changes in the system architecture and database are scheduled. Using Hadoop looks promising as it provides shared-nothing architecture, while allowing storage of large amount of data to run various apps and analytics in parallel on the data [76].

Several validation tools were developed including animation of the skeleton data and the MAGID software tool, which show encouraging results. While the results in this pilot study are based on a small number of elderly residents in a short period of time, they imply that within the MiiHome project it is possible to obtain indicators of cognitive decline. As a result, further comprehensive and longer studies with continuous cognitive assessment of many more participants are to be planned for full clinical validation of these indications.