1 Introduction

Human activity recognition in an intelligent environment is a highly dynamic research area which has gained a lot of attention due to its varied applications. The applications of activity recognition systems are categorized as: active and assisted living systems for smart homes (SH), monitoring and surveillance systems for indoor and outdoor, health care monitoring and tele-immersion applications [1,2,3]. Among these, SH plays an important role, especially in user behavior analyses, health monitoring, and assistance. Most of the research on activity recognition in SH has investigated single resident activity monitoring [4,5,6,7]. However, in real-life scenarios, a home is not always occupied by a single resident but often occupied by more than one resident. Therefore, developing an SH solution from the perspective of multiple residents is extremely crucial.

In recent years, there has been an increase in multiple occupancy-based research related to activity modelling and data association. However, there are still various challenges to be addressed in multiple occupancies, such as finding the suitable models for data association i.e. identification of the residents by whom each sensor is triggered and capturing interactions between the occupants [8]. Another major challenge that occurs while developing real-life applications is the class imbalance problem. Activity recognition is mainly considered a classification problem where the performance of the system depends on the model selection, features involved, number of classes, and the size of the datasets available for training the system. In most of the SH datasets, there is a lack of uniformity in different daily living activities of residents; which is obvious as in real-life situations, some activities are performed more often than others.

Although several studies have been conducted for class imbalance, there remains a lack of empirical work on addressing the class imbalance in multiple residents activity recognition. In this work, we report an empirical study of both data-driven and algorithm-driven techniques for handling class imbalance. Data-driven approaches modify the original dataset by oversampling the minority samples and can provide a balanced distribution without losing information on the majority class. Undersampling techniques alter the dataset by removing samples from the majority class. The main advantage of undersampling lies in the reduction of the training time, which is significant in the case of highly imbalanced large datasets [9]. In algorithm level techniques, we employed cost-sensitive learning to deep learning models, which has performed well as reported in previous works in class imbalance problem. However, the majority of works use statistical methods such as SVM and Naive Bayes as a base classifier in cost sensitive learning approach [10]. In other works, machine learning methods have been used in activity recognition which relies on feature extraction techniques including time-frequency transformation and statistical approaches. In such methods, the extracted features are carefully engineered and heuristic. There is no universal feature extraction method that can effectively capture distinguishable features of human activities. Consequently, we selected the Long Short-Term memory (LSTM) network as it allows extracting highly discriminative non-linear feature representations while modeling temporal sequences by learning long-term dependency. In addition, LSTM and 1D-convolutional neural network outperformed other statistical machine learning models on single resident activity recognition [11].

To summarize, the main contributions of this paper are:

  1. i.

    a review on handling class imbalance problem with deep learning and explainable AI (XAI);

  2. ii.

    employing LSTM and BiLSTM networks for multiple resident activity recognition;

  3. iii.

    evaluating model performance by taking each resident separately and also with combined activity labels of the residents;

  4. iv.

    conducting extensive experiments using both data level and algorithm level class imbalance techniques; and,

  5. v.

    investigating model performance at different sample ratios and cost coefficients on three benchmark datasets.

The paper is further structured as follows: Sect. 2 reports the related works. Section 3 introduces the SH datasets, LSTM and BiLSTM network and different data imbalance methods that are used in the paper and Sect. 4 describes experiments performed. The results of the paper are demonstrated and discussed in the next section, which is followed by a concluding section highlighting the major findings.

2 Related work

In this section, we review related works on multiple resident activity recognition and imbalanced data classification approaches and discuss them in detail which eventually laid the foundation of the current work.

2.1 Multiple resident activity recognition

Activity recognition has been categorized mainly into two approaches: Vision based [12,13,14] and pervasive sensing based [15,16,17]. Vision-based activity recognition can provide good results but have raised various privacy concerns among the residents due to required camera installations in their private spaces [18, 19] whereas pervasive sensing-based activity recognition approaches use data from wearable sensors and non-intrusive environment sensors [20]. A significant amount of work has been performed on activity recognition using wearable sensors. A new technology called Body Sensor Networks (BSN) has emerged which consists of different wearable sensors that capture and process physiological signals on the human body. BSNs then collect data from wearable sensors and process them to extract useful information [21, 22]. A major issue with wearable sensors is that wearing or carrying a tag is often not feasible especially with the old people, who often forget to wear, or not willing to wear tags at all. There have been efforts to create adaptive solutions for user adoption and integration. Nonetheless, the challenge of usability persists among older individuals [23,24,25]. Pervasive sensing using environment sensors offers the advantage of being non-intrusive to the inhabitants and do not require to carry any tag or device. In pervasive sensing, the sensors are deployed in the environment and capture activities of the residents, which then can be used for activity recognition. But there are some challenges in this approach as well. Recognizing human activities using environment sensors is challenging because sometimes the data captured by the sensors can be disturbed from the surroundings which can make data noisy and human activities are complex. In such a case, sensor deployment, its configuration, and selection of the classification model play an important role in the identification of activities of residents and the residents themselves [26].

In previous works, diverse computational models have been applied in the context of single resident activity recognition which includes standard data mining approaches, probabilistic models, and machine learning models such as neural networks, support vector machines, decision trees, and ontologies. However, for multi-resident activity recognition, such a diversity of models has not been used yet. The problem in multiple resident activity recognition using non-intrusive sensors is the association of sensor data when such sensors cannot directly identify residents and interactions between them, whereas, in a single resident setting, sensors’ states reflect directly the activity of the sole resident. Multiple residents’ activities can have different scenarios: the same activity can be performed by two or more residents (e.g. eating a meal or watching TV together) or multiple residents perform different activities independently (e.g. one resident is watching TV and the other preparing meal). Evidently, there is a need for a model that is capable of capturing the complex nature of both joint and independent activities. Previous works have addressed multiple resident activity recognition using wearable sensors such as RFID [17], accelerometer [27] and videos [28]. Machine learning approaches used previously for multi-resident activity recognition are naive Bayes, Markov Model classifier [29] and conditional random field (CRF) [30] on CASAS [31] dataset in which data association problem was investigated. In [32], the authors proposed a two-stage activity recognition method in order to exploit more knowledge in multi-resident activities. The two phases in the model include the building phase and activity recognition phase and it converts multi-label problems into a single-label problem by treating activities of residents as combined label state using HMM (Hidden Markov Model) and CRF (Conditional Random Field) classifiers. In recent works, deep learning models have shown impressive performances in various fields. LSTM network which is variants of recurrent neural network (RNN) is good at solving time series problems as its design enables gradients to flow through time readily [33]. Deep Residual Bidirectional LSTM network has been used for activity recognition using wearable sensors on UCI dataset (which uses data from a smartphone) and Opportunity dataset (data from wearable, object, and ambient sensor) [34]. In [35], CNN (Convolutional Neural Network) and LSTM have been used for extracting Spatio-temporal features from multisensory and multimodal data which includes RFID, audio data, and videos for concurrent activity recognition. In [36], a joint diverse temporal learning framework using LSTM and 1D-CNN models has been proposed to improve human activity recognition.

However, the existing state of the art approaches in multi-resident activity recognition focuses on the improvement of recognition algorithms using accuracy as performance metrics rather than handling imbalanced dataset. Furthermore, there is a lack of comprehensive studies on how different class imbalance approaches perform in the multiple residents’ activity recognition domain.

2.2 Imbalanced data classification

Existing research uses various machine learning and deep learning models for activity classification but lacks in analyzing the class imbalance problems in the dataset, through which a model achieves very high accuracy but did not reveal the actual performance of the model due to class imbalance. Likewise, in some machine learning problems, not every mistake is treated equally. This is very true in the SH setting; for example, if the system makes a mistake in detecting a resident fall, it is much more harmful than making a mistake in detecting if a user is brushing his teeth. Training with equal importance for each activity in the home environment is not suitable to provide high user experience and satisfaction. In the multi-resident setting, common methods to help with the imbalance dataset needs to be altered since instead of one classification there are multiple classifications (for each resident of the house).

Three major approaches have been defined to learn imbalanced data [37]: data-level methods, algorithm level methods, and hybrid methods. Data level methods concentrate on modifying training sets to balance the data distributions by adding or removing samples. Such methods use oversampling (addition of new sampling to minority class) and undersampling (removing samples from majority class) approaches for balancing the data distributions. This way, data level approaches are used to avoid the modification of the learning algorithm by decreasing the effect caused by an imbalance with a preprocessing step [38]. Synthetic Minority Over-sample Technique (SMOTE) is the popular oversampling method [39] with an idea to create new minority samples by interpolating several minority class instances that lie together. The strategy used in SMOTE is problematic as it blindly generalizes the minority class without regard to the majority class, particularly in the case of highly skewed class distribution where the minority class is very sparse in comparison to the majority class, thus resulting in a high chance of class mixture [40]. In undersampling techniques, four K-Nearest Neighbour (KNN) methods [41], namely, NearMiss-1, NearMiss-2, NearMiss-3, and the “most distant” method were proposed, in which instead of using the entire set of majority samples, a small subset of these samples was selected such that resulting training data is less skewed. Results of the experiment suggest that the NearMiss-2 method provides competitive results in comparison to SMOTE and random undersampling. Algorithm level methods modify existing learning algorithms to alleviate the bias towards majority classes. These methods require special knowledge of both the learning algorithm and the application domain, comprehending the reason behind the failure of the classifier when the class distribution is uneven. The most popular of such methods are cost-sensitive approaches [42]. In such approaches, the given learning algorithm is modified by incorporating varying costs for each of the considered groups of samples. Another algorithm level approach is one-class learning which focuses only on the target group and thus helps in eliminating the bias towards any group. Hybrid methods combine data-level methods and algorithm level methods by extracting strong features from both the approaches, merging data level solutions with classifier ensembles is one of the widely used hybrid approaches [43]. There exist some works that propose hybridization of cost-sensitive learning and sampling methods [44].

Numerous other works have been performed for handling class imbalance in traditional classification problems using data preprocessing and algorithm level techniques [45,46,47]. These studies have shown that for the various base classifiers, a balanced dataset provides improved overall performance compared to an imbalanced dataset. Traditional machine learning algorithms such as SVM, Decision tree, Naive Bayes, Random forest, Hidden Markov Models, and their ensembles were used to balance between minimizing the total recognition error and maximizing the accuracy of classification on minority class [48]. The major drawback of these methods is that they rely on handcrafted and classical heuristic feature extraction techniques. Recently, deep learning methods have shown promising results in various applications such as natural language processing, image classification, speech recognition, and also in human activity recognition systems by outperforming on raw sensor datasets [4].

2.2.1 Handling class imbalance in deep learning

Some of the works with deep learning methods for handling class imbalance use CNN for representation learning [49] and proposed quintuplet sampling with a triple-header loss for imbalanced learning. Another work proposed Deep Over-sampling (DOS) with CNN architecture [50] to address the effect of class imbalance on both classifier and representation learning. Empirical results of the proposed DOS framework showed improvement in addressing the class imbalance problem. A new loss function in a deep neural network is proposed in [51] which captures classification errors from both majority and minority classes. Another method was presented in [52] to optimize the network parameters and class sensitive costs. Deep reinforcement learning has shown promising results in various applications, therefore recent work also explores the performance of deep reinforcement learning model for imbalanced classification and evaluated their approach on text and image data [53] where classification problem was formulated as a sequential decision-making process. The environment returns a high reward to minority class samples but a low reward to the majority class sample and the process was terminated when the agent misclassifies the minority class sample. Deep Q-learning was used to find the optimal classification policy for the Imbalanced Classification Markov Decision Process (ICMDP). Experiments showed better classification performance on imbalanced text datasets. The survey on class imbalance for deep learning presents classical methods such as random oversampling and cost-sensitive target function, which show promising results when applied in deep learning situations [54]. In general, reinforcement learning (RL) is highly relevant because it is very close to a human-in-the-loop approach and can therefore be useful to bring human conceptual understanding into the machine learning pipeline [55].

2.2.2 Class imbalance problem with XAI and interpretable machine learning (IML)

Interpretable and explainable ML models are providing promising solutions in various critical fields such as healthcare, finance, and computer vision as compared to traditional methods. Explainable AI focuses on the explanation of learning models and machine learning interpretability allows users to perceive the results of learning models by giving the reasons for the predictions that it has arrived at. State-of-the-art models like Deep Learning and Boosting classifiers are trained to classify instances with high accuracy, while their interpretability is enhanced through eXplainable AI (XAI) techniques such as Layerwise Relevance Propagation (LRP) [56], SHAP [57], and LIME [58]. For an overview and comparison of XAI methods see: [59]. XAI has meanwhile developed into an established, large and complicated field with many different approaches. Due to the many different XAI techniques, it is not so easy for the non-expert data scientist to decide which method is best to use. This is where design patterns can help [60]. The LRP process mentioned above is an example of such a design pattern: Decomposition: Breaking down complex elements into smaller, more understandable parts and explaining how they work together to reach a final decision or result. Explainability is essential for fostering trust, facilitating human understanding, and ensuring ethical and effective decision-making in high-stakes applications [61]—finally to ensure that the human remains in control of AI [62].

Performance evaluations, including confusion matrices, affirm the models’ efficacy and reliability in classifying high-cost, minority-class instances. Interpretation is a critical task in both imbalanced data learning and IML; however, both techniques have different perspectives. When applied in deep learning, imbalanced learning has typically sought to understand the class imbalance with overlap, sub-concepts and data outliers. Whereas, IML methods are designed to explain internal representations, inputs and outputs in the neural networks. In addition, imbalanced learning is generally concerned globally for all the classes; whereas IML learning explain model decisions with respect to specific instances [63]. Therefore, it would be interesting to combine both fields into a single framework to better understand the predictions made by a model concerning imbalanced data.

There is few research on the potential effect of class imbalance on model-agnostic interpretation methods. Patil et al. (2020) [64] address the challenge of imbalanced datasets in AI by employing Synthetic Minority Oversampling Technique (SMOTE) for data resampling and applied LIME and SHAP to balanced dataset to identify the important features. They determined that oversampling does not change the correlation among features, as the crucial features for predicting both valid and fraudulent observations remain consistent. Several research studies have assessed the reliability of LIME and SHAP [65, 66], yet none of them have considered the issue of class imbalance. The impression could be that the class imbalance might not affect the consistency of interpretations or could even reduce uncertainty, as interpretations may come from the majority class with more stable distributions. Conversely, it could potentially hinder interpretation methods’ performance, as rare events might fall outside the typical distribution, making them challenging to interpret. Chen et al.(2024) [67] argue that the class imbalance does have an adverse effect on the interpretive performance of both LIME and SHAP. The findings indicate that interpretations generated from LIME and SHAP are less stable as the class imbalance increases, suggesting that class imbalance negatively impacts the interpretability of machine learning models. The potential effects of imbalanced learning techniques on the performance of interpretation methods need to be investigated. Resampling methods can tackle imbalanced data challenges, however, they cause problems of overfitting (over-sampling methods) or information loss (under-sampling methods). Exploring the cost-sensitive learning relationship with the interpretability of a network model involves understanding how these cost adjustments influence model decisions and interpreting these decisions, particularly through the lens of explainable methods. SHAP can help in understanding the impact of class weights and misclassification analysis. Based on the insights from the SHAP analysis, the model can be refined by adjusting the class weights if certain features are disproportionately influencing the model, leading to biases and re-evaluating the balance between managing misclassification costs and maintaining model interpretability. Furthermore, it would be beneficial to theoretically examine how class imbalance affects the stability of interpretation methods. For instance, we could initiate our investigation by employing Logistic Regression as the predictive model and establishing theoretical outcomes regarding interpretations generated by SHAP or LIME in the presence of class imbalance. Such an approach could provide further insights into evaluating interpretation method stability, especially when dealing with complex "black-box" machine learning models. However, conducting an experimental study on the integration of interpretable models and its effect on class imbalance falls beyond the scope of this paper.

2.2.3 Handling class imbalance in activity recognition

Few studies have been performed in single resident activity recognition using improved SMOTE algorithm to address issues concerning imbalanced activity classes [68]. SMOTE [39] is the widely used algorithm as it creates new non-replicated examples by interpolating neighboring minority class instances but it fails to preserve the class covariance structure and increases overlapping between classes. Another work uses a cost-sensitive SVM approach for the classification of activities and compared the results with HMM, CRF, and traditional SVM models [69]. Most of the works on handling imbalance classes focus on vision and text classification problems but very less work has been performed in handling class imbalance in multiple resident activity recognition. In addition, existing works lack comparative studies of different class imbalance approaches.

Therefore, this paper presents a comprehensive study of both data level and algorithm level class imbalance approaches in multiple resident activity recognition. Since temporal deep learning methods have shown promising results on raw sensor datasets in single resident activity recognition, we used LSTM and BiLSTM networks as a classifier for addressing the class imbalance in activity recognition.

3 Methodology

3.1 Smart home datasets

In this work we have used publicly available ARAS [70] and CASAS-Kyoto Multiresident ADL Activities datasets (fourth number dataset on CASAS dataset list: http://casas.wsu.edu/datasets/) [16, 71]. ARAS is a widely used dataset in activity recognition systems whereas the CASAS-Kyoto Multiresident ADL Activities dataset has not been used much in previous works. As the collection of real SH data is time-consuming, costly, and difficult to annotate, the publicly available datasets are used to provide a baseline for comparison.

3.1.1 ARAS multi-resident ADL dataset

ARAS datasets use ambient sensors such as contact sensor, temperature sensor, sonar distance sensor, force sensor, photocells, resistors, and infrared receivers in the SH setting. The dataset consists of 20 different types of sensor signals as features together with the activity labels of two residents for two different houses which are termed as House A and House B. Each house has 30 days of a dataset with 30 separate files for a month and every file contains 86,400 instances. The dataset consists of 27 different types of activities for each resident. The distribution of activities in House A and House B of the ARAS dataset are shown in Fig. 1.

Fig. 1
figure 1

Activity distribution of both residents (R1 and R2) in the ARAS dataset

As visible from Fig. 1 the dataset of both the residents in two houses is highly imbalanced where only few activities in the distribution are more than 35% and most of the activities are less than 10% of the whole dataset.

3.1.2 CASAS-Kyoto multiresident ADL activities dataset

The CASAS-Kyoto Multiresident ADL Activities dataset was collected in a smart apartment testbed located at Washington State University (WSU). The sensors used in the dataset are motion, item, cabinet, water sensors, burner, phone, and temperature sensors. The smart space was occupied by two residents at the same time where they performed daily living tasks concurrently. The collected sensor events were labeled with activity and person identifications. The dataset has 15 different daily living activities that were performed by both residents, in which few activities (moving furniture, playing checkers, paying bills, and packing picnic supplies) were jointly accomplished by both residents. Since some activities were performed jointly by both the residents and some individually, when an activity is performed by only one resident, there is no label for the activity of the other resident. As both residents are present in the apartment, we assigned a label (named as "Other") to the activity of the second resident which is not known, which makes our dataset of 16 activity labels for both residents. However, in many cases, there were sensor readings for both residents and their activity labels. The frequency distribution of activities in the dataset is shown in Fig. 2.

Fig. 2
figure 2

Frequency count of activities in the CASAS-Kyoto Multiresident ADL Activities dataset

3.2 LSTM models

LSTM networks [72] are a successful extension of RNNs designed to avoid the long-term dependency problem associated with RNN. LSTM models introduce a new state, called cell state and constant error carousel (CEC) that allows constant propagation of error signals over time, thus solving the problem of vanishing gradients. In addition, LSTM uses a gating mechanism over an internal memory cell to control access to CEC and to learn a more complex representation of the long-term dependencies. LSTM is better at classifying, processing, and predicting time series data with the time lags of unknown sizes. An LSTM block consists of input, output, and forget gates which are responsible for write, read and reset operations respectively for the memory cell. The main component of LSTM is the memory cell which is responsible for remembering states for short or long periods over arbitrary time intervals. Each LSTM cell operates as a memory to write, read, and erase information based on the outcomes rendered by input, output, and forget gates respectively. Forget gate receives new time step \(X_{t}\) and previous output \(h_{t-1}\) as input and gives output using sigmoid activation function to decide which information will be kept or deleted. The information will be deleted if the output of the sigmoid activation function is 0, while information will be kept if the output is 1. The forget gate computation is shown in Eq. (1). The next step decides what new information will be stored in the cell state. This step has two parts, first, the input gate layer decides which new information from the current input (\(X_{t}, h_{t-1}\)) is updated to the cell state. In the second step, tanh activation function that generates a new candidate value \(\tilde{C_{t}}\), could be appended to the cell state. The multiplication of these two parts will be added to the multiplication of forget gate (\(f_{t}\)) with the previous cell state (\(C_{t-1}\)) to generate a new cell state (\(C_{t}\)). The forget gate (\(f_{t}\)) is multiplied with the previous cell state (\(C_{t-1}\)), forgetting the information which was specified to be deleted earlier. Then we append \(i_{t} * \tilde{C_{t}}\), which is the new candidate value, scaled by how much the cell state is updated. The computation of the input gate, new candidate value and cell state is shown in Eqs. (2)–(4). In the final step, the output gate is computed based on the filtered version. First, the previous hidden state and the current input time step are passed to the sigmoid activation function, and then the new state is put through \(\tanh\) function. Then the output of the sigmoid function is multiplied with the output of \(\tanh\) function to generate the next hidden state. The update cell state and new hidden state forward the information to the next time step. Equations (5) and (6) shows the computation of output gate and hidden state (\(h_{t}\)).

$$\begin{aligned} f_{t}&= \sigma (W_{f}\cdot [h_{t-1}, x_{t}] + b_{f}) \end{aligned}$$
(1)
$$\begin{aligned} i_{t}&= \sigma (W_{i}\cdot [h_{t-1}, x_{t}] + b_{i}) \end{aligned}$$
(2)
$$\begin{aligned} \tilde{C_{t}}&= \tanh (W_{C}\cdot [h_{t-1}, x_{t}] + b_{c}) \end{aligned}$$
(3)
$$\begin{aligned} C_{t}&= f_{t} * C_{t-1} + i_{t} * \tilde{C_{t}} \end{aligned}$$
(4)
$$\begin{aligned} o_{t}&= \sigma (W_{o}\cdot [h_{t-1}, x_{t}] + b_{o}) \end{aligned}$$
(5)
$$\begin{aligned} h_{t}&= o_{t} * \tanh (C_{t}) \end{aligned}$$
(6)

where \(\sigma\) is the sigmoid activation function, \(\tanh\) is hyperbolic tangent function, x is the input data and W is the weight matrix. The LSTM equations are adapted from [73].

Fig. 3
figure 3

LSTM and Bidirectional LSTM

The architecture of LSTM and BiLSTM network is shown in Fig. 3. The input layer of the network comprises an embedded vector that contains a sequence of sensor events and then n LSTM cells are fully connected to the inputs and have recurrent connections with the other LSTM cells.

Finally, a dense output layer of the network performs the classification task. In the BiLSTM network, two parallel LSTM are used for forward and backward loops, which extracts patterns from the past and future events. The forward layer reads the input from the left to right direction whereas the backward layer reads the input from right to left direction.

The output prediction is the weighted sum of the prediction score from both the forward and backward layer. In both networks, the Adam optimizer is used for training the network and minimizing the softmax cross-entropy loss function.

3.3 Handling class imbalance with LSTM and BiLSTM networks

In this paper, the following three methods are used with LSTM and BiLSTM networks.

3.3.1 Oversampling

Oversampling is the data level approach that aims to balance the class distribution by increasing samples of the minority class. Oversampling is performed by computing the sampling ratio (also known as the class imbalance ratio) between the minority class and majority class. We selected the most frequent activity and reduced the imbalance of less frequent activities in the training set. We oversampled less frequent activities with varying sampling ratios but we never, in any case, oversampled less frequent activity to the amount where it was more frequent than the actual most frequent activity. For example, suppose Resident 1 has 1000 samples and Resident 2 has 5000 samples and maximum activity is 10,000; in this case, we threshold oversample at 2, even though we could apply oversample by 10 (if only Resident 1 was taken into consideration). We used different sampling ratios (from range 1 to 10) and conducted experiments over these ranges. The optimal difference in model performance was observed at sampling ratios 2 and 5.

3.3.2 Undersampling

Similar to oversampling, undersampling is also a data level approach performed by computing sampling ratio where we reduced the samples of most frequent activities of the residents. We limited the undersampling ratio in a way that the most frequent activity will still be the most frequent even after being undersampled. For example, if any of the Resident 1 or Resident 2 activity ratios are lower than average (uniform distribution for all activities in the original count) we keep these instances and do not undersample. We only undersample if both activities are over-represented and again keeping in mind that we threshold undersampling ratio taking into account average. We also tried different sampling ratios from range 0.25 to 1.0 and conducted experiments over these ranges, however, the optimal difference was observed at 0.25 and 0.5 undersample ratio.

Data level approaches are not dependent on the classifier as they avoid the modification of the learning model by reducing the effect caused due to imbalanced data with a preprocessing step. Thus, these approaches are more versatile.

3.3.3 Cost-sensitive learning

Cost-sensitive learning lies between data level and algorithm level approach as it incorporates both data-level processing by adding costs to samples and algorithm level modifications by modifying the learning process [74]. This method evaluates the cost associated with the misclassifying samples. It does not create a balanced data distribution, rather assigns the training samples of different classes with different weights, where weights are in proportion to the misclassification costs. In the cost-sensitive version, we scaled the loss according to the cost coefficients in frequent activities and limit the ratio of cost coefficient below the ratio of most frequent/given activity frequency. In this approach, we also conducted experiments with different cost coefficients (from range 1 to 10), and the best model performance was observed at cost coefficients 2 and 5.

Since the dataset contains multiple residents where most of the activities are performed separately by each resident but some activities are performed together as well, we looked at how often each resident does each activity by themselves, and we also looked at how often they do activities together. Figure 4 depicts the LSTM model for multiple residents activity recognition with activity labels \(a^{1,1}a^{2,1}\),...,\(a^{1,T}a^{2,T}\), where \(a^{1,1}\) is activity label of first resident and \(a^{2,1}\) is activity label of second resident and similarly for all the labels. Figure 4a shows the LSTM model with activity of each resident separately and Fig. 4b shows the model with combined activities of residents. For example, in the case of separate activity labels, we selected activity 1 and activity 3 separately for different residents and applied class imbalance methodologies for users separately, where we always kept the most frequent activity samples more than any other activity. In the case of combined activities of both residents, we used a tuple of activities and calculate the frequency of tuple activities, such as activity (1, 3), and applied class imbalance methodologies to these tuple activities.

Fig. 4
figure 4

LSTM model for multi-resident activity recognition

Oversampling the activities with sampling ratio = 2 or 5, does not represent multiplying each resident activity with sample ratio 2 or 5. We took into consideration that increasing one resident activity will also change the distribution of other resident activity as in the dataset we have sensor information for both the residents together. Similarly, while undersampling the dataset with a sampling ratio = 0.25 and 0.5 does not mean reducing the activity distribution to one-fourth and half. Instead, we performed sampling such that when we undersample with 0.25 sampling ratio, we selected 0.25 probability if a certain data point should be added or not. Therefore, the exact distribution of activities may vary in each case.

4 Experiments

The experiments were performed on three SH datasets, in which two houses (House A and House B) are from the ARAS dataset and the third house is of the CASAS dataset. Both the datasets have sensor observations of two residents. The experiments are designed such that for all the three houses classification of activities of the residents are performed using different LSTM and BiLSTM networks and in each model, we explored oversampling, undersampling, and cost-sensitive learning methods to handle class imbalance problem. Each house of the ARAS dataset consists of 30 days of human activities data, where each day consists of 86,400 data points. The dataset is divided into training, validation and test set such that the first 18 days are used for training, the next six days of the dataset are used for validation and the last six days are used as a test set. In the CASAS-Kyoto Multiresident ADL activities dataset, human activities of two residents were carried out for 26 days and each file has a different number of data points. We followed a similar approach as other datasets, where the first 16 days are used as training (10572 instances), the next five days are used as validation test (3051 instances) and the last 5 days are used as test set (3608 instances of sensor readings). The experiments are computed first with the original dataset (without applying class imbalance methods) and then twelve different experiments are conducted for each model by applying class imbalance techniques to the training data.

The evaluation metrics play an important role in measuring the performance of models in handling class imbalance in multiple resident activity recognition. Hence, we used the Exact Match Ratio (EMR), Balanced accuracy, and micro average of F1-score to evaluate all the models. EMR metrics indicate the percentage of samples that have all their labels classified correctly (shown in Eq. 7). The balanced accuracy metric is used in multi-class classification problems to deal with imbalanced datasets and is based on two most commonly used metrics: sensitivity (also known as true positive rate or recall) and specificity (also known as a false-positive rate), shown in Eq. 8. Also, we used a micro average of F1-score as it is a weighted average of recall and precision, shown in Eq. 9. The Exact match ratio of both residents, balanced accuracy, and micro average of F1-score of each resident of the test set are computed at best validation accuracy for all the models.

In both LSTM and BiLSTM networks, we used a range of sequence lengths from 10 to 100, a range of batch sizes from 32 to 512, and a range of several epochs from 5 to 100. A series of trial and error experiments were conducted over these ranges. We observed that epochs = 30, batch size = 64, sequence length = 30, and hidden units (n) = 128 are found to be optimal parameters to avoid overfitting and achieved a low generalization error in training both the models. The model parameters are kept the same for all the datasets. The training of the network is performed on a single Quadro RTX 4000 8GB GPU, also trained models can be used for inference without losing much performance when there is no GPU. In addition, we also performed experiments on a single NVIDIA 12GB GeForce GTX 1080Ti GPU, and the same results were observed on both the computer environment.

$$\begin{aligned} Exact\,Match\,Ratio,EMR = \frac{1}{n}\sum _{i=1}^{n}I(Y_i=Z_i) \end{aligned}$$
(7)

where I is the indicator function, \(Y_{i}\) is target class and \(Z_{i}\) is predicted class.

$$\begin{aligned} Balanced\,Accuracy&= \frac{Sensitivity + Specificity}{2} \end{aligned}$$
(8)
$$\begin{aligned} F1-score&= \frac{2 * (precision * recall)}{(precision + recall)} \end{aligned}$$
(9)

5 Results and discussion

In this section, the experimental results of both LSTM and BiLSTM networks together with different class imbalance approaches in terms of exact match ratio, balanced accuracy, and a micro average of F1-score are presented and discussed. Figures 5, 6 and 7 present the balanced accuracy results of each resident of the dataset. Tables 1 and 2 report the results of House A (ARAS) dataset, Tables 3 and 4 report the results of House B (ARAS) and Tables 5 and 6 present the results of the CASAS-Kyoto Multiresident ADL Activities dataset in terms of EMR and micro average F1-score.

Fig. 5
figure 5

ARAS House A Balanced accuracy results

As discussed in the previous section, each table shows the experiment results of the baseline model which is without applying class imbalance techniques, and then 12 different experiment results with data level and algorithm level techniques on deep learning models. In all three approaches (oversample, undersample, and cost-sensitive learning), the term "single" represents activity recognition of each resident separately and the term "multi" represents combined activity recognition for the results. The models are evaluated at different oversample (2 and 5) and undersample (0.25 and 0.5) class ratios, together with different cost coefficient values (2 and 5) to have a detailed study and comparison of different class imbalance approaches in a multi-resident setting.

Fig. 6
figure 6

ARAS House B Balanced accuracy results

Fig. 7
figure 7

CASAS-Kyoto Balanced accuracy results

Balanced accuracy results show that a single cost-sensitive learning approach outperforms all the other class imbalance approaches in the majority of the cases. In the ARAS house A dataset, the single cost-sensitive learning approach of R1 improves by 3% in the LSTM and 1% in the BiLSTM in comparison to the baseline model, whereas in R2 cost-sensitive approach increases balanced accuracy by 1% in LSTM network but in BiLSTM model single-undersampling improves the balance accuracy by 3%. In House B of the ARAS dataset, the cost-sensitive approach performs better in both LSTM and BiLSTM models, except in the LSTM model of R2, where the undersampling approach is slightly better. In the CASAS dataset, a single cost-sensitive approach clearly outperforms all the other approaches and improves the balance accuracy results of R1 by 9% and 13% in LSTM and BiLSTM, 11% and 14% increase in balance accuracy of R2 LSTM and BiLSTM models in comparison to a baseline model.

Table 1 LSTM-HouseA (ARAS)
Table 2 BiLSTM-HouseA (ARAS)

To summarize, from the following results it can be observed that in almost all the networks cost-sensitive learning performs better in terms of balanced accuracy. In the EMR of both residents, no clear trend has been observed in the results, as in House B the difference in EMR results is minimal for both LSTM and BiLSTM networks, in House A, the baseline model performed better in comparison with other models, whereas in the BiLSTM network the results of EMR in undersampling and cost-sensitive approach are similar. In the CASAS-Kyoto dataset, EMR results are better in undersampling and cost-sensitive approach. The F1-score of R2 is better than R1 in the case of House A, whereas for House B high F1 score is achieved for both the residents in comparison to House A. Furthermore, in the CASAS-Kyoto smart home no significant difference can be seen in F1 scores of R1 and R2.

Table 3 LSTM-HouseB (ARAS)
Table 4 BiLSTM-HouseB (ARAS)

Since each SH dataset has a different configuration, sensor readings, activity labels, and class imbalance, the difference in model performance is observed in all three datasets. The computation time of the CASAS-Kyoto dataset was much faster in comparison to the ARAS dataset due to less number of sensor observations in each day of the dataset. In terms of model computational time, the undersampling method was faster to train in comparison to oversampling and cost-sensitive learning, where multi-oversampling took a quite long time to train which is quite obvious due to the increase in the number of samples to train the models. Among the deep learning models, training with LSTM was faster in comparison to the BiLSTM model. Figure 8 shows the computational time of both the models for all the three datasets.

Table 5 LSTM (CASAS-Kyoto)
Table 6 BiLSTM (CASAS-Kyoto)
Fig. 8
figure 8

Model execution time

5.1 Results on frequent activities

In order to have a further comprehensive analysis of different class, imbalance approaches on multi-resident activity recognition datasets, we extended our experiments by selecting the first top-five activities of the dataset, and performed classifications using the same LSTM and BiLSTM as described above. Since even after oversampling and undersampling the data the distribution is still imbalance, we selected the top five activities in all the datasets to analyze model performance on frequent activities. The model configurations are exactly the same as previous experiments and the results of the experiments of ARAS are shown in Tables 7, 8, 9 and 10. The model configurations in the CASAS frequent activities dataset are also the same as previous CASAS experiments. Tables 11 and 12 present the results of class imbalance techniques on frequent activities of the CASAS-Kyoto dataset.

Table 7 LSTM-House A (ARAS)
Table 8 BiLSTM-House A (ARAS)
Table 9 LSTM-House B (ARAS)
Table 10 BiLSTM-House B (ARAS)
Table 11 LSTM-(CASAS-Kyoto)
Table 12 BiLSTM-(CASAS-Kyoto)

The EMR, balanced accuracy, and F1 score of both House A and House B of the ARAS dataset improved a lot in comparison to previous experiments when we took frequent activities of the datasets, which also makes the dataset quite balanced and thus improving the performance of LSTM and BiLSTM models. The results of EMR improved a lot in comparison to previous experiments but are similar in each dataset for all the approaches. In the balanced accuracy results of frequent activities, again cost-sensitive approach performed better in most of the cases in comparison to oversampling and undersampling methods. There were few cases such as in House B, in R2 activities classification, the oversampling approach in LSTM, and the baseline model of the BiLSTM network performed better than other approaches. However, the cost-sensitive approach performed equally in these cases, for example, the results of cost single (2) and single oversample (2) are almost equal in the LSTM model. Similarly, in the BiLSTM network of House B, the difference between baseline and cost-multi (5) is much less. In the CASAS-Kyoto dataset, the multi-undersampling method performed better in the BiLSTM network for both residents. However, in per class F1-score results, the cost-sensitive method performed better in the classification of minority classes.

The CASAS-Kyoto dataset showed improvement in class imbalance techniques (Tables 11, 12) in comparison to baseline models such as in the LSTM model of CASAS. The cost-sensitive method outperformed all the other methods and in the BiLSTM model of CASAS, undersampling approach performed better, but the results of F1-score of the cost-sensitive approach are almost similar to the undersampling method for each class. The micro average F1-score of both House A and House B improved a lot in frequent activity experiments, whereas in CASAS it did not show much improvement. This can be due to the "curse of dimensionality" in the SH datasets, as not all sensors are relevant to the classification and high dimensions deteriorate the performance of the classifier. Furthermore, it has been observed that the CASAS-Kyoto dataset showed a difference in the performance of the model with different class imbalance techniques, whereas in the ARAS dataset not much clear trend is observed. This can be attributed to the fact that the CASAS-Kyoto dataset is quite a balanced dataset whereas the ARAS dataset is highly imbalanced.

6 Conclusion

In the realm of multiple resident activity recognition, which is integral to the enhancement of smart technologies [75], elder care [76], and ambient assisted living systems [77], as well as safety and context-aware applications [78], the importance of explainability, retraceability, and human interpretability cannot be overstated. Explainability is paramount in the application of complex models such as LSTM and BiLSTM networks, as it fosters trust and acceptance among users and stakeholders. The ability to interpret model decisions is critical, especially in environments such as health care and ambient assisted living, where decisions must be transparent and justifiable. This study, through the lens of class imbalance techniques, has not only sought to enhance the accuracy of activity recognition systems but also contributed to the field of explainable AI by exploring how different techniques can influence model interpretability. Retraceability, the ability to audit the data and processes that lead to a particular model decision, is essential for compliance with regulatory frameworks that govern AI systems, particularly in Europe where the right to explanation is an emerging requirement [79]. By meticulously documenting the experimental setup, including data processing, model configuration, and the application of class imbalance techniques, this study provides a blueprint for retraceability. Human interpretability is inherently linked with the first two concepts, emphasizing the need for model predictions to be understandable by humans. This is especially pertinent when AI systems are used to support decision-making in critical settings. The study’s findings suggest that cost-sensitive learning can improve performance metrics, such as balanced accuracy, which is a step toward making AI decisions more interpretable. The interpretability of such approaches must be further investigated to ensure that users can comprehend and trust the system’s predictions. The discussion on scalability also implicitly acknowledges the challenges of maintaining explainability and interpretability as the complexity of the environment increases. With more residents and potentially more complex class distributions, the importance of designing AI systems that are not only accurate but also explainable and interpretable becomes more pronounced. Thus, this research does not merely present a set of algorithms for activity recognition but also paves the way for future studies that must consider these critical dimensions of AI development in order to be deemed trustworthy and human-centered.

To elevate the performance of our model, particularly concerning the minority class, future endeavors will be directed towards the examination of alternative deep learning architectures and hybrid methodologies that are adept at negotiating class imbalance within the multi-resident milieu [80]. In the pursuit of universal access in the information society, it is critical to align these technological advances with the principles of explainable AI, especially in the context of graph neural networks [81, 82]. The ambition is to construct models that are not only effective but also human-interpretable, particularly for temporal Smart Home (SH) datasets.

Human interpretability engenders an understanding of the reasoning behind network decisions, which in turn cultivates trust in the system-a necessity for the universal adoption of such technologies [83]. Furthermore, the deployment of AI in settings with diverse and potentially vulnerable populations accentuates the need for transparent and accountable systems.

As we forge ahead, it is imperative to recognize that novel evaluation paradigms are required to adequately assess the efficacy of such models within the context of imbalanced datasets [84]. These new paradigms must address not only the technical accuracy of model predictions but also the explainability and fairness [85] of these predictions to ensure that AI systems contribute positively to the inclusivity and accessibility of the information society. Thus, our future research is poised to contribute to this critical discourse, ensuring that the advancements in AI are both technically sound and ethically responsible, facilitating a more equitable information society.