The happimeter
The Happimeter [28] is a system for tracking and predicting human’s mood. It provides feedback to individuals about their mood and what influences it. Our hypothesis to be tested in this experiment is if this feedback will make them actually happier. The Happimeter uses a smartwatch for collecting the necessary information and a website, as well as a phone application for providing detailed insights.
The smartwatch is the most important component of the system as it is responsible for collecting the body sensing data. For the experiment described in this paper, users were equipped with either a Mobvoi TicWatch S2, a Mobvoi TicWatch E2 or a Mobvoi Ticwatch S. These watches were running on WearOS. All three kinds of watches had the following sensors. First, an accelerometer sensor which measured the acceleration force in m/s2 on three physical axes, namely: x, y, and z. Thus, when the user moved the watch, e.g. shaking or tilting it, an acceleration to one of the given axes was caused. Second, a step counter which recorded the number of steps since the last sensor measurement. Third, a heart rate sensor which collected the heart signals in beats per minute. Fourth, a microphone which recorded the noise expressed by the amplitude; and finally, a position sensor which determined the user’s location by localizing the longitude, latitude, and altitude. Each sensor was running automatically for 30 s every 20 min trying to collect as many measurements as possible. Further, exogenous variables i.e. weather data and time-related data were added by the system based on the GPS information provided by the watch.
In order to build a customized individual model for each user, not only the physiological information is necessary but also subjective information about the user’s mood to train the system. Thus, a survey is integrated into the Happimeter application which asks users approximately every two hours about their activity level, happiness and stress. Simultaneously, this survey is predicting the mood level, providing the wearers with information about what the system thinks how they feel. Users can either confirm or reject and correct the suggested mood. For example, users are asked: “Your happiness prediction is 1 of 2. Is this prediction correct?” If not, they can enter the correct value on a sliding scale from zero to two, where zero is representing that the user is not happy, one that the user is happy, and two that the user is very happy. The same applies for activity and stress (see Fig. 1 for an illustration).
The predictions are based on a machine learning algorithm which uses the user-entered mood and smartwarch sensor data. While new users start with a generic model for the prediction, individualized models are trained over time based on the feedback of the user. The difference between a generic model and an individualized model is that the latter is based on the user’s mood input and the former uses the mood input of every user in the system. Note that users can enter their sensor and mood data manually in addition to the predictions generated automatically, further increasing the accuracy of the system.
A second key component of the Happimeter system for conducting virtual mirroring is the smartphone application which is available for Android and Apple. Although users can also enter their mood data there, the primary purpose of this component is to monitor emotions and to manage the social network. Users can look at their past survey inputs examining how happy they have been on a specific day (see Fig. 2a). Further, users can evaluate their mood input associated with the geolocation. They can label specific places like “Home” or “Friends Place”. Such a function allows users to accurately analyze how they felt in certain places (see Fig. 2b, c).
Furthermore, the app can be used for managing the social network including functions such as accepting friend requests, sending friend requests, controlling privacy options for mood sharing and unfriend somebody. Users have not only access to a friend’s happiness, activity, and stress level but can also be given access to the geolocation (see Fig. 2a).
Finally, the application is sending notifications to the user if she or he is not happy according to the Happimeter predictions. Based on the system’s prediction of happiness, activity, and stress, different notifications are sent. For example, if the system predicts that the user is very unhappy, very active and very stressed, it suggests taking a long nap. The default interventions are shown in Fig. 10.
The last component of the Happimeter system is the website. It can be used for reviewing the collected measures and mood inputs. On the dashboard users can see their mood inputs over the last 30 days (see Fig. 11), as well as the system’s prediction based on the user’s last sensor data. Another function is to provide insights about the driver of a user’s mood. Users can see what variables influence their mood the most (see Fig. 12). Further, they can see by whom they are influenced and on whom they exert influence (see Fig. 13). Moreover, the website can be utilized for modifying the default interventions such that each user can incorporate individual notifications. Finally, a user can create a team consisting of multiple users. For each team the mood inputs are aggregated, the average is calculated, and the information is displayed on the website.
Experimental setup
A real-world experiment was conducted with 22 full-time employees of the Sparkassen Innovation Hub (S-Hub) in Hamburg, Germany from May 1, 2019 to August 19, 2019. Due to allergic reactions with the strap, four employees aborted the experiment prematurely after one month leading to 18 remaining participants, of whom 16 (88.89%) were male and 2 (11.11%) female. Every participant was equipped with a smartwatch at the beginning of the experiment. After setting up the watches each participant was given a few days to get used to the system. Furthermore, this time was necessary to replace the generic happiness prediction models with individualized ones.
To verify the reliability of happiness, activity, and stress prediction, one model was created for each question for each user. Sensor data of each participant was collected approximately every twenty minutes during working hours. Note that sometimes the participants forgot to wear their watches or the sensors failed to record any information because of hardware or software issues. Further, the watch was asking participants approximately every two hours about their happiness, activity, and stress. Both sensor and mood data were used to create models by using different machine learning algorithms i.e. Decision tree, gradient boosting with decision trees, random forest, support vector machine (SVM), feedforward neural network and long short-term memory (LSTM) models. Due to the availability of user-entered information about the user’s mood, the algorithms could easily be evaluated by comparing real mood values and predicted mood values.
To explore if feedback and recommendations affect happiness, stress and activity—i.e. the Heisenberg effect through virtual mirroring indeed exists—the participants were equally divided into two groups, a control group and an experimental group. The latter had access to all functionalities mentioned above. Members of the experimental group received recommendations, based on their predictions. These recommendations were sent approximately every three hours during business hours. Further, they had insights about their drivers, past mood inputs and geolocations, as well as they could use the social network functionalities. In contrast, the control group had none of the Happimeter functionalities. Users were just asked to wear the watches to provide sensor data and to answer the questions to provide mood data. Their access to the website was denied. Functionalities in the application on the phone were restricted such that participants could not see their last mood inputs. Further, when asking about the mood the watch did not provide any predictions. Instead members of the control group were only asked about their happiness, activity and stress levels (see Fig. 3 for an illustration).
Feature selection
Feature selection is crucial in machine learning, as it can have a significant impact on the outcome of the algorithms. In this paper recursive feature elimination (RFE) [29] was used to create a feature set. In RFE, the classifier is first trained with all features. Next, the least important feature is removed. That procedure is repeated until a single feature is left. The best feature set is the one that achieves the highest accuracy. For the creation of the final feature set, a decision tree was used as an estimator. This algorithm can return the feature importance of each feature, which, in turn, can be used for determining the best features.
RFE was applied on all available features including the nine features which resulted from the five sensors of the Mobvoi smartwatches described above, plus weather information, i.e. temperature, humidity, pressure, wind, clouds, and weather, as well as time-related information, i.e. hour, weekday and session. The latter determined the part of the day.The best feature set using RFE based on a decision tree is presented in Table 1.
Table 1 Best feature set using RFE One model was built for each question for each user, i.e., each user had three different models: One for predicting happiness, one for predicting activity, and one for predicting stress. In order to build these models, it was necessary to apply some preprocessing steps. These steps are explained in the following.
To create the training and testing instances, each mood data entry was combined with one instance of sensor data, by searching for the closest timestamp in the sensor data that matched the mood data. Unfortunately, a match could not always be found between mood and sensor data, and thus, some information could not be considered while fitting the models. The answers for the different questions were used as labels.
During the preprocessing step, those columns, which either had more ‘not-a-number’ (nan) values than numerical values or where the standard deviation of the column was zero, were dropped. Remaining nan values in columns which had not been dropped were replaced with the mean of the respective column. Further, one weather variable had to be encoded such that the values were numerical. To cope with the imbalanced dataset, cost-sensitive learning [29] was taken into account when building the models. Additionally, tenfold cross-validation with stratification was used when evaluating the generalization performance of each algorithm.
Performance measures
Different performance metrics, i.e. accuracy, F1-score and a combination of both were calculated based on the different feature-sets, periods and questions to evaluate and compare each algorithm. Although accuracy is a common evaluation metric, in this case it might not be a good indicator because when the class distribution in a dataset is skewed, the accuracy can be very high which leads to the assumption that the model is doing well [30]. The F1-score, a combination of precisionFootnote 1 and recallFootnote 2 was used for evaluating the model to overcome the drawback of accuracy. Mathematically, it is expressed as:
$$2\times \frac{\text{precision }\times {\text{recall}}}{\text{precision }+ {\text{recall}}}$$
(1)
Finally, to compare different models, a single number metric was necessary, which takes into account accuracy and F1-score equally. Without such a metric, one model could be superior w.r.t accuracy while another could be superior w.r.t F1-score. The combination was simply the mean of both metrics:
$$\frac{{\text{Accuracy}}+{\text{F}}1}{2}$$
(2)
The metrics were calculated for all three models of each user. However, for the final evaluation, the various performances of each user were summarized, and the average was taken.
It was necessary to build a baseline against which the models’ performance could be compared to infer the meaning of these metrics. Such a baseline can evince performance improvements while pointing out superior algorithms. Due to the highly imbalanced dataset it was reasonable to use a majority classifier, also called zeroR, as a baseline classifier. It is a naïve classifier that always predicts the most frequent class in the training set [30]. In the bottom line all the learned models should at least outperform the majority classifier.