1 Introduction

Driver distraction through secondary tasks, i.e. phone usage, is one of the major causes of road accidents [23], while the avoidance of these has been a driving force to technological advances. Such secondary tasks, pull away the driversā€™ eyes off the road, mind away from driving and hands away from the steering wheel. Consequently, the detection of driver distraction is a popular research topic, and vehicle manufacturers increasingly implement proprietary distraction detection systems to prevent accidents. With smartphone penetration (according to connections) continuing to rise to over 75% of the population [3], the number of messages received (via email, SMS, messenger apps, etc.) is steadily increasing. Studies show that it is accustomed to check our smartphones about 6 times an hour in our daily lives to see if we have received any new messages. Especially, young drivers check their smartphone on average 1.71 times per minute for various reasons while driving, e.g., to text, surf the internet, listen to music or watch videos, whereas those who are addicted to their smartphones use them dangerously while driving [22]. However, many distraction detection applications only register whether someone is actively answering a call or using a messenger app, but not whether they are just checking their status, which appears to be a more frequent activity.

Smartwatches could be categorised as a smaller version of devices, such as smartphones [8], causing a similar frequency of notifications and status checking. While their market growth has been steadily increasing (with penetration rate at approximately at 2.69% in 2021) [1, 8], they have been reported in studies as even more distracting than smartphones while driving [6]. Nonetheless, only a very small number of studies have been conducted on the impact of smartwatch usage on driving [6].

Nevertheless, smartphones are rising as important platforms for general mobile applications (including applications developed for smartwatches) and for the transport and mobility sector in particular, called smartphone-based vehicle telematics [35]. Some examples include application-based vehicle information systems [19] or fleet management applications, which, if used appropriately, can even lead to the prevention of risky driving behaviour [21]. Research has recently looked at the use of the smartphone and integrated smartphone sensors as well as smartwatches as a basis for the development of application-based driver monitoring systems (e.g., [6, 12]). Smartphone-based driver monitoring systems can provide an important added value as they can be used to retrofit older vehicles, since they do not rely on the vehicleā€™s sensors and actuators. They make use of a large number of build-in sensors offered by the devices, use their processors that offer high computational ability, and also make use of their efficient means of wireless data transfer and communications. Moreover, the developer availability producing applications for them has seen an unprecedented growth, due to the fact that smartphones have a huge market, and smartphone-based solutions are generally more frequently updated than vehicle-based systems, and are therefore much more scalable, upgradable, and cheap [14]. Smartphone-based solutions for the transport and mobility sector (and others) can be seen as a natural way of providing instant driver feedback via audio-visual means enabling the smooth integration of notifications and driving activity. At the same time, limitations with respect to battery, sensor quality and the fact the built-in sensors might be decoupled with respect to orientation and positioning of the vehicle, can bring challenges in their uptake.

Overall, monitoring driving behaviour, and especially distraction detection, can have a strong impact on traffic safety, but also results in fuel or energy consumption, and gas emissions improvements. Recognising or preventing such behaviour, plays an important role in generating a safety score for a driver, which can increase overall safety and promote economical driving. In addition, monitoring driving behaviour can address the needs of multiple markets: vehicle manufacturers, car insurance, fleet management and fuel consumption/optimisation.

The aim of this book chapter is to propose solutions where smartphones and wearables such as smartwatches, can reduce risky driving rather than causing road accidents. In Sect.Ā 2 we will present definitions and the background of the driver distraction area. Then, we will describe the design of our solution (Sect.Ā 3) proposing three Machine Learning applications for detecting driversā€™ distractions (Sect.Ā 4) and a web application (Sect.Ā 5) presenting the information as a dashboard. Before concluding, we will report the related work (Sect.Ā 6) to our work.

2 Definitions and Background

In [27] a general overview of the term ā€œdriver distractionā€, what it means, how it relates to driver inattention, types of driver distraction, sources of driver distraction, factors that moderate the effects of distraction on driving, the interference that can derive from distraction, theories that seek to explain this interference, the impact of distraction on driving performance, and safety, and strategies for mitigating the effects of driver distraction, are described. Although some inconsistencies are reported in the definitions found in the literature, and different relations with respect to inattentive driving are discussed by the authors, the following key elements emerge, characterising driving distraction [27]: driver distraction seems to involve the diversion of attention away from driving, or away from activities critical for safe driving, toward a competing activity. This activity may either originate from inside or outside the vehicle, and it may be driving- or non-driving related. Driver distraction is moreover a subset of driver inattention, and is related to something (a task, object, or person) that distracts the driverā€™s attention needed to perform the driving task adequately [29].

Driver distraction is a very broad topic. The focus of this chapter is exclusively on methods that use the smartphone and wearables, as a sensor device, to detect distraction and to communicate detected distraction to the driver. This chapter thus describes the conceptual framework which uses Smart Devices for Distracted Driving Detection. The overall architecture description of the concept and a series of initial experiments of a proof-of-concept is presented. Artificial Intelligence (AI) is used to assess driving distractions using smart devices in a comprehensive manner.

3 System Design

The aim of our system is to implement and test different approaches that could be used for detecting distracted driving events. The philosophy behind is that a smart device can be used to detect the distraction it is causing. For example, a smartphone used by the driver can be used at the same time to predict such events. Similar idea for a wearable such as a smartwatch. However, each deviceā€™s predictions concern the usage of the device itself. In addition, we include a camera-based approach which could detect the driverā€™s activity. The camera-based solution, using computer vision techniques on images, enables detecting distractions caused by wearables without being on the device itself and can be extended to other sources of distractions such as radio manipulation, talking to a passenger, etc.

Our conceptual framework makes use of smart devices for distracted driving detection at its core. An overview of its architecture is presented in Fig.Ā 1. It is comprised of the following 4 main components:

  1. 1.

    Smartphone application for driving distraction detection developed by VIF

  2. 2.

    Smartwatch application for driving distraction detection developed by RISE

  3. 3.

    Camera-based system for activity labelling developed by Tietoevry

  4. 4.

    Dashboard application summarising the driving session developed by JIG

The first three components are based on Machine Learning and share a similar process such as data collection, data pre-processing, model training, testing, validation and deployment. For some of them, a dedicated app is developed to perform the inference on the device itself. The dashboard application offers an offline visualisation of a driving session where distractions have been detected by the three other components. The goal is to implement a collaboratively designed dashboard including the information provided by all the Machine Learning-based components.

Fig. 1
An architecture presents 4 components, each with distinct processes. Smartphone app driving distraction detection, smartwatch app driving distraction detection, camera-based action labeler, and driver distraction dashboard.

Conceptual framework of the system design for distracted driver monitoring with smartphones and wearables

In the following sections we will describe the process of building each component. As the process is similar for the Machine Learning-based components, we will combine them together in one subsection.

4 Machine Learning-Based Components

In this section, we present the three Machine Learning-based applications developed for the Smartphone, the Smartwatch and the Computer vision system. We first provide their definition and describe the architecture of the applications. Then, we present them according to the Machine Learning process which is composed of three steps:

  • Data acquisition and pre-processing

  • Machine model training and experimental results of model fitting

  • Model deployment on smart devices.

4.1 Use Case Definition and Componentsā€™ Architecture

4.1.1 Smartphone Application

This component is responsible for providing a smartphone application, capable of collecting sensor data, analysing and preparing the data for driving distraction detection caused by the use of the smartphone. Previous work of an author-centric literature review [20] showed that most driver distraction detection systems are based on proprietary hardware, and there is comparatively scarce research on how to use the smartphone as a sensor and to detect critical events on the edge device. While proprietary hardware often prevents data access and thus makes it difficult for researchers to compare various algorithms, accuracy and trustworthiness of the results, smartphones are increasingly powerful devices that are always connected to the internet. Therefore, the development for this component is targeted to offer an accessible, efficient and ordered method for data collection for research purposes. To achieve this a dedicated smartphone application is developed to detect potential smartphone usage by the drivers. Thereby, the application is capable of collecting smartphone sensor data, which later on is used in a Machine Learning model. Data collection includes data cleaning, pre-processing and storing. We explore to what extend it is possible to run Machine Learning models on smartphones, given the increasingly availability of powerful smartphones. As a privacy requirement, data may alternatively chosen to not leave the usersā€™ smartphone. Hence, driver phone usage is detected via smartphone sensor data classification using a model on the smartphone.

Fig. 2
An architecture of the smartphone application is done with flutter and that includes the following flow. The smartphone app's internal sensor points to data preprocessing followed by an analytical module, A I edge device analysis, buffer, and C S V storage. The storage points to the cloud. The external sensor points to the internal sensor inside the flutter. Firebase points to A I-edge device.

Smartphone application sub-components architecture

The smartphone application is implemented using FlutterFootnote 1 and Android as the target platform. The applicationā€™s modules are shown in Fig.Ā 2. Raw values from the smartphone sensors are collected and can be processed in three alternative ways: (i) pass-through: the raw values are transmitted to the storage without being processed, (ii) analytical: the raw values are going through an analytical module where the raw values are modified by applying mathematical operations on them (e.g., apply rotation on Inertial Measurement Unit (IMU) values), (iii) edge device AI: the raw values are passed to an AI model and the output of the AI model is then stored. The described processing ways may also be combined. The following smartphone data is collected:

  • Timestamp: absolute time of the sensor recording,

  • IMU (gyroscope and accelerometer) with customisable frequencies ranging at \(1{-}20\,\mathrm{{Hz}}\) (5 Hz steps),

  • GPS coordinates with customisable frequencies ranging at \(1{-}5\,\mathrm{{Hz}}\) (1 Hz steps),

  • Screen state, and,

  • Moving state (i.e., walking, running, biking, or driving).

The goal of the application is to be able to predict distraction and attention states based on smartphone data only.

4.1.2 Smartwatch Application

Smartwatches are becoming increasingly a source of driving distraction, as they can be used for calling, texting, and receiving notifications. At the same time, they can be used to detect driver smartwatch distractions given that they are fed with appropriate smartwatch distraction data. This component is responsible to perform exactly this, after collecting smartwatch usage data while driving and running Machine Learning models. Thus, the smartwatch application collects sensor data and shows basic trip information (like trip date, duration, and distance covered) along with distraction data. The development is targeted for data collection and driver training for research purposes. The smartwatch applicationsā€™ sub-components and data flow are shown in Fig.Ā 3.

Fig. 3
A flowchart of the smartwatch application sub-components. The flow begins with data collection through the sensor followed by data preparation, data labeling, A I modeling, and distraction detection. The A I includes model design, training, testing, and model conversion.

Smartwatch application sub-components architecture and data flow

The diagram supports the following scenario: the application for driver distraction is initiated and used to inform the driver about occurring distraction events, happening during trips. The driver is using a smartwatch application that collects sensor data of wrist movements and a companion smartphone application that collects sensor data from the smartphone. A video camera is used at the same time to record the trip. The recording is used for labelling purposes only. The video camera captures only the body parts of the person so that the person is not identifiable. Then, a labelled dataset is produced, which is then used for Machine Learning model creation, training, testing and validation. The trained model is then included in the smartwatch application for driver distraction detection. A sound is played when the application detects a distraction. The following smartwatch data is collected:

  • Timestamp: absolute time of the sensor recording,

  • Seconds elapsed: time expressed relative to the start of the data collection session,

  • IMU (gyroscope, gravity and accelerometer)ā€“with customisable frequencies ranging at 1ā€“200 Hz,

  • GPS latitude and longitude,

  • Activity: a value corresponding to an activity which might involve distraction or not,

  • Distraction: a boolean value corresponding to a distraction or not (1ā€“distracted/0ā€“not distracted).

The application helps drivers detect distracted driving and summarises at the end of the trip the start time of the trip, duration and the times a distraction took place. A history of trips is also made possible to view. Finally, all trip data collected on the smartwatch can be at any time deleted by the user.

4.1.3 Computer Vision Application

One potential limitation of the two above applications is that they are dedicated for a particular smart device and a particular set of distraction events. Therefore, we propose an additional application of a computer vision solution, to extend the scope and include detection of distracted driving based on video recordings. These distractions are not necessarily limited to the usage of smart devices but can be generalised to other behaviours such as talking to a passenger, reaching for something behind the driver, etc. Our motivation relies on the automotive industry roadmap where vehicles embed more and more sophisticated systems to improve passengersā€™ safety. In particular, in the EuroNCAP roadmap [25], it is reported that an increasing level of road assistance is expected and systems are able to ensure that the driver remains engaged in the driving task such as having hands on the steering wheel and eyes on the road.

In case of implementing an efficient and accurate computer vision system for detecting driversā€™ distractions, this application could also be used for labelling the data generated by the smart devices and be responsible for keeping such data in the vehicle. Aim of this computer vision-based solution is, therefore, defined to build a Machine Learning algorithm able to classify the following 3 events: driving normally, using smartphone and using smartwatch based on driversā€™ images. To achieve that, we will follow a transfer learning procedure to train a computer vision model on a custom dataset of images, as explained below.

Training a computer vision model usually demands a lot of data. In our situation, taking into consideration that we want to execute the algorithms inside a vehicle where the computation resources would be limited, and thus we are restricted with limited resources, we combine an openly available dataset of images (the Statefarm dataset [18]) and one dataset that we created on our own. A lot of prior work exists in the literature where computer vision algorithms have been trained and tested on the Statefarm dataset. Most of them are able to perform well, frequently reaching to more than \(96\%\) accuracy [4, 15, 30, 31]. Based on that, we will report the performance of the computer vision state-of-the-art algorithm developed on our custom dataset.

4.2 Data Acquisition and Pre-processing

4.2.1 Smartphone Application

The smartphone data collection was conducted in two test drives where participant drivers performed certain actions. Each action is related either to a distraction state or an attention state. Each test drive took about 140 minutes and different cars and drivers participated. IMU data was collected with 50 Hz and GPS with 1 Hz. Furthermore, the distraction state was collected by a co-driver, being either distraction or no distraction. We only considered smartphone-induced distraction, more precisely phone calls and applications usage. We considered 4 classes, two classes when the smartphone is active (the smartphone is used to place a phone call or for app usage) and two other classes when the smartphone is resting (smartphone is in the middle console or on the smartphone holder).

Data pre-processing describes the process of combining the different data sources and unifying the data. In our case, each test drive generated two CSV-files (driver and co-driver data) which then were combined. Faulty data was removed and a combined time reference frame was created. The final dataset contains for each timestamp eight different measurements, i.e., three acceleration measurements, three gyroscope measurements, and two GPS measurements.

4.2.2 Smartwatch Application

The smartwatch data collection was carried out in three test drives. Each test drive took approximately 120 minutes and two persons (one male and one female) participated. Videos were recorded during the drives and were used for labelling the distractions (cf. Fig.Ā 4). Similar rate in frequencies as for the smartphone were used for the smartwatch sensors. Gyroscope, gravity, and accelerometer data was collected at 50 Hz and GPS at 0.1 Hz.

Fig. 4
A photograph of a person seated in the driver seat holding a steering wheel with both hands. The person wears a smartwatch on the left wrist.

Video recorded during test drives for driver distraction

Once the sensor data were collected, human annotators start the labelling process based on the video recording of the events of a driving session. Then, for each driving session three CSV-files are generated, which were combined together (see Fig.Ā 3). The first one from the time series of the wrist motion containing gyroscope, gravity, and accelerometer measurements, the second file contained the location at different timestamps and the third one with the label of the time period such as distraction or attention. The label distraction was used for those intervals when a person was looking at the watch screen, or interacting with the watch by tapping on the screen or using the scroll wheel on the side of the watch. This dataset is highly imbalanced where \({<} 5\%\) of the events are annotated as distractions.

4.2.3 Computer Vision Application

The data used for the computer vision-based application is composed of two different datasets. The first dataset is an openly available dataset [18] which was used in a Kaggle competition: the Statefarm Distracted Driver prediction [2]. It is composed of 100,Ā 000 images where 20,Ā 000 of them are annotated and belonged to 10 classes, e.g., driving normally, manipulating radio, texting with left/right hand, calling with left/right hand, etc.

However, in this dataset, there was no class of events where the driver is distracted by a wearable device, such as a smartwatch. To overcome this limitation, we decided to augment the Statefarm dataset by collecting data ourselves by performing a dozen of driving sessions of 10ā€“15 min each and use video recordings of these sessions. Each driving session was recorded using a smartphone installed on the passenger headrest of the seat, to capture at the same time, both the side profile of the driver and the steering wheel. This setup mimics the setup used in the Statefarm dataset. However, the type of camera, the resolution, the angle and the colour of the images are different which resulted in a heterogenous dataset (see Fig.Ā 5). Two drivers, a male and a female, participated in these sessions. Images were extracted from the video recordings and labelled according to the three following categories: driving normally, using smartphone or using smartwatch.

Fig. 5
2 photographs of an image taken from the state farm distracted driver dataset and an image taken from our own driving sessions. a. A person in the car driver seat uses a mobile phone. b. A person in the car driver seat uses his smartwatch.

Example of two images from the training set showing the heterogeneity of the data. On the left, a picture from the Statefarm competition representing the pictures included in the Statefarm dataset [18]. On the right, an annotated image taken from a driving session. Original image size, angle, color and view are different

The way we constructed our own new training dataset, based on the Statefarm one and our own recordings, is briefly explained. From the Statefarm dataset, we used the driving normally class, we combined the 4 classes related to the usage of a smartphone (texting left/right, calling left/right) into one class and we removed the remaining 5 other classes. Then, we added all the images of the same driver into this training set. We built a separate test set, where we manually annotated images from the Statefarm test set (using only the driving normally and using phone classes) and added the second driver from our own dataset, The statistics of this dataset are presented in TableĀ 1.

We also adjusted the images to have the same resolution, so we resized the images from both the training and test sets into a \(224\times 224\) resolution.

Table 1 Computer vision dataset statistics: number of images per category

The minority class, using smartwatch, has only a few hundred examples compared to the two others which are bigger. This may impact the performance of our Machine Learning-based models.

4.3 Machine Learning Model Training and Experimental Results

4.3.1 Smartphone Application

Fitting a model to the smartphone data was done by using a Long Short-Term Memory (LSTM) layer neuronal network (NN) design. An overview of the NN architecture can be seen in Fig.Ā 6. It contains 6 neurons in the input, 32 neurons in the central, 16 neurons in the hidden and 2 neurons in the output layer. We train the model from scratch based on the collected data.

Fig. 6
A flowchart of the neuronal network architecture for smartphone distraction detection begins with g y r x, y, and z, input layers, a c c x, y, and z input layer, and g r v x, y, and z input layer followed by concatenate, reshape, L S T M, dropout, dense, and out dense.

The neuronal network architecture used for smartphone distraction detection

For the smartphone data, a 70%/15%/15% data split was used to create training, validation, and test data. The training data was used to adapt the modelā€™s weights. The validation data was not directly used in the training process but was used to evaluate the current modelā€™s performance independently of the training data. Based on the validation data, the training process was completed. The test data gives an unbiased evaluation of the model with not at all involved data in the training/validation process. Models are evaluated with the accuracy measure (see below).

We carried out experiments by fitting models to predict the right state (distraction or attention) based our collected IMU dataset. The best model achieved an accuracy of 94%. Thereby, especially the transitions between distraction and non-distraction state were responsible for the error, see Fig.Ā 7.

Fig. 7
A line graph of sensor value versus time stamp plots 3 fluctuating trend curves.

Model performance of the AI model created for the smartphone data. Lines represent IMU measurements, the background the distraction state (whereas grey represents distraction). The top background represents the labels, the bottom background represents the model output

4.3.2 Smartwatch Application

In this application, as similar data were collected as in the smartphone application, we used a similar Machine Learning model as well. A LSTM based model has been designed and is presented in Fig.Ā 8. To deal with the overfitting as our dataset is small, we added a Dropout layer after the LSTM one. The architecture comprises of 9 neurons in the input, 4 neurons in the central, 16 neurons in the hidden and 1 neuron in the output layer. Then we trained the model from scratch on our dataset to detect when a smartwatch distraction occurs.

Fig. 8
A flowchart of the neuronal network architecture for smartwatch distraction detection begins with g y r x, y, and z, input layers, a c c x, y, and z input layer, and g r v x, y, and z input layer followed by concatenate, reshape, L S T M, dropout, dense, and out dense.

The neuronal network architecture used for smartwatch distraction detection

Similar to the smartphone application, a 70%/15%/15% data split was used to create training, validation, and test data and the models are evaluated with the accuracy measure (see below).

We carried out experiments by fitting models to predict the right state (distraction or not) based on the dataset we collected during our driving sessions. The best model achieved an accuracy of 94% on the test set and an evolution of the performances is reported in Fig.Ā 9.

Fig. 9
A line graph of accuracy versus epoch plots 3 curves. The accuracy curve begins at (0, 0.89), and rises to (1, 0.958) and (25, 0.96). The validation accuracy curve begins at (0, 0.96), and rises to (25, 0.968). The eval accuracy curve begins at (0, 0.94), and ends at (25, 0.94).

Modelā€™s performance for the smartwatch data. The training, validation and test accuracy is reported. After a few epochs the model has already converged

4.3.3 Computer Vision Application

In our experiments, we tested various computer vision state-of-the-art algorithms such as VGG16 [32], Xception [7], MobileNetV2 [28] and EfficientNetB0 [34]. We opted for a transfer learning approach on our custom driver detection dataset with pre-trainedFootnote 2 models. Based on the preliminary results we obtained, we focused on the model which was the best compromise between accuracy and size. As a recall, the end goal is to be able to train and run the model on an embedded system inside a vehicle where the computation resources would be limited. Models based on EfficientNetB0 appeared to offer the best compromise between accuracy and model size. An overview of the transfer learning model architecture can be seen in Fig.Ā 10.

Fig. 10
A flowchart of the neuronal network begins with the image input layer 224, 224, 3, followed by efficient Net B O, pooling global average pooling 2 D, dense, dropout, and out dense.

The neuronal network architecture used for camera-based distraction detection

However, all modelsā€™ performances were quite low, especially for detecting a smartwatch activity. Several reasons may explain this situation such as the fact that smartwatch activity is the minority class (with 2 orders of magnitude less than the majority class) and the images from this class were different from the original Statefarm dataset with only one driver represented (see Fig.Ā 5).

To tackle this problem, we carried out experiments where we used data augmentation techniques to improve the robustness of our model. To do that, in each batch provided to the model, several geometrical transformations were randomly applied on each image:

$$\begin{aligned} Zoom\!:\, \left[ 0.5, 2.5 \right] \\ Brightness\!:\, \left[ 0.5, 2.5 \right] \\ Rotation\!:\, \left[ -30, 30 \right] \\ Flip\!:\, \{Horizontal\} \\ Shift_{\{Height, Width\}}\!:\, \left[ -0.3, 0.3 \right] \end{aligned}$$

We used 90%/10% splits of the training data where the first split was used for training and the second one for validation. We used an early stopping strategy when the validation loss was not decreasing for 3 consecutive epochs. Then, we evaluated the model on a separate test set, where we computed the F1 score for each class (Driving normally, Using Smartphone and Using Smartwatch) and the overall accuracy. Both measures are defined below:

$$\begin{aligned} Accuracy &= \frac{\textit{TP}+\textit{TN}}{\textit{TP}+\textit{TN}+\textit{FP}+\textit{FN}} \\ F1 &= \frac{2*\textit{TP}}{2*\textit{TP}+\textit{FP}+\textit{FN}} \end{aligned}$$

where:

  • \(\textit{TP}\), the True Positives are the instances correctly predicted as Positive.

  • \(\textit{TN}\), the True Negatives are the instances correctly predicted as Negative.

  • \(\textit{FP}\), the False Positives are the instances wrongly predicted as Positive.

  • \(\textit{FN}\), the False Negatives are the instances wrongly predicted as Negative.

The performances of a vanilla version of EfficientNetB0 and one using Data Augmentation are reported in TableĀ 2.

Table 2 Computer vision modelsā€™ performances

Data Augmentation techniques helped the EfficientNetB0 model to outperform the vanilla version and overcome some of the challenges from the data imbalance and heterogeneity. However, to increase the performance of our model, more data should be collected. More specifically, additional data for the smartwatch class should be collected as it drastically impacts the performance of the model compared to the other two classes.

4.4 Model Deployment on Smart Devices

4.4.1 Smartphone Application

In our smartphone application, each model used for the purpose of training and inference can be selected and changed. This ensures to always serve a model which is performing well in production. To achieve that, we used Tensorflow LiteFootnote 3 to deploy the AI model on the smartphone. The available models are stored on a Google FirebaseFootnote 4 storage which can be easily updated without redeploying the application, subject to the condition that the number and type of the input parameters are not changed. Whenever the application is started, the most recent models are fetched from Firebase and updated in the application. The deployed model takes the IMU data as input and then computes the predictions. It returns a distraction boolean as an output of either 1ā€“distracted or 0ā€“not distracted.

For demonstration purposes, a sound is played when the model detects a distraction.

4.4.2 Smartwatch Application

In our smartwatch application, the goal is to deploy the trained models on the wearable devices. These devices have low computation and memory resources. To face this challenge, we proposed to convert the trained model to run efficiently on edge devices such as smartwatches. We opted to the approach of converting the model from being expressed in a TensorFlow format to a CoreMLFootnote 5 model, that is then added to the smartwatch during the compilation time of the application.

Currently, a beta testing service called TestflightFootnote 6 is used, which is able to do over-the-air push updates of both the model and the smartwatch application to a group of beta testers of the application.

4.4.3 Computer Vision Application

Deep Learning computer vision models are usually computationally heavy and having the end goal to be able to deploy them on a low-resource device such as smartphone or a micro-controller inside the vehicle, makes the challenge even greater. Moreover, video/image processing approaches are greedy in terms of resources and different techniques exist to produce a lightweight model for inference.

We have optimised our computer vision model with Tensorflow Lite where the optimising parameters have been removed, the weights have been quantised using a lower-precision representation and then compressed. We evaluated the performance of this lightweight model and compared it to the original one (see TableĀ 3).

As we can see, we were able to reduce the size of the model by a factor of 4 with relatively maintaining similar accuracy performances.

Table 3 Computer vision modelsā€™ sizes and performances

5 Dashboard Application for Driver Distraction

The three Machine Learning components already introduced offer driversā€™ distraction predictions based on data from smart devicesā€™ usage. To integrate these concepts into one product, we implemented an end-user application that comprises of a dashboard summarising the distraction events occurring during a driving session.

To achieve this, we designed an application based on the Javascript framework VUEFootnote 7 to be developed as a web application. Our development team comprises of members from different countries and companies. To this end, we analysed different collaborative development tools that would allow for a better work methodology that would lead to greater agility and ease of working together. We previously internally tested InVisionFootnote 8 and the Adobe XDFootnote 9 set of services with the following results:

  • Adobe XD: A lot of working power and many associated services but it is not comfortable to work collaboratively with other departments/companies. The acquisition cost of this service is expensive.

  • InVision: Better quality/price ratio, and more ability to work together, but very focused on design, without much integration with development frameworks.

We finally chose to use FigmaFootnote 10, after comparing it with the other frameworks, since it has a high capacity for group work, supporting both internal and external team members, and does not need to be installed since it can be managed via a web browser. It also has the advantage of facilitating the implementation and evolution of the design to a greater extent than the rest of the options.

After this initial analysis, Figma has allowed us to carry out the entire design suggestion so that all team members involved, could see the changes and the design evolution in real-time. The team was also able to add comments and details in each of the parts of the prototype as can be seen in Fig.Ā 11.

Fig. 11
A screenshot of the design includes the following details. The information about trips 09 and 08, a map highlights the route between Tarsgatan and Valhallavagen and an alert message displays disaster with dangerously distracted. The trip score, kilometers, events, distracted, and timeline are at the right.

A screenshot of the design using the Figma Design tool

At the same time, Figma has facilitated the transition from design to code due to its power to control parts that until now were external to the design, such as the CSS or HTML coding of the visual proposals. This ease of implementing the layout of the components developed with the proposed design is possible because when a component is created in the system, its CSS code is created and exported directly to the developed code, as shown in Fig.Ā 12.

Apart from the better joint collaboration and communication when proposing and developing the design, reducing the communication time through other means (i.e., sharing data in sharepoints, sending mails, etc.) the implementation of this application in a real environment was reduced to less than half, as Figma itself generates the code in the language of our choice (CSS, PHP, Java, etc.) making its use effective and straightforward.

Fig. 12
A screenshot exhibits the set of colors with a code line. Multiple colors are listed below feedback, background, and bus line tabs. The grayscale tab at the center has 7 grayscale images. A code screen on the right exhibits the border box, layout, style, S V G, selection colors, and assets.

Transition from design of the final dashboard layout to code

In conclusion, the use of a framework such as Figma in collaborative projects has offered us greater speed, control, and efficiency in developing web applications from design to implementation.

6 Related Work

Using smart devices to recognise driver distraction from data extracted from wearables, smartphone and onboard diagnostics to obtain sensing information such as accelerometer, gyroscope, etc. is an active research area that has been gaining interest due to the increasingly computational power smart devices offer. This section summarises related recent works in the topic.

In [24] a 4-step methodological framework is presented for driving analytics to understand driving behaviour based on smartphone data. In [17] smartphone and smartwatch data is used to detect distraction while driving, like controlling the infotainment system, drinking/eating, as well as smartphone usage. Owens et al.Ā [26] describe the coding efforts of an accessible dataset of driver behaviour and situational factors observed during distraction-related safety-critical events and baseline driving epochs. Data coding includes frame-by-frame video analysis of secondary task and hands-on-wheel activity, as well as summary event information. Deep learning on video and sensor data is proposed by [33], in a system called DarNet, capable of detecting and classifying distracted driving behaviour. To minimise privacy concerns, the system is using a distortion filter applied to the video data before processing the data.

Baheti et al. [5] use a dataset for distracted driver posture estimation and classifies images to the following 10 classes: driving, texting on mobile phones using right or left hand, talking on mobile phones using right or left hand, adjusting radio, eating or drinking, hair and makeup, reaching behind and talking to passenger. They use convolutional neural networks (CNNs) and report to achieve 96.31% on the test set.

Dua et al. [11] developed a Machine Learning-based system that uses the front camera of a windshield-mounted smartphone to monitor and rate driver attention by combining multiple features based on the driver state and behaviour such as head pose, eye gaze, eye closure, yawns, and use of cellphones. Ratings include inattentive driving, highly distracted driving, moderately distracted driving, slightly distracted driving, and attentive driving. The evaluation with a real-world dataset of 30 different drivers showed that the automatically generated driver inattention rating had an overall agreement of 0.87 with the ratings of 5 human annotators for the static data set.

Dua et al. [10] aim to identify driverā€™s distraction using facial features (i.e., headpose, eye gaze, eye closure, yawns, use of smartphones, etc). The smartphonesā€™ front camera is used and three approaches: in the first, convolutional neural networks (CNNs) are used to extract the generic features and then a gated recurrent unit (GRU) is applied to get a final representation of an entire video. In the second approach, besides having the features from a CNN, they also have other specific features, which are then combined using a GRU to get an overall feature vector for the video. In the third approach, they use an attention layer after applying long short-term memory (LSTM) to both specific and facial features. Their automatically-generated rating has an overall agreement of 0.88 with the ratings provided by 5 human annotators on a static dataset, whereas their attention-based model (third approach) outperforms the other models by 10% accuracy on the extended dataset.

Eraqi et al. [13] aim to detect ten types of driver distractions from images showing the driver. They use (in one phase) the rear camera of a fixed smartphone to collect RGB images, in order to extract the following classes with convolutional neural networks (CNNs): safe driving, phone right, phone left, text right, text left, adjusting radio, drinking, hair or makeup, reaching behind, and talking to passenger. Thereby, they run a face detector, a hand detector, and a skin segmenter against each frame. As results, first they present a new public dataset, and second their driver distraction detection solution performs with an accuracy of 90%.

Janveja et al. [16] present a smartphone-based system to detect driver fatigue (based on eye blinks and yawn frequency) and driver distraction (based on mirror scanning behaviour) under low-light conditions. In detail, two approaches are presented, while in the first, a thermal image from the smartphones RGB camera is synthesised with Generative Adversarial Network, and in the second, a low-cost near-IR (NIR) LED is attached to the smartphone, to improve driver monitoring under low-light conditions. For distraction detection, statistics are calculated if the driver is scanning his/her mirrors at least once every 10 seconds continuously during the drive. A comparison of the two approaches reveals that, ā€œresults from NIR imagery outperforms synthesised thermal images across all detectors (face detection, facial landmarks, fatigue and distraction).ā€ As a result, they mention a 93.8% accuracy in detecting driver distraction using the second approach, the NIR LED setup.

Most of the related works use smart devices to address driving distraction considering the wider view of driver behaviour and driver monitoring. Only a few of these works focus on distractions caused by the use of smart devices. Moreover, none of these works operate in real-time, gathering and detecting smart devices usage while driving, and neither rely on a wide range of driver distraction collection methods (utilising within one framework smartphones, smartwatches and camera sensors).

7 Conclusion and Future Work

Distracted driving due to smart mobile devices usage, like smartphones or smartwatches, increases the risk of accidents. There are directly related to these devices usage events of interest, like texting, browsing the web, calling, using applications, etc. To prevent distracted driving, many approaches focus on particular types of distractions.

This work demonstrates a concept called Smart Devices Distracted Driving Detection. The overall architecture description of the concept and a series of initial experiments of a proof-of-concept are presented. Artificial Intelligence and more especially Machine Learning is used to assess driving distractions using smart devices in a comprehensive manner. Based on the experiments we carried out, Machine Learning models running on smart devices demonstrated good prediction performance on test data. The computer vision model, aiming in detecting distractions by adding an external point of view of the smart devices, under-performs compared to the models on the smart devices. We also validated that the models can be deployed on the smart devices themselves without trading too much prediction performance. Moreover, a dashboard application was developed for showing to the user the occurring distraction events predicted by the models after the driving sessions.

As a future direction, collecting and annotating additional image data of distraction events would a way of improving the prediction performances of the computer vision model. Then, labelling of driving distractions is a privacy-preserving and laborious task, if not done automatically. Computer vision algorithms may offer a complement to the predictions made on smart devices and could also be used to perform efficient data labelling. Moreover, as models have been deployed with success on smart devices, investigating the Federated Learning setting is a natural next step where data remain on the smart-devices instead of being transferred outside.