Keywords

1 Introduction

Automated vehicles offer significant assistance to drivers across various levels through multiple means. During the last decade, public transportation systems are progressively integrating different levels of automated vehicles into mobility as a service (MaaS) within their services. During the last few years, fully automated vehicles started appearing in the market, able to operate without a driver in complex road conditions. In this context, as there is no driver, no assistance can be provided to the passengers, and routes and destinations will be automatically determined by the associated fleet management system, based on the passenger trip requests. The ultimate goal of mobility as a service (MaaS) is to provide comprehensive, efficient, and seamless transportation solutions to users. MaaS aims to integrate various forms of transportation services into a single, accessible platform that allows users to plan, book, and pay for their journeys using different modes of transport, such as buses, trains, taxis, rideshares, bikes, and even micromobility options like scooters.

In our vision in the AVENUE project, one of the most promising mobility modes for MaaS services will be the use of fully automated vehicles (e.g., minibuses), providing the glue for all other transport modes of a successful MaaS in an urban environment.

However, the transition to fully automated public transportation vehicles is not seamless, and several obstacles arise in a real-world scenario.

Several concerns among end users are associated with the safety and reliability of fully automated vehicles (AVs), directly impacting the acceptance of this new technology. Two major concerns for potential passengers involve automated shuttle sharing with other passengers and trust in the technology itself. The absence of a driver in the bus raises various possible issues: for instance, no immediate personnel would be present to administer first aid in case of emergencies, and passengers might feel uneasy being alone, especially at night or in certain neighborhoods.

Additionally, the lack of an authoritative figure onboard could be a concern, especially when transporting groups such as schoolchildren, potentially leading to incidents like vandalism, theft, or altercations among passengers. With the removal of the driver and the introduction of external intervention teams, questions arise about how the formal and informal services provided by the driver could still be upheld while maintaining the current high-quality service standards.

Addressing these social and personal safety and security concerns within automated vehicles necessitates the implementation of specific services to replace the driver’s presence. Within the AVENUE project, various IT-based solutions and services have been identified and developed. These solutions aim to provide passengers with similar services to those offered by a driver onboard. This includes the creation of a novel artificial intelligence (AI)-supported framework that facilitates the widespread adoption of these services. This approach shows great promise in significantly enhancing safety and security levels in automated public transportation and bolstering passenger trust.

The services that will be presented in this chapter are the following:

  • Enhance the sense of security and trust

  • Automated passenger presence

  • Follow your kid/grandparents

  • Shuttle environment assessment

  • Smart feedback system

The reason for choosing these services is twofold: (a) They are all essential services necessary to remove the safety driver from the shuttle, as safety and efficient operation are critical factors. For example, ensuring safety for the travelers have been identified from end users as the single most important factor for choosing to ride with the automated shuttle. (b) They are all based on the same foundation of technology with cameras, sensors, and algorithms, meaning that testing multiple types of services is feasible, once the necessary equipment has been installed. The analysis is also similar, and therefore the results can be compiled together.

2 Service: Enhance the Sense of Security and Trust

The service “enhance the sense of security and trust” aims to address the new reality that is formed in automated shuttles’ mobility infrastructures with the absence of a bus driver and the threat from criminal activities in European cities. Typically, drivers are trained to handle incidents of passengers’ abnormal behavior, such as incidents of petty crimes, according to standard procedures adopted by the transport operator. Surveillance using sensors such as cameras (cameras of different technologies can be used so that passengers’ privacy is protected) and microphones, as well as smart software in the bus, will maximize the feeling of security and the actual level of security. However, several concerns of the end users regarding the safety and robustness of the AVs, where a driver is not present, which are directly linked to the final user acceptance of the new technology, can be identified. Prospective passengers harbor apprehensions about potential scenarios in the absence of a driver in the bus, including:

  • No one will be in the bus to perform first aid if required

  • Feeling of discomfort being all alone in the bus at night, especially in certain neighborhoods

  • No authority figure present to keep passengers calm (e.g., schoolkids)

  • Vandalism, bag snatching, indoor fighting, and unaccompanied luggage

To address the aforementioned concerns on social and personal safety and security into the vehicle, certain measures need to be implemented. For example, the detection of unaccompanied luggage and of other personal belongings may raise a notification or an alert to the supervisor and/or the suitable authorities. This may be followed by appropriate notifications and/or instructions to the passengers, while the vehicle may also implement respective actions. Moreover, implementing a solution for enhancing the safety and security inside the automated buses will support safekeeping not only the users of the automated public bus but also the vehicle itself.

This section details the implementation of a video and audio analytics software module designed for an embedded security subsystem or for the cloud-based services within the system. Additionally, it covers the deployment and testing of this service at the pilot sites of the AVENUE project.

The service addresses the timely, accurate, robust, and automatic detection of various petty crime types or misdemeanors as well as the assistance of authorized end users toward the reidentification of any offenders. A misdemeanor is any “lesser” criminal act in some “common law” legal systems. Misdemeanors are generally punished less severely than felonies but theoretically more than administrative infractions (also known as minor, petty, or summary offences) and regulatory offences. Many misdemeanors are punished with monetary fines. The petty crimes that are targeted for identification by the sensors include petty theft like bag snatching and pickpocketing, vandalism, aggression, illegal consumption of cigarettes, public intoxication, simple assault, and disorderly conduct. These are explained in more detail:

  • Petty theft: Theft is the taking of another person’s property or services without that person’s permission or consent with the intent to deprive the rightful owner of it.

  • Vandalism: Vandalism is the action involving deliberate destruction of or damage to public or private property.

  • Aggression: Aggression is overt or covert, often harmful, social interaction with the intention of inflicting damage or other unpleasantness upon another individual.

  • Public intoxication: Public intoxication, also known as “drunk and disorderly” and drunk in public, is a summary offense in some countries rated to public cases or displays of drunkenness.

  • Simple assault: An assault is the act of inflicting physical harm or unwanted physical contact upon a person or, in some specific legal definitions, a threat or attempt to commit such an action.

  • Disorderly conduct: Disorderly conduct makes it a crime to be drunk in public, to “disturb the peace,” or to loiter in certain areas.

In the “enhance the sense of security and trust” service, two distinct petty crime detection approaches are implemented: video analytics and audio analytics. The video analytics approach supports end-to-end detection of abnormal events; achieves real-time inference on modern hardware; offers flexibility with supervised, unsupervised, and semi-supervised learning to compensate the scarcity of data in the security domain; supports multiple camera types, positions, and angles; and is able to operate in embedded setup with limited power requirements. On the other hand, the audio approach uses information from the acoustic sensors of the shuttle for abnormal event detection by comparing different spectrogram representations and focusing on the effect of signal-to-noise ratio (SNR) to audio recognition and the potential of the generalization of a model in different SNR settings and datasets collected under different environments. More specifically, for the video analytics approach, a pose classification approach is developed that classifies the extracted skeleton key points from each dataset (Tsiktsiris et al., 2020). For training the proposed model, five distinct datasets were used, including data simulated in lab in CERTH/ITI facilities, data captured from Geneva Public Transport (TPG) shuttles in Geneva and HOLO shuttles in Copenhagen, data from the P-REACT project, the NTU-RGB-D (NTU-RGB-D is the name of the dataset from Nanyang Technological University (NTU) (https://www.ntu.edu.sg/rose) featuring RGB (Red-Green-Blue/color images) and D (Depth)) dataset by ROSE lab, and the UCSD Anomaly Detection Dataset. Three different models were tested for pose classification, namely, a stacked bidirectional long short-term memory (LSTM) network classifier, a spatiotemporal autoencoder, and a spatiotemporal LSTM classifier. The first approach consists of a stacked LSTM model as a classifier. An overview of the pipeline is depicted in Fig. 5.1. Overall, the classification is performed in four stages: (a) In the first stage, pose estimation techniques are applied to obtain skeleton key points. The generated pose proposals are refined by parametric pose non-maximum suppression to obtain the estimated human poses. (b) In the second stage, tracking is performed to match cross-frame poses and form pose flows. (c) In the third stage, features are generated from the detected and tracked human body key points and are forwarded into the network (d), which classifies the action into normal or abnormal.

Fig. 5.1
4 photos with a flow chart of the pipeline of the pose classification. The photos are of a women with a briefcase standing, under different pipeline processes. From left to right the processes are as follows. Pose estimation, pose tracking, feature extraction, and classification.

Pipeline of the pose classification (Tsiktsiris et al., 2020)

For the pose estimation stage, the regional multi-person pose estimation (RMPE) by Fang et al. (2017) was adopted with a pretrained VGG19 backend, a convolutional neural network model proposed by Simonyan and Zisserman (2014). In the tracking stage, a matching of cross-frame poses and form pose flows is performed, using a real-time algorithm that is based in a distance matrix. In addition, a pose flow non-maximum suppression is applied, in order to reduce unnecessary pose flows and relink temporal disjoint ones. This is an important step that associates poses indicating the same person across multiple frames. A skeleton tracking algorithm was implemented, in order to meet the performance requirements of a real-time service. The algorithm is sorting the skeletons based on the distance between neck and image center, from small to large. Certain heuristics are taken into consideration, such as the position of the joints, the average height of the person, and the height difference between frames. Height variation improves the ability of the algorithm to understand depth, since the analysis is based on two-dimensional input. Such parameters of the algorithms are optimized for in-shuttle space and fine-tuned to specified weights based on the camera calibration. A skeleton near the center will be processed first and be given a smaller human ID. Later on, each skeleton’s features are matched based on its previous and current frame. The distance matrix (or cost) between the skeleton joints is the main criterion for the matching function. Skeletons with the less distance are paired between the frames and are given the same ID (Fig. 5.2).

Fig. 5.2
A photograph with 3 people in different poses. They have overlaid lines to identify the subsequent skeleton frames.

Skeleton matching across two subsequent frames (blended) (Tsiktsiris et al., 2020). Notice that the passenger ID, highlighted in green at the left of each bounding box, is the same across the frames

For the spatiotemporal autoencoder, there are two stages to be formed: encoding and decoding. Autoencoders set the number of encoder input units to be less than the input; thus, they were first used to reduce dimensionality. Usually, unsupervised backpropagation is used for training, helping the reconstruction error of the decoding results from the original inputs to decrease. Generally, an autoencoder can extract more useful features when the activation function is nonlinear rather than some common linear transformation methods, such as principal component analysis (PCA).

In order to learn the regular events in training data, a spatiotemporal autoencoder was introduced. In particular, the spatial autoencoder consists of an encoder and a decoder that are composed of two convolutional and transpose convolutional layers, respectively, whereas the temporal encoder is comprised of three convolutional LSTM layers, as depicted in Fig. 5.3.

Fig. 5.3
A 20 frame photo buffer with a flow chart of the model architecture of the autoencoder. From top to bottom it is as follows. Input 20 buffer frames by 64 by 64 by 1. Conv, 7 by 7, 64 filters, stride 2. Conv, 5 by 5, 32 filters, stride 1. Conv L S T M 2 D, 3 by 3, 32 filters, stride 1. Output, 20 by 64 by 64 by 1.

Model architecture of the autoencoder. The first two convolutional layers are spatial encoders, followed by temporal encoder and decoder. Between them, a ConvLSTM with reduced filters is used as a bottleneck to eliminate non-useful information. At the last two layers, spatial decoding is performed, reconstructing the input image to the same format (Tsiktsiris et al., 2020)

Finally, it was observed that even if the model is trained on thousands of data, some false positives will still be observed in certain occasions. As a result, it is possible to manually shift through the anomaly outputs and flag some of them as false positives, in order to let the previous autoencoder neural network model act as a high recaller. Semi-supervised learning is employed by decreasing the threshold, so that the majority of true anomalies (high recall) can be detected, as well as other false positives (low precision). To achieve the semi-supervised approach, a new model has been designed, which includes the previous encoder and an LSTM, which acts as a classifier as depicted in Fig. 5.4.

Fig. 5.4
A block flow chart of the model architecture of the hybrid model. Autoencoders encoder and decoder are connected to one other. Encoder then connects to the stacked L S T M which has classification. Stacked L S T M, encoder, and classification are classifiers. All are interlinked.

Model architecture of the hybrid model. The red container contains components of the previous autoencoder approach. The green components indicate the new hybrid model which acts as a classifier (Tsiktsiris et al., 2020)

The pose classification approach was tested on the NTU RGB + D dataset (Fig. 5.5a–c) and on the TPG dataset captured by CERTH inside the AV’s shuttle (Fig. 5.5d, e). In the images, there are various debugging layers enabled, such as skeleton points, lines, tracker ID, and bounding boxes of each detection. The predicted result is marked as green when the classifier indicates it as “normal” and red when “abnormal,” correspondingly. So far, NTU dataset samples were not included in the training set, so it is safe to assume that the derived model can generalize across different people, view angles, and events. Figures 5.6, 5.7, and 5.8 depict the aforementioned conditions and use cases.

Fig. 5.5
5 photos. The photos have human subjects with rectangular boxes overlaid for motion. Photos a, b, and c are of 2 subjects in a fight. d and e are photos of of subjects in an event of bag snatching.

Evaluation on test data: (ac). Abnormal event detection (violence/passengers are fighting) using different camera angles from the NTU-RGB dataset (d, e). Detection of fighting/bag snatching real-world scenarios inside the shuttle (Tsiktsiris et al., 2020)

Fig. 5.6
2 photos. The photos have 2 human subjects with rectangular boxes overlaid for motion. The subjects present a bag snatching event. Both the angles in the photos are different.

Evaluation on multiple camera angles, excessive occlusion, and partial presence (Tsiktsiris et al., 2020)

Fig. 5.7
4 photos under evaluation with human subjects performing different scenarios in each. Each photo has overlaid rectangles.

Evaluation across various scenarios (left to right): Bag snatching, fighting, vandalism, and unaccompanied luggage (Tsiktsiris et al., 2020)

Fig. 5.8
2 photos under evaluation with human subjects performing different scenarios in each. Each photo has overlaid rectangles.

Additional evaluation on the NTU-RGB dataset. Metrics at the top left depict the prediction scores for the P1 (Tsiktsiris et al., 2020)

For the audio analysis (Papadimitriou et al., 2020), the different procedures implemented include spectrograms, single-channel representation, multichannel representation, and transfer learning, and the dataset used for sound events classification (glass breaking, gunshot, screaming) is the MIVIA Audio Events dataset. The three spectrogram representations used were namely the single-channel short-time Fourier transform (STFT) (Salamon & Bello, 2017), the mel scale, and the mel-frequency coefficients (MFCCs) (Zhang et al., 2015).

The experimental results in Fig. 5.9 showed that the MFCC is able to generalize better than the STFT spectrogram and the mel spectrogram, when comparing single-channel representations. This most probably owes to the fact that this representation includes all the important information of the audio signal in the lowest MFCC features (e.g., first ten features) with regard to the concentrated energies and has minimum changes in the highest ones. Hence, it is suggested that it has its place in a feature representation combination, and for that reason, it was indeed used in both methods of multichannel representation. With regard to the multichannel representation (Fig. 5.10), the stacked features method proved to be more generalizable compared to the concatenated features method, especially when training was carried out on higher signal-to-noise ratios (SNRs) and testing was carried out on lower ones. Neither the concatenated features method nor the separate single-channel spectrogram representations (STFT, mel, and MFCC) performed as well.

Fig. 5.9
A multi bar graph plots R R versus model number. The bars are the highest for model number 7 with the highest bar for 20 decibels. Negative 5 decibels decreases from model 1 to 10. Standard deviation inclines from model 1 to 10.

Frame-by-frame recognition rate (RR) for all models using concatenated features from STFT, mel, and MFCC spectrograms (single channel) validated for each signal-to-noise ratio (SNR): each column group refers to a specific model (Papadimitriou et al., 2020)

Fig. 5.10
A multi bar graph plots R R versus model number. The bars are the highest for model number 6 with the highest bar for 20 decibels. Negative 5 decibels is the highest for model number 1 and the lowest for model number 6. Standard deviation is the lowest for model 1.

Frame-by-frame RR for all nodes using stacked features from STFT, mel, and MFCC spectrograms (multichannel) validated for each SNR: each column group refers to a specific model (Papadimitriou et al., 2020)

Finally, in Fig. 5.11, the generalization capabilities of the two multichannel methods are shown in terms of event-based recognition (GB, G, and S). As one moves along the sequence of the ten models, it is evident that the generalization capabilities of the stacked multichannel method are significantly better than the corresponding concatenated multichannel method. In both cases, the model that was trained in −5 dB and tested in 15 dB showed the best performance, with a recognition score of 91.51% for the concatenated method and 90.23% for the stacked method, with the lowest standard deviation, namely, 0.034 and 0.019, respectively.

Fig. 5.11
2 multiline graphs plot R R versus S N R values in decibels. The top graph is for concatenated with an inclining trend. The plotline for 8 is the highest and 9 is the lowest at 30 decibels. The bottom graph is for stacked with 6 the highest and 8 the lowest at 25 decibels.

A comparison of the models trained with the concatenated (top) and stacked (bottom) features method with regard to event-based RR: the sequence of the models increases from 1 to 10 (Papadimitriou et al., 2020)

Moving up in terms of SNR training (and model number), it became more difficult to generalize, especially in the case of zero and below SNRs. This is due to the fact that the lower SNR audio contains higher levels of noise and thus is more challenging, leading to more robust and generalizable classification.

3 Service: Automated Passenger Presence

The service “automated passenger presence” aims to address a basic problem of operators’ services, which is related to the occupation of their vehicles, as well as the awareness of the number of people onboard in order to schedule the routes. Furthermore, the passengers would like to know in advance if there is an available seat or enough space on a shuttle to plan their boarding. Traditionally, but also nowadays, passenger counting is conducted manually via passenger surveys or human ride checkers. Typically, the driver or inspectors are responsible for performing enumeration of the onboard passengers, something not feasible in an automated shuttle. Automatic passenger counting has been rapidly emerging in recent years to address similar needs. An automated system is introduced capable to detect passenger presence in real-time with high accuracy, count onboard passengers, and calculate vehicle occupancy. Surveillance using sensors such as cameras (cameras of different technologies can be used so that passengers’ privacy is protected) and smart software in the bus will automate the detection of passenger presence.

Several concerns of the end users regarding the safety and robustness of the automated vehicles that are directly linked to the final user acceptance of the new technology can be identified. The prospective passengers may deal with several possible instances that could arise in case there is no staff inside the shuttle. Indicatively:

  • No one will be in the bus to count the number of passengers with regard to the shuttle’s capacity.

  • There are continuous stops throughout the entire route, even in cases where the shuttle is fully occupied.

  • No authority figure is present to alert passengers of their designated bus stop.

To address the aforementioned concerns on social and personal safety and quality of service into the vehicle, certain measures need to be implemented. For example, counting the number of passengers being inside the automated vehicle could help avoid overcrowding in the shuttle, as well as meaningless stops in cases where the bus is in full capacity. This may be followed by appropriate notifications and/or instructions to the passengers, while the vehicle may also implement respective actions.

The service provides a video analysis of the vehicle internals, using the onboard camera, in order to identify the vehicle occupation, vehicle free space, as well as counting people onboard. Automatic assessment of space occupation using the onboard cameras is enabled. Capacity is set as an absolute number of space units. For example, each space unit is associated with one standing passenger. Occupancy is set as an absolute number of space units currently in the shuttle. For the operation manager, occupancy is visible on the dashboard of the AVENUE platform, whereas occupancy is displayed as real-time information via the AVENUE mobile app, wherever the traveler is. Each passenger (normal, big size, wheelchair user, seated) can determine whether he/she can fit in or not. Assessment for different cases can be provided to assist the passengers on determining whether to request onboarding or not. Automatic counting of people using the onboard cameras is also provided. Moreover, occupancy marked with information for the different user cases is displayed as real-time information, wherever the traveler is; however it does not guarantee them a free spot by the time the shuttle reaches the station of their choice.

The implementation of a video analytics software module for an automated passenger presence counting subsystem or for cloud-based services of the system is described along with deployment and test of the service into the pilot sites of the AVENUE project, where the following use cases have been identified to be further examined and addressed:

  • Passenger counting: The automated shuttle has a fixed capacity regarding the number of passengers it can carry. The video cameras installed in the automated shuttle acquire the color depth images, and the data are fed into the system’s video analytics algorithms for further analysis. In case the algorithms identify that the total number of passengers is reached, the shuttle stops receiving any others, and appropriate notifications are sent to the AVENUE mobile app for the passengers that would like to board.

  • Route optimization: Even though the shuttle is in full capacity, there may still be people waiting on a bus stop to go aboard. The bus only makes a stop when a passenger needs to get off, while the route is modified to save time and cost. The number of onboard passengers is always monitored, so that the new passengers could get on, in case of availability.

  • Passenger awareness: Even though the automated vehicle has reached its terminal, there could still be passengers onboard. The shuttle counts the number of passengers to make sure there is no one left. If there are passengers, the bus alerts them to get off.

For the “automated passenger presence” service, a deep learning-based distance assessment service is proposed that uses an overhead perspective, which is able to function with high accuracy and low-power consumption in confined spaces, such as the inside of the automated shuttle (Tsiktsiris et al., 2022). For this purpose, a fisheye wide-angle camera with a top-down perspective is used. In order to timely and accurately detect human items, a pretrained RAPiD model was implemented that outputs bounding box coordinates used for computing their centroids.

More specifically, as already mentioned, the network architecture is inspired by RAPiD (Duan et al., 2020) and therefore consists of three stages: the backbone network, the Feature Pyramid Network (FPN), and the bounding box regression network. The backbone network works as a feature extractor that takes an image as input and outputs a list of features from different parts of the network. In the next stage, these features are passed into the FPN, in order to extract features related to object detection. Finally, at the last stage, a convolutional neural network (CNN) is applied to each feature extractor in order to produce a transformed version of the bounding box predictions, as depicted in Fig. 5.12.

Fig. 5.12
A block flow diagram of multiple convolutional layer. From left to right it has backbone, F P N, and head with dimensions. 1024 by 32 by 32 at the backbone stage goes to 256 by 128 by 128 in the F P N stage and leads to 3 by 128 by 128 by 6 in the head stage.

An illustration of multiple convolutional layers and multidimensional matrices, such as the feature maps with 1024 × 1024 input resolution (Tsiktsiris et al., 2022)

Furthermore, the Euclidean distance formula is also implemented to compute the pairwise distances for each bounding box. This distance is later transformed into the real distance between passengers by multiplying it with a weight value that is defined via the camera calibration.

Experimental results indicated that the service efficiently identifies passengers with unsafe proximity, according to COVID-19 regulations (passengers must keep a distance of 1 m apart (Olivera-La Rosa et al., 2020)), as depicted in Fig. 5.13. However, handling reflections is still a challenge on certain scenarios, as also happens in similar approaches. Passenger figures might appear to the windows of the shuttle as reflections, especially when the lighting is low. To mitigate this issue, a custom mask is applied on large reflective surfaces.

Fig. 5.13
3 photos of results on unseen scenarios with human subjects. The photo at the top has a fish-eye view with 3 human subjects seated. The photos at the bottom are the top views. Left photo has 3 subjects and the right photo has more than 4. There is overlaid text and dataset on the photos.

Results on unseen scenarios from a real pilot site and the BOSS dataset. Green lines represent a safe distance, while red lines an unsafe one (Tsiktsiris et al., 2022) (Color figure online)

4 Service: Follow My Kid/Grandparents

The service “follow my kid/grandparents” is designed to increase autonomy of partly autonomous people (kids, grandparent(s), disabled people). It will allow carers or family members to be sure that their beloved family members are safe while moving around the city using public transport. On the other hand, it will increase confidence to the non-fully autonomous people to use public transports knowing that their family can “be with them.” Surveillance using sensors such as cameras (different technologies can be used so that passengers’ privacy is protected) and microphones, as well as smart software in the shuttle, will maximize the feeling of security and the actual level of security. The prospective passengers fear several instances that could arise if there is no driver in the bus:

  • Passengers feeling discomfort travelling alone during nighttime

  • Parents not being able to know if their kids have reached their destination safely

  • Caregivers not being able to track passengers with dementia or other health issues

To address those concerns on social and personal safety and security, certain measures need to be implemented. For example, third parties monitoring the route of minors or passengers with health issues could make their route much easier and less frightening. This may be followed by appropriate notifications and/or instructions to the third party, while the vehicle may also implement respective actions. Moreover, implementing a solution for monitoring the routes of kids and patients will support safekeeping not only the users of the automated public shuttle but also the vehicle itself. The implementation of a video and audio analytics software module for an embedded security subsystem or for cloud-based services of the system is described along with appropriate the deployment and test of the service into the pilot sites of the AVENUE project.

The service and scenario propose a full-fledged solution that allows designated “guardians” to follow the automated public transport journeys of more vulnerable people, since the guardians can check the trip via a dashboard or mobile app, receive notifications, add people to their “guarded” list, and share trips/positions and estimated time of arrival (ETA) with others. In the context of AVENUE project, the following use cases have been identified to be further examined and addressed:

  • Passenger monitoring: Travelling without a guardian during nighttime can be unsettling for a vulnerable person.

  • Kids monitoring: Parents need to be able to track their kids.

  • Patients monitoring: Caregivers need to track their patients, especially when they are not able to commute on their own.

The video cameras installed in the automated shuttle record the color images, and the data are fed into the system’s video analytics algorithms for further analysis. When the automated bus’s system identifies the passenger/kid/patient, the tracking begins, and the parents/caregivers can monitor their route.

For the “follow my kid/grandparents” service, an end-to-end service is developed based on deep learning models, for automated facial recognition inside the automated shuttle (Tsiktsiris et al., 2021). The techniques introduced in this service are based on attention to mitigate the occlusion issues introduced by face masks during the COVID-19 pandemic. More specifically, the first layer of the sensors connects to the hardware abstraction layer (HAL) that implements the IP and the USB cameras, respectively, but also requests raw data by the API end points to perform face recognition. The input data are then converted and transformed in a compatible format and passed into the analytics algorithms. The result is transferred via the API end points into the cloud. The user has access to the data and acts accordingly. A new passenger can be enrolled to the service using a single image of his/her face, which will be stored in a database, and by using it as a reference, the network will calculate the similarity of any new instances presented to it. An overview of the service is presented in Fig. 5.14.

Fig. 5.14
A block flow diagram of the follow my kid service. Sensors with U S B camera, dome cameras, and A P I endpoints are connected to H A L video and user input. The inputs lead to analytics with video analysis. Analytics lead to A P I endpoints and then the person identification.

Overview of the “follow my kid” service (Tsiktsiris et al., 2021)

As for the video analysis, facial recognition techniques identify human faces in images or videos by measuring specific facial characteristics. The extracted information is later combined to create a facial signature or a profile. When used for facial verification, a camera frame is compared to the recorded profile. More specifically, a multi-task cascaded convolutional network (MTCNN) (Zhang et al., 2016) receives the input frame to extract and align facial images. The facial images are then preprocessed and passed into a feature extractor (CNN backbone) linked with the explainable cosine (xCos) module that features an explainable cosine metric.

As current face verification models use fully connected layers, spatial information is lost along with the ability to understand the convolution features in a human sense. To address this obstacle, the plug-in xCos module is integrated as described below and depicted in Fig. 5.15, while experimental results are illustrated in Fig. 5.16.

  • Input: The two input images are preprocessed and passed into the feature extractor. Input A is the database image, while Input B is the image cropped from the video stream.

  • Backbone: The same CNN feature extractor is implemented as in ArcFace (Deng et al., 2019). However, to employ the xCos module, the last fully connected layer and the previous flatten layer are replaced with a 1 × 1 convolutional layer.

  • Lcos calculation (xCos): Patch-wise cosine similarity is multiplied by the attention maps and then summed to calculate the Lcos.

Fig. 5.15
A flow chart for the network pipeline. The 2 input images A and B head to feature extractor which leads to conv 1 by 1, grid cos, attention, X, sum, and L cos.

Network pipeline: the two input images are preprocessed and passed into the backbone CNN for feature extraction along with the plugin xCos module (Tsiktsiris et al., 2021)

Fig. 5.16
A photo with 2 human subjects in the shuttle. They have overlaid square boxes around their faces with their respective names Sophie and Christian above the square.

The Follow-my-kid service detecting passengers in real time inside the shuttle

5 Service: Shuttle Environment Assessment

The service “shuttle environment assessment” aims to maintain the environmental conditions in the automated vehicle that may not be adequately controlled due to the absence of the shuttle driver at acceptable levels. Minimum acceptable conditions and comfort, such as good air quality, acceptable odors, and absence of smoke, are necessary for the safe transport of passengers, as well as the viability of the whole automated service, since lack of these conditions within the vehicle could significantly discourage potential users. After all, monitoring the environmental conditions could enable passengers’ alert and warning services via notifications, thus enhancing the user experience and safety during their trips. Under these circumstances, there are several instances that must be considered in order for the prospective passengers to feel content and safe. While there would be no driver inside the vehicle, various problems might come up, such as the following:

  • There will be no staff inside the bus to prevent someone from lighting a cigarette.

  • In quite high or low temperatures, there will be no driver in order to regulate the air conditioning or heating system.

  • In emergency situations, there will be no one in charge of informing the operators and the competent authorities.

  • If the air concentration of CO2 inside the bus is high, and someone might get dizzy or exhibit breathing difficulties, there will be no driver so as to either open the windows or stop the shuttle.

As far as it is considered, the buses should create a comfortable environment for all the passengers. This feeling could undoubtedly be strengthened by controlling the temperature inside the vehicle. Besides, heating, ventilation, and air conditioning control now belong to the standard equipment of city buses. As a result, it is crucial for temperature sensors to be positioned in specific locations inside the vehicle. Simultaneously, these sensors will be connected to the air conditioning system, and if the humidity/temperature exceeds a suitable limit, the air condition will be put into operation. In that way, the existing humidity/temperature will be automatically adjusted, to provide the appropriate indoor climate, neither too hot nor too cold for the passengers. In addition, detection of certain pollutants, such as CO2, NO2, or dust particles in the indoor environment, along with critical temperature variations, is critical for the condition of certain passengers, especially ill people, such as asthma patients. Smoke in the vehicle, i.e., from a person that lights a cigarette, will deteriorate the passenger experience but may also put the whole vehicle in danger (fire hazard). Detection of certain events (air quality deterioration, smoke) may raise a notification or an alert to the passengers along with instructions on how to handle this situation, while the vehicle may also implement respective actions. Moreover, smoke in the vehicle may also result in cancelling the automated transport service. Detection of certain events (air quality deterioration, smoke) may raise a notification or an alert to the supervisor and/or the suitable authorities (i.e., police, fire department). This may be followed by appropriate notifications and/or instructions to the passengers, while the vehicle may also implement respective actions.

The service is responsible for the timely, accurate, robust, and automatic detection of any change in the air quality and the presence of smoke or fire, inside the vehicle. In cases where there will be an alert on the system, notifications and instructions will be sent to the passengers, to the operators, and/or to the suitable authorities. Considering this, several possible situations could take place in counting from high-level CO2 and NO2 concentrations at the indoors air composition to presence of humidity, smoke, or even fire. Especially, exposure to carbon dioxide can produce a variety of health effects (Azuma, 2018). These may include headaches, dizziness, restlessness, a tingling or pin or needle prick feeling, difficulty breathing, sweating, tiredness, and increased heart rate. Furthermore, detection of certain air quality indexes and pollutants in the indoor environment, along with critical temperature variations, is necessary for providing a secure service to the passengers. A gas composition sensor could be used for checking the inside air quality. This sensor monitors the air intake of the heating, ventilation, and air conditioning system of the vehicle, while it detects undesirable gases and adjusts the system accordingly by shutting off the intake and recirculating the indoor air back to the outside. Despite that, another possible scenario, mostly observed during the rainy days of the year, is the fogging of the windows due to the increased humidity. This might have a negative impact on the passengers’ attitude while reinforcing feeling of confinement. Consequently, including a fogging prevention sensor inside the vehicle might be an efficient solution. More specifically, fogging prevention sensors are used to prevent fogging of the windshield glass. These sensors consist of three sensing elements for sensing indoor temperature, windshield glass temperature, and cabin humidity. The fogging sensor feedback is used for adjusting the heating, ventilation, and air conditioning system to maintain the interior temperature higher than the windshield glass temperature. Hence, it prevents the windshield from fogging up.

To summarize, it is passengers’ wish to travel in a clean and comfortable environment and be notified if the conditions deteriorate. Moreover, operators would like to be notified when environmental conditions are considered harmful for the passengers. Representative use cases are indicatively displayed as follow:

  • Lighting a cigarette: Inside the shuttle, a passenger lights a cigarette. The smoke detection sensors detect the smoke coming out of the cigarette. The real-time sensor data is sent to a central PC, installed in the vehicle and in which the data processing takes place. With the real-time data processing, the PC decides that it is an emergency case and sends the message of smoke detection to the operators. The operator evaluates the criticality of the situation and decides how to intervene (with an announcement from the loudspeakers or by taking more drastically measures, i.e., stopping the bus).

  • Exposure to carbon dioxide: While the bus is on its route, high levels of CO2 are detected from the relevant sensor. The sensor data is sent in real time to the central PC and is processed. CO2 in unusual levels of air concentrations might have an adverse effect on passengers’ health. For instance, high levels of CO2 are related to dizziness, restlessness or breathing difficulties, and increased heart rate. To prevent an event concerning these health issues, such as a fainting, the PC sends the vehicle’s central system the command to open the windows, so that the air is refreshed and can come back to its normal composition. Also, the passengers are informed of the air composition through the mobile application in real time.

  • High temperature on the automated vehicle: The sensor measures the temperature and sends the data to the central PC. When the temperature exceeds a predefined level, the PC sends the command to start the cooling system. At the same time, passengers can be informed for the temperature inside the vehicle through the mobile application.

For the “shuttle environment assessment” service, a set of sensors was used to determine the air quality inside the automated shuttle environment, as well as to detect any passengers smoking and to prevent any fogging, as depicted in Fig. 5.17. After the sensors were deployed, they measured several metrics, such as CO2, NO2, humidity, temperature, fog, dust, and smoke, for a sufficiently long period of time. The collected values are used to predict the indoor conditions regarding the air quality for the next couple of hours. These measurements, as well as the conclusions of the real-time assessment, are sent to both the passengers and the operators of the shuttle. In case of an alarm, such as possible fire, they are sent to the suitable authorities. Moreover, this may be followed by appropriate notifications and/or instructions to the passengers, while the vehicle may also implement respective actions.

Fig. 5.17
A flow diagram of the environmental assessment process. Environment sensors which detect temperature, smoke and dust, fog and humidity, and pollutants are connected to an on board P C, linked to avenue platform with operators and mobile app.

Environmental assessment process

6 Service: Smart Feedback System

The service “smart feedback system” aims to allow the travelers inside the shuttle to give easy and effortless feedback to the operators when the safety driver is no longer inside the shuttle. It is important for the operators to know if people are satisfied with the services and transportation. Currently, the safety driver talks to the travelers and with his/her presence also becomes the conversation channel between the operators and the travelers. When the safety driver is removed, knowing whether they are satisfied or disappointed can be even more important, as the safety driver is not there to support, hence assisting the travelers.

When removing the operator, automated services must perform the same level of service and interaction as he/she did while being in the shuttle. This service aims to allow travelers to give their feedback about the service experience as easily as possible. This will be done by instructing the travelers to give a hand gesture to one of the cameras inside the shuttle. This will allow the travelers to effortlessly say “I like” or “I don’t like” the experience with a thumbs up or a thumbs down. The concept will be communicated to the travelers via stickers inside the shuttle. Camera technology is used to capture the thumbs up or thumbs down, but, if possible, sound sensors will also be tested to capture the experience/feedback from the travelers. The service will be communicated to the travelers as follows:

  • Giving a thumbs up/down in light settings: Midday with sunlight. Good visibility for the cameras.

  • Giving a thumbs up/down in dark settings: Early morning or night with no sunlight. Low visibility for the cameras.

  • Giving a thumbs up/down in crowded settings: Many passengers inside the shuttle, both standing and seating. Low visibility for the cameras due to people standing close to the cameras.

  • Giving thumbs up/down in empty settings (or few passengers): Little or no passengers inside the shuttle. Good visibility for the cameras, easy to see the hand gesture.

For the “smart feedback system” service, the model is trained end to end and regularized so that it distills the most compact profile of the normal patterns of training data and effectively detects the “thumbs up” and “thumbs down” gestures. The original images of the hand gestures are acquired through the USB camera that is inside the shuttle and then passed through a single-shot detector (SSD) for the detection of the bounding box of where the hand is and the corresponding cropped frame. The cropped frame of the hand is then passed to the CNN that predicts a class vector output of values between 0 and 1, as illustrated in Fig. 5.18. These values correspond to the probability of the frame to be one of the classes. Real-time results are depicted in Fig. 5.19.

Fig. 5.18
A C N N architecture for thumb orientation detection. The input is the photograph of the thumb that undergoes convolutions, subsampling with f maps, convolutions with f maps, and subsampling, getting fully connected at the end.

CNN architecture for thumb orientation detection

Fig. 5.19
A real time photograph of the inside of a shuttle with passengers. The system detects the thumbs down of the passengers with overlaid text.

The smart feedback service operating in real time inside the shuttle

The results were validated during a twofold live session. Staff from HOLO performed fighting, bag snatching, falling, and vandalism scenarios in vehicle P109 that serves the route Slagelse in Denmark. The algorithms were able to correctly identify the performed scenarios with 89% accuracy, and the appropriate notifications were also captured in the operator’s dashboard. The two dashboards were correctly synchronized regarding real-time event detection. Figures 5.20 and 5.21 show the maintenance and operator dashboards regarding the results for the service “enhance the sense of security and trust.” The validation of the automated passenger counting has been completed through manual comparison with data received from the operator phone’s data stream. The automated count is considered precise in terms of timestamp and accurate in terms of the passenger count. Follow my kid, shuttle environment assessment, and smart feedback services were also validated successfully during the live session.

Fig. 5.20
A screenshot of the maintenance front-end interface regarding the service. The text at the top reads in-vehicle security and environmental assessment. It is followed by photos of different scenarios with textual labels. The labels are bag snatch and scream.

Maintenance front-end interface regarding the service “enhance the sense of security and trust”

Fig. 5.21
A screenshot of the operator dashboard. It has the text alert at the top followed by multiple tiles. Some of the tiles are as follows. 8, safe distance. 2, bag snatch. 1, falling down. 3, fight. 3, vandalism. 6, background noise. 3, scream. 0, abandoned luggage. 0, smoke.

Operator dashboard regarding the service “enhance the sense of security and trust”

7 Conclusion

The transport service quality and passengers’ safety and comfort are the major preoccupation of the public transportation operators. Thus, the automated AI-supported in-vehicle services will eventually substitute the role of the safety driver and the essential functions he/she currently provides inside the shuttle while enhancing the adherence of the passengers to the novel services and accelerating the adoption of automated mobility.