1 Introduction

In the last few years, many research works have been dedicated to performing or activating tasks using voice commands [1]. Those voice commands should first be interpreted using speech recognition. Speech recognition has been used in many fields, including medicine [2], agriculture [3], smart homes [4], and smart vehicles [5]. Research suggests that by 2022, nearly three-quarters (73%) of drivers will use speech recognition technology [6]. In smart vehicles, speech recognition is used to improve the driver’s experience and allow them to have hands-free control of a car’s functionalities, which helps the driver focus on driving. It’s used to help with a wide range of tasks like speeding up and down, turning on the radio, adjusting the climate, and even starting the car without the use of a key. While these functionalities can improve the driver’s experience, several safety concerns have been raised. One of the main concerns regarding cars that can be activated through voice commands is that children who have access to the keys or are left alone in a car might be able to start the car easily by saying the required keyword. One of the possible solutions is to implement an age classification system that analyzes the audio signal and only allows access to start the car if an adult’s voice is detected. However, children might use a recording of an adult voice to start a car (“replay attack”) or use an application to produce a synthetic pronunciation of the keyword required (“synthetic voice attack”). In the case of a replay attack, this could be solved if the age classification system is trained to distinguish between live voices and recorded voices. Moreover, speech recognition problems are often solved using the traditional method of processing audio stream [7,8,9], which raises several issues. First, it raises privacy issues since the driver’s speech will be sent to the cloud for processing. The second issue is the high latency caused by the transmission of the data through the network.

1.1 Motivation and challenges of on-device robust speech recognition

Many studies have developed systems to detect adversarial voice commands by using sounds and spectral features to distinguish human voices from machines [10, 11], or other passengers [12]. However, these systems are still vulnerable to receiving commands from ineligible drivers, such as children, which could result in car accidents. This work focused (LimitAccess) on a new security aspect to classify the driver’s age to ensure the driver’s eligibility based on age. LimitAccess can also detect adversarial attacks in cars that use a speech recognition system to start.

The currently available datasets utilized by those limited number of studies are separated into age groups such as the 20 s, 30 s, and so on. However, none separated the samples into children and adults. Furthermore, speech phrases were employed instead of words in datasets. Moreover, the majority of research processed the audio files in real-time utilizing the traditional method of leveraging the cloud. There is a lack of work dedicated to assigning a speech to a certain age group.

1.1.1 Model size and latency

There are some limitations to using the NN model for speech recognition on MCUs, which are the limited memory and computational resources. The size of MCUs ranges from a few tens to a few hundreds of kilobytes of memory and this limited memory must accommodate the whole NN model including all its parameters. Also, the computational resources are limited since our system needs to be on all the time and this will limit the number of operations per inference. These MCU restrictions, together with the high accuracy and low latency requirements of speech recognition systems, necessitate the development of a lightweight NN design suitable for our system, which is the emphasis of our research. We followed the TensorFlowLite framework in this paper and used the TensorFlowLite built-in libraries to optimize our simple architecture convolutional neural network. The optimized model must be sufficiently compact to be installed on the Arduino Nano 33 BLE Sense, which is supported by TensorFlow Lite.

1.1.2 Robustness

Our primary goal is to optimize the performance of our model so that it achieves the maximum possible accuracy while remaining efficient enough to be deployed in the MCU. One of the system requirements is to be able to work effectively in any situation. For this reason, the system should be able to detect possible attacks that might be performed on it. For instance, children may figure out how to circumvent the system and attempt to record an adult’s voice. They can just use a phone to record their parents’ voices while using the keyword to start the car. They can also use specialized software that converts text to voice to unlock the car. Afterward, they can play back the recorded word to start the car and fool the system. As a result, this work focuses on improving the proposed system’s resilience against different sorts of replay attacks.

1.2 Contributions of this paper

To address the aforementioned issues, we propose a secure and automatic speech recognition system called LimitAccess, to effectively restrict children from accessing and performing safety-critical tasks, such as starting a car using speech recognition. Our system can distinguish the voices of children from adults and will work even a recorded command or a synthetically generated command is given. We deployed LimitAcess’s prototype on a microcontroller in real-world settings to reduce the time needed for the system to analyze and respond to the driver’s command. This will also address privacy concerns associated with the drivers’ voices since the data is processed locally (on the microcontroller). The main contributions of this paper are summarized next:

  • Develop a robust age classification model that works on speech signals to limit access to the underage. Our system is resilient to replay and synthetic voice attacks.

  • Create a dataset of the word “open” uttered by adults and children.

  • Deploy the trained model on an Arduino Nano 33 after it has been evaluated and optimized.

1.3 Organization of this paper

The rest of the paper is organized as follows: In Sect. 2 some background information about speech recognition and age classification and some related works are illustrated. Next, Sect. 3 introduced the methodology. Then, the experiments conducted and the results obtained were presented in Sect. 4 followed by a discussion of the outcomes and results in Sect. 5. Finally, Sect. 6 closes this paper with the conclusion and future work.

2 Background and related work

Recently, the way users interact with devices and applications has been evolving, they are now more likely to use voice commands to control and initiate an interaction with their devices. This is mainly because of the wide use of personal voice assistants (PVAs) that are found everywhere, from smart homes to digitized world environments and smart cars. They come in many shapes and forms, and the most known PVAs are Siri, Alexa, Google Home, and Amazon Echo. These personal assistants can be integrated within our smartphones and tablets and can be attached to different home devices and smart cars (Mercedes and Jaguar) or might operate on their own [1]. The use of PVAs increased significantly over the last decade; over 21% Americans own a smart speaker, and 80% of the American population own a smartphone [13].

2.1 Personal voice assistance PVA

“PVAs” are devices that can listen to voices and detect commands to perform the required action according to the command given. PVAs often contain hardware like microphones and speakers combined with software to facilitate listening, recording, analyzing, and initiating action. A PVA usually works continuously to detect a wake word like “Hello Siri” in the case of the iPhone, the system then sends the recording to the cloud. The cloud will run Automatic Speech Recognition (ASR), which analyzes the speech and scans it to identify commands. Finally, the cloud will carry out the requested commands and send back a response to the device, which is then played for the user through speakers [1]. ASR is a research field combining computer science, linguistics, and electrical engineering. It is used mainly to convert speech to text. ASR works by first converting the analog audio signal into a digital representation that is then used to extract features using ML. The ML techniques usually look for parts of the sound that represent a word or a phrase, called phonemes [14].

2.2 Speech recognition in cars

The use of PVAs and speech recognition is becoming an essential part of smart vehicles. It is mainly deployed to help drivers multitask while driving safely. It is undeniable that our day-to-day life is highly dependent on our devices, especially our smartphones, and while it is better not to have distractions while driving, most people tend to take their eyes off the road to use their smartphones or control the different functionalities of the car. Therefore, in-car speech recognition is very important to keep drivers safe. Although many car companies have integrated speech recognition into their cars, this field is still under constant development; more and more functionalities are being added to these systems [15].

Many accidents have been reported in which children have stolen car keys and driven the vehicle. A traffic accident in Utah was reported between a sedan car and a truck; both the driver and the passenger of the sedan turned out to be children (9 and 4 years old) who stole their parents’ car keys [16]. In [17], it was reported that a 2-years-old toddler who was left alone in a parked car managed to move the car, causing an accident with two cars. Another incident was reported by BBC news in [18], where a 13-years old was driving a truck and collided with a university van. 9 of the van’s passengers were killed in this accident, including 8 members of the university’s athletes and their coach. Thus, it is very critical to prevent children from accessing cars. That necessitates adding a lock to the car using the speech recognition system to make sure who is behind the steering. Because our study will address multiple problems, such as processing audio inputs for age categorization and converting the model to a lightweight TinyML model, we will do a literature review for each of these issues and place our system within the context of the existing literature.

2.3 Key classification features

When dealing with audio data, machine learning models cannot process raw audio representations. Instead, the models take the audio features extracted from the audio files as input. Audio features can be classified either by features that can be perceived by human listeners, such as pitch, loudness, and timbre. Those features are called perceptual features. Other features than can be represented only using mathematical and statistical representation are called physical features [19].

The selection of audio features to use depend mainly on the task and the purpose of the classification. For instance, for the task of age classification using audio signals, there are key classification features that contribute the most to determining age. As the human’s voice system changes continuously throughout their lifetime, this makes the age classification task a regression task of predicting an age value for the voice. However, those age values are often grouped into age groups, where the age classification system will need to associate a human voice with an age group rather than an age value. The main features that are used to determine age groups are short-term energy, zero crossing rate, and energy entropy. Short-term energy is the sum of the squared sample points, while the zero-crossing rate measures the speed of the speech. [20].

2.3.1 Spectrogram

After a spectrum is obtained by converting the audio signals from the time domain into the frequency domain, a windowing process is performed on the spectrum to split it into shorter segments to concentrate on each segment at a certain point in time. Short-time Fourier transform (STFT) of the signal is obtained by taking the discrete Fourier transform (DFT) for each window. STFT is hard to be visualized since it is a complex-value, so log spectra are performed on the SFFT matrix to get 2-dimensional log spectra, which are more easily visualized with a heat map (spectrogram) that contains the details of the audio signals. In general, we can define a spectrogram as a graphical representation of the signal strength, or “loudness,” of a signal across time at different frequencies included in a waveform. It can examine how much energy there is at different frequencies and how energy levels change with time. A spectrogram is typically represented as a heat map, which is a picture in which the intensity is represented by changing the color or brightness [19].

2.3.2 MFCCs

Mel-frequency cepstral coefficients (MFCCs) can be extracted from the spectrogram by multiplying each window with a filter bank to obtain a Mel-weighted spectrum. This procedure removes some of the spectrogram’s unrelated features. As a result, it concentrates on the most informative component of the signal. MFCCs are then calculated by multiplying the Mel-weighted spectra by a Discrete Cosine Transform (DCT). One of its flaws is that MFCCs are not noise resistant. As a result, MFCCs have not always performed well in noisy environments, as compared to other features. Even though MFCC has some drawbacks, it has been the most widely used characteristic in speech and audio systems. MFCC is popular due to its simplicity of construction and its low level of complexity. Thus, we can define the MFCC of a signal as a limited number of characteristics (often 10–20) that concisely define the shape of a spectral envelope. It simulates human vocal features [19].

2.4 Generating and detecting recorded and synthetic voices

There are some distinct characteristics of the speech signal that are related to the power spectrum, which can be used to distinguish recorded voices from live voices. All loudspeakers introduce distortions to recorded speech, ranging from significant distortion in low-quality speakers to minor distortion in high-quality speakers. Moreover, the majority of the spectral power of live voices occurs between 20Hz and 1kHz, and then the voice shows a significant power decrease at around 1 kHz. On the other hand, the spectral power of recorded sounds decreases linearly between 1 and 5kHz [21].

Another task that targets analyzing audio signals is the detection of synthetic voices, usually used to detect voice spoofing attacks. There are many tools and techniques to generate a synthetic voice. A vocoder is a speech model that uses text to predict speech features and parameters and then uses those features and parameters to reproduce acoustic waveforms. As a result, synthetic voices have become more human-like. However, the wrongly predicted speech parameters and traits can still be used to distinguish synthetic voices from live voices. Another synthetic voice generation technique creates speech by concatenating selected portions of recorded samples that represent letters and syllables (speech units). The speech generated using this technique is also identifiable because the added audio segments form an uneven waveform [22].

2.5 Utilizing audio signals for classification problems

Many fields of research utilized audio for classification. Interesting research was done on crying baby sounds to classify their present state as hungry, unwell, deaf, asphyxia, or normal in [23]. The goal of this study was to identify problems in their earliest stages by processing the babies’ crying sounds since babies are unable to express themselves verbally. ResNet50 was employed as a pre-trained model with modest modifications to match their requirements because it gives very high accuracy in image classification problems. The authors then proposed combining ResNet50 and SVM models in an ensemble approach. In comparison to classic CNN, SVM, and ResNet50, the results reveal that the proposed technique provided the highest accuracy. The accuracy obtained by CNN was 87.33%, ResNet50 was 90.80%, SVM was 90.10%, and the proposed model was 91.10%. Although this study discusses unequal classes, it does not describe the measures that were implemented. Even though the size of the dataset was small, no data augmentation techniques were used. Moreover, it is possible that the spectrogram was not as beneficial as MFCC for feature extraction, so the researchers could test the accuracy of the model using MFCC instead.

The authors of another interesting study [24] classified bird species based on their sounds and visuals using a multi-model technique. The study’s goal was to look at audio and video characteristics before applying a kernel-based fusion to categorize the type. Activation values from CNN’s internal layer were used to extract features, then merged using Multiple Kernel Learning. They compared their proposed multi-model approach with all the standard strategies of single modality learning, simple kernel, and conventional fusion; this strategy yielded the best accuracy of 78.15%.

A deep learning approach was presented in [22] as a method for classifying people’s eating sounds to better understand their eating habits. By employing Mel Spectrograms for feature extraction, the team was able to come up with some promising results. From the audio stream, Mel spectrograms were generated, and these were input into AlexNet and VGG16 pre-trained CNN models. Three FC layers were activated and used as feature vectors. The average recall value of their proposed system was found to be 79.9%. Whisper is a recent voice recognition model that trained for 68,000 h on various linguistic and multipurpose supervised data acquired from the internet. Whisper achieved high accuracy in speech recognition, although all size versions that were released were not fit for MCUs [25]. Large Language Models (LLMs), which were originally developed in the context of Natural Language Processing and Understanding, have recently become very popular as foundation models for various downstream tasks [26]. LLMs are however inappropriate in our use case as in their original form; they are huge models and difficult to deploy in tiny embedded devices. We have instead chosen CNN, which are well-known popular techniques that are suitable for spectrograms and MFCC and can be compressed without sacrificing much performance.

2.6 Age classification

Age classification can be performed using machine learning based on different modalities, including images, speech, or video. Some examples of techniques that have worked on this problem using images for age and gender classification can be found in [27,28,29,30,31], and [32].

There is active research on classifying age using speech signals. In [33], authors have proposed the use of Long-Short Term Memory (LSTM), which is a type of Recurrent Neural Network (RNN) often used to capture the time and sequential characteristics that can be found in speech. The authors converted the speech signal into MFCC features for classification. However, their model faced an overfitting problem and started producing better accuracy in testing than in training. However, they were able to mitigate the problem by using augmentation and regularization. In [34] the authors used a multi-layered perceptron to classify speech into different age groups, they used different audio features MFCC, pitch, and formants on the Mozilla speech dataset. Their model obtained an 85.68% testing accuracy in categorizing the different speech samples into the different age groups, surpassing other models found in the literature like SVM, Random Forest, XGBoost classifiers.

Another work on age classification can be found in [35], which used a Convolutional neural network (CNN) to classify speech audio based on gender and age. They used multiple models in conjunction with each other for classification. The authors reported a 90.08% accuracy for the age prediction model, 43.18% for CNN female age classification, and 43.46% for male age prediction. In most of the work found, the results were acceptable for the gender classification. However, their obtained accuracy for age classification was relatively low. Multi-layered perceptron classifier was also used in [36], where the authors tried to develop a model that can classify speech into eight different age groups using the Mozilla Open-source dataset by using MFCC and Perceptual Linear Prediction (PLP) features. They reported a 94.34% accuracy in the classification.

Generally, detecting the age based on speech and images has lower accuracy than video since these two forms of media contain less information.

2.7 TinyML

The traditional method of processing a continuous audio stream by a speech recognition system is inefficient in terms of power consumption. Because these systems must be on all the time to detect predefined words to be transmitted via the network to the cloud. User experience may be harmed as a result of increased system latency, and privacy concerns may also arise. Considering that these systems are continually running, they should utilize very minimal power to maximize battery life. Limited resource devices such as microcontrollers are the best candidate for installing always-on systems since they are affordable and power-efficient devices [37].

Since the ML models require huge storage space and computation power, it is challenging to deploy them on MCUs, which is where TinyML comes in. TinyML is a fast-growing field of machine learning technologies and applications that include algorithms, hardware, and software that can perform on-device inference doing so at extremely low power, typically in the order of milliwatts range or below [38]. The general framework of TinyML needs training a model on a powerful computer or the cloud, optimizing the trained model, and then deploying it to an embedded device.

Raza et al. [39] looked at using TinyML on the edge for UAVs for ML inference using a microcontroller integrated into a DJI Tello Micro Aerial Vehicle (MAV), a type of drone. They defined a mission in which a drone navigated a populated area while identifying people (face detection task) and classifying whether they were wearing a protective mask. The proposed solution made use of TinyML’s capabilities to create a fully autonomous smart drone, able to complete the mission entirely independently while minimizing battery use to extend flight time. The authors analyzed the energy efficiency of their model. They validated the proposed energy-aware smart drone using two microcontrollers: OpenMV and Arduino Nano 33 BLE. The system’s energy usage was evaluated in various settings and flying conditions to show its energy efficiency. Experimental validation demonstrated that the impact of integrating the microcontroller on the payload was manageable in relation to the value added by the system’s intelligence, making the system viable in real-world contexts.

In [40], the authors worked on the keyword spotting problem and introduced the attention condenser notation, which was utilized to build low-accuracy, high-efficiency deep neural networks for speech detection at the edge. They suggested an attention condenser-based model that learns and creates a condensed embedding. In contrast to self-attention techniques for deep Conventional NN, which rely mostly on convolution modules, attention condensers are self-contained, stand-alone modules that allow a more efficient Deep neural network by increasing the use of attention condensers. They could minimize the number of final design parameters by using attention condensers extensively and using stand-alone convolution modules minimally. Furthermore, the suggested model employs a machine-driven exploration strategy for on-device speech recognition, one of which is well-suited to microcontroller restrictions. They feed the MFCC features into their model, which depicts audio signals. They evaluated their model using the Google Speech Commands dataset, which significantly reduced complexity and storage requirements than previous studies that were all built for onboard speech recognition applications. According to their findings, attention condensers can be used to construct high-performance deep neural networks in various TinyML applications, including on-device speech recognition. The proposed system needs to be tested in different contexts to determine its effectiveness.

In [19], several datasets like The BDLib, UrbanSound, and ESC were used to assess the environmental sound recognition problem (ESR). They compared traditional machine learning techniques like SVM, KNN, and DTs to DN approaches. In certain datasets, the ML techniques surpassed the DL ones in terms of accuracy but at the cost of classification time, which scales dramatically for low-resource devices like the Raspberry Pi. As a result, they proposed a lightweight CNN model that was an excellent alternative for devices with low resources.

The authors in [41] were working on a similar problem. Their primary goal was to develop a light model that can distinguish between children’s and adults’ voices, their system only accepts specific instructions based on that categorization. The authors built their dataset with several English and Malay keywords. They performed different experiments to ensure that their suggested model could distinguish between children and adults when tested on any of the keywords on which the system had been trained. In the experiment with the word open, they got 94% accuracy. However, the results were reported for the validation set, not the test set. This work also used a dataset that did not have sufficient diversity. For instance, their dataset had 1683 samples collected from 30 adults and 14 children, all from Malaysia. It is not clear whether the authors took special care to ensure gender and racial diversity. This dataset was also not balanced. We try to address these issues in our dataset.

3 Methodology

This section explains the framework of our approach and the equipment utilized to develop a TinyML model that can categorize the input voice as either child or adult. Our framework is capable of distinguishing voice commands issued by an adult driver and children. Moreover, LimitAccess can also detect and deny access when a recorded or synthetic command is played from a speaker, as shown in Fig. 1. We implemented our framework as a prototype and tested it in real-world settings.

Fig. 1
figure 1

Flowchart of robust speech recognition in LimitAccess

Fig. 2
figure 2

LimitAcess ML pipeline starts with data collection and preprocessing the data before feeding them into the CNN model. After training the model, it is converted to TensorFlowLite for deployment on an Arduino Nano 33

3.1 Framework

The overall framework, as shown in Fig. 2, started with the data acquisition stage, which involved the creation of a dataset with four classes for the word “Open”: children, adults, recording of adult voices, and noise. The collected voice samples are then pre-processed with the Audacity [42] to filter out noise components. The machine learning (ML) model is trained using MFCC to extract features from the voice samples to distinguish between different classes. To deploy our model as a working prototype in a real-world scenario, the trained ML model is then compressed via model conversion stage before being deployed on the Arduino Nano 33. The conversion method follows TensorFlowLite’s [43] general framework. Once the model is deployed on a memory-constrained device, it can be evaluated in real-time and real-world settings.

3.1.1 Diversity in our dataset

Data augmentation is a typical approach for increasing the size of the dataset or mitigating biasness in the dataset by manipulating existing samples [44]; this not only expands the dataset but also provides various permutations of a single sample, allowing the model to prevent overfitting, and improving model robustness and generalization. With our dataset, we do not need to perform data augmentation since the data samples that have been collected are diverse in nature as it can be seen from Fig. 3. To ensure the diversity in our dataset, we asked the participants to utter the word “Open” in several ways and from various angles and distances while collecting the voice samples.

Fig. 3
figure 3

Different samples of the word “Open” in time frequency domain from the dataset with the label “Adult”

3.1.2 Removing noise

Since the voice samples are collected in crowded environments such as schools and malls, It is necessary to remove the background noise in the voice samples containing the keyword “Open”. Thus, the model will be able to differentiate between the noise class and child and adult classes. This was accomplished using the Audacity 3.1 tool [42], which mainly reduces the decibels of the selected wave to the needed level.

3.1.3 Machine learning classifier

We selected a CNN model, which is a deep neural network-based machine learning model, for three primary reasons. First, CNN-based models still achieve state-of-the-art in many audio classification tasks due to their ability to effectively extract local features from the audio input [45] [46]. As a result, combining them with transformers has been popular in several recent works to achieve SOTA accuracies in different speech recognition tasks [47] [48]. Second, CNN-based models also have shown promising results in classifying image-related problems, compared to other ML models, because of their ability to capture patterns [49]. Since audio files can be converted into images, their features can be properly captured. Third, the CNN model is a lightweight neural network compared to RNN and transformers-based models, and it can be compressed without significantly degrading the model’s performance [45]. We designed a lightweight CNN model consisting of two 1-D convolutions with kernel size 3 × 3 stride 2, followed by max pooling of size 2 × 2 and dropout layers to improve the model’s regularization. The model ends with a flattened layer whose output feeds a dense layer that uses a SoftMax activation function with an output vector equal to the number of classes for classification. More details of the CNN architecture used are shown in Figure 4.

Fig. 4
figure 4

The Lightweight CNN architecture used in LimitAccess

3.2 Dataset

Our model combines classification and keyword detection, as it recognizes the word “Open” and then determines whether it was spoken by an adult or a child. It will also run in a microcontroller unit (MCU). As a result, finding data sources containing the word “Open” is difficult, prompting us to create our own dataset to fulfill our requirements. Collecting data for such a model is not an easy task, and it necessitates specific limits to be completed effectively.

3.2.1 Constraints

While gathering data, we imposed some constraints on ourselves to complete the task flawlessly. Firstly, we decided to concentrate on capturing audio on device in a way that matches the real environment. This made recording the audio free of background noise, recorded with high-quality microphones, and in a formal context, look unrealistic. Successful models would have to deal with noisy environments, regular recording equipment, and people conversing in a natural, chatty manner. To reflect this, we visited people in different places: schools, playgrounds, malls, offices, and classrooms when there were background discussions, and we did the recording with microphones on the iPhone 13 and Samsung Note 10+.

Secondly, we concentrated on English as the language of choice, to limit the scope of the data collection and make it easier for us to undertake quality checks on the information gathered. However, we are hopeful that transfer learning and other techniques will make this dataset valuable for other languages. Thirdly, to improve the generalizability of our model, we selected as many people as possible from both genders, different ages, and races (Arab, Asian, European, African, etc.). Fourthly, we chose the word “open” as our keyword because we wanted to avoid terms and lengthy sentences that would make it difficult for participants to record their sounds accurately. Lastly, to make the training and testing procedures easier, we limited all utterances to one word at a time, with a one-second standard duration.

3.2.2 Dataset statistics

In this research, we obtained 40 min of data collected from 250 participants as given in Table 1, utilizing the impulse framework, which allows users to collect audio data from microphones. In the data collection design, we concentrated on collecting the term “open” from two sources: adults and children.

Table 1 LimitAccess Dataset breakdown

3.3 Attacks

3.3.1 Replay attack

We considered the possibility of a replay attack where a child will attempt to start the engine by playing an old recording of an adult saying the word open. Thus, to simulate this attack, we followed the steps shown in Fig. 5. First, we played the recordings of adults saying the word open using four different devices; then, we captured them to make the model differentiate between recording and live adult voices. The real-life voice is different from the recorded voices for two main reasons. Firstly, the cumulative power distributions of live human sounds and those reproduced through speakers differ significantly. Secondly, speakers inherently distort original sounds when playing them back [21]. Those differences could be captured by the audio feature [21]. Since each speaker generates frequencies that are different from those produced by other speakers, we made sure that the recordings were played on four distinct devices. The devices used to play the recordings are listed in Table 2.

Table 2 Devices used to play back the recordings and the distribution of samples collected by each device
Fig. 5
figure 5

Steps for collecting recorded samples [21]

3.3.2 Adversarial attacks

Three additional attacks were generated and evaluated against the proposed system, including Synthetic attacks, hidden voice attacks, and Generative Adversarial Network attacks (GANs). Synthetic attacks occurs when a child uses a text-to-speech tool to play a synthetic voice of the word “open” to fool the system into thinking he is an adult. To simulate this attack, we used several text-to-speech applications found on the Appstore such as Speechbot, Text to Speech, and Speechie, and played the synthetic voices to be tested. To enhance the system’s robustness against this attack, the system was retrained with our dataset after adding 120 synthetic voice samples to the recording class.

Fig. 6
figure 6

Generic architecture of the Generative Adversarial Networks [50]

3.3.3 Hidden voice attacks

Hidden voice attacks are a type of attack where some audio samples are produced that sound like noise to humans but contain a hidden signal intelligible by the machine. The produced audio samples should be classified by the ML algorithm as noise, but they will be classified as a specific command [21].

3.3.4 Synthethic attack

In our study, we utilized Generative Adversarial Networks (GANs) [50] to produce another synthetic attack, where GANs generated sounds that represent adults’ voices by mimicking their sounds’ features. The architecture of the GANs model used is shown in Fig. 6. GANs, or Generative Adversarial Networks, represent a type of generative modeling that employs deep learning techniques, consisting of two sub-models: generator and discriminator. Generators are used to produce samples such as specific sounds. In contrast to the generator, the discriminator classifies the samples into predetermined classifications: real examples from the domain or fake ones from the generated one. In other words, the generator network (G) attempts to replicate data distribution samples, but the discriminator network (D) must distinguish between genuine and created samples [51].

3.3.5 Noise

Because word recognition apps may be used with a variety of background sounds, we gathered 86 recordings of real-life background noise from various locations for various sounds and words. We also used Google’s speech commands open-source dataset for noise and unknown words samples [52]. We collected 377 equally distributed dataset samples from all the classes. We then refined and labeled the dataset samples into five different classes (Adult, child, noise, other words, and recordings). The sample rate of all the dataset samples was set to 16 kHz to match the sample rate of the Arduino Nano board’s microphone. The dataset was then divided into 80% training and 20% for testing and evaluating the model; 20% of the training samples were used as a validation set to finetune the hyperparameters of the CNN model.

3.4 Evaluation metrics

To evaluate our system, we used several metrics to quantify its effectiveness and efficiency. Because our system includes both hardware and software-based accuracy, F1 score, timing, and storage space can be used to assess the proposed system’s performance.

3.4.1 Accuracy

Indicates the model’s overall performance in terms of the proportion of correctly classified predictions. Eq. 1 depicts this relationship

$$\begin{aligned} Accuracy = \frac{Number\,of\,correct\,predictions}{Total\,number\,of\,predictions} \end{aligned}$$
(1)

3.4.2 F1 score

The harmonic means of precision and recall are calculated based on Eq. 2. We used it to provide a fair comparison between precision and recall.

$$\begin{aligned} F1 score = 2* \frac{Precision*Recall}{Precision+Recall} \end{aligned}$$
(2)

3.4.3 Timing and storage space

Since our application works in real-time and on a limited resource device, the classification and feature extraction time are crucial evaluation matrices. This time is obtained by measuring the inference time to classify one sample. In addition, we must determine how much storage space the trained model requires to be flashed into the microcontroller after it has been optimized.

4 Experiments and results

Our dataset was used to conduct experiments in the Google Colab environment utilizing a one-dimensional convolutional neural network (CNN) model. The dataset comprises one-second-long audio clips of the keyword “Open”, noise, and recordings representing replay and synthetic voice attacks. The neural network model has been trained to categorize incoming audio as an adult saying the keyword “Open”, a child saying the keyword “Open”, an attack, or noise. To match the microphone on the Arduino Nano board, we set the audio samples to 16 kHz. For the model to be able to extract features that will allow it to distinguish between the various classes, we converted the audio samples into MFCC. We train our model for 100 epochs using the Adam optimizer \((\beta _1= 0.9, \beta _2= 0.999)\) with a learning rate of 0.005 and a batch size of 32.

4.1 Improving the accuracy

Our initial model had a 78.4% F1 score, which is relatively low. Therefore, to improve the performance further, we made the following adjustments:

  1. (1)

    Collect more dataset samples The initial dataset contained 206 samples, so we collected more samples to reach 1,982 samples from the adult, child, and recording classes. The different categories are depicted in Table 1.

  2. (2)

    Explore different model architectures We experimented using both 1D CNN and 2D CNN. Furthermore, some of the hyperparameters were tuned like the number of neurons, layers, epochs, dropout rate, and learning rate.

  3. (3)

    Remove the unknown words from the noise Instead of combining noise and unknown words (any words other than the term ‘open’) together, they were eliminated from the dataset. We observed that when unknown words were removed, accuracy improved, and the system retained the ability to label unfamiliar words as noise throughout testing.

  4. (4)

    Explore feature extractors The effect of using MFCC and spectrogram for extracting the features was further investigated and the difference in the performance of using these two feature extractors is shown in Fig. 7.

Fig. 7
figure 7

Comparison of the system performance using MFCC vs. Spectrogram

As shown in Fig. 7, MFCC slightly outperforms the spectrogram in terms of accuracy. However, there is a significant improvement in the utilization of the storage and the inference time when using MFCC when compared to a spectrogram. As discussed in Sect. 1, one of the main requirements of LimitAccess is to achieve an acceptable accuracy while efficiently using the limited resources available in edge devices. Thus, MFCC was chosen as the feature extractor for the proposed model.

4.2 Improving the system robustness and generalization

The following steps were taken to enhance the robustness and generalization of our LimitAccess system:

  • Recordings collected from different speakers.

  • We made sure that the recording class is diverse (contains people from both genders). The dataset contained a variety of races (European, African, Asian, Middle Eastern, and American) to ensure that the system can recognize different accents. The samples were collected from different distances. Since the standard minimum distance between a driver’s seat and the steering wheel (where the edge device might be installed) is 25 cms [53], we based the simulated distance the child would be sitting or standing from the wheel on this value and made an adjustment of 10 cms.

  • Training the model to differentiate between recorded commands and live commands by including a recording class that includes both genders.

  • Testing and retraining the model on synthetic commands from different text-to-speech applications.

4.3 Robustness against voice replay attack

Our system was able to detect replay attacks since it was trained on a dataset that includes a class of recorded adults’ voices. It successfully classified 88.3% of the recording samples found in the test set as can be seen in Table 3.

Table 3 Confusion matrix of the model test results

4.4 Robustness against synthetic voice attack

Since our system was trained to differentiate between recorded speech and real-time voice, we expected that it would be able to detect synthetic voices. However, when the system was tested against synthetic voices in test time, it was only able to detect 12.5% of the synthetic voices as recordings.

Table 4 Confusion matrix after adding synthetic voices

As shown in Table 4, the system was retrained after adding the synthetic samples. The retrained model showed that the system was able to detect synthetic voice attacks, and it also improved the detection rate of replay attacks from 88.3 to 92.10%. Table 5 shows the results of the system after including GAN-generated [50] and hidden voice attack [54] samples as an adversarial attack. Even though the system’s accuracy dropped to 85.89% after these attack samples were introduced, the system became more robust as it can identify more attacks.

Table 5 Confusion matrix after adding synthetic voices

4.5 System optimization

After the model is trained and tested on the evaluation set and achieved acceptable results, an optimization technique is required to reduce the trained model’s size to fit on our microcontroller (Arduino Nano 33 BLE Sense). There are a variety of approaches for reducing the size of a model; however, in our work, we used the standard TensorFlow lite framework for the conversion process [55]. We further optimized the converted lite model to reduce the time inference as much as possible by applying quantization which changes the model’s weights from 32-bit floating point to 8-bit integer. Performing quantization had a significant impact on the model size and latency with a minimal impact on accuracy. The effect of quantization on the model’s performances is shown in Fig. 8.

Fig. 8
figure 8

The effect of quantization on CNN model’s performance

4.6 Comparison with baseline

Since the authors of the baseline work [41] only reported the validation accuracy. We contacted the authors and reconstructed their model using their code and dataset. Their model achieved an 83.3% accuracy and 86% F1 score on their test set.

Table 6 Comparison between the baseline and LimitAccess

Our work also extends their work by including an additional class to detect replay and synthetic attacks. As shown in Table 6, the model achieved a 90.5% F1-score, indicating that it accurately distinguishes the properties of each class.

5 Discussion

Using 1D-CNN with MFCC as a feature extractor, the proposed method (LimitAccess) was able to distinguish between adult and child voices, and it performed well in detecting reply and synthetic voice attacks. Furthermore, running the system on an MCU improves the user’s privacy, latency, and power consumption. In the remainder of this section, we reflect on the performance of our model and discuss various aspects of its performance.

5.1 Error types

There is an important factor that we considered while evaluating our system, which is the type of error we need to optimize. The two types of errors in our system are:

  • Type 1 error (False positive): when the model falsely labels any other class as an adult, the system grants a child access to start the car.

  • Type 2 error (False negative): if the model miss classified an adult voice as any other class. Which will lead to the system denying access to adults.

Type 1 errors could cause accidents and injuries for children, as they will gain access to the car. Meanwhile, a Type 2 error will only cause the user to repeat the word “open” several times. Hence, the consequences of Type 1 error are far more harmful than those of Type 2 error; we tried to optimize our system for higher precision to reduce Type 1 error. After investigating the samples in which the model made Type 1 errors in, we noticed that the confidence level of the “adult” prediction was mostly between 0.6 and 0.8. Therefore, the precision was enhanced by adding an additional security level that asks the user a real-time question if the confidence falls between 0.6 and 0.8.

5.2 Limitations and deployment challenges

The aim of this work was to explore the technical feasibility and viability of using TinyML at the edge for voice-based control. The system has demonstrated good performance but is not perfect and mistakes are possible. A more detailed socio-technical evaluation and safety audit is needed before we can consider using such systems in real life. It is also recommended that such systems be initially deployed for value-added services where mistakes are not catastrophic or dangerous. Another limitation is that the dataset that was used to train the model didn’t include the ages of (13–16) in the “child” class. The proposed model sometimes has difficulty distinguishing female adults from children.

5.3 Energy consumption

Since embedded devices have limited energy, the architecture of the ML model should be chosen with care to reduce the system’s power consumption. Our motivation for using the TinyML model stems from the need to have lightweight ML models that can be deployed at resource-constrained edge devices. Banbury et al. have shown in their benchmarking study [56] that the largest TinyML devices consume drastically less power than the smallest traditional ML devices. As discussed in the background and related work, previous work has shown the viability of using TinyML at the edge for complex tasks such as face detection using drones in an energy-efficient fashion in less than 1 mW [39]. The task of audio classification is less complex, and Arduino Nano will be a good choice in terms of power consumption with a lightweight inference engine implemented on it.

5.4 Privacy and ethical considerations

For the sake of privacy, we collected the data samples anonymously since we did not collect any personal information from contributors. We also asked them to sign a data-use agreement before they could contribute. The samples were only accepted if they agreed to these terms. Moreover, before collecting the data, we explained to the participants the objective of this research and how the audio samples would be utilized, and we did the same for the children’s parents.

5.5 Potential applications for LimitAccess

There are numerous other applications that can benefit from age-based access control such as preventing children from using dangerous home appliances and locking some areas at homes where children should not enter without supervision. Also, it can be used to lock certain applications on smartphones and tablets.

6 Conclusions and future work

In this paper, we proposed LimitAccess, a TinyML speech recognition system used to limit the access of children and prevent them from starting the car by classifying the voice command “open” to either child or adult. The proposed system can also detect replay attacks (if a child played a recording of an adult voice) and Adversarial attacks, including synthetic voice attacks (if a child used a text-to-speech application or GANs to generate the command “open” synthetically), and hidden voice attacks. The system achieved an overall 87.7% F1 score and 85.89% accuracy and was able to detect replay and synthetic voice attacks with an 88% F1 score. LimitAccess was deployed on an Arduino Nano 33 using a model converted using TensorFlowLite. In the future, we intend to further enhance the performance of our system against synthetic attacks using open-source data sets and to provide LimitAccess with the ability to detect more types of attacks. In addition, we plan to add a new class for teenagers aged 13 to 16 to differentiate them from adults. Another future direction is to separate the adult class into female adults and male adults and do additional work on improving the system’s accuracy in distinguishing female adults from children’s voices.