1 Introduction

Autism spectrum disorder (ASD) is a neurological disorder that lasts for a lifetime. Its primary traits are differences in conduct, social interaction, communication, particular interests, and sensory processing [1]. People with ASD may encounter difficulties interacting with their surroundings as a result of these characteristics. Many individuals on the autism spectrum share certain symptoms to some degree, whereas some symptoms are more common but not necessarily shared by all people on the autism spectrum. Although some people on the autism spectrum also have mental impairments or disabilities, the majority of them have ordinary to above-average intellect [2]. It is a confusing disorder that frequently appears during a child's first three years of life and involves conduct and communication problems. Various manifestations and skills are possible with it. ASD can range from a relatively small problem to an impairment requiring full-time special care [3].

Autism is an epidemic disease. Since 2000, the prevalence of autism has risen by 178% from one per 10,000 children to one per 54 children. Almost 1% of people worldwide, according to the Centers for Disease Control and Prevention (CDC), have ASD [4]. One in every 100 kids will have an autistic spectrum disorder diagnosis in 2022 as shown in Table 1. Boys are diagnosed with autism at a rate of about 4 times more than girls. It was expected that boys four times girls in the age group of the same age would have ASD by the year 2020 [5]. The prediction of 1 in 44 for 2021 represents a significant increase from the 2006 forecast of 1 in 110. This growth could seem alarming.

Table 1 Prevalence of autism spectrum disorder [5]

It is a complicated problem that has been thoroughly researched on how ASD sufferers can identify emotions from facial expressions. However, the findings of previous research are inconsistent, with some studies reporting significant difficulties in emotion recognition, while others show no difference compared to typically developing individuals. This variability may be due to the heterogeneity of the ASD population, the severity of the condition, and differences in the research methodology employed. Additionally, individuals with ASD may use different strategies to recognize emotions, such as focusing on specific features of the face. The difficulties with accurately identifying emotions have significant effects on how socially and emotionally capable people with ASD are, including social isolation and mental health problems. Therefore, more research is necessary to better understand the underlying mechanisms of emotion recognition in ASD and develop effective interventions to improve their social and emotional functioning [6].

Children diagnosed with ASD have difficulty in communication and expression of their emotions, and the severity of this condition varies among individuals. Researchers are exploring ways to enhance the quality of life and support for individuals with ASD, from disease diagnosis and treatment to understanding their emotions and sensations [7]. Experts have found that autistic children struggle with recognizing and understanding other people's emotions, making it challenging for them to respond appropriately and learn how to express their emotions through facial expressions. To improve social interactions between ASD children and society, researchers are utilizing basic technology, graphical representation systems, and advanced technology that includes human–computer interaction applications. Starting with fundamental technologies, such as card games and computer programs that teach individuals with ASD how to express their emotions and read facial expressions, researchers are working towards developing more advanced technologies that can provide more comprehensive support for individuals with ASD [8].

The inability to recognize and comprehend emotions is one of the key diagnostic features of autism spectrum disorder (ASD). This involves difficulties recognizing body language, voice intonation, and facial expressions. Poor social functioning, difficulty paying attention in class, poor academic performance, and both internalizing and externalizing symptoms have all been linked to low emotion awareness in childhood and adolescence [9].

By utilizing modern emotion detection techniques, we can create some technologically beneficial tools and software systems or programs to improve the lives of autistic children and make it simpler for them to communicate with people, comprehend emotions, and express their emotions. This will improve their quality of life overall and reduce autistic symptoms.

Artificial intelligence used in medical diagnosis has many benefits for the development of the healthcare sector. Artificial intelligence-based software can identify whether a patient has a particular ailment even before overt symptoms show. The potential for profound learning advances to evaluate photos and spot patterns offers up the possibility of developing algorithms that will aid in more rapid and accurate analysis of explicit diseases. Additionally, these algorithms have continuously learned, which enhances their ability to predict the right analysis in the future. The AI-assisted ASD screening will help carers, parents, and professionals diagnose children with ASD more rapidly for better results and recommend that they seek out more clinical examination and therapy [10].

Technology that helps people with autism is developing at a rapid rate right now. Such technology might range from primitive to powerful and developing technologies [11]. According to most scholars, it's important to carefully select appropriate assistive technology for individuals with autism, taking into account the severity of their condition [12]. The automatic emotion recognition domain introduces new techniques and tools used to improve the diagnosis and treatment of autistic children.

At the University of Washington, Joseph Redmon and Ali Farhadi created the well-known YOLO (You Only Look Once) object recognition and image segmentation mode [13].YOLO, a 2015 invention, soon became well-known for its rapidity and precision. As they improved the accuracy and processing speed, they released numerous versions, including YOLO2, YOLO3, YOLO4, YOLO5, YOLO6, and YOLO7.

Ultralytics YOLOv8 is the name of the most recent version of the YOLO object detection and image segmentation model. Because it was created with a strong focus on speed, size, and accuracy, YOLOv8 is a desirable alternative for a number of visual AI applications. With advancements like a new backbone network, an anchor-free split head, and improved loss functions, it outperforms earlier iterations. These upgrades allow YOLOv8 to produce better outcomes while retaining its small size and fast performance. YOLOv8 enables a wide range of visual AI tasks, including detection, segmentation, pose estimation, tracking, and classification [14]. Figure 1 is created by GitHub user RangeKing, provides a thorough representation of the network's architecture.

Fig. 1
figure 1

YOLOv8 Architecture, visualization made by GitHub user RangeKing [15]

The foundation of the YOLOX is upgraded with the attention mechanism. By integrating all of the encoded input vectors in a weighted fashion, with the most pertinent vectors obtaining the highest weights (as illustrated in Fig. 2), the attention mechanism's goal is to enable the decoder to employ the most pertinent elements of the input sequence in a flexible way [16].

Fig. 2
figure 2

Attention mechanism increases YOLOX performance [16]

The benefit of the attention mechanism is that it shortens the distance between the input's long-range dependencies and the target's maximum path length. The capacity to locate the data in input is mostly relevant to completing a task and improving performance. The following are the research's main contributions:

  1. 1.

    Develop a real-time system for recognizing emotions in children with autism, which will use database with applying image preprocessing techniques and DCNN algorithm to assist in the early diagnosis of autism.

  2. 2.

    Design of an attention-based YOLOv8 (AutYOLO-ATT) algorithm for facial expression recognition, which improves the performance of the YOLOv8 model by incorporating an attention mechanism.

  3. 3.

    Identification of six facial emotions, including surprise, delight, sadness, fear, joy, and natural, that can assist medical professionals and families in recognizing facial expressions in autistic children for early diagnosis and intervention.

  4. 4.

    Potential to outperform existing approaches for early autism diagnosis through facial expression recognition, providing an alternative and effective method for early diagnosis to improve the patient's overall health and well-being.

The remainder of this study is structured as follows. Section 2 presents a review of recent developments in facial recognition and emotion recognition techniques. Section 3 outlines the proposed approach and the problem definition. Section 4 provides an experimental evaluation. Finally, in Sect. 5, the research is concluded.

2 Literature review

For humans, detecting faces is a routine process that comes naturally. However, it is not an easy process for computers or machines to recognize faces in a certain setting. Any face-capturing device's primary goal is facial recognition. Face Detection is the initial stage of face recognition. Face detection is a preprocessing stage that identifies human facial expressions. Two sections of an image—one with faces and the other with other non-face regions—are categorized [17]. Bledsoe published one of the earliest research on automatic facial recognition in 1960 [18]. Kanade [19] developed the first automatic facial recognition system that is fully functioning. This system could differentiate between traits extracted by humans and those derived by computers, allowing it to measure sixteen different facial features to be used in different fields.

Numerous research has demonstrated the effectiveness of computer-assisted learning technology as a diagnostic and treatment tool for autistic people. Raja and Masood [20] made an effort to look into the potential application of Support Vector Machine, Naive Bayes, Logistic Regression, Support Vector Machine, KNN, and Convolutional Neural Networks for forecasting and analyzing ASD difficulties in children, adolescents, and adults. Results strongly suggested that CNN-based prediction models performed better on all of these datasets, with greater accuracy of 99.53%, 98.30%, and 96.88% for ASD Screening in Data for Adults, Children, and Adolescents, respectively. With the use of a binary firefly feature selection wrapper based on swarm intelligence, Vaishali and Sasikala [21] experimented with an ASD diagnosis dataset with 21 features taken from the UCI machine learning library. The experiment's alternate hypothesis asserted that by using fewer feature subsets, a machine learning model may improve classification accuracy. The ideal feature subset average accuracy values, which ranged from 92.12% to 97.95%, were nearly comparable to the average accuracy of the total ASD diagnosis dataset, supporting the hypothesis. A different study also developed the Rules-Machine Learning machine learning technique, which not only recognizes autistic features in cases and controls but also creates rule-based information for specialists to comprehend the classification's underlying causes, Thabtah and Peebles [22].

Using classification techniques, authors in, Mythili and Shanavas [23] conducted a study on ASD. With the aid of data mining classification algorithms, the research's primary objective was to identify the autism issue and the severity of the condition. In this study, the social interactions and behaviour of the students were examined using WEKA tools and neural networks, Support Vector Machine (SVM), and fuzzy approaches. In order to forecast the development of Autism Spectral Disorder (ASD), Nishat [24] undertook research to assess ML (machine learning) methods, Particularly notable are the quadratic discriminant and linear analysis algorithms.

In order to improve the accuracy of identifying ASD traits, Baadel [25] reduced autism dataset complexity and eliminated redundancy. They introduced the Clustering-based Autistic Trait Classification (CATC) framework, a new semi-supervised ML framework that makes use of clustering techniques to verify classifiers and classification approaches to achieve this goal. The proposed approach, in contrast to many ASD screening instruments that employ a scoring system, detects prospective autism cases based on their shared characteristics. The empirical results were verified on various datasets comprising children, teenagers, and adults and were compared with other established machine learning classification methods.

In terms of anticipated sensitivity, specificity, and accuracy rates, the results showed that CATC performed better than other intelligent classification systems including Artificial Neural Networks (ANN), Random Forest, Random Trees, and Rule Induction. The Autism Activity Checklist, Aberrant Activity Checklist, and Clinical Global Impression indicators were used in the study to evaluate the symptoms of 433 children with ASD who had undergone initial examinations of ASD symptoms. 254 elements from the baseline forms were included in one of the datasets used to assess the performance of the machine learning algorithms. Patients were deemed to have had a "better outcome" if their ASD symptoms had improved by two points at 36 months. The majority of cases demonstrated appreciable reductions in ASD symptoms.

proposed an emotion recognition framework that consists of three layers: the cloud layer, the fog layer, and the Internet of Things layer. Talaat [26] Enhanced Deep Learning (EDL) technique, which uses Convolutional Neural Networks, is employed to classify emotions. The first stage of CNN hyperparameter optimization is based on the Genetic Algorithm (GA). The U-net model segmentation method is used as the second stage to enable a faster and more precise detection process. The autoencoder for feature extraction and feature selection in the suggested approach enhances its performance and makes it easier to classify input photos. It was observed that the EDL classifier produced the best results since CNN's performance was enhanced through hyperparameter tuning.

In another study by Md. Using machine learning methods, Rahman [27] explored important concerns about autism. The authors place a strong emphasis on choosing the best characteristics of autism, enhancing categorization, and maintaining better accuracy.

612 people with autism diagnoses and 15 non-spectrum individuals from the Autism Genetic Resource Exchange (AGRE) and the Boston Autism Consortium were examined using the complete score set of Module 1 of the Autism Diagnostic Observation Schedule Generic (ADOS) using a set of machine learning algorithms (AC) by Wall and Kosmicki [28]. According to the study, only 8 of the 29 items in Module 1 of the ADOS were necessary to accurately describe autism.

A feed-forward artificial neural network was used by Achenie [29] to propose an automated machine learning (ML) strategy for removing obstacles to ASD screening (fANN). The scientists used 14,995 toddlers' historical M-CHAT-R data to apply the fANN approach. To investigate subgroup differences, they separated the sample into groups based on race, sex, and maternal education. The top scores were 99.72% for the entire sample, 99.64% for the boys' category utilizing 18 items, and 99.95% for the girls' category.

An essential part of non-verbal communication is facial expression. It has a significant role in human behaviour and social interaction. Due to the intricacy and diversity of gestures and facial variations across people's faces, and the range of facial expressions that can be performed, automatic facial expression identification is a difficult issue. Therefore, in the past two decades, researchers have shown a significant deal of interest in the recognition of facial emotional expressions.

Facial expression recognition can be achieved through two main techniques: model-based and appearance-based recognition methods [30,31,32,33]. Chang and Chen [34] proposed an automated facial expression recognition system that utilizes neural network classifiers. The Rough Contour Estimation Routine (RCER) is employed to extract features of the mouth, eyes, and eyebrows with the aid of the Point Contour Detection Method (PCDM), Chen [35], which enhances the accuracy of the eye and mouth. The researchers utilized Action Units (AUs), Ekman [36], which describe the basic movements of facial muscles. They identified 30 facial characteristic points for the eye, mouth, and eyebrow using AUs to recognize facial expressions. To achieve this, they used 80 photos of people's faces with a resolution of 128 by 128 pixels under identical lighting, distance, and background conditions. After applying this method, they obtained a recognition rate of 92.1%.

The Principal Component Analysis (PCA) method for face recognition is proposed by Abdullah [37] from a digitized facial image. They divide a picture into smaller sets of feature images or eigen faces in this paper. To compare results, they first construct a training dataset. The preprocessed facial image that has been input is compared to the computed training dataset. Multiple face photos can produce the highest matching, but it requires a lot of processing time. They gained 35% less time than the original PCA using the FACE94 database. With this new approach, they also achieved a 100% recognition rate.

Murthy [38] described a technique for identifying facial expressions using eigenfaces. The training dataset is used to evaluate the approach while PCA is used for feature extraction from the image being used as input. They did, however, split the training set into six core groups in accordance with the universal expression based on the concept. They make use of the databases, Kanade [39] and Japanese Female Facial Expression (JAFFE), Lyons and Kamachi [40].

The linear filter utilized for edge detection in image processing is known as the Dennis Gabor filter. Research has demonstrated the efficacy of Gabor filters in texture representation and discrimination due to their frequency and direction representations, which are comparable to those of the human visual system. A sinusoidal plane wave in the spatial domain modulates the Gaussian kernel function used by the 2D Gabor filter. A single mother wavelet can produce all filters by dilation and rotation since Gabor filters are self-similar, Murthy [38].

Andrysiak and Chora [41] claim that Gabor filters are efficient at lowering image noise and redundancy. These filters can either be applied to a particular area or combined with the entire image. The responses of a number of Gabor filters, all of which are centred at the pixel point and have various frequencies and orientations, are used in this case to characterise the region around a pixel.

Hong-Bo and colleagues proposed a facial expression detection system based on the Gabor feature by employing a novel local Gabor filter bank. They utilized Gabor coefficients of fiducial points to classify human emotions, achieving an average recognition rate of 97.33% on the JAFFE dataset, Deng and Huang [42].

However, Lekshmi [43] created a technique for extracting frames from video image sequences to identify faces and classify facial expressions. They employed skin color detection to identify facial regions and considered the entire face for the development of Eigenspace. After the face recognition stage, their system accurately detected facial expressions, with an overall success rate of 88% for identifying expressions from retrieved faces and frames.

To improve the precision and computational efficiency of automatic facial expression recognition, Richard-Whitehill [44] looked into two computer vision techniques. They contrasted a global segmentation of the entire face with a local segmentation of the face around the lips, eyes, and brows and discovered that, despite potential additional noise in the global data, recognizing features from the entire face generated superior accuracy. They explained this by pointing to the Cohn-Kanade database's correlation effects. They also created a technique for locating Facial Action Coding System (FACS) action units based on the Adaboost boosting algorithm and Haar features. This technique was a great deal quicker than the Gabor + SVM approach and yet managed to recognize some AUs with a high degree of accuracy. Finally, utilizing FACS as a transitional framework, they developed a live automatic signature recognition system prototype.

In a different strategy, Yang [45] introduced a novel technique for recognizing facial action units (AU) and expressions based on coded dynamical properties. To capture the temporal variations of facial events, they developed dynamic HAAR-like characteristics that were later encoded into binary pattern features, drawing inspiration from binary pattern coding. They trained a set of discriminating coded dynamic features for facial active units and expression recognition using Adaboost.

Neeta and colleagues proposed an extracting method transient employing a 2D appearance-based local method, identifying four emotions and facial features, Sarode and Bhatia [46]. The algorithm involves applying the Radial Symmetry Transform, extracting features from the face's dynamic spatiotemporal representation through edge projection analysis, and then classifying the face into one of the expression categories. The program can recognize facial expressions from grayscale images with an accuracy of 81.0%.

Anitha [47] reviewed the many facial expression databases that are available in the study with distinct variations, such as size, shape, illumination, color, expression, and texture. Frank [48] performed a comparison of facial expression recognition performance on the JAFFE database. In order to recognize 7 expressions, they explore several feature representations and classification systems (disgust, happy, anger, fear, surprise, sad, and neutral). Using 2D-LDA (Linear Discriminant Analysis) and SVM to get a 95.71% recognition rate, one 256 × 256 pixel image would be processed in 0.0357 s.

Kong and Zhang [49] take into account the 2-D Gabor filter to extract texture and palm print features for authentication. To get good results, they use the five elements they outline: palmprint acquisition, preprocessing, texture feature extraction, matching, and database. The first issue that researchers encountered was the fact that an autistic youngster cannot understand other people's emotions, which makes it difficult for him to react to them or learn how to communicate himself from other people's facial expressions. Three technological levels will be used to enhance how ASD youngsters engage with society: base technology, graphic representation system, and advanced technology: human–computer interaction application. Researchers begin with the most basic forms of technology, such as card games and software that teaches the user about emotions and facial expressions. They then begin training while utilizing Virtual Reality (VR) as a learning aid and practicing reading emotions, Moon [50]. The goal of further study was for autistic children to be able to interpret their own emotions. To do this, an electroencephalogram (EEG), Torres [51] was used, and signal processing techniques were used to determine which emotion each signal represented.

With the improvement of diagnosis, autistic children receive an early diagnosis and learn about treatments that make significant progress with patient cases by preventing illness symptoms from worsening over time and making autistic children's emotions easier to detect. By tracking eye movements and analyzing them to understand the emotions associated with each one, researchers have discovered new methods to read the emotions of autistic children. Additionally, kids engaged in a static and dynamic ER eye-tracking movement, which was recorded for attention, reliability, and reaction time to the eyes, Bedford [52]. The ability of autistic children to communicate their emotions has improved over the years, though not quite to the level of typically developing children. As a result, facial emotion can now be more easily detected in photos. A comparison between the state-of-the-art used algorithms for emotion recognition is depicted in Table 2.

Table 2 Several methods for recognizing facial expressions of emotion

3 Proposed framework for emotion detection

For kids with autism, a real-time emotion recognition system is provided in this section. The steps of emotion recognition are collecting dataset, image preprocessing, face identification, extraction of facial features, and feature categorization. The propound method recognizes a total of 6 facial emotions: surprise, joy, sadness, anger, fear, and natural.

The proposed framework for emotion detection is built on three layers: the data collection layer, the preprocessing and training model layer, and the interface application layer. An image of the child utilizing a smart device is taken using a mobile application, and the emotion is then determined in accordance with the suggested YOLO method. A database (DB) that contains the pre-trained dataset is also included in the training layer.

The primary controlling and managing are implemented with an integrated attention technique to reduce the latency for real-time detection with prompt response and position awareness. Figure 3 depicts the total proposed framework and flowchart of the whole method with details Fig. 4.

Fig. 3
figure 3

Proposed emotion detection framework

Fig. 4
figure 4

Flowchart of the proposed framework

The AutYOLO-ATT algorithm for facial expression recognition involves the following main phases:

  1. 1.

    Data collection and preprocessing: A dataset of facial expression images from autistic children and typically developing children is collected and preprocessed.

  2. 2.

    Training of the deep convolutional neural network: The DCNN architecture is trained using the preprocessed dataset to recognize six facial emotions (surprise, delight, sadness, fear, joy, and natural) in real-time.

  3. 3.

    Integration of attention mechanism: An attention mechanism is integrated into the YOLOv8 model to enhance its performance in recognizing facial expressions in autistic children.

  4. 4.

    Real-time emotion recognition: The AutYOLO-ATT algorithm is applied in real-time to recognize facial expressions in autistic children, providing an effective method for an early autism diagnosis.

  5. 5.

    Experimental evaluation: The performance of the AutYOLO-ATT algorithm is evaluated experimentally and compared with existing approaches for the recognition of facial expression.

3.1 Data collection layer

In this phase, the system collects a dataset of facial images that represent different emotional states, including surprise, anger, sadness, fear, joy, and nature, to train the deep convolutional neural network (DCNN) architecture. The data collection process involves selecting a diverse set of individuals with different ages, genders, and ethnicities to ensure that the dataset is representative of the general population. The individuals are then asked to express the six emotions mentioned above in front of a camera, and their facial expressions are recorded as images. The dataset is annotated to provide labels for each image, indicating the corresponding emotional state.

3.2 Preprocessing and training model layer

Once the dataset is collected, the preprocessing phase begins. The images are resized and cropped to remove any unnecessary background information and to focus solely on the facial features. Additionally, the images are normalized and standardized to account for differences in lighting, contrast, and other factors that may affect the accuracy of the emotion detection algorithm.

Next, a training set and a validation set were created from the dataset. The DCNN model is trained using the training set, while the validation set is used to evaluate the model's performance during training and to fine-tune the model's hyperparameters.

Overall, the data collection and preprocessing phases are crucial step in developing an accurate and reliable emotion detection system for early autism diagnosis. By collecting a diverse and annotated dataset and preprocessing the images appropriately, the system can train a robust model capable of accurately identifying the emotional states of autistic children. The overall steps for Data Collection and Preprocessing phase are shown in Algorithm 1.

Algorithm 1
figure a

Data Collection and Preprocessing

i. YOLO Algorithm

The YOLO algorithm utilizes a simple deep convolutional neural network to perform object detection on input images. To pre-train Using ImageNet, the model's first 20 convolution layers are built, along with a temporary average pooling layer and a fully linked layer. The pre-trained model is then changed to conduct object detection because prior studies have demonstrated that a pre-trained network performs better when convolution and linked layers are added. The bounding box's coordinates and class probabilities are both predicted by the last fully connected layer of YOLO.

For each grid cell, YOLO predicts numerous bounding boxes during training. But in order to guarantee that only one bounding box predictor is in charge of forecasting each object, YOLO identifies the predictor with the highest current IOU with the ground truth as "responsible." This strategy causes the bounding box predictors to become increasingly specialized as each predictor improves its ability to forecast particular sizes, aspect ratios, or kinds of objects. The overall recall score consequently rises [53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68].

ii. Attention-Mechanism

The attention mechanism is a crucial component of the proposed emotion detection framework as it helps to improve the accuracy and reduce latency for real-time detection. Specifically, The attention mechanism enables the decoder to concentrate on the input sequence's most crucial segments, giving higher weights to the most pertinent vectors. By doing so, the system can identify the child's emotional state more accurately and quickly. The overall steps for the attention mechanism phase are depicted in Algorithm 2.

Algorithm 2
figure b

Attention Mechanism

3.3 Interface application layer

Any smart device, including smartphones and tablets, may be utilized with the proposed application. It allows the user to take a picture of the child and determine their emotional state using the YOLOv8 algorithm with an integrated attention mechanism. The application can run in the background while the child uses other apps, and it can detect emotions such as natural or joy without issue. However, an alert will be issued to a connected app running on the parent's smartphone if the identified emotion is anger, fear, sadness, or surprise.

This application is especially helpful for parents of autistic children because it can inform them when their child is not feeling well. Autistic children may have difficulty expressing their emotions, so this application can provide valuable information to parents and allow them to offer assistance when needed. Overall, the proposed application and emotion detection framework can have a significant impact on early autism diagnosis and intervention, ultimately improving the overall health and well-being of autistic children.

4 Implementation and experiments

This section discusses the used dataset, the performance metrics, and the evaluation of the proposed algorithm.

4.1 Used dataset

The images used in this study have been cleaned up to better reflect the range of emotions experienced by children with autism. Duplicate and stock images have been removed. Following that, six facial emotions—happy, sad, angry, surprised, natural, and fear—were separated from the dataset [69]. The training set comprises 85% of the dataset and the validation set is 15%. The distribution of the data is shown in Fig. 5.

Fig. 5
figure 5

Data distribution to training 85% and validation 15%

The number of photos for each emotion in the data set is not equal. Figure 6 displays the data for each emotion's value.

Fig. 6
figure 6

Data distribution for six emotions

4.2 Performance metrics

The following performance indicators are employed in this study: (i) Accuracy: This refers to the system's percentage of accurate predictions. It can be computed using Eq. (1). (ii) Precision: This is the ratio of correctly predicted positive outcomes to all positive outcomes. It can be computed using Eq. (2). (iii) Recall that this metric measures the ratio of actual positive outcomes to true positive predictions. It can be computed using Eq. (3). (iv) F1 Score: A harmonic mean of recall and precision, it assesses how well recall and precision are balanced. It can be computed using Eq. (4). (v) Mean Average Precision (mAP): This statistic assesses the average precision across various recall levels and is utilized in object detection tasks. It can be computed using Eq. (5). (vi) The average squared difference between predicted and actual values is measured by the Mean Squared Error (MSE), a metric used in regression tasks. It can be computed using Eq. (6).

$$ {\text{Accuracy}} = \frac{{\left( {{\text{TP}} + {\text{ TN}}} \right) }}{{\left( {{\text{TP}} + {\text{ TN}} + {\text{FP}} + {\text{ FN}}} \right)}} $$
(1)
$$ {\text{Precision}} = \frac{{{\text{TP}}}}{{\left( {{\text{TP}} + {\text{ FP}}} \right)}} $$
(2)
$$ {\text{Recall}} = \frac{{{\text{TP}}}}{{\left( {{\text{TP}} + {\text{ FN}}} \right)}} $$
(3)
$$ {\text{F}}1\_{\text{Score}} = 2 * \frac{{\left( {{\text{Precision }}* {\text{Recall}}} \right)}}{{\left( {{\text{Precision }} + {\text{Recall}}} \right)}} $$
(4)
$$ {\text{mAP}} = 1/{\text{N}} * \Sigma \_i^{N} \left( {{\text{Precision}}\_i * {\text{recall}}\_i} \right) $$
(5)
$$ {\text{MSE}} = 1{/}N * \Sigma \left( {y\_{\text{pred}} - y\_{\text{actual}}} \right)^{2} $$
(6)

where TP: true positive, TN: true negative, FP: false positive, and FN: false negative, N is the number of samples, y_pred is the predicted value, and y_actual is the actual value.

4.3 Evaluating model

Convolutional neural networks (CNN) are used by the YOLO method to swiftly recognize items. As the name would suggest, the method only requires one forward propagation through a neural network to detect objects. This shows that the entire image is subjected to a single algorithm run for prediction. CNN is used to forecast several bounding boxes and class probabilities at once [70]. The YOLO algorithm employs the following three techniques: bounding box regression, intersection over union (IOU), and residual blocks. Residual blocks: The image is first divided into a number of grids, each of which is S x S in size. Figure 7 shows the grids produced from an input image. An outline that highlights a specific object in an image is known as a bounding box regression.

Fig. 7
figure 7

Examples for evaluation labeling boxes

To estimate the height, width, center, and class of an object, YOLO (You Only Look Once) uses a single bounding box regression. The likelihood of an object existing in the bounding box is also reflected. The box overlapping phenomenon in object detection is referred to as the intersection over union (IOU). The IOU is used by YOLO to produce an output box that completely encloses the items. The anticipated bounding boxes and their confidence scores are assigned to each grid cell. This method eliminates bounding boxes that do not match the real box by setting the IOU score to 1 when the projected bounding box matches the actual box.

4.4 Results

This section presents the results of the experiment using a collected dataset. The dataset is used for validation and training. Each image has one of the six expressions that are identified (surprise, anger, fear, happy, natural, sad). Results show how loss performance changes during training and validation, precision and recall performance. We consider the recall as performance matrix for our work. ROC shows how the algorithm achieves high performance. Figure 8 shows the results of the training phase. train/box_loss and train/object_loss starts with a high level then gets less after 40 training layers reaching near the end to be less than 0.02 (Fig. 9).

Fig. 8
figure 8

Results for training/validation

Fig. 9
figure 9

ROC curve

It is clear from comparing Fig. 10 ‘s findings from Talaat [26] and our results that the model trained using the provided results has obtained lower validation and training loss values as well as higher accuracy. This suggests that the model performs better than the one that Talaat [26] proposed. The model trained on the given data appears to be more capable of generalizing and producing precise predictions on data that has not yet been observed, as indicated by the lower validation and training loss values. Furthermore, the increased accuracy suggests that the model that was trained using the given data is more accurate and reliable in its forecasts. When compared to Talaat's earlier research, these results demonstrate the efficiency of the training procedure and the model's possible superiority [26] (Fig. 11).

Fig. 10
figure 10

Results from Talaat [26]

Fig. 11
figure 11

Comparison of results from different classifiers

Table 3 indicates the most successful outcomes of our proposed technique for the specified evaluation metrics. The shaded cells in the table represent these results and demonstrate that our method (AutYOLO-ATT) has efficiently tackled the obstacles associated with the given task and has achieved a high level of precision in categorizing the provided dataset. It is important to highlight that the shaded cells in Table 3 showcase the effectiveness of our proposed approach and its potential for practical use. Figure 10 displays a visual representation of these results.

Table 3 The performance of the proposed method (AutYOLO-ATT) versus previous classifiers

The standard CNN architectures for facial emotion identification are composed of several convolutional layers, pooling layers, and fully connected layers in order of precedence. Local elements including edges, textures, and facial landmarks are captured in the input image by the convolutional layers through the application of filters. The feature maps are downsampled by the pooling layers, which lowers the spatial dimensions without losing the most important data. Ultimately, the fully connected layers synthesise the acquired characteristics and forecast the feelings depicted in the picture [71,72,73].

SVMs use feature vectors that are taken from face photos in order to function in the setting of facial emotion recognition. These feature vectors can be statistical data, texture patterns, facial landmarks, or other descriptors of the face. The SVM algorithm then determines which decision boundary in the feature space best divides the various emotion classes [74].

KNN (K-Nearest Neighbors) is a non-parametric supervised learning algorithm that can also be used for facial emotion recognition tasks. Unlike SVMs or CNNs, KNN does not construct an explicit model during training but relies on the stored training instances to make predictions for new, unseen data points [75].

Decision Trees (DTs) provide interpretability and simplicity in facial expression recognition. The procedure entails classifying new facial photos, training the DT model using labelled data, feature representation, and feature extraction from facial images. While DTs offer explainability and transparency, they may not be able to handle complex data or overfitting. Although DTs are useful, other algorithms like CNNs and SVMs are more commonly employed in this field [76].

4.5 Results discussion

The training losses show a general downward trend over the epochs, suggesting that the model's prediction of bounding boxes, abjectness, and class labels is getting better over time. Metrics like precision, recall, and mean average precision also exhibit improvements across epochs, indicating that the model's overall performance is improving with training. The model is not overfitting and can generalize well to new data, as evidenced by the validation losses being generally lower than their corresponding training losses. Figure 8 includes the following matrices: (i) train/box_loss: Indicates the training loss associated with the bounding box regression component of the model. (ii) train/obj_loss: Represents the training loss related to the objectness prediction component of the model. (iii) train/cls_loss: Denotes the training loss associated with the class prediction component of the model. (iv) metrics/precision: Represents the precision metric value achieved during training. (v) metrics/recall: Indicates the recall metric value achieved during training. (vi) metrics/mAP_0.5: Represents the mean average precision (mAP) at an IoU threshold of 0.5 during training.

(vii) metrics/mAP_0.5:0.95: Denotes the mAP calculated over a range of IoU thresholds from 0.5 to 0.95 during training. (viii) val/box_loss: Represents the validation loss associated with the bounding box regression component of the model. (ix) val/obj_loss: Indicates the validation loss related to the objectness prediction component of the model. (x) val/cls_loss: Denotes the validation loss associated with the class prediction component of the model.

Table 3 presents a comparison of the performance of the proposed method (AutYOLO-ATT) with four other classifiers, namely CNN, SVM, KNN, and DT, in terms of precision, recall, F1-score, and accuracy. The proposed method outperforms all other classifiers in all metrics, achieving a precision of 93.97%, recall of 97.5%, F1-score of 92.99%, and accuracy of 97.2%. In comparison, the best-performing previous classifier, CNN, achieves a precision of 90.2%, recall of 95.7%, F1-score of 92.9%, and accuracy of 90%.

The results suggest that AutYOLO-ATT has effectively addressed the challenges posed by the classification task and has achieved a significantly higher level of accuracy than previous classifiers. The precision and recall scores indicate that the proposed method has performed well in accurately classifying the dataset, while the high F1 score demonstrates that it has achieved a balance between precision and recall.

These results highlight the potential of the proposed method for real-world applications, particularly in fields where high accuracy is essential. The graphical representation of these results in Fig. 8 further emphasizes the superiority of AutYOLO-ATT over other classifiers. Figure 9 presents a well-designed ROC curve that rises steeply, signifying high sensitivity (accurately detecting positive cases) and low false positive rate (erroneously designing negative cases as positive) [77]. The model performs better the closer the curve is to the upper-left corner of the plot.

5 Conclusion

In conclusion, this paper proposes a novel approach for early autism diagnosis through facial expression recognition. The proposed real-time emotion recognition system based on deep convolutional neural networks (DCNN) is designed to identify six distinct facial emotions, including surprise, anger, sadness, fear, joy, and natural. By using this system, medical professionals and parents can detect these facial expressions in autistic children and diagnose the disorder at an earlier stage, which is crucial for providing timely and appropriate treatment. Moreover, the proposed attention-based YOLOv8 (AutYOLO-ATT) algorithm enhances the performance of the YOLOv8 model by incorporating an attention mechanism. The attention mechanism allows the model to identify the most relevant facial features and use them to predict the correct emotion, resulting in better accuracy and reduced latency. Overall, the proposed framework and algorithm offer a promising approach for early autism diagnosis through facial expression recognition, which can potentially improve the lives of autistic children and their families. Future work can involve expanding the dataset and incorporating other types of features such as voice recognition to further improve the accuracy of the system. The proposed method (AutYOLO-ATT) outperforms all other classifiers in all metrics, achieving a precision of 93.97%, recall of 97.5%, F1-score of 92.99%, and accuracy of 97.2%. These results highlight the potential of the proposed method for real-world applications, particularly in fields where high accuracy is essential. In future work, we can compare our approach with existing literature. For example, Alhussan et al. [78] proposed a Facial Expression Recognition Model Depending on Optimized Support Vector Machine, Talaat et al. [79] developed a Real-time facial emotion recognition model based on kernel autoencoder and convolutional neural network for autism children, and Gamel and Talaat [80] introduced SleepSmart, an IoT-enabled continual learning algorithm for intelligent sleep enhancement. Comparing our work with these methods will provide valuable insights into the performance and effectiveness of our proposed method.