1 Introduction

In recent years, the rapid advancements in artificial intelligence (AI) have paved the way for the application of deep learning techniques in various fields [1], one prominent significant research field where deep learning has gained multiobject classification in decision-making [2]. An important strategy of artificial intelligence in studying big data is data fusion which is the joint analysis of multiple interrelated datasets that provide complementary views of the same phenomenon [6]. Data fusion systems are now widely used in various areas such as sensor networks, robotics, video and image processing, and intelligent system design [7,8,9,10]. The statistics of recent digital information around the world estimate that 80–90% of data generated by digitized services via industry is unstructured [10]. So, data fusion has become a wide-ranging subject and many terminologies have been used interchangeably. Multidata fusion is the process of combining disparate data streams to generate information in a form that is more understandable or usable [3]. It refers to the combining between multisensor data fusion technology [4] and the applicable multimodal data fusion (MMDF) [5] that is the process of combining disparate data streams of different dimensionality, resolution, type, etc., to generate information in a form that is more understandable or usable.

The main challenge of this paper is shown in modality and context-based fusion, interpreting diverse data fusion for classifying objects and improving decisions from multitargets into one unification objective for each system. The modality/context-based fusion causes conflicting data nature to solve uncertainty, ambiguity, and imbalanced interrelated data [6]. Moreover, there is no way to solve the conflicting nature of data such as image, text, audio, and video, in multitarget sensors to object classification in diverse context systems, heterogeneous data, imbalanced data, unstructured data, conflicting data, different representation, and difficult modality numbers [7].

The difficulty of this research is finding commonalities between smart systems that requires common elements in the connectivity of smart devices, and its mismatched input method type, input target, and input relationship between different intelligent systems. The lack of a single intelligent system dataset capable of understanding a large number of inputs for testing leads to the need to test multiple intelligent systems.

Most of the recent literatures are based on context-based fusion or modality on specific known contexts [9, 10]. Furthermore, no fusion framework can analyze offline stream data to generate a concluded hidden relationship between different modality types of data and diverse modality numbers. The fusion problem is considered as one of the most researched aspects of multimodal learning [11, 12].

The open research of modality/context-based fusion is shown in many problems properties as the following,

  • Standardization: a hardness of the generalized context-aware middleware due to the variety of contexts and systems involved to build a generic domain-focused middleware solution [13].

  • Increase autonomy: Although context-aware middleware architectures minimize the requirement of human intervention when they serve personalized applications, human intervention is still necessary and plays a significant role in realizing context awareness [14].

  • Lack of testing: It can be noted that most of the middleware architectures contained in this paper are still at the conceptual stage [14].

  • Lack of accurate data: Due to different sources of problems, many times a context-aware system cannot build a computational model that represents the knowledge of a real-world domain [15].

Different fusion applications aimed to support context representation and fusion, when formally incorporated in a context-aware system, are still open research [16]. Approaches fuse the multimodal features in a single way, which is not enough to elicit complementary data and then limits the performance [17].

This paper presents a new adaptive and late multimodal fusion framework that relies on creating a multifusion learning model for solving modality/context-based fusion challenges for improving multiobject classification and decision-making. It creates a fully automated selective deep neural network and constructs an adaptive fusion model for all modalities based on the input type. The proposed framework is constructing automatically a deep neural network based on the Dempster–Shafer and concatenation strategy to achieve a bigger number of features for interpreting unstructured multimodality types based on late fusion. The proposed framework is implemented based on five layers including a software-defined fusion layer, a preprocessing layer, a dynamic classification layer, an adaptive fusion layer, and an evaluation layer. The framework is formalizing the modality/context-based problem into an adaptive multifusion framework based on a late fusion level.

2 Literature review

The related works of modality fusion problem and context-aware fusion problem were discussed by many researchers. For example, authors in [18] presented an early fusion model applied on the time series text modality data types in the stock market. Although its experimental results achieved 87.7%, the limitation was shown in the redundant data in fusion. In [19], authors presented another early fusion model applied on bimodality data types of audio/visual data in human action recognition. Although its experimental results achieved 86%, the limitation was shown in the difficulty of fusing multiple modality types. Authors in [20] presented a late image fusion model for CIFAR-10 that can automate fusion in one context. The accuracy of object classification achieves 89–94% and the limitation is not suitable for multicontext.

Authors in [21] presented that a late fusion model is applied on ECG signals. It checked the quality of ECG of cardiology and the accuracy of object classification 61 and 87%. It limits the bigger number of characteristics to improve the quality. Authors in [22] presented a late fusion model that is designed to image modality type for medical images. Although its main objective was the medical image classification that achieved 88%, it required to increase the features number and to be applicable in multiple contexts.

Authors in [23] presented a hybrid fusion model for text daily historical water level data in Vietnam. The essential objective was prediction of the water level and the accuracy results achieved 91–93%. The limitation was found in the redundant data and hardness to multiple models’ types. Authors in [24] presented a hybrid fusion model that was designed based on bimodality in one context (Images and text). The experimental results achieved 99.57% on two modalities only in one context and the limitations were low robustness and limited context with specific conditions.

Authors in [25] presented the investigation of how to extricate ordinary highlights from the tremendous sum of multisensory information utilizing information preparation and mining procedures. A deep self-attention organizer was proposed to handle aero-engine multisensory information containing debasement data at diverse scales and after that precisely anticipate the comparing remaining useful life (RUL) of the aero-engine. Firstly, multiscale pieces with self-attention technique were developed to specifically extricate multisensory highlights on distinctive scales. Authors in [26] presented a depth estimation algorithm based on convolutional neural networks (CNNs). First, a single image super resolution algorithm was adopted to spatially super resolve the sub-aperture images (SAIs). Second, to adapt the texture complexity, the SAIs are partitioned into two regions, i.e., simple texture region and complex texture region, based on the texture analysis of the central SAI. Third, the epipolar plane images (EPIs) in horizontal, vertical, 45-degree diagonal, and 135-degree diagonal directions for both complex and simple texture regions were extracted, and the corresponding EPIs for the simple and complex texture regions were fed into the specified network branches. Finally, a fusion module was designed to generate a depth map. Experimental results show that the quality of the estimated depth maps by the proposed method was better than the state-of-the-art methods in terms of both objective quality and subjective quality.

Authors in [27] presented recognizing the epistemic emotions of learner-generated surveys in enormous open online courses (MOOCs) that could help to teach adaptive direction and interventions for learners. The epistemic feeling recognizable proof errand might be a fine-grained distinguishing proof assignment that contained different categories of feelings emerging amid the learning handle. Past studies only considered passionate or semantic data inside the audit writings alone, which led to deficiently highlighted representation. In addition, a few categories of epistemic feelings were ambiguously dispersed in space, making them difficult to recognize. The emotion-semantic-aware double contrastive learning (ES-DCL) approach was displayed to tackle these issues. In order to learn adequate highlight representation, certain semantic highlights and human-interpretable emotional highlights were, separately, extricated from two distinctive sets to create complementary emotional-semantic highlights. The proposed ES-DCL was compared with 11 other standard models on four diverse disciplinary MOOCs survey datasets.

Adaptivity control has two differences between it and dynamic control. The difference is shown when applying in flexibility to be adapted with diverse requirements in system behavior concerning the adoption rules on [28]. Adaptivity is considered a type of adaptive dynamic programming to achieve the optimal solution for systems iteratively [29]. But dynamic control programming is dynamic with control the parameters in a system concerning time issues only in changes. Authors in [30] presented the software-defined network (SDN) and organized work virtualization (NFV) are recognized as the most promising advances to variable distribute assets for arranged benefit. A benefit work chain (SFC), which could deploy virtualized organize capacities (VNFs) and chain them with related streams allotment can be utilized to speak to each arranged benefit owing to the presentation of the SDN/NFV innovation. This article was a deep learning that combined the multitask relapse layer over the chart neural systems were first presented to anticipate long-run asset necessities of each VNF occasion. Agreeing to the reenactment discoveries, the proposed deep showed at least a 6.2% enhancement in forecast precision over standard prediction models, and the proposed SFC arrangement procedure had been illustrated to deliver better execution in terms of acknowledgment proportion and income, compared to the current inactive sending calculations.

On the other side, fusion models face a big challenge in extraction relationships via multiple contexts due to each context having specific roles, parameters, and objectives [31,32,33]. Authors in [34] analyzed the relationship between human exercises and properties (sufficiency and stage) of Wi-Fi CSI signals on different receiving radio wires and found the flag properties that change strikingly in reaction to human development. The variety within the flag among different antennas appeared distinctive sensitivity to human exercises, specifically influencing acknowledgment execution. Hence, to recognize human exercises with way better proficiency, the research proposed a versatile radio wire disposal calculation that naturally disposes of the non-sensitive receiving wire and keeps the delicate antennas taking after distinctive human exercises. The test comes about uncovered that indeed when utilizing easy-to-implement, non-deep machine learning, such as arbitrary woodland, the acknowledgment framework based on the proposed versatile receiving wire end calculation accomplished a predominant classification precision of 99.84% (line of locate) on the StanWiFi dataset and 97.65% (line-of-sight)/93.33% (non-line-of-sight) on another broadly connected multienvironmental dataset at a division of the time fetched, illustrating the strength of the proposed calculation. Table 1 presents a summary of the state-of-the-art for the modality fusion problem and context-aware fusion problem.

Table 1 A summary of the state-of-the-art comparative analysis

3 Background and basics

3.1 Background of data fusion

Data fusion is the process of combining information from heterogeneous sources into a single composite picture of the relevant process, such that the composite picture is generally more accurate and complete than that derived from any single source alone [35, 36]. It often implies the concatenation of data sets that present an enormous diversity in terms of information, size, and behavior [37, 38]. Data fusion is based on the abstraction level that is used to simplify reality [39]. Data fusion has three types of systems which are cooperative, competitive, and complementary systems. Abstraction is focused only on the data and processes relevant to the application being built. Modality-based is defined by the interpretation of the multimodal input data which is categorized into four modalities types [40]. It can be classified into two classes, same data types such as images only, text only, etc., and different data types such as image-video, text-image, etc.

Context/aware-based refers to the ability of a framework to fuse data relative to its specific context by time streams and dynamic flexibility to study context’s data behaviors [41].

The multimodality dataset types in multiple contexts are classified into four classes including text, audio, image, and video as shown in Table 2. Data fusion techniques have three fusion strategies levels including early fusion, late fusion, and hybrid fusion as shown in Fig. 1 [14, 42, 43]. Multidata fusion combines different data streams to produce information in a more understandable or usable format. It is extracted from the combination of Multisensory data fusion technology and practicable multimodal data fusion (MMDF).

Table 2 Modality data types
Fig. 1
figure 1

Data fusion strategies levels

Multidata fusion causes conflicting nature of data in ambiguity, imbalanced data, uncertainty data, and data redundancy [44]. Ambiguities represent uncertainty but moreover an essential thing of discourse for those who are curious about the interpretation of dialects, and it is useful for communicative purposes both in human–human communication and in human–machine interaction [25]. There are four types of ambiguity, including Phonetics, language structure, semantics, language structure, as little as accentuation and sound can all be the cause of uncertainty.

Based on this, linguists partition uncertainty into distinctive types such as phonetic uncertainty, lexical uncertainty, syntactic ambiguity, and practical uncertainty. Prior review presents four steps to deal with ambiguity, the context interpreting and objective determination, selecting the correct data sources and suitable techniques, acknowledging the uncertainty and ambiguity, and then measuring the evaluation and iterating data in each context. Uncertainty is high relative to probability that can get a rise to or superior result arbitrarily is the true uncertainty of the statistical information. Uncertainty is the quantitative estimation of error shown in information; all estimations contain a few uncertainties produced through precise mistake and/or arbitrary mistakes. Recognizing the uncertainty of information is a critical component of detailing the results of logical examination [45]. It refers to situations when there is a lack of complete information or knowledge about a particular aspect, learning to ambiguity and unpredictability. Prior review of uncertainty has deductive reasoning and inductive reasoning. Uncertainty is essentially a lack of information to formulate a decision. Uncertainty challenges are handled by many algorithms of Bayesian probability, Markov models, Dempster–Shafer theory, and fuzzy theory.

Imbalanced data may be a common problem in machine learning, where one class contains an essentially higher number of perceptions than the other. This may lead to one-sided models and the destitute execution of the minority class. Imbalanced data has a great effect on the performance model [46]. Prior research of relative techniques solving imbalanced data are augmentation data, resampling (over-sampling, under-sampling), synthetic minority over-sampling technique (SMOTE), ensemble techniques (bagging, boosting), and cost-sensitive learning (evaluation). Data redundancy happens when the same piece of information exists in several sources, though information irregularity is when the same information exists in several formats in numerous tables. Data redundancy is when different duplicates of the same data are put away in more than one at a time.

Lack of organization refers to one of the essential challenges of unstructured information lies in its need for characteristic organization [47]. Not at all like structured information, which is ordinarily organized in databases or spreadsheets, unstructured information needs predefined categories or names, making it hard to classify and modify. Prior review presents four solutions to avoid data redundancy, making a mastering table, normalization, deletion of repeated or unused data, or designing a suitable database for integration.

Early fusion or feature fusion [42] is an early highlight approach that all modalities are combined. It needs single learning phase. The fusion methodology relies on concatenation is one of the famous early fusion techniques. Decision fusion or late fusion [14] refers to combining the comes about of distinctive models after building each show independently. It combines the expectations of different classifiers to aim for the classification result of one sound record. A hybrid fusion [43] is designed based on twice inference operations that cause a high complex.

3.2 Background of data fusion techniques

Data fusion techniques can fuse extracted data via multiple intelligent devices or sensors and relative metadata from databases to reach enhanced accuracy results [48]. The important fusion techniques are the central limit theorem (CLT), Kalman filter (KF), Bayesian networks (BN), Dempster–Shafer theory (DST), and deep learning (DL) algorithms, as described in brief in the following subsections.

  • Central limit theorem (CLT) refers to the understanding of several variables [49] that describes the population's random variables according to the means, variance, and standard deviation. The mean distribution can equal the mean for the population and expresses the standard deviation for the population standard deviation.

  • Kalman filter (KF) is an estimation algorithm for explaining the status of a separated time-controlled operation depicted by a linear stochastic equation. KF fuses all information [50].

  • Bayesian networks produce data fusion measurement; they are a mutual method applied for multisensor data fusion in the static environment [51]. The probability distributions can examine a convenient to suspicious data processing based on the added noise of Gaussian. If any noise can influence a multisensor data fusion system, this doesn't apply to deduce and save the original data. Kalman filter (KF) depends on a pure mathematics approach for problem-solving and analytics. The major idea of data fusion is the fusion of data established on uncertainties.

  • Dempster–Shafer Theory relies on the Bayesian theory that explains the canonical approach for statistical inference challenges. The Dempster–Shafer decision theory becomes a generalized Bayesian theory [52, 53]. It eases the evaluation of proposition distribution and the union propositions. The Dempster–Shafer is very powerful in the causes system recognizing the total mutual context facts of the same type in "the frame of discernment θ"

  • Deep learning algorithms need an interconnection, for example the constructed network generates unreasonably fast as the size of the input developments [54].

Table 3 shows a comparative study between data fusion techniques based on Strengths, Weaknesses, Opportunities, and Threats (SWOT) analysis.

Table 3 A comparative analysis between data fusion techniques based on SWOT analysis

3.3 Deep learning techniques

A neural network comprises diverse layers associated with each other, working on the structure and work of a human brain. It learns from gigantic volumes of information and uses complex calculations to prepare a neural net. Deep learning could be a kind of machine learning and has superior results with profound preprocessing and highlight extraction to move forward the learning demonstration. Deep learning could be a machine learning strategy that educates computers to do what comes actually to humans, learn by case [55].

  • Convolutional neural network (CNN) is designed based on a convolutional layer that is considered a core building block of CNN. It presents multiple parameters that include a group of learnable kernel filters. Each filter is convolved across the width and height of the input volume.

  • Artificial neural networks (ANNs) are naturally motivated computational systems. Among the different sorts of ANNs. ANN is centered on multilayer perceptron (MLPs) with backpropagation learning calculations. MLPs, the ANNs most commonly utilized for a wide diversity of issues, are based on a directed method and contain three layers: input, hidden, and yield.

  • Recurrent neural network (RNN) could be a sort of neural arrangement where the yield from the past step is bolstered as input to the current step. In conventional neural systems, all the inputs and yields are autonomous of each other. In this way, RNN came into existence, which unraveled this issue with the assistance of a Covered Layer. The most and most imperative highlight of RNN is its hidden state, which recalls a few data almost an arrangement. The state is additionally alluded to as memory state since it recollects the past input to the arrangement. It employs the same parameters for each input because it performs the same errand on all the inputs or covered-up layers to create the yield. This diminishes the complexity of parameters, not at all like other neural systems.

  • Long short-term memory (LSTM) is a kind of deep learning, the consecutive neural network that allows data to hold on. It could be an uncommon sort of repetitive neural arrangement which is able to deal with the vanishing slope issue confronted by RNN [56].

  • Transfer learning may be a machine learning strategy where a demonstration created for a task is reused as the beginning point for a model on a moment errand [57]. It may be a well-known approach in deep learning where pretrained models are utilized as the beginning point on computer vision and characteristic dialect handling tasks given the vast computing and time assets required to create neural organize models on these issues and from the gigantic bounced in ability that they give on related issues.

  • AlexNet is a convolutional neural network that's 8 layers deep. It is a pretrained form of the arrangement prepared on more than a million pictures from the ImageNet database. The trained arrangement can classify pictures into 1000 question categories, such as console, mouse, pencil, and numerous creatures. As a result, the arrangement has learned wealth including representations for a wide extend of pictures. The arrangement has a picture input estimate of 227-by-227 [58].

  • GoogLeNet is a convolutional neural network that's 22 layers deep. It applies a pretrained form of the organized preparation on the ImageNet information sets. The arrangement prepared on ImageNet classifies pictures into 1000 question categories, such as console, mouse, pencil, and numerous creatures. Reinforcement Learning (RL) could be a suite of strategies that permits us to construct machine learning frameworks that take choices consecutively. RL extracts numerous factors and features and it relies on unknown input and unknown output [58].

  • Attention learning model [59] uses to bottleneck problems concerning a fixed length encoding vector that is useful, but it has limited access to the data. It is powerful for sequence-to-sequence models that can measure the alignment scores, attention weight, and attention context vectors.

3.4 Dempster–Shafer theory

Dempster–Shafer theory could be a theory of proof that has its roots within the work of Dempster and Shafer which applied on basic Eq. (1). Whereas conventional likelihood hypothesis is restricted to relegating probabilities to commonly elite single occasions, DST amplifies this to sets of occasions in a limited discrete space [60]. DST moreover gives a more adaptable and exact approach to dealing with questionable data without depending on extra presumptions on approximately occasions inside an evidential set. By leveraging the special highlights of this hypothesis, AI frameworks can better navigate uncertain scenarios, leveraging the potential of different evidentiary types and successfully overseeing clashes. Therefore, the Dempster–Shafer hypothesis may be a capable device for building AI frameworks that can handle complex questionable scenarios. Bayes’ hypothesis is based on the classical thoughts of likelihood, whereas Dempster–Shafer's hypothesis may be a later attempt to permit more interpretation of what instability is all almost [61]. It eases the evaluation of proposition distribution and the union propositions. The Dempster–Shafer is very powerful in the cause system recognizing the total mutual context facts of the same type in the frame of discernment θ as mentioned by the Dempster–Thin Eq. (1). Θ is not an angle that refers to the power number of parameters to compute number of probabilities.

$$ [{\text{belief}},(A),{\text{Plausibilit}}y_{i} (A)], $$
(1)

The interpretation meaning of this example is "user-A", "user-B", "either user-A or user-B", or "neither user-A nor user-B, must be somebody else" 1. Each sensor, sensor Si, for instance, will participate in its notice by specifying its beliefs over Θ. The function is known as the "probability mass function" of the sensor Si, indicated by mi. So, concerning sensor Si's notice, the probability that "the detected person is user A" is specified by a "confident interval," as illustrated in Eqs. (23).

(2)
$$ {\text{Pl}}\left( A \right) = {\text{ Pl}}\left( A \right) + m\left( S \right) $$
(3)

The lower bound of the confidence interval which refers to the belief confidence as explained in Eq. (4), has for all evidence \(E_{k}\) that can help the given proposition "user A". The plausibility confidence is considered the upper level of the confidence interval, and it can be computed by the given proposition.

$$ \left( {m_{i} \oplus m_{j} } \right)(A) = \frac{{\mathop \sum \nolimits_{{E_{i} {\cap} E_{j} = \emptyset }}m_{i} \left( {E_{k} } \right)m_{j} \left( {E_{k} } \right)}}{{\mathop \sum \nolimits_{{E_{i} {\cap} E_{j} = \emptyset }}m_{i} \left( {E_{k} } \right)m_{j} \left( {E_{k} } \right)}} $$
(4)

3.5 Particle swarm optimizer

Particle swarm optimizer (PSO) algorithm is a brilliant way of solving precarious issues by imitating how animals work together. PSO employs numerous little specialists who move around to discover the finest reply. Each agent remembers its claim's best arrangement and the finest arrangement from its neighbors [62]. This makes a difference in them working together and discovering the leading reply speedier. The process of finding ideal values for the particular parameters of a given framework to fulfill all plan necessities whereas considering the least conceivable fetch is referred to as an optimization. Optimization issues can be found in all areas of science. Particle Swarm Optimization (PSO) could be a capable meta-heuristic optimization calculation propelled by swarm behavior observed in nature. PSO may be a recreation of a simplified social framework. The first aim of PSO calculation was to graphically mimic the elegant but unusual choreography of a bird run. Each particle swarm optimization has a related position, speed, and wellness value. PSO creates many particles (i), and represents population of (N) set of particles. Each particle has two properties of position and velocity. each particle keeps track best location and global best. In PSO algorithm includes Eqs. (5) and (6), p refers to position, v refers to velocity, and best global refers to the best optimal point from all computed data.

$$ P_{i}^{t + 1} = P_{i}^{t} + V_{i}^{t + 1} $$
(5)
$$ V_{i}^{t + 1} = wV_{i}^{t} + c_{1}^{{}} r_{1}^{{}} \left( {P_{{{\text{best}}\left( i \right)}}^{t} - P_{i}^{t} } \right) + c_{2}^{{}} r_{2}^{{}} \left( {P_{{{\text{bestglobal}}}}^{t} - P_{i}^{t} } \right) $$
(6)

Particle swarm optimization contains a fundamental advantage of having fewer parameters to tune. PSO obtains the most excellent arrangement of particles' interaction. The drawbacks of the PSO algorithm are that it is simple to drop into a nearby ideal in high-dimensional space and contains a low joining rate within the iterative handle. The computational complexity of PSO is acknowledged when it is connected to illuminate high-dimensional and complex issues [63].

4 The proposed adaptive and late multimodal fusion framework

4.1 Framework architecture

The architecture of the adaptive and late multimodal fusion framework in contextual representation-based evidential deep learning Dempster–Shafer relies on creating a multifusion learning model for classifying objects by solving modality-context-based fusion challenges for improving decision-making. It is designed based on the proofed improved mathematical fusion of the fully automated control of the combination of deep neural networks with Dempster–Shafer. The architecture is constructed for interpreting the unstructured supervised multimodality types and improving object classification accuracy. It makes a unification of multiple unstructured topologies into one topology unification into feature matrices with reduction feature level which is related to every object in the datasets. Figure 2 shows the general architecture of the adaptive and late fusion framework architecture is designed based on two fusion levels.

Fig. 2
figure 2

The general architecture of the adaptive multifusion framework

The adaptive multifusion framework is designed based on two fusion levels. The first one is model-based fusion and the second is feature-based fusion. Moreover, it is implemented based on five layers which are software-defined fusion layer, preprocessing layer, dynamic classification layer, adaptive fusion layer, and an evaluation layer. This section discusses these two levels and their layers, and the detailed algorithmic steps of all layers are presented in Appendix A.

4.1.1 Fusion level (1): Model fusion level

The first fusion level is entitled Model-based fusion that interprets multiple topologies with different modalities and diverse characteristics. This level extracts new correlations between modalities dataset input based on weight, priority, reduction level, and extracted relationship. It consists of two layers that are entitled, software-defined fusion layer, and preprocessing layer. It evaluates the mathematical proofed weight and priority dynamically, counts modality data type. It measures the modality dataset size and modality data number of each type.

4.1.1.1 Layer (1): Software-defined fusion layer

Software-defined fusion layer is considered as a controller for creating a proposed correlation between multiple dataset inputs. A software-defined fusion layer is constructed based on five dimensions, modality data type, modality data number, modality dataset size, the weight of features relationships interpretation, and relationship weights of priority of each modality. A software-defined fusion is extended from the software-defined terminology which refers to the software controller or the management of application programming interface (API) such as a software-defined network. This research presents proofed original equations for controlling inputs as the following,

  • Multimodality adaptation for multiple modality inputs refers to the dynamic modality number of input, type, and size of interrelated data. Interpret the inference of four modality data (image, text, audio, and video). Dealing with multiple input numbers as defined by Eq. (7).

    $$ I\left( n \right) = \mathop \sum \limits_{i = 1}^{N} \mathop \sum \limits_{j = 1}^{x} Dt_{xN} $$
    (7)
  • Multimodality relationships: Weight and Type refers to the weight value of all modality data number and data size. It interprets the inference modalities in relationships. The weight factor of each dataset is computed based on the relationships between each dataset and the neighbor dataset as defined by Eq. (8).

    $$ I\left( w \right) = \frac{{\mathop \sum \nolimits_{1}^{n} Dt_{x1N1} }}{{\mathop \sum \nolimits_{x}^{N} Ds_{xN} }} $$
    (8)

    The default weight factor is computed for each modality input dataset size based on the division of the current dataset and the biggest dataset size. The main question is how can compute the weight of each modality dataset and how can it have an impact on the relationship of extracted features? Due to no previous information about the conditions of object classification, the main goal of extracted weight is counting the number of each dataset size and calculating the relationship of all dataset sizes.

  • Multimodality Priority refers to the importance of modality concerning all datasets that is based on a relationship between each data set with the lowest dataset. This priority relies on the subtraction of each modality dataset size and the lowest modality dataset size that is divided by the summation of the total size number of all input modalities as proven in Eq. (9).

    $$ (f) = \frac{{\mathop \sum \nolimits_{1}^{n} Dt_{xcNc - } \mathop \sum \nolimits_{1}^{n} Dt_{xlNL } }}{{\mathop \sum \nolimits_{x}^{N} Ds_{SN} }} $$
    (9)

    It interprets each patient has one X-ray that interprets the 1–1 relationship. The same problem shows different accuracy of classification disease for patients is different based on various interpretation modalities type and number. This modality priority is calculated by Eq. (10) and then applies multiplication of the temp accuracy (TempAcc) of default suitable classification results.

    $$ T\left( {P\left( {{\text{DTxn}}} \right)} \right) = P * {\text{TempAcc}} \left( {{\text{DTxn}}} \right) $$
    (10)
  • Context adaptation for diverse domains refers to the reduction level filter that adapt with the domain in diverse contexts. It improves object classification by reducing the uncertainty with multiple features. The computed reduction level of domain adaptation is a suggested filter to improve the object classification of the offline supervised learning that creates a model for improving decision-making as shown in Eq. (11).

    $$ f\left( {{\text{RL}}} \right) \, = \mathop \sum \limits_{Dt = 1}^{n} R{\text{weight}} + {\text{Mpriorty}} $$
    (11)

    That explores the relationship between one and many. The importance of the computed reduction level which is based on the summation of weight and priority of each modality dataset is shown in the changed weighted weight of neural networks as shown in Eq. (12). The affected neural network concerning the multiplication of several inputs with the summation of biases of each neural network will improve the features of data output.

    $$ W = \mathop \sum \limits_{{{\text{DTxn}}}}^{{}} \left( {{\text{Rl}} \times {\text{inputs}}} \right) + {\text{ bias}} $$
    (12)

    The equation is shown in proof mathematical control for different topologies with extracted proposed correlation with multimodality in multicontext (unknown context) as mentioned in Appendix B. The hardness of implementing this layer is shown in the management of the unstructured multimodality types and characteristics with unknown multicontext.

  • Improvement Ambiguity, uncertainty, imbalanced and redundancy data strategies development deeper are addressed the challenges of data ambiguity, uncertainty, imbalance, and redundancy. The following discusses the developed strategies for methods within the framework to mitigate these issues. A new presented ambiguity strategy can interpret the negative meaning in identifying and validating the meaningful features with one word (as dislike, less) and two phrases (as not good work) or three-word phrases (not work efficiently) and convert into numerical flag (0, 1) that can characterize and distinguish it compared to the existing ones. The experimental results demonstrate enhancement in the classification rate over considering new ambiguity classes.

This research addresses ambiguity issues in preprocessing text type to recognize proof of the important highlights of multimodal ambiguity in classification learning model strategy dynamically that adapts to multicontexts. The reduction of the uncertainty based on multiple inputs creates a fusion learning model with reduced feature vector using improved Dempster–Shafer theory with two fusion techniques. It improves accuracy of object classification with dynamic belief and evidence to reduce uncertainty of probabilities. Handling the imbalanced data strategy is illustrated in multiple numbers of inputs that cannot extract and count the number of modality input easily. Prior research requires making a duplication of dataset size or applying the classification on the data to check (such as nearest neighbor or bagging of random forest) and that takes a long time. There is a limit of interval relative data to fuse data from [1 to 1/10] of data. The relative limit can’t fuse data less than related data. The importance of weighted value is shown in more detail of the interrelated data to improve the complementary data. Handling the redundancy data is shown in creating a bridge controller table to convert many-to-many relationship into two one-to-many relationships with respect to features and time condition.

The proposed strategy can fill the gap between real dataset tables at lower grain. It is designed for text sheets dataset and image dataset. It also can present a solution for different times with common features on the numbers of data. The strategy is followed in the following steps, the concurrent duplicate deletion, creating near time bridge controllers’ tables, creating frequency tables to delete unrelated data, and making the reduction data with tuned data. Identifying the complex relationships between various modality data types introduced in extracting the relationships via the behavioral matrix can be used only as an analysis tool, improving the design of the adaptation rules by helping to understand the relations between the evolving behavior of the users, and the interaction between the available modalities. It can enhance data fusion models based on a presented multifusion learning model that can be higher efficient in determining the extraction conditions, enhancing accuracy in object classification across different modalities and systems. The automated extracting conditions are counted and extracted based on the number of features. The output is highly relative to input datasets in the tracing of modality type and dynamic number of feature vectors.

4.1.1.2 Layer (2): Preprocessing layer

It shows the preprocessing layer refers to enhancing the tuning data in different modality dataset types (Images, Videos, Audios, or Texts) that can be affected by weight and priority measurements. Any effect of the edited preprocessing on any data will affect extracted correlation measurements based on measured parameters of the software-defined layer. Each modality type has preprocessing data based on normalization, cleaning, and augmentation. The importance of this layer is preprocessing automated for the different data topologies with interpreting heterogeneous unstructured modality data input type. If any preprocessed data will be a modification of the extracted correlation measurements based on measured parameters of the software-defined layer.

The preprocessing layer is designed based on different data topology layouts including image, text, audio, and video). In addition, the preprocessing layer is designed to select the suitable deep learning technique to work parallel in the next layer. Two types of preprocessing are based on the dataset size and tuning, default automated preprocessing, default refers to the default of different four data topology layouts, Image preprocessing refers to not requiring an increasing number of images. Text preprocessing refers to the normalization, and filling of missing data with null. Audio preprocessing refers to converting the spectrograms. Video preprocessing refers to splitting the video into image frames, computing number of frames, sorting the number of frames.

The proposed framework becomes inferring more dynamic training that refers to more options for multiple topologies, Image increases the number of images with augmentation, adding or removing noisy data, rotation, scaling, reflection, and crop. Text: cleaning, normalization, applying the trade-off between filling missing data with null, remove missing data, remove outliers and data fill, and determine data fill and replace data fill. Audio converts spectrograms, adding noisy data, augmented data. Video splits video into images frames, computes number of frames, sorting the number of frames, determines the time scale of video, does not require augmentation but requires a limitation of video number of frames concerning time that can be normalization or zeros center. The preprocessed data has a changed measurement of proposed correlation based on computed weight and priority. This layer works by default properties to repair the modality dataset inputs to suitable deep learning techniques. The hardness of implementing this layer is shown in management of the unstructured multimodality types and characteristics with unknown multicontext. The outputs of this layer are processed modality datasets.

4.1.2 Fusion Level (2): Feature fusion level

Fusion level two is entitled Feature-based fusion level that is designed based on feature-based fusion. This level relies on deep learning of different topologies into one topology of reduced matrices. This level aims to improve the adaptive fusion with improved Dempster–Shafer technique with larger filtered features and improves the accuracy evaluation. This level extracts the new proposed classification learning model between modalities dataset input into unification matrix topologies with reduced features of all objects in the datasets. It consists of three layers that are entitled, dynamic classification layer, adaptive fusion layer, and evaluation layer. It evaluates the accuracy optimization using particle swarm optimizer to achieve the best optimal point of accuracy.

4.1.2.1 Layer (3): Dynamic classification layer

The third layer is an automated layer for improving multiobject classification and improving object detection that is based on selecting a suitable neural network concerning the input data types. A dynamic classification layer makes compatibility between the appropriate modality types and numbers in neural networks based on the input data type which are image, video, audio, and text. The dynamic deep learning layer has a partial contribution in converting all modality of different input topologies into one topology of converted matrixes with reduced features. In addition, it uses the sigmoid function to extract learned features vectors from various topologies and outputs numbers between zero and one, describing how much of each component should be let through. It prepares the features and objects vectors as the input of the adaptive fusion layer. The dynamic deep learning layer has feature extraction, and reduction data, converting multiple topologies into one topology of feature matrices. Feature extraction aims to the operation of transforming modality input datasets into numerical data features that can be processed while preserving the information in the original data set. The main goal of feature extraction of the deep learning layer is achieving better results than applying machine learning directly from multiple modality data rather than one modality data. The dynamic deep learning layer relies on automated feature extraction and uses specialized deep learning techniques to extract features automatically from text, images, audio, or videos without human intervention. This technique can be very powerful to produce higher results, more detailed features of data objects, and quicker from raw data to developing machine learning algorithms. Previously, this was done through specialized feature detection, feature extraction, and feature matching algorithms. Nowadays, deep learning is very popular in image and video analysis and is known for its ability to take raw image data as input, skipping the feature extraction step. Regardless of the user's approach, computer vision applications such as image registration, object detection and classification, and content-based image retrieval require effective representation of image features and results—either implicitly through the first layers of the deep network or by explicitly applying some of the longstanding image feature extraction techniques.

4.1.2.2 Layer (4): Adaptive fusion layer

The adaptive fusion layer is designed based on improved mathematical Dempster–Shafer fusion with making dynamic probability with count the parameters based on each self-dataset and concatenation fusion in parallel work. This layer consists of two classifications: (1) the CNN evidence dynamically and (2) CNN concatenation fusion working in parallel for improving the classification accuracy result.

First Evidential Dempster–Shafer neural network creates a classifier to make the automated computation of the belief and evidence, and the output of the classifier. It creates belief vector and evidence vector of each object. This classifier depends on the convolutional and pooling layers to first extract high dimensional features via the input modality datasets. The features are then transformed to mass features and summed in vectors.

First CNN evidence dynamically classifier demonstrates how to deal with modality dataset types based on the following CNN structure:

  • Input Layer, the image input layer is the size of the image, in this case 28 × 28 × 1. These numbers correspond to the height, width, and channel size.

  • Digital data include grayscale images, so the channel size (color channel) is 1. For color images, the channel size is 3, corresponding to the RGB value. For a convolutional layer with a default step of 1, the "same" padding ensures that the spatial output size is the same as the input size.

  • Batch normalization layers normalize the activations and gradients propagated through the network, making training the network a simpler optimization problem. Use batch normalization layers between convolutional and nonlinear layers, such as ReLU layers, to speed up network training and reduce network initialization sensitivity.

  • ReLU layer, the batch normalization layer is followed by a nonlinear activation function. The most common activation function is the rectified linear unit (ReLU).

  • Max pooling layer, convolutional layers (with activation functions) are sometimes followed by a down sampling operation that reduces the spatial dimension of the feature map and removes redundant spatial information. Down sampling allows you to increase the number of filters in deeper convolutional layers without increasing the amount of computation required for each layer. The max pooling layer returns the maximum value of the input's rectangular regions, specified by the first argument, pool size.

  • Fully connected layer, the convolution and down-sampling layers are followed by one or more fully connected layers. As the name suggests, a fully connected layer is a layer in which neurons connect to all neurons in the previous layer. This layer combines all the features learned from previous layers on the image to identify larger patterns. The final fully connected layer combines features to classify images. Therefore, the Output Size parameter in the final fully connected layer is equal to the number of layers in the target data. In this example, the output size is 10. The main feature of Dempster–Shafer's theory is such ignorance that the probability of all events will accumulate to 1. Ignorance is reduced in this theory by adding more and more evidence. Association rules are used to combine different types of capabilities. The main advantage of Dempster–Shafer is shown in adding more information and reducing the period of uncertainty. DST has a much lower level of ignorance. The diagnostic hierarchy can be represented using this. People faced with such problems have the freedom to reflect on the evidence. The main limitation is the high computational effort since this research has dealt with 2n sets.

The second CNN concatenation fusion classifier is designed based on convolutional neural network with learned data to get big number of features. The concatenated neural network to classify the characteristics estimation for multimodal was collected by multiple sources/sensors in offline mode, and the output of the classifier. The classifier with concatenation fusion has main characteristics of concatenation fusion including the greater number of features and not important to learn the data to apply the concatenate fusion from various datasets. The major advantages of concatenation fusion are shown in adding more information, and the uncertainty interval reduces.

The Filter classifier is designed to filter classification by making the subtraction of the redundant vector of feature classes detected in the first vector by the classes detected on the second vector, which is equivalent to the initialization of the used data fusion algorithm the output is fused vector.

The interpretation of reduction level is interpreted by the weight of various relationships between parameters (Pweight) and the priority values between input modality types (Mpriority) as proven in Proof#4. The relationships discuss parameters and their relationships. The reduction level is based on similar vectors or different vectors. Data Reduction relies on the parameters and their relationships or conditions between themselves and the priority between modality data inputs.

  • SoftMax Layer, the SoftMax activation function normalizes the output of the fully connected layer. The output of the SoftMax layer consists of positive numbers that sum to 1, which can then be used by the classification layer as the classification probability.

  • Classification Layer, the last layer is the classification layer. This layer uses the probability returned by the SoftMax activation function for each input to assign the input to one of the mutually exclusive classes and calculate the loss.

  • The deep neural network is described working as the following (a) the weighted sum of the inputs is calculated. (b) The bias is added. (c) The result is fed to an activation function. (d) A specific neuron is activated.

  • The improvement of Dempster–Shafer is shown in automated neural networks and getting a larger number of features. The improvement of Dempster–Shafer aims to classify multilabel classification. The adaptive fusion approach is designed based on making two fusions on two different levels, high and low. It makes parallel fusions of the Dempster–Shafer and concatenation then extracts the not important features and reduces these features.

The contribution of this layer is shown in improved computational and extracted features of Dempster–Shafer fusion by concatenation fusion to improve the fused multiobject classification from diverse multimodality in multicontext or unknown context. The adaptive fusion layer refers to drawing the full vision of the modal's classification. That provides the unification target of multiple sensory data classifications in various smart environment systems. The present work executes the majority voting of the diverse CNN-based pretrained models with an adaptive fusion technique. Much appreciated to the simplification and expressiveness of the DS formalism, the yields of an evidential classifier give more extracted data than the ordinary classifiers (e.g., a neural network with a sigmoid function to extract vectors layer) that transform an input include vector into a likelihood distribution or any other distribution. A sigmoid function is considered a support of multilabel object classification. The importance of this layer is converting the different vectors from various matrices of topologies layouts. The hardness of implementing this layer is shown in management of the unstructured multimodality types and characteristics with unknown multicontext. The output is a reduced featured filtered vector.

The output of implementation Dempster–Shafer neural network is numerical vector with holding two temp vectors of each object that are called belief and evidence. The output of Dempster–Shafer is tuned fused features. The proposed solution is constructed based on building the classification of fused objects using dynamic neural network for computed number of parameters based on data perspective in diverse context. The output of adaptive fusion layer is tuned classified numerical feature vectors with two temp belief and evidence vectors.

4.1.2.3 Layer (5) Evaluation layer

The evaluation layer relies on two parts which are (1) evaluating the training accuracy and optimization results of multiple smart systems. It improves accuracy results by 96–98% in various contexts. The experiments are applied to various multimodal inputs for diverse contexts which have the common factors of smart systems as the following, smart military and smart health. It measures the accuracy and optimization results in multiple smart context systems. This layer classifies data into two types, training data and testing data. It measures the accuracy, precision, recall, and F1 measures [64, 65]. The training applies particle swarm optimization to improve the accuracy evaluation. The training changes the hyperparameter 30 times to achieve the best point. The importance of this layer is the training of data to achieve the best solution accuracy point. The hardness of implementing this layer is shown in the management of the unstructured multimodality types and characteristics with unknown multicontext. The contribution of this layer is shown in achieving the best accuracy result with changing hyperparameters to achieve the best-optimized point. The output featured a filtered vector. Multiclass deep learning model is designed so the neural network can make a prediction analysis of multiclass. it computes the confidence in the SoftMax.

4.2 The inputs and outputs of adaptive multifusion framework

The architecture includes five layers of the adaptive multifusion framework based on two fusion levels, which are described in Table 4.

Table 4 An adaptive multifusion framework based on two fusion levels (inputs/output)

5 Datasets characteristics for multimodality on multicontext

The description of modality data types is shown in Table 5. The description of modality datasets in multiple contexts has a limitation of tracing from 1 to 16 modality input such as shown in the following experiments. This section presents a comparative accuracy analysis between a proposed adaptive fusion model using deep learning and Dempster–Shafer fusion model and the concatenation fusion model. The experiments are to be generic and adapt with multimodality in multiple contexts, that interpret the data perspective of each data based on the target of complementary data whether interrelated data as patient’s and meta data or the complementary fusion of the same objects in diverse datasets for example weapons datasets. This research classifies multimodality datasets based on interpreting the modality data types and numbers without known conditions and known context but all experimental datasets have the data criteria. The adaptivity of multicontext is shown to be applicable in the diverse experimental datasets as smart military in three same modality inputs, smart health has two different modality inputs, and smart dietary health has three modality inputs. Smart agriculture has four modality inputs.

Table 5 A description of experimental datasets for multimodality datasets in multicontext

5.1 Dataset (1): Smart military data sets

Dataset 1 is extracted via three sources [66,67,68]. This dataset aims to classify military objects via the same tri-modality input types. The dataset size achieves 30.000 images via intersective spectrum, visual insensitive spectrums, and RGB images as shown samples in Fig. 3. These datasets are balanced data as mentioned in Table 6.

Fig. 3
figure 3

Samples of smart military dataset in diverse spectrums

Table 6 Smart military dataset description of size and modality data type

5.2 Dataset (2): smart agriculture dataset

It is extracted via sixteen sources in [69]. This dataset aims to classify leaf diseases objects via the same multimodality input types. The dataset size achieves 2.282.829.720 augmented images via leaf diseases, and RGB images as shown samples in Fig. 4. These datasets are imbalanced data as mentioned in Table 7.

Fig. 4
figure 4

Samples of smart agriculture dataset in diverse spectrums

Table 7 Smart agriculture dataset description

5.3 Dataset (3) Smart health COVID-19 data sets

Dataset 3 is extracted via two sources in [70, 71]. This dataset aims to classify infected COVID-19 objects via different bimodality input types. The dataset size achieves 7.000 text records dataset with 1.000 audio dataset via COIVD-19 patients and cardio audio of cough datasets as shown samples in Fig. 5. These datasets are imbalanced data as mentioned in Table 8.

Fig. 5
figure 5

Samples of smart health COVID-19 dataset with different bimodalities types

Table 8 Smart COVID-19 health dataset description of size and modality data type

5.4 Dataset (4) Smart dietary health

Dataset 4 is extracted via two sources in [72]. This dataset aims to classify dietary objects via different multimodality input types. The dataset size achieves 6265 text records via smart watch, 3657 text records via mobile sensor, and 2586 images dataset as shown samples in Fig. 6. These datasets are imbalanced data as mentioned in Table 9.

Fig. 6
figure 6

Samples of smart Dietary health dataset (4) with different tri-modality types

Table 9 Smart dietary dataset description of size and modality data type

6 Experimental results, analysis and discussion

The adaptive multifusion framework is implemented using a graphical user interface (GUI) to be easier and simpler. The proposed framework can be reusable for multiple information systems and mutual properties of multiple sources. The proof of the proposed solution by implementing MATLAB R2022 b that constructs the Adaptive Smart Environment Multimodal System (ASEMMS) [73].

The accuracy evaluation measurement computes the classification accuracy of various classification models. Precision-recall is a useful measure of prediction success when the classes are imbalanced [74, 75] as mentioned in Table 10. F1-measure is defined as the harmonic mean of precision and recall as mentioned. Particle Swarm Optimization (PSO) is a computational optimization method inspired by social behavior [76, 77].

Table 10 The accuracy measurements

The equations for measuring the accuracy are defined as follows Eqs. (13), and (14),

$$ {\text{Precision = }}\frac{{{\text{TP}}}}{{\text{TP + FP}}} $$
(13)
$$ {\text{ Recall = }}\frac{{{\text{TP}}}}{{\text{TP + FN}}} $$
(14)

These quantities are also related to the F1 score, where Eq. (15) as the following,

$$ {\text{F1 - measure}} = { }\frac{{{\text{2 Recall }} \cdot {\text{Precsision}}}}{{{\text{Recall}} + {\text{Precision}}}} $$
(15)

6.1 Experiments and results analysis

6.1.1 Experiment (1) Same military: An experiment for same modalities fusion

The experimental results include the tracing of changing hyperparameters of the accuracy classification results as shown in Tables 11 and 12. The variables stated on Table 12 are defined as follows: results of optimization parameters that are 1) Iter: (iteration number), showing which assessment or optimization preparation process is being recorded. 2) Eval Result represents the evaluation result of each iteration can be a degree of performance or a few other metrics. 3) Objective values aim to objective function which is optimized. 4) Objective run time shows the run time at the objective function measurement for the current iteration. 5) BestSoFar (observed) shows the best experimental result mentioned within the optimization process. 6) BestSoFar (estim.) is an evaluation of the best result reaches based on a few extrapolation or estimation techniques. 7) Section depth refers to the depth of the optimization algorithm. 8) Initial Learn-Rate is the learning rate utilized in a learning algorithm with gradient-based optimization strategies. 9) Momentum refers to an important parameter utilized in optimization algorithms like stochastic gradient descent, which decides the contribution of past slopes to the current upgrade. 10) L2 Regularization has the advantage of the optimization experiment which can avoid overfitting in machine learning models. The best point achieved as shown in Table 13, and the fused output numerical vector.

Table 11 A Comparative analysis of results between accuracy before optimization and after optimization
Table 12 First experiment tracing analysis of optimization of Adaptive fusion-based particle swarm optimizer
Table 13 The first experiment best fit optimizer record results

The adaptive multifusion framework on the smart military context achieves the accuracy results 98.8% of the same tri-modalities as shown in Fig. 7.

Fig. 7
figure 7

The adaptive multifusion framework experiment on the smart military experiment context achieves the accuracy results 98.8% for fusion via dataset (1) of the same tri-modalities

Table 13 introduces the tracing of optimization analysis of adaptive fusion for searching the best optimal points with particle swarm optimizer automatically. There are four best points with the best one has the estimated objective function achieves 0.045845.

6.1.2 Experiment (2) Smart agriculture: an experiment for same modalities fusion

The experiment is shown a comparison study analysis of results between accuracy before optimization and after optimization as shown in Table 14.

Table 14 A comparative analysis of results between accuracy before optimization and after optimization

The experimental results include the tracing of changing hyperparameters of the accuracy classification results as shown in Table 15 and the best point achieves to as shown in Table 16, and the fused output numerical vector. The adaptive multifusion framework experiment on the smart health experiment context achieves the accuracy results 98.5% of the same multimodalities as shown in Fig. 8.

Table 15 Second experiment tracing analysis of optimization of the hybrid Adaptive fusion model, deep learning-based Dempster–Shafer fusion model, and deep learning-based concatenation fusion model
Table 16 The first experiment best fit optimizer record results
Fig. 8
figure 8

The adaptive multifusion framework experiment on the smart health experiment context achieves the accuracy results 98.5% for fusion via dataset (2) of the same multimodalities

6.1.3 Experiment (3) Smart COVID-19 Health with different modalities

The experimental results include the tracing of changing hyperparameters of the accuracy classification results and the best point, and the fused output numerical vector as shown Table 17.

Table 17 A comparative analysis of results between accuracy before optimization and after optimization of experiment three

The adaptive multifusion framework experiment on the smart dietary health experiment context achieves the accuracy results 97.6% for fusion from the mention dataset (3) in Sect. 4 of the different multimodalities as shown in Fig. 9.

Fig. 9
figure 9

The adaptive multifusion framework experiment on the smart dietary health experiment context achieves the accuracy results 97.6% for fusion via different multimodalities

6.1.4 Experiment (4) Smart dietary health with different modalities

The experimental results include the tracing of changing hyperparameters of the accuracy classification results and the best point, and the fused output numerical vector as shown in Table 18.

Table 18 A comparative analysis of results between accuracy before optimization and after optimization for experiment 4

The adaptive multifusion framework experiment on the smart COVID-19 health experiment context achieves the accuracy results 95.9% for fusion from the mention dataset (4) in Sect. 4 of the different multimodalities as shown in Fig. 10.

Fig. 10
figure 10

The adaptive multifusion framework experiment on the smart COVID-19 health experiment context achieves the accuracy results 95.9% for fusion via dataset (4) of the different multimodalities

6.2 Comparative analysis and discussion

The first comparative analysis was done between the proposed adaptive multifusion model and two prior fusion models [20] and [24] as shown in Table 19. This comparison relies on the differences between multimodal fusion models’ properties of the modality data type, modality number, data fusion level, interpreted context considerable, experimental dataset, and weaknesses.

Table 19 A comparative analysis between proposed adaptive fusion model and previous models

Table 20 shows the comparison relying on the shown study of accuracy before and after optimization for four experiments for the three models with optimization with Bayesian optimizer and particle swarm optimizer.

Table 20 A comparative analysis of accuracy before and after optimization for four experiments for the three models with optimization with Bayesian optimizer and particle swarm optimizer

The second comparative analysis was done the proposed adaptive framework and three multimodal frameworks [78,79,80] as shown in Table 21. This comparison relies on the differences between the multimodal framework’s properties of the modality data type & and modality number, data fusion level, interpreted context considerable, experimental dataset, and weaknesses. The proposed adaptive framework can solve many previous drawbacks in [78,79,80]. The three previous frameworks can’t interpret multimodality input in diverse contexts to improve object classification. Its advantages are shown in the interpretation of multimodality types and multimodality numbers dynamically (based on data perspective, not context perspective). It can excavate the relationship between multimodalities. It can control automatically the same multimodality and various multimodality (Text, audio, image, and video). It can reach high accuracy of classification of one-/multiobject classification. In addition, the proposed adaptive framework can solve the redundancy fused data problem (redundant data) and high-level abstract data problem that is based on low features. It can remove the redundancy of fused vector data. It is designed based on deep neural network models with fusion of the Dempster–Shafer fusion with concatenation fusion. It has a development implementation and user interface with a highly complex implementation.

Table 21 A comparative analysis between proposed adaptive fusion framework and previous fusion frameworks

The proposed adaptive multifusion framework can be applied to data or big data whether the same or different via smart sensors or intelligent devices that can apply to the presented data criteria in Preliminaries. This data can apply to one of suitable two types of data problems. Two types of suitable datasets can be applied with the proposed criteria: 1. combine data of the same type from different sources to classify objects and 2. combine multiple object data using different types or different characteristics to achieve the goal of combining the classification of objects that are often related to each other.

The experimental results represent the average of total experimental classification of multimodalities in Fig. 11 that presents the behavior analysis of accuracy classification results in multiple experiments for various modalities inputs.

Fig. 11
figure 11

The behavior analysis of accuracy classification results and classification fusion results in many experiments for various modalities inputs

The experimental results represent the behavior analysis of fusion techniques of Adaptive multimodal fusion model, Dempster–Shafer model, concatenation fusion model in many experiments for multimodalities inputs in Multicontext as mentioned in Fig. 12.

Fig. 12
figure 12

The behavior analysis of fusion techniques in many experiments for multimodalities inputs in Multicontext

The experimental results illustrate the comparative analysis of accuracy results between the proofed fusion model of the adaptive multifusion framework and the concatenation fusion model and the Dempster–Shafer fusion model based on four experimental results in diverse modality data types and numbers as shown in Fig. 13. The results show the achieved average of the proposed multimodal fusion framework accuracy is 97.45% with reduced feature level of multiclass of fused multifusion learning model. That interprets the multimodal fusion framework as better than the contention fusion model by 28.5%. That interprets the multimodal fusion framework as better than the Dempster–Shafer fusion model by 7.075%. The results show the achieved average of concatenation fusion model accuracy is 68.925% with a large number of features. The results show the achieved average of Dempster–Shafer fusion model accuracy is 90.375% with limited features.

Fig. 13
figure 13

A comparative analysis of accuracy results between the proofed fusion model of the adaptive multifusion framework and the concatenation fusion model and the Dempster–Shafer fusion model based on four experimental results in diverse modality data types and numbers

The numerical comparative related works in terms of accuracy of other papers is applicable in different tested datasets due achieved the adaptivity condition of multicontext. Previously, no one research uses multidataset in diverse contexts to test the adaptivity generic on multicontext. Due to the modality condition, the information add a big value to extract information and how to fuse the features of diverse objects with respect to multiple modality types. So, the comparative analysis is generally with pervious frameworks without a depth thread.

7 Conclusion and future works

This paper presents the adaptive multimodal fusion framework that is a solution for the modality-context-based problem which is divided into two fusion problems, modality-based fusion and context-based fusion. The main challenge of modality-context-based is shown in the conflicting nature of data and the complexity of fusion between sensory data. The modality-based fusion is interpreted into fusing multiple data sources with the same data type and fusing multiple diverse modality types via the same source in various smart systems. The context-aware can be described as interpreting the context that extracts relationships, features, conditions, and data modality types. The suitable datasets that can be applied to the proposed criteria have two types: (1) fusing similar types of data from various sources to object classification and (2) fusing different commentary of multitarget data via diverse types or different characteristics to achieve the unification target of object classification that are often interrelated to each other. It creates a multifusion learning model that has a vital role in fusing the complementary heterogeneous data to be a reliable and robust classification model in multiple contexts. The main strengths of the adaptive fusion framework are improving the multiobject classification with reduced features automatically and solving the fused data ambiguity and inconsistent data. In addition, it increases the certainty and reduces the redundancy of data by improving the balanced data direction. This result of the adaptive multimodal fusion framework is better than the argument fusion model by 28.5%. This interprets the adaptive multimodal fusion framework to outperform the Dempster–Shafer fusion model by 7.075%. The results show that the average chain fusion model accuracy achieved is 68.925 with many features. The results show that the average achieved accuracy of the Dempster–Shafer fusion model is 90.375 with limited functions. The limitation of the presented adaptive fusion framework is shown in the hardness of applicability in a bigger number of datasets more than 16 sensory datasets to work on a bigger size of data with the requirement of higher performance.

Future work takes more attention to deeper analysis of particular fusion techniques. Briefly, new proposals to attempt different research directions, or simply inquisitiveness. There are several ideas of the essential significant research direction is Fusion materials data science which expresses that the development of materials science in the industry has led to the production of many materials data, which vary in data format and semantics and are extracted from multiple sources. The material data integration and fusion provide a unified framework for representation, processing, storage, and mining, which helps accomplish tasks such as material data clarification, material extraction, material fabrication parameter setting, and material knowledge extraction.