1 Introduction

All deep learning applications and related artificial intelligence (AI) models, clinical information, and picture investigation may have the most potential element for making a positive, enduring effect on human lives in a moderately short measure of time [1]. The computer processing and analysis of medical images involve image retrieval, image creation, image analysis, and image-based visualization [2]. Medical image processing had grown to include computer vision, pattern recognition, image mining, and also machine learning in several directions [3]. Deep learning is one methodology that is commonly used to provide the accuracy of the aft state. This opened new doors for medical image analysis [4]. Deep learning applications in healthcare addresses wide variety of issues, including cancer screening to infection monitoring to personalized advice for treatment [5].Today, a massive quantity of data is placed on different data sources such as radiological imaging, genomic sequences, as well as pathology imaging, at one's disposal of physicians [6]. We are all in a state of flux methods, however, to turn all this information into usable information, PET (positron emission tomography), X-ray, CT (computed tomography), fMRI (functional MRI), DTI (diffusion tensor imaging), and MRI (magnetic resonance imaging are the typical modalities used for medical imagery [7, 8].

Deep learning involves learning patterns in data structures using neural networks of many convolutional nodes of artificial neurons. [9, 10]. An artificial neuron is a type of cell that takes several inputs, basically works with a calculation, and returns the result similar to a biological neuron [11,12,13,14,15]. This simple calculation takes a linear source regular expression shape preceded by an activation function which is nonlinear [16]. The sigmoid conversion, ReLU (rectified linear unit) and their variants, and tanh (hyperbolic tangent) are examples of several commonly used nonlinear activation functions of a network [17,18,19,20,21]. The growth origins of deep learning can be traced back to the return to Warren McCulloch and Walter Pitts (1943). Besides the advancement of the ImageNet (2008), backpropagation model (1961), AlexNet (2010), (CNN) convolutional neural network model (1978), and (LSTM) long short-term memory (1996) [22]. GoogleNet is a search engine (winner of the ILSVRC 2013 issue) commenced by Google [23] in 2014, which introduced the notion of start-up modules that dramatically reduced the computing of CNN. In 2014, Google launched GoogleNet (winner of the challenge for ILSVRC 2014) [23] that incorporated the idea of start-up modules which significantly lowered CNN’s computational complexity. CNN architecture is composed of multiple layers that use a differentiable function to transform the input volume into an output volume (e.g., holding the class scores). Essentially, deep learning is a reincarnation of the artificial neural network, in which artificial neurons are stacked. Features of the network in CNN [22, 23] are created by converting kernels into layers with outputs from previous layers. On the input images, the kernels in the first invisible layer carryout convolutions [24, 25]. Although early hidden layers capture shapes, curves, and edges, as well as more abstract and complex features are captured by deeper hidden layers. Figure 1 shows the different types of learning processes available for the CNN networks [26].

Fig. 1
figure 1

Types of learning

1.1 Types of learning

The followings are the 14 sorts of learning that we should be acquainted with as an AI specialist.

Learning problems

  • 1. Supervised learning

  • 2. Unsupervised learning

  • 3. Reinforcement learning

Hybrid learning problems

  • 4. Semi-supervised learning

  • 5. Self-supervised learning

  • 6. Multi-instance learning

Statistical inference

  • 7. Inductive learning

  • 8. Deductive inference

  • 9. Transductive learning

Learning techniques

  • 10. Multi-task learning

  • 11. Active learning

  • 12. Online learning

  • 13. Transfer learning

  • 14. Ensemble learning

An investigation has been done one by one in the accompanying segments.

1.1.1 Learning problems

1.1.1.1 Supervised learning

A problem in which a model is used to learn a representation between input examples and a target variable is represented by supervised learning [26]. Supervised learning problems are known as systems where the training data contains examples of input vectors and the target vectors that correspond to them. There are two major types of problems with supervised learning: classification involving detection of regression and a class mark involving detection of a significant value [27]. Classification is represented as a supervised problem of learning which requires the prediction of a class label. Regression is a problem of supervised learning involving predicting a numerical label [28].

There can be one or more input variables in classification and regression problems, and input process parameters can be in any data format, such as numerical and categorical data [28]. The MNIST is a handwriting digit dataset with handwritten digit images as inputs (pixel data) that will be an example of a classification problem [29]. Some machine learning algorithms are known as supervised machine learning algorithms because they are designed for supervised machine learning problems. Decision trees and support vector machines are two examples of it [29, 30].

Algorithms are related to as supervised because, when an input data is given, they learn by making predictions, and those models are controlled and improved by an approach that can help determine the outcome [31]. Some methods can be perfectly suited for classification (e.g., logistic regression) or regression (e.g., linear regression), while some are employed for both types of problems with minor modifications (such as artificial neural networks) [32, 34].

1.1.1.2 Unsupervised learning

Unsupervised learning identifies some difficulties involving in the use of data relationship model that describes or removes data relationships. Unsupervised learning works in comparison with supervised learning, only the input data is used, with no outputs or target variables [33]. As such, unsupervised learning close to supervised learning doesn’t have an instructor to correct the model. There are several ways of unsupervised learning, but they have two key issues which a practitioner frequently encounters: clustering involves grouping the data and estimating range, which entails a summary of data distribution. Clustering is represented as an unsupervised problem of learning which requires finding data for classes [33,34,35,36,37,38,39].

Estimation of density is termed as an unsupervised problem of learning that requires summarizing the data distribution. K-Means is a clustering technique in operation where k corresponds to the cluster centres to be found in the data [40]. A density neural network is Kernel Density Estimation, which uses small groups of closely related data samples to estimate the distribution of new points in the problem space [34,35,36,37,38]. To learn about the trends even in information, clustering as well as density estimation can be performed. Additional unsupervised approaches can also be used, such as visualization involving various forms of methods for graphing or plotting results, as well as projection methods involving decreasing those data’s dimensionality [41].

Visualization is an unsupervised problem of learning involving the development of data plots. Data visualization is a methodology for helping people recognizes vast quantities of data by using a variety of standardized and interactive visuals within a particular context [42]. The information is often viewed in a narrative style, which highlights patterns, trends, and associations that would otherwise go overlooked [43].

Projection is an unsupervised problem of learning that requires the development of lower-dimensional data representations [44]. Random projection is a more computationally effective dimensionality reduction approach than principal component analysis [45]. It is often used in datasets of too many dimensions for principal component analysis to be computed directly.

1.1.1.3 Reinforcement learning

Reinforcement learning is a set of challenges in which an individual must learn to use feedback to work in a given context [46]. It is identical to supervised learning, even though feedback may be delayed, and since the model is systematically noisy, it has some responses from which to learn, which finds it challenging for the entity and model to link causality [42, 47]. Deep reinforcement learning, Q-learning, and temporal-difference learning are some common examples of reinforcement learning algorithms [48].

1.1.2 Hybrid learning problems

1.1.2.1 Semi-supervised learning

It is supervised learning where there are a few classified instances and a lot of unlabelled instances in that training data [48]. The purpose of a semi-supervised learning model is to use all the available data efficiently, not all labelled information like in the supervised learning technique [49]. It could include the use of unsupervised methods like clustering and density estimation, or it could be inspired by them to make effective use of unlabelled data [49, 50]. Upon discovery of groups or patterns, supervised strategies or supervised learning ideas will be used for marking the unlabelled instances or add labels with unlabelled later on, these descriptions were used to make accurate predictions [51].

This category covers many concerns like audio data (automated speech recognition), text data (natural language processing), and image data, and these concerns will not be smoothly solved with traditional supervised learning techniques [34, 51].

1.1.2.2 Self-supervised learning

The self-supervised learning system needs only unlabelled data to formulate a pretext learning assignment, such as predicting context or image rotation, for which a target objective can be calculated without supervision [52]. Self-supervised learning algorithms such as autoencoders are a good example. This is a type of neural network that is used to create a compact or compressed input sample representation [52, 53]. They do this using a model that includes an encoder and a decoder component separated by a bottleneck that represents the input's internal compact representation [54]. These autoencoder models are educated by providing the input as both input and target output, forcing the model to reproduce the input by first encoding it to a compressed representation and then decoding it back to the original [53]. The decoder is removed after training, and the encoder is used to generate compact input representations as desired. Autoencoders have historically been used for the reduction of dimensionality or learning of features [54].

Self-supervised learning is often exemplified by generative adversarial networks, or GANs [54, 55]. These are generative models that are most frequently used to generate synthetic photographs using only a collection of unlabelled examples from the target domain [55].

1.1.2.3 Multi-instance learning

In multi-instance learning, an entire set of examples is labelled as containing or not containing an example of a class, but individual members of the collection are not marked [56].

1.1.3 Statistical inference

The term inference refers to the process of arriving at a conclusion or making a decision. In machine learning, designing a model and making a prediction are both examples of inference [56]. There are several inference paradigms which can be used to explain how certain machine learning algorithms work or how to solve such learning problems. Approaches to learning include inductive, deductive, and transductive learning, as well as inference. Induction is the analysis of a general model based on specific instances [57, 58]. The deduction method uses formula to make predictions. Making assumptions based on specific examples is known as transduction [58].

1.1.3.1 Inductive learning

To assess the result, inductive learning involves using proof. Inductive learning refers to using particular situations, e.g. specific to general, to decide general outcomes [59]. Many algorithms learn from particular past precedents through a process called inductive reasoning, in which general rules (the model) are taught (the data) [59, 60]. It is an induction approach adapted to a machine learning model. The model is a generalization of the concrete examples in the training dataset. The training data is used to create a model or hypothesis about the problem, and the model is presumed to carry over fresh unknown data later [60].

1.1.3.2 Deductive inference

To evaluate concrete results, the use of general concepts is referred to as deduction or deductive inference. We can better understand induction by contrasting it to inference. A deduction is the polar opposite of induction [61]. In the same way that induction progresses from the person to the general, deduction progresses from the general to the specific [62]. Induction is a bottom-up form of reasoning that uses the evidence available as proof for an outcome, while deduction is a top-down method of reasoning that seeks to fulfil all premises before determining the result [63]. The algorithm can be used to make predictions before we use induction to suit a model on a training dataset, in the sense of machine learning [64,65,66,67,68]. The model is employed as a deductive method.

1.1.3.3 Transductive learning

Transduction or transductive learning is a term used in statistical learning theory to describe the process of predicting specific examples from domain [69]. It differs from induction, which involves learning universal rules, which is based on concrete examples [70]. In the model of estimating the value of a function at a given point of interest, a new definition of inference is defined [71]. Notice that when one would like to get the best outcome from a limited amount of knowledge, this principle of inference arises [72]. The k-nearest neighbours algorithm is a classic example, where the transductive algorithm uses it directly each time a prediction is required, instead of modelling the training data [3, 47, 72].

1.1.4 Learning techniques

1.1.4.1 Multi-task learning

Multi-task learning is a technique for improving generalization by combining details from a variety of tasks (which can be seen as soft constraints placed on the parameters) [73]. When there is an abundance of labelled input data for one task that can be shared with another task with much less labelled data, multi-task learning can be a useful approach to problem-solving [74, 75].

A multi-task learning problem, for example, can include the same input patterns that can be used for many different outputs or supervised learning issues [76]. In this configuration, each output can be predicted by a different part of the model, allowing the core of a model to generalize the same inputs for each task [75].

1.1.4.2 Active learning

Active learning is often a methodology in which a model will ask a human user operator questions during the learning process to solve uncertainty [77]. Active learning is a form of supervised learning that aims to produce the same or better results than so-called passive supervised learning, even if the model’s data is more efficient [78]. The central principle behind active learning is that allowing a machine learning algorithm to choose the data from which it learns allows it to achieve greater precision with fewer training labels [79, 80]. An active learner will raise questions, which will typically take the form of unlabelled instances of information that will be labelled by an oracle [e.g., annotator (human)] [81]. Active learning is a valuable tool when there is little data available and collecting or labelling new data is expensive [82, 83]. The active learning process allows for domain sampling to be oriented in a way that decreases the number of samples while increasing the model’s effectiveness [84].

1.1.4.3 Online learning

Machine learning is typically carried out offline, meaning we have a batch of data and we refine an equation [85]. However, we need to conduct online learning if we have streaming data, so we can update our estimates when each new data point arrives rather than waiting until the end (which may never occur) [86]. Online learning is useful because, over time, the data can alter rapidly [87,88,89]. It is also useful for applications, involving a broad set of knowledge that, even if changes are incremental, is continuously increasing [90].In general, online learning aims to eliminate the inconsistency, which is how well the model performed relative to how well if all the knowledge available was available as a batch, it should have performed [91]. The so-called stochastic or online gradient descent used to suit an artificial neural network is one instance of online learning [92].

The fact that stochastic gradient descent minimizes generalization error is easiest to see in the online training situation, where examples or mini lots are taken from a data stream [93].

1.1.4.4 Transfer learning

It is a form of learning in which a model learns on one problem and then is used as the reference point for another activity [94]. This is a good solution for challenges where there was a process that is close to the main problem and the associated task necessitates a large amount of data [95]. Transfer learning varies from multi-task training, in that the tasks are learned sequentially, while multi-task training seeks desirable performance from a single model on all tasks at the same time in comparison. For example, the image classification process, in which a prediction model will be learned with a broad set of images, such as an artificial neural network, and while training on even a simpler, more specific dataset, like cats and dogs, model weights may be used as a preliminary step [94]. The characteristics that the model has already learned about the larger mission, like retrieving lines and also patterns, would help for another task.

1.1.4.5 Ensemble learning

Ensemble learning is a method in which at least two modes fit into similar information and coordinate the forecasts from each one [96]. In contrast to any individual model, the goal of ensemble learning is to accomplish better execution with the group of models [97]. This incorporates the view of how to build models utilized in the group and how likely fuse the individuals from the outfit's forecasts [98,99,100,101].

Ensemble learning is an important method for creating prescient abilities in a pain point and lessening the vulnerability of stochastic learning calculations, like artificial neural organizations. Bootstrap, weighted normal, and stacking (stacked speculation) are a few instances of regular group learning calculations (Bagging) [103].

2 Deep learning architectures

During the recent twenty years, we have been furnished with deep learning models that have dramatically increased the type and number of problems that could be solved by neural networks [101]. Deep learning is not a solitary method, yet rather a class of calculations and geographies that can be applied to a wide assortment of issues [102, 103]. Connectionist structures have existed for over 70 years, but they have been brought to the frontline of artificial intelligence by modern architectures and GPUs (graphical processing units). Figure 2 shows the general architecture of neural networks [102].

Fig. 2
figure 2

General architecture of neural network and deep learning

Although deep learning techniques are not new, because of the intersection of deeply layered neural networks and the use of GPUs to accelerate their execution, it is experiencing exponential development. In this article, comparison is made among different architectures of deep learning models [103, 104]. General deep learning architecture contains the following layers: input layers, convolution and fully connected layers, sequence layers, activation layers, normalization, dropout, and cropping layers, pooling, and un pooling layers, combination layers, object detection layers, generative adversarial network layers, output layers [101,102,103,104,105,106,107,108].

Network’s secret sauce is the hidden layer(s). Because of their nodes/neurons, they may model complex data. Since the true values of their nodes are unknown in the training dataset, they are hidden. In fact, we only have access to the input and output [100,101,102,103,104,105,106,107,108,109,110]. At least one hidden layer exists in any neural network. No law says you must multiply the number of inputs by N. The ideal number of hidden units could easily be less than the number of inputs [111]. We can use several hidden units if you have a lot of training examples, but with little data, often only two hidden units will suffice.

2.1 Deep neural network (DNN)

In this architecture, at least two layers are there that allow nonlinear complexities. Classification and regression can be carried out here. The advantage of this model is generally used because of its great accuracy [104]. The drawback is that the method of training will not be easy since the error is transmitted back to the past layer and also becomes low. Also, the model's learning behaviour is too late [105].

2.2 Convolutional neural network (CNN)

This model could be best suited for 2D data. This network consists of a convolutional filter for transforming 2D to 3D which is quite strong in performance and is a rapid learning model. For classification process, it needs a lot of labelled data [54, 69, 106]. However, CNN faces issues, such as local minima, slow rate of convergence, and intense interference by humans. After AlexNet's great success in 2012, CNNs have been increasingly used to enhance the efficacy of human clinicians in medical image processing [107].

2.3 Recurrent neural network (RNN)

RNNs have the ability for recognizing the sequences. The weights of the neurons are spread through all measures. There are many variants such as LSTM, BLSTM, MDLSTM, and HLSTM [110,111,112,113,114,115]. This includes state-of-the-art accuracies in character recognition, speech recognition, and some other natural language processing-related problems. Learning sequential events can model time conditions [116]. The disadvantage is that this method has more issues because of gradient vanishing and this architecture is in need of big datasets [117].

2.4 Deep conventional-extreme learning machine (DC-ELM)

It blends CNN's strength and ELM's rapid preparation. To viably digest significant level attributes from input pictures, it utilizes various substitute convolution layers and pooling layers [118]. The preoccupied highlights are then taken care of to an ELM classifier, which prompts improved after effects of speculation with quicker learning speed [3]. In the last hidden layer, deep conventional-extreme learning machine was used to implement stochastic pooling to significantly reduce the dimensionality of functions, saving a lot of training time and computational resources [117].

2.5 Deep Boltzmann machine (DBM)

A DBM (deep Boltzmann machine) is a three-layer generative model. It is similar to a deep belief network but instead allows bidirectional connections in the bottom layers. Its energy function is as an extension of the energy function of the RBM is shown in Eq. (1).

$$ E = \left( {\sum\limits_{i < j} {w_{{ij S_{i} }} s_{j} } + \sum\limits_{i} {\theta_{i} s_{i} } } \right) $$
(1)

DBM with N hidden layers; Unidirectional connections are made among all hidden layers. Top-down feedback for more accurate inference integrates ambiguous results [119, 120]. Optimization of a parameter is difficult for a big dataset.

2.6 Deep belief network (DBN)

Deep Belief Networks are a graphical portrayal that is fundamentally generative; all the potential qualities that can be produced for the current situation are created. It is a combination of likelihood and measurements with neural organizations and AI [110]. Deep belief networks comprise a few layers with values, where the layers have a relationship, however not the qualities. The essential target is to assist the machine with characterizing the information into different classifications. The drawback of this architecture is, initialization process makes the training expensive [110, 112].

2.7 Deep autoencoder (DAN)

Applicable in the unsupervised learning process, this could be helpful for dimensionality reduction and feature extraction. Here the number of inputs is equal to the number of outputs [2]. The advantage of the model is it does not need labelled data. Various kinds of autoencoders such as denoising autoencoder, sparse autoencoder, Conventional autoencoder is needed for robustness. Here it needs to give pre-training step, but training could be vanished [3,4,5].

Generally, autoencoder [6] consists of both encoder and decoder, which can be defined as \(\Phi \) and \(\Psi \) shown in Eq. (2).

$$ \begin{gathered} \Phi :X \to {\mathcal{F}};\quad \Psi :{\mathcal{F}} \to X \hfill \\ \Phi ,\Psi :\arg_{\Phi ,\Psi } \min X\left( {\Phi \cdot \Psi } \right)X^{2} \hfill \\ \end{gathered} $$
(2)

2.8 Deep stacking networks (DSN)

A deep stacking network, also termed as deep convex network, is the final architecture [7]. A deep stacking network is separate from conventional deep learning systems in which it is essentially a deep collection of individual networks, each with its hidden layers, even though it consists of a deep network. This architecture model is a response to one of the deep learning issues: the trouble of preparing [8]. The intricacy of preparing is expanded dramatically by each layer in a deep learning design, so the DSN sees preparing not as a solitary issue but rather as a progression of individual preparing issues [9].

2.9 Long short-term memory/gated recurrent unit networks (LSTM/GRU)

The gated recurrent unit network was invented in 1997 by Hoch Reiter and Schimdhuber; however, it was filled in ubiquity lately as an RNN engineering for different applications [4, 5]. The LSTM pulled out from ordinary neuron-based neural association models and rather introduced the possibility of a memory cell [5]. The memory cell can hold its motivation for a short or long time as a segment of its data sources, which allows the phone to review what is huge and not just its last enlisted worth [7]. In 2014, an improvement of the LSTM was introduced with the gated recurrent unit. This model has two entryways, discarding the yield entrance present in the LSTM model [8]. For certain applications, the GRU has execution like the LSTM, yet being simpler techniques fewer loads and speedier execution [9].

The GRU joins two entryways: an update doorway and a reset entryway. The updated entryway exhibits the measure of the past cell substance to keep up. The reset entryway describes how to meld the new commitment with the past cell substance [10]. A GRU can show a standard RNN just by setting the reset entryway to 1 and the update doorway to 0. Different kinds of architectures are used for wide variety of applications that is mentioned in Table 1 [11].

Table 1 Deep learning architecture with suitable application

3 Development frameworks

It is possible to implement these deep learning architectures, but it can take time to start from scratch, and they will require time to refine and mature [12]. Fortunately, many open-source platforms can be used to more quickly implement deep learning algorithms. These frameworks support Java, R programming language, Python and, C/C++ [13].

Tensor flow It began as an internal Google project called Google Brain in 2011. As an open-source deep learning neural network that can run over several CPUs and GPUs, it was publicly released in 2017 [2,3,4]. It is used for training neural networks, similar to human learning and reasoning, to identify and decode patterns and connections. It gives C++ interface and Python [2,3,4,5].

Caffe In 2014, Berkeley Artificial Intelligence Research (BAIR) was developed. This provides Python and C++ interface, and academic research became popular. Using convolution nets, it is a deep learning system [4].

Caffe 2 As a commercial successor to Caffe, Facebook launched it in 2017. It was designed to resolve the scalability of Caffe's problems and to make it lighter [4, 5]. It enables distributed computing, implementation, and quantized computations to be carried out. It provides Python and C++ interface [6].

ONNX Facebook and Microsoft recently announced the Open Neural Network exchange in September 2017 [2,3,4]. ONNX is an interchange format intended to make it easy to pass deep learning models between the frameworks used to construct them [5]. This initiative could make different frameworks easier for developers to use [4, 5].

Torch Torch is an open-source machine learning library, a scientific computing platform, and a scripting language based on the Lua programming language [6]. Torch is used by IBM, Yandex, Idiap Research Institute and Facebook AI Research Group. As open-source software, Facebook has published a collection of extension modules (PyTorch) [4, 5].

Keras It is high-level programming that can run on top of Theano and Tensor flow [4, 5], and it seems as an interface. While not as weak as other structures, Keras is especially famous for its rapid growth. Using popular networks and evaluating networks algorithms and layers, it has been described as an entry point for new users’ deep learning.

MatConvNet It is commonly used Mat lab’s Deep Learning Library [6].

Theano It is a Python library for deep implementation [5]. Developing and promoting learning strategies by MILA, Montreal University.

Deeplearning4j Deeplearning4j is a common deep learning platform that focuses on Java technology but also includes application programming interfaces for other languages including Clojure, Scala, and Python [6]. Delivered under the Apache permit, the stage offers help for RNN, RBMs, CNN, and DBNs. Deeplearning4j additionally gives dispersed equal variants (enormous information preparing structures) that work with Apache Hadoop and Spark [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43]. It has been applied to several issues, including financial sector fraud detection, recommendation systems, image recognition, and cyber security (detection of network intrusion). For GPU optimization, the system integrates with CUDA and could be distributed by using OpenMP and Hadoop [20].

Distributed deep learning IBM distributed deep learning (DDL) is a library that interfaces with driving structures like Tensor Flow and Caffe, nicknamed the jet engine of deep learning DDL can be utilized over bunches of workers and many GPUs to accelerate deep learning calculations [35]. By indicating ideal ways that the subsequent information should take between GPUs, DDL streamlines the correspondence of neuron computations [37]. Deep learning libraries provided by Microsoft including MXnet, Microsoft Cognitive Toolkit, Paddle Paddle, SciKit-Learn, Matlab, Pandas, Numpy, cuDNN, NVIDIA TensorRT, NVIDIA DIGITS, Jupyter Notebook, etc., are the other popular libraries, frameworks, and tools that are popular among developers [38].

4 Process involved in medical image analysis

Medical image computation is highly associated with the field of medical imaging, but it depends on the computational analysis of images, not their acquisition [35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76]. The techniques can be grouped into many broad categories: segmentation of images, image registration, physiological modelling based on images, and others. These techniques can be discussed in the following section. Figure 3 shows the processes that are involved in medical image analysis [96].

Fig. 3
figure 3

Medical image analysis

4.1 Deep learning networks based on goals and architecture type

Both processing steps that are used for quantitative measurements, as well as abstract representations of medical images, are used in image analysis [96, 97]. These measures necessitate prior knowledge of the images' meaning and content, which must be incorporated into the algorithms at a high degree of abstraction. Figure 4 shows the taxonomy of literature review [96].

Fig. 4
figure 4

Taxonomy of literature review

4.1.1 Image registration

Goal Find [3] coordinates for transformation to align different images of the same thing, such as an MRI and CT scan, or two scans taken at various locations punctuates in time and general deep learning method used here is deep regression networks and deep learning networks (DNN) [6]. Generally, medical image analysis methods can be grouped into many categories which are shown in Fig. 4; this image registration is also one of the methods for the analysis purpose [9]. It is feasible to depict image registration, otherwise called image mapping, fusion, or warping, as the way toward adjusting at least two images. The motivation behind an image registration framework is to track down the ideal change in the image data [12]. Image registration is the method of converting various image datasets into one matched coordinate system with matched imaging content which has significant implications in the medical field [10, 11]. For image processing, image registration is a critical stage in which useful information is transmitted in more than one photograph. This means images obtained at various times, from different points of view, or by the various sensors will be harmonizing. The exact fusion of useful data from at least two images is therefore very critical [10]. Deep FLASH is a new network presented here with effective medical learning preparation and inference for learning-based registration of images. Unlike established approaches, that from training data in the high elevation learns spatial transformations. From training data in the high elevation, it learns spatial conversions [12]. Developed a new image registration method using dimensional imaging, fully in a low dimensional band limited space, the network [13]. This significantly decreases the computational expense and memory footprint of costly inference and training to achieve this, and complex-valued operations and representations of neural architectures were introduced, which provide key components for learning-based registration models and create an explicit loss function of transformation fields fully characterized in a band-restricted space with much fewer parameterizations [14].

In various clinical applications medical image registration is used (Hiba A. Mohammed), image fusion (Fatma El-Zahraa Ahmed El-Gamal, Mohammed Elmogy, Ahmed Atwan), learning-based image registration (Jian Wang, Miaomiao Zhang), (Grant Haskins1, Uwe Kruger1, Pingkun Yan1) and image reconstruction [11]. The registration of medical images is an enormous theme that can be ordered from different perspectives. Registration approaches might be separated from the info picture point of view into interpatient, intrapatient (for e.g. same-day or else different), multimodal, unimodal, and enlistment [12]. Registration from the perspective of the twisting model, it is feasible to characterize strategies as unbending, relative, and deformable methods [13]. Registration techniques can be assembled from the point of view of the district of interest (ROI) as per anatomical destinations like cerebrum, liver, lung, and so forth. From the viewpoint of the picture pair measurement, enrolment strategies can be part into 3-dimensional to 3-dimensional, 3-dimensional to 2-dimensional and 2-dimensional to 2-dimensional/3-dimensional [14]. Deep learning-based methodologies of image registration as per its techniques, highlights, and ubiquity in seven gatherings, including (1) reinforcement learning-based strategies, (2) deep strategies dependent on similitudes, (3) predicting managed change, (4) unmonitored change prediction, (5) generative adversarial network in the enrolment of clinical pictures, (6) deep learning utilized to check enrolments and (7) also another strategy focused on learning [15,16,17].

There are many freely available tools [19] and toolkits for medical image registration, such as ANTs [24] and Simple ITK [25]. These methods and techniques are typically recorded by iteratively upgrading the transformational variables until a pre-defined metric of consistency is achieved [27]. Those techniques suggest ascendancy efficiency. Nevertheless, the standard was restricted by its poor procedure for registration [29].

4.1.2 Object localization

Goal: Recognize where organs or other organs are located in space (2 and 3D) or in time, landmarks or objects (video/4D) and general deep learning method used here is to identify the intersection of interest in using separate CNNs with each 2D plane running a 3D image [18]. Localization [19] for biological architectures would be a fundamental prerequisite for different medical image investigation initiatives. Generally, medical image analysis methods can be grouped into many categories which are shown in Fig. 4, and this image localization is also one of the methods for analysis purpose [20]. Localization for the radiologist can be a hassle-free operation, or it is typically a challenging job for neural networks which are susceptible to variations in the medical data images caused by discrepancies with the process of obtainment of images, differences in structure, and pathology between patients. In medical image processing [13], the localization of anatomic structures is necessary for several activities. By defining their existence with 2-dimensional image data slices in ConvNet, here a technique is proposed for one or more anatomical systems in a 3-dimensional medical image of data localization automatically [14, 15]. A ConvNet is equipped to look for sagittal slices, axial and coronal retrieved from a 3-dimensional image of the anatomical structure of interest. To allow the ConvNet to examine slices of different sizes, spatial pyramid pooling has been used [24]. After combining the recognition, three-dimensional convolution layers are generated by combining the output of the ConvNet in all slices. In the experiment, 100 CT scan images of the abdomen, 200 chest CT image scans, and 100 heart CTA (angiography) were used [25]. In chest CT scan image, aortic arch, the localized ascending aorta, descending aorta became visible, as were the left cardiac ventricle, CT scans of the lungs, and cardiac CTA scan image, the left cardiac ventricle, CT scans of the lungs and the liver [10,11,12,13]. Localization has been analysed by measuring the distances between manually and automatically specified distances bounding box centroids and walls for reference. The best outcomes have been obtained with the localization and detection of a system from established limits; for example, aortic arch is well defined and it’s too worst while the border of the system is not clearly shown for example liver [14]. Here they proposed a novel strategy confinement, for the most part, appropriate to clinical pictures in which the articles can be recognized from the foundation fundamentally dependent on element contrasts then planned another on the global and underlying levels, CRF system to additional difference and value possibilities, which represent the greater logical data between areas [15]. A scanty coding-based arrangement solution for premium district recognition with discriminative word references was also suggested as a second assessment for more precise location labelling [16]. Here the article limitation technique is assessed with 2 clinical imaging applications: sore disparity on thoracic PET-CT pictures, and cell division with minuscule pictures than those assessments show better when contrasting with as of late revealed approaches [17].

4.1.3 Classification and detection

4.1.3.1 Exam classification

Goal: Categorize a picture of a diagnostic exam as absent/present or normal/Abnormal illness and general deep learning technique here is CNN (Convolutional Neural Networks), in particular, CNNs are pre-trained on natural images [95]. Generally, medical image analysis methods can be grouped into many categories as shown in Fig. 4, this image classification and detection are the methods for the purpose of analysis [96].

4.1.3.2 Object classification

Goal: Classify an entity that has been pre-identified (such as a Chest CT nodule) to one of two or more classes and general Deep Learning Method here is Multi-Stream CNN and additional Methods are SAE (Sparse Auto-Encoders), RBM (Restricted Boltzmann Machines), CSA (Convolutional Sparse Auto-Encoders) [15,16,17].

4.1.3.3 Classification algorithms

Classification Algorithms have a simple capability. We anticipate the objective data class by dissecting the preparation of datasets [16,17,18,19,20]. We utilize the preparation dataset to improve limit conditions which could be used to decide each target class. When those limit circumstances were resolved, the following assignment is used to foresee the objective data class [21,22,23,24,25]. Then the entire cycle is referred to as the classification technique.

4.1.3.4 Algorithm types related to classification

Figure 5 shows the various kinds of algorithms that are used in classification process [38]. The most important distinction between regression and classification is that while regression predicts continuous quantities, classification predicts discrete class labels. The two types of machine learning algorithms also have certain similarities and distinctions [39, 40].

Fig. 5
figure 5

Classification algorithms

Deep learning could recognize patterns in visual inputs and determine class labels in an image [30,31,32,33]. A convolutional neural network (CNN) or unique CNN frameworks like AlexNet, VGG, inception, and ResNet are the most popular deep learning architectures used for image processing. Figure 6 shows the evolution in deep learning techniques [35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54].

Fig. 6
figure 6

Evolution in deep learning techniques

4.1.3.5 Essential terminologies in classification algorithms

Classifier A classifier is an algorithm that assigns a particular category to the data it receives [10].

Classification model It will foresee the new data's class labels and divisions [10, 11].

Feature Feature is represented as an observable attribute of a mechanism that is being observed [10,11,12,13,14,15].

Binary classification Binary classification is a classification task that has two results. Gender classification either female or male is an example [10,11,12,13,14,15].

Multi-class classification Classification of more than 2 classifications is referred to as multi-class classification. Each data is allocated to only one targeted classification label in multi-class classification. For example, an animal maybe a dog or cat, and not both together [17].

Multi-label classification A classification task in which each item is assigned to a collection of target labels is known as multi-label classification (more than 1 data class). For example, a news article may be about games, an individual, and a place all at once [13,14,15,16,17,18,19,20].

Arrangement of infections [19] utilizing deep learning advancements on clinical pictures has picked up a ton of foothold over the most recent couple of years. For neuroimaging, the significant focal point of 3Dimensional deep learning has been on distinguishing infections with some anatomical pictures [21]. A few investigations had not identify dementia and also shows some variations from various imaging models such as practical Magnetic Resonance Imaging, DTI [22]. AD (Alzheimer's disease) will be the most widely recognized type in dementia, typically connected to the neurotic amyloid affidavits, primary decay, and some metabolic varieties with the science in the mind [23]. The ideal conclusion of Alzheimer’s disease assumes a significant job to block the movement of the infection.

For analysis and disease diagnosis [16], radiologists may need to consult medical archives for similar clinical cases. It is difficult to retrieve the relevant variety of diseases and imaging modalities, clinical cases are automatically, reliably, and correctly taken from the substantial medical image collection [17]. A powerful and reliable method for medical image classification, modality classification was used in this study, which can be used to extract clinical data from vast medical repositories. The method was created by combining the transfer learning principle with a pre-trained ResNet50 model for optimized feature retrieval and also classification with TLRN-LDA (Linear Discriminant Analysis) [18]. In Image CLEF benchmark (31-class image dataset), the evolved technique gives 88 percent of average accuracy in classification that is up to 11 percent higher related to the current state-of-the-art methods with the same image datasets [19]. Furthermore, hand-crafted features were extracted for comparison in this study [19, 20].

Transfer learning [17] is the viable component that can give a promising arrangement transferring information from nonexclusive article acknowledgment errands to area explicit undertakings. Here utilized a deep convolutional neural network, called DeTraC (Decompose, Transfer, Compose) with the characterization of Covid-19 +ve cases of chest X-ray data [18]. DeTraC can able to manage any kind of abnormalities with the picture dataset by examining its image class limits utilizing image class decay system [19]. Then the exploratory outcomes indicated that the ability of DeTraC with the discovery of Covid 19 +ve cases from complete picture datasets taken from few clinics in all over the globe. More exactness of 94 percent (True positive rate of 100percent) has been accomplished with DeTraC the identification of +ve Covid-19 X-ray data from ordinary cases, and also extreme intense respiratory condition scenarios [20].

Here [18] played out various levelled arrangements utilizing this HMIC (Hierarchical Medical Image characterization) method. This utilizes heaps of deep learning technique models for giving specific perception at every degree of medical image orders [19, 20]. For verification and testing purpose, medical image is categorized into 3 classifications at the parent level (histologically typical controls, Celiac disease and environmental enteropathy) [21]. Then for the kid level, 4 classes (I, IIIa, IIIb, and IIIC) of celiac disease severity are arranged [22].

Table 2 describes the summary of previous research results associated with image classification [50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80]. Several deep learning architectures produced different accuracy levels that are described in Fig. 7 [55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85].

Fig. 7
figure 7

Deep learning algorithms with accuracy

Table 2 Summary of previous research works associated with COVID-19 classification and detection

4.1.4 Object Detection

Goal: Identify a lesion or other interesting entity an image inside an image, and general deep learning method here is CNN and CNN with multi-stream [85].

4.1.5 Segmentation

4.1.5.1 Substructure/organ segmentation

Goal: Identify the curvature of the organ or its interior interest, enabling volume to be quantitatively analysed shape and form, such as for the heart or brain, and general deep learning methods are (recurrent neural network) RNN, CNNs, fCNNs (neural networks that are fully convolutionary) network [90, 91]. Generally, medical image analysis methods can be grouped into many categories which are shown in Fig. 4, and this image segmentation is also one of the methods for analysis purpose [92].

4.1.5.2 Lesion segmentation

Goal: Integration of object detection and organ/organ recognition segmentation of substructures and general deep learning method is Multi-Stream CNN [5,6,7].

[7] Medical image segmentation plays an essential role in numerous applications of computer-aided diagnostic systems. New medical image processing algorithms are being applied through the enormous investment, and advancement of microscopy, ultrasound, computed tomography (CT), dermoscopy, magnetic resonance imaging (MRI), and positron emission tomography and X-ray is examples of medical imaging modalities [8]. A 3D (three-dimensional) and or 2D (two-dimensional) image data is automatically or semi-automatically detected by medical image segmentation. Picture segmentation is the process by which a digital image is separated into multiple pixels. The prior objective of segmentation is to make it clearer and turn medical image representation into a meaningful subject. Because of the high variability in the photographs, segmentation is a hard job [19]. In recent days, AI and machine learning calculations have been encouraging radiologists in the division of clinical pictures, for example, bosom disease mammograms, cerebrum tumours, mind sores, skull stripping, and so forth division not as zeroing in on explicit locales in the clinical picture, additionally helps master radiologists in quantitative evaluation, and arranging further treatment [20]. A few analysts have added with the utilization of 3D convolutional neural networks in clinical picture division [21].

5 Trends and challenges

With the various CNN-based deep neural networks created, a significant result was achieved on ImageNet Challenger, the most significant image classification and segmentation challenge in the image analysing area. The key benefit of CNN over its predecessors is that it identifies essential features without the need for human intervention [27]. Various classification models are discussed in Fig. 6, produced different performance metrics such as precision, sensitivity/recall, specificity, accuracy, f-measure, and receiver operating characteristic (ROC) curves. Based on these performance metrics, we can evaluate the best CNN model [28].

We have come across the issues of imbalanced data, lack of confidence interval, and lack of properly annotated data so much in the recent deep learning related Medical Imaging literature that it’s easy to label it the fundamental challenge that the medical imaging field is currently experiencing in completely exploring deep learning advances [29].The number of samples and patients in the public databases currently available for medical imaging tasks are limited, except few datasets. Medical imaging datasets are too limited when compared to datasets for general computer vision issues, which usually range from a few hundred thousand to millions of annotated photos [30]. But on the other hand, we can see a growing trend in the medical imaging community to follow the practices of the wider pattern recognition community, to learn deep models end-to-end. The wider community, on the other hand, has typically embraced such activities based on the availability of large-scale annotated datasets, which is a crucial prerequisite for inducing accurate deep models [31].As a result, it's still unclear how well end-to-end qualified models can perform medical image analysis tasks without over- fitting to the training datasets. We have developed some elementary data augmentation techniques such as image flipping, padding, principal component analysis (PCA), image cropping, and adversarial training. However, these algorithms are not as advanced as the GAN for enhancing datasets [32,33,34,35,36].

Another major obstacle could be the use of black boxes; the legal ramifications of black-box functionality could be a deterrent since healthcare professionals would not rely on it. Who might be held liable if the outcome was unfavourable? Due to the sensitivity of this region, the hospital might be hesitant to use a black-box system, which would allow the hospital to track that a specific result came from an optometrist [35]. Trying to unlock the black box is a major research subject, and deep learning scientists are working to solve it [36].

Furthermore, due to complex data structures, teaching deep learning models is an extremely expensive endeavour. They sometimes necessitate high-end GPUs and hundreds of computers, which drives up the price for consumers [43].

Since the increased complexity of several layers necessitates a high computational load, there by the training performance suffers. Enhanced activation functions, cost function architecture, and drop-out approaches have all been used to combat vanishing gradient and over-fitting issues [47]. Using highly parallel hardware, such as GPUs and batch normalization, the issue of the high computational load has been addressed. We have also mentioned deep learning architectures in Table 1 related to corresponding applications from earlier days to till now [48]. The development of an interdisciplinary data pool is made possible by the availability of a vast volume of electronic medical record data. Machine learning extracts information from large amounts of data and generates output that can be used for individual outcome prediction and clinical decision-making [16]. This could pave the way for personalized medicine (also known as precision medicine), in which each person’s genetic, environmental, and lifestyle factors are considered for disease prevention, treatment, and prognosis [17].

6 Conclusion

Medical imaging is a key technology that bridges scientific and societal needs and can provide an important synergy that may contribute to advances in each of the areas. Our survey has illuminated the current state of the art based on the recent scientific literature from 120 medical imaging research papers which may be beneficial to radiologists worldwide. In addition to finding that the ResNet architecture typically has the highest performance, we also covered the current challenges, major issues, and future directions.