1 Introduction

Artificial intelligence contains a wide range of algorithms (Yang et al. 2023; LeCun et al. 1998; Krizhevsky et al. 2012; He et al. 2016) and modeling tools (Sutskever et al. 2014) for large-scale data processing tasks. The emergence of massive data and deep neural networks provides elegant solutions in various fields. The academic community has also begun to explore the application of AI to various traditional disciplines. The objective is to promote the development of AI while further improving the possibilities of traditional analytical modeling (Hsieh 2009; Ivezić et al. 2019; Karpatne et al. 2017, 2018; Kutz 2017; Reichstein et al. 2019). Realizing general artificial intelligence is the goal that human beings have been pursuing. Although AI has made considerable progress in the past few decades, it is still difficult to achieve general machine intelligence and brain-like intelligence Jiao et al. (2016).

At present, researchers are beginning to explore the field of “AI + Physics” (Muther et al. 2023; Mehta et al. 2019). The objectives of current research are: (1) Utilise the findings of physical science and artificial intelligence to investigate the principles governing brain learning; (2) Utilise AI to facilitate the advancement of physics; (3) Apply physical science to inform the development of novel AI paradigms. We review relevant research on the intersection between AI and physical sciences in a selective manner. This includes the development of AI conceptual and algorithmic driven by physical insights, the application of artificial intelligence technology in multiple fields of physics, and the intersection between these two fields (Zdeborová 2020; Meng et al. 2022).

Physics As we all know, physics is a natural science that plays a heuristic role in the cognition of the objective world, focusing on the study of matter, energy, space, and time, especially their respective properties and the relationship between them. Broadly speaking, physics explores and analyzes the phenomena that occur in nature to understand its rules. Statistical Mechanics describes the theoretical progress made by neural networks in statistical physics Engel (2001). In the long history, physical knowledge (a priori) has been collected, verified, and integrated into practical theories. It is a simplified induction of the laws of nature and human behavior in many important disciplines and engineering applications. If the prior knowledge and AI are properly combined, more abundant and effective feature information can be extracted from the scarce data set, which helps to improve the generalization ability and interpretability of the network model Meng et al. (2022).

Artificial intelligence Artificial intelligence is a discipline that researches and develops theories and application systems for simulating and extending human brain intelligence. The purpose of artificial intelligence is to enable machines to simulate human intelligent behavior (such as learning, reasoning, thinking, planning, etc.) Widrow and Lehr (1990), so that machines have intelligence and complete “complex work”. Today, artificial intelligence is widely valued in the computer field, involving machine vision (Krizhevsky et al. 2012; Heisele et al. 2002), natural language processing Devlin et al. (2018), psychology (Rogers and Mcclelland 2004; Saxe et al. 2018) and pedagogy Piech et al. (2015) and other disciplines Khammash (2022) to model, is an interdisciplinary subject. The convergence of physical sciences and deep learning offers exciting prospects for theoretical science, providing valuable insights into the learning and computational capabilities of deep networks.

Relationship The development of physics is a simplified induction of nature, which promotes the research of brain-like science in artificial intelligence. And the brain that perceives any “experience” technology is close to the so-called “physical sense”, and physics opens new avenues and provides new tools for current artificial intelligence research McCormick (2022). To some extent, both artificial intelligence models and physical models can share information and predict the behavior of complex systems Tiberi et al. (2022), that is, they share certain methods and goals, but the implementation methods are different. Thus, physics should understand natural mechanisms, using prior knowledge Niyogi et al. (1998), regularity, and inductive reasoning to inform models, while model-agnostic AI should provide “intelligence” Werner (2013) through data extraction.

Main contributions Based on these analyses, this study aims to provide a comprehensive review and classification of the field of physics-inspired AI deep learning (Fig. 1) and summarize potential research directions and open questions that need to be addressed urgently in the future. The main contributions of this paper are summarized as follows:

  1. 1.

    Comprehensiveness and readability. This article comprehensively reviews over 400 physical science ideas in progress and physics-inspired deep learning AI algorithms. It also summarizes existing physics-inspired learning and modeling research from four aspects: classical mechanics, electromagnetism, statistical physics, and quantum mechanics.

  2. 2.

    Inspirational. The latest progress in artificial intelligence technology to solve physical science problems is summarized in the article. Finally, in the new generation of deep learning artificial intelligence algorithms, we analyzed the outlooks and implications between AI and physics.

  3. 3.

    In-depth analysis. This article reviews open questions that need to be addressed to facilitate future research and exploration.

In this review, we attempt to provide a coherent review of the different intersections of deep learning artificial intelligence and physics. The rest of the paper is organized as follows: Chapter 2 presents artificial intelligence algorithms inspired by the perspective of classical mechanics and how AI can solve physical problems. Chapter 3 briefly reviews the electromagnetics-inspired artificial intelligence algorithms and the applications of AI in electromagnetics. Chapters 4 and 5 provide an overview of AI algorithms and applications inspired by statistical physics and quantum mechanics, respectively. Chapter 6 explores potential applications and challenges currently facing the intersection of AI and physics. Chapter 7 is the conclusion of this paper.

Fig. 1
figure 1

This article presents an overall taxonomic structure for physics-inspired AI paradigms. We mainly outline the main contents of AI deep algorithms inspired by classical mechanics, electromagnetism, statistical physics and quantum mechanics

2 Deep neural network paradigms inspired by classical mechanics

In this section, we briefly introduce manifolds, graphs and fluid dynamics in geometric deep learning, as well as the basics of Hamiltonian/Lagrangian and differential equation solvers in dynamic neural network systems. Then it explains the related work inspired by it, and finally introduces the deep learning method of graph neural networks to solve physical problems. We summarize the structure of this section and an overview of representative methods in Table 1.

Table 1 An overview of methods for AI DNNs inspired by classical mechanics

2.1 Geometric deep learning

Deep learning simulates the symmetry of the physical world (meaning the invariance of the laws of physics under various transformations). From the invariance of physical laws, an invariable physical quantity can be obtained, which is called a conserved quantity or invariant, and the universe follows translational/rotational symmetry (conservation of momentum). Momentum conservation is the embodiment of space uniformity (distortion degree), which is explained by mathematical group theory: space has translational symmetry—after the spatial translational transformation of an object, the physical motion trend and related physical laws remain unchanged. In the 20th century, Noether proposed Noether’s theorem that every continuous symmetry corresponds to a conservation law, relevant expressions see references (Torres 2003, 2004; Frederico and Torres 2007) and references therein, related applications are shown in Fig. 2.

Fig. 2
figure 2

Related applications of Noether’s theorem

The translation invariance, locality, and compositionality of Convolutional Neural Networks (CNNs) make them naturally suitable for tasks dealing with Euclidean-structured data like images. However, there are still complex non-Euclidean data in the world, and Geometric Deep Learning (GDL) Gerken et al. (2023) emerges from this. From the perspective of symmetry and invariance, the design of deep learning framework in the case of non-traditional plane data (non-Euclidean data) structure is studied Michael (2017). The term was first proposed by Michael Bronstein in 2016, and GDL attempts to generalize (structured) deep networks to non-Euclidean domains such as graphs and manifolds. The data structure is shown in Fig. 3 .

Fig. 3
figure 3

Euclidian/Non-Euclidian data structures

2.1.1 Manifold neural networks

A manifold is a space with local Euclidean space properties and is used in mathematics to describe geometric shapes, such as the spatial coordinates of the surfaces of various objects returned by radar scans. A Riemann manifold is a differential manifold with a Riemannian metric, where the Riemannian metric is a concept in differential geometry. Simply put, a Riemannian manifold is a smooth manifold given a smooth, symmetric, positive definite second-order tensor field. For example, in physics, the phase space of classical mechanics is an instance of a manifold, and the four-dimensional pseudo-Riemannian manifold that constructs the space-time model of general relativity is also an instance of a manifold.

Often, manifold data have richer spatial information, such as magnetoencephalography on a sphere Defferrard et al. (2020) and human scan data (Armeni et al. 2017; Bogo et al. 2014), which contain local structures and spatial symmetries Meng et al. (2022). At present, a new type of manifold convolution has been introduced into the physics-informed manifold (Masci et al. 2015; Monti et al. 2017; Boscaini et al. 2016; Cohen et al. 2019; De Haan et al. 2020) to make up for the defect that convolutional neural networks cannot fully utilize spatial information.

Manifold learning is a large class of manifold-based frameworks, and recovering low-dimensional structures is often referred to as manifold learning or nonlinear dimensionality reduction, which is an instance of unsupervised learning. Examples of manifold learning include: (1) A multidimensional scaling (MDS) algorithm Tenenbaum et al. (2000) that focuses on preserving “similarity (usually Euclidean distance)” information in high-dimensional spaces, is another linear dimensionality reduction method; (2) Focus on the local linear embedding (LLE) algorithm that preserves the local linear features of the sample during dimensionality reduction Roweis and Saul (2000), abandoning the global optimal dimensionality reduction of all samples; (3) The stochastic neighbor embedding (t-SNE) algorithm Maaten and Hinton (2008) uses the t distribution of heavy-tailed distribution to avoid the crowding problem and optimization problem, it is only suitable for visualization and cannot perform feature extraction; (4) The Uniform Manifold Approximation and Projection (UMAP) McInnes et al. (2018) algorithm is built on the theoretical framework of Riemannian geometry and algebraic topology. UMAP, like t-SNE, is only suitable for visualization, and the performance of UMAP and t-SNE is determined by different initialization choices (Kobak and Linderman 2019, 2021); (5) Spectral embedding, such as Laplacian feature map, is a graph-based dimensionality reduction algorithm to construct the relationship between data from a local perspective. It hopes that the points that are related to each other (the points connected in the graph) are as close as possible in the space after dimensionality reduction so that the original data structure can still be maintained after dimensionality reduction Belkin and Niyogi (2003); (6) The diffusion map method Wang (2012) uses the diffusion map to construct the data kernel, which is also a nonlinear dimensionality reduction algorithm; (7) The deep model Hadsell et al. (2006) learns a model that can evenly map the data to the output a globally consistent nonlinear function (invariant map) on a manifold for dimensionality reduction. Cho et al. (2024) proposed a Gaussian manifold variational autoencoder (GM-VAE) that addresses common limitations previously reported in hyperbolic VAEs. Katsman et al. (2024) studied ResNet and showed how to extend this structure to general Riemannian manifolds in a geometrically principled way.

2.1.2 Graph neural networks

Another type of non-Euclidean geometric data is graph. Graph refers to network structure data composed of nodes and edges, such as social networks. The concept of graph neural network (GNN) was first proposed by Gori et al. to extend existing neural networks to handle more types of graph data Gori et al. (2005), and then further inherited and developed by Scarselli et al. (2008). In 2020, Wu et al. (2020) proposed a new classification method to provide a comprehensive overview of graph neural networks (GNN) in the field of data mining and machine learning. Zhou et al. (2020) proposed a general design process for GNN models and systematically classified and reviewed applications. The network proposed for the first time in the context of spectral graph theory extends convolution and pooling operations in CNN to graph-structured data. The input is the graph and the signal on the graph, and the output is the node on each graph Defferrard et al. (2016).

The graph convolutional neural network (GCN) is the “first work” of GNN. It uses a semi-supervised learning method to approximate the convolution kernel in the original graph convolution operation and improves the original graph convolution algorithm Kipf and Welling (2016), as shown in the Fig.  4 . For the application of GCN in recommender systems, refer to Monti et al. (2017). Graph convolutional networks are the basis for many complex graph neural network models, including autoencoder-based models, generative models, and spatiotemporal networks. Inspired by physics, Martin et al. published an article using graph neural networks to solve combinatorial optimization problems in the journal Nature Machine Intelligence in 2022 Schuetz et al. (2022). In order to solve the limitation of the large amount of computation of GCN, Xu et al. proposed a graph wavelet neural network (GWNN) Xu et al. (2019) that uses graph wavelet transform to reduce the amount of computation.

Fig. 4
figure 4

a Schematic depiction of GCN for semi-supervised learning with C input channels and F feature maps in the output layer. The graph structure (edges shown as black lines) is shared over layers, labels are denoted by \({Y_i}\). b Visualization of hidden layer activations of a two-layer GCN. Colors denote document class Kipf and Welling (2016)

Graph Attention Network is a space-based graph convolutional network, which combines the attention mechanism in natural language processing with the new graph geometry data learning of graph structure data. The attention mechanism is used to determine the weight of the node neighborhood, resulting in a more effective feature representation Velikovi et al. (2017), which is suitable for (graph-based) inductive learning problems and transductive learning problems. The graph attention model proposes a recurrent neural network model that can solve the problem of graph classification. It processes graph information by adaptively visiting the sequence of each important node.

Graph autoencoders are a class of graph embedding methods that aim to represent the vertices of a graph as low-dimensional vectors using a neural network structure. At present, GCN-based autoencoder methods mainly include: GAE Kipf and Welling (2016) and ARGA Pan et al. (2018), and other variants are NetRA Yu et al. (2018), DNGR Cao et al. (2016), DRNE Ke et al. (2018).

The purpose of a graph generation network is to generate new graphs given a set of observed graphs. MolGAN Lee et al. (2021) integrates relational GCNs, modified GANs, and reinforcement learning objectives to generate a graph of the properties required by the models. DGMG Li et al. (2018) utilizes graph convolutional networks with spatial properties to obtain hidden representations of existing graphs, which is suitable for expressive and flexible relational data structures (such as natural language generation, pharmaceutical fields, etc.). GRNN You et al. (2018) generates models through depth graphs of two layers of recurrent neural networks.

In addition to the above-mentioned classic models, researchers have conducted further studies on GCN. For example, GLCN Jiang et al. (2018), RGDCN Brockschmidt (2019), GIC Jiang et al. (2019), HA-GCN Zhou and Li (2017), HGCN Liu et al. (2019), BGCN Zhang et al. (2018), SAGNN Zhang et al. (2019),DVNE Zhu et al. (2018), SDNE Wang et al. (2016), GC-MC Berg et al. (2017), ARGA Pan et al. (2018), Graph2Gauss Bojchevski and Günnemann (2017) and GMNN Qu et al. (2019) and other network models. In fact, the DeepMind team has also begun to pay attention to deep learning on graphs. In 2019, the Megvii Research Institute proposed a GeoConv for modeling the geometric structure between points and a hierarchical feature extraction framework Geo-CNN Lan et al. (2019). Hernández et al. (2022)proposed a method to predict the time evolution of dissipative dynamical systems using graph neural networks. Yao et al. (2024) introduced the Federated Graph Convolutional Network (FedGCN) algorithm for semi-supervised node classification, which has the characteristics of fast convergence and low communication cost.

2.1.3 Fluid dynamics neural networks

Computational fluid dynamics (CFD) is the product of the combination of modern fluid mechanics. The research content is to solve the governing equations of fluid mechanics through computer and numerical methods, and to simulate and analyze fluid mechanics problems.

Maziar et al. proposed a physical neural network in science - Hidden Fluid Mechanics Network Framework (HFM) to solve partial differential equations Raissi et al. (2020). The motion of the fluid in Raissi et al. (2020) is governed by the transport equation, the momentum equation, and the continuity equation, and these equations (knowledge of fluid mechanics) are encoded into the neural network, and the governing equations, and a feasible solution is obtained by combining the residuals of the neural network, as shown in Fig. 5. The HFM framework is not limited by boundary conditions and initial conditions. Realizing the prediction of fluid physics data has the advantages of strong versatility of machine learning and strong pertinence of computational fluid dynamics.

Fig. 5
figure 5

A physics-uninformed neural network (left) takes the input variables t, x, and y and outputs c, u, v, and p. By applying automatic differentiation on the output variables, we encode the transport and NS equations in the physics-informed neural networks \({e_i}\), i = 1,..., 4 (right) Raissi et al. (2020)

Wsewles et al. proposed the NPM–Neural Particle Method Wessels et al. (2020), a computational fluid dynamics using an updated Lagrangian physics-informed neural network, even with discrete point locations highly irregular, NPM is also stable and accurate. A new end-to-end learning deep learning neural network for automatic generation of fluid animations based on Lagrangian fluid simulation data Zhang et al. (2020). Guan et al. proposed the “NeuroFluid” model, which uses the artificial intelligence differentiable rendering technology based on neural implicit fields, regards fluid physics simulation as the inverse problem of solving the 3D rendering problem of fluid scenes and realizes fluid dynamic inversion Guan et al. (2022).

2.2 Dynamic neural network systems

The methods used to express nonlinear functions include dynamic systems and neural networks. At the same time, various nonlinear functions are actually information waves propagating between various layers. If physical systems in the real world are represented by neural networks, it will greatly improve the possibility of applying these physical systems to the field of artificial intelligence for analysis. Neural networks usually use a large amount of data for training, and adjust the weight and bias of the data through a large amount of information obtained. Minimizing the difference between the actual output and the expected output value, approximating the ground truth. Thereby imitating the behavior of human brain neurons to make judgments. However, this training method has the disadvantage of “chaos blindness”, that is, the AI system cannot respond to the chaos (or mutation) in the system.

2.2.1 Hamiltonian/Lagrangian neural networks

The steepest descent curve problem proposed by the Swiss mathematician Johann Bernoulli makes the variational method an essential tool for solving extreme value problems in mathematical physics. The variational principle of physical problems (or problems in other disciplines) is transformed into the problem of finding a function’s extreme value (or stationary value) by using the variational method. The variational principle is also called the principle of least action Feynman (2005). Karl Jacobbit called the principle of least action the mother of analytical mechanics. When applied to the action of a mechanical system, the equation of motion of the mechanical system can be obtained. The study of this principle led to the development of Lagrangian and Hamiltonian formulations of classical mechanics.

Hamiltonian neural networks Hamilton’s principle is a variational principle proposed by Hamilton in 1834 for dynamic complete systems. The Hamiltonian (conservation of momentum) embodies complete information about a dynamic physical system, that is, the total amount of all energies, kinetic and potential energies that exist. The Hamiltonian principle is often used to establish dynamic models of systems with continuous mass distribution and continuous stiffness distribution (elastic systems). Hamilton is the “special seasoning” that gives neural networks the ability to learn order and chaos. Neural networks understand underlying dynamics in a way that conventional networks cannot. This is the first step toward neural networks in physics. The NAIL team incorporated the Hamiltonian structure into a neural network, applying it to the known Hénon-Heiles model of stellar and molecular dynamics models Choudhary et al. (2020), accurately predicting the system dynamics moving between order and chaos.

An unstructured neural network, such as a multi-layer perceptron (MLP), can be utilized to parameterize the Hamiltonian. In 2019, Greydanus et al. proposed Hamiltonian Neural Networks (HNN) Greydanus et al. (2019) that learn the basic laws of physics (Hamiltonian of mass-spring systems) and accurately preserve a quantity similar to the total energy (energy conservation). In the same year, Toth et al. used the Hamiltonian principle (variational method) to transform the optimization problem into a functional extreme value problem (or stationary value) and proposed Hamiltonian Generative Networks (HGN) Toth et al. (2019). Due to the physical limitations defined by the Hamiltonian equations of motion, the research Han et al. (2021) introduces a class of HNNs that can adapt to nonlinear physical systems. By training a time-series-based neural network, from a small number of bifurcation parameter values of the target Hamiltonian system, the dynamic state of other parameter values can be predicted. The work Dierkes and Flaßkamp (2021) introduced the Hamiltonian Neural Network (HNN) to explicitly learn the total energy of the system, training the neural network to learn the equations of motion to overcome the lack of physical rules.

In the field of neural networks applied to chaotic dynamic systems, the work by Haber and Ruthotto (2017) introduces a neural network model called “Stable Neural Networks,” which is inspired by the differential equations of the Hamiltonian dynamical system. This model aims to address the issue of susceptibility to input data disturbance or noise that can affect the performance of neural networks obtained through the discretization of chaotic dynamic systems.

Another relevant research paper by Massaroli et al. (2019) offers a novel perspective on neural network optimization, specifically tackling the problem of escaping saddle points. The non-convexity and high dimensionality of the optimization problem in neural network training make it challenging to converge to a minimum loss function. The proposed framework guarantees convergence to a minimum loss function and avoids the saddle point problem. It also demonstrates applicability to neural networks based on physical systems and pH control, improving learning efficiency and increasing the likelihood of finding the global minimum of the objective function.

Additionally, there are other methods available for identifying Hamiltonian dynamical systems (HDS) using neural networks, as discussed in the referenced paper by Lin et al. (2017). These methods contribute to the exploration of neural network architectures and techniques for modeling and understanding HDS. Zhao et al. (2024) used conservative Hamiltonian neural flow to construct a GNN that is robust to adversarial attacks, greatly improving the robustness to adversarial perturbations.

Overall, these research works highlight important approaches and perspectives in applying neural networks to chaotic dynamic systems, addressing challenges such as input data disturbance, saddle point problems, and optimization difficulties.

Lagrangian neural networks The Lagrangian function of analytical mechanics is a function that describes the dynamical state of the entire physical system. The Lagrangian function of a system represents the properties of the system itself. If the world is symmetric (such as spatial symmetry), then after the system is translated, the Lagrangian function remains unchanged, and momentum conservation can be obtained using the variational principle.

Even if the training data satisfies all physical laws, it is still possible for a trained artificial neural network to make non-physical predictions (there are some scenarios where rigid body kinematics is not applicable, and it is even difficult to calculate with physical formulas). Therefore, in 2019, the object mass matrix in the Euler-Lagrangian equation is represented by a neural network, so that the relationship between the mass distribution and the robot pose can be estimated Lutter et al. (2019). Deep Lagrangian networks learn the equations of motion for mechanical systems, train faster than traditional feedforward neural networks, predict results more physically, and are more robust to new track predictions.

In order to enhance the sparsity and stability of the algorithm, the work Cranmer et al. (2020) proposes a new sparse penalty function based on the dimension reduction algorithm SCAD Fan and Li (2001), and adds it to the Lagrangian Constrained Neural Network to overcome the traditional blind source separation. The defects of the method and the independent component analysis method can effectively avoid the ill-conditioned problem of the equation and improve the sparsity, stability, and accuracy of blind image restoration. Since neural networks cannot conserve energy, it is difficult to model dynamics over a long period of time. In 2020, Cranmer et al. The research Cranmer et al. (2020) used neural networks to learn arbitrary Lagrangian quantities, inducing strong physical priors, as shown in Fig. 6. Xiao et al. (2024) introduce a breakthrough extension of the Lagrangian neural network (LNN) (generalized Lagrangian neural network), which is innovatively tailored for non-conservative systems.

Fig. 6
figure 6

Cranmer et al. (2020) propose a method to address the challenge of modeling the dynamics of physical systems using neural networks. They demonstrate that neural networks struggle to accurately represent these dynamics over long time periods due to their inability to conserve energy. To overcome this limitation, the authors introduce a technique for learning arbitrary Lagrangians with neural networks, which incorporates a strong physical prior on the learned dynamics. By leveraging the principles of Lagrangian mechanics, the neural networks are able to better capture the underlying physics of the system. This approach improves the accuracy of the neural network model (shown in blue) compared to traditional neural networks (shown in red), providing a promising avenue for enhancing the modeling of complex dynamical systems

2.2.2 Neural network differential equation solvers

In physics, due to the concepts of locality and causality equations, differential equations are basic equations, so it is a cutting-edge trend to treat neural networks as dynamic differential equations and to use numerical solution algorithms to design network structures.

Ordinary differential equation neural networks The general neural ODE is as follows:

$$\begin{aligned} \begin{aligned}&y\left( 0 \right) = {y_0}\\&\frac{{dy}}{{dt}}\left( t \right) = {f_\theta }\left( {t,y\left( t \right) } \right) \end{aligned} \end{aligned}$$
(1)

where \({y_0}\) can be any dimension tensor, \(\theta\) indicates some vector of learned parameters,\({f_\theta }\) indicates a neural network.

Neural networks offer powerful function approximation capabilities, while penalty terms help bridge the gap between theory and practice. One application is in turbulence modeling, as demonstrated in Ling et al. (2016), where a carefully designed neural network approximates closed relations (Reynolds stresses) while adhering to specific physical invariances. This approach enables the modeling of residuals between theoretical and observed data.

Latent ODEs emerge from this framework when incorporating time-varying components. Rubanova et al. (2019) utilize latent ODEs to simulate the dynamics of a small frog entering the air in a simulated environment. Additionally, Du et al. (2020) explore the applications of latent ODEs in reinforcement learning.

Another study by Shi and Morris (2021) combines latent ODEs with change-point detection algorithms to model switching dynamical systems. This approach provides a powerful tool for segmenting and understanding complex dynamics with abrupt changes.

In summary, neural networks coupled with penalty terms and latent ODEs offer valuable methods for modeling and simulating various dynamic systems, including turbulence, reinforcement learning, and switching dynamical systems. These approaches bridge the gap between theoretical principles and practical applications, opening up new possibilities in understanding and predicting complex phenomena.

Euler’s method: The main idea of Euler’s method is to use the first derivative of a point to linearly approximate the final value. Due to the different positions of the points where the first derivative is used, it is divided into the forward Euler method (also known as explicit Euler method) and backward Euler’s method (Implicit Euler’s method). The general form of deep residual network (ResNet) He et al. (2016) can be regarded as a discrete dynamical system, because each step of it is composed of the simplest nonlinear discrete dynamical system-linear transformation and non-linear linear activation function is formed. It can be said that the residual network is an explicit Euler discretization of a neural ODE. Now, the RevNet neural network Behrmann et al. (2019), as a further generalization of ResNet, is a residual learning with a symmetric form. The backward Euler algorithm corresponds to PolyNet Zhang et al. (2017), PolyNet can reduce the depth by increasing the width of each residual block, thereby achieving the most advanced classification accuracy. In addition, from the perspective of ordinary differential equations, the reverse Euler method has better stability than the forward Euler method. For more methods of using ordinary differential equations themselves as neural networks, see Chen et al. (2018).

Partial differential equation neural networks The general form of a second-order PDE:

$$\begin{aligned} \frac{{{\delta ^2}\psi \left( {x,y} \right) }}{{\delta {x^2}}} + \frac{{{\delta ^2}\psi \left( {x,y} \right) }}{{\delta {y^2}}} = f\left( {x,y} \right) \ \end{aligned}$$
(2)

The design of FractalNet is based on self-similarity, by repeatedly applying a simple extension rule to generate deep networks whose structure is laid out as a truncated fractal Larsson et al. (2016), whose structure can be explained as the famous Runge- Kuta form. The activation and weight dynamics of neural networks in Ramacher (1993) are derived from partial differential equations and incorporate weights as parameters or variables. Results obtained using a combination of time-varying patterns of parameters and dynamics show that learning rules can be replaced by learning laws under equal performance.

Physics Informed Neural Network (PINN) Raissi et al. (2019) is a method of applying scientific machines in traditional numerical fields, especially for solving various problems related to PDE. The principle of PINN is to approximate the solution of PDE by training the neural network to minimize the loss function. The essence is to integrate the equation (physical knowledge) into the network and use the residual term from the governing equation to construct a loss function, which is used as a penalty term to limit the space of feasible solutions.

The PINN-HFM Raissi et al. (2020) algorithm fused with physical knowledge reconstructs the overall velocity field of resolution from sparse velocity information. That is, the loss term of the NS equation is minimized, and the velocity field and the pressure field are obtained at the same time, so that the result conforms to the “laws of physics”. Compared to traditional CFD solvers, PINN is better at integrating data (observations of flow) and physical knowledge (essentially the governing equations describing the physical phenomenon).

Considering that PINN is not robust enough for extreme gradient decline, and the depth increases with the PDE order, resulting in vanishing gradients and slower learning rates, Dwivedi et al. (2019) propose DPINN. In 2020, Meng et al. (2020) used the traditional parareal time domain segmentation method for parallelization to reduce the complexity and learning difficulty of the model. Unlike PINN and its variants, Fang (2021)proposed using the approximation of differential operators instead of automatic differentiation to solve hybrid physical information networks of PDEs. The research Moseley et al. (2021) presents a parallel approach to spatially partitioned regions. As a meshless method, PINN does not require a mesh. Therefore, an algorithm using the fusion differential format to accelerate information dissemination has also emerged Chen et al. (2021). Then the work Schiassi et al. (2022) utilizes PINN to solve the equation paradigm, which is used to “learn” the optimal control of the plane orbit transfer problem. Since the global outbreak of the Covid-19 virus, Treibert et al. used PINN to evaluate model parameters, built an SVIHDR differential dynamical system model Treibert and Ehrhardt (2021), extended Susceptible-Infected-Recovered (SIR) model Trejo and Hengartner (2022).

Although AI using PDE to simulate physical problems has been widely used, there are still limitations in solving high-dimensional PDE problems. This work Karniadakis et al. (2021) discusses the diverse applications of physical knowledge (discipline) learning integrating noisy data and mathematical models, under the condition of satisfying the physical invariance, improving the accuracy, and solving the hidden physical inverse problems and high-dimensional problems. Xiao et al. (2024) proposed a deep learning framework for solving high-order partial differential equations, named SHoP. At the same time, the network was expanded to the Taylor series, providing explicit solutions to partial differential equations.

Controlled differential equations neural networks Neural controlled differential equations (CDEs) rely on two concepts: Bounded paths and Riemann CStieltjes integrals, which are formulated as follows:

$$\begin{aligned} \begin{aligned}&y\left( 0 \right) = {y_0}\\&\int _0^t {f\left( {y\left( s \right) } \right) dx\left( s \right) = } \int _0^t {f\left( {y\left( s \right) } \right) \frac{{dx}}{{ds}}} \left( s \right) ds \end{aligned} \end{aligned}$$
(3)

Modeling the dynamics of time series using neural differential equations is a promising option, however, the performance of current methods is often limited by the choice of initial conditions. The neural CDEs model generated by Kidger et al. (2020) can handle irregularly sampled and partially observed input data (i.e., time series), and has higher performance than ODE or RNN-based models. Additional terms in the numerical solver are introduced in Morrill et al. (2021) to incorporate substep information to obtain neural rough differential equations. When dealing with data with missing information, it is standard practice to add observation masks Che et al. (2018), which is the appropriate continuous-time analogy.

Stochastic differential equation neural networks Stochastic Differential Equations (SDE) have been widely used to model real-world stochastic phenomena such as particle systems (Coffey and Kalmykov 2012; Pavliotis 2014), financial markets Black and Scholes (2019), population dynamics Arató (2003) and genetics Huillet (2007). Latent ODEs serve as a natural extension of ordinary differential equations (ODEs) for modeling systems that evolve in continuous time while accounting for uncertainty Kidger (2022).

The dynamics of a stochastic differential equation (SDE) encompass both a deterministic term and a stochastic term:

$$\begin{aligned} dy\left( t \right) = \mu \left( {t,y\left( t \right) } \right) dt + \sigma \left( {t,y\left( t \right) } \right) \circ dw\left( t \right) \ \end{aligned}$$
(4)

where \(\mu\), \(\sigma\) is a regular function, w is a d dimensional Brownian motion, and y is the resulting d dimensional continuous random process.

The inherent randomness in stochastic differential equations (SDEs) can be viewed as a generative model within the context of modern machine learning. Analogous to recurrent neural networks (RNNs), SDEs can be seen as an RNN with random noise, specifically Brownian motion, as input, and the generated sample as the output. Time series models are classic interest models. Predictive models such as Holt-Winters Holt (2004), ARCH Engle (1982), ARMA Hannan and Rissanen (1982), GARCH Bollerslev (1986), etc.

More deep learning libraries for solving differential equations combined with physical knowledge and machine learning such as the literature Lu et al. (2021).

2.3 Graph neural networks to solve physical problems

Molecular Design: The most critical problem in the fields of materials and pharmaceuticals is to predict the ski, physical, and biological properties of new molecules from their structures. Recent work from Harvard University Duvenaud et al. (2015) proposes to model molecules as graphs and use graph convolutional neural networks to learn the desired molecular properties. Their method significantly outperforms the handcrafted capabilities of Morgan (1965), Rogers and Hahn (2010), a work that opens up opportunities for molecular design in a new way.

Medical Physics: The field of medical physics Manco et al. (2021) is one of the most important areas of artificial intelligence application, which can be roughly divided into radiotherapy and medical imaging. With the success of AI in imaging tasks, AI research in radiotherapy (Hrinivich and Lee 2020; Maffei et al. 2021) and medical imaging (such as x-ray, MRI, and nuclear medicine) Barragán-Montero et al. (2021) has grown rapidly. Among them, magnetic resonance imaging (MRI) technology in medical image analysis Castiglioni et al. (2021) plays a vital role in the diagnosis, management, and monitoring of many diseases Li et al. (2022). A recent study from Imperial College Ktena et al. (2017) uses graph CNNs on non-Euclidean brain imaging data to detect disruptions in autism-related brain functional networks. Zegers et al. outlined the current state-of-the-art applications of deep learning in neuro-oncology MRI Zegers et al. (2021), which has broad potential applications. Rizk et al. introduced deep learning models for meniscal tear detection after external validation Rizk et al. (2021). The discussion and summary of MRI image reconstruction work Montalt-Tordera et al. (2021) provides great potential for the acquisition of future clinical data pairs.

High-energy physics experiments: Introducing graph neural networks to predict the dynamics of N-body systems (Battaglia et al. 2016; Chang et al. 2016) with remarkable results.

Power System Solver: The research Donon et al. (2019) combines graph neural networks to propose a neural network architecture for solving power differential equations to calculate power flow (so-called “load flow”) in the grid. The work Park and Park (2019) proposes a physics-inspired data-driven model for wind farm power estimation tasks.

Structure prediction of glass systems (glass phase transitions): DeepMind published a paper in Nature Physics Bapst et al. (2020) to model glass dynamics with a graph neural network model, linking network predictions to physics. The long-term evolution of glassy systems can be predicted using only the structures hidden around the particles. The model works well across different temperature, pressure, and density ranges, demonstrating the power of graph networks.

3 Deep neural network paradigms inspired by electromagnetics

3.1 Optical design neural networks

Optical neural networks(ONNs) are novel types of neural networks designed with optical technology such as optical connection technology, optical device technology, and so on. The idea of optical neural networks is to imitate neural networks by attaching information to optical features utilizing modulation. At the same time, taking advantage of the optical propagation principle of light such as interference, diffraction, transmission, and reflection to realize neural networks and their operators. The first implementation of ONNs was optical Hopfeild networks, proposed by Demetri Psaltis and Farhat (1985) in 1985. There are three main operators involved in traditional neural networks: linear operations, nonlinear activation operations, and convolution operations, and in this subsection, the optical implementation of the above operators is presented in that order. We summarize the structure of this section and an overview of representative methods in Table 2.

Table 2 An overview of methods for AI DNNs inspired by electromagnetism

3.1.1 Optical implementation of linear operations

The main linear operators of neural networks are matrix multiplication operators and weighted summation operators. The weighted summation operators are easy to implement due to the property of optical coherence and incoherence, so the challenge of optical implementation of linear operations lies in the optical implementation of matrix multiplication. As early as 1978, J. W. Goodman et al. (1978) first implemented an optical vector–matrix multiplier with a lens set according to the principle of optical transmission; And the optical implementation of matrix-matrix multiplier was first implemented using a 4f-type system consisting of a lens set by Chen (1993).

Optical implementation of vector–matrix multiplications The vector \({\textbf {p}}\) is obtained by multiplying the matrix \({\textbf {A}}\) with the vector \({\textbf {b}}\). The mathematical essence is to use each row of the matrix \({\textbf {A}}\) to make an inner product with the vector \({\textbf {b}}\) to obtain the value of the corresponding position of the vector \({\textbf {p}}\). The mathematical expression is:

$$\begin{aligned} {\textbf {p}}(i)=\sum _{j}{{\textbf {A}}(i,j) {\textbf {b}}(j)} \end{aligned}$$
(5)

The optical vector–matrix multiplier is mainly composed of two parts: the light source such as light-emitting diode light source arrays, and the optical path system composed of a spherical lens, a cylindrical lens, a spatial light modulator, and an optical detector. Its mathematical idea is to transform the vector–matrix multiplication into the matrix-matrix point-wise multiplication.

As shown in Fig. 7, the vector \({\textbf {b}}\) is modulated into optical features of the incoherent light source (LS) such as the amplitude, intensity, phase, and polarization, then the incident light passes the first spherical lens L1. Since the LS array is located in the front focal plane of the spherical lens L1, the light through L1 is emitted in parallel. Next, the light passes the cylindrical lens CL1, which is located in the post-focal plane of the L1. Due to the vertical placement of the cylindrical lens CL1, the light through CL1 is only converged on the post-focal plane in the horizontal direction, and the light is emitted in parallel in the vertical direction. At this time the light field carries the information:

$$\begin{aligned} {\textbf {B}}= \left[ \begin{matrix} {\textbf {b}}\\ \vdots \\ {\textbf {b}}\\ \end{matrix} \right] \in R^{I \times J} \end{aligned}$$
(6)

There is a spatial light modulator(SLM) being placed on the back post-focal plane of CL1, which contains the information of matrix \({\textbf {A}}\). The process of passing through the SLM can be seen as the process of dot multiplication of matrix \({\textbf {A}}, {\textbf {B}}\). At this time, the light field carries the information as:

$$\begin{aligned} {\textbf {P}}(i,j)={\textbf {A}}(i,j){\textbf {B}}(i,j) \end{aligned}$$
(7)

Then, the light through SLM passes the cylindrical lens CL2, between which and SLM the distance is the focal length f of CL2. Due to the horizontal placement of the cylindrical lens CL2, the light through SLM is only converged on the post-focal plane in the vertical direction, and the light is emitted in parallel in the horizontal direction. At this time, the light field carries the information of the multiplication result of the vector \({\textbf {p}}\):

$$\begin{aligned} {\textbf {p}}(i)=\sum _{j}{{\textbf {P}}(i,j)}=\sum _{j}{{\textbf {A}}(i,j){\textbf {B}}(i,j)} \end{aligned}$$
(8)

Finally, the light through CL2 is demodulated and the vector \({\textbf {p}}\) can be obtained with a charge-coupled device(CCD).

Fig. 7
figure 7

Optical implementation of vector–matrix from Goodman et al. (1978)

Optical implementation of matrix–matrix multiplications Compared to vector–matrix multiplication, matrix-matrix multiplication is more complicated. The multiplication of matrix \({\textbf {A}}\) and matrix \({\textbf {B}}\) is the inner product for each row of matrix \({\textbf {A}}\) and each column of matrix \({\textbf {B}}\). Assuming that the result matrix is \({\textbf {P}}\), the expression is as follows:

$$\begin{aligned} {\textbf {P}}(x,y)=\sum _{l}{{\textbf {A}}(x,l) {\textbf {B}}(l,y)} \end{aligned}$$
(9)

The matrix-matrix multiplication is implemented with the help of an optical 4f-type system, which consists of Fourier lenses, holographic masks(HM), and charge-coupled devices. Taking advantage of the discrete Fourier transform(DFT), the matrix \({\textbf {B}}\) can be constructed with the discrete Fourier transform matrix to implement the multiplication.

As shown in Fig. 8, the matrix \({\textbf {B}}\) is modulated in the complex amplitude of the input light, and the result matrix \({\textbf {P}}\) is obtained in the output plane. The multiplication operation of matrix \({\textbf {A}}\) and matrix \({\textbf {B}}\) is completed during the light propagation from the input plane to the output plane. Let the matrix \({\textbf {B}}\) and the function F be the input light field, the Fourier transform function at the front-focal plane of the Fourier lens, respectively. According to the principle of Fresnel diffraction, the complex amplitude distribution of the light field at the post-focal plane of the lens is the Fourier transform of the complex amplitude distribution of the light field at the front-focal plane, and the expression is as follows:

$$\begin{aligned} {\textbf {P}}(x,y)=\frac{1}{i \lambda f}F({\textbf {B}}(\frac{x}{\lambda f},\frac{y}{\lambda f})) \end{aligned}$$
(10)

Since the DFT can be implemented with the DFT matrix, combining with the Equation (10), the discretized light field is expressed as:

$$\begin{aligned} {\textbf {P}}(x,y)=\sum _{l}{{\textbf {G}}(x,l){\textbf {B}}(l,y)} \end{aligned}$$
(11)

In this case, the DFT matrix \({\textbf {G}}\) of the lens is only related to the focal length and the wavelength, so the matrix \({\textbf {A}}\) must be moderated with a holographic mask, which is used to adjust the complex amplitude distribution of the light field. The whole optical system is composed of two Fourier lenses and a holographic mask, so the output light field is:

$$\begin{aligned} \begin{aligned} {\textbf {P}}(x,y)&=\sum _{m}{{\textbf {G}}_2(x,m){\textbf {H}}(m)(\sum _{l}{{\textbf {G}}_1(m,l){\textbf {B}}(l,y)})}\\&=\sum _{m}{\sum _{l}({{\textbf {G}}_2(x,m){\textbf {H}}(m) {\textbf {G}}_1(m,l){\textbf {B}}(l,y)})} \end{aligned} \end{aligned}$$
(12)

where the matrices \({\textbf {G}}_1\) and \({\textbf {G}}_2\) denote the DTF matrices of the two lenses, respectively, and \({\textbf {H}}(m)\) is the complex amplitude distribution function of the holographic mask. Comparing the Equation (9) and the Equation (12):

$$\begin{aligned} {\textbf {A}}(x,l)=\sum _{m}{{\textbf {G}}_2(x,m){\textbf {H}}(m){\textbf {G}}_1(m,l)} \end{aligned}$$
(13)

The relationship between the sampling periods and the sampling numbers in the input plane, the output plane, and the holographic mask satisfies:

$$\begin{aligned} \left\{ \begin{aligned}&\frac{\triangle {x_1}\triangle {x}}{f \lambda }=\frac{1}{M}\\&\frac{\triangle {x}\triangle {x_2}}{f \lambda }=\frac{1}{X}\\&M=X\times L\\ \end{aligned} \right. \end{aligned}$$
(14)

where \(\triangle {x_1}, \triangle {x}, \triangle {x_2}, L, M, X\) are the sampling periods and the sampling numbers in the input plane, the holographic mask, and the output plane, respectively. According to the equation (13) and the equation (14), \({\textbf {H}}(m)\) can be obtained:

$$\begin{aligned} {\textbf {H}}(m)=\sum _{x}{\sum _{l}{exp(\frac{i2\pi mx}{X}){\textbf {A}}(x,l)exp(\frac{i2\pi lm}{M})}} \end{aligned}$$
(15)

Optical matrix multipliers The vector–matrix multiplier was first proposed by J. W. Goodman et al. (1978) in 1978. With this multiplier, the DFT was implemented in an optical way. These works (Liu et al. 1986; Francis et al. 1990; Yang et al. 1990) proposed to construct a spatial light modulator with a miniature liquid crystal television (LCTV) to replace the matrix mask and lens to implement matrix multiplication. The research Francis et al. (1991) proposed to use a mirror array instead of the commonly used lens array to realize the optical neural network that uses a mirror-array interconnection; And the work Nitta et al. (1993) removed two cylindrical lenses from the matrix multiplier, improved light-emitting diode arrays and the variable-sensitivity photodetector arrays, and produced the first optical neural chip. The research Chen (1993) proposed to construct an optical 4f-type system, which used the optical Fourier transform and inverse transform of Fourier lenses to implement matrix-matrix multiplication. The research Wang et al. (1997) proposed a new optical neural network architecture that uses two perpendicular 1-D prism arrays for optical interconnection to implement matrix multiplication.

Fig. 8
figure 8

Optical implementation of matrix-matrix from Chen (1993)

Psaltis et al. (1988) proposed the implementation of matrix multiplication using the dynamic holographic modification of photorefractive crystals, enabling the construction of most neuro networks. Slinger (1991) proposed a weighted N-to-N volume-holographic neural interconnect method and derived the coupled-wave solutions that describe the behavior of an idealized version of the interconnect. (Yang et al. 1994; Di Leonardo et al. 2007; Nogrette et al. 2014) proposed the use of the Gerchberg-Saxton algorithm to calculate holograms for each region. The research Lin et al. (2018) proposed the use of transmissive and reflective layers to form phase-only masks and construct all-optical neurons by optical diffraction. Yan et al. (2019) proposed a novel diffractive neural network implemented by placing diffraction modulation layers at the Fourier plane of the optical system. The research Qian et al. (2020) proposed to scatter or focus the plane wave at microwave frequencies in a diffractive manner on a compound Huygens metasurface to mimic the functionality of artificial neural networks.

Lin et al. Mengu et al. (2019) proposed to use five phase-only diffractive layers for complex-valued phase modulation and complex-valued amplitude modulation to implement an optical diffraction neural network. Shen et al. (2017), Bagherian et al. (2018) take advantage of the Mach-Zehnder interferometer array to implement matrix multiplication through the principle of singular value decomposition; Hamerly et al. (2019) proposed an optical interference-based zero-difference detection method to implement matrix multiplication and constructed a new type of photonic accelerator to implement optical neural networks. Zang et al. (2019) implemented the vector–matrix multiplications by stretching time-domain pulses. With the help of fiber loops, the multi-layer neural network can be implemented in optical.

3.1.2 Optical implementation of nonlinear activation

Nonlinear activation functions play an important role in neural networks, which enable them to approximate complex nonlinear mappings. However the lack of nonlinear response in optics and the limitations of the fabrication conditions of optical devices, the optical response of devices is often fixed, which limits the optical nonlinearity from being reprogrammed to achieve different forms of nonlinear activation functions. Therefore, previous nonlinearities in ONNs were generally achieved using optoelectronic hybrid methods Dunning et al. (1991). With the development of material fabrication conditions, the all-optical implementation of optical nonlinearity Skinner et al. (1994) has only emerged. This is presented below as an example (Fig. 9).

Fig. 9
figure 9

Optical neural network based on Kerr-type nonlinear materials from Skinner et al. (1994)

The all-optical neural network consists of linear layers and nonlinear layers, where the linear layers are composed of thick linear media, such as free space, and the nonlinear layers are composed of thin nonlinear media, such as Kerr-type nonlinear materials, whose refractive index satisfies the following relationship:

$$\begin{aligned} n(x,y,z)=n_0+n_2I_r(x,y,z) \end{aligned}$$
(16)

where \(n_0\) is the linear refractive index component, \(n_2\) is the nonlinear refractive index coefficient, and \(I_r(x,y,z)\) is the light field intensity. The material behaves as self-focusing if \(n_2>0\), and the material behaves as self-scattering if \(n_2<0\). Since its refractive index is dependent on the light intensity, the nonlinear layer can play the role of both weighted summation and nonlinear mapping.

When the input light is incident to the plane of the nonlinear layer, the refractive index will be different at various points of the nonlinear plane, which results in changes in the intensity and direction of the transmitted light and the appearance of interference phenomenon, so the nonlinear layer achieves the function of spatial light modulation. The final output light signal depends on the first layer input and the continuous weighting and nonlinear mapping of the nonlinear layer.

Photoelectric hybrid methods Dunning et al. (1991) processed video signals on a point-by-point basis by a frame grabber and image processor to implement programmable nonlinear activation functions. Larger et al. (2012) used an integrated telecom Mach-Zendel modulator to provide an electro-optical nonlinear modulation transfer function to achieve the construction of optical neural networks. Antonik et al. (2019) modulated the phase of spatially extended plane waves by means of a spatial light modulator to improve the parallelism of the optical system, which could significantly increase the scalability and processing speed of the network. Katumba et al. (2019) constructed nonlinear operators of networks with the nonlinearity of electro-optical detectors to achieve extremely high data modulation speed and large-scale network parameter update. Williamson et al. (2019), Fard et al. (2020) converted a small portion of the incident light into the electrical signal and modulated the original light signal with the help of an electro-optical modulator to realize the nonlinearity of the neural network, which increases the operating bandwidth and computational speed of the system.

All-optical methods Skinner et al. (1994) implemented weighted connectivity and nonlinear mapping using Kerr-type nonlinear optical materials as the thin layer separating the free space to improve the response speed of optical neural networks. Saxena and Fiesler (1995) used of liquid crystal light valve (LCLV) to achieve the threshold effect of nonlinear functions and constructed an optical neural network to avoid the energy loss problem of photoelectric conversion. Vandoorne et al. (2008), Vandoorne et al. (2014) used coupled semiconductor optical amplifiers (SOA) as the basic block to achieve nonlinearity in all-optical neural networks, making the networks with low power consumption, high speed, and high parallelism. Rosenbluth et al. (2009) used novel nonlinear optical fibers as thresholds to achieve nonlinear responses in networks, overcoming the scalar problem of digital optical calculations and the noise accumulation problem of analog optical calculations. Mesaritakis et al. (2013), Denis-Le Coarer et al. (2018), Feldmann et al. (2019) used the property of nonlinear refractive index variation of ring resonators to provide the nonlinear response of the network, enabling optical neural networks with high integration and low power consumption. Lin et al. (2018) proposed a method to build optical neural networks using only optical diffraction and passive optical components working in concert, avoiding the use of power layers and building an efficient and fast way to implement machine learning tasks. Bao et al. (2011); Shen et al. (2017); Schirmer and Gaeta (1997) exploited the saturable absorption properties of nanophotons to achieve nonlinearity in networks. Miscuglio et al. (2018) discussed two approaches to achieve nonlinearity in all-optical neural networks with the reverse saturable absorption property and electromagnetically induced transparency of nanophotonics; Zuo et al. (2019)used the spatial light modulator and Fourier lens to program for linear operation and electromagnetically induced transparency of laser-cooled atoms for nonlinear optical activation functions.

3.1.3 Optical implementation of convolutional neural networks

By imitating the information hierarchical processing mechanism of biological vision, the convolutional neural network(CNN) has the properties of local perception and weight sharing, which significantly reduces the computational complexity and makes networks with stronger fitting ability to fit more complex nonlinear functions.

A deep convolutional neural network is proposed in Shan et al. (2018) to accelerate electromagnetic simulations and predict the 3D Poisson equation for the electrostatic potential distribution through the powerful ability to approximate nonlinear functions. Li et al. (2018) proposed a novel DNN architecture called DeepNIS for nonlinear inverse scattering problems (ISPs). DeepNIS consists of a cascade of multilayer complex-valued residual CNN to imitate the multi-scattering mechanism. This network takes the EM scattering data collected by the receiver as input and outputs a super-resolution image of EM inverse scattering, which maps the coarse images to the precise solutions to the ISPs. Wei and Chen (2019) proposed a physics-inspired induced current learning method (ICLM) to solve the full-wave nonlinear ISPs. In this method, a novel CEE-CNN convolutional network is designed, which feeds most of the induced currents directly to the output layer by jump connections and focuses on the other induced currents. The network defines the multi-label combination loss function to reduce the nonlinearity of the objective function to accelerate convergence. Guo et al. (2021) proposed a complex-valued Pix2pix generative adversarial network. This network consists of two parts: the generator and the discriminator. The generator consists of multilayer complex-valued CNNs, and the discriminator calculates the maximum likelihood between the original value and the reconstructed value. By adversarial training between the discriminator and the generator, the generator can capture more nonlinear features than the conventional CNN. The work Tsakyridis et al. (2024) provides an overview and discussion of the basics of photonic neural networks and optical deep learning. Matuszewski et al. (2024) discussed the role of all-optical neural networks.

4 Deep neural network paradigms inspired by statistical physics

The field of artificial intelligence contains a wide range of algorithms and modeling tools to handle tasks in various fields and has become the hottest subject in recent years. In the previous chapters, we reviewed recent research on the intersection of artificial intelligence with classical mechanics and electromagnetism. This includes the conceptual development of artificial intelligence powered by physical insights, the application of artificial intelligence techniques to multiple domains in physics, and the intersection between these two domains. Below we describe how statistical physics can be used to understand AI algorithms and how AI can be applied to the field of statistical physics. An overview of the representative methods is shown in Table 3.

Table 3 An overview of methods for AI DNNs inspired by statistical physics

4.1 Unbalanced neural networks

The most general problem in nonequilibrium statistical physics is the detailed description of the time evolution of physical (chemical or astronomical) systems. For example, different phenomena tending towards equilibrium states, considering the response of the system to external influences, metastability, and instability due to fluctuations, pattern formation and self-organization, the emergence of probabilities contrary to deterministic descriptions, and open systems, etc. Nonequilibrium statistical physics has created concepts and models that are not only relevant to physics, but also closely related to information, technology, biology, medicine, and social sciences, and even have a great impact on fundamental philosophical questions.

4.1.1 Neural networks understood from entropy

Entropy Proposed by German physicist Clausius in 1865, it was first a basic concept in the development of thermodynamics. Its essence is the ”inherent degree of chaos” of a system, or the amount of information in a system (the more chaotic the system, the less the amount of information, the more difficult it is to predict, and the greater the information entropy), which is recorded as S in the formula. It summarizes the basic development law of the universe: things in the universe have a tendency to spontaneously become more chaotic, which means that entropy will continue to increase, which is the principle of entropy increase.

Boltzmann distribution In 1877, Boltzmann proposed the physical explanation of entropy: the macroscopic physical property of the system, which can be considered as the equal probability statistical average of all possible microstates.

Information entropy (learning cost) Until the development of statistical physics and information theory, Shannon extended the concept of entropy in statistical physics to the process of channel communication Shannon (1948) in 1948, and proposed information entropy and the universal significance of entropy became more obvious.

In deep learning, the speed at which the model receives information is fixed, so the only way to speed up the learning progress is to reduce the amount of redundant information in the learning target. The so-called “removing the rudiments and saving the essentials” is the principle of minimum entropy in the deep learning model, which can be understood as “removing unnecessary learning costs”(Fig. 10).

Fig. 10
figure 10

De-redundancy

Application of algorithms inspired by the principle of minimum entropy, such as using information entropy to represent the shortest code length, InfoMap (Rosvall et al. 2009; Rosvall and Bergstrom 2008), cost minimization (Kuhn 1955; Riesen and Bunke 2009), Word2Vec (Mikolov et al. 2013a, b), t-SNE dimensionality reduction Maaten and Hinton (2008), etc.

4.1.2 Chaotic neural networks

Chaos refers to the unpredictable, random-like motion of a deterministic dynamic system because it is sensitive to initial values. Poole et al. (2016) published on NIPS in 2016 combines Riemannian geometry and dynamic mean field theory Sompolinsky et al. (1988) to analyze signals through the propagation of stochastic deep networks and form variance weights and biases in the phase plane. This work reveals the dynamic phase transition of signal propagation between ordered and chaotic states. Lin and Chen (2009)proposed a chaotic dynamic neural network based on a sinusoidal activation function, which is different from other models and has strong memory storage and retrieval capabilities. The 2020 edition of Keup et al. (2021) develops a statistical mean-field theory for random networks to solve transient chaos problems.

4.1.3 From Ising models to Hopfield networks

In everyday life, we see phase transitions everywhere changing from one phase to another. For example: liquid water is cooled to form ice, or heated and evaporated into water vapor (liquid phase to solid phase, liquid phase to gas phase). According to Landau’s theory, the process of phase transition must be accompanied by some kind of “order” change. For example, liquid water molecules are haphazardly arranged, and once frozen, they are arranged in a regular and orderly lattice position (molecules vibrate near the lattice position, but not far away), so water freezes. The crystal order is created during the liquid–solid phase transition, as shown in Fig. 11.

Fig. 11
figure 11

Liquid, solid phase transition

Another important example of a phase transition is the ferromagnetic phase transition: a process in which a magnet (ferromagnetic phase) loses its magnetism and becomes a paramagnetic phase during heating. In the process of ferromagnetic phase transition (Fig. 12), the spin orientation of atoms changes from a random state in the paramagnetic phase to a specific direction, so the ferromagnetic phase transition is accompanied by the generation of spin orientation order, resulting in the macroscopic magnetism (spontaneous magnetization) of the material. According to Landau’s theory, the order parameter changes continuously/discontinuously in the continuous/discontinuous phase transition, respectively.

Fig. 12
figure 12

Ferromagnetic phase transitions and spin-glass phase transitions. The grey edges represent ferromagnetic interactions, and the red edges represent antiferromagnetic interactions

Exactly 100 years ago, the mathematical key to solving the phase transition problem appeared, that is, the “primary version” of the spin glass model - the Ising model (the basic model of phase transition). The Ising model (also called the Lenz-Ising model) is one of the most important models in statistical physics. In 1920–1924, Wilhelm Lenz and Ernst Ising proposed a class of Ising describing the stochastic process of the phase transition of matter model. Taking the two-dimensional Ising lattice model as an example, the state of any point \(p\left( {{s_i}} \right)\) can have two values \(\pm 1\) (spin up or down), and is only affected by the point adjacent to it (interaction strength J ), the energy of the system can be obtained (Hamiltonian): For the Ising model, if all the spins are in the same direction, the Hamiltonian of the system is at a minimum, and the system is in the ferromagnetic phase. Likewise, the second law of thermodynamics tells us that, given a fixed temperature and entropy, the system seeks a configuration method that minimizes its energy, using the Gibbs-Bogoliubov-Feynman inequality to perform variational inference on the Ising model to obtain the optimal solution. In 1982, Hopfield, inspired by the Ising model, proposed a Hopfield neural network Hopfield (1982) that can solve a large class of pattern recognition problems and give approximate solutions to a class of combinatorial optimization problems. Its weight is to simulate the adjacent spin coherence of the Ising model; the neuron update is to simulate the Cell update in the Ising model. The unit of the Hopfield network (full connection) is binary, accepting a value of -1 or 1, or 0 or 1; it also provides a model that simulates human memory(Ising model and Hopfield network analogy diagram as shown in Fig. 13).

Fig. 13
figure 13

Ising model and Hopfield network analogy diagram

Hopfield formed a new calculation method with the idea of the energy function and clarified the relationship between neural networks and dynamics. He used the nonlinear dynamics method to study the characteristics of this neural network, and established the neural network stability criterion. At the same time, he pointed out that information is stored on the connections between the various neurons of the network, forming the so-called Hopfield network. By comparing the feedback network with the Ising model in statistical physics, the upward and downward directions of the magnetic spin are regarded as two states of activation and inhibition of the neuron, and the interaction of the magnetic spin is regarded as the synaptic weight of the neuron value. This analogy paved the way for a large number of physical theories and many physicists to enter the field of neural networks. In 1984, Hopfield designed and developed the circuit of the Hopfleld network model, pointing out that neurons can be implemented with operational amplifiers and the connection of all neurons can be simulated by electronic circuits, which is called a continuous Hopfield network. Using this circuit, Hopfleld successfully solved the traveling salesman (TSP) computational puzzle (optimization problem).

Liu et al. (2019) discuss an image encryption algorithm based on the Hopfield chaotic neural network. This algorithm simultaneously scrambles and diffuses color images by utilizing the iterative process of a neural network to modify the pixel values. The encryption process results in highly randomized and complex encrypted images. During decryption, the original image is restored by reversing the iterative process of the Hopfield neural network.

In 2023, Lin et al. (2023) review the research on chaotic systems based on memory impedance Hopfield neural networks. It explores the construction method of chaotic systems using these neural networks, which incorporate memory impedance to preserve resistance changes. The article discusses the properties and applications of chaotic systems achieved through adjusting network parameters and connection weights. These studies offer new ideas and methods for understanding and applying image encryption and chaotic systems. Ma et al. (2024) proposed a variational autoregressive architecture with a message-passing mechanism, which can effectively exploit the interactions between spin variables. Laydevant et al. (2024) Train Ising machines in a supervised manner via a balanced propagation algorithm, which has the potential to enhance machine learning applications.

4.1.4 Classic simulated annealing algorithms

Physical annealing process: First the object is in an amorphous state, then the solid is heated to a sufficiently high level to be disordered, and then slowly cooled, annealing to a crystal (equilibrium state).

The simulated annealing algorithm was first proposed by Metropolis et al. In 1983, Kirkpatrick et al. applied it to combinatorial optimization to form a classical simulated annealing algorithm Kirkpatrick et al. (1983): Using the similarity between the annealing process of solid matter in physics and general optimization problems; Starting from a certain initial temperature, with the continuous decrease of temperature, combined with the probabilistic sudden jump characteristic of the Metropolis criterion (accepting a new state with probability), it searches in the solution space, and stays at the optimal solution with probability 1 (Fig. 14).

Fig. 14
figure 14

Global optimal search process

Importance Sampling (IS) is an effective variance reduction algorithm for rare events, as described in the seminal work by Marshall (1954). The fundamental concept of IS involves approximating the computation by taking a random weighted average of a simpler distribution function, representing the objective function’s mathematical expectation.

Inspired by the idea of annealing, Radford proposed Annealed Importance Sampling (AIS) Salakhutdinov and Murray (2008) as a solution to address the high bias associated with IS. AIS, along with its extension known as Hamiltonian Annealed Importance Sampling (HAIS) Sohl-Dickstein and Culpepper (2012), represents generalizations of IS that enable the computation of unbiased expectations by reweighting samples from tractable distributions.

In AIS, a bridge is constructed between forward and reverse Markov chains, connecting the two distributions of interest. This bridge allows for the estimation of lower variance compared to what IS alone can provide. By leveraging the connections between the forward and reverse chains, AIS offers improved accuracy and efficiency in estimating expectations for rare events. In summary, Importance Sampling (IS) is a variance reduction algorithm for rare events, while Annealed Importance Sampling (AIS) and its extension HAIS provide solutions to overcome the bias issues associated with IS. AIS constructs a bridge between forward and reverse Markov chains, allowing for lower variance estimates than IS alone can achieve. These techniques offer improved accuracy and efficiency in estimating expectations for challenging problems involving rare events.

Later, Ranzato’s MCRBM model (2010) Bengio et al. (2013), Dickstein’s non-equilibrium diffusion model (2015) Sohl-Dickstein et al. (2015) and Menick’s self-scaling pixel network autoregressive model (2016) Oord et al. (2016) followed Come. To adapt the network null model to weighted network inference, Milisav et al. (2024) proposed a simulated annealing process to generate random networks with strength sequence preservation. The simulated annealing algorithm is widely used and can efficiently solve NP-complete problems, such as the Travelling Salesman Problem, Max Cut Problem, Zero One Knapsack Problem, Graph Colouring Problem, and so on.

4.1.5 Boltzmann machine neural networks

Hinttion proposed the Boltzmann Machine (BM) in 1985, BM is often referred to in physics as the inverse Ising model. BM is a special form of log-linear Markov random field (MRF), that is, the energy function is a linear function of the free variables. It introduces statistical probability in the state change of neurons, the equilibrium state of the network obeys Boltzmann distribution, and the network operation mechanism is based on a simulated annealing algorithm (Fig. 15), which is a good global optimal search method and is widely used in a certain range. See Nguyen et al. (2017) for the latest research on Boltzmann machines.

Fig. 15
figure 15

BM composition and structure diagrams

A Restricted Boltzmann Machine (RBM) is a type of Boltzmann Machine (BM) that exhibits a specific structure and interaction pattern between its neurons. In an RBM, the neurons in the visible layer and the neurons in the hidden layer are the two variables that interact through efficient coupling. Unlike a general BM, where all neurons can interact with each other, an RBM restricts the interactions to occur exclusively between the visible and hidden units.

The RBM’s goal is to adjust its parameters in a way that maximizes the likelihood of the observed data. By learning the weights and biases of the connections between the visible and hidden units, the RBM aims to capture and represent the underlying patterns and dependencies present in the data. Through an iterative learning process, the RBM adjusts its parameters to improve the likelihood of generating the observed data and, consequently, enhance its ability to model and generate similar data instances.

Regarding RBMs, there are many studies in physics that shed light on how they work and what structures can be learned. Since Professor Hinton proposed RBM’s fast learning algorithm contrast divergence, in order to enhance the expressive ability of RBM and take into account the specific structure of the data, many variant models of RBM have been proposed (Bengio 2009; Ranzato et al. 2010; Ranzato and Hinton 2010). Convolutional Restricted Boltzmann Machine (CRBM) Lee et al. (2009) is a new breakthrough in the RBM model. It uses filters and image convolution operations to share weight features to reduce the parameters of the model. Since most of the hidden unit states learned by RBM are not activated (non-sparse), researchers combined the idea of sparse coding to add a sparse penalty term to the log-likelihood function of the original RBM and proposed a sparse RBM model Lee et al. (2007), a sparse group restricted Boltzmann machine (SGRBM) model Salakhutdinov et al. (2007) and LogSumRBM model Ji et al. (2014), etc. In the articles (Cocco et al. 2018; Tubiana and Monasson 2017), the authors investigate a stochastic Restricted Boltzmann Machine (RBM) model with random, sparse, and unlearned weights. Surprisingly, they find that even a single-layer RBM can capture the compositional structure using hidden layers. This highlights the expressive power of RBMs in representing complex data.

Additionally, the relationship between RBMs with random weights and the Hopfield model is explored in Barra et al. (2018), Mézard (2017). These studies demonstrate the connections and similarities between RBMs and the Hopfield model, shedding light on the underlying mechanisms and properties of both models.

Overall, these works provide insights into the capabilities of RBMs with random weights in capturing compositional structures and their connections to the Hopfield model. Such research enhances our understanding of RBMs and their potential applications in various domains.

4.2 Energy models design neural networks

According to physical knowledge, the steady state of a thing actually represents its corresponding state with the lowest potential energy. Therefore, the steady state of a thing corresponds to the lowest state of a certain energy and is transplanted into the network, thus constructing the definition of an energy function when the network is in a steady state.

In 2006, Lecun et al. reviewed the energy model-based neural network and its application. When the model reaches the optimal solution, it is in the lowest energy state (that is, it seeks to minimize positive data versus energy and maximize negative data versus energy) LeCun et al. (2006). The task is to find the configuration of those hidden variables that minimize the energy value given the observed variables (inference); and to find an appropriate energy function such that the energy of the observed variables is lower than that of the hidden variables (learning).

Normalized probability distributions are difficult to implement in high-dimensional spaces, leading to an interesting approach to generative modeling of data Pernkopf et al. (2014). Normalization can still be done analytically when normalizing (Dinh et al. 2014, 2016; Rezende et al. 2016), these interesting methods can be found in the reference Wang (2018).

4.2.1 Generative adversarial networks (GANs)

In 2014, Goodfellow et al. proposed a GAN Goodfellow et al. (2014) that aims to generate samples of the same type as the training set, which essentially uses learned discriminator judgments to replace explicit evaluation of probabilities, Unsupervised learning can be performed using the knowledge acquired during the supervised learning process. Physics-inspired GAN research is beginning to emerge, such as Wang et al. (2019) generalizing perceptrons in interpretable models of GANs using early online-learned statistical physics work.

Both the discriminator and generator of Deep Convolutional Generative Adversarial Networks (DCGAN) Radford et al. (2015) use CNN to replace the multilayer perceptron in GAN, which can connect supervised and unsupervised learning together. CycleGAN Zhu et al. (2017) can achieve mode conversion between the source domain and the target domain without establishing a one-to-one mapping between training data. GCGAN Fu et al. (2019) is to add convolution constraints to the original GAN, which can stabilize the learning configuration. WGAN Arjovsky et al. (2017) has improved the loss function based on GAN, and can also get good performance results on the full link layer.

4.2.2 Variational autoencoder models (VAEs)

Autoencoder (AE) is a feedforward neural network that aims to find a concise representation of data that still maintains the salient features of each sample, and an autoencoder with linear activation is closely related to PCA. VAE Kingma and Welling (2013) combines variational reasoning and autoencoders to simulate the transformation between energy distribution functions - building a generative adversarial network provides a deep generative model for the data, generating target data X from latent variables Z, which can be trained in an unsupervised manner. The VAE model is closer to a variant of the physicist’s mindset, in which the autoencoder is represented by a graphical model and uses latent variables and variational priors for training inference (Cinelli et al. 2021; Vahdat and Kautz 2020). Rezende et al. (2014) is a fundamental version of understanding VAE.

An interesting approach to generative modeling involves decomposing the probability distribution into a product of one-dimensional conditional distributions in the autoregressive model, as discussed in the work by Van Oord et al. (2016). This decomposition allows for efficient modeling of complex high-dimensional data, such as images, by sequentially generating each dimension conditioned on the previous dimensions.

In the context of variational autoencoders (VAEs), another intriguing approach is to replace the posterior distribution with a tractable variational approximation. This idea was introduced in the seminal works by Kingma and Welling (2013), Gregor et al. (2014), and Rezende et al. Ozair and Bengio (2014). By introducing an encoder network that maps the input data to a latent space and a decoder network that reconstructs the data from the latent space, VAEs enable efficient and scalable generative modeling.

These techniques, namely decomposing probability distributions in autoregressive models and using tractable variational approximations in VAEs, offer interesting and effective strategies for generative modeling. They provide insights into modeling complex data distributions and have found applications in various domains, including image generation and data synthesis.

4.2.3 Auto-regressive generative models

Auto-regressive generative model (Van Oord et al. 2016; Salimans et al. 2017) is a controllable method for modeling distributions that allow maximum likelihood training without latent random variables, where the conditional probability distribution is represented by a neural network. Since this model is a family of displayed probabilities, direct and unbiased sampling is possible. The application of these models has been realized in statistics Wu et al. (2019) and quantum physics problems Sharir et al. (2020).

Neural Autoregressive Distribution Estimation (NADE) is an unsupervised neural network built on top of autoregressive models and feedforward neural networks Zhang et al. (2019), which is a tractable and efficient estimator for modeling data distribution and density.

4.2.4 RG-RBM models

In a 2014 paper by Mehta and Schwab (2014), the concept of renormalization is applied to explain the performance of deep learning models. Renormalization is a technique used to study physical systems when detailed information about their microscopic components is unavailable, providing a coarse-grained understanding of the system’s behavior across different length scales.

The authors propose that deep neural networks (DNNs) can be viewed as iterative coarse-graining schemes, similar to the renormalization group (RG) theory. In this context, each new high-level layer of the neural network learns increasingly abstract and high-level features from the input data. They argue that the process of extracting relevant features in deep learning is fundamentally the same as the coarse-graining process in statistical physics, as DNNs effectively mimic this process.

The paper highlights the close connection between RG and Restricted Boltzmann Machines (RBM) and suggests a possible integration of the physical conceptual framework with neural networks. This mapping between RG and RBM provides insights into the relationship between statistical physics and deep learning.

Overall, Mehta and Schwab’s work demonstrates how renormalization can be applied to understand the performance of deep learning models. It emphasizes the similarity between feature extraction in deep learning and the coarse-graining process in statistical physics. The mapping between RG and RBM offers a potential explanation for the combination of physical concepts and neural networks.

4.3 Dissipative structure neural networks

The theory of self-organization is that when an open system reaches a nonlinear region far away from the equilibrium state, once a certain parameter of the system reaches a certain threshold, the system can undergo a mutation through fluctuations, from disorder to order, and produce self-organization phenomena such as chemical oscillations. It consists of dissipative structure (disorder to order), synergy (synergy of various elements of the system), and mutation theory (threshold mutation).

Self-organizing feature map (SOM) (Kohonen 1989, 1990) was proposed by Professor Kohonen, when the neural network accepts external input, SOM will be divided into different regions, and each region has different response characteristics to the input mode. It self-organizes and adaptively changes the network parameters and structure by automatically finding the inherent laws and essential attributes in the samples. The self-organizing (competitive) neural network is an artificial neural network that simulates the functions of the above-mentioned biological nervous system. That is, in terms of the learning algorithm, it simulates the dynamic principle of information processing of excitation, coordination and inhibition, and competition between biological neurons to guide the study and work of the network. Since SOM is a tool that can visualize high-dimensional data and can effectively compress the transmission of information, Kohonen et al. (1996) summarizes some engineering applications of SOM.

A dissipative structure is when the system is far away from thermodynamic equilibrium, under certain external conditions. Due to the nonlinear interaction within the system, a new ordered structure can be formed through mutation, which is an important new aspect of non-equilibrium statistics in the physics branch. In 2017, Amemiya et al. discovered and outlined the role of glycolytic oscillations in cell rhythms and cancer cells Amemiya et al. (2017). In 2017, Kondepudi, et al. discussed the relevance of dissipative structures in understanding organisms and proposed a voltage-driven system Kondepudi et al. (2017) that can exhibit behaviors that are surprisingly similar to those we see in organisms. In the same year Burdoni and De Wit discussed how the interplay between reaction and diffusion produces localized spatiotemporal patterns Budroni and De Wit (2017) when different reactants come into contact with each other.

4.4 Random surface neural networks

In the field of artificial intelligence, early research was heavily influenced by the theoretical guarantees offered by optimization over convex landscapes, where each local minimum is also a global minimum Boyd et al. (2004). However, when dealing with non-convex surfaces, the presence of high-error local minima can impact the dynamics of gradient descent and affect the overall performance of optimization algorithms.

The statistical physics of smooth random Gaussian surfaces in high-dimensional spaces has been extensively studied, yielding various surface models that connect spatial information to probability distributions (Bray and Dean 2007; Fyodorov and Williams 2007). These models provide insights into the behavior and properties of non-convex surfaces, shedding light on the challenges posed by high-dimensional optimization problems.

In 2014, Dauphin et al. studied the connection between the neural network error surface model and statistical physics, that is, the connection between the energy functions of spherical spin glasses Choromanska et al. (2015).

In 2014, Pascanu proposed the Saddleless Newton Algorithm (SFN) in Dauphin et al. (2014) for the problem that high-dimensional non-convex optimization has a large number of saddle points instead of local extremums. It can quickly escape the saddle point where gradient descent is slowed down. Furthermore Kawaguchi (2016) introduces random surfaces into deeper networks.

By examining the statistical physics of random surfaces, researchers have gained a better understanding of the complex landscapes encountered in non-convex optimization. This knowledge has implications for improving optimization algorithms and enhancing the performance of artificial intelligence systems operating in high-dimensional spaces.

To summarize, research in statistical physics has explored different surface models to analyze the behavior of non-convex optimization landscapes. Understanding the properties of these surfaces is important not only for solving the challenges associated with high-dimensional optimization problems, but also for improving the performance of artificial intelligence algorithms.

4.5 Free energy surface (FES) neural networks

Free energy refers to the part of the reduced internal energy of the system that can be converted into external work during a certain thermodynamic process. It measures the “useful energy” that the system can output to the outside during a specific thermodynamic process. It can be divided into Helmholtz-free energy and Gibbs-free energy. The partition function is equivalent to free energy.

In the context of energy-based models, researchers have proposed a number of approaches to overcome the difficulty of calculating with free energy. These methods include exhaustive Monte Carlo, contrastive divergence heuristics Hinton (2002) and its variants Tieleman and Hinton (2009), fractional matching Hyvärinen and Dayan (2005), pseudo-likelihood Besag (1975), and minimum probability flow learning (MPF) (Battaglino 2014; Sohl-Dickstein et al. 2011) (where MPF itself is based on non-equilibrium statistical mechanics). Despite these advances, training expressive energy-based models on high-dimensional datasets remains an open challenge.

In the domain of energy-based models, several approaches have been proposed to address the challenge of computing with free energy. These methods aim to train models effectively despite the computational difficulties associated with estimating free energy. Some notable approaches include:

Exhaustive Monte Carlo: This method involves sampling from the model’s distribution using Monte Carlo techniques, which can be computationally expensive for high-dimensional datasets.

Contrastive Divergence (CD) and its variants: CD is a popular heuristic proposed by Hinton (2002) for training energy-based models. It approximates the gradient of the model’s parameters by performing a few steps of Gibbs sampling. Variants of CD, such as Persistent Contrastive Divergence (PCD) Tieleman and Hinton (2009), aim to improve the training process by maintaining a persistent chain of samples.

Fractional Matching: This approach, introduced by Hyvärinen and Dayan (2005), involves estimating the model’s parameters by matching the moments of the model’s distribution with the moments of the data distribution.

Pseudo-Likelihood: Proposed by Besag (1975), this method approximates the likelihood of the model by considering the conditional probabilities of each variable given the others.

Minimum Probability Flow Learning (MPF): MPF, based on non-equilibrium statistical mechanics, is a technique for training energy-based models introduced by Battaglino (2014) and Sohl-Dickstein et al. (2011). It minimizes the difference between the model’s distribution and the data distribution using flow-based dynamics.

Machine learning methods learn the FES of a system as a function of collective variables to optimize AI algorithms. Using a functional representation of the FES of a neural network, the sampling of high-dimensional spaces can be improved. For example, Schneider et al. proposed a learnable FES to predict the NMR spin-spin coupling model of solid xenon under pressure Schneider et al. (2017). In 2018, Sidky et al. proposed a small neural network for FES, which can use data points generated by dynamic (real-time) adaptive sampling for iterative training Sidky and Whitmer (2018). This model verifies that when new data is generated, a smooth representation of the full configuration space can be obtained. Wehmeyer and Noé (2018) proposes a time-lagged autoencoder approach to identify slowly changing collective variables in the example of peptide folding. In 2018, Mardt et al. proposed a variational neural network-based approach to identify important dynamical processes during protein folding simulations and provided a framework for unified coordinate transformation and FES exploration Mardt et al. (2018), providing insights into the underlying dynamics of the system. In 2019, Noé et al. proposed to use the Boltzmann generator to sample the equilibrium distribution of the collective space to represent the state distribution on FES Noé et al. (2019).

Despite these advances, training expressive energy-based models on high-dimensional datasets remains a challenging task. Ongoing research aims to develop more efficient and effective training methods to tackle this open challenge in the field.

4.6 Knowledge distillation to optimize neural networks

For neural networks: the larger the model, the deeper the layers, and the stronger the learning ability. In order to extract features from a large amount of redundant data, CNNs often require excessive parameters and larger models for training. However, the design of the model structure is difficult to design, so model optimization has become an important factor in solving this problem.

Knowledge distillation In 2015, Hinton’s pioneering work, Knowledge Distillation (KD), promoted the development of model optimization Hinton et al. (2015). Knowledge distillation simulates the heating distillation in physics to extract effective substances and transfers the knowledge of the large model (teacher network) to the small model (student network), which makes it easy to deploy the model. In the process of distillation, the small model learns the generalization ability of the large model, speeds up the inference speed, and retains the performance close to the large model (Fig. 16).

Fig. 16
figure 16

Knowledge distillation process

4.6.1 Knowledge distillation neural networks

In 2017, TuSimple and Huang et al. proposed a distillation algorithm that uses the knowledge selection feature of neurons to transfer new knowledge (aligned selection style distribution), named Neuron Selectivity Transfer (NST) Huang and Wang (2017). NST models can be combined with other models to learn better features and improve performance. To enable the student network to automatically learn a good loss from the teacher network to preserve the relationship between classes and maintain polymorphism, Zheng et al. used conditional adversarial networks (CAN) in 2018 to build a teacher-student architecture Xu et al. (2017). The deep mutual learning (DML) model Zhang et al. (2018) and the Born Again Neural Networks (BAN) model Furlanello et al. (2018), which apply KD and do not aim to compress the model, were proposed in 2018. Huang et al. (2024) proposed a novel KD model that uses a diffusion model to explicitly denoise and match features, reducing computational costs. Ham et al. (2024) proposed a novel network based on a knowledge distillation adversarial training strategy, named NEO-KD, which improves robustness against adversarial attacks.

4.6.2 Network Architecture Search (NAS) and KD

KD transfers the knowledge in the teacher network to the student network, and there are a large number of networks in NAS, and the use of KD helps to improve the overall performance of the supernet. In 2020, Peng et al. proposed a network distillation algorithm based on priority paths to solve the inherent defects of weight sharing between models, that is, the problem of insufficient subnet training in HyperNetworks Peng et al. (2020), which improves the convergence of individual models. In the same year, Li et al. used the Distill the Neural Architecture (DNA) algorithm Li et al. (2020) to supervise the search for the internal structure of the network using knowledge distillation, which significantly improved the effectiveness of NAS. Wang et al. (2021) improves KL divergence by adaptively choosing alpha divergence, effectively preventing overestimation or estimating uncertainty in teacher models. Gu and Tresp (2020) combines network pruning and distilled learning to search for the most suitable student network. Kang et al. proposed the Oracle Knowledge Distillation (OKD) method in Kang et al. (2020), which distilled from the integrated teacher network and used NAS to adjust the capacity of the student network model, thereby improving the learning of the student network ability and learning efficiency. Inspired by BAN, Macko et al. (2019) proposes the Adaptive Knowledge Distillation (AKD) method to assist the training of the sub-network. To improve the efficiency and effectiveness of knowledge distillation, Guan et al. (2020) used differentiable feature aggregation (DFA) to guide the learning of the teacher network and the student network (network architecture search), and uses a method similar to differentiable architecture search (DARTS) Liu et al. (2018) to adaptively adjust the scaling factor.

4.7 DNNs to solve statistical physics classical problems

4.7.1 Rubik’s cube problem

Professor Rubik invented the Rubik’s Cube in 1974, initially called it “Magic Cube”, and later the Rubik’s Cube was featured in the Seven Towns toy business, issued by Ideal Toy Co and renamed “Rubik’s Cube” European, Plastics, News, group (2015).

In 2018, DeepCube, a new algorithm without human assistance, solved the Rubik’s cube by self-learning reasoning McAleer et al. (2018), which is a milestone in how to solve complex problems with minimal help. Agostinelli et al. proposed on Nature Machine Intelligence in 2019 to use the DL method DeepCubeA and search to solve the Rubik’s cube problem Agostinelli et al. (2019), DeepCubeA can learn how to solve the Rubik’s cube without any specific domain knowledge. Solve increasingly difficult Rubik’s Cube problems in reverse from the target state. In 2021, Corli et al. introduced a deep reinforcement learning algorithm based on Hamiltonian reward and introduced quantum mechanics to solve the Rubik’s cube problem in the combinatorial problem Corli et al. (2021). Colin’s team, an associate professor at the University of Nottingham, published a paper on Expert Systems using a stepwise deep learning method to learn a “fitness function” to solve the Rubik’s cube problem Johnson (2021) while highlighting the advantages of stepwise processing.

4.7.2 Neural networks to detect phase transition

Since each new high-level layer of a DNN learns more and more abstract high-level features from the data and previous layers can learn finer scales to represent the input data, researchers introduce renormalization into physics theory and extract macroscopic rules from microscopic rules. In 2017 Bradde and Bialek (2017) discussed the analogy between renormalization groups and principal component analysis. In 2018, Li and Wang et al., used neural networks to learn a new renormalization scheme (Koch-Janusz and Ringel 2018; Kamath et al. 2018).

Phase transitions are boundaries between different phases of matter, typically characterized by order parameters. However, neural networks have shown the ability to learn appropriate order parameters and detect phase transitions without prior knowledge of the underlying physics.

In a 2018 study by Morningstar and Melko (2017), unsupervised generative graphs were used to understand the probability distribution of two-dimensional Ising systems. This work demonstrated that neural networks can capture the essential features of the phase transitions in the Ising model.

The literature also provides positive evidence that neural networks can discriminate phase transitions in the Ising model. Carrasquilla and Melko (2017) and Wang (2016) utilized principal component analysis to detect phase transitions without prior knowledge of the system’s physical properties.

Tanaka and Tomiya (2017) proposed a method for estimating specific phase boundary values from heatmaps, further demonstrating the possibility of discovering phase transition phenomena without prior knowledge of the physical system.

For a deeper understanding of these topics, interested readers can refer to the papers by Kashiwa et al. (2019) and Arai et al. (2018).

Overall, these studies highlight the potential of neural networks to identify and characterize phase transitions even without explicit knowledge of the underlying physics, opening up new avenues for studying complex systems and discovering emergent phenomena.

4.7.3 Protein sequence prediction and structural modeling

Protein sequence prediction and structural modeling are of great significance for providing valuable information in the fields of “AI + Big Health” such as precision medicine and drug research and development. In 2003 Bakk and Høye (2003) studied protein folding by introducing a simplified one-dimensional analogy of proteins composed of N-contacts (that is, using the one-dimensional Ising model). Stochastic RBM models Cocco et al. (2018) have recently been used to model protein families from their sequence information Tubiana et al. (2019). Analytical studies of the RBM learning process are extremely challenging, and this is usually done using a Gibbs sampling-based contrastive divergence algorithm Hinton (2002).

Wang et al. (2018) utilizes convolutional neural networks combined with extreme learning machine (ELM) classifiers to predict RNA-protein interactions. In 2019, Brian Kuhlman et al. reviewed the deep learning methods Kuhlman and Bradley (2019) that have been used for protein sequence prediction and 3D structure modeling problems. In Nature Communication, Ju et al. introduced a new neural network architecture, CopulaNet, which can extract features from multiple sequence alignments of target proteins and infer residue co-evolution, overcoming the defect of “information loss” in traditional statistical methods Ju et al. (2021).

4.7.4 Orderly glass-like structure design

Mehta’s experiments with the Ising model in Bukov et al. (2018) provide some initial ideas in this direction, highlighting the potential usefulness of reinforcement learning for applications of equilibrium quantities beyond quantum physics. In 2019, Greitemann and Liu et al. introduced and studied a kernel-based learning method in Greitemann et al. (2019), Liu et al. (2019), which is used to learn phases in frustrated magnetic materials, is easier to interpret and able to identify complex order parameters.

In 2016, Nussinov et al. also studied ordered glass-like solids, using multi-scale network clustering methods to identify the spatial and spatiotemporal structure of glasses Cubuk et al. (2015), learn to identify structural flow defects. It is also possible to discern subtle structural features responsible for the heterogeneous dynamics observed in broadly disordered materials. In 2017, Wetzel et al. used unsupervised learning for Ising and XY models Wetzel (2017), and in 2018 Wang and Zhai et al. in frustrated spin systems unsupervised learning also introduced in Wang and Zhai (2017), Wang and Zhai (2018), beyond the limitations of supervised learning, more be classified.

4.7.5 Prediction of nonlinear dynamical systems

AI also provides robust systems for studying, predicting, and controlling nonlinear dynamical systems. In 2016, Reddy et al. used reinforcement learning to teach autonomous gliders to use the heat in the atmosphere to make them fly like birds (Reddy et al. 2016, 2018). In 2017, Pathak et al. used a recurrent neural network or reservoir computer called an echo state network Jaeger and Haas (2004) to predict trajectories of chaotic dynamical systems and a model for weather forecasting Pathak et al. (2018). Graafland et al. (2020) uses BNS to build data-driven complex networks to solve climate problems. The network topology of the correlated network (CNS) contains redundant information. Bayesian Networks (BNS), on the other hand, only include non-redundant information (from a probabilistic perspective) and thus can extract informative physical features from them using sparse topologies. Boers et al. (2014) used the extreme event synchronization method to study the global pattern of extreme precipitation and attempted to predict rainfall in South America. Ying et al. (2021) used the same method to study the carbon cycle and carbon emissions and formulated strategies and countermeasures for carbon emissions and carbon reduction. Chen et al. (2021) applies the method of Eigen Microstates to the distribution and evolution of ozone on different structures. Zhang et al. (2021) changes the traditional ETAS model for earthquake prediction by considering the memory effect of earthquakes through the long-term memory model. Uncertainty in ocean mixing parameters is a major source of bias in ocean and climate modeling, and traditional physics-driven parameterizations that lack process understanding perform poorly in the tropics. Zhu et al. (2022) exploring data-driven approaches to parameterize ocean vertical mixing processes using deep learning methods and long-term turbulence measurements, demonstrated good performance using limited observations good physical constraints generalization capabilities, and improved physical information for climate simulations.

5 Deep neural network paradigms inspired by quantum mechanics

Quantum algorithms are a class of algorithms that operate on a quantum computing model. By drawing on fundamental features of quantum mechanics, such as quantum superposition or quantum entanglement, quantum algorithms. Compared to traditional algorithms, quantum mechanics have a dramatic reduction in computational complexity, which can even reach exponential reductions. Back in 1992, David Deutsch and Richard Jozsa proposed the first quantum algorithm, the Deutsch-Jozsa algorithm Deutsch and Jozsa (1992). The algorithm requires only one measurement to determine the class to which the unknown function in the Deutsch-Jozsa problem belongs. Although this algorithm lacked practicality, it led to a series of subsequent traditional quantum algorithms. In 1994, Peter W. Shor (1994) proposed the famous quantum large number prime factorization algorithm, called the Shor algorithm. The computational complexity of traditional factorization algorithms varies exponentially with the size of the problem, however, the Shor algorithm can solve the prime factorization problem in polynomial time. In 1996, Lov K. Grover (1996) proposed the classical quantum search algorithm, also known as the Grover algorithm, which has a complexity of \(O\sqrt{N}\) with a quadratic level of efficiency improvement compared to traditional search algorithms. Nature-inspired stochastic optimization algorithms have long been a hot topic of research. Recent work Sood (2024) provides a comprehensive overview of quantum-inspired metaheuristic algorithms, while work Kou et al. (2024) summarizes quantum dynamic optimization algorithms. An overview of the representative methods is shown in Table 4.

Table 4 An overview of methods for AI DNNs inspired by quantum mechanics

5.1 Quantum machine learning

Quantum machine learning(QML) is a combination of the speed of quantum computing and the learning and adaptation capabilities provided by machine learning. By simulating the characteristics of superposition, entanglement, coherence, and parallelism possessed by microscopic particles, traditional machine learning algorithms are quantized to enhance their ability to represent, reason, learn, and associate with data.

In general, quantum machine learning algorithms have the following three steps: (1) Quantum state preparation. Take advantage of the high parallelism of quantum computing, the original data must be converted to the form of a quantum bit so that the data has quantum characteristics; (2) Quantum algorithm processing. Quantum computers are no longer part of the von Neumann machine and its operating units are completely different from the traditional computer, so it is necessary to quantumize and transplant the traditional algorithm to the quantum computer. The transplantation of the algorithm should be combined with both the data structure of traditional algorithms and the characteristics of quantum theory to effectively accelerate the traditional algorithm, which makes the usage of quantum algorithms meaningful; (3) Quantum measurement operations. The result is output in a quantum state, which itself exists in the form of probability. By quantum measurement, quantum superposition wave packets collapse to classical states to extract the information contained in quantum states for subsequent information processing.

The history of QML can be traced back to 1995, and Subhash C. Kak (1995) first introduced the concept of ”quantum neural computing”. Kak considered quantum computers as a collection of conventional computers that can respond to stimuli and reorganize themselves to perform efficient computations. In the same way as traditional machine learning algorithms, quantum machine learning algorithms can be classified according to the data format: quantum unsupervised learning, and quantum supervised learning.

5.1.1 Quantum unsupervised learning algorithms

Quantum k-means algorithm The clustering algorithm is one of the most important classes of unsupervised learning algorithms. Clustering means partitioning some samples without labels into different classes or clusters according to some specific criteria (e.g., distance criterion), so that the difference between samples in the same cluster is as small as possible, and the difference between samples in different clusters is as large as possible.

For unsupervised clustering algorithms, the K-Means algorithm is the most common one. Its core idea is that for given a dataset consisting of U samples without labels and the number of clusters C (\(C<U\)), according to the distance between the sample and the centers of clusters, each sample is assigned to the nearest cluster:

$$\begin{aligned} argmin_c \vert {{\textbf {u}}-\frac{1}{M}\sum _{j=1}^{M}{{\textbf {v}}_j^c}} \vert \end{aligned}$$
(17)

where \({\textbf {u}}\) denotes the sample to be clustered and \({\textbf {v}}_j^c\) denotes the jth sample of class c. Then the centers of all clusters are iteratively updated until the position of centers converges. Since it is necessary to measure the distance between each sample and the center of every cluster and update the centers of all clusters when the K-means algorithm is performed, the time cost of the K-means algorithm will be very high when the number of clusters and samples is large.

In 2013, Lloyd et al. (2013) proposed the quantum version of Lloyd’s algorithm for performing the K-Means algorithm. The main idea of this algorithm is the same as the traditional K-means algorithm, which compares the distances between quantum states, but the quantum states in Hilbert space have both entanglement and superposition and can be processed in parallel to obtain the clusters samples belong to. First, the algorithm needs to transform the samples into quantum states \(|u \rangle = \frac{{\textbf {u}}}{|{\textbf {u}}|}\). And the entangled states \(| \varphi \rangle , | \phi \rangle\) are defined as

$$\begin{aligned} \left\{ \begin{aligned} | \varphi \rangle&=\frac{1}{\sqrt{2}} (| u \rangle |0 \rangle + \frac{1}{\sqrt{M}} \sum _{j=1}^{M}{| v_j^c \rangle } |j \rangle )\\ | \phi \rangle&= \frac{1}{\sqrt{Z}} (|{\textbf {u}}| |0 \rangle -\frac{1}{M} \sum _{j=1}^{M}{| v_j^c \rangle }| j \rangle ) \end{aligned} \right. \end{aligned}$$
(18)

where \(Z=|{\textbf {u}}|^2 + \frac{1}{M} \sum _{j}{|{\textbf {v}}_j^c|^2}\) is the normalization factor. It can be shown that the square of the expected distance \(D_c^2=|{\textbf {u}}-\frac{1}{M}{{\textbf {v}}_j^c}|^2\) between the sample to be measured and the cluster center is equal to Z times the probability of success of the measurement:

$$\begin{aligned} D_c^2=2|\langle \varphi | \phi \rangle |^2Z \end{aligned}$$
(19)

\(| \langle \phi | \varphi \rangle |^2\) can be considered as the square of the modulus of the projection of \(| \varphi \rangle\) in the direction of \(| \phi \rangle\), which can be obtained from the probability of successfully performing the Swap quantum operation Nielsen and Chuang (2002). This algorithm’s steps are only executed once for each sample in the sample space to find the cluster with the closest distance from the sample and assign the sample to that cluster.

The selection of the initial centers of clusters is important and the improper selection may lead to convergence to a local optimum. The general principle is to distribute the initial centers of clusters as sparsely as possible throughout the sample space. So Lloyd et al. proposed the method called quantum adiabatic computation to solve the optimization problem of finding the initial centers of clusters. Quantum adiabatic computation is based on quantum operations to evolve between states, and this method can be applied to quantum machine learning.

Quantum principal component analysis The dimensionality reduction algorithm is one of the most important unsupervised learning algorithms. Those algorithms map the features of samples in high-dimensional space to lower-dimensional space. The high-dimensional representation of samples contains noise information, which will cause errors and the reduction of accuracy. Through the dimensionality reduction algorithm, the noise information can be reduced which is beneficial to obtain the essential features of samples.

In dimensionality reduction algorithms, Principal Component Analysis (PCA) is one of the most common algorithms. The idea of PCA is to map the high-dimensional features of sample \({\textbf {X}}\) to the low-dimensional representing \({\textbf {Y}}\) by linear projection \({\textbf {P}}\), so that the variance of the features in the projection space is maximized and the covariance between each dimension is minimum, i.e. the covariance matrix \({\textbf {D}}\) of \({\textbf {Y}}\) is diagonal. It can be shown that the matrix \({\textbf {P}}\) is the eigenmatrix of the covariance matrix \({\textbf {C}}\) of the sample matrix \({\textbf {X}}\). In this way, fewer dimensions of features are used to preserve more properties of original samples, but the computational cost of PCA is prohibitive facing a large number of high-dimensional vectors.

In 2014, Lloyd et al. (2014) proposed the quantum principal component analysis (QPCA) algorithm. QPCA can be used for the discrimination and assignment of quantum states. Suppose there exist two sets consisting of m states and sample from them, the density matrix \(\rho =\frac{1}{m}{\sum _{i}{| \phi _i \rangle \langle \phi _i |}}\) is obtained from the first set \(\{ \phi _i \rangle \}|\), and the density matrix \(\sigma = \frac{1}{m}{\sum _{i}{| \psi _i \rangle \langle \psi _i |}}\) is obtained from the second set \(\{ | \psi _i \rangle \}\). Assuming that the quantum state to be assigned is \(| \chi \rangle\), the density matrix concatenation as well as the quantum phase estimation can be performed for \(| \chi \rangle\) to obtain the eigenvectors and eigenvalues of \(\rho - \sigma\):

$$\begin{aligned} | \chi \rangle | 0 \rangle \rightarrow \sum _{j}{ \chi _j | \xi _j \rangle | x_j \rangle } \end{aligned}$$
(20)

where \(| \xi _j \rangle , x_j\) are the eigenvalues and eigenvectors of \(\rho - \sigma\), respectively. By measuring the eigenvalue \(| \xi _j \rangle\), \(| \chi \rangle\) belongs to the first class \(\{ | \phi _i \rangle \}|\) if the eigenvalue is positive, and it belongs to the second class \(\{ | \psi _i \rangle \}\) if the eigenvalue is negative. The above procedure is minimum error state discrimination, which has exponential speedup. The QPCA assumes the preparation of quantum states using QRAM, but QRAM is only a theoretical model and no reliable physical implementation has emerged.

5.1.2 Quantum supervised learning algorithms

Quantum linear discriminant analysis Similar to PCA, Linear Discriminant Analysis (LDA) is also a kind of dimensionality reduction algorithm. But unlike the PCA algorithm, the PCA algorithm reduces the dimensionality of unlabeled sample data, while LDA, reduces the dimensionality of labeled sample data. The idea of the LDA algorithm is to project the data to the low-dimensional space, and projected data of the same cluster is as close as possible, i.e. minimizing the intra-class scatter:

$$\begin{aligned} {{S}_{w}}=\sum _{i=1}^{N}{\sum _{x\in {{C}_{i}}}{(x-{{\mu }_{i}})}}{{(x-{{\mu }_{i}})}^{T}} \end{aligned}$$
(21)

and the distance between the centers of clusters to be as large as possible, i.e. maximizing inter-class scatter:

$$\begin{aligned} {{S}_{b}}=\sum _{i=1}^{N}{({{\mu }_{i}}-{\overline{x}}){{({{\mu }_{i}}-{\overline{x}})}^{T}}} \end{aligned}$$
(22)

To satisfy these two conditions simultaneously, it is necessary to maximize the generalized Rayleigh quotient:

$$\begin{aligned} J=\frac{{{w}^{T}}{{S}_{b}}w}{{{w}^{T}}{{S}_{w}}w} \end{aligned}$$
(23)

where w is the normal vector of the projected hyperplane. This optimization problem can be solved by the Lagrange multiplier method.

In 2016, Cong and Duan (2016) proposed the Quantum Linear Discriminant Analysis(QLDA) algorithm. Compared with the classical LDA algorithm, QLDA achieves exponential acceleration, which greatly reduces the computational difficulty and space utilization. Firstly, QPCA used the Oracles operator to obtain the density matrix:

$$\begin{aligned} \left\{ \begin{aligned} |\left. {{\Psi }_{1}} \right\rangle&={{O}_{2}}(\frac{1}{\sqrt{\text {k}}}\sum \limits _{c=1}^{k}{|\left. c \right\rangle }|\left. 0 \right\rangle |\left. 0 \right\rangle )\\&=\frac{1}{\sqrt{\text {k}}}\sum \limits _{c=1}^{k}{|\left. c \right\rangle }|\left. ||{{\mu }_{c}}-{\overline{x}}|| \right\rangle \left. |{{\mu }_{c}}-{\overline{x}} \right\rangle \\ |\left. {{\Phi }_{1}} \right\rangle&={{O}_{1}}(\frac{1}{\sqrt{M}}\sum \limits _{j=1}^{M}{|\left. j \right\rangle }|\left. 0 \right\rangle |\left. 0 \right\rangle |\left. 0 \right\rangle )\\&=\frac{1}{\sqrt{M}}\sum \limits _{j=1}^{M}{|\left. j \right\rangle }|\left. ||{{x}_{j}}-{{\mu }_{cj}}|| \right\rangle \left. |{{x}_{j}}-{{\mu }_{cj}} \right\rangle |\left. c \right\rangle \\ \end{aligned} \right. \end{aligned}$$
(24)

If the norm of the vector forms an effectively productive distribution, the state can be obtained:

$$\begin{aligned} \left\{ \begin{aligned}&|\left. {{\Psi }_{2}} \right\rangle =\frac{1}{\sqrt{A}}\sum \limits _{c=1}^{k}{||{{\mu }_{c}}-{\overline{x}}|||\left. c \right\rangle }|\left. ||{{\mu }_{c}}-{\overline{x}}|| \right\rangle \left. |{{\mu }_{c}}-{\overline{x}} \right\rangle \\&|\left. {{\Phi }_{2}} \right\rangle =\frac{1}{\sqrt{B}}\sum \limits _{j=1}^{M}{||{{x}_{j}}-{{\mu }_{c_j}}|||\left. j \right\rangle }|\left. ||{{x}_{j}}-{{\mu }_{c_j}}|| \right\rangle \left. |{{x}_{j}}-{{\mu }_{cj}} \right\rangle |\left. c_j \right\rangle \\ \end{aligned} \right. \end{aligned}$$
(25)

where \(A=\sum \limits _{c=1}^{k}{||{{\mu }_{c}}-{\overline{x}}|{{|}^{2}}}\), \(B=\sum \limits _{j=1}^{M}{||{{x}_{j}}-{{\mu }_{cj}}|{{|}^{2}}}\). After implementing the bias trace operation on the density matrix composed of the two states\(|\left. {{\Psi }_{2}} \right\rangle , |\left. {{\Phi }_{2}} \right\rangle\), the Equation (26) can be obtained:

$$\begin{aligned} \left\{ \begin{aligned}&{{S}_{B}}=\frac{1}{A}\sum \limits _{c=1}^{k}{||{{\mu }_{c}}-{\overline{x}}|{{|}^{2}}|\left. {{\mu }_{c}}-{\overline{x}} \right\rangle \left\langle {{\mu }_{c}}-{\overline{x}}| \right. } \\&{{S}_{W}}=\frac{1}{B}\sum \limits _{c=1}^{k}{\sum \limits _{i\in c}{|{{x}_{i}}-{{\mu }_{c}}|{{|}^{2}}|\left. {{x}_{i}}-{{\mu }_{c}} \right\rangle \left\langle {{x}_{i}}-{{\mu }_{c}}| \right. }} \\ \end{aligned} \right. \end{aligned}$$
(26)

The Equation (26) can be solved by the Lagrange multiplier method and obtain the solution:

$$\begin{aligned} ({{S}_{B}}^{1/2}{{S}_{W}}^{-1}{{S}_{B}}^{1/2})v=\lambda v \end{aligned}$$
(27)

where \(w={{S}_{B}}^{-1/2}v\). The eigenvalues and eigenvectors of v can be obtained by the quantum phase estimation. Finally, the optimal projection direction w is obtained by the Ermey matrix concatenation solution.

The algorithm is similar to the QPCA algorithm. They both relate the covariance matrix of samples in original problems to the density matrix of the quantum system, and the eigenvalues and eigenvectors of the density matrix are investigated to obtain the optimal projection direction or the main eigenvectors.

Quantum k-nearest neighbors The K-Nearest Neighbors (KNN) is a very classical classification algorithm. The algorithm finds the k nearest samples with the label for the sample x to be classified, and then uses a classification decision rule, such as majority voting, to decide the cluster x belongs to according to the labels of these k samples. The advantages of the KNN algorithm are that the accuracy of the algorithm will be extremely high when the dataset is large enough, but the computational cost will be very high when the database is large or the dimensionality of the sample is large.

In 2014, Wiebe et al. proposed the Quantum K-Nearest Neighbor algorithm (QKNN) Wiebe et al. (2014). This algorithm can obtain the Euclidean distance as the inner product, which has polynomial reductions in query complexity compared with Monte Carlo algorithm. The QKNN algorithm first encodes the nonzero data of the vectors uv to the probability magnitudes of the quantum states \(|v \rangle ,|u \rangle\):

$$\begin{aligned} \left\{ \begin{aligned}&{| v \rangle }={d^{-\frac{1}{2}}}{\sum _{i:v_{ji} \ne 0}{{| i \rangle }{ \left( \sqrt{1-{\frac{r_{ji}^2}{r_{max}^2}}}{e^{-i \phi _ji}}{| 0 \rangle }+{\frac{v_{ji}}{r_{max}}}{|1 \rangle } \right) }{| 1 \rangle }}}\\&{| u \rangle }={d^{-\frac{1}{2}}}{\sum _{i:v_{0i} \ne 0}{{| i \rangle }{ \left( \sqrt{1-{\frac{r_{0i}^2}{r_{max}^2}}}{e^{-i \phi _0i}}{| 0 \rangle }+{\frac{v_{0i}}{r_{max}}}{|1 \rangle } \right) }{| 1 \rangle }}}\\ \end{aligned} \right. \end{aligned}$$
(28)

where \({{v}_{0}}=u\), \({{v}_{ji}}={{r}_{ji}}{{e}^{i{{\phi }_{ji}}}}\), \(r_{max}\) is the upper bound of the eigenvalue, and the inner product can be obtained by performing the SWAP operation on \(| v \rangle , | u \rangle\):

$$\begin{aligned} |\langle u|{{v}_{j}} \rangle {{|}^{2}}=|\langle {{v}_{0}}|{{v}_{j}} \rangle {{|}^{2}}=(2P(0)-1){{d}^{2}}{{r}_{\max }}^{2} \end{aligned}$$
(29)

where P(0) denotes the probability when the measure is zero. Wiebe et al. also tried to use the Euclidean distance of two quantum states to determine their classification rules, but the experimental results showed that this method has more iterations and lower accuracy than the inner-product method. Therefore, the quantum Euclidean distance classification algorithm has not been promoted.

Quantum decision tree classifier Decision Tree (DT) algorithm is a classical supervised learning model. The DT algorithm represents a mapping relationship between the attributes and categories of objects. The node in the tree represents an attribute value, which determines the direction of the classification. Each path from the root node to the leaf node represents attribute value requirements, according to which the object can be identified as a category. The algorithm takes advantage of samples to learn the structure of decision trees and discriminant rules for the classification of samples. To improve the learning efficiency of DT, information gain is often used to select key features (Table 5).

Table 5 The comparison of time complexity. The Q-version and C-version denote the Quantum version and Classical version of the algorithm

In 2014, Lu and Braunstein (2014) proposed a Quantum Decision Tree(QDT) classifier, which clusters samples into subclasses by the quantum fidelity measure between two quantum states so that the QDT can control the quantum states. In addition, they proposed a quantum entropy impurity criterion to prune the decision tree. The QDT classifier first converts the sample features \(\mathop {\{ {{x}_{i}},{{y}_{i}}\}}_{i=1}^{n}\) into a quantum state \(\mathop {\{ | {{x}_{i}} \rangle , | {{y}_{i}} \rangle \}}_{i=1}^{n}\), where \(|x_i\rangle\) denotes the i quantum state corresponding to the ith sample. \(|y_i\rangle\) denotes the quantum state of the known class corresponding to sample \(|x_i\rangle\). The quantum entropy impurity criterion is defined as:

$$\begin{aligned} S(\rho )=-tr(\rho \log \rho ) \end{aligned}$$
(30)

where \(\rho =\sum _{i=1}^{{{n}_{dt}}}{\mathop {p}_{i}^{dt}}| \mathop {y}_{i}^{(dt)} \rangle \langle \mathop {y}_{i}^{(dt)}|\) is the density matrix of the quantum states of the corresponding class at the node t. tr represents the trace (or trace number) of the matrix, which is the sum of the elements on the main diagonal. The expectation of the criterion is calculated:

$$\begin{aligned} {{S}_{e}}(\rho _{i}^{(t)}) =\sum _{j=1}^{{{t}_{i}}}{{{p}_{j}}S(\rho _{i,j}^{(t)})} \end{aligned}$$
(31)

Finally, the Grover algorithm is used to solve the minimum of the expectation in the Equation (31), and the class the expectation corresponds to is the one the sample belongs to. The Shannon entropy in classical information theory is replaced by the quantum entropy impurity criterion, and the eigenvalue can be obtained by calculating the expectation of the criterion, which is the difference between this algorithm and the traditional decision tree algorithm.

Quantum support vector machine Support Vector Machine (SVM) is an important supervised linear classification algorithm. The idea of SVM is to classify by finding the classification hyperplane with maximum interval:

$$\begin{aligned} \arg {{\max }_{w,b}}({min_{i}{\frac{{y_i}({{w}^{T}}\phi ({x_i})+b)}{||w||}}}) \end{aligned}$$
(32)

where w is the normal vector of the hyperplane, b is the bias, \({{y}_{i}}\in \{-1,1\}\) is the label of the sample \(x_i\). The solution \(x_i^*\) of \(x_i\) to the Equation (32) is called the support vector, which is closest to the classification hyperplane. and \(d=\frac{1}{||w||}{y_i^*}({{w}^{T}}\phi ({x_i^*})+b)\) is the maximum interval from the sample to the hyperplane. The Equation (33) can be obtained by the Equation (32) with the scale transformation:

$$\begin{aligned} \left\{ \begin{aligned}&\arg \min (\frac{1}{2}|||w{{|}^{2}}), \\&s.t. \text { }{{y}_{i}}({{w}^{T}}{{x}_{i}}+b)\ge 1 \\ \end{aligned} \right. \end{aligned}$$
(33)

The Equation (33) is a conditional constraint problem which can be solved by the Lagrange multiplier method.

In 2014, Rebentrost et al. (2014) proposed the Quantum Support Vector Machine (QSVM). The QSVM uses a non-sparse matrix exponentiation technique for efficiently performing a matrix inversion and obtaining exponential acceleration. The QSVM firstly encodes eigenvectors to quantum state probability magnitudes using Oracles operators:

$$\begin{aligned} | {{x}_{i}} \rangle =\frac{1}{|{{x}_{i}}|}\sum _{k=1}^{N}{{{({{x}_{i}})}_{k}} |k\rangle } \end{aligned}$$
(34)

where \({{({{x}_{j}})}_{k}}\) denotes the kth feature of the ith eigenvector. In order to obtain the normalized kernel matrix, it is necessary to obtain the quantum state:

$$\begin{aligned} | \chi \rangle = \frac{1}{\sqrt{{{N}_{\chi }}}}\sum _{i=1}^{M}{ | {{x}_{i}} | | i \rangle } | {{x}_{i}} \rangle \end{aligned}$$
(35)

where \({{N}_{\chi }}={{\sum _{i=1}^{M}{| {{x}_{i}}|}}^{2}}\). And the normalized kernel matrix can be solved by the bias trace of the density matrix \(| \chi \rangle \langle \chi |\):

$$\begin{aligned} t{{r}_{2}}\{ | \chi \rangle \langle \chi |\}=\frac{1}{{{N}_{\chi }}}\sum _{i,j=1}^{M}{\langle {{x}_{j}} | {{x}_{i}} \rangle | {{x}_{i}} | | {{x}_{j}} | | i \rangle } \langle j |=K/trK \end{aligned}$$
(36)

By this method, the quantum system is associated with the kernel matrix in traditional ML. Due to the high parallelism of evolutionary operations between quantum states, the computation of the kernel matrix in traditional ML can be accelerated.

5.2 Quantum deep learning

Similar to the QML, quantum deep learning(QDL) allows deep learning algorithms to take advantage of the basic properties of quantum mechanics. QDL uses quantum computing instead of the traditional von Neumann machine computation, making deep learning algorithms quantum, achieving the purpose of significantly improving the parallelism of algorithms and reducing the computational complexity (Fig. 17).

Fig. 17
figure 17

Quantum Neurons Model

The basic principle of neurons is to simulate the signal of excitation or inhibition with weight parameters and simulate the information processing with connection weighting to obtain the output. So the neuron can be modeled as \(Y=\sum _{i}{{{w}_{i}}{{x}_{i}}}\). In QDL, all neurons need to convert inputs into quantum states \(\phi _j\):

$$\begin{aligned} Y=\sum _{i}{{{w}_{i}}{{\phi }_{j}}}=\sum _{i}{\sum _{j}{{{w}_{ij}} | {{x}_{1}},\cdots ,{{x}_{{{2}^{n}}}} \rangle }},i=1,2,\cdots ,{{2}^{n}} \end{aligned}$$
(37)

where, \(2^n\) denotes the number of nodes of the input.

If the quantum state \(\phi _{j}\) is orthogonal, the output of neurons can be expressed by the quantum unitary transformation:

$$\begin{aligned} Y=\left( \begin{matrix} {{w}_{11}} &{} {{w}_{21}} &{} \cdots &{} {{w}_{{{12}^{n}}}} \\ {{w}_{12}} &{} {{w}_{22}} &{} \cdots &{} {{w}_{{{22}^{n}}}} \\ \cdots &{} \cdots &{} \cdots &{} \cdots \\ {{w}_{{{2}^{n}}1}} &{} {{w}_{{{2}^{n}}2}} &{} \cdots &{} {{w}_{{{2}^{n}}{{2}^{n}}}} \\ \end{matrix} \right) \times \left( \begin{matrix} | 0,0,\cdots ,0 \rangle \\ | 0,0,\cdots ,1 \rangle \\ \cdots \\ | 1,1,\cdots ,1 \rangle \\ \end{matrix} \right) \end{aligned}$$
(38)

In general, the process of training quantum neurons model involves five steps: First, initializing the weight matrix \(W^0\); Second, constructing the training set \(\{ | \phi \rangle , | O \rangle \}\) according to the problem, Third, calculating the neuron output \(| \Theta \rangle ={{W}^{t}} | \phi \rangle\), where t is the number of iterations. Fourth, updating the weight parameter \({{W}_{ij}}^{t+1}={{W}_{ij}}^{t}+\alpha ({{ | {\mathrm O} \rangle }_{i}}-{{| \Theta \rangle }_{i}}){{| \phi \rangle }_{j}}\), where \(\alpha\) is the learning rate; Finally, repeating the third and fourth steps until the network converges.

The concept of quantum neural computation was introduced by Kak (1995) in 1995, but the concept of quantum deep learning was first proposed by Wiebe et al. (2014) in 2014. In the same year, Schuld et al. (2014) proposed three requirements satisfied by quantum neural networks: First, the input and output of the quantum system are encoded as quantum states; Second, the quantum neural network reflects one or more fundamental neural computational mechanisms; Third, the evolution based on quantum effects Must be fully compatible with quantum theory. This section will also present quantum multilayer perceptron, quantum recurrent networks, and quantum convolutional networks in that order.

5.2.1 Quantum multilayer perceptrons

In 1995, Menneer and Narayanan (1995) proposed the quantum-inspired neural network (QUINN). Traditional neural networks train neural networks to find parameters which make networks enable to obtain the correct results for each pattern. Inspired by quantum superpositionality, QUINNs train multiple isomorphic neural networks that process only a single pattern for each pattern, and the isomorphic networks corresponding to the different patterns are superimposed in a quantum way to produce the QUINNs. The weight vectors of the QUINNs are called the quantum-inspired wave-function (QUIWF), which will collapse and generate the classification results during measuring.

In 1996, Behrman et al. (1996) proposed the Quantum Dot Neural Networks (QDNN). The QDNN imitates a quantum dot molecule coupled to the substrate lattice in the time-varying field and uses discrete nodes of the time dimension as hidden layer neurons. It is shown that the QDNN can perform any desired classical logic gate in some regions of the phase space.

In 1996, Tóth et al. (1996) proposed Quantum Cellular Neural Networks (QCN). The QCNs use cells to form interacting quantum dots that communicate between cells with Coulomb forces, each cell encodes a continuous degree of freedom, and its state equation can be represented by the time-dependent Schrodinger equation to describe the cellular network.

In 2000, Matsui et al. (2000) proposed a quantum neural network based on quantum circuits. The basic unit of the network is a quantum logic gate consisting of a 1-bit rotating gate and a 2-bit controlled-NOT gate, which can implement all the basic logic operations. It controls the connection between neurons through the rotation gate and the computation within the neuron through the controlled-NOT gate. Since the construction of neurons depends on the quantum logic gate, the number of logic gates will increase exponentially when the network structure is complex.

In 2005, Kouda et al. (2005) constructed a qubit neural network with quantum logic gates, and proposed the structure of a quantum perceptron. The state z of the neuron receiving inputs from K other neurons is denoted as:

$$\begin{aligned} \left\{ \begin{aligned}&u=\sum \limits _{k=1}^{K}{f({{\theta }_{k}})\cdot {{x}_{k}}-f(\lambda )}=\sum \limits _{k=1}^{K}{f({{\theta }_{k}})\cdot f({{y}_{k}})-f(\lambda )} \\&y=\frac{\pi }{2}g(\delta )-\arg (u) \\&z=f(y) \end{aligned} \right. \end{aligned}$$
(39)

where the quantum state \(f(\varphi )={{e}^{_{i\varphi }}}=\cos \varphi +i\sin \varphi\), \(g(\cdot )\) denotes the sigmoid function, and \(\arg (u)\) denotes the phase angle of u.

In 2006, Zhou et al. (2006) proposed the Quantum Perceptron Network (QPN). Through experiment simulation, the quantum perceptron containing only one neuron can still realize the dissimilarity operation, which cannot be achieved by the conventional perceptron containing only one neuron. The structure of QPN is as follows:

$$\begin{aligned} \left\{ \begin{aligned}&t=f(y) \\&sigmoid(x)=\frac{1}{1+{{e}^{-x}}} \\&y=\frac{\pi }{2}sigmoid(\sigma )-\arctan ({\text {Im}}(\varphi )/{\text {Re}}(\varphi )) \\&\varphi =\sum _{n=1}^{N}{f(\frac{\pi }{2}{{P}_{n}})f({{\theta }_{n}})-f(\frac{\pi }{2})}f(\lambda ) \\ \end{aligned} \right. \end{aligned}$$
(40)

where \(\theta _n\) and \(\lambda\) are the weighting parameters and the phase parameters, respectively, \(\sigma\) is the phase control factor, and \(P_n\) is the input data.

In 2014, Schuld et al. (2014) proposed a quantum neural network based on quantum walking. The network uses the location of the quantum walking to represent the firing patterns of binary neurons, resting and activated states, which are encoded with a set of binary strings. To simulate the dissipative dynamics of the neural network, the network performs quantum walking in a decoherent manner to achieve retrieval of memorized patterns on non-fully initialized patterns.

5.2.2 Quantum recurrent neural networks

In 2014, Wiebe et al. (2014) first introduced the concept of “quantum deep learning”. They argued that quantum algorithms can effectively solve some problems that cannot be solved by traditional computers. Quantum algorithms provide a more efficient and comprehensive framework for deep learning. In addition, they also proposed an optimization algorithm for quantum Boltzmann machines, which reduced the training time of Boltzmann machines and provided significant improvements in the objective function.

Boltzmann machine(BM) is a kind of undirected recurrent neural network. From a physical perspective, the BM is modeled according to the Ising model of thermal equilibrium and uses the Gibbs distribution to model the probability of each hidden node. They proposed two quantum methods to solve the optimization problem of the BM: Gradient Estimation via Quantum Sampling (GEQS) and Gradient Estimation via Quantum Amplitude Estimation (GEQAE). The GEQS algorithm uses mean-field theory to approximate the nonuniform prior distribution for each configuration and extracts the Gibbs states from the mean-field states, allowing the Gibbs distribution to be prepared accurately when the two states are close enough. Unlike the GEQS algorithm, the GEQAE uses the Oracle operators to quantize the training data, the idea of the GEQAE algorithm is to encode samples with amplitude estimation in quantum, which greatly reduces the computational complexity of gradient estimation.

In 2020, Bausch proposed Bausch (2020) quantum recurrent neural networks, which are mainly constructed by a novel quantum neuron. The nonlinear activation function of the neuron is implemented with the nonlinearity of the cosine function generated by the amplitude change when the basis vector of a quantum bit is rotated. These neurons are combined to form a structured QRNN cell which is iterated to obtain a recurrent model similar to the traditional RNN.

In 2020, Chen et al. (2020) proposed the Quantum Long short-term memory network (QLSTM). QLSTM utilizes Variational Quantum Circuits (VQCs) with tunable parameters to replace the LSTM cells in traditional Neural Networks. VQCs have the ability of feature extraction and data compression, which consist of data encoding layers, variational layers, and quantum measurement layers. Through numerical simulations, it can be demonstrated that the QLSTM learns faster and converges more robustly than the LSTM, and the typical spikes of the loss function don’t appear in QLSTM, which appears in the traditional LSTM.

In 2021, Ceschini et al. (2021) proposed the method to implement LSTM cells in a quantum framework. The method uses quantum circuits to replicate the internal structure of the cell for inferences. In this method, an encoding method was proposed to quantize the operators of the LSTM cell, such as quantum addition, quantum multiplication, and quantum activation functions. Finally, the quantum architecture was verified by numerical simulations on the IBM Quantum Experience \(^\text {TM}\) platform and classical devices.

5.2.3 Quantum convolutional networks

In 2019, Cong et al. (2019) first proposed a quantum convolution neural network(QCNN). QCNN is a variational quantum circuit model whose input is an unknown quantum state. The convolution layer consists of parametrized two-qubit gates applying a single quasilocal unitary in a translationally invariant manner. The pooling operation is implemented by applying unitary rotations to nearby qubits according to the measurement of a fraction of qubits. The convolution and pooling layers are performed until the system size is small enough to obtain qubits as the output. Similar to traditional convolutional neural networks, hyperparameters, such as the number of convolutional and pooling layers, are fixed in QCNN, while the parameters in the convolution and pooling layers of QCNN are learnable.

In 2020, Iordanis Kerenidis et al. (2019) proposed a modular quantum convolution neural network algorithm, which implements all modules with simple quantum circuits. The network achieves any number of layers and any number and size of convolution kernels. During the forward propagation, QCNN has exponential speedup compared with the traditional CNN.

In 2021, Liu et al. (2021) proposed the hybrid quantum-classical convolutional neural network (QCCNN). QCCNN utilizes interleaved 1-qubit layers and 2-qubit layers to form a quantum convolution layer. The 1-qubit layer consists of \(\hbox {R}_{\textrm{y}}\) gates, which contain tunable parameters. The 2-qubit layer consists of CNOT gates on the nearest-neighbor pairs of qubits. QCCNN converts the input into a separable quantum feature with the quantum convolution layer, utilizes the pooling layers to reduce the dimensionality of the data, and finally measures the quantum feature to obtain the output scalar.

5.3 Quantum evolutionary algorithms

An evolutionary algorithm is a stochastic search algorithm constructed based on Darwin’s theory of natural selection and Mendel’s theory of genetic variation, which simulates reproductions, mutation, competition, and selection in biological evolution. And quantum evolutionary algorithm uses qubits to encode individuals and updates the individuals with rotation gates and NOT gates so that individuals can contain the information of multiple states at the same time and get more abundant populations, which greatly improves the parallelism and convergence speed of the algorithm.

In the quantum evolutionary algorithm, each individual is encoded with qubits in the population. After encoding, each gene of the individual contains all information in the superposition state:

$$\begin{aligned} | \phi \rangle =\alpha | 0 \rangle \beta | 1 \rangle \end{aligned}$$
(41)

where \(\alpha , \beta\) denotes the probability amplitude of the quantum state satisfying \({{| \alpha |}^{2}}+{{| \beta |}^{2}}=1\), so the individuals encoded with qubits can be expressed as:

$$\begin{aligned} q_{j}^{t}= \left( \begin{matrix} \alpha _{j1}^{t} \\ \beta _{j1}^{t} \\ \end{matrix} \begin{matrix} \alpha _{j2}^{t} \\ \beta _{j2}^{t} \\ \end{matrix} \begin{matrix} \cdots \\ \cdots \\ \end{matrix} \begin{matrix} \alpha _{jm}^{t} \\ \beta _{jm}^{t} \\ \end{matrix} \right) \end{aligned}$$
(42)

Where \(q_t^j\) denotes the jth individual in the population after the tth iteration and m denotes the number of genes in the individual. The individual encoded with qubits can express the superposition of multiple quantum states at the same time, making the individual more diverse. As the algorithm converges, \(|\alpha |, |\beta |\) will also converge to 0 or 1, making the encoded individual converge to a single state.

In general, quantum evolutionary algorithm has the following steps: First, initialize the population to ensure that \(\alpha =\beta =\frac{1}{\sqrt{2}}\); Second, generate random number \(r \in [0,1]\), and compare r with the probability amplitude of the quantum state \(\alpha\), the measurement value of the quantum state takes 1 if \(r < \alpha ^2\), otherwise the value takes 0. In this way each individual in the population is measured once to get a set of solutions for the population; Third, evaluate the fitness of each state; Fourth, compare the current best state of the population with the recorded historical best state, and then record the best state and the fitness; Fifth, update the population with quantum rotation gates and quantum NOT gates according to a certain strategy; Finally, loop the above steps until the convergence condition is reached.

The earliest quantum evolutionary algorithm was proposed by Narayanan and Moore (1996). In 1996, they first combined quantum theory with genetic algorithms and proposed quantum genetic algorithms, which opened up the field of quantum evolutionary computation. The quantum evolutionary algorithm was proposed by Han et al. Based on the based on parallel quantum-inspired genetic algorithm (PGQA) Han et al. (2001) in 2001, they extended quantum genetic algorithms to quantum evolutionary algorithms (QEA) in 2002 Han and Kim (2000).

5.3.1 Quantum encoding algorithms

In 2008, Li and Li (2008) proposed a quantum evolutionary algorithm encoded by Bloch sphere coordinates. In this algorithm, individuals are encoded in Bloch coordinates of qubits, updated by a quantum rotation gate, and mutated by a quantum NOT gate. Compared with a simple genetic algorithm (SGA), the algorithm has higher effectiveness and feasibility.

In the same year, Cruz et al. (2007) proposed a quantum evolutionary algorithm based on real number encoding. The algorithm uses an interval in the search space to represent the genes in quantum individuals and calculates the pulse height by the pulse width value and the total number of quantum individuals in the population, which ensures that the total area of the probability density function used to generate classical individuals is equal to 1. Compared with similar algorithms, this algorithm can obtain better solutions with less computation, which greatly reduces the convergence time.

In 2009, Zhao et al. Chen et al. (2005) proposed a Real-coded Chaotic Quantum-inspired genetic Algorithm (RCQGA). The RCQGAreal maps individuals to qubits in the solution space and applies the crossover and mutation operations to search real individuals. The individuals representing the weights of networks are encoded as tunable vectors, which can be obtained by RCQGA. Compared with similar algorithms, this algorithm converges faster when searching for the best weights of fuzzy neural networks.

In 2016, Joshi et al. (2016) proposed an adaptive quantum evolutionary algorithm (ARQEA) encoding with real numbers. The algorithm utilizes a parameter-free quantum crossover operator inspired by a rotation gate to generate new populations and amplifies the amplitude by a quantum phase rotation gate to search the desired element. The algorithm can avoid the tuning of evolutionary parameters.

5.3.2 Quantum evolutionary operators

In 2002, Li and Zhuang (2002) proposed a genetic algorithm based on the quantum probability representation (GAQPR), where a novel crossover operator and mutation operator are designed. The crossover operator makes individuals contain the best evolutionary information by exchanging the current evolutionary target and updating individuals, and the mutation operator is implemented by randomly selecting one quantum bit of each individual to exchange the probability amplitude. The GAQPR algorithm is more effective for multi-peaked optimization problems, which is demonstrated by two typical function optimization problems.

In 2004, Yang et al. (2004) proposed a novel discrete particle swarm optimization algorithm based on quantum individuals. The algorithm defines each particle as one qubit and uses random observation instead of a sigmoid function to approximate the optimal result step by step. The algorithm has also proved its effectiveness in simulation experiments and applications in CDMA.

In 2015, Jin and Jin (2015) proposed an improved quantum particle swarm algorithm (IQPSO) for visual feature selection(VFS). The algorithm obtains the reverse solution based on the reverse operation of the solution and selects the individual optimal solution and the global optimal solution by calculating the fitness function for all solutions and inverse solutions.

In 2019, Rehman et al. (2019) proposed an improved approach to the quantum particle swarm algorithm. The method uses a mutation strategy to change the mean best position by randomly selecting the best particle to take part in the current search domain and then adds an enhancement factor to improve the global search capability to find the global best solution.

5.3.3 Quantum immune operators

In 2008, Li et al. Jiao et al. (2008) a quantum-inspired immune clonal algorithm (QICA), where antibody proliferation is divided into a set of subpopulations. The antibodies in the subpopulations are represented by multistate gene qubits. Antibody updates are implemented with the quantum rotation gate strategy and the dynamic angle adjustment mechanism to accelerate convergence, quantum mutations are implemented with the quantum NOT gate to avoid premature convergence, and a quantum recombination operator is designed for information communication between subpopulations to improve the search efficiency.

In the same year, Li et al. Yangyang and Licheng (2008) proposed a quantum-inspired immune clonal multiobjective optimization algorithm (QICMOA). The algorithm encodes the dominant population antibodies with qubits and designs quantum recombination operators and quantum NOT gates to clone, recombine, and update the dominant antibodies with less crowded density.

In 2013, Liu et al. (2013) proposed the cultural immune quantum evolutionary algorithm(CIQEA) which consists of a population space based on the QEA and a belief space based on immune vaccination. The population space periodically provides vaccines to the confident population. The confidence space continuously evolves these vaccines and optimizes the evolutionary direction for the population space, which greatly improves the global optimization capability and convergence speed.

In 2014, Shang et al. (2014) proposed an immune clonal coevolutionary algorithm (ICCoA) for dynamic multi-objective optimization(DMO). The algorithm solves the DMO problem based on the basic principle of an artificial immune system with an immune clonal selection method and designs coevolutionary competition and cooperation operators to improve the consistency and diversity of solutions.

In 2018, Shang et al. (2018) proposed a quantum-inspired immune clonal algorithm (QICA-CARP). The algorithm encodes antibodies in the population as qubits and controls the population evolution to a good schema with the current optimal antibody information. The quantum mutation strategy and quantum crossover operator speed up the convergence of the algorithm as well as the exchange of individual information.

5.3.4 Quantum population optimization

In 2005, Alba and Dorronsoro (2005) subdivided the grid population structure into squares, rectangles, and bars, and designed a quantum evolutionary algorithm that introduces a preprogrammed change of the relationship between individual fitness and population entropy to dynamically adjust the structure of the population and construct the first adaptive dynamic cellular model.

In 2008, Li et al. (2008) used a novel distance measurement method to maintain performance. The algorithm evolves the solutions population by a non-dominated sorting method and uses Pareto max-min distance to preserve population diversity, allowing a good balance between global and local search.

In 2009, Mohammad and Reza (2009) proposed a dynamic structured interaction algorithm among population members in the quantum evolutionary algorithm(QEA). The algorithm classified the population structure of QEA into ring structure, cellular structure, binary tree structure, cluster structure, lattice structure, star structure, and random structure, and proved that the best structure of QEA is cellular structure by comparing several structures.

In 2015, Qi and Xu (2015) proposed an L5-based simultaneous cellular quantum evolution algorithm (LSCQEA). In the LSCQEA algorithm, each individual is located in a lattice, and each individual in the lattice and its four neighboring individuals go through an iteration of QEA. In every iteration of QEA, different individuals exchange information with others by overlapping neighboring individuals, which makes the population evolve.

In 2018, Mei and Zhao (2018) proposed a random perturbation QPSO algorithm (RP-QPSO). By introducing a random perturbation strategy to the iterative optimization, the algorithm can dynamically and adaptively adjust, which improves the local search ability and global search ability.

6 Top open problems

The laws of physical knowledge are diverse and powerful, and the AI model simulates the brain composed of millions of neurons connected by weights to realize human behavior. Through the combination of physical knowledge and AI, mutual influence, and evolution, people’s understanding of the deep neural network model is promoted, and then the development of a new generation of artificial intelligence is promoted. However, there are also huge challenges in combining the two, which we will discuss around the following issues (see Fig. 18).

Fig. 18
figure 18

Top open problems combining physics with AI

6.1 Open problem 1: credibility, reliability, and interpretability of physical priors

Neural networks in AI are becoming more and more popular in physics as a general model in various fields (Redmon et al. 2016; He et al. 2017; Bahdanau et al. 2014). However, the intrinsic properties of neural networks (parameters and model inference results, etc.) are difficult to explain. Neural networks are therefore often labeled as a black box. Interpretability aims to describe the internal structure and inferences of the system in a way that humans can understand, which is closely related to the cognition, perception, and bias of the human brain. Today, the emerging and active intersection of physical neural networks attempts to make the black box transparent by designing deep neural networks based on physical knowledge. By using this prior knowledge, deeper and more complex neural networks are made feasible. However, the reasoning and interpretation of the internal structure of neural networks is still a mystery, and physical information methods as a supplement to prior knowledge have become a major challenge in explaining artificial intelligence neural networks.

6.2 Open problem 2: causal inference and decision making

The purpose of AI is to let machines learn to “think” and “decide” like the brain, and the brain’s understanding of the real world, processing of incomplete information, and task processing capabilities in complex scenarios are unmatched by current AI technologies, especially in time series problems (Rubin 1974; Pearl 2009; Imbens and Rubin 2015). Since most of the existing AI models are driven by association, just like the decision output of a physical machine will be affected by the change of the mechanism or the intervention of other factors, these models usually only know the “how” (correlation) but not the “why” (causality), recent groundbreaking works (Runge 2018; Runge et al. 2019a, b; Nauta et al. 2019) on time series causality lays the foundation for AI. Introducing causal reasoning, statistical physics thinking, and multi-perspective cognitive activities of the brain into the AI field, removing false associations, and using causal reasoning and prior knowledge to guide model learning is a major challenge for AI to improve generalization capabilities in unknown environments.

6.3 Open problem 3: catastrophic forgetting

The brain memory storage system is an information filter, just like a computer clearing disk space, it can delete useless information in the data to receive new information. “Catastrophic forgetting” in neurobiological terms, when learning a new task, the connection weights between neurons will weaken or even disappear as the network deepens. That is, the appearance of new neurons will cause the weights to be reset, and the brain neurons in the hippocampus rewire and overwrite memories Abraham and Robins (2005). For humans, the occurrence of forgetting can improve decision-making flexibility by reducing the impact of outdated information on people, and it can also make people forget negative events and improve adaptability.

Achieving artificial general intelligence today requires agents to be able to learn and remember many different tasks, and the most important part of the learning process is forgetting (McCloskey and Cohen 1989; Goodfellow et al. 2013). Through the purification of selective forgetting (Kirkpatrick et al. 2017; Zhang et al. 2023), AI can better understand human commands, improve the generalization ability of the algorithm, prevent overfitting of the model, and solve more practical problems. Therefore, learning to forget is one of the major challenges facing artificial intelligence.

6.4 Open problem 4: optimization and collaboration driven by knowledge and data

When solving many practical optimization problems, it is difficult to solve due to their characteristics of non-convex or multi-modal, large scale, high constraints, multi-objectives, and large uncertainty of constraints, and most evolutionary optimization algorithms evaluate the potential of candidate solutions. The objective function and constraint function are too simple and may not exist. In contrast, solving evolutionary optimization problems through the evaluation of objectives and/or constraints through numerical simulations, physical experiments, production processes, or data collected in everyday life is called data-driven evolutionary optimization. However, data-driven optimization algorithms also pose different challenges depending on the nature of the data (distributed, noisy, heterogeneous, or dynamic). Inspired by AI algorithms, the Physics-informed model not only reduces the cost of implementation and computation Belbute-Peres et al. (2020), but also has a stronger generalization ability Sanchez-Gonzalez et al. (2020). AI is mainly based on knowledge bases and inference engines to simulate human behavior, and knowledge, as a highly condensed embodiment of data and information, often means higher algorithm execution efficiency. Inspired by physics, knowledge-driven AI has a lot of experience and strong interpretability, so the knowledge-data dual-driven optimization synergy provides a new method and paradigm for general AI, combining the two will be a very challenging subject.

6.5 Open problem 5: physical information data augmentation

In real life, there are differences between the real data and the predicted data distribution, and it is crucial to obtain high-quality labeled data, so transfer learning (Tremblay et al. 2018; Bousmalis et al. 2018), multi-task learning, and reinforcement learning are indispensable tools for introducing physical prior knowledge.

In reality, many problems cannot be decomposed into sub-problems independently, even if they can be decomposed, each sub-problem is connected by some shared factors or shared representations. So the problem is decomposed into multiple independent single-task processing, it ignores the rich correlation information in the problems. Multi-task learning is to put multiple related tasks together to learn and share the information they have learned between tasks, which is not available in single-task learning. Associative multi-task learning Thanasutives et al. (2021) can achieve better generalization than single-task learning. However, the interference between tasks, the different learning rates and loss functions between different tasks, and the limited expressivity of the model make multi-task learning challenging in the AI field.

Reinforcement learning is a field in AI that emphasizes how to act based on the environment to maximize the intended benefit. The reasoning ability it brings is a key feature measurement of AI, and it gives machines the ability to learn and think by themselves. The laws of physics are a priori, and how to combine reinforcement learning with physics is a challenging topic.

6.6 Open problem 6: system stability

In physics, stability is a performance index that all automatic control systems must meet. It is a performance in which the motion of the system can return to the original equilibrium state after being disturbed. In the field of AI, the study of system stability refers to whether the output value of the system can keep up with the expected value, that is, the stability of the system is analyzed for the output value Chen et al. (2023). But, since the AI system has a dynamic system, the output value also has dynamic characteristics. The neural network model is a highly simplified approximation of the biological nervous system, that is, the neural network can approximate any function. From the perspective of the system, the neural network is equivalent to the output function of the system, that is, the dynamic system of the system. It simulates the functions of the human brain’s nervous system structure, machine information processing, storage, and retrieval at different degrees and levels. From the perspective of causality, there is a certain internal relationship between interpretability and stability, that is, by optimizing the stability of the model, its interpretability can be improved, thereby solving the current difficulties faced by artificial intelligence technology in the implementation.

As a new learning paradigm, stable learning attempts to combine the consensus basis between these two directions. How to reasonably relax strict assumptions to match more challenging real-world application scenarios and make machine learning more credible without sacrificing predictive ability is a key problem to be solved for stable learning in the future.

6.7 Open problem 7: lightweight networking

Deep learning is now playing a big role in the field of AI, but limited by the traditional computer architecture, data storage, and computing need to be completed by memory chips and central processing units, resulting in problems such as long time consumption and high power consumption for computers to process data. The physical prior knowledge is introduced into the search space of NAS to obtain the optimal knowledge so that the network structure and the prediction result can be balanced Skomski et al. (2021). Meanwhile, modularity also plays a key role in NAS based on physical knowledge (Xu et al. 2019; Chen et al. 2020; Goyal et al. 2019). At the same time, the deep neural network also has a complex structure and involves a large number of hyperparameters, which is extremely time-consuming and energy-consuming in the training process and is difficult to parallelize. Therefore, we should combine the physical structure and thinking behavior of the brain, add physical priors, break through the bottleneck of computing power, realize low-power, low-parameter, high-speed, high-precision, non-depth AI models, and develop more efficient artificial intelligence technology.

6.8 Open problem 8: physics-informed federated learning

Privacy protection: The wide application of artificial intelligence algorithms not only provides convenience for people but also brings great risks of privacy leakage. Mass data is the foundation of artificial intelligence. It is precisely because of the use of big data, the improvement of computing power, and breakthroughs in algorithms that AI can develop rapidly and be widely used. Acquiring and processing massive amounts of information data inevitably involves the important issue of personal privacy protection Wang and Yang (2024). Therefore, artificial intelligence needs to find a balance between privacy protection and AI capabilities.

Security Intelligence: With the widespread application of AI in all walks of life, the abuse or malicious destruction of AI systems will have a huge negative impact on society. In recent years, algorithm attacks, adversarial sample attacks, model stealing attacks, and other attack technologies targeting artificial intelligence algorithms have continued to develop, which has brought greater algorithm security risks to AI. Therefore, realizing the security intelligence of AI is a big challenge in the future.

6.9 Open problem 9: algorithmic fairness

While the rapid development of the AI field has brought benefits to people, there are also some fairness issues. Such as statistical (sampling) bias, the sensitivity of the algorithm itself, and discriminatory behavior introduced by human bias Pfeiffer et al. (2023). As an important tool to assist people in decision-making, improving the fairness of AI algorithms is an issue of great concern for artificial intelligence Xivuri and Twinomurinzi (2021). Given the physical distance and the large scale of data, improving dataset quality, improving the algorithm’s dependence on sensitive attributes (introducing fairness constraints), defining index quantification, and fairness measures, and improving the algorithm’s generalization ability are important solutions Chen et al. (2023). In addition, human–machine symbiosis and algorithm transparency are also important ways to achieve fairness.

The human–machine symbiosis of machine intelligence and human brain cognition, thinking, and decision-making plus human inductive reasoning of the laws of the real world (physical knowledge) will be the future development direction, and algorithm transparency (understandability and interpretability) is to achieve fairness important tool. The problem with algorithmic fairness is not to solve some complex statistical Rubik’s Cube puzzle, but to try to embody Platonic perfection of fairness on the walls of a cave that can only capture shadows. Therefore, the continuous deepening of algorithmic fairness research is a key issue in AI governance.

6.10 Open problem 10: open environment adaptation learning

Today, the AI field is all about assumptions about closed environments, such as the iid. and distribution constancy assumptions for data. In reality, it is an open dynamic environment and there may be changes. The learning environment of the neural network is a necessary condition for the learning process. The open environment, as a mechanism for learning, needs to exchange information, which requires the future AI to have the ability to adapt to the environment, or the robustness of the AI. For example, in the field of autonomous driving Müller et al. (2018), there are always emergent situations in the real world that cannot be simulated by training samples, especially in rare scenarios. Therefore, the future development of AI must be able to overcome the “open environment” problem for data analysis and modeling, which poses a huge challenge to the adaptability or robustness of AI systems.

6.11 Open problem 11: green and low carbon

With the development of the AI field, the AI-enabled industry gradually requires a greener and lower-carbon environment Liu and Zhou (2024). At present, the three cornerstones of AI algorithm, data, and computing power are developing on a large scale, resulting in higher and higher consumption of resources. Therefore, to achieve green and low-carbon intelligence, it is necessary to do “subtraction” Yang et al. (2024). At the same time, the deep integration of new energy vehicles, smart energy, and artificial intelligence has also brought great challenges to green and low-carbon intelligence. On the one hand, it builds a more flexible network model; on the other hand, it builds a more efficient and extensive sharing and reuse mechanism to realize green and low-carbon from a macro perspective. In short, the five development concepts of “innovation, coordination, green, openness, and sharing” point out the direction for the future development of AI, and propose fundamental compliance.

6.12 Open problem 12: morality and ethics construction

At present, artificial intelligence has created considerable economic benefits for human beings, but the negative impact and ethical issues of its application have become increasingly prominent Huang et al. (2022). Predictable, constrained, and behavior-oriented artificial intelligence governance has become the priority in the era of artificial intelligence proposition. For example, the privacy protection of user data and information; the protection of knowledge achievements and algorithms, the excessive demand for portrait rights by AI face-changing, and the accountability of autonomous driving safety accidents, etc. AI technology may also be abused by criminals, for example, to engage in cybercrime, produce and disseminate fake news, and synthesize fake images that are enough to disrupt audiovisuals. Artificial intelligence should protect user privacy as the principle of AI development. Only in this way can the development of artificial intelligence give back to human beings and provide new hope for the new ethics between people and AI Akinrinola et al. (2024).

7 Conclusion and outlooks

After a long period of evolution in physics, the laws of knowledge are diverse and powerful. Inevitably, our current understanding of theory is only the tip of the iceberg. With the development of the field of artificial intelligence, there is a close connection between the field of deep learning and the field of physics. Combining physical knowledge with AI is not only the driving force for the progress of physics concepts but also promotes the development of a new generation of artificial intelligence. This paper first introduces the mechanism of physics and artificial intelligence, and then gives a corresponding overview of deep learning inspired by physics, mainly including classical mechanics, electromagnetism, statistical physics, and quantum mechanics for the inspiration deep learning, and expounds how deep learning solve physical problems. Finally, the challenges of physics-inspired artificial intelligence and thinking about the future are discussed. Through the interdisciplinary analysis and design of artificial intelligence and physics, more powerful and robust algorithms are explored to develop a new generation of artificial intelligence.