Quantum-inspired event reconstruction with Tensor Networks: Matrix Product States

Tensor Networks are non-trivial representations of high-dimensional tensors, originally designed to describe quantum many-body systems. We show that Tensor Networks are ideal vehicles to connect quantum mechanical concepts to machine learning techniques, thereby facilitating an improved interpretability of neural networks. This study presents the discrimination of top quark signal over QCD background processes using a Matrix Product State classifier. We show that entanglement entropy can be used to interpret what a network learns, which can be used to reduce the complexity of the network and feature space without loss of generality or performance. For the optimisation of the network, we compare the Density Matrix Renormalization Group (DMRG) algorithm to stochastic gradient descent (SGD) and propose a joined training algorithm to harness the explainability of DMRG with the efficiency of SGD.


Introduction
Over the last decades, Machine learning (ML) techniques have developed into standard tools for data analysis strategies in high-energy physics (HEP). Due to the requirement of analysing vast, highly correlated data in order to exploit the full physics potential of the LHC, it becomes more and more important to develop a fundamental understanding of the data analysis methods applied. Jet substructure analysis is a particularly popular research area where analytic reconstruction techniques [1][2][3][4][5][6] co-exist with numerical multivariate analyses methods [7][8][9][10][11]. The combination of a large amount of available data with an excellent theoretical understanding of the underlying physics in collider phenomenology provides the ideal environment to explore novel reconstruction techniques and to improve our understanding of existing approaches.
A method of increasing popularity that is rooted in quantum mechanical concepts are tensor networks (TNs) or tensor network states [12,13]. The amplitude of a wave function in quantum mechanics can be represented as a matrix for the superposition of multiple states where TNs comes into play to describe complex quantum many-body systems [14,15]. TNs are Lego -like constructions where the connection between each Lego piece represents the entanglement between two or more states. A one-dimensional lattice, in this configuration, can be written as the Matrix Product States (MPS) [16][17][18][19][20] where more complex entanglements can be represented with tree tensor networks [21,22] or multiscale entanglement ansatz (MERA) [23,24]. It has been shown that tensor networks can be used to compress fully connected and convolutional networks to achieve more efficient results [25]. This study has been further expanded by using specialised MPS training techniques for image classification [26][27][28][29] and feature extraction [30]. MPS has also been used in unsupervised learning [31], for anomaly detection [32] and has been shown that it can produce comparable results to recurrent neural networks [33] 1 . Beyond MPS, there have been various ML applications with MERA [35,36] and in the 2D case with projected entangled pair states (PEPS) [37]. The transition to TNs also allows transferring the knowledge developed to understand and compute quantum many-body systems to ML. A particularly interesting feature of MPS is that the entanglement entropy of an MPS can shed light on the query of "what the network is learning?" [38].
Despite the ever-growing interest in TNs, surprisingly little work has been dedicated to the applications of TNs in HEP, with the exception of ref. [39]. This study will show that MPS can be used to discriminate top jets over QCD jets with comparable precisions to state-of-the-art classifiers and that the tensor network learns the volume and correlations of the projected geometry of topological relations in the data, which is reflected by the entanglement entropy of the network. This observation can be exploited to reduce redundant information in the input data, thereby reducing the complexity of the network while maintaining a high classification performance. Thus, we propose different network pruning and optimisation techniques relying on the entanglement entropy of the system.
Traditionally the MPS has been optimized by dedicated algorithms such as Density Matrix Renormalization Group algorithm (DMRG) [17,[40][41][42][43][44] or Time Evolving Block Decimation (TEBD) [45,46]. These algorithms are designed to reduce the degrees of freedom in the given wave function to find the ground state energy of the acting Hamiltonian. Additionally, due to their construction, such algorithms allow the network to adapt to the complexity of the problem by expanding or reducing the connections between the local states. Common neural networks, such as convolutional neural networks (CNN), recurrent neural networks (RNN) or deep neural networks (DNNs), are often optimised via stochastic gradient descent algorithms (SGD), which are known to be very effective methods to optimise a network. Such algorithms are designed to increase degrees of freedom in the network to map the given data on a higher-dimensional manifold. We will compare these two training methods and propose a novel way of combining the training algorithms to harness the adaptability and physical insight related to entropy entanglement of the MPS optimisation algorithm and the efficiency of SGD. We show that entanglement entropy calculated using DMRG can be used to identify redundant information in the data feature space and elucidate how the two training methods can be combined to one optimisation algorithm.
This study has been organised as follows; in Sec. 2 we will give a detailed introduction into tensor networks in mathematical form which will branch out into following subsections 2.1 and 2.2 where we will further discuss MPS and optimization algorithms. In Sec. 3 we will introduce our case study of top tagging and discuss our results. Finally we will conclude and summarize the study in Sec. 4.
1 An extensive review on tensor networks in ML applications can be found in ref. [34].

Tensor Networks
Tensors are general multidimensional objects which can describe the multilinear relationship between algebraic objects within a vector space. For this study, we will use Penrose notation (or tensor diagram notation) [47] to describe tensors as briefly shown in Fig. 1 2 . In this notation, a node without any edge describes a scalar, and each edge represents a higher rank object, such as one edge for vectors, two for matrices, where tensors can be rank N objects. The bottom two panels show two examples of algebraic operations with tensor diagram notation. Here, Einstein summation between two indices represented by connecting edges of nodes is assumed. Hence, the contraction between tensors A klm , B n lm and C ij kn has been shown by connecting edges with the same indices (for simplicity, indices are not shown in the figure). Similarly, an isometric matrix, V is connected by its conjugate V † , which leads to Kronecker delta. Note that the Kronecker delta tensor is often shown as a line without any node. Figure 1. Top panel shows the Penrose representation of the basic tensor forms such as scalar, vector, matrix and a rank tree tensor. Bottom two panels shows two examples of tensor operations where first notation shows tensor contraction between two rank tree (A klm and B n lm ) and one rank four tensor (C ij kn ) resulting a rank two tensor (Γ ij ). The bottom equations shows a tensor contraction of a isometric tensor V which leads to identity.
Tensor networks are defined as a graph that describes the connection between several tensors into a composite tensor as shown in Fig. 1 with Γ ij . The number of dangling edges describes the rank of this composite tensor. The Einstein summation has been applied to the connected edges where the leg between A and C indicates summation over index k. These are called auxiliary (or bond) dimensions of the network. The size of these connections indicates each tensor's influence on each other, which will be further detailed in the following sections. Such objects have been widely used in the theoretical description of quantum many-body systems [49], and in the design of quantum computing algorithms.

Matrix Product States
The Matrix Product States (MPS) or Tensor-Train (TT) [12,15,48,[50][51][52] is one of the most studied tensor network topologies, widely used to describe 1D entangled quantum many-body systems [15,41,48,[53][54][55]. The wave function for a general 1D lattice of N particles can be written as where local measurements on W p 1 ...pn can be held independently. A locally entangled state, on the other hand, can be factorized as sum of products of amplitudes, Here each tensor A is connected to the one on the right and the left (except the first and last one). Although all the combinations are shown with A, these tensors can be independent of each other. As before p i represents the physical states where α i shows the auxiliary indices between each tensor where the size of α i given as χ, as shown in the right-hand side of the equation in Fig. 2. Such factorisation effectively compresses the computational complexity of the system to O(N dχ 2 ), assuming all the tensors, A, have the same size of bond dimensions to the tensors on the left and the right. Note that this represents a state with open boundary conditions due to the fact that initial and final tensors are only connected to the tensor on the right and left, respectively. This structure ensures maximum entanglement between neighbouring sites, and entanglement decreases for the further sites depending on the size of the auxiliary dimensions. The entanglement between tensors is ensured via auxiliary dimensions where a larger bond dimension indicates a deeper influence of the node A i in the network. However, the size of these ancilla dimensions also increases the computational cost; hence it needs to be adjusted with respect to the necessary accuracy required from the system. Note that in MPS, the network's topology is restricted at one dimension; hence the influence of a feature can only be observed on the features on their right and left. The amplitude tensor can be unfolded to the form in Eq. (2.2) via Singular Value Decomposition (SVD) [56,57]. Each physical dimension, p i can be splitted as where U and V are isometric tensors (U U † = V V † = 1) with respect to the decomposition index and S are diagonal tensors of singular values per for factorization of Hilbert space dimensions. It is important to note that this decomposition is not unique. One can immediately observe that we could start the decomposition from the right-most leg of the W tensor and combine S tensors with V to form A. Equally, one could have started from the left-most tensor and absorb S with U to form A, which would give the same result after contraction as Eq. (2.3). Similarly, leaving S unabsorbed would not change the outcome either. Hence each S-tensor corresponds to gauge choices [15]. Although SVD splits amplitude into bitesize tensors, it is still quite expensive to do algebra with such an object for a large bond dimension. Trimming the singular values below a certain threshold, , allows using significantly smaller tensors. However, one pays the price of a reduced precision in the approximation of the full tensor. Recently, it has been shown that a fully connected neural network can be expressed via a tensor network [58] and even a convolutional network can be compressed into a tensor network structure [25]. A neural network defines an affine transformation between the feature space and the output of the layer, where W and B are weight and bias tensors, respectively and Φ(x) is the feature tensor. f l F C (x) denotes outputs of the network 3 where it can be further transformed and σ i are the activation functions. Each transformation refers to a layer which aims to find a different vector-space to represent the previous layer or input. Ref. [58] shows that compressing weight tensor via an MPS can drastically improve the performance by increasing the ability to represent the feature space with a single linear transformation instead of interconnected hidden transformations. Since the MPS asserts a linear relation between the features, it also increases the interpretability of the network.
In order to benefit from an MPS layer, one first needs to map the feature tensor in a tensor product form [26,27],  After the mapping, an MPS can be connected to each of the physical dimensions denoted by p i . Fig. 3 shows the contraction between Φ p 1 ···pn (x) and the MPS where red circles represent the outer product of the feature-tensor, and blue circles represent the MPS. The contracted tensor results in a rank-1 tensor (vector) f l (x i ) for a single example. The Born rule dictates that the square of a wave function is the probability of the measurement; hence for this study, we will use |f l (x i )| 2 as the prediction probability of the network. Note that the auxiliary dimensions, χ, holds the information regarding the entanglement between each site. For MPS, the entanglement defines the correlation of a site to the block on the left or right. As discussed before, although the size of the bond increases the computational complexity, it also enables each site to be entangled with further away sites rather than only the neighbouring sites. The influence of each bond can be measured by its entanglement entropy [38].
Using an MPS for classification also allows us to use the quantum theory built around MPS. Measuring the entanglement entropy on the MPS nodes allows us to interpret the correlation between two neighbouring features, and using this information, one can further adjust the network size or feature space mapping to achieve more efficient classification [14]. Using the Schmidt decomposition, one can write a bipartite quantum state via its orthonormal basis. The Schmidt decomposition can be achieved via SVD for the centre-of-orthogonality tensor by separating the eigenvalues from the node, Here λ i are positive definite singular values (Schmidt coefficients) in the S tensor, and as before |U X , |V Y is the orthonormal basis (Schmidt basis). The entanglement entropy then can be calculated via the von Neumann entropy [59], The entanglement entropy indicates how strongly two sites are connected, where zero means the two sites decouple. The entropy is bounded from above by the area law where for each site S ≤ log χ [60]. The normalized singular value tensor, S, has all the singular values in descending order where λ 1 = 1. If there is no other singular value but λ 1 , the entropy of the node will be zero, which means that it does not hold valuable information for classification [61]. As will be demonstrated in Sec. 3.2, if neighbouring sites have similar entropy, only the one with largest entropy can be kept to shrink the feature-space without loss of generality. Whilst this gives extensive interpretability of the network, it also allows us to optimize the network by eliminating redundant features. Figure 4. Two site correlation of the MPS, shown with blue nodes and |W , with respect to an operator, O also represented via purple triangles. i(j) represents the location of the sites to which the operator has been applied, and l is the label index.
Beyond entanglement entropy, two site correlations of the MPS can be calculated depending on an operator. This operator can be chosen from the particular group that the data embedding is based on; for instance, the spin states are based on the SU (2) group, and the generators of this group are Pauli matrices. Fig. 4 shows the two-site normalised correlator where the operator has been shown by O and i(j) represents the sites that the operator has attached. The right-hand side of the equation shows the same in Tensor Diagram notation, where purple triangles represent the operator and its Hermitian conjugate. C l ij = 1 (−1) denotes fully (anti-)correlated sites on a given basis, where one can discard one or the other site without loss of generality. C l ij = 0, on the other hand, indicates no correlation, which means that both sites bring valuable information to the process at hand. As before, l indicates the prediction label; hence the correlation can be calculated with respect to signal or background. This property of TNs has also been exploited in ref. [39].
MPS have been optimized using dedicated algorithms like the DMRG [17,[40][41][42][43][44], TEBD [45,46] and recently, it has been shown that usage of alternating least squares methods in MPS minimization can be quite efficient [62]. All these methods are designed to contract a given Hamiltonian with a 1D lattice wave-function, shown in Eq. (2.1), and find the ground state energy efficiently. Ref. [26] shows that it is possible to use DMRGlike updating algorithms for classification tasks as well, which will be discussed in Sec. 2.2. However, neural networks traditionally updated using SGD-based backpropagation algorithm, which can also be used for MPS type of networks [29]. For the SGD-based backpropagation, an efficient contraction method is needed. Since all the components of the network cannot be contracted at once, one needs to use the so-called bubbling method [12] to contract each site. Fig. 5 shows the bubbling sequence for such contraction. As before, red circles represent the training sample preprocessed in the form of Eq. (2.5), and the blue circles represent the MPS. The sequence evolves from the top frame to the bottom, as shown with the dashed enclosing line, where the first site on the left contracted with the first MPS node, then in the second step the second MPS node is contracted with the second site. In the third step, while the third site has been contracted with the third node, two previous contractions have been merged. This sequence continues until all the tensors are contracted, and the final f l (x) have been reached 5 . Note that, although the label edge is placed at the centre of the MPS in Fig. 5, the replacement of it will not affect the result; as a matter of fact, such bubbling algorithm will be most efficient if the label edge is placed at the rightmost tensor. This will allow the contraction order to be constant until the last node. Once the full contraction is complete one can calculate the loss function and each tensor can be updated via a specific SGD algorithm such as Adam [64].
Whilst such an approach leads to a very efficient algorithm, it lacks the adaptation features of the DMRG algorithm. It has been shown that a DMRG-like updating algorithm allows a network to automatically adjust its bond dimension for the complexity of the problem [26]. In Sec. 2.2 we will discuss how such updating scheme is possible.

Learning through Density Matrix Renormalization Group Algorithm
A DMRG-like MPS updating algorithm for classification purposes was first proposed in ref. [26]. In this algorithm, tensors are updated two-by-two for the reasons that will be shown below. Originally, the DMRG algorithm is used to find the ground state energy of a Hamiltonian acting on a 1D lattice where the algorithm sweeps the lattice from one side to the other and optimizes the energy by updating the site its currently on. It iteratively reduces effective degrees of freedom to emphasize the most prominent features. The same principle can be applied to classification tasks.
Node I Node II Sweep direction to the left: Node I Node II Figure 6. Schematic representation of two-site DMRG-like algorithm used for updating tensor network. I -III shows the forward-pass where the loss function is calculated and IV -VI shows the backpropagation where two nodes of the MPS has been updated using gradient decent and splitted back into its original two node structure via SVD. Fig. 6 shows the step-by-step evolution of a DMRG-like updating algorithm where I-III represents the forward pass, and IV-VI represents the backpropagation portion of the algorithm. First, let us assume that we have a network in the form of Fig. 3. In order to update two nodes, one needs to calculate the gradient of the loss function with respect to the contraction of these two nodes. In step I, we calculate the contraction of these two tensors resulting in a rank-3 tensor, B s 2 p 1 p 2 . For the sake of simplicity, this procedure has only been shown for first two tensors. Here, s 2 represents the auxiliary leg between the second node and the rest of the MPS. Then, in step II, we contract the rest of the MPS with Φ p 3 ···pn (x (i) ) via the bubbling algorithm shown in Fig. 5 and take the outer product with the two sites that are not connected to the MPS, resulting in a rank-4 tensor, Γ ls 2 p 1 p 2 . Here, i represents a particular training example, where each training example goes through the same process. Finally, on step III, the prediction, p(l, x), is calculated by contracting Γ ls 2 p 1 p 2 with B s 2 p 1 p 2 and the objective function, L, can be calculated using the entire training batch, which completes the forward pass of the algorithm. It is important to note that, in this study, we will use the tensor network as a Born Machine, where the square of the state, i.e. the wave-function, corresponds to the probability of the prediction.
For the backpropagation portion, one first needs to calculate the gradient of the loss function with respect to B s 2 p 1 p 2 and the contracted tensor can be updated via gradient descend, shown in Fig. 6 panel IV. Here η stands for the learning rate, which controls the step size of the gradient descent. Choosing a small learning rate may cause a prolonged convergence to the desired outcome; a large learning rate, on the other hand, may lead to fast convergence but can skip over the global minima. Hence, its value needs to be adjusted throughout the training.
In the following, the updated tensor,B s 2 p 1 p 2 , is decomposed using SVD to replace the initial two nodes. This procedure continues step by step until the last tensor on the right is updated. Then, the algorithm starts this time from the right side and applies the same algorithm until it reaches the starting position. As discussed earlier for Eq. (2.3), there is not a unique way of contracting the singular values, S, with the eigenvector tensors. As shown in Fig. 6 panel VI, we choose to contract S to the tensor on the right while sweeping is done towards the right direction, and S are contracted with the tensor on the left for the left sweeping direction.
As a virtue of SVD, the size of the auxiliary dimensions is modified during each update. This allows the network to increase the reach of the entanglement of each site, effectively increasing the entanglement entropy. Hence, the network grows automatically without the need to explicitly determining the size of each bond. However, an upper limit to the number of allowed singular values is necessary; otherwise, the network will grow to an unmanageable scope for the computer's memory storage device. Alongside the maximum limit for the number of singular values, the maximum truncation error can be used to limit the growth of the network where the truncation error is determined by comparing the Frobenius norm 6 [65] of the initial and final tensors. This will allow the network to shrink by discarding the small singular values to satisfy the maximum error condition. One can also achieve a similar effect by explicitly imposing a threshold for the value that singular values can take.
Whilst the DMRG-like updating algorithm reduces the need for hyperparameter optimization, it does not allow an update via SGD algorithms [26]. Hence in this study, we propose a combined updating scheme where each epoch starts with the DMRG-like algorithm on a batch and optimizes the network for a limited number of sweeps. This will allow the network to expand with respect to the complexity of the classification problem. Then for other batches, the network is updated using the SGD with the bubbling algorithm, shown in Fig. 5. 6 Frobenius norm is defined as where M is a rank-N tensor with indices i1···n.

Top Tagging Through Matrix Product States
The energy deposits of particles in the electromagnetic and hadronic calorimeters in AT-LAS and CMS experiments have long been used to reconstruct the underlying event to understand the physical system better. It has been repeatedly shown that mapping these energy deposits on a modified η − φ-plane and analyzing them with a CNN can efficiently discriminate top quark signal over QCD background. In such mapping, each pixel on η − φ-plane corresponds to one or more particles depending on their distribution through the detector geometry. A CNN algorithm can pick up the transitionally invariant features of the underlying physics. Due to different particle clusters occurring on the η − φ-plane, each pixel is only closely correlated with the close-by pixels. The η − φ-plane can be mapped onto an entangled 1D lattice. Since each pixel is only closely correlated with the neighbouring pixels, an MPS can efficiently classify such images. For this study, we will use the preprocessing prescription used in ref. [66] and demonstrate the usage of MPS for top versus QCD jet discrimination. The implementation of this study can be found at this link 7 .

Dataset & Preprocessing
In order to demonstrate the usage of the MPS-classifier, we will use the dataset provided in [10,67]. This dataset consists of top signal and QCD background generated at 14 TeV centre-of-mass energy. The parton level events have been showered in Pythia 8 [68] and the detector simulation has been obtained via Delphes 3 package [69] using the default ATLAS detector card. The so-called fat-jets are reconstructed via anti-kT algorithm [70] with R = 0.8 which is employed in FastJet [71] package. The transverse momentum of these jets are limited to [550, 650] GeV range with |η| < 2. In order to label the dataset, a parton matching with truth level tops have been applied where ∆R(j, t truth ) < 0.8 have been required from each event to be considered as signal. The complete dataset includes 1.2 million training, 400,000 validation and test events respectively.
Following the prescription presented in ref. [66], the training images are generated by further preprocessing the dataset where each image includes the leading fat-jet (reconstructed as described above). The constituents of the leading fat-jet have been centered with respect to p T weighted centroid where the jet vector shifted to the centre of η − φplane. The coordinate system has been shifted to align with the direction of the positive η. Furthermore, all the partonic activity has been collected into a 37 × 37 pixelated frame where both η and φ spaning within [−1.5, 1.5] range. The lower half of each pixellated image has been flipped to the top if the total p T of the lower half is greater than the upper half. The same process applied to the left half of the image to ensure that the first quadrant always has the highest p T . In order to simplify the problem, eight pixels from each side of the image have been cropped, and the resulting image has been downsampled by averaging each four-square pixels. Fig. 7 shows the downsampled image where the left panel shows the top signal and the right panel shows the QCD background. Each image has been averaged from 5,000 events.   In order to make use of the MPS-classifier, the image data was reshaped to a 1D lattice with an S-shaped mapping, as shown in Fig. 8 where the MPS sites follow the red line where the left panel shows a η-based ordering, the right panel shows a φ-based ordering. Note that this representative figure is shrunk to 5 × 5 just for visual simplicity. This S-shaped reshaping ensures that the alternating edges of the image are in proximity to maximize the entanglement 8 . After normalizing images by 650 GeV, each pixel has been mapped onto a hypersphere via where in two dimensions (D = 2) φ p i (p i T ) reduces to [cos(p i T π/2), sin(p i T π/2)]. Herep i T corresponds to the modified transverse momentum within the pixel i.

Network architecture & training
To capture the MPS-classifier's capability, we will demonstrate our results for different training procedures and compare the outcomes to the state-of-the-art CNN results. Our MPS classifier relies on TensorFlow version 2.4.1 [72,73] and TensorNetwork version 0.4.3 [74] and the CNN relies on Keras package [75] embeded in TensorFlow.
As a baseline, we choose to use a CNN architecture presented in ref. [66]. The architecture takes in 10 × 10 image pixels with a batch size of 128. The images are then passed through a convolutional layer with eight features and four stride pixels, including zero padding. After a batch normalization layer, the latent image is passed through a max-pooling layer where the size of the image is dropped to 5 × 5 pixels. Then the resulting latent image has been flattened and passed through a fully connected layer with sixteen nodes. For each layer ReLU activation function has been used. The network has been trained for 500 epochs and learning rate has been decayed from 0.01 every 20 epochs if validation loss has not been improved. In all the networks presented below, we used the cross-entropy loss function, where N represents the number of events in a batch. In order to compare with the CNN, we first mapped our η-based ordered images onto a 10D hypersphere as shown in Eq. (3.1). Then the network has been trained via the DMRG algorithm. We initially assumed no entanglement between the lattice nodes within the quantum state, and allowed it to expand up to 20 auxiliary dimensions. In order to avoid getting stuck in a local minimum, the growth of the network has been done gradually where in the first epoch the network has been allowed to expand up to 10 auxiliary dimensions and then each epoch this value increased by 1.5 times its size, eventually reaching 20. Without such gradual growth, we observed that the network could not reach the presented stage within a limited amount of epochs. The MPS-classifier has been trained for 200 epochs with a batch size of 10,000 events. The learning rate was chosen to be 0.0001 and decayed to its half every 20 epochs if no improvement has been observed in the validation loss. Upper left panel of Fig. 9 shows the receiver operating characteristic (ROC) curve for MPS-classifier (blue) versus CNN (green). The statistical uncertainty has been calculated by splitting the test set into batches of 50,000 events, where shaded area around each curve represents one standard deviation. Similarly the right panel of Fig. 9 shows the classification output of both networks where the blue solid (dashed) line represents the background for MPSclassifier (CNN) and the red solid (dashed) line represents the signal. Despite the large The bottom left panel of Fig. 9 shows the entanglement entropy of the MPS classifier and the right panel shows the Schmidt coefficient values, calculated as shown in eqs. (2.6) and (2.7). One can immediately observe that the entanglement entropy follows the pixel unfolding procedure shown in the left panel of the Fig. 8. Additionally, entropy follows a grouped pattern where instead of a sharp increase or decrease, it preserves similar entropy among around 10 pixels before changing the entropy value. This shows that not all pixels are equally valuable, where one can avoid using the pixels with the same entropy values hence shows a path for feature compression. One can observe that the network takes the same entropy value mainly on the left or right side of the image and changes at the centre. This is because most of the information is at the centre of the image and sides are generally empty; hence information has been propagated through the pixels that do not posses additional information. Note that the bottom-left pixel is empty simply because each node is entangled with the node to its right (or left); due to non-periodic boundaries in MPS the last node does not have any connection on its right. In the right panel, we present the Schmidt value distribution in each node, used to calculate the entanglement entropy. In this particular example, we observe that Schmidt values can be smaller than 10 −14 . As can be seen from the plot, the number of Schmidt coefficients decreases towards the edges of the MPS. This is due to the canonical structure of the MPS with non-periodic boundaries, where the number of auxiliary dimensions increases towards the centre. Each auxiliary dimension would be the same for an MPS with periodic boundaries, assuming no Schmidt value trimming has been imposed.

Feature space and network compression
Whilst a large number of trainable parameters gives the network chance to sample various possibilities to achieve the optimisation task at hand, it also leads to a vast loss hypersurface to explore. It can be challenging to explore and find a global minimum in such an ample parameter space, even with advanced optimisation algorithms. Hence, it is vital to optimise a given hypothesis's hyperparameters to achieve faster, if not better, convergence to a global minimum. Beyond the convergence, this will also lead to a quicker inference after the training.
The entanglement entropy and the Schmidt coefficients are especially valuable to provide a path towards network compression. One can use them to design a compressed feature space or limit the network size to achieve faster predictions. Fig. 10 shows three different methodologies that we adapted to reduce the size of the network while quantifying the effects of the compression on the precision of its network output. On the upper left panel, we show the effect of applying the minimum Schmidt value threshold (λ min ) on the network after the training. Following the bottom right panel of Fig. 9, we applied λ min = 10 −3 , 3 × 10 −3 , 5 × 10 −3 and 10 −2 shown with dashed red, cyan, green and purple ROC curves respectively. Each step number of trainable parameters dropped from 390500 to 204310, 91690, 32990 and 18020, respectively. Although there is a considerable 76% drop in the number of trainable parameters, we do not observe a significant change in the precision until we reach λ min = 3 × 10 −3 .
On the top right panel of Fig. 10, we applied an entanglement entropy-based compression of the network. The change in entanglement entropy signals the contribution of information that can be used for the classification task. If ∆S does not change along the different sites, one can eliminate the features with the same entanglement entropy while maintaining a similar precision. Hence we binned the entanglement entropy values with respect to the change in entropy denoted by ∆S where the entropy change between the first and last element in the bin corresponds to given ∆S value. Among these bins, we chose the nodes with the most significant entanglement and formed a new network. We present the ROC curves with ∆S = 0.025%, 0.05%, 0.1% and 0.2% following red, cyan, green and purple ROC curves which reduced the network size to 83, 75, 64 and 54 nodes.
Finally, at the bottom panel of Fig. 10 we adopted an entropy-threshold-based compression where only the nodes with entropy greater than indicated have remained in the new network. We present the results from the network with S ≥ 0.055 and 0.056 presented with red and cyan ROC curves. These thresholds dropped the number of sites in the network to 95 and 94, respectively.
Comparing all three compression methods presented reveals the nature of the DMRG training algorithm. This method spans the geometrical structure of the feature space and attempts to understand the entanglement in the 1D lattice wave function. The number of auxiliary dimensions shows the length of entanglement in each node when it is reduced, as shown in the upper left panel of Fig. 10, the sites becomes less entangled, and the outcome becomes less precise. Entanglement based restriction shows the location of the stored information where the information regarding the 3-prong structure of top has been spread towards the outskirts of the MPS; hence removing these nodes degrades the precision of the network at high tagging efficiency region. The information regarding the dipole structure, on the other hand, has been preserved at the centre of the network, where the entanglement entropy is at its larges. While removing the nodes with low entanglement on the edges does not affect the dipole substructure information, the ∆S method effectively removes the nodes in each region, including the region with the highest entropy, which effectively reduces the precision in low top tagging efficiency. This exercise reveals that the DMRG algorithm is capable of learning the geometrical structure of the given data. Combining the compression methods of ∆S and λ min , we formed a new network and retrained it from scratch. Fig. 11 shows the ROC comparison between the original network, compressed version and the trained version. Initially, the original network (shown with the solid blue curve) compressed by imposing ∆S = 0.2% and limiting the Schmidt coefficient values above 3 × 10 −3 (shown with the dashed red curve). This compression limited the network to only 54 sites (or pixels) and 43410 parameters. Then we trained this network for another 50 epochs with a learning rate starting at 10 −4 and decaying as before. During the training, we set a rigit Schmidt threshold at 3 × 10 −3 . This resulted in further compression of the network to 34160 trainable parameters-the ROC curve after 50 epochs presented with the dashed green curve. Although the number of parameters of the new network is only 8% of the original one, it managed to achieve very similar precision. This shows that the entanglement entropy holds the crucial information behind what the network learns.

Training algorithm comparison
The MPS-classifier's prediction can also be calculated using the SGD algorithm. Note that, although we also use a mini-batch SGD type backpropagation in the DMRG algorithm, henceforth SGD acronym will only refer to a neural network type of backpropagation where all the parameters are updated simultaneously using a sophisticated algorithm such as Adam [64]. In order to compare these algorithms in a more straightforward framework, we reduced the hypersphere mapping to 2D and investigated the effects of both η and φ based ordering for sake of completeness. Although η-based ordering has been shown to have a significant impact on the classification results, the pixels do not possess a correlation in the vertical axis, which might introduce different properties. Whilst DMRG training has been done as before, SGD training has been achieved by the Adam algorithm [64]. After the contraction of the MPS-classifier, the gradients of each node in the MPS has been calculated with respect to the objective function. We observed that the normalised gradient tensors lead to more stable results. Since the SGD algorithm cannot adapt the network with respect to the complexity of the problem, we initialised the network with ten bond dimensions per node, and the network has been canonicalised to limit the number of trainable parameters prior to the training, in agreement with the DMRG construction. The rest of the training hyperparameters are set equal to the DMRG algorithm. Furthermore, to harvest the ability of both algorithms, we set up a DMRG+SGD algorithm where each epoch applied the DMRG algorithm to the first batch for tree sweeps, and the rest of the batches are trained with SGD. This combination gave the network the ability to adjust its size with respect to the complexity of the problem. Hence, we assumed no entanglement in the network prior to the training, i.e. the network can grow without restrictions except from an upper limit on the bond dimensions. Fig. 12 shows the ROC curve for all the networks and training algorithms presented above, where blue, red and orange curves represent DMRG, DMRG+SGD and SGD algorithms. The left panel shows results for η-based ordered MPS, and the right panel shows the same for φ-based ordered MPS. Compared to the 10D hypersphere, we observe a slight decrease in the generality of MPS classification results; however, we only observe slight differences between the results of the different training algorithms. All algorithms seem to be able to discriminate the dipole type substructure with equal performance, and the difference seems to be more towards discriminating 3-prong substructure. This directly shows the effect of the local entanglement in MPS which is directly linked to the lattice order.
Although there is no significant difference in the classification results, we observe large variations in what the training algorithm captures. Fig. 13 shows the entanglement entropy per site for each algorithm following the same colour scheme as before, and the left (right) panel shows the entropy for η(φ)-based ordered image. As observed in Fig. 9, the DMRG algorithm is trying to set seemingly similar entropy distribution for each site where its peaking at the centre of the image, the area corresponding to the most hadronic activity. Compared to η-based ordering, φ-based ordering seems to capture two prongs in the vertical axis, hence the increased entropy between sites 40 and 60. Compared to the DMRG, on the  other hand, the SGD algorithm seems to capture much more information which also reflects to the DMRG+SGD algorithm. This approach can capture larger entropy since SGD attempts to increase the degrees of freedom of the network, it does not capture irrelevant pixels. Although this is a desirable feature in terms of classification performance, it limits the interpretability of the network. The SGD algorithm is susceptible to the rare energy fluctuations on the edges of the image; however, the DMRG algorithm seems to learn not to rely on non-frequent data. Whilst this does not affect the performance between DMRG and SGD algorithms significantly in this particular example, the SGD algorithm might be superior to DMRG for the cases with large fluctuations in the data. This effect is also captured in Schmidt coefficients observed in two algorithms; upper and bottom part of the Fig. 14 shows the Schmidt coefficient distribution for η and φ-based ordering. The left panel of both images shows it for the DMRG algorithm, where the right panel shows for the SGD algorithm. For both orderings, the SGD algorithm seems to have similar Schmidt coefficient distribution where all the singular values agregate above ∼ 10 −2 . The DMRG algorithm, on the other hand, seems to preserve the information embedded by the geometrical structure of the ordering where the singular values are peaking at the centre of the image, and the nodes are less and less expressive at the beginning and the end of the network. This vital information holds the key towards compressing the network and eliminating the less expressive connections between nodes as shown before. The DMRG algorithm not only learns data to achieve the best possible minimum for the loss function but also captures the physical properties of the data reflected in the entanglement entropy and Schmidt distribution. This also shows that in a less correlated feature space SGD would be expected to perform better than DMRG.
Comparing ordering schemes shows a couple of main differences between them. Whilst the entanglement entropy captured by the DMRG algorithm is more or less the same for both orderings, the entropy change (∆S) in φ-ordering is mostly effective between sites 40 − 60. This allows for more significant feature space compression with respect to η-based ordering. Similarly, Schmidt values for DMRG, presented in Fig. 14, shows that φ-based ordering requires less precision to achieve the same accuracy as η-based ordering. This is due to the alignment of the image where most of the hadronic activity is happening between 0 − 6 η pixels, see Fig. 7, which is captured by the DMRG algorithm by decreasing the Schmidt values after the 50 th site. Hence φ-based ordering can be used to achieve more extensive compression of both feature space and network without loss of generality.
As mentioned before, two-site correlations of the sites can be helpful to unfold the properties of these algorithms even further. Fig. 15 shows the normalised correlations of the MPS sites that trained with the DMRG algorithm (on the left) and SGD algorithm (on the right) with respect to the third Pauli matrix, σ z . Due to the nature of the DMRG algorithm, we observe many (anti-)correlated pixels, which essentially coincides with the entanglement entropy distribution. As shown in ref. [39], this information can be used alongside entanglement entropy to further compress the network. However, the network that is trained via SGD hasn't shown any significant correlation between pixels compared to DMRG-based training, where the largest (anti-)correlation reaching up to ±0.8. This shows further support for the claim of the nature of these algorithms; as mentioned before, the DMRG algorithm is actively decreasing the degrees of freedom by iteratively selecting the most prominent features by correlating them while the SGD algorithm is trying to actively increase the expressibility of the network. Although this renders the results of the SGD algorithm hard to interpret, it might be beneficial in challenging optimisation problems.
During the training, we did not observe overtraining due to the large batch size. The overtraining has only been observed when the batch size reduced below 500. The SGD algorithm, seems to reach a well-trained level at around 200 epochs; however, the DMRG algorithm requires more extended training time to reach that level. Hence, the results presented for the DMRG algorithm are all somewhat undertrained. On the other hand, the DMRG algorithm reaches a good discrimination level within couple epochs, and during the rest of the training it only slightly decreases the loss value which has also been captured in other studies [26,29]. Hence, the DMRG algorithm is suitable to achieve decent results with little training time, but it requires a longer training time to reach its full potential. Despite the interpretability, TN based applications can come with a significant toll on computation time. To assess the timing, we measured the elapsed time for bond dimensions 10 and 20 with 2D data embedding. As above, batch size of 10, 000 events has been used for each training algorithm. The benchmarks have been collected for NVIDIA Tesla V100 16GB GPU, where the algorithms are parallelized only with respect to the training samples. A complete sweep (starting from the left node and ending at the left node) for the DMRG algorithm with 10 (20) bond dimensions and 100 sites have been measured to take 6 (6.02) seconds on average after complete compilation of the algorithm. The calculation of a single batch on the SGD algorithm, on the other hand, took 0.46 (0.47) seconds for a maximum bond dimension of 10 (20). Note that although these benchmarks are from fully parallelized algorithms, it is possible to make them more efficient. For instance, the DMRG algorithm presented in this study has been optimized to preserve the canonical form of the MPS throughout the training, which means that each update has only been applied to the centre-of-orthogonality tensor. However, it is possible to write a much faster algorithm by contracting adjacent tensor pairs and updating them simultaneously. This will allow more NN-like backpropagation but will not preserve the canonical form of the network. Similarly, the SGD algorithm presented in this study does not use the efficient Keras integration in order to manipulate network structure and use normalized gradients to update the network. More efficient Keras-based TN classifiers can be found in TensorNetwork [74] library 9 .

Conclusion
Tensor Networks are non-trivial representations of high-dimensional tensors. They have developed into powerful numerical tools to perform highly sophisticated calculations of complex quantum systems. In the context of machine learning Tensor Networks have been shown to be ideal vehicles to connect quantum mechanical concepts to machine learning techniques, thereby facilitating an improved interpretability of neural networks.
In this study, we applied specific TNs, i.e. Matrix Product States, to classify LHC pseudo-data as top quark or QCD-induced processes. We compared state-of-the-art convolutional network-based tagging algorithms to such MPS and showed that it is possible to achieve similar results to the CNN, while gaining a deeper insight into how and what the network learns.
The DMRG algorithm has been designed to automatically adapt the network architecture depending on the complexity of the problem at hand, which reduces the need for hyperparameter optimization. We showed that whilst SGD can extract more information from the network in terms of entanglement entropy and two-site correlations of the pixels, DMRG learns the geometrical structure of the data and the correlations between the sites on the 1D lattice. Whereas no significant performance difference has been observed between the two optimization algorithm, DMRG led to more interpretable results. Despite the fact that the MPS only captures one dimensional correlations between pixels it can achieve similar precision as the state-of-the-art CNN classifier.
We also proposed an adaptable algorithm for training MPS by merging DMRG and SGD algorithms within one training sequence. This allowed us to harvest the adaptability of DMRG and the efficiency of the SGD algorithm. However, we observed that since the goal of these algorithms is entirely different, one needs to be mindful while training the network. While the SGD algorithm aims to increase the individual expressivity of each node by increasing the degrees of freedom to achieve high-performance classification, DMRG is trying to reduce the degrees of freedom of the neighbouring nodes to construct a locally entangled network. Hence, the fraction of DMRG sweeps during the training becomes a crucial hyper-parameter where if not adjusted correctly, the network performance can be less than the network separately trained by DMRG or SGD.
Further, using entanglement entropy, one can devise algorithms to effectively compress the network to reduce the decision making time during prediction, which can be beneficial for experimental analyses [39]. Additionally, since the MPS only expresses local entanglement, different pixel reordering schemes are very effective to increase the classification power [28]. It has also been shown that MPS can be used to pre-train the Quantum Machine Learning (QML) networks to boost the convergence of the algorithm [80]. One shortcoming of the MPS approach is that it cannot fully represent the data contained in the event image, as the pixel have to be represented as a 1D latticised chain. However, such limitation can be mitigated by using 2D tensor networks such as Projected Entangled Pair States (PEPS) to exploit more information from an image [37] or tree tensor network structure can be used to embed non-trivial connections between the features [39]. Such classical methods based on TNs can also be used to estimate the limitations of the algorithms in QML applications [81,82].
The development of novel, yet explainable, data analysis methods is of crucial importance for upcoming LHC runs, in particular during its high-luminosity phase. In the context of high-energy physics Tensor Networks are rather unexplored techniques that show promising properties which can complement or even replace existing machine learning approaches.