Heuristic optimisation of multi-task dynamic architecture neural network (DAN2)

This article proposes a novel method to optimise the Dynamic Architecture Neural Network (DAN2) adapted for a multi-task learning problem. The multi-task learning neural network adopts a multi-head and serial architecture with DAN2 layers acting as the basic subroutine. Adopting a dynamic architecture, the layers are added consecutively starting from a minimal initial structure. The optimisation method adopts an iterative heuristic scheme that sequentially optimises the shared layers and the task-specific layers until the solver converges to a small tolerance. Application of the method has demonstrated the applicability of the algorithm to simulated datasets. Comparable results to Artificial Neural Networks (ANNs) have been obtained in terms of accuracy and speed.


Introduction
In the field of process engineering, Artificial Neural Networks (ANNs) are often adopted as approximators for nonlinear processes. It has been widely adopted in fault detection, signal processing, process modelling, and control [1]. While the most traditional form of the ANN has been used in mainstream literature to solve prediction problems, alternative forms of ANN have been proposed as a separate exploration of potential variations of network architecture. One example is the DAN2 network [2]. In this network, the nonlinear transformation is achieved by Sine and Cosine functions. Moreover, instead of using a sequential order for the linear and nonlinear component of the network, the model adopts a parallel structure to achieve regression and classification tasks. In this article, we explore a modification of the DAN2 network to enable dynamic multi-task learning functionality of this novel alternative architecture.
The advantage of DAN2 architecture is threefold: (1) the model trains on the entire in-sample dataset collectively, allowing more effective capture of data patterns; (2) the model has high scalability, allowing the addition of layers by continuously and dynamically updating only five parameters; (3) the model removes the ''black box'' notion that has been associated with neural network models by using a closed form set of equations that are obtained at the end of each step of the training process. With such advantages, it is imperative that we can expand the potential areas of applications through modifications of the network, such as making it multi-task.
The DAN2 architecture has two limitations. First, it can only solve single-output regression problems. This is problematic since in some engineering processes, multivariate outputs are required. For example, an adsorption process may involve outputs such as recovery rate, purity and energy consumption. Therefore, to find an optimised multi-output approximator is an important task. A suitably structured ANN that is capable of performing multi-output forecast, i.e. conducting multi-task learning, is thus paramount to ensure an efficient monitoring of a process.
Apart from the challenge of predicting multivariate output, the second limitation with the current DAN2 algorithm is the unresolved definition of the architecture, i.e. it is unclear how many dynamic layers is to be used in the network. To overcome this limitation entails the process of architecture optimisation to decide on the number of layers to contain in the model. The architectural optimisation of neural networks is often a difficult task as it involves predicting the number of layers suitable to a particular dataset. Popular methods of optimisation include pruning, which starts off with a larger than necessary network, and then reduces the network size by disconnecting branches of linkages, to arrive at a smaller size [3][4][5]. Whether the size reduction is optimal is an ongoing research topic [6,7]. There are some weaknesses of pruning a network. First, training time has been spent on optimising a network larger than necessary, generating wasteful computations [8]. Second, the optimal architecture found is often stuck in one of the intermediately sized solutions, and the smallest network is not found [9].
The reverse to this process is to adopt a dynamic model, under which the network is expanded layer by layer to a larger size until some tolerance is reached, starting from some basic initial structure. There are several advantages of a dynamic architecture. First, it allows lifelong learning where the responses from a network evolve with new feed of input data [10]. Traditional neural networks adapt to changing inputs by varying the parameters of the network, but the overall architecture stays unchanged. With a dynamic structure, it is possible to alter architectural hyperparameters in order to respond to a learning question. In chemical processes such as Pressure Swing Adsorption, for example, a dynamic architecture is advantageous because a changing condition in the process parameters instigates corresponding changes in the architecture of a predictive network. Second, empirical evidence has demonstrated that a dynamic architecture trained with local adaptation methods works as effective as and sometimes outperforms a fixed network re-trained on the new dataset [11]. Both methods are found equally effective as the advantage of eliminating unnecessary parameters outweighs the disadvantage of local incremental learning [11].
This article discusses a modification of the DAN2 network, with a focus on the chemical engineering datasets. This modification addresses the two aforementioned challenges: (1) multi-task learning and (2) dynamic architecture construction. The former is achieved by modifying the architecture of DAN2 and the latter is achieved by proposing a dynamic optimisation scheme.
We have identified that the most suitable application scenario for the proposed multi-task DAN2 network is the tabulated engineering data. This is because the network has a small fixed number of parameters each layer which limits the scale of data treatment but enhances the model simplicity when dealing with small to medium scale engineering data. Any data in the form of sensory outputs or other multiple input multiple output system can be suitable for the method proposed. This is also the reason we have collected chemical engineering data for the purpose of application.
We have not made use of large-scale experimental database as the focus of the article is to solve tabulated data of a relatively smaller scale rather than large-scale image or sound data. Our research data are believed to be more suitable to the architecture of DAN2. Therefore, we have made use of the specially selected chemical engineering data as a test dataset for the proposed architecture instead of experimental database.
We motivate the development of multi-task learning model to this network since industrial datasets from the chemical process industries usually contain multiple highly correlated outputs relating to the same operating conditions [12,13]. Since the correlation between the outputs are high, there is huge potential in applying multi-task learning in chemical engineering. Moreover, the adoption of DAN2 serves as a novel alternative to the current ANNs or mathematical models commonly used in the industries to simulate processes. Mathematical models require heavy computations such as in the case of fluid dynamics. DAN2 and ANNs are data-based models that makes predictions with less complicated modelling techniques. In comparison with ANNs, DAN2 tends to have less parameters and is more suited to time series modelling. Moreover, the DAN2 network has an edge in the prediction of dimensionally small datasets, typical of what most chemical engineering datasets are. The original model is created for single-output predictions only. We introduce the model to a few chemical datasets and adapt the model to multi-task learning to allow multi-variate output.
Moreover, we have identified that one of the most significant application area of our proposed methodology is the measure of online systems, which is often difficult to model. Our method is particularly suitable for online measuring systems with application in fields such as biochemical engineering. In the processing of cell culture, for example, we would like to infer the state of the culture intracellularly through external measurements. Intracellular state in real-time is very difficult to measure. By tracking the changes in the concentrations with our network, the dynamics within the cell culture can be effectively inferred without breaking the cells which modifies concentrations. Therefore, this research lays the foundation for potential applications in engineering and allows the effective development of a novel alternative to traditional machine learning models.
The article is novel with twofold contributions: (1) the adaptation of the network for multi-task learning is unprecedented, with the proposal of both a multi-head and a serial structure to perform treatment of chemical engineering data; and (2) a heuristic search scheme is proposed, enabling an architectural search for optimality with a simple but sophisticated solution. It is simple because the heuristic scheme is easy to implement with an easy-tounderstand search process, while it is sophisticated because the methodology behind the method is effective in generating optimal results. In either contribution, the methods are not covered by current literature and the proposals are innovative to provide a structured design based on DAN2 with a systematic architectural search process.
This article consists of the following sections. In Sect. 2, we review key concepts and recent work related to DAN2 optimisation. In Sect. 3, we briefly review the DAN2 model. In Sect. 4, we discuss the methodology used in designing multi-task functionality and in proposing the heuristic optimisation scheme. In Sect. 5, we apply the newly designed network to simulated and experimental dataset. In Sect. 6, the multi-task DAN2 is compared with traditional ANN. Section 7 concludes the article.

Background
In this section, we provide reviews on three key concepts adopted in this work, the DAN2 architecture, the dynamic model and multi-task learning. DAN2 architecture, as the basis of this work, has wide applications in both regression and classification tasks. The key characteristic of it is the dynamic nature. To better understand how this dynamic nature can be best exploited and modified to become a multi-task network, it is important to understand these concepts.

Current research on DAN2
The archetype model used in this research is the dynamic architecture neural network (DAN2), a novel alternative structure to traditional feedforward backpropagation (FFBP) algorithm [2]. The DAN2 network has already been widely applied in prediction tasks. Research has demonstrated that the network is effective in predicting nonlinear processes [14] and in predicting time-series data [15]. From then on, mainstream research has been focusing on the prediction of time-series based on this model [16,17]. A variety of cross-disciplinary problems have been treated based on this model, including automated text classification [18], movie revenue prediction [19], twitter sentiment analysis [20],urban water demand forecasting [21], medium term electrical load forecasting [22], stock market index and stock price prediction [23,24], and government spending forecasting [25].
Meanwhile, it has been demonstrated that the model is also effective in performing classification tasks [26,27]. In [26], the author developed a hierarchical model in which the model is compared to traditional machine learning algorithms such as linear discriminant analysis, quadratic discriminant analysis, k-nearest neighbour algorithms, support vector machines, and traditional artificial neural networks. The ability of the network to perform classification is corroborated in [27] where the model is used to classify Camellia (Theaceae) species based on leaf characteristics.
While most literature focuses on the application of the algorithm to different context, an important revision of the algorithm is performed in [15] where modification to the algorithm is performed with regard to time series forecasting. The modification reformulated the network as an additive model and proposed a novel method to optimise the network where the network parameter l is fixed to a few selections and linear regression is performed with regard to the additive model. Applications to datasets have demonstrated positive results.
Our research is novel in the sense that it has not been attempted in the current research domain. Most of current research on DAN2 focuses exclusively on applying the model to novel dataset. Therefore, the current research is novel and unprecedented with the aim to revive the applicability of DAN2 to more varied application scenarios.

Current research on dynamic networks
The idea of dynamically extending the network is inspired from the classic Adaptive Resonance Theory (ART) and Grow and Learn (GAL) Theory for networks. The Adaptive Resonance Theory is a match-based learning method that changes weightings or adds new neurons only when an input is close enough to internal expectations or a completely new input is present [28]. The Grow and Learn Theory posits that network architectures change incrementally with new inputs to the system. A typical algorithm would be to add hidden neurons to the hidden layer as learning proceeds [29]. In either case, the dynamic architecture encourages ''Continual learning'', which refers to the ability of the system to learn and adapt from a continuous stream of input.
An alternative terminology for a dynamic model is neural network self-organisation, where the network evolves to fit to non-stationary input by adopting a dynamic architecture. More recent literature that focuses on neural network self-organisation dynamically allocate or remove neurons in response to sensory experience with applications in human-robot interaction (HRI) [10,30]. In [10], a self-organising recurrent model is constructed to process human actions in video sequences, and it has achieved state-of-the-art performance. In [30], a self-organising convolutional neural network is proposed to recognise body and emotional expression. In both cases, the models are robust to non-stationary input with the selfevolving nature of the network. [31] applies self-organising neural networks to novelty detection, the process that robots to recognise unexpected data in their sensory field, with applications in surveillance, reconnaissance, selfmonitoring, etc. The model used is based on Grow When Required Neural Network (GWRNN). In [32], a novel model is produced to analyse stationary and non-stationary noisy streaming data.
The popularity of self-organising neural networks in applications of robotics is that the input is usually a continuous stream with non-stationary distribution. Chemical processes, similar to robotics, contain streaming data of temperature, pressure, composition, etc. Therefore, it is natural to find applications of a self-organising network, or a dynamic architecture neural network, in the field of chemical engineering.
The controversy surrounding the dynamic methods is that it can be quite time-consuming to find an architecture and the process requires a lot of computation. This is accentuated in a large and deep network such as the ResNet [33] where there are 152 layers in total and the addition layer by layer can be a slow process. Each time the architectural parameters are altered, an optimisation algorithm is run to ensure that the structure is optimal. However, this process is capable of producing a minimally structured network, therefore reducing the needs of performing unnecessary computations in the testing and application stages. Moreover, the problem to be solved in many industrial settings [34,35] has a small data set with a less complicated nonlinear relationship to model, which allows a dynamic method to be economical to use compared to problems in computer vision [36], natural language processing [37,38], speech recognition [39,40], etc.

Background on multi-task learning
Multi-task neural networks make use of the commonalities and differences across tasks, share the representations of the problem in the network and simultaneously solve related tasks. Multi-task learning allows the model to generalise better by using the domain information as an inductive bias [41], thus enabling inductive transfer among different tasks [42]. A common inductive bias is the method of L1-regularisation where the network is biased towards sparsity. In the case of multi-task learning, the inductive bias is the auxiliary tasks that allow the network to perform better in solving tasks in parallel.
Multi-task learning has wide applications in many fields including drug discovery [43,44], computer vision [45,46], speech recognition [47,48], healthcare [49] and natural language processing [50,51]. Techniques involving multi-task learning can be separated into hard parameter sharing and soft parameter sharing. Our model adopts the hard parameter sharing scheme where the same weights are shared and fixed for different tasks. The advantage of a hard parameter sharing is that it greatly reduces over-fitting, allowing the network to generalise better between tasks [42]. Soft parameter sharing is not adopted since it in essence trains separate networks for different tasks, taking up more computation and memory to develop and store.
There are several advantages of multi-task learning [42]. The first is implicit data augmentation, where sample size is increased with data from different tasks and the model trained will better represent the hidden distribution of a more general representation. The second is attention focusing, where the model puts more emphasis on important features with relevance confirmed by auxiliary tasks. The third is eavesdropping, where a difficult to learn task can be learned effectively through auxiliary tasks. The fourth is representation bias, as the model is biased to learn similar problems since it generalises well within our defined problems. The last is regularisation, as sharing parameters prevents over-fitting to a specific task.
A technical issue of applying multi-task learning is whether to freeze the weightings with the addition of a new layer or to retrain the whole network. While the former introduces less re-computation, the latter finds the more optimal solution. This is because holding other weightings constant only allows search along the affine subspace of the weight space [9]. However, the design of the DAN2 network is such that weights are frozen in each iteration, as the addition of each layer is a deterministic step with no free parameter. Therefore, it is reasonable to freeze the weights without the need to retrain the whole network, saving computations as an added benefit.

Model description
The dynamic feed-forward architecture, DAN2, is proposed in [2], which is characteristic of common dynamic models where the number of layers increases incrementally and a better definition of each layer is encapsulated in its design [2]. The objective of the network is to model nonlinear processes or time series data. The model achieved this objective by learning and accumulating knowledge at each layer, propagating and adjusting this knowledge forward to next layer, repeating until the desired network performance criteria are reached [2].
In the DAN2 model, the input layer is defined by external data with normalisation. The hidden layers are sequentially and dynamically generated until the stopping criteria is met. The novelty of the algorithm is that there is a fixed number of hidden neurons in each hidden layer. The architecture is shown in Fig. 1.
In each layer, there are four hidden nodes. The constant(C) node adds a constant to the corresponding node. The ''current accumulated knowledge element (CAKE, labelled as F)'' node is a linear transformation from all nodes in previous layers. The rest nodes are ''current residual nonlinear element (CURNOLE, labelled as G and H)'' nodes, which perform a nonlinear transformation from the weighted average of previous CURNOLE nodes. a k ; b k ; c k ; d k and l k are the weights of each nodes inputting to the next layer.
The training process involves the addition of the network layer by layer. In the first layer, the CAKE node is calculated as a linear combination of input variables and the constant(C) node. The weights of the linear combination are obtained through linear regression. If accuracy criteria is met at this step, the process to be modelled is effectively linear. If not, the subsequent CAKE nodes take input from previous CAKE, CURNOLE and C nodes.
The nonlinear transformation brought about by CUR-NOLE nodes adopts a vector projection method. A reference vector is defined and the input vectors are projected onto this vector. The angles, a i , between each input samples and the reference vector are recorded. The trigonometric function cosðl k a i þ h k Þ captures the nonlinear transformation equivalent to a rotation of l k and a shift of where A and B are constants to be found through linear regression. There are no limits for the variables a and h since the ranges of cosine and sine functions are bounded. Therefore, two CURNOLE nodes are set up giving a nonlinear transformation of Sine and Cosine each.
The overall relationship is outlined as: where X i represents the i th independent input records, F k ðX i Þ represents the output value at layer k, G k ðX i Þ ¼ cosðl k a i Þ, and H k ðX i Þ ¼ sinðl k a i Þ represents the transferred nonlinear components, and a k , b k , c k , d k and l k are parameter values at iteration k. The objective is to minimise the total error, We optimise this equation with respect to parameters a k , b k , c k , d k and l k . The processes of layer addition and parameters optimisation alternate until a satisfying stopping condition evaluated by SSE is obtained.

Methodology
A novel architectural search algorithm is proposed where we inherit the gist of dynamical extension of networks of the DAN2 network. We introduce multi-task learning to this network and experiment with the results. The DAN2 network is adopted because the current neural network requires many parameters per layer, whereas DAN2 requires only 5 parameters per layer, which leads to a much lighter model. Our methodology of adopting the DAN2 network is proposed because the focus is to deal with sensor data which do not accumulate to large scale as other datasets such as images or sound data. Therefore, we propose this methodology to treat correlated multi-input multi-output engineering dataset with both effectiveness and simplicity. We adopt two multi-task learning network and the architectures are shown in Figs. 2 and 3. We term the first architecture a ''multi-head architecture'' and the second a ''serial architecture''. The Y i represents the ith dimension of the output.
In the multi-head architecture, the input undergoes transformations from a number of shared layers and then goes into different task-specific layers arranged in parallel. As the number of task-specific layers are different, the total number of layers each output undergoes is different. The structure is adopted to exploit the inherent strong correlations between the different values of the output.
In the serial architecture, the input undergoes transformations from shared layers and the common task-specific layers are arranged sequentially such that the number of layers an output undergoes differs for different outputs. This structure is adopted because we observe that the outputs are correlated and some outputs need to undergo a higher number of layers in order to generate accurate results. To best exploit this property and save the size of the network, the serial architecture is proposed. The sequence of the outputs in the serial structure is determined based on the correlation between the outputs where the strongest correlated outputs are placed earlier in the 'sequence. This allows the uncorrelated values to undergo more layers before arriving at the prediction results.
Similarly, the serial architecture is motivated by the correlated nature of the potential outputs to the system. This structure can also be adapted for time-series data where a number of outputs into the future can be predicted from each input.
The advantage of the DAN2 network to be applied to multi-task learning is that it freezes the parameters in each layer after training is performed and adds in new layers without affecting the parameters of the previous layers. This greatly simplifies the calculation of the shared layers. Our dataset is suitable for multi-task learning since the output values are generated based on the same operating conditions as inputs. Therefore, we expect strong correlation between different outputs and a shared representation is suitable for the data at hand.

Mathematical background
The multi-task learning problem can be formulated as a multi-objective optimisation problem [52]. Here, we summarise key concepts discussed in [52].
Suppose the input space is represented as X and the output space fY t g t2½T . The dataset is a set of i.i.d. data points represented as fx i ; y 1 i ; . . .; y T i g i2½N where x i is a particular input that is related to outputs y i of tasks 1 to T, and N is the total number of data points. The learning problem for a particular task is defined as where h sh are the shared parameters and h t are task-specific parameters. The task-specific loss function is defined as L t ðÁ; ÁÞ In most empirical cases, the overall loss function is defined as a weighted average of losses, where the weights c t can be static or dynamically computed.L t ðh sh ; h t Þ is the empirical loss of a particular task t, which is defined in Eq. (4). Fig. 2 The multi-head structure for multi-task learning To formulate the problem as a multi-objective optimisation problem, we define loss L: We seek the Pareto optimality [53] as the goal of the multiobjective optimisation.
Definition (Pareto optimality) In the context of DAN2, the loss function is defined as SSE of the testing dataset. The shared parameter, h sh , is the free hyperparameter of the total number of shared layers. The task-specific parameter h t is the free hyperparameter of the total number of task-specific layers. To find Pareto optimality, a heuristic search of the shared and task-specific parameter is performed where condition (a) holds. The best performing parameter derived from the heuristic search is assumed to be non-dominant by other parameter values, fulfilling condition (b). The multi-objective optimisation problem is thus simplified to optimising a set of hyperparameters regarding the architecture, while minimising the vector-formed loss function Lðh sh ; h 1 ; . . .; h T Þ.

Optimisation scheme
The optimisation scheme is heuristic in nature where a sequential method is adopted. In the original network where only one output is calculated, the total number of layers is the only free parameter to adjust and is settled when the percentage difference between two consecutive SSEs is smaller than a user-defined value H . In the case of multi-task learning, the total number of layers consists of number of shared layers n sh and number of task-specific layers n ts (see Fig. 2). The heuristics become more complicated when the number of tasks increases. For example, when there are three tasks to regress, the total number of parameters to optimise increases to 4, including n sh and n 1 , n 2 , n 3 for each task.
Traditional methods to optimise a high number of architectural parameters include random search, grid search, evolutionary algorithms, Bayesian optimisation or reinforcement learning. Random search is unsystematic and gird search covers a small space to optimise. The other methods are complicated in nature and require specific implementation of packages.
This work proposes a novel heuristic method to sequentially optimise the number of shared layers and the number of task-specific layers. The method provides a systematic framework in which the optimal multi-head and serial structure can be found. Although there is no proof of optimality in the structure generated, the dynamic nature of the network dictates that a relatively minimal structure extended from one basic initial layer is obtained.
The optimisation scheme consists of two steps: (a) increasing the number of shared layers by 1 and (b) increasing the number of task-specific layers by 1. The stopping criterion for Step (a) is defined as total ! H total where H total is a user-defined small value and total ¼ ðSSE total À SSE total;prev Þ=SSE total;prev . Since layers are shared between tasks, the total SSE for all tasks is used in the calculation of the stopping criterion. The stopping criterion for Step (b) is defined as task ! H task where H task is another user-defined small value and task ¼ ðSSE task À SSE task;prev Þ=SSE task;prev . The task-specific SSE is used because we would like to train the layers such that the number of layers differs for each task.
We employ an iterative procedure where Step (a) and Step (b) alternates until the overall SSE is smaller than a tolerance value. The tolerance value is arbitrarily small and determines the structure of the network. In this article, the tolerance is arbitrarily set at the scale of 10 À2 , which generates an acceptable range of optimisation results. When the tolerance is too low, the network cannot converge. When it is too high, the outputs are highly inaccurate. The pseudocode for the optimisation of the multi-task learning network is shown in Algorithm 1. It is noteworthy that the parameters a, b, c, d are the coefficients of the linear regression of F, G, and H.
This optimisation scheme works for both the multi-head architecture and the serial architecture. In either case, we optimise the shared layers and then the task-specific layers alternating dynamically until tolerance is reached. The difference is that we have a number of different taskspecific layers (or the ''heads'') to optimise in parallel under the multi-head architecture whereas for serial architecture, we have to sequentially optimise the taskspecific layers.
5 Case study 5.1 Case study I: pressure swing adsorption simulation

Datasets
There are two datasets to which we have applied our algorithm. Both datasets are used for regression tasks. Details of the dataset collection process can be found in Supplementary Material. We developed a first dataset to simulate a dynamic process of pressure swing adsorption (PSA). The PSA dataset contains 6 set of continuous input features and 3 continuous output values, i.e. the recovery rate, the purity and the energy consumption. Details of the data collection process and the data pre-processing can be found in Sect. 1. This dataset is used because (1) it contains highly correlated outputs obtained under the same operating conditions which are suitable for multi-task learning, and (2) the process is highly nonlinear in nature and thus requires a good nonlinear approximator such as DAN2 to model. To ensure that the dataset is applicable to multi-task learning, we calculate the correlation matrix of the output values consisting of recovery rate, purity and energy consumption. The correlation matrix is shown in Table 1. From the matrix, we see correlations between the three values, indicating the applicability of multi-task learning.

Application
Then, we applied the multi-head structure to the PSA dataset. In the dataset, the optimal architectural parameters combined with training and testing MSE are listed in Table 2. It is evident that the multi-head structure is capable of producing a network generating a small training and validation loss. The value for Energy Consumption diverges because of outlier values in the dataset. Since in each layer there are 5 network parameters and there is a total of 23 layers, the total number of network parameters is 115.
Similarly, we applied the serial structure to the PSA dataset. In the dataset, the optimal architectural parameters combined with training and testing MSE are listed in Table 3. It is also observable that the serial structure is capable of generating a small training and validation loss. There is no divergence in the output values. The optimised architecture has 11 layers, so the total number of network parameters is 55.
Comparing the two architectures, it is observed that the serial architecture has a better performance in terms of training and validation loss. Moreover, the number of parameters used is lower and the total number of layers is less. This means that the serial architecture is more efficient compared to the multi-head architecture for this dataset.

Dataset
The dataset used is the OILDROPLET dataset [54]. The dataset contains 24,422 experimental entries, and within each entry, there are 7 input variables and the 4 output variables (i.e. average movement speed, maximum speed of a single droplet, average number of droplets in the last second, average number of droplets throughout the experiment). Details of the dataset can be found in Sect. 1. We made used of this dataset because it also involves an The training loss is reported as SSE and the validation loss is reported as MSE. The average training loss is defined as the sum of individual SSEs for each task-specific layer. The average validation loss is defined as the average of MSEs for each task-specific layer The training loss is reported as SSE and the validation loss is reported as MSE. The average training loss is defined as the sum of individual SSEs for each task-specific layer. The average validation loss is defined as the average of MSEs for each task-specific layer implicitly nonlinear process under which DAN2 can be a suitable model. Similarly, we calculate the correlation matrix of the output values to ensure a multi-task learning architecture is suitable. Table 4 tabulates the correlation matrix, demonstrating a strong correlation between output values of droplet sizes and between outputs of droplet speeds.

Application
The application of the multi-head structure to OIL-DROPLET dataset is shown in Table 5. In this case, no divergence is observed and both the training and validation losses are small and the number of layers used is very limited. This demonstrates the effectiveness of the network acting as an approximator to nonlinear processes. Adapting to a multi-head structure does not limit the effectiveness of the network in generating small training and validation losses. Since there are 5 network parameters in each layer and there are 10 layers in total, the total number of parameters used is 50.
Similarly, we applied the serial architecture to the OILDROPLET dataset and the results are shown in Table 6.
From the results in Table 6, it is evident that serial structure is capable of reaching low values of training and validations losses, demonstrating the effectiveness of the network structure. It is also evident that the number of layers used are comparable, though slightly higher, compared to that generated from a multi-head structure. The training and validation losses are also comparable, and slightly higher, compared to the losses in the multi-head structure. In either cases, the number of layers used is minimal, requiring a small number of parameters. Thus, they serve as effective alternative to artificial neural networks.

Comparison with traditional ANNs
We construct artificial neural networks using the standard Python library, Keras 2.4.2, to compare the performance of the DAN2 network with traditional artificial neural network. The hyperparameters of the traditional deep neural network is optimised manually. We control the total number of network parameters in the ANN to be the same as that in DAN2 model. Performance data of the DAN2 model used in this work are repeated partly from those presented in the previous section, in order to compare against the results obtained with the standard ANN.

Multi-head architecture
The DAN2 network for PSA dataset has 115 parameters. A comparable ANN with one hidden layer of 11 neurons entails 113 network parameters, since the input dimension is 6 and the output dimension is 3. We train the network using Adam optimiser with MSE loss. Empirical experimentation has demonstrated that both networks generate comparable results as enlisted in Table 7, with the exception of divergence. From Table 7, it can be observed that  The training loss is reported as SSE and the validation loss is reported as MSE. The average training loss is defined as the sum of individual SSEs for each task-specific layer. The average validation loss is defined as the average of MSEs for each task-specific layer the DAN2 training MSE is of a similar order of scale to that from a traditional ANN. For validation MSE, values are of similar order of scale except the divergent value in DAN2 network. The divergence is possibly due to the initial shared structure in the multi-head structure which limits the performance of prediction in each head. Overall, traditional ANN performs slightly better but rather comparable to the multi-DAN2 network. The DAN2 network for OILDROPLET dataset has 50 parameters. A comparable ANN with a similar number of parameters will entail a network with 52 parameters. The structure consists of one hidden layer with 4 neurons. The comparison between the ANN and the DAN2 network is enlisted in Table 8. From Table 8, all training MSEs are of the same order of scale for the traditional ANN and multi-DAN2 network. This is similarly observed in validation MSE with some values of multi-DAN2 over-performing The training loss is reported as SSE and the validation loss is reported as MSE. The average training loss is defined as the sum of individual SSEs for each task-specific layer. The average validation loss is defined as the average of MSEs for each task-specific layer   and some values under-performing the traditional ANN network.

Serial architecture
We perform comparison between the serial architecture and the ANNs with a similar number of parameters. The results are shown in Tables 9 and 10.
In the PSA dataset, the serial structure has 11 layers with 55 parameters in total. A comparable ANN has one hidden layer of 5 neurons entailing 53 parameters. We train the network using Adam optimiser with regard to MSE loss. The results of running the ANN in comparison to the serial DAN2 network is shown in Table 9.
The second dataset of OILDROPLET makes use of 12 layers with 60 parameters. This is equivalent of an ANN with one hidden layer of 5 neurons with a total number of 59 parameters. We run the comparable ANN and report the results in Table 6.
In either table, the performance of DAN2 is comparable to that of ANNs with some entries having a better performance than ANNs. As the MSE values are close to zero, it follows that both networks are effective in generating accurate predictions based on the results. Therefore, DAN2 network with a serial architecture can act as a good alternative to ANNs.

Comparison with other machine learning models
We have also applied other regressors commonly used in machine learning literature and compared the performance with the DAN2 network. The regressors include Support Vector Regression (SVR), K-nearest neighbours regression (KNN), Ridge regression (RR) and Decision Tree regression (DT). The performance is tabulated in Tables 11 and  12. The MSEs in the tables refer to the validation MSEs. From the tables, we observe that both the multi-head DAN2 and the serial DAN2 obtain comparable performance to the current machine learning regression models in the literature.

Limitations of DAN2 model
A key limitation of the DAN2 model is that it is not trained through backpropagation and thus is less flexible with regard to structure. In ANNs, the model parameters are found through the process of backpropagation where the parameter updates are calculated in a backward manner through automatic differentiation. This gives much freedom in the definition of the model in terms of connections and structure. This gives much freedom in the definition of the model in terms of connections and structure, as the structure is fixed before parameter values are calculated.
On the other hand, the DAN2 network is more rigid. The output values are always close to the added layer. This can be a problem when dimension reduction is required. While it is easy in an ANN where the matrix pre-multiplying the neurons can shrink in size, it can be harder in DAN2 because the output dimension is fixed. To enable dimension reduction, one has to manually introduce some matrix multiplications, which is similar to introducing a layer of ANN into the model. This is a problem common to most dynamic architectures.
Another issue with the model is that is can act as a good regressor or a binary classifier. However, when it comes to output categorical values, it is more problematic because it is rigid in dimension and thus it becomes difficult to convert the regressed results to one-hot vectors.

CPU time analysis
A comparison between the CPU time required for the training of each network is performed. The results are tabulated in Table 13. The architecture of the networks is the same as the ones used in the analysis above. The results are generated using a 2.3 GHz Intel Core i5 processor with a memory of 8 GB 2133 MHz LPDDR3. From the results, it is evident that the networks of ANN and DAN2 spend similar amount of time on the tasks of regressing multidimensional output using multi-task learning architecture. Similarly, we perform a comparison between the CPU time spent on optimising serial DAN2 and ANN, and the results are tabulated in Table 14. The architectures of DAN2 and ANN used are reported in the previous section. From the results, it is evident that DAN2 network with a serial structure spends a lower amount of time on optimising the network to generate comparable MSE. This demonstrates the advantage of a DAN2 network which is the faster processing time.

Memory storage
The multi-head architecture does not require the storage of matrices. The coefficients in each layer are stored as separate vectors. There is no need of matrix multiplication and only manipulation of vectors is required. The total number of vectors to store depends on the number of layers used in the network hence this number varies. Overall, the memory storage is of O(n).
Similar to the multi-head structure, the serial architecture only requires the storage of vectors, and there is no storage of matrices. The number of vectors stored depends on the total number of layers. Overall, the memory storage is of O(n).

Comparison between multi-head and serial structure
We perform an empirical comparison of the two architectures adopting heuristic search-the multi-head and the serial architecture. The results are tabulated in Tables 15  and 16.
From the results, it is evident that in the PSA dataset, serial structure outperforms multi-head structure in terms of both training and validation MSE. Moreover, in the serial dataset, the divergence of the Energy Consumption data is not observed. However, in the OILDROPLET dataset, the serial structure underperforms multi-head   structure, generating a higher MSE in both the training and the validation dataset. We postulate that serial structure is more suitable to the PSA dataset because there is stronger inter-correlation between the outputs. The stronger correlation dictates that processing of one output may extract features related to the processing of another output. The serial structure reduces the total number of layers by processing simultaneously multiple outputs, thus achieving better performance in both time and accuracy.
The multi-head structure is more suitable to the OIL-DROPLET dataset because the outputs are more independent compared to the PSA dataset, and thus, the parallel structure outperforms a serial structure. The serial structure may confound the processing of different outputs. This may also lead to the increase in the number of layers required for processing. Therefore, multi-head structure allows better performance in the OILDROPLET dataset under comparable CPU time.

Conclusion
In this work, we propose a multi-task learning network adapted from the DAN2 network. A heuristic optimisation scheme is proposed to approximate Pareto optimality condition. It is proven that adoption of the optimisation scheme has led to the generation of a network with low training and validation losses on the Pressure-Swing Adsorption (PSA) and OILDROPLET datasets.
Comparing to ANNs with a similar number of parameters, comparable and slightly better performance is achieved. The CPU times used are similar, and in one instance significantly smaller with the DAN2 approach, and the memory requires only vector storage. Overall, the novel multi-task DAN2 structure serves as an effective alternative to traditional ANNs, simultaneously approximating different nonlinear processes to a high accuracy.
We also discussed one limitation of the model, which is the fixed output dimension. Another limitation of the model is the rigidity of the network parameters. In ANNs, the parameters are trained to be updated with sequential input of batches of data. The DAN2 network, on the other hand, does not calculate parameters through updates. Thus, only one run of data is required to fix the parameters of the network. This greatly limits the generalisation and the scalability of the network. This could be the reason that the network did not become a mainstream architecture as the ANNs.
Therefore, for future work, an update formula for the parameters of the network can be developed, such that the network can be trained in batches. This will improve generalisation, introduce flexibility and enable scalability of the network. As the network essentially consists of  multiple linear regressions, an algorithm developed to combine regression coefficients (for example, weighted average) should work in theory. However, more empirical evidence is required to properly formulate an update rule for the network. As the network has advantages in terms of speed and low number of parameters, an update rule will further diversify the possible applications of the network, enabling more functionalities and converting the DAN2 network to a more well-designed counterpart to traditional ANNs. The multi-task DAN2 also has potential applications in signal processing, when we combine the techniques of [55] into our optimisation framework.
8 Table of notations   The table of notations is presented in Table 17.

Supplementary material
This section describes the relevant details regarding the generation and collection of the data for the applications of the multi-task DAN2 algorithm.

The PSA dataset
Process description A pressure swing adsorption (PSA) process is an energy-efficient technology for gas separation. It achieves gas separation by operating a cyclic process where the gas species is absorbed at a higher pressure and released at a lower pressure. The data used in this research adopted a four-stage PSA process model for CO 2 capture as reported in Haghpanah's work [56]. Data generation method The PSA dataset is a dataset generated from the software Dymola to simulate the PSA process through a set of differential equations [56] programmed in Modelica [57]. The PSA cycle is simulated iteratively under a series of defined operating conditions. We evaluate the system parameters operating under these conditions until cyclic steady state (CSS) is reached. At cyclic steady state, we evaluate the recovery, purity and energy consumption of the system.
Data description The PSA dataset contains 6 sets of continuous input features and 3 continuous output values. The inputs are operating conditions (e.g. set points for pressure, duration for adsorption or desorption stages, and inlet flow rate). The outputs are the recovery rate, purity and energy consumption of the system.
Data preprocessing We apply the normalisation of the data before inputting to the multi-task DAN2 and the ANN. Normalisation is required because the ranges of the inputs are of different scales. Normalisation converts the numeric values in columns to their own column scale, thus equalising the ranges of different inputs without affecting their relative value.
Data accuracy The collected data come from a simulation process which means that the data points collected are highly accurate and to-the-point. Therefore, there is no concern that there might be anomalies in the dataset.

The OILDROPLET dataset
Date collection method The dataset is generated by a highthroughput droplet-generating robot which can execute and record a 90s droplet experiment every 111 seconds, including mixing, syringe-driven droplet placement, recording, cleaning and drying. Further details of the experimental setup can be found in [54].
Data description The dataset includes 24,422 experimental entries, and within each entry, there are 7 input variables which are the ratios of four oils (diethyl phthalate, 1-octanol, octanoic acid and 1-pentanol) in the droplet, the viscosity, the surface tension and the density of the mixture. The observation space is generated by observing the movement and merging of the oil droplet on the water surface, including average movement speed, maximum speed of a single droplet, average number of droplets in the last second, average number of droplets throughout the experiment.
Data pre-processing Similar to the preprocessing step in the PSA dataset, we have applied normalisation to the data before inputting into the neural networks. Data accuracy The dataset comes from experiments which means that there might be collection errors in the process. However, we have not identified anomalies in the dataset that might be troublesome to deal with. The data points are accurate to the point that the collection process is carefully conducted and the data are carefully preprocessed.

Declarations
Conflict of interest The authors declare that they have no conflict of interest.
Consent to participate The authors consent to participate in the publishing activities of this journal.

Consent for publication
The authors consent to publicise this article by this journal.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.