MS-NET: modular selective network

We propose a modular architecture of Deep Neural Network (DNN) for multi-class classification task. The architecture consists of two parts, a router network and a set of expert networks. In this architecture, for a C-class classification problem, we have exactly C experts. The backbone network for these experts and the router are built with simple and identical DNN architecture. For each class, the modular network has a certain number ρ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho$$\end{document} of expert networks specializing in that particular class, where ρ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho$$\end{document} is called the redundancy rate in this study. We demonstrate that ρ\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\rho$$\end{document} plays a vital role in the performance of the network. Although these experts are light weight and weak learners alone, together they match the performance of more complex DNNs. We train the network in two phase wherein, first the router is trained on the whole set of training data followed by training each expert network enforced by a new stochastic objective function that facilitates alternative training on a small subset of expert data and the whole set of data. This alternative training provides an additional form of regularization and avoids over-fitting the expert network on subset data. During the testing phase, the router dynamically selects a fixed number of experts for further evaluation of the input datum. The modular nature and low parameter requirement of the network makes it very suitable in distributed and low computational environments. Extensive empirical study and theoretical analysis on CIFAR-10, CIFAR-100 and F-MNIST substantiate the effectiveness and efficiency of our proposed modular network.


Introduction
Deep Neural Networks (DNNs) in the last two decades have shown it's superiority in the field of visual object recognition [26,62,64]; image segmentation [5,9,63,76]; speech recognition and translation [3,29]; natural language processing [11,68]; reinforcement learning [43,56,57]; bio informatics [63]; educations [38,73]; and so on. Despite their simple layered structures of neurons and connections, they have outperformed other machine learning models [74]. This superiority has been achieved due to its ability of complex non-linear mapping from input to output, automated rich and discriminate features learning as opposed to hand-engineered low-level features such as GABOR features [41], local binary patterns [2], SIFT [54] and so on. With the passage of time, we can notice that not only the performance is levitating dramatically, also networks are getting deeper [14,31,69] and wider [77]. As a result, these finer networks are lacking few important and desirable properties such as interpretability or comprehensibility, practical applicability in low computational devices and so on. In addition, problems such as catastrophic forgetting with the arrival of new data [28], lack of memory efficiency, have also started to arise. Fortunately, various novel approaches have been proposed to mitigate a few of these shortcomings. Recent notable approaches include knowledge distillation from the cumbersome models to smaller models [33]; compression of knowledge from ensemble to a single model [8]; pruning of neural 1 3 networks [48,51,58,81,82]; efficient Neural Architecture Search (NAS) [61,82,83]; modular neural network design [4,30,40,42,74]; and so on. There have been also significant advances in efficient hardware design architectures for DNNs. Intel Corporation has developed a neural computation stick powered by the Vision Processing Unit (VPU) which can accelerate the inference phase of complex DNN on a low computational device. Google has also recently developed small edge Tensor Processing Unit (TPU) for high-performance machine learning inference. These small ASIC devices for DNN can easily execute deep Convolutional Neural Networks (CNN), which make it one of the best alternatives for cloud-based service. Unfortunately, when it comes to state-of-the-art networks, these ASIC devices still face performance bottle-neck when executed in real-time scenarios. Thus, it is necessary to devote time and research on mitigating the above shortcomings of DNN.
In this paper, we propose a novel modular neural network framework for multi-class classification, which is inherently simple and easy to implement. The key idea is to leverage a fixed number of experts, each with parameters as few as possible during the inference phase, while maintaining accuracy comparable to relatively complex and monolithic state-ofthe-art DNNs. The proposed framework has a close resemblance to the model of the human brain depicted by Minsky in [55], where he described the human brain as a collection of specialist agents interconnected by nerve-bundles. Quoting from [55] We're born with proto-specialists involved with hunger, laughter, fear and anger, sleep and sexual activity-and surely many other functions no one has discovered yet-each based upon a somewhat different architecture and mode of operation. Analogous to this brain model, our framework consists of a countable set of expert agents and a router agent. In this literature, we term the expert agents as the expert networks and router agent as the router network. Each of these expert networks is expert on a specific subtask and their computation take place independently. Although they are not superior individually for a whole set of tasks, they outperform each individual network when they execute collectively. The router network moderates the execution of these expert networks. The concept of the modular neural network itself is not new. The key concept of modular connectionist goes back to the mid-1980s in [40]. A number of contributions such as [30,40,50,72] have approached the task of speech recognition using the modular connectionist theory. A majority of the proposed modular architectures are equipped with a gating network (analogous to our router network) and a set of expert networks. Despite the popularity of modular connectionist models during the 80s, the modular approach in recent DNNs (such as CNN, Recurrent Neural Network(RNN) and so on) era is relatively sparse, until recently Hinton and Vinyals have introduced the novel concept of knowledge distillation in neural network [33].
Our proposed modular neural network framework which we termed as the MS-Net has a close resemblance to [4,24,33,60] literature in the following key points: i) We divide the dataset into a number of subsets/subtasks/concepts. Afterward, we train a fixed number of expert neural networks on each of these subsets ii) The router module navigates us to those expert networks for further re-evaluation.
However, in addition to the above points and our previous work [39] 1 , the novelty of our contributions to this research can be summarized as follows: 1 3 8. Sect. 10: Guidance for optimal hyper-parameters selection for the network. 9. Sect.11 Effects of knowledge-distillation on MS-Net. 10. Sect.12 Discussion on results and comparison to stateof-the-art DNNs. 11. Sect. 13: Conclusion and possible future works.

Related work
Modular architectures have been famous in neural networks or connectionist models for a long time. In addition to that, modularity has also been widely implemented in other traditional machine learning models. This approach has not only boosted the performance of these learning models, but also introduced virtues such as interpret-ability, training efficiency, distributed computation, reduction of parameters and so on [4]. In this section, we provide an overview on the neural networks and other machine learning models which exhibit modular behaviour.
Class binarization (CB) is one of the most well-known method in the modular framework. It can be considered as a special case of ensemble learning, where each binary module is assigned to learn or distinguish a single concept or class from the rest. Among different CB techniques, ONE VS ALL (un-ordered binarization) is the most commonly practiced technique in neural network [4], support vector machine [12], due to its computational efficiency and performance boost. The technique first appeared in the literature [10]. The method constructs C binary classifiers in total, where C is the total number of classes. Despite its simplicity, the method suffers from class imbalance, since the number of positive instances is smaller compared to the negative instances for each binary classifier. In addition, an ordered variant of the mentioned CB technique requires only C − 1 classifiers. However, the class imbalance short-coming was later resolved by the method ONE VS ONE which appeared in the literature Separate-and-Conquer Rule Learning [21]. A more systematic method for generating binary classifiers which is known as the Round Robin learning was introduced by the same author in the literature [22][23][24]. Due to its systematic method of creating binary classifier, it carries more interpret-ability. The method has demonstrated that a total of C (C − 1)∕2 classifiers are constructed using the Round-Robin method. Each of these classifiers is a pair-wise-classifier, expert on two specific classes or concepts. Thus, the issue of class imbalance no longer prevails. In addition to that, authors have shown that this approach requires relatively fewer amount of data during training as oppose to ONE VS ALL method. However, during the inference phase all C (C − 1)∕2 classifiers require evaluation. With a view to resolving this computational issue, a relatively recent literature [60] proposed an efficient prediction algorithm for these ensembles, where pair-wise classifiers can be dynamically chosen without any drop in accuracy.
Knowledge distillation (KD) is a recent and very popular method for compression of complex and cumbersome DNNs. The method was first proposed by Hinton et al. [33]. This method is now widely implemented in deep learning research and industrial applications. Studies such as, [20,78] have shown that, KD not only allows compression but also enables a relatively smaller student model to outperform its teacher model. The key idea is to train a student network to mimic the output features or the class probability distributions of the teacher network. The literature's [33] contribution was not only limited to KD, the authors have also proposed a modular network framework that has a very close resemblance to our proposed framework. The model consists of two main parts, a generalist network and a set of independent expert networks. Each of these expert networks is a simple CNN, which is trained on data that are often confused and misclassified by the generalist network. Thus, each individual expert is classifier of type CONFUSABLE SUBSET VS ALL, where one part is the CONFUSABLE set of task and the rest ends up with single DUSTBIN class. This notion implies that the generalist model needs to be evaluated first to obtain those CONFUSABLE set of classes. In-order to (i) retain knowledge about the non-expert classes (ii) avoid over-fitting and (iii) solve the class imbalance problem the author initialized the expert networks with the weights of generalist network. The literature has shown that, as the number of expert networks increases, the accuracy increases proportionally. However, there have been no concrete indication and estimation on the number of experts covering those CONFUSABLE set of classes. In addition, the literature states that there can be situation where there are no expert networks covering a certain set of classes (since the generalist network is already confident on its prediction for those certain set of classes).
Recent research [80] titled Deep Mutual Learning (DML) which consists of cohort of student models resembles modular behavior. The DML enables a number of student models to mutually learn from one another by minimizing the Kullback Leibler (KL) Divergence between their predictions, which is a special case of KD [33]. The experiments have shown that the number of student networks in cohort during training can be extended to more than two. Moreover, empirical results show that, multiple student neural networks trained by the mutual learning out-perform single model network trained independently. This learning process has also shown to outperform the KD method.
Other notable recent research contribution on modular neural network includes the famous Generative Adversarial Network (GAN) [27], where two networks, discriminator and the generator network co-operate and compete against each other. There are also different variants of GAN which comprise of more than two networks [45]. Research [16] proposed modular like architecture that is build upon the existing state-of-the-art neural networks. In the literature, they re-configure the model parameters into several parallel branches where each branch is a stand-alone neural network. They have demonstrated that, the average of the log-probabilities of multiple parallel branches give better representation as opposed to the single independent branch.
In this paper, our modular neural network framework has a very close similarity to the literature [4,24,33], such as presence of gating network and expert networks. But in contrast to ONE VS ONE and ONE VS ALL, our expert networks are not limited to binary classifiers. We introduce a simple Round-Robin based systematic data partition technique which enables us to train each expert on subset of multiple classes. A contrast to note that, unlike ensemble learning method such as well known AdaBoost [18],Bagging [6], Random Forest [7], Gradient boosting [19] and so on which requires the collective wisdom of all available classifiers, our network does not require to run all the neural network models during inference. The novelty in our proposed framework is that, the router of the MS-Net extensively reduces the number of expert network evaluation during the inference phase. Since the partition of dataset is systematic, i) it gives us prior knowledge on which experts are specialist on which subsets, which also facilitates us to dynamically chose specific number of expert networks during inference. ii) it guarantees presence of multiple expert networks for a single concept or class, thus we have a certain degree of fault tolerance in case other experts or the router network fail to correctly classify the data.

Proposed network architecture
The network has two main modules, a router network, and a pool of expert networks. In the expert network pool there are C expert networks, where C is the total number of classes available in a given dataset. A simplified image of our modular framework is shown in Fig. 1. The expert networks and router network have the same network architecture. A very important issue is the size of the network. In our experiment, a cumbersome and computationally expensive network is not desirable. On the contrary, we also do not want the network to face performance bottle-neck due to simple architecture. There are many remarkable literature relating to the compact, efficient and accurate Neural Architecture Search (NAS) [17,83] in recent times, but this topic is out of scope for this paper. However, the choice of architectures of any network are dependent on the complexity of the dataset. Considering the computational issue and memory efficiency, we chose ResNet-20 [31] as the initial backbone network, which is of one the most minimalist and light weight network to our knowledge. We leverage the Resnet-20 as the backbone of MS-Net to find the optimal hyper-parameters such as, the value for , top-n and so on. After we obtain the optimal hyper-parameters through extensive empirical study with ResNet-20 we train other complex DNNs which are, GoogleNet [69] and MobileNet [35] as the backbone network of MS-Net.

Round robin based data-set partition with sliding window
In this research, the redundancy rate plays a vital role in the performance of the framework. We denote the redundancy rate as . The variable has two main interpretations. First, is the size of each subset of class indices. Second, each class index appears exactly in subsets of class indices. In any sense, when is larger more expert networks will get the chance to see the training data from any particular class. This is the reason why we called redundancy rate.
In order to prove the above two points let us introduce several notations. First, we use D = {d i |i = 1, .., N} to denote the whole training data set, where N is the total number of training data; and T = {t i |i = 1, .., N} to denote the set of teacher signals, where t i is associated with d i for i = 1, 2, ..., N . To partition the subsets for training the expert networks, we leverage a sliding window of length k and stride s. Refer to Fig. 2 for a graphical overview of dataset partition. In this figure, we arrange the indices of all classes in a ring-shaped manner. The sliding window length k is a positive integer less than C, which is the total number of classes Table 1. The redundancy rate depends directly on k. Each time when we shift the sliding window with a stride s over the ring in a Round-Robin fashion (clockwise), we obtain a subset sub i which contains k class indices. We use S = {sub 1 , sub 2 , ...} to denote the set of all class index sets so far obtained. We can prove that, for any value of k and with stride s = 1 , the cardinality of S is always equal to the total number of classes C. Since E is the set of all experts dynamically selected by the router network R. The first block represents the router network which dynamically selects the expert networks based on its softmax (SM) confidence. The second part is the pool of expert networks further re-evaluating the router's top-n most likely predictions. Finally, the network aggregates the soft-max scores of router and selected experts we have C target classes, and if we can prove |S| = C , we can conclude that our framework requires no more than C experts.

Lemma 1 With stride s = 1 and for any value of k, the cardinality of S is always equivalent to the total number of class C available in the data-set.
Proof If k is the length of sliding window of any length, by the convolution arithmetic [15] we can state that the number of class index sets in S as: Since we are using the Round Robin rotation, the later term k is added instead of 1. As, s = 1 , Eq. (1) can be re-written as: Thus, with stride 1, the total number of class index sets or the number of expert networks is always equal to the number of classes. ◻

Lemma 2
If the length of sliding window is k and stride s = 1 , the index for each class occurs exactly in k class index sets or in other words, we have exactly k experts related to each class.
This implies that the redundancy rate is determined by the sliding window size k. This phenomenon also suggests that, k determines the fault tolerance of the proposed MS-Net. As the value of k increases, we have more experts for each particular class (Note that, the total number of experts remains constant i.e. C). On the contrary, as we decrease k, the redundancy rate or the number of experts specializing on that particular class decreases.
Proof Let us assume that the sliding window length is k, where k < C . After the n − th ( n = 0, 1, ..., C − 1 ) sliding operation, we obtain the following class index sets.  6.20 16.04 According to the definitions of the sliding window and the class index sets, |sub n+1 | = k . Since we are performing Round Robin rotation, we use the modulus operator for indices of each class. ◻ Without loss of generality, we show that the index (n + k − 1) mod C + 1 exists in exactly k class index sets. During the Round-Robin partition, we shift each element of sub n+1 to the left of the sliding window with stride s = 1 as depicted in Figs. 3 and 4. Thus, in each sliding operation we introduce a new class index to the right of the sliding window, which in the case of sub n+1 is (n + k − 1) mod C + 1 . In the same way, for the next sliding operation, we have, As we can observe in sub n+2 , the class index n mod C + 1 ceases to exist and a new index (n + k) mod C + 1 arrives in the right most position. In addition, the index (n + k − 1) mod C + 1 shifts one position to the left. After the n + k − 1-th sliding operation, we have, The index (n + k − 1) mod C + 1 is now at the left most position. After the n + k − th sliding operation, (n + k − 1) mod C + 1 will no longer exist in the subset sub n+k+1 because I t i s c l e a r f r o m a b o v e e q u a t i o n , (n + k − 1) mod C + 1 ∉ sub n+k+1 since after the n + k − th sub n+k+1 = {(n + k) mod C + 1, ...,  sliding operation, the class index (n + k − 1) mod C + 1 slides out of the window. Thus, (n + k − 1) mod C + 1 occurs in {sub n+1 , sub n+2 , ..., sub n+k } or exactly in (n + k) − (n + 1) + 1 = k − 1 + 1 = k class index sets, which also concludes we have k experts for the class (n + k − 1) mod C + 1.

Training phase
We perform the training procedure in two steps. First, we train the router network on whole dataset. Second, we train C experts on the subsets which can be constructed based on the class index sets obtained in Round-Robin fashion depicted in Sect. 5. We denote the router network as y = R(.) ∶ D � → T , where D and T are the dataset and the corresponding label set, respectively. The output of the router network is the softmax defined in Eq. (3), where we obtain the probabilities q 1 , ..., q C for all C classes. Here, z 1 , ..., z C are the logit scores for the corresponding classes.
For our modular network framework, the top-1 score does not require to be strictly accurate. Since it is obvious that, the likeliness of the correct answer to be in top-n (as n increases) is higher than top-1, we take into consideration the top-n most probable answers. The role of the expert networks comes into play in this situation, where a set of experts further re-evaluate the router's top-n predictions. Thus, the accuracy of the experts have a significant effect on the MS-Net performance. Let us assume, we have a set of expert neural networks E = {e(.) 1 , ..., e(.) C } . In order to ensure these experts effectively specialize on the subsets, we formulate a stochastic objective function which we depict in the Eq. (4). The objective function optimizes each of the expert network on its corresponding subset data {D sub i , T sub i } using cross entropy loss func- The knowledge is distilled from the router network. Thus the router is the teacher model. The alteration between two the loss terms in Eq. 4 is controlled by the Bernoulli random variable X with the probability The stochastic nature of the objective function for a certain range of provides (i) balanced training of networks and (ii) better regularization. Again, the cardinality of each class index set sub i is determined by the redundancy variable . In our experiment we demonstrate the effectiveness and .
performance of the framework for = 2, 3 and 4. We stress that, during the inference phase, as we increase the number of expert network evaluation increases linearly. Due to the stochastic training of expert networks on whole dataset using KD (the second part of Eq. (4)), these networks are no longer limited to its corresponding subset data. Rather, each of the network is an expert on their own subset classes, in the meantime has certain generalization ability on the data of other classes. In Eq. (4), the first term optimizes the expert network e i () on the classes defined by sub i , weighted by Bernoulli random variable X which takes a value of 1 based on the probability . The later term of Eq. (4) optimizes the network on the whole dataset weighted by 1 − X based on probability 1 − . Thus, controls the trade-off between two loss terms in Eq. (4).
where, and In the above equations, (t, k) is the Kronecker delta function defined by The hyper-parameter controls the trade-off between the KD and cross-entropy loss, where 0 < < 1 . The value of during training depends on the performance of the teacher network. A high value puts more weight on the distilled knowledge of teacher network and vice-versa. In our experiment, we aim to retain as much knowledge as possible from the router network (here the router network is the teacher network for experts) to the expert networks. In this way, we ensure that, the expert networks are at-least as good as the router network and if not, better. Thus in this literature, we fix the value to 0.8. However, to learn more about the fine tuning of KD parameters we suggest to refer to the literature [33]. The purpose of leveraging KD in the loss function Loss kd e i is to simply retain all the knowledge of the router network in the experts. To illustrate the contrast, we construct another objective function depicted in Eq. (7) which is a variant of objective function in Eq. (4), but without knowledge distillation (wokd) term. We retrain all the experts using the loss function Loss wokd e i and illustrate performance gain by KD in the result discussion section.
where, Algorithm 1 illustrates the step by step training procedure of the MS-Net. In the Algorithm 1, line 1 through 4 performs the initialization of variable containers. In line 4 we obtain the subset class indices using the method discussed in Sect. 4. Line 6 and 7 load the subset of training data corresponding to the class index sets. In the Line 9 we randomly sample training data which consist of all classes. Thus we have two set of training data available, one with classes exclusively from the class index sets and the other with all available classes. Line 10 and 11 perform the forward pass of the expert network e i (.) for the data from all classes and class index set respectively. However, the objective function defined in line 12 optimizes either of the term based on the state of the random variable X. Finally we perform the backpropagation of the loss term followed by parameter update for expert network. We carry out this procedure for rest of class index sets and expert networks.

Inference phase
During the inference phase of MS-Net, the cost or the model complexity is dependent on two key parameters, namely, n for top-n evaluation; and the redundancy rate . In the testing phase, the input is first fed to the router. From the router, we obtain the probability scores for each class. Since the router is relatively small it is less likely that most of time the top-1 will be correct. But needless to say, the probability of obtaining a correct answer increases as the value of n increases. Thus, we select the top-n most likely classes or predictions P = {p 1 , .., p n } from the sorted softmax scores q 1 , .., q n of the router. Next, for each predicted class p i the router chooses experts from the expert pool, where i = {1, .., n} . Thus, as increases the number of expert evaluation for a particular class increases proportionally. For each element or prediction in P , we select a set of experts using the following equation: where, E is the set of all experts whose cardinality |E| = C (refer to Lemma 1), and Ē is a subset of experts available for a certain set of predictions P for a single input datum. In the proposed MS-NET we will always have C expert neural networks. This is shown by Lemma 1 and Lemma 2. However, during inference we do not leverage all C expert neural networks. Rather, the expert neural networks are selected based on and n. For each input datum, the router selects n most likely classes for re-checking. For each class, we use expert neural networks to provide information for making the final decision. Thus, MS-Net leverages at-most ( * n) and at-least ( + (n − 1)) expert networks during the inference phase. The value of ( * n) and ( + (n − 1)) are always smaller than C. In this paper the maximum value for and n are only 4 and 3 respectively. The prediction we obtain from the aggregated softmax of the set of selected experts Ē for input x is presented in Eq. (10). For single input x, the softmax returns {q 1 , .., q C } , where each of the element q i is the probability of x belonging to the class i.
where, sm r and sm e are the softmax scores by the router and experts, respectively. Finally, we take the most likely output label or the predicted class using Eq. (11) Algorithm 2 represents the testing phase of the MS-Net. Line 1 through 6 initialize the variables and all the networks (router and expert networks). Initially, we pass the input to the router network in line 8. We select the top-n most probable predictions from the router whose further re-evaluation start from line 9. Based on the prediction of router we select a fixed number of expert networks. As discussed in the earlier section, the number of expert networks for inference is governed by the variable and top-n. In the worst case scenario we will have to evaluate at-most ( * n ) expert networks and in best case ( + (n − 1 ) expert networks. We aggregate the softmax of all the expert networks in line 13 and increment the count (so far evaluated expert networks). After all the expert networks are evaluated we take the corrected or re-evaluated output based on the highest softmax value in line 20. The final output is the accuracy of MS-Net. In Line 4 and 15 of the Algorithm 2 the Boolean dictionary list visited[sub 1 ∶ sub C ] ensures that we are not evaluating an expert for particular subset more than once. This optimization comes into play during situation where the index of two or more predictions of router are consecutive numbers.

Datasets
To evaluate and validate the effectiveness of the network we leverage three public datasets, which are CIFAR-10 or C-10 (Canadian Institute For Advanced Research)[1], CIFAR-100 or C-100[1], and F-MNIST (Fashion-Modified National Institute of Standards and Technology database). The CIFAR-10 dataset consists of 60,000 32X32 color images with 10 classes. Each class has 6,000 images. The dataset is divided into two parts with 50,000 images for training purposes and 10,000 images for testing[1]. The CIFAR-100 is just like CIFAR-10 but with 100 classes containing 600 images for each class. Among these 600 images for each class, 500 are for training and the rest 100 for testing. Moreover, the 100 classes are grouped into 20 super-classes. The F-MNIST database is a large database of fashion accessories. The database contains 60,000 training images and 10,000 testing images with 10 classes, where each image is 28X28 gray-scale image.

Experiment settings
We implement MS-Net in the PyTorch framework [44], and perform all the experiments on single NVIDIA RTX 2080 GPU. The setting of hyper-parameters during training slightly vary across different datasets. However, for all datasets, we use Stochastic Gradient Descent(SGD) with momentum. We set the initial learning rate for all routers and experts to lr = 0.1 and momentum to 0.9. Hyper-parameters such as batch size, iterations and learning rate decay scheduler ( ) differ across routers, experts and datasets which are shown in the Table 2.

Result discussion
For the CIFAR-10 and CIFAR-100 dataset, we perform a detailed empirical study on the effect of variable (of objective function Eq. (4)) on expert networks during the training phase. We also perform analysis on effect of and n during the test phase. In addition, beside ResNet-20 we also provide performance of MS-Net with two well-known DNNs as backbone. However, in this paper we perform all the empirical analysis and hyper-parameters search with the backbone ResNet-20. Tables 3 and 4 represent the performance of MS-Net (with ResNet-20 backbone) for CIFAR-10 and CIFAR-100, respectively Table 5.
In Table 6 we demonstrate the performance of individual expert network on subset class indices for dataset CIFAR-10 and F-MNIST. CIFAR-100 has 100 classes which make it difficult to interpret the performance of all 100 expert networks in a table. The table illustrates several key points about the MS-Net. Firstly, we observe that each of the expert network performs with remarkable score on its corresponding subset. That is, the performance of e i on its corresponding subset sub i , where i = {1, ..., C} , is very good (highlighted on Table 6). The performance of any expert networks on the whole set or on subsets assigned to other expert networks is relatively lower. Secondly, the performance of the Router R on each individual subset is significantly lower than that of the expert networks. However, when we execute the router and the expert networks together, they perform very well.
The empirical results for CIFAR-10 and CIFAR-100 suggest that, during training phase, fixing to value 0.9 in the objective function tends to give relatively higher scores. To avoid redundant experiments, we perform rest of the training with fixed to 0.9. Table 5 presents the performance of MS-Net for F-MNIST.
It is clear that with = 1 in Eq. (4) we simply optimize the expert networks on training data sampled from subset class indices. On the contrary, with = 0 we optimize the expert networks on the dataset comprising of all the available classes, which is analogous to the naive Ensemble Learning (EL) of DNNs. The optimal value for has no theoretical bindings , rather it is dependent on the dataset. Expert networks trained with in the range 0.3 ∼ 0.9 give near optimal classification scores. However, fixing to either 0 or 1 during training degrades the performance scores, which implies that we should maintain a certain range for while optimizing the proposed loss function. The variable n tells the experts up to how many top-n most probable prediction of router to further re-evaluate. For all the experiments, we re-evaluate up-to top-3 of router's prediction. The depicts the total number of samples correctly re-classified by the experts. A positive value depicts the number of samples expert networks have correctly re-classified and a negative value for indicates the number of mis-classifications by the experts, or in other words, is the measurement of improvement in accuracy by our framework relative to the router network. All the scores that we report in this paper (figures and tables) are relative to the backbone network, which in this case is the router network. It is worth noting that, we use the online inference method during the testing. Thus for MS-Net, we make the prediction for a single observation at each iteration as oppose to batch processing. Due to modular nature of the framework, the online inference is the simplest implementation.   Table 3 represents the performance of MS-Net for CIFAR-10. From our experimental results, we can deduce the following key observations.

Performance on CIFAR-10
1. As we increase the value of the accuracy increases. Refer to Fig. 8 (d, e) for graphical illustration of this phenomenon. However, for the case of CIFAR-10 increasing top-n beyond the value 2 does not improve the performance further ( Fig. 8 (a, b and c)). 2. We observe gradual improvement in performance for the expert networks trained with increasing which can be confirmed by Figs. 5 (a) and 6. The score gets lowest when we train the expert networks with = 1 . This phenomenon suggests that training the expert networks solely on its subset classes ( = 1 i.e. clamping X = 1 in the objective function during the whole training process) does not improve performance, rather degrades. This degradation of result occurs due to imbalanced logit value in the last layer since the expert networks do not encounter any training data from rest of the classes (classes apart from the subset classes) during the training phase. Training these experts on the whole set of data alternatively within the optimal range of probability distribution substantially improve the performance. This method acts as a very effective regularization, as it prevents the experts from over-fitting on the dataset from subset classes. A graphical overview of the effect of the probability distribution is presented in the bar chart Fig. 6. 3. In our experiment, we obtain the best score (with ResNet-20 backbone) for CIFAR-10 ( 95.38% ) with = 4, n = 2 and = 0.9 . The score with the mentioned parameters is +270 , which means, integration of the expert networks with router further improves the performance by +2.70% . In other words, the router with a backbone network ResNet-20 has a top-1 accuracy of 92.68% and by integrating the experts for further re-evaluation, we levitate the top-1 score by +2.70 i.e. 95.38%. Table 4 represents the result for CIFAR-100. For CIFAR-100, the same hyper-parameters = 4, n = 2 and = 0.9 give relatively high score of 71.68% . We can observe from the Table 7 that router's top-1 performance (ResNet-20) for CIFAR-100 is only 69.58% , and with the integration of the experts the performance increases by 2.48% . This phenomenon suggests that as we increase and n we are more likely to get higher accuracy. The scores in Table 7 Table 5 Performance on F-MNIST with = 0.9

Performance on CIFAR-100
The backbone (ResNet-20) score for F-MNIST is 95.22% , and the score depicts the number of samples correctly re-classified by MS-Net's expert networks (relative to the backbone) depict that MS-Net has relatively lower score on CIFAR-100 compared to CIFAR-10 and F-MNIST. This phenomenon is also observable for other state-of-the-art DNN (refer to Table 8). The probable reason for such low performance is mostly due to fewer amount of data per class in CIFAR-100. While CIFAR-10 has 6000 samples per class, CIFAR-100 has only 600 samples. This problem has been mitigated to a certain extent recently by leveraging large scale Transfer Learning (ImageNet pre-trained) [46], learning data augmentation policy or Auto-Augment (AA) [13], task-specific NAS with Transfer Learning (TL) [59,71], Neural Architecture through hybrid online TL with multi-objective evolutionary search procedure [75] and so on. The MS-Net proposed in this study also has a significant improvement in performance compared to the backbone networks. We may expect further improvement if we introduce TL and other techniques described above. Table 5 represents the result of MS-Net on F-MNIST. In order to avoid redundant experiment the same hyper-parameters that give optimal score for CIFAR-10 and CIFAR-100 are set during the training. The network achieves score of 96.77% on F-MNIST test set. The delta score is +156, which is relatively higher. To our knowledge, the best score for F-MNIST was 96.30% , reported by Wide-ResNet-28-10 with Random Erasing data-augmentation[79]. Thus, MS-Net achieves a state-of-the-art score for this particular dataset.

Optimal value for , top-n evaluation and Ǎ
very common intuition is that as we increase the n of the router we can score (at best) as good as the router's topn prediction score. In practice, increasing n beyond the value 2 does not substantially improve the performance, however, increasing the value of gradually improves the accuracy of the network. For numerical comparison please refer to the Tables 3 and 4. In addition, Figs. 6 and 7 represent the effect of and n for CIFAR-10 and CIFAR-100 respectively. An interesting observation from Table 6 Performance of individual expert network on all the subsets. The highlighted parts depict the score of each expert on its corresponding subset class index obtained through the Round Robin partition The last column S depicts the performance of experts on whole set of data. The row with R represents the performance of router network on individual subset. In this table, all the experts and the router network have ResNet-20 backbone. The is 4 which also depicts the cardinality of each subset. We train all experts with = 0.9 the Fig. 8 is that, as we increase the performance levitates dramatically, on the contrary, increasing n does not increase accuracy with a big margin. This is also the case for the CIFAR-100 dataset. The graphs in Fig. 9 indicate that, for a fixed value of , increasing n further increases the accuracy, but not with a substantial margin. Although in this literature, our experiment is limited to ≤ 4 , we can anticipate that for CIFAR-100 further increasing will increase the accuracy. The reason is that CIFAR-100 is relatively difficult dataset with a large number of classes. Thus, from the above observations we can conclude with following guidelines for optimal parameters selection.
1. Evaluating till top-2 probable predictions of the router will suffice. This statement is true at-least for all the dataset we have explored so far. 2. Setting the redundancy rate variable to 3 provides with a comparable classification score for all cases. We know that increasing implies that we have more expert networks for each class. Thus, in situation where we have enough resource budgets, we can increase the variable beyond 3 for more redundant expert networks and accuracy. 3. During the training phase the variable of objective function (Eq. (4)) plays a crucial role in performance.
Although there are no fixed value or theoretical bindings for , we recommend to avoid fixing to two extreme values i.e. 0 and 1. Optimizing the expert networks keeping in range of 0.3-0.9 tend to give optimal score.
Thus, to keep the experiments simple, the training of MS-Net (implementation with different backbone networks) in the rest of the paper will confine to the above mentioned hyper-parameters.   4)) where we replace the KD term with simple cross entropy loss term (Eq. (7)). We train MS-Net on CIFAR-10, CIFAR-100 and F-MNIST leveraging the loss Loss wokd depicted in Eq. (7). The hyper-parameters are exactly identical to the experiments done with KD loss. The results show that, expert networks of MS-Net optimized without KD loss drops in accuracy with a substantial margin. In Fig. 10 we show the contrast for the classification accuracy. Figure 11 depict the contrast between the of MS-Net trained with and without the KD loss for CIFAR-10, CIFAR-100 and F-MNIST respectively. Using distillation in the loss term assists the expert networks in retaining the existing knowledge of the router. This also ensures the experts are at-least as good as the router network in the worst case scenario. In other words, the expert networks are less prone to mis-classify samples that are already correctly classified by the router networks.

Comparison to state-of-the-art results
In this section we provide a brief comparison of MS-Net to the performance of existing state-of-the-art DNNs on CIFAR-10, CIFAR-100 and F-MNIST dataset. For comparison we provide two tables, Tables 7 and 8, wherein, Table 7 Performance of MS-Net for CIFAR-10 (C-10), CIFAR-100 (C-100) and F-MNIST The first section depicts the score of backbone networks itself, which also indicates the performance of routers. The second section represents the performance of our proposed framework (MS-Net) equipped with different backbone networks. We train MS-Net with different backbone networks with exact same hyper-parameters We can observe from the Table 7 that, MS-Net framework elevates the classification accuracy with a significant margin relative to the backbone router. Comparing MS-Net framework with Type-I networks from Table 8, the network actually performs with a neck-and-neck scores. However, compared to Type-II networks (approximately similar parameter counts) MS-Net performs with high score relative to most of the networks. The highest score that we obtain so far is with the backbone network Goog-leNet, leveraging at most 55.80M parameters (Table 7). This high score and setup undeniably come with a tradeoff of more computational resources and parameter budget.  Most recently, researchers have been trying to find the best structure using evolutionary algorithms, reinforcement learning algorithms, and so on, and some very interesting results have been obtained [75,[82][83][84]. For example, for the database CIFAR-10, the best performance obtained so far is 98.9% (refer to Type-II section Fig. 9 The effect of variable n and on CIFAR-100 during test phase: The first row (a,b and c) depicts the variation of score while keeping fixed and changing variable n, i.e. it demonstrates how the network performs when we tweak the variable n. The first two figures of second row depict the performance of network keeping n fixed while nudging variable . The last Fig. (f) combine all the Fig. (a, b, c, d and e) for depicting the contrast clearly  Table 8) and the model's training parameter number is 64M [71]. However, based on the 'no free lunch theorem' [34], an optimal model is usually fine-tuned for some specific database, and the model may not be useful for solving other problems. Even for the same problem with more observed data, to preserve the best performance, we have to use a very expensive process to re-design the model. On the other hand, the MS-NET structure proposed in this study is very simple, and can leverage the performance of any existing state-of-the-art models by increasing the inference cost slightly. In this sense, MS-NET can be a good starting point for solving various problems.

Conclusion
In this paper, we have proposed a modular neural network architecture termed as the MS-Net (Modular Redundant Network). For a C-class classification problem, the network consisted of a router network and C expert networks. In summary, the key idea of the research have been to further re-evaluate the top-n most probable predictions of the router by leveraging these expert networks. To effectively train these expert networks we have proposed a stochastic objective function equipped with the knowledge distillation technique that facilitates alternative training on a subset of expert data and whole set of data. This alternative training have been regulated by clamping a Bernoulli random variable to each of loss function term. We have constructed the subsets of data systematically by Round-Robin fashion. As a result, it has provided us with a mean to control the redundancy of each class in the set of subsets, which have also allowed us to know which expert network is a specialist on which subset (thus we have more interpret-ability). We have shown that, with a very limited parameter budget and simple DNN as backbone, our network has achieved performance comparable or sometimes equivalent to more complex DNNs.
An interesting research direction would be to apply Neural Architecture Search (NAS) strategy in MS-Net. We can anticipate that implementing such approach can further cut down redundant parameters of expert networks. In this way, each expert network can be reshaped and designed based on its assigned expert data. Fortunately, the modular nature of MS-Net have allowed each of the individual expert network to be independent and local, i.e. they are not dependent on each other during the training and inference phase. This suggests that, the experts can be trained and tested in parallel, which give us an opportunity to utilize powerful parallel computing systems. Further optimization of this network can be obtained by reducing the number of expert networks evaluation. As we have known from our experiments, during the inference phase the router network chooses fixed number of experts for further evaluation. The final prediction can been obtained only after all the selected experts have been completely evaluated. This can be time-consuming and redundant for easy data or patterns. To mitigate the unnecessary evaluation of experts the concept of Progressive inference introduced in the literature [49] can be leveraged in our modular network. The main idea is to stop evaluation of expert networks once we reach a certain softmax confidence or threshold (the threshold can be obtained through trial and error). In this way the parameter usage can be further reduced without comprising the network accuracy. So far, in this paper we have leveraged the router network as the teacher model for knowledge distillation. We can anticipate that using more accurate and powerful DNN as the teacher model can assist the expert networks in generalizing better. In our future work, we anticipate to deploy the proposed modular neural network in real world scenario. In order to test the network, we will build a test bed equipped with multiple neural computational sticks (powered by vision processing units) and run several experts in parallel for faster and efficient inference.