1 Introduction

Machine learning is a form of artificial intelligence (AI) that has been explored in computer vision, natural language processing, speech recognition, robot control, social networks, financial analysis, medical diagnosis, security, and other applications. Machine learning enables the system to learn and improve from the experience rather than through explicit programming. Specifically, machine learning does not solve the problem by manually programming for all possible input. Instead, an easier way for machine learning is to train on some input–output behaviors and get learning models using statistical learning analysis. The models learn the inherent structure from the input data for unsupervised machine learning or predict the outcome from the input data for supervised machine learning. Many problems that machine learning algorithms tackle are data-intensive applications. They contain large volumes of data which needs to be processed [1, 2]. For example, a large amount of time series sensor log data collected in the smart home Internet of Things (IoT) environments feed in artificial neural networks (ANN), or support vector machine (SVM) to track and recognize resident activities, which helps old people living alone [3, 4]. Vast amounts of sentiment data from social networks are learned by decision trees, k-nearest neighbor, SVM, and Bayes, to manipulates people’s sentiments, opinions towards products, events, topics, etc. [5, 6]. Large-scale driving video datasets are trained by the convolutional neural network (CNN) to predict driver action [7]. In addition, large storage and database of multimedia datasets are also trained by CNN for content-based image retrieval analysis [8]. However, most machine learning algorithms contain iterative processes that are computationally intensive. When these data sets grow in size, the execution time grows significantly. Therefore, it is necessary to spread the computational intensive part among multiple cores or threads to reduce the time complexity.

Much research has been conducted into parallel implementations of machine learning algorithms. Recent developments include parallel implementation running on multi-core central processing units (CPUs) and graphic processing units (GPUs). GPUs are capable of processing data with higher peak computing power than CPUs alone [9]. The parallel programming models using GPUs include CUDA (Compute Unified Device Architecture), OpenCL (Open Computing Language), DirectCompute and many other approaches. Cuomo et al. [10] proposed a novel parallel implementation of self-organization map (SOM) neural networks on CUDA-GPU architectures. This approach uses the latest cuBLAS library from NVIDIA to achieve fast accelerated execution of standard linear algebra subroutines. Tan et al. [11] presented a parallel implementation of SVM for hyperspectral image classification by utilizing CUDA and OpenMP. The advantage of the method is a two level parallelization that optimizes both the kernel functions of SVM, and the cost in communication between GPU and host. Some high-level abstractions models are used to simply the parallel design and implementation along with GPU, such as MapReduce and GraphLab [12, 13]. MapReduce model uses mapper operation that is applied independently to each datum of a large datasets, then a reducer operation aggregates the outputs from each mapper. However, MapReduce has a limitation when dealing with data dependencies. GraphLab model is able to present dependencies using data graph and a shared data table.

On the other hand, parallel implementation on multicore CPUs gains overall performance improvement by using common low-level parallel programming practice for machine learning algorithms. There are POSIX thread, Open Multi-Processing (OpenMP) and Message Passing Interface (MPI) for shared and distributed memory architectures [14]. This method heavily relies on parallel data representation and programming challenges. Lotrič presents a parallel implementation of feed-forward neural network using MPI [15]. This method achieves acceleration by updating the weights iteratively in batch mode for two gradient-based algorithms: the classical gradient descent algorithm and Levenberg–Marquardt algorithm. Zhao et al. [16] proposed a parallel implementation of SVM with multi-core and multiprocessor system.

Margin setting algorithm (MSA) is a novel machine learning algorithm that is proposed by Caulfield et al. [17] for pattern recognition. It has been applied in many fields, including segmentation analysis in hyperspectral images and impulse noise removal in color images [18,19,20,21]. Recently, it has been applied for anomaly detection for false data injection attack in smart grids and human activity learning [22, 23]. However, MSA still leaves a room for improvement in computational efficiency. To the best of our knowledge, this is the first work that MSA algorithm is improved with parallel implementation. Due to the multiple applications of MSA, its massive parallelization is a relevant contribution. Multicore processor is a suitable platform to implement MSA in terms of its flexibility, high performance, and energy-efficiency. The proposed parallel MSA, called PMSA, can significantly reduce the execution time.

The main contribution of this paper is twofold. First, a parallel pool of workers with the specified number of workers is used to reduce the MSA classification testing time. Second, we implement PMSA that can dynamically spawn a number of threads during runtime to perform parallelism. These two parallel mechanisms are performed through experiments on image segmentation and machine learning benchmark datasets. The experimental results show that PMSA classification gains a significant speedup, comparing to serial implementation.

2 Related work

A large body of research has been dedicated to parallel machine learning algorithms. Works have been conducted on supervised learning scenarios for Decision trees, K-nearest neighbors (KNN), Naïve Bayes, SVM, ANN and CNN.

Decision tree is a tree structure that presents a natural way of decision-making. The decision tree takes examples as inputs. Each example includes a collection of attributes, together with an outcome (or class). To induce a decision tree, there are three basic elements: decision node, branch, and leaf. Decision node specifies a test attribute. Each branch is the outcome of that test attribute. Leaf node indicates the class that test example belongs to. There are mainly four methods for parallelism: horizontal, vertical, task and hybrid. Horizontal parallelism partitions the data to different processors, while vertical parallelism partitions the attributes to different processors, and task parallelism partitions the tree nodes to different professors. Hybrid parallelism implements the all other three parallelisms during different stages. For example, Ben-Haim et al. proposed a new algorithm called streaming parallel decision tree (SPDT). It builds the decision tree using horizontal parallelism for large datasets [24, 25].

KNN algorithm calculates the class of the unknown data based on the closest K neighbor examples with known class labels. KNN can be implemented in parallel using various methods. One is to partition the data using spatial indexing structures [26]. However, this method is not efficient for high-dimensional data. To overcome this limitation, parallelized brute-force KNN is proposed using a merge-path function and implement a truncated merge sort to find the K smallest squared distance of each element in a query set Q to each element in a fixed set R [27].

Bayesian network is a probabilistic graphic model that is represented by directed acyclic graphs(DAG). The Naïve Bayes classifier is a simple Bayesian network that is composed of DAGs with only one root (parent) node and several child nodes. It assumes that that child nodes are independent and will not consider any correlations between them. Naïve Bayes can be accelerated using GPU for automatic document classification. It uses different threads running during the two steps of model generation: calculation of the probability of the terms within classes, and calculation of document term frequencies [28, 29].

SVM performs classification by finding the separating hyperplane that has the largest margin between two classes. For non-linear separable data, SVM employs kernel functions that map the non-linear separable data from original space to feature space where the maximum margin hyperplane can be found. SVM has wide applications in various fields. Ali et al. [30] proposed a new image retrieval method that utilized the SVM for normalized histograms which is constructed by using the visual words integration of Scale Invariant Feature Transform (SIFT) and Speeded Up Robust Features (SURF). SVM was also applied to 3D face recognition on GavabDB database with an accuracy of 87.5% [31, 32]. Besides, a method using SVM was proposed for satellite image classification based on Pairs Orthogonal Vector Histogram (POVH). The advantage is computing the discriminative spatial clues, which is robust to image rotation and getting better performance than CNN [33]. Many parallel versions of SVM have been proposed. Lin et al. [34] presented a method using sparse matrix format to achieve better performance. Chang et al. proposed a method that performs a parallel row-based Incomplete Cholesky Factorization (ICF) on the loaded data. In addition, a parallel interior-point method (IPM) is used to solve the quadratic programming problem using linear constraints for SVM [35]. However, this method does not consider advanced multicore architectures. You et al. considered X86-based multicore and many-core architectures, such as Ivy Bridge CPUs and Intel Xeon Phi (MIC), and presented a parallel sequential minimal optimization (SMO) algorithm for SVM. This method includes several analysis and optimizations in SVM to achieve multilevel parallelism, e.g., data parallelism and task parallelism [36, 37].

ANN is a non-linear data modeling tools for classification and regression challenges. ANN is a self-adaptive and universal functional approximator for mapping any functional relationship between the features and class labels. The most widely used algorithm for ANN is backpropagation for building a feed-forward network composed of layers and neurons. Dahl et al. proposed a network parallel training technique to divide the neurons of the ANN across different threads on different machines. Each thread maintains one ANN copy and trains a randomly selected disjoin set of training samples. Over a certain number of epochs, the weights of all threads are collected and broadcast to other threads again. The process continued until the error is less than a threshold value. It is implemented using MPI [38]. The same parallel approach is used on multicore CPU for face recognition implemented by OpenMP. The disadvantage of this method is that the ANN should be very large to offset the communication overhead between neurons on different threads. Huqqani et al. also presented the GPU based approach by putting only one copy in GPU. Each thread on GPU executes independently. One auxiliary array is needed to store the weights and input data for the looping GPU [39].

CNN is an important deep learning algorithm that attempts to learn multiple layers of representations and abstractions to get the complex relationship of the data. There is a large body of research conducted on application of CNN. For example, Ratyal et al. presented a pose invariant deeply learned multi-view 3D face recognitions using deep convolutional neural network, which yields superior performance as compared to other existing methods [40]. Additionally, CNN was also used to assist makeup-invariant face recognition using augmented face dataset, as well as accurate age estimation by investigating the asymmetric left and asymmetric right face images [41, 42]. The practical implementations of CNN heavily reply on parallel processors, such as GPU. Nowadays, most of the deep learning algorithms have been implemented using deep learning existing frameworks. The implementation most focus on high-level issues so that developers do not need to worry about the close optimization of parallel kernels with respect to different hardware platforms. For example, Catanzaro et al. proposed a library with optimized routines for deep learning workloads and it is easy to be integrated to existing frameworks, e.g., Caffe. This approach optimized the performance on matrix multiplication routines which are converted from convolutions. They do not use any auxiliary memory [43, 44].

3 Parallel margin setting algorithm

3.1 Background

The invention of margin setting algorithm motivates by pattern recognition. Pattern recognition in machine learning is a method of making statistical inferences from the perceptual data. One central important work of pattern recognition is classification. The main purpose of classification is to understand and perceive the discriminant among patterns. In particular, a pattern is defined as a pair of \(\left\langle {x,y} \right\rangle\), where x is a feature vector composed of a collection of features. Features can be a series of attributes or properties of data, which can be numerical (i.e., length), or symbolic (i.e. shape). If the feature vector x has m features, x is inside the m-dimension feature space. Note that examples \(\left\langle {x_{i} ,y_{i} } \right\rangle\) of a pattern, share similar features values may belong to the same class, and if they share different features, they should be categorized into different classes. Therefore, it is desirable that a good feature vector contains features that can discriminate among one another very well.

MSA is different from SVM in two-fold. First, the original design of margin setting classifier can be more applicable to difficult non-separable scenarios, but the original design of SVM is motivated by a linear classifier. Second, in the non-separable scenario, SVM can seek a linearly separation by mapping data into higher dimensional space. However, MSA only considers classifying the data in its original feature space with optimal classifiers. The optimal classifiers are the union of all the prototypes of its class.

3.2 Training

Each class region of MSA is as a set of prototype regions - hyperspheres. Prototypes are classifiers of MSA, so we use prototypes and classifier interchangeably. Specifically, the prototype is defined as a hypersphere, or n-sphere, where n is a natural number and represents the n-dimension space. \(\omega_{i}\) is the centroid of G, \(R_{i}\) is the radius of \(G\), \(C_{p}\) is the class label. i and k are natural numbers. N is the number of prototypes belonging to class \(C_{p}\). The prototypes \(G\) are a set of center-radius form with class label \({\text{C}}_{p} \left( {p = 1,2,..P} \right)\)

$$G_{i} = \left( {\omega_{i} ,R_{i} ,C_{p} } \right),\quad 1 \le i \le N,\quad p = 1, 2 , \ldots ,P$$
(1)

The training process of MSA includes two concurrent processes: evolution and partition, as shown in Fig. 1. The evolution process is a process that seeks an optimal prototype for each class after a number of iterations. Each iteration is called one mutation. After the evolution process ends, it enters into the partition process. The partition process separates the training data into subsets after several iterations. Each iteration is called one generation. We discuss these two iterative processes respectively.

Fig. 1
figure 1

MSA training process

  1. 1.

    Evolution process the evolution process of MSA employs an artificial immune system approach by continually developing new antibodies to fight off the antigens. It can be seen in Fig. 2 that this process performs three steps repeatedly until the stopping conditions are reached. The three steps are: (a) prototypes generation, (b) fitness calculation and (c) prototypes mutation.

    Fig. 2
    figure 2

    PMSA classification at thread-level

Prototypes generation starts with N random points in the normalized space. The \(N\) random points are antibodies. The training sample set T are antigens. Prototypes are constructed by using antibodies as the center. If the closest antigen belongs to class Cp to this center, this prototype covers class Cp. The radius is the distance between this center and the closest antigen of a different class \({\text{C}}_{\text{p}}^{'}\). In this case, the constructed prototypes of class Cp only covers training points of class Cp. However, some constructed prototypes are redundant since they cover the same training sample points. To remove the redundancy, the prototypes are evaluated for fitness.

Fitness calculation is measured using a metric called, Figure of Merit. Given a prototype G, the figure of merit \(F\) is defined as the number of points belongs to class \(C_{p}\) inside the prototype \(G\). Figure of merit represents the fitness of the prototype classifier. A large \(F\) is considered better in fitness than a smaller value of \(F\). Then all the \(F\) values of prototypes are calculated. The largest \(F\) is \(LF\). Only the prototypes with \(LF\) is used for this generation. Other prototypes are discarded. In order to find a prototype with a larger \(LF\), MSA starts a search in the neighborhood area using mutation.

Prototypes mutation starts after the \(LF\) of the prototypes for each class is calculated. The aim of mutation is to find a larger \(LF\). MSA starts a search by selecting one prototype each class to mutate to several prototypes. This selection is a stochastic method that randomly select one prototype within figure of merit distribution of the prototypes. This distribution can ensure that prototypes with larger figure merit will have higher probability during our selection. After this prototype is chosen, the center of this prototypes will mutate to \(N\) points. The mutations of the points follow a certain randomness to ensure that they are in the neighborhood areas. The mutation usually rans for several iterations. It stops when we could not find a larger \(LF\) for all classes or it reaches to certain number of iterations, which is set by user. Once the mutation stops, these mutated points are used as random points when it returns to step a) to generate new prototypes. Therefore, the artificial immune system approach is to continually develop new prototypes (antibodies) to cover as many as samples points (antigens).

  1. 2.

    Partition process this process begins after mutation stops. The training sample points inside the prototypes with \(LF\) of each class is denoted as subset \(T_{1}\), which are considered training points that covered by optimal prototypes. These prototypes are optimal and strong classifiers which classify class points correctly. It is different from Boosting algorithm which combines weak classifiers [45]. The partition process removes \(T_{1}\) out from the original training set \(T\). Then the remaining sets will enter the evolution process to construct new prototypes again.

Compare to the training process for some machine learning algorithms, such as SVM, which uses the whole training set to seek hyperplane classifiers that has the largest margin, MSA gradually divides the training set into smaller subsets. MSA trains classifiers for those smaller problems with a margin preset by the user for error. The advantage of this decomposition is breaking the whole training sets into a number of small subsets makes it easier to find classifiers for non-separable cases. Specifically, given training set S, MSA generates a partition of the set \(T\) and each partition is denoted as subsets \(T_{1}\), \(T_{2}\), \(T_{3}\),….The training set can be viewed as a union of its subsets and the subsets are non-empty and non-overlapping. In particular, we formulate the partition process as below:

$$\begin{aligned} & T = \mathop {\bigcup }\limits_{i > 1} T_{i} \\ & {\text{where}}\quad T_{i } \ne \emptyset \,\&\, T_{i } \cap T_{j } = \emptyset \left( {i \ne j} \right). \\ \end{aligned}$$
(2)

The above partition process is conducted at runtime. MSA iteratively seeks the prototypes for the subsets, and the partition subsets are generated sequentially. Next, those training points fall into the prototypes are removed from the whole training set, and the remaining training sets recursively perform the partition task until no training points left.

Ideally, MSA stops when all the points are partitioned out and the reduced set becomes empty However, considering the misclassified and unclassified points, the ideal case may not be reached. Therefore, some parameters values are preset to prevent the algorithm does not converge for a long time. First, the portion of the members in the reduced set drops to a user set small limit, e.g., 0.001%. Second, the number of generations raises to a user set limit, e.g., 20. As long as one of the above conditions meets, MSA stops the training process.

3.3 Parallel MSA

PMSA iteratively solve the problems and get the optimized classifier, i.e., prototypes. The prototype regions are the decision boundaries for classification. The classification process is proposed to implement it in parallel. The algorithms of PMSA is presented as follows:

Consider a multiclass (\({\text{P}}\) classes) problem. Each class can be denoted as class \({\text{C}}_{\text{p}} \left( {{\text{p}} = 1,2,..{\text{P}}} \right)\). There are \({\text{O}}\) objects in raw data belongs to \({\text{P}}\) classes. Each sample point is an n-dimensional feature vector x with m attributes, and it can be denoted as \({\text{x}}_{\text{i}} = \left( {x_{1} , \ldots ,x_{m} } \right)\). We present margin setting algorithm in five phases below: prototype generation, fitness calculation, prototype mutation, partition and parallel classification testing.

Algorithm 1 PMSA

Notation:

Unif (A): uniform distribution on set A.

χ: margin

\({\text{M}}:\) number of iterations during mutation

\({\text{Q}}:\) number of partitions/generations

MM: Maximum Mutation

MQ: Maximum Generation

N: Number of random points

INPUT:

        1) Training set \(T = \left\{ {\left( {x_{1} , \ldots ,x_{m} } \right)} \right\}\), consists of m training samples. Each training sample \(x_{k} \left( {1 \le k \le m} \right)\) is n-dimension vector (\(n \ge 2)\) with class label \({\text{C}}_{p } \left( {p = 1, \ldots ,P} \right)\). (\(P > 1\)) classification problem,

        2) Testing set \(S = \left\{ {\left( {y_{1} , \ldots ,y_{n} } \right)} \right\}\)., consists of n unknown label testing samples.

INITILIZE:

Set MM \(\leftarrow\) 20; MG \(\leftarrow\) 20; N \(\leftarrow\) 20; M \(\leftarrow\) 0; N \(\leftarrow\) 0;

REPEATE:

        1. Normalize \(T\) into [0, 1] space. Randomly select \({\text{N}}\) n-dimension points \(\upomega_{i} \left( {1 \le i \le N} \right) \in {\text{Unif }}\left[ {0,1} \right]\).

REPEATE:

Prototype Generation

        2. Build prototypes \(G_{i} = \left( {\omega_{i} ,R_{i} ,C_{p} } \right), 1 \le i \le N,p = 1, 2 , \ldots ,P\}\). Compute the centers \(\upomega_{i}\) of \(G_{i}\) of class \(C_{p}\) that has the minimum Euclidean distance from \(\omega_{i}\) to \(x_{k}\):

\(d_{k} = \hbox{min} \left| {\left| {\upomega_{i} - x_{k} } \right|} \right|.\)                       (3)

        3. Compute the radius \(R_{k}\) of \(G_{i}\). It equals the minimum Euclidean distance from \(\omega_{i}\) to \(x_{j}\):

\(R_{k} = min\omega_{i} - x_{j} , j \ne k\).                     (4)

Fitness Calculation

        4. Compute the fitness, measured by figure of merit of the prototype. The figure of merit of prototype \({\text{G}}_{i}\) is denoted as \({\text{F}}_{{{\text{G}}_{\text{i}} }}\), i.e., the number of class \(C_{p}\) data samples inside of \({\text{G}}_{\text{i}}\) geometrically. Suppose class label \(C_{p}\) contains total h prototypes for during the current iteration. The largest figure of merit among all h prototypes is \(LF\):

\(LF = { \hbox{max} }\left\{ {{\text{F}}_{{{\text{G}}_{1} }} ,{\text{F}}_{{{\text{G}}_{2} }} , \ldots ,{\text{F}}_{{{\text{G}}_{\text{h}} }} } \right\}\). (5)

Prototypes Mutation

        5. Select one center \(\omega_{i}^{ '}\) of prototype \(G_{i}\) each class to mutate to N random points in the neighborhood area of \(\omega_{i}^{ '}\). Calculate the proportional of the prototypes of figure of merit \(f_{p}\):

\(f_{p} = \frac{{F_{{G_{i} }} }}{{\mathop \sum \nolimits_{1}^{h} F_{{G_{i} }} }}\). (6)

If i in \(\omega_{i}^{ '}\) satisfy the following distribution function and \(\upzeta \in {\text{Unif }}\left[ {0,1} \right]\): \(\mathop \sum \limits_{\xi = 1}^{i - 1} f_{\xi } <\upzeta \le \mathop \sum \limits_{\xi = 1}^{i} f_{\xi }\). Select \(\omega_{i}^{ '}\) to mutate to another N points. The mutated N points are:

\(\omega_{i}^{'} + \varepsilon \alpha U\). (7)

Where \(\varepsilon\) is random sign symbol {-1,1}. \(\alpha \in {\text{Unif }}\left[ {0,1} \right]\). U is the maximum perturbation:

\(U = \left\{ {\begin{array}{*{20}c} {\omega_{i}^{ '} } & {if \omega_{k} \le \frac{{\hbox{min} \left\{ {x_{k} } \right\} + { \hbox{max} }\left\{ {x_{k} } \right\} }}{2}} \\ { { \hbox{max} }\left\{ {x_{k} } \right\} - \omega_{i}^{ '} } & {Otherwise} \\ \end{array} \left( {1 \le k \le m} \right)} \right.\). (8)

        6. \({\text{M}} \leftarrow {\text{M}} + 1\), enter the next mutation round. The largest figure of merit in current generation M + 1 is denoted as \({\text{LF}}^{M + 1}\), the previous generation is \({\text{LF}}^{M}\).

UNTIL \(M > MW\) \(\parallel\) \(LF^{M} > LF^{M + 1}\)

Partition

        7. Partition the training set \(T\) by removing all data in the prototype \(G_{i}^{o}\). \(G_{i}^{o}\) is the optimal prototyope with largest figure of merit \({\text{LF}}^{M}\), and radius \(R_{i,Q}\), where

\(R_{i,Q} = \left( {1 - \chi } \right)R_{i,Q}\). (9)

We tune the margin \(\chi (0 \le \chi < 1)\) to shrink the radius of hyperspheres. A larger margin tends to yield better generalization. The reduced training set is \({T^{\prime}}\).

        8. \(T \leftarrow T^{\prime}\), update the training set \(T\) as the content of reduced set \(T^{\prime}\).

        9. \(Q \leftarrow Q + 1\), enter the next generation round.

UNTILE \(T^{\prime} = \emptyset \parallel\) \(Q > MQ\)

Parallel Classification

        10. After partition completes and all the prototypes generated in all \({\text{Q}}\) generations for all classes \({\text{C}}_{\text{p}} \left( {{\text{p}} = 1,2, \ldots {\text{P}}} \right)\)., the prototypes for all p classes are

\({\text{G}}^{{\prime }} = \mathop {\bigcup }\limits_{t = 1}^{\text{Q}} \mathop {\bigcup }\limits_{{C_{p} = 1}}^{P} G_{i}^{o} |_{{t,C_{p} }}\). (10)

        11. Dynamically assign n threads/processes for testing. Each thread/process handles one prototype \({\text{G}}^{{\prime }} .\) Each thread execute: For all points \(y_{i}\) in test set \(S\), we compute the Euclidean distance between \(y_{i}\) and \(\upomega_{i}\), where \(\upomega_{i}\) is all the centers of prototypes \({\text{G'}}\left( {\omega_{i} ,R_{i} ,C_{p} } \right),\), and if

\(\left| {\left| {y_{i} -\upomega_{i} } \right|} \right| \le R_{k}\). (11)

Set variable Output = 1. The points \(y_{i}\) belongs to class \(C_{p}\). \(y_{i} = C_{p}\). Otherwise, Output = 0.

        12. Gather results from all threads. Logic OR operators is performed on all output to detersmine the classification results. \(y_{i} = C_{p}\) when Output = 1. Otherwise, Output = 0. \(y_{i}\) is not classified.

To aforementioned mathematical model of the parallel classification of MSA is presented in Fig. 2. Training process of MSA completes after several iterations/generations, denoted as Gen 1, Gen2,…, Gen n, which yields multiple prototypes \(G_{1}\)\(G_{n}\) as its classification decision boundaries. For each generation, MSA splits the training set T into two subsets. One subset is the training points covered by the prototypes. The other subset is the remaining unclassified class points. MSA usually keeps only one prototype with \(LF\) for each class. If the training points of one class are all classified after several generations, no more prototypes will be generated for that class. However, MSA still continues to generate prototypes to cover the remaining unclassified training points. If there are P classes, the maximum number of prototypes can yield for one generation is P.

Take a look at a binary classification problem (class 1, 2) shown in Fig. 2. The training set T is split into subsets \(T_{1}\) and \(T_{2}\) in the first generation (Gen 1). Training set \(T_{1}\) is the class points containing class 1 points covered by G1 and class 2 points covered by \(G_{2}\). The remaining class points are the training set \(T_{2}\). In the next generation (Gen 2), \(T_{2}\) is split to \(T_{3}\) and \(T_{4}\). Training set \(T_{3}\) is the class points containing class 1 points covered by \(G_{3}\) and class 2 points covered by \(G_{4}\). Training set \(T_{4}\) contains the remaining unclassified points. Eventually, all class points are classified by prototypes {G1Gn}. Note that each generation does not necessarily yield one prototype for each class. For example, during some generation, all class 1 points have been classified, the remaining set only contains class 2 points. In this case, the next generation will only yield one prototype for class 2. Therefore, MSA tends to classify the total class points using a disjunction of several prototypes generated through multiple generations.

In addition, it can be seen from Fig. 2, the proposed parallel classification is conducted at thread-level parallelism (TLP). TLP concentrates on splitting the tasks so that each subtask is concurrently running on the same data across different processing unit [46]. MSA classification tasks are conducted sequentially through a series of classifiers. We consider an efficient parallel practice is to split this whole sequential task into several subtasks. In MSA, it generates a sequence of prototypes {G1Gn} as classifiers. Each classifier exhibits good generation with a low VC dimension. The unknown or new data can be classified by the prototypes {G1Gn} in parallel for each class independently. Each classifier can be allocated a thread/process to test the unknown belongs to that class or not. If the classifier can classify the unknown correctly, output 1. Otherwise, output 0. Then the overall classification result is a simple logic OR of output from each thread. For example, it can be seen from Fig. 2, we allocate thread 1 to n to prototypes \(\{ G_{1} \ldots G_{n} \}\). For class 1, {G1, G3,…} are the classifiers that use thread 1, 3,…to test the unknown data can be classified and recognized as class1 independently. For class 2, {G2, G4,…} are the classifiers that use thread 2, 4, …to test the unknown data can be classified and recognized as class2 independently.

4 Experiments and results

To test the efficiency of the proposed PMSA, our experiment was conducted in two stages. First, we open a pool of workers for MSA classification applied to image segmentation. Second, threads are used to spawn a number of threads during runtime and speed up the MSA classification. Several machine learning benchmark data sets are chosen for the classification tasks. Our parallel experiments are all conducted on a machine with Intel Xeon processor E5520, 2.27 GHz. It is equipped with 2-processors, and each processor has 4 cores. Each core uses hyper-threading. Totally we can utilize 16 logic processors to accelerate the performance of MSA classification in the testing phase.

4.1 Image segmentation

MSA is a supervised learning algorithm that was applied to discriminate the objects in images. The intensity of the color pixels is chosen as the features for classification. Two benchmark color images: Airplane and Peppers are used in the experiment as shown in Fig. 3a and b. The image sizes of Airplane is 321 × 481 pixels and Peppers is 512 × 512 pixels in R, G, B components respectively. The image segmentation tasks are as follows: MSA performs segmentation on the airplane from its background as shown in Fig. 3c. To implement it, we consider airplane and background are two classes. Training sets are constructed by randomly selecting 20 points from airplane and another 20 points from the pixels are not airplane. After MSA training, the generated prototypes are used to perform a binary classification on the image Airplane. For image Peppers, MSA remove the red peppers out from the image as shown in Fig. 3d. To perform MSA classification, red peppers is one class. Other peppers, including dark green and light green peppers, is another class. Training sets are 20 random points from each class, i.e., total 40 points. Overall, it can be seen that MSA yields good performance for image segmentation. We also compare the results with SVM and ANN.

Fig. 3
figure 3

PMSA for image segmentation. a Image “Airplane”; b Image “Peppers”; c classification results for Image “Airplane”; d classification results for Image “Peppers”

To accelerate the MSA classifications on the image data sets, we utilized the parallel pool. It opens a set of workers that uses parallel pool to execute the statements in parallel as shown in Fig. 4. The number of workers corresponds to the number of processes that can be run simultaneously. To test the effectiveness of the parallel workers on multicore and multiprocessor system, we choose different size of the data sets. For simplicity, we resize the Airplane image by factor of 2, 3, 4…10, and perform the image segmentation tasks. The number of image pixels are the total of the data points during MSA classification. Each data has three features, i.e., the intensity value in R, G, B channels respectively. The execution time of non-parallel, 2, 4, 6 8, 12 and 16 workers are reported in Table 1. Serial MSA execution time is when we run it on the single processor system sequentially without any parallel scheme. It shows that the execution time is greatly reduced when the number of workers increases. Specifically, the execution time for 8 workers is reduced more than 6, 4 and 2 workers. For example, when the image size is 3210 × 4810, the execution time for segmenting the airplane out from its background is reduced from 280.002 s for the non-parallel scenario to 58.724 s for 8 workers, and only 38.419 s for 16 workers.

Fig. 4
figure 4

PMSA using parallel pool of workers

Table 1 Execution time of PMSA for image segmentation (s)

Next, we investigate the speedup ratio and parallel efficiency obtained by the parallel pool. The speedup ratio R is the relative performance measurement between a single processor system and a multiprocessor system. It is calculated as follows:

$$R\left( n \right) = T_{s} /T_{m}$$
(12)

where \(T_{s}\) is the execution time on a single processor system, \(T_{m}\) is the execution time on a multiprocessor system with m processors. The parallel efficiency E can be estimated as the ratio of speedup to the number of processors. It measures the fraction of time for which a processor is being used on the computation. Efficiency is calculated as:

$$E = \frac{{T_{s} }}{{mT_{m} }}$$
(13)

The speedup ratio is reported in Table 2 and Fig. 5. It shows that the speedup increases when we increase the number of workers. The speedup has a higher value for 16 workers than 12, 8, 4 and 2 workers for all image sizes. The average speedup ratio is 7.06 for 16 workers, and 4.54 for 8 workers, comparing to 1.93 for 2 workers. We also note that the primary increase of speedup ratio happens before the image size increases to a certain amount. For example, the speedup ratio increases from 3.11 to 4.26 after the image sizes rise up to \(0.4 \times 10^{7}\) image pixels for 6 workers. After that, the speedup ratio increases slowly, only to 4.31 when the number of image pixels reach to \(4.6 \times 10^{7}\). Similarly for 16 worker case, the speedup ratio rises up quickly from 5.93 to 7.22 when we enlarge the image size to 963 × 1443 pixels in its R, G, B channel. After that, there is only an 0.06 speedup ratio gain after the image resizes to 3210 × 4810 pixels.

Table 2 Performance of PMSA for image segmentation (s)
Fig. 5
figure 5

Speedup ratio of PMSA for image segmentation

The parallel efficiency E of different workers is presented in Fig. 6. It is expected that the E reaches to 1 if the parallel algorithm scales linearly and achieves very good performance. When PMSA is tested with 2 workers, E reaches to 1 when the image contains \(1.1 \times 10^{7}\) pixels. However, E drops to when the number of workers increases. Specifically, the average value of E falls to 0.81 for 4 workers, 0.68 for 6 workers, 0.56 for 8 workers, 0.49 for 12 workers and 0.44 for 16 workers. This comes from the fact that PMSA employs a thread-level data parallelism which distributes the generated prototypes across different workers. It is less efficient than a task-parallelism.

Fig. 6
figure 6

Parallel efficiency of PMSA for image segmentation

4.2 Benchmark datasets

We also implement PMSA classification using Threads. For multicore and multiprocessor system, each independent processor owns a specific amount of physical resources. In order to take full advantage of the thread-level parallelism, a practical way is to set the proper number of threads equaling to the number of prototypes generated. To test the PMSA performances, several machine learning benchmark datasets from the UCI machine learning repository are chosen for the experiment [47]. The details of the datasets are shown in Table 3 below. There are five datasets: Pima Indians Diabetes, Wisconsin Breast Cancer, Australian Credit Approval, Wine and Svmguide2. One-third of the datasets are randomly chosen for training, and the remaining data for testing. We repeat the experiments for ten times and compute the average results.

Table 3 Benchmark Datasets

To validate the classification performance, we compared the PMSA to another two other state-of-the-art algorithms, SVM and ANN. There are two considerations. First, both PMSA and SVM are margin-based classifiers. Their performances are affected by margin, which is the distance from each example to the decision boundary. Second, both SVM and ANN are popular supervised classification algorithms that were applied to test UCI benchmark datasets. The default parameters are chosen for MSA. We set the margin χ of MSA from 0 to 0.5 for classification results. The SVM default parameters we chose are Radius basis function (RBF) kernel, with regularization parameter C = 1. The kernel parameter γ is set as the reciprocal of the number of features in the datasets. ANN implements a two-layer feed-forward backpropagation neural network with 10 neurons in its hidden layers. The classification performance of PMSA is reported in Table 4. It can be seen that PMSA yields better performance than SVM and ANN for all datasets.

Table 4 Performance of PMSA on benchmark datasets

The execution time and speedup of PMSA are reported in Table 4. This approach generated the number of threads that equals to the number of prototypes. Therefore, the number of threads are dynamically determined after the training process of MSA. This thread-level parallelism achieves the best speedup ratio 65.69 for Indian Diabetes dataset. When a large number of prototypes are generated, the speedup capacity will be affected. In this case, the communication overhead increases when splitting the classification task to a large number of threads. Our algorithm overall shows good speedup results.

5 Conclusion

In this work, we proposed a novel parallel implementation of a supervised learning algorithm, called margin setting algorithm. It is the first work to design a parallel mechanism to speed up the classification phase of the MSA. Our proposed parallel mechanism includes using parallel pool of workers with the specified number of workers to run MSA classification in parallel. In addition, threads are implemented to dynamically spawn the number of threads in runtime, and they spread the classification work among multi-core and multiprocessor system. Extensive experiments have been conducted to analyze the execution time and speedup of the PMSA. We applied PMSA on image data to perform image segmentation, and then classification on benchmark data sets from the UCI machine learning repository. We vary the number of workers during parallel implementation on image segmentation. The results show that speedup ratio tends to be higher when the image increase to a certain size. To validate the correctness of the proposed algorithm, the classification performance is also compared with SVM and ANN for benchmark data. The experimental results show that the proposed PMSA yields good classification accuracy and greatly reduces the computation time during testing phase. In the future, we will work on parallel mechanism on the training process of PMSA. Larger scale datasets will be used to test the robustness of the proposed parallel learning algorithm. One limitation of this work is that the thread-level parallelism that PMSA employs only relies on multicore and multiprocessor system. We will deploy PMSA on GPU and compare it with other deep learning algorithms, such as convolutional neural networks and recurrent neural networks, for data-intensive applications.