Keywords

1 Introduction

Data science has been widely concerned in recent years. One of the most important aspects of data science is data analytics, which aims to automatic extraction of knowledge from massive data. Traditional model-based methods are mainly on fitting the collected data to some predefined mathematical models. However, these models may fail when encountering problem varieties such as the volume, the dynamical changes, noise, and so forth. With the increase of the above varieties, traditional data processing approaches will become inefficient or even ineffective. Because of the above difficulties, new and efficient methods should be developed to deal with data analysis tasks [11]. Now the mainstream methods are shifting from traditional model-driven to data-driven paradigms. Many applications in data science can be transferred to optimization problems. Thus it requires the algorithms to have the ability to search the solution space and find the optimums [9]. Traditional model-based methods need the problems that can be written into the form of continuous and differentiable functions. However, in the face of a large amount of data and complex tasks, it is often difficult to achieve.

The population-based meta-heuristic algorithms are good at solving those problems, which the traditional methods can not deal with or, at least, be challenging to solve [10]. Swarm Intelligence (SI), a kind of meta-heuristic algorithms, is attracting more and more attention and has been proven to be sufficient to handle the large scale, dynamic, multi-objective problems in data analytics. As shown in Fig. 1, there are mainly two categories of approaches that utilize SI algorithms in data science [41]. The first approach uses swarm intelligence as a parameter tuning/optimizing method of data mining technologies may including machine learning, statistics, and others. The second category directly applies the SI algorithms on data organization, i.e., move data instances place on a low-dimensional feature space to reach a suitable clustering or reduce the dimensionality of the data.

Fig. 1.
figure 1

Two approaches of Swarm Intelligence for data science

Swarm Intelligence is a group of nature-inspired searching and optimization techniques that studies collective intelligence in a population of low complexity individuals [32]. The SI algorithms are inspired by the interactions among individuals within a group or several groups, which involves the patterns of competition and cooperation [16]. SI algorithms use a population of individuals to search in a problem domain. Each individual represents a potential solution for the problem being optimized. During a guided search process, SI algorithms maintain and improve a collection of potential solutions successively until some predefined stopping condition is met, i.e., either the result is acceptable, or the number of iterations is reached [26].

In order to gather better insight into the utilization of these methods in data science and to provide a further reference for future researches, this paper focuses on the data science related works that utilizing swarm intelligence in the past few years. After introducing the mainstream swarm intelligence algorithms and their common characteristics, both the theoretical and real-world applications in the literature which utilize the swarm intelligence to the related domains of data analytics are reviewed. Based on the summary of the existing works, this paper also analyzes the opportunities and challenges in this field, which attempts to shed some light on designing more effective algorithms to solve the problems in data science for real-world applications. The remaining of the paper is organized as follows. Section 2 briefly reviews the development of swarm intelligence and some major algorithms in this field. Section 3 introduces some theoretical applications in the literature that adopt swarm intelligence algorithms in data science. Section 4 gives a set of real-world applications. The opportunities and challenges of applying SI algorithms to data science are discussed in Sect. 5, followed by the conclusions reached in Sect. 6.

2 Swarm Intelligence Algorithms

2.1 General Procedure of SI Algorithms

SI Algorithms is a set of artificial intelligence techniques inspired by biological swarm behaviors at both macro and micro levels. They generally have self-organizing and decentralizing paradigms with the characteristics of scalability, adaptability, robustness, and individual simplicity. In SI algorithms, a population of individuals, which indicates potential candidate solutions, cooperating among themselves and statistically becoming better and better over iterations, then eventually finding good enough solutions [45]. In recent years, a large number of swarm intelligence methods have been proposed. These methods have different inspiration sources and various operations. In general, these different operations are trying to balance the convergence and diversity of the search process, i.e., the balance between exploration and exploitation.

The general procedure of swarm intelligence algorithms can be summarized in Algorithm 1. Starting from the random initialization of a population of individuals in solution space, followed by the corresponding evaluation process and new solution generation process, after a certain number of iterations, swarm intelligence algorithms can eventually find acceptable solutions.

figure a

As a general principle, the expected fitness value of a solution should improve as more computational resources in time and/or space are given. More desirable, the quality of the solution should improve monotonically over iterations, i.e., the fitness value of the solution at time \(t+1\) should be no worse than the fitness at time t.

2.2 Developments

In the past 30 years, a large number of swarm intelligence algorithms have emerged. They get inspiration from different phenomena, and design corresponding new solution generation operations with the considerations of balancing convergence and diversity of the swarm. As shown in Table 1, the source of inspirations are varying from human society (BSO, TLBO), animals (BA, GWO, MA, LOA), insects and birds (PSO, ACO, ABC, FA, CS, GSO), bacterias (BFO), and also some human-made phenomenon (FWA).

With the increasing prominence of NP-hard problems, it is almost impossible to find the optimal solutions in real-time. The number of potential solutions to these problems is often infinite. In this case, it is essential to find a feasible solution within the time limit. SI algorithms have found its practicability in the practical application of solving nonlinear problems in almost all fields of science, engineering, and industrial fields: From data mining to optimization, computational intelligence, business planning, bioinformatics, as well as industrial applications. Now is the era of big data, those mentioned above scientific and engineering problems, more or less, are related to data issues. Swarm intelligence has made a lot of successful applications in data relevant applications. Meanwhile, with the increasing dynamics, noises, and complexity of tasks, there still are many opportunities along with challenges in the applications of swarm intelligence with data sciences.

3 Theoretical Applications

For decades, data mining has been a hot academic topic in the field of computer science statistics. As mentioned, the SI algorithm is mainly used in data mining tasks in two forms: parameter tuning or data organizing. Main applications, including dimensionality reduction, classification, and clustering, as well as automated machine learning.

Table 1. Some Swarm Intelligence algorithms with source of inspiration

3.1 Dimensionality Reduction

Dimensionality reduction is the process of reducing the number of random variables or attributes in a dataset under consideration. It plays a vital role in data preprocessing for data mining. There are generally two operations for dimensionality reduction: feature selection and feature extraction. Feature selection is a process of selecting an optimal subset of relevant features for use in model construction. While feature extraction is a process of project original data in a high dimensional space onto a smaller space. The accuracy of a model will be enhanced by using wisely selected/projected features rather than all available features in a large amount of data.

Since feature selection is an NP-hard combinatorial optimization problem, SI algorithms are found to be a promise option to solve those kinds of problems. A lot of related works has emerged recently, the following are some examples: Gu et al. proposed a feature selection method for high dimensional classification based on a very recent PSO variant, known as Competitive Swarm Optimizer (CSO) [23]. Hang et al. designed an FA based method for feature selection, which has the ability to prevent premature convergence [72]. Pourpanah et al. combine the Fuzzy ARTMAP (FAM) model with the BSO algorithm for feature selection tasks [47], etc. A more detailed survey about SI powered feature selection can be found in [44].

3.2 Classification and Clustering

Classification and clustering are essential aspects of data science. They have been studied widely in the domain of statistics, neural networks, machine learning, and knowledgeable systems over the decades. In general, classification is to predict the target class by analyzing the training dataset, while clustering is to group the similar kind of targets by considering the most satisfying condition.

The SI applications in those two aspects are mainly related to parameter tuning. For classification, works can be found in literature that combine SI algorithms with regression model [53], support vector machine [7, 14, 60], k-nearest neighbor classifiers [58, 65], Decision trees [3, 35], as well as the neural networks [30, 62]. For clustering, some recent works are related to utilizing SI with k-means [28, 59, 61], c-means [21], and other linear or non-linear clustering algorithms [19, 27].

3.3 Automated Machine Learning

In the past decade, the research and application of machine learning have seen explosive growth, especially the Deep neural networks (DNNs) [37] has made great progress in many application fields. However, the performance of many machine learning methods is very sensitive to too many design decisions. In particular, the architecture designing of DNNs is very complex and highly rely on the experts’ prior knowledge. To address this problem, many SI based methods are proposed to automatically design DNNs [54].

Wang et al. [64] propose an efficient particle swarm optimisation (EPSOCNN) approach to automatically design the architectures of convolutional neural networks (CNNs). Specifically, in order to reduce the computation cost, EPSOCNN minimises the hyperparameter space of CNNs to a single block and evaluates the candidate CNNs with the small subset of the training set. Wang et al. [63] propose a multi-objective evolutionary CNNs (MOCNN) to search the non-dominant CNN architectures at the Pareto front in terms of the classification accuracy objective and the computational cost objective. It introduces a novel encoding strategy to encode CNNs and utilizes a multi-objective particle swarm optimization (OMOPSO) to optimize the candidate CNNs architectures.

4 Real-World Applications

Social Community Network Analysis. Social network analysis plays an important role in many real-world problems, such as the community detection techniques [20, 46] which aims to mine the implicit community structures in the networks. Recently, many SI methods have shown a promising potential in many community detection problems. Lyu et al. [40] propose a novel local community detection method called evolutionary-based local community detection (ECLD), which utilizes the entire obtained information and PSO algorithm to find the local community structures in the complex networks. Sun et al. [55] introduce a Parallel Self-organizing Overlapping Community Detection (PSOCD) method inspired by the swarm intelligence system to detect the overlapping communities in the large scale dynamic complex networks. It treats the complex networks as a decentralized, self-organized, and self-evolving system. They can iteratively find the community structures. Other releavant works can be refer to [6, 22, 25].

Scheduling and Routing. Scheduling and routing problems are very common in real world, as long as there are resources to manage. For example, the PSO algorithm was used in power systems for demand response management  [17], consumer demand management  [38], etc.

Internet of Things. Internet of Things (IoT) is another real-world application in which SI algorithms have been widely used  [5]. For example, in IoT-based systems, the SI algorithm has been used for task scheduling  [4]. In IoT-based smart cities, SI algorithms have been used due to its population-based feature to make the system flexible and scalable  [70].

Bioinformatics is an interdisciplinary field that develops algorithms and software tools for processing biological data samples. Various biological problems could be represented as an optimization problem and solved by SI algorithms. For example, the protein design problem could be represented as a combinatorial optimization problem [24]. More information is summarized in [56].

Resource Allocation. Resource allocation is the process of allocating and managing assets in an optimized way to support the strategic objectives of an organization. SI algorithms have been used in many related applications such as Cloud service resource allocation [8], wireless network planning [2], etc.

Others. Apart from the real-world applications discussed above, SI algorithms have also been applied to many other real-world systems that are data related. For example, the wind farm decision system  [74] to reduce the cost of wind farms, autonomous DDoS attack detection  [33], anomaly intrusion detection [18], image analysis [34, 51], facial recognition [43], Medical Image Segmentation [52], and natural language processing [1, 39], etc.

5 Opportunities and Challenges

Unified Swarm Intelligence. Unified Swarm Intelligence Are there any universal rules behind this growing field? What are the fundamental components of a good swarm intelligence algorithm to have? There are dozens of SI algorithms proposed so far and sharing similar operations on solving problems. Is there a unified framework for SI algorithms that has the ability to develop its learning capacity that can better solve an optimization problem which is unknown at the algorithms design or implementation time [50]. How to correctly identify and extract the fundamental components of SI algorithms, so that they can form new algorithms automatically according to the character of the problem on hand, is a challenge. Some efforts are trying to solve this problem [12, 50, 71], but more work is needed to make it a reality.

Handling High Dimensional and Dynamical Data. The “curse of dimensionality” happens on high-dimensional data mining problems when the dimension of the data space increases. For example, the nearest neighbor approaches are instrumental in categorization. However, for high dimensional data, it is complicated to solve the similarity search problem due to the computational complexity, which was caused by the increase of dimensionality. Furthermore, when the problems are in non-stationary environments, or uncertain environments, i.e., the conditions of data dynamically change over time, additional measures must be taken, so that swarm intelligence algorithms are still able to solve satisfactorily dynamic problems.

SI Based AutoML. As mentioned before, swarm intelligence algorithms can not only be used for automatic optimization of hyper-parameters of the machine learning model, but also the automated design of the model structure. With the development of AutoML, the swarm intelligence algorithm has great potential in this field. However, in addition to hyper-parameter optimization, the representation of learning model and the mechanism of model evaluation are also come with challenges.

6 Conclusion

This paper has reviewed related works that applying swarm intelligence algorithms in data science. The fundamentals and developments of swarm intelligence are briefly summarized. The theoretical applications such as SI based dimensionality reduction, classification, clustering, as well as automated machine learning are also introduced. A short review of real-world applications, including social community network analysis, scheduling and routing, internet of things, bioinformatics, and resource allocation, are also given, then followed by the opportunities and challenges in this field. Generally speaking, the swarm intelligence algorithm has been widely used in the field of data science in the past decades, including theoretical and practical applications. Moreover, with the development of artificial intelligence technology and data science, swarm intelligence algorithms have great opportunities in different aspects of data science. Nevertheless, it also faces a series of challenges, which need more in-depth research.