1 Introduction

The application of the IoT concept in different economic sectors is becoming a key factor for business improvement. According to [1], 92% of companies believe the IoT concept will be important for their business by the end of 2020. Consequently, the companies consider that security, privacy, costs, and regulatory issues pose the greatest challenges of implementing and applying the IoT concept. Research [2] conducted in 1,430 companies (small, medium, and large) points to a number of advantages seen by the vast majority (95%) of adopters of the IoT concept. In doing so, more than half (53%) confirm significant benefits of implementing the IoT concept in business, while 79% of those surveyed believe that by applying the IoT concept, they achieve positive results in different areas of work that they would not otherwise be able to achieve.

According to Gartner, the largest representation and application of the IoT concept, according to the number of IoT devices used until 2017, was in the area of smart building environments. After 2017, the smart home concept is the environment that brings together the largest number of IoT devices [3]. More precise insight into the representation of IoT devices by individual areas of application is provided by the research of the company IHS Markit [4]. It can be seen that the smart home concept has the largest number of installed IoT devices (822.6 million) compared to other areas of application. The annual growth rate (prediction by 2021) is 19.6%, which makes the smart home concept [5,6,7], along with the industrial IoT concept (CAGR 23.4%), the fastest growing area of application of the IoT concept. The classification of IoT devices is essential for several reasons. Successfully identifying IoT devices in a particular scenario and environment can be vital in identifying illegitimate devices, unauthorized devices, unwanted devices, devices that do not behave as expected, and have the potential to cause a security incident within the system. Besides, useful device classification and identification of new and hitherto unseen devices can enable more efficient traffic management as well as network capacity required in the environments in which IoT devices exist [8, 9].

The rest of this paper is organized as follows: the second chapter deals with the current research, their shortcoming, and the positioning of our research according to previous findings. In the third chapter, data collection approach is explained, which includes laboratory environment establishment, raw network traffic collection, data preprocessing, and device class definition as key activities for further classification model development. The fourth chapter explains the classification model development as well as the ensemble supervised machine learning method used for that purpose. In the fifth chapter, the results from the developed model were analyzed and discussed. In the final chapter, the authors give their conclusion and further research direction.

2 Related work

According to the forecasts presented in [10], by the end of 2020, approximately 31 billion IoT devices will be globally used, and until 2025 there will be 75 billion IoT devices. At the same time, 41%, i.e. 12.86 billion IoT devices will be installed within a smart home (SH) [11]. IoT device limitations in general, and thus SHIoT (smart home IoT) devices, are described in the research [12]. Limitations include hardware limitations, requirements for high autonomy and low production cost, which reduces the possibility of implementing advanced protection methods and increases the risk of many threats shown in [13]. The traffic generated by SHIoT devices or MTC (Machine Type Communication) traffic differs from the traffic generated by conventional devices, HTC (Human Type Communication) traffic, which was shown by research [14]. Specific features of MTC traffic have been used to solve several problems in the communication network. Research [15] looks at the impact of MTC traffic on QoS during integration with HTC traffic in the LTE (Long-Term Evolution) communications network. Identification and classification of IoT devices in smart cities [5, 6] and campuses and in smart environments using MTC traffic features have been presented by research [16] and [17]. Research [18] seeks to identify new requirements and challenges in the design and management of a mobile communication network imposed by the generation of MTC traffic.

SHIoT traffic can be observed through network activity features such as traffic volume (sum of the total traffic received and total traffic transferred), traffic flow duration (time between first and last packet in traffic flow), and device inactivity time (the period in which the device has no active traffic flow). The network behavioral modeling is an often-used approach to address communication network challenges such as detecting illegitimate events based on traffic generated by devices on the network. In general, current approaches seek to identify traffic characteristics at the network packet level and the traffic flow level [19]. The analyzed research shows more frequent consideration and use of traffic features at the level of traffic flow than at the level of network packages. Likewise, the mentioned studies use the presented features to identify individual devices or their classification based on the semantic characteristics of the observed devices. [20]. Authors in [21] developed a tool for automatic extraction of packet-level signatures of IoT devices form the network traffic. They have extracted packet-level features of 18 smart home devices, which was used as a basis for development of classification model with a recall of 97%. Although research represents high results of the developed classification model, it remains unclear how the model will behave on the previously unseen devices. It should be trained again for every new device that comes on the market. Such an approach is not suitable considering the nature of the IoT concept. In research [22], the authors present the LSIF (Locality-Sensitive IoT Fingerprinting) approach for identification of IoT devices. The presented approach does not require feature extraction from the traffic. Although this approach has its benefits, it is lacking in performance such as precision (93%) and recall (90%). Also, this approach is focused on the identification of individual devices, which raises already mentioned shortcomings. The research presented in [23] used artificial neuron network for the classification. They developed a model that can identify nine known devices with approximately 99% accuracy. In research [24] the primary goal is to develop a model for network anomaly detection caused by IoT devices. For that purpose, the authors first developed a classification model for profiling of normal behavior. They used J48 machine learning method for model development with precsion, recall and F-measure, 96.2%, 96.8% and 96.9%, respectively. The developed model is actual only for nine devices they used in research because profiling was done for an individual device. The lack of research can be noticed in the number of used devices and insufficient generalization of the problem, where for every new device a new model needs to be learned, trained, and validated. In research [25], the authors use the decision trees and deep learning based methods for identification, classification, and anomaly detection of IoT devices. This research tries to use a more general approach to the classification of network traffic by using tree classes of traffic (actuation, sensing, video streaming). Such an approach is suitable for recognizing normal (expected) behavior of network traffic generated by IoT devices, and it is useful in resolving problems such as anomaly detection. Negative sides of this research are the number of used devices (7), the amount of network traffic (5 days), and the results of the developed classification model (93.5%).

According to the above, the possibility of developing an efficient classification model of IoT devices based on the characteristics of the generated traffic flows is set as a hypothesis of the current research. The research aims to develop a classification model based on an ensemble supervised machine learning method that will be able to assign IoT devices to predefined classes based on the values of their traffic flows. Current research in this domain is trying to identify the individual device. Such an approach is not suitable in the fast-evolving, heterogeneous, and dynamic environment such as IoT, where the number of new devices is rising exponentially. Because of the mentioned, the approach in this research brings novelty and gives the opportunity to recognize a class of new and unseen IoT devices based on its network traffic behavior. Our approach tends to generalize the identified problem and develop a solution that would be adjusted to the nature of the IoT environment. Accordingly, IoT devices need not be observed individually in solving a problem such as certain types of management of IoT devices, detecting network anomalies generated by IoT devices, or identifying unauthorized IoT devices in the network. For that purpose, a classification model is needed that would be able to assign previously unseen devices to generic behavior profile. This research, compared to previous ones gives contribution in a larger set of observed devices, longer period and larger amount of collected data, innovative approach in the classification of IoT devices, and better performance results of the developed classification model.

3 Proposed approach

This research has been conducted in three phases with the activities shown in Fig. 1. In the first phase the research problem was identified and laboratory environment established. The dataset was formed from primary and secondary data sources. In the second research phase index Cu was extracted for each device and IoT device classes were defined. The collected data have been preprocessed which included feature engineering and data normalization (dealing with null and categorical values). In the final, third phase the dataset was balanced, and the classification model was developed. For the model development, the ensemble supervised machine learning method was used. The developed model performance was measured using standard validation measures for the classification models such as confusion matrix, accuracy, kappa coefficient, TPR (True Positive Ratio), FPR (False Positive Ratio), F-measure, ROC (Receiver Operating Characteristics) curve and other.

Fig. 1
figure 1

Research phases and activities

One of the crucial research activities was primary data collection for which the laboratory environment with SHIoT devices was established. SHIoT devices are supplied by authorized distributors and representatives of each device manufacturer. They are connected to the communication network as recommended by the manufacturer, and in no way are the devices modified at the software and hardware level. Therefore, it is assumed that the devices that are used to collect legitimate traffic in this research work are as designed and are in no way previously compromised in terms of security.

The network topology, as well as the characteristics of the smart home environment, can be seen in Fig. 1. The devices are connected, directly or indirectly, by Wi-Fi communication technology to the Fortinet AP 221C wireless access point, except for Phillips Hue, which communicates with the rest of the local network via Ethernet (IEEE 802.3) communication standard. Some devices, such as the Blink smart camera, Netatmo smart thermostat, and Philips Hue smart lighting fixtures, use an IoT hub with which they communicate wirelessly, but with ZigBee technology. The reason is the energy efficiency of the device since they use the battery as the power source of the end device, which gives them advantages in terms of mobility and independence of the device from electricity as a power source. The IoT hub is connected to Wi-Fi (or Ethernet in the case of Phillips Hue devices) technology with a wireless access point. Based on the above, a wireless access point has been determined as an adequate collection point for traffic generated by SHIoT devices. Due to the known modes of operation and characteristics of computers, and thus wireless Wi-Fi networks, traffic in the communication network cannot be collected directly. Several methods are available for traffic collection, often using physical port mirroring on the switch. This method is efficient in several studies, such as [15, 26,27,28], which provides a basis for the application of the same method in conducting this research.

A software-hardware platform consisting of a Fortinet AP 221C wireless access point, a Cisco 2960 Catalyst 48 PoE switch (Power over Ethernet) and an HP Pavillion dm1 workstation (Microsoft Windows 10 10.0.17134 build 17,134, × 64 processor architecture, AMD E-350, 1600 MHz 2 cores, 4 GB RAM) has been set up to collect traffic by port mirroring with Wireshark software tool version 2.6.3 installed.

As shown in Fig. 2, port mirroring is configured for the physical communication ports (FA0 / 1 and FA0 / 3) of the switch to which the wireless access point and IoT hub for the Phillips Hue device are connected. These ports are configured as a source, which means that all traffic coming to or from these ports will be mirrored (mapped) to the destination communication port (FA0 / 2). A traffic collection workstation is connected to this port (Fig. 2).

Fig. 2
figure 2

Laboratory environment of a smart home formed for data collection [29]

3.1 Analysis of the used SHIoT devices

The laboratory environment of a smart home was formed to collect primary data. It contains SHIoT devices commercially available on the market, considering that, according to statistical indicators, such devices feature continuous growth of the application. Figure 3 shows the distribution of SHIoT devices, i.e., the representation of each group in the total number of devices and the number of devices that will be used to collect the primary and secondary data. The complete list of SHIoT devices included in this research is shown in Table 1.

Fig. 3
figure 3

Distribution of SHIoT device groups

Table 1 SHIoT devices for data collection purposes

The smart home laboratory environment was formed within the Laboratory for security and forensic analysis of the information and communication system of the Department for information and communication traffic at the Faculty of Transport and Traffic Sciences. In addition to SHIoT devices intended for the collection of primary data, for the subject research, secondary data already collected through various SHIoT devices within the existing research were used [17, 30, 31].

Table 1 lists the MAC (Media Access Control) addresses as the unique identifiers of the SHIoT device in the network, the device name, the P / S code indicating whether the observed device was used to collect primary or secondary data, and to which functional group the observed SHIoT the device belongs.

A total of 41 devices in a smart home environment were used for the research, part of which was already shown in [29]. According to statistics, there are differences in the estimate of the average number of SHIoT devices per household that has implemented a specific form of a smart home. These estimates range from 6.53 to 14 SHoT devices per household. In the Republic of Croatia, the representation of smart homes is still low, and telecom operators are taking on the role of smart home service providers through the offer of SHIoT devices for end-users. For example, the Internet service provider Iskon Internet offers customers the opportunity to purchase a smart home package consisting of four SHIoT devices [32]. In comparison, the telecom operator A1 offers customers the opportunity to implement a total of five SHIoT devices in a smart home environment [33].

Despite the above, this research sought to achieve the highest possible diversity of SHIoT devices due to the need to define device classes based on the characteristics of the generated traffic. Therefore, the number of devices used is higher than the current statistical estimate of the average value of SHIoT devices per smart home in the Republic of Croatia and the world. The predictions shown in [34] refer to the period until 2023, but given the upward trend in the growth of the number of devices, it is to be assumed that the number of devices will reach 40 per smart home in the foreseeable future.

3.2 Descriptive statistical analysis of collected data

The primary dataset formed for this research consists of a total of 103 files in.pcap format that contain a complete record of network traffic. The secondary dataset consists of 41 files of the same format as the primary set, which makes a total of 144 network traffic files generated by various SHIoT devices and represents the legitimate network traffic. Each of the 144 files contains traffic generated in a 24-h time interval.

Table 2 shows the statistical description of the dataset through statistical measures of standard deviation, minimum, maximum, and mean values at the level of 24-h intervals of collected traffic for primary and secondary data and the consolidated dataset. Statistical description is represented through three logical parts: primary data, secondary data and total.

Table 2 Statistical description of the collected legitimate network traffic data

For every logical part we gave standard deviation, minimum, maximum and mean value for the parameters such as Number of collected packets, File size, Amount of collected data, Average data transfer rate, Average packet transfer rate, and Average packet size. These measures show the characteristics of the collected data. For example, it can be concluded that secondary dataset is bigger than the primary one or that the average packet size is smaller in the secondary dataset than in the primary. All previously mentioned can be explained with a high level of device heterogeneity and diversity in both datasets which are characteristics of the devices in IoT concept. The characteristics of the initially collected data are shown in Table 3. They are expressed through the number of collected files containing 24-h intervals of generated traffic, number of collected packets, file size, amount of collected data, and the total period of data collection.

Table 3 Characteristics of the initial traffic dataset

The network traffic acquisition tool (Wireshark) uses specific metadata that it records within files with the collected traffic, which makes a difference between the size of the file and the amount of collected data (traffic) contained in the file.

3.3 Extraction of identified traffic features

To develop the SHIoT device classification model, the process of filtering traffic from an individual.pcap file according to the MAC address of the device was performed. The reason for this way of filtering is the assignment of an IP (Internet Protocol) address to devices via a DHCP (Dynamic Host Configuration Protocol) server, which is why it can change over time and does not represent a reliable feature according to which it is possible to accurately filter traffic to a particular device over time.

The research observes the traffic characteristics for individual SHIoT devices covered by the research (41 devices) at the traffic flow level. The traffic flow is defined by a sequence of packets with equal values of source IP address, destination IP address, source communication port, destination communication port and the protocol used, TCP (Transmission Control Protocol) or UDP (User Datagram Protocol) [35]. The reason for choosing the traffic flow as the level of observation and analysis of traffic characteristics is that it represents the aggregated (statistical) data of the packet header for communication between the source and the destination. The analysis of packet-level traffic features encompasses more information such as packet content, and also requires more computing resources to store and process them. An example of the relationship between the number of traffic flows and the number of packages in 24 h is visible for the Google Chromecast device (covered by this study), where 11,877 separate traffic flows were generated while the number of packets is 2,459,538. Nowadays, the number of devices and applications uses cryptographic methods for communication. The contents of the packet cannot be observed and analyzed in an economically, temporally, and legally acceptable way. Therefore, the observation and analysis of traffic characteristics at the traffic flow level represent an acceptable and frequently used approach in numerous studies.

The CICFlowMeter software tool was used to extract traffic flow features. CICFlowMeter is a tool developed at the Canadian Institute of Cyber Security, University of New Brunswick [36]. The tool was developed in the Java programming language, which provides flexibility in selecting traffic flow features that can be calculated as well as adding new features. By using this tool, a total of 83 traffic flow features were extracted (z1,…,z83). The extracted traffic flow characteristics are the result of the analysis and identification of relevant traffic characteristics for MTC traffic resulting from the research [20]. The reason is to collect as many features as possible in the initial set in order to determine in the later stages of the research (classification of SHIoT devices and anomaly detection) which independent features have the most significant influence on the change of the selected dependent feature.

Figure 4 shows the distribution of traffic flows (feature vectors), i.e., the share of traffic flows extracted from the collected traffic of SHIoT devices covered by the research.

Fig. 4
figure 4

Distribution of the number of traffic flows according to the SHIoT device

The total number of collected traffic flows is 2,045,052. The presented feature vectors were used in the later phases, which include defining of SHIoT device classes and developing a SHIoT device classification model.

3.4 Defining classes of IoT devices

Identification of devices in the IoT environment is an important step and the basis for activities related to the security of the environment in which such devices exist, such as the detection of unauthorized activities, unauthorized devices within the network, malicious program code. The authors in the research [16] use the cluster method for the purpose of classifying 21 IoT devices whereby the devices are classified separately based on 11 features. Based on the identification of the device, research [28] seeks to detect unauthorized devices connected to the observed network. For this purpose, a total of 11 IoT devices was used, which are classified according to the semantic characteristics of the devices, i.e., their purpose (child monitoring devices, motion sensors, refrigerators, security cameras, smoke sensors, sockets, thermostats, televisions, clocks). A similar method of classification, based on the semantic characteristics of the device, is shown in research [37] in which the authors use a secondary dataset collected in [16]. The research included a total of 15 devices that are classified into four categories concerning the purpose of each device (concentrators, electronic devices, cameras, and sockets). Based on the analysis conducted by the research, the authors point out that the diversity of devices included in the data collection phase is more critical for the classification of SHIoT devices than the size of the dataset (the period of collection and amount of collected traffic).

From previous research, it is noticeable that the classification approaches so far are based mainly on semantic features, which means that the device classes are defined according to the application of such devices or their functionalities. The lack of such an approach for defining classes can be observed from the aspect of the dynamism of the smart home environment. According to the statistical indicators presented in [34], the number of SHIoT devices is continuously increasing, which is accompanied by an increase in the number of companies developing new solutions and new SHIoT devices. Therefore, SHIoT device classes need to be defined in a way that will apply to the upcoming SHIoT devices that will differ in functionality and application from the currently available devices.

3.4.1 Determining the traffic flow feature for the definition of device classes

The predictability of IoT device behavior is a phenomenon that is the result of the communication activities of IoT devices observed in research [15, 27, 38]. Since SHIoT devices possess a limited number of functionalities, specific devices will behave approximately equally in time according to the values of the observed traffic characteristics. Unlike IoT devices, the conventional devices (smartphones, desktops, laptops, and servers) support the installation of a large number of applications where the communication activity of such devices depends on the end-users and the way the device is used. Accordingly, the index of the level of predictability of the behavior of IoT devices expressed by the coefficient of variation of the received and sent data (Cu index) is a measure based on which it is possible to determine the behavior of SHIoT devices in a certain period. The closer the index (Cu) is to 0, the smaller the deviation of the observed device in relation to the amount of received and sent data, and it is considered that the level of predictability of the behavior of such a device is higher than the device whose Cu index is farther than 0. All notation used in paper are shown in.

The Cu index was calculated for the mean values of consecutive traffic flows of an individual SHIoT device in 30 days according to expression (1).

$$C_{u} = CVar_{u} = { }\frac{{\sqrt {\frac{1}{N - 1}\mathop \sum \nolimits_{i = 1}^{N} \left( {x_{i} - \overline{x}} \right)^{2} } }}{{\frac{1}{N}\mathop \sum \nolimits_{i = 1}^{N} x_{i} }}$$
(1)

where:

Cu = \(CVar_{u}\) traffic predictability level index for SHIoT device u;

N total number of mean values of the ratio of received and sent traffic for consecutive traffic flows in period T;

xi the amount of the mean value of the ratio of received and sent traffic volume for consecutive traffic flows.

In order to avoid the mean values to weigh 0, which is a problem of applying the method of the coefficient of variation, as normalized values of dispersion, traffic flows in which the ratio of received and sent data is equal to 0 are removed from the data set.

3.4.2 Defining IoT device classes based on coefficients of variation

To define the device classes based on the Cu index value, we used the method of coefficients of variation classification used in research [29, 39,40,41,42]. It assumes a normal distribution of data. Since the distribution of the obtained values (Cu index) is asymmetric (slanted to the left), the data are transformed. The data transformation method was selected using the Ladder of powers method (Tukey method), which clearly shows the appropriate data transformation function to achieve a normal distribution [43].

From the results obtained by the applied method, the suitability of the application of the logarithmic function for data transformation is observed, since in this case, it results in a normal distribution. The distribution of data is closest to normal the closer chi2 is to 0, i.e., the closer P (chi2) is to 1. The normal distribution of the obtained data was confirmed by both the Shapiro–Wilk and Shapiro-Francia normality test, seen in Table 4, wherein both cases, p > 0.05 and the null hypothesis (that the values of the log (Cu) variable follow the normal distribution) cannot be rejected. Parameters W and V represent coefficients that indicate the deviation from the normal distribution of data where the value of W ≈ 1 indicates the normal distribution of data, while z is a z-statistic that indicates how many standard deviations are observed data away from the mean value [44].

Table 4 Results of Shapiro–Wilk and Shapiro-Francia normality tests

To apply the coefficients of variation classification method, the logarithmic values of Cu index were normalized by the min–max method according to expression (2):

$$C_{{u\left( {norm} \right)}} = \frac{{\log \left( {C_{u} } \right) - \log (C_{{u_{{\min }} }} )}}{{\log (C_{{u_{{\max }} }} ) - \log \left( {C_{{u_{{\min }} }} } \right)}}$$
(2)

where:

\(C_{{u\left( {norm} \right)}}\) normalized value of a logarithmically transformed value Cu in the interval [0,1];

\({\text{log}}\left( {C_{u} } \right)\) logarithmic value of Cu for device u;

\(\log (C_{{u_{{\min }} }})\)– minimum logarithmic value of Cu of all devices;

\(\log (C_{{u_{{\max }} }})\) maximum logarithmic value of Cu of all devices.

After establishing the normal distribution of data and their normalization, the method of defining classes based on coefficients of variation was applied as a result of the mean values of the coefficients of variation and their standard deviation.

The mean value of the coefficient of variation was calculated according to expression (3):

$$A_{{C_{{u\left( {norm} \right)}} }} = \frac{1}{N}\mathop \sum \limits_{u = 1}^{n} \frac{{C_{{1\left( {norm} \right)}} + C_{{2\left( {norm} \right)}} + \cdots + C_{{n\left( {norm} \right)}} }}{N}$$
(3)

where:

\(A_{{C_{{u\left( {norm} \right)}} }}\) arithmetic mean of the coefficients of variation of all devices;

N number of devices;

\(C_{{u\left( {norm} \right)}}\) coefficient of variation of device u.

The standard deviation of the coefficients of variation was calculated according to expression (4):

$$\sigma_{{C_{{u\left( {norm} \right)}} }} = { }\sqrt {\frac{1}{N - 1}\mathop \sum \limits_{u = 1}^{n} (C_{{u\left( {norm} \right)}} - \overline{C})^{2} }$$
(4)

where:

\(\sigma_{{C_{{u\left( {norm} \right)}} }}\) standard deviation of the coefficients of variation of all devices;

N number of devices;

\(C_{{u\left( {norm} \right)}}\) coefficient of variation of device u;

\(\overline{C}\) arithmetic mean of the coefficients of variation of all devices.

Based on the previously performed data processing, a total of four classes of IoT devices were defined according to the method used in the research [41]. The first class includes devices where the condition is met Cu(norm) ≤ \(A_{{C_{{u\left( {norm} \right)}} }}\)-\(\sigma_{{C_{{u\left( {norm} \right)}} }}\). The second class includes devices that meet the condition \(A_{{C_{{u\left( {norm} \right)}} }}\)-\(\sigma_{{C_{{u\left( {norm} \right)}} }} <\) Cu(norm) ≤ \(\frac{{A_{{C_{u} }} + \sigma_{{C_{u} }} }}{2}\). The third class includes devices that meet the condition \(\frac{{A_{{C_{u} }} + \sigma_{{C_{u} }} }}{2}\) < Cu(norm) ≤ \(A_{{C_{u} }} + \sigma_{{C_{u} }}\), while the last class includes devices that satisfy the condition Cu(norm) > \(A_{{C_{u} }} + \sigma_{{C_{u} }}\).

Values of Cu index, logarithmically transformed values, and min–max normalized values for each analyzed device are shown in Table 5. According to the data shown in Table 5, a total of four device classes was defined based on the values of the Cu index. The first class (C1) includes all devices whose logarithmically transformed, and normalized value of Cu index is Cu(norm) ≤ 0.253722. The second class (C2) includes devices which met the condition 0.253722 < Cu(norm) > 0.354866. The third class (C3) includes devices that met condition 0.354866 < Cu(norm) ≤ 0.709732 while the last class (C4) includes devices that met condition Cu(norm) > 0.709732.

Table 5 Defined device classes according to Cu index value

Class C1 denotes IoT devices with a very high level of behavioral predictability since the coefficient of variation of the ratio of received and sent data is closest to 0. This means that such devices behave approximately equally over time from the aspect of the observed feature. If a Class C1 IoT device is used by a user, another device, or the environment, there will be no significant effect on the change in the Cu index value.

Class C2 combines devices with a high level of predictable behavior. If a device in the specified class is used by a user, another device, or the environment, it can result in minor changes to the ratio of received and sent data. Devices integrated into class C3 represent devices with a medium level of predictable behavior. The impact of user interaction, other devices, or the environment on the relationship between received and sent data can be significant. This behavior can result in additional functionality of the device that, at certain times, results in a larger amount of data in the incoming or outgoing direction.

The latest class (C4), combines SHIoT devices with a low level of predictable behavior. The use of such devices and their interaction with the user, other devices, or the environment significantly affects the relationship between the received and sent data. The reason is a significantly higher amount of data in the incoming direction (download) as a result of user requests. An example is seen with a device such as Google Chromecast, where video content is played at the user's request, which requires it to be downloaded via the Youtube service. This class also includes the Google Home mini, a smart speaker that can provide a variety of audio contents at the user request, which also causes a more considerable variation in the ratio of received and sent traffic.

Figure 5 shows an example of the behavioral relationships of four SHIoT devices (TPlink Day Night Cloud NC220 camera, NEST Protect Smoke Alarm, iRoobot Roomba 896, and Google Home mini) belonging to different classes for 1,000 consecutive traffic flows. There is a difference in the variation of the ratio of received and sent traffic (Cgoogle_mini = 4.18) in relation to the devices TPlink Day Night Cloud NC220 camera (Ctp_link = 0.042), NEST Protect Smoke Alarm (Cnest_smoke = 0.19), and iRoobot Roomba 896 (Ci896 = 0.37).

Fig. 5
figure 5

Display of the difference in the behavior of four SHIoT devices in time according to the ratio of received and sent traffic for 1,000 consecutive traffic flows

For the development of a classification model based on the method of logistic regression improved by the concept of supervised machine learning, a dataset was formed containing the values of extracted characteristics of SHIoT devices traffic flows and belonging to the class of each device for each traffic flow in the dataset. The process of forming a dataset that contains aggregated data on the values of the characteristics of individual traffic flow and the affiliation of the traffic flow to the defined classes is shown by the UML (Unified Modeling Language) flow diagram in Fig. 6.

Fig. 6
figure 6

UML activity diagram of the data set creation process

Each traffic flow is generated by a SHIoT device belonging to a particular class according to the classification shown in Table 5. Accordingly, each traffic flow is associated with a corresponding class, as shown in Table 6.

Table 6 Example and aggregation of traffic flows and class labels

The extraction of traffic flow characteristics generated by an individual SHIoT device and the definition of SHIoT device classes are the basis for the formation of a data set of SHIoT device traffic flows to which class labels are associated.

4 Development of SHIoT device classification model

In order to develop a multiclass classification model of SHIoT devices, the logitboost method was used. The method used belongs to the ensemble machine learning methods and is based on the statistical method of logistic regression. Ensembles combine several models, as shown in Fig. 7, with each model solving the original problem to obtain a better composite global model with better performance than using a single model [45].

Fig. 7
figure 7

Generalized presentation of the working principle of an ensemble machine learning method

Boosting belongs to a set of ensemble methods that can convert multiple "weak" classifiers (models that predict the target class depending on the values of the observed feature vectors) into "strong" classifiers. In general, a "weak" classifier is a model whose class prediction accuracy is slightly better than random guessing, while a strong classifier is characterized by near-ideal performance. Boosting methods have proven to be a suitable classification technique that provides excellent results in solving problems from different domains [46]. Given the classification problem that is being addressed and the proven effectiveness of the boosting group of machine learning methods, the logitboost method was used in this research.

4.1 Feature selection for development of SHIoT device classification model

Selecting the traffic characteristics generated by SHIoT devices is a crucial step in the process of developing a SHIoT device classification model. The importance of feature selection has been proven in numerous studies using statistical and machine learning methods, especially in the area of classification and regression. The aim is to identify a subset of the original feature set that is relevant to the classification problem being addressed and to remove those features that are irrelevant or redundant, thus reducing the dimensionality of the feature space as well as the entire dataset. The choice of features has a positive effect on the accuracy of the classification model, the speed of classification, and can reduce the occurrence of overfitting, which often leads to poor results in the validation process [47].

Features related to traffic flow identification (z1,.., z7) were preventively removed from the initial feature set to reduce its bias, a phenomenon that causes "wrong assumptions" during the model learning phase and results in a failure to identify the relevant relationships between independent and dependent features. Therefore, the initial set of independent features was reduced from 83 to 76.

For the purpose of selecting features, the information gain (IG) method was used. The selected method is based on entropy and belongs to a set of feature ranking methods. This group of methods is characterized by simplicity and good results in practical applications, which is why it is often used in the process of selecting features in different domains such as text categorization, genome analysis, anomaly detection in communication networks, and bioinformatics [48,49,50,51].

According to [52], the IG method belongs to the measures based on correlation and serves to calculate the degree of correlation between the selected independent feature and the dependent feature (device class) and to evaluate the suitability of the feature for the classification purpose (goodness of feature). According to [53], an independent feature is appropriate if it is relevant to the observed dependent feature, but it is also not redundant with other relevant independent features. IG expresses a measure to reduce the uncertainty of identifying a dependent feature in the case where the value of the independent feature is unknown. The uncertainty calculation is based on information theory, and Shannon entropy to select those independent features that have the most significant impact on the dependent feature. The entropy of the dependent feature X is defined by expression (5) [53].

$$H\left( X \right) = { } - \mathop \sum \limits_{i = 1}^{n} P\left( {x_{i} } \right)log_{2} \left( {P\left( {x_{i} } \right)} \right)$$
(5)

where:

H(X) entropy of dependent feature X;

P(xi) probability of occurrence of value xi for feature X.

The entropy of the dependent feature X, after observing the value of the independent feature Y, is defined by expression (6).

$$H\left( {X|Y} \right) = { } - { }\mathop \sum \limits_{{{\text{j}} = 1}}^{m} P\left( {y_{j} } \right){ }\mathop \sum \limits_{i = 1}^{n} P\left( {x_{i} |y_{j} } \right)log_{2} \left( {P\left( {x_{i} |y_{j} } \right)} \right)$$
(6)

where:

P(yi) probability of occurrence of value yj for feature Y;

P(xi|yj) conditional probability of feature X concerning values of feature Y.

The information gain reflects the amount by which the uncertainty of an individual value identification of the dependent feature X (device class) decreases with respect to the values of the observed independent feature Y according to the expression (7).

$$IG = H\left( X \right) - H\left( {X|Y} \right){ }$$
(7)

Since the dependent feature X can only take four values (four possible classes), the maximum value of IG is 2 (log2X). Therefore, the value obtained for an individual independent feature represents the amount of information of the independent feature, i.e., the amount by which the observed independent feature reduces the entropy (uncertainty) of the dependent feature. Table 7 shows the characteristics of the traffic flow with the expressed value of IG. From the presented table it can be seen that, for example, feature z12 almost completely reduces the entropy of the dependent feature (IG = 1.832) while certain features (e.g. z67, z39, z37) do not contribute to the decrease of the entropy of the dependent feature (IG = 0).

Table 7 Information gain values as the basis for selecting a subset of relevant independent features

Accordingly, the set of 76 has been further reduced to 58 independent features. In doing so, those features that satisfy the condition IG > 0 are considered. The obtained subset of features cannot be considered final since, in the development phase of the SHIoT device classification model, it is necessary to examine the model performance further if features with lower IG values are removed. The goal is to use the minimum set of features that gives the best performance of the classification model in order to reduce the time required to predict the class, reduce complexity, and reduce the occurrence of model bias.

4.2 Dataset used in the development of the classification model

The classification model, which aims to determine the class to which a device belongs based on the traffic flow characteristics it generates, is based on traffic flow characteristics collected over a period of 10 days for each device. The traffic flow feature vectors extracted for SHIoT devices are labeled with the appropriate class (Table 6). The number of traffic flows generated in the observed period depends on the characteristics of each SHIoT device [54].

The initial dataset, according to the above, has the characteristics of an unbalanced dataset and contains a total of 681,684 feature vectors distributed in four classes, according to Fig. 8. Therefore, before the development of the classification model, the number of traffic flows in the used dataset was balanced by stratification with the under-sampling of the majority represented class. Representation of traffic flows of an individual device in the initial dataset has been taken into account. The reason for this approach is the possibility of model bias occurring to the class that contains the largest number of feature vectors, and according to [55] it is necessary to stratify the classes before the model development. Following the stratification, the dataset contains 117,423 feature vectors used to further develop the classification model.

Fig. 8
figure 8

Distribution of traffic flows according to SHIoT device classes

4.3 Application of the additive logistic regression method for multiclass classification of SHIoT devices

Additive logistic regression (logitboost) is a method of controlled machine learning that can be viewed as a generalization of the classical statistical method of logistic regression. The Logitboost method was developed in the year 2000 and presented in the research [56].

4.3.1 Logistic regression method

The logistic regression method models the conditional probability of belonging of the observed example to a particular class Pr(G = j|X = x) for the J class, where it is possible to determine the classes of unknown examples according to expression (8).

$$j = \mathop {{\text{argmax}}}\limits_{j} Pr(G = j|X = x){ }$$
(8)

where:

j j-th class from the set of classes G;

G set of classes (1,…,J);

x independent feature from set X;

X a set of independent features.

Logistic regression models probabilities using linear functions in x while at the same time ensuring that their sum remains within limits [0,1]. The model is specified in terms of J—1 log-odds that separate each class from the "basic" class J according to expressions (9, 10, 11).

$$\log \frac{{Pr\left[ {G = j|X = x} \right]}}{{Pr\left[ {G = J|X = x} \right]}} = \beta_{j}^{T} x_{i} ;j = 1,..,{ }J - 1$$
(9)

where:

\(\beta_{j}\) logistic coefficient of the independent feature for class j;

$$\Pr \left( {G = j{|}X = x} \right) = { }\frac{{e^{{\beta_{j}^{T} x_{i} }} }}{{1 + \mathop \sum \nolimits_{l = 1}^{J - 1} e^{{\beta_{l}^{T} x_{i} }} }}{ };j = 1,..,J - 1$$
(10)
$$\Pr \left( {G = J{|}X = x} \right) = { }\frac{1}{{1 + \mathop \sum \nolimits_{l = 1}^{J - 1} e^{{\beta_{l}^{T} x_{i} }} }}{ }$$
(11)

Expression (9) implies a multiclass classification model in which xi is the i-th feature vector, and J represents a class where j ∈ {0,1,2,..,J-1} under condition J ≥ 3. This model sets linear boundaries between areas corresponding to different classes. Thus, examples (xi) that lie on the boundary between two classes (j and J) are those for which implies Pr(G = j|X = x) = Pr (G = J|X = x) which is also the equivalent of log odds = 0. Adaptation of the logistic regression model involves estimating parameter \({\beta }_{j}\) where the standard statistical procedure is to find the maximum of the likelihood function [57].

4.3.2 Logitboost method

In models based on logistic regression, there is no single method for estimating parameter \({\beta }_{j}\) that would result in maximizing the plausibility function, but it is necessary to use the optimization methods. In this way, the maximum of the likelihood function is reached by an iterative procedure. Logitboost is one such method used in this study, and is based on the multinomial ordinal logistic regression method due to the existence of more than two dependent features whose values follow a natural sequence. In general, logitboost takes the form shown by expression (12).

$$\Pr \left( {G = j{|}X = x} \right) = { }\frac{{e^{{F_{j} \left( x \right)}} }}{{\mathop \sum \nolimits_{{{\text{k}} = 1}}^{J} e^{{F_{k} \left( x \right)}} }};{ }\mathop \sum \limits_{k = 1}^{J} F_{k} \left( x \right) = 0{ }$$
(12)

where:

\(F_{j} \left( x \right)\) independent feature function (x).

Functions \(F_{j} \left( x \right) = \mathop \sum \limits_{m = 1}^{M} f_{mj} \left( x \right)\) and \(f_{mj}\) are functions of independent features. In each iteration m (m ∈ {1,2,…, M}), for example (xi) that is misclassified, the weighting factor (w) increases while in the correctly classified example, the weighting factor decreases. In this way, the m-th "weak" classifier fm focuses on examples that have been misclassified in previous iterations.

The output of the logitboost method is a set of J + 1 response functions {Fj(x); j = 0,1,…,J} as shown in Fig. 9. Each Fj(x) is a linear combination of a set of "weak" classifiers.

Fig. 9
figure 9

Logitboost method [46]

5 Results analysis and discussion

Model development, testing, and validation were performed using the WEKA software tool with the support of MS Excel 2016 during the preparation of the dataset for model development. Because a total of 59 features were selected in the feature selection process using the information gain method, the number of features was gradually reduced during the model development when validation measures for each model were compared. This process aims to develop a model that will use the least possible number of independent features that will not significantly negatively affect its performance.

Each model was validated by k-fold cross-validation with k = 10. The principle of operation of the k-fold cross-validation for k = 5 is shown in Fig. 10. Cross-validation is a statistical method intended to assess the performance of machine learning models on new, unseen data. This method is used to assess the behavior of the model over data that was not used in the learning phase. In doing so, the model is applied k times iteratively over the dataset. In each iteration, the dataset is divided into k parts. One part of the set is used to validate the model, while the remaining k-1 parts of the set are combined into a subset for model learning.

Fig. 10
figure 10

Representation of k-fold cross-validation with k = 5 [58]

Tables 8, 9, 10, 11 show the performance and results of validation measures for a total of five models (M1,…, M5) with a different number of independent features used (M1–59 features, M2–48 features, M3–33 features, M4–13 features and M5–8 features). Features were reduced to the lowest IG value (Table 7). The initial dataset was divided into a 70/30 ratio, where 70% of the examples in the set were used for model learning, while 30% were used for model testing. This division, along with 60/40 and 80/20, is common in the development of models based on machine learning methods [58].

Table 8 Performance representation of the SHIoT device classification model
Table 9 Confusion matrix for classification model M4
Table 10 Overview of model validation measures (TPR and FPR)
Table 11 Overview of model validation measures (F-measure and precision)

The performance of a classification model based on machine learning needs to be expressed through several different measures, especially when the model is multiclass, given that each measure has advantages and limitations [59]. Accuracy is one of these measures that represents the share of accurately classified examples in the set of all examples according to expression (13) where TP (true positive examples), TN (true negative examples), FP (false positive examples) and FN (false negative examples).

$$Acc = \frac{TP + TN}{{TP + TN + FP + FN}}$$
(13)

where:

Acc proportion of accurately classified examples in the set of all examples;

TP number of true positive examples;

TN number of true negative examples;

FP number of false positive examples;

FN number of false negative examples.

Table 8 shows that all models have approximately the same classification accuracy (≈ 99.8%). The drop in accuracy is only noticeable with the M5 model, which uses eight independent features. The accuracy of the classification is 99.71% or 336 misclassified examples. The table shows a slight decrease in the accuracy of the classification for the M4 model (99.7956%) compared to the M1 model (99.8075%) which uses all 59 features.

Kappa coefficient (κ) expresses the measure of the success of the observed model according to the ideal with the correction of random selection [60]. The values of the kappa coefficient range from [0,1] where κ = 0–0.2 is an extremely bad model, κ = 0.2–0.39 is a bad model, κ = 0.4–0.59 is a moderate model, κ = 0.6–0.79 a good model, κ = 0.8–0.9 a very good model and κ = 0.9–1 an excellent model. According to the scale shown and the results seen in Table 8, all models show excellent characteristics according to the kappa coefficient, whereas the M4 model shows a minimal deviation from the M1 model (0.0001) with a significant reduction in the independent features used.

The accuracy of the model in predicting SHIoT device classes according to the traffic flow characteristics is also given by the confusion matrix shown in Table 9. Confusion matrix represents the performance measure for machine learning classification models where output can be two or more classes, representing the basis for other performance measures. In the confusion matrix shown in Table 9, the relation between traffic flow class affiliation predicted by the developed model and actual class affiliation of observed traffic flow is visible. Accordingly, a high number of traffic flows whose class affiliation is accurately predicted in relation to the number of instances whose class affiliation is incorrectly predicted is observed.

Additional validation measures are expressed through sensitivity, i.e., the rate or frequency of TPR and the rate or frequency of FPR are shown in Table 10.

The true positive rate results represent accurately classified examples of a class in the set of all examples assigned to that class according to expression (14). The false positive example rate represents the ratio of misclassified class examples in the set of all examples assigned to that class to expression (15)

$$TPR = \frac{TP}{{TP + FN}}$$
(14)

where:

TPR true positive rate;

$$FPR = \frac{FP}{{FP + TN}}$$
(15)

where:

FPR false positive rate;

Table 10 shows that models M1 and M2 provide the best results according to the TPR measure for class C1, for class C2 all observed models provide the same results, while for classes C3 and C4 model M5 provides slightly worse results compared to the others. From the aspect of FPR measures, the M2 and M4 models provide better or equally good results compared to other models, with the M5 model providing the worst results.

Additional validation measures that show the quality of the classification model are the precision or positive prediction value (PPV) and the F-measure shown in Table 10, as well as ROC and PRC (Precision-Recall Curve) curves whose values are shown in Table 12.

Table 12 Overview of model validation measures (ROC and PRC)

The measure of precision is used to express the number of correctly classified examples in relation to the total number of examples belonging to that class according to expression (16).

$$PPV = \frac{TP}{{TP + FP}}$$
(16)

where:

PPV positive prediction value

According to the values expressed in Table 11, it can be seen that for class C1 the best results are given by model M1 while the worst results are visible with model M5. For classes C2 and C4, equally good results are observed for all models with the exception of the M5 model for the C2 class, while for the C3 class, the M2 and M4 models provide the best results.

The F-measure or F1 score represents the harmonic mean of the precision measures and the TPR according to expression (17) [59]. According to [61], the harmonic mean is more intuitive than the classical arithmetic mean to calculate the mean of the ratio.

$$F1 = \frac{{2\left( {PPV \cdot TPR} \right)}}{PPV + TPR}$$
(17)

The calculated values of the F-measure shown in Table 11 indicate the M5 model as the worst observed from the aspect of classes C1 and C4 while the other models show approximately the same results.

Table 12 shows the values of ROC and PRC validation measures. The ROC curve, or AUROC (Area Under the ROC Curve), is one of the most important and most frequently used measures that show the quality of the classification model.

ROC is, although in this case, expressed in tabular form, a graphical representation of the relationship between the rate of true positive classifications (TPR) and specificity, i.e. the rate of true negative classifications (TNR = 1–FPR). An example of a graphical representation of the ROC curve is seen in Fig. 11 for the M4 model. The area under the curve, AUROC, is interpreted as the average TPR value for all TNR values in the interval [0,1]. The closer the AUROC is to the value of 1, the better the performance of the classification model. A lower value of 0.5 represents the performance of the model equal to random guessing [8]. The values shown in Table 12 indicate the excellent performance of all observed models, i.e. almost all AUROC values are very close to the value of 1.

Fig. 11
figure 11

ROC curve representation for classification model M4

An alternative measure to the ROC is the PRC (Precision-Recall Curve), which is often used in cases of unbalanced datasets, whereas in the ROC measure a significant change in the number of false positively classified examples may result in small change in the rate of false positively classified examples. Therefore, since PRC uses the ratio of PPV and TPR, i.e. focuses on positively classified examples (TP and FP), it can better demonstrate the impact of many negative examples on the model performance. Because the dataset was stratified in this research, the PRC measure gives almost the same results as the ROC measure for all the observed models.

According to the analysis of the results, the M4 model was selected as the optimal model of SHIoT device classification, considering that its performance according to all presented measures does not deviate significantly from other observed models (TPR 0–0.001, FPR 0–0.001, PPV 0–0.001, F1 0–0.001, ROC 0.0001–0.0003 and PRC 0–0.002) with a significant reduction in the independent features used. In the M4 model, a total of 13 independent features were used compared to the initial 59, which, according to the IG value, had some influence on the dependent feature. The independent features used are shown in Table 13.

Table 13 Independent features used in the process of developing the classification model

As can be seen from the table, the most relevant features are information-related to the length of packets in the observed traffic flow (z11–total length of sent packets in traffic flow; z12–total length of received packets; z13–the maximum length of sent packets; z17–the maximum length of received packets, z19–mean value of received packet length, z46–maximum packet length, z47–mean packet length value, z49–packet length variation), then information on interarrival packet times in traffic flow (z25–maximum interarrival packet time in traffic flow; z30–the maximum time between two consecutive packets sent in traffic flow). Features that provide information on the segments in the traffic flow (z61–the average size of the received segment), as well as features that provide information on the amount of data transmitted in the sub-stream (z60–the amount of data sent in the sub-stream; z71–the amount of data received in the sub-stream), proved to be relevant (Table 14).

Table 14 Notation used in paper

6 Conclusion and future work

The research presented in this paper provides a new approach in observing the behavior of IoT devices based on the generated network traffic. The goal, which was achieved by this research, was to develop an effective model of IoT device classification in a smart home environment, which is based on the outgoing and incoming traffic ratio coefficient of variation as a measure of device behavior predictability. The basis of the research and achieving the defined goal is the scientific hypothesis that it is possible to define classes of IoT devices and develop an effective classification model based on supervised machine learning methods acknowledging traffic characteristics generated by IoT devices in a smart home environment. The scientific hypothesis was proved by defining four classes of devices based on the coefficient of variation ratio of received and sent traffic. The defined classes conditioned the development of the classification model of IoT devices.

The mentioned coefficient was named Cu index, which was chosen as a dependent feature used for the purpose of defining a total of four classes of SHIoT devices using the method of the coefficient of variation classification. Based on the defined classes of SHIoT devices, a multiclass classification model based on the boosting method of additive logistic regression as a machine learning method was developed, which according to all validation measures, shows high performance. The accuracy of SHIoT device recognition to one of the defined classes based on independent traffic flow features is 99.79%. Relevant independent traffic flow features used in the development of the model were selected using the information gain method.

The research proved that it is possible to assign, with high accuracy, new and unseen devices, and traffic flow that they generate into predefined classes with the application of boosting methods of machine learning. Besides, this approach responds to the needs of the newly created IoT environment in which the number of devices is growing exponentially, and it is not possible (or requires substantial resources) to know traffic profiles for each device, but it is sufficient to identify which class the device belongs to. This innovative approach has the potential to lay the foundations for many other activities and research in the IoT concept problem domain. The detection of anomalies in the communication network caused by IoT devices is one of the future research directions that will use findings and conclusions gathered in this research. Besides further research, developed model can have real-life applications as a software solution that can upgrade functionalities of the existing solutions for the device and network monitoring and management in an environment where many various IoT devices exist. This solution can help to monitor device groups that have similar communication patterns, manage their behavior in the network, plan future communication capacities for various device classes or similar activities. This kind of a solution is only usable as a support process for various other activities. That is why future research, as well as future real-life applications and service, will have a foundation in the results achieved in this research.