Privacy Enhancing Techniques in the Internet of Things Using Data Anonymisation

The Internet of Things (IoT) and Industrial 4.0 bring enormous potential benefits by enabling highly customised services and applications, which create huge volume and variety of data. However, preserving the privacy in IoT and Industrial 4.0 against re-identification attacks is very challenging. In this work, we considered three main data types generated in IoT: context data , continuous data , and media data . We first proposed a stream data anonymisation method based on k -anonymity for data collected by IoT devices; and then privacy enhancing techniques for both continuous data and media data were proposed for different IoT scenarios. The experiment results show that the proposed techniques can well preserve privacy without significantly affecting the utility of the data.


Introduction
The Internet of things (IoT) provides great promising in support a variety of IoT applications, such as healthcare, smart manufacturing, industrial 4.0, smart home, etc., which create huge volume of data that need to be processed and shared that might contain sensitive information needs to be protected before share with others . With the great potentials, there are however significant security and privacy concerns and legal issues to be aware of Zhang et al. (2015). The EU's General Data Protection Regulation (GDPR) outlines data protection and privacy and addresses the transfer of personal data and give users more control Shancang Li shancang.li@ieee.org 1 over their personal data. In IoT, the data anonymisation techniques are wildly used to protect sensitive information and privacy related to personally identifiable information by erasing or encryption identifiers that connect an individual to stored data. It aims to simply the regulatory environment for business so both individuals and businesses can fully benefit from the digital economy. The GDPR also permits businesses to collect anonymised data without consent, use it for any purpose, and store it for an indefinite time (Otgonbayar et al. 2016;Da Xu et al. 2014).
The most recent data link technologies, such as big data analytic, etc., are able to establish links between dataset created by different components in IoT (Zhang et al. 2015;Zhang and Chen 2020). In an IoT environment, massive number of smart devices are connected that are used to gather data, this comes with security and privacy risks and the sensitive data breach might can be caused by malicious behaviours or careless. In addition, resource limited IoT devices compromise through vulnerability can cause disclosure of critical data (e.g., in healthcare, smart home, etc.) Viriyasitavat et al. 2019).
In IoT scenarios, massive data being gathered and much of this data is digitised and stored online (cloud server, etc.) Although much of this is not made public, authorised accesses and the threat of hackers looking to steal data, often with malicious intent. Virtual everything related to user, both online/offline can be tracked in the form of data, such as doctor visits, interaction with companies, browsing habits, app use, etc. If you live in a 'smart home', then daily actions, like the use of digit kitchen, temperature, clothes, etc. Some of this data might not directly linked to a user, privacy problems arise when information is tied to personally identifiable information (PII), including email, name, ip address, location, etc. (Ma et al. 2020;Zhou et al. 2019). Data anonymisation does not mean it is always impossible to discover the identity of the subject. Many anonymisation techniques can be reversible.E.g., hashed data could be de-anonymised by guessing the data until a matching hash was found. Even using irreversible suppression, probably the most fail-safe method of anonymisation, the remaining data could be cross-referenced with other data sets to identify the source .
Data anonymisation is a valuable privacy-preserving tool that can anonymise identifiers using removing, substituting, distorting, generalisation or aggregation, etc. Basically, user identifiers include direct identifiers and indirect identifiers. The direct identifiers are the attributes that can directly identify a user, such as names, address, photo, etc., while the indirect identifiers are attributes that can identify a user by linking with other available dataset or information, such as ages, salary, occupations, etc. Many IoT applications collect as much as data in order to improve their service and develop new products. However, this will significantly increase the risk of data loss or accidental data breaches. Many exciting applications in IoT, such as Facebook, Twitters, Tictok, Skyscanner, Hungryhouse, Trainline, Smart home, etc., collects key information such as user's name, address, credit card, timeline, behaviour patterns of using apps, which may be re-associated with the data at a later time to identify personally identifying information. Among the arsenal of data analysis tools available, data anonymisation is highly recommended to satisfy the requirements of GDPR on the basis of the data is identifiable before sharing to third party or public.
In this work, we considered three main data types generated in IoT: context data, continuous data, and media data, and the main contributions are: 1. A stream data anonymisation method for IoT device is proposed based on k-anonymity, which is able to continuously facilitate k-anonymity over data stream in IoT scenarios. 2. Multiple modal based data anonymisation techniques for continuous dataset and media data (image and video) were proposed in IoT scenarios. 3. Experimental results show the effectiveness of proposed privacy enhancing techniques.
The remains of this paper is organised as: Section 2 discusses the state-of-the-art of data anonymisation in IoT; Section 3.3 proposed anonymisation schemes for attributes data, continuous data, and visual data generated by IoT systems; Section 4 use the real data to evaluate the proposed schemes; and Section 5 concludes this paper.

Related Works
Suffering from data breach may significantly cause organisations loss on both finance and reputation. In the past decade, the IoT security has attracted lots of research attention and a number of privacy-preserving techniques have been developed for protecting data generated by IoT devices (Ouazzani and Bakkali 2018), including data anonymisation, pseudonymisation, de-identification, etc.

Data Anonymisation Techniques
In the past decades, a number of privacy-preserving mechanisms have been implemented through data masking, pseudonymisation, generalisation, swapping, perturbation, synthetic, etc. Key features of these techniques are summarised as: -Data masking, using character modification (such as shuffling, substitution, encryption, etc.) to hide data, which against reverse-engineering or detection attacks (Gope and Sikdar 2019); -Pseudonymization, replace private identifier with fake identifier or pseudonyms to hide key identifiable information, which preserves statistical accuracy and data integrity (Somolinos et al. 2015;Faldum 2007); -Generalization, remove part of data to make it unidentifiable, e.g., data can be altered into a set of ranges (Deldar and Abadi 2019; Yaseen et al. 2018); -Perturbation, hiding sensitive data patterns by adding a crafted random noise to prevent privacy data mining attacks (Amar et al. 2018); -Synthetic, using algorithms to create artificial dataset with specific statistical patterns or models instead of altering the original dataset (El Emam 2020).
Specifically, a number of k-anonymisation based data anonimisation techniques have been developed. Gionis et al. improved the k-anonymisation by proposing a three-measure method for capturing the amount of information that is lost during the anonymisation (Gionis and Tassa 2009).
However, there are still many challenges in data anonymisation. In IoT applications, a huge number of smart sensors, devices are connected that continuously generate data for monitoring activities, status, etc. IoT applications can be designed to access these data. As discussed above, these data may includes sensitive data or specific data patterns that the data owners hope to keep private. To preserve the privacy in the complicated IoT environments, we need to develop intelligent and automatic data anonymisation techniques that can hide key data attributes and potential patterns in raw data.

Recent Advances in Data Anonymisation
In the past decades, k-anonymity, l-diversity, and t-closeness based methods have been widely used in cloud-based application to protect sensitive information. In a dataset, some attributes themselves are not unique identifiers, but could create a unique identifier by correlating with other QIs, which we call quasi-identifiers (QIs). In dataset anonymisation, both direct identifiers and QIs should be anonymised. An attacker could be able to identify the individuals by linking QI attributes with external dataset(s) containing direct identifiers. For D, if each combinations of QI attributes to be shared by at least k records, we say D has the k-anonymity property. To achieve l-diversity safe, D needs to 'well present' at lease l values for each equivalence class; and D satisfies t-closesness if the distance (e.g. Earth Mover's distance (EMD)) between the distribution of the sensitive attributions in a class and the distribution is rather than a threshold t.
Microaggregation is a perturbative data protection method (Shi et al. 2018), in which small clusters in a dataset can be replaced each original record by the centred of the corresponding cluster (each cluster should have between k and 2k elements), the larger the k, the larger the information loss and the lower the disclosure risk. It ensures k-anonymity only when multivariate microaggregation is applied processing all the variables of the dataset (Mahawaga Arachchige et al. 2020; Du et al. 2020).
Li et al. proposed a stream k-anonymity scheme to continuously facilitate k-anonymity on data streams (Li et al. 2008). However, it cannot well process data in diverse IoT scenarios. In Pervaiz et al. (2015), a data stream scheme was proposed for privacy protection in access control, in which k-anonymity or l-diversity is used to generalise stream data. Domingo et al. explored a unified and conceptually stream data anonymisation approach using the microaggregation (Domingo-Ferrer et al. 2019), which is a fine-grained data aggregation that supports both static dataset and dynamic data stream (Domingo-Ferrer et al. 2019). Khavkin et al. proposed a stream anonymisation scheme based on microaggregation against differential privacy attacks, which includes an algorithm that satisfies k-anonymity and recursive (c, l)-diversity aiming at minimising information lost and reducing data disclosure risks (Khavkin and Last 2018;Soria-Comas et al. 2017).
Differential privacy (DP) a strong privacy protection scheme aims to guarantee bounds on how much information can be revealed by the participation of an individual component in a database (Soria-Comas et al. 2017). Furthermore, -Differential privacy (DP) was proposed the measure relative privacy: for datasets D 1 and D 2 and a randomised algorithm κ, for all S ∈ Range(κ), Eq. 1 holds Wang et al. 2018).
In recent, the DP is widely used to protect the privacy in machine learning (Phan et al. 2017;. Actually in many IoT scenarios, anonymising personal data is not enough to protect privacy because heavily incomplete dataset will increase the risk to re-identify a specific individual. It is noticed that even anonymised dataset can be tracked back to individual using machine learning (Rocher et al. 2019;Yang et al. 2020;Lu and Ning 2020).
Actually, in recent the blockchain technologies were introduced to address the security and privacy issues in both the IoT and Industrial 4.0 (Gorkhali et al. 2020;Aceto et al. 2020;Yli-Ojanperä et al. 2019;Xu et al. 2018). This work will focus on the stream data anonymisation in IoT environments by using existing k-anonymity based techniques. The k-anonymity can guarantee the privacy of data from following aspects and have to follow conditions: (1) The sensitive data must not reveal information that was redacted in the generalised columns; (2) The value of sensitive columns are not all the same for a particular group of k; (3) The dimensional of the data must be sufficiently low. Data generated by applications or IoT devices may contain sensitive PII, the new big data techniques, such as big data, etc. are able to learning key PII from this data.

Data Anonymisation in IoT
In most IoT scenarios, the data collected mainly fails into three categories; (1) normalised context data collected by IoT sensor & device, such as temperature, flow, pressure, and humidity, data collected using proprietary formats and protocols depending on the source ; (2) continuous data gained via sensors, which is collected using appropriate communication protocols, such as MQTT, etc. and that keep the features of real-time; (3) media data, many IoT devices, such as IP camera, can collect media data, including audio, image, video, and text data in real-time.
In the past decades, many research efforts have been conducted to make trade-off between data utility and security of this information. (1) In statistical data analysis, adding noise is the most common way to maintain some statistical invariant, however, it may compromise the integrity of the data/dataset. Actually, it is not easy to add random noise without significantly reducing the utility; (2) data usually is stored at devices that may have different security levels, which can cause the risk of data disclose due to the fact that we cannot consider every potential attacks an IoT system may face. Another issue is that many data holders share same data, however they may have different security or privacy concerns which makes it very challenges to balance these concerns; (3) In IoT systems, sophisticated users/devices authentication will ensure proper access controls and access control policies need to be well designed to protect privacy information.
In this work, we present an IoT data anonymisation scheme as shown in Fig. 1. The data generated by sensors in IoT will be aggregated into data stream, which needs to be analysed to identify sensitive information, and then the data will be anonymised using secure anonymistion algorithms before sending to apps or release to public.

IoT Data Pre-processing
We first introduce the data pre-processing before analysing attributes that might be sensitive information. In an IoT system, the data collected by a sensor may be intermittent, which needs to be leverage the time span gaps before further processing. For a sensor node N i received n number of input data every time window w j in t second with each data is of d size and one data requires p seconds to process. Then, the total output time needed will be n × p seconds, if n × p < t, then there is no issue. However, if n × p > t, then we need multiple windows, i.e., when time needed to process the input greater than time between two consecutive batches of data. In this work, we use a variable for multiple windows, the problem can be simplified to [(n × p)/ ] < t.
For an IoT gateway G, we introduce a factor ζ to denote the number of threads, which can leverage the CPU and RAM utilisation. The larger the ζ is, then more CPU computation resources will be used. When receives data, the IoT devices needs ζ windows to process. Hence, there will be ζ active windows and n − ζ pending items in queue. In this work, we use r denote the number of batches in memory, one batch can be processed by w windows at a time. The size of a batch is ζ × d Then, we have here we use r to balance the data stream speed and bandwidth. For a fixed-length time-window of data, X = x sj ∈ R M×W , it contains some attributes together with specific patters that can be utilised to identify a specific user.
Definition 1 Data Streaming DS = {pid, X t } is an IoT data steam generated by smart sensor, in which pid denotes the identifier, and X t denotes dataset that include both quasi-identifier attribute set and other attribute set, and t is the time window. In this work, the t is adjustable and quasiidentifier attributes in X t need to be anonymised before publishing.
Definition 2 Anonymisation algorithm (A) can be used to conduct anonymisation of a dataset or stream: in which DS is the anonymised data stream that could be shared with third party or released publicly.
The collected data can be clustered into multiple clusters using specific algorithms. Following method can be used to form a data cluster C from data stream DS (Otgonbayar et al. 2016).
Definition 3 A cluster C built from DS satisfies kanonymity if the distinct number |C| in a cluster less than k (|C| ≥ k), then we say C is generalised and the cluster C satisfies k-anonymity.
Definition 4 For two tuples in the same cluster C, t 1 ∈ C and t 2 ∈ C, the distance between t 1 and t 2 can be calculated by where |X q 1 |(|X q 2 |) is the number of quasi-identifier attributes in X q 1 (X q 2 ), d i is the normalised distance between the numerical values.
In data anonymisation, the quality of anonymisation algorithm is usually measured by average information loss defined in Eq. 6.

Definition 5 Average information Loss. The average information loss of first N tuples from DS is
in which g i is generalisation of tuple t i .

k -Anonymity for Data Streams
For an original data stream S t org = {< pid 1 , X 1 >, < pid 2 , X 2 , < pid n , X n >}; Each tuple t i comprise a vector of |pid i | identifier attributes and values X i , S t out satisfies k-anonymity property with respect to QI is produced. Moreover, the S out order deviates from the input stream.
Each t i = (id, q 1 , . . . , q m , z) includes the identity id, QI q 1 , . . . , q m , and sensitive attribute z. s i = (q 1 , . . . , q m , z is the anonymised s i , where the id has been pruned. Stream anonymisation may cause delay and extra temperately data that can degrade the performance of the data processing.

Identifier Group
If a set of < pid i , X i >∈ S org have the same values on QI attributes, these tuples can form a QI group g i . For a specific group g i , we use a QI detection algorithm to check if these QI attributes match the k-anonymity requirements. If |t i | ≥ k, then it means the dataset X i ∈ g i can be k-anonymised. If |t i | < k, it means tuple t i ∈ S org is k-anonymised already.

Classification of Attributes
In an IoT applications, the key attributes in an dataset could be used to identify subject, such as name, address, mobile number, which can uniquely identify an individual directly and should be hidden before releases. In healthcare application, key attributes could be DoB, sex, postcode, etc. Sensitive attributes includes medical record, wage, home address, bank account, etc.

Proposed IoT Based Anonymisation Algorithm
In is work, we classify the IoT data into three categories: (1) attribute dataset, which includes different attributes in each record; (2) Continuous dataset, which includes continuous data samples, such as monitoring values, motion data, etc.; (3) image dataset, includes images and videos (Fig. 2).
At first, the data captured by an IoT devices will be clustered into partitions and find k − 1 nearest neighbors of expiring tuple t according to the distances in Eq. 5; and create cluster over the identified tuples; Then, anonymises expiring tuple using reusable k-anonymity cluster defined over the same partition which coverts t, k-anonymity is based on information loss calculated by Eq. 6.

Attributes Dataset Anonymisation
In this scenarios, most existing anonymisation techniques, (k-anonymit, l-diversity, -closeness, etc.) can be used to anonymise identifiable attributes. Following main steps are involved: 1. Determine the release model, define how the anonymised dataset will be released; 2. Determine the acceptable re-identification risk threshold and utility, which is used to define the anonymisation parameters in the algorithms; 3. classify IoT data attributes. In this process, attributes need to be defined as direct identifiers, indirect identifiers, non-identifiers; 4. Remove unused attributes, since in IoT some attributes could be missed or anomaly data will be collected, this process will remove all unused data attributes; 5. Anonymise direct and indirect identifiers by applying techniques such as k-anonymity, l-diversity, t-closeness or the combination of these techniques; 6. Evaluate the risk or anonymisation quality, if need, adjust the parameters and repeat 5) and 6). 7. Examining the utility of anonymised dataset, if the utility is sufficient, then it can be released; if not, the anonymisation process need to be redesigned.
An IoT system possesses data anonymisation using an anonymise engine to define data anonymisation parameter, which hosts the anonymise algorithms with the parameters to perform the data anonymisation processes. Figure 1 provides the main architecture of proposed data anonymisation scheme in IoT environment. To anonymise the data, we first need to use the anonymise engine to define the privacy. The anonymisation can be conducted using Algorithm 1, in which we assume that the device owner and the device are data owner. Let s and v denote the data owner and data user respectively. The input is the data created in D and the output D is the anonimised data.
Streaming IoT data often involves a large amount of data and a number of devices, which needs to be carefully processed. The data rate might be changeable depends the change of environment. The data stream may need to be buffered or aggregated before transmitting to cloud. Many existing stream processors, such as Apache Spark, Storm, Flink, etc., have been developed perform MapReduce in real time on data streaming. Some tools, like Kafka Stream, Apache Samza, etc. can aggregate multiple data streams into a large or complex stream. The InfluxDB, TimescaleDB, Cassandra 1. Aggregate process. An aggregate perform aggregation by reading time series value from the short term buffer, writing the sums to aggregate storage, and then deleting the aggregation time, together with the reservation; 2. Rule engine. The rule engine defines and maintains rules and store them in the same database as the short term buffer. E.g., "triggerInterval": "1m",

Continuous Dataset Anonymisation
In many IoT applications, including e-healthcare, body sensor networks, location-based services, etc., IoT devices can capture continuous motion data, location data, and continuous bio-signals, which may includes private information about users health status, behaviours, activities, locations, etc. that might be identifiable. To protect this kind of information, this section will discuss the ways to anonymise continuous data stream.
One of challenges is to classify the user-identifiable patterns from the dataset, in recent researchers have proposed that data patterns can be used to create finegrained behavioral profiles of users that can reveal their identity (Neverova et al. 2016;Malekzadeh et al. 2019). For the continuous data anonymisation, we propose a continuous data anonymisation algorithm as Algorithm 2.
In continuous data anonymisation, adding Laplace noise to continuous or unbounded data is an effective way to improve randomness of the data against patterns learning, as shown in Eq. 7. The Laplace distribution noise can preserves differential privacy by adding the randomness but can keep the utility of the data. Adding Gaussian noise is also an effective way to protect privacy by adding randomness.
in which μ the position of the distribution peak; the nonnegative λ is the exponential decay. In practice, the μ and λ needs to be carefully defined depends the model. As an example, in this work we use the data extracted using a wearable IoT system, which includes human behaviours detection, we first extract the continuous data of 'walking', 'jogging', 'upstairs', and using a deep learning network to train and extract features of above activities . In this works, we directly use the models trained in our previous work, and the use Algorithm 1 to anonymise the 'walking' data series, and the results can be found in Fig. 6.

Visual Data Anonymisation
In many IoT applications, such as smart cities, intelligent transport system, etc., devices often collect visual data that may include personal data. In this section, we developed a deep learning based video anonymisation scheme that can remove all private data from images, videos, in IoT applications. This can also enable IoT applications compliance with GDPR (Xiao et al. 2019;.
The idea is to perform privacy information detection using Yolov3 from images or videos and the conduct image anonymisation. The privacy information detection is a task involves identifying the presence, location, and type of one or more privacy data items. E.g., in a road surveillance system, the images/video from a surveillance camera can capture vehicles, pedestrian, buildings, etc., the register number, face of pedestrian, or building location/number may the information that the owner do not want to share with public. Figure 3 shows the image anonymisation process. It can be seen that the first process is to identify the private information to use Yolov3. The Yolov3 is a popular image object detection algorithm that can perform realtime, accurate detection. It uses the Darknet-53 architecture including 53 layer network trained on ImageNet. The Yolo is a full convolutional network and its eventual output is generated by applying a 1×1 kernel on a feature map. In V3, the detection is done by applying 1 × 1 detection kernels on feature maps of three different sizes at three different places in the network.
The video anonymisation can be performed using Algorithm 3, in which the M is a trained objects model using Yolov3 darknet. In practice, depending the specific privacy information or object (E.g., facial features for human, register number for vehicle, house number in street review), pre-trained model could be used to detect objects with the private information, object in an image or video.

Evaluation
As mentioned above, this work considers three types of data: (1) context data; (2) continuous data; and (3) media data. Based on these three scenarios, we evaluated the performance in different angles.
Figure 4a-f shows the anonymisation performance for when the k and l change from 2 to 7, and the number of QIs change from 2 to 5, in which the blue line is k-anonymity, and green line shows the performance of l-diversity, and red line represents the t-closeness algorithms.
To evaluate the performance of anonymisation, we fixed the QIs as {age, education-num, sex, race} and use k = l ranges from 2 to 10, the elapsed time (sec) can be found from Fig. 5 Also,

Continuous Dataset Anonymisation
To evaluate the effectiveness of anonymisation for continuous data, we use a piece of walking motion accelerometer Fig. 7 Visual information anonymisation in video clips raw data (x, y, z) we acquired using wearable body sensor networks , as shown in Fig. 6. The Fig. 6a, c, and e show the raw data in x, y, and z direction, respectively. Figure 6b, d, and f show the anonymised raw data of Fig. 6a, c, and e. In the anonymisation procedure, we use laplace noise to hide the pattern information and the parameters in the noise generation are designed depends the specific patterns. In this work, the noise for different direction were derived based on the walking pattern.

Visual Data Anonymisation
For media data, as discussed in Section 3.3, in this work, we use an example that use proposed algorithm to anonymise a video clip from a transport surveillance system. In this work, we first use the VOC and car reg-number dataset trained a model that can be used in the system for recognising vehicle number, pedestrian, etc. The Yolov3 darknet can conduct object detection. In this section we defined vehicle 'registration number' and 'facial feature' are two key privacy information that the data owner might do not like to share with others. Then,the darknet is used to detect 'register plates' or 'face' and then to run an image anonymisation algorithm to anonymise the bounding box on the 'license plate' or 'face'. We use the proposed model to conduct anonymisation for a video clip, and Fig. 7 shows the anonymised plate number and facial features.

Conclusion and Discussion
The IoT systems are increasingly gaining importance in both daily life and industrial applications, which create substantial opportunities for users together with huge volume of variety data. This work introduced a data anonymisation schemes that can conduct data stream anonymisation before store or share data to other systems or organisation without leaking user privacy or other confidentiality. The experimental shows the proposed scheme can effectively anonymise data stream in IoT.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/. Jing Du Associate Researcher, China Information Security Evaluation Center. Graduated from Xi'an University of Electronic Science and technology with a master's degree. She is mainly engaged in party and government offices' and national important information systems' risk assessment, data security assessment and network security inspection of critical information infrastructure.

Na Wang
Research Assistant, China Information Technology Security Evaluation Center. She has a bachelor's degree in computer science and technology. She is mainly engaged in risk assessment, product evaluation, network security inspection of critical information infrastructure and quality management of information system.

Shancang
Li is an associate professor at the Department of Computer Science and Creative Technology at the University of the West of England, Bristol, UK. His research interests include IoT security, lightweight crypto, digital forensics, and cyber security.