A systematic construction of non-i.i.d. data sets from a single data set: non-identically distributed data

Data-driven models strongly depend on data. Nevertheless, for research and academic purposes, public data sets are usually considered and analyzed. For example, most machine learning algorithms are applied and tested using the UCI Machine Learning repository. There is a current need for not i.i.d. data sets for distributed machine learning. Recall that i.i.d. random variables stand for independent and identically distributed (i.i.d.) random variables. An example of this need is federated learning. In federated learning, the typical scenario is to consider a set of agents each one with its own data set. Agents are typically heterogeneous and because of that, it is not appropriate to consider that the data of these agents follow the same distributions. In this paper we propose an approach to build non-identically distributed data sets from a single data set for machine learning classification, where we may suppose or not that all instances follow the same distribution. Each device will have only instances of a subset of the classes. The approach uses optimization to distribute the data set into a set of subsets, each one following a different distribution. Our goal is to define an approach for building subsets for training that is as systematic as the approaches used for cross-validation/k-fold validation.


Introduction
In order to be able to properly compare and evaluate new machine learning algorithms [7], testing and validation needs to be done on public data. This permits to reproduce results and determine in a fair way the performance of the algorithms. This is in fact the case for all data-driven methods. Public available repositories, as the UCI Machine learning repository [5], have been used for this purpose for a long time.
Recent research in distributed machine learning has the same requirements. In this case, data sets need to satisfy additional constraints. One of them is that standard assumptions on data sets do not hold. In standard machine learning and statistical learning it is usual to B Vicenç Torra vtorra@cs.umu.se 1 Department of Computing Science, Umeå University, Umeå, Sweden 992 V. Torra assume that data sets contain observations that are independent and identically distributed. This is usually expressed as i.i.d. That i.i.d. holds is convenient for the statistical properties of the methods [3]. Unfortunately, this assumption cannot be considered true in general in distributed machine learning. More particularly, it cannot be considered true in federated learning (see e.g. [9]).
Federated learning [1,2,9,12] is a distributed machine learning framework in which a set of agents collaborate in a machine learning task. Each agent has (or it is) a device and the goal is to build a classification model from these data. The data is naturally distributed and stored in each device. Typically, the goal is to build a classification model based on devices' data. More typically, this model is a deep learning model and, thus, internally represented by the matrices of weights associated to the model. Federated learning assumes that there is a server that leads the process of building the machine learning model. This model is built in a distributed way. The server bootstraps the process with an initial model. This model is transmitted to the devices. Then, the devices use their own data to locally train the received model. Then, they transmit to the central server the difference between their own model and the one they have received. This process is repeated until the central model converges. According to this standard procedure, the data from each device is not transmitted to the central server, and, instead, is kept private in the device. Because of that, the approach is more respectful with respect to the users from a privacy perspective. This has made federated learning a hot research topic. Different research directions exist for federated learning, including the research for more efficient algorithms as well as on algorithms that are private by design (e.g., PySyft [10]).
Federated learning usually considers that the devices are heterogeneous. Heterogeneity comes in different flavors. The research literature shows that its most important source is the one related to the computational capabilities of the device. That is, they are typically resource-constrained. In particular, this refers to communication capabilities (and access to internet), computational power, and memory and storage. In addition, the research literature [9] also discusses that agents are heterogeneous with respect to the data they keep in their devices. This is modeled considering that the data is not i.i.d. In other words, different devices have data that have been generated using different (random) processes. A typical scenario in federated learning consists of a population of mobile phones, these mobile phones gather data related to texting (SMS messaging, social network messaging, etc.) and the central machine learning model learned using the decentralized approach is for predictive text. It is clear that different devices contain different types of data as agents (i.e., people) use, e.g. different (natural) languages. So, at least the textual data gathered from people using different languages will differ from the data gathered from people expressing themselves using the same languages. In other words, the data will not be independent and identically distributed. I.e., data will not be i.i.d.
Some of the existing data sets currently used in the federated learning literature are not i.i.d., others are but they are transformed in an ad-hoc way so do not follow the i.i.d. assumptions. Other experiments just ignore this issue. These are some relevant examples of data sets used in federated learning. LEAF [2] considers the use of FEMNIST, Sentiment140, and Shakespeare. For these data sets, LEAF consider 3550, 660120, and 1129 devices. FEMNIST data set consists of data for 62 different classes. Shakespeare data set includes data for 715 different characters from Shakespeare plays. When each character is considered independently, this corresponds to 715 non i.i.d. data sources. Here the number of classes will depend on the classification problem to be considered. For example, classification of a character from text means 715 classes, but for text prediction purposes, the number of classes will be different. Chai et al. [4] use MNIST [8] and Fashion-MNIST [13]. Both contain 60,000 training images (and 10,000 test images) of images represented by 28x28 pixels corresponding to 10 classes. They also use Cifar10 (10 classes) and FEMNIST. Sarkar et al. [11] consider also MNIST, sampled-FEMNIST (10000, 10 classes) as well as other data sets with smaller number of classes. In particular, VSN (68532 samples, 2 classes), HAR (15762 samples, 6 classes). The number of clients they use in their experiments is rather small: 10, 10, 23, 30, respectively (for MNIST, sampled-FEMNIST, VSN, and HAR). The use of MNIST and similar data sets is quite common in the federated learning literature. An arbitrary subset of MNIST will satisfy the same properties as the whole MNIST. If the latter is considered i.i.d, the arbitrary subset (selected by, e.g. a random partition) will also be i.i.d.
Most open data sets in machine learning repositories do not provide non-i.i.d. data. Well, properly speaking, machine learning and statistical research using open data sets usually assume that the instances of a single database (of these open data sets) have been generated or belong to independent and identically distributed random variables. Then, in principle, when we partition the database into random subsets (generated using, e.g. uniform distributions on the instances of the database) to assign a subset to each device, these subsets will also follow an independent and identically distributed random variable. Because of that, we need to follow a different approach to build this partition. We summarize our goal and contribution as follows.
• Our goal is to provide an approach for building subsets for training and testing that is as systematic as the current approaches used for cross-validation/k-fold validation. • To achieve this goal, our approach generates several disjoint data sets that are nonidentically distributed from a given data set (that is assumed to be i.i.d.). Each subset will contain instances of only a subset of the whole set of classes. In this way, we build a partition of an original data set in a systematic way so that the resulting sets do not longer satisfy the i.i.d. condition.
The structure of this paper is as follows. In Sect. 2 we describe our approach to generate non-identically distributed data from a single data set. In Sect. 3 we discuss the complexity of the approach and we report as well some examples of computations. The paper concludes with a discussion and research directions.

Generation of a non-identically distributed data set
This section describes our systematic approach to create data sets that do not satisfy the i.i.d. condition from an original data set that is i.i.d. The premise is that devices have instances corresponding to different classes. For example, device number 1 has instances of the first and the second classes but no instances of the other classes; then, in contrast, device number 2 has instances of the first and the third classes but no instance of the other classes.
We will use the following notation. The original data sets has n instances of l classes. Then, there are n j instances associated to each class. Naturally, l j=1 n j = n. In addition, one of the parameters of the approach is the number of different classes each device can handle. In the example above, device 1 and 2 have only two different classes. Let cxd denote the number of classes per device. Then, there are number of possible devices when there are l different classes in the data set. We call this number osdd for One Set of Different Devices. We create nCopies of each of these devices, Table 1 For each device (dev i ), probability that instances belong to given classes c 1 , . . . , c l Here, n j is the total number of instances in c j , and n the total number of instances to be distributed each with different probabilities for each class to fulfill the non-i.i.d. assumption. In this way, the total number of devices will be d =nCopies · osdd. Taking all these assumptions into consideration, we create the new data by means of finding a set of probabilities p i j for i = 1, . . . , d and j = 1, . . . , l. Here, p i j represents the probability of the ith device of an instance being of the jth class. We represent these probabilities in Table 1. We define and solve an optimization problem to find these probabilities. These probabilities need to satisfy several constraints.
First, for a given device, only selected classes have a nonzero probability. For example, according to the example above, device number one will have p 1 j = 0 for j ≥ 3 and device number 2 will have p 22 = 0 and p 2 j = 0 for j ≥ 4. Other probabilities can be nonzero. This is modeled by means of a set N that denotes all null probabilities. In this example, N includes at least p 1 j = 0 for j ≥ 3, p 22 = 0 and p 2 j = 0 for j ≥ 4.
Second, for each device, the probabilities add to one. I.e., for all i, we have l j=1 p i j = 1. Third, probabilities associated to each class are also constrained by the number of instances in the data. For example, if in the original data set there are so many instances for class one as for class two (i.e., n 1 = n 2 ), then the proportion of instances assigned to 1st class and to 2nd class should be the same. Naturally, the sum of probabilities for the jth class is d i=1 p i j . It should be clear that the sum of all probabilities for all classes is d (because there are d devices and the probabilities associated to each device is 1). That is, l j=1 d i=1 p i j = d. Therefore, we need that for each class j, the proportion of d i=1 p i j /d is equal to n j /n. In other words, we need that for each class j: Finally, we also need probabilities to be positive. That is, p i j ≥ 0 for all i = 1, . . . , d and j = 1, . . . , l.
Different assignments satisfy these constraints. We prefer in a device probabilities to be distributed among different (selected) classes. In our example, for device number one we prefer p 11 = 0.5 and p 12 = 0.5 than the solution p 11 = 1.0 and p 12 = 0. Similarly for the second device: our goal is to distribute the non-null probabilities among classes 1 and 3 (i.e., p 21 + p 23 = 1 but also p 21 = 0 and p 23 = 0). From an optimization point of view, this means that we do not want the solutions to be at the vertices of the polyhedron of feasible solutions. We define a quadratic objective function to achieve this effect. The best solutions are the ones in which probabilities for a device are equally distributed. Therefore, an interim expression for the objective function associated to the ith device is the following one: This results into a quadratic (and, thus, convex) optimization problem with solutions, in general, not in the vertices of the feasible polyhedron.
Recall that nCopies represents the number of devices with the same classes. In our example, when nCopies is two, we have two devices with only classes 1 and 2. In order to have a nonidentically distributed data set, these two devices need different probabilities for classes 1 and 2. Nevertheless, Eq. 2 will produce the same probabilities for all devices with the same classes. To avoid this, we modify the objective function using some random numbers. Let α i j be a random number taken from a uniform distribution in [0,1], then replace 1/cxd by Naturally, this is also a quadratic objective function. Let p be the vector of all probabilities p i j . This vector has dimension d · l. Then, as we have that the objective function of our problem can be expressed using the square matrix Q = I d (i.e., the identity matrix of size d ·l) and the vector L = −2A where A = (α 11 , α 12 , . . . , α dl ). That is, the objective function is p T Qp + p T L, where p T denotes the transpose of p.
Putting all together, we need to solve: A solution of this problem will be a matrix of probabilities p i j for i = 1, . . . , d and j = 1, . . . , l as in Table 1. From this matrix of probabilities we can compute the expected number of instances of each class assigned to each device. Let us denote this number by n i j . That is, n i j represents the number of instances that we assign to device j such that their class is i. Then, n i j needs to be defined as follows: It is very easy to see that by construction d i=1 n i j = n j for any class j. Observe (using Eq. 1):

An example: the IRIS data set
As an example, we consider the IRIS data set [5] that consists of 150 instances of 3 classes represented by 4 numerical features. The three classes are Iris setosa, Iris virginica and Iris versicolor but we represent them here by c 1 , c 2 , c 3 . There are 50 instances for each class. Therefore, n 1 = n 2 = n 3 = 50. Then, if we consider that each device has 2 classes (i.e., cxd = 2), this means that there are fundamentally 3 types of devices. Type 1 has classes c 2 and c 3 , type 2 has classes c 1 and c 3 , and type 3 has classes c 1 and c 2 . Therefore, osdd = 3. If we select nCopies=2, this means that we will have two devices of each type and, thus, a total of 6 devices (i.e., d = 2 · osdd = 6). Let dev 1 and dev 4 be of type 1, dev 2 and dev 5 of type 2, and dev 3 and dev 6 of type 3. Then, we need the constraints p 11 = 0, p 22 = 0, p 33 = 0, p 41 = 0, p 52 = 0, and p 63 = 0 to avoid the devices to include instances from not allowed classes. Then, we will have 6 constraints for the six devices requiring that their probabilities add to one (i.e., p 12 + p 13 = 1, p 21 + p 23 = 1, p 31 + p 32 = 1, p 42 + p 43 = 1, p 41 + p 43 = 1, p 41 + p 42 = 1). Finally, we have also the equalities for each class. As we have n j = 50 instances for each class j, this means that the probabilities for each class add to d · n j /n = 6 · 50/150 = 2. So, p 21 + p 31 + p 51 + p 61 = 2, p 12 + p 32 + p 42 + p 62 = 2, and p 13 + p 23 + p 43 + p 53 = 2.
In addition, we have the inequalities p i j ≥ 0, and the objective function defined by Q = I d (with an identity matrix of size d · 3 = 6 * 3 = 18, as there are 18 probabilities p i j ) and L, a vector of length 18, with random numbers in [0,1] multiplied by −2. The problem built in this way minimizes

Assignment of instances to devices
In this way, we can partition the whole data set of n instances randomly assigning to each device an appropriate number of instances that take into account the classes this device needs to consider. This can be easily implemented in the following way. Let us consider class 1 with n 1 instances, then, we can assign instances to the devices as follows.
1. Generate a random sample (with replacement!) of size n 1 of values in the set of devices {1, . . . , d}. The probability of selecting the device i 0 is Let (d 1 , . . . , d n 1 ) be the name of these devices according to the sample. Observe that as there are n 1 instances of the first class, this process results into an assignment of each instance to a device. 2. Assign each instance in the data set with class c 1 to the devices according to the random sample. For example, assign the first instance with class c 1 to d 1 , assign the second instance with class c 1 to d 2 , etc.
For the first step, in our implementation we have used the function choice from Python's package numpy (random). An alternative way would be to draw n 1 values from a uniform distribution, and then use the inverse of the cumulative distribution function to map each values to a class.

Unbalanced number of instances for devices
The optimization problem formulated above does not make any requirement on the number of instances associated to each device. That is, Eq. 4 may result in all devices having the same number of instances. In practice, this is not always necessarily the case because constraints (including the definition of the set of null probabilities N ) can force some sets having more instances than others.
Our point of view that Eq. 4 assumes that all devices have the same number of instances is supported by Eq. 1. Observe that the solution in Sect. Expressed in this way we have that the equation considers d devices and each one has weight 1/d. We can, thus, consider different weights for different devices including parameters w i for i = 1, . . . , d. These weights need to be positive and add to one (i.e., d i=1 w i = 1). Then, we can rewrite the previous set of equations into the following one: Then, the optimization problem becomes: A solution of this optimization problem is, again, probabilities p i j of instances being assigned to class j given a device i. Then, from these p i j we need to compute the average number of instances for each pair (device, class). In the first formulation of the problem, this was achieved by means of multiplying p i j by n/d. In the present situation, we need to multiply p i j by n and the weight of the device. Using the notation above, with n i j as the average number of instances for class j in device i, we have n i j = p i j nw i for all i = 1, . . . , d and j = 1, . . . , l. We can also observe that this definition is consistent because we can prove that d i=1 n i j = n j for any class j. Taking into account Eq. 6, we can write: This solution is implemented in the same way as for the previous problem. That is, using the description in Sect. 2.2. Both optimization problems just produce a set of probabilities and an average number of instances consistent with the constraints.

An example: unbalanced case for the IRIS data set
Let us consider again the IRIS data set, but now with the goal of creating an unbalanced number of instances. For illustration, we consider the same parameters as above (i.e., 6 devices). In addition, we consider that devices have different weights (i.e., so that we expect them to have different number of instances): the first device has three times the weight of the last one, the second and the third twice the weight of the last one, and the fourth and the fifth have the same weight as the last one. More formally, we consider the following weights. It can be seen that the assignment of instances to devices is in such a way that the number of instances is according to weights w i for i = 1, . . . , d. Observe that the assignments n i j are such that d j=1 n i j lead, respectively to: 30, 30, 15, 15, 15), that, in this case, satisfies the proportion given above w = (3/10, 2/10, 2/10, 1/10, 1/10, 1/10).
In this example, there are a few probabilities with a value that is close to zero. This is caused by the weights assigned to the devices, but also by the random vector A. Different assignments A will produce different probabilities, and some of them will avoid these values close to zero, and, thus, enforcing diversity in these devices.
We now describe this results in text. The first device will have the majority of the instances

Computational complexity and experiments
In this section we discuss the computational complexity of our approach and some experiments we have performed to generate data sets.

Computational complexity of the approach
We have given our initial formulation of the problem in Eq. 4 and the revised version in Eq. 7. The optimization problem has been defined taking into account that the problem has l classes, with n 1 , . . . , n l number of instances for each class. Then, we have also considered the number of classes per device (cxd) and the number of copies of different types of devices (nCopies) as input parameters of our approach.
It can be observed that the computational complexity of both definitions is the same. Both problems are quadratic with linear constraints. Both problems have the same number of variables and equations.
To solve this type of problems, quadratic solvers can be used. We have implemented our approach using Python and solved the optimization problem using the library cvxopt. In particular, we have used the function solvers.qp. This function requires the specification of the matrix and vector of the objective function, and the matrices and vectors of equalities and inequalities. The software is available at [14].
The optimization problem has one constraint for each class (i.e., l equations), one constraint for each device (i.e., d equations), inequalities for each probability (l · d inequalities), and equalities for each p i j ∈ N . The latter equations just set the probability to zero. In our case we have established the number of devices as So, the total number of relevant equations (i.e., ignoring the ones that set a probability to zero) are: l + d + l · d where l is defined as above.
The number of variables to be determined is naturally l · d.
It can be observed that the number of equations mainly depends on the number of classes of the problem. More particularly, it is linear on the number of combinations built from the number of classes. In contrast, the number of instances does not affect the computational cost.
For problems with a limited number of classes, the solution of this optimization problem does not pose any computational difficulty.

Experiments
We have illustrated our approach with the Iris data set. This data set is a classification problem that consists of 150 instances described by 4 features and corresponding to 3 classes. We have seen that considering 2 classes for each device this means only 3 different types of devices. Then, for nCopies=1 we have • l = 3 equations, one for each class, • d = 3 equations, one for each different type of device, and • l · d = 9 inequalities, one for each probability (i.e., pair (class, device)). Therefore, it is an optimization problem with 15 equations, and 9 variables.
We have also considered the case of MNIST. This data set consists of 60000 training instances (images of 28x28 pixels each) that correspond to 10 classes. Therefore, l = 10. Then, d will depend on the number of classes we require to each device. We give in Table 2, the number of different types of devices that would be generated considering different number of classes in each device. The table also includes the number of variables of the optimization problem. As we have described above, the total number of equations is the addition of the three values l = 10, d and l · d.
For illustration we give mean computation times of our implementation using a regular laptop (characteristics: lat7400n, 31,2 GiB, Intel Core i7-8665U CPU@1.90GHz x 8, 1,0TB, Ubuntu 20.04.2 LTS 64-bits) when we require devices to have instances of 4 and 5 different classes (i.e., cxd=4 and cxd=5). Note that these are the cases in which the optimization problem has the largest number of equalities and variables.

Simple extensions
In our analysis of problem complexity we have assumed that the number of classes associated to each device is the same for all devices (and equal to cxd). Considering different number of classes for different devices (e.g., devices have cxd' number of classes for xcd' ≥ cxd) will produce additional types of devices and the corresponding constraints. Nevertheless, the whole process will be similar to the one described in Sect. 2.

Discussion and research directions
In this paper we have presented an approach to generate non-independent and identically distributed data from a data set. The approach is based on creating different data sets for different devices so that each device has only a subset of the classes. We have formulated this solution in terms of an optimization problem: a quadratic problem with linear constraints. We have provided two solutions. One in which all devices have the same number of instances, and the second in which we can generate different number of instances for different devices. This second problem is, naturally, a generalization of the first and we have seen that this does not add complexity to the optimization problem.
The goal was to define a systematic way to create these data sets, in line with other machine learning standard approaches to partition data sets for testing and evaluation. For example, generating sets for cross-validation / k-fold validation.
This approach has been defined for classification data sets. In this approach we have considered that all devices share the features of the data set. That is, our approach provides horizontally distributed data.
As a future work we consider alternatives to the use of random values α i j in Eq. 3. In particular, as a referee suggested, we may consider values diverging from 1/cxd (as used in Eq. 2). Another direction is to consider the generation of non-i.i.d. for vertically distributed data. A similar optimization problem can be defined considering partition of features. Selection of features per device will also provide non-i.i.d. data.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Vicenç Torra is currently a WASP professor on AI at Umeå University (Sweden). He is an IEEE and EurAI Fellow, and ISI elected member. His fields of interests include data privacy for machine learning and statistics, approximate reasoning, and decision making. He has written seven books including "Modeling decisions" (with Y. Narukawa, Springer, 2007), "Data Privacy" (Springer, 2017), "Scala: from a functional programming perspective" (Springer, 2017) and "Guide to data privacy" (Springer, 2022). He is founder and editor of the Transactions on Data Privacy, and started in 2004 the annual conference series Modeling Decisions for Artificial Intelligence (MDAI).