Private Hospital Workflow Optimization via Secure kMeans Clustering
 145 Downloads
Abstract
Optimizing the workflow of a complex organization such as a hospital is a difficult task. An accurate option is to use a realtime locating system to track locations of both patients and staff. However, privacy regulations forbid hospital management to assess location data of their staff members. In this exploratory work, we propose a secure solution to analyze the joined location data of patients and staff, by means of an innovative cryptographic technique called Secure MultiParty Computation, in which an additional entity that the staff members can trust, such as a labour union, takes care of the staff data. The hospital, owning location data of patients, and the labour union perform a twoparty protocol, in which they securely cluster the staff members by means of the frequency of their patient facing times. We describe the secure solution in detail, and evaluate the performance of our proofofconcept. This work thus demonstrates the feasibility of secure multiparty clustering in this setting.
Keywords
Secure multiparty computation Hospital Workflow optimization Privacy Realtime locating system Clustering kmeansIntroduction
Hospitals are highly complex organizations typically involving a toxic combination of unpredictable patient flows and limited staffing and equipment resources. Achieving the Quadruple Aim (which aims to simultaneously improve Patient Experience, Population Health, Cost of Care and Provider WellBeing [28]) under such challenging conditions, often drives senior healthcare management to find every opportunity to optimize resources within the hospital.
A common approach taken by hospitals to optimize workflows is to hire consultants who interview and shadow key stakeholders and patients in order to develop an accurate picture of how the targeted department/hospital is functioning. A wellknown drawback of such an approach is that individuals tend to change their behavior due to their awareness of being observed (a phenomenon known as the Hawthorne effect [36]). In addition, such manual observations only allow for point measurements, as it is impossible for any group of visiting consultants to accurately capture the operational characteristics of all key individuals in a department at any given time. Interviews are also unable to accurately capture data, as people often report their perception of events rather than facts.
Some hospitals approach this problem with a datadriven strategy. This involves going through the timestamps entered in various hospital IT systems, e.g. in Electronic Health Record (EHR) systems, Staffing Information Systems, Laboratory Information systems, etc. While this is a better strategy than simply depending on manual observations, the data entered into hospital IT systems is highly susceptible to data quality issues [9, 24, 34]. Optimizing hospital workflows based on such noisy data can lead to erroneous outcomes [37].
One option is to use a RealTime Locating System (RTLS) to help address the problem of inaccurate timestamps. An RTLS consists of tags that can be placed on patients, staff and assets. The tags allow the locations of all tagged entities to be tracked at high spatial and temporal (e.g. every few seconds) resolution throughout the defined area of interest (e.g. within a hospital department). The realtime streaming data can also be used to automatically and accurately label many events. For example, a tagged patient would allow the system to accurately label when a patient has moved into a particular exam room. Similarly, a tagged nurse could be used to determine how many times the nurse has moved back and forth between two rooms of interest. Patient and staff location information can then be combined and plugged into certain common data mining algorithms (e.g. kmeans clustering, sequential pattern analysis, or market basket analysis) to analyze the utilization patterns of various hospital rooms and to highlight any abnormalities that might exist. Such information can subsequently be used to identify bottlenecks and thus optimize workflows.
Under current hospital practices, hospitals routinely monitor staffing logs which describe which members of staff are on duty at any point in time; such information is critical for running a hospital. However, finegrained location data of staff members is not currently considered routine in hospitals. Moreover, location data is considered as personal data in Europe under the newly established GDPR. This means that it is essential for a hospital to be completely transparent about what data is collected about individuals and gather permission from them prior to collecting and using the data; on the other hand, in order to perform effective and accurate workflow analysis based on location data, it is essential to have a high degree of participation from staff members. With hospital boards under constant pressure to improve productivity, sharing realtime location data of staff members with higher management could be considered to be a step too far. Such fear could greatly limit the number of participants who agree to sharing their location information. Moreover, privacy regulations such as GDPR, when it comes to dealing with patient records, mean that hospitals are not allowed to send any data beyond their physical boundaries.
A traditional approach in this case would be for the hospital to hire a trusted third party that collects all RTLS data, and outputs the clustering results. This party would be obliged, by contract and law only, not to disclose the RTLS data. However, the especially sensitive type of data involved would require expensive security measures. Furthermore, having all data at one single place increases the risk of information leakage. This makes it highly challenging to perform any kind of workflow optimization by analyzing these separate patient and staff RTLS data streams jointly.
In order to address this problem, this exploratory paper demonstrates how Secure MultiParty Computation (shortened as MPC) can be used to allow data mining algorithms, such as kmeans clustering, to be performed on two separate RTLS data streams: one generated by tagged patients, and the other by tagged hospital staff, while maintaining the privacy of all individuals. The location information of patients will only be made accessible to the hospital, while the location information of staff members will only be accessible to the staff members themselves, or to the labour union that represents them; labour unions, having the goal to represent the interests of all staff members of the hospital, are effectively the only body that can collectively act on behalf of all the nurses in a hospital.^{1} By splitting sensitive location information into two parts (patients and staff), each part being handled by a suitable independent party (hospital and labour union), we avoid any party gaining location information that they are not supposed to learn. Such a scheme allows the hospital to derive insights using both patient and staff RTLS data streams, without having access to individual location data streams of its staff members. The labour union makes secondary use of location data of its members (i.e., the hospital staff) impossible.
More concretely, we show the feasibility of this approach with a demonstrator that clusters nurses based on their patient facing time. This is motivated by the fact that hospital departments generally have some expectations in terms of how they should operate: for instance, in a hospital ward, patients typically arrive from different parts of the hospital with medical conditions of various type and of various degree of seriousness. As a consequence, nurses may be given different tasks and be requested to assist patients of a given ‘type’, where a type can indicate the medical condition of a patient or its seriousness. Clustering nurses, i.e. assigning them to separate sets based on the frequency and duration of interactions with patients of different types, can assist hospitals in determining whether nurses are indeed behaving as expected. Unexpected behavior may be a sign of suboptimal workflow (e.g., signaling how other tasks prevent nurses from focusing on the assigned patients), and may thus lead to further investigation on the part of the hospital. For the proofofconcept described in this article, we focused on kmeans clustering, due to its popularity and its relative conceptual simplicity; kmeans clustering is commonly used, for instance, when performing workflow analysis [23, 29, 35].
We stress the fact that the usage of RTLS in this setting is still in its infancy, and precise requirements are thus yet to be determined; in particular, it is still unclear at this point which data analysis algorithm can give the best insight in hospital workflow. We believe that the solution we present could also potentially help in clarifying needs and goals for an RTLSbased hospital workflow analysis, with kmeans clustering of nurses based on patient facing times constituting a first usecase.
In the remainder of this section, we introduce the concept of secure multiparty computation, and give an overview of related work. In “Details of the computation” the details of all computational steps are explained, and in “Secure solution” it is shown how these could be performed securely. The performance results are shown in “Implementation and results”, and we end with the conclusions.
Secure multiparty computation
The idea of MPC is that different mutuallydistrusting parties compute the output of a certain function or computation, depending on private inputs of each party, without actually revealing information on their inputs. MPC has been introduced by Yao in the 1980s [39], and has led to a new flourishing research area yielding secure solutions for a large number of applications. Although efficiency was often a bottleneck, various implementation frameworks for MPC have appeared, especially during recent years, incorporating the latest technical accelerations, bringing applications towards practice [25].
To illustrate how the seemingly impossible requirements of MPC can be met, we briefly discuss a paradigm for constructing MPC protocols, which is widely used by the most recent generation of MPC frameworks. This paradigm is referred to as sharecomputereveal, and works in three phases: first, the input data is ‘secretshared’ between the different parties, then a secure computation of the function is performed, and finally the output is revealed to the authorized party. All sensitive (intermediate) values are secretshared, which means that each party obtains a nonrevealing part of the data, called share, and the actual secret can only be obtained after combining all shares.^{2} Therefore, the data is secure as long as not all parties collude, and the parties can securely compute the desired function with sensitive information. Once the output has been securely computed, the parties can jointly reveal it; this means that the output of the computation is the only information learned by the parties.
Various applications of MPC in the medical domain have been presented, e.g., privacypreserving data mining for joint data analysis between hospitals [26], branching programs for privacypreserving classification of medical ElectroCardioGram signals [7], and also secure disclosure of patient data for disease surveillance [20], Rbased healthcare statistics [15], and privacypreserving genomewide association studies [11].
Related work
The potential benefits derived from using realtime locating systems in hospitals and other healthcare facilities have been presented in several papers [6, 8, 19, 30]. The security and privacy implications of pervasive data analysis techniques for healthcare, moreover, are widely discussed in the scientific community; see e.g. [1, 2] for some surveys on the topic.
To the best of our knowledge, this is the first paper that studies the usage of MPC for secure hospital optimization. However, other privacypreserving techniques for healthcare data analysis have been presented in [32], and several MPC techniques for secure data analysis, and clustering in particular, have been presented in the past few years [3, 4, 10, 14, 21, 22, 27]. These MPCbased works differ from our approach in that they are set in the socalled ‘honestbutcurious’ model, where security is only guaranteed as long as parties follow the instructions of the protocol, while our solution is also secure in the ‘malicious’ model where one (or several) parties deviate from the instructions of the protocol.
Another important difference is that previous works on secure clustering assume that data is partitioned between parties, either horizontally (meaning that different data points will be owned by different parties) or vertically (meaning that each party only holds specific attributes of any data point). Our assumptions and requirements are different, as the data to be clustered is sensitive information that should remain hidden from both parties; a securelydistributed version of it — or, formally, a secretshared version of it — is thus constructed in a first step of our solution (cf. “Secure solution” for details). Although showing the feasibility of secure clustering for hospital optimization is the main contribution of this manuscript, we thus believe the secureclustering protocol itself to be of independent interest.
Unlike some of the related work mentioned above, we use secret sharing instead of (additive) homomorphic encryption. The main disadvantage of homomorphic encryption is that it leads to big overheads, because cipher texts need to be large for security reasons, which induces considerable computational efforts, and large amounts of communication. On the other hand, secret shares can be much shorter, and secure frameworks based on them (see “The MPC framework of our choice: SPDZ”) have been recently developed, which are quite efficient.
Details of the computation
In this section we give a precise description of the algorithm that we wish to compute. We stress the fact that what is described here is the ‘plaintext’, or ‘unsafe’ computation, where privacysensitive data of patients and staff members is used. We show in “Secure solution” how to securely compute the functionality described in this section.
As informally described in the introduction, the input of the clustering algorithm is given by the RTLS data, which gives a snapshot of the hospital every few seconds, identifying where each patient and each staff member is at a given moment. The algorithm uses this input to cluster nurses according to the frequency and length of interactions with patients (the socalled nursepatient facing time). Focusing on this concrete usecase, we will henceforth speak of ‘nurses’ instead of more generic ‘staff members’.
In order to realize this functionality, we developed a twostep algorithm: first, we construct a table that combines the RTLS data from the hospital and the labour union, and secondly, kmeans clustering is applied to this table.
Parameters
Parameter  Description 

Nn  number of nurses 
Np  number of patients 
Nptype  number of patient types 
nID  nurse ID 
tagID  tag ID 
zID  zone ID 
time  time record 
tagRole  person role tag 
nP  set of nurse periods 
pP  set of patient periods 
st  starting time of a period 
et  end time of a period 
Ntimebins  number of time bins 
TB  array with time bin boundaries 
ov  overlap between interaction periods 
ovbin  time bin indicator of overlapping periods 
Constructing the table
Since the hospital and the labour union each own a part of the RTLS data, which is needed to determine and compare the behaviour of the nurses, the first step of the computation is to combine these data. The outcome of this step is a table that associates each nurse to an array, indicating frequency and length of her/his interactions with patients, which can be used as input for a clustering algorithm.
Structure of raw RTLS data
Tag  Role  Time stamp  Zone 

tagID_{1}  tagRole_{1}  time_{1}  zID_{1} 
tagID_{2}  tagRole_{2}  time_{2}  zID_{2} 
\(\dots \)  \(\dots \)  \(\dots \)  \(\dots \) 
The tag tagID is the unique identifier assigned to each tag, while the role tagRole defines whether the tag belongs to a nurse or a patient; as stated in the introduction, what is crucial for the privacy of our solution is that the hospital will receive only rows with tag roles for patients, and the labour union will receive only rows with tag role ‘nurse’.
The tag role also serves another goal, namely, it differentiates between various patient types. Indeed, patients are divided into Nptype ‘types’, according to the nature and severity of their medical condition; types could thus denote, for instance, terminally ill patients, or patients suffering from a heart attack. Each row of the table means that the individual with tag tagID was in a zone with identifier zID at time time, where tagRole gives additional information on the individual (role and patient type, if applicable).
Structure of individual preprocessed RTLS data
Tag  Start time  End time  Zone  Role 

tagID_{1}  st_{1}  et_{1}  zID_{1}  tagRole_{1} 
tagID_{2}  st_{2}  et_{2}  zID_{2}  tagRole_{2} 
\(\dots \)  \(\dots \)  \(\dots \)  \(\dots \)  \(\dots \) 
Nursepatient facing times
Patient Type A  Patient Type B  

nID  010  1030  3060  > 60  010  1030  3060  > 60 
nID_{1}  ⋆  ⋆  ⋆  ⋆  ⋆  ⋆  ⋆  ⋆ 
nID_{2}  ⋆  ⋆  ⋆  ⋆  ⋆  ⋆  ⋆  ⋆ 
\(\dots \)  \(\dots \)  \(\dots \)  \(\dots \)  \(\dots \)  \(\dots \)  \(\dots \)  \(\dots \)  \(\dots \) 
Notice that for simplicity, Table 4 only shows two patient types (‘A’ and ‘B’); entries denoted by ⋆ are aggregates, indicating how many times the nurse nID_{i} was in the same zone as a patient of the specified type (‘A’ or ‘B’) for a period of time within the specified ‘timebin’ (less than 10 seconds, between 10 and 30, and so on).
Kmeans clustering
The computation described above associates each nurse to an array of nonnegative integers, where each entry specifies how many interactions of a given length the nurse had with patients of given types (cf. Table 4).
Clustering, a branch of unsupervised machine learning, offers a way to extract valuable information from this data: informally speaking, it allows us to find a partition of the set of nurses into disjoint sets, or clusters, in such a way that ‘similar’ nurses (i.e., with a ‘similar’ associated array) belong to the same cluster, while ‘dissimilar’ nurses belong to different clusters.
We focus on kmeans clustering, widely used due to its relative simplicity and applicability to large data sets [33]. The kmeans algorithm works as follows: denote by \(\mathbf {y}^{(i)}\in \mathbb {R}^{{m}}\) (where m = Ntimebins ⋅Nptype) the vector, or data point, associated with the ith nurse for every \(i=1,\dots ,\textsf {Nn}\), and let \(\mathcal {S}\) denote the list \((\mathbf {y}^{(1)},\dots ,\mathbf {y}^{(\textsf {Nn})})\) (i.e., the list consisting of the rows of Table 4). While various notions of similarity between data points can be defined, kmeans clustering typically assumes that a distance d is defined over the vector space the data points belong to; we assume for simplicity that d is the Euclidean distance, which is the most common case in kmeans clustering.
Formally, the goal of kmeans clustering is to find a partition \((\mathcal {S}_{1}, \dots , \mathcal {S}_{k})\) of the list \(\mathcal {S}\) of data points, i.e., \(\mathcal {S} = \mathcal {S}_{1} \sqcup {\dots } \sqcup \mathcal {S}_{k}\), so as to minimize the quantity \({\sum }_{j=1}^{k} {\sum }_{\mathbf {y} \in \mathcal {S}_{j}}d(\mathbf {y},\boldsymbol {\mu }_{j})^{2}\), where μ_{j} denotes the arithmetic mean of the points belonging to the jth cluster.
The output of Algorithm 2 does not encompass the centroid values: this is due to our MPCmotivated approach, since the centroids may reveal sensitive information. We also remark that the description of Algorithm 2 only provides a skeleton of the actual kmeans algorithm, as it does not specify how to sample the initial centroids, and does not handle some degenerate cases which make the algorithm illdefined (notably, it implicitly assumes that clusters are never empty). Several approaches are possible to fill these gaps and obtain a fullyfledged specification of kmeans; in the following section, we will detail the solution of our choice, highlighting the reasons that led us to select them.
Secure solution
In order to develop a secure solution, we make use of MPC schemes based on socalled secret sharing techniques. The owner of each entry x of Table 3 uploads this entry as a secret, shared between the hospital and the labour union. We denote the resulting secretshared value by \(\left \langle x \right \rangle \); such a secretshared value consists of two shares, x_{1} and x_{2}, held by the hospital and the labour union respectively. The fundamental property of this secretsharing process is that a single share x_{i} gives no information whatsoever on the original value x, but the two parties can cooperate to perform computations on secretshared data and, if required, jointly reconstruct the value of a secretshared element.
The secretsharingbased framework of our choice, SPDZ (cf. “The MPC framework of our choice: SPDZ”), ensures that our solution is secure under the assumption that the involved parties are restricted to polynomialtime computation, and safeguards the privacy of each party’s input and the correctness of the result even if one of the parties actively cheats and does not follow the instructions of the protocol.^{3}
 A.
A secure computation of the table consisting of facing times frequencies per nurse (see Table 4).
 B.
A secure clustering of the nurses, based on this table.
Prior to the secure computation of the table, both parties need to locally transform their RTLS data into a series of time intervals per zone (see also “Constructing the table”), as illustrated in Table 3. Since this does not require combining data of patients and nurses, there is no security issue: parties can perform this processing locally, and we therefore do not further discuss this preliminary step.
Secure table construction
The input of the first step of the computation is given by a secure variant of Table 3, where all entries have been secretshared between the two parties. In order to obtain a table of nursepatient facing times, we need to translate Algorithm 1 to the encrypted domain — namely, we need to specify how all steps of Algorithm 1 can be performed on secretshared data.
As a first step, we discuss the translation to the encrypted domain of basic operations:
 Sum and multiplication:

these can be directly computed on secretshared inputs by secretsharing based MPC protocols [12]. The same also holds with addition and sum between a secretshared input and a public constant.
 Secure comparison:
 securely checking whether a < b for secretshared values \(\left \langle a \right \rangle \) and \(\left \langle b \right \rangle \) can be performed by any secure comparison protocol [12], given the above basic operations. We do not describe here how this is exactly performed by an MPC protocol, and denote the output of a secure comparison as follows:$$ \langle (a \overset{?}{<} b) \rangle \text{, where } (a \overset{?}{<} b) = \left\{\begin{array}{ll} 1 & \text{, if } a<b, \\ 0 & \text{, otherwise}. \end{array}\right. $$
Similarly, one can securely compute a secretshared bit \(\langle (a\overset {?}{\geq } b) \rangle \) that expresses whether a ≥ b, or not.
 Minimum and maximum computation:
 given a secretsharing of \(\epsilon = (a \overset {?}{<} b)\), the minimum (resp. maximum) between two secretshared values a and b can be readily computed by means of the above operations:and similarly for the secure maximum function.$$ \left\langle \min(a,b) \right\rangle = \left\langle a \right\rangle \cdot \langle \epsilon \rangle + \left\langle b \right\rangle \cdot \left( 1  \langle \epsilon \rangle \right), $$
Secure k −means clustering
After the above step has been performed, we thus obtain a (secretshared) table that associates to each nurse a secretshared data point (vector) \(\left \langle \mathbf {y}^{(i)} \right \rangle \) expressing how many interaction periods of a given length, and with a patient of given type, the ith nurse had.
To perform secure kmeans clustering over secretshared data, we construct a membership matrix \(\textbf {M} \in \mathbb {N}^{\textsf {Nn} \times {k}}\), where M_{ij} = 1, if the ith data point belongs to the jth cluster, and M_{ij} = 0, otherwise. The idea is then to keep M secretshared, and only to reveal it at the last step of the clustering algorithm.
With this concept in mind, one can then transpose the ‘skeleton’ kmeans Algorithm 2 to an MPC setting: the key points of the iterative steps are presented below.
 Distance computation:
 since sums and multiplications can be directly computed, we can securely compute the (secretshared) valuefor each nurse y^{(i)} and each cluster (with centroid) c^{(j)}.$$ \left\langle d^{2}\left( \mathbf{y}^{(i)},\mathbf{c}^{(j)} \right) \right\rangle = {\sum}_{\ell=1,\dots,{m}} \left( \left\langle \mathbf{y}^{(i)}_{\ell} \right\rangle  \left\langle \mathbf{c}^{(j)}_{\ell} \right\rangle \right)^{2} $$
 Cluster assignment:
 by making use of a securecomparison subroutine as described in the previous subsection, we can compute for any \(\mathbf {y}^{(i)},\mathbf {c}^{(j)},\mathbf {c}^{(j^{\prime })}\) the following secretshared value:We can then set \(\left \langle \textbf {M}_{ij} \right \rangle = {\prod }_{j^{\prime }=1}^{k} \left \langle \xi (i,j,j^{\prime }) \right \rangle \).$$ \left\langle \xi(i,j,j^{\prime}) \right\rangle:=\left\langle \left( d^{2}\left( \mathbf{y}^{(i)}, \mathbf{c}^{(j^{\prime})}\right) \overset{?}{\geq} d^{2}\left( \mathbf{y}^{(i)},\mathbf{c}^{(j)} \right) \right) \right\rangle $$
 Centroid update:
 Assuming the selected MPC protocol has a builtin secure integerdivision subroutine (for fixed or floatingpoint numbers), we can securely compute the valuefor all \(j=1,\dots ,{k}\).$$ \left\langle \mathbf{c}^{(j)} \right\rangle = \frac{{\sum}_{i=1}^{\textsf{Nn}} \left\langle \textbf{M}_{ij} \right\rangle \cdot \left\langle \mathbf{y}^{(i)} \right\rangle}{\left\langle {\sum}_{i=1}^{\textsf{Nn}} \textbf{M}_{ij} \right\rangle} $$
At the end of this section, we show how to avoid this expensive subroutine, and only use basic operations and comparisons instead.
 1.
A method to sample the k initial centroids needs to be specified;
 2.
The algorithm does not prevent assignment of a data point to several clusters. It would be preferable to assign each point to one cluster only;
 3.
If a cluster becomes empty, then the algorithm is illdefined, as it attempts to perform a division by \(\mathcal {S}_{j}= {\sum }_{i} \textbf {M}_{ij} =0\). A method to prevent this should be specified;
 4.
A routine that checks whether the algorithm has converged (i.e., whether the cluster assignment did not change at the last iteration) should be specified.
We now describe our solution to the above issues. Furthermore, we show how we can avoid expensive fixed or floatingpoint computation and restrict ourselves to more efficient integer arithmetic.
Sampling initial centroids
Various methods are used in standard kmeans clustering to sample the initial centroids, often selecting them among the data points via a randomized choice method. While more involved techniques such as kmeans++ [5] can guarantee faster convergence and/or better cluster quality, we opt for a simpler method, which can very efficiently be implemented in a secure way, and which is sufficient for our goal of showing the feasibility of an MPC solution. We thus select the initial centroids by sampling k elements among the data points; this random sampling can be executed, for instance, by the hospital, who should be in charge of the decision of the relevant clustering parameters, given that it is the entity interested in the workflow analysis.
Avoiding multiple assignment
As we noticed above, if for a given data point y^{(i)} there are two centroids \(\mathbf {c}^{(j_{1})},\mathbf {c}^{(j_{2})}\) such that \(d(\mathbf {y}^{(i)},\mathbf {c}^{(j_{1})}) = d(\mathbf {y}^{(i)},\mathbf {c}^{(j_{2})}) = \min \limits _{j} (d(\mathbf {y}^{(i)},\mathbf {c}^{(j)}))\), then the algorithm sets \(\textbf {M}_{ij_{1}}=\textbf {M}_{ij_{2}}=1\). It would instead be desirable to assign y^{(i)} to a unique cluster. In order to do this, we simply assign y^{(i)} to the cluster with the lowest index; this can be done securely by setting M_{ij} = 0, if \(\textbf {M}_{ij} \leq \textbf {M}_{ij^{\prime }}\) for some \(j^{\prime }<j\), which can be done via a securecomparison subroutine.
Handling empty clusters
As highlighted above, Algorithm 2 is not guaranteed to be welldefined. Namely, if a cluster becomes empty, the algorithm will attempt to divide by 0 upon computing the new centroid corresponding to that cluster. Once again, several methods are used in (nonsecure) kmeans clustering to address this problem. Most of these methods take action in case an empty cluster is detected, for instance by assigning a given data point to an empty cluster. This is arguably a suboptimal approach in secure kmeans, since it either requires revealing intermediate cluster assignments (which could undermine the security of our solution), or it can lead to increased complexity by checking in a secure way whether there is an empty cluster. We adopt an alternative approach, described in [31]: simply add each centroid to its corresponding cluster. As shown in [31], the convergence time of the algorithm is only slightly increased with this method.
Adding a convergence check
As a general rule, the kmeans algorithm is supposed to stop only after it has converged, i.e., once the cluster assignments (and centroid values) no longer change. Such a check can be performed in an (almost) oblivious way by means of secure equality; we stress the fact that this is a relatively expensive check, and we thus prefer not to execute it after every iteration of the algorithm. A better alternative is to only run it after the last iteration, or, alternatively, after any fixed number of iterations. In our simulations, we made use of the first alternative.
Altogether, the above subroutines yield a complete specification of a circuit modeling secure kmeans clustering.
Improving Efficiency with IntegerOnly Computation
An important remark to improve the efficiency of our solution is that the data points \(\mathbf {y}^{(1)},\dots ,\mathbf {y}^{(\textsf {Nn})}\) of nursepatient interaction periods are vectors with integeronly entries. We can exploit this fact designing a centroidupdate routine that only makes use of secure integer arithmetic (instead of fixed or floatingpoint), significantly improving the efficiency of secure kmeans clustering. Notice that integer arithmetic can be readily simulated by choosing a large enough integer M and then embedding \(\mathbb {Z}\cap [M,M]\) into a prime field \(\mathbb {F}_{p}\) for any p > 2M; in contrast, simulating fixed and floatingpoint arithmetic in a finite field is a more involved and computationallyexpensive process.
The above steps thus form a fullyfledged and efficient secure kmeans clustering algorithm, which we believe to be of independent interest as well.
Implementation and results
We describe in this section our implementation of the secure solution of “Secure solution”, and present some evaluation of its performance.
The MPC framework of our choice: SPDZ
We chose to use SPDZ [17, 18], a recent secretsharingbased MPC platform of celebrated efficiency. A software suite for UNIX systems based on the SPDZ platform is publicly available [13, 16];^{4} we used this suite to implement our secure solution for workflow analysis.
SPDZ has builtin functionalities for secure comparison, and can thus be used to implement the building blocks described in “Secure table construction”. SPDZ needs to produce some raw material in a precomputation phase in order to securely evaluate these functionalities; however, this precomputation is independent of the actual function to be computed and of the secret inputs, and can thus be executed on idle time between the two parties. For this reason, we neglect preprocessing when measuring the performance of our solution.
Setup
In order to test the efficiency of the algorithms we developed, we ran several simulations on two physicallyseparated machines, representing the hospital and the labour union, respectively. Both machines were equipped with of a 3.5 GHz Intel i77567U CPU and 32 GB of RAM, and were connected to each other via a 1 Gbit/s wired network. Furthermore, the SPDZ protocol has been instantiated with 40bit statistical security, 128bit computational security and a 64bit prime field.
Performance results
Several simulations were run in order to measure the efficiency and scalability of both phases of our secure solution in the above setup. We sampled artificial data for these simulations, made to resemble a realistic size of a hospital department and realistic behavior of nurses [38]: we considered a fixed number of 15 zones and a total study time of one hour, in which tracking information was produced every 4 seconds. We assumed that nurses remain in the same zone for up to 120 seconds, while patients can remain in the same zone for the entire hour. Accordingly, we considered 4 time bins, namely 0to10 seconds, 10to30 seconds, 30to60 seconds, and more than 60 seconds.
We measured the elapsed computation time and the communication cost while varying either the number of patient types (3, 5, 10), considering a fixed number of 7 patients per patient type, or the number of nurses (5, 12, 30, 60, 120). We also investigated the effect of increasing the total number of clusters, considering 2, 5 and 10 clusters, while fixing at 5 the number of iterations of the kmeans clustering protocol.
In Fig. 1 we varied the number of nurses, while fixing at 5 the number of patient types. We observe that the computation time of the first phase grows linearly in the number of nurses; this matches our expectations, since theoretically the complexity of this phase scales linearly with the number of nurse time periods, which in turn grows linearly with the number of nurses in our simulations. Also notice that the computation time of this phase is independent of the number of clusters, as this number only plays a role in the second phase of the protocol. Further, for each experiment, the total number of patient time periods varies, as for each experiment new artificial data is generated; this explains the slight variation in the timing results of the first phase. Furthermore, the timing results indicate that the computation time of the second phase scales linearly in the number of clusters and in the number of nurses.
In Fig. 2 we varied the number of patient types, keeping the number of nurses fixed at 60. We note that the computation time of the first phase grows linearly with the total number of patients, again slightly fluctuating due to the fact that the total number of time periods (of patients and nurses) slightly varies per experiment. The computation time of the second phase scales linearly in the number of clusters and in the number of patient types.
Runtime (seconds) and exchanged data (megabytes), 5 clusters
7 nurses  12 nurses  30 nurses  60 nurses  120 nurses  

21 patients  time: 108  time: 160  time: 310  time: 564  time: 1072 
comm.: 47  comm.: 95  comm.: 233  comm.: 499  comm.: 964  
35 patients  time: 154  time: 212  time: 422  time: 816  time: 1677 
comm.: 90  comm.: 143  comm.: 335  comm.: 677  comm.: 1496  
70 patients  time: 241  time: 384  time: 768  time: 1530  time: 2912 
comm.: 166  comm.: 297  comm.: 657  comm.: 1338  comm.: 2657 
Notice that by inspecting the pseudocode of our solution, it is readily seen that the observed linearity in the timing results is as expected. Finally, we note that the benchmarks described in this section are obtained with an implementation that still has plenty of room for efficiency improvement. Future development on this aspects could, for instance, benefit from further parallelization within both phases of the protocol, use of highperformance computing machines, or implementation in lowlevel, very fast programming languages such as C.
Conclusion
We proposed a novel approach to analyze the joined location data of patients and staff in a hospital, by means of an innovative cryptographic technique called Secure MultiParty Computation. In a joint protocol, the hospital and the labour union securely cluster the staff members by means of the frequency of their patient facing times.
In the first step, a table is securely constructed that contains for each nurse a secret frequency distribution of his, or her, patient facing times. In the second step, this table is used to cluster the nurses into similar groups. Although this secure kmeans clustering algorithm is used for optimizing the workflow in a hospital, it could be used in many different domains where sensitive data needs to be clustered.
We described the secure protocol in detail, and evaluated its performance, thereby demonstrating the feasibility of our approach: it takes less than half an hour to securely cluster 120 nurses, who take care of 35 patients in 15 different zones, given location data of one hour and a tracking frequency of 4 seconds. While speed was not a factor of capital importance for our solution, given that data analysis does not need to be performed in real time, we believe that the good performance obtained by our protocol paves the way for more advanced data analysis techniques to optimize the workflow in a hospital.
Towards a fully operational deployment, however, some points need to be addressed. Notably, our solution was not tested on real data, given that even obtaining retrospective data would require individual consent from the involved staff members and patients; for operational deployment, however, this step will be necessary, in order to properly assess the impact of the data analysis. Moreover, while kmeans clustering was a natural choice for a demonstrator due to its ubiquity and relative conceptual simplicity, several other machinelearning techniques could be securely implemented with our approach. This means that an appropriate evaluation and comparison of the various possibilities will have to be performed.
Footnotes
 1.
We make the remark that our solution can also accommodate for the case of several labour unions, up to a natural extension of the steps described in “Constructing the table” and “Secure table construction”.
 2.
Thus ‘sharing’ is here by no means a synonym of ‘revealing’: secretsharing can actually be seen as a strong form of distributed encryption.
 3.
In this case, however, it is not guaranteed that the honest party will obtain output: they might only detect that cheating occurred, and have at that point no other option than to abort the protocol.
 4.
Support for the SPDZ2 implementation is being discontinued; development has shifted to the SCALEMAMBA platform, which is also based on the SPDZ protocol.
Notes
Acknowledgements
The research activities that have led to this paper were partly funded by PPSsurcharge for Research and Innovation of the Dutch Ministry of Economic Affairs and Climate Policy. This project has received funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No 780495, and from the ERC advanced investigator grant 740972 (ALGSTRONGCRYPTO).
The authors would like to thanks Meilof Veeningen, Peter van Liesdonk, Thomas Attema and Mark Abspoel for their valuable help in developing and implementing the solution described in this paper.
Compliance with Ethical Standards
Conflict of interests
The authors declare that they have no conflict of interest.
References
 1.Abouelmehdi, K., BeniHessane, A., and Khaloufi, H., Big healthcare data: preserving security and privacy. Journal of Big Data 5(1):1, 2018. https://doi.org/10.1186/s4053701701107.CrossRefGoogle Scholar
 2.Abouelmehdi, K., BeniHssane, A., Khaloufi, H., and Saadi, M., Big data security and privacy in healthcare: A review. Procedia Computer Science 113:73–80, 2017. https://doi.org/10.1016/j.procs.2017.08.292. http://www.sciencedirect.com/science/article/pii/S1877050917317015. The 8th International Conference on Emerging Ubiquitous Systems and Pervasive Networks (EUSPN 2017) / The 7th International Conference on Current and Future Trends of Information and Communication Technologies in Healthcare (ICTH2017) / Affiliated Workshops.CrossRefGoogle Scholar
 3.Almutairi, N., Coenen, F., and Dures, K.: Kmeans clustering using homomorphic encryption and an updatable distance matrix: Secure third party data clustering with limited data owner interaction. In: Big data analytics and knowledge discovery  19th international conference, DaWaK 2017, Lyon, France, August 2831, 2017, Proceedings, pp. 274–285. https://doi.org/10.1007/9783319642833_20, 2017CrossRefGoogle Scholar
 4.ARORA, D., KUMAR, U., et al.: Implications of privacy preserving kmeans clustering over outsourced data on cloud platform Journal of Theoretical & Applied Information Technology 96(12), 2018Google Scholar
 5.Arthur, D., and Vassilvitskii, S.: kmeans++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACMSIAM Symposium on Discrete Algorithms, SODA 2007, New Orleans, Louisiana, USA, January 79, 2007, pp. 1027–1035. http://dl.acm.org/citation.cfm?id=1283383.1283494, 2007
 6.Baek, H.: Lessons learned from adopting rtlsbased asset tracking system in a tertiary hospital. In: AMIA 2016, American medical informatics association annual symposium, Chicago, IL, USA, November 1216, 2016. http://knowledge.amia.org/amia633001.3360278/t0051.3362920/f0051.3362921/25000421.3364425/24990291.3364420, 2016
 7.Barni, M., Failla, P., Kolesnikov, V., Lazzeretti, R., Sadeghi, A., and Schneider, T.: Secure evaluation of private linear branching programs with medical applications. In: Backes, M., and Ning, P. (Eds.) Computer security  ESORICS 2009, 14th European symposium on research in computer security, SaintMalo, France, September 2123, 2009. Proceedings, lecture notes in computer science, vol. 5789, pp. 424–439. Springer. https://doi.org/10.1007/9783642044441_26, 2009.CrossRefGoogle Scholar
 8.Bendavid, Y., Rfidenabled realtime location system (RTLS) to improve hospital’s operations management: An uptodate typology. I. J. RF Technol.: Res. Appl. 5(34):137–158, 2013. https://doi.org/10.3233/RFT130056.Google Scholar
 9.Benin, A., Fenick, A., Herrin, J., Vitkauskas, G., Chen, J., and Brandt, C., How good are the data? feasible approach to validation of metrics of quality derived from an outpatient electronic health record. Am. J. Med. Qual. 26:441–51, 2011.CrossRefGoogle Scholar
 10.Beye, M., Erkin, Z., and Lagendijk, R.L.: Efficient privacy preserving kmeans clustering in a threeparty setting. In: 2011 IEEE International Workshop on Information Forensics and Security, WIFS 2011, Iguacu Falls, Brazil, November 29  December 2, 2011, pp. 1–6. https://doi.org/10.1109/WIFS.2011.6123148, 2011
 11.Bonte, C., Makri, E., Ardeshirdavani, A., Simm, J., Moreau, Y., and Vercauteren, F.: Privacypreserving genomewide association study is practical. Cryptology ePrint Archive, Report 2017/955. https://eprint.iacr.org/2017/955, 2017
 12.Bristol, U.: Multiparty computation with spdz, mascot, and overdrive offline phases, github repository. https://github.com/brystolcrypto/SPDZ2 https://github.com/brystolcrypto/SPDZ2
 13.Bristol Crypto: Spdz2: Multiparty computation with spdz, mascot, and overdrive offline phases. https://github.com/bristolcrypto/SPDZ2 (2016–2018)
 14.Bunn, P., and Ostrovsky, R.: Secure twoparty kmeans clustering. In: Proceedings of the 2007 ACM conference on computer and communications security, CCS 2007, Alexandria, Virginia, USA, October 2831, 2007, pp. 486–497. https://doi.org/10.1145/1315245.1315306, 2007
 15.Chida, K., Morohashi, G., Fuji, H., Magata, F., Fujimura, A., Hamada, K., Ikarashi, D., and Yamamoto, R., Implementation and evaluation of an efficient secure computation system using ’R’ for healthcare statistics. Journal of the American Medical Informatics Association 21(e2):e326–e331, 2014. https://doi.org/10.1136/amiajnl2014002631. https://academic.oup.com/jamia/articlelookup/doi/10.1136/amiajnl2014002631 https://academic.oup.com/jamia/articlelookup/doi/10.1136/amiajnl2014002631.CrossRefPubMedPubMedCentralGoogle Scholar
 16.COSIC KU Leuven: Secure computation algorithms from leuven (scale) and multiparty algorithms basic argot (mamba). https://github.com/KULeuvenCOSIC/SCALEMAMBA 2018
 17.Damgård, I., Keller, M., Larraia, E., Pastro, V., Scholl, P., and Smart, N.P.: Computer Security  ESORICS 2013  18th European Symposium on Research in Computer Security, Egham, UK, September 913, 2013. Proceedings, Lecture Notes in Computer Science, vol. 8134, pp. 1–18. Springer. In: Crampton, J., Jajodia, S., and Mayes, K. (Eds.) https://doi.org/10.1007/9783642402036_1, 2013.Google Scholar
 18.Damgård, I., Pastro, V., Smart, N.P., and Zakarias, S.: Multiparty computation from somewhat homomorphic encryption. In: Advances in Cryptology  CRYPTO 2012  32nd annual cryptology conference, Santa Barbara, CA, USA, August 1923, 2012. Proceedings, pp. 643–662. https://doi.org/10.1007/9783642320095_38, 2012Google Scholar
 19.D’Souza, I., Ma, W., and Notobartolo, C., Realtime location systems for hospital emergency response. IT Professional 13(2):37–43, 2011.CrossRefGoogle Scholar
 20.El Emam, K., Hu, J., Mercer, J., Peyton, L., Kantarcioglu, M., Malin, B., Buckeridge, D., Samet, S., and Earle, C., A secure protocol for protecting the identity of providers when disclosing data for disease surveillance. Journal of the American Medical Informatics Association 18(3):212–217, 2011. https://doi.org/10.1136/amiajnl2011000100. https://academic.oup.com/jamia/articlelookup/doi/10.1136/amiajnl2011000100 https://academic.oup.com/jamia/articlelookup/doi/10.1136/amiajnl2011000100.CrossRefPubMedPubMedCentralGoogle Scholar
 21.Erkin, Z., Veugen, T., Toft, T., and Lagendijk, R.L.: Privacypreserving user clustering in a social network. In: First IEEE international workshop on information forensics and security, WIFS 2009, London, UK, December 69, 2009, pp. 96–100. IEEE. https://doi.org/10.1109/WIFS.2009.5386476, 2009
 22.Erkin, Z., Veugen, T., Toft, T., and Lagendijk, R.L.: Privacypreserving distributed clustering. EURASIP J. Information Security 2013, 4. https://doi.org/10.1186/1687417X20134, 2013
 23.Greco, G., Guzzo, A., Pontieri, L., and Sacca, D.: Mining expressive process models by clustering workflow traces. In: PacificAsia Conference on Knowledge Discovery and Data Mining, pp. 52–62. Springer, 2004.Google Scholar
 24.Hogan, W. R., and Wagner, M. M., Accuracy of data in computerbased patient records. J. Am. Med. Inform. Assoc. 4(5):342–355, 1997.CrossRefGoogle Scholar
 25.Keller, M., Pastro, V., and Rotaru, D.: Advances in Cryptology  EUROCRYPT 2018  37th Annual International Conference on the Theory and Applications of Cryptographic Techniques, Tel Aviv, Israel, April 29  May 3, 2018 Proceedings, Part III, Lecture Notes in Computer Science, vol. 10822, pp. 158–189. Springer. In: Nielsen, J.B., and Rijmen, V. (Eds.) https://doi.org/10.1007/9783319783727_6, 2018.CrossRefGoogle Scholar
 26.Lindell, Y., and Pinkas, B.: Secure multiparty computation for privacypreserving data mining. Cryptology ePrint Archive, Report 2008/197. https://eprint.iacr.org/2008/197, 2008
 27.Liu, D., Bertino, E., and Yi, X.: Privacy of outsourced kmeans clustering. In: 9th ACM symposium on information, computer and communications security, ASIA CCS ’14, Kyoto, Japan  June 03  06, 2014, pp. 123–134. https://doi.org/10.1145/2590296.2590332, 2014
 28.MiHIA – Michigan Health Improvement Alliance : What is the quadruple aim? (2016). https://www.mihia.org/index.php/quadaim/whatisthequadaim
 29.Nara, A., Izumi, K., Iseki, H., Suzuki, T., Nambu, K., and Sakurai, Y.: Trajectory data mining for surgical workflow analysisGoogle Scholar
 30.Oude Weernink, C., Felix, E., Verkuijlen, P., Dierickvan Daele, A., Kazak, J., and van Hoof, J., Realtime location systems in nursing homes: state of the art and future applications. Journal of Enabling Technologies 12(2):45–56, 2018. https://doi.org/10.1108/JET1120170046. https://www.emeraldinsight.com/doi/10.1108/JET1120170046 https://www.emeraldinsight.com/doi/10.1108/JET1120170046.CrossRefGoogle Scholar
 31.Pakhira, M. K., A modified kmeans algorithm to avoid empty clusters. International Journal of Recent Trends in Engineering 1:220–226, 2009.Google Scholar
 32.Park, J., and Lee, D. H.: Privacy preserving knearest neighbor for medical diagnosis in ehealth cloud. Journal of Healthcare Engineering 2018, 2018Google Scholar
 33.ShalevShwartz, S., and BenDavid, S., Understanding machine learning: From theory to algorithms. New York: Cambridge University Press, 2014.CrossRefGoogle Scholar
 34.Smith, P., ArayaGuerra, R., Bublitz, C., Parnes, B., Dickinson, L., Vorst, R. V., Westfall, J., and Pace, W., Missing clinical information during primary care visits. JAMA 293(5):565–71, 2005.CrossRefGoogle Scholar
 35.Song, M., Günther, C. W., and Van der Aalst, W. M.: Trace clustering in process mining. In: International conference on business process management, pp. 109–120. Springer, 2008.Google Scholar
 36.The Economist: The hawthorne effect (2008). https://www.economist.com/news/2008/11/03/thehawthorneeffect
 37.Ward, M., Self, W., and Froehle, C., Effects of common data errors in electronic health records on emergency department operational performance metrics: A monte carlo simulation. Acad. Emerg. Med. 22(9):1085–92, 2015.CrossRefGoogle Scholar
 38.Westbrook, J., Duffield, C., Li, L., and Creswick, N.J.: How much time do nurses have for patients? a longitudinal study quantifying hospital nurses’ patterns of task time distribution and interactions with health professionals. BMC Health Services Research 11. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3238335/, 2011
 39.Yao, A.C.: Protocols for secure computations (extended abstract). In: 23rd annual symposium on foundations of computer science, Chicago, Illinois, USA, 35 November 1982, pp. 160–164. IEEE Computer Society. https://doi.org/10.1109/SFCS.1982.38, 1982
Copyright information
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.