Enhancing Transparency with Distributed Privacy-Preserving Logging
Transparency of data processing is often a requirement for compliance to legislation and/or business requirements. Furthermore, it has recognised as a key privacy principle, for example in the European Data Protection Directive. At the same time, transparency of the data processing should be limited to the users involved in order to minimise the leakage of sensitive business information and privacy of the employees (if any) performing the data processing.
We propose a cryptographic logging solution, making the resulting log data publicly accessible, that can be used by data subjects to gain insight in the data processing that takes place on their personal data, without disclosing any information about data processing on other users’ data. Our proposed solution can handle arbitrary distributed processes, dynamically continuing the logging from one data processor to the next. Committing to the logged data is irrevocable, and will result in log data that can be verified by the data subject, the data processor and a third party with respect to integrity. Moreover, our solution allows data processors to offload storage and interaction with users to dedicated log servers. Finally, we show that our scheme is applicable in practice, providing performance results for a prototype implementation.
Transparency is recognised as a key privacy principle, e.g., in the EU Data Protection Directive 95/46/EC Articles 7, 10, and 11; and in the Swedish Patient Data Act ("Patientdatalagen") SFS (2008:355). The Swedish Patient Data Act states that every patient have the right to see who has accessed their electronic healthcare record (EHR), i.e., access logs to EHRs have to be kept and made available to patients. This kind of transparency of data processing is often a requirement for compliance with legislation and/or business requirements, as well in healthcare as in other sectors, e.g., bookkeeping in the financial sector. Transparency of data processing, in general, may increase end-users’ trust in the data processor1, especially if the data processing is distributed as in cloud computing [KJM + 11]. The need for building trust is also a big part in why transparency towards citizens is a key element of eGovernment services [UN12].
When attempting to make data processing transparent and determining what an adequate level of transparency is, there are a number of social and economic issues that need to be taken into account beyond (sometimes purely) technical considerations of what to make transparent to whom. For example, employees of a data processor may experience the requirement of transparency as a breach of their own privacy [Robe09]. For data processors, as recognised in recital 41 of the EU Data Protection Directive, too detailed descriptions of data processing risk revealing business-sensitive information of the data processor, such as trade secrets.
Most commercially deployed logging systems operate on a single system and focus on a single goal: providing deep and fast analysis into massive amounts of log data, for System Information and Event Management (SIEM) purposes: e.g., to detect system malfunctioning and security breaches. Some of these logging systems include cryptographic methods to validate the log’s integrity in a simple way.
Our proposed logging scheme focuses on which guarantees and services can be delivered to the end-user of a data processor, based on the events that this data processor logs for its users. Furthermore, by only allowing the user whose personal data is being processed to read the logged information, the negative effects of transparency (e.g., logs cannot be misused to monitor the employees’ performance) are minimised. Compared to commercially available logging systems, this is more than a shift of focus; it introduces very challenging questions about trust, privacy and confidentiality. Adapting existing logging systems to answer these questions is far from trivial as these questions touch the core of the logging system.
The paper is structured as follows: in Section 2, we present our design goals and rationale. Section 3 introduces our proposed logging scheme. We present a performance evaluation of our prototype implementation in Section 4. Our solution is compared to related work in Section 5. Finally, Section 6 provides concluding remarks.
2 Design Goals and Rationale
When implementing a log system for transparency of data processing, there are several privacy and security issues that need to be addressed. Take for example the case of electronic healthcare records: knowing who has accessed a patient’s medical records is sensitive information. In fact, even knowing that a person has been a patient at a particular medical institution may be sensitive. So, not only the content of the log entries is sensitive information, but also merely the fact that log entries exist for a certain individual (a patient in this example) can be sensitive information. Furthermore, imagine the case of an attacker (such as malicious medical personnel) illegitimately accessing EHRs. In such a case, the offenders are likely to attempt to cover the traces of their actions, for example by deleting the generated log entries. Therefore, alterations to log entries need to be detectable. Similar arguments can be made for the case of business-sensitive information being logged for a company.
The content of log entries is submitted by the data processor, to a log server. The logging of data can be completely outsourced and aims to minimise impact on existing processing infrastructure. Coupling of the process to the logging is loose. Log entries are fully confidential, and hold identification and authentication metadata, allowing only that end user to identify the log entries related to a certain process, and to check the integrity of the logged data. The data processor ensures the confidentiality of the data to be logged, also keeping these data confidential for the log server. The log server adds the identifying and authentication metadata, both for the end user and data processor, ensuring the integrity of the log trail. The log process is also auditable: a log server can be forced to reproduce the entire set of log entries, related to a user or the data processor that generated the data to be logged. In other words, neither data processor nor log server needs to be completely trusted.
The log server keeps a state that is updated each time a log entry is created in such a way that it is hard to recover the previous state values once an update is complete. This is the mechanism at the core of our prior to compromise adversary model: we assume that log servers (and data processors) are initially trusted and will at some point in time t become compromised by an adversary. Due to the fact that the state kept by log servers is continuously updated as log entries are created, an adversary is unable to reconstruct prior states needed for successfully manipulating log entries created prior to compromising the log server. Figure 2 illustrates the interplay between log entries, the log server’s state, and our adversarial model.
To ensure the “anonymity” of the logged data, we need more than just confidentiality. Especially if one is to serve the logged data publicly. It should be impossible for attackers given only the logged data to link multiple entries concerning the same end user together.
Finally, we need to take another important aspect of data processing in account, namely that it is often distributed. Data concerning users may be shared by the data processor to other data processors for additional processing. Also, each data processor can be logging to different log servers. This means that there should be support for distributed processes, such that these can be logged and the logging scheme itself can also be distributed across several log servers. An example of such a distributed setting is depicted in Fig. 3. Alice discloses data to Bob, the initial data processor. Bob then shares (part of) Alice’s data with the downstream data processor Charlie, who in turn shares (part of) Alice’s data with data processors Dave and Eve. While data processors Bob, Charlie, Dave and Eve process Alice’s data, all of them continuously log descriptions of their processing (dashed lines) to their, potentially different, log servers. Alice can later reconstruct the log trail of the data processing on her data.
For distributed processes, also the user identifiers used across the different data processors should be unlinkable to ensure maximal anonymity of the logged data.
To conclude, a logging system for transparency should have the following security and privacy properties:
Forward-Integrity: any changes (including deletion2) of entries committed to the log prior to the log server’s compromise can be detected.
Confidentiality: given a log entry, only the user the log entry concerns can read the logged data.
Unlinkability of Log Entries: given the log and the current state of the log server, no two entries in this log that relate to the same user (or data processor when multiple data processors are using the same log server) can be linked.
Support for Distributed Processes: logging for the user continues when going from one data processor to the next, the log trail can be reconstructed across the different data processors.
Unlinkability of User Identifiers: in case of distributed processes, user identifiers used across multiple data processors are unlinkable.
3 Our Logging Scheme
In this section we present our logging scheme. First we give an overview of the internals of the log server and then we discuss the most important logging operations.
3.1 Log Server
The log server in our logging scheme, depicted in Fig. 4, stores all log entries and keeps state information, which is updated with each new log entry, for each registered user and data processor.
A log entry consists of five fields:
Data: The data field contains the actual data to be logged in an encrypted form, such that only the user can derive the plaintext.
IC(U) The index chain field for the user serves as an identifier for the log entry for the user. The values of this field create a chain that links all log entries for the user together. Only the user can reconstruct this chain.
DC(U): The data chain field for the user allows the user to verify the validity of this log entry. All entries that were created for this user are chained together, leading to cumulative verification.
IC(P): The index chain field for the data processor.
DC(P): The data chain for the data processor.
A state consists of four fields:
ID: The identity of the user/data processor. This identity also serves as a public encryption key for the user/data processor, for which they have the corresponding private decryption key.
DC: The current data chain intermediate for this user/data processor. This intermediate will be used while constructing the next log entry for this user/data processor.
IC: The current index chain intermediate for this user/data processor. This intermediate will be used while constructing the next log entry for this user/data processor.
AK: The current authentication key for this user/data processor. This value will be used while the next log entry for this user/data processor. This value will also be used to update the state (DC, IC and AK fields) for this user/data processor.
3.2 Start Logging
Before a data processor can start logging data for a user, the user needs to be set up at the log server. The user will present his identity to the data processor, which in turn passes it on to the log server. The log server initialises the user’s state for this identifier and returns the initial authentication key to the user through the data processor. The user will need this initial authentication key to reconstruct his log trail (see section 3.5).
To keep the initial authentication key hidden from the data processor, it is encrypted under the user’s identity, which also serves as public encryption key in our scheme. The user knows the corresponding private decryption key and can thus obtain the initial authentication key. To guarantee the origin of the initial authentication key, it is signed by the log server prior to encryption.
3.3 Creating Log Entries
When a data processor performs processing on a user’s disclosed data, it logs a description of the processing to the log trail of the user located at the log server used by the data processor. The data processor first signs the data to log (to prove the origin to the user) and then encrypts the data and signature under the identity (public encryption key) of the user. Next, the data processor sends the resulting ciphertext, together with the user identifier to the log server who creates a log entry for this user.
A log entry consists of three parts: the user block, the data processor block and the data. The data is the ciphertext as provided by the data processor. The user block and data processor block are in part derived from the internal state kept by the log server.
A graphical overview of how the user block is generated and the user’s state is updated, is given in Fig. 5. The index chain field is derived from the state kept by the log server. The data chain field is derived from index field from the log entry, together with the state kept by the log server. The authentication key is used to update the index and data chain intermediates, before the authentication key itself is updated. The data processor block is generated in a similar manner. The log server only needs to do symmetric key operations: hashes and MACs, which makes that creating a log entry is very efficient at the log server.
For a data processor to involve another data processor in the processing of a user’s data, the data processor needs to fork the transparency logging of data processing to the other data processor. When forking, the data processor needs to blind the public key that serves as an identifier for the user to prevent the transparency logging from being linked at both data processors for the user. Blinding a public key is done by applying a random blinding factor, which is passed on to the user. This blinding factor together with his original private key will allow the user to decrypt messages that are encrypted under this new blinded public key.
Forking is a protocol between two data processors A and B with their respective log servers. Data processor B will set up a new user at its log server for the blinded identity as provided by data processor A. Data processor B will sign (to prove its involvement in the forking to the user) the resulting ciphertext from its log server, before sending it back to data processor A. Data processor A will then create a new log entry at its log server that contains a forking marker, the identity of data processor B, the signed ciphertext and the blinding factor.
3.5 Log Trail Reconstruction
When the user disclosed data to the data processor, the user initiated the start logging protocol (see section 3.2), generating a user identifier and obtaining the initial authentication key from the data processor in the process. To reconstruct the log trail, the user first downloads all log entries, stored at the log server used by this data processor, linked to his identifier.
Starting from the initial identity chain field, derived from his identity and the initial authentication key, all following identity chain fields can be computed by evolving the identity chain field and authentication key in the same manner as the log server. The user can request all log entries where it can provide a valid identity chain field for.
Now the user can validate his log trail by evolving the data chain field from log entry to log entry. After validating the integrity of the log trail, the data fields of the log entries are decrypted and the signature of the data processor is checked.
The user can also request the latest identity chain intermediate in the log server’s state. This mechanism allows the user to detect truncation attacks, in which the attacker deletes one or more consecutive log entries at the end of the chain.
In case the user comes across a forking marker, the user first verifies the signature by data processor B. Then he creates a new private key using the blinding factor, which can be used to decrypt the ciphertext, containing the initial authentication key at log server B. Now he can also reconstruct and validate his log trail at log server B.
4 Performance Evaluation
We used ECIES for public-key encryption and ECDSA for signature generation on the NIST P-256 elliptic curve. The selected hash function is SHA-256, which is also used in an HMAC construction to generate MACs. For these selected cryptographic key lengths, long term protection (from 2013 to 2040) is ensured [Ecry12].
A prototype of our scheme was implemented in the programming language Go. The first benchmarks are performed on a mid-range laptop (quad core 2.6 GHz CPU and 8 GB DDR3 RAM). Using Go’s built-in benchmarking functionality, which will run a test until it is “timed reliably”, we created Table 1 that provides a benchmark of the algorithms that make up our logging scheme. The benchmark shows that the main bottlenecks are operations related to encryption and signatures. As a consequence, data processors perform the bulk of the work when creating log entries. For log servers, creating log entries is fast. The only relatively costly operation at the log server is the setup of a user (which is also part of forking, i.e., log server B), needed to start logging for a new user, which presumably will be relatively infrequent. Decryption and verification for users are relatively costly.
Benchmark of algorithms.
Start logging (log server)
Create a log entry
1 KiB data
- data processor
- log server
From data processor A
- data processor A
with log server A to
- log server A
data processor B
- data processor B
with log server B
- log server B
Verify log trail (user)
10 entries of 1 KiB data
To get a better idea of how our scheme would perform in practice as a deployed system, we extended our implementation:
First, we transformed the data processor to a standalone service (similar to a Syslog server) to which other systems at the data processor send messages that should be logged. The data processor and log server communicate securely over a TLS connection. The data processor service is also offered over TLS.
Next, we introduced the concept of transactions, analogous to transactions in relational databases. At the data processor service, starting a transaction creates a new buffer for messages to log for users. A transaction can then be committed, which takes all messages in the buffer and creates log entries of them. At a log server, a transaction buffer works in a similar way: a data processor can create a buffer of messages for users that can be committed to create log entries. Transactions at a log server enable the data processor to send messages to the buffer in parallel, since the order in which log entries are created are determined first when the transaction is committed, not when a log entry arrives at the log server.
For the transaction buffers at a data processor we also added support for parallelism. When a data processor receives a message for a user to put into a transaction buffer, the processor spawns a new Go routine (lightweight thread) that performs the signing and encryption of the message for the user in the background. This way the data processor service can instantly acknowledge that a message has been stored in atransaction buffer, enabling the caller to return to its data processing. The computationally demanding cryptographic operations are then completed in the background of the service while waiting for the transaction to be committed.
The log server and the data processor were run in two different settings: local (L) and remote (R). The local experiment was run on the earlier described laptop. For the remote experiment, the log server was run at Amazon EC2 (Ireland) using a medium instance, the data processor in a private cloud at a Karlstad University (Sweden). The latency between the data processor and the log server at Amazon was on average 45.7 ms with a standard deviation of 0.3 ms. Table 2 shows the goodput, which is the throughput measured with respect to the data to be logged, for both the local and remote setting at 100 log entries per transaction. The average log entry generation time does not scale linearly with the size of the logged data. This is mainly due to the fact that the data to be logged is first signed and then encrypted before being sent to the log server, which involves relatively costly operations on the elliptic curve. The increased time in the remote setting is most likely due to the increased latency and potential bottlenecks at our Amazon EC2 instance.
The goodput at 100 log entries per transaction.
Log entry size
5 Related Work
Unlinkability Log Entries
Unlinkability User Identifiers
Partial (truncation attack possible)
Partial (truncation attack possible)
We introduced a privacy-preserving distributed logging scheme which can be used to enhance transparency of data processing. Our scheme generates a log trail for a user, typically the data subject of the process that is logged. Dynamic and distributed processes can be logged to distributed log servers. The log entries are world-readable, but the strong cryptographic properties of the underlying scheme ensure confidentiality and unlinkability in a broad sense. Last, but not least, we implemented our scheme in a robust prototype implementation and evaluated its performance. The initial timing results show that the scheme can be used in practice.
This work was supported in part by the Research Council KU Leuven: GOA TENSE (G0A/11/007). Tobias Pulls has received funding from the Seventh Framework Programme for Research of the European Community under grant agreement no. 317550.
We use the technical terminology of data processor and user, as opposed to the EU Data Protection Directive in which a more formal/legal terminology (data controller, data subject) is used.
Schemes that do not support deletion detection are subject to so-called truncation attacks, for which the adversary can delete one or more consecutive entries at the end of a log.
- [BouA11]Bournez, Carine; and Ardagna, Claudio A.: Policy Requirements and State of the Art. Came- nisch, Fischer-Hubner and Rannenberg: Privacy and Identity Management for Life, ISBN 9783-642-20316-9, Springer, 2011, p. 295-312.Google Scholar
- [Ecry12]ECRYPT II: Yearly Report on Algorithms and Keysizes (2012). D.SPA.20 Rev. 1.0, ICT-2007- 216676 ECRYPT II, 2012.Google Scholar
- [HPHL10]Hedbom, Hans; Pulls, Tobias; Hjartquist, Peter; and Laven, Andreas: Adding Secure Transparency Logging to the PRIME Core. Bezzi, Duquenoy, Fischer-Hubner, Hansen and Zhang: Privacy and Identity Management for Life, ISBN 978-3-642-14281-9, Springer, 2010, p. 299-314Google Scholar
- [KJM11]Ko, Ryan K.L.; Jagadpramana, Peter; Mowbray, Miranda; Pearson, Siani; Kirchberg, Markus; Liang, Qianhui; and Leek, Bu-Sung: TrustCloud: A Framework for Accountability and Trust in Cloud Computing. In: Proceedings of EuroPKI 2011. Camenisch and Costas: LNCS 6711, Springer, 2011, p. 584-588.Google Scholar
- [PWVG12]Pulls, Tobias; Wouters, Karel; Vliegen, Jo; and Grahn, Christian: Distributed Privacy-Preserving Log Trails. Karlstad University Studies 2012:24, 2012.Google Scholar
- [Robe09]Roberts, John: No one is perfect: The limits of transparency and an ethic for ’intelligent’ accountability. Accounting, Organizations and Society 34(8), 2009.Google Scholar
- [SaSA06]Sackmann, Stefan; Struker, Jens; and Accorsi, Rafael: Personalization in Privacy-Aware Highly Dynamic Systems. Communications of the ACM 49(9), ACM, 2006, p. 32-38.Google Scholar
- [SchK98]Schneier, Bruce; and Kelsey, John: Personalization Cryptographic Support for Secure Logs on Untrusted Machines. In: USENIX Security Symposium. USENIX, 1998, p. 53-62.Google Scholar
- [UN12]United Nations Department of Economic and Social Affairs: UN e-Government Survey 2012. E-Government for the People. ISBN 978-92-1-055353-7, 2012.Google Scholar
- [WSLP08]Wouters, Karel; Simoens, Koen; Lathouwers, Danny; Preneel, Bart: Secure and Privacy-Friendly Logging for eGovernment Services. In: ARES. IEEE Computer Society, 2008, p. 1091-1096.Google Scholar