Generalised Differential Privacy for Text Document Processing

We address the problem of how to"obfuscate"texts by removing stylistic clues which can identify authorship, whilst preserving (as much as possible) the content of the text. In this paper we combine ideas from"generalised differential privacy"and machine learning techniques for text processing to model privacy for text documents. We define a privacy mechanism that operates at the level of text documents represented as"bags-of-words"- these representations are typical in machine learning and contain sufficient information to carry out many kinds of classification tasks including topic identification and authorship attribution (of the original documents). We show that our mechanism satisfies privacy with respect to a metric for semantic similarity, thereby providing a balance between utility, defined by the semantic content of texts, with the obfuscation of stylistic clues. We demonstrate our implementation on a"fan fiction"dataset, confirming that it is indeed possible to disguise writing style effectively whilst preserving enough information and variation for accurate content classification tasks.


Introduction
Partial public release of formerly classified data incurs the risk that more information is disclosed than intended. This is particularly true of data in the form of text such as government documents or patient health records. Nevertheless there are sometimes compelling reasons for declassifying data in some kind of "sanitised" form -for example government documents are frequently released as redacted reports when the law demands it, and health records are often shared to facilitate medical research. Sanitisation is most commonly carried out by hand but, aside from the cost incurred in time and money, this approach provides no guarantee that the original privacy or security concerns are met.
To encourage researchers to focus on privacy issues related to text documents the digital forensics community PAN@Clef ( [42], for example) proposed a number of challenges that are typically tackled using machine learning. In this paper our aim is to demonstrate how to use ideas from differential privacy to address some aspects of the PAN@Clef challenges by showing how to provide strong a priori privacy guarantees in document disclosures.
We focus on the problem of author obfuscation, namely to automate the process of changing a given document so that as much as possible of its original substance remains, arXiv:1811.10256v1 [cs.CR] 26 Nov 2018 but that the author of the document can no longer be identified. Author obfuscation is very difficult to achieve because it is not clear exactly what to change that would sufficiently mask the author's identity. In fact author properties can be determined by "writing style" with a high degree of accuracy: this can include author identity [28] or other undisclosed personal attributes such as native language [33,52], gender or age [16,27]. These techniques have been deployed in real world scenarios: native language identification was used as part of the effort to identify the anonymous perpetrators of the 2014 Sony hack [17], and it is believed that the US NSA used author attribution techniques to uncover the identity of the real humans behind the fictitious persona of Bitcoin "creator" Satoshi Nakamoto. 1 Our contribution concentrates on the perspective of the "machine learner" as an adversary that works with the standard "bag-of-words" representation of documents often used in text processing tasks. A bag-of-words representation retains only the original document's words and their frequency (thus forgetting the order in which the words occur). Remarkably this representation still contains sufficient information to enable the original authors to be identified (by a stylistic analysis) as well as the document's topic to be classified, both with a significant degree of accuracy. 2 Within this context we reframe the PAN@Clef author obfuscation challenge as follows: Given an input bag-of-words representation of a text document, provide a mechanism which changes the input without disturbing its topic classification, but that the author can no longer be identified.
In the rest of the paper we use ideas inspired by d X -privacy [3], a metric-based extension of differential privacy, to implement an automated privacy mechanism which, unlike current ad hoc approaches to author obfuscation, gives access to both solid privacy and utility guarantees. 3 We implement a mechanism K which takes b, b bag-of-words inputs and produces "noisy" bag-of-words outputs determined by K(b), K(b ) with the following properties: Privacy: If b, b are classified to be "similar in topic" then, depending on a privacy parameter the outputs determined by K(b) and K(b ) are also "similar to each other", irrespective of authorship. Utility: Possible outputs determined by K(b) are distributed according to a Laplace probability density function scored according to a semantic similarity metric.
In what follows we define semantic similarity in terms of the classic Earth Mover's distance used in machine learning for topic classification in text document processing. 4 We explain how to combine this with d X -privacy which extends privacy for databases to other unstructured domains (such as texts).
In §2 we set out the details of the bag-of-words representation of documents and define the Earth Mover's metric for topic classification. In §3 we define a generic mechanism which satisfies "E d X -privacy" relative to the Earth Mover's metric E d X and show how to use it for our obfuscation problem. We note that our generic mechanism is of independent interest for other domains where the Earth Mover's metric applies. In §4 we describe how to implement the mechanism for data represented as real-valued vectors and prove its privacy/utility properties with respect to the Earth Mover's metric; in §5 we show how this applies to bags-of-words. Finally in §6 we provide an experimental evaluation of our obfuscation mechanism, and discuss the implications.
Throughout we assume standard definitions of probability spaces [18]. For a set A we write DA for the set of (possibly continuous) probability distributions over A. For η ∈ DA, and A ⊆ A a (measurable) subset we write η(A) for the probability that (wrt. η) a randomly selected a is contained in A. In the special case of singleton sets, we write η{a}. If mechanism K: α → Dα, we write K(a)(A) for the probability that if the input is a, then the output will be contained in A.

Documents, topic classification and Earth Moving
In this section we summarise the elements from machine learning and text processing needed for this paper. Our first definition sets out the representation for documents we shall use throughout. It is a typical representation of text documents used in a variety of classification tasks. Definition 1. Let S be the set of all words (drawn from a finite alphabet). A document is defined to be a finite bag over S, also called a bag-of-words. We denote the set of documents as BS, i.e. the set of (finite) bags over S.
Once a text is represented as a bag-of-words, depending on the processing task, further representations of the words within the bag are usually required. We shall focus on two important representations: the first is when the task is semantic analysis for eg. topic classification, and the second is when the task is author identification. We describe the representation for topic classification in this section, and leave the representation for author identification for §5 and §6.

Word embeddings
Machine learners can be trained to classify the topic of a document, such as "health", "sport", "entertainment"; this notion of topic means that the words within documents will have particular semantic relationships to each other. There are many ways to do this classification, and in this paper we use a technique that has as a key component "word embeddings", which we summarise briefly here.
A word embedding is a real-valued vector representation of words where the precise representation has been experimentally determined by a neural network sensitive to the way words are used in sentences [38]. Such embeddings have some interesting properties, but here we only rely on the fact that when the embeddings are compared using a distance determined by a pseudometric 5 on R n , words with similar meanings are found to be close together as word embeddings, and words which are significantly different in meaning are far apart as word embeddings.
Definition 2. An n-dimensional word embedding is a mapping Vec : S → R n . Given a pseudometric dist on R n we define a distance on words dist Vec : S×S→R ≥ as follows: Observe that the property of a pseudometric on R n carries over to S. Lemma 1. If dist is a pseudometric on R n then dist Vec is also a pseudometric on S.
Proof. Immediate from the definition of a pseudometric: i.e. the triangle equality and the symmetry of dist Vec are inherited from dist.
Word embeddings are particularly suited to language analysis tasks, including topic classification, due to their useful semantic properties. Their effectiveness depends on the quality of the embedding Vec, which can vary depending on the size and quality of the training data. We provide more details of the particular embeddings in §6. Topic classifiers can also differ on the choice of underlying metric dist, and we discuss variations in §3.2.
In addition, once the word embedding Vec has been determined, and the distance dist has been selected for comparing "word meanings", there are a variety of semantic similarity measures that can be used to compare documents, for us bags-of-words. In this work we use the "Word Mover's Distance", which was shown to perform well across multiple text classification tasks [31].
The Word Mover's Distance is based on the classic Earth Mover's Distance [44] used in transportation problems with a given distance measure. We shall use the more general Earth Mover's definition with dist 6 as the underlying distance measure between words. We note that our results can be applied to problems outside of the text processing domain.
Let X, Y ∈ BS; we denote by X the tuple x a1 1 , x a2 2 , . . . , x a k k , where a i is the number of times that x i occurs in X. Similarly we write Y = y b1 1 , y b2 2 , . . . , y b l l ; we have i a i = |X| and j b j = |Y |, the sizes of X and Y respectively. We define a flow matrix F ∈ R k×l ≥0 where F ij represents the (non-negative) amount of flow from x i ∈ X to y j ∈ Y . Definition 3. (Earth Mover's Distance) Let d S be a (pseudo)metric over S. The Earth Mover's Distance with respect to d S , denoted by E d S , is the solution to the following linear optimisation: where the minimum in (1) is over all possible flow matrices F subject to the constraints (2). In the special case that |X| = |Y |, the solution is known to satisfy the conditions of a (pseudo)metric [44] which we call the Earth Mover's Metric.
In this paper we are interested in the special case |X| = |Y |, hence we use the term Earth Mover's metric to refer to E d S .
We end this section by describing how texts are prepared for machine learning tasks, and how Def. 3 is used to distinguish documents. Consider the text snippet "The President greets the press in Chicago". The first thing is to remove all "stopwords" -these are words which do not contribute to semantics, and include things like prepositions, pronouns and articles. The words remaining are those that contain a great deal of semantic and stylistic traits. 7 In this case we obtain the bag: Consider a second bag: b 2 := Chief 1 , speaks 1 , media 1 , Illinois 1 , corresponding to a different text. Fig. 1 illustrates the optimal flow matrix which solves the optimisation problem in Def. 3 relative to d S . Here each word is mapped completely to another word, so that F i,j = 1/4 when i = j and 0 otherwise. We show later that this is always the case between bags of the same size. With these choices we can compute the distance between b 1 , b 2 : For comparison, consider the distance between b 1 and b 2 to a third document, b 3 := Chef 1 , breaks 1 , cooking 1 , record 1 . Using the same word embedding metric, 8 be classified as semantically "closer" to each other than to b 3 , in line with our own (linguistic) interpretation of the original texts. 7 In fact the way that stopwords are used in texts turn out to be characteristic features of authorship. Here we follow standard practice in natural language processing to remove them for efficiency purposes and study the privacy of what remains. All of our results apply equally well had we left stopwords in place. 8 We use the same word2vec-based metric as per our experiments; this is described in §6.

Differential Privacy and the Earth Mover's Metric
Differential Privacy was originally defined with the protection of individuals' data in mind. The intuition is that privacy is achieved through "plausible deniability", i.e. whatever output is obtained from a query, it could have just as easily have arisen from a database that does not contain an individual's details, as from one that does. In particular, there should be no easy way to distinguish between the two possibilities. Privacy in text processing means something a little different. A "query" corresponds to releasing the topic-related contents of the document (in our case the bag-of-words) -this relates to the utility because we would like to reveal the semantic content. The privacy relates to investing individual documents with plausible deniability, rather than individual authors directly. What this means for privacy is the following. Suppose we are given two documents b 1 , b 2 written by two distinct authors A 1 , A 2 , and suppose further that b 1 , b 2 are changed through a privacy mechanism so that it is difficult or impossible to distinguish between them (by any means). Then it is also difficult or impossible to determine whether the authors of the original documents are A 1 or A 2 , or some other author entirely. This is our aim for obfuscating authorship whilst preserving semantic content.
Our approach to obfuscating documents replaces words with other words, governed by probability distributions over possible replacements. Thus the type of our mechanism is BS → D(BS), where (recall) D(BS) is the set of probability distributions over the set of (finite) bags of S. Since we are aiming to find a careful trade-off between utility and privacy, our objective is to ensure that there is a high probability of outputting a document with a similar topic as the input document. As explained in §2, topic similarity of documents is determined by the Earth Mover's distance relative to a given (pseudo)metric on word embeddings, and so our privacy definition must also be relative to the Earth Mover's distance.
Definition 4. (Earth Mover's Privacy) Let X be a set, and d X be a (pseudo)metric on X and let E d X be the Earth Mover's metric on BX relative to d X . Given ≥ 0, a mechanism K : BX → D(BX ) satisfies E d X -privacy iff for any b, b ∈ BX and Z ⊆ BX : Def. 4 tells us that when two documents are measured to be very close, so that is approximately 1 and the outputs K(b) and K(b ) are almost identical. On the other hand the more that the input bags can be distinguished by E d X , the more their outputs are likely to differ.
This flexibility is what allows us to strike a balance between utility and privacy; we discuss this issue further in §5 below. Our next task is to show how to implement a mechanism that can be proved to satisfy Def. 4. We follow the basic construction of Dwork et al. [13] for lifting a differentially private mechanism K: X → DX to a differentially private mechanism K : X N → DX N on vectors in X N . (Note that, unlike a bag, a vector imposes a fixed order on its components.) Here the idea is to apply K independently to each component of a vector v ∈ X N to produce a random output vector, also in X N . In particular the probability of outputting some vector v is the product: Thanks to the compositional properties of differential privacy when the underlying metric on X satisfies the triangle inequality, it's possible to show that the resulting mechanism K satisfies the following privacy mechanism [14]: where , the Manhattan metric relative to d X .
However Def. 4 does not follow from (6), since Def. 4 operates on bags of size N , and the Manhattan distance between any vector representation of bags is greater than N × E d X . Remarkably however, it turns out that K -the mechanism that applies K independently to each item in a given bag-in fact satisfies the much stronger Def. 4, as the following theorem shows, provided the input bags have the same size as each other.
Theorem 1. Let d X be a pseudo-metric on X and let K : X → DX be a mechanism satisfying d X -privacy, i.e.
Let K : BX → D(BX ) be the mechanism obtained by applying K independently to each element of X for any X ∈ BX . Denote by K ↓ N the restriction of K to bags of fixed size N . Then K ↓ N satisfies N E d X -privacy.
Proof. (Sketch) The full proof is given in App. A.1; here we sketch the main ideas.
Let b, b be input bags, both of size N , and let c a possible output bag (of K ). Observe that both output bags determined by K (b 1 ), K (b 2 ) and c also have size N . We shall show that (4) is satisfied for the set containing the singleton element c and multiplier N , from which it follows that (4) is satisfied for all sets Z.
By Birkhoff-von Neumann's theorem ( [26], Thm. A1), in the case where all bags have the same size, the minimisation problem in Def. 3 is optimised for transportation matrix F where all values F ij are either 0 or 1/N . This implies that the optimal transportation for E d X (b, c) is achieved by moving each word in the bag b to a (single) word in bag c. The same is true for E d X (b , c) and E d X (b, b ). Next we use a vector representation of bags as follows. For bag b, we write b for a vector in X N such that each element in b appears at some b i .
Next we fix b and b to be vector representations of respectively b, b in X N such that the optimal transportation for The final fact we need is to note that there is a relationship between K acting on bags of size N and K which acts on vectors in X N by applying K independently to each component of a vector: it is characterised in the following way. Let b, c be bags and let b, c be any vector representations. For permutation σ ∈ {1 . . . N } → {1 . . . N } write c σ to be the vector with components permuted by σ, so that c σ i = c σ(i) . With these definitions, the following equality between probabilities holds: where the summation is over all permutations that give distinct vector representations of c. We now compute directly: as required.

Application to Text Documents
Recall the bag-of-words and assume we are provided with a mechanism K satisfying the standard d X -privacy property (7) for individual words. As in Thm. 1 we can create a mechanism K * by applying K independently to each word in the bag, so that, for example the probability of outputting b 3 = Chef 1 , breaks 1 , cooking 1 , record 1 is determined by (9): 816, we deduce that if ∼ 1/16 then the output distributions K (b 1 ) and K (b 2 ) would differ by the multiplier e 2.816×4/16 ∼ 2.02; but if ∼ 1/32 those distributions differ by only 1.42. In the latter case it means that the outputs of K on b 1 and b 2 are almost indistinguishable.
The parameter depends on the randomness implemented in the basic mechanism K; we investigate that further in §4.

Properties of Earth Mover's Privacy
In machine learning a number of "distance measures" are used in classification or clustering tasks, and in this section we explore some properties of privacy when we vary the underlying metrics of an Earth Mover's metric used to classify complex objects.
Let v, v ∈ R n be real-valued n-dimensional vectors. We use the following (wellknown) metrics. Recall in our applications we have looked at bags-of-words, where the words themselves are represented as n-dimensional vectors. 9

Euclidean
Note that the Euclidean and Manhattan distances determine pseudometrics on words as defined at Def. 2 and proved at Lem. 1.
Proof. Trivial, by contradiction. If d X ≤ d X and F ij , F ij are the minimal flow matrices for E d X , E d X respectively, then F ij is a (strictly smaller) minimal solution for E d X which contradicts the minimality of F ij .
This shows that, for example, E ||·|| -privacy implies E · -privacy, and indeed any distance measure d which exceeds the Euclidean distance then E ||·|| -privacy implies E d -privacy.
We end this section by showing that Def. 4 satisfies post-processing; i.e. that privacy does not decrease under post processing. We write K; K for the composition of mechanisms K, K : BX → D(BX ), defined: Proof. Let b, c ∈ BX ; we reason as follows. In Thm. 1 we have shown how to promote a privacy mechanism on components to E d X -privacy on a bag of those components. In this section we show how to implement a privacy mechanism satisfying (7), when the components are represented by high dimensional vectors in R n and the underlying metric is taken Euclidean on R n , which we denote by || · ||.
We begin by summarising the basic probabilistic tools we need. A probability density function (PDF) over some domain D is a function φ : D → [0, 1] whose value φ(z) gives the "relative likelihood" of z. The probability density function is used to compute the probability of an outcome "z ∈ A", for some region A ⊆ D as follows: In differential privacy, a popular density function used for implementing mechanisms is the Laplacian, defined next.
Definition 5. Let n ≥ 0 be an integer > 0 be a real, and v∈R n . We define the Laplacian probability density function in n-dimensions: , and c n is a real-valued constant satisfying the integral equation 1 = · · · R n Lap n (v)dv 1 . . . dv n .
When n = 1, we can compute c 1 = /2, and when n = 2, we have that c 2 = 2 /2π. In privacy mechanisms, probability density functions are used to produce a "noisy" version of the released data. The benefit of the Laplace distribution is that, besides creating randomness, the likelihood that the released value is different from the true value decreases exponentially. This implies that the utility of the data release is high, whilst at the same time masking its actual value. In Fig. 2 the probability density function Lap 2 (v) depicts this situation, where we see that the highest relative likelihood of a randomly selected point on the plane being close to the origin, with the chance of choosing more distant points diminishing rapidly. Once we are able to select a vector v in R n according to Lap n , we can "add noise" to any given vector v as v+v , so that the true value v is highly likely to be perturbed only a small amount.
In order to use the Laplacian in Def. 5, we need to implement it. Andrés et al. [5] exhibited a mechanism for Lap 2 (v), and here we show how to extend that idea to the general case. The main idea of the construction for Lap 2 (v) uses the fact that any vector on the plane can be represented by spherical coordinates (r, θ), so that the probability of selecting a vector distance no more than r from the origin can be achieved by selecting r and θ independently. In order to obtain a distribution which overall is equivalent to Lap 2 (v), Andrés et al. computed that r must be selected according to a well-known distribution called the "Lambert W" function, and θ is selected uniformly over the unit circle. In our generalisation to Lap n (v), we observe that the same idea is valid [7]. Observe first that every vector in R n can be expressed as a pair (r, p), where r is the distance from the origin, and p is a point in B n , the unit hypersphere in R n . Now selecting vectors according to Lap n (v) can be achieved by independently selecting r and p, but this time r must be selected according to the Gamma distribution, and p must be selected uniformly over B n . We set out the details next. Definition 6. The Gamma distribution of (integer) shape n and scale δ > 0 is determined by the probability density function: Definition 7. The uniform distribution over the surface of the unit hypersphere B n is determined by the probability density function: where B n := {v ∈ R n | ||v|| = 1}, and Γ (α):= ∞ 0 x α−1 e −x dx is the "Gamma function". With Def. 6 and Def. 7 we are able to provide an implementation of a mechanism which produces noisy vectors around a given vector in R n according to the Laplacian distribution in Def. 5. The first task is to show that our decomposition of Lap n is correct.
Lemma 4. The n-dimensional Laplacian Lap n (v) can be realised by selecting vectors represented as (r, p), where r is selected according to Gam n 1/ (r) and p is selected independently according to Uniform n (p).
Proof. (Sketch) The proof follows by changing variables to spherical coordinates and then showing that A Lap n (v) dv can be expressed as the product of independent selections of r and p.
Next we assume for simplicity that A is a hypersphere of radius R; with that we can reason: Change of variables to spherical coordinates; see below (14) "See below (14)" Now rearranging we can see that this becomes a product of two integrals. The first r≤R e − r r n−1 is over the radius, and is proportional to the integral of the Gamma distribution Def. 6; and the second is an integral over the angular coordinates and is proportional to the surface of the unit hypersphere, and corresponds to the PDF at (7). We complete the details in the appendix App. A.2. Finally, for the "see below's" we are using the "Jacobian" with details given at App. A.2: We can now assemble the facts to demonstrate the n-Dimensional Laplacian.
Theorem 2 (n-Dimensional Laplacian). Given > 0 and n ∈ Z + , let K : R n → DR n be a mechanism that, given a vector x ∈ R n outputs a noisy value as follows: where x is represented as (r, p) with r ≥ 0, distributed according to Gam n 1/ (r) and p ∈ B n distributed according to Uniform n (p). Then K satisfies (7) from Thm. 1, i.e. K satisfies ||·||-privacy where ||·|| is the Euclidean metric on R n .
Proof. (Sketch) Let z, y ∈ R n . We need to show that for any (measurable) set A ⊆ R n that: However (15) follows provided that the probability densities of respectively K(z) and K(y) satisfy it. By Lem. 4 the probability density of K(z), as a function of x is distributed as Lap n (z−x); and similarly for the probability density of K(y). Hence we reason: "Triangle inequality; s → e s is monotone" as required.
Thm. 2 reduces the problem of adding Laplace noise to vectors in R n to selecting a real value according to the Gamma distribution and an independent uniform selection of a unit vector. Several methods have been proposed for generating random variables according to the Gamma distribution [30] as well as for the uniform selection of vectors on the unit n-sphere [35]. The uniform selection of a unit vector has also been described in [35]; it avoids the transformation to spherical coordinates by selecting n random variables from the Gaussian distribution over [0, 1] to produce vector v ∈ R n , and then normalising to output v |v| .

Earth Mover's Privacy in BR n
Using the n-dimensional Laplacian, we can now implement an algorithm for N E ||·||privacy. Algorithm 1 takes a bag of n-dimensional vectors as input and applies the n-dimensional Laplacian mechanism described in Thm. 2 to each vector in the bag, producing a noisy bag of n-dimensional vectors as output. Cor. 2 summarises the privacy guarantee.

Utility Bounds
We prove a lower bound on the utility for this algorithm, which applies for high dimensional data representations. Given an output element x, we define Z to be the set of outputs within distance ∆ > 0 from x. Recall that the distance function is a measure of utility, therefore Z = {z | E ||·|| (x, z) ≤ ∆} represents the set of vectors within utility ∆ of x. Then we have the following: Thus the probability of outputting an element of Z is the same as the probability of outputting Z E , and by (16) that is at least the probability of outputting an element from Z M by applying a standard n-dimensional Laplace mechanism to each of the components of b. We can now compute: The result follows by completing the multiple integrals and applying some approximations, whilst observing that the variables in the integration are n-dimensional vector valued. The details appear in App. A.2.
We note that our application word embeddings are typically mapped to vectors in R 300 , thus we would use n ∼ 300 in Thm. 3.

Text Document Privacy
In this section we bring everything together, and present a privacy mechanism for text documents; we explore how it contributes to the author obfuscation task described above. Algorithm 2 describes the complete procedure for taking a document as a bag-ofwords, and outputting a "noisy" bag-of-words. Depending on the setting of parameter , the output bag will be likely to be classified to be on a similar topic as the input.

Algorithm 2 Document privacy mechanism
Require: Bag-of-words b, dimension n, epsilon , Word embedding Vec : S → R n 1: procedure GenerateNoisyBagOfWords(b, n, , Vec) 2: Z ← GeneratePrivateBag(X, n, ) 4: return (Vec −1 ) (Z) 5: end procedure Note that Vec : BS → BR n applies Vec to each word in a bag b; and (Vec −1 ) : BR n → BS reverses this procedure as a post-processing step; note that to be the word w that minimises the Euclidean distance ||z − Vec(w)||.
Algorithm 2 uses a function Vec to turn the input document into a bag of word embeddings; next Algorithm 1 produces a noisy bag of word embeddings, and, in a final step the inverse Vec −1 is used to reconstruct an actual bag-of-words as output. In our implementation of Algorithm 2, described below, we compute Vec −1 (x) to be the word w that minimises the Euclidean distance ||z − Vec(w)||. The next result summarises the privacy guarantee for Algorithm 2. Proof. The result follows by appeal to Thm. 2 for privacy on the word embeddings; the step to apply Vec −1 to each vector is a post-processing step which by Lem. 3 preserves the privacy guarantee.
Although Thm. 4 utilises ideas from differential privacy, an interesting question to ask is how it contributes to the PAN@Clef author obfuscation task, which recall asked for mechanisms that preserve content but mask features that distinguish authorship. Algorithm 2 does indeed attempt to preserve content (to the extent that the topic can still be determined) but it does not directly "remove stylistic features". So has it, in fact, disguised the author's characteristic style? To answer that question, we review Thm. 4 and interpret what it tells us in relation to author obfuscation. The theorem implies that it is indeed possible to make the (probabilistic) output from two distinct documents b, b almost indistinguishable by choosing to be extremely small in comparison with N ×E ||·|| (Vec (b), Vec (b )). However, if E ||·|| (Vec (b), Vec (b )) is very large -meaning that b and b are on entirely different topics, then would need to be so tiny that the noisy output document would be highly unlikely to be on a topic remotely close to either b or b (recall Lem. 3).
This observation is actually highlighting the fact that, in some circumstances, the topic itself is actually a feature that characterises author identity. (First-hand accounts of breaking the world record for highest and longest free fall jump would immediately narrow the field down to the title holder.) This means that any obfuscating mechanism would, as for Algorithm 2, only be able to obfuscate documents so as to disguise the author's identity if there are several authors who write on similar topics. And it is in that spirit, that we have made the first step towards a satisfactory obfuscating mechanism: provided that documents are similar in topic (i.e. are close when their embeddings are measured by E ||·|| ) they can they be obfuscated so that it is unlikely that the content is disturbed, but that the contributing authors cannot be determined easily.
We can see the importance of the "indistinguishability" property wrt. the PAN obfuscation task. In stylometry analysis the representation of words for eg. author classification is completely different to the word embeddings which have used for topic classification. State-of-the-art author attribution algorithms represent words as "character n-grams" [28] which have been found to capture stylistic clues such as systematic spelling errors. A character 3-gram for example represents a given word as the complete list of substrings of length 3. For example character 3-gram representations of "color" and "colour" are: · "color" → |[ "col", "olo", "lor" ]| · "colour" → |[ "col", "olo", "lou", "our" ]| For author identification, any output from Algorithm 2 would then need to be further transformed to a bag of character n-grams, as a post processing step; by Lem. 3 this additional transformation preserves the privacy properties of Algorithm 2. We explore this experimentally in the next section.

Experimental Results
Document Set The PAN@Clef tasks and other similar work have used a variety of types of text for author identification and author obfuscation. Our desiderata are that we have multiple authors writing on one topic (so as to minimise the ability of an author identification system to use topic-related cues) and to have more than one topic (so that we can evaluate utility in terms of accuracy of topic classification). Further, we would like to use data from a domain where there are potentially large quantities of text available, and where it is already annotated with author and topic.
Given these considerations, we chose "fan fiction" as our domain. Wikipedia defines fan fiction as follows: "Fan fiction . . . is fiction about characters or settings from an original work of fiction, created by fans of that work rather than by its creator." This is also the domain that was used in the PAN@Clef 2018 author attribution challenge, 10 although for this work we scraped our own dataset. We chose one of the largest fan fiction sites and the two largest "fandoms" there; 11 these fandoms are our topics. We scraped the stories from these fandoms, the largest proportion of which are for use in training our topic classification model. We held out two subsets of size 20 and 50, evenly split between fandoms/topics, for the evaluation of our privacy mechanism. 12 We follow the evaluation framework of [28]: for each author we construct an knownauthor text and an unknown-author snippet that we have to match to an author on the basis of the known-author texts. (See Appendix B.1 for more detail.) Word Embeddings There are sets of word embeddings trained on large datasets that have been made publicly available. Most of these, however, are already normalised, which makes them unsuitable for our method. We therefore use the Google News word2vec embeddings as the only large-scale unnormalised embeddings available. (See Appendix B.1 for more detail.)

Inference Mechanisms
We have two sorts of machine learning inference mechanisms: our adversary mechanism for author identification, and our utility-related mechanism for topic classification. For each of these, we can define inference mechanisms both within the same representational space or in a different representational space. As we noted above, in practice both author identification adversary and topic classification will use different representations, but examining same-representation inference mechanisms can give an insight into what is happening within that space.
Different-representation author identification For this we use the algorithm by [28]. This algorithm is widely used: it underpins two of the winners of PAN shared tasks [25,48]; is a common benchmark or starting point for other methods [19,40,45,47]; and is a standard inference attacker for the PAN shared task on authorship obfuscation. 13 It works by representing each text as a vector of space-separated character n-gram counts, and comparing repeatedly sampled subvectors of known-author texts and snippets using cosine similarity. We use as a starting point the code from a reproducibility study [41], but have modified it to improve efficiency. (See Appendix B.2 for more details.)

Different-representation topic classification
Here we choose fastText [8,22], a highperforming supervised machine learning classification system. It also works with word embeddings; these differ from word2vec in that they are derived from embeddings over character n-grams, learnt using the same skipgram model as word2vec. This means it is able to compute representations for words that do not appear in the training data, which is helpful when training with relatively small amounts of data; also useful when training with small amounts of data is the ability to start from pretrained embeddings trained on out-of-domain data that are then adapted to the in-domain (here, fan fiction) data. After training, the accuracy on a validation set we construct from the data is 93.7% (see Appendix B.2 for details).

Same-representation author identification
In the space of our word2vec embeddings, we can define an inference mechanism that for an unknown-author snippet chooses the closest known-author text by Euclidean distance.
Same-representation topic classification Similarly, we can define an inference mechanism that considers the topic classes of neighbours and predicts a class for the snippet based on that. This is essentially the standard k "Nearest Neighbours" technique (k-NN) [21], a non-parametric method that assigns the majority class of the k nearest neighbours. 1-NN corresponds to classification based on a Voronoi tesselation of the space, has low bias and high variance, and asymptotically has an error rate that is never more than twice the Bayes rate; higher values of k have a smoothing effect. Because of the nature of word embeddings, we would not expect this classification to be as accurate as the fastText classification above: in high-dimensional Euclidean space (as here), almost all points are approximately equidistant. Nevertheless, it can give an idea about how a snippet with varying levels of noise added is being shifted in Euclidean space with respect to other texts in the same topic. Here, we use k = 5. Same-representation author identification can then be viewed as 1-NN with author as class. DRtopic  none  12  16  15  18  30  8  18  16  18  25  8  18  14  17  20  5  11  11  16  15  2  11  12  17  10  0  15  11  19 50-author set SRauth SRtopic DRauth DRtopic  none  19  36  27  43  30  19  37  29  43  25  17  34  24  41  20  12  28  19  42  15  9  22  13  42  10  1  24 10 43 Table 1: Number of correct predictions of author/topic in the 20-author set (left) and 50-author set (right), using 1-NN for same-representation author identification (SRauth), 5-NN for same-representation topic classification (SRtopic), the Koppel algorithm for different-representation author identification (DRauth) and fastText for different-representation topic classification (DRtopic).

20-author set SRauth SRtopic DRauth
Results: Table 1 contains the results for both document sets, for the unmodified snippets ("none") or with the privacy mechanism of Algorithm 2 applied with various levels of : we give results for between 10 and 30, as at = 40 the text does not change, while at = 1 the text is unrecognisable. For the 20-author set, a random guess baseline would give 1 correct author prediction, and 10 correct topic predictions; for the 50-author set, these values are 1 and 25 respectively. Performance on the unmodified snippets using different-representation inference mechanisms is quite good: author identification gets 15/20 correct for the 20-author set and 27/50 for the 50-author set; and topic classification 18/50 and 43/50 (comparable to the validation set accuracy, although slightly lower, which is to be expected given that the texts are much shorter). For various levels of , with our different-representation inference mechanisms we see broadly the behaviour we expected: the performance of author identification drops, while topic classification holds roughly constant. Author identification here does not drop to chance levels: we speculate that this is because (in spite of our choice of dataset for this purpose) there are still some topic clues that the algorithm of [28] takes advantage of: one author of Harry Potter fan fiction might prefer to write about a particular character (e.g. Severus Snape), and as these character names are not in our word2vec vocabulary, they are not replaced by the privacy mechanism.
In our same-representation author identification, though, we do find performance starting relatively high (although not as high as the different-representation algorithm) and then dropping to (worse than) chance, which is the level we would expect for our privacy mechanism. The k-NN topic classification, however, shows some instability, which is probably an artefact of the problems it faces with high-dimensional Euclidean spaces. (We show a sample of texts and nearest neighbours in Appendix B.3)

Related Work
Our work addresses the problem of author obfuscation which falls in the general domain of text document privacy.
Author Obfuscation The most similar work to ours is by Weggenmann and Kerschbaum [54] who also consider the author obfuscation problem but apply standard differential privacy using a Hamming distance of 1 between all documents. As with our approach, they consider the simplified utility requirement of topic preservation and use word embeddings to represent documents. Our approach differs in our use of the Earth Mover's metric to provide a strong utility measure for document similarity.
An early work in this area by Kacmarcik et al. [23] applies obfuscation by modifying the most important stylometric features of the text to reduce the effectiveness of author attribution. This approach was used in Anonymouth [36], a semi-automated tool that provides feedback to authors on which features to modify to effectively anonymise their texts. A similar approach was also followed by Karadhov et al. [24] as part of the PAN@Clef 2017 task.
Other approaches to author obfuscation, motivated by the PAN@Clef task, have focussed on the stronger utility requirement of semantic sensibility [6,9,34]. Privacy guarantees are therefore ad hoc and are designed to increase misclassification rates by the author attribution software used to test the mechanism.
Most recently there has been interest in training neural networks models which can protect author identity whilst preserving the semantics of the original document [15,49]. Other related deep learning methods aim to obscure other author attributes such as gender or age [11,32]. While these methods produce strong empirical results, they provide no formal privacy guarantees. Importantly, their goal also differs from the goal of our paper: they aim to obscure properties of authors in the training set (with the intention of the author-obscured learned representations being made available), while we assume that an adversary may have access to raw training data to construct an inference mechanism with full knowledge of author properties, and in this context aim to hide the properties of some other text external to the training set.
Machine Learning and Differential Privacy Outside of author attribution, there is quite a body of work on introducing differential privacy to machine learning: [14] gives an overview of a classical machine learning setting; more recent deep learning approaches include [1,50]. However, these are generally applied in other domains such as image processing: text introduces additional complexity because of its discrete nature, in contrast to the continuous nature of neural networks. A recent exception is [37], which constructs a differentially private language model using a recurrent neural network; the goal here, as for instances above, is to hide properties of data items in the training set.
Generalised Differential Privacy Also known as d X -privacy [10], this definition was originally motivated by the problem of geo-location privacy [5]. Despite its generality, d X -privacy has yet to find significant applications outside this domain; in particular, there have been no applications to text privacy.
Text Document Privacy This typically refers to the sanitisation or redaction of documents either to protect the identity of individuals or to protect the confidentiality of their sensitive attributes. For example, a medical document may be modified to hide specifics in the medical history of a named patient. Similarly, a classified document may be redacted to protect the identity of an individual referred to in the text.
Most approaches to sanitisation or redaction rely on first identifying sensitive terms in the text, and then modifying (or deleting) only these terms to produce a sanitised document. Abril et al. [2] proposed this two-step approach, focussing on identification of terms using NLP techniques. Cumby and Ghani [12] proposed k − conf usability, inspired by k − anonymity [51], to perturb sensitive terms in a document so that its (utility) class is confusable with at least k other classes. Their approach requires a complete dataset of similar documents for computing (mis)classification probabilities. Anandan et al. [4] proposed t-plausibility which generalises sensitive terms such that any document could have been generated from at least t other documents. Sánchez and Batet [46] proposed C-sanitisation, a model for both detection and protection of sensitive terms (C) using information theoretic guarantees. In particular, a C-sanitised document should contain no collection of terms which can be used to infer any of the sensitive terms.
Finally, there has been some work on noise-addition techniques in this area. Rodriguez-Garcia et al. [43] propose semantic noise, which perturbs sensitive terms in a document using a distance measure over the directed graph representing a predefined ontology.
Whilst these approaches have strong utility, our primary point of difference is our insistence on a differential privacy-based guarantee. This ensures that every output document could have been produced from any input document with some probability, giving the strongest possible notion of plausible-deniability.

Conclusions
We have shown how to combine representations of text documents with generalised differential privacy in order to implement a privacy mechanism for text documents. Unlike most other techniques for privacy in text processing, ours provides a guarantee in the style of differential privacy. Moreover we have demonstrated experimentally the trade off between utility and privacy.
This represents an important step towards the implementation of privacy mechanisms that could produce readable summaries of documents with a privacy guarantee.
One way to achieve this goal would be to reconstruct readable documents from the bag-of-words output that our mechanism currently provides. A range of promising techniques for reconstructing readable texts from bag-of-words have already produced some good experimental results [20,53,55]. In future work we aim to explore how techniques such as these could be applied as a final post processing step for our mechanism.

A Appendix A
Here we present proofs omitted from the main body of the paper.

A.1 Proofs Omitted from §3
To prove Thm. 1 we introduce the following results.
Definition A1. An n×n matrix whose elements are non-negative and has all rows and columns summing to 1 is called doubly stochastic. A doubly stochastic matrix which contains only 1's and 0's is called a permutation matrix.
Theorem A1. (Birkhoff-von Neumann) The set of n × n doubly stochastic matrices forms a convex polytope whose vertices are the n × n permutation matrices.
The Birkhoff-von Neumann theorem says that the set of doubly stochastic matrices is a closed, bounded convex set, and every doubly stochastic matrix can be written as a convex combination of the permutation matrices. We can now prove the following result.
Lemma A1. Let D and F be non-negative n×n matrices. Then the problem of finding an F which minimises always has an n × n permutation matrix as an optimal solution.
Proof. We prove this by contradiction. Let F be an optimal n × n solution matrix. Since F is doubly stochastic we can apply Birkhoff-von Neumann. Firstly, we know that such a solution exists (since the set of solutions is closed and bounded). We now assume that F is not a permutation matrix, and also that no permutation matrix is optimal. Let {P 1 , P 2 , . . . , P k } be the set of n × n permutation matrices. Then, by Birkhoff-von Neumann, we can write where λ i ≥ 0 and k i=1 λ i = 1. Since F is optimal and none of the P i are optimal, by assumption we also know for 0 < m ≤ k. And thus we have: which is a contradiction. Thus, either F is a permutation matrix, or there must be a permutation matrix which is also optimal.
Lemma A2. Let d X be a pseudometric on X and let K : X → DX be a mechanism satisfying d X -privacy. Let x, z ∈ BX be bags of length N with corresponding vectors x, z ∈ X N . Then K can be extended to a mechanism K : BX → D(BX ) satisfying: where the sum is over unique permutations of elements in z.
Proof. Recall that a mechanism is a probabilistic function; we have to show that there is a mechanism K that outputs a valid distribution over bags in BX given an input bag in X . We show this by constructing the required mechanism. We can easily extend K to a mechanism K : X N → D(X N ) operating on vectors by applying K to each element of x in order. That is, K (x) defines a valid probability distribution for any x since we sum over all possible output vectors z.
Observe that the mechanism K produces the same output distribution regardless of the ordering of elements in x (since the mechanism K operates on each element independently). Therefore the distribution over bags depends only on the different permutations of elements in the output z. That is, Here K (x) also defines a valid probability distribution, since it produces the same distribution as K (x) except that the output probabilities are 'collected' for all permutations of the output vector. Thus K is the required mechanism.
We are now ready to prove Thm. 1.
Theorem 1. Let d X be a pseudometric on X and let K : X → DX be a mechanism satisfying d X -privacy, i.e.
Let K * : BX → D(BX ) be the mechanism obtained by applying K independently to each element of X for any X ∈ BX . Denote by K ↓ N the restriction of K to bags of fixed size N . Then K ↓ N satisfies N E d X -privacy.
Proof. Let b, b be input bags of size N , and c a possible output bag of K . Observe that c also has size N . Therefore, the Earth Mover's constraints in Def. 3 can be rewritten as: This has the same form as in Lem. A1, thus the optimal transportation for E d X (b, c) is achieved by moving each word in bag b to a single word in bag c. The same is true for E d X (b , c) and E d X (b, b ). Next, we fix b and b to be vector representations of respectively b, b in X N such that the optimal transportation for E d That is, we fix the ordering of the elements in the vectors b, b so that the Manhattan distance is exactly the Earth Mover's distance (which we know can be done thanks to Lem. A1). Finally, from Lem. A2 we know that the following equality between probabilities holds: where the summation is over all permutations that give distinct vector representations of c. We now compute directly: Thus the mechanism K satisfies N E d X -privacy for singleton sets, and by extension for all finite sets Z ⊆ BX .

A.2 Proofs Omitted from §4
Lemma 4. The n-dimensional Laplacian Lap n (v) can be realised by selecting vectors represented as (r, p), where r is selected according to Gam n 1/ (r) and p is selected independently according to Uniform n (p).
Proof. We note first that the n-dimensional Laplacian is spherically symmetric; that is, we want the length of the random vector to follow a Laplacian distribution independently from its direction. Therefore the Laplacian has a stochastic representation: where R = ||X|| and U = X/||X||. i.e. U is a random variable drawn from the uniform distribution on the n-sphere (that is, U ∼ Uniform n (p)) and R is a 'scaling' component independent from U . We now show that the radial component is drawn from the Gamma distribution. From Def. 5 we have that We perform a conversion to spherical co-ordinates (r, θ 1 , θ 2 , . . . , θ n−1 ) using the following transformation [39]: where r = v 2 1 + · · · + v 2 n . The bounds for the new co-ordinates are: We also need the Jacobian determinant, denoted ∂(v1,v2,...,vn) ∂(r,θ1,...,θn−1) . This is well-known to be: And therefore we reason: "Independent variables; c n = c0 × c1 × · · · × cn−1" We recognise the form of this integral as the stochastic representation of a spherically symmetric distribution. The first component is the radial component and the remainder of the integrals represent the uniform distribution on the n-sphere. We can compute the constant c 0 by equating the radial component with a univariate distribution. That is,  where S n−1 (1) is the surface area of the n-dimensional unit sphere.
Proof. This follows from the observation that the integral π 0 c 1 sin n−2 θ 1 dθ 1 · · · π 0 c n−2 sin θ n−2 dθ n−2 2π 0 c n−1 dθ n−1 must sum to 1 since we defined c 0 such that the radial integral was a probability distribution.
Lemma A3. For higher dimensions than 1 the probability of a random vector being selected within a region is determined by a multiple integral; for the special case that the region is D(R):= {v | ||v|| ≤ R}, then when v is sampled from a Lap n distribution, the probability that it is contained in D(R), denoted L n (R), is given by: where e k (α):= 0≤i≤k α i /i! .
Proof. Using Lem. 4 we can calculate this probability using the radial (Gamma) distribution, since we defined the angular and radial distributions independently. That is, This has well-known CDF given by which we now prove. We note that and using integration by parts we see that And therefore Probability of outputting an element in D(R) = R 0 r n−1 e − r × n /(n−1)! dr (26) and (27); induction" We now present the proof of Thm. 3. It follows as a consequence of of the next theorem.
, and so Thus the probability of outputting an element of Z is the same as the probability of outputting Z E , and by (28) that is at least the probability of outputting an element from Z M by applying a standard n-dimensional Laplace mechanism to each of the components of b. We can now compute: The result follows by completing the integration, and applying simplifying approximations. Let We rewrite RHS this as: 30) Using Lem. A3 we can simplify the integral for v N , to obtain: Rewriting the product 1≤i≤N −1 Lap n (v i ) as (c n ) N −1 e − ( 1≤j≤N −1 ||vj ||) and using (29) we can simplify (31) to We make two approximations. Observe that e n−1 is an increasing function, and that within the region of integration we have so that (32) is at least The final simplification is to note that the integral is no more than V n (R) N −1 , where V n (R) is the volume of an n-dimensional sphere. Putting all this together we have: We can now unwind this inequation to obtain the result, noting that R:= N ∆.
For the proof of Thm. 3, we need to make some further simplifications by using some additional constraints on the data.
In our application we know that the word2vec embeddings are typically of the order n ≥ 30. In this case we can make the following approximations to c n V n (R).
Using Stirling's approximation for n! we obtain c n V n (R) ≈ eR n n × 1 √ 2πn (36) Comparing to our formula for Thm. 3, we can see that if we set eR ≤ n then (36) is less than 1 √ 2πn and so, for n ≥ 30, we have ((c n V n (N ∆)) N −1) c n V n (N ∆)−1 ≈ 1 giving finally that if eN ∆ ≤ n then the utility calculation reduces to: which would be expected of a linear-like integration over an n-dimensional variable.

B Experimental Details
Here we describe further details of our constructed dataset and implemented inference mechanisms to support replicability, along with some additional analysis.

B.1 Dataset Construction
Document Sets Following [28], we take the first 2000 words of each story to constitute the known-author text, and the final 1000 words of each to constitute the unknownauthor snippet. We then normalise the text (removing stopwords and punctuation, and lowercasing, which are standard for topic classification) and then -because our mechanism requires each bag-of-words document to be of a fixed size N -truncate each text or snippet to the length of the shortest one in the set. So for our 20-author set the texts are of length 420, and for the 50-author set length 402.
Word Embeddings The Google News word2vec embeddings 14 are 300-dimensional embeddings that were trained on about 100 billion words of news text and in full contain about 3 million words and phrases. To make our experiments computationally feasible, we restricted our vocabulary to the 100,000 most frequent words for our 20-document dataset and the 30,000 most frequent words for our 50-document dataset.

B.2 Inference Mechanism Implementation Details
Different-representation author identification The algorithm of [28] does not require any training for our purposes. (In tasks where a "don't know" answer is permitted, there is a threshold parameter σ that can be learnt, but we do not use this.) There are a few hyperparameters to the method (e.g. size of character n-grams, number of character n-grams in the feature vector); for the most part we use the hyperparameter settings of the replication we used as our starting point 15 [41], which were set on the basis of the empirical analysis of [28]. Only the minimum text length for training is changed to 400, given the length of our texts.
Different-representation topic classification In training, we use pretrained embeddings trained on about 16 billion words of English Wikipedia text. 16 Our in-domain dataset consists of texts from 500 authors chosen at random, some of whom had written multiple stories; from this we derived a training set of 448 known-author texts of size 2000 each, split evenly between the two topics, and a comparable validation set of 111 knownauthor texts. We trained the classifier for 25 epochs, with learning rate 1.0. On the validation set, the accuracy is 0.937. Table 2 illustrates the three closest distances for a sample of unknown-author snippets, along with their authors. It can be seen that the distances are relatively close, a consequence of the high-dimensional Euclidean space. Table 3 illustrates some of the changes introduced by the privacy mechanism under various levels of for a single sample snippet. = 30 produces very minor changes,   mostly semantically close (felt → feels, uncle → grandma), but with some greater randomness as well (hurt → Harrison). This increases as decreases, until there are some very unlikely words (e.g. ENERGY STAR), as expected.

B.3 Analysis of Examples
In a practical application, such unlikely words or phrases like ENERGY STAR would look rather out of place with respect to the domain. However, our choice of word2vec vocabulary for this experiment was in a sense arbitrary; a practical application could use a vocabulary tailored to the domain, by (say) culling entries that do not appear in a training set.