Towards privacy-aware exploration of archived personal emails

Bartliff, Zoe; Kim, Yunhyong; Hopfgartner, Frank

doi:10.1007/s00799-024-00394-5

Towards privacy-aware exploration of archived personal emails

Open access
Published: 21 February 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

International Journal on Digital Libraries Aims and scope Submit manuscript

Towards privacy-aware exploration of archived personal emails

Download PDF

Zoe Bartliff¹^na1,
Yunhyong Kim ORCID: orcid.org/0000-0001-5400-0389¹^na1 &
Frank Hopfgartner^2,3^na1

749 Accesses
4 Altmetric
Explore all metrics

Abstract

This paper examines how privacy measures, such as anonymisation and aggregation processes for email collections, can affect the perceived usefulness of email visualisations for research, especially in the humanities and social sciences. The work is intended to inform archivists and data managers who are faced with the challenge of accessioning and reviewing increasingly sizeable and complex personal digital collections. The research in this paper provides a focused user study to investigate the usefulness of data visualisation as a mediator between privacy-aware management of data and maximisation of research value of data. The research is carried out with researchers and archivists with vested interest in using, making sense of, and/or archiving the data to derive meaningful results. Participants tend to perceive email visualisations as useful, with an average rating of 4.281 (out of 7) for all the visualisations in the study, with above average ratings for mountain graphs and word trees. The study shows that while participants voice a strong desire for information identifying individuals in email data, they perceive visualisations as almost equally useful for their research and/or work when aggregation is employed in addition to anonymisation.

A survey on email visualisation research to address the conflict between privacy and access

Article Open access 22 February 2022

Identifying Virtual Tribes by Their Language in Enterprise Email Archives

Unveiling Archive Users: Understanding Their Characteristics and Motivations

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Email has been referred to as ‘the backbone of the internet’, a ‘virtual working environment’ and the ‘main means for distributed collaboration’ ([1]). An email collection is an organically formed record that documents both important and everyday moments in an individual’s life and work. The extent of information that can be extracted from such a dataset makes email collections a rich source for investigating patterns of human behaviour, relationships, and communications (cf. [2,3,4,5,6,7]). However, there are caveats to the valuable nature of this data, most notably the enduring ethical concerns provoked by facilitating access to such personal content. Email research often thrives on the details of individual lives and connections with others, information that can be deeply private, sensitive, and/or confidential in nature. The challenge has hitherto encouraged a caution-driven practice of closing or severely restricting access to collections.

Scholars and custodians of data alike have explored and implemented a great range of methods for accessing and exploring email collections (e.g. [2, 3, 8,9,10]), and yet the impact of these with regard to privacy preservation is not widely discussed nor, seemingly, understood ([11]). This partly reflects the complexity of thoughts surrounding privacy in itself (cf. [12,13,14]). Regardless, this disconnect amplifies continued uncertainty, resulting in a ‘risk-adverse attitude’ (cf. [15]) amongst custodians of data. Consequently, a great swathe of potential research data remains locked within closed or ‘dark’ archives ([3, 4, 15,16,17]). Whilst preventing access might be ‘[t]he most intuitive way to preserve privacy’ ( [18]), it also, in many ways, defeats the purpose of maintaining the records, particularly in instances where the relevance of the data might be time sensitive ( [11, 15]). This is the second of three key challenges that Lise Jaillant identifies the archival sector to be facing along ‘the path from the appraisal of records to their analysis’( [16]).

Even in cases where an email archive is not ‘dark’, discoverability is a continued issue with a heavy reliance on search infrastructure and accurate metadata and cataloguing ( [16]), as well as a demand on the end user to have ’a rough idea of the information they are trying to retrieve’ ( [19]). In response, data visualisation has been used in many facets of email research to support the holistic, creative, and perhaps even ‘playful’ (cf. [20]) exploration of email datasets. They have been shown to reveal patterns and insights that may otherwise be obscure to researchers (cf. [21,22,23,24,25]). The exploratory and browsing behaviour encouraged by visualisations (cf. [26, 27]) is of particular use for high volume data. They ‘capitalise on the characteristics of digital sources’ ( [28]) facilitating a malleable perspective on a collection. In short, visualisations represent a method that may support both researchers and practitioners to engage usefully with email collections, irrespective of pre-existing data analysis skills.

Although many have noted the value visualisations have for research-enabling interface to email collections, the understanding of how the design of a visualisation interacts with, protects, or compromises privacy is understudied. Without an understanding of the impact of the visualisation on privacy, it is possible that the method of mediation might negatively impact upon access or open the data to reveal ‘previously unknown patterns and relationships’ ( [11]) that might, contrary to intention, compromise privacy.

It is within this knowledge gap that the research presented within this paper sits. It presents findings from an empirical investigation on the potential for visualisations to facilitate access for users and provide a degree of protection for any personal or sensitive data contained within a dataset. Through these findings, this paper intends to promote a greater understanding of the relationship between privacy management strategies and the impact that these might have on the perceived usefulness of visualisations to users. This, in turn, might support both researchers and archivists ‘to capitalise on the information available to them at the appropriate scale of privacy’ ( [11]), therefore mitigating the need to close email archives to adhere to the legal and ethical requirements of engaging with sensitive data. Should such an approach prove fruitful, it would fall within the calls for archivists and other custodians of knowledge to ‘consider very different types of access’ that more closely reflect user needs ( [29]).

In the next section, we start by setting the scene to explain our approach to selecting and implementing visualisations in our user case study. This is followed by a detailed methodology of the user study (Sect. 3). Section 4 sets out the findings from the study which, in turn, is followed by a reflective discussion in Sect. 5, that considers the results of the study, their implications, and future work that might be conducted in this area.

2 Background

Our approach to the current study is developed through three steps. First, previous email research is reviewed, especially where data visualisation techniques have been employed and/or evaluated (Sect. 2.1). Second, ethical concerns for digital archives are also discussed, with a special focus on concerns associated with privacy and email collections (Sect. 2.2). Finally, in Sect. 2.3, we explain how we bridge these areas, to formulate our research questions and to select and generate our visualisations for our user case study.

2.1 Email visualisation

Research related to emails often poses questions concerned with understanding how people use email for communication and what this can reveal about them, their environment ( [30,31,32,33]), and their social/professional network ( [34,35,36,37]). Building indirectly on this understanding of email usage are studies aimed towards improving the efficiency and efficacy of communication workflows ( [1, 38, 39]), and the filtering out of unwanted communication ( [40,41,42]). Additionally, in the humanities, email data research naturally aligns with that of older forms of correspondence such as letters (cf. [43]), for example, the close reading of selected passages for qualitative analysis in the context of other events and achievements in their lives. Features such as the metadata found in email headers (e.g. time stamps, subject, who is sending and receiving) help broaden this context, to open up the researchers’ gaze to a wider array of analysis than its technological predecessors might have allowed.

A systematic classification of email research ( [11]) reveals two strands of thought (cf. Fig. 1)—one with the focus of enquiry on people (e.g. the patterns of relationships and social network analysis), and one which concentrates on the emails themselves and their usage (e.g. topic identification, content analysis, patterns of behaviour).

The use of specific types of visualisation has, on the whole, been agnostic of these branches of research ( [11]), although there are exceptions to this with, for instance, studies focused on social network analysis prioritising network graphs (cf. [24, 44,45,46,47,48,49,50,51,52,53]). The great variety and adaptability of visualisations ensure that many common designs (e.g. bar charts [24, 49, 54,55,56], line graphs [2, 24,25,26, 49, 54, 55, 57, 58], scatter/bubble plots [46, 54, 57,58,59], pie charts [60]) can be adapted to diverse research objectives. There have been several, more specialised types of visualisations that were employed across the spectrum of research interests, such as timelines ( [25, 57, 59, 61,62,63]), heatmaps ( [64]) and iconographic representations ( [65,66,67]), and some studies even creatively combine visualisations in a hybrid approach (e.g. [22, 25, 45, 48, 48, 53, 63, 65, 66, 68]).

The literature shows that network graphs, of various types (e.g. random, force directed, tree), are notable as a mainstay of social network analysis research (cf. [24, 44,45,46,47,48,49,50,51,52,53]) with all 18 items reviewed in this area using this visualisation. The research for patterns of relationships employs a more varied selection with no particular preference: including widely popular visualisations such as scatter and/or bubble plots ( [54, 59]) to newer visualisations such as mountain graphs ( [57]). Bar charts are most regularly used in literature for studies investigating patterns of behaviour, although, as a mainstay of visualisation creation, they also appear in studies focused on other branches of investigation (cf. [24, 49, 54,55,56]). Email content analysis ‘aims to help users navigate a collection and withdraw meaningful data whether as a search or summary mechanism’ [11] and, as with many forms of textual analysis, the forms of visualisation used are quite broad (cf. [25, 52, 57, 57, 60,61,62,63, 69, 69, 70, 70]). The word tree visualisation is one of these (cf. Fig. 9), a type of visualisation that has proved useful for the early stages of textual exploration (cf. [71,72,73,74]) and, as such, will be employed in our study.

In exploring the ‘state-of-the-art’ approaches to visualisation design, [75] highlights several criteria that encompass successful visualisation. They indicate that data visualisations should be ‘familiar’, ‘able to convert abstract information’ in a way that ‘preserves its underlying meaning but also provides insights to the user’. In each of the studies above, it is argued, if indirectly, that the method of visualisation utilised fulfils these criteria, therefore creating a useful interface for the potential users (cf. [2, 25, 26]). These studies, however, centre their focus on the particular features of the visualisation under investigation, rather than exploring the broader applications or benefits of the design outside of the stated purpose. Therefore, whilst the visualisations might be well suited to the task at hand and they might also fulfil the key criteria of good design established by [75] and other scholars, it is not possible to extrapolate meaningfully from these studies as to what might benefit the sector as a whole, particularly with reference to user needs.

2.2 Email collection ethics

Within the context of email collections, it is necessary to advance discussions of email visualisation beyond the immediate needs of the researcher to also address questions of ethical needs. Emails, in their raw form, not only contain information that can identify people by name, email address, and/or affiliation, but contain detailed information about locations, events, and relationships between people. Metadata alone can be used to infer identities, and sensitive and/or confidential information. For example, "e-mail headers reveal who is central to your professional, social and romantic life"( [76]). The access to such collections creates opportunities for "private information within these collections to be disseminated widely and without consent" ( [77]) even where it creates opportunities for much needed research ( [78]).

Emails also often have a tendency to include information beyond that which is written or intended to be received by the email account owner, or worse, those who access it later. For example, emails have attachments which could, if distributed further, entail copyright infringement or communication of privileged, proprietary, or confidential information. In established archival practice, it is standard practice to consider materials of long deceased individuals of less risk of disclosure. Even when the primary owner of the email is deceased, content is directly associated with others who may still be living, potentially causing distress or issues of privacy. This challenge is compounded by the potential for the emails of others to get copied in as a thread and, sometimes, even sent to individuals who were not intended to have access. Effectively, when you archive emails in one person’s personal archive, you are archiving other people’s emails as well.^{Footnote 1} It has to be recognised that when it comes to digital forms of communication, it is not always possible for creators to be aware how the information would be used in later contexts and can interfere with an individual’s right to forget ( [79]). In addition, some laws and/or regulations stipulate that the control of the data needs to take into account cultural needs.^{Footnote 2}

Privacy management is an especially thorny and shifting concept ( [12]), and an intersection of research relevant to email research and visualisation which has, thus far, remained largely unexplored. The majority of the studies identified in [11] made no mention of privacy, and nearly a quarter of the studies were tested on participants’ own email collections, or in a slightly smaller sample, on the popular open source email dataset, the Enron dataset.^{Footnote 3} It is highlighted that only two out of the 39 reviewed papers engaged with a personal email archive (cf. [25, 56]) and, of these, one involved the owner of the archive as a co-author. This is a distinct gap within literature pertaining to email visualisation research, one which has arisen, at least in part, due to the difficulties involved in defining privacy.

Debated in scholarship at least since the philosophies of Aristotle ( [80]), little has been agreed about the definition of privacy other than that it is a multifaceted concept encompassing legal, ethical, cultural and personal dimensions. There have been ‘many attempts to create a synthesis of existing literature’ ( [81]), but the default approach to protecting privacy for many institutions, archives included, has necessarily been to rely on the more concrete legal definitions of, for example, personal and sensitive data,^{Footnote 4} as well as on the ethical mandate to limit harm.

This situation is not one that will improve with time. In 2012, it was noted that approximately ‘75% of the email accounts belong to individual users, with only 25% belonging to organisations’ ( [19]). This statistic is more than a decade old at the point of writing and, therefore, does not necessarily reflect the proportions of email data that are destined to be archived in coming years. Whilst the ‘risk adverse attitude’ (cf. [15]) of present custodians of data is quite logical given the potential ramifications from mis-managed email data, the great swathes of incoming, culturally significant data necessitates the inclusion of alternative approaches in order to facilitate effective user-driven access and, therefore, research.

There is not, at present, a nuanced and consistent approach for managing privacy with respect to email collections, although [11] presents the first steps towards this. The paper explores existing literature pertaining to the visualisation of emails and the impact of different design choices on the level of privacy consciousness. The five privacy consciousness (PrivCon) levels discussed in the paper ( [11]) represent a scale of privacy management strategies that might be applied to the data that forms the basis for different visualisations. These strategies range from full disclosure (PrivCon 0) through to closed to public access (PrivCon 4) with each level representing a category of privacy management as opposed to a specific method. The description of the levels is reproduced in contracted form below:

PrivCon0—the open end of the scale with no accounting for privacy, there are visualisations that contain the full range of the data as would be utilised in, for example, in a forensic examination of the data or an archivist’s appraisal when a full collection has been donated.
PrivCon1—the introduction of redaction that ‘includes situations whereby the data have been altered or removed in order to obscure the identity of individuals contained within’.
PrivCon2—‘the grouping or amalgamation of data to the point that individuals become ‘lost in the crowd’, minimising the risk that details might be identified’.
PrivCon3—the introduction of noise which ‘involves shifting the data through the use of an algorithm, statistical model or encryption, in a way that maintains the statistical characteristics of the data set, but the detail does not consistently reflect the original’.
PrivCon4—it represents a closed collection which has been fully redacted and contain only a descriptive representation of the collection with only a cursory indication of contents. This presentation of the data represents what might be found in an online catalogue for an archival collection that only permits on-site access, or for a fully embargoed collection.

The manner in which the PrivCon levels (0-3) might be applied to a dataset is displayed in Fig. 2. The paper ( [11]) reveals a skewed distribution of approaches in the thirty-nine papers reviewed, with a clear tendency leaning towards anonymised/pseudonymised content (PrivCon 1). This is summarised in Table 1 along with pros and cons of each type of strategy.

Table 1 Distribution of PrivCon levels adopted in the literature review of [11], collection type, and pros and cons of each level

Towards privacy-aware exploration of archived personal emails

Abstract

Similar content being viewed by others

A survey on email visualisation research to address the conflict between privacy and access

Identifying Virtual Tribes by Their Language in Enterprise Email Archives

Unveiling Archive Users: Understanding Their Characteristics and Motivations

Explore related subjects

1 Introduction

2 Background

2.1 Email visualisation

2.2 Email collection ethics

2.3 Research questions and approach

2.4 Implementation of network graphs

2.5 Implementation of mountain graphs

2.6 Implementation of scatter plots

2.7 Implementation of bar graphs

2.8 Implementation of word tree

3 Methodology

3.1 The data

3.2 The participants

3.3 The study

3.3.1 Stage One

3.3.2 Stage Two

3.3.3 Stage Three

4 Findings

4.1 Overview

4.2 Stage One

4.3 Stage Two

4.3.1 Directed network graph

4.3.2 Mountain graph

4.3.3 Scatter plots

4.3.4 Bar charts

4.3.5 Word tree

4.4 Stage Three

4.4.1 Directed network graph

4.4.2 Mountain graph

4.4.3 Scatter plot

4.4.4 Bar chart

4.4.5 Word tree

5 Discussions and implications

5.1 Discussions

5.2 Implications

6 Conclusion

Data availability

Code Availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate

Consent for publication

Additional information

Publisher's Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation