Keywords

1 Introduction

Social media services like Twitter or Instagram are used to communicate and share information worldwide, which generates a rich set of data. Since a large part of this data is publicly available, it can be used beyond the features of the social media services itself, especially by third parties.

The main problem with social media data utilization for applications other than their dedicated use case is that explicit consent from the social media user is usually missing. While most users are aware that their content is publicly available on the Internet, they do not assume that data is frequently recycled for other purposes such as scientific, commercial, or administrative use (Boyd and Crawford 2012). Accordingly, this demands an exceptional strong focus on their privacy.

In contrast to other environments, the data to be protected is already public (Williams et al. 2017). In the view of third parties, that data can be utilized for any purpose, including those that oppose the user’s interest (Zhou et al. 2008). But data can also be used with good intentions (Daly et al. 2019), whereas “good” could be defined by “in the user’s Interest.” For example, social media has shown a valuable source of information in crisis mapping, emergency response, or public planning (Fiedrich and Fathi 2021; Dunkel 2021).

In order to support ongoing development of positive use cases, scientists need to respect and actively protect social media users’ privacy. Scientists need to take explicit control over data that they expose and prevent accidental disclosures.

An approach to support the adoption of accidental disclosure prevention techniques is to prevent the gathering of privacy-relevant data in the first place. We specifically aim at providing methods for the use of social media data following the privacy-by-design principles (Cavoukian et al. 2009).

In this chapter, we show a set of concepts that enable to process social media data with social media user’s privacy in mind. We present a data storage concept that implements an algorithm called HyperLogLog (HLL) (Flajolet et al. 2007) to not store raw social media data but only statistics about their occurrence. We further show that while losing precision of the data, privacy can even be increased by applying multiple layers of abstraction on the data. For a context-dependent treatment of privacy and to cover edge cases, we further introduce a model to implement filter lists on the incoming data stream. A conclusive case study demonstrates our methods to be protected against adversarial actors.

2 Fundamentals

2.1 Related Work

Issues and challenges related to privacy arise everywhere, where social media data is involved. Following up, we link to research projects within this book, which are primarily based on processing social media data and therefore our research is relevant for.

The EVA-VGI project (see Chap. 12) studies the heterogeneity, quality, subjectivity, spatial resolution, and temporal relevance of geo-referenced social media data. Focusing on the integration of spatial, temporal, topical, and social dimensions combined with an explicit link between events and reactions, they present conceptual approaches and methods that enable a privacy-aware visual analysis of VGI in general and geo-social media data in particular. The project has taken advantage of the results of our research by implementing HLL on datasets related to their publications (Dunkel et al. 2020).

Similarly, the VA4VGI project (see Chap. 6) describes how geo-aware filtering and anomaly detection on geo-referenced social media data can be a significant information source for stakeholders in journalism, urban planning, or disaster management. They present tag maps that provide overview-first, details-on-demand, visual summaries of large amounts of social media data over time and thus visualize their temporal evolution.

Closely related to the former is the DVCHA project (see Chap. 13). The overall objective of their research is to study the implications of social media data for the efficiency of disaster management. Focusing on so-called Virtual Operations Support Teams (VOST), their research addresses motivation, success factors, and improvement of distributed decision making processes based on disaster-related real-time social media data.

In a collaboration with the DVCHA project, we carried out a case study, in which we explored the deployment of HLL into disaster management processes (see Sect. 13.6). We developed and conducted a focus group discussion with VOST members, where we identified challenges and opportunities of working with HLL and compared the process with conventional techniques (Löchner et al. 2020). Findings showed that deploying HLL in the data acquisition process of VOST operations will not distract their data analysis process. Instead, several benefits, such as improved working with huge datasets, may contribute to a more widespread use and adoption of the presented technique, which provides a basis for a better integration of privacy considerations in disaster management.

2.2 On Privacy Aspects

From a generic point of view, privacy is the freedom to fully or partially retreat oneself in a self-controlled manner. There are always multiple forms of definitions of the term privacy, stretching from personal to a cultural point of views (Solove 2008). It is important to distinguish between the right to privacy and the concept of privacy (Hildebrandt 2006). The right is clearly formed by laws, whereas the concept is rather vaguely determined based on subjectively perceived personal values. Privacy is often sacrificed voluntarily in exchange for perceived benefits and sometimes violated by others, either intentionally or accidentally (Reyman 2013).

Privacy by design as a set of principles is a relevant objective in the conception of applications in general. As Cavoukian et al. (2009) state, privacy must be approached from a design-thinking perspective. It must be incorporated in technologies not as an optional on-top feature but as a fundamental characteristic of organizational priorities, project objectives, design processes, and planning operations. Concepts built upon these principles are hard to break in terms of privacy violations.

A contemporary method to protect data has been presented as differential privacy (DP) (Dwork 2008) and adopted frequently (Desfontaines and Pejó 2020). DP adds certain amounts of random data to a set of real data set, in order to make real data indistinguishable from the random data and thus protect it from being identified as such. However, DP still requires the original data to be available to process. Furthermore, DP requires developing new concepts and models for each data set, which is very inefficient when dealing with really large sets of data.

In the geo-community, there is a wide range of concepts known to protect privacy in terms of location data. Some techniques are based on anonymity, e.g., mix zones (Beresford and Stajano 2003) or k-anonymity (Ciriani et al. 2007). Others are based on obfuscation, e.g., imprecision (Duckham and Kulik 2005), or policy like restriction (Hauser and Kabatnik 2001). All of these approaches require the possession of original raw data. Processed data sets are unable to be updated with subsequent data, which requires reprocessing of the entire data set upon updates. This is very inefficient when dealing with large amounts of social media data.

In the context of social media data, the consideration of privacy, ethics, and legal issues should play an important role. The statement “Privacy of user data and information should be considered in the initial design of VGI systems” (Mooney et al. 2017) can be extended to platforms and methods for the analysis and further processing of social media data in general.

Kounadi et al. (2018) discuss privacy threats related to inference attacks on geosocial network data. They provide protection recommendations for sharing these sorts of data and publishing resulting visualizations. Keßler and McKenzie (2018) proposed in a total of 21 theses to reflect on the current state of geoprivacy from a technological, ethical, legal, and educational perspective. They provide various examples of how common it has become to share location and how it can be used and misused.

2.3 Data Retention

Processing social media data is to a relevant extent based on operating analytics software, which provides automatic analysis on gathered social media data stored in local databases. Their user interfaces take input to be crawled for in the stored data and return, for example, statistics of post occurrences in any context. Depending on the situation, only parts of that information may be relevant (see Sect. 14.3.1). Still, the entirety of every post has been and remains stored in local databases.

This means that if a data item is being deleted on the site of the corresponding social media service, it still resides at the place where it has been downloaded to. Technically, that practice meets the requirements to be termed data retention. We define this term as such: preserving data for an indefinite time period with no specific purpose for any individual data item but with the assumption to make use of the information in entirety at a later point in time.

The term is being discussed in the public mostly in conjunction with telecommunication analysis and surveillance. European Digital Rights public interest group states that “data retention practices interfere with the right to privacy at two levels: at the level of retention of data, and at the level of subsequent access to that data by law enforcement” (Rucz and Kloosterboer 2020).

We introduce the term in a broader and more technical environment to emphasize the explosive nature of recklessly dealing with personal data, which social media data is (European Commission 2018). According to the above definition, the term is valid for any case of storing and retending personal data in stocks. Wright et al. (2020) use it even to describe any storage of data underlying scientific studies.

Owning a set of data requires great responsibility in terms of data security. It opens up risks of possible abuse, theft, or accidental public exposure (Miller 2020). Breaking it down to a simple rule, it can be stated that “the more data you have, the more data you can lose” (Guillou and Portner 2020).

Beyond governmental agencies and law enforcement, also commercial players, journalists, researchers, or nonprofit organizations face challenges when storing individual-related data like those from social media. Stieglitz et al. (2018) discovered that the volume of data was most often cited as a challenge by researchers. Wang and Ye (2018) summarize common techniques for social media analytics in natural disaster management and coin the term mining for that matter.

Furthermore, the social impact of misusing large sets of data is well-known. The Cambridge Analytica scandal is one of the examples that show how massive data sets can be alienated (Berghel 2018). The company used personal information from millions of Facebook users without their consent to derive information about their political points of view and then microtarget personally tailored political advertisements to them. They claimed to have a major impact on the 2016 US presidential election, which can be regarded as a threat to democratic legitimacy (Dowling 2022).

Users of social media services start to realize that all of their data is not only publicly available but made use of by third parties. Data retention drives forgetfulness as a social concept at risk (Blanchette and Johnson 2002). The chilling effect, people slowly increasing self-discipline and restriction of their communication behavior due to becoming aware of digital surveillance, and panopticism (Manokha 2018; Büchi et al. 2022) are described consequences.

Nevertheless, the huge amount of data raised by social media services being a tremendous privacy thread is only one side of the coin. Large sets of social media data can also be beneficial for the public. The work of humanitarian organizations depends on publicly available data that is authentic and relevant. Especially, VOSTs rely on the availability of public social media data (Kuner and Marelli 2020); therefore, its prosperity must be preserved. A gradual retreat of users from social media services in favor of closed, “antisocial” messaging groups (Leetaru 2019; Wilson 2020) must be prevented.

2.4 HyperLogLog

One of our contributions to this issue presented in this chapter is based on storing data using an algorithm called HyperLogLog (HLL). This algorithm is a cardinality estimator first introduced by Flajolet et al. (2007).

Its fundamental strength is the ability to estimate the distinct count of a multiset (cardinality) and store it in a data structure, which does not allow the extraction of individual elements. This is done by storing only hashes of data items instead of the original raw data and identifying them by counting leading zeros of the binary representation of their hashes. The algorithm is able to predict how many distinct items have been added to the HLL set, based on the maximum number of leading zeros observed. This makes processing data using HLL very efficient in terms of processing time and storage space. It is not possible to search for prior unknown information in an HLL set, for example, the usernames of all the posts that have been gathered. This makes implementing HLL follow the privacy by design principle.

3 Concepts

3.1 Privacy-Aware Storage

The key aspect for our approach is to make it impossible to relate to the original social media data from a given processed data set (privacy by design). Therefore, we propose to utilize the cardinality estimation algorithm HyperLogLog (HLL) described in Sect. 14.2.4 to gathered store social media data.

To provide a minimal example of the process, we introduce a scenario, in which the difference in spatial occurrences of social media posts including a certain hashtag should be visualized. The result should be a choropleth map of areas according to the amount of post occurrences within that area (see Fig. 14.1). Areas are defined by a GeoHash, a hierarchical grid-like geocode identification concept (Niemeyer 2008; Morton 1966).

Fig. 14.1
A map of Europe highlights the different post-occurrences in that area. A major part is under post count greater than 136, followed by greater than 848 and 1262.

Example of a map showing areas with different occurrences of posts containing #omicron hashtag on Twitter from January through March 2022. Map data: OpenStreetMap contributors Color distribution: Head/Tail Breaks (Jiang 2013)

To store the occurrence of posts in an area, it is only necessary to count the number of distinct occurring posts, their cardinality. Reflecting, this unveils that storing the entirety of a social media post is unnecessary. It is sufficient to memorize its unique identifier (ID), which has been assigned by the social media service it originates from.

However, storing the ID in clear text in the database will allow identifying the post and thus the author of a post later on. The characteristics of HLL in turn enable to store data like the ID of a post in a set without the ability to regain it without prior knowledge about its existence in the set. Storing post IDs in an HLL set related to their geohash will only reveal their cardinality. Posts that occur later in the stream and match the same geohash will be added to this HLL set, which increases its cardinality by one for each new post. The geohash itself representing the post’s originating area is stored as the index of the database record (see Table 14.1). The resulting HLL data structure represents all posts matching a certain term from a certain area, while it is impossible to derive the post IDs back from it.

Table 14.1 Exemplary database table structure showing four records (each stands for one area represented by the geohash) and the corresponding HLL set containing the post IDs

Using HLL, we do not store the post IDs itself but calculate hashes from them and store them in an array of counters that represent the set of post IDs (see Sect. 14.2.4). Table 14.1 shows an example database table structure with geohash values representing an area and the corresponding HLL set representing the IDs of posts that occurred in that area.

Having a database with geohashes and their corresponding HLL set as shown exemplarily in Table 14.1, it is possible to compute the cardinality of the HLL set and thus determine the number of posts in each area. The result of such a computation could as well be achieved by just incrementing an integer per seen post ID and storing the sum instead of an HLL set. The significance of using the HLL algorithm instead is that it provides the opportunity to perform the set operations union and intersection on the HLL sets.

This can be useful for combinations of individual data sets. Different sets of gathered posts, each relating to certain terms, can be combined to monitor a more specific scenario.

A social media post as a data item can be broken down into its spatial, temporal, topical, and social components, each of which can be stored as separate HLL sets. As shown in Fig. 14.2, this can lead to a number of different HLL sets, each containing the post IDs of posts matching different criteria: involving a certain topic, originating in a certain area or in a certain time period, or authored by a user of a certain group.

Fig. 14.2
An illustration has social media post at the center connected to thematic H L L sets, social H L L sets, spatial H L L sets, and temporal H L L sets. All sets include different topics, categories, and post I Ds.

Examples of HLL sets derived from the four facets of a social media post

Using the topical facet exemplarily in a disaster management scenario, an intersection of a set containing posts with the terms fire and one containing forest posts could lead more precisely to disaster incidents than both terms on their own. It still makes sense to monitor the terms individually in the first place because a combination of fire and accident can lead to other and different disaster incidents, as well as forest and accident does.

Furthermore, different terms could have the same meaning, for example, flood, high tide, wave, and tsunami could all refer to the same situation. So, a union of HLL sets on posts over these terms can provide more comprehensive information about disasters. Likewise, terms in different languages could also be monitored in combination. This, for example, enables VOSTs (see Sect. 14.2.1 and Chap. 13) to monitor larger, multiple languages involving areas like border triangles or including smaller countries like Benelux or the Baltics.

This concept provides privacy by design because it does not store the post IDs in a readable way. It only stores a statistical derivative resulting from the characteristics of the HLL algorithm (see Sect. 14.2.4) and complies to the privacy by design principles. The following subsections cover how it can be extended even further by applying extended concepts to adjust the level of privacy protection.

3.2 Abstraction Layers

The concept of abstraction has been widely used in the geo-community to visualize spatial information scale dependent on different degrees of detail (Burghardt et al. 2016). We re-dedicate these generalization methods from geovisualization to privacy protection.

Herein, we present a model to improve privacy for social media users, in particular in the context of data collection. It aims at withdrawing precision from the data by deriving multiple abstraction layers of it. Applying these layers, we are able to quantitatively describe different levels of privacy. By deploying methods of generalization and thus decreasing precision of the data, we can increase privacy, and vice versa.

Figure 14.3 shows a visual representation of this model, following the four-facet representation to characterize a social media post, introduced by Dunkel et al. (2019). The bottom layer in each facet is formed by the original data. Each following layer represents an increase in privacy protection for the user. This way, we have the ability to adjust the level of detail of the data in a fine-grained and context-dependent way. Each of the layers is described in detail in the following subsections.

Fig. 14.3
4 pyramid diagrams of abstraction layers for each facet. The layers from bottom to top are as follows. Spatial facet, latitude or longitude, place, city, region, and country. Temporal facet, timestamp, day, month, and year. Topical facet, terms, subject, and domain. Social facet, user, follower, cluster, and platform.

Abstraction layers for each facet

3.2.1 Spatial Facet

In the spatial facet, the original data is usually represented by a coordinate in latitude and longitude or a tiny area surrounding that coordinate. A first abstraction from it can be an arbitrarily named place that includes the coordinate, e.g., a market square or a park. The next abstraction could be an administrative or functional region or other territories enclosing the place, e.g., a city or metropolitan area, county, or state. Cities can also be regarded as intermediate layers below regions. The next abstraction layer could be a country or another even broader defined, e.g., natural, political, administrative, or religious region.

When applying this model to the HLL-based storage concept presented in Sect. 14.3.1, it is crucial to note that even the lowest layer needs to be an area to be able to count multiple posts within it. If using a point coordinate instead, chances tend toward zero for multiple posts hitting that exact coordinate. An alternative approach with clustering techniques would be necessary there.

In an application implementing this model, the database index would be the geohash, place, city, or country, depending on the layer of abstraction. The corresponding HLL sets include the hashed post IDs of posts that originate in that respective area. An appropriate visualization of that data would be a map showing visually differentiable areas (see, e.g., Fig. 14.1).

3.2.2 Temporal Facet

Abstractions in the temporal facet are clearly defined by common time units. The basic layer is the timestamp of the publication of a post, abstracted as the day of the publication or a month or even a year.

In analogy to the spatial facet, when implementing this model, the database index must be a time range rather than a point in time, to be able to associate multiple matching posts with it. To visualize just the temporal facet, a timeline is the preferred graphical representation. Usually, this facet requires to also filter for a certain topic first, to prevent just visualizing all social media posts occurring within a certain time period.

3.2.3 Topical Facet

The topical facet is characterized by applying topic modeling techniques (Kherwa and Bansal 2019) to find abstractions of terms. The basic layer can be defined by the terms, the original content in the post, e.g., The River Elbe has burst its banks in Dresden today. A more general layer can be rendered in the overall subject of the post, e.g., Dresden flood. Another abstraction layer can be the domain of the Natural Disaster.

In an implementation, the database index will represent these terms, subjects, or domains, and the corresponding HLL sets hold the associated post IDs. It is not trivial to generate more generic terms for specific posts, but topic modeling techniques can help with that task. A word cloud can visualize these terms in different sizes (Hearst et al. 2019) depending on the cardinality of posts associated with them.

3.2.4 Social Facet

The social facet relates to users of social media. When running analyses on this object, it is crucial to note that we are switching the focus. While we are trying to avoid storing personal data around an object in the other facets, here we want to achieve the opposite: counting appearances of data that relates to a single person or a group. An exemplary analysis would be to count the number of posts per user or group. In this scenario, it is especially useful to apply abstraction layers in order to gain privacy for a single user. In analogy to the temporal facet, it is also useful to filter posts for a certain topic in beforehand.

In the basic layer, the creator of a social media post, the user, is targeted. The database index would be the username or id, and the corresponding HLL set consists of the post IDs. Combining, e.g., all of the user account’s followers to a group and regarding posts of all of them could apply as the first abstracted layer. Through network analysis (Maireder et al. 2014), we can define clusters to be objects in a second step of abstraction. Another layer of abstraction could be the consideration of different social network platforms (Cosenza 2022), which have a distinct user base, which might originate from different cultural backgrounds, e.g., Twitter, Instagram, WeChat, and VKontakte.

Implementing groups of users is more challenging than in other layers. The group of followers in the second layers consists of a list of user IDs eventually, which could again be stored in an HLL set and get an ID assigned to. IDs of multiple groups are then stored in HLL sets and can be combined or contrasted with other group IDs, defining clusters accordingly.

All the described layers are only examples and can be replaced by other structures. Also, the number of abstraction layers can be chosen arbitrarily, as the granularity of the data can change.

It should be noted that abstraction layers do not only gain privacy for the social media users, but they also diminish the precision of the data. This makes applying abstraction to social media data be a compromise between privacy and precision.

3.3 Filter Lists

Storing social media data using HLL to be processed in analytics software forms the basement of privacy protection. Applying generalization methods as described in Sect. 14.3.2 provides further opportunities to adjust data precision. However, there are edge cases that require special handling. For instance, even the existence of a single specific term, a specific time, location, etc. may provide hints that can be repurposed or combined with other (e.g., external) information to compromise user privacy in certain situations. Following the principle that different data must be treated differently (Almås et al. 2018), we seek to contribute to a systematic approach to fine-tuning privacy preservation and analytical flexibility.

There are two main approaches to adjusting privacy—utility trade-offs with HLL and abstraction layers. First, stop and allow lists can be used during the generation of the HLL set to enable context-dependent data protection through filtering. Second, threshold values can be defined flexible to influence the granularity of the HLL set indexes and, based on that, the degree of anonymity. Table 14.2 lists examples for each context in the framework, where accuracy (utility) may be traded in favor of a higher degree of privacy, similar to the broader data sensitivity spectrum proposed by Rumbold and Pierscionek (2018).

Table 14.2 Example of sensitive context factors for which no data analysis might be carried out

Whether stop lists or allow lists are preferable depends on the context of application. Allow lists are more restrictive and require less effort from the analysts, by automatically excluding all terms, times, locations, etc. that are not explicitly considered beforehand. For the spatial context, for instance, unless worldwide data is required, allow lists are frequently used, to limit data collection to a specific area, region, place, etc. Conversely, stop lists can be added selectively on top, to exclude places that are known to be related to vulnerable groups or sensitive contexts (e.g., hospitals, party locations). Similarly, filter lists for specific terms, hashtags, or emoji can be defined for the topical context.

For topical contexts, the openness of possible references complicates defining holistic stop lists ahead of time. As an example, Fig. 14.4 shows a map generated from terms, hashtags, and emoji used on the social media services Twitter, Flickr, and Instagram at a public vantage point and park. The syringe emoji could indicate drug use, which may lead to further onsite investigation by, e.g., authorities, with potential unexpected consequences of the user perspective. Obviously, this is an edge case for social-individual privacy because both positive (society) and negative (user) consequences are imaginable. One solution would be to assign the specific emoji to a thematic broader emoji class, e.g., the umbrella group of “medical emoji”Footnote 1 (see Sect. 14.3.2). As another solution, the syringe emoji could be classified ahead of time, for increased sensitivity, leading to, e.g., a greater spatial granularity reduction on data ingestion, or exclusion, preventing having to deal with this ambiguous ethical edge case in advance.

Fig. 14.4
A map presents the emojis on drug use at selected locations. It includes emojis of syringe, star, caution, muscle, and flower. Labels in a foreign language are also present.

A thematically sensitive emoji on drug use at selected locations

Lastly, as the second approach to enable systematic user privacy with HLL, threshold values may be defined, similar to what is known from other disciplines, such as the HIPAA Privacy Rules for health data publications (Malin et al. 2011) or census statistics (Szibalski 2007, p.142). Allshouse et al. (2010), for instance, use geomasking in combination with k-anonymity, to define a lower threshold of \(k=5\) (people), which is a rule of thumb size in geoprivacy (Kamp et al. 2013). Comparable best-practice threshold values could be defined for HLL sets of different sizes, e.g., suggestions by Desfontaines et al. (2019), with smaller sets indicating lesser privacy protection due to a scarce context collapse. In the spatial context, this could be implemented by using quadtrees, for example, to split and aggregated social data into sub-sections (quads), based on pre-defined thresholds, where the resolution is automatically decreased for areas of lesser data density.

4 Case Study

Even though different implementations of HLL exist, all share a number of basic steps. At the core, the binary representation of any given character string is divided into buckets, for which the number of leading zeroes is counted (see Sect. 14.2.4). Because any given character string is first randomized, it is possible to predict how many distinct items must have been added to a given HLL set, based on the maximum number of leading zeroes observed. In other words, if multiple items are added to an HLL set, only the highest number of leading zeroes per bucket needs to be memorized. As a result, the cardinality estimation will only approximate counts.

As a side effect, there is a limited ability to check whether a specific user or ID has been added to a HLL set. In an adversarial situation, Desfontaines et al. (2019) refer to such a check as an intersection attack. Intersection attacks first require obtaining the hash of a targeted person or ID and then adding this hash to an HLL set. If the HLL set changes, an adversarial may be able to increase their initial suspicion by a certain degree. To better illustrate intersection attacks and how and under which circumstances the privacy of a user could become compromised in the presented two-component research setup, we briefly introduce two examples.

Alex is included in the YFCC100M dataset (Thomee et al. 2016) because he published 289 photos under Creative Commons Licenses between 2013 and 2014 on Flickr; 120 of these photos are geotagged. Given this information, it will be relatively easy to re-identify Alex. Sandy is an internal adversary. She could be someone working at an analytics service with full access to the database. Robert, on the other hand, is someone representing an external adversary, with access only to the published dataset. In the first example, the privacy of Alex is compromised if Sandy could increase or confirm her suspicion that Alex was not at his workplace in Berlin on 9 May 2012. In the second example, the privacy of Alex is compromised if Robert could increase or confirm his suspicion that Alex was indeed at least once at a specific location, e.g., contrary to what Alex claims. Finally, Alex could be someone who voluntarily contributed his pictures to the conceived analytics service or altruistically published Creative Commons photos on Flickr.

Consider that, at the moment of contribution, Alex may not have thought of the consequences for his privacy but later realized his mistake. With the use of raw data, even removing any compromising data from Flickr, this change would need to be reflected in any subsequent data collection, such as in the analytics service or the YFCC100M dataset. This is either impractical or impossible. The question is, therefore, whether it is possible to replace raw data workflows with a privacy-aware visualization pipeline, without significantly reducing utility.

Several factors must coincide for intersection attacks to be successful. Firstly, an adversarial must have access to HLL sets. In our system model, this can either be an internal adversary (Sandy), having direct access to the database, or an external adversary (Robert), having access only to published data. Furthermore, an adversary must be able to either compute hashes for a given target user or somehow gain access to a computed HLL set for the given user. The former is only possible if the secret key is compromised. The latter appears conceivable, in our example, if the adversary has some prior knowledge about other locations visited by a target user, and if the HLL sets of these locations ideally contain only the target user or a few other users. In the following, we explore this worst-case scenario, where both Sandy and Robert somehow got hold of an HLL set that only contains Alex’s computed hashes.

For Sandy, this means in order to test whether Alex was not in Berlin on 9 May 2012, she either needs Alex’s original user ID and the secret key to construct the hash or find another location that has only been visited by Alex on this date. In this unlikely scenario, the result of an intersection attack for all grid cells is shown in Fig. 14.5. Visible in the figure is that a large number of other grid cells show false positives for the intersection test, that is, these HLL sets did not change, even when updated with the particular user day-hash for Alex.

Fig. 14.5
A world map highlights the distribution of query results on published data, additional query results with direct database access, and Alex on September 5, 2012. The majority of the query results on published data are in Canada and around Berlin.

Evaluation of scenario “Sandy” (Dunkel et al. 2020, CC-BY 4.0)

Since HLL prevents the occurrence of false negatives, and San Francisco is indeed among these locations, the result does include Alex’s actual location on 9 May 2012. Depending on the size of the targeted HLL set, Sandy may then increase her suspicion by some degree. In the case of the grid cell for San Francisco, with 209,581 user days, this increase in posterior knowledge may be found to be negligibly small. In other words, even if there was no post from Alex on 9 May 2012, the intersection attack may have produced the same result. In conclusion, even in the worst scenario, having direct access to the database and a compromised secret key, Sandy could not gain any further affirmation.

Similarly, and rather incidentally, the positive grid cell for Berlin does indeed falsely suggest that Alex was in Berlin. This is not surprising given that larger HLL sets have a higher likeliness of showing false positives and Berlin is a highly frequented location. In other words, Alex benefits from the privacy-preserving effect of HLL.

In the second scenario, consider a situation in which Robert may have an a priori suspicion that Alex went to Cabo Verde. Alex, on the other hand, does not want Robert to know that he went surfing without him. Robert knows that Alex is participating in the conceived analytics service and, somehow, gains access to an HLL set containing only one hashed user ID from Alex. The results of the intersection attack for all grid cells are shown in Fig. 14.6. Since only 56 users have been to Cabo Verde in the YFCC100M dataset, the particular bin is not included in the published benchmark data, which is limited by a minimum threshold of 100 users. However, with direct access to the database, Robert could observe that Cabo Verde is among the locations revealed. In this case, Robert may gain some affirmation for his suspicion that Alex was in Cabo Verde. At the same time, a definite answer will not be possible, given the irreversible approximation of the HLL structure. For example, for the same intersection attack, for set sizes below 56 users, there are 14 other grid cells that show false positives, down to 8 users. In other words, even though these HLL sets do not change when tested, Alex has never been to these locations.

Fig. 14.6
A world map highlights the distribution of query results on published data where the user count is greater than 100, additional query results with direct database access, and Alex with actual locations. The majority of the query results on published data are in Canada and around Berlin.

Evaluation of scenario “Robert” (Dunkel et al. 2020, CC-BY 4.0)

While these two scenarios provide a base to understand how intersection attacks may be executed in a spatial setting, a valid question is how likely successful intersection attacks are overall. To some degree, this depends on questions of security, such as protecting the secret key or managing database access.

Another part is directly related to the distribution of collected data and the number of outliers that are present at each stage of data processing. If data is more clustered, users will generally receive more benefits from the privacy-preserving effects of HLL. This can be quantitatively substantiated with the given dataset (Dunkel et al. 2020).

5 Conclusion

The research presented in this chapter introduced a number of approaches to deal with privacy aspects in the process of social media data processing. Social media data is being used as a source of data for wide-ranging projects within and beyond the scope of this book (see Sect. 14.2.1). The relevance of privacy aspects in processing this kind of data and the range of related work are pointed out in Sect. 14.2.2. Furthermore, in Sect. 14.2.3, we discussed our focus on data retention as a potential threat for analysts. We defined the term and explained that we make use of that specific term to emphasize the explosiveness of dealing with personal data.

We showed that it is possible to preserve the privacy of social media users with the major concepts. As a basis for our first concept, we first introduced the cardinality estimation algorithm HyperLogLog in Sect. 14.2.4. In Sect. 14.3.1, the main part of this chapter, we introduced a concept to store social media data in a way that it is not possible to extract individual items from it but only to estimate the cardinality of social media data items within a certain set, plus running set operations over multiple sets to extend analytical ranges. Applying this method requires defining the scope of the result before even gathering the data and thus prevents the data from being misused for other purposes at a later point in time. This follows the privacy-by-design principle.

As an extension to the first concept, we proceeded by introducing a concept that is well known in the geographic community, generalization, in Sect. 14.3.2. By defining a number of abstraction layers, it is possible to even more reduce the data to be stored, depending on the required precision. The less precise data is needed, the fewer data needs to be stored. Finally, in Sect. 14.3.3, we explain the conceptual exclusion of edge cases by applying filter lists to the data set.

A closing case study in Sect. 14.4 explains the concept of intersection attacks and shows that under rare circumstances the HyperLogLog technology is vulnerable against them. The case study unveils that the larger the dataset, the less likely are intersection attacks. Since social media data is usually very large, implementing the HyperLogLog technology is an excellent approach to protect the data from being abused, thieving, or publicly exposed and thus preserves the privacy of social media users.