Invisible market for online personal data: An examination


Despite the widespread knowledge that corporations collect and exchange user online personal data (OPD) between themselves in a market for OPD, there have been few attempts to systematically understand the nature and structure of these markets or answer basic questions about the behavior of parties in these markets. This paper addresses these questions using records of data sharing behavior by 218 websites across eight economic sectors. Two datasets, collected 4 years apart, are analyzed using social network analysis (SNA). Findings indicate linear preferential attachment is the most likely coordinating mechanism in the OPD market. Further, this market has a much higher number of brokers (intermediary corporations that facilitate exchange between other corporations) than comparable markets. Building on these findings, implications for research and practice are presented along with future research directions.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6


Appendix 1

Dataset 1: Spring 2016

Lightbeam was created by Atul Varma, software developer at Mozilla and originally called Collusion. This application made it possible to create a visualization of the network of websites collecting data about one’s browsing behavior on each page they visit online. In February 2012, Mozilla CEO at the time, Gary Kovacs, spoke about Collusion in a TED talk leading to the plugin going viral. In September 2012, Mozilla along with faculty and student researchers at Emily Carr University of Art + Design extended this plugin and relaunched it as Lightbeam in 2013. This application was supported by the Ford Foundation and the Natural Sciences and Engineering Research Council (NSERC). The full reference for this application and its source code can be accessed from:

The data retrieved from Lightbeam has the following layout:

[source, target, timestamp, contentType, cookie, sourceVisited, secure, sourcePathDepth, sourceQueryDepth, sourceSub, targetSub, method, status, cacheable]
For instance, [““, ““, 1,456,366,106,722, “text\/html”, true, false, true, 1, 0, “www.”, “cm.g.”, “GET”, 204, true, false]

WHOIS Lookup is a query and response protocol that is used to query internet registry databases that store the registered users or assignees of an internet resource, such as a domain name, an IP address block or an autonomous system

Steps in creating the dataset:

  1. 1.

    Selected 8 different economic sectors

  2. 2.

    Identified the top 20–25 ranked websites in each sector using Alexa

  3. 3.

    Visited the homepage only of the top ranked websites

  4. 4.

    Save the data on websites that ‘talked’ to the visited page using Lightbeam

  5. 5.

    Retrieve the name of the corporation that owns each website in the dataset using a WHOIS look up tool.

Dataset 2 Spring 2020

OpenWpm is an automated web privacy measurement framework that makes it easy to collect data from thousands to millions of websites. It is built on top of Firefox and runs in a windowed or windowless state, crawling the provided list of websites automatically and according to configurations supplied. This tool is still in active development. The full reference for this application and its source code can be accessed from:

The data retrieved from OpenWpm is in form of an SQLite database with different tables. More information about the Http requests table can be found at this link:

Steps in creating the dataset:

  1. 1.

    Crawled websites from dataset 1 using OpenWpm script

  2. 2.

    Extracted the same information as used for dataset 1

  3. 3.

    Retrieve the name of the corporation that owns each website in the dataset using WHOIS records collected for dataset 1.

  4. 4.

    Updated WhoIS records for those websites that were new in this dataset.

Complete List of Websites Crawled.

Social Media

Appendix 2

A t-test was performed to compare the incidence of different forms of brokerage between the observed network and the commensurate random networks. The full results of this test by brokerage type, and for each observed network is shown below in Tables 6 and 7. Table 8 contains details of the companies with the most brokerage positions at both times

Table 6 T-tests comparing observed networks to random networks (Time 1)
Table 7 T-tests comparing observed networks to random networks (Time 2)
Table 8 Companies and the number of Brokerage positions occupied (Time 1)

