Advertisement

bias goggles: Graph-Based Computation of the Bias of Web Domains Through the Eyes of Users

Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12035)

Abstract

Ethical issues, along with transparency, disinformation, and bias, are in the focus of our information society. In this work, we propose the bias goggles model, for computing the bias characteristics of web domains to user-defined concepts based on the structure of the web graph. For supporting the model, we exploit well-known propagation models and the newly introduced Biased-PR PageRank algorithm, that models various behaviours of biased surfers. An implementation discussion, along with a preliminary evaluation over a subset of the greek web graph, shows the applicability of the model even in real-time for small graphs, and showcases rather promising and interesting results. Finally, we pinpoint important directions for future work. A constantly evolving prototype of the bias goggles system is readily available.

Keywords

Bias Web graph Propagation models Biased PageRank 

1 Introduction

There is an increasing concern about the potential risks in the consumption of abundant biased information in online platforms like Web Search Engines (WSEs) and social networks. Terms like echo chambers and filter-bubbles [26] depict the isolation of groups of people and its aftereffects, that result from the selective and restrictive exposure to information. This restriction can be the result of helpful personalized algorithms, that suggest user connections or rank highly information relevant to the users’ profile. Yet, this isolation might inhibit the growth of informed and responsible humans/citizens/consumers, and can also be the result of malicious algorithms that promote and resurrect social, religious, ethnic, and other kinds of discriminations and stereotypes.

Currently, the community focus is towards the transparency, fairness, and accountability of mostly machine learning algorithms for decision-making, classification, and recommendation in social platforms like twitter. However, social platforms and WSEs mainly act as gateways to information published on the web as common web pages (e.g., blogs and news). Unfortunately, users are unaware of the bias characteristics of these pages, except for obvious facts (e.g., a page in a political party’s web site will be biased towards this party).

In this work, we propose the bias goggles model, where users are able to explore the biased characteristics of web domains for a specific biased concept (i.e., a bias goggle). Since there is no objective definition of what bias and biased concepts are [27], we let users define them. For these concepts, the model computes the support and the bias score of a web domain, by considering the support of this domain for each aspect (i.e., dimension) of the biased concept. These support scores are calculated by graph-based algorithms that exploit the structure of the web graph and a set of user-defined seeds representing each aspect of bias. As a running example we will use the biased concept of greek politics, that consists of nine aspects of bias, each one representing a popular greek party, and identified by a single seed; the domain of its homepage.

In a nutshell, the main contributions of this work are:
  • the bias goggles model for computing the bias characteristics of web domains for a user-defined concept, based on the notions of Biased Concepts (BCs), Aspects of Bias (ABs), and the metrics of the support of the domain for a specific AB and BC, and its bias score for this BC,

  • the introduction of the Support Flow Graph (SFG), along with graph-based algorithms for computing the AB support score of domains, that include adaptations of the Independence Cascade (IC) and Linear Threshold (LT) propagation models, and the new Biased-PageRank (Biased-PR) variation that models different behaviours of a biased surfer,

  • an initial discussion about performance and implementation issues,

  • some promising evaluation results that showcase the effectiveness and efficiency of the approach on a relatively small dataset of crawled pages, using the new AGBR and AGS metrics,

  • a publicly accessible prototype of bias goggles.

The rest of the paper is organized as follows: the background and the related work is discussed in Sect. 2, while the proposed model, and its notions and metrics are described in Sect. 3. The graph-based algorithms for computing the support score of a domain for a specific AB are introduced in Sect. 4. The developed prototype and related performance issues are discussed in Sect. 5, while some preliminary evaluation results over a relatively small dataset of web pages are reported in Sect. 6. Finally, Sect. 7 concludes the paper and outlines future work.

2 Background and Related Work

Social platforms have been found to strengthen users’ existing biases [21] since most users try to access information that they agree with [18]. This behaviour leads to rating bubbles when positive social influence accumulates [24] and minimizes the exposure to different opinions [31]. This is also evident in WSEs, where the personalization and filtering algorithms lead to echo chambers and filter bubbles that reinforce bias [4, 12]. Remarkably, users of search engines trust more the top-ranked search results [25] and biased search algorithms can shift the voting preferences of undecided voters by as much as 20% [8].

There is an increasingly growing number of discrimination reports regarding various protected attributes (e.g., race, gender, etc.) in various domains, like in ads [7, 29] and recommendation systems [13], leading to efforts for defining principles of accountable1, auditing [28] and de-bias algorithms [1], along with fair classifiers [6, 14, 34]. Tools that remove discriminating information2, flag fake news3, make personalization algorithms more transparent4, or show political biases in social networks5 also exist. Finally, a call for equal opportunities by design [16] has been raised regarding the risks of bias in the stages of the design, implementation, training and deployment of data-driven decision-making algorithms [3, 11, 20].

There are various efforts for measuring bias in online platforms [27]. Bias in WSEs has been measured as the deviation from the distribution of the results of a pool of search engines [23] and the coverage of SRPs towards US sites [30]. Furthermore, the presence of bias in media sources has been explored through human annotations [5], by exploiting affiliations [32], the impartiality of messages [33], the content and linked-based tracking of topic bias [22], and the quantification of data and algorithmic bias [19]. However, this is the first work that provides a model that allows users to explore the available web sources based on their own definitions of biased concepts. The approach exploits the web graph structure and can annotate web sources with bias metrics on any online platform.

3 The bias goggles Model

Below we describe the notions of Biased Concepts (BCs) and Aspects of Bias (ABs), along with the support of a domain for an AB and BC, and its bias score for a BC. Table 1 describes the used notation.
Table 1.

Description of the used notation. The first part describes the notation used for the Web Graph, while the second the notation for the proposed model.

Symbol

Description

Open image in new window

the set of crawled Web pages

\(\mathtt{p}\)

a page in Open image in new window

\(\mathtt{dom(p)}\)

the normalized SLD of page \(\mathtt{p}\)

Open image in new window

the set of normalized SLDs in Open image in new window

\(\mathtt{dom}\)

an SLD in Open image in new window

\(\mathtt{{link_\mathtt{{p}, \mathtt {p'}}}}\)

a link from page \(\mathtt {p}\) to Open image in new window

\(\mathtt{{link_\mathtt{{dom}, \mathtt {dom'}}}}\)

a link from domain \(\mathtt {dom}\) to Open image in new window

Open image in new window

the set of crawled links between pages in Open image in new window

Open image in new window

the set of crawled links between the domains in Open image in new window

\(\mathtt{{inv(link_\mathtt{{p}, \mathtt {p'}})}}\)

the inverse link of \(\mathtt {link_\mathtt{{p}, \mathtt {p'}}}\), i.e., \(\mathtt {link_\mathtt{{p'}, \mathtt {p}}}\)

\(\mathtt{{inv(link_\mathtt{{dom}, \mathtt {dom'}})}}\)

the inverse link of \(\mathtt {link_\mathtt{{dom}, \mathtt {dom'}}}\), i.e., \(\mathtt {link_\mathtt{{dom'}, \mathtt {dom}}}\)

Open image in new window

the set of inverse links between the pages in Open image in new window

Open image in new window

the set of inverse links between the domains in Open image in new window

Open image in new window

the graph with Open image in new window as nodes and Open image in new window as edges

\(\mathtt{{outInvLinks(dom)}}\)

the set of Open image in new window

\(\mathtt{{outInvLinks(dom, dom')}}\)

the set of Open image in new window

\(\mathtt{neigh({dom})}\)

the set of all Open image in new window

\(\mathtt{invNeigh({dom})}\)

the set of all Open image in new window

\(\mathtt{{w_\mathtt{{dom}, \mathtt {dom'}}}}\)

the weight of the \(\mathtt{{link_\mathtt{{dom}, \mathtt {dom'}}}}\)

Open image in new window

the weighted graph with Open image in new window as nodes and Open image in new window as edges where \(\mathtt{{w_\mathtt{{dom}, \mathtt {dom'}}}} = \frac{\mathtt{{outInvLinks(dom, dom')}}}{\mathtt{{outInvLinks(dom)}}}\)

Open image in new window

a non-empty set of normalized domain urls (i.e., seeds)

Open image in new window

the signature of a set of seeds

Open image in new window

an Open image in new window as identified by Open image in new window

Open image in new window

the set of seeds that define Open image in new window

Open image in new window

the universe of all available \(\mathtt{AB}\)s

Open image in new window

a non-empty set of Open image in new window

Open image in new window

a \(\mathtt{Biased \ Concept}\) \((\mathtt{BC})\) as defined by Open image in new window

Open image in new window

an Open image in new window-dimensional vector holding the \(\mathtt {AB}s\) of Open image in new window

Open image in new window

the \(\mathtt{AB}\) stored in dimension i of Open image in new window

Open image in new window Open image in new window

support score of domain \(\mathtt{{dom}}\) regarding \(\mathtt{AB}\) Open image in new window

Open image in new window

vector holding support scores Open image in new window for domain \(\mathtt{{dom}}\)

Open image in new window

support score of dimension i of Open image in new window

Open image in new window Open image in new window

support score of domain \(\mathtt{{dom}}\) regarding \(\mathtt{BC}\) Open image in new window

Open image in new window Open image in new window

bias score of \(\mathtt{{dom}} \) for \(\mathtt{BC}\) Open image in new window

Open image in new window

An Open image in new window-dimensional vector with support 1 in all dimensions

3.1 Biased Concepts (BCs) and Aspects of Bias (ABs)

The interaction with a user begins with the definition of a Biased Concept (BC), which is considered the goggles through which the user wants to explore the web domains. BCs are given by users and correspond to a concept that can range from a very abstract one (e.g., god) to a very specific one (e.g., political parties). For each BC, it is required that the users can identify at least two Aspects of Bias (ABs), representing its bias dimensions. ABs are given by the users and correspond to a non-empty set of seeds (i.e., domains) Open image in new window, that the user considers to fully support this bias aspect. For example, consider the homepage of a greek political party as an aspect of bias in the biased concept of the politics in Greece. Notice, that an AB can be part of more than one BCs. Typically, an AB is denoted by Open image in new window, where Open image in new window is the signature of the non-empty set of seeds Open image in new window. The Open image in new window is the SHA1 hash of the lexicographic concatenation of the normalized Second Level Domains (SLDs)6 of the urls in Open image in new window. We assume that all seeds in Open image in new window are incomparable and support with the same strength this AB.

Assumption 1

Incomparable Seeds Support. The domains in the set of seeds Open image in new window are incomparable and equally supportive of the Open image in new window.

The user-defined BC of the set of ABs Open image in new window, where Open image in new window and Open image in new window the universe of all possible ABs in the set of domains Open image in new window of the crawled pages Open image in new window, is denoted by Open image in new window and is represented by the pair Open image in new window. The Open image in new window is an Open image in new window-dimensional vector with Open image in new window, holding all Open image in new window of this BC in lexicographic order. Open image in new window is a user-defined textual description of this BC. In this work, we assume that all ABs of any BC are orthogonal and unrelated.

Assumption 2

Orthogonality of Aspects of Bias. All ABs in a user-defined BC are considered orthogonal.

Using the notation, our running example is denoted as Open image in new window, where Open image in new window is a vector that holds lexicographically the SHA1 signatures of the nine ABs singleton seeds of greek political parties Open image in new window \( \mathtt \{``anexartitoiellines.gr"\},\) \(\mathtt \{``antidiaploki.gr"\},\) \(\mathtt \{``elliniki-lisi.gr"\},\) \(\mathtt \{``kke.gr"\},\) \(\mathtt \{``mera25.gr"\},\) \(\mathtt \{``nd.gr"\},\) \(\mathtt \{``syriza.gr"\},\) \(\mathtt \{``topotami.gr"\},\) \(\mathtt \{``xryshaygh.com"\}\}\), and Open image in new window \(\mathtt ``politics\) \(\mathtt in\) \(\mathtt Greece"\) is its description.

3.2 Aspects of Bias Support and Biased Concepts Support

A core metric in the proposed model is the support score of a domain dom to an aspect of bias Open image in new window, denoted as Open image in new window. The support score ranges in [0, 1], where 0 denotes an unsupportive domain for the corresponding AB, and 1 a fully supportive one. We can identify three approaches for computing this support for a dataset of web pages: (a) the graph-based ones that exploit the web graph structure and the relationship of a domain with the domains in Open image in new window, (b) the content-based ones that consider the textual information of the respective web pages, and (c) the hybrid ones that take advantage of both the graph and the content information. In this work, we focus only on graph-based approaches and study two frequently used propagation models, the Independence Cascade (IC) and Linear Threshold (LT) models, along with the newly introduced Biased-PageRank (Biased-PR), that models various behaviours of biased surfers. The details about these algorithms are given in Sect. 4.

In the same spirit, we are interested about the support of a specific domain dom to a biased concept Open image in new window, denoted by Open image in new window. The basic intuition is that we need a metric that shows the relatedness and support to all or any of the aspects in Open image in new window, which can be interpreted as the relevance of this domain with any of the aspects of the biased concept Open image in new window. A straightforward way to measure it, is the norm of the Open image in new window vector that holds the support scores of \(\mathtt dom\) for each AB in Open image in new window, normalized by the norm of the Open image in new window vector. This vector holds the support scores of a ‘virtual’ domain that fully supports all bias aspects in Open image in new window. Specifically,The Open image in new window value ranges in [0, 1]. By using the above formula two domains might have similar support scores for a specific BC, while the support scores for the respective aspects might differ greatly. For example, consider two domains \(\mathtt{dom}\) and \(\mathtt{dom'}\), with \(\mathtt{dom}\) fully supporting only one aspect in Open image in new window and \(\mathtt{dom'}\) fully supporting another aspect in Open image in new window. Then Open image in new window). Below we introduce the bias score of a domain regarding a specific BC, as a way to capture the leaning of a domain to specific ABs of a BC.

3.3 Bias Score of Domain Regarding a Biased Concept

The bias score of a domain regarding a BC tries to capture how biased the domain is over any of its ABs, and results from the support scores that the domain has for each aspect of the BC. For example, consider a domain \(\mathtt{dom}\) that has a rather high support for a specific AB, but rather weak ones for the rest ABs of a specific BC. This domain is expected to have a high bias score. On the other hand, the domain \(\mathtt{dom'}\) that has similar support for all the available ABs of a BC can be considered to be unbiased regarding this specific BC.

We define the bias score of a domain dom for Open image in new window as the distance of the Open image in new window vector from the Open image in new window vector, multiplied by its support Open image in new window. The bias score takes values in [0, 1]. Specifically,We use the cosine similarity to define the distance metric, as shown below:

4 Graph-Based Computation of Aspects of Bias Support

In this section, we discuss the graph-based algorithms that we use for computing the support score of a domain regarding a specific AB. We focus on the popular Independence Cascade (IC) and Linear Threshold (LT) propagation models, along with the newly introduced Biased-PageRank (Biased-PR) algorithm.

Let Open image in new window be the set of crawled web pages, Open image in new window the set of normalized SLDs in Open image in new window, Open image in new window the set of crawled links between the domains in Open image in new window, and Open image in new window the corresponding graph with Open image in new window as nodes and Open image in new window as edges. With \(\mathtt{{link_\mathtt{{dom}, \mathtt {dom'}}}}\) we denote a link from domain \(\mathtt {dom}\) to Open image in new window, while \(\mathtt{{inv(link_\mathtt{{dom}, \mathtt {dom'}})}}\) inverses the direction of a link and Open image in new window is the set of inverse links in Open image in new window. Furthermore, for the links we assume that:

Assumption 3

Equally Supportive Links.Any link \(\mathtt{{link_\mathtt{{dom}, \mathtt {dom'}}}}\) from the domain \(\mathtt {dom}\) to the domain \(\mathtt {dom'}\) in the set of crawled domains Open image in new window, is considered to be of supportive nature (i.e., \(\mathtt {dom}\) has the same support stance as \(\mathtt {dom'}\) for any AB). All links in a domain are equally supportive and independent of the importance of the page they appear in.

Although the above assumption might not be precise, since links from a web page to another are not always of supportive nature (e.g., a web page critizing another linked one), or of the same importance (e.g., links in the homepage versus links deeply nested in a site), it suffices for the purposes of this first study of the model. Identification of the nature of links and the importance of the pages they appear is left as future work. Given that the assumption holds, part or whole of the support of \(\mathtt {dom'}\) regarding any AB can flow to \(\mathtt {dom}\) through \(\mathtt{{inv(link_\mathtt{{dom}, \mathtt {dom'}})}}\). Specifically, we define the Support Flow Graph as:

Support Flow Graph (SFG) Definition

The \(\mathtt{{SFG}}\) of a set of web pages Open image in new window is the weighted graph that is created by inversing the links in Open image in new window (i.e., the graph with Open image in new window as nodes and Open image in new window as edges). The weight of each edge is \(\mathtt{{w_\mathtt{{dom}, \mathtt {dom'}}}} = \frac{\mathtt{{outInvLinks(dom, dom')}}}{\mathtt{{outInvLinks(dom)}}}\) (i.e., the number of outgoing inverse links of pages in the domain \(\mathtt{{dom}}\) that link to pages in the domain \(\mathtt{{dom'}}\), divided by the total outgoing inverse links of pages in the domain \(\mathtt{{dom}}\)), and takes a value in [0, 1].

So, given an Open image in new window and the Open image in new window of an AB we can now describe how the support flows in the nodes of the Open image in new window graph. All algorithms described below return a map M holding Open image in new window.

4.1 Independence Cascade (IC) Model

The IC propagation model was introduced by Kempe et al. [17], and a number of variations have been proposed in the bibliography. Below, we describe the basic form of the model as adapted to our needs. In the IC propagation model, we run n experiments. Each run starts with a set of activated nodes, in our case the Open image in new window, that fully support the Open image in new window. In each iteration there is a history independent and non-symmetric probability of activating the neighbors of the activated nodes associated with each edge, flowing the support to the neighbors of the activated nodes in the Open image in new window. This probability is represented by the weights of the links of an activated node to its neighbors, and each node, once activated, can then activate its neighbors. The nodes and their neighbors are selected in arbitrary order. Each experiment stops when there are no new activated nodes. After n runs we compute the average support score of nodes, i.e., Open image in new window. The algorithm is given in Algorithm 1.

4.2 Linear Threshold (LT) Model

The LT model is another widely used propagation model. The basic difference from the \(\mathtt{IC}\) model is that for a node to become active we have to consider the support of all neighbors, which must be greater than a threshold \(\theta \in [0,1]\), serving as the resistance of a node to its neighbors joint support. Again, we use the support probabilities represented by the weights of the \(\mathtt{{SFG}}\) links. The full algorithm, which is based on the static model introduced by Goyal et al. [10], is given in Algorithm 2. In each experiment the thresholds \(\theta \) get a random value.

4.3 Biased-PageRank (Biased-PR) Model

We introduce the Biased-PR variation of PageRank [9] that models a biased surfer. The biased surfer always starts from the biased domains (i.e., the seeds of an AB), and either visits a domain linked by the selected seeds or one of the biased domains again, with some probability that depends on the modeled behaviour. The same process is followed in the next iterations. The Biased-PR differs to the original PageRank in two ways. The first one is how the score (support in our case) of the seeds is computed at any step. The support of all domains is initially 0, except from the support of the seeds that have the value \(\mathtt{init_{seeds} = 1}\). At any step, the support of each seed is the original PageRank value, increased by a number that depends on the behaviour of the biased surfer. We have considered three behaviours: (a) the Strongly Supportive (SS) one, where the support is increased by \(\mathtt{init_{seeds}}\) and models a constantly strongly biased surfer, (b) the Decreasingly Supportive (DS) one, where the support is increased by \(\mathtt{init_{seeds} / iter}\), modeling a surfer that becomes less biased the more pages he/she visits, and (c) the Non-Supportive (NS) one, with no increment, modeling a surfer that is biased only on the initial visiting pages, and afterwards the support score is computed as in the original PageRank. Biased-PR differs also on how the biased surfer is teleported to another domain when he/she reaches a sink (i.e., a domain that has no outgoing links). The surfer randomly teleports with the same probability to a domain in any distance from the seeds. If a path from a node to any of the seeds does not exist, the distance of the node is the maximum distance of a connected node increased by one. Since the number of nodes at a certain distance from the seeds increase as we move away from the seeds, the teleporting probability for a node is greater the closer the node is to the seeds. We expect slower convergence for Biased-PR than the original PageRank, due to the initial zero scores of non-seed nodes. The algorithm is given in Algorithm 3.

5 Perfomance and Implementation Discussion

Due to size restrictions we provide a rather limited discussion about the complexities and the cost of tuning the parameters of each algorithm. The huge scale of the web graph has the biggest performance implication to the the graph-based computation of the ABs support. What is encouraging though, is that the algorithms are applied over the compact SFG graph, that contains the SLDs of the pages and their corresponding links. The complexity of IC is in Open image in new window, where n is the number of experiments. LT is much slower though since we have to additionally consider the joint support of the neighbors of a node. Finally, the Biased-PR converges slower than the original PageRank, since the algorithm begins only with the seeds, spreading the support to the rest nodes. Also, we must consider the added cost of computing the shortest paths of the nodes from the seeds. For the relatively small SFG used in our study (see Sect. 6), the SS converges much faster than the DS and NS, which need ten times more iterations.

For newly introduced ABs though, the computation of the support scores of the domains can be considered an offline process. Users can submit ABs and BCs into the bias goggles system and get notified when they are ready for use. However, what is important is to let users explore in real-time the domains space for any precomputed and commonly used BCs. This can be easily supported by providing efficient ways to store and retrieve the signatures of already known BCs, along with the computed support scores of domains of available ABs. Inverted files and trie-based data structures (e.g., the space efficient burst-tries [15] and the cache-consious hybrid or pure HAT-tries [2]) over the SLDs and the signatures of the ABs and BCs, can allow the fast retrieval of offsets in files where the support scores and the related metadata are stored. Given the above, the computation of the bias score and the support of a BC for a domain is lightning fast. We have implemented a prototype7 that allows the exploration of predefined BCs over a set of mainly greek domains. The prototype offers a REST API for retrieving the bias scores of the domains, and exploits the open-source project crawler4j8. We plan to improve the prototype, by allowing users to search and ingest BCs, ABs and domains of interest, and develop a user-friendly browser plugin on top of it.

6 Experimental Evaluation Discussion

Evaluating such a system is a rather difficult task, since there are no formal definitions of what bias in the web is, and there are no available datasets for evaluation. As a result, we based our evaluation over BCs for which it is easy to find biased sites. We used two BCs for our experiments, the greek politics (BC1) with 9 ABs, and the greek football (BC2) with 6 ABs. For these BCs, we gathered well known domains, generally considered as fully supportive of only one of the ABs, without inspecting though their link coverage to the respective seeds, to avoid any bias towards our graph based approach. Furthermore, we did not include the original seeds to this collection. In total, we collected 50 domains for BC1 and 65 domains for BC2, including newspapers, radio and television channels, blogs, pages of politicians, etc. This collection of domains is our gold standard.
Table 2.

Experimental results over two BCs.

We crawled a subset of the greek web by running four instances of the crawler: one with 383 sites related to the greek political life, one with 89 sport related greek sites, one with the top-300 popular greek sites according to Alexa, and a final one containing 127 seeds related to big greek industries. We black-listed popular sites like facebook and twitter to control the size of our data and avoid crawling non-greek domains. The crawlers were restricted to depth seven for each domain, and free to follow any link to external domains. In total we downloaded 893,095 pages including 531,296,739 links, which lead to the non-connected SFG graph with 90,419 domains, 288,740 links (on average 3.1 links per domain) and a diameter \(k=7,944\). More data about the crawled pages, the gold standard, and the SFG graph itself are available in the prototype’s site.

Below we report the results of our experiments over an i7-5820K 3.3GHz system, with 6 cores, 15MB cache and 16GB RAM memory, and a 6TB disk. For each of the two BCs and for each algorithm, we run experiments for various iterations n and Biased-PR variations, for the singleton ABs of the 9 political parties and 6 sports teams. For Biased-PR we evaluate all possible behaviours of the surfer using the parameters \(\theta _{conv}=0.001\) and \(d=0.85\). We also provide the average number of iterations for convergence over all ABs for Biased-PR. We report the run times in seconds, along with the metrics Average Golden Bias Ratio (AGBR) and Average Golden Similarity (AGS), that we introduce in this work. The AGBR is the ratio of the average bias score of the golden domains, as computed by the algorithms for a specific BC, divided by the average bias score of all domains for this BC. The higher the value, the more easily we can discriminate the golden domains from the rest. On the other hand, the AGS is the average similarity of the golden domains to their corresponding ABs. The higher the similarity value, the more biased the golden domains are found to be by our algorithms towards their aspects. A high similarity score though, does not imply high support for the golden domains or high disimilarity for the rest. The perfect algorithm will have high values for all metrics. The results are shown in Table 2.

The difference in BC1 and BC2 results implies a less connected graph for BC2 (higher AGBR values for BC2), where the support flows to less domains, but with a greater interaction between domains supporting different aspects (smaller AGS values). What is remarkable is the striking time performance of IC, suggesting that it can be used in real-time and with excellent results (at least for AGBR). On the other hand, the LT is a poor choice, being the slowest of all and dominated in any aspect by IC. Regarding the Biased-PR  only the SS variation offers exceptional performance, especially for AGS. The DS and NS variations are more expensive and have the worst results regarding AGBR, especially the NSS that avoids bias. In most cases, algorithms benefit from more iterations. The SS variation of Biased-PR needs only 40 iterations for BC1 and 31 for BC2 to converge, proving that less nodes are affected by the seeds in BC2. Generally, the IC and the SS variation of Biased-PR are the best options, with the IC allowing the real-time ingestion of ABs. But, we need to evaluate the algorithms in larger graphs and for more BCs.

We also manually inspected the top domains according to the bias and support scores for each algorithm and each BC. Generally the support scores of the domains were rather low, showcasing the value of other support cues, like the content and the importance of pages that links appear in. In the case of BC1, except from the political parties, we found various blogs, politicians homepages, news sites, and also the national greek tv channel, being biased to a specific political party. In the case of BC2 we found the sport teams, sport related blogs, news sites, and a political party being highly biased towards a specific team, which is an interesting observation. In both cases we also found various domains with high support to all ABs, suggesting that these domains are good unbiased candidates. Currently, the bias goggles system is not able to pinpoint false positives (i.e pages with non supportive links) and false negatives (i.e., pages with content that supports a seed without linking to it), since there is no content analysis. We are certain that such results can exist, although we were not able to find such an example in the top results of our study. Furthermore, we are not able to distinguish links that can frequently appear in users’ content, like in the signatures of forum members.

7 Conclusion and Future Work

In this work, we introduce the bias goggles model that facilitates the important task of exploring the bias characteristics of web domains to user-defined biased concepts. We focus only on graph-based approaches, using popular propagation models and the new Biased-PR PageRank variation that models biased surfers behaviours. We propose ways for the fast retrieval and ingestion of aspects of bias, and offer access to a developed prototype. The results show the efficiency of the approach, even in real-time. A preliminary evaluation over a subset of the greek web and a manually constructed gold standard of biased concepts and domains, shows promising results and interesting insights that need futher research.

In the future, we plan to explore variations of the proposed approach where our assumptions do not hold. For example, we plan to exploit the supportive, neutral or oppositive nature of the available links, as identified by sentiment analysis methods, along with the importance of the web pages they appear in. Content-based and hybrid approaches for computing the support scores of domains are also in our focus, as well as the exploitation of other available graphs, like the graph of friends, retweets, etc. In addition interesting aspects include how the support and bias scores of multiple BCs can be composed, providing interesting insights about possible correlations of different BCs, as well as how the bias scores of domains change over time. Finally, our vision is to integrate the approach in a large scale WSE/social platform/browser, in order to study how users define bias, create a globally accepted gold standard of BCs, and explore how such tools can affect the consumption of biased information. In this way, we will be able to evaluate and tune our approach in real-life scenarios, and mitigate any performance issues.

Footnotes

References

  1. 1.
    Adomavicius, G., Bockstedt, J., Shawn, C., Zhang, J.: De-biasing user preference ratings in recommender systems. In: CEUR-WS, vol. 1253, pp. 2–9 (2014)Google Scholar
  2. 2.
    Askitis, N., Sinha, R.: Hat-trie: a cache-conscious trie-based data structure for strings. In: Proceedings of the Thirtieth Australasian Conference on Computer Science, vol. 62, pp. 97–105. Australian Computer Society Inc. (2007)Google Scholar
  3. 3.
    Bolukbasi, T., Chang, K., Zou, J.Y., Saligrama, V., Kalai, A.: Man is to computer programmer as woman is to homemaker? debiasing word embeddings. CoRR, abs/1607.06520 (2016)Google Scholar
  4. 4.
    Bozdag, E.: Bias in algorithmic filtering and personalization. Ethics Inf. Technol. 15(3), 209–227 (2013)CrossRefGoogle Scholar
  5. 5.
    Budak, C., Goel, S., Rao, J.M.: Fair and balanced? quantifying media bias through crowdsourced content analysis. Publ. Opin. Q. 80(S1), 250–271 (2016)CrossRefGoogle Scholar
  6. 6.
    Corbett-Davies, S., Pierson, E., Feller, A., Goel, S., Huq, A.: Algorithmic decision making and the cost of fairness. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 797–806. ACM (2017)Google Scholar
  7. 7.
    Dwork, C., Hardt, M., Pitassi, T., Reingold, O., Zemel, R.S.: Fairness through awareness. In: ITCS, pp. 214–226 (2012)Google Scholar
  8. 8.
    Epstein, R., Robertson, R.E.: The search engine manipulation effect (seme) and its possible impact on the outcomes of elections. PNAS 112(20), E4512–E4521 (2015)CrossRefGoogle Scholar
  9. 9.
    Gleich, D.F.: Pagerank beyond the web. SIAM Rev. 57(3), 321–363 (2015)MathSciNetCrossRefGoogle Scholar
  10. 10.
    Goyal, A., Bonchi, F., Lakshmanan, L.V.: Learning influence probabilities in social networks. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp. 241–250. ACM (2010)Google Scholar
  11. 11.
    Hajian, S., Bonchi, F., Castillo, C.: Algorithmic bias: from discrimination discovery to fairness-aware data mining. In: KDD, pp. 2125–2126. ACM (2016)Google Scholar
  12. 12.
    Hannak, A., et al.: Measuring personalization of web search. In: WWW, pp. 527–538. ACM (2013)Google Scholar
  13. 13.
    Hannak, A., Soeller, G., Lazer, D., Mislove, A., Wilson, C.: Measuring price discrimination and steering on e-commerce web sites. In: Internet Measurement Conference, pp. 305–318 (2014)Google Scholar
  14. 14.
    Hardt, M., Price, E., Srebro, N.: Equality of opportunity in supervised learning. In: NIPS, pp. 3315–3323 (2016)Google Scholar
  15. 15.
    Heinz, S., Zobel, J., Williams, H.E.: Burst tries: a fast, efficient data structure for string keys. ACM Trans. Inf. Syst. (TOIS) 20(2), 192–223 (2002)CrossRefGoogle Scholar
  16. 16.
    House, W.: Big data: A report on algorithmic systems, opportunity, and civil rights. Executive Office of the President, White House, Washington, DC (2016)Google Scholar
  17. 17.
    Kempe, D., Kleinberg, J., Tardos, É.: Maximizing the spread of influence through a social network. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 137–146. ACM (2003)Google Scholar
  18. 18.
    Koutra, D., Bennett, P. N., Horvitz, E.: Events and controversies: influences of a shocking news event on information seeking. In: WWW, pp. 614–624 (2015)Google Scholar
  19. 19.
    Kulshrestha, J., et al.: Quantifying search bias: investigating sources of bias for political searches in social media. In: CSCW (2017)Google Scholar
  20. 20.
    Lepri, B., Staiano, J., Sangokoya, D., Letouzé, E., Oliver, N.: The Tyranny of data? the bright and dark sides of data-driven decision-making for social good. In: Cerquitelli, T., Quercia, D., Pasquale, F. (eds.) Transparent Data Mining for Big and Small Data. SBD, vol. 11, pp. 3–24. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-54024-5_1CrossRefGoogle Scholar
  21. 21.
    Liu, Z., Weber, I.: Is twitter a public sphere for online conflicts? a cross-ideological and cross-hierarchical look. In: SocInfo, pp. 336–347 (2014)Google Scholar
  22. 22.
    Lu, H., Caverlee, J., Niu, W.: Biaswatch: a lightweight system for discovering and tracking topic-sensitive opinion bias in social media. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 213–222. ACM (2015)Google Scholar
  23. 23.
    Mowshowitz, A., Kawaguchi, A.: Measuring search engine bias. Inf. Process. Manag. 41(5), 1193–1205 (2005)CrossRefGoogle Scholar
  24. 24.
    Muchnik, L., Aral, S., Taylor, S.J.: Social influence bias: a randomized experiment. Science 341(6146), 647–651 (2013)CrossRefGoogle Scholar
  25. 25.
    Pan, B., Hembrooke, H., Joachims, T., Lorigo, L., Gay, G., Granka, L.: In google we trust: users’ decisions on rank, position, and relevance. J. Comput.-Mediated Commun. 12(3), 801–823 (2007)CrossRefGoogle Scholar
  26. 26.
    Pariser, E.: The filter bubble: what the Internet is hiding from you. Penguin, UK (2011)Google Scholar
  27. 27.
    E. Pitoura, P. Tsaparas, G. Flouris, I. Fundulaki, P. Papadakos, S. Abiteboul, G. Weikum: On measuring bias in online information. ACM SIGMOD Record 46(4), 16–21 (2018)CrossRefGoogle Scholar
  28. 28.
    Sandvig, C., Hamilton, K., Karahalios, K., Langbort, C.: Auditing algorithms: research methods for detecting discrimination on internet platforms. Data and discrimination: converting critical concerns into productive inquiry (2014)Google Scholar
  29. 29.
    Skeem, J.L., Lowenkamp, C.T.: Risk, race, and recidivism: predictive bias and disparate impact. Criminology 54(4), 680–712 (2016)CrossRefGoogle Scholar
  30. 30.
    Vaughan, L., Thelwall, M.: Search engine coverage bias: evidence and possible causes. Inf. Process. Manag. 40(4), 693–707 (2004)CrossRefGoogle Scholar
  31. 31.
    Weber, I., Garimella, V.R.K., Batayneh, A.: Secular vs. islamist polarization in Egypt on twitter. In: ASONAM, pp. 290–297 (2013)Google Scholar
  32. 32.
    Wong, F.M.F., Tan, C.W., Sen, S., Chiang, M.: Quantifying political leaning from tweets and retweets. ICWSM 13, 640–649 (2013)Google Scholar
  33. 33.
    Zafar, M.B., Gummadi, K.P., Danescu-Niculescu-Mizil, C.: Message impartiality in social media discussions. In: ICWSM, pp. 466–475 (2016)Google Scholar
  34. 34.
    Zafar, M.B., Valera, I., Rodriguez, M.G., Gummadi, K.P.: Fairness beyond disparate treatment & disparate impact: learning classification without disparate mistreatment. In: WWW (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.Institute of Computer Science, FORTH-ICSHeraklionGreece
  2. 2.Computer Science DepartmentUniversity of CreteHeraklionGreece

Personalised recommendations