Keywords

Introduction

The expanding use of algorithms in society has called for the emergence of “critical algorithm studies” (Seaver, 2013; Gillespie, 2014). In this chapter, we present four cases of how digital traces may be used for studying algorithms and testing their quality in terms of data, models, and outcomes. We employ digital unobtrusive methods, such as search as research and Application Programming Interfaces (APIs).

The contextual use of algorithms that support decision-making processes is an increasingly frequent phenomenon within social organizations (Nakamura, 2013; Grosser, 2014; Aragona & De Rosa, 2018). Studies that analyze the contextual nature of algorithms have unfolded in several fields, spanning from media studies to chemistry and from social sciences to humanities. These critical algorithm studies pay particular attention to the social and political consequences of outputs, for example, the algorithmic circulation of content influences cultural consumption (Beer, 2013), how the massive use of algorithms affects markets and finance (Mackenzie, 2019), or how algorithms can reinforce inequalities and embed cultural bias (Lupton, 2015; Noble, 2018; Espeland & Yung, 2019; Airoldi, 2020; Aragona, 2020). Over the past 5 years, substantial literature has developed on the subject (Beer, 2013; Pasquale, 2015; O’Neil, 2016; Bucher, 2018; Eubanks, 2018). The critical study of algorithms is a process by which the social sciences can make an important contribution (Koene et al., 2019).

This chapter focuses on the methods of social research that can be employed for auditing algorithms to gain a greater understanding of algorithmic functioning (Amaturo & Aragona, 2021).

Different social science methods can be used for the critical study of algorithms. These can be classified through two basic elements: the level of obtrusiveness and whether they require access to the algorithm assemblage. Here, the main focus is on the use of two digitally unobtrusive methods. The first is the search as research, which uses the queries that users make on search engines for research objectives (Salganik, 2019). The second unobtrusive method relies on the Application Programming Interfaces (APIs). Through four examples, it is shown how digital traces could be effectively used for auditing algorithms (Aragona, 2021).

The first example is taken from the Noble’s academic research (2018), run between 2009 and 2015, that investigated the relationship between search engines and racism particularly focusing on gender. Noble notes that algorithms are not neutral, but rather can be influenced by the biases of those who produced them. Noble employed different combinations of keywords related to gender and race on Google search engine to estimate discrimination biases. The second example is based on the “Happy white woman” case on Google Images. The third example is by Algorithm Watch (Kayser-Bril, 2020) and shows how the algorithm used by Google Vision Cloud API, an automatic image classification service, produced different results based on the color of the skin of individuals. The last example is a quasi-experiment created by Barsan (2021) of Wunderman Thompson, who tested different image classification services (Google, IBM, and Microsoft) for gender bias.

All the four pieces of research highlights how digital traces may be analyzed with different digital research methods for the study of algorithms.

In the next section, we present the rationale of critical algorithm studies by explaining why they should become privileged objects of social research. In Sect. “Methods for Researching Algorithms,” we outline the methods for researching algorithms through digital traces. The final section illustrates the four examples and the main results. Some final remarks and the limits of our work are drawn in the conclusions.

Critical Algorithm Studies

With the rapid intensification of the datafication process, which is concerned especially with bureaucracies, governments, and policies, the need for a critical study on algorithms has arisen (Espeland & Stevens, 1998; Visentin, 2018). In response to this, various scholars from different research fields have begun to focus on the nature of algorithms more generally (Geiger, 2014; Montfort et al., 2012), on the effects they produce in specific sectors of society (Amoore, 2006; Pasquale, 2015), or, even more specifically, on the code construction procedures (Gillespie, 2014; Seaver, 2013).

Several authors overcame the technical vision, in favor of a more complex approach that recognizes the crucial role that algorithms play in traditional social, political, and economic institutions, for example, to prevent and combat crime in the context of predictive policing or to support choices regarding hiring and firing in the workplace and, more generally, their influence on social reality (O’Neil, 2016; Boyd & Crawford, 2012; Nakamura, 2013; Grosser, 2014; Tufekci, 2015). Recent literature is cohesive in the identification of algorithms as socio-technical constructs, considered as the result of the combination of social and technical knowledge (Aragona & Felaco, 2018). In the context of “critical algorithm studies,” there is widespread agreement in considering the algorithm as a more complex set of steps defined to produce specific results in which social and material practices that have their own cultural, historical, and institutional nature are intertwined (Montfort et al., 2012; Takhteyev, 2012; Napoli, 2013; Dourish, 2016; Aragona & De Rosa, 2017). The objectivity, impartiality, and consequent claim of reliability of algorithms are contested, underlining that the same codes are not mere abstractions but also have a social and political value (Porter, 1995; Gillespie, 2014). This socio-technical production implies a socio-cultural influence that—deeply stratified at the various levels of the development of the algorithmic formulation—emerges during the implementation of the algorithm. As a result, algorithms cannot be defined as neutral constructs. The technical implementation of the code is influenced by the milieu of the producer and is imbued with its personal evaluations (Simondon, 2017).

At the same time, the algorithms could have unexpected effects and automatically disadvantage and discriminate against different social and demographic groups. Eubanks (2018), for example, notices that in the United States, bad designed algorithms, uncorrected indexes, and poor computer systems are denying many requests for health, food, or economic aid. The redlining of beneficiaries through algorithms, which was supported by neoliberal logic as an antidote to inefficiency and waste, has in some cases impacted heavily on the lives of the poorest citizens and socially excluded, especially African Americans. Another famous case that shows the opacity of algorithms is the Correctional Offender Management Profiling for Alternative Sanctions (COMPAS). This is an algorithmic decision tool employed by US courts in many states to estimate the recidivism of the defendants The fairness of this tool has been questioned, because it may exacerbate inequalities. Black defendants were almost twice as likely as whites to be labelled a higher risk of recidivism but not actually re-offend, whereas white defendants were much more likely than blacks to be labelled lower risk but went on to commit other crimes. Different false-positive rates suggested racial bias.

Both cases show that one of the main problems that specifically concerns algorithms is the lack of transparency and accountability. Often, they are defined as codes, whose attribution of accountabilities remains opaque and ends up weighing on the overall system of those who produced it. Digital traces may allow us to investigate the fairness of the algorithms and to make them accountable.

Methods for Researching Algorithms

The critical study of algorithms makes use of both technical and social research methods.

The methods for studying algorithms are distinguished on two dichotomies. The first distinction is whether the methods are used in an analogue or digital environment. This choice is a critical step for the researcher to achieve his goals. If the primary objective is to study the assemblage, it might be possible to conduct a participatory observation using ethnography. Otherwise, on the other hand, if the goal is to investigate the algorithm’s outputs based on certain inputs, the use of the digital context would be an adequate choice.

The second dichotomy sees obtrusive methods contrasted with unobtrusive ones. For example, the discrete deconstruction of algorithms via document analysis or code testing may provide insight into how an algorithm works, but provide little understanding of the algorithm’s designers’ intent. The danger is that the researcher may interpret the data using a self-referential method based on ideas that solely correspond to its normative, cognitive, and emotional foundations. Interviews with designers and programmers, as well as the ethnography of a coding team, are examples of obtrusive research within the algorithm assemblage (Aragona & Felaco, 2019). The interviewees are asked how they defined the objectives and turned them into code in terms of languages and technologies, practices, influences, and limits (Diakopoulos, 2016).

A method that could be used for the study of black-boxed algorithms is the walkthrough. The walkthrough method is an unobtrusive method that allows the researcher to interact with the user interface of an algorithm when it cannot use the APIs. This research method allows us to understand the technological mechanisms and the cultural references that underlie its functioning. The researcher, through its application, can approach the algorithm by experiencing the effects as perceived by the users. The walkthrough method involves the production of documentation, such as the field notes. Through this method, it is possible to develop further research focused on the user and his relationship with the algorithm. Experiments and quasi-experiments are further methods for studying algorithms. Through the use of experiments, it is possible to investigate the risks of algorithms in automating inequalities. However, an experimental design must satisfy several conditions. Sampling and assignment of units to groups is probabilistic. Furthermore, there is a control group that does not receive the treatment. Also, the treatment needs to be measured. The experiments can be conducted in both analogue and digital contexts and allow the researcher to study characteristics of the algorithms such as fairness and transparency.

Two methods are presented below, both of which are digital and unobtrusive and make use of digital traces.

The first method is the search as research, where the researcher makes use of the search engines of the digital platform as a tool for gathering data. By querying the search engine, it is possible to investigate the answers that the algorithm provides to the researcher’s requests. This method allows the researcher to investigate the social impact of algorithms when their assemblage is black boxed (Pasquale, 2015). In many cases, the use of this technique is necessary, since the product and its code have very high levels of opacity, effectively making the entire algorithmic assemblage unintelligible.

The second method is the Application Programming Interfaces, which are a set of functions and procedures that allow us to interact with the algorithm being studied. Products and services that offer this possibility are certainly at a different level of opacity compared to black boxes, but at the same time, they place constraints on the methods and purposes of use that can be obtained through the use of these tools. It is important to note that the APIs are not provided for research purposes only. Their origin is to be found in the value production and datafication that lead digital companies to provide third-party manufacturers with the tools to integrate these services into their applications. In some cases, such as that of Twitter, the use of APIs for research purposes is highly regulated. Through the use of APIs, it is possible to obtain some of the information that the algorithm processes, and therefore, the researcher approaches how the data is collected and constructed by the algorithm. The APIs are designed primarily as a tool to be granted to third-party producers with a view to integration between platforms. This possible use of APIs allows the creation of new products that can process data and information through the recombination of multiple services, opening the doors to unconventional uses or not implemented by those who produced the algorithm in the first place.

Digital Traces and Algorithm Studies

A first example of the use of digital traces for researching algorithms is the one offered by Noble’s research work described thoroughly in her book, Algorithms of Oppression. The scholar was able to highlight how, between 2009 and 2015, the Google Search algorithm perpetrated gender and racial discrimination. Noble interrogated the Google search engine to reconstruct how its algorithm interpreted certain keywords. Her research interest, also driven by a feminist perspective, led her to investigate the representation of the female gender in the search engine and more specifically the condition of black women. Starting from these premises, Noble questioned the search engine in various ways regarding the representation of women and men and then, again, between white women and black women. By querying the search engine, Noble analyzed the results proposed by Google’s auto-completion function, noting how the proposals suggested by the algorithm were discriminatory. It is in this context that the scholar develops the formulation that gives her study its title. The use of the self-completion proposed by Google search engine is only one of the techniques used by Noble. She spent a significant amount of time analyzing the results pages and related content, noting how the portrayal of black women was always associated with erotic behaviors. White men and women, on the other hand, had very different outcomes. The search was not limited to the search engine’s suggested web results and web pages but also included other observations made on the images. Noble continued her investigation using the image search. It is precisely from this in-depth analysis that the researcher obtained further results which highlighted the serious discrimination implemented by the search engine.

Noble notes that there is almost a monopoly in the search engine sector that, together with the presence of advertising systems and services that underlie private interests, there has been strong support for sponsoring the content of white people. Research carried out on paid advertising was an additional element in highlighting discrimination against black women. At this point, Noble argues that the combination of these factors can lead to a culture of racism. The growth of search engines, as tools increasingly used, would represent the main point in the spread of these biases. Specifically, she points out how the errors and discrimination carried out by algorithms have concrete consequences on people’s lives, directing toward the need to define their responsibilities. The results of her research have greatly contributed to contest algorithmic neutrality. Following Noble’s research work, Google has repeatedly changed the Google search algorithm.

The second example that shows how biases may be incorporated in algorithms is the “Happy white woman” case. As Noble’s research, also this algorithm bias was identified by using the search as research. Some early references about this case started in January 2021, but as of February 2022, it does not appear to have undergone significant changes. We carried out two searches on the Google Images search engine with the keywords “Happy black woman” and “Happy white woman.” The results obtained by looking for the “Happy black women” are mainly photos and stock images, which represent black women intent on smiling and apparently happy. The images obtained from this search then appear to be in line with content that presumably would have been expected from the query entered. On the contrary, the results for “Happy white woman” show some particular images and do not feature many stock images or photos of smiling women. Rather, in most of the photos on the first page of the results, the “Happy white woman” is always along with a black man. Although it is not the aim of this chapter to develop a content analysis on the results of these searches, it is clear that the Google algorithm presents some problems in this specific query (Figs. 8.1 and 8.2).

Fig. 8.1
19 images depict the search results of different black women smiling and apparently happy. Every image has the face blurred, and a short text is written below.

Anonymised screenshot of Google search results images of “Happy black woman.” Obtained on 02/23/22

Fig. 8.2
22 images depict the search results of the Happy white woman along with a black man. Every image has the face blurred, and a short text is written below.

Anonymised screenshot of Google search results images of “Happy white woman.” Obtained on 02/23/22

A third recent experiment was carried out by Kayser-Bril of the non-governmental organization (NGO) AlgorithmWatch, engaged in monitoring and analyzing the impact of automated decision systems on society, which has produced evidence regarding the discriminatory criticalities present in Google Cloud Vision API. Cloud Vision is a Google service that allows developers to analyze image content through the use of ever-evolving machine learning models (Kelleher, 2019). This description indicates that Google is providing a business-to-business service that is not intended for non-commercial use. These Cloud Vision APIs, according to Google, would be able to detect the content of an image with extreme precision. When the APIs’ functionalities are examined in detail, it becomes clear how they can detect labels, explicit contents, places, reference points, faces, and image attributes. Kayser-Bril submitted two similar images to Vision API, representing a hand holding a laser thermometer, with the difference that one hand was white-skinned and one black-skinned. The result proposed by Vision API has not been the same for the two images. The algorithm determined that the object in the white hand was a “monocular,” while it mistook the same object for a gun when it was detected in the black hand. The cause of this error is inherent in the machine learning techniques and models used to train the Cloud Vision algorithm. In this regard, Kayser-Bril concludes that probably more images of black people in violent contexts were provided to the algorithm than those of white people. As a result of this type of learning process, black people may be associated more frequently with the concept of violence. From this experiment, it is clear how, even in the absence of human manipulation, given by the socio-technical nature of the algorithms, through an uncontrolled learning process, the algorithmic constructs can develop the same bias that concerns individuals (Konrad & Böhle, 2019). Individuals may suffer severe consequences as a result of such biases. In contexts where algorithmic tracking associates facial detection with that of weapons, serious discrimination can arise for citizens if these algorithms present similar biases. The police forces, but especially the airport ones, often equipped with cutting-edge technologies to prevent threats, use predictive algorithms which, through the use of algorithms trained through specific models, can provide important support in the prevention of crimes (Uchida, 2014; Dieterich et al., 2016; Ferguson, 2017). If the algorithms were to present such biases, how would it be possible to still consider them as reliable sources for the security of citizenship? Critical studies on algorithms through digital traces allow in these circumstances to investigate the algorithmic construct and provide clear evidence of the social impact that the algorithm has on society. On April 6, 2020, Google implemented the changes suggested by Kayser-Bril, correctly detecting the laser thermometers without being influenced by the user’s skin color.

A final example of how using traces for studying algorithms is a quasi-experiment that revealed sexist bias was conducted by Barsan (2021), director of data science at Wunderman Thompson, a marketing firm. Barsan was developing a tool that would allow authorities to connect to thousands of street cameras and determine the percentage of pedestrians wearing masks at any given time. As in the previous case study, the image recognition software APIs offered by Google Vision Cloud, IBM Watson, and Microsoft Computer Vision were also used in this research. These software were intended to enhance the mask detection tool developed by Wunderman Thompson. However, these reports showed gender bias when tested on self-portraits of people wearing partial face masks. This was found directly by Barsan when she uploaded a photo of herself wearing a mask to test the accuracy of the Vision Cloud APIs. The analysis resulted in ten labels, two of which are of particular interest: the first, 94.51% “Duct tape,” and the second, 73.92% “Mask.” Although most of the labels were adherent to the image inserted by Barsan, the presence of the “duct tape” with a confidence level associated with 94.51% was surprising. Based on these results, Barsan wondered why Vision Cloud identified the mask with the duct tape with so much association and continued her investigation. Two more tests were conducted after the first. Firstly, she wore a ruby red mask and then a blue surgical mask. The confidence levels for “duct tape” decreased in these cases, but they were still 87% and 66%, respectively. Surprisingly, the keyword “mask” from the labels, which was applied to 74% in the first example, was not applied at all in this case. This evidence led Barsan to conduct the quasi-experiment. Her hypothesis was to see if Cloud Vision API could classify differently men and women who were wearing masks. To achieve this, her team formed 2 groups of 265 images each, one with images of men and the other with those of women. The final corpora contained two groups of images, respectively, of men and women, in different contexts, dressed in different ways, and with various types of masks. As for the male group, Vision Cloud correctly identified 36% of personal protective equipment (PPE), while 27% of cases identified an association with facial hair and, finally, 15% as duct tape. For the female group, on the other hand, the PPE was associated only 19% of the time, while for 28%, it was identified as duct tape. In the case of women, the association with facial hair was detected at 8%. From these data, it emerges that as regards women, duct tape was associated almost twice as often as men. This quasi-experiment conducted by Barsan produced results similar to what Kayser-Bril noted the previous year. Again, Cloud Vision’s machine learning algorithm may have been trained with inadequate data and which, as a result, produced this bias. Moreover, even before the arrival of the SARS-CoV-2 pandemic, generic masks and PPE were tools widely used in healthcare facilities and laboratories.

Barsan and her team ran the same experiment on the image recognition services of IBM Watson Visual Recognition and Microsoft Azure Cognitive Services Computer Vision. In the case of IMB Watson, the labels “Restraint chains” and “gag” were returned with the same value, indicating that they were present in 10% of men’s images and 23% of women’s. The mask, on the other hand, was detected 12% for men and only 5% for women. The average confidence value for the gag label was around 0.79 for women, while it was 0.75 for men. This indicates how, unlike Google’s algorithm, IBM’s was found to be less partial. With Microsoft Azure, no gender bias was detected, but the algorithm correctly classified the masks only 9% for men and 5% for women. The biggest associations were with the “fashion accessories” label with 40% for women and 13% for men.

Through these experiments and the results obtained, Barsan reports that such evidence is not the result of bad intentions. Rather, they serve as a reminder that prejudices and stereotypes can exist in learning models. This is one of the ways that social biases are replicated in software. Figures 8.3 and 8.4 show the results of our search on the Google Images search engine using, respectively, the keywords “Duct tape man” and “Duct tape woman.” This research was performed during the month of February 2022. The results show that even using different services, the bias of the Google search engine’s algorithm is still present.

Fig. 8.3
21 images depict the search results of different men with duct tape wrapped around their bodies. The tapes for every image are different and placed around different areas. Every image has the face blurred, and a short text is written below.

Anonymised screenshot of Google search results images of “Duct tape man.” Obtained on 02/23/22

Fig. 8.4
19 images depict the search results of different women with duct tapes. The tapes for every image are different and placed around different areas. In most of the images, the duct tapes are placed around the mouth. Every image has the face blurred, and a short text is written below.

Anonymised screenshot of Google search results images of “Duct tape woman.” Obtained on 02/23/22

However, if Barsan hadn’t conducted these tests, it’s unlikely that the algorithms’ hidden bias would have been discovered, and a tool to detect how many people on the streets wear masks would have been developed with the help of these algorithms. Testing the tools on digital traces such as images was a valuable research strategy to control the biases and reliability of the Vision API system.

Conclusions

The examples shown here are all primarily linked to Google services, but similar problems can be retraced in any algorithmic assemblage. They all served us to observe how digital traces can represent valuable data for finding evidence on the functioning of algorithms and increasing transparency and accountability. Interacting with the black boxes implies the need to consider the levels of opacity that separate the researcher from the algorithm. As Noble has effectively shown, through the search as research method, it is still possible to trace how algorithms work. In addition, the last two examples clearly shown how, through the APIs, it is possible to identify the presence of racial and gender discrimination. In a digital society where the presence of algorithms is at all levels of society, the need for critical studies on algorithms becomes a fundamental requirement to which the social sciences can and should make their contribution. As emerged from the examples, the issues that an unregulated use of algorithms can raise concern typical biases and controversies investigated by social research. The use of digital traces can provide valuable evidence on the social impact that algorithms have on society, something that the technical sciences alone cannot address.