Keywords

1 Introduction

Recommender systems (RSs) have driven much of the digital services that we have today. Businesses use them for hiring and keeping your attention while consumers use them to find everything from flights to songs. It is difficult to find an online service that doesn’t use an RS. While RSs have made our digital lives convenient, academics have raised ethical issues associated with them – including privacy violations, inappropriate content, fairness issues, and threats to autonomy and personal identity (Milano et al. 2020).

Indeed, with recommender systems, there have been clear consequences resulting from these ethical issues. The title of this chapter may be alarming; however, in some cases recommendations have been accused of causing death. There is the case of Molly Russel in the UK who killed herself after viewing graphic images of suicide and self-harm on Instagram (BBC News 2020). Then there is the case of genocide of the Rohingya in Myanmar. Facebook was sued for 150 billion pounds because it failed to prevent the amplification of hate speech and misinformation which lead to offline violence (Mozur 2018; Milmo 2021).

While these cases deserve attention, in this chapter I am not necessarily referring to literal death. With this chapter I show that many RSs are responsible for a vicious cycle that takes our most important judgments outside of our control. These judgements I am referring to are evaluative judgements that have ethical value associated with them – particularly when this judgement is about human beings. Recommending someone for a job or evaluating someone on their criminal risk level have important consequences. These judgments deserve the highest level of scrutiny.

In scrutinizing these types of recommendations in the context of using algorithms to arrive at them, at least four distorting forces exist that should make these evaluations suspect. First, because they depend upon machine learning algorithms, they are inherently based upon past evaluative standards.Footnote 1 This may not always be a bad thing, but rarely is the fact that “this is the way we have always done it” a good argument in favor of continuing to do it that way. Second, because we are using machines, we need all the input data to be machine readable. This necessarily forces us to reduce complex things like emotions and character traits to computable informationFootnote 2 or to ignore them completely. Either option will distort the outputs in an undesirable way. Third, algorithms need some proxy to maximize that serves as its notion of the ‘good’ (Braganza 2022). Engagement, clicks, buys, etc. have all played a role. All have led to serious consequences including teen suicide, election interference, radicalization, etc. There is no doubt that revenue has increased due to these proxies; however, it is also clear that these proxies distort recommendations. Finally, the process by which these recommendations are generated is opaque. That is, the considerations that led to the recommendation are unknown. This leads us astray from the actual object of evaluation – which are the reasons for a recommendation – not necessarily the recommendation itself.

The concern of this chapter is not only that we are likely to have bad recommendations when it comes to the evaluative; it is also that these recommendations serve to change the way we evaluate. This has already had an effect on how, especially young women, evaluate their bodies (see e.g. Cohen et al. 2017). Just like algorithms learn from our behavior, we will ‘learn’ from the recommendations that the algorithms give to us. Large technology companies know this and use this to their advantage as they try to ‘expand our tastes’ to generate more engagement and revenue (Roose 2019). This, however, is taking humans out of the driver’s seat when it comes to setting up evaluative standards. It is fundamentally a human enterprise to determine how the world ought to be. Algorithms can serve to help us achieve that world – but given the distorting forces I have outlined; they should never determine it.

2 Distorting Forces

The premise of this chapter is that recommender systems will inevitably play a role in the formation of our evaluative standards. Recommender systems telling us what news to read, movies to watch, people to hire, music to listen to, etc. will shape how we see the world. When Russia invaded the Ukraine in 2022 it was on the front page of every newspaper. The weeks after featured Ukraine on the front page every day and my news recommender system (Google News) also had it as the main headline. Today (26 July 2022) the New York Times still has stories about Ukraine on the front page. However, both the generic (non-signed in) and personalized Google News pages feature no stories specific to the war in Ukraine.Footnote 3 There is no way to verify if this is the case for everyone; however, the point is to show that how professional journalists rate the importance of news can greatly diverge from that of news recommender systems.

I am not going to argue whether stories about the Ukraine war should be featured as front-page news or not. Rather, what stories are shown as front-page news influences what stories we think are important. While the New York Times has an editor-in-chief and an editorial board deciding what is important, news recommender systems have an algorithm.

I cannot simply say that an editor is better than an algorithm. The New York Times has, for example, been criticized for coverage that distorted the truth. It has also been argued that there is a consistent bias in the New York Times towards corporations (Herman and Chomsky 2011) and that they propagated unchecked claims that Saddam Hussein was manufacturing weapons of mass destruction (The New York Times 2004). Algorithms should also receive this type of scrutiny. What biases do they have? How could this go wrong?

The claim that is being made in this chapter is that there are forces that will distort the outputs of algorithms when it comes to the evaluative. These forces are: (1) that recommender system outputs are based solely on past evaluative standards; (2) that the outputs are based on solely on computable information; (3) that these algorithms are taught using proxies for good; and (4) the algorithms are black boxed – meaning we cannot know the considerations which lead to the output – which as I will argue is what matters when evaluating these evaluative judgments. These distorted outputs then feed into our future evaluative judgements. This creates a vicious cycle whereby humans are taken out of the driver’s seat when it comes to the evaluative. We are in danger of losing meaningful human control over what we ought to do and how the world ought to be. We must ensure that we retain control over how the world ought to be and then only use technology to help us realize that world.

2.1 Past Evaluative Standards

If we take our personal aesthetic or ethical standards, we know that some of them have changed over time. What we once thought was a good movie is no longer a good movie (can you imagine what it would be like if your standards on films didn’t change since you were a child?). Many across the world have come to the decision that eating meat is a moral wrong. The changes brought by the pandemic have introduced many novel moral norms regarding wearing masks and coming to work sick.

What this points to is that our evaluative standards have changed. Not simply subjectively, but the context over time has changed which has necessitated new evaluative standards. Recommender systems are built on the premise that past evaluative standards were good in the first place. But this is simply going to be untrue in many cases.

Amazon used an algorithm to recommend applicants for jobs. The algorithm used past hiring habits – which could be translated as past evaluative standards regarding who would be a good employee. Of course, it was quickly understood that those past evaluative standards included the idea that men were better employees for higher up positions than women (Dastin 2018). The past evaluative standards of Amazon’s hiring were not good – and therefore the algorithm’s standards were also not good.

It is also easy to see that some evaluative standards that were good may not continue to be good in the future. Continuing to use the example of hiring, it will be the case that the profile and character of what constituted a good employee for a specific role in the past will be different from what constitutes a good employee in the future. People with gaps in their CV, for example, were often seen negatively. However, it has since been pointed out that this benefits men who historically have taken little time off to take care of a newborn. Coming to this understanding changes how we evaluate potential hires. In the academic world there is a constant debate over what standards academics should be judged by (Hicks et al. 2015). These will necessarily evolve. Simply training an algorithm on data from the pastFootnote 4 will cement in place an evaluative standard that is most likely incorrect.

2.2 Reducing to Computable Information

The data used to train recommender system algorithms must be machine readable data. This means that either information that is not machine readable must be left out or that information that is not machine readable must be converted into machine readable information. Both have problems and will be taken in turn.

Leaving non-machine-readable information out means that a lot of the information that we use to evaluate is deemed unimportant. When evaluating people – whether it is for a job, for prison sentencing, or for their ability to pay back a loan – it is necessary to reduce these people to machine readable format. Their criminal history, financial data, job history, etc. can all be used to evaluate. However, there is a reason that processes of evaluating people often involve open conversations in the form of interviews, depositions, etc. in order to understand the data in context. Someone’s CV may look bad because they have a two-year employment gap, but their explanation for that gap could be good (e.g., they had to take care of a dying family member). When information like this doesn’t make it to the machine, it is implicitly deemed unimportant – and will therefore affect people with unusual circumstances. This privileges those that have ‘normal’ lives.

Things get worse when non-machine-readable information is claimed to be made machine readable. This happens with emotion detection, lie detection, pain detection, etc. Often things like this are done by analyzing video or image data and reading people’s facial expressions. These expressions are thought to show what people are feeling. However, studies have shown that the scientific basis for this is non-existent (Barrett et al. 2019). This has not stopped companies and academics from claiming that such methods work. Students are rated for engagement in classes (Goldberg et al. 2021). HireVue claims that their AI-powered video interviewing service can read a candidate’s empathy level: “E-Motions measures empathy, defined as an individual’s ability to read and recognize emotions in others” (HireVue 2019). Companies and academics also claim to be able to detect ‘suspicious’ people (Arroyo et al. 2015; Gorilla Technology 2019). There is even software that claims to provide pain assessment based on facial recognition (PainCheck n.d.). However, these systems are known to be seriously flawed. Studies have shown that negative emotions are assigned more often to people of color (Rhue 2018). The AI Now institute concluded in 2019 that “there remains little to no evidence that these new affect-recognition products have any scientific validity” (Crawford et al. 2019).

The point is that attempting to reduce non-computable information like emotions and character traits to computable information is, to date, not possible. When recommender systems try to make recommendations that require such information, they must either leave it out or fake it. Either way will distort its results in an undesirable way.

2.3 Proxies for ‘Good’

When ML algorithms are trained, there needs to be some goal built into it for it to know what it is supposed to be getting closer to. For example, a chess playing algorithm has the goal of winning built into it. When it plays a game and loses, it adjusts the statistical weights for the moves made in that game to reflect that loss. With recommender systems powered by ML, a goal is also needed. When a social media feed ‘recommends’ a post by placing it at the top of your feed, it may have the goal of getting you to click on it, re-tweet it, reply to it, etc. The overall goal of these platforms is to keep their app in the foreground – the focus of your attention (BBC News 2021).

The point is that recommender systems need something to aim for. Whatever that something is deserves scrutiny, as it is – in some sense – a proxy for ‘good’. The simplistic logic is that if a recommendation keeps you engaged – then whatever was recommended was good for you. However, this is obviously not necessarily the case. The very opposite may be true. For example, if an algorithm recommends for me to eat French fries, and I indeed order French fries – that does not mean that the French fries are good for me. Steamed broccoli would be much better – even if I do not end up ordering it. When engagement is used as the goal for an ML algorithm, on, for example, a news feed, then it is an implicit assumption that the more engaged the user is with the news the better that news is for them. Any given person might be more engaged with news stories that are written to be misleading – giving people a false impression of what is going on in the world.

Moving away from platforms, we can see something similar with, for example, RSs that recommend job candidates. The algorithm succeeds when the top recommended candidates get hired. But this is simply circular. The entire purpose of the system is to recommend ‘good’ candidates; however, ‘good’ is simply a measure of whether or not the candidate gets hired. The company wants to hire good candidates – but whomever they hire will be considered ‘good’. The reader here may ask how the algorithm determines the top ranked candidates. This may be based on past hiring decisions by the employer (see problems with this in the preceding section) or even on video analysis of candidates asking questions – which has raised ethical issues surrounding cultural, racial, and gender differences biasing the results. In the next sect. I get more into the opacity of the considerations that lead to an algorithm’s output – and why that is a problem. Here it is important to note that because of this opacity we are forced to determine the success of the algorithm based on the subject’s engagement with the top recommendations – rather than an evaluation of the recommendation itself.

With recommender systems in evaluative contexts, there is a danger that simplistic goals like ‘engagement’ or ‘clicks’ will drive us away from what is good for us in those contexts. Platforms may claim that the goals of their algorithms have nothing to do with the good – they are neutrally trying to get you to engage more with their platform so that they can sell ads. However, determining which job candidates are rated highest, which convicted criminals are rated the riskiest, which news stories are prominently displayed, and social media post is at the top of your feed is a huge responsibility with normative implications.

2.4 Black Boxed

If someone were to tell you to that you should quit your job, they are making an evaluative statement. Something to the effect of “it would be good for you to quit your job.” I imagine that if you had yet to think about the prospect of quitting your job, then you would not just follow this person’s advice without further inquiry. Instead, it would be appropriate to ask why they think that. What are their reasons for thinking that you should quit your job. Without those reasons the judgment is worth little. For it is the reasons that need scrutiny. For example, their reasoning might be that academia doesn’t pay enough and you could make much more in the private sector. This reason may or may not be a good reason for you.

With AI powered recommender systems, we have judgements without the reasons. They are black boxed because, as it stands, it is not possible to understand the reasoning for an output of machine learning systems. While this is often portrayed as ML’s biggest problem, it is also the source of its power. ML is not restricted to reasons that are articulable to humans – they have access to patterns and considerations that would be impossible for us to comprehend (Robbins 2019, 2020). If we restricted machines to human articulable reasons, like we did with expert systems or good old-fashioned AI (GOFAI), then we would not have had the breakthroughs in AI we have had today.

However, a decision about whether to quit your job seems to require human articulable reasons. In fact, it is difficult to find ethical or aesthetic decisions that do not require justifying reasons (which are human articulable reasons that justify a particular decision). This presents a problem for ML powered recommender systems used for such outputs. For using these recommender systems implies that reasons are not important for the outputs. To take another example, ML systems used for recommending people to be hired for positions implies that the reasons for hiring a particular person are not important. It does this because whatever internal logic that the recommender system is using is opaque to us. There may be reasons (human inarticulable) that could justify the output. However, without access to these reasons we cannot question their ability to justify the output. When we take such recommendations without the ability to question the reasons which lead to that recommendation, we are implicitly disregarding the importance of those reasons. This is a problem because the reasons justifying why you hire one person over another are of the utmost importance. Amazon’s ill-fated hiring tool mentioned earlier in this chapter highlights this point. A reason for hiring one person rather than another was gender. Knowing that gender is a reason precludes someone that wants to make ethical decisions from accepting the output of the system.

While using gender as a reason to hire one person over another is almost always unethical there are other reasons that are seemingly irrelevant that also have ethical import. If the number of letters in someone’s name or the number of lines on someone’s CV were to be used to hire one person over another it would also be unethical. These are not good reasons to hire or not to hire someone. A hiring committee could decide to use such considerations because they were going with a random approach – which would not be unethical; however, a machine doing this without our knowledge that these were considerations would cause us to place higher evaluative weight on a candidate based on reasons that are irrelevant to their candidacy. If a human being recommended someone for a job based on the number of letters in their name without telling you that was a consideration – then it would be deceptive at best. Knowing that this situation is possible with algorithms should give us pause. When we do not know what the reasons are for a particular ML recommendation, then we are left without a way to evaluate whether the output is acceptable. In cases like hiring, not having the reasons for why a particular person was hired is unacceptable.

This has led some to argue that what is needed is so-called ‘explainable AI’ (XAI) (Floridi et al. 2018; Wachter et al. 2017). Many researchers are now working on trying to make ML explainable (see e.g. Adadi and Berrada 2018; Linardatos et al. 2021). While there has been some success, we are a long way from knowing the reasons that justify a particular output – which is what is needed in situations described above.

This all goes to show that ML based recommender systems should probably not be used for evaluative outputs like hiring and anywhere reasons are important. What I want to argue for now is that it is worse than it looks. These systems also have the capacity to influence our evaluative reasoning. To highlight this let us look at news recommender systems.

What is going on in the world is important. We form opinions about where resources should be allocated (frequent stories about traffic jams may make one conclude that we need to fund a light rail), who to vote for (a candidate may be involved in a scandal reported by the news causing you to vote for someone else) and gives you an overall picture of what is currently going on in the world. We can easily see the ethical import of this in the U.S. right now (July 2022). The hearings are going on in in Washington D.C. regarding the pro-Trump violence in the capital on January 6, 2020. All but one news network is broadcasting the hearings live. Fox News has decided it is not important enough to broadcast (Peters and Koblin 2022).

3 Changing Human Values

The ultimate concern that this chapter is focusing in on is that recommendation algorithms are not simply trying to figure out what is good for you. They are, intentionally or not, influencing what you think is good. They are changing your behavior to realize their goals. When I say that a recommender algorithm has a goal, I do not mean that the algorithm is an agent with its own goals. These goals are given to the algorithm. They are ‘surrogate agents.’ Despite not being conscious or intelligent, they are able to act as surrogate agents and act on behalf of those human agents that gave them their goals (Johnson and Powers 2008).

Information released by former employees at YouTube has shown that the designers of these recommender systems understand that their algorithms change user behavior – and that this is the point. Platforms like YouTube make money by selling advertisements. They have a financial incentive to keep you engaged on their platform for as long as possible so that you see a maximum number of ads. This incentive has driven changes in the recommender algorithm. For example, in response to users getting bored of watching recommended videos that were simply very similar to things they had already watched, Google built an RS called ReinForce. This algorithm was designed to “maximize users’ engagement over time by predicting which recommendations would expand their tastes” (Roose 2019).

The idea that algorithms could have the power to ‘expand our tastes’ should be of the utmost concern. Remember – the algorithm is not driving you towards some agreed upon ‘good’. The algorithm is simply maximizing user engagement. So, in sum, the algorithm is designed to change what you value so that you spend more time on YouTube. The entire project is premised upon the idea that an algorithm can take some control over what a person values. In other words, recommendations can impact how an individual values. This is an extremely important point. While ReinForce was designed to change values, other recommendation systems may change values without such malicious intent.

The concern is that algorithms could influence how we come to view what a good X is – in light of its recommendations. For example, if you were on a hiring committee and were given 20 CVs to review and you picked your top 5 which had zero overlap with the 5 that were selected by a hiring algorithm, you could either believe that the algorithm was wrong, or you might adjust the standards you use considering the recommendations (though you may not consciously do this). Though many of us are sure to claim that we would never simply take the algorithm’s recommendations as truth that would override our own intuitions and standards, the situation is far from clear.

Humans suffer from various biases which may cast doubt as to whether we, as humans, will be able to prevent recommender algorithms from influencing our evaluative standards. Automation bias, for example, “occurs when a human decision maker disregards or does not search for contradictory information in light of a computer-generated solution which is accepted as correct” (Cummings 2012). This has resulted in disasters like the 1983 Korean Air flight which was shot down in Soviet airspace after the automated system was incorrectly setup and not monitored by the flight crew (Skitka et al. 1999). We humans, tend to be biased towards machine solutions and do not tend to verify every solution offered by these machines. Piling onto that bias is confirmation bias, which biases us towards looking for confirming data rather than disconfirming data (Cummings 2012).

This gets much more complicated with ML powered recommender systems – as we have very little information to go on to confirm, disconfirm, or in any way check the output of such a system. In evaluative cases, there may be simply no way to check. When YouTube recommends you a video to watch next – there is little by which we can check to see if that is indeed the best video to follow the video that we previously watched. Having an algorithm recommend to us the five best candidates in a pool of 1000 is near impossible to verify. This is in part because the real evaluation – as discussed earlier – should be made regarding the evaluative standards. Are these the right standards? With ML we will not know what those standards are. I am not trying to say that there is some objective set of standards that we have that are perfect. What I mean to say is that what we need to do is evaluate the standards themselves. Whatever process we use to, for example, hire someone – we must do it in a way that we can critique and question the evaluative standards used.

This does not stop us from being influenced by those standards. It will be difficult to understand the effect that these multitude of recommendations will have on us – especially children. The news, music, videos, people, products, etc. that are recommended to us will no doubt inform our understanding of what ‘good’ is in whatever context. The hypothesis is that this will affect our behavior – which will feedback into the algorithms that make recommendations. This vicious cycle would wrest control over the evaluative from human beings and give it to machines. It is necessary to study how people are influenced by recommender systems, and how we can mitigate those influences before we have lost control over the evaluative.

4 Same Problem with Humans?

Here, there may rise an objection due to the idea that our access to humans’ reasons for recommendations are also not accessible to us. This is what Jocelyn Maclure calls the argument from “limitations of the human mind”:

Decision-making, either by human beings or machines, lacks transparency. As was abundantly shown by researchers in fields such cognitive science, social psychology, and behavioural economics, real world human agents are much less rational than imagined by either some rationalist philosophers or by rational choice theorists in the social sciences. (Maclure 2021)

Top AI researchers, including Geoffery Hinton have made a point like this:

When you hire somebody, the decision is based on all sorts of things you can quantify, and then all sorts of gut feelings. People have no idea how they do that. If you ask them to explain their decision, you are forcing them to make up a story. Neural nets have a similar problem (Simonite 2018)

It is an intuitive point – especially in light of some of the research that has been done by Daniel Kahneman (2013; Kahneman et al. 2021) and Jonathan Haidt (2001). They have made the empirical case that humans do not simply act in light of the reasons that they claim to have used. Rather, it is the other way around. They make up the reasons for their actions after the fact. Their actions were influenced by several situational factors that were outside of their control and unknown to them. This is supported by a number of studies that, for example, show that judges give harsher sentences before lunch than afterwards (Danziger et al. 2011). The implication is that how hungry a judge is affects sentencing decisions – though no judge has used that as a reason to hand down harsh sentences. So, though it feels like we have good reasons for human decisions, it is simply a myth. Therefore, there is no reason to not use machines simply because the process for reaching outputs is opaque.

However, this misses some crucial points. First, if expensive machines are simply re-creating the problems that we have with humans – then there seems to be no reason to use the machines. The burden is on those pushing for the adoption of these systems to show a good reason to use them. This is especially true considering the wealth of ethical issues associated with these systems. Most importantly are the environmental costs of these systems (Crawford 2021; van Wynsberghe 2021; Robbins and van Wynsberghe 2022).

More to the heart of the matter, Maclure makes the point that institutions and processes are designed to account for human biases and deficiencies: “in non-ideal normative theory, none of these institutions are seen as perfectly capable of neutralizing human foibles, but they can be criticized and continuously improved” (Maclure 2021). The fact that, for example, judges hand down harsher sentences when they are hungry can be mitigated with better scheduling. Institutions and people can be criticized for their failures and can change due to public scrutiny. Also, individuals – especially professionals like judges – must take moral and legal responsibility for their decisions. This is something that RSs – and machines in general – cannot accept. So, an argument would have to be made that decisions like these can be taken by agents that cannot accept moral or legal responsibility before we delegate such decisions to them.

5 Conclusion

Recommender systems are increasingly being used for many purposes. With this chapter I have shown that this is creating a deeply problematic situation. First, recommender systems are likely to be wrong when used for these purposes because there are distorting forces working against them. RS’s are based on past evaluative standards which will often not align with current evaluative standards. RS’s algorithms must reduce everything to computable information – which will often, in these cases, be incorrect and will leave out information that we normally consider to be important for such evaluations. The algorithms powering these RSs also must use proxies for the evaluative ‘good’. These proxies are not equal to the ‘good’ and therefore will often go off track. Finally, these algorithms are opaque. We do not have access to the considerations that lead to a particular recommendation. I have argued that it is precisely these considerations that are used to evaluate whether or not a particular recommendation is good. Without these considerations we are taking the machine’s output on faith.

Second, I have shown that these algorithms can modify how we evaluate. YouTube has modified its algorithm explicitly to ‘expand our tastes’. This is an extraordinary amount of power – and one that if my first argument goes through, shows that these algorithms will be expanding our tastes in a manner that is likely to take us away from the good. This influences our behavior which feeds back into the algorithms that make recommendations. It is important that we establish some meaningful human control over this process before we lose control over the evaluative.

Finally, I have anticipated that readers may say that the way we receive recommendations without RSs is also problematic. There is no way to verify that someone’s recommendation of a job candidate, movie, prison sentence, etc. is ‘good’. To this I have replied in two ways. First, that if all things were equal, and we can’t verify either, why would we use a machine which has environmental and economic costs over a human being? Second, that things are not equal – machines cannot accept the moral responsibility required to make evaluative choices that cannot be verified by human beings. Giving up this control is giving up control over the evaluative – something that requires good reasons which have yet to be offered.