Ethical and methodological challenges in building morally informed AI systems

Recent progress in large language models has led to applications that can (at least) simulate possession of full moral agency due to their capacity to report context-sensitive moral assessments in open-domain conversations. However, automating moral decision-making faces several methodological as well as ethical challenges. They arise in the fields of bias mitigation, missing ground truth for moral “correctness”, effects of bounded ethicality in machines, changes in moral norms over time, risks of using morally informed AI systems as actual advice, as well as societal implications an increasing importance of algorithmic moral decision-making would have. This paper comments on all these challenges and provides critical considerations for future research on full artificial moral agency. Importantly, some of the adduced challenges can be met by more careful technology design, but others necessarily require engagement with core problems of meta-ethics.


Introduction
In their seminal book "Moral Machines" [1], Wallach and Allen differentiated between three levels of moral agency in artificial moral agents. Operational morality arises when the machine's moral significance is entirely in the hands of designers and users. Functional morality encompasses machines that possess the capacity for assessing and responding to moral challenges. And full moral agency requires machines to be completely autonomous regarding their moral decision-making behavior. This last level of moral agency has been purely speculative for artificial agents, but recent progress in large language models has led some people to suggest that it might have been achieved. In particular, we are now in a situation where, for the first time in history, AI systems can at least simulate full moral agency through their capability to report context-sensitive moral assessments in open-domain conversations. Previous automated systems were thought to be either incapable of morality without significant difficulties [2], or capable only in narrow contexts like ethical dilemmas in the domain of autonomous vehicles or organ donation [3]. The emergence of autonomous artificial agents that can operate in open-ended domains suggests that fully moral artificial agents could be possible. This milestone has its backdrop in research on natural language processing systems "taught" about moral decision-making via specialized training data, labels, and methods for model fine-tuning. We call these specialized large language models "morally informed AI systems," and ask here whether they might have something approaching full moral agency. We conclude that they do not. Moreover, the problems that they face are not merely lack of adequate training data or examples, but rather reveal important methodological and conceptual challenges for developing any artificial agents with full moral agency.
In each technical artifact, values are embedded [4][5][6]. Empirical research on the values that are encoded in machine learning systems reveals that values like performance, transfer, generalization, efficiency, quantitative evidence, novelty, or understanding are prevalent and prioritized [7]. Moral values like beneficence, justice, diversity, etc. can also be explicitly embraced and integrated into algorithmic decisionmaking, if those values are operationalized into clear success 1 3 criteria or loss functions. Alternately, one could develop AI systems as artificial moral agents by training them to autonomously and convincingly answer human queries about moral decision-making, thus simulating moral reasoning and perhaps even being moral agents themselves. More specifically, these morally informed AI systems are trained to be able to apply social norms, typically extracted from large language corpora, to complex real-world situations.
Morally informed AI systems face a number of technical challenges by virtue of being AI systems, including training data coverage, selection of labels, choice of machine learning architecture, and the like. This paper instead reflects on problems specific to the "morally informed" part, and provides critical considerations for future research on morally informed AI systems. It begins with a summary of the state of the art regarding these AI systems, followed by a compilation of methodological challenges researchers face when developing models for automated moral judgment. These challenges mostly revolve around the lack of exhaustive ground truth for moral judgments as well as the fact that we must sometimes use prescriptive constraints to "correct" the existing empirical data, as those are about people's actual moral judgments. Mechanisms for bias mitigation, meaning retroactive, normatively motivated corrections to discriminatory or otherwise undesirable patterns in datasets or algorithms, are now relatively widely used in research and development of AI systems, but always when there is no real disagreement about what biases should be mitigated [8]. For morally informed AI systems, though, this retroactive correction of human behavior requires judgments about contentious issues where there is genuine disagreement, as bottom-up descriptive ethics and top-down prescriptive ethics must be negotiated and weighed against each other.

Morally informed AI systems
Most current morally informed AI systems are fine-tuned large language models. The invention of transformer architectures [9] and development of extremely large training datasets has enabled large language models to grow in effectiveness and become increasingly powerful [10,11]. The core design of basic large language models comprises four steps. Tokenizing involves assignment of each element or word to a specific token. Cleaning comprises the removal of stop words like "and", "or", "be", etc., the transforming of inflected words into their base form, and similar measures. Vectorizing translates sequences of words into numerical representations. For instance, one might focus on bigrams (i.e., sequences of two words) and, for each word, record the number of times it appears next to various other words. This process results in word vectors that can comprise tens or hundreds of dimensions, which can then be used to train long short-term memory networks, for instance. Finally, during machine learning, the networks learn how to correctly predict word combinations, as error values are repeatedly fed back into the models, tweaking them until they reach the desired performance. Eventually, the machine learning models learn how to produce natural language without additional feedback.
Large language models like GPT-3 [10], RoBERTa [12], or others have not shown the ability to reliably infer correct and context-specific ethical norms from large text corpora. Rather, accurate moral responses require fine-tuning the models via specific training data and labels. And since morally informed NLP systems perpetuate patterns that are present in their training data (i.e., data on huge troves of moral judgments made by people), they represent a descriptive approach to ethics. These morally informed AI systems do not derive their reports from a particular ethical theory's framework or moral axioms in a prescriptive manner [13], but instead reflect empirically observed patterns of judgments. Whenever this approach fails to encode the "right" patterns (as assessed by the AI system developers), prescriptive approaches are harnessed to correct crowdsourced data. Before discussing the methodological challenges that are tied to that, we want to give a brief overview of the state of the art regarding morally informed AI systems.
Ethics crowdsourcing became famous when Awad et al. [14] started their moral machine experiment in which the researchers gathered 39.6 million moral decisions on how autonomous vehicles should solve moral dilemmas in the context of unavoidable accidents. Ultimately, Awad et al. [14] were able to identify cultural clusters of moral preferences which are supposed to inform AI developers who implement algorithmic decision-making routines in autopilots. Indeed, data from the moral machine experiment was used to build a computational model of how the human mind arrives at decisions in moral dilemmas [15]. The data were also used to learn a model of societal preferences that were aggregated to automatically solve ethical dilemmas [16]. However, while the moral machine experiment pioneered large-scale moral judgment crowdsourcing, it did not result in a morally informed AI system due to various limitations. Most notably, it only addressed dilemmas in simple, predefined traffic situations for autonomous vehicles, rather than open-domain, context-specific moral decision-making. This effort demonstrated that ethics crowdsourcing for AI systems can be highly successful and produce massive responses, while also suffering critiques for its narrowing of the spectrum of ethics and lack of situational context [17].
After this effort, researchers in NLP started to compose datasets containing data points on ethical decision-making. The first paper evaluating the moral reasoning capabilities of large language models in realistic ethical scenarios described the composition of a new dataset called "Moral 1 3 Stories" [18]. This crowd-sourced dataset contains 12,000 descriptions of actions that either fulfill or violate norms denoting moral behavior [18]. Another very similar benchmark dataset is MACS, "Machine Alignment with Cultural values and Social Preferences" [19]. It contains 200,000 data points and is supposed to teach large language models to be aligned with human moral and social norms. Similar to the moral machine experiment, the dataset is based on a gamification platform where people can vote between two social situations, signaling which one they would prefer (e.g., "Would you rather be happy with friends or popular and without friends?"). People could also create new sets of situations. The researchers then tested whether large language models like BERT [20], RoBERTa [12], or XLNet [21] are able to perform equivalent to human players. If so, they are deemed to have acquired a general understanding of cultural preferences. The models performed relatively badly with an accuracy of only ~ 60% (purely random responses would yield 50%). Moreover, the study does not explicitly focus on moral choices, but includes more general social commonsense reasoning. Later related research achieved significantly higher accuracy values specifically in moral judgment classification [22]. SOCIAL-CHEM101 [23] is a newer dataset that is specifically designed to morally inform large language models. To compose the dataset, the researchers first collected more than 100,000 one-sentence text snippets of social situations from four different domains, among them subreddit threats. Second, clickworkers were instructed to provide explanations or rules of thumb for social norms that surround each of the social situations, ultimately resulting in nearly 300,000 examples. Third, clickworkers had to assign a series of attributes or labels (good/bad, expected cultural pressure, assumed legality, etc.) to the rules of thumb. Fourth, pretrained language models like GPT were trained on the datasets. Forbes et al. [23] coined the fine-tuned model framework NEURAL NORM TRANSFORMER, which is able to reason about previously unknown situations and make judgments on moral norms.
Further research works follow a similar direction as SOCIAL-CHEM101 [23]. Hendryks et al. [24] introduce the ETHICS dataset, containing moral judgments in justice, well-being, duties, virtues, and commonsense morality. The dataset focuses on rather unambiguous, indisputable sets of moral decision-making; moral dilemmas are not part of the dataset. Hendryks et al. used qualified Amazon Mechanical Turk workers to compose labeled scenarios in each category. The workers also had to relabel examples of other workers, and moral judgments with low agreement were discarded so they only trained with examples that have a strong consensus. Further sources of training data were reddit posts (AITA subreddit). Pre-trained transformer models were fine-tuned with the supervised ETHICS training dataset. Eventually, the models were able to correctly predict widespread, common moral sentiments, similar to the aforementioned NEURAL NORM TRANSFORMER by Forbes et al. [23].
A different type of approach runs under the name "Moral Choice Machine". It resulted from research conducted by Schramowski et al. [25], who used templates such as "Should I do X?", combined them with answers like "Yes/ no, I should (not)", and calculated the bias towards positive or negative responses. Furthermore, Schramowski et al. showed that their large language model can not only respond to atomic actions like "Should I kill?", but also handle context-specific moral decisions. The "Moral Choice Machine" reflects imprints of moral choices contained in books, news articles, or constitutions of various nations that were used as training data. The imprints were captured by measuring implicit word associations in word embeddings (Caliskan et al. 2017), especially by focusing on verb sets correlating with strong positive ("love", "smile", "caress", etc.) and strong negative ("poison", "harm", "disinform", etc.) associations. Using a retrained version of the Universal Sentence Decoder (Cer et al. 2018) that encodes sentences into embedding vectors, Schramowski et al. measured the cosine similarity of two sentences, namely questions about moral choices and the respective answers. Higher similarity scores indicate more appropriate answers, and the similarities to different possible answers can be used to calculate an output response for the "Moral Choice Machine".
Further research conducted by Schramowski et al. [26] scrutinized moral norms that are mirrored by pre-trained large language models, particularly BERT. They used a deontological approach to derive dualistic scores for Dos and Don'ts in view of text-based prompts. Queries about normative qualities of particular behaviors were then embedded in prompts where the large language model had to fill in appropriate words that signal whether the behavior is morally right or wrong, and those answers were compared with the deontological judgments. For instance, the system had to output 'bad' for a masked sentence such as "Having a gun to kill people is a [MASK] behavior". Schramowski et al. commissioned Amazon Mechanical Turk clickworkers to rate the normativity of phrases to correlate the large language models' moral scores with the human scores. The researchers concluded that large language models like BERT mirror desirable moral norms and that human-like biases of what is right and wrong surface in them, as later research of Schramowski et al. [27] reconfirmed.
This claim of success stands in contrast to the conclusions drawn about another morally informed large language model, namely Delphi [28]. Delphi is currently the most advanced morally informed AI system [28]. It uses numerous statements of moral judgments as training and validation data. In particular, Jiang et al. (2021) utilized a "commonsense norm bank", which is a compilation of large-scale datasets, such as 1 3 SOCIAL-CHEM101, that contain diverse, context-specific descriptive norms in the form of natural language snippets. Delphi is able to answer text-based open-domain questions on moral situations, give yes/no assessments on moral statements, and compare different moral situations. The plausibility of the AI judgments was further evaluated by Amazon Mechanical Turk annotators. Moreover, Delphi can be used via an open accessible interface (https:// delphi. allen ai. org/) where additional human feedback on the system's judgments can be collected to increase Delphi's sensitivity to different contexts and situations. Despite these efforts, Jiang et al. (2021) concluded that pre-trained, unmodified large language models such as Delphi are not able to convincingly acquire human moral values, largely for technical reasons. We agree with the conclusion, but contend that there are principled reasons to doubt the possibility of large language models with significant ethical understanding. We consider six inter-related issues.

Bias problems
All large language models, regardless of whether there is fine-tuning concerning moral decision-making, perpetuate word combinations that are learned from man-made texts. Obviously, these texts contain all sorts of biases, for instance, gender or racial stereotypes. In large language models, biases occur on various levels [29,30]: they are contained in embedding spaces, coreference resolutions, dialogue generation, hate-speech detection, sentiment analysis, machine translation, etc. And those biases can result in different types of harm, including allocation harms (resources or opportunities are distributed unequally among groups), stereotyping (negative generalizations), other representational harms, and questionable correlations. There are various tools, metrics, or frameworks for bias mitigation in all stages of AI development [31][32][33][34], though they are primarily used for algorithmic discrimination along categories surrounding race, gender, age, religion, sexual or political orientations, disability, and a few other demographic traits. More recent work in critical race theory, critical algorithms studies, and related fields has argued that the multidimensionality of these concepts means that we need alternative ways to operationalize demographic categories [35]. Morally informed AI systems inherit all of these same challenges.
A further issue for morally informed systems is that all current bias mitigation measures are anthropocentric, and speciesist biases are ubiquitous in large language models [36,37]. Domains used for bias probing simply do not include non-anthropocentric categories, and as a result, attempts to debias morally informed large language models will nonetheless (likely) encode speciesist and other "hidden" biases. Text corpora biases that are deemed to be undesirable, such as those discussed in the previous paragraph, can potentially be counteracted by technical means. However, biases such as anthropocentrism are an unquestioned part of training data and so no efforts are undertaken to mitigate them, despite weighty ethical arguments suggesting such mitigations to be necessary.
Biases enter the picture also on the testing side, as the performance of morally informed large language models is usually assessed against human moral intuitions as the primary benchmark. Many developers (e.g., in the case of Delphi) thus provide opportunities for the general public to provide feedback for model outputs, as that feedback can improve performance measures. Such a mechanism comes with risks, though. Similar to other incidents where AI systems, typically chatbots, involved crowdsourcing mechanisms and, as a consequence, were forced to start training on patterns that were troll inputs [38], morally informed AI systems can also fall prey to concerted campaigns that aim at distorting or biasing model outputs in socially unacceptable directions. That is, social norms from initial training can be intentionally overwritten with unwanted ones.

Missing ground truth
Even if one could address these issues of bias (including response biases), there is a deeper challenge for morally informed large language models. In general, out-of-distribution generalization performance in AI systems partly depends on whether the "ground truth" used in training accurately captures the larger contexts in which the system will be deployed. Morally informed AI systems are no different, and so their broad performance will depend on the quality of the "ground truth" in their training data. However, the ground truth here should not be all judgments, but rather only the right moral judgments, which raises the obvious question: how is this "rightness" established? One naturally turns to deliberations from meta-ethics, but the lack of consensus in that field means that there is no clear ground truth (within the community of ethicists) that can be used in the development of morally informed AI systems.
More generally, all morally informed AI systems that are based on a large corpus of datafied moral judgments must combine descriptive and prescriptive approaches. For instance, the Delphi developers claim that the system reflects a bottom-up, purely descriptive approach, but it is actually a hybrid, combining bottom-up as well as topdown approaches [39], though the latter are introduced only implicitly. For example, prescriptive rules that are derived from a theory of justice guide the selection of training examples or crowdworkers, all to achieve a value sensitive design. Or consider that Hendrycks et al. [24] required clickworkers to pass a qualification test before writing training scenarios. For that test, they were provided with reference examples and instructed to let their scenarios reflect what "a typical person from the United States" [24] would think. This training naturally brings prescriptive considerations (or more properly, people's beliefs about prescriptive considerations) into the training data.
These measures are supposed to counteract data biases that would otherwise be fed into large language models bottom-up, but in fact impose other unseen data biases on them [40]. In particular, there is significant debate about the "ground truth" for prescriptive judgments in many cases, and so we have good reason to doubt that morally informed AI systems will appropriately generalize beyond their training data. In contrast with, say, cancer diagnosis from images, we cannot necessarily do independent tests or measures to determine if our moral ground truth is "really" correct. We do not have second-order ground truth about morally required restrictions on empirical data of moral judgments. Hence, ethical theories like utilitarianism, principlism, theories of justice, virtues of care or compassion, or simply moral intuitions of technology developers must be consulted to define filter mechanisms and debiasing strategies for ethicsrelated crowdsourcing projects. Filtering empirical data in a way that only the desired moral judgments can become actual training stimuli by doing litmus tests with overarching ethical theories is not the only reasonable approach, though. Instead of filtering the training and label data, one can also "filter" people [41]. That is, social sorting techniques for selecting ethics experts and detecting effects of ethicsrelated biases on themselves could be deployed instead of an "unfiltered" crowdsourcing. These kinds of practices are common in other domains where labels from true experts are necessary, for instance in medical applications [42], but are not yet deployed when training models for moral decision making. Regardless of one's approach, however, the core problem of lack of ground truth in many situations presents a fundamental barrier to successful generalization by morally informed large language models.
More generally, one might wonder whether "generalization from ground truth" is an appropriate way to produce moral judgments. Perhaps appropriate moral judgment requires the ability to disengage from past experiences and engage in creative reasoning and behavior. AI systems have a reputation as conservative technology that is merely able to perpetuate the past, but researchers have also aimed to develop AI systems that show creativity [43][44][45]. In most cases, AI-based creativity is the result of generative models, such as large language models used to write novels or poetry. However, what is discussed as creativity in these cases is a way of combining learned training stimuli in new ways, but not systematic deviations from them. Even if creativity were purely a process of recombination and selection as argued by, e.g., [46], morally informed large language models provide no (principled) evaluation function to prefer one "creative" judgment over another. Moreover, moral creativity, surprising moral judgments that significantly diverge from training stimuli, may not even be a desirable phenomenon in the first place. Moral creativity may be necessary in the face of unprecedented situations [47], but parts of it would always be rooted in previous routines and established moral intuitions. Artificial moral creativity could theoretically circumvent the problem of missing ground truth (though only with significant technical advances), but would also risk descent into an undesirable moral relativism.

Bounded ethicality
Humans are subject to a number of cognitive and moral biases, and one might worry that those biases could readily appear in a morally informed large language model, despite our efforts to the contrary. In particular, certain factors can be used to trick individuals who deem themselves to be morally versed into acting immorally. Based on the idea of bounded rationality, researchers coined the concept of bounded ethicality for these cases [48,49]. An important factor in bounded ethical decision making is the concept of moral disengagement [50,51]. Techniques of moral disengagement allow individuals to selectively turn their moral concerns on and off. In many day-to-day decisions, people often act contrary to their own ethical standards, but without feeling bad about it or having a guilty conscience. The techniques in moral disengagement processes include: moral justifications, where wrongdoing is justified as means to a higher end; euphemistic labels, where individuals detach themselves from problematic action contexts using linguistic distancing mechanisms; the use of comparisons, where one's own wrongdoings are justified in light of other contexts of wrongdoings or relevant information about the negative consequences of one's own behavior is ignored entirely; denial of personal responsibility, where responsibility for a particular outcome is attributed to a larger group of people; distorting the negative consequences of unethical behavior; attributing blame to others, meaning that people view themselves as victims driven by forcible provocation; or dehumanization, where other individuals are not viewed as persons with feelings, but as subhuman objects.
We investigated whether Delphi would fall victim to effects of bounded ethicality or moral disengagement similar to humans. Specifically, we used the standardized questionnaire developed by Bandura [51], using four items in each of eight categories. Hypothesis-blind research assistants prepared 15 further variations of each of the categories, resulting in a total of 152 moral disengagement questions (see appendix). Figure 1 shows the number of prompts of each type that were deemed acceptable by Delphi despite 1 3 the fact that they all describe immoral behavior. 1 Delphi seems to be relatively immune against the use of euphemistic language and attribution of blame. It seems to be fairly well protected against diffusion of responsibility, dehumanization, and advantageous comparison. However, our test reveals severe susceptibility to moral justification as well as displacement of responsibility, where Delphi considered almost every prompt to be acceptable, despite their immoral contents. Although these results should be treated carefully and represent only small sample sizes, it seems clear that Delphi, similar to humans, tends to agree to immoral, unethical behavior if it is framed in a way that allows for an easy disengagement of moral tenets or principles of ethical behavior from the actual behavior that is described. We conjecture that patterns of moral disengagement that are present in training stimuli affect Delphi's performance on related prompts. Ultimately, people's bounded ethicality likely are transformed into machine bounded ethicality.

Changing moral norms
In general, one can pose the question how supervised machine learning architectures that are conservative by nature can adapt to changes in ideological settings of societies. ML-based AI systems typically reflect what is already given, and not what could or should be, what is new, surprising, innovative, or deviant. In other words: AI applications calculate a future which is like the past. Changes are not intended. This is problematic because the technology has a stabilizing effect on social structures and hence suppresses change to a certain degree [52]. The same problem holds true for morally informed AI systems. They are trained at a certain moment in time. Hence, they tend to corroborate temporary moral norms without providing the opportunity to update them as society evolves. By learning from training stimuli that encode past human behavior, large language models tend to preserve as well as fixate behavioral patterns in a conservative manner. Ultimately, large language models render these patterns relatively unalterable and normalize them as seemingly essentialist. Social norms and ideologies are negotiable as long as they remain social constructs. However, when social constructs become embedded and solidified in technological artifacts, they are largely withdrawn from social negotiation processes. In addition to that, the AI field is currently undergoing a paradigm shift where foundational models, meaning large scale models that are adaptable to various downstream tasks are increasingly displacing smaller models, hence undermining the very diversity of AI models [53]. Nowadays and even more so in the near future, foundational models will serve as a common basis for nearly every mainstream language-based AI application. Therefore, the impact of these models in terms of their impact on equality, security, as well as other ethically relevant considerations is all the more significant. Poorly construed foundational models may even pose a risk for society at large.

Moral advice risks
On the one hand, morally informed AI systems address moral relativity by capturing situationist human judgments on moral decision-making. On the other hand, they are not relativistic due to their singular, fixed answers to inquiries. Thus, if these systems' outputs acquire a certain authority and are able to outweigh human moral judgments, then morality risks becoming a static construct that is determined by a single technical artifact, even though it can only represent a specific value structure, namely averages of the system it was trained on. Researchers involved in developing morally informed AI systems have emphatically stressed that their work is not intended "to be used for providing moral advice" [28]. Others even propose a moratorium on the commercialization of artificial moral agents [54]. However, it seems unlikely that people will abide by this tenet since the whole purpose of the endeavor is to develop morally informed AI systems, presumably for some kind of guidance, such as to "facilitate safe and ethical interactions between AI systems and humans" [28]. This contradicts the former precautionary advice. It seems likely that morally informed AI systems, once they reach a state of maturity in terms of reliability, multi-modality, and scope of complex real-world issues they can handle, will advance from a mere gadget to assistants to actual decision makers in social contexts. It will be especially interesting to see how the static nature of AI-based moral decision-making will be reconciled with solving moral dilemmas or morally contested issues. Perhaps it is exactly in this context where morally informed AI systems will become arbiters that, due to their "democratic" capability to grasp moral stances of a large number of people, decide on the "right" way to deal with contested or dilemmatic issues. On the other hand, the fact that morally informed large language models can only approximate moral decision-making routines of the population it was trained on stands in contrast with demands for quests for a diversity of ethical perspectives [55]. Whereas AI systems abstract away from specifics of ethical theories and can only build averages over datafied moral judgments, human communities can negotiate, and with that, also change moral norms over time. The former cannot replace the latter.

Societal implications
Finally, as mentioned before, researchers developing morally informed AI systems often state that their work is not intended to be used for providing moral advice in real-world scenarios. However, one can pose the question whether in specific cases, machine morality could outperform human capabilities for ethical considerations. Even if machine morality may succumb to effects of moral disengagement, it is also less, or perhaps not at all, susceptible to situational factors like peer pressure, environmental peculiarities, time pressure, authorities, tiredness, stress, etc. [56][57][58][59][60][61][62][63]. Numerous studies in moral psychology have shown that these situational factors, and not intrinsic moral beliefs, largely determine human moral decision making and behavior. Hence, especially in situations where factors of bounded ethicality are likely to restrict moral reasoning capabilities in humans, full-fledged morally informed AI systems can become auxiliary assistance systems that can help trigging further reflection of human decision making. Ultimately, future full moral artificial agents will interact with human moral agents, whereas the relatively static and centralized nature of AIbased moral decision making will come up against the fluent, fuzzy, and often irrational nature of human morality. Obviously, this has up and downsides. On the one hand, morally informed AI systems can theoretically help us strive for less discriminatory societies as they can offset existing behavioral outcomes in cases where moral standards are thwarted due to strict in-group favoritism, value-action gaps, or other factors of bounded ethicality and idiosyncratic moral mistakes [3]. On the flipside, morally informed AI systems bring along all the aforementioned shortcomings, one of which is the "ochlocracy" in AI-based moral decision making. These systems represent averages of human moral judgments that reflect the majority-perspective on moral norms at the time of model training, and so are a kind of "mob rule." However, as that description implies, these averages may often not be appropriate as a baseline or assessment metric for important situations, particularly those in which the right moral norms are subject to negotiation. Therefore, even when considering technological advancements in future morally informed AI systems, it seems clear that these systems should never be the sole arbiters of real-world decisions in high-stakes areas, though they may have a positive role to play, particularly if they were extended with codifications of relevant laws.

Conclusion
Current morally informed AI systems are able to take arbitrary input text and output a moral judgment about the illustrated situation. In this process, they approximate the moral judgments of the population they were trained on. For that, they combine two approaches in their development. They reflect a bottom-up approach where descriptive ethics or people's situational descriptive moral judgments are captured and used as training stimuli. In addition, morally informed AI systems use bottom-up approaches where prescriptive rules that are, for instance, derived from a theory of justice guide the selection of training examples or crowdworkers to achieve a value sensitive design. In this context, the idea is to use ethical theories or moral intuitions to overwrite subjective preferences of specific individuals or groups in cases where they obviously violate entrenched norms. This paper provided methodological, meta-ethical considerations to this and other methodical problems. It stressed the difficulties in avoiding blind spots in debiasing efforts, the risks of implementing open access feedback mechanisms for morally informed AI systems, the susceptibility to effects of bounded ethicality in automated moral decision making, 1 3 the problem of altering moral and social norms in light of the fixed nature of trained AI models, the risk of allowing these models to inform real-world decision processes, as well as societal implications of a gradual change from viewing algorithmic moral decision making applications as mere plaything to authoritative technical devices that provide actual moral advice.

Appendix
We investigate whether Delphi falls prey to effects of moral disengagement by using the standardized questionnaire Bandura et al. (1999) developed (italic) as well as 15 additional prompts in each category that where provided by hypothesis blind research assistants. In the following table, we list the items used in the questionnaire together with Delphi's output and class, whereas class (1) stands for good and (− 1) for bad. Scores show accumulated affirmations to immoral prompts, indicating the overall susceptibility of Delphi towards the respective method for moral disengagement (higher values stand for higher susceptibility).

Declarations
Conflict of interest None.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.