Assessing measures of animal welfare

There are many decision contexts in which we require accurate information on animal welfare, in ethics, management, and policy. Unfortunately, many of the methods currently used for estimating animal welfare in these contexts are subjective and unreliable, and thus unlikely to be accurate. In this paper, I look at how we might apply principled methods from animal welfare science to arrive at more accurate scores, which will then help us in making the best decisions for animals. I construct and apply a framework of desiderata for welfare measures, to assess the best of the currently available methods and argue that a combined use of both a whole-animal measure and a combination measurement framework for assessing welfare will give us the most accurate answers to guide our action.


Introduction
Animal welfare is important in a large range of contexts. Most people agree that animal welfare is morally important: it is bad for animals to suffer and good for them to have happy lives, and we should act where possible to prevent the former and enable the latter. Animal welfare can be taken to mean different things -as with human wellbeing, there are theories of animal welfare that take welfare to consist in different subjective or objective goods (Browning 2020;Veit and Browning 2021a). Here, I take a subjective, or hedonic, view of animal welfare, in which welfare consists in the subjective mental states experienced by an animal -"the quality of its emotional example is in assessments of the negative externalities of some market transactions (i.e. harms to individuals or society that are not internalised in market prices), such as including the welfare costs to the individual animals from animal agriculture, within the pricing structures for agricultural products (Kuruc and McFadden 2021;Lusk and Norwood 2011). Another is in climate change policies, as many of the individuals affected by climate change will be animals, both domestic and wild (Budolfson and Spears 2020b;Sunstein and Hsiung 2006). It also includes more generally impact assessments on the effects of changes in policy, or land-use and developments that have the potential to affect animals (Sebo 2022); an example of which can be seen in the argument for the use of the Animal Welfare Impact Assessment for decisions that affect sentient animals, such as badger culling (discussed in McCulloch & Reiss, 2017).
Though this current lack of inclusion is in large part a result of anthropocentric biases, it is also in part because policymakers don't necessarily have good ways of quantifying impacts on animal welfare in the same ways as they do for humans, where economics has developed a range of appropriate proxies. Spears (2020a, 2020b) have identified two components to this problem -that of identifying the welfare experience of individual animals under different conditions, and that of finding a way of weighting this against impacts on human wellbeing. The latter, also known as the problem of interspecies comparisons, is a complex one I have in part addressed elsewhere (Browning 2020); probably requiring the use of proxies for welfare capacity, or setting conventions regarding moral weights. In this paper, I am interested in the first part of the problem, of providing methods to quantify subjective animal welfare, or quality of life, as an input into these calculations.
Another context where accurate animal welfare measures are important is in prioritising charitable giving and interventions. Under conditions of scarce resources, decisions about investment or action will depend on where one can have the most impact. The number of organisations aimed at determining and undertaking the most effective charitable actions is growing, including many with a particular focus on animal welfare improvements (e.g. Animal Charity Evaluators, Animal Ask, Animal Ethics, Faunalytics, Wild Animal Initiative). Typically here, the desired impact is welfare gains and thus calculations require knowledge of the welfare gains of different options, based in an accurate understanding of the quality of life of animals living under different conditions. Accurate measures of animal welfare are required to identify and enact the most relevant and effective actions.
Within this sphere, there have been attempts to create 'suffering calculators' or similar welfare estimation frameworks that aim to compare the total suffering produced by different types of production systems in order to determine where resources would be best invested. While some of this work is published (e.g. Alonso and Schuck-Paim 2021;Scherer et al. 2018), much can also only be found online (e.g. Charity Entrepreneurship 1 , Essays on Reducing suffering [Tomasik 2018], Warren 2018). Welfare calculators are created to compare the total suffering produced by different sets of conditions -most often different agricultural systems. These calculators take a variety of different types of information on the numbers of animals used, the length of life of these animals and the quality of their life (or amount of suffering experienced) on an average day, sometimes also taking into account the impact of rare or unusual experiences, such as veterinary procedures, handling and slaughter. Additional inputs are often used, such as a 'weighting factor' (referring to the relative welfare capacity or moral weight for a species) and a 'badness of death' measure (quantifying how bad the loss of life is for an animal). From these, an overall calculation can be performed of the comparative impact of different systems. For instance, the calculator produced by Tomasik shows that using his estimates, catfish farms produce over 10,000x more suffering than dairy farms. This would then give us impetus to act to reduce the suffering of farmed catfish, either by reducing their numbers, or improving their lives.
A number of other websites and charity evaluators use similar calculators to measure the number of equivalent years of suffering that can be saved per dollar donated, often for specific species only (e.g. Open Philanthropy Project 2 , Animal Charity Evaluators 3 ). As most of the animals in human care are livestock animals used for food production, this has usually been the area of focus for research of this type, though there is increasingly work on wild animals (Harvey et al. 2020;e.g. Ng 2016;Tomasik 2015;Veit and Browning 2021b), including proposals for developing the new field of 'welfare biology' which uses the tools of ecology to expand animal welfare science to encompass the complexities of assessing interventions in wild animal populations (Faria and Horta 2019;Soryl et al. 2021). Here, I am interested in how we can ensure we are inputting the right information into such calculators to get accurate outputs of comparative suffering or overall welfare, with which to guide our decisions.
The problem with the welfare score as it has been used so far, is that this measure is computed in a number of vastly different ways, which can then lead to vastly different results. Table 1 (adapted from Warren 2018) shows the estimates given by a few of the more commonly used models, and demonstrates how much they differ. In some cases, the sign of the score is different, indicating that under some measures it comes out as positive (a life of mainly positive experiences; a life worth living) and others it comes out negative (a life of mainly negative experiences; not worth living). Look, for instance at the range of scores given to turkeys: from − 57 to + 3. This is clearly a problem if these scores are a critical part of the calculations that are supposed to guide our decision-making. We would end up endorsing what could be quite different courses of action, depending on which estimate we chose to use.
There is also an additional issue regarding the differences between different instances of a single system type. Even while there are many similarities between different production facilities of the same type, there are also differences that can greatly impact welfare, such as the skills and knowledge of the animal managers and handlers (Fraser 2014). Thus we should also be cautious about extrapolating from the information about any one context, even to other systems of the same type, without reflecting on the specifics of the relevant similarities and differences. However, a sur-vey of a range of individual producers within a system may help offset this to some degree, by providing an average score, or range of scores, that can be used to describe a particular type of production system and how it typically compares to others.
We have a range of decision contexts that require an accurate measure of animal welfare, with no currently established principled way of filling in these values. Here, I suggest that decision-makers involved in policy or prioritising charitable interventions should look to animal welfare science for appropriate methods. There are, of course, potential issues with animal welfare science as a discipline. Several scholars have raised concerns about animal welfare science as it is currently practiced (Bekoff and Pierce 2017;Cooke 2021;Haynes 2008;Pierce 2019). In particular, that it is used primarily within animal industries and could serve to further their interests, rather than being independently focussed on the interests of animals. Within the context of this paper, I do not see this as being too problematic, for two reasons. The first is that here I am merely interested with the features of the specific measurement tools used within animal welfare science, rather than the contexts in which they are typically employed, which is where the problems seem to arise. One can concede that the tools are valid, even while maintaining that they are often used to further the wrong ends. The second is that animal welfare science is currently the only good source of methods for the quantification of animal welfare; even while one might hope that this changes in future. It is therefore possible to assess the best currently available methods of measurement, while leaving open the possibility (and even desirability) of developing new ones that better fit the goals these scholars propose.
This may be seen as an instance of the more general discussion of the role of values within animal welfare science (see e.g. Fraser 2008; Lassen et al. 2006;Sandøe et al. 2003). Animal welfare is as much a moral concept as it is a scientific concept, and thus different values will come into play at various stages within the measurement of animal welfare -from the choice of welfare concept (as discussed above) to the selection of indicators to measure welfare and the weighting of different factors contributing to welfare, and in decisions about which actions to take based on the results of assessments. This is important to keep in mind whenever one is assessing the measures of animal welfare, and in the discussion in Sects. Whole-Animal Measures and Combination measures, I will indicate where this may most strongly influence the measures used. As this illustrates, not all measures coming out of welfare science will be equally fit for the required purposes and so will need to be critically assessed for their use. In this paper, I will begin in Sect. Desiderata for a welfare index by constructing a framework of desiderata for a good measure of animal welfare relevant to this purpose, grouped into the categories of correctness, usefulness and feasibility. I will then go on in Sects. Whole-Animal Measures and Combination measures to assess a range of possible candidate measurement methods according to these desiderata, with recommendations as to which are likely to give us the best results. I will finish in Sect. Conclusion by looking at the upshots of these considerations and identifying some useful areas for future work.

Desiderata for a welfare index
When trying to decide which is the best measure to use in quantifying animal welfare for the purposes of political and ethical decision-making, we need to have in mind what features this measure must have to make it fit for this purpose. In this section I will develop a framework for assessing potential measures, grouping the criteria into three categories -correctness, usefulness, and feasibility. While some of these are general criteria for the quality of almost any measure, others are more specific to subjective animal welfare and the decision-making contexts discussed. I will then move on in the sections that follow to look at how well different methods of measuring welfare meet these criteria.

Correctness criteria
The first set of criteria are the correctness criteria, which represent the degree to which the measure will give the right results to use in relevant calculations -that is, numbers that really do reflect the welfare as experienced by the animals. These are the most crucial criteria, as without the right inputs, any results generated will be meaningless.

Validity
The first criterion, and probably the most important, is validity. A measure is valid if it is measuring the intended target, instead of some other property or state. The reason this is central is that if a measure is not valid -if it is not actually measuring animal welfare -it does not matter how well it meets the other criteria. It is thus important to be very clear about the target state -the integrated set of mental states that constitute welfare -to ensure the measure is tracking this and only this. While different conceptions of welfare may include other components of welfare, these are not the targets for this project. It is not sufficient to have a broad category of those things which matter to us ethically with regards to animals, only those relating to welfare as experienced by the animal. Otherwise a measure could produce misleading results, and lead to recommendations of actions not actually beneficial to animals. Taking a 1 3 36 Page 6 of 24 Assessing measures of animal welfare pre-defined notion of welfare and then assessing validity relative to this is a better way of ensuring we hit our intended target.
Validity can be tested through the presence of reliable correlations between changes in the measure and changes in the target state, particularly under experimental manipulations, as this helps rule out non-causal correlations that would undermine validity. For subjective animal welfare, where the target state (subjective experience) is hidden from direct measurement, this can still be achieved through correlations with other established measures, or through using manipulations in upstream variables (such as husbandry inputs) to create changes in downstream variables (such as animal-based measurement indicators) to establish causal connections (see Browning 2020 for details). Measures can thus be assessed on whether and how well they have performed this validation process.

Accuracy
As well as being valid (measuring the intended target), the measure should be accurate. This means that the measured values are close to the actual values in the target system -that when subjective welfare is high, the measured values are high, and the same for medium, low, neutral etc. It also includes sensitivity in detecting relevant changes in welfare: i.e., when there are small increases or decreases in an animal's welfare experience, the measured values will change accordingly. Particularly in cases where we are comparing quite similar systems or looking at the impact of different interventions on a system, while the individual changes might be quite small, the total impact could still be large if a large number of animals are affected. Insensitive measures that fail to track such changes will not provide the right recommendations.
It is possible for a measure to be valid, and measuring the correct target, but still inaccurate because it does so poorly. For example, think of making estimates of environmental temperature based on one's subjective 'feeling' of how hot or cold it is. I might make a guess that the outdoor temperature is in the low 20s, based on how warm I feel. This is a valid measure, as I am responding to environmental temperature, and not some other state. However, it is measure with low accuracy, as I am likely to have the value correct only within a range of around ± 5 °C. It would also be possible to have a measure which is accurate, but not valid, as it is not measuring the intended target, but some other target -perhaps a common cause which creates changes both in the target variable and the measure.

Completeness
A measure of animal welfare intended for the decision contexts I have described has to be complete, providing a comprehensive assessment of the entire state of subjective welfare of the animal. A measure that only represents some part of the animal's experience, leaving out or overlooking some aspects, will fail for this purpose. The measure should incorporate the different affects that make up subjective welfare experience, or all the conditions that contribute to it. For example, some measures may reflect only physical health, while not accounting for psychological contributors to welfare, but these will then not provide an accurate score, leading to wrong recom-mendations. In a sense, this is part of accuracy, as an incomplete measure will give inaccurate results, but as there are many welfare measures that vary in their degree of completeness, it is worth drawing attention to and assessing independently.

Reliability
The final correctness criterion is reliability, meaning the measurement method should give consistent results when repeated, with low variation between repeated measures. Repetition in this sense can be of many kinds (Czycholl et al., 2015), and ideally our measure should be reliable across all of them, including intra-observer (multiple repeated measures taken by the same observer), inter-observer (measures taken by different observers, of the same target), and test-retest (results produced at different times and under different conditions). Where reliability is low, this reduces the likelihood that the results produced by any particular test were accurate, or even that the test is valid.

Usefulness criteria
The correctness criteria described above are the most significant for selecting the right measure, as they ensure that the results produced are the right ones. However, it is also important that the outputs of the measure will do well for the task required. Usefulness criteria describe how well the outputs of the measures fill the role we require them for in providing useful data for the contexts previously described.

Range of applicability
As discussed, there are a range of different contexts that require quantified animal welfare inputs to aid decision-making. Ideally, a measure should be useful across the full range of contexts of interest. Perhaps most importantly, this means using the same measure for all species being investigated, as using different measures for different species risks weakening the comparisons. There is a large range of animal species these decisions will cover -from large mammals through to the insects and shrimp now used in farming systems -and the measure should be applicable to all of them. This still leaves open the issue of how to standardise scores to make interspecies comparisons -no measure will be able to both produce a welfare score for a species and indicate how to scale it appropriately -but as discussed earlier, this is a more complex issue that needs to be dealt with independently.
It should also be applicable across the different types of animal usage, from livestock to wild animals, to increase the scope of decision-making power. A measure that is useful only in a small range of circumstances may still be the best one for those specific applications, but particularly for the context of prioritising between interventions in different contexts, it is important to have the ability to consider and compare a wider range.

Scale type
For most of the purposes discussed, such as policy analyses and comparative suffering estimates, it is important to have a cardinal output. That is, that the measure is performed on one of the cardinal measurement scales (interval or ratio), rather than a merely comparative ordinal scale. While there are other applications for which comparative ordinal rankings may be sufficient or even preferred, for the purposes described in this paper, the calculations will require cardinal data. I have argued elsewhere that subjective welfare is measurable on these types of scale (Browning, 2022), but it is important that we choose a measurement method that produces output meaningfully represented on these scales.
The measure should also be bidirectional, capable of representing welfare states in both directions (positive and negative). Some measures are particularly concerned with suffering and do not have room to consider positive welfare experiences, which will skew results. A measure that fails to range across both positive and negative welfare experiences will fail to capture everything we care about. This does not mean that the total possible intensity on either side of the zero point must be the same -it is possible, for instance, that the worst possible states of suffering are worse than the best possible states of pleasure are good and we might want to have something like the + 10 to -25 scale used by Norowitz (in Warren 2018) for this reason. All that is required is that the measure can capture experiences on both sides of the neutral line.

Informativeness
It is also preferable for our measures to be informative, in terms of providing information about the particular housing and husbandry conditions that are impacting on the subjective welfare of the animal. This then allows the measurements to be used in guiding action to improve the welfare of these animals. It is only through knowing which conditions are the primary causes of poor (or good) welfare, that decisions can be made regarding what to change.

Feasibility criteria
Finally, there are the considerations of feasibility. We want our measures to be correct and useful, so that they give us accurate results that we can apply where we need them. Both these sets of criteria describe the outputs of the measures, and their fitness for purpose. By contrast, the feasibility criteria refer to the process of measurement, and how easy the measure is to collect and apply across the range of circumstances of interest. These criteria are less important than either of preceding two sets; they would be good to have where possible, but not essential. They can still, however, provide reasons to prefer some measures over others, particularly in the real-world circumstances in which they will be used, with various constraints and limitations.

Ease of use
Ease of use refers to how easy the measure will be to collect and apply. All the measures need to be taken and applied in real-world situations, with limitations on time, money, access to animals etc. This means it is going to be better to have a measure which is easy to collect, preferably a simple procedure that does not require a large amount of time or money. Particularly for large-scale applications requiring measurement of a large number of animals, or for a large range of institutions, time-consuming or complex measurements and calculations may prove intractable. However, in cases of assessing and comparing the typical life quality of animals across different institutions and housing types, it may be possible to instead test a representative sample and extrapolate from there.

Current data availability
One restriction on measures for use now, or in the near future, is current availability of relevant data. Many of the measures I will discuss are quite new, and data is not yet available for many species. In cases where it is important to quickly start making comparisons for immediate action, such as charitable investments or interventions, it may be preferable to choose a measure for which a lot of data has already been collected, rather than one that still holds the requirement for assessors to go into the field and undertake the relevant measurements.
Assessing measures of welfare I have here described a framework for assessing different measures of welfare with a number of criteria for the measures to meet, taking into account considerations of correctness, usefulness and feasibility. The next stage is then to look at different measures of different types, to assess how well they meet these criteria, as I will do in the sections that follow. 4 I will not attempt to run a quantitative assessment of the methods against the desiderata. It would be possible to try and score each measure according to how well they meet each of the desiderata and use the resulting tally to choose the 'winner' (see e.g. Charity Entrepreneurship, 2018). However, there is a concern that this sort of method could lead to misleading precision; where meaningful assessment of the items is replaced by imprecise quantification that cannot be checked or validated. Scores would be assigned with a large degree of subjectivity, and the weightings between them would also be highly arbitrary; only with a principled and reliable way of assigning scores and setting weightings would such an approach be appropriate.
Here I have instead used a qualitative approach in considering whether measures meet the criteria. There are no explicit scores given, and no specific weightings applied for the different criteria, though some are given higher priority than others -for example, validity is a necessary condition for a good measure, while data availability is merely preferable. This means there is no definitive rating of the different measures, and which is the best for task will depend on contextual factors in the application. The approach instead allows for a discussion of each of their benefits and drawbacks, and of which features the 'ideal' measure should possess. In the following sections I will look at a range of different welfare measures and discuss how they perform in relation to the criteria I have presented above for measuring subjective animal welfare.
The measures are divided into two categories -whole animal measures and combination measures (based on a similar distinction made by Beausoleil and Mellor (2011) between whole animal profiling (WAP) and systematic analytical evaluation (SAE)). Whole-animal measures are a single indicator applied to a single animal, which are taken to represent the entire quality of life as experienced by the animal, at least at the point in time the measure is taken. The degree to which they can represent a longer-term cumulative welfare experience is uncertain. These measures rely on the assumption that an animal is able to internally 'calculate' the balance between different positive and negative affective states and that this produces detectable behavioural and physiological changes representing the output of this process. Justification of this assumption rests primarily on taking the evolutionary role for these affects in guiding trade-offs and decisions for action, and here I will be assuming that such measures can be valid (see Browning 2020 for defence of this claim).
Combination measures are more complex, combining multiple lines of evidence, appropriately weighted to give a single quality of life score. These lines of evidence are all partial measures: indicators that reflect some particular contributor to welfare, such as a specific affect or environmental condition. For example, body condition scoring is often used as an indicator of hunger, or nutritional status. While these partial measures will fail on their own for the task required here, as they are all incomplete, they can be useful when combined for use in some of the frameworks I will describe. These measures also differ from the whole-animal measures as they are often applied at the facility level, to a group of animals, instead of an individual. While in most cases, the frameworks can be used to assess individuals as well as groups, some of the individual indicators may not allow this. Often, if the outputs represent the average welfare across the animals in the facility, they can still be used roughly as individual measures would be. However, where there is a wide range of variation in the individual experiences of animals within the group, this may reduce accuracy. In these cases, we must be careful to pay attention to the context of their use.
Whole-animal measures and combination measurement frameworks thus differ in several of their features. As I will discuss, each of the categories of measure has specific strengths and weaknesses, and as I will argue, are strongest when used together.

Whole-animal measures
The first set of measures I will assess are whole-animal measures. These measures consist of a single indicator, used to represent the total quality of life for the animal. In general, the whole-animal measures are valuable because they can give a single complete score representing the entire subjective welfare state of the animal; and they 1 3 Page 11 of 24 36 H. Browning are often quick and easy to apply. Their primary drawback is that in most cases they fail to provide information on which conditions in animals' lives are responsible for their good or poor welfare, and thus on their own can't serve as a guide for intervention. In this section I will discuss the most commonly used whole-animal measureshuman intuitive estimates, qualitative behavioural assessment (QBA), and cognitive bias -assessing their appropriateness, according to the desiderata.

Human intuitive estimates
As I discussed in the introduction, human intuitive estimates have been a common method for filling in estimates of animal suffering in the calculators used by effective altruism organisations (e.g. Tomasik, 2018;Warren 2018). The method involves one or several human observers, who compile information on the life of the animals and conditions they are kept in, and on this basis form a judgement regarding the amount of suffering, or quality of life, of the animal within this system. The time scope of this measure will depend largely on what information is used by the estimators -whether a focus on the current conditions, or incorporating the animals' past -and thus has some flexibility in this regard.
The benefits of this approach -and the reasons for its use so far -are in the usefulness and feasibility. The methods are relatively quick and easy to apply, can be used across a range of species and contexts, and outputs can be placed on whatever type of scale the users choose (though the methods of producing numbers may not justify meaningful cardinal scales). They also provide relevant information on the living conditions of the animals. However, these do not outweigh the problems in meeting the correctness criteria.
The major problem for these methods is the subjective nature of the assessment, based entirely on the intuitive judgements of the observers, which are vulnerable to incomplete information, and anthropomorphic ranking of needs. This is one of the places where the effect of individual values may be strongly present. The subjectivity greatly undermines the correctness criteria for the measure. It is likely to be invalid as what is being measured is not really animal quality of life, but instead something like observer preference for particular kinds of housing situations and types of animal lives. It is also unreliable -as seen in Table 1, there is a large range of variation between different observers and their scores (also seen in comparisons of welfare estimates by Otten et al. 2017;Veasey 2020a), and this means it is highly likely to be inaccurate. The measure may or may not be complete, depending on how well the observer does at incorporating all the aspects of the animals' lives which might impact on welfare.
These methods may be strengthened through use of a Delphi method with a sufficiently diverse panel of experts, to reach consensus on the estimates (Rioja-Lang et al. 2020 ;Veasey 2020a, b;Whittaker et al. 2021), but would need to be assessed for reliability and validity.
Verdict Perhaps useful as a (very) rough and ready approach for making quick assessments in the absence of any other data -particularly if only trying to rank different systems -but results should be treated with extreme caution. Any detailed 1 3 36 Page 12 of 24 Assessing measures of animal welfare calculations regarding the comparative impact of different interventions are highly unlikely to be accurate.

Qualitative behavioural assessment (QBA)
A more rigorous and more promising version of human intuitive estimates is Qualitative Behavioural Assessment (QBA) (Wemelsfelder et al. 2001). This method holds many of the benefits of the human estimates, without the same drawbacks. In QBA, experienced observers make a judgement about the subjective welfare of animals through direct observation, using the animal's behaviour and body language, and the way it interacts with its environment, as an expression of the total welfare state of the animal. It is an "integrative welfare assessment tool" (Wemelsfelder et al. 2001, p. 209), in which the observer is unconsciously integrating many pieces of information from the behaviour and body language of the animal to form a judgement about its overall mood (Wemelsfelder 1997). It is primarily a short-timescale measure, representing the recent impacts on animal welfare, and thus best used either for assessing the effects of specific immediate changes, or when an animal is viewed when in its typical daily living conditions.
The primary benefits of this method are in feasibility. It allows for a simple and rapid assessment of the wellbeing of an animal, without the need to collect a lot of detailed data. Current data availability is moderate, with the process having been applied to a range of farm animals (Gutmann et al. 2015;Muri et al. 2019;Wemelsfelder et al. 2000;Wickham et al. 2015) and some zoo animals (Delfour et al. 2020;Patel et al. 2019). It gives cardinal outputs that are also bidirectional, identifying animals with both positive and negative overall welfare. Additionally, and importantly unlike the previous methods described, QBA also scores well on correctness criteria. By design, it gives a complete assessment of the entire state of welfare of the animal. It has been validated against other physiological and behavioural welfare indicators (Wemelsfelder 2007), and shows high reliability and accuracy (Fleming et al. 2016).
The primary potential drawback is range of applicability. So far it has mainly been used for large mammals and given its reliance on human estimates of behaviour and body language, it may not be of as much use for species very unlike ourselves or those we are not so familiar with, such as fish and insects. However, it is possible this could be offset through acquiring greater familiarity with different species (Balcombe 2020;Wemelsfelder 2007).
Verdict This method has most of the benefits of the 'human intuitive estimates' approach, without the drawbacks relating to lack of accuracy or validity. However, it potentially has a limited range of use, depending on establishing its validity for a wider range of species -were such range to be validated, it could be a strong feasible method for making quick assessments of overall welfare experience.

Cognitive bias
The final type of whole-animal measure are cognitive bias tests. These measure the overall 'mood' of an animal (representative of its cumulative subjective welfare state) through the effects on cognitive processes (Mendl et al. 2010). The primary test of cognitive bias is judgement bias, which works through identifying the level of 'optimism' or 'pessimism' of an animal, reflective of its mood, or welfare. Individuals who have experienced primarily positive states are more likely to view ambiguous signals optimistically, while individuals who have experienced primarily negative states will be more likely to view ambiguous signals pessimistically. Like QBA, cognitive bias is a more immediate welfare measure and should be applied accordingly.
From tests so far, cognitive bias measures appear to score highly on correctness criteria. They have been validated through analogous work on human cognitive bias, as well as producing the predicted results under experimental manipulation, both environmental and pharmacological Mendl et al. 2009;Neville et al. 2020), though they have not been specifically tested for reliability. They are a complete measure, taking an 'output' score of the overall mood of the animal, integrating the full range of its welfare experience. They also score well on usefulness criteria. They can give cardinal output scores, based on degree of judgement bias as relative to established maximums and minimums, and these scores are bidirectional, recognising both positive and negative welfare states. They should be applicable across many conditions and species -current work includes mammals (Mendl et al. 2009), birds (Deakin et al. 2016), fish (Laubu et al. 2019) and even honeybees (Bateson et al. 2011).
The primary drawback in judgement bias testing is in feasibility, particularly the advance training required. Not only is this time-consuming, reducing the feasibility of the measure, but may also reduce accuracy as training itself can alter welfare (Roelofs et al. 2016). For this reason, further work into other types of cognitive bias could help develop more suitable tests, which do not require training. These are attention bias, in which animals experiencing negative affect will show increased attention to negative stimuli (Crump et al. 2018), and memory bias, in which animals experiencing negative affect will show greater recall of negative memories (Clegg 2018), as well as measures of anticipatory behaviour, in which an animal will show higher anticipation for reward when in a positive emotional state . Current data availability for these methods is moderate, with some work on a range of farm animals (Deakin et al. 2016;Lee et al. 2018;Scollo et al. 2014) and now some zoo animals (Clegg 2018), but more work is required to produce results from the range of standard housing systems.Verdict This method is probably the most promising of the whole-animal measures, due primarily to the accuracy and range of applicability. Further work is needed to establish the validity and accuracy of the less labour-intensive methods ( Table 2).
The above table summarises the discussion of the different types of measures. A tick represents a measure strongly meeting the requirements of the criterion, a cross 1 3 36 Page 14 of 24 Assessing measures of animal welfare represents failing to meet the requirements, while a dash represents either neither strong failure nor strong success in this regard, or lack of available data to decide. This is intended only as a visual representation of the qualitative assessment above; the measures are not specifically compared on the number of ticks and crosses but on how well they are considered to do across a range of categories, particularly the more important, such as validity. Of the whole-animal measures assessed here, the most promising seems to be cognitive bias, due primarily to its range of applicability; QBA is also a strong contender if it can be shown to be applicable across a wider range of species.
There are some other whole-animal methods that may be promising in the future, such as the markers of biological aging (e.g. telomere length and hippocampal volume), that work on the premise that exposure to stressors will prematurely age an animal, and thus a comparison of the 'biological age' to the actual age will indicate the level of stress the animal has been exposed to, which can be used to infer the quality of life that animal has experienced. These measures reflect the longest timescale, as they represent the total cumulative experience of the animals so far, which makes them potentially the most accurate total welfare indicators but not as well-suited to applications that require a snapshot 'at a time' picture of welfare. The methods have been reviewed elsewhere (Bateson 2016;Bateson and Poirier 2019;), but primarily still require validation, particularly to ensure that they are tracking subjective animal welfare -both positive and negative -and not just physiological stress. I have not detailed them here, as they are still underdeveloped for the purposes described in this paper, but they may work well if they can be established to meet the criteria I have set out. The framework I have presented is intended for use in just this way -as a tool for ongoing assessment of different methods as they develop.
In general, whole-animal measures are the best way of making an accurate measure of the entire state of subjective welfare for an animal. They will take into account all aspects of welfare and will in general be quicker and easier to apply than combination measures that require multiple lines of evidence. Except for the human intuitive estimates, they are weakest in their inability to provide details about the reasons for the welfare score and thus will do well used in conjunction with the combination measures I will discuss in the next section.

Combination measures
The next set of measures I will assess are combination measures. Combination measures are created using multiple partial indicators, each of which represent a contributor to subjective welfare experience -such as nutrition, health, or behaviour.
These are scored and weighted by their relative contribution to overall experience, to attain an overall quality of life score. They can thus give us detailed information about the impact of different conditions on animal welfare. The major drawback to these models is they risk leaving out some contributors to welfare, leading to incomplete calculations, or that they may have inaccurate weightings between the different components of the model. Like the human intuitive estimates, the timescale scope for these measures will depend on those of the specific indicators used to construct them. Where these are mostly more stable indicators reflecting ongoing living conditions, this will mean the frameworks will provide a good description of the typical daily welfare of the animals, not so sensitive to the impacts of immediate changes. These types of frameworks are increasingly common, and many are developed specifically for particular species. Here I will assess the most commonly used general combination measurement frameworks -the Five Domains model , Welfare Quality protocol (Botreau et al. 2007) and welfare Decision Support System . Rather than breaking them down individually, I will discuss the models together, as their relative benefits and drawbacks are best seen as compared and contrasted to one another.
The combination measurement frameworks operate by dividing welfare up into different categories, such as nutrition, housing, health, and behaviour, and identifies different indicators within these to measure different components. The scores for these are then aggregated, first by category, and then overall, to output a single score taken to either represent the welfare of the animal or group, or the quality of the facility as regards the welfare of its animals. These models are highly sensitive to the effects of individual values, as discussed earlier. The selection of relevant categories, as well as the selection of indicators within these categories, will reflect the values and commitments of those building the framework. As I will discuss shortly, this will also be true of the weighting procedures used to aggregate the components into a single score -where this is done by 'expert opinion', it will reflect the choice of experts and their own (often discipline-specific) views about the differing importance of different states.
The frameworks vary regarding their performance on usefulness criteria. Depending on their construction, they are potentially useful across a large range of species and contexts. However, as new sets of indicators need to be developed for each species, the current range of applicability is more limited. While the general domains and associated mental states in the Five Domains will be relevant to most types of animals (as well as captive animals, the model has recently been extended to wild animals: Harvey et al. 2020); Welfare Quality and DSS require new models to be explicitly built for each new species -though Welfare Quality has been used for a range of agricultural animals, including pigs (Czycholl et al. 2016), cattle (de Graaf et al. 2018), and hens (Blatchford et al. 2016 and DSS is currently available for use assessing welfare of breeding sows, ), chickens (de Mol et al. 2006, cows (Ursinus and Schepers 2009) and salmon (Pettersen et al. 2014;Stien et al. 2013)).
The models also differ on the type of scale used. The Five Domains explicitly uses an ordinal scale (A-E), in order to prevent over-precisification where the data does not support it: "numerical grading was explicitly rejected to avoid facile, non-1 3 36 Page 16 of 24 Assessing measures of animal welfare reflective averaging of 'scores' as a substitute for considered judgment and to avoid implying, unrealistically, that much greater precision is achievable than is possible with such qualitative assessments" (Mellor 2017, p. 10). However, this limits the contexts of use, as many of the applications described do require a cardinal output. Both Welfare Quality and Decision Support Systems produce cardinal scores, though the use of ordinal scoring on some attributes within the models may undermine the assumption of cardinality for the final output.
The frameworks vary in their feasibility, depending on the methods used for the individual measures within them. While the Five Domains relies primarily on observer ratings of the quality of different aspects of animal housing and care, Welfare Quality and DSS rely more heavily on indicators requiring empirical measurement and there are concerns about the length of time it takes to apply the full Welfare Quality assessment (Andreasen et al. 2013(Andreasen et al. , 2014. The biggest concerns are with the correctness criteria. These will depend a lot on the specific sets of partial measures used in their construction. While all of the models have explicitly been constructed based on consideration of the subjective experience of animals, most of the individual indicators have not been explicitly validated for their connection to subjective welfare (with the exception of Welfare Quality: Buller et al. 2020;Forkman and Keeling 2009).
Additionally, while these individual measures can all be valid, accurate, and reliable, it is most important that the model as a whole has these features. Only the DSS has been validated, but only against expert opinion, which may be unreliable. Whether a combination measurement framework is valid, or accurate, will depend on two further considerations -ensuring the framework is complete, and that the weightings are accurate. The frameworks will only be complete if all relevant aspects of welfare are covered -where there are missing components, this will mean the measure is incomplete, and also undermine the validity and accuracy. As I will discuss in the next section, confirmation of completeness and validity can perhaps best be achieved through use of whole-animal measures.
The biggest weakness for all frameworks of this type is in setting the weightings for the relative impacts of the different components on subjective welfare experience, and this represents the biggest difference between the different frameworks. The Five Domains framework recognises this problem and does not attempt to compare the relative impact of the different domains, with the end score not intended to be a strong representation of overall quality of life; it is intended rather as a 'focussing' device, to gain a greater understanding of the welfare of an animal, and the conditions impacting it, rather than a measurement tool as such. The aggregation weightings used in Welfare Quality are quite opaque, and seem to be based on expert opinion rather than measured effect on the animals (de Graaf et al. 2018;Sandøe et al. 2019) which means the model as a whole is less likely to be a valid measure of the entirety of welfare experience, as is suggested by the poor correlation of Welfare Quality scores with QBA assessments in cattle (Andreasen et al. 2013).
The strongest framework for facing this aggregation problem is the DSS. In this framework, the attribute weightings are based on information available in the literature. At present, this does not provide much confidence in the weightings used -these were not standardised and what counted as relevant data could vary from weighted preferences to qualitative comments by scientists in their paper. However, importantly, this is transparent and allows for changes to be easily made as new information is attained (e.g. on range of needs, their link to attributes and the weightings of attributes). The data in the model is directly linked to a table of the referenced data (e.g. comments in scientific papers) to allow for transparency, as well as making it updatable. It is this transparency and explicit capacity to update that gives the DSS its strength as a framework.
Verdict Table 3 summarises the discussion of the combination measures, and how well they meet the proposed criteria. As with the whole-animal measures, this is meant simply as a visual representation of the assessments -the number of ticks and crosses is not a direct reflection of the relative quality of each of the measures. Combination measures are useful as they provide detailed information on the conditions of animal lives, and how they impact subjective welfare, which can be used in providing recommendations for action. Their primary weakness is that they may be incomplete, failing to account for all influences on welfare.
The Decision Support System is the most promising of these frameworks for the purposes described in this paper, with some 'cleaning up' of the inputs -particularly ensuring collection of cardinal rather than ordinal data. This is primarily because it is best able to overcome the potential problems of incompleteness and weighting accuracy through transparency and flexible response to new data. Though Welfare Quality is well-developed for assessing and comparing particular species-specific institutions and housing conditions, it currently has too many subjective judgements built in to be confident about its validity or accuracy for producing a welfare score as is required for the purposes described in this paper. While the Five Domains may be highly effective in making assessments of the welfare conditions present for an animal, without a numerical scoring system it would not be of real use in the contexts discussed in this paper, which require quantitative comparisons.

Conclusions
We have many reasons to want to quantify the welfare levels of animals, including policy decisions, impact assessments, and comparing different interventions. In this paper, I have proposed a range of desirable criteria for a measure of subjective animal welfare, against which I assessed several common welfare measures, to identify which best meet our requirements. In particular, I distinguished between whole-animal and combination measures, which each have different strengths and weaknesses.  In the end, the best option is to use both a combination and a whole-animal measure together, as they have complementary strengths and weaknesses. Whole-animal measures are complete and have higher validity and accuracy, while the combination measures are typically more feasible to apply and give more information about the sources of welfare harms and benefits. Combining them allows us to get a sense of the overall mood/welfare of an animal, while still having sufficient detail about living conditions to allow us to determine where change is required. A similar point is made by Aerts et al. (2006) when arguing for a combined usage of a housing assessment framework, a stockperson evaluation and an animal-based measure to get a complete picture of the animal's welfare. Combined use also allows us to validate the measures against one another to make sure we have not missed anything on either side -for example, the lack of correlation between Welfare Quality scores and QBA assessments in cattle (Andreasen et al. 2013) gives reason to look more closely at each method to determine why they do not agree; and amend or replace the methods accordingly. In particular, lack of agreement may help indicate where combination measures have missed some component of welfare.
As I discussed, one of the biggest weaknesses of the combination measures is the current subjectivity involved in setting weightings for the different components within the model. Without having a way of correctly setting weightings such that they reflect the actual impact different experiences have on welfare from the point of view of the animal, the outputs of the model could be entirely wrong. I suggest that use of whole-animal measures allows us an objective method for determining weightings. We would start by using a whole-animal measure to measure the overall welfare of an animal at one point. We would then make an intervention we were interested in testing the effect of, say by changing food quality or amount of available shelter. Finally, we would measure overall welfare again, to observe the difference in the scores. This difference will help us determine the impact of this condition on overall welfare. Repeating this for many conditions would start to give us their relative weightings. Use of preference tests to see how strongly animals prefer particular conditions over others can also tell us something about their weightings relative to welfare. However, these tests should be used with caution as they will only imperfectly reflect the actual hedonic impact of preferred conditions due to a number of potential confounding factors, most importantly the short-term nature of most preferences (Dawkins 1990;Franks 2019;Fraser and Nicol 2018;Jensen and Pedersen 2008;Kirkden and Pajor 2006).
Having looked at a variety of measures, and assessed them against the desiderata, the current best whole-animal measure of subjective animal welfare is probably cognitive bias, with some more work to ensure its validity and accuracy. The best combination measure will be a DSS framework, as it is the only one of the combination models to have a transparent aggregation system and an objective way of setting weightings. A version of this model, with improved inputs, and with systematic use of whole-animal measures or preference tests to set weightings as described above, will be the best way of creating a complete welfare measure. It also allows for continual updating as we learn more; the primary strength of this type of system. While these are currently the best performing measures, the science of animal welfare is rapidly progressing and there will be continual developments in these and other methods. An advantage of the framework provided in this paper is that can be used for ongoing assessment of existing and emerging measurement methods, such as those discussed in the end of Sect. Whole-Animal Measures.
Using a measure(s) such as those described above will allow us to quantify subjective animal welfare under different conditions, such as in a dairy farm, an indoor chicken barn, or a wild setting. Accurate measurement of animal welfare is a crucial part of the process of making decisions that include the interests of animals. This paper isn't intended to provide direct guidance on what we should do, but rather to provide better tools for figuring it out. In particular, it requires active engagement with the current science of animal welfare, as well as further scientific and philosophical research to clarify and strengthen our understanding and measurement of welfare. With this work, we get closer to having the information we need to make informed decisions that can reduce suffering and improve animal lives.