Keywords

1 Introduction

The term User Experience (UX) is, despite being devised almost 25 years ago by Donald Norman, still a somewhat chatoyant concept. In an approach to provide an overview of its meaning, the site allaboutux.org lists 27 definitions of user experience that were established until 2010. The ISO 9241-210 defines user experience as “a person’s perceptions and responses that result from the use or anticipated use of a product, system or service … [UX] includes all the users’ emotions, beliefs, preferences, perceptions, physical and psychological responses, behaviors and accomplishments that occur before, during and after use … Usability criteria can be used to assess aspects of user experience”. While this definition falls short of explicitly mentioning directly measurable UX metrics, the situation is more relieved in the ISO 9241-110 definition of usability. It defines usability as “the extent to which a product can be used by specified users to achieve specified goals with effectiveness, efficiency and satisfaction in a specified context of use”. While the exact relation of usability to the obviously more comprehensive concept of user experience is not spelled out extensively in the ISO 9241 definition, we are informed that “usability criteria can be used to assess aspects of user experience”. Completionrates or error rates can be used to operationalize the usability criterion of effectiveness, while time on task is an example for a quantification of the efficiency criterion. Completion rates and time on task can be measured objectively, while the third usability criterion, satisfaction, is a subjective measure that can be elicited on the task level using the System Usability Scale, for example (Sauro 2011). Operationalizing and including criteria in the measurement that extend the task level (on which the semantically tighter definition of usability mainly operates) as required by the ISO definition of user experience is clearly more complex: taking the subjective consequences of an “anticipated use of a product” or a “user’s emotions, beliefs, preferences, perceptions” into account demands the application of valid and reliable methods like the User Experience Questionnaire (UEQ, Laugwitz et al. 2008).

In this paper we introduce an artifact called UX metrics table that focuses on the identification, measuring and comparative interpretation of UX metrics. To pave the ground for outlining a UX metrics table, some essential terms are clarified in the next section.

1.1 What Do We Mean by UX Metrics?

A metric is understood as an approach to the measurement or evaluation of a phenomenon under consideration (Tullis and Albert 2013). A metric that focuses on the usability of an interactive system like time on task can be measured directly and objectively by the assignment of a numerical value representing the delta between the start time and the end time of working on a task for a given sample of users. We need to determine an instrument for measuring, as well as the units for reporting the results of its application in a (set of) situation(s). A potential result would be a statement like: The mean time for completing an order with system X is 37.9 s (with a standard deviation of 7.2 s).

When moving from the definition of usability to its superordinate concept of user experience according to the ISO 9241 we fall short of simple, directly observable indicators and need to resort to subjective measures like the UEQ mentioned above or to marketing metrics like brand perception (Sauro 2015) to capture specific aspects of user experience.

UX metrics, as results of measurements are quantitative by nature: we can compare the measurement results of a certain UX metric for different systems or contrast the measurement before and after the redesign of a system. In order to qualify as a well-founded and useful UX metric, the respective measure is required to be valid (i.e. it should measure what it claims to measure), reliable (i.e. it should produce similar results under similar conditions) and objective (i.e. it should be independent of the person conducting the measurement and should be free of references to outside influences). Ideally, a UX metric should be easy and economically to measure and its results should be understandable and informative. Embedding an instrument for measuring UX in a comprehensive framework of UX like the Components Model of User Experience (Thüring and Mahlke 2007) provides valuable theoretical underpinnings to support a sound interpretation of its results (see the meCUE questionnaire, Thüring and Minge 2014). To aid broad applicability, metrics should be flexible for utilization in early design phasesand should give meaningful results even with small sample sizes.

Leech (2011) argues that in order to be helpful, a UX metric needs to come with a timescale (to designate the temporal period under consideration), a benchmark (to allow for comparisons), a reason to be reported (to focus on significant data) and an action (that would allow appropriate response in light of available data) associated.

1.2 Why Should We Collect UX Metrics?

UX metrics provide directions for design. Depending on the requirements of a given project, the goal of design activities might vary: in one project we might primarily focus on achieving an efficient interaction concept while we might need to balance (conflicting) metrics like efficiency, error ratesand/or learnability in the next project.

Identifying a relevant set of metrics and adequately balancing conflicting metrics is crucial to the success of product development. Starting with the vision of a new or improved product, we need to identify UX metrics to qualify what we mean by a desirable user experience in the respective project context. Explicit UX metrics help to shape and elaborate the goals of product development. Agreeing on benchmarks that come as target (and, optionally, acceptance) values for the UX metrics selected fosters insightful discussions especially in cross-functional teams and aligns project members on a strategic level. UX metrics guide data collection during the research phase of a project by pointing to empirical information required to come to informed design decisions.

Establishing UX metrics also helps us to evaluate our assumptions: they aid the interpretation of conducted studies, measure the impact of changes to a product and spell out improvements over iteration cycles. UX metrics tell us when to stop iterating because target values are met or exceeded. Defined target values for UX metrics allows a calculation of the intended benefits when conducting predictive Return-on-Investment (ROI) analyses. Monitoring the status and continuously reviewing the relevance of the chosen UX metrics acknowledges the dynamics of user experience and provides opportunities for prompt action in the light of changing UX requirements.

UX metrics make it easy to communicate project progress — and provide convincing numbers to prove it. Finally, UX metrics allow for easy comparisons of different products or different versions of the same product. Tullis and Albert (2013, p. 8), in their seminal book on UX metrics claim that “Metrics add structure to the design and evaluation process, give insight into the findings, and provide information to the decision makers”.

In defiance of theoutlined value, UX metrics are often neither explicitly operationalized, nor agreed upon or annotated with target and/or acceptance values. Nielsens (2001) infamous quote “Metrics are expensive and are a poor use of typically scarce usability resources” might not be completely inculpable for that. A fortunate, clear exception are approaches that sail under the Lean UX flag following a designated build, measure and learn loop (Gothelf and Seiden 2012). In too many projects, however, UX metrics remain implicit and lose their guiding force. Even worse, team members, who are jointly contributing to the development of a product, might individually be striving for the attainment of different (or even conflicting), but unexpressed UX goals. The lack of explicit and shared UX metrics impedes a smooth orchestration of project activities and misses opportunities to focus on informed, goal-oriented actions.

User experience design approaches typically rely on the creation of various concrete artifacts that are constructed and refined during the iterative course of an UX project. UX metrics need to be tightly engrained in the landscape of core UX artifacts to become widely understood as a matter of course and used as informative tools for design. A UX metrics tableconnectspersonas and scenarios with the explicit elaboration of quantitative metrics and has proven to be a helpful artifact through the stages of a UX project. In the next section we discuss ways to establish a UX metric.

1.3 Establishing UX Metrics

Meaningful UX metrics focus on the critical interaction scenarios of using an application. A scenario might be critical for different reasons: it is used very frequently, errors imply severe consequences, the touch point of interaction might contribute crucially to the general impression of the application, the interaction sequence might be of eminent importance for learning how to use a system or other essential scenarios that shape the user experience significantly. We first need to identify these critical interaction scenarios in order to set concrete expectations regarding relevant UX metrics.

Different scenariosmay be critical for different personas that are conceptualized as representative users of an application. When we establish UX metrics we need to take critical scenarios and their associated personas into account. Independently from the concept of a UX metrics table presented in this paper, Travis (2011) suggested a related approach in which he suggests to first identify the red routes as an initial step to create UX metrics: “Most systems, even quite complex ones, usually have only a handful of critical tasks. This doesn’t mean that the system will support only those tasks: it simply means that these tasks are the essential ones for the business and for users. So it makes sense to use these red routes to track progress”. Travis refers to user stories (Cohn 2004) to support “thinking of the needs and goals of a specific persona” and to “fully ground” the scenario in context.

Including personas and scenarios in the formulation of a UX metric helps to prevent the definition of overly generic UX metrics that would require a significant operationalization before they can be associated with a method for measurement: “… «easy to use», «enjoyable» or «easy to learn»”. It’s not that these aren’t worthy design goals, it’s that they are simply too abstract to help you measure the user experience” (Travis 2011). Defining UX metrics in reference to personas and scenarios gives us the concreteness needed to declare successful goal attainment.

What is missing yet in order to arrive at quantifiable UX metrics are numerical values that form precise criteria for comparison. We may want a revised system to be better with regard to a certain UX metric than its predecessor, and/or we want it to be as least as good as its next competitor. We need to know benchmark values for comparison to indicate success or failure. In a summative evaluation, we can then qualify a measured value for a relevant UX metric to be good when it exceeds the benchmark value — or to be bad when it is lower, considering just a simple case.

Having benchmark values for UX metrics does not just allow for meaningful comparisons, but also helps to set reasonable target values regarding the magnitude of intended improvement. Arriving at relevant benchmark values can be achieved by measuring a UX metric with regard to a preceding system version, a competitor’s offer — or, for some UX metrics, even with the values for the manual processing of some up-to-now unsupported task. Technical capabilities and insights from research activities typically inform the precise assignment of target and acceptance values. If in need of available benchmark values, generic average values for UX metrics, as published by Sauro (2012), can be used as rough first evidence.

In practice, differentiating between intended targetvalues and acceptance values has turned out to be helpful in some occasions. Target values represent true indicators of success, while achieving acceptance values points to the right direction but leaves room for improvement.

Identifying and prioritizing UX metrics is typically not a solitary endeavor. Different UX metrics might be of varying importance to different stakeholders, it is thus advisable to engage all interested parties when selecting and consolidating the set of UX metrics considered to be relevant. Especially in early phases of a project, large-scale UX goals, as spelled out in Google’s HEART framework (Happiness, Engagement, Adoption, Retention, and Task Success, see Luenendonk 2015) are often put forward and later refined by referring to measurable metrics. Tullis and Albert (2013) extensively discuss behavioral and attitudinal UX metrics, categorized as performance metrics, issue-based metrics, self-reported metrics, as well as behavioral and physiological metrics.

2 The UX Metrics Table

Human-centered design activities are, by their very nature, artifact-centered. The construction of artifacts is the lowest common denominator uniting the iteratively intertwined phases of a human-centered product development cycle. Coming in different guises and named according to different flavors, artifacts are shared as communication tools amongst stakeholders, are evaluated and refined and, if necessary, abandoned. The maturity of artifacts indicates, within the limits of iterative approaches, progress in UX projects: We envision future users of an interactive system by establishing lively personas as representative archetypes. Scenarios excite the working goals of an acting persona in context and constitute an essential ingredient for deriving requirements. Scribbles, wireframes, user journeys and interactive prototypes let a product gradually come alive and support an early experience of a product’s essentials. In contrast to the artifacts mentioned before, UX metrics have, however, not yet found a firm home in the artifact arsenal of UX professionals.

A UX metrics table links personas and scenarios to explicit UX metrics in a comprehensive artifact accompanying human-centered design activities. It continuously conveys the targets for the design, helping to keep the focus on agreed upon quantitative UX goals of a project. The rows of a UX metrics table refer to the UX metrics considered, presented in decreasing order of priority. Its columns provide the agreed upon parameters of the metric.

Figure 1 shows the header of an UX metrics table. The UX metrics table consists of nine columns that define (1) the UX metric under consideration, (2) the method tomeasurethe UX metric, (3) the persona representing the intended user, (4) the scenario that provides the situational context, (5) a benchmark value for comparison, (6) a target value (complemented by an optional acceptance value) that serves as a quantitative goal for this metric, (7) a result column, holding the measured score for the UX metric that was empirically measured, (8) a time scale column to indicate the time frame within which the intended result is to be achieved and (9) a sample column to describe the designated sample for evaluating the UX metric.

Fig. 1.
figure 1

Elements of a UX metrics table

The number of rows representing the chosen UX metrics and their respective content depend of course on the specific goals and circumstances of an actual project. While starting with a small and focused list of UX metrics is advisable, the details of a UX metric table are subject to change over time. In practice, UX metrics might be added, rows where the result value equals or exceeds the target value might be highlighted to indicate success, and values might be adjusted in the light of new insights. The history of a UX metrics table represents dynamic snapshots of the progress made in establishing and attaining UX metrics.

Using the UX metrics table in real-world projects is quite straightforward: after having arrived at an initial understanding of the project goals, a first version of a UX metrics table is established in a joint meeting that brings relevant stakeholders together. Agreeing on relevant UX metrics paves the ground for a shared understanding of the project goals — that often get into a vivid tug-of-war for the “right” metrics and/or their prioritization. In projects targeted at the development of new products, initial UX metrics tables often start with tables consisting of hardly more than two concerted entries in the UX metric column that have mutually been agree to be significant.

Setting UX metrics for a product requires (well-grounded) assumptions regarding the quality of its use. Establishing a UX metrics table gives additional weight to the careful creation of personas and scenarios based on empirical data. If user research has already provided evidence for their construction, the columns for persona and scenario can be filled — with their content certainly having a major impact on the definition of the UX metric under consideration. If not, the cells for persona and scenario in the UX table will remind the team about required action for further research to empirically ground the UX metric.

To illustrate the construction process of a UX metrics table, we discuss a simple example in the next paragraph.

3 Applying the UX Metrics Table in Practice

A UX Metrics Table provides a clear rationale for assessing the progress made during the iterations of a design project. It is easy to understand the benefits of the suggested artifact in the build-measure-learn loops of Lean UX approaches where (real word) validations of incrementally extended products are formatively conducted (see Steimle and Wallach, in preparation, for a discussion of UX metrics tables in Lean Development). Seeing how a UX metrics table can contribute to summative evaluations is less obvious. In the following real-world example we illustrate a summative use of UX tables in a project that was initiated by a client interested in an overall evaluation of a complex software system. For reasons of anonymization, we will denote the software as Wcs in the remainder of the paper.

3.1 An Example for the Summative Use of the UX Metrics Table

Wcs was installed three years ago with a large, international user base as the successor of an application with similar functionality. After having received negative user feedback regarding several UX deficiencies of Wcs, the client commissioned a summative usability study to inspect the application.

Analysis of the user base revealed two distinct user groups of Wcs:

  • Expert users, who (1) mostly report to use Wcs daily or on several days per week, (2) qualify themselves to be very proficient in using Wcs (self-reports using a 7-point scale [1–7], yielding an average of 5.28 out of 7, SD = 1.63), (3) work on task sets that require full access to Wcs’s functionality;

  • Occasional users, with (1) >75% of this group using Wcs less than once per month, (2) qualify their level of proficiency comparatively low (self-reports using a 7-point scale [1–7], yielding an average of 3.66 out of 7, SD = 1.70), (3) work on restricted tasks with only limited access to Wcs’s functionality.

Both user groups work with Wcs for about the same time (Expert users: 31.22 months, SD: 10.68; occasional users: 30.44 months, SD: 10.98).

To better understand the respective user attributes, their working goals and situational parameters, a Contextual Inquiry was conducted. The domain of Wcs is quite knowledge-intense, so the gathered insights were necessary prerequisites for a detailed UX inspection of the system. Results from the Contextual Inquiry were used to empirically ground the construction of two personas. The name Emily Expert was used to denote the Expert persona, while the group of occasional users was archetypically depicted by a persona called Tim Sometime.

When the client commissioned the UX evaluation, he was mainly interested in (a) an understanding of the user perspective, i.e. the subjective impression of Wcs’s quality of use and (b) an expert opinion regarding reported (potential) usability flaws of Wcs.

Following the categorization according to Tullis an Albert (2013), the subjective usability impression of Wcs’s users can be considered as a self-reported metric that can be measured, for example, using the System Usability Scale (S us ). The Sus is a questionnaire to elicit the subjective impression of the usability of an interactive system. It was published by Brooke (1996) and is used in an exceptionally large number of scientific and practical studies. TheSus has excellent values for validity and reliability (Sauro 2011) and allows economical data elicitation since the questionnaire is comprised of only ten items. In response to these items, participants express their (dis)agreement with a system on a five-tier Likert-scale (example: “I feel very confident in using W cs ”, (strongly disagree … strongly agree). A Susscore is calculated from the answers to the Sus questions and can range in 41 increments from 0 to 100. Zero symbolizes the theoretical lowest score and 100 the highest possible Susscore.

3.2 Determining the Values of the UX Metrics Table

Given the UX metric of a subjective usability impression, the Sus as a method for measurement and the two personas, we can already complete parts of Wcs UX metric table. The scenario cell in the table can be left empty since the Sus score refers to the overall impression of a system and is not related to an isolated scenario. To determine a benchmark value for the Sus score, data reported by Sauro (2011) was used. Sauro published Susscores from a total of 446 studies with more than 5,000 participants to derive general benchmark data that supports a comparative interpretation of obtained Susscores. The mean Sus score representing the entirety of those studies is 68, which can be inserted as the benchmark value for Wcs — with the target value set to >68. While this setting seems to be appropriate for the Tim Sometime persona representing occasional users, the values for benchmark/target is set in reference to Sauro’s mean Sus score for “Internal-productivity software: Customer Service and Network Operations applications”, which is 76.7 (SD = 8.8). In fact, all expert users of Wcs are working in the very same organization that released Wcs, while occasional users are employed in a variety of different organizations: the respective reference points are set accordingly. The slot for time scale was set to Nowbecause the summative study is intended to capture the current state of the metric. Although the Sus is a robust tool even with small sample sizes, a web-based presentation of the Sus promised to gather data from a larger sample size (N > 50). Figure 2 shows the definition of the initial version of the UX metrics table for Wcs.

Fig. 2.
figure 2

Initial UX metrics table for Wcs

Note: For reasons of simplicity and illustration, the table in Fig. 2 just comprises a single metric, referring to two different personas. Typically, a UX metrics table includes various types of UX metrics that require carefully balancing (see Steimle and Wallach, in preparation).

3.3 Results

Data collection using a web-based version of the Sus was supported by using unique, computer-generated tokens to rule out multiple participation of the same person. Approximately 4,000 users of both user groups were invited via email to participate in the study.

A total sample of close to 500 Wcs users answered the Sus questionnaire, resulting in a data set of 414 occasional users and 75 expert users. The average Susscore after completing the Sus for occasional Wcs users was 40.4 (SD: 19.4). The confidence interval (95%) ranges between 38.48 and 42.22. Evaluating Cronbach’s alpha to measure the internal consistency of the scale returned an excellent value of 0.914. The average Susscore for expert users was 43.3 (SD: 18.6). The confidence interval (95%) ranges between 39.02 and 47.58. Evaluating Cronbach’s alpha to measure the internal consistency of a scale returns a very good value of 0.885.

The results for both user groups indicated in Fig. 3 indicate extraordinarily low Sus scores, expressing very negative subjective usability impressions about Wcs. The results were clearly confirmed by the outcome of another (ISO 9241-related) usability inventory (Isonorm, see Prümper and Anft 1997) that was jointly applied with the System Usability Scale in the sample. The highly significant positive correlation coefficients (Pearson, p < 0.001) of 0.78 (expert users) and 0.61 (occasional users) indicate a convergent validity of the two measurement tools. The picture is consistently completed by the results of a Heuristic Analysis targeted at identifying usability flaws with Wcs. A total of 48 findings was reported, with 6 findings classified to be of minor importance, 21 were categorized as severe and 21 as critical.

Fig. 3.
figure 3

Complemented UX metrics table for Wcs

Although the study reported was summative, the negative evaluation results have convinced the client to start a project to redesign Wcs. With the basic UX metrics table shown in Fig. 2 as a quantitative vantage point, two metrics, completion time for the core scenario and error rate were added and the time frame set to the release date of the next version of Wcs. At present, log file analyses are carried out to determine benchmark values for the completion time and error rates regarding the selected core scenarios of Wcs.

4 Discussion

In this paper we have introduced an artifact coined UX metrics table that has proven to be very helpful in real world projects. Constructing a UX metrics table highlights the identification and tracking of UX metrics in the course of a project and supports a comprehensive understanding of the mutual dependencies between UX metrics. Linking UX metrics to concrete scenarios and personas provides the right resolution level for their meaningful definition and measurement. Introducing UX metrics table put metrics into true effect and supports institutionalizing their use in organizations.

As an outlasting project artifact, UX metrics tables connect the research-, design- and evaluation phase(s) of a project and provide clear quantitative means to determine project success — or failure. Travis (2011) argues: “That is the strength of metrics-based testing: it gives us the what of usability”. While we wholeheartedly agree, qualitative methods give us the why of usability and UX and help to make sense of available data. It is in combination with qualitative data when quantitative approaches permit significant insights. Design, however, should never degrade to any form of convulsive focusing on metrics and data — or as Bowman (2009) has put it in his now famous farewell letter to Google: “I won’t miss a design philosophy that lives or dies strictly by the sword of data”.