Keywords

1 Introduction

Multimodal user interfaces (MMUI) allow the user to interact with the machine recurring to natural communication modalities such as speech, pen, touch and gesture. This provides a more robust and stable solution than a single modality interface, due to the mutual disambiguation inherent to an MMUI [7]. One of the most pervasive applications of MMUIs is in the accessibility and inclusion area where some studies [10, 31] show that they improve the user experience by disabled, elderly and not so technologically-savvy users [25, 26].

The multimodal interaction scenario poses several challenges for designers and developers. It is not just the possibility of using different modalities to interact with the applications and devices, it is also the continuously changing plethora of modalities that are proposed, need to be tested and possibly supported by existing applications. Several modalities are already part of everyday activities, such as touch, but what about evolving modalities such as eye gazing or emerging approaches such as silent speech [14] or emotions?

In this article we discuss several aspects which we deem important to the design, development and evaluation of multimodal systems. These, derive from our experience gathered from continued work on multimodal applications in the context of several projects such as S4SFootnote 1, AAL4ALLFootnote 2 and Paelife.Footnote 3

The contribution of this article is a vision of the full multimodal application design and development cycle, for which we have made contributions, at different levels, along with our perspective on what are the some of the key issues to address. At the onset of our proposals are concerns regarding how traditional methods need to be adapted, to serve the more complex scenario of multimodality, and what needs to be proposed, to tackle new challenges and provide users with the best possible experience of usable and useful applications.

In our discussion, we consider three aspects: (1) the system architecture, to flexibly support multimodality, not only to deploy applications, but also to support research in, for example, interaction modalities; (2) the design and development methodologies, to account for proper gathering and fulfilment of requirements, adapted to the target users; and (3) the evaluation, at the different stages of development, considering the increasing complexity of the application, its possibly distributed nature and the importance of context.

The focus of this article is not on a detailed descriptions of all aspects, but in providing an integrated view of the full range of what we have been considering and adopting for the design, development and evaluation of multimodal applications, providing examples, and, where applicable, directing the reader to additional literature. With this, we hope to contribute to show how a set of methods and tools can be put together to support research and development in multimodality, in a wide range of scenarios. This is not to be understood as the only way to do it, nor the methods are presented as the best, but as the possible instruments to serve a set of long-term research goals.

The remainder of this article is organised as follows: Sect. 2 briefly discusses multimodal interactions and presents our high level research goals; Sect. 3 concerns the rationale and advantages deriving from adopting a multimodal architecture aligned with the W3C recommendations; Sect. 4 describes the adopted iterative user-centred development methodology; Sect. 5 explains the methods used for system evaluation, how they blend with the development methods and adapt to the different development stages; finally, Sect. 6 presents conclusions and ideas for further work.

2 Multimodal Interaction

Multimodal interaction research looks for more natural communication channels [24] and for ways to deal with certain context restrictions or user limitations (e.g., reduced motor and cognitive abilities as a result of ageing) by adding redundancy to the interfaces or by providing the chance to perform different tasks using the most suitable modality [3]. This can improve accessibility and user performance, but we are no longer designing for a fixed keyboard and mouse setting and the essence of multimodal interaction raises different challenges if we want to harness its full potential.

First, designing multimodal interfaces requires following a set of principles that concern the applicability of the different modalities to specific tasks and data and needs to consider how to perform modality combination and adaptability (e.g., to context) [32]. Second, interaction modalities are often improved or new modalities can be proposed and need to be tested in a context that favours a perception of their real potential and flaws. Third, any application should be designed with a strong focus on its potential users and application contexts and be subject of thorough evaluation. For multimodal applications it is particularly relevant to pay attention to how different modalities might interact or how cognitive load or task complexity might influence performance [35] or modality choice. Finally, developing a multimodal application, if it includes many features and interaction modalities, might be a complex task and its modularity might enable parallel development efforts.

Each of these aspects is a challenge in itself and contributions to each are required. When addressing research on multimodality, we consider a set of high-level goals:

  • Specifically address the particularities of the target user groups (e.g., elderly) and contexts;

  • Develop and improve interaction modalities, with a particular emphasis on speech related interaction given its importance for human communication and usefulness for interaction with small/vanishing devices;

  • Develop multimodal systems for different devices and application scenarios;

  • Foster evaluations that account for the maturity level of the application, for the characteristics of the end-users and for multimodal interaction and the context in which it should happen;

  • Collaboratively develop complex applications e.g., by different partners in a research project;

Based on these high-level goals we made a set of choices, adopting or evolving methods already described in the literature and started research on aspects we felt were not conveniently covered by the state-of-the-art as detailed in the following sections.

3 Architecture

Some approaches to accomplish multimodal interactions have been proposed in the literature, such as Mudra [16] or HepaisTK [11]. One notable effort we have been following attentively is being performed by the World Wide Web Consortium (W3C). The W3C recommendation for multimodal architectures [4] defines four major components of a multimodal system (as depicted in Fig. 1) and defines how the communication between the components and data modules should work. Notable modules in the architecture are modalities and the interaction manager. The W3C recommendations, even though they are originally proposed for web scenarios, encompass the potential to support a wider range of applications as we advocated in Teixeira et al. [39]. Therefore, we adopted its view and extend it for the general multimodal interaction scenario encompassing mobile devices (e.g., smartphones, tablets) and different application contexts, e.g. AAL [36]. This is enabled by the versatile nature of the architecture and provides a direct answer to a significant part of the envisaged requirements, easing the creation and integration of new modules or their improvements.

Therefore, having a standard for multimodal architecture helps application developers to avoid the unpractical situation of having to master each individual modality technology. This is particularly problematic as the number of technologies that can be used with multimodal interaction is increasing rapidly. This standard architecture gives experts the possibility to develop standalone components [9] that can be used in a common way.

Fig. 1.
figure 1

The W3C multimodal architecture diagram depicting its main components.

This architecture has already been tested as the basis for the development of a multimodal personal assistant application [36], in the scope of project PaeLife, involving the development of different modules (messaging services, agenda, weather report, news) by multiple European partners and supporting speech interaction in multiple languages (Portuguese, French, Hungarian, Polish and English). The adoption of this architecture allowed a collaborative effort from all partners and a seamless integration of all modules, including increasingly refined versions of the speech modality [1, 38].

Another line of research we are following, supported on this architecture, is multimodal multi-device applications development [2]. It consists in interacting with an application using more than one device with each device providing a set of interaction modalities and presenting the user with the same or complementary views of the application. In a particular instantiation of this concept, in project PaeLife, the personal assistant application can be accessed through a tablet and, if near the television, use its display to provide detailed news contents, while the user keeps browsing the news list on the tablet. When the user moves away, all interaction and output are performed using the tablet. Figure 2 shows different ways of accessing the same application using two devices: a TV and a tablet.

Fig. 2.
figure 2

Using a news reader in a multi-device setting. The two devices can present: (a) the same content; (b) content and navigation pane; or (c) detail and full content.

4 Design and Development

The adoption of the architecture, as described in the previous section, defines the organization for the different components required to develop a multimodal application and provides the structure to support research at its different levels, but how can we use it to develop applications tailored to specific audiences and scenarios?

First of all, it is our view that interaction and interaction modalities need to be developed considering real application contexts that allow the definition of realistic requirements and the assessment of their performance [12]. Therefore, we do not separate modality development and tuning from applications [1]. This view benefits from the adopted architecture since any developed modality is not hard coded onto the application, but is a module that can then be reused in any other of our applications.

Since one of our goals is dealing with different age groups, particularly those presenting strong heterogeneity [34], whether age-related or deriving from other disabilities impeding communication and interaction, it is important to adopt a methodology that includes the end-users in the whole process. Furthermore, given the complexity of the envisaged systems and interaction modalities, it makes sense to have multiple development stages [19] and assess progress along the way, to guaranty that the system evolves towards the defined requirements and is usable and useful for its users. Therefore, and inheriting from user-centred design (UCD), we adopted an iterative, user-centred methodology aligned with Martins et al. [21]. After obtaining the requirements (phase 1), a prototype is proposed (phase 2) and evaluated (phase 3), in order to refine the requirements. This iterative methodology continues with additional prototypes and evaluations towards an increasingly refined application. In this methodology, the prototype works as a mediator of the dialogue between the developers and the end users to gather feedback, refine and elicit requirements.

The first requirements are gathered based on Personas and context scenarios [8]. From these, a set of requirements is chosen for the first application prototype and, from its evaluation, information is extracted that allows refining existing requirements and identifying new ones. These, and possibly a few more from the original requirements list, depending on the complexity involved in addressing any of the problems identified or refinements needed, are used as requirements for the new prototype and a new development iteration is performed.

Note that, since this methodology is grounded on fast prototyping, and significant additions or changes might be required from one prototype to the other, the adopted architecture plays a key role in reducing the development effort. Its modularity provides a decoupling among the different aspects (modalities, fusion, graphical user interface, etc.) minimizing the cascade effect when changes are required.

Examples of applications developed by adopting this methodology are those of Medication Assistant [13], an application devoted to address different factors contributing to medication non-adherence in the elderly; Trip4All [33], a gamified tourism application that provides users with information and Telerehab, a telerehabilitation service [37] that allows a patient to perform a remote physiotherapy session supervised by a physiotherapist.

5 Evaluation

The previous section already presented evaluation as intrinsic to the adopted iterative design and development methodology, but how should this evaluation be performed?

The design and development of complex multimodal systems, working in multiple devices and deployed in dynamic environments, poses several challenges. Beyond the technical aspects, designing user experience in this context is far from being simple. At this level, tasks and interaction modalities cannot be looked at as isolated phenomena. For example, the use of several modalities simultaneously, as a result of a more complex use of the system, might result in sensory overload; or particular modalities, which in abstract seem suitable options, are disregarded in some (e.g., stressful) situations. Furthermore, these concerns are particularly relevant when the target users might present some level of disability, physical or cognitive, which directly influences how they use the system: an audio warning might not be heard by the user, due to a hearing disability, or multiple tasks crossing might leave the user disoriented. Therefore, integration of proper evaluation, in the development cycles, covering different contexts of use and complex tasks, running in its intended (real or simulated) environment, is of paramount importance and should be increasingly introduced, from early on, as a tool to support the development of such systems.

In Martins et al. [21] a method is described that reflects this need to intertwine evaluation with iterative design and development and considers three phases: conceptual validation, prototype test, and pilot test. The first phase of evaluation, conceptual validation, aims to determine if an idea of a system is sustainable in terms of interface and functions. In the prototype test, the second phase of the evaluation, the goal is to collect information regarding the usability and user satisfaction. At this phase there is already a physical implementation of the system prototype in order to be tested by users. The prototype test is conducted in a controlled environment and can be repeated the number of times judged necessary, e.g., to fulfil the defined requirements. Finally, the third phase of evaluation, the pilot test, and the goal is to evaluate, in addition to usability and satisfaction, the meaning that a system has on users lives. For this reason, this last phase of testing differs from the prototype phase in the context where it happens. The system should be installed in user’s homes and integrated into their daily life routines.

It has been discussed, in the literature, that users tend to increase their use of multimodality if cognitive load or task difficulty increase [27] and context plays an important role in how systems are used and modalities selected [42]. The advantages of deploying systems in the field are also an important aspect for evaluating multimodal systems [5, 30] and might be an entry point to the long term assessment of user experience as advocated, for example, by Ickin et al. [17] and Wechsung [40]. Therefore, adding complexity, context and naturalness to the usability evaluation seems an important route to follow. Nonetheless, and even though some of the usual usability evaluation approaches can be used [40], accounting for all the environmental factors is not a simple task and might profit from a supporting framework.

Facing these issues and considering evaluation scenarios such as those of a telerehabilitation application [37], or where evaluating the system during context changes is important such as in multi-device scenarios, we have proposed Dynamic Evaluation as a Service (DynEaaS) [28] that is an evaluation platform providing the means to evaluate user performance in dynamical environments by allowing evaluation teams to create and conduct context-aware evaluations. The platform allows evaluators to specify evaluation plans which are triggered at precise timings or only when certain conditions are met, thus gathering better contextualized data. DynEaaS follows a distributed paradigm allowing the evaluator to run multiple evaluations at different locations simultaneously. At each location, the plan is instantiated and applied taking into account user preferences, current context and the environment itself. When applying the plan, DynEaaS constantly evaluates the current context and chooses the best suited conditions to interact with the user. Results are synchronized in real time. By having access to them, the evaluator is able to analyze current data and have a better grasp on the evaluation current status making small changes to it, if required (Fig. 3).

Fig. 3.
figure 3

DynEaaS allows the instantiation of local nodes in each of the envisaged evaluation contexts, based on evaluation plans defined by the evaluator, and adapts the application of the defined evaluation tools to the local ecosystem.

The major difference of DynEaaS, when compared to other evaluation frameworks, such as those proposed by Navarro et al. [22], Ickin et al. [17] and Witt [44] is that it specifically addresses the context of use and emphasizes the need to collect the data at the best possible time, or at least contextualizing it as best as possible. For example, it makes far more sense to ask a user about an application feature right after he has used (or had problems with) it than to do the same questions at the end of the evaluation session, when most of the impressions have probably faded; or it might not be a good time to enrol the user in providing feedback if he/she is leaving for an appointment. Furthermore, by using ontologies, DynEaaS is highly flexible and can be used in different domains without core changes.

Another aspect that is also important, beyond the stages at which evaluation is performed and the support framework described above, is which methods to use to actually measure the quality of service (QoS) and quality of experience (QoE) and Wechsung et al. [41] propose a taxonomy of the factors defining each one of these measures. Usability questionnaires are also an important tool for evaluation and a set of works has been presented assessing the applicability of existing questionnaires (AttrakDiff [15], System Usability Scale (SUS) [6], USE [20] and QUESI [23]) to the evaluation of multimodal systems [18]. Despite the great number of usability questionnaires, none of them adequately addresses user functionality when interacting with technology solutions. Existing questionnaires are technology-oriented instead of user centred. To address this issue, members of our team [21] proposed assessment tools based on the International Classification of Functioning, Disability and Health (ICF) [43] addressing the individuals’ functionality and assessing environmental factors according to an ICF approach. The ICF brought the concepts of functionality and disability into a multidimensional understanding of human functioning, such as biological, psychological, social and environmental. The surrounding environment is crucial in multimodal systems in the attenuation or elimination of the disability. In ICF, an environmental factor is classified as a facilitator if contributes to increase users performance and participation.

Technologies, including multimodal systems, should be considered as environmental factors in an ICF approach. Accordingly, the ICF may arise as a conceptual model for the holistic development of a methodology for evaluation of environmental factors and, consequently, multimodal systems. The assessment tools were created based on the first qualifier of the ICF environmental factors. Using the ICF as a framework to develop instruments for the evaluation of environmental factors permits that the terminology, concepts and coded information can be aggregated with the available in-formation, and can also be used as a comprehensive model to characterize users and their contexts, activities and participation [21]. Applying these tools at the proper time, in the relevant context maximizes their utility and their integration with DynEaaS has been performed [29].

6 Conclusions

This article provides an overview of our approach to the design and development of multimodal applications, covering the full cycle, from architecture definition to evaluation. In the different lines of work involved there is still room for improvement. On the subject of the multimodal architecture, we are currently exploring how to integrate fusion in our multimodal framework, producing fusion of events, and how to dynamically discover and register new interaction modalities.

In terms of the evaluation process, more research is needed in the consolidation of the ICF evaluation method. However, despite of the operational difficulties in the evaluation using the ICF as a conceptual framework, it is still an added value because it focuses the assessment in the users functionality. The ICF seems to be useful to identify what to change in the product and what to consider as a good practice.

The usage of ontologies on DynEaaS opens the door to automatic data evaluation which is able to trigger new questions based on domain ontologies. Such a feature would enable evaluation plans to inquire the user without the evaluator specifically setting the questions. In such a scenario, this would enable the evaluator to simply indicate a domain ontology from which DynEaaS would extract knowledge and combine it with already gathered data to enhance the evaluation plan on its own. On the subject of evaluation the use of DynEaaS paves the way to improved in-context evaluations, but also brings forward an infrastructure that might be used to gather data that allows continuously measuring user performance and detecting changes in behaviour. This might be due, for example, to some environmental changes or a sign of difficulties in dealing with the system or particular features. Proper handling of such information can lead to improved adaptability of the system [44].

To conclude, we do not claim this to be the only (or the best) possible approach. Instead, we aim to provide an integrated view of the whole pipeline, currently in use, along with the rationale supporting our choices. Along the way, we refer to concrete examples that have been put together using this same methods and discuss where they can/need to evolve and where the literature provides further information.