1 Introduction

A large number of visually impaired people use state-of-the-art technology to perform tasks in their everyday lives. Such technologies consist of electronic devices equipped with sensors and processors capable of making “intelligent” decisions. Various feedback devices are then used to communicate results effectively. One of the most important and challenging tasks in developing such technologies is to create a user interface that is appropriate for the sensorimotor capabilities of blind users, both in terms of providing input and interpreting output feedback. Today, the use of commercially available mobile devices shows great promise in addressing such challenges. Besides being equipped with increasing computational capacity and sensor capabilities, these devices generally also provide standardized possibilities for touch-based input and perceptually rich auditory-tactile output. As a result, the largest and most widespread mobile platforms are rapidly evolving into de facto standards for the implementation of assistive technologies.

The term “assistive technology” in general is used within several fields where users require some form of assistance. Although these fields can be diverse in terms of their scope and goals, user safety is always a key issue. Therefore, besides augmenting user capabilities, ensuring their safety and well-being is also of prime importance. Designing navigational aids for visually impaired people is an exemplary case where design decisions must in no way detract from users’ awareness of their surroundings through natural channels.

In this paper, our goal is to provide an overview of theoretical and practical solutions to the challenges faced by the visually impaired in various domains of everyday life. Recent developments in mobile technology are highlighted, with the goal of setting out an agenda for future research in CogInfoCom-supported assistive technologies. The paper is structured as follows. In Sect. 2, a summary is provided of basic auditory and haptic feedback techniques in general, along with solutions developed in the past decades both in academia and for the commercial market. In Sect. 3, an overview is given of generic capabilities provided by state-of-the-art mobile platforms which can be used to support assistive solutions for the visually impaired. It is shown that both the theoretical and practical requirements relating to assistive applications can be addressed in a unified way through the use of these platforms. A cross-section of trending mobile applications for the visually impaired on the Android and iOS platforms is provided in Sect. 4. Finally, an agenda is set out for future research in this area, ranging from basic exploration of perceptual and cognitive issues, to the development of improved techniques for prototyping and evaluation of sensor-bridging applications with visually impaired users.

2 Overview of auditory and haptic feedback methods

The research fields dealing with sensory interfaces between users and devices often acknowledge a certain duality between iconic and more abstract forms of communication. In a sense, this distinction is intuitive, but its value in interface design has also been experimentally justified with respect to different sensory modalities.

We first consider the case of auditory interfaces, in which the icon-message distinction is especially strong. Auditory icons were defined in the context of everyday listening as “caricatures of everyday sounds” [13]. This was the first generalization of David Canfield-Smith’s original visual icon concept [4] to modalities other than vision, through a theory that separates ‘everyday listening’ from ‘musical listening’. Briefly expressed, Gaver’s theory separates cases where sounds are interpreted with respect to their perceptual-musical qualities, as opposed to cases where they are interpreted with respect to a physical context in which the same sound is generally encountered. As an example of the latter case, the sound of a door being opened and closed could for instance be used as an icon for somebody entering or leaving a virtual environment. Earcons were defined by Blattner, Sumikawa and Greenberg as “non-verbal audio messages used in the user–computer interface to provide information to the user about some computer object, operation, or interaction” [5]. Although this definition does not in itself specify whether the representation is iconic or message-like, the same paper offers a distinction between ‘representational’ and ‘abstract’ earcons, thus acknowledging that such a duality exists. Today, the term ‘earcon’ is used exclusively in the second sense, as a concept that is complementary to the iconic nature of auditory icons. As a parallel example to the auditory icon illustration described above, one could imagine a pre-specified but abstract pattern of tones to symbolize the event of someone entering or leaving a virtual space. Whenever a data-oriented perspective is preferred, as in transferring data to audio, the term ‘sonification’ is used, which refers to the “use of non-speech audio to convey information or perceptualize data” [6] (for a more recent definition, the reader is referred to [7]).

Since the original formulation of these concepts, several newer kinds of auditory representations have emerged. For an overview of novel representations—including representations with speech-like and/or emotional characteristics (such as spearcons, spindexes, auditory emoticons and spemoticons), as well as representations used specifically for navigation and alerting information (such as musicons and morphocons)—the reader is referred to the overview provided in [8].

In the haptic and tactile domains, a distinction that is somewhat analogous to the auditory domain exists between iconic and message-like communications, although not always in a clearly identifiable sense. Thus, while MacLean and Enriquez suggest that haptic icons are conceptually closer to earcons than auditory icons, in that they are message-like (“our approach shares more philosophically with [earcons], but we also have a long-term aim of adding the intuitive benefits of Gaver’s approach...”) [9], the same authors in a different publication write that “haptic icons, or hapticons, [are] brief programmed forces applied to a user through a haptic interface, with the role of communicating a simple idea in manner similar to visual or auditory icons” [10]. Similarly, Brewster and Brown define tactons and tactile icons as interchangeable terms, stating that both are “structured, abstract messages that can be used to communicate messages non-visually” [11]. Such a view is perhaps tenable as long as no experimentally verifiable considerations suggest that the two terms should be regarded as referring to different concepts. Interestingly, although the terms ‘haptification’ and ‘tactification’ have been used in analogy to visualization and sonification, very often these terms arise independently of any data-oriented approach (i.e., it is the use of haptic and tactile feedback—as opposed to no feedback—that is referred to as haptification and tactification).

In both the auditory and haptic/tactile modalities, the task of creating and deploying useful representations is a multifaceted challenge which often requires a simultaneous reliance on psychophysical experiments and trial-and-error based techniques. While the former source of knowledge is important in describing the theoretical limits of human perceptual capabilities—in terms of e.g. just-noticeable-differences (cf. e.g. [1214]) and other factors—the latter can be equally important as the specific characteristics of the devices employed and the particular circumstances in which an application is used can rarely be controlled for in advance. This is well demonstrated by the large number of auditory and haptic/tactile solutions which have appeared in assistive technologies in the past decades, and by the fact that, despite huge differences among them, many have been used with a significant degree of success. An overview of relevant solutions is provided in the following section.

2.1 Systems in assistive engineering based on tactile solutions

Historically speaking, solutions supporting vision using the tactile modality appeared earlier than audio-based solutions, therefore, a brief discussion of tactile-only solutions is provided first.

Systems that substitute tactile stimuli for visual information generally translate images from a camera into electrical or vibrotactile stimuli, which can then be applied to various parts of the body (including the fingers, the palm, the back or the tongue of the user). Experiments have confirmed the viability of this approach in supporting the recognition of basic shapes [15, 16], as well as reading [17, 18] and localization tasks [19].

Several ideas for such applications have achieved commercial success. An early example of a device that supports reading is the Optacon device, which operates by transcoding printed letters onto an array of vibrotactile actuators in a \(24 \times 6\) arrangement [17, 20, 21]. While the Optacon was relatively expensive at a price of about 1500 GBP in the 1970s, it allowed for reading speeds of 15–40 words per minute [22] (others have reported an average of about 28 wpm [23], while the variability of user success is illustrated by the fact that one of the authors of the current paper knew and observed at least two users with optacon reading speeds of over 80 wpm). A camera extension to the system was made available to allow for the reading of on-screen material.

An arguably more complex application area of tactile substitution for vision is navigation. This can be important not only for the visually impaired, but also for applications in which users are subjected to significant cognitive load (as evidenced by the many solutions for navigation feedback in both on-the-ground and aerial navigation [2426]).

One of the first commercially available devices for navigation was the Mowat sensor, from Wormald International Sensory Aids, which is a hand-held device that uses ultra-sonic detection of obstacles and provides feedback in the form of tactile vibrations that are inversely proportional to distance. Another example is Videotact, created by ForeThought Development LLC, which provides navigation cues through 768 titanium electrodes placed on the abdomen [27]. Despite the success of such products, newer ones are still being developed so that environments with increasing levels of clutter can be supported at increasing levels of comfort. The former goal is supported by the growing availability of (mobile) processing power, while the latter is supported by the growing availability of unencumbered wearable technologies. A recent example of a solution which aims to make use of these developments is a product of a company by the name “Artificial Vision For the Blind”, which incorporates a pair of glasses from which haptic feedback is transmitted to the palm [28, 29].

The effective transmission of distance information is a key issue if depth information has to be communicated to users. The values of distance to be represented are usually proportional or inversely proportional to tactile and auditory attributes, such as frequency or spacing between impulses. Distances reproduced usually range from 1 to 15 m. Instead of a continuous simulation of distance, discrete levels of distance can be used by defining areas as being e.g. “near” or “far”.

2.2 Systems in assistive engineering based on auditory solutions

In parallel to tactile feedback solutions, auditory feedback has also increasingly been used in assistive technologies oriented towards the visually impaired. Interestingly, it has been remarked that both the temporal and frequency-based resolution of the auditory sensory system is higher than the resolution of somatosensory receptors along the skin [30]. For several decades, however, this potential advantage of audition over touch was difficult to take advantage of due to limitations in processing power. For instance, given that sound information presented to users is to be synchronized with the frame rate at which new data is read, limitations in visual processing power would have ultimately affected the precision of feedback as well. Even today, frame rates of about 2–6 frames per second (fps) are commonly used, despite the fact that modern camera equipment is easily capable of capturing 25 fps, and that human auditory capabilities would be well suited to interpreting more information.

Two early auditory systems—both designed to help blind users with navigation and obstacle detection—are SonicGuide and the LaserCane [31]. SonicGuide uses a wearable ultrasonic echolocation system (in the form of a pair of eyeglass frames) to provide the user with cues on the azimuth and distance of obstacles [32]. Information is provided to the user by directly mapping the ultrasound echoes onto audible sounds in both ears (one ear can be used if preferred, leaving the other free for the perception of ambient sounds).

The LaserCane system works similarly, although its interface involves a walking cane and it uses infrared instead of ultrasound signals in order to detect obstacles that are relatively close to the cane [33]. It projects beams in three different directions in order to detect obstacles that are above the cane (and thus possibly in front of the chest of the user), in front of the cane at a maximum distance of about 12 ft, and in front of the cane in a downward direction (e.g., to detect curbs and other discontinuities in terrain surface). Feedback to the user is provided using tactile vibrations for the forward-oriented beam only (as signals in this direction are expected to be relatively more frequent), while obstacles from above and from the terrain are represented by high-pitched and low-pitched sounds, respectively.

A similar, early approach to obstacle detection is the Nottingham Obstacle Detector that works using a \(\sim \)16 Hz ultrasonic detection signal and is somewhat of a complement to the Mowat sensor (it is also handheld and also supports obstacle detection based on ultrasound, albeit through audio feedback). Eight gradations of distance are assigned to a musical scale. However, as the system is handheld, the position at which it is held compared to the horizontal plane is important. Blind users have been shown to have lower awareness of limb positions, therefore this is a drawback of the system [34]

More recent approaches are exemplified by the Real-Time Assistance Prototype (RTAP) [35], which is camera-based and is more sophisticated in the kinds of information it conveys. It is equipped with stereo cameras applied on a helmet, a portable laptop with Windows OS and small stereo headphones. Disadvantages are the limited panning area of \(\pm \)32\(^\circ \), the lack of wireless connection, the laptop-sized central unit, the use of headphones blocking the outside world, low resolution in distance and “sources too close” perception during binaural rendering, and the unability to detect objects at the ground level. The latter can be solved by a wider viewfield or using the equipment as complementary to the white stick that is responsible for detection of steps, stones etc. The RTAP uses 19 discrete levels for distance. It also provides the advantage of explicitly representing the lack of any obstacle within a certain distance, which can be very useful in reassuring the user that the system is still in operation. Further, based on the object classification capabilities of its vision system, the RTAP can filter objects based on importance or proximity. Tests conducted using the system have revealed several important factors in the success of assistive solutions. One such factor is the ability of users to remember auditory events (for example, even as they disappear and reappear as the user’s head moves). Another important factor is the amount of training and the level of detail with which the associated training protocols are designed and validated.

A recent system with a comparable level of sophistication is the System for Wearable Audio Navigation (SWAN), which was developed to serve as a safe pedestrian navigation and orientation aid for persons with temporary or permanent visual impairments [36, 37]. SWAN consists of an audio-only output and a tactile input via a dedicated handheld interface device. Once the user’s location and head direction are determined, SWAN guides the user along the required path using a set of beacon sounds, while at the same time indicating the location of features in the environment that may be of interest to the user. The sounds used by SWAN include navigation beacons (earcon-like sounds), object sounds (through spatially localized auditory icons), surface transitions, and location information and announcements (brief prerecorded speech samples).

General-purpose systems for visual-to-audio substitution (e.g. with the goal of allowing pattern recognition, movement detection, spatial localization and mobility) have also been developed. Two influential systems in this category are the vOICe system developed by Meijer [38] and the Prosthesis Substituting Vision with Audition (PSVA) developed by Capelle and his colleagues [31]. The vOICe system translates the vertical dimension of images into frequency and the horizontal dimension of images into time, and has a resolution of 64 pixels \(\times \) 64 pixels (more recent implementations use larger displays [30]). The PSVA system uses similar concepts, although time is neglected and both dimensions of the image are mapped to frequencies. Further, the PSVA system uses a biologically more realistic, retinotopic model, in which the central, foveal areas are represented in higher resolution (thus, the model uses a periphery of 8 \(\times \) 8 pixels and a foveal region in the center of 8 \(\times \) 8 pixels).

2.3 Systems in assistive engineering based on auditory and tactile solutions

Solutions combining the auditory and tactile modalities have thus far been relatively rare, although important results and solutions have begun to appear in the past few years. Despite some differences between them, in many cases the solutions represent similar design choices.

HiFiVE is one example of a technically complex vision support system that combines sound with touch and manipulation [3941]. Visual features that are normally perceived categorically are mapped onto speech-like (but non-verbal) auditory phonetics (analogous to spemoticons, only not emotional in character). All such sounds comprise three syllables which correspond to different areas in the image: one syllable for color, and two for layout. For example, “way-lair-roar” might correspond to “white-grey” and “left-to-right”. Changes in texture are mapped onto fluctuations of volume, while motion is represented through binaural panning. Guided by haptics and tactile feedback, users are also enabled to explore various areas on the image via finger or hand motions. Through concurrent feedback, information is presented on shapes, boundaries and textures.

The HiFiVE system has seen several extensions since it was originally proposed, and these have in some cases led to a relative rise to prominence of automated vision processing approaches. This has enabled the creation of so-called audiotactile objects, which represent higher-level combinations of low-level visual attributes. Such objects have auditory representations, and can also be explored through ‘tracers’—i.e., communicational entities which either systematically present the properties of corresponding parts of the image (‘area-tracers’), or convey the shapes of particular items (‘shape-tracers’).

The See ColOr system is another recent approach to combining auditory feedback with tactile interaction [42]. See ColOr combines modules for local perception, global perception, alerting and recognition. The local perception module uses various auditory timbres to represent colors (through a hue-saturation-level technique reminiscent of Barrass’s TBP model [43]), and rhythmic patterns to represent distance as measured on the azimuth plane. The global module allows users to pinpoint one or more positions within the image using their fingers, so as to receive comparative feedback relevant to those areas alone. The alerting and recognition modules, in turn, provide higher-level feedback on obstacles which pose an imminent threat to the user, as well as on “auditory objects” which are associated with real-world objects.

At least two important observations can be made based on these two approaches. The first observation is that whenever audio is combined with finger or hand-based manipulations, the tactile/haptic modality is handled as the less prominent of the two modalities—either in the sense that it is (merely) used to guide the scope of (the more important) auditory feedback, or in the sense that it is used to provide small portions of the complete information available, thereby supporting a more explorative interaction from the user. The second observation is that both of these multi-modal approaches distinguish between low-level sensations and high-level perceptions—with the latter receiving increasing support from automated processing. This second point is made clear in the See ColOr system, but it is also implicitly understood in HiFiVE.

2.4 Summary of key observations

Based on the overview presented in Sect. 2, it can be concluded that feedback solutions range from iconic to abstract, and from serial to parallel. Given the low-level physical nature of the tasks considered (e.g. recognizing visual characters and navigation), solutions most often apply a low-level and direct mapping of physical changes onto auditory and/or tactile signals. Such approaches can be seen as data-driven sonification (or tactification, but in an analogous sense to sonification) approaches. Occasionally, they are complemented by higher-level representations, such as the earcon-based beacon sounds used in SWAN, or the categorical verbalizations used in HiFiVE. From a usability perspective, such approaches using mostly direct physical, and some abstract metaphors seem to have proven most effective.

3 Generic capabilities of mobile computing platforms

Recent trends have led to the appearance of generic mobile computing technologies that can be leveraged to improve users’ quality of life. In terms of assistive technologies, this tendency offers an ideal working environment for developers who are reluctant to develop their own, specialized hardware/software configurations, or who are looking for faster ways to disseminate their solutions.

The immense potential behind mobile communication platforms for assistive technologies can be highlighted through various perspectives, including general-purpose computing, advanced sensory capabilities, and crowdsourcing/data integration capabilities.

3.1 General-purpose computing

The fact that mobile computing platforms offer standard APIs for general-purpose computing provides both application developers and users with a level of flexibility that is very conducive to developing and distributing novel solutions. As a result of this standardized background (which can already be seen as a kind of ubiquitous infrastructure), there is no longer any significant need to develop one-of-a-kind hardware/software configurations. Instead, developers are encouraged to use the “same language” and thus collaboratively improve earlier solutions. A good example of this is illustrated by the fact that the latest prototypes of the HiFiVE system are being developed on a commodity tablet device. As a result of the growing ease with which new apps can be developed, it is intuitively clear that a growing number of users can be expected to act upon the urge to improve existing solutions, should they have new ideas. This general notion has been formulated in several domains in terms of a transition from “continuous improvement” to “collaborative innovation” [44]; or, in the case of software engineering, from “software product lines” to “software ecosystems” [45].

3.2 Advanced sensory capabilities

State-of-the-art devices are equipped with very advanced, and yet generic sensory capabilities enabling tight interaction with the environment. For instance, the Samsung Galaxy S5 (which appeared in 2014) has a total of 10 built-in sensors. This is a very general property that is shared by the vast majority of state-of-the-art mobile devices. Typically supported sensors include:

  • GPS receivers, which are very useful in outdoor environments, but also have the disadvantage of being inaccurate, slow, at times unreliable and of being unusable in indoor environments

  • Gyro sensors, which detect the rotation state of the mobile device based on three axes (it is interesting to note that the gyroscopes found on most modern devices are sufficiently sensitive to measure acoustic signals of low frequencies, and through signal processing and machine learning, can be used as a microphone [46]).

  • Accelerator sensors, which detect the movement state of the device based on three axes

  • Magnetic sensors (compass), which have to be calibrated, and can be affected by strong electromagnetic fields.

The accuracy of such sensors generally depends on the specific device, however, they are already highly relevant to applications which integrate inputs from multiple sensors and will only improve in accuracy in the coming years.

3.3 Crowdsourcing and data integration capabilities

A great deal of potential arises from the ability of mobile communication infrastructures to aggregate semantically relevant data that is curated in a semi-automated, but also crowd-supported way. This capability allows for low-level sensor data to be fused with user input and (semantic) background information in order to achieve greater precision in functionality.

A number of recent works highlight the utility of this potential, for example in real-time emergency response [47, 48], opportunistic data dissemination [49], crowd-supported sensing and processing [50], and many others. Crowdsourced ICT based solutions are also increasingly applied in the assistive technology arena, as evidenced by several recent works on e.g. collaborative navigation for the visually impaired in both the physical world and on the Web [5155]. Here, sensor data from the mobile platform (e.g., providing location information) are generally used to direct users to the information that is most relevant, or to the human volunteer who has the most knowledge with respect to the problem being addressed.

4 State-of-the-art applications for mobile platforms

Both Android and iOS offer applications that support visually impaired people in performing everyday tasks. Many of the applications can also be used to good effect by sighted persons. In this section, a brief overview is provided of various solutions on both of these platforms.

Evidence of the take up of these applications can be seen on email lists specific to blind and visually impaired users [5658], exhibitions of assistive technology [59], and specialist publications [60]. A further symptom of the growth in mobile phone and tablet use by this population is the fact that the Royal National Institute of the Blind in the UK runs a monthly “Phone Watch” event at its headquarters in London.

4.1 Applications for Android

In this subsection, various existing solutions for Android are presented.

4.2 Text, speech and typing

Android includes a number of facilities for text-to-speech based interaction as part of its Accessibility Service. In particular, TalkBack, KickBack, and SoundBack are applications designed to help blind and visually impaired users by allowing them to hear and/or feel their selections on the GUI [61]. These applications are also capable of reading text out loud. The Sound quality of TalkBack is relatively good compared to other screen-readers for PCs, however, proper language versions of SVOX must be installed in order to get the required quality, and some of these are not free. On the other hand, operating vibration feedback cannot be switched off and the system sometimes reads superfluous information on the screen. Users have reported that during text-messaging, errors can occur if the contact is not already in the address book. The software is only updated for the latest Android versions. Although all of them are preinstalled on most configurations, the freely available IDEAL Accessible App can be used to incorporate further enhancements in their functionality.

The Classic Text to Speech Engine includes a combination of over 40 male and female voices, and enables users to listen to spoken renderings of e.g. text files, e-books and translated texts [62]. The app also features voice support in key areas like navigation. In contrast to SVOX, this application is free, but with limited language support both for reading and the human-device user interface.

BrailleBack works together with the TalkBack app to provide a combined Braille and speech experience [63]. This allows users to connect a number of supported refreshable Braille displays to the device via Bluetooth. Screen content is then presented on the Braille display and the user can navigate and interact with the device using the keys on the display. Ubiquitous Braille based typing is also supported through the BrailleType application [64].

The Read for the Blind application is a community-powered app that allows users to create audio books for the blind, by reading books or short articles from magazines, newspapers or interesting websites [65].

The ScanLife Barcode and QR Reader [62] enables users to read barcodes and QR codes. This can be useful not only in supermarkets, but in other locations as well using even a relatively low resolution camera of 3.2 MP, however, memory demands can be a problem.

The identification of colors is supported by application such as the Color Picker or Seenesthesis [66, 67]. More generally, augmented/virtual magnifying glasses—such as Magnify—also exist to facilitate reading for the weak-sighted [62]. Magnify only works adequately if it is used with a camera of high resolution.

4.3 Speech-based command interfaces

The Eyes-Free Shell is an alternative home screen or launcher for the visually impaired as well as for users who are unable to focus on the screen (e.g. while driving). The application provides a way to interact with the touch screen to check status information, launch applications, and direct dial or message specific contacts [68]. While the Eyes-Free Shell and the Talking Dialer screens are open, the physical controls on the keyboard and navigation arrows are unresponsive and different characters are assigned to the typing keys. Widgets can be used, the volume can be adjusted and users can find answers with Eyes-Free Voice Search.

Another application that is part of the Accessibility Service is JustSpeak [69]. The app enables voice control of the Android device, and can be used to activate on-screen controls, launch installed applications, and trigger other commonly used Android actions.

Some of the applications may need to use TalkBack as well.

4.4 Navigation

Several Android apps can be used to facilitate navigation for users with different capabilities in various situations (i.e., both indoor and outdoor navigation in structured and unstructured settings).

Talking Location enables users to learn their approximate position through WiFi or mobile data signals by shaking the device [70]. Although often highly inaccurate, this can nevertheless be the only alternative in indoor navigation, where GPS signals are not available. The app allows users to send SMS messages to friends with their location, allowing them to obtain help when needed. This is similar to the idea behind Guard My Angel, which sends SMS messages, and can also send them automatically if the user does not provide a “heartbeat” confirmation [71].

Several “walking straight” applications have been developed to facilitate straight-line walking [72, 73]. Such applications use built-in sensors (i.e. mostly the magnetic sensor) to help blind pedestrians.

Through the augmentation of mobile capabilities with data services comes the possibility to make combined use of GPS receivers, compasses and map data. WalkyTalky is one of the many apps created by the Eyes-Free Project that helps blind people in navigation by providing real-time vibration feedback if they are not moving in the correct direction [62]. The accuracy based on the in-built GPS can be low, making it difficult to issue warnings within 3–4 m of accuracy. This can be eliminated by having a better GPS receiver connected via Bluetooth.

Similarly, Intersection Explorer provides a spoken account of the layout of streets and intersections as the user drags her finger across a map [74].

A more comprehensive application is The vOICe for Android [75]. The application maps live camera views to soundscapes, providing the visually impaired with an augmented reality based navigation support. The app includes a talking color identifier, talking compass, talking face detector and a talking GPS locator. It is also closely linked with the Zxing barcode scanner and the Google Goggles apps by allowing for them to be launched from within its own context. The vOICe uses pitch for height and loudness for brightness in one-second left to right scans of any view: a rising bright line sounds as a rising tone, a bright spot as a beep, a bright filled rectangle as a noise burst, a vertical grid as a rhythm (cf. Section 2.2 and [38]).

4.5 Applications for iOS

In this subsection, various existing solutions for iOS are presented.

4.6 Text, speech and typing

One of the most important apps is VoiceOver. It provides substantive screen-reading capabilities for Apple’s native apps but also for many third party apps developed for the iOS platform. VoiceOver renders text on the screen and also employs auditory feedback in response to user interactions. The user can control speed, pitch and other parameters of the auditory feedback by accessing the Settings menu. VoiceOver supports Apple’s Safari web browser, providing element by element navigation, as well as enabling navigation between headings and other web page components. This sometimes provides an added bonus for visually impaired users, as the mobile versions of web sites are often simpler and less cluttered than their non-mobile counterparts. Importantly VoiceOver can easily be switched on and off by pressing the home button three times. This is key if the device is being used alternately by a visually impaired and a sighted user, as the way in which interactions work is totally different when VoiceOver is running.

Visually impaired users can control their device using VoiceOver by using their fingers on the screen or by having an additional keyboard attached. There are some third party applications which do not conform to Apple guidelines on application design and so do not work with VoiceOver.

Voice Over can be used well in conjunction with the oMoby app, which searches the Internet based on photos taken with the iPhone camera and returns a list of search results [76]. oMoby also allows for the use of images from the iPhone photo library and supports some code scanning. It has an unique image recognition capability and is free to use.

Voice Brief is another utility for reading emails, feeds, weather, news etc. [77].

Dragon Dictation can help to translate voice into text [78]. Speaking and adding punctuation as needed verbally, Apple’s Braille solution, BrailleTouch offers a split-keyboard design in the form of a braille cell to allow a blind user to type [79]. The left side shows dots 1, 2 and 3 while the right side holds dots 4, 5 and 6. The right hand is oriented over the iPhone’s Home button with the volume buttons on the left edge.

In a way similar to various Android solutions, the Recognizer app allows users to identify cans, packages and ID cards by through camera-based barcode scans [79]. Like Money Reader, this app incorporates object recognition functionality. The app stores the image library locally on the phone and does not require an Internet connection.

The LookTel Money Reader recognizes currency and speaks the denomination, enabling people experiencing visual impairments or blindness to quickly and easily identify and count bills [80]. By pointing the camera of the iOS device at a bill the application will tell the denomination in real-time.

The iOS platform also offers color identification (for free) called Color ID Free [81].

4.7 Speech-based command interfaces

The TapTapSee app is designed to help the blind and visually impaired identify objects they encounter in their daily lives [82]. By double tapping the screen to take a photo of anything, at any angle, the user will hear the app speak the identification back.

A different approach is possible through crowdsourced solutions, i.e. by connecting to a human operator. VizWiz is an application that records a question after taking a picture of any object [83]. The query can be sent to the Web Worker, IQ Engines or can be emailed or shared on Twitter. Web worker is a human volunteer who will review and answer the question. On the other hand, IQ Engines is an image recognition platform.

A challenging aspect of daily life for visually impaired users is that headphones block environmental sounds. “Awareness! The Headphone App” allows one to listen to the headphones while also hearing the surrounding sounds [84]. It uses the microphone to pick up ambient sounds while listening to music or using other apps using headphones for output. While this solution helps to mediate the problem of headphones masking ambient sounds, its use means that the user is hearing both ambient sounds and whatever is going on in the app, potentially leading to overload of the auditory channel or distraction during navigation tasks.

With an application called Light Detector, the user can transform any natural or artificial light source he/she encounters into sound [85]. By pointing the iPhone camera in any direction the user will hear higher or lower pitched sounds depending on the intensity of the light. Users can check whether the lights are on, where the windows and doors are closed, etc.

Video Motion Alert (VM Alert) is an advanced video processing application for the iPhone capable of detecting motion as seen through the iPhone camera [86]. VM Alert can use either the rear or front facing camera. It can be configured to sound a pleasant or alarming audible alert when it detects motion and can optionally save images of the motion conveniently to the iPhone camera roll.

4.8 Navigation

The iOS platform also provides applications for navigation assistance. The Maps app provided by default with the iPhone is accessible in the sense that its buttons and controls are accessible using VoiceOver, and using these one can set routes and obtain turn by turn written instructions.

Ariadne GPS also works with VoiceOver [87]. Talking maps allow for the world to be explored by moving a finger around the map. During exploration, the crossing of a street is signaled by vibration. The app has a favorites feature that can be used to announce stops on the bus or train, or to read street names and numbers. It also enables users to navigate large buildings by pre-programming e.g. classroom locations. Rotating maps keep the user centered, with territory behind the user on the bottom of the screen and what is ahead on the top portion. Available in multiple languages, Ariadne GPS works anywhere Google Maps are available. Similarly to WalkyTalk, low resolution GPS receivers can be a problem, however this can be solved through external receivers connected to the device [70].

GPS Lookaround also uses VoiceOver to speak the name of the street, city, cross-street and points of interest [79]. Users can shake the iPhone to create a vibration and swishing sound indicating the iPhone will deliver spoken information about a location.

BlindSquare also provides information to visually impaired users about their surroundings [79]. The tool uses GPS and a compass to identify location. Users can find out details of local points of interest by category, define routes to be walked, and have feedback provided while walking. From a social networking perspective, BlindSquare is closely linked to FourSquare: it collects information about the user’s environment from FourSquare, and allows users to check in to FourSquare by shaking the iPhone.

4.9 Gaming and serious gaming

Although not directly related to assistive technologies, audio-only solutions can help visually impaired users access training and entertainment using so-called serious gaming solutions [8890].

One of the first attempts was Shades of Doom, an audio-only version of the maze game called Doom, which is not available on mobile platforms [91]. The user had to navigate through a maze, accompanied by steps, growling of a monster and the goal was to find the way out without being killed. The game represents interesting new ideas, but the quality of the sound it uses, as well as its localization features have been reported to be relatively poor.

A game called Vanished is another “horror game” that relies entirely on sound to communicate with the player. It has been released on iPhone, but there is also a version for Android [92]. A similar approach is BlindSide, an audio-only adventure game promising a fully-immersive 3D world [93]. Nevertheless, games incorporating 3D audio and directional simulation may not always provide high quality experience, depending on the applied playback system (speakers, headphone quality etc.).

Papa Sangre is an audio-only guidance-based navigation game for both Android and iOS [94]. The user can walk/run by tapping left and right on the bottom half, and turn by sliding the finger across the top half of the screen while listening to 3D audio cues for directional guidance.

Grail to the Thief tries to establish a classic adventure game like Zork or Day of the Tentacle, except that instead of text-only descriptions it presents the entire game through audio [95].

A Blind Legend is an adventure game that uses binaural audio to help players find their bearings in a 3D environment, and allows for the hero to be controlled through the phone’s touchscreen using various multi-touch combinations in different directions [96].

Audio archery brings archery to Android. Using only the ears and reflexes the goal is to shoot at targets. The game is entirely auditory. Users hear a target move from left to right. The task consists in performing flicking motions on the screen with one finger to pull back the bow, and releasing it as the target is centered [97]. The game has been reported to work quite well with playback systems with two speakers at a relatively large distance which allows for perceptually rich stereo imagery.

Agents is an audio-only adventure game in which players control two field agents solely via simulated “voice calls” on their mobile phone [98]. The task is to help them break into a guarded complex, then to safely make their way out, while helping them work together. The challenge lies is using open-ended voice-only commands directed at an automatic speech recognition module.

Deep sea (A Sensory Deprivation Video Game) and Aurifi are audio-only games in which players have a mask that obscures their vision and takes over their hearing, plunging them into a world of blackness occupied only by the sound of their own breathing or heartbeat [99, 100]. Although these games try to mimic “horrifying environments” using audio only, at best a limited experience can be provided due to the limited capabilities of the playback system. In general, audio-only games (especially those using 3D audio, stereo panning etc.) require high-quality headphones.

5 Towards a research agenda for mobile assistive technology for visually impaired users

Given the level of innovation described in the above sections in this paper, and the growth in take up of mobile devices, both phones and tablets, by the visually impaired community worldwide, the potential and prospects for research into effective Interaction Design and User Experience for non-visual use of mobile applications is enormous. The growing prevalence of small screen devices, allied to the continual growth in the amount of information—including cognitive content—that is of interest to users, means that this research will undoubtedly have relevance to sensor-bridging applications targeted at the mobile user population as a whole [101].

In the following subsections, we set out a number of areas and issues for research which appear to be important to the further development of the field.

5.1 Research into multimodal non-visual interaction

  1. 1.

    Mode selection. Where a choice exists, there is relatively little to guide designers on the choice of mode in which to present information. We know relatively little about which mode, audio or tactile, is better suited to which type of information.

  2. 2.

    Type selection: Little is also known about how information is to be mapped to the selected mode, i.e. whether through (direct) representation-sharing or (analogy-based) representation-bridging as described in [102, 103].

  3. 3.

    Interaction in a multimodal context. Many studies have been conducted on intersensory integration, sensory cross-effects and sensory dominance effects in visually oriented desktop and virtual applications [104107]. However, few investigations have taken place, in a non-visual and mobile computing oriented context, into the consumption of information in a primary mode while being presented with information in a secondary mode. How does the presence of the other mode detract from cognitive resources for processing information delivered in the primary mode? What information can safely be relegated to yet still perceived effectively in the secondary mode?

  4. 4.

    What happens when users themselves are given the ability to allocate separate information streams to different presentation modes?

  5. 5.

    What is the extent of individual differences in the ability to process auditory and haptic information in unimodal and multimodal contexts? Is it a viable goal to develop interpretable tuning models allowing users to fine-tune feedback channels in multimodal mobile interactions [108, 109]?

  6. 6.

    Context of use. How are data gained experimentally on the issues above effected when examined in real, mobile contexts of use? Can data curated from vast numbers of crowdsourced experiments be used to substitute, or complement laboratory experimentation?

5.2 Accessible development lifecycle

The following issues relate to the fact that the HCI literature on user-centered prototype development and evaluation is written assuming a visual context. Very little is available about how to go about these activities with users with no or little sight.

  1. 1.

    How might the tools used for quickly creating paper-based visual prototypes be replaced by effective counterparts for a non-visual context? What types of artifact are effective in conducting prototyping sessions with users with little or no vision? It should be a fundamental property of such materials that they can be perceived and easily altered by the users in order that an effective two-way dialogue between developers and users takes place.

  2. 2.

    To what extent are the standard means of collecting evaluation data from sighted users applicable to working with visually impaired users? For example, how well does speak-aloud protocol work in the presence of spoken screen-reader output? What are the difficulties posed for evaluation given the lack of a common vocabulary for expressing the qualities of sounds and haptic interactions by non-specialist users?

  3. 3.

    Are there techniques that are particularly effective in evaluating applications for visually impaired users which give realistic results while addressing health and safety? For example, what is the most effective yet realistic way of evaluating an early stage navigation app?

5.3 Summary

This paper briefly summarized recent developments on mobile devices in assistive technologies targeting visually impaired people. The focus was on the use of sound and vibration (haptics) whenever applicable. It was demonstrated that the most commonly used mobile platforms—i.e. Google’s Android and Apple’s iOS—both offer a large variety of assistive applications by using the built-in sensors of the mobile devices, and combining this sensory information with the capability of handling large datasets, as well as cloud resources and crowdsourced contributions in real-time. A recent H2020 project was launched to develop a navigational device running on a mobile device, incorporating depth-sensitive cameras, spatial audio, haptics and development of training methods.