Keywords

1 Introduction

The number of people owning and using smartphones across the globe is high, 68% of adults in advanced economies in 2015 reported owning a smartphone. This number is rising, with even developing countries showing large growth rates of adults owning smartphones [1]. As the proliferation of technology continues, the market of available technology is also getting more powerful. The validity of Moore’s Law in which the capabilities of a computer chip will double every two years, was recently shown to still exist [2]. As processing power and connectivity continue to increase, the number of interactions users can have with a smartphone have stagnated. Most design innovations take place “under” the screen, as visual and interaction design for applications. Personal assistants introduce voice-based command, and the IOS 3D touch release added slightly more depth to screen-based interactions [3]. Our goal with this research project was to expand interactions well beyond the screen, to consider a broader environment than can be afforded by phone-in-hand. Our hypothesis was that this expansion could increase the scenarios and environments where the smartphone could play a role in task completion.

2 Concept Generation – Phase 1

The scope of this project limited hardware to just the smartphone in our pocket. These devices have impressive computing power and come with built-in sensors: microphone, light, proximity, accelerometer, and at least one camera. Although other devices, sensors, and processing exists, we sought to capture current use cases without complicating the experience beyond the capabilities of users now.

We started with the assumption that smartphones could be placed in a Draper-proprietary case incorporating ZmanTM material that could adhere to a vertical surface, creating a hands-free interaction opportunity with the phone. We most wanted to explore two major areas. In the first phase, we explored the tasks and environments where this hands-free phone experience could have the most impact. Once we had determined a good use case, our second phase explored how to optimize the usability of a hands-free phone in this scenario.

As a note, because of the nature of the ZmanTM material, and the idea of a gecko climbing walls, we refer to the case or the case and phone as the “Gecko Phone”.

2.1 Research Methods

To explore our hypothesis from an HCE perspective, we conducted exploratory user research. We began by framing the solution and user space of interest. We designed a number of research activities to explore that space, and then developed a set of high-level concepts. We then validated these concepts with target users. To explore this space thoroughly, there were a number of dimensions we needed to examine. The first was defining the targeted audience, or the types of users we wanted to focus our efforts with. Once we had a defined user group, we began to explore what unmet needs these users have that relate to cell phones. We also needed to explore the unique capabilities enabled by ZmanTM technologies and how that pairs with the native capabilities of cell phones.

Four different techniques helped us assess our user requirements and needs from the capabilities of a surface-mounted phone. First, a diary study clarified the contexts that people were in when they would be using their phones. The second study, brainstorming cards helped generate ideas based on user types, locations, and phone capabilities. The third study used storytelling to gather ideas about edge use cases. Finally, the results were sorted and culminated in an affinity diagram.

Diary Study.

The first was a diary study in which we texted surveys to our participants four times a day for two days each (a work day and a weekend day). A diary study provides the opportunity for near-real time feedback about exactly where a person is and what they are doing [4]. A diary study removes recall bias for short-term feedback. We performed this study twice, with a total of 11 participants. Each participant was asked five questions about where they were, what they were doing, how they were using their phone, and any problems with that phone usage. The final question asked the user to brainstorm in place: to use the context around them and the task that they were performing to identify uses of a gecko phone. Each participant also filled out an introductory demographic survey and an exit survey that asked them to reflect on their time in the study and provide additional thoughts and concepts.

Brainstorming Cards.

Our second activity was a card matching activity. Card matching allows users to expand on their ideas by pairing sometimes-nonsensical cards together and providing a backstory or context [5]. We created a set of cards that included different user types and locations. We held two sessions of this activity, with three participants in each (plus the researchers). The first session focused on creating pairings of users and locations and then brainstorming problems that could be solved through the use of a surface-mounted phone. For the second session we added a set of cards that captured the capabilities of the phone itself, such as cameras, accelerometers, and WIFI connectivity. In this session we asked users to create sets from all three types of cards. This allowed us to pair “what” and “why” as well as “how” questions that helped form better user context with real-world capabilities.

Physical Story Telling.

Our third activity was a physical storytelling activity. Storytelling furthers the creative process, allowing participants to imagine use cases and capabilities beyond their everyday life [6]. For this activity we created physical items that represented technologies in multiple sizes and form factors: anything from small sensors to phones to laptops and televisions. We also included shapes and sizes without particular technologies in mind to prompt creativity. We started the session by telling a story: “I’m going to the park to fly my quadcopter” and then drawing that setting on the whiteboard. From there we had our participants extend that story with their physical props. One would tape a “phone” to the board and say “I stick my phone to the quadcopter so I can take a picture from up above.” Another user might extend the story by saying “there’s a road next to the park that had an accident. I use my gecko phone as a beacon for the paramedics.” We told a single story until it ran out, then started with a new story, repeating this process until the session ended.

Affinity Diagramming.

The fourth activity was affinity diagramming, which was a continuous process over the course of the study. The affinity diagramming process comes out of Beyer and Holtzblatt [7], in which sticky notes are used to list findings, which are then clustered naturally by their relatedness to other sticky note findings. As we received results, we began printing them out and sticking them in groups of similarity, but without an overarching organization scheme in mind. This process allows organization to spring organically from the data. With each activity we added more data and named these organic groups. At times we found an intriguing group and decided to develop an activity around it. As we completed this from the diary study, the storytelling grew out of one such grouping. The affinity organizations allowed us to review and reorganize the data multiple times in order to understand what we had really captured. Once we had grouped all of the data, we reviewed it to gain insights of the groups to reach our high level finding categories.

2.2 Major Findings

Phase 1 research resulted in two major findings. The first was that the smartphone as a tool was not a good fit for every environment. Environments are often extremely dynamic, and technology that cannot sense to react to those findings can be a mismatch for a changing setting. The second finding was that phones are capable of more than users are able to do with them. This occurs when the device fails to adapt to the users context (where they are and what they are doing) and when the phone fails to bridge the divide between physical and digital. When a phone is used to accomplish a digital task, such as answering email or reading an article, that divide is very small. It becomes larger when the device is used for a more physical task, such as following a recipe. Our plan was to design an application for the phone that made the phone a better fit for physical environments by bridging that gap.

2.3 Storyboards Through Speed Dating

In order to select a use case from phase 1, the team implemented a research method called speed dating. The team generated 23 concepts for a potential device. These concepts were portrayed as storyboards (see Fig. 1), which were then presented to 29 research participants (14 women). A storyboard provides a visual and verbal example of a problem the user may have, how technology could be implemented to address that problem, and a solution with the technology solution integrated into the task. This speed dating process had participants pick their most and least favorite concepts and discuss which aspects of the concepts elicited strong reactions from them.

Fig. 1.
figure 1

Storyboards were used to sum up Phase 1 findings and get feedback from potential users.

These results were quantified (Fig. 2) to determine which ideas were best to pursue. Figure 3 shows the top three storyboards. These ideas included “Monitor This,” (Fig. 3a) “Master Chef,” (Fig. 3b) and “Selfie Stick” (Fig. 3c).

Fig. 2.
figure 2

Speed dating adjusted responses by Highest Rated, Above Average, Below Average, and Lowest Rated Quadrants.

Fig. 3.
figure 3

Storyboards for the top three concepts. (A.) Monitor This, where the gecko phone is used as a portable camera and audio monitor. (B.) Master Chef, where the gecko phone contains the recipe and can be mounted out of the mess of cooking. (C.) Selfie Stick where the gecko phone can assist users in taking a series of selfies.

A competitive market analysis to identify the largest gaps in technology between the three top options. With the analysis of people using gestures while cooking, the integration of gestures provided an opportunity for us to advance the technology best in their area. This analysis revealed MasterChef was the best route to pursue.

2.4 Concept to Pursue: Masterchef

The team implemented a proof-of-concept prototype of the MasterChef idea. The MasterChef concept was a recipe tutorial that displayed step-by-step instructions for how to make challah bread using photos for each step. This application could be navigated using either voice or gesture commands. Researchers tested this concept using a Wizard-of-Oz prototype, wherein from the users perspective the prototype is fully functional but many parts of the software are “faked” or remote controlled by the experimenter. This prototype was tested with five participants in situ, with participants standing in a kitchen. Since cooking is often a partnered task, user pairs were communicating between each other using verbal cues. Beyond verbal cues, and with a cooking partner, participants started resorting to gestures to facilitate the cooking process. This finding helped shape subsequent studies and developments.

3 Phase 2 – Validating Our Solution

In Phase 1 the team determined that hands-free navigation of instructional content was the best use case to explore. At the beginning of Phase 2, the team decided to better understand how users approach and use existing instructional video content. The researchers performed a survey with 26 anonymous participants online. Their free-response quotes are captured here.

The survey revealed interesting findings. The main issue users have with video tutorials is the pacing: “I find instructional videos do not fit with the speed I prefer to learn at. Mostly they are too slow, except in the few instances where I need to pour over an image, then they move too quickly.”

Users want to be able to skip through content they don’t need and slow down or repeat content they’re less certain about. Users find video content most helpful for learning a specific part of a task but not necessarily for the entire task: “[I would use a video to learn] cooking methods or skills…like chopping vegetables faster…–NOT recipe videos.”

Users want to be able to navigate away from irrelevant information quickly or, better yet, not be shown that information at all. When users get stuck, they find it cumbersome to pause and navigate a video to watch step by step. Certain tasks, especially spatial tasks, are best portrayed by video.

We selected an origami video as our test case. Origami projects already have difficulty ratings and plentiful existing online content. We were able to find a task that was easy enough for a non-expert to get started but difficult enough to guarantee that a user would need to go back through and navigate the video to re-watch steps. Each step in an origami project is extremely dependent on the previous step having been executed well. Origami is a highly spatial task, which makes it a good candidate for video instruction over audio-only or written instructions. Finally, this project presented few barriers to collect data as it can be carried out in an unspecialized environment and requires few specialized or potentially dangerous elements, such as a heat or a sharp object.

Wizard of Oz.

 In order to validate our video choice and interaction methods, we had an open-ended wizard of Oz testing session with three participants. Users were shown the video and asked to navigate it entirely by voice at first. They were encouraged to be conversational. The researcher involved interpreted the verbal cues from the participants and used a remote control to navigate video playback. Users were then asked to generate some of their own methods of controlling the video. All three users selected to use gesture controls. The major finding from this observation was that users asked specifically to go back to certain steps (“Show me that again”) instead of going back by a certain amount of time (“rewind 15 s”).

From this perspective, we learned that controlling video playback using relative positioning and more natural interaction techniques was going to be the most usable.

4 Final Deliverable - MAEK

MAEK, an odd combination of letters, stands for Motion Activated Expert Kontroller. From the combination of gestures (Motion Activated), smart video parsing (Expert), we explored the usability of controlling an instructional video (Kontroller).

This project included four components: the ZmanTM case hardware, the gesture recognition computer vision, the automatic video parsing, and the user experience of the application interface incorporating video parsing and gesture recognition. This section focuses on using the case, recognition, and video parsing to build the user experience.

4.1 Gestures

Gestures provide a physical control mechanism that is available for the user when their hands are not occupied by holding a phone. Gestures can come naturally as physical expressions to convey more meaning. They can be taught to provide non-intuitive, semantic meaning to movement. Gestures can also be customized to the situation in which the user operates, to maintain semantic relevance.

Gestures have certain advantages and disadvantages from other interaction methods. Humans have a natural tendency to extend their communication methods beyond their words, to include body movements. The concept of gesturing extends from this [8]. In modern technology, designers have taken advantage of gestures as a way to remove an extra control device that can be lost or malfunction. Unfortunately, gestures do have their shortcomings. Developing software both sensitive and robust enough to detect intentional movements is not a trivial task. Despite the challenges, we chose to move ahead with hand-based movements or poses as control and input to our system.

There are a number of pre-existing gesture languages: American Sign Language, hand signals for tactical missions, and even fish market bartering all rely on poses and movements of the hands to create meaning. Karam [9] builds a taxonomy of gesture-based interactions. Based on her research and other sources, “Gesticulation” is the most natural use of hand and body movement to enhance communication. This is also one of the hardest forms to interpret from a technical perspective. The nuances of motioning with your hands are not easily detectable, even by humans. Thus, more formal methods are preferred for set human-computer interactions.

With the understanding that gesticulation needs more development from both a semantic and software perspective, generating a set of known gestures became our goal. These known gestures should be intuitive so as to require little training, simple so as to be repeatable, and distinct so as to be recognized by currently available camera and software equipment. The goal is for this information to help us design rich and innovative touch-free interactions with an instructional video on a phone able to be mounted at eye-level. It is interesting to note that as we design the gestures, recognition software, and video interface, gestures often require attention to determine affirmation or feedback. From the storyboarding activity in Phase 1, we learned of the importance of gestures and semantics when communicating steps in a process. “This one,” and “That one,” combined with gestures will indicate specifics without touching the object of interest. With the Wizard of Oz origami task in Phase 2, we learned that users like to use gestures to move relative amounts forward or backward from a video.

To develop a set of intuitive gestures that would be able to be recognized by a COTS phone, we took to our human-centered design methodology. In a second round of Wizard of Oz testing, we tested with eight participants. For this round of testing, the researchers defined a set of hand gestures that users were to use to control the video. We captured their hand gestures to train our computer vision model and gathered usability feedback.

Their only control mechanism for the video was to use their hands in repeatable motions, at which time the wizard would interpret the gesture and act accordingly. The final gestures involved three dynamic hand motions. The first, play/pause, was simply holding up a closed fist (Fig. 4a). The second was to skip a sub-step, in which users held up two fingers and moved their hand to the right to skip forward or to the left to skip backwards (Fig. 4b). The third gesture was a version of the second, in which users held up an open hand to skip forwards and backwards by an entire chapter (meaningful and related group of consecutive sub-steps; Fig. 4c). This round of testing validated that the gestures collected were usable and intuitive for the participants to use to navigate a video as well as easily differentiated by computer vision.

Fig. 4.
figure 4

The final gestures: (A.) Play/pause, (B.) Skip a sub-step, and (C.) A version of the second, skip forwards and backwards by an entire chapter

4.2 Multimodal Interactions

In addition to hand-based, touch-free gestures, we considered some options that combine commands across different modalities (across gestures, voice, gaze, and possibly head tilt) in order to either expand the commands available, or to increase the reliability of the inputs.

Gaze and gestures could be used to improve reliability. In this case, the system would only accept commands or some commands when the user is looking at the device. This assumes, for example, that the user would not rewind the video unless they are watching it. This constrains the user to remove their attention from the origami task to manipulate the play location of the video. This is an assumption we did not want to make, and thus, eliminated the option for gaze and gesture-based interactions.

Head tilt could also be used in place of or in addition to hand-based gestures. Users were observed to instinctively tilt their heads when they were confused. This movement could add confidence that users want to pause or go back or be used as a more independent command. Detectability of this motion, in addition to head movements during the task provided challenges in execution of this as a command.

Another combination option is to add both gesture and voice commands. One participant was observed to do this naturally, without explicit prompting. The use of redundant commands would increase the confidence in what the system recognizes. Voice commands can also change the context or precise meaning of a gesture.

4.3 Content Parsing

Our observational research revealed the importance of content parsing. Users greatly preferred navigate their content by meaningful steps and sub-steps. We used human annotation to train content parsing models. These models worked well when the instructional content contained visual markers on the screen to indicate a new step or chapter but require more training for non-marked videos to be done automatically. Our final content was parsed into sub-steps. Each sub-step contains no more than one complete action. These sub-steps are grouped into chapters. See Fig. 4 for an example of steps and chapters.

4.4 Gesture Recognition

The final application could be controlled via gesture using only the phones built-in camera and processing power. In order to control the phone using gestures, the application needed to be calibrated to the background and the user. This calibration process was relatively simple and included in a user onboarding interaction. The current gesture recognition required a blank background and recalibration between users. Further optimization is required for this gesture recognition to be more robust.

5 Conclusion

Hands-free interaction proved to be a good way to improve the utility of instructional video content. Users who were able to navigate their content with quick hand gestures reported improved usability. An important part of this navigation was parsing content into meaningful steps, an action that results in improved navigation and usability. Further research will be required to make automatic parsing and automatic gesture recognition more robust. Field-testing opportunities would allow the researchers to test these findings in an experimental environment.