A Simple Method to Record Keystrokes on Mobile Phones and Other Devices for Usability Evaluations
- 4.6k Downloads
Task times, sometimes at the keystroke level, as well as the number of keystrokes are often used to assess the usefulness and ease of use of mobile devices. This paper describes a new method to obtain keystroke-level timing of tasks involving Virtual Network Computing (VNC) and Techsmith Morae (commonly used for usability tests). Running VNC, the PC mimics what the mobile device does, which is recorded by Morae running on the PC. To evaluate this configuration, 24 pairs of subjects texted 1,200 messages concerning five topics to each other. Every keystroke was recorded and timed to the nearest 10 ms. As desired, the communications were quite stable, with times of 90 % of the messages on the two recording computers being normally distributed and within ± 100 ms of each other. Others are encouraged to use this method, given its accuracy and low cost, to examine mobile device usability.
KeywordsKeystroke-logging Usability testing Mobile devices Virtual network computing (VNC)
Mobile devices are widely popular because they are convenient and easy to use. Ease of use can be assessed using many measures and statistics, though the most straightforward measure is task time  and number of keystrokes. Previous studies have reported methods to record when and which key is pressed at any given time to determine inter-keystroke intervals for mobile devices [2, 3, 4]. However, more specific and easier to implement methods are needed. Until now, usability assessments of these devices have been limited in number because collecting accurate methods to record the desired data is a challenge.
Basically, task time can be determined using (1) a real-time manual timing system operated by an observer, (2) specialized hardware, (3) video records analyzed after the fact (post processed frame by frame), or (4) usability evaluation software embedded in the device being tested (e.g. Remote User Interface, RUI , LetterWise , Uzilla ).
(1) Manual real-time data collection has a long history of use, especially within industrial engineering, typically using stopwatches. However, it is time consuming and events that occur in rapid succession cannot be reliably timed manually. (2) For human-computer interaction, specialized data collection devices inserted between the key and computer (e.g. KeyGhost for PS/2 keyboards, KeyCarbon for USB keyboards) and computer software (Windows APIs, Mac Keylogger by REFOG), can be used to surreptitiously record all keystrokes, mouse actions, and screen updates. By design, most variations of this method will not work for mobile devices. (3) On the surface, post-processing of video records seems to be an easy method to collect the desired data. This approach is not cost effective, and the data analysis is tedious. (4) Finally, one could use specialized evaluation software that runs in parallel with the system being tested. This software may be specifically intended for human-computer interaction evaluations.
Accordingly, tailored approaches for mobile devices have been developed. These approaches involve universal applications such as a small J2ME, a program that can be installed into the device very quickly  to log the time of key presses and releases and save them in a file (ASCII, Unicode) on the device. The program is running in the background, independent of other processes. After the data collection phase, the experimenter can export, reduce, and analyze the data. This software may add to the device processor load and memory requirements (leaving insufficient capacity for the applications being assessed), and may affect timing accuracy (adding in variable delays) . Furthermore, these programs often do not provide for some type of independent real time output, in particular a signal that can be used by other software to synchronize related information (e.g. physiological data, other task performance data). Some applications do not encrypt the data, so there may be security concerns.
To overcome mobile device limitations, the mobile device could be emulated on a personal computer, mapping the key/icon assignments (in Flash, JAVA, or XML) to those typical on the PC . The data collection program can be installed in a powerful PC, run in the background , and can also combine with other process, such as eye-tracking systems . The PC interface, however, is not what mobile device people use, and the key/icon assignments are device dependent. For example, users will feel odd if they are asked to use a PC numeric keypad instead of the telephone keypad, because the shape, size, layout, and force feedback of the two keypads are very different. Betiol and Cybis  clearly indicated that the emulator setup may affect the validity of the usability problems on the device.
Furthermore, these methods can require more software knowledge than those conducting usability evaluations may have, and can exceed their limited budgets and time. Therefore, this paper describes a new method to collect user task times using a remote protocol (Virtual Network Computing, VNC) and usability evaluation software (Morae v.3.2, TechSmith). A sample experiment conducted using this method is described.
2 New Method Description
2.1 Overview of the VNC-Morae Method
A novel method was developed in which VNC (Virtual Network Computing) protocol was installed in a mobile device (a smartphone) that was connected to a PC running the usability evaluation software (Morae v3.2). Because the application involved texting, Skype was used as the platform for data exchange between mobile devices. Currently, the only limitation of this method is that the mobile device must run on iOS, Android, and BlackBerry operating systems, which constitute the majority of phones sold (94.1 % in Q4 2012, ).
By far, Morae is the most popular application for collecting usability data. The software suite consists of a recorder, an observer, and a manager. The Morae Recorder is installed in the subject’s computer and records every event (keystroke, mouse action, screen refresh) occurring, along with input from a web-camera (showing the user’s face or hands) and a microphone (what the user says). Accuracy can reliably be determined to the nearest 10 ms. Morae Observer, running on another computer connected to the Recorder via a local area network (LAN) or Internet, allows a remote observer to see the user’s screen, hear what they say, and see their face as a picture-in-picture (PIP) in real time. Using the Observer software, an experimenter can mark when tasks start and stop, log errors, and enter comments about tasks, and identify segments for highlight clips. The experimenter’s inputs are synchronized with subjects’ events in the Recorder computer.
2.2 Virtual Network Computing - The Connection Between Mobile Device and Computer
Virtual network computing (VNC) involves using the Remote Frame Buffer (RFB) protocol to remotely control another computer. Color values for pixels on the screen stored in the memory buffer are transferred through RFB. VNC consists of a server, a client (or viewer), and a protocol between them. The machine running the server feeds the signal to the client via RFB protocol, so the client has the same desktop image as the server. As the controller, the client can submit commands and interact with the server.
As a default, the communication uses the TCP port 5900 for connections to the Internet through a broadband connection or LAN. RealVNC, a particular implementation, uses the high-strength AES (Advanced Encryption Standard) data encryption to improve the data transfer security.
2.3 Data Transfer Platform – The Connection Between Computers
Given that user’s operations on the smartphone are connected to the computer with VNC server and Morae Recorder running, the data transfer is triggered by the mobile device, but executed by the computer. In other words, the communication between mobile devices can be treated as between computers, which is much easier. The communication can be over a local area network (LAN) or the Internet. Using a LAN eliminates network traffic jams and provides a secure connection.
2.4 Hardware and Software Configuration
In summary, what the users see on their smartphone (running the VNC Client) is actually from the PC, which is showing the Skype interaction and running Morae on the background. Most of the computationally intensive software is running on the PC and the smartphone serves as an input device.
3 Case Study: Keystroke-Level Accurate Timing of Text Messages Between Smartphones
This configuration was used to study users sending text messages to each other . Of interest was both the linguistic content of the message  and the keystroke level timing of user inputs. In this experiment, subjects sent text messages on five different topics that were the same for all subjects. About five minutes was spent on each topic, after which the interaction ended and subjects moved on to a new topic. To provide realism of a driving scenario, this experiment took place in a driving simulator. Because texting while driving is illegal, all manual entry by drivers was done with the vehicle parked. The goal of this case study is to examine the protocol of real-time keystroke logging between two smartphones for the usability test, comparing the timelines recorded on the two sides.
Twenty-four pairs of young smartphone users who regularly sent text messages to each other participated, an important distinction from other research. Each subject was paid $50 for approximately two hours of his/her time. One subject sat in the driver’s seat of the driving simulator (http://www.umich.edu/~driving/facilities/sim.html), and the other subject was elsewhere out of sight of the driver (sitting at desk in an office), as is commonly the case. Subject pairs were equally drawn from two groups, late teens (aged 18–19, mean = 19) and young adults (aged 20–29, mean = 23). Among all drivers, these are the two groups who are most likely to text and drive. Because their relationship could affect their texting interaction, each group of twelve had three male pairs, three female pairs, and six mixed gender pairs. All participants were friends, classmates, or colleagues, but not relatives to avoid relationships that could alter message content.
3.3 Equipment, Materials, and Software
The ideal situation would be for subjects to have used their own phone. Android users used their own phones with the RealVNC Client downloaded (free) and installed from Android Market. Iphone users used two iPhone 4’s provided by the UMTRI Driver Interface Group onto which RealVNC had been downloaded before the experiment to save time. To simulate the BlackBerry experience, those users were provided iPhones with an attached hard key QWERTY keyboard (Keyboard Buddy iPhone 4 case, Boxwave Co.). The screen size of the BlackBerry was too small (usually about 2.5 in) to accommodate all the information needed by the VNC Server.
Morae Recorder (version 3.2) running independently on two Windows XP computers that recorded the keys pressed and the inter-keystroke interval to the nearest centisecond. Also, the timestamps of when messages were received were collected. A wireless router (Netgear WNDR3400, N600 wireless dual band, 300 Mbps x 2) was connected to the two mobile devices (via WiFi). One of the WiFi protocols used the band of 2.4 GHz and another used 5.0 GHz, each of which had a bandwidth of 300 Mbps without interference.
3.4 Experiment Design
As mentioned before, subjects then sent text messages to each other, with one subject in the driving simulator and the other in an office. An expressway road scene was presented to the driver, but he/she did not drive. Initially, subjects practiced texting for 10 to 15 min to become familiar with the experimental situation, with the two topics of “I need BBQ” and “I am glad the football season begins.” Subjects could either continue the topic assigned, or change to other topics they were more interested in at any time.
Subsequently, subjects participated in five message sequences, each triggered by text provided to the driver by the experimenter on an in-vehicle display. Topics were selected to be statistically representative of those that typically occurred while texting (and not driving) based on a General Motors provided text message corpus collected while not driving . See also Winter et al. . There were 5 topics about human relations, 2 for activities and events, 3 for appointments and schedules, 2 for school and work, 2 for technology, and 1 for emotion. For further information about the design, please refer to Green et al. .
There were 1,200 messages (49,985 keystrokes) sent between the 2 sides, 584 messages from drivers to partners and 616 from partners to drivers. On average, a typical message included 8.5 words (S.D. = 6.2), composed of 41.6 characters (S.D. = 30.7), including 7.5 spaces, 1.2 punctuation marks, and 32.9 letters and numbers. For a detailed analysis of message content, see Hecht et al. .
To analyze the inter-keystroke intervals, one needs to have confidence that the timing is accurate. When typing transcripts, the brain is used as a short-term buffer and the typist will load a certain amount of text into the buffer . Text will be grouped into discrete units and entered once , so the times between keystrokes can be very short. All too often research is conducted on keying behavior, response times, or eye fixations, but the timing is never checked. This is particularly important when claims are made about millisecond or even centisecond accuracy, but the hardware and software do not support such. The focus of this paper is on how well the method captured the keystrokes and the timing accuracy.
There was no evidence that any of the 49,985 keystrokes were lost or transmitted out of order. Also of interest was if various devices were reporting the same times and the same transmission delays between devices. The issue was addressed in two phases. During the preliminary phase of the development of this method, each smartphone was placed side by side with the PC that was associated with it, and text was entered into the phone. There was no perceptible delay between when keys were pressed, when the text appeared on the smartphone display, and when the text appeared on the associated PC (connected via VNC).
In a subsequent phase, system timing and lags were examined. The time logging started before showing the topic to the driver, continued on the five topics, and did not stop until the texting for all five topics ended. The authors compared the log files on each PC (driver, partner), each with its own timeline and analyzed the effects of message characteristics. In this case, the driver and the partner could not be readily synchronized and compared using a third-party timeline.
The log files included the time a message sent from one device and the time that message received by another device, on different timelines. In this case, a method was conducted to integrate the two timelines, using the first sent/received message as the baseline.
Comparing the adjusted message-sent and -received timelines, the time differences included two parts, the time for Skype to pass messages through the Internet (whose transmission times were variable), and the difference between the Morae timelines running on two non-identical computers to record all the events. (The driver side computer (Intel Core 2 Quad 2.4 GHz + 4 GB of RAM) had more throughput than the other computer (Intel Pentium IV 3.6 GHz + 1 GB of RAM).) These two timelines could not be synchronized without a third-party timeline, which did not exist in this case. Identical hardware for the driver and partner sides was not available, and was initially not thought to be a concern.
Calculation of the difference between timelines
By driver’s computer
By partner’s computer
Time difference (ms)
Time based on the first sent message
Time based on the first received message
During the data collection phase, three messages from the driver to the partner and one from the partner to the driver were delayed by the VNC for greater than five seconds, in which the data stream from VNC Client was not immediately sent to VNC Server. These four messages were removed from further analysis and only 1,196 messages remained.
Further, the time of the first sent/received message of each pair of subjects for both sides did not count because it was the baseline of the time-log adjustment and there was no message sent/received before it. Therefore, 1,148 messages were considered (584 - 24 - 3 = 557 sent from the driver to the partner; 616 - 24 - 1 = 591 from the partner to the driver).
In Fig. 3a, when the driver sent messages to the partner, the relative time differences were normally distributed, with a mean of 0.70 ms and a standard deviation of 73 ms. Approximately 83 % (463/558) of messages had the relative time differences between ± 73 ms, the standard deviation, and over 99 % (553/557) between ± 3σ. Similar results could be found in Fig. 3b, the messages sent by the partner to the driver. The mean and standard deviation of the relative time differences were 8.2 ms and 73 ms. Time differences between ± 73 ms were for 77 % (454/591) of messages and over 99 % (589/591) between ± 3σ. Thus the mean delay was about 8 ms longer from the partner relative to the first message, but the standard deviation was identical, which is most important. As a reminder, times were determined to the nearest 0.01 s by Morae.
Thus, the message transmission delays due to VNC and Morae were quite stable. However, as long as communication occurred over the Internet, communication delays introduced could not be completely avoided. Certainly using identical hardware for the driver and partner would have led to more consistent time, but the expected differences due to such are likely to be much less than those due to the Internet. However, what is important is that in many situations the inter-keystroke intervals for each device were of sufficient accuracy.
4.1 Strengths of the Method Used
This paper describes in detail a method that easily and accurately collects keystrokes on mobile devices to the nearest centisecond, and provides example performance data collected using this method from a case study. The performance of the configuration was quite good. The data recorded was very stable. Some 83 % (from driver to partner), and 77 % and (from partner to driver) of the messages had the time differences within ± 1σ (73 and 73 ms, respectively), using the first messages as the baselines. When the error tolerance was ± 100 ms, some 93 % (from driver to partner), and 88 % (from partner to driver) were included, respectively. This is excellent, considering that timing was to the nearest 10 ms. Kukreja et al.  and Austin et al.  report the peak and mean inter-keystroke intervals of 175 ms and 356 ms for typing on a full-size computer keyboard. Therefore, the data-logging configuration in this study was accurate enough to record keystroke timing. The timing was unaffected by the length of the message sent and was fairly stable throughout the experiment.
4.2 Concerns with the Method Used
There are three sources of potential timing errors, (1) the Internet over which the communication occurred, (2) the software (VNC) to exchange messages, and (3) the computers to log the timestamps. The Internet was used in this case for ease of access. There were three outliers in the messages sent from the driver to the partner and one from the partner to the driver. This corresponds to only 0.3 % of all messages, an acceptable low amount. Oddly, three of these cases all occurred in the afternoon, between 1–4 pm, on a particular day, which is why some sort of Internet-related cause is suspected. In theory, a LAN dedicated to an experiment should provide more consistent transmission times because the load is stable and the hardware fixed. However, there are few applications that support LANs and to customizing a LAN information exchange platform is time and cost consuming and the timing will not be as accurate as native programs .
Finally, although not considered to be a major source of timing problems, the driver and partner logging computers were different, so their processing time could differ. Using two identical computers is recommended to eliminate any suspicion of a problem.
4.3 Closing Thought
In summary, the method described in this paper provides a simple, low-cost, and accurate method to record and time keystroke-level actions for mobile devices, something which is extremely difficult to do as well using other methods. These data are essential for performing detailed analyses of user actions in applied usability studies and more fundamental analyses of how people interact with mobile devices. The accuracy, using the configurations is good enough for most purposes, but it is not perfect. The next step is to explore (1) using LANs to improve timing, (2) variations in VNC performance as a function of hardware, and (3) recording performance for other input gestures such as swiping and dragging (in particular their path and click locations). Researchers are strongly encouraged to use this method in their research.
- 1.Hornbæk, K., Law, E.L.-C.: Meta-analysis of correlations among usability measures. Paper presented at the CHI 2007 Proceedings, San Jose, CA, USA (2007)Google Scholar
- 3.Klockar, T., Carr, D.A., Hedman, A., Johansson, T., Bengtsson, F.: Usability of mobile phones. Paper presented at the Proceedings of the 19th International Symposium on Human Factors in Telecommunication, Berlin, Germany (2003)Google Scholar
- 4.Silfverberg, M., MacKenzie, S., Korhonen, P.: Predicting text entry speed on mobile phones. Paper presented at the CHI 2000, The Hague, The Netherlands (2000)Google Scholar
- 6.MacKenzie, S., Kober, H., Smith, D., Jones, T., Skepner, E.: LetterWise: prefix-based disambiguation for mobile text input. Paper presented at the Proceedings of the 14th Annual ACM Symposium on User Interface Software and Technology, Orlando, FL, USA (2001)Google Scholar
- 8.Holleis, P., Otto, F., Huβmann, H., Schmidt, A.: Keystroke-level model for advanced mobile phone interaction. Paper presented at the CHI 2007 Proceedings, San Jose, CA, USA (2007)Google Scholar
- 12.Gupta, A., Cozza, R., Milanesi, C., Lu, C.: Market Share Analysis: Mobile Phones, Worldwide, 4Q12 and 2012: Gartner, Inc. (2013)Google Scholar
- 13.Green, P., Lin, B., Kang, T.-P., Best, A.: Manual and Speech Entry of Text Messages while Driving. University of Michigan Transportation Research Institute (UMTRI), Ann Arbor (2011)Google Scholar
- 14.Hecht, R.M., Tzirkel, E., Tsimhoni, O.: Language models for text messaging based on driving workload. Paper presented at the 4th International Conference on Applied Human Factors and Ergonomics, San Francisco, CA, USA (2012)Google Scholar
- 15.Winter, U., Grost, T.J., Tsimhoni, O.: Language pattern aanalysis for automotive natural language speech applications. Paper presented at the Proceedings of the 2nd International Conference on Automotive User Interfaces and Interactive Vehicular Applications, Pittsburgh, PA, USA (2010)Google Scholar
- 16.Winter, U., Ben-Aharon, R., Chernobrov, D., Hecht, R.M.: Topics as contextual indicators for word choice in sms conversations. Paper presented at the Proceedings of the SIGDIAL 2011: The 12th Annual Meeting of the Special Interest Group on Discourse and Dialogue, Portland, OR, USA (2011)Google Scholar