Response time accuracy in Apple Macintosh computers
The accuracy and variability of response times (RTs) collected on stock Apple Macintosh computers using USB keyboards was assessed. A photodiode detected a change in the screen’s luminosity and triggered a solenoid that pressed a key on the keyboard. The RTs collected in this way were reliable, but could be as much as 100 ms too long. The standard deviation of the measured RTs varied between 2.5 and 10 ms, and the distributions approximated a normal distribution. Surprisingly, two recent Apple-branded USB keyboards differed in their accuracy by as much as 20 ms. The most accurate RTs were collected when an external CRT was used to display the stimuli and Psychtoolbox was able to synchronize presentation with the screen refresh. We conclude that RTs collected on stock iMacs can detect a difference as small as 5–10 ms under realistic conditions, and this dictates which types of research should or should not use these systems.
KeywordsReaction time Response time Measurement Experimentation Computers
Researchers studying vision or other areas of psychophysics have long been sensitive to issues of measurement, especially response time (RT), accuracy, and various properties of the visual display. For researchers in other areas of experimental psychology, however, many of the issues are less critical. For example, in a standard memory experiment, the researcher is more often interested in relative speeds of responding in two (or more) conditions than in absolute time to respond. Similarly, although precise knowledge of the onset time of a stimulus is important, the researcher does not necessarily worry about other factors that may be critical in psychophysical studies (e.g., luminance or saturation). Although information is available about individual LCD displays or computers running Microsoft Windows (e.g., Plant & Turner, 2009) or older versions of Apple’s operating system (e.g., MacInnes & Taylor, 2001), we could find no publications examining timing accuracy and variability for systems running recent versions of Mac OS X. The purpose of the present work was to answer the following question: How accurate are RTs collected on Apple Macintosh computers?
The basic logic of the tests reported is as follows. In an experiment, the program starts a clock and displays a stimulus on the screen. The subject presses a key on the keyboard, and the clock is stopped. The dependent variable is the RT, the difference between when the clock started and when it stopped. We replaced the subject with a device that always took the same amount of time to press a key on the keyboard once the stimulus was shown. Any variation observed in the measured RTs must be due to variability attributable to a combination of the computer hardware (e.g., display, USB bus, and keyboard) and software (e.g., operating system and specific program running the experiment). This is similar to the method of De Clercq, Crombez, Buysse, and Roeyers (2003), except that we used dedicated hardware expressly constructed for the purpose, whereas they used a second general-purpose computer.
Hardware and software
Technical Services at Memorial University of Newfoundland built a custom testing box that consisted of the following: A photodetector monitored the display for a change in luminance on the screen; when this occurred, a relay was activated that in turn activated a solenoid. The solenoid was positioned over the keyboard and pressed a key.
Details and calibration
The testing box used a PIC 16F877 microcontroller, and the T2 timer was set to interrupt at 1 kHz. For the purposes of calibration, the testing box also included an LED and a single test button. On a calibration run, the solenoid was positioned over the onboard test button and the photodetector was pointed at the LED. The microcontroller set logical 1 at one of its outputs, which caused the LED to light. The photodetector was monitored at one of the microcontroller’s analog inputs. When the photodetector output voltage reached a value chosen to be 2/3 of a typical scale reading, a second output of the microcontroller was set to logical 1. This second output activated a relay, and in turn a solenoid, which pressed the test button. The buttonpress grounded another microcontroller input, which was normally at logical 1. Detection of logical 0 at this input was monitored by the microcontroller, which reported the number of 1-ms periods elapsed.
This basic cycle—from setting a pin high in order to light the LED to noting that the test button had been pressed—was measured in two independent ways. First, the microcontroller always reported a time of 38 ticks of a 1-kHz timer. Second, a Tektronix oscilloscope (model TDS2012B) reported a value of 38.0 ms. Thus, all results reported below have this 38-ms value subtracted from the measured time. With an ideal system, the resulting latencies should be 0 ms; any value higher than 0 is due to the computer/monitor/software.
Two iMac computers were assessed, a recent model and an eight-year-old model. The newer model was a 24-in. iMac with the model identifier iMac8,1, sold between April 2008 and March 2009.1 This machine has a 2.8-GHz Intel Core 2 Duo processor, a bus speed of 1.07 GHz, and 2 GB 800-MHz DDR2 SDRAM. The graphics card is an ATI Radeon HD 2600 Pro driving a 24-in. glossy TFT Active Matrix LCD (1,920 × 1,200). This iMac was running Mac OS X 10.6.3 (build 10D573) with all software updates installed as of May 5, 2010.
The older model was a 15-in. iMac with the model identifier PowerMac4,2, sold between January 2002 and February 2003.2 This machine has a 700-MHzG4 processor, a bus speed of 100 MHz, and 512 MB PC133 SDRAM. The graphics card is an NVIDIA GeForce2 MX powering a 15-in. TFT Active Matrix LCD display (1,024 × 768). This iMac was running Mac OS X 10.4.11 (build 8S165) with all software updates installed as of May 5, 2010.
Two types of Apple-branded keyboards were tested: The currently available (as of 2010) aluminum USB keyboard (Model A1243), which came with the 24-in. iMac, and the previous generation white USB keyboard (Model A1048). Note that the white keyboard was one generation more recent than that which originally came with the 15-in. iMac. An Apple-branded USB mouse (A1152) remained attached during the tests.
While the tests were being conducted, no other user-initiated software was running except for the following: MATLAB requires that the X11 application be running, and Octave runs inside a Terminal.app window. Unnecessary services (e.g., Bluetooth, wireless, Time Machine, and file sharing) were turned off, but the computer remained connected to the Internet via an ethernet cable. There was no antivirus software running.
The first assessment addressed reliability. The 24-in. iMac and the default aluminum keyboard that came with it were used, and the software was Psychtoolbox running under MATLAB. For each of five runs, the computer was first rebooted under a nonadministrator account, and only MATLAB was running.
Measures of central tendency, range, and standard deviation for each of five replications using Psychtoolbox running under MATLAB on a 24-in. iMac with an aluminum USB keyboard (top) and for five distributions randomly generated from a normal distribution using MATLAB (bottom) (See the text for details)
Descriptive statistics for each are shown in the bottom half of Table 1. There is little difference between the observed and generated RTs except for the range, which increased from a mean of 13.605 for the observed to a mean of 18.047 for the generated. The means, standard deviations, quartile, and skewness measures are all comparable; if anything, the generated distributions show a slightly wider range and more variability in skewness than the observed distributions.
The bottom panel of Fig. 1 shows the cumulative frequency distribution for all 5,000 measured RTs combined and all 5,000 generated RTs combined. While there are a few minor differences, the observed function approximates the generated function quite closely. The right-most column in Table 1 shows the descriptive statistics for these combined functions. With the exception of the range, the numbers are quite comparable.
This assessment shows that RTs collected on a 24-in. iMac and default aluminum keyboard with MATLAB and Psychtoolbox produce RTs that are, on average, approximately 40 ms too long. However, there are no noticeable differences between different runs, the standard deviation is quite small, and the cumulative frequency distribution is similar to one generated from a normal distribution.
The second assessment compared two different Apple-branded keyboards and also compared Psychtoolbox running under MATLAB and Octave. One set of tests was run using the same keyboard as in Assessment 1 (i.e., the aluminum USB keyboard that was still shipping in 2010); a second set of tests were run using the previous-generation white USB keyboard. In addition, one set of tests ran Psychtoolbox under MATLAB, which is proprietary commercial software, and the second set ran the same Psychtoolbox code under Octave, which is mostly compatible with MATLAB but is freely redistributable under the terms of the GNU General Public License.
Measures of central tendency, range, and standard deviation for RTs measured on two types of keyboards (aluminum or white) and two kinds of software running Psychtoolbox (MATLAB or Octave)
There were no differences observable between running Psychtoolbox code under MATLAB or under Octave, but there was a large difference between the two most recent Apple-branded keyboards, with the older keyboard yielding more accurate RTs.
One advantage of Psychtoolbox is that the presentation of stimuli on a display can be synchronized with the vertical refresh. To the extent that the software is successful, the start of the timer coincides with the actual display, and therefore most of the remaining timing noise may be attributable to issues with the USB keyboard. The third assessment examined accuracy and variability in RTs when the stimuli were shown on an external CRT rather than on the built-in LCD.
The 24-in. iMac supports a second display, and a ViewSonic Professional Series PF790 CRT was connected via a mini-DVI to a VGA adapter. The CRT was set to a resolution of 1,024 × 768 at 85 Hz. In addition, we compared mirrored (i.e., both displays set to the same resolution of 1,024 × 768 and showing the same images) and nonmirrored (i.e., the built-in LCD set to 1,920 × 1,200 and the CRT set to 1,024 × 768, with the MATLAB main window displayed on the LCD and the “stimuli” shown on the CRT) modes. Both the white and aluminum keyboards were tested.
Measures of central tendency, range, and standard deviation for RTs measured with MATLAB and Psychtoolbox with an aluminum or white keyboard when stimuli were shown on a CRT set at 85 Hz and the built-in display either mirrored or did not mirror the CRT
The RTs were faster when an external CRT was used rather than the built-in LCD display. The fastest RT detected with the white keyboard was just 5.6 ms with the CRT, compared with 13.4 ms with the built-in display. The comparable values for the aluminum keyboard were 23.9 and 32.9 ms.
Unlike in previous assessments, the skewness measures for the RTs collected on the G4 iMac were well outside the range of those from the generated distributions in Assessment 1, and were also more than two standard errors from 0. This was due primarily to a larger range at the higher end of the distribution than in the distributions from the Intel iMac. For both computers and both keyboards, the standard deviations were almost twice as large as those seen in previous assessments.
The fifth assessment focused on collecting data over the Internet using Java applets (e.g., Stevenson, Francis, & Kim, 1999) and again compared performance on the recent 24-in. iMac to that on an eight-year-old 15-in. iMac.
Measures of central tendency, range, and standard deviation for RTs measured with Java on two types of keyboards and two different iMac computers
Measures of central tendency, range, and standard deviation for RTs measured with Flash on two types of keyboards and two different iMac computers
As can be seen, more than 100 observations are necessary to detect a 1-ms difference once the standard deviation is larger than 2 ms. However, larger differences can be consistently detected with fewer observations, as long as the standard deviation remains small. It should be kept in mind that the criterion of “consistently detected” means that each of the 1,000 statistical tests in one run resulted in a p of .05 or less.
Although this simulation depends on certain assumptions, it does offer one way of estimating the number of observations required in order to detect a particular difference in RTs. If the particular combination of hardware and software results in an approximately normal distribution with a small standard deviation, then a 1-ms difference can be consistently detected. However, given that the smallest standard deviation we observed in any of the assessments was on the order of 2.6 ms, the smallest difference in magnitude that a stock iMac could detect under reasonable conditions is approximately 5–10 ms.
One way of reading the figure is to note that with a large difference in magnitude between the two RTs of interest, the standard deviation of the distribution of measured RTs does not matter much; that is, a real difference of 50 ms will be consistently detected with just a few measurements, even with a standard deviation larger than any we observed. While this is a reasonable reading, we prefer to emphasize an alternate view: The smaller the difference in RTs, the more critical it is to know the properties of the timing device used.
Some research areas require highly specialized equipment for displaying stimuli and collecting RTs (e.g., tachistoscopes, high-end CRT displays, and dedicated response boxes). Although these tools are essential for many kinds of psychophysical and perceptual work, in other areas many aspects of the display or response apparatus are not critical to the questions being asked or to the conclusions that are made. Many memory paradigms fall into the latter category. However, prior to this report, there was no available information on the accuracy and characteristics of RTs collected on stock Apple Macintosh computers running Mac OS X and equipped with Apple-branded keyboards. Thus, the quality of RT data collected using these machines was unknown. The assessments presented here allow the researcher to make informed decisions about the types of hardware needed to answer the questions that are being investigated.
We found that RT distributions collected using Psychtoolbox were comparable to distributions generated from a normal distribution. The most accurate RTs occurred using Psychtoolbox (running under either MATLAB or Octave), an external CRT, and the older white keyboard, but even when the built-in display was used, the distributions remained largely unchanged except for the mean.
Surprisingly, we also found a large difference between two Apple-branded keyboards. This difference was larger than the difference in accuracy between a current and an eight-year-old computer. As noted in other studies (e.g., Plant, Hammond, & Turner, 2004), if half of the subjects in a study used one type of keyboard and the remaining half used the second type of keyboard, the RTs of the two groups would be statistically different.
Our simulation results are consistent with previous examinations of clock resolution. For example, Ulrich and Giray (1989, p. 11) concluded that the time resolution of a clock has “almost no effect on detecting mean RT differences even if the time resolution is about 30 ms or worse.” The data in Fig. 8 provide a resource so that a researcher can make a more informed evaluation of whether the likely differences in RTs can be observed with a stock Apple Macintosh computer, and also reveal the increasing importance of validating the timing in a particular experiment as the magnitude of the RT difference of interest decreases. The smallest magnitude likely to be detected consistently is on the order of 5–10 ms.
Should researchers conduct experiments on stock Apple Macintosh computers when the dependent variable is RT? Given the variability in RTs observed above, we strongly recommend that researchers using any computer to collect RTs should assess the accuracy and reliability of their chosen platform. It is always desirable to minimize sources of error, and therefore one should validate the system on which one is collecting data. Given our findings above, we can recommend using the particular hardware/software combinations tested in only some situations. To the extent that the research examines small differences, or that absolute measures of time are important, or that the properties of the visual display are critical, or that synchronizing two or more items is critical, then the answer must be no. However, if a researcher tests all subjects using the exact same hardware, if the focus is on relative rather than absolute RTs, if the differences in RTs in the conditions to be examined are expected to be fairly large (e.g., at least 20–40 ms), if only certain software is used, and if many properties of the visual display are not of critical importance, then the conclusions drawn from RT data collected on a stock iMac are likely to be the same as those drawn from RT data collected on custom or high-end hardware.
GNU Octave is a freely available numerical computing language that is mostly compatible with MATLAB. See www.gnu.org/software/octave/.
The RTs listed as “MATLAB – Aluminum” are the same as those shown as Replication 1 in Assessment 1.
We used three different aluminum keyboards (all Model A1243) and two different white keyboards (all Model A1048). We detected no differences between keyboards of the same model, but always observed the same magnitude difference between keyboard models.
This work was supported, in part, by grants from NSERC to D.H., I.N., and A.M.S. We thank D.J.K. Mewhort for piquing our interest in this topic. Example source code is available from http://memory.psych.mun.ca/research/code.shtml or from the first author.
- Ulrich, R., & Giray, M. (1989). Measuring reaction times: Effects on reaction time measurement—Good news for bad clocks. The British Journal of Mathematical and Statistical Psychology, 42, 1–12.Google Scholar