During naturalistic tasks, such as supervising children on a playground, watching sport events, or driving a vehicle, the human environment is dynamically changing over time. In many of these situations it is crucial to keep track of multiple independently moving objects over time. In laboratory experiments, this ability can be studied with the multiple object tracking paradigm (MOT; Pylyshyn & Storm, 1988). Within the almost three decades after the seminal work of Pylyshyn and Storm, there was an impressive increase in MOT research. Until today, there are more than 160 journal articles studying a broad variety of different aspects of MOT. Because MOT provides a good mean to study dynamic visual attention, the aim of this tutorial review is to provide researchers without prior experience in MOT with an introduction to MOT research (for more specific reviews, see Scholl, 2009; Scimeca & Franconeri, 2015). For this purpose, we have divided this manuscript into three sections. In the first section, we introduce the MOT paradigm as well as its link to visual attention. In the second section, we present and discuss the evidence of several theories of MOT. Finally, in the third section, we review the state of the art of recent research topics in multiple object tracking and summarize how the described research has fostered the understanding of visual attention. We conclude with a brief outlook into potential future directions of research on visual attention using MOT.

Multiple object tracking

The central features of the multiple object tracking task closely match several features of attentionally demanding naturalistic tasks. Most importantly, the setup of the task is highly dynamic as the objects change their location over time. This transience of information requires a continuous deployment of visual attention in order to avoid confusions between the objects. In return, studying the MOT task thus might help us to understand the basic principles of dynamic visual attention as it operates in many real-world situations. Given the complexity of tracking, it is not surprising that MOT is not only related to other tasks that draw upon the efficiency of visual attention (Huang, Mo, & Li., 2012) but also to processes of attentional selection (Franconeri, Alvarez, & Enns, 2007) and working memory (e.g., Oksama & Hyönä, 2004).

A good way to get started with the MOT paradigm obviously is to experience the tracking task. For this purpose, we have uploaded several video demonstrations of the task (which increase in speed/difficulty; go to https://www.iwm-tuebingen.de/public/realistic_depictions_lab/mot_demo).

In this section, we will introduce the main paradigm of MOT research. Similar to the experiments of Pylyshyn and Storm (1988), most MOT studies have explored tracking performance within a 2-D frame of reference on common lab computers. Although some studies have embedded the MOT task in more immersive setups, such as virtual realities (Lochner & Trick, 2014; Thomas & Seiffert, 2010), large projection screens (Franconeri, Lin, Pylyshyn, Fisher, & Enns, 2008), or three-dimensional scenes (Pylyshyn, Haladjian, King, & Reilly, 2008), all MOT studies follow the same principals. Therefore, we will outline these principles first. We will then review the evidence illustrating that MOT draws upon attentional resources and how these attentional resources alter MOT performance.

The MOT paradigm

Several variations of the MOT paradigm have been published to study the mechanisms of tracking; however, the basic paradigm is similar across almost all MOT experiments. First, several visually indistinguishable objects appear onscreen (typically six to 10 objects). Thereafter, a subset of these objects is designated as target objects (typically three to five targets). The participants are instructed to track this subset of objects across an interval of object motion. The motion paths of the objects (e.g., Brownian motion vs. straight motion paths) as well as object speed (1 to 15 degrees of visual angle) usually vary between studies depending on the actual research question. After the interval of object motion, performance is typically measured either with a probe-one or a mark-all procedure. In a mark-all procedure, participants mark all of the targets that they were able to track and guess the remaining target objects. In contrast, in a probe-one procedure, one of the objects is probed (a target in one half of all trials), and participants decide for this specific object whether it is a target or not. Both procedures are depicted in Fig. 1.

Fig. 1
figure 1

Illustration of the procedure of multiple object tracking experiments. A subset of visually indistinguishable objects is designated as targets. Following target designation, the objects move for several seconds. Then the objects stop moving and participants either mark all objects (mark all paradigm; left column) or indicate for one probed object whether this object was a target or a distractor (probe one paradigm; right column)

Whether it is more sensible to apply the mark-all procedure or the probe-one procedure depends on the experimental manipulations of the concrete experiment. Due to the higher number of measurements within the same trial, the mark-all procedure is generally more powerful. However, if the experimental manipulations involve tracking load (i.e., the number of targets), a probe-one procedure is preferable because this procedure maintains chance level across different tracking loads even when the total number of objects is constant. As the dependent variable, most studies report the number of correctly identified targets, the proportion of correctly identified targets, or tracking capacity (i.e., the number of actually tracked objects when corrected for guessing). Specific formulas for these calculations have been summarized by Hulleman (2005).

The difficulty level of the MOT task can be varied parametrically (e.g., Alvarez & Franconeri, 2007; Bettencourt & Somers, 2009) up to an almost complete exhaustion of attentional resources. This becomes visible in experiments showing that tracking of a single object could be so exhausting that it is impossible to track a second object (Holcombe & Chen, 2012). Further, tracking also has been demonstrated to interfere with the extraction of the gist of a natural scene in a dual task setup (Cohen, Alvarez, & Nakayama, 2011). Across the large body of research, several variables that influence tracking performance have been identified. Some of these variables are linked to the evaluation of the competing theories of tracking (see section titled Theories of MOT and their evidence). However, besides their theoretical importance, these variables might serve to adjust the difficulty of the MOT task to any required level of difficulty. Typically, MOT performance declines with an increasing number of targets (e.g., Alvarez & Franconeri, 2007; Drew, Horowitz, Wolfe, & Vogel, 2011; Pylyshyn & Storm, 1988), an increasing number of distractors (e.g., Bettencourt & Somers, 2009; Sears & Pylyshyn, 2000), an increasing trial duration (e.g., Oksama & Hyönä, 2004), an increasing proximity between the objects in the display (e.g., Bettencourt & Somers, 2009; Franconeri, Jonathan, & Scimeca, 2010), as well as an increasing object speed (e.g., Holcombe & Chen, 2012; Meyerhoff, Papenmeier, Jahn, & Huff, 2016; Tombu & Seiffert, 2011). Please note that this list represents the most common manipulations only and makes no claim to be complete.

Situating MOT in research on visual attention

Although the MOT paradigm was not designed as an attentional paradigm initially, there is overwhelming behavioral, electrophysiological, and neuroimaging evidence that MOT draws heavily on attentional resources. For instance, Tombu and Seiffert (2008) demonstrated the attentional demands of MOT in a dual task setup. In their study, intervals of increased difficulty such as a spontaneous attraction of the objects or a temporarily increase in object speed interfered with a tone discrimination task that was temporally aligned with these intervals (see Fig. 2). In a similar study, Kunar, Carter, Cohen, and Horowitz (2008) showed that MOT does not only interfere with other attentional demanding tasks but that tracking limitations also arise at more central stages of information processing. In their experiments, the generation of words during a telephone conversation (with the experimenter) impaired MOT performance. The close link between MOT and visual attention also becomes evident in a remarkable study by Huang et al. (2012), who investigated the interrelations between numerous attentional paradigms. In this study, MOT did not only correlate with all other attentional paradigms that draw upon attentional efficiency, but was also correlated with a general factor of visual attention that was extracted from the intercorrelations of the individual tasks.

Fig. 2
figure 2

Demonstration of attentional demands during multiple object tracking (Tombu & Seiffert, 2008). a The participants track four objects over time. During the trial the objects spontaneously attract each other (and/or increase in speed). Concurrently, the participants perform a tone pitch discrimination task. b Multiple object tracking performance as a function of interdot attraction and the stimulus onset asynchrony (SOA) between the attraction interval and the dual task. Dual-task interference was more pronounced with short SOAs. Figures reproduced with permission from Elsevier

Remarkably, however, participants are able to interrupt the tracking task for a few hundred milliseconds in order to perform a concurrent dual task. Such a rapid task switching was reported by Alvarez, Horowitz, Arsenio, DiMase, and Wolfe (2005) who observed better tracking performance with a concurrent search task than would be expected by shared attentional resources. Even further, tracked objects do not need to be continuously visible during tracking as long as occlusion cues signal their disappearance (Scholl & Pylyshyn, 1999) or a global offset triggers memory processes (Horowitz, Birnkrant, Fencsik, Tran, & Wolfe, 2006). Further evidence for the role of task switching and memory processes during MOT arises from correlations between individual differences in MOT and task switching as well as working memory (Oksama & Hyönä, 2004). In general, previous research has established a strong connection between spatial attention and spatial working memory (Awh, Anllo-Vento, & Hillyard, 2000; Awh & Jonides, 2001; Smyth & Scholey, 1994). Therefore, a link between MOT and working memory is not too surprising (see Allen, Mcgeorge, Pearson, & Milne, 2006; Zhang, Xuan, Fu, & Pylyshyn, 2010). In reverse, a concurrent MOT task also disturbs processes of working memory such as feature binding (Fougnie & Marois, 2009). However, there is also clear evidence for a functional dissociation of tracking and memorizing (Carter et al., 2005; Fougnie & Marois, 2006).

In line with intuition, the MOT task requires visual selection as well as sustained visual attention to track the target objects (Ma & Flombaum, 2013). Although visual selection typically occurs at the beginning of the trials, these two processes do not mutually exclude each other. During tracking, observers are able to deselect previously tracked objects and select new target objects (multiple object juggling) without sacrificing accuracy (Wolfe, Place, & Horowitz, 2007; see also Ericson & Christensen, 2012; Pylyshyn & Annan, 2006). The distinction between selection and tracking has also been corroborated by electrophysiological and neuroimaging studies. For instance, in an event-related potential (ERP) approach, Drew and Vogel (2008) observed that increasing demands of the target selection process (i.e., prior to motion onset) were reflected in an increasing negativity roughly 200 ms after stimulus onset over the posterior electrodes (N2pc; see also Hopf et al., 2006; Woodman & Luck, 2003, for converging evidence from visual search). Increasing load during the tracking period was reflected in the contralateral delay activity (CDA; Drew et al., 2011). Because CDA amplitude is a better predictor for tracking performance than N2pc amplitude in trials of typical duration (Drew & Vogel, 2008), limitations in tracking performance arise from tracking itself rather than from visual selection. This matches the behavioral observation that the capacity limitation for visual selection (Franconeri et al., 2007) is higher than the limitation for tracking moderately fast-moving objects (e.g., Alvarez & Franconeri, 2007).

Importantly, Drew et al. (2011) were able to show that the load sensitivity of the CDA distinguishes attentive tracking from working memory tasks (see Vogel & Machizawa, 2004; Vogel, McCollough, & Machizawa, 2005). Although the CDA varied with the load manipulation in both types of tasks, it was much more pronounced in tracking trials than in memory trials. In fact, the CDA was even reduced when the tracked objects stopped moving. This observation agrees with the assumption that tracking moving objects requires sustained visual attention and cannot be reduced to a passive memorization of object locations. In further work, Drew, Horowitz, and Vogel (2013) took advantage of the load sensitivity of the CDA in order to disentangle different kinds of tracking errors. Hereby, the CDA amplitude indicated that an increase in the number of distractors causes confusion between targets and distractors (i.e., swaps), whereas increasing object speed results in dropping objects that were to be tracked.

In general, the electrophysiological evidence matches neuroimaging studies that have identified several cerebral areas that contribute to MOT, such as the intraparietal sulcus (IPS), the superior parietal lobule (SPL), the frontal cortex such as the frontal eye fields (FEF), the precentral sulcus (PreCS), and the motion sensitive areas in the MT+ complex (Culham et al., 1998). Two of the neuroimaging studies aimed at distinguishing areas that are sensitive to tracking load from areas that respond to the tracking task in general (Culham, Cavanagh, & Kanwisher, 2001; Jovicich et al., 2001). Across both studies, IPS as well as PreCS were sensitive to the manipulation of attentional load (see also Howe, Horowitz, Morocz, Wolfe, & Livingstone, 2009). This agrees with previous research that has highlighted the role of the IPS for spatial attention (e.g., Coull & Frith, 1998). Surprisingly, evidence for the contribution of early visual areas to MOT performance is remarkably absent in the neuroimaging studies. However, electroencephalography (Störmer, Winther, Li, & Andersen, 2013) as well as eye tracking (i.e., pupil dilation; Alnaes et al., 2014) have provided evidence that early visual processing also predicts MOT performance.

How attention modulates MOT

Whereas the abovementioned studies clearly show that MOT requires attention, the question how attentional resources contribute to tracking is more controversial. The first studies that provide insights into the contributions of attention to tracking typically studied dual task setups in which participants performed a probe detection task besides tracking (e.g., Pylyshyn, 2006). The principal idea of these experiments is that the allocation of visual attention toward target objects should enhance probe detection performance on targets relative to distractors. Indeed, this prediction was confirmed across a wide range of experiments (Huff, Papenmeier, & Zacks, 2012; Pylyshyn, 2006; Pylyshyn et al., 2008; Sears & Pylyshyn, 2000). A remarkable observation, however, was that probes that appeared on the empty background elicited higher detection rates than probes that appeared on distractors, suggesting that distractors are suppressed during tracking (Pylyshyn, 2006). The extent of this suppression depends on the similarity between targets and distractors in terms of motion and form (Feria, 2012) as well as depth plane (Pylyshyn et al., 2008; see also Rehman, Kihara, Matsumoto, Ohtsuka, 2015; Viswanathan & Mignolla, 2002).

A central challenge for the interpretation of these dual task experiments is that the probe detection task might have affected the allocation of visual attention in the MOT task. In order to avoid this fallacy, Drew, McCollough, Horowitz, and Vogel (2009; see also Sternshein & Sekuler, 2011) recorded ERPs elicited by task-irrelevant probes that appeared on targets, distractors, stationary distractors, or the empty background. They observed more pronounced amplitudes of the N1 and the P1 (see Luck, 1995; Luck et al., 1994) for probes that appeared on targets relative to moving distractors or empty space but no evidence for distractor suppression. Even further, the magnitude of ERPs signaling target enhancement predicted tracking performance on the behavioral level. In other words, good and poor trackers could be distinguished based on their neural response (see Fig. 3).

Fig. 3
figure 3

Illustration of a study conducted by Drew, McCollough, Horowitz, and Vogel (2009) addressing the role of attentional enhancement during multiple object tracking. a During the tracking task, task-irrelevant probes appear on target objects, distractor objects, stationary objects, or the empty background. b Good trackers show a stronger electrophysiological response to task-irrelevant probes on targets than poor trackers, signaling attentional enhancement. Figures reprinted with permission from Springer

It is noteworthy that Doran and Hoffman (2010) observed evidence for distractor suppression in the N1 amplitude. There were several differences in the stimulus design that might be responsible for the contrasting patterns of results. However, this study points out that target enhancement and distractor suppression might occur in parallel during tracking. Because the study by Drew et al. (2009) indicates that target enhancement is capable of arising without distractor suppression, both mechanisms might reflect functionally independent processes. On the behavioral level, target enhancement and distractor suppression seems to warp the perceived spacing between the moving objects (Liverence & Scholl, 2011). Nevertheless, suppression does not blank out distractors completely, as demonstrated by studies showing that distractor locations are represented above chance level (Alvarez & Oliva, 2008), that repeating motion paths of targets as well as distractors enhances tracking performance (Ogawa, Watanabe, & Yagi, 2009), and that displacing distractors impairs tracking even when the displacements maintain the spacing between target and distractors (Meyerhoff, Papenmeier, Jahn, & Huff, 2015).

Theories of MOT and their evidence

Several theoretical frameworks have been proposed in order to explain limitations in MOT. A central difference in these theories is whether they propose fixed architectural constraints, such as slots, or a limited attentional resource that is allocated among objects being tracked. The theories also differ in the role that is attributed to visual attention. Although recent tendencies seem to favor models without architectural constraints, the empirical evidence is still conflicting. Thus, the evaluation of the MOT theories is still a subject of change. In this section, we will now review the theories in the order of their publication dates.

Visual index theory (FINST theory)

The basic assumption of the visual index theory (also called FINST theory; Pylyshyn, 1989, 2001, 2007) is the existence of a visual index mechanism in early vision that provides a connection between objects in the distal word and the visual representation in the mind (see Fig. 4a). This connection is achieved by preconceptual visual indexes (also called FINSTs, derived from FINgers of INSTantiation) that point and stick to feature clusters on the retina (Pylyshyn, 1989); that is, they provide a reference to objects in the scene. They are characterized as preconceptual because the visual indexes provide a reference to objects (like the referential “this” or “that” in speech) without encoding information about their identities. While this property was called “preattentive” in earlier work (Pylyshyn, 1989), this term was replaced by the term “preconceptual” in later work (Pylyshyn, 2001) to avoid misunderstandings and make clear that focused attention can also play a role during tracking (Pylyshyn, 2001). The visual indexes stick to the indexed objects across motion or eye movements and thereby allow for an automatic tracking of multiple objects in parallel, even if they look identical. Because visual indexes are a part of a mental architecture, their number is limited, presumably at about four or five (Pylyshyn, 2001). The visual index theory is not a theory of MOT in particular but a theory of vision in general, with findings beyond MOT, such as visual search or subitizing (Pylyshyn, 1994; Pylyshyn et al., 1994) supporting its assumptions.

Fig. 4
figure 4

MOT theories. a According to the visual index theory, four or five indices that provide a pointer toward the actual object locations remain “sticking” onto moving objects. b The perceptual grouping model states that observers track the higher order object (i.e., the polygon) formed by the individual objects. c In the multifocal attention theory, independent attentional spotlights track the target objects. d The FLEX model proposes a demand based and dynamically changing allocation of visual attention toward target objects (e.g., based on interobject spacing). e According to the spatial interference theory, distractors that break through an inhibitory zone around targets increase the probability of tracking errors. Also, the inhibitory surround of one individual target is considered to interfere with the enhancement of other nearby targets

Originally, the MOT paradigm was designed to test the prediction of the visual index theory that observers can track multiple objects at a higher accuracy than predicted by a serial spotlight metaphor (Pylyshyn & Storm, 1988). It is important to note, however, that participants’ response times for the detection of flashes on targets increased with the number of targets in the display. That is, participants could track multiple objects in parallel but required covert attention to serially scan the indexed objects for feature changes because the visual indexes are merely pointers that do not provide feature information by themselves. Furthermore, Sears and Pylyshyn (2000) showed that a zoom-lens model of attention could also not account for the distribution of attention during MOT, because response latencies for form changes on distractors in a secondary task did not differ between distractors within or outside the convex hull formed by the target objects.

Regarding object selection, the visual index theory states that objects grab the visual indexes in a data-driven, parallel, and automatic manner, such as by new objects appearing in the visual field or by having objects that blink (Pylyshyn, 2001). In contrast, a voluntary allocation of visual indexes to objects requires focused attention such that local features can then pop out and attract a visual index (Pylyshyn, 2001). Results found by Pylyshyn and Annan (2006) support this assumption by showing that participants could track voluntarily selected objects (e.g., marked by digits) but that voluntary selection was sensitive to the duration of the marking phase and the number of target objects present in the display while automatic selection (blinking) was not.

A number of findings have challenged the visual index theory. Those studies were mainly concerned with the role of attention during MOT. For example, they showed that attention is utilized during MOT (Tombu & Seiffert, 2008) or that separate tracking resources are available for the left and right visual hemispheres (Alvarez & Cavanagh, 2005; see also section titled Multifocal Attention Theory). Furthermore, it was argued that tracking is achieved by a flexible resource rather than by a fixed architecture allowing participants to track up to eight objects when these objects are slow enough (Alvarez & Franconeri, 2007; see also the section titled FLEX Model). Increases in object speed can also result in tracking capacities far below the assumed four or five visual indexes (Alvarez & Franconeri, 2007; Holcombe & Chen, 2012). While those findings speak against the strong view that tracking is achieved solely by the visual index mechanism, the question whether or not mechanisms of visual indexing and attention might co-occur during MOT (Pylyshyn, 2001) needs to be resolved.

Perceptual grouping model

Yantis (1992) proposed that the visual system groups the individual target objects into a higher order visual representation (i.e., three targets into a triangle or four targets into a quadrangle; see Fig. 4b). Evidence for this proposition comes from experiments that investigated grouping processes during either group formation or group maintenance. Although tracking performance benefits from manipulations that foster the formation of perceptual groups such as canonical configurations or explicit grouping instructions, Yantis (1992) has shown that these effects are of short duration. He therefore suggested that group formation is automatic and preattentive. In contrast, group maintenance during tracking was shown to be effortful. For instance, Yantis (1992) reported more accurate tracking when the higher order object (i.e., the polygon formed by the individual objects) remained intact as compared to when it collapsed during the trial.

Grouping processes during MOT are also affected by identity information (Erlikhman, Keane, Mettler, Horowitz, & Kellman, 2013; see also Zhao et al., 2014). Compared to a condition in which all targets shared the same feature, tracking performance was impaired when identity information divided the objects into two groups, each containing two targets and two distractors. Evidence for the importance of abstracted higher order representations in MOT comes from studies that emphasized the importance of the centroid of the target objects. If multiple targets are integrated into a higher order object such as the polygon connecting these targets, the centroid seems to reflect a plausible instance of this higher order object in terms of a summary statistic of the locations of the individual targets. In fact, Alvarez and Oliva (2008) showed that the location of the centroid is represented above chance level during tracking.

This observation matches findings from studies that monitored eye movements during MOT. These studies revealed that observers tend to fixate the (invisible) centroid rather than the individual objects during tracking (Fehd & Seiffert, 2008). Within observers, oculomotor processes are remarkably stable across repetitions of trials (Lukavský, 2013; see also Lukavský & Dĕchtĕrenko, 2016), however, properties of the moving objects themselves have been demonstrated to alter fixation behavior. For instance, the tendency to fixate the centroid increases with increasing object speeds (Huff, Papenmeier, Jahn, & Hesse, 2010) but decreases with increasing tracking load (Zelinsky & Neider, 2008) or reduced target-distractor spacing (rescue saccades toward the individual targets that are in danger of getting lost; Zelinsky & Todor, 2010; see also Colas, Flacher, Tanner, Bessiere, & Girard, 2009). This overall pattern of results matches the idea of an automatic (and thus stimulus driven) formation of perceptual groups during tracking. In line with the observation of Yantis (1992) that perceptual grouping fosters MOT performance, centroid looking behavior is predictive for successful tracking (Fehd & Seiffert, 2010).

Multifocal attention theory

Both the FINST as well as the grouping theory involve a single focus of visual attention. As an alternative, Cavanagh and Alvarez (2005) proposed the multifocal attention theory of MOT. This model suggests that multiple foci of attention follow the objects being tracked across the tracking trial (see Fig. 4c). In this sense, the multiple foci of attention serve a similar function as the visual indices within the FINST theory. The critical difference to the FINST model, however, is that the multiple foci of attention allow for continuous attentional access to all objects being tracked. Indeed, there is convincing behavioral (Awh & Pashler, 2000; Castiello & Umiltà, 1992; Kramer & Hahn, 1995) as well as neuroscientific evidence (McMains & Somers, 2004; Müller, Malinowski, Gruber, & Hillyard, 2003) that the attentional focus can be split, thus enhancing the processing of stimuli at independent locations.

There are at least two empirical findings that support the multifocal theory of MOT. The first line of evidence stems from hemispherical independence during tracking. For instance, Alvarez and Cavanagh (2005) observed that participants were able to track twice as many objects when they were equally distributed across the left and right visual hemifields (see also Battelli et al., 2001; Chen, Howe, & Holcombe, 2013; Hudson, Howe, & Little, 2012; Störmer, Alvarez, & Cavanagh, 2014). In other words, they observed that the capacity limitation of MOT is two objects per hemifield rather than four objects for the entire visual field. Indeed, the observation of hemifield independence rules out a simple switching model with a single focus of attention (see also Delvenne, 2005). The second finding supporting the multifocal theory is that tracking performance is more accurate when all objects move simultaneously than when they move sequentially (Howe, Cohen, Pinto, & Horowitz, 2010). Because a single focus of attention that switches between the target objects would predict the opposite pattern of results, this finding also supports the assumption of multiple foci of visual attention.

Despite the evidence in favor of parallel tracking as suggested by multiple attentional spotlights, there also is evidence highlighting serial components during tracking. For instance, Howard, Masom, and Holcombe (2011; see also Howard & Holcombe, 2008) observed a lag between actual and remembered object locations that increased with tracking load. This observation is in line with a serial updating process that returns to an individual object less often at higher tracking load. Furthermore, the idea of serial updating also fits in with other studies that demonstrate decreasing temporal resolution of visual attention with increasing tracking load (d’Avossa, Shulman, Snyder, & Corbetta, 2006; Holcombe & Chen, 2013). In fact, Holcombe and Chen (2013) showed that the temporal resolution of tracking dropped from 7 Hz when tracking a single target to 4 Hz when tracking two objects to 2.6 Hz when tracking three objects. Because this decline closely matches the computational predictions of a single focus of attention switching between targets, this finding supports serial models of object tracking rather than the multifocal attention account.

It is, however, currently unclear whether the load-dependent decrease in the temporal resolution spreads across the distinct hemifields or whether each hemifield has a distinct temporal resolution at command. Such a study would be useful to distinguish between accounts that suggest parallel tracking between, but serial tracking within, distinct hemispheres and accounts that predict a serial limitation across the entire visual field (see also Chen et al., 2013). From the current work, it seems plausible that tracking occurs in parallel between the different hemifields but serially within each hemifield.

FLEX model

The previously introduced theories of MOT describe limitations in the ability to track objects as a consequence of architectural constraints such as the number of visual indices or attentional foci. Critically, such fixed architecture models predict a fixed precision of tracking as well as accuracy as long as the number of targets does not exceed the architectural constraints. In a study by Alvarez and Franconeri (2007), however, observers were able to track up to eight objects when they were moving at sufficiently slow speed. Because the detrimental effects of object speed also were more pronounced at reduced spacing between the objects, the overall pattern of results indicates that it is the speed of objects or their spatial interference that limits tracking rather than tracking load in terms of architectural constraints. Based upon this observation, Alvarez and Franconeri introduced the FLEX model of MOT, which proposes a flexible allocation of an attentional resource between the objects being tracked. According to this model, tracking errors arise when the attentional resource is insufficient to cover the demands of all targets. Critically, the actual demand of a target being tracked varies with stimulus properties such as spatial proximity or object speed. Because spatial proximity between the objects varies across a tracking trial, the demand-based allocation of visual attention also needs to change continuously across the tracking interval (see Fig. 4d).

Supporting evidence for the demand-based allocation of visual attention comes from a study by Iordanescu, Grabowecky, and Suzuki (2009), who asked their participants to localize target discs that had disappeared after an interval of object tracking. When Iordanescu et al. analyzed the localization errors as a function of the distance between the corresponding targets and their closest distractor, they observed that targets with close distractors were localized more precisely than those without close distractors. This pattern of results indicates that indeed more attentional resources are devoted to targets with close distractors. Horowitz and Cohen (2010) investigated whether the process underlying MOT is limited by fixed architectural constraints or a flexible resource by applying the mixture model approach previously used within the slot versus resource debate in visual working memory (Zhang & Luck, 2008). By doing so, they observed that the precision of participants in reporting the motion direction of multiple moving targets decreased with an increasing tracking load. In fact, this decline matched the predictions of a model assuming an attentional resource being shared among all items being tracked. Additionally, Holcombe and Chen (2012) demonstrated that a single target (moving on a circular path) is capable of consuming the full tracking resources. If this target moves fast enough, adding a second target results in a tracking performance that matches the expected performance that appears as if the observer were able to track only one of the two targets. This observation also cannot be reconciled with fixed architecture models.

An advantage (and a drawback at the same time) of the FLEX model is that it can explain a broad variety of findings across a broad spectrum of the tracking literature such as a flexible switching between tasks (Alvarez et al., 2005) or flexible switching between location and identity tracking (e.g., Cohen, Pinto, Howe, & Horowitz, 2011). However, a very strong argument against the FLEX model is that it does not fulfill the criteria of a good scientific theory. There is hardly any pattern of results that cannot be resolved within the FLEX framework. The major reason for this deficit is that the FLEX model is rather vaguely specified. Similar to the multifocal attention approach, the existence of hemifield independence suggests that there are two distinct resources rather than one. This is most evident in a set of experiments of Chen et al. (2013). Although these authors observed evidence in favor of a flexible allocation of an attentional resource in general, this was only true when the corresponding objects were within the same visual hemifield.

Spatial interference theory

As a testable alternative to the FLEX model, Franconeri et al. (2010) proposed the spatial interference theory of MOT. According to this model, objects being tracked receive attentional enhancement that is accompanied by an inhibitory surround (see Hopf et al., 2006; Müller, Mollenhauer, Rösler, & Kleinschmidt, 2005). Tracking errors arise when distractor objects break through the inhibitory surrounds of targets or when the inhibitory surround of one target interferes with the attentional enhancement of another target (see Fig. 4e). Because these events occur only at reduced interobject spacing, spatial interference from close objects is supposed to be the only limiting factor of multiple object tracking performance in this approach (see also Intriligator & Cavanagh, 2001). Indeed, the experiments of Franconeri et al. (2010; see also Franconeri et al., 2008) have shown that it is the distance that objects travel in close proximity to other objects rather than trial duration that constrains tracking performance. Importantly, most of the previous parameters that have been identified to alter tracking performance such as number of targets (e.g., Pylyshyn & Storm, 1988), the number of distractors (e.g., Bettencourt & Somers, 2009), or object speed at a constant tracking interval (e.g., Alvarez & Franconeri, 2007) can be retraced to modulations of the spatial proximity between the moving objects.

Despite its parsimony, the spatial interference account captures a large portion of the variance in tracking performance and the outstanding influence of spatial interference on tracking is corroborated by several studies. For instance, tracking performance increased markedly when Bae and Flombaum (2012) briefly colored distractors during events of spatial interference (see Fig. 5). Further, Shim, Alvarez, and Jiang (2008) demonstrated that not only proximity between targets and distractors but also proximity among targets impairs tracking performance—a finding that is hard to reconcile with the other theories on MOT. The spatial inference theory also receives support from recent computational models that have conceptualized tracking errors as a result of probabilistic processes and their summation across the duration of a trial (Zhong, Ma, Wilson, Liu, & Flombaum, 2014, see also Vul, Frank, Alvarez, & Tenenbaum, 2009).

Fig. 5
figure 5

Demonstration of the detrimental effects of spatial proximity on multiple object tracking performance (by Bae & Flombaum, 2012). a In the experimental trials, distractor objects that are close to targets change in color. b Coloring distractor objects that are close to targets enhances tracking performance as these distractors are typically involved in target-distractor confusions. Figures reprinted with permission from Springer

The clear predictions of the spatial interference account have inspired much research on the relationship between object speed and spatial proximity during tracking. Whereas the spatial interference account states that object speed per se has no influence on tracking performance beyond the modulation of spatial interference, a couple of recent studies have revealed data that are incompatible with this view. For instance, in Tombu and Seiffert (2011), participants tracked objects that were rotating around each other on an orbital path. Although the orbital rotation left spatial interference unaffected, tracking performance declined with orbital speed (see also Feria, 2013, for similar results). Further, Holcombe, Chen, and Howe (2014) observed that also objects that are well beyond the range of inhibitory surrounds are capable of interfering with the tracking task. Finally, in one of our own studies, we asked participants to track objects that dynamically changed their speed of motion following a sine wave pattern (Meyerhoff et al., 2016). In the control condition, the objects moved at a constant speed that matched the average of the dynamically changing condition. Thus, traveled distance as well as spatial interference was controlled for. The only difference between the conditions was that the events of spatial interference were less predictable and their duration was more variable in the condition with dynamic speed changes than in the condition with constant object speed (see Fig. 6). Tracking performance was worse with dynamically changing object speed, indicating that speed is capable of impairing MOT. One possible solution to resolve the immediate effects of speed within the spatial interference theory is to assume that the inhibitory surround of targets needs to unfold over time. In return, fast moving distractors might interfere with the objects being tracked before the inhibition is fully established. Given the liveliness of the debate about the effects of speed and spatial interference, we expect to see further theoretical progress with regard to the spatial interference theory within the near future.

Fig. 6
figure 6

Demonstration of the influence of objects speed beyond effects of spatial proximity (by Meyerhoff, Papenmeier, Jahn, & Huff, 2016). a The participants tracked four out of eight objects that moved at a constant or variable object speed. Importantly, the traveled distance and average speed as well as spatial proximity was identical between the conditions. b Variable object speed impaired tracking performance indicating that object speed affects tracking beyond the modulation of effects of spatial interference. Figures reproduced with permission from the American Psychological Association

Research topics in MOT

Besides the research that explicitly aimed at evaluating the theories of MOT, other lines of experiments have explored topics with a broader scope. For the purpose of this tutorial review, we have identified four such fields of research that used the MOT paradigm in order to study broader properties of visual attention. These fields encompass questions about the basic unit of dynamic visual attention, the reference frame of dynamic attention, whether attentional processes use motion information to anticipate prospective object locations, and how distinct identity information affects attentive tracking. In addition, we briefly summarize the growing literature on tracking in special groups such as children, experts, and clinical samples.

MOT reveals the object-based nature of dynamic visual attention

The allocation of attentional resources is either space based (Posner, 1980; Posner, Snyder, & Davidson, 1980) or object based (Egly, Driver, & Rafal, 1994; Scholl, 2001). This paragraph reviews studies on MOT that provide evidence that tracking and thus the allocation of dynamic visual attention toward moving objects is mostly object based.

Scholl, Pylyshyn, and Feldman (2001; see Fig. 7) demonstrated the object-based nature of MOT with a target-distractor merging technique. In their study, targets and distractors were either distinct boxes or each target was visually merged to a distractor. To be more concrete, targets and distractors either were the endpoints of a common object such as a line (object-based merging) or they were connected so that they appeared as dumbbells (connection-based merging). As predicted by an object-based account of attention, object-based merging severely impaired tracking performance. Connection-based merging impairments were smaller and occurred only if the connection lines touched the targets and distractors. This shows that participants had difficulties in directing their attention to just one endpoint of connected objects. Merging effects occurred also when the size of the merged objects remained constant across the tracking trials (Howe, Incledon, & Little, 2012) instead of expanding and shrinking across a trial which in itself impairs tracking performance (van Marle & Scholl, 2003). Remarkably, a physical connection between target and distractor is not necessary to induce detrimental effects on tracking because illusory contours have been demonstrated to impair tracking performance similarly (Keane, Mettler, Tsoi, & Kellman, 2011). While the effect of connection-based merging was less pronounced in children with autism (Evers et al., 2014), indicating a relatively higher amount of local processing in this population, object-based merging affected this population equally (Van der Hallen et al., 2015).

Fig. 7
figure 7

Demonstration of the object-based nature of multiple object tracking (by Scholl, Feldman, & Pylyshyn, 2001). The participants were able to track the objects accurately only when they appeared as distinct objects. Tracking the endpoint(s) of larger objects reduced tracking capacities to one objects. Figure reprinted with permission from Elsevier

Beyond merging, further studies demonstrated the object-based nature of MOT. In line with an account that objects are defined by topological invariants such as the number of holes in the object shape, participants tracking performance was impaired when objects frequently changed their shape between two instances varying in their number of holes such as filled disc versus disc with hole (Zhou, Luo, Zhou, Zhuo, & Chen, 2010). However, nontopological changes such as shape changes between filled discs and S-shapes or constant filled discs with changing color did not impair tracking performance. Note that it is not the holes per se that impair tracking because participants can track shapes with holes (Zhou et al., 2010) and holes (Horowitz & Kuzmova, 2011) as efficient as solid discs. Instead, the change in topological invariants likely caused the formation of a new object thus disturbing tracking. While abrupt transition in object shape between small squares and long rectangles did not affect tracking (Zhou et al., 2010), continuous transitions did (Howe, Holcombe, Lapierre, & Cropper, 2013; van Marle & Scholl, 2003). However, in this case it was not the formation of new objects that impaired tracking but the reduced ability to locate objects that expand and contract, particularly along the axis of elongation or contraction (Howe et al., 2013).

By asking participants to track lines instead of discs or squares, a number of studies investigated the allocation of attention across tracked objects (Alvarez & Scholl, 2005; Doran, Hoffman, & Scholl, 2009; Feria, 2008). These studies explored detection performance for probes appearing either in the center or near the endpoints of the tracked lines. Because probe detection was more efficient in the center than at the endpoints, these studies suggest an attentional bias toward the center of the objects being tracked (Alvarez & Scholl, 2005; Doran et al., 2009; Feria, 2008). This center benefit was not the result of overt attention shifts being biased toward object centers (Vishwanath & Kowler, 2003) because it was also observed when eye movements were controlled for (Doran et al., 2009). Although the center bias is strong and increases with line length, it is not completely automatic because it is sensitive to the distribution of probe probabilities between the center and endpoints (Feria, 2008, 2010).

The reference frame of visual attention

What is the reference frame of (dynamic) visual attention? Does attentional tracking operate within a retinotopic coordinate system (i.e., relative to the corresponding locations on the retina) or within an allocentric coordinate system (i.e., scene-based coordinates)? These questions are of interest because the visual cortex is mostly organized in retinotopic maps (DeYoe et al., 1996; Engel, Glover, & Wandell, 1997; Lennie, 1998; Van Essen et al., 2001); however, mental representations of scenes appear to be in allocentric coordinates rather than retinotopic coordinates (e.g., Li & Warren, 2000; Wang & Simons, 1999).

In a first attempt to disentangle these two alternatives for MOT, G. Liu et al. (2005) asked participants to track a set of moving targets within a virtual three-dimensional box that itself underwent translations, rotations, and zooms. When G. Liu et al. manipulated object speed independently of scene speed, they observed that increasing object speed but not increasing scene speed impaired tracking. This finding suggests that tracking is carried out in allocentric coordinates.

Although the results of research by G. Liu et al. (2005) were straightforward in favor of an allocentric reference system, subsequent studies with different methodologies have revealed more mixed results. In two studies, Howe and his colleagues (Howe, Drew, Pinto, & Horowitz, 2011; Howe, Pinto, Horowitz, 2010) explored the stability of tracking performance across saccades and smooth pursuits. In these studies, disrupting the allocentric frame of reference impaired tracking performance during saccades as well as smooth pursuits. This matches the results found by G. Liu et al. (2005) as well as related research providing evidence that the programming of saccades (Deubel, Bridgeman, & Schneider, 1998) as well as the execution of smooth pursuits (Raymond, Shapiro, & Rose, 1984) also is based on an allocentric representation of the scene. However, in contrast to the results found by G. Liu et al. (2005), tracking performance was also sensitive to manipulations of the retinotopic coherence of the scene during smooth pursuits, indicating that tracking partially is based on retinotopic coordinates (Howe et al., 2010). The observation that allocentric as well as retinocentric coordinate systems contribute to tracking matches with related research from our lab showing that continuous scene information is necessary to maintain objects in an allocentric frame of reference (Huff, Meyerhoff, Papenmeier, & Jahn, 2010; see also Jahn, Wendt, Lotze, Papenmeier, & Huff, 2012, for neuroimaging evidence) whereas abrupt changes in the frame of reference seem to draw onto retinocentric processes (Huff, Jahn, & Schwan, 2009; Jahn, Papenmeier, Meyerhoff, & Huff, 2012).

In a further study, we showed that observers automatically track objects within an allocentric frame of reference as long as continuous motion cues were available (Meyerhoff, Huff, Papenmeier, Jahn, & Schwan, 2011). In this study, our participants tracked objects while either the floor plane, the set of objects, neither, or both underwent continuous rotations of 30 degrees during brief intervals of object invisibility. Thus, we disentangled the spatial orientation of the objects from the spatial orientation of the floor. We observed that participants were unable to suppress continuous visual information about the rotations of the floor plane even under experimental conditions under which this updating process actually disturbed tracking performance (see Fig. 8). Despite the automatic usage of continuous motion information from the frame of reference, the perceived space also affects tracking performance. For instance, tracking accuracy is impaired when the tracking space is turned upside down (Papenmeier, Meyerhoff, Brockhoff, Jahn, & Huff, in press). Further, tracking performance declines when participants have to track objects relative to external points of reference. Such an effect was demonstrated in a study by Thomas and Seiffert (2010). In the critical conditions of this study, the participants’ viewpoint on a tracking display (projected via a HMD) changed as a consequence of self-motion. When contrasted to conditions with the same viewpoint change without proprioceptive motion cues, self-motion impaired tracking performance because participants had to track their own location in space in addition to the target objects (see also Thomas & Seiffert, 2011). In fact, it is the self-motion during tracking that draws upon the tracking resource rather than the execution of motor actions themselves (Thornton, Bülthoff, Horowitz, Rynning, & Lee, 2014; Thornton & Horowitz, 2015; Thomas & Seiffert, 2010; but see also Trick, Guindon, & Vallis, 2006).

Fig. 8
figure 8

Experiment conducted by Meyerhoff, Huff, Papenmeier, Jahn, and Schwan (2011). a Participants track objects that briefly become invisible during a rotation of the floor plane. After the rotation, the objects reappear at their previous screen location (floor plane only) or at the location where they would have been if they rotated with the floor plane (full). b When the floor plane was invisible during the rotation, tracking accuracy was more accurate when the objects reappeared at the same screen locations than when they rotated with the floor plane. When the floor plane was visible during the rotation, the results were inverted, namely, tracking was more accurate when the objects rotated with the floor plane than when they reappeared at their previous screen location. This pattern of results shows that participants use the continuous visual information of the floor plane in order to maintain the tracked objects in scene based coordinates. Figures reprinted with permission from Elsevier

With regard to the broader picture of visual attention, the findings from the MOT studies indicate that most attentional processing occurs within an allocentric frame of reference. In other words, the objects are addressed as being at a relative position to other objects or frames of reference (including the own location) rather than in the corresponding retinal coordinates.

Extrapolation

How predictive are visual processes? Does dynamic visual attention anticipate prospective object locations based on their actual trajectories? These questions have been studied extensively with variants of the MOT paradigm. Most of this research has operationalized this questions by testing whether MOT is restricted to pure location tracking (i.e., not predictive) or whether spatiotemporal object information such as motion direction or heading are used to extrapolate object locations and thus predict prospective demands.

The first studies that explored such extrapolatory processes during MOT asked participants to track objects that were rendered invisible for several 100 ms during the trial. Because performance was best when the objects remained stationary during their invisibility, Keane and Pylyshyn (2006) concluded that prospective object locations are not predicted based on previous direction information (see Fig. 9 for a replication of the results). These results were confirmed in a similar study conducted by Fencsik, Klieger, and Horowitz (2007), who showed that direction information might aid tracking but only under specific preview conditions and for a maximum of two targets. With respect to tracking continuously visible objects, however, these results should be interpreted with caution because subsequent work has demonstrated that tracking invisible objects increases the attentional demands of tracking (Flombaum, Scholl, & Pylyshyn, 2008; see also Ilg, 2008). Also, the mental representation of multiple moving objects has been demonstrated to lag behind the actual object locations. This lag increases with tracking load and object speed (Howard et al., 2011). As a consequence of these restrictions in the interpretation of experiments that require the recovery of invisible objects, later studies have explored variants of the MOT paradigm that involved continuously visible objects.

Fig. 9
figure 9

Illustration of Experiment 1 and 2 conducted by Fencsik, Klieger, and Horowitz (2007) which replicates and extends the finding of Keane and Pylyshyn (2006). a During the tracking trials the object briefly disappear. The location of their reappearance is behind, at, or in front of their last location. b Tracking accuracy as a function of tracking load and the location of object reappearance. Participants seem to use only the last object location in order to relocate the tracked targets. Figures reprinted with permission from Springer

A first set of studies that investigated whether observers extrapolate motion paths during tracking continuously visible objects manipulated the predictability of the motion paths of the objects. In general, MOT performance was more accurate with predictable than unpredictable motion paths (Howe & Holcombe, 2012), even when eye movements were controlled for (Luu & Howe, 2015). This effect depended on tracking load and object speed. Extrapolation occurred less in conditions with higher tracking loads (i.e., more than two target objects) and lower object speed. In line with the idea that motion information is evaluated during tracking, even a single direction change of a target is sufficient to impair tracking performance (Meyerhoff, Papenmeier, Jahn, & Huff, 2013).

A second set of studies approached the question of location extrapolation by studying the effect of conflicting motion information on tracking performance. In these studies, the texture of the moving objects signaled motion either in the same or in the opposing direction as the actual direction of the object itself. In the first study of this kind, St. Clair, Huff, and Seiffert (2010) showed that conflicting motion information impaired tracking relative to a neutral baseline. Because there was no benefit in tracking performance when the texture moved along with the object, St. Clair et al. as well as Huff and Papenmeier (2013) have attributed this effect to shifts in the perception of the object location rather than the extrapolation of object locations. This detrimental effect of conflicting motion signals, however, does stem from object-based processing during MOT because tracking impairments arise selectively on objects that actually exhibit conflicting motion signals (Meyerhoff, Papenmeier, & Huff, 2013).

In sum, it is still controversial whether motion information is used during tracking in order to extrapolate object locations. It seems fair to conclude that if extrapolation occurs during tracking, it might occur only to a limited extent. This conclusion also would be in agreement with a recent computational model showing that tracking benefits due to extrapolation would be marginal at best (Zhong et al., 2014). For the broader question regarding predictive attentional processing, this pattern of results of course does not rule out predictive processing in general; however, in the spatiotemporal domain they seem not to arise consistently—at least not with the demands of a MOT task.

Tracking objects with identities

Is the attentional processing of the spatiotemporal information of objects affected by unique object identities? Although many MOT researchers provide motivation for their research with real-world examples such as tracking cars, tracking children on a playground, or tracking players in sports, most research reviewed above was dedicated to the tracking of indistinguishable objects. In real-world scenarios, however, observers track objects with distinct identities. For the purpose of this review, we focus on studies investigating the influence of identities on location tracking performance. This review, however, does not include studies concerned with task-relevant identity information that require maintaining location identity bindings during tracking to give identity-related responses at the end of the trial, sometimes called multiple identity tracking (MIT; Botterill, Allen, & McGeorge, 2011; Cohen et al., 2011; Horowitz et al., 2007; Howard & Holcombe, 2008; Li, Oksama, & Hyönä, 2016; Oksama & Hyönä, 2004, 2008, 2016; Pinto, Howe, Cohen, & Horowitz, 2010; Pylyshyn, 2004; Ren, Chen, Liu, & Fu, 2009).

When observers track unique objects (e.g., colored discs or cartoon animals) instead of indistinguishable objects, location tracking performance increases (Horowitz et al., 2007; Makovski & Jiang, 2009a, 2009b). One reason for this benefit is that adding identities to objects helps in separating targets from distractors (Bae & Flombaum, 2012; Feria, 2012). Therefore, adding identity information to targets does not improve location tracking performance per se but only when targets and distractors do not share the same features (Horowitz et al., 2007; Howe & Holcombe, 2012; Jardine & Seiffert, 2011; Makovski & Jiang, 2009a, 2009b). The beneficial effect of identity information on location tracking has been attributed to effortful deliberate processing as well as automatic processes. For instance, Makovski and Jiang (2009b) provided evidence for the suggestion that effects of identity information stem from deliberate processing. In their study, they showed that observers effortfully encoded target identities into memory and recovered lost targets based on this memory representation. In their experiments, Makovski and Jiang were able to prevent such a memory-based target recovery by a concurrent color memory task that frequently changed the color of the objects in the display. However, there is also evidence supporting the view that automatic processing of identity information enhances location tracking performance. For instance, when participants were asked to track a set of faces, they were better at tracking attractive faces than unattractive faces, even though face identities were task irrelevant (C. H. Liu & Chen, 2012). Furthermore, identity information influenced participants’ tracking performance even when relying on object identities was always harmful to MOT performance (Erlikhman et al., 2013; Papenmeier, Meyerhoff, Jahn, & Huff, 2014).

The occurrence of identity effects on location tracking is also determined by the reliability of spatiotemporal information; that is, effects of identity information arise when spatiotemporal information becomes less reliable. For example, adding identity information to objects influenced tracking performance particularly at reduced interobject spacing (Bae & Flombaum, 2012; Makovski & Jiang, 2009b) or with an increasing number of distractors in the display (Drew et al., 2013), although identity effects can also occur across large distances with low object speeds (Störmer, Li, Heekeren, & Lindenberger, 2011). In one of our studies (Papenmeier et al., 2014), we manipulated spatiotemporal reliability. Whereas swapping object colors between targets and distractors left tracking performance unaffected by continuous spatiotemporal information, we observed identity effects when we introduced spatiotemporal discontinuities such as abrupt scene rotations, abrupt zooms, or reduced presentation frame rates. These findings indicate that tracking itself does not only rely on location information but also identity information of objects. The exact mechanisms by which spatiotemporal and identity information work together still needs to be resolved. Identity information could either be encoded during tracking and used to establish object correspondence (Papenmeier et al., 2014; see also Jardine & Seiffert, 2011) or identity information might just be another source of input that is directly utilized by the tracking mechanism (Oksama & Hyönä, 2016), such as the visual indices (FINSTs) that stick to feature clusters on the retina without recognizing their identities (Pylyshyn, 1989).

Development and expertise

How does development and experience shape attentional processing? The ability to track multiple objects in parallel arises relatively early in normally developing children. Even children as young as 6 months are able to keep track of objects moving synchronously on circular trajectories (Richardson & Kirkham, 2004). At the age of 6.5 years, children are able to track up to four moving objects (O’Hearn, Hoffman, & Landau, 2010; O’Hearn, Landau, & Hoffman, 2005; see also Brockhoff et al., 2016). Remarkably, even children suffering from autism spectrum disorders show a qualitatively similar pattern of development of object tracking as healthy children, although they reach the same quantitative level of tracking at a later biological age (Koldewyn, Weight, Kanwiyher, & Jiang, 2013; see also Griffith, Pennington, Wehner, & Rogers, 1999; Poirier, Martin, Gaigg, & Bowler, 2011). To date, science is only beginning to understand how visual processing is affected by disorders. As an example, the development of MOT in children suffering from Williams syndrome (a genetic disorder involving impairments of visuospatial processing; O’Hearn et al., 2010) seem to differ qualitatively from their peers.

At older ages (e.g., above 60 years), MOT performance declines relatively to younger adults (20–35 years; Sekuler, McLaughlin, Yotsumoto, 2008; Störmer et al., 2011; Trick, Perl, & Sethi, 2005). However, even at older ages, performance remained well above one target (Trick et al., 2005). Some studies also have suggested that certain types of expertise come along with an increased ability to track multiple objects. For instance, radar operators (Allen, Mcgeorge, Pearson, & Milne, 2004) as well as regular video game players (Dye & Bavelier, 2010; Green & Bavelier, 2006; Sekuler et al., 2008) outperform their corresponding control groups. However, such effects of expertise seem to be restricted to closely matching tasks that involve object tracking. Other plausible activities such as team sports were uncorrelated with the capability to track multiple objects even after more than 10 years of extensive practice (Memmert, Simons, & Grimme, 2009).

Future directions

Whereas MOT has been studied mostly in laboratory settings, the conclusions of these studies often encompass everyday life activities, such as driving cars or supervising children on a playground. Therefore, a central challenge for future research on MOT is to determine and to strengthen the ecological validity of MOT by considering relevant factors, such as the environmental complexity and its constrains. This could be accomplished by embedding the tracking task into more naturalistic scenarios (e.g., Ericson, Parr, Beck, Wolshon, 2017; Lochner & Trick, 2014) or by investigating the impact of the ability to track indistinguishable objects on performance in naturalistic tasks such as team sports (Memmert et al., 2009). Research on the ecological validity of MOT has just begun and much more work is needed to provide a complete and compelling demonstration of the relevance of MOT for everyday life activities.

On a theoretical level, much progress has been made over recent years. Nevertheless, there are still several open questions that cannot be answered sufficiently based on the existing work yet. Related to the question of the ecological validity, it is still an open question to what extend MOT reflects a singular process or whether it consists of several subroutines (including attentional selection and working memory processes) that interact with each other based on current task demands (e.g., Drew et al., 2012; Oksama & Hyöna, 2016). Further, more work is necessary to distinguish between models based on fixed architectural constraints or models based on the idea of a flexible attentional resource. Although the pendulum currently seems to tend toward the more flexible attentional resource models, until today no decisive evidence in favor of any one of these theories has been obtained. One potential way to deepen the understanding of the mechanism of tracking and to further develop more sophisticated models of tracking would be a more systematic investigation of concrete instances of tracking errors. An interesting approach to work toward this goal comes from computational modeling that can be helpful to specify and quantify the cognitive operations determining both successful tracking and tracking errors during MOT.