The context for geographic research has shifted from a data-scarce to a data-rich environment, in which the most fundamental changes are not just the volume of data, but the variety and the velocity at which we can capture georeferenced data; trends often associated with the concept of Big Data. A data-driven geography may be emerging in response to the wealth of georeferenced data flowing from sensors and people in the environment. Although this may seem revolutionary, in fact it may be better described as evolutionary. Some of the issues raised by data-driven geography have in fact been longstanding issues in geographic research, namely, large data volumes, dealing with populations and messy data, and tensions between idiographic versus nomothetic knowledge. The belief that spatial context matters is a major theme in geographic thought and a major motivation behind approaches such as time geography, disaggregate spatial statistics and GIScience. There is potential to use Big Data to inform both geographic knowledge-discovery and spatial modeling. However, there are challenges, such as how to formalize geographic knowledge to clean data and to ignore spurious patterns, and how to build data-driven models that are both true and understandable.
KeywordsBig data GIScience Spatial statistics Geographic knowledge discovery Geographic thought Time geography
A great deal of attention is being paid to the potential impact of data-driven methods on the sciences. The ease of collecting, storing, and processing digital data may be leading to what some are calling the fourth paradigm of science, following the millennia-old traditional of empirical science describing natural phenomena, the centuries-old tradition of theoretical science using models and generalization, and the decades-old traditional of computational science simulating complex systems. Instead of looking through telescopes and microscopes, researchers are increasingly interrogating the world through large-scale, complex instruments and systems that relay observations to large databases to be processed and stored as information and knowledge in computers (Hey et al. 2009).
This fundamental change in the nature of the data available to researchers is leading to what some call Big Data. Big Data refer to data that outstrip our capabilities to analyze. This has three dimensions, the so-called “three Vs”: (1) volume—the amount of data that can be collected and stored; (2) velocity—the speed at which data can be captured; and (3) variety—encompassing both structured (organized and stored in tables and relations) and unstructured (text, imagery) data (Dumbill 2012). Some of these data are generated from massive simulations of complex systems such as cities (e.g., TRANSIMs; see Cetin et al. 2002), but a large portion of the flood is from sensors and software that digitize and store a broad spectrum of social, economic, political, and environmental patterns and processes (Graham and Shelton 2013; Kitchin 2014). Sources of geographically (and often temporally) referenced data include location-aware technologies such as the Global Positioning System and mobile phones; in situ sensors carried by individuals in phones, attached to vehicles, and embedded in infrastructure; remote sensors carried by airborne and satellite platforms; radiofrequency identification (RFID) tags attached to objects; and georeferenced social media (Miller 2007, 2010; Sui and Goodchild 2011; Townsend 2013).
Yet despite the enthusiasm over Big Data and data-driven methods, the role it can play in scholarly research, and specifically research in geography may not be immediately apparent. Are theory and explanation archaic when we can measure and describe so much, so quickly? Does data velocity really matter in research, with its traditions of careful reflection? Can the obvious problems associated with variety—lack of quality control, lack of rigorous sampling design—be overcome? Can we make valid generalizations from ongoing, serendipitous (instead of carefully designed and instrumented) data collection? In short, can Big Data and data-driven methods lead to significant discoveries in geographic research? Or will the research community continue to rely on what for the purposes of this paper we will term Scarce Data: the products of public-sector statistical programs that have long provided the major input to research in quantitative human geography?
Our purpose in this paper is to explore the implications of these tensions—theory-driven versus data-driven research, prediction versus discovery, law-seeking versus description-seeking—for research in geography. We anticipate that geography will provide a distinct context for several reasons: the specific issues associated with location, the integration of the social and the environmental, and the existence within the discipline of traditions with very different approaches to research. Moreover, although data-driven geography may seem revolutionary, in fact it may be better described as evolutionary since its challenges have long been themes in the history of geographic thought and the development of geographical techniques.
The next section of this paper discusses the concepts of Big Data and data-driven geography, addressing the question of what is special about the new flood of georeferenced data. The "Data-driven geography: challenges" section of this paper discusses major challenges facing data-driven geography; these include dealing with populations (not samples), messy (not clean) data, and correlations (not causality). The "Theory in data-driven geography" section discusses the role of theory in data-driven geography. "Approaches to data-driven geography" identifies ways to incorporate Big Data into geographic research. The final section concludes this paper with a summary and some cautions on the broader impacts of data-driven geography on society.
Big data and data-driven geography
Humanity’s current ability to acquire, process, share, and analyze huge quantities of data is without precedent in human history. It has led to the coining of such terms as the “exaflood” and the metaphor of “drinking from a firehose” (Sui et al. 2013; Waldrop 1990). It is also led to the suggestion that we are entering a new, fourth phase of science that will be driven not so much by careful observation by individuals, or theory development, or computational simulation, as by this new abundance of digital data (Hey et al. 2009).
It is worth recognizing immediately, however, that the firehose metaphor has a comparatively long history in geography, and that the discipline is by no means new to an abundance of voluminous data. The Landsat program of satellite-based remote sensing began in the early 1970s by acquiring data at rates that were well in excess of the analytic capacities of the computational systems of the time; subsequent improvements in sensor resolution and the proliferation of military and civilian satellites have meant that four decades later data volumes continue to challenge even the most powerful computational systems.
Volume is clearly not the only characteristic that distinguishes today’s data supply from that of previous eras. Today, data are being collected from many sources, including social media, crowd sourcing, ground-based sensor networks, and surveillance cameras, and our ability to integrate such data and draw inferences has expanded along with the volume of the supply. The phrase Big Data implies a world in which predictions are made by mining data for patterns and correlations among these new sources, and some very compelling instances of surprisingly accurate predictions have surfaced in the past few years with respect to the results of the Eurovision song contest (O’Leary 2012), the stock market (Preis et al. 2013), and the flu (Butler 2008). The theme of Big Data is often associated not only with volume but with variety, reflecting these multiple sources, and velocity, given the speed with which such data can now be analyzed to make predictions in close-to-real time.
Ubiquitous, ongoing data flows are a big deal because they allow us to capture spatio-temporal dynamics directly (rather than inferring them from snapshots) and at multiple scales. The data are collected on an ongoing basis, meaning that both mundane and unplanned events can be captured. To borrow Nassim Taleb’s metaphor for probable and inconsequential versus improbable but consequential events (Taleb 2007): we do not need to sort the white swans from the black swans before collecting data: we can measure all swans and then figure out later which are white or black. White swans may also combine in surprising ways to form black-swan events.
Big Data is leading to new approaches to research methodology. Fotheringham (1998) defines geocomputation as quantitative spatial analysis where the computer plays a pivotal role. The use of the computer drives the form of the analysis rather than just being a convenient vehicle: analysts design geocomputational techniques with the computer in mind. Similarly, data play a pivotal role in data-driven methods. From this perspective data are not just a convenient way to calibrate, validate, and test but rather the driving force behind the analysis. Consequently, analysts design data-driven techniques with data in mind–and not just large volumes of data, but a wider spectrum of data flowing at higher speeds from the world. In this sense we may indeed be entering a fourth scientific paradigm where scientific methods are configured to satisfy data rather than data configured to satisfy methods.
Data-driven geography: challenges
In Big Data: A Revolution That Will Transform How We Live, Work, and Think, Mayer-Schonberger and Cukier (2013) identify three main challenges of Big Data in science: (1) populations, not samples; (2) messy, not clean data, and; (3) correlations, not causality. We discuss these three challenges for geographic research in the following subsections.
Populations, not samples
Back when analysis was largely performed by hand rather than by machines, dealing with large volumes of data was impractical. Instead, researchers developed methods for collecting representative samples and for generalizing to inferences about the population from which they were drawn. Random sampling was thus a strategy for dealing with information overload in an earlier era. In statistical programs such as the US Census of Population it was also a means for controlling costs.
Random sampling works well, but it is fragile: it works only as long as the sampling is representative. A sampling rate of one in six (the rate previously used by the US Bureau of the Census for its more elaborate Long Form) may be adequate for some purposes, but becomes increasingly problematic when analysis focuses on comparatively rare subcategories. Random sampling also requires a process for enumerating and selecting from the population (a sampling frame), which is problematic if enumeration is incomplete. Sample data also has a lack of extensibility for secondary uses. Because randomness is so critical, one must carefully plan for sampling, and it may be difficult to re-analyze the data for purposes other than those for which it was collected (Mayer-Schonberger and Cukier 2013).
In contrast, many of the new data sources consist of populations, not samples: the ease of collecting, storing, and processing digital data means that instead of dealing with a small representation of the population we can work with the entire population and thus escape one of the constraints of the past. But one problem with populations is that they are often self-selected rather than sampled: for example, all people who signed up for Facebook, all people who carry smartphones, or all cars than happened to travel within the City of London between 8 a.m.–11:00 a.m. on 2 September 2013. Geolocated tweets are an attractive source of information on current trends (e.g., Tsou et al. 2013), but only a small fraction of tweets are accurately geolocated using GPS. Since we do not know the demographic characteristics of any of these groups, it is impossible to generalize from them to any larger populations from which they might have been drawn.
Yet geographers have long had to contend with the issues associated with samples and their parent populations. Consider, for example, an analysis of the relationship between people over 65 years old and people registered as Republicans, the case studied by Openshaw and Taylor in their seminal article on the modifiable areal unit problem (Openshaw and Taylor 1979). The 99 counties of Iowa (their source of data) are all of the counties that exist in Iowa. They are not therefore a random sample of Iowa counties, or even a representative sample of counties of the US, so the methods of inferential statistics that assume random and independent sampling are not applicable. In remote sensing it is common to analyze all of the pixels in a given scene; again, these are not a random sample of any larger population.
However, the cases discussed above are where we can be assured that the entire population of interest is included: we are interested in all of the land cover in a scene, or all of the people over 65 and Republicans in Iowa. This is often not true with many new sources of data. A challenge is how to identify the niches to which monitored population data can be applied with reasonable generality. This inverts the classic sampling problem where we identify a question and collect data to answer that question. Instead, we collect the data and determine what questions we can answer.
Another issue concerns what people are volunteering when they volunteer geographic and other information (Goodchild 2007). Social media such as Facebook may have high penetration rates with respect to population, but do not necessarily have high penetration rates into peoples’ lives. Checking in at an orchestra concert or lecture provides a noble image that a person would like to promote, while checking in at a bar at 10am is an image that a person may be less keen to share. In the classic sociology text The Presentation of Self in Everyday Life, Erving Goffman uses theater as a metaphor and distinguishes between stage and backstage behaviors, with stage behaviors being consistent with the role people wish to play in public life and backstage behaviors being private actions that people wish to keep private (Goffman 1959). While there are certainly cases of over-sharing behavior (especially among celebrities) we cannot be assured that the information people volunteer is an accurate depiction of their complete lives or just of the lives they wish to present to the social sphere. Several geographic questions follow from these observations. What is the geography of stage versus backstage realms in a city or region? Does this distribution vary by age, gender, socioeconomic status, or culture? What do these imply for what we can know about human spatial behavior?
In addition to selective volunteering of information about their lives, there also may be selection biases in the information people volunteer about environments. Open Street Map (OSM) is often identified as a successful crowdsourced mapping project: many cities of the world have been mapped by people on a voluntary basis to a remarkable degree of accuracy. However, some regions get mapped quicker than others, such as tourist locations, recreation areas, and affluent neighborhoods, while locations of less interest to those who participate in OSM (such as poorer neighborhoods) receive less attention (Haklay 2010). While biases exist in official, administrative maps (e.g., governments in developing nations often do not map informal settlements such as favelas), the biases in crowdsourced maps are likely to be more subtle. Similarly, the rise of civic hacking where citizens generate data, maps, and tools to solve social problems tends to focus on the problems that citizens with laptops, fast internet connections, technical skills, and available time consider to be important (Townsend 2013).
Messy, not clean
The new data sources are often messy, consisting of data that are unstructured, collected with no quality control, and frequently accompanied by no documentation or metadata. There are at least two ways of dealing with such messiness. On the one hand, we can restrict our use of the data to tasks that do not attempt to generalize or to make assumptions about quality. Messy data can be useful in what one might term the softer areas of science: initial exploration of study areas, or the generation of hypotheses. Ethnography, qualitative research, and investigations of Grounded Theory (Glaser and Strauss 1967) often focus on using interviews, text, and other sources to reveal what was otherwise not known or recognized, and in such contexts the kinds of rigorous sampling and documentation associated with Scarce Data are largely unnecessary. We discuss this option in greater detail later in the paper.
On the other hand, we can attempt to clean and verify the data, removing as much as possible of the messiness, for use in traditional scientific knowledge construction. Goodchild and Li (2012) discuss this approach in the context of crowdsourced geographic information. They note that traditional production of geographic information has relied on multiple sources, and on the expertise of cartographers and domain scientists to assemble an integrated picture of the landscape. For example, terrain information may be compiled from photogrammetry, point measurements of elevation, and historic sources; as a result of this process of synthesis the published result may well be more accurate than any of the original sources.
Goodchild and Li (2012) argue that that traditional process of synthesis, which is largely hidden from popular view and not apparent in the final result, will become explicit and of critical importance in the new world of Big Data. They identify three strategies for cleaning and verifying messy data: (1) the crowd solution; (2) the social solution; and (3) the knowledge solution. The crowd solution is based on Linus’ Law, named in honor of the developer of Linux, Linus Torvalds: “Given enough eyeballs, all bugs are shallow” (Raymond 2001). In other words, the more people who can access and review your code, the greater the accuracy of the final product. Geographic facts that can be synthesized from multiple original reports are likely to be more accurate than single reports. This is of course the strategy used by Wikipedia and its analogs: open contributions and open editing are evidently capable of producing reasonably accurate results when assisted by various automated editing procedures.
In the geographic case, however, several issues arise that limit the success of the crowd solution. Reports of events at some location may be difficult to compare if the means used to specify location (place names, street address, GPS) are uncertain, and if the means used to describe the event is ambiguous. Geographic facts may be obscure, such as the names of mountains in remote parts of the world, and the crowd may therefore have little interest or ability to edit errors.
Goodchild and Li (2012) describe the social solution as implementing a hierarchical structure of volunteer moderators and gatekeepers. Individuals are nominated to roles in the hierarchy based on their track record of activity and the accuracy of their contributions. Volunteered facts that appear questionable or contestable are referred up the hierarchy, to be accepted, queried, or rejected as appropriate. Schemes such as this have been implemented by many projects, including OSM and Wikipedia. Their major disadvantage is speed: since humans are involved, the solution is best suited to applications where time is not critical.
The third, the knowledge solution, asks how one might know if a purported fact is false, or likely to be false. Spelling errors and mistakes of syntax are simple indicators which all of us use to triage malicious email. In the geographic case, one can ask whether a purported fact is consistent with what is already known about the geographic world, in terms both of facts and theories. Moreover such checks of consistency can potentially be automated, allowing triage to occur in close-to real time; this approach has been implemented, although on a somewhat unstructured basis, by companies that daily receive thousands of volunteered corrections to their geographic databases.
A major task for the knowledge solution is formalizing knowledge to support automated triage of asserted facts and automated data fusion. Knowledge can be derived empirically or as predictions from theories, models, and simulations. In the latter case, we may be looking for data at variance with predictions as part of the knowledge-discovery and construction processes.
There are at least two major challenges to formalizing geographic knowledge. First, geographic concepts such as neighborhood, region, the Midwest, and developing nations can be vague, fluid, and contested. A second challenge is the development of explicit, formal, and computable representations of geographic knowledge. Much geographic knowledge is buried in formal theories, models, and equations that must be solved or processed, or in informal language that must be interpreted. In contrast, knowledge-discovery techniques require explicit representations such as rules, hierarchies, and concept networks that can be accessed directly without processing (Miller 2010).
Correlations, not causality
Traditionally, scholarly research concerns itself with knowing why something occurs. Correlations alone are not sufficient, because the existence of correlation does not imply that change in either variable causes change in the other. In the correlation explored by Openshaw and Taylor cited earlier (Openshaw and Taylor 1979), the existence of a correlation between the number of registered Republicans in a county and the number of people aged 65 and over does not imply that either one has a causal effect on the other. Over the years, science has adopted pejorative phrases to describe research that searches for correlations without concern for causality or explanation: “curve-fitting” comes to mind. Nevertheless correlations may be useful for prediction, especially if one is willing to assume that an observed correlation can be generalized beyond the specific circumstances in which it is observed.
But while they may be sufficient, explanation and causality are not necessary conditions for scientific research: much research, especially in such areas as spatial analysis, is concerned with advancing method, whether its eventual use is for explanation or for prediction. The literature of geographic information science is full of tools that have been designed not for finding explanations but for more mundane activities such as detecting patterns, or massaging data for visualization. Such tools are clearly valuable in an era of data-driven science, where questions of “why” may not be as important. In the next section we extend this argument by taking up the broader question of the role of theory in data-driven geography.
Theory in data-driven geography
In a widely discussed article published in Wired magazine, Anderson called for the end of science as we know it, claiming that the data deluge is making the scientific method obsolete (Anderson 2008). Using physics and biology as examples, he argued that as science has advanced it has become apparent that theories and models are caricatures of a deeper underlying reality that cannot be easily explained. However, explanation is not required for continuing progress: as Anderson states “Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all.”
Duncan Watts makes a similar argument about theory in the social sciences, stating that unprecedented volumes of social data have the potential to revolutionize our understanding of society, but this understanding will not be in the form of general laws of social science or cause-and-effect social relationships. Although Watts suggests the limitations of theory in the era of data-driven science, he does not call for the end of theory but rather for a more modest type of theory that would include general propositions (such as what interventions work for particular social problems) or how more obvious social facts fit together to generate less obvious outcomes. Watts links this approach to calls by sociologist Robert Merton in the mid-twentieth century for middle-range theories: theories that address identifiable social phenomena instead of abstract entities such as the entire social system (Watts 2011). Middle-range theories are empirically grounded: they are based in observations, and serve to derive hypotheses that can be investigated. However, they are not endpoints: rather, they are temporary stepping-stones to general conceptual schemes that can encompass multiple middle-range theories (Merton 1967).
Data-driven science seems to entail a shift away from the general and towards the specific—away from attempts to find universal laws than encompass all places and times and towards deeper descriptions of what is happening at particular places and times. There are clearly some benefits to this change: as Batty (2012) points out, urban science and planning in the era of Scarce Data focused on radical and massive changes to cities over the long-term, with little concern for small spaces and local movements. Data-driven urban science and planning can rectify some of the consequent urban ills by allowing greater focus on the local and routine. However, over longer time spans and wider spatial domains the local and routine merges into the long-term; a fundamental scientific challenge is how local and short-term Big Data can inform our understanding of processes over longer temporal and spatial horizons; in short, the problem of generalization.
A brief history of partnerships and tensions between nomothetic (law-seeking) and idiographic (description-seeking) knowledge in geographic thought
Path to geographic knowledge
Nomothetic ↔ idiographic
Nomothetic → idiographic
Nomothetic ← idiographic
Nomothetic ↔ idiographic
Hägerstrand (time geography)
Fotheringham/Anselin (local spatial statistics)
However, attempts to reconcile nomothetic and idiographic knowledge did not die with Humboldt and Ritter. Approaches such as time geography seek to capture context and history and recognize the roles of both agency and structure in human behavior (Cresswell 2013). In spatial analysis, the trend towards local statistics, exemplified by Geographically Weighted Regression (Fotheringham et al. 2002) and Local Indicators of Spatial Association (Anselin 1995), represents a compromise in which the general principles of nomothetic geography are allowed to express themselves differently across geographic space. Goodchild (2004) has characterized GIS as combining the nomothetic, in its software and algorithms, with the idiographic in its databases.
In a sense, the paths to geographic knowledge engendered by data-intensive approaches such as time geography, disaggregate spatial statistics and GIScience are a return to the early foundation of geography where neither law-seeking nor description-seeking were privileged. Geographic generalizations and laws are possible but space matters: spatial dependency and spatial heterogeneity create local context that shapes physical and human processes as they evolve on the surface of the Earth. Geographers have believed this for a long time, but this belief is also supported by recent breakthroughs in complex systems theory, which suggests that patterns of local interactions lead to emergent behaviors that cannot be understood in isolation at either the local or global levels. Understanding the interactions among agents within an environment is the scientific glue that binds the local with the global (Flake 1998).
In short, data-driven geography is not necessarily a radical break with the geographic tradition: geography has a longstanding belief in the value of idiographic knowledge by itself as well as its role in constructing nomothetic knowledge. Although this belief has been tenuous and contested at times, data-driven geography may provide the paths between idiographic and nomothetic knowledge that geographers have been seeking for two millennia. However, while complexity theory supports this belief, it also suggests that this knowledge may have inherent limitations: emergent behavior is by definition surprising.
Approaches to data-driven geography
If we accept the premise—at least until proven otherwise—that Big Data and data-driven science harmonize with longstanding themes and beliefs in geography, the question that follows is: how can data-driven approaches fit into geographic research? Data-driven approaches can support both geographic knowledge-discovery and spatial modeling. However, there are some challenges and cautions that must be recognized.
Data-driven geographic knowledge discovery
Geographic knowledge-discovery refers to the initial stage of the scientific process where the investigator forms his or her conceptual view of the system, develops hypotheses to be tested, and performs groundwork to support the knowledge-construction process. Geographic data facilitates this crucial phase of the scientific process by supporting activities such as study-site selection and reconnaissance, ethnography, experimental design, and logistics.
Perhaps the most transformative impact of data-driven science on geographic knowledge-discovery will be through data-exploration and hypothesis generation. Similar to a telescope or microscope, systems for capturing, storing, and processing massive amounts of data can allow investigators to augment their perceptions of reality and see things that would otherwise be hidden or too faint to perceive. From this perspective, data-driven science is not necessarily a radically new approach, but rather a way to enhance inference for the longstanding processes of exploration and hypothesis generation prior to knowledge-construction through analysis, modeling, and verification (Miller 2010).
Data-driven knowledge-discovery has a philosophical foundation: abductive reasoning, a form of inference articulated by astronomer and mathematician C. S. Peirce (1894–1914). Abductive reasoning starts with data describing something and ends with a hypothesis that explains the data. It is a weaker form of inference relative to deductive or inductive reasoning: deductive reasoning shows that X must be true, inductive reasoning shows that X is true, while abductive reasoning shows only that X may be true. Nevertheless, abductive reasoning is critically important in science, particularly in the initial discovery stage that precedes the use of deductive or inductive approaches to knowledge-construction (Miller 2010).
Abductive reasoning requires four capabilities: (1) the ability to posit new fragments of theory; (2) a massive set of knowledge to draw from, ranging from common sense to domain expertise; (3) a means of searching through this knowledge collection for connections between data patterns and possible explanations, and; (4) complex problem-solving strategies such as analogy, approximation, and guesses. Humans have proven to be more successful than machines in performing these complex tasks, suggesting that data-driven knowledge-discovery should try to leverage these human capabilities through methods such as geovisualization rather than try to automate the discovery process. Gahegan (2009) envisions a human-centered process where geovisualization serves as the central framework for creating chains of inference among abductive, inductive, and deductive approaches in science, allowing more interactions and synergy among these approaches to geographic knowledge building.
One of the problems with Big Data is the size and complexity of the information space implied by a massive multivariate database. A good data-exploration system should generate all of the interesting patterns in a database, but only the interesting ones to avoid overwhelming the analyst. Two ways to manage the large number of potential patterns are background knowledge and interestingness measures. Background knowledge guides the search for patterns by representing accepted knowledge about the system to focus the search for novel patterns. In contrast, we can use interestingness measures a posteriori to filter spurious patterns by rating each pattern based on dimensions such as simplicity, certainty, utility, and novelty. Patterns with ratings below a user-specified threshold are discarded or ignored (Miller 2010). Both of these approaches require formalization of geographic knowledge, a challenge discussed earlier in this paper.
Traditional approaches to modeling are deductive: the scientist develops (or modifies or borrows) a theory and derives a formal representation that can be manipulated to generate predictions about the real world that can be tested with data. Theory-free modeling, on the other hand, builds models based on induction from data rather than through deduction from theory.
The field of economics has flirted with data-driven modeling in the form of general-to-specific modeling (Miller 2010). In this strategy, the researcher starts with the most complex model possible and reduces it to a more elegant one based on data, founded on the belief that, given enough data, only the true specification will survive a sufficiently stringent battery of statistical tests designed to pare variables from the model. This contrasts with the traditional specific-to-general strategy where one starts with a spare model based on theory and conservatively builds a more complex model (Hoover and Perez 1999). However, this approach is controversial, with some arguing that given the enormous number of potential models one would have to be very lucky to encompass the true model within the initial, complex model. Therefore, predictive performance is the only relevant criterion; explanation is irrelevant (Hand 1999).
Geography has also witnessed attempts at theory-free modeling, also not without controversy. Stan Openshaw is a particularly strong advocate for using the power of computers to build models from data: examples include the Geographical Analysis Machine (GAM) for spatial clustering of point data, and automated systems for spatial interaction modeling. GAM uses a technique that generates local clusters or “hot spots” without requiring a priori theory or knowledge about the underlying statistical distribution. GAM searches for clusters by systematically expanding circular search from locations within a lattice. The system saves circles with observed counts greater than expected and then systematically varies the radii and lattice resolution to begin the search again. The researcher does not need to hypothesize or have any prior expectations regarding the spatial distribution of the phenomenon: the system searches, in a brute-force manner, all possible (or reasonable, at least) spatial resolutions and neighborhoods (Charlton 2008; Openshaw et al. 1987).
GAM is arguably an exploratory technique, while Openshaw’s automated system for exploring a universe of possible spatial interaction models leaps more into the traditional realm of deductive modeling. The automated system uses genetic programming to breed spatial interaction models from basic elements such as the model variables (e.g., origin inflow and destination outflow totals, travel cost, intervening opportunities), functional forms (e.g., square root, exponential), parameterizations, and binary operators (add, subtract, multiply and divide) using goodness-of fit as a criterion (Diplock 1998; Openshaw 1988).
One challenge in theory-free modeling is that it takes away a powerful mechanism for improving the effectiveness of a search for an explanatory model—namely, theory. Theory tells us where to look for explanation, and (perhaps more importantly) where not to look. In the specific case of spatial interaction modeling, for example, the need for models to be dimensionally consistent can limit the options, though the possibility of dimensional analysis (Gibbings 2011) was not employed in Openshaw’s work. The information space implied by a universe of potential models can be enormous even in a limited domain such as spatial interaction. Powerful computers and clever search techniques can certainly improve our chances (Gahegan 2000). But as the volume, variety, and velocity of data increase, the size of the information spaces for possible models also increases, leading to a type of arms race with perhaps no clear winner.
A second challenge in data-driven modeling is that the data drive the form of the model, meaning there is no guarantee that the same model will result from a different data set. Even given the same data set, many different models could be generated that fit the data, meaning that slight alterations in the goodness-of-fit criterion used to drive model selection can produce very different models (Fotheringham 1998). This is essentially the problem of statistical overfitting, a well-known problem with inductive techniques such as artificial neural networks and machine learning. However, despite methods and strategies to avoid overfitting, it appears to be endemic: some estimate that three-quarters of the published scientific papers in machine learning are flawed due to overfitting (The Economist 19 October 2013).
The knowledge from data-driven models can be complex and non-compressible: the data are the explanation. But if the explanation is not understandable, do we really have an explanation? Perhaps the nature of explanation is evolving. Perhaps computers are fundamental in data-driven science not only for discovering but also for representing complex patterns that are beyond human comprehension. Perhaps this is a temporary stopgap until we achieve convergence between human and machine intelligence as some predict (Kurzweil 1999). While we cannot hope to resolve this question (or its philosophical implications) within this paper, we can add a cautionary note from Nate Silver: telling stories about data instead of reality is dangerous and can lead to mistaking noise for signal (Silver 2012).
A final challenge in data-driven spatial modeling is de-skilling: a loss of modeling and analysis skills. While allocating mundane tasks to computers frees humans to perform sophisticated activities, there are times when mundane skills become crucial. For example, there are documented cases of airline pilots, due to a lack of manual flying experience, reacted badly in emergencies when the autopilot shuts off (Carr 2013). Although rarely life-threatening, one could make a similar argument about automatic model building: if a data-driven modeling process generates anomalous results, will the analyst be able to determine if they are artifacts or genuine? With Openshaw’s automated spatial interaction modeling system, the analyst may become less skilled at spatial interaction modeling and more skilled at combinatorial optimization techniques. While these skills are valuable and may allow the analyst to reach greater scientific heights, they are another level removed from the empirical system being modeled. However, the more anomalous the results, the deeper the thinking required.
A solution to de-skilling is to force the skill: require it as part of education and certification, or design software that encourages or requires analysts to maintain some basic skills. However, this is a difficult case to make compared to the hypnotic call of sophisticated methods with user-friendly interfaces (Carr 2013). Re-reading Jerry Dobson’s prescient essay on automated geography thirty years later (Dobson 1983), one is impressed by the number of the activities in geography that used to be painstaking but are now push-button. Geographers of a certain age may recall courses in basic and production cartography without much nostalgia. What skills that we consider essential today will be considered the pen, ink, and lettering kits of tomorrow? What will we lose?
The context for geographic research has shifted from a data-scarce to a data-rich environment, in which the most fundamental changes are not the volume of data, but the variety and the velocity at which we can capture georeferenced data. A data-driven geography may be emerging in response to the wealth of georeferenced data flowing from sensors and people in the environment. Some of the issues raised by data-driven geography have in fact been longstanding issues in geographic research, namely, large data volumes, dealing with populations and messy data, and tensions between idiographic versus nomothetic knowledge. However, the belief that spatial context matters is a major theme in geographic thought and a major motivation behind approaches such as time geography, disaggregate spatial statistics, and GIScience. There is potential to use Big Data to inform both geographic knowledge-discovery and spatial modeling. However, there are challenges, such as how to formalize geographic knowledge to clean data and to ignore spurious patterns, and how to build data-driven models that are both true and understandable.
Cautionary notes need to be sounded about the impact of data-driven geography on broader society (see Mayer-Schonberger and Cukier 2013). We must be cognizant about where this research is occurring—in the open light of scholarly research where peer review and reproducibility is possible, or behind the closed doors of private-sector companies and government agencies, as proprietary products without peer review and without full reproducibility. Privacy is a vital concern, not only as a human right but also as a potential source of backlash that will shut down data-driven research. We must be careful to avoid pre-crimes and pre-punishments (Zedner 2010): categorizing and reacting to people and places based on potentials derived from correlations rather than actual behavior. Finally, we must avoid a data dictatorship: data-driven research should support, not replace, decision-making by intelligent and skeptical humans. Some of the other papers in this special issue explore these challenges in depth.
- Anderson, C. (2008). The end of theory: The data deluge makes the scientific method obsolete. Wired, 16, 07.Google Scholar
- Carr, N. (2013) The great forgetting. The Atlantic, pp. 77–81.Google Scholar
- Charlton, M. (2008). Geographical Analysis Machine (GAM). In K. Kemp (Ed.), Encyclopedia of Geographic Information Science (pp. 179–180). London: Sage.Google Scholar
- Cresswell, T. (2013). Geographic thought: A critical introduction. New York: Wiley-Blackwell.Google Scholar
- DeLyser, D., & Sui, D. (2013). Crossing the qualitative-quantitative divide II: Inventive approaches to big data, mobile methods, and rhythmanalysis. Progress in Human Geography, 37(2), 293–305.Google Scholar
- Dumbill, E. (2012). What is big data? An introduction to the big data landscape, http://strata.oreilly.com/2012/01/what-is-big-data.html. Last accessed 17 April 2014.
- Flake, G. W. (1998). The computational beauty of nature: computer explorations of fractals, chaos, complex systems, and adaptation. Cambridge: MIT Press.Google Scholar
- Fotheringham, A. S., Brunsdon, C., & Charlton, M. (2002). Geographically weighted regression: The analysis of spatially varying relationships. Chichester: Wiley.Google Scholar
- Gahegan, M. (2000). On the application of inductive machine learning tools to geographical analysis. Geographical Analysis, 32(1), 113–139.Google Scholar
- Gahegan, M. (2009). Visual exploration and explanation in geography: Analysis with light. In H. J. Miller & J. Han (Eds.), Geographic data mining and knowledge discovery (2nd ed., pp. 291–324). London: Taylor and Francis.Google Scholar
- Glaser, B. G., & Strauss, A. L. (1967). The discovery of grounded theory. Chicago: Aldine.Google Scholar
- Goffman, E. (1959). The presentation of self in everyday life. New York: Anchor Books.Google Scholar
- Goodchild, M. F. (2004). GIScience, geography, form, and process. Annals of the Association of American Geographers, 94(4), 709–714.Google Scholar
- Guptill, S. C., & Morrison, J. L. (Eds.). (1995). Elements of spatial data quality. Oxford: Elsevier.Google Scholar
- Hartshorne, R. (1939). The nature of geography: A critical survey of current thought in the light of the past. Washington, DC: Association of American Geographers.Google Scholar
- Hey, T., Tansley S., & Tolle, K. (Eds.). (2009). The fourth paradigm: Data-intensive scientific discovery.Google Scholar
- Kurzweil, R. (1999). The age of spiritual machines: when computers exceed human intelligence. New York: Vintage.Google Scholar
- Mayer-Schonberger, V., Cukier, K. (2013). Big Data: A revolution that will transform how we live, work, and think. Google Scholar
- Merton, R. K. (1967). On sociological theories of the middle range. In R. K. Merton (Ed.), On theoretical sociology (pp. 39–72). New York: The Free Press.Google Scholar
- O’Leary, M. (2012). Eurovision statistics: post-semifinal update, Cold Hard Facts (May 23). Available: http://mewo2.com/nerdery/2012/05/23/eurovision-statistics-post-semifinal-update/. Accessed October 25, 2013.
- Openshaw, S., & Taylor, P. J. (1979). A million or so correlation coefficients: three experiments on the modifiable areal unit problem. In N. Wrigley (Ed.), Statistical methods in the social sciences (pp. 127–144). London: Pion.Google Scholar
- Preis, T., Moat, H. S., & Stanley, H. E. (2013). Quantifying trading behavior in financial markets using Google Trends. Scientific Reports, 3 (1684). doi: 10.1038/srep01684.
- Raymond, E. S. (2001). The cathedral and the bazaar: Musings on linux and open source by an accidental revolutionary. Sebastopol: O’Reilly Media.Google Scholar
- Silver, N. (2012). The signal and the noise: Why most predictions fail—but some don’t. Google Scholar
- Sui, D. (2004). GIS, cartography, and the “Third Culture”: Geographic imaginations in the computer age. Professional Geographer, 56(1), 62–72.Google Scholar
- Taleb, N. N. (2007). The black swan: The impact of the highly improbable. New York: Random House.Google Scholar
- The Economist. (19 October 2013). Trouble at the lab, pp. 26–30.Google Scholar
- Townsend, A. (2013). Smart cities: Big data, civic hackers, and the quest for a new utopia. New York: Norton.Google Scholar
- Tsou, M. H., Yang, J. A., Lusher, D., Han, S., Spitzberg, B., Gawron, J. M., et al. (2013). Mapping social activities and concepts with social media (Twitter) and web search engines (Yahoo and Bing): a case study in 2012 US Presidential Election. Cartography and Geographic Information Science, 40(4), 337–348.CrossRefGoogle Scholar
- Watts, D. J. (2011). Everything is Obvious – Once You Know the Answer. United States of America: Crown Business.Google Scholar
- Weinberger, D. (2011). The machine that would predict the future, Scientific American, November 15, 2011. http://www.scientificamerican.com/article.cfm?id=the-machine-that-would-predict.