1 Introduction

The research community has relied for several decades on a consistent supply of high-quality digital data about urban areas, and this supply has provided an essential component of urban infrastructure, guiding the development of public policy and forming a basis for socioeconomic and demographic research. These traditional data have come often from government sources, such as administrative records or the demographic census that exists in most countries and provides details of age, sex, and socioeconomic variables such as income and education level. The data have been produced on a regular basis (often decennial) and disaggregated to small geographic units. While these units must be sufficiently large to allay concerns about the potential for breaching confidentiality, it is also possible to obtain suitably anonymized data on individuals. Yet today this system is being disrupted in many ways. New kinds of data, such as those obtained by tracking vehicles and smartphones or from postings on social media, or data obtained by capturing video of streetscapes, have created a new data environment that we now identify with the phrase Big Data. Its characteristics are variety, because of the multiple sources, some of high quality and some much more biased and less reliable; velocity, because in contrast to the traditional world in which many months often elapsed between capture and publication these new sources are often available in close to real time; and volume, because the new sources eclipse the traditional often by orders of magnitude. Arguments are often made for additional Vs: veracity, though it is not clear whether the new sources are necessarily more or less accurate than the old; and value, because the new sources clearly have potential value for research.

This new world is clearly radically different from the old, and is stimulating new interest in urban informatics (for a more comprehensive discussion of urban informatics see Shi et al., 2021) and in the infrastructure that cities and the research community will have to develop if they are to take full advantage of these new developments and to use them to achieve the vision of the smart city. It seems appropriate, therefore, to ask what this new world implies for research that is based on urban informatics, for urban informatics itself, and for this journal. In this short article I explore five such implications. These are based in part on my own personal interest, on my interests as an academic geographer and geographic information scientist, and on topics that I have explored over the past few years, and I make no claim that the list is complete. I hope, however, that it will stimulate further discussion and further research, and that the results of those discussions and that research will eventually appear in the pages of this journal.

1.1 Data integration

Variety is a fundamental characteristic of Big Data, but the effective operation of a Big Data infrastructure will require effective and functional means to integrate the assorted types of data that are now available, and will be increasingly available in the future. Geographic information systems (GIS) have long been touted as a technology that is fundamentally focused on integrating geospatial data, by making it possible to simultaneously access and process many different data types based on common location: in other words, to assemble various facts about a location and its neighborhood. In the geospatial world these data types are likely to be thought of as layers, a metaphor that dates from the earliest GIS and is in large part inherited from maps. The same concept also extends to remote sensing, since the basic unit of remote sensing is an image, collected at one point in time by a sensor and later georeferenced.

Today a large number of data models exist for layers. They may consist of rectangular or hexagonal rasters, of regularly or randomly located points, of space-exhausting polygons, of digitized lines or contours, or of meshes of triangles or quadrilaterals. In each case one or more attributes is associated with each of the basic spatial units of the data model. The task of integrating two layers may take various forms depending on the data models. To integrate two congruent rasters is a trivial matter of combining the attributes of each pixel, while integration of non-congruent rasters or data that use different data models will require a geometric operation on spatial units, a task that is often described as a spatial join.

Data are integrated in a GIS using geographic coordinates such as latitude and longitude. But determining latitude and longitude is always a problem of measurement, and like all measurement problems it inherits the uncertainties and errors of the measuring instrument. The Global Positioning System (GPS) and other global navigation satellite systems (GNSS) provide varying accuracies and their signals tend to be distorted in dense urban areas. Thus the measurement errors of a hand-held GPS receiver or the GPS chip installed in a smartphone may range from a few meters to as much as a hundred meters, while survey-grade receivers may achieve accuracies in the centimeter range. In turn, GPS will provide a reliable means of integrating data in some but by no means all circumstances. Determining whether someone is at a given point of interest (POI) will be very unreliable when, for example, the POI is less than a few meters across, the POI is indoors, or the POI is in a multistory building.

These issues came to the fore early in the COVID pandemic in the search for a technological solution to the problem of contact tracing. Calculating the distance between two individuals based on their smartphone GPS coordinates, and comparing to some assumed critical distance such as two meters, will clearly lead to abundant false positives and false negatives because of the uncertainty in location measurements. Bluetooth signals between smartphones may offer a more reliable solution, but little appears to be known about the accuracy of distance measurement using Bluetooth. Similarly various technologies for indoor positioning, including Bluetooth beacons and Wifi (Chen & Chen, 2021), may offer ways of computing distance between individuals. But in short, there are significant technological problems in integrating data based on the measured positions of individuals.

These issues persist when the data are spatiotemporal and it becomes possible to address questions such as “Were these two people within two meters of each other for at least 15 minutes?” While geospatial data have tended to focus largely on capturing what is static about the geographic domain, in recent years there has been a massive increase in the availability of spatiotemporal data, and velocity is now often the most important and useful of the characteristics of Big Data. Many decisions can now be made in close to real time, and the dashboard showing the current situation is an increasingly popular means of communication. Dashboards became very popular during the COVID pandemic (Kitchin et al., 2016), and are likely to remain so in the future.

In addition to location and time, individual identity can often provide a useful basis for integration. Name is not necessarily unique and individuals can control multiple telephone numbers, and while social security numbers (and their equivalents in other countries) are in principle unique, fairly stringent protections apply to their use. Instead it is the device ID that is providing the basis for data integration in many contexts. A time, location, and device ID may appear anonymous and harmless as a record, but it is a trivial task to link together all of the times and locations for a given device to construct the personal track or trajectory of the carrier of the device, and a vast amount of such data is now being generated through the use of smartphone apps. Such data are being aggregated and sold through a rapidly growing network of companies, as the basis for monitoring consumer behavior (Valentino-DeVries et al., 2018), and largely without any regulation or constraint, at least in the US. Moreover the track of an individual can readily lead to inferences about home and work locations, personal relationships, and even medical conditions, leaving the track essentially de-anonymized.

1.2 Accuracy and uncertainty

Reference has already been made to the issue of accuracy and uncertainty, and its impact on linking records by location and on contact tracing. But accuracy of position is only one tip of the uncertainty iceberg. In principle any spatiotemporal data has uncertainty because both time and location are the result of measurement. But many other forms of uncertainty exist and will impact the development of Big Data infrastructure. Many variables are to some degree subjective, as in the assessment and classification of neighborhoods and housing quality, and as such are not scientifically replicable: thus two observers may not agree when asked to make the same assessment. For such variables the terms accuracy and error seem inappropriate because they imply agreement on the existence of a true value, so the term uncertainty is now preferred in discussions of its role in geographic information science (GIScience).

Korzybski (1933) is often quoted in discussions of the differences between the geographic world and representations of it: “The map is not the territory.” In part these differences arise through measurement, as discussed in the previous section. But they also arise because for representations to be computationally useful they must meet the constraints imposed by computational systems, and although those constraints have been loosened substantially in recent years, they will always exist: it will never be possible to capture the infinitely complex geographic world in its full detail. Many practices limit the amount of detail that is stored in representations, including the pixel sizes of remote sensing, the number of significant digits that are maintained when coordinates are processed, and the level of detail that is included in topographic mapping and reflected in its scale. It follows that effective use of spatiotemporal data requires an ability on the part of the user to know what is missing from the data but present in the real world, and whether such missing information would seriously impact conclusions.

The concept of a digital twin has received attention in recent years as a theme in discussions of the smart city (Batty, 2018). In principle a digital twin allows processes that impact the real world to be simulated on a digital copy, and it is possible to imagine a world in which the impacts of the massive increases in atmospheric greenhouse gases could have been simulated in the laboratory, rather than made the subject of the enormous experiment on the planet itself that has now been running since the start of the industrial revolution. The concept of a Digital Earth, a digital twin of the planet, has also received much attention in recent years (Guo et al., 2020). But since a digital twin cannot be a complete representation of a city or of the Earth, there will always be some degree of ambiguity associated with the concept: What threshold of completeness allows a developer to claim that a representation is a digital twin? What applications will an incomplete digital twin support and what applications will it not support? How should the inevitable uncertainty of an incomplete digital twin be measured, visualized, and propagated to influence its predictions?

In effect, then, accuracy and uncertainty should not be seen as properties of data, to be measured and reported as an important part of metadata, but in terms of specific use cases: For what applications is a data set suitable, given its inevitable uncertainty, and for what is it not? This is the doctrine of fitness for use, and it is one of a number of ways in which the traditional perspective on geospatial data is being challenged. Decades ago when the cost of collecting, compiling, and publishing geospatial data was almost prohibitive, and production was effectively limited to agencies of national governments, it was necessary to spread the cost by finding as many applications as possible, and for the data to remain valid for as long as possible. But these arguments largely disappear today, with the costs of production a mere fraction of what they were, and with everyone, including private citizens, enabled to collect and build geospatial data sets (Turner, 2006). We can apply the same argument to digital twins, by linking them to explicitly identified use cases. Thus a collection of data might rank as a digital twin for a specific use case, if it can be shown that the data are fit for that particular purpose and the differences between the real world and its digital twin, while substantial, are nevertheless largely irrelevant to that purpose.

One of the most important differences between data and the real world concerns time, and the need in many applications for data to be up to date. Some types of features change slowly, including the existence and locations of streets and physical infrastructure. But construction activities may impact cities on a daily basis, and data on congestion requires minute-by-minute updating. A number of mechanisms are now in place to ensure the accuracy of congestion data, using traffic cameras, loop detectors embedded in pavement, and the tracking of vehicles through programs such as Waze (www.waze.com). But POI data is proving much more difficult to update and maintain. Businesses open and close, especially during periods of disruption such as the COVID pandemic, and it may be difficult to rely on published hours of opening and closing. As a result POI data may be the weakest element in current wayfinding and route-guidance technology, and badly in need of novel ideas for improvement.

1.3 Data ownership

While the volume of data with potential relevance to urban Big Data infrastructure has been expanding rapidly, and is expected to continue to do so, access to those data is a much more complex issue. Many traditional data sets are readily accessible, including data from the decennial census. But it was frustration with the high cost of data about the UK’s street network that led to the launch of Open Street Map (OSM) in 2004, an effort to enlist volunteers armed with GPS receivers and fine-resolution imagery to create a free and readily accessible geospatial data set. Enthusiasm for OSM has since spread around the world, and OSM now provides a very successful mechanism for rapid mapping in areas impacted by disaster, such as Haiti following the earthquake of 2010. But many other data sets remain expensive, or closely held by corporations such as utility companies, or inaccessible because of national security concerns. Building an admittedly incomplete digital twin of a major US city would require a massive, expensive, and time-consuming effort to negotiate access with the various data-holding companies and agencies, a process that would seriously impact response to disasters, as it did in the case of Manhattan after the 9/11 attacks or New Orleans after Hurricane Katrina (National Research Council, 2007).

Even personal data about individuals remain locked up in the databases of many corporations, a situation which Geoff Jacquez calls the “Balkanization of the quantified self” (Goodchild, 2015). As noted earlier, users of a corporation’s apps frequently and largely inadvertently allow the corporation to capture “pings,” or records of the user’s presence at a location at a certain time, and then aggregate these pings for sale to market analysts. Today a vast amount of information about an individual exists, held by insurers, medical clinics, or financial institutions, but almost none of it is owned or accessible by the individual. In my case, and probably like most academics, the only major repository of personal data that is definitely under my control is my CV.

1.4 The locations of computing

In the early stages of the popularization of the Internet, in the 1990s, it was popular to speculate about the “death of distance,” because it had become as easy to connect with someone on the other side of the world as with a person in one’s own neighborhood. Internet access was largely free, creating a major disruption for people who were used to the idea that the cost of telephone communication was an increasing function of distance. Yet three decades later distance remains an important factor in human behavior. Our acquaintance networks are still largely governed by distance, distance acts as a surrogate for common nationality and common language, and despite the growth of telecommuting we still conduct many of our activities locally.

What does all of this imply for Big Data infrastructure? We instinctively think of the city as the primary unit of organization, and the city center as the appropriate location for its control functions. But is this necessarily the best solution, and how should infrastructure be organized in complex metropolitan areas such as Los Angeles with its many local units of government? Should control be centralized, or would there be advantages and possibly cost reductions if control were disaggregated?

Any application of information technology involves multiple locations: the location of the observer or sensor that acquires the data, the locations where the data are stored and processed, and the locations of users and decision makers. Confounding this is the somewhat chaotic nature of urban governance, with its different levels of government, jurisdictions for different specialized agencies, and services that are often provided from distributed locations. The Internet and cellular networks ensure that data pass easily and at minimal cost, but these networks are themselves vulnerable during emergencies and disasters.

Consider a digital camera installed somewhere in a city, generating fine-resolution images many times per second. A single camera might generate on the order of gigabits of data per second, and a large city might have a million such cameras. If all of these data were fed to a single location it is hard to imagine how even the simplest forms of analysis would be feasible: consider tracking an individual from camera to camera through facial recognition, for example. It is not surprising, then, that many of the success stories of such technology are based on subsequent analyses of stored images following known events, not the detection of suspicious events in real time.

The concept of edge computing (Marcham, 2021) is clearly relevant here: instead of transmitting all raw data to a single central location for analysis, edge computing performs as much analysis as possible at the point of data capture. For example, a satellite used to detect change might do so by comparing a current image to images that were previously stored onboard, rather than transmitting all images to the ground for analysis. A similar approach is used in the design of autonomous vehicles, where much of the data collected by the sensors attached to the vehicle is processed locally, and only limited data, such as a representation of the vehicle’s trajectory, may be uploaded to a central store; and similarly only limited data, such as that required to route the vehicle to its destination, need to be downloaded.

In short, a major element of future research into urban Big Data infrastructure will need to be concerned with the locations of computing, with issues such as the centralization of data storage and control versus edge computing and local storage. Central control is consistent with the model of the monocentric and radially symmetric city, but many conurbations no longer follow such simplistic models, and patterns of data ownership and data acquisition now invite very different approaches. Geographers have long been concerned with the provision of services from central locations to distributed populations, and have developed models and algorithms that will provide a useful basis for this future research.

1.5 Broader impacts

We tend to think of the urban infrastructure of information technology as relatively invisible. The towers of cellular networks and occasional microwave dishes may impinge on our field of view at times, but many communication links are buried underground or strung overhead. Yet it seems that the developing Big Data infrastructure may have much broader impacts in the future. We have already seen how online shopping is leading to the failure of many traditional shopping centers and the expansion of logistics facilities, and the COVID pandemic has had dramatic effects in reducing the daytime occupancy of downtown office buildings, with knock-on effects on retailing and services.

There is a developing need for research into the long-term, broader impacts of Big Data infrastructure. What will be the effects on land use, as shopping centers fail and their spaces are redeveloped? Will the growth of autonomous vehicles lead to a further growth in urban sprawl, as people opt for a longer commute and repurpose their driving time? Will drivers continue to regard route selection as an individual right, when autonomous vehicles and smart traffic controls allow for greater central control? Will street layouts, signage, and mode separation need to be rethought to accommodate new and disruptive technologies such as e-bikes, e-scooters, ridesharing, and active transportation, along with the broader impacts of Big Data infrastructure?

2 Concluding comments

Cities have existed for millennia, and have weathered many disruptions in the past. Cities have been besieged, obliterated, in some cases planned, and frequently disrupted. I have attempted in this paper to elaborate on one particular form of disruption, the growth of Big Data and its supporting infrastructure and impacts. At first one might think of these as minor: as I have argued, the infrastructure of information technology remains largely hidden from view, and we have little awareness of the vast numbers of bits that flow through our wires and fiber-optic cables, and through the atmosphere as electromagnetic waves. It is only when these bits are acquired by cameras and sensors, or stored in data warehouses, or handled by computers, or displayed for users, that we gain a sense of their importance. Ours is still very much a physical world of people, buildings, streets, trees, and vehicles, despite the growing and critical importance of digital information.

Yet that digital information is now impacting almost all of our activities, infecting our culture, and ultimately changing the nature of the city. In this paper I have identified several areas where the effects of Big Data infrastructure cry out for research. All of them are geospatial in nature, and although I would not argue that Big Data infrastructure is an exclusively geospatial issue it clearly has important geospatial dimensions. It also has societal and political dimensions, and we have seen the importance of societal and political reactions very clearly over the past 2 years of the COVID pandemic. Which developments of Big Data infrastructure will encounter significant push-back? Will there be significant resistance, for example, to the widespread use of autonomous vehicles because of safety concerns? Will significant privacy-protection legislation be adopted in the US? What aspects of Big Data infrastructure will raise issues of environmental justice or equity?