Cameras have become part of the urban landscape and a testimony of our social interactions with the city. They are deployed on buildings and street lights as surveillance tools, carried by billions of people daily, or as an assistive technology in vehicles with different levels of self-driving capabilities. We rely on this abundance of images to interact with the city.
In fact, 2.5 quintillion bytes of data are created each day by billions of people using the Internet. Increasingly, social media are heavily based on visual data. Among the top social media channels, several are overwhelmingly and exclusively based on images: YouTube has 1.5 billion users and Instagram has 1 billion users—as a comparison, Facebook has 2.3 billion users. Such visually based social interactions are also extended to the interactions we have in our cities. In the USA, on average, a person is caught on camera 75 times per day, and over 300 times in London. Also, disruptive urban technologies such as autonomous vehicles use cameras. The challenge is to make sense of the amount of visual data generated daily in our cities in meaningful ways, beyond surveillance purposes.
In this chapter, we are not interested in the abundance of visual data available online collected by individuals and widely available on social media. The previous work used geotagged photographs available online to measure urban attractiveness (Paldino et al. 2016) or to assess the aesthetic appeal of the urban environment based on user-generated image (Saiz et al. 2018), and the visual discrepancy and heterogeneity of different cities around the world (Zhang et al. 2019). The focus of this chapter is not on the visual data produced by cameras carried by people for personal uses, but rather on the images collected by cameras specifically designed and deployed to gather visual data about the city—which we call here urban cameras.
Cameras deployed and controlled by a range of public and private organizations in urban areas are counted by the dozens of thousands in cities, from London and Beijing to New York and Rio de Janeiro. As an example, a Londoner is captured on camera more than 300 times every day; and during the same period, the UK captures over 30 million plate numbers (Kitchin 2016). Additionally, private companies, such as Google, collect and make available online hundreds of thousands of images of hundreds of cities worldwide.
Making sense of such large visual datasets is the key to understanding and managing contemporary cities. There are still many technical issues to be solved to make the use of such huge visual datasets actionable. Challenges include cloud versus local storage and processing; architecture integration, ontology building, semantic annotation, and search; and online real-time analysis and offline batch processing of large-scale video data (Shao et al. 2018; Xu et al. 2014; Zhang et al. 2015).
Besides the technical challenges, there are also ethical issues. The most prevalent among social scientists is the narrow understanding of cities when urban phenomena are equaled to available data, heading the operationalization of the urban (Luque-Ayala and Marvin 2015), mainly when “portions of the urban public space that are shadowed by the gaze of private cameras and security systems” (Firmino and Duarte 2015 p. 743) become subject to the datafication of the city, often leading to “social sorting and anticipatory governance” (Kitchin 2016 p. 4). Closed-circuit television (CCTV), deployed on public areas and aimed to assist police patrols with crime prevention, using video analytics to identify abnormal behaviors, fosters predictive policing by the profiling of subjects and places, and frequently triggers false alarms due to biases embedded in the algorithms (Vanolo 2016).
We are aware of these issues and have contributed ourselves to the literature on the risks of oversurveillance based on the abundance of data about people’s behavior in public spaces. But, in this chapter, we would like to discuss the other side of this phenomenon: how novel computational techniques can be used to make sense of the huge amount of visual data generated about cities, and how such results reveal aspects of urban life that can contribute to better understanding and design of cities.
The projects discussed in this chapter are part of the extensive work using urban cameras done by the Senseable City Lab, at the Massachusetts Institute of Technology. These works can be divided into two types: the use of visual urban data available online, and the capture of visual data by the Lab with specifically designed devices.
In the first type, we take advantage of the visual urban data available online and develop machine learning techniques to make sense of these data. The datasets used in this research are Google Street View images, which we have been using to measure a critical aspect of cities with rapid urbanization: the quantification of green canopy in urban areas using a standard method that can be deployed cheaply, and that makes possible comparisons among hundreds of cities worldwide. And, at the same time, it provides a fine-grained analysis of greenery at the street level, allowing citizens and municipalities to assess tree coverage in different neighborhoods.
In the second type, we design specific devices to collect images and deploy them ourselves. In one example, we started by using thermal cameras mounted on vehicles to measure heat leaks in buildings. Using the same devices, we developed other techniques to use thermal data to quantify and track people’s movements in indoor and outdoor areas. Besides the technical advantages of the method in terms of data transmission and processing, it also addresses an important concern about the use of cameras in public spaces: Thermal cameras allow us to have accurate data about people’s behavior without revealing their identities, therefore avoiding privacy concerns. Also, as part of this type of research, we address the problem of indoor navigability in large public areas. It is a well-known problem that users often have difficulty in navigating areas such as shopping malls, university campuses, and train stations, due either to their labyrinthic design or to the repetitiveness of visual cues. Here, we collected thousands of images on the MIT campus and in train stations in Paris and trained a neural network to measure the easiness to navigate these spaces, comparing the results with a survey with users.
Visual data about cities will tend to increase in the coming years, with personal photographs and videos that people use to register their daily routines in cities posted on social media, the deployment of cameras for surveillance not only for policing purposes but also for traffic management and infrastructure monitoring, and the fact that visual data will be crucial in technologies such as self-driving cars. All work dealing with visual big data needs to overcome the hurdles of manually processing this massive amount of information and generating useful empirical metrics on visual structure and perception. In this chapter, we propose to discuss how the development of novel computation methods used to analyze the abundance of visual urban data can help us to better understand urban phenomena.