A travel demand modeling framework based on OpenStreetMap

Notelaers, Lotte; Verstraete, Jeroen; Vansteenwegen, Pieter; Tampère, Chris M. J.

doi:10.1007/s44290-024-00020-y

A travel demand modeling framework based on OpenStreetMap

Research
Open access
Published: 28 June 2024

Volume 1, article number 26, (2024)
Cite this article

Download PDF

You have full access to this open access article

Discover Civil Engineering Aims and scope Submit manuscript

A travel demand modeling framework based on OpenStreetMap

Download PDF

800 Accesses
Explore all metrics

Abstract

Demand modeling is an important part of the setup of a traffic model for a city. All travel demand models rely on land use data as the demand for traveling fundamentally stems from activities occurring at different locations; however, many cities lack these data, or experience in estimating travel demand in their region. In response, this study develops a methodology for generating highly detailed land use data in the form of points of interest (POIs) specifically aimed at travel demand estimation purposes. The framework includes a procedure to extract, clean, enhance, and categorize freely available land use data from OpenStreetMap (OSM) into different POI categories, such as residences, schools, and shops. These residential and activity POIs, which are typical origins and/or destinations of trips, serve as the starting point for estimating travel demand. This paper demonstrates the framework’s utility through three case studies across different cities in Belgium. It validates the effectiveness of OSM-derived POIs for travel demand estimation by replicating Antwerp’s existing demand model, examines the POIs classification’s suitability for various travel demand purposes in Leuven, and assesses the transferability of correlations between OSM data and travel demand from Antwerp to Ghent. Beyond the applications illustrated in this paper, the framework provides opportunities for future research on the consistent disaggregation of existing zonal demand estimates and design-based research in which future demand is estimated given the development of POIs. The framework is openly available as a Python tool called Poidpy.

Building a Multimodal Urban Network Model Using OpenStreetMap Data for the Analysis of Sustainable Accessibility

A unified dataset for the city-scale traffic assignment model in 20 U.S. cities

Article Open access 29 March 2024

Efficient Planning of Urban Public Transportation Networks

Find the latest articles, discoveries, and news in related topics.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

An essential component of a traffic model is the travel demand in the region. For modeling travel demand, there are multiple approaches in literature and practice, of which the most known are trip-based [1] and activity-based [2] demand modeling. Although these methods vary in their methodology, they all rely on land use data for estimating demand. This is because the demand for traveling fundamentally stems from activities occurring at different locations. Therefore, regardless of the method used, understanding the land use and distribution of activity locations or points of interest (POIs) is essential for accurately estimating travel demand within a region.

The traditional trip-based demand models, also known as the 4-step model, represent travel demand by estimating the number of trips originating and ending in a certain area, called production and attraction within a traffic analysis zone (TAZ) [1]. The reliance on these TAZs in traditional models stems from the use of spatial data available only at an aggregate level, typically derived from government-provided census data. However, the need for more detailed travel demand estimates becomes apparent, particularly for the planning of emerging and adaptable transportation services such as micro-mobility and demand-responsive systems. More detailed spatial data is required to accurately capture the complexities of travel patterns by these systems.

This paper provides a methodology for generating highly detailed land use data in the form of POIs specifically aimed at travel demand estimation purposes. For this purpose, the developed framework leverages OpenStreetMap (OSM) for gathering spatial data as it is free and available worldwide. The framework includes an automated process of extracting, cleaning and categorizing the OSM data in different types of POIs. The framework is generic in the sense that these classified POIs can be used as input for any type of demand model and can be applied to any region in the world.

The remainder of the paper is structured as follows. Section 2 provides background information about OSM, and states the contribution of this paper compared to the literature. In section 3, the developed framework is presented, and the methodology for collecting, cleaning and categorizing the data is explained. In section 4, the applicability of the framework is showcased through three case studies in which demand models are estimated, and their performance is evaluated. Finally, the paper ends with a discussion and conclusion.

2 Literature and contribution statement

Volunteer-based geographic information (VGI) systems like OpenStreetMap (OSM) have become increasingly popular. OSM is the most comprehensive open database that contains geographical features representing physical objects worldwide, such as roads, buildings, POIs and land use data [3]. Operating as a key-value database, each object in OSM is assigned a unique ID for identification, with additional information added through tags attached to the object. These tags consist of a key and an associated value [4]. For example, the tag “building” = ‘school’ indicates that the building is used for educational purposes.

In this way, OSM provides open access to POI information, revealing highly disaggregated geospatial data about the built environment and land use in a region. An advantage of OSM over other static data sources is that it is a living data source, ensuring it remains up to date while land use and mobility habits evolve. Moreover, the availability of these OSM data, at a highly disaggregated scale for any region of choice, can reduce dependency on heterogeneous governmental data sources, which simplifies the process and enhances its portability to other regions.

The global and cost-free accessibility of OSM, facilitated by VGI crowdsourced data, has benefited numerous applications in transport planning. For instance, previous studies have leveraged OSM data for the generation of a road network required in traffic simulation [5, 6]. Also for travel demand estimation purposes, OSM has been consulted. Valdes et al. [7] used a subset of POIs data from OSM in combination with demographic information to estimate the demand for electric charging. More recently, Klinkhardt et al. [8] developed and validated a methodology for calculating the attractiveness of travel demand models from OSM data and national trip generation guidelines. Similarly, Li et al. [9] employed OSM POIs and standard trip generation rates to estimate travel demand using the traditional 4-step model.

Despite its popularity, the reliability and fitness for use of crowdsourced data can be questioned as the quality depends on the diligence of the local contributors [10, 11]. There exists a large body of literature on quality assessment of VGI systems [12] and the International Organization for Standardization (ISO) has defined a set of measures for evaluating the quality of geographic data [13]. Relevant quality aspects of OSM include completeness (i.e., the extent to which the map data covers all geographic features within a given area) [14, 15], positional accuracy of features [16], topological consistency between features [17, 18], temporal quality [19], attribute completeness [17, 20], and attribute accuracy [20].

Research shows that building completeness exhibits notable variability across regions and urban contexts with relatively high completeness observed in Europe & Central Asia (71%) and North America (64%), while regions such as Latin America & Caribbean (20%), East Asia & Pacific (20%), Middle East & North Africa (12%), and South Asia (9%) exhibit lower completeness values [15]. Moreover, the latter study observed substantial variations in building completeness within individual urban areas, highlighting disparities even within the same cityscape [15].

Furthermore, the completeness of attributes remains low, even in areas with extensive coverage of building geometries. Overall, OSM includes more than 96,000 attributes and more than 150,000 distinct tags [4]. At most 20% of buildings were labeled with their building type, while 4.6% had tags indicating the number of levels, and merely 2.9% included height information [20]. The authors of [17] observed comparable rates of omissions in building types and noted that these omissions are significantly higher than those observed for road or rail types. Nonetheless, despite the low attribute completeness, the accuracies for building type reached up to 84.4%, and building levels achieving up to 72.2% accuracy [20]. The analysis of land use data in [21] revealed a similar trend indicating low completeness in most countries but relatively high accuracy.

On top of the low attribute completeness, there is a large difference in tag usage [17]. For example, 7868 different values are used for specifying the building type, and 99.9% of these values yield usage rates less than 0.5% [4]. This underscores the challenges in data completeness and heterogeneity within OSM, attributed to its VGI nature, where it relies on the contributions of local mappers and is subject to regional cultural and language differences.

While initial efforts have been made to use OSM data for travel demand generation, few publications address the challenges posed by low attribute completeness, large attribute variability and topological inconsistencies of OSM data for this application. Moreover, the highly heterogeneous quality of OSM buildings across regions has hindered the development of an automated process for acquiring detailed land use data from OSM for travel demand estimation.

This paper addresses these gaps by developing a framework for meticulous gathering, cleaning and categorizing OSM data in the form of POIs for application in any travel demand model. Previous methodologies (proposed by Klinkhardt et al. [8] and Li et al. [9]) select OSM data based on a specific selection of tags. This means that spatial objects that have incomplete labeling (which often occurs in OSM) might be ignored. To address this, the framework presented here uses only a selection of keys instead of tags, i.e., keys and values. Later in our process, a tag list is used to drop specific objects and, in this way, limit the pollution in the dataset. Additionally, while previous efforts primarily focused on attractiveness in travel demand models, in this study, we also want to identify residential buildings (residential POIs) as they form the origins of most trips. Especially for identifying residences, more than for activity POIs, the information in OSM is missing because the completion rate of relevant tags is very low. Furthermore, inconsistencies in the multilayered dataset, such as overlapping building polygons or overrepresentation (for example, a shop can be included as a point with the tag “shop” = “supermarket” as well as a polygon with the tag “building” = ‘supermarket’), are resolved and flattened to one layer. In this layer, building polygons serve as the unit of analysis, to which the relevant information of the contained POIs and surrounding land use is added. This approach is an effective step to (1) take into account and categorize incompletely labeled or unlabeled objects and (2) avoid double counting.

Furthermore, our contribution lies in providing a fully automated, open-source Python toolkit for data gathering, cleaning, and categorization. In contrast, existing methods required manual data retrieval and multiple software tools for visualization. Klinkhardt et al. [8] proposed a manual methodology that requires the use of multiple available tools. In their suggestions for future research, they acknowledge the added value of an automated process for faster estimation of travel demand and for easier updating of existing models based on an updated OSM dataset. Moreover, automating the estimation process benefits portability to other regions. The toolkit includes a built-in module for data retrieval and allows customization of attributes and values based on regional differences. Furthermore, visualizations of the OSM data and model performance are integrated within the tool.

This paper shows the applicability of the framework by applying it for three case studies on three different cities in Belgium. The first case study validates the effectiveness of POIs data generated from OSM for travel demand estimation by reproducing travel demand of an existing trip-based demand model of Antwerp. Further, the correspondence between the chosen POIs classification and different travel demand purposes is analyzed through a second case study on the city of Leuven. In the third case study the model estimated for Antwerp is applied to Ghent to analyze to which extent the correlations found between OSM data and travel demand found for one city can be extrapolated to a different but comparable region or city.

3 Poidpy framework

The methodology developed in this paper is incorporated in a tool called Poidpy. The core idea of Poidpy is that POIs, i.e., activity locations and residences, form the origins and destinations of trips and hence can be used to estimate travel demand within and between regions. The overall Poidpy framework is presented in Fig. 1.

The Poidpy framework consists of two submodules: POI data extraction and preprocessing, and POI categorization. They are, respectively, related to gathering and categorizing the data from OSM.

In the next subsections, the submodules in Fig. 1 will be discussed in more detail. Poidpy^{Footnote 1} is openly available as a Python package and includes code, documentation and example notebooks.

3.1 POI data extraction and preprocessing

This section describes the design choices made concerning the data downloaded from OSM as well as the preprocessing methods applied to attain a consistent dataset of POIs. An overview is given in Fig. 2. Four preprocessing steps are considered. First, relevant data are downloaded from OSM. Second, objects that are irrelevant for travel demand estimation are removed from the downloaded dataset. Third, inconsistencies in the data are resolved. Finally, the information of objects representing the same building in reality is combined. Moreover, this information is further enhanced by including details of the surrounding land use type. All four steps are discussed in more detail below.

To download the data, the OSMNx package [22] is utilized to interact with the OSM API. To query the OSM data, different parameters are specified in the OSM download module. Spatial data in OSM exist in three geometry types: points, linestrings and polygons. For identifying activity locations and residences, only points (e.g., a POI) and polygons (e.g., a building) are of interest and hence downloaded from OSM, assuming that a line feature, such as a street, is not a POI that contributes to travel demand.

Because of the low level of completeness in OSM and the large heterogeneity in tags, we did not choose to download objects based on specific key-value pairs (as in previous research), as this could result in an underestimation of objects in the study area. Instead, the approach developed in this paper extracts all the information from OSM within the specified study area that is relevant for classifying residences and activity locations.

Nonetheless, downloading all relevant information from OSM while limiting the inclusion of irrelevant data is challenging because of the large heterogeneity in tag usage. Therefore, a selection of relevant keys (without specified values) is passed on as a parameter in the OSM download module instead of as a list of specific tags. The keys used are land use, building, amenity, shop, office, leisure, sport and tourism. These are among the most commonly used attributes in the OSM database [23] and were chosen based on previous research [24]. This process results in a multilayered dataset (visualized in Fig. 3a) that includes points representing POIs, polygons representing buildings, and polygons providing general information on land use. Via this approach, unlabeled buildings can also be classified with the additional extracted information of these contained POIs or surrounding land use polygons.

Only in the subsequent step, a list of tags to ignore is used to drop objects that are irrelevant for travel demand estimation from the downloaded dataset. The ignore-tags might, for example, include “landuse” = ‘flowerbed’, “building” = ‘garage’, “amenity” = ‘waste_basket’, and “leisure” = ‘outdoor_seating’, as these objects do not represent residential or activity locations. An example of a set of dropped data points is shown in Fig. 3b. The full list of attribute-value pairs that are ignored in the case studies presented in this paper is shown in Appendix A, Table 9.

Ideally, the multilayered spatial data extracted from OSM are accurate and consistent; unfortunately, this is not the case. There could be faulty or inaccurate mappings that make the layer spatially inconsistent. In the third step, two types of inconsistencies are handled: contained polygons and overlapping polygons. Two types of polygon features are distinguished: (1) objects with a value for the building key (building polygons or, in short, buildings) and (2) objects without a value for buildings but with a value for the land use key or other tags referring to the land use within the polygon (called contour polygons). The contour polygons are selected based on the tag list included in Table 10 in Appendix. Examples of buildings and contour polygons are shown in Fig. 4.

Examples of overlapping and contained building and contour polygons are shown in Fig. 5. For buildings, it is impossible to have more than one structure at the same location. Having another function inside a building is possible, but the structure as such consists of only one building or two individual nonoverlapping buildings. Similarly for contour polygons, an area with a specific land use should not contain a zone with another land use, since this implies a new land use consisting of the combination of the others which is an unwanted situation. To resolve these inconsistencies and avoid double counting, two possibilities are put forward: (1) removing the contained or smallest polygon, keeping only the larger polygon or (2) cutting out the contained or smallest polygon from the larger one, keeping both as individual nonoverlapping polygons in the dataset. The contained buildings are removed. For overlapping buildings, the smallest polygon is always cut from the largest polygon. The same pragmatic approach is used for overlapping and contained contour polygons: the smallest or contained polygon is always cut out from the largest or surrounding contour polygon. In this way, the most informative land use type is used to infer the function of buildings located inside the contour polygon.

As a final cleaning step, a threshold for a minimal building surface is specified. For the Belgian case studies considered in this paper, this threshold is set to 40 m² because permission is needed for constructing buildings in Belgium that are larger than this threshold. In this way, most garden sheds and garages are successfully removed from the building layer, as shown in Fig. 6.

Finally, although already consistent, the multilayered dataset still needs to be flattened to only one layer where all the object information is combined. For this purpose, the procedure considers the building polygons as the unit of analysis and adds all the information to these polygons. This is because most activities (including residential activity) are hosted inside a building. One exception is outside leisure activities such as soccer pitches or running tracks. As these will also attract trips, they are taken into account as a separate category in the categorization module (which will be explained in more detail in the next subsection).

The flattening of the multilayered data is an important contribution of the framework developed in this paper. It is important for two reasons: (1) to avoid doubled counting and (2) to be able to classify unlabeled buildings. First, places for certain activities are often included in OSM both as points and polygons (see Fig. 7a). It also happens, although less frequently, that an activity is included as an extra polygon in addition to the building polygon. Second, combining information from the different layers enables the classification of unlabeled buildings. For example, in Fig. 7, buildings with the tag “building” = ‘yes’ can be classified by the information of the points located within (Fig. 7a) and based on the land use polygons around it (Fig. 7b). For both reasons, the information of points and nonbuilding polygons lying within the building polygons and of the surrounding contour polygons are added to the building polygons. The exact use of this combined information for classifying the building polygons in the subsequent categorization step is explained in the next section.

3.2 POI categorization

The next step uses these preprocessed data to identify activities and residential locations (called POIs hereafter) that attract or produce trips and categorize them. The following categorization was used: Small residential, Large residential, Health, Services, Shops, Industry, Catering, Leisure, Leisure areas, School, Tourism and Others. To identify the POIs, two separate procedures are developed for residential POI categories and activity POI categories (as discussed in Sects. 3.2.1 and 3.2.2, respectively). Both procedures assign a probability between zero and one, indicating the likelihood of any POI belonging to the POI categories. Note that buildings can be of mixed use and can receive a probability for both a residential and an activity category and that their sum can be greater than one. Once the probabilities are assigned, these values are multiplied by the polygon area (in squared meters) to take the size of the residence and activity location into account. In this way, not only the existence of a POI of a certain type but also its degree of attractiveness or trip-generating potential are taken into account.

Note that OSM has a building-level tag, but because it is rarely used (in approximately 2% of all polygons in OSM this is not considered [4, 20]). However, as explained below, we heuristically increase the area inside city centers for some activities to account for the effect that buildings inside city centers tend to be multistoried buildings.

3.2.1 Residential function

To obtain all the residential POIs, the categorization procedure starts from a layer containing only building polygons. Fig. 20 in Appendix B depicts the full procedure that is used to assign a residential function to a building depending on the available information.

The process starts by selecting geometries with specific attribute values directly referring to residences. The considered attribute-value pairs are, for example, “building” = ‘house’ classified as Small residential with a probability of one and “building” = ‘apartments’ with a probability of one for Large residential. The full list of considered attribute-value pairs is presented in Table 11 in Appendix B. Similarly, buildings that clearly have nonresidential purposes according to tags such as “building” = ‘church’, “amenity” = ‘university’, “leisure” = ‘sports-centre’, “office” = ‘government’ and “tourism” = ‘hotel’ are labeled nonresidential. These buildings are selected based on the tag list presented in Table 12.

However, many features (for example, 92.2% of the buildings presented in Fig. 8) lack detailed information to immediately label them residential or nonresidential, such as features with the tag “building” = ‘yes’. Moreover, buildings can be of mixed use. A clear example is buildings in the city center that have a shop on the ground floor and apartments on the floors above. Therefore, if the labeling of objects is incomplete, some additional basic decision rules are used, which are briefly explained below.

One strategy used to classify buildings is to look at the surrounding contour polygons, providing information on the land use. For this purpose, three types of land use contours are differentiated: nonresidential, unlikely residential and residential. On the one hand, features inside a nonresidential land use contour are classified as nonresidential (zero probability), as these features are highly unlikely to be residences. An example of a nonresidential land use contour is “landuse_outer” = ‘cemetery’. Other nonresidential land use contours are identified according to the tags presented in Table 13 in Appendix B. On the other hand, buildings located inside unlikely residential land areas (e.g., “landuse_outer” = ‘farmyard’, “landuse_outer” = ‘retail’ and other presented in Table 14) have a small probability of having a residential function because they are most likely to represent an activity POI but could also be of mixed use and hence host a residence. Finally, buildings inside residential land use polygons (“landuse_outer” = ‘residential’) are more likely to be purely residential. Nonetheless, because they could also be of mixed use, this rule is also combined with other rules (extra tags, polygon area) in the categorization algorithm to define the residential probability.

Extra tags: Having extra tags (e.g., “shop”, “amenity”, “office”) in addition to a building tag decreases the likelihood of a building being a residence.
Polygon area: Buildings larger than the maximal residential building area threshold are less likely to be residences. As the average size of a house varies by region and country, this threshold is a changeable parameter in the tool (here 600 m²).
City center: Buildings inside the city center are more likely to be of mixed use.
Inner point or polygon info: when a building intersects with another feature (point or polygon), its likelihood of being solely residential decreases. By considering the tags of these intersecting features, in combination with the other rules, a probability for the residential function is assigned.

As a result, all buildings will receive a probability representing the likelihood of the building being a Large or Small residence. An example of such a result is visualized in Fig. 8.

3.2.2 Activity categories

This section describes the approach for classifying POIs according to their activity type. The following nine activity categories are considered: School, Health, Services, Industry, Catering, Shops, Leisure, Tourism and Others.

The algorithm starts by classifying each building according to all its attribute-value pairs^{Footnote 2}, including the tags derived from the information of contained or surrounding points and polygons. All considered pairs associated with any activity category are shown in Table 15 in Appendix C. Some examples of classifications are listed in Table 1. If a building has multiple tags that are part of different activity categories, this building receives a probability for each of these categories. For example, a stadium is assigned to both the Leisure and Tourism categories. In the case studies presented in this paper, all activity categories receive equal probability.

Table 1 Examples of considered tags per activity category

A travel demand modeling framework based on OpenStreetMap

Abstract

Similar content being viewed by others

Building a Multimodal Urban Network Model Using OpenStreetMap Data for the Analysis of Sustainable Accessibility

A unified dataset for the city-scale traffic assignment model in 20 U.S. cities

Efficient Planning of Urban Public Transportation Networks

Explore related subjects

1 Introduction

2 Literature and contribution statement

3 Poidpy framework

3.1 POI data extraction and preprocessing

3.2 POI categorization

3.2.1 Residential function

3.2.2 Activity categories

3.2.3 Flexibility of the Poidpy tool

4 Applications

4.1 Antwerp case study

4.1.1 Trip generation models

4.1.2 Calibration performance

4.2 Leuven case study

4.3 Ghent case study

4.4 Other applications of POI-based travel demand modeling

5 Conclusion and suggestions for future research

Availability of data and code

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Competing interests

Additional information

Publisher's Note

Appendix

Appendix

1.1 A: POI data extraction and preprocessing

1.2 B: Residential categorization

1.3 C: Activity categories

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation