The technical approach of using mobile positioning data to support urban population size monitoring

This paper summarizes the methods and approaches of using mobile positioning data to estimate and monitor urban population size. It starts with the necessity of using big data to monitor urban population size in territorial spatial planning. Then it elaborates on the difference between the definition of "population size" reflected by mobile positioning data and the common concept of urban population size, and the necessity of verifying the logic and measurement of sample expansion at four levels. Finally, taking Wuhan city as a case, this paper proposes the technical approach of monitoring the size of urban permanent population through multi-source data verification. The study finds that when it comes to monitoring urban population size, mobile positioning data have the advantage of monitoring short-period changes in population size and spatial distribution, yet special attention must be paid to the three technical links of definition, sample expansion and verification.

Population size has remained a basic topic in the field of urban and rural planning, and it is also one of the most important data for the reference of territorial spatial planning.Planning of land-use, public services, infrastructure and other fields is all underpinned by population size.With the establishment of the territorial spatial planning system, the size of permanent population and other major indicators have been included in the monitoring indicator system for the implementation of territorial spatial planning (Ministry of Natural Resources of the People's Republic of China [MNR], 2019).
The past four decades have witnessed massive population migration from rural to urban areas and from small and medium-sized cities to large and mega ones resulting from rapid urbanization.It has posed great challenges to the traditional demographic approach due to large scale, high frequency and difficulty in estimating the temporal and spatial distribution of migration.The national census is the most accurate demographic approach in China, but it is only conducted once every ten years.Between the two censuses, the results are obtained through sample surveys on 1‰ of the population year by year, it is difficult to monitor massive and highly frequent population flows in a comprehensive and efficient way.And the longer the interval from the previous census, the greater the possibility of errors.Taking Shanghai as an example, after the Sixth National Census, the permanent population at the end of 2009 was corrected from 19,213,200 to 22,102,800, with a difference of 2,889,600. 1 With the development of information and communication technology, especially the increasingly high availability rate of mobile Internet, mobile positioning data such as mobile phone signaling data and mobile Internet positioning data have been used for rural and urban planning, playing an effective role in such areas as regional and urban spatial structures (Niu et al., 2014;Wang et al., 2017), urban transportation (Zhang, 2016), job-housing spatial relationship (Niu & Ding, 2015;Song et al., 2019), urban center system (Ding et al., 2016), and the provision of facilities and services (Niu & Li, 2019;Niu et al., 2019).As mobile terminal devices such as mobile phones are widely used and easy to carry around, real-time positioning of such devices provides a way to dynamically monitor the size and distribution of urban and rural populations.Mobile positioning data has thus been frequently used to estimate the population size.Over the years, however, there has been an interesting phenomenon that no papers on the application of mobile positioning data to estimate population size have been published in peer-reviewed academic journals, while media outlets have covered many stories about how mobile positioning data are used to estimate the permanent population in several super-cities and mega-cities.This suggests that formidable technical problems are to be solved before using mobile positioning data to measure population size.The news coverage, which requires no peer review, has neither been recognized by academia and nor applied to the planning practice.The technical approach of using mobile positioning data to measure permanent population needs to be identified before these data can be applied to the dynamic monitoring of population size.
As an important part of the territorial spatial planning system, smart territorial spatial planning consists of major links such as the dynamic monitoring of planning based on the big data technology, and relevant requirements have been included in the technical documents of territorial spatial planning (MNR, 2019).Therefore, it is necessary to conduct a systematic discussion on the technical obstacles of using mobile positioning data to monitor population size, identify current progress, difficulties and feasible solutions, and forecast future technology trends.Starting from the characteristics of mobile positioning data as the data source, this paper discusses the differences in the definition of "population size" reflected by mobile positioning data, focuses on the technical aspects of sample expansion and inspection, and defines the scope of technology application by sorting out the technical approaches of using mobile positioning data to support the monitoring of urban population size.
1 Consistency in the definition of population size

Definition of urban permanent population
Permanent population is an important statistic indicator in the current national census of China and is defined as "the population actually living in a place regularly for six months or longer." 2 This definition focuses on people's real living needs, gives particular attention to the social context of rapid urbanization and mass population migration.Regarded as the most stable population size indicator, it provides more effective guidance on resource allocation in housing, infrastructure, environmental protection, healthcare, sports, cultural and recreational facilities based on people's livelihood needs (Shi et al., 2018).The definition is used in both territorial spatial planning and urban and rural planning.
The application of mobile positioning data to measure the population size is based on the big data with positioning labels left by mobile communication devices when they are connected to the mobile communication network or mobile Internet.It intuitively records the specific spatial location of a device user at a certain point.Through a range of temporal and spatial records of the device in use, it partly restores life traces of the user, thereby identifying the social attributes of the user and estimating the total population size meeting specific conditions.There are two challenges from device identification to population estimation: Firstly, one device is not necessarily associated with one person.A person may have more than one mobile phone 3 , and some may not have their own mobile phones due to their financial status, lifestyle, age, etc.Therefore, the number of users identified is not equivalent to the population.
Secondly, how people use their devices does not necessarily reflect how they live their lives.Despite the increasingly important role of mobile phones and other electronic devices in our daily life, devices for special purposes are used at intervals and cannot cover the whole 1 According to the Shanghai Municipal Bureau of Statistics, the city's permanent population was 19,213,200 in 2009.Yet the Sixth National Census conducted in 2010 found that the permanent population of Shanghai was 23,020,000, generating a difference of nearly four million.Therefore, the permanent population at the end of 2009 was revised to 22,102,800.process of our life.The loss of some location records may disrupt the identification of device users.For example, it is impossible to identify a user's residence when his or her mobile phone is shut down at night.Even in an ideal world where "one person corresponds to one device," the way people use their devices still cannot be regarded the same as the way they live their lives.
Therefore, when measuring the permanent population size using the mobile positioning data, it is necessary to convert the life behavior logic of "actually living regularly for six months or longer" into the computing logic of the device use behavior, so as to sort out the users meeting the definition.In addition, it needs to perform a sample expansion from the number of users to the total permanent population of cities.

Mobile positioning data-based measurement
of "permanent population" must comply with the definition of "urban population size"

Long time series for the measurement of permanent population
Long time series is a basic requirement for using mobile positioning data to identify permanent population, and there are two reasons for that: Firstly, mobile positioning data measure population size and identify its location based on long-term use of devices.The population type and location of a user is measured through calculating the spatial and temporal trajectory of the device in many days and nights.For example, when estimating a person's residence by identifying the place where he or she stays for the longest time during 9:00 p.m. and 7:00 a.m., when people generally rest at home.Occasional night outs will disturb the measurement of residence in case of short time series; while the influence of occasional night outs can be eliminated by calculating the repetition rate, thus improving the accuracy of estimation in case of long time series.Secondly, given the definition of permanent population, "[people] living [in a place] regularly for six months or longer, " the urban population calculated based on the data for one to two weeks tend to be affected by short-term migration.The result will be inaccurate if the population visiting the place on a temporary basis, for business, travel, etc. is included.
In contrast, long-period data can improve the accuracy of identifying permanent population by setting a longer time limit.For example, the condition can be set as living in the place for more than 50% of the days during a period of consecutive six months.
It can be seen that the length of time series has a significant impact on the results.While it is difficult to obtain data sources, short time series data shall not be used as the basis for measuring permanent population.The time period of data collection should be as long as possible even if six-month is impossible.A reasonable and effective alternative is stratified sampling.When it is difficult to obtain long-term data on a continuous basis, the impact of contingency can be reduced and the reliability of measurement improved through sampling in a number of days per month during the six-month period.

Continuous positioning record for the measurement of permanent population
Continuous positioning can help improve the accuracy of permanent population identification supported by mobile positioning data.Positioning data is generated by the use of devices.The number and interval of positioning records, however, are different every day because devices have different positioning frequencies.In an ideal world, the positioning of device occurs continuously and evenly in one day (24 h).Such continuity ensures that daily positioning records restore people's spatial and temporal trajectory as truly as possible, approximating their life logic (Fig. 1a).However, it is impossible to restore a complete life trajectory using existing mobile positioning data, considering that the continuity and evenness of daily positioning records cannot be guaranteed (Fig. 1b).
When the positioning records of a day appear in clusters in certain periods, it means that part of the life trajectory without the use of devices has not been recorded, which makes it difficult to restore the complete spatial and temporal behavior chain of device users, resulting in deviation in the identification of permanent population.Therefore, despite the difficulties in generating and acquiring data sources, they cannot simply be used due to the low frequency of collection.Instead, the daily positioning records should be distributed as continuously and evenly as possible in one day (24 h) by improving the algorithm and other means.

The necessity of sample expansion
Long time series and continuous positioning are only prerequisites for using mobile positioning data to identify permanent population.On the basis of a consistent definition, sample expansion is also required for the application of mobile positioning data to measure urban permanent population.Unlike applications such as the identification of regional spatial structures and the establishment of urban center system which are based on relative population value (Ding et al., 2016;Wang et al., 2017), the measurement of population size describes the absolute value of total population, for which sample expansion is a must.Moreover, sample expansion is still needed even for full sample data.Taking mobile phone signaling data as an example, all the data collected from the three major operators can be regarded as full sample data of mobile phone users, but the full sample of devices is not equivalent to that of the population, because some people do not use mobile phones and one person may have more than one phone.Therefore, full sample data is not necessary.In fact, the adoption of data from multiple operators is mainly for the purpose of making comparison, instead of skipping sample expansion.When it comes to the measurement of population size, full sample still needs to be expanded.

Sample expansion at four levels from number of devices to population size
Now that the necessity of sample expansion has been confirmed, the specific process of sample expansion entails more research.It is a popular practice in the industry to calculate the population simply using the number of devices and the sample expansion coefficient (population after sample expansion = number of devices/K).However, such simple practice ignores the complexity of sample expansion coefficients and makes it hard to spot or review any errors.To improve the accuracy and reliability of sample expansion, it is necessary to clarify sample expansions at different levels.Specifically, there are four levels of sample expansion from the number of devices to the population size.Taking mobile phone signaling data as an example, the sample expansion at four levels covers five P values and four K values (Fig. 2).
In the data pre-processing stage, "the number of devices that identify permanent residence (P4)" is calculated.Using the mobile positioning data obtained through long time series and continuous positioning, a proper algorithm can be developed to ensure accurate identification.
The calculation of "the number of active devices (P3)" at the first level (K3) aims to restore a considerable number of devices that are used irregularly, such as those that are turned off at night and therefore cannot identify permanent residence.In fact, with a precise definition of active devices, P3 can also be directly calculated via big data through the restrictive rules of time thresholds.The necessity of P4 calculation lies in that it can identify the spatial locations of permanent population and complete verification by comparing them with conventional statistical data of administrative spatial units.
The calculation of "the total number of devices (P2) of a particular operator" at the second level (K2) aims to restore a considerable number of inactive devices.It is difficult to identify population characteristics by obtaining regular spatial and temporal trajectories of those devices as they are rarely used.Even if each operator were able to count the total number of cards issued in a city, the mobility of mobile users, and the use of cards in other cities, which is especially common after the cancellation of roaming charges, make it difficult to count the total number of devices that are actually used locally.
The calculation of "the total number of users of all operators (P1)" at the third level (K1) aims to convert the number of devices to the number of people.Consideration should be given not only to the market share of one operator, but also to the possibility of one person using multiple devices.The latter includes two scenarios: 1) when the multiple devices possessed by one person belong to the same operator, they can be combined by appropriate location algorithm based on the data of their spatial and temporal trajectories; 2) when they belong to different operators, calculations can only be made through inter-network communication and other ways.By comparing the total amount of contact with each operator, the market share can be estimated synchronously.
The calculation of "the number of urban permanent population (P0)" at the fourth level (K0) aims to restore a considerable part of non-mobile communication device users.Despite the popularity of mobile phones today, a considerable part of the population, including the elderly, infants and children, does not or is unable to use them, and there is no proper method to count the number of these people for the time being.
In conclusion, among the above four levels of sample expansions, P4 and P3 can be accurately calculated based on continuous spatial and temporal positioning of mobile phone signaling data, hence the value of K3.K2, K1 and K0, however, are very uncertain and may present huge differences due to regional economic development level, social and cultural characteristics.Their value selection poses technical difficulties in the application of mobile positioning data to calculate permanent population.Any error in the coefficients, however small it is, will have a much bigger impact on the results.Therefore, the expansion coefficients of the four levels are the greatest challenge in measuring the size of permanent population.

The necessity of verification
As mentioned above, errors in the identification of permanent population will be caused directly if the calculation based on mobile positioning data is not conducted in strict compliance with the definition of population adopted in the traditional approach.Such errors will be amplified if the complexity of the sample expansion system and the difference of geographical spaces are not taken into account, exerting a profound impact on the accuracy and reliability of population measurement.Without an established process of sample expansions, it is an effective way to improve the reliability of measurement by checking the results with other data.There are two approaches available: One is checking the big data with field survey sampling data, that is, comparing the census results of a small number of typical areas with the results of big data measurement.This approach is undoubtedly reliable and effective if the sampling of the typical areas is conducted in a strict and appropriate way.The problem is that even small-scale censuses are time-consuming.
The other is comparing the estimation results of mobile positioning data from two different sources to see whether the population changes and spatial characteristics reflected by them are identical.On this basis, taking conventional statistical data as a reference, it can be determined whether the trends and characteristics jointly reflected by multi-source data are consistent with the social and economic status quo.The advantage of this approach is that it requires less work.The key lies in the use of appropriate means to analyze and compare the multi-source data.

Case overview
Based on the case of Wuhan City, this study explores a feasible approach to cross-inspect and measure permanent population based on multi-source data.In the case of Wuhan, there are three sets of data on its permanent population, one from the statistical yearbook and the other two obtained based on mobile positioning data sources A and B (Table 1).The statistical yearbook published by the Wuhan Bureau of Statistics, using the traditional demographic method, shows that the permanent population of Wuhan was 11,081,000 at the end of 2018.The latter two, which take the first six months of 2019 as the calculation period, show that the permanent population of Wuhan is 11,310,000 and 13,340,000, respectively, with a difference of more than two million.
In conclusion, there are significant differences in the sample expansion estimations based on mobile positioning data from different sources, and the big data-based measurement also differs from the traditional statistical data, but the spatial distribution patterns of the three are highly similar.Therefore, comparing, inspecting and revising the three sets of data in each spatial unit is a proper way to improve the reliability of big data-based measurement of permanent population in the absence of accurately calculated expansion coefficients.

Inspection of change ratio, over-expectation ratio and difference ratio
(1) Change ratio.Change ratio is the ratio of big data-based measurement to published statistical data, which directly reveals the spatial difference of population distribution between big data observation and statistical yearbook.Transverse comparison shows that the change ratios of the two kinds of data sources in each district present a similar rank, which suggests that population changes measured by different data sources are basically the same, and the values are both strikingly different between the "head" and "tail".The change ratio in the head section is quite high, which means that the big data-based measurement is significantly higher than the traditional statistical result, and that may be caused by an increase in permanent population or in device users.The head section, which includes Wuhan Donghu New Technology Development Zone, Wuhan Donghu Ecological Scenic Spot, Dongxihu District, and Hanyang District, mainly covers the suburban area.On the contrary, the tail section, which includes Qingshan District (the Chemical Industry Area), Xinzhou District, Huangpi District and Caidian District, mainly covers the outer suburbs of the city (Fig. 3).
In addition, the change ratio of the two kinds of big data in Jiang'an District is closest to 1, i.e., the big data-based measurement is the closest to that in the statistical yearbook.The population value of the statistical yearbook can be regarded as an expected value extrapolated from the census year and based on the pattern of past population changes.The value of big data-based sample expansion is calculated based on the actual number of devices and the general behavior characteristics of device use as well as the relationship between devices and people.The area with the change ratio closest to 1 can be seen as having the most stable population size and age structure, with the population size estimated by big data closest to the expected value.
Over-expectation ratio is the ratio of the results measured by big data to the expected changes based on published statistics.Taking the change ratio of Jiang'an District, which is closest to 1, as the reference for population change, the expected growth coefficients of the two data sources are 1.04 and 1.18, respectively.The expected value of the permanent population can be achieved by multiplying published statistical data by the expected growth coefficient.In the "middle" section, which includes Jiangxia District, Wuhan Economic and Technological Development Zone, Hongshan District, Jiangan District, Wuchang District, Jianghan District and Qiaokou District, the sample expansion results are all within ± 30% of the expected values and the difference of over-expectation ratio of the two data source is within ± 0.08.This indicates that the measurement calculated by two kinds of big data in the seven districts are in line with theoretical expectations with a small difference, making it difficult to decide which kind of measurement is more reliable.In the meanwhile, there is some ambiguity in the classification of the two spatial units Jiangxia District and Qiaokou District, which lie next to the dividing lines as the first and the last of the "middle" section.
Difference ratio is the ratio of the measurement of two kinds of big data, directly reflecting the difference between them.Ideally, the ratio should remain in a stable range.By comparing the sections above, it is found that the difference ratio of the "middle" section is basically within ± 5% of the mean value, which is regarded as an acceptable interval for the two kinds of calculations.In comparison, the difference ratio is high in Jiangxia District and low in Qiaokou District, i.e., there is a great difference in the calculation results of the two data sources among ten districts and counties.Furthermore, the difference ratio is low in Hanyang District and Dongxihu District in the "head" section, similar to the spatial units in the "tail" section.In other words, there is some ambiguity in the classification of Hanyang District and Dongxihu District.

Measurement of permanent population
The cross-inspection results of the three indicators show significant differentiations between the head and the tail, as well as some spatial distribution characteristics.Considering the general rule of urban development, in the process of rapid urbanization, regional central cities and mega-cities tend to attract a large number of migrant population, who normally work in manufacturing and gather in the periphery and outskirts of the downtown area, which explains why the population calculated through big data for the "head" section is way larger than that in the statistical yearbook.Also, for these migrant workers belonging to the active age group, the availability rate of smart phones and other electronic devices is higher than the average level of the whole age group, which leads to the underestimation of K0 and K2 and a higher change ratio.A lower measurement should be taken as a higher one is more likely to cause errors.
On the contrary, the core area of the downtown area and the outer suburbs tend to suffer population outflow in the process of urban renewal and development, which explains why the population measured by big data in the above "tail" area is far less than that in the statistical yearbook.Also, the degree of aging in those two areas is higher than the average level of the city because they attract a small number of migrant permanent residents, which leads to a low availability rate of smart electronic devices, hence the overestimation of the sample expansion coefficient.Therefore, higher measurement should be selected.
To sum up, the number of permanent devices calculated by the two sets of big data based on their respective sources and the sample expansion coefficient of the total population underlie the different change ratios for the head and tail.The approach of head-tail combination is adopted, in which a more credible sample expansion value is selected through the big data calculation at the "head", "middle," and "tail."Specifically, first, for the "head" spatial statistical units with a higher over-expectation ratio, the calculation result of a lower over-expectation ratio is selected, on the contrary, the calculation result of a higher over-expectation ratio is selected for the "tail"; second, for the "middle" spatial statistical units with a very close over-expectation ratio, the calculation result of sample expansion more close to the data in the statistical yearbook is selected; third, if it is difficult to define which spatial statistical unit is more reliable through the calculation results of two kinds of data, a mean value is recommended.
According to the principle of "taking the lower for the head, the higher for the tail, and the average for the uncertain" in deciding the sample expansion value, different scenarios of value determination are set up, hence the value range of permanent population in Wuhan between 11,980,000 and 12,310,000, with the recommended value of being 12,090,000.

Case Summary
The "permanent population" of Wuhan in the conventional sense is calculated through the big data under the logic of "the number of people who reside in Wuhan city at night for more than 50% of the days during a period of consecutive six months, and the residence at night is the place where they stay for the longest between 9:00 p.m. and 7:00 a.m. the next day."In the absence of accurate sample expansion coefficients, traditional statistical yearbook data are used to check the sample expansion results based on mobile positioning data of different sources.The units with large differences and those with blurred classification are identified through the change ratio, the over-expectation ratio, and the difference ratio, based on which a more credible value is selected.The value range and the recommended value of permanent population is eventually obtained based on the calculations of different scenarios.During the process, the verification of multisource data is an indispensable part, which helps improve the accuracy and reliability of mobile positioning data in measuring permanent population.

Conclusion
The application of mobile positioning data to measure permanent population is an effective tool for monitoring and evaluating the implementation of territorial spatial planning.In super-cities and mega-cities with large population inflow, mobile positioning data is especially suitable for monitoring the size of permanent population when monitoring the implementation of territorial spatial planning.Compared with conventional demographic approach, mobile positioning data has three advantages in monitoring urban permanent population: one, it is suitable for dynamic monitoring due to the short update cycle; two, it is convenient and efficient, and costs relatively less; three, it can monitor not only the size of permanent population, but also changes in the spatial distribution of population.
In the application of mobile positioning data to measure the size of urban permanent population, the technical approach includes three parts: definition, sample expansion and verification.To begin with, the measurement of "permanent population" by mobile positioning data must comply with the definition of "urban population size".The consistency of the definition is the premise of the measurement.Secondly, sample expansion includes four levels, from the number of devices to the size of permanent population.The sample expansion coefficients at the four levels are the key difficulties in measuring permanent population.Finally, it is an essential step to verify the measurement obtained via mobile positioning data.The verification guarantees the measurement accuracy of permanent population.
The cross-inspection of measurement obtained from multi-source mobile positioning data proposed in this paper compares the results achieved through two data sources with the population in the statistical yearbook.By cross-inspecting the change ratio, the over-expectation ratio and the difference ratio, the value range and the recommended value of permanent population are estimated through the head-tail combination method.Since there is no way to accurately determine the sample expansion coefficients at the four levels, cross-inspection of multi-source data can be used to measure the permanent population of mega-cities with large population inflow, and monitor the implementation of territorial spatial planning.

Discussion and Prospect
Firstly, artificial intelligence (AI)-based algorithms can be applied to sample expansions.Nowadays, technical difficulties are seen in the three technical links of definition, sample expansion and inspection, while sample expansion is the most challenging one.Technical breakthroughs are required to address the challenges in sample expansion and determine the sample expansion coefficients at the four levels.The values of the three key coefficients, K0, K1 and K2, depend not only on the type of data source, but also vary for different cities.Right now, using machine learning and other AI technologies to capture data from the behavior characteristics of device users seems a possible way to identify the values of K0, K1 and K2, but the relevant technologies are still under exploration.
Secondly, the use of mobile positioning data to measure the permanent population should not be in obsessive pursuit of an accurate total number.Given the difficulties in the relevant technical approaches, only an interval value of permanent population can be achieved through sample expansion and verification.Therefore, when applying mobile positioning data to measure the urban permanent population, instead of focusing on the accurate total number of population, more attention should be paid to the changes in the size and spatial distribution of population.
Thirdly, when it comes to the prospect for the application of mobile positioning data to measure permanent population, it should be stressed that the national census conducted every ten years is the most detailed and accurate method to obtain population data, and no big data-based measurement can be as detailed and accurate as it.Mobile positioning data cannot replace traditional demographic approach as the latter has its own justification and application scope.It is appropriate to use mobile positioning data to measure the permanent population in the years between two censuses; and for cities with largescale population inflow and outflow, mobile positioning data can be used to improve the accuracy of population measurement.

Fig. 1
Fig. 1 Continuous positioning records (illustrated by the author).a Equal-interval continuous positioning.b Unequal-interval positioning

Fig. 2
Fig. 2 Diagram of sample expansion logic for using mobile positioning data to measure permanent population (illustrated by the author)

Fig. 3
Fig. 3 Spatial distribution patterns of different change ratios at the head and tail (illustrated by the author).(a) change ratio of data source A. (b) change ratio of data source B

Table 1
Cross-inspection calculation results of multi-source data-based measurement of permanent population