Urban photograph localization using the INSTREET application—accuracy and performance analysis
- 834 Downloads
The paper proposes a solution to the problem of geolocation of photographs by using an algorithm to compare their content against a geolocated database of street view images, and analyzing the performance of the algorithm. The algorithm makes it possible to pinpoint the location where a photograph was taken. In order to solve this problem, we propose an algorithm based on MPEG-7 features. The paper also describes the results of optimizing the performance of the algorithm and its accuracy. We show that the algorithm scales with the size of the reference database at least up to 130 km2, which was the largest urban area we tested the algorithm on.
KeywordsStreet view Geotagging Geolocalization Urban environment Safety
Many cameras available today feature a GPS module. It makes it possible to add the geographic coordinates of the location where the photograph was taken to its metadata (EXIF). However, if such information is either unavailable or has been removed from the photograph metadata, the photograph’s location may be difficult or impossible to determine.
This problem is frequently encountered by police officers during their investigations. If the outcome of the investigation starts depending on the geolocation of the photograph, the problem becomes crucial. In this case, the only way to geolocate the photograph is to hire analysts whose job it is to manually search for the most likely location. This work is extremely labour-intensive and has a low probability of success.
The initial version of this algorithm was described in . This paper outlines the principles of the INSTREET algorithm, presents the optimizations which have been added since the first version of the algorithm was published, and shows the new and extended results of accuracy tests. Our algorithm is based on low-level image features and uses MPEG-7 descriptors for the task. Data from a street view application is used as a source database of reference images. We show that our algorithm is able to pinpoint the location of the photograph based on a single characteristic detail at a cost of the high computing power required for the task. We aim at the accuracy allowing for identification of a single building, that is approx. 15 m. We also show that the algorithm scales well as the area covered by the reference database is increased.
The problem is yet to be solved by the scientific community, although it has been approached in several research papers and its importance has been stressed by the US National Intelligence Office. When compared to other solutions presented in literature, our solution provides satisfactory results tested over a major dataset. Details of the state-of-the-art analysis are presented in Section 2 of this paper.
The rest of the paper is structured as follows. Section 2 presents the architecture of the solution. Section 3 describes the results of performance optimization and accuracy tests. The paper is concluded in Section 4.
2 State of the art
The problem considered in this paper has been approached by researchers in the past. Hays and Efros  propose a system, similar to the one presented here, that attempts to obtain geolocation based on a single image. In their paper, the authors use geolocated images obtained from an image sharing service. Seven different features are compared in terms of accuracy of localization (where an accuracy of 200km is considered satisfactory). This concept is further extended by using the tagged reference database in .
Unlike the authors of , we focus on street view data and try to pinpoint the location of the photograph exactly. In their next paper , the authors focus on using a larger database and optimizing the geolocation algorithm. Work presented by Zhang and Kosecka  is the closest in concept to that presented here. The authors attempt to solve the problem by applying the SIFT  algorithm. A great effort is made to compensate for perspective transforms resulting from the different angles the photographs may have been taken from. In our work, we attempt to use the SURF algorithm  (enhanced and more efficient when compared to SIFT); however, it was shown to be less accurate than our MPEG-7 approach.
Schindler et al.  propose a solution where the location is estimated by detecting and matching repeated patterns. While the proposed solution offers a very high accuracy with a mean error of just 6 m, the paper does not propose a large scale (in terms of area covered) test of this solution. A good overview of the current approach to geotagging research is presented in , where the topic is presented in terms of three different approaches: how the information can be extracted, where it can be applied, and what the interplay is between the extracted geotagging information and applications.
The importance of this topic is further confirmed by a call published by the US National Intelligence Office: Intelligence Advanced Research Projects Activity (IARPA) for a solution to the presented problem (Solicitation IARPA-BAA-11-05, May 2011).
3 Solution architecture
In brief, our algorithm accepts a fragment of the localized photograph as an input. This rectangular fragment is chosen by the user and should contain a characteristic architectural detail of the photograph. For instance, it could be an unusually shaped window in a building, or a piece of wall of a distinctive colour, texture and/or shape. Low level descriptors of this fragment are compared against low-level descriptors of photographs coming from a street view service.
A street view service is a popular Internet application that combines interactive maps with panoramic photographs taken at street level. Examples of such services are StreetView from Google, NORC (which covers cities in mid-eastern Europe not covered by Google StreetView) and Microsoft Streetside. Photographs from these services are of high resolution and quality, as well as being geolocated; as such they create a perfect reference database for the INSTREET application.
Our approach uses a cascade of a texture descriptor (Edge Histogram) and four colour descriptors. Afterwards, a weighted rank algorithm is used. Finally the k-nearest neighbor algorithm (k-NN) returns the candidate panoramas. The best candidate street view panoramas with their location are chosen by the algorithm and presented to the user for the final overview and geolocation of the photograph. The following sections describe the subsequent stages of the INSTREET algorithm.
3.1 Stage 1 — inputs and pre-filtering
The user is also required to mark the region to be searched on the map. Our algorithm is optimized to work in an urban environment. The larger the area the user chooses, the more street view panoramas need to be searched and the longer the search time.
The panoramas from the selected region are divided into subimages using the sliding window algorithm. The mode of operation of this algorithm is characterized by two parameters: an area of overlap between two adjacent blocks, and a size of sliding window in consecutive iterations. In the system presented, the best results were achieved for 90 % of overlap area and 10 % of resizing applied to the sliding window per iteration. According to these parameters, one panoramic view is divided into approx. 30,000 subimages, while one street may consist of 40–80 views. This means the INSTREET is a computing–heavy algorithm.
Pre-filtering is conducted by using the edge histogram algorithm from the MPEG-7 standard . The edge histogram is an algorithm that makes it possible for an image to be reduced to a vector that represents the numbers and directionality of edges in the image. Moreover, it makes it possible to compute the distance between two vectors, which in turn makes it possible to estimate the visual similarity between the two images. The edge histogram distinguishes five types of edges in local image regions, four directional (vertical, horizontal, 45°, 135°) and one non-directional.
This information is provided for every subimage. During the process of calculating the edge histogram, subimages are defined by dividing the image into 16 non-overlapping parts. As a result, the edge histogram returns a vector of values, also referred to as a descriptor value. The distance between two images is computed following the MPEG-7 standard . This distance represents the visual similarity between the query and the analyzed image.
At this stage in the algorithm, the characteristic object chosen by the user is compared using the edge histogram descriptor against subimages created by the sliding window algorithm. A list of best candidate subimages is created and continuously updated during the process of pre-filtering. This list is referred to as a bank of results. Such an approach enables optimized memory management and a more time-efficient operation of the application.
3.2 Stage 2 — colour-based filters
At this stage, the INSTREET compares the candidates in the bank against the user-selected region using colour-based MPEG7 descriptors. They are the Color Layout, Scalable Color, Dominant Color and Color Structure. For each of the descriptors, all candidates in the bank are compared against the user-selected region. The results are ordered from best to worst match.
R(i) overall rank for i-object
X(i,j) result for ith object in jth-descriptor (CL,SC,DC,CS)
As a result, the final output of this stage of the algorithm is a set of candidate images with an associated overall score. Summarizing, first we analyze the data focusing on the layout of the edges and later on we focus on the colour characteristics. Our experiments have shown that such approach yields best results in terms of the accuracy.
It should be noted that the presented algorithm will not be effective if the example photograph is taken at night due to colour differences between daytime and nighttime photos. In order to compensate for different light conditions, a colour normalization algorithm is currently under development.
3.3 Stage 3 — final classification
The final stage of the algorithm is to utilize the k—nearest neighbour (k − NN) algorithm. This is a method for classifying objects based on the closest distance in a feature space. After the logarithmic classification, a set of closest-matching photos is available. The role of the k − NN algorithm is to assign objects to their own model (a street view). As a consequence it is possible to determine which class of objects is the most numerous in the k − NN classification. The algorithm outputs the best candidate panoramas.
4 Experiment and results
This section describes the practical experiments performed with the INSTREET algorithm in order to assess the accuracy and time performance.
4.1 Reference database
Views and regions in the reference database
# of views
4.2 Time performance
The time performance of the INSTREET algorithm has been greatly improved compared to the performance described in . The only optimization in the early version of the application was the downscaling of panoramic views, which resulted in an acceleration of the process by a factor of three. In the version presented in this paper, we made significant progress in the descriptor calculation time and in the descriptor comparison time, accelerating the process 150 times.
As for the time required for the descriptor calculation, we have created a multi-threaded version of the software used for calculating the descriptors. The process of calculating the descriptor values is independent for each of the subimages generated by the sliding window, and can be executed in a separate thread. As the process is CPU intensive, the acceleration in processing time is proportional to the number of threads used. We used an eight-thread processor for the experiment, therefore the descriptor calculation was accelerated eight times. The time required to calculate a set of descriptors for one view is 55 s for a single thread, which means it takes approx. 26 days of computing time for the entire set. By using multi-threading, we reduced the time required to calculate the descriptors for the entire dataset from 26 days to 3 days. Further improvement is possible by using a more powerful machine.
As for the time required for the search, we switched from on-demand calculation of the descriptors to pre-calculating the descriptor set for the whole reference database. In the first version of the application, the entire dataset was analyzed each time we issued a query. This is far from optimal, although it allowed us to create a proof-of-concept of the application. The time performance of the application has since been improved by pre-calculating all the descriptors and limiting the search process just to comparing the descriptors. The largest dataset we were capable of testing with the first version of the application was Region 1. A query was calculated for 36 hr. With pre-calculation, we were able to increase the performance by a factor of 150 with a query to this region taking 13 min. This also allowed us to perform tests on larger regions within reasonable time frames. The results of the accuracy tests are presented in the next section.
In order to assess the accuracy of the INSTREET application, we prepared a set of 30 queries that were processed by the algorithm. The queries were selected from a different street view service to the one that was used to create the reference database. This way the queries are similar to what can be expected during the real operation of the system, with photographs of urban scenery from the street view level, although taken in different conditions (lighting, weather) than those stored in the reference database.
We analyzed the results of the query by observing the returned list of results. The score given to a query in terms of accuracy is the place on the list of results where the relevant result is given (we refer to it as ‘rank’). The lower the rank, the better, as in an ideal situation we expect the relevant result to appear in the first position. We also calculate the percentage of the entire dataset that needs to be analyzed by an operator by going down the list of query results in order to find the relevant answer. We also expect this value to be as low as possible. The most important question that needed answering was the impact of increasing the dataset that is available to be searched on the accuracy of the results. In other words, does the INSTREET scale as the size of the area covered is increased? In order to answer this question, we increased the size of the reference database by adding more regions in consecutive experiments. As described previously, 30 queries were issued in each experiment.
Results for accuracy tests for the INSTREET application—percentage
Results for accuracy tests for the INSTREET application—rank
To analyse – average
To analyse – median
To analyse – worst
5 Conclusions and further work
This paper proposes an algorithm for geolocating photographs in an urban environment. Our algorithm requires the user to select a region of interest in the photograph, such as a distinctive architectural detail. Based on this rectangular image, the algorithm searches a database of geolocated panoramas taken at the street view level. Using MPEG-7 descriptors, the algorithm suggests geolocated panoramas that are most likely to have been taken in the vicinity of the query photograph.
We significantly improved the time performance of the algorithm by enabling multi-threaded calculation of the descriptors and by pre-calculating a descriptor set for the whole reference database. As for the accuracy of the solution, we created a set of tests with queries generated from another, competitive street view service. The tests reveal that while a positive result is obtained for the majority of the queries, some troublesome queries remain. We also demonstrated that the system scales linearly as the urban area covered by the reference database is increased.
In further work, we plan to create a GUI for the algorithm in order to create a prototype application. We will also work on speeding up the most time-consuming operation, which is the calculation of descriptor values by delegating this task to a GPU instead of the CPU. We also plan to perform a set of experiments with user-generated content that will further demonstrate the usefulness of the algorithm. Additionally, we hope to deploy the service to a multi-threaded supercomputer and prepare a web-accessible demonstrator capable of returning search results in real time.
- 2.Gallagher A, Joshi D, Yu J, Luo J (2009) Geo-location inference from image content and user tags. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2009. CVPR Workshops, pp 55–62Google Scholar
- 3.Grega M, Łach S, Cyganek B (2012) INSTREET—application for urban photograph localization. In: Multimedia communications, services and security communications in computer and information science, vol 287, pp 130–138Google Scholar
- 4.Hays J, Efros AA (2008) IM2GPS: estimating geographic information from a single image. In: Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)Google Scholar
- 5.Kalogerakis E, Vesselova O, Hays J, Efros AA, Hertzmann A (2009) Image sequence geolocation with human travel priors. In: Proceedings of the IEEE Internaltional Conference on Computer Vision Recognition (ICCV)Google Scholar
- 6.Lowe DG (1999) Object recognition from local scale-invariant features. In: Proceedings of the International Conference on Computer Vision 2, pp 1150–1157Google Scholar
- 8.Manjunath BS, Salembier P, Sikora T (2002) Itroduction to MPEG-7. WileyGoogle Scholar
- 9.Schindler G, Krishnamurthy P, Lublinerman R, Yanxi L, Dellaert F (2008) Detecting and matching repeated patterns for automatic geo-tagging in urban environments. In: IEEE Conference on Computer Vision and Pattern Recognition, 2008. CVPR, pp 1–7Google Scholar
- 10.Zhang W, Kosecka J (2006) Image based localization in urban environments. International Symposium on 3D Data Processing, Visualization and Transmission, 3DPVT 2006, North Carolina, Chapel HillGoogle Scholar