Abstract
With the advance of wireless networks, location-based services have become very important as people often need to query for addresses of unfamiliar locations through Web and then locate the position on the map. Existing geographic information systems based on crowd-sourcing are insufficient and have a slow update progress. However, it can actually be complemented by automatically extracting addresses of location entities and associated information from general pages. Thus, effectively crawling webpages with addresses is a practical challenge for enriching the location entity database. This research is devoted to automatic address and associated information extraction to provide information retrieval on maps, i.e. integrating the process of location entity query on Web and positioning on maps. We build a geographic information system of location entities by crawling the Web via three strategies for Chinese addresses. One point two seven (1.27) million distinct Chinese addresses are crawled using 1.08 million HTTP requests, leading to a return-of-investment of 1.169.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Ahlers, D., Boll, S.: Location-based Web Search. In: The Geospatial Web, pp. 55–66. Springer (2007)
Ahlers, D.: Business entity retrieval and data provision for yellow pages by local search. In: Integrating IR Technologies for Professional Search, ECIR (2013)
Ahlers, D.: Lo mejor de dos idiomas – cross-lingual linkage of geotagged wikipedia articles. In: Serdyukov, P., Braslavski, P., Kuznetsov, S.O., Kamps, J., Rüger, S., Agichtein, E., Segalovich, I., Yilmaz, E. (eds.) ECIR 2013. LNCS, vol. 7814, pp. 668–671. Springer, Heidelberg (2013)
Buttler, D., Liu, L., Pu, C.: A Fully Automated Object Extraction System for the World Wide Web. In: ICDCS, pp. 361–370 (2001)
Chakrabarti, S., Van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific Web resource discovery. In: WWW (1999)
Chang, C.-H., Li, S.-Y.: MapMarker: Extraction of Postal Addresses and Associated Information for General Web Pages. In: WI, pp. 105–111 (2010)
Cho, J., Garcia-Molina, H.: The Evolution of the Web and Implications for an Incremental Crawler. In: VLDB 2000 Proceedings of the 26th International Conference on Very Large Data Bases, pp. 200–209 (2000)
Croft, W.B., Metzler, D., Strohman, T.: Search Engines, information retrieval in pracitce. Pearson (2010)
He, B., Patel, K., Zhang, Z., Chang, K.C.C.: Accessing the deep web: A survey. Communications of the ACM 50(5), 95–101 (2007)
Kayed, M., Chang, C.-H.: FiVaTech: Page-Level Web Data Extraction from Template Pages. IEEE Trans. Knowledge Data Engineering 22(2), 249–263 (2010)
Laender, A.H.F., Ribeiro-Neto, B.A., da Silva, A.S., Teixeira, J.S.: A Brief Survey of Web Data Extraction Tools. SIGMOD Record 31(2), 84–93 (2002)
Liu, B., Grossman, R.L., Zhai, Y.: Mining Data Records in Web Pages. In: SIGKDD, pp. 601–606 (2003)
McCallum, A.: Efficiently inducing features of conditional random fields. In: UAI, pp. 403–410 (2003)
Najork, M., Wiener, J.L.: Breadth-first crawling yields high-quality pages. In: Proceedings of the 10th International Conference on World Wide Web, pp. 114–118 (2001)
Olston, C., Najork, M.: Web Crawling. Foundations and Trends in Information Retrieval 4(3), 175–246 (2010)
Ourioupina, O.: Extracting geographical knowledge from the Internet. In: ICDMAM, pp. 108–113 (2002)
Sanderson, M., Kohler, J.: Analyzing Geographic Queries. In: Workshop on Geographic Information Retrieval (SIGIR), Sheffield, UK (2004)
Shkapenyuk, V., Suel, T.: Design and Implementation of a High-Performance Distributed Web Crawler. In: Proceedings of the 18th International Conference on Data Engineering, San Jose, CA, USA, February 26-March 1 (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Chuang, HM., Chang, CH., Kao, TY. (2014). Effective Web Crawling for Chinese Addresses and Associated Information. In: Hepp, M., Hoffner, Y. (eds) E-Commerce and Web Technologies. EC-Web 2014. Lecture Notes in Business Information Processing, vol 188. Springer, Cham. https://doi.org/10.1007/978-3-319-10491-1_2
Download citation
DOI: https://doi.org/10.1007/978-3-319-10491-1_2
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10490-4
Online ISBN: 978-3-319-10491-1
eBook Packages: Computer ScienceComputer Science (R0)