Effective Web Crawling for Chinese Addresses and Associated Information

Chuang, Hsiu-Min; Chang, Chia-Hui; Kao, Ting-Yao

doi:10.1007/978-3-319-10491-1_2

Hsiu-Min Chuang⁸,
Chia-Hui Chang⁸ &
Ting-Yao Kao⁸

Part of the book series: Lecture Notes in Business Information Processing ((LNBIP,volume 188))

Included in the following conference series:

International Conference on Electronic Commerce and Web Technologies

1338 Accesses
4 Citations

Abstract

With the advance of wireless networks, location-based services have become very important as people often need to query for addresses of unfamiliar locations through Web and then locate the position on the map. Existing geographic information systems based on crowd-sourcing are insufficient and have a slow update progress. However, it can actually be complemented by automatically extracting addresses of location entities and associated information from general pages. Thus, effectively crawling webpages with addresses is a practical challenge for enriching the location entity database. This research is devoted to automatic address and associated information extraction to provide information retrieval on maps, i.e. integrating the process of location entity query on Web and positioning on maps. We build a geographic information system of location entities by crawling the Web via three strategies for Chinese addresses. One point two seven (1.27) million distinct Chinese addresses are crawled using 1.08 million HTTP requests, leading to a return-of-investment of 1.169.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ahlers, D., Boll, S.: Location-based Web Search. In: The Geospatial Web, pp. 55–66. Springer (2007)
Google Scholar
Ahlers, D.: Business entity retrieval and data provision for yellow pages by local search. In: Integrating IR Technologies for Professional Search, ECIR (2013)
Google Scholar
Ahlers, D.: Lo mejor de dos idiomas – cross-lingual linkage of geotagged wikipedia articles. In: Serdyukov, P., Braslavski, P., Kuznetsov, S.O., Kamps, J., Rüger, S., Agichtein, E., Segalovich, I., Yilmaz, E. (eds.) ECIR 2013. LNCS, vol. 7814, pp. 668–671. Springer, Heidelberg (2013)
Chapter Google Scholar
Buttler, D., Liu, L., Pu, C.: A Fully Automated Object Extraction System for the World Wide Web. In: ICDCS, pp. 361–370 (2001)
Google Scholar
Chakrabarti, S., Van den Berg, M., Dom, B.: Focused crawling: a new approach to topic-specific Web resource discovery. In: WWW (1999)
Google Scholar
Chang, C.-H., Li, S.-Y.: MapMarker: Extraction of Postal Addresses and Associated Information for General Web Pages. In: WI, pp. 105–111 (2010)
Google Scholar
Cho, J., Garcia-Molina, H.: The Evolution of the Web and Implications for an Incremental Crawler. In: VLDB 2000 Proceedings of the 26th International Conference on Very Large Data Bases, pp. 200–209 (2000)
Google Scholar
Croft, W.B., Metzler, D., Strohman, T.: Search Engines, information retrieval in pracitce. Pearson (2010)
Google Scholar
He, B., Patel, K., Zhang, Z., Chang, K.C.C.: Accessing the deep web: A survey. Communications of the ACM 50(5), 95–101 (2007)
Article Google Scholar
Kayed, M., Chang, C.-H.: FiVaTech: Page-Level Web Data Extraction from Template Pages. IEEE Trans. Knowledge Data Engineering 22(2), 249–263 (2010)
Article Google Scholar
Laender, A.H.F., Ribeiro-Neto, B.A., da Silva, A.S., Teixeira, J.S.: A Brief Survey of Web Data Extraction Tools. SIGMOD Record 31(2), 84–93 (2002)
Article Google Scholar
Liu, B., Grossman, R.L., Zhai, Y.: Mining Data Records in Web Pages. In: SIGKDD, pp. 601–606 (2003)
Google Scholar
McCallum, A.: Efficiently inducing features of conditional random fields. In: UAI, pp. 403–410 (2003)
Google Scholar
Najork, M., Wiener, J.L.: Breadth-first crawling yields high-quality pages. In: Proceedings of the 10th International Conference on World Wide Web, pp. 114–118 (2001)
Google Scholar
Olston, C., Najork, M.: Web Crawling. Foundations and Trends in Information Retrieval 4(3), 175–246 (2010)
Article Google Scholar
Ourioupina, O.: Extracting geographical knowledge from the Internet. In: ICDMAM, pp. 108–113 (2002)
Google Scholar
Sanderson, M., Kohler, J.: Analyzing Geographic Queries. In: Workshop on Geographic Information Retrieval (SIGIR), Sheffield, UK (2004)
Google Scholar
Shkapenyuk, V., Suel, T.: Design and Implementation of a High-Performance Distributed Web Crawler. In: Proceedings of the 18th International Conference on Data Engineering, San Jose, CA, USA, February 26-March 1 (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Dept. of Computer Science and Information Engineering, National Central University, Chungli, Taoyuan, Taiwan
Hsiu-Min Chuang, Chia-Hui Chang & Ting-Yao Kao

Authors

Hsiu-Min Chuang
View author publications
You can also search for this author in PubMed Google Scholar
Chia-Hui Chang
View author publications
You can also search for this author in PubMed Google Scholar
Ting-Yao Kao
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

E-Business and Web Science Research Group, Universität der Bundeswehr, Neubiberg, Germany
Martin Hepp
Department of Software Engineering, Shenkar College of Engineering and Design, Ramat Gan, Israel
Yigal Hoffner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chuang, HM., Chang, CH., Kao, TY. (2014). Effective Web Crawling for Chinese Addresses and Associated Information. In: Hepp, M., Hoffner, Y. (eds) E-Commerce and Web Technologies. EC-Web 2014. Lecture Notes in Business Information Processing, vol 188. Springer, Cham. https://doi.org/10.1007/978-3-319-10491-1_2

Download citation

DOI: https://doi.org/10.1007/978-3-319-10491-1_2
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10490-4
Online ISBN: 978-3-319-10491-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics