Abstract
Entity-pages are Web pages that publish data representing one only instance of a certain conceptual entity. In this paper we propose SSUP, a new method to entity-page discovery. Specifically, given a sample entity-page from a Web site (e.g., Jolyon Palmer entity-page from GP2 Web site) we aim to find all same type entity-pages (driver entity-pages) from this Web site. We propose two structural URL similarity metrics and a set of algorithms to combine URL features with HTML features in order to improve the quality results and minimize the number of downloaded pages and processing time. We evaluate our method in real world Web sites and compare it with two baselines to demonstrate the effectiveness of our method.
Chapter PDF
Similar content being viewed by others
References
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval: The Concepts and Technology Behind Search. Addison Wesley Professional (2011)
Blanco, L., Crescenzi, V., Merialdo, P.: Efficiently locating collections of web pages to wrap. In: WEBIST, pp. 247–254. INSTICC Press (2005)
Blanco, L., Dalvi, N.N., Machanavajjhala, A.: Highly efficient algorithms for structural clustering of large websites. In: WWW, pp. 437–446. ACM (2011)
Grünwald, P.D.: The Minimum Description Length Principle (Adaptive Computation and Machine Learning). The MIT Press (2007)
He, Y., Xin, D., Ganti, V., Rajaraman, S., Shah, N.: Crawling deep web entity pages. In: WSDM, pp. 355–364. ACM (2013)
Kaptein, R., Serdyukov, P., de Vries, A.P., Kamps, J.: Entity ranking using wikipedia as a pivot. In: CIKM, pp. 69–78. ACM (2010)
Lerman, K., Getoor, L., Minton, S., Knoblock, C.A.: Using the structure of web sites for automatic segmentation of tables. In: SIGMOD Conf., pp. 119–130. ACM (2004)
Weninger, T., Johnston, T.J., Han, J.: The parallel path framework for entity discovery on the web. ACM Trans. Web 7(3), 16:1–16:29 (2013)
Yu, H., Han, J., Chang, K.C.C.: Pebl: Web page classification without negative examples. IEEE Trans. on Knowl. and Data Eng. 16(1), 70–81 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Manica, E., Galante, R., Dorneles, C.F. (2014). SSUP – A URL-Based Method to Entity-Page Discovery. In: Casteleyn, S., Rossi, G., Winckler, M. (eds) Web Engineering. ICWE 2014. Lecture Notes in Computer Science, vol 8541. Springer, Cham. https://doi.org/10.1007/978-3-319-08245-5_15
Download citation
DOI: https://doi.org/10.1007/978-3-319-08245-5_15
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-08244-8
Online ISBN: 978-3-319-08245-5
eBook Packages: Computer ScienceComputer Science (R0)