Abstract
This paper proposes a method for finding and extracting academic information from conference Web pages. The main contributions include: (1) A lightweight topic crawling method based on search engine is used to crawl academic conference Web pages. (2) An new vision-based page segmentation algorithm is proposed to improve the result of classical VIPS algorithm by introducing complete tree. This algorithm can divide Web pages into text blocks. (3) Using bayesian network classifier, all text blocks are classified as 10 categories according to its vision features, key-word features and text content features. The initial classification results have 75 % precision and 67 % recall. (4) The context information of text blocks are employed to repair and refine initial classification results, which are improved to 96 % precision and 98 % recall. Finally, academic information is easily extracted from the classified text blocks. Experimental results on real-world datasets show that our method is effective and efficient for finding and extracting academic information from conference Web pages.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Tang, J., Zhang, J., Yao, L., Li, J., et al.: ArnetMiner: extraction and mining of academic social networks. Presented at the Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA (2008)
Chang, C.-H., Kayed, M., Girgis, M.R., Shaalan, K.: A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. 18, 1411–1428 (2006)
Laender, A., Ribeiro-neto, B.A., da Silva, A.S., Teixeira, J.S.: A brief survey of web data extraction tools. SIGMOD Record 31, 84–93 (2002)
Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. Presented at the Proceedings of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, San Francisco, California, United States (1998)
Flake, G.W., Lawrence, S., Lee Giles, C., Coetzee, F.M.: Self-organization and identification of web communities. Computer 35, 66–71 (2002)
Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: VIPS: a vision-based page segmentation algorithm. Microsoft Technical Report (2003)
Liu, W., Meng, X., Meng, W.: ViDE: a vision-based approach for deep web data extraction. IEEE Trans. Knowl. Data Eng. 22, 447–460 (2010)
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco (1993)
Hastie, T., Tibshirani, R.: Discriminant adaptive nearest neighbor classification. IEEE Trans. Pattern Anal. Mach. Intell. 18, 607–616 (1996)
Hand, D.J., Yu, K.: Idiot’s Bayes—not so stupid after all? Int. Stat. Rev. 69, 385–398 (2001)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wang, P., Zhang, X., Zhou, F. (2013). Finding and Extracting Academic Information from Conference Web Pages. In: Zhou, S., Wu, Z. (eds) Social Media Retrieval and Mining. Communications in Computer and Information Science, vol 387. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41629-3_6
Download citation
DOI: https://doi.org/10.1007/978-3-642-41629-3_6
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41628-6
Online ISBN: 978-3-642-41629-3
eBook Packages: Computer ScienceComputer Science (R0)