Finding and Extracting Academic Information from Conference Web Pages

Wang, Peng; Zhang, Xiang; Zhou, Fengbo

doi:10.1007/978-3-642-41629-3_6

Finding and Extracting Academic Information from Conference Web Pages

Peng Wang³,
Xiang Zhang³ &
Fengbo Zhou⁴

Conference paper
First Online: 16 November 2013

744 Accesses
1 Citations

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 387))

Abstract

This paper proposes a method for finding and extracting academic information from conference Web pages. The main contributions include: (1) A lightweight topic crawling method based on search engine is used to crawl academic conference Web pages. (2) An new vision-based page segmentation algorithm is proposed to improve the result of classical VIPS algorithm by introducing complete tree. This algorithm can divide Web pages into text blocks. (3) Using bayesian network classifier, all text blocks are classified as 10 categories according to its vision features, key-word features and text content features. The initial classification results have 75 % precision and 67 % recall. (4) The context information of text blocks are employed to repair and refine initial classification results, which are improved to 96 % precision and 98 % recall. Finally, academic information is easily extracted from the classified text blocks. Experimental results on real-world datasets show that our method is effective and efficient for finding and extracting academic information from conference Web pages.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
http://arnetminer.org/page/conference-rank/html/Conference.html

References

Tang, J., Zhang, J., Yao, L., Li, J., et al.: ArnetMiner: extraction and mining of academic social networks. Presented at the Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA (2008)
Google Scholar
Chang, C.-H., Kayed, M., Girgis, M.R., Shaalan, K.: A survey of web information extraction systems. IEEE Trans. Knowl. Data Eng. 18, 1411–1428 (2006)
Article Google Scholar
Laender, A., Ribeiro-neto, B.A., da Silva, A.S., Teixeira, J.S.: A brief survey of web data extraction tools. SIGMOD Record 31, 84–93 (2002)
Article Google Scholar
Kleinberg, J.M.: Authoritative sources in a hyperlinked environment. Presented at the Proceedings of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, San Francisco, California, United States (1998)
Google Scholar
Flake, G.W., Lawrence, S., Lee Giles, C., Coetzee, F.M.: Self-organization and identification of web communities. Computer 35, 66–71 (2002)
Article Google Scholar
Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: VIPS: a vision-based page segmentation algorithm. Microsoft Technical Report (2003)
Google Scholar
Liu, W., Meng, X., Meng, W.: ViDE: a vision-based approach for deep web data extraction. IEEE Trans. Knowl. Data Eng. 22, 447–460 (2010)
Article Google Scholar
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco (1993)
Google Scholar
Hastie, T., Tibshirani, R.: Discriminant adaptive nearest neighbor classification. IEEE Trans. Pattern Anal. Mach. Intell. 18, 607–616 (1996)
Article Google Scholar
Hand, D.J., Yu, K.: Idiot’s Bayes—not so stupid after all? Int. Stat. Rev. 69, 385–398 (2001)
MATH Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Engineering, Southeast University, Nanjing, China
Peng Wang & Xiang Zhang
Focus Technology Co., Ltd, Nanjing, China
Fengbo Zhou

Authors

Peng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Fengbo Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Peng Wang .

Editor information

Editors and Affiliations

Fudan University School of Computer Science, Shanghai, People's Republic of China
Shuigeng Zhou
University of Finance and Economics, Nanjing, People's Republic of China
Zhiang Wu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, P., Zhang, X., Zhou, F. (2013). Finding and Extracting Academic Information from Conference Web Pages. In: Zhou, S., Wu, Z. (eds) Social Media Retrieval and Mining. Communications in Computer and Information Science, vol 387. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41629-3_6

Download citation

DOI: https://doi.org/10.1007/978-3-642-41629-3_6
Published: 16 November 2013
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41628-6
Online ISBN: 978-3-642-41629-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics