Locating Candidate Tables in a Spreadsheet Rendered Web Page
A method to locate web table(s) is presented in this paper. Web page is captured as a spread sheet grid of textual elements (web sheet) with all visual attributes retained, using a spread sheet software. The leaf tables in that web page are captured in a separate sheet using DOM analysis (DOM sheet). Locating a table in a web sheet consists of two sub tasks namely locating the start point and the end point of the table. Start point is located by text comparison of the table elements from DOM sheet with that of web sheet. End point is located by navigating through the web sheet with located start point. Rows, columns information needed for navigation are used from DOM sheet. This method is tested for arbitrarily selected 60 URLs containing 450 leaf tables and in more than 90% of the cases, tables were located correctly.
Keywordsinformation extraction web mining web table location spreadsheet grid
Unable to display preview. Download preview PDF.
- 1.Chang, C.-H., Kayed, M., Girgis, M.R., Shaalan, K.: A Survey of Web Information Extraction Systems. IEEE Transactions on Knowledge and Data Engineering, 1411–1428 (2006)Google Scholar
- 2.Chen, H.-H., Tsai, S.-C., Tsai, J.-H.: Mining tables from large scale HTML texts. In: Proc. 18th COLING, pp. 166–172. Morgan Kaufmann (2000)Google Scholar
- 3.Cohen, W.W., Hurst, M., Jensen, L.S.: A flexible learning system for wrapping tables and lists in HTML documents. In: Proc. 11th WWW, pp. 232–241. ACM (2002)Google Scholar
- 5.Millard, B.T.: Collections of Interesting Data Tables (2007), http://projectcerbera.com/web/study/2007/tables/ (accessed August 2, 2009)
- 6.Muslea, I.: Extraction Patterns for Information Extraction Tasks: A Survey. In: Proc. AAAI 1999 Workshop Machine Learning for Information Extraction, pp. 1–6 (1999)Google Scholar
- 7.Tengli, A., Yang, Y., Ma, N.L.: Learning table extraction from examples. In: Proc. 20th COLING, pp. 987–993 (2004)Google Scholar
- 8.Wang, Y., Hu, J.: A Machine Learning Based Approach for Table Detection on the web. In: Proceedings of the 11th International Conference on World Wide Web, pp. 242–250 (2002)Google Scholar