Skip to main content

Information extraction for deep web using repetitive subject pattern

Abstract

In this paper, we propose an information extraction (IE) system for extracting data records from semi-structured documents on the Deep Web using a promising proposed technique, called Repetitive Subject Pattern. This technique was based on the hypothesis that data records in the web page must have a subject item, and the repetitive pattern of the subject items can be used to identify the boundary of data records. The system consists of four automatic tasks: (1) parsing a sample page to a DOM tree, (2) recognizing a subject string in the DOM tree, (3) using the subject string for identifying the pattern of data records and generating a wrapper, and (4) using the generated wrapper for extracting data records. This approach enables the very flexible wrapper generator; when the automatic process generated the wrong wrapper, user can also provide a new sample subject string for generating better wrapper. As the result, the system can be both semi-supervised and unsupervised system. The experimentation shows that the proposed technique provides the outstanding results in generating the very high quality wrappers, with both recall and precision close to 100 % when tested on a number of datasets.

This is a preview of subscription content, access via your institution.

References

  1. 1.

    Adelberg, B.: NoDoSE - A tool for Semi-Automatically Extracting Structured and Semistructured Data from Text Documents. Proceedings of the 1998 ACM SIGMOD in-ternational conference on Management of data. pp. 283–294 ACM, New York, NY, USA (1998). doi:10.1145/276304.276330

  2. 2.

    Álvarez, M., Pan, A., Raposo, J., Bellas, F., Cacheda, F.: Extracting lists of data records from semi-structured web pages. Data Knowl. Eng 64(2), 491–509 (2008). doi:10.1016/ j.datak.2007.10.002

    Article  Google Scholar 

  3. 3.

    Arasu, A., Garcia-Molina, H.: Extracting structured data from Web pages. Proceedings of the 2003 ACM SIGMOD international conference on Management of data. pp. 337–348 ACM, New York, NY, USA (2003). doi:10.1145/872757.872799

  4. 4.

    Arocena, G.O., Mendelzon, A.O.: WebOQL: Restructuring Documents, Databases, and Webs. Proceedings of the Fourteenth International Conference on Data Engineering. pp. 24–33 I.E. Computer Society, Washington, DC, USA (1998)

  5. 5.

    Cai, D., Yu, S., Wen, J.-R., Ma, W.-Y.: Extracting content structure for web pages based on visual representation. APWeb. 406–417 (2003)

  6. 6.

    Chang, C.-H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of Web information extraction systems. IEEE Trans Knowl Data Eng 18(10), 1411–1428 (2006). doi:10.1109/TKDE.2006.152

    Article  Google Scholar 

  7. 7.

    Chang, C.-H., Kuo, S.-C.: OLERA: semisupervised Web-data extraction with visual support. IEEE Intell Syst 19(6), 56–64 (2004). doi:10.1109/MIS.2004.71

    Article  Google Scholar 

  8. 8.

    Chang, C.-H., Lui, S.-C.: IEPAD: information extraction based on pattern discovery. Proceedings of the 10th international conference on World Wide Web. pp. 681–688 ACM, New York, USA (2001). doi:10.1145/371920.372182

  9. 9.

    Ciravegna, F., Dingli, A., Wilks, Y., Petrelli, D.: Adaptive information extraction for document annotation in amilcare. Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval. pp. 451–451 ACM, New York, NY, USA (2002). doi:10.1145/564376.564492

  10. 10.

    Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards Automatic Data Extraction from Large Web Sites. Proceedings of the 27th International Conference on Very Large Data Bases. pp. 109–118 Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2001)

  11. 11.

    He, B., Patel, M., Zhang, Z., Chang, K.C.-C.: Accessing the deep web. Commun of the ACM. 50(5), 94–101 (2007). doi:10.1145/1230819.1241670

    Article  Google Scholar 

  12. 12.

    Hengru, Z., Chun, C.: Web Information Extraction Technology Research Based on Ajax. Proceedings of the 2011 International Conference on Business Computing and Global Informatization. pp. 208–211 I.E. Computer Society, Washington, DC, USA (2011). doi:10.1109/BCGIn.2011.60

  13. 13.

    Hogue, A., Karger, D.: Thresher: automating the unwrapping of semantic content from the World Wide Web. Proceedings of the 14th international conference on World Wide Web. pp. 86–95 ACM, New York, NY, USA (2005). doi:10.1145/1060745.1060762

  14. 14.

    Hong, J.L.: Data extraction for deep Web using WordNet. IEEE Trans Syst Man, Cybern, Part C: Appl Rev 41(6), 854–868 (2011). doi:10.1109/TSMCC.2010.2089678

    Article  Google Scholar 

  15. 15.

    Hong, J.L., Siew, E.-G., Egerton, S.: Information extraction for search engines using fast heuristic techniques. Data Knowl. Eng 69(2), 169–196 (2010). doi:10.1016/j.datak.2009.10.002

    Article  Google Scholar 

  16. 16.

    Hsu, C.-N., Dung, M.-T.: Generating finite-state transducers for semi-structured data extraction from the Web. Inf Syst. 23(8), 521–538 (1998). doi:10.1016/S0306-4379(98)00027-1

    Article  Google Scholar 

  17. 17.

    Kayed, M., Chang, C.H.: FiVaTech: page-level Web data extraction from template pages. IEEE Trans Knowl Data Eng 22(2), 249–263 (2009). doi:10.1109/TKDE.2009.82

    Article  Google Scholar 

  18. 18.

    Liu, B., Grossman, R., Zhai, Y.: Mining data records in Web pages. Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 601–606 ACM, New York, NY, USA (2003). doi:10.1145/956750.956826

  19. 19.

    Liu, W., Meng, X., Meng, W.: ViDE: a vision-based approach for deep Web data extraction. IEEE IEEE Trans Knowl Data Eng 22(3), 447–460 (2010). doi:10.1109/TKDE.2009.109

    Article  Google Scholar 

  20. 20.

    Liu, L., Pu, C., Han, W.: XWRAP: an XML-enabled wrapper construction system for Web information sources. Data Engineering, 2000. Proceedings. 16th International Conference on. pp. 611 –621 (2000). doi:10.1109/ICDE.2000.839475

  21. 21.

    Myllymaki, J.: Effective Web data extraction with standard XML technologies. Computer Networks. 39(5), 635–644 (2002). doi:10.1016/S1389-1286(02)00214-1

    Article  Google Scholar 

  22. 22.

    Padmadas, V., Gadge, J.: Web data extraction using visual features. Proceedings of the International Conference and Workshop on Emerging Trends in Technology. pp. 218–221 ACM, New York, NY, USA (2010). doi:10.1145/1741906.1741954

  23. 23.

    Qin, Y., Zheng, D., Zhao, T.: Research on search results optimization technology with category features integration. Int J Mach Learn Cybern 3(1), 71–76 (2012). doi:10.1007/s13042-011-0037-9

    Article  Google Scholar 

  24. 24.

    Simon, K., Lausen, G.: ViPER: augmenting automatic information extraction with visual perceptions. Proceedings of the 14th ACM international conference on Information and knowledge management. pp. 381–388 ACM, New York, NY, USA (2005). doi:10.1145/ 1099554.1099672 DOI:10.1145/1099554.1099672

  25. 25.

    Sleiman, H.A., Corchuelo, R.: An unsupervised technique to extract information from semi-structured Web pages. In: Wang, X.S., Cruz, I., Delis, A., Huang, G. (eds.) Web information systems engineering - WISE, pp. 631–637. Springer, Berlin (2012)

    Google Scholar 

  26. 26.

    Sleiman, H.A., Corchuelo, R.: TEX: an efficient and effective unsupervised Web information extracto. Knowl-Based Syst 39(0), 109–123 (2013). doi:10.1016/j.knosys.2012.10.009

    Article  Google Scholar 

  27. 27.

    Sleiman, H.A., Corchuelo, R.: A Survey on Region Extractors From Web Documents. IEEE Transactions on Knowledge and Data Engineering. 99, (2012). doi:10.1109/TKDE. 2012.135 DOI:10.1109/TKDE.2012.135

  28. 28.

    Thamviset, W., Wongthanavasu, S.: Structured web information extraction using repetitive subject pattern. Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), 2012 9th International Conference on. pp. 1 –4 , Thailand (2012). doi:10.1109/ECTICon.2012.6254247

  29. 29.

    Vadrevu, S., Gelgi, F., Davulcu, H.: Information extraction from Web pages using presentation regularities and domain knowledge. World Wide Web. 10(2), 157–179 (2007). doi:10.1007/s11280-007-0021-1

    Article  Google Scholar 

  30. 30.

    Wang, J., Lochovsky, F.H.: Data extraction and label assignment for web databases. Proceedings of the 12th international conference on World Wide Web. pp. 187–196 ACM, New York, NY, USA (2003). doi:10.1145/775152.775179

  31. 31.

    Yang, S., Wang, G., Han, Y.: Grubber: Allowing End-Users to Develop XML-Based Wrappers for Web Data Sources. Proceedings of the Joint International Conferences on Advances in Data and Web Management. pp. 647–652 Springer-Verlag, Berlin, Heidelberg (2009). doi:10.1007/978-3-642-00672-2_65

  32. 32.

    Zhai, Y., Liu, B.: Structured data extraction from the Web based on partial tree alignment. IEEE Trans Knowledge Data Eng 18(12), 1614–1628 (2006). doi:10.1109/TKDE.2006.197

    Article  Google Scholar 

  33. 33.

    Zhao, H., Meng, W., Wu, Z., Raghavan, V., Yu, C.: Fully automatic wrapper generation for search engines. Proceedings of the 14th international conference on World Wide Web. pp. 66–75 ACM, New York, NY, USA (2005). doi:10.1145/1060745.1060760

  34. 34.

    Zheng, X., Gu, Y., Li, Y.: Data extraction from web pages based on structural-semantic entropy. Proceedings of the 21st international conference companion on World Wide Web. pp. 93–102 ACM, New York, NY, USA (2012). doi:10.1145/2187980.2187991

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Wachirawut Thamviset.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Thamviset, W., Wongthanavasu, S. Information extraction for deep web using repetitive subject pattern. World Wide Web 17, 1109–1139 (2014). https://doi.org/10.1007/s11280-013-0248-y

Download citation

Keywords

  • Information  extraction
  • Web  data extraction
  • Web content mining
  • Subject pattern
  • Wrapper induction
  • Unsupervised learning