World Wide Web

, Volume 22, Issue 2, pp 603–620 | Cite as

A novel approach for Web page modeling in personal information extraction

  • Wei Yuliang
  • Zhou Qi
  • Lv Fang
  • Han Xixian
  • Xin Guodong
  • Wang BailingEmail author
Part of the following topical collections:
  1. Special Issue on Deep vs. Shallow: Learning for Emerging Web-scale Data Computing and Applications


The target of personal information extraction (PIE) is to extract content associated with a name form Web pages. Available Web page models, which are also used widely in content extraction and automatic wrapper algorithms, include text model, document object model, and vision-based page segmentation model. Because of existing models focus on Web structure rather than semantic relevance, they are difficult to be directly used for PIE. To deal with this problem, we introduce the sequence block model (SBM), by which is easy to determine the relevance of each page block to the retrieval name. Then, we give the definition of PIE based on the SBM. Depending on the sequence correlation of SBM, we design a 4-layer seq2seq deep learning network for PIE. Experiment result shows that our new model extracts twice as much data as content extraction algorithms. And the recall rate of the network is 7% higher than the traditional model with classification algorithm.


Sequence block model Personal information Seq2seq network Deep learning Web information extraction 



This work is partially supported by National Key Research and Development Program of China (No. 2016YFB0800802) and Shandong Key Research and Development Plan under grant (No.2016ZDJS01A04 and No.2017CXGC0706).


  1. 1.
    Banu, A., Chitra, M.: Dwde-ir: an efficient deep Web data extraction for information retrieval on Web mining. J. Emerg. Technol. Web Intell. 6(1), 133–141 (2014)Google Scholar
  2. 2.
    Bartoli, A., De Lorenzo, A., Medvet, E., Tarlao, F.: Inference of regular expressions for text extraction from examples. IEEE Trans. Knowl. Data Eng. 28(5), 1217–1230 (2016)CrossRefGoogle Scholar
  3. 3.
    Bu, Z., Zhang, C., Xia, Z., Wang, J.: An far-sw based approach for Webpage information extraction. Inf. Syst. Front. 16(5), 771–785 (2014)CrossRefGoogle Scholar
  4. 4.
    Cai, D., Yu, S., Wen, J.R., Ma, W.Y.: VIPS: vision-based page segmentation algorithm. Microsoft Technical Report, MSR-TR-2003-79 (2003)Google Scholar
  5. 5.
    Chang, C.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of Web information extraction systems. IEEE Trans. Knowl. Data Eng. 18(10), 1411–1428 (2006)CrossRefGoogle Scholar
  6. 6.
    Cramer, D.: A library to extract meaningful data from a Webpage.
  7. 7.
    Cuthbertson, T.: Python-readability.
  8. 8.
    Dalvi, N., Kumar, R., Soliman, M.: Automatic wrappers for large scale Web extraction. Proc. VLDB Endow. 4(4), 219–230 (2011)CrossRefGoogle Scholar
  9. 9.
    Doddington, G.R., Mitchell, A., Przybocki, M.A., Ramshaw, L.A., Strassel, S., Weischedel, R.M.: The automatic content extraction (ace) program-tasks, data, and evaluation. In: LREC, vol. 2, pp. 837–840 (2004)Google Scholar
  10. 10.
    Ferrara, E., De Meo, P., Fiumara, G., Baumgartner, R.: Web data extraction, applications and techniques: a survey. Knowl.-Based Syst. 70, 301–323 (2014)CrossRefGoogle Scholar
  11. 11.
    Flesca, S., Manco, G., Masciari, E., Rende, E., Tagarelli, A.: Web wrapper induction: a brief survey. AI Commun. 17(2), 57–61 (2004)Google Scholar
  12. 12.
    Furche, T., Gottlob, G., Grasso, G., Schallhart, C., Sellers, A.: Oxpath: a language for scalable data extraction, automation, and crawling on the deep Web. VLDB J. 22(1), 47–72 (2013)CrossRefGoogle Scholar
  13. 13.
    Gao, L., Guo, Z., Zhang, H., Xu, X., Shen, H.T.: Video captioning with attention-based lstm and semantic consistency. IEEE Trans. Multimedia 19(9), 2045–2055 (2017)CrossRefGoogle Scholar
  14. 14.
    Gogar, T., Hubacek, O., Sedivy, J.: Deep neural networks for Web page information extraction. In: IFIP International Conference on Artificial Intelligence Applications and Innovations, pp. 154–163. Springer, Berlin (2016)Google Scholar
  15. 15.
    Grigalis, T., Radvilavičius, L., Čenys, A., Gordevičius, J.: Clustering visually similar Web page elements for structured Web data extraction. In: Web Engineering, pp. 435–438 (2012)Google Scholar
  16. 16.
    Hadnagy, C.: Social Engineering: the Art of Human Hacking. Wiley, New York (2010)Google Scholar
  17. 17.
    Jarrett Irons, G.Y.: Goose - article extractor.
  18. 18.
    Junyi, S.: jparser - parsing binary files made easy.
  19. 19.
    Kohlschütter, C.: Boilerplate removal and fulltext extraction from html pages.
  20. 20.
    Krishna, S.S., Dattatraya, J.S.: Schema inference and data extraction from templatized Web pages. In: 2015 International Conference on Pervasive Computing (ICPC), pp. 1–6. IEEE (2015)Google Scholar
  21. 21.
    Kushmerick, N.: Finite-state approaches to Web information extraction. In: Lecture Notes in Computer Science, pp. 77–91 (2003)Google Scholar
  22. 22.
    Li, Y., Krishnamurthy, R., Raghavan, S., Vaithyanathan, S., Jagadish, H.: Regular expression learning for information extraction. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 21–30. Association for Computational Linguistics (2008)Google Scholar
  23. 23.
    Li, J.Q., Zhao, Y., Garcia-Molina, H.: A path-based approach for Web page retrieval. World Wide Web 15(3), 257–283 (2012)CrossRefGoogle Scholar
  24. 24.
    Mann, G.S., Yarowsky, D.: Unsupervised personal name disambiguation. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003-Volume 4, pp. 33–40. Association for Computational Linguistics (2003)Google Scholar
  25. 25.
    Saleh, A.I., Al Rahmawy, M.F., Abulwafa, A.E.: A semantic based Web page classification strategy using multi-layered domain ontology. World Wide Web 20(5), 939–993 (2017)CrossRefGoogle Scholar
  26. 26.
    Sanoja, A., Gancarski, S.: Block-o-matic: a Web page segmentation framework. In: 2014 International Conference on Multimedia Computing and Systems (ICMCS), pp. 595–600. IEEE (2014)Google Scholar
  27. 27.
    Sleiman, H.A., Corchuelo, R.: Tex: an efficient and effective unsupervised Web information extractor. Knowl.-Based Syst. 39, 109–123 (2013)CrossRefGoogle Scholar
  28. 28.
    Song, D., Sun, F., Liao, L.: A hybrid approach for content extraction with text density and visual importance of dom nodes. Knowl. Inf. Syst. 42(1), 75–96 (2015)CrossRefGoogle Scholar
  29. 29.
    Song, J., Zhang, H., Li, X., Gao, L., Wang, M., Hong, R.: Self-supervised video hashing with hierarchical binary auto-encoder. IEEE Trans. Image Process. 27 (7), 3210–3221 (2018)MathSciNetCrossRefzbMATHGoogle Scholar
  30. 30.
    Thamviset, W., Wongthanavasu, S.: Information extraction for deep Web using repetitive subject pattern. World Wide Web 17(5), 1109–1139 (2014)CrossRefGoogle Scholar
  31. 31.
    Vijendran, A.S., Deepa, C.: LBDA: a novel framework for extracting content from Web pages. In: 2013 International Conference on Advanced Computing & Communication Systems (ICACCS), pp. 1–7. IEEE (2013)Google Scholar
  32. 32.
    Wei, Y., Wang, B., Liu, Y., Lv, F.: Research on Webpage similarity computing technology based on visual blocks. In: Chinese National Conference on Social Media Processing, pp. 187–197 (2014)Google Scholar
  33. 33.
    Williams, R.J., Zipser, D.: A learning algorithm for continually running fully recurrent neural networks. Neural Comput. 1(2), 270–280 (1989)CrossRefGoogle Scholar
  34. 34.
    Wu, G., Li, L., Hu, X., Wu, X.: Web news extraction via path ratios. In: Proceedings of the 22nd ACM International Conference on Information & Knowledge Management, pp. 2059–2068. ACM (2013)Google Scholar
  35. 35.
    Xu, X., Shen, F., Yang, Y., Shen, H.T., Li, X.: Learning discriminative binary codes for large-scale cross-modal retrieval. IEEE Trans. Image Process. 26(5), 2494–2507 (2017)MathSciNetCrossRefzbMATHGoogle Scholar
  36. 36.
    Xu, X., He, L., Lu, H., Gao, L., Ji, Y.: Deep adversarial metric learning for cross-modal retrieval. World Wide Web, pp. 1–16 (2018)Google Scholar
  37. 37.
    Zhang, C., Liu, C., Zhang, X., Almpanidis, G.: An up-to-date comparison of state-of-the-art classification algorithms. Expert. Syst. Appl. 82, 128–150 (2017)CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Harbin Institute of TechnologyWeihaiPeople’s Republic of China

Personalised recommendations