Extracting Structured Data from Web Pages with Maximum Entropy Segmental Markov Model

  • Susan Mengel
  • Yaoquin Jing
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5802)


Automated techniques can help to extract information from the Web. A new semi-automatic approach based on the maximum entropy segmental Markov model, therefore, is proposed to extract structured data from Web pages. It is motivated by two ideas: modeling sequences embedding structured data instead of their context to reduce the number of training Web pages and preventing the generation of too specific or too general models from the training data. The experimental results show that this approach has better performance than Stalker when only one training Web page is provided.


HTML extraction Markov Model 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Feldman, R., Aumann, Y., Finkelstein-Landau, M., Hurvitz, E., Regev, Y., Yaroshevich, A.: A Comparative Study of Information Extraction Strategies. In: Gelbukh, A. (ed.) CICLing 2002. LNCS, vol. 2276, pp. 349–359. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  2. 2.
    Chang, C.-H., Kuo, S.-C.: OLERA: Semisupervised Web Data Extracion with Visual Support. IEEE Intelligent Systems 4(6), 56–64 (2004)CrossRefGoogle Scholar
  3. 3.
    Zhai, Y., Liu, B.: Extracting Web Data Using Instance-Based Learning. In: Proceedings of 6th International Conference on Web Information System Engineering (2005)Google Scholar
  4. 4.
    Hogue, A., Karger, D.: Thresher: Automating the Unwrapping of Semantic Content from the World Wide Web (2005)Google Scholar
  5. 5.
    Lemma, K., Getoor, L., Minton, S., Knoblock, C.: Using the Structure of Web Sites for Automatic Segmentation of Tables. In: Proceedings of the 2004 ACM SIGMOD International Conference on Management of data, pp. 119–130 (2004)Google Scholar
  6. 6.
    Zhai, Y., Liu, B.: Web Data Extraction Based on Partial Tree alignment. In: Proceedings of the 14th International World Wide Web in Chiba, Japan (2005)Google Scholar
  7. 7.
    Liu, B., Zhai, Y.: NET- A System for Extracting Web Data from Flat and Nested Data Records. In: Proceedings of 6th International Conference on Web Information Systems Engineering (2005)Google Scholar
  8. 8.
    Rabiner, L.R.: A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition. Proceedings of the IEEE 77(2), 257–285 (1989)CrossRefGoogle Scholar
  9. 9.
    McCallum, A., Freitag, D., Pereira, F.: Maximum Entropy Markov Models for Information Extraction and Segmentation. In: Proceedings ICML 2000, pp. 591–598 (2000)Google Scholar
  10. 10.
    Ge, X.: Segemental Semi-Markov Models and Applications to Sequence Analysis. PhD. Thesis, University of California, Irvine (2002)Google Scholar
  11. 11.
    Good, I.J.: Maximum Entropy for Hypothesis Formulation, Especially for Multidimensional Contingency Tables. The Annals of Mathematical Statistics 34, 911–934 (1963)zbMATHCrossRefMathSciNetGoogle Scholar
  12. 12.
    Darroch, J.N., Ratcliff, D.: Generalized Iterative Scaling for Log-Linear Models. The Annals of Mathematical Statistics 43(5), 1470–1480 (1972)zbMATHCrossRefMathSciNetGoogle Scholar
  13. 13.
    Viterbi, A.J.: Error Bounds for Convolutional Codes and an Asymptotically Optimal Decoding Algorithm. IEEE Transactions on Information Theory IT-13, 260–269 (1967)zbMATHCrossRefGoogle Scholar
  14. 14.
    Jing, Y.: Extracting Structured Data from Web Pages with Maximum Entropy Segmental Markov Models. Texas Tech University, Computer Science, Doctoral Dissertation (2007)Google Scholar
  15. 15.
    Fetch Technologies, Inc. (2009),

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Susan Mengel
    • 1
  • Yaoquin Jing
    • 1
  1. 1.Computer ScienceTexas Tech UniversityLubbock

Personalised recommendations