Advertisement

Juicer: Scalable Extraction for Thread Meta-information of Web Forum

  • Yan Guo
  • Yu Wang
  • Guodong Ding
  • Donglin Cao
  • Gang Zhang
  • Yi Lv
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5477)

Abstract

In Web forum, thread meta-information contained in list-of-thread of board page provide fundamental data for the further forum mining. This paper describes a complete system named Juicer which was developed as a subsystem for an industrial application that involves forum mining. The task of Juicer is to extract thread meta-information from board pages of a great many of large scale online Web forums, which implies that scalable extraction is required with high accuracy and speed, and minimal user effort for maintenance. Among so many existed approaches about information extraction, we can not find any approach to fully satisfy the requirements, so we present simple scalable extraction approach behind Juicer to achieve the goal. Juicer is constituted by four modules: Template generation, Specifying labeling setting, Automatic extraction, Label assignment. Both experiments and practice show that Juicer successfully satisfied the requirements.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Chang, C.-H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE transactions on knowledge and data engineering 18(10), 1411–1428 (2006)CrossRefGoogle Scholar
  2. 2.
    Liu, B., Zhai, Y.: Mining data records in web pages. In: Proc. Intl. Conf. Knowledge Discovery in Databases and Data Mining (KDD), pp. 601–606 (2003)Google Scholar
  3. 3.
    Zhai, Y., Liu, B.: Web data extraction based on partial tree alignment. In: Proc. 14th Intl. Conf. World Wide Web (WWW), pp. 76–85 (2005)Google Scholar
  4. 4.
    Liu, B., Zhai, Y.: Net: a system for extracting web data from flat and nested data records. In: Proc. Sixth Intl. Conf. Web Information Systems Eng., pp. 487–495 (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Yan Guo
    • 1
  • Yu Wang
    • 1
  • Guodong Ding
    • 1
  • Donglin Cao
    • 1
  • Gang Zhang
    • 1
  • Yi Lv
    • 2
  1. 1.Institute of Computing TechnologyChinese Academy of SciencesChina
  2. 2.State Key Laboratory of Computer Science, Institute of SoftwareChinese Academy of SciencesChina

Personalised recommendations