Unsupervised User-Generated Content Extraction by Dependency Relationships

  • Jingwei Zhang
  • Yuming Lin
  • Xueqing Gong
  • Weining Qian
  • Aoying Zhou
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6997)


User-generated contents are very valuable for event detection, opinion mining and so on, but the extraction of those data is difficult because users are given strong power to present their contents in Web 2.0 pages. Compared to machine-generated contents, user-generated contents are very personalized, which often take on complex styles, combine various information and embed much noise. Users’ deep participation makes data acquisition environment a great change and breaks the hidden assumption of traditional extraction methods, which is that Web pages should be relatively regular. The traditional extraction methods can not adapt complex user-generated contents well. In this paper, we consider user-generated contents as unstable contents and advise an unsupervised approach to extract high-quality user-generated contents without noise. Those stable information in machine-generated contents, which are often omitted by traditional extraction methods, are firstly picked up by a two-stage filtering operation, page-level filtering and template-level filtering. Path accompanying distance is then defined to compute the dependency relationships between unstable information and stable information, which guide us to locate user-generated contents. Our approach gives a full consideration on structures, contents and the dependency information between stable and unstable contents to assure the extraction accuracy of user data. The whole process does not need any artificial participation. The experimental results show its good performance and robustness.


Unstable Region Unsupervised Method Stable Information Redundant Path Path Consistency 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, SIGMOD 2003, pp. 337–348. ACM, New York (2003)Google Scholar
  2. 2.
    Baumgartner, R., Flesca, S., Gottlob, G.: Visual web information extraction with lixto. In: Proceedings of the 27th International Conference on Very Large Data Bases, VLDB 2001, pp. 119–128. Morgan Kaufmann Publishers Inc., San Francisco (2001)Google Scholar
  3. 3.
    Chang, C.-H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A survey of web information extraction systems. IEEE Trans. on Knowl. and Data Eng. 18, 1411–1428 (2006)CrossRefGoogle Scholar
  4. 4.
    Crescenzi, V., Mecca, G., Merialdo, P.: Roadrunner: Towards automatic data extraction from large web sites. In: Proceedings of the 27th International Conference on Very Large Data Bases, VLDB 2001, pp. 109–118. Morgan Kaufmann Publishers Inc., San Francisco (2001)Google Scholar
  5. 5.
    Flesca, S., Manco, G., Masciari, E., Rende, E., Tagarelli, A.: Web wrapper induction: a brief survey. AI Commun. 17, 57–61 (2004)Google Scholar
  6. 6.
    Gulhane, P., Madaan, A., Mehta, R.R., Ramamirtham, J., Rastogi, R., Satpal, S., Sengamedu, S.H., Tengli, A., Tiwari, C.: Web-scale information extraction with vertex. In: Abiteboul, S., Böhm, K., Koch, C., Tan, K.-L. (eds.) ICDE, pp. 1209–1220. IEEE Computer Society, Los Alamitos (2011)Google Scholar
  7. 7.
    Han, W.-S., Kwak, W., Yu, H.: On supporting effective web extraction. In: Li, F., Moro, M.M., Ghandeharizadeh, S., Haritsa, J.R., Weikum, G., Carey, M.J., Casati, F., Chang, E.Y., Manolescu, I., Mehrotra, S., Dayal, U., Tsotras, V.J. (eds.) ICDE, pp. 773–775. IEEE, Los Alamitos (2010)Google Scholar
  8. 8.
    Muslea, I., Minton, S., Knoblock, C.A.: Hierarchical wrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems 4, 93–114 (2001)CrossRefGoogle Scholar
  9. 9.
    Sahuguet, A., Azavant, F.: Building light-weight wrappers for legacy web data-sources using w4f. In: Proceedings of the 25th International Conference on Very Large Data Bases, VLDB 1999, pp. 738–741. Morgan Kaufmann Publishers Inc., San Francisco (1999)Google Scholar
  10. 10.
    Weninger, T., Hsu, W.H., Han, J.: Cetr: content extraction via tag ratios. In: Proceedings of the 19th International Conference on World Wide Web, WWW 2010, pp. 971–980. ACM, New York (2010)Google Scholar
  11. 11.
    Yang, J.-M., Cai, R., Wang, Y., Zhu, J., Zhang, L., Ma, W.-Y.: Incorporating site-level knowledge to extract structured data from web forums. In: Proceedings of the 18th International Conference on World Wide Web, WWW 2009, pp. 181–190. ACM, New York (2009)Google Scholar
  12. 12.
    Zhang, J., Zhang, C., Qian, W., Zhou, A.: Automatic extraction rules generation based on xpath pattern learning. In: The 1st International Symposium on Web Intelligent Systems and Services (WISS 2010). Springer, Heidelberg (2010)Google Scholar
  13. 13.
    Zheng, S., Song, R., Wen, J.-R., Giles, C.L.: Efficient record-level wrapper induction. In: Proceeding of the 18th ACM Conference on Information and Knowledge Management, CIKM 2009, pp. 47–56. ACM, New York (2009)Google Scholar
  14. 14.
    Zheng, S., Song, R., Wen, J.-R., Wu, D.: Joint optimization of wrapper generation and template detection. In: Berkhin, P., Caruana, R., Wu, X. (eds.) KDD, pp. 894–902. ACM, New York (2007)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Jingwei Zhang
    • 1
  • Yuming Lin
    • 1
  • Xueqing Gong
    • 1
  • Weining Qian
    • 1
  • Aoying Zhou
    • 1
  1. 1.Institute of Massive Computing, Software Engineering InstituteEast China Normal UniversityShanghaiChina

Personalised recommendations