Advertisement

Probabilistically Ranking Web Article Quality Based on Evolution Patterns

  • Jingyu Han
  • Kejia Chen
  • Dawei Jiang
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7600)

Abstract

User-generated content (UGC) is created, updated, and maintained by various web users, and its data quality is a major concern to all users. We observe that each Wikipedia page usually goes through a series of revision stages, gradually approaching a relatively steady quality state and that articles of different quality classes exhibit specific evolution patterns. We propose to assess the quality of a number of web articles using Learning Evolution Patterns (LEP). First, each article’s revision history is mapped into a state sequence using the Hidden Markov Model (HMM). Second, evolution patterns are mined for each quality class, and each quality class is characterized by a set of quality corpora. Finally, an article’s quality is determined probabilistically by comparing the article with the quality corpora. Our experimental results demonstrate that the LEP approach can capture a web article’s quality precisely.

Keywords

Hide Markov Model Support Vector Regression Evolution Pattern Quality Class Observation Sequence 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Aebi, D., Perrochon, L.: Towards improving data quality. In: Proc. of the International Conference on Information Systems and Management of Data, pp. 273–281 (1993)Google Scholar
  2. 2.
    Giles, J.: Internet encyclopaedias go head to head. Nature 438, 900–901 (2005)CrossRefGoogle Scholar
  3. 3.
    Dalip, D.H., Cristo, M., Calado, P.: Automatic quality assessment of content created collaboratively by web communities: A case study of wikipedia. In: Proc. of JCDL 2009, pp. 295–304 (2009)Google Scholar
  4. 4.
    Stvilia, B., Twidle, B.M., Smith, C.L.: Assessing information quality of a community-based encyclopedia. In: Proc. of the International Conference on Information Quality, pp. 442–454 (2005)Google Scholar
  5. 5.
    Rassbach, L., Pincock, T., Mingus, B.: Exploring the feasibility of automatically rating online article quality (2008)Google Scholar
  6. 6.
    Wang, R.Y., Kon, H.B., Madnick, S.E.: Data quality requirements analysis and modeling. In: Proc. of the Ninth International Conference on Data Engineering, pp. 670–677 (1993)Google Scholar
  7. 7.
    Bouzeghoub, M., Peralta, V.: A framework for analysis of data freshness. In: Proc. of 2004 International Information Quality Conference on Information System, pp. 59–67 (2004)Google Scholar
  8. 8.
    Wand, Y., Wang, R.Y.: anchoring data quality dimensions in ontological foundations. Communications of the ACM 39(11) (1996)Google Scholar
  9. 9.
    Pernici, B., Scannapieco, M.: Data Quality in Web Information Systems. In: Spaccapietra, S., March, S.T., Kambayashi, Y. (eds.) ER 2002. LNCS, vol. 2503, pp. 397–413. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  10. 10.
    Parssian, A., Sarkar, S., Jacob, V.S.: Assessing information quality for the composite relational operation joins. In: Proceedings of the Seventh International Conference on Information Quality, pp. 225–236 (2002)Google Scholar
  11. 11.
    Parssian, A., Sarkar, S., Jacob, V.S.: assessing data quality for information products. In: Proceeding of the 20th International Conference on Information Systems, pp. 428–433 (1999)Google Scholar
  12. 12.
    Ballou, D.P., Chengalur-Smith, I.N., Wang, R.Y.: Sample-based quality estimation of query results in relational database environments. IEEE Transactions on Knowledge and Data Engineering 18(5), 639–650 (2006)CrossRefGoogle Scholar
  13. 13.
    Macdonald, N., Frase, L., Gingrich, P., Keenan, S.: The writer’s workbench: computer aids for text analysis. IEEE Transactions on Communications 30(1), 105–110 (1982)CrossRefGoogle Scholar
  14. 14.
    Foltz, P.W.: Supporting content-based feedback in on-line writing evaluation with lsa. Interactive Learning Environments 8(2), 111–127 (2000)CrossRefGoogle Scholar
  15. 15.
    Hu, M., Lim, E.P., Sun, A.: Measuring article quality in wikipedia: Models and evaluation. In: Proc. of the sixteenth CIKM, pp. 243–252 (2007)Google Scholar
  16. 16.
    Zeng, H., Alhossaini, A., Ding, M. L.: Computing trust from revision history. In: Proc. of the 2006 International Conference on Privacy, Security and Trust: Bridge the Gap Between PST Technologies and Business Services (2006)Google Scholar
  17. 17.
    Zeng, H., Alhossaini, A., Fikes, M., McGuinness, R.L.: Data Mining revision history to assess trustworthiness of article fragments. In: Proc. of the 2006 International Conference on Collaborative Computing Networking Applications and Worksharing, pp. 1–10 (2006)Google Scholar
  18. 18.
    Rabiner, L.: A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE 77, 257–286 (1989)CrossRefGoogle Scholar
  19. 19.
    Agrawal, R., Srikant, R.: Mining sequential patterns. In: ICDE (1995)Google Scholar
  20. 20.
    Ramakrishnan, A.S.: Mining sequential patterns: Generalizations and performance improvements. In: 1996 Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology (1996)Google Scholar
  21. 21.
    Zhang, M., Kao, B., Cheung, D., Yip, K.: Mining periodic patterns with gap requirement from sequences. In: SIGMOD (2005)Google Scholar
  22. 22.
    Ding, B., Lo, D., Han, J., Khoo, S.C.: Efficient mining of closed repetitive gapped subsequences from a sequence database. In: Proc. of 2009 ICDE, pp. 1024–1035 (2009)Google Scholar
  23. 23.
    Kukich, K.: Technique for automatically correcting words in text. ACM Computing Surveys 24(4), 377–439 (1992)CrossRefGoogle Scholar
  24. 24.
  25. 25.
    Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edn. Morgan Kaufmann (2005)Google Scholar
  26. 26.
    Croft, W., Metzler, D., Strohman, T.: Search engines: information retrieval in practice. Addison-Wesley (2009)Google Scholar
  27. 27.
    Mitchell, T.M.: Machine learning. McGraw-Hill Higher Education (1997)Google Scholar
  28. 28.
    Xie, X.L., Beni, G.: A validity measure for fuzzy clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 13(8), 841–847 (1991)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Jingyu Han
    • 1
  • Kejia Chen
    • 1
  • Dawei Jiang
    • 2
  1. 1.School of Computer Science and TechnologyNanjing University of Posts and TelecommunicationsNanjingP.R. China
  2. 2.School of ComputingNational University of SingaporeSingapore

Personalised recommendations