Abstract
User-generated content (UGC) is created, updated, and maintained by various web users, and its data quality is a major concern to all users. We observe that each Wikipedia page usually goes through a series of revision stages, gradually approaching a relatively steady quality state and that articles of different quality classes exhibit specific evolution patterns. We propose to assess the quality of a number of web articles using Learning Evolution Patterns (LEP). First, each article’s revision history is mapped into a state sequence using the Hidden Markov Model (HMM). Second, evolution patterns are mined for each quality class, and each quality class is characterized by a set of quality corpora. Finally, an article’s quality is determined probabilistically by comparing the article with the quality corpora. Our experimental results demonstrate that the LEP approach can capture a web article’s quality precisely.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Aebi, D., Perrochon, L.: Towards improving data quality. In: Proc. of the International Conference on Information Systems and Management of Data, pp. 273–281 (1993)
Giles, J.: Internet encyclopaedias go head to head. Nature 438, 900–901 (2005)
Dalip, D.H., Cristo, M., Calado, P.: Automatic quality assessment of content created collaboratively by web communities: A case study of wikipedia. In: Proc. of JCDL 2009, pp. 295–304 (2009)
Stvilia, B., Twidle, B.M., Smith, C.L.: Assessing information quality of a community-based encyclopedia. In: Proc. of the International Conference on Information Quality, pp. 442–454 (2005)
Rassbach, L., Pincock, T., Mingus, B.: Exploring the feasibility of automatically rating online article quality (2008)
Wang, R.Y., Kon, H.B., Madnick, S.E.: Data quality requirements analysis and modeling. In: Proc. of the Ninth International Conference on Data Engineering, pp. 670–677 (1993)
Bouzeghoub, M., Peralta, V.: A framework for analysis of data freshness. In: Proc. of 2004 International Information Quality Conference on Information System, pp. 59–67 (2004)
Wand, Y., Wang, R.Y.: anchoring data quality dimensions in ontological foundations. Communications of the ACM 39(11) (1996)
Pernici, B., Scannapieco, M.: Data Quality in Web Information Systems. In: Spaccapietra, S., March, S.T., Kambayashi, Y. (eds.) ER 2002. LNCS, vol. 2503, pp. 397–413. Springer, Heidelberg (2002)
Parssian, A., Sarkar, S., Jacob, V.S.: Assessing information quality for the composite relational operation joins. In: Proceedings of the Seventh International Conference on Information Quality, pp. 225–236 (2002)
Parssian, A., Sarkar, S., Jacob, V.S.: assessing data quality for information products. In: Proceeding of the 20th International Conference on Information Systems, pp. 428–433 (1999)
Ballou, D.P., Chengalur-Smith, I.N., Wang, R.Y.: Sample-based quality estimation of query results in relational database environments. IEEE Transactions on Knowledge and Data Engineering 18(5), 639–650 (2006)
Macdonald, N., Frase, L., Gingrich, P., Keenan, S.: The writer’s workbench: computer aids for text analysis. IEEE Transactions on Communications 30(1), 105–110 (1982)
Foltz, P.W.: Supporting content-based feedback in on-line writing evaluation with lsa. Interactive Learning Environments 8(2), 111–127 (2000)
Hu, M., Lim, E.P., Sun, A.: Measuring article quality in wikipedia: Models and evaluation. In: Proc. of the sixteenth CIKM, pp. 243–252 (2007)
Zeng, H., Alhossaini, A., Ding, M. L.: Computing trust from revision history. In: Proc. of the 2006 International Conference on Privacy, Security and Trust: Bridge the Gap Between PST Technologies and Business Services (2006)
Zeng, H., Alhossaini, A., Fikes, M., McGuinness, R.L.: Data Mining revision history to assess trustworthiness of article fragments. In: Proc. of the 2006 International Conference on Collaborative Computing Networking Applications and Worksharing, pp. 1–10 (2006)
Rabiner, L.: A tutorial on hidden markov models and selected applications in speech recognition. Proceedings of the IEEE 77, 257–286 (1989)
Agrawal, R., Srikant, R.: Mining sequential patterns. In: ICDE (1995)
Ramakrishnan, A.S.: Mining sequential patterns: Generalizations and performance improvements. In: 1996 Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology (1996)
Zhang, M., Kao, B., Cheung, D., Yip, K.: Mining periodic patterns with gap requirement from sequences. In: SIGMOD (2005)
Ding, B., Lo, D., Han, J., Khoo, S.C.: Efficient mining of closed repetitive gapped subsequences from a sequence database. In: Proc. of 2009 ICDE, pp. 1024–1035 (2009)
Kukich, K.: Technique for automatically correcting words in text. ACM Computing Surveys 24(4), 377–439 (1992)
Knuth, D.: Knuth-morris-pratt algorithm. http://www.personal.kent.edu/~rmuhamma/Algorithms/MyAlgorithms/StringMatch/kuthMP.htm
Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edn. Morgan Kaufmann (2005)
Croft, W., Metzler, D., Strohman, T.: Search engines: information retrieval in practice. Addison-Wesley (2009)
Mitchell, T.M.: Machine learning. McGraw-Hill Higher Education (1997)
Xie, X.L., Beni, G.: A validity measure for fuzzy clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 13(8), 841–847 (1991)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Han, J., Chen, K., Jiang, D. (2012). Probabilistically Ranking Web Article Quality Based on Evolution Patterns. In: Hameurlain, A., Küng, J., Wagner, R., Liddle, S.W., Schewe, KD., Zhou, X. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems VI. Lecture Notes in Computer Science, vol 7600. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-34179-3_8
Download citation
DOI: https://doi.org/10.1007/978-3-642-34179-3_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-34178-6
Online ISBN: 978-3-642-34179-3
eBook Packages: Computer ScienceComputer Science (R0)