1 Introduction

Since its advent in 2001, Wikipedia has become the most sought after online encyclopedia; its free distribution and wide coverage radically changed the way people approach knowledge. By 2014, there are more than 4 millions English articles in Wikipedia ranging from natural science to social science. Moreover, Wikipedia provides a gold mine for researchers of various disciplines. For researchers in the social science field, Wikipedia serves as a prominent exemplar of using wiki technology for crowdsourcing; it offers a great source of data for researches on online collaboration, open innovation, etc.

Typically researchers utilizing Wikipedia data would analyze the predictors of the quality of Wikipedia articles. Ironically, the quality of Wikipedia has been a debatable issue for long time; drawing much attention not only from the regular users who simply use its content as knowledge sources but also from researchers who are interested in the study of Wikipedia. There are two main streams of researches related to the quality of the Wikipedia articles: one stream is modeling of the articles quality, which propose methods to help identify or distinguish article quality levels [1, 2]; another stream is to use quality as a dependent variable for the studies of collaboration, conflicts, or virtual teams in general [3, 4]. The studies from both streams basically rely on some sample articles which are already rated to different quality levels. The typical ways to derive the quality ratings are either directly making use of Wikipedia internal ratings or hiring outer expert raters to evaluate the quality of randomly selected articles. Based on the quality rating criteria provided by Wikipedia, about 1 million out of 4 million articles were ever assigned one or more quality labels, moreover, many of the articles have experienced quality promotion or demotion or both over their development history. Kittur and Kraut [5] found significant correlation between Wikipedia’s article quality ratings and the article quality ratings assigned by external raters. The study indicates the high consistency in the evaluating criteria from within and outside of Wikipedia. The main limitation of this study is that it is based on only a small number of articles. So the question remains: Does the rating assigned by Wikipedia to around 1 million articles really reflect the criteria in Wikipedia’s evaluation system?

We first refine this question before attempting to answer it. We notice that constant change made to articles distinguishes Wikipedia from traditional encyclopedia; therefore the quality rating in Wikipedia is a dynamic process. Kane et al. [6] suggests two stages of online collaboration, namely, the creation stage when the information is developed and shaped, and the retention stage when the created information gets preserved and refined through ongoing collaboration. Similarly, the development of Wikipedia articles can also be divided into these two stages; accordingly, the improvement for Wikipedia articles would have different focuses in these two stages. In the creation stage, content completeness is the priority while elaboration and minor details like having reliable references would be looked after later. In Wikipedia’s criteria, the complete, accurate content and detailed requirements in citations and writing styles are all taken into consideration. In our study of examining the mapping between quality and criteria, we focus on the content change since the complete content formed the mainframe of the articles on which any other minor changes are based. Content change is therefore the critical factor for a significant quality improvement.

There are seven quality scales in Wikipedia, for each scale there is a corresponding rating criterion. Observing the content-related rule in the criterion of each scale, we noticed that splitting the seven quality scales into two groups would facilitate a better alignment of the content rule and the article quality, since these two groups have significant distinguishable content requirements, which only slightly differ within the groups. This quality scales grouping is also adopted in some recent study which propose a model to evaluate the article quality [7]. We define quality promotion as the change of quality scale from the lower-level group to the higher-level group, and quality demotion as the change of quality scale from the higher-level group to the lower-level group. Having collected almost all of the English Wikipedia articles that have experienced quality promotion and demotion in their first two quality ratings, noted as promoted and demoted articles, we subsequently conduct a longitudinal analysis of the content change by examining the semantic convergence during the time period of quality change. Semantic convergence is a measurement of content change by computing the semantic similarity of each version with the last version of the article.

2 Quality Promotion and Demotion

According to Wikipedia quality rating system, articles’ quality level ranges from Stub, Start, C-class, B-class, Good Articles (GA), A-class, to Featured Articles (FA) in ascending order. Wikipedia provides guidelines for assessing the articles in a multidimensional measure. For instance, a Featured Article needs to be: (1) well written; (2) comprehensive; (3) well-researched by including appropriate references; (4) neutral; (5) stable (no ongoing edit wars); (6) compliance with Wikipedia style guidelines, like consistent citation and appropriate structure; (7) having appropriate media mostly in the form of images with acceptable copyright status; and (8) having appropriate length and focusing on the main topic.

Wikipedia has developed various mechanisms for quality rating centering around two principles: relying on consensus of most reviewers and constraining the influence of major contributors. To promote an article to GA or FA, it first needs to be nominated by at least one registered editor, and all other editors excluding the significant contributors can review the article and give their comments based on the corresponding criteria, after then vote for the nomination. For a nomination to be promoted to GA or FA, a consensus among most reviewers must be reached. But for GA assessment, nominator is not allowed to participate in the review. The quality assessments for other levels (e.g. B-class, C-class) are normally performed by members of WikiProject. A WikiProject is composed of a collection of articles of the same domain or on a specific topic, and a group of editors who collaborate in these articles. Similarly, a consensus among WikiProject members is required for the rating.

In general, an article changing from any level to a more superior level is considered to have been promoted or demoted in reverse. Nevertheless, instead of exhausting all the possible changes, we simplify the problem by firstly grouping quality levels and then adjusting the definition of promotion and demotion accordingly. We cluster GA, A and FA as advanced group while B, C, Start and Stub as underdeveloped group. The reason beyond this grouping is primarily based on the reader’s experience and the Wikipedia editing suggestions. According to Wikipedia, readers tend to perceive FA articles as professional, outstanding, and thorough; A-class articles as very useful and fairly complete; and GA articles as useful, without obvious problems and approaching the quality of professional encyclopedia articles. For FA articles, no further content is needed unless new information becomes available. Some style problems may need solving in GA and A-class articles. GA articles may need some editing by subject and style expert.

Moving down to B-class articles, the content is probably not be enough to satisfy a serious researcher, which means a few aspects of content and style need to be addressed, the inclusion of supporting materials as well as a better style should also be considered. It is worse in C-class articles as there is no complete picture for a detailed study, and considerable editing is needed. In Start-class articles, users can find some meaningful content, but most readers will need more; these articles are suggested to provide reliable sources and substantial improvement in content and structure.

From the above descriptions of each quality class provided by Wikipedia, there seemingly exists a line to distinguish GA, A, and FA articles from B, C, Start and Stub articles. That is, the considerable content changes. In GA, A and FA articles, an expert-level elaboration and minor updates are needed, while the articles of the other three lower levels require more content. This distinction matches the two dimensions of our study: content change and quality motion. Based on the grouping, we characterize the promotion as the change from any level in the underdeveloped group (B or C or Start or Stub) to one in the advanced group (GA, A, FA), and the demotion as the change from advanced group to the underdeveloped group.

As aforementioned, around 1 million Wikipedia articles have ever received quality rating; many of them have had more than two quality assessments. In our research, we only focus on the articles’ first quality motion. For example, when one article has assessment history like “B → GA → FA”, we conduct the study on the content change during the time period from B to GA. We align the promoted and the demoted articles by their first quality motion, regardless of their age and birth date. In the following section, we provide the details of collecting and preparing the data. We then describe the method of semantic convergence and present the semantic convergence for the different groups of quality motion.

3 Data Collection

We extract the data from the recent English Wikipedia as of January 2014. It includes over 4 million articles, out of which 3,799 articles are labeled as FA, 18,616 articles are categorized as GA, and 674 articles as A-class. There are also 64,021 B-class articles and 118,906 C-class articles. We did not consider the articles currently rated as Start or Stub class, since we only focus on articles that have already undergone one creation stage.

Each Wikipedia article has its own talk page where the discussions on the article editing take place among the editors. Wikipedia not only stores the most recent version of the articles and their talk pages but also records every effective change made to them in the form of revision through its lifecycle. A revision of Wikipedia article is a version of the article and the changes made to the current revision is based on its last revision. When a revision is submitted, in addition to the updated article content, Wikipedia would also record the information of the editor who creates the revision and the submission time. The same recording procedure applies for talk page. Additionally, each version of the talk page contains new comments and new meta information about the article. The meta information includes which WikiProject the article belongs to, suggestions on improving the article and the current quality level of the article. We retrieve the quality information of the articles from their talk pages.

To get articles that have been promoted or demoted, we first retrieve the quality rating history for all the 205,987 articles, covering all articles which are presently labeled as C-class, B-class, Good, A-class and FA. We derive the quality ratings by checking the meta information through all the revisions of each article’s talk page. We finally get the quality rating history for around 201,500 articles out of 205,987 articles. About 5,000 articles were dropped because of missing information in the talk page. The quality ratings’ are recorded in the form: \( {\text{q}}_{1} ({\text{t}}_{1} ) \ldots {\text{q}}_{\text{n}} ({\text{t}}_{\text{n}} ) \); where \( {\text{q}}_{\text{i}} \) is the \( {\text{i}}^{\text{th}} \) quality rating at timestamp \( {\text{t}}_{\text{i}} \). An article is considered at \( {\text{q}}_{\text{i}} \) level from \( {\text{t}}_{\text{i}} \) until a different quality level is assigned, i.e., \( {\text{q}}_{\text{i}} \) is different from its adjacent quality ratings.

Next, we collect all articles that have at least two quality assessments over their histories. By checking their first two quality levels, regardless of their present quality level, we select the articles which meet the condition that one of the two quality classes is from advanced group (GA, A, FA) and another one is from underdeveloped group (B, C, Start, Stub). We then cluster them as promoted articles (PA) and demoted articles (DA) according to the quality change order. Finally, we get 7,653 promoted articles and 525 demoted articles; both of the two clusters contain articles that are presently assessed as C-class, B-class, GA, A-class or FA. This is our base dataset on which we check the relation between the pattern of the first quality motion and the present quality of these articles.

Interestingly, from all the promoted articles and demoted articles, we found out that the first quality motion pattern indeed has an impact on its present quality status: (1) more than 95 % of the promoted articles (7,399 out of 7,653) stay at the advanced group and less than 5 % suffered demotion after having been promoted; (2) only about 40 % of the demoted articles (213 out of 525) get promoted to advanced group after having experienced a demotion, while the rest (60 %) stay at underdeveloped group. The finding in (1) shows that if an article gets promoted to advanced group from underdeveloped group in their first quality motion, it is more likely that it will stay as at least a Good article. Hence, this initial finding further implies the importance of the change made in the time period between the first two quality assessments.

For all the promoted and demoted articles, we do the following: given the timestamps \( {\text{t}}_{1} \) and \( {\text{t}}_{2} \) corresponding to the first quality rating \( {\text{q}}_{1} \) and the second quality rating \( {\text{q}}_{2} \), to capture the content changes in the article between \( {\text{t}}_{1} \) and \( {\text{t}}_{2} \), we extract every revision of the article which was issued during this time period for the computation of semantic convergence.

4 Semantic Convergence to Measure Content Change

Semantic convergence measure applied in Wikipedia research was first designed to assess stability of Wikipedia articles, in order to automatically judge their maturity [8]. In our research, we adopted the similar method to measure the content change during the time from the first quality assessment to the second assessment. The detail of the method and the validity of using this method for measuring content change are stated as follows.

4.1 Vector Space Model

“Semantic” is a commonly used term in natural language processing domain. It is normally referred in a context when comparing the similarity of two documents in terms of their topic and content. When two documents are matched in their contents, they are considered to be semantically similar to each other. If a vector is used to represent an article, then the similarity degree is typically computed by the cosine measure of the two vectors having values from −1 to 1. The value of 1 indicates that the two documents are the same or very highly matched, while the value of −1 tells the opposite relation. Among all the type of vectors representing document, Term Frequency (TF) vector is used in our study, which will be explained in the following.

We take any revision of an article as one single document. Suppose an article has N revisions issued during the first quality motion time. This article is composed into a set of N documents, denoted by:

$$ D = \{ d_{k } |k \in 1 \ldots N\} , $$

with \( {\text{d}}_{\text{k }} \) is the \( {\text{k}}^{\text{th}} \) document or \( {\text{k}}^{\text{th}} \) revision. We then establish a vocabulary from all the documents in D, i.e., a list of all the distinct words appeared in D. Let \( {\text{w}}_{1} ,{\text{w}}_{2} , \ldots ,{\text{w}}_{\text{m}} \) be the words in the vocabulary. The (sparse) representation of the TF value of the words in \( {\text{d}}_{\text{k }} \) is denoted by the vector

$$ \overrightarrow {{d_{k} }} = \left\{ {{\text{TF}}\left( {{\text{w}}_{{1,{\text{k}}}} } \right), \ldots , {\text{TF}}\left( {{\text{w}}_{{{\text{m}},{\text{k}}}} } \right)} \right\}, $$

with \( {\text{TF}}\left( {{\text{w}}_{{{\text{i}}, {\text{k}}}} } \right) \) is the Term Frequency value of word \( {\text{w}}_{\text{i}} \) (\( i \in 1 \ldots m \)) in document \( d_{k } \), i.e., the number of occurrence of word \( {\text{w}}_{\text{i}} \) in document \( d_{k } \). For some words listed in the vocabulary but not appeared in this document, the \( {\text{TF}} \) value of those words in the document is zero. Suppose there is a different document \( {\text{d}}_{\text{l}} \) with its vector representation as \( \overrightarrow {{d_{l} }} = \left\{ {{\text{TF}}\left( {{\text{w}}_{{1,{\text{l}}}} } \right), \ldots , {\text{TF}}\left( {{\text{w}}_{{{\text{m}},{\text{l}}}} } \right)} \right\} \), then the semantic similarity between \( d_{l} \) and \( d_{k } \) is computed by the cosine value of the two vectors:

$$ \cos \left( {\overrightarrow {{d_{k} }} , \overrightarrow {{d_{l} }} } \right) = \frac{{\overrightarrow {{d_{k} }} . \overrightarrow {{d_{l} }} }}{{\left| {d_{k} } \right|\left| {d_{l} } \right|}} , $$

where the nominator is the dot product of the two vectors:

$$ \overrightarrow {{d_{k} }} . \overrightarrow {{d_{l} }} = \sqrt {{\text{TF}}\left( {{\text{w}}_{{1,{\text{k}}}} } \right) \times {\text{TF}}\left( {{\text{w}}_{{1,{\text{l}}}} } \right) + \ldots + {\text{TF}}\left( {{\text{w}}_{{{\text{m}},{\text{k}}}} } \right) \times {\text{TF}}\left( {{\text{w}}_{{{\text{m}},{\text{l}}}} } \right)} $$

and in the denominator “\( \left| {d_{k} } \right| \)” denotes \( \sqrt {{\text{TF}}\left( {{\text{w}}_{{1,{\text{k}}}} } \right)^{2} + \ldots + {\text{TF}}\left( {{\text{w}}_{{{\text{m}}, {\text{k}}}} } \right)^{2} } \).

Following the above way, a matrix is built for every Wikipedia article with the rows representing the different revisions in order of the date they were issued and the columns representing words in the vocabulary, such as:

$$ Article \,Matrix = \left( {\begin{array}{*{20}c} {{\text{TF}}\left( {{\text{w}}_{1,1} } \right)} & \cdots & {{\text{TF}}\left( {{\text{w}}_{{{\text{m}},1}} } \right)} \\ \vdots & \ddots & \vdots \\ {{\text{TF}}\left( {{\text{w}}_{{1,{\text{N}}}} } \right)} & \cdots & {{\text{TF}}\left( {{\text{w}}_{{m,{\text{N}}}} } \right)} \\ \end{array} } \right) = \left( {\begin{array}{*{20}c} {\overrightarrow {{d_{1} }} } \\ \vdots \\ {\overrightarrow {{d_{N} }} } \\ \end{array} } \right). $$

The reason that we think the above vector model is appropriate for measuring the content change lies in the type of actions leading to the change. Similar to any other text revise, the primary editing actions in Wikipedia article include [4]: (1) insertion or deletion of a sentence; (2) modification or rewording of an existing sentence; (3) linking a existing word to another Wikipedia article to external Internet articles; (4) change the URL of the name of an existing link; (5) deletion of an existing link; (6) adding or deleting of a reference; (7) modification of an existing reference; (8) reverting an article to a former version. Basically all these actions, to more or less extent, caused the change in word frequency. For example, inserting a new sentence means the vector representing the new revision will have bigger values in some of its entries. The more the value changes, the more it varies from the previous revision. In prior work [9], the distance of two revisions was represented by the count of inserted and deleted words. This count is computed by a complex algorithm, which would become quite time-consuming when the article is long and has a large number of revisions. In our study, we firstly index the words of each revision using Lucene, and then build the vectors using the indexing. This way, the matrix computation is much faster than the word-counting methods and it can be even improved by using some optimal algorithm for sparse vectors. Thus, we consider vector space model more robust and efficient as compared to other models measuring the document distance.

4.2 Revision Milestone

After we get the article matrix that is composed of the vectors representing all the requested revisions of the article, we start to compute the semantic similarity between any two revisions. However, before computing the semantic similarity, there are still some problems that need to be addressed. First, vandalism is commonly seen in Wikipedia. Since this action was also recorded as a revision, this will add noise to the analysis of effective content change. Second, to explore the characteristics in content change of a group of articles, such as promoted group and demoted group, we need to align the articles in some way, because these articles vary in the number of revisions and timestamps for each revision.

Inspired by the work of Thomas and Sheth [8], we use revision milestone to address the above stated problems. Revision milestone is an abstract revision that is representative for the content change made through a cluster of real revisions. Since revision milestone is considered as a “revision”, it then can be represented as a vector. Given one article matrix as in Eq. (5) where each row is the vector representing one revision and chronologically ordered by the timestamp of the revision. We now cluster the revision vectors from top to bottom of the matrix using a one-week timeframe. That is, the revisions belonging to one cluster are issued within one week starting from the timestamp of the first revision in this cluster. For example, we derive the first cluster matrix from the article matrix as:

$$ Cluster_{1} = \left( {\begin{array}{*{20}c} {{\text{TF}}\left( {{\text{w}}_{1,1} } \right)} & \cdots & {{\text{TF}}\left( {{\text{w}}_{{{\text{m}},1}} } \right)} \\ \vdots & \ddots & \vdots \\ {{\text{TF}}\left( {{\text{w}}_{{1,{\text{i}}}} } \right)} & \cdots & {{\text{TF}}\left( {{\text{w}}_{{m,{\text{i}}}} } \right)} \\ \end{array} } \right), $$

where \( 1 \le i \le N \); \( t_{{rev_{1} }} \le t_{{rev_{i} }} \le (t_{{rev_{1} }} + 1 week) \). The revision milestone vector of the first cluster, denoted by \( \overrightarrow {{RM_{1} }} \), is defined as:

$$ \overrightarrow {{RM_{1} }} = median \left( {Cluster_{1} } \right), $$

where the “median” is to get the median value of entries in each column of the cluster matrix.

Using revision milestone vector, we can align the articles regardless of their different developing time. Moreover, taking the median value as entry of the revision milestone helps to eliminate the impact of revert wars and random vandalism. Hence, we represent for each article by its revision milestone vectors instead of revision vectors; the rows of the new article matrix are the vectors of the revision milestones in the order of the weeks and the columns are the words from the vocabulary.

4.3 Semantic Convergence

The next step is to measure the content change using the new article matrix composed of the revision milestone vectors. Given an article having L revision milestones, we display its semantic convergence by computing the cosine similarity, as showed in Eq. (3), between each \( \overrightarrow {{RM_{l} }} \left( {l \in 1 \ldots L} \right) \) and the last vector \( \overrightarrow {{RM_{L} }} \). In this way, we can track the major content change towards the last version that dynamically leads to a quality motion.

The same semantic convergence computation is applied to each article, after then we align the articles by grouping the cosine similarity value with the same index (correspond to the index of their revision milestones from the 1st week to the last week).

5 Comparison of Semantic Convergence

By displaying the semantic convergence over the revision milestone index, we examine the pattern of the content change of the promoted and demoted articles during the period between the times of their first and second quality assessments. The promoted articles have had their first quality level from the underdeveloped group (B, C, Start, Stub) upgraded to one level in the advanced group (GA, A, FA), while the demoted articles have the quality motion in the opposite direction.

In addition, when we look back to our quality levels grouping and the definition of quality motion, we recall that (according to Wikipedia’s statements): (a) considerable content change distinguishes the articles belonging to the underdeveloped group and the advanced group; (b) meanwhile, the quality motion within advanced group should be caused by other minor updates instead of significant content change. In order to confirm these two statements computationally, we also display the process of the semantic convergence of the articles having their first quality motion within advanced group. We cluster the articles by their motion pattern: (1) promotion_GA → A; (2) promotion_GA → FA; (3) promotion_A → FA; (4) demotion_A → GA; (5) demotion_FA → GA; (6) demotion_FA → A, and show their semantic convergence process separately.

5.1 Semantic Convergence of Quality Motion Across Groups

As aforementioned, we have 7,653 promoted articles and 525 demoted articles in our dataset. In order to balance the sample quantity in the two groups, we randomly select 600 promoted articles and take all 525 demoted articles. We filter out the articles having less than 3 revision milestones, since we assume that content change would be more stable and effective after at least three weeks. Further, to better align the articles, we split the selected articles into 3 subgroups according to the number of their revision milestones, the range of the number in first subgroup is from 3 to 20, in second subgroup is from 21 to 50, and in the third subgroup is from 51 to 100. Finally we have in total 358 promoted articles and 187 demoted articles. Notice that about 60 % of the demoted articles have less than 3 revision milestones. This phenomenon already indicates that remarkable content change should not be the main reason for quality demotion. We further check the semantic convergence of the sampled articles.

In Fig. 1, the blue curve shows the mean of semantic similarity values of each revision milestone to its final revision milestone. We see that, in each subgroup, the similarity distance from the first revision milestone to the final revision milestone for promoted articles are larger than that for demoted articles. This can be interpreted as that much editing efforts made to the content improvement for quality promotion, even though it might experience some fluctuations during the process. In contrast, the semantic convergence for demoted articles is faster, without noticeable distance in semantic. In general, the semantic convergence patterns computationally reflect Wikipedia’s rating mechanism. On one hand, considerable improvement in content is needed for the underdeveloped articles to be promoted to an advanced quality level; specifically it means complete coverage of the specific topic with abundant facts and resources as well as the professional writing style. On the other hand, there is no strong evidence that content issue is critical for quality demotion; it is probably caused by outdated references, problems in article structure, or writing style. For the promoted articles, we see that the more number of revision milestone the articles have, the larger the similarity distance is.

Fig. 1.
figure 1

Semantic convergence across groups (Color figure online)

In terms of content stability, for the demoted articles, we don’t see turbulence in all the three subgroups, same for the first two subgroups of promoted articles. There is some fluctuation seen in the third promoted subgroup after 55 revision milestones, this is mostly due to the drop in the number of articles which have more than 55 revision milestones.

5.2 Semantic Convergence of Quality Motion Within Advanced Groups

Now we are going to show the semantic convergence of the quality motion within advanced group to compare with the quality motion across groups. From our original dataset, we extract all the articles matching the motion patterns within the advanced group; Table 1 shows the number of sampled articles, since the number of articles are less, instead of divide the articles by the number of revision milestones as did in Sect. 5.1, we simply filtered out the articles which have far more revision milestones than the average revision milestones of each case.

Table 1. Sampled articles within group quality motion

From the result in Fig. 2, we see that all demotion and promotion cases within the advance group do not show significant semantic change over the period of quality motion. This phenomenon seems to be consistent with Wikipedia’s suggestion on how to improve the article quality when the current quality level is already high: instead of content change, some knowledge from expert is needed.

Fig. 2.
figure 2

Semantic convergence within advanced group

5.3 Tests on the Distribution of Means Across

Now that we see the difference in the semantic convergence patterns of promotion and demotion across groups, we are going to test the distribution discrepancy of the means in the two groups. The mean of the semantic convergence throughout the first revision to the final revision represents an effort needed to be either promoted or demoted. A smaller mean shows a larger effort. As Fig. 1 depicts that the mean of promoted articles across group is smaller than of demoted article, we performed the non-parametric Kolmogorov-Smirnov (KS) test and confirmed that the mean in the promotion cases is statistically smaller than the mean in the demotion cases for each subgroup, with “p-value = 0.0003892” for the first subgroup, “p-value = 6.715e-15” for the second, and “p-value < 2.2e-16” for the third.

6 Conclusion

Some prior studies on Wikipedia phenomena are conducted using its internal quality ratings of the articles. Though Wikipedia provides a set of criteria and implements voting mechanism for quality evaluation, the validity of the internal ratings is yet to be examined. Our study is one of the first that shows the mapping between the stated evaluation criteria and the quality rating of the Wikipedia articles. We investigate one of the most important evaluation criteria, i.e. the content rule, in a computational way. We check to what extent the content in terms of quantity change and stability affects quality change in both directions (promotion and demotion) and if the result is consistent with the criteria stated by Wikipedia. We measure the content change by computing the semantic similarity of every revision of the article until the final revision, starting from the first quality rating until the first quality motion occurs to see how the semantic converges. We define quality promotion as the change of quality scale from the underdeveloped group to the advanced group, and quality demotion as the reverse. In order to show the semantic convergence of a group of articles, namely a group of promoted or demoted articles, we align the semantic convergence of the articles by their revision milestones.

By showing the aligned semantic convergence of the articles, we found out that the quantity of content change is significant in the promoted articles, which complies with Wikipedia’s stated criteria. We also saw slight content instability in some demoted articles though the phenomenon is not as clear at the group level. We thereby conclude that there could be other major issues which cause quality demotion instead of content instability, such as the outdated references and the missing links. Overall our findings suggest that Wikipedia’s evaluation for promotion is significantly influenced by content change whereas the evaluation for demotion is influenced by other factors stated in Wikipedia’s evaluation criteria. Wikipedia’s assigned quality rating may thus be a reliable outcome variable for research provided that the researchers are comparing articles from the lower-quality group (articles with B-class, C-class, Start, and Stub ratings) with the higher-quality group (articles with Featured, A-class, and Good ratings). A more fine-grained outcome variable is not suggested, as there are only slight different content requirements’ evaluation criteria within the advanced group. Nevertheless, for the articles that experience quality promotion within underdeveloped group, for example, from Start-class to B-class, we expect larger content change to take place.