1 Introduction

In Human-Computer Interaction (HCI), the readability of the text is one of the primary requirements for HCI systems designed to provide textual information [1]. The readability of the text is defined as the degree of comprehension imparted by the underlying text. One such HCI system serving textual information is Wikipedia, which is regarded as one of the most exhaustive sources of textual information. According to Alexa rankings, Wikipedia has outperformed all other information sources on the Internet with respect to users’ traffic [2]. Wikipedia is a crowdsourced information source curated and maintained by the crowd. Any Wikipedia article can be edited by any individual called the editor. The edits done by the editors include addition, deletion, rephrasing, and restructuring the given information present in the article. Hence, the crowdsourced nature of Wikipedia articles allows the union of the knowledge of different people around the globe. Given that Wikipedia is a source of exhaustive information, the majority of Wikipedia users are readers rather than producers of information [3]. The high traffic on Wikipedia comprises mostly readers reading the information rather than editors who contribute to the Wikipedia articles. For a reader, one of the primary requirements is the comprehension of the underlying text, which is captured by readability. In view of the importance of readability for an HCI system serving textual information and the popularity of Wikipedia as an information source, it becomes imperative to study the readability of Wikipedia.

1.1 Research gap

According to one of the previous studies on the readability of Wikipedia, it is established that the crowdsourced nature of Wikipedia articles does affect their readability [4]. The crowdsourced parameters used to capture the nature of crowdsourced articles include the edit coefficient, stylistic coefficient, and knowledge gap parameter. In [4], the authors develop a classification model to distinguish between readable and non-readable articles, and conclude that the knowledge gap parameter is the most important parameter that affects readability. The knowledge gap parameter captures the missing pieces of information in Wikipedia articles, which hinder the comprehension of the underlying text. It should also be noted that many of the past NLP-based studies have shown a number of text-based parameters, such as lexical diversity, semantic diversity, lexical complexity, semantic complexity, are correlated with the readability of the underlying text [5,6,7,8,9,10]. However, there is no such study which analyses the effect of all possible NLP based parameters on the readability of text. All of the aforementioned studies study the effect of one or two parameters on the readability at a time. In addition to this, there is no such study on crowdsourced platforms such as Wikipedia that analyses the effect of NLP parameters and crowdsourced parameters on the readability of articles. The present study explains the effect of both NLP parameters and crowdsourced parameters on the readability of the articles. The present study also takes into consideration the change in NLP parameters due to crowdsourced parameters, which in turn affects the readability of the text.

It is clear from the aforementioned studies that the text-based parameters affect the readability of the text. However, in the case of Wikipedia articles, the crowdsourced parameters also affect the readability of Wikipedia articles. In the present study, we first use existing tools and techniques to calculate the NLP-based features. We then build an Artificial Neural Network (ANN) to quantify the effect of NLP-based features on the readability of articles. According to the results obtained, the NLP-based features affect the readability of the articles. However, it should be noted that it is the crowd that influences these features in the case of Wikipedia articles. Since any piece of text can be added, deleted or modified by the editors of the article on Wikipedia, multiple editors author a single Wikipedia article. Hence, we can say that the crowd dynamics on Wikipedia is responsible for the changes in NLP-based features of the article, which in turn lead to changes in the readability scores. To quantify the effect of crowdsourced parameters on the readability of articles, we do not only analyze the direct effect of crowdsourced parameters on readability scores, but we also analyze the effect of crowdsourced parameters on NLP-based features that lead to changes in readability scores. Hence, we perform mediation analysis with NLP-based features as the mediating variable to study the effect of crowdsourced parameters on readability scores. The mediation analysis helps us to study the effect of crowdsourced parameters on readability scores, particularly due to the changes in NLP-based features. The aim of mediation analysis is to quantify the extent to which the effect of crowdsourced parameters on readability can be modified through NLP-based features. The results show that the effect of crowdsourced parameters on the readability of articles is partially due to the changes in NLP-based features introduced by the crowd.

1.2 Present study

In this study, we introduce five NLP-based parameters related to the readability of the text. The parameters considered are semantic complexity, lexical complexity, sentiment analysis, lexical diversity, and semantic diversity exhibited by the text. These parameters are selected as they have been found to be correlated with the readability of the text [11, 12]. We train a feed-forward artificial neural network using these parameters to study the relation between these parameters and the readability of the articles.

However, the crowd dynamics may be responsible for achieving the desired text-based features like lexical diversity. There is a high probability of the text being more lexically diverse if it is written by multiple editors. The footprints left by the crowd/editors while editing the readable article must be discerned to engineer the production of more readable articles in Wikipedia. We investigate the various parameters specific to the crowdsourced nature of Wikipedia articles that may affect the text-based features, which in turn affects the readability. The crowdsourced parameters investigated are the ratio of experienced editors editing the particular article and the standard deviation of edits contributed by the editors to the articles. For this, we design a theoretical model that judges the mediating effects of NLP-based features (particularly lexical diversity) between the above-mentioned crowdsourced parameters and the readability of Wikipedia articles. The mediation analysis judges whether the effect of crowdsourced parameters on readability can be mediated by the change in NLP-based features or not.

The present study is divided into two phases. In the first phase, we build an artificial neural network to find out the most dominant NLP-based feature affecting the readability of articles. According to the results obtained, the parameters related to lexical diversity are found to be the most dominant parameters that affect the readability of Wikipedia articles. In the second phase, we perform mediation analysis, which judges whether the effect of crowdsourced parameters on readability can be modified by NLP-based features or not. The results of the mediation analysis suggest that a partial mediation effect of lexical diversity exists between the two crowdsourced parameters and the readability of the Wikipedia articles. Hence, the first phase verifies the effect of NLP-based parameters on the readability of Wikipedia articles. The second phase judges the effect of crowdsourced parameters on the readability of articles. Further, we perform a mediation analysis that verifies the effect of crowdsourced parameters on readability scores due to lexical diversity as the mediator variable. The mediation analysis sheds light on the effect of crowdsourced parameters on lexical diversity, which in turn affects the readability scores.

Overall, the findings suggest the importance of including text-based features in studying the effect of crowdsourced parameters on the readability of Wikipedia. The main contributions of our work are as follows.

  1. 1.

    Understanding the impact of NLP-based features on the readability of crowdsourced articles.

  2. 2.

    Understanding the impact of crowdsourced parameters on the readability of articles.

  3. 3.

    Analyzing the effect of crowdsourced parameters on lexical diversity of articles.

  4. 4.

    Finding out the mediating effect of NLP-based parameters on readability due to the crowdsourced parameters.

1.3 Organization of the article

The remainder of the article is organized as follows: Section 2 discusses the related studies on NLP-based features affecting the readability of articles. A summary of recent studies on the readability of Wikipedia articles is also discussed in this section. Then, dataset details and tools used are discussed in Section 4 and Appendix 6, respectively. Next, we divide the methodology and results sections into two phases. Section 5.1 discusses the methodology and results obtained for the neural network built based on NLP-based features (Phase 1). Section 5.2 discusses the methodology and results of mediation analysis of NLP-based features on readability due to crowdsourced parameters (Phase 2).

2 Related work

In our study, we consider NLP-based parameters and crowdsourced parameters affecting the readability of Wikipedia articles. In this section, we first discuss the past studies based on NLP parameters that affect the readability of a text article. Next, we discuss the literature based on crowdsourced parameters that specifically affect the readability of Wikipedia articles.

2.1 NLP Based parameters

This section discusses the literature based on NLP-based parameters and their effect on the readability of the underlying text. It should be noted that this section enumerates all the past studies that describe the NLP-based parameters affecting the readability of text. All the NLP-based parameters can be categorized under five broad categories, which are listed below.

2.1.1 Cohesion

Many of the past researchers have investigated several parameters that can act as predictors for text readability. One of such parameters is the cohesion of the text. Cohesion refers to grammatical linking in the sentences, which imparts meaning to them. Past studies have clearly shown that cohesive texts are very easy to comprehend for a reader [13,14,15,16]. In 2013, Todirascu et al. [6] demonstrated that the cohesion of the text is a predictive variable for the readability of the text. The authors calculated the correlation scores of the readability of the text (annotated manually) with 41 variables measuring the cohesion of the text. Some of the variables were calculated using similarity between the adjacent sentences using cosine similarity in LSAFootnote 1 space, POSFootnote 2 tagging, and many other techniques. LSA-based parameters were found to be highly correlated with the readability of the text.

In addition to this, one more study conducted by Rezeaee et al. [7] showed the correlation between readability and cohesion. In this particular study, readability was calculated using standard readability metrics, and cohesion was calculated using grammatical markers, lexical markers, and conjunctions such as and, or, then, and so on. In another study focused on calculating concept-based readability, document cohesion was used as one of the parameters to calculate the desired readability metric [17]. Document cohesion can be calculated using Leacock-Chodorow semantic similarity measure [18]. Document cohesion captures the cohesion between concepts explained in the documents. The results show that more document cohesion led to better readability values.

2.1.2 Sentiment analysis

The sentiments conveyed by the text are associated with the feelings, emotions, and opinions subjected by the text [8, 19]. The readability of text is also found to be a function of the sentiments conveyed by the text [20]. This is because a reader’s opinion may differ from the one conveyed by the text. Also, some of the past studies claim that the opinion conveyed in the text influences cognition and, hence, the comprehension of the particular text [21].

There are generally two ways of measuring the underlying sentiment [22]. One is a lexical methodology, where a list of sentiment words (such as good, bad, happy) along with the intensity of the sentiment conveyed by the word are already given. We can quantify the sentiment of a sentence by extracting the sentiment words and then using the given list. Another way of quantifying the sentiment is Bag-Of-Words, which uses techniques such as POS tagging and co-occurrence with other sentiment words.

There are a number of studies that quantify or classify the sentiment conveyed by the underlying sentence. One of the past studies employs a hybrid approach to classify the sentiment (positive/negative) conveyed by the sentence [23]. The authors extract sentiment words using POS Tagging and LDAFootnote 3 Topic Modelling. After the extraction of words, they employ AFINN (lexicon-based dictionary) to capture the context of the underlying sentiment word. This approach helps to classify ambiguous words with high accuracy. In addition to this, a number of deep learning models, such as the BERT model, are also used to quantify the sentiment conveyed by the text [24]. BERT is used to generate text embeddings, which can then be used as features in the classification model to classify the sentiment conveyed by the underlying text. Another study focuses on capturing the variance in sentiments conveyed by the underlying sentiments [25]. The authors construct a novel feature set that helps quantify the variance in the intensity of the sentiments.

2.1.3 Lexical complexity

Apart from the parameters discussed above, lexical complexity also affects the readability of the text. Cobb et al. [11] observed that text comprehension is dependent on the words presented in the underlying text. If the reader is aware of the words used in the text, then the comprehension of the text by the reader is successful. In addition to this, Crossley et al. [9] also claimed lexical complexity to be a determinant in calculating the readability of the text. However, the lexical complexity was only calculated in terms of the frequency of the words present in the text. A class of studies also focused on simplification of the underlying text to make it more readable [9, 26]. Simplification involves rephrasing the complex lexical as well as syntactic structures, which aids the readers in the comprehension of the text.

Lexical sophistication is also considered one of the dimensions of lexical complexity. Various past studies have established that the measures used to quantify lexical sophistication, such as word familiarity [27], word imageability [28], and word concreteness [28], affect the readability of the text.

2.1.4 Syntactic and semantic complexity

Similar to lexical complexity, syntactic complexity is also found to be an indicator of the text readability [12]. A past study conducted an experiment with three different versions of a text with varied syntactic complexities. The readers were also divided into three groups based on their reading proficiency. The results suggested that syntactically complex texts were difficult to comprehend in case of low proficient readers. In addition to this, a past study has concluded that a semantically complex sentence is difficult to comprehend, and hence, the readability of the semantically complex text is always low [29].

The syntactic complexity of the underlying sentence is measured using metrics, such as mean length of sentences, mean length of clauses, and mean length of phrases [30]. The rationale behind using these simple metrics is that a longer sentence is syntactically more complex as compared to a shorter sentence. On the other hand, the semantic complexity can be measured with the help of text embeddings using BERT [31]. The semantics of the underlying text can also be discerned with the help of topological embeddings using a graph based on entities present in the text [32].

2.1.5 Lexical diversity

Lexical diversity is termed as one of the factors that affects the readability of the underlying text [5]. If the underlying text is lexically diverse, that means it uses more number of unique words. It can also be said that it tries to convey more information in less amount of text. In such cases, it becomes difficult to comprehend the underlying text, and the readability is low. In another study regarding the readability of Journal Mission Statements (JMSs), it was established that the readability and lexical diversity were positively correlated with each other [33]. In [34], the authors developed a new readability measure called CAREC (Crowdsourced Algorithm of Reading Comprehension) that uses crowdsourcing techniques to collect human judgments of text comprehension. This study used lexical diversity and some other features, including surface-level parameters (including the number of words, number of phrases, and number of sentences), lexical sophistication, and so on, to predict the readability scores.

There are a number of past studies that have coined indices to measure the lexical diversity of the text [35, 36]. Some of the important indices are evenness, i.e. the variance in words observed for different types; dispersion, i.e. the average number of words between the words of the same type; and importance, i.e. the frequency of the words in the text as a whole. These indices help measure the lexical diversity of the text.

2.2 Crowdsourced parameters

A few past studies have worked on the readability of crowdsourced articles [4, 37, 38]. Lucassen et al. [37] calculated the readability scores of Wikipedia articles using conventional readability metrics like the Flesch Kincaid Readability Score. The results of the study suggested that Wikipedia articles are hard to comprehend since they score low on readability. However, it should be noted that the conventional readability metrics are based on surface-level metrics, such as the number of words in a sentence, the number of sentences in a paragraph, the number of phrases in a sentence, etc. Such metrics do not take into account the semantics of the text and, hence, are not sufficient to quantify the readability of the text [39, 40]. Semantic text mining is an important dimension required to quantify the readability of the text [41, 42]. Keeping in mind the shortcomings of the study, some other studies defined some metrics that capture the semantics/quality of the Wikipedia articles. In one such study, the authors built a classification model to classify Wikipedia articles into readable and non-readable articles [4]. The authors considered NLP features and crowdsourced features to classify the given set of articles. The crowdsourced features considered were the edit coefficient, stylistic coefficient, and knowledge gap parameter [4]. In addition to the above studies, a number of studies have been conducted in the direction of analyzing the readability of medical information present on Wikipedia [43,44,45,46]. [47] focused on the analysis of patient medication information and showed that the readability scores of the information given on Wikipedia were usually higher for an average reader to understand. In [38], the authors compared the readability of Wikipedia articles with the Simple Wikipedia articles (Simple Wikipedia is one of the projects by Mediawiki foundation that maintains concise and readable versions of Wikipedia articles) and Britannica articles using parameters related to word complexity in conjunction with conventional readability metrics. The authors scrutinized the words present in the text based on topic-based familiarity, genre-based familiarity, and popularity-based familiarity. The results show that Wikipedia articles score low on readability as compared to Simple Wikipedia and Britannica. To summarize, there are a number of crowdsourced parameters defined that affect the readability of the text, including the knowledge gap parameter, stylistic coefficient, edit coefficient, and word familiarity. As per the above discussion, there are two types of parameters (NLP-based/crowdsourced) that lead to changes in the readability scores of an article. In the case of NLP parameters, the high-level metrics are cohesion, lexical diversity, sentiments, lexical complexity, syntactic complexity, and semantic complexity [48]. These metrics are, in turn, measured using a number of low-level metrics. Similarly, in the case of crowdsourced parameters, the high-level metrics are edit coefficient, stylistic coefficient, and knowledge gap parameters, and a number of low-level metrics help to measure these high-level metrics. The summary of all metrics described in this section is provided in the Table 1.

Table 1 Past Literature on NLP-based parameters and Crowdsourced Parameters
Table 2 Mapping of high-level metrics to low-level metrics

3 Tools and techniques

The conventional readability metrics only consider surface-level parameters such as the number of words, sentences, and phrases to calculate the text readability, which do not capture the quality of the written text [40, 51]. Therefore, we use a number of text analysis tools, including TAALED, SEANCE, TAALES, TAASSC, and TAACO, to compute quality parameters related to the readability of the text, such as lexical diversity, lexical complexity, and semantic complexity. These tools are briefly discussed below. We further summarize the mapping of high-level metrics to low-level metrics in Table 2.

3.1 TAALED

TAALED (Tool for Automatic Analysis of Lexical Diversity) [52] is used to measure the lexical diversity of a given text using three dimensions, i.e., Volume, Abundance, and Variety. Volume, Abundance, and Variety take into account the number of tokens, the number of different types of tokens, and the number of different types of tokens encountered in a given length of the text, respectively. TAALED computes a number of indices to quantify these three dimensions of lexical diversity, and further details are provided in Appendix A.

Table 3 Mean and Standard Deviation of four rating dimensions present in AFT
Fig. 1
figure 1

Frequency Distribution of four rating dimensions of AFT. The above figure shows that all the rating dimensions follow a skewed distribution

3.2 TAASSC

TAASSC (Tool for Automatic Analysis of Syntactic Sophistication and Complexity) is used to measure various indices related to syntactic complexity and syntactic sophistication. The syntactic complexity is measured using phrasal complexity and clausal complexity. It should be noted that clauses are different from phrases in English grammar. A clause must have a subject and a predicate, whereas a phrase does not have a subject and a predicate. On the other hand, syntactic sophistication is measured using a number of frequency-based indices like the frequency of verbs and the conditional probability of verbs occurring in a specific sentence structure (for example, a sentence comprising of subject, verb, and object). A detailed explanation of indices is given in Appendix A.

3.3 TAALES

TAALES (Tool for Automatic Analysis of Lexical Sophistication) is used to measure indices related to lexical sophistication [53]. Lexical sophistication of the underlying text takes into account the length and breadth of lexical words, i.e., the frequency and the quality of the words used in the text. The frequency is calculated using the frequency of different words used and the n-gram frequency. On the other hand, the quality of the words is measured using the indices like correctness, familiarity, and imageability. The description of all these indices is given in Appendix A.

3.4 TAACO

TAACO (Tool for Automatic Analysis of Cohesion) is used to measure the indices required for text cohesion. The text cohesion is measured through lexical overlap and semantic overlap [54]. The lexical overlap between sentences is measured by the number of intersecting POS tags between the sentences, and the lexical overlap between paragraphs is measured by the number of intersecting POS tags between various sentences or paragraphs. On the other hand, semantic overlap between sentences is measured using the number of semantically similar words present between sentences. Similarly, in the case of paragraphs, semantically similar words are found between sentences of given paragraphs. It should be noted that semantically similar words are found using WordNet. The information about all these indices is given in Appendix A.

3.5 SEANCE

SEANCE (Sentiment Analysis and Social Cognition Engine) is used to extract indices related to human sentiments and feelings conveyed by the underlying text. The sentiments are quantified using preexisting databases that define the degree of positive and negative sentiment conveyed by a particular word. The preexisting databases are SenticNet [55, 56], VADER [57], Hi-Liu polarity [58], Emolex [59]. More details about the indices calculated are given in Appendix A.

4 Dataset

The dataset is collected using Article Feedback Tool (version 4) (AFT) and is publicly available on Wikipedia data dump [60]. Article Feedback Tool is a tool deployed by the Mediawiki Foundation on Wikipedia articles to collect feedback from the readers. The main aim of this initiative is to engage readers and assess the quality of Wikipedia articles. AFT was deployed on around 45000 English Wikipedia articles. AFT assisted the readers to rate the Wikipedia articles on four dimensions, namely trustworthiness, completeness, neutrality, and readability [61]. All the dimensions were rated on a scale of one to five. The dataset consists of 12,498 ratings given by various users. The statistics related to the four rating dimensions are given in Table 3. The distribution of four rating dimensions is also given in Fig.1. The frequency distributions for all the rating dimensions are observed to be skewed. In order to carry out the experiments, we sampled 3000 articles out of 45000 articles. We use stratified sampling to sample these articles. As shown in Fig. 1, we have articles rated on a scale of one to five. So, we divide the given set of articles into five strata according to the ratings and sample an equal number of articles (i.e., 600) from each strata.

NLP-based features are collected using the TAALED, TAACO, TAASC, TAALES, and SEANCE that quantify lexical diversity, cohesion, semantic complexity, lexical sophistication, and sentiments in text, respectively. The details of the features are as follows.

  • TAALED is used to calculate 38 indices for measuring lexical diversity.

  • TAALES is used to calculate 484 indices for measuring lexical complexity.

  • TAASSC is the tool used to quantify 367 indices related to syntactic complexity. It is used to calculate 31 indices related to clausal complexity, 132 indices related to phrasal complexity, 190 indices related to syntactic sophistication, and 14 indices included in Syntactic Complexity Analyzer.

  • TAACO is the tool used to calculate 178 indices related to text cohesion, out of which 15 indices are related TTRs, 54 indices are related to the lexical overlap of sentences, 54 indices are related to the lexical overlap of paragraphs, 8 indices are related semantic overlap of sentences, 8 indices are related to the semantic overlap of paragraphs, 25 indices are related connectiveness, and 4 indices are related to givenness.

  • SEANCE is the tool used to calculate 254 indices measuring the sentiments conveyed by the text.

5 Experimental analysis

In this section, we will discuss the experiments carried out as a part of the research study. We build a neural network to quantify the readability scores. The neural network helps to find out the most dominant NLP feature that affects the readability of articles. The most dominant NLP feature is then used in mediation analysis to find out the mediating effect of NLP features in the relationship between crowdsourced parameters and the readability of articles.

5.1 Experiment 1: deep learning based approach to quantify readability

In the first phase of the research study, we use a feed-forward Artificial Neural Network (ANN) to quantify the readability scores based on input NLP features computed using tools, as discussed above.

5.1.1 Details of experiment

We use a neural network to quantify different readability scores of the articles. We use NLP-based parameters as the features and four dimensions (trustworthiness, completeness, neutrality, and readability) present in the dataset as the target variables. For this, we perform text preprocessing on 3000 Wikipedia articles present in the dataset. The text preprocessing of Wikipedia articles involved removing the unwanted words and tags used in the Wiki markup language. After performing the text preprocessing, the text corpus of 3000 articles is fed to various tools (TAALED, TAASSC, TAALES, TAACO, AND SEANCE), which calculate 1474 indices used as input to the neural network, as shown in Fig. 3. The regression models are built using 1474 independent variables and four respective dependent variables (Readability, Trustworthiness, Neutrality, and Completeness), as discussed in the dataset. We take all four survey parameters as the dependent variables to design the regression model aimed at quantifying the readability scores. It should be noted that the three survey parameters other than readability are also essential for an article to achieve the desired readability score. An incomplete article contains missing information, which hinders the comprehension of the underlying text [4]. An article written from a specific point of view may not be appreciated by a reader from a contrasting point of view. Some of the past studies suggest that the opinion conveyed by the text influences the cognition and comprehension of the underlying text [21]. Therefore, an article must be written from a neutral point of view. In addition, the reliability of the information present in the text affects the readability of the text intuitively. If a reader does not rely on the information conveyed by the text, then he/she will not certainly make any effort to comprehend the information given in the text. Hence, we use all four survey parameters as the dependent variables to quantify the readability scores. The proposed methodology of experiment 1 is shown in Fig.2.

Fig. 2
figure 2

Phase 1 Methodology: This flowchart describes the steps to be followed to execute experiment 1. First, we collect the features using the tools mentioned in the previous section. Then, we build four regression models using the features as independent variables and each of the AFT dimensions as the dependent variables. We fine-tune the hyperparameters of the ANN by observing the changes in the accuracy of the regression models

We train a feed forward neural network with three back-propagation training algorithms. We use a train-test split of 80:20. The neural network comprises of one input layer, three hidden layers, and one output layer for each of the four target variables. We use three training algorithms to train the neural network, which are listed below.

  1. 1.

    Scaled Conjugate Gradient Algorithm (SCG) [62]

  2. 2.

    Levenberg-Marquardt Algorithm (LM) [63]

  3. 3.

    Gradient Descent Algorithm [64]

As per the results obtained, we observe better mean accuracy with Gradient Descent Algorithm (see 5). The input layer comprises 1474 neurons for processing all the inputs. Each of the hidden layers comprises 3000 neurons, and the output layer has four neurons, which output the readability, trustworthiness, neutrality, and completeness scores (as shown in Fig. 3). In addition to this, we use the RELU activation function in the output layers, and we use the hyperbolic tangent function for the hidden layers. The performance goal accuracy is set to 85% as we observe no significant improvement in accuracy after achieving the performance goal, and the improvement is below the threshold (less than 0.5%). After tuning the hyperparameters, the learning rate is 0.05. The number of epochs is set to a high value of 3000, and the batch size is 100. The number of epochs required for convergence for different training algorithms and dependent variables are listed in Table 4. The training of the neural network stops once it reaches the required performance goal. The training and testing accuracies of all four models with three different training approaches are listed in Table 5. The designed ANN model exhibited an accuracy of 84.7% with readability as the dependent variable, 87.2% with trustworthiness as the dependent variable, 83.2% with neutrality as the dependent variable, and 86.3% with completeness as the dependent variable. The mean accuracy observed is 85.4%. The accuracy is calculated using the difference between the actual and predicted values and then subtracting the difference from 1. The hyperparameters are tuned using the Grid Search method of hyperparameter optimization. A number of alternatives are tried for each of the hyperparameters, and those that exhibit higher accuracy are selected. It should be noted that the GridSearchCV method in Keras uses predefined splits for training and testing along with cross-validation [65]. The results shown in Table 5 are for specific train and test splits when the GridSearchCV method is overridden with a predefined split. The predefined train and test split used is 80:20.

Fig. 3
figure 3

Workflow for ANN-based approach to compute readability. The above figure depicts the phase 1 of the underlying analysis. First, we extract the features of each article using tools such as TAALED, TAASSC, TAALES, TAACO, and SEANCE. Once the features were extracted, we built ANN with these features as the independent variables and each of the rating dimensions as the dependent variables

5.1.2 Results of experiment 1

If we talk about the relative importance of the features involved, lexical diversity is the most dominant feature observed in the entire feature space. Since lexical diversity is associated with a number of indices, we observe that the index mtld_ma_wrap_aw is found to be the most dominant one. The mtld_ma_wrap_aw index is computed using TAALED and is based on the moving average of content words to reach a certain Type Token Ratio (TTR) value. The TTR is the ratio of types of tokens to the total number of tokens occurring in a text [52]. To identify the dominant feature, we use the permutation importance method to find the importance of features in ANN [66]. The permutation importance technique permutes the given input feature and checks the change in accuracy. If the accuracy drops, it means that the particular feature affects the output parameters, and if it does not drop, it means the feature under consideration does not have any impact on the output features. Based on the drop in accuracy, the permutation importance technique ranks the features. The relative drop in ANN accuracy observed for all the features is mentioned in the table given in Appendix B. Here, the drop in accuracy refers to the drop in mean accuracy. It should also be noted that the second highest drop in accuracy is observed for a feature which is also related to lexical diversity, i.e., mltd_ma_bi_aw. mltd_ma_bi_aw is based on the moving average of content words both in forward and backwards directions to reach a certain Type Token Ratio (TTR) value.

Table 4 Number of Epochs required for different training algorithms
Table 5 Training and Testing Accuracies with different training algorithms
Fig. 4
figure 4

The plots show the correlation of the target variable with the most dominant feature, i.e., mtld_ma_wrap_aw, that reflects the lexical diversity. The above figure reports a positive correlation between mtld_ma_wrap_aw and each of the rating dimensions

Next, we plot the most dominant feature against all the target variables (Fig. 4). We observe a positive correlation between lexical diversity (mtld_ma_wrap_aw) and all the target variables. The positive correlation shows that a higher value of lexical diversity facilitates better readability scores. A higher value of mtld_ma_wrap means more words are required to reach the TTR value. This, in turn, implies that fewer diverse tokens are used in the text. If fewer diverse tokens are used, that means the reader encounters very few new words while reading the text. Fewer new words clearly indicate that the readers can easily comprehend the underlying text. This is also proved by one of the past studies about readability. The study states that if more diverse tokens are used, then it is difficult for a reader to comprehend the underlying text [5]. However, fewer diverse tokens facilitate comprehension of the text and, hence, better readability scores.

In addition to this, the other parameters (Trustworthiness, Neutrality, and Completeness) also exhibit a positive correlation with lexical diversity. The possible reasoning behind this is the positive correlation of all of these parameters with the number of edits. Trustworthiness is measured by a number of references in the article. As number of edits increases in the article, number of references also increases. An article also comes close to the completion stage with a number of edits. The neutrality of the article also increases if there are more edits done by different editors. A single editor editing the article may express only his point of view. In contrast to this, when a number of editors express their point of view, it leads to a neutral point of view. It should also be noted that the lexical diversity increases with more number of edits. Hence, as the number of edits increases in the article, lexical diversity increases and the parameters (Trustworthiness, Neutrality, and Completeness) also increase.

Further, it should be noted that the kind of tokens used in the text affects the readability of the underlying text. If more diverse tokens are used, then the readability is low. However, if the diverse tokens used are content wordsFootnote 4, such as I, we, are, is, you, and many more, then it should not affect the readability. In case the diverse tokens refer to function wordsFootnote 5, such as nouns, adjectives, and adverbs, then it must affect the readability. In order to verify the above claim of the type of words affecting readability, we calculate the mltd_ma_wrap_aw only for function words. We calculate the number of function words required to reach the required TTR value. As per the results obtained, the mltd_ma_wrap_aw in the case of function words is also positively correlated with readability values (correlation coefficient=0.63). This again proves the relationship between lexical diversity (even in the case of function words) and readability values.

5.2 Experiment 2: mediation analysis of NLP-based parameters

As per the above discussion, it is clear that lexical diversity affects the readability of the underlying text the most. Thus, we can alter the lexical diversity to achieve the desired readability level of the underlying text. However, it should be noted that there are a number of crowdsourced parameters in Wikipedia articles that are responsible for the quality of the text and, thus, the readability of the text. We, therefore, propose a methodology that takes into account both crowdsourced parameters and lexical diversity and their effect on readability.

There have been many instances of crowdsourcing where a group of individuals outperform a single individual. One of the most famous examples is the DARPA balloon challenge, where various people across the US were able to identify the locations of balloons in different parts of the country. According to ‘Wisdom of Crowds’ [67], it is the diversity among the group of individuals that makes them more intelligent than a single individual. Similarly, in crowdsourced platforms like Wikipedia, diversity among article editors leads to good-quality Wikipedia articles. Since quality is directly related to the readability of the articles, diversity does influence the readability of Wikipedia articles. As per the past studies, the diversity in Wikipedia articles can be measured using a number of parameters, which are described below:

  • As described by Kittur and Kraut [68], implicit coordination among editors helps in the introduction of information diversity in the underlying article. The implicit coordination is quantified using the Gini coefficient of contribution done by the editors. A very similar metric called Local Workload Diversity was used by Robert et al. in one of the past studies.

  • Another study focused on the quality of contribution made by the editors rather than a number of edits [69]. The experience of the editors is taken into account while judging the quality of edits made by the editor. The experience is, in turn, measured using the variation in edits made by the editor across all Wikipedia articles.

  • Ren et al. coined tenure disparity as one of the measures of diversity in Wikipedia [50]. Tenure disparity quantifies the amount of time spent by the editor in editing Wikipedia articles.

To summarize, the aforementioned parameters, such as tenure disparity, implicit coordination, and experience diversity, play an important role in the quality of the articles. It should be noted that all of the above parameters are, in a way, related to the number and quality of edits. Hence, based on the previous studies, we majorly take into account two crowdsourced parameters that can capture the diversity in Wikipedia articles, i.e., the number of edits and the quality of edits. The number of edits is captured by the standard deviation of edits done by all editors. The standard deviation helps to measure the variation in edits made by the editors across all the Wikipedia articles. For an experienced editor, the standard deviation of edits must be low. The quality of edits is captured by the ratio of experienced editors editing the particular Wikipedia article. A higher number of experienced editors indicates quality edits being made to the underlying articles. Therefore, we introduce two crowdsourced parameters, one concerning the number of edits and the other one concerning the quality of edits contributed by the editors to a Wikipedia article. The details of the crowdsourced parameters are as follows:

  • Standard Deviation of edits contributed by the editors of a Wikipedia article: The contribution of different editors is not uniform in a Wikipedia article. It is established by the previous literature that there are a handful of editors contributing most of the content to the article [70]. A majority of the editors are responsible for minor changes, and a small set of editors is responsible for the majority of the changes in the underlying article. However, the number of editors contributing the majority of the content varies from article to article.

  • Ratio of experienced editors contributing to a Wikipedia article: The second crowdsourced parameter takes into account the experience of editors contributing to a Wikipedia article. The editors from the list of top 10,000 editors ordered by the number of edits are considered to be experienced [71]. The ratio of experienced editors is calculated as the ratio of the number of experienced editors to the total number of editors present in the Wikipedia article.

Based on the above discussion, we propose the following four hypotheses discussing the relationship between the above-crowdsourced parameters and the readability of Wikipedia articles.

H1: The standard deviation of edits contributed by the editors is positively correlated with the readability of Wikipedia articles In [72], a classification model is trained to determine the effect of writing style on the quality of the article. The results showed that the writing style of an article is an effective predictor of its quality, and as quality is directly correlated with readability, it also affects readability. [4] states that different editors have different writing styles, and therefore, a change in the writing style of an article is observed as it is crowdsourced. The changes in writing style lead to changes in the readability levels of the article.

If we test hypothesis H1, we observe a positive correlation between the standard deviation of edits contributed by the editors and the readability of articles (Correlation coefficient= 0.45). The reason is frequent change in writing style observed for articles with low standard deviation of edits which leads to low readability. If the standard deviation of the edits contributed by the editors is less, then it means all the editors are contributing uniformly to the article. Since all the editors have uniform contributions, there are frequent changes in writing style, which affects the readability levels. On the contrary, if the standard deviation of edits done by editors is high, then it means editors are contributing to the articles in a non-uniform fashion. Some editors are contributing more than the other editors. In such cases, some of the past studies claim that most of the content is added by a handful of editors, and the rest of the editors are involved in rectifying minor errors in the text [73]. The reader experiences fewer changes in the writing style, which makes the comprehension of the underlying text easier. A sudden change in the writing style while reading the underlying article hinders the comprehension of the text and affects its readability. Our claim is also substantiated by past studies which state writing style affects readability and Wikipedia articles experience changes in the writing style [73]. Hence, a high standard deviation in edits done on the article by various editors leads to high readability in Wikipedia articles.

H2: The ratio of experienced editors is negatively correlated with the readability of Wikipedia articles

It is clear that experienced editors contribute significantly to the articles. If the ratio of experienced editors is higher in an article, then more editors are contributing heavily to the articles. Due to a higher number of editors contributing heavily to the articles, there are more frequent changes in the writing style of the article. As discussed above, sudden changes in the writing style affect the readability of the Wikipedia article. This hypothesis is also supported by the negative correlation observed between the above two parameters with a correlation coefficient of around -0.56.

After defining the nature of relationships between the two crowdsourced parameters and the readability of the text present in Wikipedia articles, we propose the following hypotheses defining the relationship between lexical diversity and the crowdsourced parameters.

H3: The standard deviation of edits contributed by the editors is negatively correlated with the Lexical Diversity of Wikipedia articles

A low standard deviation value means that all editors contribute evenly to the articles. In simple words, such an article comprises text written by various editors where the contribution made by all the editors is nearly the same. As each one of these editors uses a different vocabulary, the underlying article exhibits varied vocabulary that leads to an increase in the lexical diversity of Wikipedia articles. The above claim is also supported by the Wikipedia articles with a correlation coefficient of around -0.67.

H4: The ratio of experienced editors is positively correlated with the lexical diversity of the Wikipedia articles

A higher ratio of experienced editors in Wikipedia articles suggests that more editors contributed heavily to the underlying articles. Since all of the editors possess different vocabulary, the underlying text in the concerned Wikipedia article experiences higher lexical diversity. Also, if we calculate the correlation between the ratio of experienced editors and the lexical diversity of the article, it is around 0.48.

Fig. 5
figure 5

Bootstrap Mediation Analysis with the readability as the dependent variable, lexical diversity as the mediator variable, and standard deviation of edits and the ratio of experienced editors as the independent variables

5.2.1 Details of Mediation Analysis

According to the neural network proposed in the last section, lexical diversity is a deciding parameter for the readability level of the text. However, it is also important to note that it is the editors who are responsible for introducing lexical diversity in the text present in Wikipedia articles. Thus, we propose a theoretical model where the various crowdsourced parameters act as independent variables, lexical diversity acts as a mediator, and the readability of the text acts as a dependent variable. The crowdsourced parameters considered are the standard deviation of edits contributed by the editors and the ratio of experienced editors editing a Wikipedia article. After testing the correlation between independent variables and the mediator variable, we perform bootstrap mediation analysis to study the mediating effects of lexical diversity on the readability of Wikipedia articles [74]. Along with the above-mentioned variables, we also introduce control variables, such as the age and size of the article, to the bootstrap mediation model. The age of the article is calculated as the time difference (in days) between the first edit and the last edit of the article. Article size is calculated as the number of words added to the article (log base-10 transformed). We use the Pingouin library in Python to perform the bootstrap mediation analysis [75]. The library uses the method proposed by Baron and Kenny for the mediation analysis. In addition to this, it should also be noted that the VIF values obtained for the crowdsourced parameters are below 10, suggesting no issues of multicollinearity in the proposed model.

In addition to the above model, we regress the two crowdsourced parameters to predict the readability values without considering the mediating effects of lexical diversity. The regression model also proves the above-stated hypotheses (H1, H2). We also regress the two crowdsourced parameters to predict the lexical diversity of the underlying text present in Wikipedia articles. This regression model also proves the above-stated hypotheses (H3, H4). The results obtained for both the models are discussed in the next section.

5.2.2 Results of experiment 2

To test hypotheses H1 and H2, a simple regression analysis is performed with the crowdsourced parameters as the independent variables and the readability of the text as the dependent variable. We also add control variables to the model, i.e., Age and Size of the article. The \(R^2\) value of the regression model is observed to be 0.65. The results resonated with the proposed hypotheses. The model predicted a positive relationship between the standard deviation of edits contributed by the editors and the readability values of the text present in Wikipedia articles. The regression results are mentioned in the equation 1 below. Readability \(=i_1+0.92 *\) Standard deviation of edits \(-1.71 *\) Ratio of experienced editors \(-0.41 *\) Size of the article \(+1.1 *\) Age of the article \(+e_1\)

To test H3 and H4, we perform regression analysis with the standard deviation of edits contributed by the editors and ratio of experienced editors as the independent variables and lexical diversity as the dependent variable. The results show a negative relationship between lexical diversity and standard deviation of edits contributed by the editors, as hypothesized in H3. In addition, the results prove a positive relationship between lexical diversity and the ratio of experienced editors, similar to hypothesis H4. The results are shown in Equation 2 below.

Lexical Diversity \(=i_2-0.6 *\) Standard deviation of edits \(+0.51 *\) Ratio of experienced editors \(-0.30 *\) Size of the article \(+0.04 *\) Age of the article \(+e_2\)

In addition to the above models, we also perform bootstrap mediation analysis with all the variables as depicted in Fig. 5. The bootstrap mediation analysis takes into account the mediating effects of lexical diversity on the readability of the text in Wikipedia articles. As the effects of the standard deviation of edits and the ratio of experienced editors are still significant after adding the mediator lexical diversity, we conclude that lexical diversity has a partial mediation effect on the readability of the text present in Wikipedia articles. The results for this model are shown in the equation 3 below.

Readability\(=i_3-0.73 *\) Standard deviation of edits\(-1.25 *\) Ratio of experienced editors \(+0.62 *\) Lexical Diversity \(-0.35 *\) Size of the article \(+0.92 *\) Age of the article \(+e_3\)

To summarise, the above results conclude the effect of crowdsourced parameters on the readability scores. As per equation 1, a higher value of the ratio of experienced editors leads to low readability scores, and a higher value of standard deviation of edits leads to high readability scores. As per equation 2, a higher value of standard deviation of edits leads to low lexical diversity, and a higher ratio of experienced editors leads to high lexical diversity. Now, as per equation 3, we still observe a significant effect of crowdsourced parameters on readability even after taking lexical diversity into account. We also observe a moderate effect of lexical diversity on readability scores. Hence, we conclude that the effect of crowdsourced parameters on readability scores is partially mediated by lexical diversity. Since there exists a partial mediation effect of lexical diversity, we can say that the effect of crowdsourced parameters on readability scores is not only because of changes observed in the writing style of the article. There are other factors which are responsible for the effect of crowdsourced parameters on readability scores. One of the probable reasons behind the effect of crowdsourced parameters on readability scores can be conflicts among the editors. In the case of a low standard deviation of edits and a high ratio of experienced editors, we observe an equal number of edits by all the editors. In such a scenario where editors have an equal number of edits, there is collective ownership of articles, which often leads to conflicts among editors. With a high number of conflicts among editors, much of the time/attention is wasted on conflict resolution, and article quality suffers. Hence, we experience low readability in such articles. It should be noted that the above explanation does not consider the effect of writing style or lexical diversity. It takes into account the effect of crowd dynamics on the readability scores. Hence, apart from crowdsourced parameters affecting the lexical diversity and lexical diversity affecting readability scores, the crowdsourced parameters also directly influence the readability scores.

6 Conclusion

The regression models built with NLP parameters as the input features and trustworthiness, neutrality, readability, and completeness scores as the target variables show that the NLP-based features do affect the readability score of a crowdsourced document. The NLP-based features are calculated using a wide range of tools, such as TAALED, TAASSC, TAALES, SEANCE, and TAACO. Further, the above exercise establishes lexical diversity as one of the deciding factors behind a good readability score. The past studies also show that crowdsourced parameters influence the readability scores of Wikipedia articles. Hence, it is clear that both NLP-based and crowdsourced parameters affect the readability scores of the articles. However, it should be noted that it is the crowd only that is responsible for the generation of the entire article in Wikipedia, and the effect of crowdsourced parameters on NLP-based parameters can not be ignored either. We conduct bootstrap mediation analysis to judge the effect of crowdsourced parameters on NLP-based parameters, which in turn influence the readability scores. The results show that the NLP-based features partially mediate the effect of crowdsourced parameters on readability scores. Therefore, it is important to consider the effect of both crowdsourced and NLP-based parameters while calculating the readability scores of Wikipedia articles.

In the future, we plan to study the changes in lexical diversity with every change introduced in the crowdsourced article and their effect on readability scores. Additionally, different crowdsourced parameters, such as conflicts observed between editors and interaction among the editors, can also be considered to quantify their effect on readability scores. It will help us better understand the crowd dynamics that affect the NLP parameters, leading to a change in the readability scores.