In KDAP, we have implemented commonly used traditional algorithms as well as new methods for analyzing and understanding the dynamics of collaborative knowledge building. All these methods accept Knol-ML files as data input and return the analysis results. The main disadvantage with collaborative online portals is that the data dump is in raw format and requires multiple steps of cleaning for analysis purposes. With the design of Knol-ML and KDAP, we have created a system that reduces the time and effort required to retrieve and analyze knowledge data. We provide many atomic level methods which can be used as building blocks for analysis such as language statistics as a measure of quality of contribution [44–47], global inequality as a measure of participation [48, 49], editing patterns to understand collaboration [50–52], data extraction for various machine learning and NLP algorithms [11, 53, 54].
To evaluate our tool, we describe two evaluation methodologies with the following goals in mind:
-
To show that our toolkit performs better than the present analysis toolkits based on the parameters such as execution time, memory consumed, the complexity of the code, and the lines of code.
-
to show that it is possible to perform complex analyses using our toolkit without significant effort.
The first evaluation methodology includes six common mining tasks, whereas the second methodology includes large-scale analysis tasks. The tasks listed in both the methodology were performed twice, including and excluding KDAP. The authors in [55] have used a similar approach to evaluate the PyDriller tool with other existing tools. The tasks in the first evaluation methodology were designed to measure our tool based on the comparision with present tools, whereas the tasks of the second methodology evaluate the usefulness of our tool for large-scale analyses. We describe and compare the analysis of both the evaluation methodologies. There is no such library for the large-scale analysis of collaborative knowledge-building portals to the best of our knowledge. Hence, we compare the performance of KDAP with existing libraries and APIs commonly used for extracting and parsing the dataset of these portals. All the analyses were performed on a computer with a 3.10GHz Intel Xeon E5-1607 processor and 16 GB of RAM. The analyses were performed five times, and the average execution time and memory consumption are shown.
Evaluation based on comparision with present tools
We first compare KDAP against existing libraries and APIs for online collaborative portals. We select six commonly known knowledge building tasks that we encountered in our experience as researchers in the knowledge building field. We divided the tasks into three groups, as shown in Table 2. The reason behind this segregation is to evaluate our tool based on a variety of tasks commonly performed by the knowledge building researchers. We compare the analyses using different metrics: lines of code (LOC), complexity (McCabe complexity [56]), memory consumption, and execution time of both implementations. Table 3 shows the results. We do not count the number of lines for the code which do not contribute to the core functionality (like constructor). Instead, we use a fixed lines of code value of 3 for all such codes.
Table 2 Tasks assigned to the first group Table 3 Comparison between KDAP and various libraries Regarding execution time, KDAP is 63.32 and 8.33 times faster than the respective tool for tasks 1 and 2, respectively. This speed is achieved because KDAP maintains a database of Wikipedia articles name and corresponding categories (please see Appendix for more details). For other tasks, the performance of KDAP is similar to that of other tools. In terms of memory consumption, the tools behave similarly. In most of the cases, memory consumption was less than 20MB. In the most memory consuming task (task 4), 86MB of memory was used. Given the massive size of the dataset (e.g. the size of United States articles is close to 6 GB) and limited main memory size, we had to performed iterative parsing (for tasks 3, 4, 5, and 6) of the files while using Wikipedia API, cElementTree and mwparserfromhell. More precisely, we iterate over each block of a file and process it, keeping the overall memory consumption constant. This iterative parsing adds to the complexity of the code. KDAP by default provides method to iteratively parse the files, where a user has the freedom to either load a whole file in the memory or process it chunkwise.
The more significant difference is in terms of the complexity of the implementation and LOC. For the former, we observe that using KDAP results (on average) in writing 61% less complex code as compared to respective libraries. This is specially the case in tasks that have to deal with mining Wikipedia and Stack Exchange (Task1 and 2); indeed, obtaining this information in KDAP is just a one line code, while Wikipedia API and Stack Exchange API require many lines of code and exceptions handling.
We also observe that the number of lines of code written using KDAP is significantly lower than for the respective library. Table 3 shows that, on an average, 84% fewer lines of code are required using KDAP. The biggest difference is in task 3, where the tool had to calculate the change in words, sentences and Wikilinks for each revision of an article. This problem was solved in five LOC using KDAP, while 120 LOC with cElementTree (95% difference). The details of the experiment are provided in the supplimentry material [57] (Appendix Getting started with KDAP), and codes for all the implementations are available in the experiment section of the GitHub repository.
Evaluation based on usefulness
To further analyze our tool, we choose four peer-reviewed articles in the CSCW domain to be analyzed using KDAP. We took the help of four undergraduate students working in the knowledge building domain to re-perform the analysis mentioned in these articles. Each participant was assigned one paper, as shown in Table 4. They were asked to perform the analyses twice (including and excluding KDAP) and note the time they took to solve the problems, as well as their personal opinions on all the tools. All the students had an experience in developing with Python and on performing knowledge building studies, but they had never used KDAP before. The setting of the experiment is the following:
-
Each participant is assigned one paper, which he/she has to implement first with KDAP, then with any other library of their choice. Since understanding how to solve the tasks requires some additional time, we asked the participants to start with KDAP. This choice clearly penalizes our tool, as participants will have a better intuition about the tasks during the first round of implementation. However, we believe that KDAP is simpler to use and that the difference between the two implementations will still be significant.
Table 4 Papers Assigned to the Participants
-
For the sake of simplicity, participants should only implement the core analysis methods. Methods like machine learning model training and graph plotting were excluded.
-
Participants note the time taken to implement the tasks. They are also asked to include the time spent in reading the documentation of the tools, since understanding how to use the tool is part of the experiment.
-
After having implemented all the tasks, we ask the participants to elaborate on the advantages and disadvantages of the tools.
The result of the experiment is shown in Table 5. All the participants took less time to solve the problems (26% less in the worst case, 71% less in the best case). Regarding the LOC, three out of four participants wrote significantly less LOC. P4, instead solved both problems using a similar amount of time and LOC: the participant first solved the problem using KDAP and applied the same logic to solve the problem using cElementTree.
Table 5 Time and Loc Comparison for Each Participant All the participants agreed that KDAP was more comfortable to use than other analysis tools (P1 to P4). For example, P1 affirmed that using KDAP, he was able to achieve the same result with more straightforward and shorter code, and that he will continue to use KDAP in his subsequent knowledge building studies. P2 added that Wikipedia and Stack Exchange APIs are useful when one has to perform limited extraction tasks, but it can be overcomplicated when the goal is to perform broad-scale analysis on these portals, for which KDAP is more appropriate because it hides this complexity from the users.
Evaluation based on scalability and generalizability: a case study of JOCWiki
Although we have established our toolkit’s usefulness, it is essential to show if our toolkit is scalable enough to analyze a new portal’s dataset. Moreover, it is interesting to evaluate the generalizability of the analysis methods using cross-portal analysis.
To understand the extent of scalability and generalizability of our toolkit, we describe the analysis of JOCWiki with KDAP as a case study. JOCWiki is an online crowdsourced portal that follows a unique integration of a Wiki-like portal and a discussion-styled forum [62]. JOCWiki was deployed as a part of a MOOC named “The Joy of Computing” to create a rich knowledge repository using the course students as the contributors. The authors showed that this integration of Wiki and QnA forum (also known as QWiki) generates more knowledge as opposed to a single disintegrated platform. The integration of Wiki and discussion-forum like features in a single platform allows us to evaluate our toolkit based on scalability and generalizability. We show that our toolkit can handle the dataset of JOCWiki, and the fundamental analysis can be performed in lesser lines of code. We also show the generalizability of our toolkit by performing same tasks on Wiki and QnA forum of JOCWiki.
We extracted the full dataset of JOCWiki, publically available at GitHub [63]. The dataset contained twelve weeks of Wiki articles (one for each course module) and their respective discussion forum in a raw XML and text format, respectively. We first converted the dataset of Wiki articles and their respective discussion forum into Knol-ML format. Although the conversion step is an overhead in terms of time and computation, it is essential since it enables a user to execute all the KDAP library methods on the converted Knol-ML dataset. We extended the KDAP library by adding the JOCWiki conversion methods to it. We perform a set of fundamental analysis on the JOCWiki Knol-ML dataset using KDAP and without KDAPFootnote 4. We divide the tasks into two categories. The first category of tasks requires similar analysis on both the portals (Wiki articles and their respective QnA forums), and hence we perform them on both the portals (Wikis and QnAs) separately. The second category contains a set of tasks that require cross-portal analysis between the Wiki articles and the QnA forums. We aim to determine if we can perform the mentioned analysis with less complex codes while using KDAP. We compare the results based on LOC (Lines of Code) and Complexity (McCabe complexity [56]). Table 6 represents the result of the experiment.
Table 6 Evaluation of KDAP methods based on scalibility and generalizability. Intra-portal analysis tasks were performed on both Wiki and QnA forum of the JOCWiki, whereas inter-portals analysis tasks represent the cross portal analysis between Wiki and QnA forum Results on scalibility
We observed that using KDAP, we could perform the analysis in lesser lines of code compared to using the conventional libraries. More precisely, we could perform all the mentioned tasks using average 95% lesser lines of code with KDAP compared to using conventional libraries. Moreover, with our toolkit, we could write less complex codes. We observed that the fundamental methods of the KDAP library are useful in performing cross-portal analysis using less complex and fewer lines of code. For instance, extracting topics from Wiki articles that triggered discussion in QnA forums and vice versa is comparatively more straightforward with KDAP. This quantification of triggers was a significant contribution by Simran et al. [62] where they showed that the QWiki setup generates more knowledge units than conventional collaborative knowledge building systems. The reason behind writing more complex codes while using the conventional libraries was the unstrctured formatting of QnA forum dataset. Retireving information from an unstructured dataset is generally difficult and time consuming. Given the standardized structure of the Knol-ML format, such information retrieval and analysis tasks are easy to perform with KDAP.
Results on generalizability
In terms of generalizability, we observed that most of the KDAP methods were generalizable over both the portals while performing intra-portal analysis tasks. More specifically, for most of the tasks, common KDAP methods were used to analyze both Wiki articles and QnA forums. We had to use a different set of methods for some tasks as no common methods were applicable (red colored rows in Table 6 represent those tasks). For example, to extract the author’s information (such as age, country, and bio) from both Wiki articles and QnA forums, we had to use the external file of author’s dataset. One solution for this problem is to include all author’s detailed information in the Knol-ML dataset. However, including such information will increase the size of a single Knol-ML KnowledgeData (refer to Fig. 2 for details) by many folds. We discuss this tradeoff in detail in the Limitations and future work section.
Comparison of KDAP with other tools
There are various tools like WikiBrain, DBpedia, and Wikipedia API to analyze the knowledge data. Although these tools provide analysis and retrieval methods, knowledge building analysis methods (like edit statistics, inequality measure, and controversy detection) are limited in number. Also, these tools are limited to specific portals. KDAP provides exclusive methods for analyzing the dynamics of collaborative knowledge building. We define a set of tasks (defined in Table 7) based on which we compare our tool with other analysis tools. Table 8 shows a comparison of methods implemented in KDAP with the other analysis tools.
Table 7 Tasks defined to compare KDAP with other analysis tools Table 8 Comparison of KDAP methods with other analysis tools