As the analysis involved a very large volume of text, several methods were employed to be able to summarize the text and highlight patterns of potential interest. In analyzing the literature, we aimed to gain a high level overview of macro-trends, while also being able to pinpoint micro-trends that may not be widespread yet, but illustrate interesting directions with regard to how organic waste is processed. In general, we are interested in identifying the general topics being discussed and then performing a more fine-grained analysis about the entities involved. In particular, we were looking for interesting combinations of entities (e.g. “oyster shells” as a catalyst for “biodiesel” production) which may indicate new pathways being explored in the bio-economy.
Literature Collection
The literature was collected via search terms that aimed to locate articles describing processing of organic waste from the years 1995 to 2014. The queries, listed in Appendix A.1, were run on both Scopus and Web of Science (WOS), resulting in 53,292 distinct matching articles. While it would be ideal to analyze the full text of these articles, all of our analysis is restricted to article titles, keywords and abstracts, due to limitations on automated full text downloads for Scopus and Web of Science. As documented more in the “Results”, there is overlap in coverage between Scopus and Web of Science, we had to remove duplicate articles. We performed this step using OpenRefineFootnote 2 and the “cluster and edit” feature which uses several string distance metrics to group together text strings that have similar forms.
Topic Modelling
This study employed topic modelling [12] as both a way to refine the literature search terms as well as to address the objective of “distinguishing prominent topics” in the collected literature. Topic modelling is a means of identifying the dominant themes that occur across a collection of documents. The topics are not predetermined by a user, but rather are based on statistical patterns of words that commonly co-occur. The advantage of this is that it allows for grouping the abstracts together based on their contents instead of the biases of those examining the text. For example, if one were to ask a group of researchers to read the same set of literature and describe what they thought were the dominant topics, their answers would likely differ based on their own interests and background knowledge.
A key value of topic modelling is that it aids in providing a map of what literature has been collected. In early stages of analysis, we noticed a topic involving “organic Rankine cycle” and “waste heat”, which was not relevant for the analysis (as these terms describe a thermodynamic cycle that is suitable for low-grade waste heat). However, we later realized that our search terms at the time were matching these words exactly as we were looking for terms such as “organic” and “waste”. This process of topic modelling the search results allowed us to improve our collection of abstracts by quickly scanning to see if the topics we expected to see were included, and refine our search terms when topics undesirable for further analysis were identified.
The number of topics is specified by the user, and conceptually this is similar to k-means clustering [13], where users specify how many clusters to find, and while a certain cluster does not necessarily have a specific meaning, objects within that cluster will statistically have a degree of similarity. In general, specifying a small number of topics would indicate general clusters of themes, while a large number of topics would likely distinguish between sub-themes such as ethanol from maize and cellulosic ethanol.
The particular algorithm used for this analysis was Latent Dirichlet Allocation (LDA) [12]. We used the software implementation of this algorithm provided by the MAchine Learning for LanguagE Toolkit (MALLET) [14]. There are several outputs of the software which allow us to characterize the literature collected. First, for every document there is a vector with 50 elements, each corresponding to a topic number, and indicating the document’s relevance within each of them. By examining the elements of this vector, one can see the degrees to which a document may actually be discussing multiple topics. Secondly, for every topic, there is a vector of words and their weights within that topic. These word weights are used when characterizing a document’s relevance to a particular topic. For example, a document may contain a single heavily weighted word, or a combination of lower weighted words.
The output of MALLET is a set of matrices which are not easy to interpret on their own. To expose the results in a more user-friendly way, we visualize them with the dfr-browser libraryFootnote 3 for R. The dfr-browser library creates an interactive online visualization that allows one to pivot around different analysis results of the topics, documents and words in a way that helps to show the statistical relations between them. For example, one can see an overview of topics showing their top-weighted words. Per topic, one can find the most relevant documents sorted by score. Per word, one can also see which topics it is prevalent in, which can indicate different contexts for the word. For example, “soil” may occur in topics mentioning actual soil studies, while also appearing in other topics where it is more incidental, such as plant growth.
Co-occurrence Analysis of Terms
To examine the content of the abstracts in more detail, and to address our third objective to “Identify a broad range of processes utilized for secondary waste material valuation in academic literature”, co-occurrence analysis of entities was performed as illustrated in Fig. 2. Put simply, this process analyzed which types of wastes were mentioned in the literature along with certain types of technologies, applications or final products (TAPs). To achieve this, three steps were required: (a) create a robust list of TAPs, (b) create a sufficiently detailed list of wastes, (c) scan academic literature, recording which wastes and TAPs are mentioned in which documents, and (d) analyze and present statistics on the co-occurrence of terms from these lists in the collected literature.
A motivation for this approach is that it is not unusual for researchers to search academic databases for co-occurrences of wastes and TAPs in order to find literature showing how a particular waste can be utilized (e.g. “fish fat” and “biodiesel”). Our goal in co-occurrence analysis is to offer an alternative to performing one search at a time by hand. Instead, using the co-occurrence analysis approach we can scan a large body of literature for a variety of wastes and TAPs, and perform statistical analysis to highlight combinations which could be useful for researchers or practitioners seeking options for processing wastes.
We use two different strategies to scan academic literature for mentions of TAPs and wastes. For these tasks in general, we needed to have a list of terms to match, a way to match variants of those terms (e.g. plural and singular forms, synonyms, differing adjectives, etc.), and finally a way to match term variants into a single “preferred” version useful for later analysis. The literature we collected contains over 16 million words comprising the titles, keywords and abstracts. A system for term location which is as automated as possible was desired, although as described below, this is not completely possible in particular for the waste terms.
We should emphasize that the lists of wastes and TAPs which we used to find terms of interest are not meant to be exhaustive or authoritative, but rather comprehensive and broad enough to demonstrate the value of our approach. The system we have can automatically regenerate the analysis once better expert-curated lists are created.
Filtering for TAPs
Creating a list to capture TAPs of interest was difficult as there is not a standard classificationFootnote 4 that we can use, especially one that can be easily linked to the entities that have been extracted. For this analysis, as shown in Appendix A.3, we compiled a list of 85 Wikipedia categories covering a range of topics in the bio-economy, then created a list of all the article titles within those categories (779 total), and finally filtered this by hand down to a list of 112 article titles which represented an inventory of energy, products, chemicals, processes, and agricultural terms. Furthermore, we amended the list by a incorporating a literature review of chemicals mentioned as being promising to focus on within the bio-economy [15].
There are three reasons for using titles of Wikipedia articles to compile this list. First, it gave us a basis to build a sufficiently comprehensive list to demonstrate our approach. Secondly, Wikipedia article URLs (titles) essentially function as unique identifiers, as every distinct concept should only have one Wikipedia article present for it. To be clear, in our analysis we are not using any of the actual content of Wikipedia articles, but rather simpley the titles themselves. The only content we analyze is sourced from academic literature. While people are naturally skeptical of the quality of Wikipedia, one can argue that it is at least extremely comprehensive, with currently 5.4 million articles as of the time of writing. Finally, this approach allowed us to the DBpedia Spotlight service [16] to automate the process of locating terms of interest found in the abstracts. A more detailed discussion of using DBpedia Spotlight for this purpose is given in Appendix A.4.
Through the DBpedia Spotlight analysis, we can determine for every academic article abstract analyzed, which terms are present which have corresponding Wikipedia titles. Therefore, in an efficient manner we get a very broad scan that allows us to know if an abstract contains terms related to chemicals, organisms, place names, etc. For our analysis we needed to further filter the set of Wikipedia articles found per abstract, and only keep those mentioned in the list of TAPs compiled for the study.
Filtering for Wastes
For locating detailed waste terms in the abstracts, DBpedia Spotlight is not suitable, as Wikipedia does not have article titles at the level of detail that we would like to analyze. Existing waste classifications such as the European List of Waste (LoW)Footnote 5 are not suitable either as they are too aggregated.
To enable scanning for detailed waste terms, we first created two lists: one for “waste sources”, and one for “waste descriptors”. As shown in Appendix A.3, the “waste sources” list contains 189 items or locations from which the wastes are produced (e.g. apple, aquaculture, banana, etc.). The “waste descriptors” list contains 156 various forms of waste (e.g. ash, pomace, peel, etc.). These two lists were initially populated using the biomass waste terms mentioned in the ECN Phyllis database [17] and were expanded to include synonyms and other wastes or sources of interest. In scanning for waste terms, we looked for series of terms that contained matches from both lists (e.g. apple + peel, core, pomace, etc.). After locating the terms in the abstracts, we also had to do extensive de-duplication in order to standardize the waste terms as much as possible for later analysis. For example, through this strategy, we were able to successfully locate the terms cow manure, cow excreta, cattle dung, bovine manure and dairy cow feces in the abstracts, but we needed to take an extra step to group these variants into a single “preferred” term.
It is possible that the final list of wastes documented in Appendix A.3.3, may be one of the most extensive lists of organic waste compiled. To compare, the European Waste Catalog (EWC) lists a total of 90 wastes within its 02, 03, and 04 categoriesFootnote 6, which cover agriculture, forestry, food processing, wood processing, and the textile industries among many others. Data analysis utilizing such lists can be problematic as the waste entities in the EWC and similar lists are often very vague such as:
-
02 06 01—Materials unsuitable for consumption or processing,
-
02 01 03—Plant-tissue waste,
-
02 01 01—Sludges from washing and cleaning,
-
03 03 01—Waste bark and wood,
-
04 02 10—Organic matter from natural products (for example grease, wax),
-
04 02 99—Wastes not otherwise specified.
Co-occurrence Analysis and Visualization
One strategy for analyzing co-occurring terms would be to count how often two terms appear together in all documents. A drawback of this approach is that it would likely only highlight common knowledge or combinations that are already well known and appear in a large number of documents. To get around this issue, we used a different strategy, which was to calculate the Normalized Pointwise Mutual Information (NPMI) of co-occurring terms [18]. The Pointwise Mutual Information is a statistical measure that evaluates whether two terms co-occur more often than would be expected by chance, which thus indicates that there may be some special relation between them. As seen in Eq. 1, it is the ratio of the probability that two terms occur together in a document, divided by the probability that they occur in a document (independent of whether the other is present). The probabilities are based on actual observations. In other words, a value of p(x, y) = 0.1 means that the terms x and y are observed to co-occur together in 10% of all the documents collected. The NPMI (Eq. 2) was used to then normalize the PMI scores to between −1 and 1.
$$\begin{aligned} pmi (x;y)=\log \frac{p(x,y)}{p(x)p(y)} \end{aligned}$$
(1)
Pointwise Mutual Information (PMI),
$$\begin{aligned} npmi (x;y)=\frac{ pmi (x;y)}{-\log p(x,y)} \end{aligned}$$
(2)
Normalized Pointwise Mutual Information (NPMI) as derived from PMI.
After scanning for terms in all the abstracts, observations on the occurrence of TAPs and wastes per abstract were then used to compute NPMI scores for each co-occurrence of a waste and a TAP. A NPMI value of 1 means that the terms always co-occur, 0 means that they are independent of each other, and –1 means that terms never co-occur.
As with the topic modelling, this step generated a large amount of data which in its raw form was not easily interpretable. To help with this, we created an interactive visualization, which can be viewed at https://is.gd/wastecircle, that employed the following strategies. First, wastes and TAPs were arranged in a circle with connecting lines showing co-occurrences that have a NPMI value above the threshold. As detailed in the results, in the literature we were able to locate over 473 specific wastes. To prevent the visualization from becoming too cluttered, we grouped the wastes by their waste sources (e.g. “apple pomace” and “apple peel” show up as “apple”). When someone places their mouse over a term like “apple”, a secondary visualization is shown to give a more detailed view of the specific waste terms and their linkages with TAPs. Then the user can highlight a combination of a specific waste term and a TAP to see links to the actual literature sources mentioning that combination. This helps users to verify for themselves the exact nature of this co-occurrences and if it is interesting for their purposes.
To further unclutter the circle visualization, we grouped together similar wastes and TAPs along the circumference of the circle. given the categories shown in Appendices A.3.3 and A.2.2. For the TAPs, the categories largely correspond to those found in a literature review by [15], with categories for terms not in that review filled in by hand.
The final circle visualization employed hierarchical edge bundling [19] using an implementation created with the d3.js JavaScript libraryFootnote 7, which we extended to include the more detailed network visualization and to also show the links to literature mentioning the co-occurrences.
Since the co-occurrence analysis results are extensive and difficult to visualize within a single image, we have also created a series of matrices, introduced in "Co-occurrence Analysis Results", showing the NPMI values between TAPs (on rows) and wastes (on columns). As an additional step, we have also applied hierarchical clustering to the rows and columns in these matrices. This clustering serves to re-order the matrix so that rows and columns with similar values are located near each other. In practice, this gives an indication of similarity of wastes and TAPs. For wastes, this clustering may group together materials with similar properties that could serve as substitutes in a particular value pathway. For TAPs, this clustering can indicate chemicals which are related to processes, or relations between products and processes (such as syngas and gasification).
Iterative Support Between Topic Modelling and Co-occurrence Analysis
These approaches are not meant to operate in isolation. As illustrated in Fig. 3, we used the topic modelling results as a means of quickly evaluating if we had overtly irrelevant literature results, which meant that we had to adjust our Web of Science and Scopus search terms and re-download the results. The results of the topic modelling also gave insight into the types of wastes and TAPs that we could expect to see in the co-occurrence analysis. If we failed to see these wastes or TAPs, then this was an indication that the lists we used to scan the literature for terms was not complete enough and that they should then be complemented.