Towards robust tags for scientific publications from natural language processing tools and Wikipedia
In this work, two simple methods of tagging scientific publications with labels reflecting their content are presented and compared. As a first source of labels, Wikipedia is employed. A second label set is constructed from the noun phrases occurring in the analyzed corpus. The corpus itself consists of abstracts from 0.7 million scientific documents deposited in the ArXiv preprint collection. We present a comparison of both approaches, which shows that discussed methods are to a large extent complementary. Moreover, the results give interesting insights into the completeness of Wikipedia knowledge in various scientific domains. As a next step, we examine the statistical properties of the obtained tags. It turns out that both methods show qualitatively similar rank–frequency dependence, which is best approximated by the stretched exponential curve. The distribution of the number of distinct tags per document follows also the same distribution for both methods and is well described by the negative binomial distribution. The developed tags are meant for use as features in various text mining tasks. Therefore, as a final step we show the preliminary results on their application to topic modeling.
KeywordsTagging document collections Natural language processing Wikipedia ArXiv preprint collection
Text mining methods and techniques are increasingly important in the design and deployment of digital library systems. They automatically generate additional value from the stored information, which improves the way the content may be searched, presented and consumed by the end user . In this work, we present a study of two methods for enriching scientific publications with compact and uniform set of tags reflecting their content. The first method is based on Wikipedia and the second approach relies on the noun phrases detected by the natural language processing (NLP) tools. Both methods are applied to the document collection consisting of abstracts from the ArXiv preprint server . The motivation behind this study is threefold.
First, we would like to generate compact and meaningful features for document content representation in text mining tasks, which will go beyond the basic bag of words approach. The developed tags can serve as such features, which later on can be employed for various applications, e.g., determining document similarity, clustering and topic modeling. After the appropriate filtering and ranking, the obtained tags can also be used as keyphrases, summarizing the document. In this work, we briefly demonstrate the potential of obtained tags by using them, instead of bag of words representation, in latent Dirichlet allocation topic modeling method.
Our second goal is the comparison of the two approaches to tagging publications with labels reflecting their content. We employed two methods, abbreviated hereafter NP and WIKI. The NP approach relies on the tags’ dictionary generated from noun phrases detected in the analyzed corpus using NLP tools. The WIKI method relies on the filtered set of Wikipedia multi-word entries. The tags generated by both of the methods are to a large extent independent. Therefore, a comparison of both obtained results reveals strengths and shortcomings of the NP and WIKI tags. As a side effect, on the basis of such comparison one may draw conclusions about the completeness of Wikipedia knowledge in examined scientific domains. Even though Wikipedia is often used in analysis of scientific texts, its completeness with respect to domain-specific vocabulary involved could be questioned. Many topics may be out of the scope of interest of the average Internet user, i.e., Wikipedia reader and author. When considering texts from particular field of science, the relative efficiency of the WIKI method to complementary NP approach may be used as figure of merit, crudely reflecting the knowledge coverage in Wikipedia with respect to the examined discipline.
The third goal of this work is the analysis of the statistical properties for the obtained tags. We look at the distributions of the number of different tags per document. We also examine if the Zipf’s law is valid for the rank–frequency curves of labels detected by both methods. It is also interesting to check, if the aforementioned statistical properties are qualitatively similar for the NP and WIKI tags.
The paper is organized as follows. Related work is discussed in Sect. 2. In Sect. 3, the employed datasets are described. Afterward, in Sect. 4, we provide the details of both tagging procedures—the one based on Wikipedia (WIKI) and the complementary approach based on the noun phrases (NP). Comparison of both methods is the subject of Sect. 5. Statistical properties of the obtained tags are investigated in Sect. 6. The sample application of both NP and WIKI tags to the topic modeling problem is presented in Sect. 7. The results are summarized in Sect. 8.
This paper contains an extended version of the material presented on the Linking and Contextualizing Publications and Datasets Workshop, during the conference Theory and Practice of Digital Libraries 2013 .
2 Related work
In the first of our tagging methods, we used noun phrases as source of tags reflecting the content of a given text. This was inspired by methods employing NLP, in particular noun phrases detection, to keyword extraction problem. Barker and Cornacchia  filtered noun phrases according to the frequency of a head noun to obtain the best keyphrases summarizing a document. Hulth  employed noun phrases and part of speech patterns in algorithms for supervised keyword extraction. Chuang, Manning and Heer  conducted recently a large-scale research on the properties of human-assigned keyphrases. They provided solid data, confirming the intuition about the importance of noun phrases. In their study, almost 65 % of the manually assigned keyphrases were either a noun phrase or were contained in it. They also showed that a vast majority of human-assigned keyphrases consisted of multiple words (75 %, for the experiment when human experts are presented one document at a time). Therefore, in our study we also focus on multi-word tags.
Our second approach to tagging is based on Wikipedia. Wikipedia is currently very often used in studies on conceptualizing and contextualizing document collections. There is no doubt that it constitutes a very useful source of semi-structured knowledge. To name just a few recent examples of research, applications of Wikipedia knowledge in text mining tasks include: extracting keywords , clustering [28, 29], assigning readable labels to the obtained document clusters [23, 24] and facilitating classification . When it comes to tagging, a method similar to ours, although much more advanced, was used by Mendes and coworkers . They created DBpedia Spotlight—a system which uses DBpedia URIs to tag documents. Moreover, it allows to configure the annotations to the user needs through the DBpedia Ontology  and dedicated quality measures.
3 Employed datasets
The ArXiv categories and their abbreviations
Category full name
Condensed Matter Physics
Physics—General Relativity and
High Energy Physics—Experiment
High Energy Physics—Lattice
High Energy Physics—Phenomenology
High Energy Physics—Theory
4 Processing methods
Generating the preliminary dictionary During this stage we generated the preliminary version of the dictionary, used later on as a lexicon for tagging. For the WIKI case, simply all multi-word titles of articles form Wikipedia dump were extracted. Full texts of Wikipedia articles were not used. For the NP method all the abstract from ArXiv corpus were analyzed using general purpose natural language processing library OpenNLP , detecting all the noun phrases containing two or more words. Noun phrases occurring in fewer than four documents were excluded from the dictionary.
Cleaning the dictionary Clearly, on this level both dictionaries contain a lot of non-informative entries. Therefore, we apply a cleaning procedure to both preliminary tag sets. For each tag we remove the initial and final words, if they belong to the set of stopwords. The labels which contain only one word after such filtering are removed. Then we use a simple heuristic observation that good label candidates usually do not contain stopword in the middle; see the study  for more details. One notable exception here is the word of. We drop all the entries according to this heuristic rule. Naturally, many far more sophisticated algorithms can be employed here. One example might be matching a grammatical pattern devised to select true keywords, which could be employed, when the knowledge about the part-of-speech classification is available [3, 14]. However, the simple stopword method worked well enough for us, especially because we mostly aimed at labels for further applications in machine learning and hence we could afford having a certain fraction of “bogus labels”. The generated dictionaries after the cleaning procedure contained around 5 million entries for the WIKI method and 0.3 million for the NP case.
Tagging Finally, we tag the analyzed corpus of the ArXiv abstracts with the filtered dictionaries, obtained previously. In the process of tagging, we make use of the Porter stemming , to alleviate the problem of different grammatical forms. For the WIKI case, all abstracts that contain a sequence of words that stems to the same roots as the label contained in the WIKI dictionary are tagged with this label. Similarly in the NP case, however, the dictionary generated from noun phrases is used instead of the lexicon created from titles of Wikipedia articles.
5 Comparison of the WIKI and NP tags across domains
Comparison of the top ten most frequent tags in four categories
State of art
Degrees of freedom
Point of view
Quality of service
Order of magnitude
Multi agent system
Point of view
Degrees of freedom
Equation of state
Center of mass
Heavy ion collision
Order of magnitude
Degrees of freedom
Au Au collision
Au Au collision
Heavy ion collision
Time of flight
Relativistic heavy ions
Clearly, not all of the above tags are perfect. It can be observed that noun phrases detector sometimes yields the fragments of actual noun phrase, e.g., hoc network is a fragment of correct phrase ad hoc network and time algorithm comes from complexity statements, such as polynomial time algorithm. There are also a few tags which do not yield any information, e.g., et al, point of view, give rise and initial data. If there is a need, their impact can be reduced by improving the filtering procedure described in Sect. 4.
6 Statistical properties of the WIKI and NP tags
7 Application of the NP and WIKI tags to topic modeling
7.1 Description of the employed methods
The tags developed and examined in the previous sections can be employed as features in various sorts of text mining algorithms. In this part, we examine their applicability to topic modeling.
Detecting topics in large collections of documents is of very high interest when building digital libraries or other document repositories. In particular, this technique can significantly improve search facilities or content discovery options . The key method in the field is latent Dirichlet allocation (LDA) . This is a very useful model as it does not require any a priori knowledge about the domain of a given document collection or manually labeled training set (unsupervised method). However, topics generated by this model are still often imperfect and not always meaningful for humans . Therefore, to improve the situation, we examine running an LDA analysis not on words, but on tags, which already form comprehensible short phrases. To put it another way, we investigate the use of “set of tags” instead of “bag of words” document representation.
For the experiments in this part, the subset of 1,052 documents was selected. This set consisted of ArXiv abstracts with domain cond-mat, published in March 2012. The domain of condensed matter physics was selected, since it is a very strongly represented area in a given dataset (second most abundant domain, see Fig. 1). Therefore, the content is likely to be a reasonable sample of the whole body of ongoing research. Moreover, the domain expert in the field was available for this study.
To carry out the analysis, we employed the R statistical environment. It was supplemented with specialized packages for text analysis—TM  and topicmodels . The latter contains procedures to carry out LDA analysis on a given corpus of documents.
7.2 Qualitative analysis of the obtained topics
The objective automatic analysis of the obtained topics is a difficult problem. Typically, they are ranked on the basis of probability of the held-out set of documents calculated from previously trained LDA model [6, 30]. However, also other more sophisticated methods of ranking topic models were introduced [4, 21]. Nevertheless, automatic methods not always correlate with human judgment [7, 8]. In the work , the authors summarized the features describing a well-extracted topic as sensible, meaningful, interpretable and coherent. These traits are very difficult for automatic quantification. Therefore, the only currently available method of reliable topics ranking is the manual judgement. Indeed, recent works dealing with this issue [7, 8, 22] employed a large number of human evaluators, gathered, e.g., using crowdsourcing services such as Amazon Mechanical Turk. Human judgement is particularly important when further applications in digital library systems are planned. Obviously, digital libraries are designed for people, and their user experience, not a mathematical criterion, has to be maximized.
Therefore, our results of the LDA analysis were inspected by an expert in the field of condensed matter. Most apparent issue, noticeable for human expert in the presented word clouds, is that the tag-based methods yield narrower and more easily interpretable topics. For example, both tag-based methods (NP and WIKI) recognized topics related to Bose–Einstein condensation—see NP1 and WIKI7 in Figs. 9 and 10, respectively. This is indeed a heavily investigated field, both experimentally and theoretically (Nobel Prize in Physics in 2001 was awarded for the work in this field). Similarly, both NP and WIKI methods recognized the Quantum Hall effect and related research (again, heavily investigated phenomenon—Nobel Prizes in Physics in 1998 and 1985 were awarded for the work related to this field); see NP7 and WIKI3. Furthermore, the NP and WIKI methods recognized the density functional theory as a separate topic (NP3, WIKI6), which is indeed the case, as it is currently the most powerful theoretical approach to predict properties of solids from first principles of quantum mechanics. The remaining topics easily identifiable for the human expert are spin–orbit effects, especially in low-dimensional structures (NP2, WIKI5) and many body simulation methods (WIKI1). Out of these topics, only one can be (with slight difficulty) identified by the expert in the abstract-based results. This is the ABS6 in Fig. 8, which is related to density functional theory, except that the ABS method yields only one more sharp topic that could be clearly resolved by human expert. It is the ABS3, which can be interpreted as graphene-related research. Admittedly, it is a hot topic in recent years—Nobel Prize 2010 in Physics was awarded for the pioneering work in this field. Other than that, the ABS topics described rather general and broad themes. The fact that so few of the results from ABS and NP/WIKI methods coincide seems to be related to the deficiencies of the LDA method. As it was pointed out, e.g., in , the distribution of words in documents can be much sharper than distribution in fairly general topics. This effect is captured by models, where active features get multiplied and renormalized, e.g., [16, 27]. It is possible that such methods would locate sharp correlations between words density, functional, theory or spin, orbit, coupling.
The next important conclusion from the presented results is that the efficiency of topic detection in tag-based methods is heavily affected by the quality of the input tags. Both tag-based methods (NP and WIKI) missed the very important graphene topic ABS3. This was because of the tags’ construction. They were by design restricted to capture multi-word features. Even though the detected tag set contained terms such as bilayer graphene or graphene nanoribbon they were too specific and rare to generate separate topic. Recent studies  show that human-generated keywords contain mostly multiple words; however, it turns out that important unrecoverable information may be contained in single word tags. Therefore, including unigram tags in the presented methods (without cluttering the tag dictionary with all nouns or all single word entries from Wikipedia) is a very interesting field of further research.
As far as the number of tags per document is concerned, it does not seem to influence the topics quality very strongly. The average number of WIKI tags in the field of physics-cond-mat was slightly lower than 40 % of the number of NP tags (compare Sect. 5, in particular Fig. 4). However, the quality of topics obtained from both NP and WIKI methods is comparable. Moreover, in many cases the correspondence between the topics produced by both approaches can be established, e.g., NP1–WIKI7, NP2–WIKI5, NP3–WIKI6 and NP7–WIKI3.
Another interesting aspect of the presented comparison is related to the topic repetition effect. This is one of the flaws encountered in the LDA results, when one theme resolved by human appears as multiple LDA topics . Our experiments indicate that both tag-based approaches and plain abstract method seem equally prone to this difficulty. For example, ABS1 and ABS2 or NP4 and NP5 are related to very similar areas.
7.3 Analysis of computational cost
Execution time for the examined methods
Overall, we find that the examined NP and WIKI tags are useful features for applications in LDA. They generate interpretable, narrow topics and reduce computing resources needed to obtain results. Their applications to other text mining problems, such as evaluating document similarity or clustering, seems promising.
8 Summary and conclusions
In this paper, we have compared two methods of tagging scientific publications. First, abbreviated WIKI was based on the multi-word entries from Wikipedia. Second, referenced as NP, relied on the noun phrases detected by the NLP tools. We have focused on the effectiveness of each method across domains and on the statistical properties of the obtained labels. Since the tags are meant for applications as features in text mining tasks, we have shown a sample application to topic modeling using the LDA approach.
When it comes to the effectiveness of the above tagging methods, it turned out that the NP approach yields higher average number of tags per document. The difference is by a factor between two and three with respect to the WIKI case. This strongly depends on the domain. The WIKI tags coverage is better in the areas more relevant to the Internet community, such as computer science or quantitative finance than in more exotic domains such as nuclear experimental physics. The ratio could be interpreted as a crude measure of Wikipedia knowledge completeness across domains. None of the methods is clearly superior than the other. Both have their strengths and weaknesses. The NP method is more prone to inaccuracies of underlying NLP tools, sometimes detecting incomplete phrases. It also produces more uninformative tags. The WIKI method yields generally fewer tags, but is much more effective in detecting tags including surnames, which are often important in names of theorems, equations or effects in science.
As far as the statistical properties are concerned, it turned out that both the WIKI and NP methods exhibit qualitatively very similar behavior. Observed dependencies of the tag frequency on the tag rank deviate from the Zipf’s law. However, it can be well approximated with the so-called stretched exponential model. The investigation of the distribution of the number of distinct tags per document revealed that in both the WIKI and NP cases it follows quite closely the negative binomial model.
Sample application of the prepared tags to the topic modeling using the LDA method shows promising results. The representation of the document as a “set of tags” instead of “bag of words” yielded topics that were definite and easily interpretable by humans. It also reduced the computational time required for the analysis. Both NP and WIKI tags method performed here comparably well. Overall, in our opinion, the presented tags seem useful complement to the “bag of words” representation. We plan further refinement of both methods (in particular, extending them with unigrams) and further experiments with their applications in text mining tasks.
This research was carried out with the support of the “HPC Infrastructure for Grand Challenges of Science and Engineering (POWIEW)” Project, co-financed by the European Regional Development Fund under the Innovative Economy Operational Programme.
- 1.Apache OpenNLP, http://opennlp.apache.org
- 2.arXiv preprint server, http://arxiv.org
- 4.AlSumait, L., Barbar, D., Gentle, J., Domeniconi, C.: Topic significance ranking of lda generative models. In: Buntine, W., Grobelnik, M., Mladeni, D., Shawe-Taylor, J. (eds.) Machine learning and knowledge discovery in databases, lecture notes in computer science, p. 67. Springer, Berlin (2009). doi:10.1007/978-3-642-04180-8_22 CrossRefGoogle Scholar
- 5.Barker, K., Cornacchia, N.: Using noun phrase heads to extract document keyphrases. In: H. Hamilton (ed.) Advances in artificial intelligence, lecture notes in computer science, vol. 1822, p. 40. Springer, Berlin (2000). doi:10.1007/3-540-45486-1_4
- 7.Chang, J., Boyd-Graber, J.L., Gerrish, S., Wang, C., Blei, D.M.: Reading tea leaves: How humans interpret topic models. In: Neural Information Processing Systems, vol. 22, p. 288 (2009)Google Scholar
- 8.Chuang, J., Gupta, S., Manning, C., Heer, J.: Topic model diagnostics: Assessing domain relevance via topical alignment. In: S. Dasgupta, D. Mcallester (eds.) Proceedings of the 30th International Conference on Machine Learning (ICML-13), vol. 28, p. 612. JMLR Workshop and Conference Proceedings (2013)Google Scholar
- 10.Feinerer, I., Hornik, K., Meyer, D.: Text mining infrastructure in R. J. Stat. Softw. 25(5), 1 (2008)Google Scholar
- 11.Grün, B., Hornik, K.: Topicmodels: an R package for fitting topic models. J. Stat. Softw. 40(13), 1 (2011)Google Scholar
- 12.Hulth, A.: Improved automatic keyword extraction given more linguistic knowledge. In: Proceedings of the 2003 conference on Empirical Methods in Natural Language Processing, EMNLP ’03, p. 216. Association for Computational Linguistics, Stroudsburg, PA, USA (2003). doi:10.3115/1119355.1119383
- 16.Larochelle, H., Lauly, S.: A neural autoregressive topic model. In: NIPS, p. 2717 (2012)Google Scholar
- 17.Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S., Morsey, M., van Kleef, P., Auer, S., Bizer, C.: DBpedia—a large-scale, multilingual knowledge base extracted from wikipedia. Semant. Web J. (2014, to appear)Google Scholar
- 18.Łopuszyński, M., Bolikowski, Ł.: Tagging scientific publications using wikipedia and natural language processing tools. In: Ł. Bolikowski, V. Casarosa, P. Goodale, N. Houssos, P. Manghi, J. Schirrwagen (eds.) Theory and Practice of Digital Libraries—TPDL 2013 Selected Workshops, Communications in Computer and Information Science, vol. 416, p. 16. Springer International Publishing (2014). doi:10.1007/978-3-319-08425-1_3.
- 19.Mendes, P.N., Jakob, M., García-Silva, A., Bizer, C.: DBpedia spotlight: shedding light on the web of documents. In: Proceedings of the 7th International Conference on Semantic Systems, I-Semantics ’11, p. 1. ACM, New York, NY, USA (2011). doi:10.1145/2063518.2063519
- 21.Newman, D., Lau, J.H., Grieser, K., Baldwin, T.: Automatic evaluation of topic coherence. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT ’10, p. 100. Association for Computational Linguistics, Stroudsburg, PA, USA (2010)Google Scholar
- 22.Newman, D., Noh, Y., Talley, E., Karimi, S., Baldwin, T.: Evaluating topic models for digital libraries. In: Proceedings of the 10th Annual Joint Conference on Digital Libraries, JCDL ’10, p. 215. ACM, New York, NY, USA (2010). doi:10.1145/1816123.1816156
- 23.Nomoto, T.: WikiLabel: an encyclopedic approach to labeling documents en masse. In: Proceedings of the 20th ACM international conference on Information and knowledge management, CIKM ’11, p. 2341. ACM, New York, NY, USA (2011). doi:10.1145/2063576.2063961
- 24.Nomoto, T., Kando, N.: Conceptualizing documents with Wikipedia. In: Proceedings of the fifth workshop on Exploiting semantic annotations in information retrieval, ESAIR ’12, p. 11. ACM, New York, NY, USA (2012). doi:10.1145/2390148.2390155
- 25.Porter, M.: An algorithm for suffix stripping. Progr. Electron. Libr. Inf. Syst. 14(3), 130 (1980)Google Scholar
- 26.Rose, S., Engel, D., Cramer, N., Cowley, W.: Automatic keyword extraction from individual documents, p. 1. John Wiley and Sons, Ltd (2010). doi:10.1002/9780470689646.ch1
- 27.Salakhutdinov, R., Hinton, G.E.: Replicated softmax: an undirected topic model. In: NIPS, vol. 22, p. 1607 (2009)Google Scholar
- 30.Wallach, H.M., Murray, I., Salakhutdinov, R., Mimno, D.: Evaluation methods for topic models. In: Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, p. 1105. ACM, New York, NY, USA (2009). doi:10.1145/1553374.1553515
Open AccessThis article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.