Keywords

1 Introduction

The creation of texts has accelerated in the last few decades. The number of patents, websites on the internet, and the amount of data in general has increased exponentially. Searching for the right piece of information is a ubiquitous problem (referred here as general-purpose information searching). Although scientific writing is particularly affected by that, the problem of information searching (especially when writing a literature review) is that many researchers do not know how the search engines work. While journals and renowned conferences help sort articles in a research field and identify the state of the art, individual researchers often struggle to get a comprehensive overview of all the relevant studies. Not only has the speed of the procedures of writing and publishing studies been accelerating, but also the pressure to publish or to perish has been quantified into numbers and scores, such as h-index,Footnote 1 finally increasing the amount of data to be searched. The idea that nonetheless digitalization and search engines can simply lead to substantial time gains when surveying a subject for a certain scientific field is appealing, but it actually often entails the problem of finding appropriate studies while being confronted with a too large list of potentially relevant matches.

In this situation, academics are similarly confronted with problems that arose with large data and the internet, especially overflow of information. Information retrieval focuses on developing algorithms for searching for a piece of information in a large corpus or in general in large corpora. This problem appeared in the late 1960s and 1970s with the creation of databases, but more specifically, with the storage of large parts of texts such as in libraries and large institutions. Databases use an index to access data quickly; unfortunately, creating an index over texts is not that easy. For instance, sometimes a part of a word is interesting (when looking for graduate, the word undergraduate is relevant), so using a simple alphabetic index will not cover basic use cases. Better methods needed to be developed, turning databases into search engines. Nevertheless, textual data is unstructured data, which cannot be processed to extract knowledge by computers easily. Knowledge extraction refers to the field which studies approaches targeting the challenge of extracting structured information from a textual form. Since the beginning of electronic computers, there has been a large amount of data embedded into textual data; thus, manually extracting structured information from it is an arduous task. In particular, when performing knowledge extraction, information retrieval might be a first task to execute, so information retrieval and knowledge extraction are closely related.

In the last two decades, the issue of information retrieval has become omnipresent, for example with the dispute between search engines such as Altavista, Yahoo, Microsoft Search (Bing), and Google, who ended up with the lion’s share. Even today, there are attempts to break Google’s monopoly with new search engines such as ecosia and duckduckgo. However, Google’s algorithm in its core (we will cover it later), is the most popular nowadays.

When writing scientific articles, thanks to the rapid digitalization of academic publishing and the rise of search engines, we now have access to so much more data and information than before that we are now often confronted with the challenge of finding a needle in the haystack. This is where online tools can help, especially those providing access to scientific publications. Hence, academic social network platforms, search engines and bibliographic databases such as Google Scholar (Halevi et al., 2017), Scopus, Microsoft Academic, ResearchGate or Academia.edu have become very popular over the last decade (Ortega, 2014; van Noorden 2014). These specialized search engines are needed and make a great gain in contrast to conventional search engines, since the procedure for academic writing is very different from general purpose information searching (Raamkumar et al., 2017). Most of these online platforms offer more or less detailed search interfaces to retrieve relevant scientific output. Moreover, they provide us with some indicators allowing us to assess the relevance of the search results: the number of citations, specific keywords, reference to further relevant studies through automatic linking from the citation list, articles suggested on the basis of previous searches or according to preferences set in one’s profile, amongst others. However, many challenges still remain, such as the ontological challenge of finding the right search terms (many terms being ambiguously coined), including all possible designations for a given topic, as well as assessing the quality of the articles presented in the results list.

On top of that, with the rise of academic social-networking activities, the number of potentially interesting and quickly accessible publications surpasses our human capacities. As a result, we depend more and more on algorithms to perform a first selection and extract relevant information which we can turn into knowledge for our scientific writing purpose. In that sense, algorithms provide us with two important services: on one side, information retrieval, which is becoming each day more sophisticated, and on the other side, knowledge extraction, i.e. the access to structured dataFootnote 2 allowing us to process the information automatically, e.g. for statistics or surveys. This chapter will present and discuss the methods used to solve these tasks.

2 Information Retrieval

When we use an academic search engine or database to obtain an overview of the relevant articles on a given topic, we come up with a moderate number of words that, to our opinion, sum up the topic, and enter them in the search field. By launching the search, we give over to the machine and the actual information retrieval process. The main purpose of information retrieval is to find relevant texts in a large collection given the handful of words provided by a human user, and, more specifically, to rank these documents on their relevance to the query words. The resulting list of matches is thus created according to various criteria usually not known by the users. Yet, gaining insights into the information retrieval process might help understand and assess the relevance of the displayed search results. Especially, what is on top of the ranked list and what might get suppressed or be ranked down. As we will see later, depending on the search engine a search term needs to be written exactly or the search engine can provide us with helpful synonyms, or links to interesting papers.

The first approach for information retrieval is to break down the query words and analyze the corpora individually, looking for the appearances of each of the terms in the texts. The occurrence of a term in a document might increase the relevance of the document to the query, especially if there are many occurrences of the term within the same document. However, if a term is equally frequent in the language in general compared to its frequency in the corpus, it might be of no help. A metric aiming to engage this issue is term-frequency inverse-document-frequency (TF-IDF) (Manning & Schütze, 1999), which is often used for an array of natural language problems, and, amongst others, in automatic text classification (e.g., spam recognition, document classification [Benites, 2017]), dialect classification (Benites et al., 2018), but also for research-paper recommender systems (Beel, 2015). This method can find words that are important/specific for a certain document within a collection of documents, by giving more weight if a word frequently occurs in the document and less weight if it frequently occurs in the collection. Further, other considerations might help sort the results. If we want to find something about “scientific text writing” in a scientific database of articles on the internet, there will probably be just too many hits. Adding more words will reduce the list of results (since they are aggregated by an AND operation, seldom by an OR), but this implies choosing an adequate term that gives the query more purpose and specificity. For example, adding the word “generation” will break down the result set, but it could be equally helpful to discard some less important query terms, i.e. “text.” Moreover, very large documents might contain all the query words, which would lead to considering them a good match. However, if the terms are scattered throughout different parts of the document and have no vicinity or direct relation with each other, chances are that there are different disjoint subjects that do not automatically reunite towards the subject of interest. This is why some methods also foresee the penalization of lengthy documents as well as the prioritization of documents showing indicators of centrality, such as the number of citations, to obtain a more relevant set of results. And more importantly, these criteria have a direct impact on the ranking order of the results.

However, all those aspects do not consider the semantic context of the word. A “bank” can be a piece of furniture, a financial institution, or the land alongside a river. This is why more and more search engines use so-called contextual language models (such as transformers): artificial neural networks (machine learning approaches) trained to predict missing words in sentences from a collection of billions of texts (Devlin et al., 2018). This training procedure is called a self-supervisedFootnote 3 task but is also known as pre-training. This approach helps the model memorize which words are used in the vicinity of certain words. After the pre-training phase, these models can be fine-tuned to down-stream tasks, such as document classification, similarity ranking of documents, sentiment analysis (e.g., is a tweet negative or positive), named-entity recognition (e.g., classification of words: Mr. President Obama, Senator Obama, Barrack Obama, all refer to one entity), language generation (for chatbots, for rephrasing tools), and the list goes on and on. Their range is so broad because they can create document representationsFootnote 4 that take the context into account, and they can determine if two documents are relevant to each other, even though they might only be connected by synonyms, i.e., they do not use the same exact vocabulary but have a similar meaning. This allows a search which is much more semantically guided and less orthographic (the exact spelling of a word).

After breaking the text into single words and examining them, the next step in providing better ranking is to not only look for a single word, but to analyze word combinations and see if they constitute a term/construction in the corpus. The TF-IDF approach would only search for n-grams (a contiguous sequence of words) of the terms, and to that purpose, it would need to build an index with all the possible word combinations (usually n-grams of 3–7 words). This index can quickly become oversized with the explosion of combinations (multiple hundreds of gigabyte, depending on the corpus size and diversity of the vocabulary). Newer language models, such as transformers, take a different approach. They dissect the words in subwords and then try to grasp the combination from the whole sentence or paragraph (usually 512 subwords which can be up to 200–300 words). They use a mechanism called self-attention, which weights a word from different perspectives (one for being a query of other words, one for being a key for other words, and lastly one for being the value searched by the query and key), using a positional encoding for each word. The intuition is that it can then check correlations between the words, as it takes the whole sentence as input. Plus, neural networks consider all possible combinations at the same time. This creates a computational problem, which is dealt with by a myriad of heuristics and a massive amount of computational power. Consequently, this produces powerful language models able to grasp context even over long distances in the sentences, enabling, for instance, context-aware coreference resolution (the cat ate the mouse, it was hungry, “it” is referring to which animal?). This can be used for search engines when analyzing search words: are the queried words found in the documents and if so, are they used as central words in the right context?

While search terms play a major role in the information retrieval process, most academic search engines also still heavily rely on citations, using them to create graphs. Such graphs can use the PageRank (Page et al., 1999) algorithmFootnote 5 to prioritize works that are highly cited. CiteSeer used a different approach and implemented a “Common Citation Inverse Document Frequency” (Giles et al., 1998). It is also possible to create networks based on the search terms and count only citations that are relevant for the search. The use of citations for Google Scholar was also examined in Beel and Gipp (2009). The paradigm of the PageRank algorithm can be observed in a citation networkFootnote 6 by ranking more important seminal papers. As Raamkumar et al. (2017) point out, seminality is critical for a scientific reading list, along with sub-topic relevance, diversity, and recency. These criteria can also be applied for a literature survey and for ranking scientific publications for the use case of scientific writing.

In sum, automatic information retrieval is a complex process involving multiple elements such as words, subwords, synonyms, document length, and citations. However, the way these elements are used and combined by the machine to establish a ranked list of matches is generally not displayed along with the results. This is why being aware of such mechanisms can help take a constructive critical stance towards the identified literature.

3 Knowledge Extraction

As the amount of scientific literature grows significantly, the need for systematic literature reviews in specific research fields is also increasing. Human-centered approaches have been developed and established as standards, e.g., the Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA) method (Page et al., 2021). Nevertheless, the overwhelming amount of available literature in some fields calls for automated solutions. Unlike information retrieval, knowledge extraction directly taps into a publication’s content to extract and categorize data.

The construction of structured data that can be saved into a schematized database and processed automatically from unstructured data (e.g., a simple text document) is a vast research field. The ultimate goal of processing unstructured data, especially documents or articles, is of great importance for algorithms. For example, in medical research, contraindications of a substance or illnesses associated with a certain drug could be easily found automatically in the literature, therefore guiding the search process and speeding up research even more. Unfortunately, it is not so easy to identify the substances, or which relationship connects them. In the field of natural language processing (NLP), we speak of named entity recognition (substances) and relation extraction (how do the substances relate to each other). Although finding relevant entities seems easy enough, there are many cases where it is quite difficult. For example, the 44th President of the United States of America can be referred to by his name Barack Hussein Obama II, Mr. President (even though he is not active in this position anymore), candidate for President, Senator Obama, President Obama, Peace Nobel Prize laureate, and so on. Usually, authors will use multiple denominations of the same entities to avoid repetitions, rendering the finding and tracking of named entities very difficult for an automatic algorithm. Although, in the last years, many improvements were made to grasp the semantic context of a word, the understanding and world modelling (real world) of the NLP algorithms is extremely limited. The algorithms can extract many relations from texts, but a chain of consequences is difficult to form. They are merely advanced pattern matching procedures: given a word, they find which words are related to that; however, they are not yet capable of extrapolation or abstract association (i.e., connecting the associations to rules or rule chains). Nonetheless, the results can be quite impressive in some specific tasks, such as coreference resolution of entities, which has some very accurate approaches (Dobrovolskii, 2021), yet not perfect nor near human performance. Although the current generation is learning to master relatively simple tasks for the next generation of algorithms, a paradigm change is yet to be developed.

Being able to search for entities and for relations between entities can be helpful in many fields, such as chemistry or drug-development (contraindications). When performing a literature review, it is equally important to know what the key papers are, what methods were used, how the data were collected, etc. Automatic knowledge extraction could also be used for creating surveys on a new task or a specific method. Although creating a database of entities and their different relations is not new, and even constitutes a paradigm in the field of database (graph database), it remains very complicated, especially when it comes to resolving conflicts, ambiguities, and evolving relations. On the other hand, if a document contains a graph, a text can be created automatically (see Benites, Benites, & Anson, “Automated Text Generation and Summarization for Academic Writing”).

Still, some information, like cited authors or how certain research objects are dealt with, can be extracted automatically, and this method can be applied to hundreds of papers, which makes the writing of research synthesis papers much easier. We can cluster and find similarities and differences much faster. Extracting entities from unstructured data such as texts is usually performed with neural networks that are trained on news articles. Until recently, this meant that the language model of these algorithms was confined to the so-called “news article” genre. Transformers (Vaswani et al., 2017), especially BERT (Devlin et al., 2018), changed that since they are trained on a very large corpus using multiple genres, from Wikipedia articles, to books, to news articles, and to scientific articles, but in an unsupervised mannerFootnote 7 allowing the language model to learn different facets of the language. After the first training phase, the transformer is fine-tuned in a supervised manner to a specific task (e.g., entity recognition in books, where much less data is needed to achieve satisfying results). In that sense, the first step constitutes a pre-training, allowing the actual training to be performed with low amounts of specific data and without substantial computational effort.

This method, however, is still pattern matching, although in a much broader context. As a result, certain manipulations and associative relations are not accounted for (such a triangle inequality), showing the limitations of these large language models. Some newer approaches try to tackle the problem of semantic relation and logical implications, but there are many problems to be solved before they can be used; for instance, some language models in summarization can count from 2–5 but jump to 7 skipping 6 (e.g., number of ships in an article, Zhang et al., 2020). Other approaches use a graph over the documents to infer relations and central entities from the documents, but this is not very reliable, as pointed out earlier.

Thus, although knowledge extraction is a very promising avenue in light of the exploding amount of scientific data being released every day, there is still work to be done before this can be considered a reliable, fully automated solution. At the moment, there is no clear path how to inject the information of knowledge extraction into large text generation models (see Benites, Benites, & Anson, “Automated Text Generation and Summarization for Academic Writing”), which could make many mistakes (false facts) avoidable in many cases. The combination of knowledge graphs and language models is a possibility since the extracted knowledge can be embedded into a graph where reasoning can be performed. This would allow to check the content of a sentence while writing against the facts stored in the knowledge graph, and thus contributing to speeding up writing, making better citations, etc.

Knowing the entities and relations could also help information retrieval systems since the connection between synonyms becomes clearer, and reasoning over the search query could also be performed. This could helps researchers find what they are looking for faster and even help gather data for statistics. For example, in a Google Scholar search the number of hits is shown, but it would be good to know if they all handle the same use case or a method across disciplines, what the time span is, and whether the subjects are about the same or different topics. Also, a survey of papers could show how many papers use a certain dataset, employ a certain methodology, or refer positively or negatively to a specific term.

3.1 Functional Specifications

Search engines allow us to do a literature review or survey much faster and more precise than 20–30 years ago. More importantly, they allow us to also scavenge social media, a facete that is becoming more important for science. Which papers are discussed in the community and why, are there some critical issues that cannot be easily inferred from the paper?

However, finding an interesting paper (because it uses similar methodology) without knowing the specific words it uses still remains a challenging task. Using knowledge graphs of a certain field, allows to find these scattered pieces and put together a more precise and concise image of the state of the art. Although generating such graphs is also not trivial, it could be much easier to perform with automated procedures. Maintaining a certain degree of scepticism towards the results may nonetheless be a good precaution.

3.2 Main Products

Both information retrieval and knowledge extraction belong to the technologies used by scholarly search engines—and hence used by a wide majority of researchers, scientific writers and students, even when they are not aware of them. This is why a succinct overview of current academic search engines can help establish their relevance for academic writing.

CiteSeer (Giles et al., 1998) was an early internet index for research (beginning of 2000s), especially for computer science. It already offered some knowledge extraction in the form of simple parsing such as extraction of headers with title and author, abstract, introduction, citations, citation context and full text. It also had a citation merging function, recognizing when the same article was being cited with a different citation format. For information retrieval, CiteSeer used a combination of TF-IDF, string matchingFootnote 8 and citation network.

The most part of popular databases and search engines for scientific article search do not disclose their relevance ranking algorithm. For Google Scholar, we do not know much about the algorithm behind Google Scholar’s search engine, only that it uses the citation count in its ranking (Beel & Gipp, 2009). Researchgate and Academia.edu are new social networks for the scientific community, both offering to upload and share scholarly publications. This also enables a search engine capability, and a recommendation system for papers to read. Springer’s SpringerLink is an online service that covers reputable conferences and journals. IEEE Xplore, ACM Digital Library, and Mendeley/Scopus are similar to SpringerLink for the respective publishers IEEE, ACM and Elsevier.

Martín-Martín et al. (2021) published a comparison of the various popular search engines for academic papers and documents. The study examined the index of most used search engines such as Google Scholar and Elsevier’s Scopus. The authors compared the coverage of these databases of 3 million citations from 2006. Further, in the discussion, the authors argue that the algorithms for ranking are non-transparent and might change the rankings over time. This last issue will hinder reproducible results, but as the popularity of the papers change over time, it might also be difficult to argue against it. The authors point out that while Google Scholar and Microsoft Academic has a broad coverage, there are more sophisticated search engines as Scopus and Web of Science (WoS), but they cover mostly articles behind pay-walls. Further comparisons between Google Scholar, Microsoft Academic, WoS, and Scopus can be found in Rovira et al. (2019), and between Google Scholar and Researchgate in Thelwall and Kousha (2017). The most relevant finding for academic writing is that Google Scholar attributes great importance to citation, and Researchgate seems to tap the same data pool as Google Scholar.

3.3 Research on Information Retrieval and Knowledge Extraction

Much research is being conducted in information retrieval and knowledge extraction, especially in the light of the recent developments in NLP and big data. The new, better-learning language models and broader representation of documents through contrastive learningFootnote 9 will heavily influence the next generation of search engines. One focus of the research is the field of author academic paper recommender system and academic search engine optimization (Rovira et al., 2019), which will become more and more important, especially given the growing awareness of these search engines among academic writers and the distribution of scholarship to wider audiences. As previously mentioned, the amount of research to be reviewed before writing will increase, and methods for automatization of selection will prevail over manual evaluation of certain sources.Footnote 10 For writers, this would optimize the writing process since the search engine would display the work more prominently.

Other rapidly-developing technologies might heavily influence the way how we perform searches in the near future. Automatic summarization is getting better and better, leading the way to automatically summarizing a collection of results provided by a search engine and even grouping the documents by topics. This can help easily create a literature overview and even give an overview over the state of the art, shortening by a large margin the work performed by researchers when writing articles. The most relevant paper for a search can also be highlighted as well as papers that may contradict its findings.

A further advance is the automatic question answering, where an algorithm tries to find an answer to a question within a given text. Hereafter, the search question answering system can further refine the list by recommending keywords or by filtering irrelevant articles from the result list, even by posing questions to the user, helping the user find relevant terms and aspects of the document collection resulting from the search. Lastly, the results can be better visualized as graphs showing clusters and influential concepts for each cluster, thus grasping the essence of the search results. This can help not only to refine the research question when writing but also to find good articles, insights, and ideas for writing.

3.4 Implications of This Technology for Writing Theory and Practice

The way the results are prioritized makes quite an impact, especially since many researchers will not scroll through the exhaustive number of hits of their query to find appropriate papers. If they do not find relevant matches within the first entries, they will most likely rephrase their query. Papers that are highly cited might be more prominently placed in the list, although they might be only a secondary source (such a case occurred with the field of association rules in data mining where a concept was reintroduced in the 1990s, although it was discovered in the 1960s, and the former became the defacto standard citation). Many concepts are coined almost simultaneously by different authors using different terminologies, and generally only one becomes mainstream, making it difficult to obtain a fair overview of a field with search methods based on TF-IDF and citation count. This might change in the future, as there is progress on structured data and some subfields as mathematics (Thayaparan et al., 2020), but understanding that two concepts are similar/related requires cognition and understanding, something that algorithms still cannot perform over scientific natural language.

Google’s PageRank (and thus citation counts) was built for the internet. If a group of persons finds an internet page interesting, they will link this to this page, and thus make the work of marking interesting sites for the algorithm. However, if something new and relevant but less popular or known emerges, this algorithm might take a while to catch up. Finding early citations is very important to stay current and relevant and have a longer citation span for an article, which impacts the career of a researcher. While it seems that Google Scholar is very good at it (Thelwall & Kousha, 2017), the algorithm still does not know if the results are truly relevant for your research or not. This shows the limits of ranking paradigms based on non-academic internet popularity for scientific research, since novelty and relevance are usually more important factors than popularity. From the academic writing point of view, search engines can only take you so far; a good scholarly network and dissemination of research at different venues can help get to new research findings faster.

4 Tool List

Software

Access

Specificity

Licensing

https://scholar.google.com

Free

Google search engine for scientific publications. It has some features for managing literature, calculates an h-index and has often the respective references linked to an article. Does not belong to any specific publisher

Proprietary

https://citeseerx.ist.psu.edu/

Free

A academic solution (Pennsylvania State University) for archiving scientific articles allows to search

Proprietary

https://ieeexplore.ieee.org/

Paid/free

Search engine and archive of IEEE published papers

Proprietary

https://dl.acm.org/

Paid/free

Search engine and archive of ACM published papers

Proprietary

https://link.springer.com/

Paid/free

Search engine and archive of Springer published papers

Proprietary

https://www.sciencedirect.com/

Paid/free

Search engine and archive of Elsevier published papers

Proprietary