1 Introduction

Scientific research plays an important role in the development of a society. Researchers have always been fascinated by the scientific method and are compelled to ask questions themselves about the phenomena all around. Which led them to perform research and seek answers to their desired questions. Researchers present their research findings in front of a group of like-minded researchers called a scientific community, who also have similar interests. A scientific community can rather be small or large depending on the involvement and interest of the researchers. During this digital era, fast-paced development has resulted in the rapid evolution of the scientific community. With the ever-increasing number of researchers with diverse interests joining large scientific communities, the number of topics is also rapidly increasing within those communities. A scientific community has to promptly adapt to the popular demand and interests of society.

In order to properly adapt to the interests of society, a scientific community needs to have a thorough understanding and awareness of the public interests. For that purpose, an intelligent system is imminent for the analysis of the scientific communities which can efficiently handle and analyze the community data and identify trends that can help make smart decisions for the future of a scientific community. These useful trends are classified into different levels depending on the target and decisions. For example, to decide for the whole community, trends found on a community level will be considered. The analysis of scientific communities also involves identifying small tightly packed sub-groups where all involved researchers only cite one another from the same sub-group and identifying individuals who use self-citations to manipulate their author indexes.

In this paper, we propose a comprehensive system for the analysis of the scientific communities called ACE 2.0 (Academic Community Explorer 2.0) which takes scientific publications as its baseline information source and also collects related information about authors and publications from several external resources for digital profiling of the authors and their publications respectively. ACE 2.0 later performs an analysis of all the collected and extracted data in the context of a scientific community. All individual entities, i.e., authors are identified and separately profiled from each publication. Once the data analysis is completed, all the trends and data are visualized using a visualization engine. We visualize the whole scientific community in the form of a network where each individual is interacting with the other by citing the papers of the others. Our proposed system consists of several modules where each module is responsible for a specific task, i.e., bibliographic reference detection, Keyword detection, topic modeling, etc. For each model, we employed state-of-the-art models for the respective tasks. Additionally, this paper introduces a novel Semantic index that quantifies the contribution and influence of an author in a scientific community. Instead of relying merely on the raw citation count like traditional indexes, the Semantic index considers both qualitative and quantitative aspects of the citations to quantify the influence.

The contributions of this publication are as follows:

  • We present a comprehensive system for the analysis of scientific communities which automatically extracts, consolidates, analyzes data and visualizes the extracted trends using a visualization engine.

  • We present a tool for automatic digital profiling of the authors and publications from a given scientific community.

  • We propose a novel Semantic index that takes both quantitative and qualitative aspects of a citation for determining the influence of a researcher in the scientific community.

The rest of the paper is structured as follows: Section 2 discusses the literature related to several modules of our proposed system. Section 3 discusses the details of the proposed system where each system module is described in detail along with their respective evaluations. It also discusses the feature highlights of our system where features from different levels are elaborated. And lastly, Sect. 4 includes the overall discussion of the proposed system and presents the concluding remarks for this paper.

2 Related works

There are several modules integrated into our proposed pipeline. Each module is responsible for performing a specific task, i.e., Bibliographic Reference Detection, Citation Sentiment Analysis, Keywords Detection, etc. In this section, we will discuss relevant literature for each of the modules of our proposed pipeline.

2.1 Bibliographic reference detection

Reference detection from scientific publications is a popular task in Scientometrics and is mainly an area of interest for library cataloging. Two kinds of solutions approach the task of bibliographic reference detection from different perspectives. The first one is text-based solutions, while the other is layout-based solutions.

Text-based reference detection is the most common solution for reference detection. Several approaches have been proposed that use/consider textual features for identifying references from a given publication. The popularity of text-based approaches is due to the existence of a finite number of referencing styles adopted by the scientific community. The most primitive approaches employed well-defined heuristicsFootnote 1\(^,\)Footnote 2 to identify components of a reference string, i.e., author name, publication title, etc. These components then contribute toward the identification of a reference occurring in a text corpus. On the similar lines, RefParseSautter and Böhm (2012) and BibPro (Chen et al. 2012) proposed using heuristics on the component similarity between reference strings to identify referencing style used in that publication. In addition to standalone tools, librariesFootnote 3 in Perl were presented to parse and extract reference string data from a given text corpus.

PDFSSA4METFootnote 4 suggested converting the PDF of a given publication to intermediate form like XML and then analyzing the structure of the XML for identification of the section containing references. Some approaches like (Ahmed and Afzal 2020) employed a wide range of features like lexical properties, font type, location, neighbors distance, etc. These features are then used to identify and extract reference strings and their metadata.

A novel method of calculating conditional random fields was proposed by Lafferty et al. (2001). Using CRF, sequence data can be systematically labeled by a probabilistic approach. A reference string is identified by identifying the specific parts of the string, such as the authors, the title of the publication, the year, the name of the conference or journal, etc. Labeling such components facilitates the identification of a reference string based on the components that are tagged.

A CRF-based system for extracting and mining bibliographic metadata from references in born-digital PDF scientific articles was proposed by Tkaczyk et al. (2015) as Content ExtRactor and MINEr (CERMINE). There are also different toolsFootnote 5\(^,\)Footnote 6 (Matsuoka et al. 2016) which are based on CRF for identifying and extracting metadata from reference strings. In a work published by Councill et al. (2008), a CRF-based package called ParsCit was presented for its application in relation to the reference metadata tagging problem. It is claimed by Councill et al. (2008) that ParsCit is among the best known and most widely used open-source systems based on heuristics and CRF for reference detection, string parsing, and metadata tagging. As part of the project, Tkaczyk et al. (2018) also proposed a recommendation system based on reference metadata, which integrated 10 of the most popular open-source citation parser tools into one system. A combination of simple heuristic-based and machine learning-based tools was selected as solutions.

To identify references, the literature discussed so far relies exclusively on textual features. The text-based approach overlooks an imperative aspect of layout due to its inability to take advantage of layout features. The potential for detecting bibliographic references using layout information has been explored in very few approaches.

Using layout information,  Bhardwaj et al. (2017) detected references in scanned documents. In order to accomplish this, a Fully Convolutional Neural Network (FCN) was used (Long et al. 2015) to segment the references and then post-processed in order to identify individual references. A layout-based citation detection method was incorporated into Lauscher et al. (2018) to build an open database of citations for libraries for indexing purposes. To detect bibliographic references in scientific publications,  Rizvi et al. (2019) evaluated four state-of-the-art object detection models based on layout information.

2.2 Sentiment classification

Several publications have been published to address the problem of sentiment classification because of its wide range of applications. Using sentiment-specific word embeddings,  Tang et al. (2014) classified tweets based on their sentiment. As a result, highly specialized embeddings can be used to improve sentiment classification performance.  Thongtan and Phienthrakul (2019) employed document embeddings trained with cosine similarity to perform sentiment classification on a movie review dataset. A sentiment classifier based on an ensemble of CNN and LSTM models was proposed by  Cliche (2017) which was fine-tuned on a large database of unlabeled tweets.

In recent years, BERT (Devlin et al. 2018) has gained widespread recognition for its ability to perform a wide variety of Natural Language Processing (NLP) tasks. Several large volumes of unlabeled data were used in training the BERT model. In order to improve the performance of sentiment analysis, recent literature has utilized the BERT model. By combining pre-processing, attention modules, and structural features with a pre-trained BERT model, the authors maximize the performance of the model. In Munikar et al. (2019), Zhou et al. (2016), Xied et al. (2019), the authors adapt a pre-trained BERT model for sentiment classification by combining transfer learning with pre-processing, attention modules, and structural features.

The majority of the literature has been devoted to classifications of sentiments in tweets and movie reviews. The text in scientific publications has a more formal tone in comparison with the text in reviews. Thus, citation sentiment classification differs considerably from review sentiment classification.  Esuli and Sebastiani (2006) have compared sentiment classifications to opinion mining and subjectivity mining. Citations can also be subject to subjectivity due to author preferences and writing styles. This is due to the fact that an author has the option of intentionally making a citation sound positive or negative.  Athar (2011) has carried out experiments to simulate experimental outcomes that involve a wide range of features for sentiment classification in scientific papers, including science lexicon, contextual polarity, dependencies, negation, sentence splitting, and word-level features. As a result of the use of textual features such as n-grams, sentiment lexicons, and structure information,  Xu et al. (2015) conducted a sentiment analysis of citations in clinical trial papers. Considering the lack of datasets suitable for scientific sentiment classification, sentiment classification is imperative in the field of scientific citation analysis. It is the result of the shallow definition of sentiment in this domain that has led to this situation. It is much more challenging to determine a sentiment in an analytical and objective text in contrast to finding a sentiment in high subjective texts, such as Twitter data. Recently, Mercier et al. (2021, 2022) released a clean dataset for citation sentiment analysis and proposed the use of transformers to achieve state-of-the-art performance for citation sentiment classification on different datasets.

2.3 Keywords detection

It is estimated that approximately millions of scientific articles are published in journals around the world every year (Ware and Mabe 2015). A manual search to link large volumes of scholarly publications with appropriate representative keywords would certainly prove to be impractical. Consequently, it is anticipated that a system will be developed that will be capable of automatically analyzing and indexing scientific articles. It is well known that automated keyword detection has been studied extensively; however, most approaches concentrate on social media such as tweets (Beliga 2014; Biswas et al. 2018; Boudin 2018; Carpena et al. 2009; Carretero-Campos et al. 2013; Duari and Bhatnagar 2019; Florescu and Caragea 2017; Mahata et al. 2018; Nikolentzos et al. 2017; Ohsawa et al. 1998; Pay and Lucci 2017; Rabby et al. 2018).

In order to detect keywords in a text, an undirected graph \(G = (N, E)\) is commonly used, in which the nodes represent the individual terms within the text and the edges represent their relationship. In general, the term co-occurrence is the most common type of relationship, which adds an edge to the graph between nodes \(n_1\) and \(n_2\) if both corresponding terms appear within the same sliding window. Depending on the approach selected, the recommended window size is often between 2 and 10 (Litvak et al. 2011; Mihalcea and Tarau 2004; Rousseau and Vazirgiannis 2015). A study by Duari and Bhatnagar (2019) indicates that the size of the window w can have a significant impact on the properties of the resulting graph. When w increases, the density increases as well, resulting in a decrease in the average path length between any two nodes.

It is assumed that the words appearing closer together are related (Rousseau and Vazirgiannis 2015). It is possible to change the sliding window in several ways, for instance, allowing it to slide over individual sentences rather than the entire text, stopping at certain punctuation marks (Litvak et al. 2011). A novel concept referred to as Connectivity Aware Graph (CAG) has been proposed by Duari and Bhatnagar (2019). They use a dynamic window size instead of a fixed window size that spans over two sentences irrespective of their length. In their view, consecutive sentences are considered to be related. They demonstrated that using CAGs instead of graphs constructed using traditional window sizes generally enhances the performance of approaches.

2.4 Authors index

A study by Bollen et al. (2009) discusses science as a gift economy. An author’s value can be defined as the extent to which he contributes to knowledge and the extent to which he influences the ideas of other scientists. As  Hirsch (2005) pointed out, it is needed to have the possibility to quantify this kind of value, among others, for recruitment decisions of universities and the award of grants, especially in a world of limited resources. The increasing costs of research and the shortage of available economic resources lead to a high and increasing interest in scientific author assessment (Costas and Bordons 2008). Additionally, the usefulness of evaluating scientific author impact and author ranking when doing research, in general, should not be underestimated. It offers the possibility for every researcher to easily spot authors heavily contributing to a research field and to discover their publications which might be worth reading when executing research in a specific field. For achieving such an author impact assessment, different indicators are commonly used by many author assessment approaches. On the one hand, there are production indicators which are, for example, the total number of published papers and on the other hand, there are impact indicators which are usually based on the citations received by an author (Alonso et al. 2010). Hirsch consequently states that the large amount of useful information, which is given by the publication record of an individual, can be evaluated with various criteria by several researchers (Hirsch 2005). This leads to the emergence of different author assessment approaches. Each of these approaches can be considered as an attempt to highlight a specific aspect of an author’s publication record that might be of interest when evaluating the author’s importance and contribution to science (Cai et al. 2019). There are huge debates about which of them are the best for assessing the importance and contribution of a scientific author. However, it is widely accepted as a good approach to simply use multiple quantitative measures to support an expert judgment for improving objectivity and fairness in the evaluation process.

3 Proposed system

In this section, we will discuss the details of our proposed system called ACE 2.0. Figure 1 shows the overview of the proposed system. It consists of several modules where each module specializes in a specific task. Each of the modules will be discussed in detail in the following subsections:

3.1 Bibliographic reference extraction

This is the first module of our proposed system and is responsible for the extraction of bibliographic references from given scientific publications. We employed our previously proposed state-of-the-art model called DeepBiRD (Rizvi et al. 2020) for the task of bibliographic reference detection. Details of the selected model are as follows:

3.1.1 Methodology

DeepBiRD (Rizvi et al. 2020) is a layout-based reference detection approach. Typical reference detection approaches use textual features for bibliographic reference detection. However, DeepBiRD makes use of layout features to identify references in a given scanned document image. For this purpose, DeepBiRD compiles a hybrid representation using the input image which is later provided to the network to identify bibliographic references from the given image. Figure 2 shows the overview of the DeepBiRD pipeline.

Fig. 1
figure 1

Overview of ACE 2.0 system

The first step in the DeepBiRD pipeline is to prepare a hybrid representation from the input image. The idea behind compiling a hybrid representation is to make decisive layout features like line and word spacing more prominent. Such layout features play an important role in identifying bibliographic references. Hybrid representation consists of three different components.

The first component of the hybrid representation is the Distance transform in which we consider the distance of each pixel from its nearest input foreground pixel. Using distance transformation of a given image we can introduce additional information highlighting spaces between words, lines, and characters. In our experimental setup, we initially inverted the input image, followed by binarization using OTSU thresholding and inversion of the resultant image. Later, the distance transformation of this image is performed using Euclidean distance as the distance measure with a mask of \(3\times 3\).

The second component of the hybrid representation is a dilated version of the input image. The process of dilation highlights the characters and their vicinity. In order to get a dilated image, we performed OTSU thresholding on the input image with a horizontal kernel size of \(1\times 5\). The purpose of using a horizontal kernel is to highlight the characters and their vicinity in each line.

The third component of the hybrid representation is the binarized input image. Once all three components of the hybrid representation are ready, we use these components to fill three channels of the image characterizing a hybrid representation. The order of the components in the channels of the hybrid representation is set as transform image, binarized image, and dilated image. Text and their surroundings are represented in red color however the spaces between characters and lines are represented in blue color. The resultant hybrid representation is categorically better than the original input image as it has more visual features than its original counterpart.

The second step in the DeepBiRD pipeline involves detecting bibliographic references from a given hybrid representation. For this purpose, DeepBiRD is equipped with a state-of-the-art object detection model called Mask-RCNN which differentiates between detected reference instances in addition to identifying bibliographic references. Mask-RCNN generates a mask and a bounding box for individual detected reference. For training the network, we followed all the settings mentioned in the original paper (Rizvi et al. 2020).

Fig. 2
figure 2

DeepBiRD (Rizvi et al. 2020) pipeline overview

3.1.2 Evaluation and discussion

The performance of DeepBiRD (Rizvi et al. 2020) was evaluated on two datasets named BibX and BibLy dataset. Both datasets are publicly available and are annotated for the task of bibliographic reference detection from scanned scientific publications. The performance of DeepBiRD model is compared with other existing object detection approaches in Table 1. It can be observed that DeepBiRD model outperforms other approaches by a significant margin on both datasets. In BibX dataset, our selected model outperforms the other approaches by an average of \(13.38\%\). Similarly, on BibLy, our selected model also outperforms the previous state-of-the-art on BibX dataset by a huge margin. The reason behind the superior performance of DeepBiRD model is that it uses hybrid representations to assist the detection model in identifying bibliographic references using additional highlighted information. Additionally, the inherent property of the selected model which employs ROIAlign operation that also plays an important role in detecting references to a very fine detailed level.

Table 1 Evaluation results of DeepBiRD (Rizvi et al. 2020) on two datasets compared with other approaches

3.2 Sentiment analysis

In this section, we will discuss the component which is responsible for the sentiment analysis of the citations. We will briefly discuss the architectural details of the selected model and its performance evaluation.

3.2.1 Methodology

For the task of sentiment analysis of citations, we employed another approach we proposed earlier which is an XLNet-based approach called ImpactCite. XLNet (Yang et al. 2019) is an auto-regressive language model which captures the context in both directions using a bi-directional attention mechanism. Such a mechanism helps comprehend the context of a sentence from right to left and left to right and is very effective in longer sentences. The most highlighting feature of XLNet is that in addition to the Transformer-XL (Dai et al. 2019) as a backbone, it uses a permutation generalization approach which helps the model to achieve generalization and state-of-the-art performance in many Natural Language Processing problems. ImpactCite model is suitable for our problem of citation sentiment analysis as the sentences in publications can be longer and their context heavily depends on the preceding and proceeding sentences.

XLNet is generally used in a combination of many different settings with variations in the number of layers and units per layer. In this work, we selected the XLNet-Large model which is very suitable for problems with long sentences and their respective context. We used an XLNet-Large model with 24-layers, 1024 hidden units, and 16 heads. We used a standard pre-trained XLNet-Large model and later fine-tune it for the task of citation sentiment classification. In the experiments, we initially employed a warm-up phase with a fixed learning rate and later we employed learning rate decay in the training phase. Transfer learning played an important role as the dataset available for the citation sentiment analysis was a relatively small dataset. Therefore, the pre-trained network adapts to the new domain and task using scarcely available data.

3.2.2 Evaluation and discussion

Evaluation results of ImpactCite are shown in Table 2. As there was no official dataset split mentioned in Athar (2011), therefore 10-fold cross-validation was performed for the selected models. The results show that CNN, RNN, and LSTM were unable to handle highly imbalanced data and could only learn representations from class with most samples, i.e., Neutral class. However, their performance in classes with low representation in the dataset was significantly worse.

On the other hand, sophisticated language models like BERT, ALBERT & ImpactCite were able to handle class imbalance and were able to learn representations from all classes. The reason for the performance improvement is that all these models were pre-trained and were later fine-tuned for the task of citation sentiment classification. For the sake of completeness, we also included results reported by Athar (2011) on the CSC dataset.

To conclude, ImpactCite outperformed all other approaches by achieving the highest macro-F1. Macro-F1 treats all classes equally irrespective of their number of samples. Contrary to BERT (Devlin et al. 2018) and ALBERT (Lan et al. 2019), ImpactCite was able to deal with the instances in long sentences and their relationships with each other in a given context. Therefore, ImpactCite sets a new state-of-the-art for citation sentiment analysis.

Table 2 Evaluation results of ImpactCite (Mercier et al. 2021) on CSC-C dataset (Mercier et al. 2021) compared with other approaches

3.3 Keywords detection and topic modeling

This module is responsible for the task of detecting keywords from scientific publications and later using those keywords to identify topics. For this purpose, we employed an approach called Collective Connectivity-Aware Node Weight (CoCoNoW) (Beck et al. 2020). The overview of the (CoCoNoW) (Beck et al. 2020) pipeline is shown in Fig. 1. It consists of two stages where keywords are detected in the first stage followed by using those detected keywords in combination with the Computer Science Ontology (Salatino et al. 2018) to perform topic modeling. Details of CoCoNoW (Beck et al. 2020) pipeline are as follows:

3.3.1 Methodology

CoCoNow (Beck et al. 2020) pipeline consists of several phases, where each phase is responsible for a set of tasks. CoCoNow (Beck et al. 2020) takes the title, abstract and full text of a publication as input and performs the standard NLP pre-processing steps. These steps involve, tokenization, parts of part of speech tagging, lemmatization, stemming and filtering the least probable keywords candidates.

Once the pre-processing is complete, a graph is generated using the pre-processed tokens. In the graph, the tokens appearing in two consecutive sentences are connected with an edge between them with their normalized co-occurrence serving as the edge weight.

The node weight is assigned using four criteria which include distance to the most central node, term frequency, occurrence in abstract, and occurrence location in the document. The scores from these criteria are summed together to assign the final node weight. Lastly, the Node and edge rank algorithm is applied on the weighted graph to acquire a list of keywords sorted in descending order of their importance. Eventually, a set of top K tokens is selected as keywords for the input document.

The second stage is responsible for performing topic modeling on the extracted keywords from the first stage. To perform topic modeling, we employed the Computer Science Ontology (CSO)Footnote 7 consisting of 23,800 nodes and 162,121 edges. The topic scores are assigned using the Levenshtein distance between each token or its synonym with a topic in the ontology. Lastly, the topic scores are normalized between 0 and 1.

Table 3 Performance comparison of CoCoNoW (Beck et al. 2020) on the different datasets

3.3.2 Evaluation and discussion

This section discusses the evaluation of both stages of the selected approach. The first stage was responsible for keyword detection. It was evaluated on three datasets. Each approach in the literature uses a different value of k which represented the number of returned keywords. CoCoNow was also evaluated on different values of k. Table 3 shows the comparative evaluation of CoCoNoW with other approaches, it can be seen that CoCoNoW consistently outperforms all the approaches on all three datasets for a different number of k keywords.

The second stage of CoCoNoW was evaluated on ICDAR proceedings due to the lack of available ground truth. A hypothesis was developed for the evaluation of topic modeling which states that topics with similar topics more often tend to cite each other. Figure 3a, b show the heatmaps representing citations among two sample topics, i.e., character recognition and pattern recognition respectively. It can be observed that the darkest colors on the diagonal axis support the evaluation hypothesis, therefore, suggesting that the topic assignment is correct.

Fig. 3
figure 3

Citation count for different super topics (Beck et al. 2020)

3.4 Semantic index

In this section, we will introduce the formulation of our proposed novel Semantic index scores to assess the influence of the researchers in a scientific community. The purpose of the Semantic index is to assign a representative score to individual authors which depicts the extent of their contribution and its acceptance in the scientific community.

Generally, author indexes take into account the quantitative aspect of citations, i.e., number of citations received by publications. We propose a novel Semantic index that considers the nature of individual citations in addition to citation count, therefore, enabling us to integrate the qualitative aspect of citations in the Semantic index. In this work, we consider two qualitative aspects of citations namely citation sentiment and self-citation. The motivation behind this selection is fairly intuitive as we propose that not all the citations are equal, the first factor which sets apart one citation from the other is whether a paper is cited positively or negatively, i.e., appreciating and using the proposed approach or highlighting shortcomings of a research work respectively. In Semantic index, we only consider the citations which have a positive sentiment as those citations represent the appreciation and support of the scientific community for research work. For this purpose, we estimate the citation sentiment by using the approach mentioned in Sect. 3.2. The second factor which affects the quality of citation is whether an author is citing their own papers which is synonymous with a famous English idiom “Self-praise is no recommendation”. Therefore, any citation which is an instance of self-citation is not considered during the estimation of the Semantic index for an author. The resultant number of citations will be referred to as \(N_{positive}\) in this paper.

In addition to the above-mentioned qualitative aspects, we also consider the multi-faceted community interactions of an author to effectively evaluate their position in a scientific community. These multi-faceted interactions are represented by different centrality measures. In graph theory, a centrality measure is used to rank nodes based on their position in the graph. To estimate these centrality measures, we use two types of graph networks one is the author citation network and the other is the author collaboration network.

For this purpose, we construct an author citation network by representing all citations extracted from the publications in the form of a directed graph where each node represents an author of a publication, and the relation between two nodes highlights citations pointing in the direction of cited author. On the other hand, the author’s collaboration network depicts the collaboration among the authors in a scientific community. It is constructed using the information extracted from the header of a publication, where each author is represented by a node and a non-directed relation between two nodes represents collaboration in a publication. Once both author citation and collaboration networks are ready, we can now estimate the value of different centrality indicators for the Semantic index of an author. The description of the selected centrality measures is as follows:

  1. 1.

    Degree centrality It represents the extent to which a node is connected in a network.

  2. 2.

    Eigenvector centrality It quantifies the transitive influence of a node on its neighboring nodes.

  3. 3.

    Betweenness centrality It measures the influence and control of a node on the flow of information.

  4. 4.

    Closeness centrality It estimates the extent to which a node can efficiently spread information.

  5. 5.

    Indegree centrality It represents the number of incoming connections, i.e., citations from other nodes.

Each of these centrality measures refers to a specific role in the scientific community which would be discussed in Sect. 3.10.2. After computing all the centrality measures for every node in the network, we prepare the weighted centralities by taking a product of non-self-cited positive citations \(N_{positive}\) with the sum of all centrality scores to finally compute the Semantic index value for each node in the network. The proposed Semantic index can be formulated as follows:

$$\begin{aligned} \mathrm {{SIndex_n}} = \log (({c_\textrm{deg}} + {c_\textrm{eig}} + {c_\textrm{bet}} + {c_\textrm{clo}} + {c_\textrm{ind}} ) \times {N_\textrm{positive}}) \end{aligned}$$
(1)

where \(N_\textrm{positive}\) represents the total number of positive citations received by an author. It is to be noted that \(N_\textrm{positive}\) does not include any self-citation. \(c_\textrm{deg}\), \(c_\textrm{eig}\), \(c_\textrm{bet}\), \(c_\textrm{clo}\), and \(c_\textrm{ind}\) represent the Degree, Eigenvector, Betweenness, closeness, and indegree centralities respectively. By default, all the centrality measures contribute equally toward estimating the value of the Semantic index. Additionally, the effect of increasing or decreasing the contribution of each centrality measure can be later visualized in the Sect. 3.10.2.

3.4.1 Evaluation and discussion

We evaluated the Semantic index by inspecting its compliance with the Declaration on Research Assessment (DORA) guidelines which were developed in 2012 as a result of the Annual Meeting of the American Society for Cell Biology in San Francisco. These guidelines were further developed and refined over a period of time and are currently been actively maintained. This initiative provides instructions and best practices to all researchers, organizations, funding agencies, and scientific communities to assess the quality of scholarly research. So far, there are 21,729 participants and organizations from 158 countries who have already signed the DORA declaration. For our comparative evaluation, we selected the 5 most widely adopted author indexes, i.e., h-index, g-index, i10-index, Eigenfactor, and Impact factor.

Now, we will discuss the core aspects of the DORA guidelines and their compliance in the case of the Semantic index and some most common author indexes. Table 4 provides a quick overview of this comparative compliance with DORA guidelines while their details are discussed as follows:

Suitability for quality evaluation This aspect represents the overall purpose of an author index, which is to evaluate the quality of the research work performed by a researcher. However, existing author indexes have certain limitations along with their advantages. The limitations of the h-index, g-index, i10-index, and Eigenfactor deem them not suitable for evaluating the quality of research work as they heavily rely on a subset of publications for assessment. For instance, any publication with citations less than the h-index of an author will be discarded. Moreover, the impact factor was initially introduced to help librarians to facilitate in deciding which potential journal volume they should buy for their libraries, and now the scientific community seems to measure the quality of a journal using the impact factor. Such affairs make the existing indexes unsuitable for accessing the quality of research work. However, our proposed Semantic index considers the qualitative aspect of a publication, i.e., citation sentiment to justify its suitability to serve as a tool for assessing the quality of research work.

Circumstances The second aspect that DORA deems important for evaluating the impact of research is the consideration of individual circumstances. For instance, a researcher who joined recently would have less time to get citations as compared to the long publishing old researchers. All h-index, g-index, i10-index, Eigenfactor, and Impact factor do not take into account such individual circumstances of an author and usually take a very long time to gradually increase the score of these indexes. On the other hand, the Semantic index considers all publications irrespective of their number of citations and hence provides a consistent increase in score upon receiving new citations.

Content oriented Another aspect of impact assessment of research work is to consider the content of the publications while performing the assessment. As already mentioned, the h-index, g-index, i10-index, Eigenfactor, and Impact factor only consider raw citation count to estimate the impact of research work and all these indexes do not consider the content of the publications. Citation count is a superficial feature, as it does not covey any information about the quality of a citation, i.e., if a publication is cited positively, negatively, or neutrally. Contrary to this, the Semantic index takes into account the content of the publications to identify the sentiment behind a citation so that it can be given credit accordingly.

Transparency and reliability Transparency and Reliability of indexes are the key aspects of analyzing the quality of research work. Existing indexes are somewhat transparent as we know that based on what data it was estimated. However, they have limited reliability as the publications with a low number of citations are completely overlooked which might be crucial in some domains, i.e., Medicine. In contrast, for estimating the Semantic index, data are collected directly from the publications, therefore it is much more transparent and reliable. Additionally, it is generalizable as it considers all publications with any number of citations thus making it suitable for any domain.

Table 4 Aspects covering DORA guidelines for evaluating research impact

3.4.2 Limitations of existing indexes

In this section, we will compare the limitations of existing indexes with the Semantic index. Similar to the previous section, we have selected the most widely used indexes for comparative analysis, i.e., h-index, g-index, i10-index, Eigenfactor, and Impact factor. Following are the key limitations of all the existing indexes:

Self citations Self-citation is the most common challenge when assessing the impact of a research profile. Indexes like h-index, g-index, i10-index, and Impact factor do not handle self-citations and continue to consider them during the estimation of their index values. However, unlike other indexes, Eigenfactor and the Semantic index discard the self-citations as they are not considered to be a part of the actual impact of a research publication.

Lack of fairness Indexes like the h-index and g-index favor the old researchers who have been publishing for a while to sustain their index score as their publications received many citations over years or decades. For the new researchers, they have to wait for years or decades to accumulate a high number of citations and eventually reach the same level as old researchers. On the other hand, the i10-index, Eigenfactor, and Semantic index consider the citations from all papers, and their values start increasing with the growing number of citations. Therefore, it does not require a lot of time to build up, hence supporting new and existing researchers relatively fairly and equally.

Quantity versus quality In the context of publishing, the scientific community has two informal popular schools of thought. One focuses more on the number of publications and the other one emphasizes more on the quality of the publications. This results in the cases where researchers have either high volume and low quality or low volume and high quality respectively. The selected indexes are ineptly not able to handle such cases as either of these cases affect the index score of the h-index, g-index, i10-index, Eigenfactor, and Impact factor. Contrary to popular indexes, the Semantic index can well handle the delicate balance between Quantity and Quality. Since the Semantic index favors all publications equally and therefore can handle both of these cases.

Coverage Some indexes, i.e., h-index and g-index employ components like h-core and g-core which results in partial coverage of publications including citations. These limited components restrict the scope of insights provided by the index scores and hence provide an incomplete picture of a researcher in the community. On the other hand, Eigenfactor, Impact factor, and Semantic index are independent of typical h-core or g-core components to estimate the index score for a given researcher. Therefore, the limitations associated with h-core or g-core are irrelevant for these indexes.

Meaningfulness Author indexes like h-index, g-index, i10-index, and Eigenfactor are meaningful to some extent as they attempt to highlight the importance of notable authors in the scientific community. However, they severely lack evaluation of the quality of the work due to the several limitations mentioned above. This results in partial meaningfulness of these indexes. On the contrary, the Semantic index not only analyzes all available citations but also the quality of each citation by analyzing the sentiment of each citation. The Semantic index has the same granularity as most popular indices like the h-index. However, the Semantic index is more comprehensive and is, therefore, more meaningful.

3.5 Consolidation of bibliographic data

Data extracted by the Data Extraction modules in Stage 1 is consolidated into a common data storage. For this purpose, ACE employs MongoDB as the central storage where all the data are collected and secured. One of the challenges faced during the consolidation of data was to precisely identify each author and correctly assign the respective publications to the right author. This phenomenon is known as Author name disambiguation. One of the reasons which gave rise to this challenge is the use of abbreviated names in the reference section of the publications. For instance, there are two persons with names Anthony Davidson and Andrew Davidson. Both persons write their short name as A. Davidson. In a scenario where we only see the shortened name, it is very challenging to identify which specific person is being referred to in this name. Another challenge could appear if there is an error in the extracted text. The text from detected bibliographic references is extracted by performing Optical Character Recognition (OCR). There is a possibility for the introduction of an OCR error in the extracted text. Especially, in the case of name initials, any misclassification can lead to an entirely different name for a person.

To tackle the challenges in author name disambiguation and ensure the quality of data, we employed a set of external resources, i.e., Crossref and Semantic Scholar to validate the accuracy of extracted data. Both of these external resources have a huge collection of bibliographic data and have specified interfaces through which we can query data about a specific publication. Any error or disambiguation in names that arise during the data extraction phase is eradicated by verifying it with author names collected from the external resources. It is worth mentioning that the author name disambiguation is not performed solely on the data received from external resources. The external resources provide some metadata attributes in addition to those extracted from the scientific publication. The attribute values from external resources, i.e., affiliation, aliases, etc., are used to validate the identity of an author. Once the author name disambiguation is complete, all the attributes from different resources are consolidated. The data are now ready for further processing.

3.6 Computation of statistics

This module deals with analyzing the consolidated data for estimation of certain author level and community-level statistics. These statistics are crucial in identifying the author and community-level trends in different aspects.

Although, there are numerous fine-grained statistical figures extracted from the consolidated data, i.e., number of publications or citations for an author, etc. However, we will only discuss the most prominent statistical figures estimated in this module. One of such figures is the generation of co-authors graph. We represented all authors as nodes such that all co-authors have a unidirectional link with one another. Once the co-authorship graph for a whole community is ready, then we apply Girwan Newmann clustering (Girvan and Newman 2002) on the co-authorship graph. It results in clusters of co-authors.

On the other hand, we also generate a community network graph. Given the consolidated data, we take authors in a community and represent them as nodes in a community network graph. Additionally, we employed the citation data to draw links between network nodes. So the resultant graph is a citation network graph. To incorporate more information in the citation network graph we color-coded the nodes based on their co-authorship cluster.

3.7 Visualization engine

Once all the statistics have been successfully estimated, the final data are delivered to the visualization engine, which uses the given data and visualizes in more than one way to highlight different trends. There are different visualizations with a granularity that spans over three levels. The highest level contains the visualization representing the domain-level insights, followed by less detailed visualization representing the community-level insights, and finally the author-level insights. The visualizations for each of these granularity levels are described below.

3.8 Domain-level insights

As the name suggests, these visualizations represent insights from the domain level. For the proof of concept, we selected the domain of Document Analysis as our sample domain. Figure 4 shows overall statistics of the Document Analysis domain. The visualization shows three communities in the selected domain which have a total of 6255 authors who contributed 4638 publications with 65775 incoming and outgoing citations. However, the graph shown below the statistics depicts the interaction between communities. Each node in the visualization represents a community in the domain and the size of each node represents the number of citations it received. If the publications in a community receive relatively more citations, then their size will be bigger than other communities. The links between nodes represent the citation relation between two communities. The direction of the link represents the citation direction and the number on the link shows the number of times the publications of the target community were cited by the other. Self-citations are also shown in the graph which is a key indicator to understanding the popularity of the publications within the community as well as in the other communities.

Fig. 4
figure 4

Community interactions in the domain of document analysis

Fig. 5
figure 5

Different interaction patterns of an author in a scientific community

The next visualization in Fig. 5 shows the distribution of top publications and authors among all three communities. Figure 5a, b show the distribution of top 100 authors in all communities ranked in the order of the number of citations received and publications respectively. However, Fig. 5c shows the distribution of the top 100 authors with the highest Semantic index score in all communities. Lastly, Fig. 6a, b show the top topics with most citations and contributions among all communities. The size of the topic refers to the number of citations or contributions received by that topic.

Fig. 6
figure 6

Topic popularity in the domain of document analysis

3.9 Community-level insights

This section describes the community-level insights which are more abstract than author-level highlights, however, they are more precise about the overall community. Such highlights play an important role in policing and paving the future path of a scientific community.

3.9.1 Community highlights

Overall community highlights give us a quick insight into community-level statistics. Figure 7 shows an example of different visualizations related to a community. A red bar chart shows the number of authors who participated in a proceeding year. It allows us to gauge if the number of participants is increasing or decreasing over time. On the other hand, the number of submissions received each year is represented in the orange bar chart. We also visualize the number of citations received by a publication by each proceeding year with the blue chart. It provides an immediate insight into which years are more popular within the community. From the example given, it is clear that the proceedings of the year 2011 are the most popular and have received the highest number of citations. Furthermore, we also visualize participation by country to see which countries are contributing the most to this academic community. It can be observed that the United States has the most contributions among all countries in the given scientific community.

Fig. 7
figure 7

Example of overall community highlights

3.9.2 Community as a network

This section describes an important community-level feature highlight where the whole community is represented in the form of a community network graph. Figure 8 shows the example of an academic community, where each node represents an author. Authors are connected with links among them. Each link between two nodes represents a citation relationship. The color of the nodes represents the collaboration groups in the scientific community. Author nodes with the same color tend to collaborate in their research work. The network graph can be filtered using the citation count threshold. It will filter the graph and only shows the authors having the specified number of overall citations or more.

Fig. 8
figure 8

Community as a network

3.9.3 Topic evolution and its trends

This section describes the visualization related to topic evolution. Figure 9 shows a dynamic visualization of the evolution of the topics over a while. It represents topics in the form of bubbles. The year slider on the top middle of the chart can be moved to see the effect on individual topic bubbles. When you move the slider, a topic bubble might become larger or smaller, representing its popularity in the selected year. Figure 10 shows an example of two selected topics along with the number of contributions for these topics over the years. The topic “Handwritten Document” and “Web Content” are represented in brown and blue colors respectively. Solid lines represent the number of scientific publications submitted on a specific topic. On the other hand, dotted lines represent the number of authors who contributed to a specific topic. It is quite evident that both topics started with minimal interest at the start. With time, the topic of ’Handwritten Documents’ became increasingly popular within the scientific community. Contrary to this, the topic of “Web Content” gradually increased in popularity. However, after the year 2007 scientific community lost interest in this topic. Such trends help us understand the community’s interests and make decisions regarding the future direction of the community.

Fig. 9
figure 9

Dynamic representation of topic evolution

Fig. 10
figure 10

Visualizing topic evolution of selected topics

3.9.4 Authors overview

The table shown in Fig. 11 shows different statistics for all authors in a scientific community. These statistics range from simple citation or publication count to complex Semantic index values. The authors can be sorted with respect to any measure in the table by clicking the title of the measure of interest. It provides a quick quantitative comparison between different authors in a single glimpse.

Fig. 11
figure 11

Authors overview

3.10 Author-level insights

This section discusses the visualizations designed for displaying the author-level statistics. Some key author-level visualizations are as follows:

3.10.1 Author statistics

Figure 12 shows how the basic statistics of a specific author are displayed in the author’s profile. The bars on the top show the citation count, the number of publications, and the Semantic index score of a given author. The extent to which the bars are filled represents their percentile. On the bottom, we can see the top 5 topics on which this author is continuing their research.

Fig. 12
figure 12

Author statistics overview

3.10.2 Community roles

This section discusses the roles of an author in an academic community. In this work, we consider each author for five different roles. These roles are represented by different centrality measures discussed in Sect. 3.4. Different roles and their description are as follows:

  • Collaborator Collaborates more often with members of the community. It is represented by the Degree centrality of the author.

  • Idea generator Highly influential individual, who brings new ideas which are widely accepted by the community. It is represented by the Eigenvector centrality of the author.

  • Community connector Diversely publishes with different cliques in the community. It is represented by the Betweenness centrality of the author.

  • Opinion leader Holds a strong network and is capable of influencing an opinion about a trend in the community. It is represented by the Closeness centrality of the author.

  • Contribution influencer Dominates the community with their important scientific contributions. It is represented by the Indegree Centrality of the author.

Fig. 13
figure 13

Visualization of an author’s roles in a community

Fig. 14
figure 14

Roles contribution sliders

Figure 13 represents a visualization example of an author’s role in the community. It can be noticed that in this specific example the person in the discussion is more of a contribution influencer as compared to any other role in the community. Figure 14 shows a set of sliders for each role that can slide to increase or decrease the extent to which they contribute toward estimating the Semantic index.

3.10.3 Citation sentiment analysis

In this section, we will discuss the system’s features related to the citation sentiment of an author in a scientific community. Figure 15 shows an example visualization for the overall citation sentiment of an author. A doughnut chart is used for this visualization. Positive, negative, and neutral citation sentiments are represented in green, red, and gray colors. It can be seen that this specific author has mostly received neutral citations. However, positive citations also have a fair share in total citations which shows that the scientific contributions by the author in the discussion have been fairly accepted and appreciated in the academic community. Figure 16 shows a list of publications of a given author. The last column on the right shows the citation sentiment of all the citations received by each publication.

Fig. 15
figure 15

Visualization of citation sentiment all the author’s citations in a community

Fig. 16
figure 16

Publications list with sentiment

3.10.4 Author citation patterns

This section discusses the citation pattern of individual authors. Figure 17a shows a visualization example of a radar chart representing the citation pattern of an author. The visualization shows the top 10 authors who were cited by the author in the discussion and the number of times they were cited is represented in green color. However, the data points in red color represent how many times those authors cited back the author in the discussion. With this visualization, we can instantly realize the citation interaction of an author with the other community members. In the given example we can see that the author in the discussion cited himself more than any other else in the community.

Fig. 17
figure 17

Different interaction patterns of an author in a scientific community

3.10.5 Author collaboration

This section presents the collaboration pattern of an author. Figure 17b shows an example collaboration visualization of an author. Each data point shows the number of times this author collaborated with other researchers in the academic community and is represented in blue color. This visualization in combination with citation pattern visualization can uncover even more patterns. For example, in the given examples, we can see that the current author tends not to cite his second most common collaborator very often.

3.10.6 Customized author network

Our proposed tool also includes a customized network graph on each author’s profile page. Figure 18 shows an example of a custom network of an author. Where the author in the discussion is at the center of the network graph. Each node represents other authors in the academic community who either cited or were cited by the current author. With this visualization, we can get a quick insight into the extent of networking of an author in the academic community.

Fig. 18
figure 18

Visualization of author’s custom network in a community

4 Conclusion

This paper proposes ACE 2.0, a comprehensive tool for analyzing and obtaining instant insights about scientific communities. An integrated pipeline is characterized by modules that are each responsible for performing a specific task. Each module of the system utilized state-of-the-art models to detect bibliographic references, analyze sentiment, and identify keywords. Furthermore, we proposed a Semantic index that quantifies the influence of a researcher in a scientific community. To accomplish this goal, the Semantic index has taken into consideration both quantitative and qualitative aspects of citations, as well as the impact of various roles played by researchers within the scientific community. Its generic nature and robustness make it an ideal index to mitigate the challenges faced by other indexes. Additionally, ACE 2.0 is capable of automatic digital profiling of all the researchers in the community and is equipped with a diverse visualization engine. With the help of different examples discussed in this paper, we got instant community insights. Using these insights, we were able to identify citation patterns of authors, and interests of the scientific community, which can help us in policing and planning the future direction of the community.