General process
The goal is the realization of an exploratory visualization approach that enables the identification of trends and its probable future potentials in a graphical manner. The main feature is the inclusion of users’ interaction in visual structures, which enables the ability to view the data from different perspectives. The user should be enabled to gather an overall trend evolution and different perspectives (e.g., geographical or temporal). Thus, the focus lies on answering the questions [40] toward the analysis of potential technological trends: “(1) when have technologies or topics emerged and when were they established, (2) who are the key players, (3) where are the key players and key locations, (4) what are the core topics (5) how will the technologies or topics probably evolve, and (6) which technologies or topics are relevant for a certain enterprise or application area?” [36, 40] The question where basically introduced by Marchionini [30] toward exploratory searches. However, we expanded the question space and adopted them to the specific characteristics of technology and innovation management that includes early trend detection. The analysis based on the questions provides an overview of core topics that are currently relevant, next to the navigation through the different perspectives, a result analysis and probable reasons about evolving trends. Our approach was developed based on these requirements with the following steps (see Fig. 1).
In this section, we explain the processing exemplary based on the DBLP database. We chose the BPLB indexing database, since it does not provide any abstracts or full-texts and makes the data gathering process more difficult. The DBLP is a research paper index for computer science related research. Commonly, patents or web news are considered to identify technological trends, which both have the downside that they often indicate new technologies when they are already market-ready. In particular patents are highly established for this purpose, but it is to consider that patent registrations usually take between one and two years, which counts also for identified trends based on patent data. And web news usually appear when a solution is already market-ready. In contrast to these sources, research publications introduce new technologies usually in an early prototype stage, so that at its detection there is still enough time to react to those developments.
Data indexing
We use an indexing database on the server side. The initial data source for the transformation process is DBLP, which provides rudimentary metadata in the area of computer science and related areas. Each document can have a unique id, the Document Object Identifier (DOI), through these identifiers, the data-entities can be identified and enriched from several additional data sources [36, 40]. The initial data is stored in relational database. For the identification of trends, especially emerging ones, we use enriched data (abstracts and full-text articles) to extract and model topics. The enriched data and the identified trends data model are stored in the database after the extraction process. According to Card et al. [7], the data models build the foundation for choosing visual structures and enable in the last step to either choose appropriate interactive visualization or a juxtaposed visual dashboard for the initially mentioned tasks and questions [36].
For this, each DOI is sent to all publishers, e.g. “ACM DL”, “IEEE XPlore”, “Springer” etc. If a certain document has no DOI, the title of the document in combination with authors’ names is used to identify the document on Web.
Data enrichment
A proper analysis requires sufficient data quality. As first step of transformation, we use Data Enrichment techniques to gather additional data from other web sources to enhance data quality. The data collection used as basis is a combination of multiple datasets. The different considered individual datasets offer a varying quality and varying number of available meta-information. As initial dataset we use DBLP in our approach with about six million entries [36]. The DBLP dataset entries are without text (e.g., abstracts or full-text articles). Since topic modeling is only possible with appropriate text documents, we compensate the limitation of the original DBLP dataset by augmenting each publication entry with additional information [36].
To enrich the data, the system figures out, where the particular data resources are located on the web, and where further information about a certain publication is available. For this, each DOI is sent to all publishers, for example, “ACM DL”, “IEEE Xplore”, “Springer”, or “CrossRef”. If a certain document has no DOI, the title of the document in combination with authors’ names is used to identify the document on Web. The information for the publication enrichment can be either gathered through a web-service or through crawling techniques. The response of results of a web-service is well-structured and contains commonly all required information, while in contrast crawling techniques require a conformation of robot policies and the results have to be normalized. However, it should be considered that the retrieved data may contain duplicates, missing or faulty data. Hence, common data cleansing techniques are applied [36]. As a result, with this step we further enrich the data of DBLP with metadata, such as abstracts and full-text articles from the publisher and citation information through CrossRef that allows to identify the most relevant papers in a field with regards to citation count. A detailed description of the data enrichment would go beyond the scope of this paper. The process is described in a replicable way in [39].
Topic modelling
We gathered abstracts for the majority of “DBLP” entries and some open access full text for some entries from general public sources such as “CEUR-WS” or the “Springer” database in the previous processing stage. We are now able to perform information extraction from text to generate topics based on the previously enriched data. As topic classification, learned probabilistic topic models are applied for topic generation. This approach is a viable alternative and can even outperform subject heading systems when evaluating similarities between documents clustered by both systems [42], as studies have shown. Consequently, we implemented the Latent Dirichlet Allocation (LDA) algorithm [3]. The main advantage of LDA is its fully automatic topic classification and assignment capability. The classification is performed consistently on all publications, based on classifiers that controls the assignment of all the topics. The accuracy of the resulting model strongly depends on the number of documents. After the processing, each document is assigned to one or multiple topics. A topic is typically represented by the top 20 used words withing the topic. We also generate most of the used phrases for each topic in form of N-Grams (similar to [54]) additional to the uni-grams. In the current setting [52] we generate 500 topics with 20 words and 20 phrases through 4000 iteration of the LDA-algorithm. During the analysis, the topics can be used to filter search results in the shape of facets. Further, they are used to create the Topic Model. We have evaluated the Latent Semantics Indexing (LSI) and LDA with and without lemmatization for abstracts and full-texts and found that in the majority of cases the generation of topics with LDA and without lemmatization provides a good coherency [39].
Trend identification
The topic modeling uses the Latent Dirichlet Allocation (LDA) proposed by Blei et al. [3] and is the foundation for identifying trends in the analysis process. But if we tried to identify trends based on the topics’ frequency over years, the retrieved trends would not be appropriate. The nearly increasing numbers of all topics through the years would let any topic look like a trending topic. Furthermore, the number of publications have increased dramatically in the last years. This is illustrated in Table 1, which shows the real number of publications for every four years.
Table 1 Number of publications in DBLP [36] So, the normalization of topic frequencies is the first step to get the real trends over time. Hence, we calculate the normalized number of documents containing a topic for each year. Let dy be the total number of documents in a year y, and ty is the number of documents in year y that contain a certain topic t [36]. Then, \(\tilde {t}_{y}\) is the normalized topic frequency in the given year y, and is computed as
$$ \tilde{t}_{y} = \frac{t_{y}}{d_{y}} $$
(1)
After having the normalized frequency of documents containing the topic, the entire years with documents with a certain \(\tilde {t}\) are split into periods of a fixed length x > 1, limiting the length of the period to the time of the topic’s first occurrence, if necessary. So at the current year yc, each period pk covers the previous years [yc − x ⋅ (k + 1),yc − x ⋅ k]. For example, in the year 2019, for x = 5, we have the periods \(p_{0} = [2015, \dots , 2019]\), \(p_{1} = [2010, \dots , 2014]\), \(p_{2} = [2005, \dots , 2009]\), up to the period where the topic appeared for the first time [36].
For each period, we calculate the regression of the normalized topic frequencies, and take the gradient (slope) as indicator for the trend. The following (2) calculates the slope for a topic t in a period pk, based on the normalized topic frequencies \(\tilde {t}_{y}\), where \(\bar {t}\) is the mean of the normalized topic frequencies and \(\bar {y}\) is the mean of years in the time period [36].
$$ b_{\tilde{t},k}= \frac{{\sum}_{y \in p_{k}} (y-\bar{y}) \cdot (\tilde{t}_{y} - \bar{t})} {{\sum}_{y \in p_{k}} (y-\bar{y})^{2}} $$
(2)
Each calculated slope \(b_{\tilde {t},k}\) is weighted through two parameters. The first parameter is the regression’s coefficient of determination \({R^{2}_{k}}\). The second parameter is a weight ωk that is determined with a function that decreases for earlier periods.
For example, the weight ωk that is used for one period can be defined using a linearly decreasing function:
$$ \omega_{k} = max(0, 1 - \frac{k}{4}) $$
(3)
This means that the weight is 1 for the most recent period p0, decreases linearly about 0.25 for each earlier period, and becomes 0 for period p4 and beyond.
Alternatively, the weight can decrease exponentially:
$$ \omega_{k} = \frac{1}{2^{k}} $$
(4)
In this case, the weight is 1 for the most recent period p0, then 0.5 for period p1, and 0.25 for period p2.
The final weighting for a topic t is then computed from the slopes \(b_{\tilde {t},k}\), the coefficients of determination \({R^{2}_{k}}\), and the weights ωk of each of the K periods as follows:
$$ \omega = \frac{1}{K} \cdot \sum\limits_{i=1}^{K} b_{\tilde{t},k} w_{k} {R^{2}_{k}} $$
(5)
To identify the best measurement, we integrated the linear and the exponential measurements for the weight ωk and evaluated those through two different systems. The more appropriate results seems to be achieved with the linear calculation, due to the overall 20 years fixed time periods.
The weighting of the trends, the slopes in different time periods and the regression allow us to identify trends with better results compared to trend identification methods described and illustrated in the literature review, although the method is quite simple [36].
Data modelling
The creation of data models in our approach in the Data Modeling stage is realized according to Card et al. [7] for different aspects of the data that are relevant in the analysis process. The interaction with our system should lead to answer the questions mentioned in Section 3.1. To answer those questions with particular given aspects in the data, we considered aspect-oriented data models [36]. With five data models, the Semantics Model, Temporal Model, Geographical Model, Topic Model and Trend Model, we enable a refined data structuring [36]. The Enriched Data and the Trend Identification build the basis for the creation of these models. An exposed position has the generation of the semantic data model [38], which serves as the primary data model to hold all information. For an easier extraction of needed information in order to create the visual representations, structure and semantics is added to the data. Particularly the textual list presentation makes mostly use of the semantics model, where all available information about every publication needs to be shown, beside the generation of facet information for filtering purpose.
Several temporal visualizations make use of the temporal data model. Here several aspects of the information in the data collection must be accessible based on the temporal property. The temporal model must map the publication years to the set of publications in a given year, to create an overview of the entire result set in a temporal spread. This temporal analysis is not only necessary for the entire available result set. Furthermore, it is also necessary to analyze specialized parts of faceted aspects. Based on these faceted attributes, detailed temporal spreads for all attributes of each facet type must be part of the temporal model. The temporal spread analysis must be available for each facet in the underlying data. The temporal visualizations can be created more easily with this information. These can show a ranking over time or show comparisons of popularity over time [36].
The geographical aspect of the available data is represented in the geographical data model. The complexity of the geographic data model is lower than the temporal model, because the geographic visualization only needs quantity information for each country. The data in this model provides the information about the origin country of the authors’ affiliations. Also the data is enriched with information from various additional databases, many data entities lacks of the information about the country. To face this problem, we integrated two approaches: (1) we use the country of the author’s affiliation and (2) we consider the publications from the same author and field based on the extracted topics and within a certain time range (plus and minus one year) to assume the country. Due the fact that authors can change affiliation and thus country, the year of publication is important to respect [36].
The topic model contains detailed information about the generated probabilistic topic model. The topic model supplements other data models, but in particular the semantics data model, by providing insights into the assigned topics.
The inclusion of the most frequently used phrases can help the user immensely when reformulating the search query to find additional information on interested topics. However, the main purpose of the topic model is gathering relevant information about technological developments and the approaches used within those implementations. To provide the temporal spread of topics, the topic model is commonly correlated to the temporal model. Figure 9 demonstrates the temporal spread of topics toward the exemplary search “Information Visualization”. The trend model is generated as combination of the trend recognition process described in Section 3.5 and the temporal model. The combination enables to illustrate the main trends either as an overview of the “top trends” identified by the described weight calculation or after a query has been performed. The same procedure is applied, but with the difference that only with the results that relate to a queried term instead of the entire database, as second case.
Interactive visualizations
The generation of interactive visualizations has two main processing stages: (1) Visual Structure, and (2) Visual Representations, which will be described more detailed in this section.
Visual structure
The “visual structure” of our approach enables an automatic generation and selection of visual representations based on the underlying data model. We applied the three steps model by Nazemi [35, p. 256], consisting of semantics, visual layout and visual variable, which was originally created for the procedure of visual adaptive applications. The model starts the visual transformation to generate the “visual structure” with the semantics layer. Thus, our system is not yet adaptive, we investigated the data characteristics for choosing appropriate “visual layouts”. Afterward, we defined a number of “visual variables” according to Bertin [2] that are applied to a certain “visual layout”. The inclusion of this model allows us to enhance the system with an adaptation capabilities and reduces the complexity of integrating new visualizations. Thereby the system performs adaptations more likely as recommendation that the user can neglect at any time. The users can therewith decide to compose their own analysis data view or to consider the adaptation recommendations from the system.
Visual representations
We integrated several “data models” (as described in Section 3.6) that allow the users to interact with different aspects of the underlying data. We provide several interactive visual layouts based on visual structures to enable information gathering from different perspectives, which create the integrated data models. For the analysis processes in technology and innovation management, we identified overall five different visual representations as necessary. Most of the integrated visual representations are based on temporal data, since they are most common in visual data analysis. These temporal data allow not only to visualize the temporal spread of certain data entities over time but also provide forecasting and foresight based on statistical and learning methods. A simple temporal spread of a certain search term is illustrated in Fig. 2. Thereby the right visual representation includes some statistical values, for example, regression line, maximum and minimum.
A main aspect of the temporal visual representation is to get an insight of the temporal spread of certain automatically extracted topics. The temporal spread of the highest weighted topics over time is exemplary illustrated in Fig. 3. It shows pretty well that “neural networks” gained more attention in “artificial intelligence” research - the analysis was initiated via the search for “machine learning”.
Another well-established visual layout for temporal data are “stacked charts”. However, one should consider that the visual perception of the underlying information might become difficult, if more information entities are illustrated. Stacked visualization makes it sometimes difficult to identify differences between multiple datasets or changes withing the same dataset over time. As a consequence we integrated the temporal river layout, which in contrast to stacked layouts separates all the topics and trends for a more comprehensible view.
Figure 4 illustrates two visualizations of the same data, on the left a river chart and on the right a stacked chart. Each river has a center line and a uniform expansion to each side based on frequency distribution over time. The placement of multiple rivers beside each other makes it easier to spot differences in the temporal datasets and to compare the impact of various authors, topics, or trends on a search-term.
For the analysis of trends, it is important to gather the knowledge of the underlying topics (e.g., technologies) that emerged or possibly lost relevance over time. We have integrated a number of temporal visual structures, which can be combined into analysis dashboards to enable a fast and comprehensible analysis. It is even more important to gather different correlations through the semantic data model, geographic spread, topics and temporal spreads and especially the trends, which are modeled through the described procedure (see Section 3.5). A small set of visual representations implemented in our system is shown in Fig. 5. The practical usage in form of the interaction behavior while analyzing trends, technological advancement, and correlations will be described in the next section.