1 Introduction

Patents are widely considered to be a major indicator of current technological developments. They are often used as a metric of the research and innovation output of a given society, depicting current technological trends, and indicating its economic potential [1, 2]. It has long been established that a significant amount of time is often needed to advance from the innovation stage to the production one [3]. This is partly due to the fact that advances in production require similarly advanced levels in many different and distinct innovation fields [4]. This translates to a need to understand and track the co-evolution of innovation fields and their cross-correlations.

Such questions, however, have been sufficiently analyzed on a spatial [5, 6] level mainly, and not as much a temporal one. Indeed, the focus of studies on patents and patent citations has mainly centered around the dynamics of specific domains, such as the Internet of Things [7], synthetic biology [8], the water sector [9] etc. There are also papers [10,11,12] that have tried to identify the dynamics of the patent citation networks as a whole, and understand the hidden processes behind the growth and evolution of the patent citation networks’ structure. The effect of patent citations added by patent examiners on such studies should not be overlooked. However, it has been proven to mostly have a spatial and self-citing angle, and not a temporal, cross-sectorial, aspect [13].

In econophysics, biophysics, and many other fields where data in most cases carry a timestamp (which is an extremely useful temporal mark), time-series analysis is very commonly used [14]. It serves as a method to extract information on the evolution of specific quantities, and can be used to understand whether some hidden dependencies exist [15]. For example, it can be used to extract hidden information that may provide hints on the future values of stock market indices [16], or it may be used to describe the metabolism of free-moving mice under controlled ambient temperature [17]. Another issue in time-series analysis is that of representation and dimensionality reduction, for which there have been several different approaches such as [18, 19]. Regarding patents and innovation, work has also been done in the field of forecasting future trends that are found when combining time series [20].

To the best of our knowledge, what seems to be lacking in this field is a study on the correlations and cross-correlations between individual time series derived from data coming from technological fields. The time series of citations of a technological field in patents describes the potential this area has on applied innovation. The influence innovations on one field may have on other fields is usually due to new trends that may arise. To cover this, we will try to identify specific innovation fields (as defined in the patent network) which are seemingly unrelated, and which through our research show a high probability of having a well-defined correlation decrease, or increase over time, or exhibit cross-correlation.

The remainder of this study is organized as follows. Section 2 lists the data used for our study. Section 3 describes the method followed to treat our data in order to get results and reach our basic goals, which are listed in Sect. 4. Our study concludes with Sect. 5 which lists our basic observations and our ideas on the future prospects of this work.

2 Data

The International Patent Classification (IPC) system is a hierarchical system used worldwide to categorize patents based on the scientific or technical areas they cover. It comprises eight sections (labeled A–H), each representing a broad field of technology, which are further subdivided into classes, subclasses, groups, and subgroups. An IPC code consists of a combination of letters and numbers. The first letter indicates the section, while the first letter along with the subsequent two digits represent the class. The remaining characters and numbers provide further details on the classification, specifying the subclasses and groups. Each patent is assigned one or more IPC codes based on the specific technical areas it pertains to, making it easier for patent examiners, researchers, and inventors to search for relevant prior art and analyze the patent landscape in specific technology fields.

Our data has been obtained from the Organization for Economic Co-operation and Development (OECD), and covers European Patent Office (EPO) data which span 37 full years, for the period of 1979 to 2015. The network created from the patents’ citations, has cited and citing patents as nodes and the citation itself is depicted by a directed link. Each link has a timestamp, allowing for a detailed record of the evolution of link creation within the network.

Our primary focus is on the IPC classes of the cited patents. We tally each patent citation and attribute it to the respective IPC classes the patent is associated with, taking note of the citing date. Thus, even when the cited patent belongs to more than a single class, every citing patent’s class adds a citation event (all of equal weight) in each of these classes. Through this process, we generate a time series of the total citations for each of the 124 IPC classes. Since the EPO citation data is not uniformly recorded at consistent time intervals, we aggregated the data on a monthly basis, which yielded approximately 450 time points in each time series.

3 Methodology

We calculate the Pearson correlation coefficient, which is a statistic that measures linear correlation between two variables x and y, where x and y are time series of individual IPC classes [21]. The coefficient value rxy, of the 124 × 124 pairs of IPC classes for all 37 years of citation measurements are then plotted in a matrix. The correlation values are calculated using the following equation:

$$ r_{xy} = \frac{{\mathop \sum \nolimits_{i = 1}^{n} \left( {x_{i} - \overline{x}} \right)\left( {y_{i} - \overline{y}} \right)}}{{\sqrt {\mathop \sum \nolimits_{i = 1}^{n} \left( {x_{i} - \overline{x}} \right)^{2} \mathop \sum \nolimits_{i = 1}^{n} \left( {y_{i} - \overline{y}} \right)^{2} } }} $$
(1)

where and are the mean values of x and y, and n is the number of data points in the time series. We can then plot the values in a correlation matrix. The diagonal of such matrices (rxy) has fixed value 1, which is the correlation value of any IPC class time series to itself. Given the previous equation, it is straightforward that correlation coefficient values range from − 1 to 1. However, since pairs are in the order of their IPC classification system, which can be considered as random order in that it is not related to correlations, rxy values are also in the same order. To rearrange them in a more meaningful and visually clear order, and identify some sort of clusters, we use the Hierarchical Binary Cluster Tree (HBCT) method [22].

For this reason, we estimate the correlation distance of the rearranged matrix, δxy:

$$ \delta_{xy} = \sqrt {2\left( {1 - r_{xy} } \right)} $$
(2)

where rxy is the Pearson correlation coefficient of variables x, y and δxy is the corresponding distance that takes values 0 < δ < 2.

Distance values are preferable over Pearson correlation coefficients, since IPC classes can cluster together when their distance is small. By rearranging the matrix, it is possible to distinguish with greater ease differences between pairs of classes.

A basic correlation analysis would not enable us to confirm the presence or absence of temporal variations in the correlation between different IPC classes. To address this, we employ the “sliding windows method” [23, 24], which allows us to divide the aggregated network into smaller time periods with various starting dates. In particular, we establish a 4-year window and apply a 1-year time step, resulting in a total of 34 consecutive windows. In each window, we calculate the correlation coefficients (rxy) and keep track of the changes in their values, thereby constructing a time series that represents the evolution of rxy for every pair of IPC classes.

Subsequently, we evaluate the temporal dependence of all these pairs using cross-correlation analysis. This process involves determining whether a displaced similarity exists between the pairs by temporally shifting one of the two time-series forwards or backwards, and by using the same equation (Eq. 1) as above. To ensure more reliable statistical analysis, we set an abstract limit for time-series with low citation counts, which corresponds to an average of approximately 3 citations per month or 1500 citations over the 37-year period.

Additionally, we implement measures to account for any generic trends, such as the increase in citations over time. We apply a ‘pre-whitening’ process to mitigate such concerns, flattening the spectrum [25] and rendering the time series stationary. We examine variance over time, auto-correlation, and seasonality, ensuring that any potential impact on our results is neutralized.

To eliminate the possibility that our findings are due to random statistical occurrences or artifacts of our methodology, we generate a large set of randomized time-series samples by shuffling the existing values randomly. This allows us to compare the outcomes derived from the original real-world data with those from the randomized instances.

4 Results and discussion

The Pearson correlation coefficient methodology yields a unique correlation value per pair, rxy, enabling us to generate a correlation matrix for all IPC class pairs. The resulting values are displayed in Fig. 1a, where each row corresponds to the correlation of each IPC class with all others, including itself on the diagonal. Since IPC class codes are arranged in the IPC classification ordering, there is no evidence of clustering in the correlation matrix.

Fig. 1
figure 1

a The correlation matrix of all IPC classes, as produced when no ordering is imposed. b The dendrogram as produced by the HBCT method, which is used to order the IPC Classes. c The resulting ordered correlation matrix of all IPC classes

To shed more light on this, we employ the HBCT method to generate a dendrogram (Fig. 1b), which rearranges all IPC classes according to their existing clustering patterns, using the distance values δ (derived from Eq. 2) as the basis for this reorganization. We then proceed to produce the rearranged correlation matrix (Fig. 1c), where the diagonal consists of coefficients with exact value 1 as it refers to the pairs of each IPC class with itself. With this rearrangement, it becomes more evident that some clustering takes place, as there are some same-colored regions mostly located in the edges, although the majority of IPC classes do not distinctly belong to a specific cluster. This is an indication of a yet hidden relation between individual IPC classes.

It should be noted that correlation values are dynamic from year to year or even from month to month due to seasonal phenomena or significant social, technological and economic events, and, thus, vary greatly. For this reason, we will make time series analysis using the ‘sliding window method’ mentioned in Sect. 3. The width of the window is chosen to be 4 years and the step of the slide forward time is that of a year, thus creating 34 time windows (Fig. 2). However, and given that the first few years have limited recorded patent data on most IPC classes, we omit the first 4 time windows and keep the remaining 30 for all future analysis.

Fig. 2
figure 2

Sliding time windows method on the patent data. Red and green windows on the left are two consecutive windows. Yellow, blue and red time series are 3 indicative time series of IPC classes showing some of the most extreme behaviour. Yellow being one with many citations per year, red one with very few, blue one with an extreme peak

In order to ensure that only those IPC classes with a significant amount of citations are being used, and that correlations between pairs of them are meaningful, we apply a minimum of 1500 overall citations per IPC class. This reduces the amount of classes to 78. Thus, the pairs whose correlations are investigated are 78*77/2, which means that we have about 3000 such pairs.

Furthermore, given that we have 30 sliding windows per pair, we obtain 30 correlation coefficient values, one for each window, for each of these pairs. This helps us construct a new time series of the correlations of each pair. As expected, some pairs show strong and continuous correlation throughout their entire time series, while others seem to lose their correlation depending on the period under study (Fig. 3).

Fig. 3
figure 3

Indicative correlation values of various IPC class pairs over the 30 time windows. Some pairs exhibit strong and continuous correlation, others exhibit periods or significant correlation, and other periods of no correlation. Year in the x-axis corresponds to the year where each 4-year sliding window starts. For a full definition of all related IPC Codes found in the figure above see Appendix A

We focus on those cases exhibiting a trend with either positive or negative continuous slope. We calculated the least squares slope estimator in simple linear regression. All presented slopes were found statistically significant at the level of 99.9%, with p-value < 10–3. As seen in the results (Table 1 and Fig. 4), some pairs have a cumulative change of up to ± 0.6 in the correlation coefficient of those 30 windows.

Table 1 Selected IPC class pairs with highest and lowest slopes and their respective 95% confidence intervals. The definitions of all presented IPC codes are given in Appendix A
Fig. 4
figure 4

Correlation coefficient values of various pairs over each sliding window. Each subfigure corresponds to the correlation of a specific pair. The pairs selected exhibit the highest and lowest slope values. Pairs on the left correspond to the highest slope, on the right to the lowest. a C22–C30, b C07–F02, c B61–F01, d C07–F16, e B62–H04, f C12–G21. Year in the x–axis is the same as in Fig. 3. Full definition of IPC codes can be found in Appendix A

It is worth noting that there are many more pairs that show a strong systematic decrease in their correlation values than increase. This is mainly due to some specific IPC classes that seem to have lost their correlation to many other classes over time. Looking specifically at one, G21, we see that it regards patents in the field of: “Physics/Nuclear physics; Nuclear engineering”. A simple explanation that may be offered is that nuclear physics materials are used less often in innovation in recent times than a few decades ago. Thus, any association with this class has potentially gradually declined, and loss of correlation to other fields may have ensued.

Furthermore, it is also interesting that IPC classes B60, D04 and F16 exhibit a peculiar behaviour, in that they are seen both in pairs that show a significant increase in the correlation with one class, and a significant decrease with another (or two in the case of F16). The exact description of these classes are:

B60: Performing Operations; Transporting/Vehicles in General.

D04: Textiles; Paper/Braiding; Lace-making; Knitting; Trimmings; Non-woven fabrics.

F16: Mechanical engineering; Lighting; Heating; Weapons; Blasting/Engineering elements or units; General measures for producing and maintaining effective functioning of machines or installations; Thermal insulation in general.

Their behaviour may point to a change in the use of innovations on these specific fields over time. For example, D04 undergoes a decrease in its correlation value to A01 and an increase to that of B24. Those classes are agriculture and animal husbandry for the first, and grinding and polishing for the second. This means that it was more common in the past to have innovations in textiles and paper that were related to innovations in agricultural practices, while it may now be more common for innovations in textiles and paper to be registered along with grinding and polishing. A simple observation is that textiles were more related to raw materials produced through animal husbandry 40 years ago than now, and innovations in paper are now more often related to polishing (e.g., glossy papers) and grinding (e.g., paper recycling) than decades ago. Furthermore, agricultural practices and textiles/paper production have been driven further apart due to the advent of smart textiles that are not connected to agriculture. In comparison, properties needed in paper/textiles such as water repellency, antimicrobial activity, or improved strength, can be achieved through innovations in grinding and polishing of modern times.

Similar explanations hold for the behaviour of the correlation of F16 to pairs B62 (increasing correlation), and C07 and C12 (decreasing correlation), where innovation in mechanical engineering is more bound to land vehicles nowadays and is less correlated to innovation in organic chemistry and in brewing.

Next, we focus on understanding whether any hidden relations exist between pairs of IPC classes that are not direct, and can, instead, be found only when there is some time delay. Uncovering this would help identify innovation driving fields (IPC classes) and innovation following ones. Therefore, cross correlation was performed to all possible IPC pairs in order to detect whether such intriguing cases exist. Given that the number of pairs is 3003, and that the cross correlation of a pair for several months was investigated, additional criteria were required to identify the most promising results in a reliable way.

Thus, we applied 3 additional criteria to select the strongest and meaningful cross correlations. Firstly, a threshold at rmax > 0.75 was imposed for the maximum cross correlation value. Secondly, a cross correlation coefficient peak which is much higher than the correlation value of the pair was set at ∆r = rmax − r0 > 0.45. Thirdly, only those cases where the peak was reached at a time delay larger than 1 month were examined, thus, ∆t =|tmax − t0|> 1 is required. In the above, tmax indicates the time of the peak cross correlation values, t0 the zero-time delay, rmax indicates the peak cross correlation values, r0 the value at t0, which are essentially the correlation values, as derived before. If these criteria (visually presented at Fig. 5) are fulfilled for a pair of timeseries, then this pair indicates that a variation in the first time series may be the cause for a change in the second one, following some time Δt.

Fig. 5
figure 5

Visual representation of the criteria for the selection of the most promising pairs with significant cross correlation peaks in their time delays

Using this process, it was found that only 68 pairs satisfied the requirements discussed above, the first 20 of which in terms of higher correlation values are shown in Table 2. Furthermore, for the purpose of providing an indicative image of some cases, we also show four such pairs (Fig. 6a–d).

Table 2 Cross-correlation coefficients of 20 pairs, ordered by the highest cross-correlation values, and time lag of the max (peak) value. Positive time lag values signify that IPC Code 2 is ahead of IPC Code 1, and negative values the opposite. Full definition of IPC codes can be found in Appendix A
Fig. 6
figure 6

Cross correlation of specific pairs over their time lags (in months). Pairs with some of the highest cross correlation values that cover all previously set criteria. Full definition of IPC codes can be found in Appendix A

As before, to ensure that such results would not be produced randomly, we shuffle each time series 10,000 times and calculate the cross-correlation values. All the cases presented demonstrate statistical significance at a 99.99% confidence level, with a p-value of 1 × 10–4.

It is evident that there are some pairs where strong cross-correlation exists. Also, given that these pairs often cover cases between IPC Classes for which it can be surmised by their description that a hidden relation may exist, our results seem fitting. For example, in Fig. 6b there is a pair of IPC classes which have the following exact descriptions:

B65: Performing Operations; Transporting/Conveying; Packing; Storing; Handling thin or filamentary material—and.

C23: Chemistry; Metallurgy/Coating metallic material; Coating material with metallic material; Chemical surface treatment; Diffusion treatment of metallic material; Coating by vacuum evaporation, by sputtering, by ion implantation or by chemical vapour deposition, in general; Inhibiting corrosion of metallic material or incrustation in general.

This pair can easily be thought of as connected, since a new innovation in Coating metallic material, Chemical surface treatment or Coating by vacuum evaporation can have an effect on innovations on packing or handling thin or filamentary material. Our results point to the fact that there is indeed such a relation, and we even estimate that a period of about 3–4 months is the time needed for the maximum cross correlation value to be achieved.

Similarly, pairs displayed in Fig. 6a, although seemingly unrelated per their definitions (A47 is Furniture; Domestic Articles or Appliances; Coffee Mills; Spice Mills; Suction Cleaners in General, and B62 is Land Vehicles for Traveling Otherwise than on Rails) which display a lag of 8 months could be due to the fact that technological advances in furniture design and related materials or devices, may precede their incorporation into land vehicles, as it may require some time for them to be adopted or adapted for use in the automotive industry. As for Fig. 6c, where the pair again seems initially unrelated (C03 is for Glass; Mineral or Slag Wool, H05 is for Electric Techniques not Otherwise Provided for), and which shows a time lag of 3 months, some assumptions can be made. It is for example reasonable to assume that specialized glass compositions, coatings, or manufacturing processes covered by C03, may have implications for electrical applications of H05, and the time lag may be due to just that. Finally, for Fig. 6d, a pair where it is again not easy to find direct relations (B62 is for Land Vehicles for Traveling Otherwise than on Rails, and G01 is for Measuring; Testing), there is a time lag of 9 months. This can perhaps be related to innovations in measuring and testing technologies, such as advanced sensors, diagnostic tools, or testing methodologies covered by G01, which may be developed and patented before their incorporation into land vehicles of B62.

5 Conclusion

Patents are one of the most common established indicators of innovation, and can often amount to a hidden market value of several billions for large tech companies. For example, in a recent patent auction Yahoo had sold around 3000 of its patents for an estimated price of 1 billion US dollars. Companies such as Apple or Samsung have a considerably higher patent portfolio value. The value alone, however, cannot determine the way a specific innovation impacts others in related, or seemingly unrelated, fields. Thus, we have tried to find methods to understand the effects that patent classes can have on each other, so as to predict with high likelihood the changes in one class when observed in another.

Our methodology which is based on time series analysis, and the results obtained, verify that in some cases there is significant, and statistically improbable, change in the correlation of some pairs of IPC classes. In fact, we have clearly identified several cases where there is a gradual (or in some cases more abrupt) change in the correlation values between the time series of a pair of classes. We have provided an explanation for this change, and we have identified specific classes whose patent content has probably changed over time, in relation to other classes. As an example, we mentioned the case of class G21 which may nowadays have a smaller technological impact in innovation overall. We further found classes such as B60, D04 and F16 that have both significantly increased their correlations to some classes and decreased it with others. This can be attributed to a change in the use of innovations of each IPC class, especially when compared to its correlated class.

The methodology also identifies whether there is a temporal lag between the correlation of two IPC classes, and estimates the strength of this cross correlation. We have chosen through the application of additional criteria, some example cases where there appears to be significant cross correlation. One such case is that of the pairs mentioned in the results section between B65 and C23. We have also provided a plausible explanation for all four such cases shown in the related figure, based on some of the specific subfields that exist in these classes.

Our approach lacks the ability to identify cases where there are peaks for only short periods of time in the correlation values over the span of the 34 years of data. It also cannot identify pairs where cross correlation may be more complex, as would be the case if both classes were to influence one another after some time lag. Future work could focus on finding more robust ways to identify such changes, as well as set indisputable criteria to identify and specify exact cross correlation temporal lags between pairs.