1 Introduction

Policymakers and a range of interest groups need official statistics that allow them to make evidence-based decisions. Questions by researchers, policy makers and citizens are increasingly complex and stimulated by the dynamic relationships. In this context, there is a need to widen statistical information of official statistics in order to allow dynamic analyses and to permit the control for unobserved heterogeneity via panel econometric methods. Therefore, the availability of longitudinal data is a crucial point. In many countries, like in Italy, longitudinal data on businesses are not collected. This paper focuses on how to solve the lack of this type of data using already collected data. Reengineering the statistical business process could allow building longitudinal information while satisfying National Institute of Statistics (NSI) cost and respondent load reduction. By combining information for firms gathered in different surveys and/or from other sources a new generation of data set can be provided. The literature and some empirical studies provide experiences in using innovative approaches to create longitudinal databases. For instance, the Longitudinal Business Database (LBD) maintained by Statistics New Zealand is an important example. This database is developed by integrating various sources. This approach leads also to the opportunity of expanding information to a variety of new contents the user is interested in. See Fabling [4] for more details and for further literature.

The Research Data Center of the statistical offices of the German landers provided panel data for Germany as whole, linking surveys across time and space. Moreover, the project AFiD (Amtliche Firmendaten für Deutschland, official firm data for Germany), for details see in Malchin and Voshage [10], is merging firm level data from different surveys to build up data sets covering a wider range of variables than the ones collected in any survey, one at a time. In Italy, Oropallo and Inglese [11] and Zeli [16] proposed an approach for a longitudinal database based on annual data collection in this country.Footnote 1

In this paper, a new approach to manage the longitudinally analysed Italian core units considering the Mergers and Acquisitions (M&A) events is proposed. When an historical complex archive of data is available (for instance surveys’ archive of data—or integrated archives of data—collected yearly for a long period, or administrative data sources available for a long time period as well), this mass of information may be exploited and organized longitudinally. Two possible approaches are the so-called the CIG (namely Code Identifying Group) and the COD (namely ISTAT firm’s code).

Following the single firms and understanding their economic and behavioral changes is possible to create a longitudinal dataset by merely linking the firms’ codes year after year (COD approach). This solution ignores the problem of M&A, which is a dramatically increasing phenomenon in the last decades (Fig. 1).

Fig. 1
figure 1

Source: Unctad. World Investment Report 2015

Number of net cross-border M&As by region of seller, 2002–2019. Data for Italy are represented on right ordinate.

Indeed, the single enterprise status changes over time. Many events and transformation may modify the structure and other firm characteristics and its ranking on the relevant markets in terms of production and activity. Changes may affect:

  • The dimensional structure (number of persons employed),

  • The economic activity,

  • The legal status (firm name, fiscal code).

Moreover, many events can imply the death of an enterprise. Due to possible changes the obtained longitudinal database is incorporating a greater attrition. An alternative approach consists on creating a longitudinal database that considers also the M&A (named CIG). In such a case all firms involved in M&A events are considered and they can be clustered using an appropriate codification. The codification permits including in the panel longitudinal data of all relevant firms.

This paper shows the characteristic of datasets build up in Italy using these two approaches. Our hypothesis is that an improvement is gained in data quality (representativity) when the panel data taking in account the M&A events (CIG) is compared with a benchmark represented by a mere code-linked panel (COD). To verify this hypothesis, we apply different statistical methodologies to perform the comparison, such as: Spearman ranking correlation, FP-test and R-indicators.

In Sect. 2 an overview of panel data implemented in different NSI is supplied, the panels building techniques are showed in Sect. 3 the statistical methodology implemented for panels’ representativity comparison is described in Sect. 4. Sections 5 and 6 present preliminary and final results respectively. Section 7 concludes.

2 Data and variables

The database used in the analysis is the result of an innovative multi-source approach. It includes the ISTAT’s Business Register (ASIA), the Event Register and the Italian financial reports database.

ASIA is a complex archive. It results from the logical and physical integration of two different type of data sources, i.e. statistical sources (surveys) and administrative sources, treated with methodologies for obtaining statistical data (Tax Register, Chamber of Commerce Register, Social Security Register, Work Accident Insurance Register, Register of the Electric Power Board).

ASIA Register Service also creates the Events Register, it is a database in which all information on enterprises changes are registered.

Due the various events that can occur, a firm code may be referred to entity notably different from one year to another. The economic indicator can be strongly influenced by this.

In order to link the firm/entity over time, the Events Register provides the following information:

  • Code1, origin firm code (for instance the seller),

  • Code2, destination firm code (for instance the buyer),

  • Event Code, according to the above-mentioned scheme,

  • Event Date, month and year,

  • Quality Code.

The enterprises transformations are classified in two groups: legal status modifications and transfer of property, hence, the Event Register classification has 2 classes of nomenclatures: change of legal status nomenclature and transfer of property nomenclature. The first nomenclature includes all events of transformation of the original firm, such as: transfer by means of transformation in a new firm, cessation by means of transformation in a new firm, transfer by means of transformation in an existing firm, birth and cessation by means of a legal status change, acquisitions et cetera.

The second nomenclature includes events such as cessation by means of succession on death or takeover, transfer by means of business/enterprise lease, birth by means of succession on death or property transfer, birth by means of business/enterprise lease.

Combining the origin and destination code and considering the event code a complete registration of all possible transformation events that might occur is obtained. This information is used for panel construction purposes.

The last database, containing the Italian financial reports, collects annual reports transmitted, according to the law, to the local Chamber of Commerce (CC) by all limited firms. Due to its relevance in terms of number of observations and target population representativeness, ISTAT uses this source in many processes such as survey data validation, panel building. It is also used as backbone dataset for many simulation models. The database includes around 700,000 firm financial reports every year. It represents a rich and complete source of information.

Table 1 shows the data of the main economic variables contained in the available Istat financial reports for years from 2008 to 2011.

Table 1 Main economic values of financial reports database for limited firms in Italy—years 2008–2011 (economic values in billions of euro)

In particular the table presents the firms’ number, the turnover (i.e., the firms’ sales amount), the intermediate costs (i.e., the commodity), the personnel costs, the value added (calculated as revenues less intermediate consumption) and the total of persons employed in the firm (the sum of employees, self-employed and other people working in the firm).

3 Panels’ construction

Literature and certain empirical studies encompass alternative survey approaches to collecting data longitudinally [6, 9]. Here we consider three different approaches for preparing panel data from an archival source:

  1. 1)

    A cross-sectional approach, i.e. a series of cross-sections bound together, see for instance [7].

  2. 2)

    A simple codes panel (COD panel), merely built by linking the register identification code year after year, with no regards if the economic entity remains the same;

  3. 3)

    A panel which considers firms’ transformation events. A complete technical description of this approach is presented in Zeli [16].

The analysis of the effects of different approaches for a longitudinal study of the data, has to consider different factors or properties. Application criteria simplicity/easy to treat with respect to the available databases, time length of the database that can be obtained by applying the different methodology, consistency with actual facts which are registered are examples of factors/properties to be taken into account.

Cross-sectional approach is the one that fits more the real firm legal and economic development, but longitudinally linking these entities is not possible in each individual case and for every possible event. However, the cross-section approach is not considered here.

The simple code panel (COD) has the advantage of being simple in procedures and concepts, and of requiring a short preparing time. According to this approach the M&A events are not considered; therefore, there is no guarantee for the actual (both legal and economic) continuity of starting firms. Drawbacks associated with analyses and conclusions based on datasets build-up on simple code approach are important misrepresentation effects.

Longitudinalization through the first approach, i.e., clusters of firms involved in events of transformation, is considered in this paper. All M&A events are used to cluster the involved firms by means of a special code (CIG); an economic super-entity is build-up. It may be traced back to the firm(s) existing in the starting year. By means of this procedure two targets can be achieved: the economic continuity of a single firm and maximisation of the representativity of panel firms and so of the panel itself. Focusing on the transformation events approach [16], Fig. 2 illustrates the scheme followed to cluster firms with 50 employees or more in Italy. An illustrative series of events during a period of 4 years is considered to describe the longitudinal recording steps. The black codes are the new longitudinal ones created for clustering the firms involved in the M&A events, while the dark grey shadowed codes are the usual ASIA codes. At every annual transition (black, firms’ clusters are created. The relevant quantitative (economic data have to be treated, for instance, summed up. For qualitative data, on the contrary, some prevalence criteria have to be adopted. In this example, three clusters are obtained; they represent the virtual firms spanning over the considered time period. Both the 6 starting firms (F1, F2, F5, F8, F10 and F12 and the 7 firms present in the last period (F3, F4, F6, F7, F9, F10 and F12 and are represented in the 3 clusters (A, B and C. If the clustering approach had not been applied and a code panel were built instead, only two firms would have been included in the code panel itself (F10 and F12 with a loss of information.

Fig. 2
figure 2

Clustering events scheme (Source: reworking from [16])

The code panel is easy to implement, but it fails to express variables levels consistent with the real ones and in granting a strong longitudinal characteristic and coverage for the firms included in the set of interest. The panel built by grouping firms involved in M&A events may be very difficult to implement starting, every time again, from the base data sources: Business Register, Events Register (i.e., Register of M&A), Business data warehouse. However, when a statistical procedure implementing this kind of panel is available (as in our case) it does not take long. As regards the strength in the longitudinal context it could to be more reliable. Having as background information the comments and concepts discussed above, the aim of this paper is to evaluate the longitudinal reliability that can be achieved taking the transformation events into account while building up a longitudinal dataset. A dataset construction by means of the grouping of firms involved in M&A events is undertaken following the scheme presented in Fig. 2. Namely the set of firms involved in one or more events is clustered by using the Event Register. Afterwards, we considered the rest of ASIA register and these firms (not involved in any events) are “clustered” by assigning a cluster number (cig) to each single firm. Subsequently, the list of enterprises obtained is linked to the economic data coming from financial reports of limited enterprises laid down to Chamber of Commerce for a time 4 years span from 2008 to 2011. Having the availability of the longitudinal dataset based on two different approaches, i.e., by grouping firms involved in transformation events and by the simply code panel approach, a comparison of these dataset in terms of representativity and coverage with respect to the whole population of financial reports (i.e., limited enterprises in Italy) is studied in this paper. From here on we denote the first panel as cig panel and the simple code panel as cod panel.

4 Methodology for the statistical comparison

In order to compare the distribution of economic variables across the databases the following analyses are undertaken.

First, Spearman's rank correlation coefficient is applied to evaluate whether the ranking of different economic sectors remains the same in cig panel and cod panel with reference to the total population. This analysis is useful to understand if changes in the ordinal structure are present, Spearman’s rank correlation coefficients for value added, persons employed and other economic variables for the years 2008 and 2011 were calculated with reference to the data classified by economic activity.

Since Spearman’s rank correlation verifies only an ordinal ranking on the relative variations among the economic divisions, it provides no information on absolute changes. To overcome this problem, as second step of our analyses, the Fligner-Policello (F-P) test is used to verify the equality of distributions for cig panel and cod panel with respect to the universe of the relative share of the economic divisions of the economic variables considered with respect to the totals for each year. Thus, the equality of the distributions of sectorial shares of the economic variables in cig panel and cod panel with respect to total population for each year is tested using the F-P test [5, 8]. For an application in a similar context see Biffignandi et al. [3].

At the last step, an R-index is applied to evaluate the representativeness of cig panel and cod panel in terms of sector breakdown and size class. The focus is particularly on the comparison of the response composition to that of our population. R-indicators have been proposed in recent literature [12,13,14,15]. They are useful to compare the quality of different surveys or registers and apply to different situations. They can be used to compare the response to different surveys sharing the same target population to understand deviations from representativeness. The R-indicators are also useful to compare response to a survey over time, e.g., monthly, quarterly or annually. A further application is to monitor the response to a survey during data collection, in this case the representativeness of the sample during the fieldwork is checked and it is possible to adjust the data collection process.

Representativity is based on the comparison of response propensities for different units given their values on a specified set of auxiliary variables. When all response propensities in the sample are equal, the sample is representative. To compute the standardized bias the response rate is combined with the R-indicator. When comparing a survey in time, the number of sets of auxiliary variables has to be fixed in the computation of the R-indicators. In this case it is possible to include also variables that correlate to the main survey items, even if they not belong to the general population characteristics. An application of this methodology in a temporal context is provided in Biffignandi and Zeli [2].

It is also possible to compute partial R-indicators. The conditional partial R-indicator measures the impact of a variable on the lack of representativity, corrected for the impact of other variables in the logit model. It is computed as the remaining within-cell variation after removal of the variable from the model. It measures ‘pure’ effects. A large value means the variable is important for explaining lack of representativity. By decomposing the variation in response propensities according to different values of the auxiliary variables, the representativeness with respect to each specific variable is obtained. The unconditional partial R-indicator for a category of a variable measures the contribution of a category to the conditional partial indicator. It measures the impact of a single variable/category on deviations from representative response. A large value means a category is a source of representativity problems. Large value can mean under- or over-representation. In all cases, partial R-indicators ideally have values equal to zero. The unconditional category-level partial R-indicators have a positive or negative sign. The sign indicates the direction of the representation of the category, e.g., a negative sign for males implies that males responded less to the survey. The conditional category level partial R-indicators do not have a sign and are always positive. In fact, in this conditional case, a category may perform better in one subpopulation but not be better in another. For example, textile sector may do better than average while food sector may do worse than average. We refer to Schouten et al. [13] for detailed descriptions of the various indicators.

5 Preliminary results

Some preliminary analyses are carried out in order to detect the possible differences in coverage rate for the event panel (cig) and the code panel (cod) with respect to the target population of whole limited firms in Italy [1]. Table 2 presents the coverage rates of the two panels with respects of target population for some important economic variables such as: turnover, intermediate costs, personnel costs, value added, persons employed and number of firms. cig represents the M&A consistent panel data and cod represents simple code linked panel data.

Table 2 Coverage rate of the two panel with respect to target population—years 2008–2011

Every year, cig panel performs for all variables better than the cod one in term of coverage rate. The difference between the two panels with respect to firm’s coverage rate is anyway rather small. From 2088 to 2011, firm’s coverage rate is fairly stable in both panels. As regards the coverage rate in terms of economic variables, the difference between the two panels is a quite relevant value. The positive issue is that the coverage rate in increasing from 2008 to 2011. The critical point is that the difference in terms of rate remains high. The cig panel is better in terms of higher coverage rate. If we compute the implications in terms of value of each economic aggregate, the different performance of cig panel with respect to cod panel becomes relevant.

6 Results

This section presents the results obtained by the application of the methodology described in the Sect. 4. First of all, our approach focuses on the ranking of the sectors, which is a critical economic information to support policy decision making, and to understand the changing pattern of an economic system. Spearman correlation coefficient is providing a synthetic indicator to compare the ranking; therefore, we apply it as preliminary information for the quality evaluation of the panels. The statistical notation of this coefficient is reported in Appendix B. The Spearman correlation coefficients are calculated at the two-digit NACE level (economic divisions). For the years 2008 and 2011, the results are shown in Table 3.

Table 3 Spearman correlation coefficients between total population and cig–cod panel for Nace two-digit level of some relevant economic variables—years 2008 and 2011

Table 3 shows similar Spearman coefficients values for cig and cod panel. It means that at NACE 2-digit levels the ranking of economic sectors does not vary much from one approach to the other. Anyway, cig Spearman coefficients are higher than cod ones for any variables in both years. Therefore, implementing the grouping of firms by events improves representativity at least in terms of ranking of economic activity.

Spearman’s rank correlation verifies only an ordinal ranking on the relative changes among the economic activities. To get insights on absolute changes, as a second step of our procedure we apply the Fligner-Policello (FP) test. For the characteristics of this test references are Fligner and Policello [5] and Hollander and Wolfe [8], in Appendix B basic statistical notations are reported. FP test is useful to verify the equality of the distributions (null hypothesis) and eventually to identify which distribution is dominant (alternative hypothesis) FP tests are calculated on the two-digit NACE level (economic divisions) distributions separately for the cig and the cod panel; each panel is compared with the original population. FP-test results for the distributions for 2008 and 2011 are presented in Table 4.

Table 4 FP-test results for the distributions of selected relevant economic variables (year 2008 and 2011)

FP test values and the related p values (higher than 10%) do not point out significant difference between the two panels and the original population of firms for the variables under study. Thus, the hypothesis of equality of distributions is accepted. Therefore, both the event panel and the cod panel well represent the original distribution of firms. Only “Persons employed” variable has probability slightly smaller (under 10%) for the equality of the distributions. If the values of FP test are considered as a “distance” between the panel distributions and the target population, the test value for the cig panel is in every case (but the person employed for the year 2008) smaller than the one for the cod data. It means that the cig panel is closer to original distribution with respect to the cod one. Thus, this test confirms that the cig panel approach improves representativity with respect to the cod one. To support a comparison of the quality of the two panels that share the same target population, our approach focuses on a third type of analysis: R-indicators, as a measure of representativeness. They have been proposed in literature by [12] (additional references are [13, 15], in Appendix B some statistical characteristics are summarized. The R-indexes are calculated at firm micro-data level. In Table 5 the R_indicators are reported as well as the R variance (R_variance), the R standard error (std_R), the confidence interval (ci_R) and lower and upper bound for the R estimate (LB_R and UB_R). However, one has to bear in mind that R-indexes, by considering only the population of firms, display less sensitivity, with respect to economic variables, to the application of the two different panel methodologies. The results show similar values in the cod and cig panel.

Table 5 R-indicator for cod and cig panels—years 2008 and 2011

To measure dissimilarity of each panel from the target population at category level of the two auxiliary variables unconditional partial indicators are computed. As regards the category level unconditional partial indicators, the two considered variables are economic sectors and size class. Economic sectors breakdown ai at sections of NACE rev.2 level, whereas firm size breakdown considers eight classes of number of persons employed, i.e. less than 5, 5 to 10, 11 to 20, 21 to 50, 51 to 100, 101 to 250, 251 to 1000, over 1000. Figure 3 shows the values of category level partial indicators for the economic sectors (sections of NACE rev.2) for year 2008.

Fig. 3
figure 3

Category level unconditional partial indicators (economic sectors)—year 2008. See Appendix A for the sectoral activity codes description

Mining (B), manufacturing (C), trade (G) and other business services in L are in both type of panel over-represented while the other sectors are underrepresented. In 2008 cig panel presents higher over representativity for industrial sectors and trade with respect to cod panel. Size class in terms of persons employed partial indicators for year 2008 are presented in Fig. 4. In this case cig panel represents better the larger firms over five persons employed while it registers an underrepresentation for the smallest firms.

Fig. 4
figure 4

Category level unconditional partial indicators (size class)—year 2008. See Appendix A for the sectoral activity codes description

In year 2011 (Fig. 5) many economic sectors present an over representativity: mining (B), manufacturing (C) and water supply (E) among industrial sectors and trade (G), information and communication (J) and professional activities (M) among services.

Fig. 5
figure 5

Category level unconditional partial indicators (economic sectors)—year 2011

For year 2011 the partial indicators for size class over represented all size classes but the micro-firms with less than five persons employed (the most numerous one). R partial indicators register a better representation for cod panel for the smallest firms (with less than 10 persons employed this time) and confirm for cig panel a greater over-representativity for larger enterprises with more than ten persons employed (Fig. 6).

Fig. 6
figure 6

Category level unconditional partial indicators (size class)—year 2011

7 Final remarks

The paper considers two alternative approaches to build up longitudinal business panels: the new multi-source approach for taking transformation events in the panel construction is proposed (cig database) and the longitudinal panel based on simple code approach (cod database). Panels’ units are linked to the Chamber of Commerce balance sheet reports database. Selected relevant economic variables (turnover, personnel costs, value added, person employed) are computed with reference to each longitudinal panel and results compared.

The aim of this study is to understand if the two approaches provide estimates of the variables of valuable quality for the economic analysis and to compare their performance. A methodological approach based on a sequence of tests is proposed to answer the research question. The value added of the presented approach is to gain insights in key issues for understanding the longitudinal characteristics of the economic aggregates, namely the ranking of sectors and the distribution.

The adopted test and methodologies show that the ranking of economic sectors does not vary very much from one approach to the other, and both panels well represent the original distribution of firms. Nevertheless, looking to the results in more detail, the firms grouped by events seem to improve representativity in terms of ranking and distribution.

Considering the partial indicators for size class register, both in the 2008 and in 2011 a better representation for cod panel for the smallest firms (with less than 10 persons employed) is detected, whereas cig panel confirms a greater over-representativity for larger enterprises (with more than 10 persons employed). As regards economic sectors, partial indicators show that mining, manufacturing, trade are in both type of panel over-represented while the other sectors are underrepresented.

These results confirm that the proposed approach for panel construction taking in account transformation events performs well, even better than the simple code approach.

This study has the limitation that only few variables are processed to evaluate and compare panel quality; even if the considered variables are the most important from economic point of view, it would be interesting to check the improvement of the proposed approach over other variables. Further studies to evaluate in more detail the impact of the proposed approach to economic results are being undertaken.