1 Introduction

As cutting-edge information technologies such as artificial intelligence and the Internet of things continue to be combined with various new technologies, most products and services in our society are becoming intelligent. As exemplars of such source data in the era of the 4th Industrial Revolution, public data produced by central ministries, local governments, and various government-affiliated organizations are becoming increasingly useful, creating economic and social value in various forms in both public and private sectors.

The term “public data” refer to information or data created or authorized by the government, provided free of charge to everyone, and reusable with machine-readable features (OECD 2018). The openness of public data is a policy representative of the global open government paradigm. In 2009, the Obama administration emphasized the importance of increasing private access and utilization of public data, and established a culture of transparency, participation, and collaboration (White House 2009). In Korea, alongside the global trend of open government, the Act on Promotion of Provision and Use of Public Data was enacted in 2013 to make data held by all public institutions, including national and local governments, open to the private sector; this created socio-economic value in various fields.

In the Korea public data portal, data open institutions are divided into the central government, local government, public institutions, and the office of education. There is a difference in the content of the data disclosed by each open body. Data from the central government and public institutions deal with administrative information on national issues in a field of specialization. In contrast, local government data, accumulated in close contact with people's daily lives, deal with local innovation and current issues closely related to the local community. As a result, public data differ in subject scope depending on the type of government, leading inevitably to a difference in usability. Therefore, it is necessary to empirically analyze the differences in the intellectual structure of data made public by different government entities to determine which of the different data characteristics lead to usability. A prior understanding of the content and utilization of government data open to private consumers is very useful in terms of concept development and design in private business projects. Additionally, from the operator's point of view, it is essential to enhance the utility of public data for better policy development.

This study aims to identify differences in content by grouping public data thematically via a network analysis on the relationship between keywords assigned to public data. We also aimed to identify the clusters that show high usability and the types of government that provide access to such data. Keyword network analysis is a methodology that can impose an intellectual structure on a network by extracting keywords from a dataset and calculating the similarity based on the simultaneous appearance frequency of each keyword pair. This study utilized the pathfinder analysis technique (PFNet), which is a method that removes paths having links with weights that violate the triangle inequality in networks. Thus, the thematic differences in the data released by the four different types of government can be understood and interpreted in relation to their utilization. To achieve this, the following research questions were established.

  1. (RQ1)

    What is the difference between the subject clusters of data released by the central government, local governments, public institutions, and the office of education?

  2. (RQ2)

    What is the difference in the utilization of data disclosed by the central government, local governments, public institutions, and the office of education? In addition, which subject clusters show a high utilization?

2 Prior research

Research on public data shows various interdisciplinary characteristics, which can be summarized as a discussion on usability, quality, and content as follows.

Recent studies have discussed the quality and usability of public data. Vetrò et al. (2016) evaluated the quality of Italian open government data (OGD), dividing samples into centralized and municipal level decentralized data. They identified the problem of insufficient metadata and incomplete information provision for data updates, arguing that policy guidelines for a decentralized environment should be supplemented. Kim et al. (2021) evaluated the utilization of a local government data portal open to the Korean public. They pointed out that although the size of open data in metropolitan cities such as Seoul and Busan is large, the number of downloads is not as high, arguing that it is necessary to identify themes that can satisfy demand rather than the quantitative scale of the provided data. In another study on the properties that affect the use of each type of public data (Lim 2021), Lim identified the attributing factors that should be considered relatively more important when opening data for each public institution with various purposes. He said that an influencing factor in local governments is whether to adopt an open format that allows users to easily and freely utilize the data. In contrast, in national administrative agencies, public corporations, and industrial complexes, the amount of open data is regarded as an influencing factor.

Research has also been published that analyzes the knowledge domain surrounding public data based on semantic analysis. Charalabidis et al. (2016) targeted research papers centered on open government data (OGD) to analyze the knowledge domain. They identified four major research areas in management, policy, infrastructure, interoperability, use, and value, as well as 35 detailed research topics. The study of the semantic analysis of the open public data directive (OPDD), which is a long-term plan for open data policy, is also noteworthy. Jung and Park (2015) performed a semantic network analysis targeting the open data policies of 34 government agencies. The results revealed that "open data", "open", and "openAPI" appeared as high appearance keywords. Moreover, the keyword “quality” was suggested to have a high correlation with “public data,” “information,” and “utilization.” They noted that it is possible to identify the core agenda included in public policy documents using semantic analysis.

Finally, as in this study, the contents of public data have been analyzed through co-word-based semantic analysis. Shin (2021) collected the titles of datasets made available by 243 local governments through public data portals and analyzed the contents using text mining techniques. It was found that local governments most frequently open information on the status of local facilities and companies as well as other topics such as environmental inspection, public facilities, licensing and enforcement, and local tourism. Lee (2020) proposed a method for predicting public data demand in a timely manner through keyword analysis of the desired data applications submitted to public data portals. According to the results of the analysis, most public data belonging to topics with high demand are provided by the domestic public data portal (data.go.kr), but public data related to the actual needs of users are rarely provided. The keywords “disease” and “treatment” have recently surged in demand as a result of the COVID-19 pandemic, but the amount of open data is low. Similarly, Cho and Ha (2020) analyzed the desired data requested from the public data portal "Data 1st Avenue" and analyzed nine topics including transportation, medical welfare, administration, and real-time data as main topics of interest.

A study that analyzes the conceptual relationship of public data is also noteworthy: Jeong et al. (2017) proposed a concept pair that predicts future mashup demand at a time when the use of public data and its convergence are becoming important issues due to the 4th Industrial Revolution.

This study is differentiated from previous studies in that it provides empirical knowledge about which types of data were opened by each government type with different characteristics by identifying the contents of public data. In addition, it is meaningful in that it analyzed not only the content characteristics but also the usability at the same time.

3 Research method

As of June 2021, data with the highest number of downloads from the public data portal of each government type (central government, local government, public institution, and office of education) were extracted. The analysis target data comprised a total of 1,200 cases, with 300 data items extracted from four types of government.

First, the average value of download statistics was compared through descriptive statistical analysis of each government type, and the gap between the number of downloads was compared using the Gini coefficient. The Gini coefficient is an index that explains the variance of wealth and is 0 for perfect equality and 1 for inequality. It is an index used to explain the gap in various fields such as economics and information science, and it was used here to identify government data types displaying large deviations in usability. The above analysis used SPSS 25 and the Ineq function of R.

Second, keyword network analysis was performed to determine the difference between the subject clusters of data released by the central government, local governments, public institutions, and educational institutions. The analysis consisted of a total of 3,600 keywords with 3 words assigned to each of the 1,200 metadata. Similar concepts were integrated by performing stemming for the extracted keywords, such as synonyms, broad-word, and narrow-word. The Pathfinder weighted network analysis was performed four times by selecting keywords that showed a frequency of four or more for each government type. Pathfinder weighted network analysis refers to a network created by removing paths created by weighted links that violate the triangle inequality (Lee 2006a). Like traditional methodologies such as multidimensional scale and cluster analysis, Pathfinder not only expresses the overall structure well, but has also been evaluated as more advantageous in expressing detailed structures using parallel nearest neighbor clustering (PNNC) (Lee 2006a, 2006b). Here, a weighted network analysis was performed using COOC and Wnet developed by Lee (https://cafe.daum.net/wnets), and a subject cluster was created for each government type.

Third, the number of downloads of the word selected as the analysis target was used to compare utilization by cluster. After backtracking multiple public data to which the selected keyword belonged using kutool (https://www.extendoffice.com/product/kutools-for-excel.html), the number of downloads of the public data was summed up. For example, if there were five public data items containing the keyword crime after backtracking, the sum of the download numbers of each five data item was considered to be the download number for the keyword crime. The number of times of use by cluster is the sum of the values obtained in the previous step for keywords included in the cluster, and this was used to compare relative utility among the subject clusters.

4 Results of the analyses

4.1 Government data usability analysis

First, the results of analysis for the data showing high usability included in the top 10 high usability data were considered. Most of the data published by public institutions showed the highest utilization, and data from the central government were also included. However, the data disclosed by local governments and education authorities were not included in the top 10 rankings. Of the openly accessible data from public institutions, business district information from the Small Enterprise and Market Service, health checkup information from the National Health Insurance Service, and news big data from the Korea Press Foundation showed the highest utilization. For the central government data, 3D printing status data from the Cultural Heritage Administration and the information on the import and export of seafood from the Ministry of Oceans and Fisheries showed high utilization. The business district data from the Small Enterprise and Market Service, which displayed high utilization, are used for self-diagnosis by start-ups and for marketing strategies by merchants. In addition, the data disclosed by the National Health Insurance Service are being actively used in mobile medical services, health and obesity management, and customized medical research (Table 1).

Table 1 Highly utilized public data

On the other hand, if the download statistics of open data are schematically represented, a long-tail phenomenon is observed. Only a few of the data items show high utilization, and most of the data show relatively low utilization, resulting in a long tail. In particular, data released by public institutions have the longest tails, while data released by the central government also show a similar pattern. In the case of data from local governments and the office of education, there were not many cases of extremely high utilization, and the overall utilization was low. In terms of the average number of downloads (Table 2), public institutions had an overwhelming 4,462 downloads, followed by the central government with 2,444 downloads. Local governments and offices of education had relatively low download numbers of 1,602 and 554, respectively. When the Gini coefficient was used to calculate the gap between public institutions and the central government, an extremely long tail was observed with g = 0.50 and g = 0.44, respectively, indicating a gap in utilization. In contrast, although the overall utilization of local government and office of education was low, open data were being used relatively evenly, with g = 0.30 and g = 0.34, respectively (Fig. 1)

Table 2 Download average and gap index of data by type of government
Fig. 1
figure 1

Differences in the number of downloads of public data by government type

By drawing a Fig. 2 scatter plot of the utilization and gaps by government type, it is possible to visually confirm the difference in utilization patterns of the open data according to each government type. Examining the download average on the y-axis and the Gini coefficient on the x-axis revealed that public institutions and central government data are located in the first quadrant and its boundary, while local governments and the office of education are located in the third quadrant, with the two types of institutions displaying opposite patterns. Although public institutions located in the upper right corner have the largest data utilization, they also show the largest data utilization gap. Central government data also have high utilization, but gaps still exist. In contrast, the office of education data exhibit the lowest utilization with a relatively low gap index. Finally, the local government data show the smallest gap in utilization, but in terms of utilization are lower than both the central government and the public institution data. In summary, although data from public institutions and central governments dealing with nationwide administrative data in specialized fields are highly utilized, there is a large gap in the utilization of each data case. The overall utilization of the local government and office of education data, that is, regional data, is low, making the gap in the utilization not so large. Subsequently, the data of each government entities were compared through keyword network analysis to identify the type of data that makes up the topic cluster and thus the cluster that shows high utilization.

Fig. 2
figure 2

Data utilization and utilization gaps by government type

4.2 Analysis of subject cluster differences

4.2.1 Public institutions

First, by integrating similar concepts through a stemming process for 900 keywords extracted from the open data of public institutions, 515 refined keywords were obtained. Among these, network analysis was performed on 40 keywords that appear four or more times. The results of the analysis of the keyword occurrence frequency are shown in Table 3. Keywords implying specialized topics such as house (21), small business (13), traffic (11), and guarantee (10) had a high frequency of appearance. Houses, small businesses, generation of electricity, etc. are major issues in Korean society, representing topics that correspond to national issues. Network analysis of these keywords revealed 11 topic clusters (Real estate, Small business, transportation, Energy, Statistics, Tourism, News, Bigdata, Airport, Other social issue, Healthcare).

Table 3 High-occurrence keywords and subject clusters (public institutions)

Among the 11 clusters in Fig. 3, the cluster with the highest utilization was Healthcare with an overwhelming number of downloads, much higher than that of the subject clusters of the other three government types. This category includes health checkup and treatment history information. Second, the Real estate cluster also showed high utilization. Collateral details, personal guarantees, and real estate public sales were included in this category. In addition, Small businesses, Transportation, and Energy clusters also showed high usability.

Fig. 3
figure 3

Network maps and clusters (public institutions)

4.2.2 Central government

The results of the analysis of 48 keywords that appeared four or more times out of 436 keywords with similar concepts were integrated under central government data. First, assessment of the frequency of occurrence (Table 4) revealed that searches for crime were overwhelming with 61 occurrences, followed by statistics (29), foreigners (14), national taxes (11), and industrial complexes (10). In addition, keywords related to national administration, such as land, nationality, and immigration, are shown. Then, 15 clusters (Crime, National statistics, Foreigner/immigration, Safety policing, Industry, Business work and other issues, Healthcare / public facility, Land, Disabled, Information, Livestock, School, Small business, Delinquent list, Accident) were formed. Excluding information and statistics that include comprehensive subject areas, the Crime cluster was the most frequently used, and the Safety policing and Foreigner/immigration clusters also showed high usability, as shown in Fig. 4  . The Crime cluster and Safety policing cluster include the crime arrest status and the Foreigner/immigration cluster includes the nationality of domestic and foreign nationals, immigration status by port, and foreign residence data.

Table 4 High-occurrence keywords and subject clusters (central government)
Fig. 4
figure 4

Network map and clusters (central government)

4.2.3 Local government

Analysis of 53 keywords that appeared more than four times in 515 data items with similar integrated concepts revealed that keywords closely related to local life, such as local factory and manufacturing, the categories of household and population, local transport, book catalog, and restaurant appeared most often. Among them, the most frequent were factory (31), household and population (28), and manufacturing (19), as shown in Table 5. Then, 16 clusters (Local factories and manufacturing, Resident registration status, Local transport, Local public library, Tourist attractions and commercial areas, Construction and building registration, Local industry, Local medical care, Restaurant catering, Car registration, Food hygiene, Local currency, Pollution, Livestock quarantine, Public facilities and policing, and Elderly welfare) were formed. The cluster with the highest utilization was Local factories and manufacturing, followed by Resident registration status and Car registration, as shown in  Fig. 5 .  Local factories and manufacturing include keywords such as local companies, manufacturing, products, etc., and Resident registration status includes population and household status. In addition, data closely related to daily life, such as new library books, vehicle registration status, welfare facility status, tourist status, and building permit statistics are included. Thus, local governments mainly disclose data on permits, registrations, and status necessary for autonomous administration, consequently not showing high utilization as seen in data from central government and public institutions.

Table 5 High-occurrence keywords and subject clusters (local governments)
Fig. 5
figure 5

Network map and clustering (local government)

4.2.4 Office of education

Finally, among the data published by the office of education, the results of analyzing 38 keywords that appeared more than four times out of 253 keywords were further refined. Keywords related to school, academy, lifelong education, and library books mainly appeared. Among them, school (96), school closing (70), academy (49), books (47), and lifelong education (41) showed the highest frequency of appearance (Table 6). The data were represented by 11 clusters (Lifelong education, Office of education / school closing, Academics, Libraries, School district, Elementary and middle school, Education budget, Contact information, Kindergarten, Education statistics, and Others). Among them, Lifelong education was found to be the most useful cluster, followed by Academics and Libraries, as shown in Fig. 6. The Lifelong education cluster includes lifelong education courses; the Academics cluster includes the current status of private academy and classrooms, and the Library cluster includes a list of new books. As with the local government data, the office of education mainly consists of education-related status or statistical data opened to the public corresponding to each educational district, as a result of which nationwide utility seems to be weak.

Table 6 High-occurrence keywords and subject clusters (Office of Education)
Fig. 6
figure 6

Network map and clustering (Office of Education)

5 Discussion and conclusion

The results of this study have several implications. First, it was possible to accumulate empirical knowledge on the open data utilized by each government type for different purposes by identifying the content of public data, considered to be in an academic vacuum, through keyword network analysis. Such prior knowledge can increase understanding of and knowledge on data offered to users and can contribute to business planning in the private sector.

The findings based on each research question presented in this study are summarized as follows:

The first research question was whether there is any difference in the content of open data made available by different public institutions, namely, the central government, local governments, and the office of education. The identification of clusters through keyword network analysis revealed that public institutions provide specialized information on national issues; the central government provides nationwide administrative information; and local governments provide information on the status and registration of items closely related to local daily life. The Office of Education deals with current information on educational institutions and libraries of districts.

The second research question was whether there is a difference in the utilization of open data by each government type, and whether a cluster of subjects with prominent utility can be identified. The analysis revealed that the utilization of data from public institutions dealing with special information on national issues was the highest, followed by that from the central government with nationwide administrative information, characterized by a high Gini coefficient due to the existence of popular data showing extremely high utilization. In contrast, local data showed an overall lower utility than national data, and data from district offices of education had still lower utility than the local government data. On the other hand, backtracking the number of downloads of data to the assigned keywords and comparing them by clustering revealed that open data on Health care provided by public institutions showed overwhelming use. Data on Real estate, Small businesses in public institutes, and National statistics and Crime / policing in central government were also found to be highly useful. In addition, in the local government and office of education, relatively high utilization was observed for Local factories and manufacturing, Resident registration status, and Lifelong education.

However, the results of this study are based on the sampling and analysis of a total of 1,200 samples based on the number of downloads. This is a limitation of the study and follow-up studies using a larger sample are needed.

Understanding the content of public data by government type and interpreting it in relation to utilization, as in this study, is meaningful in that future utility can be predicted in preparation for developing open policies.

For future public data open policies, the following recommendations can be made. It is necessary to maximize the added value of private use, such as business creation, by opening public institution and central government data that show high usability based on open API, which is a convenient method for the private sector to use. In addition, because there is a large gap in the utilization of open data, it will be necessary to identify and develop specific strategies to improve the usability of data with low usability. If best practices for underutilized data or the possibility of mashup with other data are suggested, the use of underutilized data can be improved. However, data released by local governments and district offices of education have characteristics that are less useful than those of nationwide special data because of the quantitative limitations on data supply, low consumer awareness, and lack of human resources to promote start-ups. It is necessary to create demand from local users through continuous discovery of data with local characteristics and environment rather than recklessly increasing the scale of data (Korea Local Information Development Institute Editorial Office 2017). In addition, policies are needed that facilitate an unobstructed path between supply and demand, such as matching public data with local companies and trigger competition.