Introduction

It is important to share and integrate ecological data for monitoring and studying of long-term ecological changes (Brunt et al. 2002). Currently, however, domestic data in South Korea are spread over numerous research sites, institutions, and individual researchers. Even there has been no common protocol for ecological data collection and management; the data are mainly kept in a variety of forms. For this reason, existing data are difficult to integrate, analyze, and manage for long-term ecological research, so it is very necessary to standardize domestic ecological data in a common form for data integration and further analyses (Michener et al. 2012, Bonet et al. 2014).

Until now, long-term ecological data have been globally collected in each country according to its own protocols, while being maintained in large databases in the form of Ecological Metadata Language (EML) (Fegraus et al. 2005). In particular, various long-term environmental monitoring projects, including Environmental Change Network (ECN) (Morecroft et al. 2009), the National Ecological Observatory Network (NEON) (Keller et al. 2008), and the Long-Term Ecological Research network (LTER) (San Gil et al. 2009), are providing large volume of ecological data easily accessible to the public. To follow such trends, Korea is also building a unified ecological data integration network. For this purpose, there is a need to convert already collected raw data into common form, as well as to collect new data with common protocols.

In this study, we developed a semi-automatic ecological data conversion tool that can help ecologists to standardize ecological data more easily and efficiently in a relatively short time, while keeping the inherent meaning of the data. The data conversion was done based on some predefined protocols for data collection and management. Figure 1 summarizes the overall workflow of conversion procedure in our program.

Fig. 1
figure 1

The overall workflow of semi-automatic ecological data conversion tool

Materials and methods

Ecological data are mostly stored in text-based tables. Each row in the table represents a record that contains the values of many attributes (or characteristics) for target species. Each column corresponds to an attribute of the same data type and unit. For example, an attribute of “search date” includes the date when the raw data were collected, usually given in the format of YYYY-MM-DD, DD-MM-YY, and so on. With our tool, the raw data is standardized by following the four steps: (1) data file and protocol selection step, (2) species selection step, (3) attribute mapping step, and (4) data standardization step.

The first step of data file and protocol selection is to upload raw data file to be converted and select predefined protocols which define standard attributes and data types for target species (see Fig. 2). In the present version of the tool, only csv files are allowed for raw data files.

Fig. 2
figure 2

Typical user interface used in the first step of data file and protocol selection

Next, the second step is to specify target species to be converted from raw data files. This is to filter out and convert only specific (target) species data matched with the chosen protocol, in case that the raw data file contains a number of species. If the raw data include only one species corresponding to the protocol, this step can be skipped. The user interface for this step to choose a list of target species that should be extracted from raw data is presented as shown in Fig. 3. Here, users can find a certain attribute containing some specific names of target species and add a particular species name to the “selected species list”. Like this, users can selectively convert only a part of raw data matched with the chosen protocol. For user convenience, we provide the function of uploading a list of species names to be converted, which makes it easier and faster to select a number of species.

Fig. 3
figure 3

User interface of species selection step

Then, in the step of attribute mapping, the relations between raw data attributes and standard attributes in the protocol need to be specified by users. To this end, users should specify which attributes in raw data are matched with which standard attributes defined in the protocol. Once the relation between the two attributes is specified, in Fig. 4, the “mapping” button of the screen can be pressed to realize the mapping into the data conversion procedure. Non-selected raw data attributes are excluded from the subsequent conversion process. The mapping list between the two attributes can also be allowed to use for convenience.

Fig. 4
figure 4

User interface of attribute mapping step

In the final step, data type and unit of each attribute can be properly transformed into a standardized format. For this purpose, we provide several functions like concatenation, separation, substitution, date conversion, unit conversion, and editing function (Fig. 5). Specifically, the concatenation function can be used to merge values in two or more attributes into one new value. We can insert a text or symbol as a delimiter when combining multiple values. The separation function divides a string into several chunks. For example, by the separation, the attribute of “search period” can be divided into two attributes of the “search start date” and “search end date.” The substitution function replaces certain values with different values, e.g., texts, numbers, delimiters, or symbols. The function of date conversion can be utilized to specify the desirable format of search date. For example, this function separates search date into three parts as day, month, and year, and then rearranges them to the desired order such as YYYY-MM-DD. Unit conversion is to change the data unit, and editing function is to transform numerical data by using a formula for computation. At the end, the standardized data are saved into a new csv file in the table form.

Fig. 5
figure 5

User interface of data standardization step

Results and discussion

Our semi-automatic data conversion tool is a software of desktop application that works on Windows and Macintoshes. It helps ecologists to easily and efficiently create standardized data from raw collection data. To find the usability of our tool, we performed the data standardization with the six datasets from six observatory sites located in Korea National Park (for more information about the dataset, refer to Table 1). For this purpose, we need some predefined protocols about five kinds of indicator species, selected by the long-term ecological research of Kyungpook National University in Korea (refer to Table 2 for details). As results, overall, each raw data that varies widely in data types and terms was successfully standardized according to predefined protocols (refer to Table 3). For instance, search period was divided into search start date and search end date, and search date such as 01-MAY-2010 was converted to 2010-05-01, using separation and date conversion functions. The number of records that was converted according to SC protocols is equal to or smaller than that of the original raw data, because the SC protocol contain only search date and environment information, and several entities can be found in the same search date.

Table 1 Datasets of Korea National Parks used in this study
Table 2 Protocols of five species (measurements) used in this study
Table 3 Data conversion results from six datasets of Korea National Parks

With the use of our tool, it is expected to possibly create standardized data of a common form in a relatively short time. Moreover, since the converted data can be stored and shared in the same format, it is possible to conduct comparative analysis with numerous ecological data more easily without regard to any organizations or project goals. Consequently, this tool can contribute to provide broad applicability to ecological and environmental data, such as towards uncovering the various effects of environmental factors on species.