In this section we describe our framework for digitizing chart images using crowdsourcing. We present our task design, which makes it easy to extract structured data, and discuss the feasibility of our framework in terms of the requester, the workers, the tasks, and the platform.
One of the most important aspects of human computation is designing the task so that intuitive and effective instructions can be given to non-expert workers so that they can work efficiently and accurately. The objective in our study was extraction of data from charts in image format and conversion of them into a form convenient for computer processing. A possible task design is to simply ask workers to extract graph data and place them in a CSV- or Excel-formatted file; however, the output with this approach does not provide a data structure, such as a distinction between row/column headers and data values, which is inconvenient for later data processing steps like data integration. Therefore, in our method, workers are asked to instead visually reproduce a chart image as a chart object in a spreadsheet using the functions of the spreadsheet software. This enables us to obtain a table linked to a chart object representing the data in the table and obtain the structure of the data, such as row and column headers and data sequences, from the properties of the chart object. It is not a straightforward task to automatically identify row and column headers in a table in a CSV file or a spreadsheet without the chart, but they can be easily obtained using an application programming interface if the chart object is provided with the table. This task design (Fig. 2) is an example of having an implicit purpose (extracting structured data) hidden behind an explicit instruction (reproducing a chart), which is common in human computation such as in the reCAPTCHA system .
Additionally, the structure of the data is essential for controlling the quality of digitizing work; it provides an efficient way to aggregate tables made by different workers and enables use of the common practice of asking multiple workers to complete the same task and then aggregating the results. Figure 3 shows our framework for digitizing chart images using crowdsourcing. The inputs are charts in an image file format such as JPEG. Microtasks asking crowd workers to reproduce the images in spreadsheets are generated and posted to a crowdsourcing marketplace. Several workers are assigned to each image. Each worker converts the image into a spreadsheet (in Microsoft Excel format) with an embedded graph. The axis labels, legends, and sequence values are extracted from the submitted file, resulting in pairs of attribute names and values. Finally, the spreadsheets obtained from the different workers are integrated into a single, higher-quality spreadsheet.
While Berners-Lee ranks CSV above Microsoft Excel since the former is an open format while the latter uses a proprietary one, in practice, the distinction is not substantial because recent versions of Excel use the Office Open XML format, and data in this format are readable by other software. Thus, we use Excel spreadsheets as the format in which to save data extracted from chart images.
Structured data extraction through visualization
During the process of visually reproducing a chart image, a worker has to specify the properties of the chart object in the spreadsheet to reflect the structure of the data represented in the chart. Such properties can be accessed by using a computer program and an application programming interface. Although there are various kinds of charts including bar charts and line charts, most spreadsheets use a common format for their internal representations; for example, Microsoft Excel uses a three-item format.Footnote 12
A chart (Chart) has several data series (Series).
Each data series (Series) has a name (Name).
Each data series (Series) has x-axis values (XValues) and values (Values).
Figure 4 shows the relationships between the structure of a table and the properties of a Chart object. Although a two-dimensional table has several possible chart representations, the column labels and row labels correspond to the labels of the x-axis and the legends; they are extracted as XValues and Name. The data values are extracted as Values in Series objects. In tables, the choice of columns and rows is arbitrary; for example, with Figs. 5 and 6, the data categories can correspond to the rows and the months can correspond to the columns and vice versa. In either case, Name corresponds to categories, and the XValues property corresponds to months; this is essential information for integrating multiple tables since the choice of rows and columns does not need to be made. Moreover, such information is also beneficial when converting tables into RDF format using the RDF Data Cube Vocabulary, which is the next step toward 5-star open data.
Feasibility of our crowdsourcing framework
According to Hosseini et al. , crowdsourcing can be classified on the basis of its four components: the crowdsourcer, the crowd, the crowdsourced task, and the crowdsourcing platform. In our framework, the crowdsourcer (requester) is a governmental agency that owns legacy data. Such an agency has a clear motivation for using crowdsourcing since it does not have enough human resources to convert legacy data into a machine-readable format, and crowdsourcing can reduce the monetary cost of converting the data.
The crowd in our case consists of people who can use spreadsheet software such as Microsoft Excel. The key to the success of our crowdsourcing approach is the availability of such users. Fortunately, Excel is one of the most commonly used business software products worldwide. While we do not have the precise number of Excel users, as of November 2014 more than 1.2 billion people were using one or more of the Office products, which include Excel.Footnote 13 Among the 109,344 workers registered with LancersFootnote 14, a crowdsourcing marketplace in Japan, 17,917 have “Excel” in their profiles. The task of extracting data from a chart image is not a simple microtask but a complex task that requires certain software skills. In Sect. 5, we demonstrate that we were able to gather workers with the skills needed to complete the task.
In our case study, we used Lancers but we can use any crowdsourcing platform that has basic functions such as task assignment and reward payment. The quality control mechanism using data aggregation is performed only on the requester’s computer using Visual Basic .NET (VB.NET) software. Thus, our framework can be easily implemented by most governmental agencies.