This Section describes the experimental evaluation of the error insertion method to assess the impact of OCR errors in IR systems. The resources, tools, and configurations used in our experiments were as follows.
Data. To generate the character exchange list, we took a sample of 900 PDF documents containing abstracts from research articles published at the website of a Brazilian Oil CompanyFootnote 1. The extracted text was manually checked and the extraction errors were fixed to create the list of <extracted, expected> pairs. The IR collection used was Folha de São Paulo, a Brazilian Newspaper. It has 103K documents, 100 queries, and it has been used in important evaluation campaigns such as CLEF [11].
Table 1. Number of index terms (in thousands) and the proportional increase in comparison to the baseline.
Tools. The OCR software used was Abbyy Finereader 14Footnote 2. The choice was made after its good result compared to a number of other alternatives including Tesseract, a9t9, Omnipage, and Wondershare. The IR System was Apache SolrFootnote 3.
Experimental Procedure. In our experimental procedure, we varied the following parameters. The Ranking function, taking three possibilities: Cosine (COS) using TF-IDF weighting, BM25, and Divergence from Randomness (DFR). The use of stemming: applying a light stemmer (ST) and no stemming (NS). The error rates were 1%, 5%, 10%, 25%, and 50%. Baseline runs using the original documents were also created. The candidate terms for error insertion were either any term from any document (ALL) or any term from the judged documents (JD). These variations amounted to a total of 72 experimental runs, which were evaluated using standard IR metrics. Statistical significance was measured using T-tests. Queries were made by simply taking the title field from the topics. The goal was to simulate real queries that are typically short.
Table 1 shows the number of index terms for the combination of error rates, use of stemming, and candidate terms. As expected, the number of index entries grows remarkably with the error rates, reaching more than a four-fold increase for unstemmed runs with a 50% error rate.
The results for all experimental runs are in Table 2. The runs in which the mean average precision (MAP) decrease was found to be statistically significant (in relation to the baseline) at a 99% confidence interval are in a darker shade and the ones with a 95% significance are in a lighter shade.
The best ranking function in terms of absolute MAP values was DFR, followed by BM25. However, there were no differences on their robustness in the presence of OCR errors as their pattern of MAP decrease was the same. Strangely, in two runs in which the cosine was used, the insertion of errors at a 1% rate improved the performance (ALL-COS-NS and JD-COS-ST). This can be explained by the fact that errors are inserted both relevant and non-relevant documents. In these cases, the errors were introduced in non-relevant documents which made relevant documents be ranked higher.
The use of stemming consistently improved the results – i.e., all runs in which stemming was used had higher scores than their unstemmed counterparts. Stemming has made the runs more robust to the OCR errors. This can be seen comparing the loss in MAP of the runs with and without stemming. Nearly all runs in which stemming was used had smaller losses than their counterparts. Furthermore, the aid of stemming is more noticeable in the runs with higher error rates. The benefit of stemming can be explained by the fact that the OCR error can be in the suffix that is removed. Looking at the correlation between the number of index terms (Table 2) and MAP, we find a strong negative correlation of 0.86. When the correlation is measured for stemmed and unstemmed runs separately, the negative correlations are 0.81 and 0.90, respectively. This gives further support to the benefits of stemming.
Table 2. MAP Results for all configurations. The numbers in brackets indicate the proportional change.
Looking at our sample of aligned extracted and expected texts (assembled from real documents) we observed an error rate of 1.5%. Considering this rate and the results in Table 2, one can conclude that the errors do not have a severe impact on IR as significant impacts are observed starting at 5%. At a 10% rate, all runs are significantly affected. Recall that this small error rate was found using the software which provided the best results on relatively recent documents. Some studies that provide statistics of the proportion of errors found in OCR-ed documents report finding error rates of around 20% in historical documents [4, 5, 14]. At that error rate, the degradation is considered statistically significant.
Comparing the two choices of candidate terms for error insertion we find close scores. This means that the error injection targeting the judged documents did not have an influence on the results.