Metadata quality frameworks
Despite the key role played by metadata in cultural heritage collections, evaluating their quality and establishing measures able to identify the data features that need to be improved is still a debated argument. Day [10] assessed metadata quality in e-print archives according to functional requirements defined at two separate levels: compliance with the specifications of the metadata schema used to describe the digital objects and compliance with the needs of the end user. At the first level, an object must be described strictly following the rules and guidelines of the metadata schema (or application profile) in order to be considered correct. The second, higher level of correctness requires the rightness of the values of the metadata fields: e.g. the Italian painting The birth of Venus by Sandro Botticelli is also know as The Venus. According to the second level of evaluation, both titles should be considered appropriate even if the only correct one is The birth of Venus. Hence, according to this second level, quality and correctness are about fitness for purpose.
Another approach to assess metadata quality has been defined by the NISO FoundationFootnote 3 and addresses the problem in the context of metadata creation by machines and by professionals who are not familiar with cataloging, indexing or vocabulary control [30]. The NISO Framework of Guidance for Building Good Digital Collections presents six principles of what are considered “good” metadata [27]. However, these criteria and principles do not provide a clear number of well-defined quality dimensions, so that metadata curators and end users are not supported in addressing these issues.
The first attempt to operationally define what the evaluation of metadata quality is can be found in the Metadata Quality Framework developed by Bruce and Hillmann [5], where seven dimensions and related characteristics are introduced and described, namely Completeness, Accuracy, Conformance to Expectations, Logical Consistency and Coherence, Accessibility, Timeliness and Provenance. However, there is no formal definition about the quality aspects that should be measured by each dimension. The authors note that it is not possible to state which of the seven dimensions they describe is most important for a given application, since the importance of each quality criterion is strictly influenced by the nature of the resource to be described, as well as by the environment in which the metadata is to be constructed or derived. Thus, great emphasis is put on the fact that perception of quality strictly depends on context. As a consequence, metadata curators are required to follow a generic and “fitness for use” workflow [3] based on personal interpretation and manual intervention: they should check the content of each record and, depending on the types of issues, report errors to metadata creators or fix the metadata themselves, relying for instance on a controlled vocabulary. Given the growing amount of digital cultural heritage records available, this is a very time-consuming process, which cannot be adopted at scale. Concerning accuracy, which is the central topic of this paper, Bruce and Hillmann’s framework points to the fact that “The information provided about the resource in the metadata instance should be as correct as possible [...] Typographical errors, as well as factual errors, affect this quality dimension.” This is however a very narrow definition of accuracy, which only takes into account some surface features of a description (e.g. presence of mistakes), without considering that a description can be formally perfect without containing useful information, therefore being of low quality.
Besides the framework by Bruce and Hillmann, few other approaches have been proposed to automatically compute quality metrics. The ones that are more related to our work are the Framework for Information Quality Assessment by Stivlia [36], the Metadata Quality Framework by Ochoa and Duvall [28] and the Metdatata Quality Assurance Framework by Péter Király [18, 19]. Other frameworks (e.g. [25]) do not include accuracy and are therefore not discussed in this paper.
Stvilia proposes a framework which overlaps with the Metadata Quality Framework by Bruce and Hillmann. The author identifies four major sources of information quality problems: mapping, changes to the information entity, changes to the underlying entity or condition, and context changes. To address mapping, Stvilia adopts the definition from Wand [39] according to which mapping issues arise when there is incomplete or ambiguous mapping between the information source and the information entity from the metadata schema. Changes, instead, may occur in the information entity itself or in the real-world entity it represents. Based on that, the authors develop a taxonomy of 22 dimensions, systematically organized into three categories: intrinsic i.e. dimensions that can be assessed by measuring information aspects in relation to reference standards (e.g. spelling mistakes); relational, i.e. dimensions that measure relationships between the information and some aspects of its usage (e.g. accuracy); reputational, i.e. dimensions that measure the position of an information entity in a given structure (e.g. authority). However, there is no implementation of these dimensions as algorithms that can be operationally applied to different cases.
Ochoa and Duvall’s framework is inspired by the parameters introduced by Bruce and Hillmann and Stivlia. However, it is more detailed and specific, in that it presents several automatic calculable metrics of quality associated with the seven parameters in Bruce and Hillmann’s framework. The authors point out that the proposed metrics are not intended to be a comprehensive or definite set, but should be considered as a first step towards the automatic evaluation of metadata quality.
Regarding accuracy, Ochoa and Duvall define it as ‘the degree to which metadata values are “correct”, i.e. how well they describe the object.’ [28]. Similar to our approach, they make use of text processing techniques and apply them to textual fields of metadata. However, they propose a general unsupervised method based on vector space model (VSM), aimed at finding the semantic distance between two resources according to the keywords stored in a vocabulary, while our approach is supervised and does not rely on external resources, because this information is already inferred by the trained classification model. Furthermore, Ochoa and Duvall’s proposal to assess metadata accuracy may be affected by issues related to the length of the descriptions. Longer text contains more words than shorter ones, and this has an impact on the computation of the semantic distance with the keywords stored in the external vocabulary: the longer the text, the higher the chances that it contains some of the keywords in the vocabulary, and thus, the higher the accuracy score (due to the way the VSM works), independently from whether such keywords accurately describe the content of the text. This way, lengthy (but not accurate) descriptions containing many keywords may score higher accuracy than shorter (but accurate) descriptions. Moreover, Ochoa and Duval present also three validation studies to evaluate the proposed metrics with respect to human-made quality assessment. In general, the quality metrics do not seem to correlate with human ratings.
The third metadata quality framework we consider has been developed in collaboration with the Data Quality Committee (DQC) from the European Digital Library “Europeana”Footnote 4 by Péter Király [19]. The Metadata Quality Assurance Framework is an ongoing project tailored to measure the metadata quality of the Europeana digital library and based on Europeana Data Model (EDM) metadata profile. The framework consists of 4 different metrics, namely completeness, multilinguality, uniqueness, i.e. frequency of the duplicated values and record patterns, i.e. density distribution of filled fields among all Europeana content providers. While a lot of emphasis is put on the issue of multilinguality, which has become very relevant in data aggregation projects like Europeana, the issue of accuracy is not introduced as a separate metric, but only mentioned as a dimension that can be inferred from the others. In this respect, our work adopts a different perspective on the issue.
NLP and machine learning for description quality
We are interested in automatically assessing the quality of descriptions in digital records. The topic has already been tackled in the past with the use of machine learning and NLP, but using techniques that are different from what we propose. In [9], for example, description length is considered as a proxy for accurate content description and is used as a feature in a supervised classification task. No semantic information is analysed. In [13], string matching is used to detect information redundancy in metadata collections, a task related to metadata quality because redundancy may hinder basic digital library functions. In [33], accuracy is computed on public government data as the distance between the format of the referenced resource and the actual data type. Again, this measure is based on a formal check, without looking at what information is actually presented in the description. In [23], instead, the authors highlight the importance of a semantic check for quality assessment and therefore propose to verify correctness, completeness and relevance of metadata by creating logic rules to model relations among digital resources. In [26], the authors show that enriching the subject field with automatically extracted terms using topic modelling is valuable, especially when coupled with a manual revision by human curators. None of the techniques considered in our work, in particular supervised classification using word embeddings, has been applied and tested for Cultural Heritage repositories and resources.