- 1.5k Downloads
The most developed post-industrial societies live by information, and information and communication technologies keep them oxygenated (English 2009). So, the better the quality of the information exchanged, the more likely such societies and their members may prosper. But what is information quality (IQ) exactly? The question has become increasingly pressing in recent years.1 Yet, our answers have been less than satisfactory so far.
In the USA, the Information Quality Act, also known as the Data Quality Act,2 enacted in 2000, left undefined virtually every key concept in the text. So, it required the Office of Management and Budget “to promulgate guidance to agencies ensuring the quality, objectivity, utility, and integrity of information (including statistical information) disseminated by Federal agencies”. Unsurprisingly, the guidelines have received much criticism and have been under review ever since.3
In the UK, some of the most sustained efforts in dealing with IQ issues have concerned the National Health Service (NHS). Already in 2001, the Kennedy Report4 acknowledged that: “All health care is information driven, so the threat associated with poor information is a direct risk to the quality of healthcare service and governance in the NHS”. However, in 2004, the NHS Information Quality Assurance Consultation5 still stressed that “Consideration of information and data quality are made more complex by the general agreement that there are a number of different aspects to information/data quality but no clear agreement as to what these are”.
Lacking a clear and precise understanding of IQ properties causes costly errors, confusion, impasse, dangerous risks and missed opportunities. Part of the difficulty lies in constructing the right conceptual and technical framework necessary to analyse and evaluate them. Some steps have been taken to rectify the situation. The first International Conference on Information Quality was organised in 1996.6 In 2006, the Association of Computing Machinery launched the new Journal of Data and Information Quality.7 The Data Quality Summit8 now provides an international forum for the study of information quality strategies. Pioneering investigations in the 1990s—including Wang and Kon (1992), Tozer (1994), Redman (1996), and Wang (1998)—and research programmes such as the Information Quality Program9 at MIT have addressed applied issues, plausible scenarios and the codification of best practices. So, there is already a wealth of available results that could make a difference. However, such results have had limited impact also because research concerning IQ has failed to combine and cross-fertilise theory and practice. Furthermore, insufficient work has been done to promote the value-adding synthesis of academic findings and technological know-how.
A failure to identify the potentially multipurpose and boundlessly repurposable nature of information as the source of significant complications (this is particularly significant when dealing with “big data” (Floridi 2012)), because of
A disregard for the fact that any quality evaluation can only happen at a given level of abstraction.11 To simplify, the quality of a system fit for a particular purpose is analysed at a LoA whose selection is determined by the choice of the purpose in the first place: if one wants to evaluate a hammer for the purpose of holding some paper in place on the desk, then that purpose determines the LoA, which will include, for example, how clean the hammer is; leading to
A missed opportunity to address the development of a satisfactory approach to IQ in terms of LoA and purpose orientation.
Accuracy, objectivity, believability
Relevancy, value-added, timeliness, completeness, amount of data
Interpretability, ease of understanding, concise representation, consistent representation
Admittedly, all this is a bit hard to digest, so here are three examples that should clarify the point.
The proposed strategy reflects a considered balance between data relevance, accuracy, timeliness and coherence. The data accuracy that can be achieved reflects the methods and resources in place to identify and control data error and is therefore constrained by the imperative for timely outputs. ‘Timeliness’ refers to user requirements and the guiding imperative for the 2011 Census is to provide census population estimates for rebased 2011 mid-year population estimates in June 2012. ‘Coherence’ refers to the internal integrity of the data, including consistency through the geographic hierarchy, as well as comparability with external (non-census ONS) and other data sources. This includes conformity to standard concepts, classifications and statistical classifications. The 2011 Data Quality Assurance Strategy will consider and use the best available administrative data sources for validation purposes, as well as census time series data and other ONS sources. A review of these sources will identify their relative strengths and weaknesses. The relevance of 2011 Census data refers to the extent to which they meet user expectations. A key objective of the Data Quality Assurance Strategy is to anticipate and meet user expectations and to be able to justify, empirically, 2011 Census outcomes. To deliver coherent data at acceptable levels of accuracy that meet user requirements and are on time, will demand QA input that is carefully planned and targeted. Census (2011), pp. 8–9 (my italics).
Apart from a questionable distinction between information quality and accuracy (as if accuracy were something else from IQ), overall the position expressed in the document (and in the citation above) is largely reasonable. However, I specify “largely” because the statement about the “key objective” of anticipating and meeting user expectations remains quite problematic. It shows a lack of appreciation for the complexity of the fit for purpose requirement. The objective is problematic because it is unrealistic: such expectations are unpredictable, that is, the purpose for which the information collected in the census is supposed to be fit may change quite radically, thus affecting the fitness itself. To understand why, consider a second example.
In the UK, postcodes for domestic properties refer to up to 100 properties in contiguous proximity. Their original purpose was to aid the automated sorting of the mail. That was what the postcode information was fit for (Raper et al. 1992). Today, they are used to calculate insurance premiums, designate destinations in route planning software and allocate different levels of public services, depending on one’s location (postcode) in such crucial areas such as health and social services and education (the so-called postcode lottery). In short, the information provided by postcodes has been radically repurposed, and keeps being repurposed, leading to a possible decline in fitness. For instance, the IQ of postcodes is very high when it comes to delivering mail, but rather poorer if route planning is in question, as many drivers have experienced who expect, mistakenly, a one-to-one relation between postcodes and addresses. The same holds true in the US for the Social Security Numbers (SSNs), our third and last example. Originally, and still officially, SSNs were intended for only one purpose: tracking a worker’s lifetime earnings in order to calculate retirement benefits. So much so that, between 1946 and 1972, SSNs carried the following disclaimer: “For social security purposes not for identification”. However, SSNs are the closest thing to a national ID number in the USA, and this is the way they are regularly used today, despite being very “unfit” for such a purpose, especially in terms of safety (United States Federal Trade Commission 2010).
The previous examples illustrate the fact that one of the fundamental problems with IQ is the tension between, on one hand, purpose–depth and, on the other hand, purpose–scope. Ideally, high quality information is information that is fit for both: it is optimally fit for the specific purpose/s for which it is elaborated (purpose–depth) and is also easily re-usable for new purpose/s (purpose–scope). However, as in the case of a tool, sometimes the better, some information fits its original purpose, the less likely it seems to be repurposable, and vice versa. The problem is that not only may these two requirements be more or less compatible, but that we often forget this (that is, that they may be) and speak of purpose-fitness as if it were a single feature, synonymous for information quality, to be analysed according to a variety of taxonomies. Recall the statement from the Census Data Quality Assurance Strategy. This is a mistake. Can it be avoided? A detailed answer would require more space than is available here, so let me offer an outline of a promising strategy in terms of a bi-categorical approach, which could be implemented through some user-friendly interfaces.
Example of bi-categorical IQ analysis
The result would be that one would link IQ to a specific purpose, instead of talking of IQ as fit-for-purpose in absolute terms.
There are many senses in which we speak of fit for purpose. A pre-Copernican, astronomical book would be of very bad IQ, if its purpose was to instruct us on the nature of our galaxy, but it may be of very high IQ if its purpose is to offer evidence about the historical development of Ptolemaic astronomy. This is not relativism; it is a matter of explicit choice of the purpose against which the value of some information is to be examined. Once this methodological step is carefully taken, then a bi-categorical approach is compatible with, and can be supported by quantitative metrics, which can (let users) associate values to dimensions depending on the categories in question, by relying on solutions previously identified: metadata, tagging, crowd sourcing, peer review, expert interventions, reputation networks, automatic refinement and so forth. The main advantage of a bi-categorical approach is that it clarifies that the values need not be the same for different purposes. It should be rather easy to design interfaces that enable and facilitate such interactive selection of purposes for which IQ is evaluated. After all, we know that we have plenty of information systems that are syntactically smart and users who are semantically intelligent, and a bi-categorical approach may be a good way to make them work together successfully.
See more recently US Congress House Committee on Government Reform. Subcommittee on Regulatory Affairs (2006).
See Borges, “The Analytical Language of John Wilkins”, originally published in 1952, English translation in Borges (1964).
On the method of abstraction and LoA, see Floridi (2008).
- Al-Hakim, L. (2007). Information quality management: theory and applications. Hershey, PA: Idea Group.Google Scholar
- Batini, C., & Scannapieco, M. (2006). Data quality-concepts, methodologies and techniques. Berlin: Springer.Google Scholar
- Borges, J. L. (1964). Other inquisitions, 1937–1952. Austin: University of Texas Press.Google Scholar
- Census (2011) Census Data Quality Assurance Strategy, http://www.ons.gov.uk/ons/guide-method/census/2011/the-2011-census/processing-the-information/data-quality-assurance/2011-census---data-quality-assurance-strategy.pdf
- English, L. (2009). Information quality applied: best practices for improving business information, processes, and systems. Indianapolis: Wiley.Google Scholar
- Herzog, T. N., Scheuren, F., & Winkler, W. E. (2007). Data quality and record linkage techniques. New York: Springer.Google Scholar
- Lee, Y. W., et al. (2006). Journey to data quality. Cambridge: MIT.Google Scholar
- Maydanchik, A. (2007). Data quality assessment. Bradley Beach: Technics.Google Scholar
- McGilvray, D. (2008). Executing data quality projects ten steps to quality data and trusted information. Amsterdam: Morgan Kaufmann/Elsevier.Google Scholar
- Olson, J. E. (2003). Data quality the accuracy dimension. San Francisco: Morgan Kaufmann.Google Scholar
- Raper, J. F., Rhind, D., & Shepherd, J. F. (1992). Postcodes: the new geography. Harlow: Longman.Google Scholar
- Redman, T. C. (1996). Data quality for the information age. Boston: Artech House.Google Scholar
- Theys, P. P. (2011). Quest for quality data. Paris: Editions TECHNIP.Google Scholar
- Tozer, G. V. (1994). Information quality management. Oxford: Blackwell.Google Scholar
- United States Federal Trade Commission. (2010). Social security numbers and ID theft. New York: Nova Science.Google Scholar
- United States. Congress. House. Committee on Government Reform. Subcommittee on Regulatory Affairs. (2006). Improving Information Quality in the Federal Government: hearing before the Subcommittee on Regulatory Affairs of the Committee on Government Reform, House of Representatives, One Hundred Ninth Congress, First Session, July 20, 2005. Washington: U.S. G.P.O.Google Scholar
- Wang, R. Y., et al. (Eds.). (2005). Information quality. Armonk: ME Sharpe.Google Scholar
- Wang, Y. R., & Kon, H. B. (1992). Toward quality data: an attributes-based approach to data quality. Cambridge: MIT.Google Scholar