The International Database on Longevity: Data Resource Profile

Even in countries with very good statistical systems, routine population statistics that cover individuals of very high ages are often problematic, as the proportion of erroneous cases increases sharply with age. The desire to measure human mortality at extreme ages was the main motivation for the establishment of the International Database on Longevity (IDL). The IDL is a uniquely valuable source of information on extreme human longevity. It provides high-quality age-validated individual-level data on the ages of semi-supercentenarians and supercentenarians. Moreover, the IDL is the only database that provides such data without age-ascertainment bias. It obtains its candidates from records of government agencies to ensure that there is no dependency between the probability of being included and age. Candidates who meet strict criteria for the validity of their age (date of their birth) are then included in the IDL. Nevertheless, the IDL does not include exhaustive sets of validated supercentenarians and semi-supercentenarians for any country, because it is nearly impossible to find documents that would allow for the validation of the ages of all of the individuals on the list. As of August 2017, the IDL has records on 1,304 validated supercentenarians and 18,590 semi-supercentenarians from 15 countries. The first person in the IDL collection who attained age 110 was born in 1852 and died in 1962 in Quebec, while the last person was born in 1906 and attained age 110 in 2016. This chapter introduces the database and explains its purpose and principles. We also describe the data structure and provide an overview of the information available.


Introduction
Extreme longevity has long been a topic of interest to the media and to the broader public. There are many legends of people who set longevity records, with tales of individuals who lived 200, 500, and even 969 years. Unfortunately, it is almost impossible to validate the ages of long-lived individuals until the twentieth century. In the second half of the twentieth century, the number of people in a collection of low-mortality countries who have reached age 100 has approximately doubled every decade (Jeune 2002). This trend continued in the first decade of the twentyfirst century (HMD 2016), which suggests that the proportion of the long-lived will probably continue to increase in the future. The unprecedented growth in the number of centenarians and supercentenarians (those aged 110 and older) in recent decades provides us with a practical basis for investigating the extremes of human longevity. There is no consensus about the limits of longevity or about the form of the mortality hazard at extreme ages. The existing data suggest that the chances that a new Jeanne Calment -who died in 1997 at age 122 -will appear in the near future are quite low; however, the chances are clearly higher than zero. While the postponement of mortality has been reported in many studies (Vaupel 2010), the trajectories of longevity at ages above 105 or 110 are still disputed (Gampe 2010;Gavrilov and Gavrilova 2011;Robine and Vaupel 2001). There are radically different ideas and assumptions about the direction of future change in longevity, and about the potential limits to the human lifespan (de Beer et al. 2017;Dong et al. 2016;Oeppen and Vaupel 2002;Olshansky 2013). Having carefully collected and rigorously validated data might help us to confirm or reject these hypotheses.
The existing data sources on extreme human longevity can be placed into two categories. The first category consists of comprehensive data assembled by government agencies on deaths and population exposures for semi-supercentenarians (those aged 105-109) and supercentenarians. The second category consists of unofficial special lists and data collections of cases of extreme longevity compiled by researchers interested in the topic from sundry sources.
Even in countries with very good statistical systems, routine population statistics that cover individuals of very high ages are often problematic, as the proportion of erroneous cases increases sharply with age (Cairns et al. 2016;Jdanov et al. 2008). For example, according to U.S. vital statistics, there are numerous deaths at ages above 110, and even some above age 130 (HMD 2016), which clearly cannot be accurate (Rosenwaike and Stone 2003). The high-quality population registers of northern European countries are also far from perfect. For example, the proportion of foreign-born individuals in the 2014 Swedish population jumps from 6-8% at ages 90-94 to 23% at ages 105+, but there is no similar jump in the proportion of foreign-born among deaths, as the proportion of foreign-born individuals in the population who died after reaching age 90 is fairly stable across all age groups, at 5-7% (Glei et al. 2015). The surprisingly high proportion of foreign-born individuals alive at ages 105 and above suggests that age overstatement is occurring among people whose births were not registered in Sweden. A steep increase in the proportion of foreign-born individuals in the population denominator that does not coincide with a similar increase in the death numerator is a signal of problematic population estimates, and of a nominator-denominator bias at extremely old ages. In light of this growing problem, statistical offices have been forced to begin the open age interval at an age no higher than 100. The Human Mortality Database (HMD), the leading source of population and mortality data at the national level in the world, recommends the use of smoothed death rates at ages 95+ even for countries with high-quality statistics, such as the Western European countries (Wilmoth et al. 2007).
As we noted above, the second data category consists of lists of very old individuals compiled by researchers interested in extreme longevity. The Gerontology Research Group supercentenarian list (GRG 2015) is probably the best example of such a list. It consists of supercentenarians around the world who are known to the GRG and have met the GRG age-validation criteria. Such lists are open to several criticisms. First, what proportion of the target group is captured in the list is not known. Second, the list may be unrepresentative of the age distribution of the extreme aged. For example, newspapers may report on the oldest or the secondoldest person in a country, but not mention younger individuals of extreme ages. Because they are subject to this age-ascertainment bias, these lists cannot be used to measure mortality at extreme ages.
The desire to measure human mortality at extreme ages was the main motivation for the establishment of the International Database on Longevity (IDL) by an international collaborative research group . The IDL aims to provide highly reliable data on the ages of semi-and supercentenarians that are free of age ascertainment bias; and thus to ensure a solid basis for studying the mortality trajectories of extreme longevity. As the IDL obtains its candidates from the comprehensive records of government agencies, there is no dependency between the probability of being included and age. The candidates who meet strict validation criteria are ultimately included in the IDL. These criteria do, however, vary somewhat from country to country; for more about the validation processes, see Poulain (2010). Nevertheless, the IDL does not include exhaustive sets of validated supercentenarians and semi-supercentenarians for any country. Even if a complete list of individuals who survived to ages 105+ existed for a given country, it would be nearly impossible to find documents that would allow for the validation of the ages of all of the individuals on the list. Most importantly, the IDL guarantees that all of the data in the database are of high quality.
In most countries of the IDL, records of deaths at extreme ages are obtained from the vital registration system. Records of living persons are often more difficult to obtain, particularly for countries without a population register. In the validation process, records that do not meet the age threshold are rejected, and records for which no satisfactory determination cannot be made are annotated as such. Lists of validated semi-and supercentenarian cases may be somewhat biased compared to records on the general population due to the exclusion of two types of cases: (1) those with an incorrect age (age overstatement), and (2) those that could not be validated. For example, it is particularly difficult to validate the age of a person who was born abroad. While the number of validated cases is smaller than the number of candidate cases, if the data quality is good -i.e., if relatively few candidates are discarded -the patterns seen in the age-validated data will also be seen in the candidate data. For example, for France we can see that the numbers line up quite well for the cohorts born between 1883 and 1900 ( Fig. 2.1). Only a few candidates in the cohorts born before 1883 could be age-validated because data with individual death records, which are needed for the validation process (records with, for example, name, year, and place of birth), are available in electronic form only from 1983 onward. For cohorts born after 1900, the numbers do not line up as well because the validation process has not been completed.
The first version of the IDL was launched in 2010 (Cournil et al. 2010) using the country data described in Maier et al. (2010). This chapter provides an overview of the updated and modified IDL. The new IDL includes all of the IDL-2010 data, as well as new data collected during two rounds of updates. The record content and the format of the new IDL differ from those in the IDL-2010, as we offer formalized descriptions of the metadata by applying a new set of variables.
Most importantly, the threshold age for the new IDL is 105, rather than the age of 110 used in the IDL-2010. Because it is so costly to validate the large number of candidates aged 105-109, for three countries, only a random sample of semisupercentenarian candidates were put through the validation process. The United States provides only validated cases from a sample randomly drawn from the population, while France and England and Wales also provide full lists of known semisupercentenarians and information about failures in the validation sample. In the second case, the probability of successful validation can be extrapolated based on the whole list. This approach is called sample validation.
Currently, the IDL includes data from 12 European countries, as well as from Canada, Japan, and the United States. The country-specific details of the validation process are given in the respective country-specific metadata files, which are an essential part of the dataset; and in . Additional details for some countries can be also found in the country chapters of this book. All of the data were provided either by individual researchers who collected the information from official data sources, or by national statistical agencies. The full list of contributors is given on the IDL website. The pooled IDL dataset was uniformly coded, harmonized, and carefully checked. The standards developed for data collection and presentation ensure the comparability of the present and future collections, and increase the cross-country coherence of the data.
In the next section of this chapter, we describe the main features of the data collection and verification procedures used in compiling the IDL. In the third section, we explain the structure and organization of the data, and describe the main data fields. In the fourth section, we provide a brief overview of the data available in each country. In the fifth section, we discuss the use of data from the IDL. Our concluding remarks are presented in the final section.

Data Collection
In the following, we describe the main features of the IDL data collection and validation procedures, which should be taken into account by researchers who intend to analyze these data.

Sampling Frames
The IDL deals with individual data sampled from the population. In practice, this means that the IDL collects individual trajectories in a certain age-period frame. The process of data collection eliminates any age ascertainment bias that might otherwise exist. The choice of specific procedures depends on the data availability.
Data might be collected using period or cross-sectional approach. The difference between the two approaches can be explained with the help of the Lexis diagram of Fig. 2.2. There, y0 and y3 denote the years of attainment of the threshold age of 110 for the first and the last cohorts, and y1 and y3 denote the first and the last years of period observation. The area consisting of A, B, C, and D in the Lexis diagram corresponds to the cohort approach: only the supercentenarians born between y0-110 and y3-110 are included. The age w is the age of extinction; i.e., the age of the oldest person alive in the population in the last year of observation y3. This oldest person reached age 110 in the year y2. The complete information can be obtained only for cohorts born between y0-110 and y2-110 (area A+B+C), while the data on supercentenarians born in the year y2-110 or later (area D) might be changed because there are candidates who are still alive. The area consisting of A, B, D, and E corresponds to the period approach; supercentenarians who died in C are excluded. Thus, the data for cohorts born before y1-110 are left-truncated: supercentenarians dying before y1 are excluded. The data representing the area D are still subject to change by future updates also based on a period approach.
As we mentioned above, for several countries, no information is available on individuals who are still alive, resulting in right-truncation. When such information is available, we still do not know the age at death of the then-living; we call this right-censoring. For Japan, the exact age at death cannot be determined, only the year of death and, respectively, the age range within which the death occurred; this we call interval-censoring.

Validation Methods
Age validation procedures vary across countries depending on the sources of information that are available in each country. Birth or baptism records are available in some countries; while in others, the validation is performed by checking early census records. In the IDL-2010, the quality of the validation procedures used in the production of the country data was assigned to one of two categories: fully validated, which is the more reliable and desirable level of validation; and carefully checked, which is the less reliable and desirable level of validation. Cases in which the individual's early life documents were validated were classified as fully validated, while all of the other validated cases were classified as carefully checked. In the present version of the IDL, the quality of the validation procedures used is not noted, because in some situations it was difficult to establish the formal criteria needed to distinguish between the two levels of validation. Information about the documents used to validate age in each country is provided by the IDL in standardized country-specific metadata files. In France and in England and Wales, there were large numbers of semisupercentenarian candidates, which made the cost of validating all candidates prohibitive. Therefore, in these countries the validation of semi-supercentenarians was done on a sample basis (so-called sample validation). The idea of sample validation is very simple: instead of conducting an exhaustive validation of all candidates, only the candidates in a randomly selected sample are validated in order to estimate the age-specific probability of the successful validation of every candidate (i.e., the probability that the recorded age is correct). In particular, in France a random sample of 100 candidates was selected by choosing 20 records from each one-year age group. We used this non-proportional method of sampling because the value of the observations increases with age. Thus, in France and in England and Wales, the lists of semi-supercentenarians contain all of the known candidates, whereas the records selected for validation contain additional information about the result of the validation process.
The validated list of (semi-) supercentenarians for most countries consists exclusively of individuals who were born in that country. This is because foreign-born candidates often come from countries with poor records. Even when these individuals come from countries with good records, it is necessary to establish cooperation with the country of birth in order to perform an age validation.

Dataset Structure
The IDL data are classified by country. A country dataset contains as many as four data files: (1) a file of individuals who were alive at age 110 or older, if any; (2) a file of individuals who died at age 110 or older, if any; (3) a file of individuals who were alive at ages 105-109; and (4) a file of individuals who died at ages 105-109.
Each country dataset also includes a metadata file providing information about the data collection process and the validation method. When the exhaustive validation approach was used, the data files consist exclusively of validated records. Sample validated files include cases that have been validated, invalidated (selected for validation but found to be invalid), and not selected for validation.
Each record in the data files describes an individual. The names of the individuals are not provided to IDL users. The data fields can be grouped into the following types of information: 1. information about the date, country, and region of birth; 2. information about the date, country, and region of death; 3. information about the place of current residence and proof of being alive for those alive; 4. source of raw data, including information about the sampling frame; 5. method of validation (sample or exhaustive); 6. description of documents used for validation (birth certificate, census record, etc.) A detailed description of the data fields and a list of the file names used are provided together with the data files. Table 2.1 summarizes the information that the IDL currently has on supercentenarians. Fifteen countries contributed cases -in the case of Canada, for Quebec only. The IDL-2010 also had information for 15 countries, but we dropped Australia because of age-ascertainment bias in its data, and added Austria. New cases have been added for nine countries. The data are on a period basis, i.e. on the individuals who reached the threshold age during a period of years. The new IDL has records for 1304 validated supercentenarians. The large majority of the 138 living supercentenarians are from Japan, for reasons explained below; the other cases are of the supercentenarians who were alive at the time of the most recent investigation in the respective countries, which took place between 2000 and 2016, depending on the country. The first person in the IDL collection who attained age 110 was born in 1852 and died in 1962 in Quebec, while the last person in the collection was born in 1906 and reached age 110 in 2016.

Data Overview
The large number of living supercentenarians in Japan can be explained as follows. In Japan, the primary sources of supercentenarian data are annual government lists of centenarians alive on September 1, by age. Before 2006, these lists were complete; accordingly, an individual's death could be inferred by the first absence of his or her name from the annual lists. Since 2006, however, only those individuals who agreed to be included were listed (Saito 2010); accordingly, the absence of an individual from the list no longer necessarily implies that s/he died.
Seven countries did not provide data on living supercentenarians because these data were unavailable due to data protection policies. Finally, all of the countries except France provided data based on the period approach.
Data on semi-supercentenarians were provided by 12 countries (Table 2.2), but not for Finland, Sweden, and Spain. Except for Germany and Switzerland, these The lists of semi-supercentenarian cases were compiled in the same way as the corresponding lists of supercentenarians in all of the countries except England and Wales, France, and the U.S.
The first two countries used the sample validation approach; i.e., for England and Wales and France, all of the candidates appear on the list, but only a sample of cases randomly drawn from this list was validated. In England and Wales, 12% of the female deaths and 100% of the male deaths underwent a validation exercise. In France, where the pool of candidates came from a high-quality data source, a sample of 100 cases was chosen for validation, and 99 of these cases were validated. The last record could not be validated because the place of birth was missing, which meant that the municipality that could be approached to obtain a birth record was not known. All the U.S. semi-supercentenarian data in the IDL represent cases that were validated using the validation protocol that was also applied in validating supercentenarians (Kestenbaum and Ferguson, 2010); however, the list of candidates was limited to a sample representing only around 10% of the universe of candidates.

Using the IDL
Because its method of construction avoids the type of age-ascertainment bias that plagues other collections of records of the extreme aged, the IDL provides researchers with an opportunity for a careful analysis of extreme longevity. Whichever analysis approach is used -regardless of whether it is based on classic Bayesian theory or extreme-values theory or something else -the analyst needs to be aware of and account for the characteristics of the IDL collection. First, it is important to keep in mind that the IDL is a collection of validated individual cases. Although all of the country samples of (semi-) supercentenarian cases were randomly drawn, the cases that have been validated might be selective with respect to place and year of birth. Second, only certain countries and certain cohorts and time periods are represented.
A related issue is that the number of supercentenarians in the IDL is fairly small. Additionally, because the number of contributing countries varies from one time period to another, comparisons over time of the oldest supercentenarian, or even of the average age of supercentenarians, can be misleading. Third, because some of the cohorts are not yet extinct, the complete set of mortality probabilities are not directly observed. Moreover, some countries do not even provide counts of their residents who are alive at extreme ages. We believe, for example, that an analysis of IDL data performed recently by Dong et al. (2016) to support their controversial thesis that the limit to the human lifespan has already been reached is flawed. The authors tabulated combined data for England and Wales, France, Japan, and the United States from the IDL-2010 database to demonstrate, among other things, that the annual average age at death of supercentenarians did not increase from 1968 to 2006. But, as shown by the blue line in Fig. 2.3, the numbers of supercentenarian deaths before 1980 are tiny or nil: there were none in 1969, 1971, 1972, 1974, and 1975; only one in 1968, 1970, 1973, 1975, and 1977; and only three in 1978 and 1979. For the years 2000 to 2006, the lines with the solid triangles reflect the fact that the counts for those years in the earlier dataset used by Dong et al. were incomplete because some supercentenarian deaths had not yet occurred, and because some data had not yet been obtained and validated. Thus, the revised pattern is quite different. Indeed, over the periods with the most reliable data -roughly 1980-2007 for IDL-2015IDL- and 1980IDL- -2003 for IDL-2010 -the average age at death was generally increasing.

Summary
The IDL, with its high-quality age-validated individual-level data on the ages of semi-supercentenarians and supercentenarians, and its goal of avoiding ageascertainment bias, is a uniquely valuable source of information on extreme human longevity. Of course, like all data collections, the IDL has its limitations, and research using the IDL will be affected by those limitations. Among the main drawbacks of the IDL is that its coverage is limited to certain countries and times, and that the validation methodology is not uniform across countries.
Unfortunately, the recent changes in data protection rules and the general tendency toward limiting access to personal data, even for scientific research, are likely to make future updates of the IDL more and more problematic. Nonetheless, we are hopeful that the IDL will be expanded, despite the increasing strictures on access to personally identifiable information. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.