Associations between birthweight, gestational age at birth and subsequent type 1 diabetes in children under 12: a retrospective cohort study in England, 1998–2012

Aims/hypothesis With genetics thought to explain only 40–50% of the total risk of type 1 diabetes, environmental risk factors in early life have been proposed. Previous findings from studies of type 1 diabetes incidence by birthweight and gestational age at birth have been inconsistent. This study aimed to investigate the relationships between birthweight, gestational age at birth and subsequent type 1 diabetes in England. Methods Data were obtained from a population-based database comprising linked mother–infant pairs using English national Hospital Episode Statistics from 1998 to 2012. In total, 3,834,405 children, categorised by birthweight and gestational age at birth, were followed up through record linkage to compare their incidence of type 1 diabetes through calculation of multivariable-adjusted HRs. Results Out of 3,834,405 children, 2969 had a subsequent hospital diagnosis of type 1 diabetes in childhood. Children born preterm (<37 weeks) or early term (37–38 weeks) experienced significantly higher incidence of type 1 diabetes than full term children (39–40 weeks) (HR 1.19 [95% CI 1.03, 1.38] and 1.27 [95% CI 1.16, 1.39], respectively). Children born at higher than average birthweight (3500–3999 g or 4000–5499 g) after controlling for gestational age experienced higher incidence of type 1 diabetes than children born at medium birthweight (3000–3499 g) (HR 1.13 [95% CI 1.03, 1.23] and 1.16 [95% CI 1.02, 1.31], respectively), while children at low birthweight (<2500 g) experienced lower incidence (0.81 [95% CI 0.67, 0.98]), signifying a statistically significant trend (p trend 0.001). Conclusions/interpretation High birthweight for gestational age and low gestational age at birth are both independently associated with subsequent type 1 diabetes. These findings help contextualise the debate about the potential role of gestational and early life environmental risk factors in the pathogenesis of type 1 diabetes, including the potential roles of insulin sensitivity and gut microbiota. Electronic supplementary material The online version of this article (10.1007/s00125-017-4493-y) contains peer-reviewed but unedited supplementary material, which is available to authorised users.


Identifying delivery and birth records
There are a number of ways of identifying delivery and birth records in HES [4]. For this study, delivery and birth records were identified by the presence of a delivery/birth "tail" of data items attached to the HES record. These tails of data provide information about the mother's characteristics at delivery and the child's characteristics at birth. For any given birth episode, the values of every data field on the tail of the mother's record should ideally match those on the baby's record. In practice, this is not always the case, particularly as the tails can sometimes contain missing data for certain fields. The matching methods described below, therefore, required that some, but not all, of these values matched.

Matching methods
There is no gold standard for linkage of mother-infant pairs in HES [4]. The approach in this study was to be more concerned with ensuring that a match was correct (i.e. maximising true positives), and less concerned about throwing away uncertain matches. The rationale for taking this more restrictive approach was to reduce the possibility of bias, since incorrectly matched mother-infant pairs (i.e. the inclusion of false positives) would lead to a blurring of the true effects of baseline characteristics. The principal disadvantage of discarding uncertain matches in these circumstances is reduction of statistical power. Given that the numbers in this population-based study were already very high, a reduction in statistical power was regarded as an acceptable trade-off.
Matching mother-infant pairs was achieved using a mixture of deterministic and probabilistic methods. Deterministic matching was used to create provisional pairs, while probabilistic matching was used to break provisional pairs, i.e. to identify the "best pair" in circumstances where deterministic methods caused a conflict, such as where one baby matched to more than one mother.

Deterministic matching
Deterministic links were identified as a pair of records (one belonging to a mother, the other belonging to a baby) agreeing exactly on baby's date of birth (which was obtained from "dob" on the stem of the infant's record and "dobbaby" on the tail of the mother's record), and two or more positive matches on the following fields: encrypted postcode, encrypted mother's date of birth ("dob" on the mother's record and "motdob" on the child's record), birthweight ("birweit"), local patient identifier ("lopatid"), provider code ("procode"), GP practice ("gpprac").
10,061,381 pairs from 1998 to 2012 passed this test and were kept and stored in a provisional matched pair output table and further used for match resolution using probabilistic matching.

Probabilistic matching
In order to compare one match pair with another, to identify the "best" pair where deterministic methods caused a conflict, competing pairs were scored by assigning weights to each match field. The additional fields used in probabilistic matching were hospital admission date ("admidate"), discharge date ("disdate"), place of delivery ("delplace"), delivery method ("delmeth"), purchaser code ("purcode"), resident local authority ("resladst"), mother's age ("age" on the mother's record and "matage" on the child's record), sex of infant ("sex" on the child's record and "sexbaby" on the mother's record), birth order ("birorder"), total number of babies ("numbab"), gestational age ("gestat"). These fields were used along with the deterministic matching fields above to corroborate a match, to identify multiple births, and to determine whether one match pair was better than another where the deterministic match fields created multiple matched pairs (of which only one pair was likely to be a true match).
Matching fields with more distinct values have more discrimination than those with fewer distinct values, and as such they were assigned a higher power score. Negative values were set for mismatches too. Competing pairs were sorted and then ranked to identify the highest scoring pair. A pair was accepted only if it had the highest score in both mother-to-baby matching (i.e. where the baby was the best match for the mother) and in baby-to-mother matching (i.e. where the mother was the best match for the baby). Pairs that did not meet these criteria were discarded. After match resolution, there were 7,335,218 pairs from 1 April 1998 to 31 March 2012.