Background

The linkage of routinely collected perinatal and other administrative health data has broadened the scope of maternal and child health research as it enables researchers to establish and follow-up large samples or whole populations and ascertain multiple factors for risk adjustment [1]. Linking perinatal to pharmaceutical dispensing data offers a valuable approach for pharmacovigilance and examination of medication safety in pregnancy [2] given ethical concerns about including pregnant women in clinical trials [3] and bias associated with voluntary reporting to post-market pharmaceutical surveillance systems [4].

In many countries, including Australia, unique individual identifiers are not available across all of the administrative data collections relevant to perinatal research. In this situation, probabilistic linkage methods are used to link individuals’ records [5, 6], but probabilistic linkage is not perfect [7]. Previous studies have reported that the sensitivity (i.e. truly matched records) of probabilistic linkage ranges from 74 to 98%, and specificity (i.e. truly unmatched records) ranges between 99 and 100% [8]. False and missed matches can introduce bias and affect the validity of research findings. While data linkage units aim to improve the quality of linkage, there is a growing consensus that data cleaning (i.e. detecting, diagnosing, and editing data anomalies) [9] and proper documentation are essential aspects of quality assurance [9,10,11]. The RECORD Statement recommends that observational studies using routinely collected health data should provide information on the process and quality of linkage and data cleaning [10]. Furthermore, systematic checks have a potential to improve quality of future linkage through provision of feedback to data linkage units.

In studies that involve cross-jurisdictional linkages, additional data cleaning considerations are required. Australia has a federated health care system with delivery and administration of services being the responsibility of either States/Territories (e.g. hospital services) or the Federal government (e.g. subsidised pharmaceuticals). In this setting, cross-jurisdictional linkage brings together diverse and rich data sources, enabling national-level research studies [12]. Cross-jurisdictional linkage performed by different data linkage units, however, is subject to discrepancies resulting from variations in the use of personal identifiers, techniques for constructing linkage keys and quality assurance policies. Consistency checks, therefore, are vital before merging records from different States.

Cleaning linked data is a complex process and requires thorough planning and knowledge about data collection methodologies and the validity of the data items. While there are existing frameworks and check lists for data cleaning [9, 11], literature that describes how to systematically examine the consistency of content data in linked perinatal records [13], and how to identify and resolve disparities arising from cross-jurisdictional linkages is lacking. Additionally, researchers rarely provide their coding syntax, making it difficult to replicate their data cleaning procedures. This paper presents a series of steps for assessing data consistency and cleaning in the Smoking MUMS (Maternal Use of Medications and Safety) Study [14] which involves the linkage of perinatal records from two Australian states—New South Wales (NSW) and Western Australia (WA)—to national Pharmaceutical Benefits Scheme (PBS) claims data. Exemplar documentation and SAS code presented in the paper can be adopted in similar studies.

Methods

Study design and data sources

The Smoking MUMS Study is an observational cohort study including all women who delivered in NSW and WA between 1 January 2003 and 31 December 2012, and their babies. For mothers, perinatal records (i.e. the mother’s deliveries, including pre-2003 records) were linked to hospital separations (i.e. hospital discharge), emergency department (ED) attendances, death, and pharmaceutical claims records. For babies, perinatal records (i.e. the baby’s birth) were linked to hospital, ED and death data. Congenital defect notifications were included in the linkage for babies born in NSW (Fig. 1). New South Wales is Australia’s most populous State with more than 7.5 million residents, while WA has a population of 2.6 million [15]. Table 1 describes the data collections used in the study.

Fig. 1
figure 1

Data linkage and examples of data set layouts

Table 1 Descriptions of data sets

Data linkage

All the linkages for the Smoking MUMS Study used probabilistic linkage methods and a privacy preserving approach [16,17,18]. Specifically, personal identifiers were separated from health information, with the data linkage units receiving personal identifiers only (i.e. no health information) and encrypted record IDs from the data custodians. The linkage units assigned a project-specific person number to all records that belonged to the same person and returned these person numbers and encrypted record IDs to the respective data custodians who released the approved research variables together with the person numbers (i.e. no personal identifiers) to the researchers [16,17,18].

In NSW, the Centre for Health Record Linkage (CHeReL) has established a Master Linkage Key to routinely link the Perinatal Data Collection with the other NSW data collections (Table 1), except the Register of Congenital Conditions which was specifically linked for NSW babies in this project. Likewise, the WA Data Linkage Branch (WA DLB) regularly links the Midwifery Notification System to the other WA data collections (Table 1). The Master Linkage Keys in NSW and WA are regularly updated and assessed via robust quality assurance procedures. The false positive rates for NSW and WA were estimated to be 0.3 and 0.05% respectively [19, 20]. Once the linkages for mother and baby cohorts were finalised, the CHeReL and WA DLB created a Project Person Number for each mother (mumPPN) and each baby (babyPPN, mapped to mumPPN).

In Australia, records of claims for pharmaceutical dispensing processed by the Federal government PBS are not routinely linked to State-based health records. For this study, PBS data custodian assigned a project-specific Patient Identification Number (PATID) to each woman who had claim records and provided PATIDs and personal details to the Australian Institute of Health and Welfare (AIHW) Integration Services Centre, while CHeReL and WA DLB provided the list of mumPPNs and identifiers (Fig. 1). The AIHW conducted probabilistic linkages based on personal identifiers and assigned weights (i.e. degree of similarity between the pairs of records, higher weights indicating greater similarity) to matches between PATIDs and PPNs. Based on AIHW clerical reviews, recommended threshold for accepting the matches to NSW mumPPNs was 29.0 (link rate 99.43%, link accuracy 98.62%) and 28.0 for matches to WA mumPPNs (link rate 99.02%, link accuracy 98.65%) [21]. Separate mapping tables for each State, including any PATID-PPN matches with weight ≥ 17 were released to researchers, as were separate files containing claims records relating to PATIDs that were included in the mapping tables (Table 1, Fig. 1). The release of claims records for matches with weights lower than the recommended threshold allows for sensitivity analyses in which different thresholds are used.

Steps to check consistency of State-based data

Prior to the assessment of data consistency, all data sets were examined to make sure that all variables and associated data dictionaries were delivered as expected, and the number of persons and records were in accordance with reports provided by the data linkage units. The mother’s hospital separation record and the child’s hospital separation record that correspond to the delivery of the mother and the birth of the child were carefully identified based on previously reported methods [6]. The range of data values, distribution by year and missing values were explored for all variables. Data items that underwent historical changes (as per data dictionaries or the published midwife notification forms) were examined whether the distribution of data is consistent with the documented changes (results not shown).

Consistency of State-based data was assessed through a series of steps (Fig. 2 and Table 2).

  • Steps 1 to 3 examined the uniqueness of records.

  • Steps 4 to 8 checked the consistency within and across pregnancies based on perinatal data items, including baby date of birth (DOB), parity, pregnancy plurality, birth order, gestational age, and birthweight. These variables were used because previous validation studies have reported high levels of accuracy in their recording [22, 23]. Parity was defined as the number of previous pregnancies ≥20 weeks and numerically coded (e.g. 0, 1, 2, 3). Plurality assigned pregnancies as single or multiple-fetus (coded as singleton, twins, triplets, quadruplets, etc.) while birth order indicated the order each baby was born (coded as 1st, 2nd, 3rd, etc.). Plural pregnancies generated more than one perinatal record which contained the same maternal information but baby-specific information, including order of birth. Gestational age was defined as number of completed weeks of gestation. Date of conception was calculated (baby DOB – completed weeks of gestation × 7 + 14 days).

  • Steps 9 to 16 assessed the consistency of information across data sources, including consistency between unique events (birth, death) and episodes of health service use. These steps capitalised on the availability of the same information (e.g. baby DOB, interchangeably date of delivery, mother’s month and year of birth) in multiple data sets and validity of these variables [22, 23].

Fig. 2
figure 2

Summary of data cleaning steps and results

Table 2 Steps undertaken to assess consistency of State-based data

On-screen scrutiny of relevant records was undertaken (as indicated in Table 2) when multiple entries of the same death (Step 1) or birth (Step 3) were suspected (i.e. partial duplicates), using additional information (e.g. demographic details, birthweight, Apgar scores, delivery hospital, hospital diagnoses and discharge status). Manual review of these records was time efficient because inconsistencies were found in a small number of cases.

Identified inconsistences were categorised as person-level or record-level. Person-level inconsistencies suggest likely false positive links and the persons were flagged for “exclusion” from future data analyses. Examples include a woman who conceived a second child before delivering her first child (Step 6) or had a baby after a total hysterectomy procedure (Step 13). In some cases, errors were identified for a child (e.g. date of admission later than date of death) while no inconsistencies were identified for the mother. For those cases, the mother and records for all of her children were flagged for “exclusion”.

Findings including duplicates, missing data, invalid data or likely typographical errors, and where date of admission was later than date of discharge were considered random and at record-level. Duplicates were flagged, and missing or typographical errors were corrected if plausible. Hospital separation and ED records found to contain inconsistent dates of birth, admission and discharge (Steps 9, 14 and 16) were flagged for “deletion”. Inconsistencies for which no changes were made were quantified and documented for consideration in specific analyses.

At the completion of each step, new variables were created and merged into the original data sets rather than deleting records or overwriting data values, this allowed the original data content to remain unmodified. For efficiency, decisions reached through each cleaning step were applied before undertaking the subsequent step (e.g. removal of duplicates and the use of corrected birth order to select one record per pregnancy).

Steps to check cross-jurisdictionally linked data

Table 3 and Fig. 2 present steps (17 to 22) to resolve discrepancies in the linkage performed by different linkage units and assess validity of apparent cross-State links. Specifically, cases where a PBS PATID matched to multiple mumPPNs were detected and sent to the AIHW linkage unit for review, through which clusters of mumPPNs (i.e. records likely to belong to the same woman) were identified (Step 17) and assessed for person-level consistency (Step 19). Step 20 examined consistency among records for women who had records in both States. Following the creation of the variable finalPPNmum (Step 21) to integrate mother’s records, consistency was checked for finalPPNmums that had multiple PATIDs (Step 22).

Table 3 Extract recommended PBS links and steps undertaken to check cross-jurisdictional linkage

All analyses were performed in SAS 9.3. Samples of SAS codes are provided in Additional file 1.

Results

The checks for consistency of State-based data (Table 2) suggested false links for 703 women in NSW (0.12%) and 90 women in WA (0.05%), and flagged these women for “exclusion”. Corrections were made in 2062 perinatal records for variables including birth order (10 records), parity (1379 records), and baby date of birth (673 records) and in 41,516 hospital separation and ED records for baby’s age.

Assessing cross-jurisdictional links (Table 3), Step 19 flagged an additional 149 mumPPNs as “exclusion” and confirmed 3323 clusters of mumPPNs while Step 20 identified 1986 women who had records in both States (Step 20). Records of these mumPPNs clusters and cross-State mothers were integrated through the construction of the variable finalPPNmum (Step 21) which were used as the new person number for the mothers. The last step further identified 2763 finalPPNmums for “exclusion”, bringing the total number of women flagged for “exclusion” from future data analyses to 3705. The final cohort included 774,449 women and 1,225,341 babies born between 2003 and 2012. In this cohort, about 4.6% of women had the expected number of pregnancies greater than the number of deliveries recorded in the perinatal data, suggestive additional births elsewhere, and 4.5% had likely errors in the recording of parity. In 1838 cases, finalPPNmums were matched to two or more PBS PATIDs.

From the original mapping tables (shown in Table 1), 625,972 PBS links with weight ≥ recommended threshold were extracted and among those, 16,138 matches (2.6%) were further disregarded (Table 3). For the remaining 608,834 matches, 14,212,875 claims records were subset for the final mother cohort.

Discussion

In this perinatal cross-jurisdictional data linkage study, we developed a series of steps to identify, and where appropriate, correct inconsistent data values. The methods were based on standard and reliable content data items [22, 23], and thus can be adopted in other perinatal research. The methods included a stepwise approach to resolving disparities in linkage performed by different linkage units and identifying women who had records in more than one State, for whom integration of records is required for analyses.

Data errors are commonly detected incidentally during statistical analyses or interpretation of results, leading to inefficient checking of data and repeating analyses [9, 11] and, potentially, lack of reproducibility of results if ad-hoc or undocumented data edits are made. We found inconsistencies that were indicative of false positive links and clusters of women’s IDs which suggest missed State-based links. These findings were fed back to the State-based data linkage units for further examination and rectification prior to future linkages, conferring benefits for other data users. Researchers play an important role in contributing to quality assurance, through systematic assessment of data consistency, given that content data have not traditionally been accessible to data linkage units under the “best practice” protocol [16,17,18]. The detection of the probable missed links improved data completeness, matching a further 448 perinatal records to records of maternal hospital admission for the delivery. Assessment of the consistency of the recording of parity identified women who might have additional births elsewhere (4.6%) and who had likely errors in the recording of parity (4.5%). Obstetric history is particularly important for longitudinal analyses or evaluation of interventions or exposures in the period between pregnancies.

In this study, the proportion of NSW women who were flagged for “exclusion” was lower than the false positive rate estimated by the data linkage unit in NSW (0.12% vs. 0.3%), while for WA women these proportions were similar (0.05% vs. 0.05%). This study was unable to examine the characteristics of the unlinked perinatal records, while previous studies have reported that unmatched records might hold different maternal and pregnancy characteristics compared to fully linked records [6, 7, 24]. Limitations in the data cleaning methods should also be acknowledged. Assessment of parity was less likely to detect link errors among women with fewer perinatal records, and the cut-off to flag “exclusion” due to inconsistencies in parity and mother YOB was based on a conservative decision. Given discrepancy in baby date of birth found in 667 perinatal records (0.05% of the babies), birth registrations as an additional data source would potentially helpful in assessing these discrepancies. Following the checks for clusters of mumPPNs within a PBS PATID (Step 19), an anomaly in the opposite direction (i.e. clusters of PBS PATIDs within a finalPPNmum) was present among 1838 cases (Step 22). For these women, the recording of parity, month and year of birth were consistent but no further checks using dispensing data were performed. Checking the consistency of clinical information against medicines dispensed was deemed inappropriate given that maternal morbidities recorded in the perinatal, hospital and ED data might not require a pharmacotherapy. Furthermore, our PBS data extract did not contain records for all medicines, nor did the PBS data contain records for all subsidised medicines dispensed (i.e. prior to April 2012 only subsidised medicines dispensed to social security beneficiaries were captured completely) [25]. The presence of more than one identifier in the PBS data suggests that more pharmaceutical dispensing will be attributed to these women, perhaps inappropriately, hence sensitivity analyses excluding these women should be considered.

The data cleaning process outlined in this manuscript can be summarised into stages that can be adopted in studies based on administrative health data. Moreover, majority of the specific checks undertaken in this study are generalizable to other studies. As a first step, it is important to gather necessary information to inform the development of a data cleaning plan. These include descriptions of the data collections, the variables and associated data dictionaries, the reliability of the recording of these variables as well as the procedures through which the project’s data were linked. It is advisable that the researcher examines the distribution of data (e.g. frequency, cross-tabulation), unusual patterns of the data should be discussed with the data custodians and researchers with experience working with the same data source.

It was noticed in this study that, for example, hospital records of healthy newborns were included in NSW data but were typically excluded (84%) from WA hospital admission data.

Subsequently, it is essential to draft a plan, outlining general rules about decisions to be made for identified errors, and content of specific checks (i.e. objectives and detailed algorithms). Factors to consider when creating general rules include whether there will be data sharing among analysts or use of the data for multiple research objectives, potential causes of errors (e.g. incorrect links, inconsistent patient response, inaccurate recording, typographical errors) and possible implications of decisions. Data in this project are used for several sub-studies, therefore, no deletion or overwriting of the original data value was made, instead, flag variables and corrected data values were added. Data analysts were provided with detailed documentation including noting of inconsistencies for which no changes were made so that informed decisions could be made for specific analyses. The decision regarding how to handle an error was guided by the probable cause of the error. Flags for exclusion were applied to the mother (thus, all her children) even if a linkage error was found for a child, because excluding only the problematic pregnancy record could affect analyses that investigate or control for outcomes of the prior pregnancy or health service utilisation (e.g. medication use, hospital procedures) between pregnancies. Where possible, missing, invalid and erroneous data was corrected. Flags for deletion were applied to ED or hospital records which contained inconsistencies in dates. Duplicates were flagged for removal. No changes were made for “grey” unexplainable inconsistencies.

In terms of planning for specific consistency checks, a structured approach should be used to ensure that important aspects are covered and to avoid digressing. Factors that can be used to inform which data items should be checked and the sequence of the checks include the methods of the linkage (i.e. deterministic, probabilistic), the base data set and its variables (i.e. the data sets used to derive the study population), commonalities between data sets, the coherence between different pieces of information that relate to the same event, the uniqueness of an event or expected findings, and likely consequences of unmanaged inconsistencies. It is easier to conduct the checks in the order of increasing complexity, such as commencing the checks of data items within a record, followed by examining consistencies between records of the same data set before linking records across data sets.

Our check for missing death registration record (Step 2) applicable for only NSW death data demonstrates the application of the “uniqueness” rationale that can be applied for all studies and data sources. For projects that involved cross-jurisdictionally linked data, the checks for consistency in the IDs matching (e.g. Steps 17) illustrate the effective “uniqueness” rationale to identify potential incorrect links when the study participants were represented by different sets of IDs. In studies when the IDs mapping tables are not provided by the cross-jurisdictional data linkage unit (i.e. the IDs were embedded in the data sets), researchers are advised to create the mapping tables by summarising the IDs variables to identify inconsistencies. Checking for consistency among people identified as moving between jurisdictions and the integration of IDs (Steps 19–21) are essential for all studies using cross-jurisdictional linkage of person-level unit records. A failure to identify and manage the IDs matching inconsistencies would result in a lost (if one-to-many merging) or over-collation (if many-to-many merge) of information.

During the development of algorithms, it is critical to make sure that the exclusion of study participants is not related to their health status or outcomes (i.e. the algorithms not creating selection bias). This selection bias can arise because people having multiple contacts with health services would have higher chance of inconsistencies being identified. The decision to classify the inconsistencies as incorrect links, therefore, should be based on biological and chronological plausibility, and coherence between different data items. Inconsistencies that are biologically and/or chronologically impossible (e.g. different women mapped to a single ID of the child, medications dispensed years after date of death) are indicative of incorrect linkage. When linkage errors cannot be ruled out immediately, additional information obtained from related variables or records can help to inform decisions. For example, dates of the maternal hospital separation associated with the delivery were used to verify baby DOB (Steps 6 and 9) or inconsistencies were found in more than one data items (mother’s sex and month/year of birth as in Step 12). When decisions about reasonable values or patterns are imposed, it is important to evaluate the implications of chosen cut-offs by quantifying extent of the exclusion. For instance, a conservative decision was made for inconsistencies in parity (Step 8.3.1) as a less restrictive criteria i.e. expected number of pregnancy =1 and the count of pregnancy record ≥3 (instead of ≥4) would result in an additional 156 women being flagged for exclusion (578 instead of 422 women).

Conclusion

In conclusion, comprehensive and well-documented data consistency checks prior to commencing planned statistical analyses will improve the quality and reproducibility of perinatal research using linked administrative data. The data cleaning methods developed for the Smoking MUMS Study are recommended in other perinatal linkage studies, with appropriate modifications made based on knowledge about the data collections, validity and coherence of data items. Adoption of similar data cleaning methods across studies will assist in making comparisons across jurisdictions and countries, as well as across studies that are using ostensibly the same source datasets.