Study objectives:
-
1)
Develop a protocol to geocode infectious disease notifications, using notified cryptosporidiosis case data;
-
2)
Evaluate the reliability of the developed protocol;
-
3)
Produce a geocoded spatiotemporal dataset comprising notified infectious disease cases, with each case linked to one Census 2016 Small Area (SA), or larger spatial unit.
CSO Census Small Areas
A dataset comprising notified cryptosporidiosis cases was geocoded, with each case linked to one CSO Census 2016 Small Area (SA), or larger spatial unit. SAs are delineated spatial areas, generally comprising between 80 and 120 dwellings (mean: 90) and represent the smallest geographical unit (i.e. the unit of highest spatial resolution) available for compilation/reporting of statistics in line with data protection regulations. They generally comprise either complete or parts of townlands, or neighbourhoods. SAs nest within Electoral Division (ED) boundaries, which are the smallest legally defined administrative areas in Ireland. EDs are recognised by the European Union as ‘local administrative units (LAU) Level 2’. As of 2016, there were 18,641 SAs and 3440 EDs delineated in Ireland for the national Census [20, 21].
Ethical approval, data protection and data sources
Following receipt of ethical approval for this study from the Royal College of Physicians of Ireland Research Ethics Committee (RECSAF_84), access to address-level infectious disease notification data from Ireland’s CIDR system [22] at the national level was sought and subsequently granted by the National CIDR Peer Review Group. For data protection purposes, aside from address information, personal data and disease-specific information were excluded from the datasets during the geocoding process. Datasets for geocoding were only accessed by designated partners working within the HSE, including the Heath Protection Surveillance Centre (HSE-HPSC), the Department of Public Health-East and the Health Intelligence Unit (HIU). Data security and confidentiality were maintained at all times, in compliance with the requirements of data protection legislation, specifically the General Data Protection Regulation (GDPR) and the Data Protection Act 2018. HSE-HPSC is accredited for Information Security Management ISO 27001.
CIDR is an information system developed to manage the surveillance and control of infectious diseases in Ireland, using standard case definitions for all notifiable diseases, as per the Infectious Diseases (Amendment) Regulations 2020 (S.I. No. 53 of 2020) [22]. The cryptosporidiosis dataset comprised a 10-year (2008–2017) dataset including address-level infectious disease notification national data from CIDR. The variables used for geocoding included a unique identifier (CIDR Event ID) and the spatial identifier variables: Address Line 1, Address Line 2, Town, Suburb, Postcode and County.
Geocoding
This protocol employs the following: (i) a Health Intelligence Unit (HIU) in-house geocoding program and (ii) the Geo Reference tools on the Health Atlas Ireland platform. Health Atlas Ireland comprises a suite of software tools developed more recently by the HIU to provide role-based web access to key health-related datasets and is limited to HSE and partner organisations [23,24,25]. Geo Reference, one of the software applications accessed via Health Atlas Ireland, uses An Post’s GeoDirectory for the purposes of geocoding, thus facilitating mapping and geospatial analyses. The GeoDirectory is a definitive reference dictionary of addresses for all 1.9 million buildings that receive post in the Republic of Ireland, assigning them with precise postal and geographic addresses. Irish postcodes (called ‘Eircodes’) are generated by the company Eircode and licensed to other parties [19], including GeoDirectory.
Geocoding of residential addresses associated with cryptosporidiosis notification data involved a series of string-matching algorithms i.e. exact string and finite string matching, undertaken sequentially in three phases (Phases 1–3); two automated phases and one manual phase (Fig. 1). In this study, the address of a cryptosporidiosis case reported in the CIDR dataset is referred to as the ‘reference address’. All address fields (Address Lines 1 & 2, Town, Suburb, Postcode and County) in the CIDR dataset were included in the geocoding process. Each record was manually assigned a numeric match type code, defining the type of spatial unit the address was geocoded to (Table 1).
Table 1 Definitions for generated variable ‘Match Type Code’ (SA, small area; ED, electoral division) Geocoding Phase 1
An automated geocoding program attempts to match each address in the submitted CIDR dataset (i.e. the ‘reference address’) to an address from An Post’s GeoDirectory by searching for individual components (e.g. house number/name, apartment complex, street, townland, suburb, town, county) of the reference address among addresses in the GeoDirectory, using exact string matching. In the event of a likely address match identified by several identical or near-identical components, the program returns the XY geographical coordinates for the matched address (i.e. the ‘returned address’). For any address that remains unmatched (i.e. no return address identified), the program attempts to search again using synonyms from a pre-existing data frame of paired strings (e.g. George St Gt N-North Great George’s Street), using literal string matching. The algorithm is thus described as a ‘naive string search’ with ‘normalisation’ [26]. The program does not provide a measure of distance or proximity to each match.
A unique address match was considered to be achieved when a unique dwelling identifier in the returned address matched that in the reference address. A unique dwelling identifier refers to a house name, a house number in combination with a unique street name, an apartment number in combination with a unique apartment complex name, or a complete Eircode. Where the returned address was matched to SA level, the output was validated against a set of pre-defined validation criteria developed by two members of the research team (Table 2).
Table 2 Validation criteria for all output records with a reference address possible match to returned address from Phase 1 and 2 automated geocoding An additional variable ‘Match Type Code’ was created in the dataset to code the address match level (Table 1). Addresses that met the validation criteria for a unique address match to XY coordinates were coded ‘1’ and records with a validated address match to one Small Area (SA), but not to a unique address, were coded ‘2’. Records that remained unmatched after Phase 1, including those that were matched but did not pass validation, were coded ‘0’ (Fig. 1). The objectives of the validation criteria and the match type coding were to improve the quality of the geocoding process via development of a series of ‘data-bins’ i.e. grouping of records based on the level of geocoding validity and reliability. The match type code also served as a measure of confidence in the match.
Once matching was complete, each address was spatially attributed via its geographical coordinates to the corresponding SA, i.e. reverse-geocoded from a point to a geographical unit that was specifically developed for population analyses. Reverse geocoding to a geographical centroid of SA i.e. latitude/longitude, maintains spatial accuracy, while also safeguarding anonymity.
Geocoding Phase 2
All records with no address match (coded ‘0’) and records that failed the validation process during Phase 1 (also coded ‘0’) were included in the second geocoding phase (Phase 2; Fig. 1). These records were uploaded to the Health Atlas Geo Reference tool for automated address matching using an approximate (‘fuzzy’) string-matching algorithm [27]. Approximate string-matching attempts to match components of the reference address which were identified in the GeoDirectory. Further matches are possible by allowing for erroneous characters (e.g. ‘Ballyboug’ matched to ‘Ballybough’) or typographical transpositions (e.g. ‘Dulbin’ for ‘Dublin’).
All output records with a reference address ‘exact match’ (i.e. a unique address match) to a return address from Phase 2 of geocoding were validated against pre-agreed validation criteria analogous to those applied during Phase 1 validation (Table 2). All outputs with a reference address ‘area match’ (i.e. to a cluster of addresses) or with ‘no match’ from Phase 2 and all validated records not accepted as a match to a return address were coded ‘0’. These records underwent a third, manual geocoding step (Phase 3; Fig. 1).
Geocoding Phase 3
To maximise the number of records successfully geocoded, a manual geocoding process was applied. A ‘fuzzy search’ function in the Health Atlas Geo Reference tool was used to identify further appropriate matches in the GeoDirectory database (via approximate string matching, with multiple returned approximations) for the remaining unmatched records. The fuzzy search function automatically returns a list of all addresses that are approximately matched to each reference address, permitting the user to manually match addresses that may contain variant spellings, misspellings, typographical errors or Irish-language equivalents not identified during automated Phases 1 and 2. Criteria were devised for manual matching addresses to SA/ED using the fuzzy search function (Table 3). These criteria were devised for geocoding to SA, ED or larger spatial units; i.e. a unique address match was not required.
Table 3 Manual matching criteria for selection of geocoded address matches from options produced by the Health Atlas Geo Reference tool’s fuzzy search function (*A ‘similar’ dwelling name refers to minor punctuation/spelling differences in the house name) In conjunction with the selection criteria for manual matching, the Health Atlas Geo Reference ‘Area match’ function was used to check whether any given group of addresses, selected from the GeoDirectory, would successfully match to 1, 2 or 3 Small Areas before proceeding to ‘Save as area match’. The ‘Display on map’ function provided an additional visual tool to inspect if a specific cohort of addresses are situated within a well-defined geographical area, e.g. a single townland, before proceeding with the area match.
Validation, identification of potential bias and quality control
Records that could not be geocoded, records that failed validation during each phase and records geocoded in each phase were reviewed in order to identify any potential spatiotemporal bias that may impact on future studies/sensitivity analyses by inclusion/exclusion of these records. Records that failed validation are those with a potential address match that did not meet the outlined validation criteria/manual matching criteria.
A number of quality control procedures were implemented to improve the overall efficacy of geocoding, outlined as follows:
-
All criteria for geocoding and validation were devised and revised through an extensive iterative process, involving members of the research team. Ambiguous addresses that did not match the initial criteria for matching were assessed with a view to devising appropriate new validation criteria;
-
The final validation criteria were reviewed and agreed by the whole research team;
-
A variable recording the type of geocoding match, e.g. a unique address match or match to one SA, was added for each record to allow the flexibility to conduct sensitivity analysis during subsequent epidemiological studies if required;
-
The Geo Reference validation tools (Area match and Display on map functions) were used to verify and maximise matching;
-
Validation/verification of townlands was conducted using the Irish townlands database. https://www.townlands.ie/