Using Routinely Collected Administrative Data in Public Health Research: Geocoding Alcohol Outlet Data

We describe our process of geocoding alcohol outlets to create a national longitudinal exposure dataset for Wales, United Kingdom from 2006 to 2011. We investigated variation in the availability of data items and the quality of alcohol outlet addresses held within unitary authorities. We used a standard geocoding method augmented with a manual matching procedure to achieve a fully spatially referenced dataset. We found higher quality addresses are held for outlets based in urban areas, resulting in the automatic geocoding of 68 % of urban outlets, compared to 48 % in rural areas. Missing postcodes and a lack of address structure contributed to a lower geocoding proportion. An urban rural bias was removed with the development of a manual matching procedure. Only one-half of the unitary authorities provided data on on/off sales and opening times, which are important availability factors. The resulting outlet dataset is suitable for contributing to the evidence-base of alcohol availability and alcohol-related harm. Local government should be encouraged to use standardised data fields, including addresses, to enable accurate geocoding of alcohol outlets and facilitate research that aims to prevent alcohol-related harm. Standardising data collection would enable efficient secondary data reuse using record linkage techniques, allowing the retrospective creation and evaluation of population-based natural experiments to provide evidence for policy and practice.


Introduction
The 'natural experiment' is an important methodology for investigating the effectiveness of public health interventions (Cousens et al. 2009;Craig et al. 2012;Ogilvie et al. 2006;Petticrew et al. 2005). Natural experiments can be the best source of available evidence where the randomised controlled trial is not feasible and the intervention(s) have a natural non-random variability in geographical location and time. Many public health interventions, such as urban regeneration schemes, (Egan et al. 2010;Petticrew et al. 2009) are based in small geographical areas and so the spatial scale of measurement of exposures and outcomes that change over time becomes an important consideration.
Methods to evaluate the impact of natural experiments have been enhanced by the development of anonymised electronic record-linkage in secure databanks that facilitate the use of administrative data in public health research (Fone et al. 2012(Fone et al. , 2013Rodgers et al. 2012;White et al. 2014). Data linkage methods allow spatially referenced exposures to be linked to existing cohort studies and to populate age-sex registers for defining new cohorts with full information on small-area migration. These new rich sources of data are facilitating a new approach to the longitudinal analysis of natural experiments (Fone et al. 2013;Rodgers et al. 2012;White et al. 2014).
The main challenge for the use of routine administrative data in measuring variation in exposure across space and time is fulfilling the requirement for good quality standardised data. Ideally, the time and precise geographical location of unplanned changes that have taken place, due to policy implementation or other factors, should be recorded with high resolution to enable the longitudinal variation in exposure to be used in a natural experiment. Without this, spatial measures of exposure will be biased from misclassification of events across spatial boundaries and from events that are missing geographical or temporal identifiers.
We have recently carried out a natural experiment of the effect of change in alcohol availability on alcohol-related harm in the community (CHALICE) (Fone et al. 2012). Excess alcohol consumption is known to cause a wide range of harm, including liver disease (Leon and McCambridge 2006), cancers (Parkin 2011), cardiovascular disease (Klatsky and Gunderson 2008;Reynolds et al. 2003), suicide (Ramstedt 2001), violence and injuries (Flatley et al. 2010;Popova et al. 2009). One method proposed to reduce the societal burden of alcohol-related harm is to reduce the availability of alcohol through a reduction in the density of alcohol outlets (Babor 2008;British Medical British Medical 2008;Humphreys and Smith 2013;Young et al. 2013;Gruenewald 2011).
To investigate this further we had to define and measure outlet density. The problem that we address in this paper is the first step in estimating an outlet density, namely how to identify, enumerate and locate alcohol outlets in geographical space. This required the use of spatially referenced administrative data to quantify alcohol availability. Three Scottish studies have previously used local licensing authority data on alcohol outlets. Two studies set in Glasgow found that the unit postcode was obtained for every outlet recorded in 2006, but did not report any method of validation (Ellaway et al. 2010;R. Young et al. 2013). A more recent study quantified the number of alcohol outlets for each of the 6505 small-area datazones in Scotland provided by local Licensing Boards in 2012 (Richardson et al. 2014). After dataset cleaning and removal of duplicates they recorded 222 (1.4 %) fewer outlets than the official Scottish Liquor Licensing Statistics. However the variation between the 32 Licensing Boards in Scotland was from 7.4 % fewer to 1.8 % more outlets and this degree of variation will be more problematic for preparing accurate counts at smaller geographical levels.
We report on aspects of the CHALICE project (Fone et al. 2012) where we collated, geocoded address-level data, and analysed 3-monthly changes in the number of alcohol outlets in Census Lower Layer Super Output Areas across Wales for use in a longitudinal analysis of the impact of change in alcohol outlet density on alcohol-related harm. We discuss the challenges faced in the secondary use of publicly available alcohol licence data for household and small-area research and describe the methods we developed to geocode the licence data in order to estimate outlet density in a Geographic Information System. Humphreys and Smith (2013) document the difficulties in collecting data from unitary authorities. Our study aims to advance the discussion by assessing the technical challenges researchers face following receipt of the data in achieving a dataset ready for analysis.

Study Design
The setting is Wales, a country in the UK with a population of just over 3 million. More than one-half of the land area is rural with 11 % of the population living in the sparsest rural category; almost 2 million people live in urban areas (Pateman 2011). Local government is organised into 22 unitary authorities that oversee many administrative functions and as such are an important source of routinely collected data. The study area is illustrated in Fig. 1.

Licence Data Acquisition
An alcohol outlet refers to any premise or club licensed by a unitary authority to sell alcohol. The 22 Welsh unitary authorities are responsible under The Licensing Act (2003), for maintaining public registers of all licensed alcohol outlets (Hadfield et al. 2010). The Act came into force in England and Wales on 24 th November 2005. Licences may be in the form of a premise licence, a club licence or a temporary event notice. The Act requires licensing sub-committees to record the name and address of each outlet, including postcode details, using a licensing pro forma. A completed pro forma is then submitted to the unitary authority for consideration; successful applications are recorded and made available for public inspection.
The use of unitary authority data offers a potentially unbiased data source with no marketing or competitive agenda. We requested the names and location of all licensed alcohol outlets from each unitary authority for dates between Nov 2005 and Dec 2011, using a similar model to the one described by Humphreys and Smith (2013). The data items requested for each outlet were: 1) the date permission was granted or the licence became active (referred to as Start Date), 2) the licence expiry date or an indicated date of outlet closure (End Date), 3) whether the premise is licensed for ON and/or OFF premise sales (On/Off Status), 4) the hours permissible to sell alcohol or general opening hours of the outlet (Opening Hours), and 5) the type of premise as assigned by the unitary authority (Premise Type), if available. Under The Act, unitary authorities are mandated to maintain public registers of current licences, populated using licence application forms, which include items 1, 3 and 4 only. The registers hold data about premises that have licences to cover food and music; therefore we selected only those relating to alcohol.

Licence Data Format
Consistent with the findings of Humphreys and Smith (2013), outlet data were received in a wide variety of media and with different amounts of detail. Only one-half of the unitary authorities provided data on the on/off premise sales and opening times. A similar number provided data in a spreadsheet; several provided pdfs or electronic text files, and one provided printed copy only. Historical outlet data from January 2006 onwards were received from 21 of the 22 authorities. One authority could provide only their current (2012) public register. An actual or approximate licence start date was used to summarise the number of outlets into time periods for 21 of the 22 authorities. Several were unable to provide precise outlet closure dates, so the date of the last interaction with the outlet by the authority was used to generate an approximate end date. Each authority presented its own unique challenges for data gathering, such as the need to collect records in person; collation from multiple pdfs; text files without consistent formatting; or print-outs that needed to be converted into a useable format. As a result, extensive processing was required to gather all files, transpose the information into useable outlet records, and collate into a single database of outlets. The tools and techniques used included web scraping, string manipulation and optical character recognition. The final database consisted of 21,137 outlets. Each record contained some address information (partial or full) for the premise; we describe in the next section how we used the available address details to geocode outlets.

Geocoding Licence Data
Geocoding is the process of matching text-based address data to known geographic coordinates. There have been a number of studies that have assessed geocoding techniques, positional accuracy (Zandbergen 2008;Ward et al. 2005;Duncan et al. 2011) and the impacts of geocoded or matched outlet proportion and positional accuracy upon resulting analyses (Krieger et al. 2001;Ratcliffe 2004;Zandbergen 2007). In particular, Zandbergen (2007) has shown that wherever possible, high resolution address data (building-level) should be employed as the reference dataset particularly when finescale analysis is being performed. Given that the CHALICE project looks at the alcohol availability for a 10-minute walking and driving buffer around a residence, it was important to use the best quality address data available for the study area.
Ordnance Survey AddressBase® Premium (Ordnance Ordnance Survey 2014) was used as the reference dataset for the geocoding procedure. AddressBase® Premium is the most comprehensive address dataset for Great Britain and is based upon the Royal Mail Postcode Address File (PAF) and National Land and Property Gazetteer (NLPG). The AddressBase® Premium dataset contains a point for each residence, which is located within the footprint of the residence. The buildings are surveyed with a spatial accuracy of ±1-2 m in urban areas. Although this dataset is comprehensive and spatially accurate, it was not suitable as our source of outlet dataset because it did not contain a suitable classification to extract all alcohol outlets, neither did it contain information specific to alcohol outlets such as on/off sales or opening times.
A geocoding script creating an ArcMap plugin was written in VB.NET 1 to break down each outlet address received from the unitary authorities into its component parts (Organisation Name, Building Name, Street Number, Thoroughfare, Post town and Postcode) as defined by the Royal Mail (Royal 2010). Each address component was indexed before the reference AddressBase® Premium dataset was searched and the best match was recorded. The best match was achieved using string matching and regular expression techniques to resolve any formatting/grammatical issues (e.g. commas, hyphens, capitalisation). For records that were not fully matched in a first pass, a probabilistic matching algorithm was implemented to identify addresses based on 1 Available on request partial address matches. For example, a premise was identified from a street number and name where a premise name could not be matched. The final geocoding step filtered for matches based on the postcode provided in the license data.
Each successfully fully matched unitary authority outlet address record was allocated the corresponding Unique Property Reference Number (UPRN) from the AddressBase® Premium data, as well as the relevant geographic coordinates. Only at this stage was the dataset was deemed to be ready for analysis, specifically suitable for utilisation in longitudinal outlet density research.

Manually Matching Licence Data
Outlets that were not geocoded underwent a manual matching process. There is a precedent for using web-based mapping technologies to fill gaps and perform neighbourhood audits (Clarke et al. 2010;Rundle et al. 2011;Rossen et al. 2012). Google Maps and Google StreetView were examined for the outlet location using the information contained in the licence record, including, for example, the outlet name, street and locality. The outlet was identified as a point on Google Maps and the latitude and longitude of the outlet location were extracted using the information for the point contained in the URL. For any remaining outlets, for which the exact outlet location was unclear, further effort was made to approximate the location within the appropriate street by assessing the function of buildings using Google Streetview. This third Bapproximated^stage assigned all outlet locations in order to achieve a final dataset that was suitable for spatio-temporal research to guide policy and practice.
A validation exercise of the manual matching was completed using a 10 % random sample of geocoded outlets. This resulted in a dataset with two sets of coordinates; the geocoded outlet location, treated as the Btrue location^, and the manually matched location. Location errors, introduced as part of the manual match process, were calculated by measuring the Euclidean distance between the two sets of coordinates. The final dataset has the match type (geocoded, manually matched, and approximated) recorded against each outlet. This will enable purely spatial future research focussed on urban areas to omit the approximated locations.
The geocoded outlets will be used in future work to calculate alcohol density scores using a GIS. To investigate whether there would be large differences in the proportion of outlets matched between urban and rural authorities, we classified the outlets into one of six rural-urban classes of settlement type based on the ONS rural urban classification (ONS Geography 2004) assigned to the small area geographies (Lower Super Output Area). For each unitary authority the number of urbanised LSOAs were aggregated into percent urbanised.

Geocoding Licence Data
Of the 21 137 individual licence records received, 16 101 remained after filtering out those that were not alcohol licences, were temporary event notices, or duplicates. Of these, 8536 (51 %) were matched using geocoding software (matched against Ordnance Survey AddressBase® Premium). Temporal coverage of outlets was calculated across all unitary authorities for 6 years (January 2006 to December 2011) apart from one authority that retained only current licence records.
Successful geocoding proportions varied from 28 to 72 % (Table 1). The geocoded proportion was influenced by the structure and completeness of unitary authority addresses (data not shown). All provided records contained some address information but several unitary authorities only provided an address in a single column. Some outlet records contained addresses with a missing postcode field suggesting address quality is reduced, although it may be an optimistic measure because some of the provided postcodes may be incorrect.
The main reasons geocoding failed were an absent or inaccurate premise name or street name, or an absent or inaccurate postcode. For example, a relatively small change from number 10 Market Square, to 10 Market Street, and other small errors in the main address elements made it difficult for a machine to decide between multiple possible matches. This was particularly true when the error in the address partially matched an alternative address. The example outlined previously illustrates this because both Market Square and Market Street exist in the locality, making identifying the correct address impossible for an automated system.
There was no pattern of geocoding success by the proportion of LSOAs in the most deprived 10 % of the Welsh Index of Multiple Deprivation (Local Government Data Unit -Wales 2005) (data not shown), nor was there a historical influence. The average geocoding proportion was 51 % for all years, ranging from 53 % in 2006 to 51 % in 2011. The geocoded proportion, however, was generally higher for records obtained from more urbanised unitary authorities compared to those authorities in a more rural setting (Fig. 2). The percentage of records successfully matched is shown in Fig. 2 for each ONS urban-rural class. The unitary authorities were arranged in order of percent urbanised. Outlet records from three cities in south Wales; Cardiff, Swansea and Newport, had geocoding proportions of 68, 64 and 65 %, respectively, averaging 66 %. In contrast, rural geocoding proportions ranged from 28 to 68, averaging 48 %.

Manually Matching Licence Data
Records that were not geocoded underwent a manual matching procedure using Google Streetview to identify an outlet location. The proportion of records manually matched and those approximated are shown in Table 1. The manual matching validation process for a 10 % sample (1604) of geocoded outlets, revealed that 79 % were manually matched to locations within 100 m Euclidean distance of the AddressBase® Premium location, 90 % within 500 m and 93 % within 1000 m. For reporting purposes data were aggregated into standard census geographies (LSOA). Several geocoded outlets (<8 %) were located in a different LSOA when manually matched but the spatial distribution revealed no systematic error (Fig. 3). The validation process was completed to give confidence in the accuracy of the combined geocoding and manual matching procedures in allocating an accurate spatial location to alcohol outlets, and assure the usefulness of these outlet data for aggregation into areas defined by travel distances from each household.

Discussion
We found that current and historical alcohol outlet licence data are collected and retained within most unitary authorities in Wales. There were, however, important differences between the authorities in obtaining access to the licensing data, and substantial variation in the data items stored. For example, information on premise type for on-and off-sales and opening hours were recorded by only one-half of the unitary authorities. There were substantial differences in the quality and accuracy of the address details captured, which impacted on the geocoding process and required the introduction of a three-stage matching process (geocoded, manually matched and approximated). This was necessary to achieve the most accurate count of outlets at household-and LSOA-level for each three month period in the study, particularly important for accurately estimating quarterly change in outlet density. Averages calculated from data prior to rounding As Humphreys and Smith (2013) found, the effort required to obtain a dataset ready for analysis was considerable including problems in accessing and preparing these data for use. We initially under-estimated the length of time required to obtain the Licensing Act data. Several follow-up requests had to be made to ensure historical, rather than only current, licensing data were provided, with many unitary authorities unclear how to obtain such information from their systems. Six unitary authorities required as many as ten follow-up calls, and up to three months of data checking time before release. In some cases Freedom of Information (FOI) requests were necessary. One unitary authority deleted old records and therefore could not provide historical data. The availability of historical data in all but one unitary authority was nevertheless an improvement over other UK datasets, approaching the quality of a researcher-led dataset held for New Orleans (Xu et al. 2012) and data that are readily available in Finland (Halonen et al. 2013).
The bilingual status of Wales had an impact on geocoding. A licence application can be made in Welsh or English, and records may be administered and maintained in either language. This leads to different formats and spellings in licence records. This is particularly true of address fields, which contained misspelled outlet names and streets, resulting in no matches found during the geocoding process. It also makes the manual matching process more challenging. We achieved a relatively low overall geocoded proportion of 51 %. To improve this we developed a manual matching method to locate premises that could not be geocoded. The AddressBase® Premium dataset, used as the geocoding data source, is in part generated and maintained by the same unitary authorities approving licence applications. From this low resulting geocoding proportion, we concluded that basic checks on vital address components, such as postcodes or building names and numbers, had not taken place. These are essential to achieve accurate and fast geocoding. We briefly considered using Points of interest (POI) data instead of those sourced from unitary authorities but the POI dataset does not include historical outlet data to complete longitudinal research and, additionally, previous researchers found that the POI dataset contained fewer than 64 % of the outlets known to unitary authorities (Burgoine and Harrison 2013). This degree of error is too high for spatio-temporal policy research.
Our hypothesis that urban areas produced the highest quality data was generally confirmed but there were examples of rural authorities that excelled at data collection and storage. We suggest this is largely due to the efforts made by individuals and teams within each authority. The most complete, highest quality data were received from three largest cities in Wales (Cardiff, Swansea and Newport) and resulted in a geocoding proportion of 66 %. Our proportion of geocoded urban outlets was lower than found in other research, which suggests 85 % geocoding may be achieved (Zandbergen 2008). However, their research focussed on residential addresses, which in our experience are more accurately recorded and more frequently contain core address components. We suspect the lack of structure in the commercial addresses we received contributed to our Fig. 3 Map illustrating outlet location error as a result of the manual matching process for a validated sample of 1604 geocoded outlets lower geocoding success. Data received from the more sparsely populated rural unitary authorities were further investigated and for most fewer than half of the licensed premises provided could be geocoded. Given that there are statutory requirements to record these data and supply summary statistics to the Home Office we had expected that these licence data would be recorded more accurately. We recommend that communication is improved within and between unitary authorities to maximise the potential of these administrative data through initiatives such as the UK government funded Administrative Data Research Network (ADRN).
This study has used the most accurate geocoding standards to produce a complete spatial dataset of outlet locations by quarterly time-period at the highest resolution possible. Given that the quality of addresses varied systematically by rural and urban areas, the 97 % combined geocoding and manual match proportion achieved was the best solution for aggregation into small area geographies because it has reduced bias resulting from less accurately recorded rural outlets.
We suggest that the larger night-time economies in urban areas require urban unitary authorities to have accurate data on the location of premises from a planning and policing perspective. It is likely that this requirement has led to the manual adjustment of addresses by the urban authorities to match records when the licence application form is submitted for consideration. Moreover, there are likely to be more frequent checks on licensed premises in urban areas by licensing officers who update the records. Finally there is potentially greater business churn in urban areas leading to more frequent updating of records.
The use of a manual matching process to find the spatial location of licensed outlets is a relatively straightforward, if time consuming process. Limitations of the manual matching process include human error through misidentification of outlets from the address, and data errors remaining unresolved in Google Maps. However, the validation process has tested the introduction of errors with the reassuring result of 80 % located within 100 m of the geocoded location.
The spatial accuracy of the approximated locations (those which were not geocoded and failed to be manually matched precisely) is unknown and these outlets have been flagged in the dataset to enable removal for analyses, if so required. The full dataset will be used in our analyses for CHALICE because we are interested in changes in outlet density through time. Analyses concentrating on the spatial distribution of outlets within an urban area may choose to use only the geocoded and manually matched outlets. The geocoded, manually matched and approximated outlets have the greatest utility for subsequent data linkage at the household-level. These data will allow calculation of change in outlet density/accessibility through time at household-or LSOA-level to assess the impact on individual-level alcohol related health outcomes.
Longitudinal outlet data were required to assess quarterly changes through time and the impact on alcohol-related harm. The requirement to retrospectively collect data was challenging, with older premises more likely to have closed or changed name; changes were not always captured in official address registers such as AddressBase® Premium, despite the dataset structure to maintain historical property changes The final dataset covered a six-year period and had consistent geocoding proportions throughout. The outlet data received from the unitary authorities seems not to have been subjected to any address verification, allowing local knowledge or colloquialisms into some alcohol outlet address data, probably by the landlord or licensee completing the form.
Occasionally the home or work address of the licensee or club secretary was recorded, rather than the outlet. These addresses are now ambiguous, leading to inaccuracies when trying to map an outlet, especially once it closed.
Standardised recording and maintenance of records between all unitary authorities could help to overcome these problems. A standardised address checking service may be incorporated into the address capture or data entry process. Manchester City Council, one of several comprising the Greater Manchester Unitary Authority in England, is achieving this by installing standardised address capture software within Council departmental software applications (John Bradley, Personal Communication). Councils should also be encouraged to maintain historical data, which are essential to retrospectively create and evaluate natural experiments.

Conclusions
The collation of retrospective alcohol outlet data was completed to enable the building of a longitudinal exposure dataset. There was considerable variation between the unitary authorities in the quality of address data and data related to the availability of alcohol, for example opening hours. The lack of address structure required us to devise a manual address matching process to capture the addresses that could not be geocoded. This extra procedure resulted in a geocoded alcohol outlet dataset suitable for creating an alcohol availability score at the household-level and any larger geography, including the LSOA-level. The alcohol outlet dataset will be used to contribute to the evidencebase of alcohol availability and alcohol-related harm.
The Licensing Act 2003 was intended to help Government monitor the impact of, among other things, the sale of alcohol in relation to alcohol-related crime. The potential of administrative licensing data as a resource for public health surveillance and associated data access issues have already been highlighted (Humphreys and Smith 2013). We faced similar problems trying to access Licence data. Through the process of geocoding a longitudinal dataset, we have identified further issues surrounding the variation in the number of data items stored, the accuracy of the address data, and historical preservation of the licence data. Moreover, many of the issues and problems we faced were raised more than 25 years ago in the Chorley report (Coppock 1987) and little progress has been made on Bremoving the barriers^to geographic information. Local government should be encouraged to use nationally standardised data fields, including addresses. The availability of accurate addresses, with dates, would contribute to research that aims to enhance our understanding of the impact of the built environment on health.

Compliance with Ethical Standards
Funding This project was funded by the National Institute for Health Research, Public Health Research (project number 09/3007/02). Please visit the PHR programme website for more information: http://www.nets. nihr.ac.uk/projects/phr/09300702. The views and opinions expressed therein are those of the authors and do not necessarily reflect those of the Public Health Research, NIHR, NHS or the Department of Health.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.