The development of a machine learning algorithm to identify occupational injuries in agriculture using pre-hospital care reports

Purpose Current injury surveillance efforts in agriculture are considerably hampered by the limited quantity of occupation or industry data in current health records. This has impeded efforts to develop more accurate injury burden estimates and has negatively impacted the prioritization of workplace health and safety in state and federal public health efforts. This paper describes the development of a Naïve Bayes machine learning algorithm to identify occupational injuries in agriculture using existing administrative data, specifically in pre-hospital care reports (PCR). Methods A Naïve Bayes machine learning algorithm was trained on PCR datasets from 2008–2010 from Maine and New Hampshire and tested on newer data from those states between 2011 and 2016. Further analyses were devoted to establishing the generalizability of the model across various states and various years. Dual visual inspection was used to verify the records subset by the algorithm. Results The Naïve Bayes machine learning algorithm reduced the volume of cases that required visual inspection by 69.5 percent over a keyword search strategy alone. Coders identified 341 true agricultural injury records (Case class = 1) (Maine 2011–2016, New Hampshire 2011–2015). In addition, there were 581 (Case class = 2 or 3) that were suspected to be agricultural acute/traumatic events, but lacked the necessary detail to make a certain distinction. Conclusions The application of the trained algorithm on newer data reduced the volume of records requiring visual inspection by two thirds over the previous keyword search strategy, making it a sustainable and cost-effective way to understand injury trends in agriculture.


Introduction
Quantifying occupational injuries are of particular importance in the agricultural sector, which has one of the highest fatality rates in the United States [1]. While workplace fatalities are often captured in the Census of Fatal Occupational Injury (CFOI) [2], collecting data relating to injury, especially in Northeast agriculture, is problematic [3]. The majority of farms in the Northeast are exempt from OSHA regulations, as only 10.7% of Northeast region farms with hired workers have 10 or more employees [4]. The injury and fatality estimates that do exist are limited [5][6][7][8]. Similarly, data has also shown underreporting of injuries in the forestry and logging sector, which is part of the agricultural super-sector [9].
Surveillance efforts have been considerably constrained by a lack of knowledge regarding what information to collect and where to find it. These gaps in knowledge have impeded efforts to develop more accurate injury burden estimates and have negatively impacted the prioritization of workplace health and safety in state and federal public health efforts [10]. The low prioritization of occupational health and safety is demonstrated by NIOSH's 0.2% share of all medical research and development expenditures in the 2016 federal budget [11,12]. By creating a system to identify previously unreported agriculture related injury, Health Information Science and Systems *Correspondence: Erika.scott@bassett.org 1 Northeast Center for Occupational Health and Safety in Agriculture, Forestry, and Fishing, Bassett Medical Center, Cooperstown, NY, USA Full list of author information is available at the end of the article it will be possible to provide a more complete picture of injury burden in this high-risk industry. The loss of several national surveillance efforts-the National Agricultural Worker Health Survey Injury Module [13], and the Occupational Injury Surveillance of Production Agriculture (OISPA) Survey [14]-further justifies the need to invest in new strategies for surveillance research. With robust computing power now available for little cost and the ability to obtain existing administrative health databases with free-text, artificial intelligence and machine learning methods offer promising avenues for public health research. For example, Koivu and Sairanen's use of several machine learning methods to develop a model that predicts early still birth and pre-term pregnancies [15]. Rybinski et al. applied natural language processing methods to identifying family members and diseases in free text family history sections of electronic health records [16]. Prieto et al. developed a natural language processing method using logistic regression and several keywords to identify opioid misuse in the narrative portion of PCRs [17]. Yang et al. developed a deep neural network to identify cases involving allergic reactions in the free text section of hospital safety reports [18]. This paper describes the development of a Naïve Bayes machine learning algorithm for pre-hospital care reports (PCR) to identify agricultural occupational injuries along with the algorithm's utility on untagged datasets. Naïve Bayes methodology was chosen for our first effort utilizing machine learning because of its simplicity and effectiveness in text classification [19,20].

Materials and methods
The creation of a"gold-standard' dataset has been described in detail elsewhere and will be briefly summarized here [21]. Over 50,000 PCR records were visually inspected and tagged as to their occupational injury status. In order to be included in this review, the record's narrative needed to contain a keyword from Table 1. This tagged dataset, which came from Maine and New Hampshire 2008-2010 PCRs, served as the basis for training the machine learning algorithm. The following describes the training of the algorithm on this tagged dataset, and the subsequent application of this trained algorithm to newer datasets for which case determination had not yet been made. These newer datasets included Maine PCR data from 2011-2016 and New Hampshire PCR data from 2011-2015. The Institutional Review Board (IRB) of the Mary Imogene Bassett Hospital approved all protocols. Additionally, approval was also granted by each participating state's IRB or data use board.

Preprocessing of all datasets used in these analyses
Cleaning PCR records involved two steps: (1) removing duplicates and (2) removing records of no interest. Duplicate records were identified based on an exact match on four variables: gender, admission, ZIP code, and date of birth. For records that met these criteria, one was

Training the algorithm using tagged datasets
The algorithm was trained on the gold standard PCR dataset from 2008 to 2010 from Maine and New Hampshire using the variables shown in Table 2. After PCR records were cleaned using SAS 9.3 (Cary, NC), they were imported into Python (v 3.7) for further data processing. Three Python packages were used: pandas (v 1.1.1) for data management, nltk (v 3.5) for natural language processing, and scikit-learn (v 0.23.2) for machine learning. Transformation mapping was performed to link similar variables ( Table 2) across state datasets between given years of data. These maps were applied to the following variables: incident location, mechanism of injury, dispatch reason, and primary impression. For the other variables in Table 2, a dummy matrix was created which included a (0,1) variable for each level of the variable they represented, a process known as one-hot encoding. For example, the fifty-nine levels for mechanism of injury were represented by fifty-nine (0,1) variables.
Narratives were prepared for the stemmed keyword search by lowercasing all characters, removing all punctuation, and stemming all words using the Natural Language Toolkit's (NLTK) Snowball stemmer [22]. Next, narratives were scanned for any instances of the stemmed keywords in Table 1. Throughout this process, the algorithm was trained to ignore keywords that were found in combination with other words or phrases that did not indicate an agricultural injury, such as proper names of emergency responders, local non-agricultural businesses, or keywords followed by a known address suffix or abbreviation [23]. Lastly, exclusions were applied for irrelevant words that stemmed to the identical value as a given keyword (e.g., "animate" and "animal" both stem to "anima"; therefore "animate" was excluded). Based on that searching process, each narrative was tagged as to the presence (1) or the absence (0) of each of the stemmed keywords in Table 1.
A four-level case-class variable was created as follows: 0 (non-agricultural, non-traumatic/acute, or both), 1 (confirmed agricultural, confirmed traumatic/acute = true case), 2 (confirmed traumatic/acute, suspected agricultural), or 3 (suspected traumatic/acute, confirmed agricultural) ( Table 3). Naïve Bayes models were run for binary case (case-class 1,2, or 3 versus case-class 0). These models used all of these variables in conjunction to assign a predicted probability that the record was a true case.
The essential element of these analyses was to identify variables that occurred relatively frequently in cases and relatively infrequently in non-cases. Using this mechanism, our goal was to train the algorithm to identify a posterior probability threshold for assigning a record to be a true case such that ninety percent (90%) of all true cases would be identified. To determine what threshold would meet this ninety percent (90%) requirement, the algorithm was trained on eighty percent (80%) of the data selected at random, and validated on the remaining twenty percent (20%). This procedure was repeated for one hundred iterations. Over these one hundred iterations, the mean and standard deviation were calculated for the required threshold probability [in our case this is 0.17, discussed further in the results section]. The mean and standard deviation of the percentage of all cases in the dataset meeting this threshold probability were also calculated. Hypothetically, on iteration three, in order for the algorithm to identify a subset of the records that contained ninety percent (90%) of the true positive cases it was necessary to "tag" any record with a posterior probability of 0.17 or higher as a "hit". This resulted in three percent (3%) of all records in this iteration being tagged  as "hits". Therefore, the relevant data points for this iteration were 0.17 (the required threshold posterior probability), and 0.03 (the proportion of all records that the algorithm needed to assign as "hits" in order to capture 90% of the true hits in the file). The proportion of records within the file that was tagged by the algorithm as "hits" that had been previously confirmed as case-class 1, 2, or 3 was also recorded. As a hypothetical example, on iteration three, the algorithm was training on a file containing one hundred (100) records that were confirmed case-class 1, 2, or 3. In order for the algorithm to identify a subset containing ninety (90) of these confirmed case-class 1, 2, or 3 records, 3,012 records with a posterior probability of 0.17 or greater were tagged as "hits". The resulting percentage was therefore 90/3,012 = 0.0299 or 3%.
A receiver operator characteristic curve (ROC) with sensitivity (true positive rate) on the y-axis and 1-specificity (false positive rate) on the x-axis was also created for certain of these iterations. The area under these curves (AUC) was also recorded as an additional data point.
Sub-analyses were also performed with the goal of identifying the relative importance (discriminatory power) of the variables in Table 2. Further analyses were devoted to establishing the generalizability of the model across various states and various years. Variable importance was determined by subtracting the log probability of the negative class for each variable from the log probability of the positive class. This gave a measure of how strong a discriminator the variable was.

Application of the trained algorithm on newer un-tagged datasets from Maine (2011-2016) and New Hampshire (2011-2015)
For each of the new untagged datasets (PCR data from Maine 2011-2016 and New Hampshire 2011-2015), the trained algorithm was used to identify the subset of records that had a posterior probability that was equal to or greater than the mean of the one-hundred (100) threshold probabilities obtained above. These records were set aside for visual inspection to determine their true case classification as being either a 1, 2, or 3 versus 0 (Table 3).

Visual confirmation of records meeting the probability threshold as identified by the trained algorithm
The visual case determination utilized the following variables: state ID, state, incident ID, date of birth, gender, incident location, dispatch reason, primary impression, mechanism of injury, incident date (admit date), stems, narrative. The research team developed an injury surveillance manual with specific coding rules, and this manual was updated throughout the visual case determination process to address any questions that arose. Trained coders independently assigned one of the four case-class levels ( Table 3) to each record.
Any case that was not assigned a zero (not a case of interest) received a second, separate review by an additional coder; and discrepancies between the two case determinations were resolved by the two coders. When the initial review was complete, all non-zero cases and a random sample of 10% of zero cases were reviewed again by lead reviewers (Principal Investigator and Research Coordinator) to confirm the case determination. The lead review results were also used to provide additional training to the coders and to update the injury surveillance manual. In addition, coders used comment fields to suggest new exclusions; and these exclusion suggestions were reported to the study team, and subsequently included in future iterations of the algorithm's exclusion list. The labor resource allocation to visual case determination was calculated in minutes per case, then transformed into full-time equivalents (FTE). For each of the untagged datasets, the percent of records in these tagged files that were confirmed by visual examination to be case-class 1, 2, or 3 was recorded.

Results from the training of the algorithm on data from 2008 to 2010
Of the total of 1,072,745 records, 224,572 (20.9%) were eliminated in SAS as being either duplicates or irrelevant, leaving 848,173 records for the application of the algorithm. As the algorithm was trained, a total of 1557 exclusion word or phrases were identified.
The average posterior probability cutoff over the training iterations that was required to produce a subdataset that contained ninety percent of all true positives was 0.016 (SD 0.007). On average, fifteen percent (15%) of records fed into the model had a posterior probability > 0.016 and were thus reviewed visually. Averaged over the 100 iterations, 10.6% (SD 2.64) of the records with a posterior probability of 0.016 or higher were found to be agricultural cases.
The mean and standard deviation of the area under the ROC over the 100 80%-20% training-validation iterations were mean 0.95 and SD 0.  Table 4).

Results from application of the trained algorithm on the newer un-tagged datasets from 2011 to 2016
Eliminating duplicates and records of no interest reduced the volume of records for machine learning by 791,659 (Fig. 3). A total of 1,923,107 PCR records were imported into Python. Of these records, 95,545 contained a stemmed keyword of interest. The Naïve Bayes algorithm identified 29,099 records (30.5%) that had a posterior probability of being an agricultural case of 0.016 or greater and therefore met the 90% true positive threshold that was derived from the training of the algorithm on the 2008-2010 data.

Results from visual confirmation of records
Visual inspection of these 29,099 records subsequently confirmed that 922 (3.2%) were in fact agricultural cases. The time burden for case determination is shown in Table 5. Initial case determination of the 29,099 records requiring visual inspection was estimated to require 1.7 full-time equivalent (FTE) staff. The FTE required if we did not employ machine learning-visually inspecting 95,545 records-would have been 5.6 FTE. There is a small additional amount of time for discrepancy resolution and lead review inspection, but that is a fraction of the time required for initial case determination.
Of the 922 records confirmed to be agricultural cases, coders identified 341 true agricultural injury records (Case class = 1) (Maine 2011-2016, New Hampshire 2011-2015). In addition, there were 581 (Case class = 2 or 3) that were suspected to be agricultural acute/traumatic events, but lacked the necessary detail to make a certain distinction. The top twenty variables with the highest discriminatory power of a true agricultural case are summarized in Table 6.

Discussion
The Naïve Bayes machine learning algorithm has substantially reduced the burden of identifying agricultural injury cases in PCRs by decreasing the number of records that need to be reviewed by visual inspection. Previous research showed that PCRs yield a higher proportion of occupational injury records than other types of existing administrative records, such as hospital data [24]. Therefore, speeding up the process of identifying cases without sacrificing accuracy is a significant advancement in the field of injury surveillance for agriculture.
By moving from simple keyword searches to employing the use of machine learning techniques, we have advanced much closer to achieving an operational,  sustainable surveillance system using existing data. The healthcare system is already burdened and lacks additional time for reporting; therefore we embarked on this endeavor utilizing data that can be imperfect, understanding that formatted variables are often left blank or not required. The onus has been on us to enhance the ability to find cases in the vast number of existing records, instead of insisting that emergency services fill out yet another report.   We anticipate that PCRs will continue to be a stable data source for years to come. Advancements in electronic reporting have improved over the last decade with rural areas obtaining improved connectivity by way of broadband internet. Most PCR data are used for quality assurance and quality control for emergency medical services, but it is increasingly being seen as a viable research dataset [17,25,26]. In addition, state-based Emergency Medical Services Bureaus are interested in utilizing the results of research involving PCRs as a way to enhance EMS response.
It is possible to use a Naïve Bayes machine learning algorithm to identify agricultural injury records in Maine and New Hampshire. Utilizing two states and three years of data respectively, we examined how such a surveillance system performs over time, and how additional states may be added to the system in the future. In the case of cross-year and cross-state train/test splits, the necessary false positive rates were higher than with purely random splits, but still significantly better than chance (Table 4). The model performed best when all the variables were available, though it performed much better than chance when presented with 1) only the narrative or 2) only the categorical variables. This has major implications for expanding the surveillance system to new states, as the variables available through research data use agreements can vary by state. Results of cross-state train/test splits also suggest that while a model trained on one state does not generalize to another state perfectly, it may be an acceptable low-cost alternative to creating a state-specific training set. In addition, findings indicate that a model can be trained on earlier years and still generalize well to later years.
There is a slight decrease in the model's performance when it's applied to the newer years. Within the newer validation data, 30.5% of records were tagged for visual inspection (due to a posterior probability > 0.016) versus 15% in the training dataset. The variables which had the greatest discriminatory power included stemmed keywords which are quite unique to agriculture, for example hoov(e), silag(e), and grain_bin. While they showed up rarely in the PCR narratives, when they did they were a good indicator of an agricultural injury. Other variables that were strongly indicative of agricultural cases and appeared much more often include the keywords cow, hay, and pastur(e), as well as the incident location of Farm. The cost for maintaining this surveillance system can be reduced in two ways: by enhancing the accuracy of the machine learning algorithm and by altering protocols for visual inspection of records. Further review of the inter-rater reliability between coders will determine if we can reduce the time spent on visual inspection, without sacrificing significant errors in case classification.
Our ability to add states to the system and continue to review and code timely data will rest on continued refinement of the machine learning algorithm. To this end, next steps include the exploration of active machine learning. Part of this process is scrutinizing how much tagged data is necessary to get a new state off the ground, or to review how well the algorithm performs over time, understanding that databases and their data dictionaries evolve over time.
The descriptive epidemiology of the injury events identified will be the subject of a separate manuscript.

Limitations
This surveillance method captures traumatic injuries for which EMS were involved, where the record contained a variable or keyword related to agriculture. Inherently, this leaves out injuries where medical treatment was sought without EMS involvement, such as those transported to the hospital in a private vehicle. This system is designed to capture ninety percent of true positive cases, knowing that some cases (10%) will be missed.
Since the model's performance in later years does not exactly match that of the training years, further assessment is needed to understand if the full 90% of case-class 1, 2, and 3 are still captured in later years. In 2008-2010, the percent of records that made it to the model which were tagged & visually confirmed was 1.59% (15% * 10.6%). In 2011-2015, that was only 0.976% (30.5% * 3.2%). Assuming the base rate of true cases amongst records that make it to the model remains relatively constant over the nine year period, means that we are identifying a slightly smaller percent of true positives in 2011-2015 than in 2008-2010.
To refine the machine learning algorithm, a certain amount of tagged data is required, and the initial step of tagging the large corpus is quite time consuming. For the newer datasets to which the algorithm was applied, we cannot confirm that this dataset in fact contains ninety percent (90%) of all true cases. Understanding that would require visually inspecting a vast number of cases and requires further study. Choosing a lower required true positive rate will reduce the number of false positives that need to be reviewed, but will also increase the number of false negatives.

Conclusions
This research adds substantial information to improving occupational injury surveillance using existing data sources. The application of the trained algorithm on newer data left less than two percent of records requiring visual inspection, making it a sustainable and costeffective way to understand injury trends in agriculture. This system, along with companion surveillance methods such as those utilizing hospital, trauma and survey data, provides a broader picture of worker injury in agriculture. Continued investment in robust injury surveillance methodologies will benefit worker health and safety, by allowing occupational health and safety specialists to make informed decisions about hazards and evaluate the effect of injury prevention efforts over time.