1 Introduction

Suicides are tragic events for witnesses, affected family, friends and for society in general. In order to understand the determinants of suicide, including studying the (un)intended consequences of past policies, data on suicides is necessary. Yet, currently there is only one data source for US suicides prior to 1900—the mortality schedules. Mortality schedules were parts of the census that asked for people who died the year before the census and took place in 1850, 1860, 1870, 1880, and 1885. The 1885 mortality schedules were part of some state censuses. States and territories are usually only covered in these mortality schedules once they have come under U.S. sovereignty or jurisdiction. This explains why many current states are not included in the data.Footnote 1 Furthermore, not all mortality schedules are digitized.

It is thus unsurprising that little is known about what determined suicide rates in the past. I offer a proxy measure for the state-year suicide rate by US state based on a large newspaper archive run by the Library of Congress. The proxy is built analogues to disease prevalence measures as mentions per 100,000 pages. However, alternatives are explored and in cases where only deaths are available as comparators mentions themselves are also used.

This paper provides evidence that suicide mentions are a feasible proxy for the suicide rate providing the first insight into suicide (mention) trends prior to 1900. The validity of suicide mentions as a proxy is shown by comparison with the sparse available mortality data and by comparing patterns with the subjective well-being measure developed in Hills et al. (2019).

The proxy measure, developed here, allows more detailed analysis as well as analysis for areas and years for which previously no data was available at all. The analysis here is mostly conducted on an aggregate level. However, future research could exploit that the newspaper data has exact dates and towns in which the newspaper was based.

2 Previous Work

Using text and especially newspapers as data is a recent development. Yet, already an entire literature using text as data exists (Algaba et al., 2020; Baker et al., 2016; Currie et al., 2020; Gentzkow et al., 2011, 2014, 2015; Gutmann et al., 2018; Marquardt, 2020). A prominent example is the work by Baker et al. (2016). They search ten newspapers, which are available from 1900 to today for a list of keywords indicating policy uncertainty and then divide those by total number of articles in the same newspaper and month. Yet, most of this literature focuses on how an aspect of the newspaper market is influenced by or influences societal changes (Gentzkow et al., 2011, 2014, 2015). An exception is the work by Bencsik (2020). She shows that crime distresses the neighborhoods in which they occur, but only after the crimes have been reported in newspapers.

Some work creates and validates a measures suicide, suicidality or well-being. Hills et al. (2019) create a measure of national subjective well-being. They use sentiment analysis to capture the subjective mental well-being of book authors using Google Books to create measures of national subjective well-being for the US, the UK, Germany and Italy.

Vandoros et al. (2019) show that economic uncertainty, as measured by mentions of certain macroeconomic and political terms in newspapers, could lead to increased suicides in the short-run.

Arendt (2018) shows for fifteen newspapers from Austria (1819–1944) that there is covariation between newspaper reports of suicides and actual suicides. He further shows that newspaper reports of suicides predict future suicides, but suicides do not predict future newspaper reports of suicides. Arendt (2019) explores whether suicide reporting in two Austrian newspapers in the years 1855, 1865, 1875 & 1885 was responsible given the current state of knowledge and finds that the level of responsible report was low and did not improve during the observation period. Arendt (2020) analyzes five newspaper each from a different territory of the Austro-Hungarian Empire between 1871 and 1910. He again finds covariation between the number of suicide reports in newspapers and the number of suicides within all five newspapers.

Several other projects have used newspapers as a data source to quantify historical events for which no records exist. For example Monkkonen (2006) creates a database of murders in New York city by using newspaper reports in the New York Tribune, the largest contributing newspaper in this study. Atalay et al. (2020) digitized job postings in several US newspapers (Boston Globe, the New York Times, and the Wall Street Journal) from 1950 to 2000 to describe the evolution of work tasks to a previously impossible degree.

3 Data Sources

3.1 Digital Newspaper Program

The National Digital Newspaper Program (NDNP) is a collaboration between the National Endowment for the Humanities (NEH) and the Library of Congress (LC). Their website “Chronicling America” provides digitized historical newspapers. NEH-funded institutions (awardees) conducted the digitization and are spread across all US states, except Massachusetts. Awardees digitized 100,000 pages a year since 2005.Footnote 2 The historical years from which newspaper pages are digitized vary by year, but are in the range 1690–1963.

Table 1 provides the historical years covered by each grant e.g. the 2018 grants covered 1690–1963. Table 2 provides an overview of which states received grants in which year (2005–2019) and who the awardee in that state is.

Table 1 Years covered by each award
Table 2 State award years

3.2 Mortality Data

I merge two mortality datasets. First, I discovered the summary of the 1850 and 1870 mortality schedules and digitized them. They cover 31 and 44 states respectively. From 1900 onwards cause specific state-year mortality data has been made available on the NBER websiteFootnote 3 by Miller (2008). Haines (2001) describes the underlying recording system and describes that death registration similar to modern standards was only implemented in the 1930s. Thus, in 1900 only 10 states had data of adequate quality to report. Thus, comparing post-1900 mortality data to newspaper suicide mentions is not ideal as the newspaper data is fading out (see Fig. 1) and the mortality data is starting to fade in.

Fig. 1
figure 1

Distribution of pages. Note the line shows the yearly number of observations per year. The spikes descending from the top of the graph show how often a historic year was sampled. For example, the drop in newspaper pages around 1899 is not due to the sampling of the NDNP, but due to a real reduction in newspaper pages

I lose observations in the comparison, because Massachusetts is available in both mortality datasets, but has never been part of the NDNP. On the other hand, states that joined the union only after 1920 are never available in the mortality data (e.g. Hawaii and Alaska joined the union in 1959).

3.3 Data Preparation

I first scrape the suicide mentions and the total number of pages available in the archive. I do not merge on the newspaper page level, due to a large share of missing page numbers, but first aggregate the pages and the suicide mentions to state-year level and then merge the suicide mentions to the total pages.

In the cleaning process, I dropped all observations that are not from current states but from current districts or territories—like Puerto Rico, Guam, etc. The reason they are dropped is, that NEH grants are only given to current-day states. Thus, it is unknown which state organization scanned those pages. The District of Columbia is also dropped, because the Library of Congress does the scanning, but does not report how many pages were scanned in which year. It is thus difficult to have any idea about the underlying data generating process.

I also only keep data from local newspapers in the sense that these newspapers were only distributed in the state they were published in. This should reduce over-counting due to celebrity suicides.

For the period 1800–1920 this leads to 932,457 suicide mentions and 8,677,032 newspaper pages. Aggregating this to state-year level there are 3,112 observations. The scraping is conducted in Stata 15.1 using the jsonio java plugin provided by Buchanan (2015). The aggregated state-year level data and the scraping code is available on my website at https://sites.google.com/view/christoph-kronenberg. Chronicling America also provides excellent resources on how to access the data at https://chroniclingamerica.loc.gov/about/api/.

4 Data Description

Table 3 answers the question how much of the newspaper market is represented in this sample. Unfortunately, that information is only available for 1840 and 1850 and thus comparison can only be made with newspapers from those years that are in the scraped dataset. Overall, the scraped dataset covers 3% of newspapers at the time. In a few cases the scraped dataset includes newspapers for states that were not part of the census newspapers count, see West Virginia (1850) and Hawaii (1840 and 1850).

Table 3 Newspaper sample versus newspaper census

In terms of contributions from individual newspapers, the top three are the New York Tribune with 130 thousand pages, the New York Herald with 116 thousand pages and the Evening star (Washington, D.C.) with 90 thousand pages. The mean/median number of pages per newspaper is 21/8 thousand for 1,684 newspapers.

Figure 1 displays the number of pages available in the data per year. The bars descending from the top indicate the number of times a historical year was covered by a NEH grant. This information is a combination of the information provided in Tables 1 and 2. For example, the combination Arizona and the year 1881 is coded as four. Newspapers from Arizona were scanned four times and each time 1881 was covered in the award period. 1879 on the other hand was coded as three, because the 2008 award only covers the period 1880–1922. The spikes follow the overall pattern of the pages and help to understand whether short-term fluctuations from the overall trend are due to sampling or real changes in the production of newspaper pages.

The number of pages drops around the turn of the century, despite the NEH grants being similar over that period. Thus, it is likely that this drop is a real change in newspaper production and not characteristics of the digitization process. A similar pattern can be observed for the American Civil War (1861–1865) for this period the number of pages’ declines, while the number of times those years were sampled is increasing.

Figure 1 shows that newspaper pages prior to 1840 are scarce. This appears partly driven by the number of grants covering those years; see top part of the figure. However, Dill (1928) reports that in 1800 150 newspapers were active in the US increasing to 393 in 1810, 861 in 1820, 1300 in 1830 and 1403 in 1840. However, the number of copies took a little bit longer to increase as in 1810 22.5 million copies were produced increasing to 68.1 million in 1828, 90.4 million in 1835 and 195.8 million in 1840. Prior to these changes newspapers were so expensive that only a small number was produced and read. The current day price of a pre-1840 newspaper is estimated to be around $20.Footnote 4 The introduction of new printing technologies combined with the use of steam to power printing presses and increased literacy led to so-called penny presses cheap tabloid newspapers.

I search this archive for occurrences of the term “suicide(s)”. The result is a dataset of all mentions of suicides within the archive. Given that I have more pages for some states and years than for others, I divide the total number of suicide-mentions by state and year by the total number of pages per state-year cell and multiply by 100,000. This approach is analogues to cause-specific mortality that is reported as deaths per 100,000 individuals for a specific disease/cause of death (e.g. suicide).

Figure 2 shows the trend in suicide mentions per 100,000 pages over time. The civil war dip is far larger than in Fig. 1, which suggests that the trend in suicide mentions is far larger than purely explained by the trend in pages. A similar finding has been reported by Wasserman et al. (1994) looking at cover pages of the New York Times. Yet, they also show that the editor drove this pattern for the New York Times.

Fig. 2
figure 2

Suicide mentions per 100,000 pages over time. Note the dashed line shows the yearly number of suicides mentions per 100,000 pages per year. The adjusted measure (dotted line) only counts at most one suicide mention per newspaper/day before aggregation. This should reduce the potential problem of over counting of celebrity suicides. The x-axis shows the years from 1800 to 1920, ticks indicate five-year intervals

Figure 2 also shows the comparison between the raw suicide mentions per 100,000 pages and only counting one mention per day and newspaper. The latter should reduce concerns about over-counting suicides from well-known individuals. The lines are virtually identical until 1865 and then suicide mentions grows faster for a decade after which the two lines continue parallel of each other. This indicates that spikes in suicide mentions are not a huge concern, at least when looking at aggregate trends.

5 Data Validation

The natural validation process would be to compare the newspaper suicide mentions to the real suicide prevalence. Yet, as outlined in 3.2 this data is only available for 1850, 1870 and 1900- and even then lack states before they join the union. Data quality issues have also been raised (Haines, 2001), which is unsurprising given that they are large data quality issues even today (Fernandez, 2019).

I still compare the suicide mentions proxy against suicide prevalence data post 1900 in order to show that they despite all outlined problems compare reasonably well.

Figure 3 compares 1850, 1870 and 1900–1920 suicide counts to suicide mentions and the adjusted suicide mention count. Given the unavailability of year population numbers counts of suicides and mentions are compared. Furthermore, due to unavailable suicide counts for some states this comparison can only be conducted for 32 states from 1900 onwards.Footnote 5 For 1850, no suicide data is available from Hawaii, Minnesota and West Virginia as these states were only admitted to the union later, the same holds for Hawaii in 1870.

Fig. 3
figure 3

Suicide mentions versus suicides 1850, 1870 and 1900–1920. Note The connected lines between 1850 and 1870 show suicides reported in historical mortality schedules for 22 states (1850) and 33 states (1870). The lines from 1900 onwards show suicides as provided by Grant Miller on the NBER website https://data.nber.org/data/vital-statistics-deaths-historical/ (last accessed 20th of July 2020)

For 1850 and 1870 suicide mentions appear to over-count suicides. Yet it is also possible that the suicide count is an undercount given that the mortality schedules are based on the reporting of relatives. The adjusted suicide mentions track the actual number of suicides fairly well from 1900 to 1910 after which suicides grow while adjusted suicide mentions decline. The trends for the raw suicide mentions are similar at higher levels. Thus, it appears that newspaper suicide mentions, raw or adjusted, are either a reasonably good proxy or an undercount of the true number of suicides. Yet, the newspaper data includes data for another 15 statesFootnote 6 post 1900 not covered in the data provided by Grant Miller and the NBER.

Figure 4 compares the suicide mentions per 100,000 pages versus the valence measure put forth in Hills et al. (2019). Their measure captures subjective wellbeing as implied by the sentiment (or valence) in serval million books. While subjective wellbeing or valence are not identical with suicide (mentions) there is a known strong correlation between subjective measures of wellbeing and quality of life with suicides (Eckersley & Dear, 2002; Hays & Fayers, 2020; Helliwell, 2006). The valence measure drops during both the Civil War and the First World War as do the suicide mentions. The valence measure declines already prior to both wars, whereas the suicide mentions only declines around the beginning of both wars.

Fig. 4
figure 4

Suicide mentions versus valence measure from Hills et al. (2019). Note The black dashed line shows the yearly valence as provided by Hills et al. (2019). The dashed line shows the number of suicide mentions per year and the dotted line the adjusted measure. The latter only counts at most one suicide mention per newspaper/day before aggregation. The vertical lines indicate the US Civil War and the First World War in line with Fig. 5 in Hills et al. (2019)

The solid line shows the yearly number of suicides. The dashed line shows the number of suicide mentions per year and the dotted line the adjusted measure. The latter only counts at most one suicide mention per newspaper/day before aggregation.

Another aspect of validation is whether suicide mentions are indeed, capturing suicides and what share of suicides are captured. For this suicide mentions from Maine 1850, 1870 and 1900–1920 where hand-validated. There are 2,495 automatically recognized mentions, hand-validation shows that this is an 98% accuracy rate. In 45 cases the text recognition reported a suicide mention where there was none. However, of the remaining mentions of the word suicide, some covered suicidality, suicides in other countries, suicide attempts that did not lead to death or instances of the word suicide appearing without a suicide having taken place in the USA. Ignoring these cases 70% of all suicide mentions relate to suicides. In most of the cases, 1,619 to be exact, the name of the individual was reported, which allows me to check for the number of mentions per suicide. There are 1,113 unique names and thus 1.45 suicide mentions per suicide.

Figure 5 shows the share of suicide mentions after accounting for text recognition failures and invalid suicide mentions over suicide reports from the Mortality Schedules and Grant Miller. For 1870, 40% of all reported suicides are captured. From 1900 onwards the share drops as expected.

Fig. 5
figure 5

valid suicide mentions as a share of reported suicides for maine. Note The 1850 and 1870 suicides are obtained from historical mortality schedules (1850) and (1870) as the denominator. The lines from 1900 onwards have suicides as provided by Grant Miller on the NBER website https://data.nber.org/data/vital-statistics-deaths-historical/ (last accessed 20th of July 2020) as the denominator

Figure 3 has shown that suicides went up in that time, while Fig. 1 has shown that the number of pages went down.

6 Discussion

I have extracted suicide mentions from nearly nine million newspaper pages from 1800 to 1920 and then validated them against the sparse existing data on suicide prevalence as well as a measure of subjective-wellbeing based on the sentiment in millions of books (Hills et al., 2019).

Thus, this work highlights the potential of using historical newspapers as a data source. There are of course challenges when working with data derived from newspapers published from 1800 to 1920. The scraping process is entirely machine-driven and thus some suicide mentions are a false positive. Yet, I have shown that 70% of all mentions relate to actual suicides in a subsample from Maine. False negatives (no reporting in newspapers despite suicides taking place) are a concern. However, the same is for any mortality data that is not based on coroner data. Thus, the same issue is present in the mortality schedules as relatives might misrepresent the cause of death of relatives for reasons of social prestige. Newspapers not reporting on suicides appears to have only taken place around the year 2000 Footnote 7as can also be seen by the larger number of names reported in historical newspapers, which would be considered poor journalism currently.

Overall, it is unlikely that better data will ever be available for the 19th century and additional newspapers are continuously being added to the archive improving the quality. Yet using long historical datasets of any measure, including GDP, (un)employment or even population numbers should be done with caution. The same holds here, but having a dataset that has to be used with some caution is arguably better than having barely any data at all. However, I currently recommend using the data only from 1840 onwards given how rarely earlier years were sampled (see Fig. 1) and how few pages are available in the archive. I also currently recommend using the data only until 1910 given the sharp drop in available pages despite the reasonably high sampling rate.

This data resource offers researchers in the quantitative social sciences, linguistics and beyond the opportunity to study how numerous changes throughout the 19th century affected one of the currently leading causes of death (Hedegaard etal., 2018). There are many opportunities to study how changes in the economy, the westward expansion, slavery and the displacement of Native Americans, technological changes and wars affected suicides.

Future research could also attempt to extract data for other reported causes of death. Yet, it would need to be a cause of death that is often reported in newspapers and that all synonyms for the cause are known. While suicide as a word has been used for centuries without change, this is not true for most other causes of death.Footnote 8