Data collection
We collected newspaper articles published in major U.S. newspapers in the year following each of the hurricanes. We chose the timespan of 1 year to capture the duration of media coverage following each hurricane and also to ensure we had enough articles from each hurricane to conduct a proper mathematical analysis. We identified newspaper articles through a search that included the name of the hurricane and either the word “hurricane” or “storm” in either the title or leading paragraphs of the article. To account for regional variation in post-hurricane reporting, we chose four newspapers spanning major regions of the USA: northeast, New England, midwest, and west. We chose the following four newspapers due to their high Sunday circulation, and because they are high-profile, established newspapers with high readership: The New York Times, The Boston Globe, The Los Angeles Times, and The Chicago Tribune are influential and well-respected nationally as well as locally. These four newspapers are consistently in the top 25 U.S. Sunday newspapers and were available for article collection through online databases. We collected articles appearing onwards from the first of the month the hurricane occurred in throughout the subsequent year using the ProQuest, LexisNexis, and Westlaw Campus Research online databases. The total number of articles collected and included in the corpora for analysis are 3100 for Hurricane Katrina and 1039 for Hurricane Sandy. We transform each corpus into a term-document matrix for the analysis.
Latent semantic analysis
LSA is a method of uncovering hidden relationships in document data (Deerwester et al. 1990). LSA uses the matrix factorization technique singular value decomposition (SVD) to reduce the rank of the term-document matrix and merge the dimensions that share similar meanings. SVD creates the following matrices:
where the matrix M is the original t×d matrix (number of terms by number of documents), the columns of the matrix U are the eigenvectors of M
M
T, the entires in the diagonal of the matrix S are the square roots of the eigenvalues of M
M
T, and the rows of the matrix V
T are the eigenvectors of M
T
M. Retaining the k largest singular values and setting all others to 0 gives the best rank k approximation of M. This rank reduction creates a t×k term matrix, U
k
S
k
, consisting of term vectors in latent semantic space as its columns, and a k×d document matrix, \(S_{k}{V_{k}^{T}}\), consisting of document vectors as its rows. The documents and terms are then compared in latent semantic space using cosine similarity as the distance metric (Berry and Browne 2005). If two term vectors have cosine distances close to 1, then these terms are interpreted to be related to each other in meaning. We explain this process further in Fig. 1.
We load the documents into a term-document matrix and remove common and irrelevant terms. The terms we removed included terms common to the articles like “hurricane”, “storm”, “sandy”, and “katrina”, along with names of authors and editors of the articles. We then convert each frequency in the matrix to term frequency-inverse document frequency (tf-idf) via the following transformation (Baeza-Yates et al. 1999):
$$w_{i,j} = \left\{\begin{array}{ll}(1+\log_{2} f_{i,j})\times\log_{2}\frac{N}{n_{i}} & f_{i,j} > 0 \\ 0 & \text{otherwise,}\end{array}\right.$$
where the variable w
i,j
is the new weight in the matrix at location (i,j), f
i,j
is the current frequency in position (i,j), N is the number of documents in the corpus, and n
i
is the number of documents containing word i. This weighting scheme places higher weights on rarer terms because they are more selective and provide more information about the corpus, while placing lower weights on common words such as “the” and “and”.
We run LSA on the tf-idf term-document matrix for each hurricane. We then compare the documents and terms in the corpus to a given query of terms in latent semantic space. We transform the words that the query is composed of into term vectors and calculate their centroid to give the vector representation of the query. If the query is only one word in length, then the vector representation of the query equals the vector representation of the word. We analyze three queries using LSA: “climate”, “energy”, and “climate, energy”. LSA gives the terms most related to this query vector, which we then use to determine how climate change and energy are discussed both separately and together in the media after Hurricanes Katrina and Sandy.
Latent dirichlet allocation
LDA, a probabilistic topic model (Blei et al. 2003; Blei 2012), defines each hidden topic as a probability distribution over all of the words in the corpus, and each document’s content is then represented as a probability distribution over all of the topics. Figure 2 gives illustrations of distributions for a potential LDA model.
LDA assumes that the documents were created via the following generative process. For each document:
-
1.
Randomly choose a distribution of topics from a dirichlet distribution. This distribution of topics contains a nonzero probability of selecting each word in the corpus.
-
2.
For each word in the current document:
-
a)
Randomly select a topic from the topic distribution in part 1.
-
b)
Randomly choose a word from the topic just selected and insert it into the document.
-
3.
Repeat until document is complete.
The distinguishing characteristic of LDA is that all of the documents in the corpus share the same set of k topics, however, each document contains each topic in a different proportion. The goal of the model is to learn the topic distributions. The generative process for LDA corresponds to the following joint distribution:
$$\begin{array}{llll} & P(\beta_{1:K},\theta_{1:D},z_{1:D},w_{1:D}) \\ &\quad= \prod\limits_{i=1}^{K}\!P(\beta_{i})\prod\limits_{d=1}^{D}\!P(\theta_{d})\left( \prod\limits_{n=1}^{N}P(z_{d,n}|\theta_{d})P(w_{d,n}|\beta_{1:K},z_{d,n})\right), \end{array} $$
where β
k
is the distribution over the words, 𝜃
d,k
is the topic proportion for topic k in document d, z
d,n
is the topic assignment for the nth word in document d, and w
d,n
is the nth word in document d. This joint distribution defines certain dependencies. The topic selection z
d,n
is dependent on the topic proportions each the article, 𝜃
d
. The current word w
d,n
is dependent on both the topic selection, z
d,n
and topic distribution β
1:k
. The main computational problem is computing the posterior. The posterior is the conditional distribution of the topic structure given the observed documents
$$p(\beta_{1:K},\theta_{1:D},z_{1:D}|w_{1:D}) = \frac{p(\beta_{1:K},\theta_{1:D},z_{1:D},w_{1:D})}{p(w_{1:D})}.$$
The denominator of the posterior represents the probability of seeing the observed corpus under any topic model. It is computed using the sampling-based algorithm, Gibbs Sampling.
We generate topic models for the Hurricane Sandy and Katrina articles using LDA-C, developed by Blei in (Blei et al. 2003). We remove a list of common stop words from the corpus, along with common words specific to this corpus such as “Sandy”, “Katrina”, “hurricane”, and “storm”. After filtering through the words, we use a Porter word stemmer to stem the remaining words, so each word is represented in one form, while it may appear in the articles in many different tenses (Porter 1980).
Determining the number of topics
The number of topics within a particular corpus depends on the size and scope of the corpus. In our corpora, the scope is already quite narrow as we only focus on newspaper articles about a particular hurricane. Thus, we do not expect the number of topics to be large and to choose the number of topics for the analysis, we implement several techniques.
First, to determine k, the rank of the approximated term-document matrix used in LSA, we look at the singular values determined via SVD. The 100 largest singular values are plotted in Fig. 3 for Hurricanes Sandy and Katrina. The singular value decay rate slows considerably between singular values 20 and 30 for both matrices. We find that topics become repetitive above k=20, and thus we choose k=20 as the rank of the approximated term-document matrix in LSA.
To determine the number of topics for LDA to learn, we use the perplexity, a measure employed in (Blei et al. 2003) to determine how accurately the topic model predicts a sample of unseen documents. We compute the perplexity of a held out test set of documents for each hurricane and vary the number of learned topics on the training data. Perplexity will decrease with the number of topics and should eventually level out when increasing the number of topics no longer increases the accuracy of the model. The perplexity may begin to increase when adding topics causes the model to overfit the data. Perplexity is defined in (Blei et al. 2003) as
$$\textnormal{perplexity}(D_{\text{test}}) = \exp\left\{-\frac{{\sum}_{d=1}^{M}\log p(\textbf{w}_{d})}{{\sum}_{d=1}^{M}N_{d}}\right\},$$
where the numerator represents the log-likelihood of unseen documents w
d
, and the denominator represents the total number of words in the testing set. We separate the data into 10 equal testing and training sets for 10-fold cross validation on each hurricane. We run LDA on each of the 10 different training sets consisting of 90 % of the articles in each hurricane corpus. We then calculate the perplexity for a range of topic numbers on the testing sets, each consisting of 10 % of the articles. We average the perplexity at each topic number over the testing sets and plot the result in Fig. 4a, b.
Figure 4 indicates that the optimal number of topics in the Hurricane Sandy corpus is roughly 20 distinct topics, while the optimal number in the Hurricane Katrina corpus is between 280 and 300 distinct topics. Compared to the Sandy corpus, the Hurricane Katrina corpus contains three times as many articles and about double the number of unique words (17,898 vs. 9521). On average, an article in the Hurricane Sandy corpus contains 270 words, while an article in the Hurricane Katrina corpus contains 376 words. The difference in these statistics may account for the difference in optimal topic numbers in Fig. 4. To test this hypothesis, we take 100 random samples of size 1039 (the size of the Sandy corpus) from the Katrina corpus and calculate the average perplexity over these samples. For each of the 100 random samples, we use 10 testing and training sets for 10-fold cross validation, as was done in the previous calculations of perplexity. We calculate the average perplexity over the 10 testing sets for each topic number, and then average over the 100 samples for each topic number, showing the result in Fig. 4c. We find that on average, the optimal number of topics for a smaller Katrina corpus is around 30.
Based on the above analysis, we opt to use a 20-topic model for Hurricane Sandy and a 30-topic model for Hurricane Katrina in our LDA analysis of the post-event media coverage.