Skip to main content

Natural Language

  • 665 Accesses

Part of the Springer Series in the Data Sciences book series (SSDS)


Although we are constantly exposed to written text and human speech, most of the data we encounter in policy settings is available in neat, tabular formats. In the public sector, for example, both surveys and administrative data tend to rely on closed-form questions to quality control responses to a finite set of possibilities, such as yes/no and agree/neutral/disagree. Forcing respondents to a well-defined response makes for neat datasets, but ignores the richness of natural language that could be harvested from open-ended questions. Language carries a tremendous amount of information, encoded in nouns, verbs, adjectives, and other parts of speech. Combinations of these words allow language to carry multi-layered information, conveying sentiments, ideas, facts, etc.

This is a preview of subscription content, access via your institution.

Buying options

USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-030-71352-2_13
  • Chapter length: 23 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
USD   59.99
Price excludes VAT (USA)
  • ISBN: 978-3-030-71352-2
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   59.99
Price excludes VAT (USA)
Hardcover Book
USD   79.99
Price excludes VAT (USA)
Figure 13.1:
Figure 13.2:
Figure 13.3:
Figure 13.4:
Figure 13.5:
Figure 13.6:
Figure 13.7:


  1. 1.

    In the wild, these news articles would have been embedded in HTML files on websites, requiring some attention to the structure of the page.

  2. 2.

    Amazon’s Mech Turk asks internet participants to conduct many simple tasks for small payments. It is a favorite of social scientist experimentalists to create training samples (Samuel 2018).

  3. 3.

    The Wikipedia article on the Oil Crisis can be found at

  4. 4.

    Other clustering techniques can accommodate mixtures of non-exclusive clusters; however, these are beyond the scope of this text.

  5. 5.

    Other initialization methods can also be employed with prior knowledge.

  6. 6.

    There are other inputs into the sampling procedure such hyperparameters that change the shape of the underlying distributions; however, these parameters vary from one technique to another, thus we have chosen to omit these in this overview.

  7. 7.

    While LDA is the most common topic modeling in use, STM is arguably flexible, intuitive, and more extensible.

  8. 8.

    As of 2019, the only presidents who did not deliver SOTU addresses or letters were James A. Garfield and William Henry Harrison.

  9. 9.

    Minor adjustments were made to minimize errors. These adjustments can be seen in the data processing scripts on Github.

  10. 10.

    For more advanced DFM processing, consider using the processes illustrated in this chapter using tidytext package or the quanteda package.

Author information

Authors and Affiliations


Corresponding author

Correspondence to Jeffrey C. Chen .

Rights and permissions

Reprints and Permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this chapter

Verify currency and authenticity via CrossMark

Cite this chapter

Chen, J.C., Rubin, E.A., Cornwall, G.J. (2021). Natural Language. In: Data Science for Public Policy. Springer Series in the Data Sciences. Springer, Cham.

Download citation