Exploratory Data Analysis

Chen, Jeffrey C.; Rubin, Edward A.; Cornwall, Gary J.

doi:10.1007/978-3-030-71352-2_6

Part of the book series: Springer Series in the Data Sciences ((SSDS))

1489 Accesses

Abstract

Smart phones are a modern wonder that allow society to stay connected and to enhance interactive experiences with the world around. Each interaction between a user and a phone is dependent on a sophisticated array of sensors.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Hardcover Book: USD 69.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
There are many graphs presented in this chapter. To follow along, download the DIYs repository from Github (https://github.com/DataScienceForPublicPolicy/diys). The R Markdown file for this section is diy-ch06-visuals.Rmd.*
2.
The designation of a region depends on classification system. The U.S. Census Bureau, for example, has regional divisions and sub-divisions.
3.
With ggplot2, it is also possible to supply the unaggregated data using stat_count.
4.
The NNBS dataset is a General Household Survey produced by the Nigerian National Bureau of Statistics in collaboration with the World Bank to measure (Nigeria National Bureau of Statistics 2019). The example is drawn from survey questions relating to banking access in file “sect4a1_plantingw4”. The EU LFS is the European Union’s Labour Force Survey, which is conducted by national statistical agencies of EU member countries and maintained by Eurostat (Eurostat 2020). The ACS dataset is the American Community Survey, administered by the U.S. Census Bureau (U.S. Census Bureau2018a). The NYC 311 SR dataset contains complaints and service requests made to the City of New York (NYC Department of Information Technology and Telecommunication 2020).
5.
Inf values are not truly missing values, but can prove to be problematic. We include these values for awareness.
6.
Note this applies not only to the logical values from missing value functions but also to any vector with NA values.
7.
Many functions have the ability to ignore NA values. When in doubt, check the Help section for documentation.
8.
Consider experimenting with the pct_miss parameter to understand the trade-offs.
9.
Some series follow a multiplicative formulation. For simplicity, we focus on the additive case.
10.
There are challenges with seasonal adjustment, however. The process of decomposing a time series can be subjective and requires analyst judgment. There does not exist a universal definition of what truly constitutes trend or seasonality. Ultimately, whether a series is “well-adjusted” is dependent on trust in the process.
11.
All variables should be numeric values.
12.
Economic numbers are published as vintages, meaning that a given quarter’s data will be revised each time a new release is made available. The data is “real-time” as the Philadelphia Fed archives the data based on each vintage so that the history of an estimate can be traced.
13.
We will revisit hierarchical clustering in Chapter 11.

Author information

Authors and Affiliations

Bennett Institute for Public Policy, University of Cambridge, Cambridge, UK
Jeffrey C. Chen
Department of Economics, University of Oregon, Eugene, OR, USA
Edward A. Rubin
Department of Commerce, Bureau of Economic Analysis, Suitland, MD, USA
Gary J. Cornwall

Authors

Jeffrey C. Chen
View author publications
You can also search for this author in PubMed Google Scholar
Edward A. Rubin
View author publications
You can also search for this author in PubMed Google Scholar
Gary J. Cornwall
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jeffrey C. Chen .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Chen, J.C., Rubin, E.A., Cornwall, G.J. (2021). Exploratory Data Analysis. In: Data Science for Public Policy. Springer Series in the Data Sciences. Springer, Cham. https://doi.org/10.1007/978-3-030-71352-2_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-71352-2_6
Published: 01 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-71351-5
Online ISBN: 978-3-030-71352-2
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics