Beginning Deep Survey Analysis

Paczkowski, Walter R.

doi:10.1007/978-3-030-76267-4_4

Walter R. Paczkowski²

775 Accesses

Abstract

I had divided the analysis of survey data into Shallow Analysis and Deep Analysis. The former just skims the surface of all the data collected from a survey, highlighting only the minimum of findings with the simplest analysis tools. These tools, useful and informative in their own right, are only the first that should be used, not the only ones. They help you dig out some findings but leave much buried. I covered them and their use in Python in the previous chapter.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Based on: https://stats.stackexchange.com/questions/267192/doubling-or-halving-p-values-for-one-vs-two-tailed-tests/267197. Last accessed July 29, 2020.
2.
As a personal anecdote, I once had a client who wanted to know if a difference of one cent in the prices of two products was significant—the products were selling for about $10 each.
3.
To be “stat tested” as many like to say.
4.
I once did some survey analysis work for a large food manufacturing company (to remain nameless) that used α = 0.20.
5.
Economists refer to this as a perfectly competitive market. All firms in such a market are price takers, meaning they have no influence on the market price. Therefore, there is only one market price.
6.
It is easy to show that for $\bar {X} = {1}/{n} \sum (X_i - \bar {X}) = 0$.
7.
It can actually be any level. As you will see, however, the first levels is dropped by statsmodels.
8.
The cross-product term cancels after summing terms.
9.
Of course, the military branch does not determine your age. But the age distribution varies by branch is the main point.
10.
$\binom {7}{2} = \dfrac {7!}{2! \times 5!} = 21$.
11.
QA7: “Did you ever serve in a combat or war zone?” There is a clarifying statement: “Persons serving in a combat or war zone usually receive combat zone tax exclusion, imminent danger pay, or hostile fire pay.”
12.
Source: https://en.wikipedia.org/wiki/Tooth-to-tail_ratio. Last accessed September 24, 2020.
13.
In the Design of Experiments literature, a treatment is an experimental condition placed on an object that will be measured. The measure is the effect of that treatment. The objects may be grouped into blocks designed to be homogeneous to remove any nuisance factors that might influence the responses to the treatments. Only the effect of the treatments is desired. In the survey context, the treatments are the CATA questions, and the blocks are the respondents themselves. See Box et al. (1978) for a discussion of experimental designs.
14.
See the article “Cochran’s Q test” at https://en.wikipedia.org/wiki/Cochran%27s_Q_test. Last accessed September 30, 2020.
15.
See https://docs.python.org/3/library/itertools.html#itertools.count. Last accessed October 1, 2020.
16.
The original data had “Yes” = 1, “No” = 2, and “Don’t Know” = 3.
17.
“In mathematics, specifically set theory, the Cartesian product of two sets A and B, denoted A × B is the set of all ordered pairs (a, b) where a is in A and b is in B.” Source: Wikipedia article “Cartesian product”: https://en.wikipedia.org/wiki/Cartesian_product. Last accessed on October 2, 2020. For this problem, the collection of branches is one set, and the collection of gender is another.
18.
See, for example, comments by N. Robbins at https://www.forbes.com/sites/naomirobbins/2015/03/19/color-problems-with-figures-from-the-jerusalem-post/?sh=21fd52f71c7f. Last accessed December 20, 2020. Also see Few (2008).

References

Agresti, A. 2002. Categorical Data Analysis. 2nd ed. New York: Wiley.
Book Google Scholar
Box, G., W. Hunter, and J. Hunter. 1978. Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building. New York: Wiley.
MATH Google Scholar
Cox, D. 2020. Statistical significance. Annual Review of Statistics and Its Application 1: 1–10.
Article MathSciNet Google Scholar
Daniel, W.W. 1977. Statistical significance versus practical significance. Science Education 61 (3): 423–427.
Article Google Scholar
Dudewicz, E.J. and S.N. Mishra. 1988. Modern Mathematical Statistics. New York: Wiley.
MATH Google Scholar
Ellis, S. and H. Steyn. 2003. Practical significance (effect sizes) versus or in combination with statistical significance (p-values). Management Dynamics 12 (4): 51–53.
Google Scholar
Few, S. 2008. Practical rules for using color in charts. resreport, Perceptual Edge. In Visual Business Intelligence Newsletter.
Google Scholar
Greene, W.H. 2003 Econometric Analysis, 5th ed. Englewood: Prentice Hall.
Google Scholar
Guenther, W.C. 1964. Analysis of Variance. Englewood Cliffs: Prentice-Hall, Inc.
MATH Google Scholar
Gujarati, D. 2003. Basic Econometrics, 4th ed. New York: McGraw-Hill/Irwin.
Google Scholar
Marascuilo, L. 1964. Large-sample multiple comparisons with a control. Biometrics 20: 482–491.
Article Google Scholar
Marascuilo, L. and M. McSweeney. 1967. Nonparametric posthoc comparisons for trend. Psychological Bulletin 67 (6): 401.
Article Google Scholar
McGrath, J.J. 2007. The other end of the spear: The tooth-to-tail ratio (t3r) in modern military operations. Techreport, Combat Studies Institute Press. The Long War Series Occasional Paper 23. Kansas: Combat Studies Institute Press Fort Leavenworth.
Google Scholar
Moore, D.S. and W.I. Notz. 2014. Statistics: Concepts and Controversies, 8th ed. New Year: W.H. Freeman & Company.
Google Scholar
Paczkowski, W.R. 2016. Market Data Analysis Using JMP. New York: SAS Press.
Google Scholar
Paczkowski, W.R. 2018. Pricing Analytics: Models and Advanced Quantitative Techniques for Product Pricing. Milton Park: Routledge.
Book Google Scholar
Paczkowski, W.R. 2020. Deep Data Analytics for New Product Development. Milton Park: Routledge.
Book Google Scholar
Paczkowski, W.R. 2022. Business Analytics: Data Science for Business Problems. Berlin: Springer.
Google Scholar
Robbins, N.B. 2010. Trellis display. WIREs Computational Statistics 2: 600–605.
Article Google Scholar
Rosen, B.L., and A.L. DeMaria. 2012. Statistical significance vs. practical significance: An exploration through health education. American Journal of Health Education 43 (4): 235–241.
Article Google Scholar
Weiss, N.A. 2005. Introductory Statistics, 7th ed. Boston: Pearson Education, Inc.
MATH Google Scholar
Ziliak, S.T., and D.N. McCloskey. 2008. The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives (Economics, Cognition, and Society), 1st ed. Ann Arbor: University of Michigan Press.
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Data Analytics Corp., Plainsboro, NJ, USA
Walter R. Paczkowski

Authors

Walter R. Paczkowski
View author publications
You can also search for this author in PubMed Google Scholar

Appendix

This appendix provides brief overviews and, maybe, refresher material on some key statistical concepts.

4.1.1 Refresher on Expected Values

The expected value of a random variable is just a weighted average of the values of the random variable. The weights are the probabilities of seeing a particular value of the random variable. Although the averaging is written differently depending on whether the random variable is discrete or continuous, the interpretation is the same in either case. If Y is a discrete random variable, then

$$\displaystyle \begin{aligned} \begin{array}{rcl} E(Y) = \sum_{-\infty}^{+\infty} y_i \times p(y) \end{array} \end{aligned} $$

where p(y) = Pr(Y = y), the probability that Y = y, is a probability function such that

$$\displaystyle \begin{aligned} \begin{array}{rcl} 0 \leq p(y) \leq 1 \\ \sum p(y) = 1 \end{array} \end{aligned} $$

So E(Y ) is just a weighted average. This is the expected value of Y .

If Y is a continuous random variable, then

$$\displaystyle \begin{aligned} \begin{array}{rcl} E(Y) = \int_{-\infty}^{+\infty} y f(y) dy \end{array} \end{aligned} $$

where

$$\displaystyle \begin{aligned} \begin{array}{rcl} \int_{-\infty}^{+\infty} f(y) = 1 \end{array} \end{aligned} $$

The function, f(y), is the probability density function of Y at y. It is not the probability of Y = y, which is zero.

It is easy to show the following:

1.
E(aX) = a × E(X) where a is a constant.
2.
E(aX + b) = a × E(X) + b where a and b are constants.
3.
V (aX) = a ² × V (X) where V (⋅) is the variance defined as V (X) = E[X − E(X)]².
4.
V (aX + b) = a ² × V (X).

It is also easy to show, although I will not do it here, that the expected value of a linear function of random variables is linear. That is, for two random variables, X and Y , and for c _i a constant, then

$$\displaystyle \begin{aligned} \begin{array}{rcl} E( c_1 \times X + c_2 \times Y) = c_1 \times E(X) + c_2 \times E(Y) \\ \end{array} \end{aligned} $$

You can also show that V (Y ₁ ± Y ₂) = V (Y ₁) + V (Y ₂) if Y ₁ and Y ₂ are independent. If they are not independent, then V (Y ₁ ± Y ₂) = V (Y ₁) + V (Y ₂) ± 2 × COV (Y ₁, Y ₂) where COV (Y ₁, Y ₂) is the covariance between the two random variables. For a random sample, Y ₁ and Y ₂ are independent.

4.1.2 Expected Value and Standard Error of the Mean

You can now show that if Y _i, i = 1, 2, …, n, are independent and identically distributed (commonly abbreviated as iid) random variables, with a mean E(Y ) = μ and variance V (Y ) = E(Y − μ)² = σ ², then

$$\displaystyle \begin{aligned} \begin{array}{rcl} E(\bar{Y}) & =&\displaystyle \dfrac{1}{n} \times \sum_{i = 1}^n E(Y_i) \\ & =&\displaystyle \dfrac{1}{n} \times \sum_{i = 1}^n \mu \\ & =&\displaystyle \dfrac{1}{n} \times n \times \mu \\ & =&\displaystyle \mu \end{array} \end{aligned} $$

and

$$\displaystyle \begin{aligned} \begin{array}{rcl} V(\bar{Y}) & =&\displaystyle \dfrac{1}{n^2} \times \sum_{i = 1}^n V(Y_i) \\ & =&\displaystyle \dfrac{1}{n^2} \times \sum_{i = 1}^n \sigma^2 \\ & =&\displaystyle \dfrac{1}{n^2} \times n \times \sigma^2 \\ & =&\displaystyle \dfrac{\sigma^2}{n}. \end{array} \end{aligned} $$

This last result can be extended. Suppose you have two independent random variables, $Y_1 \sim \mathcal {N}(\mu _1, \sigma _1^2)$ and $Y_2 \sim \mathcal {N}(\mu _2, \sigma _2^2)$. Then

$$\displaystyle \begin{aligned} \begin{array}{rcl} V(\bar{Y_1} + \bar{Y_2}) & =&\displaystyle \dfrac{1}{n_1^2} \times \sum_{i = 1}^{n_1} V(Y_{i1}) + \dfrac{1}{n_2^2} \times \sum_{i = 1}^{n_2} V(Y_{i2}) \\ & =&\displaystyle \dfrac{1}{n_1^2} \times \sum_{i = 1}^{n_1} \sigma_1^2 + \dfrac{1}{n_2^2} \times \sum_{i = 1}^{n_2} \sigma_2^2 \\ & =&\displaystyle \dfrac{1}{n_1^2} \times n_1 \times \sigma_1^2 + \dfrac{1}{n_2^2} \times n_2 \times \sigma_2^2\\ & =&\displaystyle \dfrac{\sigma_1^2}{n_1} + \dfrac{\sigma_2^2}{n_2}. \end{array} \end{aligned} $$

This last result is used when two means are compared.

4.1.3 Deviations from the Mean

Two very important results about means are:

1.
$\sum _{i = 1}^n (Y_i - \bar {Y}) = 0$.
2.
$E(\bar {Y} - \mu ) = 0$

The first is for a sample; the second is for a population. Regardless, both imply that a function of deviations from the mean is zero. To show the first, simply note that

$$\displaystyle \begin{aligned} \begin{array}{rcl} \sum_{i = 1}^n (Y_i - \bar{Y}) & =&\displaystyle \sum_{i = 1}^n Y_i - n \times \bar{Y} \\ & =&\displaystyle \sum_{i = 1}^n Y_i - n \times \dfrac{1}{n} \sum_{i = 1}^n Y_i \\ & =&\displaystyle \sum_{i = 1}^n Y_i - \sum_{i = 1}^n Y_i \\ & =&\displaystyle 0. \end{array} \end{aligned} $$

The second uses the result I showed above that $E(\bar {Y}) = \mu $. Using this,

$$\displaystyle \begin{aligned} \begin{array}{rcl} E(\bar{Y} - \mu) & =&\displaystyle E\left( \dfrac{1}{n} \sum_{i = 1}^n Y_i\right) - \mu \\ & =&\displaystyle \dfrac{1}{n} \times \sum_{i = 1}^n E(Y_i) - \mu \\ & =&\displaystyle \dfrac{1}{n} \times n \times \mu - \mu \\ & =&\displaystyle 0. \end{array} \end{aligned} $$

4.1.4 Some Relationships Among Probability Distributions

There are several distributions that are often used in survey analyses:

1.
Normal (or Gaussian) distribution
2.
χ ² distribution
3.
Student’s t-distribution
4.
F-distribution

These are applicable for continuous random variables. They are all closely related, as you will see.

4.1.4.1 Normal Distribution

The normal distribution is the basic distribution; other distributions are based on it. The Normal’s probability density function (pdf) is

$$\displaystyle \begin{aligned} \begin{array}{rcl} f(y) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(Y - \mu)^2}{2\sigma^2}} \end{array} \end{aligned} $$

where μ and σ ² are two population parameters. A succinct notation is $Y \sim \mathcal {N}(\mu , \sigma ^2)$. This distribution has several important properties:

1.
All normal distributions are symmetric about the mean μ.
2.
The area under an entire normal curve traced by the pdf formula is 1.0.
3.
The height (i.e., density) of a normal curve is positive for all y. That is, f(y) > 0 ∀y.
4.
The limit of f(y) as y goes to positive infinity is 0, and the limit of f(y) as y goes to negative infinity is 0. That is,
$$\displaystyle \begin{aligned} \begin{array}{rcl} \lim\limits_{y \to \infty} f(y) = 0 ~\text{and}~ \lim\limits_{y \to -\infty} f(y) = 0 \end{array} \end{aligned} $$
5.
The height of any normal curve is maximized at y = μ.
6.
The placement and shape of a normal curve depends on its mean μ and standard deviation σ, respectively.
7.
A linear combination of normal random variables is normally distributed. This is the Reproductive Property of Normals.

A standardized normal random variable is Z = (Y − μ)∕σ for $Y \sim \mathcal {N}(\mu , \sigma ^2)$. This can be rewritten as a linear function: Z = 1∕σ × Y − μ∕σ. Therefore, Z is normally distributed by the Reproductive Property of Normals. Also, E(Z) = E(Y )∕σ − μ∕σ = 0 and V (Z) = 1∕σ ² × σ ² = 1. So, $Z \sim \mathcal {N}(0, 1)$.

I show a graph of the standardized normal in Fig. 4.43.

4.1.4.2 Chi-Square Distribution

If $Z \sim \mathcal {N}(0, 1)$, then $Z^2 \sim \chi ^2_1$ where the “1” is one degree-of-freedom. Some properties of the χ ² distribution are:

1.
The sum of n χ ² random variables is also χ ² with n degrees-of-freedom: $\sum _n Z_i^2 \sim \chi ^2_n$.
2.
The mean of the $\chi ^2_n$ is n and the variance is 2n.
3.
The $\chi ^2_n$ approaches the normal distribution as n →∞.

I show a graph of the χ ² pdf for 5 degrees-of-freedom in Fig. 4.44.

4.1.4.3 Student’s t-Distribution

The ratio of two random variables, where the numerator is $\mathcal {N}(0, 1)$ and the denominator is the square root of a χ ² random variable with ν degrees-of-freedom divided by the degrees-of-freedom, follows a t-distribution with ν degrees-of-freedom.

$$\displaystyle \begin{aligned} \begin{array}{rcl} \frac{Z}{\sqrt{\frac{\chi^2_\nu}{\nu}}} \sim t_\nu \end{array} \end{aligned} $$

I show a graph of the Student’s t pdf for 23 degrees-of-freedom in Fig. 4.45.

4.1.4.4 F-Distribution

The ratio of a χ ² random variable with ν ₁ degrees-of-freedom to a χ ² random variable with ν ₂ degrees-of-freedom, each divided by its degrees-of-freedom, follows an F-distribution with ν ₁ and ν ₂ degrees-of-freedom.

$$\displaystyle \begin{aligned} \begin{array}{rcl} \frac{\chi^2_{\nu_1}/\nu_1}{\chi^2_{\nu_2}/\nu_2} \sim F_{\nu_1 \text{, } \nu_2} \end{array} \end{aligned} $$

Note that the $F_{1, \nu _2}$ is t ². You can see this from the definition of a t with ν degrees-of-freedom: $\frac {Z}{\sqrt {\frac {\chi ^2_\nu }{\nu }}} \sim t_\nu $. I show a graph of the F-distributionpdffor 3 degrees-of-freedom in the numerator and 15 degrees-of-freedom in the denominator in Fig. 4.46.

4.1.5 Equivalence of the F and t Tests for Two Populations

Guenther (1964, p. 46) shows that then there are two independent populations, and the F-test and the t-test are related. In particular, he shows that

$$\displaystyle \begin{aligned} \begin{array}{rcl} F_{1, n_1 + n_2 - 2} = t_{n_1 + n_2 - 2}^2. \end{array} \end{aligned} $$

(4.42)

Part of this demonstration is showing that the denominator of the F-statistic is

$$\displaystyle \begin{aligned} \begin{array}{rcl} \dfrac{SSW}{n_1 + n_2 - 2} & =&\displaystyle \left(\dfrac{(n_1 - 1) \times s_1^2 + (n_2 - 1) \times s_2^2}{n_1 + n_2 - 2}\right) \times \left(\dfrac{1}{n_1} + \dfrac{1}{n_2}\right) \end{array} \end{aligned} $$

(4.43)

which is the result I stated in (4.11) and (4.12).

4.1.6 Code for Fig. 4.37

The code to generate the 3-D bar chart in Fig. 4.37 is shown here in Fig. 4.47. This is longer than previous code because more steps are involved. In particular, you have to define the plotting coordinates for each bar in addition to the height of each bar. The coordinates are the X and Y plotting positions, and the base of the Z-dimension in the X − Y plane. This base is just 0 for each bar. Not only are these coordinates needed, but the widths of the bars are also needed. These are indicated in the code as dx and dy. The height of each bar is the Z position in a X − Y − Z three-dimensional plot. This is indicated by dz. The data for the graph comes from the pivot table in Fig. 4.35. I did some processing of this data, which I also show in Fig. 4.47. Notice, incidentally, that I make extensive use of list comprehensions to simplify list creations.

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Paczkowski, W.R. (2022). Beginning Deep Survey Analysis. In: Modern Survey Analysis. Springer, Cham. https://doi.org/10.1007/978-3-030-76267-4_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-76267-4_4
Published: 12 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-76266-7
Online ISBN: 978-3-030-76267-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics

Beginning Deep Survey Analysis

Abstract

Access this chapter

Notes

References

Author information

Authors and Affiliations

Appendix

Appendix

4.1.1 Refresher on Expected Values

4.1.2 Expected Value and Standard Error of the Mean

4.1.3 Deviations from the Mean

4.1.4 Some Relationships Among Probability Distributions

4.1.4.1 Normal Distribution

4.1.4.2 Chi-Square Distribution

4.1.4.3 Student’s t-Distribution

4.1.4.4 F-Distribution

4.1.5 Equivalence of the F and t Tests for Two Populations

4.1.6 Code for Fig. 4.37

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation