Skip to main content

Beginning Deep Survey Analysis

  • Chapter
  • First Online:
Modern Survey Analysis
  • 775 Accesses

Abstract

I had divided the analysis of survey data into Shallow Analysis and Deep Analysis. The former just skims the surface of all the data collected from a survey, highlighting only the minimum of findings with the simplest analysis tools. These tools, useful and informative in their own right, are only the first that should be used, not the only ones. They help you dig out some findings but leave much buried. I covered them and their use in Python in the previous chapter.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Based on: https://stats.stackexchange.com/questions/267192/doubling-or-halving-p-values-for-one-vs-two-tailed-tests/267197. Last accessed July 29, 2020.

  2. 2.

    As a personal anecdote, I once had a client who wanted to know if a difference of one cent in the prices of two products was significant—the products were selling for about $10 each.

  3. 3.

    To be “stat tested” as many like to say.

  4. 4.

    I once did some survey analysis work for a large food manufacturing company (to remain nameless) that used α = 0.20.

  5. 5.

    Economists refer to this as a perfectly competitive market. All firms in such a market are price takers, meaning they have no influence on the market price. Therefore, there is only one market price.

  6. 6.

    It is easy to show that for \(\bar {X} = {1}/{n} \sum (X_i - \bar {X}) = 0\).

  7. 7.

    It can actually be any level. As you will see, however, the first levels is dropped by statsmodels.

  8. 8.

    The cross-product term cancels after summing terms.

  9. 9.

    Of course, the military branch does not determine your age. But the age distribution varies by branch is the main point.

  10. 10.

    \(\binom {7}{2} = \dfrac {7!}{2! \times 5!} = 21\).

  11. 11.

    QA7: “Did you ever serve in a combat or war zone?” There is a clarifying statement: “Persons serving in a combat or war zone usually receive combat zone tax exclusion, imminent danger pay, or hostile fire pay.”

  12. 12.

    Source: https://en.wikipedia.org/wiki/Tooth-to-tail_ratio. Last accessed September 24, 2020.

  13. 13.

    In the Design of Experiments literature, a treatment is an experimental condition placed on an object that will be measured. The measure is the effect of that treatment. The objects may be grouped into blocks designed to be homogeneous to remove any nuisance factors that might influence the responses to the treatments. Only the effect of the treatments is desired. In the survey context, the treatments are the CATA questions, and the blocks are the respondents themselves. See Box et al. (1978) for a discussion of experimental designs.

  14. 14.

    See the article “Cochran’s Q test” at https://en.wikipedia.org/wiki/Cochran%27s_Q_test. Last accessed September 30, 2020.

  15. 15.

    See https://docs.python.org/3/library/itertools.html#itertools.count. Last accessed October 1, 2020.

  16. 16.

    The original data had “Yes” = 1, “No” = 2, and “Don’t Know” = 3.

  17. 17.

    “In mathematics, specifically set theory, the Cartesian product of two sets A and B, denoted A × B is the set of all ordered pairs (a, b) where a is in A and b is in B.” Source: Wikipedia article “Cartesian product”: https://en.wikipedia.org/wiki/Cartesian_product. Last accessed on October 2, 2020. For this problem, the collection of branches is one set, and the collection of gender is another.

  18. 18.

    See, for example, comments by N. Robbins at https://www.forbes.com/sites/naomirobbins/2015/03/19/color-problems-with-figures-from-the-jerusalem-post/?sh=21fd52f71c7f. Last accessed December 20, 2020. Also see Few (2008).

References

  • Agresti, A. 2002. Categorical Data Analysis. 2nd ed. New York: Wiley.

    Book  Google Scholar 

  • Box, G., W. Hunter, and J. Hunter. 1978. Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building. New York: Wiley.

    MATH  Google Scholar 

  • Cox, D. 2020. Statistical significance. Annual Review of Statistics and Its Application 1: 1–10.

    Article  MathSciNet  Google Scholar 

  • Daniel, W.W. 1977. Statistical significance versus practical significance. Science Education 61 (3): 423–427.

    Article  Google Scholar 

  • Dudewicz, E.J. and S.N. Mishra. 1988. Modern Mathematical Statistics. New York: Wiley.

    MATH  Google Scholar 

  • Ellis, S. and H. Steyn. 2003. Practical significance (effect sizes) versus or in combination with statistical significance (p-values). Management Dynamics 12 (4): 51–53.

    Google Scholar 

  • Few, S. 2008. Practical rules for using color in charts. resreport, Perceptual Edge. In Visual Business Intelligence Newsletter.

    Google Scholar 

  • Greene, W.H. 2003 Econometric Analysis, 5th ed. Englewood: Prentice Hall.

    Google Scholar 

  • Guenther, W.C. 1964. Analysis of Variance. Englewood Cliffs: Prentice-Hall, Inc.

    MATH  Google Scholar 

  • Gujarati, D. 2003. Basic Econometrics, 4th ed. New York: McGraw-Hill/Irwin.

    Google Scholar 

  • Marascuilo, L. 1964. Large-sample multiple comparisons with a control. Biometrics 20: 482–491.

    Article  Google Scholar 

  • Marascuilo, L. and M. McSweeney. 1967. Nonparametric posthoc comparisons for trend. Psychological Bulletin 67 (6): 401.

    Article  Google Scholar 

  • McGrath, J.J. 2007. The other end of the spear: The tooth-to-tail ratio (t3r) in modern military operations. Techreport, Combat Studies Institute Press. The Long War Series Occasional Paper 23. Kansas: Combat Studies Institute Press Fort Leavenworth.

    Google Scholar 

  • Moore, D.S. and W.I. Notz. 2014. Statistics: Concepts and Controversies, 8th ed. New Year: W.H. Freeman & Company.

    Google Scholar 

  • Paczkowski, W.R. 2016. Market Data Analysis Using JMP. New York: SAS Press.

    Google Scholar 

  • Paczkowski, W.R. 2018. Pricing Analytics: Models and Advanced Quantitative Techniques for Product Pricing. Milton Park: Routledge.

    Book  Google Scholar 

  • Paczkowski, W.R. 2020. Deep Data Analytics for New Product Development. Milton Park: Routledge.

    Book  Google Scholar 

  • Paczkowski, W.R. 2022. Business Analytics: Data Science for Business Problems. Berlin: Springer.

    Google Scholar 

  • Robbins, N.B. 2010. Trellis display. WIREs Computational Statistics 2: 600–605.

    Article  Google Scholar 

  • Rosen, B.L., and A.L. DeMaria. 2012. Statistical significance vs. practical significance: An exploration through health education. American Journal of Health Education 43 (4): 235–241.

    Article  Google Scholar 

  • Weiss, N.A. 2005. Introductory Statistics, 7th ed. Boston: Pearson Education, Inc.

    MATH  Google Scholar 

  • Ziliak, S.T., and D.N. McCloskey. 2008. The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives (Economics, Cognition, and Society), 1st ed. Ann Arbor: University of Michigan Press.

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Appendix

Appendix

This appendix provides brief overviews and, maybe, refresher material on some key statistical concepts.

4.1.1 Refresher on Expected Values

The expected value of a random variable is just a weighted average of the values of the random variable. The weights are the probabilities of seeing a particular value of the random variable. Although the averaging is written differently depending on whether the random variable is discrete or continuous, the interpretation is the same in either case. If Y  is a discrete random variable, then

$$\displaystyle \begin{aligned} \begin{array}{rcl} E(Y) = \sum_{-\infty}^{+\infty} y_i \times p(y) \end{array} \end{aligned} $$

where p(y) = Pr(Y = y), the probability that Y = y, is a probability function such that

$$\displaystyle \begin{aligned} \begin{array}{rcl} 0 \leq p(y) \leq 1 \\ \sum p(y) = 1 \end{array} \end{aligned} $$

So E(Y ) is just a weighted average. This is the expected value of Y .

If Y  is a continuous random variable, then

$$\displaystyle \begin{aligned} \begin{array}{rcl} E(Y) = \int_{-\infty}^{+\infty} y f(y) dy \end{array} \end{aligned} $$

where

$$\displaystyle \begin{aligned} \begin{array}{rcl} \int_{-\infty}^{+\infty} f(y) = 1 \end{array} \end{aligned} $$

The function, f(y), is the probability density function of Y  at y. It is not the probability of Y = y, which is zero.

It is easy to show the following:

  1. 1.

    E(aX) = a × E(X) where a is a constant.

  2. 2.

    E(aX + b) = a × E(X) + b where a and b are constants.

  3. 3.

    V (aX) = a 2 × V (X) where V (⋅) is the variance defined as V (X) = E[XE(X)]2.

  4. 4.

    V (aX + b) = a 2 × V (X).

It is also easy to show, although I will not do it here, that the expected value of a linear function of random variables is linear. That is, for two random variables, X and Y , and for c i a constant, then

$$\displaystyle \begin{aligned} \begin{array}{rcl} E( c_1 \times X + c_2 \times Y) = c_1 \times E(X) + c_2 \times E(Y) \\ \end{array} \end{aligned} $$

You can also show that V (Y 1 ± Y 2) = V (Y 1) + V (Y 2) if Y 1 and Y 2 are independent. If they are not independent, then V (Y 1 ± Y 2) = V (Y 1) + V (Y 2) ± 2 × COV (Y 1, Y 2) where COV (Y 1, Y 2) is the covariance between the two random variables. For a random sample, Y 1 and Y 2 are independent.

4.1.2 Expected Value and Standard Error of the Mean

You can now show that if Y i, i = 1, 2, …, n, are independent and identically distributed (commonly abbreviated as iid) random variables, with a mean E(Y ) = μ and variance V (Y ) = E(Yμ)2 = σ 2, then

$$\displaystyle \begin{aligned} \begin{array}{rcl} E(\bar{Y}) & =&\displaystyle \dfrac{1}{n} \times \sum_{i = 1}^n E(Y_i) \\ & =&\displaystyle \dfrac{1}{n} \times \sum_{i = 1}^n \mu \\ & =&\displaystyle \dfrac{1}{n} \times n \times \mu \\ & =&\displaystyle \mu \end{array} \end{aligned} $$

and

$$\displaystyle \begin{aligned} \begin{array}{rcl} V(\bar{Y}) & =&\displaystyle \dfrac{1}{n^2} \times \sum_{i = 1}^n V(Y_i) \\ & =&\displaystyle \dfrac{1}{n^2} \times \sum_{i = 1}^n \sigma^2 \\ & =&\displaystyle \dfrac{1}{n^2} \times n \times \sigma^2 \\ & =&\displaystyle \dfrac{\sigma^2}{n}. \end{array} \end{aligned} $$

This last result can be extended. Suppose you have two independent random variables, \(Y_1 \sim \mathcal {N}(\mu _1, \sigma _1^2)\) and \(Y_2 \sim \mathcal {N}(\mu _2, \sigma _2^2)\). Then

$$\displaystyle \begin{aligned} \begin{array}{rcl} V(\bar{Y_1} + \bar{Y_2}) & =&\displaystyle \dfrac{1}{n_1^2} \times \sum_{i = 1}^{n_1} V(Y_{i1}) + \dfrac{1}{n_2^2} \times \sum_{i = 1}^{n_2} V(Y_{i2}) \\ & =&\displaystyle \dfrac{1}{n_1^2} \times \sum_{i = 1}^{n_1} \sigma_1^2 + \dfrac{1}{n_2^2} \times \sum_{i = 1}^{n_2} \sigma_2^2 \\ & =&\displaystyle \dfrac{1}{n_1^2} \times n_1 \times \sigma_1^2 + \dfrac{1}{n_2^2} \times n_2 \times \sigma_2^2\\ & =&\displaystyle \dfrac{\sigma_1^2}{n_1} + \dfrac{\sigma_2^2}{n_2}. \end{array} \end{aligned} $$

This last result is used when two means are compared.

4.1.3 Deviations from the Mean

Two very important results about means are:

  1. 1.

    \(\sum _{i = 1}^n (Y_i - \bar {Y}) = 0\).

  2. 2.

    \(E(\bar {Y} - \mu ) = 0\)

The first is for a sample; the second is for a population. Regardless, both imply that a function of deviations from the mean is zero. To show the first, simply note that

$$\displaystyle \begin{aligned} \begin{array}{rcl} \sum_{i = 1}^n (Y_i - \bar{Y}) & =&\displaystyle \sum_{i = 1}^n Y_i - n \times \bar{Y} \\ & =&\displaystyle \sum_{i = 1}^n Y_i - n \times \dfrac{1}{n} \sum_{i = 1}^n Y_i \\ & =&\displaystyle \sum_{i = 1}^n Y_i - \sum_{i = 1}^n Y_i \\ & =&\displaystyle 0. \end{array} \end{aligned} $$

The second uses the result I showed above that \(E(\bar {Y}) = \mu \). Using this,

$$\displaystyle \begin{aligned} \begin{array}{rcl} E(\bar{Y} - \mu) & =&\displaystyle E\left( \dfrac{1}{n} \sum_{i = 1}^n Y_i\right) - \mu \\ & =&\displaystyle \dfrac{1}{n} \times \sum_{i = 1}^n E(Y_i) - \mu \\ & =&\displaystyle \dfrac{1}{n} \times n \times \mu - \mu \\ & =&\displaystyle 0. \end{array} \end{aligned} $$

4.1.4 Some Relationships Among Probability Distributions

There are several distributions that are often used in survey analyses:

  1. 1.

    Normal (or Gaussian) distribution

  2. 2.

    χ 2 distribution

  3. 3.

    Student’s t-distribution

  4. 4.

    F-distribution

These are applicable for continuous random variables. They are all closely related, as you will see.

4.1.4.1 Normal Distribution

The normal distribution is the basic distribution; other distributions are based on it. The Normal’s probability density function (pdf) is

$$\displaystyle \begin{aligned} \begin{array}{rcl} f(y) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(Y - \mu)^2}{2\sigma^2}} \end{array} \end{aligned} $$

where μ and σ 2 are two population parameters. A succinct notation is \(Y \sim \mathcal {N}(\mu , \sigma ^2)\). This distribution has several important properties:

  1. 1.

    All normal distributions are symmetric about the mean μ.

  2. 2.

    The area under an entire normal curve traced by the pdf formula is 1.0.

  3. 3.

    The height (i.e., density) of a normal curve is positive for all y. That is, f(y) > 0 ∀y.

  4. 4.

    The limit of f(y) as y goes to positive infinity is 0, and the limit of f(y) as y goes to negative infinity is 0. That is,

    $$\displaystyle \begin{aligned} \begin{array}{rcl} \lim\limits_{y \to \infty} f(y) = 0 ~\text{and}~ \lim\limits_{y \to -\infty} f(y) = 0 \end{array} \end{aligned} $$
  5. 5.

    The height of any normal curve is maximized at y = μ.

  6. 6.

    The placement and shape of a normal curve depends on its mean μ and standard deviation σ, respectively.

  7. 7.

    A linear combination of normal random variables is normally distributed. This is the Reproductive Property of Normals.

A standardized normal random variable is Z = (Y − μ)∕σ for \(Y \sim \mathcal {N}(\mu , \sigma ^2)\). This can be rewritten as a linear function: Z = 1∕σ × Y − μσ. Therefore, Z is normally distributed by the Reproductive Property of Normals. Also, E(Z) = E(Y )∕σ − μσ = 0 and V (Z) = 1∕σ 2 × σ 2 = 1. So, \(Z \sim \mathcal {N}(0, 1)\).

I show a graph of the standardized normal in Fig. 4.43.

Fig. 4.43
figure 43

This is the standardized normal pdf

4.1.4.2 Chi-Square Distribution

If \(Z \sim \mathcal {N}(0, 1)\), then \(Z^2 \sim \chi ^2_1\) where the “1” is one degree-of-freedom. Some properties of the χ 2 distribution are:

  1. 1.

    The sum of n χ 2 random variables is also χ 2 with n degrees-of-freedom: \(\sum _n Z_i^2 \sim \chi ^2_n\).

  2. 2.

    The mean of the \(\chi ^2_n\) is n and the variance is 2n.

  3. 3.

    The \(\chi ^2_n\) approaches the normal distribution as n →.

I show a graph of the χ 2 pdf for 5 degrees-of-freedom in Fig. 4.44.

Fig. 4.44
figure 44

This is the χ 2 pdf for 5 degrees-of-freedom. The shape changes as the degrees-of-freedom change

4.1.4.3 Student’s t-Distribution

The ratio of two random variables, where the numerator is \(\mathcal {N}(0, 1)\) and the denominator is the square root of a χ 2 random variable with ν degrees-of-freedom divided by the degrees-of-freedom, follows a t-distribution with ν degrees-of-freedom.

$$\displaystyle \begin{aligned} \begin{array}{rcl} \frac{Z}{\sqrt{\frac{\chi^2_\nu}{\nu}}} \sim t_\nu \end{array} \end{aligned} $$

I show a graph of the Student’s t pdf for 23 degrees-of-freedom in Fig. 4.45.

Fig. 4.45
figure 45

This is the Student’s t pdf for 23 degrees-of-freedom. The shape changes as the degrees-of-freedom change

4.1.4.4 F-Distribution

The ratio of a χ 2 random variable with ν 1 degrees-of-freedom to a χ 2 random variable with ν 2 degrees-of-freedom, each divided by its degrees-of-freedom, follows an F-distribution with ν 1 and ν 2 degrees-of-freedom.

$$\displaystyle \begin{aligned} \begin{array}{rcl} \frac{\chi^2_{\nu_1}/\nu_1}{\chi^2_{\nu_2}/\nu_2} \sim F_{\nu_1 \text{, } \nu_2} \end{array} \end{aligned} $$

Note that the \(F_{1, \nu _2}\) is t 2. You can see this from the definition of a t with ν degrees-of-freedom: \(\frac {Z}{\sqrt {\frac {\chi ^2_\nu }{\nu }}} \sim t_\nu \). I show a graph of the F-distributionpdffor 3 degrees-of-freedom in the numerator and 15 degrees-of-freedom in the denominator in Fig. 4.46.

Fig. 4.46
figure 46

This is the F-distribution pdf for 3 degrees-of-freedom in the numerator and 15 degrees-of-freedom in the denominator. The shape changes as the degrees-of-freedom change

4.1.5 Equivalence of the F and t Tests for Two Populations

Guenther (1964, p. 46) shows that then there are two independent populations, and the F-test and the t-test are related. In particular, he shows that

$$\displaystyle \begin{aligned} \begin{array}{rcl} F_{1, n_1 + n_2 - 2} = t_{n_1 + n_2 - 2}^2. \end{array} \end{aligned} $$
(4.42)

Part of this demonstration is showing that the denominator of the F-statistic is

$$\displaystyle \begin{aligned} \begin{array}{rcl} \dfrac{SSW}{n_1 + n_2 - 2} & =&\displaystyle \left(\dfrac{(n_1 - 1) \times s_1^2 + (n_2 - 1) \times s_2^2}{n_1 + n_2 - 2}\right) \times \left(\dfrac{1}{n_1} + \dfrac{1}{n_2}\right) \end{array} \end{aligned} $$
(4.43)

which is the result I stated in (4.11) and (4.12).

4.1.6 Code for Fig. 4.37

The code to generate the 3-D bar chart in Fig. 4.37 is shown here in Fig. 4.47. This is longer than previous code because more steps are involved. In particular, you have to define the plotting coordinates for each bar in addition to the height of each bar. The coordinates are the X and Y  plotting positions, and the base of the Z-dimension in the X − Y  plane. This base is just 0 for each bar. Not only are these coordinates needed, but the widths of the bars are also needed. These are indicated in the code as dx and dy. The height of each bar is the Z position in a X − Y − Z three-dimensional plot. This is indicated by dz. The data for the graph comes from the pivot table in Fig. 4.35. I did some processing of this data, which I also show in Fig. 4.47. Notice, incidentally, that I make extensive use of list comprehensions to simplify list creations.

Fig. 4.47
figure 47

This is the Python code I used to create Fig. 4.37

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Paczkowski, W.R. (2022). Beginning Deep Survey Analysis. In: Modern Survey Analysis. Springer, Cham. https://doi.org/10.1007/978-3-030-76267-4_4

Download citation

Publish with us

Policies and ethics