Abstract
I had divided the analysis of survey data into Shallow Analysis and Deep Analysis. The former just skims the surface of all the data collected from a survey, highlighting only the minimum of findings with the simplest analysis tools. These tools, useful and informative in their own right, are only the first that should be used, not the only ones. They help you dig out some findings but leave much buried. I covered them and their use in Python in the previous chapter.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Based on: https://stats.stackexchange.com/questions/267192/doubling-or-halving-p-values-for-one-vs-two-tailed-tests/267197. Last accessed July 29, 2020.
- 2.
As a personal anecdote, I once had a client who wanted to know if a difference of one cent in the prices of two products was significant—the products were selling for about $10 each.
- 3.
To be “stat tested” as many like to say.
- 4.
I once did some survey analysis work for a large food manufacturing company (to remain nameless) that used α = 0.20.
- 5.
Economists refer to this as a perfectly competitive market. All firms in such a market are price takers, meaning they have no influence on the market price. Therefore, there is only one market price.
- 6.
It is easy to show that for \(\bar {X} = {1}/{n} \sum (X_i - \bar {X}) = 0\).
- 7.
It can actually be any level. As you will see, however, the first levels is dropped by statsmodels.
- 8.
The cross-product term cancels after summing terms.
- 9.
Of course, the military branch does not determine your age. But the age distribution varies by branch is the main point.
- 10.
\(\binom {7}{2} = \dfrac {7!}{2! \times 5!} = 21\).
- 11.
QA7: “Did you ever serve in a combat or war zone?” There is a clarifying statement: “Persons serving in a combat or war zone usually receive combat zone tax exclusion, imminent danger pay, or hostile fire pay.”
- 12.
Source: https://en.wikipedia.org/wiki/Tooth-to-tail_ratio. Last accessed September 24, 2020.
- 13.
In the Design of Experiments literature, a treatment is an experimental condition placed on an object that will be measured. The measure is the effect of that treatment. The objects may be grouped into blocks designed to be homogeneous to remove any nuisance factors that might influence the responses to the treatments. Only the effect of the treatments is desired. In the survey context, the treatments are the CATA questions, and the blocks are the respondents themselves. See Box et al. (1978) for a discussion of experimental designs.
- 14.
See the article “Cochran’s Q test” at https://en.wikipedia.org/wiki/Cochran%27s_Q_test. Last accessed September 30, 2020.
- 15.
See https://docs.python.org/3/library/itertools.html#itertools.count. Last accessed October 1, 2020.
- 16.
The original data had “Yes” = 1, “No” = 2, and “Don’t Know” = 3.
- 17.
“In mathematics, specifically set theory, the Cartesian product of two sets A and B, denoted A × B is the set of all ordered pairs (a, b) where a is in A and b is in B.” Source: Wikipedia article “Cartesian product”: https://en.wikipedia.org/wiki/Cartesian_product. Last accessed on October 2, 2020. For this problem, the collection of branches is one set, and the collection of gender is another.
- 18.
See, for example, comments by N. Robbins at https://www.forbes.com/sites/naomirobbins/2015/03/19/color-problems-with-figures-from-the-jerusalem-post/?sh=21fd52f71c7f. Last accessed December 20, 2020. Also see Few (2008).
References
Agresti, A. 2002. Categorical Data Analysis. 2nd ed. New York: Wiley.
Box, G., W. Hunter, and J. Hunter. 1978. Statistics for Experimenters: An Introduction to Design, Data Analysis, and Model Building. New York: Wiley.
Cox, D. 2020. Statistical significance. Annual Review of Statistics and Its Application 1: 1–10.
Daniel, W.W. 1977. Statistical significance versus practical significance. Science Education 61 (3): 423–427.
Dudewicz, E.J. and S.N. Mishra. 1988. Modern Mathematical Statistics. New York: Wiley.
Ellis, S. and H. Steyn. 2003. Practical significance (effect sizes) versus or in combination with statistical significance (p-values). Management Dynamics 12 (4): 51–53.
Few, S. 2008. Practical rules for using color in charts. resreport, Perceptual Edge. In Visual Business Intelligence Newsletter.
Greene, W.H. 2003 Econometric Analysis, 5th ed. Englewood: Prentice Hall.
Guenther, W.C. 1964. Analysis of Variance. Englewood Cliffs: Prentice-Hall, Inc.
Gujarati, D. 2003. Basic Econometrics, 4th ed. New York: McGraw-Hill/Irwin.
Marascuilo, L. 1964. Large-sample multiple comparisons with a control. Biometrics 20: 482–491.
Marascuilo, L. and M. McSweeney. 1967. Nonparametric posthoc comparisons for trend. Psychological Bulletin 67 (6): 401.
McGrath, J.J. 2007. The other end of the spear: The tooth-to-tail ratio (t3r) in modern military operations. Techreport, Combat Studies Institute Press. The Long War Series Occasional Paper 23. Kansas: Combat Studies Institute Press Fort Leavenworth.
Moore, D.S. and W.I. Notz. 2014. Statistics: Concepts and Controversies, 8th ed. New Year: W.H. Freeman & Company.
Paczkowski, W.R. 2016. Market Data Analysis Using JMP. New York: SAS Press.
Paczkowski, W.R. 2018. Pricing Analytics: Models and Advanced Quantitative Techniques for Product Pricing. Milton Park: Routledge.
Paczkowski, W.R. 2020. Deep Data Analytics for New Product Development. Milton Park: Routledge.
Paczkowski, W.R. 2022. Business Analytics: Data Science for Business Problems. Berlin: Springer.
Robbins, N.B. 2010. Trellis display. WIREs Computational Statistics 2: 600–605.
Rosen, B.L., and A.L. DeMaria. 2012. Statistical significance vs. practical significance: An exploration through health education. American Journal of Health Education 43 (4): 235–241.
Weiss, N.A. 2005. Introductory Statistics, 7th ed. Boston: Pearson Education, Inc.
Ziliak, S.T., and D.N. McCloskey. 2008. The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice, and Lives (Economics, Cognition, and Society), 1st ed. Ann Arbor: University of Michigan Press.
Author information
Authors and Affiliations
Appendix
Appendix
This appendix provides brief overviews and, maybe, refresher material on some key statistical concepts.
4.1.1 Refresher on Expected Values
The expected value of a random variable is just a weighted average of the values of the random variable. The weights are the probabilities of seeing a particular value of the random variable. Although the averaging is written differently depending on whether the random variable is discrete or continuous, the interpretation is the same in either case. If Y is a discrete random variable, then
where p(y) = Pr(Y = y), the probability that Y = y, is a probability function such that
So E(Y ) is just a weighted average. This is the expected value of Y .
If Y is a continuous random variable, then
where
The function, f(y), is the probability density function of Y at y. It is not the probability of Y = y, which is zero.
It is easy to show the following:
-
1.
E(aX) = a × E(X) where a is a constant.
-
2.
E(aX + b) = a × E(X) + b where a and b are constants.
-
3.
V (aX) = a 2 × V (X) where V (⋅) is the variance defined as V (X) = E[X − E(X)]2.
-
4.
V (aX + b) = a 2 × V (X).
It is also easy to show, although I will not do it here, that the expected value of a linear function of random variables is linear. That is, for two random variables, X and Y , and for c i a constant, then
You can also show that V (Y 1 ± Y 2) = V (Y 1) + V (Y 2) if Y 1 and Y 2 are independent. If they are not independent, then V (Y 1 ± Y 2) = V (Y 1) + V (Y 2) ± 2 × COV (Y 1, Y 2) where COV (Y 1, Y 2) is the covariance between the two random variables. For a random sample, Y 1 and Y 2 are independent.
4.1.2 Expected Value and Standard Error of the Mean
You can now show that if Y i, i = 1, 2, …, n, are independent and identically distributed (commonly abbreviated as iid) random variables, with a mean E(Y ) = μ and variance V (Y ) = E(Y − μ)2 = σ 2, then
and
This last result can be extended. Suppose you have two independent random variables, \(Y_1 \sim \mathcal {N}(\mu _1, \sigma _1^2)\) and \(Y_2 \sim \mathcal {N}(\mu _2, \sigma _2^2)\). Then
This last result is used when two means are compared.
4.1.3 Deviations from the Mean
Two very important results about means are:
-
1.
\(\sum _{i = 1}^n (Y_i - \bar {Y}) = 0\).
-
2.
\(E(\bar {Y} - \mu ) = 0\)
The first is for a sample; the second is for a population. Regardless, both imply that a function of deviations from the mean is zero. To show the first, simply note that
The second uses the result I showed above that \(E(\bar {Y}) = \mu \). Using this,
4.1.4 Some Relationships Among Probability Distributions
There are several distributions that are often used in survey analyses:
-
1.
Normal (or Gaussian) distribution
-
2.
χ 2 distribution
-
3.
Student’s t-distribution
-
4.
F-distribution
These are applicable for continuous random variables. They are all closely related, as you will see.
4.1.4.1 Normal Distribution
The normal distribution is the basic distribution; other distributions are based on it. The Normal’s probability density function (pdf) is
where μ and σ 2 are two population parameters. A succinct notation is \(Y \sim \mathcal {N}(\mu , \sigma ^2)\). This distribution has several important properties:
-
1.
All normal distributions are symmetric about the mean μ.
-
2.
The area under an entire normal curve traced by the pdf formula is 1.0.
-
3.
The height (i.e., density) of a normal curve is positive for all y. That is, f(y) > 0 ∀y.
-
4.
The limit of f(y) as y goes to positive infinity is 0, and the limit of f(y) as y goes to negative infinity is 0. That is,
$$\displaystyle \begin{aligned} \begin{array}{rcl} \lim\limits_{y \to \infty} f(y) = 0 ~\text{and}~ \lim\limits_{y \to -\infty} f(y) = 0 \end{array} \end{aligned} $$ -
5.
The height of any normal curve is maximized at y = μ.
-
6.
The placement and shape of a normal curve depends on its mean μ and standard deviation σ, respectively.
-
7.
A linear combination of normal random variables is normally distributed. This is the Reproductive Property of Normals.
A standardized normal random variable is Z = (Y − μ)∕σ for \(Y \sim \mathcal {N}(\mu , \sigma ^2)\). This can be rewritten as a linear function: Z = 1∕σ × Y − μ∕σ. Therefore, Z is normally distributed by the Reproductive Property of Normals. Also, E(Z) = E(Y )∕σ − μ∕σ = 0 and V (Z) = 1∕σ 2 × σ 2 = 1. So, \(Z \sim \mathcal {N}(0, 1)\).
I show a graph of the standardized normal in Fig. 4.43.
4.1.4.2 Chi-Square Distribution
If \(Z \sim \mathcal {N}(0, 1)\), then \(Z^2 \sim \chi ^2_1\) where the “1” is one degree-of-freedom. Some properties of the χ 2 distribution are:
-
1.
The sum of n χ 2 random variables is also χ 2 with n degrees-of-freedom: \(\sum _n Z_i^2 \sim \chi ^2_n\).
-
2.
The mean of the \(\chi ^2_n\) is n and the variance is 2n.
-
3.
The \(\chi ^2_n\) approaches the normal distribution as n →∞.
I show a graph of the χ 2 pdf for 5 degrees-of-freedom in Fig. 4.44.
4.1.4.3 Student’s t-Distribution
The ratio of two random variables, where the numerator is \(\mathcal {N}(0, 1)\) and the denominator is the square root of a χ 2 random variable with ν degrees-of-freedom divided by the degrees-of-freedom, follows a t-distribution with ν degrees-of-freedom.
I show a graph of the Student’s t pdf for 23 degrees-of-freedom in Fig. 4.45.
4.1.4.4 F-Distribution
The ratio of a χ 2 random variable with ν 1 degrees-of-freedom to a χ 2 random variable with ν 2 degrees-of-freedom, each divided by its degrees-of-freedom, follows an F-distribution with ν 1 and ν 2 degrees-of-freedom.
Note that the \(F_{1, \nu _2}\) is t 2. You can see this from the definition of a t with ν degrees-of-freedom: \(\frac {Z}{\sqrt {\frac {\chi ^2_\nu }{\nu }}} \sim t_\nu \). I show a graph of the F-distributionpdffor 3 degrees-of-freedom in the numerator and 15 degrees-of-freedom in the denominator in Fig. 4.46.
4.1.5 Equivalence of the F and t Tests for Two Populations
Guenther (1964, p. 46) shows that then there are two independent populations, and the F-test and the t-test are related. In particular, he shows that
Part of this demonstration is showing that the denominator of the F-statistic is
which is the result I stated in (4.11) and (4.12).
4.1.6 Code for Fig. 4.37
The code to generate the 3-D bar chart in Fig. 4.37 is shown here in Fig. 4.47. This is longer than previous code because more steps are involved. In particular, you have to define the plotting coordinates for each bar in addition to the height of each bar. The coordinates are the X and Y plotting positions, and the base of the Z-dimension in the X − Y plane. This base is just 0 for each bar. Not only are these coordinates needed, but the widths of the bars are also needed. These are indicated in the code as dx and dy. The height of each bar is the Z position in a X − Y − Z three-dimensional plot. This is indicated by dz. The data for the graph comes from the pivot table in Fig. 4.35. I did some processing of this data, which I also show in Fig. 4.47. Notice, incidentally, that I make extensive use of list comprehensions to simplify list creations.
This is the Python code I used to create Fig. 4.37
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Paczkowski, W.R. (2022). Beginning Deep Survey Analysis. In: Modern Survey Analysis. Springer, Cham. https://doi.org/10.1007/978-3-030-76267-4_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-76267-4_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-76266-7
Online ISBN: 978-3-030-76267-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)