Skip to main content

Probability Distributions

  • Chapter
  • First Online:
Book cover Mathematics for the Life Sciences

Abstract

This first chapter on probability is narrowly focused on probability distributions. The initial section presents the basics of descriptive statistics using the data set that William Sealy Gosset (writing as “Student”) used in the classic 1906 paper that introduced the t test. The remainder of the chapter is based on the overall theme of probability distributions as models for large populations of data. Specific sections present the basic ideas of discrete and continuous distributions and a detailed look at the binomial, normal, Poisson, and exponential distributions. Many of the problems involve characterization of small data sets from published research, including R.A. Fisher’s well-known data for measurements of iris flowers and another data set used by Gosset. Several of these problems leave open questions that are addressed in Chapter 4.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The word “arithmetic” used here is the adjective, pronounced “eh-rith-MEH-tic,” not the noun of the same spelling that is pronounced “a-RITH-muh-tic.”

  2. 2.

    If one person gives a rating of 3 and another gives a rating of 1, we can say that the “average” rating is 2, but this average has no clear meaning in terms of the actual categories. There is no objective reason to think that one “agree” and one “disagree” are interchangeable with two “neither agree nor disagree.”

  3. 3.

    On the whole, taller people weigh more than shorter people, and children weigh less than adults.

  4. 4.

    The drug tested in the study is more commonly known as scopolamine; today it is used in a patch applied to the skin to prevent motion sickness. The subjects were not insomnia patients; they were inmates at the Michigan Asylum for the Insane and obviously not volunteers. (Medical ethics has improved a lot since the early 1900s!) The clinicians who administered the test first checked the drugs’ safety by trying them on themselves. (Experimental protocols have also improved a lot since the early 1900s!) The data was subsequently used by William Sealy Gosset, better known by the pseudonym “Student,” in a classic 1908 paper that introduced the test now known as “Student”’s t test [19]. Gosset incorrectly copied the column headings, so he did not reach the correct conclusion from the data. (Publication standards in science have improved a lot since the early 1900s as well!) For a more complete history of this interesting story, see [15].

  5. 5.

    If the data set contained 100 values instead of 10, we might prefer rounding off to the nearest half-hour instead.

  6. 6.

    See Section 2.3.

  7. 7.

    See Section 2.3.

  8. 8.

    The correct denominator actually is n when the data represents the full population of interest. The choice n − 1 is appropriate for the much more common case where data for a sample is being used to try to characterize a larger population. The reason for this change is beyond the scope of an elementary treatment. As the sample size increases, the distinction makes less of a difference.

  9. 9.

    A chromosome is a large DNA molecule.

  10. 10.

    See Section 3.1.

  11. 11.

    Data for random variables is subject to demographic stochasticity, with considerable variation among samples. It is customary to report probabilities with four decimal digits, which implies a higher degree of precision than is usually warranted. Reporting any more than four decimal digits, except as a one-digit approximation for a tiny probability, is an example of the “measure it with your hand, mark it with a pencil, cut it with a laser” fallacy.

  12. 12.

    This is another advantage of working with probability distributions rather than actual populations. Removing an individual from a population changes the probabilities for the remaining population. Sampling does not change a probability distribution. In effect, a probability distribution is a model of an infinite population.

  13. 13.

    A plasmid is a DNA strand that appears in cells although it is not part of a chromosome.

  14. 14.

    This strain of E. coli is well studied because it is used in the process of making ethanol from corn.

  15. 15.

    See the range and sum rules of Section 3.2.

  16. 16.

    For better clarity, the probabilities at terminal nodes are in boxes.

  17. 17.

    See Section 3.2.

  18. 18.

    All state lottery authorities hire mathematicians to determine the profiles of the ticket populations, but it is unlikely that any of the tickets are purchased by mathematicians. Those in the know refer to a lottery as “a tax on those who are bad at math.”

  19. 19.

    The distribution is not exactly equivalent to the actual population because the probabilities have been rounded to decimal values; nevertheless, the difference between the actual population and the model distribution is insignificant. Note that we need to round 2/3 down to 0 for one value so that the total probability is exactly 1. It is more important that the probabilities satisfy the definition of a probability distribution than for them to exactly equal the real probabilities.

  20. 20.

    See Problem 3.3.2.

  21. 21.

    This is also known as the gambler’s fallacy.

  22. 22.

    The implications of this question were of great interest to the children’s author “Dr. Seuss” [16].

  23. 23.

    On the Origin of Species was published in 1859. Not only was Charles Darwin ignorant of the DNA mechanism of inheritance—he was also ignorant of the basic facts of inheritance!

  24. 24.

    The peas also differed in color, but the extra complication is not necessary for our mathematical modeling.

  25. 25.

    Mendel did, in fact, clearly state that he used only a portion of the data for his calculations. Careful control of data to minimize bias is a fairly recent innovation in science. A more complete discussion of Mendel’s data and Fisher’s analysis appears in [2].

  26. 26.

    See Section 2.1.

  27. 27.

    Bernoulli trials are also the basic components for the geometric and negative binomial distributions.

  28. 28.

    There is no standard notation for the binomial distribution. Most authors use the letter “b” or some variation of it. The number of trials is almost always n and the success probability for one Bernoulli trial is almost always p. The outcomes are usually labeled as either x or k. Many authors put the parameters inside the parentheses along with the outcome. I prefer to use the parameters as subscripts to emphasize that they are fixed parameters of the distribution function, whereas k is the independent variable in the function.

  29. 29.

    For a more scholarly discussion, see [9].

  30. 30.

    This estimate is probably a little high, but not much. It is hard to know how many people are naturally left-handed, since some natural left-handers, such as the author’s sister, were “trained” in school to be right-handed. Irrational prejudice against left-handers dates back many centuries; for example, the Latin word for “left-handed” is the original source of the English word “sinister.”

  31. 31.

    Even if 6 of 12 is highly unusual, it does not mean that the result is significant. While any one coincidence discovered in a group of 12 people is unusual, there was no particular reason to look for the specific coincidence of left-handedness. The number of possible coincidences that could be discovered is probably quite large, so perhaps there is nothing unusual in discovering one.

  32. 32.

    Why the dice were rolled 26,306 times, rather than some larger or smaller number, is lost to posterity. According to Weldon, 7,006 rolls were done by a clerk deemed “reliable and accurate,” but Weldon did the other 19,300 rolls himself. There is no evidence that Weldon’s experiments inspired the invention of the game “Yahtzee.”

  33. 33.

    Nowadays, a professional scientist would have his/her graduate student roll the dice 19,300 times.

  34. 34.

    This data is part of a larger set in a classic paper by Ronald A. Fisher [3].

  35. 35.

    This data set was originally published in the Edinburgh Medical and Surgical Journal in 1817. Since then, it has been used as an example for the development of statistical methods. The full history of the data set is recounted in [17].

  36. 36.

    Note the use of \(\bar{x}\) inside the definite integral rather than x. The symbol for the integration variable must be distinct from that for the independent variable. Any symbol other than x can be chosen; the advantage of \(\bar{x}\) is that it serves as a reminder that we are integrating over values of x.

  37. 37.

    In essence, we now calculate normal distribution values the same way we calculate values for exponential and trigonometric functions.

  38. 38.

    The question of whether a data set is approximately normal is addressed in Section 4.2.

  39. 39.

    Think of drawing a value from the normal distribution as a Bernoulli trial. For one million Bernoulli trials, each with probability 3 ×10 − 7, the expected number of successes is the product 0.3.

  40. 40.

    These are not confidence intervals, which are discussed in Section 4.3.

  41. 41.

    In general, statistics can be described as “probability augmented by arbitrary standards.” The advantage of such standards is that they provide a unified language to use in comparing results, but the disadvantage is that the standards can acquire a perceived significance out of proportion to their actual value.

  42. 42.

    For those of us who don’t speak French, an acceptable pronunciation in English is “pwa-SOHN.”

  43. 43.

    Basketball statisticians report this as a percentage rather than a probability.

  44. 44.

    Technically, we should only consider the distribution of baskets as suitable for a Poisson distribution. Since points can come in two or threes, they are not technically Bernoulli trials.

  45. 45.

    See Section 3.5.

  46. 46.

    We use < at the lower bound of the interval and ≤ at the upper bound because this is the custom. In practice, it is a distinction without a difference, since \(\mbox{ P}\left [X = x\right ] =\int _{ x}^{x}f(s)\,ds = 0\).

  47. 47.

    See Problem 3.7.14.

  48. 48.

    In most cases, the sample size is so large that individual decays do not measurably change the mean rate.

  49. 49.

    Parasitoids are animals that lay their eggs inside a living organism. The parasitoid larvae feed on the host.

  50. 50.

    This was a very important technique in the period before modern computers. It is still used in software implementations of the binomial distribution for extreme cases such as that of part (d).

  51. 51.

    See PoissonApproximation.pdf at http://www.math.unl.edu/~gledder1/MLS for details.

References

  1. Blackman GE. A study by statistical methods of the distribution of species in grassland associations. Annals of Botany 49, 749–777 (1935). Cited in Goodness-of-Fit Techniques, eds. RB DAgostino and MA Stephens. Marcel Dekker, New York (1986)

    Google Scholar 

  2. Fairbanks DJ and B Rytting. Mendelian controversies: A botanical and historical review. American Journal of Botany 88, 737–752 (2001)

    Google Scholar 

  3. Fisher RA. The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7, 179–184 (1936)

    Article  Google Scholar 

  4. Gilchrist W. Statistical Modeling. John Wiley and Sons, New York (1984)

    Google Scholar 

  5. Hand DJ, F Daly, K McConway, D Lunn, and E Ostrowski. Handbook of Small Data Sets. CRC Press, Boca Raton, FL (1993)

    Google Scholar 

  6. James SD. Four Out of Five Recent Presidents Are Southpaws. ABC News (2008-02-22). http://abcnews.go.com/politics/story?id=4326568CitedSep2012

  7. Kulasekeva KB and DW Tonkyn. A new discrete distribution, with applications to survival, dispersal, and dispersion. Communications in Statistics (Simulation and Computation), 21, 499–518 (1992)

    Google Scholar 

  8. Ledder G, JD Logan, and A Joern. Dynamic energy budget models with size-dependent hazard rates. J Math Biol, 48, 605–622 (2004)

    Google Scholar 

  9. Llaurens V, M Raymond, and C Faurie. Why are some people left-handed? An evolutionary perspective. Philosophical Transactions of the Royal Society of London, B: Biological Sciences, 364, 881–894 (1999)

    Google Scholar 

  10. MacArthur JW. Linkage studies with the tomato, III, Fifteen factors in six groups. Transactions of the Royal Canadian Institute, 18, 1–19 (1931)

    Google Scholar 

  11. Macrae F. As two lefties vie for the American presidency why are so many U.S. premiers left-handed? The Daily Mail (2008-10-24). http://www.dailymail.co.uk/sciencetech/article-1080401/As-lefties-vie-American-presidency--U-S-premiers-left-handed.htmlCitedSep2012

  12. Pearson K. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. Philosophical Magazine, 5, 157–175 (1900)

    Article  Google Scholar 

  13. Pilkington E. Revealed: The leftist plot to control the White House. The Guardian (2008-10-24). http://www.guardian.co.uk/world/2008/oct/24/barack-obama-mccain-white-house-left-handedCitedSep2012

  14. Roach JC, G Glusman, AFA Smit, CD Huff, R Hubley, PT Shannon, L Rowen, KP Pant, N Goodman, M Bamshad, J Shendure, R Drmanac, LB Jorde, L Hood, and DJ Galas. Analysis of genetic inheritance in a family quartet by whole-genome sequencing. Science, 328, 636–639 (2010)

    Google Scholar 

  15. Senn SJ and W Richardson. The first t-test. Statistics in Medicine, 13, 785–803 (1994)

    Google Scholar 

  16. Seuss, Dr. (1961), Too Many Dave’s. In The Sneetches and Other Stories. Random House, New York (1961)

    Google Scholar 

  17. Stigler SM. The History of Statistics: The Measurement of Uncertainty Before 1900, p. 208. Harvard University Press (1986)

    Google Scholar 

  18. “Student”. On the error of counting with a haemocytometer. Biometrika, 5, 351–360 (1906)

    Google Scholar 

  19. “Student”. The probable error of a mean. Biometrika, 6, 1–25 (1908)

    Google Scholar 

  20. Wang CC, Multiple invasion of erythrocyte by malaria parasites. Transactions of the Royal Society of Tropical Medicine and Hygiene. 64, 268–270 (1970)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer Science+Business Media, LLC

About this chapter

Cite this chapter

Ledder, G. (2013). Probability Distributions. In: Mathematics for the Life Sciences. Springer Undergraduate Texts in Mathematics and Technology. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-7276-6_3

Download citation

Publish with us

Policies and ethics