Introduction: Why Do We Visualize Data and What Is This Book About?
- 3.2k Downloads
Introduction: Why do we visualize data and what is this book about? The introductory chapter describes the three rationales to visualize data: exploration, confirmation and presentation, and discusses the developments in computer hardware, software and connectivity that were instrumental for the recent increased interest in visualizing data.
The goal of this book is simple: We would like to show how mortality dynamics can be visualized in the so-called Lexis diagram. To appeal to as many potential readers as possible, we do not require any specialist knowledge. This approach may be disappointing: Demographers may have liked more information about the mathematical underpinnings of population dynamics on the Lexis surface as demonstrated, for instance, by Arthur and Vaupel in 1984. Statisticians would have probably preferred more information about the underlying smoothing methods that were used. Epidemiologists likewise might miss discussions about the etiology of diseases. Sociologists would have probably expected that our results were more embedded into theoretical frameworks….
We are aware of those potential shortcomings but believe that the current format can, nevertheless, provide interesting insights into mortality dynamics, and we hope our book can serve as a starting point to visualize data on the Lexis plane for those who have not used those techniques yet.
Exploration: John Tukey stresses that exploratory data analysis “can never be the whole story, but nothing else can serve as the foundation—as the first step” (Tukey 1977, p. 3). He uses the expression of “graphical detective work” by trying to uncover as many important details about the underlying data as possible. If one explores data only with preconceived notions and theories, it is likely that essential characteristics remain undiscovered.
Confirmation: It could be argued that the mere exploration of data without any hypotheses is a misguided endeavor. Exploration needs to be firmly distinguished from confirmatory analysis, though. While the exploration is comparable to the work of the police, this step can be seen as the task of a judge or the jury. Both are important to advance science, the first step is to gather the facts whereas the second step is of judgmental nature: Can the “facts” be interpreted to support the theory? Or do certain findings exclude some hypotheses? In this sense, confirmatory analysis represents the core of scientific progress in Popper’s sense, namely by falsifying theories.
Presentation: Presenting and communicating the findings from the data analysis to the reader, or more appropriately, to the viewer, represents the third pillar of why data visualization is important. Mixing up confirmatory analysis with the presentation of the findings is probably one of the root causes for poor scientific communication. It is a common occurrence at scientific conferences that researchers use the same graphical tools to present their results to others as they used to obtain their findings in the first place. As pointed out by Schumann and Müller (2000, p. 6), this step requires careful thought that third parties are able to understand the findings without any unnecessary difficulties.
The introduction of the predecessor of all modern PCs, the IBM personal computer, in 1981 as well as of microcomputers (e.g., the “C-64”) in the same era triggered a shift away from the so-called minicomputers of the 1970s2 to computers that could be purchased by households of average income. The speed of the processors was too slow and the size of computer memory was too small to process data as conveniently as we can nowadays, though. The first PC had an upper limit for working memory (RAM) of 256 kB, that is about 0.000778% of the first author’s current desktop workstation. If we disregard developments in cache technology, parallel processing, etc., the pure clock speed of processors is now three orders of magnitude higher than in the early 1980s. Only 20 years ago, the typical size of total RAM was about as large as the size of a single digital photo today. But even if there was enough RAM and sufficient clock speed of the CPU, data storage was another limiting factor. The first hard disk with a capacity of more than one gigabyte was introduced in 1980 and cost at least US-$ 97,000.3 One thousand times the storage capacity is available now at less than US-$ 100. This trend allowed the collection of massive data sets. To illustrate current capabilities: If we were interested in creating a data set, which contains about 1000 alphabetic characters (more than enough for the name, birth date and current residence) of any person alive, we would have to invest less than US-$ 400.4 But, once again, even if we had the affordable computer storage of today, communicating results graphically was hindered by the low resolution combined with relatively few colors of early graphics standards such as CGA and EGA. Only with the introduction and the extension of the VGA standard, high resolution displays have become feasible.
- Having hardware in terms of processing speed, working memory and hard disk capacity to process graphics coincided with a revolution in software in the 1990s: Similar to the introduction of home computers that gave access to almost everyone, the emergence of free software, also called open source software, allowed anyone to use software without the costs and other restrictions often imposed by software products. Examples for this development can be found in the area of
general programming languages (e.g., Python, Perl) as well as
languages tailored or at least particularly suited for statistical programming and data analysis. The invention of the S language, started in the 1970s, was instrumental.5 The most prominent example today is probably R (Ihaka 1998), but also other languages such as the now almost completely abandoned XLISP-STAT (de Leeuw 2005) facilitate(d) the visualization of data.6
Lastly, in the area of efficient data storage, especially with the advent of “big data”. Although it might be one of the most abused buzzwords currently, data sets in the gigabyte and terabyte range, partly in non-rectangular formats, have become ubiquitous. Those data can be handled by relational and non-relational database systems that are also available under free and open source licenses (e.g., SQLite, MySQL, Postgresql, Cassandra).
While the internet existed already for more than 20 years, the introduction and rising popularity of the world wide web (WWW) was a catalyst for the exchange of information via electronic networks. This technology allows now billions of people on earth to have almost instant access to data. The speed of the internet connection, which is crucial for the exchange of information such as downloading large data sets, has also increased by at least two orders of magnitude since the middle of the 1990s when 56 kbit/s modems were the standard.
This trend is probably best demonstrated by visualizing the popularity of the term “visualizing data” over time, for instance, via Google’s Ngram viewer. Google Books Ngram Viewer displays the relative frequency of a search term in a corpus of books during a given time frame. Please see, for example: https://books.google.com/ngrams/graph?content=visualizing+data+&year_start=1960&year_end=2008
As noted at https://en.wikipedia.org/wiki/Minicomputer#cite_note-Smith_1970-4 (last accessed on 13 June 2017), the New York Times wrote in 1970 that minicomputers were computers that cost less than US-$ 25,000.
See: https://www-03.ibm.com/ibm/history/exhibits/storage/storage_3380.html, last accessed on 13 June 2017.
Assuming a world population of less than eight billion, a price for a 2TB hard disk of less than US-$S 100 and one byte per alphabetic letter.
Please see Appendix A in Chambers (2008) for some notes on the history of S.
It should be mentioned, though, that Matlab (Mathworks 2017), which is not published under a free/open-source license, was and is also key for the analysis and visualization of data.
- Andreev, K. F. (2002). Evolution of the Danish population from 1835 to 2000 (Odense Monographs on Population Aging, Vol. 9), Odense: University Press of Southern Denmark.Google Scholar
- Arriaga, E. E. (1984). Measuring and explaining the change in life expectancies. Demography, 21(1), 83–96.Google Scholar
- Camarda, C. G. (2008). Smoothing methods for the analysis of mortality development. PhD thesis, Universidad Carlos III de Madrid.Google Scholar
- Camarda, C. G. (2015). Smoothing and forecasting Poisson counts with P-splines. http://CRAN.R-project.org/package=MortalitySmooth, R package version 2.3.4.
- Canudas-Romo, V. (2003). Decomposition methods in demography. PhD thesis, Rijksuniversiteit Groningen, Groningen, NL.Google Scholar
- Caselli, G., & Vallin, J. (2006). Frequency surfaces and isofrequency lines. In G. Caselli, J. Vallin, & G. Wunsch (Eds.), Demography. Analysis and synthesis (Vol. I, Chap. 7, pp. 69–77). Amsterdam: Elsevier.Google Scholar
- Caselli, G., Vaupel, J. W., & Yashin, A. I. (1985). Mortality in Italy: Contours of a century of evolution. Genus, 41(1–2), 39–55.Google Scholar
- CDC/NCHS. (2015). Mortality in the United States, 2014. NCHS Data Brief, Number 229, Dec 2015. Available online as Supplementary Material at: https://www.cdc.gov/nchs/data/databriefs/db229_table.pdf#1.
- Cho, H., Howlader, N., Mariotto, A. B., & Cronin, K. A. (2011). Estimating relative survival for cancer patients from the SEER program using expected rates based on Ederer I versus Ederer II method. Tech. Rep. 2011-01, National Cancer Institute.Google Scholar
- Crimmins, E. M., Preston, S. H., & Cohen, B. (Eds.). (2011). Explaining divergent levels of longevity in high-income countries. Washington, DC: The National Academies Press.Google Scholar
- Deborah, S. (1998). The da Vinci of Data. The New York Times, 30 March 1998.Google Scholar
- Delaporte, P. (1938). Évolution de la mortalité française depuis un siècle. Journal de la société de statistique de Paris, 79, 181–206.Google Scholar
- Delaporte, P. (1942). Évolution de la mortalité en Europe depuis l’origine des statisques. Journal de la société de statistique de Paris, 83, 183–203.Google Scholar
- Ederer, F., Axtell, L. M., & Cutler, S. J. (1961). The relative survival rate: A statistical methodology (Chap. 6, pp. 101–121). National Cancer Institute Monograph, National Cancer Institute.Google Scholar
- EurowinterGroup. (1997). Cold exposure and winter mortality from ischaemic heart disease, cerebrovascular disease, respiratory disease, and all causes in warm and cold regions of Europe. Lancet, 349, 1341–1346.Google Scholar
- EurowinterGroup. (2000). Winter mortality in relation to climate. International Journal of Circumpolar Health, 59, 154–159.Google Scholar
- Few, S. (2014). Why do we visualize quantitative data? Available Online At: http://www.perceptualedge.com/blog/?p=1897, Visual Business Intelligence. A blog by Stephen Few. Last verification: 14 June 2017.
- Friendly, M. (2008). A brief history of data visualization. In C. H. Chen, W. K. Härdle, & A. Unwin (Eds.), Handbook of data visualization (Springer Handbooks of Computational Statistics, Chap. I, pp. 15–56). Berlin/Heidelberg: Springer.Google Scholar
- Galilei, G. (1613). Istoria e dimostrazioni intorno alle macchie solari. Rome.Google Scholar
- Gambill, B. A., & Vaupel, J. W. (1985). The LEXIS program for creating shaded contour maps of demographic surfaces. Tech. Rep. RR–85–94. International Institute for Applied Systems Analysis (IIASA), Laxenburg, A.Google Scholar
- Hippocrates. (400BC). On airs, waters, and places. Translated by Francis Adams. Available online at: http://classics.mit.edu/Hippocrates/airwatpl.html.
- Ihaka, R. (1998). R: Past and future history. Available online at https://cran.r-project.org/doc/html/interface98-paper/paper.html.
- Jacobsen, R., Keiding, N., & Lynge, E. (2006). Causes of death behind low life expectancy of Danish women. Scandinavian Journal of Social Medicine, 34(4), 432–436.Google Scholar
- Jdanov, D. A., Jasilionis, D., Soroko, E. L., Rau, R., & Vaupel, J. W. (2008). Beyond the Kannisto-Thatcher database on old age mortality: An assessment of data quality at advanced ages. Working paper MPIDR Working Paper WP-20083-013, Max Planck Institute for Demographic Research, Rostock, Germany.Google Scholar
- Kannisto, V. (1994). Development of oldest-old mortality, 1950–1990: Evidence from 28 developed countries (Monographs on Population Aging, Vol. 1). Odense: Odense University Press.Google Scholar
- Keyfitz, N. (1977). Applied mathematical demography. New York: John Wiley & Sons.Google Scholar
- Kintner, H. J. (2004). The life table. In J. B. Siegel & D. A. Swanson (Eds.), The methods and materials of demography (2nd ed., Chap. 13, pp. 301–340). San Diego: Elsevier.Google Scholar
- Lang, D. T., & the CRAN team. (2015). RCurl: General Network (HTTP/FTP/…) Client Interface for R. http://CRAN.R-project.org/package=RCurl, R package version 1.95-4.7.
- Leon, D. A., Chenet, L., Shkolnikov, V. M., Zakharov, S., Shapiro, J., Rakhmanova, G., Vassin, S., & McKee, M. (1997). Huge variation in Russian mortality rates 1984–1994: Artefact, alcohol, or what? The Lancet, 350(9075), 383–388. https://doi.org/10.1016/S0140-6736(97)03360-6, http://www.sciencedirect.com/science/article/pii/S0140673697033606.
- Lindahl-Jacobsen, R., Rau, R., Jeune, B., Canudas-Romo, V., Lenart, A., Christensen, K., & Vaupel, J. W. (2016). Rise, stagnation, and rise of Danish women’s life expectancy. Proceedings of the National Academy of Sciences, 113(15), 4015–4020. DOI 10.1073/pnas.1602783113, http://www.pnas.org/content/113/15/4015.abstract, http://www.pnas.org/content/113/15/4015.full.pdf.
- Luy, M. (2004). Mortality differences between Western and Eastern Germany before and after Reunification. A macro and micro level analysis of developments and responsible factors. Genus, 60(3/4), 99–141.Google Scholar
- Mathworks. (2017). Matlab. Available at www.mathworks.com.
- Max Planck Institute for Demographic Research (Germany) and Vienna Institute of Demography (Austria). (2017). Human fertility database. Available at http://www.humanfertility.org.
- McDowall, M. (1981). Long term trends in seasonal mortality. Population Trends, 26, 16–19.Google Scholar
- Meslé, F. (2006). Medical causes of death. In G. Caselli, J. Vallin, & G. Wunsch (Eds.), Demography. Analysis and synthesis (Vol. II, Chap. 42, pp. 29–44). Amsterdam: Elsevier.Google Scholar
- Meslé, F., & Vallin, J. (2006a). Diverging trends in female old-age mortality: The United States and the Netherlands versus France and Japan. Population and Development Review, 32, 123–145.Google Scholar
- Meslé, F., & Vallin, J. (2006b). The health transition: Trends and prospects. In G. Caselli, J. Vallin, & G. Wunsch (Eds.), Demography. Analysis and synthesis (Vol. II, Chap. 57, pp. 247–259). Amsterdam: Elsevier.Google Scholar
- Moore, T. B., & Hurvitz, C. G. (2009). Cancers in childhood. In D. A. Casciato & M. C. Territo (Eds.), Manual of clinical oncology (Chap. 18, p. 397). Philadelphia: Lippincott Williams & Wilkins.Google Scholar
- National Bureau of Economic Research. (1959–2015). Mortality data — Vital statistics NCHS’ multiple cause of death data, 1959–2015. http://www.nber.org/data/vital-statistics-mortality-data-multiple-cause-of-death.html.
- National Cancer Institute. (2017). NCI dictionary of cancer terms. Available Online at https://www.cancer.gov/publications/dictionaries/cancer-terms.Google Scholar
- National Center for Health Statistics. (1959–2015). Mortality multiple cause files. Available online at https://www.cdc.gov/nchs/data_access/vitalstatsonline.htm#Mortality_Multiple.
- Parkin, D., & Hakulinen, T. (1991). Analysis of survival. In O. Jensen, D. Parkin, R. MacLennan, C. Muir, & R. Skeet (Eds.), Cancer registration: Principles and methods (Chap. 12, pp. 159–176). No. 95 in IARC Scientific Publication, International Agency for Research on Cancer (IARC).Google Scholar
- Peters, F. (2015). Deviating trends in Dutch life expectancy. PhD thesis, Erasmus University Rotterdam, Rotterdam, NL.Google Scholar
- Preston, S. H., Heuveline, P., & Guillot, M. (2001). Demography. Measuring and modeling population processes. Oxford, UK: Blackwell Publishers.Google Scholar
- Quetelet, A. (1838). De l’influence des saisons sur la mortalité aux différens ages dans la Belgique. M. Hayez, Bruxelles, B.Google Scholar
- Rau, R. (2007). Seasonality in human mortality. A demographic approach (Demographic research monographs). Heidelberg: Springer.Google Scholar
- Rau, R., & Riffe, T. (2015). ROMIplot: Plots surfaces of rates of mortality improvement. R package version 1.0.Google Scholar
- van Rossum, G. (1995). Python reference manual. Tech. rep., CWI (Centre for Mathematics and Computer Science), Amsterdam, The Netherlands.Google Scholar
- Schumann, H., & Müller, W. (2000). Visualisierung. Grundlagen und allgemeine Methoden. Berlin/Heidelberg: Springer.Google Scholar
- Surveillance, Epidemiology, and End Results (SEER) Program. (2014). Research data (1973–2011). Available online at www.seer.cancer.gov. National Cancer Institute, DCCPS, Surveillance Research Program, released April 2014, based on the November 2013 submission.
- Thatcher, R. A., Kannisto, V., & Vaupel, J. W. (1998). The force of mortality at ages 80 to 120 (Monographs on population aging, Vol. 3). Odense: Odense University Press.Google Scholar
- Tufte, E. R. (2001). The visual display of quantitative data (2nd ed.). Cheshire: Graphics Press.Google Scholar
- Tufte, E. R. (2003). The cognitive style of PowerPoint. Cheshire: Graphics Press.Google Scholar
- Tukey, J. W. (1977). Exploratory data analysis. Reading, MA: Addison-Wesley.Google Scholar
- University of California, Berkeley (USA), and Max Planck Institute for Demographic Research, Rostock, (Germany). (2007). Methods protocol for the human mortality database. Available at http://www.mortality.org/Public/Docs/MethodsProtocol.pdf.
- University of California, Berkeley (USA), and Max Planck Institute for Demographic Research, Rostock, (Germany). (2017). Human mortality database. Available at http://www.mortality.org.
- Van Den Berg, G. J. (1860). Befolknings statistik in: Underdaniga berattelse for dren 1856–1860, ny folijd II, 3. Stockholm, Statistika Central-Byrans.Google Scholar
- Vandeschrick, C. (2001). The Lexis diagram, a misnomer. Demographic Research, 4(3), 97–124. DOI 10.4054/DemRes.2001.4.3, http://www.demographic-research.org/volumes/vol4/3/.CrossRefGoogle Scholar
- Vaupel, J. W., Gambill, B. A., & Yashin, A. I. (1985a). Contour maps of population surfaces. Tech. Rep. RR–85–47, International Institute for Applied Systems Analysis (IIASA), Laxenburg, A.Google Scholar
- Vaupel, J. W., Gambill, B. A., & Yashin, A. I., Bernstein, A. J. (1985b). Contour maps of demographic surfaces. Tech. Rep. RR–85–33, International Institute for Applied Systems Analysis (IIASA), Laxenburg, A.Google Scholar
- Vaupel, J. W., Gambill, B. A., & Yashin, A. I. (1987). Thousands of data at a glance: Shaded contour maps of demographic surfaces. Tech. Rep. RR–87–16, International Institute for Applied Systems Analysis (IIASA), Laxenburg, A.Google Scholar
- Vaupel, J. W., Zhenglian, W., Andreev, K. F., & Yashin, A. I. (1997). Population data at a glance: Shaded contour maps of demographic surfaces over age and time (Odense Monographs on Population Aging, Vol. 4). Odense: University Press of Southern Denmark.Google Scholar
- Wilmoth, J. R. (2006). Age-period-cohort models in demography. In G. Caselli, J. Vallin, & G. Wunsch (Eds.), Demography. Analysis and synthesis (Vol. I, Chap. 18, pp. 227–236). Elsevier, Amsterdam.Google Scholar
Open Access This chapter is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, duplication, adaptation, distribution, and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, a link is provided to the Creative Commons license, and any changes made are indicated.
The images or other third party material in this book are included in the work’s Creative Commons license, unless indicated otherwise in the credit line; if such material is not included in the work’s Creative Commons license and the respective action is not permitted by statutory regulation, users will need to obtain permission from the license holder to duplicate, adapt or reproduce the material.