Abstract
The purpose of this lesson is to provide summary information on how R and more specifically R’s tidyverse ecosystem are both used in support of data science. A few key concepts about the tidyverse ecosystem are reinforced, such as: (1) use of an Application Programming Interface (API) in an effort to obtain data; (2) the need to put data into tidy format; (3) and use of the tidyverse ecosystem in support of statistical analyses and the creation of figures, maps, and other visuals. An introduction is also offered on how data scientists prepare reports and the way supporting software and processes for the same can be integrated into R. A few comments are made on the next steps for those who wish to continue in data science. Perhaps most importantly for those who are in the early days of career exploration and advancement, there is also a discussion on the soft skills needed by those who wish to become leaders in data science and in turn use data science to promote societal improvement.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
For those who are not well-acquainted with dairy cattle and characteristics of the leading breeds, review State and national standardized lactation averages by breed for cows calving in 2007 (https://queries.uscdcb.com/publish/dhi/dhi09/laall.shtml) for generalized by breed statistics not only on milk production (lb), but statistics that also address fat and protein production (% and lb). There are many dairy herdsmen who place a high value on the production of fat (% and lb.) and protein (% and lb.) and are willing to accept less milk production in terms of measured weight (lb.).
- 2.
Whether data are nonparametric or parametric is often a matter of personal judgment or group consensus. Ideally, the final decision is based not only on observation of the data but is also a result of applied tests such as the Anderson-Darling test or the Shapiro test, but more discussion on this issue would go beyond the scope for this specific lesson.
- 3.
For those with special interest in the issue of normal distribution and the selection of a nonparametric or parametric approach to inferential test selection, look at use of the dlookr::normality() function, such as dlookr::normality(Pounds) alone or chained to testing of normal distribution by groups by using the dplyr::group_by() function.
- 4.
Consider a map of the United Kingdom of Great Britain and Northern Ireland (UK). Should this map include only England, Northern Ireland, Scotland, and Wales? How would the Republic of Ireland show on this map, given how it is part of the same land mass as the land mass for Northern Ireland? Then, add to this complexity the Channel Islands such as the Bailiwick of Guernsey and the Bailiwick of Jersey. Should these two entities be included in a map of the UK? Should the Isle of Man also show on the map? Should the British Virgin Islands, the Falkland Islands, Gibraltar, and other British Overseas Territories show on the map? What about the Chagos Archipelago? Should Rockall be included? The complexity of maps goes far beyond the use of R or any other software for their creation.
- 5.
Is it possible to distinguish the borders for Luxembourg in this map?
- 6.
Notice how data may not be available for all geographic entities, or there may be concerns about the efficacy of some data.
- 7.
Review Federal Information Processing System (FIPS) Codes for States and Counties, https://transition.fcc.gov/oet/info/maps/census/fips/fips.txt, for state by state and county by county FIPS codes.
- 8.
Review materials such as ZIP Code Tabulation Areas (ZCTAs) (https://www.census.gov/programs-surveys/geography/guidance/geo-areas/zctas.html) to learn about the way United States Postal Service ZIP Codes are accommodated when working with output gained from the Census Bureau. Census Bureau ZCTAs seem to be similar to Postal Service ZIP Codes, but not quite.
- 9.
Look at the accommodation that was needed for the county named Cape May. What is the issue? From many possible ways to approach this issue, what tidyverse tool is best for this accommodation?
- 10.
Many documents are also prepared by use of word processing software, but it is not necessary to comment too much of its use other than to mention that some of the most popular word processing software packages are proprietary and it cannot be assumed that interested peers and students have access to the same. In contrast, the typesetting approach demonstrated in this section (both R Markdown and LaTeX) is based on markup software that is legally and freely obtained.
- 11.
Although there is no desire to make negative comments on the use of word processing software, investigate distinction between the expression WYSIWYG (what you see is what you get) v WYSIAYG (what you see is all you get) when deciding to prepare a document with word processing software v the decision to prepare a document using a markup language and accompanying software. Decide if the need for inclusion of comments, syntax, and other text directly in the manuscript, but text that is not visible in the final report, has value when selecting document preparation software.
Author information
Authors and Affiliations
Electronic Supplementary Materials
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
MacFarland, T.W. (2024). Putting It All Together – R, the tidyverse Ecosystem, and APIs. In: Introduction to Data Science in Biostatistics. Springer, Cham. https://doi.org/10.1007/978-3-031-46383-9_7
Download citation
DOI: https://doi.org/10.1007/978-3-031-46383-9_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-46382-2
Online ISBN: 978-3-031-46383-9
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)