Skip to main content

Use of R-Based APIs (Application Programming Interface) to Obtain Data

  • Chapter
  • First Online:
Introduction to Data Science in Biostatistics
  • 45 Accesses

Abstract

The purpose of this lesson is to demonstrate how R-Based APIs (Application Programming Interface) functions are used to obtain data so that reproducible syntax is used to acquire data and therefore avoid the cumbersome and opaque process of using graphical (e.g., GUI) point and click menus to obtain data. The use of R-based API functions, serving as data retrieval clients, is now so popular that many API functions have an added feature that data are returned in a tidy format. Much work still needs to be done to expand the number and ease of use of these data retrieval clients, but R functions that return data in tidy format are a great advantage over prior data retrieval processes. This lesson also stresses the use of the way a key (e.g., a proxy for a resource-specific password), freely and easily obtained, is used to further advance the use of R-based data retrieval functions (e.g., clients).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 129.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Search on the terms chmod 700 filename and chmod 777 filename and consider differences in data protection early on with the use of distributed systems and current practices.

  2. 2.

    Although it is an oversimplification, think of REST as a scenario where a researcher at Computer A sends a message to Computer B, asking for data. Computer B has been set so that the message is then reviewed to see if the request is structured correctly, if the authentication process (if any) confirms that the researcher is qualified to receive the data, if the data are available, etc. If all requirements are met, then Computer B sends the requested data back to Computer A, for the requesting researcher to use the data as desired.

  3. 3.

    A key should be treated as if it were a password. A key provides a unique identification of the approved user. Do not share a key with anyone else, in the same way that passwords should always be kept private.

  4. 4.

    To continue with an evaluation of available R-based API client functions, compare ease of use and outcomes from use of the tidycensus::get_acs() function and the acs::acs.fetch() function. Try both API clients over multiple queries to see if there is a reason to prefer one API client over the other. It has been decided to use the tidycensus::get_acs() function in this text, but the acs::acs.fetch() function also has great value.

  5. 5.

    Use Census Bureau resources at https://www.census.gov/programs-surveys/geography/guidance/geo-areas/urban-rural.html and give attention to the file A state-sorted list of all 2020 Census Urban Areas for the US, Puerto Rico, and Island Areas first sorted by state FIPS code, then sorted by Urban Area Census (UACE) code.

  6. 6.

    If the NASS GUI were used, think of all the times it may be necessary to click (or not click) on either Year (from the 1920s into the early 2020s) or County (Iowa has 99 counties). Reproducible syntax seems like a good idea when confronted with this user experience.

  7. 7.

    The tidyUSDA::getQuickstat() function by default returns a dataframe, not a tibble.

  8. 8.

    For those with special interest, review the biography of Henry A. Wallace, a native Iowan, who was appointed Secretary of Agriculture in 1933 and was elected as Vice President of the United States in 1940. His leadership, individually and in league with others, was instrumental in the development and use of hybrid corn beginning in the mid-1930s. Look at the figure on corn yields over time, and note how a few years after, by the late 1930s to early 1940s, corn yields per acre began their rapid ascent.

  9. 9.

    Without going beyond the scope of this addendum, environmental conditions such as temperature are not easily measured as one or two datapoints in a spreadsheet, such as mean monthly high temperature and mean monthly low temperature. Consider the impact of extreme temperatures on corn, both high temperatures and low temperatures, especially during tasseling and silking, when pollination occurs. Extreme temperatures during these critical stages of development may have an adverse impact on kernel formation or grain fill. Large stalks, where the corn is as high as an elephant’s eye, may look good when driving past a field, but most corn is grown, sold, and used for grain, not fodder.

  10. 10.

    Going back to the late 1700s and passage of the Northwest Ordinance, note how many counties in what later became the state of Iowa (and other midwestern states) are generally square. There are many resources that explain how land was surveyed into one square mile sections (640 acres), with 36 adjoining sections organized into a survey township, all on a rectangular grid. Townships were then collectively organized into what are commonly called box-shaped counties. There are exceptions, of course, considering natural borders such as rivers, but the organizational structure of borders has historically impacted land ownership, farming practices, etc.

  11. 11.

    Use of facet_wrap() may not be needed, but it should still be attempted just to see if it is a reasonable way of communicating outcomes.

  12. 12.

    When determining a correlation coefficient between two variables, whether using Pearson’s r, Spearman’s rho, or Kendall’s tau, always keep in mind two related common expressions: (1) Past behavior is the best predictor of future behavior; and (2) correlation does not suggest causation.

  13. 13.

    Do not obtain data and then immediately rush in and start analyses. Take time to study the data, visually and by using software. There is a reason for the 80-20 rule.

  14. 14.

    Look at the price of Iowa corn in 1915 ($0.63 per bushel) and the rapid increases up to 1919 ($1.34 per bushel), a more than doubling of price in only a few years. Was World War I and the worldwide demand for food responsible for this increase? Give special attention to the $1.34 per bushel price of Iowa corn in 1919 and how the price of Iowa corn crashed throughout the 1920s and 1930s, with a low for Iowa corn at $0.32 per bushel in 1932. Concurrent with these low prices, review available resources on the Great Depression and the Dust Bowl. It is dangerous to say that X caused Y, but is there an association between the low prices for farm commodities in the 1920s and 1930s, the Great Depression of the 1930s, and the prior use of compromised farming practices that may have contributed to the 1930s Dust Bowl? This is of course an extremely complicated issue, but those who work in agriculture and biostatistics should at least be aware of these issues.

  15. 15.

    Other global events impacting the rapid increase and soon after decline in Iowa corn prices would be the 1972 decision to export grain to the then Union of Soviet Socialist Republics (USSR) and later, the 1980 decision to embargo grain sales to the same entity. The point here is that farm commodity prices are impacted by far more than weather and similar environmental factors.

  16. 16.

    Note: At one time, the original object variables names showed in UPPERCASE instead of lowercase. Always check datasets for current naming schemes, structure, data availability, etc. Data at online resources that are controlled by others are always subject to change: updates, deletions, modifications.

  17. 17.

    Challenge: Use tools from the tidyverse ecosystem to observe change over time, if any, in percentage use of Natural Gas (Weighted US Average) in 2010 compared to 2021.

  18. 18.

    Challenge: look at mean values for n2o_emissions_co2e from 2010 to 2019, in 2020, and then again in 2021. Consider possible reasons for the gradual decline, even if somewhat uneven from 2010 to 2019, the large drop off in 2020, and the uptick in 2021. As data become available, what is the trend from 2022 onward?

  19. 19.

    The change from coal and oil to natural gas for generation of electricity has provided the opportunity for many impressive videos, showing dramatic images of the implosion of old infrastructure: boilers and cooling stacks. Merely as one of many possible selections, search for videos of the June 19, 2011, implosion of electricity-generating infrastructure at Riviera Beach, Florida. In mere seconds, two boilers and two 300-foot stacks came tumbling down, to make way for construction of a new natural gas-powered system. A serendipitous outcome was construction of a dedicated lagoon, where warm water from the new cooling towers is discharged into an area where manatees (a protected species) can gather during the winter and thrive when cool water temperatures may otherwise put them at stress. Look at the Manatee Cam (https://www.visitmanateelagoon.com/) in the winter, when air temperatures are about 60F or 15C and look at what may seem to be 100 or more manatees enjoying the benefit of warm water discharge from the power plant.

  20. 20.

    Be sure to notice that data for total_annual_heat_input were unavailable for two years, 2010 and 2011.

  21. 21.

    Consumers expect continuous and uninterrupted electricity for their factories, gas pumps at service stations, refrigeration at grocery stores, homes, hospitals, schools, shops, offices, traffic lights, water purification plants, etc. Even a few minutes (seconds, actually) of interruption to the electric power grid creates havoc. Downtime in the availability of electricity, even when power plants and communities are faced with weather-related force majeure events are quickly deemed unacceptable by the public – the lights need to be on 24 hours a day, each day, every day, all year, with no exceptions. Think of the February 2021 power outages in Texas. The disruptions, during an exceptionally active polar vortex that reached far into the South, with freezing weather and bitter storm conditions, caused millions of consumers and businesses the hardship of intolerable living conditions, diminished economic impact from lost productivity, flooded houses once frozen pipes eventually warmed up and discharged untold gallons of water into residences, and worse of all, the many deaths that could be directly attributable to the disruption of electric service as power-generating plants were offline and power lines went down. Many consumers would have gladly accepted a temporary increase in emissions, n2o_emissions_co2e and ch4_emissions_co2e, but of course, power plants cannot be so easily transformed from one fuel type to another, and this can certainly not be done quickly, without adequate (and costly) advance planning, if at all.

  22. 22.

    In a similar manner, comma separated values (.csv) files are also text based, allowing wide use across multiple platforms, software and hardware, and users.

Author information

Authors and Affiliations

Authors

Electronic Supplementary Materials

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

MacFarland, T.W. (2024). Use of R-Based APIs (Application Programming Interface) to Obtain Data. In: Introduction to Data Science in Biostatistics. Springer, Cham. https://doi.org/10.1007/978-3-031-46383-9_6

Download citation

Publish with us

Policies and ethics