Abstract
The previous chapter introduced you to using Docker and Apache Zeppelin to power your Spark explorations. You learned to transform loosely structured data into reliable, self-documenting, and most importantly, highly structured data through the application of explicit schemas. You wrote your first end-to-end ETL job, which enabled you to encode this journey from raw data to structured data in a reliable way. However, the process we looked at is just the beginning and can be looked at as the first step of many in a data transformation pipeline. The reason we began by looking at raw data transformations is simple—there is a high probability that the data you’ll be ingesting into your data pipelines starts at the data lake.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature
About this chapter
Cite this chapter
Haines, S. (2022). Transforming Data with Spark SQL and the DataFrame API. In: Modern Data Engineering with Apache Spark. Apress, Berkeley, CA. https://doi.org/10.1007/978-1-4842-7452-1_4
Download citation
DOI: https://doi.org/10.1007/978-1-4842-7452-1_4
Published:
Publisher Name: Apress, Berkeley, CA
Print ISBN: 978-1-4842-7451-4
Online ISBN: 978-1-4842-7452-1
eBook Packages: Professional and Applied ComputingApress Access BooksProfessional and Applied Computing (R0)