Abstract
In this chapter, we continue with our exploration of supervised learning with a focus on regression tasks. Specifically, we will be building a regression model using the decision tree algorithm—an alternative to the multiple linear regression model we used in the previous chapter. We will use both Scikit-Learn and PySpark to train and evaluate the model and then use it to predict the sale price of houses based on several features such as the size of property and the number of bedrooms, bathrooms, and stories, among others. Additionally, we will compare the performance of Pandas and PySpark in data loading and exploration tasks to better understand their similarities and differences.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature
About this chapter
Cite this chapter
Testas, A. (2023). Decision Tree Regression with Pandas, Scikit-Learn, and PySpark. In: Distributed Machine Learning with PySpark. Apress, Berkeley, CA. https://doi.org/10.1007/978-1-4842-9751-3_4
Download citation
DOI: https://doi.org/10.1007/978-1-4842-9751-3_4
Published:
Publisher Name: Apress, Berkeley, CA
Print ISBN: 978-1-4842-9750-6
Online ISBN: 978-1-4842-9751-3
eBook Packages: Professional and Applied ComputingApress Access BooksProfessional and Applied Computing (R0)