Skip to main content

Decision Tree Regression with Pandas, Scikit-Learn, and PySpark

  • Chapter
  • First Online:
Distributed Machine Learning with PySpark
  • 326 Accesses

Abstract

In this chapter, we continue with our exploration of supervised learning with a focus on regression tasks. Specifically, we will be building a regression model using the decision tree algorithm—an alternative to the multiple linear regression model we used in the previous chapter. We will use both Scikit-Learn and PySpark to train and evaluate the model and then use it to predict the sale price of houses based on several features such as the size of property and the number of bedrooms, bathrooms, and stories, among others. Additionally, we will compare the performance of Pandas and PySpark in data loading and exploration tasks to better understand their similarities and differences.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 49.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Testas, A. (2023). Decision Tree Regression with Pandas, Scikit-Learn, and PySpark. In: Distributed Machine Learning with PySpark. Apress, Berkeley, CA. https://doi.org/10.1007/978-1-4842-9751-3_4

Download citation

Publish with us

Policies and ethics