Predicting House Prices in Python using Linear Regression

Hello World,

Satvik Virmani
3 min readApr 12, 2021

Hi everyone, this is the second blog in the Machine Learning series. In this we’re going to predict house prices using Linear Regression. So let’s get started:-

First and foremost ,

What is Linear Regression?

A model that assumes a linear relationship between the input variables (x) and the single output variable (y).

Eg. Linear Regression Model

Math behind it:-

Eq. of line: y = mx+c

Now to find y = we only require x as input if we know m, c but we don’t in linear regression. So for (hθ)price = θ₀*(xθ)sqft_living + θ₁

Now we only need to find θ so we will generte first random θ then find error (J(cost function)) on it

J(cost function)= 1/2m∑(hθ - xθ

Now we will minimize it using differential calculus by a algorithm called Gradient Descent

θ₁ ≔ θ₁ - α[ 1/m∑(hθ — xθ)xθ ] ( α = descent rate )

After this computing this, we will get the best θ possible, and now we can just plug x input and get the house price as result.

So this is the basic idea how Linear Regression works. Above is one-dimensional (1 input) but in real we use multi-dimensional (multiple inputs).

Let’s begin the code:-

For this we’ll use this dataset from Kaggle.

First we’ll import modules and data and find if there are any null or zeros in data

We’ll get no null values and the following output.
| date | 0 |
| price | 49 |
| bedrooms | 2 |
| bathrooms | 2 |
| sqft_living | 0 |
| sqft_lot | 0 |
| floors | 0 |
| waterfront | 4567 |
| view | 4140 |
| condition | 0 |
| sqft_above | 0 |
| sqft_basement | 2745 |
| yr_built | 0 |
| yr_renovated | 2735 |
| street | 0 |
| city | 0 |
| statezip | 0 |
| country | 0 |

We will remove the 0 price values as they affect our data badily because we don’t need the price of the houses which are not sold yet. So we replace them with prices of houses having similar features.

We will also remove 0 values of other features by looking over graph

Now we will remove outliers as they affect our model badily. We can remove them bymany ways but I am going do this with Z Score algorithm.

Now will we remove outliers of other features manually by looking over plots of features.

Now we will remove discrete values of features with doesn’t make sense like a house with a really low price but having 8 bedrooms. We will do this too by looking over graph manually.

Now we will plot heatmap.

NOW after all of above code. We have removed outliers and refined our data.

So let’s create our model now.

We get the following score:-

explained_variance_score : 0.7263689983438038
max_error : 1265577.6392805874
mean_absolute_error : 93320.21416728517
mean_squared_error : 25954205385.90649
mean_squared_log_error : 0.06420609977688856
mean_absolute_percentage_error : 0.19477983081040492
median_absolute_error : 55069.36503234797
r2_score : 0.726338460452626

Our model has 72% r2_score which is pretty decent.

The code ( with Jupyter notebook ) is on my Github here.

Thanks for reading 😄
And, clap 👏 if this was a good read. Enjoy!

--

--

Satvik Virmani

I am a college student with a passion for web development and Python programming. Creator of Python libraries, sharing insights through technical blogs.