1.The first step is to import the required Python libraries into Ipython Notebook.
2.This data set is available in sklearn Python module, so we’ll access it using scikitlearn. we’re going to import Boston data set into Ipython notebook and store it in a variable called boston.
3.The object boston is a dictionary, so you can explore the keys of this dictionary.
we’ll going to fit a linear regression model and predict the Boston housing prices. we’ll use the least squares method as the way to estimate the coefficients.
Y = boston housing price(also called “target” data in Python)
and
X = all the other features (or independent variables)
.lm.fit() -> fits a linear model
.lm.predict() -> Predict Y using the linear model with estimated coefficients
.lm.score() -> Returns the coefficient of determination (R^2). A measure of how well observed outcomes are replicated by the model, as the proportion of total variation of outcomes explained by the model.
we’ll going to use all 13 parameters to fit a linear regression model. Two other parameters that you can pass to linear regression object are fit_intercept and normalize.
In [20]: lm.fit(X, bos.PRICE)
Out[20]: LinearRegression(copy_X=True, fit_intercept=True, normalize=False)
and then construct a data frame that contains features and estimated coefficients
finally plot a scatter plot between True housing prices and True RM.
we’ll going to calculate the predicted prices (Y^i) using lm.predict. Then display the first 5 housing prices.
In practice we wont implement linear regression on the entire data set, we will have to split the data sets into training and test data sets. So that we train our model on training data and see how well it performed on test data.
How not to do train-test split:
Residual plots are a good way to visualize the errors in our data. If we have done a good job then our data should be randomly scattered around line zero. If we see structure in our data, that means our model is not capturing some thing. Maye be there is a interaction between 2 variables that we are not considering, or may be we are measuring time dependent data. If we get some structure in our data, we should go back to our model and check whether we are doing a good job with our parameters.