Polynomial Regression

Feature thumb polynomial regression

In this guide we will be discussing our final linear regression related topic, and that’s polynomial regression. Unlike simple and multivariable linear regression, polynomial regression fits a nonlinear relationship between independent and dependent variables. Because of this nonlinear relationship it’s classified as a special case of multivariable regression. Some common examples include population growth, half-life or decay rates, stock pricing, and even force fields (like gravity).

As you would expect the equation of a polynomial regression line is very similar to that of a multivariable regression line. 


In the previous guide we discovered that exam two scores had the most significant impact on the model when predicting final exam scores. In this guide, we’re going to be looking at the relationship between exam two and homework scores.

To get started we’re going to import our libraries and data frame. As always we’re going to be using the basics: NumPy, Pandas, and  Matplotlib. We’re also going to use the train and test split again as well as the linear regression function from Scikit-learn. The last piece we’ll need comes from the preprocessing package in Scikit-learn, PolynomialFeatures. Once we have everything successfully imported we’ll move forward with segmenting and assigning our featured and dependent variables. Then we’ll finish by converting our two variables into a matrix by utilizing NumPy’s newaxis function. 

Now, if you remember back to simple and multiple variable regression we’re going to follow very similar steps. Just like before we’re going to start by creating and defining our training and testing variables. But before we actually train the model, we’re going to apply the PolynomialFeatures class and pass in a degree of one. Next we’re going to create two more variables: poly_1_homework_train and poly_1_homework_test. Both will be assigned to the fit_transform function in PolynomialFeatures which will convert the regression line from slope-intercept form into polynomial form.  

If you open up the variable explorer pane and compare homework_train and poly_1_homework_train you’ll notice they are nearly identical and that is because we converted the line into a first degree polynomial. And if you remember back to your math class, any number raised to the power of one is itself. So for now, all our homework scores stayed the same. Now, I'm not one hundred percent sure but I believe the one is simply acting as a placeholder that will eventually be multiplied by the alpha value in our final regression equation.    


From here we can go ahead and train the model, and we’re going to do that by fitting it the same way we did with both simple and multivariable regression. We’ll start by creating our regressor variable and equating that to LinearRegression. Next, we’ll use the fit function to train our model with poly_1_homework_train and we’ll still use exam_2_train as our training output. Now that we have the model trained we can test it by using our transformed homework array. Let’s create a prediction variable and then assign that to the predict function and we’ll pass in poly_1_homework_test. Before we graph the result let’s really quickly determine the r-squared value. As a reminder, the r-squared value is the proportion of the variance in the dependent variable that is predictable from the independent variable. So, just like before we’ll use the regressor.score function and we’ll pass in poly_1_homework_test followed by exam_2_test.

Now, when we graph this, notice how the regression line is still straight, which is exactly what we expected. But, we don’t have a very strong r-squared value. That’s because we’ve underfit the model.   

To avoid underfitting we need to create a higher order equation which adds complexity to the model. To do that we’re going to use PolynomialFeatures again to transform homework_train and test into cubic polynomials. So, I’m going to create a few new variables by copying and pasting, then I’m going to replace every poly_1 with a poly_3. And most importantly inside PolynomialFeatures we’re going to pass in degree equals three instead of one. 

Let’s run this, then open up the variable explorer pane and check the poly_3_homework_train variable. It looks like everything worked and our homework array was successfully transformed into a cubic polynomial.

Now, we’re going to add this to the graph we already have. And as you can see we have a much better fit. In fact, the r-squared value increased significantly. We’re going to follow this process  one more time, but with a fifth degree polynomial. But before that I’d like to take the time to address a possible issue you may be facing when trying to graph your regression line.  

Before we move on I’d like to take a minute to address a possible issue you may be facing while graphing your regression line. I am purposely graphing my regression line as a scatter plot because if you use plt.plot, or some alternative, you typically end up with a graph like this.


If you’re having this issue it’s more than likely due to your homework array. The array is plotted sequentially, meaning you’d need to sort your array before you graph it to get a single uniform line. While it make look a little cleaner, it’s not something I really recommend. Graphing is arguably the least important aspect of regression, and anytime you introduce code that isn’t necessary you run an increased risk of introducing a bug into your model. 

Alright, now for the fifth degree polynomial. We’re going to follow the same steps that we walked through for the cubic polynomial: pass in degree equals three in PolynomialFeatures, then replace every poly_3 with poly_5.


We’re also going to add this as a scatter plot to our graph. What’s important to notice is that we’ve actually decreased the r-squared value. Until we get more training data we should probably keep the regression model as a cubic polynomial. 


In theory we could keep increasing the degree of the polynomial to account for higher levels of variance in the data. But there comes a point when we begin to over-fit the model. The problem with overfitting is that it potentially gives significance to outliers which will eventually decrease the accuracy of prediction in your model. All in all polynomial regression is a great tool because it accounts for a wide range of curves, and provides us with the best approximation for relationships between dependent and featured variables.