Simple linear regression is the most basic of the regression methods and allows us to summarize and study the relationship between two variables. As we’ve discussed before, the first variable is the independent or featured variable. The second variable is the dependent variable or response variable. This form of linear regression is referred to as “simple,” because it only analyzes the influence of one feature variable. Some common real world examples of simple linear relationships include weight versus height, annual salary versus job experience, and price versus a home’s square footage.
Now, let’s take a look at a data frame we’ve used in the past to see if there is a linear relationship between course grades and final exam grades. Before we get into the code I’d like you to try this on your own. As a guide, what you’ll need to do is import the data frame, create a featured and dependent variable, graph the data as a scatter plot and finish by adding axis labels and a title.
There are a few ways you could have done this. So if your code is different from mine, that’s just fine. As long as we have similar results.
Just by looking at the data it’s obvious there is a linear relationship between final exam scores and final grades. But without a quantitative measurement we can’t really say with any level of certainty how strong the relationship really is. Thankfully for us, Python offers a few different ways of solving this problem.
The first method we’ll cover leverages the stats module in the SciPy library, which enables us to call statistical functions such as linear regression. Using the linregress function we’re able to determine the slope, intercept, correlation coefficient (r-value), calculated probability (p-value), and standard error using one line of code.
The syntax may look a bit funky, but it’s no different than any other assignment you’ve done in the past. With the obvious difference being we’ve assigned five variables to a single function rather than one. Before we run this we’re going to add one more piece, and that’s a r-squared variable. We can do that by creating a new r_squared variable then squaring our correlation coefficient variable.
Alright, now let’s run this and then head over to the console and pass in a few commands to see what exactly is going on.
What we have returned makes a lot of sense. The slope is just about one, which means there is essentially a one to one ratio between exam scores and final grades. Since the ratio is one to one the y-intercept should be zero, which it pretty much is. Finally, and most importantly the r squared value is just under .95, which tells us that about 95 percent of the variance can be explained by our model.
So what we’re going to do now is build a function, using the equation for simple regression, that allows us to plot a fit line that uses final exam grades as our input and predicted final grades as our output. The equation for a simple regression line is essentially the equation of a line in slope-intercept form. Where beta is the slope, and alpha is the intercept.
From here we can go ahead and plot the regression line, and for easy reference, I'm going to add the r squared value and the equation of the regression line to the graph as well.
With everything plotted not only can we visualize the linear relationship, but we have a solid r squared value as supporting evidence. And while there is variance around the line, we are still able to capture just under 95 percent of that variance with our model.
This example uses the Python module StatsModels, and will be a bit more condensed than our last example. To start, we're going to import statsmodels.api as sm. Then we’re going to access the ordinary least squares function by assigning it to our model variable. It’s important to note that unlike SciPy, StatsModels calls for the dependent variable first, followed by the feature variable. Next we’ll create a results variable that is assigned to our fitted model. Finally, for a summary of the results we’ll create one last variable called summary, and then we’ll pass in results.summary( ).
Now let’s run this and see what we have.
Unlike our last example StatsModels delivers a nicely formatted results summary with all the critical information in one easy to read location. We won’t be covering the graphing process in StatsModels, but if you are interested I have included a few basic lines in the final code at the end of this section.
Our last example uses Scikit-Learn, and is probably the most popular method because it also allows us to train and test our model. To start, we're going to import a couple new features from the Scikit-Learn library. The first allows us to split our data into training and testing sets, and the second will be used to fit the linear model.
Before we split our data we have to make one more modification, and that’s to convert both variables from an array to a matrix.
With our data properly formatted we’re ready to set up our training and testing split. Similar to our previous example we’re going to assign a handful of variables to a single function. The official scikit-learn documentation and many of the examples you’ll come across use capital “X” as their feature variable and lowercase “y” as the dependent variable.
In our example we’ll be using final exam in place of “X” and final grade instead of “y”. Next we’ll reduce the test size to .2, which will dedicate twenty percent of the data to a test sample. Also, depending on how your data is sorted it’s probably advantageous to add a random state which will randomly seed the data.
The next step is to build out the regression model, which includes creating a linear regression object also known as the regressor, then fitting and testing the model. Our first line of code assigns our regressor variable, giving us access to linear regression’s functionality which scikit-learn describes below.
The second line fits the model to our data, or in other words, we’re making it “learn” from the training set. Our third and final line of code tests the model by using what it learned during the fitting stage to help predict an output.
From here we have everything we need to graph our result. We’ll start by using the test data to create a scatter plot and the final exam test set and final grade prediction to make the regression line.
This graph should look very similar to the previous graphs, but because we used a portion for testing there will be small variations. In fact, you can go back and change the random state, re-run your code, and you’ll have a different result.
Scikit-learn also allows us to easily extract the slope, intercept, and r squared value with just a few lines of code.
After we run this we can either check the variable explorer pane or pass in a console command to check the results.
This section was a bit more intensive than most, but I wanted to make sure we covered the commonly used regression methods. Moving forward we’ll be using scikit-learn with training and testing splits. But as always, I encourage you to continue to explore the different methods Python has to offer to find what works best for you.
SciPy Linear Regression:
Scikit-Learn Linear Regression: