We touched upon importing data very briefly when we covered the Pandas library and now I’d like to discuss that in a little more detail as well as talk about how we can segment independent and dependent variables.
If you remember back a few sections we used the read function in the Pandas library to import our data frame. So we’ll start the same way by creating our data frame variable, df, and then using the read function to call for the data frame we want to work with. Now, there are a couple ways of passing this in. If you know the file path you can manually pass it in, or what I usually do is just drag the file into the terminal and that will display the file path and then you can just copy it and pass it in that way.
I’d also like to quickly mention that while we’ve only worked with CSV files, Pandas also has the ability to read other text files like JSON and HTML as well as other file types like binary and SQL. The syntax for all the read functions can be found in the User Guide under the API Reference section.
Once we have our data frame imported we can move forward with segmenting the data. The first variable I’m going to name “features” and this is going to contain all the elements of our independent variables. From our data frame we’re going to call the loc method like we have in the past and we’ll start by passing in the rows we want to access and that will be all of them and then we’re going to add a list of all the columns we want to access which will be everything except “average_grade.” Our second variable is going to be called “dependent_variable” and we’re going to pass in our data frame again and now we’re going to use a new method called iloc which uses integer-location based indexing. So instead of passing in strings, we’ll be using integers related to index position. Again, we want all rows so we’re just going to pass in a colon and now we only want average_grade so we will pass in the related index integer which is 5. We could have used the iloc function for the features variable as well, but I wanted to show you a couple different ways this can be done.
And so, If we were to run this right now what we’d have returned would be two separate data frames but what I’d like to do is convert them into a NumPy matrix, which will then allow us to work with nested arrays. There are a few ways to do this, and one of the common methods you’ll probably see leverages the values function which returns only the values of the data frame in the form of an array. However, in the Pandas user guide the suggested method calls for the to_numpy function, and being the good developers we are we’ll follow the suggested syntax. One thing I’d like to point out, which you may have noticed already, is that we’re not using the NumPy alias np and that’s because what’s happening is that we’re calling a function that’s inside the Pandas library, we’re not actually calling the NumPy library itself so we won’t use the alias.
With all that being said, we can add the to_numpy function at the end of each of our variables and we’ll go ahead and return the first 5 rows and remember we’re not working with a Pandas data frame anymore so we’ll pass in an index slice instead of the head function. I’m also going to add a couple more lines so we can see the full shape of each array.
And so what we’re getting back is exactly what we wanted. We segmented our original data into independent and dependent variables, and then successfully converted them into nested arrays. The shape of our first matrix consists of 48 rows and 6 columns and the second array is 48 rows with only one column which again is exactly what we were looking for.
FINAL CODE: