In this guide we’re going to use the Help option that we previously discussed and apply that to how we can handle missing numerical data in a data frame by using either the mean, median or mode.
In this example we’re going to take a look at similar data to our previous dynamic learning example. But this data is coming from the Biology department because they saw the positive results coming in from the Computer Science department, so they want to implement similar systems. Unfortunately, since you weren't there to oversee the data entry process they ended up having some missing data. And yes, we could just go back to the department and get the actual data, but that wouldn’t serve us very well for this example.
So, to start, we're going to follow the same steps we have already gone through. We’re going to start by importing our libraries and data frame, then segment our data between independent and dependent variables, and finish by converting them into a NumPy array. So, if you’d like to pause the video and try to work through the first few steps yourself that would be great. Okay, let’s take a look at what we have so far, and if you’re not sure how I go here you can go back to our last guide to review.
Okay, so now we’re going to run our segmented data and take a look at what we have and what is missing.
So, all the elements in the features matrix are strings, so we won’t worry about that for now and we’ll move onto the dependent variable array containing float values as well as four “nan” elements, and if you’ve never heard the term nan before it stands for “not a number” and acts as a placeholder for any missing numerical value in the array. Now, there are a few different ways of handling missing data that we will discuss later but for now we’re going to use the mean, median, or mode to fill in the missing data. To do this we’re going to introduce a new machine learning library called scikit-learn which is an incredibly powerful tool for data mining and analysis that’s built on the NumPy, SciPy and matplotlib libraries. We’re only going to be using a single class from the library, so we’re going to start our code with from sklearn dot impute import SimpleImputer then assign an alias imp underscore mean equals SimpleImputer and parentheses.
Remember, I wanted to use the Help pane in this example so let’s use the shortcut command-i to see what the SimpleImputer function does. If you take a look at the documentation it summarizes the SimpleImputer function as an imputation transformer for completing missing values that includes the parameters missing_values, strategy, fill_value, verbose, and copy. And for more information it instructs you to reference the User Guide, and I recommend pausing the video to open the documentation because I will be using it as a reference shortly.
For this example we’re most interested in the strategy parameter, which allows us to fill missing data with the mean, median, or mode with mean being the default setting. So, inside our parentheses we’re going to add missing underscore values is equal to np dot nan comma strategy equals quotation marks mean. Now, we’re going to make a copy of the dependent_variables add underscore median, then copy imp_mean and put it down here, replace mean with median and change the strategy to median as well. Then we’re going to do this again for mode and the strategy for mode is most underscore frequent.
Now, I’d like you to take a look at the SimpleImputer User Guide because we’re now going to use the fit method to select only the columns that contain missing data, which in our case is just a single column. So, we’re going to start with the imp_mean variable and we’re going to call the fit method and pass in parentheses. Fit calls for three different things. First is the NumPy matrix that we’re going to use, so for us that’s going to be the dependent_variable, containing our one dimensional array. Next we need to add rows and columns, so we’ll pass in our square brackets and we want to use every sample, so we’ll just add our colon then a comma and next we’ll set our range of columns which is indexed as zero so we’ll pass in zero and another colon. I’d like to point out that the fit method expects a matrix, not a one dimensional array so even though we’re just using a single column we can’t just pass in zero with no colon or an error will be returned. Now to replace the missing data were going to use the fit_transform method and that calls for the exact same parameters as the fit method. So we can copy dependent_variable with the brackets and then set that equal to imp_mean dot fit_transform, add the parentheses and then we can pass in the dependent_variable again. Then we’re going to copy this and put it below dependent_variable_median and then again below the mode variable and then where it’s needed we’ll change mean to either median or mode.
Now, when we run this our nan elements should all be replaced by either the mean, median or mode.
As you can see everything worked perfectly because the four nan elements have all been replaced by the corresponding strategy. When comparing the three we can see the median and mode both returned the value of 81 to replace the missing data while the mean was just a bit higher because of the float. And as I said at the beginning of this guide, this isn’t the only way to manage missing data. Ultimately, the method you choose should best represent the data you’re working with to ensure the most accurate result possible.
FINAL CODE: