In this guide we’re going to continue our discussion on how missing values can be managed. We’ve already touched upon one method which replaced every “not a number” element with either the mean, median, or mode. So we’ll expand on that a bit and then introduce a couple new methods such as global constants and data mining algorithms and then finish with when it’s appropriate to simply ignore missing data.
Now, if you remember back to the last guide when we calculated the mean we used every available sample. Which in that instance worked just fine. But think about how the data frame is going to change over time as more and more data comes in. Eventually there will be hundreds if not thousands of different courses, professors and sections, so using every sample might not make as much sense. And so, calculating the mean based on relevant or similar classes might suit us better, and I’ll explain what I mean using an example. Let’s say we’re missing data for a couple different Biometry courses. It probably wouldn’t make much sense for us to include University Biology in our calculation of the mean. Not only is there a massive discrepancy in difficulty, there is also a huge difference in sample size. Because University Biology is offered much more frequently than Biometry it could potentially skew our result leading us to misinformation. So, instead the mean should be calculated by only using samples that are relevant. In our example we may only use rows containing other Biometry courses, or possibly expand it to all the rows containing five hundred level courses. Ultimately, there is no one size fits all solution because it depends on what will represent your data most accurately.
The second option for managing missing data is the use of a global constant value where it’s something like “unknown” or “N/A”. A lot of times this technique is used because sometimes it’s just easier and makes more sense than trying to predict a missing value. Going back to our example, if there is a row where the Fall or Spring semester is missing, our time would probably be better spent simply adding a global constant instead of trying to predict the semester. The semester probably won’t have a big influence on the result of our algorithm.
Now, let’s move onto our third option which is data mining algorithms. When using data mining algorithms missing values are predicted by using regression, decision trees, and clustering algorithms. So, unlike using the mean, median, or mode to determine an average data mining algorithms allow us to make predictions with a known level of certainty. So, going back if we’re missing grades instead of simply using an average we could apply a data mining algorithm. Instead of just taking inputs from the grade column we could also include the course, professor, semester, and dynamic learning to actually predict grades. This is an option that I love because it uses the most information available in the data set to actually predict the missing value which generates a less bias result. But one downside is that time is still a valuable resource, and this can be a lengthy process, especially when we compare it to replacing “not a number” elements with the mean.
Our last option when handling missing data is simply ignoring it. And while this isn’t generally the preferred method, sometimes it really does make sense to just cut your losses. This method is usually used when the class label is missing, or a bunch of the attributes are missing from the row. Using a new example, let’s say we have a database filled with current and former students at a University. Some of the information collected includes age, state of residence, Git activity, time spent in the tutor center, time on the phone with a mentor, and finally a column classifying their success to “Low”, “Medium” and “High”. Let’s say our goal is to build a model predicting a student’s success in college. Data rows that are missing the success column, which in this case is the class label, are not useful in predicting success so they can ignored and removed before running the algorithm. One of the major downsides to this method is that you reduce your sample size every time you ignore a row. If you’re already working with a massive data frame it might not be the big of a deal. But if you’re not every time you shrink your sample size you run the risk of increasing the range of error in your result.
Well, that brings us to the end of this guide, and as you can see there are plenty of options available to you when managing missing values in your data. And as I’ve said, there is no clear cut way of handling it, it comes down to what method works best for you to achieve the most effective result.