In this guide we will be digging in to how supervised learning algorithms are structured and evaluated through the use of training, validation, and testing data sets. When you start your supervised learning algorithm build out you are going to be breaking down your dataframe into three sets: training, validation and testing. The general rule is that the majority of the data is dedicated to the training set and the remainder is allocated based on sample size and the actual model itself.
So, if you have a robust data frame I’d say about 80 percent of your data will be used as your training set. And depending on your model and how many hyperparameters you have you’ll have somewhere between a 10 percent, and 10 percent split. But keep in mind, if there aren’t many hyperparameters you can probably shift some of data into the test set. Now, let’s get into what all this means, and why it’s important to you as a machine learning developer.
When we talk about supervised learning, and hopefully I’m not over simplifying it, we are talking about a machine learning task that requires prior knowledge. And what I mean is that in order for supervised learning to work we have to know what the output should be based on the inputs. And this has to be the case because the goal of supervised learning is to approximate a relationship between input and output in the data. An analogy I like to use, because we all have experience with it, is when a parent tells their child not to touch a potentially hot stove top. Since the parent has previous knowledge with their stove, they are able to approximate how hot it may be. And in the future the child can decide what course of action they want to take based on the information. If they decide to touch the stove they will quickly realize, with a high level of certainty, their parents were right.
Like I said earlier, when building out a supervised algorithm it has to begin prior knowledge. This is what we consider to be the training set. More specifically, training sets are samples of data used to fit the model. It’s the actual data the model sees and learns from and it’s what we use to train the model. In our analogy the training data is comprised of all the experiences the parent has had with their stove top. They know every time they turn the stove on, it gets hot. Thus, turning the stove on is the input, and heat is the output.
The parent also knows that stoves aren’t just scorching hot or room temperature, instead there’s a range of temperatures based on a few factors. And that’s where validation sets come in. Validation sets are used to provide an unbiased evaluation of how well the model is fit based on the training data. And in order to produce a more accurate fit, variation has to be reduced by fine tuning the hyperparameters. So, if the parent in our example decided they wanted to ignore thermodynamics and concluded stoves have to be hot when the burners are on and cool when the burners are off, we would probably end up with a poorly fit model and a child that has a really sore hand. Thankfully, the parent considered heat dissipation and tuned the hyperparameters, allowing the training set to more accurately predict the output. Because of these modifications the uncertainty in the model has been reduced which presumably decreases the likelihood of the child burning their hand.
With a well fit model in place the algorithm can be tested, and that’s obviously where the testing dataset comes into play. There is really only one main purpose of the testing set and that is to evaluate the model. Applying this to our analogy it will be the moment the child decides to touch the stove top. Let’s say the burner has been off for twenty minutes and the child finally wants to validate their parents advice. When they touch it, it’s only luke warm. So to the child it kind of makes sense, but how do they know if the stove really gets that hot? So, they turn the burner back on and let it heat up. Anway, I think you get the point. Everytime the child decides to touch the burner they are acting as the test set, evaluating and learning from their parents prior knowledge.
Hopefully this all makes to you and wasn’t too simplified, because it is definitely something you will be dealing with as you move into the workplace. And when handling real world datasets it’s not always this cut and dry. At some point you will probably see coworkers using validation sets as their test set, which is not a good practice. Because as we’ve discussed the test set is generally well curated and contains carefully sampled data that covers the various classes that the model would face, when used in the real world.