My Understanding of Data Separation and Model Evaluation
Part 1: Hold-out Method
- How to separate data?
Now we know that our model has errors and there could be several sources of errors. But, how do we identify which one? We have millions of records in the training set, and at least several thousands in the dev set. The test set is not in sight as yet.
We cannot evaluate every record in the training set. Nor can we evaluate each record in the dev set. In order to identify the kind of errors our model generates, we split the dev set into two parts — the eyeball set and the blackbox set.
The eyeball set is the sample set that we actually evaluate. We can check these records manually, to guess the source of errors. So the eyeball set should be small enough that we can work manually and large enough to get a statistical representation of the whole dev set.
On analyzing the errors in the eyeball set, we can identify the different error sources and the contribution of each. With this information, we can start working on the major error sources. As we make appropriate fixes, we can go on digging for more error sources.
Note that the analysis should be based on the eyeball set only. If we use the entire dev set for this analysis, we will end up overfitting the dev set. But if the dev set is not big enough, we have to use the whole of it. In such a case, we should just note that we have a high risk of overfitting the dev set — and plan the rest accordingly. (Perhaps we can use a rotating dev set — where we pick a new dev set from the training set on every attempt.)
2. Train/dev/test distribution
Dev and test sets have to come from the same distribution.
Choose dev set and test set to reflect data you expect to get in the future and consider important to do well on.
Setting up the dev set, as well as the validation metric is really defining what target you want to aim at.
It’s very important to have dev and test sets to come from the same distribution. But it could be OK for a train set to come from slighly other distribution
3. Size of the dev and test sets
An old way of splitting the data was 70% training, 30% test or 60% training, 20% dev, 20% test.
The old way was valid for a number of examples ~ <100000
In the modern deep learning if you have a million or more examples a reasonable split would be 98% training, 1% dev, 1% test.(https://github.com/mbadry1/DeepLearning.ai-Summary/tree/master/3-%20Structuring%20Machine%20Learning%20Projects#course-summary)
4. Stratified Sampling
It shall be noted that random subsampling in non-stratified fashion is usually not a big concern when working with relatively large and balanced datasets. However, in my opinion, stratified resampling is usually beneficial in machine learning applications. Moreover, stratified sampling is incredibly easy to implement, and Ron Kohavi provides empirical evidence [Kohavi, 1995] that stratification has a positive effect on the variance and bias of the estimate in k-fold cross-validation, a technique that will be discussed later in this article.
5. Summary of Hold-out method