A while again, I discovered myself pondering of various knowledge augmentation methods for unbalanced datasets, i.e. datasets by which a number of courses are over-represented in comparison with the others, and questioning how these methods stack as much as each other. So I made a decision to arrange a easy experiment to match them. This submit describes the experiment and its outcomes.
The dataset I selected for this experiment was the SMS Spam Collection Dataset from Kaggle, a set of virtually 5600 textual content messages, consisting of 4825 (87%) ham and 747 (13%) spam messages. The community is an easy 3 layer absolutely related community (FCN), whose enter is a 512 aspect vector generated utilizing the Google Universal Sentence Encoder (GUSE) towards the textual content message, and outputs the argmax of a 2 aspect vector (representing « ham » or « spam »). The textual content augmentation methods I thought of in my experiment are as follows:
- Baseline — this can be a baseline for consequence comparability. For the reason that activity is binary classification, the metric we selected is Accuracy. We practice the community for 10 epochs utilizing Cross Entropy and the AdamW Optimizer with a studying fee of 1e-3.
- Class Weights — Class Weights try to deal with knowledge imbalance by giving extra weight to the minority class. Right here we assign class weights to our optimizer proportional to the inverse of their counts within the coaching knowledge.
- Undersampling Majority Class — on this state of affairs, we pattern from the bulk class the variety of information within the minority class, and solely use the sampled subset of the bulk class plus the minority class for our coaching.
- Oversampling Minority Class — that is the other state of affairs, the place we pattern (with substitute) from the minority class various information which can be equal to the quantity within the majority class. The sampled set will comprise repetitions. We then use the sampled set plus the bulk class for coaching.
- SMOTE — this can be a variant on the earlier technique of oversampling the minority class. SMOTE (Artificial Minority Oversampling TEchnique) ensures extra heterogeneity within the oversampled minority class by creating artificial information by interpolating between actual information. SMOTE wants the enter knowledge to be vectorized.
- Textual content Augmentation — like the 2 earlier approaches, that is one other oversampling technique. Heuristics and ontologies are used to make modifications to the enter textual content preserving its that means so far as potential. I used the TextAttack, a Python library for textual content augmentation (and producing examples for adversarial assaults).
A couple of factors to notice right here.
First, all of the sampling strategies, i.e., all of the methods listed above apart from the Baseline and Class Weights, requires you to separate your coaching knowledge into coaching, validation, and check splits, earlier than they’re utilized. Additionally, the sampling ought to be accomplished solely on the coaching break up. In any other case, you threat knowledge leakage, the place the augmented knowledge leaks into the validation and check splits, supplying you with very optimistic outcomes throughout mannequin growth which can invariably not maintain as you progress your mannequin into manufacturing.
Second, augmenting your knowledge utilizing SMOTE can solely be accomplished on vectorized knowledge, for the reason that thought is to seek out and use factors in characteristic hyperspace which can be « in-between » your current knowledge. Due to this, I made a decision to pre-vectorize my textual content inputs utilizing GUSE. Different augmentation approaches thought of right here do not want the enter to be pre-vectorized.
The code for this experiment is split into two notebooks.
- blog_text_augment_01.ipynb — On this pocket book, I break up the dataset right into a practice/validation/check break up of 70/10/20, and generate vector representations for every textual content message utilizing GUSE. I additionally oversample the minority class (spam) by producing roughly 5 augmentations for every document, and generate their vector representations as effectively.
- blog_text_augment_02.ipynb — I outline a typical community, which I retrain utilizing Pytorch for every of the 6 augmentation eventualities listed above, and evaluate their accuracies.
Outcomes are proven beneath, and appear to point that oversampling methods are inclined to work the very best, each the naive one and the one based mostly on SMOTE. The subsequent best option appears to be class weights. This appears comprehensible as a result of oversampling provides the community essentially the most knowledge to coach with. That’s in all probability additionally why undersampling does not work effectively. I used to be a bit stunned additionally that textual content augmentation methods didn’t carry out in addition to the opposite oversampling methods.
Nevertheless, the variations listed below are fairly small and probably probably not vital (notice the y-axis within the bar chart is exagerrated (0.95 to 1.0) to spotlight this distinction). I additionally discovered that the outcomes diverse throughout a number of runs, in all probability ensuing from totally different initialization eventualities. However general the sample proven above was the commonest.
Edit 2021-02-13: @Yorko steered utilizing confidence intervals with a view to deal with my above concern (see feedback beneath), so I collected the outcomes from 10 runs and computed the imply and customary deviation for every method throughout all of the runs. The up to date bar chart above exhibits the imply worth and has error bars of +/- 2 customary deviations off the imply consequence. Due to the error bars, we are able to now draw a couple of extra conclusions. First, we observe that SMOTE oversampling can certainly give higher outcomes than naive oversampling. It additionally exhibits that undersampling outcomes could be very extremely variable.