Data leakage after using RandomOverSampler
Hi! I would like to ask you. Why did you use the split into train/test after using RandomOverSampler for all the data?
I think that the evaluation you do after training will be incorrect and may contain a data leak.
6 cell
# random oversampling of minority class
# 'y' contains the target variable (label) we want to predict
y = df[['label']]
# Drop the 'label' column from the DataFrame 'df' to separate features from the target variable
df = df.drop(['label'], axis=1)
# Create a RandomOverSampler object with a specified random seed (random_state=83)
ros = RandomOverSampler(random_state=83)
# Use the RandomOverSampler to resample the dataset by oversampling the minority class
# 'df' contains the feature data, and 'y_resampled' will contain the resampled target variable
df, y_resampled = ros.fit_resample(df, y)
# Delete the original 'y' variable to save memory as it's no longer needed
del y
# Add the resampled target variable 'y_resampled' as a new 'label' column in the DataFrame 'df'
df['label'] = y_resampled
7 cell
dataset = Dataset.from_pandas(df).cast_column("image", Image())
11 cell
# Creating classlabels to match labels to IDs
ClassLabels = ClassLabel(num_classes=len(labels_list), names=labels_list)
# Mapping labels to IDs
def map_label2id(example):
example['label'] = ClassLabels.str2int(example['label'])
return example
dataset = dataset.map(map_label2id, batched=True)
# Casting label column to ClassLabel Object
dataset = dataset.cast_column('label', ClassLabels)
# Splitting the dataset into training and testing sets using an 60-40 split ratio.
dataset = dataset.train_test_split(test_size=0.4, shuffle=True, stratify_by_column="label")
# Extracting the training data from the split dataset.
train_data = dataset['train']
# Extracting the testing data from the split dataset.
test_data = dataset['test']
Hi @litvan , thank you for the question, really appreciated!
You are right, oversampling of the data before the train-test split often causes a data leakage.
However, in this example, the potential leakage is diminished by augmenting the training data, see e.g.:
# Define a set of transformations for training data
_train_transforms = Compose(
[
Resize((size, size)), # Resize images to the ViT model's input size
RandomRotation(45), # Apply random rotation
RandomAdjustSharpness(2), # Adjust sharpness randomly
ToTensor(), # Convert images to tensors
normalize # Normalize images using mean and std
]
)
# Define a set of transformations for validation data
_val_transforms = Compose(
[
Resize((size, size)), # Resize images to the ViT model's input size
ToTensor(), # Convert images to tensors
normalize # Normalize images using mean and std
]
)
As a result, in practice the model learns well from augmented samples, and it becomes better generalizable with completely unseen (test) data than a trained model without oversampling.
An alternative would be to add weights to the imbalanced classes instead of sampling them. See for example cmd21 here:
class WeightedTrainer(Trainer):
def compute_loss(self, model, inputs, return_outputs=False):
labels = inputs.pop("labels")
# forward pass
outputs = model(**inputs)
logits = outputs.get("logits")
# compute custom loss (suppose one has labels with different weights)
loss_fct = torch.nn.CrossEntropyLoss(weight=torch.tensor(ordered_weigths, device=model.device).float())
loss = loss_fct(logits.view(-1, self.model.config.num_labels), labels.view(-1))
return (loss, outputs) if return_outputs else loss
Hope this helps.
BR Dima.