Dataset options
Last updated
Was this helpful?
Last updated
Was this helpful?
This option helps balance the representation of underrepresented concepts in your dataset.
Each region in a dataset can have:
One label in classification and detection tasks
Multiple labels in tagging tasks
The goal of class balancing is to achieve a more uniform distribution of labels across all regions. The algorithm follows a loss-based approach to do this:
It first assesses the initial label distribution (e.g., {'cat': 54, 'dog': 13, 'horse': 79}
).
It calculates an entropy value to measure how uniform the distribution is.
Each data point is assigned a score based on its individual loss, which reflects how prevalent its associated labels are in the dataset.
The algorithm ranks samples by prevalence and selects the least represented ones to add to the dataset. It then updates the loss values accordingly.
This process repeats until adding more samples no longer improves entropy, meaning the dataset has reached its best possible balance.
To ensure efficient balancing, the algorithm follows these rules:
You can define a maximum expansion ratio, which limits how much larger the new dataset can be compared to the original.
The same samples are not repeatedly reused to avoid overfitting.
A perfectly balanced dataset is not always achievable. For example, if a label appears in only two samples ({A -> 55, B -> 41, C -> 2}
), it cannot be duplicated excessively to match other labels. Balancing stops when no further improvements in entropy can be made.
This option allows you to enrich your training data by integrating external datasets. It is available for tagging and detection models.
You can select a third-party dataset, such as COCO, and define the number of records to include in your training set. This helps increase dataset diversity and improve model generalization by introducing additional labelled samples. By leveraging external data, you can compensate for underrepresented concepts in your dataset, improve robustness, and reduce biases.
You can train a dataset composed of images which are regions, or crops, of a parent "Detection" view. This sub view can be any kind of task. Typically, when training on a view whose parent is a detection view, we crop the region out of the original image with the coordinates of the parent view. With this parameter, we expand this crop, allowing you to enlarge the sample region by using the original image, expanding the crop by the percentage given in the parameter.
This can be useful if important elements of context are situated next to the bounding box.
Example: you are tasked with detecting animals. You create a first view to detect any animal, and a second one to select the kind of animal.
Your parent view correctly detects animal instances in a bounding box. Next, you should predict which kind of animal it is. For that task, the tail could be useful, unfortunately your bounding box did not include the tail of the animals.
Adding a crop margin will allow the training engine to take a larger crop from the image and include the tail.