Metrics explained: detection

Once you have trained a model version and if your validation set is not empty, you have access to a performance report. In the case of detection projects, several metrics are given to help you understand how good your model is. All the following metrics are defined for one particular concept (for instance: "dog"):

  • Precision: this quantifies the ability of a model to not trigger false alarms compared to the number of correctly-predicted objects (e.g: the probability that the model is right when it predict a dog).

  • Recall: this quantifies the ability of a model to not miss any object (e.g: the probability the that the model will catch every dog present in the validation set).

  • IoU score: this evaluates how precise the boxes predicted by the model are.

  • F1-score: this summarizes both precision and recall in a unique metric. It is useful in order to compare the general performance of two models.

  • LRP: this summarizes precision, recall and IoU score in a unique metric. It is useful in order to compare the general performance of two models, while taking the tightness of boxes into account.

On the Models Library page, one metric is given to summarize how good your model is: the mean LRP score. It is the average LRP score on each concept.

In the next section, we have defined all these metrics in a more formal way.

Preliminary definitions

Before defining what the metrics are, let's define a few important notions. Suppose we have a dogs and cats detector.

  • The IoU (intersection over union) of two boxes p and p' quantifies how well these two boxes overlap. It is generally used to check how well a predicted box fits an annotated box. It is defined as follows:

IoU(p,p)=ppppIoU(p,p') = \frac{p\cap p'}{p \cup p'}
  • A true positive is an object that has been correctly detected. In our case, a dog was present in the image, and our predictor predicted a box such that the intersection over union of the real box and the predicted box is greater than 0.5.

  • A false positive happens when the detector predicted a dog, but in reality there was no dog in that particular box.

  • A false negative happens when there was a dog in the image but our predictor missed it (there is no predicted box such that the intersection over union with the dog's box is greater than 0.5).

COMMENTS: you sometime use predictor, somtimes dectector, could it be confusing?



Given a particular concept c, the precision is defined as:

precision(c)=TP(c)TP(c)+FP(c)\text{precision}(c) = \frac{TP(c)}{TP(c)+FP(c)}

A score of 0 for the class dog means that all the predictions made for the class dog were false positives. A score of 1 means that all the predictions made for the class dog were actual dogs.


Given a particular concept c, the recall is defined as:

recall(c)=TP(c)TP(c)+FN(c)\text{recall}(c) = \frac{TP(c)}{TP(c)+FN(c)}

A score of 0 for the concept dog means that our predictor missed all the dogs in the image. A score of 1 means that it did not miss any dog in the image.

IoU score

The IoU score is a metrics introduced to evaluate how good the predictor is at providing precise boxes (a more precise box is a box that fits the real object better and hence have a higher IoU). Two predictors could provide a box for the same dog, but one of these boxes is going to be more precise: the IoU score allows to compare the two predictors for this criterion. It is defined as follows:

IoU(c)=1#TP(p,p)TP(c)IoU(p,p)IoU(c) = \frac{1}{\#TP}\sum_{(p,p')\in TP(c)} IoU(p,p')

Where TP(c) is the set of pairs of predicted and annotated boxes (p,p') which constitute a true positive for the concept c, and #TP being the number of such true positives.


The F1-score combines recall and precision to provide a unique metrics, which is defined as the harmonic mean of recall and precision:

F1(c)=2.precision(c).recall(c)precision(c)+recall(c)F_1(c) = 2.\frac{precision(c).recall(c)}{precision(c)+recall(c)}

For instance, if model1 has a precision of 1.0 and a recall of 0.5, while model2 has a precision of 0.9 and a recall of 0.7, then:

  • The F1-score of model1 will be 0.67

  • The F1-score of model2 will be 0.78

This provides an easy way to compare both models while giving the same importance to both recall and precision.


The LRP score is a way of taking into account the three main metrics: precision, recall and IoU score. The precise definition has been introduced in the paper Localization Recall Precision (LRP): A New Performance Metric for Object Detection, which is available on arXiv.