Given a set of True / False predictions and corresponding True / False instances, precision represents the accuracy rate of True predictions. It is sometimes referred to as positive prediction value.
Given a set of True/False predictions and corresponding True/False instances, recall represents the accuracy rate of predictions on True instances. It is sometimes referred to as sensitivity.
Here, precision and recall are considered equally important.
F
1
=
2
1
p
r
e
c
i
s
i
o
n
+
1
r
e
c
a
l
l
=
2
⋅
p
r
e
c
i
s
i
o
n
⋅
r
e
c
a
l
l
p
r
e
c
i
s
i
o
n
+
r
e
c
a
l
l
F_{1}={\frac {2}{\mathrm {\frac{1}{precision}} + \mathrm {\frac{1}{recall}}}}=
2\cdot {\frac {\mathrm {precision} \cdot \mathrm {recall} }{\mathrm {precision} +\mathrm {recall} }}
F1=precision1+recall12=2⋅precision+recallprecision⋅recall
FBScore (Generalized F-Score)
Here, recall is considered
β
\beta
β times as important as precision.
F
β
=
(
1
+
β
2
)
⋅
p
r
e
c
i
s
i
o
n
⋅
r
e
c
a
l
l
(
β
2
⋅
p
r
e
c
i
s
i
o
n
)
+
r
e
c
a
l
l
F_\beta = (1 + \beta^2) \cdot \frac{\mathrm{precision} \cdot \mathrm{recall}}{(\beta^2 \cdot \mathrm{precision}) + \mathrm{recall}}
Fβ=(1+β2)⋅(β2⋅precision)+recallprecision⋅recall
The Precision-Recall Curve plots precision vs recall for a range of classification thresholds. For example, suppose we have the following predicted scores and truths.
Here, recall = 0.67 and precision = 1.00. This pair of values, (0.67, 1.00), is one point on the Precision-Recall curve. Repeating this process for a range of classification thresholds, we can produce a Precision-Recall curve like the one below.
A and B have identical prediction classes, and therefore have identical precision, recall, and F-scores. However, suppose we reveal A's and B's prediction scores (in this case, probabilities).
| scores_A | scores_B | truths ||:--------:|:--------:|:------:|| 0.95 | 0.55 | True | <- A & B are correct, only A is confident| 0.85 | 0.59 | True | <- A & B are correct, only A is confident| 0.73 | 0.88 | False | <- A & B are incorrect, only B is confident| 0.62 | 0.97 | False | <- A & B are incorrect, only B is confident| 0.48 | 0.20 | True | <- A & B are incorrect, only B is confident| 0.39 | 0.09 | True | <- A & B are incorrect, only B is confident| 0.12 | 0.43 | False | <- A & B are correct, only A is confident| 0.04 | 0.32 | False | <- A & B are correct, only A is confident
The classification threshold in this example is 0.5. That is, preds_A = scores_A >= 0.5 and preds_B = scores_B >= 0.5.
Observe
When model A was confident in its predictions, it was correct.
When model B was confident in its predictions, it was incorrect.
Intuitively, this suggests model A better than model B. This is depicted visually observing that A's Precision-Recall curve is consistently higher than B's Precision-Recall curve.
This desire to score ranked predictions leads to Average Precision which measures the average precision value on the Precision-Recall curve over the range [0, 1].
Average Precision equals area under the Precision-Recall curve.
AveP
=
∫
0
1
p
(
r
)
d
r
\operatorname{AveP} = \int_0^1 p(r)dr
AveP=∫01p(r)dr
where
p
(
r
)
p(r)
p(r) is the precision at recall
r
r
r.
This integral can be discretized as
AveP
=
∑
k
=
1
n
P
(
k
)
Δ
r
(
k
)
\operatorname{AveP} = \sum_{k=1}^n P(k) \Delta r(k)
AveP=k=1∑nP(k)Δr(k)
where
k
k
k is the rank in the sequence of prediction scores (high to low),
n
n
n is the total number of samples,
P
(
k
)
P(k)
P(k) is the precision at cut-off
k
k
k in the list, and
Δ
r
(
k
)
\Delta r(k)
Δr(k) is the change in recall from items
k
−
1
k-1
k−1 to
k
k
k.
Geometrically, this formula represents a Riemann Sum approximation to the Precision-Recall curve using rectangles with height
P
(
k
)
P(k)
P(k) and width
Δ
r
(
k
)
\Delta r(k)
Δr(k).
R code to reproduce this plot
library(data.table)library(ggplot2)set.seed(4)trues <- rnorm(6)falses <- rnorm(10, -1)dt <- data.table( pred_score = round(c(trues, falses), 3), truth = c(rep(T, length(trues)), rep(F, length(falses))))setorderv(dt, "pred_score", order = -1)dt[, `:=`( TP = cumsum(truth == T), FP = cumsum(truth == F), FN = c(tail(rev(cumsum(rev(truth))), -1), 0))]dt[, `:=`( precision = TP/(TP + FP), recall = TP/(TP + FN))]dt[, recall_prev := shift(recall, type = "lag", fill = 0)]ggplot(dt, aes(x = recall, y = precision))+ geom_rect(aes(xmin = recall_prev, xmax = recall, ymin = 0, ymax = precision), alpha = 0.1, color = "black")+ geom_line(size = 1.5, color = "light blue")+ geom_point(size = 2, color = "red")+ labs(x = "Recall", y = "Precision", title = "Precision-Recall Curve", subtitle = "Riemman sum approximation to area under the PR curve")+ ylim(0,1)+ xlim(0,1)+ theme_bw()+ theme(text = element_text(size=16))
Since recall ranges from 0 to 1, this can be interpreted as a weighted sum of Precisions whose weights are the widths of the rectangles (i.e. the changes in recall from threshold to threshold), hence the name Average Precision.
Furthermore, the width of each non-zero-width rectangle is the same. Alternatively stated, each positive change in recall is equivalent. Thus, Average Precision can be described as an arithmetic mean of Precision values restricted to the set of True instances (relevant documents).
AveP
=
∑
k
=
1
n
P
(
k
)
×
rel
(
k
)
number of true instances
\operatorname {AveP} ={\frac {\sum _{k=1}^{n}P(k)\times \operatorname {rel} (k)}{\text{number of true instances}}}
AveP=number of true instances∑k=1nP(k)×rel(k)
where
rel
(
k
)
\operatorname {rel} (k)
rel(k) is an indicator function equaling 1 if the item at rank
k
k
k is a true instance, 0 otherwise.
Some authors use interpolated precision whereby the Precision-Recall curve is transformed such that the precision at recall
r
r
r is taken to be the
m
a
x
(
p
r
e
c
i
s
i
o
n
)
max(precision)
max(precision) at all
r
e
c
a
l
l
≥
r
recall \ge r
recall≥r.
Precision at k measures What percent of the top k ranked predictions are true instances? This is useful in settings like information retrieval where one only cares about the top k ranked documents returned by a search query.
Suppose we build a system that predicts which movies are relevant to a particular search query (e.g. "dog movies"). Our database contains 12 movies which receive the following prediction scores and prediction classes.
What if there are fewer than
k
k
k True instances (relevant documents)?
This scenario presents a drawback to
Precision @ k
\text{Precision @ k}
Precision @ k. If there are a total of
n
n
n True instances (relevant documents) where
n
<
k
n < k
n<k, the model will have at most
Precision @ k
=
n
k
<
1
\text{Precision @ k} = \frac{n}{k} < 1
Precision @ k=kn<1.
What if there are fewer than
k
k
k True predictions (retrieved documents)?
If a model returns
m
<
k
m < k
m<k True predictions (retrieved documents), one possible implementation is to treat the next
k
−
m
k - m
k−m predictions as False Positive predictions (irrelevant documents).
Average Precision at k represents Average Precision amongst the top k classification thresholds (cutoffs). It can also be described as area below the Precision Recall curve, restricted to the top k thresholds, and normalized by the lesser of k and the total number of true instances (relevant documents).
Ambiguity
The term Average Precision at k usually implies (Average Precision) @ k, not Average (Precision @ K). These are distinctly different!
∑
i
=
1
3
P
(
i
)
×
rel
(
i
)
=
1.00
+
1.00
=
2.00
\sum _{i=1}^{3}P(i)\times \operatorname {rel} (i) = 1.00 + 1.00 = 2.00
i=1∑3P(i)×rel(i)=1.00+1.00=2.00
Normalize by the lesser of
k
k
k and the total number of relevant documents (true instances)
In this example, the total number of true instances is 4.
AP@k
=
∑
i
=
1
3
P
(
i
)
×
rel
(
i
)
min
(
k
,
total true instances
)
=
2.00
min
(
3
,
4
)
=
2
3
=
0.67
\operatorname {AP@k} ={\frac {\sum _{i=1}^{3}P(i)\times \operatorname {rel} (i)}{\min(k, \text{total true instances})}} =
\frac{2.00}{\min(3, 4)} = \frac{2}{3} = 0.67
AP@k=min(k,total true instances)∑i=13P(i)×rel(i)=min(3,4)2.00=32=0.67
Intuition Observe the Precision-Recall curve for this example.
Average precision measures the area under the curve, which we can approximate with a Reimann sum.
Average Precision at 3 corresponds to area beneath the precision recall curve restricted to the first three cutoffs in the ordered predictions. In other words, the area beneath the curve to the left of the third point.
However the rectangle widths are rescaled to have width
1
3
\frac{1}{3}
31 so that the maximum achievable area beneath the curve in this range is 1. This is the purpose of the denominator,
min
(
k
,
total true instances
)
\min(k, \text{total true instances})
min(k,total true instances) in the formula for
AP@k
\operatorname {AP@k}
AP@k.