Skip to content

Choosing a good threshold

fgregg edited this page Feb 27, 2013 · 8 revisions

Dedupe can predict the probability that a pair of records are duplicates. So, when should we decide that a pair of records are duplicates. To answer this question we need to know something about Precision and Recall. Why don't you check out the Wikipedia page and come back

There's always a trade-off between precision and recall. That's okay. As long as know how we value precision relative to recall we can define an F-score that will let us find a threshold for deciding when records are duplicates that is optimal for our priorities.

Typically, the way that we find that threshold is by looking at the true precision and recall of some data where we know their true labels. However, we will only get a good threshold if the labeled examples are representative of the data we are trying to classify.

The labeled examples that we make with Dedupe are not at all representative, and that's by design. Through active learning we are not trying to find the most representative data examples, but the ones will learn the most about.

The approach we take is to take a random sample of blocked data, and then calculate the pairwise probability that records will be duplicates within each block. From these probabilities we can calculated the expected number of duplicates and distinct pairs, so we can calculate the expected precision and recall.

Back: Grouping duplicates

Clone this wiki locally