soft-tfidf

Mispelling tolerant tf-idf similarity metric

Part of the Dedupe.io cloud service and open source toolset for de-duplicating and finding fuzzy matches in your data.

Prior art

In this definition from "A Comparison of String Distance Metrics for Name-Matching Tasks", I believe that "dist'(w,v) > θ" was meant to be "sim'(w,v) > θ".

I don't see any necessity for the measure that defines closeness to be the same measure used in D. If that's true, the performance of this measure could be dramatically improved by using an data structure to quickly find strings that are within some levenshtein distance from the target string. Something like Levenshtein_search
When calculating token frequency and inverse document frequency, do you count each individual string or try to collapse strings? For example, if Illinois appears in 20 documents and Ilinois appears in 5 documents, shoudl we try to combine these into document frequency of 25 (or something like that)?
When calculating overlapping sets, what do we do with similar tokens that appear in the same document. What if Illinois and Ilinois appear in same document?
How sensitive is performance to the choice of D?
What is the justification for having the D term? If dist is the probability that w and v were supposed to be the same token, I could see a reason. But in that case, truncating at θ would not be principled, though that's likely true for any meaning of dist.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
soft-tfidf-definition.png		soft-tfidf-definition.png