Skip to content

dedupeio/soft-tfidf

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 

Repository files navigation

soft-tfidf

Mispelling tolerant tf-idf similarity metric

Part of the Dedupe.io cloud service and open source toolset for de-duplicating and finding fuzzy matches in your data.

Prior art

Original Definition

Originial definition from "A Comparison of String Distance Metrics for Name-Matching Tasks"

In this definition from "A Comparison of String Distance Metrics for Name-Matching Tasks", I believe that "dist'(w,v) > θ" was meant to be "sim'(w,v) > θ".

Questions

  1. I don't see any necessity for the measure that defines closeness to be the same measure used in D. If that's true, the performance of this measure could be dramatically improved by using an data structure to quickly find strings that are within some levenshtein distance from the target string. Something like Levenshtein_search
  2. When calculating token frequency and inverse document frequency, do you count each individual string or try to collapse strings? For example, if Illinois appears in 20 documents and Ilinois appears in 5 documents, shoudl we try to combine these into document frequency of 25 (or something like that)?
  3. When calculating overlapping sets, what do we do with similar tokens that appear in the same document. What if Illinois and Ilinois appear in same document?
  4. How sensitive is performance to the choice of D?
  5. What is the justification for having the D term? If dist is the probability that w and v were supposed to be the same token, I could see a reason. But in that case, truncating at θ would not be principled, though that's likely true for any meaning of dist.

About

Mispelling tolerant tf-idf similarity metric

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published