[py-tx] Implement a new cleaner PDQ index solution from scratch #1613
Labels
help wanted
mlh
Related to Major League Hacking Fellowship
pdq
Items related to the pdq libraries or reference implementations
python-threatexchange
Items related to the threatexchange python tool / library
When we built the PDQ index, it was our first attempt, and we made a lot of strange/bad choices.
Namely:
I think we could provide a second implementation that is a lot simpler, which we could then find a way to swap.
They key elements:
Pass in the index type as an argument during construction
Simplify the stored state of the index implementation
Use a simpler inner wrapper to handle some of the PDQ details
class _PDQHashIndex:
"""
A wrapper around the faiss index for pickle serialization
"""
Putting it together with search
Dynamically selecting lookup type from build function
Test everything
Add a robust set of unittests for this functionality
Rollout plan
After we confirm that everything is working as expected, we'll swap out the index class that the PDQ signal type uses by default. I think we can get away without a major version bump for this.
The text was updated successfully, but these errors were encountered: