Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[py-tx] Validate hash length #1618

Merged
merged 3 commits into from
Sep 17, 2024
Merged

[py-tx] Validate hash length #1618

merged 3 commits into from
Sep 17, 2024

Conversation

aryzle
Copy link
Collaborator

@aryzle aryzle commented Sep 9, 2024

Summary

fixes #1609

validates hash length in SignalExchangeAPIWithSimpleUpdates (used by stopNCII, file, etc)

Test Plan

reproduce error by adding "00000000000000000000000000000000" to the list of example PDQ hashes and try to run tx fetch

...
Building index for pdq with 17 signals...
Traceback (most recent call last):
  File "/home/vscode/.local/bin/tx", line 33, in <module>
    sys.exit(load_entry_point('threatexchange', 'console_scripts', 'tx')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/ThreatExchange/python-threatexchange/threatexchange/cli/main.py", line 321, in main
    inner_main()
  File "/workspaces/ThreatExchange/python-threatexchange/threatexchange/cli/main.py", line 314, in inner_main
    execute_command(settings, namespace)
  File "/workspaces/ThreatExchange/python-threatexchange/threatexchange/cli/main.py", line 157, in execute_command
    command.execute(settings)
  File "/workspaces/ThreatExchange/python-threatexchange/threatexchange/cli/fetch_cmd.py", line 163, in execute
    DatasetCommand().execute_generate_indices(settings)
  File "/workspaces/ThreatExchange/python-threatexchange/threatexchange/cli/dataset_cmd.py", line 292, in execute_generate_indices
    index = index_cls.build(signals.items())
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/workspaces/ThreatExchange/python-threatexchange/threatexchange/signal_type/index.py", line 216, in build
    ret.add_all(entries)
  File "/workspaces/ThreatExchange/python-threatexchange/threatexchange/signal_type/pdq/pdq_index.py", line 75, in add_all
    self.index.add(
  File "/workspaces/ThreatExchange/python-threatexchange/threatexchange/signal_type/pdq/pdq_faiss_matcher.py", line 240, in add
    self.faiss_index.add_with_ids(numpy.array(vectors), numpy.array(i64_ids))
...

after rebuilding tx with the fix in SignalExchangeAPIWithSimpleUpdates, error goes away as the offending entry is ignored (note the count 17 went down to 16)

...
Building index for pdq with 16 signals...
Index for pdq ready
...

@Dcallies
Copy link
Contributor

For test plan, can you confirm if you were able to generate a successful reproduction by adding an invalid length hash in the PDQ samples?

@aryzle
Copy link
Collaborator Author

aryzle commented Sep 11, 2024

@Dcallies not yet, I'm trying to get a unit test going, but I'll do a manual test too

@aryzle aryzle marked this pull request as ready for review September 11, 2024 16:32
@aryzle
Copy link
Collaborator Author

aryzle commented Sep 11, 2024

seems like some tests do catch this 😂 will dig into them later today

logging.warning(
"Invalid fingerprint (%s): %s",
s_type.get_name(),
signal_str,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: This string could be quite long, suggest truncating it and doing a "..."

@aryzle
Copy link
Collaborator Author

aryzle commented Sep 16, 2024

@Dcallies I commented out the bad entry here because a few tests depended on it (DatasetCommandTest and test_utils in test_vpdq_faiss.py because VPDQSignal.get_examples calls PdqSignal.get_examples) and didn't make sense to include a bad entry in get_examples when it's expected to have only valid examples. We can def refactor later to handle this, or make another method that returns mixed/dirty examples to test on.

Copy link
Contributor

@Dcallies Dcallies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, would like to have a regression test that would exercise this condition, but you've been at it for a while, let's merge this one.

@Dcallies Dcallies merged commit e9d0fd0 into facebook:main Sep 17, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[py-tx]Threatexchange CLI PDQ index doesn't validate hash length
3 participants