[py-tx] Implement "File" Content Type for Image or Video subclassing #1727

Mackay-Fisher · 2024-12-16T20:14:48Z

Summary

Added FileContent to determine content types based on file extensions and GIF properties. Static GIFs resolve to PhotoContent, and animated GIFs resolve to VideoContent. Updated hash_cmd.py to integrate FileContent for dynamic content type resolution. This is the first pr for #1675

For testing added tests to validate that the FileConent type accurately maps based on file extension as well as whether or not the given gif is animated in hash_cmd_test.py.

More specifically:
sample-b.jpg → PhotoContent
sample-video.mp4 → VideoContent
static.gif → PhotoContent
animated.gif → VideoContent
Unsupported file types throw errors.

Note: I added a super basic RGB animated GIF I made with a pillow and then a Blue unanimated GIF. For the other data types, I have them currently as empty files and will update for the second pr in regards to hashing.

Follow Up: What is the process for hashing Video Content?

Test Plan

Unit Test Explanation

The updated unit test includes the following test cases:

JPEG Test Case: Tests a .jpg file and verifies its hash.
PNG Test Case: Tests a .png file and verifies its hash.
JPEG with RGB Profile Test Case: Tests a .jpeg file with an RGB profile.
Empty MP4 File Test: Tests an empty .mp4 file and verifies the video MD5 hash.
Empty AVI File Test: Tests an empty .avi file and verifies the video MD5 hash.
Empty MOV File Test: Tests an empty .mov file and verifies the video MD5 hash.
Static GIF Test: Tests a static .gif file and verifies its hash as a photo.
Animated GIF Test: Tests an animated .gif file and verifies its hash as a video.
Unsupported TXT File Test: Tests an unsupported .txt file and verifies that the CLI raises an appropriate error.

Expected Output

CLI Output for Each File Type

PNG File

$ tx hash file foo.png
pdq accb6d39648035f8125c8ce6ba65007de7b54c67a2d93ef7b8f33b0611306715

JPEG File

$ tx hash file foo.jpg
pdq f8f8f0cee0f4a84f06370a22038f63f0b36e2ed596621e1d33e6b39c4e9c9b22

MP4 File

$ tx hash file foo.mp4
video_md5 d41d8cd98f00b204e9800998ecf8427e

AVI File

$ tx hash file foo.avi
video_md5 d41d8cd98f00b204e9800998ecf8427e

MOV File

$ tx hash file foo.mov
video_md5 d41d8cd98f00b204e9800998ecf8427e

Static GIF File

$ tx hash file static.gif
pdq dd908cc83bea8ddd781ad2cc37b4a2ddf780152a327ad32d777875120a67b112

Animated GIF File

$ tx hash file animated.gif
video_md5 ec82a2d0d4d99a623ec2a939accc7de5

Unsupported TXT File

$ tx hash file foo.txt
Error: Unsupported file type: .txt

Verbose Logging Output

For verbose mode (--verbose), the CLI logs detailed information for each file.

PNG File

$ tx --verbose hash file foo.png
INFO 2024-12-18 14:10:00 Processing file: foo.png
INFO 2024-12-18 14:10:00 Detected file extension: .png
INFO 2024-12-18 14:10:00 File type identified as photo.
INFO 2024-12-18 14:10:00 Content type set to: PhotoContent
pdq accb6d39648035f8125c8ce6ba65007de7b54c67a2d93ef7b8f33b0611306715

Animated GIF File

$ tx --verbose hash file animated.gif
INFO 2024-12-18 14:15:00 Processing file: animated.gif
INFO 2024-12-18 14:15:00 Detected file extension: .gif
INFO 2024-12-18 14:15:00 Identified as an animated GIF.
INFO 2024-12-18 14:15:00 Content type set to: VideoContent
video_md5 ec82a2d0d4d99a623ec2a939accc7de5

…photo type without updating hashing based on file

Dcallies

Looking solid so far!

blocking: Can you update your test plan to include the output of running tx hash and tx --verbose hash on some files?

blocking question: I'm always suspicious of adding new files to the codebase, especially large videos.

sample-video.mp4 is an empty file. Instead of comitting it, you could just use a tmpfile to create an empty mp4 whenever you needed one.
Does that same approach work for a gif?
If it doesn't, can we use an even smaller gif? 1x1 pixels even. Additionally, test photos tend to live in pdq/data and videos in tmk/data, and you can reference them using a relative filepath trick - look for references of __file__, some of your fellows have had to do similar.

Dcallies · 2024-12-17T18:32:40Z

python-threatexchange/threatexchange/cli/hash_cmd.py

+        # Initial apporach on usage in hash_cmd
+        # if issubclass(self.content_type, FileContent):
+        #     try:
+        #         # Use the first file to determine content type
+        #         self.content_type = FileContent.map_to_content_type(files[0].name)
+        #     except ValueError as e:
+        #         raise CommandError.user(f"Error: {e}")
+        # else:
+        #     self.content_type = content_type


blocking: Pull commented out code or uncomment it :P

This looks like where this should live, rather than execute where you've put it now. Why did you end up putting it in execute?

Yeah, I had started doing the whole thing, but the issue asked for it to be split into two prs. So I stopped and put this pr up. I will finish the full execution with the hashing output and update the pr.

Dcallies · 2024-12-18T14:06:17Z

python-threatexchange/threatexchange/cli/hash_cmd.py

+                    print(
+                        f"File: {file.name}, Resolved ContentType: {actual_content_type.get_name()}"
+                    )
+                except ValueError as e:
+                    print(f"Error: {e}")


blocking question: Are these prints for debugging, or do you expect them in the final version?

If in the final version, don't use print, instead using logging that only shows when doing --verbose. Just using the logging module at info level should be enough for that.

Dcallies · 2024-12-18T14:06:27Z

python-threatexchange/threatexchange/cli/hash_cmd.py

+        if issubclass(self.content_type, FileContent):
+            for file in self.files:
+                try:
+                    actual_content_type = FileContent.map_to_content_type(file)


blocking q: Where is this used?

This was just for the testing of the content type I will move it back up to the intializer.

Dcallies · 2024-12-18T14:07:35Z

python-threatexchange/threatexchange/cli/tests/hash_cmd_test.py

@@ -155,3 +155,59 @@ def test_unletterbox_with_photo_content(hash_cli: ThreatExchangeCLIE2eHelper):
            "pdq f8f8f0cee0f4a84f06370a22038f63f0b36e2ed596621e1d33e6b39c4e9c9b22",
        ],
    )
+
+
+def test_file_content(hash_cli: ThreatExchangeCLIE2eHelper):


Thanks for writing a comprehensive unittest!

Mackay-Fisher · 2024-12-18T17:40:56Z

Also, I wanted to know if this was meant to read multiple files and pass each file to the associated hasher or if all the files have to be the same kind of type.

Mackay-Fisher · 2024-12-18T20:21:03Z

I updated for current usage based on reading the first file and then interpreting the type because that is the way the hashers are set up in hash cmd if it is meant to be for each file I can switch that real quick and update the hashing for each file.

Dcallies · 2024-12-23T13:41:59Z

Also, I wanted to know if this was meant to read multiple files and pass each file to the associated hasher or if all the files have to be the same kind of type.

For now, it's fine to assume all files have to be of the same type, otherwise this will be a complicated feature :P

Dcallies

One question for you inline, thanks for the comprehensive test plan!

Can merge as soon as I understand the context of the animated gif change,

Dcallies · 2024-12-23T13:44:27Z

python-threatexchange/threatexchange/content_type/file.py

+                logger.error(f"Error processing GIF: {e}")
+                raise ValueError(f"Error processing GIF: {e}")
+        else:
+            logger.error(f"Unsupported file type: {extension}")


ignorable: In general I think if you are throwing the exception, it doesn't make sense to log it, since whoever is catching your exception is probably going to log it, and you'll duplicate messages.

I agree I will change that.

Dcallies · 2024-12-23T13:45:12Z

python-threatexchange/threatexchange/content_type/file.py

+            logger.error(f"Unsupported file type: {extension}")
+            raise ValueError(f"Unsupported file type: {extension}")
+
+        logger.info(f"Content type set to: {content_type.__name__}")


ignorable: Logger natively supports printf-style formatting as well - "%s"

Dcallies · 2024-12-23T13:46:57Z

python-threatexchange/threatexchange/signal_type/pdq/pdq_hasher.py

        # LA images (luminance with alpha) return 3 dimensional ndarray
+        # For GIF converts the first frame of a static GIF to RGB


blocking question: What is the behavior if you don't convert it?

Actually it works perfectly fine without converting. I had initially started to make these changes because the quality of the hash was poor but that was due to the solid color gifs I was using not because of the RGB convert I will change it back. Thank you for catching that I will restore the file to the state before I made any changes.

…ging of errors

Mackay-Fisher added 2 commits December 12, 2024 22:53

wip

c1111c9

[py-tx] update for file type to help differentiate between video and …

3364d51

…photo type without updating hashing based on file

Mackay-Fisher requested a review from Dcallies as a code owner December 16, 2024 20:14

facebook-github-bot added the CLA Signed label Dec 16, 2024

formating :(

b265301

Dcallies requested changes Dec 18, 2024

View reviewed changes

[py-tx] active hashing from file class

a777225

Mackay-Fisher force-pushed the Issue-1675-File-ContentType branch from e4cb006 to a777225 Compare December 18, 2024 20:19

Mackay-Fisher requested a review from Dcallies December 18, 2024 20:21

Dcallies approved these changes Dec 23, 2024

View reviewed changes

Dcallies mentioned this pull request Dec 23, 2024

[pytx] Implement FileContent class #1680

Closed

Mackay-Fisher added 2 commits December 23, 2024 10:47

[py-tx] restored pdq file and updated logger to remove unessisary log…

c2b8fc2

…ging of errors

[py-tx] restored f style logging for file accuracy

fd0ecf7

Dcallies merged commit 148d8cc into facebook:main Dec 23, 2024
6 checks passed

Mackay-Fisher deleted the Issue-1675-File-ContentType branch December 23, 2024 21:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[py-tx] Implement "File" Content Type for Image or Video subclassing #1727

[py-tx] Implement "File" Content Type for Image or Video subclassing #1727

Mackay-Fisher commented Dec 16, 2024 •

edited

Loading

Dcallies left a comment

Dcallies Dec 17, 2024

Mackay-Fisher Dec 18, 2024

Dcallies Dec 18, 2024

Dcallies Dec 18, 2024

Mackay-Fisher Dec 18, 2024

Dcallies Dec 18, 2024

Mackay-Fisher commented Dec 18, 2024

Mackay-Fisher commented Dec 18, 2024

Dcallies commented Dec 23, 2024

Dcallies left a comment

Dcallies Dec 23, 2024

Mackay-Fisher Dec 23, 2024

Dcallies Dec 23, 2024

Dcallies Dec 23, 2024

Mackay-Fisher Dec 23, 2024 •

edited

Loading

		# LA images (luminance with alpha) return 3 dimensional ndarray
		# For GIF converts the first frame of a static GIF to RGB

[py-tx] Implement "File" Content Type for Image or Video subclassing #1727

[py-tx] Implement "File" Content Type for Image or Video subclassing #1727

Conversation

Mackay-Fisher commented Dec 16, 2024 • edited Loading

Summary

Test Plan

Unit Test Explanation

Expected Output

CLI Output for Each File Type

PNG File

JPEG File

MP4 File

AVI File

MOV File

Static GIF File

Animated GIF File

Unsupported TXT File

Verbose Logging Output

PNG File

Animated GIF File

Dcallies left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mackay-Fisher commented Dec 18, 2024

Mackay-Fisher commented Dec 18, 2024

Dcallies commented Dec 23, 2024

Dcallies left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Mackay-Fisher Dec 23, 2024 • edited Loading

Choose a reason for hiding this comment

Mackay-Fisher commented Dec 16, 2024 •

edited

Loading

Mackay-Fisher Dec 23, 2024 •

edited

Loading