Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[py-tx] Implement "File" Content Type for Image or Video subclassing #1727

Merged
merged 6 commits into from
Dec 23, 2024

Conversation

Mackay-Fisher
Copy link
Contributor

@Mackay-Fisher Mackay-Fisher commented Dec 16, 2024

Summary

Added FileContent to determine content types based on file extensions and GIF properties. Static GIFs resolve to PhotoContent, and animated GIFs resolve to VideoContent. Updated hash_cmd.py to integrate FileContent for dynamic content type resolution. This is the first pr for #1675

For testing added tests to validate that the FileConent type accurately maps based on file extension as well as whether or not the given gif is animated in hash_cmd_test.py.

More specifically:
sample-b.jpg → PhotoContent
sample-video.mp4 → VideoContent
static.gif → PhotoContent
animated.gif → VideoContent
Unsupported file types throw errors.

Note: I added a super basic RGB animated GIF I made with a pillow and then a Blue unanimated GIF. For the other data types, I have them currently as empty files and will update for the second pr in regards to hashing.

Follow Up: What is the process for hashing Video Content?

Test Plan

Unit Test Explanation

The updated unit test includes the following test cases:

  1. JPEG Test Case: Tests a .jpg file and verifies its hash.
  2. PNG Test Case: Tests a .png file and verifies its hash.
  3. JPEG with RGB Profile Test Case: Tests a .jpeg file with an RGB profile.
  4. Empty MP4 File Test: Tests an empty .mp4 file and verifies the video MD5 hash.
  5. Empty AVI File Test: Tests an empty .avi file and verifies the video MD5 hash.
  6. Empty MOV File Test: Tests an empty .mov file and verifies the video MD5 hash.
  7. Static GIF Test: Tests a static .gif file and verifies its hash as a photo.
  8. Animated GIF Test: Tests an animated .gif file and verifies its hash as a video.
  9. Unsupported TXT File Test: Tests an unsupported .txt file and verifies that the CLI raises an appropriate error.

Expected Output

CLI Output for Each File Type

PNG File

$ tx hash file foo.png
pdq accb6d39648035f8125c8ce6ba65007de7b54c67a2d93ef7b8f33b0611306715

JPEG File

$ tx hash file foo.jpg
pdq f8f8f0cee0f4a84f06370a22038f63f0b36e2ed596621e1d33e6b39c4e9c9b22

MP4 File

$ tx hash file foo.mp4
video_md5 d41d8cd98f00b204e9800998ecf8427e

AVI File

$ tx hash file foo.avi
video_md5 d41d8cd98f00b204e9800998ecf8427e

MOV File

$ tx hash file foo.mov
video_md5 d41d8cd98f00b204e9800998ecf8427e

Static GIF File

$ tx hash file static.gif
pdq dd908cc83bea8ddd781ad2cc37b4a2ddf780152a327ad32d777875120a67b112

Animated GIF File

$ tx hash file animated.gif
video_md5 ec82a2d0d4d99a623ec2a939accc7de5

Unsupported TXT File

$ tx hash file foo.txt
Error: Unsupported file type: .txt

Verbose Logging Output

For verbose mode (--verbose), the CLI logs detailed information for each file.

PNG File

$ tx --verbose hash file foo.png
INFO 2024-12-18 14:10:00 Processing file: foo.png
INFO 2024-12-18 14:10:00 Detected file extension: .png
INFO 2024-12-18 14:10:00 File type identified as photo.
INFO 2024-12-18 14:10:00 Content type set to: PhotoContent
pdq accb6d39648035f8125c8ce6ba65007de7b54c67a2d93ef7b8f33b0611306715

Animated GIF File

$ tx --verbose hash file animated.gif
INFO 2024-12-18 14:15:00 Processing file: animated.gif
INFO 2024-12-18 14:15:00 Detected file extension: .gif
INFO 2024-12-18 14:15:00 Identified as an animated GIF.
INFO 2024-12-18 14:15:00 Content type set to: VideoContent
video_md5 ec82a2d0d4d99a623ec2a939accc7de5

Copy link
Contributor

@Dcallies Dcallies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking solid so far!

blocking: Can you update your test plan to include the output of running tx hash and tx --verbose hash on some files?

blocking question: I'm always suspicious of adding new files to the codebase, especially large videos.

  1. sample-video.mp4 is an empty file. Instead of comitting it, you could just use a tmpfile to create an empty mp4 whenever you needed one.
  2. Does that same approach work for a gif?
  3. If it doesn't, can we use an even smaller gif? 1x1 pixels even. Additionally, test photos tend to live in pdq/data and videos in tmk/data, and you can reference them using a relative filepath trick - look for references of __file__, some of your fellows have had to do similar.

Comment on lines 131 to 139
# Initial apporach on usage in hash_cmd
# if issubclass(self.content_type, FileContent):
# try:
# # Use the first file to determine content type
# self.content_type = FileContent.map_to_content_type(files[0].name)
# except ValueError as e:
# raise CommandError.user(f"Error: {e}")
# else:
# self.content_type = content_type
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking: Pull commented out code or uncomment it :P

This looks like where this should live, rather than execute where you've put it now. Why did you end up putting it in execute?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I had started doing the whole thing, but the issue asked for it to be split into two prs. So I stopped and put this pr up. I will finish the full execution with the hashing output and update the pr.

Comment on lines 147 to 151
print(
f"File: {file.name}, Resolved ContentType: {actual_content_type.get_name()}"
)
except ValueError as e:
print(f"Error: {e}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking question: Are these prints for debugging, or do you expect them in the final version?

If in the final version, don't use print, instead using logging that only shows when doing --verbose. Just using the logging module at info level should be enough for that.

if issubclass(self.content_type, FileContent):
for file in self.files:
try:
actual_content_type = FileContent.map_to_content_type(file)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking q: Where is this used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just for the testing of the content type I will move it back up to the intializer.

@@ -155,3 +155,59 @@ def test_unletterbox_with_photo_content(hash_cli: ThreatExchangeCLIE2eHelper):
"pdq f8f8f0cee0f4a84f06370a22038f63f0b36e2ed596621e1d33e6b39c4e9c9b22",
],
)


def test_file_content(hash_cli: ThreatExchangeCLIE2eHelper):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing a comprehensive unittest!

@Mackay-Fisher
Copy link
Contributor Author

Also, I wanted to know if this was meant to read multiple files and pass each file to the associated hasher or if all the files have to be the same kind of type.

@Mackay-Fisher Mackay-Fisher force-pushed the Issue-1675-File-ContentType branch from e4cb006 to a777225 Compare December 18, 2024 20:19
@Mackay-Fisher
Copy link
Contributor Author

I updated for current usage based on reading the first file and then interpreting the type because that is the way the hashers are set up in hash cmd if it is meant to be for each file I can switch that real quick and update the hashing for each file.

@Dcallies
Copy link
Contributor

Also, I wanted to know if this was meant to read multiple files and pass each file to the associated hasher or if all the files have to be the same kind of type.

For now, it's fine to assume all files have to be of the same type, otherwise this will be a complicated feature :P

Copy link
Contributor

@Dcallies Dcallies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question for you inline, thanks for the comprehensive test plan!

Can merge as soon as I understand the context of the animated gif change,

Comment on lines 53 to 56
logger.error(f"Error processing GIF: {e}")
raise ValueError(f"Error processing GIF: {e}")
else:
logger.error(f"Unsupported file type: {extension}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignorable: In general I think if you are throwing the exception, it doesn't make sense to log it, since whoever is catching your exception is probably going to log it, and you'll duplicate messages.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree I will change that.

logger.error(f"Unsupported file type: {extension}")
raise ValueError(f"Unsupported file type: {extension}")

logger.info(f"Content type set to: {content_type.__name__}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ignorable: Logger natively supports printf-style formatting as well - "%s"

# LA images (luminance with alpha) return 3 dimensional ndarray
# For GIF converts the first frame of a static GIF to RGB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking question: What is the behavior if you don't convert it?

Copy link
Contributor Author

@Mackay-Fisher Mackay-Fisher Dec 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it works perfectly fine without converting. I had initially started to make these changes because the quality of the hash was poor but that was due to the solid color gifs I was using not because of the RGB convert I will change it back. Thank you for catching that I will restore the file to the state before I made any changes.

@Dcallies Dcallies merged commit 148d8cc into facebook:main Dec 23, 2024
6 checks passed
@Mackay-Fisher Mackay-Fisher deleted the Issue-1675-File-ContentType branch December 23, 2024 21:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants