-
Notifications
You must be signed in to change notification settings - Fork 322
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[py-tx] Implement "File" Content Type for Image or Video subclassing #1727
[py-tx] Implement "File" Content Type for Image or Video subclassing #1727
Conversation
…photo type without updating hashing based on file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking solid so far!
blocking: Can you update your test plan to include the output of running tx hash
and tx --verbose hash
on some files?
blocking question: I'm always suspicious of adding new files to the codebase, especially large videos.
- sample-video.mp4 is an empty file. Instead of comitting it, you could just use a tmpfile to create an empty mp4 whenever you needed one.
- Does that same approach work for a gif?
- If it doesn't, can we use an even smaller gif? 1x1 pixels even. Additionally, test photos tend to live in pdq/data and videos in tmk/data, and you can reference them using a relative filepath trick - look for references of
__file__
, some of your fellows have had to do similar.
# Initial apporach on usage in hash_cmd | ||
# if issubclass(self.content_type, FileContent): | ||
# try: | ||
# # Use the first file to determine content type | ||
# self.content_type = FileContent.map_to_content_type(files[0].name) | ||
# except ValueError as e: | ||
# raise CommandError.user(f"Error: {e}") | ||
# else: | ||
# self.content_type = content_type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blocking: Pull commented out code or uncomment it :P
This looks like where this should live, rather than execute
where you've put it now. Why did you end up putting it in execute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I had started doing the whole thing, but the issue asked for it to be split into two prs. So I stopped and put this pr up. I will finish the full execution with the hashing output and update the pr.
print( | ||
f"File: {file.name}, Resolved ContentType: {actual_content_type.get_name()}" | ||
) | ||
except ValueError as e: | ||
print(f"Error: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blocking question: Are these prints for debugging, or do you expect them in the final version?
If in the final version, don't use print
, instead using logging that only shows when doing --verbose. Just using the logging module at info level should be enough for that.
if issubclass(self.content_type, FileContent): | ||
for file in self.files: | ||
try: | ||
actual_content_type = FileContent.map_to_content_type(file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blocking q: Where is this used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was just for the testing of the content type I will move it back up to the intializer.
@@ -155,3 +155,59 @@ def test_unletterbox_with_photo_content(hash_cli: ThreatExchangeCLIE2eHelper): | |||
"pdq f8f8f0cee0f4a84f06370a22038f63f0b36e2ed596621e1d33e6b39c4e9c9b22", | |||
], | |||
) | |||
|
|||
|
|||
def test_file_content(hash_cli: ThreatExchangeCLIE2eHelper): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for writing a comprehensive unittest!
Also, I wanted to know if this was meant to read multiple files and pass each file to the associated hasher or if all the files have to be the same kind of type. |
e4cb006
to
a777225
Compare
I updated for current usage based on reading the first file and then interpreting the type because that is the way the hashers are set up in hash cmd if it is meant to be for each file I can switch that real quick and update the hashing for each file. |
For now, it's fine to assume all files have to be of the same type, otherwise this will be a complicated feature :P |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question for you inline, thanks for the comprehensive test plan!
Can merge as soon as I understand the context of the animated gif change,
logger.error(f"Error processing GIF: {e}") | ||
raise ValueError(f"Error processing GIF: {e}") | ||
else: | ||
logger.error(f"Unsupported file type: {extension}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ignorable: In general I think if you are throwing the exception, it doesn't make sense to log it, since whoever is catching your exception is probably going to log it, and you'll duplicate messages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree I will change that.
logger.error(f"Unsupported file type: {extension}") | ||
raise ValueError(f"Unsupported file type: {extension}") | ||
|
||
logger.info(f"Content type set to: {content_type.__name__}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ignorable: Logger natively supports printf-style formatting as well - "%s"
# LA images (luminance with alpha) return 3 dimensional ndarray | ||
# For GIF converts the first frame of a static GIF to RGB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
blocking question: What is the behavior if you don't convert it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it works perfectly fine without converting. I had initially started to make these changes because the quality of the hash was poor but that was due to the solid color gifs I was using not because of the RGB convert I will change it back. Thank you for catching that I will restore the file to the state before I made any changes.
Summary
Added FileContent to determine content types based on file extensions and GIF properties. Static GIFs resolve to PhotoContent, and animated GIFs resolve to VideoContent. Updated hash_cmd.py to integrate FileContent for dynamic content type resolution. This is the first pr for #1675
For testing added tests to validate that the FileConent type accurately maps based on file extension as well as whether or not the given gif is animated in hash_cmd_test.py.
More specifically:
sample-b.jpg → PhotoContent
sample-video.mp4 → VideoContent
static.gif → PhotoContent
animated.gif → VideoContent
Unsupported file types throw errors.
Note: I added a super basic RGB animated GIF I made with a pillow and then a Blue unanimated GIF. For the other data types, I have them currently as empty files and will update for the second pr in regards to hashing.
Follow Up: What is the process for hashing Video Content?
Test Plan
Unit Test Explanation
The updated unit test includes the following test cases:
.jpg
file and verifies its hash..png
file and verifies its hash..jpeg
file with an RGB profile..mp4
file and verifies the video MD5 hash..avi
file and verifies the video MD5 hash..mov
file and verifies the video MD5 hash..gif
file and verifies its hash as a photo..gif
file and verifies its hash as a video..txt
file and verifies that the CLI raises an appropriate error.Expected Output
CLI Output for Each File Type
PNG File
$ tx hash file foo.png pdq accb6d39648035f8125c8ce6ba65007de7b54c67a2d93ef7b8f33b0611306715
JPEG File
$ tx hash file foo.jpg pdq f8f8f0cee0f4a84f06370a22038f63f0b36e2ed596621e1d33e6b39c4e9c9b22
MP4 File
$ tx hash file foo.mp4 video_md5 d41d8cd98f00b204e9800998ecf8427e
AVI File
$ tx hash file foo.avi video_md5 d41d8cd98f00b204e9800998ecf8427e
MOV File
$ tx hash file foo.mov video_md5 d41d8cd98f00b204e9800998ecf8427e
Static GIF File
$ tx hash file static.gif pdq dd908cc83bea8ddd781ad2cc37b4a2ddf780152a327ad32d777875120a67b112
Animated GIF File
$ tx hash file animated.gif video_md5 ec82a2d0d4d99a623ec2a939accc7de5
Unsupported TXT File
$ tx hash file foo.txt Error: Unsupported file type: .txt
Verbose Logging Output
For verbose mode (
--verbose
), the CLI logs detailed information for each file.PNG File
Animated GIF File