Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HuggingFace][Neuronx] Training - Optimum Neuron 0.0.25 - Neuron sdk 2.20.0 - Transformers to 4.43.2 #4365

Merged
merged 12 commits into from
Nov 18, 2024

Conversation

dacorvo
Copy link
Contributor

@dacorvo dacorvo commented Oct 22, 2024

Issue #4307

Description

This PR creates Hugginface's PyTorch DLC for training on neuron-v2 devices (Trainium).

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

@aws-deep-learning-containers-ci aws-deep-learning-containers-ci bot added build Reflects file change in build folder huggingface Reflects file change in huggingface folder Size:S Determines the size of the PR labels Oct 22, 2024
@dacorvo dacorvo marked this pull request as ready for review October 22, 2024 07:12
@dacorvo dacorvo requested review from a team as code owners October 22, 2024 07:12
@dacorvo dacorvo force-pushed the add-neuronx-train-0.0.25 branch from de3747a to 39ff0fe Compare October 23, 2024 07:14
"71670": "[Package: torch] Core torch package version 2.1 affected, cannot be changed in PyTorch 2.1 DLC advisory='A vulnerability in the PyTorch's torch.distributed.rpc framework, specifically in versions prior to 2.2.2, allows for remote code execution (RCE). The framework, which is used in distributed training scenarios, does not properly verify the functions being called during RPC (Remote Procedure Call) operations. This oversight permits attackers to execute arbitrary commands by leveraging built-in Python functions such as eval during multi-cpu RPC communication. The vulnerability arises from the lack of restriction on function calls when a worker node serializes and sends a PythonUDF (User Defined Function) to the master node, which then deserializes and executes the function without validation. This flaw can be exploited to compromise master nodes initiating distributed training, potentially leading to the theft of sensitive AI-related data.'",
"71671": "[Package: torch] Core torch package version 2.1 affected, cannot be changed in PyTorch 2.1 DLC advisory='PyTorch before v2.2.0 was discovered to contain a heap buffer overflow vulnerability in the component /runtime/vararg_functions.cpp. This vulnerability allows attackers to cause a Denial of Service (DoS) via a crafted input.'",
"71672": "[Package: torch] Core torch package version 2.1 affected, cannot be changed in PyTorch 2.1 DLC advisory='Pytorch before version v2.2.0 was discovered to contain a use-after-free vulnerability in torch/csrc/jit/mobile/interpreter.cpp.'",
"71064": "Affected versions of Requests, when making requests through a Requests `Session`, if the first request is made with `verify=False` to disable cert verification, all subsequent requests to the same host will continue to ignore cert verification regardless of changes to the value of `verify`. This behavior will continue for the lifecycle of the connection in the connection pool. Requests 2.32.0 fixes the issue, but versions 2.32.0 and 2.32.1 were yanked due to conflicts with CVE-2024-35195 mitigation."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing a , on this line, code build logs:

Traceback (most recent call last):
--
259 | File "src/main.py", line 132, in <module>
260 | main()
261 | File "src/main.py", line 128, in main
262 | image_builder(buildspec_file, image_types, device_types)
263 | File "/codebuild/output/src2844226945/src/github.com/aws/deep-learning-containers/src/image_builder.py", line 370, in image_builder
264 | pushed_images += process_images(parent_images, "Parent/Independent", buildspec_path=buildspec)
265 | File "/codebuild/output/src2844226945/src/github.com/aws/deep-learning-containers/src/image_builder.py", line 434, in process_images
266 | build_images(common_stage_image_list, make_dummy_boto_client=True)
267 | File "/codebuild/output/src2844226945/src/github.com/aws/deep-learning-containers/src/image_builder.py", line 581, in build_images
268 | FORMATTER.progress(THREADS)
269 | File "/codebuild/output/src2844226945/src/github.com/aws/deep-learning-containers/src/output.py", line 103, in progress
270 | output[i] += "." * 10 + constants.STATUS_MESSAGE[futures[image].result()]
271 | File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 432, in result
272 | return self.__get_result()
273 | File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
274 | raise self._exception
275 | File "/usr/local/lib/python3.8/concurrent/futures/thread.py", line 57, in run
276 | result = self.fn(*self.args, **self.kwargs)
277 | File "/codebuild/output/src2844226945/src/github.com/aws/deep-learning-containers/src/image.py", line 164, in build
278 | self.update_pre_build_configuration()
279 | File "/codebuild/output/src2844226945/src/github.com/aws/deep-learning-containers/src/common_stage_image.py", line 54, in update_pre_build_configuration
280 | generate_safety_report_for_image(
281 | File "/codebuild/output/src2844226945/src/github.com/aws/deep-learning-containers/src/utils.py", line 383, in generate_safety_report_for_image
282 | ignore_dict = get_safety_ignore_dict(
283 | File "/codebuild/output/src2844226945/src/github.com/aws/deep-learning-containers/src/utils.py", line 324, in get_safety_ignore_dict
284 | get_safety_ignore_dict_from_image_specific_safety_allowlists(image_uri)
285 | File "/codebuild/output/src2844226945/src/github.com/aws/deep-learning-containers/src/utils.py", line 265, in get_safety_ignore_dict_from_image_specific_safety_allowlists
286 | ignore_dict_from_image_specific_allowlist = json.load(f)
287 | File "/usr/local/lib/python3.8/json/__init__.py", line 293, in load
288 | return loads(fp.read(),
289 | File "/usr/local/lib/python3.8/json/__init__.py", line 357, in loads
290 | return _default_decoder.decode(s)
291 | File "/usr/local/lib/python3.8/json/decoder.py", line 337, in decode
292 | obj, end = self.raw_decode(s, idx=_w(s, 0).end())
293 | File "/usr/local/lib/python3.8/json/decoder.py", line 353, in raw_decode
294 | obj, end = self.scan_once(s, idx)
295 | json.decoder.JSONDecodeError: Expecting ',' delimiter: line 8 column 5 (char 2593)

@dacorvo dacorvo force-pushed the add-neuronx-train-0.0.25 branch from 0c43330 to 0663096 Compare October 23, 2024 15:22
@dacorvo
Copy link
Contributor Author

dacorvo commented Oct 24, 2024

@Captainia a python vulnerability is detected for gevent:

ERROR    test.dlc_tests.sanity.test_safety_report_file:test_safety_report_file.py:97 SAFETY_REPORT (FAILED) [pkg: gevent] [installed: 24.2.1] [vulnerabilities: [SafetyVulnerabilityAdvisory(vulnerability_id='73655', advisory='Affected versions of gevent are vulnerable to a Race Condition leading to Unauthorized Access — CWE-362. The attack can be carried out when the fallback socketpair implementation is used on platforms that lack native support, and the vulnerable function does not properly authenticate the connected sockets. To exploit this vulnerability, an attacker must be able to predict the address and port used by the fallback socketpair and establish a connection before the legitimate client. Users are advised to update to the version of gevent where this issue is fixed by introducing authentication steps in the fallback socketpair implementation to ensure the sockets are correctly connected.', reason_to_ignore='N/A', spec='<24.10.1', ignored=False)]]

It does not seem to have been detected for other images in this repository, so I don't know if it can be ignored or not.

Copy link
Contributor

@Captainia Captainia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is one new critical vulnerability in the image

{"apparmor": [{"description": "It was discovered that the AppArmor policy compiler incorrectly generated looser restrictions than expected for rules allowing mount operations. A local attacker could possibly use this to bypass AppArmor restrictions in applications where some mount operations were permitted.", "vulnerability_id": "CVE-2016-1585", "name": "CVE-2016-1585", "package_name": "apparmor", "package_details": {"file_path": null, "name": "apparmor", "package_manager": "OS", "version": "2.13.3", "release": "7ubuntu5.3build2"}, "remediation": {"recommendation": {"text": "None Provided"}}, "cvss_v3_score": 9.8, "cvss_v30_score": 9.8, "cvss_v31_score": 0.0, "cvss_v2_score": 0.0, "cvss_v3_severity": "CRITICAL", "source_url": "https://people.canonical.com/~ubuntu-security/cve/2016/CVE-2016-1585.html", "source": "UBUNTU_CVE", "severity": "CRITICAL", "status": "ACTIVE", "title": "CVE-2016-1585 - apparmor", "reason_to_ignore": "N/A"}]}

Shall we add it to apt-get update && apt-get upgrade in the docker file?

@dacorvo
Copy link
Contributor Author

dacorvo commented Oct 24, 2024

There is one new critical vulnerability in the image

{"apparmor": [{"description": "It was discovered that the AppArmor policy compiler incorrectly generated looser restrictions than expected for rules allowing mount operations. A local attacker could possibly use this to bypass AppArmor restrictions in applications where some mount operations were permitted.", "vulnerability_id": "CVE-2016-1585", "name": "CVE-2016-1585", "package_name": "apparmor", "package_details": {"file_path": null, "name": "apparmor", "package_manager": "OS", "version": "2.13.3", "release": "7ubuntu5.3build2"}, "remediation": {"recommendation": {"text": "None Provided"}}, "cvss_v3_score": 9.8, "cvss_v30_score": 9.8, "cvss_v31_score": 0.0, "cvss_v2_score": 0.0, "cvss_v3_severity": "CRITICAL", "source_url": "https://people.canonical.com/~ubuntu-security/cve/2016/CVE-2016-1585.html", "source": "UBUNTU_CVE", "severity": "CRITICAL", "status": "ACTIVE", "title": "CVE-2016-1585 - apparmor", "reason_to_ignore": "N/A"}]}

Shall we add it to apt-get update && apt-get upgrade in the docker file?

It is already installed in the Dockerfile.
https://github.com/dacorvo/deep-learning-containers/blob/fb7e3ca6a491cae112edd53440a1761d67595d3b/huggingface/pytorch/training/docker/2.1/py3/sdk2.20.0/Dockerfile.neuronx#L40

@Captainia
Copy link
Contributor

@Captainia a python vulnerability is detected for gevent:

ERROR    test.dlc_tests.sanity.test_safety_report_file:test_safety_report_file.py:97 SAFETY_REPORT (FAILED) [pkg: gevent] [installed: 24.2.1] [vulnerabilities: [SafetyVulnerabilityAdvisory(vulnerability_id='73655', advisory='Affected versions of gevent are vulnerable to a Race Condition leading to Unauthorized Access — CWE-362. The attack can be carried out when the fallback socketpair implementation is used on platforms that lack native support, and the vulnerable function does not properly authenticate the connected sockets. To exploit this vulnerability, an attacker must be able to predict the address and port used by the fallback socketpair and establish a connection before the legitimate client. Users are advised to update to the version of gevent where this issue is fixed by introducing authentication steps in the fallback socketpair implementation to ensure the sockets are correctly connected.', reason_to_ignore='N/A', spec='<24.10.1', ignored=False)]]

It does not seem to have been detected for other images in this repository, so I don't know if it can be ignored or not.

It seems this vulnerability is not exploitable in a docker environment, but worth confirming and then we can add to ignore list.

@Captainia
Copy link
Contributor

@Captainia a python vulnerability is detected for gevent:

ERROR    test.dlc_tests.sanity.test_safety_report_file:test_safety_report_file.py:97 SAFETY_REPORT (FAILED) [pkg: gevent] [installed: 24.2.1] [vulnerabilities: [SafetyVulnerabilityAdvisory(vulnerability_id='73655', advisory='Affected versions of gevent are vulnerable to a Race Condition leading to Unauthorized Access — CWE-362. The attack can be carried out when the fallback socketpair implementation is used on platforms that lack native support, and the vulnerable function does not properly authenticate the connected sockets. To exploit this vulnerability, an attacker must be able to predict the address and port used by the fallback socketpair and establish a connection before the legitimate client. Users are advised to update to the version of gevent where this issue is fixed by introducing authentication steps in the fallback socketpair implementation to ensure the sockets are correctly connected.', reason_to_ignore='N/A', spec='<24.10.1', ignored=False)]]

It does not seem to have been detected for other images in this repository, so I don't know if it can be ignored or not.

It seems this vulnerability is not exploitable in a docker environment, but worth confirming and then we can add to ignore list.

Could you update gevent similar to the PR here? #4367 (comment)

@dacorvo dacorvo force-pushed the add-neuronx-train-0.0.25 branch from 4fa1145 to a0bd545 Compare October 25, 2024 07:22
@dacorvo dacorvo requested a review from Captainia October 26, 2024 12:07
@dacorvo dacorvo force-pushed the add-neuronx-train-0.0.25 branch 3 times, most recently from 5015228 to 2ee717f Compare November 4, 2024 09:58
@dacorvo dacorvo force-pushed the add-neuronx-train-0.0.25 branch 2 times, most recently from 1c66ef1 to 3aa0fb7 Compare November 11, 2024 22:09
Captainia
Captainia previously approved these changes Nov 11, 2024
Copy link
Contributor

@Captainia Captainia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR looks good, feel free to revert the changes in dlc_developer_config.toml after the hook passes

@Captainia
Copy link
Contributor

There is one vulnerability

ERROR    test.dlc_tests.sanity.test_safety_report_file:test_safety_report_file.py:98 SAFETY_REPORT (FAILED) [pkg: werkzeug] [installed: 3.0.4] [vulnerabilities: [SafetyVulnerabilityAdvisory(vulnerability_id='73969', advisory='Affected versions of Werkzeug are vulnerable to Path Traversal (CWE-22) on Windows systems running Python versions below 3.11. The safe_join() function failed to properly detect certain absolute paths on Windows, allowing attackers to potentially access files outside the intended directory. An attacker could craft special paths starting with "/" that bypass the directory restrictions on Windows systems. The vulnerability exists in the safe_join() function which relied solely on os.path.isabs() for path validation. This is exploitable on Windows systems by passing paths starting with "/" to safe_join(). To remediate, upgrade to the latest version which includes additional path validation checks. \r\nNOTE: This vulnerability specifically affects Windows systems running Python versions below 3.11 where ntpath.isabs() behavior differs.', reason_to_ignore='N/A', spec='<3.0.6', ignored=False), SafetyVulnerabilityAdvisory(vulnerability_id='73889', advisory="Affected versions of Werkzeug are potentially vulnerable to resource exhaustion when parsing file data in forms. Applications using 'werkzeug.formparser.MultiPartParser' to parse 'multipart/form-data' requests (e.g. all flask applications) are vulnerable to a relatively simple but effective resource exhaustion (denial of service) attack. A specifically crafted form submission request can cause the parser to allocate and block 3 to 8 times the upload size in main memory. There is no upper limit; a single upload at 1 Gbit/s can exhaust 32 GB of RAM in less than 60 seconds.", reason_to_ignore='N/A', spec='<3.0.6', ignored=False)]]

Please patch or add to ignore list

@Captainia
Copy link
Contributor

It seems there is another vulnerability id associated with werkzeug

ERROR    test.dlc_tests.sanity.test_safety_report_file:test_safety_report_file.py:98 SAFETY_REPORT (FAILED) [pkg: werkzeug] [installed: 3.0.4] [vulnerabilities: [SafetyVulnerabilityAdvisory(vulnerability_id='73889', advisory="Affected versions of Werkzeug are potentially vulnerable to resource exhaustion when parsing file data in forms. Applications using 'werkzeug.formparser.MultiPartParser' to parse 'multipart/form-data' requests (e.g. all flask applications) are vulnerable to a relatively simple but effective resource exhaustion (denial of service) attack. A specifically crafted form submission request can cause the parser to allocate and block 3 to 8 times the upload size in main memory. There is no upper limit; a single upload at 1 Gbit/s can exhaust 32 GB of RAM in less than 60 seconds.", reason_to_ignore='N/A', spec='<3.0.6', ignored=False)]]

Could you add it as well?

@dacorvo dacorvo force-pushed the add-neuronx-train-0.0.25 branch 2 times, most recently from 1d36b4f to 4d51973 Compare November 15, 2024 08:03
@Captainia
Copy link
Contributor

Hi David, thanks for making the updates, looks like the image has an incompatible version of package installed, could you take a look?


=================================== FAILURES ===================================
--
703 | _ test_pip_check[669063966089.dkr.ecr.us-west-2.amazonaws.com/pr-huggingface-pytorch-training-neuronx:2.1.2-transformers4.43.2-neuronx-py310-sdk2.20.0-ubuntu20.04-pr-4365-2024-11-15-08-05-13] _
704 | [gw3] linux -- Python 3.8.0 /usr/local/bin/python
705 |  
706 | image = '669063966089.dkr.ecr.us-west-2.amazonaws.com/pr-huggingface-pytorch-training-neuronx:2.1.2-transformers4.43.2-neuronx-py310-sdk2.20.0-ubuntu20.04-pr-4365-2024-11-15-08-05-13'
707 |  
708 | @pytest.mark.usefixtures("sagemaker", "security_sanity")
709 | @pytest.mark.model("N/A")
710 | def test_pip_check(image):
711 | """
712 | Ensure there are no broken requirements on the containers by running "pip check"
713 |  
714 | :param image: ECR image URI
715 | """
716 |  
717 | allowed_exceptions = []
718 |  
719 | # SageMaker Python SDK updated its pyyaml requirement to 6.0, which is incompatible with the
720 | # requirement from awscli. awscli only requires pyyaml for ecs/eks related invocations, while
721 | # pyyaml usage seems to be more fundamental in sagemaker. Therefore, we are ignoring awscli's
722 | # requirement in favor of sagemaker.
723 | allowed_exceptions.append(
724 | r"^awscli \d+(\.\d+)* has requirement PyYAML<5\.5,>=3\.10, but you have pyyaml 6\.0.$"
725 | )
726 |  
727 | # TF inference containers do not have core tensorflow installed by design. Allowing for this pip check error
728 | # to occur in order to catch other pip check issues that may be associated with TF inference
729 | # smclarify binaries have s3fs->aiobotocore dependency which uses older version of botocore. temporarily
730 | # allowing this to catch other issues
731 | gpu_suffix = "-gpu" if "gpu" in image else ""
732 | allowed_exceptions.append(
733 | rf"^tensorflow-serving-api{gpu_suffix} \d\.\d+\.\d+ requires tensorflow(\|{gpu_suffix}), which is not installed.$"
734 | )
735 |  
736 | allowed_exceptions.append(
737 | r"^aiobotocore \d+(\.\d+)* has requirement botocore<\d+(\.\d+)*,>=\d+(\.\d+)*, but you have botocore \d+(\.\d+)*\.$"
738 | )
739 |  
740 | # The v0.22 version of tensorflow-io has a bug fixed in v0.23 https://github.com/tensorflow/io/releases/tag/v0.23.0
741 | allowed_exceptions.append(
742 | rf"^tensorflow-io 0.22.0 requires tensorflow, which is not installed.$"
743 | )
744 |  
745 | framework, framework_version = get_framework_and_version_from_tag(image)
746 | # The v0.21 version of tensorflow-io has a bug fixed in v0.23 https://github.com/tensorflow/io/releases/tag/v0.23.0
747 |  
748 | tf263_io21_issue_framework_list = [
749 | "tensorflow",
750 | "huggingface_tensorflow",
751 | "huggingface_tensorflow_trcomp",
752 | ]
753 | if framework in tf263_io21_issue_framework_list or Version(framework_version) in SpecifierSet(
754 | ">=2.6.3,<2.7"
755 | ):
756 | allowed_exceptions.append(
757 | rf"^tensorflow-io 0.21.0 requires tensorflow, which is not installed.$"
758 | )
759 |  
760 | # TF2.9 sagemaker containers introduce tf-models-official which has a known bug where in it does not respect the
761 | # existing TF installation. https://github.com/tensorflow/models/issues/9267. This package in turn brings in
762 | # tensorflow-text. Skip checking these two packages as this is an upstream issue.
763 | if framework in ["tensorflow", "huggingface_tensorflow"] and Version(
764 | framework_version
765 | ) in SpecifierSet(">=2.9.1"):
766 | exception_strings = []
767 |  
768 | for ex_ver in [
769 | "2.9.1",
770 | "2.9.2",
771 | "2.10.0",
772 | "2.11.0",
773 | "2.12.0",
774 | "2.13.0",
775 | "2.14.2",
776 | "2.16.0",
777 | ]:
778 | exception_strings += [f"tf-models-official {ex_ver}".replace(".", r"\.")]
779 |  
780 | for ex_ver in ["2.9.0", "2.10.0", "2.11.0", "2.12.0", "2.13.0", "2.14.0", "2.16.1"]:
781 | exception_strings += [f"tensorflow-text {ex_ver}".replace(".", r"\.")]
782 |  
783 | for ex_ver in ["2.16.0"]:
784 | exception_strings += [f"tf-keras {ex_ver}".replace(".", r"\.")]
785 |  
786 | allowed_exceptions.append(
787 | rf"^({'\|'.join(exception_strings)}) requires tensorflow, which is not installed."
788 | )
789 |  
790 | allowed_exceptions.append(
791 | rf"tf-models-official 2.9.2 has requirement tensorflow-text~=2.9.0, but you have tensorflow-text 2.10.0."
792 | )
793 |  
794 | if framework in ["pytorch"]:
795 | exception_strings = []
796 |  
797 | # Adding due to the latest pip check platform feature: https://github.com/pypa/pip/issues/12884
798 | for ex_ver in ["1.11.1.1"]:
799 | exception_strings += [f"ninja {ex_ver}".replace(".", r"\.")]
800 |  
801 | allowed_exceptions.append(
802 | rf"^({'\|'.join(exception_strings)}) is not supported on this platform"
803 | )
804 |  
805 | if "pytorch" in image and "trcomp" in image:
806 | allowed_exceptions.extend(
807 | [
808 | r"torch-xla \d+(\.\d+)* requires absl-py, which is not installed.",
809 | r"torch-xla \d+(\.\d+)* requires cloud-tpu-client, which is not installed.",
810 | ]
811 | )
812 |  
813 | if "autogluon" in image:
814 | allowed_exceptions.extend(
815 | [
816 | r"openxlab \d+\.\d+\.\d+ has requirement requests~=\d+\.\d+\.\d+, but you have requests .*",
817 | r"openxlab \d+\.\d+\.\d+ has requirement setuptools~=\d+\.\d+\.\d+, but you have setuptools .*",
818 | ]
819 | )
820 |  
821 | # Add null entrypoint to ensure command exits immediately
822 | ctx = Context()
823 | output = ctx.run(f"docker run --entrypoint='' {image} pip check", hide=True, warn=True)
824 | if output.return_code != 0:
825 | if not (
826 | any(
827 | [
828 | re.findall(allowed_exception, output.stdout)
829 | for allowed_exception in allowed_exceptions
830 | ]
831 | )
832 | ):
833 | # Rerun pip check test if this is an unexpected failure
834 | >               ctx.run(f"docker run --entrypoint='' {image} pip check", hide=True)
835 |  
836 | sanity/test_pre_release.py:676:
837 | _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
838 | /usr/local/lib/python3.8/site-packages/invoke/context.py:95: in run
839 | return self._run(runner, command, **kwargs)
840 | /usr/local/lib/python3.8/site-packages/invoke/context.py:102: in _run
841 | return runner.run(command, **kwargs)
842 | /usr/local/lib/python3.8/site-packages/invoke/runners.py:380: in run
843 | return self._run_body(command, **kwargs)
844 | /usr/local/lib/python3.8/site-packages/invoke/runners.py:442: in _run_body
845 | return self.make_promise() if self._asynchronous else self._finish()
846 | _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
847 |  
848 | self = <invoke.runners.Local object at 0x7f63e3639eb0>
849 |  
850 | def _finish(self):
851 | # Wait for subprocess to run, forwarding signals as we get them.
852 | try:
853 | while True:
854 | try:
855 | self.wait()
856 | break  # done waiting!
857 | # Don't locally stop on ^C, only forward it:
858 | # - if remote end really stops, we'll naturally stop after
859 | # - if remote end does not stop (eg REPL, editor) we don't want
860 | # to stop prematurely
861 | except KeyboardInterrupt as e:
862 | self.send_interrupt(e)
863 | # TODO: honor other signals sent to our own process and
864 | # transmit them to the subprocess before handling 'normally'.
865 | # Make sure we tie off our worker threads, even if something exploded.
866 | # Any exceptions that raised during self.wait() above will appear after
867 | # this block.
868 | finally:
869 | # Inform stdin-mirroring worker to stop its eternal looping
870 | self.program_finished.set()
871 | # Join threads, storing inner exceptions, & set a timeout if
872 | # necessary. (Segregate WatcherErrors as they are "anticipated
873 | # errors" that want to show up at the end during creation of
874 | # Failure objects.)
875 | watcher_errors = []
876 | thread_exceptions = []
877 | for target, thread in six.iteritems(self.threads):
878 | thread.join(self._thread_join_timeout(target))
879 | exception = thread.exception()
880 | if exception is not None:
881 | real = exception.value
882 | if isinstance(real, WatcherError):
883 | watcher_errors.append(real)
884 | else:
885 | thread_exceptions.append(exception)
886 | # If any exceptions appeared inside the threads, raise them now as an
887 | # aggregate exception object.
888 | # NOTE: this is kept outside the 'finally' so that main-thread
889 | # exceptions are raised before worker-thread exceptions; they're more
890 | # likely to be Big Serious Problems.
891 | if thread_exceptions:
892 | raise ThreadException(thread_exceptions)
893 | # Collate stdout/err, calculate exited, and get final result obj
894 | result = self._collate_result(watcher_errors)
895 | # Any presence of WatcherError from the threads indicates a watcher was
896 | # upset and aborted execution; make a generic Failure out of it and
897 | # raise that.
898 | if watcher_errors:
899 | # TODO: ambiguity exists if we somehow get WatcherError in *both*
900 | # threads...as unlikely as that would normally be.
901 | raise Failure(result, reason=watcher_errors[0])
902 | # If a timeout was requested and the subprocess did time out, shout.
903 | timeout = self.opts["timeout"]
904 | if timeout is not None and self.timed_out:
905 | raise CommandTimedOut(result, timeout=timeout)
906 | if not (result or self.opts["warn"]):
907 | >           raise UnexpectedExit(result)
908 | E           invoke.exceptions.UnexpectedExit: Encountered a bad command exit code!
909 | E
910 | E           Command: "docker run --entrypoint='' 669063966089.dkr.ecr.us-west-2.amazonaws.com/pr-huggingface-pytorch-training-neuronx:2.1.2-transformers4.43.2-neuronx-py310-sdk2.20.0-ubuntu20.04-pr-4365-2024-11-15-08-05-13 pip check"
911 | E
912 | E           Exit code: 1
913 | E
914 | E           Stdout:
915 | E
916 | E           tensorboard 2.6.0 has requirement google-auth<2,>=1.6.3, but you have google-auth 2.36.0.
917 | E
918 | E           Stderr:

@Captainia
Copy link
Contributor

One last vulnerabilities and we should be good to go..

ERROR    test.dlc_tests.sanity.test_safety_report_file:test_safety_report_file.py:98 SAFETY_REPORT (FAILED) [pkg: gunicorn] [installed: 22.0.0] [vulnerabilities: [SafetyVulnerabilityAdvisory(vulnerability_id='72809', advisory='A vulnerability in Gunicorn allowed the TolerateDangerousFraming setting to process conflicting headers (Transfer-Encoding and Content-Length) and dangerous characters in HTTP header fields. This could lead to HTTP request smuggling and header injection attacks. The issue was resolved by removing this setting and enforcing stricter header validation. \r\nNote: It happens due to an incomplete fix for CVE-2024-1135.', reason_to_ignore='N/A', spec='>=22.0.0,<23.0.0', ignored=False)]]
--
762 | ERROR    test.dlc_tests.sanity.test_safety_report_file:test_safety_report_file.py:98 SAFETY_REPORT (FAILED) [pkg: mlflow] [installed: 2.12.2] [vulnerabilities: [SafetyVulnerabilityAdvisory(vulnerability_id='71691', advisory='Deserialization of untrusted data can occur in affected versions of the MLflow platform running, enabling a maliciously uploaded PyTorch model to run arbitrary code on an end user’s system when interacted with.', reason_to_ignore='N/A', spec='>=0.5.0,<=2.13.1', ignored=False), SafetyVulnerabilityAdvisory(vulnerability_id='72394', advisory="Affected versions of LangChain have a callback injection issue with asynchronous and streaming requests. The current autologging method injects the MLflow callback into the user's callback list, causing bugs such as failure with predict_stream, ainvoke, astream, and abatch calls when configurations are specified. Cleanup scripts execute prematurely, preventing proper trace generation and sometimes failing to clean up callbacks. The updated logic now creates copies of callback lists to avoid mutation and maintain original callbacks", reason_to_ignore='N/A', spec='<2.15.0', ignored=False)]]

The other code build jobs are newly added so once security test passes we can merge

@dacorvo dacorvo force-pushed the add-neuronx-train-0.0.25 branch from 9309905 to 041689f Compare November 18, 2024 09:23
@malav-shastri
Copy link
Contributor

there's one remaining mlflow vulnerability which you might need to allowlist. Can you please address that?

ERROR    test.dlc_tests.sanity.test_safety_report_file:test_safety_report_file.py:98 SAFETY_REPORT (FAILED) [pkg: mlflow] [installed: 2.12.2] [vulnerabilities: [SafetyVulnerabilityAdvisory(vulnerability_id='72394', advisory="Affected versions of LangChain have a callback injection issue with asynchronous and streaming requests. The current autologging method injects the MLflow callback into the user's callback list, causing bugs such as failure with predict_stream, ainvoke, astream, and abatch calls when configurations are specified. Cleanup scripts execute prematurely, preventing proper trace generation and sometimes failing to clean up callbacks. The updated logic now creates copies of callback lists to avoid mutation and maintain original callbacks", reason_to_ignore='N/A', spec='<2.15.0', ignored=False)]]

@dacorvo
Copy link
Contributor Author

dacorvo commented Nov 18, 2024

there's one remaining mlflow vulnerability which you might need to allowlist. Can you please address that?

ERROR    test.dlc_tests.sanity.test_safety_report_file:test_safety_report_file.py:98 SAFETY_REPORT (FAILED) [pkg: mlflow] [installed: 2.12.2] [vulnerabilities: [SafetyVulnerabilityAdvisory(vulnerability_id='72394', advisory="Affected versions of LangChain have a callback injection issue with asynchronous and streaming requests. The current autologging method injects the MLflow callback into the user's callback list, causing bugs such as failure with predict_stream, ainvoke, astream, and abatch calls when configurations are specified. Cleanup scripts execute prematurely, preventing proper trace generation and sometimes failing to clean up callbacks. The updated logic now creates copies of callback lists to avoid mutation and maintain original callbacks", reason_to_ignore='N/A', spec='<2.15.0', ignored=False)]]

I just added one ... it is impossible to complete this if every time I push a new vuln pops out ! You must improve this process somehow !

@dacorvo dacorvo requested a review from Captainia November 18, 2024 16:34
malav-shastri
malav-shastri previously approved these changes Nov 18, 2024
@malav-shastri malav-shastri merged commit 36a6aab into aws:master Nov 18, 2024
30 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Reflects file change in build folder huggingface Reflects file change in huggingface folder Size:S Determines the size of the PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants