[HuggingFace][Neuronx] Training - Optimum Neuron 0.0.25 - Neuron sdk 2.20.0 - Transformers to 4.43.2 #4365

dacorvo · 2024-10-22T07:12:17Z

Description

This PR creates Hugginface's PyTorch DLC for training on neuron-v2 devices (Trainium).

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Captainia · 2024-10-23T15:14:06Z

huggingface/pytorch/training/docker/2.1/py3/sdk2.20.0/Dockerfile.neuronx.py_scan_allowlist.json

+    "71670": "[Package: torch] Core torch package version 2.1 affected, cannot be changed in PyTorch 2.1 DLC advisory='A vulnerability in the PyTorch's torch.distributed.rpc framework, specifically in versions prior to 2.2.2, allows for remote code execution (RCE). The framework, which is used in distributed training scenarios, does not properly verify the functions being called during RPC (Remote Procedure Call) operations. This oversight permits attackers to execute arbitrary commands by leveraging built-in Python functions such as eval during multi-cpu RPC communication. The vulnerability arises from the lack of restriction on function calls when a worker node serializes and sends a PythonUDF (User Defined Function) to the master node, which then deserializes and executes the function without validation. This flaw can be exploited to compromise master nodes initiating distributed training, potentially leading to the theft of sensitive AI-related data.'",
+    "71671": "[Package: torch] Core torch package version 2.1 affected, cannot be changed in PyTorch 2.1 DLC advisory='PyTorch before v2.2.0 was discovered to contain a heap buffer overflow vulnerability in the component /runtime/vararg_functions.cpp. This vulnerability allows attackers to cause a Denial of Service (DoS) via a crafted input.'",
+    "71672": "[Package: torch] Core torch package version 2.1 affected, cannot be changed in PyTorch 2.1 DLC advisory='Pytorch before version v2.2.0 was discovered to contain a use-after-free vulnerability in torch/csrc/jit/mobile/interpreter.cpp.'",
+    "71064": "Affected versions of Requests, when making requests through a Requests `Session`, if the first request is made with `verify=False` to disable cert verification, all subsequent requests to the same host will continue to ignore cert verification regardless of changes to the value of `verify`. This behavior will continue for the lifecycle of the connection in the connection pool. Requests 2.32.0 fixes the issue, but versions 2.32.0 and 2.32.1 were yanked due to conflicts with CVE-2024-35195 mitigation."


Missing a , on this line, code build logs:

Traceback (most recent call last): -- 259 | File "src/main.py", line 132, in <module> 260 | main() 261 | File "src/main.py", line 128, in main 262 | image_builder(buildspec_file, image_types, device_types) 263 | File "/codebuild/output/src2844226945/src/github.com/aws/deep-learning-containers/src/image_builder.py", line 370, in image_builder 264 | pushed_images += process_images(parent_images, "Parent/Independent", buildspec_path=buildspec) 265 | File "/codebuild/output/src2844226945/src/github.com/aws/deep-learning-containers/src/image_builder.py", line 434, in process_images 266 | build_images(common_stage_image_list, make_dummy_boto_client=True) 267 | File "/codebuild/output/src2844226945/src/github.com/aws/deep-learning-containers/src/image_builder.py", line 581, in build_images 268 | FORMATTER.progress(THREADS) 269 | File "/codebuild/output/src2844226945/src/github.com/aws/deep-learning-containers/src/output.py", line 103, in progress 270 | output[i] += "." * 10 + constants.STATUS_MESSAGE[futures[image].result()] 271 | File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 432, in result 272 | return self.__get_result() 273 | File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result 274 | raise self._exception 275 | File "/usr/local/lib/python3.8/concurrent/futures/thread.py", line 57, in run 276 | result = self.fn(*self.args, **self.kwargs) 277 | File "/codebuild/output/src2844226945/src/github.com/aws/deep-learning-containers/src/image.py", line 164, in build 278 | self.update_pre_build_configuration() 279 | File "/codebuild/output/src2844226945/src/github.com/aws/deep-learning-containers/src/common_stage_image.py", line 54, in update_pre_build_configuration 280 | generate_safety_report_for_image( 281 | File "/codebuild/output/src2844226945/src/github.com/aws/deep-learning-containers/src/utils.py", line 383, in generate_safety_report_for_image 282 | ignore_dict = get_safety_ignore_dict( 283 | File "/codebuild/output/src2844226945/src/github.com/aws/deep-learning-containers/src/utils.py", line 324, in get_safety_ignore_dict 284 | get_safety_ignore_dict_from_image_specific_safety_allowlists(image_uri) 285 | File "/codebuild/output/src2844226945/src/github.com/aws/deep-learning-containers/src/utils.py", line 265, in get_safety_ignore_dict_from_image_specific_safety_allowlists 286 | ignore_dict_from_image_specific_allowlist = json.load(f) 287 | File "/usr/local/lib/python3.8/json/__init__.py", line 293, in load 288 | return loads(fp.read(), 289 | File "/usr/local/lib/python3.8/json/__init__.py", line 357, in loads 290 | return _default_decoder.decode(s) 291 | File "/usr/local/lib/python3.8/json/decoder.py", line 337, in decode 292 | obj, end = self.raw_decode(s, idx=_w(s, 0).end()) 293 | File "/usr/local/lib/python3.8/json/decoder.py", line 353, in raw_decode 294 | obj, end = self.scan_once(s, idx) 295 | json.decoder.JSONDecodeError: Expecting ',' delimiter: line 8 column 5 (char 2593)

dacorvo · 2024-10-24T14:02:38Z

@Captainia a python vulnerability is detected for gevent:

ERROR    test.dlc_tests.sanity.test_safety_report_file:test_safety_report_file.py:97 SAFETY_REPORT (FAILED) [pkg: gevent] [installed: 24.2.1] [vulnerabilities: [SafetyVulnerabilityAdvisory(vulnerability_id='73655', advisory='Affected versions of gevent are vulnerable to a Race Condition leading to Unauthorized Access — CWE-362. The attack can be carried out when the fallback socketpair implementation is used on platforms that lack native support, and the vulnerable function does not properly authenticate the connected sockets. To exploit this vulnerability, an attacker must be able to predict the address and port used by the fallback socketpair and establish a connection before the legitimate client. Users are advised to update to the version of gevent where this issue is fixed by introducing authentication steps in the fallback socketpair implementation to ensure the sockets are correctly connected.', reason_to_ignore='N/A', spec='<24.10.1', ignored=False)]]

It does not seem to have been detected for other images in this repository, so I don't know if it can be ignored or not.

Captainia

There is one new critical vulnerability in the image

{"apparmor": [{"description": "It was discovered that the AppArmor policy compiler incorrectly generated looser restrictions than expected for rules allowing mount operations. A local attacker could possibly use this to bypass AppArmor restrictions in applications where some mount operations were permitted.", "vulnerability_id": "CVE-2016-1585", "name": "CVE-2016-1585", "package_name": "apparmor", "package_details": {"file_path": null, "name": "apparmor", "package_manager": "OS", "version": "2.13.3", "release": "7ubuntu5.3build2"}, "remediation": {"recommendation": {"text": "None Provided"}}, "cvss_v3_score": 9.8, "cvss_v30_score": 9.8, "cvss_v31_score": 0.0, "cvss_v2_score": 0.0, "cvss_v3_severity": "CRITICAL", "source_url": "https://people.canonical.com/~ubuntu-security/cve/2016/CVE-2016-1585.html", "source": "UBUNTU_CVE", "severity": "CRITICAL", "status": "ACTIVE", "title": "CVE-2016-1585 - apparmor", "reason_to_ignore": "N/A"}]}

Shall we add it to apt-get update && apt-get upgrade in the docker file?

dacorvo · 2024-10-24T14:10:32Z

There is one new critical vulnerability in the image

{"apparmor": [{"description": "It was discovered that the AppArmor policy compiler incorrectly generated looser restrictions than expected for rules allowing mount operations. A local attacker could possibly use this to bypass AppArmor restrictions in applications where some mount operations were permitted.", "vulnerability_id": "CVE-2016-1585", "name": "CVE-2016-1585", "package_name": "apparmor", "package_details": {"file_path": null, "name": "apparmor", "package_manager": "OS", "version": "2.13.3", "release": "7ubuntu5.3build2"}, "remediation": {"recommendation": {"text": "None Provided"}}, "cvss_v3_score": 9.8, "cvss_v30_score": 9.8, "cvss_v31_score": 0.0, "cvss_v2_score": 0.0, "cvss_v3_severity": "CRITICAL", "source_url": "https://people.canonical.com/~ubuntu-security/cve/2016/CVE-2016-1585.html", "source": "UBUNTU_CVE", "severity": "CRITICAL", "status": "ACTIVE", "title": "CVE-2016-1585 - apparmor", "reason_to_ignore": "N/A"}]}

Shall we add it to apt-get update && apt-get upgrade in the docker file?

It is already installed in the Dockerfile.
https://github.com/dacorvo/deep-learning-containers/blob/fb7e3ca6a491cae112edd53440a1761d67595d3b/huggingface/pytorch/training/docker/2.1/py3/sdk2.20.0/Dockerfile.neuronx#L40

Captainia · 2024-10-24T14:11:19Z

@Captainia a python vulnerability is detected for gevent:

ERROR    test.dlc_tests.sanity.test_safety_report_file:test_safety_report_file.py:97 SAFETY_REPORT (FAILED) [pkg: gevent] [installed: 24.2.1] [vulnerabilities: [SafetyVulnerabilityAdvisory(vulnerability_id='73655', advisory='Affected versions of gevent are vulnerable to a Race Condition leading to Unauthorized Access — CWE-362. The attack can be carried out when the fallback socketpair implementation is used on platforms that lack native support, and the vulnerable function does not properly authenticate the connected sockets. To exploit this vulnerability, an attacker must be able to predict the address and port used by the fallback socketpair and establish a connection before the legitimate client. Users are advised to update to the version of gevent where this issue is fixed by introducing authentication steps in the fallback socketpair implementation to ensure the sockets are correctly connected.', reason_to_ignore='N/A', spec='<24.10.1', ignored=False)]]

It does not seem to have been detected for other images in this repository, so I don't know if it can be ignored or not.

It seems this vulnerability is not exploitable in a docker environment, but worth confirming and then we can add to ignore list.

Captainia · 2024-10-24T16:58:56Z

@Captainia a python vulnerability is detected for gevent:

ERROR    test.dlc_tests.sanity.test_safety_report_file:test_safety_report_file.py:97 SAFETY_REPORT (FAILED) [pkg: gevent] [installed: 24.2.1] [vulnerabilities: [SafetyVulnerabilityAdvisory(vulnerability_id='73655', advisory='Affected versions of gevent are vulnerable to a Race Condition leading to Unauthorized Access — CWE-362. The attack can be carried out when the fallback socketpair implementation is used on platforms that lack native support, and the vulnerable function does not properly authenticate the connected sockets. To exploit this vulnerability, an attacker must be able to predict the address and port used by the fallback socketpair and establish a connection before the legitimate client. Users are advised to update to the version of gevent where this issue is fixed by introducing authentication steps in the fallback socketpair implementation to ensure the sockets are correctly connected.', reason_to_ignore='N/A', spec='<24.10.1', ignored=False)]]

It does not seem to have been detected for other images in this repository, so I don't know if it can be ignored or not.

It seems this vulnerability is not exploitable in a docker environment, but worth confirming and then we can add to ignore list.

Could you update gevent similar to the PR here? #4367 (comment)

Captainia

This PR looks good, feel free to revert the changes in dlc_developer_config.toml after the hook passes

Captainia · 2024-11-12T01:03:24Z

There is one vulnerability

ERROR    test.dlc_tests.sanity.test_safety_report_file:test_safety_report_file.py:98 SAFETY_REPORT (FAILED) [pkg: werkzeug] [installed: 3.0.4] [vulnerabilities: [SafetyVulnerabilityAdvisory(vulnerability_id='73969', advisory='Affected versions of Werkzeug are vulnerable to Path Traversal (CWE-22) on Windows systems running Python versions below 3.11. The safe_join() function failed to properly detect certain absolute paths on Windows, allowing attackers to potentially access files outside the intended directory. An attacker could craft special paths starting with "/" that bypass the directory restrictions on Windows systems. The vulnerability exists in the safe_join() function which relied solely on os.path.isabs() for path validation. This is exploitable on Windows systems by passing paths starting with "/" to safe_join(). To remediate, upgrade to the latest version which includes additional path validation checks. \r\nNOTE: This vulnerability specifically affects Windows systems running Python versions below 3.11 where ntpath.isabs() behavior differs.', reason_to_ignore='N/A', spec='<3.0.6', ignored=False), SafetyVulnerabilityAdvisory(vulnerability_id='73889', advisory="Affected versions of Werkzeug are potentially vulnerable to resource exhaustion when parsing file data in forms. Applications using 'werkzeug.formparser.MultiPartParser' to parse 'multipart/form-data' requests (e.g. all flask applications) are vulnerable to a relatively simple but effective resource exhaustion (denial of service) attack. A specifically crafted form submission request can cause the parser to allocate and block 3 to 8 times the upload size in main memory. There is no upper limit; a single upload at 1 Gbit/s can exhaust 32 GB of RAM in less than 60 seconds.", reason_to_ignore='N/A', spec='<3.0.6', ignored=False)]]

Please patch or add to ignore list

Captainia · 2024-11-13T15:41:42Z

It seems there is another vulnerability id associated with werkzeug

ERROR    test.dlc_tests.sanity.test_safety_report_file:test_safety_report_file.py:98 SAFETY_REPORT (FAILED) [pkg: werkzeug] [installed: 3.0.4] [vulnerabilities: [SafetyVulnerabilityAdvisory(vulnerability_id='73889', advisory="Affected versions of Werkzeug are potentially vulnerable to resource exhaustion when parsing file data in forms. Applications using 'werkzeug.formparser.MultiPartParser' to parse 'multipart/form-data' requests (e.g. all flask applications) are vulnerable to a relatively simple but effective resource exhaustion (denial of service) attack. A specifically crafted form submission request can cause the parser to allocate and block 3 to 8 times the upload size in main memory. There is no upper limit; a single upload at 1 Gbit/s can exhaust 32 GB of RAM in less than 60 seconds.", reason_to_ignore='N/A', spec='<3.0.6', ignored=False)]]

Could you add it as well?

Captainia · 2024-11-15T14:14:46Z

Hi David, thanks for making the updates, looks like the image has an incompatible version of package installed, could you take a look?


=================================== FAILURES ===================================
--
703 | _ test_pip_check[669063966089.dkr.ecr.us-west-2.amazonaws.com/pr-huggingface-pytorch-training-neuronx:2.1.2-transformers4.43.2-neuronx-py310-sdk2.20.0-ubuntu20.04-pr-4365-2024-11-15-08-05-13] _
704 | [gw3] linux -- Python 3.8.0 /usr/local/bin/python
705 |  
706 | image = '669063966089.dkr.ecr.us-west-2.amazonaws.com/pr-huggingface-pytorch-training-neuronx:2.1.2-transformers4.43.2-neuronx-py310-sdk2.20.0-ubuntu20.04-pr-4365-2024-11-15-08-05-13'
707 |  
708 | @pytest.mark.usefixtures("sagemaker", "security_sanity")
709 | @pytest.mark.model("N/A")
710 | def test_pip_check(image):
711 | """
712 | Ensure there are no broken requirements on the containers by running "pip check"
713 |  
714 | :param image: ECR image URI
715 | """
716 |  
717 | allowed_exceptions = []
718 |  
719 | # SageMaker Python SDK updated its pyyaml requirement to 6.0, which is incompatible with the
720 | # requirement from awscli. awscli only requires pyyaml for ecs/eks related invocations, while
721 | # pyyaml usage seems to be more fundamental in sagemaker. Therefore, we are ignoring awscli's
722 | # requirement in favor of sagemaker.
723 | allowed_exceptions.append(
724 | r"^awscli \d+(\.\d+)* has requirement PyYAML<5\.5,>=3\.10, but you have pyyaml 6\.0.$"
725 | )
726 |  
727 | # TF inference containers do not have core tensorflow installed by design. Allowing for this pip check error
728 | # to occur in order to catch other pip check issues that may be associated with TF inference
729 | # smclarify binaries have s3fs->aiobotocore dependency which uses older version of botocore. temporarily
730 | # allowing this to catch other issues
731 | gpu_suffix = "-gpu" if "gpu" in image else ""
732 | allowed_exceptions.append(
733 | rf"^tensorflow-serving-api{gpu_suffix} \d\.\d+\.\d+ requires tensorflow(\|{gpu_suffix}), which is not installed.$"
734 | )
735 |  
736 | allowed_exceptions.append(
737 | r"^aiobotocore \d+(\.\d+)* has requirement botocore<\d+(\.\d+)*,>=\d+(\.\d+)*, but you have botocore \d+(\.\d+)*\.$"
738 | )
739 |  
740 | # The v0.22 version of tensorflow-io has a bug fixed in v0.23 https://github.com/tensorflow/io/releases/tag/v0.23.0
741 | allowed_exceptions.append(
742 | rf"^tensorflow-io 0.22.0 requires tensorflow, which is not installed.$"
743 | )
744 |  
745 | framework, framework_version = get_framework_and_version_from_tag(image)
746 | # The v0.21 version of tensorflow-io has a bug fixed in v0.23 https://github.com/tensorflow/io/releases/tag/v0.23.0
747 |  
748 | tf263_io21_issue_framework_list = [
749 | "tensorflow",
750 | "huggingface_tensorflow",
751 | "huggingface_tensorflow_trcomp",
752 | ]
753 | if framework in tf263_io21_issue_framework_list or Version(framework_version) in SpecifierSet(
754 | ">=2.6.3,<2.7"
755 | ):
756 | allowed_exceptions.append(
757 | rf"^tensorflow-io 0.21.0 requires tensorflow, which is not installed.$"
758 | )
759 |  
760 | # TF2.9 sagemaker containers introduce tf-models-official which has a known bug where in it does not respect the
761 | # existing TF installation. https://github.com/tensorflow/models/issues/9267. This package in turn brings in
762 | # tensorflow-text. Skip checking these two packages as this is an upstream issue.
763 | if framework in ["tensorflow", "huggingface_tensorflow"] and Version(
764 | framework_version
765 | ) in SpecifierSet(">=2.9.1"):
766 | exception_strings = []
767 |  
768 | for ex_ver in [
769 | "2.9.1",
770 | "2.9.2",
771 | "2.10.0",
772 | "2.11.0",
773 | "2.12.0",
774 | "2.13.0",
775 | "2.14.2",
776 | "2.16.0",
777 | ]:
778 | exception_strings += [f"tf-models-official {ex_ver}".replace(".", r"\.")]
779 |  
780 | for ex_ver in ["2.9.0", "2.10.0", "2.11.0", "2.12.0", "2.13.0", "2.14.0", "2.16.1"]:
781 | exception_strings += [f"tensorflow-text {ex_ver}".replace(".", r"\.")]
782 |  
783 | for ex_ver in ["2.16.0"]:
784 | exception_strings += [f"tf-keras {ex_ver}".replace(".", r"\.")]
785 |  
786 | allowed_exceptions.append(
787 | rf"^({'\|'.join(exception_strings)}) requires tensorflow, which is not installed."
788 | )
789 |  
790 | allowed_exceptions.append(
791 | rf"tf-models-official 2.9.2 has requirement tensorflow-text~=2.9.0, but you have tensorflow-text 2.10.0."
792 | )
793 |  
794 | if framework in ["pytorch"]:
795 | exception_strings = []
796 |  
797 | # Adding due to the latest pip check platform feature: https://github.com/pypa/pip/issues/12884
798 | for ex_ver in ["1.11.1.1"]:
799 | exception_strings += [f"ninja {ex_ver}".replace(".", r"\.")]
800 |  
801 | allowed_exceptions.append(
802 | rf"^({'\|'.join(exception_strings)}) is not supported on this platform"
803 | )
804 |  
805 | if "pytorch" in image and "trcomp" in image:
806 | allowed_exceptions.extend(
807 | [
808 | r"torch-xla \d+(\.\d+)* requires absl-py, which is not installed.",
809 | r"torch-xla \d+(\.\d+)* requires cloud-tpu-client, which is not installed.",
810 | ]
811 | )
812 |  
813 | if "autogluon" in image:
814 | allowed_exceptions.extend(
815 | [
816 | r"openxlab \d+\.\d+\.\d+ has requirement requests~=\d+\.\d+\.\d+, but you have requests .*",
817 | r"openxlab \d+\.\d+\.\d+ has requirement setuptools~=\d+\.\d+\.\d+, but you have setuptools .*",
818 | ]
819 | )
820 |  
821 | # Add null entrypoint to ensure command exits immediately
822 | ctx = Context()
823 | output = ctx.run(f"docker run --entrypoint='' {image} pip check", hide=True, warn=True)
824 | if output.return_code != 0:
825 | if not (
826 | any(
827 | [
828 | re.findall(allowed_exception, output.stdout)
829 | for allowed_exception in allowed_exceptions
830 | ]
831 | )
832 | ):
833 | # Rerun pip check test if this is an unexpected failure
834 | >               ctx.run(f"docker run --entrypoint='' {image} pip check", hide=True)
835 |  
836 | sanity/test_pre_release.py:676:
837 | _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
838 | /usr/local/lib/python3.8/site-packages/invoke/context.py:95: in run
839 | return self._run(runner, command, **kwargs)
840 | /usr/local/lib/python3.8/site-packages/invoke/context.py:102: in _run
841 | return runner.run(command, **kwargs)
842 | /usr/local/lib/python3.8/site-packages/invoke/runners.py:380: in run
843 | return self._run_body(command, **kwargs)
844 | /usr/local/lib/python3.8/site-packages/invoke/runners.py:442: in _run_body
845 | return self.make_promise() if self._asynchronous else self._finish()
846 | _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
847 |  
848 | self = <invoke.runners.Local object at 0x7f63e3639eb0>
849 |  
850 | def _finish(self):
851 | # Wait for subprocess to run, forwarding signals as we get them.
852 | try:
853 | while True:
854 | try:
855 | self.wait()
856 | break  # done waiting!
857 | # Don't locally stop on ^C, only forward it:
858 | # - if remote end really stops, we'll naturally stop after
859 | # - if remote end does not stop (eg REPL, editor) we don't want
860 | # to stop prematurely
861 | except KeyboardInterrupt as e:
862 | self.send_interrupt(e)
863 | # TODO: honor other signals sent to our own process and
864 | # transmit them to the subprocess before handling 'normally'.
865 | # Make sure we tie off our worker threads, even if something exploded.
866 | # Any exceptions that raised during self.wait() above will appear after
867 | # this block.
868 | finally:
869 | # Inform stdin-mirroring worker to stop its eternal looping
870 | self.program_finished.set()
871 | # Join threads, storing inner exceptions, & set a timeout if
872 | # necessary. (Segregate WatcherErrors as they are "anticipated
873 | # errors" that want to show up at the end during creation of
874 | # Failure objects.)
875 | watcher_errors = []
876 | thread_exceptions = []
877 | for target, thread in six.iteritems(self.threads):
878 | thread.join(self._thread_join_timeout(target))
879 | exception = thread.exception()
880 | if exception is not None:
881 | real = exception.value
882 | if isinstance(real, WatcherError):
883 | watcher_errors.append(real)
884 | else:
885 | thread_exceptions.append(exception)
886 | # If any exceptions appeared inside the threads, raise them now as an
887 | # aggregate exception object.
888 | # NOTE: this is kept outside the 'finally' so that main-thread
889 | # exceptions are raised before worker-thread exceptions; they're more
890 | # likely to be Big Serious Problems.
891 | if thread_exceptions:
892 | raise ThreadException(thread_exceptions)
893 | # Collate stdout/err, calculate exited, and get final result obj
894 | result = self._collate_result(watcher_errors)
895 | # Any presence of WatcherError from the threads indicates a watcher was
896 | # upset and aborted execution; make a generic Failure out of it and
897 | # raise that.
898 | if watcher_errors:
899 | # TODO: ambiguity exists if we somehow get WatcherError in *both*
900 | # threads...as unlikely as that would normally be.
901 | raise Failure(result, reason=watcher_errors[0])
902 | # If a timeout was requested and the subprocess did time out, shout.
903 | timeout = self.opts["timeout"]
904 | if timeout is not None and self.timed_out:
905 | raise CommandTimedOut(result, timeout=timeout)
906 | if not (result or self.opts["warn"]):
907 | >           raise UnexpectedExit(result)
908 | E           invoke.exceptions.UnexpectedExit: Encountered a bad command exit code!
909 | E
910 | E           Command: "docker run --entrypoint='' 669063966089.dkr.ecr.us-west-2.amazonaws.com/pr-huggingface-pytorch-training-neuronx:2.1.2-transformers4.43.2-neuronx-py310-sdk2.20.0-ubuntu20.04-pr-4365-2024-11-15-08-05-13 pip check"
911 | E
912 | E           Exit code: 1
913 | E
914 | E           Stdout:
915 | E
916 | E           tensorboard 2.6.0 has requirement google-auth<2,>=1.6.3, but you have google-auth 2.36.0.
917 | E
918 | E           Stderr:

Captainia · 2024-11-15T17:53:13Z

One last vulnerabilities and we should be good to go..

ERROR    test.dlc_tests.sanity.test_safety_report_file:test_safety_report_file.py:98 SAFETY_REPORT (FAILED) [pkg: gunicorn] [installed: 22.0.0] [vulnerabilities: [SafetyVulnerabilityAdvisory(vulnerability_id='72809', advisory='A vulnerability in Gunicorn allowed the TolerateDangerousFraming setting to process conflicting headers (Transfer-Encoding and Content-Length) and dangerous characters in HTTP header fields. This could lead to HTTP request smuggling and header injection attacks. The issue was resolved by removing this setting and enforcing stricter header validation. \r\nNote: It happens due to an incomplete fix for CVE-2024-1135.', reason_to_ignore='N/A', spec='>=22.0.0,<23.0.0', ignored=False)]]
--
762 | ERROR    test.dlc_tests.sanity.test_safety_report_file:test_safety_report_file.py:98 SAFETY_REPORT (FAILED) [pkg: mlflow] [installed: 2.12.2] [vulnerabilities: [SafetyVulnerabilityAdvisory(vulnerability_id='71691', advisory='Deserialization of untrusted data can occur in affected versions of the MLflow platform running, enabling a maliciously uploaded PyTorch model to run arbitrary code on an end user’s system when interacted with.', reason_to_ignore='N/A', spec='>=0.5.0,<=2.13.1', ignored=False), SafetyVulnerabilityAdvisory(vulnerability_id='72394', advisory="Affected versions of LangChain have a callback injection issue with asynchronous and streaming requests. The current autologging method injects the MLflow callback into the user's callback list, causing bugs such as failure with predict_stream, ainvoke, astream, and abatch calls when configurations are specified. Cleanup scripts execute prematurely, preventing proper trace generation and sometimes failing to clean up callbacks. The updated logic now creates copies of callback lists to avoid mutation and maintain original callbacks", reason_to_ignore='N/A', spec='<2.15.0', ignored=False)]]

The other code build jobs are newly added so once security test passes we can merge

These vulnerabilities were already added for the pytorch training DLCs.

malav-shastri · 2024-11-18T15:09:55Z

there's one remaining mlflow vulnerability which you might need to allowlist. Can you please address that?

ERROR    test.dlc_tests.sanity.test_safety_report_file:test_safety_report_file.py:98 SAFETY_REPORT (FAILED) [pkg: mlflow] [installed: 2.12.2] [vulnerabilities: [SafetyVulnerabilityAdvisory(vulnerability_id='72394', advisory="Affected versions of LangChain have a callback injection issue with asynchronous and streaming requests. The current autologging method injects the MLflow callback into the user's callback list, causing bugs such as failure with predict_stream, ainvoke, astream, and abatch calls when configurations are specified. Cleanup scripts execute prematurely, preventing proper trace generation and sometimes failing to clean up callbacks. The updated logic now creates copies of callback lists to avoid mutation and maintain original callbacks", reason_to_ignore='N/A', spec='<2.15.0', ignored=False)]]

dacorvo · 2024-11-18T15:13:04Z

there's one remaining mlflow vulnerability which you might need to allowlist. Can you please address that?

ERROR    test.dlc_tests.sanity.test_safety_report_file:test_safety_report_file.py:98 SAFETY_REPORT (FAILED) [pkg: mlflow] [installed: 2.12.2] [vulnerabilities: [SafetyVulnerabilityAdvisory(vulnerability_id='72394', advisory="Affected versions of LangChain have a callback injection issue with asynchronous and streaming requests. The current autologging method injects the MLflow callback into the user's callback list, causing bugs such as failure with predict_stream, ainvoke, astream, and abatch calls when configurations are specified. Cleanup scripts execute prematurely, preventing proper trace generation and sometimes failing to clean up callbacks. The updated logic now creates copies of callback lists to avoid mutation and maintain original callbacks", reason_to_ignore='N/A', spec='<2.15.0', ignored=False)]]

I just added one ... it is impossible to complete this if every time I push a new vuln pops out ! You must improve this process somehow !

This reverts commit 7cff21f.

aws-deep-learning-containers-ci bot added build Reflects file change in build folder huggingface Reflects file change in huggingface folder Size:S Determines the size of the PR labels Oct 22, 2024

dacorvo marked this pull request as ready for review October 22, 2024 07:12

dacorvo requested review from a team as code owners October 22, 2024 07:12

dacorvo force-pushed the add-neuronx-train-0.0.25 branch from de3747a to 39ff0fe Compare October 23, 2024 07:14

Captainia reviewed Oct 23, 2024

View reviewed changes

dacorvo force-pushed the add-neuronx-train-0.0.25 branch from 0c43330 to 0663096 Compare October 23, 2024 15:22

Captainia reviewed Oct 24, 2024

View reviewed changes

Captainia mentioned this pull request Oct 24, 2024

patch HF PR Training latest neuronx vulnerabilities for apparmor and libarchive13 #4367

Merged

29 tasks

dacorvo force-pushed the add-neuronx-train-0.0.25 branch from 4fa1145 to a0bd545 Compare October 25, 2024 07:22

dacorvo requested a review from Captainia October 26, 2024 12:07

dacorvo force-pushed the add-neuronx-train-0.0.25 branch 3 times, most recently from 5015228 to 2ee717f Compare November 4, 2024 09:58

dacorvo force-pushed the add-neuronx-train-0.0.25 branch 2 times, most recently from 1c66ef1 to 3aa0fb7 Compare November 11, 2024 22:09

Captainia previously approved these changes Nov 11, 2024

View reviewed changes

dacorvo dismissed Captainia’s stale review via 26401fd November 12, 2024 17:59

dacorvo force-pushed the add-neuronx-train-0.0.25 branch 2 times, most recently from 1d36b4f to 4d51973 Compare November 15, 2024 08:03

dacorvo added 9 commits November 18, 2024 10:18

feat(neuronx): add 0.0.25 training DLC

e37a52a

fix(neuronx): add mlflow vulnerabilities to allow-list

5f62f21

These vulnerabilities were already added for the pytorch training DLCs.

REVERTME: activate neuronx train CI build

7cff21f

fix(neuronx): apparmor and gevent vulnerabilities

abd2c45

fix: add werkzeug exception (Windows vuln)

e0a9da1

fix: add another werkzeug exception

bc3ab61

fix: pin sagemaker version to stop importing errors

885d6fb

fix: try to remove tensorboard 2.6 error

294acbe

fix: add mlflow and gunicorn exceptions

041689f

dacorvo force-pushed the add-neuronx-train-0.0.25 branch from 9309905 to 041689f Compare November 18, 2024 09:23

fix: yet another mlflow vuln

9e8b986

dacorvo requested a review from Captainia November 18, 2024 16:34

malav-shastri previously approved these changes Nov 18, 2024

View reviewed changes

Revert "REVERTME: activate neuronx train CI build"

ff01a40

This reverts commit 7cff21f.

dacorvo dismissed malav-shastri’s stale review via ff01a40 November 18, 2024 21:17

Merge branch 'master' into add-neuronx-train-0.0.25

b8f03b4

malav-shastri approved these changes Nov 18, 2024

View reviewed changes

malav-shastri merged commit 36a6aab into aws:master Nov 18, 2024
30 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HuggingFace][Neuronx] Training - Optimum Neuron 0.0.25 - Neuron sdk 2.20.0 - Transformers to 4.43.2 #4365

[HuggingFace][Neuronx] Training - Optimum Neuron 0.0.25 - Neuron sdk 2.20.0 - Transformers to 4.43.2 #4365

dacorvo commented Oct 22, 2024 •

edited

Loading

Captainia Oct 23, 2024

dacorvo commented Oct 24, 2024

Captainia left a comment

dacorvo commented Oct 24, 2024 •

edited

Loading

Captainia commented Oct 24, 2024

Captainia commented Oct 24, 2024

Captainia left a comment

Captainia commented Nov 12, 2024

Captainia commented Nov 13, 2024

Captainia commented Nov 15, 2024

Captainia commented Nov 15, 2024

malav-shastri commented Nov 18, 2024

dacorvo commented Nov 18, 2024

[HuggingFace][Neuronx] Training - Optimum Neuron 0.0.25 - Neuron sdk 2.20.0 - Transformers to 4.43.2 #4365

[HuggingFace][Neuronx] Training - Optimum Neuron 0.0.25 - Neuron sdk 2.20.0 - Transformers to 4.43.2 #4365

Conversation

dacorvo commented Oct 22, 2024 • edited Loading

Description

Captainia Oct 23, 2024

Choose a reason for hiding this comment

dacorvo commented Oct 24, 2024

Captainia left a comment

Choose a reason for hiding this comment

dacorvo commented Oct 24, 2024 • edited Loading

Captainia commented Oct 24, 2024

Captainia commented Oct 24, 2024

Captainia left a comment

Choose a reason for hiding this comment

Captainia commented Nov 12, 2024

Captainia commented Nov 13, 2024

Captainia commented Nov 15, 2024

Captainia commented Nov 15, 2024

malav-shastri commented Nov 18, 2024

dacorvo commented Nov 18, 2024

dacorvo commented Oct 22, 2024 •

edited

Loading

dacorvo commented Oct 24, 2024 •

edited

Loading