Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Private AML: Dockerfile with Nexus config #3424

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from
Draft

Conversation

damoodamoo
Copy link
Member

@damoodamoo damoodamoo commented Apr 6, 2023

Resolves #3295

As AML is deployed, it deploys a container registry for its own use. We need to push a customised Docker image into that registry, that contains the nexus config 'baked in', so that nexus will be used to download required packages during AML image builds during training / experimentation runs. Without the nexus config, the compute cluster can not run in network isolation.

What this PR does:

  • As the container registry is deployed, it leaves the network open.
  • An ACR task is used to build + push the nexus docker image
  • The terraform for AML is re-run, and the network is closed.

Alternatives considered and discounted:

  • Build a Docker image in the VMSS runner, and push from there. We don't want docker-in-docker.
  • Cannot use an ACR task as-is because the general use agent pools don't have network access to the ACR
  • Instead of opening the network temporarily, use a custom ACR Task Agent Pool, within the ACR VNet, to run the task. The pools are not available in the UK, and have to be in the same region as the ACR.
  • Add an IP exception to the ACR for the public agent pool. We don't know the IP of the runner until it's running, and then it's too late.

@github-actions
Copy link

github-actions bot commented Apr 6, 2023

Unit Test Results

0 tests   0 ✅  0s ⏱️
0 suites  0 💤
0 files    0 ❌

Results for commit a8dfe47.

♻️ This comment has been updated with latest results.

Copy link
Collaborator

@jjgriff93 jjgriff93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good and makes sense - shame this workaround is needed but seems you've considered every possible alternative!


ARG TRE_ID
ARG TRE_LOCATION
ENV NEXUS_PROXY_URL="https://nexus-${TRE_ID}.${TRE_LOCATION}.cloudapp.azure.com"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this going to need to be updated as per the Gov cloud stuff?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep - have updated to set the endpoint based on cloud type

resource "azurerm_container_registry" "acr" {
name = local.acr_name
location = data.azurerm_resource_group.ws.location
resource_group_name = data.azurerm_resource_group.ws.name
sku = "Premium"
admin_enabled = false
public_network_access_enabled = false
public_network_access_enabled = var.public_access_enabled
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as discussed, I suspect this might benefit from a comment

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed - have added comment to explain

@damoodamoo damoodamoo added the blocked Cannot progress at present label Apr 6, 2023
@damoodamoo
Copy link
Member Author

Holding off until @tamirkamara / the crew are back from leave to review from a Gov perspective

Copy link
Collaborator

@tamirkamara tamirkamara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tagging @LizaShak for a review as well.

@@ -55,3 +63,7 @@ output "mcr_tag" {
output "batch_tag" {
value = data.azurerm_network_service_tags.batch_tag.id
}

output "azure_endpoint" {
value = var.azure_environment == "AzureGovCloud" ? "cloudapp.usgovcloudapi.net" : "cloudapp.azure.com"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. We have a similar condition somewhere else already... but at any case, the suffixes should be brought from TF configuration module we use elsewhere (we might need to add those as outputs there).
  2. If we need to have this condition in the definition, then I think we'd prefer to use arm_environment as that is native to Terraform.

@marrobi marrobi added the enhancement New feature or request label Jun 27, 2023
@tim-p-allen tim-p-allen removed the blocked Cannot progress at present label May 9, 2024
@tim-p-allen tim-p-allen self-assigned this May 9, 2024
@tim-p-allen tim-p-allen marked this pull request as draft August 8, 2024 11:17
@tim-p-allen tim-p-allen removed their assignment Aug 8, 2024
@marrobi
Copy link
Member

marrobi commented Nov 26, 2024

I know a lot of time has passed, but adding some thoughts as this might come up again. Is there anything to stop us building this in an ACR "outside" the workspace via CI/CD, then pull/push to the workspace ACR from the RP?

@damoodamoo
Copy link
Member Author

  • ACR Task Agent Pool,

This is a distant memory for me, but I think that would be doable. I don't think we wanted to add anything more explicit to the general CI pipeline as we wanted it self contained in a WS service. If you were to build in a general, open ACR that would work - but you'd need connectivity from that into the WS ACR, and a way for the RP to pull/push it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Development

Successfully merging this pull request may close these issues.

AML Compute does not build when private in the workspace
6 participants