[Don't merge] A small workflow example with RunsOn #11046

hcho3 · 2024-12-03T03:17:45Z

Per @jameslamb's suggestion, I pared down #11001 to extract a small subset: a single representative workflow, with the essential components of the CI pipeline:

Special syntax in GitHub Actions YAML to use self-hosted runners (via RunsOn)
ops/pipeline/stash-artifacts.sh: Script for stashing artifacts
ops/docker_build.sh: Script for building and caching containers
ops/docker_run.py: Script for running tests inside containers
ops/docker/dockerfile: Dockerfiles
ops/docker/ci_container.yml: YAML file to store build args for Docker containers. See also ops/docker/extract_build_args.sh.

Also see https://xgboost--11001.org.readthedocs.build/en/11001/contrib/ci.html for the overview.

Major elements of #11001 are represented in this pull request; if there is no major objection to this PR, I will go ahead and merge #11001.

Note. The failures from BuildKite are expected, since we are removing the pipelines targeting BuildKite. BuildKite is planned to be removed after #11001 is merged.

Note. In GitHub Actions, jobs run on Microsoft-hosted runners by default. To opt into self-hosted runners (enabled by RunsOn), we use the following special syntax:

    runs-on:
      - runs-on
      - runner=runner-name
      - run-id=${{ github.run_id }}
      - tag=[unique tag that uniquely identifies the job in the GH Action workflow]

where the runner is defined in .github/runs-on.yml. See the documentation at https://runs-on.com/runners/linux/.

cc @jameslamb

jameslamb

Putting up a "comment" review with a few initial, small comments. Sorry, it'll take me more time to give a thorough and thoughtful review. I'll try to do that tomorrow.

ops/pipeline/enforce-ci.sh

ops/docker/ci_container.yml

ops/docker/docker_cache_ecr.yml

ops/docker/dockerfile/Dockerfile.gpu_build_rockylinux8

ops/pipeline/stash-artifacts.sh

jameslamb

I spent some more time with this PR today trying to understand the setup, left some more detailed comments and questions. You mentioned to me that a goal of this effort is for more people to be able to understand and contribute to XGBoost CI, so some of my comments are of the form "this is confusing" or "this could be simplified".

I also have some structural questions. I think #11001 (comment) did not totally answer one of my main concerns:

where / when are ... VM images built?

Just saying "packer build will be run manually" isn't enough to answer that. That produces image files, but then:

where are those image files stored?
who can update them?
what environment does that command need to be run in?

And I have a related set of questions:

where can I find the full list of acceptable values for runs-on: and their specs? e.g. if I see a job consistently running out of memory and I want to upgrade to a bigger runner, how can I find whether there are larger runners provisioned?

I'm not expecting you to block any of this work on my understanding it. @trivialfis already approved #11001 and you two are the primary maintainers of this project, so merge these things whenever you are comfortable with them.

I'm only leaving all these comments because you asked for my review and said that making this system understandable for a wider audience was one of the design goals.

ops/docker/dockerfile/Dockerfile.gpu

.github/workflows/main.yml

ops/pipeline/build-cuda.sh

ops/pipeline/stash-artifacts.py

.github/workflows/main.yml

ops/docker/dockerfile/Dockerfile.gpu

hcho3 · 2024-12-05T23:38:31Z

where are those image files stored?

who can update them?

what environment does that command need to be run in?

When packer build runs, the following things occurs:

Packer launches a new EC2 instance, using the AWS account hosting the CI.
Packer runs the bootstrap script (ops/packer/linux/bootstrap.sh) inside the new EC2 instance.
Once the bootstrap is complete, Packer stops the instance and then generates a new VM image (AMI).
The generated VM image will be stored in the AWS account.

Prerequisites for packer build:

Packer and AWS CLI should be installed in the system.
AWS credentials should be configured, by running aws configure or setting AWS_* environment variables.

Note. For now, we build VM images manually, but in a follow-up pull request, I plan to set up a CI/CD pipeline to build VM images with a regular schedule.

where can I find the full list of acceptable values for runs-on: and their specs? e.g. if I see a job consistently running out of memory and I want to upgrade to a bigger runner, how can I find whether there are larger runners provisioned?

RunsOn provides a set of default runners: https://runs-on.com/runners/linux/. In addition, we define custom runners as well, the list of which is found at https://github.com/dmlc/xgboost/blob/master/.github/runs-on.yml.

* x- prefix for anchors in ci_container.yml * Use latest Miniforge; specify versions as build args * Make PYTHON_VERSION build arg * Fix gpu_build_r_rockylinux8 * Build containers weekly

hcho3 · 2024-12-09T15:22:26Z

@jameslamb I believe I addressed all your comments. Do you have any more feedback?

jameslamb · 2024-12-11T05:35:44Z

@jameslamb I believe I addressed all your comments. Do you have any more feedback?

Thanks, I just reviewed #11079 ... no other comments from me. Awesome work on this!!!

[Don't merge] A small workflow example with RunsOn

6d206b1

This comment was marked as outdated.

Sign in to view

hcho3 added 3 commits December 2, 2024 19:36

Add missing files

07fa2a5

Fix permission

4df47bb

Add missing files

9b0b399

jameslamb reviewed Dec 5, 2024

View reviewed changes

hcho3 added 3 commits December 5, 2024 17:44

Merge branch 'master' into xgboost-ci-ng-sample

6eace58

Merge branch 'master' into xgboost-ci-ng-sample

8683b1e

Merge remote-tracking branch 'origin/master' into xgboost-ci-ng-sample

f9b64fd

hcho3 changed the base branch from master to legacy-ci December 6, 2024 19:26

hcho3 mentioned this pull request Dec 6, 2024

Next-generation CI/CD pipelines with RunsOn #11001

Merged

hcho3 added 3 commits December 7, 2024 20:33

Move container build to xgboost-devops

f26c0b9

GITHUB_ACTION -> GITHUB_ACTIONS

645a5e7

Remove build_via_cmake.sh

cc1e9e8

hcho3 force-pushed the xgboost-ci-ng-sample branch from 1bd401e to 8f5b261 Compare December 8, 2024 09:33

hcho3 added 2 commits December 8, 2024 01:50

Replace stash-artifacts.{sh,py} -> manage-artifacts.py

57f6165

Use manage-artifacts.py for uploading nightly builds

e14d393

hcho3 force-pushed the xgboost-ci-ng-sample branch from 8f5b261 to e14d393 Compare December 8, 2024 09:51

hcho3 added 3 commits December 8, 2024 23:18

Fix test-python-wheel

ac97aec

Build sm_75 only if pull request

89270de

Remove rename_whl.py; change of dir structure in xgboost-nightly-builds

ea312e9

hcho3 mentioned this pull request Dec 10, 2024

Improve design and address comments in the new CI #11079

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Don't merge] A small workflow example with RunsOn #11046

[Don't merge] A small workflow example with RunsOn #11046

hcho3 commented Dec 3, 2024 •

edited

Loading

This comment was marked as outdated.

jameslamb left a comment

jameslamb left a comment •

edited

Loading

hcho3 commented Dec 5, 2024 •

edited

Loading

hcho3 commented Dec 9, 2024

jameslamb commented Dec 11, 2024

[Don't merge] A small workflow example with RunsOn #11046

Are you sure you want to change the base?

[Don't merge] A small workflow example with RunsOn #11046

Conversation

hcho3 commented Dec 3, 2024 • edited Loading

This comment was marked as outdated.

jameslamb left a comment

Choose a reason for hiding this comment

jameslamb left a comment • edited Loading

Choose a reason for hiding this comment

hcho3 commented Dec 5, 2024 • edited Loading

hcho3 commented Dec 9, 2024

jameslamb commented Dec 11, 2024

hcho3 commented Dec 3, 2024 •

edited

Loading

jameslamb left a comment •

edited

Loading

hcho3 commented Dec 5, 2024 •

edited

Loading