Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Don't merge] A small workflow example with RunsOn #11046

Draft
wants to merge 15 commits into
base: legacy-ci
Choose a base branch
from

Conversation

hcho3
Copy link
Collaborator

@hcho3 hcho3 commented Dec 3, 2024

Per @jameslamb's suggestion, I pared down #11001 to extract a small subset: a single representative workflow, with the essential components of the CI pipeline:

  • Special syntax in GitHub Actions YAML to use self-hosted runners (via RunsOn)
  • ops/pipeline/stash-artifacts.sh: Script for stashing artifacts
  • ops/docker_build.sh: Script for building and caching containers
  • ops/docker_run.py: Script for running tests inside containers
  • ops/docker/dockerfile: Dockerfiles
  • ops/docker/ci_container.yml: YAML file to store build args for Docker containers. See also ops/docker/extract_build_args.sh.

Also see https://xgboost--11001.org.readthedocs.build/en/11001/contrib/ci.html for the overview.

Major elements of #11001 are represented in this pull request; if there is no major objection to this PR, I will go ahead and merge #11001.

Note. The failures from BuildKite are expected, since we are removing the pipelines targeting BuildKite. BuildKite is planned to be removed after #11001 is merged.

Note. In GitHub Actions, jobs run on Microsoft-hosted runners by default. To opt into self-hosted runners (enabled by RunsOn), we use the following special syntax:

    runs-on:
      - runs-on
      - runner=runner-name
      - run-id=${{ github.run_id }}
      - tag=[unique tag that uniquely identifies the job in the GH Action workflow]

where the runner is defined in .github/runs-on.yml. See the documentation at https://runs-on.com/runners/linux/.

cc @jameslamb

@hcho3

This comment was marked as outdated.

Copy link
Contributor

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting up a "comment" review with a few initial, small comments. Sorry, it'll take me more time to give a thorough and thoughtful review. I'll try to do that tomorrow.

ops/pipeline/enforce-ci.sh Outdated Show resolved Hide resolved
ops/docker/ci_container.yml Outdated Show resolved Hide resolved
ops/docker/docker_cache_ecr.yml Outdated Show resolved Hide resolved
ops/docker/dockerfile/Dockerfile.gpu_build_rockylinux8 Outdated Show resolved Hide resolved
ops/pipeline/stash-artifacts.sh Outdated Show resolved Hide resolved
Copy link
Contributor

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I spent some more time with this PR today trying to understand the setup, left some more detailed comments and questions. You mentioned to me that a goal of this effort is for more people to be able to understand and contribute to XGBoost CI, so some of my comments are of the form "this is confusing" or "this could be simplified".

I also have some structural questions. I think #11001 (comment) did not totally answer one of my main concerns:

where / when are ... VM images built?

Just saying "packer build will be run manually" isn't enough to answer that. That produces image files, but then:

  • where are those image files stored?
  • who can update them?
  • what environment does that command need to be run in?

And I have a related set of questions:

  • where can I find the full list of acceptable values for runs-on: and their specs? e.g. if I see a job consistently running out of memory and I want to upgrade to a bigger runner, how can I find whether there are larger runners provisioned?

I'm not expecting you to block any of this work on my understanding it. @trivialfis already approved #11001 and you two are the primary maintainers of this project, so merge these things whenever you are comfortable with them.

I'm only leaving all these comments because you asked for my review and said that making this system understandable for a wider audience was one of the design goals.

ops/docker/dockerfile/Dockerfile.gpu Outdated Show resolved Hide resolved
ops/docker/dockerfile/Dockerfile.gpu Outdated Show resolved Hide resolved
.github/workflows/main.yml Outdated Show resolved Hide resolved
.github/workflows/main.yml Outdated Show resolved Hide resolved
.github/workflows/main.yml Outdated Show resolved Hide resolved
ops/pipeline/build-cuda.sh Outdated Show resolved Hide resolved
ops/pipeline/build-cuda.sh Outdated Show resolved Hide resolved
ops/pipeline/stash-artifacts.py Outdated Show resolved Hide resolved
.github/workflows/main.yml Show resolved Hide resolved
ops/docker/dockerfile/Dockerfile.gpu Outdated Show resolved Hide resolved
@hcho3
Copy link
Collaborator Author

hcho3 commented Dec 5, 2024

  • where are those image files stored?
  • who can update them?
  • what environment does that command need to be run in?

When packer build runs, the following things occurs:

  1. Packer launches a new EC2 instance, using the AWS account hosting the CI.
  2. Packer runs the bootstrap script (ops/packer/linux/bootstrap.sh) inside the new EC2 instance.
  3. Once the bootstrap is complete, Packer stops the instance and then generates a new VM image (AMI).
  4. The generated VM image will be stored in the AWS account.

Prerequisites for packer build:

  • Packer and AWS CLI should be installed in the system.
  • AWS credentials should be configured, by running aws configure or setting AWS_* environment variables.

Note. For now, we build VM images manually, but in a follow-up pull request, I plan to set up a CI/CD pipeline to build VM images with a regular schedule.

where can I find the full list of acceptable values for runs-on: and their specs? e.g. if I see a job consistently running out of memory and I want to upgrade to a bigger runner, how can I find whether there are larger runners provisioned?

RunsOn provides a set of default runners: https://runs-on.com/runners/linux/. In addition, we define custom runners as well, the list of which is found at https://github.com/dmlc/xgboost/blob/master/.github/runs-on.yml.

@hcho3 hcho3 changed the base branch from master to legacy-ci December 6, 2024 19:26
hcho3 added a commit to dmlc/xgboost-devops that referenced this pull request Dec 8, 2024
* x- prefix for anchors in ci_container.yml

* Use latest Miniforge; specify versions as build args

* Make PYTHON_VERSION build arg

* Fix gpu_build_r_rockylinux8

* Build containers weekly
@hcho3 hcho3 force-pushed the xgboost-ci-ng-sample branch from 1bd401e to 8f5b261 Compare December 8, 2024 09:33
@hcho3 hcho3 force-pushed the xgboost-ci-ng-sample branch from 8f5b261 to e14d393 Compare December 8, 2024 09:51
@hcho3
Copy link
Collaborator Author

hcho3 commented Dec 9, 2024

@jameslamb I believe I addressed all your comments. Do you have any more feedback?

@jameslamb
Copy link
Contributor

@jameslamb I believe I addressed all your comments. Do you have any more feedback?

Thanks, I just reviewed #11079 ... no other comments from me. Awesome work on this!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants