Add remote persistent worker support #787

aherrmann · 2024-09-25T14:57:05Z

Closes #776.

Implements support for persistent workers in remote builds using the Bazel remote execution protocol and the approach documented in the Bazel remote persistent workers proposal:
https://github.com/bazelbuild/proposals/blob/main/designs/2021-03-06-remote-persistent-workers.md

Includes an example setup that works with

local builds without persistent worker
local builds with persistent worker (Buck2 protocol)
remote builds without persistent worker
remote builds with persistent worker (Bazel protocol)

The Bazel remote persistent worker protocol includes an automatic fallback in cases where the remote execution system does not yet support persistent workers. To that end actions take the shape

WORKER WORKER_ARGS... @REQUEST_ARGS_FILE

The remote execution system separates worker arguments on the command-line from request arguments in the response file and adds the --persistent_worker flag.

The demo worker included in the example in this PR distinguishes between Buck2 worker, Bazel remote worker, and one-shot modes depending on whether Buck2's WORKER_SOCKET, Bazel's --persistent_worker flag, or neither is set.

The example includes a README with detailed instructions how to test this feature.

Add example for remote persistent workers
Implement support for remote persistent workers

sluongng

LGTM from BuildBuddy side.

CI failed due to a Github outage yesterday, but I think a rebase + push should retry it.

sluongng · 2024-09-26T09:48:19Z

examples/persistent_worker/README.md

+export BUILDBUDDY_CONTAINER_USER=...  # GitHub user name
+export BUILDBUDDY_CONTAINER_PASSWORD=...  # GitHub access token


I just want to note that this is optional if the container image you are using is publicly downloadable.

sluongng · 2024-09-26T09:50:41Z

examples/persistent_worker/platforms/buildbuddy.bzl

+            remote_execution_properties = {
+                "OSFamily": "Linux",
+                "container-image": image,
+                "workload-isolation-type": "podman",


Nit: don't need to set isolation type specifically. We(BuildBuddy) may want to change the default isolation type underneath while maintaining backward compatibility. (In fact, we did recently stopped using podman as default isolation).

I had to set it this way because I got a credentials error on image download with the default.

christolliday · 2024-10-01T20:55:20Z

Thanks for the PR! This looks good.

We started working on support for this internally but unfortunately it doesn't match the bazel spec exactly. For reasons that aren't entirely clear to me we can't attach the 'worker key' to the platform so it's attached elsewhere on the action.

I can't see why we can't support both APIs though and have a default 'bazel mode' for this behavior.

The main other difference is that on our end we construct an RE::Action for the worker, upload it and use the digest of that action as the 'worker key', instead of what I assume is requiring that the worker args are a prefix of the action args (in which case the 'worker key' doesn't really seem to matter?). Similarly we can support both in 'bazel mode' though.

Just a heads up that there is likely to be some churn around this at some point and I'm slightly concerned we may not have an easy time testing that this does the right thing in all edge cases for 'bazel mode', but I suppose we can deal with that when we get to it.

If it's easy, it would be nice to have a github action for testing the remote example. I'm not sure how difficult that is.

aherrmann · 2024-10-02T07:47:08Z

I can't see why we can't support both APIs though and have a default 'bazel mode' for this behavior, though.

That's great to hear! Yes, I was hoping for something along those lines.

The main other difference is that on our end we construct an RE::Action for the worker, upload it and use the digest of that action as the 'worker key', instead of what I assume is requiring that the worker args are a prefix of the action args (in which case the 'worker key' doesn't really seem to matter?).

In the Bazel version the worker key is used to associate a given action with a potentially already running worker instance on a remote executor node. But, it is not directly tied to any kind of previously uploaded blob. Bazel calculates a digest of the worker command and its inputs and uses that as a worker key. In this PR I went for the same approach.

Just a heads up that there is likely to be some churn around this at some point and I'm slightly concerned we may not have an easy time testing that this does the right thing in all edge cases for 'bazel mode', [...] If it's easy, it would be nice to have a github action for testing the remote example. I'm not sure how difficult that is.

That makes sense. I'll look into how to test this on the CI.

aherrmann · 2024-10-04T15:24:09Z

I noticed that the example did not use the WorkerRunInfo attributes worker and exe appropriately to distinguish between (remote) persistent worker and non-worker modes respectively. I've updated the example accordingly. This highlighted that the implementation did wrongly use request.all_args_vec for the remote persistent worker case, when it should be composing worker.exe and request.args instead. I've updated the implementation accordingly.

I will continue looking into ways to test this feature on CI.

aherrmann · 2024-10-11T09:35:42Z

Just to give a heads up on expected progress here. I'm on leave for the next two weeks and will get back to this when I'm back.

I've started making the setup independent on Nix, so that it is easier to integrate with CI here. That's already working locally.
The other thing is the remote execution system to test the remote persistent worker mode. I spoke with BuildBuddy, it would be possible to use BuildBuddy for this CI use-case, but it would require setting up a free account for the Meta Buck2 repository and configuring an access token for CI. @christolliday would that be possible for Meta to do? Otherwise, I'd have to look into other options to set up a compatible remote execution system on CI.

christolliday · 2024-10-28T02:51:42Z

Hi @aherrmann, @KapJI is looking into setting up a build buddy account.

`WorkerRunInfo` has two fields `worker` and `exe` for the persistent worker command or the non-worker mode command respectively. This commit changes the remote persistent worker example to use these appropriately to distinguish between worker and non-worker mode execution.

The Buck2 Rust implementation needs to use the worker command instead of the non-worker command for remote persistent worker execution mode.

We want to use a Buck2 managed hermetic Python toolchain. However, Python binaries generated by such a toolchain are sensitive to `PWD` and don't work in other working directories than the repository root. Unfortunately, `genrule` changes directory and is hence incompatible with such Python binaries. This commit defines a dedicated rule to invoke protoc without changing the working directory.

The previous setup used Nix to provide a Python toolchain and packages in a reproducible fashion. However, this requires dedicated remote worker images with the Nix store paths pre-populated which complicates the setup for testing on Buck2 CI. Using a hermetic Python toolchain avoids these issues and works on the standard remote execution image.

Remove the old genrule targets to generate the Python gRPC/protobuf bindings.

Without Nix it is no longer required to use a custom remote worker image.

It's useful to test multiple builds in parallel to catch potential issues related parallel worker requests. However, we also don't need to be excessive in the number of tests to not unnecessarily waste resources.

Requires a repository secret to be set up for the BuildBuddy API key named `BUILDBUDDY_API_KEY`.

aherrmann · 2024-10-29T16:04:36Z

@christolliday @KapJI I've rebased this PR and added the changes to make it independent of Nix so we no longer require a custom worker image. I've also added a CI test for the persistent worker examples, the test requires a GitHub secret named BUILDBUDDY_API_KEY to hold the BuildBuddy token. I've tested the test script locally, but things may still fail under the GitHub actions environment. Please let me know when the BuildBuddy account is set up and a token is added as a GitHub secret, then I can test and debug the CI configuration.

KapJI · 2024-10-29T16:18:19Z

I added BUILDBUDDY_API_KEY which holds my 20 chars API token to our github secrets. Does my BuildBuddy account need any extra setup?

aherrmann · 2024-10-29T16:51:01Z

@KapJI Thank you! That sounds great, I don't think it should require any additional configuration. I'll test and debug the CI setup and let you know if I run into anything.

aherrmann · 2024-10-30T15:44:43Z

One issue I'm encountering is that repository secrets are not exposed to GH actions runs that are initiated from forks (as is the case with this PR) for security reasons (see here). Here's what I've done now:

I've added a CI step for the persistent worker steps.
In there I check if the token is available or not. If not, I skip the remote execution tests and generate a GH actions notice annotation. So, by default the remote execution cases will only be checked on main CI or PRs coming from Meta engineers.
I've added the workflow_dispatch trigger to the CI workflow, this should allow Meta engineers to manually trigger a CI run that should have the token set (will only be available once the workflow_dispatch trigger is merged). E.g. to test that an external PR doesn't break these tests.

@KapJI could I ask you to trigger a CI run of this PR from within the Buck2 repo to test the remote execution cases? (After convincing yourself that this PR doesn't do anything dodgy with the token).
You can do this by pulling this PR's branch and then pushing it to the facebook/buck2 repo (not onto the main branch, just as a separate branch). Something along the lines of gh pr checkout 787; git push origin persistent-remote-worker. I'd expect the push trigger to fire at that point, if not you may need to add a dummy commit.

facebook-github-bot · 2024-10-30T15:54:29Z

@KapJI has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

KapJI · 2024-10-30T16:25:10Z

@aherrmann It's running on #800

aherrmann · 2024-10-31T10:04:48Z

@KapJI Thanks! Unfortunately, it looks like the token is still not available:

It is empty in the env section
The RE tests are skipped.

Is BUILDBUDDY_API_KEY configured as a repository secret? Does gh secret list include it in its output?

aherrmann · 2024-11-11T09:10:40Z

@KapJI friendly ping, did you have a chance to look into the above?

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 25, 2024

sluongng reviewed Sep 26, 2024

View reviewed changes

aherrmann force-pushed the persistent-remote-worker branch from f179b29 to f8c20ef Compare October 7, 2024 09:04

aherrmann added 13 commits October 29, 2024 15:43

Add example for remote persistent workers

a62ed89

Implement support for remote persistent workers

003b58d

Fix clippy error

0d4f1ea

fix run_display test

872064c

Use WorkerRunInfo.worker for remote persistent worker

dbb8d6a

The Buck2 Rust implementation needs to use the worker command instead of the non-worker command for remote persistent worker execution mode.

Document the worker protocol in the example

2e448e0

Use proto_python_library rule

9890623

Remove the old genrule targets to generate the Python gRPC/protobuf bindings.

Standard BuildBuddy worker image

cc593fa

Without Nix it is no longer required to use a custom remote worker image.

Remove the Nix flake

7a7d45d

Update the README and direnv configuration

e4e6667

aherrmann force-pushed the persistent-remote-worker branch from d9c7b9b to e4e6667 Compare October 29, 2024 14:59

aherrmann added 4 commits October 29, 2024 15:59

Shrink number of test targets

b4376e9

It's useful to test multiple builds in parallel to catch potential issues related parallel worker requests. However, we also don't need to be excessive in the number of tests to not unnecessarily waste resources.

fix README instructions

88a399a

Add an automated test script

3f88885

Test persistent worker example on CI

0a7f496

Requires a repository secret to be set up for the BuildBuddy API key named `BUILDBUDDY_API_KEY`.

fix typo

5d9b7cb

aherrmann added 6 commits October 30, 2024 11:15

Remove old Nix toolchain configuration file

722d948

close GH actions output groups

e503285

Generate GH actions annotations on missing token

3ea6a03

Document BuildBuddy token availability

6cf252e

Enable manual pipeline runs

7181c3d

Fix annotations file path

62b3f8a

aherrmann mentioned this pull request Nov 12, 2024

feature request - support remote persistent workers #776

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add remote persistent worker support #787

Add remote persistent worker support #787

aherrmann commented Sep 25, 2024

sluongng left a comment

sluongng Sep 26, 2024

sluongng Sep 26, 2024

aherrmann Sep 26, 2024

christolliday commented Oct 1, 2024 •

edited

Loading

aherrmann commented Oct 2, 2024

aherrmann commented Oct 4, 2024

aherrmann commented Oct 11, 2024

christolliday commented Oct 28, 2024

aherrmann commented Oct 29, 2024

KapJI commented Oct 29, 2024 •

edited

Loading

aherrmann commented Oct 29, 2024 •

edited

Loading

aherrmann commented Oct 30, 2024

facebook-github-bot commented Oct 30, 2024

KapJI commented Oct 30, 2024

aherrmann commented Oct 31, 2024

aherrmann commented Nov 11, 2024

		export BUILDBUDDY_CONTAINER_USER=... # GitHub user name
		export BUILDBUDDY_CONTAINER_PASSWORD=... # GitHub access token

Add remote persistent worker support #787

Are you sure you want to change the base?

Add remote persistent worker support #787

Conversation

aherrmann commented Sep 25, 2024

sluongng left a comment

Choose a reason for hiding this comment

sluongng Sep 26, 2024

Choose a reason for hiding this comment

sluongng Sep 26, 2024

Choose a reason for hiding this comment

aherrmann Sep 26, 2024

Choose a reason for hiding this comment

christolliday commented Oct 1, 2024 • edited Loading

aherrmann commented Oct 2, 2024

aherrmann commented Oct 4, 2024

aherrmann commented Oct 11, 2024

christolliday commented Oct 28, 2024

aherrmann commented Oct 29, 2024

KapJI commented Oct 29, 2024 • edited Loading

aherrmann commented Oct 29, 2024 • edited Loading

aherrmann commented Oct 30, 2024

facebook-github-bot commented Oct 30, 2024

KapJI commented Oct 30, 2024

aherrmann commented Oct 31, 2024

aherrmann commented Nov 11, 2024

christolliday commented Oct 1, 2024 •

edited

Loading

KapJI commented Oct 29, 2024 •

edited

Loading

aherrmann commented Oct 29, 2024 •

edited

Loading