Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: reduce memory usage #596

Merged
merged 7 commits into from
Dec 14, 2023
Merged

Conversation

flavio
Copy link
Member

@flavio flavio commented Nov 23, 2023

Reduce the memory consumption of Policy Server when multiple instances of the same Wasm module are loaded.

Thanks to this change, a worker will have only once instance of PolicyEvaluator (hence of wasmtime stack), per unique type of module.

Literally speaking, if a user has the apparmor policy deployed 5 times (different names, settings,...) only one instance of PolicyEvaluator will be allocated for it.

Warning: the optimization works at the worker level. Meaning that PolicyEvaluator are NOT sharing these instances between themselves.

This commit helps to address issue kubewarden/kubewarden-controller#528

Note 1: this depends on kubewarden/policy-evaluator#390

Benchmark data

I've collected data about the amount of memory consumed by 1 instance of policy server. The test has been done using the following policies being loaded.

policies.yml
pod-privileged-protect-mode:
  url: ghcr.io/kubewarden/tests/pod-privileged:v0.2.1
  policyMode: protect
  allowedToMutate: false
  settings: {}

pod-privileged-protect-mode2:
  url: ghcr.io/kubewarden/tests/pod-privileged:v0.2.1
  policyMode: protect
  allowedToMutate: false
  settings: {}

pod-privileged-protect-mode3:
  url: ghcr.io/kubewarden/tests/pod-privileged:v0.2.1
  policyMode: protect
  allowedToMutate: false
  settings: {}

pod-privileged-protect-mode4:
  url: ghcr.io/kubewarden/tests/pod-privileged:v0.2.1
  policyMode: protect
  allowedToMutate: false
  settings: {}

pod-privileged-monitor-mode:
  url: ghcr.io/kubewarden/tests/pod-privileged:v0.2.1
  policyMode: monitor
  allowedToMutate: false
  settings: {}

sleep:
  url: ghcr.io/kubewarden/tests/sleeping-policy:v0.1.0
  policyMode: protect
  allowedToMutate: false
  settings:
    sleepMilliseconds: 2000

disallow-service-loadbalancer:
  url: ghcr.io/kubewarden/tests/disallow-service-loadbalancer:v0.1.5
  policyMode: protect
  allowedToMutate: false

flux:
  url: ghcr.io/kubewarden/tests/go-wasi-template:v0.1.0
  policyMode: protect
  allowedToMutate: true
  settings:
    requiredAnnotations:
      "fluxcd.io/cat": "felix"

flux2:
  url: ghcr.io/kubewarden/tests/go-wasi-template:v0.1.0
  policyMode: protect
  allowedToMutate: true
  settings:
    requiredAnnotations:
      "fluxcd.io/cat": "felix"

verify1:
  url: ghcr.io/kubewarden/test/verify-image-signatures:v0.2.8
  policyMode: protect
  allowedToMutate: true
  settings:
    signatures:
      - image: "*"
        githubActions:
          owner: "kubewarden"
          repo: "app-example"

verify2:
  url: ghcr.io/kubewarden/test/verify-image-signatures:v0.2.8
  policyMode: protect
  allowedToMutate: true
  settings:
    signatures:
      - image: "*"
        githubActions:
          owner: "kubewarden"
          repo: "app-example"

verify3:
  url: ghcr.io/kubewarden/test/verify-image-signatures:v0.2.8
  policyMode: protect
  allowedToMutate: true
  settings:
    signatures:
      - image: "*"
        githubActions:
          owner: "kubewarden"
          repo: "app-example"

verify4:
  url: ghcr.io/kubewarden/test/verify-image-signatures:v0.2.8
  policyMode: protect
  allowedToMutate: true
  settings:
    signatures:
      - image: "*"
        githubActions:
          owner: "kubewarden"
          repo: "app-example"

This configuration file will load 13 policies:

  • 4 instances of the "verify signatures" policy
  • 5 instances of the "pod privileged" policy
  • 2 instances of the go-wasi template policy (remember: this is a 20 MB policy!)
  • 1 rego policy
  • 1 regular rust policy

Measurement process

The policy server has been started with an increasing number of workers. I started from 1 worker and went up to 8.
For each worker value, 10 samplings have been taken.

The table below shows the data collected.

workers Current (MB) Optimized (MB) Improvement % Current StdDev Optimized StdDev
1 621.00 530.38 -14.59 30.86 23.26
2 807.35 586.15 -27.40 39.24 18.24
3 983.33 649.17 -33.98 23.23 30.79
4 1172.28 752.59 -35.80 30.85 27.50
5 1358.67 831.02 -38.84 22.24 27.58
6 1535.16 914.64 -40.42 30.53 21.74
7 1694.57 980.81 -42.12 26.48 17.79
8 1897.01 1054.93 -44.39 27.44 22.69

This is a chart that shows the overall memory reduction brought by this change:

memory optimization

@flavio
Copy link
Member Author

flavio commented Nov 23, 2023

I've rebased against the latest changes merged into the main branch

Copy link
Contributor

@fabriziosestito fabriziosestito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@viccuad viccuad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Top!

@viccuad viccuad changed the title reduce memory usage refactor: reduce memory usage Nov 24, 2023
@flavio
Copy link
Member Author

flavio commented Nov 24, 2023

I've merged the changes into the main branch of policy-evaluator, I've updated this PR to consume policy-evaluator from a specific commit of the original repository.

I would prefer to not tag a new version of policy-evaluator right now. @fabriziosestito is going to open a new PR against policy-evaluator that will remove the majority of the locks. I would prefer to have his changes merged into the policy-evaluator main branch, and then tag a new version of the crate

@flavio
Copy link
Member Author

flavio commented Nov 24, 2023

Moving to blocked, waiting for the next iteration of policy-evaluator that is lock-free before merging this PR.

@flavio flavio self-assigned this Nov 28, 2023
This allows a better organization of the code

Signed-off-by: Flavio Castelli <[email protected]>
This allows us to uniquely identify a precompiled wasm module

Signed-off-by: Flavio Castelli <[email protected]>
This is going to be useful also to perform the k6 tests.

Signed-off-by: Flavio Castelli <[email protected]>
Reduce the memory consumption of Policy Server when multiple instances
of the same Wasm module are loaded.

Thanks to this change, a worker will have only once instance of
`PolicyEvaluator` (hence of wasmtime stack), per unique type of module.

Literally speaking, if a user has the `apparmor` policy deployed 5 times
(different names, settings,...) only one instance of `PolicyEvaluator`
will be allocated for it.

Note: the optimization works at the worker level. Meaning that `PolicyEvaluator`
are NOT sharing these instances between themselves.

This commit helps to address issue kubewarden/kubewarden-controller#528

Signed-off-by: Flavio Castelli <[email protected]>
@flavio
Copy link
Member Author

flavio commented Dec 12, 2023

@fabriziosestito , @jvanz , @viccuad : I've rebased my changes against the main branch. I've to look into 3 integration tests that are currently failing, but you can start to review the changes

Copy link
Contributor

@fabriziosestito fabriziosestito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a couple of comments added

overwrite: false
supplemental_groups:
rule: RunAsAny
overwrite: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed? we are going to remove venom tests soonish

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm using them inside of k6 tests. This file should definitely be moved to the load-testing repo once we remove venom

src/workers/evaluation_environment.rs Show resolved Hide resolved
@flavio flavio force-pushed the reduce-memory-usage branch 2 times, most recently from 94873a1 to 5bf6b3e Compare December 13, 2023 08:27
src/workers/pool.rs Outdated Show resolved Hide resolved
This commit significantly changes the internal architecture of Policy
Server workers.

This code takes advantage of the `PolicyEvaluatorPre` structure defined by
the latest `policy-evalautor` crate.
`PolicyEvaluatorPre` has different implementations, one per type of
policy we support (waPC, Rego, WASI). Under the hood it holds a
`wasmtime::InstancePre` instance that is used to quickly spawn a
WebAssembly environment.

Since `PolicyEvaluatorPre` instances are managed by the `EvaluationEnvironment`
structure. `EvaluationEnvironment` takes care of deduplicating the WebAssembly
modules defined inside of the `policies.yml` file, ensuring only one
`PolicyEvaluatorPre` is created per policy.

The `EvaluationEnvironment` struct provides `validate` and
`validate_settings` methods. These methods create a fresh `PolicyEvaluator`
instance by rehydrating its `PolicyEvaluatorPre` instance. Once the
WebAssembly evaluation is done, the `PolicyEvaluator` instance is
discarded.
This is a big change compared to the previous approach, where each WebAssembly
instance was a long lived object.

This new architecture assures that each evaluation is done inside of a freshly
created WebAssembly environment, which guarantees:
- Policies leaking memory have a smaller impact on the memory
  consumption of the Policy Server process
- Policy evaluation always starts with a clean slate, this is useful to
  prevent bugs caused by policies that are not written to be stateless

In terms of memory optimizations, the `EvaluationEnvironment` is now an
immutable object. That allows us to have one single instance of `EvaluationEnvironment`
shared across all the worker threads, all without using mutex or locks.
This significantly reduces the amount of memory required by the Policy Server instance,
without impacting on the system performances.

As an added bonus, a lot of code has been simplified during this
transition. More code can be removed by future PRs.

Signed-off-by: Flavio Castelli <[email protected]>
Introduce custom error types for `EvaluationEnvironment` and make use of
them to properly identify not found policies.

This is required to properly return a 404 response inside of the UI.

This regression was detected by the integration tests.

Signed-off-by: Flavio Castelli <[email protected]>
@flavio
Copy link
Member Author

flavio commented Dec 13, 2023

@fabriziosestito , @jvanz , @viccuad : you can take another look at the PR. I've applied the changes you requested with the latest commits

Copy link
Member

@viccuad viccuad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@flavio
Copy link
Member Author

flavio commented Dec 14, 2023

Before merging this PR I want to summarize what happened.

Optimization number 1

The PR started with one optimization technique: ensure each worker of Policy Server (where a worker is a thread dedicated to Wasm evaluation) has just one instance of PolicyEvaluator (a WebAssembly sandbox) per WebAssembly module. That means that if a user has the same policy defined 5 times, each time with a different configuration, only one PolicyEvaluator will be created inside of a worker.
However, that still means that the same PolicyEvaluator is duplicated across each worker. Which explains why the memory consumption is directly proportional with the number of workers.

Optimization number 2

This is an iteration over the previous one. We changed drastically the way we manage WebAssembly runtimes. We rely on wasmtime::InstancePre to keep a deduplicated list of all the policies defined by the user.
This collection of instances is shared across all the workers, which leads to huge memory savings.

In addition to that, we changed the evaluation mode to be "on demand". That means that, once an AdmissionReview is received, a fresh PolicyEvaluator is created. The creation time is significantly reduced by using wasmtime::InstancePre.
Once the evaluation is done, the PolicyEvaluator instance is discarded. This leads to the deallocation of the memory used by the WebAssembly sandbox wrapped by PolicyEvaluator.

This architecture has two main advantages:

  • When dealing with policies that leak memory: reduce the impact they have on the long running Policy Server process
  • Ensure each evaluation starts with a clean slate. This prevents bugs caused by policies that are mistakenly leaving unclean state in between evaluations

Benchmarks

I've updated the table and the graph shown when the PR was open to include also the numbers of the second, and final, optimization.

The benchmarks have been done on the same machine, with the same set of policies and the same measurement process.

workers Current Optimized #1 Optimized #2 Improvement #1 % Improvement #2 % Current StdDev Optimized #1 StdDev Optimized #2 StdDev
1 621.00 530.38 526.073 -14.59 -15.29 30.86 23.26 20.15
2 807.35 586.15 535.173 -27.40 -33.71 39.24 18.24 13.49
3 983.33 649.17 532.808 -33.98 -45.82 23.23 30.79 17.44
4 1172.28 752.59 529.324 -35.80 -54.85 30.85 27.50 21.26
5 1358.67 831.02 536.239 -38.84 -60.53 22.24 27.58 25.76
6 1535.16 914.64 533.919 -40.42 -65.22 30.53 21.74 18.04
7 1694.57 980.81 524.528 -42.12 -69.05 26.48 17.79 19.15
8 1897.01 1054.93 541.903 -44.39 -71.43 27.44 22.69 9.08

This is the updated chart:

policy-server-memory-opt2

As you can see, the memory consumption has been significantly reduced. Moreover, the memory usage stays constant regardless of the number of workers instantiated.

Adapt the code to the new API exposed by latest version of
policy-evaluator.

On top of that, adding some extra unit tests.

Signed-off-by: Flavio Castelli <[email protected]>
@flavio flavio merged commit fee9dd7 into kubewarden:main Dec 14, 2023
8 checks passed
@flavio flavio deleted the reduce-memory-usage branch December 14, 2023 18:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants