[spike] Gather info for Vai Testing #1276

git-ival · 2024-04-10T03:11:19Z

What framework(s) are expected to be used here?
- Is there any known pre-existing code that can help in the test effort?
- What functionality is needed in order to automate this testing as part of a regression suite?
Is there a particular cluster configuration that should be targetted for this testing?
- HA, single-node, # of CPUs, amount of RAM, K8s distro, K8s version, etc.
- Cloud provider (AWS, Azure, etc?)
What metrics must be tracked?
- What are the cutoff points for each metric that needs to be tracked? (When should we mark a given test as pass/fail based on a given metric's performance?)
- What tool(s) can be used to track the listed metrics? If Prometheus/Grafana: what query(ies)/dashboards should be used for tracking?
What benchmark testing, if any, needs to be accounted for?
- # of clusters, # of nodes, # of nodes per cluster, etc.
- # of rolebindings, # of secrets, # of namespaces, etc.
Cache testing - tbd

git-ival · 2024-04-10T17:04:50Z

UI Benchmark Considerations:

Assuming a cluster with 80,000 ConfigMaps with 1MiB payload each
First load of ConfigMap page should happen within 2.5 seconds
Changing to a different page should happen within 1 second
- Same for changing filtering
Not more than 750MiB RAM per Browser Tab

Steve API Benchmark Considerations:

Paginated ConfigMaps must be returned from the API at a rate of 100 Resources/500ms (not including network latency and transfer time)
(Nice-to-have) Modify increasing amounts of objects until reaching the "limit" of objects that can be updated while remaning at <1 second page load times

Ideally we can automate these sooner rather than later, if push comes to shove we can do the 1st run "manually"

git-ival · 2024-04-11T19:37:13Z

UI Benchmark Considerations:

Assuming a cluster with 80,000 ConfigMaps with 1MiB payload each

First load of ConfigMap page should happen within 2.5 seconds

Changing to a different page should happen within 1 second

Same for changing filtering

Not more than 750MiB RAM per Browser Tab

We will need to rely on Cypress or a similar tool in order to perform browser-based frontend tests. We hope to leverage the rancher/dashboard test framework wherever we can for this effort.

Loading up the cluster with ConfigMaps can be done via shepherd or a relatively simple bash script, this process will need to be batched.

Steve API Benchmark Considerations:

Paginated ConfigMaps must be returned from the API at a rate of 100 Resources/500ms (not including network latency and transfer time)

(Nice-to-have) Modify increasing amounts of objects until reaching the "limit" of objects that can be updated while remaining at <1 second page load times

Ideally we can automate these sooner rather than later, if push comes to shove we can do the 1st run "manually"

We can utilize k6 for bullet 1 and bullet 2. Bullet 2 is more complex as verification would require kicking off a Cypress test that confirms page load is < 1 each step of the way.

git-ival · 2024-04-17T18:53:59Z

Found an old golang library for ingesting JUnit XML reports: https://github.com/joshdk/go-junit
This library could be useful in order to parse Cypress results in dartboard for a final pass/fail.

Cypress can output a JUnit XML report: https://docs.cypress.io/guides/tooling/reporters
k6 does not support outputting JUnit XML reports, but it does support outputting as JSON (https://k6.io/docs/results-output/real-time/json/).

moio · 2024-04-23T07:14:23Z

Organizational notes:

I suggest we spin two tasks off this spike: implementation of a backend and a frontend benchmark. Then tackle them in order. Reasons:
1. frameworks and code will be necessarily different
2. there is no way frontend will ever pass until backend passes first
3. backend is much more likely to fail than frontend, as we are lifting complexity from frontend and moving it to backend
4. we have less know-how about frontend benchmarking than we have in backend benchmarking as a group today, so they even have different risk levels from a project perspective

I suggest you create two separate issues and discuss frameworks, test setup, metrics and criteria separately.

moio · 2024-04-23T07:48:52Z

Backend test notes

Cluster setup

according to the PD&O:

3 server nodes in HA, 16 vCPUs, 64 GiB RAM, one local or fast network SSD (eg. AWS's EBS gp3 type volumes) each

external etcd cluster running on 3 servers with 8 vCPUs, 32 GiB RAM and locally attached NVMe SSDs (eg. AWS's m6gd.2xlarge)

Local cluster distribution is the latest supported RKE2 or latest supported K3s. RKE1 is explicitly out of scope

That is a monster setup, I expect this benchmark to actually pass with way less hardware - and in any case 95% of the development should be carried out on a smaller setup, and hardware maxing should happen as a last step, if needed.

A starting point could be: AWS: 3 nodes, 4 vCPUs, 16 GiB of RAM each (eg. t3a.xlarge) with 50 GiB EBS gp3 root volumes, on latest supported RKE2, internal etcd cluster.

(I have no problem in doing development with k3d on your laptop if that is more convenient - then re-running on the above "light" setup and leave the "heavy monster" setup only if all else fails as a last option)

Repeating the test on k3s is relatively unimportant and can be left at a later point, eg. after the browser tests are complete.

If you need any other details about setup please ask.

Benchmarking criteria notes

according to the PD&O "All time targets are intended as 95-percentile over an adequate number of repetitions."
a starting point for adequate can be 30. We can look at variance after the fact to tweak that number (if variance is low it can be safely reduced)

What you really care is that asking for a page (100 resources) consistently stays below half a second in http duration (see below) in 95% of cases - no matter the sorting, filtering, resource type and size. You should see how well that number scales if the number of virtual users grows - the minimum being 20 users making 1 request every 5 seconds.

As a second objective, add virtual users who concurrently change the ConfigMaps and see how performance degrades as more virtual users change them (this is more exploratory, we can set a pass/fail limit when we see the first results).

Metrics tracking: as a first step, make sure relevant stats are recorded in Qase (eg. p(95) expected: under 500ms, actual: 234 ms, test PASS). Full k6 output nice to have. Grafana tracking can be added later.

Framework choice notes

k6 makes working with the above stats easy, as it computes percentiles and divides them between download and processing time by default
is it possible for a shepherd-based test, integrated in one of the regularly run testsuites, to shell out to k6? Do we have a setup in which the node running shepherd+k6 is on the same network as the cluster under test, and it is decently sized hardware-wise (k6 can generate quite some load)? Is it easy to do that in, eg., AWS?

Implementation notes

parsing k6 JSON output is easy. What you need is something like:

type Metrics struct {
	HTTPReqDuration struct {
		Values struct {
			P95 float64 `json:"p(95)"`
		} `json:"values"`
	} `json:"http_req_duration"`
}

...
	bytes, _ := ioutil.ReadAll(jsonFile)

	var result Result
	json.Unmarshal(bytes, &result)

here you have a k6 script that we generally use for read benchmarks at customers as a starting point. It already supports the new Steve pagination style (and it also supports Norman which you can easily and safely drop). Feel free to reach out to me when you have the infra running and you need guidance to go deeper on benchmark specifics

richard-cox · 2024-04-23T08:06:53Z

UI side, it's important to note that the new vai backed API and it's features will be used

In eventually all resource lists via server-side pagination
In multiple different places to remove times when the UI fetches ALL of a resource
Generally for all steve based API requests, regardless of filtering / sorting / pagination

This effort is tracked in rancher/dashboard#8527 and will be partially complete in 2.9.0 (as described Server-Side Pagination - 2.9.0 State / Solution). That doc also has a rough spec to QA

git-ival · 2024-04-24T17:18:53Z

@moio In regards to the upstream cluster setup for vai testing, should we model the same config? Example: 20 projects, 1000 Secrets, 5 users, 10 roles, 50 workload pods, etc.

moio · 2024-04-26T14:27:30Z

@git-ival FMPOV not necessarily. To me, they could as well be empty or almost empty (as empty as a default installation is).

What you will need tens of thousands of the specific resource under test (eg. ConfigMaps if you are testing the ConfigMaps page, Secrets if it is Secrets and so on) on the cluster under test (upstream and at least one downstream should be tested, because affected Steve code is in both). But in principle, other resources should not matter.

git-ival added the QA/need-info label Apr 10, 2024

git-ival assigned git-ival and MSpencer87 Apr 10, 2024

git-ival changed the title ~~[spike] Gather info on Vai Testing~~ [spike] Gather info for Vai Testing Apr 10, 2024

git-ival closed this as completed May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[spike] Gather info for Vai Testing #1276

[spike] Gather info for Vai Testing #1276

git-ival commented Apr 10, 2024 •

edited by sowmyav27

Loading

git-ival commented Apr 10, 2024

git-ival commented Apr 11, 2024 •

edited

Loading

git-ival commented Apr 17, 2024 •

edited

Loading

moio commented Apr 23, 2024 •

edited

Loading

moio commented Apr 23, 2024

richard-cox commented Apr 23, 2024 •

edited

Loading

git-ival commented Apr 24, 2024

moio commented Apr 26, 2024

[spike] Gather info for Vai Testing #1276

[spike] Gather info for Vai Testing #1276

Comments

git-ival commented Apr 10, 2024 • edited by sowmyav27 Loading

git-ival commented Apr 10, 2024

git-ival commented Apr 11, 2024 • edited Loading

git-ival commented Apr 17, 2024 • edited Loading

moio commented Apr 23, 2024 • edited Loading

moio commented Apr 23, 2024

Backend test notes

Cluster setup

Benchmarking criteria notes

Framework choice notes

Implementation notes

richard-cox commented Apr 23, 2024 • edited Loading

git-ival commented Apr 24, 2024

moio commented Apr 26, 2024

git-ival commented Apr 10, 2024 •

edited by sowmyav27

Loading

git-ival commented Apr 11, 2024 •

edited

Loading

git-ival commented Apr 17, 2024 •

edited

Loading

moio commented Apr 23, 2024 •

edited

Loading

richard-cox commented Apr 23, 2024 •

edited

Loading