Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: add enhance mid-tier resource proposal #1762

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

j4ckstraw
Copy link
Contributor

Ⅰ. Describe what this PR does

Ⅱ. Does this pull request fix one issue?

Ⅲ. Describe how to verify it

Ⅳ. Special notes for reviews

V. Checklist

  • I have written necessary docs and comments
  • I have added necessary unit tests and integration tests
  • All checks passed in make test

@koordinator-bot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign hormes after the PR has been reviewed.
You can assign the PR to them by writing /assign @hormes in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

codecov bot commented Nov 28, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (3784df1) 66.11% compared to head (cb3cf33) 66.11%.
Report is 11 commits behind head on main.

❗ Current head cb3cf33 differs from pull request most recent head e6b42c9. Consider uploading reports for the commit e6b42c9 to get more accurate results

Additional details and impacted files
@@           Coverage Diff            @@
##             main    #1762    +/-   ##
========================================
  Coverage   66.11%   66.11%            
========================================
  Files         388      390     +2     
  Lines       42425    42589   +164     
========================================
+ Hits        28048    28159   +111     
- Misses      12305    12346    +41     
- Partials     2072     2084    +12     
Flag Coverage Δ
unittests 66.11% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

docs/proposals/20231123-enhance-mid-tier-resource.md Outdated Show resolved Hide resolved
docs/proposals/20231123-enhance-mid-tier-resource.md Outdated Show resolved Hide resolved
docs/proposals/20231123-enhance-mid-tier-resource.md Outdated Show resolved Hide resolved

#### Story 1

There are low-priority online-service tasks, which performance requirements is same as Prod+LS while it do not want to be suppressed but can tolerate being evicted, when the machine usage spike.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the Mid pods are allowed to allocate the Prod unallocated resources, the Prod apps can be affected since ProdPeak + MidAllocated > NodeAllocatable is possible. So to avoid the affection, there should be a design about when and how we let the Prod pods preempt/evict Mid pods when they want to win back the resources that Prod unallocated but Mid allocated.

Copy link
Contributor Author

@j4ckstraw j4ckstraw Dec 1, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Because prod pods use cpu/memory resource, while mid pods use mid-cpu/mid-memory, so prod can't preempt mid directly.
How about evict by mid-allocated / mid-allocatable.

Signed-off-by: j4ckstraw <[email protected]>
Copy link
Member

@eahydra eahydra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one more question: How does the scheduler's LoadAware Scheduling plugin support middle tiers?

docs/proposals/20231123-enhance-mid-tier-resource.md Outdated Show resolved Hide resolved
docs/proposals/20231123-enhance-mid-tier-resource.md Outdated Show resolved Hide resolved
docs/proposals/20231123-enhance-mid-tier-resource.md Outdated Show resolved Hide resolved
docs/proposals/20231123-enhance-mid-tier-resource.md Outdated Show resolved Hide resolved

### Prerequisites

Must use koordinator node reservation if someone wants to use Mid+LS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide more details about the prerequisites? Why must use koordinator node reservation? Is it the Reservation described in this document(20221227-node-resource-reservation.md)or the Reservation defined by the Koordinator SLO?

**native resource or extended resource**

*native resource*:
hijack node update, change `node.Status.allocatable`, mid pod also use native resource, in this situation, Mid is equivalent to a sub-priority in prod, resource quota need to make adaptive modification.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t quite understand the logic described in this paragraph. Why do we need to hijack node update?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hijack node update to add up prod-reclaimable to original node.Status.allocatable

docs/proposals/20231123-enhance-mid-tier-resource.md Outdated Show resolved Hide resolved
for Mid+BE pods, it can be located burstable, even guaranteed to disobey the QoS level policy.

*extended resource*:
add mid-cpu/mid-memory, insert "extended resource" field by webhook.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here, do you want to add new webhook plugin?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nop.

docs/proposals/20231123-enhance-mid-tier-resource.md Outdated Show resolved Hide resolved
Signed-off-by: j4ckstraw <[email protected]>
@j4ckstraw
Copy link
Contributor Author

I have one more question: How does the scheduler's LoadAware Scheduling plugin support middle tiers?

I need to think about it.


Let us look at the scenario without overselling.

if prod and mid pods share resource account, a preempt is required for an upcoming prod pod. koord-scheduler needs filter and preept plugins to handle this.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please clarify the specific rules of filter and preemption


**share resource account or not**

Let us look at the scenario without overselling.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does the overselling mean here?

**cpuShares**

Configured according requests.mid-cpu
- for Mid+LS, same as Prod+LS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both the pod-level and container-level of the Mid pods follow the same rule as the Batch extended resources if the pods allocate Mid extended resources. So this statement is confusing. I am not sure if you are talking about the QoS-level cgroups. We'd better either make a concise and clear expression or add a comprehensive diagram to clarify the design.

docs/proposals/20231123-enhance-mid-tier-resource.md Outdated Show resolved Hide resolved
**CPU Evicton**

CPU eviction is currently linked to pod satisfaction.
in the long term, however, it should be done from the perspective of the operating system, like memory eviction.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please provide more information on why and how the cpu eviction is done from the perspective of the OS?


Eviction is sorted by priority and resource model
- Batch first and then Mid.
- Mid+LS first and then Mid+BE, for Mid pods, request and usage should be taken into account when evicting for fairness reasons.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mid+LS first and then Mid+BE

It may not work well if you deploy online services as Mid+LS pods and deploy stream-computing jobs as Mid+BE. Think about the online service pods being evicted earlier than the job pods. Though the Mid+BE pods can be suppressed to reduce interference with Prod resources, it cannot be a reason for Mid+LS to be a lower priority in eviction. Please avoid unnecessary coupling of the priority and QoS if there is no proper design.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get, Thank you for your advice

Signed-off-by: j4ckstraw <[email protected]>
@ZiMengSheng
Copy link
Contributor

/milestone 1.5

@koordinator-bot
Copy link

@ZiMengSheng: The provided milestone is not valid for this repository. Milestones in this repository: [someday, v1.4, v1.5]

Use /milestone clear to clear the milestone.

In response to this:

/milestone 1.5

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ZiMengSheng
Copy link
Contributor

/milestone v1.5

@koordinator-bot koordinator-bot bot added this to the v1.5 milestone Jan 2, 2024
at the moment we need to change this to

```
Allocatable[Mid] := min(Reclaimable[Mid], NodeAllocatable * thresholdRatio) + Unallocated[Mid]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to my understanding, adding unallocated here is to allow Mid + LS to also use unallocated resources in the cluster, but there is a problem here.Using unallocated resources will affect the view of Prod resources, and ultimately need to support Prod's preemption of Mid, which in turn affects the stability of Mid resources.

We are also considering applying node prediction to the amplification factor of Node, so that Prod can be directly oversold, so that the Quota management and priority preemption are consistent with the native semantics of k8s.
Does this satisfy your Mid + LS need?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider such a scenario,
The user deployed a Mid + LS pod, while the Mid resource of the cluster was insufficient, so the cluster autoscaler scaled out one new node.
The problem is that there are no Prod pods in the new node, so no Mid resources are available too. so we want to allowing Mid + LS to use unallocated resources in the cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants