Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

proposal: add enhance mid-tier resource proposal #1762

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
253 changes: 253 additions & 0 deletions docs/proposals/20231123-enhance-mid-tier-resource.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,253 @@
---
title: Enhance Mid-tier resources
authors:
- "@j4ckstraw"
- "@jiasheng55"
reviewers:
- "@zwzhang0107"
- "@hormes"
- "@eahydra"
- "@FillZpp"
- "@jasonliu747"
creation-date: 2023-11-23
last-updated: 2023-11-28
status: implementable
see-also:
- "/docs/proposals/20230613-node-prediction.md"
---

# Enhance Mid-tier resources

## Table of contents

<!--ts-->
- [Enhance Mid-tier resources](#enhance-mid-tier-resources)
- [Table of contents](#table-of-contents)
- [Summary](#summary)
- [Motivation](#motivation)
- [Goals](#goals)
- [Non-goals/Future work](#non-goalsfuture-work)
- [Proposal](#proposal)
- [Terminology](#terminology)
- [User stories](#user-stories)
- [Story 1](#story-1)
- [Story 2](#story-2)
- [Prerequisites](#prerequisites)
- [Design principles](#design-principles)
- [Implementation details/Notes/Constraints](#implementation-detailsnotesconstraints)
- [Cgroup basic configuration](#cgroup-basic-configuration)
- [QoS policy](#qos-policy)
- [Node QoS](#node-qos)
- [Risks and mitigations](#risks-and-mitigations)
- [Alternatives](#alternatives)
- [Upgrade strategy](#upgrade-strategy)
- [Additional details](#additional-details)
- [Implementation history](#implementation-history)
<!--te-->

## Summary

The *Mid-tier resources* is proposed to both improve the node utilization and avoid overloading, which rely on [node prediction](https://github.com/koordinator-sh/koordinator/blob/main/docs/proposals/20230613-node-prediction.md).

While *node prediction* clarifies how the Mid-tier resources are calculated with the prediction, and this proposal clarifies Mid-tier cgroup and QoS design, suppression and eviction policy.

## Motivation

This proposal introduces Mid+LS and Mid+BE to close the gap in Prod+LS and Batch+BE and fulfil the requirements of different task types.

### Goals

- How to calculate and update the mid-tier resource amount of a node.
- Clarification of Mid-tier cgroup and QoS
- Clarification of Mid-tier suppression and eviction policy
- Necessary optimization of the scheduler/descheduler

j4ckstraw marked this conversation as resolved.
Show resolved Hide resolved
### Non-goals

- Replace Batch-tier resources
- Add new QoS type

## Proposal

### Terminology

Here I would like to explain some concepts:
1. koord-QoS

Quality of Service, we assume that the same QoS level has similar operational performance, operational quality.

2. koord-priority

Scheduling priority, high priority can preempt low priority by default.

3. Resource models

We now have four resource models: prod, mid, batch and free. The resource model takes care of whether the resource is oversold and whether it is stable, which affects pod eviction.

Note that koordinator bind koord-priority and the resource model; different priorities have different resource models.

### User stories

#### Story 1

There are low priority, latency-sensitive service tasks whose performance requirements match those of Prod+LS, but which should not be suppressed but can tolerate being evictied as machine utilisation increases.

Mid+LS can handle them.

#### Story 2

There are latency-insensitive tasks such as AI or stream computing, e.g. Apache Spark, that can consume a lot of resources. They require stable resources, can be suppressed and do not want to be evicted.

Mid+BE can handle them.

### Prerequisites

Must use koordinator node reservation if someone wants to use Mid+LS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you provide more details about the prerequisites? Why must use koordinator node reservation? Is it the Reservation described in this document(20221227-node-resource-reservation.md)or the Reservation defined by the Koordinator SLO?


### Design principles

**QoS**

By default, the performance of same QoS should be similar, Prod+LS and Mid+LS should be *basically* the same, also suitable to Mid+BE and Batch+BE.

The adaptation of the QoS policy must be specific. In general, if the QoS is the same and the priorities are different, there is no need to make a big difference, such as the priority of eviction, and some advanced configurations in memory reclamation, usually no configuration is required.
However, there is no finer configuration granularity for CPU group identity, so no additional customization is required for the time being.
In the future, advanced features such as core schedule and CPU idle need to be considered.

Also, you can introduce a QoS level, such as Mid between LS and BE to implement functions such as QoS detail customization.

**resource oversale**

Mid resource is calculated by
```
Allocatable[Mid] := min(Reclaimable[Mid], NodeAllocatable * thresholdRatio)
```
at the moment we need to change this to

```
Allocatable[Mid] := min(Reclaimable[Mid], NodeAllocatable * thresholdRatio) + Unallocated[Mid]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to my understanding, adding unallocated here is to allow Mid + LS to also use unallocated resources in the cluster, but there is a problem here.Using unallocated resources will affect the view of Prod resources, and ultimately need to support Prod's preemption of Mid, which in turn affects the stability of Mid resources.

We are also considering applying node prediction to the amplification factor of Node, so that Prod can be directly oversold, so that the Quota management and priority preemption are consistent with the native semantics of k8s.
Does this satisfy your Mid + LS need?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider such a scenario,
The user deployed a Mid + LS pod, while the Mid resource of the cluster was insufficient, so the cluster autoscaler scaled out one new node.
The problem is that there are no Prod pods in the new node, so no Mid resources are available too. so we want to allowing Mid + LS to use unallocated resources in the cluster.

Unallocated[Mid] = max(NodeAllocatable - Allocated[Prod], 0)
```

reference [mid-tier-overcommitment](https://koordinator.sh/docs/designs/node-prediction/#mid-tier-overcommitment)

in the long term, non-overselling can be supported by Feature-Gate.
```
Unallocated * thresholdRatio
```

**native resource or extended resource**

*native resource*:
hijack node update, change `node.Status.allocatable`, mid pod also use native resource, in this situation, Mid is equivalent to a sub-priority in prod, resource quota need to make adaptive modification.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t quite understand the logic described in this paragraph. Why do we need to hijack node update?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hijack node update to add up prod-reclaimable to original node.Status.allocatable

for Mid+BE pods, it can be located burstable, even guaranteed to disobey the QoS level policy.

*extended resource*:
add mid-cpu/mid-memory, insert "extended resource" field by webhook.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and here, do you want to add new webhook plugin?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nop.

for Mid+BE pods it is localized in besteffort.
for Mid+LS pods, it is located in burstable by reserving limits.cpu/limits.memory.

**share resource account or not**

Let us look at the scenario without overselling.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does the overselling mean here?


if prod and mid pods share resource account, a preempt is required for an upcoming prod pod. koord-scheduler needs filter and preept plugins to handle this.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please clarify the specific rules of filter and preemption


if prod and mid pods do not share a resource account, a mechanism for standalone or descheduler is needed to migrate mid pods in necessary cases, e.g. hot spots.

In this scenario, Mid+LS can affect Prod+LS; this depends on the calculation strategy for the Mid-tier resource and the degree of interference from mid-pods.
the calculation of the Mid-tier resource should be relatively conservative in the given target scenario and the application type should not cause too much interference.

**trade-off-result**
- retain Mid+LS and Mid+BE,do not introduce a new QoS tier.
- change the calculation method for mid resources, add unallocated resources to mid resources.
- use of the mid-cpu/mid-memory
- no shared resource account

### Implementation Details/Notes/Constraints

#### Cgroup basic configuration

**cfsQuota/memoryLimit configuration**

Configured according limits.mid-cpu and limits.mid-memory.

**cpuShares**

Configured according requests.mid-cpu.

**cgroup hierarchy**

- Mid+LS, injects limits.cpu/limits.memory through webhook, so that it can be located in Burstable.
- Mid+BE, is located in Besteffort by default.

*Notification*
Burstable cpuShares/memoryLimit can be updated regularly by kubelet.

#### QoS policy

Configured according to koord-QoS
- LS for Mid+LS
- BE for Mid+BE

#### Node QoS

**CPU Suppress**

- Mid+LS are not suppressed by default, if task performance does not meet SLA, eviction is accepted.
- Mid+BE can be suppressed, and do not want to be evicted frequently.

Batch+BE and Mid+BE should be considered for CPU suppression.

**CPU Evicton**

CPU eviction is currently linked to pod satisfaction.
in the long term, however, it should be done from the perspective of the operating system, like memory eviction that evict low-priority pod if CPU utilization higher than given threshold.

**Memory Evict**

Eviction is sorted by priority and resource model
- Batch first and then Mid.
- for each pod, request and usage should be taken into account when evicting for fairness reasons.

### Risks and mitigations

- Burstable cpuShares can be periodically updated by kubelet, leading to conflicts with koordlet update. [reference](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/qos_container_manager_linux.go#L170)

We cannot update burstable cpuShares, then Prod+LS and Mid+LS can only interfere with each other under heavy load.

- Burstable memory limit can be periodically updated by kubelet, which conflicts with koordlet update. [reference](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/qos_container_manager_linux.go#L343)

We should disable kubefeatures.QOSReserved for memory resources, see [Prerequisites](#prerequisites).

## Alternatives

**Add Mid QoS**

Introduction of a new QoS level that can be used to fine-tune the QoS of the mid-tier pod.

## Upgrade Strategy

- [ ] add midresource runtimehook, configure cgroup
- [ ] update Mid-tier calculate policy
- [ ] update BE suppression to support Mid+BE
- [ ] update CPU/Memory eviction to support Mid-tier
- [ ] scheduler/descheduler filters and policies, e.g. no over-allocation for mid+prod resources when a feature gate is enabled, migration of some mid-pods to hotspots for load balancing.

## Additional Details

With the improved mid-resource we have the following panorama:

koor-priority | resource model | koord-QoS | k8s-QoS | scenario |
-- | -- | -- | -- | -- |
koord-prod | cpu/memory | LSE | guaranteed | middleware |
koord-prod | cpu/memory | LSR | guaranteed | high-priority latency-sensitive, CPU bind |
koord-prod | cpu/memory | LS | guaranteed | high-priority, latency-sensitive |
koord-prod | cpu/memory | LS | burstable | high-priority latency-sensitive |
koord-mid | mid-cpu/mid-memory | LS | burstable | low-priority latency-insensitive |
koord-mid | mid-cpu/mid-memory | BE | besteffort | AI/Flink jobs |
koord-batch | batch-cpu/batch-memory | BE | besteffort | big data jobs |
koord-free | TBD | TBD | TBD | TBD |

## Implementation History
- [ ] 11/28/2023: Open proposal PR
Loading