Skip to content

Commit

Permalink
proposal: add enhance mid-tier resource proposal
Browse files Browse the repository at this point in the history
Signed-off-by: j4ckstraw <[email protected]>
  • Loading branch information
j4ckstraw committed Nov 28, 2023
1 parent 3784df1 commit 0da22aa
Showing 1 changed file with 187 additions and 0 deletions.
187 changes: 187 additions & 0 deletions docs/proposals/20231123-enhance-mid-tier-resource.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,187 @@
---
title: Enhance Mid-tier resources
authors:
- "@j4ckstraw"
- "@jiasheng55"
reviewers:
- "@zwzhang0107"
- "@hormes"
- "@eahydra"
- "@FillZpp"
- "@jasonliu747"
creation-date: 2023-11-23
last-updated: 2023-11-28
status: implementable
see-also:
- "/docs/proposals/20230613-node-prediction.md"
---

# Enhance Mid-tier resources

## Table of Contents

<!--ts-->
- [Enhance Mid-tier resources](#enhance-mid-tier-resources)
- [Table of Contents](#table-of-contents)
- [Summary](#summary)
- [Motivation](#motivation)
- [Goals](#goals)
- [Non-Goals/Future Work](#non-goalsfuture-work)
- [Proposal](#proposal)
- [User Stories](#user-stories)
- [Story 1](#story-1)
- [Story 2](#story-2)
- [Design Principles](#design-principles)
- [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints)
- [Cgroup basic configuration](#cgroup-basic-configuration)
- [QoS policy](#qos-policy)
- [Node QoS](#node-qos)
- [Risks and Mitigations](#risks-and-mitigations)
- [Alternatives](#alternatives)
- [Upgrade Strategy](#upgrade-strategy)
- [Additional Details](#additional-details)
- [Implementation History](#implementation-history)
<!--te-->

## Summary

The *Mid-tier resources* is proposed to both improve the node utilization and avoid overloading, which rely on [node prediction](https://github.com/koordinator-sh/koordinator/blob/main/docs/proposals/20230613-node-prediction.md).

While *node prediction* clarify how the Mid-tier resources are calculated with the prediction, and this proposal will clarify Mid-tier cgroup and QoS design, suppress and eviction policy.

## Motivation

Here I want to explain some concepts:
1. koord-QoS

Quality of Service, we assue the same QoS level has similar operating performance, operating quality.

2. koord-priority

Scheduling priority, high priority can preempt low priority by default.

3. Resource Type

We have four resource type now, prod, mid, batch and free resource
resource type care about whether the resource is oversold, whether it is stable, which affects pod eviction.

koordinator bind koord-priority and resource type, different priority has different resource type.

This proposal introduce Mid+LS and Mid+BE fill the gap in Prod+LS and Batch+BE
meet the requirements of different types of tasks.

### Goals

- Clarify Mid-tier cgroup and QoS
- Clarify Mid-tier suppress and eviction policy

### Non-Goals

- Replace Batch-tier resources
- Add new QoS type

## Proposal

### User Stories

#### Story 1

There are low-priority online-service tasks, which performance requirements is same as Prod+LS while it do not want to be suppressed but can tolerate being evicted, when the machine usage spike.

Mid+LS can conquer it.

#### Story 2

There are resource consumption tasks, AI or stream computing, such as Apache Spark, which may consume a lot of resources. It need stable resource and it can be suppressed and do not want to be evicted.

Mid+BE can conquer it.

### Implementation Details/Notes/Constraints

#### Cgroup basic configuration

**cfsQuota/memoryLimit configuration**

Configured according limits.mid-cpu and limits.mid-memory.

**cpuShares**

Configured according requests.mid-cpu
- for Mid+LS, same as Prod+LS
- for Mid+BE, same as Batch+BE

**cgroup hierarchy**

- Mid+LS, inject limits.cpu/limits.memory by webhoook, so it can be located in Burstable.
- Mid+BE, located in Besteffort by default.

*Notification*
Burstable cpuShares/memoryLimit may be update by kubelet periodically.

#### QoS Policy

Configured according koord-QoS
- LS for Mid+LS
- BE for Mid+BE

#### Node QoS

**CPU Suppress**

- Mid+LS do not be suppressed by default, if task performance do not meet SLA, eviction is accepted.
- Mid+BE can be suppressed, and do not want to be evicted frequently.

CPU suppress should consider Batch+BE and Mid+BE.

**CPU Evicton**

CPU eviction is related to pod satisfaction at present.
but in the long term, it should be done from the perspective of OS, like memory eviction.

**Memory Evict**

The eviction is sorted according to the priority and resource type
- Batch first and then Mid.
- Mid+LS first and then Mid+BE.

### Risks and Mitigations

- Burstable cpuShares may be updated by kubelet periodically, which is confict with koordlet update. [reference](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/qos_container_manager_linux.go#L170)

We can do not update burstable cpuShares, then Prod+LS and Mid+LS may mutual interference each other only when high load.

- Burstable memory limit may be updated by kubelet periodically, which is confict with koordlet update. [reference](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/cm/qos_container_manager_linux.go#L343)

We can disable kubefeatures.QOSReserved for memory resource to mitigation for now.

## Alternatives

**Add Mid QoS**

Introduce new QoS level, which can adjust Mid-tier pod QoS finely.

## Upgrade Strategy

- [ ] add midresource runtimehook, configure cgroup
- [ ] update Mid-tier calculate policy
- [ ] update BE suppress to support Mid+BE
- [ ] update CPU/Memory eviction to support Mid-tier
- [ ] scheduler/descheduler filter and policy

## Additional Details

With mid resource enhanced, we have panorama as follow:

koor-priority | resource type | koord-QoS | k8s-QoS | scenario |
-- | -- | -- | -- | -- |
koord-prod | cpu/memory | LSE | guaranteed | middleware |
koord-prod | cpu/memory | LSR | guaranteed | high-priority online-service,CPU bind |
koord-prod | cpu/memory | LS | guaranteed | high-priority online-service,微服务工作负载 |
koord-prod | cpu/memory | LS | burstable | high-priority online-service,微服务工作负载 |
koord-mid | mid-cpu/mid-memory | LS | burstable | low-priority online-service |
koord-mid | mid-cpu/mid-memory | BE | besteffort | AI/Flink jobs |
koord-batch | batch-cpu/batch-memory | BE | besteffort | big data jobs |
koord-free | TBD | TBD | TBD | TBD |

## Implementation History
- [ ] 11/28/2023: Open proposal PR

0 comments on commit 0da22aa

Please sign in to comment.