Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cost measurement and analysis #75

Open
Tracked by #71
cmelone opened this issue Jul 31, 2024 · 0 comments
Open
Tracked by #71

Cost measurement and analysis #75

cmelone opened this issue Jul 31, 2024 · 0 comments
Assignees
Labels
feature New feature or request

Comments

@cmelone
Copy link
Collaborator

cmelone commented Jul 31, 2024

The main goal of this project is to optimize the cost of Spack's CI pipelines. To do this, we need to compute and store the cost of each job to determine if the predictive framework is having a positive impact.

We should have price info for at least a couple weeks to form a baseline.


Objectives

  1. We want to measure the cost of a job's submission and execution on the cluster
  2. Efficiency of resource usage should be quantified to incentivize against wasted cycles

Current Approaches
In the Kitware analytics database, they store a "node occupancy" metric, which measures the proportion of a node that was available to a job over the job's life. For instance, if the job was alone on the node, this value would be 1; if there were five other builds happening at the same time, 0.2. This is then multiplied by the cost of the node during the job's life to get a cost per job.

However, it's not a perfect measurement for our application. The cost should be independent of other activity on the node. Without this, it would be impossible to compare cost per job among many samples. The metric is not indicative of the job or spec, simply whether other jobs were running on the node.

While the node occupancy metric is useful for understanding node utilization, it really only optimizes for the success of Karpenter/K8s in packing nodes correctly, which is out of our control. This may be helpful as we investigate scheduling for CI, but not now as we're mostly interested in improving efficiency of resource usage.

Setup

To normalize the cost of resources within instance types, we'll define cost per resource metrics.

$$\text{Cost per CPU}_i = \frac{C_i}{\text{CPU}_i}$$ $$\text{Cost per RAM}_i = \frac{C_i}{\text{RAM}_i}$$

where

  • $C_i$ is the cost of the node over the life of the job
  • $CPU_i$ is the number of CPUs available on node $i$
  • $RAM_i$ is the amount of RAM available on node $i$
$$\text{Job Cost} = (\text{CPU}_{\text{request}} \times \text{Cost per CPU}_i + \text{RAM}_{\text{request}} \times \text{Cost per RAM}_i)$$

where
$CPU_{\text{request}}$ and $RAM_{\text{request}}$ are the resource requests. Rather than including actual usage as a factor in this metric, requests represent the resources reserved by a job on a node. If a build requests 10GB of memory but only uses 5, it should be charged for its allocation, as it prevented other jobs from running on the node.

Using this cost per job metric, jobs are rewarded for minimizing their requests and wall time.

However, we should also measure whether a job is using more/less resources than requested. Underallocation can negatively impact other processes on the node and slow down the build, while overallocation is simply a waste of cycles. In conjunction with the cost per job, a penalty factor would be helpful for understanding the cost imposed on the rest of the cluster or other jobs that could have potentially run on the node.

$$\text{P}_{\text{CPU}} = \max\left(\frac{1}{\text{UR}_{\text{CPU}}}, \text{UR}_{\text{CPU}}\right)$$ $$\text{P}_{\text{RAM}} = \max\left(\frac{1}{\text{UR}_{\text{RAM}}}, \text{UR}_{\text{RAM}}\right)$$

where

$$\text{UR}_{\text{CPU}} = \frac{\text{CPU}_{\text{usage avg}}}{\text{CPU}_{\text{request}}}$$ $$\text{UR}_{\text{RAM}} = \frac{\text{RAM}_{\text{usage avg}}}{\text{RAM}_{\text{request}}}$$

We ensure that jobs are penalized for using fewer resources than requested (inverse of utilization ratio) and more resources than requested (utilization ratio, which can be > 1).

Therefore, a "weighted" cost per job would be

$$(\text{CPU}_{\text{request}} \times \text{Cost per CPU}_i \times \text{P}_{\text{CPU}} + \text{RAM}_{\text{request}} \times \text{Cost per RAM}_i \times \text{P}_{\text{RAM}})$$

Job cost and $P$ would be stored separately as the former represents "true" cost, while the latter can be used to measure the efficiency of its resource requests via an artificial penalty. When analyzing costs, node instance type should be controlled for because cost per job is influenced by $\text{Cost per CPU}_i$ and $\text{Cost per RAM}_i$, which will vary among instance types.

@cmelone cmelone added the feature New feature or request label Jul 31, 2024
@cmelone cmelone self-assigned this Jul 31, 2024
@cmelone cmelone mentioned this issue Jul 31, 2024
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant