Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcdserver: separate "raft log compact" from snapshot #18235

Closed
wants to merge 8 commits into from

Conversation

clement2026
Copy link
Contributor

I'm looking into #17098 @serathius and made changes to separate the compact logic from the snapshot logic.

I ran rw-benchmark.sh several times to observe performance changes. I expected this patch to slow down etcd, but it seems to have improved rw-benchmark.sh results. I’m not sure how. Similar results were observed on both cloud VMs (8vCPU, 16GB Memory, NVMe) and a bare-metal machine (8CPU, 32GB Memory, SSD).

Since the rw-benchmark.sh script was taking forever to finish, I adjusted the parameters to complete the benchmark within a few hours.

Below are the script and the results from running it on a Vultr bare-metal machine (8 CPUs, 32GB Memory, SSD, Ubuntu 24.04 LTS x64).

export RATIO_LIST="4/1"
export REPEAT_COUNT=3
export RUN_COUNT=50000
./rw-benchmark.sh

compare
main.csv
issue-17098-patch.csv

@k8s-ci-robot
Copy link

Hi @clement2026. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@serathius
Copy link
Member

/ok-to-test

@serathius
Copy link
Member

/retest

compacti := uint64(1)
if snapi > s.Cfg.SnapshotCatchUpEntries {
compacti = snapi - s.Cfg.SnapshotCatchUpEntries
compacti := uint64(0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change from 1 to 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/etcd-io/raft/blob/ee19bbe7f9bb16fd7d874a6c08dfac319596d2ad/storage.go#L258-L260

	if compactIndex > ms.lastIndex() {
		getLogger().Panicf("compact %d is out of bound lastindex(%d)", compactIndex, ms.lastIndex())
	}

A panic occurred at this line during local test runs. It appears the initial value of ms.lastIndex() is 0, so compacti should default to zero.

In the main branch, a panic doesn't occur because ms always contains more than one element when compaction starts, which isn't the case in this patch.

@serathius
Copy link
Member

serathius commented Jun 26, 2024

Great work, the change look good, but as any change in the etcd apply loop we need to be careful here.

I expected this patch to slow down etcd, but it seems to have improved rw-benchmark.sh results. I’m not sure how.

Would be good to investigate, would you be able to collect CPU&Memory profiles?

@serathius
Copy link
Member

Found by robustness tests:

panic: need non-empty snapshot

goroutine 210 [running]:
go.etcd.io/raft/v3.(*raft).maybeSendAppend(0xc0002be180, 0xeabdbb777cf498cb, 0x1)
	go.etcd.io/raft/[email protected]/raft.go:640 +0x838
go.etcd.io/raft/v3.(*raft).sendAppend(...)
	go.etcd.io/raft/[email protected]/raft.go:592
go.etcd.io/raft/v3.stepLeader(0xc0002be180, {0x9, 0xbf19ae4419db00dc, 0xeabdbb777cf498cb, 0x3, 0x0, 0x0, {0x0, 0x0, 0x0}, ...})
	go.etcd.io/raft/[email protected]/raft.go:1545 +0x99d
go.etcd.io/raft/v3.(*raft).Step(0xc0002be180, {0x9, 0xbf19ae4419db00dc, 0xeabdbb777cf498cb, 0x3, 0x0, 0x0, {0x0, 0x0, 0x0}, ...})
	go.etcd.io/raft/[email protected]/raft.go:1215 +0xe3c
go.etcd.io/raft/v3.(*node).run(0xc000018240)
	go.etcd.io/raft/[email protected]/node.go:399 +0x825
created by go.etcd.io/raft/v3.StartNode in goroutine 1
	go.etcd.io/raft/[email protected]/node.go:273 +0x59

@clement2026
Copy link
Contributor Author

Great work, the change look good, but as any change in the etcd apply loop we need to be careful here.

I expected this patch to slow down etcd, but it seems to have improved rw-benchmark.sh results. I’m not sure how.

Would be good to investigate, would you be able to collect CPU&Memory profiles?

Thanks! I understand the importance of the apply loop. I'll investigate it and provide CPU & Memory profiles after resolving the bug and test failures.

@clement2026
Copy link
Contributor Author

/retest

@serathius
Copy link
Member

Thanks for fixing test, please note this PR is small, but might have big impact on etcd behavior, it's review might take a while. Would be great to repeat the tests and see the profile difference to understand performance difference.

cc @ahrtr

@clement2026
Copy link
Contributor Author

Thanks for fixing test, please note this PR is small, but might have big impact on etcd behavior, it's review might take a while. Would be great to repeat the tests and see the profile difference to understand performance difference.

cc @ahrtr

Thanks for checking in. I temporarily fixed the test to test my ideas, but it might cause a serious memory leak. Currently, I'm developing an optimized version to properly fixed the test. I’ll update you once it's done and provide the CPU and memory profiles.

@serathius serathius changed the title WIP: etcdserver: separate compact from snapshot WIP: etcdserver: separate "raft log compact" from snapshot Jun 28, 2024
@clement2026
Copy link
Contributor Author

/retest

Signed-off-by: Clement <[email protected]>
@clement2026
Copy link
Contributor Author

/retest

Signed-off-by: Clement <[email protected]>
@clement2026
Copy link
Contributor Author

clement2026 commented Jul 1, 2024

I have completed some benchmarks.  Here’s the heatmap comparison between this PR and the main branch:

compare
main.csv
pr-18235-patch.csv

Since the full benchmark is time-consuming, the results were generated using this simplified script:

export RATIO_LIST="4/1"
export REPEAT_COUNT=3
export RUN_COUNT=50000
./rw-benchmark.sh

The script was executed on a Vultr cloud VM with 8 vCPUs, 32 GB of memory, and a 150 GB NVMe drive. All subsequent results are based on this machine configuration unless otherwise specified.

Updates:
The script was executed on both the main branch and the patch branch separately, each taking about 2 hours(I forgot to record the exact time).

@clement2026
Copy link
Contributor Author

To verify that this patch improves memory usage, I collected go_memstats_heap_inuse_bytes using Prometheus and Grafana.

patch:

patch-mem-in-use-grafana

main:

main-mem-in-use-grafana

Step 1: start etcd with pprof enabled

ETCD_BIN="./bin/etcd"
BACKEND_SIZE="$((130 * 1024 * 1024 * 1024))" # 130GB

${ETCD_BIN} --quota-backend-bytes=${BACKEND_SIZE} \
  --log-level 'error' \
  --enable-pprof

Step 2: keep writing data to etcd

ETCD_BM_BIN="./bin/tools/benchmark"
KEY_SPACE_SIZE="$((1024 * 64))"
VALUE_SIZE="$((2 ** 20))" # 1MB
RUN_COUNT="50000"

${ETCD_BM_BIN} put "" \
  --key-space-size=${KEY_SPACE_SIZE} \
  --total=${RUN_COUNT} \
  --val-size ${VALUE_SIZE}

@clement2026
Copy link
Contributor Author

Benchmark etcd using a small CONN_CLI_COUNT, a small VALUE_SIZE and a big RUN_COUNT(1000,000).

The benchmark took about 600s, andgo tool pprof took 700s to capture CPU profile.

Branch Heap before benchmark Heap after benchmark Profile
main main-before-benchmark.inuse_space.pb.gz main-after-benchmark.inuse_space.pb.gz main-pprof.etcd.samples.cpu.001.pb.gz
patch patch-before-benchmark.inuse_space.pb.gz patch-after-benchmark.inuse_space.pb.gz patch-pprof.etcd.samples.cpu.001.pb.gz

Images produced by executing go tool pprof -pdf -diff_base $file_a $file_b:

image

compare-profile.pdf

comparing-heap-after-benchmark
comparing-heap-after-benchmark.pdf

Compared to the main branch, this patch uses more memory and consumes less CPU.

Below are the steps to reproduce.

Step 1: Prepare ENV

ETCD_BIN="./bin/etcd"
ETCD_BM_BIN="./bin/tools/benchmark"
BACKEND_SIZE="$((20 * 1024 * 1024 * 1024))" # 20GB
KEY_SPACE_SIZE="$((1024 * 64))"
RATIO="4.0"
KEY_SIZE="256"
RUN_COUNT="1000000"
RANGE_RESULT_LIMIT="100"
VALUE_SIZE="$((2 ** 10))" # VALUE_SIZE="$((2 ** 14))" 
CONN_CLI_COUNT="$((2**5))"

Step 2: start etcd with pprof enabled

${ETCD_BIN} --quota-backend-bytes=${BACKEND_SIZE} \
  --log-level 'error' \
  --enable-pprof 

Step 3: init etcd with data

${ETCD_BM_BIN} put --sequential-keys \
  --key-space-size=${KEY_SPACE_SIZE} \
  --val-size=${VALUE_SIZE} --key-size=${KEY_SIZE} \
  --total=${KEY_SPACE_SIZE} \
  &>/dev/null

Step 4: capture heap usage
Step 5: start capturing CPU profile for 700s

Step 6: benchmark

${ETCD_BM_BIN} txn-mixed "" \
  --conns=${CONN_CLI_COUNT} --clients=${CONN_CLI_COUNT} \
  --total=${RUN_COUNT} \
  --rw-ratio ${RATIO} --limit ${RANGE_RESULT_LIMIT} \
  --val-size ${VALUE_SIZE}

Step 7: capture heap usage
Step 8: complete capturing CPU profile after 700s

@clement2026
Copy link
Contributor Author

Benchmark etcd using a big CONN_CLI_COUNT, a big VALUE_SIZE and a big RUN_COUNT(1000,000).

The benchmark took about 1100s, andgo tool pprof took 1200s to capture CPU profile.

Branch Heap before benchmark Heap after benchmark Profile
main main-before-benchmark.inuse_space.pb.gz main-after-benchmark.inuse_space.pb.gz main-pprof.etcd.samples.cpu.001.pb.gz
patch patch-before-benchmark.inuse_space.pb.gz patch-after-benchmark.inuse_space.pb.gz patch-pprof.etcd.samples.cpu.001.pb.gz

Images produced by executing go tool pprof -pdf -diff_base $file_a $file_b:

image

compare-profile.pdf

compare-heap

compare-heap.pdf

Compared to the main branch, this patch uses less memory and consumes less CPU.

The steps to reproduce are nearly identical to the previous benchmark description, except for the different ENV settings.

Step 1: Prepare ENV

ETCD_BIN="./bin/etcd"
ETCD_BM_BIN="./bin/tools/benchmark"
BACKEND_SIZE="$((20 * 1024 * 1024 * 1024))" # 20GB
KEY_SPACE_SIZE="$((1024 * 64))"
RATIO="4.0"
KEY_SIZE="256"
RUN_COUNT="1000000"
RANGE_RESULT_LIMIT="100"
VALUE_SIZE="$((2 ** 14))"
CONN_CLI_COUNT="$((2**11))"

@clement2026 clement2026 marked this pull request as ready for review July 1, 2024 22:48
@clement2026
Copy link
Contributor Author

I am still struggling to understand why this patch reduced the go.etcd.io/etcd/server/v3/storage/mvcc.(*keyIndex).get time.

@clement2026
Copy link
Contributor Author

@serathius
I've fixed the bug and updated some benchmark results to help us review this PR’s performance. The information might be a bit messy, please bear with me. Let me know if further information is needed.

@clement2026 clement2026 changed the title WIP: etcdserver: separate "raft log compact" from snapshot etcdserver: separate "raft log compact" from snapshot Jul 2, 2024
@serathius
Copy link
Member

@clement2026 great work, it's a lot of information and it should be enough to make some conclusions. I think what we are missing is a clear summarization of your findings. What test we did, what improvement/regresion the test showed.

Please see kubernetes/enhancements#2340 (comment) for an example

@clement2026
Copy link
Contributor Author

Thanks for the example. I’ll create a summary of the findings, including the tests conducted and their results.

@clement2026
Copy link
Contributor Author

clement2026 commented Jul 3, 2024

@serathius Here’s a summary of the three tests I’ve done. Please review it when you have a moment.

Test 1 #18235 (comment)

rw-benchmark R/W ratio 4:1

compare

Improvements

  • 2% ~ 12% improvement to read performance
  • 1% ~ 13% improvement to write performance

Test 2 #18235 (comment)

benchmark put 50000 values of 1 MB each, within 6 minutes

Branch Max Memory
main 31 GB
patch 12 GB

Improvements

  • 59% reduction in maximum memory usage

Test 3

1 KB: #18235 (comment)
16 KB: #18235 (comment)

benchmark txn-mixed 1,000,000 transactions, within 20 minutes

Value Size Branch Initial Memory Final Memory CPU Usage
1 KB main 35.5 MB 45.83 MB 4050.75s
patch 39.4 MB 52.01 MB 3584.51s
16 KB main 188.97 MB 884.82 MB 6552.40s
patch 119.57 MB 753.51 MB 5954.52s

Improvements

  • 12% improvement in CPU usage for 1 KB transactions
  • 9% improvement in CPU usage for 16 KB transactions
  • 15% reduction in memory usage for 16 KB transactions

Degradations

  • 13% increase in memory usage for 1 KB transactions - unexpected ☹️

Updates:

  • Add links to detailed test information.
  • Correct "16 MB" to "16 KB."

@clement2026
Copy link
Contributor Author

I find testing and recording the memory usage of etcd to be quite time-consuming. We already have rw-benchmark.sh for collecting read/write performance data. Are there similar scripts for gathering memory usage statistics?

@serathius
Copy link
Member

serathius commented Jul 3, 2024

I feel your pain, it shouldn't take so much time to iterate to validate performance improvement. There is an open issue to improve the current situation, however there hasn't been a lot of progress. #16467

The change looks overall good to me. Usage of heap might be overkill here, I don't expect too many concurrent snapshots being executed, however users might set snapshot count very low, which for a big db file could cause many concurrent snapshot attempts.

The improvement in performance is awesome! Don't think that increase in memory for small objects is surprising or concerning. Executing compaction after each apply might be overkill. Would you want to try checking if reducing frequency to once every 100/1000 would be better?

Happy to see the change passes the robustness test. We might want to also try to think if we could add some more testing to ensure this change is correct.
cc @MadhavJivrajani @siyuanfoundation @fuweid for ideas on extending robustness tests.

cc @ahrtr to confirm change correctness and performance benefits.

@clement2026
Copy link
Contributor Author

This is great advice. I'll reduce the compaction frequency to once every 100/1000 runs and observe the benchmark results. With such a low frequency, replacing the min heap with a simple slice should not impact performance and will enhance readability and maintenance.

@ahrtr ahrtr self-requested a review July 4, 2024 05:55
@ahrtr
Copy link
Member

ahrtr commented Jul 4, 2024

Thanks for the PR, but I think the complexity introduced in this PR outweighs the benefit. We should always consider simplify ing the core apply workflow, so as to improve the understandability and maintainability.

Previously the creating snapshot is performed in a separate goroutine async (see below link), because the v2 snapshot contained all key data, so it might be huge and accordingly it may take a long time to create a snapshot. But it isn't that case anymore, v2store & v2 snapshot only contains membership data now.

s.GoAttach(func() {

So I think we can start with turning it into a sync operation, and a direct & immediate benefit is big simplification. But we need to evaluate the impact on performance.

Afterwards, we can discuss any further improvement step by step.

@serathius
Copy link
Member

serathius commented Jul 4, 2024

Thanks for the PR, but I think the complexity introduced in this PR outweighs the benefit. We should always consider simplify ing the core apply workflow, so as to improve the understandability and maintainability.

I wouldn't phrase it as "complexity outweighs benefits", however I would agree that should consider simplify code first. It's a good suggestion to start from making operation synchronous.

@clement2026 Could you prepare a separate PR that changes the snapshot + compact into sync operation with apply loop and see how this impacts performance? Removing GoAttach should be enough.

Please let me know if you need any help

@clement2026
Copy link
Contributor Author

@ahrtr @serathius Wow, that sounds great! Converting snapshot and compact into sync operations should significantly simplify things. I'll start with the sync operation in a new PR and see how it goes. If everything works well, I'll rewrite this PR based on that one.

@ivanvc
Copy link
Member

ivanvc commented Jul 5, 2024

Since the rw-benchmark.sh script was taking forever to finish, I adjusted the parameters to complete the benchmark within a few hours.

Out of curiosity, @clement2026, what parameters did you use to run your benchmarks? And how long does it take to run them?

@clement2026
Copy link
Contributor Author

@ivanvc Initially, I ran the script with the default parameters, but it didn't complete overnight. I then terminated the script and adjusted the parameters to ensure it would finish within a limited timeframe. With the modified parameters, the script completed in a few hours(I did not record the exact time):

export RATIO_LIST="4/1"
export REPEAT_COUNT=3
export RUN_COUNT=50000
./rw-benchmark.sh

I ran the same script twice last night to benchmark another PR, taking a total of 4 hours and 13 minutes (approximately 2 hours and 6 minutes per run). FYI: #18283 (comment)

@siyuanfoundation
Copy link
Contributor

I think this PR demonstrates the great potential benefits for this idea. Should the change be guarded by a feature gate to be safe?

@fuweid fuweid self-requested a review July 18, 2024 03:33
@serathius
Copy link
Member

@clement2026 With #18283 merged what would be the best way to move forward? Is this PR still relevant?

cc @ahrtr

@k8s-ci-robot
Copy link

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@clement2026
Copy link
Contributor Author

@clement2026 With #18283 merged what would be the best way to move forward? Is this PR still relevant?

cc @ahrtr

Based on #18283, we can implement this feature with less code and make it simpler. So, we don't need this PR anymore. I’ll create a new PR to keep things moving forward.

Thanks, everyone, for taking the time to review it!

Closing this.

@clement2026 clement2026 deleted the issues-17098 branch July 27, 2024 05:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

6 participants