Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gc-compaction: find the correct horizon #10192

Open
Tracked by #9114
skyzh opened this issue Dec 18, 2024 · 2 comments · May be fixed by #10193
Open
Tracked by #9114

gc-compaction: find the correct horizon #10192

skyzh opened this issue Dec 18, 2024 · 2 comments · May be fixed by #10193
Assignees
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug

Comments

@skyzh
Copy link
Member

skyzh commented Dec 18, 2024

It seems that neither of time|space_cutoff/latest_gc_cutoff are correct for determining the gc horizon for gc-compaction. Currently, it did not consider the case where the child branch has a retention period that is lower than the parent branch gc horizon. This will cause child branch not able to branch off within the retention period if gc-compaction is enabled, plus logical size computation failures.

main ------------
        |  |
child   |  ^---------
        |            ^now
        ^now-24hr is on the parent branch
@skyzh skyzh added c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug labels Dec 18, 2024
@skyzh skyzh self-assigned this Dec 18, 2024
@skyzh
Copy link
Member Author

skyzh commented Dec 18, 2024

Seems like we have a tenant-level PiTR, but we still get into races where the gc_horizon obtained in gc-compaction would cause logical size computation to stuck. I assume it's because we are not using the same SystemTime::now() for all branches when computing the horizon.

@skyzh
Copy link
Member Author

skyzh commented Dec 18, 2024

Looking at the staging tenant, the race can be observed: the global cutoff is set to 86400s, so ideally we should get exactly the same time cutoff LSNs for all branches. But we get two different time cutoff LSNs on two branches:

2024-12-17T15:57:30.232055Z  INFO gc_loop{tenant_id=12fd6e6d7a50bf7dd96154ec39b8b7c8 shard_id=0000}:run:gc_timeline{timeline_id=2365c8af38983a9c48e6e4df0c5ae767 cutoff=0/E4B979D0}: Nothing to GC: new_gc_cutoff_lsn 0/E4B979D0, latest_gc_cutoff_lsn 0/E4B979D0
2024-12-17T15:57:30.232041Z  INFO gc_loop{tenant_id=12fd6e6d7a50bf7dd96154ec39b8b7c8 shard_id=0000}:run:gc_timeline{timeline_id=9136e295b2647dae2fc5e2a2abbb1dc6 cutoff=0/E4B96D18}: Nothing to GC: new_gc_cutoff_lsn 0/E4B96D18, latest_gc_cutoff_lsn 0/E4B96D18

And therefore causing a missing key error when computing logical size after running gc-horizon on these latest cutoff LSNs:

2024-12-18T21:35:46.442792Z ERROR synthetic_size_worker: failed to calculate synthetic size for tenant 12fd6e6d7a50bf7dd96154ec39b8b7c8: could not find data for key 010000000000000000000000000000000000 (shard ShardNumber(0)) at LSN 0/E4B839F1, request LSN 0/E4B839F0, ancestor 0/0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver t/bug Issue Type: Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant