Repair: describe ring per table and without cache #3718

Michal-Leszczynski · 2024-02-16T14:07:04Z

Right now, repair plan contains descriptions of all repaired rings which are stored at the keyspace level. This has two main issues:

unnecessary memory consumption
it makes it difficult to move to a per table ring description required for tablets

The main goal of this PR is remove ring descriptions from plan and query them when needed during an actual repair.
In order to achieve that, generator responsibility is passed to generator and tableGenerator. Generator is responsible for querying ring description and creating tableGenerators, which take care of repairing given table.

This introduces a lot of code refactoring, so in order to make repair logic more comprehensible I added:

MaxRingParallel
ShouldRepairRing

Currently, determining max parallel or if a keyspace should be repaired is convoluted as it requires traversing all token ranges. The changes to DescribeRing replication strategy calculation make it possible to do it by looking at the replication strategy and replication factor, which is easier to understand, explain and test.

This PR tried not to touch repair progress manager, but it might be good to refactor it as well.

Fixes #3745

Michal-Leszczynski · 2024-02-16T14:09:23Z

On the second thought, it is also possible to keep repair plan organized per keyspace / table. It might require less changes.

Michal-Leszczynski · 2024-02-22T13:37:57Z

We could also change progress to keep stats of only currently repaired table, but this can be a part of another PR.

Michal-Leszczynski · 2024-02-22T13:38:59Z

Dtests passed except for one, not connected test.

karol-kokoszka

My brain compiler is too weak to verify that all changes made to generator won't break the repair process.
The code itself (of generator.go) looks good.
I believe that integration-tests of the repair should catch pottential bugs. Do you agree with me ? Maybe it's worth to execute the SCT sanity against this branch too ?

pkg/service/repair/plan.go

pkg/service/repair/service.go

pkg/service/repair/progress_integration_test.go

pkg/service/repair/progress.go

Michal-Leszczynski · 2024-03-07T00:34:18Z

@karol-kokoszka It seems like additional fields in Ring added in #3747 would be useful in this PR. Having the dc -> rf mapping would make keyspace filtering and max parallel calculation way cleaner. I will refactor and rebase this PR on top of master when #3747 gets merged.

Fixes #3745

MaxRingParallel is a cleaner and more comprehensible way of calculating max repair parallelism based on Ring replication strategy.

ShouldRepairRing is a cleaner and more comprehensible way of checking if Ring should be repaired.

This change increases readability.

This approach stores only a single ring description at a time (instead of all of them) and makes it easier to move to per table ring descriptions.

Michal-Leszczynski · 2024-03-11T00:44:17Z

@karol-kokoszka I decided to fix #3745 in this PR as it is used in refactored code. I also tried to address all comments above. Sorry for a large PR, but could you please take a look at it again?

I have some problems with building and running tests on jenkins. I will retry them tomorrow.

pkg/service/repair/plan.go

karol-kokoszka · 2024-03-11T10:56:28Z

@karol-kokoszka It seems like additional fields in Ring added in #3747 would be useful in this PR. Having the dc -> rf mapping would make keyspace filtering and max parallel calculation way cleaner. I will refactor and rebase this PR on top of master when #3747 gets merged.

But you closed the PR.

Michal-Leszczynski · 2024-03-11T10:59:35Z

But you closed the PR.

Yes, I mentioned above that I decided to have those fixes here as they are important for this PR.

karol-kokoszka

👍

Michal-Leszczynski · 2024-03-13T11:05:48Z

Test status:

dtests - passed
centos-satnity-test - failed due to instance termination - rekicked - and passed
ubuntu22-sanity-test - failed due to instance termination - rekicked - and failed but it looks like a cluster setup failure

Michal-Leszczynski · 2024-03-13T13:33:00Z

@karol-kokoszka are we ok to merge or do we want to investigate the ubuntu22-sanity-test failure first?

karol-kokoszka · 2024-03-13T14:18:44Z

I checked it and looks that it failed on your run. Previous builds failed due to spot instance being terminated.
Unfortunately I don't see a clear reason of why this SCT run failed ....
Did you try with other sanity tests ? Is it only ubuntu-22-sanity-test that fails ?

Michal-Leszczynski · 2024-03-13T14:44:49Z

Did you try with other sanity tests ? Is it only ubuntu-22-sanity-test that fails ?

centos-satnity-test passes

karol-kokoszka · 2024-03-13T14:49:23Z

Merge it, then please check the output of this job execution after merging to master. If it will appear once again without the exact reason in logs (as it is now), please fill the ticket in https://github.com/scylladb/scylla-cluster-tests

Michal-Leszczynski · 2024-03-13T16:16:10Z

Actions look good after merging to master.

Michal-Leszczynski force-pushed the ml/tablet-repair branch 3 times, most recently from e070900 to 573d0f5 Compare February 22, 2024 12:20

Michal-Leszczynski changed the title ~~WIP(repair): reduce plan to table level instead of keyspace/table level.~~ Repair: describe ring per table and without cache Feb 22, 2024

Michal-Leszczynski marked this pull request as ready for review February 22, 2024 13:37

Michal-Leszczynski requested a review from karol-kokoszka as a code owner February 22, 2024 13:37

karol-kokoszka reviewed Mar 1, 2024

View reviewed changes

Michal-Leszczynski added 6 commits March 11, 2024 00:26

fix(scylla_client): fix strategy calculation in DescribeRing

d1d918b

Fixes #3745

feat(repair): add MaxRingParallel

8d806b7

MaxRingParallel is a cleaner and more comprehensible way of calculating max repair parallelism based on Ring replication strategy.

feat(repair): add ShouldRepairRing

e8555fe

ShouldRepairRing is a cleaner and more comprehensible way of checking if Ring should be repaired.

refactor(repair): remove idx dependency from TableDiskSizeReport

14823b1

This change increases readability.

refactor(repair): move masterSelector to a separate file

3978547

This change increases readability.

feat(repair): don't cache all ring descriptions

f5bff87

This approach stores only a single ring description at a time (instead of all of them) and makes it easier to move to per table ring descriptions.

Michal-Leszczynski force-pushed the ml/tablet-repair branch from 573d0f5 to f5bff87 Compare March 10, 2024 23:51

karol-kokoszka reviewed Mar 11, 2024

View reviewed changes

pkg/service/repair/plan.go Outdated Show resolved Hide resolved

Michal-Leszczynski force-pushed the ml/tablet-repair branch from 1129e55 to f5bff87 Compare March 11, 2024 10:49

karol-kokoszka approved these changes Mar 11, 2024

View reviewed changes

Michal-Leszczynski merged commit ef46c8b into master Mar 13, 2024
38 of 42 checks passed

Michal-Leszczynski deleted the ml/tablet-repair branch March 13, 2024 15:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repair: describe ring per table and without cache #3718

Repair: describe ring per table and without cache #3718

Michal-Leszczynski commented Feb 16, 2024 •

edited

Loading

Michal-Leszczynski commented Feb 16, 2024

Michal-Leszczynski commented Feb 22, 2024

Michal-Leszczynski commented Feb 22, 2024

karol-kokoszka left a comment

Michal-Leszczynski commented Mar 7, 2024

Michal-Leszczynski commented Mar 11, 2024

karol-kokoszka commented Mar 11, 2024

Michal-Leszczynski commented Mar 11, 2024

karol-kokoszka left a comment

Michal-Leszczynski commented Mar 13, 2024 •

edited

Loading

Michal-Leszczynski commented Mar 13, 2024

karol-kokoszka commented Mar 13, 2024

Michal-Leszczynski commented Mar 13, 2024

karol-kokoszka commented Mar 13, 2024

Michal-Leszczynski commented Mar 13, 2024

Repair: describe ring per table and without cache #3718

Repair: describe ring per table and without cache #3718

Conversation

Michal-Leszczynski commented Feb 16, 2024 • edited Loading

Michal-Leszczynski commented Feb 16, 2024

Michal-Leszczynski commented Feb 22, 2024

Michal-Leszczynski commented Feb 22, 2024

karol-kokoszka left a comment

Choose a reason for hiding this comment

Michal-Leszczynski commented Mar 7, 2024

Michal-Leszczynski commented Mar 11, 2024

karol-kokoszka commented Mar 11, 2024

Michal-Leszczynski commented Mar 11, 2024

karol-kokoszka left a comment

Choose a reason for hiding this comment

Michal-Leszczynski commented Mar 13, 2024 • edited Loading

Michal-Leszczynski commented Mar 13, 2024

karol-kokoszka commented Mar 13, 2024

Michal-Leszczynski commented Mar 13, 2024

karol-kokoszka commented Mar 13, 2024

Michal-Leszczynski commented Mar 13, 2024

Michal-Leszczynski commented Feb 16, 2024 •

edited

Loading

Michal-Leszczynski commented Mar 13, 2024 •

edited

Loading