Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repair: describe ring per table and without cache #3718

Merged
merged 6 commits into from
Mar 13, 2024

Conversation

Michal-Leszczynski
Copy link
Collaborator

@Michal-Leszczynski Michal-Leszczynski commented Feb 16, 2024

Right now, repair plan contains descriptions of all repaired rings which are stored at the keyspace level. This has two main issues:

  • unnecessary memory consumption
  • it makes it difficult to move to a per table ring description required for tablets

The main goal of this PR is remove ring descriptions from plan and query them when needed during an actual repair.
In order to achieve that, generator responsibility is passed to generator and tableGenerator. Generator is responsible for querying ring description and creating tableGenerators, which take care of repairing given table.

This introduces a lot of code refactoring, so in order to make repair logic more comprehensible I added:

  • MaxRingParallel
  • ShouldRepairRing

Currently, determining max parallel or if a keyspace should be repaired is convoluted as it requires traversing all token ranges. The changes to DescribeRing replication strategy calculation make it possible to do it by looking at the replication strategy and replication factor, which is easier to understand, explain and test.

This PR tried not to touch repair progress manager, but it might be good to refactor it as well.

Fixes #3745

@Michal-Leszczynski
Copy link
Collaborator Author

On the second thought, it is also possible to keep repair plan organized per keyspace / table. It might require less changes.

@Michal-Leszczynski Michal-Leszczynski force-pushed the ml/tablet-repair branch 3 times, most recently from e070900 to 573d0f5 Compare February 22, 2024 12:20
@Michal-Leszczynski Michal-Leszczynski changed the title WIP(repair): reduce plan to table level instead of keyspace/table level. Repair: describe ring per table and without cache Feb 22, 2024
@Michal-Leszczynski Michal-Leszczynski marked this pull request as ready for review February 22, 2024 13:37
@Michal-Leszczynski
Copy link
Collaborator Author

We could also change progress to keep stats of only currently repaired table, but this can be a part of another PR.

@Michal-Leszczynski
Copy link
Collaborator Author

Dtests passed except for one, not connected test.

Copy link
Collaborator

@karol-kokoszka karol-kokoszka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My brain compiler is too weak to verify that all changes made to generator won't break the repair process.
The code itself (of generator.go) looks good.
I believe that integration-tests of the repair should catch pottential bugs. Do you agree with me ? Maybe it's worth to execute the SCT sanity against this branch too ?

pkg/service/repair/plan.go Outdated Show resolved Hide resolved
pkg/service/repair/plan.go Outdated Show resolved Hide resolved
pkg/service/repair/plan.go Outdated Show resolved Hide resolved
pkg/service/repair/plan.go Outdated Show resolved Hide resolved
pkg/service/repair/plan.go Show resolved Hide resolved
pkg/service/repair/plan.go Outdated Show resolved Hide resolved
pkg/service/repair/service.go Show resolved Hide resolved
pkg/service/repair/progress_integration_test.go Outdated Show resolved Hide resolved
pkg/service/repair/progress.go Outdated Show resolved Hide resolved
pkg/service/repair/progress.go Outdated Show resolved Hide resolved
@Michal-Leszczynski
Copy link
Collaborator Author

@karol-kokoszka It seems like additional fields in Ring added in #3747 would be useful in this PR. Having the dc -> rf mapping would make keyspace filtering and max parallel calculation way cleaner. I will refactor and rebase this PR on top of master when #3747 gets merged.

MaxRingParallel is a cleaner and more comprehensible way of calculating max repair parallelism based on Ring replication strategy.
ShouldRepairRing is a cleaner and more comprehensible way of checking if Ring should be repaired.
This approach stores only a single ring description at a time (instead of all of them) and makes it easier to move to per table ring descriptions.
@Michal-Leszczynski
Copy link
Collaborator Author

@karol-kokoszka I decided to fix #3745 in this PR as it is used in refactored code. I also tried to address all comments above. Sorry for a large PR, but could you please take a look at it again?

I have some problems with building and running tests on jenkins. I will retry them tomorrow.

@karol-kokoszka
Copy link
Collaborator

@karol-kokoszka It seems like additional fields in Ring added in #3747 would be useful in this PR. Having the dc -> rf mapping would make keyspace filtering and max parallel calculation way cleaner. I will refactor and rebase this PR on top of master when #3747 gets merged.

But you closed the PR.

@Michal-Leszczynski
Copy link
Collaborator Author

But you closed the PR.

Yes, I mentioned above that I decided to have those fixes here as they are important for this PR.

Copy link
Collaborator

@karol-kokoszka karol-kokoszka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@Michal-Leszczynski
Copy link
Collaborator Author

Michal-Leszczynski commented Mar 13, 2024

Test status:

  • dtests - passed
  • centos-satnity-test - failed due to instance termination - rekicked - and passed
  • ubuntu22-sanity-test - failed due to instance termination - rekicked - and failed but it looks like a cluster setup failure

@Michal-Leszczynski
Copy link
Collaborator Author

@karol-kokoszka are we ok to merge or do we want to investigate the ubuntu22-sanity-test failure first?

@karol-kokoszka
Copy link
Collaborator

I checked it and looks that it failed on your run. Previous builds failed due to spot instance being terminated.
Unfortunately I don't see a clear reason of why this SCT run failed ....
Did you try with other sanity tests ? Is it only ubuntu-22-sanity-test that fails ?

@Michal-Leszczynski
Copy link
Collaborator Author

Did you try with other sanity tests ? Is it only ubuntu-22-sanity-test that fails ?

centos-satnity-test passes

@karol-kokoszka
Copy link
Collaborator

Merge it, then please check the output of this job execution after merging to master. If it will appear once again without the exact reason in logs (as it is now), please fill the ticket in https://github.com/scylladb/scylla-cluster-tests

@Michal-Leszczynski Michal-Leszczynski merged commit ef46c8b into master Mar 13, 2024
38 of 42 checks passed
@Michal-Leszczynski Michal-Leszczynski deleted the ml/tablet-repair branch March 13, 2024 15:29
@Michal-Leszczynski
Copy link
Collaborator Author

Actions look good after merging to master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect replication strategy calculation in DescribeRing
2 participants