Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[storage] anvil-manage-server-storage must be able to handle drbd resync during grow #748

Open
fabbione opened this issue Oct 14, 2024 · 6 comments
Assignees
Labels
3.1 Priorities bug High To be resolved once all urgent issues are clear

Comments

@fabbione
Copy link
Member

This is not a super common situation, but regardless it needs to be handled properly or storage is leaked during grow processes.

create a server, stop the server to resize root disk (this can happen on any disk, in my test i only had one disk).

Run for the first time:
anvil-manage-server-storage --server an-test-deploy1 --grow 5G --disk vda --confirm
....
Done!

wait for drbd resync to be completed <-- IMPORTANT. All good, you can issue again:

anvil-manage-server-storage --server an-test-deploy1 --grow 5G --disk vda --confirm
....
Done!

and it will work as expected.

wait for drbd resync to be completed <-- IMPORTANT. All good, you can issue:

anvil-manage-server-storage --server an-test-deploy1 --grow 30G --disk vda --confirm
...
Done!

and issue the same command IMMEDIATELY after:

# anvil-manage-server-storage --server an-test-deploy1 --grow 30G --disk vda --confirm
Working with the server: [an-test-deploy1], UUID: [d5af3b99-8e57-418f-99d6-90f74372ff78]
- Target: [vda], boot: [01], path: [/dev/drbd/by-res/an-test-deploy1/0], Available space: [130.00 GiB]
- Preparing to grow the storage by: [30.00GiB]...
 - Extending local LV: [/dev/anvil-test-vg/an-test-deploy1_0]...
Done!
 - Extending peer: [an-a01n02:/dev/anvil-test-vg/an-test-deploy1_0], via: [10.201.10.2 (bcn1)]
Done!
- Extending backing devices complete. Now extending DRBD resource/volume...
 Error!
[ Failed ] - When trying to grow the DRBD device: [an-test-deploy1/0]
[ Failed ] - using the command: [/usr/sbin/drbdadm resize an-test-deploy1/0]
[ Failed ] - The return code: [10] was received, expected '0'. Output, if any:
==========
print $output!#
==========
The extension of the resource is incomplete, manual intervention is required!!
[ Note ] - All backing devices have been grown. Manually resolving the drbd grow
[ Note ] - error should complete the drive expansion!

This issue is caused by drbd resource refusing a resize one is already in flight. At this point we are leaking storage.

The lv has been resized, but drbd will not see it or recognize it.

Storage is leaked any time a drbd resize request fails, this is just one possible trigger.

For the grow operation specifically, either check drbd status BEFORE resizing the lv and exit 1 if in progress (avoid leaking) or a loop is necessary to wait for the first sync to complete before issuing the next resize.

@fabbione fabbione added bug High To be resolved once all urgent issues are clear 3.1 Priorities labels Oct 14, 2024
@digimer
Copy link
Member

digimer commented Oct 14, 2024

What do you mean by “leaking storage”?

@fabbione
Copy link
Member Author

Simple, the lv is resized, but not the drbd device. That means the VMs doesn´t see the storage but it is allocated in the lv/lvm. That storage is unavailable to anyone to use.

@digimer
Copy link
Member

digimer commented Oct 14, 2024

Ah, that is expected. There's a period of time where it's unavoidable that one LV is grown before the peer node is grown, and DRBD can't be grown until both are grown. If I've started a grow operation, I don't want that space to be available to others to use. The scan-lvm scan agent should see the reduced free space in the VG and drop the available space in the associated storage group.

@fabbione
Copy link
Member Author

That is NOT the issue. The issue is that lv is grown (correctly), second drbd resize fails, nothing is going to trigger another drbd resize to match the new lv size. Hence the space is lost.

@digimer
Copy link
Member

digimer commented Oct 14, 2024

Aaaah, ok, sorry I misunderstood.

@digimer
Copy link
Member

digimer commented Oct 16, 2024

ToDo:

  1. Check the drbd device size and compare against LV size when doing resize, and make sure all space is used. If not, do a grow.
  2. On resize, don't even start a resize operation until both/all DRBD resources are UpToDate

Don't allow resize job to start until all nodes are online (no other way to ensure UpToDate on all DRBD nodes)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3.1 Priorities bug High To be resolved once all urgent issues are clear
Projects
None yet
Development

No branches or pull requests

2 participants