-
Notifications
You must be signed in to change notification settings - Fork 253
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: deadlock between bootstrap_cross_signing
and sync
#4060
Conversation
`bootstrap_cross_signing` holds a lock on the private identity. In case a new identity is created, it will try to acquire a lock on `account`. The latter is locked by `sync`, which tries to acquire a lock on the private identity. Note that the `bootstrap_cross_signing` call is spawned in a separate task e.g. in `restore_session`. In particular, this task and `sync` both race to acquire locks described above. Signed-off-by: boxdot <[email protected]>
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #4060 +/- ##
==========================================
- Coverage 84.55% 84.55% -0.01%
==========================================
Files 266 266
Lines 28449 28448 -1
==========================================
- Hits 24054 24053 -1
Misses 4395 4395 ☔ View full report in Codecov by Sentry. |
@poljar I'm out of depth here: please could you review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think an alternative approach might be better here, let me know if that solves the deadlock as well.
@@ -642,28 +642,32 @@ impl OlmMachine { | |||
&self, | |||
reset: bool, | |||
) -> StoreResult<CrossSigningBootstrapRequests> { | |||
let mut identity = self.inner.user_identity.lock().await; | |||
// Don't hold the lock, otherwise we might deadlock in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not quite sure I understand what's happening here.
We drop the lock on the identity before we call upload_device_keys()
, so it can't be that we deadlock because of that.
Is it perhaps because we can't acquire the lock on the account to call bootstrap_cross_singing()
?
If so, then we should, instead of not holding the lock here, acquire the account lock first. The lock on the identity also servers to make this method idempotent. Which kinda was ruined by the drop()
call bellow, but this would make it even worse.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are the backtraces from two places locking the account.
<Holding accounts lock from sync>
0: matrix_sdk_crypto::store::StoreCache::account::{{closure}}
at src/matrix-rust-sdk/crates/matrix-sdk-crypto/src/store/mod.rs:371:25
1: matrix_sdk_crypto::store::StoreCacheGuard::account::{{closure}}
at src/matrix-rust-sdk/crates/matrix-sdk-crypto/src/store/mod.rs:409:30
2: matrix_sdk_crypto::machine::OlmMachine::outgoing_requests::{{closure}}
at src/matrix-rust-sdk/crates/matrix-sdk-crypto/src/machine/mod.rs:513:49
3: matrix_sdk::encryption::<impl matrix_sdk::client::Client>::send_outgoing_requests::{{closure}}
at src/matrix-rust-sdk/crates/matrix-sdk/src/encryption/mod.rs:723:18
4: matrix_sdk::client::Client::sync_once::{{closure}}::{{closure}}
at src/matrix-rust-sdk/crates/matrix-sdk/src/client/mod.rs:1821:55
5: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
at .cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.40/src/instrument.rs:321:9
6: matrix_sdk::client::Client::sync_once::{{closure}}
at src/matrix-rust-sdk/crates/matrix-sdk/src/client/mod.rs:1809:5
7: matrix_sdk::sync::<impl matrix_sdk::client::Client>::sync_loop_helper::{{closure}}
at src/matrix-rust-sdk/crates/matrix-sdk/src/sync.rs:283:62
8: matrix_sdk::client::Client::sync_with_result_callback::{{closure}}::{{closure}}
at src/matrix-rust-sdk/crates/matrix-sdk/src/client/mod.rs:2066:68
<Holding account lock from bootstrap_cross_signing>
0: matrix_sdk_crypto::store::StoreCache::account::{{closure}}
at src/matrix-rust-sdk/crates/matrix-sdk-crypto/src/store/mod.rs:371:25
1: matrix_sdk_crypto::store::StoreCacheGuard::account::{{closure}}
at src/matrix-rust-sdk/crates/matrix-sdk-crypto/src/store/mod.rs:409:30
2: matrix_sdk_crypto::machine::OlmMachine::bootstrap_cross_signing::{{closure}}
at src/matrix-rust-sdk/crates/matrix-sdk-crypto/src/machine/mod.rs:654:47
3: matrix_sdk::encryption::Encryption::bootstrap_cross_signing::{{closure}}
at src/matrix-rust-sdk/crates/matrix-sdk/src/encryption/mod.rs:1185:48
4: matrix_sdk::encryption::Encryption::bootstrap_cross_signing_if_needed::{{closure}}
at src/matrix-rust-sdk/crates/matrix-sdk/src/encryption/mod.rs:1353:53
5: matrix_sdk::encryption::Encryption::spawn_initialization_task::{{closure}}
at src/matrix-rust-sdk/crates/matrix-sdk/src/encryption/mod.rs:1668:83
Technically, we need the other two stack trace acquiring private_identity
, but we can just follow the stack down and track the execution. Indeed, in the second backtrace the private_identity
is already locked (this is the lock this patch is removing). In the first, trace continuing the executing private_identity
is locked in keys_for_upload
.
As proposed, we could try to acquire the lock on account first. Before actually locking the private identity. I am not sure it is ok to hold two locks for such a big chunk of execution. It also creates a dependency between locks. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As proposed, we could try to acquire the lock on account first. Before actually locking the private identity. I am not sure it is ok to hold two locks for such a big chunk of execution. It also creates a dependency between locks. WDYT?
I think that this would be the way to go. Eventually we'll want to move the private identity behind the store cache lock, and then we would only have a single lock for this and no deadlock because of interdependent locks could happen.
I don't think that this is quite as a big chunk of execution, we're just generating some random bytes and producing a signature, so things should™ be fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Locking account on top of bootstrap_cross_signing
now deadlocks on it. But only for the first two initial logins. First time (no account yet) it just deadlocks, second time the bootstrap finishes with Own identity is now verified, check all known identities for verification status changes
but now sync is deadlocked. Third time all starts working. I can't tell why it is like that without digging deeper. Maybe sync hold a lock to it for some reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, alright, let's merge this PR then and I'll see if I can move the things under the cache so we have a simpler locking model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I could take a look at this as a follow up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, if you have the time, that would be great.
bootstrap_cross_signing
holds a lock on the private identity. In case a new identity is created, it will try to acquire a lock onaccount
. The latter is locked bysync
, which tries to acquire a lock on the private identity.Note that the
bootstrap_cross_signing
call is spawned in a separate task e.g. inrestore_session
. In particular, this task andsync
both race to acquire locks described above.