Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: deadlock between bootstrap_cross_signing and sync #4060

Merged
merged 4 commits into from
Oct 8, 2024

Conversation

boxdot
Copy link
Contributor

@boxdot boxdot commented Oct 1, 2024

bootstrap_cross_signing holds a lock on the private identity. In case a new identity is created, it will try to acquire a lock on account. The latter is locked by sync, which tries to acquire a lock on the private identity.

Note that the bootstrap_cross_signing call is spawned in a separate task e.g. in restore_session. In particular, this task and sync both race to acquire locks described above.

`bootstrap_cross_signing` holds a lock on the private identity. In case
a new identity is created, it will try to acquire a lock on `account`.
The latter is locked by `sync`, which tries to acquire a lock on the
private identity.

Note that the `bootstrap_cross_signing` call is spawned in a separate
task e.g. in `restore_session`. In particular, this task and `sync` both
race to acquire locks described above.

Signed-off-by: boxdot <[email protected]>
@boxdot boxdot requested review from a team as code owners October 1, 2024 12:06
@boxdot boxdot requested review from andybalaam and removed request for a team October 1, 2024 12:06
Copy link

codecov bot commented Oct 1, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 84.55%. Comparing base (866b6e5) to head (e506d08).
Report is 76 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4060      +/-   ##
==========================================
- Coverage   84.55%   84.55%   -0.01%     
==========================================
  Files         266      266              
  Lines       28449    28448       -1     
==========================================
- Hits        24054    24053       -1     
  Misses       4395     4395              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@andybalaam
Copy link
Contributor

@poljar I'm out of depth here: please could you review?

@andybalaam andybalaam requested review from poljar and removed request for andybalaam October 3, 2024 10:08
Copy link
Contributor

@poljar poljar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think an alternative approach might be better here, let me know if that solves the deadlock as well.

@@ -642,28 +642,32 @@ impl OlmMachine {
&self,
reset: bool,
) -> StoreResult<CrossSigningBootstrapRequests> {
let mut identity = self.inner.user_identity.lock().await;
// Don't hold the lock, otherwise we might deadlock in
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite sure I understand what's happening here.

We drop the lock on the identity before we call upload_device_keys(), so it can't be that we deadlock because of that.

Is it perhaps because we can't acquire the lock on the account to call bootstrap_cross_singing()?

If so, then we should, instead of not holding the lock here, acquire the account lock first. The lock on the identity also servers to make this method idempotent. Which kinda was ruined by the drop() call bellow, but this would make it even worse.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are the backtraces from two places locking the account.

  <Holding accounts lock from sync>
  0: matrix_sdk_crypto::store::StoreCache::account::{{closure}}
            at src/matrix-rust-sdk/crates/matrix-sdk-crypto/src/store/mod.rs:371:25
  1: matrix_sdk_crypto::store::StoreCacheGuard::account::{{closure}}
            at src/matrix-rust-sdk/crates/matrix-sdk-crypto/src/store/mod.rs:409:30
  2: matrix_sdk_crypto::machine::OlmMachine::outgoing_requests::{{closure}}
            at src/matrix-rust-sdk/crates/matrix-sdk-crypto/src/machine/mod.rs:513:49
  3: matrix_sdk::encryption::<impl matrix_sdk::client::Client>::send_outgoing_requests::{{closure}}
            at src/matrix-rust-sdk/crates/matrix-sdk/src/encryption/mod.rs:723:18
  4: matrix_sdk::client::Client::sync_once::{{closure}}::{{closure}}
            at src/matrix-rust-sdk/crates/matrix-sdk/src/client/mod.rs:1821:55
  5: <tracing::instrument::Instrumented<T> as core::future::future::Future>::poll
            at .cargo/registry/src/index.crates.io-6f17d22bba15001f/tracing-0.1.40/src/instrument.rs:321:9
  6: matrix_sdk::client::Client::sync_once::{{closure}}
            at src/matrix-rust-sdk/crates/matrix-sdk/src/client/mod.rs:1809:5
  7: matrix_sdk::sync::<impl matrix_sdk::client::Client>::sync_loop_helper::{{closure}}
            at src/matrix-rust-sdk/crates/matrix-sdk/src/sync.rs:283:62
  8: matrix_sdk::client::Client::sync_with_result_callback::{{closure}}::{{closure}}
            at src/matrix-rust-sdk/crates/matrix-sdk/src/client/mod.rs:2066:68

<Holding account lock from bootstrap_cross_signing>
   0: matrix_sdk_crypto::store::StoreCache::account::{{closure}}
             at src/matrix-rust-sdk/crates/matrix-sdk-crypto/src/store/mod.rs:371:25
   1: matrix_sdk_crypto::store::StoreCacheGuard::account::{{closure}}
             at src/matrix-rust-sdk/crates/matrix-sdk-crypto/src/store/mod.rs:409:30
   2: matrix_sdk_crypto::machine::OlmMachine::bootstrap_cross_signing::{{closure}}
             at src/matrix-rust-sdk/crates/matrix-sdk-crypto/src/machine/mod.rs:654:47
   3: matrix_sdk::encryption::Encryption::bootstrap_cross_signing::{{closure}}
             at src/matrix-rust-sdk/crates/matrix-sdk/src/encryption/mod.rs:1185:48
   4: matrix_sdk::encryption::Encryption::bootstrap_cross_signing_if_needed::{{closure}}
             at src/matrix-rust-sdk/crates/matrix-sdk/src/encryption/mod.rs:1353:53
   5: matrix_sdk::encryption::Encryption::spawn_initialization_task::{{closure}}
             at src/matrix-rust-sdk/crates/matrix-sdk/src/encryption/mod.rs:1668:83

Technically, we need the other two stack trace acquiring private_identity, but we can just follow the stack down and track the execution. Indeed, in the second backtrace the private_identity is already locked (this is the lock this patch is removing). In the first, trace continuing the executing private_identity is locked in keys_for_upload.

As proposed, we could try to acquire the lock on account first. Before actually locking the private identity. I am not sure it is ok to hold two locks for such a big chunk of execution. It also creates a dependency between locks. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As proposed, we could try to acquire the lock on account first. Before actually locking the private identity. I am not sure it is ok to hold two locks for such a big chunk of execution. It also creates a dependency between locks. WDYT?

I think that this would be the way to go. Eventually we'll want to move the private identity behind the store cache lock, and then we would only have a single lock for this and no deadlock because of interdependent locks could happen.

I don't think that this is quite as a big chunk of execution, we're just generating some random bytes and producing a signature, so things should™ be fine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Locking account on top of bootstrap_cross_signing now deadlocks on it. But only for the first two initial logins. First time (no account yet) it just deadlocks, second time the bootstrap finishes with Own identity is now verified, check all known identities for verification status changes but now sync is deadlocked. Third time all starts working. I can't tell why it is like that without digging deeper. Maybe sync hold a lock to it for some reason.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, alright, let's merge this PR then and I'll see if I can move the things under the cache so we have a simpler locking model.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I could take a look at this as a follow up?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if you have the time, that would be great.

@poljar poljar merged commit 4bcb9b7 into matrix-org:main Oct 8, 2024
40 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants