-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disconnect on Drop
idiom is unsound
#482
Comments
Also as a note, the Xous Book should be updated with a reference to this issue to explain why |
An alternative is to make Then we get rid of the reuse of CIDs -- each call to For cases where you need to connect to services in a loop, you could have a pool of connections that get handed out as copies of the That would at least remove some of the unsoundness behind |
Oooh, that's a nice idea. Make CID not copy-able, and force us to use runtime checks like The problem is you still have a shadow copy of the CID existing in
I rely on this pretty often, especially with services that have threads that create/destroy handles to hardware objects frequently. I only rely on it to return an existing CID insofar as the thread copy of the connection does not burn another entry in the connection table, which, by definition means it is a copy of an existing CID. I don't actually care what the value is, just that it's not a different or new one, because connections are scarce. A clear and simple example is the mount poller in |
In the use std::sync::{
atomic::{AtomicU8, Ordering},
Arc, OnceLock,
};
/// A mock Connection ID. Note that in the real world, the caller
/// woudn't have access to the `u8` inside, and wouldn't be able
/// to call `clone()`.
pub struct CID(u8);
impl Drop for CID {
fn drop(&mut self) {
println!("Dropping connection {}", self.0);
}
}
/// Keep track of the CID to ensure we don't duplicate it. If we
/// were getting a new connection each time, this number would
/// increment.
static CID_INDEX: AtomicU8 = AtomicU8::new(1);
/// Return the existing connection or create a new one.
pub fn connect() -> Arc<CID> {
static OUR_CID: OnceLock<Arc<CID>> = OnceLock::new();
OUR_CID
.get_or_init(|| Arc::new(CID(CID_INDEX.fetch_add(1,
Ordering::Relaxed))))
.clone()
}
pub fn main() {
// Demonstrate that explicit dropping works
let dropped_cid = CID(0);
drop(dropped_cid);
// This should get dropped at the end of the program
let _long_lived_cid = CID(100);
// Use the long-lived connection. Since this is the first one,
// it should have a CID of `1`.
let cid = connect();
println!("Got CID {}", cid.0);
// Duplicate it. It should have the same CID as the first.
let cid2 = connect();
println!("On second connect got CID {}", cid2.0);
// Dropping cid2 doesn't call the destructor
drop(cid2);
let mut thread_index = vec![];
for i in 0..10 {
thread_index.push(std::thread::spawn(move || {
let c = connect();
println!("Child thread {} got CID {}", i, c.0);
}));
}
println!("Joining child threads...");
for t in thread_index {
t.join().unwrap();
}
// _long_lived_cid cannot be cloned or copied,
// so this doesn't compile:
//let other_cid = _long_lived_cid.clone();
// Even if we drop `cid` here, the destructor doesn't
// get run because it's still hanging out in the
// `OnceLock<Arc<CID>>`
drop(cid);
// The _long_lived_cid should get dropped here as it
// goes out of scope.
} |
The magic there is entirely within In the above program, the CID is never incremented, even though it's "connected to" from multiple threads. The actual CID itself is never cloned, only a reference to it is cloned. And because the reference persists throughout the lifetime of the program, it is never dropped. |
I'm missing how this integrates with Or are you re-using the term CID to mean something else in this example, and the intention is I still allocate a |
I think the example makes more sense to me if it were like this: pub fn connect() -> Arc<CID> {
let xns = xous_api_names::XousNames::new().unwrap();
static OUR_CID: OnceLock<Arc<CID>> = OnceLock::new();
OUR_CID
.get_or_init(|| Arc::new(CID(xns::request_connection_blocking(MY_SERVER_NAME).expect("can't connect to server")))
.clone()
} Or something like that maybe? and maybe here it's meant to have this actually call
instead of just printing something. Maybe that's what you meant, I can't tell, and you just put placeholders in because you're trying to write something for godbolt but there's a lot of context missing with all the Xous API calls stripped out. Or maybe you're suggesting something entirely different and more generic in terms of how CID tracking should be done, like the callers are responsible for allocating and tracking CIDs now? |
Here is a better example where I've created mocks of I've also renamed the pool function to You could change The end result is that you have one static function that can be called from any thread that will give you the same long-lived connection ID. Furthermore, if you want to share connection IDs between threads you can do so by wrapping them in With this approach in place, we can change Xous so that subsequent calls to |
Background
Every service in Xous currently uses an idiom similar to this one:
xous-core/api/xous-api-ticktimer/src/lib.rs
Lines 168 to 180 in 8703764
The rationale behind this is as follows:
Drop
get calledxous::disconnect()
on the connection insideDrop
Seems simple in concept, except, because the connection IDs are scarce, they are shared between all threads within a process space. Thus, we have to count all the connections across all the threads, before we know we can safely
disconnect
.The solution to this was to put a reference counter in the
lib
wrapper for every service. This reference is astatic
Atomic
variable, and when we are the last one to decrement it, we know we are the last object standing and thus it is safe todisconnect()
.Ah, but our troubles only begin here...
Reference Counting is Unsound
The reference count implementation is unsound. Although the count itself is atomic, the subsequent
disconnect()
call is not. Thus, we could get a reference count indicating we're the last object standing, and before we get to completedisconnect()
, our quantum is yielded and a new thread starts up and allocates a connection. When our quantum is returned, we continue ourdisconnect()
call, leading to the new thread having its connection severed to the server.A possible solution to this is to add another
AtomicBool
that indicates that we're in a DISCONNECT-LOCKED state. If this is set byDrop
before checking the reference count and cleared only after the disconnect is completed, and all allocators check this and wait until we are out of DISCONNECT-LOCKED, perhaps this eliminates the unsoundness in reference counting.Except it gets worse.
std
has Shadow Copies of Connectionsstd
contains wrappers that connects to well-known servers to do work on behalf of the caller. If you do something withstd::net
, for example, it has to create a connection tonet
. This connection is outside the context of thenet
crate'slib
API, so that connection is not included in its reference count. Thus, a user who uses bothstd::net
and Xous-nativenet
calls (for example, you're doing both TCP connections, and also checking wifi state) could have the legs swept out underneath thestd::net
APIs if the Xous-nativenet
objects go out of scope before thestd::net
operations are finished. (Historical note: we didn't always havestd
, so in the original design of the system, the reference counting scheme could have been sound; once we implemented thestd
, we didn't notice this problem, until a code review just found it. In practice, I haven't seen this trigger a crash...but maybe it could explain some TLS instabilities in the folks working onsigchat
!).We probably haven't been bitten by this because generally, shared, "well-known" services such as the
ticktimer
,net
(std::net
),pddb
(std::fs
),rtc
(datetime) are often bound at the top level of themain
loop and the objects live the duration of the service. However, users who create and destroy threads that use any of thesestd
calls could see crashes if these services aren't also bound inside the top of themain
loop so that the reference count can actually go to zero when the threads exit.Here are the problems with some obvious solutions:
disconnect
, since it would be a unique number.A close look at the process structure shows that we do have space to add 14 more connections to the connection table. Thus, one potential solution looks something like this:
std
-related services neverdisconnect()
onDrop
. This should probably be done sooner than later -- in practice, most of those connections never do get to the point of callingdisconnect()
, so it will have minimal impact on current resource usage, but at least we can guarantee we don't step onstd
. Services that never use a particularstd
library call won't pay any penalty for this. Probably the biggest problem would come if you had a server that was tight on connections and only ever needed astd
call once early on and could use the connection space. In which case we should...disconnect()
API for services that have theDrop
modified. This allows users who are tight on connections and can guarantee they will never ever use astd
API that relies on it to free up space in the connection table.std
layer. 14 extra connections would still be a net improvement because the number of well-known servers thatstd
relies upon is probably about 5 or 6 max.Anyways, the impact of this problem seems to have been minimal, because this would lead to a very particular panic that would be reported on the screen (I forget the exact error message but it would probably say something about "Disconnected" or "Server not Found"). The correctness impact is large, but the practical impact so far has been small, likely because most important connections are bound in such a way that their lifetimes are typically eternal, but again, as we get more complicated (especially with the
sigchat
project) I suspect we'll run smack into this problem sooner or later.References
[1] Connections are scarce because a goal (perhaps questionably?) is for the kernel to never allocate memory. The result is that we have a limited amount of memory to count connections (all state has to live in 4096 bytes).
This is the struct that describes a process, and it lives within the process itself at a fixed address:
xous-core/kernel/src/arch/riscv/process.rs
Lines 69 to 70 in 8703764
The connection map is here:
xous-core/kernel/src/services.rs
Line 201 in 8703764
The text was updated successfully, but these errors were encountered: