-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Computation has been terminated" when terminating a process while resuming from an exception #112
Comments
I was able to reproduce! block := [[Compiler evaluate: 'x'] on: UndeclaredVariableNotification do: [:ex | ex resume]].
"context := block asContext.
steps := 0.
[context willReturn and: [context sender isNil]] whileFalse:
[steps := steps + 1.
context := context step].
steps. ""11145""
0 to: 11145"#(6402) do: [:n |
Transcript showln: n.
context := block asContext.
n timesRepeat: [context := context step].
process := Process forContext: context priority: 40.
process terminate.
self assert: process isTerminated]. As the out-commented code shows, terminate should not fail at whatever point the process is interrupted ... |
Hi Christoph, My guess what is happening at step 6402 is the process is in the middle of unwinding the ex resume and you initiate termination, i.e. another unwind on top of the previous one - and the outside unwind kind of slips through some crack in the inner one and the procedure fails with the BCR (see the two ensure contexts inserted at the bottom of the contextToUnwind at your last screenshot). Although I tried to make unwind resilient against termination at any arbitrary point (see tests covering termination of a terminating process etc.) I clearly failed to cover all such situations :) I'll investigate more thoroughly next week. Thanks for this lovely problem! PS: have you tried it in 5.3 ? |
Hi Christoph, I've described the root cause and quickly drafted a solution. If you find it going in the right direction we can polish the final form. It turns out the problem lies in the way the ensure guard contexts are being generated by contextEnsure. I'm curious whether the suggested solution will fix all your observed problems or whether some more surface. |
Hi Christoph (@LinqLover), I've modified #contextEnsure a bit - see Kernel-jar.1553. I guess this prevents the situation you observed. Please check and merge if you like it. Let me know of any other issues, thanks. |
Hi Christoph, Have you had a chance to check Kernel-jar.1553 yet? I wonder if it solves the bug you observed? I assume you found another occurrence of the BCR at a later step in your amazing example: this time it's a completely unrelated issue where two nested unwinds don't cooperate correctly. I've improved the "granularity" of the unwind to distinguish unwind blocks that started but not finished their execution and those that are really finished - see Kernel-jar.1554 - description of the bug and the fix. Please let me know of this helps. Please either merge if you approve or otherwise I could bundle all my recent changes into a single package... Let me know what you think. |
Hi @isCzech, so sorry for not replying earlier! Having too many different things to do right now diving into the depths of simulation and unwinding is never something I can handle in a couple of minutes. :( Thank you for looking into this!
Good idea! The original example was pretty slow but I have just added I confirm that Kernel-jar.1553 fixes the original issue and together with Kernel-jar.1554 makes the new test entirely pass, so I am tempted to merge both of them. :-) Nevertheless, I do not yet fully understand why this issue is specific to Once I have understood that, I will merge both versions. I guess Kernel-jar.1552 can be moved to treated, or do you have any further plans for it? One disadvantage with this solution is that you hardcode expectations about the bytecodes of block := [[| c |
c := thisContext swapSender: nil.
thisContext swapSender: c.
42] ensure: []]. |
Hi Christoph (@LinqLover),
The crucial difference between the two examples is that the original one ( If you're happy with the fix I'd do the same modification to #contextOn:do: for the sake of consistency. I can't see a scenario where this would cause a similar problem but who would have thought about what you observed :)
Sure, thanks for the cleanup. It was just the first approximation but as you said it's bound to bytecodes which is bad...
mildly put :)
Great! I'd love to see this really as a "pattern" to just feed it a block, a test method (anything, not just terminate) and an expectation and watch the result (ok or where it derails). I haven't really thought it through it just feels so powerful :) (PS: regarding the other issues I haven't had time yet but I'll get there) |
Ah, I got it! So a more minimal example would be something like [| c |
[c := thisContext sender sender swapSender: nil]
ensure: [thisContext sender sender swapSender: c]] value] Yes, it sounds wise to patch Oh no, here is another example that still fails when put inside #testTerminateEverywhere: block := [| c |
c := thisContext.
[] ensure: [c jump].
42]. What is going on here?
Sure, feel free to extract the logic from |
I think this is inevitable... jump inside the ensure argument block is a nightmare and if it jumps over my unwind guard in #runUntilReturnFrom: I don't know what can be done about it. I'll explore this along with your other examples around the stepOver bug. Coming soon, I hope :)
I guess the example would have to involve #runUntilErrorOrReturnFrom: which is usually called only in simulation... I haven't had the energy to try so I only guess :) I'll send the fix. |
contextOn:do: fix in Kernel-jar.1555 |
@isCzech Thanks, willl take a look soon! In the meantime, here is just another example that triggers the same error: block := [(Context runSimulated: [41]) + 1]. This one is actually of practical relevance for me, because in my scenario I am running a lot of sandboxed simulations in background processes ... I wonder whether we could think this together with the stepOver/runUntilErrorOrReturnFrom issue which shares the same vulnerability to irregular context switches ... Like, place a marker on the context stack/a safe bottom context while performing a temporary context switch and when we are terminating a process, check for such a marker, and if yes, continue execution regularly until the marker is no longer set? Or a bit stateless, use a pragma in known context switching methods such as #contextEnsure: (the old version) that instruct the unwinding logic to defer unwinding until that method has popped? Hm ... this is tricky ^^ |
Ah, I see. The ensure context in [Context runSimulated: [2/0]] on: ZeroDivide do: [:ex | ex return: 1] but when the execution is actually to abort, this behavior makes no sense. Hmm... maybe |
Hi @isCzech, all,
I've discovered another situation while I get a "Computation has terminated" error from
Process>>#terminate
. The entire situation is pretty complex, depends on some experimental code that I have not yet commited to third-party packages, and occurs only very sporadically (thatterminate
is maybe getting sent thousands of times per day and fails every second day...) - so it won't be possible to clearly reproduce right now. Nevertheless, I'm trying to collect as many information about the bug here as possible and hope it helps:First debugger:
Bug report
At the same time, the UI hangs, and I need to press Cmd-dot to continue. The interrupt reveals where the UI process was stuck:
Bug report
Exploring the receiver of the selected context in the first debugger (
FullBlockClosure(BlockClosure)>>on:do:
) reveals the following stack (it is cyclic/infinite, I usedself stackOfSize: 100
):Exploring the
contextToUnwind
from the interrupted context of the second debugger reveals the following (note the print-it displays the full stack of the interrupted context's receiver):These are the methods from my package that are relevant to the bug (just look at the
<--
pointer):To me, this looks as if the process was attempted to terminate while the
UndeclaredVariableNotification
was on the stack (maybe while it was already being handled/resumed from), and something in the stack manipulation logic has prevented the termination from working correctly. Maybe this is related to the fact that the context stack is temporarily invalidated duringContext class>>#contextEnsure:
et al. (cf. stepping into all the details ofcut:
duringthisContext insertSender: (Context contextEnsure: [])
)?I wish I could reproduce this issue, but I cannot for now. I have a vague hope that this information might be enough to suggest any ideas of what might be wrong, and maybe create a simpler example to reproduce ...
The text was updated successfully, but these errors were encountered: